FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper [email protected] FailSafe - What is it? • High Availability for business critical applications at a low cost • User level software running in a clustered environment providing – single point of failure recovery – cluster administration services GUI – a simple way to make applications HA aware FailSafe - What it looks like FailSafe - Terminology • Node : a single Linux image • Cluster : one or more nodes connected via some interconnect • Pool : entire set of nodes involved with a group of clusters • Node Membership : list of nodes in a cluster on which FailSafe can allocate resource groups FailSafe - Terminology (contd.) • Process Membership : list of process instances in a cluster which form a process group • Resource : a single physical or logical entity • Resource Group : Collection of inter-dependent resources – cannot overlap – Behaves like an atomic unit of failover – Must have a unique name throughout the cluster FailSafe - Terminology (contd.) • Failover : process of moving a resource group from one node to another • Failover Policy : method used by FailSafe to determine the destination node of a failover • Failover Domain : ordered list of nodes on which a given resource group can be allocated FailSafe - Terminology (contd.) • Failover Attributes: Auto Failback, Controlled Failback, InPlace Recovery • Failover policy script : shell script which generates an ordered set of node names on which the resource group can be placed • Action scripts : scripts which determine how a resource is started, stopped and monitored FailSafe - Architecture FailSafe Cluster Infrastructure (CI) {CMS, GCS, SRM, CRS} Cluster Manager GUI and CLI Cluster Administration services (CAS) {CAD, CDBD, CDB} FailSafe - Acronyms (so many!) CMS = Cluster Membership Service GCS = Group Communication Service SRM = System Resource Manager CRS = Cluster Reset Service CAD = Cluster Administration Daemon CDB = Cluster Database CDBD = Cluster Database Daemon FailSafe - Cluster Database • • • • Repository for all cluster configuration Dynamic changes supported Consistency is automatically supported Replicated in all nodes of the pool • Provides read and write transactional semantics FailSafe - Cluster Database Daemon • Controls read and write accesses to the CDB • Notifies clients of dynamic changes to the CDB • Keeps global portions of the CDB consistent across the pool FailSafe - Cluster Administration Daemon • Daemon responsible for dynamically updating the GUI • CAD is a client of CDBD • CDBD notifies CAD of any changes • Provides notification (default = email) of status changes in node, cluster or resource groups FailSafe - Cluster Membership Service • Provides cluster node membership information to its clients • Node membership information includes – – – – nodes that are currently part of the cluster Node status i.e. up, down or unknown Node name IP address currently being used for inter-CMSD communication • Inactive cluster node membership information is also provided FailSafe - Cluster Membership Service (contd.) • Any change in cluster status results in a node membership message issued by CMSD to its clients on all nodes of the cluster • CMSD implements failstop and quorum policy • CMSDs monitor each other by exchanging heartbeat messages directly with each other FailSafe - Group Communication Service • Provides a consistent view of process group memberships in presence of process failures, new processes joining, and changing node memberships • Provides a reliable ordered atomic messaging service to members of the process group under changing node and group memberships • GCS operates in the context of a cluster as defined by CMS FailSafe - System Resource Manager • Manages the resources and resource groups in a cluster • Co-ordinates access to physically shared resources • Monitors availability of resources • Performs local failover of resources • Maps a set of resources into a resource group • Atomically allocate resource groups FailSafe - Failsafe Daemon • A policy implementor for Resource Groups (RG) • Provides the ability to enable/disable monitoring an application dynamically • Provides ability to failover an application if monitoring fails • Failover can be either local (restart) or remote FailSafe - Failsafe Daemon (contd.) • Failover Policy Module (PM) • PM’s components – Failover script – Initial Failure Domain – Attributes FailSafe - Cluster Reset Service • Provides reset facility in a cluster upon request from one of its clients • Provides facility to monitor each reset line that connects to a machine that it is expected to reset • Special reset network to ensure connectivity for resetting remote machines FailSafe - Agents • Glue between a resource type and the Failsafe daemon • Collection of action scripts and binaries that the action scripts could be calling • Goal : Make a resource a highly available service • Examples: a file server agent, a web server agent, an agent for making an IP address , a filesystem or a volume highly available FailSafe - Action Scripts • Determine how a resource is started, stopped and monitored • Action scripts are per resource type • Types: start, stop, monitor, exclusive, restart • Returns status for each resource acted on • Called by SRM FailSafe - Related HA Technologies • A journaled file system for fast recovery • FailSafe can support multiple journaled filesystems such as XFS, GFS, ext3fs • Volume manager for disk failures (lvm) • Network mirroring • Monitoring tool (mon) FailSafe - Docs, Contacts • Documentation : http://oss.sgi.com/projects/failsafe/ • Contact : [email protected] FailSafe - Q & A • Questions - Sure! • Answers …. Well maybe :)
© Copyright 2025 Paperzz