Top Banner
FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper [email protected]
24
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe SGI’s High Availability Solution

Mayank Vasa

MTS, Linux FailSafe Gatekeeper

[email protected]

Page 2: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - What is it?

•High Availability for business critical applications at a low cost

•User level software running in a clustered environment providing– single point of failure recovery– cluster administration services

GUI– a simple way to make

applications HA aware

Page 3: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - What it looks like

Page 4: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Terminology

•Node : a single Linux image

•Cluster : one or more nodes connected via some interconnect

•Pool : entire set of nodes involved with a group of clusters

•Node Membership : list of nodes in a cluster on which FailSafe can allocate resource groups

Page 5: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Terminology (contd.)

•Process Membership : list of process instances in a cluster which form a process group

•Resource : a single physical or logical entity

•Resource Group : Collection of inter-dependent resources – cannot overlap– Behaves like an atomic unit of failover– Must have a unique name throughout the

cluster

Page 6: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Terminology (contd.)

•Failover : process of moving a resource group from one node to another

•Failover Policy : method used by FailSafe to determine the destination node of a failover

•Failover Domain : ordered list of nodes on which a given resource group can be allocated

Page 7: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Terminology (contd.)

•Failover Attributes: Auto Failback, Controlled Failback, InPlace Recovery

•Failover policy script : shell script which generates an ordered set of node names on which the resource group can be placed

•Action scripts : scripts which determine how a resource is started, stopped and monitored

Page 8: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Architecture

Cluster Administration services (CAS){CAD, CDBD, CDB}

Cluster Infrastructure (CI){CMS, GCS, SRM,

CRS}

FailSafe

Cluster Manager GUI and CLI

Page 9: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Acronyms (so many!)

CMS = Cluster Membership Service

GCS = Group Communication Service

SRM = System Resource Manager

CRS = Cluster Reset Service

CAD = Cluster Administration Daemon

CDB = Cluster Database

CDBD = Cluster Database Daemon

Page 10: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Cluster Database

•Repository for all cluster configuration

•Dynamic changes supported•Consistency is automatically supported

•Replicated in all nodes of the pool

•Provides read and write transactional semantics

Page 11: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Cluster Database Daemon

•Controls read and write accesses to the CDB

•Notifies clients of dynamic changes to the CDB

•Keeps global portions of the CDB consistent across the pool

Page 12: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Cluster Administration Daemon•Daemon responsible for dynamically updating the GUI

•CAD is a client of CDBD

•CDBD notifies CAD of any changes

•Provides notification (default = email) of status changes in node, cluster or resource groups

Page 13: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Cluster Membership Service•Provides cluster node membership information to its clients

•Node membership information includes – nodes that are currently part of the

cluster – Node status i.e. up, down or unknown – Node name– IP address currently being used for

inter-CMSD communication

•Inactive cluster node membership information is also provided

Page 14: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Cluster Membership Service (contd.)•Any change in cluster status results in a node membership message issued by CMSD to its clients on all nodes of the cluster

•CMSD implements failstop and quorum policy

•CMSDs monitor each other by exchanging heartbeat messages directly with each other

Page 15: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Group Communication Service •Provides a consistent view of

process group memberships in presence of process failures, new processes joining, and changing node memberships

•Provides a reliable ordered atomic messaging service to members of the process group under changing node and group memberships

•GCS operates in the context of a cluster as defined by CMS

Page 16: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - System Resource Manager•Manages the resources and resource groups in a cluster

•Co-ordinates access to physically shared resources

•Monitors availability of resources

•Performs local failover of resources•Maps a set of resources into a resource group

•Atomically allocate resource groups

Page 17: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Failsafe Daemon

•A policy implementor for Resource Groups (RG)

•Provides the ability to enable/disable monitoring an application dynamically

•Provides ability to failover an application if monitoring fails

•Failover can be either local (restart) or remote

Page 18: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Failsafe Daemon (contd.)

•Failover Policy Module (PM)

•PM’s components– Failover script– Initial Failure Domain– Attributes

Page 19: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Cluster Reset Service

•Provides reset facility in a cluster upon request from one of its clients

•Provides facility to monitor each reset line that connects to a machine that it is expected to reset

•Special reset network to ensure connectivity for resetting remote machines

Page 20: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Agents

•Glue between a resource type and the Failsafe daemon

•Collection of action scripts and binaries that the action scripts could be calling

•Goal : Make a resource a highly available service

•Examples: a file server agent, a web server agent, an agent for making an IP address , a filesystem or a volume highly available

Page 21: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Action Scripts

•Determine how a resource is started, stopped and monitored

•Action scripts are per resource type

•Types: start, stop, monitor, exclusive, restart

•Returns status for each resource acted on

•Called by SRM

Page 22: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Related HA Technologies

•A journaled file system for fast recovery

•FailSafe can support multiple journaled filesystems such as XFS, GFS, ext3fs

•Volume manager for disk failures (lvm)

•Network mirroring

•Monitoring tool (mon)

Page 23: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Docs, Contacts

•Documentation : http://oss.sgi.com/projects/failsafe/

•Contact : [email protected]

Page 24: FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa@sgi.com.

FailSafe - Q & A

•Questions - Sure!

•Answers …. Well maybe :)