1 Towards Highly Available, Scalable, and Secure HPC Clusters with HA-OSCAR Dr. Chokchai Box Leangsuksun Louisiana Tech University [email protected] Ibrahim.

1

Towards Highly Available, Scalable, and Secure HPC Clusters with HA-OSCAR

Dr. Chokchai Box LeangsuksunLouisiana Tech University

[email protected]

Ibrahim HaddadOSDL

[email protected]

Dr. Stephen L. ScottOak Ridge National Laboratory

[email protected]

IEEE CLUSTER CONFERENCESeptember 26, 2005 -- Boston, MA, USA

2

• Louisiana Tech University— Chokchai “Box” Leangsuksun— HA-OSCAR slides prepared by Venkata Kiriti Munganuru

• Oak Ridge National Laboratory— Stephen L. Scott— Thomas Naughton— John Mugler— Christian Engelmann

• Dell— Tong Liu

• Intel — Richard Libby

• Ericsson — Makan Pourzandi

• OSCAR — The entire OSCAR team, collaborators, and users.

• Open Source Development Labs— Ibrahim Haddad

Acknowledgments & Collaborations

3

Agenda

1:00 – 1:05 Introduction Box

1:05 – 1:50 Clustering and HA Ibrahim

1:50 – 2:30 OSCAR Stephen

2:30 – 3:00 Break, Q&A

3:00 – 4:30 HA-OSCAR & Demo Box

4

Clustering

Definitions

Beowulf

HA Clusters

Challenges

Ibrahim HaddadOSDL

[email protected]

5

What is Clustering?Not yet another clustering definition!

• The use of multiple loosely coupled nodes to form what appears to users as a single highly available system.

Reliable & Fault - tolerant processor interconnect

Processor

Operating System

Middleware

Application

…

Reliable & Fault - tolerant processor interconnect

Processor

Operating System

Middleware

Application

Reliable & Fault-tolerant processor interconnect-

Processor

Operating System

Middleware

Application

…

Reliable & Fault-tolerant Disk Storage (RAID / SAN / …)

6

The focus is not clustering. Clustering is just a means to an end.

• The focus is on scalability, high availability and reliability.

• Clustering is a technology which can be used to achieve the above.

7

Goals – what do we want to achieve?

• High Availability

– Isolate or reduce the impact of a failure in the machine, resources, or device through redundancy and fail over techniques.

• Scalability

– Expand the capacity of servers in terms of processors, memory, storage, or other resources to support business growth

– Linear scalability

• Improved processing speed

• Efficient resource utilization

– Load Balancing and/or Traffic distribution

• Manageability

– Reduce system management costs through appropriate system management facilities

8

High Availability Clusters

• HA-Clusters ensure service availability

– HA clusters have the ability to continue serving clients even if one (or more) server node fails and becomes unavailable

• HA clusters are not anymore regarded for traditional mission critical applications

– Business applications,

– Military,

– Bio/Pharma/medical,

– Telecom,

– Etc

9

General HA Clustering Wish-list (1/2)

• Capacity scalability: Scale up any of the components in order

to achieve a linear increase

• Better resource utilization: Load balancing and/or traffic

distribution: Dynamic mechanisms that detect and react to

the unavailability, addition and removal of components in the

system

• Availability: provide HA services to its end users

• Operation and maintenance: Performed remotely without

affecting the system performance and availability

• Fast response time: Minimize serialized executions

10

General HA Clustering Wish-list (2/2)

• Geographical Diversity

– Spread across several Points of Presence

– Support geographic mirroring

• Provide a single cluster IP Interface: Clients access the server

application through a single IP address

• Security: High security requirements depending on

deployment scenarios (open vs. closed networks)

11

2 Main Challenges

• System Availability

• System Capacity

12

• Availability defined as:

MTBF: Mean Time Between Failure

MMTR: Mean Time To Repair

• Example:

If a system offered a MTBF of 20,000 hours with a MTTR of 2 hours, then its availability would be 99.99%, “4-nines.”

System Availability – What does it mean?

MTBF

MTBF + MTTR

13

Means to achieve higher availability

• Increase MTBF– Improve “quality” or “robustness”– Use redundancy / remove single points of failure

• Decrease MTTR– Streamline and accelerate fail-over

• Optimize boot / reboot time• Respond to fault conditions in real-time

– Make faults more granular in time and scope• Better to have many short faults than a few long ones• Limit scope of faults to smaller s/w and h/w components

14

Degrees of HA

Source: Gartner

15

System Availability Redundancy Models

3

4

5

6

1+0 1+1 N+M

System redundancy model

Availa

bili

ty (

num

ber

of

nin

e (

9))

16

Beowulf Cluster

17

Beowulf Cluster

• Beowulf is one approach to clustering COTS components to form a supercomputer

• A Beowulf cluster is a collection of COTS computers networked together to harvest high performance computing

• A typical Beowulf cluster has:– a single head node – multiple identical client nodes

18

Beowulf Cluster Architecture

HeadNode: Entry point to the cluster Responsible for serving user requests Distributes jobs to compute clients via

scheduling and queuing software

Compute Clients Dedicated for computation

Communication: Using Ethernet network and/or fast connectivity: Myrinet, Infinitband, etc.

Head Node

Compute Nodes

Communication

End Users

19

Beowulf Cluster – Advantages

1. COTS HW and SW components

2. Toward High Performance Computation (HPC)

3. Allows flexible configurations

4. Heterogeneous environment

5. Scalability (add more nodes)

6. Alternative choice for monolithic supercomputers

7. Good price/performance

20

Beowulf Cluster – Issues

Head Node

Compute Nodes

Communication

End Users

• Single head node architecture– Vulnerable for SPOF

• Single communication path architecture– Vulnerable for SPOF

• Compute nodes are not accessible after above threat occurs, or when cluster services or OS upgrade takes place.

21

HA & Linux Clusters

22

Providing High Availability

• One technique of providing HA is by distributing

functionality across multiple nodes

• In response to HW and SW failures, HA systems

facilitate the rapid transfer of control from a faulty

CPU, peripheral, or software component to a

functional one, while preserving operations or

transactions in-progress at the time of failure.

23

HA Supporting mechanisms

• HA Systems must support mechanisms for:– Error Detection– Damage Containment– Error Recovery– Fault Treatment (incl. dynamic reconfiguration)

• Box will discuss how HA-OSCAR support these mechanisms

• Assumption:We are dealing with systems comprising clusters of processors which share nothing.

24

Redundancy in HA Systems

• Redundancy of key subsystems is important– Redundant Ethernet to ensure constant networking

connections

– Redundant power supplies

– Redundant disks

– etc.

25

Other means to support HA

• Disk mirroring to ensure high levels of data reliability

• Hot swap (hot insert, hot remove, identity maintenance)

• Options for booting compressed and remotely hosted kernel images

• Support of compressed r/w and read-only Flash file systems

• Accelerated boot and daemon start times

• Fast shutdown / reboot

• Eliminating costly file system operations with journaling file systems

• Etc.

26

Uptime

• Example from the telecom industry:

The main operator requirement is

No more than 30 seconds of service interruption per year

– Applies to the overall solution: hardware, software (OS and middleware), and the applications.

– Includes software and hardware upgrade and maintenance.

27

High Availability of Cluster Hardware

• Hardware availability is very important

• In some cases, the platform may be available but not the application

– Software has bugs; it may cause applications to crash.

– Keeping redundancy in applications and maintaining processes state is complex.

• In telecom, the required uptime includes both platform and applications uptime

– End-users don’t care about running platforms when the required application is unavailable

28

Cluster Concurrent Maintenance

• Allow (un)scheduled maintenance to be performed on a node of a cluster while other nodes continue to provide service without noticeable degradation.

• and doing it remotely.

29

Failover

• Failover is the ability to detect problems in a node and to accommodate ongoing processing by routing applications to other nodes.

– This process may be programmed or scripted so that steps are taken automatically without operator/admin intervention

– Fundamental to failover is communication among nodes signaling that they are functioning correctly and reporting problems when they occur

30

Characteristics of Failover

• Transparency

– Applications and users are automatically and transparently reconnected to another node/system

• Performance

– Depends on hardware configuration, instance recovery time and workload at time of failover

• Robustness

– The cluster should be able to survive multiple failures and still provide mission critical applications

31

Linear Scalability

• We want clusters to support linear growth

– New processors can be added without disturbance

• Capacity grows linearly as processors are added

– Modular addition of HW and SW components happens

• If we double the number of processors, we should expect to almost double the throughput of the system.

2 4 6 8 10 12 14 16 18 …

32

Highly Available Storage

• Storage is a critical and necessary part of a HA cluster

• Data should to be available to users even when a storage node fails or when errors occur with the distributed file system

• One popular technique is providing RAID support

– Other: NAS, SAN, etc.

33

Online OS and Application Upgrades(required for teleco grade clusters)

• A requirement in mission critical environments

– Teleco, defense.

• When upgrading software (applications), old and new version of same process can coexist

– provide mechanisms to upgrade a running application

– The system will deal with the old and new running versions of the application simultaneously

• Applications will transfer states between old and new static process

34

Manageability

• Single point of control

– Applications

– Software

– Hardware

– Data (Data movement, Security, Backup, Recovery, etc)

• Online configurability to reduce downtime • Capacity Control

– Overload protection by selectively rejecting jobs/requests when threshold is reached

35

Heartbeat Mechanisms

• Linux-HA project

http://www.linux-ha.org

Release 2 is out.

Linux-HA BOF is on Friday

Router

MasterNode 1

1 1

1

2

2

2

MasterNode 2

36

Non-Stop Operations – Summary (required for teleco grade clusters)

• No single point of failure

• No scheduled downtime

• In-service upgrade of software and hardware with

no disturbance to operation

• HA failover software

• Software Configuration Control

– Automatic restart of processes that originally executed on

a faulty processor on the ones that are working

37

Fast Recovery of Applications

• Maximizing availability of applications and services is a priority

• If an application dies for some reason, it is very important to restart it ASAP

• Provide automatic failover and recovery capabilities with very minimum interruptions to the users.

• Monitoring mechanisms to detection of failures and trigger actions

38

Redundancy in a HA Linux Cluster

39

Redundancy Levels in a Cluster

Cluster Virtual Interface

…

Network Redundancy

Cluster Interface to theOutside World

Traffic Node Redundancy

Storage Redundancy(includes NFS redundancyand RAID 5)

Redundancy of Nodes Providing cluster services

1

2

3

4

3

Master Node A Master Node B

Traffic Node A Traffic Node N

Storage Node A Storage Node B

Firewall

1

2

4

55

• Redundant cluster services• NFS redundancy• Traffic node redundancy• Redundant image server• Network redundancy (routers and NIC)• Storage redundancy

40

Ethernet Redundancy

The active Ethernet adapter provides the connection to the network.The standby Ethernet adapter is hiddenfrom applications and is know only tothe Ethernet redundancy daemon.

Active Standby Active Standby

The active Ethernet adapter has failed and Ethernet redundancy daemon designates the former standby Ethernetadapter as the new active adapter.

Before network adaptor swap After network adaptor swap

41

1+1 active/standby

Public Network

ActiveMasterNode

StandbyMasterNode

Heartbeat Messages

Shared NetworkStorage

Dual Redundant Data Paths

Physically connectedbut not logically in use

Clients

42

1+1 active/standby …

Public Network

ActiveMasterNode

StandbyMasterNode

Heartbeat Messages


Failed Node

Now ActiveMaster Node

Physically connectedand in use

Physically connectedbut not logically in usedue to the failure of the master node

Clients

43

1+1 active/active …

Public Network

ActiveMasterNode 1

ActiveMasterNode 2

Heartbeat Messages


Physically connectedand in use providing redundant data pathfor master node 2

Clients

Physically connectedand in use providing redundant data pathfor master node 1

44

N+M

Node A Node B Node C Node D

StandbyNode 2

StandbyNode 1

StandbyProcess

ActiveProcess

HA SharedStorage

State2

1

3

45

N-way

Node A Node B Node C Node D

HA SharedStorage

State2

1 3

(1) State information is written to shared (2) State information is available on

shared storage(3) Traffic nodes have access to the state

information.

46

HA Clusters: Nodes Topology & Redundancy Models

N Traffic Node, where N 2.Redundancy is at node levelRedundancy models: N active or N active and M standby, where N 2 and M 0

Redundant storage through the implementation of a HA NFS server.

N active N active / M standby

Redundant SpecializedStorage Nodes

Master Nodes:1+1 redundancy model:Active/Hot -standby or Active/Active

Storage:Specialized nodes: N = 2 Or use modified NFSimplementation

HA Tier

ServiceScalabilityTier

StorageTier

N Traffic Nodes, N 2.Redundancy is at node levelRedundancy models: N active, orN active and M standby, N and M 0

Redundant storage through the implementation of a HA NFS server.

N active N active / M standby

Redundant SpecializedStorage Nodes

Master Nodes:1+1 redundancy model:Active/Hot -standby or Active/Active

Storage:Specialized nodes: N = 2 or use modified NFSimplementation

HA Tier

ServiceScalabilityTier

StorageTier

2

47

Challenges

48

Challenges

• How to automatically build and boot the nodes?– Installation infrastructure

• Which (HA?) distributed file systems to use in the cluster?– File Systems

• What types of traffic distribution and load balancing mechanisms to use?

• How to build redundancy and to which extend?– Redundancy at the Network, File System , Disk and CPU Levels

• How to manage the cluster, remotely? – System management

• How to add/remove nodes without affecting the operations?– Dynamic reconfiguration

• How to achieve linear scalability?– Scalability

• How to secure cluster running on open networks?– Security

49

Conclusion to the intro!

• From the start, the design of the software architecture of the application should take into account:

– Scalability

– Failure Handling

– Error (software bug) handling

– Future modification

– Hot software upgrade

• A complete HA solution requires close integration of:– HA hardware, – HA software solution, – HA middleware, and – Application software that can cause failover to redundant

systems.

50

QUESTIONS ANSWERS

Ibrahim Haddad OSDL

[email protected]+1 503 906 1914

&

51

Agenda




2:30 – 3:00 Break, Q&A


52

OSCAR(Open Source Cluster Application Resources)

Dr. Stephen L. [email protected]

53

What is OSCAR?

• Framework for cluster installation configuration and management

• Common used cluster tools• Wizard based cluster software installation

– Operating system– Cluster environment

• Administration• Operation

• Automatically configures cluster components• Increases consistency among cluster builds• Reduces time to build / install a cluster• Reduces need for expertise

Open Source Cluster Application Resources

Step 5

Step 8 Done!

Step 6

Step 1 Start…

Step 2

Step 3Step 4

Step 7

54

OSCAR - the beginning

55

• Extreme Linux• May 13, 1998• $29.95 CD

First cluster “distro”

Oak Ridge National Laboratory -- U.S. Department of Energy

56

OSCAR Background

• Concept first discussed in January 2000

• First organizational meeting in April 2000– Cluster assembly is time consuming & repetitive

– Nice to offer a toolkit to automate

• First public release in April 2001

• Use “best practices” for HPC clusters– Leverage wealth of open source components

– Targeted modest size cluster (single network switch)

• Form umbrella organization to oversee cluster efforts– Open Cluster Group (OCG)

57

Open Cluster Group

• Informal group formed to make cluster computing more practical for HPC research and development

• Membership is open, direct by steering committee– Research/Academic– Industry

• Current active working groups– OSCAR (core group)– Thin-OSCAR (Diskless Beowulf)– HA-OSCAR (High Availability)– SSS-OSCAR (Scalable Systems Software)– SSI-OSCAR (Single System Image)– BIO-OSCAR (Bioinformatics cluster system)

58

OSCAR Core Participants

• Dell• IBM• Intel• Bald Guy Software• RevolutionLinux• INRIA• EDF• Canada’s Michael Smith Genome Sciences

Center

• Indiana University• NCSA• Oak Ridge National Laboratory• Université de Sherbrooke• Louisiana Tech Univ.• NEC Europe• Air Force Research Lab (USA)

November 2004

59

Offer Variety of Flavors

HA-OSCAR, Thin-OSCAR, SSS-OSCAR, SSI-OSCAR,

SSS-OSCAR

• OSCAR is a snap-shot of best-known-methods for building, programming and using clusters of a “reasonable” size.

• To bring uniformity to clusters, foster commercial versions of OSCAR, and make clusters more broadly acceptable.

• Consortium of research, academic & industry members cooperating in the spirit of open source.

The OSCAR strategy

Open Source OSCAR with Linux

Commercially supported Value added

instantiations of OSCAR

60

Today’s OSCAR

61

OSCAR Components

• Administration/Configuration – SIS, C3, OPIUM, Kernel-Picker, NTPconfig cluster services (dhcp, nfs, ...)– Security: Pfilter, OpenSSH

• HPC Services/Tools– Parallel Libs: MPICH, LAM/MPI, PVM– Torque, Maui, OpenPBS– HDF5– Ganglia, Clumon, … [monitoring systems]– Other 3rd party OSCAR Packages

• Core Infrastructure/Management– System Installation Suite (SIS), Cluster Command & Control (C3), Env-Switcher, – OSCAR DAtabase (ODA), OSCAR Package Downloader (OPD)

62

System Installation Suite (SIS)

Enhanced suite to the SystemImager tool.

Adds SystemInstaller and SystemConfigurator

• SystemInstaller – interface to installation, includes a stand-alone GUI – Tksis. Allows for description based image creation.

• SystemImager – base tool used to construct & distribute machine images.

• SystemConfigurator – extension that allows for on-the-fly style configurations once the install reaches the node, e.g. ‘/etc/modules.conf’.

63

System Installation Suite (SIS)

• Used in OSCAR to install nodes– partitions disks, formats disks and installs nodes

• Construct “image” of compute node on headnode– Directory structure of what the node will contain– This is a “virtual”, chroot–able environment

/var/lib/systemimager/images/oscarimage/etc/

…/usr/

• Use rsync to copy only differences in files, so can be used for cluster management – maintain image and sync nodes to image

64

Switcher

• Switcher provides a clean interface to edit environment without directly tweaking .dot files.– e.g. PATH, MANPATH, path for ‘mpicc’, etc.

• Edit/Set at both system and user level.

• Leverages existing Modules system

• Changes are made to future shells– To help prevent simple operator errors while making shell edits– Modules already offers facility for current shell manipulation, but no

persistent changes.

65

OSCAR DAtabase (ODA)

• Used to store OSCAR cluster data

• Currently uses MySQL as DB engine

• User and program friendly interface for database access

• Capability to extend database commands as necessary.

66

OSCAR Package Downloader (OPD)

Tool to download and extract OSCAR Packages.

• Can be used for timely package updates

• Packages that are not included, i.e. “3rd Party”

• Distribute packages with licensing constraints.

67

C3 Power Tools

• Command-line interface for cluster system administration and parallel user tools.

• Parallel execution cexec – Execute across a single cluster or multiple clusters at same time

• Scatter/gather operations cpush/cget – Distribute or fetch files for all node(s)/cluster(s)

• Used throughout OSCAR and as underlying mechanism for tools like OPIUM’s useradd enhancements.

68

C3 Building Blocks

• System administration• cpushimage - “push” image across cluster• cshutdown - Remote shutdown to reboot or halt cluster

• User & system tools• cpush - push single file -to- directory• crm - delete single file -to- directory• cget - retrieve files from each node• ckill - kill a process on each node• cexec - execute arbitrary command on each node

• cexecs – serial mode, useful for debugging• clist – list each cluster available and it’s type• cname – returns a node name from a given node position• cnum – returns a node position from a given node name

69

C3 Power Tools

Example to run hostname on all nodes of default cluster:$ cexec hostname

Example to push an RPM to /tmp on the first 3 nodes$ cpush :1-3 helloworld-1.0.i386.rpm /tmp

Example to get a file from node1 and nodes 3-6$ cget :1,3-6 /tmp/results.dat /tmp

* Can leave off the destination with cget and will use the same location as source.

70

Current OSCAR Release Notes (v4.1)

• Supported Distros:

– Red Hat 9

– Red Hat Enterprise Linux (RHEL) 3

• Supports both x86 and Itanium systems

– Fedora Core 2 support

– Mandrake 10.0 (experimental)

• Torque is included as the default scheduler (OpenaPBS can still be downloaded from OPD)

• DepMan / PackMan

– Resolves dependencies during “build node image”

– Used in install/uninstall packages

• APITest now part of OSCAR testing framework

• Versions of key software components:

– Ganglia 2.5.6-1B

– LAM-MPI 7.0.6-1

– MPICH-MPI 1.2.5

– Torque (PBS Replacement) 1.0.1

– MAUI 3.2.5

– SIS 3.3.2

71

OSCAR Installation

72

Server Installation and Configuration

• Install Linux on server machine (cluster head node)– workstation install w/ software development tools– 50+ page installation document!

• (quick install available)

• Download copy of OSCAR and unpack on server• Configure and install OSCAR on server

– readies the wizard install process

• Configure server Ethernet adapters– public– private

• Launch OSCAR Installer (wizard)

73

OSCAR Wizard

Demo install

Demo add/delete node

Demo add/delete package

version 4.0

74

OSCAR Wizard

75

Step 0

Enables you to download additional packages

OPD – Oscar Package Downloader does download

OPDer – GUI front end to OPD

76

OPDer

clumon and PVFS selected for download

77

Alternate repositories, possibly a local machine

OPDer (2)

78

Create your own flavor of cluster distribution

Select OSCAR packages to install.

Step 1

79

Core packages are automatically selected for you and can not “unselect”

Download does not equal installation!

Packages downloaded with OPDer are selected for installation here

Package Selector

80

Configure OSCAR packages that require special configuration tasks

Step 2

81

Environment Switcher does configuration for default MPI use

make selection

Package configuration

82

Install OSCAR Server (cluster head node) specific packages on cluster head node

May take a few minutes

Wait for button…

Step 3

83

success

Install server packages

84

Specify and build system image for client (compute) nodes

Step 4

85

name your image

list of packages

package file location

disk partition file location

static or dynamic

halt, reboot, beep

enable multicast

Build image configure

86

showing progress

Building image

87

success

Building image finished

88

Define client nodes

Step 5

89

specify image name (from step 4 – or other saved image)

client IP domain name

client base name (oscarnodeXXX)

node count

starting index to append to base

padding to client names (3 = oscarnode009)

starting IP address

Subnet Mask

Default Gateway

Define client nodes

90

success

Define client nodes

91

in one operation – setup networking for all cluster client nodes

for first time in installation process we will “touch” the client nodes

Step 6

92

machines named as specified in prior step 5

IP address as specified in prior step 5

Scan network for MACs or import from file

Setup network – initial window

93

found MAC addresses will show here for network setup

stop collecting when done

Setup network - scanning network

94

found and assigned all MAC addresses

Setup network – initial window

95

reboot on own – “post install action” from step 4

or

manually reboot

Reboot Clients

96

only after ALL clients have rebooted

runs “post install” scripts for packages that have them

cleanup and reinitialize where needed

Step 7

97

success

Complete setup

98

test suite provided to ensure that key cluster components are functioning properly

Step 8

99

All Passed!!!

Test cluster setup

100

Your OSCAR cluster is now ready to use

Quit OSCAR Wizard

101

Demo install



version 4.0

OSCAR Wizard

102

OSCAR Cluster Maintenance

Add Compute Nodes

103

increase the number of compute nodes in the cluster

Add OSCAR Clients

104

Operates in similar manner to steps 5, 6, and 7 in OSCAR installation

Behind the scene action differs somewhat…

step 5step 6

step 7

compare to standard install process:

Add OSCAR Clients

105

Delete Compute Nodes


106

decrease the number of compute nodes in the cluster

Delete OSCAR Clients

107

ready to select client(s) to delete


108

client selected to delete


109

success


110

Demo install



version 4.0

OSCAR Wizard

111

Install / Uninstall Packages


112

select to install or uninstall an OSCAR package

Install/UninstallOSCAR packages

113

Install/UninstallOSCAR packages

114

Open Cluster Groupwww.OpenClusterGroup.org/

OSCAR Home Pageoscar.OpenClusterGroup.org/

OSCAR Development sitesourceforge.net/projects/oscar/

Mailing [email protected]@lists.sourceforge.net

OSCAR Research supported by the Mathematics, Information and Computational Sciences Office, Office of Advanced Scientific Computing Research, Office of Science, U. S. Department of Energy, under contract No. DE-AC05-00OR22725 with UT-Battelle, LLC.

More OSCAR Information

115

Agenda




2:30 – 3:00 Break, Q&A


116

HA-OSCAR: High Availability - Open Source Cluster Application Resource

Dr. Chokchai Box [email protected]

117

High Availability Needs

• High Availability is not anymore regarded only for traditional mission critical applications.

• Environment that needs High Availability:– Major shared resource centers

– Critical applications

• Scientific/medical/security, other services

– Long running HPC applications (aggregate performance)

– Telecomm services

– Inventory control and transaction processing systems

• HA-OSCAR can serve as springboard for many critical applications that demands high availability and high performance computing

118

Our Goal with HA-OSCAR

• High Reliability and Availability for HPC cluster• Serviceability - Simplicity• Transparent - Preserve existing investments, No change

required, retrofitable • Production-quality software release• Robust security and fine-grained access control

119

HA-OSCAR Overview 1/3

• Open source Production-quality clustering software that aims towards non-stop services in HEC environment

• Combined HA features and HPC capabilities to provide a Beowulf computing environment that is reliable and highly available

• The first field grade HA Beowulf cluster that provides high-availability and critical self-healing services

120

HA-OSCAR Overview 3/3

• HA-OSCAR enhances serviceability and survivability

– serviceability aims toward effective means with which corrective and preventive maintenance can be performed

– survivability ensures system high availability

• HA-OSCAR alleviates unplanned downtime through component redundancy and Adaptive Self-Healing (ASH) mechanisms

– Replication, proactive monitoring, and recovery are HA-OSCAR’s essence.

121

HA-OSCAR versus Beowulf cluster

Beowulf Beowulf versus HA-OSCAR Availability vs. Unavailability

123

Beowulf Cluster

HeadNode: Entry point to the cluster Responsible for serving user requests Distributes jobs to compute clients via

scheduling and queuing software

Compute Clients Dedicated for computation

Communication: Using Ethernet network and/or fast connectivity: Myrinet, Infinitband, etc.

125

Beowulf Cluster Management Systems (CMS)

• Provide a cluster installation and configuration tool set

• Some include management and monitoring tools

• Most objectives are– Reduces need for expertise

– Alleviates cluster installation & configuration complexities

• Widely used CMS packages are

– OSCAR

– ROCKS

– Scyld

– Warewulf

126

Beowulf Cluster – Issues 1/2

HeadNode

Compute Nodes

Communication

What happens in the event of a Head nodeor communicationfailure?

127

Beowulf Cluster – Issues 2/2

• Unavailability threat– Beowulf is a single head node architecture

• Vulnerable for single-point-of-failure– Beowulf is a single communication path architecture

• Vulnerable for single-point-of-failures– Compute nodes behind a firewall/local switch are not

accessible after above threat occurs– When cluster services or os upgrade takes place.

128

Beowulf vs. HA-OSCAR Architecture 1/2

Beowulf HA-OSCAR

129

Beowulf vs. HA-OSCAR Architecture 2/2

Beowulf HA-OSCAR

132

HA-OSCAR Architecture

• HA-OSCAR (beta release) is an active/hot-standby architecture with automatic failover

• HA-OSCAR Major components:– Primary server

– Standby server

– Switches

– Multiple clients

Health Detection

133

Head Node Stack

• Redundant H/W platform

• Intelligent sensors (optional)

• HPI wrapper (optional)

• Operating System (OS) hardware Interface

• OS Application Services

• Monitoring and Self-healing Core

• HA-OSCAR Management layer

Application Services

Monitoring & self-healing core

134

Monitoring and Self-Healing

ServiceMonitor

ResourceMonitor

Healthchannel Monitor

Self-Healing Daemon

Monitored Services:

PBS, MAUI, NFS, and

HTTP

Monitored Resources:load_average, disk_usage, and free_memory

Monitored Interfaces: eth0,eth0:1

135

HA-OSCAR Features

136

HA-OSCAR Features

• Eliminates single-point-of-failures -> reduces unplanned downtime

• Provides service level fault tolerance

• Ease of installation with a GUI enabled Installation Wizard

• Self-configuration

• Cost-effective solution by dual head

• HA enabled HPC services (MAUI,PBS,SGE etc..)

• Image server powered by SIS for cloning and disaster recovery

• Fast failover and failback mechanisms

– Minimize application service outage

– Significant downtime improvement

• Webmin is used to introduce new services

• Provides customized policy configuration utility

137

GUI Installation Wizard

Step1

Step2: create head imageStep3: clone image

Step4: config & Build Standby

Optional Step5: web admin to add/config more services

138

Multi-Head Builder and Self-Configuration

• SystemImager facilitates upgrade and improve reliability with potential rollback and disaster recovery

• SystemImager is used to clone and build images• HA-OSCAR installation and deployment mechanism

employs SystemImager as a self-build, self-configuration tool to capture and deploy images

• SIS (System Installation Suit) component

• Updating Image on standby server• Editing the image itself: The image that has been

retrieved earlier will be automatically edited.

139

Service Monitoring

• HA-OSCAR includes a default set of monitoring services to ensure critical services, hardware components and important resources are always available

– XINETD,HTTP,NFS,SNMP,MAUI,PBS, are critical services of a Beowulf cluster, whose failure makes entire cluster unavailable

– similarly system resources like free_memory ,disk space, cpu_load must be with in threshold level

140

Service Monitoring – Implementation Notes

• Enhancement based kernel.org MON• Contributes to both detection and recovery policies of HA-

OSCAR• Associative and adaptive responses

• local restart• failover (simple or impersonated)

• net-SNMP• Contributes to HA-OSCAR detection policy• net-SNMP hooks in MON to monitor resources and critical

services of the system

• Monitoring Example• netsnmp–freespace.monitor => memory available• netsnmp-loadaverage.monitor => CPU load • netsnmp-proc.monitor => MAUI,HTTP,PBS

141

Service Monitoring – ASH MON

• Adaptive Self Healing MON daemon monitors system for service availability

• MON policy scripts are Perl scripts of actions

— current release of HA-OSCAR monitors PBS, MAUI, NFS, HTTP services

• SNMP traps contributes MON to monitor resources of the system— currently release of HA-OSCAR monitors load average ,free memory and

disk space of the system

142

Configuration Files

• MON conf file— /etc/mon/mon.cf

• MON alert scripts— /usr/lib/mon/alert.d/

—ex: /usr/lib/mon/alert.d/servicerestart.alert

• SNMP conf file— /etc/snmp/snmpd.conf

• Monitoring script – e.g. processes— /usr/lib/mon/mon.d/

—ex: /usr/lib/mon/mon.d/netsnmp-proc.monitor

• Rule of thumb: – Don’t edit config files if you don’t know what they are.

143

MON/SNMP Configuration

• SNMP proc server monitoring critical processes– Monitoring & Detection

– Trigger alerts on failures

– Mon traps these alerts and perform proper action

– Recovery action

SNMP hooks in MON conf file

144

MON/SNMP Configuration – Resource Watch

• SNMP server watching resource threshold

• Monitoring & Detection

– Trigger alerts at threshold level

– Mon trap the alerts and trigger mail alerts to administrator

– Ex :If CPU load is above 12% for 1 minute then it triggers an alert

• Recovery action

SNMP hooks in MON conf file

145

Mon and net-SNMP

ServiceMonitor

ResourceMonitor

Healthchannel Monitor

Self-Healing Daemon (MON)

Monitored Services:

PBS, MAUI, NFS, and

HTTP

Monitored Resources:load_average, disk_usage, and free_memory

Monitored Interfaces: eth0,eth0:1

net-SNMP

146

Mon configuration (continued)

• Monitoring primary server availability

• Alias IP watch server • HPC services monitoring

process• Respective recovery actions

147

Snapshot of Mon alert example

• This is the servicerestart.alert file

• Following script is executed when a critical service like HTTP,MAUI,PBS dies

148

Service/Resource Detection & Recovery

Working

FAILED

ALERT

Failover

Failed Services Detected

Recovery

Reach Maximum Counter(>3)

Resource Usage Reach Threshold

Standby server take control

Failback

Compare with Maximum Restart Counter (<=3)

Primary ServerPrimary Server

Servicerestart.alert

Shutdown.alert

Primaryserver_down.alert

Primaryserver_up.alert

149

HA-OSCAR network interface configuration

152

Downloading HA-OSCAR & More info

153

Main HA-OSCAR websitexcr.cenit.latech.edu/ha-oscar

154

Download page

155

Survey page

156

GPL

157

HA-OSCAR tar ball

158

Bug Report

159

Installation Scenario: Provide step-by-step instructions of HA-

OSCAR installation process

Assumptions & Requirements HA-OSCAR Installation

Installation Walkthrough HA-OSCAR Monitoring and Configuration

Webmin

160

Assumptions and Requirements

• HA-OSCAR 1.0 beta release – Supports Active-Hot standby dual head nodes

• Installation requirements• Redhat 9.0 Linux distribution • OSCAR 3.0 version

• HA-OSCAR 1.1 release• Support OSCAR 4.1

161

Installation Process

• Adopt ease of build (self-build, config w/o OS loaded)

• 30 min – 1.5 hrs installation (retrofit)

• Take almost the same time for disaster recovery

• Webmin for new services

Step1

Step2: create head image

Step3: clone image

Step4: config & Build Standby

Optional Step5: web admin to add/config more services

162

Installation Walkthrough 1/5

• Download HA-OSCAR http://xcr.cenit.latech.edu/ha-oscar • Extract the tar-file• Run ‘./haoscar_install eth0’ to

launch the following screen • It takes four simple steps to

install HA-OSCAR

163


1. Installation of server packages to build an HA-OSCAR base.

2. The second step launches a fetch Image wizard by which Primaryserver image is grabbed and stored on Primaryserver.

— User can accept the defaults values in this window

— Finally user clicks the Fetch Image button and the image is fetched.

164


3. Next step involves the configuration of standby server.

— The image name from the previous step (Serverimage) is selected to install on Standbyserver .

— Standbyserver’s local IP, public alias IP and gateway can be changed according to there network address.

— After entering all the fields, next, click on AddStandby Server button. 10.0.0.3

10.0.0.1

165


4. Fourth step involves network setup (for PXE boot) to transfer the clone image on Primaryserver to remote Standbyserver.— First click on Setup Network Boot (A).

— Configure Standbyserver boot sequence to network boot and reboot the Standbyserver.

— Next Collect MAC Address (B) button is clicked to collect the MAC address of Standbyserver.

Note: For Build Autoinstall Floppy method refer to appendix 1

166


After MAC address is collected, it will be associated to IP address (from previous step) of Standbyserver by clicking on Assign MAC to Node (E).

Then Configure DHCP Server (F) button is clicked to configure the DHCP on primary node to assign IP address to Standbyserver Setup Network Boot (G) is clicked and the Standbyserver is booted as PXE boot.

Once the Standbyserver is up, the last and final step complete installation is clicked which finishes the HA-OSCAR setup.

G

167

Monitoring & Configuration with Webmin

• This procedure is optional. Normally, you don’t need to except for customization by advanced users only.

• HA-OSCAR Webmin is used:• Default configuration should be sufficient for a standard

Beowulf environment.• Only for advanced users• Available at http://localhost:10000• For customized detection channel configuration • Add/edit Services for monitoring• Customized Resource management

168

HA-OSCAR Webmin 1/13

• User enters HA-OSCAR configuration monitoring screen by clicking HA-OSCAR icon

FF

169


• Detection channel configuration’ icon is to setup and configure both detection channels (ethX) between Primary server and Standby server.

170


• ‘Add/Edit Network interface’ icon is to add customized detection channel for Primaryserver.

171


• Clicking ‘Add a new interface’ link launches a window from which user can create a new interface.

172


• Example of adding new private virtual interface (eth0:1)– Name of the interface– Virtual IP address– Activate at boot – Commit settings

click here

173


• Similarly add customized public virtual interface.

FF

174


• Configuration of the HA-OSCAR (heartbeat) detection channel on the Primary server window is launched by accessing this icon

175


• Primary server network and detection channel configuration

– Name of the public virtual IP– Public virtual IP address– Name of the private virtual IP– Private virtual IP address– Private IP address– Commit settings

176


• HA-OSCAR Service Monitor’ icon is clicked to launch HA-OSCAR policy configuration main window

177


‘Monitoring Lists’ icon is to list

monitored service.

178


• ‘process_server’ watches the critical services running on primary_server (itself) to be up and running

• ‘loadaverage_server’ watches CPU load to be with In threshold level

• similarly ‘freespace_server’ monitors available memory/swap space

• To add a new service

179


• Adding new services– Name of the service – Monitoring time interval – Monitoring daemon– Default monitored services– Append to mon conf file – Duration in days– Event – Alert triggered

• This snapshot details ‘process_server’ monitoring policy with pbs_server, maui, http as default monitored processes. The same window is popped up without any populated data when add_service link is clicked.

180


• Same scenario is followed on Standby Server to add and configure customized ‘Detection channels’ and ‘services’

181

Experiments and Test Results

Experiments Standbyserver Alert History Primaryserver Alert HistoryHA-OSCAR Measurements

182

Experiment – HA-OSCAR Stack

• HA-OSCAR is successfully verified on a cluster system with OSCAR release 3.0 and RedHat 9.0

• Experimental cluster configuration:— Two dual Xeon server head nodes

— 1GB RAM each— 40GB hard drive with 2GB of free space— Two NIC cards

— 16 Intel client nodes — 512MB RAM — 40GB hard drive— Two NIC cards

— Network Switch

183

Standby Server Alert History

• Testing Failover— Private Ethernet cable (eth0) of Primary server is unplugged — Login to the Primary server via the public IP and run MPI program— Client node gets 100% response from failover Standby server

• Testing Failback— Private Ethernet cable (eth0) of Primary server is plugged

— Client node pings (eth0) the Primary server and gets 100% response

Group Service Type Time Alert Args

1

primary_server Ping AlertMon Sept 29

21:28:07 2003

server_down.alert -

2

primary_server Ping upalertMon Sept 29

21:36:21 2003

server_up.alert -

Table shows an example of alert history log on Standbyserver

184

Primary Server Alert History

Group Service Type Time Alert Args

1

primary_server ping alertMon Sept 29

21:36:17 2003

self_down.alert -

2

primary_server ping upalertMon Sept 29

21:36:26 2003

self_up.alert -

3

service_mon PBS alertMon Sept 29

21:35:16 2003

PBS.alert -

4

service_monPBS

serverupalert

Mon Sept 29 21:36:26

2003mail.alert -

Table demonstrates Primary server alert history

185

HA-OSCAR Measurements

• 3-5 sec Manual failover time• 2 seconds poling interval (tunable)• 0.9% CPU usage at each monitoring interval

0

50

100

150

200

250

300

1 2 5 10 15 20 30 60

HA-OSCAR Mon polling interval (s)

HA

-OS

CA

R N

etw

ork

load in

Pack

ets

/Min

m

easu

red b

y

TC

Ptr

ace

Comparison of network usages for HA-OSCAR different polling sizes

186

HA-OSCAR Availability Modeling: Present Availability Modeling and Analysis

Stochastic Petri Net modeling Comparison Results

187

Reality Checks

• Great! We got HA Beowulf!

• But How much improvement?– The total uptime?– Performance?

• Analytical model and prediction– How many 9’s? (downtime per/year)– Stochastic Reward Net using SPNP package from

Duke U.

188

HA-OSCAR SRN Model

Server sub-model

Switches

Compute nodes

189

Server Sub-Model

P Server upP Server downFailoverP server repairFailback

S is up and readyS takes controlS Server downS repair

192

Instantaneous Availability

Steady (A) = 99.993 (36 min) vs.

Beowulf (A) = 99.65 (30 hr)

193

Availability Analysis

HA-OSCAR solution vs traditional BeowulfTotal Availability impacted by service nodes

90.580%

91.575%92.081% 92.251% 92.336% 92.387%

99.9896%

99.9951% 99.9962% 99.9966% 99.9968%

99.9684%

90.00%

91.00%

92.00%

93.00%

94.00%

95.00%

96.00%

97.00%

98.00%

99.00%

100.00%

Noda-wise mean time to failure (hr)

Avai

labi

lity

99.950%

99.955%

99.960%

99.965%

99.970%

99.975%

99.980%

99.985%

99.990%

99.995%

100.000%

Beowulf 0.905797 0.915751 0.920810 0.922509 0.923361 0.923873

HA-oscar 0.999684 0.999896 0.999951 0.999962 0.999966 0.999968

1000 2000 4000 6000 8000 10000

Model assumption:- scheduled downtime=200 hrs - nodal MTTR = 24 hrs- failover time=10s- During maintainance on the head, standby node acts as primary

Total availability analysis of HA-OSCAR versus Beowulf architecture

194

Work in Progress

195

Different flavors of HA-OSCAR

Monitoring HA-OSCAR Active-Hot Standby

HA-OSCAR 2+1 Multi-Active

(lab grade)

Pbs maui nfs httpd

SGE

NSF

nis

httpd

gmond ,gmetad

Heartbeat (3 sec)

CPU Fan Speed IPMI option IPMI option

CPU Temperature IPMI option IPMI option

CPU status IPMI option IPMI option

Memory bit error IPMI option IPMI option

200

HA-OSCAR and GRID(Lab Grade)

Grid-enabled HA clusterHigh Availability architecture stack for grid

HA enabled grid architecture

201

Grid enabled HA Cluster

• Grid computing allows:• Various organizations to work together to achieve

common goals and high performance operations • Provides local autonomy• Distributed computing

• Potential pitfall is single point of failure at head node

• Site unavailability • Reduces number of resources available

202

HA Architecture Stack for Globus Grid

Operating System Applications

Cluster Software Stack

Grid Layer

HA-OSCAR Service and Job Level Monitoring

HA-OSCAR Policy based

recoverymechanism

203

Critical Service Monitoring & Failover-Failback capability for site-manager

Site C

Site B

Site A

Standby HEAD Node

Primary Head Node

Service Nodes

GRID

HA-OSCAR

HA-OSCARModified Failover Aware Client

HA-OSCAR

HA-OSCAR

Client

Client submits MPI job

Site-Manager

HAOSCAR failover if

critical services

(Gatekeeper, gridFTP, PBS) die

Compute nodes

Stand-By

204

Basic Components for Smart Failover

HA-OSCAR Smart Failover Architecture

`

Job Queue Monitor

Scheduler jobID to Globus

assigned jobID

mapper

Backup updater

ServiceMonitor

HW Health

Monitor

Monitoring Core Daemon

ResourceMonitor

Gatekeeper, GridFTP, PBS

Notify on critical event

Event Monitoring

Core Daemon

Notify system events

Notify on job addition & completion

mapping between the GjobID and the SjobID is the key information for transparent recovery

Event Generator

Scheduler

205

In-progress Work – Security

Integrating DSI components with HA-OSCAR

206

Security – Different types of attacks

• Denials of Service (DoS) • Impersonation • Exploits of Misconfiguration • Exploits of Implementation Flaws • Data Driven Attacks• Network Infrastructure Attacks

207

Goals

• Design an architecture as a platform to support different security mechanisms for carrier class Internet servers running on a clustered system.

• Requirements:– Scalable, flexible, no single point of failure, no performance

bottleneck, supporting run time changes in security context and policy

• The platform must provide mechanisms to protect the system against:– External attacks: originating from Internet– Internal attacks:

• Break through a node in the cluster• O&M security• Attacks originating from Intranet

208

Functionality

• Access control– Finer grained, wide range of operations

• Authentication– Between cluster nodes, and processes

• Confidentiality and integrity for communications– Securing distributed IPCs

• Auditing– Collection and analysis of alarms and warnings through

O&M

209

DSI Overview

A primary Security Server (SS)

Multiple Security Managers (SM) (one per node)

SS and SMs communicate through an encrypted and authenticated channel

Security policy is enforced at kernel level

Primary Security Server Node

Node 1 Node 2 Node 3

DSMSS DSM DSM

Proc123 Proc978 Proc222

Ke r

ne l

Security Broker

Secondary

Data TrafficI nsi

de

th

e C

lus

ter

Security andO&M/IDS

Ou

tsi d

e th

e C

l us

ter

SS Security Server

SM Security Manager

AuthenticatedEncrypted Communications

SMSMSM

DSMDistributed Security Module

210

DSI and HA-OSCAR

• One of the goals for 2005 is to integrate DSI components into HA-OSCAR to provide advanced security features.

211

Distributed Security Infrastructure (DSI)

• Developed for Cluster Environment– Fine grained, Flexible, Adaptable

• High level of abstraction for access control– Separating administrative, network, computation into different

security zones

• Process level + User level – Kernel level module (DSM)– Real time checks based on the LSM hooksSELinux

212

DSI – components (1/2)

• Security Server– Central point of management, policy holder

• Security Manager– Node based enforcement of policy

• Secure Communication Channel– Encrypted, authenticated. SS <-> SM

• Distributed Security Policy (DSP)– XML, rules spanning entire cluster

213

DSP Architecture

214

DSI-aware HA-OSCAR Architecture

Primary Security Server Node

Node 1 Node 2 Node 3

DSMSS DSM DSM

Proc123 Proc978 Proc222

Ke r

ne l

Security Broker

Secondary

Data TrafficI nsi

de

th

e C

lus

ter

Security andO&M/IDS

Ou

tsi d

e th

e C

l us

ter

SS Security Server

SM Security Manager

AuthenticatedEncrypted Communications

SMSMSM

DSMDistributed Security Module

215

Current Status – DSI+HA-OSCAR

• Porting from DSM + LSM -> SELinux– 5 major classes

• DSP -> SELinux TE• Process and network mapping are done

• HA-OSCAR is ready to integrate• Design and Develop security provisioning – tricky

216

Fault Tolerant Scheduling for Computational

Grid/Cluster environments

217

Objectives

• To provide fault tolerance for jobs at cluster level.

• Retain the job run sequence in case of failure

218

Architecture

HA-OSCAR Primary Server

PBS Job Queue

FAM Job event

Monitor

text

HA-OSCAR Backup Server

Get EventPBS Job Queue

Update on Job ADD/COMPLETE

Job Submission

Prologue

Job Run

Epilogue

Job Initialization

Perform Job Cleanup

Leads to

Ckpt & create & update restart job spec file

Update with Ckpt files and user

home dirs Job Spec Directory

JobID. Spec

JobID. Spec

QPS

Update

Remove job spec file after

completionHA-OSCAR Monitoring Daemon

Monitor Primary Server

219

Demo Steps

• Submit MPI jobs through Torque• View job queue status on the primary • Simulate outage by bringing the network down• View the job queue status on the standby

220

Final Thoughts

• It took us lot of work to arrive to our current results.

• HA-OSCAR is an Open Source project – open for contributions from all

• Several HA-OSCAR enabled works toward mission critical HPC clusters

• Please give us your feedback on how we can improve HA-

OSCAR and make it your preferred open source HA clustering stack

• Participation is open!

221

Thank you. Feedback? Questions?

This slide show is available for download from: http://xcr.cenit.latech.edu/ha-oscar

222

Resources

• HA OSCAR xcr.cenit.latech.edu/ha-

oscar

• Open Cluster Group OpenClusterGroup.org

• OSCAR oscar.OpenClusterGroup.org

• OSCAR Development sourceforge.net/projects/oscar

• Latech CENIT cenit.latech.edu

• Open System Lab www.linux.ericsson.ca

Master slide show (LCI) is available at:

http://xcr.cenit.latech.edu/ha-oscar/papers.html

1 Towards Highly Available, Scalable, and Secure HPC Clusters with HA-OSCAR Dr. Chokchai Box Leangsuksun Louisiana Tech University [email protected] Ibrahim.

Documents

haoscar demobox slide

system availability

usa slide

ha services

unavailable ha clusters

available system

high availability isolate

system performance