1 Towards Highly Available, Scalable, and Secure HPC Clusters with HA-OSCAR Dr. Chokchai Box Leangsuksun Louisiana Tech University [email protected]Ibrahim Haddad OSDL [email protected]Dr. Stephen L. Scott Oak Ridge National Laboratory [email protected]IEEE CLUSTER CONFERENCE September 26, 2005 -- Boston, MA, USA
210
Embed
1 Towards Highly Available, Scalable, and Secure HPC Clusters with HA-OSCAR Dr. Chokchai Box Leangsuksun Louisiana Tech University [email protected] Ibrahim.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Towards Highly Available, Scalable, and Secure HPC Clusters with HA-OSCAR
Dr. Chokchai Box LeangsuksunLouisiana Tech University
distribution: Dynamic mechanisms that detect and react to
the unavailability, addition and removal of components in the
system
• Availability: provide HA services to its end users
• Operation and maintenance: Performed remotely without
affecting the system performance and availability
• Fast response time: Minimize serialized executions
10
General HA Clustering Wish-list (2/2)
• Geographical Diversity
– Spread across several Points of Presence
– Support geographic mirroring
• Provide a single cluster IP Interface: Clients access the server
application through a single IP address
• Security: High security requirements depending on
deployment scenarios (open vs. closed networks)
11
2 Main Challenges
• System Availability
• System Capacity
12
• Availability defined as:
MTBF: Mean Time Between Failure
MMTR: Mean Time To Repair
• Example:
If a system offered a MTBF of 20,000 hours with a MTTR of 2 hours, then its availability would be 99.99%, “4-nines.”
System Availability – What does it mean?
MTBF
MTBF + MTTR
13
Means to achieve higher availability
• Increase MTBF– Improve “quality” or “robustness”– Use redundancy / remove single points of failure
• Decrease MTTR– Streamline and accelerate fail-over
• Optimize boot / reboot time• Respond to fault conditions in real-time
– Make faults more granular in time and scope• Better to have many short faults than a few long ones• Limit scope of faults to smaller s/w and h/w components
14
Degrees of HA
Source: Gartner
15
System Availability Redundancy Models
3
4
5
6
1+0 1+1 N+M
System redundancy model
Availa
bili
ty (
num
ber
of
nin
e (
9))
16
Beowulf Cluster
17
Beowulf Cluster
• Beowulf is one approach to clustering COTS components to form a supercomputer
• A Beowulf cluster is a collection of COTS computers networked together to harvest high performance computing
• A typical Beowulf cluster has:– a single head node – multiple identical client nodes
18
Beowulf Cluster Architecture
HeadNode: Entry point to the cluster Responsible for serving user requests Distributes jobs to compute clients via
scheduling and queuing software
Compute Clients Dedicated for computation
Communication: Using Ethernet network and/or fast connectivity: Myrinet, Infinitband, etc.
Head Node
Compute Nodes
Communication
End Users
19
Beowulf Cluster – Advantages
1. COTS HW and SW components
2. Toward High Performance Computation (HPC)
3. Allows flexible configurations
4. Heterogeneous environment
5. Scalability (add more nodes)
6. Alternative choice for monolithic supercomputers
7. Good price/performance
20
Beowulf Cluster – Issues
Head Node
Compute Nodes
Communication
End Users
• Single head node architecture– Vulnerable for SPOF
• Single communication path architecture– Vulnerable for SPOF
• Compute nodes are not accessible after above threat occurs, or when cluster services or OS upgrade takes place.
21
HA & Linux Clusters
22
Providing High Availability
• One technique of providing HA is by distributing
functionality across multiple nodes
• In response to HW and SW failures, HA systems
facilitate the rapid transfer of control from a faulty
CPU, peripheral, or software component to a
functional one, while preserving operations or
transactions in-progress at the time of failure.
23
HA Supporting mechanisms
• HA Systems must support mechanisms for:– Error Detection– Damage Containment– Error Recovery– Fault Treatment (incl. dynamic reconfiguration)
• Box will discuss how HA-OSCAR support these mechanisms
• Assumption:We are dealing with systems comprising clusters of processors which share nothing.
24
Redundancy in HA Systems
• Redundancy of key subsystems is important– Redundant Ethernet to ensure constant networking
connections
– Redundant power supplies
– Redundant disks
– etc.
25
Other means to support HA
• Disk mirroring to ensure high levels of data reliability
• Hot swap (hot insert, hot remove, identity maintenance)
• Options for booting compressed and remotely hosted kernel images
• Support of compressed r/w and read-only Flash file systems
• Accelerated boot and daemon start times
• Fast shutdown / reboot
• Eliminating costly file system operations with journaling file systems
• Etc.
26
Uptime
• Example from the telecom industry:
The main operator requirement is
No more than 30 seconds of service interruption per year
– Applies to the overall solution: hardware, software (OS and middleware), and the applications.
– Includes software and hardware upgrade and maintenance.
27
High Availability of Cluster Hardware
• Hardware availability is very important
• In some cases, the platform may be available but not the application
– Software has bugs; it may cause applications to crash.
– Keeping redundancy in applications and maintaining processes state is complex.
• In telecom, the required uptime includes both platform and applications uptime
– End-users don’t care about running platforms when the required application is unavailable
28
Cluster Concurrent Maintenance
• Allow (un)scheduled maintenance to be performed on a node of a cluster while other nodes continue to provide service without noticeable degradation.
• and doing it remotely.
29
Failover
• Failover is the ability to detect problems in a node and to accommodate ongoing processing by routing applications to other nodes.
– This process may be programmed or scripted so that steps are taken automatically without operator/admin intervention
– Fundamental to failover is communication among nodes signaling that they are functioning correctly and reporting problems when they occur
30
Characteristics of Failover
• Transparency
– Applications and users are automatically and transparently reconnected to another node/system
• Performance
– Depends on hardware configuration, instance recovery time and workload at time of failover
• Robustness
– The cluster should be able to survive multiple failures and still provide mission critical applications
31
Linear Scalability
• We want clusters to support linear growth
– New processors can be added without disturbance
• Capacity grows linearly as processors are added
– Modular addition of HW and SW components happens
• If we double the number of processors, we should expect to almost double the throughput of the system.
2 4 6 8 10 12 14 16 18 …
32
Highly Available Storage
• Storage is a critical and necessary part of a HA cluster
• Data should to be available to users even when a storage node fails or when errors occur with the distributed file system
• One popular technique is providing RAID support
– Other: NAS, SAN, etc.
33
Online OS and Application Upgrades(required for teleco grade clusters)
• A requirement in mission critical environments
– Teleco, defense.
• When upgrading software (applications), old and new version of same process can coexist
– provide mechanisms to upgrade a running application
– The system will deal with the old and new running versions of the application simultaneously
• Applications will transfer states between old and new static process
34
Manageability
• Single point of control
– Applications
– Software
– Hardware
– Data (Data movement, Security, Backup, Recovery, etc)
• Online configurability to reduce downtime • Capacity Control
– Overload protection by selectively rejecting jobs/requests when threshold is reached
35
Heartbeat Mechanisms
• Linux-HA project
http://www.linux-ha.org
Release 2 is out.
Linux-HA BOF is on Friday
Router
MasterNode 1
1 1
1
2
2
2
MasterNode 2
36
Non-Stop Operations – Summary (required for teleco grade clusters)
• No single point of failure
• No scheduled downtime
• In-service upgrade of software and hardware with
no disturbance to operation
• HA failover software
• Software Configuration Control
– Automatic restart of processes that originally executed on
a faulty processor on the ones that are working
37
Fast Recovery of Applications
• Maximizing availability of applications and services is a priority
• If an application dies for some reason, it is very important to restart it ASAP
• Provide automatic failover and recovery capabilities with very minimum interruptions to the users.
• Monitoring mechanisms to detection of failures and trigger actions
The active Ethernet adapter provides the connection to the network.The standby Ethernet adapter is hiddenfrom applications and is know only tothe Ethernet redundancy daemon.
Active Standby Active Standby
The active Ethernet adapter has failed and Ethernet redundancy daemon designates the former standby Ethernetadapter as the new active adapter.
Before network adaptor swap After network adaptor swap
41
1+1 active/standby
Public Network
ActiveMasterNode
StandbyMasterNode
Heartbeat Messages
Shared NetworkStorage
Dual Redundant Data Paths
Physically connectedbut not logically in use
Clients
42
1+1 active/standby …
Public Network
ActiveMasterNode
StandbyMasterNode
Heartbeat Messages
Shared NetworkStorage
Failed Node
Now ActiveMaster Node
Physically connectedand in use
Physically connectedbut not logically in usedue to the failure of the master node
Clients
43
1+1 active/active …
Public Network
ActiveMasterNode 1
ActiveMasterNode 2
Heartbeat Messages
Shared NetworkStorage
Physically connectedand in use providing redundant data pathfor master node 2
Clients
Physically connectedand in use providing redundant data pathfor master node 1
44
N+M
Node A Node B Node C Node D
StandbyNode 2
StandbyNode 1
StandbyProcess
ActiveProcess
HA SharedStorage
State2
1
3
45
N-way
Node A Node B Node C Node D
HA SharedStorage
State2
1 3
(1) State information is written to shared (2) State information is available on
shared storage(3) Traffic nodes have access to the state
information.
46
HA Clusters: Nodes Topology & Redundancy Models
N Traffic Node, where N 2.Redundancy is at node levelRedundancy models: N active or N active and M standby, where N 2 and M 0
Redundant storage through the implementation of a HA NFS server.
N active N active / M standby
Redundant SpecializedStorage Nodes
Master Nodes:1+1 redundancy model:Active/Hot -standby or Active/Active
Storage:Specialized nodes: N = 2 Or use modified NFSimplementation
HA Tier
ServiceScalabilityTier
StorageTier
N Traffic Nodes, N 2.Redundancy is at node levelRedundancy models: N active, orN active and M standby, N and M 0
Redundant storage through the implementation of a HA NFS server.
N active N active / M standby
Redundant SpecializedStorage Nodes
Master Nodes:1+1 redundancy model:Active/Hot -standby or Active/Active
Storage:Specialized nodes: N = 2 or use modified NFSimplementation
HA Tier
ServiceScalabilityTier
StorageTier
2
47
Challenges
48
Challenges
• How to automatically build and boot the nodes?– Installation infrastructure
• Which (HA?) distributed file systems to use in the cluster?– File Systems
• What types of traffic distribution and load balancing mechanisms to use?
• How to build redundancy and to which extend?– Redundancy at the Network, File System , Disk and CPU Levels
• How to manage the cluster, remotely? – System management
• How to add/remove nodes without affecting the operations?– Dynamic reconfiguration
• How to achieve linear scalability?– Scalability
• How to secure cluster running on open networks?– Security
49
Conclusion to the intro!
• From the start, the design of the software architecture of the application should take into account:
– Scalability
– Failure Handling
– Error (software bug) handling
– Future modification
– Hot software upgrade
• A complete HA solution requires close integration of:– HA hardware, – HA software solution, – HA middleware, and – Application software that can cause failover to redundant
• Framework for cluster installation configuration and management
• Common used cluster tools• Wizard based cluster software installation
– Operating system– Cluster environment
• Administration• Operation
• Automatically configures cluster components• Increases consistency among cluster builds• Reduces time to build / install a cluster• Reduces need for expertise
Open Source Cluster Application Resources
Step 5
Step 8 Done!
Step 6
Step 1 Start…
Step 2
Step 3Step 4
Step 7
54
OSCAR - the beginning
55
• Extreme Linux• May 13, 1998• $29.95 CD
First cluster “distro”
Oak Ridge National Laboratory -- U.S. Department of Energy
56
OSCAR Background
• Concept first discussed in January 2000
• First organizational meeting in April 2000– Cluster assembly is time consuming & repetitive
– Nice to offer a toolkit to automate
• First public release in April 2001
• Use “best practices” for HPC clusters– Leverage wealth of open source components
• Form umbrella organization to oversee cluster efforts– Open Cluster Group (OCG)
57
Open Cluster Group
• Informal group formed to make cluster computing more practical for HPC research and development
• Membership is open, direct by steering committee– Research/Academic– Industry
• Current active working groups– OSCAR (core group)– Thin-OSCAR (Diskless Beowulf)– HA-OSCAR (High Availability)– SSS-OSCAR (Scalable Systems Software)– SSI-OSCAR (Single System Image)– BIO-OSCAR (Bioinformatics cluster system)
58
OSCAR Core Participants
• Dell• IBM• Intel• Bald Guy Software• RevolutionLinux• INRIA• EDF• Canada’s Michael Smith Genome Sciences
Center
• Indiana University• NCSA• Oak Ridge National Laboratory• Université de Sherbrooke• Louisiana Tech Univ.• NEC Europe• Air Force Research Lab (USA)
November 2004
59
Offer Variety of Flavors
HA-OSCAR, Thin-OSCAR, SSS-OSCAR, SSI-OSCAR,
SSS-OSCAR
• OSCAR is a snap-shot of best-known-methods for building, programming and using clusters of a “reasonable” size.
• To bring uniformity to clusters, foster commercial versions of OSCAR, and make clusters more broadly acceptable.
• Consortium of research, academic & industry members cooperating in the spirit of open source.
• HPC Services/Tools– Parallel Libs: MPICH, LAM/MPI, PVM– Torque, Maui, OpenPBS– HDF5– Ganglia, Clumon, … [monitoring systems]– Other 3rd party OSCAR Packages
• Core Infrastructure/Management– System Installation Suite (SIS), Cluster Command & Control (C3), Env-Switcher, – OSCAR DAtabase (ODA), OSCAR Package Downloader (OPD)
62
System Installation Suite (SIS)
Enhanced suite to the SystemImager tool.
Adds SystemInstaller and SystemConfigurator
• SystemInstaller – interface to installation, includes a stand-alone GUI – Tksis. Allows for description based image creation.
• SystemImager – base tool used to construct & distribute machine images.
• SystemConfigurator – extension that allows for on-the-fly style configurations once the install reaches the node, e.g. ‘/etc/modules.conf’.
63
System Installation Suite (SIS)
• Used in OSCAR to install nodes– partitions disks, formats disks and installs nodes
• Construct “image” of compute node on headnode– Directory structure of what the node will contain– This is a “virtual”, chroot–able environment
/var/lib/systemimager/images/oscarimage/etc/
…/usr/
• Use rsync to copy only differences in files, so can be used for cluster management – maintain image and sync nodes to image
64
Switcher
• Switcher provides a clean interface to edit environment without directly tweaking .dot files.– e.g. PATH, MANPATH, path for ‘mpicc’, etc.
• Edit/Set at both system and user level.
• Leverages existing Modules system
• Changes are made to future shells– To help prevent simple operator errors while making shell edits– Modules already offers facility for current shell manipulation, but no
persistent changes.
65
OSCAR DAtabase (ODA)
• Used to store OSCAR cluster data
• Currently uses MySQL as DB engine
• User and program friendly interface for database access
• Capability to extend database commands as necessary.
66
OSCAR Package Downloader (OPD)
Tool to download and extract OSCAR Packages.
• Can be used for timely package updates
• Packages that are not included, i.e. “3rd Party”
• Distribute packages with licensing constraints.
67
C3 Power Tools
• Command-line interface for cluster system administration and parallel user tools.
• Parallel execution cexec – Execute across a single cluster or multiple clusters at same time
• Scatter/gather operations cpush/cget – Distribute or fetch files for all node(s)/cluster(s)
• Used throughout OSCAR and as underlying mechanism for tools like OPIUM’s useradd enhancements.
68
C3 Building Blocks
• System administration• cpushimage - “push” image across cluster• cshutdown - Remote shutdown to reboot or halt cluster
• User & system tools• cpush - push single file -to- directory• crm - delete single file -to- directory• cget - retrieve files from each node• ckill - kill a process on each node• cexec - execute arbitrary command on each node
• cexecs – serial mode, useful for debugging• clist – list each cluster available and it’s type• cname – returns a node name from a given node position• cnum – returns a node position from a given node name
69
C3 Power Tools
Example to run hostname on all nodes of default cluster:$ cexec hostname
Example to push an RPM to /tmp on the first 3 nodes$ cpush :1-3 helloworld-1.0.i386.rpm /tmp
Example to get a file from node1 and nodes 3-6$ cget :1,3-6 /tmp/results.dat /tmp
* Can leave off the destination with cget and will use the same location as source.
70
Current OSCAR Release Notes (v4.1)
• Supported Distros:
– Red Hat 9
– Red Hat Enterprise Linux (RHEL) 3
• Supports both x86 and Itanium systems
– Fedora Core 2 support
– Mandrake 10.0 (experimental)
• Torque is included as the default scheduler (OpenaPBS can still be downloaded from OPD)
• DepMan / PackMan
– Resolves dependencies during “build node image”
– Used in install/uninstall packages
• APITest now part of OSCAR testing framework
• Versions of key software components:
– Ganglia 2.5.6-1B
– LAM-MPI 7.0.6-1
– MPICH-MPI 1.2.5
– Torque (PBS Replacement) 1.0.1
– MAUI 3.2.5
– SIS 3.3.2
71
OSCAR Installation
72
Server Installation and Configuration
• Install Linux on server machine (cluster head node)– workstation install w/ software development tools– 50+ page installation document!
• (quick install available)
• Download copy of OSCAR and unpack on server• Configure and install OSCAR on server
– readies the wizard install process
• Configure server Ethernet adapters– public– private
• Launch OSCAR Installer (wizard)
73
OSCAR Wizard
Demo install
Demo add/delete node
Demo add/delete package
version 4.0
74
OSCAR Wizard
75
Step 0
Enables you to download additional packages
OPD – Oscar Package Downloader does download
OPDer – GUI front end to OPD
76
OPDer
clumon and PVFS selected for download
77
Alternate repositories, possibly a local machine
OPDer (2)
78
Create your own flavor of cluster distribution
Select OSCAR packages to install.
Step 1
79
Core packages are automatically selected for you and can not “unselect”
Download does not equal installation!
Packages downloaded with OPDer are selected for installation here
Package Selector
80
Configure OSCAR packages that require special configuration tasks
Step 2
81
Environment Switcher does configuration for default MPI use
make selection
Package configuration
82
Install OSCAR Server (cluster head node) specific packages on cluster head node
May take a few minutes
Wait for button…
Step 3
83
success
Install server packages
84
Specify and build system image for client (compute) nodes
Step 4
85
name your image
list of packages
package file location
disk partition file location
static or dynamic
halt, reboot, beep
enable multicast
Build image configure
86
showing progress
Building image
87
success
Building image finished
88
Define client nodes
Step 5
89
specify image name (from step 4 – or other saved image)
client IP domain name
client base name (oscarnodeXXX)
node count
starting index to append to base
padding to client names (3 = oscarnode009)
starting IP address
Subnet Mask
Default Gateway
Define client nodes
90
success
Define client nodes
91
in one operation – setup networking for all cluster client nodes
for first time in installation process we will “touch” the client nodes
Step 6
92
machines named as specified in prior step 5
IP address as specified in prior step 5
Scan network for MACs or import from file
Setup network – initial window
93
found MAC addresses will show here for network setup
stop collecting when done
Setup network - scanning network
94
found and assigned all MAC addresses
Setup network – initial window
95
reboot on own – “post install action” from step 4
or
manually reboot
Reboot Clients
96
only after ALL clients have rebooted
runs “post install” scripts for packages that have them
cleanup and reinitialize where needed
Step 7
97
success
Complete setup
98
test suite provided to ensure that key cluster components are functioning properly
Step 8
99
All Passed!!!
Test cluster setup
100
Your OSCAR cluster is now ready to use
Quit OSCAR Wizard
101
Demo install
Demo add/delete node
Demo add/delete package
version 4.0
OSCAR Wizard
102
OSCAR Cluster Maintenance
Add Compute Nodes
103
increase the number of compute nodes in the cluster
Add OSCAR Clients
104
Operates in similar manner to steps 5, 6, and 7 in OSCAR installation
Behind the scene action differs somewhat…
step 5step 6
step 7
compare to standard install process:
Add OSCAR Clients
105
Delete Compute Nodes
OSCAR Cluster Maintenance
106
decrease the number of compute nodes in the cluster
Delete OSCAR Clients
107
ready to select client(s) to delete
Delete OSCAR Clients
108
client selected to delete
Delete OSCAR Clients
109
success
Delete OSCAR Clients
110
Demo install
Demo add/delete node
Demo add/delete package
version 4.0
OSCAR Wizard
111
Install / Uninstall Packages
OSCAR Cluster Maintenance
112
select to install or uninstall an OSCAR package
Install/UninstallOSCAR packages
113
Install/UninstallOSCAR packages
114
Open Cluster Groupwww.OpenClusterGroup.org/
OSCAR Home Pageoscar.OpenClusterGroup.org/
OSCAR Development sitesourceforge.net/projects/oscar/
OSCAR Research supported by the Mathematics, Information and Computational Sciences Office, Office of Advanced Scientific Computing Research, Office of Science, U. S. Department of Energy, under contract No. DE-AC05-00OR22725 with UT-Battelle, LLC.
More OSCAR Information
115
Agenda
1:00 – 1:05 Introduction Box
1:05 – 1:50 Clustering and HA Ibrahim
1:50 – 2:30 OSCAR Stephen
2:30 – 3:00 Break, Q&A
3:00 – 4:30 HA-OSCAR & Demo Box
116
HA-OSCAR: High Availability - Open Source Cluster Application Resource
• Installation requirements• Redhat 9.0 Linux distribution • OSCAR 3.0 version
• HA-OSCAR 1.1 release• Support OSCAR 4.1
161
Installation Process
• Adopt ease of build (self-build, config w/o OS loaded)
• 30 min – 1.5 hrs installation (retrofit)
• Take almost the same time for disaster recovery
• Webmin for new services
Step1
Step2: create head image
Step3: clone image
Step4: config & Build Standby
Optional Step5: web admin to add/config more services
162
Installation Walkthrough 1/5
• Download HA-OSCAR http://xcr.cenit.latech.edu/ha-oscar • Extract the tar-file• Run ‘./haoscar_install eth0’ to
launch the following screen • It takes four simple steps to
install HA-OSCAR
163
Installation Walkthrough 2/5
1. Installation of server packages to build an HA-OSCAR base.
2. The second step launches a fetch Image wizard by which Primaryserver image is grabbed and stored on Primaryserver.
— User can accept the defaults values in this window
— Finally user clicks the Fetch Image button and the image is fetched.
164
Installation Walkthrough 3/5
3. Next step involves the configuration of standby server.
— The image name from the previous step (Serverimage) is selected to install on Standbyserver .
— Standbyserver’s local IP, public alias IP and gateway can be changed according to there network address.
— After entering all the fields, next, click on AddStandby Server button. 10.0.0.3
10.0.0.1
165
Installation Walkthrough 4/5
4. Fourth step involves network setup (for PXE boot) to transfer the clone image on Primaryserver to remote Standbyserver.— First click on Setup Network Boot (A).
— Configure Standbyserver boot sequence to network boot and reboot the Standbyserver.
— Next Collect MAC Address (B) button is clicked to collect the MAC address of Standbyserver.
Note: For Build Autoinstall Floppy method refer to appendix 1
166
Installation Walkthrough 5/5
After MAC address is collected, it will be associated to IP address (from previous step) of Standbyserver by clicking on Assign MAC to Node (E).
Then Configure DHCP Server (F) button is clicked to configure the DHCP on primary node to assign IP address to Standbyserver Setup Network Boot (G) is clicked and the Standbyserver is booted as PXE boot.
Once the Standbyserver is up, the last and final step complete installation is clicked which finishes the HA-OSCAR setup.
G
167
Monitoring & Configuration with Webmin
• This procedure is optional. Normally, you don’t need to except for customization by advanced users only.
• HA-OSCAR Webmin is used:• Default configuration should be sufficient for a standard
Beowulf environment.• Only for advanced users• Available at http://localhost:10000• For customized detection channel configuration • Add/edit Services for monitoring• Customized Resource management
168
HA-OSCAR Webmin 1/13
• User enters HA-OSCAR configuration monitoring screen by clicking HA-OSCAR icon
FF
169
HA-OSCAR Webmin 2/13
• Detection channel configuration’ icon is to setup and configure both detection channels (ethX) between Primary server and Standby server.
170
HA-OSCAR Webmin 3/13
• ‘Add/Edit Network interface’ icon is to add customized detection channel for Primaryserver.
171
HA-OSCAR Webmin 4/13
• Clicking ‘Add a new interface’ link launches a window from which user can create a new interface.
172
HA-OSCAR Webmin 5/13
• Example of adding new private virtual interface (eth0:1)– Name of the interface– Virtual IP address– Activate at boot – Commit settings
click here
173
HA-OSCAR Webmin 6/13
• Similarly add customized public virtual interface.
FF
174
HA-OSCAR Webmin 7/13
• Configuration of the HA-OSCAR (heartbeat) detection channel on the Primary server window is launched by accessing this icon
175
HA-OSCAR Webmin 8/13
• Primary server network and detection channel configuration
– Name of the public virtual IP– Public virtual IP address– Name of the private virtual IP– Private virtual IP address– Private IP address– Commit settings
176
HA-OSCAR Webmin 9/13
• HA-OSCAR Service Monitor’ icon is clicked to launch HA-OSCAR policy configuration main window
177
HA-OSCAR Webmin 10/13
‘Monitoring Lists’ icon is to list
monitored service.
178
HA-OSCAR Webmin 11/13
• ‘process_server’ watches the critical services running on primary_server (itself) to be up and running
• ‘loadaverage_server’ watches CPU load to be with In threshold level
• similarly ‘freespace_server’ monitors available memory/swap space
• To add a new service
179
HA-OSCAR Webmin 12/13
• Adding new services– Name of the service – Monitoring time interval – Monitoring daemon– Default monitored services– Append to mon conf file – Duration in days– Event – Alert triggered
• This snapshot details ‘process_server’ monitoring policy with pbs_server, maui, http as default monitored processes. The same window is popped up without any populated data when add_service link is clicked.
180
HA-OSCAR Webmin 13/13
• Same scenario is followed on Standby Server to add and configure customized ‘Detection channels’ and ‘services’
181
Experiments and Test Results
Experiments Standbyserver Alert History Primaryserver Alert HistoryHA-OSCAR Measurements
182
Experiment – HA-OSCAR Stack
• HA-OSCAR is successfully verified on a cluster system with OSCAR release 3.0 and RedHat 9.0
• Experimental cluster configuration:— Two dual Xeon server head nodes
— 1GB RAM each— 40GB hard drive with 2GB of free space— Two NIC cards
— 16 Intel client nodes — 512MB RAM — 40GB hard drive— Two NIC cards
— Network Switch
183
Standby Server Alert History
• Testing Failover— Private Ethernet cable (eth0) of Primary server is unplugged — Login to the Primary server via the public IP and run MPI program— Client node gets 100% response from failover Standby server
• Testing Failback— Private Ethernet cable (eth0) of Primary server is plugged
— Client node pings (eth0) the Primary server and gets 100% response
Group Service Type Time Alert Args
1
primary_server Ping AlertMon Sept 29
21:28:07 2003
server_down.alert -
2
primary_server Ping upalertMon Sept 29
21:36:21 2003
server_up.alert -
Table shows an example of alert history log on Standbyserver
184
Primary Server Alert History
Group Service Type Time Alert Args
1
primary_server ping alertMon Sept 29
21:36:17 2003
self_down.alert -
2
primary_server ping upalertMon Sept 29
21:36:26 2003
self_up.alert -
3
service_mon PBS alertMon Sept 29
21:35:16 2003
PBS.alert -
4
service_monPBS
serverupalert
Mon Sept 29 21:36:26
2003mail.alert -
Table demonstrates Primary server alert history
185
HA-OSCAR Measurements
• 3-5 sec Manual failover time• 2 seconds poling interval (tunable)• 0.9% CPU usage at each monitoring interval
0
50
100
150
200
250
300
1 2 5 10 15 20 30 60
HA-OSCAR Mon polling interval (s)
HA
-OS
CA
R N
etw
ork
load in
Pack
ets
/Min
m
easu
red b
y
TC
Ptr
ace
Comparison of network usages for HA-OSCAR different polling sizes
186
HA-OSCAR Availability Modeling: Present Availability Modeling and Analysis
Stochastic Petri Net modeling Comparison Results
187
Reality Checks
• Great! We got HA Beowulf!
• But How much improvement?– The total uptime?– Performance?
• Analytical model and prediction– How many 9’s? (downtime per/year)– Stochastic Reward Net using SPNP package from
Duke U.
188
HA-OSCAR SRN Model
Server sub-model
Switches
Compute nodes
189
Server Sub-Model
P Server upP Server downFailoverP server repairFailback
S is up and readyS takes controlS Server downS repair
192
Instantaneous Availability
Steady (A) = 99.993 (36 min) vs.
Beowulf (A) = 99.65 (30 hr)
193
Availability Analysis
HA-OSCAR solution vs traditional BeowulfTotal Availability impacted by service nodes
Model assumption:- scheduled downtime=200 hrs - nodal MTTR = 24 hrs- failover time=10s- During maintainance on the head, standby node acts as primary
Total availability analysis of HA-OSCAR versus Beowulf architecture
194
Work in Progress
195
Different flavors of HA-OSCAR
Monitoring HA-OSCAR Active-Hot Standby
HA-OSCAR 2+1 Multi-Active
(lab grade)
Pbs maui nfs httpd
SGE
NSF
nis
httpd
gmond ,gmetad
Heartbeat (3 sec)
CPU Fan Speed IPMI option IPMI option
CPU Temperature IPMI option IPMI option
CPU status IPMI option IPMI option
Memory bit error IPMI option IPMI option
200
HA-OSCAR and GRID(Lab Grade)
Grid-enabled HA clusterHigh Availability architecture stack for grid
HA enabled grid architecture
201
Grid enabled HA Cluster
• Grid computing allows:• Various organizations to work together to achieve
common goals and high performance operations • Provides local autonomy• Distributed computing
• Potential pitfall is single point of failure at head node
• Site unavailability • Reduces number of resources available
202
HA Architecture Stack for Globus Grid
Operating System Applications
Cluster Software Stack
Grid Layer
HA-OSCAR Service and Job Level Monitoring
HA-OSCAR Policy based
recoverymechanism
203
Critical Service Monitoring & Failover-Failback capability for site-manager
Site C
Site B
Site A
Standby HEAD Node
Primary Head Node
Service Nodes
GRID
HA-OSCAR
HA-OSCARModified Failover Aware Client
HA-OSCAR
HA-OSCAR
Client
Client submits MPI job
Site-Manager
HAOSCAR failover if
critical services
(Gatekeeper, gridFTP, PBS) die
Compute nodes
Stand-By
204
Basic Components for Smart Failover
HA-OSCAR Smart Failover Architecture
`
Job Queue Monitor
Scheduler jobID to Globus
assigned jobID
mapper
Backup updater
ServiceMonitor
HW Health
Monitor
Monitoring Core Daemon
ResourceMonitor
Gatekeeper, GridFTP, PBS
Notify on critical event
Event Monitoring
Core Daemon
Notify system events
Notify on job addition & completion
mapping between the GjobID and the SjobID is the key information for transparent recovery
Event Generator
Scheduler
205
In-progress Work – Security
Integrating DSI components with HA-OSCAR
206
Security – Different types of attacks
• Denials of Service (DoS) • Impersonation • Exploits of Misconfiguration • Exploits of Implementation Flaws • Data Driven Attacks• Network Infrastructure Attacks
207
Goals
• Design an architecture as a platform to support different security mechanisms for carrier class Internet servers running on a clustered system.
• Requirements:– Scalable, flexible, no single point of failure, no performance
bottleneck, supporting run time changes in security context and policy
• The platform must provide mechanisms to protect the system against:– External attacks: originating from Internet– Internal attacks:
• Break through a node in the cluster• O&M security• Attacks originating from Intranet
208
Functionality
• Access control– Finer grained, wide range of operations
• Authentication– Between cluster nodes, and processes
• Confidentiality and integrity for communications– Securing distributed IPCs
• Auditing– Collection and analysis of alarms and warnings through
O&M
209
DSI Overview
A primary Security Server (SS)
Multiple Security Managers (SM) (one per node)
SS and SMs communicate through an encrypted and authenticated channel
Security policy is enforced at kernel level
Primary Security Server Node
Node 1 Node 2 Node 3
DSMSS DSM DSM
Proc123 Proc978 Proc222
Ke r
ne l
Security Broker
Secondary
Data TrafficI nsi
de
th
e C
lus
ter
Security andO&M/IDS
Ou
tsi d
e th
e C
l us
ter
SS Security Server
SM Security Manager
AuthenticatedEncrypted Communications
SMSMSM
DSMDistributed Security Module
210
DSI and HA-OSCAR
• One of the goals for 2005 is to integrate DSI components into HA-OSCAR to provide advanced security features.
211
Distributed Security Infrastructure (DSI)
• Developed for Cluster Environment– Fine grained, Flexible, Adaptable
• High level of abstraction for access control– Separating administrative, network, computation into different
security zones
• Process level + User level – Kernel level module (DSM)– Real time checks based on the LSM hooksSELinux
212
DSI – components (1/2)
• Security Server– Central point of management, policy holder
• Security Manager– Node based enforcement of policy
• Secure Communication Channel– Encrypted, authenticated. SS <-> SM
• Porting from DSM + LSM -> SELinux– 5 major classes
• DSP -> SELinux TE• Process and network mapping are done
• HA-OSCAR is ready to integrate• Design and Develop security provisioning – tricky
216
Fault Tolerant Scheduling for Computational
Grid/Cluster environments
217
Objectives
• To provide fault tolerance for jobs at cluster level.
• Retain the job run sequence in case of failure
218
Architecture
HA-OSCAR Primary Server
PBS Job Queue
FAM Job event
Monitor
text
HA-OSCAR Backup Server
Get EventPBS Job Queue
Update on Job ADD/COMPLETE
Job Submission
Prologue
Job Run
Epilogue
Job Initialization
Perform Job Cleanup
Leads to
Ckpt & create & update restart job spec file
Update with Ckpt files and user
home dirs Job Spec Directory
JobID. Spec
JobID. Spec
QPS
Update
Remove job spec file after
completionHA-OSCAR Monitoring Daemon
Monitor Primary Server
219
Demo Steps
• Submit MPI jobs through Torque• View job queue status on the primary • Simulate outage by bringing the network down• View the job queue status on the standby
220
Final Thoughts
• It took us lot of work to arrive to our current results.
• HA-OSCAR is an Open Source project – open for contributions from all
• Several HA-OSCAR enabled works toward mission critical HPC clusters
• Please give us your feedback on how we can improve HA-
OSCAR and make it your preferred open source HA clustering stack
• Participation is open!
221
Thank you. Feedback? Questions?
This slide show is available for download from: http://xcr.cenit.latech.edu/ha-oscar
222
Resources
• HA OSCAR xcr.cenit.latech.edu/ha-
oscar
• Open Cluster Group OpenClusterGroup.org
• OSCAR oscar.OpenClusterGroup.org
• OSCAR Development sourceforge.net/projects/oscar