Artyom Sharov, Technion, Haifa 1 Condor week – April 2006 Adding Adding High Availability High Availability to Condor to Condor Central Manager Central Manager Artyom Sharov Technion – Israel Institute of Technology, Haifa
Artyom Sharov, Technion, Haifa 1 Condor week – April 2006
Adding Adding High Availability High Availability
to Condor to Condor Central Manager Central Manager
Artyom SharovTechnion – Israel Institute of Technology, Haifa
Artyom Sharov, Technion, Haifa 2 Condor week – April 2006
Collector
Negotiator
Startd and ScheddStartd and Schedd
Startd and ScheddStartd and ScheddStartd and Schedd
Startd and ScheddStartd and ScheddCentral Manager
Condor Pool without High Availability
Artyom Sharov, Technion, Haifa 3 Condor week – April 2006
Central Manager is a single-point-of-
failure No additional matches are possible
Condor tools do not work
Unfair resource sharing and user priorities
Our goal - continuous pool functioning
in case of failure
Why Highly Available CM?
Artyom Sharov, Technion, Haifa 4 Condor week – April 2006
Highly AvailableCentral
ManagerStartd and ScheddStartd and Schedd
Startd and ScheddStartd and Schedd Startd and Schedd
Startd and ScheddStartd and Schedd
Highly Available Condor Pool
Artyom Sharov, Technion, Haifa 5 Condor week – April 2006
Automatic failure detection
Transparent failover
“Split brain” reconciliation
Persistency of CM state
No changes to CM code
Solution Requirements
Artyom Sharov, Technion, Haifa 6 Condor week – April 2006
Highly AvailableCentral
Manager
Negotiator
Condor Pool with HA
Replicator
HAD
Replicator
HAD
Collector
Replicator
HAD
Collector
Collector
Artyom Sharov, Technion, Haifa 7 Condor week – April 2006
Backup 1
Backup 2
Backup 3
HA – Election + Main
#1
#2
Election message
Election message
Election message
I win
Raise Negotiator
I loose
Active
I am alive
I loose
Artyom Sharov, Technion, Haifa 8 Condor week – April 2006
HA – CrashActive Backup
1Backup 2
#3
Election messages
#4
I am alive
I win
Raise Negotiator
I loose
Active
Artyom Sharov, Technion, Haifa 9 Condor week – April 2006
Active Backup
Joining
#1
#2
State update
State update
Solicit version
Solicit version reply
#3
Downloading request
Replication – Main + Joining
Artyom Sharov, Technion, Haifa 10 Condor week – April 2006
Replication – CrashActive Backup
1Backup 2
#4
#5
State update
State updateActive
Artyom Sharov, Technion, Haifa 11 Condor week – April 2006
Stabilization time Depends on number of CMs and network
performance HAD_CONNECT_TIMEOUT – upper bound on the
time to establish TCP connection Example: HAD_CONNECT_TIMEOUT = 2 and 2
CMs - new Negotiator is guaranteed to be up and
running after 48 seconds
Replication frequency REPLICATION_INTERVAL
Configuration
Artyom Sharov, Technion, Haifa 12 Condor week – April 2006
Automatic distributed testing framework: simulation of node crashes, network disconnections,
network partition and merges
Extensive testing: distributed testing on 5 machines in the Technion
interactive distributed testing in Wisconsin pool
automatic testing with NMI framework
Testing
Artyom Sharov, Technion, Haifa 13 Condor week – April 2006
Already deployed and fully functioning for more than a year in Technion GLOW, UW California Department of Water Resources,
Delta Modeling Section, Sacramento, CA Hartford Life Cycle Computing Additional commercial users
HA in Production
Artyom Sharov, Technion, Haifa 14 Condor week – April 2006
HAD Monitoring System
Configuration/administration utilities
Detailed manual section
Full support by Technion team
Usability and Administration
Artyom Sharov, Technion, Haifa 15 Condor week – April 2006
HA in WAN HAIFA – High Availability Is For Anyone
HA for any Condor service (e.g.: HA for schedd) More consistency schemes and HA semantics Dynamic registration of services requiring HA Dynamic addition/removal of replicas
More details in "Materializing Highly Available Grids" - hot topic paper, to appear in HPDC 2006.
Future Work
Artyom Sharov, Technion, Haifa 16 Condor week – April 2006
Ongoing collaboration for 3 years Compliance with Condor coding standards Peer-reviewed code Integration with NMI framework Automation of testing Open-minded attitude of Condor team to
numerous requests and questions Unique experience of working with large
peer-managed group of talented programmers
Collaboration with Condor Team
Artyom Sharov, Technion, Haifa 17 Condor week – April 2006
This work was a collaborative effort of: Distributed Systems Laboratory in
Technion Prof. Assaf Schuster, Gabi Kliot, Mark
Zilberstein, Artyom Sharov Condor team
Prof. Miron Livny, Nick, Todd, Derek, Greg, Anatoly, Peter, Becky, Bill, Tim
Collaboration with Condor Team
Artyom Sharov, Technion, Haifa 18 Condor week – April 2006
Part of the official 6.7.18 development release
Will soon appear in stable 6.8 release More information:
http://dsl.cs.technion.ac.il/projects/gozal/project_pages/ha/ha.html
http://dsl.cs.technion.ac.il/projects/gozal/project_pages/replication/replication.html
more details + configuration in my tutorial Contact:
{gabik,marks,sharov}@cs.technion.ac.il [email protected]
You Should Definitely Try It
Artyom Sharov, Technion, Haifa 19 Condor week – April 2006
In case of time
Artyom Sharov, Technion, Haifa 20 Condor week – April 2006
Replication – “Split Brain”Active 1
Active 2Merge of networks
I am alive, Active 1
I am alive, Active 2
Decision making :
my ID > ‘Active 2’ ID, I am a leader
Decision making :
my ID < ‘Active 1’ ID, give up
HAD
Replication
HAD
Replication
Artyom Sharov, Technion, Haifa 21 Condor week – April 2006
Active BackupMerge of networks
HAD
Replication
HAD
Replication
You’re leader
‘Active 2’ last version before merge
merg
ing
vers
ions
from
tw
o p
ools
State update
Replication – “Split Brain”
Artyom Sharov, Technion, Haifa 22 Condor week – April 2006
HAD State Diagram
Artyom Sharov, Technion, Haifa 23 Condor week – April 2006
RD State Diagram