joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette
Post on 23-Jan-2016
17 Views
Preview:
DESCRIPTION
Transcript
11/18 SC 2003
MPICH-V2 a Fault Tolerant MPI for Volatile
Nodes based on PessimisticSender Based Message Logging
joint work with
A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette
Parallelism team, Grand Large ProjectThomas Hérault herault@lri.fr
http://www.lri.fr/~herault
Grand Large
11/18 SC 2003
MPICH-V2
• Computing nodes of clusters are subject to failure
• Many applications use MPI as communication library– Design a fault-tolerant MPI library
• MPICH-V1 is a fault-tolerant MPI implementation– It requires many stable components to provide
high performance
• MPICH-V2 addresses this requirements– And provides higher performances
11/18 SC 2003
Outline
• Introduction• Architecture• Performances• Perspective & Conclusion
11/18 SC 2003
• Industry and academia are building larger and larger computing facilities for technical computing (research and production).
Platforms with 1000s of nodes are becoming common: Tera Scale Machines (US ASCI, French Tera), Large Scale Clusters (Score III, etc.), Grids, PC-Grids (Seti@home, XtremWeb, Entropia, UD, Boinc)
Large Scale Parallel and Distributed systems and node Volatility
• These large scale systems have frequent failures/disconnections:
ASCI-Q full system MTBF is estimated (analytic) to few hours (Petrini: LANL), A 5 hours job with 4096 procs has less than 50% chance to terminate.
PC Grids nodes are volatile disconnections / interruptions are expected to be very frequent (several/hour)
• When failures/disconnections can not be avoided, they become one characteristic of the system called Volatility
We need a Volatility tolerant Message Passing library
11/18 SC 2003
Goal: execute existing or new MPI Apps
PC client MPI_send() PC client MPI_recv()
Programmer’s view unchanged:
Objective summary: 1) Automatic fault tolerance2) Transparency for the programmer & user3) Tolerate n faults (n being the #MPI processes)4) Scalable Infrastructure/protocols5) Avoid global synchronizations (ckpt/restart)6) Theoretical verification of protocols
Problems: 1) volatile nodes (any number at any time) 2) non named receptions ( should be replayed in the
same order as the one of the previous failed exec.)
11/18 SC 2003
Related works
Manethon faults[EZ92]
Egida
[RAV99]
MPI/FTRedundance of tasks
[BNC01]
FT-MPIModification of MPI routines
User Fault Treatment
[FD00]
MPICH-V2N faults
Distributed logging
MPI-FTN fault
Centralized server
[LNLE00]
Non AutomaticAutomatic
Pessimistic log
Log basedCheckpointbased
Causal logOptimistic log(sender based)
Level
Framework
API
Communication Lib.
CocheckIndependent of MPI
[Ste96]
StarfishEnrichment of MPI
[AF99]
ClipSemi-transparent checkpoint
[CLP97]
Pruitt 982 faults sender based
[PRU98]
Sender based Mess. Log.1 fault sender based
[JZ87]
Optimistic recoveryIn distributed systems
n faults with coherent checkpoint[SY85]
A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques.
Coordinated checkpoint
11/18 SC 2003
The objective is to checkpoint the application when there is no in transit messages between any two nodes global synchronization network flush not scalable Nodes
Ckpt
failure
detection/global stop
restart
Nodes
Ckpt
failure
detectionrestart
Coordinated Checkpoint(Chandy/Lamport)
Uncoordinated Checkpoint
Checkpoint techniques
No global synchronization (scalable)Nodes may checkpoint at any time (independently of the others)Need to log undeterministic events: In-transit Messages
Sync
11/18 SC 2003
Outline
• Introduction• Architecture• Performances• Perspective & Conclusion
11/18 SC 2003
MPICH-V1
node
Network
node
Dispatcher
node
Channel Memories
Checkpointservers
node
Get
Network
Put
Channel Memory
ch_cm
X ~2
10.5 MB/s
5.6 MB/s
Time, sec
Mean over 100 measurements
0
0.05
0.1
0.15
0.2
064 128 192 256 320 384
P4
size, Kb
2
11/18 SC 2003
Definition 3 (Pessimistic Logging protocol) Let P be a communication protocol, and E an execution of P with at most f concurrent failures. Let MC denotes the set of messages transmitted between the initial configuration and the configuration C of E.
P is a pessimistic message logging protocol if and only if CE, m MC,
(|DependC(m)| > 1) ) Re − Executable(m)
MPICH-V2 protocolA new protocol (never published yet) based on1) Splitting message logging and event logging2) Sender based message logging3) Pessimistic approach (reliable event logger)
Theorem 2 The protocol of MPICH-V2 is a pessimistic message logging protocol.
Key points of the proof:
A. Every non deterministic event has its logical clock logged on reliable media
B. Every message reception logged on reliable media is reexecutable
the message payload is saved on the sender
the sender will produce the message again and associate the same unique logical clock
11/18 SC 2003
Message logger and event logger
q
p
r
event loggerfor p
crash
q
p
r
event loggerfor p
restart
reexecution phase
A
B CD
m
(id, l)
11/18 SC 2003
Computing node
MPI process
V2 daemon
Send
Receive
Send
Receive
EventLogger
Receptionevent
keeppayload
CkptServer
CSAC
CheckpointImage
CkptControl
Node
ack
11/18 SC 2003
Impact of uncoordinated checkpoint+ sender based message logging
EL
P0
P1
P1’s ML
CS
1 1 2
1 2
? 1,2
Obligation to checkpoint Message Loggers on computing nodesGarbage collector required for reducing ML checkpoint size.
Checkpoint image
Checkpoint image Checkpoint image
Checkpoint image
? ?
11/18 SC 2003
Garbage collection
EL
P0
P1
P1’s ML
CS
1 1 2
1 2
Receiver checkpoint completion triggers the garbage collector of senders.
Checkpoint image
Checkpoint image
1 and 2 can be deleted Garbage collector
123 3
11/18 SC 2003
Scheduling Checkpoint
P0
P1
P1’s ML
CS
1 1 2
3 needs to be checkpointed
123
P0’s ML1 1 2 123
•Uncoordinated checkpoint lead to log in-transit messages•Scheduling checkpoint simultaneously will lead to bursts in the network traffic.•Checkpoint size can be reduced by removing message logs
Coordinated checkpoint (Lamport).Requires global synchronization
•Checkpoint traffic should be flattened•Checkpoint scheduling should evaluate the cost and benefit of each checkpoint.
1 and 2 can be deleted Garbage collector
1, 2 and 3 can be deleted Garbage collector
No messageCheckpoint needed
11/18 SC 2003
Node (Volatile) : Checkpointing
User-level Checkpoint : Condor Stand Alone Checkpointing
Clone checkpointing + non blocking checkpoint
code
CSAC
libmpichv
Ckptorder
(1) fork
fork(2) Terminate ongoing coms(3) close sockets(4) call ckpt_and_exit()
Checkpoint image is sent to CS on the fly (not stored locally)
Resume execution using CSAC just after (4), reopen sockets and return
11/18 SC 2003
ADI
ChannelInterface
ChameleonInterface
V2 deviceInterface
MPI_Send
MPID_SendControlMPID_SendChannel
_v2bsend
_v2bsend
_v2brecv
_v2probe
_v2from
_v2Init
_v2Finalize
- get the src of the last message
- check for any message avail.
- blocking send
- blocking receive
- initialize the client
- finalize the client
– A new device: ‘ch_v2’ device
– All ch_v2 device functions are blocking communication functions built over TCP layer
Library: based on MPICH 1.2.5
Binding
11/18 SC 2003
Outline
• Introduction• Architecture• Performances• Perspective & Conclusion
11/18 SC 2003
Performance evaluationCluster: 32 1800+ Athlon CPU, 1 GB, IDE Disc
+ 16 Dual Pentium III, 500 Mhz, 512 MB, IDE Disc+ 48 ports 100Mb/s Ethernet switch
Linux 2.4.18, GCC 2.96 (-O3), PGI Frotran <5 (-O3, -tp=athlonxp)
node
Network
node
node
A single reliable node
Checkpoint Server +Event Logger +Checkpoint Scheduler+Dispatcher
11/18 SC 2003
Bandwidth and Latency
Latency for a 0 byte MPI message : MPICH-P4 (77us), MPICH-V1 (154us), MPICH-V2 (277us)
Latency is high due to the event logging. A receiving process can send a new message only
when the reception event has been successfully logged (3 TCP messages for a communication)Bandwidth is high because event messages are short.
11/18 SC 2003
0
500
1000
1500
2000
1 4 16
# processors
MG, Class B
0
500
1000
1500
2000
1 4 16
# processors
MG, Class A
0
100200300400500
1 4 16
# processors
CG, Class B
0
100
200
300
400
1 4 16
# processors
CG, Class A
0200400600800
1000
1 4 16
# processors
FT, Class A
010002000
300040005000
1 4 16
# processors
LU, Class B
0
1000
2000
3000
4000
1 4 16
# processors
LU, Class A
0
500
1000
1500
1 4 9 16 25
# processors
SP, Class B
0
200400600800
1000
1 4 9 16 25
# processors
SP, Class A
0500
1000150020002500
1 4 9 16 25
# processors
BT, Class A
0
1000
2000
3000
4000
1 4 9 16 25
# processors
BT, Class B
NAS Benchmark Class A and BMPICH-P4 MPICH-V2
Meg
aflo
psM
egaf
lops
Meg
aflo
ps
Latency
Memory capacity(logging on disc)
11/18 SC 2003
Breakdown of the execution time
11/18 SC 2003
Faulty execution performance
+190 s(+80%)
1 faultEvery45 sec!
11/18 SC 2003
Outline
• Introduction• Architecture• Performances• Perspective & Conclusion
11/18 SC 2003
Perspectives
• Compare to Coordinated techniques– Treshold of fault frequency where logging
techniques are more valuable– MPICH-V/CLCluster 2003
• Hierarchical logging for Grids– Tolerate node failures & cluster failures– MPICH-V3SC 2003 Poster session
• Address the latency of MPICH-V2– Use causal logging techniques ?
11/18 SC 2003
Conclusion• MPICH-V2 is a completely new protocol replacing MPICH-V1
removing the channel memories• New protocol is pessimistic sender based • MPICH-V2 reach a Ping-Pong Bandwidth
close to the one of MPICH-P4• MPICH-V2 cannot compete with MPICH-P4 on latency• However for applications with large messages, performance
are close to the one of P4• In addition, MPICH-V2 resists up to one fault every 45
seconds.• Main conclusion: MPICH-V2 requires much less stable nodes
than MPICH-V1 with better performances
Come to see MPICH-V demos at the Booth: 3315 INRIA
11/18 SC 2003
Crash
Time for the re-execution of a token ring on 8 nodesAccording to the token size and number of re-started nodes
Re-execution performance (1)
11/18 SC 2003
Re-execution performance (2)
11/18 SC 2003
Logging techniques
Initial execution
ckpt crash
Replayed execution : starts fromlast checkpoint (this process)
The system must provide the messages to be replayed, and discard the re-emissions
Main problems:•Discard re-emissions (technical)•Ensure that messages are replayed in a consistent order
11/18 SC 2003
Large Scale Parallel and Distributed Systems and programing
• Many HPC applications use message passing paradigm
• Message passing :MPI
We need a Volatility tolerant Message Passing Interface implementation
•Based on MPICH-1.2.5 which implements MPI standard 1.1
11/18 SC 2003
Checkpoint Server (stable)
Multiprocess server
Checkpoint images are stored on reliable media:1 file per Node (name given By Node)
Incoming Message(Put ckpt transaction)
Open Sockets:-one per attached Node-one per home CM of attached Nodes
Poll, treat event and dispatch jobto other processes
Outgoing Message(Get ckpt transaction + control)
Checkpoint images
Disc
11/18 SC 2003
NAS Benchmark Class A and B
Latency
Memory capacity(logging on disc)
top related