11/18 SC 2003 MPICH-V2 a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette Parallelism team, Grand Large Project Thomas Hérault herault @ lri.fr http://www.lri.fr/~herault Grand Large
32
Embed
joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette
gl. Grand Large. MPI CH- V2 a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette Parallelism team, Grand Large Project - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
11/18 SC 2003
MPICH-V2 a Fault Tolerant MPI for Volatile
Nodes based on PessimisticSender Based Message Logging
• Industry and academia are building larger and larger computing facilities for technical computing (research and production).
Platforms with 1000s of nodes are becoming common: Tera Scale Machines (US ASCI, French Tera), Large Scale Clusters (Score III, etc.), Grids, PC-Grids (Seti@home, XtremWeb, Entropia, UD, Boinc)
Large Scale Parallel and Distributed systems and node Volatility
• These large scale systems have frequent failures/disconnections:
ASCI-Q full system MTBF is estimated (analytic) to few hours (Petrini: LANL), A 5 hours job with 4096 procs has less than 50% chance to terminate.
PC Grids nodes are volatile disconnections / interruptions are expected to be very frequent (several/hour)
• When failures/disconnections can not be avoided, they become one characteristic of the system called Volatility
We need a Volatility tolerant Message Passing library
11/18 SC 2003
Goal: execute existing or new MPI Apps
PC client MPI_send() PC client MPI_recv()
Programmer’s view unchanged:
Objective summary: 1) Automatic fault tolerance2) Transparency for the programmer & user3) Tolerate n faults (n being the #MPI processes)4) Scalable Infrastructure/protocols5) Avoid global synchronizations (ckpt/restart)6) Theoretical verification of protocols
Problems: 1) volatile nodes (any number at any time) 2) non named receptions ( should be replayed in the
same order as the one of the previous failed exec.)
11/18 SC 2003
Related works
Manethon faults[EZ92]
Egida
[RAV99]
MPI/FTRedundance of tasks
[BNC01]
FT-MPIModification of MPI routines
User Fault Treatment
[FD00]
MPICH-V2N faults
Distributed logging
MPI-FTN fault
Centralized server
[LNLE00]
Non AutomaticAutomatic
Pessimistic log
Log basedCheckpointbased
Causal logOptimistic log(sender based)
Level
Framework
API
Communication Lib.
CocheckIndependent of MPI
[Ste96]
StarfishEnrichment of MPI
[AF99]
ClipSemi-transparent checkpoint
[CLP97]
Pruitt 982 faults sender based
[PRU98]
Sender based Mess. Log.1 fault sender based
[JZ87]
Optimistic recoveryIn distributed systems
n faults with coherent checkpoint[SY85]
A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques.
Coordinated checkpoint
11/18 SC 2003
The objective is to checkpoint the application when there is no in transit messages between any two nodes global synchronization network flush not scalable Nodes
Ckpt
failure
detection/global stop
restart
Nodes
Ckpt
failure
detectionrestart
Coordinated Checkpoint(Chandy/Lamport)
Uncoordinated Checkpoint
Checkpoint techniques
No global synchronization (scalable)Nodes may checkpoint at any time (independently of the others)Need to log undeterministic events: In-transit Messages
Definition 3 (Pessimistic Logging protocol) Let P be a communication protocol, and E an execution of P with at most f concurrent failures. Let MC denotes the set of messages transmitted between the initial configuration and the configuration C of E.
P is a pessimistic message logging protocol if and only if CE, m MC,
(|DependC(m)| > 1) ) Re − Executable(m)
MPICH-V2 protocolA new protocol (never published yet) based on1) Splitting message logging and event logging2) Sender based message logging3) Pessimistic approach (reliable event logger)
Theorem 2 The protocol of MPICH-V2 is a pessimistic message logging protocol.
Key points of the proof:
A. Every non deterministic event has its logical clock logged on reliable media
B. Every message reception logged on reliable media is reexecutable
the message payload is saved on the sender
the sender will produce the message again and associate the same unique logical clock
11/18 SC 2003
Message logger and event logger
q
p
r
event loggerfor p
crash
q
p
r
event loggerfor p
restart
reexecution phase
A
B CD
m
(id, l)
11/18 SC 2003
Computing node
MPI process
V2 daemon
Send
Receive
Send
Receive
EventLogger
Receptionevent
keeppayload
CkptServer
CSAC
CheckpointImage
CkptControl
Node
ack
11/18 SC 2003
Impact of uncoordinated checkpoint+ sender based message logging
EL
P0
P1
P1’s ML
CS
1 1 2
1 2
? 1,2
Obligation to checkpoint Message Loggers on computing nodesGarbage collector required for reducing ML checkpoint size.
Checkpoint image
Checkpoint image Checkpoint image
Checkpoint image
? ?
11/18 SC 2003
Garbage collection
EL
P0
P1
P1’s ML
CS
1 1 2
1 2
Receiver checkpoint completion triggers the garbage collector of senders.
Checkpoint image
Checkpoint image
1 and 2 can be deleted Garbage collector
123 3
11/18 SC 2003
Scheduling Checkpoint
P0
P1
P1’s ML
CS
1 1 2
3 needs to be checkpointed
123
P0’s ML1 1 2 123
•Uncoordinated checkpoint lead to log in-transit messages•Scheduling checkpoint simultaneously will lead to bursts in the network traffic.•Checkpoint size can be reduced by removing message logs
Coordinated checkpoint (Lamport).Requires global synchronization
•Checkpoint traffic should be flattened•Checkpoint scheduling should evaluate the cost and benefit of each checkpoint.
1 and 2 can be deleted Garbage collector
1, 2 and 3 can be deleted Garbage collector
No messageCheckpoint needed
11/18 SC 2003
Node (Volatile) : Checkpointing
User-level Checkpoint : Condor Stand Alone Checkpointing
Clone checkpointing + non blocking checkpoint
code
CSAC
libmpichv
Ckptorder
(1) fork
fork(2) Terminate ongoing coms(3) close sockets(4) call ckpt_and_exit()
Checkpoint image is sent to CS on the fly (not stored locally)
Resume execution using CSAC just after (4), reopen sockets and return
11/18 SC 2003
ADI
ChannelInterface
ChameleonInterface
V2 deviceInterface
MPI_Send
MPID_SendControlMPID_SendChannel
_v2bsend
_v2bsend
_v2brecv
_v2probe
_v2from
_v2Init
_v2Finalize
- get the src of the last message
- check for any message avail.
- blocking send
- blocking receive
- initialize the client
- finalize the client
– A new device: ‘ch_v2’ device
– All ch_v2 device functions are blocking communication functions built over TCP layer