joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette

11/18 SC 2003

MPICH-V2 a Fault Tolerant MPI for Volatile

Nodes based on PessimisticSender Based Message Logging

joint work with

A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette

Parallelism team, Grand Large ProjectThomas Hérault [email protected]

http://www.lri.fr/~herault

Grand Large

11/18 SC 2003

MPICH-V2

• Computing nodes of clusters are subject to failure

• Many applications use MPI as communication library– Design a fault-tolerant MPI library

• MPICH-V1 is a fault-tolerant MPI implementation– It requires many stable components to provide

high performance

• MPICH-V2 addresses this requirements– And provides higher performances

11/18 SC 2003

Outline

• Introduction• Architecture• Performances• Perspective & Conclusion

11/18 SC 2003

• Industry and academia are building larger and larger computing facilities for technical computing (research and production).

Platforms with 1000s of nodes are becoming common: Tera Scale Machines (US ASCI, French Tera), Large Scale Clusters (Score III, etc.), Grids, PC-Grids (Seti@home, XtremWeb, Entropia, UD, Boinc)

Large Scale Parallel and Distributed systems and node Volatility

• These large scale systems have frequent failures/disconnections:

ASCI-Q full system MTBF is estimated (analytic) to few hours (Petrini: LANL), A 5 hours job with 4096 procs has less than 50% chance to terminate.

PC Grids nodes are volatile disconnections / interruptions are expected to be very frequent (several/hour)

• When failures/disconnections can not be avoided, they become one characteristic of the system called Volatility

We need a Volatility tolerant Message Passing library

11/18 SC 2003

Goal: execute existing or new MPI Apps

PC client MPI_send() PC client MPI_recv()

Programmer’s view unchanged:

Objective summary: 1) Automatic fault tolerance2) Transparency for the programmer & user3) Tolerate n faults (n being the #MPI processes)4) Scalable Infrastructure/protocols5) Avoid global synchronizations (ckpt/restart)6) Theoretical verification of protocols

Problems: 1) volatile nodes (any number at any time) 2) non named receptions ( should be replayed in the

same order as the one of the previous failed exec.)

11/18 SC 2003

Related works

Manethon faults[EZ92]

Egida

[RAV99]

MPI/FTRedundance of tasks

[BNC01]

FT-MPIModification of MPI routines

User Fault Treatment

[FD00]

MPICH-V2N faults

Distributed logging

MPI-FTN fault

Centralized server

[LNLE00]

Non AutomaticAutomatic

Pessimistic log

Log basedCheckpointbased

Causal logOptimistic log(sender based)

Level

Framework

API

Communication Lib.

CocheckIndependent of MPI

[Ste96]

StarfishEnrichment of MPI

[AF99]

ClipSemi-transparent checkpoint

[CLP97]

Pruitt 982 faults sender based

[PRU98]

Sender based Mess. Log.1 fault sender based

[JZ87]

Optimistic recoveryIn distributed systems

n faults with coherent checkpoint[SY85]

A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques.

Coordinated checkpoint

11/18 SC 2003

The objective is to checkpoint the application when there is no in transit messages between any two nodes global synchronization network flush not scalable Nodes

Ckpt

failure

detection/global stop

restart

Nodes

Ckpt

failure

detectionrestart

Coordinated Checkpoint(Chandy/Lamport)

Uncoordinated Checkpoint

Checkpoint techniques

No global synchronization (scalable)Nodes may checkpoint at any time (independently of the others)Need to log undeterministic events: In-transit Messages

Sync

11/18 SC 2003

Outline


11/18 SC 2003

MPICH-V1

node

Network

node

Dispatcher

node

Channel Memories

Checkpointservers

node

Get

Network

Put

Channel Memory

ch_cm

X ~2

10.5 MB/s

5.6 MB/s

Time, sec

Mean over 100 measurements

0

0.05

0.1

0.15

0.2

064 128 192 256 320 384

P4

size, Kb

2

11/18 SC 2003

Definition 3 (Pessimistic Logging protocol) Let P be a communication protocol, and E an execution of P with at most f concurrent failures. Let MC denotes the set of messages transmitted between the initial configuration and the configuration C of E.

P is a pessimistic message logging protocol if and only if CE, m MC,

(|DependC(m)| > 1) ) Re − Executable(m)

MPICH-V2 protocolA new protocol (never published yet) based on1) Splitting message logging and event logging2) Sender based message logging3) Pessimistic approach (reliable event logger)

Theorem 2 The protocol of MPICH-V2 is a pessimistic message logging protocol.

Key points of the proof:

A. Every non deterministic event has its logical clock logged on reliable media

B. Every message reception logged on reliable media is reexecutable

the message payload is saved on the sender

the sender will produce the message again and associate the same unique logical clock

11/18 SC 2003

Message logger and event logger

q

p

r

event loggerfor p

crash

q

p

r

event loggerfor p

restart

reexecution phase

A

B CD

m

(id, l)

11/18 SC 2003

Computing node

MPI process

V2 daemon

Send

Receive

Send

Receive

EventLogger

Receptionevent

keeppayload

CkptServer

CSAC

CheckpointImage

CkptControl

Node

ack

11/18 SC 2003

Impact of uncoordinated checkpoint+ sender based message logging

EL

P0

P1

P1’s ML

CS

1 1 2

1 2

? 1,2

Obligation to checkpoint Message Loggers on computing nodesGarbage collector required for reducing ML checkpoint size.

Checkpoint image

Checkpoint image Checkpoint image

Checkpoint image

? ?

11/18 SC 2003

Garbage collection

EL

P0

P1

P1’s ML

CS

1 1 2

1 2

Receiver checkpoint completion triggers the garbage collector of senders.

Checkpoint image

Checkpoint image

1 and 2 can be deleted Garbage collector

123 3

11/18 SC 2003

Scheduling Checkpoint

P0

P1

P1’s ML

CS

1 1 2

3 needs to be checkpointed

123

P0’s ML1 1 2 123

•Uncoordinated checkpoint lead to log in-transit messages•Scheduling checkpoint simultaneously will lead to bursts in the network traffic.•Checkpoint size can be reduced by removing message logs

Coordinated checkpoint (Lamport).Requires global synchronization

•Checkpoint traffic should be flattened•Checkpoint scheduling should evaluate the cost and benefit of each checkpoint.

1 and 2 can be deleted Garbage collector

1, 2 and 3 can be deleted Garbage collector

No messageCheckpoint needed

11/18 SC 2003

Node (Volatile) : Checkpointing

User-level Checkpoint : Condor Stand Alone Checkpointing

Clone checkpointing + non blocking checkpoint

code

CSAC

libmpichv

Ckptorder

(1) fork

fork(2) Terminate ongoing coms(3) close sockets(4) call ckpt_and_exit()

Checkpoint image is sent to CS on the fly (not stored locally)

Resume execution using CSAC just after (4), reopen sockets and return

11/18 SC 2003

ADI

ChannelInterface

ChameleonInterface

V2 deviceInterface

MPI_Send

MPID_SendControlMPID_SendChannel

_v2bsend

_v2bsend

_v2brecv

_v2probe

_v2from

_v2Init

_v2Finalize

- get the src of the last message

- check for any message avail.

- blocking send

- blocking receive

- initialize the client

- finalize the client

– A new device: ‘ch_v2’ device

– All ch_v2 device functions are blocking communication functions built over TCP layer

Library: based on MPICH 1.2.5

Binding

11/18 SC 2003

Outline


11/18 SC 2003

Performance evaluationCluster: 32 1800+ Athlon CPU, 1 GB, IDE Disc

+ 16 Dual Pentium III, 500 Mhz, 512 MB, IDE Disc+ 48 ports 100Mb/s Ethernet switch

Linux 2.4.18, GCC 2.96 (-O3), PGI Frotran <5 (-O3, -tp=athlonxp)

node

Network

node

node

A single reliable node

Checkpoint Server +Event Logger +Checkpoint Scheduler+Dispatcher

11/18 SC 2003

Bandwidth and Latency

Latency for a 0 byte MPI message : MPICH-P4 (77us), MPICH-V1 (154us), MPICH-V2 (277us)

Latency is high due to the event logging. A receiving process can send a new message only

when the reception event has been successfully logged (3 TCP messages for a communication)Bandwidth is high because event messages are short.

11/18 SC 2003

0

500

1000

1500

2000

1 4 16

# processors

MG, Class B

0

500

1000

1500

2000

1 4 16

# processors

MG, Class A

0

100200300400500

1 4 16

# processors

CG, Class B

0

100

200

300

400

1 4 16

# processors

CG, Class A

0200400600800

1000

1 4 16

# processors

FT, Class A

010002000

300040005000

1 4 16

# processors

LU, Class B

0

1000

2000

3000

4000

1 4 16

# processors

LU, Class A

0

500

1000

1500

1 4 9 16 25

# processors

SP, Class B

0

200400600800

1000

1 4 9 16 25

# processors

SP, Class A

0500

1000150020002500

1 4 9 16 25

# processors

BT, Class A

0

1000

2000

3000

4000

1 4 9 16 25

# processors

BT, Class B

NAS Benchmark Class A and BMPICH-P4 MPICH-V2

Meg

aflo

psM

egaf

lops

Meg

aflo

ps

Latency

Memory capacity(logging on disc)

11/18 SC 2003

Breakdown of the execution time

11/18 SC 2003

Faulty execution performance

+190 s(+80%)

1 faultEvery45 sec!

11/18 SC 2003

Outline


11/18 SC 2003

Perspectives

• Compare to Coordinated techniques– Treshold of fault frequency where logging

techniques are more valuable– MPICH-V/CLCluster 2003

• Hierarchical logging for Grids– Tolerate node failures & cluster failures– MPICH-V3SC 2003 Poster session

• Address the latency of MPICH-V2– Use causal logging techniques ?

11/18 SC 2003

Conclusion• MPICH-V2 is a completely new protocol replacing MPICH-V1

removing the channel memories• New protocol is pessimistic sender based • MPICH-V2 reach a Ping-Pong Bandwidth

close to the one of MPICH-P4• MPICH-V2 cannot compete with MPICH-P4 on latency• However for applications with large messages, performance

are close to the one of P4• In addition, MPICH-V2 resists up to one fault every 45

seconds.• Main conclusion: MPICH-V2 requires much less stable nodes

than MPICH-V1 with better performances

Come to see MPICH-V demos at the Booth: 3315 INRIA

11/18 SC 2003

Crash

Time for the re-execution of a token ring on 8 nodesAccording to the token size and number of re-started nodes

Re-execution performance (1)

11/18 SC 2003

Re-execution performance (2)

11/18 SC 2003

Logging techniques

Initial execution

ckpt crash

Replayed execution : starts fromlast checkpoint (this process)

The system must provide the messages to be replayed, and discard the re-emissions

Main problems:•Discard re-emissions (technical)•Ensure that messages are replayed in a consistent order

11/18 SC 2003

Large Scale Parallel and Distributed Systems and programing

• Many HPC applications use message passing paradigm

• Message passing :MPI

We need a Volatility tolerant Message Passing Interface implementation

•Based on MPICH-1.2.5 which implements MPI standard 1.1

11/18 SC 2003

Checkpoint Server (stable)

Multiprocess server

Checkpoint images are stored on reliable media:1 file per Node (name given By Node)

Incoming Message(Put ckpt transaction)

Open Sockets:-one per attached Node-one per home CM of attached Nodes

Poll, treat event and dispatch jobto other processes

Outgoing Message(Get ckpt transaction + control)

Checkpoint images

Disc

11/18 SC 2003

NAS Benchmark Class A and B

Latency

Memory capacity(logging on disc)

joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette

Documents

fault tolerant mpi

volatile nodes

nodes of clusters

pc grids nodes

s of nodes

b fault tolerance techniques

large scale systems

new mpi appsobjective