Top Banner
Parallel Checkpointing - Sathish Vadhiyar
37

Parallel Checkpointing - Sathish Vadhiyar. Introduction Checkpointing? storing application’s state in order to resume later.

Dec 13, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Parallel Checkpointing

- Sathish Vadhiyar

Page 2: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Introduction

Checkpointing? storing application’s state in

order to resume later

Page 3: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Motivation The largest parallel systems in the world

have from 130,000 to 213,000 parallel processing elements. (http://www.top500.org)

Large-scale applications, that can use these large number of processes, are continuously being built.

These applications are also long-running As the number of processing elements,

increase the Mean Time Between Failure (MTBF) decreases

So, how can long-running applications execute in this highly failure-prone environment? – checkpointing and fault-tolerance

Page 4: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Uses of Checkpointing

Fault tolerance & rollback recovery

Process migration & job swapping

Debugging

Page 5: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Types of Checkpointing OS-level

Process preemption E.g. Berkeley’s checkpointing Linux

User-level Transparent Checkpointing performed by the program itself Transparency achieved by linking application

with a special library e.g. Plank’s libckpt

User-level non-transparent Users insert checkpointing calls in their

programs e.g. SRS (Vadhiyar)

Page 6: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Rules for Consistent Checkpointing In a parallel program,

each process has events and local state An event changes

the local state of a process

Global state – an external view of the parallel application (e.g. lines S, S’, S’’) – used for checkpointing and restarting Consists of local

states and messages in transit

Page 7: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Rules for Consistent Checkpointing

Types of global states Consistent global state – from

where program can be restarted correctly

Inconsistent - Otherwise

Page 8: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Rules for Consistent Checkpointing

Chandy & Lamport – 2 rules for consistent global states 1. if a receive event is part of local state of a process,

the corresponding send event must be part of the local state of the sender.

2. if a send event is part of the local state of a process and the matching receive is not part of the local state of the receiver, then the message must be part of the state of the network.

S’’ violates rule 1. Hence cannot lead to consistent global state

Page 9: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Independent checkpointing

Processors checkpoint periodically without coordination

Can lead to domino effect – each rollback of a processor to a previous checkpoint forces another processor to rollback even further

Page 10: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Checkpointing Methods

1. Coordinated checkpointing

1. All processes coordinate to take a consistent checkpoint (e.g. using a barrier)

2. Will always lead to consistent checkpoints

2. Checkpointing with message logging

1. Independent checkpoints are taken by processes and a process logs the messages it receives after the last checkpoint

2. Thus recovery is by previous checkpoint and the logged messages.

Page 11: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Message Logging

Page 12: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Introduction In message-logging protocols, each process

stores message contents and sequence number

of all messages it has sent or received into a message log

To trim message logs, a process can also periodically checkpoint

Once a process checkpoints, all messages sent/received before this checkpoint can be removed from log

Page 13: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Pessimistic and Optimistic Message Logging

Pessimistic – A sender does not send a message m until it knows all messages sent before m are logged Waits for logging to complete; hence

overhead in communications during no failures

But failure recovery is fast on the occurrence of a failure

Page 14: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Pessimistic and Optimistic Message Logging

Optimistic – assumes messages are logged Does not wait for logs to complete to send a

message, m Hence on failure, the earlier logged messages

will be sent; message m will not be sent – leads to inconsistent states

Hence failure recovery is complex – need to trace back and carefully reconstruct application states

But minimal overhead when no failures

Page 15: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Message Logging – Sender or Receiver Based Can be sender or receiver based Receiver based (e.g. MPICH-V) – message logs

stored at the receiver side Each process has an associated channel

memory (CM) called home CM When P1 wants to send messages to P2, it

contacts P2’s home CM and sends message to it.

P2 retrieves message from its home CM. During checkpoint, the process state is stored

to a checkpoint server (CS) When P2 crashes, during restart, P2 retrieves

checkpoints from CS and messages from CM.

Page 16: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Sender-based (e.g. MPICH-V2) Sender based logging to avoid channel

memories When p sends message m to q, p logs m. m is associated with an id When q receives m, it stores (id, l) where l

is the logical clock. When q crashes and restarts from a

checkpoint state, retrieves (id, l) sets from storage and asks other processes to resend from the sets.

Page 17: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Coordinated Checkpointing

Page 18: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Coordinated Checkpointing

During checkpointing, all processes coordinate by means of a barrier and then checkpoint the respective data

This way, all are consistent checkpoints; messages are cleared at the time of checkpoints;

During recovery, all processes rollback to the respective consistent state (coordinated)

No need to replay messages

Page 19: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Coordinated Checkpointing

Example: Start Restart System (SRS) A user-level non-transparent checkpointing

library User inserts checkpointing calls specifying

data to be checkpointed Advantage: Small amount of checkpointed

data Disadvantage: User burden Allows reconfiguration of applications. Reconfiguration of number of processors and /

or data distribution.

Page 20: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

SRS Example – Original Code MPI_Init(&argc, &argv); local_size = global_size/size;

if(rank == 0){ for(i=0; i<global_size; i++){ global_A[i] = i; } }

MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; for(i=iter_start; i<global_size; i++){ proc_number = i/local_size; local_index = i%local_size; if(rank == proc_number){ local_A[local_index] += 10; } } MPI_Finalize();

Page 21: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

SRS Example – Modified Code

MPI_Init(&argc, &argv); SRS_Init(); local_size = global_size/size; restart_value = SRS_Restart_Value();

if(restart_value == 0){ if(rank == 0){ for(i=0; i<global_size; i++){ global_A[i] = i; } } MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; } else{ SRS_Read(“A”, local_A, BLOCK, NULL); SRS_Read(“iterator”, &iter_start, SAME, NULL); } SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL); SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL);

Page 22: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

SRS Example – Modified Code (Contd..)

for(i=iter_start; i<global_size; i++){ stop_value = SRS_Check_Stop(); if(stop_value == 1){ MPI_Finalize(); exit(0); } proc_number = i/local_size; local_index = i%local_size; if(rank == proc_number){ local_A[local_index] += 10; } }

SRS_Finish(); MPI_Finalize();

Page 23: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Checkpointing Performance Checkpoint overhead – time added to the

running time of the application due to checkpointing

Checkpoint latency hiding Checkpoint buffering – during checkpointing,

copy data to local buffer, store buffer to disk in parallel with application progress

Copy-on-write buffering – only the modified pages are copied to a buffer. Other pages can be directly stored without copying to buffer. Can be implemented using fork() – forked checkpointing

Page 24: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Checkpointing Performance

Reducing checkpoint size – memory exclusion and checkpoint compression

Memory exclusion – no need to store dead and read-only variables A dead variable is one whose current value

will not be used by the program; The variable will not be accessed again by the program or it will be overwritten before it is read

Read only variable – whose value has not changed since the previous checkpoint

Page 25: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Incremental Checkpointing

Memory exclusion can be made automatic by using incremental checkpointing Store only that part of data that have been

modified from the previous checkpoint Following a checkpoint, all pages in memory

are set to read-only When the program attempts to write a page,

an access violation occurs During next checkpoint, only pages that have

caused access violations are checkpointed

Page 26: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Checkpointing performance – using compression

Using a standard compression algorithm

This is beneficial only if the extra processing time for compression is lower than the savings that result from writing a smaller file to disk

Page 27: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Redundancy/replication + checkpointing for fault tolerance

Page 28: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Replication Every node/process N has a shadow

node/process N’, so that if one of them fail, the other can still continue the application – failure of the primary node no longer stalls the application

Redundancy scales: As more nodes are added to the system, the probability of failure of both a node and its shadow rapidly decreases Only one of the remaining n-1 nodes represent a

shadow node

Page 29: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Replication

Less overhead for checkpointing Higher checkpointing interval/period for

periodic checkpointing Recomputation and restart overheads

are nearly eliminated

Still need checkpointing: Why?

Page 30: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Total Redundancy

Page 31: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Partial Redundancy

Page 32: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Replication vs No Replication

Page 33: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

References James S. Plank,

``An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance'', University of Tennessee Technical Report CS-97-372, July, 1997

James Plank and Thomason. Processor Allocation and Checkpointing Interval Selection in Cluster Computing Systems. JPDC 2001.

Page 34: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

References MPICH-V: Toward a Scalable Fault Tolerant

MPI for Volatile Nodes -- George Bosilca, Aurélien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fédak, Cécile Germain, Thomas Hérault, Pierre Lemarinier, Oleg Lodygensky, Frédéric Magniette, Vincent Néri, Anton Selikhov -- SuperComputing 2002, Baltimore USA, November 2002

MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on the Pessimistic Sender Based Message Logging -- Aurélien Bouteiller, Franck Cappello, Thomas Hérault, Géraud Krawezik, Pierre Lemarinier, Frédéric Magniette -- To appear in SuperComputing 2003, Phoenix USA, November 2003

Page 35: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

References Vadhiyar, S. and Dongarra, J. “SRS - A

Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems”. Parallel Processing Letters, Vol. 13, number 2, pp. 291-312, June 2003.

Page 36: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

References

Schulz et al. Implementation and Evaluation of a Scalable Application-level Checkpoint-Recovery Scheme for MPI Programs. SC 2004.

Page 37: Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

References for Replication

Evaluating the viability of process replication reliability for exascale systems. SC 2011.

Combining Partial Redundancy and Checkpointing for HPC. ICDCS 2012