Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE 655 Checkpointing III
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .1
Fall 2006
UNIVERSITY OF MASSACHUSETTSDept. of Electrical & Computer
Engineering
Fault Tolerant ComputingECE 655
Checkpointing III
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .2
Coordinated Checkpointing Algorithms
Uncoordinated checkpointing may lead to domino effect or to livelock
Two basic approaches to checkpoint coordination: The Koo-Toueg algorithm, which has a process to initiate
the system-wide checkpointing process An algorithm which staggers checkpoints in time;
Staggering checkpoints can help avoid near-simultaneous heavy loading of the disk system
Communication-induced checkpointing procedures Simultaneously using coordinated and
uncoordinated checkpointing algorithms - the latter is sufficient to deal with most isolated failures
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .3
Koo-Toueg Algorithm
Suppose P wants to establish a checkpoint at P_3 This will record that q1 was received from Q - to prevent q1 from being orphaned, Q must checkpoint as well
Thus, establishing a checkpoint at P_3 by P forces Q to take a checkpoint to record that q1 was sent
An algorithm for such coordinated checkpointing has two types of checkpoints - tentative and permanent
P first records its current state in a tentative checkpoint, then sends a message to all other processes from whom it has received a message since taking its last checkpoint
Call the set of such processes
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .4
Koo-Toueg Algorithm - Cont. The message tells each process in (e.g., Q), the
last message, m_qp, that P has received from it before the tentative checkpoint was taken
If m_qp was not recorded in a checkpoint by Q: to prevent m_qp from being orphaned, Q is asked to take a tentative checkpoint to record sending m_qp
If all processes in , that need to, confirm taking a checkpoint as requested, then all tentative checkpoints can be converted to permanent
If some members of , are unable to checkpoint as requested, P and all members of abandon the tentative checkpoints, and none are made permanent
This may set off a chain reaction of checkpoints Each member of can potentially spawn a set of
checkpoints among processes in its corresponding set
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .5
Staggered Checkpointing
The Koo-Toueg algorithm - and others like it - can lead to a large number of processes taking checkpoints at nearly the same time
If they are all writing to a shared stable storage, e.g., a set of common disks, this surge can lead to congestion at the disks or network or both
Either of two approaches can be used to ensure that, at any time, at most one process is taking its checkpoint
(1) Write the checkpoint into a local buffer, then stagger the writes from buffer to stable storage Assuming a buffer of sufficiently large capacity
(2) Try staggering the checkpoints in time
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .6
Staggered Checkpointing - Cont.
Staggered checkpoints may not be consistent - there may be orphan messages in the system
This can be avoided by a coordinating phase in which each process logs in stable storage all messages it sent out since its previous checkpoint
The message-logging phase of the processes will overlap in time
If the volume of messages is less than the size of the individual checkpoints - the disks and network will see a reduced surge
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .7
Recovery From Failure
If a process fails, it can be restarted after rolling it back to its last checkpoint and all the messages stored in log played back
This combination of checkpoint and message log is called a logical checkpoint
The staggered checkpointing algorithm guarantees that all the logical checkpoints form a consistent recovery line
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .8
Phase One of Staggering Algorithm
Phase 1 - the checkpointing phase: for (i=0; i n-1; i++) { P_i takes a checkpoint P_i sends a message to P_{(i+1) mod n}, ordering
the latter to take a checkpoint }
When P_0 gets a message from P_{n-1} ordering it to checkpoint - this is the cue for P_0 to initiate the second (message-logging) phase
It sends out a marker message on each of its outgoing channels. When a process P_i receives a marker message, it goes to phase 2
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .9
Phase Two of Staggering Algorithm
Message Logging Phase if (no previous marker message was received in this
round by P_i) then { P_i sends a marker message on each of its
outgoing channels P_i logs all the messages received by it after the
preceding checkpoint } else P_i updates its message log by adding all the
messages received by it since the last time the log was updated
end if
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .10
Example of Staggering
Algorithm - Phase One
P0 takes a checkpoint and sends take_checkpoint order to P1
P1 sends such an order to P2 after taking its own checkpoint
P2 sends a take_checkpoint order back to P0 At this point, each of the processes has taken
a checkpoint and the second phase can begin
system
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .11
Example - Phase 2
P0 sends message_log to P1 and P2 - logging messages they received since last checkpoint
P1 and P2 send out similar message_log orders
Each time such a message is received - the process logs the messages
If it is the first time such a message_log order is received by it - the process sends out marker messages on each of its outgoing channels
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .12
Recovery
Assumption - given the checkpoint and messages received, a process can be recovered
We may have orphan messages with respect to the physical checkpoints taken in the first phase
Orphan messages will not exist with respect to the latest (in time) logical checkpoints that are generated using the physical checkpoint and the message log
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .13
Time-Based Synchronization
Orphan messages cannot happen if each process checkpoints at exactly the same time
Practically impossible - clock skews and message communication times cannot be reduced to zero
Time-based synchronization can still be used to facilitate checkpointing - we have to take account of nonzero clock skews
Time-based synchronization - processes are checkpointed at previously agreed times
Example - ask each process to checkpoint when its local clock reads a multiple of 100 seconds
Such a procedure by itself is not enough to avoid orphan messages
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .14
Creation of an Orphan Message - Example
Each process is checkpointing at time 1100 (local clock)
Skew between the two clocks is such that process P0 checkpoints much earlier (in real time) then process P1
As a result, P0 sends out a message to P1 after its checkpoint, which is received by P1 before its checkpoint
This message is a potential orphan
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .15
Preventing Creation of an Orphan Message
Suppose the skew between any two clocks in the distributed system is bounded by , and each process is asked to checkpoint when its local clock reads
Following its checkpoint, a process Px should not send out messages to any process Py until it is certain that Py's local clock reads more than
Px should remain silent over the duration [,+] (all times as measured by Px's local clock)
If the inter-process message delivery time has a lower bound - to prevent orphan messages Px needs to remain silent during a shorter interval [,+-]
If >, this interval is of zero length - no need for Px to remain silent
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .16
Different Method of Prevention Suppose message m is received by process Py
when its clock reads t m must have been sent (by Px) no later than
earlier - before Py's clock read t- Since the clock skew , at this time, Px's clock
should have read at most t-+ If t-+ < , the sending of m would be recorded
in Px's checkpoint - m cannot be an orphan A message m received by Py when its clock
reads at least -+ cannot be an orphan Orphan messages can be avoided by Py not
using and not including in its checkpoint at any message received during [-+,] (Py's clock) until after taking its checkpoint at
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .17
Diskless Checkpointing
Memory is volatile and unsuitable for storing a checkpoint However, with extra processors, we can permit
checkpointing in main memory
By avoiding disk writes, checkpointing can be faster
Best used as one level in a two-level checkpointing Have redundant processors using RAID-like
techniques to deal with failure Example: a distributed system with five executing,
and one extra, processors Each executing processor stores its checkpoint in its
memory; extra processor stores the parity of these checkpoints
If an executing processor fails, its checkpoint can be reconstructed from the remaining five plus parity checkpoints
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .18
RAID-like Diskless Checkpointing
The inter-processor network must have enough bandwidth for sending checkpoints
Example: n executing and one checkpointing processor, if all the executing processors send their checkpoints to the checkpointing processor to calculate parity - a potential hotspot
Solution: Distribute the parity computations
n=5
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .19
Two-Level Recovery Coordinating checkpoints prevents orphan
messages but imposes overhead Will not affect correctness if failures are isolated, i.e.,
at most one process in a failed/recovering state at any time
The vast majority of failures are isolated Make recovery from isolated failures fast Accept longer recovery times for simultaneous failures
This suggests a two-level recovery scheme First level: each process takes its own checkpoints
without coordination (only useful when recovering from isolated failures)
Checkpoint need not be written to disk, can be written into a memory of another processor
Second level: occasionally entire system undergoes a coordinated checkpointing (with higher overhead), which guards against non-isolated failures
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .20
Two-Level Recovery - Example
P0 fails at t0; system rolls back to latest first-level checkpoint; Recovery successful;
P1 fails at t1; rolls back; At point tx (during recovery), P2 also fails
Non-isolated failures - the system rolls back both processes to the latest second-level checkpoint
In general, the more common the non-isolated failures, the greater must be the frequency at which the second-level checkpoint is taken
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .21
Message Logging To continue computation beyond latest
checkpoint, recovering process may require all the messages it received since then, played back in original order
For coordinated checkpointing - each process can be rolled back to its latest checkpoint and restarted: those messages will be resent during reexecution
To avoid the overhead of coordination and let processes checkpoint independently, logging messages is an option
Two approaches to message logging: Pessimistic logging - ensures that rollback will not
spread, i.e., if a process fails, no other process will need to be rolled back to ensure consistency
Optimistic logging - a process failure may trigger rollback of other processes as well
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .22
Pessimistic Message Logging
Simplest approach - the receiver of a message stops whatever it is doing when it receives a message, logs the message onto stable storage, then resumes execution
Recovering a process from failure - roll it back to its latest checkpoint and play back to it the messages it received since that checkpoint, in the right order
No orphan messages will exist - every message will either have been received before the latest checkpoint or explicitly saved in the message log
Rolling back one process will not trigger the rollback of any other process.
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .23
Sender-Based Message Logging Logging messages into stable storage can impose a
significant overhead Against one isolated failure at a time, sender-based
message logging can be used The sender of a message records it in a log - when
required, the log is read to replay the message Each process has send- and receive-counters, which
increment every time the process sends or receives a message
Each message has a Send Sequence Number (SSN) - value of the send-counter when it is transmitted
A received message is allocated a Receive Sequence Number (RSN) - value of the receive-counter when it was received
The receiver also sends out an ACK to the sender, including the RSN it has allocated to the message
Upon receiving this ACK, the sender acknowledges the ACK in a message to the receiver
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .24
Sender-Based Message Logging - Cont’d
Between the time that the receiver receives the message and sends its ACK, and when it receives the sender's ACK of its own ACK, the receiver is forbidden to send messages to other processes - essential to maintaining correct functioning upon recovery
A message is said to be fully-logged when the sending node knows both its SSN and its RSN; it is partially-logged when the sending node does not yet know its RSN
When a process rolls back and restarts computation from the latest checkpoint, it sends out to the other processes a message listing the SSN of their latest message that it recorded in its checkpoint
When this message is received by a process, it knows which messages are to be retransmitted, and does so
The recovering process now has to use these messages in the same order as they were used before it failed - easy to do for fully-logged messages, since their RSNs are available, and they can be sorted by this number
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .25
Partially-logged Messages Remaining problem - the partially-logged
messages, whose RSNs are not available They were sent out, but their ACK was never
received by the sender The receiver failed before the message could be
delivered to it, or it failed after receiving the message but before it could send out the ACK
The receiver is forbidden to send out messages of its own to other processes between receiving the message and sending out its ACK
As a result, receiving the partially-logged messages in a different order the second time cannot affect any other process in the system - correctness is preserved
Clearly, this approach is only guaranteed to work if there is at most one failed node at any time
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .26
Optimistic Message Logging Optimistic message logging has a lower
overhead than pessimistic logging; however, recovery from failure is much more complex
Optimistic logging is of theoretical interest When messages are received, they are
written into a volatile buffer which, at a suitable time, is copied into stable storage
Process execution is not disrupted, and so the logging overhead is very low
Upon failure, the contents of the buffer can be lost leading to multiple processes having to be rolled back
We need a scheme to handle this situation
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .27
Checkpointing in Shared-Memory Systems
A variant of CARER for shared-memory bus-based multiprocessors - each processor has its own cache
Change the algorithm to maintain cache coherence among the multiple caches
Instead of the single bit marking a line as unchangeable, we have a multi-bit identifier:
A checkpoint identifier, C_id with each cache line A (per processor) checkpoint counter, C_count,
keeping track of the current checkpoint number
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .28
K-1
Shared Memory - Cont.
To take a checkpoint, increment the counter A line modified before will have its C_id less
than the counter When a line is updated, set C_id = C_count If a line has been modified since being
brought into the cache and C_id < C_count, the line is part of the checkpoint state, and is therefore unwritable. Any writes into such a line must wait until the line is first written into the main memory.
If the counter has k bits, it rolls over to 0 after reaching 2
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .29
Bus-Based Coherence Protocol Modify a cache coherence algorithm to take
account of checkpointing All traffic between caches and memory must
use the bus, i.e., all caches can watch the traffic on bus
A cache line can be in one of the following states: invalid, shared unmodified, exclusive modified, and exclusive unmodified
Exclusive - this is the only valid copy in any cache;
Modified - line has been modified since it was brought into cache from memory
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .30
Bus-Based Coherence Protocol - Cont’d
If processor wants to update a line in shared unmodified state, it moves into exclusive modified state
Other caches holding the same line must invalidate their copies - no longer current
When in the exclusive modified or exclusive unmodified states, another cache puts out a read request on the bus, this cache must service that request (only current copy of that line)
Byproduct- memory is also updated if necessary Then, move to shared unmodified Write miss, line into cache - exclusive modified
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .31
Bus-Based Coherence and checkpointing Protocol
How can we modify this protocol to account for checkpointing?
The original exclusive modified state now splits into two:
Exclusive modified Unwritable
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .32
Directory-Based Protocol In this approach a directory is maintained
centrally which records the status of each line We can regard this directory as being controlled
by some shared-memory controller This controller handles all read and write misses
and all other operations which change line state Example: If a line is in the exclusive unmodified
state and the cache holding that line wants to modify it, it notifies the controller of its intention
The controller can then change the state to exclusive modified
It is then a simple matter to implement this checkpointing scheme atop such a protocol
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .33
Other Uses of Checkpointing (1) Process Migration
A checkpoint represents process state - migrating a process from one processor to another means moving the checkpoint, and computation can resume on the new processor - can be used to recover from permanent or intermittent faults
Nature of checkpoint determines whether the new processor must be of the same model and run the same operating system
(2) Load-balancing Better utilization of a distributed system by ensuring that the
computational load is appropriately shared among the processors (3) Debugging
Core files are dumped when a program exits abnormally - these are essentially checkpoints, containing full state information about the affected process - debuggers can read core files and aid in the debugging process
(4) Snapshots Observing the program state at discrete epochs - deeper
understanding of program behavior