Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12 .1

Fall 2006

UNIVERSITY OF MASSACHUSETTSDept. of Electrical & Computer

Engineering

Fault Tolerant ComputingECE 655

Checkpointing III


Coordinated Checkpointing Algorithms

Uncoordinated checkpointing may lead to domino effect or to livelock

Two basic approaches to checkpoint coordination: The Koo-Toueg algorithm, which has a process to initiate

the system-wide checkpointing process An algorithm which staggers checkpoints in time;

Staggering checkpoints can help avoid near-simultaneous heavy loading of the disk system

Communication-induced checkpointing procedures Simultaneously using coordinated and

uncoordinated checkpointing algorithms - the latter is sufficient to deal with most isolated failures


Koo-Toueg Algorithm

Suppose P wants to establish a checkpoint at P_3 This will record that q1 was received from Q - to prevent q1 from being orphaned, Q must checkpoint as well

Thus, establishing a checkpoint at P_3 by P forces Q to take a checkpoint to record that q1 was sent

An algorithm for such coordinated checkpointing has two types of checkpoints - tentative and permanent

P first records its current state in a tentative checkpoint, then sends a message to all other processes from whom it has received a message since taking its last checkpoint

Call the set of such processes


Koo-Toueg Algorithm - Cont. The message tells each process in (e.g., Q), the

last message, m_qp, that P has received from it before the tentative checkpoint was taken

If m_qp was not recorded in a checkpoint by Q: to prevent m_qp from being orphaned, Q is asked to take a tentative checkpoint to record sending m_qp

If all processes in , that need to, confirm taking a checkpoint as requested, then all tentative checkpoints can be converted to permanent

If some members of , are unable to checkpoint as requested, P and all members of abandon the tentative checkpoints, and none are made permanent

This may set off a chain reaction of checkpoints Each member of can potentially spawn a set of

checkpoints among processes in its corresponding set


Staggered Checkpointing

The Koo-Toueg algorithm - and others like it - can lead to a large number of processes taking checkpoints at nearly the same time

If they are all writing to a shared stable storage, e.g., a set of common disks, this surge can lead to congestion at the disks or network or both

Either of two approaches can be used to ensure that, at any time, at most one process is taking its checkpoint

(1) Write the checkpoint into a local buffer, then stagger the writes from buffer to stable storage Assuming a buffer of sufficiently large capacity

(2) Try staggering the checkpoints in time


Staggered Checkpointing - Cont.

Staggered checkpoints may not be consistent - there may be orphan messages in the system

This can be avoided by a coordinating phase in which each process logs in stable storage all messages it sent out since its previous checkpoint

The message-logging phase of the processes will overlap in time

If the volume of messages is less than the size of the individual checkpoints - the disks and network will see a reduced surge


Recovery From Failure

If a process fails, it can be restarted after rolling it back to its last checkpoint and all the messages stored in log played back

This combination of checkpoint and message log is called a logical checkpoint

The staggered checkpointing algorithm guarantees that all the logical checkpoints form a consistent recovery line


Phase One of Staggering Algorithm

Phase 1 - the checkpointing phase: for (i=0; i n-1; i++) { P_i takes a checkpoint P_i sends a message to P_{(i+1) mod n}, ordering

the latter to take a checkpoint }

When P_0 gets a message from P_{n-1} ordering it to checkpoint - this is the cue for P_0 to initiate the second (message-logging) phase

It sends out a marker message on each of its outgoing channels. When a process P_i receives a marker message, it goes to phase 2


Phase Two of Staggering Algorithm

Message Logging Phase if (no previous marker message was received in this

round by P_i) then { P_i sends a marker message on each of its

outgoing channels P_i logs all the messages received by it after the

preceding checkpoint } else P_i updates its message log by adding all the

messages received by it since the last time the log was updated

end if


Example of Staggering

Algorithm - Phase One

P0 takes a checkpoint and sends take_checkpoint order to P1

P1 sends such an order to P2 after taking its own checkpoint

P2 sends a take_checkpoint order back to P0 At this point, each of the processes has taken

a checkpoint and the second phase can begin

system


Example - Phase 2

P0 sends message_log to P1 and P2 - logging messages they received since last checkpoint

P1 and P2 send out similar message_log orders

Each time such a message is received - the process logs the messages

If it is the first time such a message_log order is received by it - the process sends out marker messages on each of its outgoing channels


Recovery

Assumption - given the checkpoint and messages received, a process can be recovered

We may have orphan messages with respect to the physical checkpoints taken in the first phase

Orphan messages will not exist with respect to the latest (in time) logical checkpoints that are generated using the physical checkpoint and the message log


Time-Based Synchronization

Orphan messages cannot happen if each process checkpoints at exactly the same time

Practically impossible - clock skews and message communication times cannot be reduced to zero

Time-based synchronization can still be used to facilitate checkpointing - we have to take account of nonzero clock skews

Time-based synchronization - processes are checkpointed at previously agreed times

Example - ask each process to checkpoint when its local clock reads a multiple of 100 seconds

Such a procedure by itself is not enough to avoid orphan messages


Creation of an Orphan Message - Example

Each process is checkpointing at time 1100 (local clock)

Skew between the two clocks is such that process P0 checkpoints much earlier (in real time) then process P1

As a result, P0 sends out a message to P1 after its checkpoint, which is received by P1 before its checkpoint

This message is a potential orphan


Preventing Creation of an Orphan Message

Suppose the skew between any two clocks in the distributed system is bounded by , and each process is asked to checkpoint when its local clock reads

Following its checkpoint, a process Px should not send out messages to any process Py until it is certain that Py's local clock reads more than

Px should remain silent over the duration [,+] (all times as measured by Px's local clock)

If the inter-process message delivery time has a lower bound - to prevent orphan messages Px needs to remain silent during a shorter interval [,+-]

If >, this interval is of zero length - no need for Px to remain silent


Different Method of Prevention Suppose message m is received by process Py

when its clock reads t m must have been sent (by Px) no later than

earlier - before Py's clock read t- Since the clock skew , at this time, Px's clock

should have read at most t-+ If t-+ < , the sending of m would be recorded

in Px's checkpoint - m cannot be an orphan A message m received by Py when its clock

reads at least -+ cannot be an orphan Orphan messages can be avoided by Py not

using and not including in its checkpoint at any message received during [-+,] (Py's clock) until after taking its checkpoint at


Diskless Checkpointing

Memory is volatile and unsuitable for storing a checkpoint However, with extra processors, we can permit

checkpointing in main memory

By avoiding disk writes, checkpointing can be faster

Best used as one level in a two-level checkpointing Have redundant processors using RAID-like

techniques to deal with failure Example: a distributed system with five executing,

and one extra, processors Each executing processor stores its checkpoint in its

memory; extra processor stores the parity of these checkpoints

If an executing processor fails, its checkpoint can be reconstructed from the remaining five plus parity checkpoints


RAID-like Diskless Checkpointing

The inter-processor network must have enough bandwidth for sending checkpoints

Example: n executing and one checkpointing processor, if all the executing processors send their checkpoints to the checkpointing processor to calculate parity - a potential hotspot

Solution: Distribute the parity computations

n=5


Two-Level Recovery Coordinating checkpoints prevents orphan

messages but imposes overhead Will not affect correctness if failures are isolated, i.e.,

at most one process in a failed/recovering state at any time

The vast majority of failures are isolated Make recovery from isolated failures fast Accept longer recovery times for simultaneous failures

This suggests a two-level recovery scheme First level: each process takes its own checkpoints

without coordination (only useful when recovering from isolated failures)

Checkpoint need not be written to disk, can be written into a memory of another processor

Second level: occasionally entire system undergoes a coordinated checkpointing (with higher overhead), which guards against non-isolated failures


Two-Level Recovery - Example

P0 fails at t0; system rolls back to latest first-level checkpoint; Recovery successful;

P1 fails at t1; rolls back; At point tx (during recovery), P2 also fails

Non-isolated failures - the system rolls back both processes to the latest second-level checkpoint

In general, the more common the non-isolated failures, the greater must be the frequency at which the second-level checkpoint is taken


Message Logging To continue computation beyond latest

checkpoint, recovering process may require all the messages it received since then, played back in original order

For coordinated checkpointing - each process can be rolled back to its latest checkpoint and restarted: those messages will be resent during reexecution

To avoid the overhead of coordination and let processes checkpoint independently, logging messages is an option

Two approaches to message logging: Pessimistic logging - ensures that rollback will not

spread, i.e., if a process fails, no other process will need to be rolled back to ensure consistency

Optimistic logging - a process failure may trigger rollback of other processes as well


Pessimistic Message Logging

Simplest approach - the receiver of a message stops whatever it is doing when it receives a message, logs the message onto stable storage, then resumes execution

Recovering a process from failure - roll it back to its latest checkpoint and play back to it the messages it received since that checkpoint, in the right order

No orphan messages will exist - every message will either have been received before the latest checkpoint or explicitly saved in the message log

Rolling back one process will not trigger the rollback of any other process.


Sender-Based Message Logging Logging messages into stable storage can impose a

significant overhead Against one isolated failure at a time, sender-based

message logging can be used The sender of a message records it in a log - when

required, the log is read to replay the message Each process has send- and receive-counters, which

increment every time the process sends or receives a message

Each message has a Send Sequence Number (SSN) - value of the send-counter when it is transmitted

A received message is allocated a Receive Sequence Number (RSN) - value of the receive-counter when it was received

The receiver also sends out an ACK to the sender, including the RSN it has allocated to the message

Upon receiving this ACK, the sender acknowledges the ACK in a message to the receiver


Sender-Based Message Logging - Cont’d

Between the time that the receiver receives the message and sends its ACK, and when it receives the sender's ACK of its own ACK, the receiver is forbidden to send messages to other processes - essential to maintaining correct functioning upon recovery

A message is said to be fully-logged when the sending node knows both its SSN and its RSN; it is partially-logged when the sending node does not yet know its RSN

When a process rolls back and restarts computation from the latest checkpoint, it sends out to the other processes a message listing the SSN of their latest message that it recorded in its checkpoint

When this message is received by a process, it knows which messages are to be retransmitted, and does so

The recovering process now has to use these messages in the same order as they were used before it failed - easy to do for fully-logged messages, since their RSNs are available, and they can be sorted by this number


Partially-logged Messages Remaining problem - the partially-logged

messages, whose RSNs are not available They were sent out, but their ACK was never

received by the sender The receiver failed before the message could be

delivered to it, or it failed after receiving the message but before it could send out the ACK

The receiver is forbidden to send out messages of its own to other processes between receiving the message and sending out its ACK

As a result, receiving the partially-logged messages in a different order the second time cannot affect any other process in the system - correctness is preserved

Clearly, this approach is only guaranteed to work if there is at most one failed node at any time


Optimistic Message Logging Optimistic message logging has a lower

overhead than pessimistic logging; however, recovery from failure is much more complex

Optimistic logging is of theoretical interest When messages are received, they are

written into a volatile buffer which, at a suitable time, is copied into stable storage

Process execution is not disrupted, and so the logging overhead is very low

Upon failure, the contents of the buffer can be lost leading to multiple processes having to be rolled back

We need a scheme to handle this situation


Checkpointing in Shared-Memory Systems

A variant of CARER for shared-memory bus-based multiprocessors - each processor has its own cache

Change the algorithm to maintain cache coherence among the multiple caches

Instead of the single bit marking a line as unchangeable, we have a multi-bit identifier:

A checkpoint identifier, C_id with each cache line A (per processor) checkpoint counter, C_count,

keeping track of the current checkpoint number


K-1

Shared Memory - Cont.

To take a checkpoint, increment the counter A line modified before will have its C_id less

than the counter When a line is updated, set C_id = C_count If a line has been modified since being

brought into the cache and C_id < C_count, the line is part of the checkpoint state, and is therefore unwritable. Any writes into such a line must wait until the line is first written into the main memory.

If the counter has k bits, it rolls over to 0 after reaching 2


Bus-Based Coherence Protocol Modify a cache coherence algorithm to take

account of checkpointing All traffic between caches and memory must

use the bus, i.e., all caches can watch the traffic on bus

A cache line can be in one of the following states: invalid, shared unmodified, exclusive modified, and exclusive unmodified

Exclusive - this is the only valid copy in any cache;

Modified - line has been modified since it was brought into cache from memory


Bus-Based Coherence Protocol - Cont’d

If processor wants to update a line in shared unmodified state, it moves into exclusive modified state

Other caches holding the same line must invalidate their copies - no longer current

When in the exclusive modified or exclusive unmodified states, another cache puts out a read request on the bus, this cache must service that request (only current copy of that line)

Byproduct- memory is also updated if necessary Then, move to shared unmodified Write miss, line into cache - exclusive modified


Bus-Based Coherence and checkpointing Protocol

How can we modify this protocol to account for checkpointing?

The original exclusive modified state now splits into two:

Exclusive modified Unwritable


Directory-Based Protocol In this approach a directory is maintained

centrally which records the status of each line We can regard this directory as being controlled

by some shared-memory controller This controller handles all read and write misses

and all other operations which change line state Example: If a line is in the exclusive unmodified

state and the cache holding that line wants to modify it, it notifies the controller of its intention

The controller can then change the state to exclusive modified

It is then a simple matter to implement this checkpointing scheme atop such a protocol


Other Uses of Checkpointing (1) Process Migration

A checkpoint represents process state - migrating a process from one processor to another means moving the checkpoint, and computation can resume on the new processor - can be used to recover from permanent or intermittent faults

Nature of checkpoint determines whether the new processor must be of the same model and run the same operating system

(2) Load-balancing Better utilization of a distributed system by ensuring that the

computational load is appropriately shared among the processors (3) Debugging

Core files are dumped when a program exits abnormally - these are essentially checkpoints, containing full state information about the affected process - debuggers can read core files and aid in the debugging process

(4) Snapshots Observing the program state at discrete epochs - deeper

understanding of program behavior

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.

Documents