Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Scalable Algorithms for Global Snapshots in Distributed Systems

Rahul Garg IBM India Research Lab

Vijay K. Garg Univ. of Texas at Austin

Yogish Sabharwal IBM India Research Lab

Motivation for Global Snapshot Checkpoint to tolerate faults

Take global snapshot periodically On failure, restart from the last checkpoint

Global property detection Detecting deadlock, loss-of-a-token etc.

Distributed Debugging Inspecting the global state

Consistent and inconsistent cuts

G1 is not consistent

G2 is consistent but m3 must be recorded

P1

P2

P3

m1

m2

m3

G1G2

Model of the System

No shared clock

No shared memory

Processes communicate using messages

Messages are reliable

No upper bound on delivery of messages

Checkpoint

A process must be red to receive a red message A white process turns red on receiving a red message

Any white message received by a red process must be recorded as in-transit message

P

Qwr

rrrw

ww

Classification of Messages w – white process (pre-recording local state) r – red process (post-recording) e.g. rw – sent by a red process, received by a white process

Previous Work

Chandy and Lamport’s algorithm Assumes FIFO channels Requires one message (marker) per channel

Marker indicates the end of white messages

Mattern’s algorithm Schulz, Bronevetsky et al.

Work for non-FIFO channels Require a message that indicates the total number of white

messages sent on the channel

Results

Algorithm Message Complexity

Message Size

Space

CLM O(N2) O(1) O(N)

Grid-based O(N3/2) O(N) O(N)

Tree-based O(N log N log W/n) O(1) O(1)

Centralized O(N log W/n) O(1) O(1)

Grid-based Algorithm

Idea 1 Previously: send number of white messages/channel This algorithm: the total number of white messages

destined to a process

Idea 2 Previously: send N messages of size O(1) Now: send N messages of size N


Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c

Send this count to P(c,c) Step 3: if (r=c) // diagonal entry

Receive count from all processes in the column Send jth entry to P(c,j)

whiteSent = 1 0 32 1 04 0 0[ 1 0 3 ] [ 2 1 0 ] [ 4 0 0 ]





[ 1 2 3 ] [ 2 1 0 ]

[ 1 4 1 ]

+





For each processor of second row: Count of messages sent to it

from processors in third row[ 1 2 3 ] [ 2 1 0 ]

[ 1 4 1 ][ 4 7 4 ]





[ 4 7 4 ]

+





[ 4 7 4 ]

[ 2 1 2 ][ 1 0 1 ]





For each processor of second row: Count of messages sent to it

from all processors

[ 7 8 6 ]

Tree/Centralized Algorithms

Idea Previously: maintain white messages sent for every

destination These algorithms: nodes maintain local deficits

Local deficit = white messg sent – white messg recvd

Total deficit = Sum of all local deficits

Distributed Message Counting Problem W in-transit messages destined for N processors Detect when all messages have been received W tokens: a token is consumed when a message is

received

Tree/Centralized Algorithms

Distributed Message Counting Algorithm Arrange nodes in suitable data structure Distribute tokens equally to all processors at start

w = W/n Each node has a color:

Green (Rich) : has more than w/2 tokens Yellow (Debt-free) : has <= w/2 tokens Orange (Poor) : has no tokens and has received

a white message

Tree-based Algorithm: High level idea Arrange nodes as a binary tree

Progresses in rounds In each round all the nodes start off rich A token is consumed on receiving a message Debt-free node cannot have a rich child

Ensured by transfer of tokens

Starting a new round When root is no longer rich ½ tokens consumed

Tree-based Algorithm

Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow

I2

I1

Tree-based Algorithm - Example




Violates I1

Swap Request

Swap Accept





Split Request

Split Accept

Violates I3



Violates I2


Reset Round Recalculate remaining tokens W’ ( <= nw/2 = W/2 ) Start new round with W’ Redistribute tokens equally All nodes turn Green

Violates I2

Tree-based Algorithm – Analysis Number of rounds

If W < 2n, only O( n ) messages are required Tokens reduce by half in every round # of rounds = O( log W/n )

Number of control messages per round O( log n ) control messages per color change Whenever color changes, some green node turns yellow

O( n ) color changes per round # of control messages per round = O( n log n )

Total control messages = O( n log n log W/n )

Centralized Algorithm

Idea In tree-based algorithm, every color change requires

search for a green node to split/swap tokens with Requires O( log n ) control messages

Can we find a green node with O(1) control messages? Master node (tail) maintains list of all green nodes

Master

Centralized Algorithm - Example

Master

Swap Request

Swap Request

Swap Accept

Master

Centralized Algorithm – Analysis Number of rounds

If W < 2n, only O( n ) messages are required Tokens reduce by half in every round # of rounds = O( log W/n )

Number of control messages per round O( 1 ) control messages per color change Whenever color changes, some green node turns yellow

O( n ) color changes per round # of control messages per round = O( n )

Total control messages = O( n log W/n )

Lower Bound

Observation Suppose there are W outstanding tokens Some process must generate a control message on

receiving W/n white messages

W/n W/nW/n W/n W/n W/n

Send W/n white messages to that processor Remaining tokens = (n-1)W/n Repeat Argument recursively

Tokens remaining after i control messages >= ((n-1)/n)i . W # of control messages = ( n log W/n )

Experimental Results

Total Latencies

01020304050607080

N=32, W=2880992 N=64, W=5764032 N=128,W=11536256

N=256,W=23105280

N=512,W=46341632

N/W

Mill

isec

onds

Grid Tree Centralized

Experimental Results

Average Message Counts (W=40,000)

0

50

100

150

200

250

300

32 64 128 256 512

N

Co

un

t Grid

Tree

Centralized

Conclusions

Global Snapshots in distributed systems Distributed Message Counting problem Optimal algorithm

Message Complexity O( n log W/n ) Matching lower bound Centralized algorithm

Open Problem Decentralized algorithm ?

Thank You

Questions?

Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Documents

c step

count of messages

messages messages

white process slide

c diagonal entry

gridbased algorithm

processes messages

cumulative count