Top Banner
Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish Sabharwal IBM India Research Lab
33

Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Dec 14, 2015

Download

Documents

Jalynn Sturman
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Scalable Algorithms for Global Snapshots in Distributed Systems

Rahul Garg IBM India Research Lab

Vijay K. Garg Univ. of Texas at Austin

Yogish Sabharwal IBM India Research Lab

Page 2: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Motivation for Global Snapshot Checkpoint to tolerate faults

Take global snapshot periodically On failure, restart from the last checkpoint

Global property detection Detecting deadlock, loss-of-a-token etc.

Distributed Debugging Inspecting the global state

Page 3: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Consistent and inconsistent cuts

G1 is not consistent

G2 is consistent but m3 must be recorded

P1

P2

P3

m1

m2

m3

G1G2

Page 4: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Model of the System

No shared clock

No shared memory

Processes communicate using messages

Messages are reliable

No upper bound on delivery of messages

Page 5: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Checkpoint

A process must be red to receive a red message A white process turns red on receiving a red message

Any white message received by a red process must be recorded as in-transit message

P

Qwr

rrrw

ww

Classification of Messages w – white process (pre-recording local state) r – red process (post-recording) e.g. rw – sent by a red process, received by a white process

Page 6: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Previous Work

Chandy and Lamport’s algorithm Assumes FIFO channels Requires one message (marker) per channel

Marker indicates the end of white messages

Mattern’s algorithm Schulz, Bronevetsky et al.

Work for non-FIFO channels Require a message that indicates the total number of white

messages sent on the channel

Page 7: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Results

Algorithm Message Complexity

Message Size

Space

CLM O(N2) O(1) O(N)

Grid-based O(N3/2) O(N) O(N)

Tree-based O(N log N log W/n) O(1) O(1)

Centralized O(N log W/n) O(1) O(1)

Page 8: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Grid-based Algorithm

Idea 1 Previously: send number of white messages/channel This algorithm: the total number of white messages

destined to a process

Idea 2 Previously: send N messages of size O(1) Now: send N messages of size N

Page 9: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Grid-based Algorithm

Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c

Send this count to P(c,c) Step 3: if (r=c) // diagonal entry

Receive count from all processes in the column Send jth entry to P(c,j)

whiteSent = 1 0 32 1 04 0 0[ 1 0 3 ] [ 2 1 0 ] [ 4 0 0 ]

Page 10: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Grid-based Algorithm

Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c

Send this count to P(c,c) Step 3: if (r=c) // diagonal entry

Receive count from all processes in the column Send jth entry to P(c,j)

[ 1 2 3 ] [ 2 1 0 ]

[ 1 4 1 ]

Page 11: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

+

Grid-based Algorithm

Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c

Send this count to P(c,c) Step 3: if (r=c) // diagonal entry

Receive count from all processes in the column Send jth entry to P(c,j)

For each processor of second row: Count of messages sent to it

from processors in third row[ 1 2 3 ] [ 2 1 0 ]

[ 1 4 1 ][ 4 7 4 ]

Page 12: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Grid-based Algorithm

Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c

Send this count to P(c,c) Step 3: if (r=c) // diagonal entry

Receive count from all processes in the column Send jth entry to P(c,j)

[ 4 7 4 ]

Page 13: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

+

Grid-based Algorithm

Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c

Send this count to P(c,c) Step 3: if (r=c) // diagonal entry

Receive count from all processes in the column Send jth entry to P(c,j)

[ 4 7 4 ]

[ 2 1 2 ][ 1 0 1 ]

Page 14: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Grid-based Algorithm

Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c

Send this count to P(c,c) Step 3: if (r=c) // diagonal entry

Receive count from all processes in the column Send jth entry to P(c,j)

For each processor of second row: Count of messages sent to it

from all processors

[ 7 8 6 ]

Page 15: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Tree/Centralized Algorithms

Idea Previously: maintain white messages sent for every

destination These algorithms: nodes maintain local deficits

Local deficit = white messg sent – white messg recvd

Total deficit = Sum of all local deficits

Distributed Message Counting Problem W in-transit messages destined for N processors Detect when all messages have been received W tokens: a token is consumed when a message is

received

Page 16: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Tree/Centralized Algorithms

Distributed Message Counting Algorithm Arrange nodes in suitable data structure Distribute tokens equally to all processors at start

w = W/n Each node has a color:

Green (Rich) : has more than w/2 tokens Yellow (Debt-free) : has <= w/2 tokens Orange (Poor) : has no tokens and has received

a white message

Page 17: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Tree-based Algorithm: High level idea Arrange nodes as a binary tree

Progresses in rounds In each round all the nodes start off rich A token is consumed on receiving a message Debt-free node cannot have a rich child

Ensured by transfer of tokens

Starting a new round When root is no longer rich ½ tokens consumed

Page 18: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Tree-based Algorithm

Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow

I2

I1

Page 19: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Tree-based Algorithm - Example

Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow

Page 20: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Tree-based Algorithm - Example

Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow

Violates I1

Swap Request

Swap Accept

Page 21: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Tree-based Algorithm - Example

Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow

Page 22: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Tree-based Algorithm - Example

Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow

Split Request

Split Accept

Violates I3

Page 23: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Tree-based Algorithm - Example

Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow

Violates I2

Page 24: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Tree-based Algorithm - Example

Reset Round Recalculate remaining tokens W’ ( <= nw/2 = W/2 ) Start new round with W’ Redistribute tokens equally All nodes turn Green

Violates I2

Page 25: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Tree-based Algorithm – Analysis Number of rounds

If W < 2n, only O( n ) messages are required Tokens reduce by half in every round # of rounds = O( log W/n )

Number of control messages per round O( log n ) control messages per color change Whenever color changes, some green node turns yellow

O( n ) color changes per round # of control messages per round = O( n log n )

Total control messages = O( n log n log W/n )

Page 26: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Centralized Algorithm

Idea In tree-based algorithm, every color change requires

search for a green node to split/swap tokens with Requires O( log n ) control messages

Can we find a green node with O(1) control messages? Master node (tail) maintains list of all green nodes

Master

Page 27: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Centralized Algorithm - Example

Master

Swap Request

Swap Request

Swap Accept

Master

Page 28: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Centralized Algorithm – Analysis Number of rounds

If W < 2n, only O( n ) messages are required Tokens reduce by half in every round # of rounds = O( log W/n )

Number of control messages per round O( 1 ) control messages per color change Whenever color changes, some green node turns yellow

O( n ) color changes per round # of control messages per round = O( n )

Total control messages = O( n log W/n )

Page 29: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Lower Bound

Observation Suppose there are W outstanding tokens Some process must generate a control message on

receiving W/n white messages

W/n W/nW/n W/n W/n W/n

Send W/n white messages to that processor Remaining tokens = (n-1)W/n Repeat Argument recursively

Tokens remaining after i control messages >= ((n-1)/n)i . W # of control messages = ( n log W/n )

Page 30: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Experimental Results

Total Latencies

01020304050607080

N=32, W=2880992 N=64, W=5764032 N=128,W=11536256

N=256,W=23105280

N=512,W=46341632

N/W

Mill

isec

onds

Grid Tree Centralized

Page 31: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Experimental Results

Average Message Counts (W=40,000)

0

50

100

150

200

250

300

32 64 128 256 512

N

Co

un

t Grid

Tree

Centralized

Page 32: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Conclusions

Global Snapshots in distributed systems Distributed Message Counting problem Optimal algorithm

Message Complexity O( n log W/n ) Matching lower bound Centralized algorithm

Open Problem Decentralized algorithm ?

Page 33: Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab Vijay K. Garg Univ. of Texas at Austin Yogish SabharwalIBM.

Thank You

Questions?