Top Banner
Implementing Fault-Tolerant Services Using State Machines Vijay K. Garg Electrical and Computer Engineering The University of Texas at Austin Email: [email protected] Disc’2010 Implementing Fault-Tolerant Services Using State Machines : Beyond Replication
42

Implementing Fault-Tolerant Services Using State Machines

Feb 25, 2016

Download

Documents

selia

Implementing Fault-Tolerant Services Using State Machines. : Beyond Replication. Implementing Fault-Tolerant Services Using State Machines. Vijay K. Garg Electrical and Computer Engineering The University of Texas at Austin Email: [email protected]. Disc’2010. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Implementing Fault-Tolerant Services Using State Machines

Implementing Fault-Tolerant Services Using State Machines

Vijay K. GargElectrical and Computer Engineering

The University of Texas at AustinEmail: [email protected]

Disc’2010

Implementing Fault-Tolerant Services Using State Machines

: Beyond Replication

Page 2: Implementing Fault-Tolerant Services Using State Machines

Fault Tolerance: Replication

2

Server 1 Server 2 Server 3

1 Fault Tolerance

2 FaultTolerance

Page 3: Implementing Fault-Tolerant Services Using State Machines

Fault Tolerance: Fusion

3

1 FaultTolerance

Server 1 Server 2 Server 3

Page 4: Implementing Fault-Tolerant Services Using State Machines

Fault Tolerance: Fusion

4

2 FaultTolerance

`Fused’ Servers : Fewer Backups than Replication

Server 1 Server 2 Server 3

Page 5: Implementing Fault-Tolerant Services Using State Machines

Motivation

5

Coding Replication Fusion

Space Efficient Wasteful Efficient

Recovery Expensive Efficient Expensive

Updates Expensive Efficient Efficient

Probability of failure is low => expensive recovery is ok

Page 6: Implementing Fault-Tolerant Services Using State Machines

OutlineCrash Faults

Space savingsMessage savings Complex Data Structures

Byzantine FaultsSingle Fault (f=1), O(1) dataSingle Fault, O(m) dataMultiple Faults (f>1), O(m) data

Conclusions & Future Work

6

Page 7: Implementing Fault-Tolerant Services Using State Machines

Example 1: Event Counter

7

n different counters counting n different itemscounti= entry(i) – exit(i)

What if one of the processes may crash?

Page 8: Implementing Fault-Tolerant Services Using State Machines

Event Counter: Single Fault

8

fCount1 keeps the sum of all countsAny crashed count can be recovered using remaining

counts

Page 9: Implementing Fault-Tolerant Services Using State Machines

Event Counter: Multiple Faults

9

Page 10: Implementing Fault-Tolerant Services Using State Machines

Event Counter: Theorem

10

Page 11: Implementing Fault-Tolerant Services Using State Machines

Shared Events: Aggregation

11

Suppose all processes act on entry(0) and exit(0)

Page 12: Implementing Fault-Tolerant Services Using State Machines

Aggregation of Events

12

Page 13: Implementing Fault-Tolerant Services Using State Machines

Some Applications of FusionCausal Ordering of Messages for n Processes

O(n2) matrix at each processReplication to tolerate one fault: O(n3) storageFusion to tolerate one fault: O(n2) storage

Ricart and Agrawala’s AlgorithmO(n) storage per process, 2(n-1) messages/mutexReplication: n backup processes each with O(n) storage,

2(n-1) additional messagesFusion: 1 fused process with O(n) storage

Only n additional messages

13

Page 14: Implementing Fault-Tolerant Services Using State Machines

OutlineCrash Faults

Space savingsMessage savings Complex Data Structures

Byzantine FaultsSingle Fault (f=1), O(1) dataSingle Fault, O(m) dataMultiple Faults (f>1), O(m) data

Conclusions & Future Work

14

Page 15: Implementing Fault-Tolerant Services Using State Machines

Example: Resource Allocation, P(i)

15

user: int initially 0;// resource idlewaiting: queue of int initially null;

On receiving acquire from client pid if (user == 0) { send(OK) to client pid; user = pid; } else waiting.append(pid);On receiving release if (waiting.isEmpty()) user = 0; else { user = waiting.head(); send(OK) to user; waiting.removeHead(); }

Page 16: Implementing Fault-Tolerant Services Using State Machines

Complex Data Structures: Fused Queue

16

a1 a2

a3

a4

a5a6a7

a8

b1

b2b3b4

b5

head

tail tail

head

(i) Primary Queue A (i) Primary Queue B

HeadA

a2a3 + b1

a4 + b2

a5 + b3

a6 + b4

a7 + b5a8 + b6

a1

HeadB

tailA tailB

(iii) Fused Queue F

Fused Queue that can tolerate one crash fault

Page 17: Implementing Fault-Tolerant Services Using State Machines

Fused Queues: Circular Arrays

17

Page 18: Implementing Fault-Tolerant Services Using State Machines

Resource Allocation: Fused Processes

18

Page 19: Implementing Fault-Tolerant Services Using State Machines

OutlineCrash Faults

Space savingsMessage savings Complex Data Structures

Byzantine FaultsSingle Fault (f=1), O(1) dataSingle Fault, O(m) dataMultiple Faults (f>1), O(m) data

Conclusions & Future Work

19

Page 20: Implementing Fault-Tolerant Services Using State Machines

Byzantine Fault Tolerance: Replication

20

13 8 45

13 8 45

13 8 45 (2f+1)*n processes

Page 21: Implementing Fault-Tolerant Services Using State Machines

Goals for Byzantine Fault ToleranceEfficient during error-free operationsEfficient detection of faults

No need to decode for fault detectionEfficient in space requirements

21

Page 22: Implementing Fault-Tolerant Services Using State Machines

Byzantine Fault Tolerance: Fusion

22

13 8 45

13 8 45

66

P(i)

Q(i)

F(1)

11

Page 23: Implementing Fault-Tolerant Services Using State Machines

Byzantine Faults (f=1)

Assume n primary state machine P(1)..P(n), each with an O(1) data structure.

Theorem 2: There exists an algorithm with additional n+1 backup machines withsame overhead as replication during normal operations additional O(n) overhead during recovery.

23

Page 24: Implementing Fault-Tolerant Services Using State Machines

Byzantine FT: O(m) data

24

P(i)

Q(i)

F(1)

a1 a2

a3

a4

a5a6a7

a8

a1 a2

a3

a4

a5a6a7

a8

b1

b2b3b4

b5

b1

b2b3b4

b5HeadA

a2a3 + b1

a4 + b2

a5 + b3

a6 + b4a7 + b5a8 + b6

a1

HeadB

tailA tailB

g

x

Crucial location

Page 25: Implementing Fault-Tolerant Services Using State Machines

Byzantine Faults (f=1), O(m)Theorem 3: There exists an algorithm with additional

n+1 backup machines such thatnormal operations : same as replication additional O(m+n) overhead during recovery.

No need to decode F(1)

25

Page 26: Implementing Fault-Tolerant Services Using State Machines

Byzantine Fault Tolerance: Fusion

26

3 1 4

3 8 4

P(i)

F(1)

1

3 1 4

3 1 4

8 17 43 F(3)

1*3 + 2*1 + 3*41*3+4*1+9*45

5

3Single mismatched primary

10

1*3+1*1+1*4

Page 27: Implementing Fault-Tolerant Services Using State Machines

Byzantine Fault Tolerance: Fusion

27

3 7 4

3 8 4

P(i)

F(1)

1

3 1 4

3 1 4

8 17 43 F(3)

5

5

3Multiple mismatched primary

8

1

Page 28: Implementing Fault-Tolerant Services Using State Machines

Byzantine Faults (f>1), O(1) data

Theorem 4: Algorithm with additional fn+f state machines for f Byzantine faults with same overhead as replication during normal operations.

28

Page 29: Implementing Fault-Tolerant Services Using State Machines

Liar Detection (f > 1), O(m) data Z := set of all f+1 unfused copiesWhile (not all copies in Z identical) do

w := first location where copies differUse fused copies to find v, the correct value of state[w]Delete unfused copies with state[w] != v

Invariant: Z contains a correct machine.

No need to decode the entire fused state machine!

29

Page 30: Implementing Fault-Tolerant Services Using State Machines

Fusible Structures

Fusible Data Structures[Garg and Ogale, ICDCS 2007]Linked Lists, Stacks, Queues, Hash tablesData structure specific algorithmsPartial Replication for efficient updatesMultiple faults tolerated using Reed-Solomon Coding

Fusible Finite State Machines [Ogale, Balasubramanian, Garg IPDPS 09]Automatic Generation of minimal fused state machines

30

Page 31: Implementing Fault-Tolerant Services Using State Machines

Conclusions

31

Coding Replication Fusion

Crash Faults n+nf n+f

Byzantine Faults n+2nf n+nf+f

Replication: recovery and updates simple, tolerates f faults for each of the primaryFusion: space efficient

Can combine them for tradeoffs

n: the number of different servers

Page 32: Implementing Fault-Tolerant Services Using State Machines

Future Work

Optimal Algorithms for Complex Data StructuresDifferent Fusion OperatorsConcurrent Updates on Backup Structures

32

Page 33: Implementing Fault-Tolerant Services Using State Machines

Thank You!

33

Page 34: Implementing Fault-Tolerant Services Using State Machines

Questions?Crash Faults

Event Counters: Space savingsMutex Algorithm: Message savingsResource Allocator: Complex Data Structures

Byzantine FaultsSingle Fault (f=1), Detection and CorrectionLiar DetectionMultiple Faults (f>1)

Conclusions & Future Work

34

Page 35: Implementing Fault-Tolerant Services Using State Machines

Backup Slides

35

Page 36: Implementing Fault-Tolerant Services Using State Machines

Event Counter: Proof Sketch

36

Page 37: Implementing Fault-Tolerant Services Using State Machines

ModelThe servers (primary and backups) execute

independently (in parallel)Primaries and backups do not operate in lock-stepEvents/Updates are applied on all the serversAll backups act on the same sequence of events

37

Page 38: Implementing Fault-Tolerant Services Using State Machines

Model contd…Faults:

Fail Stop (crash): Loss of current stateByzantine: Servers can `lie` about their current state

For crash faults, we assume the presence of a failure detector

For Byzantine faults, we provide detection algorithmsInfrequent Faults

38

Page 39: Implementing Fault-Tolerant Services Using State Machines

Byzantine Faults (f=1), O(m)Theorem 3: There exists an algorithm with additional n+1 backup

machines such thatnormal operations : same as replication additional O(m+n) overhead during recovery.

Proof Sketch:Normal Operation: Responses by P(i) and Q(i), identical Detection: P(i) and Q(i) differ for any response Correction: Use liar detectionO(m) time to determine crucial locationUse F(1) to determine who is correctNo need to decode F(1)

39

Page 40: Implementing Fault-Tolerant Services Using State Machines

Byzantine Faults (f>1)Proof Sketch:

f copies of each primary state machine and f overall fused machines

Normal Operation: all f+1 unfused copies result in the same output

Case 1: single mismatched primary state machine Use liar detection

Case 2: multiple mismatched primary state machinesUnfused copy with the largest tally is correct

40

Page 41: Implementing Fault-Tolerant Services Using State Machines

Resource Allocation Machine

41

RequestQueue 1

RequestQueue 2

Lock Server 1

Lock Server 2

R1 R2 R3

R1 R2

RequestQueue 3

Lock Server 3

R1R2 R4

R3

Page 42: Implementing Fault-Tolerant Services Using State Machines

Byzantine Fault Tolerance: Fusion

42

13 8 45

13 8 45

66 (f+1)*n + f processes

P(i)

Q(i)

F(1)

11