Moving away from the independent and identically distributed failure assumption

Moving away from the independent and Moving away from the independent and identically distributed failure assumptionidentically distributed failure assumption

University of California, San Diego

Flavio Junqueira

Research Exam/Thesis Proposal

Advisors: Keith Marzullo and Geoffrey M. Voelker

2MotivationMotivation

Common approach for distributed systems: replicate! Cheaper than investing on ultra-reliable, specialized components Enhance performance, availability E.g. Processes on software-based systems

Typical replication strategy Compute a threshold t on the failures of processes Determine the degree of replication required, depending on the

problem (e.g. n > 3t for Consensus with arbitrary failures ) Replicate to this degree

Well suited for independent and identically distributed failures (IID failure assumption) Non-negligible probability of t failures in any subset of size t+1 Is it often a reasonable assumption?

3Where IID does not apply…Where IID does not apply…

Systems for the Internet Hosts execute the same popular

software systems Hosts share the same vulnerabilities

Some major outbreaks Code Red: over 360,000

hosts [Moore02] Sapphire: over 75,000 hosts

[Moore03a]

A threshold on the number of failures is unrealistic.


Quorum systems in a wide-area network [Amir96] Failures are strongly correlated

Power outages

Network partitions

Software bugs [Little01] Single version

A demand may cause all replicas to crash

Multiple independently-developed versions Difficulty of a demand: difficulty in handling it Level of difficulty varies among the demands More difficult demands tend to cause multiple versions to fail

5Where IID does not apply… Where IID does not apply…

Multi-computer systems [Tang92] Correlated failures due to shared resources

Network errors Shared memory

Impact on availability, reliability, and performance

Grid computing Master delegates computation

Wait replies from slaves

Replicate to achieve fault-tolerance Dependent failures: same sub-network,

same software systems, etc.

6OutlineOutline

System model Modeling failures

The classical approach: The threshold model An alternative to the threshold model: Cores/Survivor sets

Applying it to problems: Consensus Traditional results on Consensus Consensus in the core/survivor set model

Generalizing the results for Consensus General bounds on process replication

Coping with dependent failures in the real world A few systems that assume dependent failures An application: The Phoenix Recovery System

7System modelSystem model

Set of processes = {p1, p2, , pn} A process is a unit of computation

Communicate by exchanging messages

Reliable channels Validity: If a correct process p sends a message m to a correct

process q, then q eventually receives m; Integrity: A process p receives a message m from some process q

only if q sent m to p;


Processes exchange messagesChannels are reliable

Set of processes

Distributed algorithm:

collection of state machines

Step of a process

State machine for process qState machine for process p

State

Execution: sequence of steps of processes

Atomic

9Distributed algorithmDistributed algorithm

Collection of state machines, one for each process p

Proceeds in steps of processes

In a step, a process p Sends a message to a single process Receives a message from a single process Undergoes a state transition

Execution Sequence of steps of processes in

10Timing assumptionsTiming assumptions

Synchronous systems Clock drift, message delay, processor speed are bounded Execution in synchronous rounds In a synchronous round, a process

sends messages to any number of processes receives messages from any number of processes Undergoes a state transition

Asynchronous systems No bounds on clock drift, message delay, or processor speed

11Failure modes for processesFailure modes for processes

Crash failures For every faulty process p in some execution of an algorithm A, there is a

time tp after which p stops executing steps of A

Arbitrary failures A faulty process can deviate arbitrarily from the specification of the

algorithm E.g. crash, sending messages selectively, modify arbitrarily the content of

messages

Receive-omission failures A faulty process either crashes or selectively fail to receive messages

Assumptions Once a process fails it does not recover Probability of a total failure is negligible

Modeling failures

13The threshold modelThe threshold model

Threshold t on the number of process failures Degree of reliability: R [0,1] The probability of t+1process failures is smaller than 1-R Simple and compact representation (n > f(t))

SIFT project [Wensley76] Ultra-reliable computer system Process failures are arbitrary, but non-malicious Hardware designed to isolate faults (independent failures) Similar hardware (identically distributed process failures) IID failure assumption is valid

What if failures are not IID? Still safe

t is the size of the largest subset of faulty processes in any execution It does not hurt to consider more

14Limitations of the threshold modelLimitations of the threshold model

R : target degree of reliability>R: subset of processes has reliability greater than R

15An alternative to the threshold modelAn alternative to the threshold model

Desirable properties Expressive: scenarios in the previous slide Flexible: not tied to any particular way of characterizing failures General: widely applicable

Cores [JM03a] A core c: minimal reliable subset of processes At least one process in c is correct in every execution of the system Generalize subsets of size t+1

Survivor sets [JM03a] A survivor set: contains all the correct processes of some execution Generalize subsets of size n-t

16Cores and Survivor setsCores and Survivor sets

R: desired degree of reliability r(X), X : evaluates to the reliability of x A subset C is a core of iff

r(C) R p C, r(C - {p}) R C : set of cores of

A subset S is a survivor set of iff C C, SC p S, C C, such that (p C) and ((S - {p}) C = )

S : set of survivor sets of

Cores and survivor sets are the dual of each other

17An alternative definitionAn alternative definition

Design of algorithms be the set of allowed executions up(be the set of correct processes in execution A subset C is a core of iff

s.t. C up() C’C, s.t. C’ up()= C : set of cores of

A subset S is a survivor set of iff s.t. S = up() S’ S, , S’ up() S : set of survivor sets of

: system configuration SC ,,

18An exampleAn example

Blue, Red, and Yellow fail independentlyFailures of Yellow processes are highly correlatedr({Red, Blue, Yellow}) = R

19Another exampleAnother example

Blue: highly-reliable serverRed: clientFailures of Blue and Red are negatively correlated

Probability of more than 3 Red processes failing is negligible

20Determining cores and survivor sets Determining cores and survivor sets

Probability models E.g. Markov models used in the analysis of dynamic fault trees

[Ren98] To find cores: Minimal subset of processes s.t. probability of total

failure in the subset is negligible Often difficult in practice

Attribute-based model [JM02] Processes characterized by attributes Attributes determine failure correlation Finding a core is NP-hard

Color-based model [JM02] Single attribute characterizes a process Polynomial time algorithm to find cores

21Cores/Survivor sets vs. Quorum systemsCores/Survivor sets vs. Quorum systems

Cores, Survivor sets, Quorums Subsets of processes

Quorums [Giff79] Enforce mutual exclusion [GM85] E.g. One-copy serializability Quorums necessarily intersect Execute operations on behalf of the system

Cores/Survivor sets Do not necessarily execute operations on behalf of the system Weaker than quorums: no intersection requirement a priori Generalize objects commonly used in proofs and algorithms

Cores: subsets of size t+1 Survivor sets: subsets of size n-t

Consensus

23Motivation for ConsensusMotivation for Consensus

Replication often requires coordination

Coordination problems Atomic broadcast

Clock synchronization

Agreement on fault-tolerant processors (FTP)

24Consensus specificationConsensus specification

Each process begins with a proposed value v V Goal: agree on a single value Typical Consensus definition [Attiya98]

Agreement: No two correct processes decide on different values Termination: Every correct process eventually decides Validity: If a process p decides on value v, then v was proposed by

some process q Strong validity: if every process has v as its initial value, then v is

the only possible decision value [Attiya98] Vector validity: A correct process decides on a vector such that

[Doudou98]1. If pi is correct, then [i] has the initial value of pi or null

2. At least t+1 elements of are initial values of correct processes

25Synchronous systems - Crash failuresSynchronous systems - Crash failures

Solution for any number of failures Full-information algorithm (t+1

rounds, )

Early-deciding algorithms [LF82, CB00] For any execution with f failures,

correct processes decide in at most f+1 rounds ( )

Clean round: Round in which no process fails Process receives messages from

the same set of processes in two consecutive rounds

Message complexity: O(f·||2)

1 tn

1 2

Decide

p0

p1

p2

1 tn

26In the core/survivor set modelIn the core/survivor set model

Algorithm SyncCrash [JM03a,

JM03d] Choose a core C, preferentially the

smallest Execute early-deciding algorithm

among processes of C Every process in has an array of |C|

positions, one for each process in C Processes in C send messages to

processes in -C as well A process decides when a round with

no failures in C happen

t)(sufficien C

Decision in at most |C| rounds If |C|-1 < t, then improves on

number of rounds Message complexity: O(f·|C|·|

|)

27Synchronous systems - Arbitrary failuresSynchronous systems - Arbitrary failures

Impossible if n 3•t [Lamport82]

Strong Consensus Proof idea

Consensus algorithm that solves for || 3·t

Execution in which agreement is violated

Assume || 3·t Partition (A, B, C) of

s.t. each subset has at most t processes

Execution 2

(A, B, C: v’)

A

B C

A:v, C

:v

B:v', C

:v'

A:v, B:v'B:v', C:v'

A:v ', C:v '

A:v, B:v '

Execution 1

(A, B, C: v)

A

B C

A:v, C

:v

B:v', C

:v'

A:v, B:v'

B:v, C:vA:v, C:v

A:v, B:v

Execution 3

(A: v; B: v’, C: *)

A

B C

A:v, C

:v

B:v', C

:v'

A:v, B:v'

B:v, C:v

A:v ', C:v '

A:v, B:v '


Lower bound on process replication [JM03a, JM03d] Byzantine Partition: Every partition (A, B, C) of is such that at

least one of the subsets contains a core Byzantine Intersection:

The intersection of every pair of survivor sets in S contains a core

The intersection of every three survivor sets in S is not empty

Scenario (A, B, C: v)

A

B C

A:v, C

:v

B:v', C

:v'

A:v, B:v'

B:v, C:v

A:v, C:v

A:v, B:v

Scenario (A, B, C: v’)

A

B C

A:v, C

:v

B:v', C

:v'

A:v, B:v'B:v', C:v'

A:v ', C:v '

A:v, B:v '

Scenario (A: v; B: v’, C: *)

A

B C

A:v, C

:v

B:v', C

:v'

A:v, B:v'

B:v, C:v

A:v ', C:v '

A:v, B:v '

29

AC contains a survivor set S1

AB contains a survivor set S3

BC contains a survivor set S2

AB contains a survivor set S3

AC contains a survivor set S1

BC contains a survivor set S2

Equivalence of Byzantine Equivalence of Byzantine Intersection and PartitionIntersection and Partition

A

B C

All processes in B can be faulty

All processes in A can be faulty

B

All processes in B can be faulty

A

All processes in A can be faulty

All processes in C can be faulty

C

All processes in C can be faulty

No subset contains a core

S1S2S3 is empty

In a partition (A,B,C):

30

In the threshold model: Lamport et al. [Lamport82] Solution for n>3·t in t+1 rounds

In the core/survivor set model Modified algorithm by Lamport et al. Solution for systems satisfying Byzantine Partition Replace subsets of processes of size n-t by survivor sets Replace majority by intersection of two survivor sets

Enable solution for some systems ={pa, pb, pc, pd, pe}

C={papbpc, papd, pape, pbpd, pbpe, pcpd, pcpe, pdpe}

S={papbpcpd, papbpcpe, papdpe, pbpdpe, pcpdpe}

Solving Consensus for arbitrary failuresSolving Consensus for arbitrary failures

31Lower bound on the number of roundsLower bound on the number of rounds

Definitions : replication requirement (e.g. Byzantine Partition) is a subsystem of iff

satisfies

A subsystem is minimal if there is no smaller subsystem

Theorem: Given a system [JM03a, JM03b] is a minimal subsystem of sys A is a Consensus algorithm

SCssy ,, SCsys ,, CC ,

ssy

subsystem) the in failures of number (maximum }:min{ SSS

processes) correct two least at are (there decide to rounds ( )1:1 f

one) butfaulty be can processes (all decide to rounds ,min( )1:1 f

SCsys ,,

SCssy ,,

32Back to the exampleBack to the example

={pa, pb, pc, pd, pe}



Crash failures Lower bound on the number of rounds:

Arbitrary failures Lower bound on the number of rounds:

Bound is different for crash and arbitrary failures!

CSC ,12 core), (smallest

case) (worst 2111 f

SSCC , ,

case) (worst 3121 f

14- ,1}:min{ SSS

13- ,2):min( SSS

33Asynchronous systemsAsynchronous systems

No solution for pure asynchronous systems even for a single crash failure [FLP85] Slow process vs. Faulty process: requires a liveness property

Common approaches Partially synchronous systems [DLS88] Extend model with failure detectors [CT96]

Crash failures (S [CT96]) Crash Partition: Every partition (A,B) of is such that either A or B

contains a core Crash Intersection: The intersection of every two survivor sets contains

a core (coterie [GM88])

Arbitrary failures (M [Doudou98]) Byzantine Partition/Intersection

34Related work - Hybrid failures modelsRelated work - Hybrid failures models

Moves away only from the identically distributed failure assumption

Different failure modes, one class for each mode [LR94] Manifest (c):detectable failures (e.g. corrupted messages) Symmetric (s): behavior deviates arbitrarily, but it is the same for

every other processor (e.g. send the same erroneous value to every other process)

Arbitrary (a): behavior deviates arbitrarily (e.g. send different values to different processes)

Algorithm for the Oral messages problem mamcsan ,22

35Replication requirements elsewhereReplication requirements elsewhere

More general descriptions of failure scenarios Fail-prone systems [Malkhi97] Collusion and adversary structures (malicious players) [Hirt97]

Martin et al [Martin02] Confirmable writes in quorum systems Property: for every subset B in a fail-prone system and every pair of

quorums Q1, Q2, we have that Q1Q2\B intersection of every pair of quorums contains a core

Hirt and Maurer [Hirt97] Secure multi-party protocols Passive model: no pair of collusions can add up to the set of players set

of correct players is a coterie Active model: no three adversaries can add up to the set of players

intersection of three sets of correct players is not empty

Generalizing n > k t(Work in progress)

37Motivation: Motivation: kk integer integer

Properties establishing bounds on process replication are similar for problems

Asynchronous crash Consensus( W) TM: n > 2 • t C/SS: S1, S2 S: S1 S2

State-machine replication: arbitrary failures TM: n > 2 • t C/SS: S1, S2 S: S1 S2

Synchronous arbitrary Consensus TM: n > 3 • t C/SS:S1, S2, S3 S: S1 S2 S3

38Motivation: Motivation: kk rational rational

Consensus for synchronous systems with receive-omission faults

In the threshold model:

Execution 1: Process in B and C crash Processes in A propose 0 and

decide upon 0

Execution 2 Process in A and C crash Processes in B propose 1 and

decide upon 1

2

3 tn

2

t

2

t

Proof idea

Execution 3 Process in A omit to receive msgs

from processes not in A Processes in B omit to receive msgs

from processes not in B Processes in A propose 0 and

decide upon 0 Processes in B propose 1 and

decide upon 1 Agreement is violated!

2

t

A B C

39Generalizing the partition and the Generalizing the partition and the intersection propertiesintersection properties

(, )-Partition. For every partition of

, there is a subset such that:

(, )-Intersection. For every :

},,,{ 21 AAAA AAAAA kkk },,,{'

21

core a contains )(i

kiA

,, and S

))((,

s

s

,, ,S

1

1

),

):,

||

,,

,,

,

SCCCS

SCC

sss

CSsss

,()-(1 :

(

\, ,,

1

,: subset of S

,, of subsets of collection :

40Some intuition on the generalized propertiesSome intuition on the generalized properties

=3, =2

A

B C

A

B C

Threshold Model ( )12

3

t

n

12

t

2

t

2

t

Core/Survivor set Model

AC contains

a core

Acore a contains CB

B core a contains CA

C core a contains BA

sSs

sSs

sSs

:)(

:)(

:)(

processes contains CB

processes contains CA

processes contains BA

1

1

1

t

t

t

)()()(:),,( 323121321 SSSSSSSSSS

41Bounds on process replicationBounds on process replication

Lower bound Every set of processes that satisfies ,

also satisfies (, )-Partition In every partition of into subsets, there are subsets s.t. the union

contains at least t+1 processes consequently a core

Upper bound (work in progress) If a problem P can be solved by an algorithm A in a system satisfying

, then P can be solved by a system satisfying (k,1)-Partition

Simulate a system under the threshold model

Rational k Looking for a candidate algorithm to motivate

1 ,

,tn

integer kktkn ,1,1

42ImplicationsImplications

Algorithms designed under the threshold model can be automatically translated to our model, for integer k

There is no need to rethink the whole FT distributed systems world

If it simplifies, one may design an algorithm under the threshold model and later translate using our technique

Correlated failures in the real world

(work in progress)

44Background: Systems considering Background: Systems considering dependent failuresdependent failures

Oceanstore [WMK02] Online mechanism to correlate failures Identify subsets of maximally independent failures Problem

Correlate failures only after they have happened Not useful for malicious behavior

PASIS [BWWG02] Survivable storage systems Add correlation level to classical model of availability Two models to determine correlation level

Conditional probabilities Beta-binomial distribution

Problem: Requires the computation of failure distributions

45Coping with Internet catastrophes: PhoenixCoping with Internet catastrophes: Phoenix

Possible approaches Contain Internet pathogens: very challenging [Moore03b] Recover from catastrophes: replicate data

Typical replication strategy Assume independent host failures Compute a threshold t on the number of failures Replicate to this degree

Shared vulnerabilities Dependent host failures Independent host failures is not a suitable assumption Threshold t on the number of host failures

From previous events, t can be large Code Red worm infected over 360,000 hosts

46Our replication strategyOur replication strategy

Desirable properties Enable recovery of data after an Internet catastrophe Small replica sets

Informed strategy for replica placement [JBMSV03] Sets of hosts that fail independently Hosts executing different sets of software systems

Classes of software systems: attributes E.g. Operating system

Potentially vulnerable software systems: attribute values E.g. Linux, Windows

Replicate data on a set of hosts that have different values for each attribute: cores

47An exampleAn example

Attributes Operating system:{ , }

Web server:{ , }

Web browser:{ , }

Cores Red and Green

(orthogonal core) Red, Yellow, and Blue

{ , , }{ , , }

{ , , }

Attribute configurations Attribute configurationsPhoenix

{ , , }

48In this presentation…In this presentation…

Feasibility of this approach What is the impact of diversity on storage overhead and

load? Diversity: distribution of attribute configurations Storage overhead: size of the replica set (core) Storage load: given a host h, number of cores h participates

Simulations Levels of diversity Varying attribute sets


A set H of hosts A set A of attributes Every attribute has the

same cardinality y A mapping M from hosts to

attribute configurations Diversity

Determined by M Often skewed in practice

(93% Windows) [OneStat]

Modeling diversity Single parameter f [0.5,1) A share f of the hosts has a

share (1-f) of the attribute configurations

Example 1:

Example 2:

f = 0.5

f = 0.75

Attribute configurations:

50Heuristic to find coresHeuristic to find cores

Attributes Operating system:{ , }

Web server:{ , }

Web browser:{ , }

Cores Red and Green Red, Yellow, and Blue

{ , , }{ , , }

{ , , }

Attribute configurations Attribute configurationsPhoenix

{ , , }

{ , , }{ , , }

Attribute configurations Attribute configurationPhoenix

{ , , }

51Summary of resultsSummary of results

Simulated for 1,000 hosts 8 attributes, 2 values per attribute

f=0.7, core size=2/2.34/6 (min/avg/max), storage load=21 f=0.95, core size=2/3.49/7(min/avg/max), storage load=151

8 attributes, 4 values per attribute f=0.7, core size=2/2.00/2 (min/avg/max), storage load=6 f=0.95, core size=2/2.01/3 (min/avg/max), storage load=52

Conclusions Even for highly skewed diversity

Average core size is small

More attribute values reduce core size variation

Wrapping up

53ConclusionsConclusions

Process failures are often non-IID Core/survivor set model

Enables one to model non-IID failures Abstracts failure probability distributions Generalizes objects commonly used in algorithms and proofs

Consensus Improves on number of rounds Enables solutions in systems in which Consensus is not solvable under

the threshold model

Generalizing the results for Consensus General lower bound on process replication Automatic translation of algorithms

Compatible with previous works


The Phoenix recovery system An application that uses the core abstraction

Determine cores by using attributes of hosts (shared vulnerabilities)

Reduces significantly storage overhead compared with a solution under the threshold model

Current status: we are working in the design of a prototype

55Future workFuture work

Impact on reliability and performance Fewer executions allowed Another requirement: compute cores/survivor sets

Static vs. dynamic cores/survivor sets Processes joining and leaving Changes in reliability

Implementation issues Representation of cores and survivor sets Determining the cores/survivor sets of a system Applicability on the various systems

Phoenix Determining good sets of attributes Heuristics to find cores: storage overhead vs. storage load

56Dissertation planDissertation plan

Representation of non-IID failures Core/Survivor set model [JM03c] Application to Consensus [JM03a]

General bounds Lower bound Algorithm translation (work in progress) Submission to PODC 2004 (January 2004)

Phoenix [JBMSV03C] Method for determining attribute sets Heuristics to find cores that consider both storage load and overhead Implementation details Submissions to OSDI (March 2004) and NSDI (September 2004)

57BibliographyBibliography

[Amir96] Amir, Y., and Wool, A., “Evaluating quorum systems over the Internet,” in 26th Symposium on Fault-tolerant Computing (FTCS’96), (Sendai, Japan) pp. 26-35, IEEE Computer Society, June 1996.

[Moore02] Moore, D., Shannon, C., and Brown, J., “Code-Red: A case study on the spread and victims of an Internet worm,” in Proceedings of the 2002 ACM SIGCOM Internet Measurement Workshop, (Marseille, France), pp. 273-284, Nov. 2002.

[Moore03a] Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S., and Weaver, N., “The spread of the Sapphire/Slammer worm, “ Tech. Rep., CAIDA, La Jolla, CA.

[Little01] Littlewood, B., Popov, P., and Strigini, L., “Modeling software design diversity - A review,” ACM Computing Surveys, vol. 33, pp. 177-208, June 2001.

[Tang92] Tang, D., and Iyer, R., “Analysis and modeling of correlated failures in multicomputer systems,” IEEE Transactions on Computers, vol. 41, pp. 567-577, May 1992.

[JM03a] Junqueira, F., and Marzullo, K., “Synchronous Consensus for dependent process failures,” in International Conference on Distributed Computing Systems (ICDCS’03), May 2003

[Ren98] Ren, Y., and Dungan, J. B., “Optimal design of reliable systems using static and dynamic fault trees,” IEEE Transactions on Reliability, vol. 47, no 3, pp. 234-244, Dec. 1998.


[Wensley76] Wensley, J. et al., ”SIFT: Design and analysis of a fault-tolerant computer for aircraft control, “ in 2nd IEEE International Conference on Software Engineering, pp. 458-469, 1976.

[Lamport82] Lamport, L., Shostak, R., and Pease, M., “The Byzantine generals problem,” ACM Transactions on Programming Languages and Systems, vol. 4, pp. 382-401, July 1982.

[FLP85] Fischer, M., Lynch, N., Paterson, M., “Impossibility of distributed Consensus with one faulty process,” Journal of the ACM, vol. 33, pp. 374-382, April 1985.

[DLS88] Dwork, C., Lynch, N., and Stockmeyer, L., “Consensus in the presence of partial synchrony,” in the Journal of the ACM, no. 35, vol. , pp. 288-323, Apr. 1988.

[CT96] Chandra, T., and Toueg, S., “Unreliable failure detectors for reliable distributed systems,” Journal of the ACM, vol. 43, pp. 225-267, March 1996.

[CHT96] Chandra, T., Hadzilacos, V., and Toueg, S., “The weakest failure detector for solving Consensus,” Journal of the ACM, vol. 43, pp. 685-722, July 1996.

[Doudou98] Doudou, A., and Schiper, A., “Muteness detectors for Consensus with Byzantine processes, “ in 17th ACM Symposium on Principles of Distributed Computing, (Puerto Vallarta, Mexico), p. 315, July 1998. (Brief Announcement).


[LR94] Lincoln, P., and Rushby, “Formal verification of an interactive consistency algorithm for the Draper FTP architecture under a hybrid failure model,” in Proceedings of the 9th IEEE Annual Conference on Computer Assurance, pp. 107-120, June 1994.

[JM03b] Junqueira, F., and Marzullo, F., “Lower bound on the number pf rounds for synchronous Consensus with dependent process failures,” Tech. Rep. CS2003-0734, UCSD, La Jolla, CA, January 2003.

[JM03c] Junqueira, F., and Marzullo, K., “Designing algorithms for dependent process failures,” in Future Directions in Distributed Computing, number 2584 in LCNS, pp. 24-28, Springer-Verlag, 2003.

[JM02] Junqueira, F., and Marzullo, K., “Coping with dependent process failures,” Tech Rep. CS2002-0723, UCSD, La Jolla, October 2002.

[JBMSV03C] Junqueira, F. , Bhagwan, R., Marzullo, K., Savage, S., and Voelker, G.M., “The Phoenix Recovery System: Rebuilding from the ashes of an Internet catastrophe,” in IX Workshop on Hot Topics in Operating Systems (HotOS-IX), May 2003.

[Giff79] Gifford, D., “Weighted voting for replicated data,” in 7th Syposium on Operating Systems Principles (SOSP), pp. 150-162, 1979.

[Attiya98] Attiya, H., and Welch, J., “Distributed computing: Fundamentals, simulation, and advanced topics,” chapter 5, McGraw-Hill, 1998.


[GM85] Garcia-Molina, H., and Barbara, D., “How to assign votes in a distributed system,” Journal of the ACM, vol. 32, pp. 841-860, October 1985.

[Malkhi97] Malkhi, D., and Reiter, M., “Byzantine quorum systems,” in 29th ACM Symposium on Theory of Computing, pp. 569-578, May 1997.

[LF82] Lamport, L., and Fischer, M., “Byzantine generals and transaction commit protocols,” Tech. Rep. 62, SRI International, Apr. 1982.

[CB00] Charron-Bost, B., and Schiper A., “Uniform Consensus is harder than Consensus,” Tech. Rep. DSC/2000/028, École Polytechnique Fédérale de Lausanne, Switzerland, May 2000.

[JM03d] Junqueira, F. and Marzullo, K., “Consensus for dependent process failures,” Tech. Rep. CS2003-0737, UCSD, La Jolla, CA, February 2003.

[Martin02] Martin, J.-P., Alvisi, L., and Dahlin, M., “Minimal Byzantine storage, ” in Proceedings of the 17th Symposium on Distributed Computing, LNCS 2508, pp. 311-325, Ed. Malkhi, D., 2002.


[Hirt97] Hirt, M. and Maurer, U., “Complete characterization of adversaries tolerable in secure multi-party computation,” in ACM Symposium on Principles of Distributed Computing (PODC’97), (Santa Barbara, CA), pp. 25-34, 1997.

[JM03e] Junqueira, F. and Marzullo, K., “On the generalization of n>k·t, ” Tech. Rep. CS2003-0743, UCSD, La Jolla, CA, April 2003.

[WMK02] Weatherspoon, H., Moscovitz, T., and Kubiatowicz, “Introspective failure analysis: Avoiding correlated failures in peer-to-peer systems,” in Proceedings of the International Workshop on Reliable Peer-to-Peer Distributed Systems, October 2002

[BWWG02] Bakkaloglu, M., Wylie, J., Wang, C., Ganger, G., “On correlated failures in survivable storage systems,” Tech. Rep. CMU-CS-02-129, CMU, May 2002.

[Moore03b] Moore, D., Shannon, C., Voelker, G. M., and Savage, S., “Internet quarantine: Requirements for containing self-propagating code,” in Proceedings of the IEEE INFOCOM, Apr. 2003.

[ICAT] National Institute of Standards and Technology (NIST), “ICAT Vulnerability Database.” http://icat.nist.gov/icat.cfm.

http://icat.nist.gov/icat.cfm










63TimelineTimeline

8/1Jul 1, 2003 Dec 1, 2004

9/1 10/1 11/1 12/1 1/1 2/1 3/1 4/1 5/1 6/1 7/1 8/1 9/1 10/1 11/1

3/04OSDI

1/04PODC

8/2 - 10/27Microsoft Internship

10/31 - 1/26Generalization

10/31 - 3/15Phoenix

3/30 - 12/1Phoenix, journal papers and dissertation

12/1Thesis defense

64

Evaluation of quorum systems in a wide-area network [Amir96] Crashes are strongly correlated Power outages

Computers in the same room Total failure during the experiment

Wide-area outage

Network partitions Quorums partially unreachable Computers in different segments Computers in the same segment

Switching devices Bridges

Where IID does not apply…Where IID does not apply…


Software bugs [Little01] Single version

A demand may cause all replicas to crash

E.g. State-machine replication

Multiple independently-developed versions Difficulty of a demand: difficulty in

handling it Level of difficulty varies among the

demands More difficult demands tend to

cause multiple versions to fail


Failures are not independent Computing a threshold is not practical Model of dependent failures based on shared

vulnerabilities Storage overhead is small even for highly skewed

diversity Storage load can be large

Has to be considered by the heuristic that finds cores Increase average core size

67Synchronous systems - Crash failuresSynchronous systems - Crash failures

Solution for any number of failures Early-deciding algorithm: decision in f+1 rounds, where f is

the number of failures in a given execution [Charron-Bost and Schiper]1. Every process keeps an array of initial values

2. In every round, a process:1. sends its array of initial values to all the other processes

2. receives messages from other processes (array of initial values or decide)

3. updates its array according to the received arrays

4. decides if it receives a decide message, and then sends a decide message to all the other processes in the next round

5. decides if round is detected as clean, and then sends a decide message to all the other processes in the next round

68Upper bound on process replicationUpper bound on process replication

Conjecture: Suppose a correct algorithm A that requires under the threshold model

Replace (n-’·t), 0 < ’ , for intersection of ’ survivor sets in A to generate algorithm A’

Transformed algorithm A’ is correct

Intuition: k=3 In every execution: at least 2·t+1 correct processes (subset )

Survivor sets: subsets of processes of size 2·t+1 Cores: subsets of size t+1

Every intersection of two subsets of size 2·t+1 (survivor sets) intersect in at least t+1 processes (a core)

Intersection of two survivor sets contains at least one correct process At least one intersection of two survivor sets contains only correct processes

1 ,

1,tn

69Solving Consensus for arbitrary failuresSolving Consensus for arbitrary failures

Algorithm by Lamport, Shostak, and Pease [Lamport82]Each process keeps a copy of the treeLevel i of the tree: values received at round iIn round 0, a correct process broadcast its initial valueIn round i, a correct process sends the values at level i-1

p1 p2 p3

p1p2

pt+1 pt pn

p1p3

1tp

i 2t

p

i nip

},{1,2, ,ppp21

nikiii t

Depth 0

Depth 1

Depth 2

Depth t

Depth t+1

=p1p2...pt

70

1.2},,'

2

t

ni

nll

l

iii

kiii

pp,{p p,

},{1,2, ,ppp

1

21

p

lip'

1'

lip

nip'

Depth l-1

Depth l

Each correct process Traverse the tree in post-order Evaluates each node as follows

If node is a leaf, then evaluates to its current value Otherwise, use the majority (null if there is no majority)

Claim: if process p is correct, then The value of node(p) is the same in every correct process q It is the value p sent at round |p| regarding

Proof idea Recursion on the levels of the tree Base case: level t+1 (leaves) Ind. hypothesis: claim valid for

every level l t+1 Ind. Step: prove for l-1

},{1,2, ,ppp21

nikiii t

pDepth t+1

At round t+1… At round t+1…

71Algorithm for the core/survivor set modelAlgorithm for the core/survivor set model

Adapt the algorithm from Lamport et al. In the original algorithm

In our algorithm: given a system Intersection of two survivor sets instead of majority

SC ,,

},{1,2, ,ppp21

nikiii t

1tp

i 2t

p

i nip

Depth t

Depth t+1 Leaves

},{1,2,

pp,{p ,ppp2121

niSl

SS

k

iiiiii slll

,

,},,

1p

li 2p

li spi

Depth l

Depth l+1 Leaves

72 ={pa, pb, pc, pd, pe}



In the threshold model Threshold on the

number of failures: 2 Minimum number of

processes: 3·t+1=7 There is no solution!

pe

pepd

pepc

pepb

pepa

pd

pdpe

pdpc

pdpb

pdpa pc

pcpe

pcpd

pcpb

pcpa

pcpbpepcpbpdpcpbpa

pcpapepcpapdpcpapb

pb

pbpe

pbpd

pbpc

pbpcpepbpcpd

pbpcpa

pbpapepbpapdpbpapc

pbpa

pa

pape

papd

papc

papb

papcpe

papcpd

papcpb

papbpepapbpd

papbpc

{}

73Executing the algorithmExecuting the algorithm

={pa, pb, pc, pd, pe}



pa and pc are faulty

pape

papd

papc

papb

papcpe

papcpd

papcpb

papbpepapbpd

papbpc

Time 1

pa

pape

papd

papc

papb

papcpe

papcpd

papcpb

papbpepapbpd

papbpc

Time 2

pa

pape

papd

papc

papb

papcpe

papcpd

papcpb

papbpepapbpd

papbpc

Time 3

pa

Possibly different values across correctprocesses

Same value across correct processes

74Asynchronous systemsAsynchronous systems

No solution for pure asynchronous systems even for a single crash failure [FLP85]

Slow process vs. Faulty process Requires a liveness property Approach 1: consider more realistic timing assumptions

Partially synchronous systems [DLS88] Difficult to evaluate parameters in practice

Approach 2: extend model with failure detectors [CT96] Unreliable failure detectors

75Asynchronous systems - Crash failuresAsynchronous systems - Crash failures

W is the weakest class of failure detectors that enable a solution to Consensus [CHT96] Weak completeness: eventually every process that crashes is

suspected by some correct process Eventual weak accuracy: there is some correct process which is

eventually not suspected by any other correct process

Lower bound on process replication: Proof idea:

12 tn

A B

Initial value of processes in A: vProcesses in A decide v

B

Initial value of processes in B: v’Processes in B decide v’

A

No faulty processMessages from A to B are delayedProcesses in A decide v and process in B decide v’Agreement is violated

76An algorithmAn algorithm

Rotating coordinator paradigm [CT96] Assumes

Strong completeness: eventually every correct process suspects forever every faulty process

(coordinator)

p1

p2

p3

p4

Every process sends an estimate message to the coordinator

Coordinator gathers t+1 estimates and proposes a new estimate

Processes acknowledge the reception of an estimate from the coordinator

Coordinator gathers t+1 acks and broadcasts a decide message

)( WSSD

t=2, n=5: execution with no suspicions or failures

77Proof of correctnessProof of correctness

No correct process stops (does not decide, does not move on) in a round i

A correct process either Decides in a round Eventually suspect the coordinator (Strong Completeness) and

moves on to the next round

Eventually there is a round in which the coordinator is not suspected by any correct process Ensured by Eventual Weak Accuracy

If not all processes decide in the same round Once some process decides, the decision value is “locked”


Lower bound on process replication [JM03d] Crash Partition: There is no partition (A,B) of the processes in

such that none of the partitions contains a core Crash intersection: The intersection of every two survivor sets

contains a core

Crash Partition Crash Intersection Bound is tight: Chandra and Toueg’s algorithm modified

In the original algorithm: coordinator waits for n-t replies In our algorithm: coordinator waits for a reply from a survivor set

79Proof ideaProof idea

Layering technique [Keidar] Layer: [p,[i]]

Process p fails but send messages to processes pi,…, pn

Apply layers to system states State is composed of states of processes Similar states x, y: only a single process can distinguish x from y A set of states is similarly connected iff for every pair of states in the,

there is a chain of similar states connecting them

Set of initial states is similarly connected Applying layers to a similarly connected set of states

generates another similarity connected set of states Cannot apply layers indefinitely

80Asynchronous systems - Arbitrary failuresAsynchronous systems - Arbitrary failures

Faulty processes can behave arbitrarily Correct to a subset of processes Strong completeness does not make sense

Mute process [Doudou98] A process pi is mute to a process pj iff there is a time t after which pj

stops sending messages to pi forever

Mute completeness Every process pi eventually suspects forever a process pj that is

mute to pi

Equivalent to S if processes fail only by crashing

81Lower bound on process replicationLower bound on process replication

Lower bound: (Strong Consensus) Proof idea: assume , and a partition (A,B,C) such

that

13 tntn 3

Scenario 1: All process in B crash at time 0 Processes in A and C propose

value v and decide v

Scenario 2: All process in C crash at time 0 Processes in A and B propose

value v’ and decide v’

Scenario 3: All process in C are arbitrarily faulty Processes in C behave to process

in A as in Scenario 1, and to processes in B as in Scenario 2

Messages from B to C and conversely are delayed until after the last process decides

Processes in A propose v, and processes in B v’

Processes in A cannot distinguish Scenario 1 from Scenario 3

Processes in B cannot distinguish Scenario 2 from Scenario 3

Processes in A and B decide upon different values (agreement violation)

tCBA

82An algorithm for Vector ConsensusAn algorithm for Vector Consensus

Requires Digitally signed messages Certificates

Certify message content E.g. Decision message has to contain enough Estimate messages from

other processes

Each process has a list of faulty processes FIFO channels: out of order messages Corrupted messages

1st stage Each process broadcasts its initial value Each process composes a proposed vector with received values

13 tn

83

Move on to the next round after receiving at least 2·t+1 current estimates

Processes exchange suspicion messages

An algorithm for Vector Consensus (An algorithm for Vector Consensus (cont.cont.))

2nd stage: asynchronous rounds of message exchange

(coordinator) Forward estimate received from the coordinator

Coordinator’s estimateDecide after receiving at least 2·t+1 estimate msgs

Coordinator crashes and do not send estimateProcesses exchange current estimates after receiving at least 2·t+1 suspicion messages

(coordinator)

84In the core/survivor set model In the core/survivor set model

Byzantine Intersection/Partition is necessary and sufficient [JM03d]

Necessity proof Assume Byzantine Partition does not hold Scenario in which processes decide upon different values

Sufficiency proof Modify algorithm by Doudou and Schiper Original algorithm: process waits for messages from 2/3 of the processes In our algorithm: process waits for messages from a survivor set

Observation In the original protocol: wait for t+1 suspicion messages In our algorithm: wait for messages from processes in

SSSSS 2121 ,,

85




Generalizing the partition and the Generalizing the partition and the intersection propertiesintersection properties


21

core a contains )(i

kiA

S,

)(,

s

s

,, ,S

1

1

1

,: subset of S


},,,,,,

,,,,,,{

},,,,,{

efdfdecfcecdbf

bebdafaeadabcC

fedcba

},,|{

},,,,,,,,{

zyxxyzC

ihgfedcba

a' b' c' d' g' e' h' f' i'

a b c d e f

Simulated processes

Physical process

Physical system Virtual system

Every core in the virtual system (subset of 3 processes) is simulated by a core in the physical system

Every subset of size 3 in the

virtual system contains at

least one correct process

87Proposed algorithmProposed algorithm

Algorithm: given a system , let x be the size of the largest core

Any process in simulates at most (x-xp+1) virtual processes

Conjecture: necessary and sufficient for any subset of t+1 processes in the virtual system to map to a core in the physical system

Necessity: straightforward (counterexample) Sufficiency:

There are sufficient physical processes to simulate virtual processes Byzantine Partition t+1 processes map to a core

SC ,,

88Our replication strategyOur replication strategy

Classes of software systems: attributes E.g. Operating system

Potentially vulnerable software systems: attribute values E.g. Linux, Windows

Replicate data on a set of hosts that have different values for each attribute: cores

Tolerating the failure of k values No permutation of k attribute values covers all the hosts in a core Current assumption: k=1

At least two distinct values per attribute in a core

Definitions Attribute configuration: attribute values of a host Diversity: distribution of attribute configurations

89Choosing a coreChoosing a core

Decision problem is NP-Complete (Set cover) Finding a core for host hi

1. Make a list L of hosts orthogonal to hi

2. If L is not empty1. Choose a host hj s.t hj L;

2. Return {hi, hj};

3. Else1. R {hi};

2. Make a list L’ of hosts that have different attribute configurations;

3. For each attribute a in A, choose randomly a host hj in L’ s.t. hj has a different value for a;

4. R R {hi};

5. Repeat 2 and 3 until R covers all attributes or L’ is empty;

6. Return R.

90Core size for scenario 8/2Core size for scenario 8/2

1,000 hosts 8 attributes

[ICAT] 2 values per

attribute

“Linux vs. Windows”

Average core size is small even for highly skewed diversity

91Core size for scenario 8/4Core size for scenario 8/4

1,000 hosts 8 attributes 4 values per attribute

More attribute values reduces core size variation

92Storage loadStorage load

1,000 hosts

For highly skewed diversity, storage load can be high

93System design issuesSystem design issues

Fully-distributed system No single point of failure Leverage research on P2P systems

Announcing available configurations DHT-based approach

Encryption scheme to protect against data corruption Recovering from a catastrophe

Time to recover is not critical Coping with a large number of requests

Threshold on the number of accepted requests Exponential backoff

94Lower bound on process replicationLower bound on process replication

Claim: Every set of processes that satisfies , also satisfies (, )-Partition

Proof idea. Given a set , , construct a partition as follows:

1 ,

,tn

A1 Ak Ak+1. . . . . .

t

part) fractional : part,

Integral : -

f

IfIt

fl ,(

A1…Al: t/ processes

Al+1…A: t/ processes

Where:

There is at least one subset of elements Ai such that the union of these subsets contains t processesAdd one process to


Claim: If a problem P can be solved by an algorithm A in a system satisfying , then P can be solved by a system satisfying (k,1)-Partition

Suppose that A requires k=4 System satisfying (4,1)-Partition

Maximum number of failures: 2 Virtual system defined under the threshold model

Satisfies Simulate the virtual system with

},,,,,,,,,,,,{

},,,,,{

efdfdecfcecdbfbebdafaeadabcC

fedcba

integer kktkn ,1,1

SC ,,

SC ,,

},,,,,,,,{ ihgfedcba

2,14 ttn for


Impact on reliability and performance Fewer executions allowed

What are the chances that an execution not assumed happen?

Another requirement: compute cores/survivor sets

Static vs. dynamic cores/survivor sets Processes joining and leaving Changes in reliability

Implementation issues Representation of cores and survivor sets Determining the cores/survivor sets of a system Applicability on the various systems


Applicability of the Consensus solutions Look at existing systems that use Consensus as a primitive Evaluate the benefits in practice of using our solutions

Solutions for hybrid failure models Translate , to our modelmamcsan ,22


No protocols with rational k so far Any known candidate?

Finish formal proof of algorithm translation


How do we determine the attributes? Resilience depends on the attributes Vulnerability databases Dynamic attributes:new attributes and values

How many attributes do we need? The number of attributes impact on storage overhead

What is a good level of granularity for the attributes? E.g. {Windows} vs. {Win_95, Win_98, Win_2000, Win_XP}

Other challenges Heuristics for finding cores: storage overhead and storage load Efficacy

How do we assess the efficacy of a prototype? Major Internet incidents are not so frequent






21

core a contains )(i

kiA

,, and S

))((,

s

s

,, ,S

1

1

),

),,

:;||

,,

,,

,,

SCCCS

SCSCC

sss

sss

,()-( :

,( :-

\, ,,

1

,: subset of S

,, of subsets of collection :


, and integers (, )-Partition. For every partition of


(, )-Intersection. For every , there is a subset , such that:

where


21

core a contains )(i

kiA

1

S,

kSSS k ,},,,{ ,21

i

iS

,

Moving away from the independent and identically distributed failure assumption

Documents

failures of processes

correct process q

distributed systems

arbitrary failures

number of failures

process psends

process replicationcoping

correct process p