Moving away from the Moving away from the independent and identically independent and identically distributed failure distributed failure assumption assumption University of California, San Diego Flavio Junqueira Research Exam/Thesis Proposal Advisors: Keith Marzullo and Geoffrey M. Voelker
101
Embed
Moving away from the independent and identically distributed failure assumption
Moving away from the independent and identically distributed failure assumption. University of California, San Diego Flavio Junqueira Research Exam/Thesis Proposal Advisors: Keith Marzullo and Geoffrey M. Voelker. Motivation. Common approach for distributed systems: replicate! - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Moving away from the independent and Moving away from the independent and identically distributed failure assumptionidentically distributed failure assumption
University of California, San Diego
Flavio Junqueira
Research Exam/Thesis Proposal
Advisors: Keith Marzullo and Geoffrey M. Voelker
2MotivationMotivation
Common approach for distributed systems: replicate! Cheaper than investing on ultra-reliable, specialized components Enhance performance, availability E.g. Processes on software-based systems
Typical replication strategy Compute a threshold t on the failures of processes Determine the degree of replication required, depending on the
problem (e.g. n > 3t for Consensus with arbitrary failures ) Replicate to this degree
Well suited for independent and identically distributed failures (IID failure assumption) Non-negligible probability of t failures in any subset of size t+1 Is it often a reasonable assumption?
3Where IID does not apply…Where IID does not apply…
Systems for the Internet Hosts execute the same popular
software systems Hosts share the same vulnerabilities
Some major outbreaks Code Red: over 360,000
hosts [Moore02] Sapphire: over 75,000 hosts
[Moore03a]
A threshold on the number of failures is unrealistic.
4Where IID does not apply…Where IID does not apply…
Quorum systems in a wide-area network [Amir96] Failures are strongly correlated
Power outages
Network partitions
Software bugs [Little01] Single version
A demand may cause all replicas to crash
Multiple independently-developed versions Difficulty of a demand: difficulty in handling it Level of difficulty varies among the demands More difficult demands tend to cause multiple versions to fail
5Where IID does not apply… Where IID does not apply…
Multi-computer systems [Tang92] Correlated failures due to shared resources
Network errors Shared memory
Impact on availability, reliability, and performance
Grid computing Master delegates computation
Wait replies from slaves
Replicate to achieve fault-tolerance Dependent failures: same sub-network,
same software systems, etc.
6OutlineOutline
System model Modeling failures
The classical approach: The threshold model An alternative to the threshold model: Cores/Survivor sets
Applying it to problems: Consensus Traditional results on Consensus Consensus in the core/survivor set model
Generalizing the results for Consensus General bounds on process replication
Coping with dependent failures in the real world A few systems that assume dependent failures An application: The Phoenix Recovery System
7System modelSystem model
Set of processes = {p1, p2, , pn} A process is a unit of computation
Communicate by exchanging messages
Reliable channels Validity: If a correct process p sends a message m to a correct
process q, then q eventually receives m; Integrity: A process p receives a message m from some process q
only if q sent m to p;
8System modelSystem model
Processes exchange messagesChannels are reliable
Set of processes
Distributed algorithm:
collection of state machines
Step of a process
State machine for process qState machine for process p
State
Execution: sequence of steps of processes
Atomic
9Distributed algorithmDistributed algorithm
Collection of state machines, one for each process p
Proceeds in steps of processes
In a step, a process p Sends a message to a single process Receives a message from a single process Undergoes a state transition
Execution Sequence of steps of processes in
10Timing assumptionsTiming assumptions
Synchronous systems Clock drift, message delay, processor speed are bounded Execution in synchronous rounds In a synchronous round, a process
sends messages to any number of processes receives messages from any number of processes Undergoes a state transition
Asynchronous systems No bounds on clock drift, message delay, or processor speed
11Failure modes for processesFailure modes for processes
Crash failures For every faulty process p in some execution of an algorithm A, there is a
time tp after which p stops executing steps of A
Arbitrary failures A faulty process can deviate arbitrarily from the specification of the
algorithm E.g. crash, sending messages selectively, modify arbitrarily the content of
messages
Receive-omission failures A faulty process either crashes or selectively fail to receive messages
Assumptions Once a process fails it does not recover Probability of a total failure is negligible
Modeling failures
13The threshold modelThe threshold model
Threshold t on the number of process failures Degree of reliability: R [0,1] The probability of t+1process failures is smaller than 1-R Simple and compact representation (n > f(t))
SIFT project [Wensley76] Ultra-reliable computer system Process failures are arbitrary, but non-malicious Hardware designed to isolate faults (independent failures) Similar hardware (identically distributed process failures) IID failure assumption is valid
What if failures are not IID? Still safe
t is the size of the largest subset of faulty processes in any execution It does not hurt to consider more
14Limitations of the threshold modelLimitations of the threshold model
R : target degree of reliability>R: subset of processes has reliability greater than R
15An alternative to the threshold modelAn alternative to the threshold model
Desirable properties Expressive: scenarios in the previous slide Flexible: not tied to any particular way of characterizing failures General: widely applicable
Cores [JM03a] A core c: minimal reliable subset of processes At least one process in c is correct in every execution of the system Generalize subsets of size t+1
Survivor sets [JM03a] A survivor set: contains all the correct processes of some execution Generalize subsets of size n-t
16Cores and Survivor setsCores and Survivor sets
R: desired degree of reliability r(X), X : evaluates to the reliability of x A subset C is a core of iff
r(C) R p C, r(C - {p}) R C : set of cores of
A subset S is a survivor set of iff C C, SC p S, C C, such that (p C) and ((S - {p}) C = )
S : set of survivor sets of
Cores and survivor sets are the dual of each other
17An alternative definitionAn alternative definition
Design of algorithms be the set of allowed executions up(be the set of correct processes in execution A subset C is a core of iff
s.t. C up() C’C, s.t. C’ up()= C : set of cores of
A subset S is a survivor set of iff s.t. S = up() S’ S, , S’ up() S : set of survivor sets of
: system configuration SC ,,
18An exampleAn example
Blue, Red, and Yellow fail independentlyFailures of Yellow processes are highly correlatedr({Red, Blue, Yellow}) = R
19Another exampleAnother example
Blue: highly-reliable serverRed: clientFailures of Blue and Red are negatively correlated
Probability of more than 3 Red processes failing is negligible
20Determining cores and survivor sets Determining cores and survivor sets
Probability models E.g. Markov models used in the analysis of dynamic fault trees
[Ren98] To find cores: Minimal subset of processes s.t. probability of total
failure in the subset is negligible Often difficult in practice
Attribute-based model [JM02] Processes characterized by attributes Attributes determine failure correlation Finding a core is NP-hard
Color-based model [JM02] Single attribute characterizes a process Polynomial time algorithm to find cores
21Cores/Survivor sets vs. Quorum systemsCores/Survivor sets vs. Quorum systems
Cores, Survivor sets, Quorums Subsets of processes
Quorums [Giff79] Enforce mutual exclusion [GM85] E.g. One-copy serializability Quorums necessarily intersect Execute operations on behalf of the system
Cores/Survivor sets Do not necessarily execute operations on behalf of the system Weaker than quorums: no intersection requirement a priori Generalize objects commonly used in proofs and algorithms
Cores: subsets of size t+1 Survivor sets: subsets of size n-t
Consensus
23Motivation for ConsensusMotivation for Consensus
Replication often requires coordination
Coordination problems Atomic broadcast
Clock synchronization
Agreement on fault-tolerant processors (FTP)
24Consensus specificationConsensus specification
Each process begins with a proposed value v V Goal: agree on a single value Typical Consensus definition [Attiya98]
Agreement: No two correct processes decide on different values Termination: Every correct process eventually decides Validity: If a process p decides on value v, then v was proposed by
some process q Strong validity: if every process has v as its initial value, then v is
the only possible decision value [Attiya98] Vector validity: A correct process decides on a vector such that
[Doudou98]1. If pi is correct, then [i] has the initial value of pi or null
2. At least t+1 elements of are initial values of correct processes
25Synchronous systems - Crash failuresSynchronous systems - Crash failures
Solution for any number of failures Full-information algorithm (t+1
rounds, )
Early-deciding algorithms [LF82, CB00] For any execution with f failures,
correct processes decide in at most f+1 rounds ( )
Clean round: Round in which no process fails Process receives messages from
the same set of processes in two consecutive rounds
Message complexity: O(f·||2)
1 tn
1 2
Decide
p0
p1
p2
1 tn
26In the core/survivor set modelIn the core/survivor set model
Algorithm SyncCrash [JM03a,
JM03d] Choose a core C, preferentially the
smallest Execute early-deciding algorithm
among processes of C Every process in has an array of |C|
positions, one for each process in C Processes in C send messages to
processes in -C as well A process decides when a round with
no failures in C happen
t)(sufficien C
Decision in at most |C| rounds If |C|-1 < t, then improves on
number of rounds Message complexity: O(f·|C|·|
|)
27Synchronous systems - Arbitrary failuresSynchronous systems - Arbitrary failures
Impossible if n 3•t [Lamport82]
Strong Consensus Proof idea
Consensus algorithm that solves for || 3·t
Execution in which agreement is violated
Assume || 3·t Partition (A, B, C) of
s.t. each subset has at most t processes
Execution 2
(A, B, C: v’)
A
B C
A:v, C
:v
B:v', C
:v'
A:v, B:v'B:v', C:v'
A:v ', C:v '
A:v, B:v '
Execution 1
(A, B, C: v)
A
B C
A:v, C
:v
B:v', C
:v'
A:v, B:v'
B:v, C:vA:v, C:v
A:v, B:v
Execution 3
(A: v; B: v’, C: *)
A
B C
A:v, C
:v
B:v', C
:v'
A:v, B:v'
B:v, C:v
A:v ', C:v '
A:v, B:v '
28In the core/survivor set modelIn the core/survivor set model
Lower bound on process replication [JM03a, JM03d] Byzantine Partition: Every partition (A, B, C) of is such that at
least one of the subsets contains a core Byzantine Intersection:
The intersection of every pair of survivor sets in S contains a core
The intersection of every three survivor sets in S is not empty
Scenario (A, B, C: v)
A
B C
A:v, C
:v
B:v', C
:v'
A:v, B:v'
B:v, C:v
A:v, C:v
A:v, B:v
Scenario (A, B, C: v’)
A
B C
A:v, C
:v
B:v', C
:v'
A:v, B:v'B:v', C:v'
A:v ', C:v '
A:v, B:v '
Scenario (A: v; B: v’, C: *)
A
B C
A:v, C
:v
B:v', C
:v'
A:v, B:v'
B:v, C:v
A:v ', C:v '
A:v, B:v '
29
AC contains a survivor set S1
AB contains a survivor set S3
BC contains a survivor set S2
AB contains a survivor set S3
AC contains a survivor set S1
BC contains a survivor set S2
Equivalence of Byzantine Equivalence of Byzantine Intersection and PartitionIntersection and Partition
A
B C
All processes in B can be faulty
All processes in A can be faulty
B
All processes in B can be faulty
A
All processes in A can be faulty
All processes in C can be faulty
C
All processes in C can be faulty
No subset contains a core
S1S2S3 is empty
In a partition (A,B,C):
30
In the threshold model: Lamport et al. [Lamport82] Solution for n>3·t in t+1 rounds
In the core/survivor set model Modified algorithm by Lamport et al. Solution for systems satisfying Byzantine Partition Replace subsets of processes of size n-t by survivor sets Replace majority by intersection of two survivor sets
Enable solution for some systems ={pa, pb, pc, pd, pe}
34Related work - Hybrid failures modelsRelated work - Hybrid failures models
Moves away only from the identically distributed failure assumption
Different failure modes, one class for each mode [LR94] Manifest (c):detectable failures (e.g. corrupted messages) Symmetric (s): behavior deviates arbitrarily, but it is the same for
every other processor (e.g. send the same erroneous value to every other process)
Arbitrary (a): behavior deviates arbitrarily (e.g. send different values to different processes)
Algorithm for the Oral messages problem mamcsan ,22
Feasibility of this approach What is the impact of diversity on storage overhead and
load? Diversity: distribution of attribute configurations Storage overhead: size of the replica set (core) Storage load: given a host h, number of cores h participates
Simulations Levels of diversity Varying attribute sets
49System modelSystem model
A set H of hosts A set A of attributes Every attribute has the
same cardinality y A mapping M from hosts to
attribute configurations Diversity
Determined by M Often skewed in practice
(93% Windows) [OneStat]
Modeling diversity Single parameter f [0.5,1) A share f of the hosts has a
Process failures are often non-IID Core/survivor set model
Enables one to model non-IID failures Abstracts failure probability distributions Generalizes objects commonly used in algorithms and proofs
Consensus Improves on number of rounds Enables solutions in systems in which Consensus is not solvable under
the threshold model
Generalizing the results for Consensus General lower bound on process replication Automatic translation of algorithms
Compatible with previous works
54ConclusionsConclusions
The Phoenix recovery system An application that uses the core abstraction
Determine cores by using attributes of hosts (shared vulnerabilities)
Reduces significantly storage overhead compared with a solution under the threshold model
Current status: we are working in the design of a prototype
55Future workFuture work
Impact on reliability and performance Fewer executions allowed Another requirement: compute cores/survivor sets
Static vs. dynamic cores/survivor sets Processes joining and leaving Changes in reliability
Implementation issues Representation of cores and survivor sets Determining the cores/survivor sets of a system Applicability on the various systems
Phoenix Determining good sets of attributes Heuristics to find cores: storage overhead vs. storage load
56Dissertation planDissertation plan
Representation of non-IID failures Core/Survivor set model [JM03c] Application to Consensus [JM03a]
General bounds Lower bound Algorithm translation (work in progress) Submission to PODC 2004 (January 2004)
Phoenix [JBMSV03C] Method for determining attribute sets Heuristics to find cores that consider both storage load and overhead Implementation details Submissions to OSDI (March 2004) and NSDI (September 2004)
57BibliographyBibliography
[Amir96] Amir, Y., and Wool, A., “Evaluating quorum systems over the Internet,” in 26th Symposium on Fault-tolerant Computing (FTCS’96), (Sendai, Japan) pp. 26-35, IEEE Computer Society, June 1996.
[Moore02] Moore, D., Shannon, C., and Brown, J., “Code-Red: A case study on the spread and victims of an Internet worm,” in Proceedings of the 2002 ACM SIGCOM Internet Measurement Workshop, (Marseille, France), pp. 273-284, Nov. 2002.
[Moore03a] Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S., and Weaver, N., “The spread of the Sapphire/Slammer worm, “ Tech. Rep., CAIDA, La Jolla, CA.
[Little01] Littlewood, B., Popov, P., and Strigini, L., “Modeling software design diversity - A review,” ACM Computing Surveys, vol. 33, pp. 177-208, June 2001.
[Tang92] Tang, D., and Iyer, R., “Analysis and modeling of correlated failures in multicomputer systems,” IEEE Transactions on Computers, vol. 41, pp. 567-577, May 1992.
[JM03a] Junqueira, F., and Marzullo, K., “Synchronous Consensus for dependent process failures,” in International Conference on Distributed Computing Systems (ICDCS’03), May 2003
[Ren98] Ren, Y., and Dungan, J. B., “Optimal design of reliable systems using static and dynamic fault trees,” IEEE Transactions on Reliability, vol. 47, no 3, pp. 234-244, Dec. 1998.
58BibliographyBibliography
[Wensley76] Wensley, J. et al., ”SIFT: Design and analysis of a fault-tolerant computer for aircraft control, “ in 2nd IEEE International Conference on Software Engineering, pp. 458-469, 1976.
[Lamport82] Lamport, L., Shostak, R., and Pease, M., “The Byzantine generals problem,” ACM Transactions on Programming Languages and Systems, vol. 4, pp. 382-401, July 1982.
[FLP85] Fischer, M., Lynch, N., Paterson, M., “Impossibility of distributed Consensus with one faulty process,” Journal of the ACM, vol. 33, pp. 374-382, April 1985.
[DLS88] Dwork, C., Lynch, N., and Stockmeyer, L., “Consensus in the presence of partial synchrony,” in the Journal of the ACM, no. 35, vol. , pp. 288-323, Apr. 1988.
[CT96] Chandra, T., and Toueg, S., “Unreliable failure detectors for reliable distributed systems,” Journal of the ACM, vol. 43, pp. 225-267, March 1996.
[CHT96] Chandra, T., Hadzilacos, V., and Toueg, S., “The weakest failure detector for solving Consensus,” Journal of the ACM, vol. 43, pp. 685-722, July 1996.
[Doudou98] Doudou, A., and Schiper, A., “Muteness detectors for Consensus with Byzantine processes, “ in 17th ACM Symposium on Principles of Distributed Computing, (Puerto Vallarta, Mexico), p. 315, July 1998. (Brief Announcement).
59BibliographyBibliography
[LR94] Lincoln, P., and Rushby, “Formal verification of an interactive consistency algorithm for the Draper FTP architecture under a hybrid failure model,” in Proceedings of the 9th IEEE Annual Conference on Computer Assurance, pp. 107-120, June 1994.
[JM03b] Junqueira, F., and Marzullo, F., “Lower bound on the number pf rounds for synchronous Consensus with dependent process failures,” Tech. Rep. CS2003-0734, UCSD, La Jolla, CA, January 2003.
[JM03c] Junqueira, F., and Marzullo, K., “Designing algorithms for dependent process failures,” in Future Directions in Distributed Computing, number 2584 in LCNS, pp. 24-28, Springer-Verlag, 2003.
[JM02] Junqueira, F., and Marzullo, K., “Coping with dependent process failures,” Tech Rep. CS2002-0723, UCSD, La Jolla, October 2002.
[JBMSV03C] Junqueira, F. , Bhagwan, R., Marzullo, K., Savage, S., and Voelker, G.M., “The Phoenix Recovery System: Rebuilding from the ashes of an Internet catastrophe,” in IX Workshop on Hot Topics in Operating Systems (HotOS-IX), May 2003.
[Giff79] Gifford, D., “Weighted voting for replicated data,” in 7th Syposium on Operating Systems Principles (SOSP), pp. 150-162, 1979.
[Attiya98] Attiya, H., and Welch, J., “Distributed computing: Fundamentals, simulation, and advanced topics,” chapter 5, McGraw-Hill, 1998.
60BibliographyBibliography
[GM85] Garcia-Molina, H., and Barbara, D., “How to assign votes in a distributed system,” Journal of the ACM, vol. 32, pp. 841-860, October 1985.
[Malkhi97] Malkhi, D., and Reiter, M., “Byzantine quorum systems,” in 29th ACM Symposium on Theory of Computing, pp. 569-578, May 1997.
[LF82] Lamport, L., and Fischer, M., “Byzantine generals and transaction commit protocols,” Tech. Rep. 62, SRI International, Apr. 1982.
[CB00] Charron-Bost, B., and Schiper A., “Uniform Consensus is harder than Consensus,” Tech. Rep. DSC/2000/028, École Polytechnique Fédérale de Lausanne, Switzerland, May 2000.
[JM03d] Junqueira, F. and Marzullo, K., “Consensus for dependent process failures,” Tech. Rep. CS2003-0737, UCSD, La Jolla, CA, February 2003.
[Martin02] Martin, J.-P., Alvisi, L., and Dahlin, M., “Minimal Byzantine storage, ” in Proceedings of the 17th Symposium on Distributed Computing, LNCS 2508, pp. 311-325, Ed. Malkhi, D., 2002.
61BibliographyBibliography
[Hirt97] Hirt, M. and Maurer, U., “Complete characterization of adversaries tolerable in secure multi-party computation,” in ACM Symposium on Principles of Distributed Computing (PODC’97), (Santa Barbara, CA), pp. 25-34, 1997.
[JM03e] Junqueira, F. and Marzullo, K., “On the generalization of n>k·t, ” Tech. Rep. CS2003-0743, UCSD, La Jolla, CA, April 2003.
[WMK02] Weatherspoon, H., Moscovitz, T., and Kubiatowicz, “Introspective failure analysis: Avoiding correlated failures in peer-to-peer systems,” in Proceedings of the International Workshop on Reliable Peer-to-Peer Distributed Systems, October 2002
[BWWG02] Bakkaloglu, M., Wylie, J., Wang, C., Ganger, G., “On correlated failures in survivable storage systems,” Tech. Rep. CMU-CS-02-129, CMU, May 2002.
[Moore03b] Moore, D., Shannon, C., Voelker, G. M., and Savage, S., “Internet quarantine: Requirements for containing self-propagating code,” in Proceedings of the IEEE INFOCOM, Apr. 2003.
[ICAT] National Institute of Standards and Technology (NIST), “ICAT Vulnerability Database.” http://icat.nist.gov/icat.cfm.
3/30 - 12/1Phoenix, journal papers and dissertation
12/1Thesis defense
64
Evaluation of quorum systems in a wide-area network [Amir96] Crashes are strongly correlated Power outages
Computers in the same room Total failure during the experiment
Wide-area outage
Network partitions Quorums partially unreachable Computers in different segments Computers in the same segment
Switching devices Bridges
Where IID does not apply…Where IID does not apply…
65Where IID does not apply…Where IID does not apply…
Software bugs [Little01] Single version
A demand may cause all replicas to crash
E.g. State-machine replication
Multiple independently-developed versions Difficulty of a demand: difficulty in
handling it Level of difficulty varies among the
demands More difficult demands tend to
cause multiple versions to fail
66ConclusionsConclusions
Failures are not independent Computing a threshold is not practical Model of dependent failures based on shared
vulnerabilities Storage overhead is small even for highly skewed
diversity Storage load can be large
Has to be considered by the heuristic that finds cores Increase average core size
67Synchronous systems - Crash failuresSynchronous systems - Crash failures
Solution for any number of failures Early-deciding algorithm: decision in f+1 rounds, where f is
the number of failures in a given execution [Charron-Bost and Schiper]1. Every process keeps an array of initial values
2. In every round, a process:1. sends its array of initial values to all the other processes
2. receives messages from other processes (array of initial values or decide)
3. updates its array according to the received arrays
4. decides if it receives a decide message, and then sends a decide message to all the other processes in the next round
5. decides if round is detected as clean, and then sends a decide message to all the other processes in the next round
68Upper bound on process replicationUpper bound on process replication
Conjecture: Suppose a correct algorithm A that requires under the threshold model
Replace (n-’·t), 0 < ’ , for intersection of ’ survivor sets in A to generate algorithm A’
Transformed algorithm A’ is correct
Intuition: k=3 In every execution: at least 2·t+1 correct processes (subset )
Survivor sets: subsets of processes of size 2·t+1 Cores: subsets of size t+1
Every intersection of two subsets of size 2·t+1 (survivor sets) intersect in at least t+1 processes (a core)
Intersection of two survivor sets contains at least one correct process At least one intersection of two survivor sets contains only correct processes
1 ,
1,tn
69Solving Consensus for arbitrary failuresSolving Consensus for arbitrary failures
Algorithm by Lamport, Shostak, and Pease [Lamport82]Each process keeps a copy of the treeLevel i of the tree: values received at round iIn round 0, a correct process broadcast its initial valueIn round i, a correct process sends the values at level i-1
p1 p2 p3
p1p2
pt+1 pt pn
p1p3
1tp
i 2t
p
i nip
},{1,2, ,ppp21
nikiii t
Depth 0
Depth 1
Depth 2
Depth t
Depth t+1
=p1p2...pt
70
1.2},,'
2
t
ni
nll
l
iii
kiii
pp,{p p,
},{1,2, ,ppp
1
21
p
lip'
1'
lip
nip'
Depth l-1
Depth l
Each correct process Traverse the tree in post-order Evaluates each node as follows
If node is a leaf, then evaluates to its current value Otherwise, use the majority (null if there is no majority)
Claim: if process p is correct, then The value of node(p) is the same in every correct process q It is the value p sent at round |p| regarding
Proof idea Recursion on the levels of the tree Base case: level t+1 (leaves) Ind. hypothesis: claim valid for
every level l t+1 Ind. Step: prove for l-1
},{1,2, ,ppp21
nikiii t
pDepth t+1
At round t+1… At round t+1…
71Algorithm for the core/survivor set modelAlgorithm for the core/survivor set model
Adapt the algorithm from Lamport et al. In the original algorithm
In our algorithm: given a system Intersection of two survivor sets instead of majority
No solution for pure asynchronous systems even for a single crash failure [FLP85]
Slow process vs. Faulty process Requires a liveness property Approach 1: consider more realistic timing assumptions
Partially synchronous systems [DLS88] Difficult to evaluate parameters in practice
Approach 2: extend model with failure detectors [CT96] Unreliable failure detectors
75Asynchronous systems - Crash failuresAsynchronous systems - Crash failures
W is the weakest class of failure detectors that enable a solution to Consensus [CHT96] Weak completeness: eventually every process that crashes is
suspected by some correct process Eventual weak accuracy: there is some correct process which is
eventually not suspected by any other correct process
Lower bound on process replication: Proof idea:
12 tn
A B
Initial value of processes in A: vProcesses in A decide v
B
Initial value of processes in B: v’Processes in B decide v’
A
No faulty processMessages from A to B are delayedProcesses in A decide v and process in B decide v’Agreement is violated
76An algorithmAn algorithm
Rotating coordinator paradigm [CT96] Assumes
Strong completeness: eventually every correct process suspects forever every faulty process
(coordinator)
p1
p2
p3
p4
Every process sends an estimate message to the coordinator
Coordinator gathers t+1 estimates and proposes a new estimate
Processes acknowledge the reception of an estimate from the coordinator
Coordinator gathers t+1 acks and broadcasts a decide message
)( WSSD
t=2, n=5: execution with no suspicions or failures
77Proof of correctnessProof of correctness
No correct process stops (does not decide, does not move on) in a round i
A correct process either Decides in a round Eventually suspect the coordinator (Strong Completeness) and
moves on to the next round
Eventually there is a round in which the coordinator is not suspected by any correct process Ensured by Eventual Weak Accuracy
If not all processes decide in the same round Once some process decides, the decision value is “locked”
78In the core/survivor set modelIn the core/survivor set model
Lower bound on process replication [JM03d] Crash Partition: There is no partition (A,B) of the processes in
such that none of the partitions contains a core Crash intersection: The intersection of every two survivor sets
contains a core
Crash Partition Crash Intersection Bound is tight: Chandra and Toueg’s algorithm modified
In the original algorithm: coordinator waits for n-t replies In our algorithm: coordinator waits for a reply from a survivor set
79Proof ideaProof idea
Layering technique [Keidar] Layer: [p,[i]]
Process p fails but send messages to processes pi,…, pn
Apply layers to system states State is composed of states of processes Similar states x, y: only a single process can distinguish x from y A set of states is similarly connected iff for every pair of states in the,
there is a chain of similar states connecting them
Set of initial states is similarly connected Applying layers to a similarly connected set of states
generates another similarity connected set of states Cannot apply layers indefinitely
80Asynchronous systems - Arbitrary failuresAsynchronous systems - Arbitrary failures
Faulty processes can behave arbitrarily Correct to a subset of processes Strong completeness does not make sense
Mute process [Doudou98] A process pi is mute to a process pj iff there is a time t after which pj
stops sending messages to pi forever
Mute completeness Every process pi eventually suspects forever a process pj that is
mute to pi
Equivalent to S if processes fail only by crashing
81Lower bound on process replicationLower bound on process replication
Lower bound: (Strong Consensus) Proof idea: assume , and a partition (A,B,C) such
that
13 tntn 3
Scenario 1: All process in B crash at time 0 Processes in A and C propose
value v and decide v
Scenario 2: All process in C crash at time 0 Processes in A and B propose
value v’ and decide v’
Scenario 3: All process in C are arbitrarily faulty Processes in C behave to process
in A as in Scenario 1, and to processes in B as in Scenario 2
Messages from B to C and conversely are delayed until after the last process decides
Processes in A propose v, and processes in B v’
Processes in A cannot distinguish Scenario 1 from Scenario 3
Processes in B cannot distinguish Scenario 2 from Scenario 3
Processes in A and B decide upon different values (agreement violation)
tCBA
82An algorithm for Vector ConsensusAn algorithm for Vector Consensus
Requires Digitally signed messages Certificates
Certify message content E.g. Decision message has to contain enough Estimate messages from
other processes
Each process has a list of faulty processes FIFO channels: out of order messages Corrupted messages
1st stage Each process broadcasts its initial value Each process composes a proposed vector with received values
13 tn
83
Move on to the next round after receiving at least 2·t+1 current estimates
Processes exchange suspicion messages
An algorithm for Vector Consensus (An algorithm for Vector Consensus (cont.cont.))
2nd stage: asynchronous rounds of message exchange
(coordinator) Forward estimate received from the coordinator
Coordinator’s estimateDecide after receiving at least 2·t+1 estimate msgs
Coordinator crashes and do not send estimateProcesses exchange current estimates after receiving at least 2·t+1 suspicion messages
(coordinator)
84In the core/survivor set model In the core/survivor set model
Byzantine Intersection/Partition is necessary and sufficient [JM03d]
Necessity proof Assume Byzantine Partition does not hold Scenario in which processes decide upon different values
Sufficiency proof Modify algorithm by Doudou and Schiper Original algorithm: process waits for messages from 2/3 of the processes In our algorithm: process waits for messages from a survivor set
Observation In the original protocol: wait for t+1 suspicion messages In our algorithm: wait for messages from processes in
SSSSS 2121 ,,
85
(, )-Partition. For every partition of
, there is a subset such that:
(, )-Intersection. For every :
Generalizing the partition and the Generalizing the partition and the intersection propertiesintersection properties
},,,{ 21 AAAA AAAAA kkk },,,{'
21
core a contains )(i
kiA
S,
)(,
s
s
,, ,S
1
1
1
,: subset of S
86Upper bound on process replicationUpper bound on process replication
},,,,,,
,,,,,,{
},,,,,{
efdfdecfcecdbf
bebdafaeadabcC
fedcba
},,|{
},,,,,,,,{
zyxxyzC
ihgfedcba
a' b' c' d' g' e' h' f' i'
a b c d e f
Simulated processes
Physical process
Physical system Virtual system
Every core in the virtual system (subset of 3 processes) is simulated by a core in the physical system
Every subset of size 3 in the
virtual system contains at
least one correct process
87Proposed algorithmProposed algorithm
Algorithm: given a system , let x be the size of the largest core
Any process in simulates at most (x-xp+1) virtual processes
Conjecture: necessary and sufficient for any subset of t+1 processes in the virtual system to map to a core in the physical system
Classes of software systems: attributes E.g. Operating system
Potentially vulnerable software systems: attribute values E.g. Linux, Windows
Replicate data on a set of hosts that have different values for each attribute: cores
Tolerating the failure of k values No permutation of k attribute values covers all the hosts in a core Current assumption: k=1
At least two distinct values per attribute in a core
Definitions Attribute configuration: attribute values of a host Diversity: distribution of attribute configurations
89Choosing a coreChoosing a core
Decision problem is NP-Complete (Set cover) Finding a core for host hi
1. Make a list L of hosts orthogonal to hi
2. If L is not empty1. Choose a host hj s.t hj L;
2. Return {hi, hj};
3. Else1. R {hi};
2. Make a list L’ of hosts that have different attribute configurations;
3. For each attribute a in A, choose randomly a host hj in L’ s.t. hj has a different value for a;
4. R R {hi};
5. Repeat 2 and 3 until R covers all attributes or L’ is empty;
6. Return R.
90Core size for scenario 8/2Core size for scenario 8/2
1,000 hosts 8 attributes
[ICAT] 2 values per
attribute
“Linux vs. Windows”
Average core size is small even for highly skewed diversity
91Core size for scenario 8/4Core size for scenario 8/4
1,000 hosts 8 attributes 4 values per attribute
More attribute values reduces core size variation
92Storage loadStorage load
1,000 hosts
For highly skewed diversity, storage load can be high
93System design issuesSystem design issues
Fully-distributed system No single point of failure Leverage research on P2P systems
Announcing available configurations DHT-based approach
Encryption scheme to protect against data corruption Recovering from a catastrophe
Time to recover is not critical Coping with a large number of requests
Threshold on the number of accepted requests Exponential backoff
94Lower bound on process replicationLower bound on process replication
Claim: Every set of processes that satisfies , also satisfies (, )-Partition
Proof idea. Given a set , , construct a partition as follows:
1 ,
,tn
A1 Ak Ak+1. . . . . .
t
part) fractional : part,
Integral : -
f
IfIt
fl ,(
A1…Al: t/ processes
Al+1…A: t/ processes
Where:
There is at least one subset of elements Ai such that the union of these subsets contains t processesAdd one process to
95Upper bound on process replicationUpper bound on process replication
Claim: If a problem P can be solved by an algorithm A in a system satisfying , then P can be solved by a system satisfying (k,1)-Partition
Suppose that A requires k=4 System satisfying (4,1)-Partition
Maximum number of failures: 2 Virtual system defined under the threshold model
Satisfies Simulate the virtual system with
},,,,,,,,,,,,{
},,,,,{
efdfdecfcecdbfbebdafaeadabcC
fedcba
integer kktkn ,1,1
SC ,,
SC ,,
},,,,,,,,{ ihgfedcba
2,14 ttn for
96Future workFuture work
Impact on reliability and performance Fewer executions allowed
What are the chances that an execution not assumed happen?
Another requirement: compute cores/survivor sets
Static vs. dynamic cores/survivor sets Processes joining and leaving Changes in reliability
Implementation issues Representation of cores and survivor sets Determining the cores/survivor sets of a system Applicability on the various systems
97Future workFuture work
Applicability of the Consensus solutions Look at existing systems that use Consensus as a primitive Evaluate the benefits in practice of using our solutions
Solutions for hybrid failure models Translate , to our modelmamcsan ,22
98Future workFuture work
No protocols with rational k so far Any known candidate?
Finish formal proof of algorithm translation
99Future workFuture work
How do we determine the attributes? Resilience depends on the attributes Vulnerability databases Dynamic attributes:new attributes and values
How many attributes do we need? The number of attributes impact on storage overhead
What is a good level of granularity for the attributes? E.g. {Windows} vs. {Win_95, Win_98, Win_2000, Win_XP}
Other challenges Heuristics for finding cores: storage overhead and storage load Efficacy
How do we assess the efficacy of a prototype? Major Internet incidents are not so frequent
100Generalizing the partition and the Generalizing the partition and the intersection propertiesintersection properties
(, )-Partition. For every partition of
, there is a subset such that:
(, )-Intersection. For every :
},,,{ 21 AAAA AAAAA kkk },,,{'
21
core a contains )(i
kiA
,, and S
))((,
s
s
,, ,S
1
1
),
),,
:;||
,,
,,
,,
SCCCS
SCSCC
sss
sss
,()-( :
,( :-
\, ,,
1
,: subset of S
,, of subsets of collection :
101Generalizing the partition and the Generalizing the partition and the intersection propertiesintersection properties
, and integers (, )-Partition. For every partition of
, there is a subset such that:
(, )-Intersection. For every , there is a subset , such that: