Smart Redundancy for Distributed Computation George Edwards Blue Cell Software, LLC Yuriy Brun University of Washington Jae young Bang University of Southern California Nenad Medvidovic University of Southern California
Mar 23, 2016
Smart Redundancy forDistributed Computation
George EdwardsBlue Cell Software, LLC
Yuriy BrunUniversity of Washington
Jae young BangUniversity of Southern
California
Nenad MedvidovicUniversity of Southern
California
Distributed Computation Architectures• Solve large computational
problems and/or process large data sets
• Provide a platform and API for applications
• Transparently parallelize computation across a pool of computers
• Examples:– Clouds– Grids– Volunteer computing
DCA Applications
• Highly parallelizable problems– Find the 10100th digit of π– Factor 22011 – 1
• Driven by:– Basic research– Pharmaceutical applications– Web analytics– …
Volunteer Computing• Attempts to leverage the
more than 1 billion (mostly idle) machines on the Internet– Volunteers install a client– When idle, the client requests
work from a server and send back results
• Aids projects that have limited funding but large public appeal
Dealing with Faults
• Context:– Volunteers fail and maliciously return false results– Volunteers are not accountable– Malicious volunteers may collude– Well-formed but incorrect results are hard to
detect– The reliability of volunteers is difficult to estimate
• Solution:– Redundancy and voting
System Model• A task server subdivides
computations into tasks
• The task server replicates each task into multiple identical jobs
• The task server assigns each job to a node in the node pool
• Nodes perform work, send results, and rejoin the pool
• New volunteer nodes may join the pool while other nodes may leave
k-vote Traditional Redundancy (TR)• Performs k independent executions of
each task
• Takes a vote on the correctness of the result
• Requires expending a factor of k resources or suffering a factor of k slowdown in performance
Example
• k = 19• r = 0.7
Insights• Redundant computations
need not be simultaneous
• DCAs can dynamically adjust the level of redundancy based onrun-time information
• k-vote traditional redundancy wastes computations
Example
• 19 independent computations (k = 19)
• 70% node reliability (r = 0.7)
• (0.7)10 ≈ 2.8% of the time, the first 10 of them will return the correct result• The last 9 results are
irrelevant
k-vote Progressive Redundancy (PR)
• Distributes jobs in waves
• In each wave, distributes the minimum jobs needed to produce a consensus (assuming all agree)
• Repeats until a consensus is reached
Example
• k = 19• r = 0.7
Insights• The confidence level
associated with a result can be computed
• k-vote progressive redundancy produces results with varying confidence
Example
• k = 19, r = 0.7
• If the vote is 10-0, confidence level ≈ 99.98%
• If the vote is 10-9, confidence level = 70%
Iterative Redundancy (IR)
• Distributes jobs in waves
• In each wave, distributes the minimum jobs required to achieve a desired confidence level
• Repeats until desired confidence level is reached
Example
• d = 4• r = 0.7
Algorithm Comparison• System reliability
approaches 1 exponentially for TR, PR, and IR
• IR produces the same reliability at a lower cost– Or, equivalently, higher
reliability at the same cost
• IR is optimal with respect to cost– Guaranteed to use the
minimum computation needed to achieve desired system reliability
Cost Factor
Cost Factor
Syst
em R
elia
bilit
ySy
stem
Rel
iabi
lity
Algorithm Comparison
• PR and IR perform best when the reliability of the node pool is high
Node Reliability
Ratio
Impr
ovem
ent O
ver
Trad
ition
al R
ecov
ery
Adaptive Behavior
• IR maintains a constant system reliability as node reliability fluctuates
– Injects redundancy where it is needed• “Unlucky” situations
– Removes redundancy where it is unnecessary
Time
Time
Time
Nod
eRe
liabi
lity
Cost
Fac
tor
Syst
emRe
liabi
lity
Node Reliability Estimation
• Incorrectly estimating node reliability does not affect the performance of IR
Cost Factor
Syst
em R
elia
bilit
y
Conclusions
• Iterative redundancy automatically replicates computation with optimal efficiency
• Iterative redundancy can be used when:– A computation can be broken down into
independent tasks– Computation is performed by a pool of
independent processing resources– Task deployment decisions can be made at runtime– The reliability of resources in the pool is unknown
For More InformationTo appear in ICDCS 2011:
Smart Redundancy for Distributed Computationby Yuriy Brun, George Edwards, Jae young Bang and Nenad Medvidovic
http://www.cs.washington.edu/homes/brun/pubs/pubs/Brun11icdcs.pdf