Exploring Tradeoffs in Failure Detection in P2P Networks Shelley Zhuang, Ion Stoica, Randy Katz HIIT Short Course August 18-20, 2003
Dec 20, 2015
Exploring Tradeoffs in Failure Detection in P2P Networks
Shelley Zhuang, Ion Stoica, Randy KatzHIIT Short Course
August 18-20, 2003
Problem Statement
• One of the key challenges to achieve robustness in overlay networks: quickly detect a node failure
• Canonical solution: each node periodically pings its neighbors
• Propose keep-alive techniques• Study the fundamental limitations and tradeoffs
between detection time, control overhead, and probability of false positives
Outline
• Motivation
• Network Model and Assumptions
• Keep-alive Techniques
• Performance Evaluation
• Conclusion
Network Model and Assumptions
• P2P system with n nodes• Each node A knows d other nodes• Average path length = l• Node up-time ~ i.i.d. T = exponential(λf)• Failstop failures• If a neighbor is lost, a node can use another
neighbor to route the packet w/o affecting the path length
Packet Loss Probability
• δ = average time it takes a node to detect that a neighbor has failed
• Probability that a node forwards a packet to a neighbor that has failed is 1- e-λf δ δλf
P(T-t δ | Tt) = P(T<=δ)
• Probability that the packet is lost is pl lδλf
δT
Outline
• Motivation
• Network Model and Assumptions
• Keep-alive Techniques
• Performance Evaluation
• Conclusion
fl lp
2
2
Aliveness Techniques
• Baseline– Each node sends a ping message to each of its
neighbors every Δ seconds
A
B C
D
Aliveness Techniques• Information Sharing
– Piggyback failures of neighbors in acknowledgement messages
– Best case: completely connected graph of degree d
fld
dlp
d
d
log
log
B C
DA
Aliveness Techniques
• Boosting– When a node detects failure of a neighbor, D, it
announces to all other nodes that have D as their neighbor
– Best case: completely connected graph of degree d
fld
lp
d
1
1
B C
DA
Outline
• Motivation
• Network Model and Assumptions
• Keep-alive Techniques
• Performance Evaluation
• Conclusion
Performance Evaluation
• Case studies– d-regular network– Chord lookup protocol
• Chord event driven simulator– Gnutella join/leave trace– Packet loss rate– Control overhead
• Planetlab experiments– Planetlab event driven simulator– False positives
Loss Rate – Gnutella• Loss Rate = # Lookup timeouts / # Lookups• 20 lookups per second
Boosting (simple)- No additional state
Loss Rate – Gnutella
• Tto seconds before deciding that a probe is lost
• Multiple losses before deciding that a neighbor has failed
Overhead (count) – Gnutella• Constant probing overhead (1 probe/second)
• Small difference due to boost messages
False Positive – Planetlab• Propagation of positive information
• Most false positives are of TO = 0, 1 increase probe timeout threshold
Overhead (bps) – Planetlab• Overhead from boost messages and positive information
correlate with the loss rate
Outline
• Motivation
• Network Model and Assumptions
• Keep-alive Techniques
• Performance Evaluation
• Conclusion
Conclusion
• Examined three keep-alive techniques in Chord with Gnutella join/leave trace
• By carefully designing keep-alive algorithms, it is possible to significantly reduce packet loss probability
• Probability of false positive for boosting with backpointer < 0.01 for loss rate ~ 8.6% by propagating positive information and increasing probe timeout threshold