Page 1
© 2005 Andreas Haeberlen, Rice University1
Glacier: Highly durable, decentralized storage despite massive correlated failures
Andreas Haeberlen
Alan Mislove
Peter Druschel
Rice UniversityHouston, TX
2nd Symposium on Networked Systems Design & Implementation (NSDI)
Boston, MAMay 2-4, 2005
Page 2
2© 2005 Andreas Haeberlen, Rice University
Introduction
Many distributed applications require storage
Cooperative storage: Aggregate storage on participating nodes
Advantages: Resilient Highly scalable
Examples: Farsite, PAST, OceanStore
Structuredoverlaynetwork
Page 3
3© 2005 Andreas Haeberlen, Rice University
Motivation Common assumption:
High node diversity Failure independence
Unrealistic! Node population may have low diversity (e.g. OS)
Worms can cause large-scale correlated Byzantine failures
Reactive systems are too slow to prevent data loss
Page 4
4© 2005 Andreas Haeberlen, Rice University
Related Work Phoenix, OceanStore use
introspection: Build failure model Store data on nodes
with low correlation Limitations:
Model must reflect all possible correlations
Even small inaccuracies may lead to data loss
Users have an incentive to report incorrect data
Page 5
5© 2005 Andreas Haeberlen, Rice University
Our Approach: Glacier
Create massive redundancy to ensure that data survives any correlated failure with high probability
Assumption: Magnitude of the failure can be bounded by fraction fmax
Challenges: Minimize storage
and bandwidth requirements
Withstand attacks, Byzantine failures
Page 6
6© 2005 Andreas Haeberlen, Rice University
6543 2 1
Glacier: Insertion When a new object is
inserted:1. Apply erasure code2. Attach manifest
with hashes of fragments
3. Send each fragment to a different node
No remote delete operation, but lifetime of objects can be limited
Storage is lease-based; reclaims unused storage
1 2 3 4 5 6
X
Page 7
7© 2005 Andreas Haeberlen, Rice University
Glacier: Maintenance
Nodes with distance store similar fragments
Periodic maintenance: Ask a peer node
for its list of fragments
Compare with local list, recover any missing fragments
Fragments remain on their nodes during offline periods
3
2 1
6
54
Xk ?
X
Page 8
8© 2005 Andreas Haeberlen, Rice University
Glacier: Recovery
During a failure, some fragments are damaged or lost Communication may not be possible Unaffected nodes do not take any special action:
Failed nodes are eventually repaired Maintenance gradually restores lost fragments
1
2
3
4
5
6
Time
Insert Correlated failure
Tfail
Offline period
Page 9
9© 2005 Andreas Haeberlen, Rice University
Glacier: Durability
Example configuration: 48 fragments, any 5 sufficient for recovery Bad news: Storage overhead 9.6x Good news: Survives 60% correlated failure
with P=0.999999 (single object)
fmax Durability
Code Fragments Storage
0.30 0.9999 3 13 4.33
0.50 0.99999 4 29 7.25
0.60 0.999999
5 48 9.60
0.70 0.999999
5 68 13.60
0.85 0.999999
5 149 29.80
More
stora
ge
Hig
her d
ura
bility
Page 10
10© 2005 Andreas Haeberlen, Rice University
Aggregation If objects are small:
Huge number of fragments
High overhead for storage, management
Solution: Aggregate objects before storing them in Glacier
Challenges: Untrusted
environment Aggregates must be
self-authenticating
App
Glacier
App
Aggreg.
Glacier
Page 11
11© 2005 Andreas Haeberlen, Rice University
Aggregation: Links Mapping from objects to
aggregates is crucial! Need durability Need authentication
Solution: Link aggregates Result: DAG Can recover mapping
by traversing the DAG DAG forms a hash
tree; easy to authenticate
Top-level pointer is kept in Glacier itself
Page 12
12© 2005 Andreas Haeberlen, Rice University
Evaluation Two sets of experiments:
Trace-driven simulations (scalability, churn, ...) Actual deployment: ePOST
ePOST: A cooperative, serverless e-mail system In production use: Initially 17 users, 20 nodes Based on FreePastry, PAST, Scribe, POST Added Glacier for durability
Glacier configuration in ePOST: 48 fragments, 0.2 encoding fmax=0.6, P=0.999999
140 days of practical experience (incl. some failures)
Page 13
13© 2005 Andreas Haeberlen, Rice University
Evaluation: Storage
Inherent storage overhead: 48/5=9.6 17 GB of on-disk storage for 1.3GB of data Actual storage overhead on disk: About 12.6
Page 14
14© 2005 Andreas Haeberlen, Rice University
Evaluation: Network load
During stable periods, traffic is comparable to PAST In the ePOST experiment, a misconfiguration caused
frequent traffic spikes Long off-line periods were mistaken for failures
Page 15
15© 2005 Andreas Haeberlen, Rice University
Evaluation: Recovery
Experiment: Created a 'clone' of the ePOST ring with only 13 of the 31 nodes (a 58% failure!)
Started recovery process on a freshly installed node: User entered e-mail address and date of last use Glacier located head of aggregate tree, recovered it System was again ready for use; no data loss
Page 16
16© 2005 Andreas Haeberlen, Rice University
Conclusions
Large-scale correlated failures are a realistic threat to distributed storage systems
Glacier provides hard durability guarantees with minimal assumptions about the failure model
Glacier transforms abundant but unreliable disk space into reliable storage
Bandwidth cost is low
Thank you!
Page 17
17© 2005 Andreas Haeberlen, Rice University
Glacier is available!
Download: www.epostmail.org
• Serverless, secure e-mail• Easy to set up• Uses Glacier for durability