Top Banner
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley Proc. of the 2 nd USENIX Conf. On File and Storage Technologies (FAST ‘03) Presented by Park, Seon-Yeong
27

Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

Dec 29, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

Pond: the OceanStore Prototype

Sean Rhea, Patric Eaton, Dennis Gells,

Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz

University of California, Berkeley

Proc. of the 2nd USENIX Conf. On File and Storage Technologies (FAST ‘03)

Presented by Park, Seon-Yeong

Page 2: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

2/26

Ubiquitous Computing

Telephone

SPO Watch

PDA Cell Phone

Digital TV

PC

Storage Pool

Page 3: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

3/26

OceanStore Overview

Internet-scale, Cooperative File System

ApplicationCalendars, Email, Contact Lists, Large Digital Libraries, Repositories for Scientific Data, Distributed Design Tool, etc.

RequirementsUniversal Availability

Durability

Understandable Consistency Model

Privacy vs. Information Sharing

Page 4: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

4/26

Data Model (1/2)

Data ObjectA File in a Traditional File System

Named by an Active Globally-Unique Identifier, AGUID– Location Independent

– Preventing Name Space Collisions

SHA-1

AGUID

Application-specified Name + Owner’s Public Key

Page 5: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

5/26

Data Model (2/2)

Data ObjectSequences of Read-only Versions

Block Reference– Cryptographically-secure Hash of Child Block’s Contents

< Structure of Data Object >

Page 6: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

6/26

Underlying Technology

Access Control

Data UpdatePrimary Replica

Archival Storage

Secondary Replica

Data Read

Data Location & Routing ;Tapestry

Page 7: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

7/26

Access Control

Reader RestrictionEncrypt All Data

Distribute Encryption Key to Users with Read Permission

Writer RestrictionAccess Control List (ACL) for an Object

All Writes be Signed so that Well-behaved Servers and Clients Verify them based on the ACL

Page 8: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

8/26

Underlying Technology

Access Control

Data UpdatePrimary Replica

Archival Storage

Secondary Replica

Data Read

Data Location & Routing

Page 9: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

9/26

Data Update (1/2)

UpdateAdding a New Version to the Head of Version Stream

Array of Potential Actions each Guarded by a Predicate– Predicate Examples

• Checking Latest Version_Num, Comparing a Region of Bytes to an Expected Value, etc.

– Action Examples• Replacing a Set of Bytes, Appending New Data, Truncating the

Object, etc.

TimestampClient ID<Predicate 1, Action 1><Predicate 2, Action 2> . . .<Predicate N, Action N>Client Signature < Update Message Format >

Page 10: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

10/26

Data Update (2/2)

Application

Primary Replica(Inner Ring)

Archival Storages

ApplicationSecondary

ReplicaSecondary

Replica

< OceanStore Update Path >

Page 11: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

11/26

Primary Replica

Inner RingA Set of Servers that Implement Object’s Primary Replica

Applies Updates and Creates New Versions– Serialization

– Access Control

– Create Archival Fragments

Update Agreements– Byzantine Agreement Protocol

• Distributed Decision Process in which All Non-faulty Participants Reach the Same Decision for a Group of Size 3f+1, no more than f Faulty Servers

Page 12: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

12/26

Archival Storage

Simple ReplicationTolerance of One Failure for an Addition 100% Storage Cost

Erasure CodesEfficient and Stable Storage for Archival Copies

Storage Cost by a Factor of N/M

Original Block can be Reconstructed from Any M Fragments

Block

Fragment 1

Fragment 2

Fragment N

. . .

Fragment 1

Fragment 2

Fragment M

. . .Encoded by

Erasure Code

M < N

Fragment 3

Page 13: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

13/26

Secondary Replica

Whole-block Caching to Avoid Erasure Codes on Frequently-read Objects

Push-based UpdateEvery Time the Primary Replica Applies an Update

Dissemination TreeApplication-level Multicast Tree

Rooted at Primary Replica

Parent Nodes are Pre-existing Replicas to Serve Objects

Page 14: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

14/26

Underlying Technology

Access Control

Data UpdatePrimary Replica

Archival Storage

Secondary Replica

Data Read

Data Location & Routing

Page 15: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

15/26

Data Read

Application

Primary Replica(Inner Ring)

Archival Storages

SecondaryReplica

1. AGUID

2. Latest VGUID

3. Search Blocks from Secondary Replicas

4. Search enough Fragments from Archival Storages

Page 16: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

16/26

Underlying Technology

Access Control

Data UpdatePrimary Replica

Archival Storage

Secondary Replica

Data Read

Data Location & Routing

Page 17: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

17/26

Data Location & Routing (1/4)

TapestryDecentralized Object Location and Routing System

Using Globally Unique Identifier (GUID) to Hosts and Resources

Location Independent

Locality Aware

Page 18: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

18/26

Data Location & Routing (2/4)

Routing Example

Messages are Routed to the Destination ID Digit by Digit***8=>**98=>*598=>4598

B4F8

9098

0325

2BB8

75984598

87CA

0098

3E98

1598

D598

2118

L1

L2

L2

L3

L4 L4

L2

L4

L3

L3

L1

Page 19: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

19/26

Data Location & Routing (3/4)

Location Independent & Locality Aware

L1

L2

L2

L3

L4 L4

L2

L4

L3

L3

ReplicaLocation Pointer

L1

Page 20: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

20/26

Data Location & Routing (4/4)

Routing Table

< Neighbor Map in Memory for Tapestry Node 0642 >

Page 21: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

21/26

Prototype

Prototype Software Architecture

Page 22: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

22/26

Experimental Results (1/2)

Update Performance

< Table. Results of Latency Microbenchmark > < Figure. Throughput in Local Area >

Page 23: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

23/26

Experimental Results (2/2)

Comparison with NFS

< Figure. Andrew Benchmark >

Write

Read

Read/Write

Page 24: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

24/26

Related Work

Other Peer-to-peer File SystemsPAST[Rows01] and CFS[Dabe01]

– No Write Sharing

IVY[Muth02], Pangaea[Sait02]– Provide Both Read and Write Sharing but,

– No Single Point of Consistency

Page 25: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

25/26

Conclusion

Operational OceanStore PrototypeUniversally Accessible, Fault-tolerance, Security and Information Sharing

Future ResearchImproving Performance

– Efficient Threshold Schemes and Archival Data Generation

Self-Maintenance

Stability and Fault-tolerance

Supporting More Applications

Page 26: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

26/26

Discussion

System Design ChoiceSecurity vs. Fast Response

Simple vs. Complicate Design

Storage Service Provider (SSP)Independent SSP vs.

Confederation of Companies such as IBM, AT&T

Efficient Storage Usage

Page 27: Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.

27/26

Primary Replica (Ext.)

Modification of Byzantine Agreement ProtocolPublic Key Cryptography

– Symmetric-key Message Authentication Codes (MACs) for Inner Ring

– Public-key Cryptography for All Other Machines

Proactive Threshold Signatures– Flexibility in Choosing the Membership of Inner Ring– Single Public Key with l Private Key Shares– Any k Correctly Generated Signature Shares among l– Independent Sets of Key Shares can be Used to Control

Membership

Responsible Party– To Choose the Hosts that Make Up Inner Rings