HYDRAstor:a Scalable Secondary Storage
C. Dubnicki, L. Gryz, L. Heldt,M. Kaczmarczyk, W. Kilian,P. Strzelczak, J. Szczepkowski,M. Welnicki
C. Ungureanu
7th USENIX Conference on File and Storage Technologies(FAST '09)
February 26th 2009
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 2
Scalable secondary storage
Characteristics RequirementsHuge amount of data - Scalability (dynamic)
- Low cost per TBSmall backup windows - Very high write performanceDuplication between backup streams
- Global deduplication
Reliable, on-line retrieval - Failure tolerance- High restore performance
Varying value of data - Adjust resilience overhead- Data deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 3
Scalable secondary storage
Characteristics RequirementsHuge amount of data - Scalability (dynamic)
- Low cost per TBSmall backup windows - Very high write performanceDuplication between backup streams
- Global deduplication
Reliable, on-line retrieval - Failure tolerance- High restore performance
Varying value of data - Adjust resilience overhead- Data deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 4
Scalable secondary storage
Characteristics RequirementsHuge amount of data - Scalability (dynamic)
- Low cost per TBSmall backup windows - Very high write performanceDuplication between backup streams
- Global deduplication
Reliable, on-line retrieval - Failure tolerance- High restore performance
Varying value of data - Adjust resilience overhead- Data deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 5
Scalable secondary storage
Characteristics RequirementsHuge amount of data - Scalability (dynamic)
- Low cost per TBSmall backup windows - Very high write performanceDuplication between backup streams
- Global deduplication
Reliable, on-line retrieval - Failure tolerance- High restore performance
Varying value of data - Adjust resilience overhead- Data deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 6
Scalable secondary storage
Characteristics RequirementsHuge amount of data - Scalability (dynamic)
- Low cost per TBSmall backup windows - Very high write performanceDuplication between backup streams
- Global deduplication
Reliable, on-line retrieval - Failure tolerance- High restore performance
Varying value of data - Adjust resilience overhead- Data deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 7
Challenges
● High-performance, decentralized
global deduplication
... in a dynamic, distributed system
... with deletion and failures● Combination introduces complexity● Tension between:
● Deduplication and dynamic scalability● Deduplication and on-demand deletion● Failure tolerance and deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 8
● Satisfies Scalable secondary storage requirements
● Started as a research project atNEC Laboratories America, in Princeton, NJ
● Successfully commercialized● Today: real-world, commercial system ● Sold by NEC in the US and Japan
● Development of back-end continues at 9LivesData, LLC in Warsaw, Poland● Spinoff from NEC Laboratories
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 9
HYDRAstor functionality
● Content addressable storage (CAS)● Vast data repository
● Storing and extracting streams of blocks● Single system image built of independent nodes
● Support for standard access methods● Filesystem, VTL
● Dynamic capacity sharing● Self-recovery from failures● On-demand deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 10
Programming Model
● Repository of blocks● Content-addressed● Immutable● Variable-sized
hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 11
Programming Model
● Repository of blocks● Content-addressed● Immutable● Variable-sized
● Exposed pointers to other blocks
E
hash=011..0
011.
.0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 12
Programming Model
● Repository of blocks● Content-addressed● Immutable● Variable-sized
● Exposed pointers to other blocks
● Trees of blocks E
EE
ERoot1
E
hash=010..1
hash=011..0
011.
.0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 13
Programming Model
● Repository of blocks● Content-addressed● Immutable● Variable-sized
● Exposed pointers to other blocks
● Trees of blocks● DAGs due to deduplication● No cycles possible
E
EE
011.
.0
ERoot1
E
ERoot2hash=010..1
hash=110..0
hash=011..0
011.
.0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 14
Programming Model
● Repository of blocks● Content-addressed● Immutable● Variable-sized
● Exposed pointers to other blocks
● Trees of blocks● DAGs due to deduplication● No cycles possible
● Deletion of whole trees
E
EE
011.
.0
ERoot1
E
ERoot2hash=010..1
hash=110..0
hash=011..0
011.
.0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 15
Programming Model
● Repository of blocks● Content-addressed● Immutable● Variable-sized
● Exposed pointers to other blocks
● Trees of blocks● DAGs due to deduplication● No cycles possible
● Deletion of whole trees
E
EE
011.
.0
ERoot1
E
ERoot2hash=010..1
hash=110..0
hash=011..0
011.
.0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 16
Programming Model
● Repository of blocks● Content-addressed● Immutable● Variable-sized
● Exposed pointers to other blocks
● Trees of blocks● DAGs due to deduplication● No cycles possible
● Deletion of whole trees
E
EE
011.
.0
ERoot1
E
ERoot2hash=010..1
hash=110..0
hash=011..0
011.
.0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 17
Programming Model
● Repository of blocks● Content-addressed● Immutable● Variable-sized
● Exposed pointers to other blocks
● Trees of blocks● DAGs due to deduplication● No cycles possible
● Deletion of whole trees
E 011.
.0
ERoot2
hash=110..0
hash=011..0
011.
.0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 18
Architecture overview
● Standard server-grade hardware running Linux● Scalability on data-center level
Storage Nodes
Access Nodes
NFS / CIFS
Front-end
Back-end(CAS Layer)
InternalNetwork
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 19
Data organization:selected requirements
Requirements on scalable storage
Required internaldata services
Failure tolerance ● Identify data resilience reduction● Fast data rebuilding
High performance ● Preserve locality of data streams● Prefetching
Dynamic scalability ● Decentralized data management● Load balancing● Fast data transfer to new location
Deduplication ● Location of potential duplicates● Availability & resiliency verification
On-demand deletion ● Failure-tolerant, distributed deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 20
Data organization:selected requirements
Requirements on scalable storage
Required internaldata services
Failure tolerance ● Identify data resilience reduction● Fast data rebuilding
High performance ● Preserve locality of data streams● Prefetching
Dynamic scalability ● Decentralized data management● Load balancing● Fast data transfer to new location
Deduplication ● Location of potential duplicates● Availability & resiliency verification
On-demand deletion ● Failure-tolerant, distributed deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 21
Data organization:selected requirements
Requirements on scalable storage
Required internaldata services
Failure tolerance ● Identify data resilience reduction● Fast data rebuilding
High performance ● Preserve locality of data streams● Prefetching
Dynamic scalability ● Decentralized data management● Load balancing● Fast data transfer to new location
Deduplication ● Location of potential duplicates● Availability & resiliency verification
On-demand deletion ● Failure-tolerant, distributed deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 22
Data organization:selected requirements
Requirements on scalable storage
Required internaldata services
Failure tolerance ● Identify data resilience reduction● Fast data rebuilding
High performance ● Preserve locality of data streams● Prefetching
Dynamic scalability ● Decentralized data management● Load balancing● Fast data transfer to new location
Deduplication ● Location of potential duplicates● Availability & resiliency verification
On-demand deletion ● Failure-tolerant, distributed deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 23
Data organization:selected requirements
Requirements on scalable storage
Required internaldata services
Failure tolerance ● Identify data resilience reduction● Fast data rebuilding
High performance ● Preserve locality of data streams● Prefetching
Dynamic scalability ● Decentralized data management● Load balancing● Fast data transfer to new location
Deduplication ● Location of potential duplicates● Availability & resiliency verification
On-demand deletion ● Failure-tolerant, distributed deletion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 24
Failure tolerance: erasure coding
Deco
de
Any 3 fragments can be lost
Example: N=8, m=5E
ncod
e
Original block
Ori
gina
l F
ragm
ents
R
edun
dant
F
ragm
ents
● Block erasure-coded into N fragments● Storage overhead tunable
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 25
Scalability with DHT: data placement
● Block location: DHT with prefix routing
0 1
01 10 11
empty prefix
00
0
01
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 26
Scalability with DHT: data placement
● Block location: DHT with prefix routing● Block mapped to hash prefix hash=011..0
0 1
01 10 11
empty prefix
00
Block
0
01
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 27
Scalability with DHT: data placement
● Block location: DHT with prefix routing● Block mapped to hash prefix● Prefix components
● Hosted on SNs● N components
per prefix
hash=011..0
Block
Node 1Node 6
Node 1Node 5
Node 1Node 4
Node 1Node 3
Node 1Node 2
Node 1Node 1
0 1
1
3
2
0 0
1
2
3
0
1
2
3
2
3
1
0
01 10 11
empty prefix
00
N=4
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 28
Scalability with DHT: data placement
hash=011..0
Block
Node 1Node 6
Node 1Node 5
Node 1Node 4
Node 1Node 3
Node 1Node 2
Node 1Node 1
0 1
1
3
2
0 0
1
2
3
0
1
2
3
2
3
1
0
01 10 11
empty prefix
00
N=4
● Block location: DHT with prefix routing● Block mapped to hash prefix● Prefix components
● Hosted on SNs● N components
per prefix● Store fragments
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 29
Scalability with DHT: data placement
hash=011..0
Block
Node 1Node 6
Node 1Node 5
Node 1Node 4
Node 1Node 3
Node 1Node 2
Node 1Node 1
0 1
1
3
2
0 0
1
2
3
0
1
2
3
2
3
1
0
01 10 11
empty prefix
00
N=4
● Block location: DHT with prefix routing● Block mapped to hash prefix● Prefix components
● Hosted on SNs● N components
per prefix● Store fragments
● Distributedconsensus
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 30
Scalability with DHT: data placement
hash=011..0
Block
Node 1Node 6
Node 1Node 5
Node 1Node 4
Node 1Node 3
Node 1Node 2
Node 1Node 1
0 1
1
3
2
0 0
1
2
3
0
1
2
3
2
3
1
0
01 10 11
empty prefix
00
N=4
● Block location: DHT with prefix routing● Block mapped to hash prefix● Prefix components
● Hosted on SNs● N components
per prefix● Store fragments
● Distributedconsensus
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 31
Scalability with DHT: data placement
hash=011..0
Block
Node 1Node 6
Node 1Node 5
Node 1Node 4
Node 1Node 3
Node 1Node 2
Node 1Node 1
0 1
1
3
2
0 0
1
2
3
0
1
2
3
2
3
1
0
01 10 11
empty prefix
00
N=4
● Block location: DHT with prefix routing● Block mapped to hash prefix● Prefix components
● Hosted on SNs● N components
per prefix● Store fragments
● Distributedconsensus
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 32
Scalability with DHT: data placement
hash=011..0
Block
Node 1Node 6
Node 1Node 5
Node 1Node 4
Node 1
Node 1
Node 3
Node 2
Node 1Node 1
0 1
1
3
2
0 0
1
2
3
0
1
2
3
2
3
1
0
01 10 11
empty prefix
00
N=4
● Block location: DHT with prefix routing● Block mapped to hash prefix● Prefix components
● Hosted on SNs● N components
per prefix● Store fragments
● Distributedconsensus
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 33
Scalability with DHT: data placement
hash=011..0
Block
Node 1Node 6
Node 1Node 5
Node 1Node 4
Node 1Node 3
Node 1Node 2
Node 1Node 1
0 1
1
3
2
0 0
1
2
3
0
1
2
3
2
3
1
0
01 10 11
empty prefix
00
N=4
● Block location: DHT with prefix routing● Block mapped to hash prefix● Prefix components
● Hosted on SNs● N components
per prefix● Store fragments
● Distributedconsensus
● Load balancing
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 34
Data organization: synchrun chains
A B EC D F G● Data stream split to blocks
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 35
Data organization: synchrun chains
A B EC D F G
Hash 010…
Hash 101…
Hash 110…
Hash 011…
Hash 000…
Hash 011…
Hash 100…
● Data stream split to blocks
● Hashes of blocks computed
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 36
Data organization: synchrun chains
A B EC D F G
Hash 010…
Hash 101…
Hash 110…
Hash 011…
Hash 000…
Hash 011…
Hash 100…
● Data stream split to blocks
● Hashes of blocks computed
● Routing through DHT
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 37
Data organization: synchrun chains
A B EC D F G
Hash 010…
Hash 101…
Hash 110…
Hash 011…
Hash 000…
Hash 011…
Hash 100…
● Data stream split to blocks
● Hashes of blocks computed
● Routing through DHTPrefix 01
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 38
Data organization: synchrun chains
A B EC D F G
Hash 010…
Hash 101…
Hash 110…
Hash 011…
Hash 000…
Hash 011…
Hash 100…
Erasure Coding
Compression
● Data stream split to blocks
● Hashes of blocks computed
● Routing through DHTPrefix 01
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 39
Data organization: synchrun chains
A B EC D F G
Hash 010…
Hash 101…
Hash 110…
Hash 011…
Hash 000…
Hash 011…
Hash 100…
Prefix 01
Erasure Coding
Compression
● Data stream split to blocks
● Hashes of blocks computed
● Routing through DHT
Component0
Component1
Component2
Component3
● Erasure-coded fragments stored by components
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 40
Data organization: synchrun chains
A B EC D F G
Hash 010…
Hash 101…
Hash 110…
Hash 011…
Hash 000…
Hash 011…
Hash 100…
Erasure Coding
Compression
● Data stream split to blocks
● Hashes of blocks computed
● Routing through DHT
A D F
A D F
A D F
A D F
Component0
Component1
Component2
Component3
Prefix 01
● Erasure-coded fragments stored by components
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 41
Data organization: synchrun chains
A B EC D F G
Hash 010…
Hash 101…
Hash 110…
Hash 011…
Hash 000…
Hash 011…
Hash 100…
Synchrun 1 Synchrun 2 Synchrun 3
Prefix 01
Erasure Coding
Compression
Synchrun
● Data stream split to blocks
● Hashes of blocks computed
● Routing through DHT
Component0
Component1
Component2
Component3
● Erasure-coded fragments stored by components
● Grouped into synchruns
A D F
A D F
A D F
A D F
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 42
Data organization: synchrun chains
A B EC D F G
Hash 010…
Hash 101…
Hash 110…
Hash 011…
Hash 000…
Hash 011…
Hash 100…
Synchrun 1 Synchrun 2 Synchrun 3
Prefix 01
Erasure Coding
Compression
● Data stream split to blocks
● Hashes of blocks computed
● Routing through DHT
Component0
Component1
Component2
Component3
Container
● Erasure-coded fragments stored by components
● Grouped into synchruns
● Containers stored on disks
● Fragment metadata separately from data
Synchrun
A D F
A D F
A D F
A D F
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 43
Data organization: synchrun chains
A B EC D F G
Hash 010…
Hash 101…
Hash 110…
Hash 011…
Hash 000…
Hash 011…
Hash 100…
Synchrun 1 Synchrun 2 Synchrun 3
Erasure Coding
Compression
● Data stream split to blocks
● Hashes of blocks computed
● Routing through DHT
A D F
A D F
A D F
A D F
Component0
Component1
Component2
Component3
Prefix 01
● Erasure-coded fragments stored by components
● Grouped into synchruns
● Containers stored on disks
● Fragment metadata separately from data
● Ordered synchrun chains
● Preserve order & locality
● ManageableContainer Synchrun
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 44
Component01:1
Synchrun chains in a dynamic system
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 45
Component01:1
System growth: split
Component010:1
Component011:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 46
Component01:1
Component010:1
Component011:1
System growth: split
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 47
Component010:1
Component011:1
System growth: split
Component01:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 48
Component01:1
Concatenation
Component010:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 49
Component01:1
Concatenation
Component010:1
Component010:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 50
Component01:1
Component010:1
Marking blocks to reclaim
Component010:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 51
Component01:1
Component010:1
Space reclamation & Concatenation
Component010:1
Component010:1
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 52
Component01:0
Component01:1
Component01:2
Component01:3
Data Services:Identification of data resiliency level
Missing fragments
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 53
Data Services:Identification of data resiliency level
Component01:0
Component01:1
Component01:2
Component01:3
Chain scanning
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 54
Data Services:Identification of data resiliency level
Component01:0
Component01:1
Component01:2
Component01:3
Chain scanning
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 55
Data Services:Identification of data resiliency level
Component01:0
Component01:1
Component01:2
Component01:3
Chain scanning
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 56
Data Services:Identification of data resiliency level
Component01:0
Component01:1
Component01:2
Component01:3
Chain scanning
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 57
Data services: reconstruction
Component01:0
Component01:1
Component01:2
Component01:3
● Sequential read/write of entire Containers● Erasure decoding and re-encoding
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 58
Data services: reconstruction
Component01:0
Component01:1
Component01:2
Component01:3
● Sequential read/write of entire Containers● Erasure decoding and re-encoding
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 59
Data services: reconstruction
Component01:0
Component01:1
Component01:2
Component01:3
● Sequential read/write of entire Containers● Erasure decoding and re-encoding
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 60
Data services: fast data transfer
Component01:0
Component01:1
Component01:2
Component01:3
Old component
01:3
Location of newnode (DHT)
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 61
Data services: fast data transfer
Component01:0
Component01:1
Component01:2
Component01:3
Old component
01:3
Data transfer
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 62
Data services: fast data transfer
Component01:0
Component01:1
Component01:2
Component01:3
Old component
01:3
Data transfer
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 63
Data services: fast data transfer
Component01:0
Component01:1
Component01:2
Component01:3
Old component
01:3
Data transfer
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 64
Data services: fast data transfer
Component01:0
Component01:1
Component01:2
Component01:3
Old component
01:3
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 65
Data services for deduplication
● Global: duplicates detected in entire system● DHT routing based on content● Inline deduplication: has to be high-performance
● Prefetching Containers for streams of duplicates● Block hashes stored separately
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 66
Data services for deduplication
Component
01:0
Component
01:1
Component
01:2
Component
01:3
hash=011..
Block
Choose completechain
Completeness: “definitely not a duplicate”Deletion interaction: wasn't the block scheduled for deletion?
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 67
Data services for deduplicationhash=011..
Block
Component
01:0
Component
01:1
Component
01:2
Component
01:3
Query
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 68
Data services for deduplicationhash=011..
Block
Local candidate found
Component
01:0
Component
01:1
Component
01:2
Component
01:3
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 69
Data services for deduplicationhash=011..
Block
Candidate verification
Successfuldedup
Component
01:0
Component
01:1
Component
01:2
Component
01:3
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 70
On-demand data deletion
● Distributed garbage collection● Per-block reference counter stored per-
fragment● Failure-tolerant
● Block reference counter calculated independently on peer Container chains
● Interference with duplicate elimination:● read-only phase for block tree traversal● space reclamation in background
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 71
Writes during node failure
Writing
Reconstruction
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 72
Write Scalingnodes added while writing
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 73
Write Scalingnodes added while writing
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC 74
Questions?