Tradeoffs in Scalable Data Routing for Deduplication Clusters FAST '11 Wei Dong From Princeton University Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, Philip Shilance From EMC 2011. 04. 21 (Thu) Kwangwoon univ. SystemSoftware Lab. HoSeok Seo 1
24
Embed
Tradeoffs in Scalable Data Routing for Deduplication Clusters
Tradeoffs in Scalable Data Routing for Deduplication Clusters. FAST '11. Wei Dong From Princeton University Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, Philip Shilance From EMC. 2011. 04. 21 ( Thu ) Kwangwoon univ . SystemSoftware Lab. HoSeok Seo. Introduction. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Tradeoffs in Scalable Data Routing for Deduplication Clus-
tersFAST '11
Wei Dong From Princeton UniversityFred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, Philip Shilance From EMC
a deduplication cluster storage system having a primary node with the a hard disk
Basically cluster storage systems are... a well-known technique to increase capacity
but have 2 problems- less deduplication than the single node system- not exhibit linear performance
3
Introduction Goal
Scalable Throughput- using Super-chunk for data transfer- maximize the parallelism of disk I/O by balanced routing data to nodes- reduce bottleneck of disk I/O utilizing cache locality
Scalable Capacity- using a cluster storage system- route repeated data to the same node- maintain the balanced utilization between nodes
High Deduplication like single node system- using a super-chunk that consist of consecutive chunks
4
Introduction Chunk
Definition- A segment of Data stream
Merits- when a chunk size is small...
• Show high deduplication
- when a chunk size is big...• Show high throughput
5
Introduction Super-chunk
Definition- Consist of consecutive chunks
Merits- Maintain high cache locality- Reduce system overhead- Get similar deduplication rate of chunk
Demerits- Risk of duplication creation- Can result in imbalance utilization between nodes
Issues of super-chunk- How they are formed- How they are assigned to nodes- How they route super-chunks to nodes for a balance
6
Dataflow of Deduplication Cluster
1. Divide Data Streams into Chunks
2. Create fingerprints of chunks
3. Create a super-chunk
4. Select a representative for a super-chunk in chunks
5. Route a super-chunk to one of nodes
7
Deduplication flow at a node (cont.)
8
Deduplication flow at a node
Dup?at dedup
logic
Finger-print in cache?
Finger-print in index?
Write Fingerprint & Chunkto a container
no
yes
no
Dediplication Done
yesno
Is a con-tainer full?
Write a container to a disk
A chunk
Load fingerprints were written at the
same time to cache
yes
Color box means that it re-quires disk access
9
What is Container? Container
Definition- fixed-size large pieces in a disk- consist of two part : Fingerprint & Chunk Data
Usage- Use it to store Fingerprint & Chunk of non-duplicated data into a disk
Finger-prints Chunk Data
10
Issue 1 : How super-chunk are formed? How super-chunk are formed?
Determine an average super-chunk size- Experimented with a variety size from 8KB to 4MB- Generally 1MB is a good choice
11
Issue 2 :How they assigned to nodes Use Bin Manager running on master node Bin Manager executes rebalance between nodes by bin migration(For stateless
routing)
1. assign number of binto a super-chunk
node 1
node 2
node 3
node N
bin1 bin2 bin3 ... bin Mnode1 node2 node3 ... node N
bin man-ager
M>N
a super-chunk
2. find a node by number of bin
3. route a super-chunk to a node
12
Issue 3 :How they route super-chunks to nodes for bal-ance Use two DATA Routing to overcome demerits of super-chunk
stateless technique with a bin migration- light-weight and well suited for most balanced workloads
stateful technique- Improve deduplication while avoiding data skew
13
Stateless Technique Basic
1. Create fingerprint about each chunks 2. Select a representative fingerprint in fingerprints 3. allocate a bin to super-chunk ( such mod #bin )
How to Create fingerprint Hash all of chunk ( a.k.a hash(*) ) Hash N byte of chunk ( a.k.a hash(N) ) ※ Use SHA-1 Hash function
How to select representative fingerprint first maximum minimum
14
Stateful Technique (cont.) Merits compare to Stateless
Higher Deduplication like single node backup system Balanced overload Bin migration no longer needed
Demerits Increased operations Increased cost of memory or communication
15
Stateful Technique Process
Calculate "weighted voting"
Select a node that has the highest weighted voting
number of match * overloaded value
1
number of match : number of duplication chunk at each nodeoverloaded value : overloaded utilization of node relative to the average storage utilization
1.0
16
Datasets
17
Evaluation Metrics Capacity
Total Deduplication (TD)- the original dataset size % deduplication size
Data Skew- Max node utilization % avg node utilization
Effective Deduplication (ED)- TD % Data Skew
Normalized ED- Show that how much deduplication close to a single-node system
Throughput # of on-disk fingerprint index lookups
18
Experimental Results :Overall Effectiveness Using Trace-driven simulation
19
Experimental Results :Overall Effectiveness
with mig
20
Experimental Results :Feature Selection
HYDRAstor- Routing chunks to nodes according to content- Good performance- Worse deduplication rate due to 64KB chunks
21
Experimental Results :Cache Locality and Throughput
Logical Skew : max(size before dedupe) / avg ( size before dedupe)
Max lookup : maximum normalized total number of fingerprint index lookupsED : Effective Deduplication
(32node) (32node)
22
Experimental Results :Effect of Bin Migration
The ED drops between migration points due to increasing skew.
23
Summary
Stateless Stateful
Small Clusters
LargeClusters ALL
Deduplication Good Bad Good
Data Skew Good Bad Good
Overhead Good Good Bad
24
Conclusion 1. Using Super-chunks for data routing is superior to using individ-
ual chunks to achieve scalable throughput while maximizing dedu-plication
2. The stateless routing method (hash(64)) with bin migration is a simple and efficient way
3. The effective deduplication of the stateless routed cluster may drop quickly as the number of nodes increases.To solve this problem, proposed stateful data routing approach.Simulations show good performance when using up to 64 nodes in a cluster