Fast SIMD-Based Chunking Algorithm PSC 2019-Aug-27 Johnny Dude Fast SIMD-Based Fast SIMD-Based Chunking Algorithm Chunking Algorithm Yehonatan Dude, Michael Hirsch, Yair Toaff PSC 2019 PSC 2019
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Fast SIMD-BasedFast SIMD-BasedChunking AlgorithmChunking Algorithm
Yehonatan Dude, Michael Hirsch, Yair Toaff
PSC 2019PSC 2019
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
OutlineOutline
1. Background2. Chunking Problem3. Traditional Solutions4. Our Solution5. Future Work
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Background - DeduplicationBackground - Deduplication
Deduplication is a technique for eliminatingDeduplication is a technique for eliminatingduplicate copies of repeating data.duplicate copies of repeating data.
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Deduplication process in a nutshellDeduplication process in a nutshell1. Divide into chunks2. Calculate the chunks' hashes3. Store chunks uniquely
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
A BC
D BC
D DObject 1
Object 2
Object 1Object 2
AB
CD
hA
hB
hC
hD
hA hBhC hDhD
hBhD hC
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Background - Chunking MethodsBackground - Chunking Methods
How to chunk the input dataHow to chunk the input data1. Simple - fixed size.2. Content aware - files, objects, applications.3. Content sensitive - rolling hash.
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
The Opera ghost really existed._ He was not, as was long believed
, a creature of the imagination_ of the artists, the superstition
of the managers, or a product o f the joy and impressionable bra
ins of the young ladies of the b allet, their mothers, the box-ke
epers, the cloak-room attendants or the concierge. Yes, he exist
ed in flesh and blood, although_ he assumed the complete appearan
ce of a real phantom; that is to say, of a spectral shade.
The Opera ghost really existed._ He was not, as was long believed
, a creature of the imagination_ of the artists, the superstition
of the managers, or a product o f the absurd and impressionable
brains of the young ladies of th e ballet, their mothers, the box
-keepers, the cloak-room attenda nts or the concierge. Yes, he ex
isted in flesh and blood, althou gh he assumed the complete appea
rance of a real phantom; that is to say, of a spectral shade.
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
The Opera ghost really existed.
He was not, as was long believed, a creature of the imagination of the
artists, the superstition of the managers, or a product of the joy and
impressionable brains of the young ladies of the ballet, their
mothers, the box-keepers, the cloak-room attendants or the concierge.
Yes, he existed in flesh and blood, although he assumed the complete
appearance of a real phantom; that is to say, of a spectral shade.
The Opera ghost really existed.
He was not, as was long believed, a creature of the imagination of the
artists, the superstition of the managers, or a product of the absurd
and impressionable brains of the young ladies of the ballet, their
mothers, the box-keepers, the cloak-room attendants or the concierge.
Yes, he existed in flesh and blood, although he assumed the complete
appearance of a real phantom; that is to say, of a spectral shade.
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
The Opera ghost really existed. He was not, as was long believed
, a creature of the imagination of the artists, the superstition
of the managers, or a product of the joy and impressionable bra
ins of the young ladies of the ballet, their mothers, the box-ke
epers, the cloak-room attendants or the concierge. Yes, he exist
ed in flesh and blood, although he assumed the complete appearan
ce of a real phantom; that is to say, of a spectral shade.
The Opera ghost really existed. He was not, as was long believed
, a creature of the imagination of the artists, the superstition
of the managers, or a product of the absurd and impressionable
brains of the young ladies of the ballet, their mothers, the box
-keepers, the cloak-room attendants or the concierge. Yes, he ex
isted in flesh and blood, although he assumed the complete appea
rance of a real phantom; that is to say, of a spectral shade.
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Background - Deduplication PerformanceBackground - Deduplication Performance
In 2017 we worked on a deduplication engine,and we tried to improve its performance.
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Seconds per GB
LZ4
Karp-Rabin
SHA1
Other1 2
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Chunking ProblemChunking Problem
Given a stream of bytes, divide it into chunks for deduplication.1. Output identical chunks for identical data2. Good chunk size distribution.3. Good performance.4. Works for any input (photo, DB, text, random, etc...)
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Traditional SolutionsTraditional Solutions
Karp-RabinCyclic Polynomial
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
●●● Xk-2 Xk-1 Xk=Hash( )Hk
Criteria(If Hk ) holds then mark a boundary after Xk
Xk-63 Xk-62 Xk-61
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Xk-62 Xk-61 Xk-60 ●●● Xk-1 Xk Xk+1=Hash( )HK+1
Xk-63 Xk-62 Xk-61 ●●● Xk-2 Xk-1 Xk=Hash( )Hk
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Xk-62 Xk-61 Xk-60 ●●● Xk-1 Xk Xk+1=Hash( )HK+1
Xk-63 Xk-62 Xk-61 ●●● Xk-2 Xk-1 Xk=Hash( )Hk
=RollHash( Xk-63 Xk+1Hk ), ,
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Proposed SolutionProposed Solution
How does it work:How does it work:1. Work with rolling vectors2. Calculate a hash of byte size3. Calculate the criteria, in a way that:
Number of calculations are constantunrelated to the vector sizeCan find a cutting point at a byte offset
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Xk-62
Xk-61
Xk-60
●●● Xk-10 Xk-6 Xk-2=Hash( )hk-2
Xk-63 Xk-59
Xk-58
●●● Xk-11 Xk-7 Xk-3=Hash( )hk-3
Xk-57
Xk-56
Xk-55
Xk-54
Xk-53
Xk-52
●●● Xk-9 Xk-5 Xk-1 )
●●● Xk-8 Xk-4 Xk )=Hash(hk
=Hash(hk-1
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
●●● Xk+2=Hash( )hk+2
Xk-59
Xk-58
●●● Xk+1=Hash( )hk+1
Xk-57
Xk-56
Xk-55
Xk-54
Xk-53
Xk-52
●●● Xk+3 )
●●● Xk+4 )=Hash(hk+4
=Hash(hk+3
=RollHash( ), ,
Xk-2
Xk-3
Xk-1
Xk
=RollHash( ), ,
=RollHash( ), ,
=RollHash( ), ,
hk-2
hk-3
hk
hk-1
Xk-62
Xk-61
Xk-60
Xk-63
Xk+2
Xk+1
Xk+3
Xk+4
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
= Criteria( Hk )ck
= Criteria( Hk+1 )ck+1
= Criteria( Hk+2 )ck+2
= Criteria( Hk+3 )ck+3
Pass
Fail
Pass
Pass
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
= Criteria( Hk )ck
= Criteria( Hk+1 )ck+1
= Criteria( Hk+2 )ck+2
= Criteria( Hk+3 )ck+3
0 1 0 0
ck ...ck+3
Pass
Fail
Pass
Pass
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
1 1 0 0
ck-4 ...ck-1
0 1 0 0
ck ...ck+3
0 0 0 1
ck+4 ...ck+7
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
1 1 0 0
ck-4 ...ck-1
0 1 0 0
ck ...ck+3
0 0 0 1
ck+4 ...ck+7
Trailing Zeroes
Leading Zeroes
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
1 1 0 0
ck-4 ...ck-1
0 1 0 0
ck ...ck+3
0 0 0 1
ck+4 ...ck+7
Trailing Zeroes
Leading Zeroes
Boundary
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Algorithm Random Data Corpus Data
Karp-Rabin 975 MB/s 927 MB/s
Cyclic-Polynomial 1675 MB/s 1676 MB/s
Ours 6715 MB/s 7136 MB/s
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
ChunkingAlg.
DedupPerf.
LZ4 SHA1 Other Chunking
Karp-Rabin 262 MB/s 63.9% 4.1% 3.8% 28.1%
Ours 345 MB/s 84.5% 5.4% 5.1% 4.7%
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Same DistributionFaster Chunking PerformanceFaster Overall Performance
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
Future WorkFuture Work
A chunking algorithm that is* Past Presented Future
Backward Compatible N/A f(k, x) ≠ f(l, x) f(k, x) = f(l, x)
Work cn cn cn
Speed cn c(n log k) / k c(n log k) / k
Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude
ThanksThankshttps://github.com/dudejohnny/PSC2019