Top Banner
Fast SIMD-Based Chunking Algorithm PSC 2019-Aug-27 Johnny Dude Fast SIMD-Based Fast SIMD-Based Chunking Algorithm Chunking Algorithm Yehonatan Dude, Michael Hirsch, Yair Toaff PSC 2019 PSC 2019
42

Chunking Algorithm Fast SIMD-Based

Dec 11, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Fast SIMD-BasedFast SIMD-BasedChunking AlgorithmChunking Algorithm

Yehonatan Dude, Michael Hirsch, Yair Toaff

PSC 2019PSC 2019

Page 2: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

OutlineOutline

1. Background2. Chunking Problem3. Traditional Solutions4. Our Solution5. Future Work

Page 3: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Page 4: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Background - DeduplicationBackground - Deduplication

Deduplication is a technique for eliminatingDeduplication is a technique for eliminatingduplicate copies of repeating data.duplicate copies of repeating data.

Page 5: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Page 6: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Deduplication process in a nutshellDeduplication process in a nutshell1. Divide into chunks2. Calculate the chunks' hashes3. Store chunks uniquely

Page 7: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Object 1

Object 2

Page 8: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

A BC

D BC

D DObject 1

Object 2

Page 9: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

A BC

D BC

D DObject 1

Object 2

Object 1Object 2

AB

CD

hA

hB

hC

hD

hA hBhC hDhD

hBhD hC

Page 10: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Background - Chunking MethodsBackground - Chunking Methods

How to chunk the input dataHow to chunk the input data1. Simple - fixed size.2. Content aware - files, objects, applications.3. Content sensitive - rolling hash.

Page 11: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Page 12: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

The Opera ghost really existed._ He was not, as was long believed

, a creature of the imagination_ of the artists, the superstition

of the managers, or a product o f the joy and impressionable bra

ins of the young ladies of the b allet, their mothers, the box-ke

epers, the cloak-room attendants or the concierge. Yes, he exist

ed in flesh and blood, although_ he assumed the complete appearan

ce of a real phantom; that is to say, of a spectral shade.

The Opera ghost really existed._ He was not, as was long believed

, a creature of the imagination_ of the artists, the superstition

of the managers, or a product o f the absurd and impressionable

brains of the young ladies of th e ballet, their mothers, the box

-keepers, the cloak-room attenda nts or the concierge. Yes, he ex

isted in flesh and blood, althou gh he assumed the complete appea

rance of a real phantom; that is to say, of a spectral shade.

Page 13: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

The Opera ghost really existed.

He was not, as was long believed, a creature of the imagination of the

artists, the superstition of the managers, or a product of the joy and

impressionable brains of the young ladies of the ballet, their

mothers, the box-keepers, the cloak-room attendants or the concierge.

Yes, he existed in flesh and blood, although he assumed the complete

appearance of a real phantom; that is to say, of a spectral shade.

The Opera ghost really existed.

He was not, as was long believed, a creature of the imagination of the

artists, the superstition of the managers, or a product of the absurd

and impressionable brains of the young ladies of the ballet, their

mothers, the box-keepers, the cloak-room attendants or the concierge.

Yes, he existed in flesh and blood, although he assumed the complete

appearance of a real phantom; that is to say, of a spectral shade.

Page 14: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

The Opera ghost really existed. He was not, as was long believed

, a creature of the imagination of the artists, the superstition

of the managers, or a product of the joy and impressionable bra

ins of the young ladies of the ballet, their mothers, the box-ke

epers, the cloak-room attendants or the concierge. Yes, he exist

ed in flesh and blood, although he assumed the complete appearan

ce of a real phantom; that is to say, of a spectral shade.

The Opera ghost really existed. He was not, as was long believed

, a creature of the imagination of the artists, the superstition

of the managers, or a product of the absurd and impressionable

brains of the young ladies of the ballet, their mothers, the box

-keepers, the cloak-room attendants or the concierge. Yes, he ex

isted in flesh and blood, although he assumed the complete appea

rance of a real phantom; that is to say, of a spectral shade.

Page 15: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Background - Deduplication PerformanceBackground - Deduplication Performance

In 2017 we worked on a deduplication engine,and we tried to improve its performance.

Page 16: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Page 17: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Seconds per GB

LZ4

Karp-Rabin

SHA1

Other1 2

Page 18: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Chunking ProblemChunking Problem

Given a stream of bytes, divide it into chunks for deduplication.1. Output identical chunks for identical data2. Good chunk size distribution.3. Good performance.4. Works for any input (photo, DB, text, random, etc...)

Page 19: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Page 20: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Traditional SolutionsTraditional Solutions

Karp-RabinCyclic Polynomial

Page 21: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Page 22: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

●●● Xk-2 Xk-1 Xk=Hash( )Hk

Criteria(If Hk ) holds then mark a boundary after Xk

Xk-63 Xk-62 Xk-61

Page 23: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Xk-62 Xk-61 Xk-60 ●●● Xk-1 Xk Xk+1=Hash( )HK+1

Xk-63 Xk-62 Xk-61 ●●● Xk-2 Xk-1 Xk=Hash( )Hk

Page 24: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Xk-62 Xk-61 Xk-60 ●●● Xk-1 Xk Xk+1=Hash( )HK+1

Xk-63 Xk-62 Xk-61 ●●● Xk-2 Xk-1 Xk=Hash( )Hk

=RollHash( Xk-63 Xk+1Hk ), ,

Page 25: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Karp-Rabin Cyclic Polynomial

Page 26: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Proposed SolutionProposed Solution

How does it work:How does it work:1. Work with rolling vectors2. Calculate a hash of byte size3. Calculate the criteria, in a way that:

Number of calculations are constantunrelated to the vector sizeCan find a cutting point at a byte offset

Page 27: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Page 28: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Xk-62

Xk-61

Xk-60

●●● Xk-10 Xk-6 Xk-2=Hash( )hk-2

Xk-63 Xk-59

Xk-58

●●● Xk-11 Xk-7 Xk-3=Hash( )hk-3

Xk-57

Xk-56

Xk-55

Xk-54

Xk-53

Xk-52

●●● Xk-9 Xk-5 Xk-1 )

●●● Xk-8 Xk-4 Xk )=Hash(hk

=Hash(hk-1

Page 29: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

●●● Xk+2=Hash( )hk+2

Xk-59

Xk-58

●●● Xk+1=Hash( )hk+1

Xk-57

Xk-56

Xk-55

Xk-54

Xk-53

Xk-52

●●● Xk+3 )

●●● Xk+4 )=Hash(hk+4

=Hash(hk+3

=RollHash( ), ,

Xk-2

Xk-3

Xk-1

Xk

=RollHash( ), ,

=RollHash( ), ,

=RollHash( ), ,

hk-2

hk-3

hk

hk-1

Xk-62

Xk-61

Xk-60

Xk-63

Xk+2

Xk+1

Xk+3

Xk+4

Page 30: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

= Criteria( Hk )ck

= Criteria( Hk+1 )ck+1

= Criteria( Hk+2 )ck+2

= Criteria( Hk+3 )ck+3

Pass

Fail

Pass

Pass

Page 31: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

= Criteria( Hk )ck

= Criteria( Hk+1 )ck+1

= Criteria( Hk+2 )ck+2

= Criteria( Hk+3 )ck+3

0 1 0 0

ck ...ck+3

Pass

Fail

Pass

Pass

Page 32: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

1 1 0 0

ck-4 ...ck-1

0 1 0 0

ck ...ck+3

0 0 0 1

ck+4 ...ck+7

Page 33: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

1 1 0 0

ck-4 ...ck-1

0 1 0 0

ck ...ck+3

0 0 0 1

ck+4 ...ck+7

Trailing Zeroes

Leading Zeroes

Page 34: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

1 1 0 0

ck-4 ...ck-1

0 1 0 0

ck ...ck+3

0 0 0 1

ck+4 ...ck+7

Trailing Zeroes

Leading Zeroes

Boundary

Page 35: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Measured ResultsMeasured Results

Page 36: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Page 37: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Algorithm Random Data Corpus Data

Karp-Rabin 975 MB/s 927 MB/s

Cyclic-Polynomial 1675 MB/s 1676 MB/s

Ours 6715 MB/s 7136 MB/s

Page 38: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

ChunkingAlg.

DedupPerf.

LZ4 SHA1 Other Chunking

Karp-Rabin 262 MB/s 63.9% 4.1% 3.8% 28.1%

Ours 345 MB/s 84.5% 5.4% 5.1% 4.7%

Page 39: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Same DistributionFaster Chunking PerformanceFaster Overall Performance

Page 40: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Future WorkFuture Work

A chunking algorithm that is* Past Presented Future

Backward Compatible N/A f(k, x) ≠ f(l, x) f(k, x) = f(l, x)

Work cn cn cn

Speed cn c(n log k) / k c(n log k) / k

Page 41: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Page 42: Chunking Algorithm Fast SIMD-Based

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

ThanksThankshttps://github.com/dudejohnny/PSC2019