Top Banner
The Language of Compression Leif Walsh Two Sigma Investments [email protected] @leifwalsh September 22, 2015 1
152

The Language of Compression

Jan 15, 2017

Download

Software

leifwalsh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Language of Compression

The Language of Compression

Leif Walsh

Two Sigma [email protected]

@leifwalsh

September 22, 2015

1

Page 2: The Language of Compression

Scope

Today’s talk is about compression

:▶ In data storage systems (databases, filesystems)▶ Using general-purpose (lossless) algorithms▶ On disk, not in memory or over the wire

2

Page 3: The Language of Compression

Scope

Today’s talk is about compression:▶ In data storage systems (databases, filesystems)

▶ Using general-purpose (lossless) algorithms▶ On disk, not in memory or over the wire

2

Page 4: The Language of Compression

Scope

Today’s talk is about compression:▶ In data storage systems (databases, filesystems)▶ Using general-purpose (lossless) algorithms

▶ On disk, not in memory or over the wire

2

Page 5: The Language of Compression

Scope

Today’s talk is about compression:▶ In data storage systems (databases, filesystems)▶ Using general-purpose (lossless) algorithms▶ On disk, not in memory or over the wire

2

Page 6: The Language of Compression

Scope

We’ll talk about systems like:▶ MySQL (InnoDB, TokuDB)▶ MongoDB (WiredTiger, TokuMX, RocksDB)▶ Cassandra▶ PostgreSQL▶ Vertica▶ zfs, btrfs

3

Page 7: The Language of Compression

Goal of the Talk

4

Page 8: The Language of Compression

Goal

A framework for answering:

▶ How do compression algorithms even work?▶ How do storage systems use compression?▶ How should I evaluate the compression of a storage system?▶ How should I read articles about compression?▶ How should I write articles about compression?

5

Page 9: The Language of Compression

Goal

A framework for answering:▶ How do compression algorithms even work?

▶ How do storage systems use compression?▶ How should I evaluate the compression of a storage system?▶ How should I read articles about compression?▶ How should I write articles about compression?

5

Page 10: The Language of Compression

Goal

A framework for answering:▶ How do compression algorithms even work?▶ How do storage systems use compression?

▶ How should I evaluate the compression of a storage system?▶ How should I read articles about compression?▶ How should I write articles about compression?

5

Page 11: The Language of Compression

Goal

A framework for answering:▶ How do compression algorithms even work?▶ How do storage systems use compression?▶ How should I evaluate the compression of a storage system?

▶ How should I read articles about compression?▶ How should I write articles about compression?

5

Page 12: The Language of Compression

Goal

A framework for answering:▶ How do compression algorithms even work?▶ How do storage systems use compression?▶ How should I evaluate the compression of a storage system?▶ How should I read articles about compression?

▶ How should I write articles about compression?

5

Page 13: The Language of Compression

Goal

A framework for answering:▶ How do compression algorithms even work?▶ How do storage systems use compression?▶ How should I evaluate the compression of a storage system?▶ How should I read articles about compression?▶ How should I write articles about compression?

5

Page 14: The Language of Compression

About Me

6

Page 15: The Language of Compression

Me

Engineer at Two Sigma▶ We have a lot of data▶ We care a lot about compression

Previously at Tokutek▶ Worked on TokuMX, TokuFT▶ We thought a lot about compression▶ We evaluated a lot of compression algorithms▶ We wrote a lot about compression

7

Page 16: The Language of Compression

Me

Engineer at Two Sigma▶ We have a lot of data▶ We care a lot about compression

Previously at Tokutek▶ Worked on TokuMX, TokuFT▶ We thought a lot about compression▶ We evaluated a lot of compression algorithms▶ We wrote a lot about compression

7

Page 17: The Language of Compression

How to Talk About Compression

8

Page 18: The Language of Compression

How to Talk About Compression

ScenarioI have a database which can store 1TB of “user data” in only 200GB of disk.

How much is my database compressing? 80%? 20%? 5x? 1/5? 5:1?

Let’s talk about what this number is going to mean to us…

9

Page 19: The Language of Compression

How to Talk About Compression

ScenarioI have a database which can store 1TB of “user data” in only 200GB of disk.

How much is my database compressing?

80%? 20%? 5x? 1/5? 5:1?

Let’s talk about what this number is going to mean to us…

9

Page 20: The Language of Compression

How to Talk About Compression

ScenarioI have a database which can store 1TB of “user data” in only 200GB of disk.

How much is my database compressing? 80%? 20%? 5x? 1/5? 5:1?

Let’s talk about what this number is going to mean to us…

9

Page 21: The Language of Compression

How to Talk About Compression

ScenarioI have a database which can store 1TB of “user data” in only 200GB of disk.

How much is my database compressing? 80%? 20%? 5x? 1/5? 5:1?

Let’s talk about what this number is going to mean to us…

9

Page 22: The Language of Compression

Why Compress?

10

Page 23: The Language of Compression

Why Compress?

Data storage is expensive (you’ve heard this)

▶ Replication magnifies your data costs▶ Maintenance/operations cost scales superlinearly with hardware▶ SSD is expensive

11

Page 24: The Language of Compression

Why Compress?

Data storage is expensive (you’ve heard this)▶ Replication magnifies your data costs

▶ Maintenance/operations cost scales superlinearly with hardware▶ SSD is expensive

11

Page 25: The Language of Compression

Why Compress?

Data storage is expensive (you’ve heard this)▶ Replication magnifies your data costs▶ Maintenance/operations cost scales superlinearly with hardware

▶ SSD is expensive

11

Page 26: The Language of Compression

Why Compress?

Data storage is expensive (you’ve heard this)▶ Replication magnifies your data costs▶ Maintenance/operations cost scales superlinearly with hardware▶ SSD is expensive

11

Page 27: The Language of Compression

Why Compress?

Compressionmagnifies yourcapacity to store data at a fixed cost.

We ask “by what factor doescompressionmultiplymy capacity?”

Compressionminimizes your costto provide a fixed capacity.

We ask “by what factor doescompression dividemy cost?”

We should always talk about compression in terms of themultiplicative factor by which you increase your

cost-effectiveness.

12

Page 28: The Language of Compression

Why Compress?

Compressionmagnifies yourcapacity to store data at a fixed cost.

We ask “by what factor doescompressionmultiplymy capacity?”

Compressionminimizes your costto provide a fixed capacity.

We ask “by what factor doescompression dividemy cost?”

We should always talk about compression in terms of themultiplicative factor by which you increase your

cost-effectiveness.

12

Page 29: The Language of Compression

Why Compress?

Compressionmagnifies yourcapacity to store data at a fixed cost.

We ask “by what factor doescompressionmultiplymy capacity?”

Compressionminimizes your costto provide a fixed capacity.

We ask “by what factor doescompression dividemy cost?”

We should always talk about compression in terms of themultiplicative factor by which you increase your

cost-effectiveness.

12

Page 30: The Language of Compression

How to Talk About Compression

Say “5x compression”, not “80% compression”.

13

Page 31: The Language of Compression

Cost Model

14

Page 32: The Language of Compression

Cost Model

Compression ismore expensive than decompression.

▶ Compression is searching for repeated patterns in data.Searching is expensive.

▶ Decompression is copying bytes out in the order described by encoding,which isn’t very hard.

Bandwidth speeds for typical compression algorithms (cp is a no-op)(my laptop, Haswell CPU, Samsung SSD, 362MB tarball of /usr/include):

(MB/s) zlib bz2 lzma lzo lz4 zstd cpCompress 39 8 3 366 405 293 1466

Decompress 179 28 138 395 774 500 1466

(higher is better)

15

Page 33: The Language of Compression

Cost Model

Compression ismore expensive than decompression.▶ Compression is searching for repeated patterns in data.Searching is expensive.

▶ Decompression is copying bytes out in the order described by encoding,which isn’t very hard.

Bandwidth speeds for typical compression algorithms (cp is a no-op)(my laptop, Haswell CPU, Samsung SSD, 362MB tarball of /usr/include):

(MB/s) zlib bz2 lzma lzo lz4 zstd cpCompress 39 8 3 366 405 293 1466

Decompress 179 28 138 395 774 500 1466

(higher is better)

15

Page 34: The Language of Compression

Cost Model

Compression ismore expensive than decompression.▶ Compression is searching for repeated patterns in data.Searching is expensive.

▶ Decompression is copying bytes out in the order described by encoding,which isn’t very hard.

Bandwidth speeds for typical compression algorithms (cp is a no-op)(my laptop, Haswell CPU, Samsung SSD, 362MB tarball of /usr/include):

(MB/s) zlib bz2 lzma lzo lz4 zstd cpCompress 39 8 3 366 405 293 1466

Decompress 179 28 138 395 774 500 1466

(higher is better)

15

Page 35: The Language of Compression

Cost Model

Compression ismore expensive than decompression.▶ Compression is searching for repeated patterns in data.Searching is expensive.

▶ Decompression is copying bytes out in the order described by encoding,which isn’t very hard.

Bandwidth speeds for typical compression algorithms (cp is a no-op)(my laptop, Haswell CPU, Samsung SSD, 362MB tarball of /usr/include):

(MB/s) zlib bz2 lzma lzo lz4 zstd cpCompress 39 8 3 366 405 293 1466

Decompress 179 28 138 395 774 500 1466

(higher is better)

15

Page 36: The Language of Compression

Cost Model

How does compression impact perceived performance?

Compression:▶ Usually infrequent and done in the background▶ Can reduce overall throughput

Decompression:▶ More frequent (“Write Once, Read Many”) and on the critical path▶ High impact on user-visible latency

16

Page 37: The Language of Compression

Cost Model

How does compression impact perceived performance?

Compression:▶ Usually infrequent and done in the background▶ Can reduce overall throughput

Decompression:▶ More frequent (“Write Once, Read Many”) and on the critical path▶ High impact on user-visible latency

16

Page 38: The Language of Compression

Cost Model

How does compression impact perceived performance?

Compression:▶ Usually infrequent and done in the background▶ Can reduce overall throughput

Decompression:▶ More frequent (“Write Once, Read Many”) and on the critical path▶ High impact on user-visible latency

16

Page 39: The Language of Compression

Cost Model: Corollaries

1. Do compression in the background and in large batches▶ Implement backpressure to avoid falling behind▶ If backpressure reaches users, try a faster compressor

2. Be sensitive to decompression latency▶ Hit the highest nail: other latency sources may be more important▶ Experiment with block sizes and faster compression algorithms

17

Page 40: The Language of Compression

Cost Model: Corollaries

1. Do compression in the background and in large batches▶ Implement backpressure to avoid falling behind▶ If backpressure reaches users, try a faster compressor

2. Be sensitive to decompression latency▶ Hit the highest nail: other latency sources may be more important▶ Experiment with block sizes and faster compression algorithms

17

Page 41: The Language of Compression

Cost Model: Corollaries

1. Do compression in the background and in large batches▶ Implement backpressure to avoid falling behind▶ If backpressure reaches users, try a faster compressor

2. Be sensitive to decompression latency▶ Hit the highest nail: other latency sources may be more important▶ Experiment with block sizes and faster compression algorithms

17

Page 42: The Language of Compression

How Compression Even Works

18

Page 43: The Language of Compression

How Compression Even Works

All* compression algorithms, at their core, use a form of dictionary encoding:▶ Write down a dictionary of “common phrases” with shorter names▶ Encode the input stream by referencing the short names in the dictionary

*A Universal Algorithm for Sequential Data Compression, J. Ziv, A. Lempel 1977

abbabbabbcdcdcdcdabb => abb|abb|abb|cd|cd|cd|cd|abb

Symbol Phrasex abby cd

, xxxyyyyx

To decompress: read the dictionary, use it to interpret the compressed stream.

19

Page 44: The Language of Compression

How Compression Even Works

All* compression algorithms, at their core, use a form of dictionary encoding:▶ Write down a dictionary of “common phrases” with shorter names▶ Encode the input stream by referencing the short names in the dictionary

*A Universal Algorithm for Sequential Data Compression, J. Ziv, A. Lempel 1977

abbabbabbcdcdcdcdabb => abb|abb|abb|cd|cd|cd|cd|abb

Symbol Phrasex abby cd

, xxxyyyyx

To decompress: read the dictionary, use it to interpret the compressed stream.

19

Page 45: The Language of Compression

How Compression Even Works

All* compression algorithms, at their core, use a form of dictionary encoding:▶ Write down a dictionary of “common phrases” with shorter names▶ Encode the input stream by referencing the short names in the dictionary

*A Universal Algorithm for Sequential Data Compression, J. Ziv, A. Lempel 1977

abbabbabbcdcdcdcdabb => abb|abb|abb|cd|cd|cd|cd|abb

Symbol Phrasex abby cd

, xxxyyyyx

To decompress: read the dictionary, use it to interpret the compressed stream.

19

Page 46: The Language of Compression

How Compression Even Works

All* compression algorithms, at their core, use a form of dictionary encoding:▶ Write down a dictionary of “common phrases” with shorter names▶ Encode the input stream by referencing the short names in the dictionary

*A Universal Algorithm for Sequential Data Compression, J. Ziv, A. Lempel 1977

abbabbabbcdcdcdcdabb => abb|abb|abb|cd|cd|cd|cd|abb

Symbol Phrasex abby cd

, xxxyyyyx

To decompress: read the dictionary, use it to interpret the compressed stream.

19

Page 47: The Language of Compression

How Compression Even Works

All* compression algorithms, at their core, use a form of dictionary encoding:▶ Write down a dictionary of “common phrases” with shorter names▶ Encode the input stream by referencing the short names in the dictionary

*A Universal Algorithm for Sequential Data Compression, J. Ziv, A. Lempel 1977

abbabbabbcdcdcdcdabb => abb|abb|abb|cd|cd|cd|cd|abb

Symbol Phrasex abby cd

, xxxyyyyx

To decompress: read the dictionary, use it to interpret the compressed stream.

19

Page 48: The Language of Compression

How Compression Even Works

Most compressors have a dynamic dictionary which is modified (optimized) asit compresses the input.

The dictionary takes up some space in the file header, so to be worthwhile, wewant to compress a lot of input with it at once.

20

Page 49: The Language of Compression

How Compression Even Works

Most compressors have a dynamic dictionary which is modified (optimized) asit compresses the input.

The dictionary takes up some space in the file header, so to be worthwhile, wewant to compress a lot of input with it at once.

20

Page 50: The Language of Compression

How Compression Even Works

We cannot seek directly to an offset in the decompressed output because:

▶ We need to read the compressed stream to modify the dictionary▶ We don’t know how much output any given chunk of input will produce

We also can’t update a compressed file without recompressing the whole thing.

21

Page 51: The Language of Compression

How Compression Even Works

We cannot seek directly to an offset in the decompressed output because:▶ We need to read the compressed stream to modify the dictionary

▶ We don’t know how much output any given chunk of input will produce

We also can’t update a compressed file without recompressing the whole thing.

21

Page 52: The Language of Compression

How Compression Even Works

We cannot seek directly to an offset in the decompressed output because:▶ We need to read the compressed stream to modify the dictionary▶ We don’t know how much output any given chunk of input will produce

We also can’t update a compressed file without recompressing the whole thing.

21

Page 53: The Language of Compression

How Compression Even Works

We cannot seek directly to an offset in the decompressed output because:▶ We need to read the compressed stream to modify the dictionary▶ We don’t know how much output any given chunk of input will produce

We also can’t update a compressed file without recompressing the whole thing.

21

Page 54: The Language of Compression

How Compression Even Works

Systems that provide seeking in compressed data do so by dividing the inputinto blocks, and compressing them individually.

▶ When writing, recompress the whole block being written(but not the whole data set)

▶ When reading, decompress the whole block being read▶ Overall compression ratio depends on the size of the blocks

22

Page 55: The Language of Compression

How Compression Even Works

Systems that provide seeking in compressed data do so by dividing the inputinto blocks, and compressing them individually.

▶ When writing, recompress the whole block being written(but not the whole data set)

▶ When reading, decompress the whole block being read▶ Overall compression ratio depends on the size of the blocks

22

Page 56: The Language of Compression

How Compression Even Works

Systems that provide seeking in compressed data do so by dividing the inputinto blocks, and compressing them individually.

▶ When writing, recompress the whole block being written(but not the whole data set)

▶ When reading, decompress the whole block being read

▶ Overall compression ratio depends on the size of the blocks

22

Page 57: The Language of Compression

How Compression Even Works

Systems that provide seeking in compressed data do so by dividing the inputinto blocks, and compressing them individually.

▶ When writing, recompress the whole block being written(but not the whole data set)

▶ When reading, decompress the whole block being read▶ Overall compression ratio depends on the size of the blocks

22

Page 58: The Language of Compression

Block Sizes

23

Page 59: The Language of Compression

Block Sizes

Compression algorithms are pattern finders. Give them more data to search in,and they find more patterns.

Compressors use block sizes to limit their runtime and memory usage.

As block size increases:▶ Compression throughput decreases▶ Compression and decompression memory usage increases▶ Decompression throughput may increase if disk throughput increases

24

Page 60: The Language of Compression

Block Sizes

Compression algorithms are pattern finders. Give them more data to search in,and they find more patterns.

Compressors use block sizes to limit their runtime and memory usage.

As block size increases:▶ Compression throughput decreases▶ Compression and decompression memory usage increases▶ Decompression throughput may increase if disk throughput increases

24

Page 61: The Language of Compression

Block Sizes

Compression algorithms are pattern finders. Give them more data to search in,and they find more patterns.

Compressors use block sizes to limit their runtime and memory usage.

As block size increases:

▶ Compression throughput decreases▶ Compression and decompression memory usage increases▶ Decompression throughput may increase if disk throughput increases

24

Page 62: The Language of Compression

Block Sizes

Compression algorithms are pattern finders. Give them more data to search in,and they find more patterns.

Compressors use block sizes to limit their runtime and memory usage.

As block size increases:▶ Compression throughput decreases

▶ Compression and decompression memory usage increases▶ Decompression throughput may increase if disk throughput increases

24

Page 63: The Language of Compression

Block Sizes

Compression algorithms are pattern finders. Give them more data to search in,and they find more patterns.

Compressors use block sizes to limit their runtime and memory usage.

As block size increases:▶ Compression throughput decreases▶ Compression and decompression memory usage increases

▶ Decompression throughput may increase if disk throughput increases

24

Page 64: The Language of Compression

Block Sizes

Compression algorithms are pattern finders. Give them more data to search in,and they find more patterns.

Compressors use block sizes to limit their runtime and memory usage.

As block size increases:▶ Compression throughput decreases▶ Compression and decompression memory usage increases▶ Decompression throughput may increase if disk throughput increases

24

Page 65: The Language of Compression

Block Sizes

1

2

3

4

5

6

7

8

512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M

Com

pres

sion

Rat

io

Block Size

Compression Ratio vs. Block Size (higher is better)

btrfs

InnoDBPostgreSQL

WiredTigerVertica

Sybase IQ

RocksDB

TokuDB/TokuMXCassandra

zfsThe compression ratio sweetspot is∼128k, for gzip onthis data set.

Most systems use smallblocks∼8k, to reducedecompression latency.

25

Page 66: The Language of Compression

Block Sizes

1

2

3

4

5

6

7

8

512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M

Com

pres

sion

Rat

io

Block Size

Compression Ratio vs. Block Size (higher is better)

btrfs

InnoDBPostgreSQL

WiredTigerVertica

Sybase IQ

RocksDB

TokuDB/TokuMXCassandra

zfs

The compression ratio sweetspot is∼128k, for gzip onthis data set.

Most systems use smallblocks∼8k, to reducedecompression latency.

25

Page 67: The Language of Compression

Block Sizes

1

2

3

4

5

6

7

8

512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M

Com

pres

sion

Rat

io

Block Size

Compression Ratio vs. Block Size (higher is better)

btrfs

InnoDBPostgreSQL

WiredTigerVertica

Sybase IQ

RocksDB

TokuDB/TokuMXCassandra

zfs

The compression ratio sweetspot is∼128k, for gzip onthis data set.

Most systems use smallblocks∼8k, to reducedecompression latency.

25

Page 68: The Language of Compression

Block Sizes

1

2

3

4

5

6

7

8

512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M

Com

pres

sion

Rat

io

Block Size

Compression Ratio vs. Block Size (higher is better)

btrfs

InnoDBPostgreSQL

WiredTigerVertica

Sybase IQ

RocksDB

TokuDB/TokuMXCassandra

zfsThe compression ratio sweetspot is∼128k, for gzip onthis data set.

Most systems use smallblocks∼8k, to reducedecompression latency.

25

Page 69: The Language of Compression

Fragmentation

Another corollary of compressing in blocks is fragmentation.

Blocks need to be allocated locations on disk. As the data grows, shrinks, andmoves around, these locations (and for some systems, allocation sizes) change.

26

Page 70: The Language of Compression

Fragmentation

Another corollary of compressing in blocks is fragmentation.

Blocks need to be allocated locations on disk. As the data grows, shrinks, andmoves around, these locations (and for some systems, allocation sizes) change.

26

Page 71: The Language of Compression

Fragmentation

27

Page 72: The Language of Compression

Fragmentation

Fragmentation hurts you in two ways:

1. Fragmented files occupy more effective space than defragmented ones2. Fragmentation degrades range query throughput by reducing data locality

For some systems, the overall compression ratio will be reduced oncefragmentation develops.

28

Page 73: The Language of Compression

Fragmentation

Fragmentation hurts you in two ways:1. Fragmented files occupy more effective space than defragmented ones

2. Fragmentation degrades range query throughput by reducing data locality

For some systems, the overall compression ratio will be reduced oncefragmentation develops.

28

Page 74: The Language of Compression

Fragmentation

Fragmentation hurts you in two ways:1. Fragmented files occupy more effective space than defragmented ones2. Fragmentation degrades range query throughput by reducing data locality

For some systems, the overall compression ratio will be reduced oncefragmentation develops.

28

Page 75: The Language of Compression

Fragmentation

Fragmentation hurts you in two ways:1. Fragmented files occupy more effective space than defragmented ones2. Fragmentation degrades range query throughput by reducing data locality

For some systems, the overall compression ratio will be reduced oncefragmentation develops.

28

Page 76: The Language of Compression

Entropy

29

Page 77: The Language of Compression

Entropy

Not all data compresses equally!

Information Theory* can tell us how much real information is present in aset of data (“bits of entropy”).

*A Mathematical Theory of Communication, C. E. Shannon, 1948

A general-purpose, lossless compression algorithm can’t hope to compressdata smaller than that.

If it could, it would have to produce the same compressed output for multipleinputs, which would mean it isn’t lossless.

High entropy data is highly uncompressible.Low entropy data is easily compressed.

30

Page 78: The Language of Compression

Entropy

Not all data compresses equally!

Information Theory* can tell us how much real information is present in aset of data (“bits of entropy”).

*A Mathematical Theory of Communication, C. E. Shannon, 1948

A general-purpose, lossless compression algorithm can’t hope to compressdata smaller than that.

If it could, it would have to produce the same compressed output for multipleinputs, which would mean it isn’t lossless.

High entropy data is highly uncompressible.Low entropy data is easily compressed.

30

Page 79: The Language of Compression

Entropy

Not all data compresses equally!

Information Theory* can tell us how much real information is present in aset of data (“bits of entropy”).

*A Mathematical Theory of Communication, C. E. Shannon, 1948

A general-purpose, lossless compression algorithm can’t hope to compressdata smaller than that.

If it could, it would have to produce the same compressed output for multipleinputs, which would mean it isn’t lossless.

High entropy data is highly uncompressible.Low entropy data is easily compressed.

30

Page 80: The Language of Compression

Entropy

Not all data compresses equally!

Information Theory* can tell us how much real information is present in aset of data (“bits of entropy”).

*A Mathematical Theory of Communication, C. E. Shannon, 1948

A general-purpose, lossless compression algorithm can’t hope to compressdata smaller than that.

If it could, it would have to produce the same compressed output for multipleinputs, which would mean it isn’t lossless.

High entropy data is highly uncompressible.Low entropy data is easily compressed.

30

Page 81: The Language of Compression

Entropy

Not all data compresses equally!

Information Theory* can tell us how much real information is present in aset of data (“bits of entropy”).

*A Mathematical Theory of Communication, C. E. Shannon, 1948

A general-purpose, lossless compression algorithm can’t hope to compressdata smaller than that.

If it could, it would have to produce the same compressed output for multipleinputs, which would mean it isn’t lossless.

High entropy data is highly uncompressible.Low entropy data is easily compressed.

30

Page 82: The Language of Compression

Entropy

Not all data compresses equally!

Information Theory* can tell us how much real information is present in aset of data (“bits of entropy”).

*A Mathematical Theory of Communication, C. E. Shannon, 1948

A general-purpose, lossless compression algorithm can’t hope to compressdata smaller than that.

If it could, it would have to produce the same compressed output for multipleinputs, which would mean it isn’t lossless.

High entropy data is highly uncompressible.Low entropy data is easily compressed.

30

Page 83: The Language of Compression

Entropy: Experiment

Built 8 data sources (∼50k each):1. Random bytes2. Sequential numbers, encoded as ASCII decimals3. All zeroes4. The beginning of The Iliad5. 1000 randomWikipedia URLs6. 1000 randomWikipedia URLs, sorted7. RAW image (CR2)8. JPEG-compressed image

31

Page 84: The Language of Compression

Entropy: Experiment

Built 8 data sources (∼50k each):

1. Random bytes2. Sequential numbers, encoded as ASCII decimals3. All zeroes4. The beginning of The Iliad5. 1000 randomWikipedia URLs6. 1000 randomWikipedia URLs, sorted7. RAW image (CR2)8. JPEG-compressed image

31

Page 85: The Language of Compression

Entropy: Experiment

Built 8 data sources (∼50k each):1. Random bytes2. Sequential numbers, encoded as ASCII decimals3. All zeroes4. The beginning of The Iliad5. 1000 randomWikipedia URLs6. 1000 randomWikipedia URLs, sorted7. RAW image (CR2)8. JPEG-compressed image

31

Page 86: The Language of Compression

Entropy: Experiment

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20

rand image-jpg seq-nums iliad-head urls urls-sorted image-raw zeroes

Com

pres

sion

Rat

io (

high

er is

bet

ter)

Data source

gzipbz2

lzmalzolz4

zstd

160x to 2000xJPEG-compresseddata has high

entropy, doesn’tcompress well

32

Page 87: The Language of Compression

Entropy: Experiment

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20

rand image-jpg seq-nums iliad-head urls urls-sorted image-raw zeroes

Com

pres

sion

Rat

io (

high

er is

bet

ter)

Data source

gzipbz2

lzmalzolz4

zstd160x to 2000x

JPEG-compresseddata has high

entropy, doesn’tcompress well

32

Page 88: The Language of Compression

Entropy: Experiment

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20

rand image-jpg seq-nums iliad-head urls urls-sorted image-raw zeroes

Com

pres

sion

Rat

io (

high

er is

bet

ter)

Data source

gzipbz2

lzmalzolz4

zstd160x to 2000x

JPEG-compresseddata has high

entropy, doesn’tcompress well

32

Page 89: The Language of Compression

Entropy

Homogeneous data has lower entropy than heterogeneous data.

▶ Integers compress better than documents with complex interal structure

Column stores have a compression advantage over row stores.

33

Page 90: The Language of Compression

Entropy

Homogeneous data has lower entropy than heterogeneous data.▶ Integers compress better than documents with complex interal structure

Column stores have a compression advantage over row stores.

33

Page 91: The Language of Compression

Entropy

Homogeneous data has lower entropy than heterogeneous data.▶ Integers compress better than documents with complex interal structure

Column stores have a compression advantage over row stores.

33

Page 92: The Language of Compression

Entropy

Know your data!Don’t waste your time compressing JPEG blobs

Some compressors are fantastic in specific data domains(VLQ, delta coding, JPEG, MP3, …)

(But 95% of the time, gzip is fine)

34

Page 93: The Language of Compression

Entropy

Know your data!Don’t waste your time compressing JPEG blobs

Some compressors are fantastic in specific data domains(VLQ, delta coding, JPEG, MP3, …)

(But 95% of the time, gzip is fine)

34

Page 94: The Language of Compression

Entropy

Know your data!Don’t waste your time compressing JPEG blobs

Some compressors are fantastic in specific data domains(VLQ, delta coding, JPEG, MP3, …)

(But 95% of the time, gzip is fine)

34

Page 95: The Language of Compression

Before we use compression, we need to understand thecosts and benefits to our application.

35

Page 96: The Language of Compression

Benchmarking

36

Page 97: The Language of Compression

Benchmarking

When designing a compression benchmark, you should consider:

▶ Execution▶ Measurement▶ Presentation

37

Page 98: The Language of Compression

Benchmarking

When designing a compression benchmark, you should consider:▶ Execution

▶ Measurement▶ Presentation

37

Page 99: The Language of Compression

Benchmarking

When designing a compression benchmark, you should consider:▶ Execution▶ Measurement

▶ Presentation

37

Page 100: The Language of Compression

Benchmarking

When designing a compression benchmark, you should consider:▶ Execution▶ Measurement▶ Presentation

37

Page 101: The Language of Compression

Benchmarking: Execution

Main QuestionIs the workload representative of a real-world use-case?

38

Page 102: The Language of Compression

Benchmarking: Execution

1. Sample real data if you can get it.

If not, generate plausibly realistic data:▶ Zeroes: bad▶ Random: bad▶ 25% random and 75% zeroes: meh▶ JSON blobs: good

39

Page 103: The Language of Compression

Benchmarking: Execution

1. Sample real data if you can get it.

If not, generate plausibly realistic data:

▶ Zeroes: bad▶ Random: bad▶ 25% random and 75% zeroes: meh▶ JSON blobs: good

39

Page 104: The Language of Compression

Benchmarking: Execution

1. Sample real data if you can get it.

If not, generate plausibly realistic data:▶ Zeroes: bad▶ Random: bad▶ 25% random and 75% zeroes: meh▶ JSON blobs: good

39

Page 105: The Language of Compression

Benchmarking: Execution

2. Use a realistic read/insert/update mixture:

▶ Most applications are read-heavy▶ Favors fast decompressors

40

Page 106: The Language of Compression

Benchmarking: Execution

2. Use a realistic read/insert/update mixture:▶ Most applications are read-heavy▶ Favors fast decompressors

40

Page 107: The Language of Compression

Benchmarking: Execution

3. Use a realistic insert/update distribution:

▶ Most applications don’t write uniformly over the keyspace▶ Zipfian or Pareto (or sometimes sequential, or nearly) distributions are morerealistic, and cache-friendlier

▶ Vadim wrote a sysbench workload generator that uses a Zipfian distribution:http://j.mp/sysbench-zipf

41

Page 108: The Language of Compression

Benchmarking: Execution

3. Use a realistic insert/update distribution:▶ Most applications don’t write uniformly over the keyspace

▶ Zipfian or Pareto (or sometimes sequential, or nearly) distributions are morerealistic, and cache-friendlier

▶ Vadim wrote a sysbench workload generator that uses a Zipfian distribution:http://j.mp/sysbench-zipf

41

Page 109: The Language of Compression

Benchmarking: Execution

3. Use a realistic insert/update distribution:▶ Most applications don’t write uniformly over the keyspace▶ Zipfian or Pareto (or sometimes sequential, or nearly) distributions are morerealistic, and cache-friendlier

▶ Vadim wrote a sysbench workload generator that uses a Zipfian distribution:http://j.mp/sysbench-zipf

41

Page 110: The Language of Compression

Benchmarking: Execution

3. Use a realistic insert/update distribution:▶ Most applications don’t write uniformly over the keyspace▶ Zipfian or Pareto (or sometimes sequential, or nearly) distributions are morerealistic, and cache-friendlier

▶ Vadim wrote a sysbench workload generator that uses a Zipfian distribution:http://j.mp/sysbench-zipf

41

Page 111: The Language of Compression

Benchmarking: Execution

4. To measure latency, throttle your workload.

▶ Full-throughput workloads will induce artificial latency spikes (fsyncs, GC)

To measuremax throughput, run at full speed.

You should do both.

42

Page 112: The Language of Compression

Benchmarking: Execution

4. To measure latency, throttle your workload.▶ Full-throughput workloads will induce artificial latency spikes (fsyncs, GC)

To measuremax throughput, run at full speed.

You should do both.

42

Page 113: The Language of Compression

Benchmarking: Execution

4. To measure latency, throttle your workload.▶ Full-throughput workloads will induce artificial latency spikes (fsyncs, GC)

To measuremax throughput, run at full speed.

You should do both.

42

Page 114: The Language of Compression

Benchmarking: Execution

4. To measure latency, throttle your workload.▶ Full-throughput workloads will induce artificial latency spikes (fsyncs, GC)

To measuremax throughput, run at full speed.

You should do both.

42

Page 115: The Language of Compression

Benchmarking: Execution

5. Run for a long time. Lots of important properties don’t become visibleimmediately (e.g. fragmentation), and you need to understand them.

Your application is hopefully going to run for months or years. You don’twant to be surprised by degradation after you think everything’s stable.

43

Page 116: The Language of Compression

Benchmarking: Execution

5. Run for a long time. Lots of important properties don’t become visibleimmediately (e.g. fragmentation), and you need to understand them.

Your application is hopefully going to run for months or years. You don’twant to be surprised by degradation after you think everything’s stable.

43

Page 117: The Language of Compression

Benchmarking: Execution

6. Parameterize your workload:

▶ Read/insert/update mixture▶ Write distribution▶ Number of threads▶ Data size▶ Throttling▶ Duration▶ System configuration (cache size, isolation levels, log commit)

You are going to want to explore these parameter spaces. Save yourselfthe pain later and think about parameterization up front.

44

Page 118: The Language of Compression

Benchmarking: Execution

6. Parameterize your workload:▶ Read/insert/update mixture

▶ Write distribution▶ Number of threads▶ Data size▶ Throttling▶ Duration▶ System configuration (cache size, isolation levels, log commit)

You are going to want to explore these parameter spaces. Save yourselfthe pain later and think about parameterization up front.

44

Page 119: The Language of Compression

Benchmarking: Execution

6. Parameterize your workload:▶ Read/insert/update mixture▶ Write distribution

▶ Number of threads▶ Data size▶ Throttling▶ Duration▶ System configuration (cache size, isolation levels, log commit)

You are going to want to explore these parameter spaces. Save yourselfthe pain later and think about parameterization up front.

44

Page 120: The Language of Compression

Benchmarking: Execution

6. Parameterize your workload:▶ Read/insert/update mixture▶ Write distribution▶ Number of threads

▶ Data size▶ Throttling▶ Duration▶ System configuration (cache size, isolation levels, log commit)

You are going to want to explore these parameter spaces. Save yourselfthe pain later and think about parameterization up front.

44

Page 121: The Language of Compression

Benchmarking: Execution

6. Parameterize your workload:▶ Read/insert/update mixture▶ Write distribution▶ Number of threads▶ Data size

▶ Throttling▶ Duration▶ System configuration (cache size, isolation levels, log commit)

You are going to want to explore these parameter spaces. Save yourselfthe pain later and think about parameterization up front.

44

Page 122: The Language of Compression

Benchmarking: Execution

6. Parameterize your workload:▶ Read/insert/update mixture▶ Write distribution▶ Number of threads▶ Data size▶ Throttling

▶ Duration▶ System configuration (cache size, isolation levels, log commit)

You are going to want to explore these parameter spaces. Save yourselfthe pain later and think about parameterization up front.

44

Page 123: The Language of Compression

Benchmarking: Execution

6. Parameterize your workload:▶ Read/insert/update mixture▶ Write distribution▶ Number of threads▶ Data size▶ Throttling▶ Duration

▶ System configuration (cache size, isolation levels, log commit)

You are going to want to explore these parameter spaces. Save yourselfthe pain later and think about parameterization up front.

44

Page 124: The Language of Compression

Benchmarking: Execution

6. Parameterize your workload:▶ Read/insert/update mixture▶ Write distribution▶ Number of threads▶ Data size▶ Throttling▶ Duration▶ System configuration (cache size, isolation levels, log commit)

You are going to want to explore these parameter spaces. Save yourselfthe pain later and think about parameterization up front.

44

Page 125: The Language of Compression

Benchmarking: Execution

6. Parameterize your workload:▶ Read/insert/update mixture▶ Write distribution▶ Number of threads▶ Data size▶ Throttling▶ Duration▶ System configuration (cache size, isolation levels, log commit)

You are going to want to explore these parameter spaces. Save yourselfthe pain later and think about parameterization up front.

44

Page 126: The Language of Compression

Benchmarking: Execution

Great example: https://github.com/ParsePlatform/flashback

Captures a MongoDB workload with profiling, then replays operations either attheir original timestamps, or at full speed.

45

Page 127: The Language of Compression

Benchmarking: Measurement

46

Page 128: The Language of Compression

Benchmarking: Measurement

46

Page 129: The Language of Compression

Benchmarking: Measurement

Application metrics:

▶ Throughput▶ Latency▶ Aborted/retried transactions

Instrument your application so you know which operations are expensive.

47

Page 130: The Language of Compression

Benchmarking: Measurement

Application metrics:▶ Throughput▶ Latency▶ Aborted/retried transactions

Instrument your application so you know which operations are expensive.

47

Page 131: The Language of Compression

Benchmarking: Measurement

Application metrics:▶ Throughput▶ Latency▶ Aborted/retried transactions

Instrument your application so you know which operations are expensive.

47

Page 132: The Language of Compression

Benchmarking: Measurement

System metrics:

▶ CPU▶ Memory (RSS)▶ I/O▶ Network▶ Actual storage usage (du)

perf(1), iostat(1), dstat(1), oprofile(1), collectd(1), Datadog, Librato, …

48

Page 133: The Language of Compression

Benchmarking: Measurement

System metrics:▶ CPU▶ Memory (RSS)▶ I/O▶ Network▶ Actual storage usage (du)

perf(1), iostat(1), dstat(1), oprofile(1), collectd(1), Datadog, Librato, …

48

Page 134: The Language of Compression

Benchmarking: Measurement

System metrics:▶ CPU▶ Memory (RSS)▶ I/O▶ Network▶ Actual storage usage (du)

perf(1), iostat(1), dstat(1), oprofile(1), collectd(1), Datadog, Librato, …

48

Page 135: The Language of Compression

Benchmarking: Measurement

Database/filesystem metrics (product-specific):

▶ Cache hits/misses▶ Replication lag▶ Checkpoint lag

Talk to your storage vendor about what’s important.

49

Page 136: The Language of Compression

Benchmarking: Measurement

Database/filesystem metrics (product-specific):▶ Cache hits/misses▶ Replication lag▶ Checkpoint lag

Talk to your storage vendor about what’s important.

49

Page 137: The Language of Compression

Benchmarking: Measurement

Database/filesystem metrics (product-specific):▶ Cache hits/misses▶ Replication lag▶ Checkpoint lag

Talk to your storage vendor about what’s important.

49

Page 138: The Language of Compression

Benchmarking: Presentation

1. Describe the workload, and make a case for why it’s realistic2. Choose key metrics that reflect the benefits of compression (e.g. users

stored per TB) as well as the costs (e.g. operation latency)3. Demonstrate which parameter choices influence the costs and

benefits you think are important4. Explain which parameters have little or no effect on your metrics5. Explain how much of your measurement is overhead.6. If you show charts, normalize your data. Only present important

differences.

50

Page 139: The Language of Compression

Benchmarking: Presentation

1. Describe the workload, and make a case for why it’s realistic

2. Choose key metrics that reflect the benefits of compression (e.g. usersstored per TB) as well as the costs (e.g. operation latency)

3. Demonstrate which parameter choices influence the costs andbenefits you think are important

4. Explain which parameters have little or no effect on your metrics5. Explain how much of your measurement is overhead.6. If you show charts, normalize your data. Only present important

differences.

50

Page 140: The Language of Compression

Benchmarking: Presentation

1. Describe the workload, and make a case for why it’s realistic2. Choose key metrics that reflect the benefits of compression (e.g. users

stored per TB) as well as the costs (e.g. operation latency)

3. Demonstrate which parameter choices influence the costs andbenefits you think are important

4. Explain which parameters have little or no effect on your metrics5. Explain how much of your measurement is overhead.6. If you show charts, normalize your data. Only present important

differences.

50

Page 141: The Language of Compression

Benchmarking: Presentation

1. Describe the workload, and make a case for why it’s realistic2. Choose key metrics that reflect the benefits of compression (e.g. users

stored per TB) as well as the costs (e.g. operation latency)3. Demonstrate which parameter choices influence the costs and

benefits you think are important

4. Explain which parameters have little or no effect on your metrics5. Explain how much of your measurement is overhead.6. If you show charts, normalize your data. Only present important

differences.

50

Page 142: The Language of Compression

Benchmarking: Presentation

1. Describe the workload, and make a case for why it’s realistic2. Choose key metrics that reflect the benefits of compression (e.g. users

stored per TB) as well as the costs (e.g. operation latency)3. Demonstrate which parameter choices influence the costs and

benefits you think are important4. Explain which parameters have little or no effect on your metrics5. Explain how much of your measurement is overhead.6. If you show charts, normalize your data. Only present important

differences.

50

Page 143: The Language of Compression

Review

51

Page 144: The Language of Compression

Review

Say “5x compression”

52

Page 145: The Language of Compression

Review

Compression is slower than decompression, butdecompression ismore frequent

53

Page 146: The Language of Compression

Review

Large blocks compress better

54

Page 147: The Language of Compression

Review

Fragmentation degradeseffective compression over time

55

Page 148: The Language of Compression

Review

High entropy data is less compressible

56

Page 149: The Language of Compression

Review

Benchmark realistic workloads over a long period

57

Page 150: The Language of Compression

Review

Present responsibly(and distrust benchmarketers who don’t)

58

Page 151: The Language of Compression

Thanks!

▶ Tim and Mark Callaghan for being exemplar benchmarkers(http://acmebenchmarking.com and http://smalldatum.blogspot.com)

▶ Bohu Tang for introducing me to zstd▶ Andrew Bolin, Corey Milloy, Effie Baram, Li Jin, Wil Yegelwel for makingthis talk better

▶ Tokutek engineering▶ Percona (they’re also good benchmarkers)

59

Page 152: The Language of Compression

Questions?

Leif [email protected]

@leifwalsh

60