Top Banner
CASSANDRA & SOLID STATE DRIVES Rick Branson, DataStax
52

Cassandra and Solid State Drives

Jan 15, 2015

Download

Technology

Rick Branson

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cassandra and Solid State Drives

CASSANDRA & SOLID STATE DRIVESRick Branson, DataStax

Page 2: Cassandra and Solid State Drives

FACT

CASSANDRA’S STORAGE ENGINE WAS OPTIMIZED

FOR SPINNING DISKS

Page 3: Cassandra and Solid State Drives

LSM-TREES

Page 4: Cassandra and Solid State Drives

WRITE PATH

Page 5: Cassandra and Solid State Drives

Client Cassandra

On-Disk Node Commit Log

{ cf1: { row1: { col1: abc } } }

{ cf1: { row1: { col2: def } } }

{ cf1: { row1: { col1: <del> } } }

{ cf1: { row2: { col1: xyz } } }

{ cf1: { row1: { col3: foo } } }

insert({ cf1: { row1: { col3: foo } } })

In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”

row1

row2

col1: [del] col2: “def” col3: “foo”

col1: “xyz”

COMMIT

Page 6: Cassandra and Solid State Drives

SSTableSSTableSSTableSSTableSSTableSSTableSSTable

1

SSTableSSTableSSTableSSTableSSTableSSTableSSTable

2

SSTableSSTableSSTableSSTableSSTableSSTableSSTable

3

SSTableSSTableSSTableSSTableSSTableSSTableSSTable

4

In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”

row1

row2

col1: [del] col2: “def” col3: “foo”

col1: “xyz”

FLUSH

Page 7: Cassandra and Solid State Drives

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable

31 2 4

SSTables are merged to maintain read performance

COMPACT

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable

Page 8: Cassandra and Solid State Drives

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableNew SSTable is streamedto disk and old SSTables

are erased

X X X X

Page 9: Cassandra and Solid State Drives

TAKEAWAYS

• All disk writes are sequential, append-only operations

• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)

• SSTables are completely immutable

Page 10: Cassandra and Solid State Drives

TAKEAWAYS

• All disk writes are sequential, append-only operations

• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)

• SSTables are completely immutable

IMPORTANT

Page 11: Cassandra and Solid State Drives

COMPARED

• Most popular data storage engines rewrite modified data in-place: MySQL (InnoDB), PostgreSQL, Oracle, MongoDB, Membase, BerkeleyDB, etc

• Most perform similar buffering of writes before flushing to disk

• ... but flushes are RANDOM writes

Page 12: Cassandra and Solid State Drives

SPINNING DISKS• Dirt cheap: $0.08/GB

• Seek time limited by time it takes for drive to rotate: IOPS = RPM/60

• 7,200 RPM = ~120 IOPS

• 15,000 RPM has been the max for decades

• Sequential operations are best: 125MB/sec for modern SATA drives

Page 13: Cassandra and Solid State Drives

THAT WAS THE WORLD IN WHICH CASSANDRA

WAS BUILT

Page 14: Cassandra and Solid State Drives

2012: MLC NAND FLASH*

• Affordable: ~$1.75/GB street

• Massive IOPS: 39,500/sec read, 23,000/sec write

• Latency of less than 100µs

• Good sequential throughput: 270MB/sec read, 205MB/sec write

• Way cheaper per IOPS: $0.02 vs $1.25

* based on specifications provided by Intel for 300GB Intel 320 drive

Page 15: Cassandra and Solid State Drives

WITH RANDOM ACCESS STORAGE, ARE CASSANDRA’S

LSM-TREES OBSOLETE?

Page 16: Cassandra and Solid State Drives
Page 17: Cassandra and Solid State Drives

SOLID STATE HAS SOME MAJOR BUTS...

Page 18: Cassandra and Solid State Drives

... BUT• Cannot overwrite directly: must erase

first, then write

• Can write in small increments (4KB), but only erase in ~512KB blocks

• Latency: write is ~100µs, erase is ~2ms

• Limited durability: ~5,000 cycles (MLC) for each erase block

Page 19: Cassandra and Solid State Drives

WEAR LEVELING is used to reduce the number of total erase operations

Page 20: Cassandra and Solid State Drives

WEAR LEVELING

Page 21: Cassandra and Solid State Drives

WEAR LEVELINGErase Block

Page 22: Cassandra and Solid State Drives

WEAR LEVELING

Page 23: Cassandra and Solid State Drives

WEAR LEVELING

Page 24: Cassandra and Solid State Drives

WEAR LEVELING

Disk Page

Page 25: Cassandra and Solid State Drives

WEAR LEVELING

Write 1

Page 26: Cassandra and Solid State Drives

WEAR LEVELING

Write 1

Write 2

Page 27: Cassandra and Solid State Drives

WEAR LEVELING

Write 1

Write 2

Write 3

Page 28: Cassandra and Solid State Drives

Write 1

Write 2

Write 3

How is data from only Write 2 modified?

Remember: the whole block must be erased

Page 29: Cassandra and Solid State Drives

Mark Garbage

Page 30: Cassandra and Solid State Drives

Mark Garbage AppendModified

Data

Empty Block

Page 31: Cassandra and Solid State Drives

Wait... GARBAGE?

Page 32: Cassandra and Solid State Drives

THAT MEANS...

Page 33: Cassandra and Solid State Drives

... fragmentation,WHICH MEANS...

Page 34: Cassandra and Solid State Drives

Garbage Collection!

Page 35: Cassandra and Solid State Drives

GARBAGE COLLECTION

• Compacts fragmented disk blocks

• Erase operations drag on performance

• Modern SSDs do this in the background... as much as possible

• If no empty blocks are available, GC must be done before ANY writes can complete

Page 36: Cassandra and Solid State Drives

WRITE AMPLIFICATION

• When only a few kilobytes are written, but fragmentation causes a whole block to be rewritten

• The smaller & more random the writes, the worse this gets

• Modern “mark and sweep” GC reduces it, but cannot eliminate it

Page 37: Cassandra and Solid State Drives

Torture test shows massive write performance drop-off for heavily fragmented drive

Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6

Page 38: Cassandra and Solid State Drives

Some poorly designed drives COMPLETELY fall apart

Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6

Page 39: Cassandra and Solid State Drives

Even a well-behaved drive suffers significantly from the

torture test

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11

Page 40: Cassandra and Solid State Drives

Post-torture, all disk blocks were marked empty, and the

“fast” comes back...

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11

Page 41: Cassandra and Solid State Drives
Page 42: Cassandra and Solid State Drives

“TRIM”• Filesystems don’t typically immediately

erase data when files are deleted, they just mark them as deleted and erase later

• TRIM allows the OS to actively tell the drive when a region of disk is no longer used

• If an entire erase block is marked as unused, GC is avoided, otherwise TRIM just hastens the collection process

Page 43: Cassandra and Solid State Drives

TRIM only reduces the write amplification effect,

it can’t eliminate it.

Page 44: Cassandra and Solid State Drives

THEN THERE’S LIFETIME...

Page 45: Cassandra and Solid State Drives
Page 46: Cassandra and Solid State Drives
Page 47: Cassandra and Solid State Drives

AnandTech estimates that modern MLC SSDsonly last about 1.5 years under heavy MySQL load,

which causes around 10x write amplification

Page 48: Cassandra and Solid State Drives

REMEMBER THIS?

Page 49: Cassandra and Solid State Drives

TAKEAWAYS

• All disk writes are sequential, append-only operations

• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)

• SSTables are completely immutable

Page 50: Cassandra and Solid State Drives

CASSANDRA ONLY WRITES

SEQUENTIALLY

Page 51: Cassandra and Solid State Drives

“For a sequential write workload, write amplification is equal to 1,

i.e., there is no write amplification.”

Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write Performance: Understanding, Analysis, and Performance Modeling”

Page 52: Cassandra and Solid State Drives

THANK YOU.~ @rbranson