Top Banner
MySQL versus something else Evaluating alternative databases Mark Callaghan Small Data Engineer October, Friday, October 25, 13
40
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mark Callaghan, Facebook

MySQL versus something elseEvaluating alternative databases

Mark CallaghanSmall Data EngineerOctober, !"#$

Friday, October 25, 13

Page 2: Mark Callaghan, Facebook

What metric is important?▪ Throughput

▪ Throughput while minimizing response time variance

▪ Efficiency - reduce cost while meeting response time goals

Friday, October 25, 13

Page 3: Mark Callaghan, Facebook

My focus is storage efficiency▪ Use flash to get IOPs

▪ Use spinning disks to get capacity

▪ Use both to reduce cost while improving quality of service

device frequentreads

frequentwrites read IOPs write IOPs

flash

flash

SATA, /dev/null

SATA, /dev/null

yes yes yes maybe

yes no yes no

no yes no maybe

no no no no

Friday, October 25, 13

Page 4: Mark Callaghan, Facebook

What technology would you choose today?▪ How do you value flexibility?

▪ Newer & faster hardware arrives each year

▪ Servers you buy today will be in production for a few years▪ Software can last even longer in production

▪ We have several generations of HW on the small data tiers▪ Pure-disk (SAS array + HW RAID)▪ Flashcache (SATA array + HW RAID, flash)▪ Pure-flash

Friday, October 25, 13

Page 5: Mark Callaghan, Facebook

Common definitions▪ Sorted run - rows stored in key order

▪ may be stored using many range-partitioned files

▪ Memtable - sorted run in memory▪ L! - " or more sorted runs on disk▪ L", L#, ... Lmax - each is " sorted run on disk

▪ Lmax is the largest level▪ by size L# < L! ... < Lmax

▪ live$ - percentage of live data in the database

Friday, October 25, 13

Page 6: Mark Callaghan, Facebook

Amplification factors▪ Framework for describing efficiency of database algorithms

▪ How much is done physically in response to a logical change?▪ Read amplification

▪ Write amplification▪ Space amplification

▪ Can determine▪ How many disks or flash you must buy▪ How long your flash might last▪ Whether you can buy lower endurance flash

Friday, October 25, 13

Page 7: Mark Callaghan, Facebook

Read amplification▪ Read-amp == disk reads per query

▪ Separate results for point query versus short range scan

▪ Assume some data is in cache▪ Assume the index is covering for the query

▪ Example: b-tree with all non-leaf levels in cache▪ Point read-amp - # disk read to get the leaf block▪ Short range read-amp - # or ! disk reads to get the leaf blocks

Friday, October 25, 13

Page 8: Mark Callaghan, Facebook

Read amplification and bloom filters▪ Bloom filter summary

▪ f(key) -> { no, maybe }▪ Use ~#" bits/row to get reasonable false positive rate▪ Great for avoiding disk reads on point queries▪ Bonus - prevent disk reads for keys that don’t exist

▪ Useless for general range scans like “select x where y < "!!”▪ Can be useful for equality prefix like “select x where q = "! and y < "!!”

▪ use bloom filter on q

▪ Too many bloom filter checks can hurt response time▪ each sorted run on disk needs a bloom filter check

Friday, October 25, 13

Page 9: Mark Callaghan, Facebook

Write amplification▪ Write-amp == bytes written per byte changed

▪ Insert #"" bytes with write-amp=% and %"" bytes will be written

▪ For now ignore penalty from small random writes

▪ Some writes done immediately, others are deferred

▪ Immediate -> redo log▪ Deferred -> b-tree dirty pages not forced on commit, LSM compaction

Friday, October 25, 13

Page 10: Mark Callaghan, Facebook

Write amplification, part !▪ HW can increase write-amp

▪ Read live pages and write them elsewhere when cleaning flash blocks

▪ Only a cost for algorithms that do small random writes

▪ Redo log writes can increase write-amp

▪ Writes must be done to a multiple of %#! or larger▪ Insert #"" byte row, force %#! byte sector for redo has write-amp=%

Friday, October 25, 13

Page 11: Mark Callaghan, Facebook

Why write amplification matters▪ Write endurance for flash device

▪ The wrong algorithm can wear out the device too soon▪ The right algorithm might let you buy lower cost/endurance device

▪ Write-amp can predict peak performance▪ If storage can sustain &"" MB/second of writes▪ And write-amp is #"▪ Then database can sustain &" MB/second of changes

Friday, October 25, 13

Page 12: Mark Callaghan, Facebook

Simple request - make counting faster▪ Some web-scale workloads need to maintain counts

▪ Database is IO-bound▪ Workload should be write-heavy, counters might not be read

▪ update foo set count = count + " where key = ‘bar’▪ Read-modify-write▪ Write-only: write delta, merge deltas later when queried/compacted

Friday, October 25, 13

Page 13: Mark Callaghan, Facebook

Space amplification▪ Space-amp == sizeof(database files) / sizeof(data)

▪ Ignore secondary indexes

▪ Assume database files are in steady state (fragmented & compacted)▪ Space-amp == #"" / 'live

▪ Things that change space amplification▪ B-tree fragmentation▪ Old versions of rows that are yet to be collected▪ Compression

▪ Per row/page metadata (rollback pointer, transaction ID, ...)

Friday, October 25, 13

Page 14: Mark Callaghan, Facebook

Space versus write amplification▪ Sorry for the confusion

▪ Databases store N blocks in # extent▪ Flash devices store N pages in # block

▪ Copy out▪ Read live data from the cleaned extent, write it elsewhere▪ Cost is a function of the percentage of live data▪ Larger live' means less space and more write amplification▪ Smaller live' means more space and less write amplification

Friday, October 25, 13

Page 15: Mark Callaghan, Facebook

Space versus write amplification

(% dead pages !% live pages

Old flash block assuming all blocks have !%' live pages

New flash block

(% pages ready for new writes !% copied pages

Block cleaning copies !% pages

Write #"" pages total per (% new page writes:* 'live is !%'* write-amp is #"" / (#"" - 'live) == #"" / (%* space-amp is #"" / 'live == &

Friday, October 25, 13

Page 16: Mark Callaghan, Facebook

Disclaimer▪ There are many assumptions in the rest of the slides.

▪ Assumption ##: workloads have no skew.▪ Most real workloads have skew.▪ Lets save skew for a much longer discussion

▪ Assumption #!: workload is update-only

▪ I am trying to start a discussion rather than solve everything.▪ This won’t be confused as a lecture on algorithm analysis.▪ We might disagree on technology, but we can agree on terminology

Friday, October 25, 13

Page 17: Mark Callaghan, Facebook

Database algorithms▪ B-tree

▪ Update-in-place (UIP)▪ Copy-on-write using sequential (COW-S) and random (COW-R) writes

▪ Log structured merge tree (LSM)▪ LevelDB-style compaction (leveled)

▪ HBase-style compaction (n-files, size-tiered)

▪ Other

▪ Log-only - Bitcask▪ Memtable + L# - Sophia via Sphia.org▪ Memtable, L", L# - MaSM▪ TokuDB/TokuMX - fractional cascading

Friday, October 25, 13

Page 18: Mark Callaghan, Facebook

B-tree

algorithm fixed-page(fragments) in-place write-back

needsgarbage

collection(block or extent

cleaning)

example

UIP

COW-R

COW-S

yes yes single-block HW GC if flash InnoDB

yes no single-block HW GC if flash LMDB

no no multi-block SW GC ?

Friday, October 25, 13

Page 19: Mark Callaghan, Facebook

B-tree: UIP and COW-R▪ When non-leaf levels are in cache

▪ Point read-amp is #, range read-amp is # or !

▪ When dirty pages are forced after each row change▪ Write-amp is sizeof(page) / sizeof(row)▪ More write-amp from torn-page protection▪ Add +# for redo log▪ Include HW write-amp when using flash▪ Forcing data pages too soon increases write-amp

Friday, October 25, 13

Page 20: Mark Callaghan, Facebook

B-tree: UIP and COW-R, space amplification▪ Fragmentation because b-tree pages are not full on average

▪ After a page split # full page becomes ! half-full pages▪ With InnoDB we have many indexes with pages that are ~)"' full

▪ Fixed page size reduces compression, with InnoDB #X compression▪ Default fixed page size is *kb▪ Compress #)kb to )kb, still write out *kb

▪ It is hard to use a compression window larger than one page▪ Per-row metadata uses "%+ bytes on InnoDB

Friday, October 25, 13

Page 21: Mark Callaghan, Facebook

B-tree: COW-S▪ Read amplification is the same as for UIP and COW-R▪ Write amplification

▪ Smaller page size from better compression and no fragmentation▪ Has SW write-amp, cost of cleaning previously written extents▪ No HW write-amp on flash

▪ Space amplification▪ Compresses better than UIP/COW-R because page size not fixed▪ Almost no fragmentation▪ Space-amp from old versions of pages that have yet to be cleaned▪ More (less) space-amp means less (more) write-amp

Friday, October 25, 13

Page 22: Mark Callaghan, Facebook

LSM with leveled compaction▪ Implemented by LevelDB and Cassandra▪ Database is memtable, L!, L", ..., Lmax▪ Less read-amp and space-amp, more write-amp▪ Similar to original LSM design from paper by O’Neil

▪ Difference is the use of many range-partitioned files per level▪ Increases write-amp by a small amount▪ Prevents temporary doubling of Lmax during compaction

▪ Compaction from L" to L#▪ reads N bytes from L#▪ reads #"*N bytes from L!▪ writes #"*N + N bytes back to L!

Friday, October 25, 13

Page 23: Mark Callaghan, Facebook

LSM with leveled compaction

memtable keys: "..++ keys: "..++ keys: "..++ Level " (#GB)

#"X more data

keys: "".."# keys: ##..#+ keys: +"..++ Level # (#GB)

Level ! (#"GB)keys: """..""#

keys: ""!...""$

keys: +"..++

Friday, October 25, 13

Page 24: Mark Callaghan, Facebook

LSM with leveled compaction▪ Point read amplification

▪ # bloom filter check per L" file and per level for L#->Lmax + # disk read

▪ Range read amplification▪ # disk read per level and per L" file, bloom filters don’t help

▪ Write amplification▪ #" per level starting with L! + # for redo + # for L" + ~# for L#

▪ Space amplification▪ #.# assuming +"' of data is on the maximum level

Friday, October 25, 13

Page 25: Mark Callaghan, Facebook

LSM with n-files compaction▪ Implemented by Hbase, WiredTiger and Cassandra▪ Database is memtable, L!, L"

▪ Files in L" have varying sizes

▪ Less write-amp, more read-amp and space-amp▪ Compaction cost determined by:

▪ #files merged at a time▪ sizeof(L#) / sizeof(file created by memtable flush)

▪ If memtable is " GB, L" is &' GB, # files are merged at a time▪ then a row is written to files of size #, !, &, *, #), $! and )& GB▪ write-amp is (

Friday, October 25, 13

Page 26: Mark Callaghan, Facebook

LSM with n-files compaction, L"=#$ GB

# GB

)& GB

memtable

# GB ! GB ! GB & GB & GB * GB * GB #) GB #) GB $! GB $! GB

L" files L#

Friday, October 25, 13

Page 27: Mark Callaghan, Facebook

LSM with n-files compaction▪ Point read amplification

▪ # bloom filter check per file + # disk read

▪ Range read amplification▪ # disk read per file, bloom filters don’t help with range scans

▪ Write amplification▪ Usually much less than leveled compaction▪ Trade write for space amplification▪ Add # for redo

▪ Space amplification▪ Usually greater than !

Friday, October 25, 13

Page 28: Mark Callaghan, Facebook

Log-only▪ Bitcask (part of Riak/Basho) is an example of this▪ Data is written "+ times

▪ Write data once to a log▪ Write again when row is live during log cleaning

▪ Copy data from tail to head of log when out of disk space

Friday, October 25, 13

Page 29: Mark Callaghan, Facebook

Log-only

Log &

Log $

Log !

Log #

newest log file

oldest log file cleaner

live data

/dev/nulldead data

new data

Friday, October 25, 13

Page 30: Mark Callaghan, Facebook

Log-only▪ Point read amplification is "▪ Range read amplification is one per value in the range▪ Write and space amplification are related

▪ Write amplification is #"" / (#"" - 'live)▪ Space amplification is #"" / 'live

▪ When &&$ of the data in the logs is live▪ Space-amp is #.%▪ Write-amp is $

Friday, October 25, 13

Page 31: Mark Callaghan, Facebook

Memtable + L"▪ I think Sophia (sphia.org) is an example of this▪ Database is memtable, L"▪ Do compaction between memtable & L" when memtable is full▪ Great when database on disk not too much bigger than RAM

Friday, October 25, 13

Page 32: Mark Callaghan, Facebook

Memtable + L"

L#

memtablecompact

new L#

Friday, October 25, 13

Page 33: Mark Callaghan, Facebook

Memtable + L"▪ Point read amplification is "▪ Range read amplification is "▪ Write amplification

▪ The ratio sizeof(database) / sizeof(memtable)

▪ +# for redo log

▪ Space amplification is "

Friday, October 25, 13

Page 34: Mark Callaghan, Facebook

Memtable + L% + L"▪ MaSM is an example of this▪ Database is memtable, L!, L"

▪ sizeof(L") == sizeof(L#)▪ Looks like file structures from !-pass external sort

▪ Tradeoffs▪ Minimize write-amp▪ Maximize read-amp

Friday, October 25, 13

Page 35: Mark Callaghan, Facebook

Memtable + L% + L"

memtable

L#

L" L" L" L" L"

Merge all on compaction

Friday, October 25, 13

Page 36: Mark Callaghan, Facebook

Memtable + L% + L"▪ Point read amplification is " disk read + many bloom filter checks▪ Range read amplification " disk read per L! file + "▪ Write amplification is %

▪ Write to redo log, L" and L#

▪ Space amplification is #

Friday, October 25, 13

Page 37: Mark Callaghan, Facebook

TokuDB, TokuMX▪ Read amplification

▪ # disk read for point queries▪ # or ! disk reads for range read queries

▪ Write amplification▪ #" per level + # for redo▪ Won’t use as many levels as LevelDB

▪ Space amplification▪ No internal fragmentation, variable sizes pages written▪ Similar to LevelDB

Friday, October 25, 13

Page 38: Mark Callaghan, Facebook

Database algorithms

algorithm point read-amp

rangeread-amp write-amp space-amp

UIP b-tree

COW-R b-tree

COW-S b-tree

LSM leveled

LSM n-files

log-only

memtable+L!

memtable+L"+L!

tokudb

# # or ! page/row * HW GC #.% to !

# # or ! page/row * HW GC #.% to !

# # or ! page/row * SW GC #

# + N*bloom N #" per level #.# X

# + N*bloom N can be < #" can be > !

# N # / (# - 'live) # / 'live

# # database/mem #

# + N*bloom N $ !

# ! #" per level #.# X

Friday, October 25, 13

Page 39: Mark Callaghan, Facebook

Two things to remember▪ You can trade space/read versus write amplification

▪ Switch database algorithms or tune existing algorithm▪ Hard to minimize read, write & space amplification

▪ One size doesn’t fit all▪ The workload I care about has different types of indexes

▪ Some indexes should be optimized for short range scans▪ Other indexes can be optimized for write amplification

▪ would be nice to support both in one database engine

Friday, October 25, 13

Page 40: Mark Callaghan, Facebook

Thank youfacebook.com/MySQLatFacebook

Mark CallaghanSmall Data EngineerOctober, !"#$

Friday, October 25, 13