MySQL versus something else Evaluating alternative databases Mark Callaghan Small Data Engineer October, Friday, October 25, 13
Jan 27, 2015
MySQL versus something elseEvaluating alternative databases
Mark CallaghanSmall Data EngineerOctober, !"#$
Friday, October 25, 13
What metric is important?▪ Throughput
▪ Throughput while minimizing response time variance
▪ Efficiency - reduce cost while meeting response time goals
Friday, October 25, 13
My focus is storage efficiency▪ Use flash to get IOPs
▪ Use spinning disks to get capacity
▪ Use both to reduce cost while improving quality of service
device frequentreads
frequentwrites read IOPs write IOPs
flash
flash
SATA, /dev/null
SATA, /dev/null
yes yes yes maybe
yes no yes no
no yes no maybe
no no no no
Friday, October 25, 13
What technology would you choose today?▪ How do you value flexibility?
▪ Newer & faster hardware arrives each year
▪ Servers you buy today will be in production for a few years▪ Software can last even longer in production
▪ We have several generations of HW on the small data tiers▪ Pure-disk (SAS array + HW RAID)▪ Flashcache (SATA array + HW RAID, flash)▪ Pure-flash
Friday, October 25, 13
Common definitions▪ Sorted run - rows stored in key order
▪ may be stored using many range-partitioned files
▪ Memtable - sorted run in memory▪ L! - " or more sorted runs on disk▪ L", L#, ... Lmax - each is " sorted run on disk
▪ Lmax is the largest level▪ by size L# < L! ... < Lmax
▪ live$ - percentage of live data in the database
Friday, October 25, 13
Amplification factors▪ Framework for describing efficiency of database algorithms
▪ How much is done physically in response to a logical change?▪ Read amplification
▪ Write amplification▪ Space amplification
▪ Can determine▪ How many disks or flash you must buy▪ How long your flash might last▪ Whether you can buy lower endurance flash
Friday, October 25, 13
Read amplification▪ Read-amp == disk reads per query
▪ Separate results for point query versus short range scan
▪ Assume some data is in cache▪ Assume the index is covering for the query
▪ Example: b-tree with all non-leaf levels in cache▪ Point read-amp - # disk read to get the leaf block▪ Short range read-amp - # or ! disk reads to get the leaf blocks
Friday, October 25, 13
Read amplification and bloom filters▪ Bloom filter summary
▪ f(key) -> { no, maybe }▪ Use ~#" bits/row to get reasonable false positive rate▪ Great for avoiding disk reads on point queries▪ Bonus - prevent disk reads for keys that don’t exist
▪ Useless for general range scans like “select x where y < "!!”▪ Can be useful for equality prefix like “select x where q = "! and y < "!!”
▪ use bloom filter on q
▪ Too many bloom filter checks can hurt response time▪ each sorted run on disk needs a bloom filter check
Friday, October 25, 13
Write amplification▪ Write-amp == bytes written per byte changed
▪ Insert #"" bytes with write-amp=% and %"" bytes will be written
▪ For now ignore penalty from small random writes
▪ Some writes done immediately, others are deferred
▪ Immediate -> redo log▪ Deferred -> b-tree dirty pages not forced on commit, LSM compaction
Friday, October 25, 13
Write amplification, part !▪ HW can increase write-amp
▪ Read live pages and write them elsewhere when cleaning flash blocks
▪ Only a cost for algorithms that do small random writes
▪ Redo log writes can increase write-amp
▪ Writes must be done to a multiple of %#! or larger▪ Insert #"" byte row, force %#! byte sector for redo has write-amp=%
Friday, October 25, 13
Why write amplification matters▪ Write endurance for flash device
▪ The wrong algorithm can wear out the device too soon▪ The right algorithm might let you buy lower cost/endurance device
▪ Write-amp can predict peak performance▪ If storage can sustain &"" MB/second of writes▪ And write-amp is #"▪ Then database can sustain &" MB/second of changes
Friday, October 25, 13
Simple request - make counting faster▪ Some web-scale workloads need to maintain counts
▪ Database is IO-bound▪ Workload should be write-heavy, counters might not be read
▪ update foo set count = count + " where key = ‘bar’▪ Read-modify-write▪ Write-only: write delta, merge deltas later when queried/compacted
Friday, October 25, 13
Space amplification▪ Space-amp == sizeof(database files) / sizeof(data)
▪ Ignore secondary indexes
▪ Assume database files are in steady state (fragmented & compacted)▪ Space-amp == #"" / 'live
▪ Things that change space amplification▪ B-tree fragmentation▪ Old versions of rows that are yet to be collected▪ Compression
▪ Per row/page metadata (rollback pointer, transaction ID, ...)
Friday, October 25, 13
Space versus write amplification▪ Sorry for the confusion
▪ Databases store N blocks in # extent▪ Flash devices store N pages in # block
▪ Copy out▪ Read live data from the cleaned extent, write it elsewhere▪ Cost is a function of the percentage of live data▪ Larger live' means less space and more write amplification▪ Smaller live' means more space and less write amplification
Friday, October 25, 13
Space versus write amplification
(% dead pages !% live pages
Old flash block assuming all blocks have !%' live pages
New flash block
(% pages ready for new writes !% copied pages
Block cleaning copies !% pages
Write #"" pages total per (% new page writes:* 'live is !%'* write-amp is #"" / (#"" - 'live) == #"" / (%* space-amp is #"" / 'live == &
Friday, October 25, 13
Disclaimer▪ There are many assumptions in the rest of the slides.
▪ Assumption ##: workloads have no skew.▪ Most real workloads have skew.▪ Lets save skew for a much longer discussion
▪ Assumption #!: workload is update-only
▪ I am trying to start a discussion rather than solve everything.▪ This won’t be confused as a lecture on algorithm analysis.▪ We might disagree on technology, but we can agree on terminology
Friday, October 25, 13
Database algorithms▪ B-tree
▪ Update-in-place (UIP)▪ Copy-on-write using sequential (COW-S) and random (COW-R) writes
▪ Log structured merge tree (LSM)▪ LevelDB-style compaction (leveled)
▪ HBase-style compaction (n-files, size-tiered)
▪ Other
▪ Log-only - Bitcask▪ Memtable + L# - Sophia via Sphia.org▪ Memtable, L", L# - MaSM▪ TokuDB/TokuMX - fractional cascading
Friday, October 25, 13
B-tree
algorithm fixed-page(fragments) in-place write-back
needsgarbage
collection(block or extent
cleaning)
example
UIP
COW-R
COW-S
yes yes single-block HW GC if flash InnoDB
yes no single-block HW GC if flash LMDB
no no multi-block SW GC ?
Friday, October 25, 13
B-tree: UIP and COW-R▪ When non-leaf levels are in cache
▪ Point read-amp is #, range read-amp is # or !
▪ When dirty pages are forced after each row change▪ Write-amp is sizeof(page) / sizeof(row)▪ More write-amp from torn-page protection▪ Add +# for redo log▪ Include HW write-amp when using flash▪ Forcing data pages too soon increases write-amp
Friday, October 25, 13
B-tree: UIP and COW-R, space amplification▪ Fragmentation because b-tree pages are not full on average
▪ After a page split # full page becomes ! half-full pages▪ With InnoDB we have many indexes with pages that are ~)"' full
▪ Fixed page size reduces compression, with InnoDB #X compression▪ Default fixed page size is *kb▪ Compress #)kb to )kb, still write out *kb
▪ It is hard to use a compression window larger than one page▪ Per-row metadata uses "%+ bytes on InnoDB
Friday, October 25, 13
B-tree: COW-S▪ Read amplification is the same as for UIP and COW-R▪ Write amplification
▪ Smaller page size from better compression and no fragmentation▪ Has SW write-amp, cost of cleaning previously written extents▪ No HW write-amp on flash
▪ Space amplification▪ Compresses better than UIP/COW-R because page size not fixed▪ Almost no fragmentation▪ Space-amp from old versions of pages that have yet to be cleaned▪ More (less) space-amp means less (more) write-amp
Friday, October 25, 13
LSM with leveled compaction▪ Implemented by LevelDB and Cassandra▪ Database is memtable, L!, L", ..., Lmax▪ Less read-amp and space-amp, more write-amp▪ Similar to original LSM design from paper by O’Neil
▪ Difference is the use of many range-partitioned files per level▪ Increases write-amp by a small amount▪ Prevents temporary doubling of Lmax during compaction
▪ Compaction from L" to L#▪ reads N bytes from L#▪ reads #"*N bytes from L!▪ writes #"*N + N bytes back to L!
Friday, October 25, 13
LSM with leveled compaction
memtable keys: "..++ keys: "..++ keys: "..++ Level " (#GB)
#"X more data
keys: "".."# keys: ##..#+ keys: +"..++ Level # (#GB)
Level ! (#"GB)keys: """..""#
keys: ""!...""$
keys: +"..++
Friday, October 25, 13
LSM with leveled compaction▪ Point read amplification
▪ # bloom filter check per L" file and per level for L#->Lmax + # disk read
▪ Range read amplification▪ # disk read per level and per L" file, bloom filters don’t help
▪ Write amplification▪ #" per level starting with L! + # for redo + # for L" + ~# for L#
▪ Space amplification▪ #.# assuming +"' of data is on the maximum level
Friday, October 25, 13
LSM with n-files compaction▪ Implemented by Hbase, WiredTiger and Cassandra▪ Database is memtable, L!, L"
▪ Files in L" have varying sizes
▪ Less write-amp, more read-amp and space-amp▪ Compaction cost determined by:
▪ #files merged at a time▪ sizeof(L#) / sizeof(file created by memtable flush)
▪ If memtable is " GB, L" is &' GB, # files are merged at a time▪ then a row is written to files of size #, !, &, *, #), $! and )& GB▪ write-amp is (
Friday, October 25, 13
LSM with n-files compaction, L"=#$ GB
# GB
)& GB
memtable
# GB ! GB ! GB & GB & GB * GB * GB #) GB #) GB $! GB $! GB
L" files L#
Friday, October 25, 13
LSM with n-files compaction▪ Point read amplification
▪ # bloom filter check per file + # disk read
▪ Range read amplification▪ # disk read per file, bloom filters don’t help with range scans
▪ Write amplification▪ Usually much less than leveled compaction▪ Trade write for space amplification▪ Add # for redo
▪ Space amplification▪ Usually greater than !
Friday, October 25, 13
Log-only▪ Bitcask (part of Riak/Basho) is an example of this▪ Data is written "+ times
▪ Write data once to a log▪ Write again when row is live during log cleaning
▪ Copy data from tail to head of log when out of disk space
Friday, October 25, 13
Log-only
Log &
Log $
Log !
Log #
newest log file
oldest log file cleaner
live data
/dev/nulldead data
new data
Friday, October 25, 13
Log-only▪ Point read amplification is "▪ Range read amplification is one per value in the range▪ Write and space amplification are related
▪ Write amplification is #"" / (#"" - 'live)▪ Space amplification is #"" / 'live
▪ When &&$ of the data in the logs is live▪ Space-amp is #.%▪ Write-amp is $
Friday, October 25, 13
Memtable + L"▪ I think Sophia (sphia.org) is an example of this▪ Database is memtable, L"▪ Do compaction between memtable & L" when memtable is full▪ Great when database on disk not too much bigger than RAM
Friday, October 25, 13
Memtable + L"
L#
memtablecompact
new L#
Friday, October 25, 13
Memtable + L"▪ Point read amplification is "▪ Range read amplification is "▪ Write amplification
▪ The ratio sizeof(database) / sizeof(memtable)
▪ +# for redo log
▪ Space amplification is "
Friday, October 25, 13
Memtable + L% + L"▪ MaSM is an example of this▪ Database is memtable, L!, L"
▪ sizeof(L") == sizeof(L#)▪ Looks like file structures from !-pass external sort
▪ Tradeoffs▪ Minimize write-amp▪ Maximize read-amp
Friday, October 25, 13
Memtable + L% + L"
memtable
L#
L" L" L" L" L"
Merge all on compaction
Friday, October 25, 13
Memtable + L% + L"▪ Point read amplification is " disk read + many bloom filter checks▪ Range read amplification " disk read per L! file + "▪ Write amplification is %
▪ Write to redo log, L" and L#
▪ Space amplification is #
Friday, October 25, 13
TokuDB, TokuMX▪ Read amplification
▪ # disk read for point queries▪ # or ! disk reads for range read queries
▪ Write amplification▪ #" per level + # for redo▪ Won’t use as many levels as LevelDB
▪ Space amplification▪ No internal fragmentation, variable sizes pages written▪ Similar to LevelDB
Friday, October 25, 13
Database algorithms
algorithm point read-amp
rangeread-amp write-amp space-amp
UIP b-tree
COW-R b-tree
COW-S b-tree
LSM leveled
LSM n-files
log-only
memtable+L!
memtable+L"+L!
tokudb
# # or ! page/row * HW GC #.% to !
# # or ! page/row * HW GC #.% to !
# # or ! page/row * SW GC #
# + N*bloom N #" per level #.# X
# + N*bloom N can be < #" can be > !
# N # / (# - 'live) # / 'live
# # database/mem #
# + N*bloom N $ !
# ! #" per level #.# X
Friday, October 25, 13
Two things to remember▪ You can trade space/read versus write amplification
▪ Switch database algorithms or tune existing algorithm▪ Hard to minimize read, write & space amplification
▪ One size doesn’t fit all▪ The workload I care about has different types of indexes
▪ Some indexes should be optimized for short range scans▪ Other indexes can be optimized for write amplification
▪ would be nice to support both in one database engine
Friday, October 25, 13
Thank youfacebook.com/MySQLatFacebook
Mark CallaghanSmall Data EngineerOctober, !"#$
Friday, October 25, 13