HBase Low Latency Nick Dimiduk, Hortonworks (@xefyr) Nicolas Liochon, Scaled Risk (@nkeywal) HBaseCon May 5, 2014
Aug 27, 2014
HBase Low Latency
Nick Dimiduk, Hortonworks (@xefyr)Nicolas Liochon, Scaled Risk (@nkeywal)
HBaseCon May 5, 2014
Agenda• Latency, what is it, how to measure it
• Write path
• Read path
• Next steps
What’s low latency
Latency is about percentiles• Long tail issue• There are often order of magnitudes between « average » and « 95
percentile »• Post 99% = « magical 1% ». Work in progress here.
• Meaning from micro seconds (High Frequency Trading) to seconds (interactive queries)• In this talk milliseconds
Measure latency – during test
bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation• More options related to HBase: autoflush, replicas, …• Latency measured in micro second• Easier for internal analysis
• YCSB• Useful for comparison between tools • Set of workload already defined
Measure latency : Exposed by HBase
"QueueCallTime_num_ops" : 33044, "QueueCallTime_min" : 0, "QueueCallTime_max" : 86, "QueueCallTime_mean" : 0.2525420651252875, "QueueCallTime_median" : 0.0, "QueueCallTime_75th_percentile" : 0.0, "QueueCallTime_95th_percentile" : 1.0, "QueueCallTime_99th_percentile" : 1.0,
a
"SyncTime_num_ops" : 379081, "SyncTime_min" : 0,"SyncTime_max" : 865, "SyncTime_mean" : 3.0293341000999785, "SyncTime_median" : 2.0, "SyncTime_75th_percentile" : 3.0, "SyncTime_95th_percentile" : 4.0, "SyncTime_99th_percentile" : 253.5899999999999,
HBase write path – high level
Deeper in the write path• Two parts• Single put (WAL)
• The client just sends the put• Multiple puts from the client (new behavior since 0.96)
• The client is much smarter
• Four stages to look at for latency• Start (establish tcp connections, etc.)• Steady: when expected conditions are met• Machine failure: expected as well• Overloaded system: you may need to add machines or tune your workload
Single put: communication• Create a « Call » object, with an id, as queries are multiplexed• protobuf it• tcp write (in trunk it can be queued for a separate thread as well)• Wait for the answer• Separate thread, separate queue
• unprotobuf the answer
• Implies locks and multiple threads communicating with queues
Single put: server side scheduling• Threads to receives « Call »• Threads to handle the call execution• Threads to write the answer on the wire
• Multiple threads, communicating with queues
Single put: real work
• The server must• Take a row lock (HBase strong consistency)• Write into the WAL queue• Write into the memstore• Sync the queue (HDFS flush)• Free the lock
• WALs queue is shared between all the regions/handlers• Sync is avoided if another handlers did the work• You may flush more than expected
Latency sources• Candidate one: network
• 0.5ms within a datacenter.
• Candidate two: HDFS Flush
• Millisecond world: everything can go wrong• Network• OS Scheduler• All this goes into the post 99% percentile
Metric Time in msMean 0.33
50% 0.26
95% 0.59
99% 1.24
Latency sources• Split (and presplits)
• Autosharding is great!• Puts have to wait• Impacts: seconds
• Balance• Regions move• Triggers a retry for the client
• hbase.client.pause = 100ms since HBase 0.96
• Garbage Collection• Impacts: 10’s of ms, even with a good config• Covered with the read path of this talk
From steady to loaded and oveloaded• Number of concurrent tasks is a factor of
• Number of cores• Number of disks• Number of remote machines used
• Difficult to estimate• Queues are doomed to happen
• So for low latency• Specific Scheduler since Hbase 0.98 (HBASE-8884). Requires specific code.• Priorities: work in progress.
Loaded & overloaded• Step 1: Loaded system
• Tasks are queued: creates latency• Specific metric in HBase
• Step 2: Limit reached• MemStore takes too much room: blocks until it’s flushed
• hbase.regionserver.global.memstore.size.lower.limit• hbase.regionserver.global.memstore.size• hbase.hregion.memstore.block.multiplier
• Too many Hfiles: blocks until compations keeps up• hbase.hstore.blockingStoreFiles
• Too many WALs files• Don’t change this
Machine failure• Failure• Dectect• Reallocate• Replay WAL
• Replaying WAL is NOT required for puts
• Failure = Dectect + Reallocate + Retry• That’s in the range of ~1s for simple failures• Silent failures leads puts you in the 10s range if the hardware does not help
Single puts
• Millisecond range• Spikes do happen in steady mode• 100ms• Causes: GC, load, splits
Streaming puts
Htable#setAutoFlushTo(false)Htable#putHtable#flushCommit
Streaming puts• Write into a buffer• When the buffer is full, in the background• Select the puts that matches load conditions• Send them• Manage retries and delay
• The buffer is freed for other client operations• Blocks only if there is an a not retryable error or if the buffer is full
Multiple puts• hbase.client.max.total.tasks (default 100)• hbase.client.max.perserver.tasks (default 5)• hbase.client.max.perregion.tasks (default 1)
• Decouple the client from a latency peak of a region server• Increase the throughput by 50%• Does not solve the problem of an unbalanced cluster• But makes split and GC more transparent
Conclusion on write path• Single puts can be very fast• It’s not a « hard real time » system: there are peaks
• Latency peaks can be hidden when streaming puts• Including autosplits
And now for the read path
HBase read path – high level
Deeper in the read path• Get/short scan are assumed for low-latency operations• Again, two APIs• Single get: HTable#get(Get)• Multi-get: HTable#get(List<Get>)
• Four stages, same as write path• Start (tcp connection, …)• Steady: when expected conditions are met• Machine failure: expected as well• Overloaded system: you may need to add machines or tune your workload
Multi get / Client
Multi get / ClientGroup Gets byRegionServer
Multi get / Client
Execute themone by one
Multi get / Server
Multi get / Server
http://hadoop-hbase.blogspot.com/2012/05/hbasecon.html
Access latency magnides
Dean/2009
Memory is 100000xfaster than disk!
Disk seek = 10ms
Known unknowns• For each candidate HFile• Exclude by file metadata
• Timestamp• Rowkey range
• Exclude by bloom filter
• StoreFileManager (0.96, HBASE-7678)
StoreFileScanner#shouldUseScanner()
Unknown knowns• Merge sort results polled from Stores• Seek each scanner to a reference KeyValue• Retrieve candidate data from disk
• Multiple HFiles => mulitple seeks• hbase.storescanner.parallel.seek.enable=true
• Short Circuit Reads• dfs.client.read.shortcircuit=true
• Block locality• Happy clusters compact!
HFileBlock#readBlockData()
Remembered knowns: BlockCache• Reuse previously read data• Smaller BLOCKSIZE => better utilization• TODO: compression (HBASE-8894)
BlockCache#getBlock()
BlockCache Showdown• LruBlockCache• Quite good most of the time• < 30 GB
• BucketCache• Offheap alternative• > 30 GB
http://www.n10k.com/blog/blockcache-showdown/
Latency enemies: Compactions• Fewer HFiles => fewer seeks
• Evict data blocks!• Evict Index blocks!!
• hfile.block.index.cacheonwrite• Evict bloom blocks!!!
• hfile.block.bloom.cacheonwrite
• OS buffer cache to the rescue• Compactected data is still fresh• Better than going all the way back to disk
Latency enemies: Garbage Collection
• Use Heap. Not too much. With CMS.• Max heap: 30GB, probably less• Healthy cluster load• regular, reliable collections• 25-100ms pause on regular interval
• Overloaded RegionServer suffers GC overmuch
Off-heap to the rescue?
• BucketCache (0.96, HBASE-7404)• Network interfaces (HBASE-9535)• MemStore et al (HBASE-10191)
Failure• Machine failure
• Detect + Reallocate + Replay
• Strong consistency requires replay
• Cache starts from scratch
Read latency in summary• Steady mode
• Cache hit: < 1 ms• Cache miss: + 10 ms per seek• Writing while reading: cache churn• GC: 25-100ms pause on regular interval
Network request + (1 - P(cache hit)) * 10 ms
• Same long tail issues as write• Overloaded: same scheduling issues as write• Partial failures hurt a lot
Hedging our bets• HDFS Hedged reads (since HDFS 2.4)• Strongly consistent• Works at the HDFS level
• Timeline consistency (HBASE-10070)• Reads on secondary regions
• If a region does not answer quickly enough, go to another one
• Not strongly consistent• Helps a lot latency for read path.
HBase ranges for 99% latency
Put Streamed Multiput Get Timeline get
Steady milliseconds milliseconds milliseconds milliseconds
Failure seconds seconds seconds milliseconds
GC10’s of milliseconds milliseconds
10’s of milliseconds milliseconds
What’s next• Less GC
• Use less objects• Offheap
• Prefered location (HBASE-4755)
• The « magical 1% »• Most tools stops at the 99% latency
• YCSB for example• What happens after is much more complex
• But key to improve average
Thanks!Nick Dimiduk, Hortonworks (@xefyr)
Nicolas Liochon, Scaled Risk (@nkeywal)
HBaseCon May 5, 2014