Operating HBase – Things You Need to Know
Christian Gügi
2
Outline● HBase internals
● Overview of HBase utilities
● HBase split visualisation with Hannibal
● Challenges & lessons learned
● Resources to get started
3
About me● Software Architect @ Sentric
● Founder and organizer of the Swiss Big Data User Grouphttp://www.bigdata-usergroup.ch
● Contact:[email protected]://www.sentric.ch@chrisgugi
4
HBase Internals
5
Data Model● A sparse, multi-dimensional, sorted map
● Table consist of rows, each has a row key
● Each row may have any number of columns
● Rows are sorted lexicographically based on row key
● Column = Column Family : Column Qualifier
– Cell → {rowkey, column, timestamp}
● Region: contiguous set of sorted rows
● Region: unit of distribution and availability
[Bigtable: A Distributed Storage System for Structured Data]
6
Physical Data Organization
Memstore
HFile(on HDFS)
HFile(on HDFS)
Store
Region
HLo
g(W
AL
on H
FD
S)
content Column Family
● Column families are stored separately on disk
– Unit of access control with different patterns
● Writes are held (sorted) in memory until flush
● Sorted on disk in predictable order
– By row key, column key, descending timestamp
Memstore
HFile(on HDFS)
Store
anchor Column Family
7
Flushes and Compaction● Flushing/compaction per Region
– One thread (CompactSplitThread) per region server
● Minor compaction
– Merges two or more HFiles into one
● Major compaction
– Picks up all HFiles in the region, merges them and removes deleted k/v
● Regions are split when grown too large
8
System Architecture
Master
HBase
Write-Ahead Log
RegionServer
HDFS ZooKeeper
[HBase: The Definitive Guide]
API
MemstoreHFile
9
Key Design & Distribution● Bad idea: continuous number or timestamp
(sequential row keys)– RegionServer hot-spotting
● Better: use hash function and/or composite key – Distribute keys over random regions
– Uniform reads/writes across key space
● Proper key design is very essential– E.g. reversed URL (Bigtable paper)
10
Overview HBase Utilities
11
Useful Tools● hbck – checks and fixes table integrity and
region consistency
● HFile – examine contents of HFile
● HLog – examine contents of HLog file
● OfflineMetaRepair – rebuild meta table from file system
● HBase web interfaces– Master
– RegionsServer
12
Monitoring Tools● Ganglia
● Nagios
● OpenTSDB
● …
All tools use metrics provided through JMX
13
Manual Splitting● Via master web interface– Split
● HBase shell split command
● RegionSplitter– Create table with pre-split regions
– Rolling split of all regions on existing table
– . /bin/hbase org.apache.hadoop.hbase.util.RegionSplitter
14
Disable Automatic Splitting● Determined by hbase.hregion.max.filesize
● Set to max. 100GB
● OK, but: – How do I monitor my region growth?
– Where do I split when I have irregular data growth?
15
HBase Split Visualisation with Hannibal
16
Hannibal● Open source, project on github
– https://github.com/sentric/hannibal
● Web based
● Implemented in Scala
● Compatible with HBase 0.90
● Support > 0.92 added soon
● Check it out!
17
How well are regions balanced over the cluster?
18
How well are the regions split for the table?
19
How did the region evolve over time?
20
Future Plans● HBase 0.92 client API changes allow to
query Compaction-State on Regions through HBaseAdmin → differentiate major from minor compactions
● Add tool to find best region-key for irregular data growth
● Expose metrics through JMX
21
Challenges & Lessons Learned
22
Challenges● Everyone is still learning
● Some issues only appear at scale– At scale, nothing works as advertised
● Production cluster configuration– Hardware issues
– Tuning cluster configuration to our work loads
● HBase stability
● Monitoring health of HBase
23
Lessons Learned● Schema & key design
– What’s queried together should be stored together
● Monitoring/Operational tooling is most important
● Forget “emergency actions”, it takes some time
● You need DevOps in production
● Huge know-how curve, you need to know the whole ecosystem
– Hadoop, HDFS, Map/Red, ZooKeeper
24
Resources to get started● https://github.com/sentric/hannibal
● http://hbase.apache.org/book.html
● https://github.com/jmhsieh/hbase-repair-scripts
● http://www.sentric.ch/blog/best-practice-why-monitoring-hbase-is-important
● HBase: The Definitive Guide
25
Questions?@chrisgugi
Thank you!