May 10, 2015
Outline
HDFS Overview
HDFS meets HBase
Solving the HDFS-HBase problemsSmall Random ReadsSingle-Client Fault ToleranceDurable Record Appends
Summary
HDFS OverviewWhat is HDFS?
I Hadoop’s Distributed File System
I Modeled after Google’s GFS
I Scalable, reliable data storage
I All persistent HBase storage is on HDFS
I HDFS reliability and performance are key toHBase reliability and performance
HDFS Architecture
HDFS Design GoalsI Store large amounts of data
I Data should be reliable
I Storage and performance should scale withnumber of nodes.
I Primary use: bulk processing with MapReduce
Requirements for MapReduceI MR Task Outputs
I Large streaming writes of entire files
I MR Task InputsI Medium-size partial reads
I Each task usually has 1 reader, 1 writer; 8-16tasks per node.
I DataNodes usually servicing few concurrent clients
I MapReduce can restart tasks with ease (theyare idempotent)
Requirements for HBaseAll of the requirements of MapReduce, plus:
I Constantly append small records to an edit log(WAL)
I Small-size random reads
I Many concurrent readers
I Clients cannot restart → single-client faulttolerance is necessary.
HDFS Requirements Matrix
Requirement MR HBaseScalable storage X X
System fault tolerance X XLarge streaming writes X XLarge streaming reads X X
Small random reads - XSingle client fault tolerance - X
Durable record appends - X
HDFS Requirements Matrix
Requirement MR HBaseScalable storage X X©
System fault tolerance X X©Large streaming writes X X©Large streaming reads X X©
Small random reads - X§Single client fault tolerance - X§
Durable record appends - X§
Solutions...turn that frown upside-down
hard↔
easy I Configuration Tuning
I HBase-side workarounds
I HDFS Development/Patching
Small Random ReadsConfiguration Tuning
I HBase often has more concurrent clients thanMapReduce.
I Typical problems:
xceiverCount 257 exceeds the limit of
concurrent xcievers 256
I Increase dfs.datanode.max.xcievers → 1024(or greater)
Too many open files
I Edit /etc/security/limits.conf to increasenofile → 32768
Small Random ReadsHBase Features
I HBase block cacheI Avoids the need to hit HDFS for many reads
I Finer grained synchronization in HFile reads(HBASE-2180)
I Allow parallel clients to read data in parallel forhigher throughput
I Seek-and-read vs pread API (HBASE-1505)I In current HDFS, these have different performance
characteristics
Small Random ReadsHDFS Development in Progress
I Client↔DN connection reuse (HDFS-941,HDFS-380)
I Eliminates TCP handshake latencyI Avoids restarting TCP Slow-Start algorithm for
each read
I Multiplexed BlockSender (HDFS-918)I Reduces number of threads and open files in DN
I Netty DataNode (hack in progress)I Non-blocking IO may be more efficient for high
concurrency
Single-Client Fault ToleranceWhat exactly do I mean?
I If a MapReduce task fails to write, the MRframework will restart the task.
I MR relies on idempotence → task failures are nota big deal.
I Thus, fault tolerance of a single client is not asimportant to MR
I If an HBase region fails to write, it cannotrecreate the data easily
I HBase may access a single file for a day at atime → must ride over transient errors
Single-Client Fault ToleranceHDFS Patches
I HDFS-127 / HDFS-927I Clients used to give up after N read failures on a
file, with no regard for time. This patch resets thefailure count after successful reads.
I HDFS-630I Fixes block allocation to exclude nodes client
knows to be badI Important for small clusters!I Backported to 0.20 in CDH2
I Various other write pipeline recovery fixes in0.20.2 (HDFS-101, HDFS-793)
Durable Record AppendsWhat exactly is the infamous sync()/append()?
I Well, it’s really hflush()
I HBase accepts writes into memory (theMemStore)
I It also logs them to disk (the HLog / WAL)
I Each write needs to be on disk before claimingdurability.
I hflush() provides this guarantee (almost)
I Unfortunately, it doesn’t work in ApacheHadoop 0.20.x
Durable Record AppendsHBase Workarounds
I HDFS files are durable once closed
I Currently, HBase rolls the edit log periodically
I After a roll, previous edits are safe
I Not much of a workaround §I A crash will lose any edits since last roll.I Rolling constantly results in small files
I Bad for NN metadata efficiency.I Triggers frequent flushes → bad for region server
efficiency
Durable Record AppendsHBase Workarounds
I HDFS files are durable once closed
I Currently, HBase rolls the edit log periodically
I After a roll, previous edits are safe
I Not much of a workaround §I A crash will lose any edits since last roll.I Rolling constantly results in small files
I Bad for NN metadata efficiency.I Triggers frequent flushes → bad for region server
efficiency
Durable Record AppendsHDFS Development
I On Apache trunk: HDFS-265I New append re-implementation for 0.21/0.22I Will work great, but essentially a very large set of
patchesI Not released yet - running unreleased Hadoop is
“daring”
I In 0.20.x distributions: HDFS-200 patchI Fixes bugs in old hflush() implementationI Not quite as efficient as HDFS-265, but good
enough and simplerI Dhruba Borthakur from Facebook testing and
improvingI Cloudera will test and merge this into CDH3
SummaryI HDFS’s original target workload was
MapReduce, and HBase has different (harder)requirements.
I Engineers from the HBase team plus Facebook,Cloudera, and Yahoo are working together toimprove things.
I Cloudera will integrate all necessary HDFSpatches in CDH3, available for testing soon.
I Contact me if you’d like to help test in April.