HBase User Group #9: HBase and HDFS

HBase and HDFS

Todd [email protected]: @tlipcon

#hbase IRC: tlipcon

March 10, 2010

Outline

HDFS Overview

HDFS meets HBase

Solving the HDFS-HBase problemsSmall Random ReadsSingle-Client Fault ToleranceDurable Record Appends

Summary

HDFS OverviewWhat is HDFS?

I Hadoop’s Distributed File System

I Modeled after Google’s GFS

I Scalable, reliable data storage

I All persistent HBase storage is on HDFS

I HDFS reliability and performance are key toHBase reliability and performance

HDFS Architecture

HDFS Design GoalsI Store large amounts of data

I Data should be reliable

I Storage and performance should scale withnumber of nodes.

I Primary use: bulk processing with MapReduce

Requirements for MapReduceI MR Task Outputs

I Large streaming writes of entire files

I MR Task InputsI Medium-size partial reads

I Each task usually has 1 reader, 1 writer; 8-16tasks per node.

I DataNodes usually servicing few concurrent clients

I MapReduce can restart tasks with ease (theyare idempotent)

Requirements for HBaseAll of the requirements of MapReduce, plus:

I Constantly append small records to an edit log(WAL)

I Small-size random reads

I Many concurrent readers

I Clients cannot restart → single-client faulttolerance is necessary.

HDFS Requirements Matrix

Requirement MR HBaseScalable storage X X

System fault tolerance X XLarge streaming writes X XLarge streaming reads X X

Small random reads - XSingle client fault tolerance - X

Durable record appends - X

HDFS Requirements Matrix

Requirement MR HBaseScalable storage X X©

System fault tolerance X X©Large streaming writes X X©Large streaming reads X X©

Small random reads - X§Single client fault tolerance - X§

Durable record appends - X§

Solutions...turn that frown upside-down

hard↔

easy I Configuration Tuning

I HBase-side workarounds

I HDFS Development/Patching

Small Random ReadsConfiguration Tuning

I HBase often has more concurrent clients thanMapReduce.

I Typical problems:

xceiverCount 257 exceeds the limit of

concurrent xcievers 256

I Increase dfs.datanode.max.xcievers → 1024(or greater)

Too many open files

I Edit /etc/security/limits.conf to increasenofile → 32768

Small Random ReadsHBase Features

I HBase block cacheI Avoids the need to hit HDFS for many reads

I Finer grained synchronization in HFile reads(HBASE-2180)

I Allow parallel clients to read data in parallel forhigher throughput

I Seek-and-read vs pread API (HBASE-1505)I In current HDFS, these have different performance

characteristics

http://issues.apache.org/jira/browse/HBASE-2180

http://issues.apache.org/jira/browse/HBASE-1505

Small Random ReadsHDFS Development in Progress

I Client↔DN connection reuse (HDFS-941,HDFS-380)

I Eliminates TCP handshake latencyI Avoids restarting TCP Slow-Start algorithm for

each read

I Multiplexed BlockSender (HDFS-918)I Reduces number of threads and open files in DN

I Netty DataNode (hack in progress)I Non-blocking IO may be more efficient for high

concurrency

http://issues.apache.org/jira/browse/HDFS-941



Single-Client Fault ToleranceWhat exactly do I mean?

I If a MapReduce task fails to write, the MRframework will restart the task.

I MR relies on idempotence → task failures are nota big deal.

I Thus, fault tolerance of a single client is not asimportant to MR

I If an HBase region fails to write, it cannotrecreate the data easily

I HBase may access a single file for a day at atime → must ride over transient errors

Single-Client Fault ToleranceHDFS Patches

I HDFS-127 / HDFS-927I Clients used to give up after N read failures on a

file, with no regard for time. This patch resets thefailure count after successful reads.

I HDFS-630I Fixes block allocation to exclude nodes client

knows to be badI Important for small clusters!I Backported to 0.20 in CDH2

I Various other write pipeline recovery fixes in0.20.2 (HDFS-101, HDFS-793)






Durable Record AppendsWhat exactly is the infamous sync()/append()?

I Well, it’s really hflush()

I HBase accepts writes into memory (theMemStore)

I It also logs them to disk (the HLog / WAL)

I Each write needs to be on disk before claimingdurability.

I hflush() provides this guarantee (almost)

I Unfortunately, it doesn’t work in ApacheHadoop 0.20.x

Durable Record AppendsHBase Workarounds

I HDFS files are durable once closed

I Currently, HBase rolls the edit log periodically

I After a roll, previous edits are safe

I Not much of a workaround §I A crash will lose any edits since last roll.I Rolling constantly results in small files

I Bad for NN metadata efficiency.I Triggers frequent flushes → bad for region server

efficiency

Durable Record AppendsHBase Workarounds

I HDFS files are durable once closed

I Currently, HBase rolls the edit log periodically

I After a roll, previous edits are safe

I Not much of a workaround §I A crash will lose any edits since last roll.I Rolling constantly results in small files

I Bad for NN metadata efficiency.I Triggers frequent flushes → bad for region server

efficiency

Durable Record AppendsHDFS Development

I On Apache trunk: HDFS-265I New append re-implementation for 0.21/0.22I Will work great, but essentially a very large set of

patchesI Not released yet - running unreleased Hadoop is

“daring”

I In 0.20.x distributions: HDFS-200 patchI Fixes bugs in old hflush() implementationI Not quite as efficient as HDFS-265, but good

enough and simplerI Dhruba Borthakur from Facebook testing and

improvingI Cloudera will test and merge this into CDH3




SummaryI HDFS’s original target workload was

MapReduce, and HBase has different (harder)requirements.

I Engineers from the HBase team plus Facebook,Cloudera, and Yahoo are working together toimprove things.

I Cloudera will integrate all necessary HDFSpatches in CDH3, available for testing soon.

I Contact me if you’d like to help test in April.

[email protected]: @tlipcon

#hbase IRC: tlipcon

P.S. we’re hiring!

HBase User Group #9: HBase and HDFS

Technology

hbase workarounds hdfs

current hdfs

hdfs todd

hdfs architecture

hflush hbase

hbase region

necessary hdfs patches

hbase irc