Page 1
1© Cloudera, Inc. All rights reserved.
13 April 2016Ted Malaska| Principle Solutions Architect @ Cloudera, Jonathan Hsieh| HBase Tech Lead @ Cloudera, Apache HBase PMC
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications
Page 2
2© Cloudera, Inc. All rights reserved.
About Ted and Jon
Ted Malaska• Principal Solutions Architect @ Cloudera• Apache HBase SparkOnHBase
Contributor•Contact• [email protected]
Jon Hsieh• Tech Lead/Eng Manager HBase Team @ Cloudera• Apache HBase PMC• Apache Flume founder
•Contact• [email protected] • @jmhsieh
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Page 3
3© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Outline
• Introduction
•Architecture and integration patterns
• Typing and API usage examples
• Future work and Conclusion
Page 4
4© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
• Apache HBase is a distributed non-relational datastore that specializes in strongly consistent, low-latency, random access reads, writes, and short scans. As a storage system, it is an obvious source for reading RDDs and a destination for writing RDDs
• Apache Spark is a distributed in-memory processing system that can be used for batch and continuous, near-real time streaming jobs. Spark’s programming model is built upon the RDD (resilient distributed dataset) abstraction
Apache HBase + Apache Spark
Page 5
5© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Example Use cases
• Streaming Analytics into HBase to replace Lambda Architectures (with Kafka)•Weblogs
• ETL in Spark to bulkload into HBase• 25-50B records per weekly batch
• Using SQL for extraction layer to query HBase entity-centric timeseries data
Page 6
6© Cloudera, Inc. All rights reserved.
Architecture and Integration Patterns
Page 7
7© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
How does data get in and out of HBase?
HBase Client
Put, Incr, Append
HBase Client
Get, Scan
Bulk Import
HBase Client
HBase ReplicationHBase Replication
low latency
high throughput
GetsShort scan
Full Scan, Snapshot, MapReduce
HBase Scanner
Page 8
8© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
HBase + MapReduce: Batch processing patterns
• Read dataset from HBase Table• Use HBase’s MR InputFormats• TableInputFormat• MultiTableInputFormat• TableSnapshotInputFormat
• Write dataset to HBase Table• Use HBase’s MR OutputFormat• TableOutputFormat• MultiTableOutputFormat• HFileOutputFormat
Read from HBase Table
Write to HBase Table
Page 9
9© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
HBase + Spark: Batch processing patterns
• Read dataset(RDD) from HBase Table• Use HBase’s MR InputFormats• TableInputFormat• MultiTableInputFormat• TableSnapshotInputFormat
• Write dataset(RDD) to HBase Table• Use HBase’s MR OutputFormat• TableOutputFormat• MultiTableOutputFormat• HFileOutputFormat
Read HBase Table as RDD
Write RDD as HBase Table
Page 10
10© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Spark Streaming
• Take an Data source• Partition in to mini batches RDDs• Compute using Spark engine• Output mini batch RDDsMini batch input RDD
Data source
Mini batch output RDD
Page 11
11© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
HBase + Spark Streaming – Enriching With HBase Data
• “Join” a dataset with HBase data• Enrich Streaming data source with
HBase data• Extract information from minibatch• Read/write/update HBase data in
processing • Output HBase-data enriched stream
of output RDDs
Mini batch input RDD
Data source
HBase-enriched mini batch output RDD
Page 12
12© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
How does Spark get data in and out of HBase?
HBase Client
Put, Incr, Append
HBase Client
Get, Scan
Bulk Import
HBase Client
HBase ReplicationHBase Replication
low latency
high throughput
GetsShort scan
Full Scan, Snapshot, MapReduce
HBase Scanner
Page 13
13© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
How does Spark get data in and out of HBase?
HBase Client
Put, Incr, Append
HBase Client
Get, Scan
Bulk Import
HBase Client
HBase ReplicationHBase Replication
low latency
high throughput
GetsShort scan
Full Scan, Snapshot, MapReduce
HBase ScannerBatch RDD via HBase’s MR Input/ Output Formats
Streaming using Hbase to Enrich stream data
Streaming using HBase to Enrich stream data
Page 14
14© Cloudera, Inc. All rights reserved.
Typing and API Usage
Page 15
15© Cloudera, Inc. All rights reserved.
Under the covers
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks
Page 16
16© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Key Addition: HBaseContext
• Create an HBaseContext// an Hadoop/HBase Configuration objectval conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))// sc is the Spark Context; hbase context corresponds to an HBase Connectionval hbaseContext = new HBaseContext(sc, conf)
// A sample RDDval rdd = sc.parallelize(Array( (Bytes.toBytes("1")), (Bytes.toBytes("2")), (Bytes.toBytes("3")), (Bytes.toBytes("4")), (Bytes.toBytes("5")), (Bytes.toBytes("6")), (Bytes.toBytes("7"))))
Page 17
17© Cloudera, Inc. All rights reserved.
• Foreach • Map• BulkLoad• BulkLoadThinRows• BulkGet (aka Multiget)• BulkDelete
Operations on the HBaseContext
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Page 18
18© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Foreach
• Read HBase data in parallel for each partition and compute
rdd.hbaseForeachPartition(hbaseContext, (it, conn) => { // do something val bufferedMutator = conn.getBufferedMutator(
TableName.valueOf("t1")) it.foreach(r => { ... // HBase API put/incr/append/cas calls } bufferedMutator.flush() bufferedMutator.close() })
Page 19
19© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Map
• Take an HBase dataset and map it in parallel for each partition to produce a new RDD
val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => { val table = conn.getTable(TableName.valueOf("t1")) var res = mutable.MutableList[String]() it.map( r => { ... // HBase API Scan Results } })
Page 20
20© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
BulkLoad
• Bulk load a data set into Hbase (for all cases, generally wide tables)
rdd.hbaseBulkLoad (tableName, t => { Seq((new KeyFamilyQualifier(t.rowKey, t.family, t.qualifier), t.value)).iterator }, stagingFolder)
Page 21
21© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
BulkLoadThinRows
• Bulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => { val rowKey = Bytes.toBytes(t._1) val familyQualifiersValues = new FamiliesQualifiersValues t._2.foreach(f => { val family:Array[Byte] = f._1 val qualifier = f._2 val value:Array[Byte] = f._3 familyQualifiersValues +=(family, qualifier, value) }) (new ByteArrayWrapper(rowKey), familyQualifiersValues) }, stagingFolder.getPath)
Page 22
22© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Scan vs Bulk Get (Parallel HBase Multigets)Scan HBase Table Bulk Get HBase Table
Page 23
23© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
BulkPut
• Parallelized HBase Multiput
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) =>
put.add(putValue._1, putValue._2, putValue._3)) put }
}
Page 24
24© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
BulkDelete
• Parallelized HBase Multi-deletes
hbaseContext.bulkDelete[Array[Byte]](rdd, tableName, putRecord => new Delete(putRecord), 4) // batch size
rdd.hbaseBulkDelete(hbaseContext, tableName, putRecord => new Delete(putRecord), 4) // batch size
Page 25
25© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
SparkSQL
• Using SparkSQL to query HBase Data
// Setup Schema Mappingval dataframe = sqlContext.load("org.apache.hadoop.hbase.spark", Map("hbase.columns.mapping" -> "KEY_FIELD STRING :key, A_FIELD STRING c:a, B_FIELD STRING c:b,", "hbase.table" -> "t1"))dataframe.registerTempTable("hbaseTmp")
// Query sqlContext.sql("SELECT KEY_FIELD FROM hbaseTmp " + "WHERE " + "(KEY_FIELD = 'get1' and B_FIELD < '3') or " + "(KEY_FIELD <= 'get3' and B_FIELD = '8')")
.foreach(r => println(" - "+r))
Page 26
26© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
SparkSQL + MLLib
• Process data extracted from SparkSQL
val resultDf = sqlContext.sql("SELECT gamer_id, oks, games_won, games_played FROM gamer")// Parse data to apply typing informationval parsedData = resultDf.map(r => { val array = Array(r.getInt(1).toDouble, r.getInt(2).toDouble, r.getInt(3).toDouble) Vectors.dense(array) })val dataCount = parsedData.count()if (dataCount > 0) { val clusters = KMeans.train(parsedData, 3, 5) clusters.clusterCenters.foreach(v => println(" Vector Center:" + v))}
Page 27
27© Cloudera, Inc. All rights reserved.
Future work and Conclusion
Page 28
28© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Development and Distribution Status
• Today • Batch Analysis patterns with existing MR Input/Output Formats• Streaming Analysis Patterns
• Committed to HBase trunk branch (2.0) as part of HBase project• Available in CDH5.7.0 with commercial support• Used in production and pre-production today at ~10 Cloudera customers
• Recent Additions• Kerberos and Secure HBase access
• To come: Kerberos ticket renewals for Spark Streaming• New JSON based HBase table schema specification
Page 29
29© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
How does Spark get data in and out of HBase?
HBase Client
Put, Incr, Append
HBase Client
Get, Scan
Bulk Import
HBase Client
HBase ReplicationHBase Replication
low latency
high throughput
GetsShort scan
Full Scan, MapReduce
HBase ScannerBatch RDD via HBase’s MR Input/ Output Formats
Streaming using Hbase to Enrich stream data
Streaming using Hbase to Enrich stream data
HBase Data as Spark Streaming data source
Page 30
30© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Future: HBase Data as a Source
• HBase edits as a Spark streaming data source (with Kafka?)• Gather other data• Do some computation• Write the data out
HBaseReplication
Mini batch input RDD
Data source
Hbase-enriched mini batch output RDD
Page 31
31© Cloudera, Inc. All rights reserved.
Thank you!
Page 32
32© Cloudera, Inc. All rights reserved.
Use Case – Streaming Counting
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
• Puts vs Increments• Bulk Puts/Gets is good• You can get perfect counting
4/13/2016
Page 33
33© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
DStream
DStream
DStream
Spark Streaming
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count HBase Increments
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count HBase Increments
First Batch
Second Batch
Page 34
34© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
HBase Puts
Source Receiver RDDpartitions
RDDParition
RDD
Single Pass
Filter Count
Pre-first Batch
First Batch
Second Batch
Stateful RDD 1
HBase Puts
Stateful RDD 2
Stateful RDD 1
Spark Streaming