HBaseConEast2016: Splice machine open source rdbms

Splice MachineOpen Source

RDBMSSeptember 26, 2016

Daniel Gómez FerroJohn Leach

Open Source Stack: Spark, Hadoop and Derby

Apache Derby▪ ANSI SQL-99 RDBMS▪ Java-based▪ ODBC/JDBC Compliant

Apache HBase/Hadoop▪Auto-sharding▪High availability▪Scalability to 100s of PBs

Apache Spark▪Analytical engine▪Fast, in-memory technology▪Memory resilient to node

failure 2

Splice Machine: Query Execution

3


4

1. Parse SQL• Generate Abstract Syntax

Tree (AST)• Bind AST to Transactional

Dictionary


5

1. Parse SQL2. Optimize query plan

• Determine join order and storage structure (e.g., base table, index) using table statistics (e.g., cardinality estimates)

• Push predicates• Unroll nested subqueries


6

3. Generate optimal byte code



7

OLTP Execution on HBase4a. Execute OLTP query from

byte code5a. Use block cache and bloom

filters to optimize data access6a. Return results




8

OLAP Execution on Spark4b. Generate Spark execution plan

OLTP Execution on HBase4a. Execute OLTP query from

byte code5a. Use block cache and bloom

filters to optimize data access6a. Return results



OLAP Execution on Spark4b. Generate Spark execution plan5b. Submit Spark plan with byte code6b. Fair scheduling of distributed of tasks7b. Generate RDD from HFiles and Memstore 8b. Execute query and return results

Architectural Differences: Don’t we already have SQL on HBase?

Transactional System Tephra Centralized SI Two Phase Commit Hierarchical Distributed SI

Analytical Engine HBase Coprocessors,JDBC Client

HBase Coprocessors,Executor Services Processes

Spark on Yarn

Import Process Python or MapReduce MapReduce via Hive JDBC CommandSpark job

Scanning DataCoprocessor Internal Scans,HBase Scans

Coprocessor Internal Scans,HBase Scans

File Oriented Hybrid Scanner

Compaction HBase Compaction HBase Compaction Spark Compaction

Resource Management HBase Call Queues Workload Management System

Spark Job Scheduling (FAIR)

TPCH 100 Load Times

Tables Row Count

LINEITEM 600037902 5:19:27 1:25:46 0:22:34

ORDERS 150000000 0:51:28 0:15:29 0:09:58

PARTSUPP 80000000 0:18:41 0:08:52 0:06:28

PART 20000000 0:07:26 0:02:27 0:02:14

CUSTOMERS 15000000 0:05:37 0:02:03 0:01:42

SUPPLIER 1000000 0:01:48 0:00:26 0:00:18

NATION 25 0:00:41 0:00:07 0:00:01

REGION 5 0:00:43 0:00:05 0:00:01

TPCH 100 Load Throughput

Write Pipeline▪Features

▪ Batched writes per region server▪ Congestion control, retries▪ Asynchronous writes▪ Constraint checking (PK, FK…)▪ Index updates

▪One-for-all pipeline▪ OLTP queries▪ Batch data ingestion (Imports, Hadoop OutputFormat, OLAP query inserts...)▪ Streaming data ingestion (Kafka, Spark streaming…)

Spark Compactions

13

Spark UI▪Out of process compactions

▪ Minor and Major▪ Decrease Regionserver load▪ Increase stability▪ Remote compactions▪ Prioritized by Spark’s fair scheduler

TPCH 100 Query Times (seconds)Query

1 395 TRAFODION-2237 99

2 PHOENIX-3322 516 44

3 PHOENIX-3322 TRAFODION-2237 126

4 PHOENIX-3322 TBD 133


6 74 3178 38

7 PHOENIX-3322 4442 220


9 PHOENIX-3322 941 273


11 PHOENIX-3317 463 56

TPCH 100 Query Times (seconds)Query

12 379 TBD 85






18 PHOENIX-3322 TBD SPLICE-34


20 PHOENIX-3320 TBD SPLICE-410



Splice Machine: Advanced Spark Integration

16

Innovative, High-Performance RDD Creation▪Fast access to HFiles in HDFS▪Merged with deltas from Memstore▪Avoids slower HBase API ▪Reduces load in HBase

Universal Execution Plan and Byte Code▪Optimizer, plan and code shared

across Spark or HBase execution

•••

HBase Region Server

HDFS

•••Region 1

Memstore

Spark Worker

•••RDD 1

HFile HFile•••

PHYSICAL NODE

RDD N

HFile••• HFile•••

Region N

Memstore

HBase Region Server

HDFS

•••Region 1

Memstore

Spark Worker

•••RDD 1

HFile HFile•••

PHYSICAL NODE

RDD N

HFile••• HFile•••

Region N

Memstore

Resources▪Do you trust us? Nah...

▪ Give it a shot yourself and let us know what you find...▪ https://github.com/splicemachine/benchmarks

▪Want to get involved?▪ http://community.splicemachine.com/

▪ Want to code? Yeah, me too...▪ https://github.com/splicemachine/spliceengine

http://community.splicemachine.com/

https://github.com/splicemachine/spliceengine

HBaseConEast2016: Splice machine open source rdbms

Engineering