Top Banner
Splice Machine Open Source RDBMS September 26, 2016 Daniel Gómez Ferro John Leach
17

HBaseConEast2016: Splice machine open source rdbms

Jan 06, 2017

Download

Engineering

Michael Stack
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HBaseConEast2016: Splice machine open source rdbms

Splice MachineOpen Source

RDBMSSeptember 26, 2016

Daniel Gómez FerroJohn Leach

Page 2: HBaseConEast2016: Splice machine open source rdbms

Open Source Stack: Spark, Hadoop and Derby

Apache Derby▪ ANSI SQL-99 RDBMS▪ Java-based▪ ODBC/JDBC Compliant

Apache HBase/Hadoop▪Auto-sharding▪High availability▪Scalability to 100s of PBs

Apache Spark▪Analytical engine▪Fast, in-memory technology▪Memory resilient to node

failure 2

Page 3: HBaseConEast2016: Splice machine open source rdbms

Splice Machine: Query Execution

3

Page 4: HBaseConEast2016: Splice machine open source rdbms

Splice Machine: Query Execution

4

1. Parse SQL• Generate Abstract Syntax

Tree (AST)• Bind AST to Transactional

Dictionary

Page 5: HBaseConEast2016: Splice machine open source rdbms

Splice Machine: Query Execution

5

1. Parse SQL2. Optimize query plan

• Determine join order and storage structure (e.g., base table, index) using table statistics (e.g., cardinality estimates)

• Push predicates• Unroll nested subqueries

Page 6: HBaseConEast2016: Splice machine open source rdbms

Splice Machine: Query Execution

6

3. Generate optimal byte code

1. Parse SQL2. Optimize query plan

Page 7: HBaseConEast2016: Splice machine open source rdbms

Splice Machine: Query Execution

7

OLTP Execution on HBase4a. Execute OLTP query from

byte code5a. Use block cache and bloom

filters to optimize data access6a. Return results

3. Generate optimal byte code

1. Parse SQL2. Optimize query plan

Page 8: HBaseConEast2016: Splice machine open source rdbms

Splice Machine: Query Execution

8

OLAP Execution on Spark4b. Generate Spark execution plan

OLTP Execution on HBase4a. Execute OLTP query from

byte code5a. Use block cache and bloom

filters to optimize data access6a. Return results

3. Generate optimal byte code

1. Parse SQL2. Optimize query plan

OLAP Execution on Spark4b. Generate Spark execution plan5b. Submit Spark plan with byte code6b. Fair scheduling of distributed of tasks7b. Generate RDD from HFiles and Memstore 8b. Execute query and return results

Page 9: HBaseConEast2016: Splice machine open source rdbms

Architectural Differences: Don’t we already have SQL on HBase?

Transactional System Tephra Centralized SI Two Phase Commit Hierarchical Distributed SI

Analytical Engine HBase Coprocessors,JDBC Client

HBase Coprocessors,Executor Services Processes

Spark on Yarn

Import Process Python or MapReduce MapReduce via Hive JDBC CommandSpark job

Scanning DataCoprocessor Internal Scans,HBase Scans

Coprocessor Internal Scans,HBase Scans

File Oriented Hybrid Scanner

Compaction HBase Compaction HBase Compaction Spark Compaction

Resource Management HBase Call Queues Workload Management System

Spark Job Scheduling (FAIR)

Page 10: HBaseConEast2016: Splice machine open source rdbms

TPCH 100 Load Times

Tables Row Count

LINEITEM 600037902 5:19:27 1:25:46 0:22:34

ORDERS 150000000 0:51:28 0:15:29 0:09:58

PARTSUPP 80000000 0:18:41 0:08:52 0:06:28

PART 20000000 0:07:26 0:02:27 0:02:14

CUSTOMERS 15000000 0:05:37 0:02:03 0:01:42

SUPPLIER 1000000 0:01:48 0:00:26 0:00:18

NATION 25 0:00:41 0:00:07 0:00:01

REGION 5 0:00:43 0:00:05 0:00:01

Page 11: HBaseConEast2016: Splice machine open source rdbms

TPCH 100 Load Throughput

Page 12: HBaseConEast2016: Splice machine open source rdbms

Write Pipeline▪Features

▪ Batched writes per region server▪ Congestion control, retries▪ Asynchronous writes▪ Constraint checking (PK, FK…)▪ Index updates

▪One-for-all pipeline▪ OLTP queries▪ Batch data ingestion (Imports, Hadoop OutputFormat, OLAP query inserts...)▪ Streaming data ingestion (Kafka, Spark streaming…)

Page 13: HBaseConEast2016: Splice machine open source rdbms

Spark Compactions

13

Spark UI▪Out of process compactions

▪ Minor and Major▪ Decrease Regionserver load▪ Increase stability▪ Remote compactions▪ Prioritized by Spark’s fair scheduler

Page 14: HBaseConEast2016: Splice machine open source rdbms

TPCH 100 Query Times (seconds)Query

1 395 TRAFODION-2237 99

2 PHOENIX-3322 516 44

3 PHOENIX-3322 TRAFODION-2237 126

4 PHOENIX-3322 TBD 133

5 PHOENIX-3322 TBD 192

6 74 3178 38

7 PHOENIX-3322 4442 220

8 PHOENIX-3322 TRAFODION-2239 620

9 PHOENIX-3322 941 273

10 PHOENIX-3322 TRAFODION-2241 101

11 PHOENIX-3317 463 56

Page 15: HBaseConEast2016: Splice machine open source rdbms

TPCH 100 Query Times (seconds)Query

12 379 TBD 85

13 PHOENIX-3318 TBD 71

14 PHOENIX-3322 TBD 50

15 PHOENIX-3319 TBD 102

16 PHOENIX-3322 TBD 33

17 PHOENIX-3322 TBD 929

18 PHOENIX-3322 TBD SPLICE-34

19 PHOENIX-3322 TBD 57

20 PHOENIX-3320 TBD SPLICE-410

21 PHOENIX-3321 TBD 479

22 PHOENIX-3322 TBD 219

Page 16: HBaseConEast2016: Splice machine open source rdbms

Splice Machine: Advanced Spark Integration

16

Innovative, High-Performance RDD Creation▪Fast access to HFiles in HDFS▪Merged with deltas from Memstore▪Avoids slower HBase API ▪Reduces load in HBase

Universal Execution Plan and Byte Code▪Optimizer, plan and code shared

across Spark or HBase execution

•••

HBase Region Server

HDFS

•••Region 1

Memstore

Spark Worker

•••RDD 1

HFile HFile•••

PHYSICAL NODE

RDD N

HFile••• HFile•••

Region N

Memstore

HBase Region Server

HDFS

•••Region 1

Memstore

Spark Worker

•••RDD 1

HFile HFile•••

PHYSICAL NODE

RDD N

HFile••• HFile•••

Region N

Memstore

Page 17: HBaseConEast2016: Splice machine open source rdbms

Resources▪Do you trust us? Nah...

▪ Give it a shot yourself and let us know what you find...▪ https://github.com/splicemachine/benchmarks

▪Want to get involved?▪ http://community.splicemachine.com/

▪ Want to code? Yeah, me too...▪ https://github.com/splicemachine/spliceengine