Top Banner
SparkSQL: A Compiler from Queries to RDDs Sameer Agarwal Spark Summit | Boston | February 9 th 2017
44

SparkSQL: A Compiler from Queries to RDDs

Feb 14, 2017

Download

Software

Databricks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SparkSQL: A Compiler from Queries to RDDs

SparkSQL:A Compiler from Queries to RDDs

Sameer AgarwalSpark Summit | Boston | February 9th 2017

Page 2: SparkSQL: A Compiler from Queries to RDDs

About Me

• Software Engineer at Databricks (Spark Core/SQL)• PhD in Databases (AMPLab, UC Berkeley)• Research on BlinkDB (Approximate Queries in Spark)

Page 3: SparkSQL: A Compiler from Queries to RDDs

Background: What is an RDD?

• Dependencies• Partitions• Compute function: Partition => Iterator[T]

3

Page 4: SparkSQL: A Compiler from Queries to RDDs

Background: What is an RDD?

• Dependencies• Partitions• Compute function: Partition => Iterator[T]

4

Opaque Computation

Page 5: SparkSQL: A Compiler from Queries to RDDs

Background: What is an RDD?

• Dependencies• Partitions• Compute function: Partition => Iterator[T]

5

Opaque Data

Page 6: SparkSQL: A Compiler from Queries to RDDs

RDD Programming Model

6

Construct execution DAG using low level RDD operators.

Page 7: SparkSQL: A Compiler from Queries to RDDs

RDD Programming Model

7

Construct execution DAG using low level RDD operators.

Page 8: SparkSQL: A Compiler from Queries to RDDs

RDD Programming Model

8

Construct execution DAG using low level RDD operators.

Page 9: SparkSQL: A Compiler from Queries to RDDs

SQL/Structured Programming Model

• High-level APIs (SQL, DataFrame/Dataset): Programs describe what data operations are needed without specifying how to execute these operations• More efficient: An optimizer can automatically find out

the most efficient plan to execute a query

9

Page 10: SparkSQL: A Compiler from Queries to RDDs

10

SQL AST

DataFrame

Dataset

Query Plan Optimized Query Plan RDDs

Transformations

Catalyst

Abstractions of users’ programs(Trees)

Spark SQL Overview

Tungsten

Page 11: SparkSQL: A Compiler from Queries to RDDs

11

How Catalyst Works: An Overview

SQL AST

DataFrame

Dataset

Query Plan Optimized Query Plan RDDs

Transformations

Catalyst

Abstractions of users’ programs(Trees)

Page 12: SparkSQL: A Compiler from Queries to RDDs

12

Trees: Abstractions of Users’ Programs

SELECT sum(v)FROM (SELECTt1.id,1 + 2 + t1.value AS v

FROM t1 JOIN t2WHEREt1.id = t2.id ANDt2.id > 50 * 1000) tmp

Page 13: SparkSQL: A Compiler from Queries to RDDs

13

Trees: Abstractions of Users’ Programs

SELECT sum(v)FROM (SELECTt1.id,1 + 2 + t1.value AS v

FROM t1 JOIN t2WHEREt1.id = t2.id ANDt2.id > 50 * 1000) tmp

Expression• An expression represents a

new value, computed based on input values• e.g. 1 + 2 + t1.value

Page 14: SparkSQL: A Compiler from Queries to RDDs

14

Trees: Abstractions of Users’ Programs

SELECT sum(v)FROM (SELECTt1.id,1 + 2 + t1.value AS v

FROM t1 JOIN t2WHEREt1.id = t2.id ANDt2.id > 50 * 1000) tmp

Query Plan

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,1+2+t1.value as v

t1.id=t2.idt2.id>50*1000

Page 15: SparkSQL: A Compiler from Queries to RDDs

Logical Plan

• A Logical Plan describes computation on datasets without defining how to conduct the computation

15

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,1+2+t1.value as v

t1.id=t2.idt2.id>50*1000

Page 16: SparkSQL: A Compiler from Queries to RDDs

Physical Plan

• A Physical Plan describes computation on datasets with specific definitions on how to conduct the computation

16

Parquet Scan(t1)

JSON Scan(t2)

Sort-Merge Join

Filter

Project

Hash-Aggregate sum(v)

t1.id,1+2+t1.value as v

t1.id=t2.idt2.id>50*1000

Page 17: SparkSQL: A Compiler from Queries to RDDs

17

How Catalyst Works: An Overview

SQL AST

DataFrame

Dataset(Java/Scala)

Query Plan Optimized Query Plan RDDs

Transformations

Catalyst

Abstractions of users’ programs(Trees)

Page 18: SparkSQL: A Compiler from Queries to RDDs

• A function associated with every tree used to implement a single rule

Transform

18

Attribute(t1.value)

Add

Add

Literal(1) Literal(2)

1 + 2 + t1.value

Attribute(t1.value)

Add

Literal(3)

3+ t1.valueEvaluate 1 + 2 onceEvaluate 1 + 2 for every row

Page 19: SparkSQL: A Compiler from Queries to RDDs

Transform

• A transform is defined as a Partial Function• Partial Function: A function that is defined for a subset

of its possible arguments

19

val expression: Expression = ...expression.transform {

case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>Literal(x + y)

}

Case statement determine if the partial function is defined for a given input

Page 20: SparkSQL: A Compiler from Queries to RDDs

val expression: Expression = ...expression.transform {

case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>Literal(x + y)

}

Transform

20

Attribute(t1.value)

Add

Add

Literal(1) Literal(2)

1 + 2 + t1.value

Page 21: SparkSQL: A Compiler from Queries to RDDs

val expression: Expression = ...expression.transform {

case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>Literal(x + y)

}

Transform

21

Attribute(t1.value)

Add

Add

Literal(1) Literal(2)

1 + 2 + t1.value

Page 22: SparkSQL: A Compiler from Queries to RDDs

val expression: Expression = ...expression.transform {

case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>Literal(x + y)

}

Transform

22

Attribute(t1.value)

Add

Add

Literal(1) Literal(2)

1 + 2 + t1.value

Page 23: SparkSQL: A Compiler from Queries to RDDs

val expression: Expression = ...expression.transform {

case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>Literal(x + y)

}

Transform

23

Attribute(t1.value)

Add

Add

Literal(1) Literal(2)

1 + 2 + t1.value

Attribute(t1.value)

Add

Literal(3)

3+ t1.value

Page 24: SparkSQL: A Compiler from Queries to RDDs

Combining Multiple Rules

24

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,1+2+t1.value as v

t1.id=t2.idt2.id>50*1000

Predicate Pushdown

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,1+2+t1.value as v

t2.id>50*1000

t1.id=t2.id

Page 25: SparkSQL: A Compiler from Queries to RDDs

Combining Multiple Rules

25

Constant Folding

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,1+2+t1.value as v

t2.id>50*1000

t1.id=t2.id

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,3+t1.value as v

t2.id>50000

t1.id=t2.id

Page 26: SparkSQL: A Compiler from Queries to RDDs

Combining Multiple Rules

26

Column Pruning

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,3+t1.value as v

t2.id>50000

t1.id=t2.id

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,3+t1.value as v

t2.id>50000

t1.id=t2.id

Project Projectt1.idt1.value t2.id

Page 27: SparkSQL: A Compiler from Queries to RDDs

Combining Multiple Rules

27

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,1+2+t1.value as v

t1.id=t2.idt2.id>50*1000

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,3+t1.value as v

t2.id>50000

t1.id=t2.id

Project Projectt1.idt1.value

t2.id

Before transformations

After transformations

Page 28: SparkSQL: A Compiler from Queries to RDDs

28

SQL AST

DataFrame

Dataset

Query Plan Optimized Query Plan RDDs

Transformations

Catalyst

Abstractions of users’ programs(Trees)

Spark SQL Overview

Tungsten

Page 29: SparkSQL: A Compiler from Queries to RDDs

Scan

Filter

Project

Aggregate

select count(*) from store_saleswhere ss_item_sk = 1000

Page 30: SparkSQL: A Compiler from Queries to RDDs

G. Graefe, Volcano— An Extensible and Parallel Query Evaluation System,In IEEE Transactions on Knowledge and Data Engineering 1994

Page 31: SparkSQL: A Compiler from Queries to RDDs

Volcano Iterator Model

• Standard for 30 years: almost all databases do it

• Each operator is an “iterator” that consumes records from its input operator

class Filter(child: Operator,predicate: (Row => Boolean))

extends Operator {def next(): Row = {var current = child.next()while (current == null ||predicate(current)) {

current = child.next()}return current}

}

Page 32: SparkSQL: A Compiler from Queries to RDDs

Downside of the Volcano Model

1. Too many virtual function callso at least 3 calls for each row in Aggregate

2. Extensive memory accesso “row” is a small segment in memory (or in L1/L2/L3 cache)

3. Can’t take advantage of modern CPU featureso SIMD, pipelining, prefetching, branch prediction, ILP, instruction

cache, …

Page 33: SparkSQL: A Compiler from Queries to RDDs

Scan

Filter

Project

Aggregate

long count = 0;for (ss_item_sk in store_sales) {

if (ss_item_sk == 1000) {count += 1;

}}

Whole-stage Codegen: Spark as a “Compiler”

Page 34: SparkSQL: A Compiler from Queries to RDDs

Whole-stage Codegen

• Fusing operators together so the generated code looks like hand optimized code:- Identify chains of operators (“stages”)- Compile each stage into a single function- Functionality of a general purpose execution engine;

performance as if hand built system just to run your query

Page 35: SparkSQL: A Compiler from Queries to RDDs

T Neumann, Efficiently compiling efficient query plans for modern hardware. In VLDB 2011

Page 36: SparkSQL: A Compiler from Queries to RDDs

Putting it All Together

Page 37: SparkSQL: A Compiler from Queries to RDDs

Operator Benchmarks: Cost/Row (ns)

5-30xSpeedups

Page 38: SparkSQL: A Compiler from Queries to RDDs

Operator Benchmarks: Cost/Row (ns)

Radix Sort10-100xSpeedups

Page 39: SparkSQL: A Compiler from Queries to RDDs

Operator Benchmarks: Cost/Row (ns)

Shufflingstill thebottleneck

Page 40: SparkSQL: A Compiler from Queries to RDDs

Operator Benchmarks: Cost/Row (ns)

10xSpeedup

Page 41: SparkSQL: A Compiler from Queries to RDDs

TPC-DS (Scale Factor 1500, 100 cores)Q

uery

Tim

e

Query #

Spark 2.0 Spark 1.6

Lower is Better

Page 42: SparkSQL: A Compiler from Queries to RDDs

What’s Next?

Page 43: SparkSQL: A Compiler from Queries to RDDs

Spark 2.2 and beyond

1. SPARK-16026: Cost Based Optimizer- Leverage table/column level statistics to optimize joins and aggregates- Statistics Collection Framework (Spark 2.1)- Cost Based Optimizer (Spark 2.2)

2. Boosting Spark’s Performance on Many-Core Machines- In-memory/ single node shuffle

3. Improving quality of generated code and better integration with the in-memory column format in Spark

Page 44: SparkSQL: A Compiler from Queries to RDDs

Thank you.