Beyond SQL: Speeding up Spark with DataFrames

Beyond SQL: Speeding up Spark with DataFrames Michael Armbrust - @michaelarmbrust March 2015 Spark Summit East

2
0
50
100
150
# of Unique Contributors
0
50
100
150
200 # Of Commits Per Month
Graduated from Alpha
in 1.3
About Me and SQL
Spark SQL Part of the core distribution since Spark 1.0 (April 2014)

3
SELECT COUNT(*) FROM hiveTable WHERE hive_udf(data)
About Me and SQL
Spark SQL Part of the core distribution since Spark 1.0 (April 2014) Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments

4
About Me and SQL
replacing existing Hive deployments Connect existing BI tools to Spark through JDBC

replacing existing Hive deployments Connect existing BI tools to Spark through JDBC Bindings in Python, Scala, and Java
5
About Me and SQL

replacing existing Hive deployments Connect existing BI tools to Spark through JDBC Bindings in Python, Scala, and Java
@michaelarmbrust Lead developer of Spark SQL @databricks
About Me and
6
SQL

The not-so-secret truth...
7
is not about SQL.
SQL

Execution Engine Performance
8
0
50
100
150
200
250
300
350
400
450
3 7 19 27 34 42 43 46 52 53 55 59 63 68 73 79 89 98
TPC-DS Performance
Shark
Spark SQL

The not-so-secret truth...
9
is about more than SQL.
SQL

Spark SQL: The whole story
Creating and Running Spark Programs Faster: Write less code Read less data Let the optimizer do the hard work
10

DataFrame noun [dey-tuh-freym]
11
1. A distributed collection of rows organized into named columns.
2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas).
3. Archaic: Previously SchemaRDD (cf. Spark < 1.3).

Write Less Code: Input & Output
Spark SQLs Data Source API can read and write DataFrames using a variety of formats.
12
{ JSON }
Built-In External
JDBC
and more

Write Less Code: High-Level Operations
Common operations can be expressed concisely as calls to the DataFrame API: Selecting required columns Joining dierent data sources Aggregation (count, sum, average, etc) Filtering
13

Write Less Code: Compute an Average
private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("\t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) }
data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()

Write Less Code: Compute an Average
15
Using RDDs
data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [int(x[1]), 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()
Using DataFrames
sqlCtx.table("people") \ .groupBy("name") \ .agg("name", avg("age")) \ .collect()
Full API Docs Python Scala Java

Not Just Less Code: Faster Implementations
16
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)

17
Demo: Data Sources API Using Spark SQL to read, write, and transform data in a variety of formats. http://people.apache.org/~marmbrus/talks/dataframe.demo.pdf

Read Less Data
The fastest way to process big data is to never read it.
Spark SQL can help you read less data automatically:
1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned o by default in Spark 1.3 18
Converting to more eicient formats Using columnar formats (i.e. parquet) Using partitioning (i.e., /year=2014/month=02/)1 Skipping data using statistics (i.e., min, max)2
Pushing predicates into storage systems (i.e., JDBC)

Plan Optimization & Execution
19
SQL AST
DataFrame
Unresolved Logical Plan Logical Plan
Optimized Logical Plan RDDs
Selected Physical Plan
Analysis Logical Optimization Physical Planning
Cost
Mod
el
Physical Plans
Code Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline

Optimization happens as late as possible, therefore Spark SQL can optimize even across functions.
20

21
def add_demographics(events): u = sqlCtx.table("users") # Load Hive table events \ .join(u, events.user_id == u.user_id) \ # Join on user_id .withColumn("city", zipToCity(df.zip)) # Run udf to add city column events = add_demographics(sqlCtx.load("/data/events", "json")) training_data = events.where(events.city == "New York").select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
expensive
only join relevent users
Physical Plan
join
scan (events) filter
scan (users)

22
def add_demographics(events): u = sqlCtx.table("users") # Load partitioned Hive table events \ .join(u, events.user_id == u.user_id) \ # Join on user_id .withColumn("city", zipToCity(df.zip)) # Run udf to add city column
Physical Plan with Predicate Pushdown
and Column Pruning
join
optimized scan
(events)
optimized scan
(users)
events = add_demographics(sqlCtx.load("/data/events", "parquet")) training_data = events.where(events.city == "New York").select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
Physical Plan
join
scan (events) filter
scan (users)

Machine Learning Pipelines
23
tokenizer = Tokenizer(inputCol="text", outputCol="words) hashingTF = HashingTF(inputCol="words", outputCol="features) lr = LogisticRegression(maxIter=10, regParam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) df = sqlCtx.load("/path/to/data") model = pipeline.fit(df)
ds0 ds1 ds2 ds3 tokenizer hashingTF lr.model
lr
Pipeline Model

Create and Run Spark Programs Faster:
Write less code Read less data Let the optimizer do the hard work
SQL
Questions?

Beyond SQL: Speeding up Spark with DataFrames

Software

Spark SQL, Spark Streaming - cvut.cz · Spark SQL a...

Introducing DataFrames in Spark for Large Scale Data Science

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And....

The Future of Real-Time in Spark - GitHub Pages...2016/02/18...

7 Steps for a Developer to Learn Apache Spark · Learning.....

1. Spark DataFrames + SQL - Systems Group · 2019-06-11 ·...

Parallelizing Existing R Packages with SparkR · What is...

Vectors and DataFrames

Big Data and Apache Spark - ERASMUS...

Advanced Apache Spark Meetup Spark SQL + DataFrames +...

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark...

with BSP, Pregel and DataFrames · 2018-11-30 · Spark...

Spark Summit EU 2015: Spark DataFrames: Simple and Fast...

Frustration-Reduced Spark: DataFrames and the Spark...

Apache Spark Notes · SparkSQL is a library that runs on...

Scotland Data Science Meetup Oct 13, 2015: Spark SQL,...