Top Banner
Spark DataFrames: Simple and Fast Analytics on Structured Data Michael Armbrust Spark Summit Amsterdam 2015 - October, 28 th
33

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Jan 11, 2017

Download

Software

Databricks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Spark DataFrames:Simple and Fast Analyticson Structured Data

Michael ArmbrustSpark Summit Amsterdam 2015 - October, 28th

Page 2: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Graduated from Alpha

in 1.3

• Spark SQL• Part of the core distribution since Spark 1.0 (April 2014)

SQLAbout Me and

2

0

100

200

300

# Of Commits Per Month

050

100150200

# of Contributors

2

Page 3: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

3

SELECT COUNT(*)FROM hiveTableWHERE hive_udf(data)

• Spark SQL• Part of the core distribution since Spark 1.0 (April 2014)• Runs SQL / HiveQL queries, optionally alongside or

replacing existing Hive deployments

SQLAbout Me and

Improved multi-version

support in 1.4

Page 4: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

4

• Spark SQL• Part of the core distribution since Spark 1.0 (April 2014)• Runs SQL / HiveQL queries, optionally alongside or

replacing existing Hive deployments• Connect existing BI tools to Spark through JDBC

SQLAbout Me and

Page 5: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

• Spark SQL• Part of the core distribution since Spark 1.0 (April 2014)• Runs SQL / HiveQL queries, optionally alongside or

replacing existing Hive deployments• Connect existing BI tools to Spark through JDBC• Bindings in Python, Scala, Java, and R

5

SQLAbout Me and

Page 6: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

• Spark SQL• Part of the core distribution since Spark 1.0 (April 2014)• Runs SQL / HiveQL queries, optionally alongside or

replacing existing Hive deployments• Connect existing BI tools to Spark through JDBC• Bindings in Python, Scala, Java, and R

•@michaelarmbrust• Creator of Spark SQL @databricks

6

SQLAbout Me and

Page 7: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

The not-so-secret truth...

7

is about more than SQL.SQL

Page 8: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

8

Create and runSpark programs faster:

SQL

• Write less code• Read less data• Let the optimizer do the hard work

Page 9: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

DataFramenoun – [dey-tuh-freym]

9

1. A distributed collection of rows organized into named columns.

2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas).

3. Archaic: Previously SchemaRDD (cf. Spark < 1.3).

Page 10: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Write Less Code: Input & Output

Unified interface to reading/writing data in a variety of formats:

df = sqlContext.read \.format("json")  \.option("samplingRatio",  "0.1")  \.load("/home/michael/data.json")

df.write \.format("parquet")  \.mode("append")  \.partitionBy("year")  \

.saveAsTable("fasterData")10

Page 11: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Write Less Code: Input & Output

Unified interface to reading/writing data in a variety of formats:

df = sqlContext.read \.format("json")  \.option("samplingRatio",  "0.1")  \.load("/home/michael/data.json")

df.write \.format("parquet")  \.mode("append")  \.partitionBy("year")  \

.saveAsTable("fasterData")

read and write  functions create new builders for

doing I/O

11

Page 12: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Write Less Code: Input & Output

Unified interface to reading/writing data in a variety of formats:

Builder methods specify:• Format• Partitioning• Handling of

existing data

df = sqlContext.read \.format("json")  \.option("samplingRatio",  "0.1")  \.load("/home/michael/data.json")

df.write \.format("parquet")  \.mode("append")  \.partitionBy("year")  \

.saveAsTable("fasterData")12

Page 13: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Write Less Code: Input & Output

Unified interface to reading/writing data in a variety of formats:

load(…), save(…) or saveAsTable(…)  

finish the I/O specification

df = sqlContext.read \.format("json")  \.option("samplingRatio",  "0.1")  \.load("/home/michael/data.json")

df.write \.format("parquet")  \.mode("append")  \.partitionBy("year")  \

.saveAsTable("fasterData")13

Page 14: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Read Less Data: Efficient Formats

• Compact binary encoding with intelligent compression (delta, RLE, etc)• Each column stored separately with an index that allows

skipping of unread columns• Support for partitioning (/data/year=2015)• Data skipping using statistics (column min/max, etc)

14

is an efficient columnar storage format:

ORC

Page 15: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Write Less Code: Data Source APISpark SQL’s Data Source API can read and write DataFrames

using a variety of formats.

15

{  JSON }

Built-In External

JDBC

and more…

Find more sources at http://spark-packages.org/

ORCplain text*

Page 16: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

ETL Using Custom Data SourcessqlContext.read

.format("com.databricks.spark.jira")

.option("url",  "https://issues.apache.org/jira/rest/api/latest/search")

.option("user",  "marmbrus")

.option("password",  "*******")

.option("query",  """|project  =  SPARK  AND  |component  =  SQL  AND  |(status  =  Open  OR  status  =  "In  Progress"  OR  status  =  Reopened)""".stripMargin)

.load()

16

.repartition(1).write.format("parquet").saveAsTable("sparkSqlJira")

Load  data  from                                (Spark’s  Bug  Tracker)  using  a  custom  data  source.

Write  the  converted  data  out  to  a                                              table  stored  in  

Page 17: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Write Less Code: High-Level Operations

Solve common problems concisely using DataFrame functions:

• Selecting columns and filtering• Joining different data sources• Aggregation (count, sum, average, etc)• Plotting results with Pandas

17

Page 18: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Write Less Code: Compute an Average

private IntWritable one  =new IntWritable(1)

private IntWritable output   =new IntWritable()

proctected void map(LongWritable key,Text  value,Context   context)   {

String[]   fields   = value.split("\t")output.set(Integer.parseInt(fields[1]))context.write(one,   output)

}

IntWritable one  = new IntWritable(1)DoubleWritable average   = new DoubleWritable()

protected void reduce(IntWritable key,Iterable<IntWritable> values,Context   context)   {

int sum   = 0int count   = 0for(IntWritable value   : values)   {

sum  += value.get()count++}

average.set(sum   / (double)   count)context.Write(key,   average)

}

data   = sc.textFile(...).split("\t")data.map(lambda x:  (x[0],   [x.[1],   1]))   \

.reduceByKey(lambda x,  y:  [x[0]   + y[0],   x[1]   + y[1]])   \

.map(lambda x:  [x[0],   x[1][0]   / x[1][1]])   \

.collect()

18

Page 19: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Write Less Code: Compute an Average

Using RDDsdata  = sc.textFile(...).split("\t")data.map(lambda x:  (x[0],  [int(x[1]),  1]))  \

.reduceByKey(lambda x,  y:  [x[0]  + y[0],  x[1]  + y[1]])  \

.map(lambda x:  [x[0],  x[1][0]  / x[1][1]])  \

.collect()

Using DataFramessqlCtx.table("people")   \

.groupBy("name")   \

.agg("name",  avg("age"))  \

.map(lambda …)  \

.collect()  

Full API Docs• Python• Scala• Java• R

19

Using SQLSELECT name,  avg(age)FROM peopleGROUP BY  name

Page 20: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Not Just Less Code, Faster Too!

20

0 2 4 6 8 10

RDD Scala

RDD Python

DataFrame Scala

DataFrame Python

DataFrame R

DataFrame SQL

Time to Aggregate 10 million int pairs (secs)

Page 21: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Plan Optimization & Execution

21

SQL AST

DataFrame

Unresolved Logical Plan Logical Plan Optimized

Logical Plan RDDsSelected Physical Plan

Analysis LogicalOptimization

PhysicalPlanning

Cost

Mod

el

Physical Plans

CodeGeneration

Catalog

DataFrames and SQL share the same optimization/execution pipeline

Page 22: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Seamlessly Integrated

Intermix DataFrame operations with custom Python, Java, R, or Scala code

zipToCity =  udf(lambda zipCode:  <custom  logic  here>)

def add_demographics(events):u  = sqlCtx.table("users")events  \.join(u,  events.user_id == u.user_id)  \.withColumn("city",  zipToCity(df.zip))

Augments any DataFrame

that contains user_id

22

Page 23: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Optimize Full Pipelines

Optimization happens as late as possible, therefore Spark SQL can optimize even across functions.

23

events  = add_demographics(sqlCtx.load("/data/events",  "json"))

training_data = events  \.where(events.city == "Amsterdam")  \.select(events.timestamp)  \.collect()  

Page 24: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

24

def add_demographics(events):u  = sqlCtx.table("users") #  Load  Hive  tableevents  \.join(u,  events.user_id == u.user_id)  \ #  Join  on  user_id.withColumn("city",  zipToCity(df.zip))   #  Run  udf to  add  city  column

events  = add_demographics(sqlCtx.load("/data/events",  "json"))  training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()  

Logical Plan

filterby city

join

events file users table

expensive

only join relevent users

Physical Plan

join

scan(events)

filterby city

scan(users)

24

Page 25: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

25

def add_demographics(events):u  = sqlCtx.table("users")                                          #  Load  partitioned Hive  tableevents  \.join(u,  events.user_id == u.user_id)  \ #  Join  on  user_id.withColumn("city",  zipToCity(df.zip))            #  Run  udf to  add  city  column

Optimized Physical Planwith Predicate Pushdown

and Column Pruning

join

optimized scan

(events)

optimizedscan

(users)

events  = add_demographics(sqlCtx.load("/data/events",  "parquet"))  training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()  

Logical Plan

filterby city

join

events file users table

Physical Plan

join

scan(events)

filterby city

scan(users)

25

Page 26: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Machine Learning Pipelines

26

tokenizer = Tokenizer(inputCol="text",  outputCol="words”)hashingTF = HashingTF(inputCol="words",  outputCol="features”)lr = LogisticRegression(maxIter=10,  regParam=0.01)pipeline  = Pipeline(stages=[tokenizer,  hashingTF,  lr])

df = sqlCtx.load("/path/to/data")model  = pipeline.fit(df)

ds0 ds1 ds2 ds3tokenizer hashingTF lr.model

lr

Pipeline Model

Page 27: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

• 100+ native functions with optimized codegenimplementations– String manipulation – concat,  format_string,  lower,  lpad

– Data/Time – current_timestamp,  date_format,  date_add,  …

– Math – sqrt,  randn,  …– Other –monotonicallyIncreasingId,  sparkPartitionId,   …

27

Rich Function Library

from pyspark.sql.functions import *yesterday  = date_sub(current_date(),  1)df2  = df.filter(df.created_at > yesterday)

import org.apache.spark.sql.functions._val yesterday = date_sub(current_date(),  1)val df2 = df.filter(df("created_at")  > yesterday)

Added   in  Spark  1.5

Page 28: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Optimized Execution with Project Tungsten

Compact encoding, cache aware algorithms, runtime code generation

28

Page 29: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

The overheads of JVM objects

“abcd”

29

• Native: 4 bytes with UTF-8 encoding• Java: 48 bytes

java.lang.String object internals:OFFSET SIZE TYPE DESCRIPTION VALUE

0 4 (object header) ...4 4 (object header) ...8 4 (object header) ...12 4 char[] String.value []16 4 int String.hash 020 4 int String.hash32 0

Instance size: 24 bytes (reported by Instrumentation API)

12 byte object header

8 byte hashcode

20 bytes data + overhead

Page 30: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

6 “bricks”

Tungsten’s Compact Encoding

30

0x0 123 32L 48L 4 “data”

(123, “data”, “bricks”)

Null bitmap

Offset to data

Offset to data Field lengths

“abcd” with Tungsten encoding: ~5-­‐6  bytes  

Page 31: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Runtime Bytecode Generation

31

df.where(df("year")  > 2015)

GreaterThan(year#234,  Literal(2015))

bool filter(Object  baseObject)  {int offset  = baseOffset + bitSetWidthInBytes + 3*8L;int value  =  Platform.getInt(baseObject,  offset);return value34  > 2015;

}

DataFrame Code / SQL

Catalyst Expressions

Low-level bytecodeJVM intrinsic JIT-ed to

pointer arithmetic

Platform.getInt(baseObject,  offset);

Page 32: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

• Type-safe: operate on domain objects with compiled lambda functions• Fast: Code-generated

encoders for fast serialization• Interoperable: Easily convert DataFrameßà Datasetwithout boiler plate

32

Coming soon: Datasetsval df = ctx.read.json("people.json")

//  Convert  data  to  domain  objects.case class Person(name:  String,  age:  Int)val ds: Dataset[Person]  = df.as[Person]ds.filter(_.age  > 30)

//  Compute  histogram  of  age  by  name.val hist =  ds.groupBy(_.name).mapGroups {

case (name,  people:  Iter[Person])  =>val buckets =  new Array[Int](10)            people.map(_.age).foreach {  a  =>

buckets(a  / 10)  += 1}                  (name,  buckets)

}

Preview  in  Spark  1.6

Page 33: Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

33

Create and run your Spark programs faster:

SQL

• Write less code• Read less data• Let the optimizer do the hard work

Questions?

Committer Office HoursWeds, 4:00-5:00 pm MichaelThurs, 10:30-11:30 am Reynold

Thurs, 2:00-3:00 pm Andrew