Top Banner
© 2016 Mesosphere, Inc. All Rights Reserved. SPARK DATAFRAMES FOR DATA MUNGING 1 Susan X. Huynh, Scala by the Bay, Nov. 2016
18

Spark DataFrames for Data Munging

Jan 18, 2017

Download

Software

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

SPARK DATAFRAMES FOR DATA MUNGING

1

Susan X. Huynh, Scala by the Bay, Nov. 2016

Page 2: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

OUTLINE

2

Motivation

Spark DataFrame API

Demo

Beyond Data Munging

Page 3: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

MOTIVATION

3

•Your job: Analyze 100 GB of log data:

{"created_at":"Tue Sep 13 19:54:43 +0000 2016","id":775784797046124544,"id_str":"775784797046124544","text":"@abcd4321 ur icon is good","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":1180432963,"in_reply_to_user_id_str":"1180432963","in_reply_to_screen_name":"fakejoshler","user":{"id":4795786058,"id_str":"4795786058","name":"maggie","screen_name":"wxyz1234","location":"she her - gabby, mily","url":"http:\/\/666gutz.tumblr.com","description":"one too many skeletons","protected":false,"verified":false,"followers_count":2168,"friends_count":84,"listed_count":67,"favourites_count":22298,"statuses_count":29769,"created_at":"Fri Jan 22 00:04:30 +0000 2016","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"000000","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,

Page 4: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

WHAT DO YOU MEAN BY “ANALYZE”?

4

AKA: data munging, ETL, data cleaning, acronym: PETS (or PEST? :)

Parse

Explore

Transform

Summarize

Data pipeline

Motivation

"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\”nofollow\”…

Page 5: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

BEST TOOL FOR THE JOB?

5

DataFrame

Pandas (Python)

R

Big data + SQL

Hive, Impala

DataFrame + Big data / SQL

Spark DataFrame

Motivation

https://flic.kr/p/fnCVbL

Page 6: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

WHY SPARK?

6

Open source

Scalable

Fast ad-hoc queries

Motivation

Page 7: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

WHY SPARK DATAFRAME?

7

Parse: Easy to read structured, semi-structured (JSON) formats

Explore: DataFrame

Transform / Summarize:

SQL queries + procedural processing

Utilities for math, string, date / time manipulation

Scala

Motivation

Page 8: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

PARSE: READING JSON DATA

8

> spark res4: org.apache.spark.sql.SparkSession@3fc09112

> val df = spark.read.json(“/path/to/mydata.json”) df: org.apache.spark.sql.DataFrame = [contributors: string ... 33 more fields]

DataFrame: a table with rows and columns (fields)

Spark DataFrame API

"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\”nofollow\”…

Page 9: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

EXPLORE

9

> df.printSchema() // lists the columns in a DataFrameroot |-- contributors: string (nullable = true) |-- coordinates: struct (nullable = true) | |-- coordinates: array (nullable = true) | | |-- element: double (containsNull = true) | |-- type: string (nullable = true) |-- created_at: string (nullable = true) |-- delete: struct (nullable = true) | |-- status: struct (nullable = true) | | |-- id: long (nullable = true) |-- lang: string (nullable = true)…

Spark DataFrame API

Page 10: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

EXPLORE (CONT’D)

10

> df.filter(col(”coordinates”).isNotNull) // filters on rows, with given condition .select("coordinates", “created_at”) // filters on columns .show()

+------------------------------------------------+------------------------------+|coordinates |created_at |+------------------------------------------------+------------------------------+|[WrappedArray(104.86544034, 15.23611896),Point] |Thu Sep 15 02:00:00 +0000 2016||[WrappedArray(-43.301755, -22.990065),Point] |Thu Sep 15 02:00:03 +0000 2016||[WrappedArray(100.3833729, 6.13822131),Point] |Thu Sep 15 02:00:30 +0000 2016||[WrappedArray(-122.286, 47.5592),Point] |Thu Sep 15 02:00:38 +0000 2016||[WrappedArray(110.823004, -6.80342),Point] |Thu Sep 15 02:00:42 +0000 2016|

Other DataFrame ops: count(), describe(), create new columns, …

Spark DataFrame API

Page 11: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

TRANSFORM/SUMMARIZE: SQL QUERIES + PROCEDURAL PROC.

11

> val langCount = df.select(“lang")

.where(col(”lang”).isNotNull)

.groupBy(“lang")

.count()

.orderBy(col(”count”).desc) +----+-----+

|lang|count|

+----+-----+

| en|61644|

| es|22937|

| pt|21610|

| ja|19160|

| und|10376|

Also: joins

> val result = langCount.map{row:Row => …} // or flatMap, filter, …

Spark DataFrame API

SQL

PROCEDURAL

Page 12: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

MATH, STRING, DATE / TIME FUNCTIONS

12

> df.select(”created_at”)

.withColumn(“day_of_week”, col(”created_at”).substr(0, 3))

.show() +--------------------+-----------+

| created_at|day_of_week|

+--------------------+-----------+

| null| null|

|Thu Sep 15 01:59:...| Thu|

|Thu Sep 15 01:59:...| Thu|

|Thu Sep 15 01:59:...| Thu|

| null| null|

|Thu Sep 15 01:59:...| Thu|

Also: sin, cos, exp, log, pow, toDegrees, toRadians, ceil, floor, round, concat, format_string, lower, regexp_extract, split, trim, upper, current_timestamp, datediff, from_unixtime, …

Spark DataFrame API

Page 13: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

DEMO

13

Spark 2.0

Zeppelin notebook 0.6.1

8 GB JSON-formatted public Tweet data

Page 14: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

BEYOND DATA MUNGING

14

Machine learning

Data pipeline in production

Streaming data

Page 15: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

BEYOND DATA MUNGING

15

Machine learning => DataFrame-based ML API

Data pipeline in production => Dataset API, with type safety

Streaming data => Structured Streaming API, based on DataFrame

Spark 2.0

Page 16: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

RECAP

16

Spark DataFrames combine the “data frame” abstraction with Big Data and SQL

Spark DataFrames simplify data munging tasks (“PETS”):

Parse => structured and semi-structured formats (JSON)

Explore => DataFrame: printSchema, filter by row / column, show

Transform,

Summarize => SQL + procedural processing, math / string / date-time utility functions

All in Scala

Page 17: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

REFERENCES

17

Spark SQL and DataFrames Guide: http://spark.apache.org/docs/latest/sql-programming-guide.html

Spark DataFrame API: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

Overview of Spark DataFrames: http://xinhstechblog.blogspot.com/2016/05/overview-of-spark-dataframe-api.html

DataFrame Internals: https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf

Page 18: Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

THANK YOU!

18