Top Banner
Sol Ackerman & Franklyn D’souza Adopting Dataframes and Parquet in an Already Existing Warehouse
30

Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Jan 06, 2017

Download

Data & Analytics

Spark Summit
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Sol Ackerman & Franklyn D’souza

Adopting Dataframes and Parquet in an Already Existing Warehouse

Page 2: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Page 3: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Page 4: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

4

Page 5: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Tier 2 FrontroomWebsite

MySQL HTTP APIs

HDFS JSON

PySpark RDD

Tier 1 Frontroom

Redshift

Operational DB

Shopify User

Kafka

PySpark Streaming

Presto

Page 6: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

6

>300kSTORES

2 PBDATA

4000JOBS / DAY

1.5 TBNEW DATA / DAY

Warehouse Factoids

Page 7: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Why Dataframes + Parquet ?

Page 8: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Page 9: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

● Processing time, time to fresh data

● Development/Iteration time

● Cost overhead

Page 10: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

● Columnar format

● Python simplicity, JVM performance

● Unlocks SQL on HDFS

Page 11: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

11

• Zero down time for analysts / users

• Fully backwards compatible

• No data loss / corruption

• Incremental rollout, mitigating risk, downtime, revertability

• Incremental rol

Goals

Page 12: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

• Keep old code

• Introduce Parquet + Dataframes

• Rewrite slow jobs as Dataframe pipelines

Adoption Plan

Page 13: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

13

Data agnostic reader / writer

Backwards Compatible

OutputJSON / Parquet

InputJSON / Parquet

Page 14: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Data agnostic reader

rdd1 = read('hdfs://file1.parquet', load_type='rdd')rdd2 = read('hdfs://file2.json', load_type='rdd')df1 = read('hdfs://file1.parquet', load_type='dataframe')df2 = read('hdfs://file2.json', load_type='dataframe')

Page 15: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Configurable Output-format

sample-job: owner: [email protected] build: resource_class: large command_options: some-file-on-hdfs: hdfs://input/to/job output: hdfs://output/from/job output-format: parquet file:job.py

Page 16: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Configurable Output-format

sample-job: owner: [email protected] build: resource_class: large command_options: some-file-on-hdfs: hdfs://input/to/job output: hdfs://output/from/job output-format: parquet file:job.py

Page 17: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Configurable Output-format

def write_output(self, path, out): if self.output_format == 'json': save_as_json_file(out, path) elif self.output_format == 'parquet': save_as_parquet_file(out, path)

Page 18: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Job Types

• Full Drop : Jobs that rebuild their output from scratch every time.

Input

Output

Input Input

Page 19: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

• Incremental Drop : Jobs that build their output incrementally.

Job Types

Part 1

Part 2

Part 3

Part 1

Part 2

Part 3

Input Output

Page 20: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Forwards with Dataframes

• Is the dataset Dataframe-able?

• Checking data integrity

• Regression blocking

Page 21: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Will it Dataframe?

Python Dataframe

long any sizesql.LongType(-2^63, 2^63-1)

Decimal28 prec

28 sig digitssql.DecimalType(38, 18)

[20 digits].[18 digits]

Page 22: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Long:if (v < -9223372036854775808 or v > 9223372036854775807):

logger.warn('Field: {}, Value: {}. long will not fit inside sql.LongType'.format(k, v))

Decimal:DECIMAL_CONTEXT = Context(prec=38, Emax=19, Emin=19, traps=[Inexact])

try:

DECIMAL_CONTEXT.create_decimal(v)

except: logger.warn('Field: {}, Value: {}. Decimal will not fit inside sql.DecimalType(38,18)'.format(k, v))

Page 23: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Record based format[{'a': 1},{'a': 2, 'b': True} ...]

if 'b' in data: # DataFrame pathelse: # RDD path

Column based formata b

1 None

2 True

Will it Dataframe?

Page 24: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Optional Blocking

def test_no_new_optional_jobs(): white_list = {...} jobs_with_optionals = find_jobs_with_optionals() new_jobs_with_optionals = jobs_with_optionals - white_list assert len(new_jobs_with_optionals) == 0

Page 25: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Checking Data Integrity

Parquet

InputJSON / Parquet

JSON

== ?

Reconcile Job

Page 26: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Checking Data Integrity

Parquet

Input Parquet

Parquet

== ?

Reconcile Job

RDD

DF

Page 27: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

def test_no_new_json_jobs(): white_list = {...} jobs_outputing_json = find_json_jobs() new_json_jobs = jobs_outputing_json - white_list assert len(new_json_jobs) == 0

Regression Blocking

Page 28: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

10xFASTER

0.7xRESOURCES USED

1.3xJAVA MEM

Results

SQLDATA ACCELERATED

Page 29: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

● Dataframes pay for themselves, especially at scale

● There is no Dataframe switch

● Safety first

Lessons Learned

Page 30: Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

30

shopify.com/careers

We’re Hiring