Top Banner
Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc.
140

Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Jul 06, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Some things you learn running Apache Spark in production for three yearsWilliam Benton (@willb) Red Hat, Inc.

Page 2: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Some things you learn running Apache Spark in production for three yearsWilliam Benton (@willb) Red Hat, Inc.

Page 3: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

About meSince 2013: data engineering and data science focus

Since 2008: distributed systems focus

Ancient history: compiler/VM design, static analysis, logic programming

Page 4: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

ForecastIntroducing Apache Spark

How my team has used Spark

Lessons we’ve learned

From analytics as a workload to insightful applications

Page 5: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Meet Apache Spark

Page 6: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Parallel execution models

Page 7: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Parallel execution models

Page 8: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Parallel execution models

Page 9: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A user-friendly abstractionPartitioned collections are spread across multiple processors or machines.

Immutable collections are copied, never modified in place.

Lazy operations only execute when necessary.

Page 10: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A user-friendly abstractionPartitioned collections are spread across multiple processors or machines.

Immutable collections are copied, never modified in place.

Lazy operations only execute when necessary.

This means we’ll have failures

Page 11: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A user-friendly abstractionPartitioned collections are spread across multiple processors or machines.

Immutable collections are copied, never modified in place.

Lazy operations only execute when necessary.

This means we’ll have failures

These mean we’ll always have a recipe for how to recover

Page 12: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 13: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 14: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 15: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 16: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 17: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 18: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

file = sc.textFile("file://...")

counts = file.flatMap(lambda l: l.split(" ")) .map(lambda w: (w, 1)) .reduceByKey(lambda x, y: x + y)

# computation actually occurs here counts.saveAsTextFile("file://...")

A simple example program

Page 19: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Spark core (collections and scheduler)

Page 20: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Spark core (collections and scheduler)

Graph SQL ML Streaming

Page 21: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Spark core (collections and scheduler)

Graph SQL ML Streaming

ad hoc Mesos YARN

Page 22: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A more interesting example

Madrid

Spain

France

Page 23: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A more interesting example

Madrid

Spain

France

Page 24: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A more interesting example

Madrid

Spain

France

Page 25: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A more interesting example

Madrid

Spain

France

Page 26: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A more interesting example

Madrid

Spain

France

Page 27: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A more interesting example

v("Madrid") - v("Spain") + v("France") ≈ v("Paris")

Madrid

ParisSpain

France

Page 28: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A more interesting example# read text file as a data frame df = spark.read.text("file://...") .select(split("value", "\s+").alias("text"))

# fit and use a Word2Vec model w2v = Word2Vec(inputCol="text", outputCol="result") model = w2v(df) model.findSynonyms("data", 5).show()

Page 29: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

How we've used Spark

Page 30: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 31: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 32: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 33: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 34: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 35: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 36: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Prototyping new techniques

Middleton

Waunakee

Verona

US 12

US 12

Vero

na Ro

ad

MS

Page 37: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Machine configuration analysis

Page 38: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Machine configuration analysis

Page 39: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Machine configuration analysisnot libhugetlbfs && not fontpackages-filesystem && httpclient && transitional-eap6-jars p11-kit && not libtheora && not perl-Text-ParseWords && jboss-metadata-appclient not python-suds && rt61pci-firmware && avahi && apache-commons-io-ea not sanlock-python && shrinkwrap-parent && not cli-tools-zend-server not perl-parent && not python-stevedore && jdom && not openshift-origin-cartridge-diy && not redhat-sso-login-module-eap6 && iwl6050-firmware-41 not pytalloc && not libldb && xorg-x11-fonts-Typ && not nodejs010-gyp && hibernate4-entitymanager ...

Page 40: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Diagnosing community health

fedmsg bus

package

SCM

fedmsg-hub

ansible

MediaWiki

(via mod_php)

COPR and

secondary

arch builds

meetbot koji Apache mod_wsgi

bodhi

accounts

ticketing

package database

…many others!

IRC gateway

web gatewayCLI tools notifications

datanommer(archive)

Fedora badges

Page 41: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Diagnosing community health

fedmsg bus

package

SCM

fedmsg-hub

ansible

MediaWiki

(via mod_php)

COPR and

secondary

arch builds

meetbot koji Apache mod_wsgi

bodhi

accounts

ticketing

package database

…many others!

IRC gateway

web gatewayCLI tools notifications

datanommer(archive)

Fedora badges

Do people work on Fedora for love or money?

Page 42: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Diagnosing community health

fedmsg bus

package

SCM

fedmsg-hub

ansible

MediaWiki

(via mod_php)

COPR and

secondary

arch builds

meetbot koji Apache mod_wsgi

bodhi

accounts

ticketing

package database

…many others!

IRC gateway

web gatewayCLI tools notifications

datanommer(archive)

Fedora badges

Is there anyone the community couldn’t live without?

Do people work on Fedora for love or money?

Page 43: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Diagnosing community health

fedmsg bus

package

SCM

fedmsg-hub

ansible

MediaWiki

(via mod_php)

COPR and

secondary

arch builds

meetbot koji Apache mod_wsgi

bodhi

accounts

ticketing

package database

…many others!

IRC gateway

web gatewayCLI tools notifications

datanommer(archive)

Fedora badges

Is there anyone the community couldn’t live without?

Do people work on Fedora for love or money?

How do we characterize breadth and depth of community engagement?

Page 44: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Outlier detection in log data

DEBUG

WARN

WARN INFO INFOINFO

WARN INFOINFOINFO

WARNINFO INFO INFO

WARN

INFO

INFO

INFOhost01host02host03

Page 45: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Outlier detection in log data

DEBUG

WARN

INFO

WARN INFO INFOINFO

INFOINFOINFO

WARNINFO INFO INFO

WARN INFOINFO

INFO INFO

INFO

INFO INFO INFO

INFO INFO

WARN DEBUG

host01host02host03

Page 46: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Outlier detection in log data

DEBUGWARN

INFO

INFO

WARNINFO

WARN INFO INFOINFO

INFO INFO

INFOINFO

INFO INFO INFO

WARN

DEBUG

WARN

INFO

INFO INFO INFO

INFO INFO

WARN DEBUG

WARN

INFOhost01host02host03

Page 47: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Outlier detection in log data

DEBUGWARN INFO

INFO INFOINFO

INFO INFO

INFO

INFO INFO INFO

WARN

DEBUG

WARN

INFO

INFO

INFO

DEBUG

WARN

INFO

INFO

INFO

INFO

INFO

INFO INFO INFO

WARN

WARN

INFO

WARN

INFOhost01host02host03

Page 48: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 49: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 50: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 51: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Out of 310 million log records, we identified 0.0012% as outliers.

Page 52: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 53: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 54: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Thirty most extreme outliers10 Can not communicate with power supply 2.

9 Power supply 2 failed.

8 Power supply redundancy is lost.

1 Drive A is removed.

1 Can not communicate with power supply 1.

1 Power supply 1 failed.

Page 55: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Modeling infrastructure costs

SYSTEM METRICS

CLOUD SPENDING

Page 56: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

SELECT COUNT(value), MIN(value), MAX(value), AVG(value), items.key_, items.hostid FROM history, items WHERE history.itemid = items.itemid GROUP BY history.itemid

Page 57: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

SELECT COUNT(value), MIN(value), MAX(value), AVG(value), items.key_, items.hostid FROM history, items WHERE history.itemid = items.itemid GROUP BY history.itemid

~120gb of data on one node with 40 threads and 384gb of RAM

Page 58: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

SELECT COUNT(value), MIN(value), MAX(value), AVG(value), items.key_, items.hostid FROM history, items WHERE history.itemid = items.itemid GROUP BY history.itemid

~120gb of data on one node with 40 threads and 384gb of RAM

RDBMS: ~15 hours

Page 59: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

SELECT COUNT(value), MIN(value), MAX(value), AVG(value), items.key_, items.hostid FROM history, items WHERE history.itemid = items.itemid GROUP BY history.itemid

~120gb of data on one node with 40 threads and 384gb of RAM

RDBMS: ~15 hoursSpark: ~15 minutes

Page 60: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Lessons we learned

Page 61: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

METALESSON: HOW TO MASTER DECLARATIVE PROGRAMMING

Page 62: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Three steps to masteryUnderstand the programming model.

Understand the execution model.

Understand when to let the environment work for you.

Page 63: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Three steps to masteryUnderstand the programming model.

Understand the execution model.

Understand when to let the environment work for you.

“What does this mean?”

Page 64: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Three steps to masteryUnderstand the programming model.

Understand the execution model.

Understand when to let the environment work for you.

“What does this mean?”

“What does this do?”

Page 65: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Three steps to masteryUnderstand the programming model.

Understand the execution model.

Understand when to let the environment work for you.

“What does this mean?”

“What does this do?”

“How can I get out of its way?”

Page 66: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 67: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 68: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

LESSON: LEARN THE API; USE THE RIGHT METHODS

Page 69: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Take advantage of the model

Spark driver (application) Spark workers

Page 70: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Take advantage of the model

Spark driver (application) Spark workers

Page 71: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Take advantage of the model

Spark driver (application) Spark workers

Page 72: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Take advantage of the model

Spark driver (application) Spark workers

Page 73: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Take advantage of the model

Spark driver (application) Spark workers

Page 74: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Take advantage of the model

Spark driver (application) Spark workers

Page 75: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Take advantage of the model

Spark driver (application) Spark workers

Aggregating at the driver: increased memory pressure, decreased parallelism.

Page 76: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Distributed aggregation

Spark driver (application) Spark workers

Page 77: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Distributed aggregation

Spark driver (application) Spark workers

Page 78: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Distributed aggregation

Spark driver (application) Spark workers

Page 79: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Distributed aggregation

Spark driver (application) Spark workers

Page 80: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Distributed aggregation

Spark driver (application) Spark workers

Page 81: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Distributed aggregation

Spark driver (application) Spark workers

Page 82: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Distributed aggregation

Spark driver (application) Spark workers

Aggregate at the workers instead of in the driver!

Page 83: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

LESSON: LET SPARK WORK FOR YOU WHENEVER POSSIBLE

Page 84: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Two favorite features

Query planning makes dumb code run faster.

Typed APIs prevent really dumb code from running at all.

Page 85: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Query planning

SELECT * FROM A, B WHERE A.ID = B.ID AND uncommon(A.X) AND extremelyRare(B.Y)

Page 86: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A naïve plan

Page 87: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A naïve plan

JOIN

Page 88: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A naïve plan

JOIN

Page 89: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A naïve plan

Page 90: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A naïve plan

FILTER

Page 91: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

A naïve plan

FILTER

Page 92: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

An optimized plan

Page 93: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

An optimized plan

FILTER FILTER

Page 94: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

An optimized plan

FILTER FILTER

Page 95: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

An optimized plan

JOIN

Page 96: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

An optimized plan

Page 97: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

LESSON: STORAGE FORMATS MATTER MORE THAN LOCALITY

Page 98: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

History lesson: Hadoop (2005)

HDFS

events

HDFS HDFS HDFS HDFS

Page 99: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

History lesson: Hadoop (2005)

HDFS

compute

events

HDFS

compute

HDFS

compute compute compute

HDFS HDFS

Page 100: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

History lesson: Hadoop (2005)

Page 101: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

History lesson: Hadoop (2005)

“Disks are too slow”

“Memories are too small”

“Your network is a bottleneck”

“Locality is king”

Page 102: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

“For the workloads from Facebook and Bing, we see that 96% and 89% of the active jobs respectively can have their data entirely fit in memory, given an allowance of 32GB memory per server for caching”

—“PACMan: Coordinated Memory Caching for Parallel Jobs.” G. Ananthanarayanan et al., in Proceedings of NSDI ’12.

Page 103: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

“Recent studies have shown that reading data from local disks is only about 8% faster than reading it from remote disks over the network … [and] this 8% number is decreasing.”

—Tom Phelan, “The Elephant in the Big Data Room: DataLocality is Irrelevant for Hadoop” (goo.gl/MnCKuM)

Page 104: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

“Three out of ten hours of job runtime were spent moving files from the staging directory to the final directory in HDFS… We were essentially compressing, serializing, and replicating three copies for a single read.”

—“Apache Spark @Scale: a 60+ TB production use case” Facebook Engineering Blog Post

Page 105: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Do you need node locality?Working set sizes typically fit in cluster memory even if raw data don’t.

I/O-heavy frameworks designed for colocated compute and storage perform worse than iterative processing in memory.

Colocating compute and storage prevents independent scale-out of compute and storage.

Page 106: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

An optimization that mattersRO

W-O

RIEN

TED

FORM

AT

Page 107: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

An optimization that mattersRO

W-O

RIEN

TED

FORM

AT

Page 108: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

An optimization that mattersRO

W-O

RIEN

TED

FORM

AT

Page 109: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

An optimization that mattersRO

W-O

RIEN

TED

FORM

AT

Page 110: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

An optimization that mattersRO

W-O

RIEN

TED

FORM

AT

COLU

MNA

R FO

RMAT

Page 111: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

An optimization that mattersRO

W-O

RIEN

TED

FORM

AT

COLU

MNA

R FO

RMAT

Page 112: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

An optimization that mattersRO

W-O

RIEN

TED

FORM

AT

COLU

MNA

R FO

RMAT

Page 113: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

An optimization that mattersRO

W-O

RIEN

TED

FORM

AT

COLU

MNA

R FO

RMAT 10% of the space

1-10% of the time

Page 114: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

LESSON: THINGS TO CONSIDER WHEN PREDICTING THE FUTURE

Page 115: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Feature engineering

mountain bike 0 1 0.35 1 1 1

HANDLEBAR TYPE

DROP FLATTIRE SIZE

SUSPENSION?

TIRE KNOBS

FRONT REARLABEL

cyclocross bike 1 0 0.13 1 0 0

Page 116: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Interpretable models

Page 117: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Interpretable models

bird, animal, outdoors

Page 118: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Interpretable models

Page 119: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Interpretable models

food, macro, fruit

Page 120: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Interpretable models

Page 121: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Interpretable models

outdoors, racing, bike, wout van aert

Page 122: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Interpretable models

outdoors, racing, bike, wout van aert

Page 123: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Interpretable models

outdoors, racing, bike, wout van aert

Accurate predictions are only part of a model’s value!

Page 124: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

From analytics as a workload to insightful applications

Page 125: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 126: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Mesos

Networked POSIX FS

Spark executor

Spark executor

Spark executor

Spark executor

Spark executor

Spark executor

Page 127: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Mesos

Networked POSIX FS

Spark executor

Spark executor

Spark executor

Spark executor

Spark executor

Spark executor

1

2

3

4

Page 128: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Mesos

Networked POSIX FS

Spark executor

Spark executor

Spark executor

Spark executor

Spark executor

Spark executor

1

2

3

4

1

1

2

3

3

4

Page 129: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Mesos

Networked POSIX FS

Spark executor

Spark executor

Spark executor

Spark executor

Spark executor

Spark executor

1

2

3

4

1

1

2

3

3

4

Page 130: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 131: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 132: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 133: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 134: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 135: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About
Page 136: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Kubernetes

Object storesapp 1 app 2

app 5app 4

app 3

app 6

Databases

Page 137: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Kubernetes

Object storesapp 1 app 2

app 5app 4

app 3

app 6

app 2

app 4

Databases

Page 138: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

Kubernetes

Object storesapp 1 app 2

app 5app 4

app 3

app 6

app 2

app 4

Databases

http://radanalytics.io

Page 139: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

TakeawaysLet Spark work for you.

Use efficient storage formats but don’t worry about data locality (yet).

ETL input data to Parquet early in your process.

Feature engineering effort often trumps fancy models.

Prefer interpretable models and easy-to-implement algorithms.

Page 140: Some things you learn running Apache Spark in production ... · Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. About

[email protected] • @willb https://chapeau.freevariable.com http://radanalytics.io