Top Banner
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive, Language- Integrated Cluster Computing UC BERKELEY www.spark-project.org
17

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Dec 17, 2015

Download

Documents

Ruby Wilkinson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,Scott Shenker, Ion Stoica

SparkFast, Interactive, Language-Integrated Cluster Computing

UC BERKELEYwww.spark-project.org

Page 2: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Project GoalsExtend the MapReduce model to better support two common classes of analytics apps:»Iterative algorithms (machine learning, graphs)

»Interactive data mining

Enhance programmability:»Integrate into Scala programming language

»Allow interactive use from Scala interpreter

Page 3: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

MotivationMost current cluster programming models are based on acyclic data flow from stable storage to stable storage

Map

Map

Map

Reduce

Reduce

Input Output

Page 4: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Motivation

Map

Map

Map

Reduce

Reduce

Input Output

Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures

Most current cluster programming models are based on acyclic data flow from stable storage to stable storage

Page 5: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

MotivationAcyclic data flow is inefficient for applications that repeatedly reuse a working set of data:»Iterative algorithms (machine learning, graphs)

»Interactive data mining tools (R, Excel, Python)

With current frameworks, apps reload data from stable storage on each query

Page 6: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Solution: ResilientDistributed Datasets (RDDs)Allow apps to keep working sets in memory for efficient reuse

Retain the attractive properties of MapReduce

»Fault tolerance, data locality, scalability

Support a wide range of applications

Page 7: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

RDD Fault ToleranceRDDs maintain lineage information that can be used to reconstruct lost partitions

Ex:messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))

HDFS FileFiltered

RDDMapped

RDDfilter

(func = _.contains(...))map

(func = _.split(...))

Page 8: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Example: Logistic RegressionGoal: find best line separating two sets of points

+

++

+

+

+

++ +

– ––

––

+

target

random initial line

Page 9: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

println("Final w: " + w)

Page 10: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Logistic Regression Performance

1 5 10 20 300

50010001500200025003000350040004500

Hadoop

Number of Iterations

Ru

nn

ing

Tim

e (

s) 127 s / iteration

first iteration 174 s

further iterations 6 s

Page 11: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Spark ApplicationsIn-memory data mining on Hive data (Conviva)

Predictive analytics (Quantifind)

City traffic prediction (Mobile Millennium)

Twitter spam classification (Monarch)

Collaborative filtering via matrix factorization

Page 12: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Conviva GeoReport

Aggregations on many keys w/ same WHERE clause

40× gain comes from:»Not re-reading unused columns or filtered records»Avoiding repeated decompression» In-memory storage of deserialized objects

Spark

Hive

0 2 4 6 8 10 12 14 16 18 20

0.5

20

Time (hours)

Page 13: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Frameworks Built on SparkPregel on Spark (Bagel)

»Google message passingmodel for graph computation

»200 lines of code

Hive on Spark (Shark)»3000 lines of code»Compatible with Apache Hive»ML operators in Scala

Page 14: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

ImplementationRuns on Apache Mesos to share resources with Hadoop & other apps

Can read from any Hadoop input source (e.g. HDFS)

SparkHadoo

pMPI

Mesos

Node Node Node Node

No changes to Scala compiler

Page 15: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Spark SchedulerDryad-like DAGs

Pipelines functionswithin a stage

Cache-aware workreuse & locality

Partitioning-awareto avoid shuffles

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= cached data partition

Page 16: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Interactive SparkModified Scala interpreter to allow Spark to be used interactively from the command line

Required two changes:»Modified wrapper code generation so that

each line typed has references to objects for its dependencies

»Distribute generated classes over the network

Page 17: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

ConclusionSpark provides a simple, efficient, and powerful programming model for a wide range of apps

Download our open source release:

www.spark-project.org

[email protected]