Top Banner
Spark Slides from Matei Zaharia and Databricks CS 5450
32

spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Aug 25, 2018

Download

Documents

duongdieu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Spark

Slides from Matei Zaharia and Databricks

CS 5450

Page 2: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Goals

uExtend the MapReduce model to better support two common classes of analytics apps• Iterative algorithms (machine learning, graphs)• Interactive data mining

uEnhance programmability• Integrate into Scala programming language• Allow interactive use from Scala interpreter• Also support for Java, Python…

slide 2

Page 3: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Cluster Programming Models

Most current cluster programming models are based on acyclic data flow from stable

storage to stable storage

slide 3

Map

Map

Map

Reduce

Reduce

Input OutputBenefits of data flow: runtime can decide where to run tasks and can automatically recover from failures

Page 4: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Acyclic Data Flow Inefficient ...

slide 4

... for applications that repeatedly reuse a working set of data

• Iterative algorithms (machine learning, graphs)• Interactive data mining (R, Excel, Python)

because apps have to reload data from stable storage on every query

Page 5: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Resilient Distributed Datasets

uResilient distributed datasets (RDDs)• Immutable, partitioned collections of objects spread

across a cluster, stored in RAM or on disk• Created through parallel transformations (map, filter,

groupBy, join, …) on data in stable storageuAllow apps to cache working sets in memory for

efficient reuseuRetain the attractive properties of MapReduce

• Fault tolerance, data locality, scalabilityuActions on RDDs support many applications

• Count, reduce, collect, save…slide 5

Page 6: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Example: Log Mining

slide 6

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block 1

Block 2

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

tasks

results

Cache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Action

Block 3

Full-text search of Wikipedia60GB on 20 EC2 machine0.5 sec vs. 20s for on-disk

Page 7: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Spark Operations

slide 7

Transformations

define a new RDD

mapfilter

samplegroupByKeyreduceByKey

sortByKey

flatMapunionjoin

cogroupcross

mapValues

Actions

return a result to driver program

collectreducecountsave

lookupKey

Page 8: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Creating RDDs

slide 8

# Turn a Python collection into an RDD>sc.parallelize([1, 2, 3])

# Load text file from local FS, HDFS, or S3>sc.textFile(“file.txt”)>sc.textFile(“directory/*.txt”)>sc.textFile(“hdfs://namenode:9000/path/file”)

# Use existing Hadoop InputFormat (Java/Scala only)>sc.hadoopFile(keyClass, valClass, inputFmt, conf)

Page 9: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Basic Transformations

slide 9

> nums = sc.parallelize([1, 2, 3])

# Pass each element through a function> squares = nums.map(lambda x: x*x) // {1, 4, 9}

# Keep elements passing a predicate> even = squares.filter(lambda x: x % 2 == 0) // {4}

# Map each element to zero or more others> nums.flatMap(lambda x: => range(x))

> # => {0, 0, 1, 0, 1, 2}

Range object (sequence of numbers 0, 1, …, x-1)

Page 10: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Basic Actions

slide 10

> nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection> nums.collect() # => [1, 2, 3]

# Return first K elements> nums.take(2) # => [1, 2]

# Count number of elements> nums.count() # => 3

# Merge elements with an associative function> nums.reduce(lambda x, y: x + y) # => 6

# Write elements to a text file> nums.saveAsTextFile(“hdfs://file.txt”)

Page 11: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Working with Key-Value Pairs

slide 11

Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs

Python: pair = (a, b)pair[0] # => a pair[1] # => b

Scala: val pair = (a, b)pair._1 // => apair._2 // => b

Java: Tuple2 pair = new Tuple2(a, b); pair._1 // => apair._2 // => b

Page 12: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Some Key-Value Operations

slide 12

> pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])

> pets.reduceByKey(lambda x, y: x + y)# => {(cat, 3), (dog, 1)}

> pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}

> pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}

reduceByKey also automatically implements combiners on the map side

Page 13: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Example: Word Count

slide 13

> lines = sc.textFile(“hamlet.txt”)

> counts = lines.flatMap(lambda line: line.split(“ ”)).map(lambda word => (word, 1)).reduceByKey(lambda x, y: x + y)

“to be or”

“not to be”

“to”“be”“or”

“not”“to”“be”

(to, 1)(be, 1)(or, 1)

(not, 1)(to, 1)(be, 1)

(be, 2)(not, 1)

(or, 1)(to, 2)

Page 14: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Other Key-Value Operations

slide 14

> visits = sc.parallelize([ (“index.html”, “1.2.3.4”),(“about.html”, “3.4.5.6”),(“index.html”, “1.3.3.1”) ])

> pageNames = sc.parallelize([ (“index.html”, “Home”),(“about.html”, “About”) ])

> visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”))# (“index.html”, (“1.3.3.1”, “Home”))# (“about.html”, (“3.4.5.6”, “About”))

> visits.cogroup(pageNames) # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”]))# (“about.html”, ([“3.4.5.6”], [“About”]))

Page 15: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Example: Logistic Regression

slide 15

Goal: find best line separating two sets of points

target

random initial line

Page 16: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Example: Logistic Regression

slide 16

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {val gradient = data.map(p =>

(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x).reduce(_ + _)w -= gradient

}

println("Final w: " + w)

Page 17: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Logistic Regression Performance

slide 17

127 s / iteration

first iteration 174 sfurther iterations 6 s

Page 18: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Setting the Level of Parallelism

slide 18

All pair RDD operations take an optional second parameter for the number of tasks

> words.reduceByKey(lambda x, y: x + y, 5)

> words.groupByKey(5)

> visits.join(pageViews, 5)

Page 19: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Using Local Variables

slide 19

Any external variables used in a closure are automatically be shipped to the cluster

> query = sys.stdin.readline()> pages.filter(lambda x: query in x).count()

Some caveats:• Each task gets a new copy (updates aren’t sent back)• Variable must be serializable / pickle-able• Don’t use fields of an outer object (ships all of it!)

Page 20: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

RDD Fault Tolerance

slide 20

RDDs maintain lineage information that can be used to reconstruct lost partitions

messages = textFile(...).filter(_.startsWith(“ERROR”))

.map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDD

filter(func = _.contains(...))

map(func = _.split(...))

Page 21: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Spark Applications

uIn-memory data mining on Hive data (Conviva)uPredictive analytics (Quantifind)uCity traffic prediction (Mobile Millennium)uTwitter spam classification (Monarch)

... many others

slide 21

Page 22: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Conviva GeoReport

slide 22

uAggregations on many keys w/ same WHERE clauseu40× gain comes from:

• Not re-reading unused columns or filtered records• Avoiding repeated decompression• In-memory storage of de-serialized objects

Time (hours)

Page 23: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Frameworks Built on Spark

slide 23

uPregel on Spark (Bagel)• Google message passing

model for graph computation• 200 lines of code

uHive on Spark (Shark)• 3000 lines of code• Compatible with Apache Hive• ML operators in Scala

Page 24: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Implementation

slide 24

Runs on Apache Mesos to share resources with Hadoop & other apps

Can read from any Hadoop input source (e.g. HDFS)

Spark Hadoop MPI

Mesos

Node Node Node Node

Page 25: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Spark Scheduler

slide 25

Dryad-like DAGsPipelines functionswithin a stageCache-aware workreuse & localityPartitioning-awareto avoid shuffles

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= cached data partition

Page 26: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Example: PageRank

slide 26

uBasic idea: gives pages ranks (scores)based on links to them• Links from many pages è high rank• Link from a high-rank page è high rank

uGood example of a more complex algorithm• Multiple stages of map & reduce

uBenefits from Spark’s in-memory caching• Multiple iterations over the same data

Page 27: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Algorithm

slide 27

1.0 1.0

1.0

1.0

1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 × contribs

1

0.5

0.5

0.5

1

0.5

Page 28: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Algorithm

slide 28

0.58 1.0

1.85

0.58

1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 × contribs

0.58

0.29

0.29

0.5

1.85

0.5

Page 29: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Algorithm

slide 29

0.39 1.72

1.31

0.58

1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 × contribs

Page 30: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Algorithm

slide 30

0.46 1.37

1.44

0.73

1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 × contribs

Final state

Page 31: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

Spark Implementation (in Scala)

slide 31

val links = // load RDD of (url, neighbors) pairsvar ranks = // load RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) {val contribs = links.join(ranks).flatMap {

case (url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}ranks = contribs.reduceByKey(_ + _)

.mapValues(0.15 + 0.85 * _)}ranks.saveAsTextFile(...)

Page 32: spark - Cornell University · Spark Operations slide 7 Transformations define a new RDD map filter sample groupByKey reduceByKey sortByKey ... [1, 2, 3]) # …

PageRank Performance

slide 32

171

80

23

140

50

100

150

200

30 60

Iter

atio

n tim

e (s

)

Number of machines

HadoopSpark