Lecture 06 - CS-5040 - modern database systems

Modern Database Systems Lecture 6

Aris6des Gionis Michael Mathioudakis

Spring 2016

logis6cs

•  tutorial on monday, TU6@2:15pm •  assignment 2 is out -‐ due by march 14th •  for programming part, check updated tutorial •  total of 5 late days are allowed

michael mathioudakis 2

today

mapreduce &

spark

as they were introduced emphasis on high level concepts


introduc6on


intro recap

structured data, semi-‐structured data, text query op6miza6on vs flexibility of data model

disk access a central issue indexing

now: big data scale so big, that new issues take front seat:

distributed, parallel computa6on fault tolerance

how to accommodate those within a simple computa6onal model?


remember this task from lecture 0... data records that contain information about products viewed or purchased from an online store task for each pair of Games products, count the number of customers that have purchased both

6

Product Category Customer Date Price Ac8on other...

Portal 2 Games Michael M. 12/01/2015 10€ Purchase

...

FLWR Plant Food Garden Aris G. 19/02/2015 32€ View

Chase the Rabbit Games Michael M. 23/04/2015 1€ View

Portal 2 Games Ores6s K. 13/05/2015 10€ Purchase

...

> what challenges does case B pose compared to case A? hint limited main memory, disk access, distributed setting

case A 10,000 records (0.5MB per record, 5GB total disk space) 10GB of main memory

case B 10,000,000 records (~5TB total disk space) stored across 100 nodes (50GB per node), 10GB of main memory per node

mapreduce



MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

[email protected], [email protected]

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis

To appear in OSDI 2004 1

appeared at the Symposium on Opera6ng Systems Design & Implementa6on, 2004

some context

in early 2000s, google was developing systems to

accommodate storage and processing of big data volumes


google file system (gfs) “a scalable distributed file system for large distributed data-‐intensive applica6ons”

“provides fault tolerance while running on inexpensive commodity hardware”

bigtable “distributed storage system for managing

structured data that is designed to scale to a very large size: petabytes of data across thousands of

commodity servers”

mapreduce “programming model and implementa6on for processing and genera6ng large data sets”

mo6va6on

hundreds of special-‐purpose computa6ons over raw data crawled webpages & documents, search & web request logs

inverted indexes, web graphs, document summaries, frequent queries

conceptually straighforward computa6on however...

a lot of data, distributed over many machines hundreds or thousands of machines...

a lot of prac6cal issues arise, that obscure the simplicity of computa6on


at google in early 2000s...

developed solu6on

programming model simple

based on the map and reduce primi6ves found in func6onal languages (e.g., Lisp)

system

hides the messy details in a library paralleliza6on, fault-‐tolerance, data distribu6on, load balancing


mapreduce

programming model system

programming model

input a set of (key,value) pairs

computa8on

two func6ons: map and reduce wrigen by the user

output

a set of (key,value) pairs


map func6on

input one (key,value) pair

output

set of intermediate (key,value) pairs

mapreduce groups together pairs with same key and passes them to reduce func6on



map func6on

key, value

key, value

key, value

... key, value

key, value

map

key, value key, value key, value key, value

key, value key, value

key, value key, value key, value

key, value key, value key, value key, value

key, value

typeof(key/value) generally ≠

typeof(key/value)

key, value key, value key, value key, value le

gend

different key value

reduce func6on

input (key, list(values))

intermediate key and set of values for that key list(values) supplied as iterator,

convenient when not enough memory

output list(values)

typically only 0 or 1 values are output per invoca6on


reduce func6on


key, value

key, value

key, value

key, value key, value

key, value

key, value

reduce key, [value1, value2, ...]

reduce key, [value1, value2, ...]

same key

same key

programming model input

a set of (key,value) pairs

map (key,value) è list( (key,value) )

reduce

(key, list(values)) è (key, list(values))

output list( (key, list(values)) )


example task

count the number of occurrences of each word in a collec6on of documents

input

a set of (key,value) pairs key: document file loca6on (id)

value: document contents (list of words)

how would you approach this?



reduce (key, list(values)) è (key, list(values))

example -‐ solu6on


doc1, value

doc2, value

doc3, value

map word1, 1 word2, 1 word3, 1 word4, 1

word4, 1 word2, 1

word2, 1 word1, 1 word4, 1

word1, 1 word1, 1

word1, 1

word1, 1

word1, [4]

redu

ce

word2, 1

word2, 1

word2, 1

word2, [3]

example -‐ solu6on


for a rewrite of our production indexing system. Sec-tion 7 discusses related and future work.

2 Programming Model

The computation takes a set of input key/value pairs, andproduces a set of output key/value pairs. The user ofthe MapReduce library expresses the computation as twofunctions: Map and Reduce.Map, written by the user, takes an input pair and pro-duces a set of intermediate key/value pairs. The MapRe-duce library groups together all intermediate values asso-ciated with the same intermediate key I and passes themto the Reduce function.The Reduce function, also written by the user, acceptsan intermediate key I and a set of values for that key. Itmerges together these values to form a possibly smallerset of values. Typically just zero or one output value isproduced per Reduce invocation. The intermediate val-ues are supplied to the user’s reduce function via an iter-ator. This allows us to handle lists of values that are toolarge to fit in memory.

2.1 ExampleConsider the problem of counting the number of oc-currences of each word in a large collection of docu-ments. The user would write code similar to the follow-ing pseudo-code:

map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");

reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);

Emit(AsString(result));

The map function emits each word plus an associatedcount of occurrences (just ‘1’ in this simple example).The reduce function sums together all counts emittedfor a particular word.In addition, the user writes code to fill in a mapreducespecification object with the names of the input and out-put files, and optional tuning parameters. The user theninvokes the MapReduce function, passing it the specifi-cation object. The user’s code is linked together with theMapReduce library (implemented in C++). Appendix Acontains the full program text for this example.

2.2 Types

Even though the previous pseudo-code is written in termsof string inputs and outputs, conceptually the map andreduce functions supplied by the user have associatedtypes:map (k1,v1) → list(k2,v2)reduce (k2,list(v2)) → list(v2)

I.e., the input keys and values are drawn from a differentdomain than the output keys and values. Furthermore,the intermediate keys and values are from the same do-main as the output keys and values.Our C++ implementation passes strings to and fromthe user-defined functions and leaves it to the user codeto convert between strings and appropriate types.

2.3 More Examples

Here are a few simple examples of interesting programsthat can be easily expressed as MapReduce computa-tions.

Distributed Grep: The map function emits a line if itmatches a supplied pattern. The reduce function is anidentity function that just copies the supplied intermedi-ate data to the output.

Count of URL Access Frequency: The map func-tion processes logs of web page requests and outputs⟨URL,1⟩. The reduce function adds together all valuesfor the same URL and emits a ⟨URL,total count⟩pair.

Reverse Web-Link Graph: The map function outputs⟨target,source⟩ pairs for each link to a targetURL found in a page named source. The reducefunction concatenates the list of all source URLs as-sociated with a given target URL and emits the pair:⟨target, list(source)⟩

Term-Vector per Host: A term vector summarizes themost important words that occur in a document or a setof documents as a list of ⟨word, frequency⟩ pairs. Themap function emits a ⟨hostname,term vector⟩pair for each input document (where the hostname isextracted from the URL of the document). The re-duce function is passed all per-document term vectorsfor a given host. It adds these term vectors together,throwing away infrequent terms, and then emits a final⟨hostname,term vector⟩ pair.


programming model -‐ types



reduce (key, list(values)) è (key, list(values))

intermediate (key, value) pairs

input (key, value) pairs output (key, value) pairs

type of type of ≠

more examples


grep search a set of documents for a string pagern in a line

input

a set of (key,value) pairs key: document file loca6on (id)

value: document contents (lines of characters)

more examples


map emits a line if it matches the pagern

(document file loca6on, line)

reduce iden6ty func6on

more examples

count of URL access frequency

process logs of web page requests logs are stored in documents, one line per request,

each line contains URL of requested page

input a set of (key,value) pairs key: log file loca6on

value: log contents (lines of requests)


more examples

map process logs of web page requests

output (URL, 1) pairs

reduce add together counts for same URL


more examples

reverse web-‐link graph process a set of webpages

for each link to target webpage, find a list [source] of all webpages source that contain a link to target

input

a set of (key,value) pairs key: webpage URL

value: webpage contents (html)


more examples

map output (target, source) pairs for each link to a target URL found

in a page named source

reduce concatenate list of sources per target output (target, list(source)) pairs


more examples

term vector per host process logs of webpages

each webpage has a URL of the form [host]/[page address] hgp://www.aalto.fi/en/current/news/2016-‐03-‐02/

find a term vector per host


key: webpage URL value: webpage contents (html-‐stripped text)


more examples

map emit a (hostname, term vector) pair for each webpage,

hostname is extracted from document URL

reduce adds (hostname, frequency vector) pair per hostname


more examples

simple inverted index (no counts) process a collec6on of documents to construct an inverted index

for each word, have a list of documents in which it occurs


key: document file loca6on (id) value: document contents (list of words)


more examples

map parse each document, emit a sequence (word, document ID)

reduce

output (word, list(document ID)) pair for each word


system

at google (back in 2004) large clusters of commodity PCs, connected with ethernet

dual-‐processor x86, linux, 2-‐4gb of memory per machine

100 Mbit/s or 1Gbit/s network 100’s or 1000’s pf machines per cluster

storage inexpensive IDE disks agached to the machines google file system (GFS) -‐ uses replica6on users submit jobs to scheduling system


execu6on

a job is submiged, then what? map and reduce invoca6ons are distributed over machines

input data is automa6cally par66oned into a set of M splits

the M splits are fed each into a map instance

intermediate results are par66oned into R par66ons according to hash func6on -‐-‐ provided by user


execu6on


UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase

Intermediate files(on local disks)

worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles

Figure 1: Execution overview

Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.

Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:

large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:

(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.

(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiplemachines by automatically partitioning the input data


execu6on


UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase


worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles




3 Implementation











(1) split input files into M pieces (16-‐64MB each) and fork many copies of the user program

execu6on


UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase


worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles




3 Implementation











(1) split input files into M pieces (16-‐64MB each) and fork many copies of the user program

(2) master assigns M + R tasks to idle workers

execu6on


UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase


worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles




3 Implementation











(3) worker assigned to map task reads corresponding split, passes input data to map func6on, stores intermediate results in memory


execu6on


UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase


worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles




3 Implementation











(4) periodically, buffered intermediate results are wrigen to local disk, into R par66ons, according to hash func6on; their loca6ons are passed to master


execu6on


UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase


worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles




3 Implementation











(4) periodically, buffered intermediate results are wrigen to local disk, into R par66ons, according to hash func6on; their loca6ons are passed to master

(5) master no6fies reduce workers; reduce worker collects intermediate data for one par66on from local disks of map workers; sorts by intermediate key;

execu6on


UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase


worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles




3 Implementation











(6) reduce worker passes each intermediate key and corresponding values to reduce func6on; output appended to file for this reduce par66on

(5) master no6fies reduce workers; reduce worker collects intermediate data for one par66on from local disks of map workers; sorts by intermediate key;

execu6on


UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase


worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles




3 Implementation











(6) reduce worker passes each intermediate key and corresponding values to reduce func6on; output appended to file for this reduce par66on

(7) arer all tasks are completed, the master wakes up the user program

final output: R files

master data structures

state for each map & reduce task idle, in-‐progress, completed

+ iden6ty of assigned worker

for each completed map task loca6on and sizes of R intermediate file regions

received as map tasks are completed pushed incrementally to reduce workers with in-‐progress tasks


fault tolerance

worker failure master pings worker periodically

if no response, then worker has failed completed map tasks reset to idle (why?)

in-‐progress tasks set to idle idle tasks: up for grabs by other workers


fault tolerance

master failure

master writes periodic checkpoints with master data structures (state)

new master re-‐starts from last check-‐point


“stragglers”

tasks that take too long to complete

solu6on when a mapreduce opera6on is close to comple6on, schedule backup tasks for

remaining tasks


fault tolerance

locality

master tries to assign tasks to nodes that contain a replica of the input data


task granularity

M map tasks and R reduce tasks ideally, M and R should be

much larger than number of workers

why?

load-‐balancing & speedy recovery


ordering guarantees

intermediate key/value pairs are processed in increasing key order

makes it easy to generate a sorted

output file per par66on (why?)


combiner func6ons op6onal user-‐defined func6on

executed on machines that perform map tasks “combines” results before passed to the reducer

what would the combiner be for the

word-‐count example?

typically the combiner is the same as the reducer only difference: output

reducer writes to final output combiner writes to intermediate output


counters

objects updated within map and reduce func6ons periodically propagated to master

useful for debugging


counters -‐ example

Counter* uppercase;uppercase = GetCounter("uppercase");

map(String name, String contents):for each word w in contents:if (IsCapitalized(w)):uppercase->Increment();

EmitIntermediate(w, "1");

The counter values from individual worker machinesare periodically propagated to the master (piggybackedon the ping response). The master aggregates the countervalues from successful map and reduce tasks and returnsthem to the user code when the MapReduce operationis completed. The current counter values are also dis-played on the master status page so that a human canwatch the progress of the live computation. When aggre-gating counter values, the master eliminates the effects ofduplicate executions of the same map or reduce task toavoid double counting. (Duplicate executions can arisefrom our use of backup tasks and from re-execution oftasks due to failures.)Some counter values are automatically maintainedby the MapReduce library, such as the number of in-put key/value pairs processed and the number of outputkey/value pairs produced.Users have found the counter facility useful for san-ity checking the behavior of MapReduce operations. Forexample, in some MapReduce operations, the user codemay want to ensure that the number of output pairsproduced exactly equals the number of input pairs pro-cessed, or that the fraction of German documents pro-cessed is within some tolerable fraction of the total num-ber of documents processed.

5 Performance

In this section we measure the performance of MapRe-duce on two computations running on a large cluster ofmachines. One computation searches through approxi-mately one terabyte of data looking for a particular pat-tern. The other computation sorts approximately one ter-abyte of data.These two programs are representative of a large sub-set of the real programswritten by users of MapReduce –one class of programs shuffles data from one representa-tion to another, and another class extracts a small amountof interesting data from a large data set.

5.1 Cluster ConfigurationAll of the programs were executed on a cluster thatconsisted of approximately 1800 machines. Each ma-chine had two 2GHz Intel Xeon processors with Hyper-Threading enabled, 4GB of memory, two 160GB IDE

20 40 60 80 100Seconds

0

10000

20000

30000

Inpu

t (M

B/s)

Figure 2: Data transfer rate over time

disks, and a gigabit Ethernet link. The machines werearranged in a two-level tree-shaped switched networkwith approximately 100-200 Gbps of aggregate band-width available at the root. All of the machines werein the same hosting facility and therefore the round-triptime between any pair of machines was less than a mil-lisecond.Out of the 4GB of memory, approximately 1-1.5GBwas reserved by other tasks running on the cluster. Theprograms were executed on a weekend afternoon, whenthe CPUs, disks, and network were mostly idle.

5.2 Grep

The grep program scans through 1010 100-byte records,searching for a relatively rare three-character pattern (thepattern occurs in 92,337 records). The input is split intoapproximately 64MB pieces (M = 15000), and the en-tire output is placed in one file (R = 1).Figure 2 shows the progress of the computation overtime. The Y-axis shows the rate at which the input data isscanned. The rate gradually picks up as more machinesare assigned to this MapReduce computation, and peaksat over 30 GB/s when 1764 workers have been assigned.As the map tasks finish, the rate starts dropping and hitszero about 80 seconds into the computation. The entirecomputation takes approximately 150 seconds from startto finish. This includes about a minute of startup over-head. The overhead is due to the propagation of the pro-gram to all worker machines, and delays interacting withGFS to open the set of 1000 input files and to get theinformation needed for the locality optimization.

5.3 Sort

The sort program sorts 1010 100-byte records (approxi-mately 1 terabyte of data). This program is modeled afterthe TeraSort benchmark [10].The sorting program consists of less than 50 lines ofuser code. A three-line Map function extracts a 10-bytesorting key from a text line and emits the key and the



performance

1800 machines each machine had two 2GHz Xeon processors

4GB of memory (2.5-‐3GB available) two 160GB disks gigabit Ethernet


performance grep

1010 100-‐byte records search for a pagern found in <105 records

M = 15000, R = 1

150 seconds from start to finish

exercise: today, how big a file would you grep on

one machine in 150 seconds?


performance sort

1010 100-‐byte records extract 10 byte sor6ng-‐key from each record (line)

M = 15000, R = 4000

850 seconds from start to finish

exercise: how would you implement sort?


summary

original mapreduce paper

simple programming model based on func6onal language primi6ves

system takes care of

scheduling and fault-‐tolerance

great impact for cluster compu6ng


hadoop


map reduce and hadoop


mapreduce implemented into apache hadoop

sorware ecosystem for distributed data storage and processing

open source

hadoop


common

hdfs

mapreduce

yarn

scheduling & resource

management

hadoop distributed filesystem

hadoop


common

hdfs

mapreduce

yarn

scheduling & resource

management

hadoop distributed filesystem

mahout

machine learning library

hive

data warehouse, sql-‐like querying

pig data-‐flow language and system for

parallel computa6on

spark and a lot of other

projects!!

cluster-‐compu6ng engine

spark



Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion StoicaUniversity of California, Berkeley

AbstractMapReduce and its variants have been highly successfulin implementing large-scale data-intensive applicationson commodity clusters. However, most of these systemsare built around an acyclic data flow model that is notsuitable for other popular applications. This paper fo-cuses on one such class of applications: those that reusea working set of data across multiple parallel operations.This includes many iterative machine learning algorithms,as well as interactive data analysis tools. We propose anew framework called Spark that supports these applica-tions while retaining the scalability and fault tolerance ofMapReduce. To achieve these goals, Spark introduces anabstraction called resilient distributed datasets (RDDs).An RDD is a read-only collection of objects partitionedacross a set of machines that can be rebuilt if a partitionis lost. Spark can outperform Hadoop by 10x in iterativemachine learning jobs, and can be used to interactivelyquery a 39 GB dataset with sub-second response time.

1 IntroductionA new model of cluster computing has become widelypopular, in which data-parallel computations are executedon clusters of unreliable machines by systems that auto-matically provide locality-aware scheduling, fault toler-ance, and load balancing. MapReduce [11] pioneered thismodel, while systems like Dryad [17] and Map-Reduce-Merge [24] generalized the types of data flows supported.These systems achieve their scalability and fault toleranceby providing a programming model where the user createsacyclic data flow graphs to pass input data through a set ofoperators. This allows the underlying system to managescheduling and to react to faults without user intervention.

While this data flow programming model is useful for alarge class of applications, there are applications that can-not be expressed efficiently as acyclic data flows. In thispaper, we focus on one such class of applications: thosethat reuse a working set of data across multiple paralleloperations. This includes two use cases where we haveseen Hadoop users report that MapReduce is deficient:• Iterative jobs: Many common machine learning algo-

rithms apply a function repeatedly to the same datasetto optimize a parameter (e.g., through gradient de-scent). While each iteration can be expressed as a

MapReduce/Dryad job, each job must reload the datafrom disk, incurring a significant performance penalty.

• Interactive analytics: Hadoop is often used to runad-hoc exploratory queries on large datasets, throughSQL interfaces such as Pig [21] and Hive [1]. Ideally,a user would be able to load a dataset of interest intomemory across a number of machines and query it re-peatedly. However, with Hadoop, each query incurssignificant latency (tens of seconds) because it runs asa separate MapReduce job and reads data from disk.

This paper presents a new cluster computing frame-work called Spark, which supports applications withworking sets while providing similar scalability and faulttolerance properties to MapReduce.

The main abstraction in Spark is that of a resilient dis-tributed dataset (RDD), which represents a read-only col-lection of objects partitioned across a set of machines thatcan be rebuilt if a partition is lost. Users can explicitlycache an RDD in memory across machines and reuse itin multiple MapReduce-like parallel operations. RDDsachieve fault tolerance through a notion of lineage: if apartition of an RDD is lost, the RDD has enough infor-mation about how it was derived from other RDDs to beable to rebuild just that partition. Although RDDs arenot a general shared memory abstraction, they representa sweet-spot between expressivity on the one hand andscalability and reliability on the other hand, and we havefound them well-suited for a variety of applications.

Spark is implemented in Scala [5], a statically typedhigh-level programming language for the Java VM, andexposes a functional programming interface similar toDryadLINQ [25]. In addition, Spark can be used inter-actively from a modified version of the Scala interpreter,which allows the user to define RDDs, functions, vari-ables and classes and use them in parallel operations on acluster. We believe that Spark is the first system to allowan efficient, general-purpose programming language to beused interactively to process large datasets on a cluster.

Although our implementation of Spark is still a proto-type, early experience with the system is encouraging. Weshow that Spark can outperform Hadoop by 10x in itera-tive machine learning workloads and can be used interac-tively to scan a 39 GB dataset with sub-second latency.

This paper is organized as follows. Section 2 describes

1

appeared at HotCloud, 2010


appeared at the USENIX conference on networked systems design and implementa6on, 2010

Resilient Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica

University of California, Berkeley

AbstractWe present Resilient Distributed Datasets (RDDs), a dis-tributed memory abstraction that lets programmers per-form in-memory computations on large clusters in afault-tolerant manner. RDDs are motivated by two typesof applications that current computing frameworks han-dle inefficiently: iterative algorithms and interactive datamining tools. In both cases, keeping data in memorycan improve performance by an order of magnitude.To achieve fault tolerance efficiently, RDDs provide arestricted form of shared memory, based on coarse-grained transformations rather than fine-grained updatesto shared state. However, we show that RDDs are expres-sive enough to capture a wide class of computations, in-cluding recent specialized programming models for iter-ative jobs, such as Pregel, and new applications that thesemodels do not capture. We have implemented RDDs in asystem called Spark, which we evaluate through a varietyof user applications and benchmarks.

1 IntroductionCluster computing frameworks like MapReduce [10] andDryad [19] have been widely adopted for large-scale dataanalytics. These systems let users write parallel compu-tations using a set of high-level operators, without havingto worry about work distribution and fault tolerance.

Although current frameworks provide numerous ab-stractions for accessing a cluster’s computational re-sources, they lack abstractions for leveraging distributedmemory. This makes them inefficient for an importantclass of emerging applications: those that reuse interme-diate results across multiple computations. Data reuse iscommon in many iterative machine learning and graphalgorithms, including PageRank, K-means clustering,and logistic regression. Another compelling use case isinteractive data mining, where a user runs multiple ad-hoc queries on the same subset of the data. Unfortu-nately, in most current frameworks, the only way to reusedata between computations (e.g., between two MapRe-duce jobs) is to write it to an external stable storage sys-tem, e.g., a distributed file system. This incurs substantialoverheads due to data replication, disk I/O, and serializa-

tion, which can dominate application execution times.Recognizing this problem, researchers have developed

specialized frameworks for some applications that re-quire data reuse. For example, Pregel [22] is a system foriterative graph computations that keeps intermediate datain memory, while HaLoop [7] offers an iterative MapRe-duce interface. However, these frameworks only supportspecific computation patterns (e.g., looping a series ofMapReduce steps), and perform data sharing implicitlyfor these patterns. They do not provide abstractions formore general reuse, e.g., to let a user load several datasetsinto memory and run ad-hoc queries across them.

In this paper, we propose a new abstraction called re-silient distributed datasets (RDDs) that enables efficientdata reuse in a broad range of applications. RDDs arefault-tolerant, parallel data structures that let users ex-plicitly persist intermediate results in memory, controltheir partitioning to optimize data placement, and ma-nipulate them using a rich set of operators.

The main challenge in designing RDDs is defining aprogramming interface that can provide fault toleranceefficiently. Existing abstractions for in-memory storageon clusters, such as distributed shared memory [24], key-value stores [25], databases, and Piccolo [27], offer aninterface based on fine-grained updates to mutable state(e.g., cells in a table). With this interface, the only waysto provide fault tolerance are to replicate the data acrossmachines or to log updates across machines. Both ap-proaches are expensive for data-intensive workloads, asthey require copying large amounts of data over the clus-ter network, whose bandwidth is far lower than that ofRAM, and they incur substantial storage overhead.

In contrast to these systems, RDDs provide an inter-face based on coarse-grained transformations (e.g., map,filter and join) that apply the same operation to manydata items. This allows them to efficiently provide faulttolerance by logging the transformations used to build adataset (its lineage) rather than the actual data.1 If a parti-tion of an RDD is lost, the RDD has enough informationabout how it was derived from other RDDs to recompute

1Checkpointing the data in some RDDs may be useful when a lin-eage chain grows large, however, and we discuss how to do it in §5.4.

why not mapreduce?

mapreduce flows are acyclic

not efficient for some applica6ons


why not mapreduce?

itera8ve jobs many common machine learning algorithms

repeatedly apply the same func6on on the same dataset (e.g., gradient descent)

mapreduce repeatedly reloads

(reads & writes) data


why not mapreduce?

interac8ve analy8cs load data in memory and query repeatedly

mapreduce would re-‐read data


spark’s proposal

generalize mapreduce model to accommodate such applica6ons

allow us treat data as available

across repeated queries and updates

resilient distributed datasets (rdds)


resilient distributed datasets (rdd)

read-‐only collec6on of objects par66oned across machines

users can explicitly cache rdds in memory

re-‐use across mapreduce-‐like parallel opera6ons


main challenge

efficient fault-‐tolerance

to treat data as available in-‐memory should be easy to re-‐build

if part of data (e.g., a par66on) is lost

achieved through course-‐grained transforma3ons and lineage


fault-‐tolerance coarse transforma8ons

e.g., map opera6ons applied to many (even all) data items

lineage

the series of transforma6ons that led to a dataset

if a par66on is lost, there is enough informa6on to re-‐apply the transforma6ons and re-‐compute it


programming model

developers write a drive program high-‐level control flow

think of rdds as ‘variables’ that represent datasets

on which you apply parallel opera3ons

can also use restricted types of shared variables


spark run6me

Worker tasks

results RAM

Input Data

Worker RAM

Input Data

Worker RAM

Input Data

Driver

Figure 2: Spark runtime. The user’s driver program launchesmultiple workers, which read data blocks from a distributed filesystem and can persist computed RDD partitions in memory.

ule tasks based on data locality to improve performance.Second, RDDs degrade gracefully when there is notenough memory to store them, as long as they are onlybeing used in scan-based operations. Partitions that donot fit in RAM can be stored on disk and will providesimilar performance to current data-parallel systems.

2.4 Applications Not Suitable for RDDs

As discussed in the Introduction, RDDs are best suitedfor batch applications that apply the same operation toall elements of a dataset. In these cases, RDDs can ef-ficiently remember each transformation as one step in alineage graph and can recover lost partitions without hav-ing to log large amounts of data. RDDs would be lesssuitable for applications that make asynchronous fine-grained updates to shared state, such as a storage sys-tem for a web application or an incremental web crawler.For these applications, it is more efficient to use systemsthat perform traditional update logging and data check-pointing, such as databases, RAMCloud [25], Percolator[26] and Piccolo [27]. Our goal is to provide an efficientprogramming model for batch analytics and leave theseasynchronous applications to specialized systems.

3 Spark Programming InterfaceSpark provides the RDD abstraction through a language-integrated API similar to DryadLINQ [31] in Scala [2],a statically typed functional programming language forthe Java VM. We chose Scala due to its combination ofconciseness (which is convenient for interactive use) andefficiency (due to static typing). However, nothing aboutthe RDD abstraction requires a functional language.

To use Spark, developers write a driver program thatconnects to a cluster of workers, as shown in Figure 2.The driver defines one or more RDDs and invokes ac-tions on them. Spark code on the driver also tracks theRDDs’ lineage. The workers are long-lived processesthat can store RDD partitions in RAM across operations.

As we showed in the log mining example in Sec-tion 2.2.1, users provide arguments to RDD opera-

tions like map by passing closures (function literals).Scala represents each closure as a Java object, andthese objects can be serialized and loaded on anothernode to pass the closure across the network. Scala alsosaves any variables bound in the closure as fields inthe Java object. For example, one can write code likevar x = 5; rdd.map(_ + x) to add 5 to each elementof an RDD.5

RDDs themselves are statically typed objectsparametrized by an element type. For example,RDD[Int] is an RDD of integers. However, most of ourexamples omit types since Scala supports type inference.

Although our method of exposing RDDs in Scala isconceptually simple, we had to work around issues withScala’s closure objects using reflection [33]. We alsoneeded more work to make Spark usable from the Scalainterpreter, as we shall discuss in Section 5.2. Nonethe-less, we did not have to modify the Scala compiler.

3.1 RDD Operations in Spark

Table 2 lists the main RDD transformations and actionsavailable in Spark. We give the signature of each oper-ation, showing type parameters in square brackets. Re-call that transformations are lazy operations that define anew RDD, while actions launch a computation to returna value to the program or write data to external storage.

Note that some operations, such as join, are only avail-able on RDDs of key-value pairs. Also, our functionnames are chosen to match other APIs in Scala and otherfunctional languages; for example, map is a one-to-onemapping, while flatMap maps each input value to one ormore outputs (similar to the map in MapReduce).

In addition to these operators, users can ask for anRDD to persist. Furthermore, users can get an RDD’spartition order, which is represented by a Partitionerclass, and partition another dataset according to it. Op-erations such as groupByKey, reduceByKey and sort au-tomatically result in a hash or range partitioned RDD.

3.2 Example Applications

We complement the data mining example in Section2.2.1 with two iterative applications: logistic regressionand PageRank. The latter also showcases how control ofRDDs’ partitioning can improve performance.

3.2.1 Logistic Regression

Many machine learning algorithms are iterative in naturebecause they run iterative optimization procedures, suchas gradient descent, to maximize a function. They canthus run much faster by keeping their data in memory.

As an example, the following program implements lo-gistic regression [14], a common classification algorithm

5We save each closure at the time it is created, so that the map inthis example will always add 5 even if x changes.


rdd read-‐only collec6on of objects par66oned across a set of machines, that can be re-‐built if a par66on is lost

constructed in the following ways:

from a file in a shared file system (e.g., hdfs) parallelizing a collec8on (e.g., an array) divide into par66ons and send to mul6ple nodes

transforming an exis8ng rdd e.g., applying a map opera6on

changing the persistence of an exis6ng rdd hint to cache rdd or save to filesystem


rdd

need not exist physically at all 6mes instead, there is enough informa6on

to compute the rdd

rdds are lazily-‐created and ephemeral

lazy materialized only when informa6on is extracted from

them (through ac3ons!) ephemeral

discarded arer use


transforma6ons and ac6ons

transforma6ons lazy opera6ons that define a new rdd

ac6ons

launch computa6on on rdd to return a value to the program or write data to external storage


shared variables

broadcast variables read-‐only variables, sent to all workers

typical use-‐case

large read-‐only piece of data (e.g., lookup table) that is used across mul6ple parallel opera6ons


shared variables

accumulators write-‐only variables, that workers can update

using an opera6on that is commuta6ve and associa6ve

only the driver can read

typical use-‐case counters


example: text search

suppose that a web service is experiencing errors and you want to search over terabytes of

logs to find the cause the logs are stored in Hadoop Filesystem (HDFS) errors are wrigen in the logs as lines that start

with the keyword “ERROR”


example: text search


lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














in Scala...

rdd

rdd

from a file

transforma6on

hint: keep in memory!

no work on the cluster so far

ac6on! lines is not loaded to ram!

example -‐ text search ctd.

let us find errors related to “MySQL”


example -‐ text search ctd.


lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














transforma6on ac6on

example -‐ text search ctd. again

let us find errors related to “HDFS” and extract their 6me field

assuming 6me is field no. 3 in tab-‐separated format


example -‐ text search ctd. again


lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














transforma6ons

ac6on

example: text search lineage of 6me fields


lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














cached

pipelined transforma6ons if a par66on of errors is lost,

filter is applied only the corresponding par66on of lines

transforma6ons and ac6ons

Transformations

map( f : T ) U) : RDD[T] ) RDD[U]filter( f : T ) Bool) : RDD[T] ) RDD[T]

flatMap( f : T ) Seq[U]) : RDD[T] ) RDD[U]sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling)

groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V,V)) V) : RDD[(K, V)] ) RDD[(K, V)]

union() : (RDD[T],RDD[T])) RDD[T]join() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (V, W))]

cogroup() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (Seq[V], Seq[W]))]crossProduct() : (RDD[T],RDD[U])) RDD[(T, U)]

mapValues( f : V ) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)]

partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]

Actions

count() : RDD[T] ) Longcollect() : RDD[T] ) Seq[T]

reduce( f : (T,T)) T) : RDD[T] ) Tlookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)

save(path : String) : Outputs RDD to a storage system, e.g., HDFS

Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.

that searches for a hyperplane w that best separates twosets of points (e.g., spam and non-spam emails). The al-gorithm uses gradient descent: it starts w at a randomvalue, and on each iteration, it sums a function of w overthe data to move w in a direction that improves it.

val points = spark.textFile(...).map(parsePoint).persist()

var w = // random initial vectorfor (i <- 1 to ITERATIONS) {val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y

}.reduce((a,b) => a+b)w -= gradient

}

We start by defining a persistent RDD called pointsas the result of a map transformation on a text file thatparses each line of text into a Point object. We then re-peatedly run map and reduce on points to compute thegradient at each step by summing a function of the cur-rent w. Keeping points in memory across iterations canyield a 20⇥ speedup, as we show in Section 6.1.

3.2.2 PageRankA more complex pattern of data sharing occurs inPageRank [6]. The algorithm iteratively updates a rankfor each document by adding up contributions from doc-uments that link to it. On each iteration, each documentsends a contribution of r

n to its neighbors, where r is itsrank and n is its number of neighbors. It then updatesits rank to a/N + (1 � a)Âci, where the sum is overthe contributions it received and N is the total number ofdocuments. We can write PageRank in Spark as follows:

// Load graph as an RDD of (URL, outlinks) pairs

ranks0 input file map

contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .

Figure 3: Lineage graph for datasets in PageRank.

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)}

This program leads to the RDD lineage graph in Fig-ure 3. On each iteration, we create a new ranks datasetbased on the contribs and ranks from the previous iter-ation and the static links dataset.6 One interesting fea-ture of this graph is that it grows longer with the number

6Note that although RDDs are immutable, the variables ranks andcontribs in the program point to different RDDs on each iteration.


example: pagerank se|ng

N documents that contain links to other documents (e.g., webpages)

pagerank itera6vely updates a rank score for each document by

adding up contribu6ons from documents that link to it

itera6on each document sends a contribu6on of rank/n to its neighbors

rank: own document rank, n: number of neighbors updates its rank to α/Ν + (1-‐α)Σci

ci: contribu6on received


example: pagerank


Transformations








Actions









}






contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .







Transformations








Actions









}






contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .







example: pagerank -‐ lineage

Transformations








Actions









}






contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .








represen6ng rdds

internal informa6on about rdds

par66ons & par66oning scheme dependencies on parent RDDs

func6on to compute it from parents


rdd dependencies

narrow dependencies each par66on of the parent rdd is used by at

most one par66on of the child rdd

otherwise, wide dependencies


rdd dependencies

union

groupByKey

join with inputs not co-partitioned

join with inputs co-partitioned

map, filter

Narrow Dependencies: Wide Dependencies:

Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.

map to the parent’s records in its iterator method.

union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7

sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.

join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).

5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.

Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.

We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).

5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.

Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-

7Note that our union operation does not drop duplicate values.

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.

sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.

Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.

For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.

If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.

Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.


scheduling

when an ac6on is performed... (e.g., count() or save())

... the scheduler examines the lineage graph builds a DAG of stages to execute

each stage is a maximal pipeline of

transforma6ons over narrow dependencies


scheduling

union

groupByKey

join with inputs not co-partitioned

join with inputs co-partitioned

map, filter

Narrow Dependencies: Wide Dependencies:

Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.

map to the parent’s records in its iterator method.

union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7

sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.

join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).

5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.

Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.

We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).

5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.

Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-

7Note that our union operation does not drop duplicate values.

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.

sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.

Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.

For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.

If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.

Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.


rdd

par66on

already in ram

memory management

when not enough memory apply LRU evic6on policy at rdd level

evict par66on from least recently used rdd


performance

logis6c regression and k-‐means amazon EC2

10 itera6ons on 100GB datasets 100 node-‐clusters


performance

them simpler to checkpoint than general shared mem-ory. Because consistency is not a concern, RDDs can bewritten out in the background without requiring programpauses or distributed snapshot schemes.

6 EvaluationWe evaluated Spark and RDDs through a series of exper-iments on Amazon EC2, as well as benchmarks of userapplications. Overall, our results show the following:• Spark outperforms Hadoop by up to 20⇥ in itera-

tive machine learning and graph applications. Thespeedup comes from avoiding I/O and deserializationcosts by storing data in memory as Java objects.

• Applications written by our users perform and scalewell. In particular, we used Spark to speed up an an-alytics report that was running on Hadoop by 40⇥.

• When nodes fail, Spark can recover quickly by re-building only the lost RDD partitions.

• Spark can be used to query a 1 TB dataset interac-tively with latencies of 5–7 seconds.

We start by presenting benchmarks for iterative ma-chine learning applications (§6.1) and PageRank (§6.2)against Hadoop. We then evaluate fault recovery in Spark(§6.3) and behavior when a dataset does not fit in mem-ory (§6.4). Finally, we discuss results for user applica-tions (§6.5) and interactive data mining (§6.6).

Unless otherwise noted, our tests used m1.xlarge EC2nodes with 4 cores and 15 GB of RAM. We used HDFSfor storage, with 256 MB blocks. Before each test, wecleared OS buffer caches to measure IO costs accurately.

6.1 Iterative Machine Learning ApplicationsWe implemented two iterative machine learning appli-cations, logistic regression and k-means, to compare theperformance of the following systems:• Hadoop: The Hadoop 0.20.2 stable release.

• HadoopBinMem: A Hadoop deployment that con-verts the input data into a low-overhead binary formatin the first iteration to eliminate text parsing in laterones, and stores it in an in-memory HDFS instance.

• Spark: Our implementation of RDDs.We ran both algorithms for 10 iterations on 100 GB

datasets using 25–100 machines. The key difference be-tween the two applications is the amount of computationthey perform per byte of data. The iteration time of k-means is dominated by computation, while logistic re-gression is less compute-intensive and thus more sensi-tive to time spent in deserialization and I/O.

Since typical learning algorithms need tens of itera-tions to converge, we report times for the first iterationand subsequent iterations separately. We find that shar-ing data via RDDs greatly speeds up future iterations.

80!

139!

46!

115!

182!

82!

76!

62!

3!

106!

87!

33!

0!40!80!

120!160!200!240!

Hadoop! HadoopBM! Spark! Hadoop! HadoopBM! Spark!

Logistic Regression! K-Means!

Itera

tion

time

(s)!

First Iteration!Later Iterations!

Figure 7: Duration of the first and later iterations in Hadoop,HadoopBinMem and Spark for logistic regression and k-meansusing 100 GB of data on a 100-node cluster.

184!

111!

76!

116!

80!

62!

15!

6! 3!

0!50!100!150!200!250!300!

25! 50! 100!

Itera

tion

time

(s)!

Number of machines!

Hadoop!HadoopBinMem!Spark!

(a) Logistic Regression

274!

157!

106!

197!

121!

87!

143!

61!

33!

0!

50!

100!

150!

200!

250!

300!

25! 50! 100!

Itera

tion

time

(s)!

Number of machines!

Hadoop !HadoopBinMem!Spark!

(b) K-Means

Figure 8: Running times for iterations after the first in Hadoop,HadoopBinMem, and Spark. The jobs all processed 100 GB.

First Iterations All three systems read text input fromHDFS in their first iterations. As shown in the light barsin Figure 7, Spark was moderately faster than Hadoopacross experiments. This difference was due to signal-ing overheads in Hadoop’s heartbeat protocol betweenits master and workers. HadoopBinMem was the slowestbecause it ran an extra MapReduce job to convert the datato binary, it and had to write this data across the networkto a replicated in-memory HDFS instance.

Subsequent Iterations Figure 7 also shows the aver-age running times for subsequent iterations, while Fig-ure 8 shows how these scaled with cluster size. For lo-gistic regression, Spark 25.3⇥ and 20.7⇥ faster thanHadoop and HadoopBinMem respectively on 100 ma-chines. For the more compute-intensive k-means appli-cation, Spark still achieved speedup of 1.9⇥ to 3.2⇥.

Understanding the Speedup We were surprised tofind that Spark outperformed even Hadoop with in-memory storage of binary data (HadoopBinMem) by a20⇥ margin. In HadoopBinMem, we had used Hadoop’sstandard binary format (SequenceFile) and a large blocksize of 256 MB, and we had forced HDFS’s data di-rectory to be on an in-memory file system. However,Hadoop still ran slower due to several factors:1. Minimum overhead of the Hadoop software stack,

2. Overhead of HDFS while serving data, and


performance

Example: Logistic Regression

0 500

1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Runn

ing

Tim

e (s

)

Number of Iterations

Hadoop

Spark

110 s / iteration

first iteration 80 s further iterations 1 s


logis6c regression 2015

summary

spark generalized map-‐reduce

tailored to itera6ve computa6on and interac6ve querying

simple programming model

centered on rdds


references 1.  Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: Simplified Data

Processing on Large Clusters.” OSDI 2004. 2.  Zaharia, Matei, et al. "Spark: Cluster Compu6ng with Working Sets."

HotCloud 10 (2010): 10-‐10. 3.  Zaharia, Matei, et al. "Resilient distributed datasets: A fault-‐tolerant

abstrac6on for in-‐memory cluster compu6ng." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementa3on.

4.  Learning Spark: Lightning-‐Fast Big Data Analysis, by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia

5.  Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE. Bigtable: A distributed storage system for structured data. ACM Transac6ons on Computer Systems (TOCS). 2008 Jun 1;26(2):4.

6.  Ghemawat, Sanjay, Howard Gobioff, and Shun-‐Tak Leung. "The Google file system." ACM SIGOPS opera3ng systems review. Vol. 37. No. 5. ACM, 2003.


next week spark programming


spark programming •  crea6ng rdds •  transforma6ons •  ac6ons •  lazy evalua6on •  persistence •  passing custom func6ons •  working with key-‐value pairs

–  crea6on, transforma6ons, ac6ons •  advanced data par66oning •  global variables

–  accumulators (write-‐only) –  broadcast (read-‐only)

•  reading and wri6ng data


Lecture 06 - CS-5040 - modern database systems

Education