Modern Database Systems Lecture 6 Aris6des Gionis Michael Mathioudakis Spring 2016
Modern Database Systems Lecture 6
Aris6des Gionis Michael Mathioudakis
Spring 2016
logis6cs
• tutorial on monday, TU6@2:15pm • assignment 2 is out -‐ due by march 14th • for programming part, check updated tutorial • total of 5 late days are allowed
michael mathioudakis 2
today
mapreduce &
spark
as they were introduced emphasis on high level concepts
michael mathioudakis 3
introduc6on
michael mathioudakis 4
intro recap
structured data, semi-‐structured data, text query op6miza6on vs flexibility of data model
disk access a central issue indexing
now: big data scale so big, that new issues take front seat:
distributed, parallel computa6on fault tolerance
how to accommodate those within a simple computa6onal model?
michael mathioudakis 5
remember this task from lecture 0... data records that contain information about products viewed or purchased from an online store task for each pair of Games products, count the number of customers that have purchased both
6
Product Category Customer Date Price Ac8on other...
Portal 2 Games Michael M. 12/01/2015 10€ Purchase
...
FLWR Plant Food Garden Aris G. 19/02/2015 32€ View
Chase the Rabbit Games Michael M. 23/04/2015 1€ View
Portal 2 Games Ores6s K. 13/05/2015 10€ Purchase
...
> what challenges does case B pose compared to case A? hint limited main memory, disk access, distributed setting
case A 10,000 records (0.5MB per record, 5GB total disk space) 10GB of main memory
case B 10,000,000 records (~5TB total disk space) stored across 100 nodes (50GB per node), 10GB of main memory per node
mapreduce
michael mathioudakis 7
michael mathioudakis 8
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
[email protected], [email protected]
Google, Inc.
AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.
1 Introduction
Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a
given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis
To appear in OSDI 2004 1
appeared at the Symposium on Opera6ng Systems Design & Implementa6on, 2004
some context
in early 2000s, google was developing systems to
accommodate storage and processing of big data volumes
michael mathioudakis 9
google file system (gfs) “a scalable distributed file system for large distributed data-‐intensive applica6ons”
“provides fault tolerance while running on inexpensive commodity hardware”
bigtable “distributed storage system for managing
structured data that is designed to scale to a very large size: petabytes of data across thousands of
commodity servers”
mapreduce “programming model and implementa6on for processing and genera6ng large data sets”
mo6va6on
hundreds of special-‐purpose computa6ons over raw data crawled webpages & documents, search & web request logs
inverted indexes, web graphs, document summaries, frequent queries
conceptually straighforward computa6on however...
a lot of data, distributed over many machines hundreds or thousands of machines...
a lot of prac6cal issues arise, that obscure the simplicity of computa6on
michael mathioudakis 10
at google in early 2000s...
developed solu6on
programming model simple
based on the map and reduce primi6ves found in func6onal languages (e.g., Lisp)
system
hides the messy details in a library paralleliza6on, fault-‐tolerance, data distribu6on, load balancing
michael mathioudakis 11
mapreduce
programming model system
programming model
input a set of (key,value) pairs
computa8on
two func6ons: map and reduce wrigen by the user
output
a set of (key,value) pairs
michael mathioudakis 12
map func6on
input one (key,value) pair
output
set of intermediate (key,value) pairs
mapreduce groups together pairs with same key and passes them to reduce func6on
michael mathioudakis 13
michael mathioudakis 14
map func6on
key, value
key, value
key, value
... key, value
key, value
map
key, value key, value key, value key, value
key, value key, value
key, value key, value key, value
key, value key, value key, value key, value
key, value
typeof(key/value) generally ≠
typeof(key/value)
key, value key, value key, value key, value le
gend
different key value
reduce func6on
input (key, list(values))
intermediate key and set of values for that key list(values) supplied as iterator,
convenient when not enough memory
output list(values)
typically only 0 or 1 values are output per invoca6on
michael mathioudakis 15
reduce func6on
michael mathioudakis 16
key, value
key, value
key, value
key, value key, value
key, value
key, value
reduce key, [value1, value2, ...]
reduce key, [value1, value2, ...]
same key
same key
programming model input
a set of (key,value) pairs
map (key,value) è list( (key,value) )
reduce
(key, list(values)) è (key, list(values))
output list( (key, list(values)) )
michael mathioudakis 17
example task
count the number of occurrences of each word in a collec6on of documents
input
a set of (key,value) pairs key: document file loca6on (id)
value: document contents (list of words)
how would you approach this?
michael mathioudakis 18
map (key,value) è list( (key,value) )
reduce (key, list(values)) è (key, list(values))
example -‐ solu6on
michael mathioudakis 19
doc1, value
doc2, value
doc3, value
map word1, 1 word2, 1 word3, 1 word4, 1
word4, 1 word2, 1
word2, 1 word1, 1 word4, 1
word1, 1 word1, 1
word1, 1
word1, 1
word1, [4]
redu
ce
word2, 1
word2, 1
word2, 1
word2, [3]
example -‐ solu6on
michael mathioudakis 20
for a rewrite of our production indexing system. Sec-tion 7 discusses related and future work.
2 Programming Model
The computation takes a set of input key/value pairs, andproduces a set of output key/value pairs. The user ofthe MapReduce library expresses the computation as twofunctions: Map and Reduce.Map, written by the user, takes an input pair and pro-duces a set of intermediate key/value pairs. The MapRe-duce library groups together all intermediate values asso-ciated with the same intermediate key I and passes themto the Reduce function.The Reduce function, also written by the user, acceptsan intermediate key I and a set of values for that key. Itmerges together these values to form a possibly smallerset of values. Typically just zero or one output value isproduced per Reduce invocation. The intermediate val-ues are supplied to the user’s reduce function via an iter-ator. This allows us to handle lists of values that are toolarge to fit in memory.
2.1 ExampleConsider the problem of counting the number of oc-currences of each word in a large collection of docu-ments. The user would write code similar to the follow-ing pseudo-code:
map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");
reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);
Emit(AsString(result));
The map function emits each word plus an associatedcount of occurrences (just ‘1’ in this simple example).The reduce function sums together all counts emittedfor a particular word.In addition, the user writes code to fill in a mapreducespecification object with the names of the input and out-put files, and optional tuning parameters. The user theninvokes the MapReduce function, passing it the specifi-cation object. The user’s code is linked together with theMapReduce library (implemented in C++). Appendix Acontains the full program text for this example.
2.2 Types
Even though the previous pseudo-code is written in termsof string inputs and outputs, conceptually the map andreduce functions supplied by the user have associatedtypes:map (k1,v1) → list(k2,v2)reduce (k2,list(v2)) → list(v2)
I.e., the input keys and values are drawn from a differentdomain than the output keys and values. Furthermore,the intermediate keys and values are from the same do-main as the output keys and values.Our C++ implementation passes strings to and fromthe user-defined functions and leaves it to the user codeto convert between strings and appropriate types.
2.3 More Examples
Here are a few simple examples of interesting programsthat can be easily expressed as MapReduce computa-tions.
Distributed Grep: The map function emits a line if itmatches a supplied pattern. The reduce function is anidentity function that just copies the supplied intermedi-ate data to the output.
Count of URL Access Frequency: The map func-tion processes logs of web page requests and outputs⟨URL,1⟩. The reduce function adds together all valuesfor the same URL and emits a ⟨URL,total count⟩pair.
Reverse Web-Link Graph: The map function outputs⟨target,source⟩ pairs for each link to a targetURL found in a page named source. The reducefunction concatenates the list of all source URLs as-sociated with a given target URL and emits the pair:⟨target, list(source)⟩
Term-Vector per Host: A term vector summarizes themost important words that occur in a document or a setof documents as a list of ⟨word, frequency⟩ pairs. Themap function emits a ⟨hostname,term vector⟩pair for each input document (where the hostname isextracted from the URL of the document). The re-duce function is passed all per-document term vectorsfor a given host. It adds these term vectors together,throwing away infrequent terms, and then emits a final⟨hostname,term vector⟩ pair.
To appear in OSDI 2004 2
programming model -‐ types
michael mathioudakis 21
map (key,value) è list( (key,value) )
reduce (key, list(values)) è (key, list(values))
intermediate (key, value) pairs
input (key, value) pairs output (key, value) pairs
type of type of ≠
more examples
michael mathioudakis 22
grep search a set of documents for a string pagern in a line
input
a set of (key,value) pairs key: document file loca6on (id)
value: document contents (lines of characters)
more examples
michael mathioudakis 23
map emits a line if it matches the pagern
(document file loca6on, line)
reduce iden6ty func6on
more examples
count of URL access frequency
process logs of web page requests logs are stored in documents, one line per request,
each line contains URL of requested page
input a set of (key,value) pairs key: log file loca6on
value: log contents (lines of requests)
michael mathioudakis 24
more examples
map process logs of web page requests
output (URL, 1) pairs
reduce add together counts for same URL
michael mathioudakis 25
more examples
reverse web-‐link graph process a set of webpages
for each link to target webpage, find a list [source] of all webpages source that contain a link to target
input
a set of (key,value) pairs key: webpage URL
value: webpage contents (html)
michael mathioudakis 26
more examples
map output (target, source) pairs for each link to a target URL found
in a page named source
reduce concatenate list of sources per target output (target, list(source)) pairs
michael mathioudakis 27
more examples
term vector per host process logs of webpages
each webpage has a URL of the form [host]/[page address] hgp://www.aalto.fi/en/current/news/2016-‐03-‐02/
find a term vector per host
input a set of (key,value) pairs
key: webpage URL value: webpage contents (html-‐stripped text)
michael mathioudakis 28
more examples
map emit a (hostname, term vector) pair for each webpage,
hostname is extracted from document URL
reduce adds (hostname, frequency vector) pair per hostname
michael mathioudakis 29
more examples
simple inverted index (no counts) process a collec6on of documents to construct an inverted index
for each word, have a list of documents in which it occurs
input a set of (key,value) pairs
key: document file loca6on (id) value: document contents (list of words)
michael mathioudakis 30
more examples
map parse each document, emit a sequence (word, document ID)
reduce
output (word, list(document ID)) pair for each word
michael mathioudakis 31
system
at google (back in 2004) large clusters of commodity PCs, connected with ethernet
dual-‐processor x86, linux, 2-‐4gb of memory per machine
100 Mbit/s or 1Gbit/s network 100’s or 1000’s pf machines per cluster
storage inexpensive IDE disks agached to the machines google file system (GFS) -‐ uses replica6on users submit jobs to scheduling system
michael mathioudakis 32
execu6on
a job is submiged, then what? map and reduce invoca6ons are distributed over machines
input data is automa6cally par66oned into a set of M splits
the M splits are fed each into a map instance
intermediate results are par66oned into R par66ons according to hash func6on -‐-‐ provided by user
michael mathioudakis 33
execu6on
michael mathioudakis 34
UserProgram
Master
(1) fork
worker
(1) fork
worker
(1) fork
(2)assignmap
(2)assignreduce
split 0
split 1
split 2
split 3
split 4
outputfile 0
(6) write
worker(3) read
worker
(4) local write
Mapphase
Intermediate files(on local disks)
worker outputfile 1
Inputfiles
(5) remote read
Reducephase
Outputfiles
Figure 1: Execution overview
Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.
Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.
3 Implementation
Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:
large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:
(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.
(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.
(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.
3.1 Execution Overview
The Map invocations are distributed across multiplemachines by automatically partitioning the input data
To appear in OSDI 2004 3
execu6on
michael mathioudakis 35
UserProgram
Master
(1) fork
worker
(1) fork
worker
(1) fork
(2)assignmap
(2)assignreduce
split 0
split 1
split 2
split 3
split 4
outputfile 0
(6) write
worker(3) read
worker
(4) local write
Mapphase
Intermediate files(on local disks)
worker outputfile 1
Inputfiles
(5) remote read
Reducephase
Outputfiles
Figure 1: Execution overview
Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.
Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.
3 Implementation
Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:
large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:
(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.
(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.
(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.
3.1 Execution Overview
The Map invocations are distributed across multiplemachines by automatically partitioning the input data
To appear in OSDI 2004 3
(1) split input files into M pieces (16-‐64MB each) and fork many copies of the user program
execu6on
michael mathioudakis 36
UserProgram
Master
(1) fork
worker
(1) fork
worker
(1) fork
(2)assignmap
(2)assignreduce
split 0
split 1
split 2
split 3
split 4
outputfile 0
(6) write
worker(3) read
worker
(4) local write
Mapphase
Intermediate files(on local disks)
worker outputfile 1
Inputfiles
(5) remote read
Reducephase
Outputfiles
Figure 1: Execution overview
Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.
Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.
3 Implementation
Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:
large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:
(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.
(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.
(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.
3.1 Execution Overview
The Map invocations are distributed across multiplemachines by automatically partitioning the input data
To appear in OSDI 2004 3
(1) split input files into M pieces (16-‐64MB each) and fork many copies of the user program
(2) master assigns M + R tasks to idle workers
execu6on
michael mathioudakis 37
UserProgram
Master
(1) fork
worker
(1) fork
worker
(1) fork
(2)assignmap
(2)assignreduce
split 0
split 1
split 2
split 3
split 4
outputfile 0
(6) write
worker(3) read
worker
(4) local write
Mapphase
Intermediate files(on local disks)
worker outputfile 1
Inputfiles
(5) remote read
Reducephase
Outputfiles
Figure 1: Execution overview
Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.
Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.
3 Implementation
Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:
large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:
(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.
(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.
(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.
3.1 Execution Overview
The Map invocations are distributed across multiplemachines by automatically partitioning the input data
To appear in OSDI 2004 3
(3) worker assigned to map task reads corresponding split, passes input data to map func6on, stores intermediate results in memory
(2) master assigns M + R tasks to idle workers
execu6on
michael mathioudakis 38
UserProgram
Master
(1) fork
worker
(1) fork
worker
(1) fork
(2)assignmap
(2)assignreduce
split 0
split 1
split 2
split 3
split 4
outputfile 0
(6) write
worker(3) read
worker
(4) local write
Mapphase
Intermediate files(on local disks)
worker outputfile 1
Inputfiles
(5) remote read
Reducephase
Outputfiles
Figure 1: Execution overview
Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.
Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.
3 Implementation
Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:
large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:
(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.
(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.
(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.
3.1 Execution Overview
The Map invocations are distributed across multiplemachines by automatically partitioning the input data
To appear in OSDI 2004 3
(4) periodically, buffered intermediate results are wrigen to local disk, into R par66ons, according to hash func6on; their loca6ons are passed to master
(2) master assigns M + R tasks to idle workers
execu6on
michael mathioudakis 39
UserProgram
Master
(1) fork
worker
(1) fork
worker
(1) fork
(2)assignmap
(2)assignreduce
split 0
split 1
split 2
split 3
split 4
outputfile 0
(6) write
worker(3) read
worker
(4) local write
Mapphase
Intermediate files(on local disks)
worker outputfile 1
Inputfiles
(5) remote read
Reducephase
Outputfiles
Figure 1: Execution overview
Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.
Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.
3 Implementation
Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:
large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:
(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.
(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.
(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.
3.1 Execution Overview
The Map invocations are distributed across multiplemachines by automatically partitioning the input data
To appear in OSDI 2004 3
(4) periodically, buffered intermediate results are wrigen to local disk, into R par66ons, according to hash func6on; their loca6ons are passed to master
(5) master no6fies reduce workers; reduce worker collects intermediate data for one par66on from local disks of map workers; sorts by intermediate key;
execu6on
michael mathioudakis 40
UserProgram
Master
(1) fork
worker
(1) fork
worker
(1) fork
(2)assignmap
(2)assignreduce
split 0
split 1
split 2
split 3
split 4
outputfile 0
(6) write
worker(3) read
worker
(4) local write
Mapphase
Intermediate files(on local disks)
worker outputfile 1
Inputfiles
(5) remote read
Reducephase
Outputfiles
Figure 1: Execution overview
Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.
Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.
3 Implementation
Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:
large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:
(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.
(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.
(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.
3.1 Execution Overview
The Map invocations are distributed across multiplemachines by automatically partitioning the input data
To appear in OSDI 2004 3
(6) reduce worker passes each intermediate key and corresponding values to reduce func6on; output appended to file for this reduce par66on
(5) master no6fies reduce workers; reduce worker collects intermediate data for one par66on from local disks of map workers; sorts by intermediate key;
execu6on
michael mathioudakis 41
UserProgram
Master
(1) fork
worker
(1) fork
worker
(1) fork
(2)assignmap
(2)assignreduce
split 0
split 1
split 2
split 3
split 4
outputfile 0
(6) write
worker(3) read
worker
(4) local write
Mapphase
Intermediate files(on local disks)
worker outputfile 1
Inputfiles
(5) remote read
Reducephase
Outputfiles
Figure 1: Execution overview
Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.
Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.
3 Implementation
Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:
large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:
(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.
(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.
(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.
3.1 Execution Overview
The Map invocations are distributed across multiplemachines by automatically partitioning the input data
To appear in OSDI 2004 3
(6) reduce worker passes each intermediate key and corresponding values to reduce func6on; output appended to file for this reduce par66on
(7) arer all tasks are completed, the master wakes up the user program
final output: R files
master data structures
state for each map & reduce task idle, in-‐progress, completed
+ iden6ty of assigned worker
for each completed map task loca6on and sizes of R intermediate file regions
received as map tasks are completed pushed incrementally to reduce workers with in-‐progress tasks
michael mathioudakis 42
fault tolerance
worker failure master pings worker periodically
if no response, then worker has failed completed map tasks reset to idle (why?)
in-‐progress tasks set to idle idle tasks: up for grabs by other workers
michael mathioudakis 43
fault tolerance
master failure
master writes periodic checkpoints with master data structures (state)
new master re-‐starts from last check-‐point
michael mathioudakis 44
“stragglers”
tasks that take too long to complete
solu6on when a mapreduce opera6on is close to comple6on, schedule backup tasks for
remaining tasks
michael mathioudakis 45
fault tolerance
locality
master tries to assign tasks to nodes that contain a replica of the input data
michael mathioudakis 46
task granularity
M map tasks and R reduce tasks ideally, M and R should be
much larger than number of workers
why?
load-‐balancing & speedy recovery
michael mathioudakis 47
ordering guarantees
intermediate key/value pairs are processed in increasing key order
makes it easy to generate a sorted
output file per par66on (why?)
michael mathioudakis 48
combiner func6ons op6onal user-‐defined func6on
executed on machines that perform map tasks “combines” results before passed to the reducer
what would the combiner be for the
word-‐count example?
typically the combiner is the same as the reducer only difference: output
reducer writes to final output combiner writes to intermediate output
michael mathioudakis 49
counters
objects updated within map and reduce func6ons periodically propagated to master
useful for debugging
michael mathioudakis 50
counters -‐ example
Counter* uppercase;uppercase = GetCounter("uppercase");
map(String name, String contents):for each word w in contents:if (IsCapitalized(w)):uppercase->Increment();
EmitIntermediate(w, "1");
The counter values from individual worker machinesare periodically propagated to the master (piggybackedon the ping response). The master aggregates the countervalues from successful map and reduce tasks and returnsthem to the user code when the MapReduce operationis completed. The current counter values are also dis-played on the master status page so that a human canwatch the progress of the live computation. When aggre-gating counter values, the master eliminates the effects ofduplicate executions of the same map or reduce task toavoid double counting. (Duplicate executions can arisefrom our use of backup tasks and from re-execution oftasks due to failures.)Some counter values are automatically maintainedby the MapReduce library, such as the number of in-put key/value pairs processed and the number of outputkey/value pairs produced.Users have found the counter facility useful for san-ity checking the behavior of MapReduce operations. Forexample, in some MapReduce operations, the user codemay want to ensure that the number of output pairsproduced exactly equals the number of input pairs pro-cessed, or that the fraction of German documents pro-cessed is within some tolerable fraction of the total num-ber of documents processed.
5 Performance
In this section we measure the performance of MapRe-duce on two computations running on a large cluster ofmachines. One computation searches through approxi-mately one terabyte of data looking for a particular pat-tern. The other computation sorts approximately one ter-abyte of data.These two programs are representative of a large sub-set of the real programswritten by users of MapReduce –one class of programs shuffles data from one representa-tion to another, and another class extracts a small amountof interesting data from a large data set.
5.1 Cluster ConfigurationAll of the programs were executed on a cluster thatconsisted of approximately 1800 machines. Each ma-chine had two 2GHz Intel Xeon processors with Hyper-Threading enabled, 4GB of memory, two 160GB IDE
20 40 60 80 100Seconds
0
10000
20000
30000
Inpu
t (M
B/s)
Figure 2: Data transfer rate over time
disks, and a gigabit Ethernet link. The machines werearranged in a two-level tree-shaped switched networkwith approximately 100-200 Gbps of aggregate band-width available at the root. All of the machines werein the same hosting facility and therefore the round-triptime between any pair of machines was less than a mil-lisecond.Out of the 4GB of memory, approximately 1-1.5GBwas reserved by other tasks running on the cluster. Theprograms were executed on a weekend afternoon, whenthe CPUs, disks, and network were mostly idle.
5.2 Grep
The grep program scans through 1010 100-byte records,searching for a relatively rare three-character pattern (thepattern occurs in 92,337 records). The input is split intoapproximately 64MB pieces (M = 15000), and the en-tire output is placed in one file (R = 1).Figure 2 shows the progress of the computation overtime. The Y-axis shows the rate at which the input data isscanned. The rate gradually picks up as more machinesare assigned to this MapReduce computation, and peaksat over 30 GB/s when 1764 workers have been assigned.As the map tasks finish, the rate starts dropping and hitszero about 80 seconds into the computation. The entirecomputation takes approximately 150 seconds from startto finish. This includes about a minute of startup over-head. The overhead is due to the propagation of the pro-gram to all worker machines, and delays interacting withGFS to open the set of 1000 input files and to get theinformation needed for the locality optimization.
5.3 Sort
The sort program sorts 1010 100-byte records (approxi-mately 1 terabyte of data). This program is modeled afterthe TeraSort benchmark [10].The sorting program consists of less than 50 lines ofuser code. A three-line Map function extracts a 10-bytesorting key from a text line and emits the key and the
To appear in OSDI 2004 8
michael mathioudakis 51
performance
1800 machines each machine had two 2GHz Xeon processors
4GB of memory (2.5-‐3GB available) two 160GB disks gigabit Ethernet
michael mathioudakis 52
performance grep
1010 100-‐byte records search for a pagern found in <105 records
M = 15000, R = 1
150 seconds from start to finish
exercise: today, how big a file would you grep on
one machine in 150 seconds?
michael mathioudakis 53
performance sort
1010 100-‐byte records extract 10 byte sor6ng-‐key from each record (line)
M = 15000, R = 4000
850 seconds from start to finish
exercise: how would you implement sort?
michael mathioudakis 54
summary
original mapreduce paper
simple programming model based on func6onal language primi6ves
system takes care of
scheduling and fault-‐tolerance
great impact for cluster compu6ng
michael mathioudakis 55
hadoop
michael mathioudakis 56
map reduce and hadoop
michael mathioudakis 57
mapreduce implemented into apache hadoop
sorware ecosystem for distributed data storage and processing
open source
hadoop
michael mathioudakis 58
common
hdfs
mapreduce
yarn
scheduling & resource
management
hadoop distributed filesystem
hadoop
michael mathioudakis 59
common
hdfs
mapreduce
yarn
scheduling & resource
management
hadoop distributed filesystem
mahout
machine learning library
hive
data warehouse, sql-‐like querying
pig data-‐flow language and system for
parallel computa6on
spark and a lot of other
projects!!
cluster-‐compu6ng engine
spark
michael mathioudakis 60
michael mathioudakis 61
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion StoicaUniversity of California, Berkeley
AbstractMapReduce and its variants have been highly successfulin implementing large-scale data-intensive applicationson commodity clusters. However, most of these systemsare built around an acyclic data flow model that is notsuitable for other popular applications. This paper fo-cuses on one such class of applications: those that reusea working set of data across multiple parallel operations.This includes many iterative machine learning algorithms,as well as interactive data analysis tools. We propose anew framework called Spark that supports these applica-tions while retaining the scalability and fault tolerance ofMapReduce. To achieve these goals, Spark introduces anabstraction called resilient distributed datasets (RDDs).An RDD is a read-only collection of objects partitionedacross a set of machines that can be rebuilt if a partitionis lost. Spark can outperform Hadoop by 10x in iterativemachine learning jobs, and can be used to interactivelyquery a 39 GB dataset with sub-second response time.
1 IntroductionA new model of cluster computing has become widelypopular, in which data-parallel computations are executedon clusters of unreliable machines by systems that auto-matically provide locality-aware scheduling, fault toler-ance, and load balancing. MapReduce [11] pioneered thismodel, while systems like Dryad [17] and Map-Reduce-Merge [24] generalized the types of data flows supported.These systems achieve their scalability and fault toleranceby providing a programming model where the user createsacyclic data flow graphs to pass input data through a set ofoperators. This allows the underlying system to managescheduling and to react to faults without user intervention.
While this data flow programming model is useful for alarge class of applications, there are applications that can-not be expressed efficiently as acyclic data flows. In thispaper, we focus on one such class of applications: thosethat reuse a working set of data across multiple paralleloperations. This includes two use cases where we haveseen Hadoop users report that MapReduce is deficient:• Iterative jobs: Many common machine learning algo-
rithms apply a function repeatedly to the same datasetto optimize a parameter (e.g., through gradient de-scent). While each iteration can be expressed as a
MapReduce/Dryad job, each job must reload the datafrom disk, incurring a significant performance penalty.
• Interactive analytics: Hadoop is often used to runad-hoc exploratory queries on large datasets, throughSQL interfaces such as Pig [21] and Hive [1]. Ideally,a user would be able to load a dataset of interest intomemory across a number of machines and query it re-peatedly. However, with Hadoop, each query incurssignificant latency (tens of seconds) because it runs asa separate MapReduce job and reads data from disk.
This paper presents a new cluster computing frame-work called Spark, which supports applications withworking sets while providing similar scalability and faulttolerance properties to MapReduce.
The main abstraction in Spark is that of a resilient dis-tributed dataset (RDD), which represents a read-only col-lection of objects partitioned across a set of machines thatcan be rebuilt if a partition is lost. Users can explicitlycache an RDD in memory across machines and reuse itin multiple MapReduce-like parallel operations. RDDsachieve fault tolerance through a notion of lineage: if apartition of an RDD is lost, the RDD has enough infor-mation about how it was derived from other RDDs to beable to rebuild just that partition. Although RDDs arenot a general shared memory abstraction, they representa sweet-spot between expressivity on the one hand andscalability and reliability on the other hand, and we havefound them well-suited for a variety of applications.
Spark is implemented in Scala [5], a statically typedhigh-level programming language for the Java VM, andexposes a functional programming interface similar toDryadLINQ [25]. In addition, Spark can be used inter-actively from a modified version of the Scala interpreter,which allows the user to define RDDs, functions, vari-ables and classes and use them in parallel operations on acluster. We believe that Spark is the first system to allowan efficient, general-purpose programming language to beused interactively to process large datasets on a cluster.
Although our implementation of Spark is still a proto-type, early experience with the system is encouraging. Weshow that Spark can outperform Hadoop by 10x in itera-tive machine learning workloads and can be used interac-tively to scan a 39 GB dataset with sub-second latency.
This paper is organized as follows. Section 2 describes
1
appeared at HotCloud, 2010
michael mathioudakis 62
appeared at the USENIX conference on networked systems design and implementa6on, 2010
Resilient Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster Computing
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica
University of California, Berkeley
AbstractWe present Resilient Distributed Datasets (RDDs), a dis-tributed memory abstraction that lets programmers per-form in-memory computations on large clusters in afault-tolerant manner. RDDs are motivated by two typesof applications that current computing frameworks han-dle inefficiently: iterative algorithms and interactive datamining tools. In both cases, keeping data in memorycan improve performance by an order of magnitude.To achieve fault tolerance efficiently, RDDs provide arestricted form of shared memory, based on coarse-grained transformations rather than fine-grained updatesto shared state. However, we show that RDDs are expres-sive enough to capture a wide class of computations, in-cluding recent specialized programming models for iter-ative jobs, such as Pregel, and new applications that thesemodels do not capture. We have implemented RDDs in asystem called Spark, which we evaluate through a varietyof user applications and benchmarks.
1 IntroductionCluster computing frameworks like MapReduce [10] andDryad [19] have been widely adopted for large-scale dataanalytics. These systems let users write parallel compu-tations using a set of high-level operators, without havingto worry about work distribution and fault tolerance.
Although current frameworks provide numerous ab-stractions for accessing a cluster’s computational re-sources, they lack abstractions for leveraging distributedmemory. This makes them inefficient for an importantclass of emerging applications: those that reuse interme-diate results across multiple computations. Data reuse iscommon in many iterative machine learning and graphalgorithms, including PageRank, K-means clustering,and logistic regression. Another compelling use case isinteractive data mining, where a user runs multiple ad-hoc queries on the same subset of the data. Unfortu-nately, in most current frameworks, the only way to reusedata between computations (e.g., between two MapRe-duce jobs) is to write it to an external stable storage sys-tem, e.g., a distributed file system. This incurs substantialoverheads due to data replication, disk I/O, and serializa-
tion, which can dominate application execution times.Recognizing this problem, researchers have developed
specialized frameworks for some applications that re-quire data reuse. For example, Pregel [22] is a system foriterative graph computations that keeps intermediate datain memory, while HaLoop [7] offers an iterative MapRe-duce interface. However, these frameworks only supportspecific computation patterns (e.g., looping a series ofMapReduce steps), and perform data sharing implicitlyfor these patterns. They do not provide abstractions formore general reuse, e.g., to let a user load several datasetsinto memory and run ad-hoc queries across them.
In this paper, we propose a new abstraction called re-silient distributed datasets (RDDs) that enables efficientdata reuse in a broad range of applications. RDDs arefault-tolerant, parallel data structures that let users ex-plicitly persist intermediate results in memory, controltheir partitioning to optimize data placement, and ma-nipulate them using a rich set of operators.
The main challenge in designing RDDs is defining aprogramming interface that can provide fault toleranceefficiently. Existing abstractions for in-memory storageon clusters, such as distributed shared memory [24], key-value stores [25], databases, and Piccolo [27], offer aninterface based on fine-grained updates to mutable state(e.g., cells in a table). With this interface, the only waysto provide fault tolerance are to replicate the data acrossmachines or to log updates across machines. Both ap-proaches are expensive for data-intensive workloads, asthey require copying large amounts of data over the clus-ter network, whose bandwidth is far lower than that ofRAM, and they incur substantial storage overhead.
In contrast to these systems, RDDs provide an inter-face based on coarse-grained transformations (e.g., map,filter and join) that apply the same operation to manydata items. This allows them to efficiently provide faulttolerance by logging the transformations used to build adataset (its lineage) rather than the actual data.1 If a parti-tion of an RDD is lost, the RDD has enough informationabout how it was derived from other RDDs to recompute
1Checkpointing the data in some RDDs may be useful when a lin-eage chain grows large, however, and we discuss how to do it in §5.4.
why not mapreduce?
mapreduce flows are acyclic
not efficient for some applica6ons
michael mathioudakis 63
why not mapreduce?
itera8ve jobs many common machine learning algorithms
repeatedly apply the same func6on on the same dataset (e.g., gradient descent)
mapreduce repeatedly reloads
(reads & writes) data
michael mathioudakis 64
why not mapreduce?
interac8ve analy8cs load data in memory and query repeatedly
mapreduce would re-‐read data
michael mathioudakis 65
spark’s proposal
generalize mapreduce model to accommodate such applica6ons
allow us treat data as available
across repeated queries and updates
resilient distributed datasets (rdds)
michael mathioudakis 66
resilient distributed datasets (rdd)
read-‐only collec6on of objects par66oned across machines
users can explicitly cache rdds in memory
re-‐use across mapreduce-‐like parallel opera6ons
michael mathioudakis 67
main challenge
efficient fault-‐tolerance
to treat data as available in-‐memory should be easy to re-‐build
if part of data (e.g., a par66on) is lost
achieved through course-‐grained transforma3ons and lineage
michael mathioudakis 68
fault-‐tolerance coarse transforma8ons
e.g., map opera6ons applied to many (even all) data items
lineage
the series of transforma6ons that led to a dataset
if a par66on is lost, there is enough informa6on to re-‐apply the transforma6ons and re-‐compute it
michael mathioudakis 69
programming model
developers write a drive program high-‐level control flow
think of rdds as ‘variables’ that represent datasets
on which you apply parallel opera3ons
can also use restricted types of shared variables
michael mathioudakis 70
spark run6me
Worker tasks
results RAM
Input Data
Worker RAM
Input Data
Worker RAM
Input Data
Driver
Figure 2: Spark runtime. The user’s driver program launchesmultiple workers, which read data blocks from a distributed filesystem and can persist computed RDD partitions in memory.
ule tasks based on data locality to improve performance.Second, RDDs degrade gracefully when there is notenough memory to store them, as long as they are onlybeing used in scan-based operations. Partitions that donot fit in RAM can be stored on disk and will providesimilar performance to current data-parallel systems.
2.4 Applications Not Suitable for RDDs
As discussed in the Introduction, RDDs are best suitedfor batch applications that apply the same operation toall elements of a dataset. In these cases, RDDs can ef-ficiently remember each transformation as one step in alineage graph and can recover lost partitions without hav-ing to log large amounts of data. RDDs would be lesssuitable for applications that make asynchronous fine-grained updates to shared state, such as a storage sys-tem for a web application or an incremental web crawler.For these applications, it is more efficient to use systemsthat perform traditional update logging and data check-pointing, such as databases, RAMCloud [25], Percolator[26] and Piccolo [27]. Our goal is to provide an efficientprogramming model for batch analytics and leave theseasynchronous applications to specialized systems.
3 Spark Programming InterfaceSpark provides the RDD abstraction through a language-integrated API similar to DryadLINQ [31] in Scala [2],a statically typed functional programming language forthe Java VM. We chose Scala due to its combination ofconciseness (which is convenient for interactive use) andefficiency (due to static typing). However, nothing aboutthe RDD abstraction requires a functional language.
To use Spark, developers write a driver program thatconnects to a cluster of workers, as shown in Figure 2.The driver defines one or more RDDs and invokes ac-tions on them. Spark code on the driver also tracks theRDDs’ lineage. The workers are long-lived processesthat can store RDD partitions in RAM across operations.
As we showed in the log mining example in Sec-tion 2.2.1, users provide arguments to RDD opera-
tions like map by passing closures (function literals).Scala represents each closure as a Java object, andthese objects can be serialized and loaded on anothernode to pass the closure across the network. Scala alsosaves any variables bound in the closure as fields inthe Java object. For example, one can write code likevar x = 5; rdd.map(_ + x) to add 5 to each elementof an RDD.5
RDDs themselves are statically typed objectsparametrized by an element type. For example,RDD[Int] is an RDD of integers. However, most of ourexamples omit types since Scala supports type inference.
Although our method of exposing RDDs in Scala isconceptually simple, we had to work around issues withScala’s closure objects using reflection [33]. We alsoneeded more work to make Spark usable from the Scalainterpreter, as we shall discuss in Section 5.2. Nonethe-less, we did not have to modify the Scala compiler.
3.1 RDD Operations in Spark
Table 2 lists the main RDD transformations and actionsavailable in Spark. We give the signature of each oper-ation, showing type parameters in square brackets. Re-call that transformations are lazy operations that define anew RDD, while actions launch a computation to returna value to the program or write data to external storage.
Note that some operations, such as join, are only avail-able on RDDs of key-value pairs. Also, our functionnames are chosen to match other APIs in Scala and otherfunctional languages; for example, map is a one-to-onemapping, while flatMap maps each input value to one ormore outputs (similar to the map in MapReduce).
In addition to these operators, users can ask for anRDD to persist. Furthermore, users can get an RDD’spartition order, which is represented by a Partitionerclass, and partition another dataset according to it. Op-erations such as groupByKey, reduceByKey and sort au-tomatically result in a hash or range partitioned RDD.
3.2 Example Applications
We complement the data mining example in Section2.2.1 with two iterative applications: logistic regressionand PageRank. The latter also showcases how control ofRDDs’ partitioning can improve performance.
3.2.1 Logistic Regression
Many machine learning algorithms are iterative in naturebecause they run iterative optimization procedures, suchas gradient descent, to maximize a function. They canthus run much faster by keeping their data in memory.
As an example, the following program implements lo-gistic regression [14], a common classification algorithm
5We save each closure at the time it is created, so that the map inthis example will always add 5 even if x changes.
michael mathioudakis 71
rdd read-‐only collec6on of objects par66oned across a set of machines, that can be re-‐built if a par66on is lost
constructed in the following ways:
from a file in a shared file system (e.g., hdfs) parallelizing a collec8on (e.g., an array) divide into par66ons and send to mul6ple nodes
transforming an exis8ng rdd e.g., applying a map opera6on
changing the persistence of an exis6ng rdd hint to cache rdd or save to filesystem
michael mathioudakis 72
rdd
need not exist physically at all 6mes instead, there is enough informa6on
to compute the rdd
rdds are lazily-‐created and ephemeral
lazy materialized only when informa6on is extracted from
them (through ac3ons!) ephemeral
discarded arer use
michael mathioudakis 73
transforma6ons and ac6ons
transforma6ons lazy opera6ons that define a new rdd
ac6ons
launch computa6on on rdd to return a value to the program or write data to external storage
michael mathioudakis 74
shared variables
broadcast variables read-‐only variables, sent to all workers
typical use-‐case
large read-‐only piece of data (e.g., lookup table) that is used across mul6ple parallel opera6ons
michael mathioudakis 75
shared variables
accumulators write-‐only variables, that workers can update
using an opera6on that is commuta6ve and associa6ve
only the driver can read
typical use-‐case counters
michael mathioudakis 76
example: text search
suppose that a web service is experiencing errors and you want to search over terabytes of
logs to find the cause the logs are stored in Hadoop Filesystem (HDFS) errors are wrigen in the logs as lines that start
with the keyword “ERROR”
michael mathioudakis 77
example: text search
michael mathioudakis 78
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
in Scala...
rdd
rdd
from a file
transforma6on
hint: keep in memory!
no work on the cluster so far
ac6on! lines is not loaded to ram!
example -‐ text search ctd.
let us find errors related to “MySQL”
michael mathioudakis 79
example -‐ text search ctd.
michael mathioudakis 80
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
transforma6on ac6on
example -‐ text search ctd. again
let us find errors related to “HDFS” and extract their 6me field
assuming 6me is field no. 3 in tab-‐separated format
michael mathioudakis 81
example -‐ text search ctd. again
michael mathioudakis 82
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
transforma6ons
ac6on
example: text search lineage of 6me fields
michael mathioudakis 83
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
cached
pipelined transforma6ons if a par66on of errors is lost,
filter is applied only the corresponding par66on of lines
transforma6ons and ac6ons
Transformations
map( f : T ) U) : RDD[T] ) RDD[U]filter( f : T ) Bool) : RDD[T] ) RDD[T]
flatMap( f : T ) Seq[U]) : RDD[T] ) RDD[U]sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling)
groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V,V)) V) : RDD[(K, V)] ) RDD[(K, V)]
union() : (RDD[T],RDD[T])) RDD[T]join() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (V, W))]
cogroup() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (Seq[V], Seq[W]))]crossProduct() : (RDD[T],RDD[U])) RDD[(T, U)]
mapValues( f : V ) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)]
partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]
Actions
count() : RDD[T] ) Longcollect() : RDD[T] ) Seq[T]
reduce( f : (T,T)) T) : RDD[T] ) Tlookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)
save(path : String) : Outputs RDD to a storage system, e.g., HDFS
Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.
that searches for a hyperplane w that best separates twosets of points (e.g., spam and non-spam emails). The al-gorithm uses gradient descent: it starts w at a randomvalue, and on each iteration, it sums a function of w overthe data to move w in a direction that improves it.
val points = spark.textFile(...).map(parsePoint).persist()
var w = // random initial vectorfor (i <- 1 to ITERATIONS) {val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y
}.reduce((a,b) => a+b)w -= gradient
}
We start by defining a persistent RDD called pointsas the result of a map transformation on a text file thatparses each line of text into a Point object. We then re-peatedly run map and reduce on points to compute thegradient at each step by summing a function of the cur-rent w. Keeping points in memory across iterations canyield a 20⇥ speedup, as we show in Section 6.1.
3.2.2 PageRankA more complex pattern of data sharing occurs inPageRank [6]. The algorithm iteratively updates a rankfor each document by adding up contributions from doc-uments that link to it. On each iteration, each documentsends a contribution of r
n to its neighbors, where r is itsrank and n is its number of neighbors. It then updatesits rank to a/N + (1 � a)Âci, where the sum is overthe contributions it received and N is the total number ofdocuments. We can write PageRank in Spark as follows:
// Load graph as an RDD of (URL, outlinks) pairs
ranks0 input file map
contribs0
ranks1
contribs1
ranks2
contribs2
links join
reduce + map
. . .
Figure 3: Lineage graph for datasets in PageRank.
val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))
}// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)
.mapValues(sum => a/N + (1-a)*sum)}
This program leads to the RDD lineage graph in Fig-ure 3. On each iteration, we create a new ranks datasetbased on the contribs and ranks from the previous iter-ation and the static links dataset.6 One interesting fea-ture of this graph is that it grows longer with the number
6Note that although RDDs are immutable, the variables ranks andcontribs in the program point to different RDDs on each iteration.
michael mathioudakis 84
example: pagerank se|ng
N documents that contain links to other documents (e.g., webpages)
pagerank itera6vely updates a rank score for each document by
adding up contribu6ons from documents that link to it
itera6on each document sends a contribu6on of rank/n to its neighbors
rank: own document rank, n: number of neighbors updates its rank to α/Ν + (1-‐α)Σci
ci: contribu6on received
michael mathioudakis 85
example: pagerank
michael mathioudakis 86
Transformations
map( f : T ) U) : RDD[T] ) RDD[U]filter( f : T ) Bool) : RDD[T] ) RDD[T]
flatMap( f : T ) Seq[U]) : RDD[T] ) RDD[U]sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling)
groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V,V)) V) : RDD[(K, V)] ) RDD[(K, V)]
union() : (RDD[T],RDD[T])) RDD[T]join() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (V, W))]
cogroup() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (Seq[V], Seq[W]))]crossProduct() : (RDD[T],RDD[U])) RDD[(T, U)]
mapValues( f : V ) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)]
partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]
Actions
count() : RDD[T] ) Longcollect() : RDD[T] ) Seq[T]
reduce( f : (T,T)) T) : RDD[T] ) Tlookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)
save(path : String) : Outputs RDD to a storage system, e.g., HDFS
Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.
that searches for a hyperplane w that best separates twosets of points (e.g., spam and non-spam emails). The al-gorithm uses gradient descent: it starts w at a randomvalue, and on each iteration, it sums a function of w overthe data to move w in a direction that improves it.
val points = spark.textFile(...).map(parsePoint).persist()
var w = // random initial vectorfor (i <- 1 to ITERATIONS) {val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y
}.reduce((a,b) => a+b)w -= gradient
}
We start by defining a persistent RDD called pointsas the result of a map transformation on a text file thatparses each line of text into a Point object. We then re-peatedly run map and reduce on points to compute thegradient at each step by summing a function of the cur-rent w. Keeping points in memory across iterations canyield a 20⇥ speedup, as we show in Section 6.1.
3.2.2 PageRankA more complex pattern of data sharing occurs inPageRank [6]. The algorithm iteratively updates a rankfor each document by adding up contributions from doc-uments that link to it. On each iteration, each documentsends a contribution of r
n to its neighbors, where r is itsrank and n is its number of neighbors. It then updatesits rank to a/N + (1 � a)Âci, where the sum is overthe contributions it received and N is the total number ofdocuments. We can write PageRank in Spark as follows:
// Load graph as an RDD of (URL, outlinks) pairs
ranks0 input file map
contribs0
ranks1
contribs1
ranks2
contribs2
links join
reduce + map
. . .
Figure 3: Lineage graph for datasets in PageRank.
val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))
}// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)
.mapValues(sum => a/N + (1-a)*sum)}
This program leads to the RDD lineage graph in Fig-ure 3. On each iteration, we create a new ranks datasetbased on the contribs and ranks from the previous iter-ation and the static links dataset.6 One interesting fea-ture of this graph is that it grows longer with the number
6Note that although RDDs are immutable, the variables ranks andcontribs in the program point to different RDDs on each iteration.
Transformations
map( f : T ) U) : RDD[T] ) RDD[U]filter( f : T ) Bool) : RDD[T] ) RDD[T]
flatMap( f : T ) Seq[U]) : RDD[T] ) RDD[U]sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling)
groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V,V)) V) : RDD[(K, V)] ) RDD[(K, V)]
union() : (RDD[T],RDD[T])) RDD[T]join() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (V, W))]
cogroup() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (Seq[V], Seq[W]))]crossProduct() : (RDD[T],RDD[U])) RDD[(T, U)]
mapValues( f : V ) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)]
partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]
Actions
count() : RDD[T] ) Longcollect() : RDD[T] ) Seq[T]
reduce( f : (T,T)) T) : RDD[T] ) Tlookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)
save(path : String) : Outputs RDD to a storage system, e.g., HDFS
Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.
that searches for a hyperplane w that best separates twosets of points (e.g., spam and non-spam emails). The al-gorithm uses gradient descent: it starts w at a randomvalue, and on each iteration, it sums a function of w overthe data to move w in a direction that improves it.
val points = spark.textFile(...).map(parsePoint).persist()
var w = // random initial vectorfor (i <- 1 to ITERATIONS) {val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y
}.reduce((a,b) => a+b)w -= gradient
}
We start by defining a persistent RDD called pointsas the result of a map transformation on a text file thatparses each line of text into a Point object. We then re-peatedly run map and reduce on points to compute thegradient at each step by summing a function of the cur-rent w. Keeping points in memory across iterations canyield a 20⇥ speedup, as we show in Section 6.1.
3.2.2 PageRankA more complex pattern of data sharing occurs inPageRank [6]. The algorithm iteratively updates a rankfor each document by adding up contributions from doc-uments that link to it. On each iteration, each documentsends a contribution of r
n to its neighbors, where r is itsrank and n is its number of neighbors. It then updatesits rank to a/N + (1 � a)Âci, where the sum is overthe contributions it received and N is the total number ofdocuments. We can write PageRank in Spark as follows:
// Load graph as an RDD of (URL, outlinks) pairs
ranks0 input file map
contribs0
ranks1
contribs1
ranks2
contribs2
links join
reduce + map
. . .
Figure 3: Lineage graph for datasets in PageRank.
val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))
}// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)
.mapValues(sum => a/N + (1-a)*sum)}
This program leads to the RDD lineage graph in Fig-ure 3. On each iteration, we create a new ranks datasetbased on the contribs and ranks from the previous iter-ation and the static links dataset.6 One interesting fea-ture of this graph is that it grows longer with the number
6Note that although RDDs are immutable, the variables ranks andcontribs in the program point to different RDDs on each iteration.
example: pagerank -‐ lineage
Transformations
map( f : T ) U) : RDD[T] ) RDD[U]filter( f : T ) Bool) : RDD[T] ) RDD[T]
flatMap( f : T ) Seq[U]) : RDD[T] ) RDD[U]sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling)
groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V,V)) V) : RDD[(K, V)] ) RDD[(K, V)]
union() : (RDD[T],RDD[T])) RDD[T]join() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (V, W))]
cogroup() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (Seq[V], Seq[W]))]crossProduct() : (RDD[T],RDD[U])) RDD[(T, U)]
mapValues( f : V ) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)]
partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]
Actions
count() : RDD[T] ) Longcollect() : RDD[T] ) Seq[T]
reduce( f : (T,T)) T) : RDD[T] ) Tlookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)
save(path : String) : Outputs RDD to a storage system, e.g., HDFS
Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.
that searches for a hyperplane w that best separates twosets of points (e.g., spam and non-spam emails). The al-gorithm uses gradient descent: it starts w at a randomvalue, and on each iteration, it sums a function of w overthe data to move w in a direction that improves it.
val points = spark.textFile(...).map(parsePoint).persist()
var w = // random initial vectorfor (i <- 1 to ITERATIONS) {val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y
}.reduce((a,b) => a+b)w -= gradient
}
We start by defining a persistent RDD called pointsas the result of a map transformation on a text file thatparses each line of text into a Point object. We then re-peatedly run map and reduce on points to compute thegradient at each step by summing a function of the cur-rent w. Keeping points in memory across iterations canyield a 20⇥ speedup, as we show in Section 6.1.
3.2.2 PageRankA more complex pattern of data sharing occurs inPageRank [6]. The algorithm iteratively updates a rankfor each document by adding up contributions from doc-uments that link to it. On each iteration, each documentsends a contribution of r
n to its neighbors, where r is itsrank and n is its number of neighbors. It then updatesits rank to a/N + (1 � a)Âci, where the sum is overthe contributions it received and N is the total number ofdocuments. We can write PageRank in Spark as follows:
// Load graph as an RDD of (URL, outlinks) pairs
ranks0 input file map
contribs0
ranks1
contribs1
ranks2
contribs2
links join
reduce + map
. . .
Figure 3: Lineage graph for datasets in PageRank.
val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))
}// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)
.mapValues(sum => a/N + (1-a)*sum)}
This program leads to the RDD lineage graph in Fig-ure 3. On each iteration, we create a new ranks datasetbased on the contribs and ranks from the previous iter-ation and the static links dataset.6 One interesting fea-ture of this graph is that it grows longer with the number
6Note that although RDDs are immutable, the variables ranks andcontribs in the program point to different RDDs on each iteration.
michael mathioudakis 87
represen6ng rdds
internal informa6on about rdds
par66ons & par66oning scheme dependencies on parent RDDs
func6on to compute it from parents
michael mathioudakis 88
rdd dependencies
narrow dependencies each par66on of the parent rdd is used by at
most one par66on of the child rdd
otherwise, wide dependencies
michael mathioudakis 89
rdd dependencies
union
groupByKey
join with inputs not co-partitioned
join with inputs co-partitioned
map, filter
Narrow Dependencies: Wide Dependencies:
Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.
map to the parent’s records in its iterator method.
union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7
sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.
join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).
5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.
Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.
We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).
5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.
Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-
7Note that our union operation does not drop duplicate values.
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.
sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.
Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.
For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.
If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.
Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.
michael mathioudakis 90
scheduling
when an ac6on is performed... (e.g., count() or save())
... the scheduler examines the lineage graph builds a DAG of stages to execute
each stage is a maximal pipeline of
transforma6ons over narrow dependencies
michael mathioudakis 91
scheduling
union
groupByKey
join with inputs not co-partitioned
join with inputs co-partitioned
map, filter
Narrow Dependencies: Wide Dependencies:
Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.
map to the parent’s records in its iterator method.
union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7
sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.
join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).
5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.
Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.
We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).
5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.
Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-
7Note that our union operation does not drop duplicate values.
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.
sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.
Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.
For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.
If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.
Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.
michael mathioudakis 92
rdd
par66on
already in ram
memory management
when not enough memory apply LRU evic6on policy at rdd level
evict par66on from least recently used rdd
michael mathioudakis 93
performance
logis6c regression and k-‐means amazon EC2
10 itera6ons on 100GB datasets 100 node-‐clusters
michael mathioudakis 94
performance
them simpler to checkpoint than general shared mem-ory. Because consistency is not a concern, RDDs can bewritten out in the background without requiring programpauses or distributed snapshot schemes.
6 EvaluationWe evaluated Spark and RDDs through a series of exper-iments on Amazon EC2, as well as benchmarks of userapplications. Overall, our results show the following:• Spark outperforms Hadoop by up to 20⇥ in itera-
tive machine learning and graph applications. Thespeedup comes from avoiding I/O and deserializationcosts by storing data in memory as Java objects.
• Applications written by our users perform and scalewell. In particular, we used Spark to speed up an an-alytics report that was running on Hadoop by 40⇥.
• When nodes fail, Spark can recover quickly by re-building only the lost RDD partitions.
• Spark can be used to query a 1 TB dataset interac-tively with latencies of 5–7 seconds.
We start by presenting benchmarks for iterative ma-chine learning applications (§6.1) and PageRank (§6.2)against Hadoop. We then evaluate fault recovery in Spark(§6.3) and behavior when a dataset does not fit in mem-ory (§6.4). Finally, we discuss results for user applica-tions (§6.5) and interactive data mining (§6.6).
Unless otherwise noted, our tests used m1.xlarge EC2nodes with 4 cores and 15 GB of RAM. We used HDFSfor storage, with 256 MB blocks. Before each test, wecleared OS buffer caches to measure IO costs accurately.
6.1 Iterative Machine Learning ApplicationsWe implemented two iterative machine learning appli-cations, logistic regression and k-means, to compare theperformance of the following systems:• Hadoop: The Hadoop 0.20.2 stable release.
• HadoopBinMem: A Hadoop deployment that con-verts the input data into a low-overhead binary formatin the first iteration to eliminate text parsing in laterones, and stores it in an in-memory HDFS instance.
• Spark: Our implementation of RDDs.We ran both algorithms for 10 iterations on 100 GB
datasets using 25–100 machines. The key difference be-tween the two applications is the amount of computationthey perform per byte of data. The iteration time of k-means is dominated by computation, while logistic re-gression is less compute-intensive and thus more sensi-tive to time spent in deserialization and I/O.
Since typical learning algorithms need tens of itera-tions to converge, we report times for the first iterationand subsequent iterations separately. We find that shar-ing data via RDDs greatly speeds up future iterations.
80!
139!
46!
115!
182!
82!
76!
62!
3!
106!
87!
33!
0!40!80!
120!160!200!240!
Hadoop! HadoopBM! Spark! Hadoop! HadoopBM! Spark!
Logistic Regression! K-Means!
Itera
tion
time
(s)!
First Iteration!Later Iterations!
Figure 7: Duration of the first and later iterations in Hadoop,HadoopBinMem and Spark for logistic regression and k-meansusing 100 GB of data on a 100-node cluster.
184!
111!
76!
116!
80!
62!
15!
6! 3!
0!50!100!150!200!250!300!
25! 50! 100!
Itera
tion
time
(s)!
Number of machines!
Hadoop!HadoopBinMem!Spark!
(a) Logistic Regression
274!
157!
106!
197!
121!
87!
143!
61!
33!
0!
50!
100!
150!
200!
250!
300!
25! 50! 100!
Itera
tion
time
(s)!
Number of machines!
Hadoop !HadoopBinMem!Spark!
(b) K-Means
Figure 8: Running times for iterations after the first in Hadoop,HadoopBinMem, and Spark. The jobs all processed 100 GB.
First Iterations All three systems read text input fromHDFS in their first iterations. As shown in the light barsin Figure 7, Spark was moderately faster than Hadoopacross experiments. This difference was due to signal-ing overheads in Hadoop’s heartbeat protocol betweenits master and workers. HadoopBinMem was the slowestbecause it ran an extra MapReduce job to convert the datato binary, it and had to write this data across the networkto a replicated in-memory HDFS instance.
Subsequent Iterations Figure 7 also shows the aver-age running times for subsequent iterations, while Fig-ure 8 shows how these scaled with cluster size. For lo-gistic regression, Spark 25.3⇥ and 20.7⇥ faster thanHadoop and HadoopBinMem respectively on 100 ma-chines. For the more compute-intensive k-means appli-cation, Spark still achieved speedup of 1.9⇥ to 3.2⇥.
Understanding the Speedup We were surprised tofind that Spark outperformed even Hadoop with in-memory storage of binary data (HadoopBinMem) by a20⇥ margin. In HadoopBinMem, we had used Hadoop’sstandard binary format (SequenceFile) and a large blocksize of 256 MB, and we had forced HDFS’s data di-rectory to be on an in-memory file system. However,Hadoop still ran slower due to several factors:1. Minimum overhead of the Hadoop software stack,
2. Overhead of HDFS while serving data, and
michael mathioudakis 95
performance
Example: Logistic Regression
0 500
1000 1500 2000 2500 3000 3500 4000
1 5 10 20 30
Runn
ing
Tim
e (s
)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s further iterations 1 s
michael mathioudakis 96
logis6c regression 2015
summary
spark generalized map-‐reduce
tailored to itera6ve computa6on and interac6ve querying
simple programming model
centered on rdds
michael mathioudakis 97
references 1. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: Simplified Data
Processing on Large Clusters.” OSDI 2004. 2. Zaharia, Matei, et al. "Spark: Cluster Compu6ng with Working Sets."
HotCloud 10 (2010): 10-‐10. 3. Zaharia, Matei, et al. "Resilient distributed datasets: A fault-‐tolerant
abstrac6on for in-‐memory cluster compu6ng." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementa3on.
4. Learning Spark: Lightning-‐Fast Big Data Analysis, by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia
5. Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE. Bigtable: A distributed storage system for structured data. ACM Transac6ons on Computer Systems (TOCS). 2008 Jun 1;26(2):4.
6. Ghemawat, Sanjay, Howard Gobioff, and Shun-‐Tak Leung. "The Google file system." ACM SIGOPS opera3ng systems review. Vol. 37. No. 5. ACM, 2003.
michael mathioudakis 98
next week spark programming
michael mathioudakis 99
spark programming • crea6ng rdds • transforma6ons • ac6ons • lazy evalua6on • persistence • passing custom func6ons • working with key-‐value pairs
– crea6on, transforma6ons, ac6ons • advanced data par66oning • global variables
– accumulators (write-‐only) – broadcast (read-‐only)
• reading and wri6ng data
michael mathioudakis 100