Scalable Data Analysis (CIS 602-02)dkoop/cis602-2015fa/lectures/lecture18.pdf · Dewitt and Stonebraker on MapReduce (2008) "We are amazed at the hype that the MapReduce proponents

D. Koop, CIS 602-02, Fall 2015

Scalable Data Analysis (CIS 602-02)

General Cluster Computing

Dr. David Koop

MapReduce Overview

2D. Koop, CIS 602-02, Fall 2015

[Yahoo! Hadoop Tutorial]

https://developer.yahoo.com/hadoop/tutorial/

72 COMMUNICATIONS OF THE ACM | JANUARY 2010 | VOL. 53 | NO. 1

M APRE DU C E I S A programming model for processing and generating large data sets.4 Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs and a reduce function that merges all intermediate values associated with the same intermediate key. We built a system around this programming model in 2003 to simplify construction of the inverted index for handling searches at Google.com. Since then, more than 10,000 distinct programs have been implemented using MapReduce at Google, including algorithms for large-scale graph processing, text processing, machine learning, and statistical machine translation. The Hadoop open source implementation

DOI:10.1145/1629175.1629198

MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs.

BY JEFFREY DEAN AND SANJAY GHEMAWAT

MapReduce: A Flexible Data Processing Tool

IL

LU

ST

RA

TI

ON

BY

MA

RI

US

WA

TZ

contributed articles

of MapReduce has been used exten-sively outside of Google by a number of organizations.10,11

To help illustrate the MapReduce programming model, consider the problem of counting the number of occurrences of each word in a large col-lection of documents. The user would write code like the following pseudo-code:

map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1”);

reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

The map function emits each word plus an associated count of occurrences (just `1' in this simple example). The re-duce function sums together all counts emitted for a particular word.

MapReduce automatically paral-lelizes and executes the program on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing re-quired inter-machine communication. MapReduce allows programmers with no experience with parallel and dis-tributed systems to easily utilize the re-sources of a large distributed system. A typical MapReduce computation pro-cesses many terabytes of data on hun-dreds or thousands of machines. Pro-grammers find the system easy to use, and more than 100,000 MapReduce jobs are executed on Google’s clusters every day.

Compared to Parallel Databases The query languages built into paral-lel database systems are also used to

Word Count Example

3D. Koop, CIS 602-02, Fall 2015

[Dean and Ghemawat, 2004]

Word Count Example

4D. Koop, CIS 602-02, Fall 2015

Dewitt and Stonebraker on MapReduce (2008)"We are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. 1. A giant step backward in the programming paradigm for large-

scale data intensive applications 2. A sub-optimal implementation, in that it uses brute force instead of

indexing 3. Not novel at all -- it represents a specific implementation of well

known techniques developed nearly 25 years ago 4. Missing most of the features that are routinely included in current

DBMS 5. Incompatible with all of the tools DBMS users have come to

depend on"

5D. Koop, CIS 602-02, Fall 2015

[via Bill Howe]

http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html

Architectural Elements

Parallel DBMS MapReduce

Schema Support " Not out of the box

Indexing " Not out of the box

Programming Model Declarative(SQL)

Imperative(C/C++, Java, …)

Extensions through Pig and Hive

Optimizations (Compression, Query

Optimization)" Not out of the box

Flexibility Not out of the box "

Fault Tolerance Coarse grained techniques "

[Pavlo et al., SIGMOD 2009, Stonebraker et al., CACM 2010, …]

Parallel Databases vs. MapReduce

6D. Koop, CIS 602-02, Fall 2015

[via J. Freire, from Pavlo et al., 2009 & Stonebraker et al., 2010]

http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf

1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes0

250

500

750

1000

1250

1500

seco

nds

← 1

7.6

← 7

5.5

← 7

6.7

← 6

7.7

← 7

5.5

Vertica Hadoop

Figure 1: Load Times – Grep Task Data Set(535MB/node)

25 Nodes 50 Nodes 100 Nodes0

5000

10000

15000

20000

25000

30000

seco

nds

Vertica Hadoop

Figure 2: Load Times – Grep Task Data Set(1TB/cluster)


10000

20000

30000

40000

50000

seco

nds

← 2

262.

2

← 4

369.

8

← 4

445.

3

← 4

486.

2

← 4

520.

8

Vertica Hadoop

Figure 3: Load Times – UserVisits Data Set(20GB/node)

Since Hadoop needs a total of 3TB of disk space in order to storethree replicas of each block in HDFS, we were limited to runningthis benchmark only on 25, 50, and 100 nodes (at fewer than 25nodes, there is not enough available disk space to store 3TB).4.2.1 Data LoadingWe now describe the procedures used to load the data from the

nodes’ local files into each system’s internal storage representation.

Hadoop: There are two ways to load data into Hadoop’s distributedfile system: (1) use Hadoop’s command-line file utility to uploadfiles stored on the local filesystem into HDFS or (2) create a customdata loader program that writes data using Hadoop’s internal I/OAPI. We did not need to alter the input data for our MR programs,therefore we loaded the files on each node in parallel directly intoHDFS as plain text using the command-line utility. Storing the datain this manner enables MR programs to access data using Hadoop’sTextInputFormat data format, where the keys are line num-bers in each file and their corresponding values are the contents ofeach line. We found that this approach yielded the best performancein both the loading process and task execution, as opposed to usingHadoop’s serialized data formats or compression features.

DBMS-X: The loading process in DBMS-X occurs in two phases.First, we execute the LOAD SQL command in parallel on each nodein the cluster to read data from the local filesystem and insert itscontents into a particular table in the database. We specify in thiscommand that the local data is delimited by a special character, thuswe did not need to write a custom program to transform the databefore loading it. But because our data generator simply createsrandom keys for each record on each node, the system must redis-tribute the tuples to other nodes in the cluster as it reads each recordfrom the input files based on the target table’s partitioning attribute.It would be possible to generate a “hash-aware” version of the datagenerator that would allow DBMS-X to just load the input files oneach node without this redistribution process, but we do not believethat this would improve load times very much.Once the initial loading phase is complete, we then execute an

administrative command to reorganize the data on each node. Thisprocess executes in parallel on each node to compress data, buildeach table’s indexes, and perform other housekeeping.

Vertica: Vertica also provides a COPY SQL command that is is-sued from a single host and then coordinates the loading process onmultiple nodes in parallel in the cluster. The user gives the COPYcommand as input a list of nodes to execute the loading operationfor. This process is similar to DBMS-X: on each node the Verticaloader splits the input data files on a delimiter, creates a new tuplefor each line in an input file, and redistributes that tuple to a dif-

ferent node based on the hash of its primary key. Once the data isloaded, the columns are automatically sorted and compressed ac-cording to the physical design of the database.

Results&Discussion: The results for loading both the 535MB/nodeand 1TB/cluster data sets are shown in Figures 1 and 2, respectively.For DBMS-X, we separate the times of the two loading phases,which are shown as a stacked bar in the graphs: the bottom seg-ment represents the execution time of the parallel LOAD commandsand the top segment is the reorganization process.The most striking feature of the results for the load times in

535MB/node data set shown in Figure 1 is the difference in perfor-mance of DBMS-X compared to Hadoop and Vertica. Despite issu-ing the initial LOAD command in the first phase on each node in par-allel, the data was actually loaded on each node sequentially. Thus,as the total of amount of data is increased, the load times also in-creased proportionately. This also explains why, for the 1TB/clusterdata set, the load times for DBMS-X do not decrease as less datais stored per node. However, the compression and housekeeping onDBMS-X can be done in parallel across nodes, and thus the execu-tion time of the second phase of the loading process is cut in halfwhen twice as many nodes are used to store the 1TB of data.Without using either block- or record-level compression, Hadoop

clearly outperforms both DBMS-X and Vertica since each node issimply copying each data file from the local disk into the localHDFS instance and then distributing two replicas to other nodesin the cluster. If we load the data into Hadoop using only a sin-gle replica per block, then the load times are reduced by a factorof three. But as we will discuss in Section 5, the lack of multiplereplicas often increases the execution times of jobs.

4.2.2 Task ExecutionSQL Commands: A pattern search for a particular field is sim-ply the following query in SQL. Neither SQL system contained anindex on the field attribute, so this query requires a full table scan.

SELECT * FROM Data WHERE field LIKE ‘%XYZ%’;

MapReduce Program: The MR program consists of just a Mapfunction that is given a single record already split into the appro-priate key/value pair and then performs a sub-string match on thevalue. If the search pattern is found, the Map function simply out-puts the input key/value pair to HDFS. Because no Reduce functionis defined, the output generated by each Map instance is the finaloutput of the program.

Results & Discussion: The performance results for the three sys-tems for this task is shown in Figures 4 and 5. Surprisingly, therelative differences between the systems are not consistent in the

Comparing Load Times

7D. Koop, CIS 602-02, Fall 2015

[Pavlo et al., 2009]


10

20

30

40

50

60

70

seco

nds

Vertica Hadoop

Figure 4: Grep Task Results – 535MB/node Data Set

25 Nodes 50 Nodes 100 Nodes0

250

500

750

1000

1250

1500

seco

nds

Vertica Hadoop

Figure 5: Grep Task Results – 1TB/cluster Data Set

two figures. In Figure 4, the two parallel databases perform aboutthe same, more than a factor of two faster in Hadoop. But in Fig-ure 5, both DBMS-X and Hadoop perform more than a factor oftwo slower than Vertica. The reason is that the amount of data pro-cessing varies substantially from the two experiments. For the re-sults in Figure 4, very little data is being processed (535MB/node).This causes Hadoop’s non-insignificant start-up costs to become thelimiting factor in its performance. As will be described in Section5.1.2, for short-running queries (i.e., queries that take less than aminute), Hadoop’s start-up costs can dominate the execution time.In our observations, we found that takes 10–25 seconds before allMap tasks have been started and are running at full speed across thenodes in the cluster. Furthermore, as the total number of allocatedMap tasks increases, there is additional overhead required for thecentral job tracker to coordinate node activities. Hence, this fixedoverhead increases slightly as more nodes are added to the clusterand for longer data processing tasks, as shown in Figure 5, this fixedcost is dwarfed by the time to complete the required processing.The upper segments of each Hadoop bar in the graphs represent

the execution time of the additional MR job to combine the outputinto a single file. Since we ran this as a separate MapReduce job,these segments consume a larger percentage of overall time in Fig-ure 4, as the fixed start-up overhead cost again dominates the workneeded to perform the rest of the task. Even though the Grep task isselective, the results in Figure 5 show how this combine phase canstill take hundreds of seconds due to the need to open and combinemany small output files. Each Map instance produces its output ina separate HDFS file, and thus even though each file is small thereare many Map tasks and therefore many files on each node.For the 1TB/cluster data set experiments, Figure 5 shows that all

systems executed the task on twice as many nodes in nearly half theamount of time, as one would expect since the total amount of datawas held constant across nodes for this experiment. Hadoop andDBMS-X performs approximately the same, since Hadoop’s start-up cost is amortized across the increased amount of data processingfor this experiment. However, the results clearly show that Verticaoutperforms both DBMS-X and Hadoop. We attribute this to Ver-tica’s aggressive use of data compression (see Section 5.1.3), whichbecomes more effective as more data is stored per node.

4.3 Analytical TasksTo explore more complex uses of both types of systems, we de-

veloped four tasks related to HTML document processing. We firstgenerate a collection of random HTML documents, similar to thatwhich a web crawler might find. Each node is assigned a set of

600,000 unique HTML documents, each with a unique URL. Ineach document, we randomly generate links to other pages set us-ing a Zipfian distribution.We also generated two additional data sets meant to model log

files of HTTP server traffic. These data sets consist of values de-rived from the HTML documents as well as several randomly gen-erated attributes. The schema of these three tables is as follows:CREATE TABLE Documents (

url VARCHAR(100)PRIMARY KEY,

contents TEXT );

CREATE TABLE Rankings (pageURL VARCHAR(100)

PRIMARY KEY,pageRank INT,avgDuration INT );

CREATE TABLE UserVisits (sourceIP VARCHAR(16),destURL VARCHAR(100),visitDate DATE,adRevenue FLOAT,userAgent VARCHAR(64),countryCode VARCHAR(3),languageCode VARCHAR(6),searchWord VARCHAR(32),duration INT );

Our data generator created unique files with 155 million UserVis-its records (20GB/node) and 18 million Rankings records (1GB/node)on each node. The visitDate, adRevenue, and sourceIP fields arepicked uniformly at random from specific ranges. All other fieldsare picked uniformly from sampling real-world data sets. Each datafile is stored on each node as a column-delimited text file.

4.3.1 Data LoadingWe now describe the procedures for loading the UserVisits and

Rankings data sets. For reasons to be discussed in Section 4.3.5,only Hadoop needs to directly load the Documents files into its in-ternal storage system. DBMS-X and Vertica both execute a UDFthat processes the Documents on each node at runtime and loadsthe data into a temporary table. We account for the overhead ofthis approach in the benchmark times, rather than in the load times.Therefore, we do not provide results for loading this data set.

Hadoop: Unlike the Grep task’s data set, which was uploaded di-rectly into HDFS unaltered, the UserVisits and Rankings data setsneeded to be modified so that the first and second columns are sep-arated by a tab delimiter and all other fields in each line are sepa-rated by a unique field delimiter. Because there are no schemas inthe MR model, in order to access the different attributes at run time,the Map and Reduce functions in each task must manually split thevalue by the delimiter character into an array of strings.We wrote a custom data loader executed in parallel on each node

to read in each line of the data sets, prepare the data as needed,and then write the tuple into a plain text file in HDFS. Loadingthe data sets in this manner was roughly three times slower thanusing the command-line utility, but did not require us to write cus-

Comparing Execution Times

8D. Koop, CIS 602-02, Fall 2015

[Pavlo et al., 2009]

Dean and Ghemawat's Response• Heterogeneous systems:

- There is data is lots of different storage systems: files, Google Bigtable, column stores

- Can write your own storage backend to combine data - Parallel databases require all data to be loaded

• Indices: - Database indexing techniques can be used for MapReduce - Push data processing to the database - Or, use partitioning of data

9D. Koop, CIS 602-02, Fall 2015

[via J. Freire, Dean and Ghemawat, 2010]


Dean and Ghemawat's Response (continued)• Complex functions:

- MapReduce was designed for complex tasks that do not fit well into the relational paradigm

- RDBMS have user-defined functions but buggy/missing • Structured data and schemas:

- Google's MapReduce supports Protocol Buffer that allows an optimized binary representation

- https://developers.google.com/protocol-buffers/

10D. Koop, CIS 602-02, Fall 2015

[via J. Freire, Dean and Ghemawat, 2010]

https://developers.google.com/protocol-buffers/


Quiz

11D. Koop, CIS 602-02, Fall 2015

Project Example• Jake VanderPlas's report on Seattle's bike sharing service:

- http://nbviewer.jupyter.org/github/jakevdp/ProntoData/blob/master/ProntoData.ipynb

12D. Koop, CIS 602-02, Fall 2015

http://nbviewer.jupyter.org/github/jakevdp/ProntoData/blob/master/ProntoData.ipynb

Spark: Focus on data reads/writes• MapReduce is very useful, but data is written to disk • What about multiple operations? • What about different types of operations?

13D. Koop, CIS 602-02, Fall 2015

D. Koop, CIS 602-02, Fall 2015

Spark: Cluster Computing with Working Sets

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

Comparison with Distributed Shared Memory

15D. Koop, CIS 602-02, Fall 2015

[Zaharia et al., 2012]

Transformations

map( f : T ) U) : RDD[T] ) RDD[U]filter( f : T ) Bool) : RDD[T] ) RDD[T]

flatMap( f : T ) Seq[U]) : RDD[T] ) RDD[U]sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling)

groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V,V)) V) : RDD[(K, V)] ) RDD[(K, V)]

union() : (RDD[T],RDD[T])) RDD[T]join() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (V, W))]

cogroup() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (Seq[V], Seq[W]))]crossProduct() : (RDD[T],RDD[U])) RDD[(T, U)]

mapValues( f : V ) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)]

partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]

Actions

count() : RDD[T] ) Longcollect() : RDD[T] ) Seq[T]

reduce( f : (T,T)) T) : RDD[T] ) Tlookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)

save(path : String) : Outputs RDD to a storage system, e.g., HDFS

Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.

that searches for a hyperplane w that best separates twosets of points (e.g., spam and non-spam emails). The al-gorithm uses gradient descent: it starts w at a randomvalue, and on each iteration, it sums a function of w overthe data to move w in a direction that improves it.

val points = spark.textFile(...).map(parsePoint).persist()

var w = // random initial vectorfor (i <- 1 to ITERATIONS) {val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y

}.reduce((a,b) => a+b)w -= gradient

}

We start by defining a persistent RDD called pointsas the result of a map transformation on a text file thatparses each line of text into a Point object. We then re-peatedly run map and reduce on points to compute thegradient at each step by summing a function of the cur-rent w. Keeping points in memory across iterations canyield a 20⇥ speedup, as we show in Section 6.1.

3.2.2 PageRankA more complex pattern of data sharing occurs inPageRank [6]. The algorithm iteratively updates a rankfor each document by adding up contributions from doc-uments that link to it. On each iteration, each documentsends a contribution of r

n to its neighbors, where r is itsrank and n is its number of neighbors. It then updatesits rank to a/N + (1 � a)Âci, where the sum is overthe contributions it received and N is the total number ofdocuments. We can write PageRank in Spark as follows:

// Load graph as an RDD of (URL, outlinks) pairs

ranks0 input file map

contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .

Figure 3: Lineage graph for datasets in PageRank.

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)}

This program leads to the RDD lineage graph in Fig-ure 3. On each iteration, we create a new ranks datasetbased on the contribs and ranks from the previous iter-ation and the static links dataset.6 One interesting fea-ture of this graph is that it grows longer with the number

6Note that although RDDs are immutable, the variables ranks andcontribs in the program point to different RDDs on each iteration.

Spark Transformations and Actions

16D. Koop, CIS 602-02, Fall 2015


Transformations

map( f : T ) U) : RDD[T] ) RDD[U]filter( f : T ) Bool) : RDD[T] ) RDD[T]

flatMap( f : T ) Seq[U]) : RDD[T] ) RDD[U]sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling)

groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V,V)) V) : RDD[(K, V)] ) RDD[(K, V)]

union() : (RDD[T],RDD[T])) RDD[T]join() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (V, W))]

cogroup() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (Seq[V], Seq[W]))]crossProduct() : (RDD[T],RDD[U])) RDD[(T, U)]

mapValues( f : V ) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)]

partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]

Actions

count() : RDD[T] ) Longcollect() : RDD[T] ) Seq[T]

reduce( f : (T,T)) T) : RDD[T] ) Tlookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)

save(path : String) : Outputs RDD to a storage system, e.g., HDFS

Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.

that searches for a hyperplane w that best separates twosets of points (e.g., spam and non-spam emails). The al-gorithm uses gradient descent: it starts w at a randomvalue, and on each iteration, it sums a function of w overthe data to move w in a direction that improves it.

val points = spark.textFile(...).map(parsePoint).persist()

var w = // random initial vectorfor (i <- 1 to ITERATIONS) {val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y

}.reduce((a,b) => a+b)w -= gradient

}

We start by defining a persistent RDD called pointsas the result of a map transformation on a text file thatparses each line of text into a Point object. We then re-peatedly run map and reduce on points to compute thegradient at each step by summing a function of the cur-rent w. Keeping points in memory across iterations canyield a 20⇥ speedup, as we show in Section 6.1.

3.2.2 PageRankA more complex pattern of data sharing occurs inPageRank [6]. The algorithm iteratively updates a rankfor each document by adding up contributions from doc-uments that link to it. On each iteration, each documentsends a contribution of r

n to its neighbors, where r is itsrank and n is its number of neighbors. It then updatesits rank to a/N + (1 � a)Âci, where the sum is overthe contributions it received and N is the total number ofdocuments. We can write PageRank in Spark as follows:

// Load graph as an RDD of (URL, outlinks) pairs

ranks0 input file map

contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .

Figure 3: Lineage graph for datasets in PageRank.

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)}

This program leads to the RDD lineage graph in Fig-ure 3. On each iteration, we create a new ranks datasetbased on the contribs and ranks from the previous iter-ation and the static links dataset.6 One interesting fea-ture of this graph is that it grows longer with the number

6Note that although RDDs are immutable, the variables ranks andcontribs in the program point to different RDDs on each iteration.

PageRank in Spark

// Load graph as an RDD of (URL, outlinks) pairsval links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairs for (i <- 1 to ITERATIONS) { // Build an RDD of (targetURL, float) pairs // with the contributions sent by each page val contribs = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } // Sum contributions by URL and get new ranks ranks = contribs.reduceByKey((x,y) => x+y) .mapValues(sum => a/N + (1-a)*sum)

17D. Koop, CIS 602-02, Fall 2015


union

groupByKey

join with inputs not co-partitioned

join with inputs co-partitioned

map, filter

Narrow Dependencies: Wide Dependencies:

Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.

map to the parent’s records in its iterator method.

union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7

sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.

join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).

5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.

Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.

We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).

5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.

Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-

7Note that our union operation does not drop duplicate values.

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.

sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.

Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.

For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.

If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.

Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.

Narrow vs. Wide Dependencies

18D. Koop, CIS 602-02, Fall 2015


Spark stores datasets in a graph-based data representation where partitions are atomic.

union

groupByKey

join with inputs not co-partitioned

join with inputs co-partitioned

map, filter

Narrow Dependencies: Wide Dependencies:

Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.

map to the parent’s records in its iterator method.

union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7

sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.

join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).

5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.

Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.

We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).

5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.

Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-

7Note that our union operation does not drop duplicate values.

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.

sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.

Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.

For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.

If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.

Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.

Spark Job Stages

19D. Koop, CIS 602-02, Fall 2015

Example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in black if they are already in memory. To run an action on RDD G, we build build stages at wide dependencies and pipeline narrow transformations inside each stage. In this case, stage 1’s output RDD is already in RAM, so we run stage 2 and then 3. [Zaharia et al., 2012]

them simpler to checkpoint than general shared mem-ory. Because consistency is not a concern, RDDs can bewritten out in the background without requiring programpauses or distributed snapshot schemes.

6 EvaluationWe evaluated Spark and RDDs through a series of exper-iments on Amazon EC2, as well as benchmarks of userapplications. Overall, our results show the following:• Spark outperforms Hadoop by up to 20⇥ in itera-

tive machine learning and graph applications. Thespeedup comes from avoiding I/O and deserializationcosts by storing data in memory as Java objects.

• Applications written by our users perform and scalewell. In particular, we used Spark to speed up an an-alytics report that was running on Hadoop by 40⇥.

• When nodes fail, Spark can recover quickly by re-building only the lost RDD partitions.

• Spark can be used to query a 1 TB dataset interac-tively with latencies of 5–7 seconds.

We start by presenting benchmarks for iterative ma-chine learning applications (§6.1) and PageRank (§6.2)against Hadoop. We then evaluate fault recovery in Spark(§6.3) and behavior when a dataset does not fit in mem-ory (§6.4). Finally, we discuss results for user applica-tions (§6.5) and interactive data mining (§6.6).

Unless otherwise noted, our tests used m1.xlarge EC2nodes with 4 cores and 15 GB of RAM. We used HDFSfor storage, with 256 MB blocks. Before each test, wecleared OS buffer caches to measure IO costs accurately.

6.1 Iterative Machine Learning ApplicationsWe implemented two iterative machine learning appli-cations, logistic regression and k-means, to compare theperformance of the following systems:• Hadoop: The Hadoop 0.20.2 stable release.

• HadoopBinMem: A Hadoop deployment that con-verts the input data into a low-overhead binary formatin the first iteration to eliminate text parsing in laterones, and stores it in an in-memory HDFS instance.

• Spark: Our implementation of RDDs.We ran both algorithms for 10 iterations on 100 GB

datasets using 25–100 machines. The key difference be-tween the two applications is the amount of computationthey perform per byte of data. The iteration time of k-means is dominated by computation, while logistic re-gression is less compute-intensive and thus more sensi-tive to time spent in deserialization and I/O.

Since typical learning algorithms need tens of itera-tions to converge, we report times for the first iterationand subsequent iterations separately. We find that shar-ing data via RDDs greatly speeds up future iterations.

80!

139!

46!

115!

182!

82!

76!

62!

3!

106!

87!

33!

0!40!80!

120!160!200!240!

Hadoop! HadoopBM! Spark! Hadoop! HadoopBM! Spark!

Logistic Regression! K-Means!

Itera

tion

time

(s)!

First Iteration!Later Iterations!

Figure 7: Duration of the first and later iterations in Hadoop,HadoopBinMem and Spark for logistic regression and k-meansusing 100 GB of data on a 100-node cluster.

184!

111!

76!

116!

80!

62!

15!

6! 3!

0!50!100!150!200!250!300!

25! 50! 100!

Itera

tion

time

(s)!

Number of machines!

Hadoop!HadoopBinMem!Spark!

(a) Logistic Regression

274!

157!

106!

197!

121!

87!

143!

61!

33!

0!

50!

100!

150!

200!

250!

300!

25! 50! 100!

Itera

tion

time

(s)!

Number of machines!

Hadoop !HadoopBinMem!Spark!

(b) K-Means

Figure 8: Running times for iterations after the first in Hadoop,HadoopBinMem, and Spark. The jobs all processed 100 GB.

First Iterations All three systems read text input fromHDFS in their first iterations. As shown in the light barsin Figure 7, Spark was moderately faster than Hadoopacross experiments. This difference was due to signal-ing overheads in Hadoop’s heartbeat protocol betweenits master and workers. HadoopBinMem was the slowestbecause it ran an extra MapReduce job to convert the datato binary, it and had to write this data across the networkto a replicated in-memory HDFS instance.

Subsequent Iterations Figure 7 also shows the aver-age running times for subsequent iterations, while Fig-ure 8 shows how these scaled with cluster size. For lo-gistic regression, Spark 25.3⇥ and 20.7⇥ faster thanHadoop and HadoopBinMem respectively on 100 ma-chines. For the more compute-intensive k-means appli-cation, Spark still achieved speedup of 1.9⇥ to 3.2⇥.

Understanding the Speedup We were surprised tofind that Spark outperformed even Hadoop with in-memory storage of binary data (HadoopBinMem) by a20⇥ margin. In HadoopBinMem, we had used Hadoop’sstandard binary format (SequenceFile) and a large blocksize of 256 MB, and we had forced HDFS’s data di-rectory to be on an in-memory file system. However,Hadoop still ran slower due to several factors:1. Minimum overhead of the Hadoop software stack,

2. Overhead of HDFS while serving data, and

Evaluation

20D. Koop, CIS 602-02, Fall 2015


them simpler to checkpoint than general shared mem-ory. Because consistency is not a concern, RDDs can bewritten out in the background without requiring programpauses or distributed snapshot schemes.

6 EvaluationWe evaluated Spark and RDDs through a series of exper-iments on Amazon EC2, as well as benchmarks of userapplications. Overall, our results show the following:• Spark outperforms Hadoop by up to 20⇥ in itera-

tive machine learning and graph applications. Thespeedup comes from avoiding I/O and deserializationcosts by storing data in memory as Java objects.

• Applications written by our users perform and scalewell. In particular, we used Spark to speed up an an-alytics report that was running on Hadoop by 40⇥.

• When nodes fail, Spark can recover quickly by re-building only the lost RDD partitions.

• Spark can be used to query a 1 TB dataset interac-tively with latencies of 5–7 seconds.

We start by presenting benchmarks for iterative ma-chine learning applications (§6.1) and PageRank (§6.2)against Hadoop. We then evaluate fault recovery in Spark(§6.3) and behavior when a dataset does not fit in mem-ory (§6.4). Finally, we discuss results for user applica-tions (§6.5) and interactive data mining (§6.6).

Unless otherwise noted, our tests used m1.xlarge EC2nodes with 4 cores and 15 GB of RAM. We used HDFSfor storage, with 256 MB blocks. Before each test, wecleared OS buffer caches to measure IO costs accurately.

6.1 Iterative Machine Learning ApplicationsWe implemented two iterative machine learning appli-cations, logistic regression and k-means, to compare theperformance of the following systems:• Hadoop: The Hadoop 0.20.2 stable release.

• HadoopBinMem: A Hadoop deployment that con-verts the input data into a low-overhead binary formatin the first iteration to eliminate text parsing in laterones, and stores it in an in-memory HDFS instance.

• Spark: Our implementation of RDDs.We ran both algorithms for 10 iterations on 100 GB

datasets using 25–100 machines. The key difference be-tween the two applications is the amount of computationthey perform per byte of data. The iteration time of k-means is dominated by computation, while logistic re-gression is less compute-intensive and thus more sensi-tive to time spent in deserialization and I/O.

Since typical learning algorithms need tens of itera-tions to converge, we report times for the first iterationand subsequent iterations separately. We find that shar-ing data via RDDs greatly speeds up future iterations.

80!

139!

46!

115!

182!

82!

76!

62!

3!

106!

87!

33!

0!40!80!

120!160!200!240!

Hadoop! HadoopBM! Spark! Hadoop! HadoopBM! Spark!

Logistic Regression! K-Means!

Itera

tion

time

(s)!

First Iteration!Later Iterations!

Figure 7: Duration of the first and later iterations in Hadoop,HadoopBinMem and Spark for logistic regression and k-meansusing 100 GB of data on a 100-node cluster.

184!

111!

76!

116!

80!

62!

15!

6! 3!

0!50!100!150!200!250!300!

25! 50! 100!

Itera

tion

time

(s)!

Number of machines!

Hadoop!HadoopBinMem!Spark!

(a) Logistic Regression

274!

157!

106!

197!

121!

87!

143!

61!

33!

0!

50!

100!

150!

200!

250!

300!

25! 50! 100!

Itera

tion

time

(s)!

Number of machines!

Hadoop !HadoopBinMem!Spark!

(b) K-Means

Figure 8: Running times for iterations after the first in Hadoop,HadoopBinMem, and Spark. The jobs all processed 100 GB.

First Iterations All three systems read text input fromHDFS in their first iterations. As shown in the light barsin Figure 7, Spark was moderately faster than Hadoopacross experiments. This difference was due to signal-ing overheads in Hadoop’s heartbeat protocol betweenits master and workers. HadoopBinMem was the slowestbecause it ran an extra MapReduce job to convert the datato binary, it and had to write this data across the networkto a replicated in-memory HDFS instance.

Subsequent Iterations Figure 7 also shows the aver-age running times for subsequent iterations, while Fig-ure 8 shows how these scaled with cluster size. For lo-gistic regression, Spark 25.3⇥ and 20.7⇥ faster thanHadoop and HadoopBinMem respectively on 100 ma-chines. For the more compute-intensive k-means appli-cation, Spark still achieved speedup of 1.9⇥ to 3.2⇥.

Understanding the Speedup We were surprised tofind that Spark outperformed even Hadoop with in-memory storage of binary data (HadoopBinMem) by a20⇥ margin. In HadoopBinMem, we had used Hadoop’sstandard binary format (SequenceFile) and a large blocksize of 256 MB, and we had forced HDFS’s data di-rectory to be on an in-memory file system. However,Hadoop still ran slower due to several factors:1. Minimum overhead of the Hadoop software stack,

2. Overhead of HDFS while serving data, and

Evaluation

21D. Koop, CIS 602-02, Fall 2015


15.4!

13.1!

2.9!

8.4!

6.9!

2.9!

0!

5!

10!

15!

20!

In-mem HDFS! In-mem local file! Spark RDD!

Itera

tion

time

(s)! Text Input!

Binary Input!

Figure 9: Iteration times for logistic regression using 256 MBdata on a single machine for different sources of input.

3. Deserialization cost to convert binary records to us-able in-memory Java objects.

We investigated each of these factors in turn. To mea-sure (1), we ran no-op Hadoop jobs, and saw that these atincurred least 25s of overhead to complete the minimalrequirements of job setup, starting tasks, and cleaning up.Regarding (2), we found that HDFS performed multiplememory copies and a checksum to serve each block.

Finally, to measure (3), we ran microbenchmarks ona single machine to run the logistic regression computa-tion on 256 MB inputs in various formats. In particular,we compared the time to process text and binary inputsfrom both HDFS (where overheads in the HDFS stackwill manifest) and an in-memory local file (where thekernel can very efficiently pass data to the program).

We show the results of these tests in Figure 9. The dif-ferences between in-memory HDFS and local file showthat reading through HDFS introduced a 2-second over-head, even when data was in memory on the local ma-chine. The differences between the text and binary in-put indicate the parsing overhead was 7 seconds. Finally,even when reading from an in-memory file, convertingthe pre-parsed binary data into Java objects took 3 sec-onds, which is still almost as expensive as the logistic re-gression itself. By storing RDD elements directly as Javaobjects in memory, Spark avoids all these overheads.

6.2 PageRankWe compared the performance of Spark with Hadoopfor PageRank using a 54 GB Wikipedia dump. We ran10 iterations of the PageRank algorithm to process alink graph of approximately 4 million articles. Figure 10demonstrates that in-memory storage alone providedSpark with a 2.4⇥ speedup over Hadoop on 30 nodes.In addition, controlling the partitioning of the RDDs tomake it consistent across iterations, as discussed in Sec-tion 3.2.2, improved the speedup to 7.4⇥. The resultsalso scaled nearly linearly to 60 nodes.

We also evaluated a version of PageRank written us-ing our implementation of Pregel over Spark, which wedescribe in Section 7.1. The iteration times were similarto the ones in Figure 10, but longer by about 4 secondsbecause Pregel runs an extra operation on each iterationto let the vertices “vote” whether to finish the job.

171!

80!

72!

28!

23!

14!

0!50!

100!150!200!

30! 60!Itera

tion

time

(s)!

Number of machines!

Hadoop!

Basic Spark!

Spark + Controlled Partitioning!

Figure 10: Performance of PageRank on Hadoop and Spark.

11

9!

57!

56!

58!

58!

81!

57!

59!

57!

59!

0!

20!

40!

60!

80!

100!

120!

140!

1! 2! 3! 4! 5! 6! 7! 8! 9! 10!

Ite

ratr

ion

tim

e (

s)!

Iteration!

No Failure!

Failure in the 6th Iteration!

Figure 11: Iteration times for k-means in presence of a failure.One machine was killed at the start of the 6th iteration, resultingin partial reconstruction of an RDD using lineage.

6.3 Fault Recovery

We evaluated the cost of reconstructing RDD partitionsusing lineage after a node failure in the k-means appli-cation. Figure 11 compares the running times for 10 it-erations of k-means on a 75-node cluster in normal op-erating scenario, with one where a node fails at the startof the 6th iteration. Without any failure, each iterationconsisted of 400 tasks working on 100 GB of data.

Until the end of the 5th iteration, the iteration timeswere about 58 seconds. In the 6th iteration, one of themachines was killed, resulting in the loss of the tasksrunning on that machine and the RDD partitions storedthere. Spark re-ran these tasks in parallel on other ma-chines, where they re-read corresponding input data andreconstructed RDDs via lineage, which increased the it-eration time to 80s. Once the lost RDD partitions werereconstructed, the iteration time went back down to 58s.

Note that with a checkpoint-based fault recoverymechanism, recovery would likely require rerunning atleast several iterations, depending on the frequency ofcheckpoints. Furthermore, the system would need toreplicate the application’s 100 GB working set (the textinput data converted into binary) across the network, andwould either consume twice the memory of Spark toreplicate it in RAM, or would have to wait to write 100GB to disk. In contrast, the lineage graphs for the RDDsin our examples were all less than 10 KB in size.

6.4 Behavior with Insufficient Memory

So far, we ensured that every machine in the clusterhad enough memory to store all the RDDs across itera-

Performance on PageRank

22D. Koop, CIS 602-02, Fall 2015


68.8!

58.1!

40.7!

29.7!

11.5!

0!

20!

40!

60!

80!

100!

0%! 25%! 50%! 75%! 100%!

Itera

tion

time

(s)!

Percent of dataset in memory!

Figure 12: Performance of logistic regression using 100 GBdata on 25 machines with varying amounts of data in memory.

tions. A natural question is how Spark runs if there is notenough memory to store a job’s data. In this experiment,we configured Spark not to use more than a certain per-centage of memory to store RDDs on each machine. Wepresent results for various amounts of storage space forlogistic regression in Figure 12. We see that performancedegrades gracefully with less space.

6.5 User Applications Built with Spark

In-Memory Analytics Conviva Inc, a video distribu-tion company, used Spark to accelerate a number of dataanalytics reports that previously ran over Hadoop. Forexample, one report ran as a series of Hive [1] queriesthat computed various statistics for a customer. Thesequeries all worked on the same subset of the data (recordsmatching a customer-provided filter), but performed ag-gregations (averages, percentiles, and COUNT DISTINCT)over different grouping fields, requiring separate MapRe-duce jobs. By implementing the queries in Spark andloading the subset of data shared across them once intoan RDD, the company was able to speed up the report by40⇥. A report on 200 GB of compressed data that took20 hours on a Hadoop cluster now runs in 30 minutesusing only two Spark machines. Furthermore, the Sparkprogram only required 96 GB of RAM, because it onlystored the rows and columns matching the customer’s fil-ter in an RDD, not the whole decompressed file.

Traffic Modeling Researchers in the Mobile Millen-nium project at Berkeley [18] parallelized a learning al-gorithm for inferring road traffic congestion from spo-radic automobile GPS measurements. The source datawere a 10,000 link road network for a metropolitan area,as well as 600,000 samples of point-to-point trip timesfor GPS-equipped automobiles (travel times for eachpath may include multiple road links). Using a trafficmodel, the system can estimate the time it takes to travelacross individual road links. The researchers trained thismodel using an expectation maximization (EM) algo-rithm that repeats two map and reduceByKey steps itera-tively. The application scales nearly linearly from 20 to80 nodes with 4 cores each, as shown in Figure 13(a).

1521!

820!

422!

0!

400!

800!

1200!

1600!

2000!

20! 40! 80!

Itera

tion

time

(s)!

Number of machines!

(a) Traffic modeling

70.6!

38.6!

27.6!

0!

20!

40!

60!

80!

20! 40! 80!

Itera

tion

time

(s)!

Number of machines!

(b) Spam classification

Figure 13: Per-iteration running time of two user applicationsimplemented with Spark. Error bars show standard deviations.

1.7!

3.2!

5.5!

2.0!

4.5!

7.0!

2.8!

4.7!

6.6!

0!

2!

4!

6!

8!

10!

100 GB! 500 GB! 1 TB!Q

uery

resp

onse

tim

e (s

)!

Data size (GB)!

Exact Match + View Count!Substring Match + View Count!Total View Count!

Figure 14: Response times for interactive queries on Spark,scanning increasingly larger input datasets on 100 machines.

Twitter Spam Classification The Monarch project atBerkeley [29] used Spark to identify link spam in Twittermessages. They implemented a logistic regression classi-fier on top of Spark similar to the example in Section 6.1,but they used a distributed reduceByKey to sum the gradi-ent vectors in parallel. In Figure 13(b) we show the scal-ing results for training a classifier over a 50 GB subsetof the data: 250,000 URLs and 107 features/dimensionsrelated to the network and content properties of the pagesat each URL. The scaling is not as close to linear due toa higher fixed communication cost per iteration.

6.6 Interactive Data Mining

To demonstrate Spark’ ability to interactively query bigdatasets, we used it to analyze 1TB of Wikipedia pageview logs (2 years of data). For this experiment, we used100 m2.4xlarge EC2 instances with 8 cores and 68 GBof RAM each. We ran queries to find total views of (1)all pages, (2) pages with titles exactly matching a givenword, and (3) pages with titles partially matching a word.Each query scanned the entire input data.

Figure 14 shows the response times of the queries onthe full dataset and half and one-tenth of the data. Evenat 1 TB of data, queries on Spark took 5–7 seconds. Thiswas more than an order of magnitude faster than work-ing with on-disk data; for example, querying the 1 TBfile from disk took 170s. This illustrates that RDDs makeSpark a powerful tool for interactive data mining.

Memory Scaling

23D. Koop, CIS 602-02, Fall 2015


68.8!

58.1!

40.7!

29.7!

11.5!

0!

20!

40!

60!

80!

100!

0%! 25%! 50%! 75%! 100%!

Itera

tion

time

(s)!

Percent of dataset in memory!

Figure 12: Performance of logistic regression using 100 GBdata on 25 machines with varying amounts of data in memory.

tions. A natural question is how Spark runs if there is notenough memory to store a job’s data. In this experiment,we configured Spark not to use more than a certain per-centage of memory to store RDDs on each machine. Wepresent results for various amounts of storage space forlogistic regression in Figure 12. We see that performancedegrades gracefully with less space.

6.5 User Applications Built with Spark

In-Memory Analytics Conviva Inc, a video distribu-tion company, used Spark to accelerate a number of dataanalytics reports that previously ran over Hadoop. Forexample, one report ran as a series of Hive [1] queriesthat computed various statistics for a customer. Thesequeries all worked on the same subset of the data (recordsmatching a customer-provided filter), but performed ag-gregations (averages, percentiles, and COUNT DISTINCT)over different grouping fields, requiring separate MapRe-duce jobs. By implementing the queries in Spark andloading the subset of data shared across them once intoan RDD, the company was able to speed up the report by40⇥. A report on 200 GB of compressed data that took20 hours on a Hadoop cluster now runs in 30 minutesusing only two Spark machines. Furthermore, the Sparkprogram only required 96 GB of RAM, because it onlystored the rows and columns matching the customer’s fil-ter in an RDD, not the whole decompressed file.

Traffic Modeling Researchers in the Mobile Millen-nium project at Berkeley [18] parallelized a learning al-gorithm for inferring road traffic congestion from spo-radic automobile GPS measurements. The source datawere a 10,000 link road network for a metropolitan area,as well as 600,000 samples of point-to-point trip timesfor GPS-equipped automobiles (travel times for eachpath may include multiple road links). Using a trafficmodel, the system can estimate the time it takes to travelacross individual road links. The researchers trained thismodel using an expectation maximization (EM) algo-rithm that repeats two map and reduceByKey steps itera-tively. The application scales nearly linearly from 20 to80 nodes with 4 cores each, as shown in Figure 13(a).

1521!

820!

422!

0!

400!

800!

1200!

1600!

2000!

20! 40! 80!

Itera

tion

time

(s)!

Number of machines!

(a) Traffic modeling70.6!

38.6!

27.6!

0!

20!

40!

60!

80!

20! 40! 80!Ite

ratio

n tim

e (s

)!Number of machines!

(b) Spam classification

Figure 13: Per-iteration running time of two user applicationsimplemented with Spark. Error bars show standard deviations.

1.7!

3.2!

5.5!

2.0!

4.5!

7.0!

2.8!

4.7!

6.6!

0!

2!

4!

6!

8!

10!

100 GB! 500 GB! 1 TB!

Que

ry re

spon

se ti

me

(s)!

Data size (GB)!

Exact Match + View Count!Substring Match + View Count!Total View Count!

Figure 14: Response times for interactive queries on Spark,scanning increasingly larger input datasets on 100 machines.

Twitter Spam Classification The Monarch project atBerkeley [29] used Spark to identify link spam in Twittermessages. They implemented a logistic regression classi-fier on top of Spark similar to the example in Section 6.1,but they used a distributed reduceByKey to sum the gradi-ent vectors in parallel. In Figure 13(b) we show the scal-ing results for training a classifier over a 50 GB subsetof the data: 250,000 URLs and 107 features/dimensionsrelated to the network and content properties of the pagesat each URL. The scaling is not as close to linear due toa higher fixed communication cost per iteration.

6.6 Interactive Data Mining

To demonstrate Spark’ ability to interactively query bigdatasets, we used it to analyze 1TB of Wikipedia pageview logs (2 years of data). For this experiment, we used100 m2.4xlarge EC2 instances with 8 cores and 68 GBof RAM each. We ran queries to find total views of (1)all pages, (2) pages with titles exactly matching a givenword, and (3) pages with titles partially matching a word.Each query scanned the entire input data.

Figure 14 shows the response times of the queries onthe full dataset and half and one-tenth of the data. Evenat 1 TB of data, queries on Spark took 5–7 seconds. Thiswas more than an order of magnitude faster than work-ing with on-disk data; for example, querying the 1 TBfile from disk took 170s. This illustrates that RDDs makeSpark a powerful tool for interactive data mining.

Node Scaling

24D. Koop, CIS 602-02, Fall 2015


Conclusions• Spark focuses on dealing with the data first and writing operations

that interact with the data in lazy fashion • More flexibility but also more complexity

25D. Koop, CIS 602-02, Fall 2015

Demo

26D. Koop, CIS 602-02, Fall 2015

Next…• Streaming data… • More on using Spark with EC2 • Continue to work on projects

27D. Koop, CIS 602-02, Fall 2015