Resilient Distributed Datasets (NSDI 2012)

Resilient Distributed Datasets (NSDI 2012)A Fault-Tolerant Abstraction forIn-Memory Cluster Computing

Piccolo (OSDI 2010)Building Fast, Distributed Programs with Partitioned Tables

Discretized Streams (HotCloud 2012)An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters MapReduce Online(NSDI 2010)

MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clustersBut as soon as it got popular, users wanted more:

»More complex, multi-stage applications(e.g. iterative machine learning & graph processing)

»More interactive ad-hoc queries

Response: specialized frameworks for some of these apps (e.g. Pregel for graph

processing)

MotivationComplex apps and interactive queries both two things that MapReduce lacks:• Efficient primitives for data sharing• A means for pipelining or continuous

processing

MapReduce System Model•Designed for batch-oriented computations over large data sets

–Each operator runs to completion before producing any output– Barriers between stages

–Only way to share data is by using stable storage• Map output to local disk, reduce output to HDFS

Examplesiter. 1 iter. 2 . .

.Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to replication and disk I/O,but necessary for fault tolerance

iter. 1 iter. 2 . . .

Input

Goal: In-Memory Data Sharing

Input

query 1

query 2

query 3. . .

one-timeprocessing

10-100× faster than network/disk, but how to get FT?

Challenge

How to design a distributed memory abstraction that is both fault-tolerant

and efficient?

Approach 1: Fine-grainedE.g., Piccolo (Others: RAMCloud; DSM)Distributed Shared TableImplemented as an in-memory (dist) key-value store

Kernel FunctionsOperate on in-memory state concurrently on many machinesSequential code that reads from and writes to distributed table

Using the storeget(key)put(key,value)update(key,value)flush()get_iterator(partition)

User specified policies… For partitioningHelps programmers express data locality preferencesPiccolo ensures all entries in a partition reside on the same machineE.g., user can locate kernel with partition, and/or co-locate partitions of different related tables

User-specified policies … for resolving conflicts (multiple kernels writing)User defines an accumulation function (works if results independent of update order)… for checkpointing and restorePiccolo stores global state snapshot; relies on user to check-point kernel execution state

Fine-Grained: ChallengeExisting storage abstractions have interfaces based on fine-grained updates to mutable stateRequires replicating data or logs across nodes for fault tolerance

»Costly for data-intensive apps»10-100x slower than memory write

Coarse Grained: Resilient Distributed Datasets (RDDs)Restricted form of distributed shared memory

»Immutable, partitioned collections of records

»Can only be built through coarse-grained deterministic transformations (map, filter, join, …)

Efficient fault recovery using lineage»Log one operation to apply to many

elements»Recompute lost partitions on failure»No cost if nothing fails

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Generality of RDDsDespite their restrictions, RDDs can express surprisingly many parallel algorithms

»These naturally apply the same operation to many items

Unify many current programming models»Data flow models: MapReduce, Dryad, SQL, …»Specialized models for iterative apps: BSP (Pregel),

iterative MapReduce (Haloop), bulk incremental, …

Support new apps that these models don’t

Memorybandwidth

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

CoarseLow High

K-V stores,databases,RAMCloud

Best for batchworkloads

Best fortransactional

workloads

HDFS RDDs

Spark Programming InterfaceDryadLINQ-like API in the Scala languageUsable interactively from Scala interpreterProvides:

»Resilient distributed datasets (RDDs)»Operations on RDDs: transformations (build

new RDDs), actions (compute and output results)

»Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)

Spark Operations

Transformations

(define a new RDD)

mapfilter

samplegroupByKeyreduceByKey

sortByKey

flatMapunionjoin

cogroupcross

mapValues

Actions(return a result

to driver program)

collectreducecountsave

lookupKey

Task SchedulerDryad-like DAGsPipelines functionswithin a stageLocality & data reuse awarePartitioning-awareto avoid shuffles

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= cached data partition

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).countmessages.filter(_.contains(“bar”)).count

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Base RDDTransformed

RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20

sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.:

messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Fault Recovery

HadoopRDD FilteredRDD MappedRDD

Fault Recovery Results

1 2 3 4 5 6 7 8 9 10020406080

100120140 119

57 56 58 5881

57 59 57 59

Iteration

Iter

atri

on t

ime

(s) Failure happens

Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to

Σi∈neighbors ranki / |neighborsi|links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs

Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to

Σi∈neighbors ranki / |neighborsi|links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairsfor (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _)}

Optimizing Placement

links & ranks repeatedly joinedCan co-partition them (e.g. hash both on URL) to avoid shufflesCan also use app knowledge, e.g., hash on DNS name

links = links.partitionBy( new URLPartitioner())

reduceContribs0

join

joinContribs2

Ranks0(url, rank)

Links(url,

neighbors)

. . .

Ranks2

reduce

Ranks1

PageRank Performance

01020304050607080 72

23

HadoopBasic SparkSpark + Con-trolled Partition-ingTi

me

per

iter

-at

ion

(s)

Programming Models Implemented on SparkRDDs can express many existing parallel models

»MapReduce, DryadLINQ»Pregel graph processing [200 LOC]»Iterative MapReduce [200 LOC]»SQL: Hive on Spark (Shark) [in progress]

Enables apps to efficiently intermix these models

All are based oncoarse-grained operations

Spark: SummaryRDDs offer a simple and efficient programming model for a broad range of applicationsLeverage the coarse-grained nature of many parallel algorithms for low-overhead recovery

Issues?

Discretized Streams

Putting in-memory frameworks to work…

Motivation• Many important applications need to process

large data streams arriving in real time– User activity statistics (e.g. Facebook’s Puma)– Spam detection– Traffic estimation– Network intrusion detection

• Target: large-scale apps that must run on tens-hundreds of nodes with O(1 sec) latency

Challenge• To run at large scale, system has to

be both:– Fault-tolerant: recover quickly from

failures and stragglers– Cost-efficient: do not require

significant hardware beyond that needed for basic processing

• Existing streaming systems don’t have both properties

Traditional Streaming Systems

• “Record-at-a-time” processing model– Each node has mutable state– For each record, update state & send new records

mutable state

node 1 node

3

input records push

node 2

input records

Traditional Streaming SystemsFault tolerance via replication or upstream backup:

node 1 node

3node

2

node 1’ node

3’node

2’

synchronization

node 1 node

3

node 2

standby

input

input

input

input


node 1 node

3node

2

node 1’ node

3’node

2’

synchronization

node 1 node

3

node 2

standby

input

input

input

input

Fast recovery, but 2x hardware

cost

Only need 1 standby, but

slow to recover


node 1 node

3node

2

node 1’ node

3’node

2’

synchronization

node 1 node

3

node 2

standby

input

input

input

input

Neither approach tolerates stragglers

Observation• Batch processing models for clusters (e.g.

MapReduce) provide fault tolerance efficiently– Divide job into deterministic tasks– Rerun failed/slow tasks in parallel on other nodes

• Idea: run a streaming computation as a series of very small, deterministic batches– Same recovery schemes at much smaller timescale– Work to make batch size as small as possible

Discretized Stream Processing

t = 1:

t = 2:

stream 1 stream 2

batch operation

pullinput

… …

input

immutable dataset

(stored reliably)

immutable dataset

(output or state);

stored in memorywithout

replication

…

Parallel Recovery• Checkpoint state datasets periodically• If a node fails/straggles, recompute its

dataset partitions in parallel on other nodesmap

input dataset

Faster recovery than upstream backup,

without the cost of replication

output dataset

Programming Model• A discretized stream (D-stream) is a

sequence of immutable, partitioned datasets– Specifically, resilient distributed

datasets (RDDs), the storage abstraction in Spark

• Deterministic transformations operators produce new streams

D-Streams Summary• D-Streams forgo traditional

streaming wisdom by batching data in small timesteps

• Enable efficient, new parallel recovery scheme

MapReduce Online

…..pipelining in map-reduce

Stream Processing with HOP

• Run MR jobs continuously, and analyze data as it arrives

• Map and reduce tasks run continuously• Reduce function divides stream into windows

– “Every 30 seconds, compute the 1, 5, and 15 minute average network utilization; trigger an alert if …”

– Window management done by user (reduce)

Dataflow in Hadoop

map

map

reduce

reduce

Local FS

Local FS

HTTP GET

Hadoop Online Prototype• HOP supports pipelining within and between

MapReduce jobs: push rather than pull– Preserve simple fault tolerance scheme– Improved job completion time (better cluster utilization)– Improved detection and handling of stragglers

• MapReduce programming model unchanged– Clients supply same job parameters

• Hadoop client interface backward compatible– No changes required to existing clients

• E.g., Pig, Hive, Sawzall, Jaql– Extended to take a series of job

Pipelining Batch Size

• Initial design: pipeline eagerly (for each row)– Prevents use of combiner– Moves more sorting work to mapper– Map function can block on network I/O

• Revised design: map writes into buffer– Spill thread: sort & combine buffer, spill to disk– Send thread: pipeline spill files => reducers

• Simple adaptive algorithm

Pipeline request

Dataflow in HOP

Schedule

Schedule + Location

map

map

reduce

reduce

Online Aggregation• Traditional MR: poor UI for data analysis• Pipelining means that data is available at

consumers “early”– Can be used to compute and refine an approximate

answer– Often sufficient for interactive data analysis,

developing new MapReduce jobs, ...• Within a single job: periodically invoke reduce

function at each reduce task on available data• Between jobs: periodically send a “snapshot”

to consumer jobs

Intra-Job Online Aggregation• Approximate answers published to HDFS by

each reduce task• Based on job progress: e.g. 10%, 20%, …• Challenge: providing statistically meaningful

approximations– How close is an approximation to the final answer?– How do you avoid biased samples?

• Challenge: reduce functions are opaque– Ideally, computing 20% approximation should reuse

results of 10% approximation– Either use combiners, or HOP does redundant work

Inter-Job Online Aggregation

Write Answer

HDFS

map

mapJob 2 Mappers

reduce

reduce

Job 1 Reducers

Inter-Job Online Aggregation

• Like intra-job OA, but approximate answers are pipelined to map tasks of next job– Requires co-scheduling a sequence of jobs

• Consumer job computes an approximation– Can be used to feed an arbitrary chain of

consumer jobs with approximate answers• Challenge: how to avoid redundant work

– Output of reduce for 10% progress vs. for 20%

Resilient Distributed Datasets (NSDI 2012)

Documents

large data

data sharinga

data miningbut

dataintensive apps10

memory data sharing

distributed memory abstraction

big data analysis

memory state