RDD

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Computing

Matei Zaharia, Mosharaf Chowdhury...

2012 University of California, Berkeley

OUTLINE

• Introduction

• Resilient Distributed Datasets (RDDs)

• Representing RDDs

• Evaluation

• Conclusion

Introduction

Cluster computing frameworks like MapReduce is not well in iterative machine learning and graph algorithms

because data replication,disk I/O,serialization

Introduction

Pregel is a system for iterative graph computations that keeps intermediate data in memory, while HaLoop offers an iterative MapReduce interface.

but only support specific computation patterns

They do not provide abstractions for more general reuse.

Introduction

RDD is defining a programming interface that can provide fault tolerance efficiently

RDD v.s distributed shared memory

fine-grained updates to mutable state

coarse-grained transformations (e.g., map, filter and join)

lineage

Resilient Distributed Datasets (RDDs)

RDD’s transformation are lazy operations that define a new RDD, while actions launch a computation to return a value to the program or write data to external storage.



RDD is a read-only, partitioned collection of records, only be created (1) data in stable storage (2) other RDDs.

lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.count()


lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith(“ERROR"))

number = errors.count()

RDD1

RDD2

RDD1 RDD2

Long

Long

tranformation action


DEMO


lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith(“ERROR"))

error = errors.persist() or cache()

RDD1

RDD2

RDD3

RDD3 error will in memory


RDD1 RDD2 Long

tranformation action

Lineage: fault tolerance

if RDD2 lost

recompute RDD1 and produce new RDD2


Spark provides the RDD abstraction through a language-integrated API

scala

a functional programming language for the Java VM

Representing RDDs

dependencies between RDDs

narrow dependencies：allow for pipelined execution on one cluster node

wide dependencies：require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation

Representing RDDsin same node in different node

Representing RDDshow spark compute job stages

RDD

partition

RDD in memory


Each stage contains as many pipelined transformations with narrow dependencies as possible.

because avoid shuffled across the nodes

Evaluation

Amazon：m1.xlarge EC2 nodes with 4 cores and 15 GB of RAM. We used HDFS for storage, with 256 MB blocks.

Evaluation

10 iterations on 100 GB datasets using 25–100 machines.

logistic regression k-means

logistic regression is less compute-intensive and thus more sensitive to time spent in deserialization and I/O.

EvaluationHadoopBinMem：convert input data to binary format,in memory

Evaluation

54 GB Wikipedia dump, 4 million articles.

pagerank

iterations :10

Evaluation

pagerank iterations :10

Evaluation

100GB data,75 node ,iterations :10

k-means

fault recovery

one node fail at the start of the 6th iteration.

Evaluation

100GB datak-means 75 node iterations :10

Evaluation

100GB data , 25machine

logistic regression

Behavior with Insufficient Memory

Evaluation

100GB datak-means 25machine

Conclusion

RDDs,an efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications.

RDDs offer an API based on coarse- grained transformations that lets them recover data efficiently using lineage.

Spark v.s Hadoop fast to 20× in iterative applications and can be used interactively to query hundreds of gigabytes of data.

RDD

Software

gb datasets

startswitherror errors

data replication

intermediate data

10evaluation100gb data

countrdd1 rdd2 rdd1

rdd2 lostrecompute rdd1

rdds narrow