Top Banner
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury... 2012 University of California, Berkeley
27
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: RDD

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Computing

Matei Zaharia, Mosharaf Chowdhury...

2012 University of California, Berkeley

Page 2: RDD

OUTLINE

• Introduction

• Resilient Distributed Datasets (RDDs)

• Representing RDDs

• Evaluation

• Conclusion

Page 3: RDD

Introduction

Cluster computing frameworks like MapReduce is not well in iterative machine learning and graph algorithms

because data replication,disk I/O,serialization

Page 4: RDD

Introduction

Pregel is a system for iterative graph computations that keeps intermediate data in memory, while HaLoop offers an iterative MapReduce interface.

but only support specific computation patterns

They do not provide abstractions for more general reuse.

Page 5: RDD

Introduction

RDD is defining a programming interface that can provide fault tolerance efficiently

RDD v.s distributed shared memory

fine-grained updates to mutable state

coarse-grained transformations (e.g., map, filter and join)

lineage

Page 6: RDD

Resilient Distributed Datasets (RDDs)

RDD’s transformation are lazy operations that define a new RDD, while actions launch a computation to return a value to the program or write data to external storage.

Page 7: RDD

Resilient Distributed Datasets (RDDs)

Page 8: RDD

Resilient Distributed Datasets (RDDs)

RDD is a read-only, partitioned collection of records, only be created (1) data in stable storage (2) other RDDs.

lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.count()

Page 9: RDD

Resilient Distributed Datasets (RDDs)

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith(“ERROR"))

number = errors.count()

RDD1

RDD2

RDD1 RDD2

Long

Long

tranformation action

Page 10: RDD

Resilient Distributed Datasets (RDDs)

DEMO

Page 11: RDD

Resilient Distributed Datasets (RDDs)

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith(“ERROR"))

error = errors.persist() or cache()

RDD1

RDD2

RDD3

RDD3 error will in memory

Page 12: RDD

Resilient Distributed Datasets (RDDs)

RDD1 RDD2 Long

tranformation action

Lineage: fault tolerance

if RDD2 lost

recompute RDD1 and produce new RDD2

Page 13: RDD

Resilient Distributed Datasets (RDDs)

Spark provides the RDD abstraction through a language-integrated API

scala

a functional programming language for the Java VM

Page 14: RDD

Representing RDDs

dependencies between RDDs

narrow dependencies:allow for pipelined execution on one cluster node

wide dependencies:require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation

Page 15: RDD

Representing RDDsin same node in different node

Page 16: RDD

Representing RDDshow spark compute job stages

RDD

partition

RDD in memory

Page 17: RDD

Resilient Distributed Datasets (RDDs)

Each stage contains as many pipelined transformations with narrow dependencies as possible.

because avoid shuffled across the nodes

Page 18: RDD

Evaluation

Amazon:m1.xlarge EC2 nodes with 4 cores and 15 GB of RAM. We used HDFS for storage, with 256 MB blocks.

Page 19: RDD

Evaluation

10 iterations on 100 GB datasets using 25–100 machines.

logistic regression k-means

logistic regression is less compute-intensive and thus more sensitive to time spent in deserialization and I/O.

Page 20: RDD

EvaluationHadoopBinMem:convert input data to binary format,in memory

Page 21: RDD

Evaluation

54 GB Wikipedia dump, 4 million articles.

pagerank

iterations :10

Page 22: RDD

Evaluation

pagerank iterations :10

Page 23: RDD

Evaluation

100GB data,75 node ,iterations :10

k-means

fault recovery

one node fail at the start of the 6th iteration.

Page 24: RDD

Evaluation

100GB datak-means 75 node iterations :10

Page 25: RDD

Evaluation

100GB data , 25machine

logistic regression

Behavior with Insufficient Memory

Page 26: RDD

Evaluation

100GB datak-means 25machine

Page 27: RDD

Conclusion

RDDs,an efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications.

RDDs offer an API based on coarse- grained transformations that lets them recover data efficiently using lineage.

Spark v.s Hadoop fast to 20× in iterative applications and can be used interactively to query hundreds of gigabytes of data.