Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury... 2012 University of California, Berkeley
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing
Matei Zaharia, Mosharaf Chowdhury...
2012 University of California, Berkeley
OUTLINE
• Introduction
• Resilient Distributed Datasets (RDDs)
• Representing RDDs
• Evaluation
• Conclusion
Introduction
Cluster computing frameworks like MapReduce is not well in iterative machine learning and graph algorithms
because data replication,disk I/O,serialization
Introduction
Pregel is a system for iterative graph computations that keeps intermediate data in memory, while HaLoop offers an iterative MapReduce interface.
but only support specific computation patterns
They do not provide abstractions for more general reuse.
Introduction
RDD is defining a programming interface that can provide fault tolerance efficiently
RDD v.s distributed shared memory
fine-grained updates to mutable state
coarse-grained transformations (e.g., map, filter and join)
lineage
Resilient Distributed Datasets (RDDs)
RDD’s transformation are lazy operations that define a new RDD, while actions launch a computation to return a value to the program or write data to external storage.
Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets (RDDs)
RDD is a read-only, partitioned collection of records, only be created (1) data in stable storage (2) other RDDs.
lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.count()
Resilient Distributed Datasets (RDDs)
lines = spark.textFile(“hdfs://...")
errors = lines.filter(_.startsWith(“ERROR"))
number = errors.count()
RDD1
RDD2
RDD1 RDD2
Long
Long
tranformation action
Resilient Distributed Datasets (RDDs)
DEMO
Resilient Distributed Datasets (RDDs)
lines = spark.textFile(“hdfs://...")
errors = lines.filter(_.startsWith(“ERROR"))
error = errors.persist() or cache()
RDD1
RDD2
RDD3
RDD3 error will in memory
Resilient Distributed Datasets (RDDs)
RDD1 RDD2 Long
tranformation action
Lineage: fault tolerance
if RDD2 lost
recompute RDD1 and produce new RDD2
Resilient Distributed Datasets (RDDs)
Spark provides the RDD abstraction through a language-integrated API
scala
a functional programming language for the Java VM
Representing RDDs
dependencies between RDDs
narrow dependencies:allow for pipelined execution on one cluster node
wide dependencies:require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation
Representing RDDsin same node in different node
Representing RDDshow spark compute job stages
RDD
partition
RDD in memory
Resilient Distributed Datasets (RDDs)
Each stage contains as many pipelined transformations with narrow dependencies as possible.
because avoid shuffled across the nodes
Evaluation
Amazon:m1.xlarge EC2 nodes with 4 cores and 15 GB of RAM. We used HDFS for storage, with 256 MB blocks.
Evaluation
10 iterations on 100 GB datasets using 25–100 machines.
logistic regression k-means
logistic regression is less compute-intensive and thus more sensitive to time spent in deserialization and I/O.
EvaluationHadoopBinMem:convert input data to binary format,in memory
Evaluation
54 GB Wikipedia dump, 4 million articles.
pagerank
iterations :10
Evaluation
pagerank iterations :10
Evaluation
100GB data,75 node ,iterations :10
k-means
fault recovery
one node fail at the start of the 6th iteration.
Evaluation
100GB datak-means 75 node iterations :10
Evaluation
100GB data , 25machine
logistic regression
Behavior with Insufficient Memory
Evaluation
100GB datak-means 25machine
Conclusion
RDDs,an efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications.
RDDs offer an API based on coarse- grained transformations that lets them recover data efficiently using lineage.
Spark v.s Hadoop fast to 20× in iterative applications and can be used interactively to query hundreds of gigabytes of data.