Data-Intensive Distributed Computing Part 2: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details CS 451/651 431/631 (Winter 2018) Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo January 23, 2018 These slides are available at http://lintool.github.io/bigdata-2018w/
60
Embed
Data-Intensive Distributed Computing - GitHub Pages · Data-Intensive Distributed Computing Part 2: From MapReduce to Spark (2/2) ... worker node datanodedaemon ... YARN YARN =...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data-Intensive Distributed Computing
Part 2: From MapReduce to Spark (2/2)
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
CS 451/651 431/631 (Winter 2018)
Jimmy LinDavid R. Cheriton School of Computer Science
University of Waterloo
January 23, 2018
These slides are available at http://lintool.github.io/bigdata-2018w/
Source: Wikipedia (The Scream)
An Apt Quote
All problems in computer science can be solved by another level of indirection... Except for the
problem of too many layers of indirection.
- David Wheeler
Source: Google
The datacenter is the computer!What’s the instruction set?What are the abstractions?
mapf: (K1, V1)
⇒ List[(K2, V2)]
List[(K1,V1)]
List[K3,V3])
reduceg: (K2, Iterable[V2]) ⇒ List[(K3, V3)]
MapReduce
RDD[T]
RDD[U]
filterf: (T) ⇒Boolean
mapf: (T) ⇒ U
RDD[T]
RDD[U]
flatMapf: (T) ⇒
TraversableOnce[U]
RDD[T]
RDD[U]
mapPartitionsf: (Iterator[T]) ⇒ Iterator[U]
RDD[T]
RDD[U]
RDD[(K, V)]
RDD[(K, Iterable[V])]
groupByKey
reduceByKeyf: (V, V) ⇒ V
RDD[(K, V)]
RDD[(K, V)]
RDD[(K, V)]
aggregateByKeyseqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U
Approach 1: turn synchronization into an ordering problemSort keys into correct order of computation
Partition key space so each reducer receives appropriate set of partial resultsHold state in reducer across multiple key-value pairs to perform computation
Illustrated by the “pairs” approach
Approach 2: data structures that bring partial results togetherEach reducer receives all the data it needs to complete the computation
Illustrated by the “stripes” approach
…
…
But commutative monoids help
Because you can’t avoid this…
Synchronization: Pairs vs. Stripes
Approach 1: turn synchronization into an ordering problemSort keys into correct order of computation
Partition key space so each reducer receives appropriate set of partial resultsHold state in reducer across multiple key-value pairs to perform computation
Illustrated by the “pairs” approach
Approach 2: data structures that bring partial results togetherEach reducer receives all the data it needs to complete the computation
For this to work:Emit extra (a, *) for every bn in mapperMake sure all a’s get sent to same reducer (use partitioner)Make sure (a, *) comes first (define sort order)Hold state in reducer across different key-value pairs
Two superpowers:
AssociativityCommutativity
(sorting)
…
…
Sequence your computations by sorting
When you can’t “monoidify”
An Apt Quote
All problems in computer science can be solved by another level of indirection... Except for the
problem of too many layers of indirection.
- David Wheeler
Source: Google
The datacenter is the computer!What’s the instruction set?What are the abstractions?
Exploit associativity and commutativityvia commutative monoids (if you can)
Source: Wikipedia (Walnut)
Exploit framework-based sorting to sequence computations (if you can’t)