Department of Computer Science Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam
Department of Computer ScienceDepartment of Computer Science
MapReduce for the Cell B. E. Architecture
Marc de KruijfUniversity of Wisconsin−Madison
Advised by Professor Sankaralingam
2
Department of Computer ScienceDepartment of Computer Science
MapReduce A model for parallel programming Proposed by Google
Large scale distributed systems – 1,000 node clusters
Applications: Distributed sort Distributed grep Indexing
Simple, high-level interface Runtime handles:
parallelization, scheduling, synchronization, and communication
3
Department of Computer ScienceDepartment of Computer Science
Cell B. E. Architecture A heterogeneous
computing platform: 1 PPE, 8 SPEs
Programming is hard Multi-threading is
explicit SPE local memories
are software-managed
The Cell is like a “cluster-on-a-chip”
4
Department of Computer ScienceDepartment of Computer Science
MotivationMapReduce
Scalable parallel modelSimple interface
Cell B. E.Complex parallel
architectureHard to program
MapReduce for the Cell B.E. Architecture
5
Department of Computer ScienceDepartment of Computer Science
Overview Motivation
MapReduce Cell B.E. Architecture
MapReduce Example Design Evaluation
Workload Characterization Application Performance
Conclusions and Future Work
6
Department of Computer ScienceDepartment of Computer Science
MapReduce ExampleCounting word occurrences in a set of documents:
7
Department of Computer ScienceDepartment of Computer Science
Overview Motivation
MapReduce Cell B.E. Architecture
MapReduce Example Design Evaluation
Workload Characterization Application Performance
Conclusions and Future Work
8
Department of Computer ScienceDepartment of Computer Science
Design
Flow of Execution
Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce
9
Department of Computer ScienceDepartment of Computer Science
Design
Flow of Execution
Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce
1. Map streams key/value pairs
10
Department of Computer ScienceDepartment of Computer Science
Design
Flow of Execution
Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce
1. Map streams key/value pairs
Key grouping implemented as:
2. Partition – hash and distribute
3. Quick-sort 4. Merge-sort
two-phase external sort
11
Department of Computer ScienceDepartment of Computer Science
Design
Flow of Execution
Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce
1. Map streams key/value pairs
Key grouping implemented as:
2. Partition – hash and distribute
3. Quick-sort 4. Merge-sort
two-phase external sort
12
Department of Computer ScienceDepartment of Computer Science
Design
Flow of Execution
Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce
1. Map streams key/value pairs
Key grouping implemented as:
2. Partition – hash and distribute
3. Quick-sort 4. Merge-sort
two-phase external sort
13
Department of Computer ScienceDepartment of Computer Science
DesignFlow of Execution
Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce
1. Map streams key/value pairs
Key grouping implemented as:
2. Partition – hash and distribute
3. Quick-sort 4. Merge-sort
5. Reduce “reduces”key/list-of-values pairs tokey/value pairs.
two-phase external sort
14
Department of Computer ScienceDepartment of Computer Science
Overview Motivation
MapReduce Cell B.E. Architecture
MapReduce Example Design Evaluation
Workload Characterization Application Performance
Conclusions and Future Work
15
Department of Computer ScienceDepartment of Computer Science
Evaluation Methodology MapReduce Model Characterization
Synthetic micro-benchmark with six parameters
Run on a 3.2 GHz Cell Blade Measured effect of each parameter on execution time
Application Performance Comparison Six full applications
MapReduce versions run on 3.2 GHz Cell Blade Single-threaded versions run on 2.4 GHz Core 2 Duo
Evaluation Measured speedup comparing execution times Measured overheads on the Cell monitoring SPE idle
time Measured ideal speedup assuming no Cell overheads
16
Department of Computer ScienceDepartment of Computer Science
MapReduce Model Characterization
Model CharacteristicsCharacter
isticDescription
Map intensity Execution cycles per input byte to Map
Reduce intensity
Execution cycles per input byte to Reduce
Map fan-out Ratio of input size to output size in Map
Reduce fan-in Number of values per key in Reduce
Partitions Number of partitions
Input size Input size in bytes
Effect on Execution Time
17
Department of Computer ScienceDepartment of Computer Science
Application Performance Applications
histogram: counts bitmap RGB occurrences
kmeans: clustering algorithmlinearReg: least-squares linear
regressionwordCount: word countNAS_EP: EP benchmark from NAS
suitedistSort: distributed sort
18
Department of Computer ScienceDepartment of Computer Science
Speedup Over Core 2 Duo
19
Department of Computer ScienceDepartment of Computer Science
Runtime Overheads
20
Department of Computer ScienceDepartment of Computer Science
Overview Motivation
MapReduce Cell B.E. Architecture
MapReduce Example Design Evaluation
Workload Characterization Application Performance
Conclusions and Future Work
21
Department of Computer ScienceDepartment of Computer Science
Conclusions and Future Work
Conclusions Programmability benefits High-performance on computationally
intensive workloads Not applicable to all application types
Future Work Additional performance tuning Extend for clusters of Cell processors
Hierarchical MapReduce
Department of Computer ScienceDepartment of Computer Science
Questions?
Department of Computer ScienceDepartment of Computer Science
Backup Slides
24
Department of Computer ScienceDepartment of Computer Science
MapReduce API
void MapReduce_exec(MapReduce Specification specification);
The exec function initializes the MapReduce runtime and executes MapReduce according to the user specification.
void MapReduce_emitIntermediate(void **key, void **value);void MapReduce_emit(void **value);
These two functions are called by the user-defined Map and Reduce functions, respectively. These functions take references to pointers as arguments, and modify the referenced pointer to point to pre-allocated storage. It is then the responsibility of the application to provision this storage.
25
Department of Computer ScienceDepartment of Computer Science
Optimizations1) Priority work queue
Distributes load Avoids serialization
Pipelined execution maximizes concurrency
2) Double-buffering3) Application support
Map only Map with sorted
output Chaining invocations
26
Department of Computer ScienceDepartment of Computer Science
Optimizations1) Priority work queue
Distributes load Avoids serialization
Pipelined execution maximizes concurrency
2) Double-buffering3) Application support
Map only Map with sorted
output Chaining invocations
27
Department of Computer ScienceDepartment of Computer Science
Optimizations4) Balanced merge (n / log(n) better bandwidth utilization as n
→ ∞)
5) Map and Reduce output regions pre-allocated. optimal memory alignment bulk memory transfers no user memory management no dynamic allocation overhead