MapReduceMerge

8/14/2019 MapReduceMerge

1/32

Map-Reduce-Merge:Simplied Relational

Data Processing onLarge ClustersHung-chih Yang, Ali DasdanRuey-Lung Hsiao, D. Stott Parker

presented by Nate Roberts


2/32

Outline

1. Introduction: principles of databases rather than the artifacts.

2. MapReduce

3. Map-Reduce-Merge: extending MapReduce

4. Using Map-Reduce-Merge to implement relational algebra operators


3/32

Principles of DB, Not

the Artifacts New data-processing systems shouldconsider alternatives to using big,

traditional databases. MapReduce does a good job, in a limited

context, with extraordinary simplicity

Map-Reduce-Merge will try to extend theapplicability without giving up too muchsimplicity


4/32

Introduction to

Ma Reduce1. Why MapReduce?

2. What is MapReduce?

3. How do you use it?

4. Whats it good for?5. What are its limitations?


5/32

Why MapReduce?

For (single core) CPUs, Moores Law isbeginning to slow down

The future is multi-core (large clusters of commodity hardware are the newsupercomputers)

But parallel programming is hard to think about!


6/32

What is MapReduce?

MapReduce handles dispatching tasksacross a large cluster

You just have to dene the tasks, in twostages:

1. Map: (k1, v1)

[(k2, v2)]2. Reduce: (k2, [v2]) [v3]


7/32

How do you use it?

Example: count the number of occurrencesof each word in a large collection of

documents. For each document d withcontents v:

map: given (d,v), for each word in v, emit(w, 1).

reduce: given (w, [v]), sum the counts in[v]. Emit the sum.


8/32

Word Counting

Behind the Scenes A single master server dispatches tasks andkeeps a scoreboard.

1. Mappers are dispatched for eachdocument. They do local writes withtheir results.

2. Once mapping nishes, a shufe phaseassigns reducers to each word. Reducersdo remote reads from mappers.


9/32

MapReduce Schematic

http ://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.html
http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.html


10/32

MapReduce in Parallel

http ://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0008.html
http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.html


11/32

Three Optimizations

Fault and slow node tolerance: once tasksare all dispatched and some nodes are

nished, assign some unnished tasksredundantly. (First to return wins.)

Combiner: have mappers do some localreduction.

Locality: assign mappers in such a way thatmost have their input available for localreading.


12/32

Whats it good for?

Data processing tasks on homogeneousdata sets:

Distributed Grep Building an index mapping words to

documents in which those words occur. Distributed sort


13/32

What isnt it good for?

Not good at heterogeneous data sets.


14/32

emp-id dept-id bonus

1 B innov. award ($100)

1 B hard worker ($50)2 A high perform. ($150)

3 A innov. award ($100)

dept-id bonus adjustmentB 1.1

A 0.9

Heterogeneous Data


15/32

Map-Reduce-Merge:

Extending MapReduce1. Change to reduce phase

2. Merge phase

3. Additional user-denable operations

a. partition selector

b. processorc. merger

d. congurable iterators


16/32

Reduce & Merge

Phases1. Map: (k1, v1) [(k2, v2)]

2. Reduce: (k2, [v2])

[v3]becomes:

1. Map: (k1, v1) [(k2, v2)]

2. Reduce: (k2, [v2]) (k2, [v3])3. Merge: ((k2, [v3]), (k3, [v4])) (k4, [v5])


17/32

Programmer-Denable

O erations1. Partition selector - which data should go towhich merger?

2. Processor - process data on an individualsource.

3. Merger - analogous to the map and reduce

denitions, dene logic to do the mergeoperation.

4. Congurable iterators - how to step

through each of the lists as you merge.


18/32

Employee Bonus

Exam le, Revisited


19/32

Implementing Relational

Algebra Operations1. Projection

2. Aggregation3. Selection

4. Set Operations: Union, Intersection, Difference

5. Cartesian Product

6. Rename

7. Join


20/32

Projection

All we have to do is emit a subset of thedata passed in.

Just a mapper can do this.


21/32

Aggregation

By choosing appropriate keys, canimplement group by and aggregate SQLoperators in MapReduce.

(Do have to be careful here, though: choosebadly, and you might not have enough tasksfor MapReduce to do you any good.)


22/32

Selection

If selection condition involves only theattributes of one data source, can

implement in mappers. If its on aggregates or a group of values

contained in one data source, canimplement in reducers.

If it involves attributes or aggregates fromboth data sources, implement in mergers.


23/32

Set Union

Let each of the two MapReduces emit asorted list of unique elements

Merges just iterate simultaneously over thelists:

store the lesser value and increment itsiterator, if there is a lesser value if the two are equal, store one of the

two, and increment both iterators


24/32

Set Intersection



if there is a lesser value, increment itsiterator if the two are equal, store one of the

two, and increment both iterators


25/32

Set Difference



if As value is less than Bs, store As if Bs value is less than As, increment it

if the two are equal, increment both

To Compute A - B:


26/32

Cartesian Product

Set the reducers up to output the two setsyou want the Cartesian product of.

Each merger will get one partition F fromthe rst set of reducers, and the full set of partitions S from the second.

Each merger emits F x S.


27/32

Rename

Trivial


28/32

Sort-Merge Join

Map: partition records into key rangesaccording to the values of the attributes onwhich youre sorting, aiming for evendistribution of values to mappers.

Reduce: sort the data. Merge: join the sorted data for each key

range.


29/32

Hash Join

Map: use the same hash function for bothsets of mappers.

Reduce: produce a hash table from thevalues mapped.

Merge: operates on corresponding hashbuckets. Use one bucket as a build set , andthe other as a probe.


30/32

Nested Loop Join

Just like a hash join, except in the mergestep, do a nested loop, scanning the right-hand relation for matches to the left.


31/32

Conclusion

MapReduce & GFS represent a paradigmshift in data processing: use a simpliedinterface instead of overly-general DBMS

Map-Reduce-Merge adds the ability toexecute arbitrary relational algebra queries

Next steps: develop SQL-like interface anda query optimizer


32/32

References

J. Dean and S. Ghemawat. MapReduce: Simplied DataProcessing on Large Clusters. In OSDI, pages137-15 0, 2004. Slides available:http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

A. Kimball, Problem Solv ing on Large-Scale Clusters.Lecture given on July 3, 2007. Available athttp://www.youtube.com/watch?v=-vD6PUdf3Js
http://www.youtube.com/watch?v=-vD6PUdf3Jshttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://www.youtube.com/watch?v=-vD6PUdf3Jshttp://www.youtube.com/watch?v=-vD6PUdf3Jshttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index.html

MapReduceMerge

Documents