Top Banner

of 32

MapReduceMerge

May 31, 2018

Download

Documents

warwithin
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/14/2019 MapReduceMerge

    1/32

    Map-Reduce-Merge:Simplied Relational

    Data Processing onLarge ClustersHung-chih Yang, Ali DasdanRuey-Lung Hsiao, D. Stott Parker

    presented by Nate Roberts

  • 8/14/2019 MapReduceMerge

    2/32

    Outline

    1. Introduction: principles of databases rather than the artifacts.

    2. MapReduce

    3. Map-Reduce-Merge: extending MapReduce

    4. Using Map-Reduce-Merge to implement relational algebra operators

  • 8/14/2019 MapReduceMerge

    3/32

    Principles of DB, Not

    the Artifacts New data-processing systems shouldconsider alternatives to using big,

    traditional databases. MapReduce does a good job, in a limited

    context, with extraordinary simplicity

    Map-Reduce-Merge will try to extend theapplicability without giving up too muchsimplicity

  • 8/14/2019 MapReduceMerge

    4/32

    Introduction to

    Ma Reduce1. Why MapReduce?

    2. What is MapReduce?

    3. How do you use it?

    4. Whats it good for?5. What are its limitations?

  • 8/14/2019 MapReduceMerge

    5/32

    Why MapReduce?

    For (single core) CPUs, Moores Law isbeginning to slow down

    The future is multi-core (large clusters of commodity hardware are the newsupercomputers)

    But parallel programming is hard to think about!

  • 8/14/2019 MapReduceMerge

    6/32

    What is MapReduce?

    MapReduce handles dispatching tasksacross a large cluster

    You just have to dene the tasks, in twostages:

    1. Map: (k1, v1)

    [(k2, v2)]2. Reduce: (k2, [v2]) [v3]

  • 8/14/2019 MapReduceMerge

    7/32

    How do you use it?

    Example: count the number of occurrencesof each word in a large collection of

    documents. For each document d withcontents v:

    map: given (d,v), for each word in v, emit(w, 1).

    reduce: given (w, [v]), sum the counts in[v]. Emit the sum.

  • 8/14/2019 MapReduceMerge

    8/32

    Word Counting

    Behind the Scenes A single master server dispatches tasks andkeeps a scoreboard.

    1. Mappers are dispatched for eachdocument. They do local writes withtheir results.

    2. Once mapping nishes, a shufe phaseassigns reducers to each word. Reducersdo remote reads from mappers.

  • 8/14/2019 MapReduceMerge

    9/32

    MapReduce Schematic

    http ://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.html

    http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.html
  • 8/14/2019 MapReduceMerge

    10/32

    MapReduce in Parallel

    http ://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0008.html

    http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.html
  • 8/14/2019 MapReduceMerge

    11/32

    Three Optimizations

    Fault and slow node tolerance: once tasksare all dispatched and some nodes are

    nished, assign some unnished tasksredundantly. (First to return wins.)

    Combiner: have mappers do some localreduction.

    Locality: assign mappers in such a way thatmost have their input available for localreading.

  • 8/14/2019 MapReduceMerge

    12/32

    Whats it good for?

    Data processing tasks on homogeneousdata sets:

    Distributed Grep Building an index mapping words to

    documents in which those words occur. Distributed sort

  • 8/14/2019 MapReduceMerge

    13/32

    What isnt it good for?

    Not good at heterogeneous data sets.

  • 8/14/2019 MapReduceMerge

    14/32

    emp-id dept-id bonus

    1 B innov. award ($100)

    1 B hard worker ($50)2 A high perform. ($150)

    3 A innov. award ($100)

    dept-id bonus adjustmentB 1.1

    A 0.9

    Heterogeneous Data

  • 8/14/2019 MapReduceMerge

    15/32

    Map-Reduce-Merge:

    Extending MapReduce1. Change to reduce phase

    2. Merge phase

    3. Additional user-denable operations

    a. partition selector

    b. processorc. merger

    d. congurable iterators

  • 8/14/2019 MapReduceMerge

    16/32

    Reduce & Merge

    Phases1. Map: (k1, v1) [(k2, v2)]

    2. Reduce: (k2, [v2])

    [v3]becomes:

    1. Map: (k1, v1) [(k2, v2)]

    2. Reduce: (k2, [v2]) (k2, [v3])3. Merge: ((k2, [v3]), (k3, [v4])) (k4, [v5])

  • 8/14/2019 MapReduceMerge

    17/32

    Programmer-Denable

    O erations1. Partition selector - which data should go towhich merger?

    2. Processor - process data on an individualsource.

    3. Merger - analogous to the map and reduce

    denitions, dene logic to do the mergeoperation.

    4. Congurable iterators - how to step

    through each of the lists as you merge.

  • 8/14/2019 MapReduceMerge

    18/32

    Employee Bonus

    Exam le, Revisited

  • 8/14/2019 MapReduceMerge

    19/32

    Implementing Relational

    Algebra Operations1. Projection

    2. Aggregation3. Selection

    4. Set Operations: Union, Intersection, Difference

    5. Cartesian Product

    6. Rename

    7. Join

  • 8/14/2019 MapReduceMerge

    20/32

    Projection

    All we have to do is emit a subset of thedata passed in.

    Just a mapper can do this.

  • 8/14/2019 MapReduceMerge

    21/32

    Aggregation

    By choosing appropriate keys, canimplement group by and aggregate SQLoperators in MapReduce.

    (Do have to be careful here, though: choosebadly, and you might not have enough tasksfor MapReduce to do you any good.)

  • 8/14/2019 MapReduceMerge

    22/32

    Selection

    If selection condition involves only theattributes of one data source, can

    implement in mappers. If its on aggregates or a group of values

    contained in one data source, canimplement in reducers.

    If it involves attributes or aggregates fromboth data sources, implement in mergers.

  • 8/14/2019 MapReduceMerge

    23/32

    Set Union

    Let each of the two MapReduces emit asorted list of unique elements

    Merges just iterate simultaneously over thelists:

    store the lesser value and increment itsiterator, if there is a lesser value if the two are equal, store one of the

    two, and increment both iterators

  • 8/14/2019 MapReduceMerge

    24/32

    Set Intersection

    Let each of the two MapReduces emit asorted list of unique elements

    Merges just iterate simultaneously over thelists:

    if there is a lesser value, increment itsiterator if the two are equal, store one of the

    two, and increment both iterators

  • 8/14/2019 MapReduceMerge

    25/32

    Set Difference

    Let each of the two MapReduces emit asorted list of unique elements

    Merges just iterate simultaneously over thelists:

    if As value is less than Bs, store As if Bs value is less than As, increment it

    if the two are equal, increment both

    To Compute A - B:

  • 8/14/2019 MapReduceMerge

    26/32

    Cartesian Product

    Set the reducers up to output the two setsyou want the Cartesian product of.

    Each merger will get one partition F fromthe rst set of reducers, and the full set of partitions S from the second.

    Each merger emits F x S.

  • 8/14/2019 MapReduceMerge

    27/32

    Rename

    Trivial

  • 8/14/2019 MapReduceMerge

    28/32

    Sort-Merge Join

    Map: partition records into key rangesaccording to the values of the attributes onwhich youre sorting, aiming for evendistribution of values to mappers.

    Reduce: sort the data. Merge: join the sorted data for each key

    range.

  • 8/14/2019 MapReduceMerge

    29/32

    Hash Join

    Map: use the same hash function for bothsets of mappers.

    Reduce: produce a hash table from thevalues mapped.

    Merge: operates on corresponding hashbuckets. Use one bucket as a build set , andthe other as a probe.

  • 8/14/2019 MapReduceMerge

    30/32

    Nested Loop Join

    Just like a hash join, except in the mergestep, do a nested loop, scanning the right-hand relation for matches to the left.

  • 8/14/2019 MapReduceMerge

    31/32

    Conclusion

    MapReduce & GFS represent a paradigmshift in data processing: use a simpliedinterface instead of overly-general DBMS

    Map-Reduce-Merge adds the ability toexecute arbitrary relational algebra queries

    Next steps: develop SQL-like interface anda query optimizer

  • 8/14/2019 MapReduceMerge

    32/32

    References

    J. Dean and S. Ghemawat. MapReduce: Simplied DataProcessing on Large Clusters. In OSDI, pages137-15 0, 2004. Slides available:http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

    A. Kimball, Problem Solv ing on Large-Scale Clusters.Lecture given on July 3, 2007. Available athttp://www.youtube.com/watch?v=-vD6PUdf3Js

    http://www.youtube.com/watch?v=-vD6PUdf3Jshttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://www.youtube.com/watch?v=-vD6PUdf3Jshttp://www.youtube.com/watch?v=-vD6PUdf3Jshttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index.html