Top Banner
MAP REDUCE Leroy Garcia
46

Leroy Garcia. What is Map Reduce? A patented programming model developed by Google Derived from LISP and other forms of functional programming Used.

Dec 26, 2015

Download

Documents

Alannah Phelps
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

MAP REDUCELeroy Garcia

Page 2: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

What is Map Reduce? A patented programming model

developed by Google Derived from LISP and other forms of

functional programming Used for processing large data and

generating large data sets Exploits large set of commodity

computers Executes process in distributed manner Easy to use, no messy code

Page 3: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Implementation at Google Machines w/ Multiple Processors Commodity Networking Hardware Cluster of Hundreds or Thousands of

Machines IDE Disks used for storage Input Data managed by GFS Users submit jobs to a scheduling

system

Page 4: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Introduction

How does Map Reduce work?

Page 5: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Overview

Programming Model Implementation Refinement Performance Related Topics Conclusion

Page 6: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Programming Model

Page 7: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Programming Model

MapInput: key/value pair

Key: ex. Document NameValue: ex. Document Contents

Output: Set of Intermediate key/values

Page 8: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Programming Model

ReduceInput: Intermediate key, values

Key: ex. A WordValues: Values

OutputList of Values or a Single Value

Page 9: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Big Data

REDUCE

MAPPartitioning

Function Reduce Result

Page 10: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Execution

M M M M MM M

k1:v k1:v k2:v k1:v k3:v k2:v k4:v k5:v k4:v k1:v k3:v

Group by Key

k1:v,v,v,v k2:v k3:v,v k5:v

R R R R

Input:

Intermediate:

Grouped: k4:v,v,v

R

Page 11: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Parallel Execution

M M M M MM M

k1:v k1:v k2:v k1:v k3:v k2:v k4:v k5:v k4:v k1:v k3:v

k1:v,v,v,vk2:v k3:v,vk5:v

RR RRR

k4:v,v,v

Map Task 1 Map Task 2 Map Task 3

Sort and Group Sort and Group

Partition Function Partition Function Partition Function

Reduce 1 Reduce 1

Page 12: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

The Map Step

vk

k v

k v

mapvk

vk

k vmap

Inputkey-value pairs

Intermediatekey-value pairs

k v

Page 13: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Reduce Step

k v

k v

k v

k v

Intermediatekey-value pairs

group

reduce

reduce

k v

k v

k v

k v

k v

k v v

v v

Key-value groupsOutput key-value pairs

Page 14: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Word Count

MAP

v{Girl,3}

{Girl,8}

{Girl,5}

{Boy,34}

{Boy,16}

{Boy,12}

{Boy,23}

{Girl,18}

{Boy,34}

{Boy,12}

{Boy,23}

{Boy,16}

{Girl,18}

{Girl,5}{Girl,8}

{Girl,5}

{Girl,12}

Reduce

{Boy,85}

{Girl,43}

Page 15: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Examples

Distributed Grep Count of URL Access Frequency Reverse Web-Link Graph Term-Vector per Host Inverted Index Distributed Sort

Page 16: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Practical Examples

Large PDF Generation Artificial Intelligence Statistical Data Geographical Data

Page 17: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Large-Scale PDF Generation

• The New York Times needs to generate PDF files for 11,000,000 articles (every article from 1851-1980) in the form of images scanned from the original paper

• Each article is composed of numerous TIFF images which are scaled and glued together

• Code for generating a PDF is relatively straightforward

Page 18: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Compute statisticsCentral Limit Theorem

N voting nodes cast votes (map) Tally votes and take action (reduce)

Artificial Intelligence

Page 19: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Statistical Analysis

Photos from: stockcharts.com

• Statistical analysis of current stock against historical data

• Each node (map) computes similarity and ROI.

• Tally Votes (reduce) to generate expected ROI and standard deviation

Page 20: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Geographical Data

Large data sets including road, intersection, and feature data

Problems that Google Maps has used MapReduce to solveLocating roads connected to a given

intersectionRendering of map tilesFinding nearest feature to a given address or

location

Page 21: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Input: Graph describing node network with all gas stations marked

Map: Search five mile radius of each gas station and mark distance to each node

Sort: Sort by key Reduce: For each node, emit path and gas

station with the shortest distance Output: Graph marked and nearest gas

station to each node

Geographical Data

Page 22: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Implementation

Page 23: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Map/Reduce Walkthrough Map: (Functional Programming)uses a function on

each element of the array Mapper: The node that performs a function on one

element of the set. Reduce: (Functional programming) iterate a

function across an array Reducer: The node that reduces across all the

like-keyed elements.

Page 24: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Execution Overview1. Split input files

2. Starts up copies of the program in cluster. Copy of program is sent to the Master Master assigns either map or reduce responsibilities

3. Map Worker reads the splits Parses key/value pairs out of the input data Passes each pair to the user-defined Map function.

4. Buffer pairs are written to local disc partitioned into regions by partitioning function The locations of these buffered pairs on the local disk are passed back to the master. Master is responsible for forwarding these locations to the reduce workers.

5. Location of the buffer pairs are given to Reduce Worker by the master Sorts Intermediate keys

6. The reduce worker iterates over the sorted intermediate data for each unique intermediate key. passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.

7. When all map tasks and reduce tasks have been completed, the master wakes up the user program

Page 25: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Distributed Execution Overview

UserProgram

Worker

Worker

Master

Worker

Worker

Worker

fork fork fork

assignmap

assignreduce

readlocalwrite

remoteread,sort

OutputFile 0

OutputFile 1

write

Split 0Split 1Split 2

Input Data

Page 26: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Fault Tolerance

Worker Failure Master Failure Dealing with Stragglers Locality Task Granularity Skipping Bad Record

Page 27: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Worker Failure

Master

Worker AMap Task 1 Complete

Worker BMap Task 2Complete

Failed

Map Task 2Idle

Worker CMap Task 2In Progress

Worker AZReduce Task 1Failed

Ping Reduce Task 1

Idle

Worker BXReduce Task 1

In Progress

Ping

Page 28: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Master Failure

Checkpoints

Checkpoint 124

Checkpoint 123

Checkpoint 125

Master Fail

MASTER

Checkpoint 125

NEW MASTER

Page 29: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Dealing with Stragglers

Straggler- a machine in a cluster than is running significantly slower than the rest

Straggler Map Task

Good MachineMap Task

Copy

Fin

ish

Task

Lin

e

Page 30: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Locality

Input Data is stored locally GFS divides files in 64 MB blocks Stores 3 copies of the blocks on

different machines Finds Replica of input data and

scheduled map tasks. Map tasks scheduled so GFS input

block replica are on same machine or same rack

Page 31: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Task Granularity

Minimizes time for fault recovery Can pipeline shuffling with map

execution Better dynamic load balancing Often use 200,000 map/5000 reduce

tasks w/ 2000 machines

Page 32: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Refinements

Page 33: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Partitoning Function•The users of MapReduce specify the number of reduce tasks/output files that they desire. •Data gets partitioned across these tasks using a partitioning function on the intermediate key. •Special partitioning function.

• eg.hash(Hostname(urlkey)) mod R.•Ordering Guarantee

• Intermediate keys are process in increasing key order.• Generates sorted output per partition.

Page 34: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Combiner Function(Optional)

•Used by the Map Task when there is a significant repetition in the intermediate keys produced by each Map Task

Map Worker

Map Function

Combiner Function

Text Document

(Girls, 1)

(Girls, 1)

(Girls, 2)

(Girls, 2)

(Girls, 6)

Page 35: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Input and Output TypesInput:•Supports reading data of various formats

• Support for new input type using a simple implementation of a reader interface.

• Ex.Database• Ex. Datastructure Mapped in Memory

Output:•User codes supports to handle new type

Page 36: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Skipping Bad Records•Map/Reduce functions sometimes fail for particular inputs •Best solution is to debug & fix, but not always possible •On seg fault:

• Send UDP packet to master from signal handler • Include sequence number of record being processed

•If master sees two failures for same record: • Next worker is told to skip the record

Page 37: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Status Pages

Page 38: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Status Pages

Page 39: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Status Pages

Page 40: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

PerformanceTests run on cluster of 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps

MR_Grep Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records)

MR_Sort Sort 1010 100-byte records (modeled after TeraSort benchmark)

Two Benchmarks

Page 41: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

MR_Grep

Locality optimization helps: •1800 machines read 1 TB of data at peak of ~31 GB/s •Without this, rack switches would limit to 10 GB/s

Inputs Scanned

Page 42: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

MR_SortNormal No Backup Tasks 200 Processes Killed

Page 43: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Related Topics

Page 44: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

• Hadoop– Open-source implementation of MapReduce

HDFS Primary storage system used by Hadoop applications. HDFS

creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

• Amazon Elastic Compute Cloud (EC2)– Virtualized computing environment designed for use with other

Amazon services (especially S3)

• Amazon Simple Storage Service (S3)– Scalable, inexpensive internet storage which can store and retrieve

any amount of data at any time from anywhere on the web– Asynchronous, decentralized system which aims to reduce scaling

bottlenecks and single points of failure

Other Notable Implementations of MapReduce

Page 45: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Conclusion•MapReduce has proven to be a useful abstraction •Greatly simplifies large-scale computations at Google•Easily Handles machine failure.•Allows users to focus on problem, without having to deal with complicated code behind the scene.

Page 46: Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Questions?????