Transcript

MapReduce Simplified Data Processing on Large

Clusters

Google , Inc.

Presented by

Noha El-Prince

Winter 2011

Problem and Motivations �  Large Data Size

�  Limited CPU Powers

�  Difficulties of Distributed , Parallel Computing

2

MapReduce

�  MapReduce is a Software framework

�  introduced by Google

�  Enables automatic parallelization and distribution of large-scale computations

�  Hides the details of parallelization, data distribution, load balancing and fault tolerance.

�  Achieves high performance

3

Outline �  MapReduce : Execution Example

�  Programming Model

�  MapReduce: Distributed Execution

�  More Examples

�  Customization on Cluster

�  Refinements

�  Performance measurement

�  Conclusion and Future Work

�  MapReduce in other companies

4

5

Programming Model

Raw Data

MapReduce Library

Reduced Processed

data

Mu

(k,v)

(k’,v’)

Intermediate data

Ru

(k’,<v’>*) <k’, v’>*

Example q  Input:

�  Page 1: the weather is good

�  Page 2: today is good �  Page 3: good weather is good.

q  Output Desired:

The frequency each word is encountered in all pages.

(the 1), (is 3), (weather 2),(today 1), (good 4)

6

(The,1) (weather,1) (is,1) (good,1) (Today,1) (is,1) (good,1) (good,1) (weather,1) (is,1) (good,1)

Intermediate Data

The weather is good Today is good Good weather is good

M M M M M M M

Input Data

(The,[1]) (weather,[1,1]) (is,[1,1,1]) (good,[1,1,1]) (Today,[1])

Group by Key

Grouped Data

map(key, value): for each word w in value:

emit(w, 1)

R R R R R

Output Data (The,1) (weather, 2) (is, 3) (good,3) (Today,1) 7

reduce(key, values): result=0 for each count v in values result += v emit(key, result)

Programming Model §  Input : A set of key/value pairs

§  Programmer specifies two functions:

8

Map • map  (k,v)  à  <k’,  v’>  

Reduce •  reduce  (k’,<v’>*)  à  <k’,v’>*  

All v’ with same k’ are reduced together

Distributed Execution Overview

Worker

Worker

Master

Worker

Worker

Worker

fork fork fork

assign map

assign reduce

read local write

remote read, sort

Output File 0

Output File 1

write

Split 0 Split 1 Split 2

Input Data

9

MapReduce Examples

10

Distributed Grep:

MAP RED Virus,

[A..,B..]

A….virus B……. C..virus… virus, A…

virus, B…

Search pattern (key) : virus

Web Page

�  Count of URL Access Frequency:

11

MapReduce Examples

www.cbc.com www.cnn.com www.bbc.com www.cbc.com www.cbc.com www.bbc.com

Web server logs

MAP CBC, [1,1,1] CNN [1] BBC [1,1]

RED

CBC, 3 BBC, 2 CNN, 1

q Reverse Web-Link Graph:

12

MapReduce Examples

Web server logs

MAP (facebook,youtube) (facebook, disney)

RE

D

(Facebook, [youtube, disney])

Facebook.com Twitter.com

Facebook.com

www.youtube.com

www.disney.com

target www.facebook.com source

MapReduce Examples

q  Term-Vector per Host:.

13

Documents of the facebook (hostname)

MAP

<facebook, word1> <facebook, word2> <facebook, word2> <facebook, word2>

<facebook, [word2, …]>

….

Summary of the most popular words

RE

D

14

q Inverted Index:

14

MapReduce Examples

Docs

<word1, docID1> <word2, docID1> … <word3, docID2> <word1, docID2> … <word1, docID3>

<word1, [docID1, docID2, docID3]>

<word2, [docID1]>

RED MAP

Outline �  MapReduce : Execution

�  Example

�  Programming Model

�  MapReduce: Distributed Execution

�  More Examples

�  Customizations on Clusters

�  Refinements

�  Performance measurement

�  Conclusion & Future Work

�  MapReduce in other companies

þ

þ

þ

þ

þ

15

Customizations on Clusters

�  Coordination

�  Scheduling

�  Fault Tolerance

�  Task Granularity

�  Backup Tasks

16

q Coordination

Customizations on Clusters

Master Data Structure

17

M 250.133.22.7 Completed Root/intFile.txt

M 250.133.22.8 inprogress Root/intFile.txt

R 250.123.23.3 idle Root/outFile.txt

q Scheduling Master scheduling policy: (objective: conserve network bandwidth) 1.  GFS divides each file into 64MB block.

2.  I/P data are stored on the worker’s local disks (managed by GFS)

Ø  Locality :using the same cluster for both data storage and data processing.

3.  GFS stores multiple copies of each block (typically 3 copies) on different machines.

Customizations on Clusters

18

q Fault Tolerance

On worker failure: •  Detect failure via periodic heartbeats •  Re-execute completed and in-progress map tasks •  Re-execute in progress reduce tasks •  Task completion committed through master

Master failure: •  Could handle, but don't yet (master failure unlikely) •  MapReduce task is aborted and client is notified

Customizations on Clusters

19

q  Task Granularity (How tasks are divided ?)

Rule of thumb:

Make M and R much larger than the number of worker machines à  Improves dynamic load balancing à speeds recovery from worker failure

Customizations on Clusters

20 Usually R is smaller than M

q  Backup tasks

�  Problem of stragglers (machines taking long time to complete one of the last few tasks )

�  When a MapReduce operation is about to complete: Ø  Master schedules backup executions of the

remaining tasks Ø  Task is marked “complete” whenever either the

primary or the backup execution completes.

Customizations on Clusters

21

Effect: dramatically shortens job completion time

Outline �  MapReduce : Execution

�  Example

�  Programming Model

�  MapReduce: Distributed Execution

�  More Examples

�  Customizations on Clusters

�  Refinements

�  Performance measurement

�  Conclusion & Future Work

�  Companies using MapReduce

þ

þ

þ

þ

þ

þ

22

Refinements

�  Partitioning functions.

�  Skipping bad records.

�  Status info.

�  Other Refinements

23

�  MapRedue users specify no. of tasks/output files desired (R)

�  For reduce, we need to ensure that records with the same intermediate key end up at the same worker

�  System uses a default partition function

e.g., hash(key) mod R ( results fairly well-balanced partitions )

�  Sometimes useful to override

�  E.g., hash(hostname(URL key)) mod R Ø  ensures URLs from a host end up in the same output file

Refinements : Partitioning Function

24

§  Map/Reduce functions sometimes fail for particular inputs

•  MapReduce has a special treatment for ‘bad’ input data, i.e. input data that repeatedly leads to the crash of a task. Ø  The master, tracking crashes of tasks, recognizes

such situations and, after a number of failed retries, will decide to ignore this piece of data.

•  Effect: Can work around bugs in third-party libraries

25

Refinements : Skipping Bad Records

�  Status pages shows the computation progress

�  Links to standard error and output files generated by each task.

�  User can Ø Predict the computational length Ø Add more resources if needed

Ø Know which workers have failed

�  Useful in user code bug diagnosis

26

Refinements : Status Information

27

§  Combiner function: Compression of intermediate data

Ø  useful for saving network bandwidth

Other Refinements

§  User-defined counters Ø  periodically propagated to the master from worker machines

Ø Useful for checking behavior of MaReduce operations (appears on master status page )

Outline �  MapReduce : Execution

�  Example

�  Programming Model

�  MapReduce: Distributed Execution

�  More Examples

�  Customizations on Clusters

�  Refinements

�  Performance measurement

�  Conclusion & Future Work

�  Companies using MapReduce

þ

þ

þ

þ

þ

þ

28

þ

Performance

§  Tests run on cluster of 1800 machines: each machine has:

�  4 GB of memory

�  Dual-processor 2 GHz Xeons with Hyperthreading

�  Dual 160 GB IDE disks

�  Gigabit Ethernet link

�  Bisection bandwidth approximately 100-200 Gbps

§  Two benchmarks:

�  Grep: Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records)

�  Sort: Sort 1010 100-byte records

29

Grep

•  1800 machines read 1 TB of data at peak of ~31 GB/s

•  Startup overhead is significant for short jobs (entire computation = 80 + 1 minute start up

30

M=15000 (input split= 64MB) R=1 Assume all machines has same host Search pattern: 3 characters Found in: 92,337 records

1764 workers

Sort

31

(a) Normal Execution (b) No backup tasks (c) 200 tasks killed

M=15000 (input split= 64MB) R=4000, # of workers = 1746 Fig.(a) Btr than Terasoft benchmark reported result of 1057 s

(a) Locality optimization èInput rate > shuffle rate and output rate Output phase writes 2 copies of sorted data è Shuffle rate > output rate (b) 5 Stragglers à Entire computation rate increases 44% than normal

Experience: Rewrite of Production Indexing System

§  New code is simpler, easier to understand

§  MapReduce takes care of failures, slow machines

§  Easy to make indexing faster by adding more machines

32

Outline �  MapReduce : Execution

�  Example

�  Programming Model

�  MapReduce: Distributed Execution

�  More Examples

�  Customizations on Clusters

�  Refinements

�  Performance measurement

�  Conclusion & Future Work

�  Companies using MapReduce

þ

þ

þ

þ

þ

þ

33

þ

þ

Conclusion & Future Work �  MapReduce has proven to be a useful abstraction

�  Greatly simplifies large-scale computations

�  Fun to use: focus on problem, let library deal w/ messy details

34

MapReduce Advantages/Disadvantages

Now it’s easy to program for many CPUs •  Communication management effectively gone

Ø  I/O scheduling done for us •  Fault tolerance, monitoring

Ø  machine failures, suddenly-slow machines, etc are handled •  Can be much easier to design and program! •  Can cascade several (many?) MapReduce tasks

But … it further restricts solvable problems •  Might be hard to express problem in MapReduce •  Data parallelism is key

Ø  Need to be able to break up a problem by data chunks •  MapReduce is closed-source (to Google) C++

Ø  Hadoop is open-source Java-based rewrite 35

Outline �  MapReduce : Execution

�  Example

�  Programming Model

�  MapReduce: Distributed Execution

�  More Examples

�  Customizations on Clusters

�  Refinements

�  Performance measurement

�  Conclusion & Future Work

�  Companies using MapReduce

þ

þ

þ

þ

þ

þ

36

þ

þ

þ

Companies using MapReduce v Amazon: Amazon Elastic MapReduce :

§  a web service §  enables businesses, researchers, data analysts, and

developers to easily and cost-effectively process vast amounts of data.

§  It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

§  allows you to use Hadoop with no hardware investment

�  http://aws.amazon.com/elasticmapreduce/

37

�  Amazon: to build product search indices

�  Facebook: processing of web logs, via both Map-Reduce and Hive

�  IBM and Google: making large compute clusters available to higher ed and research organizations

�  New York Times: large scale image conversions

�  Yahoo: use Map Reduce and Pig for web log processing, data model training, web map construction, and much, much more

�  Many universities for teaching parallel and large data systems

And many more, see them all at

http://wiki.apache.org/hadoop/ PoweredBy 38

Companies using MapReduce

39

top related