SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

SIDDHARTH MEHTAPURSUING MASTERS IN COMPUTER SCIENCE

(FALL 2008)INTERESTS: SYSTEMS, WEB

A programming model and an associated implementation(library) for processing and generating large data sets (on large clusters).

A new abstraction allowing to express the simple computations that hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.

Large-Scale Data Processing◦ Want to use 1000s of CPUs

But don’t want hassle of managing things

MapReduce provides◦ Automatic parallelization & distribution◦ Fault tolerance◦ I/O scheduling◦ Monitoring & status updates

The MapReduce programming model has been successfully used at Google for many different purposes.

◦ First, the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing.

◦ Second, a large variety of problems are easily expressible as MapReduce computations. For example, MapReduce is used for the generation of data for Google's production web search service, for sorting, for data mining, for machine learning, and many other systems.

◦ Third, developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines. The implementation makes efficient use of these machine resources and therefore is suitable for use on many of the large computational problems encountered at Google.

map(key=url, val=contents):For each word w in contents, emit (w, “1”)

reduce(key=word, values=uniq_counts):Sum all “1”s in values listEmit result “(word, sum)”

see bob throwsee spot run

see 1bob 1 run 1see 1spot 1throw 1

bob 1 run 1see 2spot 1throw 1

Distributed grep: ◦ Map: (key, whole doc/a line) (the matched line, key)◦ Reduce: identity function

Count of URL Access Frequency:◦ Map: logs of web page requests (URL, 1)◦ Reduce: (URL, total count)

Reverse Web-Link Graph:◦ Map: (source, target) (target, source)◦ Reduce: (target, list(source)) (target, list(source))

Inverted Index:◦ Map: (docID, document) (word, docID)◦ Reduce: (word, list(docID)) (word, sortedlist(docID))

In Google clusters, comprised of the top of the line PCs.◦ Intel Xeon 2 x 2MB, HyperThreading◦ 2-4GB Memory◦ 100 M– 1G network◦ Local IDE disks + Google F.S.◦ Submit job to a scheduling system

M

R

R

Fault Tolerance – in a word: redo◦ Master pings workers, re-schedules failed tasks.◦ Note: Completed map tasks are re-executed on

failure because their output is stored on the local disk.

◦ Master failure: redo◦ Semantics in the presence of failures:

Deterministic map/reduce function: Produce the same output as would have been produced by a non-faulting sequential execution of the entire program

Rely on atomic commits of map and reduce task outputs to achieve this property.

Partitioning Ordering guarantees Combiner function Side effects Skipping bad records Local execution Status information Counters

Straggler: a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation.

Cause: bad disk, … Resolution: schedule backup of in-progress

tasks near the end of MapReduce Operation

Partition output of a map task to R pieces Default: hash(key) mod R User provided

◦ E.g. hash(Hostname(url)) mod R

‥ ‥ ‥

‥ ‥ ‥

‥ ‥ ‥

R

One Partition

M

Guarantee: within a given partition, the intermediate key/value pairs are processed in increasing key order.

MapReduce Impl. of distributed sort◦ Map: (key, value) (key for sort, value)◦ Reduce: emit unchanged.

E.g. : word count, many <the, 1> Combine once before reduce task, for

saving network bandwidth Executed on machine performing map task. Typically the same as reduce function Output to an intermediate file Example: count words

Skipping Bad Records◦ Ignoring certain records makes tasks crash◦ An optional mode of execution◦ Install a signal handler to catch segmentation

violations and bus errors.

Status Information◦ The master runs an internal HTTP server and

exports a set of status pages◦ Monitor progress of computation: how many

tasks have been completed, how many are in progress, bytes of input, bytes of intermediate data, bytes of output, processing rates, etc. The pages also contain links to the standard error and standard output files generated by each task.

◦ In addition, the top-level status page shows which workers have failed, and which map and reduce tasks they were processing when they failed.

Tests on grep and sort Cluster characteristics

◦ 1800 machines (!)◦ Intel Xeon 2x2MB, HyperThreading◦ 2-4 GB Bytes Memory◦ 100 M– 1G network◦ Local IDE disks + Google F.S.

1 terabyte: 10^10 100-byte records Rare three-character pattern (10^5 freq.) Split input into 64 MB pieces, M=15000 R=1 (output is small)

Peak at 30 GB/s (1764 workers) 1 minute startup time

◦ Propagation of program to workers◦ GFS: open 1000 input files◦ Locality optimization

Completed in <1.5 minutes

1 terabyte: 10^10 100-byte records Extract 10-byte sorting key Map: Emit <key,val.>: <10-byte,100-byte> Reduce: identity 2-way replication of output

◦ For redundancy, typical in GFS M=15000, R=4000 May need pre-pass MapReduce for

computing distribution of keys

Input rate less than for grep Two humps: 2*1700 ~ 4000 Final output delayed because of sorting Rates: input>shuffle,output (locality!) Rates: shuffle>output (writing 2 copies) Effect of backup Effect of machine failures

◦ Restricting the programming model makes it easy to parallelize and distribute computations and to make such computations fault-tolerant.

◦ Network bandwidth is a scarce resource. A number of optimizations in the system are therefore targeted at reducing the amount of data sent across the network: the locality optimization allows to read data from local disks, and writing a single copy of the intermediate data to local disk saves network bandwidth.

◦ Redundant execution can be used to reduce the impact of slow machines, and to handle machine failures and data loss.

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

Documents

web slide

throw1 slide

sortedlistdocid slide

r r slide

scheduling system slide

map tasks

word w

target target