Top Banner
MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group
29

MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

MapReduce: Simplified Data Processing on Large Clusters

J. Dean and S. Ghemawat (Google) OSDI 2004

Shimin ChenDISC Reading Group

Page 2: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Motivation: Large Scale Data Processing

Process lots of data to produce other derived data Input: crawled documents, web request logs etc. Output: inverted indices, web page graph structure,

top queries in a day etc. Want to use hundreds or thousands of CPUs

but want to only focus on the functionality MapReduce hides messy details in a library:

Parallelization Data distribution Fault-tolerance Load balancing

Page 3: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Outline

Programming Model Implementation Refinements Evaluation Conclusion

Page 4: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Programming model

Input & Output: each a set of key/value pairs Programmer specifies two functions:

map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair to generate intermediate pairs

reduce (out_key, list(intermediate_value)) -> list(out_value) Given all intermediate values for a particular key, produces a set of merged output values (usually just one)

Inspired by similar primitives in LISP and other functional languages

Page 5: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Example: Count word occurrences

map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

Page 6: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Looking at Actual Code (Appendix A)#include "mapreduce/mapreduce.h“

// User's map functionclass WordCounter : public Mapper { public: virtual void Map(const MapInput& input) {

const string& text = input.value();const int n = text.size();for (int i = 0; i < n; ) { // Skip past leading whitespace while ((i < n) && isspace(text[i])) i++; // Find word end int start = i; while ((i < n) && !isspace(text[i])) i++; if (start < i) Emit(text.substr(start,i-start),"1");

} }};REGISTER_MAPPER(WordCounter);

Page 7: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

// User's reduce functionclass Adder : public Reducer { virtual void Reduce(ReduceInput* input) {

// Iterate over all entries with the // same key and add the values int64 value = 0; while (!input->done()) { value += StringToInt(input->value()); input->NextValue(); } // Emit sum for input->key() Emit(IntToString(value));

}};REGISTER_REDUCER(Adder);

Page 8: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

int main(int argc, char** argv) {ParseCommandLineFlags(argc, argv);MapReduceSpecification spec;// Store list of input files into "spec"for (int i = 1; i < argc; i++) {

MapReduceInput* input = spec.add_input();input->set_format("text");input->set_filepattern(argv[i]);input->set_mapper_class("WordCounter");

}// Specify the output files:// /gfs/test/freq-00000-of-00100,/gfs/test/freq-00001-of-00100MapReduceOutput* out = spec.output();out->set_filebase("/gfs/test/freq");out->set_num_tasks(100);out->set_format("text");out->set_reducer_class("Adder");// Optional: do partial sums within map tasksout->set_combiner_class("Adder");// Tuning parametersspec.set_machines(2000); spec.set_map_megabytes(100);spec.set_reduce_megabytes(100);// Now run itMapReduceResult result;if (!MapReduce(spec, &result)) abort();// Done: 'result' structure contains info about counters, time// taken, number of machines used, etc.return 0;

}

Page 9: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

More Examples

Inverted index: worddocuments Map: parse each document, emits <word, document ID> Reduce: emits <word, list of document IDs>

Distributed grep Map: emits a line if input document match a given pattern Reduce: identity function

Distributed sort Map: extracts key from each record, emits a <key, record> Reduce: emits all pairs unchanged Relies on partitioning function and ordering guarantees

(later in the talk)

Page 10: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.
Page 11: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

MapReduce Jobs Run in August 2004 (Table 1)

Page 12: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Outline

Programming Model Implementation Refinements Evaluation Conclusion

Page 13: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Implementation Overview

Execution environment: Google cluster 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory 100 mbps or 1 gbps Ethernet, but limited (average) bisection

bandwidth Storage is on local IDE disks GFS: distributed file system manages data Job scheduling system: jobs made up of tasks, scheduler

assigns tasks to machines

Page 14: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Parallelization

Map Divide the input into M equal-sized splits Each split is 16-64 MB large

Reduce Partitioning intermediate key space into R pieces hash(intermediate_key) mod R

Typical setting: 2,000 machines M = 200,000 R = 5,000

Page 15: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Execution Overview

M input splits of 16-64MB each

Partitioning functionhash(intermediate_key) mod R

(0) mapreduce(spec, &result)

R regions

• Read all intermediate data• Sort it by intermediate keys

Page 16: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Timeline

Page 17: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

More Details

Master: Map task: state (idle/in-progress/completed), R file locations, worker

machine Reduce task: state (idle/in-progress/completed), worker machine O(M+R) scheduling decisions, O(MR) space

Locality preserving scheduling Schedule a map task close to the input location

Prefer fine-grain tasks Dynamic load balancing Speeds up recovery

Page 18: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Fault Tolerance via Re-Execution

On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks

Fine-grain: the completed tasks can be re-executed on multiple machines quickly

Re-execute in progress reduce tasks Task completion committed through master

Master failure: Could handle, but don't yet (master failure unlikely)

Page 19: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Outline

Programming Model Implementation Refinements Evaluation Conclusion

Page 20: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Backup Tasks

Problem: “straggler” A machine takes an unusually long time to complete one of

the last few map or reduce tasks E.g. bad disk with frequent correctable errors; other jobs

running; machine configuration problems etc.

Near end of phase, master schedules backup executions of the remaining in-progress tasks

Whichever one finishes first "wins"

Page 21: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Combiner Function

Purpose: reduce data sent over network Combiner function: performs partial merging of

intermediate data at the map worker

Typically, combiner function == reducer function Requires commutative and associative E.g. word count

Page 22: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Skipping Bad Records

Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix, but not always possible On seg fault:

Send UDP packet to master from signal handler Include sequence number of record being processed

If master sees two failures for same record: Next worker is told to skip the record

Effect: Can work around bugs in third-party libraries

Page 23: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Other Refinements

Extensible input and output types Local execution for debugging Status web page User-defined counters

Counter values returned to user code Displayed on status web page

Page 24: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Outline

Programming Model Implementation Refinements Evaluation Conclusion

Page 25: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Setup

Tests run on cluster of 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps

Page 26: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Benchmarks

Two benchmarks: Grep Scan 1010 100-byte records to extract

records matching a rare pattern (92K matching records) M=15,000 (input split size about 64MB) R=1

Sort Sort 1010 100-byte records (modeled after TeraSort benchmark) M=15,000 (input split size about 64MB) R=4,000

Page 27: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs Total time about 150 seconds; 1 minute startup time

Grep

Page 28: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Sort 44% longer

5% longer

Page 29: MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Conclusion

MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Fun to use:

focus on problem, let library deal w/ messy details