Top Banner
MapReduce Design Patterns Will Shen 2013/02/01
93

20130201 MapReduce Design Patterns

Jan 27, 2015

Download

Technology

Will Shen

MapReduce Design Pattern expalines
* Summarization
** Numerical Summarizations
** Inverted Index Summarizations
** Counting with Counters
* Filtering
** Filtering
** Bloom Filtering
** Top Ten
** Distinct
* Data Organization
** Structured to Hierarchical
** Partitioning
** Binning
** Total Order Sorting
** Shuffling
* Join
** Reduce Side Join
** Replicated Join
** Composite Join
** Cartesian Product
* Metapatterns
** Job Chaining
** Chain Folding
** Job Merging
* Input and Output
** Generating Data
** External Source Output
** External Source Input
** Partition Pruning
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 20130201 MapReduce Design Patterns

MapReduce Design

Patterns

Will Shen 2013/02/01

Page 2: 20130201 MapReduce Design Patterns

Outline Part I: MapReduce Basics • Map and Reduce • A WordCount example • Open-source framework: Hadoop

Part II: MapReduce Design Patterns • Summarization Patterns • Filtering Patterns • Data Organization Patterns • Join Patterns • Meta Patterns • Input and Output Patterns

2

Reference: Donald Miner and Adam Shook, “MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems”, 230 pages, O'Reilly Media; 1 edition (December 22, 2012)

Page 3: 20130201 MapReduce Design Patterns

Part I: MapReduce Basics

Motivation: Large Scale Data Processing • Process lots of data (>1TB) • Want to use hundreds of CPUs MapReduce - Google (2005), US patent (2010)

• Automatic parallelization and distribution • Fault-tolerance • I/O scheduling • Status and monitoring

3

Google, “MapReduce: Simplified Data Processing on Large Clusters”, 2005/04/06

Page 4: 20130201 MapReduce Design Patterns

What is Map and Reduce Borrows from Functional Programming • Functional operations do not modify data structures

create new ones • Stateless functional operations no side-effect order of

operations does not matter

map: (k1, v1) → [(k2, v2)] reduce: (k2, [v2]) → [(k3, v3)]

4

fun foo(li: int list) = sum(li) + mul(li) + length(li)

Page 5: 20130201 MapReduce Design Patterns

What is MapReduce

5

map: (k1, v1) → [(k2, v2)]

reduce: (k2, [v2]) → [(k3, v3)]

Page 6: 20130201 MapReduce Design Patterns

Parallel Execution Bottleneck: Reduce phase cannot start until map phase completes

6

Page 7: 20130201 MapReduce Design Patterns

Big Picture of MapReduce Input Reader - Divides input into appropriate size 'splits' (16 to 128 MB) Map - partitioning of the data (compute part of a problem across several servers)

Shuffle - together the values returned by the map function Reduce - processing of the partitions (aggregate the partial results from all servers into a single result-set)

Output Writer - Writes the output of the Reducer

7

Page 8: 20130201 MapReduce Design Patterns

Example – counting words in documents

8

map (String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");

reduce (String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(output_key, AsString(result));

Page 9: 20130201 MapReduce Design Patterns

Open-source framework Apache Hadoop

Hadoop - http://hadoop.apache.org/ Hadoop: not only a Map/Reduce implementation! • HDFS – distributed file system • Pig – high level query language (SQL like) • HBase – distributed column store • Hive – Hadoop based data warehouse • ZooKeeper, Chukwa, Pipes/Streaming, …

9

Page 10: 20130201 MapReduce Design Patterns

How Hadoop runs a MapReduce Job

10

• Client submits the MapReduce job. • JobTracker coordinates the job run. • TaskTrackers run the tasks that the job has

been split into. • HDSF is used for sharing job files between the

other entities.

Page 11: 20130201 MapReduce Design Patterns

WordCount Java Code in Hadoop

11

Page 12: 20130201 MapReduce Design Patterns

General Considerations Map execution order is not deterministic Map processing time cannot be predicted Reduce tasks cannot start before all Maps have finished (dataset needs to be fully partitioned) Not suitable for continuous input streams There will be a spike in network utilization after the Map / before the Reduce phase Number & size of key/value pairs • Object creation & serialisation overhead (Amdahl’s law!)

Aggregate partial results when possible! • Use Combiners

12

Page 13: 20130201 MapReduce Design Patterns

Using MaReduce to Solve Problems

Map • Word Count: texts (word, 1) • Inverted Index: documents (word, doc_id) • Max Temperature: formatted data (year,

temperature) • Mean Rain Precipitation: daily data (<year-

month, lat, long>, temperature)

Reduce applies a count, list, max, and average to a set of values for each key, respectively. Reusable Solutions? 13

Page 14: 20130201 MapReduce Design Patterns

What is a “Design Pattern” Design Pattern a general reusable solution to a commonly occurring problem within a given context in software design.

14 GoF

Page 15: 20130201 MapReduce Design Patterns

Part II: MapReduce Design Patterns 1. Summarization: get a top-level view by

summarizing and grouping data 2. Filtering: view data subsets such as records

generated from one user 3. Data Organization: reorganize data to work

with other systems, or to make MapReduce analysis easier

4. Join : analyze different datasets together to discover interesting relationships

5. Metapattern : piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job

6. Input and Output: customize the way you use Hadoop to load or store data

15

Total 23 patterns

A template for solving a common and general data manipulation problem with MapReduce.

Page 16: 20130201 MapReduce Design Patterns

The 23 Patterns of MapReduce Summarization • Numerical Summarizations • Inverted Index Summarizations • Counting with Counters

Filtering • Filtering • Bloom Filtering • Top Ten • Distinct

Data Organization • Structured to Hierarchical • Partitioning • Binning • Total Order Sorting • Shuffling

Join • Reduce Side Join • Replicated Join • Composite Join • Cartesian Product

Metapatterns • Job Chaining • Chain Folding • Job Merging

Input and Output • Generating Data • External Source Output • External Source Input • Partition Pruning

16

Page 17: 20130201 MapReduce Design Patterns

The End

Thanks for your attentions. Any Questions?

17

Page 18: 20130201 MapReduce Design Patterns

Pattern Template in this Book Name: a well-selecting name of the pattern Intent: A quick problem description Motivation: Why you would want to solve this problem or where it would appear. Applicability: A set of criteria that must be true to be able to apply this pattern to a problem. Structure: The layout of the MapReduce job itself. Consequences: The end goal of the output this pattern produces. Resemblances: Show analogies of how this problem would be solved with other languages, like SQL and PIG. Known Uses: some common use cases Performance Analysis: Explains the performance profile of the analytic produced by the pattern.

18

Page 19: 20130201 MapReduce Design Patterns

2.1 Summarization Patterns

Your data is large and vast, with more data coming into the system every day • ex. web user-logs • You want to produce a top-level, summarized

view of the data • You can glean insights not available from looking

at a localized set of records alone. Patterns • Numerical Summarizations • Inverted Index Summarizations • Counting with Counters

19

Page 20: 20130201 MapReduce Design Patterns

Numerical Summarizations 1/4 Intent - Group records together by a key field and calculate a numerical aggregate per group to get a top-level view of the larger data set. Motivation • Many data sets these days are too large for a human to get

any real meaning out it by reading through it manually, e.g., terabytes of website log files.

• minimum, maximum, average, median, and standard deviation

Applicability • You are dealing with numerical data or counting. • The data can be grouped by specific fields

20

Page 21: 20130201 MapReduce Design Patterns

Numerical Summarizations 2/4 Structure • Mapper: outputs keys that consist of each field to group

by, and values consisting of any pertinent numerical items. • Reducer: receives a set of numerical values (v1, v2, v3, …,

vn) associated with a group-by key records to perform the aggregation function λ. The value of λ is output with the given input key.

21

Page 22: 20130201 MapReduce Design Patterns

Numerical Summarizations 3/4 Consequences • A set of part files containing a single record per reducer

input group. Each record will consist of the key and all aggregate values.

Known uses • Word count, Record count • Min, Max, Count of a particular event • Average, Median, Standard deviation

Resemblances • SQL • Pig

22

SELECT MIN(numericalcol1), MAX(numericalcol1), COUNT(*) FROM table GROUP BY groupcol2;

b = GROUP a BY groupcol2; c = FOREACH b GENERATE group, MIN(a.numericalcol1), MAX(a.numericalcol1), COUNT_STAR(a);

Page 23: 20130201 MapReduce Design Patterns

Numerical Summarizations 4/4 Performance analysis • Aggregations perform well when the combiner is properly used. • Data skew of reduce groups: many more intermediate key/value

pairs with a specific key than other keys, one reducer is going to have a lot more work to do than others.

23

Page 24: 20130201 MapReduce Design Patterns

Inverted Index Summarizations 1/4 Intent - Generate an index from a data set to allow for faster searches.

24

storing a mapping from content to its locations

Page 25: 20130201 MapReduce Design Patterns

Inverted Index Summarizations 2/4

Motivation • To index large data sets on keywords, so that

searches can trace terms back to records that contain specific values.

• Search performance of search engine Applicability • You are requiring quick query responses. • The results of such a query can be preprocessed

and ingested into a database.

25

Page 26: 20130201 MapReduce Design Patterns

Inverted Index Summarizations 3/4

Structure

26

Page 27: 20130201 MapReduce Design Patterns

Inverted Index Summarizations 4/4

Consequences • “filed value” -> [unique IDs of records] Performance analysis • Parsing content in Mapper most computationally • The cardinality of the index keys increase the

number of reducers increase parallelism • The number of content identifiers per key, “the”

• a few reducers will take much longer than the others. • Require a custom partitioner

27

Page 28: 20130201 MapReduce Design Patterns

Counting with Counters 1/3 Intent • An efficient means to retrieve count summarizations of large

data sets. Motivation • A count or summation can tell you a lot about your data as

a whole. • Simply use the framework’s counters no reduce phase

and no summation Applicability • You have a desire to gather counts or summations over

large data sets. • The number of counters you are going to create is small

28

Page 29: 20130201 MapReduce Design Patterns

Counting with Counters 2/3 Structure • Mapper: processes each input record at a time to increment

counters based on certain criteria. • Counter: (a) incremented by one if counting a single instance

(b)incremented by some number if executing a summation.

29

Page 30: 20130201 MapReduce Design Patterns

Counting with Counters 3/3 Consequences • the final output is a set of counters grabbed from the job

framework (no actual output) Known uses • Count number of records (over a given time period) • Count a small number of unique instances • Counters can be used to sum fields of data together.

Performance analysis • Using counters is very fast, as data is simply read in

through the mapper and no output is written. • Performance depends largely on the number of map tasks

being executed and how much time it takes to process each record.

30

Page 31: 20130201 MapReduce Design Patterns

2.2 Filtering Patterns

To understand a smaller piece of data Find a subset of data - a top-ten listing, the results of a de-duplication. Sampling Filtering Patterns: • Filtering • Bloom Filtering • Top Ten • Distinct

31

Page 32: 20130201 MapReduce Design Patterns

Filtering 1/4

Intent • Filter out records that are not of interest Motivation • Your data set is large and you want to take a

subset of this data to focus in on it and perhaps do follow-on analysis.

Applicability • The data can be parsed into “records” that can be

categorized through some well-specified criterion determining whether they are to be kept.

32

Page 33: 20130201 MapReduce Design Patterns

Filtering 2/4

Structure • No “Reducer”

33

map(key, record): if we want to keep record then emit key,value

Page 34: 20130201 MapReduce Design Patterns

Filtering 3/4 Consequences • A subset of the records that pass the selection criteria. • If the format was kept the same, any job that ran over the

larger data set should be able to run over this filtered data set, as well.

Known uses • Closer view of data • Tracking a thread of events • Distributed grep • Data cleansing • Simple random sampling • Removing low scoring data (if you can score your data)

34

Page 35: 20130201 MapReduce Design Patterns

Filtering 4/4 Resemblances • SQL: SELECT * FROM table WHERE VALUE < 3 • Pig: b = FILTER a BY value < 3;

Performance analysis • NO reducers • Data never has to be transmitted between the map and

reduce phase. • Most of the map tasks pull data off of their locally attached

disks and then write back out to that node. • Both the sort phase and the reduce phase are cut out.

35

Page 36: 20130201 MapReduce Design Patterns

Bloom Filtering 1/4 Intent • Filter such that we keep records that are member of some

predefined set of values (hot values). Motivation • To filter the record based on some sort of set membership

operation against the hot values. • The set membership is going to be evaluated with a Bloom

filter.

36

• M = 18, k = 3 • w is not in the set x, y, z

Page 37: 20130201 MapReduce Design Patterns

Bloom Filtering 2/4

Applicability • Data can be separated into records, as in filtering. • A feature can be extracted from each record that

could be in a set of hot values. • There is a predetermined set of items for the hot

values. • Some false positives are acceptable (i.e., some

records will get through when they should not have).

37

Page 38: 20130201 MapReduce Design Patterns

Bloom Filtering 3/4 Structure – training + actual filtering

38

Page 39: 20130201 MapReduce Design Patterns

Bloom Filtering 4/5

Consequences • a subset of the records in that passed the Bloom

filter membership test. • Exists false positives records Known uses • Removing most of the non-watched values • Prefiltering a data set for an expensive set

membership check

39

Page 40: 20130201 MapReduce Design Patterns

Bloom Filtering 5/5

Performance analysis • Loading up the Bloom filter is not that expensive

since the file is relatively small. • Checking a value against the Bloom filter is also a

relatively cheap operation – by O(1) hashing

40

Page 41: 20130201 MapReduce Design Patterns

Top Ten 1/4 Intent • Retrieve a relatively small number of top K records,

according to a ranking scheme in your data set, no matter how large the data.

Motivation • Finding records that are typically the most interesting • To find the best records for a specific criterion

Applicability • It is able to compare one record to another to determine which is “larger”

• The number of output records should be significantly fewer than the number of input records a total ordering of the data set.

41

Page 42: 20130201 MapReduce Design Patterns

Top Ten 2/4 Structure • Mapper: find local top K • (only one) Reducer: K*M records the final top k

42

Page 43: 20130201 MapReduce Design Patterns

Top Ten 3/4

Consequences • The top K records are returned. Known uses • Outlier analysis • Select interesting data (most valuable data) • Catchy dashboards Resemblances • SQL: SELECT * FROM table WHERE col4 DESC LIMIT 10

• Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10;

43

Page 44: 20130201 MapReduce Design Patterns

Top Ten 4/4 Performance analysis – one single Reducer • How many records (K*M) the reducer is getting? • The sort can become an expensive operation when it has

too many records and has to do most of the sorting on local disk, instead of in memory.

• The reducer host will receive a lot of data over the network a network resource hot spot

• Naturally, scanning through all the data in the reduce will take a long time if there are many records to look through.

• Any sort of memory growth in the reducer has the possibility of blowing through the Java virtual machine’s memory

• Writes to the output file are not parallelized

44

Page 45: 20130201 MapReduce Design Patterns

Distinct 1/4

Intent • To find a unique set of values from similar records Motivation • Reducing a data set to a unique set of values has

several uses Applicability • You have duplicates values in data set; it is silly

to use this pattern otherwise.

45

Page 46: 20130201 MapReduce Design Patterns

Distinct 2/4 Structure • It exploits MapReduce’s ability to group keys together to

remove duplicates. • Mapper transforms the data and doesn’t do much in the

reducer. • Duplicate records are often located close to another in a

data set, so a combiner will deduplicate them in the map phase.

• Reducer groups the nulls together by key, so we’ll have one null per key simply output the key

46

map(key, record): emit(record, null) reduce(key, records): emit(key);

Page 47: 20130201 MapReduce Design Patterns

Distinct 3/4 Consequences • The output records are guaranteed to be unique, but any

order has not been preserved due to the random partitioning of the records.

Known uses • Deduplicate data • Getting distinct values • Protecting from an inner join explosion

Resemblances • SQL: SELECT DISTINCT * FROM table; • Pig: b = DISTINCT a;

47

Page 48: 20130201 MapReduce Design Patterns

Distinct 4/4

Performance analysis • The number of reducers you think you will need. • Basically, if duplicates are very rare within an

input split, pretty much all of the data is going to be sent to the reduce phase.

48

Page 49: 20130201 MapReduce Design Patterns

2.3 Data Organization patterns

The value of individual records is often multiplied by the way they are partitioned, sharded, or sorted, especially true in distributed systems. Patterns: • Structured to Hierarchical • Partitioning • Binning • Total Order Sorting • Shuffling

49

Page 50: 20130201 MapReduce Design Patterns

Structured to Hierarchical 1/3 Intent • Transform your row-based data to a hierarchical format

(JSON or XML) Motivation • Migrating data from an RDBMS to Hadoop table join • Reformatting your data into a more conducive structure

Applicability • You have data sources that are linked by some set of

foreign keys. • Your data is structured and row-based.

50

Posts Post Comment Comment Post Comment Comment Comment

Page 51: 20130201 MapReduce Design Patterns

Structured to Hierarchical 2/3 Structure • Mapper load the data and parse the records into one

cohesive format. • Combiner isn’t going to help • Reducer build the hierarchical data structure from the list of

data items.

51

Page 52: 20130201 MapReduce Design Patterns

Structured to Hierarchical 3/3

Consequences • The output will be in a hierarchical form, grouped

by the key that you specified Known uses • Pre-joining data • Preparing data for HBase or MongoDB Performance analysis • How much data is being sent to the reducers from

the mappers • The memory footprint of the object that the

reducer builds. • For a post that has a million comments?

52

Page 53: 20130201 MapReduce Design Patterns

Partitioning 1/3 Intent • Move the records into categories;; doesn’t care the order of

records. • Take similar records in a data set and partition them into

distinct, smaller data sets. Motivation • If you want to look at a particular set of data, the data

items are normally spread out across the entire data set requires an entire scan of all of the data

Applicability • Knowing how many partitions you are going to have ahead

of time - by day of the week 7 partitions.

53

Page 54: 20130201 MapReduce Design Patterns

Partitioning 2/3

Structure - to determine what partition a record is going to go

54

Page 55: 20130201 MapReduce Design Patterns

Partitioning 3/3 Known uses • Partition pruning by continuous value (e.g., date) • Partition pruning by category

• Country, phone area code, language

• Sharding (to different disks)

Performance analysis • The resulting partitions will likely not have similar number

of records. Perhaps one partition hold 50%. • If implemented naively, all of this data will get sent to one

reducer and will slow down processing significantly.

55

Page 56: 20130201 MapReduce Design Patterns

Binning 1/3 Intent • For each record in the data set, file each one into one or

more categories. Motivation • Binning is very similar to partitioning and often can be used

to solve the same problem. • Binning splits data up in the map phase instead of in the

partitioner. • Each mapper will now have one file per possible output bin

• 1000 Bins x 1000 Mappers = 1000,000 files

56

Page 57: 20130201 MapReduce Design Patterns

Binning 2/3 Structure • Mapper: if the record

meets the criteria, it is sent to that bin.

• No combiner, partitioner, or reducer is used in this pattern.

57

Page 58: 20130201 MapReduce Design Patterns

Binning 3/3 Consequences • Each mapper outputs one small file per bin.

Resemblances • PIG

Performance analysis • map-only jobs how efficient of processing records • No sort, shuffle, or reduce to be performed • Most of the processing is going to be done on data that is

local.

58

SPLIT data INTO eights IF col1 == 8, bigs IF col1 > 8, smalls IF (col1 < 8 AND col1 > 0);

Page 59: 20130201 MapReduce Design Patterns

Total Order Sorting 1/3

Intent • Sort your data in parallel on a sort key. Motivation • Reducer will sort its data by key - but not global

across all data. • Sorting in parallel is not easy Applicability • Your sort key has to be comparable so the data

can be ordered.

59

Page 60: 20130201 MapReduce Design Patterns

Total Order Sorting 2/3

Structure • Analyze phase - determines the ranges

• idea: partitions that evenly split the random sample should evenly split the larger data set well.

• Mapper does a random sampling. • the number of records in the total data set • percentage of records you’ll need to analyze

• Only one reducer - collect the sort keys together into a sorted list the list of keys will be sliced into the data range boundaries.

• Order phase - actually sorts the data. • # of Reducers === # of Partitions • A custom partitioner loads up the partition file data

ranges 60

Page 61: 20130201 MapReduce Design Patterns

Total Order Sorting 3/3

Consequences • The output files will contain sorted data Resemblances • SQL: SELCT * FROM data ORDER BY col1; • Pig: c = ORDER b BY col1; Performance analysis • Expensive!!! • load and parse the data twice:

• Step 1. Build the partition ranges • Step 2. Actually sort the data.

61

Page 62: 20130201 MapReduce Design Patterns

Shuffling 1/3

Intent • To completely randomize a set of records that Motivation • Shuffling for 綺夢 • Shuffling for anonymizing the data. • Shuffling for repeatable random sampling.

62

Page 63: 20130201 MapReduce Design Patterns

Shuffling 2/3

Structure • Mappers [random key, record] • Reducer sorts the random keys randomizing

the data. Consequences • Each reducer outputs a file containing random

records. Resemblances • SQL: SELECT * FROM data ORDER BY RAND() • Pig: c = GROUP b BY RANDOM(); d = FOREACH c GENERATE FLATTEN(b);

63

Page 64: 20130201 MapReduce Design Patterns

Shuffling 3/3

Performance analysis • Nice performance properties. • Data distribution across reducers is completely

balanced. • With more reducers, the data will be more spread

out. • The size of the files will also be very predictable:

each is the size of the data set divided by the number of reducers. This makes it easy to get a specific desired file size as output

64

Page 65: 20130201 MapReduce Design Patterns

2.4 Join patterns

Refresh of RDMS join • Inner Join • Outer Join • Cartesian Product • Anti Join = full outer join - inner join. Patterns • Reduce Side Join • Replicated Join • Composite Join • Cartesian Product

65

An SQL query walks into a bar, sees two tables and asks them “May I join you?”

Page 66: 20130201 MapReduce Design Patterns

Reduce Side Join 1/3 Intent • Join large multiple data sets together by some foreign key.

Motivation • Simple to implement in Reducers • Supports all the different join operations • No limitation on the size of your data sets.

Applicability • Multiple large data sets are being joined by a foreign key. • You want the flexibility of being able to execute any join

operation. • A large amount of network bandwidth

66

Page 67: 20130201 MapReduce Design Patterns

Reduce Side Join 2/3 Structure • Mapper prepares [(foreign key, record)] • Reducer performs join operation

67

Page 68: 20130201 MapReduce Design Patterns

Reduce Side Join 3/3 Consequences • # of part files == # of reduce tasks. • A part contains the portion of the joined records.

Resemblances • SQL

Performance analysis • Custer’s network bandwidth !!! • Utilize relatively more reducers than your analytic.

68

SELECT users.ID, users.Location, comments.upVotes FROM users [INNER|LEFT|RIGHT] JOIN comments ON users.ID=comments.UserID

Page 69: 20130201 MapReduce Design Patterns

Replicated Join 1/3 Intent • Eliminates the need to shuffle any data to the reduce phase.

Motivation • All the data sets except the very large one are essentially

read into memory during the setup phase of each map task, which is limited by the JVM heap.

Applicability • All of the data sets, except for the large one, can be fit into

main memory of each map task.

69

Page 70: 20130201 MapReduce Design Patterns

Replicated Join 2/3 Structure • Map-only pattern • Read all files from the

distributed cache and store them into in-memory lookup tables.

70

Page 71: 20130201 MapReduce Design Patterns

Replicated Join 3/3 Consequences • # of part files == # of map tasks. • The part files contain the full set of joined records.

Performance analysis • A replicated join can be the fastest type of join executed

because there is no reducer required. • The amount of data that can be stored safely inside JVM.

71

Page 72: 20130201 MapReduce Design Patterns

Composite Join 1/4 Intent • Performed on the map-side with many very large formatted

inputs. • Completely eliminates the need to shuffle and sort all the

data to the reduce phase. • Data to be already organized or prepared in a very specific

way. Motivation • Particularly useful if you want to join very large data sets

together. • The data sets must first be sorted by foreign key,

partitioned by foreign key, and read in a very particular manner.

72

Page 73: 20130201 MapReduce Design Patterns

Composite Join 2/4 Applicability • An inner or full outer join is

desired. • All the data sets are

sufficiently large. • All data sets can be read with

the foreign key as the input key to the mapper.

• All data sets have the same number of partitions.

• Each partition is sorted by foreign key, and all the foreign keys reside in the associated partition of each data set.

• The data sets do not change often (if they have to be prepared).

73

Page 74: 20130201 MapReduce Design Patterns

Composite Join 3/4

Structure • Map-only • Mapper is very trivial. • Two values are retrieved

from the input tuple and output to file system

74

Page 75: 20130201 MapReduce Design Patterns

Composite Join 4/4

Consequences • Output # of part files == # of map tasks. Performance analysis • Can be executed relatively quickly over large data

sets. • Data Preparation = sorting cost • The cost of producing these prepared data sets is

averaged out over all of the runs.

75

Page 76: 20130201 MapReduce Design Patterns

Cartesian Product 1/3 Intent • Pair up and compare every single record with every other

record in a data set. Motivation • Simply pairs every record of a data set with every record of

all the other data sets. • To analyze relationships between one or more data sets

Applicability • You want to analyze relationships between all pairs of

individual records. • You’ve exhausted all other means to solve this problem. • You have no time constraints on execution time.

76

Page 77: 20130201 MapReduce Design Patterns

Cartesian Product 2/3 Structure • Map-only • RecordReader job

77

Page 78: 20130201 MapReduce Design Patterns

Cartesian Product 3/3 Consequences • The final data set is made up of tuples equivalent to the

number of input data sets. • Every possible tuple combination from the input records is

represented in the final output Resemblances • SQL: SELECT * FROM tableA, tableB;

Performance Analysis • A massive explosion in data size O(n^2) • If a single input split contains a thousand records the

right input split needs to be read a thousand times before the task can finish.

• If a single task fails for an odd reason, the whole thing needs to be restarted.

78

Page 79: 20130201 MapReduce Design Patterns

2.5 Metapatterns (skipped)

Patterns about using patterns • Job Chaining - piecg together several patterns to

solve complex, multistage problems • Chain Folding • Job Merging - an optimization for performing

several analytics in the same MapReduce job

79

Page 80: 20130201 MapReduce Design Patterns

2.6 Input and Output patterns Customizing Input and Output in Hadoop Loaded data on disk • Configuring how contiguous chunks of input are generated

from blocks in HDFS • Configuring how records appear in the map phase • RecordReader and InputFormat classes • RecordWriter and OuputFormat classes

Patterns • Generating Data • External Source Output • External Source Input • Partition Pruning

80

Page 81: 20130201 MapReduce Design Patterns

Generating Data 1/3

Intent • You want to generate a lot of data from scratch. Motivation • it doesn’t load data generate the data and

store it back in the distributed file system.

81

Page 82: 20130201 MapReduce Design Patterns

Generating Data 2/3

Structure • map-only

82

Page 83: 20130201 MapReduce Design Patterns

Generating Data 3/3

Consequences • Each mapper outputs a file containing random

data. Performance analysis • How many worker map tasks are needed to

generate the data. • In general, the more map tasks you have, the

faster you can generate data.

83

Page 84: 20130201 MapReduce Design Patterns

External Source Output 1/3 Intent • To write MapReduce output to a nonnative location (outside

of Hadoop and HDFS). Motivation • To output data from the MapReduce framework directly to

an external source. • This is extremely useful for direct loading into a system

instead of staging the data to be delivered to the external source.

84

Page 85: 20130201 MapReduce Design Patterns

External Source Output 2/3

Structure

85

Page 86: 20130201 MapReduce Design Patterns

External Source Output 3/3 Consequences • The output data has been sent to the external source and

that external source has loaded it successfully. Performance analysis • The receiver of the data can handle the parallel connections. • Having a thousand tasks writing to a single SQL database is

not going to work well.

86

Page 87: 20130201 MapReduce Design Patterns

External Source Input 1/3 Intent • You want to load data in parallel from a source that is not

part of your MapReduce framework. Motivation • Typical model for using MapReduce to analyze the data is to

store it into HDFS. • With this pattern, you can hook up the MapReduce

framework into an external source, such as a database or a web service, and pull the data directly into the mappers.

87

Page 88: 20130201 MapReduce Design Patterns

External Source Input 2/3

Structure

88

Page 89: 20130201 MapReduce Design Patterns

External Source Input 3/3

Consequences • Data is loaded from the external source into the

MapReduce job • Map phase doesn’t care where that data came

from. Performance analysis • Bottleneck - the source or the network. • The source may not scale well with multiple

connections (e.g., a single threaded SQL db). • If the source is not in the cluster’s network, the

connections may be reaching out on a single connection on a slower public network.

89

Page 90: 20130201 MapReduce Design Patterns

Partition Pruning 1/3

Intent • You have a set of data that is partitioned by a

predetermined value, which you can use to dynamically load the data based on what is requested by the application.

Motivation • Loading all of the files is a large waste of

processing time. • By partitioning the data by a common value, you

can avoid significant amounts of processing time by looking only where the data would exist

90

Page 91: 20130201 MapReduce Design Patterns

Partition Pruning 2/3

Structure

91

Page 92: 20130201 MapReduce Design Patterns

Partition Pruning 3/3 Consequences • Partition pruning changes only the amount of data that is

read by the MapReduce job, not the eventual outcome of the analytic.

Performance analysis • Utilizing this pattern can provide massive gains by reducing

the number of tasks that need to be created that would not have generated output anyways.

• Outside of the I/O, the performance depends on the other pattern being applied in the map and reduce phases of the job.

92

Page 93: 20130201 MapReduce Design Patterns

The End (Finally…)

Thanks for your attentions. • MapReduce has proven to be a useful abstraction

• Greatly simplifies large-scale computations • Hadoop is widely used • Focus on problems, let MapReduce deal with

messy details. Any Questions?

93