MapReduce Design Patterns Will Shen 2013/02/01
Jan 27, 2015
MapReduce Design
Patterns
Will Shen 2013/02/01
Outline Part I: MapReduce Basics • Map and Reduce • A WordCount example • Open-source framework: Hadoop
Part II: MapReduce Design Patterns • Summarization Patterns • Filtering Patterns • Data Organization Patterns • Join Patterns • Meta Patterns • Input and Output Patterns
2
Reference: Donald Miner and Adam Shook, “MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems”, 230 pages, O'Reilly Media; 1 edition (December 22, 2012)
Part I: MapReduce Basics
Motivation: Large Scale Data Processing • Process lots of data (>1TB) • Want to use hundreds of CPUs MapReduce - Google (2005), US patent (2010)
• Automatic parallelization and distribution • Fault-tolerance • I/O scheduling • Status and monitoring
3
Google, “MapReduce: Simplified Data Processing on Large Clusters”, 2005/04/06
What is Map and Reduce Borrows from Functional Programming • Functional operations do not modify data structures
create new ones • Stateless functional operations no side-effect order of
operations does not matter
map: (k1, v1) → [(k2, v2)] reduce: (k2, [v2]) → [(k3, v3)]
4
fun foo(li: int list) = sum(li) + mul(li) + length(li)
What is MapReduce
5
map: (k1, v1) → [(k2, v2)]
reduce: (k2, [v2]) → [(k3, v3)]
Parallel Execution Bottleneck: Reduce phase cannot start until map phase completes
6
Big Picture of MapReduce Input Reader - Divides input into appropriate size 'splits' (16 to 128 MB) Map - partitioning of the data (compute part of a problem across several servers)
Shuffle - together the values returned by the map function Reduce - processing of the partitions (aggregate the partial results from all servers into a single result-set)
Output Writer - Writes the output of the Reducer
7
Example – counting words in documents
8
map (String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");
reduce (String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(output_key, AsString(result));
Open-source framework Apache Hadoop
Hadoop - http://hadoop.apache.org/ Hadoop: not only a Map/Reduce implementation! • HDFS – distributed file system • Pig – high level query language (SQL like) • HBase – distributed column store • Hive – Hadoop based data warehouse • ZooKeeper, Chukwa, Pipes/Streaming, …
9
How Hadoop runs a MapReduce Job
10
• Client submits the MapReduce job. • JobTracker coordinates the job run. • TaskTrackers run the tasks that the job has
been split into. • HDSF is used for sharing job files between the
other entities.
WordCount Java Code in Hadoop
11
General Considerations Map execution order is not deterministic Map processing time cannot be predicted Reduce tasks cannot start before all Maps have finished (dataset needs to be fully partitioned) Not suitable for continuous input streams There will be a spike in network utilization after the Map / before the Reduce phase Number & size of key/value pairs • Object creation & serialisation overhead (Amdahl’s law!)
Aggregate partial results when possible! • Use Combiners
12
Using MaReduce to Solve Problems
Map • Word Count: texts (word, 1) • Inverted Index: documents (word, doc_id) • Max Temperature: formatted data (year,
temperature) • Mean Rain Precipitation: daily data (<year-
month, lat, long>, temperature)
Reduce applies a count, list, max, and average to a set of values for each key, respectively. Reusable Solutions? 13
What is a “Design Pattern” Design Pattern a general reusable solution to a commonly occurring problem within a given context in software design.
14 GoF
Part II: MapReduce Design Patterns 1. Summarization: get a top-level view by
summarizing and grouping data 2. Filtering: view data subsets such as records
generated from one user 3. Data Organization: reorganize data to work
with other systems, or to make MapReduce analysis easier
4. Join : analyze different datasets together to discover interesting relationships
5. Metapattern : piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job
6. Input and Output: customize the way you use Hadoop to load or store data
15
Total 23 patterns
A template for solving a common and general data manipulation problem with MapReduce.
The 23 Patterns of MapReduce Summarization • Numerical Summarizations • Inverted Index Summarizations • Counting with Counters
Filtering • Filtering • Bloom Filtering • Top Ten • Distinct
Data Organization • Structured to Hierarchical • Partitioning • Binning • Total Order Sorting • Shuffling
Join • Reduce Side Join • Replicated Join • Composite Join • Cartesian Product
Metapatterns • Job Chaining • Chain Folding • Job Merging
Input and Output • Generating Data • External Source Output • External Source Input • Partition Pruning
16
The End
Thanks for your attentions. Any Questions?
17
Pattern Template in this Book Name: a well-selecting name of the pattern Intent: A quick problem description Motivation: Why you would want to solve this problem or where it would appear. Applicability: A set of criteria that must be true to be able to apply this pattern to a problem. Structure: The layout of the MapReduce job itself. Consequences: The end goal of the output this pattern produces. Resemblances: Show analogies of how this problem would be solved with other languages, like SQL and PIG. Known Uses: some common use cases Performance Analysis: Explains the performance profile of the analytic produced by the pattern.
18
2.1 Summarization Patterns
Your data is large and vast, with more data coming into the system every day • ex. web user-logs • You want to produce a top-level, summarized
view of the data • You can glean insights not available from looking
at a localized set of records alone. Patterns • Numerical Summarizations • Inverted Index Summarizations • Counting with Counters
19
Numerical Summarizations 1/4 Intent - Group records together by a key field and calculate a numerical aggregate per group to get a top-level view of the larger data set. Motivation • Many data sets these days are too large for a human to get
any real meaning out it by reading through it manually, e.g., terabytes of website log files.
• minimum, maximum, average, median, and standard deviation
Applicability • You are dealing with numerical data or counting. • The data can be grouped by specific fields
20
Numerical Summarizations 2/4 Structure • Mapper: outputs keys that consist of each field to group
by, and values consisting of any pertinent numerical items. • Reducer: receives a set of numerical values (v1, v2, v3, …,
vn) associated with a group-by key records to perform the aggregation function λ. The value of λ is output with the given input key.
21
Numerical Summarizations 3/4 Consequences • A set of part files containing a single record per reducer
input group. Each record will consist of the key and all aggregate values.
Known uses • Word count, Record count • Min, Max, Count of a particular event • Average, Median, Standard deviation
Resemblances • SQL • Pig
22
SELECT MIN(numericalcol1), MAX(numericalcol1), COUNT(*) FROM table GROUP BY groupcol2;
b = GROUP a BY groupcol2; c = FOREACH b GENERATE group, MIN(a.numericalcol1), MAX(a.numericalcol1), COUNT_STAR(a);
Numerical Summarizations 4/4 Performance analysis • Aggregations perform well when the combiner is properly used. • Data skew of reduce groups: many more intermediate key/value
pairs with a specific key than other keys, one reducer is going to have a lot more work to do than others.
23
Inverted Index Summarizations 1/4 Intent - Generate an index from a data set to allow for faster searches.
24
storing a mapping from content to its locations
Inverted Index Summarizations 2/4
Motivation • To index large data sets on keywords, so that
searches can trace terms back to records that contain specific values.
• Search performance of search engine Applicability • You are requiring quick query responses. • The results of such a query can be preprocessed
and ingested into a database.
25
Inverted Index Summarizations 3/4
Structure
26
Inverted Index Summarizations 4/4
Consequences • “filed value” -> [unique IDs of records] Performance analysis • Parsing content in Mapper most computationally • The cardinality of the index keys increase the
number of reducers increase parallelism • The number of content identifiers per key, “the”
• a few reducers will take much longer than the others. • Require a custom partitioner
27
Counting with Counters 1/3 Intent • An efficient means to retrieve count summarizations of large
data sets. Motivation • A count or summation can tell you a lot about your data as
a whole. • Simply use the framework’s counters no reduce phase
and no summation Applicability • You have a desire to gather counts or summations over
large data sets. • The number of counters you are going to create is small
28
Counting with Counters 2/3 Structure • Mapper: processes each input record at a time to increment
counters based on certain criteria. • Counter: (a) incremented by one if counting a single instance
(b)incremented by some number if executing a summation.
29
Counting with Counters 3/3 Consequences • the final output is a set of counters grabbed from the job
framework (no actual output) Known uses • Count number of records (over a given time period) • Count a small number of unique instances • Counters can be used to sum fields of data together.
Performance analysis • Using counters is very fast, as data is simply read in
through the mapper and no output is written. • Performance depends largely on the number of map tasks
being executed and how much time it takes to process each record.
30
2.2 Filtering Patterns
To understand a smaller piece of data Find a subset of data - a top-ten listing, the results of a de-duplication. Sampling Filtering Patterns: • Filtering • Bloom Filtering • Top Ten • Distinct
31
Filtering 1/4
Intent • Filter out records that are not of interest Motivation • Your data set is large and you want to take a
subset of this data to focus in on it and perhaps do follow-on analysis.
Applicability • The data can be parsed into “records” that can be
categorized through some well-specified criterion determining whether they are to be kept.
32
Filtering 2/4
Structure • No “Reducer”
33
map(key, record): if we want to keep record then emit key,value
Filtering 3/4 Consequences • A subset of the records that pass the selection criteria. • If the format was kept the same, any job that ran over the
larger data set should be able to run over this filtered data set, as well.
Known uses • Closer view of data • Tracking a thread of events • Distributed grep • Data cleansing • Simple random sampling • Removing low scoring data (if you can score your data)
34
Filtering 4/4 Resemblances • SQL: SELECT * FROM table WHERE VALUE < 3 • Pig: b = FILTER a BY value < 3;
Performance analysis • NO reducers • Data never has to be transmitted between the map and
reduce phase. • Most of the map tasks pull data off of their locally attached
disks and then write back out to that node. • Both the sort phase and the reduce phase are cut out.
35
Bloom Filtering 1/4 Intent • Filter such that we keep records that are member of some
predefined set of values (hot values). Motivation • To filter the record based on some sort of set membership
operation against the hot values. • The set membership is going to be evaluated with a Bloom
filter.
36
• M = 18, k = 3 • w is not in the set x, y, z
Bloom Filtering 2/4
Applicability • Data can be separated into records, as in filtering. • A feature can be extracted from each record that
could be in a set of hot values. • There is a predetermined set of items for the hot
values. • Some false positives are acceptable (i.e., some
records will get through when they should not have).
37
Bloom Filtering 3/4 Structure – training + actual filtering
38
Bloom Filtering 4/5
Consequences • a subset of the records in that passed the Bloom
filter membership test. • Exists false positives records Known uses • Removing most of the non-watched values • Prefiltering a data set for an expensive set
membership check
39
Bloom Filtering 5/5
Performance analysis • Loading up the Bloom filter is not that expensive
since the file is relatively small. • Checking a value against the Bloom filter is also a
relatively cheap operation – by O(1) hashing
40
Top Ten 1/4 Intent • Retrieve a relatively small number of top K records,
according to a ranking scheme in your data set, no matter how large the data.
Motivation • Finding records that are typically the most interesting • To find the best records for a specific criterion
Applicability • It is able to compare one record to another to determine which is “larger”
• The number of output records should be significantly fewer than the number of input records a total ordering of the data set.
41
Top Ten 2/4 Structure • Mapper: find local top K • (only one) Reducer: K*M records the final top k
42
Top Ten 3/4
Consequences • The top K records are returned. Known uses • Outlier analysis • Select interesting data (most valuable data) • Catchy dashboards Resemblances • SQL: SELECT * FROM table WHERE col4 DESC LIMIT 10
• Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10;
43
Top Ten 4/4 Performance analysis – one single Reducer • How many records (K*M) the reducer is getting? • The sort can become an expensive operation when it has
too many records and has to do most of the sorting on local disk, instead of in memory.
• The reducer host will receive a lot of data over the network a network resource hot spot
• Naturally, scanning through all the data in the reduce will take a long time if there are many records to look through.
• Any sort of memory growth in the reducer has the possibility of blowing through the Java virtual machine’s memory
• Writes to the output file are not parallelized
44
Distinct 1/4
Intent • To find a unique set of values from similar records Motivation • Reducing a data set to a unique set of values has
several uses Applicability • You have duplicates values in data set; it is silly
to use this pattern otherwise.
45
Distinct 2/4 Structure • It exploits MapReduce’s ability to group keys together to
remove duplicates. • Mapper transforms the data and doesn’t do much in the
reducer. • Duplicate records are often located close to another in a
data set, so a combiner will deduplicate them in the map phase.
• Reducer groups the nulls together by key, so we’ll have one null per key simply output the key
46
map(key, record): emit(record, null) reduce(key, records): emit(key);
Distinct 3/4 Consequences • The output records are guaranteed to be unique, but any
order has not been preserved due to the random partitioning of the records.
Known uses • Deduplicate data • Getting distinct values • Protecting from an inner join explosion
Resemblances • SQL: SELECT DISTINCT * FROM table; • Pig: b = DISTINCT a;
47
Distinct 4/4
Performance analysis • The number of reducers you think you will need. • Basically, if duplicates are very rare within an
input split, pretty much all of the data is going to be sent to the reduce phase.
48
2.3 Data Organization patterns
The value of individual records is often multiplied by the way they are partitioned, sharded, or sorted, especially true in distributed systems. Patterns: • Structured to Hierarchical • Partitioning • Binning • Total Order Sorting • Shuffling
49
Structured to Hierarchical 1/3 Intent • Transform your row-based data to a hierarchical format
(JSON or XML) Motivation • Migrating data from an RDBMS to Hadoop table join • Reformatting your data into a more conducive structure
Applicability • You have data sources that are linked by some set of
foreign keys. • Your data is structured and row-based.
50
Posts Post Comment Comment Post Comment Comment Comment
Structured to Hierarchical 2/3 Structure • Mapper load the data and parse the records into one
cohesive format. • Combiner isn’t going to help • Reducer build the hierarchical data structure from the list of
data items.
51
Structured to Hierarchical 3/3
Consequences • The output will be in a hierarchical form, grouped
by the key that you specified Known uses • Pre-joining data • Preparing data for HBase or MongoDB Performance analysis • How much data is being sent to the reducers from
the mappers • The memory footprint of the object that the
reducer builds. • For a post that has a million comments?
52
Partitioning 1/3 Intent • Move the records into categories;; doesn’t care the order of
records. • Take similar records in a data set and partition them into
distinct, smaller data sets. Motivation • If you want to look at a particular set of data, the data
items are normally spread out across the entire data set requires an entire scan of all of the data
Applicability • Knowing how many partitions you are going to have ahead
of time - by day of the week 7 partitions.
53
Partitioning 2/3
Structure - to determine what partition a record is going to go
54
Partitioning 3/3 Known uses • Partition pruning by continuous value (e.g., date) • Partition pruning by category
• Country, phone area code, language
• Sharding (to different disks)
Performance analysis • The resulting partitions will likely not have similar number
of records. Perhaps one partition hold 50%. • If implemented naively, all of this data will get sent to one
reducer and will slow down processing significantly.
55
Binning 1/3 Intent • For each record in the data set, file each one into one or
more categories. Motivation • Binning is very similar to partitioning and often can be used
to solve the same problem. • Binning splits data up in the map phase instead of in the
partitioner. • Each mapper will now have one file per possible output bin
• 1000 Bins x 1000 Mappers = 1000,000 files
56
Binning 2/3 Structure • Mapper: if the record
meets the criteria, it is sent to that bin.
• No combiner, partitioner, or reducer is used in this pattern.
57
Binning 3/3 Consequences • Each mapper outputs one small file per bin.
Resemblances • PIG
Performance analysis • map-only jobs how efficient of processing records • No sort, shuffle, or reduce to be performed • Most of the processing is going to be done on data that is
local.
58
SPLIT data INTO eights IF col1 == 8, bigs IF col1 > 8, smalls IF (col1 < 8 AND col1 > 0);
Total Order Sorting 1/3
Intent • Sort your data in parallel on a sort key. Motivation • Reducer will sort its data by key - but not global
across all data. • Sorting in parallel is not easy Applicability • Your sort key has to be comparable so the data
can be ordered.
59
Total Order Sorting 2/3
Structure • Analyze phase - determines the ranges
• idea: partitions that evenly split the random sample should evenly split the larger data set well.
• Mapper does a random sampling. • the number of records in the total data set • percentage of records you’ll need to analyze
• Only one reducer - collect the sort keys together into a sorted list the list of keys will be sliced into the data range boundaries.
• Order phase - actually sorts the data. • # of Reducers === # of Partitions • A custom partitioner loads up the partition file data
ranges 60
Total Order Sorting 3/3
Consequences • The output files will contain sorted data Resemblances • SQL: SELCT * FROM data ORDER BY col1; • Pig: c = ORDER b BY col1; Performance analysis • Expensive!!! • load and parse the data twice:
• Step 1. Build the partition ranges • Step 2. Actually sort the data.
61
Shuffling 1/3
Intent • To completely randomize a set of records that Motivation • Shuffling for 綺夢 • Shuffling for anonymizing the data. • Shuffling for repeatable random sampling.
62
Shuffling 2/3
Structure • Mappers [random key, record] • Reducer sorts the random keys randomizing
the data. Consequences • Each reducer outputs a file containing random
records. Resemblances • SQL: SELECT * FROM data ORDER BY RAND() • Pig: c = GROUP b BY RANDOM(); d = FOREACH c GENERATE FLATTEN(b);
63
Shuffling 3/3
Performance analysis • Nice performance properties. • Data distribution across reducers is completely
balanced. • With more reducers, the data will be more spread
out. • The size of the files will also be very predictable:
each is the size of the data set divided by the number of reducers. This makes it easy to get a specific desired file size as output
64
2.4 Join patterns
Refresh of RDMS join • Inner Join • Outer Join • Cartesian Product • Anti Join = full outer join - inner join. Patterns • Reduce Side Join • Replicated Join • Composite Join • Cartesian Product
65
An SQL query walks into a bar, sees two tables and asks them “May I join you?”
Reduce Side Join 1/3 Intent • Join large multiple data sets together by some foreign key.
Motivation • Simple to implement in Reducers • Supports all the different join operations • No limitation on the size of your data sets.
Applicability • Multiple large data sets are being joined by a foreign key. • You want the flexibility of being able to execute any join
operation. • A large amount of network bandwidth
66
Reduce Side Join 2/3 Structure • Mapper prepares [(foreign key, record)] • Reducer performs join operation
67
Reduce Side Join 3/3 Consequences • # of part files == # of reduce tasks. • A part contains the portion of the joined records.
Resemblances • SQL
Performance analysis • Custer’s network bandwidth !!! • Utilize relatively more reducers than your analytic.
68
SELECT users.ID, users.Location, comments.upVotes FROM users [INNER|LEFT|RIGHT] JOIN comments ON users.ID=comments.UserID
Replicated Join 1/3 Intent • Eliminates the need to shuffle any data to the reduce phase.
Motivation • All the data sets except the very large one are essentially
read into memory during the setup phase of each map task, which is limited by the JVM heap.
Applicability • All of the data sets, except for the large one, can be fit into
main memory of each map task.
69
Replicated Join 2/3 Structure • Map-only pattern • Read all files from the
distributed cache and store them into in-memory lookup tables.
70
Replicated Join 3/3 Consequences • # of part files == # of map tasks. • The part files contain the full set of joined records.
Performance analysis • A replicated join can be the fastest type of join executed
because there is no reducer required. • The amount of data that can be stored safely inside JVM.
71
Composite Join 1/4 Intent • Performed on the map-side with many very large formatted
inputs. • Completely eliminates the need to shuffle and sort all the
data to the reduce phase. • Data to be already organized or prepared in a very specific
way. Motivation • Particularly useful if you want to join very large data sets
together. • The data sets must first be sorted by foreign key,
partitioned by foreign key, and read in a very particular manner.
72
Composite Join 2/4 Applicability • An inner or full outer join is
desired. • All the data sets are
sufficiently large. • All data sets can be read with
the foreign key as the input key to the mapper.
• All data sets have the same number of partitions.
• Each partition is sorted by foreign key, and all the foreign keys reside in the associated partition of each data set.
• The data sets do not change often (if they have to be prepared).
73
Composite Join 3/4
Structure • Map-only • Mapper is very trivial. • Two values are retrieved
from the input tuple and output to file system
74
Composite Join 4/4
Consequences • Output # of part files == # of map tasks. Performance analysis • Can be executed relatively quickly over large data
sets. • Data Preparation = sorting cost • The cost of producing these prepared data sets is
averaged out over all of the runs.
75
Cartesian Product 1/3 Intent • Pair up and compare every single record with every other
record in a data set. Motivation • Simply pairs every record of a data set with every record of
all the other data sets. • To analyze relationships between one or more data sets
Applicability • You want to analyze relationships between all pairs of
individual records. • You’ve exhausted all other means to solve this problem. • You have no time constraints on execution time.
76
Cartesian Product 2/3 Structure • Map-only • RecordReader job
77
Cartesian Product 3/3 Consequences • The final data set is made up of tuples equivalent to the
number of input data sets. • Every possible tuple combination from the input records is
represented in the final output Resemblances • SQL: SELECT * FROM tableA, tableB;
Performance Analysis • A massive explosion in data size O(n^2) • If a single input split contains a thousand records the
right input split needs to be read a thousand times before the task can finish.
• If a single task fails for an odd reason, the whole thing needs to be restarted.
78
2.5 Metapatterns (skipped)
Patterns about using patterns • Job Chaining - piecg together several patterns to
solve complex, multistage problems • Chain Folding • Job Merging - an optimization for performing
several analytics in the same MapReduce job
79
2.6 Input and Output patterns Customizing Input and Output in Hadoop Loaded data on disk • Configuring how contiguous chunks of input are generated
from blocks in HDFS • Configuring how records appear in the map phase • RecordReader and InputFormat classes • RecordWriter and OuputFormat classes
Patterns • Generating Data • External Source Output • External Source Input • Partition Pruning
80
Generating Data 1/3
Intent • You want to generate a lot of data from scratch. Motivation • it doesn’t load data generate the data and
store it back in the distributed file system.
81
Generating Data 2/3
Structure • map-only
82
Generating Data 3/3
Consequences • Each mapper outputs a file containing random
data. Performance analysis • How many worker map tasks are needed to
generate the data. • In general, the more map tasks you have, the
faster you can generate data.
83
External Source Output 1/3 Intent • To write MapReduce output to a nonnative location (outside
of Hadoop and HDFS). Motivation • To output data from the MapReduce framework directly to
an external source. • This is extremely useful for direct loading into a system
instead of staging the data to be delivered to the external source.
84
External Source Output 2/3
Structure
85
External Source Output 3/3 Consequences • The output data has been sent to the external source and
that external source has loaded it successfully. Performance analysis • The receiver of the data can handle the parallel connections. • Having a thousand tasks writing to a single SQL database is
not going to work well.
86
External Source Input 1/3 Intent • You want to load data in parallel from a source that is not
part of your MapReduce framework. Motivation • Typical model for using MapReduce to analyze the data is to
store it into HDFS. • With this pattern, you can hook up the MapReduce
framework into an external source, such as a database or a web service, and pull the data directly into the mappers.
87
External Source Input 2/3
Structure
88
External Source Input 3/3
Consequences • Data is loaded from the external source into the
MapReduce job • Map phase doesn’t care where that data came
from. Performance analysis • Bottleneck - the source or the network. • The source may not scale well with multiple
connections (e.g., a single threaded SQL db). • If the source is not in the cluster’s network, the
connections may be reaching out on a single connection on a slower public network.
89
Partition Pruning 1/3
Intent • You have a set of data that is partitioned by a
predetermined value, which you can use to dynamically load the data based on what is requested by the application.
Motivation • Loading all of the files is a large waste of
processing time. • By partitioning the data by a common value, you
can avoid significant amounts of processing time by looking only where the data would exist
90
Partition Pruning 2/3
Structure
91
Partition Pruning 3/3 Consequences • Partition pruning changes only the amount of data that is
read by the MapReduce job, not the eventual outcome of the analytic.
Performance analysis • Utilizing this pattern can provide massive gains by reducing
the number of tasks that need to be created that would not have generated output anyways.
• Outside of the I/O, the performance depends on the other pattern being applied in the map and reduce phases of the job.
92
The End (Finally…)
Thanks for your attentions. • MapReduce has proven to be a useful abstraction
• Greatly simplifies large-scale computations • Hadoop is widely used • Focus on problems, let MapReduce deal with
messy details. Any Questions?
93