Tutorial for MapReduce (Hadoop) & Large Scale Processing Le Zhao (LTI, SCS, CMU) Database Seminar & Large Scale Seminar 2010-Feb-15 Some slides adapted.
Post on 15-Jan-2016
221 Views
Preview:
Transcript
Tutorial for MapReduce (Hadoop) & Large Scale Processing
Le Zhao (LTI, SCS, CMU)
Database Seminar & Large Scale Seminar
2010-Feb-15Some slides adapted from IR course lectures by Jamie Callan
© 2010, Le Zhao1
Outline
• Why MapReduce (Hadoop)
• MapReduce basics
• The MapReduce way of thinking
• Manipulating large data
© 2010, Le Zhao2
Outline
• Why MapReduce (Hadoop)
– Why go large scale
– Compared to other parallel computing models
– Hadoop related tools
• MapReduce basics
• The MapReduce way of thinking
• Manipulating large data
© 2010, Le Zhao3
Why NOT to do parallel computing
• Concerns: a parallel system needs to provide:
– Data distribution
– Computation distribution
– Fault tolerance
– Job scheduling
© 2010, Le Zhao4
Why MapReduce (Hadoop)
• Previous parallel computation models
– 1) scp + ssh
» Manual everything
– 2) network cross-mounted disks + condor/torque
» No data distr, disk access is bottleneck
» Can only partition totally distributed computation
» No fault tolerance
» Prioritized job scheduling
© 2010, Le Zhao5
Hadoop
• Parallel batch computation
– Data distribution
» Hadoop Distributed File System (HDFS)
» Like Linux FS, but with automatic data repetition
– Computation distribution
» Automatic, user only need to specify #input_splits
» Can distribute aggregation computations as well
– Fault tolerance
» Automatic recovery from failure
» Speculative execution (a backup task)
– Job scheduling
» Ok, but still relies on the politeness of users
© 2010, Le Zhao6
How you can use Hadoop
• Hadoop Streaming
– Quick hacking – much like shell scripting
» Uses STDIN & STDOUT carry data
» cat file | mapper | sort | reducer > output
– Easier to use legacy code, all programming languages
• Hadoop Java API
– Build large systems
» More data types
» More control over Hadoop’s behavior
» Easier debugging with Java’s error stacktrace display
– NetBeans plugin for Hadoop provides easy programming
» http://hadoopstudio.org/docs.html
© 2010, Le Zhao7
Outline
• Why MapReduce (Hadoop)
• MapReduce basics
• The MapReduce way of thinking
• Manipulating large data
© 2010, Le Zhao8
© 2009, Jamie Callan 9
Map and Reduce
MapReduce is a new use of an old idea in Computer Science
• Map: Apply a function to every object in a list
– Each object is independent
» Order is unimportant
» Maps can be done in parallel
– The function produces a result
• Reduce: Combine the results to produce a final result
You may have seen this in a Lisp or functional programming course
© 2010, Jamie Callan 10
MapReduce
• Input reader– Divide input into splits, assign each split to a Map processor
• Map– Apply the Map function to each record in the split– Each Map function returns a list of (key, value) pairs
• Shuffle/Partition and Sort– Shuffle distributes sorting & aggregation to many reducers– All records for key k are directed to the same reduce processor– Sort groups the same keys together, and prepares for aggregation
• Reduce– Apply the Reduce function to each key– The result of the Reduce function is a list of (key, value) pairs
MapReduce in One Picture
© 2010, Le Zhao11
Tom White, Hadoop: The Definitive Guide
Outline
• Why MapReduce (Hadoop)
• MapReduce basics
• The MapReduce way of thinking
– Two simple use cases
– Two more advanced & useful MapReduce tricks
– Two MapReduce applications
• Manipulating large data
© 2010, Le Zhao12
MapReduce Use Case (1) – Map Only
Data distributive tasks – Map Only
• E.g. classify individual documents
• Map does everything
– Input: (docno, doc_content), …
– Output: (docno, [class, class, …]), …
• No reduce
© 2010, Le Zhao13
MapReduce Use Case (2) – Filtering and Accumulation
Filtering & Accumulation – Map and Reduce
• E.g. Counting total enrollments of two given classes
• Map selects records and outputs initial counts
– In: (Jamie, 11741), (Tom, 11493), …
– Out: (11741, 1), (11493, 1), …
• Shuffle/Partition by class_id
• Sort
– In: (11741, 1), (11493, 1), (11741, 1), …
– Out: (11493, 1), …, (11741, 1), (11741, 1), …
• Reduce accumulates counts
– In: (11493, [1, 1, …]), (11741, [1, 1, …])
– Sum and Output: (11493, 16), (11741, 35)
© 2010, Le Zhao14
MapReduce Use Case (3) – Database Join
Problem: Massive lookups– Given two large lists: (URL, ID) and (URL, doc_content) pairs– Produce (ID, doc_content)
Solution: Database join• Input stream: both (URL, ID) and (URL, doc_content) lists
– (http://del.icio.us/post, 0), (http://digg.com/submit, 1), …– (http://del.icio.us/post, <html0>), (http://digg.com/submit, <html1>), …
• Map simply passes input along,• Shuffle and Sort on URL (group ID & doc_content for the same URL together)
– Out: (http://del.icio.us/post, 0), (http://del.icio.us/post, <html0>), (http://digg.com/submit, <html1>), (http://digg.com/submit, 1), …
• Reduce outputs result stream of (ID, doc_content) pairs– In: (http://del.icio.us/post, [0, html0]), (http://digg.com/submit, [html1, 1]), …– Out: (0, <html0>), (1, <html1>), …
© 2010, Le Zhao15
MapReduce Use Case (4) – Secondary Sort
Problem: Sorting on values• E.g. Reverse graph edge directions & output in node order
– Input: adjacency list of graph (3 nodes and 4 edges)(3, [1, 2]) (1, [3])(1, [2, 3]) (2, [1, 3]) (3, [1])
• Note, the node_ids in the output values are also sorted. But Hadoop only sorts on keys!
Solution: Secondary sort• Map
– In: (3, [1, 2]), (1, [2, 3]).– Intermediate: (1, [3]), (2, [3]), (2, [1]), (3, [1]). (reverse edge direction)– Out: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1]).– Copy node_ids from value to key.
1 2
3
1 2
3
© 2010, Le Zhao16
MapReduce Use Case (4) – Secondary Sort
Secondary Sort (ctd.)
• Shuffle on Key.field1, and Sort on whole Key (both fields)
– In: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1])
– Out: (<1, 3>, [3]), (<2, 1>, [1]), (<2, 3>, [3]), (<3, 1>, [1])
• Grouping comparator
– Merge according to part of the key
– Out: (<1, 3>, [3]), (<2, 1>, [1, 3]), (<3, 1>, [1]) this will be the reducer’s input
• Reduce
– Merge & output: (1, [3]), (2, [1, 3]), (3, [1])
© 2010, Le Zhao17
Using MapReduce to Construct Indexes:Preliminaries
Construction of binary inverted lists
• Input: documents: (docid, [term, term..]), (docid, [term, ..]), ..
• Output: (term, [docid, docid, …])
– E.g., (apple, [1, 23, 49, 127, …])
• Binary inverted lists fit on a slide more easily
• Everything also applies to frequency and positional inverted lists
A document id is an internal document id, e.g., a unique integer
• Not an external document id such as a url
MapReduce elements
• Combiner, Secondary Sort, complex keys, Sorting on keys’ fields
© 2010, Jamie Callan 18
Using MapReduce to Construct Indexes:A Simple Approach
A simple approach to creating binary inverted lists
• Each Map task is a document parser
– Input: A stream of documents
– Output: A stream of (term, docid) tuples
» (long, 1) (ago, 1) (and, 1) … (once, 2) (upon, 2) …
• Shuffle sorts tuples by key and routes tuples to Reducers
• Reducers convert streams of keys into streams of inverted lists
– Input: (long, 1) (long, 127) (long, 49) (long, 23) …
– The reducer sorts the values for a key and builds an inverted list
» Longest inverted list must fit in memory
– Output: (long, [df:492, docids:1, 23, 49, 127, …])
© 2010, Jamie Callan 19
Using MapReduce to Construct Indexes:A Simple Approach
A more succinct representation of the previous algorithm
• Map: (docid1, content1) (t1, docid1) (t2, docid1) …
• Shuffle by t
• Sort by t
(t5, docid1) (t4, docid3) … (t4, docid3) (t4, docid1) (t5, docid1) …
• Reduce: (t4, [docid3 docid1 …]) (t, ilist)
docid: a unique integer
t: a term, e.g., “apple”
ilist: a complete inverted list
but a) inefficient, b) docids are sorted in reducers, and c) assumes ilist of a word fits in memory
© 2010, Jamie Callan 20
Using MapReduce to Construct Indexes:Using Combine
• Map: (docid1, content1) (t1, ilist1,1) (t2, ilist2,1) (t3, ilist3,1) …
– Each output inverted list covers just one document
• Combine
Sort by t
Combine: (t1 [ilist1,2 ilist1,3 ilist1,1 …]) (t1, ilist1,27)
– Each output inverted list covers a sequence of documents
• Shuffle by t
• Sort by t
(t4, ilist4,1) (t5, ilist5,3) … (t4, ilist4,2) (t4, ilist4,4) (t4, ilist4,1) …
• Reduce: (t7, [ilist7,2, ilist3,1, ilist7,4, …]) (t7, ilistfinal)
ilisti,j: the j’th inverted list fragment for term i
© 2010, Jamie Callan 21
© 2010, Jamie Callan 2222
Using MapReduce to Construct Indexes
Parser /Indexer
Parser /Indexer
Parser /Indexer
:
::
:
:
:
Merger
Merger
Merger
::
A-F
DocumentsInverted
Lists
Map/Combine
ProcessorsInverted ListFragments Processors
Shuffle/Sort Reduce
G-P
Q-Z
Using MapReduce to ConstructPartitioned Indexes
• Map: (docid1, content1) ([p, t1], ilist1,1)
• Combine to sort and group values
([p, t1] [ilist1,2 ilist1,3 ilist1,1 …]) ([p, t1], ilist1,27)
• Shuffle by p
• Sort values by [p, t]
• Reduce: ([p, t7], [ilist7,2, ilist7,1, ilist7,4, …]) ([p, t7], ilistfinal)
p: partition (shard) id
© 2010, Jamie Callan 23
Using MapReduce to Construct Indexes:Secondary Sort
So far, we have assumed that Reduce can sort values in memory …but what if there are too many to fit in memory?
• Map: (docid1, content1) ([t1, fd1,1], ilist1,1)
• Combine to sort and group values
• Shuffle by t
• Sort by [t, fd], then Group by t (Secondary Sort)
([t7, fd7,2], ilist7,2), ([t7, fd7,1], ilist7,1) … (t7, [ilist7,1, ilist7,2, …])
• Reduce: (t7, [ilist7,1, ilist7,2, …]) (t7, ilistfinal)
Values arrive in order, so Reduce can stream its output
fdi,j is the first docid in ilisti,j
© 2010, Jamie Callan 24
Using MapReduce to Construct Indexes:Putting it All Together
• Map: (docid1, content1) ([p, t1, fd1,1], ilist1,1)
• Combine to sort and group values
([p, t1, fd1,1] [ilist1,2 ilist1,3 ilist1,1 …]) ([p, t1, fd1,27], ilist1,27)
• Shuffle by p
• Secondary Sort by [(p, t), fd]
([p, t7], [ilist7,2, ilist7,1, ilist7,4, …]) ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …])
• Reduce: ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …]) ([p, t7], ilistfinal)
© 2010, Jamie Callan 25
© 2010, Jamie Callan 2626
Using MapReduce to Construct Indexes
Parser /Indexer
Parser /Indexer
Parser /Indexer
:
::
:
:
:
Merger
Merger
Merger
::
Shard
DocumentsInverted
Lists
Map/Combine
ProcessorsInverted ListFragments Processors
Shuffle/Sort Reduce
Shard
Shard
PageRank Calculation:Preliminaries
One PageRank iteration:
• Input:
– (id1, [score1(t), out11, out12, ..]), (id2, [score2
(t), out21, out22, ..]) ..
• Output:
– (id1, [score1(t+1), out11, out12, ..]), (id2, [score2
(t+1), out21, out22, ..]) ..
MapReduce elements
• Score distribution and accumulation
• Database join
• Side-effect files
© 2010, Jamie Callan 27
PageRank: Score Distribution and Accumulation
• Map
– In: (id1, [score1(t), out11, out12, ..]), (id2, [score2
(t), out21, out22, ..]) ..
– Out: (out11, score1(t)/n1), (out12, score1
(t)/n1) .., (out21, score2(t)/n2), ..
• Shuffle & Sort by node_id
– In: (id2, score1), (id1, score2), (id1, score1), ..
– Out: (id1, score1), (id1, score2), .., (id2, score1), ..
• Reduce
– In: (id1, [score1, score2, ..]), (id2, [score1, ..]), ..
– Out: (id1, score1(t+1)), (id2, score2
(t+1)), ..
© 2010, Jamie Callan 28
PageRank: Database Join to associate outlinks with score
• Map
– In & Out: (id1, score1(t+1)), (id2, score2
(t+1)), .., (id1, [out11, out12, ..]), (id2, [out21, out22, ..]) ..
• Shuffle & Sort by node_id
– Out: (id1, score1(t+1)), (id1, [out11, out12, ..]), (id2, [out21, out22, ..]), (id2,
score2(t+1)), ..
• Reduce
– In: (id1, [score1(t+1), out11, out12, ..]), (id2, [out21, out22, .., score2
(t+1)]), ..
– Out: (id1, [score1(t+1), out11, out12, ..]), (id2, [score2
(t+1), out21, out22, ..]) ..
© 2010, Jamie Callan 29
PageRank: Side Effect Files for dangling nodes
• Dangling Nodes
– Nodes with no outlinks (observed but not crawled URLs)
– Score has no outlet
» need to distribute to all graph nodes evenly
• Map for dangling nodes:
– In: .., (id3, [score3]), ..
– Out: .., ("*", 0.85×score3), ..
• Reduce
– In: .., ("*", [score1, score2, ..]), ..
– Out: .., everything else, ..
– Output to side-effect: ("*", score), fed to Mapper of next iteration
© 2010, Jamie Callan 30
Outline
• Why MapReduce (Hadoop)
• MapReduce basics
• The MapReduce way of thinking
• Manipulating large data
© 2010, Le Zhao31
Manipulating Large Data
• Do everything in Hadoop (and HDFS)
– Make sure every step is parallelized!
– Any serial step breaks your design
• E.g. storing the URL list for a Web graph
– Each node in Web graph has an id
– [URL1, URL2, …], use line number as id – bottle neck
– [(id1, URL1), (id2, URL2), …], explicit id
© 2010, Le Zhao32
Hadoop based Tools
• For Developing in Java, NetBeans plugin
– http://www.hadoopstudio.org/docs.html
• Pig Latin, a SQL-like high level data processing script language
• Hive, Data warehouse, SQL
• Cascading, Data processing
• Mahout, Machine Learning algorithms on Hadoop
• HBase, Distributed data store as a large table
• More
– http://hadoop.apache.org/
– http://en.wikipedia.org/wiki/Hadoop
– Many other toolkits, Nutch, Cloud9, Ivory
© 2010, Le Zhao33
Get Your Hands Dirty
• Hadoop Virtual Machine
– http://www.cloudera.com/developers/downloads/virtual-machine/
» This runs Hadoop 0.20
– An earlier Hadoop 0.18.0 version is here http://code.google.com/edu/parallel/tools/hadoopvm/index.html
• Amazon EC2
• Various other Hadoop clusters around
• The NetBeans plugin simulates Hadoop
– The workflow view works on Windows
– Local running & debugging works on MacOS and Linux
– http://www.hadoopstudio.org/docs.html
© 2010, Le Zhao34
Conclusions
• Why large scale
• MapReduce advantages
• Hadoop uses
• Use cases
– Map only: for totally distributive computation
– Map+Reduce: for filtering & aggregation
– Database join: for massive dictionary lookups
– Secondary sort: for sorting on values
– Inverted indexing: combiner, complex keys
– PageRank: side effect files
• Large data
© 2010, Jamie Callan 35
© 2010, Jamie Callan 36
For More Information
• L. A. Barroso, J. Dean, and U. Hölzle. “Web search for a planet: The Google cluster architecture.” IEEE Micro, 2003.
• J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137-150. 2004.
• S. Ghemawat, H. Gobioff, and S.-T. Leung. “The Google File System.” Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP-03), pages 29-43. 2003.
• I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes. Morgan Kaufmann. 1999.
• J. Zobel and A. Moffat. “Inverted files for text search engines.” ACM Computing Surveys, 38 (2). 2006.
• http://hadoop.apache.org/common/docs/current/mapred_tutorial.html. “Map/Reduce Tutorial”. Fetched January 21, 2010.
• Tom White. Hadoop: The Definitive Guide. O'Reilly Media. June 5, 2009
• J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce, Book Draft. February 7, 2010.
top related