Top Banner
Map/Reduce Large Scale Duplicate Detection Prof. Felix Naumann, Arvid Heise
49

Map/Reduce

Jan 02, 2016

Download

Documents

sonya-beck

Map/Reduce. Large Scale Duplicate Detection Prof. Felix Naumann , Arvid Heise. Agenda. Big Data Word Count Example Hadoop Distributed File System Hadoop Map/Reduce Advanced Map/Reduce Stratosphere. Agenda. Big Data Word Count Example Hadoop Distributed File System - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Map/Reduce

Map/Reduce

Large Scale Duplicate Detection

Prof. Felix Naumann, Arvid Heise

Page 2: Map/Reduce

2

Agenda

■ Big Data

■ Word Count Example

■ Hadoop Distributed File System

■ Hadoop Map/Reduce

■ Advanced Map/Reduce

■ Stratosphere

Map/Reduce | Arvid Heise| April 15, 2013

Page 3: Map/Reduce

3

Agenda

■ Big Data

■ Word Count Example

■ Hadoop Distributed File System

■ Hadoop Map/Reduce

■ Advanced Map/Reduce

■ Stratosphere

Map/Reduce | Arvid Heise| April 15, 2013

Page 4: Map/Reduce

What is Big Data?

“collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” [http://en.wikipedia.org/wiki/Big_data]

terabytes, petabytes, in a few years exabytes

Challenges

■ Capturing, storage, analysis, search, ...

Sources

■ Web, social platforms

■ Science

Map/Reduce | Arvid Heise| April 15, 2013

Page 5: Map/Reduce

PS,1,1,0,Pa, surface pressureT_2M,11,105,0,K,air_temperature

TMAX_2M,15,105,2,K,2m maximum temperatureTMIN_2M,16,105,2,K,2m minimum temperature

U,33,110,0,ms-1,U-component of windV,34,110,0,ms-1,V-component of wind

QV_2M,51,105,0,kgkg-1,2m specific humidityCLCT,71,1,0,1,total cloud cover

…(Up to 200 parameters)

3 month

s,

1h reso

lution

950km,2km resolution

10TB

Example: Climate Data Analysis

5

■ Analysis Tasks on Climate Data Sets□ Validate climate models

□ Locate „hot-spots“ in climate models◊ Monsoon

◊ Drought

◊ Flooding

□ Compare climate models◊ Based on different parameter settings

■ Necessary Data Processing Operations□ Filter, aggregation (sliding window), join

□ Advanced pattern recognition

Map/Reduce | Arvid Heise| April 15, 2013

Page 6: Map/Reduce

6

Big Data Landscape

Map/Reduce | Arvid Heise| April 15, 2013

Page 7: Map/Reduce

7

Agenda

■ Big Data

■ Word Count Example

■ Hadoop Distributed File System

■ Hadoop Map/Reduce

■ Advanced Map/Reduce

■ Stratosphere

Map/Reduce | Arvid Heise| April 15, 2013

Page 8: Map/Reduce

8

Programming Model

■ Inspired by functional programming concepts map and reduce

■ Operates on key/value pairs

Map

■ Process key/value pairs individually

■ Generate intermediate key/value pairs

■ Example (LISP): (mapcar ’1+ ’(1 2 3 4)) ⇒ (2 3 4 5)

Reduce

■ Merge intermediate key/value pairs with same key

■ Example (LISP):(reduce ’+ ’(1 2 3 4)) ⇒ 10

Map/Reduce | Arvid Heise| April 15, 2013

Page 9: Map/Reduce

9

Programmer’s Perspective: Word Count

1 to be, or not to be, that is the question:

2 whether 'tis nobler in the mind to suffer

3 the slings and arrows of outrageous fortune,

4 or to take arms against a sea of troubles

… …

To 4

Be 2

Or 2

Not 1

… …

Map/Reduce | Arvid Heise| April 15, 2013

Map

Reduce

Map/ReduceJob

Page 10: Map/Reduce

10

Programmer’s Perspective: WC Map

1 to be, or not to be, that is the question:

Map UDF

2 whether 'tis nobler in the mind to suffer

to 1

be 1

or 1

not 1

to 1

… …

whether 1

'tis 1

nobler 1

in 1

the 1

… …

Map/Reduce | Arvid Heise| April 15, 2013

Page 11: Map/Reduce

11

Programmer’s Perspective: WC Reduce

Reduce UDF

to 1

to 1

… …

to 2

be 2

not 1

… …

be 1

be 1

… …

not 1

… …

Map/Reduce | Arvid Heise| April 15, 2013

Page 12: Map/Reduce

12

Agenda

■ Big Data

■ Word Count Example

■ Hadoop Distributed File System

■ Hadoop Map/Reduce

■ Advanced Map/Reduce

■ Stratosphere

Map/Reduce | Arvid Heise| April 15, 2013

Page 13: Map/Reduce

13

Behind the Scenes

■ Map/Reduce framework takes care of

□ Data partitioning

□ Data distribution

□ Data replication

□ Parallel execution of tasks

□ Fault tolerance

□ Status reporting

Map/Reduce | Arvid Heise| April 15, 2013

Page 14: Map/Reduce

14

Master

Hadoop Architecture

Slave 1

Task-tracker

Datanode

Slave N

Task-tracker

Datanode

HDFS

Map/Reduce

Namenode

Job-tracker

Job Task Task

Map/Reduce | Arvid Heise| April 15, 2013

Page 15: Map/Reduce

15

HDFS Upload

■ First step: User uploads data to HDFS

Map/Reduce | Arvid Heise| April 15, 2013

Master Slave 1

Task-tracker

Datanode

Slave N

Task-tracker

Datanode

HDFS

Map/Reduce

Namenode

Job-tracker

Job Task Task

Page 16: Map/Reduce

16

HDFS Upload

■ Block/split-based format (usually 64 MB)

■ Splits are replicated over several nodes (usually 3 times)

■ In average: each slave receives #Split*3/#Slaves splits

Map/Reduce | Arvid Heise| April 15, 2013

Master Slave 1

Datanode

Slave N

Datanode

HDFS

Namenode

HDFS Client

1) Request locations 2) Upload

…3) Reg-

ister

Page 17: Map/Reduce

17

Agenda

■ Big Data

■ Word Count Example

■ Hadoop Distributed File System

■ Hadoop Map/Reduce

■ Advanced Map/Reduce

■ Stratosphere

Map/Reduce | Arvid Heise| April 15, 2013

Page 18: Map/Reduce

18

Job Submission

■ Second step: User submits job

Map/Reduce | Arvid Heise| April 15, 2013

Master Slave 1

Task-tracker

Datanode

Slave N

Task-tracker

Datanode

HDFS

Map/Reduce

Namenode

Job-tracker

Job Task Task

Page 19: Map/Reduce

19

Job Submission

■ Job tracker allocates resources for submitted job

■ Uses name node to determine which nodes processes what

■ Distributes tasks to nodes

Map/Reduce | Arvid Heise| April 15, 2013

Master Slave 1

Task-tracker

Slave N

Task-tracker

HDFS

Map/Reduce

Job-tracker

Job Task Task

Page 20: Map/Reduce

20

Slave 1 Slave N

Job Execution

■ Third step: job execution

Map/Reduce | Arvid Heise| April 15, 2013

Map

Input splits

Map

Reduce Reduce

Map

Reduce

Shuffle

Output splits

Page 21: Map/Reduce

21

Map tasks

■ Third step: job execution, map task

■ Nodes process tasks indepently

■ Task tracker receives tasks and spawn one map process per task

Map/Reduce | Arvid Heise| April 15, 2013

Slave 1

Task-tracker

Slave N

Task-tracker

Task TaskTask

Map Task

Map Task

Map Task

Page 22: Map/Reduce

22

Map Execution

■ Task tracker receives input as map waves

■ Each wave consists of at most #processors splits

■ Spawns a new JVM(!) for each split

■ Each wave has at least ~6s overhead

■ For each split, the map task reads the key value pairs

■ Invokes the map UDF for each map task

■ Collects emitted results and spills them immediately to a local file

■ Optionally reuses JVM to reduce time per wave

Map/Reduce | Arvid Heise| April 15, 2013

Page 23: Map/Reduce

23

Slave 1 Slave N

Job Execution, Shuffle

Map/Reduce | Arvid Heise| April 15, 2013

Map

Input splits

Map

Reduce Reduce

Map

Reduce

Shuffle

Output splits

Page 24: Map/Reduce

24

Shuffle

■ Partitioner distributes data to the different nodes

□ Uses unique mapping from key to node

□ Often: key.hashCode() % numReducer

■ Key/Value-pairs are serialized and sent over network

■ Spilled to local disk of the reducer

■ Sorted by key with two-phase merge sort

■ Usually most costly phase

Map/Reduce | Arvid Heise| April 15, 2013

Page 25: Map/Reduce

25

Slave 1 Slave N

Job Execution, Shuffle

Map/Reduce | Arvid Heise| April 15, 2013

Map

Input splits

Map

Reduce Reduce

Map

Reduce

Shuffle

Output splits

Page 26: Map/Reduce

26

Reducer Execution

■ Basic idea

□ Scans over sorted list

□ Invokes reducer UDF for subset of data with same keys

■ In reality, a bit more complicated

□ Provides reducer UDF with iterator

□ Iterator returns all values with same key

□ UDF is invoked as long as there is one element left

□ Only one scan with little memory overhead

■ Stores result on local disk

■ Replicates splits (two times)

Map/Reduce | Arvid Heise| April 15, 2013

Page 27: Map/Reduce

27

Combiner

■ Local reducer

■ Invoked in map phase for smaller groups of keys

□ Not the complete list of values in general

□ Preaggregates result to reduce network cost!

■ Can even be invoked recursively on preaggregated results

Map/Reduce | Arvid Heise| April 15, 2013

Page 28: Map/Reduce

28

Word Count Recap, Data Upload

■ During upload, split input

■ (In general, more than one line)

Map/Reduce | Arvid Heise| April 15, 2013

1 to be, or not to be, that is the question:

2 whether 'tis nobler in the mind to suffer

3 the slings and arrows of outrageous fortune,

4 or to take arms against a sea of troubles

… …

1 to be, or not to be, that is the question:

2 whether 'tis nobler in the mind to suffer

Page 29: Map/Reduce

29

Word Count Recap, Map Phase

■ For each input split invoke map task

■ Map task receives each line in the split

■ Tokenizes line, emits (word, 1) for each word

■ Locally combines results!

□ Decreases I/O from #word to #distinct words per split (64MB)

Map/Reduce | Arvid Heise| April 15, 2013

Slave 1

Task-tracker

Slave N

Task-tracker

Map Task Map Task

2 whether 'tis noblerin the mind to suffer

1 to be, or not to be, that is the question:

Page 30: Map/Reduce

30

Word Count Recap, Shuffle+Reduce

■ Assigns each word to reducer

■ Sends all preaggregated results to reducer

□ For example, (to, 3512)

■ Reducer sorts results and UDF sums preaggregated results up

■ Each reducer outputs a partial word histogram

■ Client is responsible for putting output splits together

Map/Reduce | Arvid Heise| April 15, 2013

Page 31: Map/Reduce

31

Behind the Scenes

■ Map/Reduce framework takes care of

□ Data partitioning

□ Data distribution

□ Data replication

□ Parallel execution of tasks

□ Fault tolerance

□ Status reporting

Map/Reduce | Arvid Heise| April 15, 2013

Page 32: Map/Reduce

32

Fault Tolerance

On Map/Reduce level

■ Each task tracker sends progress report

■ If a node does not respond within 10 minutes (configurable)

□ It is declared dead

□ The assigned tasks are redistributed over the remaining nodes

□ Because of replication, 2 nodes can be down at any time

On HDFS level

■ Each data node sends periodic heartbeat to name node

■ In case of down time

□ Receives no new I/O

□ Lost replications are restored at other nodes

Map/Reduce | Arvid Heise| April 15, 2013

Page 33: Map/Reduce

33

Agenda

■ Big Data

■ Word Count Example

■ Hadoop Distributed File System

■ Hadoop Map/Reduce

■ Advanced Map/Reduce

■ Stratosphere

Map/Reduce | Arvid Heise| April 15, 2013

Page 34: Map/Reduce

34

Record Reader

■ For WC, we used LineRecordReader

□ Splits text files at line ends (‘\n’)

□ Generates key/value pair of (line number, line)

■ Hadoop users can supply own readers

□ Could already tokenize the lines

□ Emits (word, 1)

□ No mapper needed

■ Necessary for custom/complex file formats

■ Useful when having different file formats but same mapper

Map/Reduce | Arvid Heise| April 15, 2013

Page 35: Map/Reduce

35

Dealing with Multiple Inputs

■ Map and reduce take only one input

■ Operations with two inputs are tricky to implement

■ Input splits of map can originate in several different files

□ Logical concatenation of files

■ Standard trick: tagged union

□ In record reader/mapper output (key, (inputId, value))

□ Mapper and reducer UDFs can distinguish inputs

Map/Reduce | Arvid Heise| April 15, 2013

Page 36: Map/Reduce

36

Join

■ Reduce-side join

□ Tagged union (joinKey, (inputId, record))

□ All records with same join key are handled by same reducer

□ Cache all values in local memory

□ Perform inner/outer join◊ Emit all pairs of values with different inputIds

□ May generate OOM for larger partitions

■ Map-side join

□ Presort and prepartition input

□ All relevant records should reside in same split

□ Load and cache split

□ Perform inner/outer joinMap/Reduce | Arvid Heise| April 15, 2013

Page 37: Map/Reduce

37

Secondary Grouping/Sort

■ Exploit that partitioner and grouping are two different UDFs

■ Map emits ((key1, key2), value)

■ Partitioner partitions data only on first key1

■ All KV-pairs ((keyX, ?), ?) are on the same physical machine

■ However, reducer is invoked on partitions ((keyX, keyY), ?)

■ Useful to further subdivide partitions

□ Join data could also be tagged ((joinKey, inputId), record)

□ Only need to cache one input and iterate over other partition

■ Hadoop Reducer always sorts data

□ Data is grouped by first key and sorted by second key

Map/Reduce | Arvid Heise| April 15, 2013

Page 38: Map/Reduce

38

Side-effect Files

■ Sometimes even these tricks are not enough

■ Example: triangle enumeration/three way join

■ SELECT x, y, z WHERE x.p2=y.p1 AND y.p2=z.p1 AND z.p2=x.p1

■ Cohen’s approach with two map/reduce jobs

■ Generate triad (SELECT x, y, z WHERE x.p2=y.p1 AND y.p2=z.p1)

■ Probe missing edge with a reducer on input data

■ Huge intermediate results on skewed data sets!

■ Way faster: one map/reduce job

■ Generate triad and immediately test if missing edge is in data

■ Needs to load data set into main memory in reducer

■ Might run into OOMMap/Reduce | Arvid Heise| April 15, 2013

Page 39: Map/Reduce

39

Complete pipeline in

Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, Jörg Schad. PVLDB 3(1): 518-529 (2010)

More than 10 UDFs!

Map/Reduce | Arvid Heise| April 15, 2013

Page 40: Map/Reduce

40

Agenda

■ Big Data

■ Word Count Example

■ Hadoop Distributed File System

■ Hadoop Map/Reduce

■ Advanced Map/Reduce

■ Stratosphere

Map/Reduce | Arvid Heise| April 15, 2013

Page 41: Map/Reduce

41

Overview over Stratosphere

■ Research project by HU, TU, and HPI

■ Overcome shortcomings of Map/Reduce

■ Allow optimization of queries similar to DBMS

Map/Reduce | Arvid Heise| April 15, 2013

Page 42: Map/Reduce

42

Extensions of Map/Reduce

■ Additional second-order functions

■ Complex workflows instead of Map/Reduce pipelines

■ More flexible data model

■ Extensible operator model

■ Optimization of workflows

■ Sophisticated check pointing

■ Dynamic machine booking

Map/Reduce | Arvid Heise| April 15, 2013

Page 43: Map/Reduce

Intuition for Parallelization Contracts

Map and reduce are second-order functions

■ Call first-order functions (user code)

■ Provide first-order functions with subsets of the input data

Define dependencies between therecords that must be obeyed whensplitting them into subsets

■ Contract: required partition properties

Map

■ All records are independentlyprocessable

Reduce

■ Records with identical key mustbe processed together

43

Input set

Independent subsets

Key Value

Page 44: Map/Reduce

Contracts beyond Map and Reduce

Cross

■ Two inputs

■ Each combination of records from the two inputsis built and is independently processable

Match

■ Two inputs, each combination of records withequal key from the two inputs is built

■ Each pair is independently processable

CoGroup

■ Multiple inputs

■ Pairs with identical key are grouped for each input

■ Groups of all inputs with identical key are processed together

44

Page 45: Map/Reduce

45

Complex Workflows

■ Directed acyclic graphs

■ More natural programming

■ Holistic view on query

□ Map/Reduce queries scatteredover several jobs

■ Higher abstraction

□ Allows optimization

□ Less data is shipped

Map/Reduce | Arvid Heise| April 15, 2013

Map

annotate entities

Reduce

pivot

Map

annotate sentences

Map

filter students

CoGroupsid

merge student with its

duplicates

Cross

find similar students

Matchname

pivotization

Results

Students

News articles

Map

pivot

Page 46: Map/Reduce

46

Motivation for Record Model

■ Key/Value-pairs are not very flexible

■ In Map/Reduce

□ Map performs calculation and sets key

□ Reducer uses key and performs aggregation

■ Strong implicit interdependence between Map and Reduce

■ In Stratosphere, we want to reorder Pacts

□ Need to reduce interdependence

■ Record data model

□ Array of values

□ Keys are explicitly set by contract (Reduce, Match, CoGroup)

Map/Reduce | Arvid Heise| April 15, 2013

Page 47: Map/Reduce

47

Record Model

■ All fields are serialized into a byte stream

■ User code is responsible for

□ Managing the indices

□ Knowing the correct type of the field

■ Huge performance gain through lazy deserialization

□ Deserialize only accessed fields

□ Serialize only modified fields

Map/Reduce | Arvid Heise| April 15, 2013

Page 48: Map/Reduce

48

Composite Keys

■ Composite keys in Map/Reduce

□ New tuple data structure

□ Map copies values into the fields

□ Emits (keys, value)

■ Stratosphere allows to specify composite keys

□ Reduce, Match, CoGroup can be configured to take several indices/types in the record as key

Map/Reduce | Arvid Heise| April 15, 2013

Page 49: Map/Reduce

49

More Documentation

■ Project website https://stratosphere.eu/

■ MapReduce and PACT - Comparing Data Parallel Programming Models Alexander Alexandrov, Stephan Ewen, Max Heimel, Fabian Hueske, Odej Kao, Volker Markl, Erik Nijkamp, Daniel Warneke In Proceedings of Datenbanksysteme für Business, Technologie und Web (BTW) 2011, pp. 25-44

Map/Reduce | Arvid Heise| April 15, 2013