YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Advanced topics on Mapreduce with Hadoop

Jiaheng Lu

Department of Computer Science

Renmin University of Chinawww.jiahenglu.net

Page 2: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Outline

Brief Review Chaining MapReduce Jobs Join in MapReduce Bloom Filter

Page 3: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Brief Review

A parallel programming framework Divide and merge

split0

split1

split2

Input data

Map task

Mappers

Map task

Map task

Shuffle

Reduce task

Reducers

Reduce task

Output data

output0

output1

Page 4: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Chaining MapReduce jobs

Chaining in a sequence Chaining with complex dependency Chaining preprocessing and postprocessing

steps

Page 5: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Chaining in a sequence

Simple and straightforward [MAP | REDUCE]+; MAP+ | REDUCE | MAP* Output of last is the input to the next Similar to pipes

Page 6: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Configuration conf = getConf();

JobConf job = new JobConf(conf);

job.setJobName("ChainJob");

job.setInputFormat(TextInputFormat.class);

job.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(job, in);

FileOutputFormat.setOutputPath(job, out);

JobConf map1Conf = new JobConf(false);

ChainMapper.addMapper(job, Map1.class, LongWritable.class, Text.class, Text.class, Text.class, true, map1Conf);

Page 7: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Chaining with complex dependency

Jobs are not chained in a linear fashion

Use addDependingJob() method to add dependency information:

x.addDependingJob(y)

Page 8: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Chaining preprocessing and postprocessing steps

Example: remove stop word in IR Approaches:

Separate: inefficient Chaining those steps into a single job

Use ChainMapper.addMapper() and ChainReducer.setReducer

Map+ | Reduce | Map*

Page 9: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Join in MapReduce

Reduce-side join Broadcast join Map-side filtering and Reduce-side join

A given key A range from dataset(broadcast) a Bloom filter

Page 10: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Reduce-side join

Map output <key, value> key>>join key, value>>tagged with data source

Reduce do a full cross-product of values output the combination results

Page 11: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Example

a b

1 ab

1 cd

4 ef

a c

1 b

2 d

4 c

table x

table y

map()

map()

1

4

key

x ab

x cd

x ef

value

1

2

4

key

y b

y d

y c

valuetag

join key

shuffle()

1

key

x ab

x cd

y b

valuelist

2 y d

4x ef

y c

reduce()

a b c

1 ab b

1 cd b

4 ef c

output

1

Page 12: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Broadcast join (replicated join)

Broadcast the smaller table Do join in Map()

Using distributed cache

DistributedCache.addCacheFile()

Page 13: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Map-side filtering and Reduce-side join

Join key: student IDs from info generate IDs file from info broadcast join

What if the IDs file can’t be stored in memory? a Bloom Filter

Page 14: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

A Bloom Filter

Introduction Implementation of bloom filter Use in MapReduce join

Page 15: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Introduction to Bloom Filter

space-efficient data structure, constant size, test elements, add(), contains()

no false negatives and a small probability of false positives

Page 16: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Implementation of bloom filter

Apply a bit array Add elements

generate k indexes set the k bits to 1

Test elements generate k indexes all k bits are 1 >> true, not all are 1 >> false

Page 17: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Example

0

0

0

0

0

0

0

0

0

0

0

1

2

3

4

5

6

7

8

9

1

0

1

0

0

0

1

0

0

0

0

1

2

3

4

5

6

7

8

9

add x(0,2,6)

1

0

1

1

0

0

1

0

0

1

0

1

2

3

4

5

6

7

8

9

add y(0,3,9)

1

0

1

1

0

0

1

0

0

1

0

1

2

3

4

5

6

7

8

9

contain m(1,3,9)

1

0

1

1

0

0

1

0

0

1

0

1

2

3

4

5

6

7

8

9

contain n(0,2,9)initial state

① ② ③ ④ ⑤

× √false positives

Page 18: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Use in MapReduce join

A separate subjob to create a Bloom Filter

Broadcast the Bloom Filter and use in Map() of join job

drop the useless record, and do join in reduce

Page 19: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

References

Chunk Lam, “Hadoop in action” Jairam Chandar, “Join Algorithms using

Map/Reduce”

Page 20: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

THANK YOU

Page 21: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Hadoop

Page 22: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China .

Related Documents