Top Banner
1 MapReduce
18
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Map Reduce

1

MapReduce

Page 2: Map Reduce

2

TABLE OF CONTENTS

Map Reduce

Map Reduce Features

Mapper

Mapper – An Example

Reducer

Reducer - An Example

Map Reduce – The Big Picture

Word Count – A Map Reduce Example

Word Count, Code Walk Through

How Map Reduce works in Word Count

Word Count – Execution

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 3: Map Reduce

3

Map Reduce

MapReduce is the system used to process data in the Hadoop cluster. It is basically a

software framework for writing applications that parallely process vast amounts of data on

large clusters

MapReduce works by breaking the processing into two phases: the map phase and the

reduce phase

Each Map task operates on a discrete portion of the overall dataset – Typically one

HDFS block of data

After all Maps are complete, the MapReduce system distributes the intermediate data to

nodes which perform the Reduce phase

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 4: Map Reduce

4

Map Reduce - Features

Automatic parallelization and distribution

Fault-tolerance

A clean abstraction for programmers

- Developer can concentrate simply on writing the Map and Reduce functions

- Can be implemented with scripting languages other than java – Streaming

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 5: Map Reduce

5

Mapper

Hadoop attempts to ensure that Mappers run on nodes which hold their portion of the data

locally, to avoid network traffic

Multiple Mappers run in parallel, each processing a portion of the input data

The Mapper reads data in the form of key/value pairs

The Mapper outputs zero or more key/value pairs

If the Mapper writes anything out, the output must be in the form of key/value pairs

The Mapper may use or completely ignore the input key

– For example, a standard pattern is to read a line of a file at a time

– The key is the byte offset into the file at which the line starts

– The value is the contents of the line itself

– Typically the key is considered irrelevant

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 6: Map Reduce

6

Mapper – An Example

A Mapper that turns the input key and value to their corresponding upper case

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 7: Map Reduce

7

After the Map phase is over, all the intermediate values for a given intermediate key are

combined together into a list

This list is given to a Reducer

Reducer

– There may be a single Reducer, or multiple Reducers

– This is specified as part of the job configuration

– All values associated with a particular intermediate key are guaranteed to go to the same Reducer

– The intermediate keys, and their value lists, are passed to the Reducer in sorted key order

– This step is known as the ‘shuffle and sort’

The Reducer outputs zero or more final key/value pairs

– These are written to HDFS

– In practice, the Reducer usually emits a single key/value pair for each input key

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 8: Map Reduce

8

Reducer – An Example

A Reducer that adds up all the values associated with each intermediate key

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 9: Map Reduce

9

Map Reduce – The Big Picture

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 10: Map Reduce

10

package org.apache.hadoop.examples;

import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {

word.set(itr.nextToken());context.write(word, one);

}}

}

WordCount - A Map Reduce ExampleWordCount - A Map Reduce Example

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 11: Map Reduce

11

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {

int sum = 0;for (IntWritable val : values) {

sum += val.get();}result.set(sum);context.write(key, result);

}}public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();if (args.length != 2) {

System.err.println("Usage: wordcount <in> <out>");System.exit(2);

}Job job = new Job(conf, "word count");job.setJarByClass(WordCount.class);job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true) ? 0 : 1);

}}

WordCount - A Map Reduce ExampleWordCount - A Map Reduce Example

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 12: Map Reduce

12

Word Count, Code Walk Through - main()...1Word Count, Code Walk Through - main()...1

Configuration conf = new Configuration();

Provides access to configuration parameters, here we create a new Configuration object

Job job = new Job(conf, "wordcount");

Create a new Map-Reduce Job with the name “wordcount”

Once the job is run, we can see the name “wordcount” in the jobtracker

job.setJarByClass(WordCount.class);

Specify the main class for the execution of the job.

job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);

Specify the Mapper Combiner and Reducer Class for the Job.

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 13: Map Reduce

13

job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);

Expected output being a “word” and its occurance “count”, set the reducer output Key as Text and value as IntWritable.

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

Specify the Input and Output directory for the Job using the command line arguements.

System.exit(job.waitForCompletion(true) ? 0 : 1);

Submits the job to the cluster and waits for it to complete.

Word Count, Code Walk Through - main()...2Word Count, Code Walk Through - main()...2

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 14: Map Reduce

14

Word Count, Code Walk Through – Map/Reduce Logic Word Count, Code Walk Through – Map/Reduce Logic

Map

Create objects of type “Text” and “IntWritable” as word and one respectively

Each line in the input file will be the variable value(defined in the map function)

Each token of the input string is taken and assigned the word

Context() is populated as pair (word,one) for each token

Reduce

An intermediate shuffle and sort mechanism ensures that the reducer receives each key

and an associated list of values

The reducer will iterate through the values for each key, incrementing the local variable

sum by 1

The output will be generated for all words in the input file.

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 15: Map Reduce

15

Map 1 Emits

< Hello, 1>< World, 1>< Bye, 1>< World, 1>

Map 2 Emits

< Hello, 1>< Hadoop, 1>< Goodbye, 1>< Hadoop, 1>

How Map Reduce Works in Word CountHow Map Reduce Works in Word Count

Consider the program executes in two maps, first two lines in one map and other in the other map

The combiner does local aggregation, after being sorted on the keys.

Combiner 1 < Bye, 1>< Hello, 1>< World, 2>

Combiner 2

< Goodbye, 1>< Hadoop, 2>< Hello, 1>

The reducer just sums up the values, which are the occurence counts for each key (i.e. words)

Reducer Output

< Bye, 1>< Goodbye, 1>< Hadoop, 2>< Hello, 2>< World, 2>

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 16: Map Reduce

16

Create an input directory in HDFS

Copy the input.txt from localmachine to HDFS <input-directory>

Create the wordcount jar – wordCount.jar (Eclipse IDE can be used)

Run the jar using

Word Count - ExecutionWord Count - Execution

bin/hadoop fs -mkdir <input-directory>

hadoop fs -put input.txt <input-directory>

bin/hadoop jar wordCount.jar <MainClassName> <input-directory> <output-directory>

Input File(input.txt) “Hello World Hello Hadoop Bye World Goodbye Hadoop”

Output File

Bye 1Goodbye 1 Hadoop 2 Hello 2 World 2

Check the output using

hadoop fs -cat <output-directory-path>/<generated-output-file>

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 17: Map Reduce

17

Hadoop Wiki.

Yahoo Hadoop Tutorials.

Introduction to HDFS, Developer Works, IBM.

Hadoop In Action, Chuck Lam.

REFERENCES

Only for TCS Internal Training - NextGen Solutions, Kochi

Page 18: Map Reduce

18

THANK YOU