Top Banner
IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE
71

IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

Jan 01, 2016

Download

Documents

Carmella Payne
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research

®

© 2007 IBM Corporation

INTRODUCTION TO HADOOP & MAP- REDUCE

Page 2: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Outline

Map-Reduce Features Combiner / Partitioner / Counter

Passing Configuration Parameters

Distributed-Cache

Hadoop I/O

Passing Custom Objects as Key-Values

Input and Output Formats Introduction

Input/Output Formats provided by Hadoop

Writing Custom Input/Output-Formats

Miscellaneous Chaining Map-Reduce Jobs

Compression

Hadoop Tuning and Optimization

Page 3: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Combiner

A local reduce

Processes the output of each map function

Same signature as of a reduce

Often reduces the number of intermediate key-value pairs

Page 4: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Word-Count

Hadoop MapMap ReduceHadoop Map

Map

Map MapReduce Map

Hadoop

Key KeyValue Value

(Hadoop, 1)(Map, 1)(Map, 1)

(Reduce , 1)(Hadoop, 1)

(Map, 1)(Map, 1)

(Map, 1)(Map, 1)

(Reduce, 1)(Map, 1)

(Hadoop, 1)

(Key, 1)(Key, 1) (Value, 1)(Value, 1)

Sort/Shuffle

(Hadoop, [1,1,1])

(Map, [1,1,1,1,1,1,1])

(Key, [1,1])

(Reduce, [1,1])(Value, [1,1])

A-I

J-Q

R-Z

(Hadoop, 3)

(Map, 7)(Key, 2)

(Reduce, 2)(Value, 2)

Page 5: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Word-Count

Hadoop MapMap ReduceHadoop Map

Map

Map MapReduce

MapHadoop

Key Key

Value Value

(Hadoop, 1)(Map, 1)(Map, 1)

(Reduce , 1)(Hadoop, 1)

(Map, 1)(Map, 1)

(Map, 1)(Map, 1)

(Reduce, 1)(Map, 1)

(Hadoop, 1)

(Key, 1)(Key, 1) (Value, 1)(Value, 1)

(Hadoop, [2,1])

(Map, [4, 3])

(Key, [2])

(Reduce, [1,1])

(Value, [2])

A-I

J-Q

R-Z

(Hadoop, 3)

(Map, 7)(Key, 2)

(Reduce, 2)(Value, 2)

(Hadoop, [1,1])

(Map, [1,1,1,1])

(Reduce , [1])

(Map, [1,1,1])

(Reduce, 1)

(Hadoop, 1)

(Key, [1,1])

(Value, [1,1])

(Hadoop, 2)

(Map, 4)

(Reduce , 1)

(Map, 3)

(Reduce, 1)

(Hadoop, 1)

(Key, 2)

(Value, 2)

COMBINER

Page 6: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

COMBINER

public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable>{

public void reduce(Text key, Iterable<IntWritable> values, Context context){

context.write(key, new IntWritable(count(values)));

}}

Type of Output Key Type of Output Value

Type of Input Key Type of Input Value

Page 7: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Word-Count Runner Class

public class WordCountRunner{ public static void main(String[] args){

Job job = new Job();

job.setMapperClass(WordCountMap.class);

job.setCombinerClass(WordCountCombiner.class);

job.setReducerClass(WordCountReduce.class);

job.setJarByClass(WordCountRunner.class);

FileInputFormat.addInputPath(job, inputFilesPath);

FileOutputFormat.addOutputPath(job, outputPath);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValuesClass(IntWritable.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setNumReduceTasks(1);

job.waitForCompletion(true); } }

Page 8: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Counters

Page 9: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Counters Built-in Counters

Report Metrics for various aspects of a Job

Task Counters

• Gather information about tasks over the course of a job

• Results are aggregated across all tasks

• MAP_INPUT_RECORDS, REDUCE_INPUT_GROUPS FileSystem Counters

• BYTES_READ, BYTES_WRITTEN

• Bytes Read/Written by each File-System (HDFS, KFS, Local, S3 etc) FileInputFormat Counters

• BYTES_READ (Bytes Read through FileInputFormat) FileOutputFormat Counters

• BYTES_WRITTEN (Bytes Written through FileOutputFormat) Job Counters

• Maintained by Job-Tracker

• TOTAL_LAUNCHED_MAPS, TOTAL_LAUNCHED_REDUCES

Page 10: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

User-Define Counters public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable>{

enum WCCounters {NOUNS, PRONOUNS, ADJECTIVES};

public void map(LongWritable key, Text line, Context context){

String[] tokens = Tokenize(line);for(int i=0; i<tokens.length; i++){

if(isNoun(tokens[i]))context.getCounter(WCCounter.NOUNS).increment(1);else if(isProNoun(tokens[i]))context.getCounter(WCCounter.PRONOUNS).increment(1);else if(isAdjective(tokens[i]))context.getCounter(WCCount.ADJECTIVES).increment(1);

context.write(new Text(tokens[i]), new IntWritable(1));

}}

}

Page 11: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Retrieving the values of a Counter

Counter counters = job.getCounters();Counter counter = counters.findCounter(WCCounters.NOUNS);int value = counter.getValue();

Page 12: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Output

13/10/08 15:36:15 INFO mapred.JobClient WordCountMap.NOUNS=234213/10/08 15:36:15 INFO mapred.JobClient WordCountMap.PRONOUNS=212413/10/08 15:36:15 INFO mapred.JobClient WordCountMap.ADJECTIVES=1897

Page 13: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Partitioner

Map keys to reducers/partitions

Determines which reducer receives a certain key

Identical keys produced by different map functions must map to same partition/reducer

If n reducers are used, then n partitions must be filled Number of reducers are set by the call “setNumReduceTasks”

Hadoop uses HashPartitioner as default partitioner

Page 14: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Defining a Custom Partitioner

Implement a class which extends the Partitioner class

Partitioning impacts load-balancing aspect of a map-reduce program

Word-Count: Many words starting with vowels

Words starting with a different character sent to different reducer

For words starting with vowels, second character may be taken into account

Page 15: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Word-Count Runner Class

public class WordCountRunner{ public static void main(String[] args){

Job job = new Job();

job.setMapperClass(WordCountMap.class);

job.setCombinerClass(WordCountCombiner.class);

job.setReducerClass(WordCountReduce.class);

job.setJarByClass(WordCountRunner.class);

job.setPartitionerClass(WordCountPartitioner.class);

FileInputFormat.addInputPath(job, inputFilesPath);

FileOutputFormat.addOutputPath(job, outputPath);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValuesClass(IntWritable.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setNumReduceTasks(1);

job.waitForCompletion(true); } }

Page 16: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Passing Configuration Parameters

Map-Reduce jobs may require certain input parameters

One may want to avoid counting words starting with certain prefixes

Prefixes can be set in the configuration

Page 17: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Word-Count Runner Class

public class WordCountRunner{ public static void main(String[] args){

Job job = new Job();

Configuration conf = job.getConfiguration();

conf.set(“PrefixesToAvoid”, “abs bts bnm swe”);

……

……

job.waitForCompletion(true);

} }

Page 18: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Word-Count Map

public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable>{

private String[] prefixesToAvoid;

public void setup(Context context) throws InterruptedException{Configuration conf = context.getConfiguration();String prefixes = conf.get(“PrefixesToAvoid”);this.prefixesToAvoid = prefixes.split(“ “);

}

public void map(LongWritable key, Text line, Context context){

String[] tokens = Tokenize(line);for(int i=0; i<tokens.length; i++){

context.write(new Text(tokens[i]), new IntWritable(1));}

} }

Page 19: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Distributed Cache

A file may need to be broadcasted to each map-node For example, a dictionary in a spell-check Such file-names can be added in a distributed-cache. Hadoop copies files added to the cache to all map-nodes.

Step 1 : Put file to HDFS hdfs dfs –put /tmp/file1 /cachefile1

Step 2 : Add CacheFile in Job Configuration Configuration conf = job.getConfiguration();

DistributedCache.addCacheFile(new URI(“/cachefile1”), conf);

Step 3 : Access cache file locally at each map Path[] cacheFiles = context.getLocalCacheFiles();

FileInputStream finputStream = new FileInputStream(cacheFiles[0].toString());

Page 20: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Hadoop I/O : Reading an HDFS File

// Get FileSystem Object Instance FileSystem fs = FileSystem.get(conf); // Get File Stream Path infile = new Path(filePath); BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(infile)));

// Read file line by line StringBuilder fileContent = new StringBuilder(); String line = br.readLine(); while(line!=null){

fileContent.append(line).append(“\n”); line = br.readLine(); }

Page 21: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Hadoop I/O : Writing to an HDFS file

// Get FileSystem Object Instance FileSystem fs = FileSystem.get(conf); // Get FileStreamPath path = new Path(filePath);FSDataOutputStream outputStream = hdfs.create(path);

// Write to filebyte[] bytes = content.getBytes();outputStream.write(bytes, 0, bytes.length);outputStream.close();

Page 22: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Hadoop I/O : Getting the File being Processed

A map-reduce job may need to process multiple files

The functionality of a map may depend upon which file is being processed

FileSplit fileSplit = (FileSplit) context.getInputSplit();

String filename = fileSplit.getPath().getName();

Page 23: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Custom Objects as Key-Values

Passing key and values from map functions to reducers IntWritable, DoubleWritable, LongWritable, Text, ArrayWritable

Passing key and values of custom classes may be desirable

Objects that can be passed around must implement certain interfaces Writable for passing as values

WritableComparable for passing as keys

Page 24: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Example Use-Case

Consider Weather data

Temperature and Pressure values at different lattitude-longitude-elevation-timestamp quadruple

Data is hence 4-dimensional Temperature and Pressure data in separate files File Format : lattitude, longitude, elevation, timestamp, temperature-value

• Ex: 10 20 10 1 99F 10 21 10 2 98F• Similarly for Pressure Ex 10 20 10 1 101kPa

We want to read the two data files and combine the data

• Ex: 10 20 10 1 99F 101kPa

Let class STPoint represent the coordinates class STPoint{

double lattitude, longitude, elevation;long timestamp;

}

Page 25: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Map to Reduce Flow

10 20 1 10 99F10 21 1 10 98F

10 20 1 10 101kPa10 21 1 10 109kPa

MAP

MAP (10 20 1 10, 99F)

Text

DoubleWritable

(10 21 1 10, 101kPa)

Text

DoubleWritable

REDUCE

(10, 20, 1, 10, 99F, 101kPa)

Page 26: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Map to Reduce Flow

10 20 1 10 99F10 21 1 10 98F

10 20 1 10 101kPa10 21 1 10 109kPa

MAP

MAP (STPoint(10 20 1 10), 99F)

STPoint

DoubleWritable

(STPoint(10 21 1 10), 101kPa)

STPoint

DoubleWritable

REDUCE

(STPoint(10, 20, 1, 10), 99F, 101kPa)

Page 27: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Map

public class MyMap extends Mapper<LongWritable, Text, Text, DoubleWritable>{

public void map(LongWritable key, Text line, Context context){

String tokens[] = line.split(“ “);double lattitude = new Double(tokens[0]).doubleValue();double longitude = new Double(tokens[1]).doubleValue();double elevation = new Double(tokens[2]).doubleValue();long timestamp = new Long(tokens[3]).longValue();double attrVal = new Double(tokens[4]).doubleValue();

String keyString = lattitude+” “+longitude+” “+elevation+” “+timestamp;

context.write(new Text(keyString), attrVal); }

}

Type of Map Output Key Type of Output Value

Page 28: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

New Map

public class MyMap extends Mapper<LongWritable, Text, STPoint, DoubleWritable>{

public void map(LongWritable key, Text line, Context context){

String tokens[] = line.split(“ “);double lattitude = new Double(tokens[0]).doubleValue();double longitude = new Double(tokens[1]).doubleValue();double elevation = new Double(tokens[2]).doubleValue();long timestamp = new Long(tokens[3]).longValue();double attrVal = new Double(tokens[4]).doubleValue();

STPoint stpoint = new STPoint (lattitude, longitude, elevation, timestamp);

context.write(stpoint, attrVal); }

}

Type of Map Output Key Type of Output Value

More Intuitive, Human Readable, Reduces Processing at Reduce Side

Page 29: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

New Reduce

public class DataReadReduce extends Reducer<STPoint, DoubleWritable, Text, DoubleWritable>{

public void reduce(STPoint key, Iterable<DoubleWritable> values, Context context){

}

}

Type of Output Key Type of Output Value

Type of Input Key Type of Input Value

Page 30: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Passing Custom Objects as Key-Values

Key-Value Pairs are written to local disk by map functions User must tell how to write a custom object

Key-Value Pairs are read by reducers from local disk User must tell how to read a custom object

Keys are sorted and compared User must spectify how to compare two keys

Page 31: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

WritableComparable Interface

Three Methods public void readFields(DataInput in) {}

public void write(DataOutput out) {}

public int compareTo(Object other) {}

Objects that are passed as keys must implement WritableComparable interface.

Objects that are passed as values must implement Writable Interface Writable interface does not have compareTo method

Only keys are compared and not values and hence compareTo method not required for objects being passed only as keys.

Page 32: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Implementing WritableComparable for STPoint

public void readFields(DataInput in) {

this.lattitude = in.readDouble(); this.longitude = in.readDouble();

this.elevation = in.readDouble(); long timeStamp = in.readLong(); }

public void write(DataOutput output){out.writeDouble(this.lattitude); out.writeDouble(this.longitude);out.writeDouble(this.elevation); out.writeLong(this.timestamp);

}

public int compareTo(STPoint other){

return this.toString().compareTo(other.toString());}

Page 33: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

InputFormat and OutputFormat

InputFormat Defines how to read data from file and feed it to the map functions

OutputFormat Defines how to write data on to a file

Hadoop provides various Input and Output Formats

A user can also implement custom input and output formats

Defining custom input and output formats is a very useful feature of map-reduce

Page 34: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Input-Format

Defines how to read data from file and feed it to the map functions

How to define Splits? getSplits()

How to define Record? getRecordReader()

Hadoop provides various Input and Output Formats

A user can also implement custom input and output formats

Defining custom input and output formats is a very useful feature of map-reduce

Page 35: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Split

A B CR1 1 2 3 R2 2 3 5R3 2 4 6R4 6 4 2R5 1 3 6R6 8 9 1R7 2 3 1R8 9 9 2R9 1 7 4R10 1 2 2R11 2 3 4R12 4 5 6R13 6 7 8R14 9 8 3R15 3 2 1R16 2 3 4R17 1 2 5R18 9 3 5R19 5 8 1R20 3 3 3

64 MB

64 MB

64 MB

64 MB

Split 1

Split 2

Split 3

Split 4

MAP-1

MAP-2

MAP-3

MAP-4

Page 36: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Split

A B CR1 1 2 3 R2 2 3 5R3 2 4 6R4 6 4 2R5 1 3 6R6 8 9 1R7 2 3 1R8 9 9 2R9 1 7 4R10 1 2 2R11 2 3 4R12 4 5 6R13 6 7 8R14 9 8 3R15 3 2 1R16 2 3 4R17 1 2 5R18 9 3 5R19 5 8 1R20 3 3 3

64 MB

64 MB

64 MB

64 MB

Split 1

Split 2

MAP-1

MAP-2

Page 37: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Split

A B CR1 1 2 3 R2 2 3 5R3 2 4 6R4 6 4 2R5 1 3 6R6 8 9 1R7 2 3 1R8 9 9 2R9 1 7 4R10 1 2 2R11 2 3 4R12 4 5 6R13 6 7 8R14 9 8 3R15 3 2 1R16 2 3 4R17 1 2 5R18 9 3 5R19 5 8 1R20 3 3 3

64 MB

64 MB

64 MB

64 MB

Split 1

Split 2

MAP-1

MAP-2

Page 38: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Split

A B CR1 1 2 3 R2 2 3 5R3 2 4 6R4 6 4 2R5 1 3 6R6 8 9 1R7 2 3 1R8 9 9 2R9 1 7 4R10 1 2 2R11 2 3 4R12 4 5 6R13 6 7 8R14 9 8 3R15 3 2 1R16 2 3 4R17 1 2 5R18 9 3 5R19 5 8 1R20 3 3 3

64 MB

64 MB

64 MB

64 MB

Split 1

Split 2

MAP-1

MAP-2

Split 3 MAP-3

Page 39: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Record-Reader

R1 1 2 3

R2 2 3 5

R3 2 4 6

R4 6 4 2

R5 1 3 6

All records fed to Map taskone by one

Page 40: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Record-Reader

R1 1 2 3R2 2 3 5

R3 2 4 6R4 6 4 2

R5 1 3 6

There are three records now

Page 41: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Record-Reader

R1 1 2 3R5 1 3 6

R2 2 3 5R3 2 4 6

R4 6 4 2

All the tuples with identical values incolumn 1 are bunched in the same record

Page 42: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

TextInputFormat

Default Input Format Key is Byte Offset and Value is Line Content Suitable for reading raw text files

10 20 1 10 99F10 21 1 10 98F

TEXTINPUT FORMAT

(0, “10 20 1 10 99F”)

offset line as a string

MAP

(10, “10 21 1 10 98F”)

Page 43: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

KeyValueInputFormat

Input data in form of key \tab value Anything before \tab is key Anything after \tab is value Input if not in correct format will throw up an error

10 20 1 10 \t 99F10 21 1 10 \t 98F

KEY VALUEINPUT FORMAT

(“10 20 1 10”, “99F”)

Key as Content before tab Value as content after tab

MAP

(“10 21 1 10”, “98F”)

Page 44: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

SequenceInputFormat

Hadoop specific high performance binary input format Key is user-defined Value is user-defined

Binary File

SEQUENCEINPUT FORMAT

(“10 20 1 10”, “99F”)

User-defined key User-defined Value

MAP

(“10 21 1 10”, “98F”)

Page 45: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

OutputFormats TextOutputFormat

Default Output Format

Writes data in Key \tab Value format

This output to read subsequently by KeyValueInputFormat

SequenceOutputFormat Writes Binary Files suitable for reading into subsequent MR jobs

Keys and Values are User defined

Page 46: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Text Input and Output Format

(“10 21 1 10”, “98F”)

(“10 20 1 10”, “99F”)

TEXTOUTPUT FORMAT

10 20 1 10 \tab 99F10 21 1 10 \tab 98F

TEXTINPUT FORMAT (10, “10 21 1 10 \tab 98F”)

(0, “10 20 1 10 \tab 99F”)

KEY VALUEINPUT FORMAT (“10 21 1 10”, “98F”)

(“10 20 1 10”, “99F”)

Page 47: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Custom Input Formats

Allows a user control over how to read data and subsequently feed it to the map functions

Advisable to implement custom input formats for specific use-cases

Simplifies the process of implementing map-reduce algorithms

Page 48: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

CustomInputFormat

10 20 1 10 99F10 21 1 10 98F

MYINPUT FORMAT

(STPoint(10 20 1 10), 99F) MAP

(STPoint(10 21 1 10), 98F)

- Key is of type STPoint

Page 49: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Map

public class MyMap extends Mapper<LongWritable, Text, STPoint, DoubleWritable>{

public void map(LongWritable key, Text line, Context context){

String tokens[] = line.split(“ “);double lattitude = new Double(tokens[0]).doubleValue();double longitude = new Double(tokens[1]).doubleValue();double elevation = new Double(tokens[2]).doubleValue();long timestamp = new Long(tokens[3]).longValue();double attrVal = new Double(tokens[4]).doubleValue();

STPoint stpoint = new STPoint (lattitude, longitude, elevation, timestamp);

context.write(stpoint, attrVal); }

}

Type of Map Output Key Type of Output Value

Page 50: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

New Map With Custom Input Format

class MyMap extends Mapper<STPoint, DoubleWritable, STPoint, DoubleWritable>{

public void map(STPoint point, DoubleWritable attrValue, Context context){

context.write(stpoint, attrVal); }

}

Map Output Key Map Output Value

More Intuitive, Human Readable

Map Input Key Map Input Value

Page 51: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Specifying Input and Output Format

public class MyRunner{ public static void main(String[] args){

Job job = new Job();

job.setMapperClass(MyMap.class);

job.setCombinerClass(MyCombiner.class);

job.setReducerClass(MyReduce.class);

job.setJarByClass(MyRunner.class);

job.setPartitionerClass(MyPartitioner.class);

job.setInputFormatClass(MyInputFormat.class);

job.setOutputFormatClass(MyOutputFormat.class);

FileInputFormat.addInputPath(job, inputFilesPath);

FileOutputFormat.addOutputPath(job, outputPath);

job.setNumReduceTasks(1);

job.waitForCompletion(true);

} }

Page 52: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Implementing a Custom Input Format

Specify how to split the data Data Split handled by class FileInputFormat

Custom Input Format can extend this class RecordReader

Reading the data in each split, parsing it and passing it to map

Iterator over the input data

Page 53: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Page 54: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Custom Input Format

public class MyInputFormat extends FileInputFormat<STPoint, DoubleWritable>{

public RecordReader<STPoint, DoubleWritable> createRecordReader (InputSplit split, TaskAttemptContext

context){

return new MyRecordReader();}

}

Page 55: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Custom Record-Reader

Page 56: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Custom Record-Reader

public class MyRecordReader extends RecordReader<STPoint, DoubleWritable>{private STPoint point; private DoubleWritable attrVal;private LineRecordReader lineRecordReader;

public void initialize(InputSplit split, TaskAttemptContext context){ lineRecordReader = new LineRecordReader();

lineRecordReader.initialize(split, context);}

public boolean nextKeyValue(){if(!lineRecordReader.nextKeyValue()){

this.point = null; this.attrVal = -1;return false;

}

String lineString = lineRecordReader.getCurrentValue();this.point = getSTPoint(lineString); this.attrVal = getAttributeValue(lineString);return true;

}

public STPoint getCurrentKey(){ return this.point;}public DoubleWritable getCurrentValue() { return this.attrVal; }

}

Page 57: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Chaining Map-Reduce Jobs

Simple tasks may be completed by single map and reduce.

Complex tasks will require multiple map and reduce cycles.

Multiple map and reduce cycles need to be chained together

Chaining multiple jobs in a sequence

Chaining multiple jobs in complex dependency

Chaining multiple maps in a sequence

Page 58: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Chaining Map-Reduce Jobs In Sequence

Most Commonly Done The output of a reducer is an input to the map functions of the next cycle. A new job starts only after the prior job has finished

MAPJOB 1

REDUCEJOB 1

MAPJOB 2

REDUCEJOB 2

MAPJOB 3

REDUCEJOB 3

INPUT

OUTPUT

Page 59: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Chaining Multiple Jobs In Sequence

Job job1 = new Job();job1.setInputPath(inputPath1); job1.setOutputPath(outputPath1);// set all other parametersjob1.setMapperClass(); ……..job1.waitForCompletion(true);

Job job2 = new Job();job2.setInputPath(outputPath1); job2.setOutputPath(outputPath2);// set all other parametersjob2.setMapperClass();job2.waitForCompletion(true);

Page 60: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Chaining Multiple Jobs in Complex Dependency

Chaining in sequence assumes that the jobs are dependent on a chain fashion.

That may not be so.

Example: Job1 may process a data-set of certain type, Job2 may process a data-set of another type and Job3 combines the results of Job1 and Job2

Job1 and Job2 are independent while Job3 is dependent on Job1 and Job2

Running Jobs in sequence Job1, Job2 and then Job3 may not be ideal.

Page 61: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Chaining Multiple Jobs in Complex Dependency

Use addDependingJob to specify the dependencies job3.addDependingJob(job1);

Job3.addDependingJob(job2);

Define a JobControl object JobControl jc = new JobControl();

jc.addJob(job1);

jc.addJob(job2);

jc.addJob(job3);

jc.run();

Page 62: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Chaining Multiple Maps In A Sequence

Multiple Map tasks can also be chained in a sequence followed by a reducer.

Avoids development of large map methods

Avoids multiple MR jobs with additional IO overheads

More ease of development, Code Re-Use.

Use ChainMapper API

Page 63: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Chaining Multiple Maps In A Sequence

Page 64: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Chaining Multiple Maps In A Sequence

Job chainJob = new Job(conf);ChainMapper.addMapper(chainJob, Map-1 Info );ChainMapper.addMapper(chainJob, Map-2 Info );ChainMapper.addMapper(chainJob, Map-3 Info );chainJob.setReducer(Reducer-Info);chainJob.waitForCompletion(true);

Page 65: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Compression

MR jobs produce large output

The output of an MR job can be compressed.

Saves a lot of space.

Need to ensure that the compression algorithm used is such , such that it produces splittable files

bzip2 is one such compression algorithm

If a compression algorithm does not produce splittable files, the output will not be split and a single map will process the whole data in a subsequent job.

gzip output is not splittable.

Page 66: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Compressing the Output

FileOutputFormat.setCompressOutput(job, true);FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);

Page 67: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Hadoop Tuning and Optimization

A number of parameters may impact the performance of a job. Whether to compress output or not

Number of Reduce Tasks

Block Size (64 MB or 128 MB or 256 MB etc)

Speculative Execution or Not

Buffer Size for Sorting

Temporary Space Allocation

Many more such parameters

Tuning these parameters is not an exact science

Some recommendations have been developed how to set these parameters

Page 68: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Compression

mapred.compress.map.output

Default False

Pros Faster Disk Writes

Lower Disk Space Usage

Lesser Time Spent on Data Transfer

Cons Overhead in compression and decompression

Recommendation For large jobs and large cluster, compress.

Page 69: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Speculative Execution

mapred.map/reduce.tasks.speculative.execution

Default True

Pros Reduces the job-time if the task progress is slow due to memory unavailability or

hardware degradation

Cons Increases the job-time if the task progress is slow due to complex and large

calculations.

Recommendation Set it to false in case of high average completion task duration (> 1 hr) due to complex

and large calculations

Page 70: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

Block Size

dfs.block.size

Default 64 MB

Recommendations Small Cluster and large data-sets

• Many map tasks will be needed

• Data Size 160 GB and Block Size 64 MB => # Splits = 2560

• Data Size 160 GB and Block Size 128 MB => # Splits = 1280

• Data Size 160 GB and Block Size 256 MB => # Splits = 640 In small clusters (6-10 nodes), the map task creation overhead is significant. So Block

Size should be large but small enough to utilize all resources

Block Size should be set according to size of the cluster, map task capacity and average size of input files.

Page 71: IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab

References

Hadoop – The Definitive Guide . Oreilly Press

Pro-Hadoop : Build scalable, distributed applications in the Cloud.

Hadoop Tutorial : http://developer.yahoo.com/hadoop/tutorial/.

www.slideshare.net