Top Banner
47
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to MapReduce and Hadoop
Page 2: Introduction to MapReduce and Hadoop

Expected … what to be said!

● History.● What is Hadoop.● Hadoop vs SQl.● MapReduce.● Hadoop Building Blocks.● Installing, Configuring and Running Hadoop.● Anatomy of MapReduce program.

Page 3: Introduction to MapReduce and Hadoop

Hadoop Series Resources

Page 4: Introduction to MapReduce and Hadoop
Page 5: Introduction to MapReduce and Hadoop

How hadoop was born?

Doug Cutting

Page 6: Introduction to MapReduce and Hadoop

Challenges of Distributed Processing of Large Data

● How to distribute the work?● How to store and distribute the data itself?● How to overcome failures?● How to balance the load?● How to deal with unstructured data?● ...

Page 7: Introduction to MapReduce and Hadoop

Hadoop tackles these challenges!

So, what’s Hadoop?

Page 8: Introduction to MapReduce and Hadoop

What is Hadoop?

Hadoop is an open source framework for writing and running dis tributed applications that process large amounts of data.

Key distinctions of Hadoop:● Accessible● Robust● Scalable● Simple

Page 9: Introduction to MapReduce and Hadoop

Hadoop vs SQL

● Structured and Unstructured data.● Datastore and Data Analysis.● Scale-out and Scale-up.● Offline batch processing and Online

transactions.

Page 10: Introduction to MapReduce and Hadoop

Hadoop Uses MapReduce

What is MapReduce?...

Page 11: Introduction to MapReduce and Hadoop

● Parallel programming model for clusters of commodity machines.

● MapReduce provides:o Automatic parallelization & distribution.o Fault tolerance.o Locality of data.

What is MapReduce?

Page 12: Introduction to MapReduce and Hadoop

MapReduce … Map then Reduce

Page 13: Introduction to MapReduce and Hadoop

Keys and Values

● Key/Value pairs.

● Keys divide Reduce Space.

Input Output

Map <k1, v1> list(<k2, v2>)

Reduce <k2, list(v2)> list(<k3, v3>)

Page 14: Introduction to MapReduce and Hadoop

WordCount in Action

Input:

foo.txt: “This is the foo file”

bar.txt: “And this is the bar one”

Mapper(s):this1is1the1foo1file1and1this1is1the1bar1one1

Reduce#2:Input:Output:is, [1, 1] is, 2

Reduce#1:Input:Output:this, [1, 1]this, 2

Reduce#3:Input:Output:foo, [1]foo, 1 .

.

Final output:this 2is 2the 2foo 1file 1and 1bar 1one 1

Page 15: Introduction to MapReduce and Hadoop

WordCount with MapReducemap(String filename, String document) {

List<String> T = tokenize(document);

for each token in T {emit ((String)token,

(Integer) 1);}

}

reduce(String token, List<Integer> values) {Integer sum = 0;

for each value in values {sum = sum + value;

}emit ((String)token, (Integer) sum);

}

Page 16: Introduction to MapReduce and Hadoop

Hadoop Building Blocks

How does hadoop work?...

Page 17: Introduction to MapReduce and Hadoop

Hadoop Building Blocks

1. NameNode2. DataNode3. Secondary NameNode4. JobTracker5. TaskTracker

Page 18: Introduction to MapReduce and Hadoop

HDFS: NameNode and DataNodes

Page 19: Introduction to MapReduce and Hadoop

JobTracker and TaskTracker

Page 20: Introduction to MapReduce and Hadoop

Typical Hadoop Cluster

Page 21: Introduction to MapReduce and Hadoop

Running Hadoop

Three modes to run Hadoop:1. Local (standalone) mode.2. Pseudo-distributed mode “cluster of one” .3. Fully distributed mode.

Page 22: Introduction to MapReduce and Hadoop

An ActionRunning Hadoop on Local Machine

Page 23: Introduction to MapReduce and Hadoop

Actions ...

1. Installing Hadoop.2. Configuring Hadoop (Pseudo-distributed mode).3. Running WordCount example.4. Web-based cluster UI.

Page 24: Introduction to MapReduce and Hadoop

HDFS

1. HDFS is a filesystem designed for large-scale distributed data processing.

2. HDFS isn’t a native Unix filesystem.

Basic File Commands:$ hadoop fs -cmd <args>$ hadoop fs –ls$ hadoop fs –mkdir /user/chuck$ hadoop fs -copyFromLocal

Page 25: Introduction to MapReduce and Hadoop

Anatomy of a MapReduce program

MapReduce and beyond

Page 26: Introduction to MapReduce and Hadoop

Hadoop

1. Data Types2. Mapper3. Reducer4. Partitioner5. Combiner6. Reading and Writing

a. InputFormatb. OutputFormat

Page 27: Introduction to MapReduce and Hadoop

Anatomy of a MapReduce program

Page 28: Introduction to MapReduce and Hadoop

Hadoop Data Types

● Certain defined way of serializing key/value pairs.● Values should implement Writable Interface.● Keys should implement WritableComparable interface.● Some predefined classes:

o BooleanWritable.o ByteWritable.o IntWritableo ...

Page 29: Introduction to MapReduce and Hadoop

Mapper

Page 30: Introduction to MapReduce and Hadoop

Mapper

1. Mapper<K1,V1,K2,V2>2. Override method:

void map(K1 key, V1 value, Context context)3. Use context.write(K2, V2) to emit key/value pairs.

Page 31: Introduction to MapReduce and Hadoop

WordCount Mapperpublic static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, Context

context){String line = value.toString();StringTokenizer tokenizer = new

StringTokenizer(line);while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());context.write(word, one);

}}

}

Page 32: Introduction to MapReduce and Hadoop

Predefined Mappers

Page 33: Introduction to MapReduce and Hadoop

Reducer

Page 34: Introduction to MapReduce and Hadoop

Reducer

1. Extends Reducer<K1,V1,K2,V2>2. Overrides method:

void reduce(K2, Iterable<V2>, Context context)3. Use context.write(K2, V2) to emit key/value pairs.

Page 35: Introduction to MapReduce and Hadoop

WordCount Reducerpublic static class Reduceextends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context){

int sum = 0;for (IntWritable val : values) {

sum += val.get();}context.write(key, new IntWritable(sum));

}}

Page 36: Introduction to MapReduce and Hadoop

Predefined Reducers

Page 37: Introduction to MapReduce and Hadoop

Partitioner

Page 38: Introduction to MapReduce and Hadoop

Partitioner

The partitioner decides which key goes where

class WordSizePartitioner extends Partitioner<Text, IntWritable> {

@Overridepublic int getPartition(Text

word, IntWritable count, int numOfPartions) {

return 0;}

}

Page 39: Introduction to MapReduce and Hadoop

Combiner

Page 40: Introduction to MapReduce and Hadoop

CombinerIt’s a local Reduce Task at Mapper.

WordCout Mapper Output:1. Without Combiner:<the, 1>, <file,

1>, <the, 1>, …

2. With Combiner:<the, 2>, <file, 2>, ...

Page 41: Introduction to MapReduce and Hadoop

Reading and Writing

Page 42: Introduction to MapReduce and Hadoop

Reading and Writing

1. Input data usually resides in large files.2. MapReduce’s processing power is the splitting of the

input data into chunks(InputSplit).3. Hadoop’s FileSystem provides the class

FSDataInputStream for file reading. It extends DataInputStream with random read access.

Page 43: Introduction to MapReduce and Hadoop

InputFormat Classes

● TextInputFormato <offset, line>

● KeyValueTextInputFormato key\tvaue => <key, value>

● NLineInputFormato <offset, nLines>

You can define your own InputFormat class ...

Page 44: Introduction to MapReduce and Hadoop

1. The output has no splits.2. Each reducer generates output file named

part-nnnnn, where nnnnn is the partition ID of the reducer.

Predefined OutputFormat classes:> TextOutputFormat <k, v> => k\tv

OutputFormat

Page 45: Introduction to MapReduce and Hadoop

Recap

Page 46: Introduction to MapReduce and Hadoop

END OF SESSION #1

Page 47: Introduction to MapReduce and Hadoop

Q