Apache Hadoop DFS and Map Reduce Víctor Sánchez Anguix Universitat Politècnica de València MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image Course 2014/2015
Jul 16, 2015
Apache HadoopDFS and Map Reduce
Víctor Sánchez AnguixUniversitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image
Course 2014/2015
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Who has not heard about Hadoop?
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Who knows exactly what is Hadoop?
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Being simplistic:
What is Apache Hadoop?
DFS MapReduce
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Google publishes paper about GFS (2003). http://research.google.com/archive/gfs.html
➢ Distributed data among cluster of computers
➢ Fault tolerant
➢ Highly scalable with commodity hardware
A bit of history: Distributed File System (DFS)
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Google publishes paper about MR (2004). http://research.google.com/archive/mapreduce.html
➢ Algorithm for processing distributed data in parallel
➢ Simple in concept, extremely useful in practice
A bit of history: Map Reduce (MR)
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Doug Cutting and Mike Caffarella → Apache Nutch
➢ Doug Cutting goes to Yahoo
➢ Yahoo implements Apache Hadoop
A bit of history: Hadoop is born
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Framework for distributed computing
➢ Still based on DFS and MR
➢ It is the main actor in Big Data
➢ Last major release: Apache Hadoop 2.6.0 (Nov 2014)http://hadoop.apache.org/
Apache Hadoop now
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
DFS architecture
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS: creating dirs
➢ Examples:
hdfs dfs -mkdir data
hdfs dfs -mkdir results
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS: uploading files
➢ Examples:
hdfs dfs -put datasets/students.tsv data/students.tsv
hdfs dfs -put datasets/grades.tsv data/grades.tsv
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS: listing
➢ Examples:
hdfs dfs -ls data
Found 2 items
-rw-r--r-- 3 sanguix supergroup 450 2015-02-09 10:50 data/grades.tsv
-rw-r--r-- 3 sanguix supergroup 194 2015-02-09 10:45 data/students.tsv
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS: get a file
➢ Examples:
hdfs dfs -get data/students.tsv
hdfs dfs -get data/grades.tsv
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS: deleting files
➢ Examples:
hdfs dfs -rm data/students.tsv
hdfs dfs -rm data/grades.tsv
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS: space use info
➢ Examples:
hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://localhost 1.5 T 12 K 491.6 G 0%
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce: Overview
Input data
Input data
Input data
Map task
Map task
Map task
Reduce task
Reduce task
Reduce task
Output data
Output data
Output data
chunk of data (key,value) value’
chunk of data (key,value) value’
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map: Transform data to (key, value)
Input data
Input data
Input data
Map task
Map task
Map task
chunk of data
chunk of data
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Shuffle: Send (key, values)
Reduce task
Reduce task
Reduce task
(key,value)
(key,value)
Map task
Map task
Map task
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Reduce: Aggregating (key,values)
Reduce task
Reduce task
Reduce task
Output data
Output data
Output data
value’
value’
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce
Input data
Input data
Input data
Map task
Map task
Map task
Reduce task
Reduce task
Reduce task
Output data
Output data
Output data
chunk of data (key,value) value’
chunk of data (key,value) value’
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce example: word count
CHUNK 1this class is about big data and artificial intelligence
CHUNK 2there is nothing big about this example
CHUNK 3I am a big artificial intelligence enthusiast
➢ The file is divided in chunks to be processed in parallel
➢ Data is sent untransformed to map nodes
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce example: word count
this class is about big data and artificial intelligence
[this, class, is, about, big, data, and, artificial, intelligence]
Tokenize
(this,1), (class,1), (is,1), (about,1), (big,1), (class, 1), (is, 1), (about 1), (big, 1), (data, 1), (and, 1), (artificial,1), (intelligence, 1)
Prepare (key,value) pairs
MAP TASK
Raw chunk
Ready to shuffle
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce example: word countMap Reduce example: word count
(big,1)(big,1)(big,1)
(big,3)Sum
REDUCE TASK
Fromshuffle Output
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Exercise: Matrix power
row column value1 1 3.2
2 3 4.3
3 3 5.1
1 3 0.1
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce variants: No reduce
Input data
Input data
Input data
Map task
Map task
Map task
Output data
Output data
Output data
chunk of data (key,value)
chunk of data (key,value)
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce variants: chaining
Input data
Input data
Input data
Map task
Map task
Map task
Reduce task
Reduce task
Reduce task
Output data
Output data
Output data
Map task
Map task
Map task
Reduce task
Reduce task
Reduce task
Output data
Output data
Output data
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Maps are executed in parallel➢ Reducers do not start until all maps are
finished➢ Output is not finished until all reducers are
finished➢ Bottleneck: Unbalanced map/reduce taks
○ Change key distribution
○ Increase reduces for increasing parallelism
Map Reduce: bottlenecks
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Hadoop is implemented in Java
➢ It is possible to program jobs formed by maps and reduces in Java
➢ We won’t go deep in these matters (bear with me!)
Map Reduce in Hadoop
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image http://hadoop.apache.org/
Hadoop architecture
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
Map Reduce job in Hadoop
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
...
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}
Map Reduce job in Hadoop
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Compilingjavac -cp opt/hadoop/share/hadoop/common/hadoop-common-2.6.0.jar:opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.6.0.jar -d WordCount source/hadoop/WordCount.java
jar -cvf WordCount.jar -C WordCount/ .
➢ Submitting
hadoop jar WordCount.jar es.upv.dsic.iarfid.haia.WordCount
/user/your_username/data/students.tsv /user/your_username/wc
Compiling and submitting a MR job
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Hadoop ecosystem
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ http://hadoop.apache.org
➢ Hadoop in Practice. Alex Holmes. Ed. Manning Publications
➢ Hadoop: The Definitive Guide. Tom White. Ed. O’Reilly.
➢ StackOverflow
Extra information
Apache HadoopDFS and Map Reduce
Víctor Sánchez AnguixUniversitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image
Course 2014/2015