SLIDES CREATED BY : SHRIDEEP P ALLICKARA L13.1 CS455: Introduction to Distributed Systems [Spring 2020] Dept. Of Computer Science, Colorado State University COMPUTER SCIENCE DEPARTMENT CS455: Introduction to Distributed Systems http://www.cs.colostate.edu/~cs455 CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP] Shrideep Pallickara Computer Science Colorado State University What’s this hullabaloo about an elephant? No, not the one named Horton Who has fun in the Jungle of Nool This one’s named Hadoop, and is just as cool Crunching through data and having fun COMPUTER SCIENCE DEPARTMENT CS455: Introduction to Distributed Systems http://www.cs.colostate.edu/~cs455 Professor: SHRIDEEP P ALLICKARA Frequently asked questions from the previous class survey ¨ Why does a Mapper produce R intermediate outputs? ¨ Difference between intermediate output ad final output. ¨ Possibilities for daisy-chained MapReduce tasks? E.g. M-M-M-M-R or M-R-M-R-M-R ¨ Are there backup tasks for reducers?
26
Embed
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [MAPREDUCE …cs455/lectures/slides/CS455-L13-Hadoop.pdf · SLIDES CREATED BY: SHRIDEEP PALLICKARA L13.2 CS455: Introduction to Distributed
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
SLIDES CREATED BY: SHRIDEEP PALLICKARA L13.1
CS455: Introduction to Distributed Systems [Spring 2020]Dept. Of Computer Science, Colorado State University
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
http://www.cs.colostate.edu/~cs455
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS
[HADOOP]
Shrideep PallickaraComputer Science
Colorado State University
What’s this hullabaloo about an elephant?No, not the one named Horton
Who has fun in the Jungle of Nool
This one’s named Hadoop, and is just as coolCrunching through data and having fun
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
¨ New MapReduce runtime, called MapReduce 2, implemented on a new system called YARN ¤ YARN: Yet Another Resource Negotiator¤ Replaces the “classic” runtime in previous releases
¨ HDFS federation¤ HDFS namespace can be dispersed across multiple name nodes
¨ HDFS high-availability ¤ Removes name node as a single point of failure; supports standby nodes for
failover
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
¨ HDFS storage policies¤ Hierarchical storage – Archival, Disk (default), SSD, and RamDisk¤ Users can define the type of storage when storing data¤ Blocks can be moved between different storage types
SLIDES CREATED BY: SHRIDEEP PALLICKARA L13.6
CS455: Introduction to Distributed Systems [Spring 2020]Dept. Of Computer Science, Colorado State University
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
Format of a record in the dataset0057332130 # USAF weather station identifier99999 # WBAN weather station identifier19500101 # Observation date300 # Observation time4 +51317 # latitude (degrees x 1000)+028783 # longitude (degrees x 1000)FM-12+0171 # elevation (meters)99999V020320 # wind direction (degrees)1 # quality code…-0128 # air temperature (degrees Celsius x 10)1 # quality code-0139 # dew point temperature (degree Celsius x 10)
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
¨ Choose a Text input format¤ Each line in the dataset is given as a text value¤ key is the offset of the beginning of the line from the beginning of the file
¨ Our map function¤ Pulls out year and the air temperature¤ Think of this as a data preparation phase
n Reducer will work on data generated by the maps
SLIDES CREATED BY: SHRIDEEP PALLICKARA L13.13
CS455: Introduction to Distributed Systems [Spring 2020]Dept. Of Computer Science, Colorado State University
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
Rather than use built-in Java types, Hadoop uses its own set of basic types
¨ Optimized for network serialization
¨ These are in the org.apache.hadoop.io package¤ LongWritable corresponds to Java Long¤ Text corresponds to Java String¤ IntWritable corresponds to Java Integer
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
¨ A context object is available at any point of the MapReduce execution
¨ Provides a convenient mechanism for exchanging required system and job-wide information
¨ Context coordination happens only when an appropriate phase (driver, map, reduce) of a MapReduce job starts. ¤ Values set by one mapper are not available in another mapper but is
available in any reducer
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
The reduce function is represented by an abstract Reducer class
¨ Declares an abstract reduce() method
¨ Reducer class is a generic type ¤ 4 formal type parameters¤ Used to specify the input and output types of the reduce function¤ The input types should match the output types of the map function
n In the example, Text and IntWritable
SLIDES CREATED BY: SHRIDEEP PALLICKARA L13.19
CS455: Introduction to Distributed Systems [Spring 2020]Dept. Of Computer Science, Colorado State University
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
¨ In the new API, job control is done using the Job class rather than using the JobClient
¨ Output files are named slightly differently¤ Old API: Both map and reduce outputs are named part-nnnn¤ New API: Map outputs are named part-m-nnnn and reduce outputs are
named part-r-nnnn
SLIDES CREATED BY: SHRIDEEP PALLICKARA L13.23
CS455: Introduction to Distributed Systems [Spring 2020]Dept. Of Computer Science, Colorado State University
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
¨ The new API’s reduce() method passes values as Iterable rather than as Iterator¤ Makes it easier to iterate over values using the for-each loop construct
for (VALUEIN value: values) {…
}
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
http://www.cs.colostate.edu/~cs455
MAPREDUCE TASKS & SPLIT STRATEGIES
SLIDES CREATED BY: SHRIDEEP PALLICKARA L13.24
CS455: Introduction to Distributed Systems [Spring 2020]Dept. Of Computer Science, Colorado State University
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
¨ Time taken to process split is small compared to processing the whole input
¨ Quality of load balancing increases as splits become fine-grained¤ Faster machines process proportionally more splits than slower machines¤ Even if machines are identical, this feature is desirable
n Failed tasks get relaunched, and there are other jobs executing concurrently
SLIDES CREATED BY: SHRIDEEP PALLICKARA L13.25
CS455: Introduction to Distributed Systems [Spring 2020]Dept. Of Computer Science, Colorado State University
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
¨ Hadoop does its best to run a map task on the node where input data resides in HDFS¤ Data locality
¨ What if all three nodes holding the HDFS block replicas are busy?¤ Find free map slot on node in the same rack¤ Only when this is not possible, is an off-rack node utilized
n Inter-rack network transfer
SLIDES CREATED BY: SHRIDEEP PALLICKARA L13.26
CS455: Introduction to Distributed Systems [Spring 2020]Dept. Of Computer Science, Colorado State University
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems
Why the optimal split size is the same as the block size …
¨ Largest size of input that can be stored on a single node
¨ If split size spanned two blocks?¤ Unlikely that any HDFS node has stored both blocks
¤ Some of the split will have to be transferred across the network to node running the map taskn Less efficient than operating on local data without the network movement
COMPUTER SCIENCE DEPARTMENTCS455: Introduction to Distributed Systems