Top Banner
56

Hadoop, HDFS, MapReduce and Pig

Jun 30, 2015

Download

Technology

Tomasz Bednarz

Open presentation, training material. Presented at CSIRO Big Data 2.0 workshop in September 2013, North Ryde, Australia. Animated by hands-on examples.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop, HDFS, MapReduce and Pig
Page 2: Hadoop, HDFS, MapReduce and Pig

Page 3: Hadoop, HDFS, MapReduce and Pig

Page 4: Hadoop, HDFS, MapReduce and Pig

●●●

●●●●●●●

Page 5: Hadoop, HDFS, MapReduce and Pig

●●●

●●●

Page 6: Hadoop, HDFS, MapReduce and Pig

Page 7: Hadoop, HDFS, MapReduce and Pig

Page 8: Hadoop, HDFS, MapReduce and Pig

●●

Page 9: Hadoop, HDFS, MapReduce and Pig

> hadoop fs

Page 10: Hadoop, HDFS, MapReduce and Pig

hadoop fs

Page 11: Hadoop, HDFS, MapReduce and Pig

$ hadoop fs

● ls

$ hadoop fs –help ls

Page 12: Hadoop, HDFS, MapReduce and Pig

$ hadoop fs –ls <path> $ hadoop fs –ls /

$ hadoop fs -ls $ hadoop fs –ls /user/cloudera

Page 13: Hadoop, HDFS, MapReduce and Pig

$ hadoop fs -mkdir data $ hadoop fs -ls

$ cd ~/bigdata/Exercises/hadoop/data $ ls -l $ hadoop fs –put mammograms.zip data

Page 14: Hadoop, HDFS, MapReduce and Pig

● http://localhost:50070

● fsck: an HDFS utility $ hadoop fsck /user/cloudera/data/mammograms.zip \

-blocks -locations -files

$ head -n 100 ato_centenary.txt \ | hadoop fs –put - data/ato100.txt

Page 15: Hadoop, HDFS, MapReduce and Pig

$ head -n 1000 ato_centenary.txt \ | hadoop fs –put - data/ato100.txt

put: ‘data/ato100.txt': File exists●

$ hadoop fs -rm data/ato100.txt $ head -n 1000 ato_centenary.txt \ | hadoop fs –put - data/ato100.txt

Page 16: Hadoop, HDFS, MapReduce and Pig

$ hadoop fs -cat data/ato100.txt | less

$ hadoop fs -get data/ato100.txt ato100.txt

-mv, -cp, -rmdir, -stat ...

Page 17: Hadoop, HDFS, MapReduce and Pig

●●●●

●●

Page 18: Hadoop, HDFS, MapReduce and Pig

Page 19: Hadoop, HDFS, MapReduce and Pig

●○

●○

●○

○○

Page 20: Hadoop, HDFS, MapReduce and Pig
Page 21: Hadoop, HDFS, MapReduce and Pig

Page 22: Hadoop, HDFS, MapReduce and Pig

$ javac –classpath `hadoop classpath` *.java

$ jar cvf csiro.jar *.class

$ hadoop jar csiro.jar Csiro input_dir output_dir

Page 23: Hadoop, HDFS, MapReduce and Pig

●●

map(in_key, in_value) -> (inter_key, inter_value) list

Page 24: Hadoop, HDFS, MapReduce and Pig

Page 25: Hadoop, HDFS, MapReduce and Pig

let map(key, value) =emit(key.toUpper(), value.

toUpper())

(‘csiro’, ‘cci’) -> (‘CSIRO’, ‘CCI’)(‘csiro’, ‘cesre’) -> (‘CSIRO’, ‘CESRE’)(‘csiro’, ‘cmse’) -> (‘CSIRO’, ‘CMSE’)(‘toyota’, ‘yaris’) -> (‘TOYOTA’, ‘YARIS’)

Page 26: Hadoop, HDFS, MapReduce and Pig

let map(key, value) =foreach char c in value:

emit(key, c)

(‘cci’, ‘csiro’) -> (‘cci’, ‘c’), (‘cci’, ’s’),(‘cci’, ‘i’), (‘cci’, ‘r’),(‘cci’, ‘o’)

(‘open’, ‘nasa’) -> (‘open’, ‘n’), (‘open’, ’a’),(‘open’, ‘s’), (‘open’, ‘a’)

Page 27: Hadoop, HDFS, MapReduce and Pig

let map(key, value) =emit(value.length(), value)

(‘csiro’, ‘cci’) -> (‘3’, ‘cci’)(‘csiro’, ‘cesre’) -> (‘5’, ‘cesre’)(‘csiro’, ‘cmse’) -> (‘4’, ‘cmse’)(‘toyota’, ‘yaris’) -> (‘5’, ‘yaris’)

Page 28: Hadoop, HDFS, MapReduce and Pig

●○

●○

Page 29: Hadoop, HDFS, MapReduce and Pig

map(String input_key, String input_value)foreach word w in input_value:

emit(w, 1)

reduce(String output_key,Iterator<int> intermediate_values)

set count = 0foreach v in intermediate_values:

count += vemit(output_key, count)

Page 30: Hadoop, HDFS, MapReduce and Pig

● Wordcount $ cd ~/bigdata/Exercises/hadoop/wordcount; ls

$ javac –classpath `hadoop classpath` *.java

$ jar cvf wc.jar *.class

WordCount.java WordMapper.java SumReducer.java

Page 31: Hadoop, HDFS, MapReduce and Pig

$ hadoop jar wc.jar WordCount data/ato100.txt ato_wc

$ hadoop fs ls ato_wc $ hadoop fs -cat ato_wc/part-r-00000 | less $ hadoop fs -cat ato_wc/* | grep ‘ATO\|CSIRO’

$ hadoop fs -rm -r ato_wc

Page 32: Hadoop, HDFS, MapReduce and Pig

● Average max temperature ●

Page 33: Hadoop, HDFS, MapReduce and Pig

$ cd ~/bigdata/Exercises/hadoop/data $ less nsw_temp.csv $ less bom_data_Note.txt

Page 34: Hadoop, HDFS, MapReduce and Pig

map(String input_key, String input_value):emit(input_value[3], input_value[5])

(‘IDCJAC0010,061087,1965,01,02,32.2,1,Y’)->(‘01’, 32.2)

(‘IDCJAC0010,066062,1890,04,27,20.2,1,Y’)->(‘04’, 20.2)

(‘IDCJAC0010,066062,2012,02,03,21.0,1,Y’)->(‘02’, 21.1)

Page 35: Hadoop, HDFS, MapReduce and Pig

reduce(String month, Iterator<double> values)set count = 0

set sum = 0foreach v in values:

sum += v count++ set mean = sum/count

emit(month, mean)

Page 36: Hadoop, HDFS, MapReduce and Pig

● $ cd ../averagetemp $ gedit *.java&

$ cd ../wordcount $ gedit *.java&

AverageTemp.java AverageTempMapper.java AverageReducer.java

Page 37: Hadoop, HDFS, MapReduce and Pig

●●

$ hadoop fs -put ../data/nsw_temp.csv data

$ javac –classpath `hadoop classpath` *.java $ jar cvf avt.jar *.class $ hadoop jar avt.jar AverageTemp data/nsw_temp.csv avt

Page 38: Hadoop, HDFS, MapReduce and Pig

● $ hadoop fs -cat avt/part-1-00000

~/bigdata/Exercises/hadoop/averagetemp/sample_solution

Page 39: Hadoop, HDFS, MapReduce and Pig
Page 40: Hadoop, HDFS, MapReduce and Pig

●○

●●●

Page 41: Hadoop, HDFS, MapReduce and Pig

●●●

Page 42: Hadoop, HDFS, MapReduce and Pig

●●●

Page 43: Hadoop, HDFS, MapReduce and Pig
Page 44: Hadoop, HDFS, MapReduce and Pig

●○○

●○

Page 45: Hadoop, HDFS, MapReduce and Pig

●●

●●

Page 46: Hadoop, HDFS, MapReduce and Pig
Page 47: Hadoop, HDFS, MapReduce and Pig

●●●

Page 48: Hadoop, HDFS, MapReduce and Pig
Page 49: Hadoop, HDFS, MapReduce and Pig
Page 50: Hadoop, HDFS, MapReduce and Pig
Page 51: Hadoop, HDFS, MapReduce and Pig

●○○○

●○

●○○○

Page 52: Hadoop, HDFS, MapReduce and Pig

○○○○○○

Page 53: Hadoop, HDFS, MapReduce and Pig
Page 54: Hadoop, HDFS, MapReduce and Pig

https://github.com/tomaszbednarz/pig-abc-toilets

● We have list of local ABC Radio stations in Australia

● We have list of all Public Toilets across Australia

● We want to find a closest toilet to a Radio Station

Demonstration of:

● Data Schemas● Use of external libraries● Google Maps API

Page 55: Hadoop, HDFS, MapReduce and Pig