Page 1
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 1
Big Data using Hadoop
Hands On Workshop
March 2015
Dr.Thanachart NumnondaCertified Java Programmer
[email protected]
Danairat T.Certified Java Programmer, TOGAF – Silver
[email protected] , +66-81-559-1446
Page 2
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Launch a virtual server on EC2 Amazon Web Services
Page 3
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Page 4
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hadoop Installation
Hadoop provides three installation choices:
1. Local mode: This is an unzip and run mode toget you started right away where allparts ofHadoop run within the same JVM
2. Pseudo distributed mode: This mode will berun on different parts of Hadoop as differentJava processors, but within a single machine
3. Distributed mode: This is the real setup thatspans multiple machines
Page 5
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Virtual Server
This lab will use a EC2 virtual server to install aHadoop server using the following features:
● Ubuntu Server 14.04 LTS● m3.mediun 1vCPU, 3.75 GB memory● Security group: default● Keypair: imchadoop
Page 6
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Select a EC2 service and click on Lunch Instance
Page 7
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Select an Amazon Machine Image (AMI) andUbuntu Server 14.04 LTS (PV)
Page 8
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Choose m3.medium Type virtual server
Page 9
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Leave configuration details as default
Page 10
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Add Storage: 20 GB
Page 11
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Name the instance
Page 12
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Select an existing security group > Select SecurityGroup Name: default
Page 13
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Click Launch and choose imchadoop as a key pair
Page 14
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Review an instance / click Connect for an instruction to connect to the instance
Page 15
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Connect to an instance from Mac/Linux
Page 16
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Connect to an instance from Windows using Putty
Page 17
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Connect to the instance
Page 18
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Installing Hadoop
Page 19
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Installing Hadoop and Ecosystem
1. Update the system
2. Configuring SSH
3. Installing JDK1.6
4. Download/Extract Hadoop
5. Installing Hadoop
6. Configure xml files
7. Formatting HDFS
8. Start Hadoop
9. Hadoop Web Console
10. Stop Hadoop
Notes:-
Hadoop and IPv6; Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you willencounter problems. Source: http://wiki.apache.org/hadoop/HadoopIPv6
Page 20
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
1) Update the system: sudo apt-get update
Page 21
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
2. Configuring SSH: ssh-keygen
Page 22
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Enabling SSH access to your local machine
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Testing the SSH setup by connecting to your local machine
$ ssh 54.68.149.232
Type Exit
$ exit
Page 23
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
3) Install JDK 1.7: sudo apt-get install openjdk-7-jdk
(Enter Y when prompt for answering)
(Type command > java –version
Page 24
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
4) Download/Extract Hadoop
1) Type command > wgethttp://mirror.issp.co.th/apache/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
2) Type command > tar –xvzf hadoop-1.2.1.tar.gz
3) Type command > sudo mv hadoop-1.2.1 /usr/local/hadoop
Page 25
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
5) Installing Hadoop
1) Type command > sudo vi $HOME/.bashrc
2) Add config as figure below
1) Type command > exec bash
2) Type command > sudo vi /usr/local/hadoop/conf/hadoop-env.sh
3) Edit the file as figure below
Page 26
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
6) Configuring Hadoop conf/*-site.xml
1. core-site.xml (hadoop.tmp.dir, fs.default.name)
2. hdfs-site.xml (dfs.replication)
3. mapred-site.xml (mapred.job.tracker)
Page 27
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Configuring core-site.xml
1) Type command > sudo vi /usr/local/hadoop/conf/core-site.xml
2)Add Private IP of a server as figure below
(in this case a private IP is 172.31.12.11)
Page 28
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Configuring mapred-site.xml
1) Type command > sudo sudo vi /usr/local/hadoop/conf/mapred-site.xml
2)Add Private IP of Jobtracker server as figure below
(in this case a private IP is 172.31.12.11)
Page 29
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Configuring hdfs-site.xml
1) Type command > sudo vi /usr/local/hadoop/conf/hdfs-site.xml
2)Add configure as figure below
Page 30
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
7) Formating Hadoop
1)Type command > sudo mkdir /usr/local/hadoop/tmp
2)Type command > sudo chown ubuntu /usr/local/hadoop
3)Type command > sudo chown ubuntu /usr/local/hadoop/tmp
4)Type command > hadoop namenode –format
Page 31
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Hadoop
ubuntu@ip-172-31-12-11:~$ start-all.sh
Starting up a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
[ubuntu@ip-172-31-12-11:~$ jps
11567 Jps
10766 NameNode
11099 JobTracker
11221 TaskTracker
10899 DataNode
11018 SecondaryNameNode
ubuntu@ip-172-31-12-11:~$$
Checking Java Process and you are now running Hadoop as pseudo distributed mode
Page 32
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hadoop is up!
Viewing the Hadoop HDFS using WebUI http://54.68.149.232:50070/
Page 33
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Stopping Hadoop
ubuntu@ip-172-31-12-11:~$ /usr/local/hadoop/bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
Page 34
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Importing Data to HDFSusing Hadoop Command Line
Page 35
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Importing Data to Hadoop
Download War and Peace Full Text
www.gutenberg.org/ebooks/2600
Page 36
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Importing Data to Hadoop
Download the file pg2600.txt
$ wget https://dl.dropboxusercontent.com/u/12655380/
pg2600.txt
$hadoop fs -mkdir /input
$hadoop fs -mkdir /output
$hadoop fs -copyFromLocal pg2600.txt /input
Import to Hadoop
Page 37
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Reviewing, Retrieving,Deleting Data from HDFS
Page 38
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Review file in Hadoop HDFS
ubuntu@ip-172-31-12-11:~$ hadoop fs -cat /input/pg2600.txt
List HDFS File
Read HDFS File
Retrieve HDFS File to Local File System
Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html
ubuntu@ip-172-31-12-11:~$ hadoop fs -copyToLocal /input/pg2600.txt /tmp/file.txt
Page 39
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Review file in Hadoop HDFS using WebUI
Page 40
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hadoop Port Numbers
Daemon DefaultPort
Configuration Parameter inconf/*-site.xml
HDFS Namenode 50070 dfs.http.address
Datanodes 50075 dfs.datanode.http.address
Secondarynamenode 50090 dfs.secondary.http.address
MR JobTracker 50030 mapred.job.tracker.http.address
Tasktrackers 50060 mapred.task.tracker.http.address
Page 41
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Review Content from System shell
Page 42
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Removing data from HDFS usingShell Command
hdadmin@localhost detach]$ hadoop dfs -rm /input/input_test.txt
Deleted hdfs://localhost:54310/input/input_test.txt
hdadmin@localhost detach]$
Page 43
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Lecture: Understanding Map ReduceProcessing
Client
Name Node Job Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Map Reduce
Page 44
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
High Level Architecture of MapReduce
Page 45
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 45
Before MapReduce…
● Large scale data processing was difficult!– Managing hundreds or thousands of processors– Managing parallelization and distribution– I/O Scheduling– Status and monitoring– Fault/crash tolerance
● MapReduce provides all of these, easily!
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
Page 46
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 46
MapReduce Overview
● What is it?– Programming model used by Google– A combination of the Map and Reduce models with an
associated implementation– Used for processing and generating large data sets
Page 47
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 47
MapReduce Overview
● How does it solve our previously mentioned problems?– MapReduce is highly scalable and can be used across many
computers.– Many small machines can be used to process jobs that
normally could not be processed by a large machine.
Page 48
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
MapReduce Framework
Source: www.bigdatauniversity.com
Page 49
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 49
How Map and Reduce Work Together
Page 50
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 50
How Map and Reduce Work Together
● Map returns information● Reduces accepts information● Reduce applies a user defined function to reduce the
amount of data
Page 51
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 51
Map Abstraction
● Inputs a key/value pair– Key is a reference to the input value– Value is the data set on which to operate
● Evaluation– Function defined by user– Applies to every value in value input
● Might need to parse input● Produces a new list of key/value pairs
– Can be different type from input pair
Page 52
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 52
Reduce Abstraction
● Starts with intermediate Key / Value pairs● Ends with finalized Key / Value pairs
● Starting pairs are sorted by key● Iterator supplies the values for a given key to the
Reduce function.
Page 53
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 53
Reduce Abstraction
● Typically a function that:– Starts with a large number of key/value pairs
● One key/value for each word in all files being greped(including multiple entries for the same word)
– Ends with very few key/value pairs● One key/value for each unique word across all the files with
the number of instances summed into this entry● Broken up so a given worker works with input of the
same key.
Page 54
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 54
Other Applications
● Yahoo!– Webmap application uses Hadoop to create a database of
information on all known webpages● Facebook
– Hive data center uses Hadoop to provide business statistics toapplication developers and advertisers
● Rackspace– Analyzes sever log files and usage data using Hadoop
Page 55
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 55
Why is this approach better?
● Creates an abstraction for dealing with complexoverhead– The computations are simple, the overhead is messy
● Removing the overhead makes programs muchsmaller and thus easier to use– Less testing is required as well. The MapReduce
libraries can be assumed to work properly, so onlyuser code needs to be tested
● Division of labor also handled by theMapReduce libraries, so programmers onlyneed to focus on the actual computation
Page 56
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
MapReduce Framework
map: (K1, V1) -> list(K2, V2))
reduce: (K2, list(V2)) -> list(K3, V3)
Page 57
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
How does the MapReduce work?
Output in a list of (Key, List of Values)
in the intermediate file
Sorting
Partitioning
Output in a list of (Key, Value)
in the intermediate file
InputSplit
RecordReader
RecordWriter
Page 58
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
How does the MapReduce work?
Sorting
Partitioning
Combining
Car, 2
Car, 2
Bear, {1,1}
Car, {2,1}
River, {1,1}
Deer, {1,1}
Output in a list of (Key, List of Values)
in the intermediate file
Output in a list of (Key, Value)
in the intermediate file
InputSplit
RecordReader
RecordWriter
Page 59
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
MapReduce Processing – The Dataflow
1. InputFormat, InputSplits, RecordReader
2. Mapper - your focus is here
3. Partition, Shuffle & Sort
4. Reducer - your focus is here
5. OutputFormat, RecordWriter
Page 60
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
InputFormat
InputFormat: Description: Key: Value:
TextInputFormat Default format; readslines of text files
The byte offset of theline The line contents
KeyValueInputFormat Parses lines into key,val pairs
Everything up to thefirst tab character
The remainder of theline
SequenceFileInputFormat
A Hadoop-specifichigh-performancebinary format
user-defined user-defined
Page 61
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
InputSplitAn InputSplit describes a unit of work that comprises a single maptask.
InputSplit presents a byte-oriented view of the input.
You can control this value by setting the mapred.min.split.sizeparameter in core-site.xml, or by overriding the parameter in theJobConf object used to submit a particular MapReduce job.
RecordReader
RecordReader reads <key, value> pairs from an InputSplit.
Typically the RecordReader converts the byte-oriented view ofthe input, provided by the InputSplit, and presents a record-oriented to the Mapper
Page 62
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Mapper
Mapper: The Mapper performs the user-defined logic to the input akey, value and emits (key, value) pair(s) which are forwarded to theReducers.
Partition, Shuffle & Sort
After the first map tasks have completed, the nodes may still beperforming several more map tasks each. But they also beginexchanging the intermediate outputs from the map tasks to where theyare required by the reducers.
Partitioner controls the partitioning of map-outputs to assign to reducetask . he total number of partitions is the same as the number of reducetasks for the job
The set of intermediate keys on a single node is automatically sortedby internal Hadoop before they are presented to the Reducer
This process of moving map outputs to the reducers is known asshuffling.
Page 63
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
ReducerThis is an instance of user-provided code that performs read eachkey, iterator of values in the partition assigned. The OutputCollector object in Reducer phase has a method named collect() which willcollect a (key, value) output.
OutputFormat, Record Writer
OutputFormat governs the writing format in OutputCollector andRecordWriter writes output into HDFS.
OutputFormat: Description
TextOutputFormat Default; writes lines in "key \t value"form
SequenceFileOutputFormatWrites binary files suitable forreading into subsequent MapReducejobs
NullOutputFormat generates no output files
Page 64
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Writing you own MapReduce Program
Page 65
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Wordcount (HelloWord in Hadoop)1. package org.myorg;
2.
3. import java.io.IOException; 4. import java.util.*;
5.
6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*;
11.
12. public class WordCount {
13.
14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text,IntWritable> {
15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text();
17.
18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException {
19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. }
Page 66
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Wordcount (HelloWord in Hadoop)
27.
28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text,IntWritable> {
29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>output, Reporter reporter) throws IOException {
30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. }
37.
Page 67
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Wordcount (HelloWord in Hadoop)
38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount");
41.
42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class);
44.
45. conf.setMapperClass(Map.class); 46. 47. conf.setReducerClass(Reduce.class);
48.
49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class);
51.
52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));
54.
55. JobClient.runJob(conf); 57. } 58. }
59.
Page 68
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Packaging Map Reduceand Deploying to Hadoop Runtime
Environment
Page 69
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Packaging Map Reduce Program
Usage
Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop versioninstalled, compile WordCount.java and create a jar:
$ wget https://dl.dropboxusercontent.com/u/12655380/WordCount.java
$ mkdir hduser $ cd hduserjavac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar -d hduser WordCount.java$ jar -cvf ./wordcount.jar -C hduser/ .
$ hadoop jar ./wordcount.jar org.myorg.WordCount /input/* /output/wordcount_output_dir
Output:
…….
$ hadoop fs -cat /output/wordcount_output_dir/part-00000
Page 70
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
Page 71
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
Page 72
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
Page 73
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
Page 74
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
Page 75
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
Page 76
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Writing Map/ReduceProgram on Eclipse
Page 77
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Eclipse
Page 78
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Create a Java Project
Let's name it HadoopWordCount
Page 79
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 79
Add dependencies to the project
● Add the following two JARs to your build path● hadoop-common.jar and hadoop-mapreduce-client-core.jar. Both can be
founded at /usr/lib/hadoop/client● By perform the following steps
– Add a folder named lib to the project
– Copy the mentioned JARs in this folder
– Right-click on the project name >> select Build Path >> thenConfigure Build Path
– Click on Add Jars, select these two JARs from the lib folder
Page 80
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 80
Add dependencies to the project
Page 81
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 81
Writing a source code
● Right click the project, the select New >> Package● Name the package as org.myorg● Right click at org.myorg, the select New >> Class● Name the package as WordCount● Writing a source code as shown in previoud slides
Page 82
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 82
Page 83
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 83
Building a Jar file
● Right click the project, the select Export● Select Java and then JAR file● Provide the JAR name, as wordcount.jar● Leave the JAR package options as default● In the JAR Manifest Specification section, in the botton, specify the Main
class● In this case, select WordCount● Click on Finish● The JAR file will be build and will be located at cloudera/workspace
Note: you may need to re-size the dialog font size by select
Windows >> Preferences >> Appearance >> Colors and Fonts
Page 84
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
LectureUnderstanding Hive
Page 85
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
IntroductionA Petabyte Scale Data Warehouse Using Hadoop
Hive is developed by Facebook, designed to enable easy datasummarization, ad-hoc querying and analysis of largevolumes of data. It provides a simple query language calledHive QL, which is based on SQL
Page 86
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
What Hive is NOT
Hive is not designed for online transaction processing anddoes not offer real-time queries and row level updates. It isbest used for batch jobs over large sets of immutable data(like web logs, etc.).
Page 87
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 87
Hive Metastore
● Store Hive metadata
● Configurations
– Embedded: in-process metastore, in-process database
– Local: in-process metastore, out-of-process database
– Remote: out-of-process metastore,out-of-process database
Page 88
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 88
Hive Schema-On-Read
● Faster loads into the database (simply copy or move)
● Slower queries
● Flexibility – multiple schemas for the same data
Page 89
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 89
HiveQL
● Hive Query Language● SQL dialect● No support for:
– UPDATE, DELETE
– Transactions
– Indexes
– HAVING clause in SELECT
– Updateable or materialized views
– Srored procedure
Page 90
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 90
Hive Tables
● Managed- CREATE TABLE
– LOAD- File moved into Hive's data warehouse directory
– DROP- Both data and metadata are deleted.
● External- CREATE EXTERNAL TABLE
– LOAD- No file moved
– DROP- Only metadata deleted
– Use when sharing data between Hive and Hadoop applications
or you want to use multiple schema on the same data
Page 91
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Running Hive
Hive Shell
● Interactive
hive● Script
hive -f myscript● Inline
hive -e 'SELECT * FROM mytable'
Hive.apache.org
Page 92
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
System Architecture and Components
• Metastore: To store the meta data.• Query compiler and execution engine: To convert SQL queries to a
sequence of map/reduce jobs that are then executed on Hadoop.• SerDe and ObjectInspectors: Programmable interfaces and
implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binaryrepresentation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Javaobject that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system.
• UDF and UDAF: Programmable interfaces and implementations foruser defined functions (scalar and aggregate functions).
• Clients: Command line client similar to Mysql command line.
hive.apache.org
Page 93
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Architecture Overview
HDFS
Hive CLIQueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDeThrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgm
t.W
eb U
I
HDFS
DDL
Hive
Hive.apache.org
Page 94
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Sample HiveQL
The Query compiler uses the information stored in the metastore toconvert SQL queries into a sequence of map/reduce jobs, e.g. thefollowing query
SELECT * FROM t where t.c = 'xyz'
SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1)
SELECT t1.c1, count(1) from t1 group by t1.c1
Hive.apache.org
Page 95
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Creating Table andRetrieving Data using Hive
Page 96
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hive Hands-On Labs
1. Installing Hive
2. Configuring / Starting Hive
3. Creating Hive Table
4. Reviewing Hive Table in HDFS
5. Alter and Drop Hive Table
6. Preparing Dataset
7. Loading Data to Hive Table
8. Querying Data from Hive Table
9. Reviewing Hive Table Content from HDFS Commandand WebUI
Page 97
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
1. Installing Hive
# wget http://apache.mesi.com.ar/hive/hive-1.1.0/
apache-hive-1.1.0-bin.tar.gz
# tar -xvzf apache-hive-1.1.0-bin.tar.gz
# sudo mv apache-hive-1.1.0-bin /usr/local
# rm apache-hive-1.1.0-bin.tar.gz
Install Hive binary file
Page 98
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
1. Installing HiveEdit $HOME ./bashrc
# sudo vi $HOME/.bashrc
Page 99
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
2. Configuring HiveCreating HDFS Directory for Hive
Create hdfs /tmp and /user/hive/warehouse directory
[hdadmin@localhost ~]$ hadoop fs -mkdir /tmp/hive
[hdadmin@localhost ~]$ hadoop fs -mkdir /user/hive/warehouse
[hdadmin@localhost ~]$ hadoop fs -chmod 777 /tmp/hive
[hdadmin@localhost ~]$ hadoop fs -chmod 777 /user/hive/warehouse
Page 100
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
2. Start HiveStarting Hive
hive> quit;
Quit from Hive
Page 101
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
3. Creating Hive Table
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROWFORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 4.069 seconds
hive (default)> show tables;
OK
test_tbl
Time taken: 0.138 seconds
hive (default)> describe test_tbl;
OK
id int
country string
Time taken: 0.147 seconds
hive (default)>
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
Page 102
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
4. Reviewing Hive Table in HDFS
[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x - hdadmin supergroup 0 2013-03-17 17:51 /user/hive/warehouse/test_tbl
[hdadmin@localhost hdadmin]$
Review Hive Table fromHDFS WebUI
Page 103
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
5. Alter and Drop Hive Table
hive (default)> alter table test_tbl add columns (remarks STRING);
hive (default)> describe test_tbl;
OK
id int
country string
remarks string
Time taken: 0.077 seconds
hive (default)> drop table test_tbl;
OK
Time taken: 0.9 seconds
See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html
Page 104
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
6. Preparing Large Datasethttp://grouplens.org/datasets/movielens/
Page 105
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
MovieLen Dataset
1)Type command > wgethttp://files.grouplens.org/datasets/movielens/ml-100k.zip
2)Type command > sudo apt-get install unzip
3)Type command > unzip ml-100k.zip
4)Type command > more ml-100k/u.user
Page 106
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
6. Loading Data to Hive Table
hive (default)> exit;
ubuntu@ip-172-31-12-11:~/ml-100k$ hadoop fs -put u.user /dataset/movielens/users
Loading data to Hive table
$ hive
hive (default)> CREATE EXTERNAL TABLE users (userid INT, age INT,
gender STRING, occupation STRING, zipcode STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION '/dataset/movielens/users';
Creating Hive table
Page 107
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
7. Querying Data from Hive Table
Page 108
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
8. Loading Data to test_tbl Table
$ hive
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROWFORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Creating Hive table
hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data.csv' INTO TABLEtest_tbl;
Copying data from file:/tmp/test_tbl_data.csv
Copying file: file:/tmp/test_tbl_data.csv
Loading data to table default.test_tbl
OK
Time taken: 0.241 seconds
hive (default)>
Loading data to Hive table
Page 109
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
9. Reviewing Hive Table Content from HDFS Commandand WebUI
[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl
Found 1 items
-rw-r--r-- 1 hdadmin supergroup 59 2013-03-17 18:08/user/hive/warehouse/test_tbl/test_tbl_data.csv
[hdadmin@localhost hdadmin]$
[hdadmin@localhost hdadmin]$ hadoop fs -cat/user/hive/warehouse/test_tbl/test_tbl_data.csv
1,USA
62,Indonesia
63,Philippines
65,Singapore
66,Thailand
[hdadmin@localhost hdadmin]$
Page 110
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Loading Data to Hive Table
$ hive
hive (default)> hive> CREATE TABLE products
(
prod_name STRING,
description STRING,
category STRING,
qty_on_hand INT,
prod_num STRING,
packaged_with ARRAY<STRING>
)
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;
Creating Hive table
Page 111
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
LectureUnderstanding Pig
Page 112
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
IntroductionA high-level platform for creating MapReduce programs Using Hadoop
Pig is a platform for analyzing large data sets that consists ofa high-level language for expressing data analysis programs,coupled with infrastructure for evaluating these programs.The salient property of Pig programs is that their structure isamenable to substantial parallelization, which in turns enablesthem to handle very large data sets.
Page 113
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Pig Components
● Two Compnents● Language (Pig Latin)● Compiler
● Two Execution Environments● Local
pig -x local● Distributed
pig -x mapreduce
Hive.apache.org
Page 114
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Running Pig
● Script
pig myscript● Command line (Grunt)
pig● Embedded
Writing a java program
Hive.apache.org
Page 115
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Pig Latin
Hive.apache.org
Page 116
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Pig Execution Stages
Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi
Page 117
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Why Pig?
● Makes writing Hadoop jobs easier● 5% of the code, 5% of the time● You don't need to be a programmer to write Pig scripts
● Provide major functionality required forDatawareHouse and Analytics● Load, Filter, Join, Group By, Order, Transform
● User can write custom UDFs (User Defined Function)
Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi
Page 118
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Pig v.s. Hive
Hive.apache.org
Page 119
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Running a Pig script
Page 120
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Installing Pig
# wgethttp://archive.apache.org/dist/hadoop/pig/stable/pig-0.7.0.tar.gz
# tar -xvzf pig-0.7.0.tar.gz
# sudo mv pig-0.7.0 /usr/local/
# rm pig-0.7.0.tar.gz
Install Pig binary file
Page 121
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Installing PigEdit $HOME ./bashrc
# sudo vi $HOME/.bashrc
Page 122
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Pig Command Line
Page 123
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
countryFilter.pig
A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float,lifeex:int, mysch:int, eysch:int, gni:int);B = FILTER A BY gni > 2000;C = ORDER B BY gni;dump C;
#Preparing Data
ubuntu@ip-172-31-12-11:~$ wget https://www.dropbox.com/s/pp168a6oiwqkxyu/
hdi-data.csv
#Edit Your Script
ubuntu@ip-172-31-12-11:~$ vi countryFilter.pig
Writing a Pig Script
Page 124
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
ubuntu@ip-172-31-12-11:~$ pig -x local
grunt > run countryFilter.pig
Running a Pig Script
Page 125
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Lecture: Understanding Sqoop
Page 126
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Introduction
Sqoop (“SQL-to-Hadoop”) is a straightforward command-linetool with the following capabilities:
• Imports individual tables or entire databases to files inHDFS
• Generates Java classes to allow you to interact with yourimported data
• Provides the ability to import from SQL databases straightinto your Hive data warehouse
See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
Page 127
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Architecture Overview
Hive.apache.org
Page 128
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Loading Data from DBMSto Hadoop HDFS
Page 129
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Sqoop Hands-On Labs
1. Loading Data into MySQL DB
2. Installing Sqoop
3. Configuring Sqoop
4. Installing DB driver for Sqoop
5. Importing data from MySQL to Hive Table
6. Reviewing data from Hive Table
7. Reviewing HDFS Database Table files
Page 130
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
1. MySQL RDS Server on AWS
A RDS Server is running on AWS with the followingconfiguration
> database: imc_db
> username: admin
> password: imcinstitute
>addr: imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com
[This address may change]
Page 131
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
1. country_tbl data
Testing data query from MySQL DB
Table name > country_tbl
Page 132
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
2. Installing Sqoop
# wget http://apache.osuosl.org/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz
# tar -xvzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz
# sudo mv sqoop-1.4.5.bin__hadoop-1.0.0 /usr/local/
# rm sqoop-1.4.5.bin__hadoop-1.0.0
Page 133
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Installing SqoopEdit $HOME ./bashrc
# sudo vi $HOME/.bashrc
Page 134
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
3. Configuring Sqoop
ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-1.0.0/conf/
ubuntu@ip-172-31-12-11:~$ vi sqoop-env.sh
Page 135
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
4. Installing DB driver for Sqoop
ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-1.0.0/lib/
ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.05/lib$wget https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar
ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.055/lib$exit
Page 136
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
5. Importing data from MySQL to Hive Table
[hdadmin@localhost ~]$sqoop import --connectjdbc:mysql://imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com/imc_db --username admin -P --table country_tbl--hive-import --hive-table country -m 1
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
Enter password: <enter here>
Page 137
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
6. Reviewing data from Hive Table
Page 138
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
7. Reviewing HDFS Database Table files
Start Web Browser to http://http://54.68.149.232:50070 then navigate to /user/hive/warehouse
Page 139
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
7. Reviewing HDFS Database Table files
Page 140
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
LectureUnderstanding HBase
Page 141
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
IntroductionAn open source, non-relational, distributed database
HBase is an open source, non-relational, distributed databasemodeled after Google's BigTable and is written in Java. It isdeveloped as part of Apache Software Foundation's ApacheHadoop project and runs on top of HDFS (, providingBigTable-like capabilities for Hadoop. That is, it provides afault-tolerant way of storing large quantities of sparse data.
Page 142
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
HBase Features
● Hadoop database modelled after Google's Bigtab;e● Column oriented data store, known as Hadoop Database● Support random realtime CRUD operations (unlike
HDFS)● No SQL Database● Opensource, written in Java● Run on a cluster of commodity hardware
Hive.apache.org
Page 143
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
When to use Hbase?
● When you need high volume data to be stored ● Un-structured data● Sparse data● Column-oriented data● Versioned data (same data template, captured at various
time, time-elapse data)● When you need high scalability
Hive.apache.org
Page 144
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Which one to use?
● HDFS● Only append dataset (no random write)● Read the whole dataset (no random read)
● HBase● Need random write and/or read● Has thousands of operation per second on TB+ of data
● RDBMS● Data fits on one big node● Need full transaction support● Need real-time query capabilities
Hive.apache.org
Page 145
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Page 146
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Page 147
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
HBase Components
Hive.apache.org
● Region● Row of table are stores
● Region Server● Hosts the tables
● Master● Coordinating the Region
Servers● ZooKeeper● HDFS● API
● The Java Client API
Page 148
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
HBase Architecture
Hive.apache.org
Page 149
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
HBase Shell Commands
Hive.apache.org
Page 150
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Running HBase
Page 151
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Installing HBase
# wget http://apache.cs.utah.edu/hbase/hbase-1.0.0/hbase-1.0.0-bin.tar.gz
# tar -xvzf hbase-1.0.0-bin.tar.gz
# sudo mv hbase-1.0.0 /usr/local/
# rm hbase-1.0.0-bin.tar.gz
Page 152
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Installing HBaseEdit $HOME ./bashrc
# sudo vi $HOME/.bashrc
Page 153
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting HBase shell
ubuntu@ip-172-31-12-11:~$ start-hbase.sh
starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-master-localhost.localdomain.out
ubuntu@ip-172-31-12-11:~$$ jps
3064 TaskTracker
2836 SecondaryNameNode
2588 NameNode
3513 Jps
3327 HMaster
2938 JobTracker
2707 DataNode
ubuntu@ip-172-31-12-11:~$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013
hbase(main):001:0>
Page 154
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Create a table and insert data in HBase
hbase(main):009:0> create 'test', 'cf'
0 row(s) in 1.0830 seconds
hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1'
0 row(s) in 0.0750 seconds
hbase(main):011:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1375363287644,value=val1
1 row(s) in 0.0640 seconds
hbase(main):002:0> get 'test', 'row1'
COLUMN CELL
cf:a timestamp=1375363287644, value=val1
1 row(s) in 0.0370 seconds
Page 155
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Recommendation to Further Study
Page 156
Danairat T., , [email protected] : Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Thank you
www.imcinstitute.comwww.facebook.com/imcinstitute