Page 1
Aims
This exercise aims to get you to:
Import data into HBase using bulk load
Read MapReduce input from HBase and write MapReduce output to
HBase
Manage data using Hive
Manage data using Pig
Background
In HBase-speak, bulk loading is the process of preparing and loading HFiles
(HBase’s own file format) directly into the RegionServers. Bulk load steps:
1. Extract the data from a source, typically text files or another database.
2. Transform the data into HFiles. This step requires a MapReduce job and for
most input types you will have to write the Mapper yourself. The job will
need to emit the row key as the Key, and either a KeyValue, a Put, or a Delete
as the Value. The Reducer is handled by HBase; you configure it using
HFileOutputFormat2.configureIncrementalLoad().
3. Load the files into HBase by telling the RegionServers where to find them. It
requires using LoadIncrementalHFiles (more commonly known as the
completebulkload tool), and by passing it a URL that locates the files in HDFS,
it will load each file into the relevant region via the RegionServer that serves
it.
Here’s an illustration of this process. The data flow goes from the original
source to HDFS, where the RegionServers will simply move the files to
their regions’ directories.
See more details at: http://blog.cloudera.com/blog/2013/09/how-to-use-
hbase-bulk-loading-and-why/.
Page 2
Because HBase is not installed in the VM image in the lab computers, you
need to install HBase again following the instructions in Lab 5.
Create a project “Lab6” and create a package “comp9313.lab6” in this
project. Put all your java codes in this package and keep a copy. Right click
the project -> Properties -> Java Build Path -> Libraries -> Add Externals
JARs -> go to the folder “comp9313/base-1.2.2/lib”, and add all the jar files
to the project.
Data Set
Download the two files “Votes” and “Posts” from the course homepage. The
data set contains many questions asked on http://www.stackexchange.com
and the corresponding answers. The two file used in this week’s lab are
obtained at: https://archive.org/details/stackexchange, part of
“datascience.stackexchange.com.7z”. The format of the data set is shown at:
https://ia800500.us.archive.org/22/items/stackexchange/readme.txt.
The data format of Votes is (the field BountyAmount is ignored):
- **votes**.xml
- Id
- PostId
- VoteTypeId
- ` 1`: AcceptedByOriginator
- ` 2`: UpMod
- ` 3`: DownMod
- ` 4`: Offensive
- ` 5`: Favorite - if VoteTypeId = 5 UserId will be populated
- ` 6`: Close
- ` 7`: Reopen
- ` 8`: BountyStart
- ` 9`: BountyClose
- `10`: Deletion
- `11`: Undeletion
- `12`: Spam
- `13`: InformModerator
- CreationDate
- UserId (only for VoteTypeId 5)
- BountyAmount (only for VoteTypeId 9)
For example:
The data format of Comments is:
Page 3
- **comments**.xml
- Id
- PostId
- Score
- Text, e.g.: "@Stu Thompson: Seems possible to me - why not
try it?"
- CreationDate, e.g.:"2008-09-06T08:07:10.730"
- UserId
For example:
HBase Data Bulk Load
Import “Votes” as a table in HBase.
1. HBase will use a “staging” folder to store temporary data, and we need to
configure this directory for HBase. Create a folder /tmp/hbase-staging in
HDFS, and change its mode to 711 (i.e., rwx—x—x).
$ hdfs dfs –mkdir /tmp/hbase-staging
$ hdfs dfs –chmod 711 /tmp/hbase-staging
Add the following lines to $HBASE_HOME/conf/hbase-site.xml (in
between <configuration> and </configuration>:
<property>
<name>hbase.bulkload.staging.dir</name>
<value>/tmp/hbase-staging</value>
</property>
<property>
<name>hbase.coprocessor.region.classes</name>
<value>org.apache.hadoop.hbase.security.token.TokenProvider,org.apach
e.hadoop.hbase.security.access.AccessController,org.apache.hadoop.hba
se.security.access.SecureBulkLoadEndpoint</value>
</property>
In your MapReduce code, you need to configure the two properties:
“hbase.fs.tmp.dir” and “hbase.bulkload.staging.dir”. After creating a
Configuration object, you need to:
Configuration conf = HBaseConfiguration.create();
conf.set(“hbase.fs.tmp.dir”, “/tmp/hbase-staging”);
conf.set(“hbase.bulkload.staging.dir”, “/tmp/hbase-staging”);
Page 4
2. The code for bulk loading Votes into HBase is available at the course
homepage, i.e., “Vote.java” and “HBaseBulkLoadExample.java”. Below
lists some explanations of the code:
Only the mapper is required in bulk load, because the Reducer is handled by
HBase and you configure it using
HFileOutputFormat2.configureIncrementalLoad(). The map output key data
type must be ImmutableBytesWritable, and the map output value data type
can only be a KeyValue/Put/Delete object. In this example, you create a Put
object, which will be used to insert the data into the HBase table.
The table can either be created using HBase shell or HBase Java API. In the
give code, the table is created using Java API.
In the example code, the class HBaseBulkLoadExample implements the
interface Tool, and the job is configured and started in the run() function.
Then, ToolRunner.run() is used to invoke HBaseBulkLoadExample.run(). You
can also configure and start the job in the main function, as you did in the
previous labs on MapReduce.
Before starting the job, you need to use
HFileOutputFormat2.configureIncrementalLoad() to configure the bulk load.
After the job is completed, that is, the mapper generate the Put objects for
all input data, you use LoadIncrementalHFiles to do the bulk load. It is the
tool to load the output of HFileOutputFormat2 into an existing table.
3. After “Votes” is loaded into the table “votes”, open the HBase shell to
check the table and its contents.
Your Task: Import “Comments” as a table in HBase.
Create a class “HBaseBulkLoadComments.java” and a class “Comment.java”
in package “comp9313.lab6” to finish this task.
Use “Id” as the rowkey, and create three column families, “postInfo”
(containing PostId), “commentInfo” (containing Score, Text, and
CreationDate), and “userInfo” (containing “UserId”).
Read MapReduce Input from HBase
Problem 1.
Read input data from table “votes” in HBase, and count for each post the
number of each type of vote for this post. The output data is of format:
(PostID, {<VoteTypeId, count>}).
Page 5
For example, if post with ID “1” has two votes, one is of type “1” and
another is of type “2”, then you should output (1, {<1, 1>, <2, 1>}).
Please refer to https://hbase.apache.org/book.html#mapreduce.example for
the examples of HBase MapReduce read.
Hints:
1. Your mapper should be extended from TableMapper<K, V>. The input key
data type is ImmutableBytesWritable, and value data type is Result. Each
map() function will read one row from the HBase table, and you can use
Result.getValue(CF, COLUMN) to get the value in a cell. Your mapper code
will be like:
public static class AggregateMapper extends
TableMapper<Text, Text>{
public void map(ImmutableBytesWritable row, Result
value, Context context) throws IOException,
InterruptedException {
… //do your job
}
}
2. The reducer is just like a normal MapReduce reducer
3. In the main function, you will need to use the function
TableMapReduceUtil.initTableMapperJob() to configure the mapper.
4. Because the data is read from HBase, you do not need to configure the
data input path. You only need to specify the output path in Eclipse.
The code “ReadHBaseExample.java” is available at the course webpage.
Try to write the mapper by yourself, and learn how to configure the HBase
read job from that file.
Problem 2:
Read input data from table “comments” in HBase, and calculate the number
of comments per UserID. Refer to the code “ReadHBaseExample.java” and
write your code in “ReadHBaseComment.java” in package
“comp9313.lab6”.
Write MapReduce Output to HBase
Problem 1.
Read input data from “Votes”, and count the number of votes per user. The
result will be written to an HBase table ‘votestats’, rather than storing in
files generated by reducers.
Page 6
Please refer to https://hbase.apache.org/book.html#mapreduce.example for
the examples of HBase MapReduce Write.
Hints:
1. The mapper is just like a normal MapReduce mapper.
2. Your reducer should be extended from TableReducer<K, V>. The output
key is ignored, and the value data type is ImmutableBytesWritable. The
reduce() function will aggregate the number of comments for a user. You
need to create a Put object to store the information, and HBase will use
this object to insert the information into table ‘votestats’. Your reducer
code will be like:
public static class UserVotesReducer extends
TableReducer<Text, IntWritable, ImmutableBytesWritable> {
public void reduce(Text key, Iterable<IntWritable>
values, Context context)throws IOException,
InterruptedException {
… //do your job
}
}
3. In the main function, you will need to use the function
TableMapReduceUtil.initTableReducerJob() to configure the reducer.
4. You can create the table in the main function, or using the HBase shell.
5. Because the data is written to HBase, you do not need to configure the
data output path. You only need to specify the input path in Eclipse.
The code “WriteHBaseExample.java” is available at the course webpage.
Try to write the reducer by yourself, and learn how to configure the HBase
write job from that file.
Problem 2:
Read input data from “Comments”, and calculate the average score of
comments for each question. The result will be written to an HBase table
“post_comment_score”, with only one column family “stats”.
Refer to the code “WriteHBaseExample.java” and write your code in
“WriteHBaseComment.java” in package “comp9313.lab6”.
Manage Data Using Hive
Hive Installation and Configuration
1. Download Hive 2.1.0
Page 7
$ wget http://apache.uberglobalmirror.com/hive/stable-2/apache-hive-
2.1.0-bin.tar.gz
Then unpack the package:
$ tar xvf apache-hive-2.1.0-bin.tar.gz
2. Define environment variables for Hive
We need to configure the working directory of Hive, i.e., HIVE_HOME.
Open the file ~/.bashrc and add the following lines at the end of this file:
export HIVE_HOME = ~/apache-hive-2.1.0-bin
export PATH = $HIVE_HOME/bin:$PATH
Save the file, and then run the following command to take these
configurations into effect:
$ source ~/.bashrc
3. Create /tmp and /user/hive/warehouse and set them chmod g+w for more
than one user usage
$ hdfs dfs -mkdir /tmp
$ hdfs dfs -mkdir –p /user/hive/warehouse
$ hdfs dfs -chmod g+w /tmp
$ hdfs dfs -chmod g+w /user/hive/warehouse
4. Run the schematool command to initialize Hive
$ schematool -dbType derby -initSchema
Now you have already done the basic configuration of Hive, and it is ready
to use. Start Hive shell by the following command (start HDFS and YARN
first!):
$ hive
Page 8
Practice Hive
1. Download the test file “employees.txt” from the course webpage. The file
contains only 7 records. Put the file at the home folder.
2. Create a database
$ hive> create database employee_data;
$ hive> use employee_data;
3. All databases are created under /user/hive/warehouse directory.
$ hdfs dfs –ls /user/hive/warehouse
4. Create the employee table
$ hive> CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING,
zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
Because '\001', '\002', '\003', and '\n' are by default, and thus you can ignore
“ROW FORMAT DELIMITED”. “STORED AS TEXTFILE” is also by
default, and can be ignored as well.
5. Show all tables in the current database
$ hive> show tables;
6. Load data from local file system into table
$ hive> LOAD DATA LOCAL INPATH '/home/comp9313/employees.txt'
OVERWRITE INTO TABLE employees;
Page 9
After loading the data into the table, you can check in HDFS what happened:
$ hdfs dfs –ls /user/hive/warehouse/employee_data.db/employees
The file employees.txt is copied into this folder corresponding to the table.
7. Check the data in the table
$ select * from employees;
8. You can do various queries based on the employees table, just as in an
RDBMS. For example:
Question 1: show the number of employees and their average salary
Hint: use count() and avg()
Question 2: find the employee who has the highest salary
Hint: use max(), IN clause, and subquery in where clause
9. Usage of explode(). Find all employees who are the subordinate of
another person. explode() takes in an array (or a map) as an input and
outputs the elements of the array (map) as separate rows.
$ hive> SELECT explode(subordinates) FROM employees;
10. Hive partitions. When defining employees, it is not partitioned, and thus
you cannot add a partition to it. You can only add a new partition to a table
has already been partitioned!
Create a table employees2, and load the same file into it.
$ hive> CREATE TABLE employees2 (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
Page 10
address STRUCT<street:STRING, city:STRING, state:STRING,
zip:INT>
)PARTITIONED BY (join_year STRING);
$ hive> LOAD DATA LOCAL INPATH '/home/comp9313/employees.txt'
OVERWRITE INTO TABLE employees2 PARTITION (join_year=”2015”);
Now check HDFS again to see what happened:
$ hdfs dfs –ls /user/hive/warehouse/employ_data.db/employees2
You will see a folder “join_year=2015” created in this folder, corresponding
to the partition join_year= “2015”.
Add a new partition join_year=“2016” to the table.
$ hive> ALTER TABLE employees2 ADD PARTITION (join_year=’2016’)
LOCATION
‘/user/hive/warehouse/employee_data.db/employees2/join_year=2016’;
Check in HDFS, and you will see a new folder created for this partition.
11. Insert a record to partition join_year=“2016”.
Because Hive does not support literals for complex types (array, map, struct,
union), so it is not possible to use them in INSERT INTO...VALUES
clauses. You need to create a file to store the new record, and then load it
into the partition.
$ cp employees.txt employees2016.txt
Then use vim or gedit to edit employees2016.txt to add some records, and
then load the file into the partition.
12. Query on a partition. Question: find all employees joined in the year
2016 whose salary is more than 60000.
13. (optional) Do word count in Hive, using the file employees.txt.
Manage Data Using Pig
Pig Installation and Configuration
1. Download Pig 0.16.0
$ wget http://mirror.ventraip.net.au/apache/pig/pig-0.16.0/pig-
0.16.0.tar.gz
Then unpack the package:
$ tar xvf pig-0.16.0.tar.gz
Page 11
2. Define environment variables for Pig
We need to configure the working directory of Hive, i.e., PIG_HOME.
Open the file ~/.bashrc and add the following lines at the end of this file:
export PIG_HOME = ~/pig-0.16.0
export PATH = $PIG_HOME/bin:$PATH
Save the file, and then run the following command to take these
configurations into effect:
$ source ~/.bashrc
3. Now you have already done the basic configuration of Pig, and it is ready
to use. Start Pig Grunt shell by the following command (start HDFS and
YARN first!):
$ pig
Practice Pig
1. Download the test file “NYSE_dividends.txt” from the course webpage.
The file contains 670 records. Put the file to HDFS.
$ hdfs dfs –put NYSE_dividends.txt
Start the Hadoop job history server.
$ mr-jobhistory-daemon.sh start historyserver
2. Load Data using load command into Schema exchange, symbol, date,
dividend.
$ grunt> dividends = load 'NYSE_dividends.txt' as (exchange:chararray,
symbol:chararray, date:chararray, dividend:float);
$ grunt> dump dividends;
Page 12
You should see results like:
3. Group rows by symbol.
$ grunt> grouped = group dividends by symbol;
4. Compute the average dividends for each symbol. Dividend value is
obtained using expression dividends.dividend (or dividends.$3). Store this
result in a variable avg.
$ grunt> avg = foreach grouped generate group, AVG(dividends.$3);
Use dump to check the contents of “avg”, you should see:
5. Store result avg into HDFS using store command
$ grunt> store avg into 'average_dividend';
6. Store result avg into HDFS using store command
$ grunt> fs -cat /user/comp9313/average_dividend/*
7. (optional) Do word count in Pig, using the file employees.txt.
More Practices
More practices of Hive and Pig are put into the second assignment.