Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Hive Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Essentials of HiveMastering Hadoop Map-reduce for Data Analysis
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
What is Hive?
• A data warehouse system for Hadoop
• Facilitates data summarization and ad-hoc queries
• Allows SQL like querying using HiveQL, by transposing metadata onto data stored in HDFS
• Can also plug-in custom mappers and reducers
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Supported Platforms
• Linux/Unix and Mac OSX
• Does not work on Cygwin
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Required Software
• Java 1.6.x
• Hadoop 0.17.x to 0.20.x
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Install
• Extract: tar zxvf hive-0.7.0-bin.tar.gz
• Move and Create Symbolic Link: ln -s hive-0.7.0-bin hive
• Set environment variable HIVE_HOME to point to the hive directory
• Add $HIVE_HOME/bin to your PATH environment variable
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Build From Source
• $ svn co http://svn.apache.org/repos/asf/hive/trunk hive
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Hive Needs Hadoop
• Needs Hadoop
• Add Hadoop distribution to your path or set HADOOP_HOME
• Start Hadoop daemons
• bin/start-all.sh
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Configure Hive
• Create /tmp in HDFS and set appropriate permissions
• bin/hadoop fs -mkdir /tmp
• bin/hadoop fs -chmod g+w /tmp
• Create /user/hive/warehouse and set appropriate permissions
• bin/hadoop fs -mkdir /user/hive/warehouse
• bin/hadoop fs -chmod g+w /user/hive/warehouse
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Default Hive Configuration
• Default configuration: conf/hive-default.xml
• Override default configuration by redefining properties in:
• conf/hive-site.xml
• Set HIVE_CONF_DIR to set a new location for the config file
• Hive configuration is a overlay on top of Hadoop configuration
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
• set HIVE_OPTS to "-hiveconf prop1=val1 -hiveconf prop2=val2"
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Hive by Example -- Getting Started
• Start the cli: bin/hive
• Basic DDL statements
• List the existing tables
• SHOW TABLES;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Create Table
• CREATE TABLE books (isbn INT, title STRING);
• DESCRIBE books;
• isbn int
• title string
• CREATE TABLE users (id INT, name STRING) PARTITIONED BY (vcol STRING);
• What is PARTITION BY vcol?
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Logical Table Partitions
• A Hive table can be logically partitioned by a virtual column
• virtual column is derived by the partition in which the data is stored
• A table can have multiple partitions
• Each partition in uniquely identified by a virtual column value
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Alter Table
• ALTER TABLE books ADD COLUMNS (author STRING, category STRING);
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Alter Table Column Property
• ALTER TABLE books CHANGE author author ARRAY<STRING> COMMENT "multi-valued";
• old and new column name needs to be specified
• Data type changed
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Data Types Supported
• Primitives: INT, STRING, etc...
• Complex types: maps, array, struct
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Rename Table
• ALTER TABLE books RENAME TO published_contents;
• DESCRIBE published_contents;
• DESCRIBE books; (Execution error!)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Drop Tables
• DROP TABLE published_contents;
• DROP TABLE users;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
GroupLens Example -- Getting the Data Set
• Movie ratings -- 1 million records
• Available in tar.gz format: million-ml-data.tar__0.gz
• Extract: tar zxvf million-ml-data.tar__0.gz
•
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Loading Rating Data
• Format of data in ratings.dat:
• UserID::MovieID::Rating::Timestamp
• Replace delimiter ‘::’ for ‘#’
• :%s/::/#/g
• Save as .hash_delimited
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Creating Metadata and Loading the File
• hive> CREATE TABLE ratings( userid INT, movieid INT, rating INT, tstamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE;
• LOAD DATA LOCAL INPATH <'path/to/flat/file'> OVERWRITE INTO TABLE <table name>;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
File Load Properties
• No validation. Developer’s responsibility to make sure schema matches between table schema and the file.
• Data can be on the local filesystem or on HDFS
• Data copied to Hive HDFS namespace
• If OVERWRITE not specified then its data append
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Rating Data Load
• hive> LOAD DATA LOCAL INPATH '/path/to/ratings.dat.hash_delimited'
• > OVERWRITE INTO TABLE ratings;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
A SQL Style Query
• SELECT COUNT(*) FROM ratings;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Loading movies and users data
• Now load the movies and users data in the same way as the ratings data.
• Details on the console...
• CREATE TABLE users_2(userid INT, gender STRING, age INT, occupation STRING, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE;
• INSERT OVERWRITE TABLE users_2 SELECT TRANSFORM (userid, gender, age, occupation, zipcode) USING 'python occupation_mapper.py' AS (userid, gender, age, occupation_str, zipcode) FROM users;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Good Old SQL
• SELECT * FROM movies LIMIT 5;
• SELECT * FROM ratings WHERE movieid = 1;
• SELECT COUNT(*) FROM ratings WHERE movieid < 10;
• SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;
• SELECT title FROM movies WHERE title = `^Toy+`;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
More Than Good Old SQL
• SELECT `*+(id)` FROM ratings WHERE movieid = 1;
• regular expression based search on column name
• SELECT ratings.rating, COUNT(ratings.rating) FROM ratings WHERE movieid = 1 GROUP BY ratings.rating; (group by)
• SELECT * FROM movies ORDER BY movieid DESC;
• DISTRIBUTE BY & ORDER BY (CLUSTER BY) -- by partition
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
JOIN(s) in HiveQL
• equality joins, outer joins, left semi-joins
• SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;
• More than 2 tables:
• SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
JOIN(s) in HiveQL
• equality joins, outer joins, left semi-joins
• SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;
• More than 2 tables:
• SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Explain Plan to Under the hood MapReduce
• EXPLAIN SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.