Top Banner
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Hive Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com
32

SDEC2011 Essentials of Hive

Nov 01, 2014

Download

Technology

Korea Sdec

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Essentials of HiveMastering Hadoop Map-reduce for Data Analysis

Shashank Tiwariblog: shanky.org | twitter: @[email protected]

Page 2: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

What is Hive?

• A data warehouse system for Hadoop

• Facilitates data summarization and ad-hoc queries

• Allows SQL like querying using HiveQL, by transposing metadata onto data stored in HDFS

• Can also plug-in custom mappers and reducers

Page 3: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Supported Platforms

• Linux/Unix and Mac OSX

• Does not work on Cygwin

Page 4: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Required Software

• Java 1.6.x

• Hadoop 0.17.x to 0.20.x

Page 5: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Download

• Source: http://hive.apache.org/releases.html

• Version:

• hive-0.7.0

• Both binary and source distributions available

Page 6: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Install

• Extract: tar zxvf hive-0.7.0-bin.tar.gz

• Move and Create Symbolic Link: ln -s hive-0.7.0-bin hive

• Set environment variable HIVE_HOME to point to the hive directory

• Add $HIVE_HOME/bin to your PATH environment variable

Page 7: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Build From Source

• $ svn co http://svn.apache.org/repos/asf/hive/trunk hive

• $ cd hive

• $ ant clean package

• The binary distribution is in build/dist

Page 8: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Hive Needs Hadoop

• Needs Hadoop

• Add Hadoop distribution to your path or set HADOOP_HOME

• Start Hadoop daemons

• bin/start-all.sh

Page 9: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Configure Hive

• Create /tmp in HDFS and set appropriate permissions

• bin/hadoop fs -mkdir /tmp

• bin/hadoop fs -chmod g+w /tmp

• Create /user/hive/warehouse and set appropriate permissions

• bin/hadoop fs -mkdir /user/hive/warehouse

• bin/hadoop fs -chmod g+w /user/hive/warehouse

Page 10: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Default Hive Configuration

• Default configuration: conf/hive-default.xml

• Override default configuration by redefining properties in:

• conf/hive-site.xml

• Set HIVE_CONF_DIR to set a new location for the config file

• Hive configuration is a overlay on top of Hadoop configuration

Page 11: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Hive Configuration Manipulation

• Edit: conf/hive-site.xml

• Use SET command on the Hive cli

• Pass parameters to Hive

• bin/hive -hiveconf prop1=val1 -hiveconf prop2=val2

• set HIVE_OPTS to "-hiveconf prop1=val1 -hiveconf prop2=val2"

Page 12: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Hive by Example -- Getting Started

• Start the cli: bin/hive

• Basic DDL statements

• List the existing tables

• SHOW TABLES;

Page 13: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Create Table

• CREATE TABLE books (isbn INT, title STRING);

• DESCRIBE books;

• isbn int

• title string

• CREATE TABLE users (id INT, name STRING) PARTITIONED BY (vcol STRING);

• What is PARTITION BY vcol?

Page 14: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Logical Table Partitions

• A Hive table can be logically partitioned by a virtual column

• virtual column is derived by the partition in which the data is stored

• A table can have multiple partitions

• Each partition in uniquely identified by a virtual column value

Page 15: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Alter Table

• ALTER TABLE books ADD COLUMNS (author STRING, category STRING);

• Change Column Property

• ALTER TABLE table_name CHANGE [COLUMN]

• old_column_name new_column_name column_type

• [COMMENT column_comment] [FIRST|AFTER column_name]

Page 16: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Alter Table Column Property

• ALTER TABLE books CHANGE author author ARRAY<STRING> COMMENT "multi-valued";

• old and new column name needs to be specified

• Data type changed

Page 17: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Data Types Supported

• Primitives: INT, STRING, etc...

• Complex types: maps, array, struct

Page 18: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Rename Table

• ALTER TABLE books RENAME TO published_contents;

• DESCRIBE published_contents;

• DESCRIBE books; (Execution error!)

Page 19: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Drop Tables

• DROP TABLE published_contents;

• DROP TABLE users;

Page 20: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

GroupLens Example -- Getting the Data Set

• Movie ratings -- 1 million records

• Available in tar.gz format: million-ml-data.tar__0.gz

• Extract: tar zxvf million-ml-data.tar__0.gz

Page 21: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Loading Rating Data

• Format of data in ratings.dat:

• UserID::MovieID::Rating::Timestamp

• Replace delimiter ‘::’ for ‘#’

• :%s/::/#/g

• Save as .hash_delimited

Page 22: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Creating Metadata and Loading the File

• hive> CREATE TABLE ratings( userid INT, movieid INT, rating INT, tstamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE;

• LOAD DATA LOCAL INPATH <'path/to/flat/file'> OVERWRITE INTO TABLE <table name>;

Page 23: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

File Load Properties

• No validation. Developer’s responsibility to make sure schema matches between table schema and the file.

• Data can be on the local filesystem or on HDFS

• Data copied to Hive HDFS namespace

• If OVERWRITE not specified then its data append

Page 24: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Rating Data Load

• hive> LOAD DATA LOCAL INPATH '/path/to/ratings.dat.hash_delimited'

• > OVERWRITE INTO TABLE ratings;

Page 25: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

A SQL Style Query

• SELECT COUNT(*) FROM ratings;

Page 26: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Loading movies and users data

• Now load the movies and users data in the same way as the ratings data.

• Details on the console...

• CREATE TABLE users_2(userid INT, gender STRING, age INT, occupation STRING, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE;

• add FILE /Users/tshanky/workspace/hadoop_workspace/hive_workspace/occupation_mapper.py;

• INSERT OVERWRITE TABLE users_2 SELECT TRANSFORM (userid, gender, age, occupation, zipcode) USING 'python occupation_mapper.py' AS (userid, gender, age, occupation_str, zipcode) FROM users;

Page 27: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Good Old SQL

• SELECT * FROM movies LIMIT 5;

• SELECT * FROM ratings WHERE movieid = 1;

• SELECT COUNT(*) FROM ratings WHERE movieid < 10;

• SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;

• SELECT title FROM movies WHERE title = `^Toy+`;

Page 28: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

More Than Good Old SQL

• SELECT `*+(id)` FROM ratings WHERE movieid = 1;

• regular expression based search on column name

• SELECT ratings.rating, COUNT(ratings.rating) FROM ratings WHERE movieid = 1 GROUP BY ratings.rating; (group by)

• SELECT * FROM movies ORDER BY movieid DESC;

• DISTRIBUTE BY & ORDER BY (CLUSTER BY) -- by partition

Page 29: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

JOIN(s) in HiveQL

• equality joins, outer joins, left semi-joins

• SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;

• More than 2 tables:

• SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;

Page 30: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

JOIN(s) in HiveQL

• equality joins, outer joins, left semi-joins

• SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;

• More than 2 tables:

• SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;

Page 31: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Explain Plan to Under the hood MapReduce

• EXPLAIN SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;

Page 32: SDEC2011 Essentials of Hive

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Questions?

• blog: shanky.org | twitter: @tshanky

[email protected]