Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

©2012, Cognizant

Data Warehouse and Query Language for Hadoop

August 2013By Someshwar Kale

| ©2012, Cognizant 2

HIVE

Data Warehousing Solution built on top of Hadoop

Provides SQL-like query language named HiveQL– Minimal learning curve for people with SQL expertise– Data analysts are target audience

Early Hive development work started at Facebook in 2007Today, Facebook counts 29% of its employees (and growing!) as Hive users.

https://www.facebook.com/note.php?note_id=114588058858

Today Hive is an Apache project under Hadoop– http://hive.apache.org




| 2012 Cognizant Technology Solutions

Hive Provides

3

• Ability to bring structure to various data Formats

• Simple interface for ad hoc querying,analyzing and summarizing large amounts of data

• Access to files on various data stores suchas HDFS and HBase


Hive Hive does NOT provide low latency or realtime queries.

Even querying small amounts of data may take minutes.

Designed for scalability and ease-of-use rather than low latency responses


Hive

Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster.


Hive Metastore

To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database

Packaged with Derby, a lightweight embedded SQL DB

Default Derby based is good for evaluation an testing

Schema is not shared between users as each user has their own instance of embedded Derby Stored in metastore_db directory which resides in the directory that hive was started from• Can easily switch another SQL installation such as MySQL


Metastore Deployment Modes : Embedded Mode

Default metastore deployment mode for CDH.

Both the database and the metastore service run embedded in the main HiveServer process

Both are started for you when you start the HiveServer process.

Support only one active user at a time and is not certified for production use.


Metastore Deployment Modes : Local Mode

Hive metastore service runs in the same process as the main HiveServer process.

The metastore database runs in a separate process, and can be on a separate host.

The embedded metastore service communicates with the metastore database over JDBC.


Metastore Deployment Modes : Remote Mode


Hive Architecture


Hive Interface Options

Command Line Interface (CLI)– Will use exclusively in these slides

• Hive Web Interfacehttps://cwiki.apache.org/confluence/display/Hive/HiveWebInterface

• Java Database Connectivity (JDBC)– https://cwiki.apache.org/confluence/display/Hive/HiveClient

BEELINE for Hivesrver2 (new in CDH4)- http://sqlline.sourceforge.net/#manual

https://cwiki.apache.org/confluence/display/Hive/HiveWebInterface



https://cwiki.apache.org/confluence/display/Hive/HiveClient

https://cwiki.apache.org/confluence/display/Hive/HiveClient

http://sqlline.sourceforge.net/




Data Types

[cts318692@aster4 ~]$ hiveLogging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.2.1.jar!/hive-log4j.propertiesHive history file=/tmp/cts318692/hive_job_log_cts318692_201308071622_2005272769.txthive>

Launch Hive Command Line Interface (CLI)

Location of the session’s log file

hive> !cat data/user-posts.txt;user1,Funny Story,1343182026191user2,Cool Deal,1343182133839user4,Interesting Post,1343182154633user5,Yet Another Blog,13431839394hive>

Can execute local commandswithin CLI, place a commandin between ! and ;


Data Types

Numeric TypesTINYINTSMALLINTINTBIGINTFLOATDOUBLEDECIMAL (Note: Only available starting with Hive 0.11.0)Date/Time TypesTIMESTAMP (Note: Only available starting with Hive 0.8.0)DATE (Note: Only available starting with Hive 0.12.0)Misc TypesBOOLEANSTRINGBINARY (Note: Only available starting with Hive 0.8.0)

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types





https://issues.apache.org/jira/browse/HIVE-2693








Complex Data Types


Check physical storage of hive

[cts318692@aster4 ~]$ hive -S -e "set" | grep warehousehive.metastore.warehouse.dir=/user/hive/warehousehive.warehouse.subdir.inherit.perms=true

This is the location where hive stores its data.


Creating DataBase

hive> CREATE DATABASE IF NOT EXISTS som COMMENT 'my database' > LOCATION '/user/cts318692/someshwar/hivestore/' > WITH DBPROPERTIES ('creator'='someshwar kale','date'='2013-06-08');OKTime taken: 0.046 seconds

Used to suppress warnings

Database name, Hive opens default database when u open a

new session

You can override ‘/usr/hive/warehouse’ default location for the new directory

Table propertiesPhysical storage for som database


Exploring Data

STRUCT<street:STRING, city:STRING,

state:STRING, zip:INT>

For complex data types map, arrays,structures

field


Creating Table

For complex data types map, arrays,structures

For map key and value eg. ‘key’ ^C ’value’ (\003=ctrlC=^C)

Column seperator Definition


hive> DESCRIBE FORMATTED som.employees;


Creating External Table


Create ..like

If you omit the EXTERNAL keyword and the original table is external, the new table will also be external.

If you omit EXTERNAL and the original table is managed, the new table will also be managed. However, if you include the EXTERNAL keyword and the original table is managed, the new table will be external. Even in this scenario, the LOCATION clause will still be optional.


Select Clause


Describe External Table

| ©2012, Cognizant

Dropping DataBase and Table

By default, Hive won’t permit you to drop a database if it contains tables. You can eitherdrop the tables first or append the CASCADE keyword to the command, which will causethe Hive to drop the tables in the database first.

| ©2012, Cognizant

Partitions

To increase performance Hive has the capability to partition data– The values of partitioned column divide a table intosegments– Entire partitions can be ignored at query time– Similar to relational databases’ indexes but not asGranular

Partitions have to be properly crated by users– When inserting data must specify a partition

At query time, whenever appropriate, Hive will automatically filter out partitions

| ©2012, Cognizant

Creating Partitioned Table

Partition table based onthe value of a country and state

| ©2012, Cognizant

Cntd…

| ©2012, Cognizant

Loading data to table

LOAD DATA LOCAL ... copies the local data to the final location in thedistributed filesystem, while LOAD DATA ... (i.e., without LOCAL) movesthe data to the final location.

Necessary if table to which we are loading the data is partitioned. This is known as Static partitioning as we are providing the partition value in the query

Partitions are physically stored underseparate directories

| ©2012, Cognizant

Schema Violations

hive> LOAD DATA LOCAL INPATH> 'data/user-posts-inconsistentFormat.txt'> OVERWRITE INTO TABLE posts;OKTime taken: 0.612 seconds

hive> select * from posts;OKuser1 Funny Story 1343182026191user2 Cool Deal NULLuser4 Interesting Post 1343182154633user5 Yet Another Blog 13431839394Time taken: 0.136 seconds

null is set for any value thatviolates pre-defined schema

| ©2012, Cognizant

External Partitioned Tables

| ©2012, Cognizant

Cntd…

There is no difference in syntax• When partitioned column is specified in thewhere clause entire directories/partitions couldbe ignored

| ©2012, Cognizant

Bucketing

• Break data into a set of buckets based on a hashfunction of a "bucket column"– Capability to execute queries on a sub-set of random data

• Doesn’t automatically enforce bucketing– User is required to specify the number of buckets by setting hash ofReducer

hive> mapred.reduce.tasks = 256;ORhive> hive.enforce.bucketing = true;

Either manually set the hash ofreducers to be the number ofbuckets or you can use‘hive.enforce.bucketing’ whichwill set it on your behalf.

| ©2012, Cognizant

Create and Use Table with Buckets

| ©2012, Cognizant

ALTER TABLE

| ©2012, Cognizant

Cntd…

| ©2012, Cognizant

Cntd…

| ©2012, Cognizant

Cntd…

Partition columns are not deleted

| ©2012, Cognizant

Inserting Data into Tables from Queries

| ©2012, Cognizant

Dynamic Partition Inserts

| ©2012, Cognizant

Cntd…

| ©2012, Cognizant

Exporting Data

| ©2012, Cognizant

Functions

| ©2012, Cognizant

Cntd…

| ©2012, Cognizant

Cntd…

| ©2012, Cognizant

Table generating functionsReturn 0 to many rows, one row for each element fromthe input array

| ©2012, Cognizant

Table generating functions

Only a single expression in the SELECT clause is supported with UDTF's'.

| ©2012, Cognizant

LIMIT clause

| ©2012, Cognizant

CASE … WHEN … THEN Statements

| ©2012, Cognizant

Where and Group by .. having clause

| ©2012, Cognizant

Joins

| ©2012, Cognizant

Outer Join

| ©2012, Cognizant

Points to remember

Only equality joins are allowed.

More than 2 tables can be joined in the same query e.g.

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

is a valid join.

A single map/reduce job if for every table the same column is used in the join clause -

ON (a.key = b.key1) JOIN c ON (c.key = b.key1)

ON (a.key = b.key1) JOIN c ON (c.key = b.key2)is converted into two map/reduce jobs because key1 column from b is used in the first join condition and key2 column from b is used in the second one.

| ©2012, Cognizant

ORDER BY and SORT BY

ORDER BY uses single reducer to sort the data, which may take an unacceptably long time to execute for larger data sets.

Hive adds an alternative, SORT BY, that orders the data only within each reducer, thereby performing a local ordering, where each reducer’s output will be sorted.

| ©2012, Cognizant

Casting

If a salary value was not a valid string for a floating-point number? In this case, Hive returns NULL.

| ©2012, Cognizant

UNION ALL and Nested select

Each subquery of the union query must produce the same number of columns, and for each column, its type must match all the column types in the same position.

| ©2012, Cognizant

View

• similar to writing a function in a programming language.

• Views are virtual.

| ©2012, Cognizant

Lateral view

Lateral view is used in conjunction with user-defined table generating functions such as explode().

A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.

Syntax-1. LATERAL VIEW udtf(expression) tableAlias AS columnAlias

| ©2012, Cognizant

Lateral view Example

| ©2012, Cognizant

UDF

| ©2012, Cognizant

UDF

Hive actually uses reflection to find methods whose names are evaluate and matches the arguments used in the HiveQL function call.

Hive can work with both the Hadoop Writables and the Java primitives, but it’s recommended to work with the Writables since they can be reused.

Input arguments type and return type must be same.

| ©2012, Cognizant

UDF

| ©2012, Cognizant

UDF vs. GenericUDF

| ©2012, Cognizant

between operator

hive> select name,salary from employees2 where salary between 80000 and 100000;Total MapReduce jobs = 1Launching Job 1 out of 1....OKJohn Doe 100000.0John Doe 100000.0Mary Smith 80000.0Mary Smith 80000.0Time taken: 14.39 seconds

Both values (lower and upper) are inclusive.

| ©2012, Cognizant

HiveServer2

As of CDH4.1, you can deploy HiveServer2, an improved version of HiveServer that supports a new Thrift API tailored for JDBC and ODBC clients, Kerberos authentication, and multi-client concurrency.

There is also a new CLI for HiveServer2 named BeeLine.

HiveServer2 Connection URL ===== jdbc:hive2://<host>:<port>

Driver Class =========== org.apache.hive.jdbc.HiveDriver

HiveServer1 Connection URL ===== jdbc:hive://<host>:<port>

Driver Class ========org.apache.hadoop.hive.jdbc.HiveDriver

| ©2012, Cognizant

BEELINE

$ /usr/lib/hive/bin/beelinebeeline> !connect jdbc:hive2://localhost:10000 username password org.apache.hive.jdbc.HiveDriver0: jdbc:hive2://localhost:10000>

| ©2012, Cognizant

Connecting database using properties file

| ©2012, Cognizant

References

HiveEdward Capriolo (Author), Dean Wampler (Author), JasonRutherglen (Author)O'Reilly Media; 1 edition (October 3, 2012)

Chapter About HiveHadoop in ActionChuck Lam (Author)Manning Publications; 1st Edition (December, 2010)


Thank You

Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

Data & Analytics