©2012, Cognizant Data Warehouse and Query Language for Hadoop August 2013 By Someshwar Kale
Apr 15, 2017
©2012, Cognizant
Data Warehouse and Query Language for Hadoop
August 2013By Someshwar Kale
| ©2012, Cognizant 2
HIVE
Data Warehousing Solution built on top of Hadoop
Provides SQL-like query language named HiveQL– Minimal learning curve for people with SQL expertise– Data analysts are target audience
Early Hive development work started at Facebook in 2007Today, Facebook counts 29% of its employees (and growing!) as Hive users.
https://www.facebook.com/note.php?note_id=114588058858
Today Hive is an Apache project under Hadoop– http://hive.apache.org
| 2012 Cognizant Technology Solutions
Hive Provides
3
• Ability to bring structure to various data Formats
• Simple interface for ad hoc querying,analyzing and summarizing large amounts of data
• Access to files on various data stores suchas HDFS and HBase
| ©2012, Cognizant 4
Hive Hive does NOT provide low latency or realtime queries.
Even querying small amounts of data may take minutes.
Designed for scalability and ease-of-use rather than low latency responses
| ©2012, Cognizant 5
Hive
Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster.
| ©2012, Cognizant 6
Hive Metastore
To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database
Packaged with Derby, a lightweight embedded SQL DB
Default Derby based is good for evaluation an testing
Schema is not shared between users as each user has their own instance of embedded Derby Stored in metastore_db directory which resides in the directory that hive was started from• Can easily switch another SQL installation such as MySQL
| ©2012, Cognizant 7
Metastore Deployment Modes : Embedded Mode
Default metastore deployment mode for CDH.
Both the database and the metastore service run embedded in the main HiveServer process
Both are started for you when you start the HiveServer process.
Support only one active user at a time and is not certified for production use.
| ©2012, Cognizant 8
Metastore Deployment Modes : Local Mode
Hive metastore service runs in the same process as the main HiveServer process.
The metastore database runs in a separate process, and can be on a separate host.
The embedded metastore service communicates with the metastore database over JDBC.
| ©2012, Cognizant 9
Metastore Deployment Modes : Remote Mode
| ©2012, Cognizant 10
Hive Architecture
| ©2012, Cognizant 11
Hive Interface Options
Command Line Interface (CLI)– Will use exclusively in these slides
• Hive Web Interfacehttps://cwiki.apache.org/confluence/display/Hive/HiveWebInterface
• Java Database Connectivity (JDBC)– https://cwiki.apache.org/confluence/display/Hive/HiveClient
BEELINE for Hivesrver2 (new in CDH4)- http://sqlline.sourceforge.net/#manual
| ©2012, Cognizant 12
Data Types
[cts318692@aster4 ~]$ hiveLogging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.2.1.jar!/hive-log4j.propertiesHive history file=/tmp/cts318692/hive_job_log_cts318692_201308071622_2005272769.txthive>
Launch Hive Command Line Interface (CLI)
Location of the session’s log file
hive> !cat data/user-posts.txt;user1,Funny Story,1343182026191user2,Cool Deal,1343182133839user4,Interesting Post,1343182154633user5,Yet Another Blog,13431839394hive>
Can execute local commandswithin CLI, place a commandin between ! and ;
| ©2012, Cognizant 13
Data Types
Numeric TypesTINYINTSMALLINTINTBIGINTFLOATDOUBLEDECIMAL (Note: Only available starting with Hive 0.11.0)Date/Time TypesTIMESTAMP (Note: Only available starting with Hive 0.8.0)DATE (Note: Only available starting with Hive 0.12.0)Misc TypesBOOLEANSTRINGBINARY (Note: Only available starting with Hive 0.8.0)
| ©2012, Cognizant 14
Complex Data Types
| ©2012, Cognizant 15
Check physical storage of hive
[cts318692@aster4 ~]$ hive -S -e "set" | grep warehousehive.metastore.warehouse.dir=/user/hive/warehousehive.warehouse.subdir.inherit.perms=true
This is the location where hive stores its data.
| ©2012, Cognizant 16
Creating DataBase
hive> CREATE DATABASE IF NOT EXISTS som COMMENT 'my database' > LOCATION '/user/cts318692/someshwar/hivestore/' > WITH DBPROPERTIES ('creator'='someshwar kale','date'='2013-06-08');OKTime taken: 0.046 seconds
Used to suppress warnings
Database name, Hive opens default database when u open a
new session
You can override ‘/usr/hive/warehouse’ default location for the new directory
Table propertiesPhysical storage for som database
| ©2012, Cognizant 17
Exploring Data
STRUCT<street:STRING, city:STRING,
state:STRING, zip:INT>
For complex data types map, arrays,structures
field
| ©2012, Cognizant 18
Creating Table
For complex data types map, arrays,structures
For map key and value eg. ‘key’ ^C ’value’ (\003=ctrlC=^C)
Column seperator Definition
| ©2012, Cognizant 19
hive> DESCRIBE FORMATTED som.employees;
| ©2012, Cognizant 20
Creating External Table
| ©2012, Cognizant 21
Create ..like
If you omit the EXTERNAL keyword and the original table is external, the new table will also be external.
If you omit EXTERNAL and the original table is managed, the new table will also be managed. However, if you include the EXTERNAL keyword and the original table is managed, the new table will be external. Even in this scenario, the LOCATION clause will still be optional.
| ©2012, Cognizant 22
Select Clause
| ©2012, Cognizant 23
Describe External Table
| ©2012, Cognizant
Dropping DataBase and Table
By default, Hive won’t permit you to drop a database if it contains tables. You can eitherdrop the tables first or append the CASCADE keyword to the command, which will causethe Hive to drop the tables in the database first.
| ©2012, Cognizant
Partitions
To increase performance Hive has the capability to partition data– The values of partitioned column divide a table intosegments– Entire partitions can be ignored at query time– Similar to relational databases’ indexes but not asGranular
Partitions have to be properly crated by users– When inserting data must specify a partition
At query time, whenever appropriate, Hive will automatically filter out partitions
| ©2012, Cognizant
Creating Partitioned Table
Partition table based onthe value of a country and state
| ©2012, Cognizant
Cntd…
| ©2012, Cognizant
Loading data to table
LOAD DATA LOCAL ... copies the local data to the final location in thedistributed filesystem, while LOAD DATA ... (i.e., without LOCAL) movesthe data to the final location.
Necessary if table to which we are loading the data is partitioned. This is known as Static partitioning as we are providing the partition value in the query
Partitions are physically stored underseparate directories
| ©2012, Cognizant
Schema Violations
hive> LOAD DATA LOCAL INPATH> 'data/user-posts-inconsistentFormat.txt'> OVERWRITE INTO TABLE posts;OKTime taken: 0.612 seconds
hive> select * from posts;OKuser1 Funny Story 1343182026191user2 Cool Deal NULLuser4 Interesting Post 1343182154633user5 Yet Another Blog 13431839394Time taken: 0.136 seconds
null is set for any value thatviolates pre-defined schema
| ©2012, Cognizant
External Partitioned Tables
| ©2012, Cognizant
Cntd…
There is no difference in syntax• When partitioned column is specified in thewhere clause entire directories/partitions couldbe ignored
| ©2012, Cognizant
Bucketing
• Break data into a set of buckets based on a hashfunction of a "bucket column"– Capability to execute queries on a sub-set of random data
• Doesn’t automatically enforce bucketing– User is required to specify the number of buckets by setting hash ofReducer
hive> mapred.reduce.tasks = 256;ORhive> hive.enforce.bucketing = true;
Either manually set the hash ofreducers to be the number ofbuckets or you can use‘hive.enforce.bucketing’ whichwill set it on your behalf.
| ©2012, Cognizant
Create and Use Table with Buckets
| ©2012, Cognizant
ALTER TABLE
| ©2012, Cognizant
Cntd…
| ©2012, Cognizant
Cntd…
| ©2012, Cognizant
Cntd…
Partition columns are not deleted
| ©2012, Cognizant
Inserting Data into Tables from Queries
| ©2012, Cognizant
Dynamic Partition Inserts
| ©2012, Cognizant
Cntd…
| ©2012, Cognizant
Exporting Data
| ©2012, Cognizant
Functions
| ©2012, Cognizant
Cntd…
| ©2012, Cognizant
Cntd…
| ©2012, Cognizant
Table generating functionsReturn 0 to many rows, one row for each element fromthe input array
| ©2012, Cognizant
Table generating functions
Only a single expression in the SELECT clause is supported with UDTF's'.
| ©2012, Cognizant
LIMIT clause
| ©2012, Cognizant
CASE … WHEN … THEN Statements
| ©2012, Cognizant
Where and Group by .. having clause
| ©2012, Cognizant
Joins
| ©2012, Cognizant
Outer Join
| ©2012, Cognizant
Points to remember
Only equality joins are allowed.
More than 2 tables can be joined in the same query e.g.
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
is a valid join.
A single map/reduce job if for every table the same column is used in the join clause -
ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
ON (a.key = b.key1) JOIN c ON (c.key = b.key2)is converted into two map/reduce jobs because key1 column from b is used in the first join condition and key2 column from b is used in the second one.
| ©2012, Cognizant
ORDER BY and SORT BY
ORDER BY uses single reducer to sort the data, which may take an unacceptably long time to execute for larger data sets.
Hive adds an alternative, SORT BY, that orders the data only within each reducer, thereby performing a local ordering, where each reducer’s output will be sorted.
| ©2012, Cognizant
Casting
If a salary value was not a valid string for a floating-point number? In this case, Hive returns NULL.
| ©2012, Cognizant
UNION ALL and Nested select
Each subquery of the union query must produce the same number of columns, and for each column, its type must match all the column types in the same position.
| ©2012, Cognizant
View
• similar to writing a function in a programming language.
• Views are virtual.
| ©2012, Cognizant
Lateral view
Lateral view is used in conjunction with user-defined table generating functions such as explode().
A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.
Syntax-1. LATERAL VIEW udtf(expression) tableAlias AS columnAlias
| ©2012, Cognizant
Lateral view Example
| ©2012, Cognizant
UDF
| ©2012, Cognizant
UDF
Hive actually uses reflection to find methods whose names are evaluate and matches the arguments used in the HiveQL function call.
Hive can work with both the Hadoop Writables and the Java primitives, but it’s recommended to work with the Writables since they can be reused.
Input arguments type and return type must be same.
| ©2012, Cognizant
UDF
| ©2012, Cognizant
UDF vs. GenericUDF
| ©2012, Cognizant
between operator
hive> select name,salary from employees2 where salary between 80000 and 100000;Total MapReduce jobs = 1Launching Job 1 out of 1....OKJohn Doe 100000.0John Doe 100000.0Mary Smith 80000.0Mary Smith 80000.0Time taken: 14.39 seconds
Both values (lower and upper) are inclusive.
| ©2012, Cognizant
HiveServer2
As of CDH4.1, you can deploy HiveServer2, an improved version of HiveServer that supports a new Thrift API tailored for JDBC and ODBC clients, Kerberos authentication, and multi-client concurrency.
There is also a new CLI for HiveServer2 named BeeLine.
HiveServer2 Connection URL ===== jdbc:hive2://<host>:<port>
Driver Class =========== org.apache.hive.jdbc.HiveDriver
HiveServer1 Connection URL ===== jdbc:hive://<host>:<port>
Driver Class ========org.apache.hadoop.hive.jdbc.HiveDriver
| ©2012, Cognizant
BEELINE
$ /usr/lib/hive/bin/beelinebeeline> !connect jdbc:hive2://localhost:10000 username password org.apache.hive.jdbc.HiveDriver0: jdbc:hive2://localhost:10000>
| ©2012, Cognizant
Connecting database using properties file
| ©2012, Cognizant
References
HiveEdward Capriolo (Author), Dean Wampler (Author), JasonRutherglen (Author)O'Reilly Media; 1 edition (October 3, 2012)
Chapter About HiveHadoop in ActionChuck Lam (Author)Manning Publications; 1st Edition (December, 2010)
| ©2011, Cognizant 68
Thank You