Apache Hive Prashant Gupta
Apache HivePrashant Gupta
HIVE
Hive
• Data warehousing package built on top of hadoop.
• Used for data analysis on structured data.• Targeted towards users comfortable with SQL.• It is similar to SQL and called HiveQL.• Abstracts complexity of hadoop.• No Java is required.• Developed by Facebook.
Features of Hive
How is it Different from SQL•The major difference is that a Hive query executes on a Hadoop infrastructure rather than a traditional database. •This allows Hive to handle huge data sets - data sets so large that high-end, expensive, traditional databases would fail. •The internal execution of a Hive query is via a series of automatically generated Map Reduce jobs
When not to use Hive
• Semi-structured or complete unstructured data.• Hive is not designed for online transaction processing.• It is best for batch jobs over large sets of data.• Latency for Hive queries is generally very high in
minutes, even when data sets are very small (say a few hundred megabytes).
• It cannot be compared with systems such as oracle where analyses are conducted on a significantly smaller amount of data.
Install Hive•To install hive
• untar the .gz file using tar –xvzf hive-0.13.0-bin.tar.gz•To initialize the environment variables, export the following:• export HADOOP_HOME=/home/usr/hadoop-0.20.2
(Specifies the location of the installation directory of hadoop.)• export HIVE_HOME=/home/usr/hive-0.13.0-bin
(Specifies the location of the hive to the environment variable.)• export PATH=$PATH:$HIVE_HOME/bin
Hive configurations• Hive default configuration is stored in hive-default.xml
file in the conf directory• Hive comes configured to use derby as the metastore
Hive Modes
To start the hive shell, type hive and Enter.• Hive in Local mode
No HDFS is required, All files run on local file system.
hive> SET mapred.job.tracker=local • Hive in MapReduce(hadoop) mode
hive> SET mapred.job.tracker=master:9001;
Introducing data types• The primitive data types in hive include Integers,
Boolean, Floating point, Date,Timestamp and Strings.• The below table lists the size of data types:
Type Size-------------------------TINYINT 1 byteSMALLINT 2 byteINT 4 byteBIGINT 8 byteFLOAT 4 byte (single precision floating point numbers)DOUBLE 8 byte (double precision floating point numbers)BOOLEAN TRUE/FALSE valueSTRING Max size is 2GB.
• Complex data Type : Array ,Map ,Structs
Configuring Hive• Hive is configured using an XML configuration file called
hivesite.xml and is located in Hive’s conf directory.• Execution engines
Hive was originally written to use MapReduce as its execution engine, and that is still the default.
We can use Apache Tez as its execution engine, and also work is underway to support Spark, too. Both Tez and Spark are general directed acyclic graph (DAG) engines that offer more flexibility and higher performance than MapReduce.
It’s easy to switch the execution engine on a per-query basis, so you can see the effect of a different engine on a particular query.
Set Hive to use Tez: hive> SET hive.execution.engine=tez; The execution engine is controlled by the hive.execution.engine property,
which defaults to “mr” (for MapReduce).
Hive Architecture
Components• Thrift Client
It is possible to interact with hive by using any programming language that usages Thrift server. For e.g. PythonRuby
• JDBC Driver Hive provides a pure java JDBC driver for java application to
connect to hive , defined in the class org.hadoop.hive.jdbc.HiveDriver
• ODBC Driver An ODBC driver allows application that supports ODBC protocol
Components• Metastore
This is the central repository for Hive metadata. By default, Hive is configured to use Derby as the metastore. As a result of the
configuration, a metastore_db directory is created in each working folder.• What are the problems with the default metastore
Users cannot see the tables created by others if they do not use the same metastore_db.
Only one embedded Derby database can access the database files at any given point of time
Results in only one open Hive session with a metastore. Not possible to have multiple sessions with Derby as the metastore.
Solution We can use a standalone database either on the same machine or on a remote
machine as a metastore and any JDBC-compliant database can be used MySQL is a popular choice for the standalone metastore.
Configuring MySQL as metastore Install MySQL Admin/Client Create a Hadoop user and grant permissions to the user
mysql -u root –p mysql> Create user 'hadoop'@'localhost' identified by 'hadoop‘; mysql> Grant ALL on *.* to 'hadoop'@'localhost' with GRANT option;
Modify the following properties in hive-site.xml to use MySQL instead of Derby. This creates a database in MySql by the name – Hive : name : javax.jdo.option.ConnectionUR value : dbc:mysql://localhost:3306/Hive?
createDatabaseIfNotExist=true name : javax.jdo.option.ConnectionDriverName value : com.mysql.jdbc.Driver name : javax.jdo.option.ConnectionUserName value : hadoop name : javax.jdo.option.ConnectionPassword value : hadoop
Hive Program Structure• The Hive Shell
The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL.
HiveQL is heavily influenced by MySQL, so if you are familiar with MySQL, you should feel at home using Hive.
The command must be terminated with a semicolon to tell Hive to execute it. HiveQL is generally case insensitive. The Tab key will autocomplete Hive keywords and functions.
• Hive can run in non-interactive mode. Use -f option to run the commands in the specified file,
hive -f script.hql For short scripts, you can use the -e option to specify the commands inline, in
which case the final semicolon is not required. hive -e 'SELECT * FROM dummy'
Ser-de• A SerDe is a combination of a Serializer and a Deserializer (hence,
Ser-De).• The Serializer, however, will take a Java object that Hive has been
working with, and turn it into something that Hive can write to HDFS or another supported system.
• Serializer is used when writing data, such as through an INSERT-SELECT statement.
• The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate.
• Deserializer is used at query time to execute SELECT statements.
Hive TablesA Hive table is logically made up of the data being stored in HDFS and the associated metadata describing the layout of the data in the MySQL table. • Managed Table
When you create a table in Hive and load data into a managed table, it is moved into Hive’s warehouse directory.
CREATE TABLE managed_table (dummy STRING); LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
• External Table Alternatively, you may create an external table, which tells Hive to refer to the data that is at an
existing location outside the warehouse directory. The location of the external data is specified at table creation time:
CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/tom/external_table'; LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
• When you drop an external table, Hive will leave the data untouched and only delete the metadata.
• Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move data files into locations corresponding to Hive tables.
Storage Format
Text FileWhen you create a table with no ROW FORMAT or STORED AS clauses, the default format is delimited text with one row per line.
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘
Storage Format
RC: Record Columnar FileThe RC format was designed for clusters with MapReduce in mind. It is a huge step up over standard text files. It’s a mature format with ways to ingest into the cluster without ETL. It is supported in several hadoop system components.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
Storage Format
ORC: Optimized Row Columnar FileThe ORC format showed up in Hive 0.11 onwards. As the name implies, it is more optimized than the RC format. If you want to hold onto speed and compress the data as much as possible, then ORC is best.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileOutputFormat'
Practice Session
• CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
or hive> CREATE SCHEMA testdb;
SHOW DATABASES;DROP SCHEMA userdb;
• CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
• [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format]
• Loading a data LOAD DATA [local ] INPATH 'hdfs_file_or_directory_path'
Create Table
• Managed TableCREATE TABLE Student (sno int, sname string, year
int) row format delimited fields terminated by ',';
• External TableCREATE EXTERNAL TABLE Student(sno int, sname
string, year int) row format delimited fields terminated by ',‘LOCATION '/user/external_table';
Load Data to table
To store the local files to hive location • LOAD DATA local INPATH
'/home/cloudera/SampleDataFile/student_marks.csv' INTO table Student;
To store file located in HDFS file system to hive table location • LOAD DATA INPATH
'/user/cloudera/Student_Year.csv' INTO table Student;
Table Commands• Insert Data
INSERT OVERWRITE TABLE targettableselect col1, col2 from source (to overwrite data)
INSERT INTO TABLE targettbl select col1, col2 from source (to append data)
• Multitable insert From sourcetable
INSERT OVERWRITE TABLE table1select col1,col2 where condition1INSERT OVERWRITE TABLE table2select col1,col2 where condition2
• Create table..as Select Create table table1 as select col1,col2 from source;
• Create a new table with existing schema like other table Create table newtable like existingtable;
Database Commands
• Displays all created DB List. Show Databases;
• To Create new database with default properties. Create Database DBName;
• Create database with comment Create Database DBName comment ‘holds backup data’ ;
• To Use Database Use DBName;
• To View the database details DESCRIBE DATABASE EXTENDED DbName
Table Commands
• To list all tables Show Tables;
• Displaying all contents of the tableselect * from <table-name>;select * from Student_Year where year = 2011;
• Display header information along with Dataset hive.cli.print.header=true;
• Using Group byselect year,count(sno) from Student_Year group by year;
Table Commands• SubQueries
A subquery is a SELECT statement that is embedded in another SQL statement.
Hive has limited support for subqueries, permitting a subquery in the FROM clause of a SELECT statement, or in the WHERE clause in certain cases.
The following query finds the average maximum temperature for every year and weather station:
SELECT year, AVG(max_temperature) FROM (SELECT year, MAX(temperature) AS max_temperature
FROM records2GROUP BY year) mtGROUP BY year;
Table CommandsAlter table • To Add column
ALTER TABLE student ADD COLUMNS (Year string);• To Modify a column
ALTER TABLE table_name CHANGE old_col_name new_col_name new_data_type• Changes the table name;
Alter table Employee RENAME to emp; • Drops a partition
ALTER table MyTable DROP PARTITION (age=17) -- Drop Table• DROP TABLE
DROP TABLE operatordetails;• Describe Table Schema
Desc Employee; Describe extended Employee; -- displays detailed information
View• A view is a sort of “virtual table” that is defined by a SELECT statement.• Views may also be used to restrict users’ access to particular subsets of
tables that they are authorized to see.• In Hive, a view is not materialized to disk when it is created; rather, the
view’s SELECT statement is executed when the statement that refers to the view is run.
• Views are included in the output of the SHOW TABLES command, and you can see more details about a particular view, including the query used to define it, by issuing the DESCRIBE EXTENDED view_name command. Create Views
CREATE VIEW view_name (id,name) AS SELECT * from users; Drop a view
Drop view viewName;
Joins
• Only equality joins, outer joins, and left semi joins are supported in Hive.
• Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be joined in Hive
Example-Join• hive> SELECT * FROM sales;Joe 2Hank 4Ali 0Eve 3Hank 2• hive> SELECT * FROM items;2 Tie4 Coat3 Hat1 Scarf
Table Commands
• Using Join One of the nice things about using Hive, rather than raw MapReduce,
is that Hive makes performing commonly used operations very simple.• We can perform an inner join on the two tables as follows:
hive> SELECT sales.*, items.* FROM sales JOIN items ON (sales.id = items.id);
hive> SELECT a.val, b.val, c.val FROM a JOIN b ON (a.KEY = b.key1) JOIN c ON (c.KEY = b.key1)
• You can see how many MapReduce jobs Hive will use for any particular query by prefixing it with the EXPLAIN keyword:,
• For even more detail, prefix the query with EXPLAIN EXTENDED. EXPLAIN SELECT sales.*, items.* FROM sales JOIN items ON (sales.id
= items.id);
• Outer joinsOuter joins allow you to find non-matches in the
tables being joined.hive> SELECT sales.*, items.* FROM sales LEFT
OUTER JOIN items ON (sales.id = items.id);hive> SELECT sales.*, items.* FROM sales RIGHT
OUTER JOIN items ON (sales.id = items.id);hive>SELECT sales.*, items.* FROM sales FULL
OUTER JOIN items ON (sales.id = items.id);
Table Commands
Map Side Join
• If all but one of the tables being joined are small, the join can be performed as a map only job.
• The query does not need a reducer. For every mapper a,b is read completely. A restriction is that a FULL/RIGHT OUTER JOIN b cannot be performed.
• SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key
Partitioning in Hive• Using partitions, you
can make it faster to execute queries on slices of the data.
• A table can have one or more partition columns.
• A separate data directory is created for each distinct value combination in the partition columns.
Partitioning in Hive• Partitions are defined at the time of creating a table using
PARTITIONED BY clause is used to create partition.Static Partition (Example-1)
CREATE TABLE student_partnew (name STRING,id int,marks String) PARTITIONED BY (pyear STRING) row format delimited fields terminated by ',';
LOAD DATA LOCAL INPATH '/home/notroot/std_2011.csv' INTO TABLE student_partnew PARTITION (pyear='2011');
LOAD DATA LOCAL INPATH '/home/notroot/std_2012.csv' INTO TABLE student_partnew PARTITION (pyear='2012');
LOAD DATA LOCAL INPATH '/home/notroot/std_2013.csv' INTO TABLE student_partnew PARTITION (pyear='2013');
Partitioning in HiveStatic Partition (Example-2)• CREATE TABLE student_New (id int,name string,marks int,year int)
row format delimited fields terminated by ',';• LOAD DATA local INPATH
'/home/notroot/Sandeep/DataSamples/Student_new.csv' INTO table Student_New;
• CREATE TABLE student_part (id int,name string,marks int,) PARTITIONED BY (year STRING);
• INSERT into TABLE student_part PARTITION(pyear='2012' ) SELECT id,name,marks from student_new WHERE year='2012';
SHOW Partition• SHOW PARTITIONS month_part;
Partitioning in HiveDynamic Partition• To enable dynamic partitions
set hive.exec.dynamic.partition=true; (To enable dynamic partitions, by default it is false)
set hive.exec.dynamic.partition.mode=nonstrict; (To allow a table to be partitioned based on multiple columns in hive, in such case we have to enable the nonstrict mode)
set hive.exec.max.dynamic.partitions.pernode=300; (The default value is 100, we have to modify the same according to the possible no of partitions that would come in your case)
hive.exec.max.created.files=150000 (IThe default values is 100000 but for larger tables it can exceed the default, so we may have to update the same. )
Partitioning in Hive• CREATE TABLE Stage_oper_Month (oper_id string, Creation_Date string, oper_name
String, oper_age int, oper_dept String, oper_dept_id int, opr_status string, EYEAR STRING, EMONTH STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
• LOAD DATA local INPATH '/home/notroot/Sandeep/DataSamples/user_info.csv'INTO TABLE Stage_oper_Month;
• CREATE TABLE Fact_oper_Month (oper_id string, Creation_Date string, oper_name String, oper_age int, oper_dept String, oper_dept_id int) PARTITIONED BY (opr_status string, eyear STRING, eMONTH STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
• FROM Stage_oper_Month INSERT OVERWRITE TABLE Fact_oper_Month PARTITION (opr_status, eyear, eMONTH) SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept, oper_dept_id, opr_status, EYEAR, EMONTH DISTRIBUTE BY opr_status, eyear, eMONTH;
• (Select from partition table) Select oper_id, oper_name, oper_dept from Fact_oper_Month where
eyear=2010 and emonth=1;
Bucketing Features• Partitioning gives effective results when there are limited number of
partitions and comparatively equal sized partitions• To overcome the problem of partitioning, Hive provides Bucketing
concept, another technique for decomposing table data sets into more manageable parts.
• Bucketing concept is based on (hashing function on the bucketed column) mod (by total number of buckets)
• Use CLUSTERED BY clause to divide the table into buckets.• Bucketing can be done along with Partitioning on Hive tables and even
without partitioning.• Bucketed tables will create almost equally distributed data file parts.• To populate the bucketed table, we need to set the property
set hive.enforce.bucketing = true;
Bucketing AdvantageBucketing Advantages• Bucketed tables offer efficient sampling than by non-bucketed tables.
With sampling, we can try out queries on a fraction of data for testing and debugging purpose when the original data sets are very huge.
• As the data files are equal sized parts, map-side joins will be faster on bucketed tables than non-bucketed tables. In Map-side join, a mapper processing a bucket of the left table knows that the matching rows in the right table will be in its corresponding bucket, so it only retrieves that bucket (which is a small fraction of all the data stored in the right table).
• Similar to partitioning, bucketed tables provide faster query responses than non-bucketed tables.
Bucketing Example• We can create bucketed tables with the help of CLUSTERED BY clause and
optional SORTED BY clause in CREATE TABLE statement and DISTRIBUTED BY clause in load statement.
• CREATE TABLE Month_bucketed (oper_id string, Creation_Date string, oper_name String, oper_age int,oper_dept String, oper_dept_id int, opr_status string, eyear string , emonth string) CLUSTERED BY(oper_id) SORTED BY (oper_id,Creation_Date) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Similar to partitioned tables, we cannot directly load bucketed tables with LOAD DATA (LOCAL) INPATH command, rather we need to use INSERT OVERWRITE TABLE … SELECT …FROM clause from another table to populate the bucketed tables. • INSERT OVERWRITE TABLE Month_bucketed SELECT oper_id, Creation_Date,
oper_name, oper_age, oper_dept, oper_dept_id, opr_status, EYEAR, EMONTH FROM stage_oper_month DISTRIBUTE BY oper_id sort by oper_id, Creation_Date;
Partitioning with Bucketing• CREATE TABLE Month_Part_bucketed (oper_id string, Creation_Date
string, oper_name String, oper_age int,oper_dept String, oper_dept_id int) PARTITIONED BY (opr_status string, eyear STRING, eMONTH STRING) CLUSTERED BY(oper_id) SORTED BY (oper_id,Creation_Date) INTO 12 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
• FROM Stage_oper_Month stg INSERT OVERWRITE TABLE Month_Part_bucketed PARTITION(opr_status, eyear, eMONTH) SELECT stg.oper_id, stg.Creation_Date, stg.oper_name, stg.oper_age, stg.oper_dept, stg.oper_dept_id, stg.opr_status, stg.EYEAR, stg.EMONTH DISTRIBUTE BY opr_status, eyear, eMONTH;
Note: Unlike partitioned columns (which are not included in table columns definition), Bucketed columns are included in table definition as shown in above code for oper_id and creation_date columns.
Table Sampling in HiveTable Sampling in hive is nothing but extraction small fraction of data from the original large data sets. It is similar to LIMIT operator in Hive.Difference between LIMIT and TABLESAMPLE in Hive.
In many cases a LIMIT clause executes the entire query, and then only returns limited results.
But Sampling will only select a portion of data to perform query.To see the performance difference between bucketed and non-bucketed tables.
Query-1: SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept FROM month_bucketed TABLESAMPLE(BUCKET 12 OUT OF 12 ON oper_id);
Query-2: SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept FROM stage_oper_month limit 18;
Note: Query-1 should always perform faster that query-2To perform random sampling with Hive
SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept FROM month_bucketed TABLESAMPLE (1 percent);
Hive UDF• UDF is a java code which must satisfy the following two properties.• UDF must implement at least one evaluate() method• UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDFSample UDFpackage com.example.hive.udf;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); }}• hive> add jar my_jar.jar;• hive> create temporary function my_lower as 'com.example.hive.udf.Lower';• hive> select empid , my_lower(empname) from employee;
Hive UDAF• A UDAF works on multiple input rows and creates a single output row.
Aggregate functions include such functions as COUNT and MAX.• An aggregate function is more difficult to write than a regular UDF.• UDAF must be a subclass of org.apache.hadoop.hive.ql.exec.UDAF• Contain one or more nested static classes implementing
org.apache.hadoop.hive.ql.exec.UDAFEvaluator• UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDFAn evaluator must implement five methods• init()
The init() method initializes the evaluator and resets its internal state. In MaximumIntUDAFEvaluator, we set the IntWritable object holding
the final result to null.
Hive UDAF• iterate()
The iterate() method is called every time there is a new value to be aggregated. The evaluator should update its internal state with the result of performing the aggregation. The arguments that iterate() takes correspond to those in the Hive function from which it was called.
In this example, there is only one argument. The value is first checked to see whether it is null, and if it is, it is ignored. Otherwise, the result instance variable is set either to value’s integer value (if this is the first value that has been seen) or to the larger of the current result and value (if one or more values have already been seen). We return true to indicate that the input value was valid.
• terminatePartial() The terminatePartial() method is called when Hive wants a result for the partial
aggregation. The method must return an object that encapsulates the state of the aggregation.
In this case, an IntWritable suffices because it encapsulates either the maximum value seen or null if no values have been processed.
Hive UDAF• merge()
The merge() method is called when Hive decides to combine one partial aggregation with another. The method takes a single object, whose type must correspond to the return type of the terminatePartial() method.
In this example, the merge() method can simply delegate to the iterate() method because the partial aggregation is represented in the same way as a value being aggregated. This is not generally the case(we’ll see a more general example later), and the method should implement the logic to combine the evaluator’s state with the state of the partial aggregation.
• terminate() The terminate() method is called when the final result of the aggregation
is needed. The evaluator should return its state as a value. In this case, we return the result instance variable.
Hive UDAFpackage com.hadoopbook.hive;import org.apache.hadoop.hive.ql.exec.UDAF;import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;import org.apache.hadoop.io.IntWritable;public class HiveUDAFSample extends UDAF {public static class MaximumIntUDAFEvaluator implements UDAFEvaluator {private IntWritable result;public void init() {result = null;}public boolean iterate(IntWritable value) {if (value == null) {return true;}
Hive UDAFif (result == null) {result = new IntWritable(value.get());} else {result.set(Math.max(result.get(), value.get()));}return true;}public IntWritable terminatePartial() {return result;}public boolean merge(IntWritable other) {return iterate(other);}public IntWritable terminate() {return result;}}}
Hive UDAFTo Use UDAF in hive;• hive> add jar my_jar.jar;• hive> CREATE TEMPORARY FUNCTION maximum AS
'com.hadoopbook.hive.HiveUDAFSample'; • hive>SELECT maximum(salary) FROM employee;
Performance TuningPartitioning Tables:• Hive partitioning is an effective method to improve the query
performance on larger tables. Partitioning allows you to store data in separate sub-directories under table location. It greatly helps the queries which are queried upon the partition key(s). Although the selection of partition key is always a sensitive decision, it should always be a low cardinal attribute, e.g. if your data is associated with time dimension, then date could be a good partition key. Similarly, if data has association with location, like a country or state, then it’s a good idea to have hierarchical partitions like country/state.
Performance Tuning
De-normalizing data:• Normalization is a standard process used to model your
data tables with certain rules to deal with redundancy of data and anomalies. In simpler words, if you normalize your data sets, you end up creating multiple relational tables which can be joined at the run time to produce the results. Joins are expensive and difficult operations to perform and are one of the common reasons for performance issues. Because of that, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics.
Performance TuningCompress map/reduce output:• Compression techniques significantly reduce the intermediate data
volume, which internally reduces the amount of data transfers between mappers and reducers. All this generally occurs over the network. Compression can be applied on the mapper and reducer output individually. Keep in mind that gzip compressed files are not splittable. That means this should be applied with caution. A compressed file size should not be larger than a few hundred megabytes. Otherwise it can potentially lead to an imbalanced job.
• Other options of compression codec could be snappy, lzo, bzip, etc.• For map output compression set mapred.compress.map.output to
true• For job output compression set mapred.output.compress to true
Performance Tuning
Map join:• Map joins are really efficient if a table on the
other side of a join is small enough to fit in the memory. Hive supports a parameter, hive.auto.convert.join, which when it’s set to “true” suggests that Hive try to map join automatically. When using this parameter, be sure the auto convert is enabled in the Hive environment.
Performance TuningBucketing:• Bucketing improves the join performance if the bucket key and join
keys are common. Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys (columns).
• Additionally it’s important to ensure the bucketing flag is set (SET hive.enforce.bucketing=true;) every time before writing data to the bucketed table. To leverage the bucketing in the join operation we should SET hive.optimize.bucketmapjoin=true. This setting hints to Hive to do bucket level join during the map stage join. It also reduces the scan cycles to find a particular key because bucketing ensures that the key is present in a certain bucket.
Performance TuningParallel execution:• As HIVE queries are inbuilt translated to a
number of map reduce jobs, but having multiple Map-reduce jobs is not enough, real advantage is of their parallel execution and as noted above simply writing a query does not achieve this.
• SELECT table1.a FROMtable1 JOIN table2 ON (table1.a =table2.a )join table3 ON (table3.a=table1.a)join table4 ON (table4.b=table3.b);
• Output: Execution time : 800 secBut let us check the execution plan for this:observations (see picture highlighted area):• Total Map-Reduce Jobs: 2.• Serially Launched & Run.
Performance TuningParallel execution:• To achieve this, we thought about query re-writing in a way
to segregate the query into independent units which HIVE could work upon as independent map reduce jobs running parallel. Following is what we did to our query:
• SELECT r1.a FROM(SELECT table1.a FROM table1 JOIN table2 ON table1.a =table2.a ) r1JOIN(SELECT table3.a FROM table3 JOIN table4 ON table3.b =table4.b ) r2ON (r1.a =r2.a) ;
• Output: Same results. But Execution time: 464 secsobservations:• Total Map-Reduce Jobs: 5 (see picture highlighted area).• Jobs are parallel Launched & Run. (see highlighted area).• Decrease in query execution time (around 50% in our case)Points to Note:• Need to set hive.exec.parallel parameter to set to TRUE.• To control how many jobs at most can be executed in
parallel set hive.exec.parallel.thread.number parameter.