Pig, Hive, and Jaql IBM Information Management Cloud Computing Center of Competence IBM Toronto Lab 1
Nov 21, 2015
Pig, Hive, and Jaql
IBM Information Management
Cloud Computing Center of Competence
IBM Toronto Lab
1
Agenda
2
Overview
Pig
Hive
Jaql
Agenda
3
Overview
Pig
Hive
Jaql
Similarities of Pig, Hive and Jaql
4
All translate their respective high-level languages to MapReduce jobs
All offer significant reductions in program size over Java
All provide points of extension to cover gaps in functionality
All provide interoperability with other languages
None support random reads/writes or low-latency queries
Comparing Pig, Hive, and Jaql
5
Characteristic Pig Hive Jaql
Developed by Yahoo! Facebook IBM
Language name Pig Latin HiveQL Jaql
Type of language Data flow
Declarative (SQL
dialect) Data flow
Data structures it
operates on Complex, nested JSON
Schema optional? Yes
No, but data can
have many
schemas Yes
Relational complete? Yes Yes
Turing complete?
Yes when
extended with
Java UDFs
Yes when
extended with
Java UDFs Yes
Agenda
6
Overview
Pig
Hive
Jaql
Pig components
Two Components
Language (called Pig Latin)
Compiler
Two execution environments
Local (Single JVM)
pig -x local
Distributed (Hadoop cluster)
pig -x mapreduce, or simply pig
7
Pig Latin
Compiler
Local
Distributed
Pig
Execution Environment
Running Pig
Script
pig scriptfile.pig
Grunt (command line)
pig (to launch command line tool)
Embedded
Call in to Pig from Java
8
Sample code
9
#pig
grunt> records = LOAD econ_assist.csv
using PigStorage (,)
AS (country:chararray, sum:long);
grunt> grouped = GROUP records BY country;
grunt> thesum = FOREACH grouped
GENERATE group,
SUM(records, sum);
grunt> DUMP thesum;
Pig Latin
Statements, operations and commands A Pig Latin program is a collection of statements.
A statement is an operation or a command
Example of an operation: LOAD 'statement.txt';
Example of a command: ls *.txt
Logical plan/physical plan
As statement is processed, it is added to logical plan
When a statement such as 'DUMP relation' is reached, logical plan is compiled to physical plan and executed
10
Pig Latin statements
UDF Statements
REGISTER, DEFINE
Commands
Hadoop Filesystem (cat, ls, etc.)
Hadoop MapReduce (kill)
Utility (exec, help, quit, run, set)
Diagnostic Operators
DESCRIBE, EXPLAIN, ILLUSTRATE
11
Pig Latin Relational operators
Loading and storing (LOAD, STORE, DUMP)
Filtering (FILTER, DISTINCT, FOREACH...GENERATE, STREAM, SAMPLE)
Grouping and joining (JOIN, COGROUP, GROUP, CROSS)
Sorting (ORDER, LIMIT)
Combining and splitting (UNION, SPLIT)
12
Pig Latin Relations and schemata
Result of a relational operator is a relation
A relation is a set of tuples
Relations can be named using an alias x = LOAD 'sample.txt' AS (id: int, year: int)
DUMP x alias
(1,1987) tuple
Structure of a relation is a schema DESCRIBE x
x: {id: int, year: int} schema
13
Pig Latin expressions
Part of a statement containing a relational operator
Categories of expressions:
Constant Field Projection
Map lookup Cast Arithmetic
Conditional Boolean Comparison
Functional Flatten
14
Pig Latin Data types
Simple types:
int float bytearray
long double chararray
Complex types:
Tuple Sequence of fields of any type
Bag Unordered collection of tuples
Map Set of key-value pairs. Keys must be chararray.
15
Pig Latin Function types
Eval
Input: One or more expressions
Output: An expression
Example: MAX
Filter
Input: Bag or map
Output: boolean
Example: IsEmpty
16
Load
Input: Data from external storage
Output: A relation
Example: PigStorage
Store
Input: A relation
Output: Data to external storage
Example: PigStorage
17
Pig Latin Function types
Pig Latin User-Defined Functions
Written in Java
Packaged in a JAR file
Register JAR file using the REGISTER statement
Optionally, alias it with DEFINE statement
18
Agenda
19
Overview
Pig
Hive
Jaql
Hive - Configuration
Three ways to configure hive:
hive-site.xml
- fs.default.name
- mapred.job.tracker
- Metastore configuration settings
hive hiveconf
Set command in the Hive Shell
20
Running Hive
Hive Shell
Interactive
hive
Script
hive -f myscript
Inline
hive -e 'SELECT * FROM mytable'
21
Hive services
hive --service servicename
where servicename can be:
hiveserver
server for Thrift, JDBC, ODBC clients
hwi
web interface
jar
hadoop jar with Hive jars in classpath
metastore
out of process metastore
22
Sample code
23
#hive
hive> CREATE TABLE foreign_aid
(country STRING, sum BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ,
STORED AS TEXTFILE;
hive> SHOW TABLES;
hive> DESCRIBE foreign_aid;
hive> LOAD DATA INPATH econ_assist.csv
OVERWRITE INTO TABLE foreign_aid;
hive> SELECT * FROM foreign_aid LIMIT 10;
hive> SELECT country, SUM(sum) FROM foreign_aid
GROUP BY country;
Hive - Metastore
Stores Hive metadata
Configurations
Embedded
in-process metastore, in-process database
Local
in-process metastore, out-of-process database
Remote
out-of-process metastore, out-of-process database
24
Hive Schema-On-Read
Faster loads into the database (simply copy or move)
Slower queries
Flexibility multiple schemas for the same data
25
Hive Query Language (HiveQL)
SQL dialect
Does not support full SQL92 specification
No support for:
HAVING clause in SELECT
Correlated subqueries
Subqueries outside FROM clauses
Updateable or materialized views
Stored procedures
26
Hive Query Language (HiveQL)
Extensions
MySQL-like extensions
MapReduce extensions
Multi-table insert, MAP, REDUCE, TRANSFORM clauses
Data Types
Simple
TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING
Complex
ARRAY, MAP, STRUCT
27
Hive Query Language (HiveQL)
Built-in Functions SHOW FUNCTIONS
DESCRIBE FUNCTION
28
Hive - Tables
Managed CREATE TABLE
LOAD File moved into Hive's data warehouse directory
DROP Both metadata and data deleted
External CREATE EXTERNAL TABLE
LOAD No files moved
DROP Only metadata deleted
Use EXTERNAL when:
Sharing data between Hive and other Hadoop applications
You wish to use multiple schemas on the same data
29
Hive - Partitioning
Can make some queries faster
Divide data based on partition column
Use PARTITION BY clause when creating table
Use PARTITION clause when loading data
SHOW PARTITIONS will show a table's partitions
30
Hive - Bucketing
Can make some queries faster
Supports sampling data
Use CLUSTERED BY clause when creating table
For sorted buckets, use SORTED BY clause when creating table
To query a sample of your data use the TABLESAMPLE command which uses bucketing
31
Hive Storage formats
Delimited Text (default) ROW FORMAT DELIMITED
SerDe Serializer/Deserializer ROW FORMAT SERDE serdename
e.g. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy'
Binary SerDe Row-oriented (Sequence file)
STORED AS SEQUENCEFILE
Column-oriented (RCFile)
STORED AS RCFILE
32
Hive User-Defined Functions
Written in Java
Three UDF types:
UDF
Input: single row, output: single row
UDAF
Input: multiple rows, output: single row
UDTF
Input: single row, output: multiple rows
Register UDF using ADD JAR
Create alias using CREATE TEMPORARY FUNCTION
33
Agenda
34
Overview
Pig
Hive
Jaql
Jaql
A JSON Query Language
JSON = Javascript Object Notation
Running Jaql
Jaql Shell
Interactive - jaqlshell
Batch - jaqlshell -b myscript.jaql
Inline - jaqlshell -e jaqlstatement
Modes
Cluster - jaqlshell -c
Minicluster - jaqlshell
35
Jaql
Query Language
Sources and sinks
e.g. Copy data from a local file to a new file on HDFS
source sink
read(file(input.json)) -> write(hdfs(output))
Core Operators
Filter Group Tee
Transform Join Sort
Expand Union Top
36
Jaql
Query Language
Variables
= operator binds source output to a variable
e.g. $tweets = read(hdfs(twitterfeed))
Pipes, streams, and consumers
Pipe operator (->) streams data to a consumer
Pipe expects array as input
e.g. $tweets filter $.from_src == 'tweetdeck';
$ implicit variable referencing current array value
37
Jaql
Query Language
Categories of Built-in Functions
system schema agg
core xml number
hadoop regex string
io binary function
array date random
index nil record
38
Jaql
Data Storage
Data store examples
Amazon S3 DB2 HBase HDFS
HTTP JDBC Local FS
Data format examples
JSON CSV XML
39
Thank you!
Thank you!