1-7-pighivejaql-111108221102-phpapp01

Pig, Hive, and Jaql

IBM Information Management

Cloud Computing Center of Competence

IBM Toronto Lab

1

Agenda

2

Overview

Pig

Hive

Jaql

Agenda

3

Overview

Pig

Hive

Jaql

Similarities of Pig, Hive and Jaql

4

All translate their respective high-level languages to MapReduce jobs

All offer significant reductions in program size over Java

All provide points of extension to cover gaps in functionality

All provide interoperability with other languages

None support random reads/writes or low-latency queries

Comparing Pig, Hive, and Jaql

5

Characteristic Pig Hive Jaql

Developed by Yahoo! Facebook IBM

Language name Pig Latin HiveQL Jaql

Type of language Data flow

Declarative (SQL

dialect) Data flow

Data structures it

operates on Complex, nested JSON

Schema optional? Yes

No, but data can

have many

schemas Yes

Relational complete? Yes Yes

Turing complete?

Yes when

extended with

Java UDFs

Yes when

extended with

Java UDFs Yes

Agenda

6

Overview

Pig

Hive

Jaql

Pig components

Two Components

Language (called Pig Latin)

Compiler

Two execution environments

Local (Single JVM)

pig -x local

Distributed (Hadoop cluster)

pig -x mapreduce, or simply pig

7

Pig Latin

Compiler

Local

Distributed

Pig

Execution Environment

Running Pig

Script

pig scriptfile.pig

Grunt (command line)

pig (to launch command line tool)

Embedded

Call in to Pig from Java

8

Sample code

9

#pig

grunt> records = LOAD econ_assist.csv

using PigStorage (,)

AS (country:chararray, sum:long);

grunt> grouped = GROUP records BY country;

grunt> thesum = FOREACH grouped

GENERATE group,

SUM(records, sum);

grunt> DUMP thesum;

Pig Latin

Statements, operations and commands A Pig Latin program is a collection of statements.

A statement is an operation or a command

Example of an operation: LOAD 'statement.txt';

Example of a command: ls *.txt

Logical plan/physical plan

As statement is processed, it is added to logical plan

When a statement such as 'DUMP relation' is reached, logical plan is compiled to physical plan and executed

10

Pig Latin statements

UDF Statements

REGISTER, DEFINE

Commands

Hadoop Filesystem (cat, ls, etc.)

Hadoop MapReduce (kill)

Utility (exec, help, quit, run, set)

Diagnostic Operators

DESCRIBE, EXPLAIN, ILLUSTRATE

11

Pig Latin Relational operators

Loading and storing (LOAD, STORE, DUMP)

Filtering (FILTER, DISTINCT, FOREACH...GENERATE, STREAM, SAMPLE)

Grouping and joining (JOIN, COGROUP, GROUP, CROSS)

Sorting (ORDER, LIMIT)

Combining and splitting (UNION, SPLIT)

12

Pig Latin Relations and schemata

Result of a relational operator is a relation

A relation is a set of tuples

Relations can be named using an alias x = LOAD 'sample.txt' AS (id: int, year: int)

DUMP x alias

(1,1987) tuple

Structure of a relation is a schema DESCRIBE x

x: {id: int, year: int} schema

13

Pig Latin expressions

Part of a statement containing a relational operator

Categories of expressions:

Constant Field Projection

Map lookup Cast Arithmetic

Conditional Boolean Comparison

Functional Flatten

14

Pig Latin Data types

Simple types:

int float bytearray

long double chararray

Complex types:

Tuple Sequence of fields of any type

Bag Unordered collection of tuples

Map Set of key-value pairs. Keys must be chararray.

15

Pig Latin Function types

Eval

Input: One or more expressions

Output: An expression

Example: MAX

Filter

Input: Bag or map

Output: boolean

Example: IsEmpty

16

Load

Input: Data from external storage

Output: A relation

Example: PigStorage

Store

Input: A relation

Output: Data to external storage

Example: PigStorage

17

Pig Latin Function types

Pig Latin User-Defined Functions

Written in Java

Packaged in a JAR file

Register JAR file using the REGISTER statement

Optionally, alias it with DEFINE statement

18

Agenda

19

Overview

Pig

Hive

Jaql

Hive - Configuration

Three ways to configure hive:

hive-site.xml

- fs.default.name

- mapred.job.tracker

- Metastore configuration settings

hive hiveconf

Set command in the Hive Shell

20

Running Hive

Hive Shell

Interactive

hive

Script

hive -f myscript

Inline

hive -e 'SELECT * FROM mytable'

21

Hive services

hive --service servicename

where servicename can be:

hiveserver

server for Thrift, JDBC, ODBC clients

hwi

web interface

jar

hadoop jar with Hive jars in classpath

metastore

out of process metastore

22

Sample code

23

#hive

hive> CREATE TABLE foreign_aid

(country STRING, sum BIGINT)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ,

STORED AS TEXTFILE;

hive> SHOW TABLES;

hive> DESCRIBE foreign_aid;

hive> LOAD DATA INPATH econ_assist.csv

OVERWRITE INTO TABLE foreign_aid;

hive> SELECT * FROM foreign_aid LIMIT 10;

hive> SELECT country, SUM(sum) FROM foreign_aid

GROUP BY country;

Hive - Metastore

Stores Hive metadata

Configurations

Embedded

in-process metastore, in-process database

Local

in-process metastore, out-of-process database

Remote

out-of-process metastore, out-of-process database

24

Hive Schema-On-Read

Faster loads into the database (simply copy or move)

Slower queries

Flexibility multiple schemas for the same data

25

Hive Query Language (HiveQL)

SQL dialect

Does not support full SQL92 specification

No support for:

HAVING clause in SELECT

Correlated subqueries

Subqueries outside FROM clauses

Updateable or materialized views

Stored procedures

26


Extensions

MySQL-like extensions

MapReduce extensions

Multi-table insert, MAP, REDUCE, TRANSFORM clauses

Data Types

Simple

TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING

Complex

ARRAY, MAP, STRUCT

27


Built-in Functions SHOW FUNCTIONS

DESCRIBE FUNCTION

28

Hive - Tables

Managed CREATE TABLE

LOAD File moved into Hive's data warehouse directory

DROP Both metadata and data deleted

External CREATE EXTERNAL TABLE

LOAD No files moved

DROP Only metadata deleted

Use EXTERNAL when:

Sharing data between Hive and other Hadoop applications

You wish to use multiple schemas on the same data

29

Hive - Partitioning

Can make some queries faster

Divide data based on partition column

Use PARTITION BY clause when creating table

Use PARTITION clause when loading data

SHOW PARTITIONS will show a table's partitions

30

Hive - Bucketing

Can make some queries faster

Supports sampling data

Use CLUSTERED BY clause when creating table

For sorted buckets, use SORTED BY clause when creating table

To query a sample of your data use the TABLESAMPLE command which uses bucketing

31

Hive Storage formats

Delimited Text (default) ROW FORMAT DELIMITED

SerDe Serializer/Deserializer ROW FORMAT SERDE serdename

e.g. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy'

Binary SerDe Row-oriented (Sequence file)

STORED AS SEQUENCEFILE

Column-oriented (RCFile)

STORED AS RCFILE

32

Hive User-Defined Functions

Written in Java

Three UDF types:

UDF

Input: single row, output: single row

UDAF

Input: multiple rows, output: single row

UDTF

Input: single row, output: multiple rows

Register UDF using ADD JAR

Create alias using CREATE TEMPORARY FUNCTION

33

Agenda

34

Overview

Pig

Hive

Jaql

Jaql

A JSON Query Language

JSON = Javascript Object Notation

Running Jaql

Jaql Shell

Interactive - jaqlshell

Batch - jaqlshell -b myscript.jaql

Inline - jaqlshell -e jaqlstatement

Modes

Cluster - jaqlshell -c

Minicluster - jaqlshell

35

Jaql

Query Language

Sources and sinks

e.g. Copy data from a local file to a new file on HDFS

source sink

read(file(input.json)) -> write(hdfs(output))

Core Operators

Filter Group Tee

Transform Join Sort

Expand Union Top

36

Jaql

Query Language

Variables

= operator binds source output to a variable

e.g. $tweets = read(hdfs(twitterfeed))

Pipes, streams, and consumers

Pipe operator (->) streams data to a consumer

Pipe expects array as input

e.g. $tweets filter $.from_src == 'tweetdeck';

$ implicit variable referencing current array value

37

Jaql

Query Language

Categories of Built-in Functions

system schema agg

core xml number

hadoop regex string

io binary function

array date random

index nil record

38

Jaql

Data Storage

Data store examples

Amazon S3 DB2 HBase HDFS

HTTP JDBC Local FS

Data format examples

JSON CSV XML

39

Thank you!

Thank you!

1-7-pighivejaql-111108221102-phpapp01

Documents

pig script pig

pig latin expressions

pig latin program

pig latin relations

pig grunt records

pig latin relational

characteristic pig hive

overview pig hive jaql