Top Banner
Pig, Hive, and Jaql IBM Information Management Cloud Computing Center of Competence IBM Toronto Lab 1
40
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Pig, Hive, and Jaql

    IBM Information Management

    Cloud Computing Center of Competence

    IBM Toronto Lab

    1

  • Agenda

    2

    Overview

    Pig

    Hive

    Jaql

  • Agenda

    3

    Overview

    Pig

    Hive

    Jaql

  • Similarities of Pig, Hive and Jaql

    4

    All translate their respective high-level languages to MapReduce jobs

    All offer significant reductions in program size over Java

    All provide points of extension to cover gaps in functionality

    All provide interoperability with other languages

    None support random reads/writes or low-latency queries

  • Comparing Pig, Hive, and Jaql

    5

    Characteristic Pig Hive Jaql

    Developed by Yahoo! Facebook IBM

    Language name Pig Latin HiveQL Jaql

    Type of language Data flow

    Declarative (SQL

    dialect) Data flow

    Data structures it

    operates on Complex, nested JSON

    Schema optional? Yes

    No, but data can

    have many

    schemas Yes

    Relational complete? Yes Yes

    Turing complete?

    Yes when

    extended with

    Java UDFs

    Yes when

    extended with

    Java UDFs Yes

  • Agenda

    6

    Overview

    Pig

    Hive

    Jaql

  • Pig components

    Two Components

    Language (called Pig Latin)

    Compiler

    Two execution environments

    Local (Single JVM)

    pig -x local

    Distributed (Hadoop cluster)

    pig -x mapreduce, or simply pig

    7

    Pig Latin

    Compiler

    Local

    Distributed

    Pig

    Execution Environment

  • Running Pig

    Script

    pig scriptfile.pig

    Grunt (command line)

    pig (to launch command line tool)

    Embedded

    Call in to Pig from Java

    8

  • Sample code

    9

    #pig

    grunt> records = LOAD econ_assist.csv

    using PigStorage (,)

    AS (country:chararray, sum:long);

    grunt> grouped = GROUP records BY country;

    grunt> thesum = FOREACH grouped

    GENERATE group,

    SUM(records, sum);

    grunt> DUMP thesum;

  • Pig Latin

    Statements, operations and commands A Pig Latin program is a collection of statements.

    A statement is an operation or a command

    Example of an operation: LOAD 'statement.txt';

    Example of a command: ls *.txt

    Logical plan/physical plan

    As statement is processed, it is added to logical plan

    When a statement such as 'DUMP relation' is reached, logical plan is compiled to physical plan and executed

    10

  • Pig Latin statements

    UDF Statements

    REGISTER, DEFINE

    Commands

    Hadoop Filesystem (cat, ls, etc.)

    Hadoop MapReduce (kill)

    Utility (exec, help, quit, run, set)

    Diagnostic Operators

    DESCRIBE, EXPLAIN, ILLUSTRATE

    11

  • Pig Latin Relational operators

    Loading and storing (LOAD, STORE, DUMP)

    Filtering (FILTER, DISTINCT, FOREACH...GENERATE, STREAM, SAMPLE)

    Grouping and joining (JOIN, COGROUP, GROUP, CROSS)

    Sorting (ORDER, LIMIT)

    Combining and splitting (UNION, SPLIT)

    12

  • Pig Latin Relations and schemata

    Result of a relational operator is a relation

    A relation is a set of tuples

    Relations can be named using an alias x = LOAD 'sample.txt' AS (id: int, year: int)

    DUMP x alias

    (1,1987) tuple

    Structure of a relation is a schema DESCRIBE x

    x: {id: int, year: int} schema

    13

  • Pig Latin expressions

    Part of a statement containing a relational operator

    Categories of expressions:

    Constant Field Projection

    Map lookup Cast Arithmetic

    Conditional Boolean Comparison

    Functional Flatten

    14

  • Pig Latin Data types

    Simple types:

    int float bytearray

    long double chararray

    Complex types:

    Tuple Sequence of fields of any type

    Bag Unordered collection of tuples

    Map Set of key-value pairs. Keys must be chararray.

    15

  • Pig Latin Function types

    Eval

    Input: One or more expressions

    Output: An expression

    Example: MAX

    Filter

    Input: Bag or map

    Output: boolean

    Example: IsEmpty

    16

  • Load

    Input: Data from external storage

    Output: A relation

    Example: PigStorage

    Store

    Input: A relation

    Output: Data to external storage

    Example: PigStorage

    17

    Pig Latin Function types

  • Pig Latin User-Defined Functions

    Written in Java

    Packaged in a JAR file

    Register JAR file using the REGISTER statement

    Optionally, alias it with DEFINE statement

    18

  • Agenda

    19

    Overview

    Pig

    Hive

    Jaql

  • Hive - Configuration

    Three ways to configure hive:

    hive-site.xml

    - fs.default.name

    - mapred.job.tracker

    - Metastore configuration settings

    hive hiveconf

    Set command in the Hive Shell

    20

  • Running Hive

    Hive Shell

    Interactive

    hive

    Script

    hive -f myscript

    Inline

    hive -e 'SELECT * FROM mytable'

    21

  • Hive services

    hive --service servicename

    where servicename can be:

    hiveserver

    server for Thrift, JDBC, ODBC clients

    hwi

    web interface

    jar

    hadoop jar with Hive jars in classpath

    metastore

    out of process metastore

    22

  • Sample code

    23

    #hive

    hive> CREATE TABLE foreign_aid

    (country STRING, sum BIGINT)

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY ,

    STORED AS TEXTFILE;

    hive> SHOW TABLES;

    hive> DESCRIBE foreign_aid;

    hive> LOAD DATA INPATH econ_assist.csv

    OVERWRITE INTO TABLE foreign_aid;

    hive> SELECT * FROM foreign_aid LIMIT 10;

    hive> SELECT country, SUM(sum) FROM foreign_aid

    GROUP BY country;

  • Hive - Metastore

    Stores Hive metadata

    Configurations

    Embedded

    in-process metastore, in-process database

    Local

    in-process metastore, out-of-process database

    Remote

    out-of-process metastore, out-of-process database

    24

  • Hive Schema-On-Read

    Faster loads into the database (simply copy or move)

    Slower queries

    Flexibility multiple schemas for the same data

    25

  • Hive Query Language (HiveQL)

    SQL dialect

    Does not support full SQL92 specification

    No support for:

    HAVING clause in SELECT

    Correlated subqueries

    Subqueries outside FROM clauses

    Updateable or materialized views

    Stored procedures

    26

  • Hive Query Language (HiveQL)

    Extensions

    MySQL-like extensions

    MapReduce extensions

    Multi-table insert, MAP, REDUCE, TRANSFORM clauses

    Data Types

    Simple

    TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING

    Complex

    ARRAY, MAP, STRUCT

    27

  • Hive Query Language (HiveQL)

    Built-in Functions SHOW FUNCTIONS

    DESCRIBE FUNCTION

    28

  • Hive - Tables

    Managed CREATE TABLE

    LOAD File moved into Hive's data warehouse directory

    DROP Both metadata and data deleted

    External CREATE EXTERNAL TABLE

    LOAD No files moved

    DROP Only metadata deleted

    Use EXTERNAL when:

    Sharing data between Hive and other Hadoop applications

    You wish to use multiple schemas on the same data

    29

  • Hive - Partitioning

    Can make some queries faster

    Divide data based on partition column

    Use PARTITION BY clause when creating table

    Use PARTITION clause when loading data

    SHOW PARTITIONS will show a table's partitions

    30

  • Hive - Bucketing

    Can make some queries faster

    Supports sampling data

    Use CLUSTERED BY clause when creating table

    For sorted buckets, use SORTED BY clause when creating table

    To query a sample of your data use the TABLESAMPLE command which uses bucketing

    31

  • Hive Storage formats

    Delimited Text (default) ROW FORMAT DELIMITED

    SerDe Serializer/Deserializer ROW FORMAT SERDE serdename

    e.g. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy'

    Binary SerDe Row-oriented (Sequence file)

    STORED AS SEQUENCEFILE

    Column-oriented (RCFile)

    STORED AS RCFILE

    32

  • Hive User-Defined Functions

    Written in Java

    Three UDF types:

    UDF

    Input: single row, output: single row

    UDAF

    Input: multiple rows, output: single row

    UDTF

    Input: single row, output: multiple rows

    Register UDF using ADD JAR

    Create alias using CREATE TEMPORARY FUNCTION

    33

  • Agenda

    34

    Overview

    Pig

    Hive

    Jaql

  • Jaql

    A JSON Query Language

    JSON = Javascript Object Notation

    Running Jaql

    Jaql Shell

    Interactive - jaqlshell

    Batch - jaqlshell -b myscript.jaql

    Inline - jaqlshell -e jaqlstatement

    Modes

    Cluster - jaqlshell -c

    Minicluster - jaqlshell

    35

  • Jaql

    Query Language

    Sources and sinks

    e.g. Copy data from a local file to a new file on HDFS

    source sink

    read(file(input.json)) -> write(hdfs(output))

    Core Operators

    Filter Group Tee

    Transform Join Sort

    Expand Union Top

    36

  • Jaql

    Query Language

    Variables

    = operator binds source output to a variable

    e.g. $tweets = read(hdfs(twitterfeed))

    Pipes, streams, and consumers

    Pipe operator (->) streams data to a consumer

    Pipe expects array as input

    e.g. $tweets filter $.from_src == 'tweetdeck';

    $ implicit variable referencing current array value

    37

  • Jaql

    Query Language

    Categories of Built-in Functions

    system schema agg

    core xml number

    hadoop regex string

    io binary function

    array date random

    index nil record

    38

  • Jaql

    Data Storage

    Data store examples

    Amazon S3 DB2 HBase HDFS

    HTTP JDBC Local FS

    Data format examples

    JSON CSV XML

    39

  • Thank you!

    Thank you!