2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Toronto Hadoop User Group

Apache HCatalog Overview & Next-gen Hive Adam Muise

Page 1

February 20, 2013

© Hortonworks Inc. 2012 Page 2

HCatalog


Apache HCatalog

• Incubator Project at Apache.org • Good adoption • Will likely merge with Hive project as it adds important functionality to metastore

• Allows for a “schema-on-read” approach to Big Data in HDFS

• Seat of a lot of innovation in Data Management • Used by platform partners to enhance Hadoop integration

• Will likely be used to enhance existing Data Management products in the Enterprise & create new products

Page 3


Great Tooling Options

Page 4

MapReduce •  Early adopters •  Non-relational algorithms •  Performance sensitive applications

Pig •  ETL •  Data modeling •  Iterative algorithms

Hive •  Analysis •  Connectors to BI tools

Strength: Right tool for right application Weakness: Hard to share their data


HCatalog

Table access Aligned metadata REST API

•  Raw Hadoop data •  Inconsistent, unknown •  Tool specific access

Apache HCatalog provides flexible metadata services across tools and external access

HCatalog Changes the Game

•  Consistency of metadata and data models across the Enterprise (MapReduce, Pig, Hbase, Hive, External Systems)

•  Accessibility: share data as tables in and out of HDFS •  Availability: enables flexible, thin-client access via REST API

Page 5

Shared table and schema management opens the platform

Options == Complexity

Page 6 © Hortonworks 2012

Feature MapReduce Pig Hive Record format Key value pairs Tuple Record Data model User defined int, float, string,

bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Encoded in app Declared in script or read by loader

Read from metadata

Data location Encoded in app Declared in script Read from metadata

Data format Encoded in app Declared in script Read from metadata

•  Pig and MR users need to know a lot to write their apps •  When data schema, location, or format change Pig and MR apps must be

rewritten, retested, and redeployed •  Hive users have to load data from Pig/MR users to have access to it

Hcatalog == Simple, Consistent


Feature MapReduce + HCatalog

Pig + HCatalog Hive

Record format Record Tuple Record Data model int, float, string,

maps, structs, lists int, float, string, bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Read from metadata

Read from metadata

Read from metadata

Data location Read from metadata

Read from metadata

Read from metadata

Data format Read from metadata

Read from metadata

Read from metadata

•  Pig/MR users can read schema from metadata •  Pig/MR users are insulated from schema, location, and format changes •  All users have access to other users’ data as soon as it is committed


Hadoop Ecosystem

metastore- tables- partitions- files- types

HDFS

dn1

.

.

dn2

.

dn3

.

.

.

.

.

.

.

.

.

.

dnN

Pig(scripting)

Hive(SQL)

MapReduce(Java)

Input/OutputFormat

Interface:Load/Store

Interface:SerDe

Input/OutputFormat

Input/OutputFormat

Interface:SQL

DDL

DML


Opening up Metadata to MR & Pig

HCat Metadata layer

metastore- tables- partitions- files- types

HDFS

dn1

.

.

dn2

.

dn3

.

.

.

.

.

.

.

.

.

.

dnN

Pig(scripting)

Hive(SQL)

MapReduce(Java)

HCatInput/OutputFormat

Interface:HCatLoad/Store

Interface:SerDe

Interface:SQL

Pig Example

Page 10

Assume you want to count how many time each of your users went to each of your URLs

raw = load '/data/rawevents/20120530' as (url, user);

botless = filter raw by myudfs.NotABot(user);

grpd = group botless by (url, user);

cntd = foreach grpd generate flatten(url, user), COUNT(botless);

store cntd into '/data/counted/20120530'; Using HCatalog:

raw = load 'rawevents' using HCatLoader();

botless = filter raw by myudfs.NotABot(user) and ds == '20120530';

grpd = group botless by (url, user);

cntd = foreach grpd generate flatten(url, user), COUNT(botless);

store cntd into 'counted' using HCatStorer();

© Hortonworks 2012

No need to know file location

No need to declare schema

Partition filter

Working with HCatalog in MapReduce


database to read from

table to read from

specify which partitions to read

specify which partition to write

access fields by name

Setting input: HCatInputFormat.setInput(job,

InputJobInfo.create(dbname, tableName, filter));

Setting output: HCatOutputFormat.setOutput(job,

OutputJobInfo.create(dbname, tableName, partitionSpec));

Obtaining schema: schema = HCatInputFormat.getOutputSchema();

Key is unused, Value is HCatRecord: String url = value.get("url", schema);

output.set("cnt", schema, cnt);

Managing Metadata


•  If you are a Hive user, you can use your Hive metastore with no modifications

•  If not, you can use the HCatalog command line tool to issue Hive DDL (Data Definition Language) commands:

> /usr/bin/hcat -e ”create table rawevents (url string, user string) partitioned by (ds string);";

•  Starting in Pig 0.11, you will be able to issue DDL commands from Pig

Templeton - REST API


Hadoop/HCatalog

Get a list of all tables in the default database:

•  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop

Create new table “rawevents”

Describe table “rawevents”


HCatalog is in Hortonworks Data Platform…

Page 14

OS Cloud VM Appliance

PLATFORM SERVICES

HADOOP CORE

Enterprise Readiness High Availability, Disaster Recovery, Snapshots, Security, etc…

HORTONWORKS DATA PLATFORM (HDP)

OPERATIONAL SERVICES DATA SERVICES

HCATALOG

HIVE PIG HBASE

OOZIE

AMBARI

HDFS YARN (in 2.0)

WEBHDFS MAP REDUCE

SQOOP

FLUME


Key 2013 “Enterprise Hadoop” Initiatives

Page 15

Invest In:

– Platform Services – DR, Snapshot, …

– Data Services –  In support of Refine,

Explore, Enrich

– Operational Services – Manageability,

Security, …

Hive / “Stinger” Interactive Query

“Knox” Secure Access

“Continuum” Biz Continuity

HORTONWORKS DATA PLATFORM (HDP)

PLATFORM SERVICES

HADOOP CORE

DATA SERVICES

OPERATIONAL SERVICES

Ambari Manage & Operate

“Herd” Data Integration

HBase Online Data


Hive/Stinger: Interactive Query Near-realtime queries in good old Hive…

Page 16


Top BI Vendors Support Hive Today

Page 17


Goal: Enhance Hive for BI Use Cases

Page 18

Enterprise Reports

Dashboard / Scorecard

Parameterized Reports

Visualization Data Mining

Batch Interactive

More SQL &

Better Performance


Differing Needs For Scale / Interaction

Page 19

Interactive Batch

•  Parameterized Reports

•  Drilldown •  Visualization •  Exploration

•  Operational batch processing

•  Enterprise Reports

•  Data Mining

Data Size

5s – 1m 1m – 1h 1h+

Non-Interactive

•  Data preparation •  Incremental batch

processing •  Dashboards /

Scorecards

Interactivity is key

Scalability and Reliability are key


Stinger: Make Hive Best for All Needs

Page 20

Interactive Batch

•  Parameterized Reports

•  Drilldown •  Visualization •  Exploration

•  Operational batch processing

•  Enterprise Reports

•  Data Mining

Data Size

5s – 1m 1m – 1h 1h+

Non-Interactive

•  Data preparation •  Incremental batch

processing •  Dashboards /

Scorecards

Improve Latency & Throughput •  Query engine improvements •  New “Optimized RCFile” column store •  Next-gen runtime (elim’s M/R latency)

Extend Deep Analytical Ability •  Analytics functions •  Improved SQL coverage •  Continued focus on core Hive use cases


Analytic Function Use Cases

• OVER – Rankings, top 10, bottom 10 – Running balances – Statistics within time windows (e.g. last 3 months, last 6 months)

• LEAD / LAG – Trend identification – Sessionization – Forecasting / prediction

• Distributions – Histograms and bucketing

• Good for Enterprise Reports, Dashboards, Data Mining and Business Processing.

Page 21


Stinger 2013 Roadmap Summary

• HDP 1.x (aka Hadoop 1.x …) – Additional SQL Types – SQL Analytic Functions (OVER, Subqueries in WHERE, etc.) – Modern Optimized Column Store (ORC file) – Hive Query Enhancements

– Startup time, star joins, optimize M/R DAGs, vectorization, etc.

• HDP 2.x (aka Hadoop 2.x …) – Features in HDP 1.3 & 1.4 – Next-gen runtime that eliminates startup time – Persistent function registry – Other features

Page 22


Questions?

Page 23

2013 feb 20_thug_h_catalog

Technology

data hortonworks

pig hortonworks

hcatalog hortonworks

users data

data schema

data models

data managementused

cnt hortonworks