Top Banner
© Hortonworks Inc. 2012 Toronto Hadoop User Group Apache HCatalog Overview & Next-gen Hive Adam Muise Page 1 February 20, 2013
23

2013 feb 20_thug_h_catalog

Jan 27, 2015

Download

Technology

Adam Muise

Toronto Hadoop User Group
THUG
February 20 2013
HCatalog
Hive - Stinger Initiative
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Toronto Hadoop User Group

Apache HCatalog Overview & Next-gen Hive Adam Muise

Page 1

February 20, 2013

Page 2: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012 Page 2

HCatalog

Page 3: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Apache HCatalog

• Incubator Project at Apache.org • Good adoption • Will likely merge with Hive project as it adds important functionality to metastore

• Allows for a “schema-on-read” approach to Big Data in HDFS

• Seat of a lot of innovation in Data Management • Used by platform partners to enhance Hadoop integration

• Will likely be used to enhance existing Data Management products in the Enterprise & create new products

Page 3

Page 4: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Great Tooling Options

Page 4

MapReduce •  Early adopters •  Non-relational algorithms •  Performance sensitive applications

Pig •  ETL •  Data modeling •  Iterative algorithms

Hive •  Analysis •  Connectors to BI tools

Strength: Right tool for right application Weakness: Hard to share their data

Page 5: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

HCatalog

Table access Aligned metadata REST API

•  Raw Hadoop data •  Inconsistent, unknown •  Tool specific access

Apache HCatalog provides flexible metadata services across tools and external access

HCatalog Changes the Game

•  Consistency of metadata and data models across the Enterprise (MapReduce, Pig, Hbase, Hive, External Systems)

•  Accessibility: share data as tables in and out of HDFS •  Availability: enables flexible, thin-client access via REST API

Page 5

Shared table and schema management opens the platform

Page 6: 2013 feb 20_thug_h_catalog

Options == Complexity

Page 6 © Hortonworks 2012

Feature MapReduce Pig Hive Record format Key value pairs Tuple Record Data model User defined int, float, string,

bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Encoded in app Declared in script or read by loader

Read from metadata

Data location Encoded in app Declared in script Read from metadata

Data format Encoded in app Declared in script Read from metadata

•  Pig and MR users need to know a lot to write their apps •  When data schema, location, or format change Pig and MR apps must be

rewritten, retested, and redeployed •  Hive users have to load data from Pig/MR users to have access to it

Page 7: 2013 feb 20_thug_h_catalog

Hcatalog == Simple, Consistent

Page 7 © Hortonworks 2012

Feature MapReduce + HCatalog

Pig + HCatalog Hive

Record format Record Tuple Record Data model int, float, string,

maps, structs, lists int, float, string, bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Read from metadata

Read from metadata

Read from metadata

Data location Read from metadata

Read from metadata

Read from metadata

Data format Read from metadata

Read from metadata

Read from metadata

•  Pig/MR users can read schema from metadata •  Pig/MR users are insulated from schema, location, and format changes •  All users have access to other users’ data as soon as it is committed

Page 8: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Hadoop Ecosystem

metastore- tables- partitions- files- types

HDFS

dn1

.

.

dn2

.

dn3

.

.

.

.

.

.

.

.

.

.

dnN

Pig(scripting)

Hive(SQL)

MapReduce(Java)

Input/OutputFormat

Interface:Load/Store

Interface:SerDe

Input/OutputFormat

Input/OutputFormat

Interface:SQL

DDL

DML

Page 9: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Opening up Metadata to MR & Pig

HCat Metadata layer

metastore- tables- partitions- files- types

HDFS

dn1

.

.

dn2

.

dn3

.

.

.

.

.

.

.

.

.

.

dnN

Pig(scripting)

Hive(SQL)

MapReduce(Java)

HCatInput/OutputFormat

Interface:HCatLoad/Store

Interface:SerDe

Interface:SQL

Page 10: 2013 feb 20_thug_h_catalog

Pig Example

Page 10

Assume you want to count how many time each of your users went to each of your URLs

raw = load '/data/rawevents/20120530' as (url, user);

botless = filter raw by myudfs.NotABot(user);

grpd = group botless by (url, user);

cntd = foreach grpd generate flatten(url, user), COUNT(botless);

store cntd into '/data/counted/20120530'; Using HCatalog:

raw = load 'rawevents' using HCatLoader();

botless = filter raw by myudfs.NotABot(user) and ds == '20120530';

grpd = group botless by (url, user);

cntd = foreach grpd generate flatten(url, user), COUNT(botless);

store cntd into 'counted' using HCatStorer();

© Hortonworks 2012

No need to know file location

No need to declare schema

Partition filter

Page 11: 2013 feb 20_thug_h_catalog

Working with HCatalog in MapReduce

Page 11 © Hortonworks 2012

database to read from

table to read from

specify which partitions to read

specify which partition to write

access fields by name

Setting input: HCatInputFormat.setInput(job,

InputJobInfo.create(dbname, tableName, filter));

Setting output: HCatOutputFormat.setOutput(job,

OutputJobInfo.create(dbname, tableName, partitionSpec));

Obtaining schema: schema = HCatInputFormat.getOutputSchema();

Key is unused, Value is HCatRecord: String url = value.get("url", schema);

output.set("cnt", schema, cnt);

Page 12: 2013 feb 20_thug_h_catalog

Managing Metadata

Page 12 © Hortonworks 2012

•  If you are a Hive user, you can use your Hive metastore with no modifications

•  If not, you can use the HCatalog command line tool to issue Hive DDL (Data Definition Language) commands:

> /usr/bin/hcat -e ”create table rawevents (url string, user string) partitioned by (ds string);";

•  Starting in Pig 0.11, you will be able to issue DDL commands from Pig

Page 13: 2013 feb 20_thug_h_catalog

Templeton - REST API

Page 13 © Hortonworks 2012

Hadoop/HCatalog

Get a list of all tables in the default database:

•  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop

Create new table “rawevents”

Describe table “rawevents”

Page 14: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

HCatalog is in Hortonworks Data Platform…

Page 14

OS   Cloud   VM   Appliance  

PLATFORM  SERVICES  

HADOOP  CORE  

Enterprise Readiness High Availability, Disaster Recovery, Snapshots, Security, etc…

HORTONWORKS    DATA  PLATFORM  (HDP)  

OPERATIONAL  SERVICES   DATA  SERVICES  

HCATALOG  

HIVE  PIG  HBASE  

OOZIE  

AMBARI  

HDFS   YARN  (in  2.0)  

WEBHDFS   MAP  REDUCE  

SQOOP  

FLUME  

Page 15: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Key 2013 “Enterprise Hadoop” Initiatives

Page 15

Invest In:

– Platform Services – DR, Snapshot, …

– Data Services –  In support of Refine,

Explore, Enrich

– Operational Services – Manageability,

Security, …

Hive / “Stinger” Interactive Query

“Knox” Secure Access

“Continuum” Biz Continuity

HORTONWORKS    DATA  PLATFORM  (HDP)  

PLATFORM  SERVICES  

HADOOP  CORE  

DATA  SERVICES  

OPERATIONAL  SERVICES  

Ambari Manage & Operate

“Herd” Data Integration

HBase Online Data

Page 16: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Hive/Stinger: Interactive Query Near-realtime queries in good old Hive…

Page 16

Page 17: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Top BI Vendors Support Hive Today

Page 17

Page 18: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Goal: Enhance Hive for BI Use Cases

Page 18

Enterprise Reports

Dashboard / Scorecard

Parameterized Reports

Visualization Data Mining

Batch Interactive

More SQL &

Better Performance

Page 19: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Differing Needs For Scale / Interaction

Page 19

Interactive Batch

•  Parameterized Reports

•  Drilldown •  Visualization •  Exploration

•  Operational batch processing

•  Enterprise Reports

•  Data Mining

Data Size

5s – 1m 1m – 1h 1h+

Non-Interactive

•  Data preparation •  Incremental batch

processing •  Dashboards /

Scorecards

Interactivity is key

Scalability and Reliability are key

Page 20: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Stinger: Make Hive Best for All Needs

Page 20

Interactive Batch

•  Parameterized Reports

•  Drilldown •  Visualization •  Exploration

•  Operational batch processing

•  Enterprise Reports

•  Data Mining

Data Size

5s – 1m 1m – 1h 1h+

Non-Interactive

•  Data preparation •  Incremental batch

processing •  Dashboards /

Scorecards

Improve Latency & Throughput •  Query engine improvements •  New “Optimized RCFile” column store •  Next-gen runtime (elim’s M/R latency)

Extend Deep Analytical Ability •  Analytics functions •  Improved SQL coverage •  Continued focus on core Hive use cases

Page 21: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Analytic Function Use Cases

• OVER – Rankings, top 10, bottom 10 – Running balances – Statistics within time windows (e.g. last 3 months, last 6 months)

• LEAD / LAG – Trend identification – Sessionization – Forecasting / prediction

• Distributions – Histograms and bucketing

• Good for Enterprise Reports, Dashboards, Data Mining and Business Processing.

Page 21

Page 22: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Stinger 2013 Roadmap Summary

• HDP 1.x (aka Hadoop 1.x …) – Additional SQL Types – SQL Analytic Functions (OVER, Subqueries in WHERE, etc.) – Modern Optimized Column Store (ORC file) – Hive Query Enhancements

– Startup time, star joins, optimize M/R DAGs, vectorization, etc.

• HDP 2.x (aka Hadoop 2.x …) – Features in HDP 1.3 & 1.4 – Next-gen runtime that eliminates startup time – Persistent function registry – Other features

Page 22

Page 23: 2013 feb 20_thug_h_catalog

© Hortonworks Inc. 2012

Questions?

Page 23