© Hortonworks Inc. 2012 Toronto Hadoop User Group Apache HCatalog Overview & Next-gen Hive Adam Muise Page 1 February 20, 2013
Jan 27, 2015
© Hortonworks Inc. 2012
Toronto Hadoop User Group
Apache HCatalog Overview & Next-gen Hive Adam Muise
Page 1
February 20, 2013
© Hortonworks Inc. 2012 Page 2
HCatalog
© Hortonworks Inc. 2012
Apache HCatalog
• Incubator Project at Apache.org • Good adoption • Will likely merge with Hive project as it adds important functionality to metastore
• Allows for a “schema-on-read” approach to Big Data in HDFS
• Seat of a lot of innovation in Data Management • Used by platform partners to enhance Hadoop integration
• Will likely be used to enhance existing Data Management products in the Enterprise & create new products
Page 3
© Hortonworks Inc. 2012
Great Tooling Options
Page 4
MapReduce • Early adopters • Non-relational algorithms • Performance sensitive applications
Pig • ETL • Data modeling • Iterative algorithms
Hive • Analysis • Connectors to BI tools
Strength: Right tool for right application Weakness: Hard to share their data
© Hortonworks Inc. 2012
HCatalog
Table access Aligned metadata REST API
• Raw Hadoop data • Inconsistent, unknown • Tool specific access
Apache HCatalog provides flexible metadata services across tools and external access
HCatalog Changes the Game
• Consistency of metadata and data models across the Enterprise (MapReduce, Pig, Hbase, Hive, External Systems)
• Accessibility: share data as tables in and out of HDFS • Availability: enables flexible, thin-client access via REST API
Page 5
Shared table and schema management opens the platform
Options == Complexity
Page 6 © Hortonworks 2012
Feature MapReduce Pig Hive Record format Key value pairs Tuple Record Data model User defined int, float, string,
bytes, maps, tuples, bags
int, float, string, maps, structs, lists
Schema Encoded in app Declared in script or read by loader
Read from metadata
Data location Encoded in app Declared in script Read from metadata
Data format Encoded in app Declared in script Read from metadata
• Pig and MR users need to know a lot to write their apps • When data schema, location, or format change Pig and MR apps must be
rewritten, retested, and redeployed • Hive users have to load data from Pig/MR users to have access to it
Hcatalog == Simple, Consistent
Page 7 © Hortonworks 2012
Feature MapReduce + HCatalog
Pig + HCatalog Hive
Record format Record Tuple Record Data model int, float, string,
maps, structs, lists int, float, string, bytes, maps, tuples, bags
int, float, string, maps, structs, lists
Schema Read from metadata
Read from metadata
Read from metadata
Data location Read from metadata
Read from metadata
Read from metadata
Data format Read from metadata
Read from metadata
Read from metadata
• Pig/MR users can read schema from metadata • Pig/MR users are insulated from schema, location, and format changes • All users have access to other users’ data as soon as it is committed
© Hortonworks Inc. 2012
Hadoop Ecosystem
metastore- tables- partitions- files- types
HDFS
dn1
.
.
dn2
.
dn3
.
.
.
.
.
.
.
.
.
.
dnN
Pig(scripting)
Hive(SQL)
MapReduce(Java)
Input/OutputFormat
Interface:Load/Store
Interface:SerDe
Input/OutputFormat
Input/OutputFormat
Interface:SQL
DDL
DML
© Hortonworks Inc. 2012
Opening up Metadata to MR & Pig
HCat Metadata layer
metastore- tables- partitions- files- types
HDFS
dn1
.
.
dn2
.
dn3
.
.
.
.
.
.
.
.
.
.
dnN
Pig(scripting)
Hive(SQL)
MapReduce(Java)
HCatInput/OutputFormat
Interface:HCatLoad/Store
Interface:SerDe
Interface:SQL
Pig Example
Page 10
Assume you want to count how many time each of your users went to each of your URLs
raw = load '/data/rawevents/20120530' as (url, user);
botless = filter raw by myudfs.NotABot(user);
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into '/data/counted/20120530'; Using HCatalog:
raw = load 'rawevents' using HCatLoader();
botless = filter raw by myudfs.NotABot(user) and ds == '20120530';
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into 'counted' using HCatStorer();
© Hortonworks 2012
No need to know file location
No need to declare schema
Partition filter
Working with HCatalog in MapReduce
Page 11 © Hortonworks 2012
database to read from
table to read from
specify which partitions to read
specify which partition to write
access fields by name
Setting input: HCatInputFormat.setInput(job,
InputJobInfo.create(dbname, tableName, filter));
Setting output: HCatOutputFormat.setOutput(job,
OutputJobInfo.create(dbname, tableName, partitionSpec));
Obtaining schema: schema = HCatInputFormat.getOutputSchema();
Key is unused, Value is HCatRecord: String url = value.get("url", schema);
output.set("cnt", schema, cnt);
Managing Metadata
Page 12 © Hortonworks 2012
• If you are a Hive user, you can use your Hive metastore with no modifications
• If not, you can use the HCatalog command line tool to issue Hive DDL (Data Definition Language) commands:
> /usr/bin/hcat -e ”create table rawevents (url string, user string) partitioned by (ds string);";
• Starting in Pig 0.11, you will be able to issue DDL commands from Pig
Templeton - REST API
Page 13 © Hortonworks 2012
Hadoop/HCatalog
Get a list of all tables in the default database:
• REST endpoints: databases, tables, partitions, columns, table properties • PUT to create/update, GET to list or describe, DELETE to drop
Create new table “rawevents”
Describe table “rawevents”
© Hortonworks Inc. 2012
HCatalog is in Hortonworks Data Platform…
Page 14
OS Cloud VM Appliance
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness High Availability, Disaster Recovery, Snapshots, Security, etc…
HORTONWORKS DATA PLATFORM (HDP)
OPERATIONAL SERVICES DATA SERVICES
HCATALOG
HIVE PIG HBASE
OOZIE
AMBARI
HDFS YARN (in 2.0)
WEBHDFS MAP REDUCE
SQOOP
FLUME
© Hortonworks Inc. 2012
Key 2013 “Enterprise Hadoop” Initiatives
Page 15
Invest In:
– Platform Services – DR, Snapshot, …
– Data Services – In support of Refine,
Explore, Enrich
– Operational Services – Manageability,
Security, …
Hive / “Stinger” Interactive Query
“Knox” Secure Access
“Continuum” Biz Continuity
HORTONWORKS DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
DATA SERVICES
OPERATIONAL SERVICES
Ambari Manage & Operate
“Herd” Data Integration
HBase Online Data
© Hortonworks Inc. 2012
Hive/Stinger: Interactive Query Near-realtime queries in good old Hive…
Page 16
© Hortonworks Inc. 2012
Top BI Vendors Support Hive Today
Page 17
© Hortonworks Inc. 2012
Goal: Enhance Hive for BI Use Cases
Page 18
Enterprise Reports
Dashboard / Scorecard
Parameterized Reports
Visualization Data Mining
Batch Interactive
More SQL &
Better Performance
© Hortonworks Inc. 2012
Differing Needs For Scale / Interaction
Page 19
Interactive Batch
• Parameterized Reports
• Drilldown • Visualization • Exploration
• Operational batch processing
• Enterprise Reports
• Data Mining
Data Size
5s – 1m 1m – 1h 1h+
Non-Interactive
• Data preparation • Incremental batch
processing • Dashboards /
Scorecards
Interactivity is key
Scalability and Reliability are key
© Hortonworks Inc. 2012
Stinger: Make Hive Best for All Needs
Page 20
Interactive Batch
• Parameterized Reports
• Drilldown • Visualization • Exploration
• Operational batch processing
• Enterprise Reports
• Data Mining
Data Size
5s – 1m 1m – 1h 1h+
Non-Interactive
• Data preparation • Incremental batch
processing • Dashboards /
Scorecards
Improve Latency & Throughput • Query engine improvements • New “Optimized RCFile” column store • Next-gen runtime (elim’s M/R latency)
Extend Deep Analytical Ability • Analytics functions • Improved SQL coverage • Continued focus on core Hive use cases
© Hortonworks Inc. 2012
Analytic Function Use Cases
• OVER – Rankings, top 10, bottom 10 – Running balances – Statistics within time windows (e.g. last 3 months, last 6 months)
• LEAD / LAG – Trend identification – Sessionization – Forecasting / prediction
• Distributions – Histograms and bucketing
• Good for Enterprise Reports, Dashboards, Data Mining and Business Processing.
Page 21
© Hortonworks Inc. 2012
Stinger 2013 Roadmap Summary
• HDP 1.x (aka Hadoop 1.x …) – Additional SQL Types – SQL Analytic Functions (OVER, Subqueries in WHERE, etc.) – Modern Optimized Column Store (ORC file) – Hive Query Enhancements
– Startup time, star joins, optimize M/R DAGs, vectorization, etc.
• HDP 2.x (aka Hadoop 2.x …) – Features in HDP 1.3 & 1.4 – Next-gen runtime that eliminates startup time – Persistent function registry – Other features
Page 22
© Hortonworks Inc. 2012
Questions?
Page 23