Page 1 © Hortonworks Inc. 2014 Discover HDP 2.2: Even Faster SQL Queries with Apache Hive & Stinger.next Hortonworks. We do Hadoop.
May 28, 2015
Page 1 © Hortonworks Inc. 2014
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive & Stinger.next
Hortonworks. We do Hadoop.
Page 2 © Hortonworks Inc. 2014
Speakers
Justin Sears
Hortonworks Product Marketing Manager
Alan Gates
Hortonworks Co-Founder and Apache Hive Committer & PMC Member
Raj Bains
Hortonworks Senior Manger of Product Management for Apache Hive
Page 3 © Hortonworks Inc. 2014
Agenda
• Introduction to Stinger.next
• New Innovation in Apache Hive 0.14 § SQL: Transactions with ACID semantics
§ Speed: Cost based optimizer for star and bushy joins
§ Scale: Dynamic query optimizations
• The Road Ahead for Stinger.next
• Q & A
We’ll move quickly: • Attendee phone lines are muted
• Text any questions to Raj Bains using Webex chat • Questions answered at the end
• Unanswered questions and answers in upcoming blog post
Page 4 © Hortonworks Inc. 2014
Big Data, Hadoop & Data Center Re-platforming
Business Drivers
• From reactive analytics to proactive interactions
• Insights that drive competitive advantage & optimal returns
Financial Drivers
• Cost of data systems, as % of IT spend, continues to grow
• Cost advantages of commodity hardware & open source software
$ Technical Drivers
• Data is growing exponentially & existing systems overwhelmed
• Predominantly driven by NEW types of data that can inform analytics
There is an inequitable balance between vendor and customer in the market
Page 5 © Hortonworks Inc. 2014
Clickstream Capture and analyze website visitors’ data trails and optimize your website
Sensors Discover patterns in data streaming automatically from remote sensors and machines
Server Logs Research logs to diagnose process failures and prevent security breaches
New Types of Data Hadoop Value:
Sentiment Understand how your customers feel about your brand and products – right now
Geographic Analyze location-based data to manage operations where they occur
Unstructured Understand patterns in files across millions of web pages, emails, and documents
Page 6 © Hortonworks Inc. 2014
A Shift from Reactive to Proactive Interactions
HDP and Hadoop allow organizations to use data to shift interactions from…
Reactive Post Transaction
Proactive Pre Decision
…to Real-time Personalization From static branding
…to repair before break From break then fix
…to Designer Medicine From mass treatment
…to Automated Algorithms From Educated Investing
…to 1x1 Targeting From mass branding
A shift in Advertising
A shift in Financial Services
A shift in Healthcare
A shift in Retail
A shift in Telco
Page 7 © Hortonworks Inc. 2014
Enterprise Goals for the Modern Data Architecture
• Consolidate siloed data sets structured and unstructured
• Central data set on a single cluster
• Multiple workloads across batch interactive and real time
• Central services for security, governance and operation
• Preserve existing investment in current tools and platforms
• Single view of the customer, product, supply chain
APP
LIC
ATIO
NS
DAT
A S
YSTE
M
Business Analytics
Custom Applications
Packaged Applications
RDBMS
EDW
MPP
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
Interactive Real-Time Batch CRM
ERP
Other 1 ° ° °
° ° ° °
HDFS (Hadoop Distributed File System)
SOU
RC
ES
EXISTING Systems
Clickstream Web &Social
Geoloca9on Sensor & Machine
Server Logs
Unstructured
Page 8 © Hortonworks Inc. 2014
YARN Transformed Hadoop & Opened a New Era
YARN The Architectural Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Page 9 © Hortonworks Inc. 2014
YARN Extends Hadoop to Other Data Center Leaders
YARN The Architectural Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
• Supports 3rd-party ISV tools
(ex. SAS, Syncsort, Actian, etc.)
YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Page 10 © Hortonworks Inc. 2014
Enterprise Hadoop: Central Set of Services
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for:
• Governance
• Operations
• Security
Everything that plugs into Hadoop inherits these services
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and manage
according to policy
Deploy and effectively
manage the platform
Provide layered approach to
security through Authentication, Authorization,
Accounting, and Data Protection
SECURITY GOVERNANCE OPERATIONS
Script
Pig
SQL
Hive
Java Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV Engines
YARN: Data Operating System (Cluster Resource Management)
HDFS (Hadoop Distributed File System)
Tez Slider Slider Tez Tez
Page 11 © Hortonworks Inc. 2014
Hortonworks Development Investment for the Enterprise
Vertical Integration with YARN and HDFS
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and manage
according to policy
Deploy and effectively
manage the platform
Provide layered approach to
security through Authentication, Authorization,
Accounting, and Data Protection
SECURITY GOVERNANCE OPERATIONS
Script
Pig
SQL
Hive
Java Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV Engines
YARN: Data Operating System (Cluster Resource Management)
HDFS (Hadoop Distributed File System)
Tez Slider Slider Tez Tez
• Ensure engines can run reliably and respectfully in a YARN based cluster • Implement features throughout the stack to accommodate
Page 12 © Hortonworks Inc. 2014
Hortonworks Development Investment for the Enterprise
Horizontal Integration for Enterprise Services
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and manage
according to policy
Deploy and effectively
manage the platform
Provide layered approach to
security through Authentication, Authorization,
Accounting, and Data Protection
SECURITY GOVERNANCE OPERATIONS
Script
Pig
SQL
Hive
Java Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV Engines
YARN: Data Operating System (Cluster Resource Management)
HDFS (Hadoop Distributed File System)
Tez Slider Slider Tez Tez
• Ensure consistent enterprise services are applied across the entire Hadoop stack • Integrate with and extend existing data center solutions for these key requirements
Page 13 © Hortonworks Inc. 2014
Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Authentication Authorization
Audit Data Protection
Storage: HDFS
Resources: YARN Access: Hive
Pipeline: Falcon Cluster: Ranger Cluster: Knox
Deployment Choice Linux Windows Cloud
YARN is the architectural center of HDP
• Common data set across all applications
• Batch, interactive & real-time workloads
• Multi-tenant access & processing
Provides comprehensive enterprise capabilities
• Governance
• Security
• Operations
Enables broad ecosystem adoption
• ISVs can plug directly into Hadoop
The widest range of deployment options • Linux & Windows
• On premises & cloud
Others
ISV Engines
On-Premises
Page 14 © Hortonworks Inc. 2014
Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
GOVERNANCE OPERATIONS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
YARN is the architectural center of HDP
• Common data set across all applications
• Batch, interactive & real-time workloads
• Multi-tenant access & processing
Provides comprehensive enterprise capabilities
• Governance
• Security
• Operations
Enables broad ecosystem adoption
• ISVs can plug directly into Hadoop
The widest range of deployment options • Linux & Windows
• On premises & cloud
Others
ISV Engines
SECURITY
Authentication Authorization
Audit Data Protection
Storage: HDFS
Resources: YARN Access: Hive
Pipeline: Falcon Cluster: Ranger Cluster: Knox
Deployment Choice Linux Windows On-Premises Cloud
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
YARN: Data Operating System (Cluster Resource Management)
SQL
Hive
Tez
Page 15 © Hortonworks Inc. 2014
Introduction to Stinger.next
Page 16 © Hortonworks Inc. 2014
Stinger.next – Enterprise SQL at Hadoop Scale
Stinger (Hive 0.13, Tez, ORC File)
Scale to Petabytes
Batch to Interactive Queries
Read-Only Data
Substantial SQL Support
Single Tool for Multiple SQL workloads – Interactive, Reporting and ETL
MapReduce, Tez Engines
Stinger.next
Scale to Petabytes
Sub-Second Queries
Modify Data with Transactions
Comprehensive SQL:2011 Analytics
Single Tool for Multiple SQL workloads – Interactive, Reporting, ETL, ML
MapReduce, Tez, Spark Engines
Page 17 © Hortonworks Inc. 2014
SQL in Hive 0.14: Transactions with ACID Semantics
Page 18 © Hortonworks Inc. 2014
Transaction Use Cases Reporting with Analytics (YES) • Reporting on data with occasional updates • Corrections to the fact tables, evolving dimension tables
• Low concurrency updates, low TPS
Operational Reporting (YES, next) • High throughput ingest from operational (OLTP) database
• Periodic inserts every 5-30 minutes
• Requires tool support and changes in our Transactions
Operational (OLTP) Database (NO) • Small Transactions, each doing single line inserts
• High Concurrency - Hundreds to thousands of connections
Hive
OLTP Hive Replication
Analytics Modifications
Hive
High Concurrency OLTP
Page 19 © Hortonworks Inc. 2014
Deep Dive: Transactions Transaction Support in Hive with ACID semantics • Hive native support for INSERT, UPDATE, DELETE. • Split Into Phases:
• Phase 1: Hive Streaming Ingest (append) • Phase 2: INSERT / UPDATE / DELETE Support • Phase 3: BEGIN / COMMIT / ROLLBACK Txn
[Done]
[HDP 2.2]
[Next]
Read-Optimized ORCFile
Delta File Merged Read-
Optimized ORCFile
1. Original File Task reads the latest
ORCFile
Task
Read-Optimized ORCFile
Task Task
2. Edits Made Task reads the ORCFile and merges
the delta file with the edits
3. Edits Merged Task reads the updated ORCFile
Hive ACID Compactor periodically merges the delta
files in the background.
Page 20 © Hortonworks Inc. 2014
Speed in Hive 0.14: Cost Based Optimizer
Page 21 © Hortonworks Inc. 2014
TPC-DS Query 17
SELECT i_item_id, i_item_desc, s_state, Count(ss_quantity) AS store_sales_quantitycount, Avg(ss_quantity) AS store_sales_quantityave, Stddev_samp(ss_quantity) AS store_sales_quantitystdev, Stddev_samp(ss_quantity) / Avg(ss_quantity) AS store_sales_quantitycov, Count(sr_return_quantity) as_store_returns_quantitycount, Avg(sr_return_quantity) as_store_returns_quantityave, Stddev_samp(sr_return_quantity) as_store_returns_quantitystdev, Stddev_samp(sr_return_quantity) / Avg(sr_return_quantity) AS store_returns_quantitycov, Count(cs_quantity) AS catalog_sales_quantitycount, Avg(cs_quantity) AS catalog_sales_quantityave, Stddev_samp(cs_quantity) / Avg(cs_quantity) AS catalog_sales_quantitystdev, Stddev_samp(cs_quantity) / Avg(cs_quantity) AS catalog_sales_quantitycov FROM store_sales, store_returns, catalog_sales, date_dim d1, date_dim d2, date_dim d3, store, item WHERE d1.d_quarter_name = '2000Q1' AND d1.d_date_sk = store_sales.ss_sold_date_sk AND ss_sold_date BETWEEN '2000-01-01' AND '2000-03-31' AND item.i_item_sk = store_sales.ss_item_sk AND store.s_store_sk = store_sales.ss_store_sk AND store_sales.ss_customer_sk = store_returns.sr_customer_sk AND store_sales.ss_item_sk = store_returns.sr_item_sk AND store_sales.ss_ticket_number = store_returns.sr_ticket_number AND store_returns.sr_returned_date_sk = d2.d_date_sk AND d2.d_quarter_name IN ( '2000Q1', '2000Q2', '2000Q3' ) AND sr_returned_date BETWEEN '2000-01-01' AND '2000-09-01' AND store_returns.sr_customer_sk = catalog_sales.cs_bill_customer_sk AND store_returns.sr_item_sk = catalog_sales.cs_item_sk AND catalog_sales.cs_sold_date_sk = d3.d_date_sk AND d3.d_quarter_name IN ( '2000Q1', '2000Q2', '2000Q3' ) AND cs_sold_date BETWEEN '2000-01-01' AND '2000-09-31' GROUP BY i_item_id, i_item_desc, s_state ORDER BY i_item_id, i_item_desc, s_state LIMIT 100;
Page 22 © Hortonworks Inc. 2014
CBO on Selected Queries – 17
store_sales store_returns catalog_sales
items store
date_dim d1 date_dim d2 date_dim d3
Filter: quarter Filter: quarter Filter: quarter
Filter: date Filter: date Filter: date
customer_sk ticket_number
customer_sk Item_sk
date_sk date_sk date_sk
item_sk store_sk
Page 23 © Hortonworks Inc. 2014
OLD: Left Deep Plan
Reducer 3 • Merge join 2 & 10 • Map join 1 • Map join 6 • Map Join 7 • Map Join 8 store • Map Join 11 item • Filter • Group By • Reduce
Map 12 Table_scan
Store_returns
Map 6 Table_scan d2, filter
Map 7 Table_scan d3, filter
Reducer 4 Group_By Reduce
Reducer 10 Merge join 12, 9
Map 9 Table_scan store_sales
Map 1 Table_scan d1, filter
Map 2 Table_scan catalog_sales
Reducer 5 Limit
B
B
B
Map 11 Table_scan item
Map 8 Table_scan store B
Large Fact tables joined together without filters
B
Page 24 © Hortonworks Inc. 2014
NEW: Complex Bushy Plan
Reducer 4 Merge join 3 & 8 Map join store Map join item
Reduce
Map 10 table_scan
store
Map 12 Table_scan
item
Map 3 Store_sales
Map join
Map 8 Store_returns
Map join
Reducer 5 Merge_Join Group_By Reduce
Map 11 catalog_sales,
Map Join
Map 9 Table_scan d1,
filter
Map 1 Table_scan d1,
filter
Map 2 Table_scan d1,
filter
Reducer 6 Group by Reduce
Reducer7 Limit
B
B B
B B
All 3 Large Fact tables joined with date dimension limiting data to few quarters
Page 25 © Hortonworks Inc. 2014
Performance Improvement – Query 17
Scale = 30TB Input records ~186mil
CBO Elapsed Time (sec)
Elapsed Time
Intermediate data (GB)
Output and Intermediate Records
OFF 10,683 ~3 hrs 5,017 135,647,792,123 ON 1,284 ~20 mins 275 8,543,232,360
Page 26 © Hortonworks Inc. 2014
Scale in Hive 0.14: Dynamic Query Optimization
Page 27 © Hortonworks Inc. 2014
Auto Reducer Parallelism
Use dynamic data volume during execution
rather than estimates from query compilation to determine the number of reducers
Leads to
faster query execution,
better resource utilizations
App Master
Vertex Manager
Vertex State
Machine
Time
1. Data size statistics
Tasks for a single map vertex
Tasks for a single reduce vertex
2. Set parallelism
3. Re-route
4. Cancel task
App Master
Vertex Manager
Vertex State
Machine
5. Tasks Completed
Tasks for a single map vertex
Tasks for a single reduce vertex
6. Start Tasks
7. Start
Page 28 © Hortonworks Inc. 2014
Auto Reducer Parallelism
use tpcds_bin_partitioned_orc_30000; set hive.tez.auto.reducer.parallelism=true; set hive.tez.min.partition.factor=0.125; SELECT ss_promo_sk, Sum(ss_sales_price), Count(*) FROM store_sales WHERE ss_sold_date < '1998-03-01' GROUP BY ss_promo_sk ORDER BY 2 DESC LIMIT 10;
Page 29 © Hortonworks Inc. 2014
Dynamic Partition Pruning
store_sales
date_dim d1 Filter
ss_sold_date_sk = date_sk
Table Definition create table store_sales (...) partitioned by (ss_sold_date_sk int) stored as orc;
d1 d2 d3 d4 …
Example Join of • a large Fact table with multiple partitions • with a dimension table that has a filter
The ss_sold_date_sk partitions that can be pruned away at join time is not known till the filter is applied at runtime
Compile Time Design • Insert synthetic conditions for each join representing "x in
(keys of other side in join)”. Optimizer will push it as far down as possible
• If the condition hits a table scan and the column involved is a partition column:
• Setup Operator to send key events to AM • else:
• Remove synthetic predicate
App Master
Vertex Manager
Vertex State
Machine
1. Send events for partition pruning
Tasks for a single map vertex
Tasks for a single map vertex
Page 30 © Hortonworks Inc. 2014
Dynamic Pruning
TPC-DS Query 3 SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, Sum(ss_ext_sales_price) sum_agg FROM date_dim dt, store_sales, item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND item.i_manufact_id = 436 AND dt.d_moy = 12 GROUP BY dt.d_year, item.i_brand, item.i_brand_id ORDER BY dt.d_year, sum_agg DESC, brand_id LIMIT 100;
Page 31 © Hortonworks Inc. 2014
Stinger.next: The Road Ahead
Page 32 © Hortonworks Inc. 2014
Stinger.next - Delivery Themes
Beyond Read-‐Only 2nd Half 2014
• Transac(ons with ACID allowing insert, update and delete
• Temporary Tables
• Cost Based Op(mizer op(mizes star and bushy join queries
Sub-‐Second 1st Half 2015
• Sub-‐Second queries with LLAP
• Hive-‐Spark Machine Learning integra(on
• Opera(onal repor(ng with Hive Streaming Ingest and Transac(ons
• Replica(on and SQL/CBO improvements
Richer Analy9cs 2nd Half 2015
• Toward SQL:2011 Analy(cs
• Materialized Views
• Cross-‐Geo Queries
• Workload Management via YARN and LLAP integra(on
Page 33 © Hortonworks Inc. 2014
Q & A
Page 34 © Hortonworks Inc. 2014
Thank you! Learn more at: hortonworks.com/hadoop/hive/
Register for the remaining 6 Discover HDP 2.2 Webinars
Hortonworks.com/webinars