Top Banner
© Hortonworks Inc. 2011 Hive Correlation Optimizer Yin Huai [email protected] [email protected] Page 1 Hadoop Summit 2013 Hive User Group Meetup
31

Hive Correlation Optimizer

Jan 27, 2015

Download

Technology

Yin Huai

Presented at Hadoop Summit 2013 Hive User Group Meetup
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Hive Correlation Optimizer

Yin Huai

[email protected]

[email protected]

Page 1

Hadoop Summit 2013 Hive User Group Meetup

Page 2: Hive Correlation Optimizer

© Hortonworks Inc. 2011

About me

•Hive contributor•Summer intern at Hortonworks•4th year Ph.D. student at The Ohio State University

•Research interests: query optimizations, file formats, distributed systems, and storage systems

Page 2Architecting the Future of Big Data

Page 3: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Outline

•Query planning in Hive•Correlations in a query (Intra-query correlations)

•Case studies•Automatically exploiting correlations (HIVE-2206: Correlation Optimizer)

Page 3Architecting the Future of Big Data

Page 4: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Query planning

Page 4Architecting the Future of Big Data

SELECT t1.c2, count(*)FROM t1 JOIN t2 ON (t1.c1=t2.c1)GROUP BY t1.c2

t1 t2

JOIN

AGG

t1.c1=t2.c1

Calculate count(*) for every group of t1.c2

Page 5: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Query planning

Page 5Architecting the Future of Big Data

SELECT t1.c2, count(*)FROM t1 JOIN t2 ON (t1.c1=t2.c1)GROUP BY t1.c2

t1 t2

JOIN

AGG Evaluate this query in distributed systems

t1 t2

JOIN

AGG

Shuffle

Shuffle

c1

c2

How to shuffle?Use the key column(s)

Page 6: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Generating MapReduce jobs

Page 6Architecting the Future of Big Data

t1 t2

JOIN

AGG

Shuffle

Shuffle c2

c1

t1 t2

JOIN

Shuffle

tmp

c1

tmp

AGG

Shuffle c2

1 MR job can shuffle data once

Job 1

Job 2

Page 7: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Generating MapReduce jobs

Page 7Architecting the Future of Big Data

t1 t2

JOIN

Shuffle

tmp

c1

tmp

AGG

Shuffle c2

MapReuce will shuffle data for us, we just need to emit outputs from the Map phase

We use ReduceSinkOperator (RS) to emit Map outputs.RSs are the end of a Map phase.

t1 t2

JOIN

tmp

tmp

AGG

RS1 RS2

RS2

Job 1Map

Job 1Reduce

Job 2Map

Job 2Reduce

Page 8: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Outline

•Query planning in Hive•Correlations in a query (Intra-query correlations)

•Case studies•Automatically exploiting correlations (HIVE-2206: Correlation Optimizer)

Page 8Architecting the Future of Big Data

Page 9: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Intra-query correlations

Page 9Architecting the Future of Big Data

SELECT x.c1, count(*)FROM t1 x JOIN t1 y ON (x.c1=y.c1)GROUP BY x.c1

t1 as x t1 as y

JOIN

AGG

x.c1=y.c1

Calculate count(*) for every group of x.c1

Correlations:1. Same input tables2. JOIN and AGG using the

same key

Page 10: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Intra-query correlations

Page 10Architecting the Future of Big Data

x.c1=y.c1

Calculate count(*) for every group of z.c1

t1 as x t2 as y

JOIN1

JOIN2

AGG1

t1 as z

p.c1=q.c1

SELECT p.c1, q.c2, q.cntFROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) pJOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) qON (p.c1=q.c1)

Correlations:1. Same input tables (t1)2. JOIN1 and AGG1 using the

same key3. JOIN2 and all of its parents

using the same key

Page 11: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Intra-query correlations

• Defined in “YSmart: Yet Another SQL-to-MapReduce Translator”– http://ysmart.cse.ohio-state.edu/– http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

• Targeting on operators which need to shuffle the data and inputs• Three kinds of correlations

– Input correlation (IC): independent operators share the same input tables– Transit correlation (TC): independent operators have input correlation and

also shuffle the data in the same way (e.g. using the same keys)– Job flow correlation (JFC): two dependent operators shuffle the data in

the same way

Page 11Architecting the Future of Big Data

t1 as x t2 as y

JOIN1 AGG1

t1 as z

IC

t1 as x t2 as y

JOIN1 AGG1

t1 as z

x.c1=y.c1 group by z.c1TC

JOIN

AGG

x.c1=y.c1

group by z.c1JFC

Page 12: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Correlation-unaware query planning

Page 12Architecting the Future of Big Data

t1 t1

JOIN

AGG

Shuffle

Shuffle c1

c1

Hive does not care:1. If a table has been

used multiple times

2. If data really needs to be shuffled

t1 t1

JOIN

Shuffle

tmp

c1

Job 1

tmp

AGG

Shuffle c1 Job 2

Drawbacks:1. Unnecessary data

loading2. Unnecessary data

shuffling3. Unnecessary data

materialization

Page 13: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Outline

•Query planning in Hive•Correlations in a query (Intra-query correlations)

•Case studies•Automatically exploiting correlations (HIVE-2206: Correlation Optimizer)

Page 13Architecting the Future of Big Data

Page 14: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Case studies: TPC-H Q17 (Flattened)

SELECT

sum(l_extendedprice) / 7.0 as avg_yearly

FROM

(SELECT l_partkey, l_quantity, l_extendedprice

FROM lineitem JOIN part ON (p_partkey=l_partkey)

WHERE p_brand='Brand#35’ AND

p_container = 'MED PKG’) touter

JOIN

(SELECT l_partkey as lp, 0.2 * avg(l_quantity) as lq

FROM lineitem

GROUP BY l_partkey) tinner

ON (touter.l_partkey = tinnter.lp)

WHERE touter.l_quantity < tinner.lq

Page 14Architecting the Future of Big Data

Page 15: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Case studies: TPC-H Q17 (Flattened)

Page 15Architecting the Future of Big Data

lineitem part

JOIN1

JOIN2

AGG1

lineitem

AGG2

lineitem is used by JOIN1 and AGG1JOIN1, AGG1, and JOIN2 share the same key

Page 16: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Case studies: TPC-H Q17 (Flattened)

Page 16Architecting the Future of Big Data

lineitem part

JOIN1

JOIN2

AGG1

lineitem

AGG2

Job 1 Job 2

Job 3

Job 4

Without Correlation Optimizer

Page 17: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Case studies: TPC-H Q17 (Flattened)

Page 17Architecting the Future of Big Data

lineitem part

JOIN1

JOIN2

AGG1

lineitem

AGG2

part

JOIN1

JOIN2

AGG1

lineitem

AGG2

Job 1 Job 2

Job 3

Job 4 Job 2

Job 1

Without Correlation Optimizer With Correlation Optimizer

Page 18: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Case studies: TPC-DS Q95 (Flattened)SELECT count(distinct ws1.ws_order_number) as order_count,

sum(ws1.ws_ext_ship_cost) as total_shipping_cost,

sum(ws1.ws_net_profit) as total_net_profit

FROM web_sales ws1

JOIN customer_address ca ON (ws1.ws_ship_addr_sk = ca.ca_address_sk)

JOIN web_site s ON (ws1.ws_web_site_sk = s.web_site_sk)

JOIN date_dim d ON (ws1.ws_ship_date_sk = d.d_date_sk)

LEFT SEMI JOIN (SELECT ws2.ws_order_number as ws_order_number

FROM web_sales ws2 JOIN web_sales ws3

ON(ws2.ws_order_number = ws3.ws_order_number)

WHERE ws2.ws_warehouse_sk <> ws3.ws_warehouse_sk) ws_wh1

ON (ws1.ws_order_number = ws_wh1.ws_order_number)

LEFT SEMI JOIN (SELECT wr_order_number

FROM web_returns wr

JOIN (SELECT ws4.ws_order_number as ws_order_number

FROM web_sales ws4 JOIN web_sales ws5

ON (ws4.ws_order_number = ws5.ws_order_number)

WHERE ws4.ws_warehouse_sk <> ws5.ws_warehouse_sk) ws_wh2

ON (wr.wr_order_number = ws_wh2.ws_order_number)) tmp1

ON (ws1.ws_order_number = tmp1.wr_order_number)

WHERE d.d_date >= '2001-05-01' AND

d.d_date <= '2001-06-30’ AND

ca.ca_state = 'NC’ AND

s.web_company_name = 'pri'

Page 18Architecting the Future of Big Data

Page 19: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Case studies: TPC-DS Q95 (Flattened)

Page 19Architecting the Future of Big Data

web_sales

AGG

customer_address web_site

MapJoin

SemiJoin

web_sales web_sales

JOIN1

web_sales web_sales

JOIN1

web_returns

JOIN2

date_dim

Page 20: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Case studies: TPC-DS Q95 (Flattened)

Page 20Architecting the Future of Big Data

web_sales

AGG

customer_address web_site

MapJoin

SemiJoin

web_sales web_sales

JOIN1

web_sales web_sales

JOIN1

web_returns

JOIN2

Without Correlation Optimizer• 6 MapReduce jobs• Unnecessary data loading

(black web_sales nodes)• Unnecessary data shuffling

Job 6

Job 2

Job 3

Job 4

Job 5

Job 1

date_dim

Page 21: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Case studies: TPC-DS Q95 (Flattened)

Page 21Architecting the Future of Big Data

web_sales

AGG

customer_address web_site

MapJoin

SemiJoin

web_sales

JOIN1

JOIN1

web_returns

JOIN2

With Correlation Optimizer• Black web_sales nodes share

the same data loading

date_dim

Page 22: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Case studies: TPC-DS Q95 (Flattened)

Page 22Architecting the Future of Big Data

web_sales

AGG

customer_address web_site

MapJoin

SemiJoin

web_sales

JOIN1

JOIN1

web_returns

JOIN2

With Correlation Optimizer• Black web_sales nodes share

the same data loading• 3 MapReduce jobs

Job 1

Job 2

Job 3

date_dim

Page 23: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Case studies: TPC-DS Q95 (Flattened)

Page 23Architecting the Future of Big Data

web_sales

AGG

customer_address web_site

MapJoin

SemiJoin

web_sales

JOIN1

web_returns

JOIN2

Follow-up work• Evaluate JOIN1 only once

without materializing a temporary table

date_dim

Page 24: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Case studies: TPC-DS Q95 (Flattened)

Page 24Architecting the Future of Big Data

web_sales

AGG

customer_address web_site

MapJoin

SemiJoin

web_sales

JOIN1

web_returns

JOIN2

Follow-up work• Evaluate JOIN1 only once

without materializing a temporary table

• Only use 2 MapReduce jobsJob 1

Job 2

date_dim

Page 25: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Outline

•Query planning in Hive•Correlations in a query (Intra-query correlations)

•Case studies•Automatically exploiting correlations (HIVE-2206: Correlation Optimizer)

Page 25Architecting the Future of Big Data

Page 26: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Objectives

• Eliminate unnecessary data loading– Query planner will be aware what data will be loaded– Do as many things as possible for loaded data

• Eliminate unnecessary data shuffling– Query planner will be aware when data really needs to be shuffled– Do as many things as possible before shuffling the data again

Page 26Architecting the Future of Big Data

Page 27: Hive Correlation Optimizer

© Hortonworks Inc. 2011

ReduceSink Deduplication

• HIVE-2340• Handle chained Job Flow Correlations

– e.g. Generating a single job for both Group By and Order By

• Cannot handle complex patterns– e.g. Multiple Joins involved patterns

• Need a fundamental solution• Need to exploit shared input tables

Page 27Architecting the Future of Big Data

t1

RS1

AGG1

RS2

t1

RS1

AGG1

Page 28: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Correlation Optimizer

• 2-phase optimizer– Phase 1: Correlation Detection– Phase 2: Query plan tree transformation

• This work is not just about the optimizer– New operators to support the execution of an optimized plan– A mechanism to coordinate the operator tree inside the Reduce phase

Page 28Architecting the Future of Big Data

Page 29: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Correlation detection

Page 29Architecting the Future of Big Data

SELECT p.c1, q.c2, q.cntFROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) pJOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) qON (p.c1=q.c1)

1. Traverse the tree all the way down to find matching keys in ReduceSinkOperators

2. Then, check input tables to find shared data loading opportunities

t1 as x t2 as y

JOIN1

JOIN2

AGG1

t1 as z

RS1 RS2 RS3

RS4 RS5

Key: p.c1 Key: q.c1

Key: x.c1 Key: y.c1 Key: z.c1

Page 30: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Query plan tree transformation

Page 30Architecting the Future of Big Data

SELECT p.c1, q.c2, q.cntFROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) pJOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) qON (p.c1=q.c1)

t1 as x t2 as y

JOIN1

JOIN2

AGG1

t1 as z

Key: p.c1

RS1 RS2 RS3

RS4 RS5

Key: q.c1

Key: x.c1 Key: y.c1 Key: z.c1

t1 as x, zt2 as y

JOIN1

JOIN2

AGG1

RS1RS2 RS3

Page 31: Hive Correlation Optimizer

© Hortonworks Inc. 2011

Thanks

Architecting the Future of Big DataPage 31