Top Banner
Overview of the Hive Stinger Initiative Eric N. Hanson Principal Software Development Engineer Microsoft HDInsight Team 30 June 2014
18

Overview of the Hive Stinger Initiative

May 10, 2015

Download

Technology

Dr. Eric N. Hanson, Principal Software Development Engineer at Microsoft and Apache Hive committer presents the recent improvements in Hive
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Overview of the Hive Stinger Initiative

Overview of the Hive Stinger Initiative

Eric N. Hanson

Principal Software Development Engineer

Microsoft HDInsight Team

30 June 2014

Page 2: Overview of the Hive Stinger Initiative

What is Stinger? Umbrella term for…

• Faster query in Hive• ORC• Vectorization• Tez

• Better language features for analysis• Window functions etc.

Page 3: Overview of the Hive Stinger Initiative

Why Stinger?

• Hive has good functionality

• But it started out sloooowww

• Need to speed it up• keep it competitive • make it fun to use

Page 4: Overview of the Hive Stinger Initiative

ORC

• A good columnstore format

• Run length encoding, value encoding, dictionary encoding

• Layers stream compression over the top

• Written by Owen O’Malley

• http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html

Page 5: Overview of the Hive Stinger Initiative

Using ORC

• create table Tbl (col int) stored as orc;

• orc.compress default ZLIB

• See http://www.slideshare.net/oom65/orc-andvectorizationhadoopsummit

Page 6: Overview of the Hive Stinger Initiative

TPC-DS File Sizes

Page 6*Courtesy of Hortonworks

Page 7: Overview of the Hive Stinger Initiative

Vectorization

Page 7

Page 8: Overview of the Hive Stinger Initiative

How the code works (simplified)

Page 8

class LongColumnAddLongScalarExpression {int inputColumn;int outputColumn;long scalar;void evaluate(VectorizedRowBatch batch) {

long [] inVector =((LongColumnVector) batch.columns[inputColumn]).vector;

long [] outVector = ((LongColumnVector) batch.columns[outputColumn]).vector;

if (batch.selectedInUse) {for (int j = 0; j < batch.size; j++) {

int i = batch.selected[j];outVector[i] = inVector[i] + scalar;

} } else {

for (int i = 0; i < batch.size; i++) {outVector[i] = inVector[i] + scalar;

} }

}}

}

No method callsLow instruction countCache locality to 1024 valuesNo pipeline stallsSIMD in Java 8

Page 9: Overview of the Hive Stinger Initiative

Vectorization and Compilation

• Vectorization “instructions” generated from templates

• Example’s:– Int add col-col

– Int add col-scalar

– Int add scalar-col

–Double add col-col

–Double add col-scalar

–Double add scalar-col

–And hundreds more!

• Pre-compilation of expressions

• Reduces # of function calls and instructions at runtime

• Expressions like (a + 2) / b are interpreted with these primitives

Page 10: Overview of the Hive Stinger Initiative

Example of vectorized template code

} else {

if (batch.selectedInUse) {

for(int j = 0; j != n; j++) {

int i = sel[j];

outputVector[i] = vector1[i] <OperatorSymbol> vector2[i];

}

} else {

for(int i = 0; i != n; i++) {

outputVector[i] = vector1[i] <OperatorSymbol> vector2[i];

}

}

}

Page 11: Overview of the Hive Stinger Initiative

Using vectorization in Hive

• set hive.vectorized.execution.enabled = true;

• Run query over ORC

• Only works for scalar types

• https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution

• ~5X CPU reduction

Page 12: Overview of the Hive Stinger Initiative

Apache Tez (“Speed”)• Replaces MapReduce as primitive for Pig, Hive, Cascading etc.

– Smaller latency for interactive queries

– Higher throughput for batch queries

– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft

YARN ApplicationMaster to run DAG of Tez Tasks

Task with pluggable Input, Processor and Output

Tez Task - <Input, Processor, Output>

Task

ProcessorInput Output

*Courtesy of Hortonworks

Page 13: Overview of the Hive Stinger Initiative

Tez: Building blocks for scalable data processing

Classical ‘Map’ Classical ‘Reduce’

Intermediate ‘Reduce’ for

Map-Reduce-Reduce

Map Processor

HDFS Input

Sorted Output

Reduce Processor

Shuffle Input

HDFS Output

Reduce Processor

Shuffle Input

Sorted Output

*Courtesy of Hortonworks

Page 14: Overview of the Hive Stinger Initiative

Hive – MR Hive – Tez

Hive-on-MR vs. Hive-on-TezSELECT a.x, AVERAGE(b.y) AS avg

FROM a JOIN b ON (a.id = b.id) GROUP BY a

UNION SELECT x, AVERAGE(y) AS AVG

FROM c GROUP BY x

ORDER BY AVG;

SELECT a.state

JOIN (a, c)SELECT c.price

SELECT b.id

JOIN(a, b)GROUP BY a.state

COUNT(*)AVERAGE(c.price)

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

SELECT a.state,c.itemId

JOIN (a, c)

JOIN(a, b)GROUP BY a.state

COUNT(*)AVERAGE(c.price)

SELECT b.id

Tez avoids unneeded writes to HDFS

*Courtesy of Hortonworks

Page 15: Overview of the Hive Stinger Initiative

Tez Sessions

… because Map/Reduce query startup is expensive

• Tez Sessions–Hot containers ready for immediate use

–Removes task and job launch overhead (~5s – 30s)

• Hive–Session launch/shutdown in background (seamless, user not aware)

–Submits query plan directly to Tez Session

Native Hadoop service, not ad-hoc

*Courtesy of Hortonworks

Page 16: Overview of the Hive Stinger Initiative

Stinger Phase 3: Interactive Query In Hadoop

Page 16

Hive 10 Trunk (Phase 3)Hive 0.11 (Phase 1)

190xImprovement

1400s

39s

7.2s

TPC-DS Query 27

3200s

65s

14.9s

TPC-DS Query 82

200xImprovement

Query 27: Pricing Analytics using Star Schema Join Query 82: Inventory Analytics Joining 2 Large Fact Tables

All Results at Scale Factor 200 (Approximately 200GB Data)

*Courtesy of Hortonworks

Page 17: Overview of the Hive Stinger Initiative

How you can use Stinger enhancements

• Use Hive 13

• Use ORC: create table … stored as ORC

• Enable vectorization: set hive.vectorized.execution.enabled=true

• Enable Tez: set hive.execution.engine=tez

• See http://hortonworks.com/hadoop-tutorial/supercharging-

interactive-queries-hive-tez/

Page 18: Overview of the Hive Stinger Initiative

Reference(s)

• Stinger overview, Strata, fall 2013: http://www.slideshare.net/alanfgates/strata-stingertalk-oct2013?qid=09d16028-bd7e-47d8-8438-34f3242c6f0e&v=qf1&b=&from_search=1

Slides marked “Courtesy of Hortonworks” are from Hortonworks talks