Top Banner
Managing multi tenant resource toward Hive 2.0 Kai Sasaki Treasure Data Inc.
56

Managing multi tenant resource toward Hive 2.0

Apr 15, 2017

Download

Software

Kai Sasaki
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Managing multi tenant resource toward Hive 2.0

Managing multi tenant resource toward Hive 2.0

Kai Sasaki Treasure Data Inc.

Page 2: Managing multi tenant resource toward Hive 2.0

About Me• Kai Sasaki (佐々木 海)

• @Lewuathe (Twitter)

• Software Engineer at Treasure Data Inc.

• Maintaining and develop Hadoop/Presto infrastructure

Page 3: Managing multi tenant resource toward Hive 2.0

Topic• Treasure Data infrastructure

• Hive 2.0 change

• Migration architecture

• Resource management for multi tenancy

• Performance comparison

Page 4: Managing multi tenant resource toward Hive 2.0

• Live Data Management Platform

• Original creator of Fluentd/Embulk/Digdag

• 70+ integrations with

• BI tools

• Mobile/IoT

• Cloud Storage

• and more

Page 5: Managing multi tenant resource toward Hive 2.0
Page 6: Managing multi tenant resource toward Hive 2.0

• Hive/Pig/Presto data processing interface

• 40000+ Hive queries / day

• 130000+ Presto queries / day

• Plazma Cloud Storage

• 450000+ records/sec imported

Page 7: Managing multi tenant resource toward Hive 2.0

Hive 1.x Hive 2.x

Any change?

Page 8: Managing multi tenant resource toward Hive 2.0

Hive 2.0• Include major new features

• Fixed 600+ bugs

• 140+ improvements or new features

• Backward compatible as much as possible

• Hive 1.x stable line

• 2.1.0 is available from June 20th, 2016

http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale

Page 9: Managing multi tenant resource toward Hive 2.0

Hive 2.0• HPLSQL

• LLAP

• HBase metastore

• Improvements of Hive on Spark

• CBO improvements

http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale

Page 10: Managing multi tenant resource toward Hive 2.0

HPLSQL• Procedural SQL like Oracle’s PL/SQL

• Cursor

• loops (WHILE, FOR, LOOP)

• branches (IF)

• External library which communicates through JDBC

• http://www.hplsql.org/doc

http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale

Page 11: Managing multi tenant resource toward Hive 2.0

LLAP• Sub-second Queries in Hive

• Save JVM container launch time

• Data caching

• Fit to Adhoc or interactive use case

• Beta in 2.0

http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png

Page 12: Managing multi tenant resource toward Hive 2.0

LLAP• Sub-second Queries in Hive

http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png

Page 13: Managing multi tenant resource toward Hive 2.0

HBase metastore• Use HBase as metastore of Hive

• Fetching thousands of partitions

• Limitation of concurrent connection

• Will support transaction with Apache Omid

• Alpha in Hive 2.0

http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png

Page 14: Managing multi tenant resource toward Hive 2.0

Many fixes and

Cutting edge features

Page 15: Managing multi tenant resource toward Hive 2.0

That’s all?• Operation cost of migration

• Manage multiple cluster

• Test and verify multiple packages

• Difference of configuration and parameter

Page 16: Managing multi tenant resource toward Hive 2.0

That’s all?• Operation cost of migration

• Manage multiple cluster

• Test and verify multiple packages

• Difference of configuration and parameter

• Need to reduce operation cost at the same time

Page 17: Managing multi tenant resource toward Hive 2.0

Now migration

Page 18: Managing multi tenant resource toward Hive 2.0

Challenge• NO DOWNTIME

• NO HARMFUL OPERATION

• Change package easily

• Separate from other components (Micro service)

• NO DEGRADATION

• Automatic query test and validation

Page 19: Managing multi tenant resource toward Hive 2.0

NO DOWNTIME• Hadoop cluster Blue-Green deployment

• Reliable queue system separated from Hadoop

→ PerfectQueue

• Reliable storage system separated from Hadoop

→ Plazma

Page 20: Managing multi tenant resource toward Hive 2.0

PerfectQueue• Distributed queue built on top of RDBMS

• At-least-once semantics

• Graceful and live restarting

• State consistency by transaction

• https://github.com/treasure-data/perfectqueue

Page 21: Managing multi tenant resource toward Hive 2.0

Plazma• Distributed cloud-based storage

• PostgreSQL + S3/Riak CS

• Enable time-index push down for Hive/Pig/Presto

• Column-oriented IO (mpc1)

• Data consistency with transactional API

Page 22: Managing multi tenant resource toward Hive 2.0

Plazma

x

PQ PQApp

request

Page 23: Managing multi tenant resource toward Hive 2.0

Plazma

x

PQ PQApp

request

pull

submit

Page 24: Managing multi tenant resource toward Hive 2.0

Plazma

x

PQ PQApp

request

pull

submit fetch

Page 25: Managing multi tenant resource toward Hive 2.0

Plazma

x

PQ PQApp

request

pull

submit fetch

disposablecomponents

Page 26: Managing multi tenant resource toward Hive 2.0

Plazma

x

PQ PQApp

request

pull

submit fetchv1

v2

Page 27: Managing multi tenant resource toward Hive 2.0

Plazma

x

PQ PQApp

request

pull

submitfetch

v1

v2

Page 28: Managing multi tenant resource toward Hive 2.0

Plazma

PQ PQApp

request

pull

submitfetch

v2

Page 29: Managing multi tenant resource toward Hive 2.0

NO HARMFUL OPS• Automatic package version up

• Chef server specifies the version

• Hadoop package repository

• S3 remote package repository

• Hadoop as a REST service

• elephant-server

Page 30: Managing multi tenant resource toward Hive 2.0

elephant-server• Hadoop as REST service

• Pluggable executor

• Hive

• Pig

• Embulk MapReduce executor

• Distributed on-memory queue (Hazelcast)

Page 31: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

Page 32: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

elephantserver

elephantserver

Page 33: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

elephantserver

Page 34: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

elephantserver

service discovery

Page 35: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

elephantserver

service discovery x

x

Page 36: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

elephantserver

service discovery

package distribution

S3

x

x

Page 37: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

elephantserver

request

x

x

fetch submit

service discovery

package distribution

S3

Page 38: Managing multi tenant resource toward Hive 2.0

NO DEGRADATION• Validation in

• Parameter difference

• Query result difference

• Performance deterioration

• Automatic testing and persistent result tables

Page 39: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

S3

1. upload param and configurations

Page 40: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

S3

1. upload param and configurations

x

submit

v1

Page 41: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

S3

1. upload param and configurations 2. upload query result

Plazma

x

submit

v1

3. send metrics

Page 42: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

S3

1. upload param and configurations 2. upload query result

Plazma

x

submit

v1

3. send metrics

S3 Plazma

x

v2

Page 43: Managing multi tenant resource toward Hive 2.0

elephantserver

S3

1. upload param and configurations 2. upload query result

Plazma

x

submit

v1

3. send metrics

S3 Plazma

x

v2Verification between persistent result setPQ PQ

Apprequest

pull REST

Page 44: Managing multi tenant resource toward Hive 2.0

Resource management• Define 1 resource per 1 account

• Workload type of an account varies

• Batch, Adhoc, BI tool…

• Require high level resource management across clusters

• An account can have multiple resource pools

• For service and internal purpose

Page 45: Managing multi tenant resource toward Hive 2.0

request

queue1

queue2

cluster1

cluster2

cluster1

cluster2

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Page 46: Managing multi tenant resource toward Hive 2.0

request

queue1

queue2

cluster1

cluster2

cluster1

cluster2

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Enables us to define which resource the request can use

Page 47: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

REST

elephantserver

x

Page 48: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

REST

elephantserver

PQ

PQ

x

1. multiple job queue

Page 49: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

REST

elephantserver

x

x

PQ

PQ

1. multiple job queue 2. multiple Hadoop cluster

Page 50: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

REST

elephantserver

x

q1

q2

q3

x

PQ

PQ

q1

q2

q3

1. multiple job queue 2. multiple Hadoop cluster

3. multiple Hadoop queue

Page 51: Managing multi tenant resource toward Hive 2.0

Briefly performance comparison

Page 52: Managing multi tenant resource toward Hive 2.0

130GB+ 70B+ recordsEl

apse

d tim

e (s

ec)

0

200

400

600

800

COUNT

Hive 1.x + MapReduceHive 2.x + Tez + Vectorization

Page 53: Managing multi tenant resource toward Hive 2.0

130GB+ 70B+ recordsEl

apse

d tim

e (s

ec)

0

250

500

750

1000

GROUP BY

Hive 1.x + MapReduceHive 2.x + Tez + Vectorization

Page 54: Managing multi tenant resource toward Hive 2.0

130GB+ 70B+ recordsEl

apse

d tim

e (s

ec)

0

275

550

825

1100

JOIN

Hive 1.x + MapReduceHive 2.x + Tez + Vectorization

Page 55: Managing multi tenant resource toward Hive 2.0

Recap

• Hadoop architecture in Treasure Data for Hive 2.0 and beyond

• Resource management for multi tenancy

Page 56: Managing multi tenant resource toward Hive 2.0

We’re hiring!