Managing multi tenant resource toward Hive 2.0

Kai Sasaki Treasure Data Inc.

About Me• Kai Sasaki (佐々木海)

• @Lewuathe (Twitter)

• Software Engineer at Treasure Data Inc.

• Maintaining and develop Hadoop/Presto infrastructure

Topic• Treasure Data infrastructure

• Hive 2.0 change

• Migration architecture

• Resource management for multi tenancy

• Performance comparison

• Live Data Management Platform

• Original creator of Fluentd/Embulk/Digdag

• 70+ integrations with

• BI tools

• Mobile/IoT

• Cloud Storage

• and more

• Hive/Pig/Presto data processing interface

• 40000+ Hive queries / day

• 130000+ Presto queries / day

• Plazma Cloud Storage

• 450000+ records/sec imported

Hive 1.x Hive 2.x

Any change?

Hive 2.0• Include major new features

• Fixed 600+ bugs

• 140+ improvements or new features

• Backward compatible as much as possible

• Hive 1.x stable line

• 2.1.0 is available from June 20th, 2016

http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale

Hive 2.0• HPLSQL

• LLAP

• HBase metastore

• Improvements of Hive on Spark

• CBO improvements

HPLSQL• Procedural SQL like Oracle’s PL/SQL

• Cursor

• loops (WHILE, FOR, LOOP)

• branches (IF)

• External library which communicates through JDBC

• http://www.hplsql.org/doc

LLAP• Sub-second Queries in Hive

• Save JVM container launch time

• Data caching

• Fit to Adhoc or interactive use case

• Beta in 2.0

http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png

LLAP• Sub-second Queries in Hive

HBase metastore• Use HBase as metastore of Hive

• Fetching thousands of partitions

• Limitation of concurrent connection

• Will support transaction with Apache Omid

• Alpha in Hive 2.0

Many fixes and

Cutting edge features

That’s all?• Operation cost of migration

• Manage multiple cluster

• Test and verify multiple packages

• Difference of configuration and parameter

That’s all?• Operation cost of migration

• Manage multiple cluster

• Test and verify multiple packages

• Difference of configuration and parameter

• Need to reduce operation cost at the same time

Now migration

Challenge• NO DOWNTIME

• NO HARMFUL OPERATION

• Change package easily

• Separate from other components (Micro service)

• NO DEGRADATION

• Automatic query test and validation

NO DOWNTIME• Hadoop cluster Blue-Green deployment

• Reliable queue system separated from Hadoop

→ PerfectQueue

• Reliable storage system separated from Hadoop

→ Plazma

PerfectQueue• Distributed queue built on top of RDBMS

• At-least-once semantics

• Graceful and live restarting

• State consistency by transaction

• https://github.com/treasure-data/perfectqueue

Plazma• Distributed cloud-based storage

• PostgreSQL + S3/Riak CS

• Enable time-index push down for Hive/Pig/Presto

• Column-oriented IO (mpc1)

• Data consistency with transactional API

Plazma

PQ PQApp

request

Plazma

PQ PQApp

request

submit

Plazma

PQ PQApp

request

submit fetch

Plazma

PQ PQApp

request

submit fetch

disposablecomponents

Plazma

PQ PQApp

request

submit fetchv1

Plazma

PQ PQApp

request

submitfetch

Plazma

PQ PQApp

request

submitfetch

NO HARMFUL OPS• Automatic package version up

• Chef server specifies the version

• Hadoop package repository

• S3 remote package repository

• Hadoop as a REST service

• elephant-server

elephant-server• Hadoop as REST service

• Pluggable executor

• Hive

• Pig

• Embulk MapReduce executor

• Distributed on-memory queue (Hazelcast)

PQ PQApp

request

pull REST

elephantserver

PQ PQApp

request

pull REST

elephantserver

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

service discovery

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

service discovery x

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

service discovery

package distribution

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

request

fetch submit

service discovery

package distribution

NO DEGRADATION• Validation in

• Parameter difference

• Query result difference

• Performance deterioration

• Automatic testing and persistent result tables

PQ PQApp

request

pull REST

elephantserver

1. upload param and configurations

PQ PQApp

request

pull REST

elephantserver

1. upload param and configurations

submit

PQ PQApp

request

pull REST

elephantserver

1. upload param and configurations 2. upload query result

Plazma

submit

3. send metrics

PQ PQApp

request

pull REST

elephantserver

Plazma

submit

3. send metrics

S3 Plazma

elephantserver

Plazma

submit

3. send metrics

S3 Plazma

v2Verification between persistent result setPQ PQ

Apprequest

pull REST

Resource management• Define 1 resource per 1 account

• Workload type of an account varies

• Batch, Adhoc, BI tool…

• Require high level resource management across clusters

• An account can have multiple resource pools

• For service and internal purpose

request

queue1

queue2

cluster1

cluster2

cluster1

cluster2

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

request

queue1

queue2

cluster1

cluster2

cluster1

cluster2

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Enables us to define which resource the request can use

PQ PQApp

request

elephantserver

PQ PQApp

request

elephantserver

1. multiple job queue

PQ PQApp

request

elephantserver

1. multiple job queue 2. multiple Hadoop cluster

PQ PQApp

request

elephantserver

1. multiple job queue 2. multiple Hadoop cluster

3. multiple Hadoop queue

Briefly performance comparison

130GB+ 70B+ recordsEl

Hive 1.x + MapReduceHive 2.x + Tez + Vectorization

GROUP BY

• Hadoop architecture in Treasure Data for Hive 2.0 and beyond

• Resource management for multi tenancy

We’re hiring!

Managing multi tenant resource toward Hive 2.0

Software

Hive Plans.pdf

-HIVE- Hive Insulation Valuation Experiment

BLUE LAGOON SHOPPES - LoopNet · 2020. 2. 25. · tenant 1a...

Hive Products

© Hive Studios 2009 Ivan Pavlović, Hive Studios CSM,...

Hive Notes.pdf

Hive Inspections

Interim Visit Report - Inventory Hive · 2018. 4. 2. ·...

DESARROLLO INDUSTRIAL Querétaro, Querétaro...HIVE...

Aloha Hive BUZZ · Hive, oh Hive Never so alive Oh how I...

Hive Global

Hive Research Lab Interim Brief › 2014 › 04 ›...

Integrating Apache Hive with Kafka, Spark, and...

Toward Customizable Multi-tenant SaaS Applications by Xin...

Hive and Pig -...

DLM Installation and Upgrade - Cloudera...Hive For...