Managing multi tenant resource toward Hive 2.0
Post on 15-Apr-2017
710 Views
Preview:
Transcript
Managing multi tenant resource toward Hive 2.0
Kai Sasaki Treasure Data Inc.
About Me• Kai Sasaki (佐々木 海)
• @Lewuathe (Twitter)
• Software Engineer at Treasure Data Inc.
• Maintaining and develop Hadoop/Presto infrastructure
Topic• Treasure Data infrastructure
• Hive 2.0 change
• Migration architecture
• Resource management for multi tenancy
• Performance comparison
• Live Data Management Platform
• Original creator of Fluentd/Embulk/Digdag
• 70+ integrations with
• BI tools
• Mobile/IoT
• Cloud Storage
• and more
• Hive/Pig/Presto data processing interface
• 40000+ Hive queries / day
• 130000+ Presto queries / day
• Plazma Cloud Storage
• 450000+ records/sec imported
Hive 1.x Hive 2.x
Any change?
Hive 2.0• Include major new features
• Fixed 600+ bugs
• 140+ improvements or new features
• Backward compatible as much as possible
• Hive 1.x stable line
• 2.1.0 is available from June 20th, 2016
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
Hive 2.0• HPLSQL
• LLAP
• HBase metastore
• Improvements of Hive on Spark
• CBO improvements
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
HPLSQL• Procedural SQL like Oracle’s PL/SQL
• Cursor
• loops (WHILE, FOR, LOOP)
• branches (IF)
• External library which communicates through JDBC
• http://www.hplsql.org/doc
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
LLAP• Sub-second Queries in Hive
• Save JVM container launch time
• Data caching
• Fit to Adhoc or interactive use case
• Beta in 2.0
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
LLAP• Sub-second Queries in Hive
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
HBase metastore• Use HBase as metastore of Hive
• Fetching thousands of partitions
• Limitation of concurrent connection
• Will support transaction with Apache Omid
• Alpha in Hive 2.0
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
Many fixes and
Cutting edge features
That’s all?• Operation cost of migration
• Manage multiple cluster
• Test and verify multiple packages
• Difference of configuration and parameter
That’s all?• Operation cost of migration
• Manage multiple cluster
• Test and verify multiple packages
• Difference of configuration and parameter
• Need to reduce operation cost at the same time
Now migration
Challenge• NO DOWNTIME
• NO HARMFUL OPERATION
• Change package easily
• Separate from other components (Micro service)
• NO DEGRADATION
• Automatic query test and validation
NO DOWNTIME• Hadoop cluster Blue-Green deployment
• Reliable queue system separated from Hadoop
→ PerfectQueue
• Reliable storage system separated from Hadoop
→ Plazma
PerfectQueue• Distributed queue built on top of RDBMS
• At-least-once semantics
• Graceful and live restarting
• State consistency by transaction
• https://github.com/treasure-data/perfectqueue
Plazma• Distributed cloud-based storage
• PostgreSQL + S3/Riak CS
• Enable time-index push down for Hive/Pig/Presto
• Column-oriented IO (mpc1)
• Data consistency with transactional API
Plazma
x
PQ PQApp
request
Plazma
x
PQ PQApp
request
pull
submit
Plazma
x
PQ PQApp
request
pull
submit fetch
Plazma
x
PQ PQApp
request
pull
submit fetch
disposablecomponents
Plazma
x
PQ PQApp
request
pull
submit fetchv1
v2
Plazma
x
PQ PQApp
request
pull
submitfetch
v1
v2
Plazma
PQ PQApp
request
pull
submitfetch
v2
NO HARMFUL OPS• Automatic package version up
• Chef server specifies the version
• Hadoop package repository
• S3 remote package repository
• Hadoop as a REST service
• elephant-server
elephant-server• Hadoop as REST service
• Pluggable executor
• Hive
• Pig
• Embulk MapReduce executor
• Distributed on-memory queue (Hazelcast)
PQ PQApp
request
pull REST
elephantserver
PQ PQApp
request
pull REST
elephantserver
elephantserver
elephantserver
PQ PQApp
request
pull REST
elephantserver
hazelcast
elephantserver
elephantserver
PQ PQApp
request
pull REST
elephantserver
hazelcast
elephantserver
elephantserver
service discovery
PQ PQApp
request
pull REST
elephantserver
hazelcast
elephantserver
elephantserver
service discovery x
x
PQ PQApp
request
pull REST
elephantserver
hazelcast
elephantserver
elephantserver
service discovery
package distribution
S3
x
x
PQ PQApp
request
pull REST
elephantserver
hazelcast
elephantserver
elephantserver
request
x
x
fetch submit
service discovery
package distribution
S3
NO DEGRADATION• Validation in
• Parameter difference
• Query result difference
• Performance deterioration
• Automatic testing and persistent result tables
PQ PQApp
request
pull REST
elephantserver
S3
1. upload param and configurations
PQ PQApp
request
pull REST
elephantserver
S3
1. upload param and configurations
x
submit
v1
PQ PQApp
request
pull REST
elephantserver
S3
1. upload param and configurations 2. upload query result
Plazma
x
submit
v1
3. send metrics
PQ PQApp
request
pull REST
elephantserver
S3
1. upload param and configurations 2. upload query result
Plazma
x
submit
v1
3. send metrics
S3 Plazma
x
v2
elephantserver
S3
1. upload param and configurations 2. upload query result
Plazma
x
submit
v1
3. send metrics
S3 Plazma
x
v2Verification between persistent result setPQ PQ
Apprequest
pull REST
Resource management• Define 1 resource per 1 account
• Workload type of an account varies
• Batch, Adhoc, BI tool…
• Require high level resource management across clusters
• An account can have multiple resource pools
• For service and internal purpose
request
queue1
queue2
cluster1
cluster2
cluster1
cluster2
Hadoop queue A
Hadoop queue B
Hadoop queue A
Hadoop queue B
Hadoop queue A
Hadoop queue B
Hadoop queue A
Hadoop queue B
request
queue1
queue2
cluster1
cluster2
cluster1
cluster2
Hadoop queue A
Hadoop queue B
Hadoop queue A
Hadoop queue B
Hadoop queue A
Hadoop queue B
Hadoop queue A
Hadoop queue B
Enables us to define which resource the request can use
PQ PQApp
request
REST
elephantserver
x
PQ PQApp
request
REST
elephantserver
PQ
PQ
x
1. multiple job queue
PQ PQApp
request
REST
elephantserver
x
x
PQ
PQ
1. multiple job queue 2. multiple Hadoop cluster
PQ PQApp
request
REST
elephantserver
x
q1
q2
q3
x
PQ
PQ
q1
q2
q3
1. multiple job queue 2. multiple Hadoop cluster
3. multiple Hadoop queue
Briefly performance comparison
130GB+ 70B+ recordsEl
apse
d tim
e (s
ec)
0
200
400
600
800
COUNT
Hive 1.x + MapReduceHive 2.x + Tez + Vectorization
130GB+ 70B+ recordsEl
apse
d tim
e (s
ec)
0
250
500
750
1000
GROUP BY
Hive 1.x + MapReduceHive 2.x + Tez + Vectorization
130GB+ 70B+ recordsEl
apse
d tim
e (s
ec)
0
275
550
825
1100
JOIN
Hive 1.x + MapReduceHive 2.x + Tez + Vectorization
Recap
• Hadoop architecture in Treasure Data for Hive 2.0 and beyond
• Resource management for multi tenancy
We’re hiring!
top related