Transcript
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Apache TezPiotr Krewski, Adam Kawa
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Apache Tez
■ Efficient execution engine● Faster than MapReduce
■ Can be leveraged by existing frameworks e.g. Hive, Pig, Scalding● SET hive.execution.engine=[tez,mr,spark]
■ Built atop Hadoop YARN
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Apache Tez
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Some Advantages Of Tez
■ Natural DAG● No intermediate data written to HDFS (replication 3x)● No need for “empty” map tasks to reshuffle data● No time spent in a queue to start a next MapReduce job
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Simple Comparison
■ Three real-world queries■ Real production datasets
● Stored in Avro and ORC formats■ +900-node cluster (thanks, Spotify!)
● Queries run in a queue with limited capacity■ Hive 0.14 and Tez 0.5 (version from April 2014)
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Three Users
■ Find top 3 users with largest number of streams
SELECT user_id, count(*) AS cnt
FROM stream
GROUP BY user_id
ORDER BY cnt DESC
LIMIT 3
■ The pattern is GROUP BY and ORDER BY and LIMIT
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Three Users
Hive on MapReduce on Avro Hive on Tez on Avro
Plan 2 MapReduce jobsMap => Reduce =>
Reduce
Wallclock Time (sec) 353 197
Improvement 1.8x
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Three Users - On A Busier Cluster
Hive on MapReduce on Avro Hive on Tez on Avro
Wallclock Time (sec) 576 183
Improvement 3.14x
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Console Output……Query ID = kawaa_20141130185757_3e4bd581-23bb-4d7c-b755-
044c4a5783b5
Total jobs = 1
Launching Job 1 out of 1
Status: Running (application id:
application_1414118456795_314710)
Map 1: -/- Reducer 2: 0/5 Reducer 3: 0/1
Map 1: 0/36 Reducer 2: 0/5 Reducer 3: 0/1
Map 1: 0/36 Reducer 2: 0/5 Reducer 3: 0/1
Map 1: 0/36 Reducer 2: 0/5 Reducer 3: 0/1
……
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Some Advantages Of Tez
■ Container reuse● Less time spent negotiating with the Resource Manager● Smaller tasks can be started, so fewer stragglers
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Some Advantages Of Tez
■ Container reuse● Less time spent negotiating with the Resource Manager● Smaller tasks can be started, so fewer stragglers
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Ten Countries
■ Find top 10 countries with largest number of streams
SELECT country, count(*) AS cnt
FROM stream
JOIN user ON stream.user_id = user.id
GROUP BY country
ORDER BY cnt DESC
LIMIT 3
■ The pattern is JOIN ON and GROUP BY and ORDER BY and LIMIT
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Ten Countries
Hive on MapReduce on
Avro
Hive on Tez on Avro
Hive on Tez on ORC Snappy
Plan3 MapReduce
jobs
Map => Map => Reduce => Reduce
=> Reduce
Map => Map => Reduce => Reduce
=> Reduce
Wallclock Time (sec)
636 268 203
Improvement 2.4x 3.1x
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
The Biggest Polish Fan of Timbuktu
■ Find the biggest Polish fan of Timbuktu (popular Swedish rap/reggae artists)
SELECT user_id, count(*) AS cnt
FROM stream
JOIN user ON stream.user_id = user.id
JOIN track ON stream.track_id = track.id
WHERE ...
GROUP BY user_id
ORDER BY cnt DESC
LIMIT 1
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
The Biggest Polish Fan of Timbuktu
Hive on MapReduce on
ORC ZLIB
Hive on Tez on ORC ZLIB
Hive on Tez on ORC Snappy
Plan 6 MapReduce jobsMap => Map =>
Map => Reduce => Reduce
Map => Map => Map => Reduce =>
Reduce
Wallclock Time (sec)
519 259 209
Improvement 2x 2.5x
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
The Biggest Polish Fan of Timbuktu
■ We also run this query on 1.5-year long production dataset● +25 TB of data● 690 nodes
■ Benefits (after optimizations)● 6+ hours with Hive on MapReduce and Avro Deflate● 10min 11sec with Hive on Tez and ORC Zlib
■ Features used● Containers reuse● Broadcast JOIN● Warm containers
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Summary
■ Very fast and smart● Out of the box performance for small and large queries
■ Very good at scale● Tested by Yahoo!
■ Not memory-hungry● Great for large datasets and multi-tenancy
■ Well integrated with YARN■ No pain deployment and maintenance
● No daemons - build Tez jars and upload them to HDFS■ Gives you a powerful and effortless option
● Switch execution mode between MR, Tez or Spark using simple configuration settings
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Q&A
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Thanks!
© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
About GetInData
■ Data-processing challenges addressed with passion and experience
■ +4 years with Apache Hadoop and Big Data technologies
top related