3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
Post on 09-Jan-2017
516 Views
Preview:
Transcript
© Hortonworks Inc. 2015 Page 1
Apache Tez-‐ Next Generation of execution engine upon hadoop
Jeff Zhang (@zjffdu)
© Hortonworks Inc. 2015
Who’s this guy• Start use pig from 2009. Become Pig committer from Nov 2009
• Join Hortonworks in 2014.
• Tez Committer from Oct 2014
© Hortonworks Inc. 2015
Agenda•Tez Introduction
•Tez Feature Deep Dive
•Tez Status & Roadmap
© Hortonworks Inc. 2015
I/O Synchronization Barrier
I/O Synchronization Barrier
Job 1 ( Join a & b )
Job 3 ( Group by of c )
Job 2 (Group by of a Join b)
Job 4 (Join of S & R )
Hive -‐ MR
Example of MR versus Tez
Page 4
Single Job
Hive -‐ Tez
Join a & b
Group by of a Join b
Group by of c
Job 4 (Join of S & R )
© Hortonworks Inc. 2015
Tez – Introduction
Page 5
• Distributed execution framework targeted towards data-‐processing applications.• Based on expressing a computation as a dataflow graph (DAG).• Highly customizable to meet a broad spectrum of use cases.• Built on top of YARN – the resource management framework for Hadoop.•Open source Apache project and Apache licensed.
© Hortonworks Inc. 2015
What is DAG & Why DAG
ProjectionFilterGroupBy…
JoinUnionIntersect…
Split…
• Directed Acyclic Graph• Any complicated DAG can been composed of the following 3 basic paradigm– Sequential– Merge– Divide
© Hortonworks Inc. 2015
Expressing DAG in Tez API
• DAG API (Logic View)– Allowuser to build DAG– Topological structure of the data computation flow
• Runtime API (Runtime View)– Application logic of each computation unit (vertex)– How tomove/read/write data between vertices
© Hortonworks Inc. 2015
DAG API (Logic View)
Page 8
• Vertex (Processor, Parallelism, Resource, etc…)
• Edge (EdgeProperty)– DataMovement
– Scatter Gather (Join, GroupBy … )– Broadcast ( Pig Replicated Join / Hive Broadcast Join )– One-‐to-‐One ( Pig Order by )– Custom
© Hortonworks Inc. 2015
Runtime API (Runtime View)
Page 9
ProcessorInput Output
• Input– Through which processor receives data on an edge– Vertex can have multiple inputs
• Processor– Application Logic (One vertex one processor)– Consume the inputs and produce the outputs
•Output– Through which processor writes data to an edge– One vertex can have multiple outputs
• Example of Input/Output/Processor– MRInput & MROutput (InputFormat/OutputFormat)– OrderedGroupedKVInput & OrderedPartitionedKVOutput (Scatter Gather)– UnorderedKVInput & UnorderedKVOutput (Broadcast & One-‐to-‐One)– PigProcessor/HiveProcessor
© Hortonworks Inc. 2015
Benefit of DAG• Easier to express computation in DAG
•No intermediate data written to HDFS
• Less pressure on NameNode
•No resource queuing effort & less resource contention
•More optimization opportunity with more global context
© Hortonworks Inc. 2015
Agenda•Tez Introduction
•Tez Feature Deep Dive
•Tez Improvement & Debuggability
•Tez Status & Roadmap
© Hortonworks Inc. 2015
Container-‐Reuse• Reuse the same container across DAG/Vertices/Tasks
• Benefit of Container-‐Reuse– Less resources consumed– Reduce overhead of launching JVM– Reduce overhead of negotiatewith Resource Manager– Reduce overhead of resource localization– Reduce network IO– Object Caching (Object Sharing)
© Hortonworks Inc. 2015
Tez Session• Multiple Jobs/DAGs in one AM
• Container-‐reuse across Jobs/DAGs
• Data sharing between Jobs/DAGs
© Hortonworks Inc. 2015
Dynamic Parallelism Estimation • VertexManager
– Listen to the other vertices status
– Coordinate and schedule its tasks
– Communication between vertices
© Hortonworks Inc. 2015
ATS Integration• Tez is fully integrated with YARN ATS (Application Timeline Service)– DAG Status, DAG Metrics, Task Status, Task Metrics are captured
• Diagnostics & Performance analysis– Data Source for monitoring & diagnostics – Data Source for performance analysis
© Hortonworks Inc. 2015
Recovery• AM can crash in corner cases
– OOM– Node failure–…
• Continue from the last checkpoint
• Transparent to end users
AM Crash
© Hortonworks Inc. 2015
Order By of Pig
f = Load ‘foo’ as (x, y);o = Order f by x;Load
Sample(Calculate Histogram)
HDFS
Partition
Sort
Broadcast
Load
Sample(Calculate Histogram)
Partition
Sort
One-‐to-‐One
Scatter Gather
Scatter Gather
© Hortonworks Inc. 2015
Tez UI
© Hortonworks Inc. 2015
Tez UI
Tez UI
20
Download data from ATS
© Hortonworks Inc. 2015
RoadMap• Shared output edges
– Same output to multiple vertices
• Local mode stabilization
•Optimizing (include/exclude) vertex at runtime
• Partial completion VertexManager
• Co-‐Scheduling
• Framework stats for better runtime decisions
© Hortonworks Inc. 2015
Tez – Adoption • Apache Hive
• Start from Hive 0.13• set hive.exec.engine = tez
• Apache Pig• Start from Pig 0.14• pig -‐x tez
• Cascading
• Flink
Page 22
© Hortonworks Inc. 2015
Tez Community•Useful Links
– http://tez.apache.org/– JIRA : https://issues.apache.org/jira/browse/TEZ– Code Repository: https://git-‐wip-‐us.apache.org/repos/asf/tez.git–Mailing Lists
– Dev List: dev@tez.apache.org– User List: user@tez.apache.org– Issues List: issues@tez.apache.org
• Tez Meetup– http://www.meetup.com/Apache-‐Tez-‐User-‐Group
© Hortonworks Inc. 2015
Thank You!Questions & Answers
Page 24
top related