Top Banner
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Tez Bikas Saha @bikassaha
31

Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

May 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Tez

Bikas Saha @bikassaha

Page 2: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Hadoop YARN and HDFS

FlexibleEnables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming

EfficientDouble processing IN Hadoop on the same hardware while providing predictable performance & quality of service

SharedProvides a stable, reliable, secure foundation and shared operational services across multiple workloads

The Data Operating System for Hadoop 2.x

Data Processing Engines Run Natively IN Hadoop

BATCHMapReduce

LOG STOREKafka

STREAMINGStorm

IN-MEMORYSpark

GRAPHGiraph

SASLASR, HPA

ONLINEHBase, Accumulo

OTHERS

HDFS: Redundant, Reliable Storage

YARN: Cluster Resource Management

Page 3: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Tez

•API’s and libraries to create data processing applications on YARN

•Customizable and adaptable DAG definition

•Orchestration framework to execute the DAG in a Hadoop cluster

•NOT a general purpose execution engine

Open Source Apache Project

Page 4: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Tez – Goals

• Tez solves the hard problems of running on a distributed Hadoop environment

• Apps can focus on solving their domain specific problems

• Tez instantiates the physical execution structure. App fills in logic and behavior

• API targets data processing specified as a data flow graph

App

Tez

• Custom application logic

• Custom data format

• Custom data transfer technology

• Distributed parallel execution

• Negotiating resources from the Hadoop framework

• Fault tolerance and recovery

• Shared library of ready-to-use components

• Built-in performance optimizations

• Hadoop Security

Page 5: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Tez – Adoption

• Apache Hive

– Most popular SQL-like interface for data in Hadoop

• Apache Pig

– Scripting language used in some of the largest Hadoop installations

• Apache Flink (Stratosphere project from TU Berlin)

– General purpose engine with language integrated data processing API

• Cascading + Scalding

– Language integrated data processing API in Java/Scala

• Commercial Products

– Datameer, Syncsort and other in progress

Page 6: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Tez – Performance benefits

•Apache Hive

– Order of magnitude improvement in performance

– Speed up mainly from flexible DAG definition and runtime graph reconfiguration

– Performance oriented orchestration layer and shared library components

Hive : TPC-DS Query 64

Logical DAG

Page 7: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Tez – Scale and Reliability

•Apache Pig

– Predominant number of data processing jobs at Yahoo with up to 5000 node clusters

– Multi-Petabyte jobs

– On track for using Pig with Tez for all production Pig jobs

– Already use Hive with Tez for large scale analytics

•Hortonworks customers

– All new customers default on Hive with Tez

•Cascading + Scalding

– Cascading 3.0 released with Tez integration

– Very promising results with beta users

http://scalding.io/2015/05/scalding-cascading-tez-♥/

Page 8: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – DAG API

// Define DAG

DAG dag = DAG.create();

// Define Vertex

Vertex Scan1 = Vertex.create(Processor.class);

// Define Edge

Edge edge = Edge.create(Scan1, Partition1, SCATTER_GATHER, PERSISTED, SEQUENTIAL,

Output.class, Input.class);

// Connect them

dag.addVertex(Scan1).addEdge(edge)….

Page 8

Defines the global logical processing flow

Scan1 Scan2

Partition1 Partition2

Join

Scatter Gather

Scatter Gather

Page 9: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Logical DAG expansion at Runtime

Page 9

Partition1

Scan2

Partition2

Join

Scan1

Page 10: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Task Composition

Page 10

V-A

V-B V-C

Logical DAG

Output-1 Output-3

Processor-A

Input-2

Processor-B

Input-4

Processor-C

Task A

Task B Task C

Edge AB Edge AC

V-A = { Processor-A.class }

V-B = { Processor-B.class }

V-C = { Processor-C.class }

Edge AB = { V-A, V-B,

Output-1.class, Input-2.class }

Edge AC = { V-A, V-C,

Output-3.class, Input-4.class }

Page 11: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Composable Task Model

Page 11

Hive Processor

HDFSInput

RemoteFile

ServerInput

HDFSOutput

LocalDisk

Output

Custom Processor

HDFSInput

RemoteFile

ServerInput

HDFSOutput

LocalDisk

Output

Custom Processor

RDMAInput

NativeDB

Input

KakfaPub-SubOutput

AmazonS3

Output

Adopt Evolve Optimize

Page 12: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Customizable Core Engine

Page 12

Vertex-2

Vertex-1

Start

vertex

Vertex Manager

Start

tasks

DAGScheduler

Get Priority

Get Priority

Start

vertex

TaskScheduler

Get container

Get container

• Vertex Manager• Determines task

parallelism

• Determines when tasks in a vertex can start.

• DAG SchedulerDetermines priority of task

• Task SchedulerAllocates containers from YARN and assigns them to tasks

Page 13: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Customizable core engine: graph reconfiguration

Page 14

Map Vertex

Reduce VertexApp Master

Vertex Manager

Vertex StateMachine

Event Model

Map tasks send data statistics events to the Reduce Vertex Manager.

Vertex ManagerPluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism

Page 14: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Customizable core engine: graph reconfiguration

Page 15

Map Vertex

Reduce VertexApp Master

Vertex ManagerData Size Statistics

Vertex StateMachine

Event Model

Map tasks send data statistics events to the Reduce Vertex Manager.

Vertex ManagerPluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism

Page 15: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Customizable core engine: graph reconfiguration

Page 16

Map Vertex

Reduce VertexApp Master

Vertex Manager

Vertex StateMachine

Cancel Task

Event Model

Map tasks send data statistics events to the Reduce Vertex Manager.

Vertex ManagerPluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism

Reconfigure Vertex

Re-Route

Data Size Statistics

Page 16: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Customizable core engine: graph reconfiguration

Page 17

Map Vertex

Reduce VertexApp Master

Vertex Manager

Vertex StateMachine

Reconfigure Vertex

Re-Route

Event Model

Map tasks send data statistics events to the Reduce Vertex Manager.

Vertex ManagerPluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism

Cancel Task

Data Size Statistics

Page 17: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Customizable core engine: graph reconfiguration

Page 20

Vertex 1 tasks

Vertex 2 Input DataApp Master

Input Initializer +

Vertex Manager

Vertex StateMachine

Event Model

Map tasks send data statistics events to the Reduce Vertex Manager.

Vertex ManagerPluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism

Hive – Dynamic Partition Pruning

Page 18: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Customizable core engine: graph reconfiguration

Page 21

Vertex 1 tasks

Vertex 2 Input DataApp Master

Input Initializer +

Vertex Manager

Filtering values

Vertex StateMachine

Event Model

Map tasks send data statistics events to the Reduce Vertex Manager.

Vertex ManagerPluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism

Hive – Dynamic Partition Pruning

Page 19: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Customizable core engine: graph reconfiguration

Page 22

Vertex 1 tasks

Vertex 2 Input DataApp Master

Input Initializer +

Vertex Manager

Filtering values

Vertex StateMachine

Reconfigure Vertex

Apply Filter to Prune Input Partitions

Event Model

Map tasks send data statistics events to the Reduce Vertex Manager.

Vertex ManagerPluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism

Hive – Dynamic Partition Pruning

Page 20: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Customizable core engine: graph reconfiguration

Page 23

Vertex 1 tasks

Vertex 2 Input DataApp Master

Input Initializer +

Vertex Manager

Filtering values

Vertex StateMachine

Reconfigure Vertex

Apply Filter to Prune Input Partitions

Event Model

Map tasks send data statistics events to the Reduce Vertex Manager.

Vertex ManagerPluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism

Hive – Dynamic Partition Pruning

Page 21: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Engineering optimizations

•Container re-use

•Support for user sessions

•Event-based control flow

Page 24

Page 22: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Developer tools – Local Mode

• Fast prototyping – no hadoop setup required

• Quick turnaround in Unit testing – no overheads for allocating resources , launching JVM’s.

• Easy debuggability – Single JVM

Page 25

Page 23: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Developer Tools - Tez UI

• View Status and progress of DAG/Vertex

• Diagnostics on failure

• View counters for DAG/Vertex

• View and compare counters across tasks/attempts

• View app specific information

Page 26

Page 24: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Developer Tools - Tez UI

Page 27

Page 25: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Job Analysis tools - Swimlanes

• “$TEZ_HOME/tez-tools/swimlanes/yarn-swimlanes.sh <app_id>”

Page 28

Page 26: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Job Analysis tools – Shuffle performance

• View shuffle performance between nodes

Page 29

Page 27: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Job Analysis tools – Shuffle performance

• View shuffle performance between nodes

Page 30

Page 28: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – Hybrid Execution

Page 31

• Run “compute where its most efficient”

• Building on the pluggable design of Tez, different vertices in the DAG can run in different execution environments

• Hive LLAP daemons can run initial scans, map joins etc. while large joins can run in YARN containers

• Best of both worlds and the pattern can be repeated for Apache Phoenix or your MPP database

MPP Daemon

MPP Daemon

MPP Daemon

MPP Daemon

MPP Daemon

MPP Daemon

Vertex 1

Vertex 2

Vertex 3

YARNYARN YARN

Join

Scan/Filter

Page 29: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – How can you help?

•Improve core Tez infrastructure

– Apache open source project. Your use cases and code are welcome

•Port DB ideas to Hive+Tez world

– Evolve distributed query optimization and execution

•Use Tez hybrid execution

– Use the Hive-LLAP pattern to get the best of both worlds with your execution environment

•Integrate your project with Tez

– Get benefits similar to Hive, Pig, Cascading, Flink. Takes between 1-6 months depending on the complexity of the target project

Page 30: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez – How to contribute

•Useful links

– Work tracking: https://issues.apache.org/jira/browse/TEZ

– Code: https://github.com/apache/tez

– Developer list: [email protected] User list: [email protected] Issues list: [email protected]

Page 31: Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

© Hortonworks Inc. 2013

Tez

Thanks for your time and attention!

Video with Deep Dive on Tez

http://goo.gl/BL67o7

http://www.infoq.com/presentations/apache-tez

Questions?

@bikassaha

Page 34