Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Tez

Bikas Saha @bikassaha


Apache Hadoop YARN and HDFS

FlexibleEnables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming

EfficientDouble processing IN Hadoop on the same hardware while providing predictable performance & quality of service

SharedProvides a stable, reliable, secure foundation and shared operational services across multiple workloads

The Data Operating System for Hadoop 2.x

Data Processing Engines Run Natively IN Hadoop

BATCHMapReduce

LOG STOREKafka

STREAMINGStorm

IN-MEMORYSpark

GRAPHGiraph

SASLASR, HPA

ONLINEHBase, Accumulo

OTHERS

HDFS: Redundant, Reliable Storage

YARN: Cluster Resource Management


Tez

•API’s and libraries to create data processing applications on YARN

•Customizable and adaptable DAG definition

•Orchestration framework to execute the DAG in a Hadoop cluster

•NOT a general purpose execution engine

Open Source Apache Project


Tez – Goals

• Tez solves the hard problems of running on a distributed Hadoop environment

• Apps can focus on solving their domain specific problems

• Tez instantiates the physical execution structure. App fills in logic and behavior

• API targets data processing specified as a data flow graph

App

Tez

• Custom application logic

• Custom data format

• Custom data transfer technology

• Distributed parallel execution

• Negotiating resources from the Hadoop framework

• Fault tolerance and recovery

• Shared library of ready-to-use components

• Built-in performance optimizations

• Hadoop Security


Tez – Adoption

• Apache Hive

– Most popular SQL-like interface for data in Hadoop

• Apache Pig

– Scripting language used in some of the largest Hadoop installations

• Apache Flink (Stratosphere project from TU Berlin)

– General purpose engine with language integrated data processing API

• Cascading + Scalding

– Language integrated data processing API in Java/Scala

• Commercial Products

– Datameer, Syncsort and other in progress


Tez – Performance benefits

•Apache Hive

– Order of magnitude improvement in performance

– Speed up mainly from flexible DAG definition and runtime graph reconfiguration

– Performance oriented orchestration layer and shared library components

Hive : TPC-DS Query 64

Logical DAG


Tez – Scale and Reliability

•Apache Pig

– Predominant number of data processing jobs at Yahoo with up to 5000 node clusters

– Multi-Petabyte jobs

– On track for using Pig with Tez for all production Pig jobs

– Already use Hive with Tez for large scale analytics

•Hortonworks customers

– All new customers default on Hive with Tez

•Cascading + Scalding

– Cascading 3.0 released with Tez integration

– Very promising results with beta users

http://scalding.io/2015/05/scalding-cascading-tez-♥/

© Hortonworks Inc. 2013

Tez – DAG API

// Define DAG

DAG dag = DAG.create();

// Define Vertex

Vertex Scan1 = Vertex.create(Processor.class);

// Define Edge

Edge edge = Edge.create(Scan1, Partition1, SCATTER_GATHER, PERSISTED, SEQUENTIAL,

Output.class, Input.class);

// Connect them

dag.addVertex(Scan1).addEdge(edge)….

Page 8

Defines the global logical processing flow

Scan1 Scan2

Partition1 Partition2

Join

Scatter Gather

Scatter Gather


Tez – Logical DAG expansion at Runtime

Page 9

Partition1

Scan2

Partition2

Join

Scan1


Tez – Task Composition

Page 10

V-A

V-B V-C

Logical DAG

Output-1 Output-3

Processor-A

Input-2

Processor-B

Input-4

Processor-C

Task A

Task B Task C

Edge AB Edge AC

V-A = { Processor-A.class }

V-B = { Processor-B.class }

V-C = { Processor-C.class }

Edge AB = { V-A, V-B,

Output-1.class, Input-2.class }

Edge AC = { V-A, V-C,

Output-3.class, Input-4.class }


Tez – Composable Task Model

Page 11

Hive Processor

HDFSInput

RemoteFile

ServerInput

HDFSOutput

LocalDisk

Output

Custom Processor

HDFSInput

RemoteFile

ServerInput

HDFSOutput

LocalDisk

Output

Custom Processor

RDMAInput

NativeDB

Input

KakfaPub-SubOutput

AmazonS3

Output

Adopt Evolve Optimize


Tez – Customizable Core Engine

Page 12

Vertex-2

Vertex-1

Start

vertex

Vertex Manager

Start

tasks

DAGScheduler

Get Priority

Get Priority

Start

vertex

TaskScheduler

Get container

Get container

• Vertex Manager• Determines task

parallelism

• Determines when tasks in a vertex can start.

• DAG SchedulerDetermines priority of task

• Task SchedulerAllocates containers from YARN and assigns them to tasks


Customizable core engine: graph reconfiguration

Page 14

Map Vertex

Reduce VertexApp Master

Vertex Manager

Vertex StateMachine

Event Model

Map tasks send data statistics events to the Reduce Vertex Manager.

Vertex ManagerPluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism



Page 15

Map Vertex


Vertex ManagerData Size Statistics

Vertex StateMachine

Event Model





Page 16

Map Vertex


Vertex Manager

Vertex StateMachine

Cancel Task

Event Model



Reconfigure Vertex

Re-Route

Data Size Statistics



Page 17

Map Vertex


Vertex Manager

Vertex StateMachine

Reconfigure Vertex

Re-Route

Event Model



Cancel Task

Data Size Statistics


Tez – Customizable core engine: graph reconfiguration

Page 20

Vertex 1 tasks

Vertex 2 Input DataApp Master

Input Initializer +

Vertex Manager

Vertex StateMachine

Event Model



Hive – Dynamic Partition Pruning



Page 21

Vertex 1 tasks


Input Initializer +

Vertex Manager

Filtering values

Vertex StateMachine

Event Model






Page 22

Vertex 1 tasks


Input Initializer +

Vertex Manager

Filtering values

Vertex StateMachine

Reconfigure Vertex

Apply Filter to Prune Input Partitions

Event Model






Page 23

Vertex 1 tasks


Input Initializer +

Vertex Manager

Filtering values

Vertex StateMachine

Reconfigure Vertex

Apply Filter to Prune Input Partitions

Event Model





Tez – Engineering optimizations

•Container re-use

•Support for user sessions

•Event-based control flow

Page 24


Tez – Developer tools – Local Mode

• Fast prototyping – no hadoop setup required

• Quick turnaround in Unit testing – no overheads for allocating resources , launching JVM’s.

• Easy debuggability – Single JVM

Page 25


Tez – Developer Tools - Tez UI

• View Status and progress of DAG/Vertex

• Diagnostics on failure

• View counters for DAG/Vertex

• View and compare counters across tasks/attempts

• View app specific information

Page 26


Tez – Developer Tools - Tez UI

Page 27


Tez – Job Analysis tools - Swimlanes

• “$TEZ_HOME/tez-tools/swimlanes/yarn-swimlanes.sh <app_id>”

Page 28


Tez – Job Analysis tools – Shuffle performance

• View shuffle performance between nodes

Page 29


Tez – Job Analysis tools – Shuffle performance

• View shuffle performance between nodes

Page 30


Tez – Hybrid Execution

Page 31

• Run “compute where its most efficient”

• Building on the pluggable design of Tez, different vertices in the DAG can run in different execution environments

• Hive LLAP daemons can run initial scans, map joins etc. while large joins can run in YARN containers

• Best of both worlds and the pattern can be repeated for Apache Phoenix or your MPP database

MPP Daemon

MPP Daemon

MPP Daemon

MPP Daemon

MPP Daemon

MPP Daemon

Vertex 1

Vertex 2

Vertex 3

YARNYARN YARN

Join

Scan/Filter


Tez – How can you help?

•Improve core Tez infrastructure

– Apache open source project. Your use cases and code are welcome

•Port DB ideas to Hive+Tez world

– Evolve distributed query optimization and execution

•Use Tez hybrid execution

– Use the Hive-LLAP pattern to get the best of both worlds with your execution environment

•Integrate your project with Tez

– Get benefits similar to Hive, Pig, Cascading, Flink. Takes between 1-6 months depending on the complexity of the target project


Tez – How to contribute

•Useful links

– Work tracking: https://issues.apache.org/jira/browse/TEZ

– Code: https://github.com/apache/tez

– Developer list: [email protected] User list: [email protected] Issues list: [email protected]


Tez

Thanks for your time and attention!

Video with Deep Dive on Tez

http://goo.gl/BL67o7

http://www.infoq.com/presentations/apache-tez

Questions?

@bikassaha

Page 34

Positioning, Campaigns & 2.0 Launch...•Apache Hive –Most popular SQL-like interface for data in Hadoop •Apache Pig –Scripting language used in some of the largest Hadoop installations

Documents