Everything you wanted to know, but were afraid to ask about Oozie

Post on 19-Oct-2014

9630 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Boris Lublinsky and Alexey Yakubovich give us an overview of using Oozie. This presentation was given on December 13th, 2012 at the Nokia offices in Chicago, IL. View the HD video of this talk here: http://vimeo.com/chug/oozie-overview

Transcript

Everything that you ever wanted to know about Oozie, but were

afraid to ask

B Lublinsky, A Yakubovich

Apache Oozie• Oozie is a workflow/coordination system to

manage Apache Hadoop jobs.• A single Oozie server implements all four

functional Oozie components:– Oozie workflow– Oozie coordinator– Oozie bundle– Oozie SLA.

Main components

Data

Oozie Server

Coordinator

Coordinator

Hadoop

Coordinator

Oozie Command Line Interface

3rd party application

definitions,states

WS API

job submissionand monitoring

workflow

action

action

action

action

Oozie shared libraries

Coordinator

wf logic

Bundle

CoordinatorCoordinatorBundle

CoordinatorCoordinatorWorkflow

time condition monitoring

data condition monitoring

HDFS

MapReduce

Oozie workflow

Workflow LanguageFlow-control node

XML element type Description

Decision workflow:DECISION expressing “switch-case” logic

Fork workflow:FORK splits one path of execution into multiple concurrent pathsJoin workflow:JOIN waits until every concurrent execution path of a previous fork

node arrives to itKill workflow:kill forces a workflow job to kill (abort) itself

Action node XML element type Descriptionjava workflow:JAVA invokes the main() method from the specified java classfs workflow:FS manipulate files and directories in HDFS; supports commands:

move, delete, mkdirMapReduce workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job,

streaming job or pipe jobPig workflow:pig runs a Pig jobSub workflow workflow:SUB-

WORKFLOWruns a child workflow job

Hive * workflow:HIVE runs a Hive jobShell * workflow:SHELL runs a Shell commandssh * workflow:SSH starts a shell command on a remote machine as a remote secure

shellSqoop * workflow:SQOOP runs a Sqoop jobEmail * workflow:EMAIL sending emails from Oozie workflow applicationDistcp ? Under development (Yahoo)

Workflow actions

ActionStartCommand J avaActionExecutorWorkflowStore Services J obClientActionExecutorContext

1 : workflow := getWorkflow()

2 : action := getAction()

3 : context := init<>()

4 : executor := get()

5 : start()

6 : submitLauncher()

7 : jobClient := get()

8 : runningJ ob := submit()

9 : setStartData()

• Oozie workflow supports two types of actions: Synchronous, executed inside Oozie runtime Asynchronous, executed as a Map Reduce job.

Workflow lifecycle

PREP

RUNNINGKILLED

SUSPENDED

FAILED

SUCCEDDED

Oozie execution console

Extending Oozie workflow• Oozie provides a “minimal” workflow language, which contains

only a handful of control and actions nodes.• Oozie supports a very elegant extensibility mechanism – custom

action nodes. Custom action nodes allow to extend Oozie’ language with additional actions (verbs).

• Creation of custom action requires implementation of following: – Java action implementation, which extends ActionExecutor class.– Implementation of the action’s XML schema defining action’s

configuration parameters– Packaging of java implementation and configuration schema into

action jar, which has to be added to Oozie war– extending oozie-site.xml to register information about custom executor

with Oozie runtime.

Oozie Workflow Client• Oozie provides an easy way for integration with enterprise

applications through Oozie client APIs. It provides two types of APIs

• REST HTTP APINumber of HTTP requests• Info requests (job status, job configuration)• Job management (submit, start, suspend, resume, kill)Example: job definition info request

GET /oozie/v0/job/job-ID?show=definition• Java API - package org.apache.oozie.client

– OozieClientstart(), submit(), run(), reRunXXX(), resume(), kill(), suspend()

– WorkflowJob, WorkflowAction– CoordinatorJob, CoordinatorAction – SLAEvent

Oozie workflow good, bad and ugly

• Good– Nice integration with Hadoop ecosystem, allowing to easily build

processes encompassing synchronized execution of multiple Map Reduce, Hive, Pig, etc jobs.

– Nice UI for tracking execution progress– Simple APIs for integration with other applications– Simple extensibility APIs

• Bad– Process has to be expressed directly in hPDL with no visual support– No support for Uber Jars (but we added our own)

• Ugly– Static forking (but you can regenerate workflow and invoke on a fly)– No support for loops

Oozie Coordinator

Coordinator languageElement type Description Attributes and sub-elementscoordinator-app

top-level element in coordinator instance frequencystartend

controls specify the execution policy for coordinator and it’s elements (workflow actions)

timeout (actions)concurrency (actions)execution order (workflow instances)

action Required singular element specifying the associated workflow. The jobs specified in workflow consume and produce dataset instances

Workflow name

datasets Collection of data referred to by a logical name. Datasets serve to specify data dependences between workflow instances

input event specifies the input conditions (in the form of present data sets) that are required in order to execute a coordinator action

output event specifies the dataset that should be produced by coordinator action

Coordinator lifecycle

Oozie Bundle

Bundle lifecycle

RUNNINGPREPSUSPENDED KILLED

SUSPENDED

PREP

FAILEDSUCCEDDED PAUSED

PREPPAUSED

Oozie SLA

SLA Navigation

· event_id· alert_contact· alert-frieuency· …· sla_id· ...

SLA_EVENT

· id· app_name· app_path· …

COORD_JOBS

· id· action_number· action_xml· …· external_id· ...

COORD_ACTIONS

· id· conf· console_url· …

· id· app_name· app_path· …

WF_ACTIONS

WF_JOBS

Using Probes to analyze/monitor Places

• Select probe data for specified time/location• Validate – Filter - Transform probe data• Calculate statistics on available probe data• Distribute data per geo-tiles• Calculate place statistics (e.g. attendance index)-------------------------------------------------------------If exception condition happens, report failureIf all steps succeeded, report success

Workflow as acyclic graph

Workflow – fragment 1

Workflow – fragment 2

Oozie tips and tricks

Configuring workflow• Oozie provides 3 overlapping mechanisms to configure workflow -

config-default.xml, jobs properties file and job arguments that can be passed to Oozie as part of command line invocations.

• The way Oozie processes these three sets of the parameters is as follows:– Use all of the parameters from command line invocation– For remaining unresolved parameters, job config is used– Use config-default.xml for everything else

• Although documentation does not describe clearly when to use which, the overall recommendation is as follows:– Use config-default.xml for defining parameters that never change for a given

workflow– Use jobs properties for the parameters that are common for a given

deployment of a workflow– Use command line arguments for the parameters that are specific for a given

workflow invocation.

Accessing and storing process variables

• Accessing– Through the arguments in java main

• StoringString ooziePropFileName = System.getProperty("oozie.action.output.properties"); OutputStream os = new FileOutputStream(new File(ooziePropFileName));Properties props = new Properties();props.setProperty(key, value);props.store(os, "");os.close();

Validating data presence

• Oozie provides two possible approaches for validating resource file(s) presence – using Oozie coordinator’s input events based on the data set -

technically the simplest implementation approach, but it does not provide a more complex decision support that might be required. It just either runs a corresponding workflow or not.

– custom java node inside Oozie workflow. - allows to extend decision logic by sending notifications about data absence, run execution on partial data under certain timing conditions, etc.

• Additional configuration parameters for Oozie coordinator, for example, ability to wait for files arrival, etc. can expand usage of Oozie coordinator.

Invoking map Reduce jobs• Oozie provides two different ways of invoking Map Reduce job –

MapReduce action and java action. • Invocation of Map Reduce job with java action is somewhat

similar to invocation of this job with Hadoop command line from the edge node. You specify a driver as a class for the java activity and Oozie invokes the driver. This approach has two main advantages:– The same driver class can be used for both – running Map Reduce job

from an edge node and a java action in an Oozie process.– A driver provides a convenient place for executing additional code, for

example clean-up required for Map Reduce execution.• Driver requires a proper shutdown hook to ensure that there are

no lingering Map Reduce jobs

Implementing predefined looping and forking

• hPDL is an XML document with the well-defined schema.

• This means that the actual workflow can be easily manipulated using JAXB objects, which can be generated from hPDL schema using xjc compiler.

• This means that we can create the complete workflow programmatically, based on calculated amount of fork branches or implementing loops as a repeated actions.

• The other option is creation of template process and modifying it based on calculated parameters.

Oozie client security (or lack of)• By default Oozie client reads clients identity from the

local machine OS and passes it to the Oozie server, which uses this identity for MR jobs invocation

• Impersonation can be implemented by overwriting OozieClient class’ method createConfiguration, where client variables can be set through new constructor.

public Properties createConfiguration() { Properties conf = new Properties(); if(user == null) conf.setProperty(USER_NAME, System.getProperty("user.name")); else conf.setProperty(USER_NAME, user); return conf; }

uber jars with Oozieuber jar contains resources: other jars, so libraries, zip files

<java> … <main-class>${wfUberLauncher}</main-class> <arg>-appStart=${wfAppMain}</arg> …</java>

Oozie server

launcher java action

unpack resourcesto current uber jar dir

set inverse classloader

invoke MR driverpass arguments

set shutdown hook‘wait for complete’

uber jar

Classes (Launcher)

jars so zip

mappermapper

top related