Top Banner
DataStage Enterprise Edition
374

data stage doc

Jun 02, 2018

Download

Documents

Bhaskar Reddy
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 1/374

DataStageEnterprise Edition

Page 2: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 2/374

Proposed Course Agenda

Day 1 – Review of EE Concepts

 – Sequential Access

 – Best Practices

 – DBMS as Source

Day 2 – EE Architecture

 – Transforming Data

 – DBMS as Target

 – Sorting Data

Day 3 – Combining Data

 – Configuration Files

 – Extending EE

 – Meta Data in EE

Day 4 – Job Sequencing

 – Testing and Debugging

Page 3: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 3/374

The Course Material

Course Manual

Online Help

Exercise Files and

Exercise Guide

Page 4: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 4/374

Page 5: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 5/374

Intro

Part 1

Introduction to DataStage EE

Page 6: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 6/374

What is DataStage?

Design jobs for Extraction, Transformation, andLoading (ETL)

Ideal tool for data integration projects – such as,data warehouses, data marts, and system

migrations Import, export, create, and managed metadata for

use within jobs

Schedule, run, and monitor jobs all withinDataStage

 Administer your DataStage development andexecution environments

Page 7: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 7/374

DataStage Server and Clients

Page 8: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 8/374

DataStage Administrator

Page 9: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 9/374

Page 10: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 10/374

DataStage Manager

Page 11: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 11/374

DataStage Designer

Page 12: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 12/374

DataStage Director

Page 13: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 13/374

Developing in DataStage

Define global and project properties in Administrator

Import meta data into Manager

Build job in Designer Compile Designer

Validate, run, and monitor in Director

Page 14: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 14/374

DataStage Projects

Page 15: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 15/374

Quiz – True or False

DataStage Designer is used to build and compileyour ETL jobs

Manager is used to execute your jobs after youbuild them

Director is used to execute your jobs after youbuild them

 Administrator is used to set global and projectproperties

Page 16: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 16/374

IntroPart 2

Configuring Projects

Page 17: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 17/374

Module Objectives

 After this module you will be able to: – Explain how to create and delete projects

 – Set project properties in Administrator

 – Set EE global properties in Administrator

Page 18: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 18/374

Project Properties

Projects can be created and deleted in Administrator

Project properties and defaults are set in Administrator

Page 19: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 19/374

Setting Project Properties

To set project properties, log onto Administrator,select your project, and then click ―Properties‖ 

Page 20: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 20/374

Licensing Tab

Page 21: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 21/374

Projects General Tab

Page 22: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 22/374

Environment Variables

Page 23: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 23/374

Permissions Tab

Page 24: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 24/374

Tracing Tab

Page 25: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 25/374

Tunables Tab

Page 26: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 26/374

Parallel Tab

Page 27: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 27/374

IntroPart 3

Managing Meta Data

Page 28: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 28/374

Module Objectives

 After this module you will be able to:

 – Describe the DataStage Manager components andfunctionality

 – Import and export DataStage objects

 – Import metadata for a sequential file

Page 29: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 29/374

What Is Metadata?

TargetSource

Transform

Meta DataRepository

Data

Meta

Data

Meta

Data

Page 30: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 30/374

DataStage Manager

Page 31: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 31/374

Manager Contents

Metadata describing sources and targets: Tabledefinitions

DataStage objects: jobs, routines, tabledefinitions, etc.

Page 32: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 32/374

Import and Export

 Any object in Manager can be exported to a file

Can export whole projects

Use for backup

Sometimes used for version control

Can be used to move DataStage objects from oneproject to another

Use to share DataStage jobs and projects withother developers

Page 33: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 33/374

Export Procedure

In Manager, click ―Export>DataStage

Components‖ 

Select DataStage objects for export

Specified type of export: DSX, XML Specify file path on client machine

Page 34: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 34/374

Quiz: True or False?

You can export DataStage objects such as jobs,but you can’t export metadata, such as field

definitions of a sequential file.

Page 35: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 35/374

Quiz: True or False?

The directory to which you export is on theDataStage client machine, not on the DataStageserver machine.

Page 36: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 36/374

Exporting DataStage Objects

Page 37: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 37/374

Exporting DataStage Objects

Page 38: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 38/374

Import Procedure

In Manager, click ―Import>DataStage

Components‖ 

Select DataStage objects for import

I ti D t St Obj t

Page 39: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 39/374

Importing DataStage Objects

I t O ti

Page 40: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 40/374

Import Options

E i

Page 41: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 41/374

Exercise

Import DataStage Component (table definition)

M t d t I t

Page 42: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 42/374

Metadata Import

Import format and column destinations fromsequential files

Import relational table column destinations

Imported as ―Table Definitions‖  Table definitions can be loaded into job stages

S ti l Fil I t P d

Page 43: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 43/374

Sequential File Import Procedure

In Manager, click Import>TableDefinitions>Sequential File Definitions

Select directory containing sequential file andthen the file

Select Manager category

Examined format and column definitions and editis necessary

M T bl D fi iti

Page 44: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 44/374

Manager Table Definition

I ti S ti l M t d t

Page 45: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 45/374

Importing Sequential Metadata

Page 46: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 46/374

IntroPart 4

Designing and Documenting Jobs

Module Objectives

Page 47: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 47/374

Module Objectives

 After this module you will be able to:

 – Describe what a DataStage job is

 – List the steps involved in creating a job

 – Describe links and stages

 – Identify the different types of stages – Design a simple extraction and load job

 – Compile your job

 – Create parameters to make your job flexible

 – Document your job

What Is a Job?

Page 48: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 48/374

What Is a Job?

Executable DataStage program

Created in DataStage Designer, but can usecomponents from Manager

Built using a graphical user interface Compiles into Orchestrate shell language (OSH)

Job Development Overview

Page 49: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 49/374

Job Development Overview

In Manager, import metadata defining sources

and targets

In Designer, add stages defining data extractionsand loads

 And Transformers and other stages to defineddata transformations

 Add linkss defining the flow of data from sources

to targets Compiled the job

In Director, validate, run, and monitor your job

Designer Work Area

Page 50: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 50/374

Designer Work Area

Designer Toolbar

Page 51: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 51/374

Designer Toolbar

Provides quick access to the main functions of Designer

Job

propertiesCompile

Show/hide metadata markers

Tools Palette

Page 52: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 52/374

Tools Palette

Adding Stages and Links

Page 53: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 53/374

Adding Stages and Links

Stages can be dragged from the tools palette orfrom the stage type branch of the repository view

Links can be drawn from the tools palette or byright clicking and dragging from one stage to

another

Sequential File Stage

Page 54: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 54/374

Sequential File Stage

Used to extract data from, or load data to, asequential file

Specify full path to the file

Specify a file format: fixed width or delimited Specified column definitions

Specify write action

Job Creation Example Sequence

Page 55: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 55/374

Job Creation Example Sequence

Brief walkthrough of procedure

Presumes meta data already loaded in repository

Page 56: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 56/374

Drag Stages and Links Using

Page 57: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 57/374

Palette

Assign Meta Data

Page 58: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 58/374

Assign Meta Data

Editing a Sequential Source Stage

Page 59: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 59/374

Editing a Sequential Source Stage

Page 60: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 60/374

Transformer Stage

Page 61: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 61/374

Transformer Stage

Used to define constraints, derivations, andcolumn mappings

 A column mapping maps an input column to anoutput column

In this module will just defined column mappings(no derivations)

Page 62: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 62/374

Create Column Mappings

Page 63: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 63/374

Create Column Mappings

Creating Stage Variables

Page 64: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 64/374

Creating Stage Variables

Page 65: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 65/374

Adding Job Parameters

Page 66: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 66/374

Adding Job Parameters

Makes the job more flexible

Parameters can be:

 – Used in constraints and derivations

 – Used in directory and file names

Parameter values are determined at run time

Adding Job Documentation

Page 67: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 67/374

Adding Job Documentation

Job Properties

 – Short and long descriptions

 – Shows in Manager

 Annotation stage

 – Is a stage on the tool palette

 – Shows on the job GUI (work area)

Job Properties Documentation

Page 68: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 68/374

Job Properties Documentation

Annotation Stage on the Palette

Page 69: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 69/374

otat o Stage o t e a ette

Annotation Stage Properties

Page 70: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 70/374

g p

Final Job Work Area withDocumentation

Page 71: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 71/374

Documentation

Compiling a Job

Page 72: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 72/374

p g

Errors or Successful Message

Page 73: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 73/374

g

Page 74: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 74/374

Page 75: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 75/374

Prerequisite to Job Execution

Page 76: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 76/374

q

Result from Designer compile

Page 77: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 77/374

Running Your Job

Page 78: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 78/374

Page 79: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 79/374

Director Log View

Page 80: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 80/374

Message Details are Available

Page 81: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 81/374

Other Director Functions

Page 82: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 82/374

Schedule job to run on a particular date/time

Clear job log

Set Director options

 – Row limits –  Abort after x warnings

Page 83: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 83/374

Module 1

DSEE – DataStage EE

Review

Ascential’s EnterpriseData Integration Platform

Page 84: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 84/374

Data Integration Platform

CRM

ERP

SCMRDBMS

Legacy

Real-time

Client-server

Web services

Data Warehouse

Other apps.

ANY SOURCE ANY TARGET

CRM

ERP

SCMBI/Analytics

RDBMS

Real-time

Client-server

Web services

Data Warehouse

Other apps.

Command & Control

DISCOVER

Gather

relevant

informatio

n for target

enterprise

application

s

Data Profiling 

PREPARE

Data Quality 

Cleanse,

correct andmatch

input data

TRANSFORM

Extract,

Transform,Load

Standardiz

e and

enrich data

and load to

targets

Meta Data Management

Parallel Execution

Course Objectives

Page 85: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 85/374

You will learn to:

 – Build DataStage EE jobs using complex logic

 – Utilize parallel processing techniques to increase jobperformance

 – Build custom stages based on application needs

Course emphasis is:

 –  Advanced usage of DataStage EE

 –  Application job development

 – Best practices techniques

Course Agenda

Page 86: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 86/374

Day 1

 – Review of EE Concepts – Sequential Access

 – Standards

 – DBMS Access

Day 2 – EE Architecture

 – Transforming Data

 – Sorting Data

Day 3

 – Combining Data – Configuration Files

Day 4 – Extending EE

 – Meta Data Usage

 – Job Control – Testing

Page 87: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 87/374

Page 88: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 88/374

Page 89: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 89/374

Page 90: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 90/374

Administrator – Licensing andTimeout

Page 91: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 91/374

Timeout

Administrator – ProjectCreation/Removal

Page 92: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 92/374

C eat o / e o a

Functionsspecific to a

project.

Administrator – Project Properties

Page 93: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 93/374

RCP for parallel jobs should be

enabled

Variables forparallel

processing

Page 94: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 94/374

Page 95: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 95/374

OSH is what isrun by the EEFramework

DataStage Manager

Page 96: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 96/374

Page 97: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 97/374

Designer Workspace

Page 98: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 98/374

Can executethe job from

Designer

DataStage Generated OSH

Page 99: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 99/374

The EEFrameworkruns OSH

Director – Executing Jobs

Page 100: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 100/374

Messagesfrom previousrun in different

color

Stages

Page 101: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 101/374

Can now customize the Designer’s palette 

Select desired stagesand drag to favorites

Page 102: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 102/374

Row Generator

Page 103: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 103/374

Can build test data

Repeatableproperty

Edit row incolumn tab

Page 104: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 104/374

Why EE is so Effective

Page 105: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 105/374

Parallel processing paradigm

 – More hardware, faster processing

 – Level of parallelization is determined by aconfiguration file read at runtime

Emphasis on memory – Data read into memory and lookups performed like

hash table

Page 106: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 106/374

Scaleable Systems: Examples

Page 107: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 107/374

Three main types of scalable systems

Symmetric Multiprocessors (SMP): sharedmemory and disk

Clusters: UNIX systems connected via networks

MPP: Massively Parallel Processing

note

SMP: Shared Everything

Page 108: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 108/374

• Multiple CPUs with a single operating system

• Programs communicate using shared memory•  All CPUs share system resources

(OS, memory with single linear address space,disks, I/O)

When used with Enterprise Edition:

• Data transport uses shared memory

• Simplified startup

cpu cpu

cpu cpu

Enterprise Edition treats NUMA (NonUniform Memory Access) as plain SMP 

Traditional Batch Processing

Page 109: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 109/374

Source

Transform

Target

DataWarehouse

Operational Data

Archived Data

Clean Load

Disk Disk Disk

Traditional approach to batch processing:•  Write to disk and read from disk before each processing operation•  Sub-optimal utilization of resources

• a 10 GB stream leads to 70 GB of I/O• processing resources can sit idle during I/O

•  Very complex to manage (lots and lots of small jobs)•  Becomes impractical with big data volumes

•  disk I/O consumes the processing

•  terabytes of disk required for temporary staging

Pipeline Multiprocessing

Page 110: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 110/374

Data Pipelining

• Transform, clean and load processes are executing simultaneously on the same processor• rows are moving forward through the flow

Source

Transform

Target

DataWarehouse

Operational Data

Archived Data Clean Load

•  Start a downstream process while an upstream process is stillrunning.•  This eliminates intermediate storing to disk, which is critical for big data.•  This also keeps the processors busy.•  Still has limits on scalability

Think of a conveyor belt moving the rows from process to process!

Partition Parallelism

Page 111: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 111/374

Data Partitioning

Transform

SourceData

Transform

Transform

Transform

Node 1

Node 2

Node 3

Node 4

A-F

G- M

N-T

U-Z

•  Break up big data into partitions

•  Run one partition on each processor

•  4X times faster on 4 processors -With data big enough:

100X faster on 100 processors

•  This is exactly how the paralleldatabases work!

•  Data Partitioning requires the

same transform to all partitions: Aaron Abbott and Zygmund Zornundergo the same transform

Combining Parallelism Types

Page 112: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 112/374

Putting It All Together: Parallel Dataflow

Source Target

Transform Clean Load

Pipelining

SourceData

DataWarehouse

Repartitioning

Page 113: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 113/374

Putting It All Together: Parallel Dataflow

with Repartioning on-the-fly

Without Landing To Disk!

Source Targ

 Transform Clean Load

Pipelining

SourceData Data

WarehouseA-F

G- MN-T

U-Z

Customer last nameCustomer zip code Credit card number

EE Program Elements

Page 114: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 114/374

 

• Dataset: uniform set of rows in the Framework's internal representation- Three flavors:

1. file sets  *.fs : stored on multiple Unix files as flat files2. persistent : *.ds : stored on multiple Unix files in Framework format

read and written using the DataSet Stage3. virtual : *.v : links, in Framework format, NOT  stored on disk

- The Framework processes only datasets—hence possible need for Import- Different datasets typically have different schemas- Convention: "dataset" = Framework data set.

• Partition: subset of rows in a dataset earmarked for processing by the

same node (virtual CPU, declared in a configuration file).

- All the partitions of a dataset follow the same schema: that of the dataset

DataStage EE Architecture

Page 115: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 115/374

Orchestrate Program(sequentialdataflow)

Orchestrate Application Frameworkand Runtime System

Import

Clean1

Clean2

Merge Analyze

Configuration File

Centralized Error Handlingand Event Logging

Parallel access to data in files

Parallel access to data in RDBMS

Inter-node communications

Parallel pipelining

Parallelization of operations

Import

Clean 1

Merge Analyze

Clean 2

Relational Data

PerformanceVisualization

Flat Files

Orchestrate Framework:

Provides application scalability

DataStage Enterprise Edition:Best-of-breed scalable  data integration platformNo l imi tat ions on d ata volumes or throug hpu t

DataStage:

Provides data integration platform

Introduction to DataStage EE

Page 116: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 116/374

DSEE: –  Automatically scales to fit  the machine

 – Handles data flow among multiple CPU’s and disks 

With DSEE you can: – Create applications for SMP’s, clusters and MPP’s…Enterprise Edition is architecture-neutral

 –  Access relational databases in parallel

 – Execute external applications in parallel

 – Store data across multiple disks and nodes

Job Design VS. Execution

Page 117: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 117/374

Developer assembles data flow using the Designer  

…and gets: parallel access, propagation, transformation, andload.

The design is good for 1 node, 4 nodes,or N nodes. To change # nodes, just swap configuration file.

No need to modify or recompile the design

Partitioners and Collectors

Page 118: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 118/374

Partitioners distribute rows into partitions 

 – implement data-partition parallelism

Collectors = inverse partitioners

 Live on input links of stages running

 – in parallel (partitioners)

 – sequentially (collectors)

Use a choice of methods

Page 119: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 119/374

Exercise

Page 120: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 120/374

Complete exercises 1-1 and 1-2, and 1-3

Page 121: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 121/374

Module 2

DSEE Sequential Access

Module Objectives

Page 122: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 122/374

You will learn to:

 – Import sequential files into the EE Framework

 – Utilize parallel processing techniques to increasesequential file access

 – Understand usage of the Sequential, DataSet, FileSet,

and LookupFileSet stages

 – Manage partitioned data stored by the Framework

Types of Sequential Data Stages

Page 123: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 123/374

Sequential

 – Fixed or variable length

File Set

Lookup File Set

Data Set

Page 124: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 124/374

How the Sequential Stage Works

Page 125: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 125/374

Generates Import/Export operators, depending on

whether stage is source or target

Performs direct C++ file I/O streams

Using the Sequential File Stage

Page 126: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 126/374

Importing/Exporting Data

Both import and export of general files (text, binary) are

performed by the SequentialFile Stage.

 – Data import:

 – Data export

EE internal format

EE internal format

Working With Flat Files

Page 127: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 127/374

Sequential File Stage

 – Normally will execute in sequential mode – Can be parallel if reading multiple files (file pattern

option)

 – Can use multiple readers within a node

 – DSEE needs to know How file is divided into rows

How row is divided into columns

Processes Needed to Import Data

Page 128: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 128/374

Recordization

 – Divides input stream into records – Set on the format tab

Columnization

 – Divides the record into columns – Default set on the format tab but can be overridden on

the columns tab

 – Can be ―incomplete‖ if using a schema or not even

specified in the stage if using RCP

File Format Example

Page 129: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 129/374

Field 1

Field 1

Field 1

Field 1

Field 1

Field 1

,

,

,

,

,

,

Last field

Last field

nl

nl,

Field Delimiter 

Final Delimiter = comma

Final Delimiter = end

Record delimiter 

Sequential File Stage

Page 130: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 130/374

To set the properties, use stage editor

 – Page (general, input/output) – Tabs (format, columns)

Sequential stage link rules

 – One input link – One output links (except for reject link definition)

 – One reject link Will reject any records not matching meta data in the column

definitions

Job Design Using Sequential Stages

Page 131: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 131/374

Stage categories

Page 132: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 132/374

Properties – Multiple Files

Page 133: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 133/374

Click to add more files havingthe same meta data.

Properties - Multiple Readers

Page 134: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 134/374

Multiple readers option allowsyou to set number of readers

Format Tab

Page 135: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 135/374

File into records

 

Record into columns

 

Read Methods

Page 136: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 136/374

Reject Link

Page 137: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 137/374

Reject mode = output

Source

 –  All records not matching the meta data (the columndefinitions)

Target –  All records that are rejected for any reason

Meta data – one column, datatype = raw

File Set Stage

Page 138: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 138/374

Can read or write file sets

Files suffixed by .fs

File set consists of:

1. Descriptor file – contains location of raw data files +

meta data

2. Individual raw data files

Can be processed in parallel

File Set Stage Example

Page 139: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 139/374

Descriptor file

File Set Usage

Page 140: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 140/374

Why use a file set?

 – 2G limit on some file systems – Need to distribute data among nodes to prevent

overruns

 – If used in parallel, runs faster that sequential file

Lookup File Set Stage

Page 141: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 141/374

Can create file sets

Usually used in conjunction with Lookup stages

Lookup File Set > Properties

Page 142: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 142/374

Key columnspecified

Key column

dropped indescriptor file

Data Set

Page 143: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 143/374

Operating system (Framework) file

Suffixed by .ds

Referred to by a control file

Managed by Data Set Management utility fromGUI (Manager, Designer, Director)

Represents persistent data

Key to good performance in set of linked jobs

Persistent Datasets

Page 144: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 144/374

 Accessed from/to disk with DataSet Stage. Two parts:

 – Descriptor file: contains metadata, data location, but NOT the data itself

 – Data file(s) contain the data

multiple Unix files (one per node), accessible in parallel

input.ds

node1:/local/disk1/… node2:/local/disk2/… 

record (

 partno: int32;description:

string;)

Quiz!

Page 145: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 145/374

• True or False?

Everything that has been data-partitioned must becollected in same job

Data Set Stage

Page 146: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 146/374

Is the data partitioned?

Engine Data Translation

Page 147: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 147/374

Occurs on import

 – From sequential files or file sets – From RDBMS

Occurs on export

 – From datasets to file sets or sequential files – From datasets to RDBMS

Engine most efficient when processing internally

formatted records (I.e. data contained in datasets)

Page 148: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 148/374

Data Set Management

Page 149: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 149/374

Display data

Schema

Data Set Management From Unix

f f

Page 150: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 150/374

 Alternative method of managing file sets and data

sets – Dsrecords

Gives record count

 – Unix command-line utility

 – $ dsrecords ds_nameI.e.. $ dsrecords myDS.ds

156999 records

 – Orchadmin  Manages EE persistent data sets

 – Unix command-line utility

I.e. $ orchadmin rm myDataSet.ds

Exercise

C l t i 2 1 2 2 2 3 d 2 4

Page 151: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 151/374

Complete exercises 2-1, 2-2, 2-3, and 2-4.

Page 152: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 152/374

Module 3

Standards and Techniques

Objectives

E t bli h t d d t h i f DSEE

Page 153: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 153/374

Establish standard techniques for DSEE

development

Will cover:

 – Job documentation

 – Naming conventions for jobs, links, and stages – Iterative job design

 – Useful stages for job development

 – Using configuration files for development

 – Using environmental variables

 – Job parameters

Job Presentation

Page 154: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 154/374

Document using theannotation stage

Job Properties Documentation

Page 155: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 155/374

Description shows in DSManager and MetaStage

Organize jobs intocategories

Naming conventions

St d ft th

Page 156: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 156/374

Stages named after the

 – Data they access – Function they perform

 – DO NOT leave defaulted stage names likeSequential_File_0

Links named for the data they carry

 – DO NOT leave defaulted link names like DSLink3

Stage and Link Names

Page 157: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 157/374

Stages and linksrenamed to data they

handle

Create Reusable Job Components

Use Enterprise Edition shared containers when

Page 158: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 158/374

Use Enterprise Edition shared containers when

feasible

Container

Use Iterative Job Design

Use copy or peek stage as stub

Page 159: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 159/374

Use copy or peek stage as stub

Test job in phases – small first, then increasing incomplexity

Use Peek stage to examine records

Copy or Peek Stage Stub

Page 160: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 160/374

Copy stage

Transformer StageTechniques

Suggestions

Page 161: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 161/374

Suggestions -

 –  Always include reject link. –  Always test for null value before using a column in a

function.

 – Try to use RCP and only map columns that have a

derivation other than a copy. More on RCP later. – Be aware of Column and Stage variable Data Types.

Often user does not pay attention to Stage Variable type.

 –  Avoid type conversions. Try to maintain the data type as imported.

The Copy Stage

With 1 link in 1 link out:

Page 162: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 162/374

With 1 link in, 1 link out:

the Copy Stage is the ultimate "no-op" (place-holder):  – Partitioners

 –   Sort / Remove Duplicates

 –   Rename, Drop column

… can be inserted on:

 –   input link (Partitioning): Partitioners, Sort, Remove Duplicates)

 –   output link (Mapping page): Rename, Drop.

Sometimes replace the transformer:

Developing Jobs

1 Keep it simple

Page 163: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 163/374

1. Keep it simple

• Jobs with many stages are hard to debug and maintain.

2. Start small and Build to final Solution

• Use view data, copy, and peek.

• Start from source and work out.

• Develop with a 1 node configuration file.

3. Solve the business problem before the performanceproblem.

• Don’t worry too much about partitioning until thesequential flow works as expected.

4. If you have to write to Disk use a Persistent Data set.

Final Result

Page 164: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 164/374

Good Things to Have in each Job

Use job parameters

Page 165: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 165/374

Use job parameters

Some helpful environmental variables to add to job parameters

 – $APT_DUMP_SCORE Report OSH to message log

 – $APT_CONFIG_FILE Establishes runtime parameters to EE engine; I.e. Degree of

parallelization

Setting Job Parameters

Page 166: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 166/374

Click to add

environmentvariables

DUMP SCORE Output

Setting APT_DUMP_SCORE yields:

Page 167: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 167/374

Double-click

MappingNode--> partition

Partitoner And

Collector

Page 168: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 168/374

Exercise

Complete exercise 3-1

Page 169: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 169/374

Complete exercise 3-1

Page 170: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 170/374

Module 4

DBMS Access

Objectives

Understand how DSEE reads and writes records

Page 171: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 171/374

Understand how DSEE reads and writes records

to an RDBMS Understand how to handle nulls on DBMS lookup

Utilize this knowledge to:

 – Read and write database tables

 – Use database tables to lookup data

 – Use null handling options to clean data

Parallel Database Connectivity

Page 172: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 172/374

Traditional

Client-Server Enterprise Edition

Sort

Client

Parallel RDBMS

Client

Client

Client

Client

Parallel RDBMS

Only RDBMS is running in parallel

Each application has only one connection

Suitable only for small data volumes

Parallel server runs APPLICATIONS  

Application has parallel connections to RDBMS

Suitable for large data volumes

Higher levels of integration possible

Client

Load

RDBMS AccessSupported Databases

Enterprise Edition provides high performance /

Page 173: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 173/374

Enterprise Edition provides high performance /

scalable interfaces for:

  DB2

  Informix

  Oracle

  Teradata

Automatically convert RDBMS table layouts to/from

RDBMS Access

Page 174: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 174/374

 Automatically convert RDBMS table layouts to/from

Enterprise Edition Table Definitions RDBMS nulls converted to/from nullable field values

Support for standard SQL syntax for specifying: – field list for SELECT statement – filter for WHERE clause

Can write an explicit SQL query to access RDBMS

EE supplies additional information in the SQL query

RDBMS Stages

DB2/UDB Enterprise

Page 175: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 175/374

p

Informix Enterprise

Oracle Enterprise

Teradata Enterprise

Page 176: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 176/374

RDBMS Source – Stream Link

Page 177: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 177/374

Stream link

DBMS Source - User-defined SQL

Page 178: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 178/374

Columns in SQL statementmust match the meta data

in columns tab

Page 179: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 179/374

DBMS Source – Reference Link

Page 180: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 180/374

Reject link

Lookup Reject Link

Page 181: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 181/374

―Output‖ option automatically

creates the reject link

Null Handling

Must handle null condition if lookup record is not

Page 182: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 182/374

p

found and ―continue‖ option is chosen Can be done in a transformer stage

Lookup Stage Mapping

Page 183: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 183/374

Link name

Lookup Stage Properties

Reference

Page 184: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 184/374

Reference

link

Must have same column name

in input and reference links. Youwill get the results of the lookup

in the output column.

DBMS as a Target

Page 185: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 185/374

DBMS As Target

Write Methods

Page 186: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 186/374

 – Delete – Load

 – Upsert

 – Write (DB2)

Write mode for load method

 – Truncate

 – Create

 – Replace –  Append

Target Properties

Generated codeb i d

Page 187: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 187/374

Upsert modedetermines options

can be copied

Checking for Nulls

Use Transformer stage to test for fields with null

Page 188: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 188/374

values (Use IsNull functions) In Transformer, can reject or load default value

Exercise

Complete exercise 4-2

Page 189: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 189/374

Page 190: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 190/374

Module 5

Platform Architecture

Objectives

Understand how Enterprise Edition Framework

Page 191: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 191/374

processes data You will be able to:

 – Read and understand OSH

 – Perform troubleshooting

Concepts

The Enterprise Edition Platform

Page 192: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 192/374

 – Script language - OSH (generated by DataStageParallel Canvas, and run by DataStage Director)

 – Communication - conductor,section leaders,players.

 – Configuration files (only one active at a time,

describes H/W) – Meta data - schemas/tables

 – Schema propagation - RCP

 – EE extensibility - Buildop, Wrapper

 – Datasets (data in Framework's internalrepresentation)

EE Stages Involve A Series Of Processing Steps

DS-EE Stage Elements

Page 193: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 193/374

Output Data Set schema:

prov_num:int16;member_num:int8;custid:int32;

Input Data Set schema:

prov_num:int16;member_num:int8;custid:int32;

g g

P  ar  t  i   t  i   on e

B  u s i  n e s  s 

L  o gi   c 

EE Stage

• Piece of ApplicationLogic Running AgainstIndividual Records

• Parallel or Sequential

Dual Parallelism Eliminates Bottlenecks!

DSEE Stage Execution

Page 194: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 194/374

• EE Delivers Parallelism inTwo Ways – Pipeline

 – Partition

• Block Buffering BetweenComponents – Eliminates Need for Program

Load Balancing

 – Maintains Orderly Data FlowPipeline

Partition

Producer

Consume

r

Stages Control Partition Parallelism

  Execution Mode (sequential/parallel) is controlled by Stage

Page 195: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 195/374

 –  default = parallel for most Ascential-supplied Stages –  Developer can override default mode

 –  Parallel Stage inserts the default partitioner (Auto) on itsinput links

 –  Sequential Stage inserts the default collector (Auto) onits input links

 –  Developer can override default

execution mode (parallel/sequential) of Stage > Advanced tab

choice of partitioner/collector on Input > Partitioningtab

How Parallel Is It?

Degree of parallelism is determined by the

Page 196: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 196/374

configuration file –   Total number of logical nodes in default pool, or a

subset if using "constraints".  Constraints are assigned to specific pools as defined in

configuration file and can be referenced in the stage

OSH

DataStage EE GUI generates OSH scripts

Page 197: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 197/374

 –  Ability to view OSH turned on in Administrator – OSH can be viewed in Designer using job properties

The Framework executes OSH

What is OSH? – Orchestrate shell

 – Has a UNIX command-line interface

OSH Script

An osh script is a quoted string which

Page 198: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 198/374

specifies: – The operators and connections of a single

Orchestrate step

 – In its simplest form, it is:

osh ―op < in.ds > out.ds‖ 

Where:

 – op is an Orchestrate operator – in.ds is the input data set

 – out.ds is the output data set

Page 199: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 199/374

Enable Visible OSH in Administrator

Page 200: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 200/374

Will be enabled forall projects

View OSH in Designer

Page 201: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 201/374

Schema

Operator

OSH Practice

Exercise 5-1 – Instructor demo (optional)

Page 202: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 202/374

• Operators

• Datasets: set of rows processed by Framework

Elements of a Framework Program

Page 203: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 203/374

Datasets: set of rows processed by Framework 

 – Orchestrate data sets:

 – persistent (terminal) *.ds, and

 – virtual (internal) *.v.

 – Also: flat ―file sets‖ *.fs 

• Schema: data description (metadata) for datasets and links.

• Consist of Partitioned Data and Schema

• Can be Persistent (*.ds) or Virtual (*.v, Link)

Datasets

Page 204: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 204/374

Can be Persistent ( .ds) or Virtual ( .v, Link)

• Overcome 2 GB File Limit

=

What you p rogram: What gets proc essed:

. . .

Multiple files per partitionEach file up to 2GBytes (or larger)

Operator

AOperator

A

Operator

AOperator

A

Node 1 Node 2 Node 3 Node 4

data filesof x.ds

$ osh “operator_A > x.ds“ 

GUI

OSH

What gets g enerated:

Computing Architectures: Definition

Shared Disk Shared NothingDedicated Disk

Page 205: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 205/374

Clusters and MPP SystemsUniprocessor

•  IBM, Sun, HP, Compaq

•  2 to 64 processors

•  Majority of installations

Shared Memory

SMP System(Symmetric Multiprocessor)

DiskDisk

CPU 

Memory

CPU CPU CPU

•  PC

•  Workstation

•  Single processor server

CPU

•  2 to hundreds of processors

•  MPP: IBM and NCR Teradata

•  each node is a uniprocessor or SMP

CPU

Disk

Memory

CPU

Disk

Memory

CPU

Disk

Memory

CPU

Disk

Memory

Page 206: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 206/374

Working with Configuration Files

You can easily switch between config files:

Page 207: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 207/374

'1-node'  file - for sequential execution, lighter reports—handy fortesting

'MedN-nodes' file - aims at a mix of pipeline and data-partitionedparallelism 

'BigN-nodes'  file - aims at full data-partitioned parallelism

Only one file is active while a step is running The Framework queries (first) the environment variable:

$APT_CONFIG_FILE 

# nodes declared in the config file needs not match #CPUs

Same configuration file can be used in development and

SchedulingNodes, Processes, and CPUs

DS/EE does not:

– know how many CPUs are available

Page 208: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 208/374

  know how many CPUs are available

 – schedule

Who knows what?

Who does what? – DS/EE creates (Nodes*Ops) Unix processes

 – The O/S schedules these processes on the CPUs

Nodes = # logical nodes declared in config. file

Ops = # ops. (approx. # blue boxes in V.O.)

Processes = # Unix processes

CPUs = # available CPUs

Nodes Ops Processes CPUs

User  Y N

Orchestrate Y Y Nodes * Ops N

O/S " Y

{

node "n1" {

fastname "s1"

pool "" "n1" "s1" "app2" "sort"

Configuring DSEE – Node Pools

Page 209: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 209/374

p pp

resource disk "/orch/n1/d1" {}resource disk "/orch/n1/d2" {}

resource scratchdisk "/temp" {"sort"}

}

node "n2" {

fastname "s2"

pool "" "n2" "s2" "app1"

resource disk "/orch/n2/d1" {}

resource disk "/orch/n2/d2" {}

resource scratchdisk "/temp" {}}

node "n3" {

fastname "s3"

pool "" "n3" "s3" "app1"

resource disk "/orch/n3/d1" {}

resource scratchdisk "/temp" {}

}

node "n4" {

fastname "s4"pool "" "n4" "s4" "app1"

resource disk "/orch/n4/d1" {}

resource scratchdisk "/temp" {}

}

1

43

2

{

node "n1" {

fastname "s1"

pool "" "n1" "s1" "app2" "sort"

Configuring DSEE – Disk Pools

Page 210: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 210/374

p pp

resource disk "/orch/n1/d1" {}resource disk "/orch/n1/d2" {"bigdata"}

resource scratchdisk "/temp" {"sort"}

}

node "n2" {

fastname "s2"

pool "" "n2" "s2" "app1"

resource disk "/orch/n2/d1" {}

resource disk "/orch/n2/d2" {"bigdata"}

resource scratchdisk "/temp" {}}

node "n3" {

fastname "s3"

pool "" "n3" "s3" "app1"

resource disk "/orch/n3/d1" {}

resource scratchdisk "/temp" {}

}

node "n4" {

fastname "s4"pool "" "n4" "s4" "app1"

resource disk "/orch/n4/d1" {}

resource scratchdisk "/temp" {}

}

1

43

2

Parallel to parallel flow may incur reshuffling:

Records may jump between nodes

Re-Partitioning

Page 211: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 211/374

 node

1node

2

Records may jump between nodes

partitioner

Partitioning Methods

 Auto

Page 212: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 212/374

Hash

Entire

Range

Range Map

• Collectors combine partitions of a dataset into asingle input stream to a sequential Stage

Collectors

Page 213: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 213/374

data partitions

collector

sequential Stage

...

 –Collectors do NOT synchronize data

Partitioning and Repartitioning AreVisible On Job Design

Page 214: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 214/374

Page 215: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 215/374

Setting a Node Constraint in the GUI

Page 216: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 216/374

Reading Messages in Director

Set APT_DUMP_SCORE to true

Page 217: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 217/374

Can be specified as job parameter

Messages sent to Director log

If set, parallel job will produce a report showingthe operators, processes, and datasets in therunning job

Messages With APT_DUMP_SCORE= True

Page 218: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 218/374

Exercise

Complete exercise 5-2

Page 219: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 219/374

Page 220: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 220/374

Module 6

Transforming Data

Module Objectives

Understand ways DataStage allows you to

Page 221: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 221/374

transform data Use this understanding to:

 – Create column derivations using user-defined code orsystem functions

 – Filter records based on business criteria

 – Control data flow based on data conditions

Transformed Data

Transformed data is:

Page 222: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 222/374

 – Outgoing column is a derivation that may, or may not,include incoming fields or parts of incoming fields

 – May be comprised of system variables

Frequently uses functions performed on

something (ie. incoming columns) – Divided into categories – I.e.

Date and time

Mathematical

Logical

Null handling

More

Page 223: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 223/374

Transformer Stage Functions

Control data flow

Page 224: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 224/374

Create derivations

Flow Control

Separate records flow down links based on data

Page 225: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 225/374

condition – specified in Transformer stageconstraints

Transformer stage can filter records

Other stages can filter records but do not exhibitadvanced flow control

 – Sequential can send bad records down reject link

 – Lookup can reject records based on lookup failure

 – Filter can select records based on data value

Rejecting Data

Reject option on sequential stage – Data does not agree with meta data

Page 226: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 226/374

g

 – Output consists of one column with binary data type

Reject links (from Lookup stage) result from thedrop option of the property ―If Not Found‖ 

 – Lookup ―failed‖  –  All columns on reject link (no column mapping option)

Reject constraints are controlled from theconstraint editor of the transformer – Can control column mapping

 – Use the ―Other/Log‖ checkbox 

Rejecting Data Example

Page 227: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 227/374

―If Not Found‖

property

Constraint – 

Other/log optionProperty RejectMode = Output

Transformer Stage Properties

Page 228: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 228/374

Transformer Stage Variables

First of transformer stage entities to execute

Page 229: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 229/374

Execute in order from top to bottom – Can write a program by using one stage variable to

point to the results of a previous stage variable

Multi-purpose – Counters

 – Hold values for previous rows to make comparison

 – Hold derivations to be used in multiple field dervations

 – Can be used to control execution of constraints

Stage Variables

Page 230: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 230/374

Show/Hide button

Transforming Data

Derivations

Page 231: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 231/374

 – Using expressions – Using functions

Date/time

Transformer Stage Issues

 – Sometimes require sorting before the transformerstage – I.e. using stage variable as accumulator andneed to break on change of column value

Checking for nulls

Checking for Nulls

Nulls can get introduced into the dataflow

Page 232: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 232/374

because of failed lookups and the way in whichyou chose to handle this condition

Can be handled in constraints, derivations, stagevariables, or a combination of these

Transformer - Handling Rejects

Page 233: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 233/374

Constraint Rejects

 –  All expressions are

false and reject row ischecked

Transformer: Execution Order

Page 234: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 234/374

•  Derivations in stage variables are executed first

• Constraints are executed before derivations

• Column derivations in earlier links are executed before later links

• Derivations in higher columns are executed before lower columns

Parallel Palette - Two Transformers

 All > Processing > Parallel > Processing

Page 235: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 235/374

Transformer Is the non-Universe

transformer

Has a specific set offunctions

No DS routines available

Basic Transformer Makes server style

transforms available onthe parallel palette

Can use DS routines

• Program in Basic for both transformers

Transformer Functions FromDerivation Editor

Date & Time

Page 236: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 236/374

Logical

Null Handling

Number String

Type Conversion

Exercise

Complete exercises 6-1, 6-2, and 6-3

Page 237: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 237/374

Page 238: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 238/374

Module 7

Sorting Data

Objectives

Understand DataStage EE sorting options

U hi d di d li f

Page 239: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 239/374

Use this understanding to create sorted list ofdata to enable functionality within a transformerstage

Sorting Data

Important because

S t i t d i t

Page 240: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 240/374

 – Some stages require sorted input – Some stages may run faster – I.e Aggregator

Can be performed

 – Option within stages (use input > partitioning tab andset partitioning to anything other than auto)

 –  As a separate stage (more complex sorts)

 Sorting Alternatives

Page 241: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 241/374

• Alternative representation of same flow:

Sort Option on Stage Link

Page 242: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 242/374

Sort Stage

Page 243: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 243/374

Sort Utility

DataStage – the default

UNIX

Page 244: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 244/374

UNIX

Sort Stage - Outputs

Specifies how the output is derived

Page 245: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 245/374

Page 246: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 246/374

Removing Duplicates

Can be done by Sort stage

 – Use unique option

Page 247: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 247/374

OR

Remove Duplicates stage

 – Has more sophisticated ways to remove duplicates

Exercise

Complete exercise 7-1

Page 248: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 248/374

Page 249: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 249/374

Module 8

Combining Data

Objectives

Understand how DataStage can combine datausing the Join, Lookup, Merge, and Aggregatort

Page 250: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 250/374

stages

Use this understanding to create jobs that will

 – Combine data from separate input streams

 –  Aggregate data to form summary totals

Combining Data

There are two ways to combine data:

Page 251: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 251/374

 – Horizontally:Several input links; one output link (+ optional rejects)made of columns from different input links. E.g., Joins

Lookup

Merge

 – Vertically:

One input link, one output link with column combiningvalues from all input rows. E.g.,  Aggregator

Join, Lookup & Merge Stages

These "three Stages" combine two or more inputlinks according to values of user-designated "key"

l ( )

Page 252: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 252/374

column(s).

They differ mainly in:

 – Memory usage

 – Treatment of rows with unmatched key values

 – Input requirements (sorted, de-duplicated)

Page 253: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 253/374

Join Stage Editor

Page 254: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 254/374

One of four variants:

 –  Inner –  Left Outer –  Right Outer –  Full Outer

Several key columnsallowed

Link Orderimmaterial for Innerand Full Outer Joins(but VERY important

for Left/Right Outerand Lookup andMerge)

1. The Join Stage

Four types:

• Inner

L ft O t

Page 255: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 255/374

2 sorted input links, 1 output link – "left outer" on primary input, "right outer" on secondary input – Pre-sort make joins "lightweight": few rows need to be in RAM

• Left Outer

• Right Outer

• Full Outer

2. The Lookup Stage

Combines:

 – one source link with

– one or more duplicate-free table links

Page 256: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 256/374

  one or more duplicate free table links

no pre-sort necessary

allows multiple keys LUTs

flexible exception handling for

source input rows with no match

Lookup 

Sourceinput 

One or moretables (LUTs) 

Output  Reject 

0

1

2

0

1

The Lookup Stage

Lookup Tables should be small enough to fitinto physical memory (otherwise,performance hit due to paging)

Page 257: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 257/374

performance hit due to paging)

On an MPP you should partition the lookuptables using entire partitioning method, or  

partition them the same way you partition thesource link

On an SMP, no physical duplication of a

Lookup Table occurs

The Lookup Stage

Lookup File Set – Like a persistent data set only it

contains metadata about the key

Page 258: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 258/374

contains metadata about the key. – Useful for staging lookup tables

RDBMS LOOKUP – NORMAL

Loads to an in memory hash table first

 – SPARSE Select for each row.

Might become a performancebottleneck.

3. The Merge Stage

Combines

 – one sorted, duplicate-free master  (primary) link with  – one or more sorted update (secondary) links. 

– Pre-sort makes merge "lightweight": few rows need to be in RAM (as with

Page 259: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 259/374

  Pre sort makes merge lightweight : few rows need to be in RAM (as with joins, but opposite to lookup).

Follows the Master-Update model: – Master row and one or more updates row are merged if they have the same

value in user-specified key  column(s).

 –  A non-key column occurs in several inputs? The lowest input port numberprevails (e.g., master over update; update values are ignored)

 – Unmatched ("Bad") master rows can be either kept

dropped

 – Unmatched ("Bad") update rows in input link can be captured in a "reject"link 

 – Matched update rows are consumed.

The Merge Stage

Allows composite keys

Multiple update links

Page 260: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 260/374

Multiple update links

Matched update rows are consumed

Unmatched updates can be captured

Lightweight

Space/time tradeoff: presorts vs. in-

RAM table

Master One or moreupdates 

Output  Rejects 

Merge 

0

0

21

21

Synopsis:

Joins, Lookup, & Merge

Joins Lookup Merge

Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)

Page 261: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 261/374

In this table:

• , <comma> = separator between primary and secondary input links(out and reject links)

Memory usage light heavy light

# and names of Inputs exactly 2: 1 left, 1 right 1 Source, N   LU Tables 1 Master, N  Update(s)

Mandatory Input Sort both inputs no all inputs

Duplicates in primary input OK (x-product) OK Warning!

Duplicates in secondary input(s) OK (x-product) Warning! OK only when N  = 1

Options on unmatched primary   NONE  [fail] | continue | drop | reject [keep] | drop

Options on unmatched secondary   NONE NONE   capture in reject set(s)On match, secondary entries are reusable reusable consumed

# Outputs 1 1 out, (1 reject) 1 out, (N  rejects)

Captured in reject set(s)   Nothing (N/A) unmatched primary entries unmatched secondary entries

The Aggregator Stage

Purpose: Perform data aggregations

Specify:

Page 262: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 262/374

Zero or more key columns that define theaggregation units (or groups)

Columns to be aggregated

 Aggregation functions:count (nulls/non-nulls) sum

max/min/range

The grouping method (hash table or pre-sort )is a performance issue

Grouping Methods 

Hash: results for each aggregation group are stored in ahash table, and the table is written out after all input has

been processedd ’t i t d d t

Page 263: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 263/374

been processed – doesn’t require sorted data 

 – good when number of  unique groups is small. Runningtally for each group’s aggregate calculations need to fiteasily into memory. Require about 1KB/group of RAM.

 – Example: average family income by state, requires .05MBof RAM

Sort: results for only a single aggregation group are keptin memory; when new group is seen (key value changes),

current group written out. – requires input sorted by grouping keys

 – can handle unlimited numbers of groups

 – Example: average daily balance by credit card

Aggregator Functions

Sum

Min max

Page 264: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 264/374

Min, max

Mean

Missing value count

Non-missing value count

Percent coefficient of variation

Aggregator Properties

Page 265: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 265/374

Aggregation Types

Page 266: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 266/374

 Aggregation types

Containers

Two varieties

 – Local

Shared

Page 267: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 267/374

 – Shared

Local

 – Simplifies a large, complex diagram

Shared

 – Creates reusable object that many jobs can include

Creating a Container

Create a job

Select (loop) portions to containerize

Edit C t t t i l l h d

Page 268: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 268/374

Edit > Construct container > local or shared

Using a Container

Select as though it were a stage

Page 269: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 269/374

Exercise

Complete exercise 8-1

Page 270: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 270/374

Page 271: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 271/374

Module 9

Configuration Files

Objectives

Understand how DataStage EE usesconfiguration files to determine parallel behavior

U thi d t di t

Page 272: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 272/374

Use this understanding to

 – Build a EE configuration file for a computer system

 – Change node configurations to support adding

resources to processes that need them – Create a job that will change resource allocations at

the stage level

Configuration File Concepts

Determine the processing nodes and disk spaceconnected to each node

Wh t h d l h th

Page 273: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 273/374

When system changes, need only change theconfiguration file – no need to recompile jobs

When DataStage job runs, platform readsconfiguration file

 – Platform automatically scales the application to fit thesystem

Processing Nodes Are

Locations on which the framework runsapplications

L i l th th h i l t t

Page 274: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 274/374

Logical rather than physical construct

Do not necessarily correspond to the number of

CPUs in your system – Typically one node for two CPUs

Can define one processing node for multiplephysical nodes or multiple processing nodes forone physical node

Optimizing Parallelism

Degree of parallelism determined by number ofnodes defined

P ll li h ld b ti i d t i i d

Page 275: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 275/374

Parallelism should be optimized, not maximized

 – Increasing parallelism distributes work load but alsoincreases Framework overhead

Hardware influences degree of parallelismpossible

System hardware partially determines

configuration

More Factors to Consider

Communication amongst operators – Should be optimized by your configuration

 – Operators exchanging large amounts of data should

Page 276: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 276/374

p g g gbe assigned to nodes communicating by sharedmemory or high-speed link

SMP – leave some processors for operatingsystem

Desirable to equalize partitioning of data

Use an experimental approach – Start with small data sets

 – Try different parallelism while scaling up data set sizes

Factors Affecting Optimal Degree of

Parallelism

CPU intensive applications

 – Benefit from the greatest possible parallelism

A li ti th t di k i t i

Page 277: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 277/374

 Applications that are disk intensive

 – Number of logical nodes equals the number of diskspindles being accessed

Configuration File

Text file containing string data that is passed tothe Framework

 – Sits on server side

Page 278: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 278/374

 – Can be displayed and edited

Name and location found in environmental

variable APT_CONFIG_FILE Components

 – Node

 – Fast name

 – Pools – Resource

Node Options

Node name – name of a processing node used by EE – Typically the network name

 – Use command uname –n to obtain network name

Page 279: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 279/374

Fastname –  – Name of node as referred to by fastest network in the system

 – Operators use physical node name to open connections

 – NOTE: for SMP, all CPUs share single connection to network

Pools – Names of pools to which this node is assigned

 – Used to logically group nodes

 – Can also be used to group resources

Resource – Disk

 – Scratchdisk

Sample Configuration File

{

node ―Node1" 

{

Page 280: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 280/374

{

fastname "BlackHole"

pools "" "node1"

resource disk "/usr/dsadm/Ascential/DataStage/Datasets" {pools "" }

resource scratchdisk 

"/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }}

}

Disk Pools

  Disk pools allocate storage

  By default, EE uses the defaultl ifi d b ―‖

 pool "bigdata"

Page 281: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 281/374

pool, specified by ―‖ 

Sorting Requirements

Resource pools can also be specified for sorting: 

The Sort stage looks first for scratch disk resourcesin a

Page 282: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 282/374

gin a

―sort‖ pool, and then in the default disk pool 

{

node "n1" {

fastname “s1" 

pool "" "n1" "s1" "sort" resource disk "/data/n1/d1" {}

resource disk "/data/n1/d2" {}

resource scratchdisk "/scratch" {"sort"}

Another Configuration File Example

Page 283: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 283/374

}

node "n2" {

fastname "s2"

pool "" "n2" "s2" "app1" resource disk "/data/n2/d1" {}

resource scratchdisk "/scratch" {}

}

node "n3" {

fastname "s3"

pool "" "n3" "s3" "app1" resource disk "/data/n3/d1" {}

resource scratchdisk "/scratch" {}

}

node "n4" {

fastname "s4"

pool "" "n4" "s4" "app1" resource disk "/data/n4/d1" {}

resource scratchdisk "/scratch" {}

}

...

4 5

1

6

2 3

Resource Types

Disk

Scratchdisk

Page 284: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 284/374

DB2

Oracle

Saswork

Sortwork

Can exist in a pool – Groups resources together

Using Different Configurations

Page 285: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 285/374

Lookup stage where DBMS is using a sparse lookup type

Building a Configuration File

Scoping the hardware: – Is the hardware configuration SMP, Cluster, or MPP?

 – Define each node structure (an SMP would be singlenode):

Page 286: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 286/374

node): Number of CPUs

CPU speed

 Available memory

 Available page/swap space Connectivity (network/back-panel speed)

 – Is the machine dedicated to EE? If not, what otherapplications are running on it?

 – Get a breakdown of the resource usage (vmstat, mpstat,

iostat) –  Are there other configuration restrictions? E.g. DB only

runs on certain nodes and ETL cannot run on them?

Exercise

Complete exercise 9-1 and 9-2

Page 287: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 287/374

M d l 10

Page 288: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 288/374

Module 10

Extending DataStage EE

Objectives

Understand the methods by which you can addfunctionality to EE

Use this understanding to:

Page 289: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 289/374

Use this understanding to:

 – Build a DataStage EE stage that handles specialprocessing needs not supplied with the vanilla stages

 – Build a DataStage EE job that uses the new stage

Page 290: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 290/374

When To Leverage EE Extensibility

Types of situations:

Complex business logic, not easily accomplished using standard

EE stages

Page 291: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 291/374

Reuse of existing C, C++, Java, COBOL, etc… 

Wrappers vs. Buildop vs. Custom

Wrappers are good if you cannot or do not

want to modify the application and

performance is not critical.

Page 292: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 292/374

Buildops: good if you need custom coding but

do not need dynamic (runtime-based) input

and output interfaces.

Custom (C++ coding using framework API): good

if you need custom coding and need dynamic

input and output interfaces.

Building “Wrapped” Stages 

You can ―wrapper‖ a legacy executable: 

  Binary Unix command

Page 293: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 293/374

  Unix command

  Shell script

… and turn it into a Enterprise Edition stage 

capable, among other things, of parallel execution…  As long as the legacy executable is:

  amenable to data-partition parallelism no dependencies between rows

  pipe-safe can read rows sequentially

no random access to data

Wrappers (Cont’d) 

Wrappers are treated as a black box

Page 294: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 294/374

Wrappers are treated as a black box

EE has no knowledge of contents

EE has no means of managing anything that occurs

inside the wrapper

EE only knows how to export data to and import datafrom the wrapper

User must know at design time the intended behavior of

the wrapper and its schema interface

If the wrappered application needs to see all records priorto processing, it cannot run in parallel.

LS Example

Page 295: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 295/374

Can this command be wrappered?

Creating a Wrapper

Page 296: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 296/374

Used in this job --- 

To create the ―ls‖ stage 

Creating Wrapped Stages

From Manager :Right-Click on Stage Type

Wrapper Starting Point

Page 297: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 297/374

> New Parallel Stage > Wrapped

We will "Wrapper‖ an existing

Unix executables – the lscommand

Wrapper - General Page

Page 298: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 298/374

Unix command to be wrapped

Name of stage

The "Creator" Page

Page 299: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 299/374

Conscientiously maintaining the Creator page for all your wrapped stageswill eventually earn you the thanks of others.

Wrapper – Properties Page

If your stage will have properties appear, complete theProperties page

Page 300: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 300/374

This will be the name ofthe property as itappears in your stage

Wrapper - Wrapped Page

Page 301: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 301/374

Interfaces – input and output columns -these should first be entered into the tabledefinitions meta data (DS Manager); let’s

do that now.

• Layout interfaces describe what columns the

stage: – Needs for its inputs (if any)

Interface schemas 

Page 302: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 302/374

 – Creates for its outputs (if any)

 – Should be created as tables with columns in

Manager

Column Definition for Wrapper

Interface

Page 303: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 303/374

How Does the Wrapping Work?

 – Define the schema for export

and import Schemas become interface export

input schema

Page 304: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 304/374

Schemas become interfaceschemas of the operator andallow for by-name column

access 

import 

export 

stdout ornamed pipe

stdin ornamed pipe

UNIX executable

output schema

QUIZ : Why does export precede import?

Update the Wrapper Interfaces

This wrapper will have no input interface – i.e. no inputlink. The location will come as a job parameter that will

be passed to the appropriate stage property. Therefore,only the Output tab entry is needed.

Page 305: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 305/374

y p y

Resulting Job

Page 306: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 306/374

Wrapped stage

Page 307: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 307/374

Wrapper Story: Cobol Application

Hardware Environment: – IBM SP2, 2 nodes with 4 CPU’s per node.

Software:– DB2/EEE COBOL EE

Page 308: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 308/374

 – DB2/EEE, COBOL, EE

Original COBOL Application: – Extracted source table, performed lookup against table in DB2,

and Loaded results to target table. – 4 hours 20 minutes sequential execution

Enterprise Edition Solution: – Used EE to perform Parallel DB2 Extracts and Loads

 – Used EE to execute COBOL application in Parallel

 – EE Framework handled data transfer betweenDB2/EEE and COBOL application

 – 30 minutes 8-way parallel  execution

Buildops

Buildop provides a simple means of extending beyond thefunctionality provided by EE, but does not use an existing

executable (like the wrapper)Reasons to use Buildop include:

Page 309: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 309/374

Reasons to use Buildop include:

  Speed / Performance

  Complex business logic that cannot be easily representedusing existing stages – Lookups across a range of values

 – Surrogate key generation

 – Rolling aggregates

  Build once and reusable everywhere within project, noshared container necessary

  Can combine functionality from different stages into one

BuildOps

 – The DataStage programmer encapsulates the businesslogic

Th E t i Editi i t f ll d ―b ild ‖

Page 310: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 310/374

 – The Enterprise Edition interface called ―buildop‖

automatically performs the tedious, error-prone tasks:invoke needed header files, build the necessary―plumbing‖ for a correct and efficient parallel execution. 

 – Exploits extensibility of EE Framework

 

From Manager  (or Designer ):Repository pane:

Right Click on Stage Type

BuildOp Process Overview

Page 311: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 311/374

Right-Click on Stage Type> New Parallel Stage > {Custom | Build | Wrapped}

• "Build" stages

from within Enterprise Edition

• "Wrapping‖ existing ―Unix‖

executables

General Page

Identical

to Wrappers,except: Under the BuildT b !

Page 312: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 312/374

Tab, your program!

  ogic Tab

 for

Business Logic

Enter Business C/C++logic and arithmetic infour pages under the

Page 313: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 313/374

p gLogic tab

Main code section goes

in Per-Record page- itwill be applied to allrows

NOTE:  Code will need

to be Ansi C/C++

compliant. If code doesnot compile outside of

EE, it won’t compile

within EE either!

Code Sections under Logic Tab

Temporaryvariablesdeclared [and

Page 314: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 314/374

initialized] here

Logic here isexecuted once

BEFOREprocessing theFIRST row

Logic here isexecuted once

 AFTERprocessing theLAST row

I/O and Transfer

Under Interface tab: Input, Output & Transfer pages

Page 315: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 315/374

Optional

renaming ofoutput portfrom default"out0"

Write row

Inpu t page: 'Auto Read'

Read next row 

In-Repository

TableDefinition

'False' setting,

not to interferewith Transferpage

First line:output 0

I/O and Transfer

Page 316: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 316/374

• Transfer all columns from input to output.• If page left blank or Auto Transfer = "False" (and RCP = "False")Only columns in output Table Definition are written

First line:Transfer of index 0

BuildOp Simple Example

  Example - sumNoTransfer

 –   Add input columns "a" and "b"; ignore other columns

that might be present in input– Produce a new "sum" column

Page 317: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 317/374

    Produce a new sum column

 –   Do not transfer input columns

sumNoTransfer

a:int32; b:int32

sum:int32

From Peek:

No Transfer

Page 318: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 318/374

 NO TRANSFER

-  RCP set to "False" in stage definitionand

-  Transfer page left blank, or Auto Transfer = "False" 

• Effects:

-  input columns "a" and "b" are not transferred

-  only new column "sum" is transferred

Compare with transfer ON… 

Transfer

Page 319: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 319/374

TRANSFER

- RCP set to "True" in stage definitionor

-  Auto Transfer set to "True"

• Effects:

- new column "sum" is transferred, as well as - input columns "a" and "b" and

- input column "ignored" (present in input, butnot mentioned in stage)

Columns

 DS-EE type

Temp C++ variables

  C/C++ type

Columns vs.

Temporary C++ Variables

Page 320: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 320/374

yp

 Defined in TableDefinitions

 Value refreshed from rowto row

yp

  Need declaration (inDefinitions or Pre-Looppage)

  Value persistent

throughout "loop" overrows, unless modified incode

Exercise

Complete exercise 10-1 and 10-2

Page 321: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 321/374

Exercise

Complete exercises 10-3 and 10-4

Page 322: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 322/374

Custom Stage

Reasons for a custom stage:

 –  Add EE operator not already in DataStage EE

 – Build your own Operator and add to DataStage EE

U EE API

Page 323: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 323/374

Use EE API

Use Custom Stage to add new operator to EEcanvas

Custom Stage

DataStage Manager > select Stage Types branch> right click

Page 324: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 324/374

Custom Stage

Number of input andoutput links allowed

Page 325: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 325/374

Name of Orchestrateoperator to be used

output links allowed

Custom Stage – Properties Tab

Page 326: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 326/374

The Result

Page 327: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 327/374

Page 328: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 328/374

Objectives

Understand how EE uses meta data, particularlyschemas and runtime column propagation

Use this understanding to:

B ild h d fi iti fil t b i k d i

Page 329: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 329/374

 – Build schema definition files to be invoked inDataStage jobs

 – Use RCP to manage meta data usage in EE jobs

Establishing Meta Data

Data definitions

 – Recordization and columnization

 – Fields have properties that can be set at individualfield level

Page 330: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 330/374

Data types in GUI are translated to types used by EE

 – Described as properties on the format/columns tab

(outputs or inputs pages) OR – Using a schema file (can be full or partial)

Schemas

 – Can be imported into Manager – Can be pointed to by some job stages (i.e. Sequential)

Data Formatting – Record Level

Format tab

Meta data described on a record basis Record level properties

Page 331: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 331/374

Record level properties

Data Formatting – Column Level

Defaults for all columns

Page 332: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 332/374

Column Overrides

Edit row from within the columns tab

Set individual column properties

Page 333: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 333/374

Extended Column Properties

Page 334: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 334/374

Field

andstring

settings

Extended Properties – String Type

Note the ability to convert ASCII to EBCDIC

Page 335: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 335/374

Editing Columns

Page 336: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 336/374

Properties dependon the data type

Schema

 Alternative way to specify column definitions fordata used in EE jobs

Written in a plain text file

Page 337: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 337/374

Can be written as a partial record definition

Can be imported into the DataStage repository

Creating a Schema

Using a text editor

 – Follow correct syntax for definitions

 – OR

Import from an existing data set or file set

Page 338: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 338/374

Import from an existing data set or file set

 – On DataStage Manager import > Table Definitions >

Orchestrate Schema Definitions – Select checkbox for a file with .fs or .ds

Importing a Schema

Page 339: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 339/374

Schema location can beon the server or local

work station

Data Types

Date

Decimal

Floating point

I t

Vector

Subrecord

Raw

T d

Page 340: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 340/374

Integer

String

Time

Timestamp

Tagged

Runtime Column Propagation

DataStage EE is flexible about meta data. It can cope with thesituation where meta data isn’t fully defined. You can define

part of your schema and specify that, if your job encountersextra columns that are not defined in the meta data when itactually runs, it will adopt these extra columns and propagate

Page 341: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 341/374

them through the rest of the job. This is known as runtimecolumn propagation (RCP).

RCP is always on at runtime.

Design and compile time column mapping enforcement.

 – RCP is off by default.

 – Enable first at project level. (Administrator projectproperties)

 – Enable at job level. (job properties General tab)

 – Enable at Stage. (Link Output Column tab)

Enabling RCP at Project Level

Page 342: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 342/374

Enabling RCP at Job Level

Page 343: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 343/374

Enabling RCP at Stage Level

Go to output link’s columns tab 

For transformer you can find the output linkscolumns tab by first going to stage properties

Page 344: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 344/374

Using RCP with Sequential Stages

To utilize runtime column propagation in thesequential stage you must use the ―use schema‖

option

Stages with this restriction:

Page 345: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 345/374

 Stages with this restriction:

 – Sequential

 – File Set – External Source

 – External Target

Runtime Column Propagation

When RCP is Disabled

 –  DataStage Designer will enforce Stage Input Column

to Output Column mappings. –  At job compile time modify operators are inserted on

output links in the generated osh

Page 346: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 346/374

output links in the generated osh.

Runtime Column Propagation

When RCP is Enabled

 – DataStage Designer will not enforce mapping rules.

 – No Modify operator inserted at compile time. – Danger of runtime error if column names incoming do

Page 347: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 347/374

not match column names outgoing link – casesensitivity.

Exercise

Complete exercises 11-1 and 11-2

Page 348: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 348/374

Module 12

Page 349: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 349/374

Job Control Using the JobSequencer

Objectives

Understand how the DataStage job sequencerworks

Use this understanding to build a control job torun a sequence of DataStage jobs

Page 350: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 350/374

q g j

Job Control Options

Manually write job control

 – Code generated in Basic

 – Use the job control tab on the job properties page – Generates basic code which you can modify

Page 351: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 351/374

Job Sequencer

 – Build a controlling job much the same way you buildother jobs

 – Comprised of stages and links

 – No basic coding

Job Sequencer

Build like a regular job

Type ―Job Sequence‖ 

Has stages and links

Job Activity stage

Page 352: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 352/374

Job Activity stagerepresents a DataStage

 job Links represent passing

control

Stages

Example

Job Activitystage – contains

conditionaltriggers

Page 353: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 353/374

Job Activity Properties

Job to be executed – select from dropdown

Page 354: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 354/374

Job parametersto be passed

select from dropdown

Job Activity Trigger

Page 355: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 355/374

Trigger appears as a link in the diagram

Custom options let you define the code

Options

Use custom option for conditionals

 – Execute if job run successful or warnings only

Can add ―wait for file‖ to execute 

Add ―execute command‖ stage to drop real tables

Page 356: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 356/374

 Add execute command stage to drop real tables

and rename new tables to current tables

Job Activity With Multiple Links

Page 357: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 357/374

Different linkshaving different

triggers

Sequencer Stage

Build job sequencer to control job for thecollections application

Page 358: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 358/374

Can be set to allor any

Notification Stage

Page 359: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 359/374

Notification

Notification Activity

Page 360: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 360/374

Sample DataStage log from Mail Notification

Sample DataStage log from Mail

Notification

Page 361: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 361/374

E-Mail Message

Notification Activity Message

Page 362: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 362/374

Exercise

Complete exercise 12-1

Page 363: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 363/374

Module 13

Page 364: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 364/374

Testing and Debugging

Objectives

Understand spectrum of tools to perform testingand debugging

Use this understanding to troubleshoot aDataStage job

Page 365: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 365/374

Environment Variables

Page 366: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 366/374

Parallel Environment Variables

Page 367: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 367/374

Environment Variables

Stage Specific

Page 368: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 368/374

Environment Variables

Page 369: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 369/374

Environment Variables

Compiler

Page 370: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 370/374

Typical Job Log Messages: 

Environment variables

Configuration File information

The Director

Page 371: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 371/374

Framework Info/Warning/Error messages 

Output from the Peek Stage

 Additional info with "Reporting" environments

Tracing/Debug output

 – Must compile job in trace mode – Adds overhead

• Job Properties, from Menu Bar of Designer  

• Director will

prompt youbefore eachrun

Job Level Environmental Variables

Page 372: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 372/374

run

Troubleshooting

If you get an error during compile, check the following:

Compilation problems

 – If Transformer used, check C++ compiler, LD_LIRBARY_PATH

 – If Buildop errors try buildop from command line – Some stages may not support RCP – can cause column mismatch .

 – Use the Show Error  and More buttons

 – Examine Generated OSH 

Page 373: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 373/374

 – Check environment variables settings

Very little integrity checking during compile, should run validate from Director.

Highlights source of error

Generating Test Data

Row Generator stage can be used

 – Column definitions

 – Data type dependent

Row Generator plus lookup stages provides goodt t b t t t d t f tt fil

Page 374: data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 374/374

way to create robust test data from pattern files