data stage doc

8/10/2019 data stage doc

http://slidepdf.com/reader/full/data-stage-doc 1/374

DataStageEnterprise Edition



Proposed Course Agenda

Day 1 – Review of EE Concepts

– Sequential Access

– Best Practices

– DBMS as Source

Day 2 – EE Architecture

– Transforming Data

– DBMS as Target

– Sorting Data

Day 3 – Combining Data

– Configuration Files

– Extending EE

– Meta Data in EE

Day 4 – Job Sequencing

– Testing and Debugging



The Course Material

Course Manual

Online Help

Exercise Files and

Exercise Guide





Intro

Part 1

Introduction to DataStage EE



What is DataStage?

Design jobs for Extraction, Transformation, andLoading (ETL)

Ideal tool for data integration projects – such as,data warehouses, data marts, and system

migrations Import, export, create, and managed metadata for

use within jobs

Schedule, run, and monitor jobs all withinDataStage

Administer your DataStage development andexecution environments



DataStage Server and Clients



DataStage Administrator





DataStage Manager



DataStage Designer



DataStage Director



Developing in DataStage

Define global and project properties in Administrator

Import meta data into Manager

Build job in Designer Compile Designer

Validate, run, and monitor in Director



DataStage Projects



Quiz – True or False

DataStage Designer is used to build and compileyour ETL jobs

Manager is used to execute your jobs after youbuild them

Director is used to execute your jobs after youbuild them

Administrator is used to set global and projectproperties



IntroPart 2

Configuring Projects



Module Objectives

After this module you will be able to: – Explain how to create and delete projects

– Set project properties in Administrator

– Set EE global properties in Administrator



Project Properties

Projects can be created and deleted in Administrator

Project properties and defaults are set in Administrator



Setting Project Properties

To set project properties, log onto Administrator,select your project, and then click ―Properties‖



Licensing Tab



Projects General Tab



Environment Variables



Permissions Tab



Tracing Tab



Tunables Tab



Parallel Tab



IntroPart 3

Managing Meta Data



Module Objectives

After this module you will be able to:

– Describe the DataStage Manager components andfunctionality

– Import and export DataStage objects

– Import metadata for a sequential file



What Is Metadata?

TargetSource

Transform

Meta DataRepository

Data

Meta

Data

Meta

Data



DataStage Manager



Manager Contents

Metadata describing sources and targets: Tabledefinitions

DataStage objects: jobs, routines, tabledefinitions, etc.



Import and Export

Any object in Manager can be exported to a file

Can export whole projects

Use for backup

Sometimes used for version control

Can be used to move DataStage objects from oneproject to another

Use to share DataStage jobs and projects withother developers



Export Procedure

In Manager, click ―Export>DataStage

Components‖

Select DataStage objects for export

Specified type of export: DSX, XML Specify file path on client machine



Quiz: True or False?

You can export DataStage objects such as jobs,but you can’t export metadata, such as field

definitions of a sequential file.



Quiz: True or False?

The directory to which you export is on theDataStage client machine, not on the DataStageserver machine.



Exporting DataStage Objects



Exporting DataStage Objects



Import Procedure

In Manager, click ―Import>DataStage

Components‖

Select DataStage objects for import

I ti D t St Obj t



Importing DataStage Objects

I t O ti



Import Options

E i



Exercise

Import DataStage Component (table definition)

M t d t I t



Metadata Import

Import format and column destinations fromsequential files

Import relational table column destinations

Imported as ―Table Definitions‖ Table definitions can be loaded into job stages

S ti l Fil I t P d



Sequential File Import Procedure

In Manager, click Import>TableDefinitions>Sequential File Definitions

Select directory containing sequential file andthen the file

Select Manager category

Examined format and column definitions and editis necessary

M T bl D fi iti



Manager Table Definition

I ti S ti l M t d t



Importing Sequential Metadata



IntroPart 4

Designing and Documenting Jobs

Module Objectives



Module Objectives

After this module you will be able to:

– Describe what a DataStage job is

– List the steps involved in creating a job

– Describe links and stages

– Identify the different types of stages – Design a simple extraction and load job

– Compile your job

– Create parameters to make your job flexible

– Document your job

What Is a Job?



What Is a Job?

Executable DataStage program

Created in DataStage Designer, but can usecomponents from Manager

Built using a graphical user interface Compiles into Orchestrate shell language (OSH)

Job Development Overview



Job Development Overview

In Manager, import metadata defining sources

and targets

In Designer, add stages defining data extractionsand loads

And Transformers and other stages to defineddata transformations

Add linkss defining the flow of data from sources

to targets Compiled the job

In Director, validate, run, and monitor your job

Designer Work Area



Designer Work Area

Designer Toolbar



Designer Toolbar

Provides quick access to the main functions of Designer

Job

propertiesCompile

Show/hide metadata markers

Tools Palette



Tools Palette

Adding Stages and Links



Adding Stages and Links

Stages can be dragged from the tools palette orfrom the stage type branch of the repository view

Links can be drawn from the tools palette or byright clicking and dragging from one stage to

another

Sequential File Stage




Used to extract data from, or load data to, asequential file

Specify full path to the file

Specify a file format: fixed width or delimited Specified column definitions

Specify write action

Job Creation Example Sequence



Job Creation Example Sequence

Brief walkthrough of procedure

Presumes meta data already loaded in repository



Drag Stages and Links Using



Palette

Assign Meta Data



Assign Meta Data

Editing a Sequential Source Stage



Editing a Sequential Source Stage



Transformer Stage



Transformer Stage

Used to define constraints, derivations, andcolumn mappings

A column mapping maps an input column to anoutput column

In this module will just defined column mappings(no derivations)



Create Column Mappings



Create Column Mappings

Creating Stage Variables



Creating Stage Variables



Adding Job Parameters



Adding Job Parameters

Makes the job more flexible

Parameters can be:

– Used in constraints and derivations

– Used in directory and file names

Parameter values are determined at run time

Adding Job Documentation



Adding Job Documentation

Job Properties

– Short and long descriptions

– Shows in Manager

Annotation stage

– Is a stage on the tool palette

– Shows on the job GUI (work area)

Job Properties Documentation




Annotation Stage on the Palette



otat o Stage o t e a ette

Annotation Stage Properties



g p

Final Job Work Area withDocumentation



Documentation

Compiling a Job



p g

Errors or Successful Message



g





Prerequisite to Job Execution



q

Result from Designer compile



Running Your Job





Director Log View



Message Details are Available



Other Director Functions



Schedule job to run on a particular date/time

Clear job log

Set Director options

– Row limits – Abort after x warnings



Module 1

DSEE – DataStage EE

Review

Ascential’s EnterpriseData Integration Platform



Data Integration Platform

CRM

ERP

SCMRDBMS

Legacy

Real-time

Client-server

Web services

Data Warehouse

Other apps.

ANY SOURCE ANY TARGET

CRM

ERP

SCMBI/Analytics

RDBMS

Real-time

Client-server

Web services

Data Warehouse

Other apps.

Command & Control

DISCOVER

Gather

relevant

informatio

n for target

enterprise

application

s

Data Profiling

PREPARE

Data Quality

Cleanse,

correct andmatch

input data

TRANSFORM

Extract,

Transform,Load

Standardiz

e and

enrich data

and load to

targets

Meta Data Management

Parallel Execution

Course Objectives



You will learn to:

– Build DataStage EE jobs using complex logic

– Utilize parallel processing techniques to increase jobperformance

– Build custom stages based on application needs

Course emphasis is:

– Advanced usage of DataStage EE

– Application job development

– Best practices techniques

Course Agenda



Day 1

– Review of EE Concepts – Sequential Access

– Standards

– DBMS Access

Day 2 – EE Architecture

– Transforming Data

– Sorting Data

Day 3

– Combining Data – Configuration Files

Day 4 – Extending EE

– Meta Data Usage

– Job Control – Testing









Administrator – Licensing andTimeout



Timeout

Administrator – ProjectCreation/Removal



C eat o / e o a

Functionsspecific to a

project.

Administrator – Project Properties



RCP for parallel jobs should be

enabled

Variables forparallel

processing





OSH is what isrun by the EEFramework

DataStage Manager





Designer Workspace



Can executethe job from

Designer

DataStage Generated OSH



The EEFrameworkruns OSH

Director – Executing Jobs



Messagesfrom previousrun in different

color

Stages



Can now customize the Designer’s palette

Select desired stagesand drag to favorites



Row Generator



Can build test data

Repeatableproperty

Edit row incolumn tab



Why EE is so Effective



Parallel processing paradigm

– More hardware, faster processing

– Level of parallelization is determined by aconfiguration file read at runtime

Emphasis on memory – Data read into memory and lookups performed like

hash table



Scaleable Systems: Examples



Three main types of scalable systems

Symmetric Multiprocessors (SMP): sharedmemory and disk

Clusters: UNIX systems connected via networks

MPP: Massively Parallel Processing

note

SMP: Shared Everything



• Multiple CPUs with a single operating system

• Programs communicate using shared memory• All CPUs share system resources

(OS, memory with single linear address space,disks, I/O)

When used with Enterprise Edition:

• Data transport uses shared memory

• Simplified startup

cpu cpu

cpu cpu

Enterprise Edition treats NUMA (NonUniform Memory Access) as plain SMP

Traditional Batch Processing



Source

Transform

Target

DataWarehouse

Operational Data

Archived Data

Clean Load

Disk Disk Disk

Traditional approach to batch processing:• Write to disk and read from disk before each processing operation• Sub-optimal utilization of resources

• a 10 GB stream leads to 70 GB of I/O• processing resources can sit idle during I/O

• Very complex to manage (lots and lots of small jobs)• Becomes impractical with big data volumes

• disk I/O consumes the processing

• terabytes of disk required for temporary staging

Pipeline Multiprocessing



Data Pipelining

• Transform, clean and load processes are executing simultaneously on the same processor• rows are moving forward through the flow

Source

Transform

Target

DataWarehouse

Operational Data

Archived Data Clean Load

• Start a downstream process while an upstream process is stillrunning.• This eliminates intermediate storing to disk, which is critical for big data.• This also keeps the processors busy.• Still has limits on scalability

Think of a conveyor belt moving the rows from process to process!

Partition Parallelism



Data Partitioning

Transform

SourceData

Transform

Transform

Transform

Node 1

Node 2

Node 3

Node 4

A-F

G- M

N-T

U-Z

• Break up big data into partitions

• Run one partition on each processor

• 4X times faster on 4 processors -With data big enough:

100X faster on 100 processors

• This is exactly how the paralleldatabases work!

• Data Partitioning requires the

same transform to all partitions: Aaron Abbott and Zygmund Zornundergo the same transform

Combining Parallelism Types



Putting It All Together: Parallel Dataflow

Source Target

Transform Clean Load

Pipelining

SourceData

DataWarehouse

Repartitioning



Putting It All Together: Parallel Dataflow

with Repartioning on-the-fly

Without Landing To Disk!

Source Targ

Transform Clean Load

Pipelining

SourceData Data

WarehouseA-F

G- MN-T

U-Z

Customer last nameCustomer zip code Credit card number

EE Program Elements



• Dataset: uniform set of rows in the Framework's internal representation- Three flavors:

1. file sets *.fs : stored on multiple Unix files as flat files2. persistent : *.ds : stored on multiple Unix files in Framework format

read and written using the DataSet Stage3. virtual : *.v : links, in Framework format, NOT stored on disk

- The Framework processes only datasets—hence possible need for Import- Different datasets typically have different schemas- Convention: "dataset" = Framework data set.

• Partition: subset of rows in a dataset earmarked for processing by the

same node (virtual CPU, declared in a configuration file).

- All the partitions of a dataset follow the same schema: that of the dataset

DataStage EE Architecture



Orchestrate Program(sequentialdataflow)

Orchestrate Application Frameworkand Runtime System

Import

Clean1

Clean2

Merge Analyze

Configuration File

Centralized Error Handlingand Event Logging

Parallel access to data in files

Parallel access to data in RDBMS

Inter-node communications

Parallel pipelining

Parallelization of operations

Import

Clean 1

Merge Analyze

Clean 2

Relational Data

PerformanceVisualization

Flat Files

Orchestrate Framework:

Provides application scalability

DataStage Enterprise Edition:Best-of-breed scalable data integration platformNo l imi tat ions on d ata volumes or throug hpu t

DataStage:

Provides data integration platform

Introduction to DataStage EE



DSEE: – Automatically scales to fit the machine

– Handles data flow among multiple CPU’s and disks

With DSEE you can: – Create applications for SMP’s, clusters and MPP’s…Enterprise Edition is architecture-neutral

– Access relational databases in parallel

– Execute external applications in parallel

– Store data across multiple disks and nodes

Job Design VS. Execution



Developer assembles data flow using the Designer

…and gets: parallel access, propagation, transformation, andload.

The design is good for 1 node, 4 nodes,or N nodes. To change # nodes, just swap configuration file.

No need to modify or recompile the design

Partitioners and Collectors



Partitioners distribute rows into partitions

– implement data-partition parallelism

Collectors = inverse partitioners

Live on input links of stages running

– in parallel (partitioners)

– sequentially (collectors)

Use a choice of methods



Exercise



Complete exercises 1-1 and 1-2, and 1-3



Module 2

DSEE Sequential Access

Module Objectives



You will learn to:

– Import sequential files into the EE Framework

– Utilize parallel processing techniques to increasesequential file access

– Understand usage of the Sequential, DataSet, FileSet,

and LookupFileSet stages

– Manage partitioned data stored by the Framework

Types of Sequential Data Stages



Sequential

– Fixed or variable length

File Set

Lookup File Set

Data Set



How the Sequential Stage Works



Generates Import/Export operators, depending on

whether stage is source or target

Performs direct C++ file I/O streams

Using the Sequential File Stage



Importing/Exporting Data

Both import and export of general files (text, binary) are

performed by the SequentialFile Stage.

– Data import:

– Data export

EE internal format

EE internal format

Working With Flat Files




– Normally will execute in sequential mode – Can be parallel if reading multiple files (file pattern

option)

– Can use multiple readers within a node

– DSEE needs to know How file is divided into rows

How row is divided into columns

Processes Needed to Import Data



Recordization

– Divides input stream into records – Set on the format tab

Columnization

– Divides the record into columns – Default set on the format tab but can be overridden on

the columns tab

– Can be ―incomplete‖ if using a schema or not even

specified in the stage if using RCP

File Format Example



Field 1

Field 1

Field 1

Field 1

Field 1

Field 1

,

,

,

,

,

,

Last field

Last field

nl

nl,

Field Delimiter

Final Delimiter = comma

Final Delimiter = end

Record delimiter




To set the properties, use stage editor

– Page (general, input/output) – Tabs (format, columns)

Sequential stage link rules

– One input link – One output links (except for reject link definition)

– One reject link Will reject any records not matching meta data in the column

definitions

Job Design Using Sequential Stages



Stage categories



Properties – Multiple Files



Click to add more files havingthe same meta data.

Properties - Multiple Readers



Multiple readers option allowsyou to set number of readers

Format Tab



File into records

Record into columns

Read Methods



Reject Link



Reject mode = output

Source

– All records not matching the meta data (the columndefinitions)

Target – All records that are rejected for any reason

Meta data – one column, datatype = raw

File Set Stage



Can read or write file sets

Files suffixed by .fs

File set consists of:

1. Descriptor file – contains location of raw data files +

meta data

2. Individual raw data files

Can be processed in parallel

File Set Stage Example



Descriptor file

File Set Usage



Why use a file set?

– 2G limit on some file systems – Need to distribute data among nodes to prevent

overruns

– If used in parallel, runs faster that sequential file

Lookup File Set Stage



Can create file sets

Usually used in conjunction with Lookup stages

Lookup File Set > Properties



Key columnspecified

Key column

dropped indescriptor file

Data Set



Operating system (Framework) file

Suffixed by .ds

Referred to by a control file

Managed by Data Set Management utility fromGUI (Manager, Designer, Director)

Represents persistent data

Key to good performance in set of linked jobs

Persistent Datasets



Accessed from/to disk with DataSet Stage. Two parts:

– Descriptor file: contains metadata, data location, but NOT the data itself

– Data file(s) contain the data

multiple Unix files (one per node), accessible in parallel

input.ds

node1:/local/disk1/… node2:/local/disk2/…

record (

partno: int32;description:

string;)

Quiz!



• True or False?

Everything that has been data-partitioned must becollected in same job

Data Set Stage



Is the data partitioned?

Engine Data Translation



Occurs on import

– From sequential files or file sets – From RDBMS

Occurs on export

– From datasets to file sets or sequential files – From datasets to RDBMS

Engine most efficient when processing internally

formatted records (I.e. data contained in datasets)



Data Set Management



Display data

Schema

Data Set Management From Unix

f f



Alternative method of managing file sets and data

sets – Dsrecords

Gives record count

– Unix command-line utility

– $ dsrecords ds_nameI.e.. $ dsrecords myDS.ds

156999 records

– Orchadmin Manages EE persistent data sets

– Unix command-line utility

I.e. $ orchadmin rm myDataSet.ds

Exercise

C l t i 2 1 2 2 2 3 d 2 4



Complete exercises 2-1, 2-2, 2-3, and 2-4.



Module 3

Standards and Techniques

Objectives

E t bli h t d d t h i f DSEE



Establish standard techniques for DSEE

development

Will cover:

– Job documentation

– Naming conventions for jobs, links, and stages – Iterative job design

– Useful stages for job development

– Using configuration files for development

– Using environmental variables

– Job parameters

Job Presentation



Document using theannotation stage




Description shows in DSManager and MetaStage

Organize jobs intocategories

Naming conventions

St d ft th



Stages named after the

– Data they access – Function they perform

– DO NOT leave defaulted stage names likeSequential_File_0

Links named for the data they carry

– DO NOT leave defaulted link names like DSLink3

Stage and Link Names



Stages and linksrenamed to data they

handle

Create Reusable Job Components

Use Enterprise Edition shared containers when



Use Enterprise Edition shared containers when

feasible

Container

Use Iterative Job Design

Use copy or peek stage as stub



Use copy or peek stage as stub

Test job in phases – small first, then increasing incomplexity

Use Peek stage to examine records

Copy or Peek Stage Stub



Copy stage

Transformer StageTechniques

Suggestions



Suggestions -

– Always include reject link. – Always test for null value before using a column in a

function.

– Try to use RCP and only map columns that have a

derivation other than a copy. More on RCP later. – Be aware of Column and Stage variable Data Types.

Often user does not pay attention to Stage Variable type.

– Avoid type conversions. Try to maintain the data type as imported.

The Copy Stage

With 1 link in 1 link out:



With 1 link in, 1 link out:

the Copy Stage is the ultimate "no-op" (place-holder): – Partitioners

– Sort / Remove Duplicates

– Rename, Drop column

… can be inserted on:

– input link (Partitioning): Partitioners, Sort, Remove Duplicates)

– output link (Mapping page): Rename, Drop.

Sometimes replace the transformer:

Developing Jobs

1 Keep it simple



1. Keep it simple

• Jobs with many stages are hard to debug and maintain.

2. Start small and Build to final Solution

• Use view data, copy, and peek.

• Start from source and work out.

• Develop with a 1 node configuration file.

3. Solve the business problem before the performanceproblem.

• Don’t worry too much about partitioning until thesequential flow works as expected.

4. If you have to write to Disk use a Persistent Data set.

Final Result



Good Things to Have in each Job

Use job parameters



Use job parameters

Some helpful environmental variables to add to job parameters

– $APT_DUMP_SCORE Report OSH to message log

– $APT_CONFIG_FILE Establishes runtime parameters to EE engine; I.e. Degree of

parallelization

Setting Job Parameters



Click to add

environmentvariables

DUMP SCORE Output

Setting APT_DUMP_SCORE yields:



Double-click

MappingNode--> partition

Partitoner And

Collector



Exercise

Complete exercise 3-1






Module 4

DBMS Access

Objectives

Understand how DSEE reads and writes records



Understand how DSEE reads and writes records

to an RDBMS Understand how to handle nulls on DBMS lookup

Utilize this knowledge to:

– Read and write database tables

– Use database tables to lookup data

– Use null handling options to clean data

Parallel Database Connectivity



Traditional

Client-Server Enterprise Edition

Sort

Client

Parallel RDBMS

Client

Client

Client

Client

Parallel RDBMS

Only RDBMS is running in parallel

Each application has only one connection

Suitable only for small data volumes

Parallel server runs APPLICATIONS

Application has parallel connections to RDBMS

Suitable for large data volumes

Higher levels of integration possible

Client

Load

RDBMS AccessSupported Databases

Enterprise Edition provides high performance /



Enterprise Edition provides high performance /

scalable interfaces for:

DB2

Informix

Oracle

Teradata

Automatically convert RDBMS table layouts to/from

RDBMS Access



Automatically convert RDBMS table layouts to/from

Enterprise Edition Table Definitions RDBMS nulls converted to/from nullable field values

Support for standard SQL syntax for specifying: – field list for SELECT statement – filter for WHERE clause

Can write an explicit SQL query to access RDBMS

EE supplies additional information in the SQL query

RDBMS Stages

DB2/UDB Enterprise



p

Informix Enterprise

Oracle Enterprise

Teradata Enterprise



RDBMS Source – Stream Link



Stream link

DBMS Source - User-defined SQL



Columns in SQL statementmust match the meta data

in columns tab



DBMS Source – Reference Link



Reject link

Lookup Reject Link



―Output‖ option automatically

creates the reject link

Null Handling

Must handle null condition if lookup record is not



p

found and ―continue‖ option is chosen Can be done in a transformer stage

Lookup Stage Mapping



Link name

Lookup Stage Properties

Reference



Reference

link

Must have same column name

in input and reference links. Youwill get the results of the lookup

in the output column.

DBMS as a Target



DBMS As Target

Write Methods



– Delete – Load

– Upsert

– Write (DB2)

Write mode for load method

– Truncate

– Create

– Replace – Append

Target Properties

Generated codeb i d



Upsert modedetermines options

can be copied

Checking for Nulls

Use Transformer stage to test for fields with null



values (Use IsNull functions) In Transformer, can reject or load default value

Exercise






Module 5

Platform Architecture

Objectives

Understand how Enterprise Edition Framework



processes data You will be able to:

– Read and understand OSH

– Perform troubleshooting

Concepts

The Enterprise Edition Platform



– Script language - OSH (generated by DataStageParallel Canvas, and run by DataStage Director)

– Communication - conductor,section leaders,players.

– Configuration files (only one active at a time,

describes H/W) – Meta data - schemas/tables

– Schema propagation - RCP

– EE extensibility - Buildop, Wrapper

– Datasets (data in Framework's internalrepresentation)

EE Stages Involve A Series Of Processing Steps

DS-EE Stage Elements



Output Data Set schema:

prov_num:int16;member_num:int8;custid:int32;

Input Data Set schema:

prov_num:int16;member_num:int8;custid:int32;

g g

P ar t i t i on e

r

B u s i n e s s

L o gi c

EE Stage

• Piece of ApplicationLogic Running AgainstIndividual Records

• Parallel or Sequential

Dual Parallelism Eliminates Bottlenecks!

DSEE Stage Execution



• EE Delivers Parallelism inTwo Ways – Pipeline

– Partition

• Block Buffering BetweenComponents – Eliminates Need for Program

Load Balancing

– Maintains Orderly Data FlowPipeline

Partition

Producer

Consume

r

Stages Control Partition Parallelism

Execution Mode (sequential/parallel) is controlled by Stage



– default = parallel for most Ascential-supplied Stages – Developer can override default mode

– Parallel Stage inserts the default partitioner (Auto) on itsinput links

– Sequential Stage inserts the default collector (Auto) onits input links

– Developer can override default

execution mode (parallel/sequential) of Stage > Advanced tab

choice of partitioner/collector on Input > Partitioningtab

How Parallel Is It?

Degree of parallelism is determined by the



configuration file – Total number of logical nodes in default pool, or a

subset if using "constraints". Constraints are assigned to specific pools as defined in

configuration file and can be referenced in the stage

OSH

DataStage EE GUI generates OSH scripts



– Ability to view OSH turned on in Administrator – OSH can be viewed in Designer using job properties

The Framework executes OSH

What is OSH? – Orchestrate shell

– Has a UNIX command-line interface

OSH Script

An osh script is a quoted string which



specifies: – The operators and connections of a single

Orchestrate step

– In its simplest form, it is:

osh ―op < in.ds > out.ds‖

Where:

– op is an Orchestrate operator – in.ds is the input data set

– out.ds is the output data set



Enable Visible OSH in Administrator



Will be enabled forall projects

View OSH in Designer



Schema

Operator

OSH Practice

Exercise 5-1 – Instructor demo (optional)



• Operators

• Datasets: set of rows processed by Framework

Elements of a Framework Program



Datasets: set of rows processed by Framework

– Orchestrate data sets:

– persistent (terminal) *.ds, and

– virtual (internal) *.v.

– Also: flat ―file sets‖ *.fs

• Schema: data description (metadata) for datasets and links.

• Consist of Partitioned Data and Schema

• Can be Persistent (*.ds) or Virtual (*.v, Link)

Datasets



Can be Persistent ( .ds) or Virtual ( .v, Link)

• Overcome 2 GB File Limit

=

What you p rogram: What gets proc essed:

. . .

Multiple files per partitionEach file up to 2GBytes (or larger)

Operator

AOperator

A

Operator

AOperator

A

Node 1 Node 2 Node 3 Node 4

data filesof x.ds

$ osh “operator_A > x.ds“

GUI

OSH

What gets g enerated:

Computing Architectures: Definition

Shared Disk Shared NothingDedicated Disk



Clusters and MPP SystemsUniprocessor

• IBM, Sun, HP, Compaq

• 2 to 64 processors

• Majority of installations

Shared Memory

SMP System(Symmetric Multiprocessor)

DiskDisk

CPU

Memory

CPU CPU CPU

• PC

• Workstation

• Single processor server

CPU

• 2 to hundreds of processors

• MPP: IBM and NCR Teradata

• each node is a uniprocessor or SMP

CPU

Disk

Memory

CPU

Disk

Memory

CPU

Disk

Memory

CPU

Disk

Memory



Working with Configuration Files

You can easily switch between config files:



'1-node' file - for sequential execution, lighter reports—handy fortesting

'MedN-nodes' file - aims at a mix of pipeline and data-partitionedparallelism

'BigN-nodes' file - aims at full data-partitioned parallelism

Only one file is active while a step is running The Framework queries (first) the environment variable:

$APT_CONFIG_FILE

# nodes declared in the config file needs not match #CPUs

Same configuration file can be used in development and

SchedulingNodes, Processes, and CPUs

DS/EE does not:

– know how many CPUs are available



know how many CPUs are available

– schedule

Who knows what?

Who does what? – DS/EE creates (Nodes*Ops) Unix processes

– The O/S schedules these processes on the CPUs

Nodes = # logical nodes declared in config. file

Ops = # ops. (approx. # blue boxes in V.O.)

Processes = # Unix processes

CPUs = # available CPUs

Nodes Ops Processes CPUs

User Y N

Orchestrate Y Y Nodes * Ops N

O/S " Y

{

node "n1" {

fastname "s1"

pool "" "n1" "s1" "app2" "sort"

Configuring DSEE – Node Pools



p pp

resource disk "/orch/n1/d1" {}resource disk "/orch/n1/d2" {}

resource scratchdisk "/temp" {"sort"}

}

node "n2" {

fastname "s2"

pool "" "n2" "s2" "app1"

resource disk "/orch/n2/d1" {}


resource scratchdisk "/temp" {}}

node "n3" {

fastname "s3"

pool "" "n3" "s3" "app1"


resource scratchdisk "/temp" {}

}

node "n4" {

fastname "s4"pool "" "n4" "s4" "app1"



}

1

43

2

{

node "n1" {

fastname "s1"

pool "" "n1" "s1" "app2" "sort"

Configuring DSEE – Disk Pools



p pp

resource disk "/orch/n1/d1" {}resource disk "/orch/n1/d2" {"bigdata"}

resource scratchdisk "/temp" {"sort"}

}

node "n2" {

fastname "s2"

pool "" "n2" "s2" "app1"


resource disk "/orch/n2/d2" {"bigdata"}

resource scratchdisk "/temp" {}}

node "n3" {

fastname "s3"

pool "" "n3" "s3" "app1"



}

node "n4" {

fastname "s4"pool "" "n4" "s4" "app1"



}

1

43

2

Parallel to parallel flow may incur reshuffling:

Records may jump between nodes

Re-Partitioning



node

1node

2

Records may jump between nodes

partitioner

Partitioning Methods

Auto



Hash

Entire

Range

Range Map

• Collectors combine partitions of a dataset into asingle input stream to a sequential Stage

Collectors



data partitions

collector

sequential Stage

...

–Collectors do NOT synchronize data

Partitioning and Repartitioning AreVisible On Job Design





Setting a Node Constraint in the GUI



Reading Messages in Director

Set APT_DUMP_SCORE to true



Can be specified as job parameter

Messages sent to Director log

If set, parallel job will produce a report showingthe operators, processes, and datasets in therunning job

Messages With APT_DUMP_SCORE= True



Exercise






Module 6

Transforming Data

Module Objectives

Understand ways DataStage allows you to



transform data Use this understanding to:

– Create column derivations using user-defined code orsystem functions

– Filter records based on business criteria

– Control data flow based on data conditions

Transformed Data

Transformed data is:



– Outgoing column is a derivation that may, or may not,include incoming fields or parts of incoming fields

– May be comprised of system variables

Frequently uses functions performed on

something (ie. incoming columns) – Divided into categories – I.e.

Date and time

Mathematical

Logical

Null handling

More



Transformer Stage Functions

Control data flow



Create derivations

Flow Control

Separate records flow down links based on data



condition – specified in Transformer stageconstraints

Transformer stage can filter records

Other stages can filter records but do not exhibitadvanced flow control

– Sequential can send bad records down reject link

– Lookup can reject records based on lookup failure

– Filter can select records based on data value

Rejecting Data

Reject option on sequential stage – Data does not agree with meta data



g

– Output consists of one column with binary data type

Reject links (from Lookup stage) result from thedrop option of the property ―If Not Found‖

– Lookup ―failed‖ – All columns on reject link (no column mapping option)

Reject constraints are controlled from theconstraint editor of the transformer – Can control column mapping

– Use the ―Other/Log‖ checkbox

Rejecting Data Example



―If Not Found‖

property

Constraint –

Other/log optionProperty RejectMode = Output

Transformer Stage Properties



Transformer Stage Variables

First of transformer stage entities to execute



Execute in order from top to bottom – Can write a program by using one stage variable to

point to the results of a previous stage variable

Multi-purpose – Counters

– Hold values for previous rows to make comparison

– Hold derivations to be used in multiple field dervations

– Can be used to control execution of constraints

Stage Variables



Show/Hide button

Transforming Data

Derivations



– Using expressions – Using functions

Date/time

Transformer Stage Issues

– Sometimes require sorting before the transformerstage – I.e. using stage variable as accumulator andneed to break on change of column value

Checking for nulls

Checking for Nulls

Nulls can get introduced into the dataflow



because of failed lookups and the way in whichyou chose to handle this condition

Can be handled in constraints, derivations, stagevariables, or a combination of these

Transformer - Handling Rejects



Constraint Rejects

– All expressions are

false and reject row ischecked

Transformer: Execution Order



• Derivations in stage variables are executed first

• Constraints are executed before derivations

• Column derivations in earlier links are executed before later links

• Derivations in higher columns are executed before lower columns

Parallel Palette - Two Transformers

All > Processing > Parallel > Processing



Transformer Is the non-Universe

transformer

Has a specific set offunctions

No DS routines available

Basic Transformer Makes server style

transforms available onthe parallel palette

Can use DS routines

• Program in Basic for both transformers

Transformer Functions FromDerivation Editor

Date & Time



Logical

Null Handling

Number String

Type Conversion

Exercise

Complete exercises 6-1, 6-2, and 6-3





Module 7

Sorting Data

Objectives

Understand DataStage EE sorting options

U hi d di d li f



Use this understanding to create sorted list ofdata to enable functionality within a transformerstage

Sorting Data

Important because

S t i t d i t



– Some stages require sorted input – Some stages may run faster – I.e Aggregator

Can be performed

– Option within stages (use input > partitioning tab andset partitioning to anything other than auto)

– As a separate stage (more complex sorts)

Sorting Alternatives



• Alternative representation of same flow:

Sort Option on Stage Link



Sort Stage



Sort Utility

DataStage – the default

UNIX



UNIX

Sort Stage - Outputs

Specifies how the output is derived





Removing Duplicates

Can be done by Sort stage

– Use unique option



OR

Remove Duplicates stage

– Has more sophisticated ways to remove duplicates

Exercise






Module 8

Combining Data

Objectives

Understand how DataStage can combine datausing the Join, Lookup, Merge, and Aggregatort



stages

Use this understanding to create jobs that will

– Combine data from separate input streams

– Aggregate data to form summary totals

Combining Data

There are two ways to combine data:



– Horizontally:Several input links; one output link (+ optional rejects)made of columns from different input links. E.g., Joins

Lookup

Merge

– Vertically:

One input link, one output link with column combiningvalues from all input rows. E.g., Aggregator

Join, Lookup & Merge Stages

These "three Stages" combine two or more inputlinks according to values of user-designated "key"

l ( )



column(s).

They differ mainly in:

– Memory usage

– Treatment of rows with unmatched key values

– Input requirements (sorted, de-duplicated)



Join Stage Editor



One of four variants:

– Inner – Left Outer – Right Outer – Full Outer

Several key columnsallowed

Link Orderimmaterial for Innerand Full Outer Joins(but VERY important

for Left/Right Outerand Lookup andMerge)

1. The Join Stage

Four types:

• Inner

L ft O t



2 sorted input links, 1 output link – "left outer" on primary input, "right outer" on secondary input – Pre-sort make joins "lightweight": few rows need to be in RAM

• Left Outer

• Right Outer

• Full Outer

2. The Lookup Stage

Combines:

– one source link with

– one or more duplicate-free table links



one or more duplicate free table links

no pre-sort necessary

allows multiple keys LUTs

flexible exception handling for

source input rows with no match

Lookup

Sourceinput

One or moretables (LUTs)

Output Reject

0

1

2

0

1

The Lookup Stage

Lookup Tables should be small enough to fitinto physical memory (otherwise,performance hit due to paging)



performance hit due to paging)

On an MPP you should partition the lookuptables using entire partitioning method, or

partition them the same way you partition thesource link

On an SMP, no physical duplication of a

Lookup Table occurs

The Lookup Stage

Lookup File Set – Like a persistent data set only it

contains metadata about the key



contains metadata about the key. – Useful for staging lookup tables

RDBMS LOOKUP – NORMAL

Loads to an in memory hash table first

– SPARSE Select for each row.

Might become a performancebottleneck.

3. The Merge Stage

Combines

– one sorted, duplicate-free master (primary) link with – one or more sorted update (secondary) links.

– Pre-sort makes merge "lightweight": few rows need to be in RAM (as with



Pre sort makes merge lightweight : few rows need to be in RAM (as with joins, but opposite to lookup).

Follows the Master-Update model: – Master row and one or more updates row are merged if they have the same

value in user-specified key column(s).

– A non-key column occurs in several inputs? The lowest input port numberprevails (e.g., master over update; update values are ignored)

– Unmatched ("Bad") master rows can be either kept

dropped

– Unmatched ("Bad") update rows in input link can be captured in a "reject"link

– Matched update rows are consumed.

The Merge Stage

Allows composite keys

Multiple update links



Multiple update links

Matched update rows are consumed

Unmatched updates can be captured

Lightweight

Space/time tradeoff: presorts vs. in-

RAM table

Master One or moreupdates

Output Rejects

Merge

0

0

21

21

Synopsis:

Joins, Lookup, & Merge

Joins Lookup Merge

Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)



In this table:

• , <comma> = separator between primary and secondary input links(out and reject links)

Memory usage light heavy light

# and names of Inputs exactly 2: 1 left, 1 right 1 Source, N LU Tables 1 Master, N Update(s)

Mandatory Input Sort both inputs no all inputs

Duplicates in primary input OK (x-product) OK Warning!

Duplicates in secondary input(s) OK (x-product) Warning! OK only when N = 1

Options on unmatched primary NONE [fail] | continue | drop | reject [keep] | drop

Options on unmatched secondary NONE NONE capture in reject set(s)On match, secondary entries are reusable reusable consumed

# Outputs 1 1 out, (1 reject) 1 out, (N rejects)

Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries

The Aggregator Stage

Purpose: Perform data aggregations

Specify:



Zero or more key columns that define theaggregation units (or groups)

Columns to be aggregated

Aggregation functions:count (nulls/non-nulls) sum

max/min/range

The grouping method (hash table or pre-sort )is a performance issue

Grouping Methods

Hash: results for each aggregation group are stored in ahash table, and the table is written out after all input has

been processedd ’t i t d d t



been processed – doesn’t require sorted data

– good when number of unique groups is small. Runningtally for each group’s aggregate calculations need to fiteasily into memory. Require about 1KB/group of RAM.

– Example: average family income by state, requires .05MBof RAM

Sort: results for only a single aggregation group are keptin memory; when new group is seen (key value changes),

current group written out. – requires input sorted by grouping keys

– can handle unlimited numbers of groups

– Example: average daily balance by credit card

Aggregator Functions

Sum

Min max



Min, max

Mean

Missing value count

Non-missing value count

Percent coefficient of variation

Aggregator Properties



Aggregation Types



Aggregation types

Containers

Two varieties

– Local

Shared



– Shared

Local

– Simplifies a large, complex diagram

Shared

– Creates reusable object that many jobs can include

Creating a Container

Create a job

Select (loop) portions to containerize

Edit C t t t i l l h d



Edit > Construct container > local or shared

Using a Container

Select as though it were a stage



Exercise






Module 9

Configuration Files

Objectives

Understand how DataStage EE usesconfiguration files to determine parallel behavior

U thi d t di t



Use this understanding to

– Build a EE configuration file for a computer system

– Change node configurations to support adding

resources to processes that need them – Create a job that will change resource allocations at

the stage level

Configuration File Concepts

Determine the processing nodes and disk spaceconnected to each node

Wh t h d l h th



When system changes, need only change theconfiguration file – no need to recompile jobs

When DataStage job runs, platform readsconfiguration file

– Platform automatically scales the application to fit thesystem

Processing Nodes Are

Locations on which the framework runsapplications

L i l th th h i l t t



Logical rather than physical construct

Do not necessarily correspond to the number of

CPUs in your system – Typically one node for two CPUs

Can define one processing node for multiplephysical nodes or multiple processing nodes forone physical node

Optimizing Parallelism

Degree of parallelism determined by number ofnodes defined

P ll li h ld b ti i d t i i d



Parallelism should be optimized, not maximized

– Increasing parallelism distributes work load but alsoincreases Framework overhead

Hardware influences degree of parallelismpossible

System hardware partially determines

configuration

More Factors to Consider

Communication amongst operators – Should be optimized by your configuration

– Operators exchanging large amounts of data should



p g g gbe assigned to nodes communicating by sharedmemory or high-speed link

SMP – leave some processors for operatingsystem

Desirable to equalize partitioning of data

Use an experimental approach – Start with small data sets

– Try different parallelism while scaling up data set sizes

Factors Affecting Optimal Degree of

Parallelism

CPU intensive applications

– Benefit from the greatest possible parallelism

A li ti th t di k i t i



Applications that are disk intensive

– Number of logical nodes equals the number of diskspindles being accessed

Configuration File

Text file containing string data that is passed tothe Framework

– Sits on server side



– Can be displayed and edited

Name and location found in environmental

variable APT_CONFIG_FILE Components

– Node

– Fast name

– Pools – Resource

Node Options

Node name – name of a processing node used by EE – Typically the network name

– Use command uname –n to obtain network name



Fastname – – Name of node as referred to by fastest network in the system

– Operators use physical node name to open connections

– NOTE: for SMP, all CPUs share single connection to network

Pools – Names of pools to which this node is assigned

– Used to logically group nodes

– Can also be used to group resources

Resource – Disk

– Scratchdisk

Sample Configuration File

{

node ―Node1"

{



{

fastname "BlackHole"

pools "" "node1"

resource disk "/usr/dsadm/Ascential/DataStage/Datasets" {pools "" }

resource scratchdisk

"/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }}

}

Disk Pools

Disk pools allocate storage

By default, EE uses the defaultl ifi d b ―‖

pool "bigdata"



pool, specified by ―‖

Sorting Requirements

Resource pools can also be specified for sorting:

The Sort stage looks first for scratch disk resourcesin a



gin a

―sort‖ pool, and then in the default disk pool

{

node "n1" {

fastname “s1"

pool "" "n1" "s1" "sort" resource disk "/data/n1/d1" {}

resource disk "/data/n1/d2" {}

resource scratchdisk "/scratch" {"sort"}

Another Configuration File Example



}

node "n2" {

fastname "s2"

pool "" "n2" "s2" "app1" resource disk "/data/n2/d1" {}

resource scratchdisk "/scratch" {}

}

node "n3" {

fastname "s3"



}

node "n4" {

fastname "s4"



}

...

}

4 5

1

6

2 3

Resource Types

Disk

Scratchdisk



DB2

Oracle

Saswork

Sortwork

Can exist in a pool – Groups resources together

Using Different Configurations



Lookup stage where DBMS is using a sparse lookup type

Building a Configuration File

Scoping the hardware: – Is the hardware configuration SMP, Cluster, or MPP?

– Define each node structure (an SMP would be singlenode):



node): Number of CPUs

CPU speed

Available memory

Available page/swap space Connectivity (network/back-panel speed)

– Is the machine dedicated to EE? If not, what otherapplications are running on it?

– Get a breakdown of the resource usage (vmstat, mpstat,

iostat) – Are there other configuration restrictions? E.g. DB only

runs on certain nodes and ETL cannot run on them?

Exercise

Complete exercise 9-1 and 9-2



M d l 10



Module 10

Extending DataStage EE

Objectives

Understand the methods by which you can addfunctionality to EE

Use this understanding to:




– Build a DataStage EE stage that handles specialprocessing needs not supplied with the vanilla stages

– Build a DataStage EE job that uses the new stage



When To Leverage EE Extensibility

Types of situations:

Complex business logic, not easily accomplished using standard

EE stages



Reuse of existing C, C++, Java, COBOL, etc…

Wrappers vs. Buildop vs. Custom

Wrappers are good if you cannot or do not

want to modify the application and

performance is not critical.



Buildops: good if you need custom coding but

do not need dynamic (runtime-based) input

and output interfaces.

Custom (C++ coding using framework API): good

if you need custom coding and need dynamic

input and output interfaces.

Building “Wrapped” Stages

You can ―wrapper‖ a legacy executable:

Binary Unix command



Unix command

Shell script

… and turn it into a Enterprise Edition stage

capable, among other things, of parallel execution… As long as the legacy executable is:

amenable to data-partition parallelism no dependencies between rows

pipe-safe can read rows sequentially

no random access to data

Wrappers (Cont’d)

Wrappers are treated as a black box



Wrappers are treated as a black box

EE has no knowledge of contents

EE has no means of managing anything that occurs

inside the wrapper

EE only knows how to export data to and import datafrom the wrapper

User must know at design time the intended behavior of

the wrapper and its schema interface

If the wrappered application needs to see all records priorto processing, it cannot run in parallel.

LS Example



Can this command be wrappered?

Creating a Wrapper



Used in this job ---

To create the ―ls‖ stage

Creating Wrapped Stages

From Manager :Right-Click on Stage Type

Wrapper Starting Point



> New Parallel Stage > Wrapped

We will "Wrapper‖ an existing

Unix executables – the lscommand

Wrapper - General Page



Unix command to be wrapped

Name of stage

The "Creator" Page



Conscientiously maintaining the Creator page for all your wrapped stageswill eventually earn you the thanks of others.

Wrapper – Properties Page

If your stage will have properties appear, complete theProperties page



This will be the name ofthe property as itappears in your stage

Wrapper - Wrapped Page



Interfaces – input and output columns -these should first be entered into the tabledefinitions meta data (DS Manager); let’s

do that now.

• Layout interfaces describe what columns the

stage: – Needs for its inputs (if any)

Interface schemas



– Creates for its outputs (if any)

– Should be created as tables with columns in

Manager

Column Definition for Wrapper

Interface



How Does the Wrapping Work?

– Define the schema for export

and import Schemas become interface export

input schema



Schemas become interfaceschemas of the operator andallow for by-name column

access

import

export

stdout ornamed pipe

stdin ornamed pipe

UNIX executable

output schema

QUIZ : Why does export precede import?

Update the Wrapper Interfaces

This wrapper will have no input interface – i.e. no inputlink. The location will come as a job parameter that will

be passed to the appropriate stage property. Therefore,only the Output tab entry is needed.



y p y

Resulting Job



Wrapped stage



Wrapper Story: Cobol Application

Hardware Environment: – IBM SP2, 2 nodes with 4 CPU’s per node.

Software:– DB2/EEE COBOL EE



– DB2/EEE, COBOL, EE

Original COBOL Application: – Extracted source table, performed lookup against table in DB2,

and Loaded results to target table. – 4 hours 20 minutes sequential execution

Enterprise Edition Solution: – Used EE to perform Parallel DB2 Extracts and Loads

– Used EE to execute COBOL application in Parallel

– EE Framework handled data transfer betweenDB2/EEE and COBOL application

– 30 minutes 8-way parallel execution

Buildops

Buildop provides a simple means of extending beyond thefunctionality provided by EE, but does not use an existing

executable (like the wrapper)Reasons to use Buildop include:



Reasons to use Buildop include:

Speed / Performance

Complex business logic that cannot be easily representedusing existing stages – Lookups across a range of values

– Surrogate key generation

– Rolling aggregates

Build once and reusable everywhere within project, noshared container necessary

Can combine functionality from different stages into one

BuildOps

– The DataStage programmer encapsulates the businesslogic

Th E t i Editi i t f ll d ―b ild ‖



– The Enterprise Edition interface called ―buildop‖

automatically performs the tedious, error-prone tasks:invoke needed header files, build the necessary―plumbing‖ for a correct and efficient parallel execution.

– Exploits extensibility of EE Framework

From Manager (or Designer ):Repository pane:

Right Click on Stage Type

BuildOp Process Overview



Right-Click on Stage Type> New Parallel Stage > {Custom | Build | Wrapped}

• "Build" stages

from within Enterprise Edition

• "Wrapping‖ existing ―Unix‖

executables

General Page

Identical

to Wrappers,except: Under the BuildT b !



Tab, your program!

ogic Tab

for

Business Logic

Enter Business C/C++logic and arithmetic infour pages under the



p gLogic tab

Main code section goes

in Per-Record page- itwill be applied to allrows

NOTE: Code will need

to be Ansi C/C++

compliant. If code doesnot compile outside of

EE, it won’t compile

within EE either!

Code Sections under Logic Tab

Temporaryvariablesdeclared [and



initialized] here

Logic here isexecuted once

BEFOREprocessing theFIRST row

Logic here isexecuted once

AFTERprocessing theLAST row

I/O and Transfer

Under Interface tab: Input, Output & Transfer pages



Optional

renaming ofoutput portfrom default"out0"

Write row

Inpu t page: 'Auto Read'

Read next row

In-Repository

TableDefinition

'False' setting,

not to interferewith Transferpage

First line:output 0

I/O and Transfer



• Transfer all columns from input to output.• If page left blank or Auto Transfer = "False" (and RCP = "False")Only columns in output Table Definition are written

First line:Transfer of index 0

BuildOp Simple Example

Example - sumNoTransfer

– Add input columns "a" and "b"; ignore other columns

that might be present in input– Produce a new "sum" column



Produce a new sum column

– Do not transfer input columns

sumNoTransfer

a:int32; b:int32

sum:int32

From Peek:

No Transfer



NO TRANSFER

- RCP set to "False" in stage definitionand

- Transfer page left blank, or Auto Transfer = "False"

• Effects:

- input columns "a" and "b" are not transferred

- only new column "sum" is transferred

Compare with transfer ON…

Transfer



TRANSFER

- RCP set to "True" in stage definitionor

- Auto Transfer set to "True"

• Effects:

- new column "sum" is transferred, as well as - input columns "a" and "b" and

- input column "ignored" (present in input, butnot mentioned in stage)

Columns

DS-EE type

Temp C++ variables

C/C++ type

Columns vs.

Temporary C++ Variables



yp

Defined in TableDefinitions

Value refreshed from rowto row

yp

Need declaration (inDefinitions or Pre-Looppage)

Value persistent

throughout "loop" overrows, unless modified incode

Exercise

Complete exercise 10-1 and 10-2



Exercise

Complete exercises 10-3 and 10-4



Custom Stage

Reasons for a custom stage:

– Add EE operator not already in DataStage EE

– Build your own Operator and add to DataStage EE

U EE API



Use EE API

Use Custom Stage to add new operator to EEcanvas

Custom Stage

DataStage Manager > select Stage Types branch> right click



Custom Stage

Number of input andoutput links allowed



Name of Orchestrateoperator to be used

output links allowed

Custom Stage – Properties Tab



The Result





Objectives

Understand how EE uses meta data, particularlyschemas and runtime column propagation


B ild h d fi iti fil t b i k d i



– Build schema definition files to be invoked inDataStage jobs

– Use RCP to manage meta data usage in EE jobs

Establishing Meta Data

Data definitions

– Recordization and columnization

– Fields have properties that can be set at individualfield level



Data types in GUI are translated to types used by EE

– Described as properties on the format/columns tab

(outputs or inputs pages) OR – Using a schema file (can be full or partial)

Schemas

– Can be imported into Manager – Can be pointed to by some job stages (i.e. Sequential)

Data Formatting – Record Level

Format tab

Meta data described on a record basis Record level properties



Record level properties

Data Formatting – Column Level

Defaults for all columns



Column Overrides

Edit row from within the columns tab

Set individual column properties



Extended Column Properties



Field

andstring

settings

Extended Properties – String Type

Note the ability to convert ASCII to EBCDIC



Editing Columns



Properties dependon the data type

Schema

Alternative way to specify column definitions fordata used in EE jobs

Written in a plain text file



Can be written as a partial record definition

Can be imported into the DataStage repository

Creating a Schema

Using a text editor

– Follow correct syntax for definitions

– OR

Import from an existing data set or file set



Import from an existing data set or file set

– On DataStage Manager import > Table Definitions >

Orchestrate Schema Definitions – Select checkbox for a file with .fs or .ds

Importing a Schema



Schema location can beon the server or local

work station

Data Types

Date

Decimal

Floating point

I t

Vector

Subrecord

Raw

T d



Integer

String

Time

Timestamp

Tagged

Runtime Column Propagation

DataStage EE is flexible about meta data. It can cope with thesituation where meta data isn’t fully defined. You can define

part of your schema and specify that, if your job encountersextra columns that are not defined in the meta data when itactually runs, it will adopt these extra columns and propagate



them through the rest of the job. This is known as runtimecolumn propagation (RCP).

RCP is always on at runtime.

Design and compile time column mapping enforcement.

– RCP is off by default.

– Enable first at project level. (Administrator projectproperties)

– Enable at job level. (job properties General tab)

– Enable at Stage. (Link Output Column tab)

Enabling RCP at Project Level



Enabling RCP at Job Level



Enabling RCP at Stage Level

Go to output link’s columns tab

For transformer you can find the output linkscolumns tab by first going to stage properties



Using RCP with Sequential Stages

To utilize runtime column propagation in thesequential stage you must use the ―use schema‖

option

Stages with this restriction:



Stages with this restriction:

– Sequential

– File Set – External Source

– External Target


When RCP is Disabled

– DataStage Designer will enforce Stage Input Column

to Output Column mappings. – At job compile time modify operators are inserted on

output links in the generated osh



output links in the generated osh.


When RCP is Enabled

– DataStage Designer will not enforce mapping rules.

– No Modify operator inserted at compile time. – Danger of runtime error if column names incoming do



not match column names outgoing link – casesensitivity.

Exercise

Complete exercises 11-1 and 11-2



Module 12



Job Control Using the JobSequencer

Objectives

Understand how the DataStage job sequencerworks

Use this understanding to build a control job torun a sequence of DataStage jobs



q g j

Job Control Options

Manually write job control

– Code generated in Basic

– Use the job control tab on the job properties page – Generates basic code which you can modify



Job Sequencer

– Build a controlling job much the same way you buildother jobs

– Comprised of stages and links

– No basic coding

Job Sequencer

Build like a regular job

Type ―Job Sequence‖

Has stages and links

Job Activity stage



Job Activity stagerepresents a DataStage

job Links represent passing

control

Stages

Example

Job Activitystage – contains

conditionaltriggers



Job Activity Properties

Job to be executed – select from dropdown



Job parametersto be passed

select from dropdown

Job Activity Trigger



Trigger appears as a link in the diagram

Custom options let you define the code

Options

Use custom option for conditionals

– Execute if job run successful or warnings only

Can add ―wait for file‖ to execute

Add ―execute command‖ stage to drop real tables



Add execute command stage to drop real tables

and rename new tables to current tables

Job Activity With Multiple Links



Different linkshaving different

triggers

Sequencer Stage

Build job sequencer to control job for thecollections application



Can be set to allor any

Notification Stage



Notification

Notification Activity



Sample DataStage log from Mail Notification

Sample DataStage log from Mail

Notification



E-Mail Message

Notification Activity Message



Exercise




Module 13



Testing and Debugging

Objectives

Understand spectrum of tools to perform testingand debugging

Use this understanding to troubleshoot aDataStage job






Parallel Environment Variables




Stage Specific







Compiler



Typical Job Log Messages:

Environment variables

Configuration File information

The Director



Framework Info/Warning/Error messages

Output from the Peek Stage

Additional info with "Reporting" environments

Tracing/Debug output

– Must compile job in trace mode – Adds overhead

• Job Properties, from Menu Bar of Designer

• Director will

prompt youbefore eachrun

Job Level Environmental Variables



run

Troubleshooting

If you get an error during compile, check the following:

Compilation problems

– If Transformer used, check C++ compiler, LD_LIRBARY_PATH

– If Buildop errors try buildop from command line – Some stages may not support RCP – can cause column mismatch .

– Use the Show Error and More buttons

– Examine Generated OSH



– Check environment variables settings

Very little integrity checking during compile, should run validate from Director.

Highlights source of error

Generating Test Data

Row Generator stage can be used

– Column definitions

– Data type dependent

Row Generator plus lookup stages provides goodt t b t t t d t f tt fil



way to create robust test data from pattern files

data stage doc

Documents