8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 1/299
®
IBM WebSphere DataStage
Introduction To Enterprise Edition
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 2/299
Course Contents
Module 01: Introduction
Module 02: Setting Up Your DataStage Environment
Module 03: Creating Parallel Jobs
Module 04: Accessing Sequential Data
Module 05: Platform Architecture
Module 06: Combining Data
Module 07: Sorting and Aggregating Data
Module 08: Transforming Data
Module 09: Standards and Techniques
Module 10: Accessing Relational Data
Module 11: Compilation and Execution
Module 12: Testing and Debugging
Module 13: Metadata in Enterprise Edition
Module 14: Job Control
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 3/299
Course
Objectives
DataStage Clients and ServerSetting up the parallel environment
Importing metadata
Building DataStage jobs
Loading metadata into job stages
Accessing Sequential data
Accessing Relational data
Introducing the Parallel frameworkarchitecture
Transforming data
Sorting and aggregating data
Merging data
Configuration files
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 4/299
®
IBM WebSphere DataStage
Module 01: Introduction
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 5/299
What is IBM WebSphere DataStage?
Design jobs for Extraction, Transformation, and Loading (ETL)
Ideal tool for data integration projects – such as, data warehouses, data marts,and system migrations
Import, export, create, and manage metadata for use within jobs
Schedule, run, and monitor jobs all within DataStage
Administer your DataStage development and execution environments
Create batch (controlling) jobs
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 6/299
DataStage Server and Clients
Windows or Unix Server
Microsoft Windows
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 7/299
Client Logon
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 8/299
DataStage Administrator
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 9/299
DataStage Manager
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 10/299
DataStage Designer
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 11/299
DataStage Director
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 12/299
Developing in DataStage
Define global and project properties in AdministratorImport metadata into the Repository
Manager
Designer Repository View
Build job in Designer
Compile job in Designer
Run and monitor job in Director
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 13/299
DataStage Projects
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 14/299
DataStage Jobs
Parallel jobsExecuted under control of DataStage Server runtime environment
Built-in functionality for Pipeline and Partitioning Parallelism
Compiled into OSH (Orchestrate Scripting Language)
OSH executes Operators
– Executable C++ class instances
Runtime monitoring in DataStage Director
Job Sequences (Batch jobs, Controlling jobs)Master Server jobs that kick-off jobs and other activities
Can kick-off Server or Parallel jobs
Runtime monitoring in DataStage Director
Server jobs (Requires Server Edition license)Executed by the DataStage Server Edition
Compiled into Basic (interpreted pseudo-code)
Runtime monitoring in DataStage Director
Mainframe jobs (Requires Mainframe Edition license)Compiled into COBOL
Executed on the Mainframe, outside of DataStage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 15/299
Design Elements of Parallel Jobs
StagesImplemented as OSH operators (pre-built components)
Passive stages (E and L of ETL)
Read data
Write data
E.g., Sequential File, Oracle, Peek stagesProcessor (active) stages (T of ETL)
Transform data
Filter data
Aggregate data
Generate data
Split / Merge data
E.g., Transformer, Aggregator, Join, Sort stages
Links
“Pipes” through which the data moves from stage to stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 16/299
Quiz – True or False?
DataStage Designer is used to build and compile your ETL jobs
Manager is used to execute your jobs after you build them
Director is used to execute your jobs after you build them
Administrator is used to set global and project properties
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 17/299
Introduction to the Lab Exercises
Two types of exercises in this course:
Conceptual exercises
Designed to reinforce a specific module‟s topics
Provide hands-on experiences with DataStage
Introduced by the word “Concept”
E.g., “Conceptual Lab 01A”
Solution Development exercises
Based on production applications
Provide development examples
Introduced by the word “Solution”
E.g., “Solution Lab 05A”
The Solution Development exercises are introduced and discussed in a latermodule
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 18/299
Lab Exercises
Conceptual Lab 01AInstall DataStage clients
Test connection to the DataStage Server
Install lab files
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 19/299
®
IBM WebSphere DataStage
Module 02: SettingEnvironment up Your DataStage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 20/299
Module Objectives
Setting project properties in AdministratorDefining Environment Variables
Importing / Exporting DataStage objects in Manager
Importing Table Definitions defining sources and targets in Manager
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 21/299
Setting Project Properties
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 22/299
Project Properties
Projects can be created and deleted in AdministratorEach project is associated with a directory on the DataStage Server
Project properties, defaults, and environmental variables are specifiedin Administrator
Can be overridden at the job level
01/15/06
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 23/299
Setting Project Properties
To set project properties, log onto Administrator, select your project,and then click “Properties”
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 24/299
Project Properties General Tab
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 25/299
Environment Variables
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 26/299
Permissions Tab
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 27/299
Tracing Tab
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 28/299
Parallel Tab
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 29/299
Sequence Tab
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 30/299
Importing and
DataStage
Exporting
Objects
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 31/299
What Is Metadata?
Data
Source Target
Metadata Metadata
Metadata
Repository
Transform
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 32/299
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 33/299
Manager Contents
Metadata
Describing sources and targets: Table definitions
Describing inputs / outputs from external routines
Describing inputs and outputs to BuildOp and CustomOp stages
DataStage objects
JobsRoutines
Compiled jobs / objects
Stages
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 34/299
Import and Export
Any object in Manager can be exported to a fileCan export whole projects
Use for backup
Sometimes used for version control
Can be used to move DataStage objects from one project to another
Use to share DataStage jobs and projects with other developers
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 35/299
Export
Procedure
In Manager, click “Export>DataStage Components” Select DataStage objects for export
Specify type of export:
DSX: Default format
XML: Enables processing of export file by XML applications, e.g., for
generating reportsSpecify file path on client machine
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 36/299
Quiz - True or False?
You can export DataStage objects such as jobs, but you can‟t export metadata, such as field definitions of a sequential file.
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 37/299
Quiz - True or False?
The directory to which you export is on the DataStage client machine,not on the DataStage server machine.
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 38/299
Exporting DataStage Objects
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 39/299
Select Objects for Export
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 40/299
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 41/299
Import Procedure
In Manager, click “Import>DataStage Components”
Or “Import>DataStage Components (XML)” if you are importing an XML-format export file
Select DataStage objects for import
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 42/299
Importing DataStage Objects
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 43/299
Import Options
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 44/299
Importing Metadata
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 45/299
Metadata Import
Import format and column definitions from sequential files
Import relational table column definitions
Imported as “Table Definitions”
Table definitions can be loaded into job stages
Table definitions can be used to define Routine and Stage interfaces
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 46/299
Sequential File Import Procedure
In Manager, click Import>Table Definitions>Sequential File Definitions
Select directory containing sequential file and then the file
Select Manager category
Examined format and column definitions and edit is necessary
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 47/299
Importing Sequential Metadata
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 48/299
Sequential Import Window
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 49/299
Specify Format
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 50/299
Specify Column Names and Types
Double-click to define extended properties
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 51/299
Extended Properties window
Property categories
Available properties
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 52/299
Table Definition General Tab
Second levelcategory
Top levelcategory
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 53/299
Table Definition Columns Tab
f
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 54/299
Table Definition Parallel Tab
T bl D fi iti F t T b
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 55/299
Table Definition Format Tab
L b E i
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 56/299
Lab Exercises
Conceptual Lab 02A
Set up your DataStage environment
Conceptual Lab 02B
Import a sequential file Table Definition
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 57/299
®
IBM WebSphere DataStage
Module 03: Creating Parallel Jobs
Mod le Objecti es
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 58/299
Module Objectives
Design a simple Parallel job in Designer
Compile your job
Run your job in Director
View the job log
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 59/299
Creating Parallel Jobs
What Is a Parallel Job?
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 60/299
What Is a Parallel Job?
Executable DataStage program
Created in DataStage Designer
Can use components from Manager Repository
Built using a graphical user interface
Compiles into Orchestrate shell language (OSH) and object code
(from generated C++)
Job Development Overview
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 61/299
Job Development Overview
Import metadata defining sources and targets
Can be done within Designer or Manager
In Designer, add stages defining data extractions and loads
Add processing stages to define data transformations
Add links defining the flow of data from sources to targets
Compile the job
In Director, validate, run, and monitor your job
Can also run the job in Designer
Can only view the job log in Director
Designer Work Area
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 62/299
Designer Work Area
Canvas
Repository
Tools
Palette
Designer Toolbar
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 63/299
Designer Toolbar
Provides quick access to the main functions of Designer
Show/hide metadata markers
Run
Job properties Compile
Tools Palette
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 64/299
Tools Palette
Adding Stages and Links
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 65/299
Adding Stages and Links
Drag stages from the Tools Palette to the diagram
Can also be dragged from Stage Type branch to the diagram
Draw links from source to target stage
Right mouse over source stage
Release mouse button over target stage
Job Creation Example Sequence
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 66/299
Job Creation Example Sequence
Brief walkthrough of procedure
Assumes table definition of source already exists in the repository
Create New Job
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 67/299
Create New Job
Drag Stages and Links From Palette
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 68/299
Drag Stages and Links From Palette
Peek
Row
Generator
Annotation
Renaming Links and Stages
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 69/299
e a g s a d S ages
Click on a stage or link to rename itMeaningful names have manybenefits
Documentation
Clarity
Fewer development errors
RowGenerator Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 70/299
g
Produces mock data for specified columns
No inputs link; single output link
On Properties tab, specify number of rows
On Columns tab, load or specify column definitions
Click Edit Row over a column to specify the values to be generated for thatcolumn
A number of algorithms for generating values are available depending on thedata type
Algorithms for Integer type
Random: seed, limit
Cycle: Initial value, increment
Algorithms for string type: Cycle , alphabet
Algorithms for date type: Random, cycle
Inside the Row Generator Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 71/299
g
Propertiestab
Set propertyvalue
Property
Columns Tab
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 72/299
View data
Load aTable
definition
Select TableDefinition
Extended Properties
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 73/299
Specifiedproperties and
their values
Additionalproperties to add
Peek Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 74/299
Displays field values
Displayed in job log or sent to a
Skip records option
file
Can control number of records to be displayed
Shows data in each partition, labeled 0, 1, 2, …
Useful stub stage for iterative job development
Develop job to a stopping point and check the data
Peek Stage Properties
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 75/299
Output to job log
Job Parameters
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 76/299
Defined in Job Properties window
Makes the job more flexible
Parameters can be:
Used in directory and file names
Used to specify property values
Used in constraints and derivations
Parameter values are determined at run time
When used for directory and files names and names of properties,surround with pound signs (#)
E.g., #NumRows#
Job parameters can reference DataStage and system environment
variables$PROJDEF
$ENV
Defining a Job Parameter
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 77/299
Parameters tab
Parameter
Using a Job Parameter in a Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 78/299
Job parameter surroundedwith pound signs
Adding Job Documentation
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 79/299
Job Properties
Short and long descriptions
Shows in Manager
Annotation stage
Added from the Tools Palette
Display formatted text descriptions on diagram
Job Properties Documentation
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 80/299
Documentation
Annotation Stage Properties
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 81/299
Compiling a Job Compile
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 82/299
p
Errors or Successful Message
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 83/299
Highlight stage with error Click for more info
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 84/299
Running Jobs and Viewing the Job
Log in Designer
Prerequisite to Job Execution
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 85/299
DataStage Director
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 86/299
Use to run and schedule jobs
View runtime messages
Can invoke from DataStage Manager or Designer
Tools > Run Director
Run Options
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 87/299
Stop after number of warnings
Stop after number of rows
Director Log View Click the open
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 88/299
book icon to view
log messages
Peek messages
Message Details
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 89/299
Other Director Functions
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 90/299
Schedule job to run on a particular date/time
Clear job log of messages
Set job log purging conditions
Set Director options
Row limits
Abort after x warnings
Running Jobs from Command Line
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 91/299
Use dsjob –run
Use dsjob –logsum to display messages in the log
Documented in “Parallel Job Advanced Developer‟s Guide”, ch. 7
Lab Exercises
Conceptual Lab 03A
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 92/299
Conceptual Lab 03A
Design a simple job in DesignerDefine a job parameter
Document the job
Compile
Run
Monitor the job in Director
®
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 93/299
IBM WebSphere DataStage
Module 04: Accessing Sequential Data
Module Objectives
Understand the stages for accessing different kinds of sequential data
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 94/299
Understand the stages for accessing different kinds of sequential data
Sequential File stage
Data Set stage
Complex Flat File stage
Create jobs that read from and write to sequential files
Read from multiple files using file patternsUse multiple readers
Types of Sequential Data Stages
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 95/299
Sequential
Fixed or variable length
Data Set
Complex Flat File
The Framework and Sequential Data
Th EE F k l d t t
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 96/299
The EE Framework processes only datasets
For files other than datasets, such as sequential flat files, import andexport operations are done
Import and export OSH operators are generated by Sequential andComplex Flat File stages
During import or export DataStage performs format translations –
into, or out of, the EE internal format
Internally, the format of data is described by
Like Table Definitions
schemas
Using the Sequential File Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 97/299
Both import and export of general files (text, binary) areperformed by the SequentialFile Stage.
– Data import:
– Data export EE internal format
EE internal format
Features of Sequential File Stage
Normally executes in sequential mode
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 98/299
Normally executes in sequential mode
Executes in parallel when reading multiple files
Can use multiple readers within a node
Reads chunks of a single file in parallel
The stage needs to be told:
How file is divided into rows (record format)
How row is divided into columns (column format)
File Format Example
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 99/299
1 1
1 1
Record delimiter
Final Delimiter = end
Field Delimiter
Final Delimiter = comma
nl , Last field , Field 3 , Field 2 , Field 1
nl Last field , Field 3 , Field 2 , Field 1
Sequential File Stage Rules
One input link
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 100/299
One input link
One stream output link
Optionally, one reject link
Will reject any records not matching metadata in the column definitions
Example: You specify three columns separated by commas, but the rowthat‟s read had no commas in it
Job Design Using Sequential Stages
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 101/299
Reject link
Sequential Source Columns Tab
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 102/299
View data
Load Table Definition
Save as a newTable Definition
Input Sequential Stage Properties
Output tab
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 103/299
File toaccess
Column namesin first row
Click to add more files havingthe same format
Format Tab
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 104/299
Record format
Column format
Reading Using a File Pattern
U ild
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 105/299
Use wildcards
Select FilePattern
Properties - Multiple Readers
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 106/299
Multiple readers option allowsyou to set number of readers
per node
Sequential Stage As a Target
Input Tab
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 107/299
p
Append /Overwrite
Reject Link
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 108/299
Reject mode =
Continue: Continue reading records
Fail: Abort job
Output: Send down output link
In a source stage
All records not matching the
metadata (column definitions) arerejected
In a target stage
All records that fail to be written forany reason
Rejected records consist of onecolumn, datatype = raw
Reject mode property
Inside the Copy Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 109/299
Column mappings
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 110/299
DataSet Stage
Data Set
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 111/299
Operating system (Framework) filePreserves partitioning
Component dataset files are written to on each partition
Suffixed by .ds
Referred to by a header file
Managed by Data Set Management utility from GUI (Manager, Designer,Director)
Represents persistent data
Key to good performance in set of linked jobs
No import / export conversions are needed
No repartitioning needed
Persistent Datasets
Accessed using DataSet Stage.
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 112/299
Two parts:Descriptor file:
contains metadata, data location, but NOT the data itself
input.ds
Data file(s) record
( partno:
int32;
description:
string; )
contain the data
multiple Unix files (one per node), accessible in parallel
node1:/local/disk1/…
node2:/local/disk2/…
Data Translation
Occurs on import
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 113/299
From sequential files or file setsFrom RDBMS
Occurs on export
From datasets to file sets or sequential files
From datasets to RDBMS
DataStage engine is most efficient when processing internallyformatted records (i.e. datasets)
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 114/299
File Set Stage
Can read or write file sets
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 115/299
Files suffixed by .fs
File set consists of:
Descriptor file – contains location of raw data files + metadata
Individual raw data files
Can be processed in parallel
Similar to a dataset
Main difference is that file sets are not in the internal format andtherefore more accessible to external applications
1.
2.
File Set Stage Example
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 116/299
Descriptor file
Lab Exercises
Conceptual Lab 04A
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 117/299
Read and write to a sequential fileCreate reject links
Create a data set
Conceptual Lab 04B
Read multiple files using a file path
Conceptual Lab 04CRead a file using multiple readers
DataStage Data Types
Standard types Complex types
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 118/299
Char
VarChar
Integer
Decimal (Numeric)
Floating point
Date
Time
Timestamp
VarBinary (raw)
Vector (array, occurs)
Subrecord (group)
Standard Types
Char
Fixed length string
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 119/299
VarChar
Variable length string
Specify maximum length
Integer
Decimal (Numeric)
Precision (length including numbers after the decimal point)
Scale (number of digits after the decimal point)Floating point
Date
Default string format:
Time
Default string format:
Timestamp Default string format:
VarBinary (raw)
%yyyy-%mm-%dd
%hh:%nn:%ss
%yyyy-%mm-%dd %hh:%nn:%ss
Complex Data Types
Vector A one-dimensional array
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 120/299
Elements are numbered 0 to nElements can be of any single type
All elements must have the same type
Can have fixed or variable number of elements
Subrecord A group or structure of elements
Elements of the subrecord can be of any type
Subrecords can be embedded
Schema With Complex Types
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 121/299
subrecord
vector
Table Definition with complex types
Authors is a subrecord
Books is a vector of 3 strings of length 5
Complex Types Column Definitions
subrecord
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 122/299
Elements of subrecord Vector
Reading and Writing Complex Data
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 123/299
Complex FlatFile target
stage
Complex Flat
File sourcestage
Importing Cobol Copybooks
Click Import>TableDefinitions>COBOL File Definitions
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 124/299
to begin the import
Each level 01 item begins a TableDefinition
Specify position of level 01 items
Level 01 startposition
Path tocopybook fileWhere to store theTable Definition
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 125/299
Reading and Writing NULL Values
Working with NULLs
Internally, NULL is represented by a special value outside the range ofi i l i i l
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 126/299
any existing, legitimate values
If NULL is written to a non-nullable column, the job will abort
Columns can be specified as nullable
NULLs can be written to nullable columns
You must “handle” NULLs written to non-nullable columns in aSequential File stage
You need to tell DataStage what value to write to the file
Unhandled rows are rejected
In a Sequential source stage, you can specify values you want
DataStage to convert to NULLs
Specifying a Value for NULL
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 127/299
Nullablecolumn
Addedproperty
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 128/299
Managing DataSets
Managing DataSets
GUI (Manager, Designer, Director) – tools > data set management
Dataset management from the system command line
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 129/299
Dataset management from the system command lineOrchadmin
Unix command line utility
List records
Remove datasets
– Removes all component files, not just the header fileDsrecords
Lists number of records in a dataset
Displaying Data and Schema
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 130/299
Display data
Schema
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 131/299
Lab Exercises
Conceptual Lab 04D
Use the dsrecords utility
Use Data Set Management tool
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 132/299
Use Data Set Management tool
Conceptual Lab 04E
Reading and Writing NULLs
®
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 133/299
IBM WebSphere DataStage
Module 05: Platform Architecture
© 2005 IBM Corporation
Module Objectives
Parallel processing architecture
Pipeline parallelism
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 134/299
Pipeline parallelismPartition parallelism
Partitioning and collecting
Configuration files
Key EE Concepts
Parallel processing:
Executing the job on multiple CPUs
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 135/299
Executing the job on multiple CPUs
Scalable processing:
Add more resources (CPUs and disks) to increase system performance
• Example system: 6 CPUs (processing nodes) and disks
Scale up by adding more CPUs
Add CPUs as individual nodes or to
an SMP system
•
•
1 2
3 4
5 6
Scalable Hardware Environments
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 136/299
SMPMulti-CPU (2-64+)
Shared memory & disk
GRID / ClustersMultiple, multi-CPU systems
Dedicated memory per node
Typically SAN-based shared storage
MPPMultiple nodes with dedicated memory,storage
2 – 1000‟s of CPUs
Single CPU
Dedicated memory &disk
Pipeline Parallelism
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 137/299
Transform, clean, load processes execute simultaneouslyLike a conveyor belt moving rows from process to process
Start downstream process while upstream process is running
Advantages:
Reduces disk usage for staging areas
Keeps processors busy Still has limits on scalability
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 138/299
Three-Node Partitioning
Node 1
Operation subset1
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 139/299
psubset1
Node 2
Operation subset2
Data subset3 Node 3
Operation
Here the data is partitioned into three partitions
The operation is performed on each partition of data separately and in parallel
If the data is evenly distributed, the data will be processed three times faster
EE Combines Partitioning and Pipelining
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 140/299
Within EE, pipelining, partitioning, and repartitioning are automatic
Job developer only identifies:
Sequential vs. Parallel operations (by stage)
Method of data partitioning
Configuration file (which identifies resources)
Advanced stage options (buffer tuning, operator combining, etc.)
Job Design v. Execution User assembles the flow using DataStage Designer
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 141/299
… at runtime, this job runs in parallel for any configuration
(1 node, 4 nodes, N nodes)
No need to modify or recompile the job design!
Configuration File
Configuration file separates configuration (hardware / software) from job design
Specified per job at runtime by $APT_CONFIG_FILE
Change hardware and resources without changing job design
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 142/299
Defines number of nodes (logical processing units) with their resources (need notmatch physical CPUs)
Dataset, Scratch, Buffer disk (file systems)
Optional resources (Database, SAS, etc.)
Advanced resource optimizations
“Pools” (named subsets of nodes)
Multiple configuration files can be used at runtime
Optimizes overall throughput and matches job characteristics to overall hardware resources
Allows runtime constraints on resource usage on a per job basis
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 143/299
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 144/299
Partitioning and Collecting
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 145/299
Partitioning / Collecting Algorithms Partitioning algorithms include:
Round robin
Hash: Determine partition based on key value Requires key specification
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 146/299
Requires key specification
Entire: Send all rows down all partitions
Same: Preserve the same partitioning
Auto: Let DataStage choose the algorithm
Collecting algorithms include: Round robin
Sort Merge
Read in by key
Presumes data is sorted by the key in each partition
Builds a single sorted stream based on the key
Ordered
Read all records from first partition, then second, …
Keyless V. Keyed Partitioning Algorithms
Keyless: Rows are distributed independently of data values
Round Robin
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 147/299
Entire
Same
Keyed: Rows are distributed based on values in the specified key
Hash: Partition based on key
Example: Key is State. All “CA” rows go into the same par tition; all “MA”
rows go in the same partition. Two rows of the same state never go intodifferent partitions
Modulus: Partition based on modulus of key divided by the number ofpartitions. Key is a numeric type.
Example: Key is OrderNumber (numeric type). Rows with the sameorder number will all go into the same partition.
DB2: Matches DB2 EEE partitioning
Partitioning Requirements for Related Records
Misplaced records
Using Aggregator stage to sum customer sales by customer numberIf there are 25 customers, 25 records should be output
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 148/299
If there are 25 customers, 25 records should be output
But suppose records with the same customer numbers are spreadacross partitions
This will produce more than 25 groups (records)
Solution: Use hash partitioning algorithm
Partition imbalances
Peek stage shows number of records going down each partition
Unequal Distribution Example
Same key values are assigned tothe same partition
Hash on LName, with 2-node config file
P
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 149/299
Partition
1
ID LName FName Address
1 Ford Henry 66 Edison Avenue
2 Ford Clara 66 Edison Avenue
3 Ford Edsel 7900 Jefferson
4 Ford Eleanor 7900 Jefferson
7 Ford Henry 4901 Evergreen
8 Ford Clara 4901 Evergreen
9 Ford Edsel 1100 Lakeshore
10 Ford Eleanor 1100 Lakeshore
SourceData
ID LName FName Address
1 Ford Henry 66 Edison Avenue
2 Ford Clara 66 Edison Avenue
3 Ford Edsel 7900 Jefferson
4 Ford Eleanor 7900 Jefferson
5 Dodge Horace 17840 Jefferson
6 Dodge John 75 Boston Boulevard
7 Ford Henry 4901 Evergreen
8 Ford Clara 4901 Evergreen
9 Ford Edsel 1100 Lakeshore
10 Ford Eleanor 1100 Lakeshore
Part0
ID LName FName Address
5 Dodge Horace 17840 Jefferson
6 Dodge John 75 Boston Boulevard
Partitioning / Collecting Link Icons
Partitioning icon
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 150/299
Collecting icon
More Partitioning Icons
“fan-out” Sequential to Parallel
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 151/299
SAME partitioner
Re-partitionwatch for this!
AUTO partitioner
Partitioning Tab
Key specification
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 152/299
Algorithms
Collecting Specification
Key specification
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 153/299
Algorithms
Quiz True or False?
Everything that has been data-partitioned must becollected in same job
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 154/299
Data Set Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 155/299
Is the data partitioned?
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 156/299
Introduction to the Solution Exercises
Development
Solution Development Jobs
Series of 4 jobs extracted from production jobs
Use a variety of stages in interesting, realistic configurations
Sort, Aggregator stages
Join lookup stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 157/299
Join, lookup stage
Peek, Filter stages
Modify stage
Oracle stage
Contain useful techniques
Use of Peeks
Datasets used to “connect” jobs
Use of project environment variables in job parameters
Fork Joins
Lookups for auditing
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 158/299
Glimpse Into the Sort Stage
Algorithms
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 159/299
Sort key to add
Copy Stage With Multiple Output Links
Select output link
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 160/299
Filter Stage
Used with Peek stage to select a portion of data for checking
On Properties tab, specify a Where clause to filter the data
On Mapping tab, map input columns to output columns
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 161/299
Setting the Filtering Condition
Filteringcondition
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 162/299
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 163/299
Warehouse Job 03
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 164/299
Warehouse Job 04
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 165/299
Warehouse Job 02 With Lookup
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 166/299
Lab Exercises
Conceptual Lab 05A
Experiment with partitioning / collecting
Solution Lab 05B (Build Warehouse_01 Job) Add environment variables as job parameters
Read multiple sequential files
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 167/299
Read multiple sequential files
Use the Sort stage
Use Filter and Peek stages
Write to a DataSet stage
®
IBM WebSphere DataStage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 168/299
Module 06: Combining Data
Module Objectives
Combine data using the Lookup stage
Combine data using Merge stage
Combine data using the Join stage
Combine data using the Funnel stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 169/299
Combine data using the Funnel stage
Ways to combine data:
Horizontally:
Multiple input links
One output link made of columns from different input links.
J i
Combining Data
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 170/299
Joins
Lookup
Merge
Vertically:
One input link, one output link combining groups of related records into asingle record
Aggregator
Remove Duplicates
Funneling: Multiple input streams funneled into a single output stream
Funnel stage
Lookup, Merge, Join Stages
These stages combine two or more input links
Data is combined by designated "key" column(s)
These stages differ mainly in:Memory usage
T t t f ith t h d k l
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 171/299
Treatment of rows with unmatched key values
Input requirements (sorted, de-duplicated)
Not all Links are Created Equal
• DataStage distinguishes between:
- The Primary input: (Framework port 0) - Secondary inputs: ports)
in some cases "Reference" (other Framework
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 172/299
ports)
Conventions:•
• Tip: Check “Link Ordering" tab to make sure intendedPrimary is listed first
Joins Lookup Merge
Primary Input: port 0
Secondary Input(s): ports 1,…
Left Source Master
Right Lookup table(s) Update(s)
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 173/299
Lookup Features
One Stream Input link (Source)
Multiple Reference links (Lookup files)One output link
Optional Reject link
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 174/299
Optional Reject link
Only one per Lookup stage, regardless of number of reference links
Lookup Failure options
Continue, Drop, Fail, Reject
Can return multiple matching rows
Hash tables are built in memory from the lookup files
Indexed by key
Should be small enough to fit into physical memory
The Lookup Stage Uses one or more key columns as an index into a table
Usually contains other values associated with each key.
The lookup table is created in memory before any lookup source rows are processed
Lookup table
Associated ValueInde
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 175/299
Associated Value Index
[…]
SC SD TN TXUT
VT[…]
Key column of source state_code “TN”
South Carolina South Dakota Tennessee Texas Utah
Vermont
Lookup from Sequential File Example
Reference link
Driver (Source)link
(lookup table)
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 176/299
Lookup Key Column in Sequential File
Lookup key
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 177/299
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 178/299
Handling Lookup Failures
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 179/299
Select action
Lookup Failure Actions
• If the lookup fails to find a matching key column, one of these actions can be taken:
– fail: the lookup Stage reports an error and the job fails immediately.This is the default.
drop: the input row with the failed lookup(s) is dropped –
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 180/299
continue: the input row is transferred to the output, together with the successful tableentries. The failed table entry(s) are not transferred, resulting in either default outputvalues or null output values.
reject: the input row with the failed lookup(s) is transferred to a second output link, the"reject" link.
–
–
• There is no option to capture unused table entries – Compare with the Join and Merge stages
Lookup Stage Behavior
We shall first use a simplest case, optimal input:
•
• •
Two input links: “Source" as primary, “Look up" as secondary
sorted on key column (here "Citizen"), without duplicates on key
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 181/299
Source link (primary input) Lookup link (secondary input)
Citizen Exchange
M_B_Dextrous Nasdaq
Righty NYSE
Revolution Citizen
1789 Lefty
1776 M_B_Dextrous
Lookup Stage
Output of Lookup with continue option on key Citizen
Revolution Citizen Exchange1789 Lefty
1776 M_B_Dextrous Nasdaq
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 182/299
Same outpu t as outer join and merge/keep
Empty string
or NULLOutput of Lookup with drop option on key Citizen
Same output as inner join and merge/drop
Revolution Citizen Exchange
1776 M_B_Dextrous Nasdaq
The Lookup Stage
Lookup Tables should be small enough to fit into physical memory
On a MPP you should partition the lookup tables using entire partitioning method
or partition them by the same hash key as the source linkEntire results in multiple copies (one for each partition)
On a SMP, choose entire or accept the default (which is entire)
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 183/299
Entire does not result in multiple copies because memory is shared
Join Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 184/299
Join Stage
The Join Stage
• Four types:
•
•
•
•
Inner
Left outer
Right outer
Full outer
2 or more sorted input links, 1 output link
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 185/299
2 or more sorted input links, 1 output link
"left" on primary input, "right" on secondary input
Pre-sort make joins "lightweight": few rows need to be in RAM
Follow the RDBMS-style relational modelCross-products in case of duplicates
Matching entries are reusable for multiple matches
Non-matching entries can be captured (Left, Right, Full)
No fail/reject option for missed matches
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 186/299
Join Stage Behavior
We shall first use a simplest case, optimal input:
•
• •
two input links: "left" as primary, "right" as secondary
sorted on key column (here without duplicates on key
"Citizen"),
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 187/299
Left link (primary input) Right link (secondary input)
Citizen Exchange
M_B_Dextrous Nasdaq
Righty NYSE
Revolution Citizen
1789 Lefty
1776 M_B_Dextrous
Inner Join
Transfers rows from both data sets whose key columnscontain equal values to the output
Treats both inputs symmetrically
link
Output of inner join on key Citizen
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 188/299
Output of inner join on key Citizen
Same outpu t as lookup/reject and merge/drop
Revolution Citizen Exchange
1776 M_B_Dextrous Nasdaq
Left
Outer Join
Transfers all values from the left link and transfers values from the right linkonly where key columns match.
Revolution Citizen Exchange
1789 Lefty
1776 M B Dextrous Nasdaq
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 189/299
Same outpu t as lookup/continue and merge/keep
1776 M_B_Dextrous Nasdaq
Left Outer Join Check Link Ordering Tab intended Primary to make sure is listed first
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 190/299
Right Outer Join
Transfers all values from the right link and transfers values from the left link onlywhere key columns match.
Revolution Citiz en Ex c hange
1776 M B Dextrous Nasdaq
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 191/299
1776 M_B _Dex trous Nas daq
Null or 0 Righty NYSE
Full Outer Join
Transfers rows fr om both data sets, whose key columns contain equal values, tothe output link.
It also transfers rows, whose key columns contain unequal values, from both inputlinks to the output link.
Treats both input symmetrically.
Creates new columns, with new column names!
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 192/299
Revolution leftRec_Citizen rightRec_Citizen Exchange1789 Lefty
1776 M_B_Dextrous M_B_Dextrous Nasdaq
0 Righty NYSE
Merge Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 193/299
g g
Merge Stage Job
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 194/299
The Merge Stage
Allows composite keys
Multiple update links
Matched update rows are consumed
Unmatched updates ininput port n can be captured in outputport n
One or moreupdates
Master
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 195/299
1
port n
Lightweight:
Merge
RejectsOutput
0
1 2
0 2
Merge Stage Editor
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 196/299
Unmatched Master rows
One of two options:
Unmatched Update rows option:
– Capture in reject link(s).Implemented by addingoutgoing links
– –
Keep [default]Drop
(Capture in reject link is NOT an option)
Comparison: Joins, Lookup, Merge
Jo ins Lookup Merge Model
M emory us age
# and nam es of Inputs
M andatory Input S ort
RDBMS-s ty le relational S ourc e - in RAM LU Table Mas ter -Update(s )
light heavy light
2 or more: left, right 1 S ourc e, N LU Tables 1 Mas ter, N Update(s )
all inputs no all inputs
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 197/299
Duplic ates in prim ary input
Duplic ates in s ec ondary input(s )
Options on unmatc hed prim ary
Options on unmatc hed s ec ondary
On m atc h, s ec ondary entries are
# Outputs
Captured in rejec t s et(s )
OK (x -produc t) OK W arning!
OK (x -produc t) W arning! OK only when N = 1
K eep (left outer), Drop (Inner) [fail] | c ontinue | drop | rejec t [k eep] | drop
K eep (right outer), Drop (Inner) NONE c apture in rejec t s et(s )
c aptured c aptured c ons um ed
1 1 out, (1 rejec t) 1 out, (N rejec ts )
Nothing (N/A) unmatc hed primary entries unm atc hed s ec ondary entries
Funnel Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 198/299
What is a Funnel Stage?
A processing stage that combines data from multiple input links to a
single output link
Useful to combine data from several identical data sources into a singlelarge dataset
Operates in three modes
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 199/299
Continuous
SortFunnel
Sequence
Three Funnel modes Continuous:
Combines the records of the input link in no guaranteed order.
It takes one record from each input link in turn. If data is not available on an input link,
the stage skips to the next link rather than waiting.Does not attempt to impose any order on the data it is processing.
Sort Funnel: Combines the input records in the order defined by the value(s) of one ormore key columns and the order of the output records is determined by these sorting
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 200/299
keys.
Sequence: Copies all records from the first input link to the output link, then all therecords from the second input link and so on.
Sort Funnel Method
Produces a sorted output (assuming input links are all sorted on key)
Data from all input links must be sorted on the same key column
Typically data from all input links are hash partitioned before they are sortedSelecting “Auto” partition type under Input Par titioning tab defaults to this
Hash partitioning guarantees that all the records with same key columnvalues are located in the same partition and are processed on the samenode
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 201/299
node.
Allows for multiple key columns
1 primary key column, n secondary key columns
Funnel stage first examines the primary key in each input record.
For records with multiple records with same primary key value, it will thenexamine secondary keys to determine the order of records it will output
Funnel Stage Example
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 202/299
Funnel Stage Properties
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 203/299
Lab Exercises
Conceptual Lab 06A
Use a Lookup stage
Handle lookup failures
Use a Merge stage
Use a Join stage
Use a Funnel stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 204/299
Solution Lab 06B (Build Warehouse_02 Job)
Use a Join stage
®
IBM WebSphere DataStage
Module 07: Sorting and Aggregating Data
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 205/299
Module Objectives
Sort data using in-stage sorts and Sort stage
Combine data using Aggregator stage
Combine data Remove Duplicates stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 206/299
Sort Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 207/299
Sorting Data
Uses
Some stages require sorted input
Join, merge stages require sorted input
Some stages use less memory with sorted input
E.g., Aggregator
Sorts can be done:
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 208/299
Within stages
On input link Partitioning tab, set partitioning to anything other than AutoIn a separate Sort stage
Makes sort more visible on diagram
Has more options
Sorting Alternatives
S t t S t ithi
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 209/299
Sort stage Sort within
stage
In-Stage Sorting Partitioning
ort
Preserve-key
ordering
tab Do s
non
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 210/299
Removedups
Can‟t when sorting
be Auto Sort key
Sort Stage
Sort key
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 211/299
Sort options
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 212/299
Sort Options
Sort Utility
DataStage – the default
Unix: Don‟t use. Slower than DataStage sort utility
Stable
Allow duplicates
Memory usage
Sorting takes advantage of the available memory for increased performance
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 213/299
So g a es ad a age o e a a ab e e o y o c eased pe o a ce
Uses disk if necessary
Increasing amount of memory can improve performance
Create key change column
Add a column with a value of 1 / 0
1 indicates that the key value has changed
0 mean that the key value hasn‟t changed
Useful for processing groups of rows in a Transformer
Sort Stage Mapping Tab
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 214/299
Partitioning V. Sorting Keys
Partitioning keys are often different than Sorting keys
Keyed partitioning (e.g., Hash) is used to group related records into the
same partition
Sort keys are used to establish order within each partition
For example, partition on HouseHoldID, sort on HouseHoldID,PaymentDate
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 215/299
Important when removing duplicates. Sorting within each partition is uses to
establish order for duplicate retention (first or last in the group)
Aggregator Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 216/299
Aggregator Stage
Purpose: Perform data aggregations
Specify:
Zero or more key columns that define groups)
Columns to be aggregated
the aggregation units (or
Aggregation functions, include among many others:
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 217/299
Aggregation functions, include among many others:
count (nulls/non-nulls)Sum
Max / Min / Range
The grouping method
issue
(hash table or pre-sort ) is a performance
Job with Aggregator Stage
Aggregator stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 218/299
Aggregator Stage Properties
Group columns
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 219/299
Group method
Aggregationfunctions
Aggregator Functions
Aggregation type = Count rows
Count rows in each group
Put result in a specified output column
Aggregation type = Calculation
Select column
Put result of calculation in a specified output column
Calculations include:
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 220/299
SumCount
Min, max
Mean
Missing value count
Non-missing value countPercent coefficient of variation
Grouping Methods Hash (default)
Intermediate results for each group are stored in a hash table
Final results are written out after all input has been processed
No sort required
Use when number of unique groups is small
Running tally for each gr oup‟s aggregate calculations needs to fit into
memory. Requires about 1K RAM / group
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 221/299
E.g. average family income by state requires .05MB of RAM
Sort
Only a single aggregation group is kept in memory
When a new group is seen, the current group is written out
Requires input to be sorted by grouping keys
Can handle unlimited numbers of groups
Example: average daily balance by credit card
Aggregation Types
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 222/299
Calculation types
Remove Duplicates Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 223/299
Removing Duplicates
Can be done by Sort stage
Use unique option
No choice on which to keep
Stable sort always retains the first row in the group
Non-stable sort is indeterminate
OR
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 224/299
Remove Duplicates stage
Has more sophisticated ways to remove duplicates
Can choose to retain first or last
Remove Duplicates Stage Job
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 225/299
Remove Duplicates
stage
Remove Duplicates Stage Properties
Key that definesduplicates
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 226/299
Retain first or last
duplicate
Lab Exercises
Solution Development Lab 07A
Use Sort stage
Use Aggregator stage
Use RemoveDuplicates stage
(Build Warehouse_03 job)
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 227/299
®
IBM WebSphere DataStage
Module 08: Transforming Data
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 228/299
Module Objectives
Understand ways DataStage allows you to transform data
Use this understanding to:
Create column derivations using user-defined code and system functions
Filter records based on business criteria
Control data flow based on data conditions
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 229/299
Transformed Data
Derivations may include incoming fields or parts of incoming fields
Derivations may reference system variables and constants
Frequently uses
Date and time
Mathematical
Logical
functions performed on incoming values
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 230/299
Null handlingMore
Stages Review
Stages that can transform data
Transformer
Modify
Aggregator
Stages that do not transform data
File stages: Sequential, Dataset, Peek, etc.
Sort
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 231/299
Remove Duplicates
Copy
Filter
Funnel
Transformer Stage
Column mappings
Derivations
Written in Basic
Final compiled code is C++ generated object code
Constraints
Filter data
Direct data down different output links
For different processing or storage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 232/299
Expressions for constraints and derivations can referenceInput columns
Job parameters
Functions
System variables and constants
Stage variablesExternal routines
Transformer Stage Uses Transformer withmultiple outputs
Control data flow
Constrain data
Direct data
Derivations
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 233/299
Inside the Transformer Input columns
Stage Stage variables
Output columns
Constraints
Output
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 234/299
Derivations / Mappings
Input / Output column defs
Defining a Constraint
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 235/299
Input column
Job parameter
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 236/299
IF THEN ELSE Derivation
Use IF THEN ELSE to conditionally derive a value
Format:
IF <condition> THEN <expression1> ELSE <expression1>
If the condition evaluates to true then the result of expression1 will be copiedto the target column or stage variable
If the condition evaluates to false then the result of expression2 will becopied to the target column or stage variable
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 237/299
Example:Suppose the source column is named In.OrderID and the target column isnamed Out.OrderID
Replace In.OrderID values of 3000 by 4000
IF In.OrderID = 3000 THEN 4000 ELSE Out.OrderID
String Functions and Operators
Substring operator
Format: “String” [loc, length]
Example:
Suppose In.Description contains the string “Orange Juice”
InDescription[8,5] “Juice”
UpCase(<string>) / DownCase(<string>)
Example: UpCase(In.Description) “ORANGE JUICE”
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 238/299
Len(<string>)
Example: Len(In.Description) 12
Checking for NULLs
Nulls can be introduced into the data flow fromlookups
Mismatches (lookup f ailures) can produce nulls
Can be handled in constraints, derivations,stage variables, or a combination of these
NULL functions
Testing for NULL
IsNull(<column>)
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 239/299
IsNotNull(<column>)
Replace NULL with a value
NullToValue(<column>, <value>)
Set to NULL: SetNull()
Example: IF In.Col = 5 THEN SetNull()
ELSE In.Col
Transformer Functions
Date & Time
Logical
Null Handling
Number
String
Type Conversion
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 240/299
Transformer Execution Order
Derivations in stage variables
Constraints are executed before derivations
Column derivations in earlier links are executed before later links
Derivations in higher columns are executed before lower columns
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 241/299
Transformer Stage Variables
Derivations execute in order from top to bottom
Later stage variables can reference earlier stage variables
Earlier stage variables can reference later stage variables
These variables will contain a value derived from the previous rothat came into the Transformer
Multi-purpose
Counters
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 242/299
Store values from previous rows to make comparisonsStore derived values to be used in multiple target field derivations
Can be used to control execution of constraints
Stage Variables Toggle
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 243/299
Show/Hide button
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 244/299
Otherwise Link
Otherwise link
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 245/299
Defining an Otherwise Link
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 246/299
Check to createotherwise link Can specify abort
condition
Specifying Link Ordering
Link ordering toolbar icon
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 247/299
Last inorder
Transformer Stage Tips
Suggestions -
Include reject links
Test for NULL values before using a column in a function
Use RCP (Runtime Column Propogation)
Map columns that have derivations (not just copies).
More on RCP later.
Be aware of column and stage variable data types.
Oft d l d t tt ti t t i bl t
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 248/299
Often developers do not pay attention to stage variable types
Avoid type conversions.
Try to maintain the data type as imported.
Modify Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 249/299
Modify Stage
Modify column types
Perform some types of derivations
Null handling
Date / time handling
String handling
Add or drop columns
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 250/299
Job With Modify Stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 251/299
Modify stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 252/299
Lab Exercises Conceptual Lab 08A
Add a Transformer to a job
Define a constraint
Work with null valuesDefine a rejects link
Define a stage variable
Define a derivation
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 253/299
IBM WebSphere DataStage
Module 09: Standards and Techniques
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 254/299
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 255/299
Job Presentation
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 256/299
Document using theDocument using
annotation stage
Job Properties Documentation
Organize jobs intocategories
Description is displayed inManager and MetaStage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 257/299
Naming Conventions Stages named after the
Data they access
Function they perform
DO NOT leave default stage names like Sequential_File_0One possible convention:
Use 2-character prefixes to indicate stage type, e.g.,
–
–
–
“SF_” for Sequential File stage
“DS_” for Dataset stage
“CP_” for Copy stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 258/299
_ py g Links named for the data they carry
DO NOT leave default link names like DSLink3
One possible convention:
Prefix all link names with “lnk_”
Name links after the data flowing through them
Stage and Link Names
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 259/299
handle
Name stages andlinks for the data they
Iterative Job Design
Use Copy and Peek stages as stubs
Test job in phases
Small sections first, then increasing in complexity
Use Peek stage to examine records
Check data at various locations
Check before and after processing stages
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 260/299
Copy Stage Stub Example
Copy stage
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 261/299
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 262/299
Developing Jobs 1. Keep it simple
a) Jobs with many stages are hard to debug and maintain
Start small and build to final solution
Use view data, copy, and peekStart from source and work out
Develop with a 1 node configuration file
Solve the business problem before the performance problem
Don‟t worry too much about partitioning until the sequential flow worksas expected
If you land data in order to break complex jobs into smaller sets of
2.
a) b)
c)
3.
a)
4.
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 263/299
If you land data in order to break complex jobs into smaller sets of jobs for purposes of restartability or maintainability, use persistentdatasets
Retains partitioning and internal data types
This is tr ue only as long as you don‟t need to read the data outside of
DataStage
4.
a)
b)
Final Result
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 264/299
Good Things to Have in each Job
Job parameters
Useful environmental variables to add
$APT_DUMP_SCORE
Report OSH to message log
$APT_CONFIG_FILE
to job parameters
Establishes runtime parameters to EE engine
Establishes degree of parallelization
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 265/299
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 266/299
DUMP SCORE Output
Setting APT_DUMP_SCORE yields:
Double-click Partitioner And
Collector
Mapping Node > partition
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 267/299
Node--> partition
Use Multiple Configuration Files
Make a set for 1X, 2X,….
Use different ones for test versus production
Include as a parameter in each job
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 268/299
Containers Two varieties
Local
Shared
Local
Simplifies a large, complex diagram
Shared
Creates reusable object that many jobs within the project can
include
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 269/299
Reusable Job Components
Use Shared Containers for repeatedly used components
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 270/299
Container
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 271/299
Lab Exercises
Conceptual Lab 07A
Apply best practices when naming links and stages
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 272/299
IBM WebSphere DataStageModule 10: Accessing Relational Data
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 273/299
Module Objectives Understand how DataStage jobs RDBMS tables
Import relational table definitions
read and write records to a
Read from and write to database tables
Use database tables to lookup data
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 274/299
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 275/299
Supported Database Access
Enterprise Edition provides high performance / scalable interfaces for:
DB2 / UDB
Informix
Oracle
Teradata
SQL Server
ODBC
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 276/299
Importing Table Definitions Can import using ODBC or using Orchestrate schema definitions
Orchestrate schema imports are better because the data types are moreaccurate
Import>Table Definitions>Orchestrate Schema Definitions
Import>Table Definitions>ODBC Table Definitions
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 277/299
Orchestrate Schema Import
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 278/299
ODBC Import
Select ODBC datasource name
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 279/299
RDBMS Access
Automatically convert RDBMS table layouts to/from DataStage TableDefinitions
RDBMS NULLs converted to/from DataStage NULLs
Support for standard SQL syntax for specifying:SELECT clause list WHERE clause filter condition INSERT / UPDATE
Supports user-defined queries
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 280/299
Native Parallel RDBMS Stages
DB2/UDB Enterprise
Informix Enterprise
Oracle Enterprise
Teradata Enterprise
ODBC Enterprise
SQL Server Enterprise
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 281/299
RDBMS Usage
As a source
Extract data from table (stream link)
– Read methods include: Table, Generated SQL SELECT, or User-defined SQL
– User-defined can perform joins, access views
Lookup (reference link)
–
–
–
Normal lookup is memory-based (all table data read into memory)
Can perform one lookup at a time in DBMS (sparse option)
Continue/drop/fail options
As a target
I t
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 282/299
Inserts
Upserts (Inserts and updates)
Loader
DB2 Enterprise Stage Source
Auto-generatedSELECT
Connectioni f ti
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 283/299
information
Job example
Sourcing with User-Defined SQL User-definedread method
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 284/299
Columns in SQL mustmatch definitions on
Columns tab
DBMS Source – Lookup
Reference
link
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 285/299
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 286/299
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 287/299
DB2 Stage Target Properties SQL INSERT
Drop table and
create
Database specifiedby job parameter
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 288/299
Optional CLOSE command
DB2 Target Stage Upsert SQL INSERT
SQL UPDATE
Upsert method
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 289/299
Upsert method
#################################
Generated OSH for first 2 stages
Generated OSH Primer Comment blocks introduce each operator
Operator order is determined by the order stageswere added to the canvas
OSH uses the familiar syntax of the UNIX shell
Operator name
SchemaOperator options ( “-name value” format)
Input (indicated by n< where n is the input #)
Output (indicated by n> where n is the output #)
may include modify
For every operator, input and/or output datasets arenumbered sequentially starting from 0. E.g.:
op1 0> dstop1 1< src
Virtual datasets are generated to connect operators
###################
## Operator
## Operator options
#################################################### #### STAGE: Row_Generator_0 ## Operatorgenerator ## Operator options -schema record ( a:int32;
b:string[max=12];c:nullable decimal[10,2] {nulls=10}; ) -records 50000
## General options [ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]## Outputs 0> [] 'Row_Generator_0:lnk_gen.v' ;
Virtual dataset is #### STAGE: SortSt used to connect
tsort output of one -key 'a' operator to input of -asc another
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 290/299
Virtual datasets are generated to connect operators ## General options [ident('SortSt'); jobmon_ident('SortSt'); par] ## Inputs 0< 'Row_Generator_0:lnk_gen.v'## Outputs 0> [modify ( keepa,b,c; )] 'SortSt:lnk_sorted.v' ;
Framework v. DataStage Terminology
Framework schema
propertytype
virtual dataset
Record / field
operator
step, flow, OSH command
Framework
DataStage table definition
formatSQL type and length
link
row / column
stage
job
DS Parallel Engine
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 291/299
Framework DS Parallel Engine
• GUI uses both terminologies
• Log messages (info, warnings, errors) use Framework terminology
Elements of a Framework Program
•
•
•
Operators
Virtual datasets: set of rows processed
Schema:
by Framework
• data description (metadata) for datasets and links
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 292/299
Enterprise Edition Runtime Architecture
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 293/299
Enterprise Edition Job Startup Generated OSH and configuration file are used to “compose” a job “Score”
Think of “Score” as in musical score, not game score
Similar to the way an RDBMS builds a query optimization plan
Identifies degree of parallelism and node assignments for each operator
Inserts sorts and partitioners as needed to ensure correct results
Defines connection topology (virtual datasets) between adjacent operators
Inserts buffer operators to prevent deadlocks
E.g., in fork-joins
Defines number of actual OS processes
Where possible, multiple operators are combined within a single OS process to improve performance and optimize resource requirements
Job Score is used to fork processes with communication interconnects fordata message and control
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 294/299
data, message, and controlSet $APT_STARTUP_STATUS to show each step of job startup
Set $APT_PM_SHOW_PIDS to show process IDs in DataStage log
Enterprise Edition Runtime It is only after the job Score and processes are created that
processing begins
“Startup overhead” of an EE job
Job processing ends when either:
Last row of data is processed by final operator
A fatal error is encountered by any operator
Job is halted (SIGINT) by DataStage Job Control or human intervention (e.g. DataStage Director STOP)
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 295/299
Viewing the Job Score •
•
Set $APT_DUMP_SCORE to output the Score to the job log
For each job run, 2 separate Score dumps are written
• First score is for the license operator
• Second score entry is the real job score
To identify the Score dump, look for “main program: This step …”
You don‟t see anywhere the word „Score‟
License operator job score
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 296/299
Job score
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 297/299
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 298/299
Thank You
8/11/2019 DataStage-PX ClassBook V0.1ds
http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 299/299
Thank You