DataStage Enterprise Edition
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 1/374
DataStageEnterprise Edition
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 2/374
Proposed Course Agenda
Day 1 – Review of EE Concepts
– Sequential Access
– Best Practices
– DBMS as Source
Day 2 – EE Architecture
– Transforming Data
– DBMS as Target
– Sorting Data
Day 3 – Combining Data
– Configuration Files
– Extending EE
– Meta Data in EE
Day 4 – Job Sequencing
– Testing and Debugging
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 3/374
The Course Material
Course Manual
Online Help
Exercise Files and
Exercise Guide
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 4/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 5/374
Intro
Part 1
Introduction to DataStage EE
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 6/374
What is DataStage?
Design jobs for Extraction, Transformation, andLoading (ETL)
Ideal tool for data integration projects – such as,data warehouses, data marts, and system
migrations Import, export, create, and managed metadata for
use within jobs
Schedule, run, and monitor jobs all withinDataStage
Administer your DataStage development andexecution environments
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 7/374
DataStage Server and Clients
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 8/374
DataStage Administrator
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 9/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 10/374
DataStage Manager
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 11/374
DataStage Designer
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 12/374
DataStage Director
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 13/374
Developing in DataStage
Define global and project properties in Administrator
Import meta data into Manager
Build job in Designer Compile Designer
Validate, run, and monitor in Director
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 14/374
DataStage Projects
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 15/374
Quiz – True or False
DataStage Designer is used to build and compileyour ETL jobs
Manager is used to execute your jobs after youbuild them
Director is used to execute your jobs after youbuild them
Administrator is used to set global and projectproperties
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 16/374
IntroPart 2
Configuring Projects
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 17/374
Module Objectives
After this module you will be able to: – Explain how to create and delete projects
– Set project properties in Administrator
– Set EE global properties in Administrator
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 18/374
Project Properties
Projects can be created and deleted in Administrator
Project properties and defaults are set in Administrator
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 19/374
Setting Project Properties
To set project properties, log onto Administrator,select your project, and then click ―Properties‖
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 20/374
Licensing Tab
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 21/374
Projects General Tab
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 22/374
Environment Variables
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 23/374
Permissions Tab
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 24/374
Tracing Tab
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 25/374
Tunables Tab
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 26/374
Parallel Tab
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 27/374
IntroPart 3
Managing Meta Data
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 28/374
Module Objectives
After this module you will be able to:
– Describe the DataStage Manager components andfunctionality
– Import and export DataStage objects
– Import metadata for a sequential file
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 29/374
What Is Metadata?
TargetSource
Transform
Meta DataRepository
Data
Meta
Data
Meta
Data
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 30/374
DataStage Manager
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 31/374
Manager Contents
Metadata describing sources and targets: Tabledefinitions
DataStage objects: jobs, routines, tabledefinitions, etc.
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 32/374
Import and Export
Any object in Manager can be exported to a file
Can export whole projects
Use for backup
Sometimes used for version control
Can be used to move DataStage objects from oneproject to another
Use to share DataStage jobs and projects withother developers
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 33/374
Export Procedure
In Manager, click ―Export>DataStage
Components‖
Select DataStage objects for export
Specified type of export: DSX, XML Specify file path on client machine
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 34/374
Quiz: True or False?
You can export DataStage objects such as jobs,but you can’t export metadata, such as field
definitions of a sequential file.
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 35/374
Quiz: True or False?
The directory to which you export is on theDataStage client machine, not on the DataStageserver machine.
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 36/374
Exporting DataStage Objects
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 37/374
Exporting DataStage Objects
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 38/374
Import Procedure
In Manager, click ―Import>DataStage
Components‖
Select DataStage objects for import
I ti D t St Obj t
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 39/374
Importing DataStage Objects
I t O ti
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 40/374
Import Options
E i
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 41/374
Exercise
Import DataStage Component (table definition)
M t d t I t
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 42/374
Metadata Import
Import format and column destinations fromsequential files
Import relational table column destinations
Imported as ―Table Definitions‖ Table definitions can be loaded into job stages
S ti l Fil I t P d
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 43/374
Sequential File Import Procedure
In Manager, click Import>TableDefinitions>Sequential File Definitions
Select directory containing sequential file andthen the file
Select Manager category
Examined format and column definitions and editis necessary
M T bl D fi iti
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 44/374
Manager Table Definition
I ti S ti l M t d t
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 45/374
Importing Sequential Metadata
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 46/374
IntroPart 4
Designing and Documenting Jobs
Module Objectives
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 47/374
Module Objectives
After this module you will be able to:
– Describe what a DataStage job is
– List the steps involved in creating a job
– Describe links and stages
– Identify the different types of stages – Design a simple extraction and load job
– Compile your job
– Create parameters to make your job flexible
– Document your job
What Is a Job?
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 48/374
What Is a Job?
Executable DataStage program
Created in DataStage Designer, but can usecomponents from Manager
Built using a graphical user interface Compiles into Orchestrate shell language (OSH)
Job Development Overview
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 49/374
Job Development Overview
In Manager, import metadata defining sources
and targets
In Designer, add stages defining data extractionsand loads
And Transformers and other stages to defineddata transformations
Add linkss defining the flow of data from sources
to targets Compiled the job
In Director, validate, run, and monitor your job
Designer Work Area
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 50/374
Designer Work Area
Designer Toolbar
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 51/374
Designer Toolbar
Provides quick access to the main functions of Designer
Job
propertiesCompile
Show/hide metadata markers
Tools Palette
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 52/374
Tools Palette
Adding Stages and Links
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 53/374
Adding Stages and Links
Stages can be dragged from the tools palette orfrom the stage type branch of the repository view
Links can be drawn from the tools palette or byright clicking and dragging from one stage to
another
Sequential File Stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 54/374
Sequential File Stage
Used to extract data from, or load data to, asequential file
Specify full path to the file
Specify a file format: fixed width or delimited Specified column definitions
Specify write action
Job Creation Example Sequence
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 55/374
Job Creation Example Sequence
Brief walkthrough of procedure
Presumes meta data already loaded in repository
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 56/374
Drag Stages and Links Using
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 57/374
Palette
Assign Meta Data
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 58/374
Assign Meta Data
Editing a Sequential Source Stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 59/374
Editing a Sequential Source Stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 60/374
Transformer Stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 61/374
Transformer Stage
Used to define constraints, derivations, andcolumn mappings
A column mapping maps an input column to anoutput column
In this module will just defined column mappings(no derivations)
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 62/374
Create Column Mappings
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 63/374
Create Column Mappings
Creating Stage Variables
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 64/374
Creating Stage Variables
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 65/374
Adding Job Parameters
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 66/374
Adding Job Parameters
Makes the job more flexible
Parameters can be:
– Used in constraints and derivations
– Used in directory and file names
Parameter values are determined at run time
Adding Job Documentation
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 67/374
Adding Job Documentation
Job Properties
– Short and long descriptions
– Shows in Manager
Annotation stage
– Is a stage on the tool palette
– Shows on the job GUI (work area)
Job Properties Documentation
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 68/374
Job Properties Documentation
Annotation Stage on the Palette
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 69/374
otat o Stage o t e a ette
Annotation Stage Properties
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 70/374
g p
Final Job Work Area withDocumentation
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 71/374
Documentation
Compiling a Job
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 72/374
p g
Errors or Successful Message
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 73/374
g
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 74/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 75/374
Prerequisite to Job Execution
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 76/374
q
Result from Designer compile
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 77/374
Running Your Job
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 78/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 79/374
Director Log View
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 80/374
Message Details are Available
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 81/374
Other Director Functions
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 82/374
Schedule job to run on a particular date/time
Clear job log
Set Director options
– Row limits – Abort after x warnings
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 83/374
Module 1
DSEE – DataStage EE
Review
Ascential’s EnterpriseData Integration Platform
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 84/374
Data Integration Platform
CRM
ERP
SCMRDBMS
Legacy
Real-time
Client-server
Web services
Data Warehouse
Other apps.
ANY SOURCE ANY TARGET
CRM
ERP
SCMBI/Analytics
RDBMS
Real-time
Client-server
Web services
Data Warehouse
Other apps.
Command & Control
DISCOVER
Gather
relevant
informatio
n for target
enterprise
application
s
Data Profiling
PREPARE
Data Quality
Cleanse,
correct andmatch
input data
TRANSFORM
Extract,
Transform,Load
Standardiz
e and
enrich data
and load to
targets
Meta Data Management
Parallel Execution
Course Objectives
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 85/374
You will learn to:
– Build DataStage EE jobs using complex logic
– Utilize parallel processing techniques to increase jobperformance
– Build custom stages based on application needs
Course emphasis is:
– Advanced usage of DataStage EE
– Application job development
– Best practices techniques
Course Agenda
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 86/374
Day 1
– Review of EE Concepts – Sequential Access
– Standards
– DBMS Access
Day 2 – EE Architecture
– Transforming Data
– Sorting Data
Day 3
– Combining Data – Configuration Files
Day 4 – Extending EE
– Meta Data Usage
– Job Control – Testing
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 87/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 88/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 89/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 90/374
Administrator – Licensing andTimeout
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 91/374
Timeout
Administrator – ProjectCreation/Removal
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 92/374
C eat o / e o a
Functionsspecific to a
project.
Administrator – Project Properties
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 93/374
RCP for parallel jobs should be
enabled
Variables forparallel
processing
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 94/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 95/374
OSH is what isrun by the EEFramework
DataStage Manager
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 96/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 97/374
Designer Workspace
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 98/374
Can executethe job from
Designer
DataStage Generated OSH
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 99/374
The EEFrameworkruns OSH
Director – Executing Jobs
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 100/374
Messagesfrom previousrun in different
color
Stages
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 101/374
Can now customize the Designer’s palette
Select desired stagesand drag to favorites
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 102/374
Row Generator
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 103/374
Can build test data
Repeatableproperty
Edit row incolumn tab
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 104/374
Why EE is so Effective
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 105/374
Parallel processing paradigm
– More hardware, faster processing
– Level of parallelization is determined by aconfiguration file read at runtime
Emphasis on memory – Data read into memory and lookups performed like
hash table
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 106/374
Scaleable Systems: Examples
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 107/374
Three main types of scalable systems
Symmetric Multiprocessors (SMP): sharedmemory and disk
Clusters: UNIX systems connected via networks
MPP: Massively Parallel Processing
note
SMP: Shared Everything
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 108/374
• Multiple CPUs with a single operating system
• Programs communicate using shared memory• All CPUs share system resources
(OS, memory with single linear address space,disks, I/O)
When used with Enterprise Edition:
• Data transport uses shared memory
• Simplified startup
cpu cpu
cpu cpu
Enterprise Edition treats NUMA (NonUniform Memory Access) as plain SMP
Traditional Batch Processing
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 109/374
Source
Transform
Target
DataWarehouse
Operational Data
Archived Data
Clean Load
Disk Disk Disk
Traditional approach to batch processing:• Write to disk and read from disk before each processing operation• Sub-optimal utilization of resources
• a 10 GB stream leads to 70 GB of I/O• processing resources can sit idle during I/O
• Very complex to manage (lots and lots of small jobs)• Becomes impractical with big data volumes
• disk I/O consumes the processing
• terabytes of disk required for temporary staging
Pipeline Multiprocessing
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 110/374
Data Pipelining
• Transform, clean and load processes are executing simultaneously on the same processor• rows are moving forward through the flow
Source
Transform
Target
DataWarehouse
Operational Data
Archived Data Clean Load
• Start a downstream process while an upstream process is stillrunning.• This eliminates intermediate storing to disk, which is critical for big data.• This also keeps the processors busy.• Still has limits on scalability
Think of a conveyor belt moving the rows from process to process!
Partition Parallelism
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 111/374
Data Partitioning
Transform
SourceData
Transform
Transform
Transform
Node 1
Node 2
Node 3
Node 4
A-F
G- M
N-T
U-Z
• Break up big data into partitions
• Run one partition on each processor
• 4X times faster on 4 processors -With data big enough:
100X faster on 100 processors
• This is exactly how the paralleldatabases work!
• Data Partitioning requires the
same transform to all partitions: Aaron Abbott and Zygmund Zornundergo the same transform
Combining Parallelism Types
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 112/374
Putting It All Together: Parallel Dataflow
Source Target
Transform Clean Load
Pipelining
SourceData
DataWarehouse
Repartitioning
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 113/374
Putting It All Together: Parallel Dataflow
with Repartioning on-the-fly
Without Landing To Disk!
Source Targ
Transform Clean Load
Pipelining
SourceData Data
WarehouseA-F
G- MN-T
U-Z
Customer last nameCustomer zip code Credit card number
EE Program Elements
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 114/374
• Dataset: uniform set of rows in the Framework's internal representation- Three flavors:
1. file sets *.fs : stored on multiple Unix files as flat files2. persistent : *.ds : stored on multiple Unix files in Framework format
read and written using the DataSet Stage3. virtual : *.v : links, in Framework format, NOT stored on disk
- The Framework processes only datasets—hence possible need for Import- Different datasets typically have different schemas- Convention: "dataset" = Framework data set.
• Partition: subset of rows in a dataset earmarked for processing by the
same node (virtual CPU, declared in a configuration file).
- All the partitions of a dataset follow the same schema: that of the dataset
DataStage EE Architecture
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 115/374
Orchestrate Program(sequentialdataflow)
Orchestrate Application Frameworkand Runtime System
Import
Clean1
Clean2
Merge Analyze
Configuration File
Centralized Error Handlingand Event Logging
Parallel access to data in files
Parallel access to data in RDBMS
Inter-node communications
Parallel pipelining
Parallelization of operations
Import
Clean 1
Merge Analyze
Clean 2
Relational Data
PerformanceVisualization
Flat Files
Orchestrate Framework:
Provides application scalability
DataStage Enterprise Edition:Best-of-breed scalable data integration platformNo l imi tat ions on d ata volumes or throug hpu t
DataStage:
Provides data integration platform
Introduction to DataStage EE
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 116/374
DSEE: – Automatically scales to fit the machine
– Handles data flow among multiple CPU’s and disks
With DSEE you can: – Create applications for SMP’s, clusters and MPP’s…Enterprise Edition is architecture-neutral
– Access relational databases in parallel
– Execute external applications in parallel
– Store data across multiple disks and nodes
Job Design VS. Execution
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 117/374
Developer assembles data flow using the Designer
…and gets: parallel access, propagation, transformation, andload.
The design is good for 1 node, 4 nodes,or N nodes. To change # nodes, just swap configuration file.
No need to modify or recompile the design
Partitioners and Collectors
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 118/374
Partitioners distribute rows into partitions
– implement data-partition parallelism
Collectors = inverse partitioners
Live on input links of stages running
– in parallel (partitioners)
– sequentially (collectors)
Use a choice of methods
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 119/374
Exercise
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 120/374
Complete exercises 1-1 and 1-2, and 1-3
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 121/374
Module 2
DSEE Sequential Access
Module Objectives
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 122/374
You will learn to:
– Import sequential files into the EE Framework
– Utilize parallel processing techniques to increasesequential file access
– Understand usage of the Sequential, DataSet, FileSet,
and LookupFileSet stages
– Manage partitioned data stored by the Framework
Types of Sequential Data Stages
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 123/374
Sequential
– Fixed or variable length
File Set
Lookup File Set
Data Set
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 124/374
How the Sequential Stage Works
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 125/374
Generates Import/Export operators, depending on
whether stage is source or target
Performs direct C++ file I/O streams
Using the Sequential File Stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 126/374
Importing/Exporting Data
Both import and export of general files (text, binary) are
performed by the SequentialFile Stage.
– Data import:
– Data export
EE internal format
EE internal format
Working With Flat Files
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 127/374
Sequential File Stage
– Normally will execute in sequential mode – Can be parallel if reading multiple files (file pattern
option)
– Can use multiple readers within a node
– DSEE needs to know How file is divided into rows
How row is divided into columns
Processes Needed to Import Data
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 128/374
Recordization
– Divides input stream into records – Set on the format tab
Columnization
– Divides the record into columns – Default set on the format tab but can be overridden on
the columns tab
– Can be ―incomplete‖ if using a schema or not even
specified in the stage if using RCP
File Format Example
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 129/374
Field 1
Field 1
Field 1
Field 1
Field 1
Field 1
,
,
,
,
,
,
Last field
Last field
nl
nl,
Field Delimiter
Final Delimiter = comma
Final Delimiter = end
Record delimiter
Sequential File Stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 130/374
To set the properties, use stage editor
– Page (general, input/output) – Tabs (format, columns)
Sequential stage link rules
– One input link – One output links (except for reject link definition)
– One reject link Will reject any records not matching meta data in the column
definitions
Job Design Using Sequential Stages
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 131/374
Stage categories
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 132/374
Properties – Multiple Files
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 133/374
Click to add more files havingthe same meta data.
Properties - Multiple Readers
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 134/374
Multiple readers option allowsyou to set number of readers
Format Tab
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 135/374
File into records
Record into columns
Read Methods
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 136/374
Reject Link
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 137/374
Reject mode = output
Source
– All records not matching the meta data (the columndefinitions)
Target – All records that are rejected for any reason
Meta data – one column, datatype = raw
File Set Stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 138/374
Can read or write file sets
Files suffixed by .fs
File set consists of:
1. Descriptor file – contains location of raw data files +
meta data
2. Individual raw data files
Can be processed in parallel
File Set Stage Example
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 139/374
Descriptor file
File Set Usage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 140/374
Why use a file set?
– 2G limit on some file systems – Need to distribute data among nodes to prevent
overruns
– If used in parallel, runs faster that sequential file
Lookup File Set Stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 141/374
Can create file sets
Usually used in conjunction with Lookup stages
Lookup File Set > Properties
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 142/374
Key columnspecified
Key column
dropped indescriptor file
Data Set
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 143/374
Operating system (Framework) file
Suffixed by .ds
Referred to by a control file
Managed by Data Set Management utility fromGUI (Manager, Designer, Director)
Represents persistent data
Key to good performance in set of linked jobs
Persistent Datasets
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 144/374
Accessed from/to disk with DataSet Stage. Two parts:
– Descriptor file: contains metadata, data location, but NOT the data itself
– Data file(s) contain the data
multiple Unix files (one per node), accessible in parallel
input.ds
node1:/local/disk1/… node2:/local/disk2/…
record (
partno: int32;description:
string;)
Quiz!
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 145/374
• True or False?
Everything that has been data-partitioned must becollected in same job
Data Set Stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 146/374
Is the data partitioned?
Engine Data Translation
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 147/374
Occurs on import
– From sequential files or file sets – From RDBMS
Occurs on export
– From datasets to file sets or sequential files – From datasets to RDBMS
Engine most efficient when processing internally
formatted records (I.e. data contained in datasets)
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 148/374
Data Set Management
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 149/374
Display data
Schema
Data Set Management From Unix
f f
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 150/374
Alternative method of managing file sets and data
sets – Dsrecords
Gives record count
– Unix command-line utility
– $ dsrecords ds_nameI.e.. $ dsrecords myDS.ds
156999 records
– Orchadmin Manages EE persistent data sets
– Unix command-line utility
I.e. $ orchadmin rm myDataSet.ds
Exercise
C l t i 2 1 2 2 2 3 d 2 4
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 151/374
Complete exercises 2-1, 2-2, 2-3, and 2-4.
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 152/374
Module 3
Standards and Techniques
Objectives
E t bli h t d d t h i f DSEE
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 153/374
Establish standard techniques for DSEE
development
Will cover:
– Job documentation
– Naming conventions for jobs, links, and stages – Iterative job design
– Useful stages for job development
– Using configuration files for development
– Using environmental variables
– Job parameters
Job Presentation
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 154/374
Document using theannotation stage
Job Properties Documentation
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 155/374
Description shows in DSManager and MetaStage
Organize jobs intocategories
Naming conventions
St d ft th
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 156/374
Stages named after the
– Data they access – Function they perform
– DO NOT leave defaulted stage names likeSequential_File_0
Links named for the data they carry
– DO NOT leave defaulted link names like DSLink3
Stage and Link Names
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 157/374
Stages and linksrenamed to data they
handle
Create Reusable Job Components
Use Enterprise Edition shared containers when
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 158/374
Use Enterprise Edition shared containers when
feasible
Container
Use Iterative Job Design
Use copy or peek stage as stub
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 159/374
Use copy or peek stage as stub
Test job in phases – small first, then increasing incomplexity
Use Peek stage to examine records
Copy or Peek Stage Stub
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 160/374
Copy stage
Transformer StageTechniques
Suggestions
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 161/374
Suggestions -
– Always include reject link. – Always test for null value before using a column in a
function.
– Try to use RCP and only map columns that have a
derivation other than a copy. More on RCP later. – Be aware of Column and Stage variable Data Types.
Often user does not pay attention to Stage Variable type.
– Avoid type conversions. Try to maintain the data type as imported.
The Copy Stage
With 1 link in 1 link out:
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 162/374
With 1 link in, 1 link out:
the Copy Stage is the ultimate "no-op" (place-holder): – Partitioners
– Sort / Remove Duplicates
– Rename, Drop column
… can be inserted on:
– input link (Partitioning): Partitioners, Sort, Remove Duplicates)
– output link (Mapping page): Rename, Drop.
Sometimes replace the transformer:
Developing Jobs
1 Keep it simple
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 163/374
1. Keep it simple
• Jobs with many stages are hard to debug and maintain.
2. Start small and Build to final Solution
• Use view data, copy, and peek.
• Start from source and work out.
• Develop with a 1 node configuration file.
3. Solve the business problem before the performanceproblem.
• Don’t worry too much about partitioning until thesequential flow works as expected.
4. If you have to write to Disk use a Persistent Data set.
Final Result
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 164/374
Good Things to Have in each Job
Use job parameters
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 165/374
Use job parameters
Some helpful environmental variables to add to job parameters
– $APT_DUMP_SCORE Report OSH to message log
– $APT_CONFIG_FILE Establishes runtime parameters to EE engine; I.e. Degree of
parallelization
Setting Job Parameters
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 166/374
Click to add
environmentvariables
DUMP SCORE Output
Setting APT_DUMP_SCORE yields:
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 167/374
Double-click
MappingNode--> partition
Partitoner And
Collector
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 168/374
Exercise
Complete exercise 3-1
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 169/374
Complete exercise 3-1
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 170/374
Module 4
DBMS Access
Objectives
Understand how DSEE reads and writes records
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 171/374
Understand how DSEE reads and writes records
to an RDBMS Understand how to handle nulls on DBMS lookup
Utilize this knowledge to:
– Read and write database tables
– Use database tables to lookup data
– Use null handling options to clean data
Parallel Database Connectivity
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 172/374
Traditional
Client-Server Enterprise Edition
Sort
Client
Parallel RDBMS
Client
Client
Client
Client
Parallel RDBMS
Only RDBMS is running in parallel
Each application has only one connection
Suitable only for small data volumes
Parallel server runs APPLICATIONS
Application has parallel connections to RDBMS
Suitable for large data volumes
Higher levels of integration possible
Client
Load
RDBMS AccessSupported Databases
Enterprise Edition provides high performance /
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 173/374
Enterprise Edition provides high performance /
scalable interfaces for:
DB2
Informix
Oracle
Teradata
Automatically convert RDBMS table layouts to/from
RDBMS Access
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 174/374
Automatically convert RDBMS table layouts to/from
Enterprise Edition Table Definitions RDBMS nulls converted to/from nullable field values
Support for standard SQL syntax for specifying: – field list for SELECT statement – filter for WHERE clause
Can write an explicit SQL query to access RDBMS
EE supplies additional information in the SQL query
RDBMS Stages
DB2/UDB Enterprise
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 175/374
p
Informix Enterprise
Oracle Enterprise
Teradata Enterprise
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 176/374
RDBMS Source – Stream Link
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 177/374
Stream link
DBMS Source - User-defined SQL
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 178/374
Columns in SQL statementmust match the meta data
in columns tab
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 179/374
DBMS Source – Reference Link
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 180/374
Reject link
Lookup Reject Link
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 181/374
―Output‖ option automatically
creates the reject link
Null Handling
Must handle null condition if lookup record is not
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 182/374
p
found and ―continue‖ option is chosen Can be done in a transformer stage
Lookup Stage Mapping
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 183/374
Link name
Lookup Stage Properties
Reference
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 184/374
Reference
link
Must have same column name
in input and reference links. Youwill get the results of the lookup
in the output column.
DBMS as a Target
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 185/374
DBMS As Target
Write Methods
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 186/374
– Delete – Load
– Upsert
– Write (DB2)
Write mode for load method
– Truncate
– Create
– Replace – Append
Target Properties
Generated codeb i d
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 187/374
Upsert modedetermines options
can be copied
Checking for Nulls
Use Transformer stage to test for fields with null
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 188/374
values (Use IsNull functions) In Transformer, can reject or load default value
Exercise
Complete exercise 4-2
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 189/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 190/374
Module 5
Platform Architecture
Objectives
Understand how Enterprise Edition Framework
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 191/374
processes data You will be able to:
– Read and understand OSH
– Perform troubleshooting
Concepts
The Enterprise Edition Platform
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 192/374
– Script language - OSH (generated by DataStageParallel Canvas, and run by DataStage Director)
– Communication - conductor,section leaders,players.
– Configuration files (only one active at a time,
describes H/W) – Meta data - schemas/tables
– Schema propagation - RCP
– EE extensibility - Buildop, Wrapper
– Datasets (data in Framework's internalrepresentation)
EE Stages Involve A Series Of Processing Steps
DS-EE Stage Elements
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 193/374
Output Data Set schema:
prov_num:int16;member_num:int8;custid:int32;
Input Data Set schema:
prov_num:int16;member_num:int8;custid:int32;
g g
P ar t i t i on e
r
B u s i n e s s
L o gi c
EE Stage
• Piece of ApplicationLogic Running AgainstIndividual Records
• Parallel or Sequential
Dual Parallelism Eliminates Bottlenecks!
DSEE Stage Execution
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 194/374
• EE Delivers Parallelism inTwo Ways – Pipeline
– Partition
• Block Buffering BetweenComponents – Eliminates Need for Program
Load Balancing
– Maintains Orderly Data FlowPipeline
Partition
Producer
Consume
r
Stages Control Partition Parallelism
Execution Mode (sequential/parallel) is controlled by Stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 195/374
– default = parallel for most Ascential-supplied Stages – Developer can override default mode
– Parallel Stage inserts the default partitioner (Auto) on itsinput links
– Sequential Stage inserts the default collector (Auto) onits input links
– Developer can override default
execution mode (parallel/sequential) of Stage > Advanced tab
choice of partitioner/collector on Input > Partitioningtab
How Parallel Is It?
Degree of parallelism is determined by the
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 196/374
configuration file – Total number of logical nodes in default pool, or a
subset if using "constraints". Constraints are assigned to specific pools as defined in
configuration file and can be referenced in the stage
OSH
DataStage EE GUI generates OSH scripts
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 197/374
– Ability to view OSH turned on in Administrator – OSH can be viewed in Designer using job properties
The Framework executes OSH
What is OSH? – Orchestrate shell
– Has a UNIX command-line interface
OSH Script
An osh script is a quoted string which
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 198/374
specifies: – The operators and connections of a single
Orchestrate step
– In its simplest form, it is:
osh ―op < in.ds > out.ds‖
Where:
– op is an Orchestrate operator – in.ds is the input data set
– out.ds is the output data set
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 199/374
Enable Visible OSH in Administrator
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 200/374
Will be enabled forall projects
View OSH in Designer
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 201/374
Schema
Operator
OSH Practice
Exercise 5-1 – Instructor demo (optional)
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 202/374
• Operators
• Datasets: set of rows processed by Framework
Elements of a Framework Program
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 203/374
Datasets: set of rows processed by Framework
– Orchestrate data sets:
– persistent (terminal) *.ds, and
– virtual (internal) *.v.
– Also: flat ―file sets‖ *.fs
• Schema: data description (metadata) for datasets and links.
• Consist of Partitioned Data and Schema
• Can be Persistent (*.ds) or Virtual (*.v, Link)
Datasets
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 204/374
Can be Persistent ( .ds) or Virtual ( .v, Link)
• Overcome 2 GB File Limit
=
What you p rogram: What gets proc essed:
. . .
Multiple files per partitionEach file up to 2GBytes (or larger)
Operator
AOperator
A
Operator
AOperator
A
Node 1 Node 2 Node 3 Node 4
data filesof x.ds
$ osh “operator_A > x.ds“
GUI
OSH
What gets g enerated:
Computing Architectures: Definition
Shared Disk Shared NothingDedicated Disk
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 205/374
Clusters and MPP SystemsUniprocessor
• IBM, Sun, HP, Compaq
• 2 to 64 processors
• Majority of installations
Shared Memory
SMP System(Symmetric Multiprocessor)
DiskDisk
CPU
Memory
CPU CPU CPU
• PC
• Workstation
• Single processor server
CPU
• 2 to hundreds of processors
• MPP: IBM and NCR Teradata
• each node is a uniprocessor or SMP
CPU
Disk
Memory
CPU
Disk
Memory
CPU
Disk
Memory
CPU
Disk
Memory
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 206/374
Working with Configuration Files
You can easily switch between config files:
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 207/374
'1-node' file - for sequential execution, lighter reports—handy fortesting
'MedN-nodes' file - aims at a mix of pipeline and data-partitionedparallelism
'BigN-nodes' file - aims at full data-partitioned parallelism
Only one file is active while a step is running The Framework queries (first) the environment variable:
$APT_CONFIG_FILE
# nodes declared in the config file needs not match #CPUs
Same configuration file can be used in development and
SchedulingNodes, Processes, and CPUs
DS/EE does not:
– know how many CPUs are available
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 208/374
know how many CPUs are available
– schedule
Who knows what?
Who does what? – DS/EE creates (Nodes*Ops) Unix processes
– The O/S schedules these processes on the CPUs
Nodes = # logical nodes declared in config. file
Ops = # ops. (approx. # blue boxes in V.O.)
Processes = # Unix processes
CPUs = # available CPUs
Nodes Ops Processes CPUs
User Y N
Orchestrate Y Y Nodes * Ops N
O/S " Y
{
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
Configuring DSEE – Node Pools
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 209/374
p pp
resource disk "/orch/n1/d1" {}resource disk "/orch/n1/d2" {}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {}
resource scratchdisk "/temp" {}}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
1
43
2
{
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
Configuring DSEE – Disk Pools
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 210/374
p pp
resource disk "/orch/n1/d1" {}resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
resource scratchdisk "/temp" {}}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
1
43
2
Parallel to parallel flow may incur reshuffling:
Records may jump between nodes
Re-Partitioning
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 211/374
node
1node
2
Records may jump between nodes
partitioner
Partitioning Methods
Auto
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 212/374
Hash
Entire
Range
Range Map
• Collectors combine partitions of a dataset into asingle input stream to a sequential Stage
Collectors
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 213/374
data partitions
collector
sequential Stage
...
–Collectors do NOT synchronize data
Partitioning and Repartitioning AreVisible On Job Design
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 214/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 215/374
Setting a Node Constraint in the GUI
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 216/374
Reading Messages in Director
Set APT_DUMP_SCORE to true
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 217/374
Can be specified as job parameter
Messages sent to Director log
If set, parallel job will produce a report showingthe operators, processes, and datasets in therunning job
Messages With APT_DUMP_SCORE= True
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 218/374
Exercise
Complete exercise 5-2
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 219/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 220/374
Module 6
Transforming Data
Module Objectives
Understand ways DataStage allows you to
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 221/374
transform data Use this understanding to:
– Create column derivations using user-defined code orsystem functions
– Filter records based on business criteria
– Control data flow based on data conditions
Transformed Data
Transformed data is:
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 222/374
– Outgoing column is a derivation that may, or may not,include incoming fields or parts of incoming fields
– May be comprised of system variables
Frequently uses functions performed on
something (ie. incoming columns) – Divided into categories – I.e.
Date and time
Mathematical
Logical
Null handling
More
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 223/374
Transformer Stage Functions
Control data flow
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 224/374
Create derivations
Flow Control
Separate records flow down links based on data
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 225/374
condition – specified in Transformer stageconstraints
Transformer stage can filter records
Other stages can filter records but do not exhibitadvanced flow control
– Sequential can send bad records down reject link
– Lookup can reject records based on lookup failure
– Filter can select records based on data value
Rejecting Data
Reject option on sequential stage – Data does not agree with meta data
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 226/374
g
– Output consists of one column with binary data type
Reject links (from Lookup stage) result from thedrop option of the property ―If Not Found‖
– Lookup ―failed‖ – All columns on reject link (no column mapping option)
Reject constraints are controlled from theconstraint editor of the transformer – Can control column mapping
– Use the ―Other/Log‖ checkbox
Rejecting Data Example
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 227/374
―If Not Found‖
property
Constraint –
Other/log optionProperty RejectMode = Output
Transformer Stage Properties
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 228/374
Transformer Stage Variables
First of transformer stage entities to execute
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 229/374
Execute in order from top to bottom – Can write a program by using one stage variable to
point to the results of a previous stage variable
Multi-purpose – Counters
– Hold values for previous rows to make comparison
– Hold derivations to be used in multiple field dervations
– Can be used to control execution of constraints
Stage Variables
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 230/374
Show/Hide button
Transforming Data
Derivations
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 231/374
– Using expressions – Using functions
Date/time
Transformer Stage Issues
– Sometimes require sorting before the transformerstage – I.e. using stage variable as accumulator andneed to break on change of column value
Checking for nulls
Checking for Nulls
Nulls can get introduced into the dataflow
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 232/374
because of failed lookups and the way in whichyou chose to handle this condition
Can be handled in constraints, derivations, stagevariables, or a combination of these
Transformer - Handling Rejects
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 233/374
Constraint Rejects
– All expressions are
false and reject row ischecked
Transformer: Execution Order
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 234/374
• Derivations in stage variables are executed first
• Constraints are executed before derivations
• Column derivations in earlier links are executed before later links
• Derivations in higher columns are executed before lower columns
Parallel Palette - Two Transformers
All > Processing > Parallel > Processing
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 235/374
Transformer Is the non-Universe
transformer
Has a specific set offunctions
No DS routines available
Basic Transformer Makes server style
transforms available onthe parallel palette
Can use DS routines
• Program in Basic for both transformers
Transformer Functions FromDerivation Editor
Date & Time
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 236/374
Logical
Null Handling
Number String
Type Conversion
Exercise
Complete exercises 6-1, 6-2, and 6-3
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 237/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 238/374
Module 7
Sorting Data
Objectives
Understand DataStage EE sorting options
U hi d di d li f
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 239/374
Use this understanding to create sorted list ofdata to enable functionality within a transformerstage
Sorting Data
Important because
S t i t d i t
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 240/374
– Some stages require sorted input – Some stages may run faster – I.e Aggregator
Can be performed
– Option within stages (use input > partitioning tab andset partitioning to anything other than auto)
– As a separate stage (more complex sorts)
Sorting Alternatives
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 241/374
• Alternative representation of same flow:
Sort Option on Stage Link
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 242/374
Sort Stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 243/374
Sort Utility
DataStage – the default
UNIX
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 244/374
UNIX
Sort Stage - Outputs
Specifies how the output is derived
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 245/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 246/374
Removing Duplicates
Can be done by Sort stage
– Use unique option
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 247/374
OR
Remove Duplicates stage
– Has more sophisticated ways to remove duplicates
Exercise
Complete exercise 7-1
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 248/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 249/374
Module 8
Combining Data
Objectives
Understand how DataStage can combine datausing the Join, Lookup, Merge, and Aggregatort
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 250/374
stages
Use this understanding to create jobs that will
– Combine data from separate input streams
– Aggregate data to form summary totals
Combining Data
There are two ways to combine data:
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 251/374
– Horizontally:Several input links; one output link (+ optional rejects)made of columns from different input links. E.g., Joins
Lookup
Merge
– Vertically:
One input link, one output link with column combiningvalues from all input rows. E.g., Aggregator
Join, Lookup & Merge Stages
These "three Stages" combine two or more inputlinks according to values of user-designated "key"
l ( )
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 252/374
column(s).
They differ mainly in:
– Memory usage
– Treatment of rows with unmatched key values
– Input requirements (sorted, de-duplicated)
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 253/374
Join Stage Editor
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 254/374
One of four variants:
– Inner – Left Outer – Right Outer – Full Outer
Several key columnsallowed
Link Orderimmaterial for Innerand Full Outer Joins(but VERY important
for Left/Right Outerand Lookup andMerge)
1. The Join Stage
Four types:
• Inner
L ft O t
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 255/374
2 sorted input links, 1 output link – "left outer" on primary input, "right outer" on secondary input – Pre-sort make joins "lightweight": few rows need to be in RAM
• Left Outer
• Right Outer
• Full Outer
2. The Lookup Stage
Combines:
– one source link with
– one or more duplicate-free table links
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 256/374
one or more duplicate free table links
no pre-sort necessary
allows multiple keys LUTs
flexible exception handling for
source input rows with no match
Lookup
Sourceinput
One or moretables (LUTs)
Output Reject
0
1
2
0
1
The Lookup Stage
Lookup Tables should be small enough to fitinto physical memory (otherwise,performance hit due to paging)
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 257/374
performance hit due to paging)
On an MPP you should partition the lookuptables using entire partitioning method, or
partition them the same way you partition thesource link
On an SMP, no physical duplication of a
Lookup Table occurs
The Lookup Stage
Lookup File Set – Like a persistent data set only it
contains metadata about the key
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 258/374
contains metadata about the key. – Useful for staging lookup tables
RDBMS LOOKUP – NORMAL
Loads to an in memory hash table first
– SPARSE Select for each row.
Might become a performancebottleneck.
3. The Merge Stage
Combines
– one sorted, duplicate-free master (primary) link with – one or more sorted update (secondary) links.
– Pre-sort makes merge "lightweight": few rows need to be in RAM (as with
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 259/374
Pre sort makes merge lightweight : few rows need to be in RAM (as with joins, but opposite to lookup).
Follows the Master-Update model: – Master row and one or more updates row are merged if they have the same
value in user-specified key column(s).
– A non-key column occurs in several inputs? The lowest input port numberprevails (e.g., master over update; update values are ignored)
– Unmatched ("Bad") master rows can be either kept
dropped
– Unmatched ("Bad") update rows in input link can be captured in a "reject"link
– Matched update rows are consumed.
The Merge Stage
Allows composite keys
Multiple update links
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 260/374
Multiple update links
Matched update rows are consumed
Unmatched updates can be captured
Lightweight
Space/time tradeoff: presorts vs. in-
RAM table
Master One or moreupdates
Output Rejects
Merge
0
0
21
21
Synopsis:
Joins, Lookup, & Merge
Joins Lookup Merge
Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 261/374
In this table:
• , <comma> = separator between primary and secondary input links(out and reject links)
Memory usage light heavy light
# and names of Inputs exactly 2: 1 left, 1 right 1 Source, N LU Tables 1 Master, N Update(s)
Mandatory Input Sort both inputs no all inputs
Duplicates in primary input OK (x-product) OK Warning!
Duplicates in secondary input(s) OK (x-product) Warning! OK only when N = 1
Options on unmatched primary NONE [fail] | continue | drop | reject [keep] | drop
Options on unmatched secondary NONE NONE capture in reject set(s)On match, secondary entries are reusable reusable consumed
# Outputs 1 1 out, (1 reject) 1 out, (N rejects)
Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries
The Aggregator Stage
Purpose: Perform data aggregations
Specify:
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 262/374
Zero or more key columns that define theaggregation units (or groups)
Columns to be aggregated
Aggregation functions:count (nulls/non-nulls) sum
max/min/range
The grouping method (hash table or pre-sort )is a performance issue
Grouping Methods
Hash: results for each aggregation group are stored in ahash table, and the table is written out after all input has
been processedd ’t i t d d t
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 263/374
been processed – doesn’t require sorted data
– good when number of unique groups is small. Runningtally for each group’s aggregate calculations need to fiteasily into memory. Require about 1KB/group of RAM.
– Example: average family income by state, requires .05MBof RAM
Sort: results for only a single aggregation group are keptin memory; when new group is seen (key value changes),
current group written out. – requires input sorted by grouping keys
– can handle unlimited numbers of groups
– Example: average daily balance by credit card
Aggregator Functions
Sum
Min max
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 264/374
Min, max
Mean
Missing value count
Non-missing value count
Percent coefficient of variation
Aggregator Properties
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 265/374
Aggregation Types
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 266/374
Aggregation types
Containers
Two varieties
– Local
Shared
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 267/374
– Shared
Local
– Simplifies a large, complex diagram
Shared
– Creates reusable object that many jobs can include
Creating a Container
Create a job
Select (loop) portions to containerize
Edit C t t t i l l h d
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 268/374
Edit > Construct container > local or shared
Using a Container
Select as though it were a stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 269/374
Exercise
Complete exercise 8-1
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 270/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 271/374
Module 9
Configuration Files
Objectives
Understand how DataStage EE usesconfiguration files to determine parallel behavior
U thi d t di t
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 272/374
Use this understanding to
– Build a EE configuration file for a computer system
– Change node configurations to support adding
resources to processes that need them – Create a job that will change resource allocations at
the stage level
Configuration File Concepts
Determine the processing nodes and disk spaceconnected to each node
Wh t h d l h th
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 273/374
When system changes, need only change theconfiguration file – no need to recompile jobs
When DataStage job runs, platform readsconfiguration file
– Platform automatically scales the application to fit thesystem
Processing Nodes Are
Locations on which the framework runsapplications
L i l th th h i l t t
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 274/374
Logical rather than physical construct
Do not necessarily correspond to the number of
CPUs in your system – Typically one node for two CPUs
Can define one processing node for multiplephysical nodes or multiple processing nodes forone physical node
Optimizing Parallelism
Degree of parallelism determined by number ofnodes defined
P ll li h ld b ti i d t i i d
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 275/374
Parallelism should be optimized, not maximized
– Increasing parallelism distributes work load but alsoincreases Framework overhead
Hardware influences degree of parallelismpossible
System hardware partially determines
configuration
More Factors to Consider
Communication amongst operators – Should be optimized by your configuration
– Operators exchanging large amounts of data should
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 276/374
p g g gbe assigned to nodes communicating by sharedmemory or high-speed link
SMP – leave some processors for operatingsystem
Desirable to equalize partitioning of data
Use an experimental approach – Start with small data sets
– Try different parallelism while scaling up data set sizes
Factors Affecting Optimal Degree of
Parallelism
CPU intensive applications
– Benefit from the greatest possible parallelism
A li ti th t di k i t i
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 277/374
Applications that are disk intensive
– Number of logical nodes equals the number of diskspindles being accessed
Configuration File
Text file containing string data that is passed tothe Framework
– Sits on server side
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 278/374
– Can be displayed and edited
Name and location found in environmental
variable APT_CONFIG_FILE Components
– Node
– Fast name
– Pools – Resource
Node Options
Node name – name of a processing node used by EE – Typically the network name
– Use command uname –n to obtain network name
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 279/374
Fastname – – Name of node as referred to by fastest network in the system
– Operators use physical node name to open connections
– NOTE: for SMP, all CPUs share single connection to network
Pools – Names of pools to which this node is assigned
– Used to logically group nodes
– Can also be used to group resources
Resource – Disk
– Scratchdisk
Sample Configuration File
{
node ―Node1"
{
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 280/374
{
fastname "BlackHole"
pools "" "node1"
resource disk "/usr/dsadm/Ascential/DataStage/Datasets" {pools "" }
resource scratchdisk
"/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }}
}
Disk Pools
Disk pools allocate storage
By default, EE uses the defaultl ifi d b ―‖
pool "bigdata"
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 281/374
pool, specified by ―‖
Sorting Requirements
Resource pools can also be specified for sorting:
The Sort stage looks first for scratch disk resourcesin a
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 282/374
gin a
―sort‖ pool, and then in the default disk pool
{
node "n1" {
fastname “s1"
pool "" "n1" "s1" "sort" resource disk "/data/n1/d1" {}
resource disk "/data/n1/d2" {}
resource scratchdisk "/scratch" {"sort"}
Another Configuration File Example
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 283/374
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1" resource disk "/data/n2/d1" {}
resource scratchdisk "/scratch" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1" resource disk "/data/n3/d1" {}
resource scratchdisk "/scratch" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1" resource disk "/data/n4/d1" {}
resource scratchdisk "/scratch" {}
}
...
}
4 5
1
6
2 3
Resource Types
Disk
Scratchdisk
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 284/374
DB2
Oracle
Saswork
Sortwork
Can exist in a pool – Groups resources together
Using Different Configurations
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 285/374
Lookup stage where DBMS is using a sparse lookup type
Building a Configuration File
Scoping the hardware: – Is the hardware configuration SMP, Cluster, or MPP?
– Define each node structure (an SMP would be singlenode):
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 286/374
node): Number of CPUs
CPU speed
Available memory
Available page/swap space Connectivity (network/back-panel speed)
– Is the machine dedicated to EE? If not, what otherapplications are running on it?
– Get a breakdown of the resource usage (vmstat, mpstat,
iostat) – Are there other configuration restrictions? E.g. DB only
runs on certain nodes and ETL cannot run on them?
Exercise
Complete exercise 9-1 and 9-2
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 287/374
M d l 10
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 288/374
Module 10
Extending DataStage EE
Objectives
Understand the methods by which you can addfunctionality to EE
Use this understanding to:
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 289/374
Use this understanding to:
– Build a DataStage EE stage that handles specialprocessing needs not supplied with the vanilla stages
– Build a DataStage EE job that uses the new stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 290/374
When To Leverage EE Extensibility
Types of situations:
Complex business logic, not easily accomplished using standard
EE stages
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 291/374
Reuse of existing C, C++, Java, COBOL, etc…
Wrappers vs. Buildop vs. Custom
Wrappers are good if you cannot or do not
want to modify the application and
performance is not critical.
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 292/374
Buildops: good if you need custom coding but
do not need dynamic (runtime-based) input
and output interfaces.
Custom (C++ coding using framework API): good
if you need custom coding and need dynamic
input and output interfaces.
Building “Wrapped” Stages
You can ―wrapper‖ a legacy executable:
Binary Unix command
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 293/374
Unix command
Shell script
… and turn it into a Enterprise Edition stage
capable, among other things, of parallel execution… As long as the legacy executable is:
amenable to data-partition parallelism no dependencies between rows
pipe-safe can read rows sequentially
no random access to data
Wrappers (Cont’d)
Wrappers are treated as a black box
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 294/374
Wrappers are treated as a black box
EE has no knowledge of contents
EE has no means of managing anything that occurs
inside the wrapper
EE only knows how to export data to and import datafrom the wrapper
User must know at design time the intended behavior of
the wrapper and its schema interface
If the wrappered application needs to see all records priorto processing, it cannot run in parallel.
LS Example
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 295/374
Can this command be wrappered?
Creating a Wrapper
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 296/374
Used in this job ---
To create the ―ls‖ stage
Creating Wrapped Stages
From Manager :Right-Click on Stage Type
Wrapper Starting Point
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 297/374
> New Parallel Stage > Wrapped
We will "Wrapper‖ an existing
Unix executables – the lscommand
Wrapper - General Page
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 298/374
Unix command to be wrapped
Name of stage
The "Creator" Page
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 299/374
Conscientiously maintaining the Creator page for all your wrapped stageswill eventually earn you the thanks of others.
Wrapper – Properties Page
If your stage will have properties appear, complete theProperties page
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 300/374
This will be the name ofthe property as itappears in your stage
Wrapper - Wrapped Page
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 301/374
Interfaces – input and output columns -these should first be entered into the tabledefinitions meta data (DS Manager); let’s
do that now.
• Layout interfaces describe what columns the
stage: – Needs for its inputs (if any)
Interface schemas
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 302/374
– Creates for its outputs (if any)
– Should be created as tables with columns in
Manager
Column Definition for Wrapper
Interface
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 303/374
How Does the Wrapping Work?
– Define the schema for export
and import Schemas become interface export
input schema
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 304/374
Schemas become interfaceschemas of the operator andallow for by-name column
access
import
export
stdout ornamed pipe
stdin ornamed pipe
UNIX executable
output schema
QUIZ : Why does export precede import?
Update the Wrapper Interfaces
This wrapper will have no input interface – i.e. no inputlink. The location will come as a job parameter that will
be passed to the appropriate stage property. Therefore,only the Output tab entry is needed.
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 305/374
y p y
Resulting Job
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 306/374
Wrapped stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 307/374
Wrapper Story: Cobol Application
Hardware Environment: – IBM SP2, 2 nodes with 4 CPU’s per node.
Software:– DB2/EEE COBOL EE
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 308/374
– DB2/EEE, COBOL, EE
Original COBOL Application: – Extracted source table, performed lookup against table in DB2,
and Loaded results to target table. – 4 hours 20 minutes sequential execution
Enterprise Edition Solution: – Used EE to perform Parallel DB2 Extracts and Loads
– Used EE to execute COBOL application in Parallel
– EE Framework handled data transfer betweenDB2/EEE and COBOL application
– 30 minutes 8-way parallel execution
Buildops
Buildop provides a simple means of extending beyond thefunctionality provided by EE, but does not use an existing
executable (like the wrapper)Reasons to use Buildop include:
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 309/374
Reasons to use Buildop include:
Speed / Performance
Complex business logic that cannot be easily representedusing existing stages – Lookups across a range of values
– Surrogate key generation
– Rolling aggregates
Build once and reusable everywhere within project, noshared container necessary
Can combine functionality from different stages into one
BuildOps
– The DataStage programmer encapsulates the businesslogic
Th E t i Editi i t f ll d ―b ild ‖
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 310/374
– The Enterprise Edition interface called ―buildop‖
automatically performs the tedious, error-prone tasks:invoke needed header files, build the necessary―plumbing‖ for a correct and efficient parallel execution.
– Exploits extensibility of EE Framework
From Manager (or Designer ):Repository pane:
Right Click on Stage Type
BuildOp Process Overview
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 311/374
Right-Click on Stage Type> New Parallel Stage > {Custom | Build | Wrapped}
• "Build" stages
from within Enterprise Edition
• "Wrapping‖ existing ―Unix‖
executables
General Page
Identical
to Wrappers,except: Under the BuildT b !
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 312/374
Tab, your program!
ogic Tab
for
Business Logic
Enter Business C/C++logic and arithmetic infour pages under the
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 313/374
p gLogic tab
Main code section goes
in Per-Record page- itwill be applied to allrows
NOTE: Code will need
to be Ansi C/C++
compliant. If code doesnot compile outside of
EE, it won’t compile
within EE either!
Code Sections under Logic Tab
Temporaryvariablesdeclared [and
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 314/374
initialized] here
Logic here isexecuted once
BEFOREprocessing theFIRST row
Logic here isexecuted once
AFTERprocessing theLAST row
I/O and Transfer
Under Interface tab: Input, Output & Transfer pages
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 315/374
Optional
renaming ofoutput portfrom default"out0"
Write row
Inpu t page: 'Auto Read'
Read next row
In-Repository
TableDefinition
'False' setting,
not to interferewith Transferpage
First line:output 0
I/O and Transfer
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 316/374
• Transfer all columns from input to output.• If page left blank or Auto Transfer = "False" (and RCP = "False")Only columns in output Table Definition are written
First line:Transfer of index 0
BuildOp Simple Example
Example - sumNoTransfer
– Add input columns "a" and "b"; ignore other columns
that might be present in input– Produce a new "sum" column
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 317/374
Produce a new sum column
– Do not transfer input columns
sumNoTransfer
a:int32; b:int32
sum:int32
From Peek:
No Transfer
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 318/374
NO TRANSFER
- RCP set to "False" in stage definitionand
- Transfer page left blank, or Auto Transfer = "False"
• Effects:
- input columns "a" and "b" are not transferred
- only new column "sum" is transferred
Compare with transfer ON…
Transfer
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 319/374
TRANSFER
- RCP set to "True" in stage definitionor
- Auto Transfer set to "True"
• Effects:
- new column "sum" is transferred, as well as - input columns "a" and "b" and
- input column "ignored" (present in input, butnot mentioned in stage)
Columns
DS-EE type
Temp C++ variables
C/C++ type
Columns vs.
Temporary C++ Variables
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 320/374
yp
Defined in TableDefinitions
Value refreshed from rowto row
yp
Need declaration (inDefinitions or Pre-Looppage)
Value persistent
throughout "loop" overrows, unless modified incode
Exercise
Complete exercise 10-1 and 10-2
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 321/374
Exercise
Complete exercises 10-3 and 10-4
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 322/374
Custom Stage
Reasons for a custom stage:
– Add EE operator not already in DataStage EE
– Build your own Operator and add to DataStage EE
U EE API
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 323/374
Use EE API
Use Custom Stage to add new operator to EEcanvas
Custom Stage
DataStage Manager > select Stage Types branch> right click
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 324/374
Custom Stage
Number of input andoutput links allowed
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 325/374
Name of Orchestrateoperator to be used
output links allowed
Custom Stage – Properties Tab
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 326/374
The Result
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 327/374
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 328/374
Objectives
Understand how EE uses meta data, particularlyschemas and runtime column propagation
Use this understanding to:
B ild h d fi iti fil t b i k d i
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 329/374
– Build schema definition files to be invoked inDataStage jobs
– Use RCP to manage meta data usage in EE jobs
Establishing Meta Data
Data definitions
– Recordization and columnization
– Fields have properties that can be set at individualfield level
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 330/374
Data types in GUI are translated to types used by EE
– Described as properties on the format/columns tab
(outputs or inputs pages) OR – Using a schema file (can be full or partial)
Schemas
– Can be imported into Manager – Can be pointed to by some job stages (i.e. Sequential)
Data Formatting – Record Level
Format tab
Meta data described on a record basis Record level properties
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 331/374
Record level properties
Data Formatting – Column Level
Defaults for all columns
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 332/374
Column Overrides
Edit row from within the columns tab
Set individual column properties
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 333/374
Extended Column Properties
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 334/374
Field
andstring
settings
Extended Properties – String Type
Note the ability to convert ASCII to EBCDIC
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 335/374
Editing Columns
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 336/374
Properties dependon the data type
Schema
Alternative way to specify column definitions fordata used in EE jobs
Written in a plain text file
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 337/374
Can be written as a partial record definition
Can be imported into the DataStage repository
Creating a Schema
Using a text editor
– Follow correct syntax for definitions
– OR
Import from an existing data set or file set
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 338/374
Import from an existing data set or file set
– On DataStage Manager import > Table Definitions >
Orchestrate Schema Definitions – Select checkbox for a file with .fs or .ds
Importing a Schema
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 339/374
Schema location can beon the server or local
work station
Data Types
Date
Decimal
Floating point
I t
Vector
Subrecord
Raw
T d
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 340/374
Integer
String
Time
Timestamp
Tagged
Runtime Column Propagation
DataStage EE is flexible about meta data. It can cope with thesituation where meta data isn’t fully defined. You can define
part of your schema and specify that, if your job encountersextra columns that are not defined in the meta data when itactually runs, it will adopt these extra columns and propagate
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 341/374
them through the rest of the job. This is known as runtimecolumn propagation (RCP).
RCP is always on at runtime.
Design and compile time column mapping enforcement.
– RCP is off by default.
– Enable first at project level. (Administrator projectproperties)
– Enable at job level. (job properties General tab)
– Enable at Stage. (Link Output Column tab)
Enabling RCP at Project Level
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 342/374
Enabling RCP at Job Level
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 343/374
Enabling RCP at Stage Level
Go to output link’s columns tab
For transformer you can find the output linkscolumns tab by first going to stage properties
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 344/374
Using RCP with Sequential Stages
To utilize runtime column propagation in thesequential stage you must use the ―use schema‖
option
Stages with this restriction:
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 345/374
Stages with this restriction:
– Sequential
– File Set – External Source
– External Target
Runtime Column Propagation
When RCP is Disabled
– DataStage Designer will enforce Stage Input Column
to Output Column mappings. – At job compile time modify operators are inserted on
output links in the generated osh
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 346/374
output links in the generated osh.
Runtime Column Propagation
When RCP is Enabled
– DataStage Designer will not enforce mapping rules.
– No Modify operator inserted at compile time. – Danger of runtime error if column names incoming do
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 347/374
not match column names outgoing link – casesensitivity.
Exercise
Complete exercises 11-1 and 11-2
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 348/374
Module 12
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 349/374
Job Control Using the JobSequencer
Objectives
Understand how the DataStage job sequencerworks
Use this understanding to build a control job torun a sequence of DataStage jobs
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 350/374
q g j
Job Control Options
Manually write job control
– Code generated in Basic
– Use the job control tab on the job properties page – Generates basic code which you can modify
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 351/374
Job Sequencer
– Build a controlling job much the same way you buildother jobs
– Comprised of stages and links
– No basic coding
Job Sequencer
Build like a regular job
Type ―Job Sequence‖
Has stages and links
Job Activity stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 352/374
Job Activity stagerepresents a DataStage
job Links represent passing
control
Stages
Example
Job Activitystage – contains
conditionaltriggers
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 353/374
Job Activity Properties
Job to be executed – select from dropdown
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 354/374
Job parametersto be passed
select from dropdown
Job Activity Trigger
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 355/374
Trigger appears as a link in the diagram
Custom options let you define the code
Options
Use custom option for conditionals
– Execute if job run successful or warnings only
Can add ―wait for file‖ to execute
Add ―execute command‖ stage to drop real tables
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 356/374
Add execute command stage to drop real tables
and rename new tables to current tables
Job Activity With Multiple Links
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 357/374
Different linkshaving different
triggers
Sequencer Stage
Build job sequencer to control job for thecollections application
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 358/374
Can be set to allor any
Notification Stage
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 359/374
Notification
Notification Activity
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 360/374
Sample DataStage log from Mail Notification
Sample DataStage log from Mail
Notification
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 361/374
E-Mail Message
Notification Activity Message
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 362/374
Exercise
Complete exercise 12-1
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 363/374
Module 13
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 364/374
Testing and Debugging
Objectives
Understand spectrum of tools to perform testingand debugging
Use this understanding to troubleshoot aDataStage job
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 365/374
Environment Variables
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 366/374
Parallel Environment Variables
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 367/374
Environment Variables
Stage Specific
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 368/374
Environment Variables
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 369/374
Environment Variables
Compiler
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 370/374
Typical Job Log Messages:
Environment variables
Configuration File information
The Director
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 371/374
Framework Info/Warning/Error messages
Output from the Peek Stage
Additional info with "Reporting" environments
Tracing/Debug output
– Must compile job in trace mode – Adds overhead
• Job Properties, from Menu Bar of Designer
• Director will
prompt youbefore eachrun
Job Level Environmental Variables
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 372/374
run
Troubleshooting
If you get an error during compile, check the following:
Compilation problems
– If Transformer used, check C++ compiler, LD_LIRBARY_PATH
– If Buildop errors try buildop from command line – Some stages may not support RCP – can cause column mismatch .
– Use the Show Error and More buttons
– Examine Generated OSH
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 373/374
– Check environment variables settings
Very little integrity checking during compile, should run validate from Director.
Highlights source of error
Generating Test Data
Row Generator stage can be used
– Column definitions
– Data type dependent
Row Generator plus lookup stages provides goodt t b t t t d t f tt fil
8/10/2019 data stage doc
http://slidepdf.com/reader/full/data-stage-doc 374/374
way to create robust test data from pattern files