CapGemini_Datastage

8/2/2019 CapGemini_Datastage

1/122

1

Training course Datastage (part 1) V. BEYET

03/07/2006


2/122


3/122

3

Summary

General presentation (DataStage : what is it ?)

DataStage : how to use it ?

The other components (part 2)


4/122

4

General presentation

Datastage : What is it ?

An ETL tool: Extract-Transform-Load

A graphic environment

A tool integrated in a suite of BI tools

Developed by Ascential (IBM)


5/122

5

Datastage : why to use it ?

big size of data (volume)

multi-source and multi-target :

files, Databases (oracle, sqlserver, access, ).

Data transformation :

Select,

Format,Combine,AggregateSort.



6/122

6

Datastage : how it works ?

Development is done :

on a client-server mode,with a graphical Design of flows,with simple and basic elements,with a simple language (basic).

Treatments are :

Compiled and run by an engine,Written on a Universe database,



7/122

7

The different tools

Server

Designer Manager

Administrator Director



8/122

8

Server

The server contains programs and data.

The programs

Called Jobs : first as source code and then asexecutable programs, written in Universe Database

But we cant understand source code

Data :

May be written in Universe Database but better inserver directories.



9/122

9

Server

What is a Project for Datastage ?

A server is organized in different environments calledProjects

A Project is a separated environment for jobs, tabledefinitions and routines

A Project can be created at any time

The number of projects is unlimitedThe number of jobs is unlimited for each projectBut the number of simultaneous client connection is

limited



10/122

10

Servur

Universe Database:

The Universe Database is a relational Database with files

Tables are called " Hash File "

A Hash file is an indexed file; Its the central element to use all

the possibilities of the Datastage engine.

A Hash file with incorrectly defined keys may create disastrous problems.



11/122

11

General presentation (Datastage : what is it ?) DataStage : how to use it ?

The other components (part 2)

Summary


12/122

12

The designer

The designer is to design jobs : look at the icon

The jobs are composed with Stages :

active stages : action

passive stages : data storage

Links : between the stages

Designer


13/122

13

The designer

Passive stages : a place for Data storage (thedata flow is from the stage or to the stage)

Text File : sequential file

Hash File : It can be treated only by

datastage (and not by WordPad, ) but

simultaneous access is possible on Hash file.

UV Stage : The file is in the Universe Core

(DataStage engine).

ODBC Stage, OLEDB, ORAOCI :

Representation of a database; it allows to

access directly to a database with an ODBC

link.

Designer


14/122

14

Active stagesAn active stage is a representation of a transformation on the dataflow :

Designer

The designer

Sort : of a file

Aggregator : calculations

Transformer : selection, transformation, transport of properties


15/122

15

links

Designer

The designer

Between active and passive stages

Between passive stages

Between active stages


16/122

16

The designer

A job in the designer

Designer

Passive StageActive Stage


17/122

17

The designer Designer

DataStage Designer : Each job has :- one or more source of data- one or more transformation- one or more destination for the dataThe toolbar contains the stage icons to designthe jobs.The jobs have to be compiled to createexecutable programs.


18/122

18


The repository

The toolbarwith stageicons

(palette)

To compile the job

To run the job


19/122

19


Lets study now the different Stages :

Sequential Files (text files)Transformer

Hash FilesSortAggregatorRoutinesUV Stages


20/122

20

Sequential file Stage :

Can be read,Can be written,Can be read and written in the same job,Can be written cash or not,

Can be DOS file or Unix file Can be read by two jobs at the same time

Cant be written by two jobs at the same time



21/122

21

The designer

Sequential File :

Designer

Stage name

File Type

Stage description


22/122

22


Sequential File :

Output link

Stage name (to be written)


23/122

23


Sequential File :

Data Format (Output file)

Always those values


24/122

24


Sequential File : To test the connection andview the data in the fileDifferent columns of thefile (Output) : type, length

Size to display(for View Data)


25/122

25

Group your tabledefinitions byapplication

Create or modify the tabledefinitions (for files,databases, transformers, )


To describe easily a file :use or create a tabledefinition

Sequential File :


26/122

26

Then it can be used in different jobs (click on Load to find the rightdefinition).


Sequential File :


27/122

27

View Data


Sequential File :


28/122

28

Transformer Stage :

Multi-source and multi-target,

Wait for the availability of the source of data,Makes lookup between 2 flows (reference),Transform or propagate the data of each flow,Allows to select, filter, create refusals file.



29/122

29

Transformer Stage :

Can do treatments by :

native basic function or created in the manager,DataStage function or DataStage macro,routines ( before/after type) Or only propagate columns .



30/122

30

Transformer Stage :


Input data Output data

Right click :propagate allthe columns


31/122

31


Input data

Output data

Transformer Stage :


32/122

32

Exercise n1 : Objective : Read a sequential file and create a new one (save the file)

The catalogue.in file has to be read and the catalogue_save.tmp file has to be written

Source File : catalogue.in(in \in directory)Target File : catalogue_save.tmp (in \tmp directory)

Steps :1- Create a table definition (structure of Catalogue table )2- Design the job with 2 Sequential Files and 1 Transformer

3- Create the links (data flow)4- Save and Compile the job5- Run the job6-Look at the performances statistics (right click)



33/122

33

Look at the performances of your job :

Right click on the grid and then select

Show performance statistics


Transformer Stage :


34/122

34

Create the parameters of the job :menu Edit - Job Properties , tab Parameters.



35/122

35

Exercise n2 :

Objective : Use environment variables

- create a job parameter : directory- place it on all the paths from the job of the firstexercise (example : #directory#\tmp),- compile- modify your input file (add your best film)- run with different path (other groups).



36/122

36

Hash File Stage :


Necessary for a lookup

One Hash file is entirely written before it can beread ( FromTrans link must be finished before FromFilmTypeHFcan start)

Allows to group multiple records with the samekey (suppress duplicate keys)

Can be read in different jobs simultaneouslyCan be written by different links simultaneously

(in the same job or in different jobs)


37/122

37

Hash File :


Stage name

Account name(DataStage project)

File path


38/122

38


Hash File :File name

For files to write

Select this check box tospecify that all recordsshould be cached, ratherthan written to the hashedfile immediately. This isnot recommended where

your job writes and readsto the same hashed file inthe same stream ofexecution


39/122

39

A key must be defined (it can be a single or multiple key)


Hash File :


40/122

40

Stage Transformer : Lookup The main flow can be from every type The secondary flow must has a Hash File to design a lookup (so veryoften, you will have to design a temporary Hash File) The look up is done with the key of the secondary flow

The number of records in the main flow cant be higher after thelookup than before the look up The lookup is shown with a dotted line When a lookup is exclusive the number of records after the lookupis smaller then the number of records before the lookup



41/122

41


Transformer Stage : Lookup

Principal Flow(horizontal)

Reference Flow(vertical flow)


42/122

42

Exercise n3 : Objective : make a lookup between Catalog file and Film Typeto put the type film in the output file.

Source File : catalogue.in(in \in directory)Target File : catalogue.out (in \out directory)

Steps :1- Create a table definition (structure of FilmType table )2- Modify your job to create a Hash File from the FilmType.in file

3- Create the link to show the lookup (data flow)4- Save and Compile the job5- Run the job6-Look at the performances statistics (right click)



43/122

43

Exercise n4 : Objective : put the director name and the film name togetherseparated by a >. If the film type is not found, put unknowntype in the output file. What happens when the director name isempty ? Find a solution.



44/122

44

Exercise n5 : Objective : If the film type is not found (use constraint), put thefilm in a refusals file (First a Sequential file and then a Hash File)



45/122

45

Stage Lookup with selection (exclusive lookup)

Dont forget : lookup can be designed with ORAOCI stage or UV stage but it is more better with Hash Files.



46/122

46


Exercise n6 : Objective : Select only the films for which the type is known(that means that the lookup is OK)


47/122

47

Exercise n7 : Objective : Select all the clients who are female to put them inan output fileThe SEXE column contains M (Male) or F (female)

And then create an annotation for this job (all the jobs must have annotations)



48/122

48

The director Director

The Director is the job controller, it allows to : Run jobs

Immediately or later, with more options than in the Designer

Control job status

Status : Compiled, Running, Aborted, Validated, Failed validation ...

Job monitoring

To control the number of lines treated by each active stage of a job.


49/122

49

Run jobs with Director


Select the job andclick here

And then enterthe parameters


50/122

50

To run a job later :

Director The director

click here

And then choosethe date and time


51/122

51

To modify running parameters for a job : Limits Tab


Warnings limit : the jobstops after x warnings

Rows limit : the job stops after xrows (on each flow)


52/122

52

Verify the status of jobs with Director

The status : "Not compiled" "Compiled" "Failed validation" "Validated ok" "Aborted" "Finished" "Running"



53/122

53

Director

Example : list of jobs

The director

To run jobs To stop jobs To run jobs later To view the log To reset job status


54/122

54

Example of a Monitor :

Director

For each step : the number of treated lines (input and output)the beginning timethe execution duration (Elapsed time)the statusthe performance (rows/sec)

The director

Link type :Pri : principal flow

Ref : reference flow (lookup)Out : output flow

The monitor allows to follow thedifferent stages of a job. Seethe importance of a good namefor the stages and the links !


55/122

55

Example of a log :


Green : OK No problemYellow : warningRed : blocking problem

Dont forget : Clear the log from time to time ( Job>Clear log).

To look at error messages,choose the job and click on thelog button


56/122

56

All the elements :

jobs

Routines

table definitions

are classified in Categories but the

name must be unique within a project

The manager

The manager is the tool to export/import elements from aDataStage project to an other DataStage project.

Manager

To import or export elements click on

the appropriate button

File>Open Project to change project

Drag and Drop on an element to changecategory


57/122

57

EXPORT

Manager The manager

To append to anexisting file

To change the selectionoptions :- By category

- By individual components

Jobs

Routines (always checkSource Code box)

Table definitions

choose what do you want to export (create a .dsx)


58/122

58

IMPORT

Manager The manager

This will create/modify elements inthe DataStage Project

Make your choice

choose what do you want to import


59/122

59

With the manager, you can compile many jobs at the same time (multiple compile

jobs)

Tools > Run multiple job compile

you select the type of jobs you want to compile and select Show manual

selection page and click on Next button

select the jobs and click on Next button

click on the Start compile button

Manager The manager


60/122

60

Sort Stage :


Criteria of sorting are filled inIn Stage Tab/Properties Tab

Modify those parameters if thefile to sort has a lot of lines


61/122

61

Exercise n8 : Objective : When you have selected all the Women, sort the fileby alphabetical order.



62/122

62

Aggregator Stage :

- Allows data to be aggregated on a smaller number ofrecords,- Intermediate treatments executed in memory,- Allows to execute a before/after routine (before or afterthe stage treatment when all the lines have been treated),- Performances are better if data is sorted (Input tab),

- The aggregator does not sort the records.



63/122

63

Aggregator Stage : Input Tab


When input datais sorted


64/122

64

Aggregator Stage : Output tab


Group by

Differentfunctions


65/122

65

Exercise n9 :

Objective : create a Job which reads location.inAnd calculates the hit-parade from the most hired cassettes (orderby number of hire descending). Put also the name of the film andnot only the number of the cassette (lookup with catalogue.in).



66/122


67/122

67

Exercise n9 (job to design)


D i


68/122

68



D i


69/122

69

Hash File Stage : We have seen that the Hash File is necessary for a lookupWe have seen also that Hash File allows to suppressduplicate keyLets see now how it is useful to group different flows


D i


70/122

70

Exercise n11 :

Objective : With the job from exercise 10 (use the 2 methods inthe same job), create a Hash File to put the different results in the

same Hash File.Column 1 : AVERAGE METHOD 1 or AVERAGEMETHOD 2Column 2 : the result of each methodIn the Hash file, you must have 2 lines.


Designer


71/122

71

Exercise n11 (job to design )


Designer


72/122

72

Stage Variables : Simple treatments can be made easily with stage variable.

- It is a data which remain active during all the duration of the stage. So youcan find a max (if data is sorted), calculate a sum or count something.- In the transformer, click on the right button and then select Show Stagevariables. Example :


Designer


73/122

73


Another example :

Designer


74/122

74

Exercise n12 :

Objective : Try to calculate the average with stage variables.


Exercise n13 :

Objective : Create a job that create a file with all the client (key)and in a second column the list of the films (separated by a dot).

h d Designer


75/122

75



Th d i Designer


76/122

76


Exercise n13 (job to design) The order of the different variables is important. The instructions are executed in the

order of the stage variables ! (to change the order => right click>stage properties>Link ordering Tab)

The variables must be initialized (=> right click>stage properties>variables).

There must be a hash file after the stage.

Th d i Designer


77/122

77

DataStage Variables :

Different variables are defined by Datastage :-@NULL- @INROWNUM, @OUTROWNUM- @DATE- @TRUE, @FALSE- @PATH


Link Variables :

The more useful is : NOTFOUND

Th d i Designer


78/122

78

Routines : - Source code (written with Basic language)- It is external from the jobs and can be used many times at many

levels- It can be a Transform function or a Before/After Function :a transform function is called at each linea before subroutine is called before the first line

(example : empty a file)

an after subroutine is called when all the lines have beentreated


Th d i Designer


79/122

79

Routines (1/3)

The designer g

Type of routineName of the routine

Always fill in thisShort description

Th d i Designer


80/122

80

Routines (2/3)

The designer g

To be filled inArguments : theyare used in the code

Th d ig Designer


81/122

81

The designer g

Routines (3/3)

Code : useArgument names

Save CompileTest oftheroutine



82/122

82

The designer

Routines : access to a sequential file

CloseSeq FicXXX

OpenSeq FicXXX to xxx thenendelseend

WriteSeq FicXXX to xxx thenendelseend

ReadSeq FicXXX to xxx thenendelseend

File Header

WeofSeq xxx To empty the file



83/122

83

The designer

Routines :

If Then EndElseEnd

GoTo

For i= To Next i

Loop WhileRepeat

Loop UntilRepeat

Call DSLogInfo("Information", "RoutineName")Call DSLogWarn("Warning", "RoutineName")Call DSLogFatal("Abort", "RoutineName")

A=Hello B=World C=A:B

C=Hello World

field(,',',3,1) search string file after the third comma

Trim(, ,T) suppress the trailing spaces

Upcase()

Iconv("05/27/97", "D2/")

Oconv(10740, "D2/")

A=Hello

A[1,3]=Hel



84/122

84

The designer

Routines : Test

By double-click on Result column

The designer

Designer


85/122

85

Exercise n14 :

Step 1 :

Objective : write a routine which calculates the number of daybetween two dates.If begin date is null then return 0 ,If end date is null then initialize it with date of today,

Save, compile and test the routine.

The designer



86/122

86

The designer

The designer

Designer


87/122

87

Exercise n14 :

Step 2

Objective : Read location.in, generate a file with the hireduration (returned cassettes only)Non returned cassettes after 10 days (end date null) will bewritten in a refusals file with the name and address of client (tosend then a mail)

The designer



88/122

88

Exercise n14 (job to be designed)

The designer

The designer

Designer


89/122

89

Exercise n15 :

Objective : With a routine (Use CASE ), calculate the amountfor the cassette hire (days number * hire price * coefficient).

The coefficient is calculated with that rule :=5 and =10 and = 30 days = days * hire price * 3

The designer



90/122

90

UV Stage : works with internal hash file (in the DataStage Project) makes a Cartesian product uses SQL requests (select from where order by )

The designer

The designer

Designer


91/122

91

Exercise n16 : execute the Cartesian product on Clients fileand Cassettes file

Objective : Propose to the clients cassettes he has never hired Step 1 : create the job parameter account, Step 2 : create a job to write clients hash file et cassettes hash file

in the DS project with account parameterStep 3 : In a new job, use those hash files to make the Cartesianproduct

Look at your job performances !!

The designer



92/122

92

Exercise 16 : Step 1 and Step 2

The designer



93/122

93

Step 3 :

The designer



94/122

94

The designer



95/122

95

The designer

The number of records



96/122

96

Normalization :

The designer

12 A|B|C|D|E

12 A12 B12 C12 D12 E

The normalization :

Un-normalization :

Multi-valuated file Normalized file



97/122

97

Normalization :

g

Multi-valuated file must have :1- a key2- char(253) or @VM for separator3- The Normalize On field from Hash File checked4- the column(s) to normalize

1 3 42



98/122

98

Exercise n17 : normalization/un-normalizationStep 1 : create a job which reads location.in file and writes a hashfile (Id_Cli as the key and the list of all Id_Cas separated by@VM) : use Sort stage and Stage Variables !=> View Data on the Input Link of the Hash File

Step 2 : modify the a job to add normalization of this file=> View Data on the Output Link of the Hash FileStep 3 : Compare the sequential file with location.in file

g



99/122

99

g

Exercise N17 : job to design and View Data



100/122

100

g

The ORAOCI Stages :

The version of oracle used is 9i so use ORAOCI9 stageYou can :

Either use a query generated by DataStage

Or use a user-defined queryOr a combination of the both precedent possibilitiesThe access parameters have to be defined by job parametersThe stage can access only one table or moreDifferent actions can be programmed : read, insert, update

You can also use Stocked Procedures



101/122

101

g

The ORAOCI Stages :The access parameters have to be defined by job parameters



102/122

102

g

The ORAOCI Stages : Output link

query generated byDataStage or user-defined query



103/122

103

g

Selection of the table(s)

Selection ofthecolumns

Group byclause

Sort parametersquery generatedby DataStage



104/122

104

Generate SELECT clause from column list; enter other clauses



105/122

105

Enter custom SQL statement : when you want to add something specific

To format a date forexample



106/122

106


Choose the table

Choose the action

Important parameters



107/122

107


Number of linesbetween 2 commit



108/122

108

The ORAOCI Stages : verify error code (1/3)

If the job must abortwhen there is aSQL error



109/122

109


To receive SQL error code



110/122

110


Treat lines 1 by 1

To receive SQL error code

To select the errors



111/122

111

The ORA Bulk Stages :

- to insert in a table (like SQLLOAD)- Very fast (deactivate the index before the load and reactivate it

after the load)- But no warning if the index is in Unusable state after the load

(when duplicate keys for example)- Not a lot of Date and Time format (DD.MM.YYYY, YYYY-MM-DD, DD-

MON-YYYY, MM/DD/YYYY - hh24:mi:ss, hh:mi:ss am)



112/122

112

The ORA Bulk StagesDSN

Date and Time format

password

Table name (with

oracle.tableName)

Number of linesbetween 2 Commit

user



113/122

113

How to create a table definition from a table in the database ?

On the repository,

right click on Table Definitions

and then choose Import

and then Plug-in Meta Data

Definitions

The designer

Designer


114/122

114

Then choose the table (s) and click on Import

The table definitions will be created in the category ODBC



115/122

115

Exercise n18 : Read a Database

Objective : Create a job which reads the tableREF_CPTE in BIODS database

Step 1 : create the table definition from the databaseStep 2 : create the job that reads the table



116/122

116

Exercise n19 : Write in a Database

Objective : Create a job which writes in the tableTST_ALADIN_JGV in BIODS database (only the 2 firstcolumns : keys)Location.in TST_ALADIN_JGV :Id_Cli ======== >> CHAR1Id_Cas ======== >> CHAR2In CHAR1, put a letter (different for each group) before the client number (Id_Cli).

Step 1 : Use ORAOCI stage

Step 2 : Same exercise with ORABULK stage



117/122

117

Exercise n20 : Update a Database

Objective : Create a job to update the columns BEGIN_DATEand END_DATE in the table TST_ALADIN_JGV in BIODSdatabase from location.in file

BEGIN_DATE and END_DATE are defined as timestamp !

Administrator The administrator


118/122

118

The Administrator :

Create a DataStage project

Unlock a jobSometimes, due to server problems, the designer (or manager) falls down and

some elements may be locked (jobs, table definitions, routines, ) In that case, in the Administrator (with administrator security rights) :



119/122

119

Unlock a job (1/3)

choose your project

And click on

Command button

To create a project



120/122

120

Unlock a job (2/3) CHDIR C:\Ascential\DataStage\EngineLIST.READU

Search the device number

Search the user number



121/122

121

unlock your job with device number

Unlock a job (3/3) or with user number(UNLOCK USER UserNumber READULOCK)Or everything(UNLOCK ALL)



122/122

Project name

Create a project Location for the Project (jobs,

routines, UV hash files, table

definitions, ) on the server. Must be

different from the location for the

directories of data !

CapGemini_Datastage

Documents