2010 DataStage :: FUNDAMENTAL CONCEPTS:: DAY 1 Introduction for Phases of DataStage Four different phases are in DataStage, they are Phase I: Data Profiling It is for source system analyses, and the analysis are 1. Column analysis, 2. Primary key analysis, 3. Foreign key analysis, by this analysis whether we can find the data is “dirty” or “not”. 4. Base Line analysis, and 5. Cross domain analysis. Phase II: Data Quality (or also called as cleansing) In this process we must follow inter dependent i.e., after one after one process as shown below. Parsing Correcting Inter Standardizing Dependent Matching Consolidated Phase III: Data Transmission In this ETL process is done here, the data transmission from one stage to another stage And ETL means E- Extract T- Transmission L- Load. Phase IV: Meta Data Management - “Meta data means where the data for data”. Navs notes Page 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
2010
DataStage
:: FUNDAMENTAL CONCEPTS::
DAY 1
Introduction for Phases of DataStage
Four different phases are in DataStage, they are
Phase I: Data Profiling
It is for source system analyses, and the analysis are
1. Column analysis,
2. Primary key analysis,
3. Foreign key analysis, by this analysis whether we can find the data is “dirty” or “not”.
4. Base Line analysis, and
5. Cross domain analysis.
Phase II: Data Quality (or also called as cleansing)
In this process we must follow inter dependent i.e., after one after one process as
shown below.
Parsing
Correcting Inter
Standardizing Dependent
Matching
Consolidated
Phase III: Data Transmission
In this ETL process is done here, the data transmission from one stage to another stage
And ETL means
E- Extract
T- Transmission
L- Load.
Phase IV: Meta Data Management - “Meta data means where the data for data”.
Navs notes Page 1
2010
DataStage
DAY 2
How the ETL programming tool works?
Pictorial view:
Data Base
ETL Process Business Interface
Flat files
MS Excel
Figure: ETL programming process
Navs notes Page 2
db ETL BI DM
DWH
2010
DataStage
DAY 3
Continue…
Extracting from .txt (ASCII code)
Source
Understand to DataStage Format (Native Format)
Source
Source
Loading the data into .txt (ASCII code) data base or resides in
local repository
ETL is a process that is performs in stages:
S T S T S T
OLTP stage area sa sa sa DWH
Here, S- source and T- target.
Home Work (HW): one record for each kindle (multiple records for multiple addresses and
dummy records for joint accounts);
Navs notes Page 3
Extract window
Load window
Staging (permanent data)
Staging (after transmission)
DWH
2010
DataStage
DAY 4
ETL Developer Requirements
• Q: One record for each kindle(multiple records for multiple addresses and dummy
records for joint accounts);
Kindle means information of customers.
Customer
Loan
Bank Credit card
Savings kindle
• Customer maintaining one record but handling different addresses is called ‘single
view customer’ or ‘single version of truth’.
HW explanation: Here we must read the query very care fully and understand the terminology
of the words in business perceptive. Multiple records means multiple of the
customers(records) and multiple addresses means one customer(one account) maintaining
multiple of addresses like savings/credit cards/current account/loan.
ETL Developer Requirements:
HLD LLD ,, ,, ,,
Inputs
here,
HLD- high level document
Developer LLD- low level document
Navs notes Page 4
2010
DataStage
ETL Developer Requirements are:
1. Under Standing
2. Prepare Questions : after reading document which is given and ask to friends/
forums/team leads/project leads.
3. Logical designs : means paper work.
4. Physical model : using Tool.
5. UNIT Test
6. Performance Tuning
7. Peer Reviews : it is nothing but releasing versions(version control *.**)
here, * means range of 1-9.
8. Design Turn Over Document (DTD)/ Detailed Design Document(DDD)/ Technical
Design Document(TDD)
9. Backups : means importing and exporting the data require.
10.Job Sequencing
Navs notes Page 5
2010
DataStage
DAY 5
How the DWH project is under taken?
Process:
HLD
Requirements: Warehouse(WH) -HLD
x x
x TD jobs in %
Developer (70% - 80%)
as developer involves Developer system engineer Production(10%)
Migration (30%)
x TEST Production Migration
x x
here, x – cross mark that developer not involves in the flow.
– mean where the developer involves in the project and implement all TEN
requirements shown above.
• Production based companies are like IBM and so on.
• Migration means Support based companies like TCS, Cognizent,
Satyam Mahindra and so on.
In Migration: works both server and parallel jobs.
Server jobs – parallel jobs
Up to 2002 this environment worked after 2002 and up to till this environment
IBM launched X-Migrator, which convert server jobs to parallel jobs
In this it converts up to, 70% automatically
30% manually.
Navs notes Page 6
2010
DataStage
Project divided into some category with respective to period as shown below and its
period( time of the project).
Categories - Period (that taken in months and years)
Simple 6m
Medium 6m – 1y
Complex 1– 11/2 y
Too complex 11/2 y – 5y and so on(it may takes many years depend up on project)
5.1. Project Process:
(high level documents)
HLD SRS (here, business analyzer/ Subject matter expert)
Requirements: BRD
HLD Architecture
Warehouse: Schema (structure)
Dimensions and tables (target tables)
Facts
(low level doc’s)
LLD Mapping Doc’s (specifications-spec’s)
TD Test Spec’s
Naming Doc’s
Navs notes Page 7
2010
DataStage
5.2. Mapping Document:
For example if a query requirements are 1-experience employee, 2- dname, and 3- first
name, middle name, last name.
For this mapping pictorial way as we see in the way:
Common fields
S.no Load
order
Target
Entity
Target
Attributes
Source
Tables
Source
Fields
Transmi
ssion
Constan
t
Error
HandlingEno Hire date
Dno
Current Date-
Hire date
(CD-HD)
Pk F
FName Ename Fk CExp_tbl MName Emp Eno Sk D
LName Dept Dno CExp_emp DnameDName
Funneling
S1
Target
S2
Horizontal combining
or vertical combining
As per example here horizontal combination is used
Navs notes Page 8
Get data from Multiple tables ‘C’Is combining
2010
DataStage
Emp
Trg
Dept
Here, HC means Horizontal combination is used for combine primary rows with secondary
rows.
As Developer maximum 30 Target fields will get.
As Developer maximum 100 source fields will get.
“Look Up!” means cross verification from primary table.
After document:
.txt (fwf, cv, vl, sc, s & t, h & t)
( F/ dB) (Types of dB)
S1
S2
Format of Mapping Document.
DAY 6
Architecture of DWH
Navs notes Page 9
HC
HC
T HC
TRG
2010
DataStage
For example: dB every branch have each mgr
Manager
Reliance comm.
Reliance Group : manager Top Level mgr(TLM)
Reliance power details of below
sales
Manager customer
Reliance Fresh TLM needs employee
period
` order
Input
Explanation of above example: Reliance group with some there branches and every branch
have one manager. And for all this manager one Top level manager (TLM) will be there. And
TLM needs the details of list shown above for analyze.
Bottom level
For above example how ETL process is done shown below RC-mgr
reliance fresh ERP
mini WH/
Data mart
DWH
independent Data Mart Dependent Data Mart
Reliance Fresh(taking one from group directly)
Dependent Data Mart: means the ETL process takes all manager information or dB and keep
in the Warehouse. By that the data transmission between warehouse and data mart where
depends upon by each other. Here Data mart is also called as ‘Bottom level’/ ‘mini WH’ as
Navs notes Page 10
ETLPROCESS
ETLPROCESS
2010
DataStage
shown in blue color in above figure i.e., the data of individual manager (like RF, RC, RP and
so on). Hence the data mart depends up on the WH is called dependent data mart.
Independent Data Mart: only one or individual manager i.e., data mart were directly access the
ETL process with out any help of Warehouse. That’s why its called independent data mart.
6.1 Two level approaches:
For the both approaches two layers architecture will apply.
1. Top-Bottom level approach, and
2. Bottom- Top level approach.
6.1. Top – Bottom level approach:
The level start from top means as per example Reliance group to their individual
managers their ETL process from their to Data Warehouse (top level) and from their to all
separate data marts (bottom level).
R Comm. Data Mart
R Power Data Mart
Reliance Group
Warehouse
R Fresh Data Mart
Top level Bottom level
Layer I Layer II
Top – Bottom level approach
In the above the top – bottom level is defined, and this approach is invented by W. H. Inner.
Here, warehouse is top level and all data mart are bottom level as shown in the above figure.
Navs notes Page 11
ETL PROCESS
2010
DataStage
6.2. Bottom – top level approach:
Means from here the ETL process takes directly from data mart (DM) and the data put
in the warehouse for reference purpose or storing the DM in the Data WareHouse (DWH).
R comm.
DM
R power
DM DWH
Reliance Group
R fresh DM
Layer I Layer II
Bottom level Top level
Bottom – Top level approach is invented by R Kimbell.
Here, one data mart (DM) contains information like customer, products, employees, location
and so on.
Top – Bottom level approach
These two approaches comes under two layer Architecture
Bottom – Top level approach
Programming (coding)
Navs notes Page 12
ETL PROCESS
2010
DataStage
• ETL Tool’s: GUI(graph user interface)
This tool’s to “extract the data from heterogeneous source”.
• ETL program Tool’s are “Tara Data/ Oracle/ DB2 & so on…”
6.2. Four layers of DWH Architecture:
6.2.1. Layer I:
DM
DM
Source DWH Source
DM
Layer I Layer I
In this layer the data send directly in first case from source to Data WareHouse(DWH) and in
second case source to group of Data Marts(DM).
6.2.2. Layer II:
DM DM
SRC DWH SRC DWH
DM DM
Layer I Layer II Layer I Layer II
TOP – BOTTOM APPROACH BOTTOM – TOP APPROACH
In this layer the data follow from source – data warehouse – data mart and this type of follow
is called “top – bottom approach”. And in another case the data follow from source – data
Navs notes Page 13
2010
DataStage
marts – data warehouse and this type of following data is called “bottom – top approach”. For
this Layer II architecture is explained in the above shown example eg. Reliance group.
* (99.99% using layer 3 and layer 4)
6.2.3. Layer III:
DM
Source ODS DWH DM
DM
Layer I Layer II Layer III
In this layer the data follow from source – ODS (operations data stores) – DWH – Data Marts.
Here the new concept add that is ODS means operations of data stores for at period like 6
months or one year that data used to solve instance problem where the ETL developer is not
involved here.
And who solve the instance/ temporary problems that team called Interface team is
involved here. The ODS data stores after the period into the DWH and from that it goes to DM
there the ETL developers involves here in layer 3.
The clear explanation about the layer 3 architecture in the below example, it is the best
example for clear explanation.
Example #1:
Navs notes Page 14
2010
DataStage
Source (it is waiting for landing, because of some technical problem)
(at least or max. 2hrs to solve the problem )
ETL dev. Involves here
Layer I
DM
DM
Airport terminal DWH
Interface team involves here Stores problem info for future references DM
Layer III
Layer II
Problem information captured
Data Base (stores the technical problem in dB for 1year)
OPERATIONS DATA STORE
Example explanation:
In this example, source is aero plan that is for waiting for landing to the airport
terminal. But it is not to suppose to land because of some technical problem in the airport base
station. To solve this type operations special team involves here i.e., interface team. In the
airport base station the technical problems and the Operations Data Store (ODS) in db i.e.,
simple say problem information captured.
But the ODS stores the data for one year only. And years of database stores in the data
warehouse because of some technical problems to be not repeat or for future reference. From
DWH to it goes to Data Marts here ETL developers involves for solve technical problems i.e.,
is also called layer 3 architecture of data warehouse.
DAY 7
Navs notes Page 15
Airport base station
ODS
2010
DataStage
Continues…..
Project Architecture:
7.1. Layer IV: Layer 4 is also called as “Project Architecture”
It is for data backup of DWH & SVC
L3
Business intelligence
Read flat files through DS
L2 L4
Reporting
Layer I
SRC Figure: Project Architecture / layer IV
Here,
ODS-operations data store,
DW- Data Warehouse,
DM- Data Mart,
SVC- Single view customer,
BI- Business Intelligence.
L2 & L3 & L4- layer2,3,4.
------------- reference data
- - - - - - - -> reject data
About the project architecture:
Navs notes Page 16
Source1
Source2
Source
Interface
Files
(FLAT FILES)
ETL
ODS
SVC DM
DW BI
DM
Condition
MISMATCH
Format
MISMATCH
lookup
2010
DataStage
In project architecture, there are 4 layers.
In first layer source to interface files(flat files),
Coming to second layer ETL reads the flat files through the DataStage(DS) and sends
to ODS. When ETL sending the flat files to ODS if any mismatch data will there it will
drops that data. There are two types mismatch data 1. Condition mismatch 2. Format
mismatch.
In third layer the ETL transfer the data to warehouse.
In last layer data warehouse to check whether a single customer or not and data loading
or transmission in between DWH and DM(business intelligence).
Note: (Information about dropped data when the transmission done between ETL reads the flat
files(.txt, csv, .xml and so on) to ODS.)
Two types of mismatch data:
• Condition mismatch(CM) : this verify the data from flat files whether they are
conditions are correct or mismatched, if it is mismatched the record will drops
automatically. To see the drop data the reference link is used and it shows which
record is condition mismatched.
• Format mismatch(FM) : this is also like condition mismatch but it checks on the format
whether the sending data or records is format is correct or mismatched. Here also
reference link is used to see drop data.
Example for condition mismatch : An employee table contains some data
SQL> select * from emp;
EID ENAME DNO
08 Naveen 10
19 Munna 20
99 Suman 30
15 Sravan 10
Example for Format Mismatch:
Navs notes Page 17
emp tbl
TRG
drops20,30
from emp
Contains
dno 10,20,30,10
Trg only req. dno = 10
Reference link
2010
DataStage
Here the table format is tab – space separated.
The cross mark record has format mismatched so that the
record its just rejected.
7.2. Single View Customer (SCV):
It is also called as “single version of truth”.
For example:
*to make unique customer?
Same records
Phase – II > identify field by field.
Phase – III> cannot identify in this.
5 multiple records of customers
transforming
Here DataStage people involves in this process
SVC/ single version of truth
This type of transforming is also called as Reverse Pivoting.
NOTE: Business intelligence(BI DM) is for data backup of DWH & SVC(single version of truth).
Palate means which contains all stages shortcuts i.e., 7 – stages in 7.5.2 & 8 – stages in 8.0.
This stages are categorized into two groups, they are 1 –> Active Stage (what ever stage is transmission is called active stage). 2 –> Passive Stage (here what ever stage whether extracting or loading is called passive stage).
In 8 categories we have use sequential stage and parallel stage jobs.
Save, compile and run the job.
Run director (to see views) or to view the status of your job.
DAY 17
My first job creating process
Process:
In computer desktop, the current running process will show at the left Conner in that
a round symbol with green color is to start when it is not automatically starts. i.e.,
whether the server for DataStage was start or not. If not manually to start.
When 8th version of DataStage is installed five client components short cuts visible
on desktop.
Web Console
Information Analyzer
DS Administrator
DS Designer
DS Director
Web Console: when you will click, it displays “ the login page appears”
Navs notes Page 47
2010
DataStage
o If server is not started, it displays “the page cannot open” error will appear.
o If error occurs like that, the server must be restart for doing or creating jobs.
DS Administrator: it is for creating/deleting/organize the project.
DS Director: it is for views the status of the job executed, and to view log, status,
warnings.
DS Designer: when you will click on the designer icon, it will display to attach the
project for creating a new job. As shown as below
o User id: admin
o Password: ****
o If authentication failed to login i.e., because repository interface error.
Below figure showing how to authenticate & shows designer canvas for creating
jobs.
After authentication, it displays the Designer canvas
o And it ask which job want to you do, they are
Navs notes Page 48
Domain
User Name
Password
Project
Attach the project X
Localhost:8080
admin
Teleco
phil
OK
cancel
2010
DataStage
Main frames
Parallel
Sequential
Server jobs
After clicking on parallel jobs, go to tool bar – view – palate.
In palate the 8 types of stages were displayed for designing a job, they are
General
Data Quality
Data Base
File
Development & Debug
Processing
Real Time
Re – Structure
17.1. File Stage:
Q: How data can read from files?
File stage can read only flat files and the formats of flat files are .txt, .csv, .xml
In .txt there are different types of formats like fwf, sc, csv, s & t, H & T.
.csv means comma separated value.
.xml means extendable markup language.
- In File Stage, there are sub–stages like sequential stage, data set, file set and so on.
o Example how a job can execute:
one sequential file(SF) to another SF.
Source Target
Navs notes Page 49
2010
DataStage
o Source file require target/output properties, and
o Target file require input/source properties.
- In source file, how we to read a file?
o On double clicking source file, we must set the properties as below
File name \\ browse the file name.
Location \\ example in c:\
Format \\ .txt, .csv, .xml
Structure \\ meta data
General properties of sequential file:
1. Setting / importing source file from local server.
2. Format selection:
- As per input file taken and the data must to be in given format
- Like “tab/ space/ comma” must to be select one them.
Navs notes Page 50
Select a file name:
File: \ c:\data\se_source_file.txt File: \? (This option for multiple purposes)
Browse buttonC:\data\se_source_file.txt
2010
DataStage
3. Column structure defining:
To get the structure of file.
- Steps for load a structure
- Import
o Sequential file
Browse the file and import
• Select the import file
o Define the structure.
These three are general properties when we design for simple job.
DAY 18
Sequential File Stage
Navs notes Page 51
LOA
2010
DataStage
Sequential file stage also says as “output properties”
- For single structure format only we going to use sequential file stage.
Output Input
Properties Properties
- About Sequential File Stage and how it works:
Step1: Sequential file is file stage, that it to read flat files from different of
extensions(.txt, .csv, .xml)
Step 2: SF it reads/writes sequentially by default, when it reads/writes from single
file.
o And it also reads/writes parallel when it read/writes to or from multiple files
Step 3: Sequential stage supports one input (or) one output and one reject link.
Link :
Link is also a stage that transforms data from one stage to another stage.
o That link has divided into categories.
Stream link SF SF
Reject link SF SF
Reference link SF SF
Link Marker:
It is show how the link behaves between the transmissions from source to target.
Navs notes Page 52
2010
DataStage
1. Ready BOX: it is indicate that “a stage is ready with Mata Data” and data transform
between sequential stages to sequential stage.
Ready BOX
2. FAN IN: it indicates when “a data transform from parallel stage to sequential stage” and it
done when collecting happens
FAN IN
3. FAN OUT: it indicates when “a data transform from sequential stage to parallel stage” and
it is also called auto partition.
FAN OUT
4. BOX: it indicates when “a data transform from parallel stage to parallel stage” and it is
also known as partitioning.
BOX
Navs notes Page 53
2010
DataStage
5. BOW – TIE: it indicates when “a data transform parallel stage to parallel stage” and it is
also known as re-partitioning.
BOW – TIE
Link Color:
The link color indicates the process in execution of a job.
LINK
RED:
o A link in RED color means
case1: a stage not connected properly and
case2: job aborted
BLACK:
o A link in BLACK color means “a stage is ready”.
BLUE:
o A link in BLUE color means “ it indicates that a job execution on process”
GREEN:
o A link in GREEN color means “execution of job finished”.
Navs notes Page 54
NOTE: “Stage is an operator; operator is a pre – built in component”.
Because the stage that imports import operator for purpose of creating in Native Format.
Native Format is DataStage under stable format. So, stage is a operator.
2010
DataStage
Compile:
Compile is a translator that source code to target code.
Compiling .C function
HLL BC
ALL *HLL – High Level Language
*ALL – Assembly Level Language
*BC – Binary Code
Compiling process in DataStage:
MC
OSH Code & C++
*MC – Machine Code
*OSH – Orchestrate Shell Script
Note: Orchestrate Shell Script generate for all stage except one i.e., Transformer stage that is
done by C++.
In process, it checks for
Link Requirement (checks for link)
Mandatory stage properties
Syntax Rules
Navs notes Page 55
.C
.OBJ
.EXE
GUI
.OBJ
.EXE
2010
DataStage
DAY 19
Sequential File Stage Properties
Properties:
Read Methods: two options are
o Specific File : user or client to give specifically each file name.
o File Pattern : we can use wild card character and search for pattern i.e., * & ?
For example: C:\eid*.txt
C:\eid??.txt
Reject Mode: to handle a “format/data type/condition” miss match records.
Three options
o Continue : Drops the miss match and continue other records.
o Fail : job aborted.
o Output : its capture the drop data through the link to another sequential file.
First line or record of table: true/false.
o If it false, it display the first line also a drop record.
o Else it is true, it’s doesn’t drop the first record.
Missing File Mode: if any file name miss this option used
Two options
o Ok : drops the file name when missed.
o Error : if file name miss it aborts the job.
File Name Column: “source information at the target” it gives information about which
record in which address in local server.
Directly to add a new column to existing table and it’s displays in that column.
Row Number Column: “Source record number at target” it gives information about
which source record number at target table.
Navs notes Page 56
2010
DataStage
It is also directly to add a new column to existing table and it’s displays in that column.
Read First Rows: “will get you top first n-records rows”
o Read First Rows option will asks give n value to display the n number of
records
Filter: “blocking unwanted data based on UNIX filter commands”
Like grep, egrep, ……..so on
Example:
o grep “moon” ; \\ it is case sensitive that display only moon contained records.
o grep - i “moon” \\ it ignores the case sensitive it displays all moon records.
o grep - w “moon” \\ it displays exact match record.
Read from Multiple Nodes: we can read the data parallel from using sequential stage
Reads parallel is possible
Loading parallel is not possible
LIMITATIONS of SF:
o It should be sequential processing( process the data in sequential)
o Memory limit 2gb(.txt format)
o Problem with sequential is conversions.
Like ASCII – NF – ASCII – NF
o It is lands or resides the data “outside of boundary” of DataStage.
Navs notes Page 57
2010
DataStage
DAY 20
General settings DataStage and about Data Set
Default setting for startup with parallel job:
- Tools
o Options
Select a default
• And to create new: it ask which type of job u want.
- Types of jobs are main frames/parallel/sequential/server.
- After setting above when we restart the DS Designer it directly goes designer canvas.
According naming standards every stage has to be named.
o Naming a stage is simple, just right click on a stage rename option is visible
and name a stage as naming standards.
General Stage:
In this stage the some of stage were used for commenting a stage what they behave or
what a stage can perform to do i.e., simple giving comments for a stage.
Let we discuss on Annotation & Description Annotation
- Annotation: it is for stage comment.
- Description Annotation: it is used for job title (any one tile can keep).
Parallel Capable of 3 jobs:
Resides into or
Navs notes Page 58
2010
DataStage
SRC TRG
Extracting landing the data into LS/RR/db
Q: In which format the data sends between the source file to target file?
A: if we send a .txt file from source, it is ASCII format because .txt file support only ASCII
format and DataStage support the Native format only, here the ASCII code will convert into
Native format that is understandable to DataStage. And at target ASCII code will convert
into .txt format to user/client visible.
“Native Format” is also called as Virtual Dataset.
NF
ASCII ASCII
src_f.txt trg_f.txt
Data Set (DS):
“It is file stage, and it is used staging the data when we design dependent jobs”.
Data Set over comes the limitation of sequential file stage for the better performance.
By default Data Set sends the data in parallel.
In Data Set the data lands in to “Native Format”.
Q: How the Data Set over comes the sequential file limitation?
- By default the data process parallel.
- More than 2 GB.
Navs notes Page 59
When we convert ASCII code into NF. SRC need to import an
When we convert NF code into ASCII. Target need to import an operator.
2010
DataStage
- No need of conversion, because Dataset represent or data directly resides into Native
format.
- The data Lands in the DataStage repository.
- Data Set extension is *.ds
Structure saving as “st_trg”
src_f.txt trg_f.ds
Q: How the conversion is easy in Data Set?
- we can copy the “trg_f.ds” file name and also we must save the structure of the
trg_f.ds example st_trg.
- We can use the saved file name and structure of the target in other job.
copying the structure st_trg
& trg_f.ds for reusing here.
trg_f.ds trg_f.txt
- Data Set can read only Native Format file, like DataStage reads only orchestrate
format.
Navs notes Page 60
2010
DataStage
DAY 21
Types of Data Set (DS)
Two types of Data Set, they are
Virtual (temporary)
Persistency (permanent)
- Virtual : it is a Data Set stage that the data moves in the link from one stage to another
stage i.e., link holds the data temporary.
- Persistency : means the data sending from the link it directly lands into the repository.
That data is permanent.
Alias of Data Set:
o ORCHESTRATE FILE
o OS FILE
Q: How many files are created internally when we created data set?
A: Data Set is not a single file; it creates multiple files when it created internally.
o Descriptor file
o Data file
o Control file
o Header file
Descriptor File : it contains schema details and address of data.
Data File: consists of data in the Native Format and resides in DataStage repository.
Control File :
Navs notes Page 61
2010
DataStage
It resides in the operating system and both acting as interface
between descriptor file and data file.
Header File :
Physical file means it stores in the local drive/ local server.
Permanently stores in the install program files c:\ibm\inser..\server\dataset{“pools”}
Q: How can we organize Data Set to view/copy/delete in real time and etc.,
A: Case1: we can’t directly delete the Data Set
Case2: we can’t directly see it or view it.
Data Set organizes using utilities.
o Using GUI i.e., we have utility in tool (dataset management)
o Using Command Line: we have to start with $orachadmin grep “moon”;
Navigation of organize Data Set in GUI:
o Tools
Dataset Management
- File_name.ds(eg.: dataset.ds)
o Then we will see the general information of dataset
Schema window
Data window
Copy window
Delete window
At command line
o $orachadmin rm dataset.ds (this is correct process) \\ this command for remove
a file
o $rm dataset.ds (this is wrong process) \\ cannot write like this
o $ds records \\ to view files in a folder
Navs notes Page 62
2010
DataStage
Q: What is the operator which associates to Dataset:
A: Dataset doesn’t have any operator, but it uses copy operator has a it’s operator.
Dataset Version:
- Dataset have version control
- Dataset has version for different DataStage version
- Default version in 8 is it saves in the version 4.1 i.e., v41
Q: how to perform version control in run time?
A: we have set the environment variable for this question.
Navigation for how to set a environment variable.
Job properties
o Parameters
Add environments variable
- Compile
o Dataset version ($APT_WRITE_DS_VERSION)
Click on that.
After doing this when we want to save the job, it will ask whether which version you
want.
Navs notes Page 63
2010
DataStage
DAY 22
File Set & Sequential File (SF) input properties
File Set (FS): “It is also a staging the data”.
- File stage is same to design in dependent jobs.
- Data Set & File Set are same, but having minor differences
- The differences between DS & FS are shown below
But, Data Set have more performance than File Set.
Navs notes Page 64
Data Set File Set
Having parallel extendable capabilities
More than 2 GB limit
NO REJECT link with the Dataset
DS is exclusively for internal use DataStage environment
Copy (file name) operator
Native format
.ds files saves
Having parallel extendable capabilities
More than 2GB limit
REJECT LINK with in File Set
External application create FS we use the any other application
Import / Export operator
Binary Format
.fs extension
2010
DataStage
Sequential File Stage : input properties
- Setting input properties at target file, and at target there have four properties
1. File update mode
2. Cleanup on failure
3. First line in column names
4. Reject Mode
File Update Mode: having three options – append/create (error if exists)/overwrite
o Append: when the multiple file or single file sending to sequential target it’s
appends one file after another file to single file.
o Create (error if exists): just creating a file if not exist or given wrong.
o Overwrite: it’s overwriting one file with another file.
Setting passing value in Run time(for file update mode)
o Job properties
Parameters
- Add environment variables
o Parallel
Automatically overwrite
($APT_CLOBBER_OUTPUT)
Cleanup on Failure: having two options – true/false,
True – the cleanup on failure option when it is true it adds partially coded or records.
Its works only when “file update mode” is equal to append.
False – it’s simple appends or overwrites the records.
First Line in Column Names: having two options – true/false.
True – it is enable the first row or record as a fields of column
False – it is simple reads every row include first row read as record.
Navs notes Page 65
2010
DataStage
Reject mode: here reject mode is same like as output properties we discussed already before.
In this we have three options – continue/fail/output.
Continue – it just drops when the format/condition/data type miss match the data and
continues process remain records.
Fail – it just abort the file when format/condition/data type miss match were found.
Output – it capture the drops record data.
DAY 23
Development & Debug Stage
The development and debug stage having three categories, they are
1. Stage that Generated Data:
a. Row Generated Data
b. Column Generated Data
2. The stage that used to Pick Sample Data:
a. Head
b. Tail
c. Simple
3. The stage that helps in Debugging:
a. Peek
Simply say in development and debug we having 6 types of stages and the 6 stages
where divided into three categories as above shown.
23.1. Stages that Generated Data:
Row Generator Data: “It having only one output”
Navs notes Page 66
2010
DataStage
- The row generator is for generating the sample data; in some cases it is used.
- Some cases are,
o When client unable to give the data.
o For doing testing purpose.
o To make job design simple that shoots for jobs.
- Row Generator can generate the junk data automatically by considering data type, or
we manual can set a some related understandable data by giving user define values.
- In this having only one property and select a structure for creating junk data.
Row Generator design as below:
ROW Generator DS_TRG
Navigation for Row Generator:
- Opening the RG properties
- Properties
o Number of records = XXX( user define value)
- Column
o Load structure or Meta data if existing or we can type their.
For example n=30
- Data generated for the 30 records and the junk data also generated considering the data
type.
Q: how to generate User define value instead of junk data?
A: first we must go to the RG properties
- Column
o Double click serial number or press ctrl+E
Generator
Navs notes Page 67
2010
DataStage
• Type = cycle/random (it is integer data type)
• In integer data type we have three option
Under cycle type:
There are three types of cycle generated data Increment, Initial value, and limit.
Q: when we select initial value=30?
A: it starts from 30 only.
Q: when we select increment=45?
A: it going to generate a cycle value of from 45 and after adds every number with 45.
Q: when we select limit=20?
A: it is going to generate up to limit number in a cycle form.
Under Random type:
There are three types of random generated data – limit, seed, and signed.
Q: when we select limit=20?
A: it going to generate random value up to limit=20 and continues if more than 20 rows.
Q: when we select seed=XX;
A: it is going to generate the junk data for random values.
Q: when we select signed?
A: it going to generate signed values for the field (values between –limit and +limit),
otherwise generate values between 0 and +limit.
Column Generator Data: “it having the one input and one output”
- Main purpose of column generator to group a table as one.
Navs notes Page 68
2010
DataStage
- And by using this we add extra column for the added column the junk data will be
generated in the output.
- Here mapping should be done in the column generated properties, means just drag and
dropping created column into existing table.
Sequential file Column Generator DataSet
- Coming to the column generator properties.
- To open the properties just double clicking on that.
Navigation:
- Stage
o Options
Column to generate =?
And so on we can give up to required.
- Output
o Mapping
After adding extra column it will visible here, and for mapping we drag
simple to existing table into right side of a table.
- Column
o We can change data type as you require.
In the output,
- The junk data will generate automatically for extra added columns.
- For manual we can generate some meaning full data to extra column’s
- Navigation for manual:
o Column
Ctrl+E
• Generator
Navs notes Page 69
2010
DataStage
o Algorithm = two options “cycle/ alphabet”
o Cycle – it have only one option i.e., value
o Alphabet – it also have only one option i.e., string.
- Cycle is same like above shown in row generator.
Q: when we select alphabet where string=naveen?
A: it going to generate different rows with given alphabetical wise.
DAY 24
Pick sample Data & Peek
24.1. Pick sample data: “it is a debug stage; there are three types of pick sample data”.
- Head
- Tail
- Sample
Head : “it reads the top ‘n’ records of the every partition”.
o It having one input and one output.
o In the head stage mapping must and should do.
SF_SRC HEAD DS_TRG
Properties of Head:
o Rows
Navs notes Page 70
2010
DataStage
All Rows(after skip)=false
- It is to copy all rows to the output following any requested skip
positioning
Number of rows(per partition)=XX
- It copy number of rows from input to output per partition.
o Partitions
All partition = true
- True: copies row from all partitions
- False: copies from specific partition numbers, which must be
specified.
Tail: “it is debug stage, that it can read bottom ‘n’ rows from every partition”
o Tail stage having one input and one output.
o In this stage mapping must and should do. That mapping done in the tail output
properties.
SF_SRC TAIL_F DS_TRG
Properties of Tail:
o The properties of head and tail are similar way as show above.
o Mainly we must give the value for “number of rows to display”
Sample: “it is also a debug stage consists of period and percentage”
o Period: means when it’s operating is supports one input and one output.
o Percentage: means when it’s operating is supports one input and multiple of
outputs.
Navs notes Page 71
2010
DataStage
SF_SRC SAMPLE DS_TRG
Period: if I have some records in source table and when we give ‘n’ number of
period value it displays or retrieves the every nth record from the source table.
Skip: it also displays or retrieves the every nth record from given source table.
Percentage: it reads from one input to multiple outputs.
o Coming to the properties
Options
- Percentage = 25 and we must set target =1
- Percentage = 50 , target = 0
- Percentage = 15 , target = 2
o Here we setting target number that is called link order.
o Link Order: it specifies to which output the specific data has to be send.
o Mapping: it should be done for multiple outputs.
Target1
Target2
SF_SRC SAMPLE
Navs notes Page 72
2010
DataStage
Target3
NOTE: sum of percentage of all outputs must be less than are equal to ‘<=’ to ‘n’ records of
input records.
o In the percentage it distributes the data in percentage form. When sample
receives the 90% of data from source. It considers 90% as 100% and it
distributes as we specify.
24.2. PEEK: “it is a debug stage and it helps in debugging stage”
SF_SRC PEEK
It is used in three types they are
1. It can use as copying the data from Source to multiple outputs.
2. Send the data into logs.
3. And it can use as stub stage.
Q: How to send the data into logs?
Opening properties of peek stage, we must assign
o Number of row = value?
o Peek record output mode = job log and so on, as per options
Navs notes Page 73
2010
DataStage
o If we put column name = false, it doesn’t shows the column in the log.
For seeing the log records that we stored.
o In DS Director
From Peek – log – peek - We see here ‘n’ values of records and fields
Q: When the peek act as copy stage?
A: It is done when the sequence file it doesn’t send the data to multiple outputs. In that time
the peek act as copy stage.
Q: What is Stub Stage?
A: Stub Stage is a place holder, because in some situations a client requires only dropped data.
In that time the stub stage acts as a place holder which holds the output data as temporary,
and its sends the rejected data to the another file.
DAY 25
Database Stages
In this stage we have use generally oracle enterprise, ODBC enterprise, Tara data with ODBC,
and dynamic RDBMS and so on.
25.1. Oracle Enterprise:
“Oracle enterprise is a data base stage, it reads tables from the oracle data base
from source to the target”
o Oracle enterprise reads multiple tables from, but it loads in the one output.
Oracle Enterprise Data Set
o Properties of Oracle Enterprise(OE):
Navs notes Page 74
2010
DataStage
Read Method have four options
• Auto Generated \\ it generated auto query
• SQL Builder \\ its new concept apart comparing from v7 to v8.
• Table \\ giving table name here
• User Defined \\ here we are giving user defined SQL query.
If we select table option
• Table = “<table name>”
Connection
• Password = *****
• User = Scott
• Remote server = oracle
o Navigations for how the data load to the column
This is for already data present in plug-in.
• Select load option in column
• Going to the table definitions
• Than to plug-in
• Loading EMP table from their.
If table not in the not their in plug-in.
• Select load option in column
• Then we go to import
• Import “meta data definition”
o Select related plug-in
Oracle
User id: Scott
Navs notes Page 75
2010
DataStage
Password: tiger
After loading select specific table and import.
• After importing into column, in define we must change hired
date data type as “Time Stamp”.
Q: A table containing 300 records in that, I need only 100 fields from that?
A: In read method we use user-defined SQL query to solve this problem by writing a query for
reading 100 records.
But by the first read method option, we can auto generate the query by that we can use
by coping the query statement in user-defined SQL.
Q: What we can do when we don’t know how to write a select command?
A: Selecting in read method = SQL Builder
After selecting SQL Builder option from read method
o Oracle 10g
o From their dragging which table you want
o And select column or double clicking in the dragged table
There we can select what condition we need to get.
It is totally automated.
NOTE: in version 7.5.x2 we don’t have saving and reusing the properties.
Data connection: its main purpose is reusing the saved properties.
Q: How to reuse the saved properties?
A: navigation for how to save and reuse the properties
Opening the OE properties
o Select stage
Data connection
• There load saved dc
Navs notes Page 76
2010
DataStage
o Naveen_dbc \\ it is a saved dc
o Save in table definition.
DAY 26
ODBC Enterprise
ODBC Enterprise is a data base stage
About ODBC Enterprise:
Oracle needs some plug-ins to connect the DataStage. When DataStage version7
released that time the oracle 9i provides some drivers to use.
When coming to connection oracle enterprise connects directly to oracle data base. But
ODBC needs OS drivers to hit oracle or to connect oracle data base.
Navs notes Page 77ODBC
Enterprise
ORACLE DB
OS
Oracle Enterprise
2010
DataStage
Directly hitting
Use OS drivers to hit the oracle db
Difference between Oracle Enterprise (OE) and ODBC Enterprise
Q: How database connect using ODBC?
ODBCE Data Set
First step: opening the properties of ODBCE
Read method = table
o Table = EMP
Connection
o Data Source = WHR \\ WHR means name of ODBC driver
Navs notes Page 78
OE ODBCE
Version dependent
Good performance
Specific to oracle
Uses plug-ins
No rejects at source
Version independent
Poor performance
For multiple db
Uses OS drivers
Reject at SRC &TRG.
2010
DataStage
o Password = ******
o User = Scott
Creating of WHR ODBC driver at OS level.
o Administration tools
ODBC
• Add
o MS ODBC for Oracle
Giving name as WHR
Providing user name= Scott
And server= tiger.
ODBCE driver at OS level having lengthy process to connect, to over this ODBC
connector were introduced.
Using ODBC Connector is quick process as we compare with ODBCE.
Best Feature by using ODBC Connector is “Schema reconciliation”. That
automatically handles data type miss match between the source data types and
DataStage data types.
Differences between ODBCE and ODBC Connector.
Navs notes Page 79
ODBCE ODBC Connector
It cannot make the list of Data Source Name (DSN).
In the ODBCE “no testing the connection”.
ODBCE read sequentially and load
It provides the list have in ODBC DSN.
In this we can test the connection by test button.
It read parallel and loads parallel (good performance).
2010
DataStage
Properties of ODBC Connector:
o Selecting Data Source Name DSN = WHR
o User name = Scott
o Password = *****
o SQL query
26.1. MS Excel with ODBCE:
First step is to create MS Excel that is called “work book”. It’s having ‘n’ number of
sheets in that.
For example CUST work book is created
Q: How to read Excel work book with ODBCE?
A: opening the properties of ODBCE
Read method = table
o Table = “empl$” \\ when we reading from excel name must be in double codes
end with $ symbol.
Connections
o DSN = EXE
o Password = *****
o User = xxxxx
Column
o Load
Import ODBC table definitions
• DSN \\ here select work book
• User id & password
Navs notes Page 80
2010
DataStage
o Filter \\ enable by click on include system tables
o And select which you need & ok
In Operating System
o Add in ODBC
MS EXCEL drivers
• Name = EXE \\ it is DSN
Q: How do you read Excel format in Sequential File?
A: By changing the CUST excel format into CUST.csv
26.2. Tara Data with ODBCE:
Tara Data is like an oracle cooperation data base, which use as a data base.
Q: How to read Tara Data with ODBC
A: we must start the Tara Data connection (by clicking shortcut).
“it is used to combine the multiple of columns into single column” and it is also like
concatenate in the transformer function.
Properties:
o Input
Column method = explicit
Column To Export = EID
Column To Export = ENAME
Column To Export = STATE
o Output
Export column type = “varchar”
Export output column = REC
Column Import:
“it is used to explore from single column into multiple columns” and it is also like field
separator in the transformer function.
Properties:
o Input
Column method=
Column To Import = REC
o Output
Import column type = “varchar”
Import output column= EID
Import output column= ENAME
Import output column= STATE
DAY 30
JOB Parameters (Dynamic Binding)
Navs notes Page 94
2010
DataStage
Dynamic Binding:
“After compiling the job and passing the values during the runtime is known as
dynamic binding”.
Assuming one scenario that when we taking a oracle enterprise, we must provide the
table and load its meta data. Here table name must be static bind.
But there is no need for giving the authentication to oracle are to be static bind,
because of some security reasons. For this we can use job parameters that can provide
values at runtime to authenticate.
Job parameters:
“job parameters is a technique that passing values at the runtime, it is also called
dynamic binding”.
Job parameters are divided into two types, they are
o Local variables
o Global Variable
Local variables (params): “it is created by the DS Designer only, it can use with in the
job only”.
Global Variables : “it is also called as environment variables”, it is divided into two
types. They are,
o Existing: comes with in DataStage, in this two types one general and another
one parallel. Under parallel compiler, operator specific, reporting will
available.
o User Defining : it is created in the DataStage administrator only.
Q: How to give Runtime values using parameters for the following list?
a. To give runtime values for user ID, password, and remote server?
Navs notes Page 95
NOTE: “The local parameters that created one job they cannot be reused in other job, this is up to version7. But coming to version8 we can reuse them by technique called parameter set”. But in version7 we can also reuse parameters by User Define values by DataStage Administrator.
2010
DataStage
b. Department number (DNO) to keep as constraint and runtime to select list of any
number to display it?
c. Add BONUS to SAL + COMM at runtime?
d. Providing target file name at runtime?
e. Re-using the global and parameter set?
Design:
ORACLE Tx Data Set
Step1:
“Creating job parameters for given question in local variable”.
Job parameters
o Parameters
Name DNAME Type Default value
UID USER string SCOTT
PWD Password Encrypted ******
RS SERVER String ORACLE
DNO DEPT List 10
BONUS BONUS Integer 1000
IP DRIVE String C:\
FOLDER FOLDER String Repository\
TRG FILE TARGET String dataset.ds
Here, a, b, c, d are represents a solution for the given question.
Step 2:“Creating global job parameters and parameter set”.
Navs notes Page 96
a
b
c
d
2010
DataStage
DS Administrator
o Select a project
Properties
• General
o Environment variables
User defined (there we can write parameters)
Name DNAME Type Default value
UID USER string SCOTT
PWD Password Encrypted ******
RS SERVER String ORACLE
Here, global parameters are preceded by $ symbol.
For Re-use, we must
o Add environment variables
User defined
• UID $UID
• PWD $PWD
• RS $RS
Step 3:
“Creating parameter set for multiple values & providing UID and PWD other values
for DEV, PRD, and TEST”.
In local variables job parameters
o Select multiple of values by clicking on
And create parameter set
• Providing name to the set
o SUN_ORA
Saving in Table definition
• In table definition
Navs notes Page 97
2010
DataStage
o Edit SUN_ORA values to add
Name UID PWD SERVER
DEV SYSTEM ****** MOON
PRD PRD ****** SUN
TEST TEST ****** ORACLE
For re-using this to another job.
o Add parameters set (in job parameters)
Table definitions
• Navs
o SUN_ORA(select here to use)
NOTE: “Parameter set use in the jobs with in the project only”.
Step 4:
“In oracle enterprise properties selecting the table name and later assign created job
parameter as shown below”.
Properties:
Read method = table
o Table = EMP
Connection
o Password = #PWD#
o User = #UID#
o Remote Server = #RS#
Column:
Load
o Meta data for EMP table
Navs notes Page 98
Insert job parameters
$UID
$PWD global environment variables
$RS
SUN_ORA.UID
SUN_ORA.PWDparameter set
SUN_ORA.RS
UID
PWD Local variables
Parameters
2010
DataStage
Step 5:
“In Tx properties dept no using as a constraint and assign bonus to bonus column”.
Here, DNO and BONUS are the job parameters we have created above to use here.
For that simply right click->job parameters->DNO/BONUS (choose what you want)
Step 6:
“Target file set at runtime, means following below steps to follow to keep at runtime”.
Data set properties
o Target file= #IP##FOLDER##TRGFILE#
Here, when run the job it asks in what drive, and in which folder. At last it asks what target
file name you want.
Navs notes Page 99
EIDENAMESTATESALCOMMDEPTNO
IN.EID EIDIN.ENAME ENAMENS NETSALNS+BONUS BONUS
IN
OUT
Derivation Column
IN.SAL + NullToZero(IN.COMM)NS
Stage Variable
Derivation Column
Constraint: IN.DEPTNO = DNO
2010
DataStage
DAY 31
Sort Stage (Processing Stage)
Q: What is sorting?
“Here sorting means higher than we know actually”.
Q: Why to sort the data?
“To provide sorted data to some sort stages like join/ aggregator/ merge/ remove
duplicates for the good performance”.
Two types of sorting:
1. Traditional sorting : “simple sort arranging the data in ascending order or descending
order”.
2. Complex sorting : “it is only for sort stages and to create group id, blocking unwanted
sorting, and group wise sorting”.
In DataStage we can perform sorting in three levels:
Source level: “it can only possible in data base”.
Link level: “it can use in traditional sort”.
Stage level: “it can use in traditional sorting as well as complex sorting”.
Q: What is best level to sort when we consider the performance?
“At Link level sort is the best we can perform”.
Source level sort:
o It can be done in only data base, like oracle enterprise and so on.
o How it will be done in Oracle Enterprise (OE)?
Navs notes Page 100
2010
DataStage
Go to OE properties
• Select user define SQL
o Query: select * from EMP order by DEPTNO.
Link level sort:
o Here sorting will be done in the link stage that is shown how in pictorial way.
o And it will use in traditional sorting only.
o Link sort is best sort in case of performance.
OE
JOIN DS
Q: How to perform a Link Sort?
“Here as per above design, open the JOIN properties”.
And go to partitions
o Select partition technique (here default is ‘auto’)
Mark “perform sort”
• When we select unique (it removes duplicates)
• When we select stable (it displays the stable data)
Q: Get all unique records to target1 and remaining to another target2?
“For this we must create group id, it indicates the group identification”.
Navs notes Page 101
2010
DataStage
It is done in a stage called sort stage, in the properties of the sort stage and in the
options by keeping create key change column (CKCC) = “true”, default is false.
Here we must select to which column group id you want.
Sort Stage:
“It is a processing stage, that it can sort the data in traditional sort or in complex sort”.
Sort Stage
Complex sort means to create group id, blocking unwanted sorting, and group wise
sorting in some sort stage like join, merge, aggregate, and remove duplicates.
Traditional sort means sorting in ascending order or descending order.
Sort Properties:
Input properties
o Sorting key = EID (select the column from source table)
NOTE: for every n – input and n – output stages should must done mapping.
Aggregator:
“It is a processing stage that performs count of rows and different calculation between
columns i.e. group by same operation in oracle”.
SF Aggregator DS
Properties of Aggregator:
Grouping keys:
o Group= Deptno
Aggregator
o Aggregator type = count rows (count rows/ calculation/ re – calculation)
o Count output column= count <column name>
1Q: Count the number of all records and deptno wise in a EMP table?
1 Design:
OE_EMP Copy of EMP Counting rows of deptno TRG1
Navs notes Page 129
2010
DataStage
Generating a column counting rows of created column TRG2
For doing some group calculation between columns:
Example:
Select group key
Group= DEPTNO
- Aggregation type = calculation
- Column for calculation = SAL <column name>
Operations are
Maximum value output column = max <new column name>
Minimum value output column = min <new column name>
Sum of column = sum <new column name> and so on.
Here, doing calculation on SAL based on DEPTNO;
2Q In Target one dept no wise to find maximum, minimum, and sum of rows, and in
target two company wise maximum?
2 Design:
OE_emp copy of emp max, min, sum of deptno trg1
Company: IBM max of IBM trg2
3Q: To find max salary from emp table of a company and find all the details of that?
Navs notes Page 130
2010
DataStage
&
4Q: To find max, min, sum of salary of a deptno wise in a emp table?
3 & 4 Design: dummy dno=10
compare
emp
max(deptno) dno=20
UNION ALL diving
compare dummy dno=30
copy
min(deptno)
company: IBM compare
maximum SAL with his details
max (IBM)
DAY 43
Slowly Changing Dimensions (SCD) Stage
Before SCD we must understand: types of loading
1. Initial load
2. Incremental load
Initial load: complete dump in dimensions or data warehouse i.e., target also before
data is called Initial load.
The subsequent is alter is called incremental load i.e., coming from OLTP also source
is after data.
Navs notes Page 131
2010
DataStage
Example: #1
Before data (already data in a table)
CID CNAME ADD GEN BALANCE Phone No AGE11 A HYD M 30000 988531068
8
24
After data (update n insert at source level data)
CID CNAME ADD GEN BALANCE Phone No AGE11 A SEC M 60000 988586542
2
25
Column fields that have changes types:
Address – slowly change
Balance – rapid change
Phone No – often change
Age – frequently
Example: #2
Before Data:
CID CNAME ADD11 A HYD22 B SEC33 C DEL
After Data: (update ‘n’ insert option loading a table)
CID CNAME ADD11 A HYD22 B CUL
Navs notes Page 132
2010
DataStage
33 D PUN
Extracting after and before data from DW (or) database to compare and upsert.
We have SIX Types of SCD’s are there, they are
SCD – I
SCD – II
SCD – III
SCD – IV or V
SCD – VI
Explanation:
SCD – I: “it only maintains current update, and no historical data were organized”.
As per SCD – I, it updates the before data with after data and no history present after the
execution.
SCD – II: “it maintains both current update data and historical data”. With some special
operation columns they are, surrogate key, active flag, effect start date, and effect end date;
In SCD – II, not having primary key that need system generated primary key, i.e.,
surrogate key. Here surrogate key acting as a primary key.
And when SCD – II performs we get a practical problem is to identify old and current
record. That we can solve by active flag: “Y” or “N”.
In SCD – II, new concepts are introduced here i.e., effect start date (ESDATE) and
effect end date (EEDATE).
Record version : it is concept that when the ESDATE and EEDATE where not able to
use is some conditions.
Unique key : the unique key is done by comparing.
SCD – III: SCD – I (+) SCD – II “maintain the history but no duplicates”.
Navs notes Page 133
2010
DataStage
SCD – IV or V: SCD – II + record version
“When we not maintain date version then the record version useful”.
SCD – VI: SCD – I + unique identification.
Example table of SCD data:
SID CID CNAME ADD AF ESDATE EEDATE RV UID1 11 A HYD N 03-06-06 29-11-10 1 12 22 B SEC N 03-06-06 07-09-07 1 23 33 C DEL Y 03-06-06 9999-12-31 1 34 22 B DEL N 08-09-07 29-11-10 2 25 44 D MCI Y 08-09-07 9999-12-31 1 56 11 A GDK Y 30-11-10 9999-12-31 2 17 22 B RAJ Y 30-11-10 9999-12-31 3 28 55 E CUL Y 30-11-10 9999-12-31 1 8
Table: this table is describing the SCD six types and the description is shown above.