Top Banner
Manage dimension tables in InfoSphere Information Server DataStage How to use the Slowly Changing Dimension stage Skill Level: Intermediate Brian Caufield ([email protected]) Software Architect IBM 12 Mar 2009 Information Server DataStage® Version 8.0 introduced the Slowly Changing Dimension (SCD) stage. This tutorial provides step-by-step instructions on how to use the SCD stage for processing dimension table changes. It also shows you how to use the output of the stage to update an associated fact table. The tutorial includes a fully operational download. Section 1. Before you start The Slowly Changing Dimension stage was added in the 8.0 release of InfoSphere Information Server DataStage. It is designed specifically to support the types of activities required to populate and maintain records in star schema data models, specifically dimension table data. The Slowly Changing Dimension stage encapsulates all of the dimension maintenance logic — finding existing records, generating surrogate keys, checking for changes, and what action to take when changes occur. In addition, you can associate dimension record surrogate key values with source records, which eliminates the need for additional lookups in later processing. About this tutorial Manage dimension tables in InfoSphere Information Server DataStage Trademarks © Copyright IBM Corporation 2009. All rights reserved. Page 1 of 32
32

Slowly Changing Dim

Apr 12, 2015

Download

Documents

satiur
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Slowly Changing Dim

Manage dimension tables in InfoSphereInformation Server DataStageHow to use the Slowly Changing Dimension stage

Skill Level: Intermediate

Brian Caufield ([email protected])Software ArchitectIBM

12 Mar 2009

Information Server DataStage® Version 8.0 introduced the Slowly ChangingDimension (SCD) stage. This tutorial provides step-by-step instructions on how touse the SCD stage for processing dimension table changes. It also shows you how touse the output of the stage to update an associated fact table. The tutorial includes afully operational download.

Section 1. Before you start

The Slowly Changing Dimension stage was added in the 8.0 release of InfoSphereInformation Server DataStage. It is designed specifically to support the types ofactivities required to populate and maintain records in star schema data models,specifically dimension table data. The Slowly Changing Dimension stageencapsulates all of the dimension maintenance logic — finding existing records,generating surrogate keys, checking for changes, and what action to take whenchanges occur. In addition, you can associate dimension record surrogate keyvalues with source records, which eliminates the need for additional lookups in laterprocessing.

About this tutorial

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 1 of 32

Page 2: Slowly Changing Dim

This tutorial is designed to introduce you to using the Slowly Changing Dimensionstage on the Information Server DataStage parallel canvas. The tutorial uses asimplified example scenario that focuses on Slowly Changing Dimensionfunctionality. Actual business scenarios may require different approaches to the jobdesign used in this tutorial's example. The volume of data processed in the tutorial isintentionally small to make it easier to understand the processing that is takingplace.

The material in the SCD_Tutorial.zip file in the Download section is built to run on aWindows platform with a DB2 database. You can modify the material to run on adifferent platform or to use a different database.

Objectives

In this tutorial, you will learn how to design a job that uses the Slowly ChangingDimension stage to perform updating and loading of dimension and fact tables. Aftercompletion, you will be able to configure the SCD stage for history-tracking changesand in-place changes, and use the output of the stage to update an associated facttable.

Prerequisites

This tutorial is written for DataStage developers who are familiar with the DataStageParallel Edition design canvas. You will also benefit if you already have a knowledgeof star schema design concepts (including fact and dimension tables), the use ofsurrogate keys, and the usual methodology for updating dimension tables.

System requirements

To create the job in this tutorial, you need an Information Server DataStage 8.xinstallation that is licensed to use the parallel engine. You also need a DataStageDesigner client and access to a DataStage project where you can create, import,compile, and run DataStage jobs.

To use the sample scripts in the SCD_Tutorial.zip download, your InformationServer must be installed on a Windows® OS with access to a DB2 database.However, you can also modify the scripts to work on other operating systems andwith a different database.

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 2 of 32

Page 3: Slowly Changing Dim

Section 2. Star schemas and Slowly ChangingDimensions

Star schemas are a method of data modeling in which the data that is beingmeasured, called the facts, are stored in one table, called the Fact table. BusinessObjects are the entities that are involved in the events being measured. BusinessObjects consist of identifying information and attributes that describe the object.These objects are stored in tables called dimension tables. The facts in the fact tableare linked to the business objects in the associated dimension tables using foreignkeys.

Figure 1. Example Star Schema

Because fact tables record the measurements generated from business events, theytend to grow rapidly. Dimension tables, on the other hand, tend to grow or changeless frequently. In the example used in this tutorial, the fact table records informationabout sales transactions. Every transaction results in a new row in the fact table.The product dimension in the example only grows when a new product is introduced,or if information about an existing product is changed.

You typically handle changes to attribute information in one of two ways:

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 3 of 32

Page 4: Slowly Changing Dim

• Overwrite — The existing row in the dimension table is updated to containthe new attribute values; the old values are no longer available. This iscommonly referred to as a Type1 change.

• Tracking History — The existing row in the dimension table is modified toindicate that it is no longer current (that is, it has been expired), and anew row is inserted with the current attribute values. This is commonlyreferred to as a Type2 change.

Surrogate Keys

Surrogate Keys are values that are generated specifically for the purpose of uniquelyidentifying dimension table rows. The primary reasons you would use a surrogatekey rather than the usual business key of the object in the dimension table are:

• When tracking history in the dimension table, there will be multiple rows inthe dimension table for the same business key. Therefore, it is notpossible to use the business key as the primary key.

• Typical fields that are used as business keys generally don't change, butsituations can arise where they do change. For example, US citizens canbe assigned a new social security number, or account numbers may bereassigned after a merger.

Surrogate keys provide a way for the dimension table to have a reliable, unique, andnever-changing primary key.

Section 3. Tutorial scenario

The scenario used for this tutorial has one fact table and two dimension tables thatwill be updated. The source file contains sales transaction records. The informationin the source file is used to update the fact and dimension tables.

Figure 2. Scenario schemas

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 4 of 32

Page 5: Slowly Changing Dim

Source data

The source data file is named SaleDetail.dat and is contained in theSCD_Tutorial.zip download. It contains five records that, when processed, applychanges to the fact and dimension tables. Table 1 shows the contents of the file.

Table 1. Source dataStoreId StoreNameStoreMgrProdSKUProdBrandProdDescrSaleAmtSaleUnits

A1111 Stuff Washington1111111111Bob's Redbox

00436.1413

A1112 MoreStuffAdams 2222222222SqueakyBlueChair

00456.5614

A1113 Stuffy's Jefferson3333333333SunshineYellowDuckie

00203.387

A1114 McStuff Madison 4444444444AAAAA fork 00308.872

A1115 Stuff Monroe 5555555555Best lawn 00024.4011

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 5 of 32

Page 6: Slowly Changing Dim

Jr. mower

Product dimension

The product dimension is a table in the target database. Initially this table containsrecords for three products. When the source data is processed, the table is updatedto contain new product records, and to track the history of changed productinformation. The Setup.bat file in the SCD_Tutorial.zip download contains a scriptthat creates and populates this table with the data shown in Table 2.

Table 2. Initial product dimension dataProdSK SKU Brand Descr Curr EffDate ExpDate

1 3333333333Sunshine YellowDuckie

Y 2004-01-012099-12-31

2 4444444444AAAAA spoon Y 2004-01-012099-12-31

10 5555555555AAAAA grasscutter

Y 2004-01-012099-12-31

Store dimension

The store dimension is a table in the target database. Initially this table containsrecords for three stores. When the source data is processed, the table is updated tocontain new store records, and to overwrite changed store information. TheSetup.bat file in the SCD_Tutorial.zip download contains a script that creates andpopulates this table with the data shown in Table 3.

Table 3. Initial store dimension dataStoreSK ID Name Mgr

1 A1113 Stuffy's Jefferson

2 A1114 McStuff Adams

5 A1115 Lil Stuff Monroe

Fact table

The fact dimension is a table in the target database. Initially this table contains norecords. When the source data is processed, the table is updated with the salesfacts and references to the corresponding dimension records. The Setup.bat file inthe SCD_Tutorial.zip download contains a script that creates the table as shown inTable 4.

Table 4. Initial Fact table dataProdSK StoreSK SaleAmt SaleUnits

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 6 of 32

Page 7: Slowly Changing Dim

Section 4. Setting up the tutorial

To set up the tutorial, save the SCD_Tutorial.zip file from the Download section toyour local file system and follow these steps:

1. Check if you already have the following directory structure:C:\IBM\Demo\DataStage. If not, create it.

2. Extract the contents of SCD_Tutorial.zip into C:\IBM\Demo\DataStage. Besure to select the option in your extraction program that indicates youwant to use the folder or directory names when extracting. You shouldend up with the directory C:\IBM\Demo\DataStage\SCD, which containsseveral files and an empty sub-directory named SKG.

3. Run C:\IBM\Demo\DataStage\SCD\setup.bat.

4. In the DataStage Administrator client, set the environment variableAPT_DB2INSTANCE_HOME to the location where the db2nodes.cfg fileexists. Typically this is C:\IBM\SQLLIB\DB2. This configures the project toaccess DB2 as a source or target for the DB2 Enterprise Stage.

5. Using the DataStage Designer client, importC:\IBM\Demo\DataStage\SCD\SCD_Tutorial.dsx into your DataStageproject.

Verify the state of the database

Run the Results executable shortcut in the C:\IBM\Demo\DataStage\SCD directory.This displays the contents of the product and store dimensions as well as the facttable. Review the output to verify that the tables have been initialized properly.

Resetting the tutorial

Once the tutorial has been run the first time, the contents of the database will havechanged. Therefore, subsequent runs would see different behavior. If you want toreset the database tables back to their initial state, run the zReset executableshortcut in the C:\IBM\Demo\DataStage\SCD directory.

Initializing the surrogate keys

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 7 of 32

Page 8: Slowly Changing Dim

The tutorial uses surrogate key generators that use state files to record the keyvalues that have been used. This ensures that unique values are always generated.Because the dimension tables are created with data in them, you need to make thesurrogate key generators aware of what values have already been used.

Compile and run the Demo\DataStage\Slowly Changing Dimensions\Surrogate KeyGeneration\CreateAndUpdate_File job to initialize the state files. The job reads theproduct dimension table and the store dimension table, then creates and updatesthe respective surrogate key generator state files.

Building the Slowly Changing Dimensions job

In this step you build a job that reads the SalesDetail.dat source file, updates theproduct and store dimensions, and inserts records into the fact table. For reference,a completed version of the job named Demo\DataStage\Slowly ChangingDimensions\SCD_All is included in the download.

Draw the job design as illustrated below in Figure 3.

Figure 3. Job design

The primary flow of records is from left to right in the job design. The source recordsare read from SaleDetail, passed to the first SCD stage to process the Product

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 8 of 32

Page 9: Slowly Changing Dim

dimension, then passed to the next SCD stage to process the store dimension, andfinally to the fact table. No records are added or removed on this flow of data. Everyrecord read from the source is inserted into the fact table. As part of the processingin the SCD stages, the surrogate key values that are associated with the sourcerecords are obtained from the dimension table and added to the data being passedto the fact table.

Looking at the job design from top to bottom, the product and store dimension tablesare reference sources to the SCD stages. These tables are used to initialize thelookup cache. Only records that are considered current are stored in the lookupcache. Any historical records in the dimension tables are automatically filtered outduring initial processing. The SCD stage uses the data values from the primary inputlink to lookup into the cache and check for changes. If any changes are required tothe dimension table, they are written to the secondary output link of the SCD stage,which is called the dimension update link. Target database stages are connected tothe dimension update link to apply the changes to the actual dimension table in thedatabase.

Each record on the primary input link of the SCD stage will go out on the primaryoutput link, and may produce zero, one, or two records on the dimension update link.The number of records produced depends on what, if any, action needs to be takenon the dimension table.

• Zero recordsUnchanged records require no action to the dimension table, so norecords are written on the dimension update link.

• One recordNew records and overwriting updates (Type1) require a one row changeto the dimension table. The change is either an insert or an update. Onerecord is written on the dimension update link to reflect these types ofchanges.

• Two recordsChanged records that are tracking history (Type2) require a two rowchange to the dimension table. The existing record must be updated toreflect that it is no longer current, and a new record must be inserted forthe new set of values. Two records are written to the dimension updatelink to reflect these changes.

Configuring the stages

Now that you have built the high level job design, you are ready to perform the next

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 9 of 32

Page 10: Slowly Changing Dim

set of steps in which you:

• Configure the individual stages to access the source data.

• Process the dimension tables.

• Update the fact table.

Configure the primary source stage

The source stage must be configured to read the SaleDetail.dat file. Complete thefollowing steps to configure the SaleDetail sequential file stage:

1. On the Output|Properties tab, set the File property toC:\IBM\Demo\DataStage\SCD\SaleDetail.dat.

2. On the Output|Format tab, add the Record delimiter string property andset it to DOS Format.

3. On the Output|Format tab, remove the Final delimiter property.

4. Load the Demo\DataStage\Slowly ChangingDimensions\TableDefs\SaleDetail table definition onto the output link.

Figure 4. Source stage

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 10 of 32

Page 11: Slowly Changing Dim

The source stage should now be configured to read the SaleDetail.dat file. Use ViewData to confirm that the data is being read from the database properly.

Configure the stages to process the Product dimension

Three stages are used to process the Product dimension. Reading the job designfrom top to bottom:

• The first stage specifies how to read the data from the dimension table.

• The SCD stage determines what changes need to be made to thedimension table and those changes are written to the dimension updatelink.

• The dimension update link is connected to the dimension update targetstage, which specifies how to update the actual database table with thedata produced by the SCD stage.

Configure the Product dimension source stage

Complete the following steps to configure the Product dimension DB2 Enterprisestage:

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 11 of 32

Page 12: Slowly Changing Dim

1. On the Output|Properties tab, set the Read method property to Table.

2. On the Output|Properties tab, set the Table property to SCD.ProdDim.

3. On the Output|Properties tab, set the Use Default Database and UseDefault Server properties to False.

4. On the Output|Properties tab, set the Database property to SCDDemo.

5. On the Output|Properties tab, set the Server property to DB2.

6. Load the Demo\DataStage\Slowly ChangingDimensions\TableDefs\SCD.ProdDim table definition onto the output link.

Figure 5. Product dimension source

The stage should now be configured to read the SCD.ProdDim table. Use View Datato confirm that the data is being read from the database properly.

Configure the Product dimension SCD stage

The Fast Path control of the SCD stage editor lets you navigate directly to the tabsthat require input in order to complete the stage configuration. The control is in thelower left corner of the editor. Use the arrow buttons to move forward or backwardthrough the tabs.

Open the product dimension SCD stage editor and use the Fast Path control to setthe properties as shown:

Fast Path control

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 12 of 32

Page 13: Slowly Changing Dim

The SCD stage has two input links and two output links. This resultsin a high number of property link-tab combinations. Use the FastPath control to move directly to the tabs that are required toconfigure the stage.

• Fast Path page 1: Setting the output linkBy default, the first output link connected to the stage is used as theprimary output link. Look at the link name that is displayed in the Selectoutput link property. Use the drop down list to select the output link thatis leading to the next SCD stage. This is the primary output of the stage.The other link automatically becomes the dimension update link.

Figure 6. Product dimension SCD stage, Fast Path page 1

• Fast Path page 2: Define the lookup condition and purpose codesThe first task on this page is to define what the various columns of thedimension table are used for. This information is used in a number ofways in the SCD processing. The choices for purpose codes are:

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 13 of 32

Page 14: Slowly Changing Dim

• Surrogate Key — This column is the primary key of the dimensiontable and is populated with a surrogate key value.

• Business Key — This column is the identifier of the business objectsthat the dimension table is representing, but is not the primary key ofthe dimension table. This column is typically used as a lookup columnand corresponds to a key or some other field of the source data thatidentifies the associated business object. The lookup is used to findthe dimension table row that corresponds to a source data row.

• Type 2 — Check this column for a change in value. If the value haschanged, perform a history tracking change to the dimension table.

• Type 1 — Check this column for a change in value. If the value haschanged, perform an overwriting change to the dimension table.

• Current Indicator — This column is used as a flag to indicatewhether it is the most current record for a particular business key.

• Effective Date — This column is used to specify when a record firstbecame the most current record, that is, when it became the activerecord.

• Expiration Date — This column is used to specify the ending date ofwhen a record was the active record. For currently active records, thisvalue is typically a future date or NULL.

• SK Chain — This column is used to store the surrogate key of theprevious or next record in the history for a particular business key.

• (blank) — This column is not used for anything with respect to SCDprocessing. Data for this field is inserted into the table when a newrow is inserted, but this column will not be checked for changesagainst the source data.

Set purpose codes for the columns as shown below in Figure 7. Becausethis dimension table is tracking history, it contains columns to trackwhether a row is current and the date range for when it was current.

Click on the ProdSKU source field and drag it to the SKU dimensioncolumn to create the lookup condition.

Figure 7. Product dimension SCD stage, Fast Path page 2

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 14 of 32

Page 15: Slowly Changing Dim

Although this tab looks similar to a mapping tab, it is actually defining thelookup keys from the source record to the dimension record. Any sourcecolumn can be associated with any one dimension column. This createsan equality lookup condition between those columns. If more than onesource column is associated with a dimension column, then those equalityconditions are AND'ed together. In this manner, multi-column lookup keyscan be used.

• Fast Path page 3: Configuring the surrogate key generatorSurrogate key generation capabilities are integrated into the SCD stage.This tab specifies how surrogate keys are generated for this stage.Surrogate key generation can use DataStage's file based surrogategeneration, or use DB2 or Oracle database sequence object basedgeneration. This tutorial uses the file based method.

Set the Source name property toC:\IBM\Demo\DataStage\SCD\SKG\ProdDim as shown in Figure 8. Thisis the surrogate key state file you created by running theDemo\DataStage\Slowly Changing Dimensions\Surrogate Key

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 15 of 32

Page 16: Slowly Changing Dim

Generation\CreateAndUpdate_File job. Leave the defaults for the otherproperties unchanged.

Figure 8. Product dimension SCD stage, Fast Path page 3

• Fast Path page 4: Defining the slowly changing dimension behaviorand derivationsThe DimUpdate tab is used to define several critical elements of SCDprocessing. The Derivation column is used to specify how to mapelements of a source row to elements of the dimension table. The Expirecolumn is used to specify what values need to change if an existingrecord needs to be expired. Expire expressions are only enabled whenthere are Type2 columns specified, and are only available for CurrentIndicator and Expiration Date columns.

If no matching record is found when the lookup is performed, thederivation expressions are applied and a record is written on thedimension update link to indicate a new record needs to be added to thedimension table. If a matching record is found, the derivation expressionsare applied to the source columns, and then the results are compared to

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 16 of 32

Page 17: Slowly Changing Dim

the corresponding columns of the dimension table. Columns specified asType2 are compared first. If there is a change, two records are written onthe dimension update link. The first record is an update record, to expirethe matched row. The Expire expressions are used to calculate the valuesfor the update row. The second record is a new record that contains all ofthe new values for all columns. If no Type2 columns have changed, theType1 columns are compared. If there are any changes, one record iswritten on the dimension update link that indicates an update to thedimension table. The derivation expressions are used to calculate thevalues for the update record.

Set the Derivation expressions and the Expire expressions as shownbelow in Figure 9.

Figure 9. Product dimension SCD stage, Fast Path page 4

Note that you are specifying these properties on the dimension updatelink. The output columns for this link were automatically propagated withtheir purpose codes from the dimension input link. The SCD stage onlydoes this when the set of columns on the dimension update link is empty.It is possible to load a set of columns directly on the dimension updatelink, however, they must exactly match those specified on the dimensioninput link.

• Fast Path page 5: Selecting the columns for Output LinkThe Output Map tab is used to define what columns will leave this stageon the primary output link. This tab operates much like the Mapping tab of

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 17 of 32

Page 18: Slowly Changing Dim

other stages. The only difference is that you can select columns from theprimary input link and columns from the reference link to output. Thecolumns coming from the primary source have the same values theyentered the stage with. The columns coming from the reference linkrepresent the values from the dimension table that correspond to thesource row. Note that because the SCD processing has been done by thestage, every record from the primary source data will have acorresponding record in the dimension.

Select the columns for output as shown below in Figure 10. The outputlink is initially empty. Create and map the output columns by dragging anddropping from the source to the target. Because the product dimensionhas now been processed, the source columns that contain thoseattributes are no longer needed. Instead, the primary key associated withthe source row is appended because that is the value that is required tobe inserted into the fact table.

Figure 10. Product dimension SCD stage, Fast Path page 5

The stage is now configured to perform the dimension maintenance on the Product

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 18 of 32

Page 19: Slowly Changing Dim

dimension table.

Configure the Product dimension target stage

This stage processes the dimension update link records produced by the productdimension SCD stage to update the actual dimension table in the database.Because incoming records represent both inserts and updates to the table, a Upsertwrite method must be used. Auto-generated update and insert statements take thepurpose codes specified in the SCD stage into account to generate the correctupdate statement for this usage.

Complete the following steps to configure the Product dimension update DB2Enterprise stage:

1. On the Input|Properties tab, set the Write Method property to Upsert.

2. On the Input|Properties tab, set the Upsert Mode property toAuto-generated Update and Insert.

3. On the Input|Properties tab, set the Table property to SCD.ProdDim.

4. On the Input|Properties tab, set the Use Default Database and UseDefault Server to False.

5. On the Input|Properties tab, set the Database property to SCDDemo.

6. On the Input|Properties tab, set the Server property to DB2.

Figure 11. Product dimension target

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 19 of 32

Page 20: Slowly Changing Dim

The stage is now configured to write to the SCD.ProdDim dimension table.

Configure the stages to process the Store dimension

Configure the Store dimension source stage

Complete the following steps to configure the Store dimension DB2 Enterprisestage:

1. On the Output|Properties tab, set the Read Method property to Table.

2. On the Output|Properties tab, set the Table property to SCD.StoreDim.

3. On the Output|Properties tab, set the Use Default Database and UseDefault Server to False.

4. On the Output|Properties tab, set the Database property to SCDDemo.

5. On the Output|Properties tab, set the Server property to DB2.

6. Load the Demo\DataStage\Slowly Changing

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 20 of 32

Page 21: Slowly Changing Dim

Dimensions\TableDefs\SCD.StoreDim table definition onto the output link.

Figure 12. Store dimension source stage

The stage should now be configured to read the SCD.StoreDim table. Use ViewData to confirm that the data is being read from the database properly.

Configure the Store dimension SCD stage

Open the store dimension SCD stage editor and use the Fast Path control to set theproperties as shown:

• Fast Path page 1: Setting the Output LinkUse the Select output link drop down list to select the link leading to thefact table. This is the primary output of the stage. The other linkautomatically becomes the dimension update link.

Figure 13. Store dimension SCD stage, Fast Path page 1

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 21 of 32

Page 22: Slowly Changing Dim

• Fast Path page 2: Define the lookup condition and purpose codesSet purpose codes for the columns as shown below in Figure 14.Because this dimension table is not tracking history, it does not containcolumns to track whether a row is current or not. The Name column has ablank purpose code, which indicates that this column will not be checkedfor changes.

Click on the StoreId source field and drag it to the dimension column Id tocreate the lookup condition.

Figure 14. Store dimension SCD stage, Fast Path page 2

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 22 of 32

Page 23: Slowly Changing Dim

• Fast Path page 3: Configuring the surrogate key generatorSet the file path property toC:\IBM\Demo\DataStage\SCD\SKG\StoreDim as shown in Figure 15.This is the surrogate key state file you created by running theDemo\DataStage\Slowly Changing Dimensions\Surrogate KeyGeneration\CreateAndUpdate_File job. Leave the defaults for the otherproperties.

Figure 15. Store dimension SCD stage, Fast Path page 3

• Fast Path page 4: Defining the slowly changing dimension behaviorand derivationsSet the Derivation expressions as shown below in Figure 16. Because theName column has no purpose code, the SCD stage does not check this

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 23 of 32

Page 24: Slowly Changing Dim

column for changes when a matching dimension record is found on thelookup. Because there are no Type2 columns in this dimension table, theExpire expression is not enabled for any column.

Figure 16. Store dimension SCD stage, Fast Path page 4

• Fast Path page 5: Selecting the columns for Output LinkSelect the columns for output as shown below in Figure 17. Because thestore dimension has now been processed, the source columns thatcontain those attributes are no longer needed. Instead, the surrogate keyassociated with the source row is appended because that is the value thatis required to be inserted into the fact table.

Figure 17. Store dimension SCD stage, Fast Path page 5

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 24 of 32

Page 25: Slowly Changing Dim

The stage is now configured to perform the dimension maintenance on the storedimension table.

Configure the Store dimension target stage

This stage processes the dimension update records produced by the storedimension SCD stage to update the actual dimension table in the database.

Complete the following steps to configure the Store dimension target DB2 Enterprisestage:

1. On the Input|Properties tab, set the Write method property to Upsert.

2. On the Input|Properties tab, set the Upsert Mode property toAuto-generated Update and Insert.

3. On the Input|Properties tab, set the Table property to SCD.StoreDim.

4. On the Input|Properties tab, set the Use Default Database and UseDefault Server to False.

5. On the Input|Properties tab, set the Database property to SCDDemo.

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 25 of 32

Page 26: Slowly Changing Dim

6. On the Input|Properties tab, set the Server property to DB2.

Figure 18. Store dimension target stage

The stage is now configured to write to the SCD.StoreDim dimension table.

Configure the Fact table target stage

This stage processes the source records that have been passed through the primaryoutput links to update the actual fact table in the database. At this point, the originalinput source records have been processed so that the only columns on this link arethe measurements (SaleAmt and SaleUnits) and the surrogate key values for theassociated Product and Store.

Complete the following steps to configure the Fact table target DB2 Enterprisestage:

1. On the Input|Properties tab, set the Write Method property to Write.

2. On the Input|Properties tab, set the Write Mode property to Append.

3. On the Input|Properties tab, set the Table property to SCD.Facttbl.

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 26 of 32

Page 27: Slowly Changing Dim

4. On the Input|Properties tab, set the Use Default Database and UseDefault Server to False.

5. On the Input|Properties tab, set the Database property to SCDDemo.

6. On the Input|Properties tab, set the Server property to DB2.

Figure 19. Fact table target stage

The stage is now configured to write to the SCD.Facttbl dimension table.

Final steps

You have now completed the job design and are ready to compile. Click theCompile button to start the compile.

Note that the SCD stage processing makes use of the transform operator. So for thejob to compile successfully, the C++ compiler settings for the project must becorrect. The Resources page contains a link to an article in the information center forIBM Information Server with details on configuring your environment correctly foryour C++ compiler. See the Information Server Configuration Guide for details onhow to configure the environment correctly for your C++ compiler. If any compileerrors occur, check your job and stages against the settings specified in the tutorialand make any necessary changes.

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 27 of 32

Page 28: Slowly Changing Dim

Section 5. Running the tutorial

At this point, you are now ready to compile and run the job.

Run the Results executable shortcut in the C:\IBM\Demo\DataStage\SCD directoryto see the initial contents of the database tables. The Results shortcut displays thecontents of the product dimension, the store dimensions, and the fact table.

Run the job by clicking the Run button in the DataStage Designer.

After the job finishes successfully, run the Results shortcut again to see the changesthat were made to the database tables.

Summary of changes to database tables

The contents of the database tables should now appear as follows:

• The product dimension has two update records, and four new records.Two of the new records are new objects to the dimension table, and twoexisting records had Type2 changes, resulting in the two updates and twoof the new records.Change ProdSK SKU Brand Descr Curr EffDate ExpDate

NoChange

1 3333333333SunshineYellowDuckie

Y 2004-01-012099-12-31

Expired(Type2)

2 4444444444AAAAA spoon N 2004-01-01{Today'sDate}

Expired(Type2)

10 5555555555AAAAA grasscutter

N 2004-01-01{Today'sDate}

NewRecord

3 1111111111Bob's RedBox

Y {Today'sDate}

2099-12-31

NewRecord

4 2222222222SqueakyBlueChair

Y {Today'sDate}

2099-12-31

NewRecord(Type2)

5 4444444444AAAAA fork Y {Today'sDate}

2099-12-31

NewRecord(Type2)

6 5555555555Best lawnmower

Y {Today'sDate}

2099-12-31

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 28 of 32

Page 29: Slowly Changing Dim

• The store dimension has one updated record, and two new records. Theupdated record had a Type1 change and the two new records are newobjects to the dimension table.Change StoreSK ID Name Mgr

NoChange

1 A1113 Stuffy's Jefferson

Update 2 A1114 McStuff Madison

NoChange

5 A1115 LilStuff

Monroe

NewRecord

3 A1111 Stuff Washington

NewRecord

4 A1112 MoreStuff Adams

• The fact table has five new records, one for each source recordprocessed. The surrogate key values in this table correspond to thecurrent records in the dimension tables.ProdSK StoreSK SaleAmt SaleUnits

3 3 436.14 13

4 4 456.56 14

1 1 203.38 7

5 2 308.87 2

6 5 24.40 11

The contents of the dimension tables have now changed. If you were to run the jobagain, what results would you expect to see? Hint: The dimension tables and thesource file are now in-sync.

This completes the Slowly Changing Dimensions tutorial. To reset the databasetables to their original state, run the zReset executable shortcut .

Conclusion

You can use the Slowly Changing Dimension stage to greatly reduce the time youspend creating jobs for processing star schemas. In this tutorial you have learnedhow to configure the Slowly Changing Dimension stage to process history-trackingchanges and in-place changes to dimension tables. You have also seen how youcan reduce fact table processing by augmenting the source data with associateddimension table surrogate keys that eliminate the need for an additional lookup.

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 29 of 32

Page 30: Slowly Changing Dim

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 30 of 32

Page 31: Slowly Changing Dim

Downloads

Description Name Size Downloadmethod

Supporting scripts and DS jobs for thistutorial

SCD_Tutorial.zip 16KB HTTP

Information about download methods

ibm.com/developerWorks developerWorks®

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 31 of 32

Page 32: Slowly Changing Dim

Resources

Learn

• In the InfoSphere area on developerWorks, get the resources you need toadvance your InfoSphere product skills.

• C++ compiler for job development topic in the information center for IBMInformation Server.

• Browse the technology bookstore for books on these and other technical topics.

Get products and technologies

• Download IBM product evaluation versions and get your hands on applicationdevelopment tools and middleware products from DB2®, Lotus®, Rational®,Tivoli®, and WebSphere®.

Discuss

• Participate in the discussion forum for this content.

• Check out developerWorks blogs and get involved in the developerWorkscommunity.

About the author

Brian CaufieldBrian Caufield is a software architect in IBM Silicon Valley Lab. Brianhas been working in the DataStage development organization for 10years and was involved in the design of the Slowly ChangingDimension Stage.

developerWorks® ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Trademarks© Copyright IBM Corporation 2009. All rights reserved. Page 32 of 32