BODS20_EN_COL91_A4

8/10/2019 BODS20_EN_COL91_A4

http://slidepdf.com/reader/full/bods20encol91a4 1/87

Material Number: 50102235

1

8/10/2019 BODS20_EN_COL91_A4


2

8/10/2019 BODS20_EN_COL91_A4


3

8/10/2019 BODS20_EN_COL91_A4


4

8/10/2019 BODS20_EN_COL91_A4


5

8/10/2019 BODS20_EN_COL91_A4


6

8/10/2019 BODS20_EN_COL91_A4


7

8/10/2019 BODS20_EN_COL91_A4


8

8/10/2019 BODS20_EN_COL91_A4


9

8/10/2019 BODS20_EN_COL91_A4


10

8/10/2019 BODS20_EN_COL91_A4


11

8/10/2019 BODS20_EN_COL91_A4


12

8/10/2019 BODS20_EN_COL91_A4


13

8/10/2019 BODS20_EN_COL91_A4


14

8/10/2019 BODS20_EN_COL91_A4


15

8/10/2019 BODS20_EN_COL91_A4


16

8/10/2019 BODS20_EN_COL91_A4


8/10/2019 BODS20_EN_COL91_A4


18

8/10/2019 BODS20_EN_COL91_A4


SAP Data Services:

Data Integrator Transforms

Learner’s Guide

BODS20

8/10/2019 BODS20_EN_COL91_A4


Copyright

© 2009 SAP® BusinessObjects™. All rights reserved. SAPBusinessObjects owns the following United States patents, which

may cover products that are offered and licensed by SAP

BusinessObjects and/or affliated companies: 5,295,243; 5,339,390;

5,555,403; 5,590,250; 5,619,632; 5,632,009; 5,857,205; 5,880,742;

5,883,635; 6,085,202; 6,108,698; 6,247,008; 6,289,352; 6,300,957;

6,377,259; 6,490,593; 6,578,027; 6,581,068; 6,628,312; 6,654,761;

6,768,986; 6,772,409; 6,831,668; 6,882,998; 6,892,189; 6,901,555;

7,089,238; 7,107,266; 7,139,766; 7,178,099; 7,181,435; 7,181,440;

7,194,465; 7,222,130; 7,299,419; 7,320,122 and 7,356,779. SAP

BusinessObjects and its logos, BusinessObjects, Crystal Reports®,

Rapid Mart™

, Data Insight™

, Desktop Intelligence™

, RapidMarts®, Watchlist Security™, Web Intelligence®, and Xcelsius®

are trademarks or registered trademarks of Business Objects,an SAP company and/or affiliated companies in the United

States and/or other countries. SAP® is a registered trademarkof SAP AG in Germany and/or other countries. All other namesmentioned herein may be trademarks of their respective owners.

8/10/2019 BODS20_EN_COL91_A4


Table of Contents—Learner’s Guide iii

C O N T E N T S

About this Course Course introduction...................................................................................................xiii

Course description.....................................................................................................xiv

Course audience.........................................................................................................xiv

Prerequisites................................................................................................................xiv

Additional education.................................................................................................xiv

Level, delivery, and duration....................................................................................xvCourse success factors.................................................................................................xv

Course setup.................................................................................................................xv

Course materials..........................................................................................................xv

Learning process .........................................................................................................xv

Lesson 1

Capturing Changes in Data Lesson introduction...................................................................................................1

Updating data over time...........................................................................................2

Explaining Slowly Changing Dimensions (SCD) .........................................2Updating changes to data ................................................................................4

Explaining history preservation and surrogate keys ...................................5

Comparing source-based and target-based CDC .........................................6

Using source-based CDC..........................................................................................7

Using source tables to identify changed data................................................7

Using CDC with timestamps............................................................................7

Managing overlaps.............................................................................................10

Activity: Using source-based CDC..................................................................11

Using target-based CDC...........................................................................................15

Using target tables to identify changed data .................................................15

Identifying history preserving transforms ....................................................16Explaining the Table Comparison transform.................................................17

Explaining the History Preserving transform ...............................................19

Explaining the Key Generation transform .....................................................22

Activity: Using target-based CDC ..................................................................23

Quiz: Capturing changes in data ............................................................................25

Lesson summary........................................................................................................26

Lesson 2

Using Data Integrator Transforms Lesson introduction...................................................................................................27Describing Data Integrator transforms...................................................................28

8/10/2019 BODS20_EN_COL91_A4


iv SAP Data Services: Data Integrator Transforms – Learners Guide

Defining Data Integrator transforms ..............................................................28

Using the Pivot transform........................................................................................30

Explaining the Pivot transform .......................................................................30

Activity: Using the Pivot transform.................................................................33Using the Hierarchy Flattening transform.............................................................34

Explaining the Hierarchy Flattening transform.............................................34

Activity: Using the Hierarchy Flattening transform.....................................36

Describing performance optimization....................................................................41

Describing push-down operations .................................................................41

Viewing SQL generated by a data flow .........................................................43

Caching data ......................................................................................................43

Slicing processes.................................................................................................44

Using the Data Transfer transform.........................................................................45

Explaining the Data Transfer transform.........................................................45

Activity: Using the Data Transfer transform..................................................46Using the XML Pipeline transform.........................................................................48

Explaining the XML Pipeline transform.........................................................49

Activity: Using the XML Pipeline transform..................................................49

Quiz: Using Data Integrator transforms.................................................................52

Lesson summary........................................................................................................53

Answer Key Quiz: Capturing changes in data ............................................................................56

Quiz: Using Data Integrator transforms.................................................................57

8/10/2019 BODS20_EN_COL91_A4


Table of Contents—Learner’s Guide v

8/10/2019 BODS20_EN_COL91_A4


vi BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guide

A G E N D A SAP Data Services: Data Integrator Transforms

Introduct ions, Course Overview........................................... 30 minutes

Lesson 1

Capturing Changes in Data...........................................................3 hours ❒

Updating data over time

❒

Using source-based CDC

❒ Using target-based CDC

Lesson 2

Using Data Integrator Transforms..............................................3 hours ❒

Describing Data Integrator transforms

❒

Using the Pivot transform

❒

Using the Hierarchy Flattening transform

❒

Describing performance optimization❒

Using the Data Transfer transform

❒

Using the XML Pipeline transform

8/10/2019 BODS20_EN_COL91_A4


About this Course—Learner’s Guide xiii

About this Course

Course introduction

This section explains the conventions used in the course and in this training guide.

8/10/2019 BODS20_EN_COL91_A4


xiv BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guide

Course description

BusinessObjects™ Data Integrator XI 3.0/3.1 enables you to integrate disparate data sources to

deliver more timely and accurate data that end users in an organization can trust. In this

three-day course, you will learn about creating, executing, and troubleshooting batch jobs,

using functions, scripts and transforms to change the structure and formatting of data, handling

errors, and capturing changes in data.

As a business benefit, by being able to create efficient data integration projects, you can use

the transformed data to help improve operational and supply chain efficiencies, enhance

customer relationships, create new revenue opportunities, and optimize return on investment

from enterprise applications.

Course audience

The target audience for this course is individuals responsible for implementing, administering,

and managing data integration projects.

Prerequisites

To be successful, learners who attend this course should have experience with the following:

It is also recommended you review the following articles, which can be found at:

http://www.rkimball.com/html/articles.html .

• Knowledge of data warehousing and ETL concepts

• Experience with MySQL and SQL language

• Experience using functions, elementary procedural programming, and flow-of-control

statements such as If-Then-Else and While Loop statements

• Data Warehouse Fundamentals: TCO Starts with the End User and Fact Tables and Dimension

Tables

• Data Warehouse Architecture and Modeling: There Are No Guarantees

• Advance Dimension Table Topics: Surrogate Keys,It's Time for Time, and Slowly Changing

Dimensions

• Industry- and Application-Specific Issues: Think Globally, Act Locally

• Data Staging and Data Quality: Dealing with Dirty Data

Additional education

To increase your skill level and knowledge of Data Services, the following courses are

recommended:

• BusinessObjects Data Quality XI 3.0/3.1: Core Concepts

• BusinessObjects Data Integrator XI R2 Accelerated: Advanced Workshop

8/10/2019 BODS20_EN_COL91_A4


About this Course—Learner’s Guide xv

Level, delivery, and duration

This instructor-led core offering is a three-day course.

Course success factors

Your learning experience will be enhanced by:

• Activities that build on the life experiences of the learner

• Discussion that connects the training to real working environments

• Learners and instructor working as a team

• Active participation by all learners

Course setup Refer to the setup guide for details on hardware, software, and course-specific requirements.

Course materials

The materials included with the course materials are:

• Name card

• Learner’s Guide

The Learner’s Guide contains an agenda, learner materials, and practice activities.

The Learner’s Guide is designed to assist students who attend the classroom-based courseand outlines what learners can expect to achieve by participating in this course.

• Evaluation form

At the conclusion of this course, you will receive an electronic feedback form as part of our

evaluation process. Provide feedback on the course content, instructor, and facility. Your

comments will assist us to improve future courses.

Additional resources include:

• Sample files

The sample files can include required files for the course activities and/or supplemental

content to the training guide.

• Online Help

Retrieve information and find answers to questions using the online Help and/or user’s

guide that are included with the product.

Learning process

Learning is an interactive process between the learners and the instructor. By facilitating a

cooperative environment, the instructor guides the learners through the learning framework.

8/10/2019 BODS20_EN_COL91_A4


xvi BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guide

Introduction

Why am I here? What’s in it for me?

The learners will be clear about what they are getting out of each lesson.

Objectives

How do I achieve the outcome?

The learners will assimilate new concepts and how to apply the ideas presented in the lesson.

This step sets the groundwork for practice.

Practice

How do I do it?

The learners will demonstrate their knowledge as well as their hands-on skills through theactivities.

Review

How did I do?

The learners will have an opportunity to review what they have learned during the lesson.

Review reinforces why it is important to learn particular concepts or skills.

Summary

Where have I been and where am I going?

The summary acts as a recap of the learning objectives and as a transition to the next section.

8/10/2019 BODS20_EN_COL91_A4


Using Functions, Scripts, and Variables—Learner’s Guide

8/10/2019 BODS20_EN_COL91_A4


8/10/2019 BODS20_EN_COL91_A4


1

Lesson 1

Capturing Changes in Data

Lesson introduction

The design of your data warehouse must take into account how you are going to handle changes

in your target system when the respective data in your source system changes. Data Integrator

transforms provide you with a mechanism to do this.

After completing this lesson, you will be able to:

• Update data over time

• Use source-based CDC

• Use target-based CDC

8/10/2019 BODS20_EN_COL91_A4


2 SAP Data Services: Data Integrator Transforms—Learner’s Guide

Updating data over time

Introduction

Data Integrator transforms provide support for updating changing data in your data warehouse.

After completing this unit, you will be able to:

• Describe the options for updating changes to data

• Explain the purpose of Changed Data Capture (CDC)

• Explain the role of surrogate keys in managing changes to data

• Define the differences between source-based and target-based CDC

Explaining Slowly Changing Dimensions (SCD)

SCDs are dimensions that have data that changes over time. The following methods of handling

SCDs are available:

Type Description

Type 1

No history preservation

Type 2

Unlimited history preservation and new rows

Natural consequence of normalization.

• New rows generated for significantchanges.

• Requires use of a unique key. The key

relates to facts/time.

• Optional Effective_Date field.

Type 3

Limited history preservation

• Two states of data are preserved: current

and old.

• New fields are generated to store history

data.

• Requires an Effective_Date field.

Because SCD Type 2 resolves most of the issues related to slowly changing dimensions, it is

explored last.

SCD Type 1

For an SCD Type 1 change, you find and update the appropriate attributes on a specific

dimensional record. For example, to update a record in the SALES_PERSON_DIMENSION

table to show a change to an individual’s SALES_PERSON_NAME field, you simply update

one record in the SALES_PERSON_DIMENSION table. This action would update or correct

8/10/2019 BODS20_EN_COL91_A4


3

that record for all fact records across time. In a dimensional model, facts have no meaning until

you link them with their dimensions. If you change a dimensional attribute without

appropriately accounting for the time dimension, the change becomes global across all fact

records.

This is the data before the change:

SALES_PERSON_KEY SALES_PERSON_ID NAME SALES_TEAM

15 000120 Doe, John B Northwest

This is the same table after the salesperson’s name has been changed:


15 000120 Smith, John B Northwest

However, suppose a salesperson transfers to a new sales team. Updating the salesperson’s

dimensional record would update all previous facts so that the salesperson would appear to

have always belonged to the new sales team. This may cause issues in terms of reporting sales

numbers for both teams. If you want to preserve an accurate history of who was on which sales

team, Type 1 is not appropriate.

SCD Type 3

To implement a Type 3 change, you change the dimension structure so that it renames the

existing attribute and adds two attributes, one to record the new value and one to record thedate of the change.

A Type 3 implementation has three disadvantages:

• You can preserve only one change per attribute, such as old and new or first and last.

• Each Type 3 change requires a minimum of one additional field per attribute and another

additional field if you want to record the date of the change.

• Although the dimension’s structure contains all the data needed, the SQL code required to

extract the information can be complex. Extracting a specific value is not difficult, but if you

want to obtain a value for a specific point in time or multiple attributes with separate old

and new values, the SQL statements become long and have multiple conditions.

In summary, SCD Type 3 can store a change in data, but can neither accommodate multiplechanges, nor adequately serve the need for summary reporting.




This is the same table after the new dimensions have been added and the salesperson’s sales

team has been changed:

8/10/2019 BODS20_EN_COL91_A4



SALES_PERSON_

NAME OLD_TEAM NEW_TEAM EFF_TO_DATE

SALES_

PERSON_ID

Doe, John B

SCD Type 2

Northwest Northeast Oct_31_2004 00120

With a Type 2 change, you do not need to make structural changes to the

SALES_PERSON_DIMENSION table. Instead, you add a record.




After you implement the Type 2 change, two records appear, as in the following table:



133 000120 Doe, John B Southeast

Updating changes to data

When you have a large amount of data to update regularly and a small amount of system down

time for scheduled maintenance on a data warehouse, you must choose the most appropriate

method for updating your data over time, also known as “delta load”. You can choose to do a

full refresh of your data or you can choose to extract only new or modified data and update

the target system:

• Full refresh: Full refresh is easy to implement and easy to manage. This method ensures

that no data is overlooked or left out due to technical or programming errors. For an

environment with a manageable amount of source data, full refresh is an easy method you

can use to perform a delta load to a target system.• Capturing only changes: After an initial load is complete, you can choose to extract only

new or modified data and update the target system. Identifying and loading only changed

data is called Changed Data Capture (CDC). CDC is recommended for large tables. If the

tables that you are working with are small, you may want to consider reloading the entire

table instead. The benefit of using CDC instead of doing a full refresh is that it:

○ Improves performance because the job takes less time to process with less data to extract,

transform, and load.

○ Change history can be tracked by the target system so that data can be correctly analyzed

over time. For example, if a sales person is assigned a new sales region, simply updating

8/10/2019 BODS20_EN_COL91_A4


5

the customer record to reflect the new region negatively affects any analysis by region

over time because the purchases made by that customer before the move are attributed

to the new region.

Explaining history preservation and surrogate keys

History preservation allows the data warehouse or data mart to maintain the history of data

in dimension tables so you can analyze it over time.

For example, if a customer moves from one sales region to another, simply updating the

customer record to reflect the new region would give you misleading results in an analysis by

region over time, because all purchases made by the customer before the move would incorrectly

be attributed to the new region.

The solution to this involves introducing a new record for the same customer that reflects thenew sales region so that you can preserve the previous record. In this way, accurate reporting

is available for both sales regions. To support this, Data Services is set up to treat all changes

to records as INSERT rows by default.

However, you also need to manage the primary key constraint issues in your target tables that

arise when you have more than one record in your dimension tables for a single entity, such

as a customer or an employee.

For example, with your sales records, the Sales Rep ID is usually the primary key and is used

to link that record to all of the rep's sales orders. If you try to add a new record with the same

primary key, it will throw an exception. On the other hand, if you assign a new Sales Rep ID

to the new record for that rep, you will compromise your ability to report accurately on therep's’s total sales.

To address this issue, you will create a surrogate key, which is a new column in the target table

that becomes the new primary key for the records. At the same time, you will change the

properties of the former primary key so that it is simply a data column.

When a new record is inserted for the same rep, a unique surrogate key is assigned allowing

you to continue to use the Sales Rep ID to maintain the link to the rep’s orders.

8/10/2019 BODS20_EN_COL91_A4



You can create surrogate keys either by using the gen_row_num or key_generation functions

in the Query transform to create a new output column that automatically increments whenever

a new record is inserted, or by using the Key Generation transform, which serves the same

purpose.

Comparing source-based and target-based CDC

Setting up a full CDC solution within Data Services may not be required. Many databases now

have CDC support built into them, such as Oracle, SQL Server, and DB2. Alternatively, you

could combine surrogate keys with the Map Operation transform to change all UPDATE row

types to INSERT row types to capture changes.

However, if you do want to set up a full CDC solution, there are two general incremental CDC

methods to choose from: source-based and target-based CDC.

Source-based CDC evaluates the source tables to determine what has changed and only extracts

changed rows to load into the target tables.

Target-based CDC extracts all the data from the source, compares the source and target rows

using table comparison, and then loads only the changed rows into the target.

Source-based CDC is almost always preferable to target-based CDC for performance reasons.However, some source systems do not provide enough information to make use of the

source-based CDC techniques. You will usually use a combination of the two techniques.

8/10/2019 BODS20_EN_COL91_A4


7

Using source-based CDC

Introduction

Source-based CDC is the preferred method because it improves performance by extracting the

fewest rows.


• Define the methods of performing source-based CDC

• Explain how to use timestamps in source-based CDC

• Manage issues related to using timestamps for source-based CDC

Using source tables to identify changed data

Source-based CDC, sometimes also referred to as incremental extraction, extracts only the

changed rows from the source. To use source-based CDC, your source data must have some

indication of the change. There are two methods:

• Timestamps: You can use the timestamps in your source data to determine what rows have

been added or changed since the last time data was extracted from the source. To support

this type of source-based CDC, your database tables must have at least an update timestamp;

it is preferable to have a create timestamp as well.

• Change logs: You can also use the information captured by the RDBMS in the log files for

the audit trail to determine what data is has been changed.

Log-based data is more complex and is outside the scope of this course. For more information

on using logs for CDC, see “Techniques for Capturing Data”, in the Data Services Designer Guide.

Using CDC with timestamps

Timestamp-based CDC is an ideal solution to track changes if:

• There are date and time fields in the tables being updated.

• You are updating a large table that has a small percentage of changes between extracts and

an index on the date and time fields.

• You are not concerned about capturing intermediate results of each transaction betweenextracts (for example, if a customer changes regions twice in the same day).

It is not recommended that you use timestamp-based CDC if:

• You have a large table with a large percentage of it changes between extracts and there is

no index on the timestamps.

• You need to capture physical row deletes.

• You need to capture multiple events occurring on the same row between extracts.

Some systems have timestamps with dates and times, some with just the dates, and some with

monotonically-generated increasing numbers. You can treat dates and generated numbers in

the same manner. It is important to note that for timestamps based on real time, time zones

8/10/2019 BODS20_EN_COL91_A4



can become important. If you keep track of timestamps using the nomenclature of the source

system (that is, using the source time or source-generated number), you can treat both temporal

(specific time) and logical (time relative to another time or event) timestamps in the same way.

The basic technique for using timestamps is to add a column to your source and target tables

that tracks the timestamps of rows loaded in a job. When the job executes, this column is updated

along with the rest of the data. The next job then reads the latest timestamp from the target

table and selects only the rows in the source table for which the timestamp is later.

This example illustrates the technique. Assume that the last load occurred at 2:00 PM on January

1, 2008. At that time, the source table had only one row (key=1) with a timestamp earlier than

the previous load. Data Services loads this row into the target table with the original timestamp

of 1:10 PM on January 1, 2008. After 2:00 PM, Data Services adds more rows to the source table:

At 3:00 PM on January 1, 2008, the job runs again. The job:

1. Reads the Last_Update field from the target table (01/01/2008 01:10 PM).

2. Selects rows from the source table that have timestamps that are later than the value of

Last_Update. The SQL command to select these rows is:

SELECT * FROM Sour ce WHERE Last _Update > ' 01/ 01/ 2007 01: 10 pm'

This operation returns the second and third rows (key=2 and key=3).

3. Loads these new rows into the target table.

8/10/2019 BODS20_EN_COL91_A4


9

For timestamped CDC, you must create a work flow that contains the following:

• A script that reads the target table and sets the value of a global variable to the latest

timestamp.

• A data flow that uses the global variable in a WHERE clause to filter the data.

The data flow contains a source table, a query, and a target table. The query extracts only those

rows that have timestamps later than the last update.

To set up a timestamp-based CDC delta job

1. In the Variables and Parameters dialog box, add a global variable called $G_Last _Updat e

with a datatype of dat et i me to your job.

The purpose of this global variable is to store a string conversion of the timestamp for the

last time the job executed.

2. In the job workspace, add a script called Get Ti mest amp using the tool palette.

8/10/2019 BODS20_EN_COL91_A4



3. In the script workspace, construct an expression to do the following:

• Select the last time the job was executed from the last update column in the table.

• Assign the actual timestamp value to the $G_Last_Update global variable.

The script content depends on the RDBMS on which the status table resides. The following

is an example of the expression:

$G_Last _Update = sql ( ' DEMO_Tar get ' , ' sel ect max( l ast _update) f r omempl oyee_di m' ) ;

4. Return to the job workspace.

5. Add a data flow to the right of the script using the tool palette.

6. In the data flow workspace, add the source, Query transform, and target objects and connect

them.

The target table for CDC cannot be a template table.

7. In the Query transform, add the columns from the input schema to the output schema as

required.

8. If required, in the output schema, right-click the primary key (if it is not already set to the

surrogate key) and clear the Primary Key option in the menu.

9. Right-click the surrogate key column and select the Primary Key option in the menu.

10. On the Mapping tab for the surrogate key column, construct an expression to use the

key_generation function to generate new keys based on that column in the target table,

incrementing by 1.

The script content depends on the RDBMS on which the status table resides. The following

is an example of the expression:

key_generat i on( ' DEMO_Tar get . demo_t arget . empl oyee_di m' , ' Emp_Sur r _Key' , 1)

11. On the WHERE tab, construct an expression to select only those records with a timestamp

that is later than the $G_Last_Update global variable.

The following is an example of the expression:

empl oyee_di m. l ast _updat e > $G_Last _Updat e

12. Connect the GetTimestamp script to the data flow.

13. Validate and save all objects.

14. Execute the job.

Managing overlaps

Unless source data is rigorously isolated during the extraction process (which typically is not

practical), there is a window of time when changes can be lost between two extraction runs.

This overlap period affects source-based CDC because this kind of data capture relies on a

static timestamp to determine changed data.

8/10/2019 BODS20_EN_COL91_A4


11

For example, suppose a table has 10,000 rows. If a change is made to one of the rows after it

was loaded but before the job ends, the second update can be lost.

There are three techniques for handling this situation:• Overlap avoidance

• Overlap reconciliation

• Presampling

For more information see “Source-based and target-based CDC” in “Techniques for Capturing

Changed Data” in the Data Services Designer Guide.

Overlap avoidance

In some cases, it is possible to set up a system where there is no possibility of an overlap. You

can avoid overlaps if there is a processing interval where no updates are occurring on the target

system.For example, if you can guarantee the data extraction from the source system does not last

more than one hour, you can run a job at 1:00 AM every night that selects only the data updated

the previous day until midnight. While this regular job does not give you up-to-the-minute

updates, it guarantees that you never have an overlap and greatly simplifies timestamp

management.

Overlap reconcil iation

Overlap reconciliation requires a special extraction process that re-applies changes that could

have occurred during the overlap period. This extraction can be executed separately from the

regular extraction. For example, if the highest timestamp loaded from the previous job was

01/01/2008 10:30 PM and the overlap period is one hour, overlap reconciliation re-applies thedata updated between 9:30 PM and 10:30 PM on January 1, 2008.

The overlap period is usually equal to the maximum possible extraction time. If it can take up

to N hours to extract the data from the source system, an overlap period of N (or N plus a small

increment) hours is recommended. For example, if it takes at most two hours to run the job,

an overlap period of at least two hours is recommended.

Presampling

Presampling is an extension of the basic timestamp processing technique. The main difference

is that the status table contains both a start and an end timestamp, instead of the last update

timestamp. The start timestamp for presampling is the same as the end timestamp of theprevious job. The end timestamp for presampling is established at the beginning of the job. It

is the most recent timestamp from the source table, commonly set as the system date.

Activi ty: Using source-based CDC

You need to set up a job to update employee records in the Omega data warehouse whenever

they change. The employee records include timestamps to indicate when they were last updated,

so you can use source-based CDC.

8/10/2019 BODS20_EN_COL91_A4



Objective

• Use timestamps to enable changed data capture for employee records.

Instructions

1. In the Omega project, create a new batch job called Al pha_Empl oyees_Di m_J ob.

2. Add a global variable called $G_Last Updat e with a datatype of dat et i me to your job.

3. In the job workspace, add a script called Get Ti mest amp and construct an expression to do

the following:

• Select the last time the job was executed from the last update column in the employee

dimension table.

• If the last update column is NULL, assign a value of January 1, 1901 to the $G_LastUpdate

global variable. When the job executes for the first time for the initial load, this ensures

that all records are processed.

• If the last update column is not NULL, assign the actual timestamp value to the

$G_LastUpdate global variable.

The expression should be:

$G_Last Updat e = sql ( ' omega' , ' sel ect max(LAST_UPDATE) f r om emp_di m' )

i f ( $G_LastUpdate i s nul l ) $G_LastUpdate = t o_dat e(' 1901. 01. 01' , ' YYYY. MM. DD' ) ;

el se pr i nt ( ' Last updat e was ' | | $G_Last Updat e) ;

4. In the job workspace, add a data flow called Al pha_Empl oyees_Di m_DF and connect it to the

script.5. Add the Employee table from the Alpha datastore as the source object and the Emp_Dim

table from the Omega datastore as the target object.

6. Add the Query transform and connect the objects.

7. In the transform editor for the Query transform, map the columns as follows:

Schema In Schema Out

EMPLOYEEID EMPLOYEEID

LASTNAME LASTNAME

FIRSTNAME FIRSTNAME

BIRTHDATE BIRTHDATE

HIREDATE HIREDATE

8/10/2019 BODS20_EN_COL91_A4


13


ADDRESS ADDRESS

PHONE PHONE

EMAIL EMAIL

REPORTSTO REPORTSTO

LastUpdate LAST_UPDATE

discharge_date DISCHARGE_DATE

8. Create a mapping expression for the SURR_KEY column that generates new keys based on

the Emp_Dim target table, incrementing by 1.


key_generat i on( ' Omega. dbo. emp_di m' , ' SURR_KEY' , 1)

9. Create a mapping expression for the CITY column to look up the city name from the City

table in the Alpha datastore based on the city ID.The expression should be:

l ookup_ext ( [ Al pha. sour ce. ci t y, ' PRE_LOAD_CACHE' , ' MAX' ] ,

[ CI TYNAME] , [ NULL] , [ CI TYI D, ' =' , empl oyee. CI TYI D] ) SET

( "r un_as_separ ate_pr ocess"=' no' )

10. Create a mapping expression for the REGION column to look up the region name from the

Region table in the Alpha datastore based on the region ID.


l ookup_ext ( [ Al pha. sour ce. r egi on, ' PRE_LOAD_CACHE' , ' MAX' ] ,

[ REGI ONNAME] , [ NULL] , [ REGI ONI D, ' =' , empl oyee. REGI ONI D] ) SET


11. Create a mapping expression for the COUNTRY column to look up the country name from

the Country table in the Alpha datastore based on the country ID.


l ookup_ext ( [ Al pha. sour ce. count r y, ' PRE_LOAD_CACHE' , ' MAX' ] ,

[ COUNTRYNAME] , [ NULL] , [ COUNTRYI D, ' =' , empl oyee. COUNTRYI D] ) SET


8/10/2019 BODS20_EN_COL91_A4



12. Create a mapping expression for the DEPARTMENT column to look up the department

name from the Department table in the Alpha datastore based on the department ID.

The expression should be:l ookup_ext( [ Al pha. sour ce. depart ment , ' PRE_LOAD_CACHE' , ' MAX' ] ,

[ DEPARTMENTNAME] , [ NULL] , [ DEPARTMENTI D, ' =' , empl oyee. DEPARTMENTI D] ) SET


13. On the WHERE tab, construct an expression to select only those records with a timestamp

that is later than the $G_LASTUPDATE global variable.


empl oyee. Last Update > $G_LASTUPDATE

14. Execute Alpha_Employees_Dim_Job with the default execution properties and save all

objects you have created.According to the log, the last update for the table was on 2007.12.27.

15. Return to the data flow workspace and view data for the target table. Sort the records by

the LAST_UPDATE column.

A solution file called SOLUTI ON_Sour ceCDC. at l is included in your Course Resources. To check

the solution, import the file and open it to view the data flow design and mapping logic. Do

not execute the solution job, as this may override the results in your target table.

8/10/2019 BODS20_EN_COL91_A4


15

Using target-based CDC

Introduction

Target-based CDC compares the source to the target to determine which records have changed.


• Define the Data Integrator transforms involved in target-based CDC

Using target tables to identi fy changed data

Source-based CDC evaluates the source tables to determine what has changed and only extracts

changed rows to load into the target tables. Target-based CDC, by contrast, extracts all the datafrom the source, compares the source and target rows, and then loads only the changed rows

into the target with new surrogate keys.

Source-based changed-data capture is almost always preferable to target-based capture for

performance reasons; however, some source systems do not provide enough information to

make use of the source-based CDC techniques. Target-based CDC allows you to use the

technique when source-based change information is limited.

You can preserve history by creating a data flow that contains the following:

• A source table contains the rows to be evaluated.

• A Query transform maps columns from the source.

• A Table Comparison transform compares the data in the source table with the data in thetarget table to determine what has changed. It generates a list of INSERT and UPDATE rows

based on those changes. This circumvents the default behavior in Data Services of treating

all changes as INSERT rows.

• A History Preserving transform converts certain UPDATE rows to INSERT rows based on

the columns in which values have changed. This produces a second row in the target instead

of overwriting the first row.

• A Key Generation transform generates new keys for the updated rows that are now flagged

as INSERT.

• A target table receives the rows. The target table cannot be a template table.

8/10/2019 BODS20_EN_COL91_A4



Identifying h istory preserving transforms

Data Services supports history preservation with three Data Integrator transforms:

Icon Transform Description

History Preserving

Key Generation

Converts rows flagged as

UPDATE to UPDATE plus

INSERT, so that the original

values are preserved in thetarget. You specify the

column in which to look for

updated data.

Generates new keys for

source data, starting from a

value based on existing keys

in the table you specify.

Compares two data sets andproduces the difference

Table Comparison between them as a data set with rows flagged as INSERT and UPDATE.

8/10/2019 BODS20_EN_COL91_A4


17

Explaining the Table Comparison transform

The Table Comparison transform allows you to detect and forward changes that have occurred

since the last time a target was updated. This transform compares two data sets and producesthe difference between them as a data set with rows flagged as INSERT or UPDATE.

For example, the transform compares the input and comparison tables and determines that

row 10 has a new address, row 40 has a name change, and row 50 is a new record. The output

includes all three records, flagged as appropriate:

The next section gives a brief description of the function, data input requirements, options, and

data output results for the Table Comparison transform. For more information on the Pivot

transform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Input/output

The transform compares two data sets, one from the input to the transform (input data set),

and one from a database table specified in the transform (the comparison table). The transform

selects rows from the comparison table based on the primary key values from the input data

set. The transform compares columns that exist in the schemas for both inputs.

The input data set must be flagged as NORMAL.

The output data set contains only the rows that make up the difference between the tables. The

schema of the output data set is the same as the schema of the comparison table. No DELETE

operations are produced.

8/10/2019 BODS20_EN_COL91_A4



If a column has a date datatype in one table and a datetime datatype in the other, the transform

compares only the date section of the data. The columns can also be time and datetime datatypes,

in which case Data Integrator only compares the time section of the data.

For each row in the input data set, there are three possible outcomes from the transform:

• An INSERT column is added: The primary key value from the input data set does not match

a value in the comparison table. The transform produces an INSERT row with the values

from the input data set row.

If there are columns in the comparison table that are not present in the input data set, the

transform adds these columns to the output schema and fills them with NULL values.

• An UPDATE row is added: The primary key value from the input data set matches a value

in the comparison table, and values in the non-key compare columns differ in the

corresponding rows from the input data set and the comparison table.

The transform produces an UPDATE row with the values from the input data set row.

If there are columns in the comparison table that are not present in the input data set, the

transform adds these columns to the output schema and fills them with values from the

comparison table.

• The row is ignored: The primary key value from the input data set matches a value in the

comparison table, but the comparison does not indicate any changes to the row values.

Options

The Table transform offers several options:

Option Description

Table name

Specifies the fully qualified name of the

source table from which the maximum

existing key is determined (key source table).

This table must already be imported into the

repository. Table name is represented as

datastore.owner.table where datastore is the

name of the datastore Data Services uses to

access the key source table and owner

depends on the database type associated with

the table.

Specifies a column in the comparison table.

When there is more than one row in the

Generated key columncomparison table with a given primary key

value, this transform compares the row with

the largest generated key value of these rows

and ignores the other rows. This is optional.

8/10/2019 BODS20_EN_COL91_A4


19

Option Description

Input contains duplicate keys Provides support for input rows withduplicate primary key values.

Detect deleted row(s) from comparison tableFlags the transform to identify rows that have

been deleted from the source.

Comparison method

Allows you to select the method for accessing

the comparison table. You can select from

Row-by-row select, Cached comparison

table, and Sorted input.

Specifies the columns in the input data set

that uniquely identify each row. These

Input primary key column(s) columns must be present in the comparison

table with the same column names and

datatypes.

Improves performance by comparing only

the sub-set of columns you drag into this box

Compare columnsfrom the input schema. If no columns are

listed, all columns in the input data set thatare also in the comparison table are used as

compare columns. This is optional.

Explaining the History Preserving transform

The History Preserving transform ignores everything but rows flagged as UPDATE. For these

rows, it compares the values of specified columns and, if the values have changed, flags the

row as INSERT. This produces a second row in the target instead of overwriting the first row.

For example, a target table that contains employee information is updated periodically from a

source table. In this case, the Table Comparison transform has flagged the name change for

row 40 as an update. However, the History Preserving transform is set up to preserve history

on the LastName column, so the output changes the operation code for that record from

UPDATE to INSERT.

8/10/2019 BODS20_EN_COL91_A4




data output results for the History Preserving transform. For more information on the History

Preserving transform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Input/output

The input data set is the result of a comparison between two versions of the same data in which

rows with changed data from the newer version are flagged as UPDATE rows and new data

from the newer version are flagged as INSERT rows.

The output data set contains rows flagged as INSERT or UPDATE.

Options

The History Preserving transform offers these options:

Option Description

Valid from

Valid to

Specifies a date or datetime column from the

source schema. Specify a Valid from date

column if the target uses an effective date to

track changes in data.

Specifies a date value in the following format:

YYYY.MM.DD. The Valid to date cannot be

the same as the Valid from date.

8/10/2019 BODS20_EN_COL91_A4


21

Option Description

Specifies a column from the source schemathat identifies the current valid row from a

Columnset of rows with the same primary key. The

flag column indicates whether a row is the

most current data in the target for a given

primary key.

Defines an expression that outputs a value

with the same datatype as the value in the

Set valueSet flag column. This value is used to update

the current flag column in the new row in the

target added to preserve history of an existingrow.

Defines an expression that outputs a value

with the same datatype as the value in the

Reset valueReset flag column. This value is used to

update the current flag column in an existing

row in the target that included changes in

one or more of the compare columns.

Preserve delete row(s) as update row(s)

Compare columns

Converts DELETE rows to UPDATE rows in

the target. If you previously set effective date

values (Valid from and Valid to), sets the

Valid to value to the execution date. This

option is used to maintain slowly changing

dimensions by feeding a complete data set

first through the Table Comparison transform

with its Detect deleted row(s) from

comparison table option selected.

Lists the column or columns in the input dataset that are to be compared for changes.

• If the values in the specified compare

columns in each version match, the

transform flags the row as UPDATE. The

row from the before version is updated.

The date and flag information is also

updated.

• If the values in each version do not match,

the row from the latest version is flagged

8/10/2019 BODS20_EN_COL91_A4



Option Description

as INSERT when output from the

transform. This adds a new row to thewarehouse with the values from the new

row.

Updates to non-history preserving columns

update all versions of the row if the update

is performed on the natural key (for example,

Customer), but only update the latest version

if the update is on the generated key (for

example, GKey).

Explaining the Key Generation t ransform

The Key Generation transform generates new keys before inserting the data set into the target

in the same way as the key_generation function does. When it is necessary to generate artificial

keys in a table, this transform looks up the maximum existing key value from a table and uses

it as the starting value to generate new keys. The transform expects the generated key column

to be part of the input schema.

For example, suppose the History Preserving transform produces rows to add to a warehouse,

and these rows have the same primary key as rows that already exist in the warehouse. In this

case, you can add a generated key to the warehouse table to distinguish these two rows that

have the same primary key.


data output results for the Key Generation transform. For more information on the Key

Generation transform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Input/output

The input data set is the result of a comparison between two versions of the same data in which

changed data from the newer version are flagged as UPDATE rows and new data from the

newer version are flagged as INSERT rows.

The output data set is a duplicate of the input data set, with the addition of key values in the

generated key column for input rows flagged as INSERT.

Options

The Key Generation transform offers these options:

Option Description

Table nameSpecifies the fully qualified name of the

source table from which the maximum

existing key is determined (key source table).

8/10/2019 BODS20_EN_COL91_A4


23

Option Description

This table must be already imported into the

repository. Table name is represented asdatastore.owner.table where datastore is the

name of the datastore Data Services uses to

access the key source table and owner

depends on the database type associated with

the table.

Specifies the column in the key source table

containing the existing keys values. A column

Generated key column with the same name must exist in the input

data set; the new keys are inserted in this

column.

Increment values

Activi ty: Using target-based CDC

Indicates the interval between generated key

values.

You need to set up a job to update product records in the Omega data warehouse whenever

they change. The product records do not include timestamps to indicate when they were last

updated, so you must use target-based CDC to extract all records from the source and comparethem to the target.

Objective

• Use target-based CDC to preserve history for the Product dimension.

Instructions

1. In the Omega project, create a new batch job called Al pha_Product_Di m_J ob with a data

flow called Al pha_Pr oduct _Di m_DF.

2. Add the Product table from the Alpha datastore as the source object and the Prod_Dim table

from the Omega datastore as the target object.

3. Add the Query, Table Comparison, History Preserving, and Key Generation transforms.

4. Connect the source table to the Query transform and the Query transform to the target table

to set up the schema prior to configuring the rest of the transforms.

5. In the transform editor for the Query transform, map the columns as follows:


PRODUCTID PRODUCTID

8/10/2019 BODS20_EN_COL91_A4




PRODUCTNAME PRODUCTNAME

CATEGORYID CATEGORYID

COST COST

6. Until the key can be generated, specify a mapping expression for the SURR_KEY column

to populate it with NULL.

7. Specify a mapping expression for the EFFECTIVE_DATE column to indicate the current

date as sysdate( ).

8. Delete the link from the Query transform to the target table.

9. Connect the transforms in the following order: Query, Table Comparison, History Preserving,

and Key Generation.

10. Connect the Key Generation transform to the target table.

11. In the transform editor for the Table Comparison transform, use the Prod_Dim table in the

Omega datastore as the comparison table and set Surr_Key as the generated key column.

12. Set the input primary key column to PRODUCTID, and compare the PRODUCTNAME,

CATEGORYID, and COST columns.

13. Do not configure the History Preserving transform.

14. In the transform editor for the Key Generation transform, set up key generation based on

the Surr_Key column of the Prod_Dim table in the Omega datastore, incrementing by 1.

15. In the workspace, before executing the job, display the data in both the source and target

tables.

Note that the OmegaSoft product has been added in the source, but has not yet been updated

in the target.

16. Execute Alpha_Product_Dim_Job with the default execution properties and save all objects

you have created.

17. Return to the data flow workspace and view data for the target table.Note that the new records were added for product IDs 2, 3, 6, 8, and 13, and that OmegaSoft

has been added to the target.

A solution file called SOLUTI ON_Tar get CDC. at l is included in your Course Resources. To check

the solution, import the file and open it to view the data flow design and mapping logic. Do

not execute the solution job, as this may override the results in your target table.

8/10/2019 BODS20_EN_COL91_A4


25

Quiz: Capturing changes in data

1. What are the two most important reasons for using CDC?

2. Which method of CDC is preferred for the performance gain of extracting the fewest rows?

3. What is the difference between an initial load and a delta load?

4. What transforms do you typically use for target-based CDC?

8/10/2019 BODS20_EN_COL91_A4



Lesson summary

After completing this lesson, you are now able to:

• Update data over time

• Use source-based CDC

• Use target-based CDC

8/10/2019 BODS20_EN_COL91_A4


Using Data Integrator Transforms—Learner’s Guide 27

Lesson 2

Using Data Integrator Transforms

Lesson introduction

Data Integrator transforms are used to enhance your data integration projects beyond the core

functionality of the platform transforms.

After completing this lesson, you will be able to:

• Describe the Data Integrator transforms

• Use the Pivot transform

• Use the Hierarchy Flattening transform

• Describe performance optimization

• Use the Data Transfer transform

• Use the XML Pipeline transform

8/10/2019 BODS20_EN_COL91_A4



Describing Data Integrator transforms

Introduction

Data Integrator transforms perform key operations on data sets to manipulate their structure

as they are passed from source to target.


• Describe Data Integrator transforms available in Data Services

Defining Data Integrator transforms

The following transforms are available in the Data Integrator branch of the Transforms tab inthe Local Object Library:

Icon Transform Description

Data Transfer

Allows a data flow to split its processing into two sub-data

flows and push down resource-consuming operations to

the database server.

Date GenerationGenerates a column filled with date values based on the

start and end dates and increment you specify.

Effective DateGenerates an additional effective to column based on the

primary key’s effective date.

Hierarchy Flattening

Flattens hierarchical data into relational tables so that it

can participate in a star schema. Hierarchy flattening can

be both vertical and horizontal.

Sorts input data, maps output data, and resolves before

and after versions for UPDATE rows.Map CDC Operation While commonly used to support Oracle or mainframe

changed data capture, this transform supports any data

stream if its input requirements are met.

Pivot

Reverse Pivot

Rotates the values in specified columns to rows.

Rotates the values in specified rows to columns.

8/10/2019 BODS20_EN_COL91_A4



Icon Transform

XML Pipeline

Description

Processes large XML inputs in small batches.

8/10/2019 BODS20_EN_COL91_A4



Using the Pivot transform

Introduction

The Pivot and Reverse Pivot transforms let you convert columns to rows and rows back into

columns.



Explaining the Pivot transform

The Pivot transform creates a new row for each value in a column that you identify as a pivotcolumn.

It allows you to change how the relationship between rows is displayed. For each value in each

pivot column, Data Services produces a row in the output data set. You can create pivot sets

to specify more than one pivot column.

For example, you could produce a list of discounts by quantity for certain payment terms so

that each type of discount is listed as a separate record, rather than each being displayed in a

unique column.

The Reverse Pivot transform reverses the process, converting rows into columns.


data output results for the Pivot transform. For more information on the Pivot transform see

“Transforms” Chapter 5 in the Data Services Reference Guide.

8/10/2019 BODS20_EN_COL91_A4



Inputs/Outputs

Data inputs include a data set with rows flagged as NORMAL.

Data outputs include a data set with rows flagged as NORMAL. This target includes the

non-pivoted columns, a column for the sequence number, the data field column, and the pivot

header column.

Options

The Pivot transform offers several options:

Option Description

Pivot sequence column

Non-pivot columns

Assign a name to the sequence number

column. For each row created from a pivot

column, Data Services increments and stores

a sequence number.

Select the columns in the source that are to

appear in the target without modification.

Pivot set

Data column field

Identify a number for the pivot set. For each

pivot set, you define a group of pivot

columns, a pivot data field, and a pivot

header name.

Specify the column that contains the pivoted

data. This column contains all of the Pivot

columns values.

Header column

Pivot columns

Specify the name of the column that contains

the pivoted column names. This column lists

the names of the columns where the

corresponding data originated.

Select the columns to be rotated into rows.

Describe these columns in the Header

column. Describe the data in these columns

in the Data field column.

To pivot a table

1. Open the data flow workspace.

8/10/2019 BODS20_EN_COL91_A4



2. Add your source object to the workspace.

3. On the Transforms tab of the Local Object Library, click and drag the Pivot or Reverse Pivot

transform to the workspace to the right of your source object.4. Add your target object to the workspace.

5. Connect the source object to the transform.

6. Connect the transform to the target object.

7. Double-click the Pivot transform to open the transform editor.

8. Click and drag any columns that will not be changed by the transform from the input schema

area to the Non-Pivot Columns area.

9. Click and drag any columns that will be pivoted from the input schema area to the Pivot

Columns area.

If required, you can create more than one pivot set by clicking Add.

10. If desired, change the values in the Pivot sequence column, Data field column, and Header

column fields.

These are the new columns that will be added to the target object by the transform.

8/10/2019 BODS20_EN_COL91_A4



11. Click Back to return to the data flow workspace.

Activi ty: Using the Pivot t ransform

Currently, employee compensation information is loaded into a table with a separate column

each for salary, bonus, and vacation days. For reporting purposes, you need for each of these

items to be a separate record in the HR datamart.

Objective

• Use the Pivot transform to create a separate row for each entry in a new employee

compensation table.

Instructions

1. In the Omega project, create a new batch job called Al pha_HR_Comp_J ob with a data flowcalled Al pha_HR_Comp_DF.

2. Add the HR_Comp_Update table from the Alpha datastore to the workspace as the source

object.

3. Add the Pivot transform and connect it to the source object.

4. Add the Query transform and connect it to the Pivot transform.

5. Create a new template table called Empl oyee_Comp in the Delta datastore as the target object.

6. Connect the Query transform to the new template table.

7. In the transform editor for the Pivot transform, specify that the EmployeeID anddate_updated fields are non-pivot columns.

8. Specify that the Emp_Salary, Emp_Bonus, and Emp_VacationDays fields are pivot columns.

9. Specify that the data field column is calledComp, and the header column is called Comp_Type.

10. In the transform editor for the Query transform, map all fields from input schema to output

schema.

11. On the WHERE tab, filter out NULL values for the Comp column.

The expression should be as follows:

Pi vot . Comp i s not nul l

12. Execute Alpha_HR_Comp_Job with the default execution properties and save all objects

you have created.

13. Return to the data flow workspace and view data for the target table.

A solution file called SOLUTI ON_Pi vot . atl is included in your Course Resources. To check the

solution, import the file and open it to view the data flow design and mapping logic. Do not

execute the solution job, as this may override the results in your target table.

8/10/2019 BODS20_EN_COL91_A4



Using the Hierarchy Flattening transform

Introduction

The Hierarchy Flattening transform enables you to break down hierarchical table structures

into a single table to speed up data access.



Explaining the Hierarchy Flattening transform

The Hierarchy Flattening transform constructs a complete hierarchy from parent/childrelationships, and then produces a description of the hierarchy in horizontally- or

vertically-flattened format.

For horizontally-flattened hierarchies, each row of the output describes a single node in the

hierarchy and the path to that node from the root.

For vertically-flattened hierarchies, each row of the output describes a single relationship

between ancestor and descendent and the number of nodes the relationship includes. There is

a row in the output for each node and all of the descendants of that node. Each node is

considered its own descendent and, therefore, is listed one time as both ancestor and descendent.

8/10/2019 BODS20_EN_COL91_A4




data output results for the Hierarchy Flattening transform. For more information on the

Hierarchy Flattening transform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Inputs/Outputs

Data input includes rows describing individual parent-child relationships. Each row mustcontain two columns that function as the keys of the parent and child in the relationship. The

input can also include columns containing attributes describing the parent and/or child.

The input data set cannot include rows with operations other than NORMAL, but can contain

hierarchical data.

For a listing of the target columns, consult the Data Services Reference Guide.

Options

The Hierarchy Flattening transform offers several options:

Option Description

Parent columnIdentifies the column of the source data that contains

the parent identifier in each parent-child relationship.

Child columnIdentifies the column in the source data that contains

the child identifier in each parent-child relationship.

8/10/2019 BODS20_EN_COL91_A4



Option Description

Flattening type Indicates how the hierarchical relationships aredescribed in the output.

Use maximum length paths

Maximum depth

Indicates whether longest or shortest paths are used to

describe relationships between descendants and

ancestors when the descendent has more than one

parent.

Indicates the maximum depth of the hierarchy.

Parent attribute listIdentifies a column or columns that are associated withthe parent column.

Child attribute listIdentifies a column or columns that are associated with

the child column.

Run as a separate process

Creates a separate sub-data flow process for the

Hierarchy Flattening transform when Data Services

executes the data flow.

Activi ty: Using the Hierarchy Flattening t ransform

The Employee table in the Alpha datastore contains employee data in a recursive hierarchy.

To determine all reports, direct or indirect, to a given executive or manager would require

complex SQL statements to traverse the hierarchy.

Objective

• Flatten the hierarchy to allow more efficient reporting on data.

Instructions

1. In the Omega project, create a new batch job called Al pha_Empl oyees_Repor t _J ob with a

data flow called Al pha_Empl oyees_Repor t _DF.

2. In the data flow workspace, add the Employee table from the Alpha datastore as the source

object.

3. Create a template table called Manager _Emps in the HR_datamart datastore as the target

object.

4. Add a Hierarchy Flattening transform to the right of the source table and connect the source

table to the transform.

8/10/2019 BODS20_EN_COL91_A4



5. In the transform editor for the Hierarchy Flattening transform, select the following options:

Option Value

Flattening Type Vertical

Parent Column REPORTSTO

Child Column EMPLOYEEID

Child Attribute List

LASTNAME

FIRSTNAME

BIRTHDATE

HIREDATE

ADDRESS

CITYID

REGIONID

COUNTRYID

PHONE

EMAILDEPARTMENTID

LastUpdate

discharge_date

6. Add a Query transform to the left of the Hierarchy Flattening transform and connect the

transforms.

7. In the transform editor of the Query transform, create the following output columns:

Column Datatype

MANAGERID varchar(10)

MANAGER_NAME varchar(50)

EMPLOYEEID varchar(10)

8/10/2019 BODS20_EN_COL91_A4



Column Datatype

EMPLOYEE_NAME varchar(102)

DEPARTMENT varchar(50)

HIREDATE datetime

LASTUPDATE datetime

PHONE varchar(20)

EMAIL varchar(50)

ADDRESS varchar(200)

CITY varchar(50)

REGION varchar(50)

COUNTRY varchar(50)

DISCHARGE_DATE datetime

DEPTH int

ROOT_FLAG int

LEAF_FLAG

8. Map the output columns as follows:

int


ANCESTOR MANAGERID

DESCENDENT EMPLOYEEID

8/10/2019 BODS20_EN_COL91_A4




DEPTH DEPTH

ROOT_FLAG ROOT_FLAG

LEAF_FLAG LEAF_FLAG

C_ADDRESS ADDRESS

C_discharge_date DISCHARGE_DATE

C_EMAIL EMAIL

C_HIREDATE HIREDATE

C_LastUpdate LASTUPDATE

C_PHONE PHONE

9. Create a mapping expression for the MANAGER_NAME column to look up the manager's

last name from the Employee table in the Alpha datastore based on the employee ID in the

ANCESTOR column of the Hierarchy Flattening transform.


l ookup_ext( [ Al pha. source. empl oyee, ' PRE_LOAD_CACHE' , ' MAX' ] , [ LASTNAME], [ NULL] ,

[ EMPLOYEEI D, ' =' , Hi erar chy_Fl at t eni ng. ANCESTOR] ) SET


10. Create a mapping expression for the EMPLOYEE_NAME column to concatenate the

employee's last name and first name, separated by a comma.


Hi erar chy_Fl at t eni ng. C_LASTNAME | | ' , ' | | Hi erar chy_Fl att eni ng. C_FI RSTNAME

11. Create a mapping expression for the DEPARTMENT column to look up the name of the

employee's department from the Department table in the Alpha datastore based on the

C_DEPARTMENTID column of the Hierarchy Flattening transform.


8/10/2019 BODS20_EN_COL91_A4



l ookup_ext ( [ Al pha. source. depart ment , ' PRE_LOAD_CACHE' , ' MAX' ] , [ DEPARTMENTNAME],

[ NULL] , [ DEPARTMENTI D, ' =' , Hi erarchy_Fl at t eni ng. C_DEPARTMENTI D] ) SET


12. Create a mapping expression for the CITY column to look up the name of the employee's

city from the City table in the Alpha datastore based on the C_CITYID column of the

Hierarchy Flattening transform.


l ookup_ext ( [ Al pha. sour ce. ci t y, ' PRE_LOAD_CACHE' , ' MAX' ] , [ CI TYNAME] , [ NULL] ,

[ CI TYI D, ' =' , Hi er ar chy_Fl at t eni ng. C_CI TYI D] ) SET


13. Create a mapping expression for the REGION column to look up the name of the employee's

region from the Region table in the Alpha datastore based on the C_REGIONID column of

the Hierarchy Flattening transform.


l ookup_ext( [ Al pha. source. r egi on, ' PRE_LOAD_CACHE' , ' MAX' ] , [ REGI ONNAME], [ NULL] ,

[ REGI ONI D, ' =' , Hi erarchy_Fl att eni ng. C_REGI ONI D] ) SET


14. Create a mapping expression for the COUNTRY column to look up the name of the

employee's country from the Country table in the Alpha datastore based on the

C_COUNTRYID column of the Hierarchy Flattening transform.


l ookup_ext ( [ Al pha. source. count r y, ' PRE_LOAD_CACHE' , ' MAX' ] , [ COUNTRYNAME] ,

[ NULL] , [ COUNTRYI D, ' =' , Hi erar chy_Fl at t eni ng. C_COUNTRYI D] ) SET


15. Add a WHERE clause to the Query transform to return only rows where the depth is greater

than zero.


Hi erar chy_Fl at t eni ng. DEPTH > 0

16. Execute Alpha_Employees_Report_Job with the default execution properties and save all

objects you have created.17. Return to the data flow workspace and view data for the target table.

Note that 179 rows were written to the target table.

A solution file called SOLUTI ON_Hi erarchyFl att eni ng. at l is included in your Course Resources.

To check the solution, import the file and open it to view the data flow design and mapping

logic. Do not execute the solution job, as this may override the results in your target table.

8/10/2019 BODS20_EN_COL91_A4



Describing performance optimization

Introduction

You can improve the performance of your jobs by pushing down operations to the source or

target database to reduce the number of rows and operations that the engine must retrieve and

process.


• List operations that Data Services pushes down to the database

• View SQL generated by a data flow

• Explore data caching options

• Explain process slicing

Describing push-down operations

Data Services examines the database and its environment when determining which operations

to push down to the database:

• Full push-down operations

The Data Services optimizer always tries to do a full push-down operation. Full push-down

operation s are operations that can be pushed down to the databases and the data streams

directly from the source database to the target database. For example, Data Services sends

SQL INSERT INTO... SELECT statements to the target database and it sends SELECT toretrieve data from the source.

Data Services can only do full push-down operation s to the source and target databases

when the following conditions are met:

○ All of the operations between the source table and target table can be pushed down

○ The source and target tables are from the same datastore or they are in datastores that

have a database link defined between them.

• Partial push-down operations

When a full push-down operation is not possible , Data Services tries to push down the

SELECT statement to the source database. Operations within the SELECT statement that

can be pushed to the database include:

Operation Description

Aggregations

Aggregate functions, typically used with a

Group by statement, always produce a data

set smaller than or the same size as the

original data set.

8/10/2019 BODS20_EN_COL91_A4



Operation Description

Distinct rows Data Services will only output unique rowswhen you use distinct rows.

FilteringFiltering can produce a data set smaller than

or equal to the original data set.

Joins Joins typically produce a data set smaller

than or similar in size to the original tables.

Ordering

Projections

Ordering does not affect data set size. Data

Services can efficiently sort data sets thatfit in memory. Since Data Services does not

perform paging (writing out intermediate

results to disk), it is recommended that you

use a dedicated disk-sorting program such

as SyncSort or the DBMS itself to order very

large data sets.

A projection normally produces a smaller

data set because it only returns columns

referenced by a data flow.

Functions

Most Data Services functions that have

equivalents in the underlaying database are

appropriately translated.

Operations that cannot be pushed down

Data Services cannot push some transform operations to the database. For example:

• Expressions that include Data Services functions that do not have database correspondents.

• Load operations that contain triggers.

• Transforms other than Query.

• Joins between sources that are on different database servers that do not have database links

defined between them.

Similarly, not all operations can be combined into single requests. For example, when a stored

procedure contains a COMMIT statement or does not return a value, you cannot combine the

stored procedure SQL with the SQL for other operations in a query. You can only push

operations supported by the RDBMS down to that RDBMS.

8/10/2019 BODS20_EN_COL91_A4



Note: You cannot push built-in functions or transforms to the source database. For best

performance, do not intersperse built-in transforms among operations that can be pushed down

to the database. Database-specific functions can only be used in situations where they will be

pushed down to the database for execution.

Viewing SQL generated by a data flow

Before running a job, you can view the SQL generated by the data flow and adjust your design

to maximize the SQL that is pushed down to improve performance. Alter your design to

improve the data flow when necessary.

Keep in mind that Data Services only shows the SQL generated for table sources. Data Services

does not show the SQL generated for SQL sources that are not table sources, such as the lookup

function, the Key Generation transform, the key_generation function, the Table Comparison

transform, and target tables.

To v iew SQL

1. In the Data Flows tab of the Local Object Library, right-click the data flow and select Display

Optimized SQL from the menu.

The Optimized SQL dialog box displays.

2. In the left pane, select the datastore for the data flow.

The optimized SQL for the datastore displays in the right pane.

Caching data

You can improve the performance of data transformations that occur in memory by caching

as much data as possible. By caching data, you limit the number of times the system must

access the database. Cached data must fit into available memory.

8/10/2019 BODS20_EN_COL91_A4



Pageable caching

Data Services allows administrators to select a pageable cache location to save content over the

2 GB RAM limit. The pageable cache location is set up in Server Manager and the option to usepageable cache is selected on the Dataflow Properties dialog box.

Persistent caching

Persistent cache datastores can be created through the Create New Datastore dialog box by

selecting Persistent Cache as the database type. The newly-created persistent cache datastore

will appear in the list of datastores, and can be used as a source in jobs.

For more information about advanced caching features, see the Data Services Performance

Optimization Guide.

Slicing processes

You can also optimize your jobs through process slicing, which involves splitting data flows

into sub-data flows.

Sub-data flows work on smaller data sets and/or fewer transforms so there is less virtual

memory to consume per process. This way, you can leverage more physical memory per data

flow as each sub-data flow can access 2 GB of memory.

This functionality is available through the Advanced tab for the Query transform. You can run

each memory-intensive operation as a separate process.

For more information on process slicing, see the Data Services Performance Optimization Guide.

8/10/2019 BODS20_EN_COL91_A4



Using the Data Transfer transform

Introduction

The Data Transfer transform allows a data flow to split its processing into two sub-data flows

and push down resource-consuming operations to the database server.



Explaining the Data Transfer transform

The Data Transfer transform moves data from a source or the output from another transforminto a transfer object and subsequently reads data from the transfer object. You can use the

Data Transfer transform to push down resource-intensive database operations that occur

anywhere within the data flow. The transfer type can be a relational database table, persistent

cache table, file, or pipeline.

Use the Data Transfer transform to:

• Push down operations to the database server when the transfer type is a database table. You

can push down resource-consuming operations such as joins, GROUP BY, and sorts.

• Define points in your data flow where you want to split processing into multiple sub-data

flows that each process part of the data. Data Services does not need to process the entire

input data in memory at one time. Instead, the Data Transfer transform splits the processingamong multiple sub-data flows that each use a portion of memory.


data output results for the Data Transfer transform. For more information on the Data Transfer

transform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Inputs/Outputs

When the input data set for the Data Transfer transform is a table or file transfer type, the rows

must be flagged with the NORMAL operation code. When input data set is a pipeline transfer

type, the rows can be flagged as any operation code.

The input data set must not contain hierarchical (nested) data.Output data sets have the same schema and same operation code as the input data sets. In the

push down scenario, the output rows are in the sort or GROUP BY order.

The sub-data flow names use the following format, where n is the number of the data flow:

dataf l owname_n

The execution of the output depends on the temporary transfer type:

For Table or File temporary transfer types, Data Services automatically splits the data flow into

sub-data flows and executes them serially.

8/10/2019 BODS20_EN_COL91_A4



For Pipeline transfer types, Data Services splits the data flow into sub-data flows if you specify

the Run as a separate process option in another operation in the data flow. Data Services

executes these sub-data flows that use pipeline in parallel.

Activi ty: Using the Data Transfer transform

The Data Transfer transform can be used to push data down to a database table so that it can

be processed by the database server rather than the Data Services Job Server. In this activity,

you will join data from two database schemas. When the Data Transfer transform is not used,

the join will occur on the Data Services Job Server. When the Data Transfer transform is added

to the data flow the join can be seen in the SQL Query by displaying the optimized SQL for the

data flow.

Objective

• Use the Data Transfer transform to optimize performance.

Instructions

1. In the Omega project, create a new batch job called No_Data_Transf er_ J obwith a data flow

called No_Data_Transf er_DF.

2. In the Delta datastore, import the Employee_Comp table and add it to the

No_Data_Transfer_DF workspace as a source table.

3. Add the Employee table from the Alpha datastore as a source table.

4. Add a Query transform to the data flow workspace and attach both source tables to the

transform.

5. In the transform editor for the Query transform, add the LastName and BirthDate columns

from the Employee table and the Comp_Type and Comp columns from the Employee_Comp

table to the output schema.

6. Add a WHERE clause to join the tables on the EmployeeID columns.

7. Create a template table called Employee_Temp in the Delta datastore as the target object

and connect it to the Query transform.

8. Save the job.

9. In the Local Object Library, use the right-click menu for the No_Data_Transfer_DF data

flow to display the optimized SQL.Note that the WHERE clause does not appear in either SQL statement.

10. In the Local Object Library, replicate the No_Data_Transfer_DF data flow and rename the

copy Dat a_Transf er_DF.

11. In the Local Object Library, replicate the No_Data_Transfer_Job job and rename the copy

Dat a_Tr ansf er_J ob.

12. Add the Data_Transfer_Job job to the Omega project.

8/10/2019 BODS20_EN_COL91_A4



13. Delete the No_Data_Transfer_DF data flow from the Data_Transfer_Job and add the

Data_Transfer_DF data flow to the job by dragging it from the Local Object Library to the

job's workspace.

14. Delete the connection between the Employee_Comp table and the Query transform.

15. Add a Data Transfer transform between the Employee_Comp table and the Query transform

and connect the three objects.

16. In the transform editor for the Data Transfer transform, select the Tabl e option for Transfer

Type field.

17. In the Table Options section, click the ellipses (...) button and select Tabl e Name. Select the

Alpha datastore. In the Table Name field enter PUSHDOWN_DATA. In the Owner field, enter

SOURCE.

18. In the transform editor for the Query transform, update the WHERE clause to join the

Data_Transfer.employeeid and employee.employeeid fields. Verify the Comp_Type andComp columns are mapped to the Data Transfer transform.

19. Save the job.

20. In the Local Object Library, use the right-click menu for the Data_Transfer_DF data flow to

display the optimized SQL.

Note that the WHERE clause appears in the SQL statements.

A solution file called SOLUTI ON_DataTr ansf er. at l is included in your Course Resources. To

check the solution, import the file and open it to view the data flow design and mapping logic.

Do not execute the solution job, as this may override the results in your target table.

8/10/2019 BODS20_EN_COL91_A4



Using the XML Pipeline transform

Introduction

The XML Pipeline transform is used to process large XML files more efficiently by separating

them into small batches.



Explaining the XML Pipeline transform

The XML Pipeline transform is used to process large XML files, one instance of a specifiedrepeatable structure at a time.

With this transform, Data Services does not need to read the entire XML input into memory

and build an internal data structure before performing the transformation.

This means that an NRDM structure is not required to represent the entire XML data input.

Instead, this transform uses a portion of memory to process each instance of a repeatable

structure, then continually releases and re-uses the memory to continuously flow XML data

through the transform.

During execution, Data Services pushes operations of the streaming transform to the XML

source. Therefore, you cannot use a breakpoint between your XML source and an XML Pipeline

transform.

Note:

You can use the XML Pipeline transform to load into a relational or nested schema target. This

course focuses on loading XML data into a relational target.

For more information on constructing nested schemas for your target, refer to the Data Services

Designer Guide.

Inputs/Outputs

You can use an XML file or XML message. You can also connect more than one XML Pipeline

transform to an XML source.When connected to an XML source, the transform editor shows the input and output schema

structures as a root schema containing repeating and non-repeating sub-schemas represented

by these icons:

Icon Schema structure

Root schema and repeating sub-schema

8/10/2019 BODS20_EN_COL91_A4



Icon Schema structure

Non-repeating sub-schema

Keep in mind these rules when using the XML Pipeline transform:

• You cannot drag and drop the root level schema.

• You can drag and drop the same child object repeated times to the output schema, but only

if you give each instance of that object a unique name. Rename the mapped instance before

attempting to drag and drop the same object to the output again.

• When you drag and drop a column or sub-schema to the output schema, you cannot then

map the parent schema for that column or sub-schema. Similarly, when you drag and drop

a parent schema, you cannot then map an individual column or sub-schema from under

that parent.• You cannot map items from two sibling repeating sub-schemas because the XML Pipeline

transform does not support Cartesian products (combining every row from one table with

every row in another table) of two repeatable schemas.

To take advantage of the XML Pipeline transform’s performance, always select a repeatable

column to be mapped. For example, if you map a repeatable schema column, the XML source

produces one row after parsing one item.

Avoid selecting non-repeatable columns that occur structurally after the repeatable schema

because the XML source must then assemble the entire structure of items in memory before

processing. Selecting non-repeatable columns that occur structurally after the repeatable schema

increases memory consumption to process the output into your target.

To map both the repeatable schema and a non-repeatable column that occurs after the repeatable

one, use two XML Pipeline transforms, and use the Query transform to combine the outputs

of the two XML Pipeline transforms and map the columns into one single target.

Options

The XML Pipeline is streamlined to support massive throughput of XML data; therefore, it

does not contain additional options other than input and output schemas, and the Mapping

tab.

Activi ty: Using the XML Pipeline transform Purchase order information is stored in XML files that have repeatable purchase orders and

items, and a non-repeated Total Purchase Orders column. You must combine the customer

name, order date, order items, and the totals into a single relational target table, with one row

per customer per item.

8/10/2019 BODS20_EN_COL91_A4



Objectives

• Use the XML Pipeline transform to extract XML data.

• Combine the rows required from both XML sources into a single target table joined usinga Query transform

Instructions

1. On the Formats tab of the Local Object Library, create a new file format for an XML schema

called purchaseOrders_Format, based on the purchaseOrders.xsd file in the Activity_Source

folder. Use a root element of PurchaseOrders.

2. In the Omega project, create a new job called Al pha_Pur chase_Or ders_J ob, with a data flow

called Al pha_Pur chase_Or der s_DF.

3. In the data flow workspace, add the PurchaseOrders_Format file format as the XML file

source object.

4. In the format editor for the file format, point the file format to the pos.xml file in the

Activity_Source folder.

Note that when working in a distributed environment, where Designer and the Job Server

are on different machines, it may be necessary to edit the path to the XML file if it is different

on the Job Server than the Designer client. Your instructor will tell you if you need to edit

the path to the file for this activity.

5. Add two instances of the XML Pipeline transform to the data flow workspace and connect

the source object to each.

6. In the transform editor for the first XML Pipeline transform, map the following columns:


customerName customerName

orderDate orderDate

7. Map the entire item repeatable schema from the input schema to the output schema.

8. In the transform editor for the second XML Pipeline transform, map the following columns:


customerName customerName

orderDate orderDate

totalPOs totalPOs

8/10/2019 BODS20_EN_COL91_A4



9. Add a Query transform to the data flow workspace and connect both XML Pipeline transform

to it.

10. In the transform editor for the Query transform, map both columns and the repeatableschema from the first XML Pipeline transform from the input schema to the output schema.

Also map the totalPOs columns from the second XML Pipeline transform.

11. Unnest the item repeatable schema.

12. Create a WHERE clause to join the inputs from the two XML Pipeline transforms on the

customerName column.


XML_Pi pel i ne. cust omerName = XML_Pi pel i ne_1. cust omerName

13. Add a new template table called Item_POs to the Delta datastore and connect the Query

transform to it.

14. Execute Alpha_Purchase_Orders_Job with the default execution properties and save all

objects you have created.

15. Return to the data flow workspace and view data for the target table.

A solution file called SOLUTI ON_XMLPi pel i ne. at l is included in your Course Resources. To

check the solution, import the file and open it to view the data flow design and mapping logic.

Do not execute the solution job, as this may override the results in your target table.

8/10/2019 BODS20_EN_COL91_A4



Quiz: Using Data Integrator transforms

1. What is the Pivot transform used for?

2. What is the purpose of the Hierarchy Flattening transform?

3. What is the difference between the horizontal and vertical flattening hierarchies?

4. List three things you can do to improve job performance.

5. Name three options that can be pushed down to the database.

8/10/2019 BODS20_EN_COL91_A4



Lesson summary

After completing this lesson, you are now able to:

• Describe the Data Integrator transforms



• Describe performance optimization



8/10/2019 BODS20_EN_COL91_A4



8/10/2019 BODS20_EN_COL91_A4


Answer Key—Learner’s Guide

Answer Key

This section contains the answers to the reviews and/or activities for the applicable lessons.

8/10/2019 BODS20_EN_COL91_A4


SAP Data Services: Data Integrator Transforms – Learners Guide

Quiz: Capturing changes in data

1. What are the two most important reasons for using CDC?

Answer: Improving performance and preserving history.

2. Which method of CDC is preferred for the performance gain of extracting the fewest rows?

Answer: Source-based CDC.

3. What is the difference between an initial load and a delta load?

Answer:

An initial load is the first population of a database using data acquisition modules forextraction, transformation, and load. The first time you execute a batch job, Designer performs

an initial load to create the data tables and populate them.

A delta load incrementally loads data that has been changed or added since the last load

iteration. When you execute your job, the delta load may run several times, loading data

from the specified number of rows each time until all new data has been written to the target

database.

4. What transforms do you typically use for target-based CDC?

Answer: Table Comparison, History Preserving, and Key Generation.

8/10/2019 BODS20_EN_COL91_A4


Quiz: Using Data Integrator transforms

1. What is the Pivot transform used for?

Answer: Use the Pivot transform when you want to group data from multiple columns into

one column while at the same time maintaining information linked to the columns.

2. What is the purpose of the Hierarchy Flattening transform?

Answer: The Hierarchy Flattening transform enables you to break down hierarchical table

structures into a single table to speed data access.

3. What is the difference between the horizontal and vertical flattening hierarchies?

Answer:

With horizontally-flattened hierarchies, each row of the output describes a single node in

the hierarchy and the path to that node from the root.

With vertical-flattened hierarchies, each row of the output describes a single relationship

between ancestor and descendent and the number of nodes the relationship includes. There

is a row in the output for each node and all of the descendants of that node. Each node is

considered its own descendent and, therefore, is listed one time as both ancestor and

descendent.

4. List three things you can do to improve job performance.

Answer:

Choose from the following:

○ Utilize the push-down operations.

○ View SQL generated by a data flow and adjust your design to maximize the SQL that is

pushed down to improve performance.

○ Use data caching.

○ Use process slicing.

5. Name three options that can be pushed down to the database.

Answer: Choose from the following:

BODS20_EN_COL91_A4

Documents