8/10/2019 BODS20_EN_COL91_A4 http://slidepdf.com/reader/full/bods20encol91a4 1/87 Material Number: 50102235 1
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 1/87
Material Number: 50102235
1
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 2/87
2
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 3/87
3
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 4/87
4
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 5/87
5
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 6/87
6
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 7/87
7
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 8/87
8
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 9/87
9
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 10/87
10
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 11/87
11
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 12/87
12
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 13/87
13
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 14/87
14
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 15/87
15
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 16/87
16
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 17/87
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 18/87
18
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 19/87
SAP Data Services:
Data Integrator Transforms
Learner’s Guide
BODS20
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 20/87
Copyright
© 2009 SAP® BusinessObjects™. All rights reserved. SAPBusinessObjects owns the following United States patents, which
may cover products that are offered and licensed by SAP
BusinessObjects and/or affliated companies: 5,295,243; 5,339,390;
5,555,403; 5,590,250; 5,619,632; 5,632,009; 5,857,205; 5,880,742;
5,883,635; 6,085,202; 6,108,698; 6,247,008; 6,289,352; 6,300,957;
6,377,259; 6,490,593; 6,578,027; 6,581,068; 6,628,312; 6,654,761;
6,768,986; 6,772,409; 6,831,668; 6,882,998; 6,892,189; 6,901,555;
7,089,238; 7,107,266; 7,139,766; 7,178,099; 7,181,435; 7,181,440;
7,194,465; 7,222,130; 7,299,419; 7,320,122 and 7,356,779. SAP
BusinessObjects and its logos, BusinessObjects, Crystal Reports®,
Rapid Mart™
, Data Insight™
, Desktop Intelligence™
, RapidMarts®, Watchlist Security™, Web Intelligence®, and Xcelsius®
are trademarks or registered trademarks of Business Objects,an SAP company and/or affiliated companies in the United
States and/or other countries. SAP® is a registered trademarkof SAP AG in Germany and/or other countries. All other namesmentioned herein may be trademarks of their respective owners.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 21/87
Table of Contents—Learner’s Guide iii
C O N T E N T S
About this Course Course introduction...................................................................................................xiii
Course description.....................................................................................................xiv
Course audience.........................................................................................................xiv
Prerequisites................................................................................................................xiv
Additional education.................................................................................................xiv
Level, delivery, and duration....................................................................................xvCourse success factors.................................................................................................xv
Course setup.................................................................................................................xv
Course materials..........................................................................................................xv
Learning process .........................................................................................................xv
Lesson 1
Capturing Changes in Data Lesson introduction...................................................................................................1
Updating data over time...........................................................................................2
Explaining Slowly Changing Dimensions (SCD) .........................................2Updating changes to data ................................................................................4
Explaining history preservation and surrogate keys ...................................5
Comparing source-based and target-based CDC .........................................6
Using source-based CDC..........................................................................................7
Using source tables to identify changed data................................................7
Using CDC with timestamps............................................................................7
Managing overlaps.............................................................................................10
Activity: Using source-based CDC..................................................................11
Using target-based CDC...........................................................................................15
Using target tables to identify changed data .................................................15
Identifying history preserving transforms ....................................................16Explaining the Table Comparison transform.................................................17
Explaining the History Preserving transform ...............................................19
Explaining the Key Generation transform .....................................................22
Activity: Using target-based CDC ..................................................................23
Quiz: Capturing changes in data ............................................................................25
Lesson summary........................................................................................................26
Lesson 2
Using Data Integrator Transforms Lesson introduction...................................................................................................27Describing Data Integrator transforms...................................................................28
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 22/87
iv SAP Data Services: Data Integrator Transforms – Learners Guide
Defining Data Integrator transforms ..............................................................28
Using the Pivot transform........................................................................................30
Explaining the Pivot transform .......................................................................30
Activity: Using the Pivot transform.................................................................33Using the Hierarchy Flattening transform.............................................................34
Explaining the Hierarchy Flattening transform.............................................34
Activity: Using the Hierarchy Flattening transform.....................................36
Describing performance optimization....................................................................41
Describing push-down operations .................................................................41
Viewing SQL generated by a data flow .........................................................43
Caching data ......................................................................................................43
Slicing processes.................................................................................................44
Using the Data Transfer transform.........................................................................45
Explaining the Data Transfer transform.........................................................45
Activity: Using the Data Transfer transform..................................................46Using the XML Pipeline transform.........................................................................48
Explaining the XML Pipeline transform.........................................................49
Activity: Using the XML Pipeline transform..................................................49
Quiz: Using Data Integrator transforms.................................................................52
Lesson summary........................................................................................................53
Answer Key Quiz: Capturing changes in data ............................................................................56
Quiz: Using Data Integrator transforms.................................................................57
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 23/87
Table of Contents—Learner’s Guide v
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 24/87
vi BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guide
A G E N D A SAP Data Services: Data Integrator Transforms
Introduct ions, Course Overview........................................... 30 minutes
Lesson 1
Capturing Changes in Data...........................................................3 hours ❒
Updating data over time
❒
Using source-based CDC
❒ Using target-based CDC
Lesson 2
Using Data Integrator Transforms..............................................3 hours ❒
Describing Data Integrator transforms
❒
Using the Pivot transform
❒
Using the Hierarchy Flattening transform
❒
Describing performance optimization❒
Using the Data Transfer transform
❒
Using the XML Pipeline transform
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 25/87
About this Course—Learner’s Guide xiii
About this Course
Course introduction
This section explains the conventions used in the course and in this training guide.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 26/87
xiv BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guide
Course description
BusinessObjects™ Data Integrator XI 3.0/3.1 enables you to integrate disparate data sources to
deliver more timely and accurate data that end users in an organization can trust. In this
three-day course, you will learn about creating, executing, and troubleshooting batch jobs,
using functions, scripts and transforms to change the structure and formatting of data, handling
errors, and capturing changes in data.
As a business benefit, by being able to create efficient data integration projects, you can use
the transformed data to help improve operational and supply chain efficiencies, enhance
customer relationships, create new revenue opportunities, and optimize return on investment
from enterprise applications.
Course audience
The target audience for this course is individuals responsible for implementing, administering,
and managing data integration projects.
Prerequisites
To be successful, learners who attend this course should have experience with the following:
It is also recommended you review the following articles, which can be found at:
http://www.rkimball.com/html/articles.html .
• Knowledge of data warehousing and ETL concepts
• Experience with MySQL and SQL language
• Experience using functions, elementary procedural programming, and flow-of-control
statements such as If-Then-Else and While Loop statements
• Data Warehouse Fundamentals: TCO Starts with the End User and Fact Tables and Dimension
Tables
• Data Warehouse Architecture and Modeling: There Are No Guarantees
• Advance Dimension Table Topics: Surrogate Keys,It's Time for Time, and Slowly Changing
Dimensions
• Industry- and Application-Specific Issues: Think Globally, Act Locally
• Data Staging and Data Quality: Dealing with Dirty Data
Additional education
To increase your skill level and knowledge of Data Services, the following courses are
recommended:
• BusinessObjects Data Quality XI 3.0/3.1: Core Concepts
• BusinessObjects Data Integrator XI R2 Accelerated: Advanced Workshop
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 27/87
About this Course—Learner’s Guide xv
Level, delivery, and duration
This instructor-led core offering is a three-day course.
Course success factors
Your learning experience will be enhanced by:
• Activities that build on the life experiences of the learner
• Discussion that connects the training to real working environments
• Learners and instructor working as a team
• Active participation by all learners
Course setup Refer to the setup guide for details on hardware, software, and course-specific requirements.
Course materials
The materials included with the course materials are:
• Name card
• Learner’s Guide
The Learner’s Guide contains an agenda, learner materials, and practice activities.
The Learner’s Guide is designed to assist students who attend the classroom-based courseand outlines what learners can expect to achieve by participating in this course.
• Evaluation form
At the conclusion of this course, you will receive an electronic feedback form as part of our
evaluation process. Provide feedback on the course content, instructor, and facility. Your
comments will assist us to improve future courses.
Additional resources include:
• Sample files
The sample files can include required files for the course activities and/or supplemental
content to the training guide.
• Online Help
Retrieve information and find answers to questions using the online Help and/or user’s
guide that are included with the product.
Learning process
Learning is an interactive process between the learners and the instructor. By facilitating a
cooperative environment, the instructor guides the learners through the learning framework.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 28/87
xvi BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guide
Introduction
Why am I here? What’s in it for me?
The learners will be clear about what they are getting out of each lesson.
Objectives
How do I achieve the outcome?
The learners will assimilate new concepts and how to apply the ideas presented in the lesson.
This step sets the groundwork for practice.
Practice
How do I do it?
The learners will demonstrate their knowledge as well as their hands-on skills through theactivities.
Review
How did I do?
The learners will have an opportunity to review what they have learned during the lesson.
Review reinforces why it is important to learn particular concepts or skills.
Summary
Where have I been and where am I going?
The summary acts as a recap of the learning objectives and as a transition to the next section.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 29/87
Using Functions, Scripts, and Variables—Learner’s Guide
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 30/87
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 31/87
1
Lesson 1
Capturing Changes in Data
Lesson introduction
The design of your data warehouse must take into account how you are going to handle changes
in your target system when the respective data in your source system changes. Data Integrator
transforms provide you with a mechanism to do this.
After completing this lesson, you will be able to:
• Update data over time
• Use source-based CDC
• Use target-based CDC
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 32/87
2 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Updating data over time
Introduction
Data Integrator transforms provide support for updating changing data in your data warehouse.
After completing this unit, you will be able to:
• Describe the options for updating changes to data
• Explain the purpose of Changed Data Capture (CDC)
• Explain the role of surrogate keys in managing changes to data
• Define the differences between source-based and target-based CDC
Explaining Slowly Changing Dimensions (SCD)
SCDs are dimensions that have data that changes over time. The following methods of handling
SCDs are available:
Type Description
Type 1
No history preservation
Type 2
Unlimited history preservation and new rows
Natural consequence of normalization.
• New rows generated for significantchanges.
• Requires use of a unique key. The key
relates to facts/time.
• Optional Effective_Date field.
Type 3
Limited history preservation
• Two states of data are preserved: current
and old.
• New fields are generated to store history
data.
• Requires an Effective_Date field.
Because SCD Type 2 resolves most of the issues related to slowly changing dimensions, it is
explored last.
SCD Type 1
For an SCD Type 1 change, you find and update the appropriate attributes on a specific
dimensional record. For example, to update a record in the SALES_PERSON_DIMENSION
table to show a change to an individual’s SALES_PERSON_NAME field, you simply update
one record in the SALES_PERSON_DIMENSION table. This action would update or correct
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 33/87
3
that record for all fact records across time. In a dimensional model, facts have no meaning until
you link them with their dimensions. If you change a dimensional attribute without
appropriately accounting for the time dimension, the change becomes global across all fact
records.
This is the data before the change:
SALES_PERSON_KEY SALES_PERSON_ID NAME SALES_TEAM
15 000120 Doe, John B Northwest
This is the same table after the salesperson’s name has been changed:
SALES_PERSON_KEY SALES_PERSON_ID NAME SALES_TEAM
15 000120 Smith, John B Northwest
However, suppose a salesperson transfers to a new sales team. Updating the salesperson’s
dimensional record would update all previous facts so that the salesperson would appear to
have always belonged to the new sales team. This may cause issues in terms of reporting sales
numbers for both teams. If you want to preserve an accurate history of who was on which sales
team, Type 1 is not appropriate.
SCD Type 3
To implement a Type 3 change, you change the dimension structure so that it renames the
existing attribute and adds two attributes, one to record the new value and one to record thedate of the change.
A Type 3 implementation has three disadvantages:
• You can preserve only one change per attribute, such as old and new or first and last.
• Each Type 3 change requires a minimum of one additional field per attribute and another
additional field if you want to record the date of the change.
• Although the dimension’s structure contains all the data needed, the SQL code required to
extract the information can be complex. Extracting a specific value is not difficult, but if you
want to obtain a value for a specific point in time or multiple attributes with separate old
and new values, the SQL statements become long and have multiple conditions.
In summary, SCD Type 3 can store a change in data, but can neither accommodate multiplechanges, nor adequately serve the need for summary reporting.
This is the data before the change:
SALES_PERSON_KEY SALES_PERSON_ID NAME SALES_TEAM
15 000120 Doe, John B Northwest
This is the same table after the new dimensions have been added and the salesperson’s sales
team has been changed:
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 34/87
4 SAP Data Services: Data Integrator Transforms—Learner’s Guide
SALES_PERSON_
NAME OLD_TEAM NEW_TEAM EFF_TO_DATE
SALES_
PERSON_ID
Doe, John B
SCD Type 2
Northwest Northeast Oct_31_2004 00120
With a Type 2 change, you do not need to make structural changes to the
SALES_PERSON_DIMENSION table. Instead, you add a record.
This is the data before the change:
SALES_PERSON_KEY SALES_PERSON_ID NAME SALES_TEAM
15 000120 Doe, John B Northwest
After you implement the Type 2 change, two records appear, as in the following table:
SALES_PERSON_KEY SALES_PERSON_ID NAME SALES_TEAM
15 000120 Doe, John B Northwest
133 000120 Doe, John B Southeast
Updating changes to data
When you have a large amount of data to update regularly and a small amount of system down
time for scheduled maintenance on a data warehouse, you must choose the most appropriate
method for updating your data over time, also known as “delta load”. You can choose to do a
full refresh of your data or you can choose to extract only new or modified data and update
the target system:
• Full refresh: Full refresh is easy to implement and easy to manage. This method ensures
that no data is overlooked or left out due to technical or programming errors. For an
environment with a manageable amount of source data, full refresh is an easy method you
can use to perform a delta load to a target system.• Capturing only changes: After an initial load is complete, you can choose to extract only
new or modified data and update the target system. Identifying and loading only changed
data is called Changed Data Capture (CDC). CDC is recommended for large tables. If the
tables that you are working with are small, you may want to consider reloading the entire
table instead. The benefit of using CDC instead of doing a full refresh is that it:
○ Improves performance because the job takes less time to process with less data to extract,
transform, and load.
○ Change history can be tracked by the target system so that data can be correctly analyzed
over time. For example, if a sales person is assigned a new sales region, simply updating
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 35/87
5
the customer record to reflect the new region negatively affects any analysis by region
over time because the purchases made by that customer before the move are attributed
to the new region.
Explaining history preservation and surrogate keys
History preservation allows the data warehouse or data mart to maintain the history of data
in dimension tables so you can analyze it over time.
For example, if a customer moves from one sales region to another, simply updating the
customer record to reflect the new region would give you misleading results in an analysis by
region over time, because all purchases made by the customer before the move would incorrectly
be attributed to the new region.
The solution to this involves introducing a new record for the same customer that reflects thenew sales region so that you can preserve the previous record. In this way, accurate reporting
is available for both sales regions. To support this, Data Services is set up to treat all changes
to records as INSERT rows by default.
However, you also need to manage the primary key constraint issues in your target tables that
arise when you have more than one record in your dimension tables for a single entity, such
as a customer or an employee.
For example, with your sales records, the Sales Rep ID is usually the primary key and is used
to link that record to all of the rep's sales orders. If you try to add a new record with the same
primary key, it will throw an exception. On the other hand, if you assign a new Sales Rep ID
to the new record for that rep, you will compromise your ability to report accurately on therep's’s total sales.
To address this issue, you will create a surrogate key, which is a new column in the target table
that becomes the new primary key for the records. At the same time, you will change the
properties of the former primary key so that it is simply a data column.
When a new record is inserted for the same rep, a unique surrogate key is assigned allowing
you to continue to use the Sales Rep ID to maintain the link to the rep’s orders.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 36/87
6 SAP Data Services: Data Integrator Transforms—Learner’s Guide
You can create surrogate keys either by using the gen_row_num or key_generation functions
in the Query transform to create a new output column that automatically increments whenever
a new record is inserted, or by using the Key Generation transform, which serves the same
purpose.
Comparing source-based and target-based CDC
Setting up a full CDC solution within Data Services may not be required. Many databases now
have CDC support built into them, such as Oracle, SQL Server, and DB2. Alternatively, you
could combine surrogate keys with the Map Operation transform to change all UPDATE row
types to INSERT row types to capture changes.
However, if you do want to set up a full CDC solution, there are two general incremental CDC
methods to choose from: source-based and target-based CDC.
Source-based CDC evaluates the source tables to determine what has changed and only extracts
changed rows to load into the target tables.
Target-based CDC extracts all the data from the source, compares the source and target rows
using table comparison, and then loads only the changed rows into the target.
Source-based CDC is almost always preferable to target-based CDC for performance reasons.However, some source systems do not provide enough information to make use of the
source-based CDC techniques. You will usually use a combination of the two techniques.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 37/87
7
Using source-based CDC
Introduction
Source-based CDC is the preferred method because it improves performance by extracting the
fewest rows.
After completing this unit, you will be able to:
• Define the methods of performing source-based CDC
• Explain how to use timestamps in source-based CDC
• Manage issues related to using timestamps for source-based CDC
Using source tables to identify changed data
Source-based CDC, sometimes also referred to as incremental extraction, extracts only the
changed rows from the source. To use source-based CDC, your source data must have some
indication of the change. There are two methods:
• Timestamps: You can use the timestamps in your source data to determine what rows have
been added or changed since the last time data was extracted from the source. To support
this type of source-based CDC, your database tables must have at least an update timestamp;
it is preferable to have a create timestamp as well.
• Change logs: You can also use the information captured by the RDBMS in the log files for
the audit trail to determine what data is has been changed.
Log-based data is more complex and is outside the scope of this course. For more information
on using logs for CDC, see “Techniques for Capturing Data”, in the Data Services Designer Guide.
Using CDC with timestamps
Timestamp-based CDC is an ideal solution to track changes if:
• There are date and time fields in the tables being updated.
• You are updating a large table that has a small percentage of changes between extracts and
an index on the date and time fields.
• You are not concerned about capturing intermediate results of each transaction betweenextracts (for example, if a customer changes regions twice in the same day).
It is not recommended that you use timestamp-based CDC if:
• You have a large table with a large percentage of it changes between extracts and there is
no index on the timestamps.
• You need to capture physical row deletes.
• You need to capture multiple events occurring on the same row between extracts.
Some systems have timestamps with dates and times, some with just the dates, and some with
monotonically-generated increasing numbers. You can treat dates and generated numbers in
the same manner. It is important to note that for timestamps based on real time, time zones
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 38/87
8 SAP Data Services: Data Integrator Transforms—Learner’s Guide
can become important. If you keep track of timestamps using the nomenclature of the source
system (that is, using the source time or source-generated number), you can treat both temporal
(specific time) and logical (time relative to another time or event) timestamps in the same way.
The basic technique for using timestamps is to add a column to your source and target tables
that tracks the timestamps of rows loaded in a job. When the job executes, this column is updated
along with the rest of the data. The next job then reads the latest timestamp from the target
table and selects only the rows in the source table for which the timestamp is later.
This example illustrates the technique. Assume that the last load occurred at 2:00 PM on January
1, 2008. At that time, the source table had only one row (key=1) with a timestamp earlier than
the previous load. Data Services loads this row into the target table with the original timestamp
of 1:10 PM on January 1, 2008. After 2:00 PM, Data Services adds more rows to the source table:
At 3:00 PM on January 1, 2008, the job runs again. The job:
1. Reads the Last_Update field from the target table (01/01/2008 01:10 PM).
2. Selects rows from the source table that have timestamps that are later than the value of
Last_Update. The SQL command to select these rows is:
SELECT * FROM Sour ce WHERE Last _Update > ' 01/ 01/ 2007 01: 10 pm'
This operation returns the second and third rows (key=2 and key=3).
3. Loads these new rows into the target table.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 39/87
9
For timestamped CDC, you must create a work flow that contains the following:
• A script that reads the target table and sets the value of a global variable to the latest
timestamp.
• A data flow that uses the global variable in a WHERE clause to filter the data.
The data flow contains a source table, a query, and a target table. The query extracts only those
rows that have timestamps later than the last update.
To set up a timestamp-based CDC delta job
1. In the Variables and Parameters dialog box, add a global variable called $G_Last _Updat e
with a datatype of dat et i me to your job.
The purpose of this global variable is to store a string conversion of the timestamp for the
last time the job executed.
2. In the job workspace, add a script called Get Ti mest amp using the tool palette.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 40/87
10 SAP Data Services: Data Integrator Transforms—Learner’s Guide
3. In the script workspace, construct an expression to do the following:
• Select the last time the job was executed from the last update column in the table.
• Assign the actual timestamp value to the $G_Last_Update global variable.
The script content depends on the RDBMS on which the status table resides. The following
is an example of the expression:
$G_Last _Update = sql ( ' DEMO_Tar get ' , ' sel ect max( l ast _update) f r omempl oyee_di m' ) ;
4. Return to the job workspace.
5. Add a data flow to the right of the script using the tool palette.
6. In the data flow workspace, add the source, Query transform, and target objects and connect
them.
The target table for CDC cannot be a template table.
7. In the Query transform, add the columns from the input schema to the output schema as
required.
8. If required, in the output schema, right-click the primary key (if it is not already set to the
surrogate key) and clear the Primary Key option in the menu.
9. Right-click the surrogate key column and select the Primary Key option in the menu.
10. On the Mapping tab for the surrogate key column, construct an expression to use the
key_generation function to generate new keys based on that column in the target table,
incrementing by 1.
The script content depends on the RDBMS on which the status table resides. The following
is an example of the expression:
key_generat i on( ' DEMO_Tar get . demo_t arget . empl oyee_di m' , ' Emp_Sur r _Key' , 1)
11. On the WHERE tab, construct an expression to select only those records with a timestamp
that is later than the $G_Last_Update global variable.
The following is an example of the expression:
empl oyee_di m. l ast _updat e > $G_Last _Updat e
12. Connect the GetTimestamp script to the data flow.
13. Validate and save all objects.
14. Execute the job.
Managing overlaps
Unless source data is rigorously isolated during the extraction process (which typically is not
practical), there is a window of time when changes can be lost between two extraction runs.
This overlap period affects source-based CDC because this kind of data capture relies on a
static timestamp to determine changed data.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 41/87
11
For example, suppose a table has 10,000 rows. If a change is made to one of the rows after it
was loaded but before the job ends, the second update can be lost.
There are three techniques for handling this situation:• Overlap avoidance
• Overlap reconciliation
• Presampling
For more information see “Source-based and target-based CDC” in “Techniques for Capturing
Changed Data” in the Data Services Designer Guide.
Overlap avoidance
In some cases, it is possible to set up a system where there is no possibility of an overlap. You
can avoid overlaps if there is a processing interval where no updates are occurring on the target
system.For example, if you can guarantee the data extraction from the source system does not last
more than one hour, you can run a job at 1:00 AM every night that selects only the data updated
the previous day until midnight. While this regular job does not give you up-to-the-minute
updates, it guarantees that you never have an overlap and greatly simplifies timestamp
management.
Overlap reconcil iation
Overlap reconciliation requires a special extraction process that re-applies changes that could
have occurred during the overlap period. This extraction can be executed separately from the
regular extraction. For example, if the highest timestamp loaded from the previous job was
01/01/2008 10:30 PM and the overlap period is one hour, overlap reconciliation re-applies thedata updated between 9:30 PM and 10:30 PM on January 1, 2008.
The overlap period is usually equal to the maximum possible extraction time. If it can take up
to N hours to extract the data from the source system, an overlap period of N (or N plus a small
increment) hours is recommended. For example, if it takes at most two hours to run the job,
an overlap period of at least two hours is recommended.
Presampling
Presampling is an extension of the basic timestamp processing technique. The main difference
is that the status table contains both a start and an end timestamp, instead of the last update
timestamp. The start timestamp for presampling is the same as the end timestamp of theprevious job. The end timestamp for presampling is established at the beginning of the job. It
is the most recent timestamp from the source table, commonly set as the system date.
Activi ty: Using source-based CDC
You need to set up a job to update employee records in the Omega data warehouse whenever
they change. The employee records include timestamps to indicate when they were last updated,
so you can use source-based CDC.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 42/87
12 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Objective
• Use timestamps to enable changed data capture for employee records.
Instructions
1. In the Omega project, create a new batch job called Al pha_Empl oyees_Di m_J ob.
2. Add a global variable called $G_Last Updat e with a datatype of dat et i me to your job.
3. In the job workspace, add a script called Get Ti mest amp and construct an expression to do
the following:
• Select the last time the job was executed from the last update column in the employee
dimension table.
• If the last update column is NULL, assign a value of January 1, 1901 to the $G_LastUpdate
global variable. When the job executes for the first time for the initial load, this ensures
that all records are processed.
• If the last update column is not NULL, assign the actual timestamp value to the
$G_LastUpdate global variable.
The expression should be:
$G_Last Updat e = sql ( ' omega' , ' sel ect max(LAST_UPDATE) f r om emp_di m' )
i f ( $G_LastUpdate i s nul l ) $G_LastUpdate = t o_dat e(' 1901. 01. 01' , ' YYYY. MM. DD' ) ;
el se pr i nt ( ' Last updat e was ' | | $G_Last Updat e) ;
4. In the job workspace, add a data flow called Al pha_Empl oyees_Di m_DF and connect it to the
script.5. Add the Employee table from the Alpha datastore as the source object and the Emp_Dim
table from the Omega datastore as the target object.
6. Add the Query transform and connect the objects.
7. In the transform editor for the Query transform, map the columns as follows:
Schema In Schema Out
EMPLOYEEID EMPLOYEEID
LASTNAME LASTNAME
FIRSTNAME FIRSTNAME
BIRTHDATE BIRTHDATE
HIREDATE HIREDATE
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 43/87
13
Schema In Schema Out
ADDRESS ADDRESS
PHONE PHONE
EMAIL EMAIL
REPORTSTO REPORTSTO
LastUpdate LAST_UPDATE
discharge_date DISCHARGE_DATE
8. Create a mapping expression for the SURR_KEY column that generates new keys based on
the Emp_Dim target table, incrementing by 1.
The expression should be:
key_generat i on( ' Omega. dbo. emp_di m' , ' SURR_KEY' , 1)
9. Create a mapping expression for the CITY column to look up the city name from the City
table in the Alpha datastore based on the city ID.The expression should be:
l ookup_ext ( [ Al pha. sour ce. ci t y, ' PRE_LOAD_CACHE' , ' MAX' ] ,
[ CI TYNAME] , [ NULL] , [ CI TYI D, ' =' , empl oyee. CI TYI D] ) SET
( "r un_as_separ ate_pr ocess"=' no' )
10. Create a mapping expression for the REGION column to look up the region name from the
Region table in the Alpha datastore based on the region ID.
The expression should be:
l ookup_ext ( [ Al pha. sour ce. r egi on, ' PRE_LOAD_CACHE' , ' MAX' ] ,
[ REGI ONNAME] , [ NULL] , [ REGI ONI D, ' =' , empl oyee. REGI ONI D] ) SET
( "r un_as_separ ate_pr ocess"=' no' )
11. Create a mapping expression for the COUNTRY column to look up the country name from
the Country table in the Alpha datastore based on the country ID.
The expression should be:
l ookup_ext ( [ Al pha. sour ce. count r y, ' PRE_LOAD_CACHE' , ' MAX' ] ,
[ COUNTRYNAME] , [ NULL] , [ COUNTRYI D, ' =' , empl oyee. COUNTRYI D] ) SET
( "r un_as_separ ate_pr ocess"=' no' )
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 44/87
14 SAP Data Services: Data Integrator Transforms—Learner’s Guide
12. Create a mapping expression for the DEPARTMENT column to look up the department
name from the Department table in the Alpha datastore based on the department ID.
The expression should be:l ookup_ext( [ Al pha. sour ce. depart ment , ' PRE_LOAD_CACHE' , ' MAX' ] ,
[ DEPARTMENTNAME] , [ NULL] , [ DEPARTMENTI D, ' =' , empl oyee. DEPARTMENTI D] ) SET
( "r un_as_separ ate_pr ocess"=' no' )
13. On the WHERE tab, construct an expression to select only those records with a timestamp
that is later than the $G_LASTUPDATE global variable.
The expression should be:
empl oyee. Last Update > $G_LASTUPDATE
14. Execute Alpha_Employees_Dim_Job with the default execution properties and save all
objects you have created.According to the log, the last update for the table was on 2007.12.27.
15. Return to the data flow workspace and view data for the target table. Sort the records by
the LAST_UPDATE column.
A solution file called SOLUTI ON_Sour ceCDC. at l is included in your Course Resources. To check
the solution, import the file and open it to view the data flow design and mapping logic. Do
not execute the solution job, as this may override the results in your target table.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 45/87
15
Using target-based CDC
Introduction
Target-based CDC compares the source to the target to determine which records have changed.
After completing this unit, you will be able to:
• Define the Data Integrator transforms involved in target-based CDC
Using target tables to identi fy changed data
Source-based CDC evaluates the source tables to determine what has changed and only extracts
changed rows to load into the target tables. Target-based CDC, by contrast, extracts all the datafrom the source, compares the source and target rows, and then loads only the changed rows
into the target with new surrogate keys.
Source-based changed-data capture is almost always preferable to target-based capture for
performance reasons; however, some source systems do not provide enough information to
make use of the source-based CDC techniques. Target-based CDC allows you to use the
technique when source-based change information is limited.
You can preserve history by creating a data flow that contains the following:
• A source table contains the rows to be evaluated.
• A Query transform maps columns from the source.
• A Table Comparison transform compares the data in the source table with the data in thetarget table to determine what has changed. It generates a list of INSERT and UPDATE rows
based on those changes. This circumvents the default behavior in Data Services of treating
all changes as INSERT rows.
• A History Preserving transform converts certain UPDATE rows to INSERT rows based on
the columns in which values have changed. This produces a second row in the target instead
of overwriting the first row.
• A Key Generation transform generates new keys for the updated rows that are now flagged
as INSERT.
• A target table receives the rows. The target table cannot be a template table.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 46/87
16 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Identifying h istory preserving transforms
Data Services supports history preservation with three Data Integrator transforms:
Icon Transform Description
History Preserving
Key Generation
Converts rows flagged as
UPDATE to UPDATE plus
INSERT, so that the original
values are preserved in thetarget. You specify the
column in which to look for
updated data.
Generates new keys for
source data, starting from a
value based on existing keys
in the table you specify.
Compares two data sets andproduces the difference
Table Comparison between them as a data set with rows flagged as INSERT and UPDATE.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 47/87
17
Explaining the Table Comparison transform
The Table Comparison transform allows you to detect and forward changes that have occurred
since the last time a target was updated. This transform compares two data sets and producesthe difference between them as a data set with rows flagged as INSERT or UPDATE.
For example, the transform compares the input and comparison tables and determines that
row 10 has a new address, row 40 has a name change, and row 50 is a new record. The output
includes all three records, flagged as appropriate:
The next section gives a brief description of the function, data input requirements, options, and
data output results for the Table Comparison transform. For more information on the Pivot
transform see “Transforms” Chapter 5 in the Data Services Reference Guide.
Input/output
The transform compares two data sets, one from the input to the transform (input data set),
and one from a database table specified in the transform (the comparison table). The transform
selects rows from the comparison table based on the primary key values from the input data
set. The transform compares columns that exist in the schemas for both inputs.
The input data set must be flagged as NORMAL.
The output data set contains only the rows that make up the difference between the tables. The
schema of the output data set is the same as the schema of the comparison table. No DELETE
operations are produced.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 48/87
18 SAP Data Services: Data Integrator Transforms—Learner’s Guide
If a column has a date datatype in one table and a datetime datatype in the other, the transform
compares only the date section of the data. The columns can also be time and datetime datatypes,
in which case Data Integrator only compares the time section of the data.
For each row in the input data set, there are three possible outcomes from the transform:
• An INSERT column is added: The primary key value from the input data set does not match
a value in the comparison table. The transform produces an INSERT row with the values
from the input data set row.
If there are columns in the comparison table that are not present in the input data set, the
transform adds these columns to the output schema and fills them with NULL values.
• An UPDATE row is added: The primary key value from the input data set matches a value
in the comparison table, and values in the non-key compare columns differ in the
corresponding rows from the input data set and the comparison table.
The transform produces an UPDATE row with the values from the input data set row.
If there are columns in the comparison table that are not present in the input data set, the
transform adds these columns to the output schema and fills them with values from the
comparison table.
• The row is ignored: The primary key value from the input data set matches a value in the
comparison table, but the comparison does not indicate any changes to the row values.
Options
The Table transform offers several options:
Option Description
Table name
Specifies the fully qualified name of the
source table from which the maximum
existing key is determined (key source table).
This table must already be imported into the
repository. Table name is represented as
datastore.owner.table where datastore is the
name of the datastore Data Services uses to
access the key source table and owner
depends on the database type associated with
the table.
Specifies a column in the comparison table.
When there is more than one row in the
Generated key columncomparison table with a given primary key
value, this transform compares the row with
the largest generated key value of these rows
and ignores the other rows. This is optional.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 49/87
19
Option Description
Input contains duplicate keys Provides support for input rows withduplicate primary key values.
Detect deleted row(s) from comparison tableFlags the transform to identify rows that have
been deleted from the source.
Comparison method
Allows you to select the method for accessing
the comparison table. You can select from
Row-by-row select, Cached comparison
table, and Sorted input.
Specifies the columns in the input data set
that uniquely identify each row. These
Input primary key column(s) columns must be present in the comparison
table with the same column names and
datatypes.
Improves performance by comparing only
the sub-set of columns you drag into this box
Compare columnsfrom the input schema. If no columns are
listed, all columns in the input data set thatare also in the comparison table are used as
compare columns. This is optional.
Explaining the History Preserving transform
The History Preserving transform ignores everything but rows flagged as UPDATE. For these
rows, it compares the values of specified columns and, if the values have changed, flags the
row as INSERT. This produces a second row in the target instead of overwriting the first row.
For example, a target table that contains employee information is updated periodically from a
source table. In this case, the Table Comparison transform has flagged the name change for
row 40 as an update. However, the History Preserving transform is set up to preserve history
on the LastName column, so the output changes the operation code for that record from
UPDATE to INSERT.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 50/87
20 SAP Data Services: Data Integrator Transforms—Learner’s Guide
The next section gives a brief description of the function, data input requirements, options, and
data output results for the History Preserving transform. For more information on the History
Preserving transform see “Transforms” Chapter 5 in the Data Services Reference Guide.
Input/output
The input data set is the result of a comparison between two versions of the same data in which
rows with changed data from the newer version are flagged as UPDATE rows and new data
from the newer version are flagged as INSERT rows.
The output data set contains rows flagged as INSERT or UPDATE.
Options
The History Preserving transform offers these options:
Option Description
Valid from
Valid to
Specifies a date or datetime column from the
source schema. Specify a Valid from date
column if the target uses an effective date to
track changes in data.
Specifies a date value in the following format:
YYYY.MM.DD. The Valid to date cannot be
the same as the Valid from date.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 51/87
21
Option Description
Specifies a column from the source schemathat identifies the current valid row from a
Columnset of rows with the same primary key. The
flag column indicates whether a row is the
most current data in the target for a given
primary key.
Defines an expression that outputs a value
with the same datatype as the value in the
Set valueSet flag column. This value is used to update
the current flag column in the new row in the
target added to preserve history of an existingrow.
Defines an expression that outputs a value
with the same datatype as the value in the
Reset valueReset flag column. This value is used to
update the current flag column in an existing
row in the target that included changes in
one or more of the compare columns.
Preserve delete row(s) as update row(s)
Compare columns
Converts DELETE rows to UPDATE rows in
the target. If you previously set effective date
values (Valid from and Valid to), sets the
Valid to value to the execution date. This
option is used to maintain slowly changing
dimensions by feeding a complete data set
first through the Table Comparison transform
with its Detect deleted row(s) from
comparison table option selected.
Lists the column or columns in the input dataset that are to be compared for changes.
• If the values in the specified compare
columns in each version match, the
transform flags the row as UPDATE. The
row from the before version is updated.
The date and flag information is also
updated.
• If the values in each version do not match,
the row from the latest version is flagged
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 52/87
22 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Option Description
as INSERT when output from the
transform. This adds a new row to thewarehouse with the values from the new
row.
Updates to non-history preserving columns
update all versions of the row if the update
is performed on the natural key (for example,
Customer), but only update the latest version
if the update is on the generated key (for
example, GKey).
Explaining the Key Generation t ransform
The Key Generation transform generates new keys before inserting the data set into the target
in the same way as the key_generation function does. When it is necessary to generate artificial
keys in a table, this transform looks up the maximum existing key value from a table and uses
it as the starting value to generate new keys. The transform expects the generated key column
to be part of the input schema.
For example, suppose the History Preserving transform produces rows to add to a warehouse,
and these rows have the same primary key as rows that already exist in the warehouse. In this
case, you can add a generated key to the warehouse table to distinguish these two rows that
have the same primary key.
The next section gives a brief description of the function, data input requirements, options, and
data output results for the Key Generation transform. For more information on the Key
Generation transform see “Transforms” Chapter 5 in the Data Services Reference Guide.
Input/output
The input data set is the result of a comparison between two versions of the same data in which
changed data from the newer version are flagged as UPDATE rows and new data from the
newer version are flagged as INSERT rows.
The output data set is a duplicate of the input data set, with the addition of key values in the
generated key column for input rows flagged as INSERT.
Options
The Key Generation transform offers these options:
Option Description
Table nameSpecifies the fully qualified name of the
source table from which the maximum
existing key is determined (key source table).
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 53/87
23
Option Description
This table must be already imported into the
repository. Table name is represented asdatastore.owner.table where datastore is the
name of the datastore Data Services uses to
access the key source table and owner
depends on the database type associated with
the table.
Specifies the column in the key source table
containing the existing keys values. A column
Generated key column with the same name must exist in the input
data set; the new keys are inserted in this
column.
Increment values
Activi ty: Using target-based CDC
Indicates the interval between generated key
values.
You need to set up a job to update product records in the Omega data warehouse whenever
they change. The product records do not include timestamps to indicate when they were last
updated, so you must use target-based CDC to extract all records from the source and comparethem to the target.
Objective
• Use target-based CDC to preserve history for the Product dimension.
Instructions
1. In the Omega project, create a new batch job called Al pha_Product_Di m_J ob with a data
flow called Al pha_Pr oduct _Di m_DF.
2. Add the Product table from the Alpha datastore as the source object and the Prod_Dim table
from the Omega datastore as the target object.
3. Add the Query, Table Comparison, History Preserving, and Key Generation transforms.
4. Connect the source table to the Query transform and the Query transform to the target table
to set up the schema prior to configuring the rest of the transforms.
5. In the transform editor for the Query transform, map the columns as follows:
Schema In Schema Out
PRODUCTID PRODUCTID
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 54/87
24 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Schema In Schema Out
PRODUCTNAME PRODUCTNAME
CATEGORYID CATEGORYID
COST COST
6. Until the key can be generated, specify a mapping expression for the SURR_KEY column
to populate it with NULL.
7. Specify a mapping expression for the EFFECTIVE_DATE column to indicate the current
date as sysdate( ).
8. Delete the link from the Query transform to the target table.
9. Connect the transforms in the following order: Query, Table Comparison, History Preserving,
and Key Generation.
10. Connect the Key Generation transform to the target table.
11. In the transform editor for the Table Comparison transform, use the Prod_Dim table in the
Omega datastore as the comparison table and set Surr_Key as the generated key column.
12. Set the input primary key column to PRODUCTID, and compare the PRODUCTNAME,
CATEGORYID, and COST columns.
13. Do not configure the History Preserving transform.
14. In the transform editor for the Key Generation transform, set up key generation based on
the Surr_Key column of the Prod_Dim table in the Omega datastore, incrementing by 1.
15. In the workspace, before executing the job, display the data in both the source and target
tables.
Note that the OmegaSoft product has been added in the source, but has not yet been updated
in the target.
16. Execute Alpha_Product_Dim_Job with the default execution properties and save all objects
you have created.
17. Return to the data flow workspace and view data for the target table.Note that the new records were added for product IDs 2, 3, 6, 8, and 13, and that OmegaSoft
has been added to the target.
A solution file called SOLUTI ON_Tar get CDC. at l is included in your Course Resources. To check
the solution, import the file and open it to view the data flow design and mapping logic. Do
not execute the solution job, as this may override the results in your target table.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 55/87
25
Quiz: Capturing changes in data
1. What are the two most important reasons for using CDC?
2. Which method of CDC is preferred for the performance gain of extracting the fewest rows?
3. What is the difference between an initial load and a delta load?
4. What transforms do you typically use for target-based CDC?
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 56/87
26 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Lesson summary
After completing this lesson, you are now able to:
• Update data over time
• Use source-based CDC
• Use target-based CDC
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 57/87
Using Data Integrator Transforms—Learner’s Guide 27
Lesson 2
Using Data Integrator Transforms
Lesson introduction
Data Integrator transforms are used to enhance your data integration projects beyond the core
functionality of the platform transforms.
After completing this lesson, you will be able to:
• Describe the Data Integrator transforms
• Use the Pivot transform
• Use the Hierarchy Flattening transform
• Describe performance optimization
• Use the Data Transfer transform
• Use the XML Pipeline transform
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 58/87
28 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Describing Data Integrator transforms
Introduction
Data Integrator transforms perform key operations on data sets to manipulate their structure
as they are passed from source to target.
After completing this unit, you will be able to:
• Describe Data Integrator transforms available in Data Services
Defining Data Integrator transforms
The following transforms are available in the Data Integrator branch of the Transforms tab inthe Local Object Library:
Icon Transform Description
Data Transfer
Allows a data flow to split its processing into two sub-data
flows and push down resource-consuming operations to
the database server.
Date GenerationGenerates a column filled with date values based on the
start and end dates and increment you specify.
Effective DateGenerates an additional effective to column based on the
primary key’s effective date.
Hierarchy Flattening
Flattens hierarchical data into relational tables so that it
can participate in a star schema. Hierarchy flattening can
be both vertical and horizontal.
Sorts input data, maps output data, and resolves before
and after versions for UPDATE rows.Map CDC Operation While commonly used to support Oracle or mainframe
changed data capture, this transform supports any data
stream if its input requirements are met.
Pivot
Reverse Pivot
Rotates the values in specified columns to rows.
Rotates the values in specified rows to columns.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 59/87
Using Data Integrator Transforms—Learner’s Guide 29
Icon Transform
XML Pipeline
Description
Processes large XML inputs in small batches.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 60/87
30 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Using the Pivot transform
Introduction
The Pivot and Reverse Pivot transforms let you convert columns to rows and rows back into
columns.
After completing this unit, you will be able to:
• Use the Pivot transform
Explaining the Pivot transform
The Pivot transform creates a new row for each value in a column that you identify as a pivotcolumn.
It allows you to change how the relationship between rows is displayed. For each value in each
pivot column, Data Services produces a row in the output data set. You can create pivot sets
to specify more than one pivot column.
For example, you could produce a list of discounts by quantity for certain payment terms so
that each type of discount is listed as a separate record, rather than each being displayed in a
unique column.
The Reverse Pivot transform reverses the process, converting rows into columns.
The next section gives a brief description of the function, data input requirements, options, and
data output results for the Pivot transform. For more information on the Pivot transform see
“Transforms” Chapter 5 in the Data Services Reference Guide.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 61/87
Using Data Integrator Transforms—Learner’s Guide 31
Inputs/Outputs
Data inputs include a data set with rows flagged as NORMAL.
Data outputs include a data set with rows flagged as NORMAL. This target includes the
non-pivoted columns, a column for the sequence number, the data field column, and the pivot
header column.
Options
The Pivot transform offers several options:
Option Description
Pivot sequence column
Non-pivot columns
Assign a name to the sequence number
column. For each row created from a pivot
column, Data Services increments and stores
a sequence number.
Select the columns in the source that are to
appear in the target without modification.
Pivot set
Data column field
Identify a number for the pivot set. For each
pivot set, you define a group of pivot
columns, a pivot data field, and a pivot
header name.
Specify the column that contains the pivoted
data. This column contains all of the Pivot
columns values.
Header column
Pivot columns
Specify the name of the column that contains
the pivoted column names. This column lists
the names of the columns where the
corresponding data originated.
Select the columns to be rotated into rows.
Describe these columns in the Header
column. Describe the data in these columns
in the Data field column.
To pivot a table
1. Open the data flow workspace.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 62/87
32 SAP Data Services: Data Integrator Transforms—Learner’s Guide
2. Add your source object to the workspace.
3. On the Transforms tab of the Local Object Library, click and drag the Pivot or Reverse Pivot
transform to the workspace to the right of your source object.4. Add your target object to the workspace.
5. Connect the source object to the transform.
6. Connect the transform to the target object.
7. Double-click the Pivot transform to open the transform editor.
8. Click and drag any columns that will not be changed by the transform from the input schema
area to the Non-Pivot Columns area.
9. Click and drag any columns that will be pivoted from the input schema area to the Pivot
Columns area.
If required, you can create more than one pivot set by clicking Add.
10. If desired, change the values in the Pivot sequence column, Data field column, and Header
column fields.
These are the new columns that will be added to the target object by the transform.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 63/87
Using Data Integrator Transforms—Learner’s Guide 33
11. Click Back to return to the data flow workspace.
Activi ty: Using the Pivot t ransform
Currently, employee compensation information is loaded into a table with a separate column
each for salary, bonus, and vacation days. For reporting purposes, you need for each of these
items to be a separate record in the HR datamart.
Objective
• Use the Pivot transform to create a separate row for each entry in a new employee
compensation table.
Instructions
1. In the Omega project, create a new batch job called Al pha_HR_Comp_J ob with a data flowcalled Al pha_HR_Comp_DF.
2. Add the HR_Comp_Update table from the Alpha datastore to the workspace as the source
object.
3. Add the Pivot transform and connect it to the source object.
4. Add the Query transform and connect it to the Pivot transform.
5. Create a new template table called Empl oyee_Comp in the Delta datastore as the target object.
6. Connect the Query transform to the new template table.
7. In the transform editor for the Pivot transform, specify that the EmployeeID anddate_updated fields are non-pivot columns.
8. Specify that the Emp_Salary, Emp_Bonus, and Emp_VacationDays fields are pivot columns.
9. Specify that the data field column is calledComp, and the header column is called Comp_Type.
10. In the transform editor for the Query transform, map all fields from input schema to output
schema.
11. On the WHERE tab, filter out NULL values for the Comp column.
The expression should be as follows:
Pi vot . Comp i s not nul l
12. Execute Alpha_HR_Comp_Job with the default execution properties and save all objects
you have created.
13. Return to the data flow workspace and view data for the target table.
A solution file called SOLUTI ON_Pi vot . atl is included in your Course Resources. To check the
solution, import the file and open it to view the data flow design and mapping logic. Do not
execute the solution job, as this may override the results in your target table.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 64/87
34 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Using the Hierarchy Flattening transform
Introduction
The Hierarchy Flattening transform enables you to break down hierarchical table structures
into a single table to speed up data access.
After completing this unit, you will be able to:
• Use the Hierarchy Flattening transform
Explaining the Hierarchy Flattening transform
The Hierarchy Flattening transform constructs a complete hierarchy from parent/childrelationships, and then produces a description of the hierarchy in horizontally- or
vertically-flattened format.
For horizontally-flattened hierarchies, each row of the output describes a single node in the
hierarchy and the path to that node from the root.
For vertically-flattened hierarchies, each row of the output describes a single relationship
between ancestor and descendent and the number of nodes the relationship includes. There is
a row in the output for each node and all of the descendants of that node. Each node is
considered its own descendent and, therefore, is listed one time as both ancestor and descendent.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 65/87
Using Data Integrator Transforms—Learner’s Guide 35
The next section gives a brief description of the function, data input requirements, options, and
data output results for the Hierarchy Flattening transform. For more information on the
Hierarchy Flattening transform see “Transforms” Chapter 5 in the Data Services Reference Guide.
Inputs/Outputs
Data input includes rows describing individual parent-child relationships. Each row mustcontain two columns that function as the keys of the parent and child in the relationship. The
input can also include columns containing attributes describing the parent and/or child.
The input data set cannot include rows with operations other than NORMAL, but can contain
hierarchical data.
For a listing of the target columns, consult the Data Services Reference Guide.
Options
The Hierarchy Flattening transform offers several options:
Option Description
Parent columnIdentifies the column of the source data that contains
the parent identifier in each parent-child relationship.
Child columnIdentifies the column in the source data that contains
the child identifier in each parent-child relationship.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 66/87
36 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Option Description
Flattening type Indicates how the hierarchical relationships aredescribed in the output.
Use maximum length paths
Maximum depth
Indicates whether longest or shortest paths are used to
describe relationships between descendants and
ancestors when the descendent has more than one
parent.
Indicates the maximum depth of the hierarchy.
Parent attribute listIdentifies a column or columns that are associated withthe parent column.
Child attribute listIdentifies a column or columns that are associated with
the child column.
Run as a separate process
Creates a separate sub-data flow process for the
Hierarchy Flattening transform when Data Services
executes the data flow.
Activi ty: Using the Hierarchy Flattening t ransform
The Employee table in the Alpha datastore contains employee data in a recursive hierarchy.
To determine all reports, direct or indirect, to a given executive or manager would require
complex SQL statements to traverse the hierarchy.
Objective
• Flatten the hierarchy to allow more efficient reporting on data.
Instructions
1. In the Omega project, create a new batch job called Al pha_Empl oyees_Repor t _J ob with a
data flow called Al pha_Empl oyees_Repor t _DF.
2. In the data flow workspace, add the Employee table from the Alpha datastore as the source
object.
3. Create a template table called Manager _Emps in the HR_datamart datastore as the target
object.
4. Add a Hierarchy Flattening transform to the right of the source table and connect the source
table to the transform.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 67/87
Using Data Integrator Transforms—Learner’s Guide 37
5. In the transform editor for the Hierarchy Flattening transform, select the following options:
Option Value
Flattening Type Vertical
Parent Column REPORTSTO
Child Column EMPLOYEEID
Child Attribute List
LASTNAME
FIRSTNAME
BIRTHDATE
HIREDATE
ADDRESS
CITYID
REGIONID
COUNTRYID
PHONE
EMAILDEPARTMENTID
LastUpdate
discharge_date
6. Add a Query transform to the left of the Hierarchy Flattening transform and connect the
transforms.
7. In the transform editor of the Query transform, create the following output columns:
Column Datatype
MANAGERID varchar(10)
MANAGER_NAME varchar(50)
EMPLOYEEID varchar(10)
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 68/87
38 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Column Datatype
EMPLOYEE_NAME varchar(102)
DEPARTMENT varchar(50)
HIREDATE datetime
LASTUPDATE datetime
PHONE varchar(20)
EMAIL varchar(50)
ADDRESS varchar(200)
CITY varchar(50)
REGION varchar(50)
COUNTRY varchar(50)
DISCHARGE_DATE datetime
DEPTH int
ROOT_FLAG int
LEAF_FLAG
8. Map the output columns as follows:
int
Schema In Schema Out
ANCESTOR MANAGERID
DESCENDENT EMPLOYEEID
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 69/87
Using Data Integrator Transforms—Learner’s Guide 39
Schema In Schema Out
DEPTH DEPTH
ROOT_FLAG ROOT_FLAG
LEAF_FLAG LEAF_FLAG
C_ADDRESS ADDRESS
C_discharge_date DISCHARGE_DATE
C_EMAIL EMAIL
C_HIREDATE HIREDATE
C_LastUpdate LASTUPDATE
C_PHONE PHONE
9. Create a mapping expression for the MANAGER_NAME column to look up the manager's
last name from the Employee table in the Alpha datastore based on the employee ID in the
ANCESTOR column of the Hierarchy Flattening transform.
The expression should be:
l ookup_ext( [ Al pha. source. empl oyee, ' PRE_LOAD_CACHE' , ' MAX' ] , [ LASTNAME], [ NULL] ,
[ EMPLOYEEI D, ' =' , Hi erar chy_Fl at t eni ng. ANCESTOR] ) SET
( "r un_as_separ ate_pr ocess"=' no' )
10. Create a mapping expression for the EMPLOYEE_NAME column to concatenate the
employee's last name and first name, separated by a comma.
The expression should be:
Hi erar chy_Fl at t eni ng. C_LASTNAME | | ' , ' | | Hi erar chy_Fl att eni ng. C_FI RSTNAME
11. Create a mapping expression for the DEPARTMENT column to look up the name of the
employee's department from the Department table in the Alpha datastore based on the
C_DEPARTMENTID column of the Hierarchy Flattening transform.
The expression should be:
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 70/87
40 SAP Data Services: Data Integrator Transforms—Learner’s Guide
l ookup_ext ( [ Al pha. source. depart ment , ' PRE_LOAD_CACHE' , ' MAX' ] , [ DEPARTMENTNAME],
[ NULL] , [ DEPARTMENTI D, ' =' , Hi erarchy_Fl at t eni ng. C_DEPARTMENTI D] ) SET
( "r un_as_separ ate_pr ocess"=' no' )
12. Create a mapping expression for the CITY column to look up the name of the employee's
city from the City table in the Alpha datastore based on the C_CITYID column of the
Hierarchy Flattening transform.
The expression should be:
l ookup_ext ( [ Al pha. sour ce. ci t y, ' PRE_LOAD_CACHE' , ' MAX' ] , [ CI TYNAME] , [ NULL] ,
[ CI TYI D, ' =' , Hi er ar chy_Fl at t eni ng. C_CI TYI D] ) SET
( "r un_as_separ ate_pr ocess"=' no' )
13. Create a mapping expression for the REGION column to look up the name of the employee's
region from the Region table in the Alpha datastore based on the C_REGIONID column of
the Hierarchy Flattening transform.
The expression should be:
l ookup_ext( [ Al pha. source. r egi on, ' PRE_LOAD_CACHE' , ' MAX' ] , [ REGI ONNAME], [ NULL] ,
[ REGI ONI D, ' =' , Hi erarchy_Fl att eni ng. C_REGI ONI D] ) SET
( "r un_as_separ ate_pr ocess"=' no' )
14. Create a mapping expression for the COUNTRY column to look up the name of the
employee's country from the Country table in the Alpha datastore based on the
C_COUNTRYID column of the Hierarchy Flattening transform.
The expression should be:
l ookup_ext ( [ Al pha. source. count r y, ' PRE_LOAD_CACHE' , ' MAX' ] , [ COUNTRYNAME] ,
[ NULL] , [ COUNTRYI D, ' =' , Hi erar chy_Fl at t eni ng. C_COUNTRYI D] ) SET
( "r un_as_separ ate_pr ocess"=' no' )
15. Add a WHERE clause to the Query transform to return only rows where the depth is greater
than zero.
The expression should be as follows:
Hi erar chy_Fl at t eni ng. DEPTH > 0
16. Execute Alpha_Employees_Report_Job with the default execution properties and save all
objects you have created.17. Return to the data flow workspace and view data for the target table.
Note that 179 rows were written to the target table.
A solution file called SOLUTI ON_Hi erarchyFl att eni ng. at l is included in your Course Resources.
To check the solution, import the file and open it to view the data flow design and mapping
logic. Do not execute the solution job, as this may override the results in your target table.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 71/87
Using Data Integrator Transforms—Learner’s Guide 41
Describing performance optimization
Introduction
You can improve the performance of your jobs by pushing down operations to the source or
target database to reduce the number of rows and operations that the engine must retrieve and
process.
After completing this unit, you will be able to:
• List operations that Data Services pushes down to the database
• View SQL generated by a data flow
• Explore data caching options
• Explain process slicing
Describing push-down operations
Data Services examines the database and its environment when determining which operations
to push down to the database:
• Full push-down operations
The Data Services optimizer always tries to do a full push-down operation. Full push-down
operation s are operations that can be pushed down to the databases and the data streams
directly from the source database to the target database. For example, Data Services sends
SQL INSERT INTO... SELECT statements to the target database and it sends SELECT toretrieve data from the source.
Data Services can only do full push-down operation s to the source and target databases
when the following conditions are met:
○ All of the operations between the source table and target table can be pushed down
○ The source and target tables are from the same datastore or they are in datastores that
have a database link defined between them.
• Partial push-down operations
When a full push-down operation is not possible , Data Services tries to push down the
SELECT statement to the source database. Operations within the SELECT statement that
can be pushed to the database include:
Operation Description
Aggregations
Aggregate functions, typically used with a
Group by statement, always produce a data
set smaller than or the same size as the
original data set.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 72/87
42 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Operation Description
Distinct rows Data Services will only output unique rowswhen you use distinct rows.
FilteringFiltering can produce a data set smaller than
or equal to the original data set.
Joins Joins typically produce a data set smaller
than or similar in size to the original tables.
Ordering
Projections
Ordering does not affect data set size. Data
Services can efficiently sort data sets thatfit in memory. Since Data Services does not
perform paging (writing out intermediate
results to disk), it is recommended that you
use a dedicated disk-sorting program such
as SyncSort or the DBMS itself to order very
large data sets.
A projection normally produces a smaller
data set because it only returns columns
referenced by a data flow.
Functions
Most Data Services functions that have
equivalents in the underlaying database are
appropriately translated.
Operations that cannot be pushed down
Data Services cannot push some transform operations to the database. For example:
• Expressions that include Data Services functions that do not have database correspondents.
• Load operations that contain triggers.
• Transforms other than Query.
• Joins between sources that are on different database servers that do not have database links
defined between them.
Similarly, not all operations can be combined into single requests. For example, when a stored
procedure contains a COMMIT statement or does not return a value, you cannot combine the
stored procedure SQL with the SQL for other operations in a query. You can only push
operations supported by the RDBMS down to that RDBMS.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 73/87
Using Data Integrator Transforms—Learner’s Guide 43
Note: You cannot push built-in functions or transforms to the source database. For best
performance, do not intersperse built-in transforms among operations that can be pushed down
to the database. Database-specific functions can only be used in situations where they will be
pushed down to the database for execution.
Viewing SQL generated by a data flow
Before running a job, you can view the SQL generated by the data flow and adjust your design
to maximize the SQL that is pushed down to improve performance. Alter your design to
improve the data flow when necessary.
Keep in mind that Data Services only shows the SQL generated for table sources. Data Services
does not show the SQL generated for SQL sources that are not table sources, such as the lookup
function, the Key Generation transform, the key_generation function, the Table Comparison
transform, and target tables.
To v iew SQL
1. In the Data Flows tab of the Local Object Library, right-click the data flow and select Display
Optimized SQL from the menu.
The Optimized SQL dialog box displays.
2. In the left pane, select the datastore for the data flow.
The optimized SQL for the datastore displays in the right pane.
Caching data
You can improve the performance of data transformations that occur in memory by caching
as much data as possible. By caching data, you limit the number of times the system must
access the database. Cached data must fit into available memory.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 74/87
44 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Pageable caching
Data Services allows administrators to select a pageable cache location to save content over the
2 GB RAM limit. The pageable cache location is set up in Server Manager and the option to usepageable cache is selected on the Dataflow Properties dialog box.
Persistent caching
Persistent cache datastores can be created through the Create New Datastore dialog box by
selecting Persistent Cache as the database type. The newly-created persistent cache datastore
will appear in the list of datastores, and can be used as a source in jobs.
For more information about advanced caching features, see the Data Services Performance
Optimization Guide.
Slicing processes
You can also optimize your jobs through process slicing, which involves splitting data flows
into sub-data flows.
Sub-data flows work on smaller data sets and/or fewer transforms so there is less virtual
memory to consume per process. This way, you can leverage more physical memory per data
flow as each sub-data flow can access 2 GB of memory.
This functionality is available through the Advanced tab for the Query transform. You can run
each memory-intensive operation as a separate process.
For more information on process slicing, see the Data Services Performance Optimization Guide.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 75/87
Using Data Integrator Transforms—Learner’s Guide 45
Using the Data Transfer transform
Introduction
The Data Transfer transform allows a data flow to split its processing into two sub-data flows
and push down resource-consuming operations to the database server.
After completing this unit, you will be able to:
• Use the Data Transfer transform
Explaining the Data Transfer transform
The Data Transfer transform moves data from a source or the output from another transforminto a transfer object and subsequently reads data from the transfer object. You can use the
Data Transfer transform to push down resource-intensive database operations that occur
anywhere within the data flow. The transfer type can be a relational database table, persistent
cache table, file, or pipeline.
Use the Data Transfer transform to:
• Push down operations to the database server when the transfer type is a database table. You
can push down resource-consuming operations such as joins, GROUP BY, and sorts.
• Define points in your data flow where you want to split processing into multiple sub-data
flows that each process part of the data. Data Services does not need to process the entire
input data in memory at one time. Instead, the Data Transfer transform splits the processingamong multiple sub-data flows that each use a portion of memory.
The next section gives a brief description of the function, data input requirements, options, and
data output results for the Data Transfer transform. For more information on the Data Transfer
transform see “Transforms” Chapter 5 in the Data Services Reference Guide.
Inputs/Outputs
When the input data set for the Data Transfer transform is a table or file transfer type, the rows
must be flagged with the NORMAL operation code. When input data set is a pipeline transfer
type, the rows can be flagged as any operation code.
The input data set must not contain hierarchical (nested) data.Output data sets have the same schema and same operation code as the input data sets. In the
push down scenario, the output rows are in the sort or GROUP BY order.
The sub-data flow names use the following format, where n is the number of the data flow:
dataf l owname_n
The execution of the output depends on the temporary transfer type:
For Table or File temporary transfer types, Data Services automatically splits the data flow into
sub-data flows and executes them serially.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 76/87
46 SAP Data Services: Data Integrator Transforms—Learner’s Guide
For Pipeline transfer types, Data Services splits the data flow into sub-data flows if you specify
the Run as a separate process option in another operation in the data flow. Data Services
executes these sub-data flows that use pipeline in parallel.
Activi ty: Using the Data Transfer transform
The Data Transfer transform can be used to push data down to a database table so that it can
be processed by the database server rather than the Data Services Job Server. In this activity,
you will join data from two database schemas. When the Data Transfer transform is not used,
the join will occur on the Data Services Job Server. When the Data Transfer transform is added
to the data flow the join can be seen in the SQL Query by displaying the optimized SQL for the
data flow.
Objective
• Use the Data Transfer transform to optimize performance.
Instructions
1. In the Omega project, create a new batch job called No_Data_Transf er_ J obwith a data flow
called No_Data_Transf er_DF.
2. In the Delta datastore, import the Employee_Comp table and add it to the
No_Data_Transfer_DF workspace as a source table.
3. Add the Employee table from the Alpha datastore as a source table.
4. Add a Query transform to the data flow workspace and attach both source tables to the
transform.
5. In the transform editor for the Query transform, add the LastName and BirthDate columns
from the Employee table and the Comp_Type and Comp columns from the Employee_Comp
table to the output schema.
6. Add a WHERE clause to join the tables on the EmployeeID columns.
7. Create a template table called Employee_Temp in the Delta datastore as the target object
and connect it to the Query transform.
8. Save the job.
9. In the Local Object Library, use the right-click menu for the No_Data_Transfer_DF data
flow to display the optimized SQL.Note that the WHERE clause does not appear in either SQL statement.
10. In the Local Object Library, replicate the No_Data_Transfer_DF data flow and rename the
copy Dat a_Transf er_DF.
11. In the Local Object Library, replicate the No_Data_Transfer_Job job and rename the copy
Dat a_Tr ansf er_J ob.
12. Add the Data_Transfer_Job job to the Omega project.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 77/87
Using Data Integrator Transforms—Learner’s Guide 47
13. Delete the No_Data_Transfer_DF data flow from the Data_Transfer_Job and add the
Data_Transfer_DF data flow to the job by dragging it from the Local Object Library to the
job's workspace.
14. Delete the connection between the Employee_Comp table and the Query transform.
15. Add a Data Transfer transform between the Employee_Comp table and the Query transform
and connect the three objects.
16. In the transform editor for the Data Transfer transform, select the Tabl e option for Transfer
Type field.
17. In the Table Options section, click the ellipses (...) button and select Tabl e Name. Select the
Alpha datastore. In the Table Name field enter PUSHDOWN_DATA. In the Owner field, enter
SOURCE.
18. In the transform editor for the Query transform, update the WHERE clause to join the
Data_Transfer.employeeid and employee.employeeid fields. Verify the Comp_Type andComp columns are mapped to the Data Transfer transform.
19. Save the job.
20. In the Local Object Library, use the right-click menu for the Data_Transfer_DF data flow to
display the optimized SQL.
Note that the WHERE clause appears in the SQL statements.
A solution file called SOLUTI ON_DataTr ansf er. at l is included in your Course Resources. To
check the solution, import the file and open it to view the data flow design and mapping logic.
Do not execute the solution job, as this may override the results in your target table.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 78/87
48 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Using the XML Pipeline transform
Introduction
The XML Pipeline transform is used to process large XML files more efficiently by separating
them into small batches.
After completing this unit, you will be able to:
• Use the XML Pipeline transform
Explaining the XML Pipeline transform
The XML Pipeline transform is used to process large XML files, one instance of a specifiedrepeatable structure at a time.
With this transform, Data Services does not need to read the entire XML input into memory
and build an internal data structure before performing the transformation.
This means that an NRDM structure is not required to represent the entire XML data input.
Instead, this transform uses a portion of memory to process each instance of a repeatable
structure, then continually releases and re-uses the memory to continuously flow XML data
through the transform.
During execution, Data Services pushes operations of the streaming transform to the XML
source. Therefore, you cannot use a breakpoint between your XML source and an XML Pipeline
transform.
Note:
You can use the XML Pipeline transform to load into a relational or nested schema target. This
course focuses on loading XML data into a relational target.
For more information on constructing nested schemas for your target, refer to the Data Services
Designer Guide.
Inputs/Outputs
You can use an XML file or XML message. You can also connect more than one XML Pipeline
transform to an XML source.When connected to an XML source, the transform editor shows the input and output schema
structures as a root schema containing repeating and non-repeating sub-schemas represented
by these icons:
Icon Schema structure
Root schema and repeating sub-schema
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 79/87
Using Data Integrator Transforms—Learner’s Guide 49
Icon Schema structure
Non-repeating sub-schema
Keep in mind these rules when using the XML Pipeline transform:
• You cannot drag and drop the root level schema.
• You can drag and drop the same child object repeated times to the output schema, but only
if you give each instance of that object a unique name. Rename the mapped instance before
attempting to drag and drop the same object to the output again.
• When you drag and drop a column or sub-schema to the output schema, you cannot then
map the parent schema for that column or sub-schema. Similarly, when you drag and drop
a parent schema, you cannot then map an individual column or sub-schema from under
that parent.• You cannot map items from two sibling repeating sub-schemas because the XML Pipeline
transform does not support Cartesian products (combining every row from one table with
every row in another table) of two repeatable schemas.
To take advantage of the XML Pipeline transform’s performance, always select a repeatable
column to be mapped. For example, if you map a repeatable schema column, the XML source
produces one row after parsing one item.
Avoid selecting non-repeatable columns that occur structurally after the repeatable schema
because the XML source must then assemble the entire structure of items in memory before
processing. Selecting non-repeatable columns that occur structurally after the repeatable schema
increases memory consumption to process the output into your target.
To map both the repeatable schema and a non-repeatable column that occurs after the repeatable
one, use two XML Pipeline transforms, and use the Query transform to combine the outputs
of the two XML Pipeline transforms and map the columns into one single target.
Options
The XML Pipeline is streamlined to support massive throughput of XML data; therefore, it
does not contain additional options other than input and output schemas, and the Mapping
tab.
Activi ty: Using the XML Pipeline transform Purchase order information is stored in XML files that have repeatable purchase orders and
items, and a non-repeated Total Purchase Orders column. You must combine the customer
name, order date, order items, and the totals into a single relational target table, with one row
per customer per item.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 80/87
50 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Objectives
• Use the XML Pipeline transform to extract XML data.
• Combine the rows required from both XML sources into a single target table joined usinga Query transform
Instructions
1. On the Formats tab of the Local Object Library, create a new file format for an XML schema
called purchaseOrders_Format, based on the purchaseOrders.xsd file in the Activity_Source
folder. Use a root element of PurchaseOrders.
2. In the Omega project, create a new job called Al pha_Pur chase_Or ders_J ob, with a data flow
called Al pha_Pur chase_Or der s_DF.
3. In the data flow workspace, add the PurchaseOrders_Format file format as the XML file
source object.
4. In the format editor for the file format, point the file format to the pos.xml file in the
Activity_Source folder.
Note that when working in a distributed environment, where Designer and the Job Server
are on different machines, it may be necessary to edit the path to the XML file if it is different
on the Job Server than the Designer client. Your instructor will tell you if you need to edit
the path to the file for this activity.
5. Add two instances of the XML Pipeline transform to the data flow workspace and connect
the source object to each.
6. In the transform editor for the first XML Pipeline transform, map the following columns:
Schema In Schema Out
customerName customerName
orderDate orderDate
7. Map the entire item repeatable schema from the input schema to the output schema.
8. In the transform editor for the second XML Pipeline transform, map the following columns:
Schema In Schema Out
customerName customerName
orderDate orderDate
totalPOs totalPOs
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 81/87
Using Data Integrator Transforms—Learner’s Guide 51
9. Add a Query transform to the data flow workspace and connect both XML Pipeline transform
to it.
10. In the transform editor for the Query transform, map both columns and the repeatableschema from the first XML Pipeline transform from the input schema to the output schema.
Also map the totalPOs columns from the second XML Pipeline transform.
11. Unnest the item repeatable schema.
12. Create a WHERE clause to join the inputs from the two XML Pipeline transforms on the
customerName column.
The expression should be as follows:
XML_Pi pel i ne. cust omerName = XML_Pi pel i ne_1. cust omerName
13. Add a new template table called Item_POs to the Delta datastore and connect the Query
transform to it.
14. Execute Alpha_Purchase_Orders_Job with the default execution properties and save all
objects you have created.
15. Return to the data flow workspace and view data for the target table.
A solution file called SOLUTI ON_XMLPi pel i ne. at l is included in your Course Resources. To
check the solution, import the file and open it to view the data flow design and mapping logic.
Do not execute the solution job, as this may override the results in your target table.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 82/87
52 SAP Data Services: Data Integrator Transforms—Learner’s Guide
Quiz: Using Data Integrator transforms
1. What is the Pivot transform used for?
2. What is the purpose of the Hierarchy Flattening transform?
3. What is the difference between the horizontal and vertical flattening hierarchies?
4. List three things you can do to improve job performance.
5. Name three options that can be pushed down to the database.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 83/87
Using Data Integrator Transforms—Learner’s Guide 53
Lesson summary
After completing this lesson, you are now able to:
• Describe the Data Integrator transforms
• Use the Pivot transform
• Use the Hierarchy Flattening transform
• Describe performance optimization
• Use the Data Transfer transform
• Use the XML Pipeline transform
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 84/87
54 SAP Data Services: Data Integrator Transforms—Learner’s Guide
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 85/87
Answer Key—Learner’s Guide
Answer Key
This section contains the answers to the reviews and/or activities for the applicable lessons.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 86/87
SAP Data Services: Data Integrator Transforms – Learners Guide
Quiz: Capturing changes in data
1. What are the two most important reasons for using CDC?
Answer: Improving performance and preserving history.
2. Which method of CDC is preferred for the performance gain of extracting the fewest rows?
Answer: Source-based CDC.
3. What is the difference between an initial load and a delta load?
Answer:
An initial load is the first population of a database using data acquisition modules forextraction, transformation, and load. The first time you execute a batch job, Designer performs
an initial load to create the data tables and populate them.
A delta load incrementally loads data that has been changed or added since the last load
iteration. When you execute your job, the delta load may run several times, loading data
from the specified number of rows each time until all new data has been written to the target
database.
4. What transforms do you typically use for target-based CDC?
Answer: Table Comparison, History Preserving, and Key Generation.
8/10/2019 BODS20_EN_COL91_A4
http://slidepdf.com/reader/full/bods20encol91a4 87/87
Quiz: Using Data Integrator transforms
1. What is the Pivot transform used for?
Answer: Use the Pivot transform when you want to group data from multiple columns into
one column while at the same time maintaining information linked to the columns.
2. What is the purpose of the Hierarchy Flattening transform?
Answer: The Hierarchy Flattening transform enables you to break down hierarchical table
structures into a single table to speed data access.
3. What is the difference between the horizontal and vertical flattening hierarchies?
Answer:
With horizontally-flattened hierarchies, each row of the output describes a single node in
the hierarchy and the path to that node from the root.
With vertical-flattened hierarchies, each row of the output describes a single relationship
between ancestor and descendent and the number of nodes the relationship includes. There
is a row in the output for each node and all of the descendants of that node. Each node is
considered its own descendent and, therefore, is listed one time as both ancestor and
descendent.
4. List three things you can do to improve job performance.
Answer:
Choose from the following:
○ Utilize the push-down operations.
○ View SQL generated by a data flow and adjust your design to maximize the SQL that is
pushed down to improve performance.
○ Use data caching.
○ Use process slicing.
5. Name three options that can be pushed down to the database.
Answer: Choose from the following: