faq1

http://www.dwbiconcepts.com/tutorial/10-interview/44-important-practical-interview-questions.html

Best Informatica Interview Questions & Answers

Learn the answers of some critical questions commonly asked during Informatica interview.

Deleting duplicate row using Informatica

Q1. Suppose we have Duplicate records in Source System and we want to load only the unique records in the Target System eliminating the duplicate rows. What will be the approach?Ans.Let us assume that the source system is a Relational Database . The source table is having duplicate rows. Now to eliminate duplicate records, we can check the Distinct option of the Source Qualifier of the source table and load the target accordingly.



Source Qualifier Transformation DISTINCT clause

But what if the source is a flat file? How can we remove the duplicates from flat file source? Read On...

Deleting duplicate row for FLAT FILE sources

Now suppose the source system is a Flat File. Here in the Source Qualifier you will not be able to select the distinct clause as it is disabled due to flat file source table. Hence the next approach may be we use a Sorter Transformation and check the Distinct option. When we select the distinct option all the columns will the selected as keys, in ascending order by default.

http://www.dwbiconcepts.com/index.php?option=com_content&view=article&id=45

Sorter Transformation DISTINCT clause

Deleting Duplicate Record Using Informatica Aggregator

Other ways to handle duplicate records in source batch run is to use an Aggregator Transformation and using the Group By checkbox on the ports having duplicate occurring data. Here you can have the flexibility to select the last or the first of the duplicate column value records. Apart from that using Dynamic Lookup Cache of the target table and associating the input ports with the lookup port and checking the Insert Else Update option will help to eliminate the duplicate records in source and hence loading unique records in the target.

For more details on Dynamic Lookup Cache

Loading Multiple Target Tables Based on Conditions

Q2. Suppose we have some serial numbers in a flat file source. We want to load the serial numbers in two target files one containing the EVEN serial numbers and the other file having the ODD ones.

Ans. After the Source Qualifier place a Router Transformation . Create two Groups namely EVEN and ODD, with filter conditions as

http://www.dwbiconcepts.com/basic-concept/3-etl/22-dynamic-lookup-cache.html

MOD(SERIAL_NO,2)=0 and MOD(SERIAL_NO,2)=1 respectively. Then output the two groups into two flat file targets.

Router Transformation Groups Tab

Normalizer Related Questions

Q3. Suppose in our Source Table we have data as given below:

Student Name Maths Life Science Physical Science

Sam 100 70 80

John 75 100 85

Tom 80 100 85

We want to load our Target Table as:

Student Name Subject Name Marks

Sam Maths 100

Sam Life Science 70

Sam Physical Science 80

John Maths 75

John Life Science 100

John Physical Science 85

Tom Maths 80

Tom Life Science 100

Tom Physical Science 85

Describe your approach.

Ans.Here to convert the Rows to Columns we have to use the Normalizer Transformation followed by an Expression Transformation to Decode the column taken into consideration. For more details on how the mapping is performed please visit Working with Normalizer

Q4. Name the transformations which converts one to many rows i.e increases the i/p:o/p row count. Also what is the name of its reverse transformation.

Ans.Normalizer as well as Router Transformations are the Active transformation which can increase the number of input rows to output rows.

Aggregator Transformation is the active transformation that performs the reverse action.

Q5. Suppose we have a source table and we want to load three target tables based on source rows such that first row moves to first target table, secord row in second target table, third row in third target table, fourth row again in first target table so on and so forth. Describe your approach.

Ans.



We can clearly understand that we need a Router transformation to route or filter source data to the three target tables. Now the question is what will be the filter conditions. First of all we need an Expression Transformation where we have all the source table columns and along with that we have another i/o port say seq_num, which is gets sequence numbers for each source row from the port NextVal of a Sequence Generator start value 0 and increment by 1. Now the filter condition for the three router groups will be:

MOD(SEQ_NUM,3)=1 connected to 1st target table, MOD(SEQ_NUM,3)=2 connected to 2nd target table, MOD(SEQ_NUM,3)=0 connected to 3rd target table.


Loading Multiple Flat Files using one mapping

Q6. Suppose we have ten source flat files of same structure. How can we load all the files in target database in a single batch run using a single mapping.

Ans.After we create a mapping to load data in target database from flat files, next we move on to the session property of the Source Qualifier.

To load a set of source files we need to create a file say final.txt containing the source falt file names, ten files in our case and set the Source filetype option as Indirect. Next point this flat file final.txt fully qualified through Source file directory and Source filename . Image: Session Property Flat File

Q7. How can we implement Aggregation operation without using an Aggregator Transformation in Informatica.

Ans.We will use the very basic concept of the Expression Transformation that at a time we can access the previous row data as well as the currently processed data in an expression transformation. What we need is simple Sorter, Expression and Filter transformation to achieve aggregation at Informatica level. For detailed understanding visit Aggregation without Aggregator

Q8. Suppose in our Source Table we have data as given below:

Student Name Subject Name Marks

Sam Maths 100

Tom Maths 80

Sam Physical Science 80

John Maths 75

Sam Life Science 70

John Life Science 100

John Physical Science 85

Tom Life Science 100

Tom Physical Science 85

We want to load our Target Table as:

Student Name Maths Life Science Physical Science

http://www.dwbiconcepts.com/basic-concept/3-etl/10-aggregation-with-out-informatica-aggregator-.html

Sam 100 70 80

John 75 100 85

Tom 80 100 85

Describe your approach.

Ans.Here our scenario is to convert many rows to one rows, and the transformation which will help us to achieve this is Aggregator . Our Mapping will look like this:

Mapping using sorter and Aggregator

We will sort the source data based on STUDENT_NAME ascending followed by SUBJECT ascending.

Sorter Transformation

Now based on STUDENT_NAME in GROUP BY clause the following output subject columns are populated asMATHS: MAX(MARKS, SUBJECT='Maths') LIFE_SC: MAX(MARKS, SUBJECT='Life Science') PHY_SC: MAX(MARKS, SUBJECT='Physical Science')

Aggregator Transformation

Revisiting Source Qualifier Transformation

Q9. What is a Source Qualifier? What are the tasks we can perform using a SQ and why it is an ACTIVE transformation?

Ans.A Source Qualifier is an Active and Connected Informatica transformation that reads the rows from a relational database or flat file source.

We can configure the SQ to join [Both INNER as well as OUTER JOIN] data originating from the same source database. We can use a source filter to reduce the number of rows the Integration Service queries. We can specify a number for sorted ports and the Integration Service adds an ORDER BY clause to the default SQL query. We can choose Select Distinct option for relational databases and the

Integration Service adds a SELECT DISTINCT clause to the default SQL query.Also we can write Custom/Used Defined SQL query which will override the default query in the SQ by changing the default settings of the transformation properties.Aslo we have the option to write Pre as well as Post SQL statements to be executed before and after the SQ query in the source database.

Since the transformation provides us with the property Select Distinct , when the Integration Service adds a SELECT DISTINCT clause to the default SQL query, which in turn affects the number of rows returned by the Database to the Integration Service and hence it is an Active transformation.

Q10. What happens to a mapping if we alter the datatypes between Source and its corresponding Source Qualifier?

Ans.The Source Qualifier transformation displays the transformation datatypes. The transformation datatypes determine how the source database binds data when the Integration Service reads it. Now if we alter the datatypes in the Source Qualifier transformation or the datatypes in the source definition and Source Qualifier transformation do not match, the Designer marks the mapping as invalid when we save it.

Q11. Suppose we have used the Select Distinct and the Number Of Sorted Ports property in the SQ and then we add Custom SQL Query. Explain what will happen.

Ans.Whenever we add Custom SQL or SQL override query it overrides the User-Defined Join, Source Filter, Number of Sorted Ports, and Select Distinct settings in the Source Qualifier transformation. Hence only the user defined SQL Query will be fired in the database and all the other options will be ignored .

Q12. Describe the situations where we will use the Source Filter, Select Distinct and Number Of Sorted Ports properties of Source Qualifier transformation.

Ans.Source Filter option is used basically to reduce the number of rows the Integration Service queries so as to improve performance.Select Distinct option is used when we want the Integration Service to select unique values from a source, filtering out unnecessary data

earlier in the data flow, which might improve performance.Number Of Sorted Ports option is used when we want the source data to be in a sorted fashion so as to use the same in some following transformations like Aggregator or Joiner, those when configured for sorted input will improve the performance.

Q13. What will happen if the SELECT list COLUMNS in the Custom override SQL Query and the OUTPUT PORTS order in SQ transformation do not match?

Ans.Mismatch or Changing the order of the list of selected columns to that of the connected transformation output ports may result is session failure.

Q14. What happens if in the Source Filter property of SQ transformation we include keyword WHERE say, WHERE CUSTOMERS.CUSTOMER_ID > 1000.

Ans.We use source filter to reduce the number of source records. If we include the string WHERE in the source filter, the Integration Service fails the session .

Q15. Describe the scenarios where we go for Joiner transformation instead of Source Qualifier transformation.

Ans.While joining Source Data of heterogeneous sources as well as to join flat files we will use the Joiner transformation. Use the Joiner transformation when we need to join the following types of sources: Join data from different Relational Databases. Join data from different Flat Files. Join relational sources and flat files.

Q16. What is the maximum number we can use in Number Of Sorted Ports for Sybase source system.

Ans.Sybase supports a maximum of 16 columns in an ORDER BY clause. So if the source is Sybase, do not sort more than 16 columns.

Q17. Suppose we have two Source Qualifier transformations SQ1 and SQ2 connected to Target tables TGT1 and TGT2 respectively. How do you ensure TGT2 is loaded after TGT1?

Ans.If we have multiple Source Qualifier transformations connected to multiple targets, we can designate the order in which the Integration Service loads data into the targets.In the Mapping Designer, We need to configure the Target Load Plan based on the Source Qualifier transformations in a mapping to specify the required loading order.

Image: Target Load Plan

Target Load Plan Ordering

Q18. Suppose we have a Source Qualifier transformation that populates two target tables. How do you ensure TGT2 is loaded after TGT1?

Ans.In the Workflow Manager, we can Configure Constraint based load ordering for a session. The Integration Service orders the target load on a row-by-row basis. For every row generated by an active source, the Integration Service loads the corresponding transformed row first to the primary key table, then to the foreign key table.Hence if we have one Source Qualifier transformation that provides data for multiple target tables having primary and foreign key relationships, we will go for Constraint based load ordering.

Image: Constraint based loading

Revisiting Filter Transformation

Q19. What is a Filter Transformation and why it is an Active one?

Ans.A Filter transformation is an Active and Connected transformation that can filter rows in a mapping.Only the rows that meet the Filter Condition pass through the Filter transformation to the next transformation in the pipeline. TRUE and FALSE are the implicit return values from any filter condition we set. If the filter condition evaluates to NULL, the row is assumed to be FALSE.The numeric equivalent of FALSE is zero (0) and any non-zero value is the equivalent of TRUE.

As an ACTIVE transformation, the Filter transformation may change the number of rows passed through it. A filter condition returns TRUE or FALSE for each row that passes through the transformation, depending on whether a row meets the specified condition. Only rows that return TRUE pass through this transformation. Discarded rows do not appear in the session log or reject files.

Q20. What is the difference between Source Qualifier transformations Source Filter to Filter transformation?

Ans.

SQ Source Filter Filter Transformation

Source Qualifier transformation filters rows when read from a source.

Filter transformation filters rows from within a mapping

Source Qualifier transformation can only filter rows from Relational Sources.

Filter transformation filters rows coming from any type of source system in the mapping level.

Source Qualifier limits the row set extracted from a source.

Filter transformation limits the row set sent to a target.

Source Qualifier reduces the number of

To maximize session performance, include the Filter

rows used throughout the mapping and hence it provides better performance.

transformation as close to the sources in the mapping as possible to filter out unwanted data early in the flow of data from sources to targets.

The filter condition in the Source Qualifier transformation only uses standard SQL as it runs in the database.

Filter Transformation can define a condition using any statement or transformation function that returns either a TRUE or FALSE value.

Revisiting Joiner Transformation

Q21. What is a Joiner Transformation and why it is an Active one?

Ans.A Joiner is an Active and Connected transformation used to join source data from the same source system or from two related heterogeneous sources residing in different locations or file systems. The Joiner transformation joins sources with at least one matching column. The Joiner transformation uses a condition that matches one or more pairs of columns between the two sources. The two input pipelines include a master pipeline and a detail pipeline or a master and a detail branch. The master pipeline ends at the Joiner transformation, while the detail pipeline continues to the target.

In the Joiner transformation, we must configure the transformation properties namely Join Condition, Join Type and Sorted Input option to improve Integration Service performance.The join condition contains ports from both input sources that must match for the Integration Service to join two rows. Depending on the type of join selected, the Integration Service either adds the row to the result set or discards the row . The Joiner transformation produces result sets based on the join type, condition, and input data sources. Hence it is an Active transformation.

Q22. State the limitations where we cannot use Joiner in the mapping pipeline.

Ans.The Joiner transformation accepts input from most transformations. However, following are the limitations:

Joiner transformation cannot be used when either of the input pipeline contains an Update Strategy transformation. Joiner transformation cannot be used if we connect a Sequence Generator transformation directly before the Joiner transformation.

Q23. Out of the two input pipelines of a joiner, which one will you set as the master pipeline?

Ans.During a session run, the Integration Service compares each row of the master source against the detail source. The master and detail sources need to be configured for optimal performance .

To improve performance for an Unsorted Joiner transformation, use the source with fewer rows as the master source. The fewer unique rows in the master, the fewer iterations of the join comparison occur, which speeds the join process. When the Integration Service processes an unsorted Joiner transformation, it reads all master rows before it reads the detail rows. The Integration Service blocks the detail source while it caches rows from the master source . Once the Integration Service reads and caches all master rows, it unblocks the detail source and reads the detail rows.

To improve performance for a Sorted Joiner transformation, use the source with fewer duplicate key values as the master source.When the Integration Service processes a sorted Joiner transformation, it blocks data based on the mapping configuration and it stores fewer rows in the cache, increasing performance. Blocking logic is possible if master and detail input to the Joiner transformation originate from different sources . Otherwise, it does not use blocking logic. Instead, it stores more rows in the cache.

Q24. What are the different types of Joins available in Joiner Transformation?

Ans.In SQL, a join is a relational operator that combines data from multiple tables into a single result set. The Joiner transformation is similar to an SQL join except that data can originate from different types of sources.

The Joiner transformation supports the following types of joins : Normal Master Outer

Detail Outer Full Outer

Join Type property of Joiner Transformation

Note: A normal or master outer join performs faster than a full outer or detail outer join.

Q25. Define the various Join Types of Joiner Transformation.

Ans.In a normal join , the Integration Service discards all rows of data from the master and detail source that do not match, based on the join condition. A master outer join keeps all rows of data from the detail source and the matching rows from the master source. It discards the unmatched rows from the master source. A detail outer join keeps all rows of data from the master source and the matching rows from the detail source. It discards the unmatched rows from the detail source. A full outer join keeps all rows of data from both the master and detail sources.

Q26. Describe the impact of number of join conditions and join order in a Joiner Transformation.

Ans.We can define one or more conditions based on equality between the specified master and detail sources. Both ports in a condition must have the same datatype . If we need to use two ports in the join condition with non-matching datatypes we must convert the datatypes so that they match. The Designer validates datatypes in a join condition. Additional ports in the join condition increases the time necessary to join two sources.

The order of the ports in the join condition can impact the performance of the Joiner transformation. If we use multiple ports in the join condition, the Integration Service compares the ports in the order we specified.

NOTE: Only equality operator is available in joiner join condition.

Q27. How does Joiner transformation treat NULL value matching.

Ans.The Joiner transformation does not match null values . For example, if both EMP_ID1 and EMP_ID2 contain a row with a null value, the Integration Service does not consider them a match and does not join the two rows.To join rows with null values, replace null input with default values in the Ports tab of the joiner, and then join on the default values.

Note: If a result set includes fields that do not contain data in either of the sources, the Joiner transformation populates the empty fields with null values. If we know that a field will return a NULL and we do not want to insert NULLs in the target, set a default value on the Ports tab for the corresponding port.

Q28. Suppose we configure Sorter transformations in the master and detail pipelines with the following sorted ports in order: ITEM_NO, ITEM_NAME, PRICE.When we configure the join condition, what are the guidelines we need to follow to maintain the sort order?

Ans.If we have sorted both the master and detail pipelines in order of the ports say ITEM_NO, ITEM_NAME and PRICE we must ensure that: Use ITEM_NO in the First Join Condition.

If we add a Second Join Condition, we must use ITEM_NAME. If we want to use PRICE as a Join Condition apart from ITEM_NO, we must also use ITEM_NAME in the Second Join Condition. If we skip ITEM_NAME and join on ITEM_NO and PRICE, we will lose the input sort order and the Integration Service fails the session .

Q29. What are the transformations that cannot be placed between the sort origin and the Joiner transformation so that we do not lose the input sort order.

Ans.The best option is to place the Joiner transformation directly after the sort origin to maintain sorted data. However do not place any of the following transformations between the sort origin and the Joiner transformation:

Custom Unsorted Aggregator Normalizer Rank Union transformation XML Parser transformation XML Generator transformation Mapplet [if it contains any one of the above mentioned transformations]

Q30. Suppose we have the EMP table as our source. In the target we want to view those employees whose salary is greater than or equal to the average salary for their departments.

Describe your mapping approach. Ans.Our Mapping will look like this: Image: Mapping using Joiner

To start with the mapping we need the following transformations: After the Source qualifier of the EMP table place a Sorter Transformation . Sort based on DEPTNO port.

Sorter Ports Tab

Next we place a Sorted Aggregator Transformation . Here we will find out the AVERAGE SALARY for each (GROUP BY) DEPTNO .When we perform this aggregation, we lose the data for individual employees. To maintain employee data, we must pass a branch of the pipeline to the Aggregator Transformation and pass a branch with the same sorted source data to the Joiner transformation to maintain the original data. When we join both branches of the pipeline, we join the aggregated data with the original data.

Aggregator Ports Tab

Aggregator Properties Tab

So next we need Sorted Joiner Transformation to join the sorted aggregated data with the original data, based on DEPTNO . Here we will be taking the aggregated pipeline as the Master and original dataflow as Detail Pipeline.

Joiner Condition Tab

Joiner Properties Tab

After that we need a Filter Transformation to filter out the employees having salary less than average salary for their department.Filter Condition: SAL>=AVG_SAL

Filter Properties Tab

Lastly we have the Target table instance.

Revisiting Sequence Generator Transformation

Q31. What is a Sequence Generator Transformation?

Ans.A Sequence Generator transformation is a Passive and Connected transformation that generates numeric values.It is used to create unique primary key values, replace missing primary keys, or cycle through a sequential range of numbers. This transformation by default contains ONLY Two OUTPUT ports namely CURRVAL and NEXTVAL . We cannot edit or delete these ports neither we cannot add ports to this unique transformation. We can create approximately two billion unique numeric values with the widest range from 1 to 2147483647.

Q32. Define the Properties available in Sequence Generator transformation in brief. Ans.

Sequence Generator Properties

Description

Start Value

Start value of the generated sequence that we want the Integration Service to use if we use the Cycle option. If we select Cycle, the Integration Service cycles back to this value when it reaches the end value. Default is 0.

Increment By

Difference between two consecutive values from the NEXTVAL port. Default is 1.

End Value

Maximum value generated by SeqGen. After reaching this value the session will fail if the sequence generator is not configured to cycle. Default is 2147483647.

Current Value

Current value of the sequence. Enter the value we want the Integration Service to use as the first value in the sequence. Default is 1.

Cycle

If selected, when the Integration Service reaches the configured end value for the sequence, it wraps around and starts the cycle again, beginning with the configured Start Value.

Number of Cached Values

Number of sequential values the Integration Service caches at a time. Default value for a standard Sequence Generator is 0. Default value for a reusable Sequence Generator is 1,000.

Reset Restarts the sequence at the current value

each time a session runs. This option is disabled for reusable Sequence Generator transformations.

Q33. Suppose we have a source table populating two target tables. We connect the NEXTVAL port of the Sequence Generator to the surrogate keys of both the target tables.Will the Surrogate keys in both the target tables be same? If not how can we flow the same sequence values in both of them. Ans.When we connect the NEXTVAL output port of the Sequence Generator directly to the surrogate key columns of the target tables, the Sequence number will not be the same . A block of sequence numbers is sent to one target tables surrogate key column. The second targets receives a block of sequence numbers from the Sequence Generator transformation only after the first target table receives the block of sequence numbers. Suppose we have 5 rows coming from the source, so the targets will have the sequence values as TGT1 (1,2,3,4,5) and TGT2 (6,7,8,9,10). [Taken into consideration Start Value 0, Current value 1 and Increment by 1.

Now suppose the requirement is like that we need to have the same surrogate keys in both the targets. Then the easiest way to handle the situation is to put an Expression Transformation in between the Sequence Generator and the Target tables. The SeqGen will pass unique values to the expression transformation, and then the rows are routed from the expression transformation to the targets.

Sequence Generator

Q34. Suppose we have 100 records coming from the source. Now for a target column population we used a Sequence generator.Suppose the Current Value is 0 and End Value of Sequence generator is set to 80. What will happen? Ans.End Value is the maximum value the Sequence Generator will generate. After it reaches the End value the session fails with the following error message: TT_11009 Sequence Generator Transformation: Overflow error.

Failing of session can be handled if the Sequence Generator is configured to Cycle through the sequence, i.e. whenever the Integration Service reaches the configured end value for the sequence, it wraps around and starts the cycle again, beginning with the configured Start Value.

Q35. What are the changes we observe when we promote a non resuable Sequence Generator to a resuable one? And what happens if we set the Number of Cached Values to 0 for a reusable transformation?

Ans.When we convert a non reusable sequence generator to resuable one we observe that the Number of Cached Values is set to 1000 by default; And the Reset property is disabled.

When we try to set the Number of Cached Values property of a Reusable Sequence Generator to 0 in the Transformation Developer we encounter the following error message: The number of cached values must be greater than zero for reusable sequence transformation.

Normalizer, a native transformation in Informatica, can ease many complex data transformation requirement. Learn how to effectively use normalizer here.

Using Noramalizer Transformation

A Normalizer is an Active transformation that returns multiple rows from a source row, it returns duplicate data for single-occurring source columns. The Normalizer transformation parses multiple-occurring columns from COBOL sources, relational tables, or other sources. Normalizer can be used to transpose the data in columns to rows.Normalizer effectively does the opposite of Aggregator!

Example of Data Transpose using Normalizer

Think of a relational table that stores four quarters of sales by store and we need to create a row for each sales occurrence. We can configure a Normalizer transformation to return a separate row for each quarter like below.. The following source rows contain four quarters of sales by store:Source Table

Store Quarter1 Quarter2 Quarter3 Quarter4

Store1 100 300 500 700

Store2 250 450 650 850

The Normalizer returns a row for each store and sales combination. It also returns an index(GCID) that identifies the quarter number:

Target Table

Store Sales Quarter

Store 1 100 1

Store 1 300 2

Store 1 500 3

Store 1 700 4

Store 2 250 1

Store 2 450 2

Store 2 650 3

Store 2 850 4

How Informatica Normalizer Works

Suppose we have the following data in source:

Name Month Transportation House Rent Food

Sam Jan 200 1500 500

John Jan 300 1200 300

Tom Jan 300 1350 350

Sam Feb 300 1550 450

John Feb 350 1200 290

Tom Feb 350 1400 350

and we need to transform the source data and populate this as below in the target table:

Name Month Expense Type Expense

Sam Jan Transport 200

Sam Jan House rent 1500

Sam Jan Food 500

John Jan Transport 300

John Jan House rent 1200

John Jan Food 300

Tom Jan Transport 300

Tom Jan House rent 1350

Tom Jan Food 350

.. like this. Now below is the screen-shot of a complete mapping which shows how to achieve this result using Informatica PowerCenter Designer. Image: Normalization Mapping Example 1

I will explain the mapping further below.

Setting Up Normalizer Transformation Property

First we need to set the number of occurences property of the Expense head as 3 in the Normalizer tab of the Normalizer transformation, since we have Food,Houserent and Transportation.Which in turn will create the corresponding 3 input ports in the ports tab along with the fields Individual and Month

In the Ports tab of the Normalizer the ports will be created automatically as configured in the Normalizer tab. Interestingly we will observe two new columns namely GK_EXPENSEHEAD and GCID_EXPENSEHEAD.GK field generates sequence number starting from the value as defined in Sequence field while GCID holds the value of the occurence field i.e. the column no of the input Expense head.Here 1 is for FOOD, 2 is for HOUSERENT and 3 is for TRANSPORTATION.

Now the GCID will give which expense corresponds to which field while converting columns to rows.Below is the screen-shot of the expression to handle this GCID efficiently:

Image: Expression to handle GCID

This is how we will accomplish our task!

A LookUp cache does not change once built. But what if the underlying lookup table changes the data after the lookup cache is created? Is there a way so that the cache always remain up-to-date even if the underlying table changes?

Dynamic Lookup Cache

Let's think about this scenario. You are loading your target table through a mapping. Inside the mapping you have a Lookup and in the Lookup, you are actually looking up the same target table you are loading. You may ask me, "So? What's the big deal? We all do it quite often...". And yes you are right.

There is no "big deal" because Informatica (generally) caches the lookup table in the very beginning of the mapping, so whatever record getting inserted to the target table through the mapping, will have no effect on the Lookup cache. The lookup will still hold the previously cached data, even if the underlying target table is changing.But what if you want your Lookup cache to get updated as and when the target table is changing? What if you want your lookup cache to always show the exact snapshot of the data in your target table at that point in time? Clearly this requirement will not be fullfilled in case you use a static cache. You will need a dynamic cache to handle this.

But why on earth someone will need a dynamic cache?

To understand this, let's next understand a static cache scenario.

Static Cache Scenario

Let's suppose you run a retail business and maintain all your customer information in a customer master table (RDBMS table). Every night, all the customers from your customer master table is loaded in to a Customer Dimension table in your data warehouse. Your source customer table is a transaction system table, probably in 3rd normal form, and does not store history. Meaning, if a customer changes his address, the old address is updated with the new address. But your data warehouse table stores the history (may be in the form of SCD Type-II). There is a map that loads your data warehouse table from the source table. Typically you do a Lookup on target (static cache) and check with your every incoming customer record to determine if the customer is already existing in target or not. If the customer is not already existing in target, you conclude the customer is new and INSERT the record whereas if the customer is already existing, you may want to update the target record with this new record (if the record is updated). This is illustrated below, You don't need dynamic Lookup cache for this

Image: A static Lookup Cache to determine if a source record is new or updatable

Dynamic Lookup Cache Scenario

Notice in the previous example I mentioned that your source table is an RDBMS table. This ensures that your source table does not have any duplicate record. What if you had a flat file as source with many duplicate records? Would the scenario be same? No, see the below illustration.

Image: A Scenario illustrating the use of dynamic lookup cacheHere are some more examples when you may consider using dynamic lookup,

Updating a master customer table with both new and updated customer information as shown above

Loading data into a slowly changing dimension table and a fact table at the same time. Remember, you typically lookup the dimension while loading to fact. So you load dimension table before loading fact table. But using dynamic lookup, you can load both simultaneously.

Loading data from a file with many duplicate records and to eliminate duplicate records in target by updating a duplicate row i.e. keeping the most recent row or the initial row

Loading the same data from multiple sources using a single mapping. Just consider the previous Retail business example. If you have more than one shops and Linda has visited two of your shops for the first time, customer record Linda will come twice during the same load.

So, How does dynamic lookup work?

When the Integration Service reads a row from the source, it updates the lookup cache by performing one of the following actions: Inserts the row into the cache: If the incoming row is not in the cache, the Integration Service inserts the row in the cache based on input ports or generated Sequence-ID. The Integration Service flags the row as insert.

Updates the row in the cache: If the row exists in the cache, the Integration Service updates the row in the cache based on the input ports. The Integration Service flags the row as update. Makes no change to the cache: This happens when the row exists in the cache and the lookup is configured or specified To Insert New Rows only or, the row is not in the cache and lookup is configured to update existing rows only or, the row is in the cache, but based on the lookup condition, nothing changes. The Integration Service flags the row as unchanged. Notice that Integration Service actually flags the rows based on the above three conditions. This is a great thing, because, if you know the flag you can actually reroute the row to achieve different logic. This flag port is called "NewLookupRow" and using this the rows can be routed for insert, update or to do nothing. You just need to use a Router or Filter transformation followed by an Update Strategy. Oh, forgot to tell you the actual values that you can expect in NewLookupRow port: 0 Integration Service does not update or insert the row in the cache. 1 Integration Service inserts the row into the cache. 2 Integration Service updates the row in the cache. When the Integration Service reads a row, it changes the lookup cache depending on the results of the lookup query and the Lookup transformation properties you define. It assigns the value 0, 1, or 2 to the NewLookupRow port to indicate if it inserts or updates the row in the cache, or makes no change.

Example of Dynamic Lookup Implementation

Ok, I design a mapping for you to show Dynamic lookup implementation. I have given a full screenshot of the mapping. Since the screenshot is slightly bigger, so I link it below..Image: Dynamic Lookup Mapping

And here I provide you the screenshot of the lookup below. Lookup ports screen shot first,Image: Dynamic Lookup Ports Tab

And here is Dynamic Lookup Properties Tab

If you check the mapping screenshot, there I have used a router to reroute the INSERT group and UPDATE group. The router screenshot is also given below. New records are routed to the INSERT group and existing records are routed to the UPDATE group.


About the Sequence-ID

While using a dynamic lookup cache, we must associate each lookup/output port with an input/output port or a sequence ID. The Integration Service uses the data in the associated port to insert or update rows in the lookup cache. The Designer associates the input/output ports with the lookup/output ports used in the lookup condition. When we select Sequence-ID in the Associated Port column, the Integration Service generates a sequence ID for each row it inserts into the lookup cache. When the Integration Service creates the dynamic lookup cache, it tracks the range of values in the cache associated with any port using a sequence ID and it generates a key for the port by incrementing the greatest sequence ID existing value by one, when the inserting a new row of data into the cache. When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at one and increments each sequence ID by one until it reaches the smallest existing value minus one. If the Integration Service runs out of unique sequence ID numbers, the session fails.

About the Dynamic Lookup Output Port

The lookup/output port output value depends on whether we choose to output old or new values when the Integration Service updates a row: Output old values on update: The Integration Service outputs the value that existed in the cache before it updated the row. Output new values on update: The Integration Service outputs the updated value that it writes in the cache. The lookup/output port value matches the input/output port value. Note: We can configure to output old or new values using the Output Old Value On Update transformation property.

Handling NULL in dynamic LookUp

If the input value is NULL and we select the Ignore Null inputs for Update property for the associated input port, the input value does not equal the lookup value or the value out of the input/output port. When you select the Ignore Null property, the lookup cache and the target table might become unsynchronized if you pass null values to the target. You must verify that you do not pass null values to the target. When you update a dynamic lookup cache and target table, the source data might contain some null values. The Integration Service can handle the null values in the following ways: Insert null values: The Integration Service uses null values from the source and updates the lookup cache and target table using all values from the source. Ignore Null inputs for Update property : The Integration Service ignores the null values in the source and updates the lookup cache and target table using only the not null values from the source. If we know the source data contains null values, and we do not want the Integration Service to update the lookup cache or target with null values, then we need to check the Ignore Null property for the corresponding lookup/output port. When we choose to ignore NULLs, we must verify that we output the same values to the target that the Integration Service writes to the lookup cache. We can Configure the mapping based on the value we want the Integration Service to output from the lookup/output ports when it updates a row in the cache, so that lookup cache and the target table might not become unsynchronized New values. Connect only lookup/output ports from the Lookup transformation to the target. Old values. Add an Expression transformation after the Lookup transformation and before the Filter or Router transformation. Add output ports in the Expression transformation for each port in the

target table and create expressions to ensure that we do not output null input values to the target.

When we run a session that uses a dynamic lookup cache, the Integration Service compares the values in all lookup ports with the values in their associated input ports by default. It compares the values to determine whether or not to update the row in the lookup cache. When a value in an input port differs from the value in the lookup port, the Integration Service updates the row in the cache.

But what if we don't want to compare all ports? We can choose the ports we want the Integration Service to ignore when it compares ports. The Designer only enables this property for lookup/output ports when the port is not used in the lookup condition. We can improve performance by ignoring some ports during comparison.

We might want to do this when the source data includes a column that indicates whether or not the row contains data we need to update. Select the Ignore in Comparison property for all lookup ports except the port that indicates whether or not to update the row in the cache and target table.

Note: We must configure the Lookup transformation to compare at least one port else the Integration Service fails the session when we ignore all ports.

Here is an easy to understand primer on Oracle architecture. Read this first to give yourself a head-start before you read more advanced articles on Oracle Server Architecture.

We need to touch two major things here- first server architecture where we will know memory and process structure and then we will learn the Oracle storage structure.

Database and Instance

Let’s first understand the difference between Oracle database and Oracle Instance. Oracle database is a group of files that reside on disk and store the data. Whereas an Oracle instance is a piece of shared memory and a number of processes that allow information in the database to be accessed quickly and by multiple concurrent users.

The following picture shows the parts of database and instance.

Database Instance

Control File Online Redo Log Data File

Temp File

Shared Memory (SGA)

Processes

Now let's learn some details of both Database and Oracle Instance. The DatabaseThe database is comprised of different files as follows

Control File

Control File contains information that defines the rest of the database like names, location and types of other files etc.

Redo Log file

Redo Log file keeps track of the changes made to the database. All user and meta data are stored in data files

Temp fileTemp file stores the temporary information that are often generated when sorts are performed.

Each file has a header block that contains metadata about the file like SCN or system change number that says when data stored in buffer cache was flushed down to disk. This SCN information is important for Oracle to determine if the database is consistent.

The InstanceThis is comprised of a shared memory segment (SGA) and a few

processes. The following picture shows the Oracle structure.

Shared Memory Segment Shared PoolShared SQL Area

Contains various structure for running SQL and dependency tracking

Database Buffer Cache

Contains various data blocks that are read from database for some transaction

Redo Log BufferIt stores the redo information until the information is flushed out to disk

Details of the Processes are shown below

PMON (Process Monitor)

- Cleans up abnormally terminated connection - Rolls back uncommited transactions - Releases locks held by a terminated process - Frees SGA resources allocated to the failed processes - Database maintenance

SMON (System Monitor)

- Performs automatic instance recovery - Reclaims space used by temporary segments no longer in use - Merges contiguous area of free space in the datafile

DBWR (Database Writer)

- write all dirty buffers to datafiles - Use a LRU algorithm to keep most recently used blocks in memory - Defers write for I/O optimization

LGWR (Log Writer)

- writes redo log entries to disk

CKPT (Check - If enabled (by setting the parameter

Point)

CHECKPOINT_PROCESS=TRUE), take over LGWR’s task of updating files at a checkpoint - Updates header of datafiles and control files at the end of checkpoint - More frequent checkpoint reduce recovery time from instance failure

Other ProcessesLCKn (Lock), Dnnn (Dispatcher), Snnn (Server), RECO (Recover), Pnnn(Parallel), SNPn(Job Queue), QMNn(Queue Monitor) etc.

Storage Structure

Here we will learn about both physical and logical storage structure. Physical storage is how Oracle stores the data physically in the system. Whereas logical storage talks about how an end user actually accesses that data.Physically Oracle stores everything in file, called data files. Whereas an end user accesses that data in terms of accessing the RDBMS tables, which is the logical part. Let's see the details of these structures. Physical storage space is comprised of different datafiles which contains data segments. Each segment can contain multiple extents and each extent contains the blocks which are the most granular storage structure. Relationship among Segments, extents and blocks are shown below.

Data Files|^

Segments (size: 96k)|^

Extents (Size: 24k)|^

Blocks (size: 2k)

All about Informatica Lookup

A Lookup is a Passive , Connected or Unconnected Transformation used to look up data in a relational table, view, synonym or flat file. The integration service queries the lookup table to retrieve a value based on the input source value and the lookup condition.

All about Informatica LookUp Transformation

A connected lookup recieves source data, performs a lookup and returns data to the pipeline; While an unconnected lookup is not connected to source or target and is called

http://www.dwbiconcepts.com/basic-concept/3-etl/8-all-about-informatica-lookup.html

by a transformation in the pipeline by :LKP expression which in turn returns only one column value to the calling transformation.

Lookup can be Cached or Uncached . If we cache the lookup then again we can further go for static or dynamic or persistent cache,named cache or unnamed cache . By default lookup transformations are cached and static.

Lookup Ports Tab The Ports tab of Lookup Transformation contains Input Ports: Create an input port for each lookup port we want to use in the lookup condition. We must have at least one input or input/output port in a lookup transformation.

Output Ports: Create an output port for each lookup port we want to link to another transformation. For connected lookups, we must have at least one output port. For unconnected lookups, we must select a lookup port as a return port (R) to pass a return value.

Lookup Port: The Designer designates each column of the lookup source as a lookup port.

Return Port: An unconnected Lookup transformation has one return port that returns one column of data to the calling transformation through this port.

Notes: We can delete lookup ports from a relational lookup if the mapping does not use the lookup ports which will give us performance gain. But if the lookup source is a flat file then deleting of lookup ports fails the session.

Now let us have a look on the Properties Tab of the Lookup Transformation:

Lookup Sql Override: Override the default SQL statement to add a WHERE clause or to join multiple tables.

Lookup table name: The base table on which the lookup is performed.

Lookup Source Filter: We can apply filter conditions on the lookup table so as to reduce the number of records. For example, we may want to select the active records of the lookup table hence we may use the condition CUSTOMER_DIM.ACTIVE_FLAG = 'Y'.

Lookup caching enabled: If option is checked it caches the lookup table during the session run. Otherwise it goes for uncached relational database hit. Remember to implement database index on the columns used in the lookup condition to provide better performance when the lookup in Uncached.

Lookup policy on multiple match: While lookup if the integration service finds multiple match we can configure the lookup to return the First Value, Last Value, Any

Value or to Report Error.

Lookup condition: The condition to lookup values from the lookup table based on source input data. For example, IN_EmpNo=EmpNo.

Connection Information: Query the lookup table from the source or target connection. In can of flat file lookup we can give the file path and name, whether direct or indirect.

Source Type: Determines whether the source is relational database or flat file.

Tracing Level: It provides the amount of detail in the session log for the transformation. Options available are Normal, Terse, Vebose Initialization, Verbose Data.

Lookup cache directory name: Determines the directory name where the lookup cache files will reside.

Lookup cache persistent: Indicates whether we are going for persistent cache or non-persistent cache.

Dynamic Lookup Cache: When checked We are going for Dyanamic lookup cache else static lookup cache is used.

Output Old Value On Update: Defines whether the old value for output ports will be used to update an existing row in dynamic cache.

Cache File Name Prefix: Lookup will used this named persistent cache file based on the base lookup table.

Re-cache from lookup source: When checked, integration service rebuilds lookup cache from lookup source when the lookup instance is called in the session.

Insert Else Update: Insert the record if not found in cache, else update it. Option is available when using dynamic lookup cache.

Update Else Insert: Update the record if found in cache, else insert it. Option is available when using dynamic lookup cache.

Datetime Format: Used when source type is file to determine the date and time format of lookup columns.

Thousand Separator: By default it is None, used when source type is file to determine the thousand separator.

Decimal Separator: By default it is "." else we can use "," and used when source type is file to determine the thousand separator.

Case Sensitive String Comparison: To be checked when we want to go for Case sensitive String values in lookup comparison. Used when source type is file.

Null ordering: Determines whether NULL is the highest or lowest value. Used when source type is file.

Sorted Input: Checked whenever we expect the input data to be sorted and is used when the source type is flat file.

Lookup source is static: When checked it assumes that the lookup source is not going to change during the session run.

Pre-build lookup cache: Default option is Auto. If we want the integration service to start building the cache whenever the session just begins we can chose the option Always allowed.

Aggregation with out Informatica Aggregator

Since Informatica process data row by row, it is generally possible to handle data aggregation operation even without an Aggregator Transformation. On certain cases, you may get huge performance gain using this technique!

General Idea of Aggregation without Aggregator TransformationLet us take an example: Suppose we want to find the SUM of SALARY for Each Department of the Employee Table. The SQL query for this would be: SELECT DEPTNO,SUM(SALARY) FROM EMP_SRC GROUP BY DEPTNO; If we need to implement this in Informatica, it would be very easy as we would obviously go for an Aggregator Transformation. By taking the DEPTNO port as GROUP BY and one output port as SUM(SALARY the problem can be solved easily.Now the trick is to use only Expression to achieve the functionality of Aggregator expression. We would use the very funda of the expression transformation of holding the value of an attribute of the previous tuple over here.



But wait... why would we do this? Aren't we complicating the thing here?

Yes, we are. But as it appears, in many cases, it might have an performance benefit (especially if the input is already sorted or when you know input data will not violate the order, like you are loading daily data and want to sort it by day). Remember Informatica holds all the rows in Aggregator cache for aggregation operation. This needs time and cache space and this also voids the normal row by row processing in Informatica. By removing the Aggregator with an Expression, we reduce cache space requirement and ease out row by row processing. The mapping below will show how to do this

Sorter (SRT_SAL) Ports Tab

Now I am showing a sorter here just illustrate the concept. If you already have sorted data from the source, you need not use this thereby increasing the performance benefit. Expression (EXP_SAL) Ports Tab Image: Expression Ports Tab Properties

Sorter (SRT_SAL1) Ports Tab

Expression (EXP_SAL2) Ports Tab

Filter (FIL_SAL) Properties Tab

This is how we can implement aggregation without using Informatica aggregator transformation. Hope you liked it!

Informatica Reject File - How to Identify rejection reason

When we run a session, the integration service may create a reject file for each target instance in the mapping to store the target reject record. With the help of the Session Log and Reject File we can identify the cause of data rejection in the session. Eliminating the cause of rejection will lead to rejection free loads in the subsequent session runs. If the Informatica Writer or the Target Database rejects data due to any valid reason the integration service logs the rejected records into the reject file. Every time we run the session the integration service appends the rejected records to the reject file.

Working with Informatica Bad Files or Reject Files

By default the Integration service creates the reject files or bad files in the $PMBadFileDir process variable directory. It writes the entire reject record row in the bad file although the problem may be in any one of the Columns. The reject files have a default naming convention

http://www.dwbiconcepts.com/basic-concept/3-etl/32-informatica-reject-or-bad-files.html

http://www.dwbiconcepts.com/basic-concept/3-etl/32-informatica-reject-or-bad-files.html

like [target_instance_name].bad . If we open the reject file in an editor we will see comma separated values having some tags/ indicator and some data values. We will see two types of Indicators in the reject file. One is the Row Indicator and the other is the Column Indicator . For reading the bad file the best method is to copy the contents of the bad file and saving the same as a CSV (Comma Sepatared Value) file. Opening the csv file will give an excel sheet type look and feel. The firstmost column in the reject file is the Row Indicator , that determines whether the row was destined for insert, update, delete or reject. It is basically a flag that determines the Update Strategy for the data row. When the Commit Type of the session is configured as User-defined the row indicator indicates whether the transaction was rolled back due to a non-fatal error, or if the committed transaction was in a failed target connection group.

List of Values of Row Indicators:

Row Indicator Indicator Significance Rejected By

0 Insert Writer or target

1 Update Writer or target

2 Delete Writer or target

3 Reject Writer

4 Rolled-back insert Writer

5 Rolled-back update Writer

6 Rolled-back delete Writer

7 Committed insert Writer

8 Committed update Writer

9 Committed delete Writer

Now comes the Column Data values followed by their Column Indicators, that determines the data quality of the corresponding Column.

List of Values of Column Indicators:

>

Column Indicator

Type of data Writer Treats As

DValid data or Good Data.

Writer passes it to the target database. The target accepts it unless a database error occurs, such as finding a duplicate key while inserting.

OOverflowed Numeric Data.

Numeric data exceeded the specified precision or scale for the column. Bad data, if you configured the mapping target to reject overflow or truncated data.

N Null Value.

The column contains a null value. Good data. Writer passes it to the target, which rejects it if the target database does not accept null values.

TTruncated String Data.

String data exceeded a specified precision for the column, so the Integration Service truncated it. Bad data, if you configured the mapping target to reject overflow or truncated data.

Also to be noted that the second column contains column indicator flag value 'D' which signifies that the Row Indicator is valid. Now let us see how Data in a Bad File looks like:

0,D,7,D,John,D,5000.375,O,,N,BrickLand Road Singapore,T

Database Performance Tuning

This article tries to comprehensively list down many things one needs to know for Oracle Database Performance Tuning. The ultimate goal of this document is to provide a generic and comprehensive guideline to Tune Oracle Databases from both programmer and administrator's standpoint.

Oracle terms and Ideas you need to know before beginningJust to refresh your Oracle skills, here is a short go-through as a starter.

Oracle Parser

It performs syntax analysis as well as semantic analysis of SQL statements for execution, expands views referenced in the query into separate query blocks, optimizing it and building (or locating) an executable form of that statement.

Hard Parse

A hard parse occurs when a SQL statement is executed, and the SQL statement is either not in the shared pool , or it is in the shared pool but it cannot be shared. A SQL statement is not shared if the metadata for the two SQL statements is different i.e. a SQL statement textually identical to a preexisting SQL statement, but the tables referenced in the two statements are different, or if the optimizer environment is different.

Soft Parse

A soft parse occurs when a session attempts to execute a SQL statement, and the statement is already in the shared pool, and it can be used (that is, shared). For a statement to be shared, all data, (including metadata, such as the optimizer execution plan) of the existing SQL statement must be equal to the current statement being issued.

Cost Based Optimizer

It generates a set of potential execution plans for SQL statements, estimates the cost of each plan, calls the plan generator to generate

http://www.dwbiconcepts.com/advance/2-database/25-database-performance-tuning.html

the plan, compares the costs, and then chooses the plan with the lowest cost. This approach is used when the data dictionary has statistics for at least one of the tables accessed by the SQL statements. The CBO is made up of the query transformer, the estimator and the plan generator.

EXPLAIN PLAN

A SQL statement that enables examination of the execution plan chosen by the optimizer for DML statements. EXPLAIN PLAN makes the optimizer to choose an execution plan and then to put data describing the plan into a database table. The combination of the steps Oracle uses to execute a DML statement is called an execution plan. An execution plan includes an access path for each table that the statement accesses and an ordering of the tables i.e. the join order with the appropriate join method.

Oracle Trace

Oracle utility used by Oracle Server to collect performance and resource utilization data, such as SQL parse, execute, fetch statistics, and wait statistics. Oracle Trace provides several SQL scripts that can be used to access server event tables, collects server event data and stores it in memory, and allows data to be formatted while a collection is occurring.

SQL Trace

It is a basic performance diagnostic tool to monitor and tune applications running against the Oracle server. SQL Trace helps to understand the efficiency of the SQL statements an application runs and generates statistics for each statement. The trace files produced by this tool are used as input for TKPROF.

TKPROF

It is also a diagnostic tool to monitor and tune applications running against the Oracle Server. TKPROF primarily processes SQL trace output files and translates them into readable output files, providing a summary of user-level statements and recursive SQL calls for the trace files. It also shows the efficiency of SQL statements, generate execution plans, and create SQL scripts to store statistics in the database.

The following are generally accepted “Best Practices” for Informatica PowerCenter ETL development and if implemented, can significantly improve the overall performance.

Category Technique Benefits Source Extracts Loading data from Fixed-width

files take less time than delimited, since delimited files require extra parsing. Incase of Fixed width files, Integration service know the Start and End position of each columns upfront and thus reduces the processing time.

Performance Improvement

Using flat files located on the server machine loads faster than a database located on the server machine.


Mapping Designer There should be a place holder

transformation (Expression) immediately after the Source and one before the target. Data type and Data width changes are bound to happen during development phase and these place holder transformations are used to preserve the port link between transformations.

Best Practices

Connect only the ports that are required in targets to subsequent transformations. Also, active transformations that reduce the number of records should be used as early in the mapping.

Code Optimization

If a join must be used in the Mapping, select appropriate driving/master table while using joins. The table with the lesser


number of rows should be the driving/master table.

Transformations If there are multiple Lookup

condition, make the condition with the “=” sign first in order to optimize the lookup performance. Also, indexes on the database table should include every column used in the lookup condition.

Code Optimization

Persistent caches should be used if the lookup data is not expected to change often. This cache files are saved and can be reused for subsequent runs, eliminating querying the database.


Integration Service processes numeric operations faster than string operations. For example, if a lookup is done on a large amount of data on two columns, EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID improves performance.

Code Optimization

Replace Complex filter expression with a flag (Y/N). Complex logic should be moved to the expression transformation and the result should be stored in a port. Filter expression should take less time to evaluate this port rather than executing the entire logic in Filter expression.

Best Practices

Power Center Server automatically makes conversions between compatible data types which slowdown the performance considerably. For example, if a mapping moves


data from an Integer port to a Decimal port, then back to an Integer port, the conversion may be unnecessary. Assigning default values to a port; Transformation errors written to session log will always slow down the session performance. Try removing default values and eliminate transformation errors.


Complex joins in Source Qualifiers should be replaced with Database views. There won’t be any performance gains, but it improves the readability a lot. Also, any new conditions can be evaluated easily by just changing the Database view “WHERE” clause.

Informatica Development Best Practice – WorkflowWorkflow Manager default properties can be modified to improve the overall performance and few of them are listed below. This properties can impact the ETL runtime directly and needs to configured based on :

i) Source Databaseii) Target Databaseiii) Data Volume

Category Technique Session Properties

While loading Staging Tables for FULL LOADS, Truncate target table option should be checked. Based on the Target database and the primary key defined, Integration Service fires TRUNCATE or

DELETE statement.Database Primary Key Defined No Primary KeyDB2 TRUNCATE TRUNCATEINFORMIX DELETE DELETEODBC DELETE DELETEORACLE DELETE UNRECOVERABLE TRUNCATEMSSQL DELETE TRUNCATESYBASE TRUNCATE TRUNCATE

Workflow Property “Commit interval” (Default value : 10,000) should be increased for increased for Volumes more than 1 million records. Database Rollback Segment size should also be updated, while increasing “Commit Interval”.

Insert/Update/Delete options should be set as determined by the target population method.

Target Option Integration ServiceInsert Uses Target update Option Update as Update Update as Insert Update else InsertUpdate as update Updates all rows as UpdateUpdate as Insert Inserts all rowsUpdate else Insert Updates existing rows else Insert

Partition Maximum number of partitions for a session should be 1.5 times the number of processes in the Informatica server. i.e. 1.5 X 4 Processors = 6 partitions.Key Value partitions should be used only when an even Distribution of data can be obtained. In other cases, Pass Through partitions should be used.A Source filter should be added to evenly distribute the data between Pass through Partitions. Key Value should have ONLY numeric values. MOD(NVL(<Numeric Key Value>,0),# No of Partitions defined) Ex: MOD(NVL(product_sys_no,0),6)

If a session contains “N” partition, increase the DTM Buffer Size to at least “N” times the value for the session with One partition

If the Source or Target database is of MPP( Massively Parallel Processing ), enable Pushdown Optimization. By enabling this,

Integration Service will push as much Transformation Logic to Source database or Target database or FULL ( both ) , based on the settings. This property can be ignored for Conventional databases.

Informatica and Oracle hints in SQL overridesHINTS used in a SQL statement helps in sending instructions to the Oracle optimizer which would quicken the query processing time involved. Can we make use of these hints in SQL overrides within our Informatica mappings so as to improve a query performance?

On a general note any Informatica help material would suggest: you can enter any valid SQL statement supported by the source database in a SQL override of a Source qualifier or a Lookup transformation or at the session properties level. While using them as part of Source Qualifier has no complications, using them in a Lookup SQL override gets a bit tricky. Use of forward slash followed by an asterix (“/*”) in lookup SQL Override [generally used for commenting purpose in SQL and at times as Oracle hints.] would result in session failure with an error like: TE_7017 : Failed to Initialize Server Transformation lkp_transaction2009-02-19 12:00:56 : DEBUG : (18785 | MAPPING) : (IS | Integration_Service_xxxx) : node01_UAT-xxxx : DBG_21263 : Invalid lookup override SELECT SALES. SALESSEQ as SalesId, SALES.OrderID as ORDERID, SALES.OrderDATE as ORDERDATE FROM SALES, AC_SALES WHERE AC_SALES. OrderSeq >= (Select /*+ FULL(AC_Sales) PARALLEL(AC_Sales,12) */ min(OrderSeq) From AC_Sales)This is because Informatica’s parser fails to recognize this special character when used in a Lookup override. There has been a parameter made available starting with PowerCenter 7.1.3 release, which enables the use of forward slash or hints.

Infa 7.x

1. Using a text editor open the PowerCenter server configuration file (pmserver.cfg).2. Add the following entry at the end of the file: LookupOverrideParsingSetting=13. Re-start the PowerCenter server (pmserver).

Infa 8.x

1. Connect to the Administration Console.2. Stop the Integration Service.3. Select the Integration Service.4. Under the Properties tab, click Edit in the Custom Properties section.5. Under Name enter LookupOverrideParsingSetting6. Under Value enter 1.7. Click OK.8. And start the Integration Service.

Starting with PowerCenter 8.5, this change could be done at the session task itself

as follows:

1. Edit the session. 2. Select Config Object tab. 3. Under Custom Properties add the attribute LookupOverrideParsingSetting and set the Value

to 1. 4. Save the session.

Informatica PowerCenter 8x Key Concepts – 1We shall look at the fundamental components of the Informatica PowerCenter 8.x Suite, the key components are

1. PowerCenter Domain2. PowerCenter Repository3. Administration Console4. PowerCenter Client5. Repository Service6. Integration Service

PowerCenter Domain

A domain is the primary unit for management and administration of services in PowerCenter. Node, Service Manager and Application Services are components of a domain.

Node

Node is the logical representation of a machine in a domain. The machine in which the PowerCenter is installed acts as a Domain and also as a primary node. We can add other machines as nodes in the domain and configure the nodes to run application services such as the Integration Service or Repository Service. All service requests from other nodes in the domain go through the primary node also called as ‘master gateway’.

The Service Manager

The Service Manager runs on each node within a domain and is responsible for starting and running the application services. The Service Manager performs the following functions,

Alerts. Provides notifications of events like shutdowns, restart

Authentication. Authenticates user requests from the Administration Console, PowerCenter Client, Metadata Manager, and Data Analyzer

Domain configuration. Manages configuration details of the domain like machine name, port

Node configuration. Manages configuration details of a node metadata like machine name, port

Licensing. When an application service connects to the domain for the first time the licensing registration is performed and for subsequent connections the licensing information is verified

Logging. Manages the event logs from each service, the messages could be ‘Fatal’, ‘Error’, ‘Warning’, ‘Info’

User management. Manages users, groups, roles, and privileges

Application servicesThe services that essentially perform data movement, connect to different data sources and manage data are called Application services, they are namely Repository Service, Integration Service, Web Services Hub, SAPBW Service, Reporting Service and Metadata Manager Service. The application services run on each node based on the way we configure the node and the application serviceDomain ConfigurationSome of the configurations for a domain involves assigning host name, port numbers to the nodes, setting up Resilience Timeout values, providing connection information of metadata Database, SMTP details etc. All the Configuration information for a domain is stored in a set of relational database tables within the repository. Some of the global properties that are applicable for Application Services like ‘Maximum Restart Attempts’, ‘Dispatch Mode’ as ‘Round Robin’/’Metric Based’/’Adaptive’ etc are configured under Domain Configuration

2. PowerCenter Repository

The PowerCenter Repository is one of best metadata storage among all ETL products. The repository is sufficiently normalized to store metadata at a very detail level; which in turn means the Updates to therepository are very quick and the overall Team-based Development is smooth. The repository data structure is also useful for the users to do analysis and reporting.

Accessibility to the repository through MX views and SDK kit extends the repositories capability from a simple storage of technical data to a database for analysis of the ETL metadata.

PowerCenter Repository is a collection of 355 tables which can be created on any major relational database. The kinds of information that are stored in the repository are,

1. Repository configuration details2. Mappings3. Workflows4. User Security5. Process Data of session runs

For a quick understanding,When a user creates a folder, corresponding entries are made into table OPB_SUBJECT; attributes like folder name, owner id, type of the folder like shared or not are all stored.When we create\import sources and define field names, datatypes etc in source analyzer entries are made into opb_src and OPB_SRC_FLD.When target and related fields are created/imported from any database entries are made into tables like OPB_TARG and OPB_TARG_FLD.Table OPB_MAPPING stores mapping attributes like Mapping Name, Folder Id, Valid status and mapping comments.Table OPB_WIDGET stores attributes like widget type, widget name, comments etc. Widgets are nothing but the Transformations which Informatica internally calls them as Widgets.Table OPB_SESSION stores configurations related to a session task and table OPB_CNX_ATTR stores information related to connection objects.Table OPB_WFLOW_RUN stores process details like workflow name, workflow started time, workflow completed time, server node it ran etc.REP_ALL_SOURCES, REP_ALL_TARGETS and REP_ALL_MAPPINGS are few of the many views created over these tables.PowerCenter applications access the PowerCenter repository through the Repository Service. The Repository Service protects metadata in the repository by managing repository connections and using object-locking to ensure object consistency.We can create a repository as global or local. We can go for‘global’ to store common objects that multiple developers can use through shortcuts and go for local repository to perform of development mappings and workflows. From a local repository, we can create shortcuts to objects in shared folders in the global repository. PowerCenter supports versioning. A versioned repository can store multiple versions of an object.

3. Administration Console

The Administration Console is a web application that we use to administer the PowerCenter domain and PowerCenter security. There are two pages in the console, Domain Page & Security Page.We can do the following In Domain Page:

o Create & manage application services like Integration Service and Repository Serviceo Create and manage nodes, licenses and folderso Restart and shutdown nodeso View log eventso Other domain management tasks like applying licenses and managing grids and

resourcesWe can do the following in Security Page:

o Create, edit and delete native users and groupso Configure a connection to an LDAP directory service. Import users and groups from the

LDAP directory serviceo Create, edit and delete Roles (Roles are collections of privileges)o Assign roles and privileges to users and groupso Create, edit, and delete operating system profiles. An operating system profile is a level

of security that the Integration Services uses to run workflows4. PowerCenter Client

Designer, Workflow Manager, Workflow Monitor, Repository Manager & Data Stencil are five client tools that are used to design mappings, Mapplets, create sessions to load data and manage repository.Mapping is an ETL code pictorially depicting logical data flow from source to target involving transformations of the data. Designer is the tool to create mappingsDesigner has five window panes, Source Analyzer, Warehouse Designer, Transformation Developer, Mapping Designer and Mapplet Designer.Source Analyzer: Allows us to import Source table metadata from Relational databases, flat files, XML and COBOL files. We can only import the source definition in the source Analyzer and not the source data itself is to be understood. Source Analyzer also allows us to define our own Source data definition. Warehouse Designer: Allows us to import target table definitions which could be Relational databases, flat files, XML and COBOL files. We can also create target definitions manually and can group them into folders. There is an option to create the tables physically in the database that we do not have in source analyzer. Warehouse designer doesn’t allow creating two tables with same name even if the columns names under them vary or they are from different databases/schemas. Transformation Developer:Transformations like Filters, Lookups, Expressions etc that have scope to be re-used are developed in this pane. Alternatively Transformations developed in Mapping Designer can also be reused by checking the option‘re-use’ and by that it would be displayed under Transformation Developer folders.Mapping Designer:This is the place where we actually depict our ETL process; we bring in source definitions, target definitions, transformations like filter, lookup, aggregate and develop a logical ETL program. In this place it is only a logical program because the actual data load can be done only by creating a session and workflow. Mapplet Designer:We create a set of transformations to be used and re-used across mappings.

4. PowerCenter Client (contd)Workflow Manager : In the Workflow Manager, we define a set of instructions called a workflow to execute mappings we build in the Designer. Generally, a workflow contains a session and any other task we may want to perform when we run a session. Tasks can include a session, email

notification, or scheduling information.

A set of tasks grouped together becomes worklet. After we create a workflow, we run the workflow in the Workflow Manager and monitor it in the Workflow Monitor. Workflow Manager has following three window panes,Task Developer, Create tasks we want to accomplish in the workflow. Worklet Designer, Create a worklet in the Worklet Designer. A worklet is an object that groups a set of tasks. A worklet is similar to a workflow, but without scheduling information. You can nest worklets inside a workflow. Workflow Designer, Create a workflow by connecting tasks with links in the Workflow Designer. We can also create tasks in the Workflow Designer as you develop the workflow. The ODBC connection details are defined in Workflow Manager “Connections “ Menu .

Workflow Monitor : We can monitor workflows and tasks in the Workflow Monitor. We can view details about a workflow or task in Gantt Chart view or Task view. We can run, stop, abort, and resume workflows from the Workflow Monitor. We can view sessions and workflow log events in the Workflow Monitor Log Viewer.

The Workflow Monitor displays workflows that have run at least once. The Workflow Monitor continuously receives information from the Integration Service and Repository Service. It also fetches information from the repository to display historic information.

The Workflow Monitor consists of the following windows:

Navigator window – Displays monitored repositories, servers, and repositories objects.

Output window – Displays messages from the Integration Service and Repository Service.

Time window – Displays progress of workflow runs.

Gantt chart view – Displays details about workflow runs in chronological format.

Task view – Displays details about workflow runs in a report format.

Repository Manager

We can navigate through multiple folders and repositories and perform basic repository tasks with the Repository Manager. We use the Repository Manager to complete the following tasks:

2. Add and connect to a repository, we can add repositories to the Navigator window and client registry and then connect to the repositories.

3. Work with PowerCenter domain and repository connections, we can edit or remove domain connection information. We can connect to one repository or multiple repositories. We can export repository connection information from the client registry to a file. We can import the file on a different machine and add the repository connection information to the client registry.

4. Change your password. We can change the password for our user account.

5. Search for repository objects or keywords. We can search for repository objects containing specified text. If we add keywords to target definitions, use a keyword to search for a target definition.

6. View objects dependencies. Before we remove or change an object, we can view dependencies to see the impact on other objects.

7. Compare repository objects. In the Repository Manager, wecan compare two repository objects of the same type to identify differences between the objects.

8. Truncate session and workflow log entries. we can truncate the list of session and workflow logs that the Integration Service writes to the repository. we can truncate all logs, or truncate all logs older than a specified date.

5. Repository Service

As we already discussed about metadata repository, now we discuss a separate,multi-threaded process that retrieves, inserts and updates metadata in the repository database tables, it is Repository Service. Repository service manages connections to the PowerCenter repository from PowerCenter client applications like Desinger, Workflow Manager, Monitor, Repository manager, console and integration service. Repository service is responsible for ensuring the consistency of metdata in the repository.

Creation & Properties:

Use the PowerCenter Administration Console Navigator window to create a Repository Service. The properties needed to create are,

Service Name – name of the service like rep_SalesPerformanceDevLocation – Domain and folder where the service is createdLicense – license service nameNode, Primary Node & Backup Nodes – Node on which the service process runsCodePage – The Repository Service uses the character set encoded in the repository code page when writing data to the repositoryDatabase type & details – Type of database, username, pwd, connect string and tablespacenameThe above properties are sufficient to create a repository service, however we can take a look at following features which are important for better performance and maintenance.

General Properties

> OperatingMode: Values are Normal and Exclusive. Use Exclusive mode to perform administrative tasks like enabling version control or promoting local to global repository

> EnableVersionControl: Creates a versioned repository

Node Assignments: “High availability option” is licensed feature which allows us to choose Primary & Backup nodes for continuous running of the repository service. Under normal licenses would see only only Node to select from

Database Properties

> DatabaseArrayOperationSize: Number of rows to fetch each time an array database operation is issued, such as insert or fetch. Default is 100

> DatabasePoolSize:Maximum number of connections to the repository database that the Repository Service can establish. If the Repository Service tries to establish more connections than specified for DatabasePoolSize, it times out the connection attempt after the number of seconds specified for DatabaseConnectionTimeout

Advanced Properties

> CommentsRequiredFor Checkin: Requires users to add comments when checking in repository objects.

> Error Severity Level: Level of error messages written to the Repository Service log. Specify one of the following message levels: Fatal, Error, Warning, Info, Trace & Debug

> EnableRepAgentCaching:Enables repository agent caching. Repository agent caching provides optimal performance of the repository when you run workflows. When you enable repository agent caching, the Repository Service process caches metadata requested by the Integration Service. Default is Yes.> RACacheCapacity:Number of objects that the cache can contain when repository agent caching is enabled. You can increase the number of objects if there is available memory on the machine running the Repository Service process. The value must be between 100 and 10,000,000,000. Default is 10,000> AllowWritesWithRACaching: Allows you to modify metadata in the repository when repository agent caching is enabled. When you allow writes, the Repository Service process flushes the cache each time you save metadata through the PowerCenter Client tools. You might want to disable writes to improve performance in a production environment where the Integration Service makes all changes to repository metadata. Default is Yes.

Environment Variables

The database client code page on a node is usually controlled by an environment variable. For example, Oracle uses NLS_LANG, and IBM DB2 uses DB2CODEPAGE. All Integration Services and Repository Services that run on this node use the same environment variable. You can configure a Repository Service process to use a different value for the database client code page environment variable than the value set for the node.

You might want to configure the code page environment variable for a Repository Service process when the Repository Service process requires a different database client code page than the Integration Service process running on the same node.

For example, the Integration Service reads from and writes to databases using the UTF-8 code page. The Integration Service requires that the code page environment variable be set to UTF-8. However, you have a Shift-JIS repository that requires that the code page environment variable be set to Shift-JIS. Set the environment variable on the node to UTF-8. Then add the environment variable to the Repository Service process properties and set the value to Shift-JIS.

6. Integration Service (IS)

The key functions of IS are Interpretation of the workflow and mapping metadata from the repository. Execution of the instructions in the metadata Manages the data from source system to target system within the memory and

disk

The main three components of Integration Service which enable data movement are,

Integration Service Process Load Balancer Data Transformation Manager

6.1 Integration Service Process (ISP)

The Integration Service starts one or more Integration Service processes to run and monitor workflows. When we run a workflow, the ISP starts and locks the workflow, runs the workflow tasks, and starts the process to run sessions. The functions of the Integration Service Process are,

Locks and reads the workflow Manages workflow scheduling, ie, maintains session dependency Reads the workflow parameter file Creates the workflow log Runs workflow tasks and evaluates the conditional links Starts the DTM process to run the session Writes historical run information to the repository Sends post-session emails

6.2 Load Balancer

The Load Balancer dispatches tasks to achieve optimal performance. It dispatches tasks to a single node or across the nodes in a grid after performing a sequence of steps. Before understanding these steps we have to know about Resources, Resource Provision Thresholds, Dispatch mode and Service levels

Resources – we can configure the Integration Service to check the resources available on each node and match them with the resources required to run the task. For example, if a session uses an SAP source, the Load Balancer dispatches the session only to nodes where the SAP client is installed

Three Resource Provision Thresholds, The maximum number of runnable threads waiting for CPU resources on the node called Maximum CPU Run Queue Length. The maximum percentage of virtual memory allocated on the node relative to the total physical memory size called Maximum Memory %. The maximum number of running Session and Command tasks allowed for each Integration Service process running on the node called Maximum Processes

Three Dispatch mode’s – Round-Robin: The Load Balancer dispatches tasks to available nodes in a round-robin fashion after checking the “Maximum Process” threshold. Metric-based: Checks all the three resource provision thresholds and dispatches tasks in round robin fashion. Adaptive: Checks all the three resource provision thresholds and also ranks nodes according to current CPU availability

Service Levels establishes priority among tasks that are waiting to be dispatched, the three components of service levels are Name, Dispatch Priority and Maximum dispatch wait time. “Maximum dispatch wait time” is the amount of time a task can wait in queue and this ensures no task waits forever

A .Dispatching Tasks on a node1. The Load Balancer checks different resource provision thresholds on the node

depending on the Dispatch mode set. If dispatching the task causes any threshold

to be exceeded, the Load Balancer places the task in the dispatch queue, and it dispatches the task later

2. The Load Balancer dispatches all tasks to the node that runs the master Integration Service process

B. Dispatching Tasks on a grid,1. The Load Balancer verifies which nodes are currently running and enabled2. The Load Balancer identifies nodes that have the PowerCenter resources required

by the tasks in the workflow3. The Load Balancer verifies that the resource provision thresholds on each

candidate node are not exceeded. If dispatching the task causes a threshold to be exceeded, the Load Balancer places the task in the dispatch queue, and it dispatches the task later

4. The Load Balancer selects a node based on the dispatch mode

6.3 Data Transformation Manager (DTM) Process

When the workflow reaches a session, the Integration Service Process starts the DTM process. The DTM is the process associated with the session task. The DTM process performs the following tasks:

Retrieves and validates session information from the repository. Validates source and target code pages. Verifies connection object permissions. Performs pushdown optimization when the session is configured for pushdown

optimization. Adds partitions to the session when the session is configured for dynamic

partitioning. Expands the service process variables, session parameters, and mapping variables

and parameters. Creates the session log. Runs pre-session shell commands, stored procedures, and SQL. Sends a request to start worker DTM processes on other nodes when the session is

configured to run on a grid. Creates and runs mapping, reader, writer, and transformation threads to extract,

transform, and load data Runs post-session stored procedures, SQL, and shell commands and sends post-

session email After the session is complete, reports execution result to ISP

Pictorial Representation of Workflow execution:

1. A PowerCenter Client request IS to start workflow2. IS starts ISP3. ISP consults LB to select node4. ISP starts DTM in node selected by LB

Change Data Capture in InformaticaChange data capture (CDC) is an approach or a technique to identify changes, only changes, in the source. I have seen applications that are built without CDC and later mandate to implement CDC at a higher cost. Building an ETL application without CDC is a costly miss and usually a backtracking step. In this article we can discuss different methods of implementing CDC.

Scenario #01: Change detection using timestamp on source rows

In this typical scenario the source rows have extra two columns say row_created_time & last_modified_time. Row_created_time : time at which the record was first created ; Last_modified_time: time at which the record was last modified

1. In the mapping create mapping variable $$LAST_ETL_RUN_TIME of datetime data type

2. Evaluate condition SetMaxVariable ($$LAST_ETL_RUN_TIME, SessionStartTime); this steps stores the time at which the Session was started to $$LAST_ETL_RUN_TIME

3. Use $$LAST_ETL_RUN_TIME in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where last_modified_date > ’01/01/1900 00:00:000’

4. Now let us assume the session is run on ’01/01/2010 00:00:000’ for initial seed5. When the session is executed on ’02/01/2010 00:00:000’, the sequel would be like

: select * from employee where last_modified_date > ’01/01/2010 00:00:000’, hereby pulling records that had only got changed in between successive runs

Scenario #02: Change detection using load_id or Run_idUnder this scenario the source rows have a column say load_id, a positive running number. The load_id is updated as and when the record is updated

1. In the mapping create mapping variable $$LAST_READ_LOAD_ID of integer data type

2. Evaluate condition SetMaxVariable ($$LAST_READ_LOAD_ID,load_id); the maximum load_id is stored into mapping variable

3. Use $$LAST_READ_LOAD_ID in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where load_id > 0; Assuming all records during initial seed have load_id =1, the mapping variable would store ‘1’ into the repository.

4. Now let us assume the session is run after five load’s into the source, the sequel would be select * from employee where load_id >1 ; hereby we limit the source read only to the records that have been changed after the initial seed

5. Consecutive runs would take care of updating the load_id & pulling the delta in sequence

In the next blog we can see how to implement CDC when reading from Salesforce.com