Best Informatica Interview Questions & Answers Deleting duplicate row using Informatica Q1. Suppose we have Duplicate records in Source System and we want to load only the unique records in the Target System eliminating the duplicate rows. What will be the approach? Ans. Let us assume that the source system is a Relational Database . The source table is having duplicate rows. Now to eliminate duplicate records, we can check the Distinct option of the Source Qualifier of the source table and load the target accordingly. Informatica Join Vs Database Join Which is the fastest? Informatica or Oracle? In our previous article, we tested the performance of ORDER BY operation in Informatica and Oracle and found that, in our test condition, Oracle performs sorting 14% speedier than Informatica. This time we will look into the JOIN operation, not only because JOIN is the single most important data set operation but also because performance of JOIN can give crucial data to a developer in order to develop proper push down optimization manually. Informatica is one of the leading data integration tools in today’s world. More than 4,000 enterprises worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from 1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the technologies that they support. But when it comes to the application development, developers often face challenge to strike the right balance of operational load sharing between these systems. This article will help them to take the informed decision. Which JOINs data faster? Oracle or Informatica? As an application developer, you have the choice of either using joining syntaxes in database level to join your data or using JOINER TRANSFORMATION in
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Best Informatica Interview Questions & Answers
Deleting duplicate row using Informatica
Q1. Suppose we have Duplicate records in Source System and we want to load only the unique records in
the Target System eliminating the duplicate rows. What will be the approach?
Ans.
Let us assume that the source system is a Relational Database . The source table is having duplicate rows.
Now to eliminate duplicate records, we can check the Distinct option of the Source Qualifier of the source
table and load the target accordingly.
Informatica Join Vs Database Join
Which is the fastest? Informatica or Oracle?
In our previous article, we tested the performance of ORDER BY operation in Informatica and Oracle and
found that, in our test condition, Oracle performs sorting 14% speedier than Informatica. This time we will
look into the JOIN operation, not only because JOIN is the single most important data set operation but also
because performance of JOIN can give crucial data to a developer in order to develop proper push down
optimization manually.
Informatica is one of the leading data integration tools in today’s world. More than 4,000 enterprises
worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other
hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from
1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the
technologies that they support. But when it comes to the application development, developers often face
challenge to strike the right balance of operational load sharing between these systems. This article will help
them to take the informed decision.
Which JOINs data faster? Oracle or Informatica?
As an application developer, you have the choice of either using joining syntaxes in database level to join
your data or using JOINER TRANSFORMATION in Informatica to achieve the same outcome. The question
is – which system performs this faster?
Test Preparation
We will perform the same test with 4 different data points (data volumes) and log the results. We will start
with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2 million, 4
million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million master table data
volumes. Here are the details of the setup we will use,
1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool3. Database and Informatica setup on different physical servers using HP UNIX4. Source database table has no constraint, no index, no database statistics and no partition5. Source database table is not available in Oracle shared pool before the same is read6. There is no session level partition in Informatica PowerCentre7. There is no parallel hint provided in extraction SQL query8. Informatica JOINER has enough cache size
We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer.
The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to sort data in
database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN data in
informatica level. We have executed these mappings with different data points and logged the result.
Further to the above test we will execute m_db_side_join mapping once again, this time with proper
database side indexes and statistics and log the results.
Result
The following graph shows the performance of Informatica and Database in terms of time taken by each
system to sort data. The average time is plotted along vertical axis and data points are plotted along
horizontal axis.
Data Points Master Table Record Count Detail Table Record Count
1 0.1 M 1 M
2 0.2 M 2 M
3 0.4 M 4 M
4 0.6 M 6 M
Verdict
In our test environment, Oracle 10g performs JOIN operation 24% faster than Informatica Joiner Transformation while without Index and 42% faster with Database Index
Assumption
1. Average server load remains same during all the experiments
2. Average network speed remains same during all the experiments
Note
1. This data can only be used for performance comparison but cannot be used for performance
benchmarking.
2. This data is only indicative and may vary in different testing conditions.
In this "DWBI Concepts' Original article", we put Oracle database and Informatica PowerCentre to lock horns
to prove which one of them handles data SORTing operation faster. This article gives a crucial insight to
application developer in order to take informed decision regarding performance tuning.
Comparing Performance of SORT operation (Order By) in Informatica and Oracle
because performance of JOIN can give crucial data to a developer in order to develop proper push down
optimization manually.
Informatica is one of the leading data integration tools in today’s world. More than 4,000 enterprises
worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other
hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from
1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the
technologies that they support. But when it comes to the application development, developers often face
challenge to strike the right balance of operational load sharing between these systems. This article will help
them to take the informed decision.
Which JOINs data faster? Oracle or Informatica?
As an application developer, you have the choice of either using joining syntaxes in database level to join
your data or using JOINER TRANSFORMATION in Informatica to achieve the same outcome. The question
is – which system performs this faster?
Test Preparation
We will perform the same test with 4 different data points (data volumes) and log the results. We will start
with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2 million, 4
million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million master table data
volumes. Here are the details of the setup we will use,
1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool3. Database and Informatica setup on different physical servers using HP UNIX4. Source database table has no constraint, no index, no database statistics and no partition5. Source database table is not available in Oracle shared pool before the same is read6. There is no session level partition in Informatica PowerCentre7. There is no parallel hint provided in extraction SQL query8. Informatica JOINER has enough cache size
We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer.
The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to sort data in
database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN data in
informatica level. We have executed these mappings with different data points and logged the result.
Further to the above test we will execute m_db_side_join mapping once again, this time with proper
database side indexes and statistics and log the results.
Result
The following graph shows the performance of Informatica and Database in terms of time taken by each
system to sort data. The average time is plotted along vertical axis and data points are plotted along
horizontal axis.
Data Points Master Table Record Count Detail Table Record Count
1 0.1 M 1 M
2 0.2 M 2 M
3 0.4 M 4 M
4 0.6 M 6 M
Verdict
In our test environment, Oracle 10g performs JOIN operation 24% faster than Informatica Joiner Transformation while without Index and 42% faster with Database Index
Assumption
1. Average server load remains same during all the experiments
2. Average network speed remains same during all the experiments
Note
1. This data can only be used for performance comparison but cannot be used for performance
benchmarking.
2. This data is only indicative and may vary in different testing conditions.
Informatica Reject File - How to Identify rejection reason
When we run a session, the integration service may create a reject file for each target instance in the
mapping to store the target reject record. With the help of the Session Log and Reject File we can identify
the cause of data rejection in the session. Eliminating the cause of rejection will lead to rejection free loads
in the subsequent session runs. If theInformatica Writer or the Target Database rejects data due to any valid
reason the integration service logs the rejected records into the reject file. Every time we run the session the
integration service appends the rejected records to the reject file.
Working with Informatica Bad Files or Reject Files
By default the Integration service creates the reject files or bad files in the $PMBadFileDir process variable
directory. It writes the entire reject record row in the bad file although the problem may be in any one of the
Columns. The reject files have a default naming convention like [target_instance_name].bad . If we open
the reject file in an editor we will see comma separated values having some tags/ indicator and some data
values. We will see two types of Indicators in the reject file. One is the Row Indicator and the other is
the Column Indicator .
For reading the bad file the best method is to copy the contents of the bad file and saving the same as a
CSV (Comma Sepatared Value) file. Opening the csv file will give an excel sheet type look and feel. The
firstmost column in the reject file is the Row Indicator , that determines whether the row was destined for
insert, update, delete or reject. It is basically a flag that determines the Update Strategy for the data row.
When the Commit Type of the session is configured as User-defined the row indicator indicates whether
the transaction was rolled back due to a non-fatal error, or if the committed transaction was in a failed target
Now comes the Column Data values followed by their Column Indicators, that determines the data quality of the corresponding Column.
List of Values of Column Indicators:
>
Column
IndicatorType of data Writer Treats As
DValid data or
Good Data.
Writer passes it to the target database.
The target accepts it unless a database
error occurs, such as finding a duplicate
key while inserting.
O
Overflowed
Numeric
Data.
Numeric data exceeded the specified
precision or scale for the column. Bad
data, if you configured the mapping
target to reject overflow or truncated
data.
N Null Value.
The column contains a null value. Good
data. Writer passes it to the target,
which rejects it if the target database
does not accept null values.
TTruncated
String Data.
String data exceeded a specified
precision for the column, so the
Integration Service truncated it. Bad
data, if you configured the mapping
target to reject overflow or truncated
data.
Also to be noted that the second column contains column indicator flag value 'D' which signifies that the
Row Indicator is valid.
Now let us see how Data in a Bad File looks like:
Implementing Informatica Incremental Aggregation
Using incremental aggregation, we apply captured changes in the source data (CDC part) to aggregate calculations in a session. If the source changes incrementally and we can capture the changes, then we can configure the session to process those changes. This allows the Integration Service to update the target incrementally, rather than forcing it to delete previous loads data, process the entire source data and recalculate the same data each time you run the session.
Using Informatica Normalizer Transformation
Normalizer, a native transformation in Informatica, can ease many complex data transformation requirement.
Learn how to effectively use normalizer here.
Using Noramalizer Transformation
A Normalizer is an Active transformation that returns multiple rows from a source row, it returns duplicate
data for single-occurring source columns. The Normalizer transformation parses multiple-occurring columns
from COBOL sources, relational tables, or other sources. Normalizer can be used to transpose the data in
columns to rows.
Normalizer effectively does the opposite of what Aggregator does!
Example of Data Transpose using Normalizer
Think of a relational table that stores four quarters of sales by store and we need to create a row for each
sales occurrence. We can configure a Normalizer transformation to return a separate row for each quarter
like below..
The following source rows contain four quarters of sales by store:
Use the Pushdown Optimization Viewer to examine the transformations that can be pushed to the database.
Select a pushdown option or pushdown group in the Pushdown Optimization Viewer to view the
corresponding SQL statement that is generated for the specified selections. When we select a pushdown
option or pushdown group, we do not change the pushdown configuration. To change the configuration, we
must update the pushdown option in the session properties.
Database that supports Informatica Pushdown Optimization
We can configure sessions for pushdown optimization having any of the databases like Oracle, IBM DB2,
Teradata, Microsoft SQL Server, Sybase ASE or Databases that use ODBC drivers.
When we use native drivers, the Integration Service generates SQL statements using native database SQL.
When we use ODBC drivers, the Integration Service generates SQL statements using ANSI SQL. The
Integration Service can generate more functions when it generates SQL statements using native language
instead of ANSI SQL.
Pushdown Optimization In Informatica - Pushdown Optimization Error Handling
Handling Error when Pushdown Optimization is enabled
When the Integration Service pushes transformation logic to the database, it cannot track errors that occur in
the database.
When the Integration Service runs a session configured for full pushdown optimization and an error occurs,
the database handles the errors. When the database handles errors, the Integration Service does not write
reject rows to the reject file.
If we configure a session for full pushdown optimization and the session fails, the Integration Service cannot
perform incremental recovery because the database processes the transformations. Instead, the database
rolls back the transactions. If the database server fails, it rolls back transactions when it restarts. If the
Integration Service fails, the database server rolls back the transaction.
Links
Informatica Tuning - Step by Step ApproachThis is the first of the number of articles on the series of Data Warehouse Application performance tuning scheduled to come every week. This one is on Informatica performance tuning.
Please note that this article is intended to be a quick guide. A more detail Informatica performance tuning
guide can be found here: Informatica Performance Tuning Complete Guide
1.1 Calculate original query cost1.2 Can the query be re-written to reduce cost?- Can IN clause be changed with EXISTS?- Can a UNION be replaced with UNION ALL if we are not using any DISTINCT cluase in query?- Is there a redundant table join that can be avoided?- Can we include additional WHERE clause to further limit data volume?- Is there a redundant column used in GROUP BY that can be removed?- Is there a redundant column selected in the query but not used anywhere in mapping?1.3 Check if all the major joining columns are indexed1.4 Check if all the major filter conditions (WHERE clause) are indexed- Can a function-based index improve performance further?1.5 Check if any exclusive query hint reduce query cost - Check if parallel hint improves performance and reduce cost 1.6 Recalculate query cost - If query cost is reduced, use the changed query
Tuning Informatica LookUp
1.1 Redundant Lookup transformation - Is there a lookup which is no longer used in the mapping? - If there are consecutive lookups, can those be replaced inside a single lookup override?1.2 LookUp conditions - Are all the lookup conditions indexed in database? (Uncached lookup only) - An unequal condition should always be mentioned after an equal condition 1.3 LookUp override query - Should follow all guidelines from 1. Source Query part above 1.4 There is no unnecessary column selected in lookup (to reduce cache size) 1.5 Cached/Uncached - Carefully consider whether the lookup should be cached or uncached - General Guidelines - Generally don't use cached lookup if lookup table size is > 300MB - Generally don't use cached lookup if lookup table row count > 20,000,00 - Generally don't use cached lookup if driving table (source table) row count < 1000 1.6 Persistent Cache - If found out that a same lookup is cached and used in different mappings, Consider persistent cache
1.7 Lookup cache building - Consider "Additional Concurrent Pipeline" in session property to build cache concurrently "Prebuild Lookup Cache" should be enabled, only if the lookup is surely called in the mapping
Tuning Informatica Joiner
3.1 Unless unavoidable, join database tables in database only (homogeneous join) and don't use joiner
3.2 If Informatica joiner is used, always use Sorter Rows and try to sort it in SQ Query itself using Order By (If SorterTransformation is used then make sure Sorter has enough cache to perform 1-pass sort) 3.3 Smaller of two joining tables should be master
Tuning Informatica Aggregator
4.1 When possible, sort the input for aggregator from database end (Order By Clause) 4.2 If Input is not already sorted, use SORTER. If possible use SQ query to Sort the records.
Tuning Informatica Filter
5.1 Unless unavoidable, use filteration at source query in source qualifier 5.2 Use filter as much near to source as possible
7.1 Disable "High Precision" if not required (High Precision allows decimal upto 28 decimal points) 7.2 Use "Terse" mode for tracing level 7.3 Enable pipeline partitioning (Thumb Rule: Maximum No. of partitions = No. of CPU/1.2) (Also remember increasing partitions will multiply the cache memory requirement accordingly)
Tuning Informatica Expression
8.1 Use Variable to reduce the redundant calculation 8.2 Remove Default value " ERROR('transformation error')" for Output Column. 8.3 Try to reduce the Code complexity like Nested If etc. 8.4 Try to reduce the Unneccessary Type Conversion in Calculation
Implementing Informatica Partitions
Why use Informatica Pipeline Partition?
Identification and elimination of performance bottlenecks will obviously optimize session performance. After
tuning all the mapping bottlenecks, we can further optimize session performance by increasing the number
of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the
system hardware while processing the session.
PowerCenter Informatica Pipeline Partition
Different Types of Informatica Partitions
We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key
range, Pass-through, Round-robin.
Informatica Pipeline Partitioning Explained
Each mapping contains one or more pipelines. A pipeline consists of a source qualifier, all the
transformations and the target. When the Integration Service runs the session, it can achieve higher
performance by partitioning the pipeline and performing the extract, transformation, and load for each
partition in parallel.
A partition is a pipeline stage that executes in a single reader, transformation, or writer thread. The number
of partitions in any pipeline stage equals the number of threads in the stage. By default, the Integration
Now I am showing a sorter here just illustrate the concept. If you already have sorted data from the source, you need not use this thereby increasing the performance benefit. Expression (EXP_SAL) Ports Tab Image: Expression Ports Tab Properties
Sorter (SRT_SAL1) Ports Tab
Expression (EXP_SAL2) Ports Tab
Filter (FIL_SAL) Properties Tab
This is how we can implement aggregation without using Informatica aggregator transformation. Hope you
liked it!
What are the differences between Connected and Unconnected Lookup?
Connected Lookup Unconnected Lookup
Connected lookup participates in dataflow and
receives input directly from the pipeline
Unconnected lookup receives input values from
the result of a LKP: expression in another
transformation
Connected lookup can use both dynamic and
static cache
Unconnected Lookup cache can NOT be
dynamic
Connected lookup can return more than one
column value ( output port )
Unconnected Lookup can return only one
column value i.e. output port
Connected lookup caches all lookup columns Unconnected lookup caches only the lookup
output ports in the lookup conditions and the
return port
Supports user-defined default values (i.e. value
to return when lookup conditions are not
satisfied)
Does not support user defined default values
What is the difference between Router and Filter?
Router Filter
Router transformation divides the incoming
records into multiple groups based on some
condition. Such groups can be mutually
inclusive (Different groups may contain same
record)
Filter transformation restricts or blocks the
incoming record set based on one given
condition.
Router transformation itself does not block any
record. If a certain record does not match any
of the routing conditions, the record is routed to
default group
Filter transformation does not have a default
group. If one record does not match filter
condition, the record is blocked
Router acts like CASE.. WHEN statement in
SQL (Or Switch().. Case statement in C)Filter acts like WHERE condition is SQL.
What can we do to improve the performance of Informatica Aggregator Transformation?
Aggregator performance improves dramatically if records are sorted before passing to the aggregator and
"sorted input" option under aggregator properties is checked. The record set should be sorted on those
columns that are used in Group By operation.
It is often a good idea to sort the record set in database level (why?) e.g. inside a source qualifier
transformation, unless there is a chance that already sorted records from source qualifier can again become
unsorted before reaching aggregator
What are the different lookup cache?
Lookups can be cached or uncached (No cache). Cached lookup can be either static or dynamic. A static
cache is one which does not modify the cache once it is built and it remains same during the session run.
On the other hand, Adynamic cache is refreshed during the session run by inserting or updating the
records in cache based on the incoming source data.
A lookup cache can also be divided as persistent or non-persistent based on whether Informatica retains
the cache even after session run is complete or not respectively