Top Banner
TCS Project flow is nothing but the data flow in your project. For example in many of the projects data is moved from source to staging, and then perfoem ETL and staging to ware house. So here you encountered 2 levels of data flow i.e source to staging and staging to warehouse. 1. what is the configuration file? A. A configuration file contains the information about the nodes, Resource disk and the Resource scratch disk. The datastage parallel jobs run on the basis of the configuration specified in the configuration file. Nodes : It identifies the number of nodes on which a parallel job can run. Resource disk : Here a disk path is defined. The data files of the dataset are stored in the resource disk. Resource scratch disk : Here also a path to folder is defined. This path is used by the parallel job stages for buffering of the data when the parallel job runs. The path to a configuration file is defined in the Datastage Administrator. The environment variable "APT_CONFIG_FILE" contains the path of the configuration file. The configuration files have extension ".apt A. It is normal text file. it is having the information about the processing and storage resources that are available for usage during parallel job execution. The default configuration file is having like
27
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: TCS

TCS

Project flow is nothing but the data flow in your project. For example in many of the projects data is moved from source to staging, and then perfoem ETL and staging to ware house. So here you encountered 2 levels of data flow i.e source to staging and staging to warehouse.

1. what is the configuration file?A. A configuration file contains the information about the nodes, Resource disk and the Resource scratch disk. The datastage parallel jobs run on the basis of the configuration specified in the configuration file.

Nodes : It identifies the number of nodes on which a parallel job can run.

Resource disk : Here a disk path is defined. The data files of the dataset are stored in the resource disk.

Resource scratch disk : Here also a path to folder is defined. This path is used by the parallel job stages for buffering of the data when the parallel job runs.

The path to a configuration file is defined in the Datastage Administrator.

The environment variable "APT_CONFIG_FILE" contains the path of the configuration file.

The configuration files have extension ".apt

A. It is normal text file. it is having the information about the processing and storage resources that are available for usage during parallel job execution. The default configuration file is having likea) Node:- it is logical processing unit which performs all ETL operations.b) Pools:- it is a collections of nodes.c) Fast Name: it is server name. by using this name it was executed our ETL jobs.d) Resource disk:- it is permanent memory area which stores all Repository components.e) Resource Scratch disk:-it is temporary memory area where the staging operation will be performed.

2.what is the difference between 7.5 and 8.1 version in datastage?A. Datastage 8.1 : 1.It is integrated with default database DB2and Websphere application server.2. It has SCD stage.Datastage 7.5 : 1.It has default universe database.

Page 2: TCS

2. It has no SCD stage.

 A. main differences b/w datastage 7.5.2 to 8.0.1

1. In ds 7.5.2 we have manager as client. in 8.0.1 we don’t have any manager client. the manager client is embedded in designer client.

2. In 7.5.2 quality stage has separate designer .in 8.0.1 quality stage is integrated in designer.

3. in 7.5.2 code and metadata is stored in file based system. in 8.0.1 code is a file based system where as metadata is stored in database.

4. in 7.5.2 we required operating system authentications. in 8.0.1 we required operating system authentications and datastage authentications.

5. in 7.5.2 we don’t have range lookup. In 8.0.1 we have range lookup.

6. in 7..5.2 a single join stage can't support multiple references. in 8.0.1 a single join stage can support multiple references.

7. in 7.5.2 , when a developer opens a particular job, and another developer wants to open the same job , that job can't be opened. in 8.0. it can be possible when a developer opens a particular job and another developer wants to open the same job then it can be opened as read only job.

8. in 8.0.1 a compare utility is available to compare 2 jobs , one in development another is in production. in 7.5.2 it is not possible

9. in 8.0.1 quick find and advance find features are avilable , in 7.5.2 not avilable

10.in 7.5.2 first time one job is run and surogate key s generated from initial to n value. next time the same job is compile and run again surrogate key is generated from initial to n. automatic increment of surrogate key is not in 7.5.2. but in 8.0.1 surrogate key is incremented automatically. a state file is used to store the maximum value of surrogate key.

Page 3: TCS

3. what is lookup stage? Explain sparse lookup?A. The Look up stage is a processing stage that performs lookup operations. It performs on dataset read into memory from any other parallel job stage that can output data. The main uses of the lookup stage is to map short codes in the input dataset onto expanded information from a look up table which is then joined to the data coming from input. For example, some we get the data with customers name and address. Here the data identifies state as a two letters or three letters like mel for melbourne or syd for sydney. But you want the data to carry the full name of the state by defining the code as the key column. In this case lookup stage used very much. It will reads each line, it uses the key to look up the stage in the lookup table. It adds the state to the new column defined for the output link. So that full state name is added to the each row based on codes given. If the code not found in the lookup table, record will be rejected. Lookup stage also performs to validate the row.

Look Up stage is a processing stage which performs horizontal combining.

Lookup stage Supports

N-Inputs ( For Normal Lookup )

2 Inputs ( For Sparse Lookup)

1 output

And 1 Reject link

Up to Datastage 7 Version We have only 2 Types of LookUps

a) Normal Lookup and b) Sparse Lookup

But in Datastage 8 Version, enhancements has been take place. They are

c) Range Look Up And d) Case less Look up

Normal Lookup:-- In Normal Look, all the reference records are copied to the memory and the primary records are cross verified with the reference records.

Sparse Lookup:--In Sparse lookup stage, each primary records are sent to the Source and cross verified with the reference records.

Here , we use sparse lookup when the data coming have memory sufficiency and the primary records is relatively smaller than reference date we go for this sparse lookup.

Range Look Up:--- Range Lookup is going to perform the range checking on selected columns.

Page 4: TCS

For Example: -- If we want to check the range of salary, in order to find the grades of the employee than we can use the range lookup.

A. Look Up stage is a processing stage which performs horizontal combining. Lookup stage Supports N-Inputs ( For Normal Lookup ) 2 Inputs ( For Sparse Lookup) 1 output And 1 Reject link Up to Datastage 7 Version We have only 2 Types of Look Ups a) Normal Lookup and b) Sparse Lookup But in Datastage 8 Version, enhancements has been take place. They are c) Range Look Up And d) Case less Look up Normal Lookup:– In Normal Look, all the reference records are copied to the memory and the primary records are cross verified with the reference records. Sparse Lookup:–In Sparse lookup stage, each primary records are sent to the Source and cross verified with the reference records.orSparse lkp: Sql query will be directly fired on the database related record due to which execution is faster than normal lkpHere , we use sparse lookup when the data coming have memory sufficiencyand the primary records is relatively smaller than reference date we go for this sparse lookup. Range LookUp:— Range Lookup is going to perform the range checking on selected columns. For Example: — If we want to check the range of salary, in order to find the grades of the employee than we can use the range lookup.if u use normal lookup it takes the entiretable into memory and perform lookup..coming to sparse lookup it directly perform the lookup in database level.huge amount of data at reference link compared to master link in lookup stage,on that time sparse lookup is used….A. Sparse lookup: When the reference data is more and source data is very very less at that time we are using the Sparce lookup i.e the ratio is 1:100.Normal lookup:When the reference data is less and source data is high at that time we are using the Normal lookup.

4. Can you more than one input in transformer stage?A. Transformer stage have 1-input link & n-number of output links. It can have two types of reject links • Constraint reject. This is a link defined inside the Transformer stage which takes any rows that have failed the constraint on all other output links.• Failure reject. This link is defined outside the Transformer stage and takes any rows which have not been written to any of the outputs links by reason of a write failure.

Page 5: TCS

5. Tell me scd implementation?

A. Datastage – Slowly Changing Dimensions

by Shradha Kelkar, Talentain Technologies

Basics of SCD

Shradha Kelkar

Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule.

Type 1

The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all.

Here is an example of a database table that keeps supplier information:

Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply Co CA

In this example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the table will be unique by the natural key (Supplier_Code). However, the joins will perform better on an integer than on a character string.

Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record:

Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply Co IL

Type 2

Page 6: TCS

The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.

In the same example, if the supplier moves to Illinois, the table could look like this, with incremented version numbers to indicate the sequence of changes:

Supplier_Key Supplier_Code Supplier_Name Supplier_State Version

123 ABC Acme Supply Co CA 0

124 ABC Acme Supply Co IL 1

Another popular method for tuple versioning is to add effective date columns.

Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date

123 ABC Acme Supply Co CA 01-Jan-2000 21-Dec-2004

124 ABC Acme Supply Co IL 22-Dec-2004

The null End_Date in row two indicates the current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying.

How to Implement SCD using DataStage 8.1 –SCD stage?

Step 1: Create a datastage job with the below structure-

1. Source file that comes from the OLTP sources2. Old dimesion refernce table link3. The SCD stage4. Target Fact Table5. Dimesion Update/Insert link

Page 7: TCS

Figure 1

Step 2:  To set up the SCD properties in the SCD stage, open the stage and access the Fast Path

Figure 2

Page 8: TCS

Step 3: The tab 2 of SCD stage is used specify the purpose of each of the pulled keys from the referenced dimension tables.

Figure 3

Step 4: Tab 3 is used to provide the seqence generator file/table name which is used to generate the new surrogate keys for the new or latest dimesion records.These are keys which also get passed to the fact tables for direct load.

Page 9: TCS

Figure 4

Step 5:  The Tab 4 is used to set the properties for configuring the data population logic for the new and old dimension rows. The type of activies that we can configure as a part of this tab are:

1. Generation the new Surrogate key values to be passed to the dimension and fact table2. Mapping the source columns with the source column3. Setting up of the expired values for the old rows4. Defining the values to mark the current active rows out of multiple type rows

Page 10: TCS

Figure 5

Step 6: Set the derivation logic for the fact as a part of the last tab.

Page 11: TCS

Figure 6

Step 7: Complete the remaining set up, run the job

Page 12: TCS

Figure 7

Shradha Kelkar is an ETL Lead Developer at Talentain and has extensive experience in design, development and deployment of large scale ETL projects in Banking and Telecom. 

A. If you are using Server version, you will have to use a combination of transformer, hashfiles for lookup and stage variables. If you are using the parallel version you will have to use the Change Capture and Change Apply stages

A. scd2: slowly changing dimension 2

SCD2 is used to maintain both current data and historical data.for ex: If we are having following datacid cname city country101 raju hyderabad India

if we want to implement scd2 for the above data we need to create target table with columns

scid cname city country flag version date scid - surrogate key column

Page 13: TCS

and if the customer moves from hyderabad to delhi then the output will be

scid cname city country flag version 1 raju hyderabad India 0 1 2 raju delhi India 1 2

Flag = 0 defines historical dataFlag = 1 defines current data

Similarly if again raju moves from delhi to mumbai the the output will be

scid cname city country flag version 1 raju hyderabad India 0 1 2 raju delhi India 0 22 raju mumbai India 1 3n so on..

6.Difference between lookup, join & merge?A. lookup: when the reference data is very less we use lookup.bcoz the data is stored in buffer. if the reference data isvery large then it wl take time to load and for lookup.

join: if the reference data is very large then we wl go forjoin. bcoz it access the data directly from the disk. so theprocessing time wl be less when compared to lookup. but herein join we can’t capture the rejected data. so we go for merge.

merge: if we want to capture rejected data(when the join keyis not matched) we use merge stage. for every detailed linkthere is a reject link to capture rejected data.

A. all the three stages are differ with each other mainlydepends on three categories

1)input column requirements a) sorting b)de duplication2)treatment of unmatched data3)memory usage

1.. sorting in joins and merge in primary table andsecondary table is mandatory where in look up its optional. de duplication in joins its allowed no warnings and no

Page 14: TCS

job aborts. in case of look up in primary table its accepted but insecondary it raises warnings. in merge primary table raises warnings in secondarytable its accepted. about treatment of unmatched data and about memory discussedabove by some one..

A. The 3 stages differ mainly in the memory they use, treatment of rows with unmatched & their requirements for data being input.We can have reject link in lookup and Merge not possible in joinerIn detailed:Lookup:is used for less amount of data becausu it takes the data from source and physical memory so every time it process from physical memory.By using Lookup we can do 2 types of join1)Leftouter join.2)Inner join.Join :is used for huge amount of data because it directly takes the data from disk so it process faster than lookup.Joiner is used to join with different join conditions.1)Leftouter join.2)Rightouter join.3)Inner join.4)Full outer join.Merge is also used for huge amount of data. if we want to capture rejected data we use merge stage, for every detailed link there is a reject link to capture rejected data.

A. Join Stage: 1.) It has n input links(one being primary and remaining being secondary links), one output link and there is no reject link2.) It has 4 join operations: inner join, left outer join, right outer join and full outer join.3.) join occupies less memory, hence performance is high in join stage.4.) Here default partitioning technique would be Hash partitioning technique.5.) Prerequisite condition for join is that before performing join operation, the data should be sorted.

Look up Stage:1.) It has n input links, one output link and 1 reject link2.) It can perform only 2 join operations: inner join and left outer join3.) Lookup occupies more memory, hence performance reduces4.) Here default partitioning technique would be Entire

Merge Stage:1.) Here we have n inputs master link and update links and n-1 reject links2.) in this also we can perform 2 join operations: inner join, left outer join

Page 15: TCS

3.) the hash partitioning technique is used by default4.) Memory used is very less, hence performance is high5.) sorted data in master and update links are mandatory

7. Tell me the phantom process? What is phantom error in datastage?A. Phantom Process is Orphened process.Some times some processes will still running in the server even though you kill the actual process.

Some threads will be keep running without any source process they are called Phantom Process.

if you see the Directory called %PH% this folder captures the log of phantom process.

I dont know what is the case you are getting this error but please check the active process in the datastage server and kill them if they are running since very long time. HTH

A. In the transformer stage the order of the output links should be such that the dependant links should come after independent links. otherwise job will abort with message phantom error.

8. difference between sequential file vs dataset vs fileset ?

Seq File:--->Extract/load from/to seq file max 2GB--->when used as a source at the time of compilation it will be converted into native format from ASCII--->does not support null values--->A sequence file can only be accessed on one node.

Dataset:----->it preserves partition. it stores data on the nodes so when you read from a dataset you don’t have to repartition the data----->it stores data in binary in the internal format of datastage. so it takes less time to read/write from ds to any other

Fileset:----->It stores data in the format similar to that of sequential file. Only advantage of using file set over sequence file is it preserves partition scheme-----> you can view the data but in the order defined in partitioning scheme..

* Difference between Dataset , File set and sequential file ?Dataset:0). Data set is the internally data format behind Orchestrate framework, so any other data being processed as source in parallel job would be converted into data set format first(it is handled by the operator "import") and also being processed as target would be converted from data set format last(it is handled by the operator "export"). Hence, data set usually could bring highest performance.1) It stores data in binary in the internal format of DataStage so, it takes less time to read/write from

Page 16: TCS

dataset than any other source/target.2)It preserves the partioning schemes so that you don't have to partition it again.3)You cannot view data without datastage

Fileset:0) Both .ds file and .fs file are the descriptor file of data set and file set respectively, whereas .fs file is stored as ASCII format, so you could directly open it to see the path of data file and its schema. However, .ds file cannot be open directly, and you could follow alternative way to achieve that, Data Set Management, the utility in client tool(such as Designer and Manager), and command line ORCHADMIN. 1)It stores data in the format similar to a sequential file.2)Only advantage of using fileset over a sequential file is "it preserves partioning scheme"3)You can view the data but in the order defined in partitioning scheme

9. What is the materialize view?A. A materialized view have a physical memory where it can store the result of query which are combination in joining of one or more table and view it retrieve a faster result. when we want to refresh the data which are coming from different application so it retrieve the fast result. It’s a benefit of materialized view.A. Materialized views are schema objects that can be used to summarize, precompute, replicate, and distribute data. E.g. to construct a data warehouse.

How many places u can call Routines? Routines can be called at the following places:a) In the job properties There is an option to call the Before and After job subroutines.b) In the job sequence there is an activity called "Routine Activity".From there also the routines could be called.c) In the derivation part of theTransformer of a parallel job "parallel routines"can be called.d) In the derivation part of theTransformer of a server job "server routines"can be called.f)In the server job stages also before and after job subroutines can be called.g)We have called routines from user variable activity of a sequence as well.(Here basically transforms are called which return a value to the variable of the user variable activity after their execution.)

capgemini

Page 17: TCS

1. How to eliminate Product Joins in a Teradata SQL query?

1. Ensure statistics are collected on join columns and this is especially important if the columns you are joining on are not unique.

2. Make sure you are referencing the correct alias.3. Also, if you have an alias, you must always reference it instead of a fully qualified table name.4. Sometimes product joins happen for a good reason. Joining a small table (100 rows) to a large

table (1 million rows) a product join does make sense.

2. List types of HASH functions used in Teradata?

List types of HASH functions used in Teradata?

Answer:

SELECT HASHAMP (HASHBUCKET (HASHROW ())) AS “AMP#”, COUNT (*) FROM GROUP BY 1 ORDER BY 2 DESC;

There are HASHROW, HASHBUCKET, HASHAMP and HASHBAKAMP.

The SQL hash functions are:

    * HASHROW (column(s))

    * HASHBUCKET (hashrow)

    * HASHAMP (hashbucket)

    * HASHBAKAMP (hashbucket)

Example:

SELECT

            HASHROW ('Teradata')   AS "Hash Value"

            , HASHBUCKET (HASHROW ('Teradata')) AS "Bucket Num"

            , HASHAMP (HASHBUCKET (HASHROW ('Teradata'))) AS "AMP Num"

            , HASHBAKAMP (HASHBUCKET (HASHROW ('Teradata')))  AS "AMP Fallback Num" ;

Page 18: TCS

This is really good, by looking into the result set of above written query you can easily find out the Data Distribution across all AMPs in your system and further you can easily identify un-even data distribution

How do you find out number of AMP's in the Given system

How do you find out number of AMP's in the Given system

Answer

1.running following query in queryman

Select HASHAMP () +1;

2. We can find out complete configuration details of nodes and amps in configuration screen of Performance monitor

DataStage Performance Tuning

Performance Tuning - BasicsBasicsParallelism Parallelism in DataStage Jobs should be optimized rather than maximized. The degree of parallelism of a DataStage Job is determined by the number of nodes that is defined in the Configuration File, for example, four-node, eight –node etc. A configuration file with a larger number of nodes will generate a larger number of processes and will in turn add to the processing overheads as compared to a configuration file with a smaller number of nodes. Therefore, while choosing the configuration file one must weigh the benefits of increased parallelism against the losses in processing efficiency (increased processing overheads and slow start up time).Ideally , if the amount of data to be processed is small , configuration files with less number of nodes should be used while if data volume is more , configuration files with larger number of nodes should be used.

Partioning :Proper partitioning of data is another aspect of DataStage Job design, which significantly improves overall job performance. Partitioning should be set in such a way so as to have balanced data flow i.e. nearly equal partitioning of data should occur and data skew should be minimized.

Memory :In DataStage Jobs where high volume of data is processed, virtual memory settings for the job should be optimised. Jobs often abort in cases where a single lookup has multiple reference links. This happens due to low temp memory space. In such jobs $APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should be set to sufficiently large values.

Page 19: TCS

Performance Analysis of Various stages in DataStag

Sequential File Stage -The sequential file Stage is a file Stage. It is the most common I/O Stage used in a DataStage Job. It is used to read data from or write data to one or more flat Files. It can have only one input link or one Output link .It can also have one reject link. While handling huge volumes of data, this Stage can itself become one of the major bottlenecks as reading and writing from this Stage is slow.Sequential files should be used in following conditionsWhen we are reading a flat file (fixed width or delimited) from UNIX environment which is FTPed from some external systemsWhen some UNIX operations has to be done on the file Don’t use sequential file for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a UNIX file.In order to have faster reading from the Stage the number of readers per node can be increased (default value is one).

Data Set Stage :The Data Set is a file Stage, which allows reading data from or writing data to a dataset. This Stage can have a single input link or single Output link. It does not support a reject link. It can be configured to operate in sequential mode or parallel mode. DataStage parallel extender jobs use Dataset to store data being operated on in a persistent form.Datasets are operating system files which by convention has the suffix .dsDatasets are much faster compared to sequential files.Data is spread across multiple nodes and is referred by a control file.Datasets are not UNIX files and no UNIX operation can be performed on them.Usage of Dataset results in a good performance in a set of linked jobs.They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order.

Lookup Stage –A Look up Stage is an Active Stage. It is used to perform a lookup on any parallel job Stage that can output data. The lookup Stage can have a reference link, single input link, single output link and single reject link.Look up Stage is faster when the data volume is less.It can have multiple reference links (if it is a sparse lookup it can have only one reference link)The optional reject link carries source records that do not have a corresponding input lookup tables.Lookup Stage and type of lookup should be chosen depending on the functionality and volume of data.Sparse lookup type should be chosen only if primary input data volume is small.If the reference data volume is more, usage of Lookup Stage should be avoided as all reference data is pulled in to local memory

Join Stage :Join Stage performs a join operation on two or more datasets input to the join Stage and produces one output dataset. It can have multiple input links and one Output link.There can be 3 types of join operations Inner Join, Left/Right outer Join, Full outer join. Join should be used when the data volume is high. It is a good alternative to the lookup stage and should be used when handling huge volumes of data.Join uses the paging method for the data matching.

Page 20: TCS

Merge Stage :The Merge Stage is an active Stage. It can have multiple input links, a single output link, and it supports as many reject links as input links. The Merge Stage takes sorted input. It combines a sorted master data set with one or more sorted update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record. A master record and an update record are merged only if both of them have the same values for the merge key column(s) that you specify. Merge key columns are one or more columns that exist in both the master and update records. Merge keys can be more than one column. For a Merge Stage to work properly master dataset and update dataset should contain unique records. Merge Stage is generally used to combine datasets or files.

Sort Stage :The Sort Stage is an active Stage. The Sort Stage is used to sort input dataset either in Ascending or Descending order. The Sort Stage offers a variety of options of retaining first or last records when removing duplicate records, Stable sorting, can specify the algorithm used for sorting to improve performance, etc. Even though data can be sorted on a link, Sort Stage is used when the data to be sorted is huge.When we sort data on link ( sort / unique option) once the data size is beyond the fixed memory limit , I/O to disk takes place, which incurs an overhead. Therefore, if the volume of data is large explicit sort stage should be used instead of sort on link.Sort Stage gives an option on increasing the buffer memory used for sorting this would mean lower I/O and better performance.

Transformer Stage :The Transformer Stage is an active Stage, which can have a single input link and multiple output links. It is a very robust Stage with lot of inbuilt functionality. Transformer Stage always generates C-code, which is then compiled to a parallel component. So the overheads for using a transformer Stage are high. Therefore, in any job, it is imperative that the use of a transformer is kept to a minimum and instead other Stages are used, such as:Copy Stage can be used for mapping input links with multiple output links without any transformations. Filter Stage can be used for filtering out data based on certain criteria. Switch Stage can be used to map single input link with multiple output links based on the value of a selector field. It is also advisable to reduce the number of transformers in a Job by combining the logic into a single transformer rather than having multiple transformers .

Funnel Stage –Funnel Stage is used to combine multiple inputs into a single output stream. But presence of a Funnel Stage reduces the performance of a job. It would increase the time taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is better to isolate itself to one job. Write the output to Datasets and funnel them in new job. Funnel Stage should be run in “continuous” mode, without hindrance.

Overall Job Design :While designing DataStage Jobs care should be taken that a single job is not overloaded with Stages.

Page 21: TCS

Each extra Stage put in a Job corresponds to lesser number of resources available for every Stage, which directly affects the Jobs Performance. If possible, big jobs having large number of Stages should be logically split into smaller units. Also if a particular Stage has been identified to be taking lot of time in a job, like a transformer Stage having complex functionality with a lot of Stage variables and transformations, then the design of jobs could be done in such a way that this Stage is put in a separate job all together (more resources for the transformer Stage!!!). Also while designing jobs, care must be taken that unnecessary column propagation is not done. Columns, which are not needed in the job flow, should not be propagated from one Stage to another and from one job to the next. As far as possible, RCP (Runtime Column Propagation) should be disabled in the jobs. Sorting in a job should be taken care try to minimise number sorts in a job. Design a job in such a way as to combine operations around same sort keys, if possible maintain same hash keys. Most often neglected option is “don’t sort if previously sorted” in sort Stage, set this option to “true”. This improves the Sort Stage performance a great deal. In Transformer Stage “Preserve Sort Order” can be used to maintain sort order of the data and reduce sorting in the job.In a transformer minimum of Stage variables should be used. More the no of Stage variable lower is the performance. An overloaded transformer can choke the data flow and lead to bad performance or even failure of job at some point. In order to minimise the load on transformer we can Avoid some unnecessary function calls. For example to convert a varchar field with date value can be type cast into Date type by simple formatting the input value. We need not use StringToDate function, which is used to convert a String to Date type.Implicit conversion of data types.Reduce the number of Stage variables used. It was observed in our previous project by removing 5 Stage variables and 6 function calls, runtime for the job was reduced from 2 hours to 1 hour 10 min (approximately) with 100 million records input.Try to balance load on transformers by sharing the transformations across existing transformers. This would ensure smooth flow of data.If you require type casting, renaming of columns or addition of new columns, use Copy or Modify Stages to achieve thisWhenever you have to use Lookups on large tables, look at the options such as unloading the lookup tables to datasets and using, user defined join SQL to reduce the look up volume with the help of temp tables, etc.The Copy stage should be used instead of a Transformer for simple operations including:o Job Design placeholder between stages o Renaming Columnso Dropping Columnso Implicit (default) Type Conversions The “upsert” works well if the data is sorted on the primary key column of the table which is being loaded. Or Determine , if the record already exists or not to have “Insert” and “Update” separately.It is sometimes possible to re-arrange the order of business logic within a job flow to leverage the same sort order, partitioning, and groupings. Don’t read from a Sequential File using SAME partitioning. Unless more than one source file is specified, this scenario will read the entire file into a single partition, making the entire downstream flow run sequentially (unless it is repartitioned)