Top Banner
1. What is the flow of loading data into fact & dimensional tables? Ans1 : Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values. Dimension table - Table with Unique Primary Key. Load - Data should be first loaded into dimensional table. Based on the primary key values in dimensional table, the data should be loaded into Fact table. Ans2 : Here is the sequence of loading a datawarehouse. 1. The source data is first loading into the staging area, where data cleansing takes place. 2. The data from staging area is then loaded into dimensions/lookups. 3. Finally the Fact tables are loaded from the corresponding source tables from the staging area. 2. Orchestrate Vs Datastage Parallel Extender? Ans : Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0 i.e Parallel Extender. 3. Differentiate Primary Key and Partition Key? Ans : Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of partition like Hash, DB2, Random etc..While using Hash partition we specify the Partition Key. 4. How do you execute datastage job from command line prompt? Ans : Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname 6. What are Stage Variables, Derivations and Constants? Ans: stage variable is a variable that executes locally within the stage. constraint is like a filter condition which limits the number of records coming from input according to business rule. derivation is like expression which is used to modify or get some values from input columns.
61
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Stage Docs

1. What is the flow of loading data into fact & dimensional tables? 

Ans1 : Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values. Dimension table - Table with Unique Primary Key.

Load - Data should be first loaded into dimensional table. Based on the primary key values in dimensional table, the data should be loaded into Fact table.

Ans2 : Here is the sequence of loading a datawarehouse.

1. The source data is first loading into the staging area, where data cleansing takes place.

2. The data from staging area is then loaded into dimensions/lookups.

3. Finally the Fact tables are loaded from the corresponding source tables from the staging area.

2. Orchestrate Vs Datastage Parallel Extender? 

Ans : Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0 i.e Parallel Extender. 

3. Differentiate Primary Key and Partition Key? 

Ans : Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of partition like Hash, DB2, Random etc..While using Hash partition we specify the Partition Key. 

4. How do you execute datastage job from command line prompt? 

Ans : Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname

6.  What are Stage Variables, Derivations and Constants? 

Ans: stage variable is a variable that executes locally within the stage.

constraint is like a filter condition which limits the number of records coming from input according to business rule.

derivation is like expression which is used to modify or get some values from input columns.

Execution order is stage variable---- constraint---- derivation.

5. What is the default cache size? How do you change the cache size if needed? 

Ans : The default cache size is 128 MB. This is primarily used for hash file data cache in the server.  This setting is only can be done in Administrator not in job level. Job level tuning is available only for Buffer Size.

6. Containers : Usage and Types? 

Ans: Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project. · There are two types of shared

Page 2: Data Stage Docs

container:· 1.Server shared container. Used in server jobs (can also be used in parallel jobs).· 2.Parallel shared container. Used in parallel jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example, you could use one to make a server plug-in stage available to a parallel job).

7. Compare and Contrast ODBC and Plug-In stages? 

Ans : ODBC : a) Poor Performance. b) Can be used for Variety of Databases. c) Can handle Stored Procedures.

Plug-In: a) Good Performance. b) Database specific.(Only one database) c) Cannot handle Stored Procedures. 

8. How to run a Shell Script within the scope of a Data stage job? 

Ans : By using "ExcecSH" command at Before/After job properties.

9. Types of Parallel Processing? 

Ans 1: Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing.

Ans 2: Hardware wise there are 3 types of parallel processing systems available:1. SMP (symetric multiprocessing: multiple CPUs, shared memory, single OS)2. MPP (Massively Parallel Processing Systems: multiple CPUs each having a personal set of resources - memory, OS, etc, but physically housed on the same machine)3. Clusters: same as MPP, but physically dispersed (not on the same box & connected via high speed networks).

DS offers 2 types of parallelism to take advantage of the above hardware:1. Pipeline Parallelism2. Partition Parallelism

10. What does a Config File in parallel extender consist of? 

Ans : Config file was read by datastage engine before running the job in Px.

it consist of configuration about your server. ex nodes and all

11. Functionality of Link Partitioner and Link Collector? 

Ans : Link Partitioner : It actually splits data into various partitions or data flows using various partition methods .

Link Collector : It collects the data coming from partitions, merges it into a single data flow and loads to target. 

12. What is Modulus and Splitting in Dynamic Hashed File? 

Ans : In a Hashed File, the size of the file keeps changing randomly. If the size of the file increases it is called as "Modulus". If the size of the file decreases it is called as "Splitting". 

13. Types of views in Datastage Director? 

Page 3: Data Stage Docs

Ans : There are 3 types of views in Datastage Director a) Job View - Dates of Jobs Compiled. b) Log View - Status of Job last run c) Status View - Warning Messages, Event Messages, Program Generated Messages.

16.  Differentiate Database data and Data warehouse data? 

Ans : Data in a Database is a) Detailed or Transactional b) Both Readable and Writable. c) Current.

17.  What are the difficulties faced in using DataStage ? or what are the constraints in using DataStage

Ans 1: * The issue that i faced with datastage is that, it was very difficult to find the errors from the error code since the error table did not specify the reason for the error. And as a fresher i did not know what the error codes satnd for :)

* Another issue is that the help in the datastage was not of much use since it was not specific and was more general.

* I donot  know about other tools since this is the only tool that i have used until now. But it was simple to use so liked using it inspite of above issues.

Ans 2: 1. I feel, the most difficult part is understanding the "Datastage director job log error messages'. It doesn't give u in proper readable message. 2.We dont have many date functions available like in Informatica or traditional Relational databases.

3. Datastage is like unique product interms of functions ex: Most of the database or ETL tools use for converting from lower case to upper case : UPPER. The datastage uses "UCASE". Datastage is peculiar when we compare to other ETL tools.

Otherthan that, i dont see any issues with Datastage.

18. What r XML files and how do you read data from XML files and what stage to be used?

Ans : This is how it can be doneDefine the xml file path in the administrator

Under environmental parameters, import the xml file metadata in the designer repository.Use a transformer stage (without an input link) to get this path in the server job.Use the xml file input stage. In the input tab under the xml src, place this value from the transformer.On the output tab you can import the meta data (columns) of the xml file and then use them as other input columns in the rest of the jobs.

19. How do you catch bad rows from OCI stage?Ans :

The question itself is a little ambiguous to me. I think the answer to the question might be, we will place some conditions like 'where' inside the OCI stage and the rejected rows can be obtained as shown in the example below:1) Say, there are four departments in an office, 501 through 504. We place a where condition, where deptno <= 503. Only these rows are output through the output link.2) Now what we do is, take another output link to a seq. file or another stage where you want to capture the rejected rows. In that link, we will define: where deptno > 5033) Once the rows are output from the OCI stage, you can send them into a transformer, place some constraint on it and use the reject row mechanism to collect the rows.

I am a little tentative because, I am not sure if I have answered the question or not. Please do verify and let us know

Page 4: Data Stage Docs

if this answer is wrong.

   

21.   Suppose if there are million records did you use OCI? if not then what stage do you prefer?

Ans : using Orabulk...

22. How do you populate source files?

23. How do you pass the parameter to the job sequence if the job is running at night?

Ans : Two ways

1. Ste the default values of Parameters in the Job Sequencer and map these parameters to job.

2. Run the job in the sequencer using dsjobs utility where we can specify the values to be taken for each parameter

Ans 2: You can insert the parameter values in a table and read them when the package runs using ODBC Stage or Plug-In stage and use DS variables to assign them in the data pipeline, or pass the parameters using DSSetParam from the controling job (batch job or job sequence) or Job Control Routine from with DS or use dsjob -param from within a shell script or a dos batch file when running from CLI.

24.  What happens if the job fails at night?

Ans : U can define a job sequence to send an email using SMTP activity if the job fails. Or log the failure to a log file using DSlogfatal/DSLogEvent from controlling job or using a After Job Routine. orUse dsJob -log from CLI.

26. What is project life cycle and how do you implement it?

27. How do you track performance statistics and enhance it?

Ans : You can right click on the server job and select the "view performance statistics" option. This will show the output in the number of rows per second format when the job runs.

Ans2 : Through Monitor we can view the performance statistics.

28.  How do you do oracle 4 way inner join if there are 4 oracle input files?

28. What is the order of execution done internally in the transformer with the stage editor having input links on the lft hand side and output links?

Ans : Stage variables, constraints and column derivation or expressions.

29. Explain your last project and your role in it.?

30. What are the often used Stages or stages you worked with in your last project?

Page 5: Data Stage Docs

Ans : A) Transformer, ORAOCI8/9, ODBC, Link-Partitioner, Link-Collector, Hash, ODBC, Aggregator, Sort.

31. How many jobs have you created in your last project?

Ans : 100+ jobs for every 6 months if you are in Development, if you are in testing 40 jobs for every 6 months although it need not be the same number for everybody

32. Tell me the environment in your last projectsAns : Give the OS of the Server and the OS of the Client of your recent most project

33. Did you Parameterize the job or hard-coded the values in the jobs?Ans : Always parameterized the job. Either the values are coming from Job Properties or from a ‘Parameter Manager’ – a third part tool. There is no way you will hard–code some parameters in your jobs. The often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the data to be looked against at.

34. Have you ever involved in updating the DS versions like DS 5.X, if so tell us some the steps you have taken in doing so?

Ans Yes. The following are some of the steps; I have taken in doing so:1) Definitely take a back up of the whole project(s) by exporting the project as a .dsx file2) See that you are using the same parent folder for the new version also for your old jobs using the hard-coded file path to work.3) After installing the new version import the old project(s) and you have to compile them all again. You can use 'Compile All' tool for this.4) Make sure that all your DB DSN's are created with the same name as old one's. This step is for moving DS from one machine to another.5) In case if you are just upgrading your DB from Oracle 8i to Oracle 9i there is tool on DS CD that can do this for you.6) Do not stop the 6.0 server before the upgrade, version 7.0install process collects project information during the upgrade. There is NO rework (recompilation of existing jobs/routines) needed after the upgrade.

35. What is Hash file stage and what is it used for? 

Ans : Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance.   

Or - We can also use the Hash File stage to avoid / remove duplicate rows by specifying the hash key on a particular fileld

36. What are Static Hash files and Dynamic Hash files?

Ans : As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size.

Or - Dynamic Hash Files can automatically adjust their sie - modulous (no. of groups) and separation (group size) based on the incoming data. Type 30 are dynamic.

Static files do not adjust their modulous automatically and are best when data is static.

Overflow groups are used when the data row size is equal to or greater than the specified Large Record size in dynamic HFs. Since Static HFs do not create hashing groups automatically, when the group cannot accomodate a row it goes to overflow.

Overflows should be minized as much as possible to optimal performance.

Page 6: Data Stage Docs

37- What versions of DS you worked with?

Ans : DS 7.0.2/6.0/5.2

39. What other ETL's you have worked with? 40. Did you work in UNIX environment? 

Ans : some times u need to write unix progrms in back round !

like batch progms ! bcz data stage can invoke a batch processing in every 24 hrs .

soo.......unix must... so that we can run the unix prog in back round even min/ hrs

41. How good are you with your PL/SQL?

Ans : On the scale of 1-10 say 8.5-9   

42. Explain the differences between Oracle8i/9i?

Ans : Oracle 8i does not support pseudo column sysdate but 9i supportsOracle 8i we can create 256 columns in a table but in 9i we can upto 1000 columns(fields)

43. Do you know about INTEGRITY/QUALITY stage?

Ans : integriry/quality stage is a data integration tool from ascential which is used to staderdize/integrate the data from different sources

Or - Could you please expalin in detail what excatly quality stagestage,auditstage,profilestage used in datastage and where it is used in datastage .

OR - Qulaity Stage can be integrated with DataStage, In Quality Stage we have many stages like investigate, match, survivorship like that so that we can do the Quality related works and we can integrate with datastage we need Quality stage plugin to achieve the task.

44. Do u know about METASTAGE?

Ans : MetaStage is a persistent metadata Directory that uniquely synchronizes

metadata across multiple separate silos, eliminating rekeying and the manual

establishment of cross-tool relationships. Based on patented technology, it

provides seamless cross-tool integration throughout the entire Business

Intelligence and data integration lifecycle and toolsets

Or -Metastage is a metadata repository in which you can store the metadata (DDLs etc.) and perform analysis on dependencies, change impact etc.

45. How did u connect to DB2 in your last project?Ans : The following stages can connect to DB2 Database:

ODBC

Page 7: Data Stage Docs

DB2 Plug-in StageDynamic Relational Stage

46.47. What are OConv () and Iconv () functions and where are they used?  Ans : IConv() - Converts a string to an internal storage formatOConv() - Converts an expression to an output format.

48. What are Routines and where/how are they written and have you written any routines before? &nbAns : Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines:   1) Transform functions   2) Before-after job subroutines   3) Job Control routines

OrRoutinesRoutines are stored in the Routines branch of the DataStage Repository,where you can create, view, or edit them using the Routine dialog box. Thefollowing program components are classified as routines:• Transform functions. These are functions that you can use whendefining custom transforms. DataStage has a number of built-intransform functions which are located in the Routines ➤ Examples➤ Functions branch of the Repository. You can also defineyour own transform functions in the Routine dialog box.• Before/After subroutines. When designing a job, you can specify asubroutine to run before or after the job, or before or after an activestage. DataStage has a number of built-in before/after subroutines,which are located in the Routines ➤ Built-in ➤ Before/Afterbranch in the Repository. You can also define your ownbefore/after subroutines using the Routine dialog box.• Custom UniVerse functions. These are specialized BASIC functionsthat have been defined outside DataStage. Using the Routinedialog box, you can get DataStage to create a wrapper that enablesyou to call these functions from within DataStage. These functionsare stored under the Routines branch in the Repository. Youspecify the category when you create the routine. If NLS is enabled,

49. If worked with DS6.0 and latest versions what are Link-Partitioner and Link-Collector used for?Ans :Link Partitioner - Used for partitioning the data.Link Collector - Used for collecting the partitioned data.  50. How did you handle reject data?     Ans :Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected.   What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?

1. Ans : Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. 2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. 3. Tuned the 'Project Tunables' in Administrator for better performance. 4. Used sorted data for Aggregator. 5. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs 6. Removed the data not used from the source as early as possible in the job. 7. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries 8. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. 9. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. 10. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.

Page 8: Data Stage Docs

11. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made. 12. Tuning should occur on a job-by-job basis. 13. Use the power of DBMS. 14. Try not to use a sort stage when you can use an ORDER BY clause in the database. 15. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE…. 16. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.

Or

1. Minimise the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator) 2. Use SQL Code while extracting the data 3. Handle the nulls 4. Minimise the warnings 5. Reduce the number of lookups in a job design 6. Use not more than 20stages in a job 7. Use IPC stage between two passive stages Reduces processing time 8. Drop indexes before data loading and recreate after loading data into tables 9. Gen\'ll we cannot avoid no of lookups if our requirements to do lookups compulsory. 10. There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store the data. 11. IPC Stage that is provided in Server Jobs not in Parallel Jobs 12. Check the write cache of Hash file. If the same hash file is used for Look up and as well as target, disable this Option. 13. If the hash file is used only for lookup then \"enable Preload to memory\". This will improve the performance. Also, check the order of execution of the routines. 14. Don\'t use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups. 15. Use Preload to memory option in the hash file output. 16. Use Write to cache in the hash file input. 17. Write into the error tables only after all the transformer stages. 18. Reduce the width of the input record - remove the columns that you would not use. 19. Cache the hash files you are reading from and writting into. Make sure your cache is big enough to hold the hash files. 20. Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files. This would also minimize overflow on the hash file.21. If possible, break the input into multiple threads and run multiple instances of the job. 22. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. 23. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. 24. Tuned the 'Project Tunables' in Administrator for better performance. 25. Used sorted data for Aggregator. 26. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs 27. Removed the data not used from the source as early as possible in the job. 28. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries 29. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. 30. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. 31. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. 32. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made. 33. Tuning should occur on a job-by-job basis. 34. Use the power of DBMS. 35. Try not to use a sort stage when you can use an ORDER BY clause in the database. 36. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE…. 37. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE

Page 9: Data Stage Docs

51. How did you handle an 'Aborted' sequencer? Ans :In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.  52. What are Sequencers?Ans : A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers.The sequencer operates in two modes:ALL mode. In this mode all of the inputs to the sequencer must be TRUE for any of the sequencer outputs to fire.ANY mode. In this mode, output triggers can be fired if any of the sequencer inputs are TRUE

53. How did u connect with DB2 in your last project?Ans : Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation and availability. Certainly DB2-UDB is better in terms of performance as you know the native drivers are always better than ODBC drivers. 'iSeries Access ODBC Driver 9.00.02.02' - ODBC drivers to connect to AS400/DB2.

54. Read the String functions in DS

Ans : Functions like [] -> sub-string function and ':' -> concatenation operator

Syntax: string [ [ start, ] length ]string [ delimiter, instance, repeats ]

55. What will you in a situation where somebody wants to send you a file and use that file as an input Ans : A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

56. How would call an external Java function which are not supported by DataStage?  Ans : Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use that files as a source in DataStage job.

57. What is the utility you use to schedule the jobs on a UNIX server other than using Ascential DirectAns : AUTOSYS": Thru autosys u can automate the job by invoking the shell script written to schedule the datastage jobs.Or"Control_M Scheduling Tool": Thru Control_M u can automate the job by invoking the shell script written to schedule the datastage jobs.

58. What are the command line functions that import and export the DS jobs? Ans : A. dsimport.exe- imports the DataStage components.B. dsexport.exe- exports the DataStage components.

Or -What are the parameters of this command?Or -Parameters: UserName,Password, Hostname, ProjectName,CurrentDirectory(C:/Ascential/DataStage7.5.1/dsexport.exe),FileName(JobName). 59. How will you determine the sequence of jobs to load into data warehouse? Ans : First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the Aggregator tables (if any).

The above might rise another question: Why do we have to load the dimensional tables first, then fact tables:Ans : As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign keys in Fact tables.

61. Tell me one situation from your last project, where you had faced problem and How did u solve it?Ans :

Page 10: Data Stage Docs

A. The jobs in which data is read directly from OCI stages are running extremely slow. I had to stage the data before sending to the transformer to make the jobs run faster.B. The job aborts in the middle of loading some 500,000 rows. Have an option either cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the job has aborted. To make sure the load is proper we opted the former.62. Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a TruncateAns : There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically different in permissions (Truncate requires you to have alter table permissions where Delete doesn't).  

63. How do you rename all of the jobs to support your new File-naming conventions? Ans : Create a Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx file probably into a new project for testing.  Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers.

64. Difference between Hashfile and Sequential File?Ans : Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column. Hash file used as a reference for look up. Sequential file cannotOrHash file can be stored in DS memory (Buffer) but Sequential file cannot be.. duplicates will be removed in hashfile i.e, No duplicates in Hashfile.

65. How can we join one Oracle source and Sequential file?.Ans : Join and look up used to join oracle and sequential file

66. How can we implement Slowly Changing Dimensions in DataStage?.Ans :

We can implement SCD in datastage 1.Type 1 SCD:insert else update in ODBC stage2.Type 2 SCD:insert new rows if the primary key is same and update with effective from date as JobRundate and to date to some max date3.Type 3 SCD:insert value to the column the old value and update the existing column with the new value

67. How can we implement Lookup in DataStage Server jobs?

Ans : by using the hashed files u can implement the lookup in datasatge,

hashed files stores data based on hashed algorithm and key values

68. What are all the third party tools used in DataStage?Ans : Contl-M job schedulerOrMaestro Schedular is another third party tool.........

Or Autosys, TNG, event coordinator are some of them that I know and worked with

69. what is the difference between routine and transform and function?

Ans Difference between Routiens and Transformer is that both are same to pronounce but Routines describes the Business logic and Transformer specifies that transform the data from one location to another by applying the changes by using transformation rules .  

70. what are the Job parameters?

Ans : These Parameters are used to provide Administrative access and change run time values of the job.

Page 11: Data Stage Docs

EDIT>JOBPARAMETERS

In that Parameters Tab we can define the name,prompt,type,value

71. How can we improve the performance of DataStage jobs?

Ans Performance and tuning of DS jobs:

       1.Establish Baselines

       2.Avoid the Use of only one flow for tuning/performance testing

       3.Work in increment

       4.Evaluate data skew

       5.Isolate and solve

       6.Distribute file systems to eliminate bottlenecks

        7.Do not involve the RDBMS in intial testing

         8.Understand and evaluate the tuning knobs available.

72. How can we create Containers?

73. When should we use ODS?

Ans :DWH's are typically read only, batch updated on a schedule

ODS's are maintained in more real time, trickle fed constantly74. Whats difference betweeen operational data stage (ODS) & data warehouse?

Ans : An operational data store (or "ODS") is a database designed to integrate data from multiple sources to facilitate operations, analysis and reporting. Because the data originates from multiple sources, the integration often involves cleaning, redundancy resolution and business rule enforcement. An ODS is usually designed to contain low level or atomic (indivisible) data such as transactions and prices as opposed to aggregated or summarized data such as net contributions. Aggregated data is usually stored in the Data warehouse.

Or-   A dataware house is a decision support database for organisational needs.It is subject oriented,non volatile,integrated ,time varient collect of data.

               ODS(Operational Data Source) is a integrated collection of related information . it contains maximum 90 days information.

75. How to handle Date convertions in Datastage? Convert a mm/dd/yyyy format to yyyy-dd-mm? Ans : We use a) "Iconv" function - Internal Convertion. b) "Oconv" function - External Convertion.

Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]") 

76. How do you pass filename as the parameter for a job?

Page 12: Data Stage Docs

Ans 1. Define the job parameter at the job level or Project level.2. Use the file name in the stage(source or target or lookup)3. supply the file name at the run time.

77. How will you call external function or subroutine from datastage?Ans U can call external functions, subroutines by using Before/After stage/job Subroutines :

ExecSHExecDOS

OR- By using Command Stage Plug-In or by calling the routine from external command activity from Job Sequence.78. Dimensional modelling is again sub divided into 2 types.Ans a)Star Schema - Simple & Much Faster. Denormalized form. b)Snowflake Schema - Complex with more Granularity. More normalized form. 

79. how to create batches in Datastage from command prompt

80. How do you eliminate duplicate rows?

Ans Use Remove Duplicate Stage: It takes a single sorted data set as input, removes all duplicate records, and writes the results to an output data set.

81. What is DS Administrator used for - did u use it? Ans : The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales. 

82. What is DS Designer used for - did u use it? Ans : You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data source through to the target warehouse. The Designer graphical interface lets you select stage icons, drop them onto the Designer work area, and add links. 

83. What about System variables?84. Ans : DataStage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are read-only.  @DATE The internal date when the program started. See the Date function.  @DAY The day of the month extracted from the value in @DATE.  @FALSE The compiler replaces the value with

85. What are types of Hashed File? 

Ans : Hashed File is classified broadly into 2 types.   a) Static - Sub divided into 17 types based on Primary Key Pattern.  b) Dynamic - sub divided into 2 types  i) Generic  ii) Specific.   Default Hased file is "Dynamic - Type30

86. What is DS Manager used for - did u use it?   Ans : The Manager is a graphical tool that enables you to view and manage the contents of the DataStage Repository 

87. What is DS Director used for - did u use it?

Ans : Datastage Director is GUI to monitor, run, validate & schedule datastage server jobs. 

88. How do we do the automation of dsjobs?

Page 13: Data Stage Docs

Ans : We can call Datastage Batch Job from Command prompt using 'dsjob'. We can also pass all the parameters from command prompt. Then call this shell script in any of the market available schedulers. The 2nd option is schedule these jobs using Data Stage director.

89. How do you merge two files in DS?Ans :Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different. OrUse the funnel stage to merge the two files.Also if you want to add a file to an existing dataset then you can wrtie to the same dataset wth dataset set to APPEND mode.

90. what's the difference between Datastage Developers and Datastage Designers. What are the skill required for thisAns : datastage developer is one how will code the jobs.datastage designer is how will desgn the job, i mean he will deal with blue prints and he will design the jobs the stages that are required in developing the code

91. Importance of Surrogate Key in Data warehousing? Amns : surrogate is the systemgenerated key it is a numaric key it is primary key in the dimension table and it is forgien key in the fact table it is used to hadle the missing data and complex situation in the datastage

92. How do you fix the error "OCI has fetched truncated data" in DataStage

Ans : This kind of error occurs when you have CLOB in back end and Varchar in DataStage. So check the back end and accordingly put LongVarchar in the DataStage with the maximum number of length which is used in Database.

93. what is difference between data stage and informatica

94. Could you please help me with a set of questions on Parallel Extender?

95. How many places u can call Routines?

Ans :Four Places u can call

(i) Transform of routine(A) Date Transformation (B) Upstring Transformation

(ii) Transform of the Before & After Subroutines(iii) XML transformation(iv)Web base trannsformation 

96. What is the Batch Program and how can generate ?Ans :Batch programe is the programe it's generate run time to maintain by the datastage it self but u can easy to change own the basis of your requirement (Extraction, Transformation,Loading) .Batch programe are generate depands your job nature either simple job or sequencer job,You can see this programe on job controll option. Scenario based Question ........... Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 )if job 1 have 10,000 row ,after run the job only 5000 data has been loaded in target table remaining are not loaded and your job going to be aborted then.. How can short out the problem.

Ans : Suppose job sequencer synchronies or control 4 job but job 1 have problem, in this condition should go director and check it what type of problem showing either data type problem, warning massage, job fail or job aborted, If job fail means data type problem or missing column action .So u should go Run window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this option here two option

Page 14: Data Stage Docs

(i) On Fail -- commit , Continue (ii) On Skip -- Commit, Continue.First u check how many data already load after then select on skip option then continue and what remaining position data not loaded then select On Fail , Continue ...... Again Run the job defiantly u get successful massage

99. I want to process 3 files in sequentially one by one , how can i do that. while processing the fileAns : If the metadata for all the files r same then create a job having file name as parameter, then use same job in routine and call the job with different file name...or u can create sequencer to use the job...

100. What Happens if RCP is disable ?Ans : Runtime column propagation (RCP): If RCP is enabled for any job, and specifically for those stage whose output connects to the shared container input, then meta data will be propagated at run time, so there is no need to map it at design time.If RCP is disabled for the job, in such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased.

110.  How can I extract data from DB2 (on IBM iSeries) to the data warehouse via Datastage as the ETL tooHow can I extract data from DB2 (on IBM iSeries) to the data warehouse via Datastage as the ETL tool. I mean do I first need to use ODBC to create connectivity and use an adapter for the extraction and transformation of data? Thanks so much if anybody could provide an answer.

You would need to install ODBC drivers to connect to DB2 instance  (does not come with regular drivers that we try to install, use CD provided for DB2 installation, that would have ODBC drivers to connect to DB2) and then try out

if  ur system is mainfarmes then u can utility called load and unload ..

load will load the records into main farme systme from there u hv to export in to your system ( windows)

109. what is merge and how it can be done plz explain with simple example taking 2 tables .......

Merge is used to join two tables.It takes the Key columns sort them in Ascending or descending order.Let us consider two table i.e Emp,Dept.If we want to join these two tables we are having DeptNo as a common Key so we can give that column name as key and sort Deptno in ascending order and can join those two tables

108. please list out the versions of datastage Parallel , server editions and in which year they are rea107 what happends out put of hash file is connected to transformer ..what error it throughs

If u connect output of  hash file to transformer ,it will act like reference .there is no  errores at all!! It can be used in implementing SCD's

If Hash file output is connected to transformer stage the hash file will consider as the Lookup file if there is no primary link to the same Transformer stage, if there is no primary link then this will treat as primary link itself. you can do SCD in server job by using Lookup functionality. This will not return any error code.

106. What is version Control?

Version Control

stores different versions of DS jobs

runs different versions of same job

reverts to previos version of a job

Page 15: Data Stage Docs

view version histories

105. Hi, What are the Repository Tables in DataStage and What are they?

Dear User. A datawarehouse is a repository(centralized as well as distributed) of Data, able to answer any adhoc,analytical,historical or complex queries.Metadata is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems.In data stage I/O and Transfer , under interface tab: input , out put & transfer pages.U will have 4 tabs and the last one is build under that u can find the TABLE NAME .The DataStage client components are:AdministratorAdministers DataStage projects and conducts housekeeping on the serverDesignerCreates DataStage jobs that are compiled into executable programs DirectorUsed to run and monitor the DataStage jobsManagerAllows you to view and edit the contents of the repository.Sould ypu need any further assistance pls revert to this mail id [email protected] or [email protected]

104. What is ' insert for update ' in datastage

There is a lock for update option in Hashed File Stage, which locks the hashed file for updating when the search key in the lookup is not found.

103. how can we pass parameters to job by using file

u can create a UNIX shell script which will pass the parameters to the job and u also can create logs for the whole run process of the job.102.  where does unix script of datastage executes weather in clinet machine or in server.suppose if it eDatastage jobs are executed in the server machines only. There is nothing that is stored in the client machine.

100. What Happens if RCP is disable

Runtime column propagation (RCP): If RCP is enabled for any job, and specifically for those stage whose output connects to the shared container input, then meta data will be propagated at run time, so there is no need to map it at design time.If RCP is disabled for the job, in such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased.99. I want to process 3 files in sequentially one by one , how can i do that. while processing the file

If the metadata for all the files r same then create a job having file name as parameter, then use same job in routine and call the job with different file name...or u can create sequencer to use the job...

110. How can I connect my DB2 database on AS400 to DataStage? Do I need to use ODBC 1st to open the dataHow can I connect my DB2 database on AS400 to DataStage? Do I need to use ODBC 1st to open the database connectivity and then use an adapter for just connecting between the two? Thanks alot of any replies.

111. what is the OCI? and how to use the ETL Tools?

OCI doesn't mean the orabulk data. It actually uses the "Oracle Call Interface" of the oracle to load the data. It is kind of the lowest level of Oracle being used for loading the data.112. what is NLS in datastage? how we use NLS in Datastage ? what advantages in that ? at the time of inwhat is NLS in datastage? how we use NLS in Datastage ? what advantages in that ? at the time of installation i am not choosen that NLS option , now i want to use that options what can i do ? to reinstall that datastage or first uninstall and install once again ?

NLS is basically Local language setting(characterset) .Once u install the DS u wil get NLS present.

Page 16: Data Stage Docs

Just login into Admin and u can set the NLS of your project based on your project requirement.Just need to map the NLS with your project.

Suppose if u know u r having file with some greek character.so, if u hav to set the NLS for greek so while running job DS wil recognise those special characters.

I hope u got idea about NLS and how it map.

114.  What is APT_CONFIG in datastageAPT_CONFIG is just an environment variable used to idetify the *.apt file. Dont confuse that with *.apt file that has the node's information and Configuration of SMP/MMP server.115. how we use NLS function in Datastage? what are advantages of NLS function? where we can use that

Dear User,As per the manuals and documents, We have different level of interfaces. Can you be more specific? Like Teradata interface operators, DB2 interface operators,Oracle Interface operators and SAS-Interface operators.Orchestrate National Language Support (NLS) makes it possible for you toprocess data in international languages using Unicode character sets.International Components for Unicode (ICU) libraries support NLS functionalityin Orchestrate.Operator NLS Functionality* Teradata Interface Operators * switch Operator * filter Operator * The DB2 Interface Operators * The Oracle Interface Operators* The SAS-Interface Operators * transform Operator * modify Operator * import and export Operators * generator Operator Should you need any further assistance pls let me know.116.  what is merge ?and how to use merge?

Merge is a stage that is available in both parallel and server jobs.

The merge stage is used to join two tables(server/parallel) or two tables/datasets(parallel). Merge requires that the master table/dataset and the update table/dataset to be sorted. Merge is performed on a key field, and the key field is mandatory in the master and update dataset/table.

117.  what is difference between serverjobs & paraller jobs

Server jobs. These are available if you have installed DataStage Server. They run on the DataStage Server, connecting to other data sources as necessary.

Parallel jobs. These are only available if you have installed Enterprise Edition. These run on DataStage servers that are SMP, MPP, or cluster systems. They can also run on a separate z/OS (USS) machine if required.

118.  what is DataStage Multi-byte, Single-byte file conversions in Mainframe jobs? what is UTF 8 ? whats119.  what is Data stage Multi-byte, Single-byte file conversions?how we use that conversions in data sta120.  How can ETL excel file to Datamart?take the source file(excel file) in the .csv format and apply the conditions which satisfies the datamart.

121.  what is the mean of Try to have the constraints in the 'Selection' criteria of the jobs iIt probably means that u can put the selection criterai in the where clause,i.e whatever data u need to filter ,filter it out inthe SQL ,rather than carrying it forward and then filtering it out.

Constraints is nothing but restrictions to data.here it is restriction to data at entry itself , as he told it will avoid unnecessary data entry .122. what is the meaning of the following..1)If an input file has an excessive number of rows and can bewhat is the meaning of the following..1)If an input file has an excessive number of rows and can be split-up then use standard 2)logic to run jobs in parallel3)Tuning should occur on a job-by-job basis. Use the power of DBMS.

Ans The third point specifies about tuning the performance of job,use the power of DBMS means one can improve the performance of the job by using teh power of Database like Analyzing,creating index,creating partitions one can improve the performance of sqls used in the jobs.

Page 17: Data Stage Docs

123- how to implement routines in data stage,have any one has any material for data stage pl send to me

ans there are 3 kind of routines is there in Datastage.

1.server routines which will used in server jobs.

    these routines will write in BASIC Language

2.parlell routines which will used in parlell jobs

  These routines will write in C/C++ Language

3.mainframe routines which will used in mainframe jobs

124.  what is trouble shhoting in server jobs ? what are the diff kinds of errors encountered while runniwhat is trouble shhoting in server jobs ? what are the diff kinds of errors encountered while running any job?Ans

125.  how can u implement slowly changed dimensions in datastage? explain?2) can u join flat file and dathow can u implement slowly changed dimensions in datastage? explain?2) can u join flat file and database in datastage?how?ans 1 Yes, we can join a flat file and database in an indirect way. First create a job which can populate the data from database into a Sequential file and name it as Seq_First. Take the flat file which you are having and use a Merge Stage to join these two files. You have various join types in Merge Stage like Pure Inner Join, Left Outer Join, Right Outer Join etc., You can use any one of these which suits your requirements.

Ans2 - yes you can implement Type1 Type2 or Type 3. Let me try to explain Type 2 with time stamp.

Step :1 time stamp we are creating via shared container. it return system time and one key. For satisfying the lookup condition we are creating a key column by using the column generator.

Step 2: Our source is Data set and Lookup table is oracle OCI stage. by using the change capture stage we will find out the differences. the change capture stage will return a value for chage_code. based on return value we will find out whether this is for insert , Edit,  or update. if it is insert we will modify with current timestamp and the old time stamp will keep as history.

126.  How can you implement Complex Jobs in datastageans1 Complex design means having more joins and more look ups. Then that job design will be called as complex job. We can easily implement any complex design in Data Stage by following simple tips in terms of increasing performance also. There is no limitation of using stages in a job. For better performance, Use at the Max of 20 stages in each job. If it is exceeding 20 stages then go for another job. Use not more than 7 look ups for a transformer otherwise go for including one more transformer. Am I Answered for u'r abstract Question.

Ans2if the job have good logic that is called as complex job.

simply we can say Scenarios .so If you have faced any complexity while creating job please share with

127.  Does Enterprise Edition only add the parallel processing for better performance?Are any stages/tranDoes Enterprise Edition only add the parallel processing for better performance?Are any stages/transformations available in the enterprise edition only?Ans1• DataStage Standard Edition was previously called DataStage and DataStage Server Edition. • DataStage Enterprise Edition was originally called Orchestrate, then renamed to Parallel Extender when purchased by Ascential. • DataStage Enterprise: Server jobs, sequence jobs, parallel jobs. The enterprise edition offers parallel processing features for scalable high volume solutions. Designed originally for Unix, it now supports Windows, Linux and Unix System Services on mainframes. • DataStage Enterprise MVS: Server jobs, sequence jobs, parallel jobs, mvs jobs.

Page 18: Data Stage Docs

MVS jobs are jobs designed using an alternative set of stages that are generated into cobol/JCL code and are transferred to a mainframe to be compiled and run. Jobs are developed on a Unix or Windows server transferred to the mainframe to be compiled and run. The first two versions share the same Designer interface but have a different set of design stages depending on the type of job you are working on. Parallel jobs have parallel stages but also accept some server stages via a container. Server jobs only accept server stages, MVS jobs only accept MVS stages. There are some stages that are common to all types (such as aggregation) but they tend to have different fields and options within that stage.

Ans2 Row Merger, Row splitter are only present in parallel Stage .

128.  how can you do incremental load in datastage?

Ans You can create a table where u can store the last successfull refresh time for each table/Dimension.

Then in the source query take the delta of the last successful and sysdate should give you incremental load.

Ans 2 Incremental load means daily load.

when ever you are selecting data from source, select the records which are loaded or updated between the timestamp of lastsuccessful load and todays load start date and time.

for this u have to pass parameters for those two dates.

store the last rundate and time in a file and read the parameter through job parameters and state second argument as currentdate and time.

129.  If your running 4 ways parallel and you have 10 stages on the canvas, how many processes does datasIf your running 4 ways parallel and you have 10 stages on the canvas, how many processes does datastage create?

Su

Ans 1 Answer is 40

       You have 10 stages and each stage can be partitioned and run on 4 nodes which makes total number of processes generated are 40

Ans2

It depends on the number of active stages on canvas and how they are linked as only active stages can create a process. for ex if there are 6 active stages (like transforms) linked by some passive stages, the total no of processes are 6x4=24

130.  If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arisIf data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise?Ans1 data will partitioned on both the keys ! hardly it will take more for  execution 

131.  Does the BibhudataStage Oracle plug-in better than OCI plug-in coming from DataStage? What is theDoes the BibhudataStage Oracle plug-in better than OCI plug-in coming from DataStage? What is the BibhudataStage extra functions?Ans1

132.  What is the difference between Datastage and Datastage TX?

Page 19: Data Stage Docs

Ans1 Its a critical question to answer, but one thing i can tell u that Datastage Tx is not a ETL tool & this is not a new version of Datastage 7.5.

Tx is used  for  ODS source ,this much i know

133.  what are validations you perform after creating jobs in designer.what r the different type of errorwhat are validations you perform after creating jobs in designer.what r the different type of errors u faced during loading and how u solve them

ans1 Check for Parameters.

and check for inputfiles are existed or not and also check for input tables existed or not and also usernames,datasource names,passwords like that

134.  How can I specify a filter command for processing data while defining sequential file output data?

Ans1 We have some thing called as after job subroutine and Before subroutine, with then we can execute the Unix commands.

Here we can use the sort sommand or the filter cdommand

135.  can we use shared container as lookup in datastage server jobs?

Ans1 I am using DataStage 7.5, Unix.  we can use shared container more than one time in the job.There is any limit to use it. why because in my job i used the Shared container at 6 flows. At any time only 2 flows are working. can you please share the info on this.

Ans2 ya,we can use shared container as lookup in server jobs.

whereever we can use same lookup in multiple places,on that time we will develop lookup in shared containers,then we will use shared containers as lookup.

136.  Hi!Can any one tell me how to extract data from more than 1 hetrogenious Sources.mean, example 1 seHi!Can any one tell me how to extract data from more than 1 hetrogenious Sources.mean, example 1 sequenal file, Sybase , Oracle in a singale Job.Ans1 Yes you can extract the data from from two heterogenious sources in data stages using the the transformer stage it's so simple you need to just form a link between the two sources in the transformer stage that's itByeHameed

Ans2 U can convert all hetrogenous sources into sequential files & join them using merge 

Or -U can write user defined quey in the source itself to join them

139.  what is the difference between buildopts and subroutines ?

ans1 buildopts generates c++ code ( oops concept)

subroutine :- is normal programe and u can call any where in your project.

140.  what user varibale activity when it used how it used !where it is used with real example

ans1 By using This User variable activity we can create some variables in the job sequnce,this variables r available for all the activities in that sequnce.

Page 20: Data Stage Docs

Most probablly this activity is @ starting of the job sequnce

141.  DataStage from Staging to MDW is only running at 1 row per second! What do we do to remedy?

Ans I am assuming that there are too many stages, which is causing problem and providing the solution.

In general. if you too many stages (especially transformers , hash look up), there would be a lot of overhead and the performance would degrade drastically. I would suggest you to write a query instead of doing several look ups. It seems as though embarassing to have a tool and still write a query but that is best at times.

If there are too many look ups that are being done, ensure that you have appropriate indexes while querying. If you do not want to write the query and use intermediate stages, ensure that you use proper elimination of data between stages so that data volumes do not cause overhead. So, there might be a re-ordering of stages needed for good performance.

Other things in general that could be looked in:

1) for massive transaction set hashing size and buffer size to appropriate values to perform as much as possible in memory and there is no I/O overhead to disk.

2) Enable row buffering and set appropate size for row buffering

3) It is important to use appropriate objects between stages for performance

142.  what is the difference between datastage and informatica

ans1I have used both Datastage and Informatica... In my opinion, DataStage is way more powerful and scalable than Informatica. Informatica has more developer-friendly features, but when it comes to scalabality in performance, it is much inferior as compared to datastage.

Here are a few areas where Informatica is inferior -

1. Partitioning - Datastage PX provides many more robust partitioning options than informatica. You can also re-partition the data whichever way you want.

2. Parallelism - Informatica does not support full pipeline parallelism (although it claims).

3. File Lookup - Informatica supports flat file lookup, but the caching is horrible. DataStage supports hash files, lookup filesets, datasets for much more efficient lookup.

4. Merge/Funnel - Datastage has a very rich functionality of merging or funnelling the streams. In Informatica the only way is to do a Union, which by the way is always a Union-all.

Ans2Informatica and DataStage both are ETL tools, which are used for data acquisition process, Nothing but ETL the main difference between Informatica and DataStage is for Informatica the repository (container of meta data  is database-meta data is stored in database for data stage the repository is file-meta data is stored in file before going to ETL Informatica & DataStage will check the repository for meta data here accessing a file is more faster than database because file is static but data is more secure in data base than file-data may be corrupted in file hence finally we can conclude that data stage will perform faster than Informatica but when it comes to security issue Informatica is better than DataStage

Ans3 Main difference lies in parellism, Datastage uses parellism concept through node configuration, where Informatica does not 143.  how is datastage 4.0 functionally different from the enterprise edition now?? what are the exact chhow is datastage 4.0 functionally different from the enterprise edition now?? what are the exact changes?Ans1There are lot of Changes in DS EE. CDC Stage, Procedure Stage, Etc..........

Page 21: Data Stage Docs

144.  What are orabulk and bcp stages?

Ans1 ORABULK is used to load bulk data into single table of target oracle database.

BCP is used to load bulk data into a single table for microsoft sql server and sysbase

145.  how to handle the rejected rows in datastage?Ans1 we can handle rejected rows in two ways with help of Constraints in a Tansformer.1) By Putting on the Rejected cell where we will be writing our constarints in the properties of the Transformer2)Use REJECTED in the expression editor of the ConstraintCreate a hash file as a temporory storage for rejected rows. Create a link and use it as one of the output of the transformer. Apply either of the two steps above said on that Link. All the rows which are rejected by all the constraints will go to the Hash File.146.  it is possible to run parallel jobs in server jobs?Ans1 No, It is not possible to run Parallel jobs in server jobs. But Server jobs can be executed in Parallel jobs147.  what are the differences between the data stage 7.0 and 7.5in server jobs?Ans1148.  How the hash file is doing lookup in serverjobs?How is it comparing the key values?Ans1 Hashed File is used for two purpose: 1. Remove Duplicate Records 2. Then Used for reference lookups.The hashed file contains 3 parts: Each record having Hashed Key, Key Header and Data portion.By using hashed algorith and the key valued the lookup is faster.149.  what is data set? and what is file set?Ans1I assume you are referring Lookup fileset only.It is only used for lookup stages only.Dataset: DataStage parallel extender jobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs.FileSet: DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns.

Ans2

file set:- It allows you to read data from or write data to a file set. The stage can have a single input link. a single output link, and a single rejects link. It only executes in parallel modeThe data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns.

Datasets r used to import the data in parallel jobs like  odbc in server jobs

150.  what are the enhancements made in datastage 7.5 compare with 7.0ans Many new stages were introduced compared to datastage version 7.0. In server jobs we have stored procedure stage, command stage and generate report option was there in file tab. In job sequence many stages like startloop activity, end loop activity,terminate loop activity and user variables activities were introduced. In parallel jobs surrogate key stage, stored procedure stage were introduced. For all other specifications, please refer to the manual.raj.

ans2 Complex file and Surrogate key generator stages are added in Ver 7.5     If I add a new environment variable in Windows, how can I access it in DataStage?Thanks in advanceAns1can view all the environment variables in designer. U can check it in Job properties. U can add and access the environment variables from Job properties 1.What about System variables? 2.How can we create Containers? 3.How can we improve the performance1.What about System variables? 2.How can we create Containers? 3.How can we improve the performance of DataStage? 4.what are the Job parameters? 5.what is the difference between routine and transform and function? 6.What are all the third party tools used in DataStage? 7.How can we implement Lookup in DataStage Server jobs? 8.How can we implement Slowly Changing Dimensions in DataStage?. 9.How can we join one Oracle source and Sequential file?. 10.What is iconv and oconv functions? 11.Differenc

ans1 1.System Variables are inbuilt functions that can be called in a transformer stage2.Containers is a group of stages and links, they are of 2 types,local containers and shared containers3.Using IPC, Managing the array and transaction size,Project tunables can be set through the Administrator.

Page 22: Data Stage Docs

4.Values that would be required during the job run5.Routines are which call the jobs or any actions to be performed using DS,Transforms are the manipulation of data during the load.6.7.Using HASH FILE8. Using the target oracle stages depending on the update action9.using a row id or seq generated numbers10.DATE FUNCTIONS11.Sequential file reads data sequentially,using hash file the read process is faster12.it can be any length

ans 2 System variables comprise of a set of variables which are used to get system information and they can be accessed from a transformer or a routine. They are read only and start with an @.153.  it is possible to call one job in another job in server jobs?

Ans1

I think we can call a job into another job. In fact calling doesn't sound good, because you attach/add the other job through job properties. In fact, you can attach zero or more jobs.

Steps will be Edit --> Job Properties --> Job Control

Click on Add Job and select the desired job.

154.  what is hashing algorithm and explain breafly how it works?

Ans1hashing is key-to-address translation. This means the value of a key is transformed into a disk address by means of an algorithm, usually a relative block and anchor point within the block. It's closely related to statistical probability as to how well the algorithms work.

It sounds fancy but these algorithms are usually quite simple and use division and remainder techniques. Any good book on database systems will have information on these techniques.

Interesting to note that these approaches are called "Monte Carlo Techniques" because the behavior of the hashing or randomizing algorithms can be simulated by a roulette wheel where the slots represent the blocks and the balls represent the records (on this roulette wheel there are many balls not just one).

Ans2 A hashing algorithm takes a variable length data message and creates a fixed size message digest.When a one-way hashing algorithm is used to generate the message digest the input cannot be determined from the output.. A mathematical function coded into an algorithm that takes a variable length string and changes it into a fixed length string, or hash value.

155.  what is OCI?

Ans

Oracle offers a proprietary call interface for C and C++ programmers that allows manipulation of data in an Oracle database. Version 9.n of the Oracle Call Interface (OCI) can connect and process SQL statements in the native Oracle environment without needing an external driver or driver manager. To use the Oracle OCI 9i stage, you need only to install the Oracle Version 9.n client, which uses SQL*Net to access the Oracle server.Oracle OCI 9i works with both Oracle Version 7.0 and 8.0 servers, provided you install the appropriate Oracle 9i software. With Oracle OCI 9i, you can:• Generate your SQL statement. (Fully generated SQL query/Column-generated SQL query)• Use a file name to contain your SQL statement. (User-defined SQL file)• Clear a table before loading using a TRUNCATE statement. (Clear table)• Choose how often to commit rows to the database. (Transaction size)

Page 23: Data Stage Docs

• Input multiple rows of data in one call to the database. (Array size)• Read multiple rows of data in one call from the database. (Array size)• Specify transaction isolation levels for concurrency control and transaction performance tuning. (Transaction Isolation)• Specify criteria that data must meet before being selected. (WHERE clause)• Specify criteria to sort, summarize, and aggregate data. (Other clauses)• Specify the behavior of parameter marks in SQL statements.

      

156.  What is the NLS equivalent to NLS oracle code American_America.US7ASCII on Datastage NLS?

157.  If a DataStage job aborts after say 1000 records, how to continue the job from 1000th record after fixing the error?Ans 1The above answer is wrong ,if checkpoint run is selected then it will keep track of failed job and when you start your job again it will skip the jobs which are run with out erors and restart the failed job,(not from the record where it si stopped)

158.  how to implement type2 slowly changing dimensions in data stage?explain with example?

We can handle SCD in the following waysType 1: Just use, “Insert rows Else Update rows”

Or “Update rows Else Insert rows”, in update action of target

Type 2: Use the steps as followsa)     U have use one hash file to Look-Up the targetb)     Take 3 instances of targetc)     Give different conditions depending on the processd)     Give different update actions in target e)     Use system variables like Sysdate and Null.

159.  Is it possible to move the data from oracle ware house to SAP Warehouse using with DATASTAGE Tool.

Ans1We can use DataStage Extract Pack for SAP R/3 and DataStage Load Pack for SAP BW to transfer the data from oracle to SAP Warehouse. These Plug In Packs are available with DataStage Version 7.5

160.  Can you convert a snow flake schema into star schema?

Ans1 Yes, We can convert by attaching one hierarchy to lowest level of another hierarchy.

161.  How much would be the size of the database in DataStage ?What is the difference between Inprocess a and Interprocess ?

ans1 In-process

You can improve the performance of most DataStage jobs by turning in-process row buffering on and recompiling the job. This allows connected active stages to pass data via buffers rather than row by row.

Note: You cannot use in-process row-buffering if your job uses COMMON blocks in transform functions to pass data between stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering rather than COMMON blocks.

Inter-process

Use this if you are running server jobs on an SMP parallel system. This enables the job to run using a separate process for each active stage, which will run simultaneously on a separate processor.

Page 24: Data Stage Docs

Note: You cannot inter-process row-buffering if your job uses COMMON blocks in transform functions to pass data between stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering rather than COMMON blocks.

162.  How I can convert Server Jobs into Parallel Jobs?

have never tried doing this, however, I have some information which will help you in saving a lot of time. You can convert your server job into a server shared container. The server shared container can also be used in parallel jobs as shared container.

Ans2 Could'nt we just copy the whole job design and stages using your mouse and paste into a new parallel job. From there you may want to change some properties such as partioning.

163.  What is the max capacity of Hash file in DataStage?

# 64BIT_FILES - This sets the default mode used to#   create static hashed and dynamic files.#   A value of 0 results in the creation of 32-bit#   files. 32-bit files have a maximum file size of#   2 gigabytes. A value of 1 results in the creation#   of 64-bit files (ONLY valid on 64-bit capable platforms).#   The maximum file size for 64-bit#   files is system dependent. The default behavior#   may be overridden by keywords on certain commands.64BIT_FILES 0

164.  how to use rank&updatestratergy in datastage

U can use it with ODBC stage by writing proper SQl quries

165.  What is the difference between drs and odbc stage

To answer your question the DRS stage should be faster then the ODBC stage as it uses native database connectivity. You will need to install and configure the required database clients on your DataStage server for it to work.

Dynamic Relational Stage was leveraged for Peoplesoft to have a job to run on any of the supported databases. It supports ODBC connections too. Read more of that in the plug-in documentation.

ODBC uses the ODBC driver for a particular database, DRS is a stage that tries to make it seamless for switching from one database to another. It uses the native connectivities for the chosen target ...

DRS and ODBC stage are similar as both use the Open Database Connectivity to connect to a database. Performance wise there is not much of a difference.We use DRS stage in parallel jobs.

166.  what is meaning of file extender in data stage server jobs.can we run the data stage job from one jwhat is meaning of file extender in data stage server jobs.can we run the data stage job from one job to another job that file data where it is stored and what is the file extender in ds jobs.

Ans1file extender means the adding the columns or records to the already existing the file, in the data stage,

we can run the data stage job from one job to another job in data stage.

167.  # How does DataStage handle the user security?

Ans1we have to create users in the Administrators and give the necessary priviliges to users.

Page 25: Data Stage Docs

168.  What are the Steps involved in development of a job in DataStage?

The steps required are:

select the datasource stage depending upon the sources for ex:flatfile,database, xml etc

select the required stages for transformation logic such as transformer,link collector,link partitioner, Aggregator, merge etc

select the final target stage where u want to load the data either it is datawatehouse, datamart, ODS,staging etc

169.  Briefly describe the various client components?

There are four client components

DataStage Designer. A design interface used to create DataStage applications (known as jobs). Each job specifies the data sources, the transforms required, and the destination of the data. Jobs are compiled to create executables that are scheduled by the Director and run by the Server.

DataStage Director. A user interface used to validate, schedule, run, and monitor DataStage jobs.

DataStage Manager. A user interface used to view and edit the contents of the Repository.

DataStage Administrator. A user interface used to configure DataStage projects and users.

170.  What is a project? Specify its various components?

You always enter DataStage through a DataStage project. When you start a DataStage client you are prompted to connect to a project. Each project contains:

DataStage jobs. Built-in components. These are predefined components used in a job.

User-defined components. These are customized components created using the DataStage Manager or DataStage Designer

171.  * What are constraints and derivation?* Explain the process of taking backup in DataStage?*What are* What are constraints and derivation?* Explain the process of taking backup in DataStage?*What are the different types of lookups available in DataStage?

Ans1Constraints are used to check for a condition and filter the data. Example: Cust_Id<>0 is set as a constraint and it means and only those records meeting this will be processed further.

Derivation is a method of deriving the fields, for example if you need to get some SUM,AVG etc.

Ans2Constraints are condition and once meeting those records will be processed further. Example process all records where cust_id<>0.

Derivations are derived expressions.for example I want to do a SUM of Salary or Calculate Interest rate etc

Page 26: Data Stage Docs

172.  Will the data stage consider the second constraint in the transformer once the first condition is sWill the data stage consider the second constraint in the transformer once the first condition is satisfied ( if the link odering is given)ans1

173.  How to remove duplicates in server job ans1 1)Use a hashed file stage or 2) If you use sort command in UNIX(before job sub-routine), you can reject duplicated records using -u parameter or3)using a Sort stage174.  How do you do Usage analysis in datastage ?

1. If u want to know some job is a part of a sequence, then in the Manager right click the job and select Usage Analysis. It will show all the jobs dependents. 

2. To find how many jobs are using a particular table.

3. To find how many jobs are usinga  particular routine.

Like this, u can find all the dependents of a particular object.

Its like nested.  U can move forward and backward and can see all the dependents.

175.  purpose of using the key and difference between Surrogate keys and natural keyWe use keys to provide relationships between the entities(Tables). By using primary and foreign key relationship, we can maintain integrity of the data.

The natural key is the one coming from the OLTP system.

The surrogate key is the artificial key which we are going to create in the target DW. We can use thease surrogate keys insted of using natural key. In the SCD2 scenarions surrogate keys play a major role

176.  what are the environment variables in datastage?give some examples?

Theare are the variables used at the project or job level.We can use them to to configure the job ie.we can associate the configuration file(Wighout this u can not run ur job), increase the sequential or dataset read/ write buffer.

  ex: $APT_CONFIG_FILE

Like above we have so many environment variables. Please go to job properties and click on "add environment variable" to see most of the environment variables.

177.  wht is the difference beteen validated ok and compiled in datastage

When you compile a job, it ensure that basic things like all the important stage parameters has been set, mappings are correct, etc. and then it creates an executable job.

You validate a compiled job to make sure that all the connections are valid. All the job parameters are set and a valid output can be expected after running this job. It is like a dry run where you don't actually play with the live data but you are confident that things will work.

Ans2 When we say "Validating a Job", we are talking about running the Job in the "check only" mode. The following checks are made :

- Connections are made to the data sources or data warehouse.- SQL SELECT statements are prepared.

Page 27: Data Stage Docs

- Files are opened. Intermediate files in Hashed File, UniVerse, or ODBC stages that use the local data source are created, if they do not already exist.

178.  how we can create rank using datastge like in informatica.

if ranking means that below

prop_id  rank

1         1

1         2

1        3

2       1

2        1

you can do this first,use sort stage and value of creates the column KeyChange must be set true,it makes data like below

prop_id  rank   KeyChange()

1         1        1

1         2        0

1        3         0

2       1          1

2        1          0

if value change,keychange column set 1 else set 0,after sort stage, use transformer stage variable

179.  What is difference between Merge stage and Join stage?

Someone was saying that join does not support more than two input , while merge support two or more input (one master and one or more update links). I will say, that is highly incomplete information. The fact is join does support two or more input links (left right and possibly intermediate links). But, yes, if you are tallking about full outer join then more than two links are not supported.

Coming back to main question of difference between Join and Merge Stage, the other significant differences that I have noticed are:

1) Number Of Reject Link

(Join) does not support reject link.

(Merge) has as many reject link as the update links( if there are n-input links then 1 will be master link and n-1 will be the update link). 

2) Data Selection

Page 28: Data Stage Docs

(Join) There are various ways in which data is being selected. e.g. we have different types of joins, inner, outer( left, right, full), cross join, etc. So, you have different selection criteria for dropping/selecting a row.

(Merge) Data in Master record and update records are merged only when both have same value for the merge key columns.

-----Please share if someone is aware of more differences -----

180.  Hican any one can explain what areDB2 UDB utilitiesub181.  what is PROFILE STAGE , QUALITY STAGE,AUDIT STAGE in datastage..please expalin in detail.thanks in what is PROFILE STAGE , QUALITY STAGE,AUDIT STAGE in datastage..please expalin in detail.thanks in advans1182.  hi all what is auditstage,profilestage,qulaitystages in datastge please explain indetail183.  how to implement type2 slowly changing dimenstion in datastage? give me with example?

Ans1Slow changing dimension is a common problem in Dataware housing. For example: There exists a customer called lisa in a company ABC and she lives in New York. Later she she moved to Florida. The company must modify her address now. In general 3 ways to solve this problem

 

Type 1: The new record replaces the original record, no trace of the old record at all, Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two different people. Type 3: The original record is modified to reflect the changes.

 

In Type1 the new one will over write the existing one that means no history is maintained, History of the person where she stayed last is lost, simple to use.

 

In Type2 New record is added, therefore both the original and the new record Will be present, the new record will get its own primary key, Advantage of using this type2 is, Historical information is maintained But size of the dimension table grows, storage and performance can become a concern.

Type2 should only be used if it is necessary for the data warehouse to track the historical changes.

 

Page 29: Data Stage Docs

In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. example a new column will be added which shows the original address as New york and the current address as Florida. Helps in keeping some part of the history and table size is not increased. But one problem is when the customer moves from Florida to Texas the new york information is lost. so Type 3 should only be used if the changes will only occur for a finite number of time.

 184.  how to find the number of rows in a sequential file?Using Row Count system variable185.  where actually the flat files store?what is the path?Normally flat file will be stored at FTP servers or local folders and more over .CSV , .EXL and .TXT file formats available for Flat files. 186.  what are the different types of lookups in datastage?there are two types of lookupslookup stage and lookupfilesetLookup:Lookup refrence to another stage or Database to get the data from it and transforms to other database.LookupFileSet:It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link. When creating Lookup file sets, one file will be created for each partition. The individual files are referenced by a single descriptor file, which by convention has the suffix .fs.187.  What are the most important aspects that a beginner must consider doin his first DS project ?Apart from DWH concepts and different stage knowledge,try to use the director to find out errors and also how  to tune the performance.Knowledge of Unix sheel scripting will be very much helpful188.  how we can call the routine in datastage job?explain with steps?Routines are used for impelementing the business logic they are two types 1) Before Sub Routines and 2)After Sub Routinestepsdouble click on the transformer stage right click on any one of the mapping field select [dstoutines] option within edit window give the business logic and select the either of the options( Before / After Sub Routines)189.  what is job control?how it is developed?explain with steps?

Controlling Datstage jobs through some other Datastage jobs. Ex: Consider two Jobs XXX and YYY. The Job YYY can be executed from Job XXX by using Datastage macros in Routines.

To Execute one job from other job, following steps needs to be followed in Routines.

1. Attach job using DSAttachjob function.

2. Run the other job using DSRunjob function

3. Stop the job using DSStopJob function

190.  what is job control?how can it used explain with steps?JCL defines Job Control Language it is ued to run more number of jobs at a time with or without using loops. steps:click on edit in the menu bar and select 'job properties' and enter the parameters asparamete prompt typeSTEP_ID STEP_ID stringSource SRC stringDSN DSN stringUsername unm stringPassword pwd stringafter editing the above steps then set JCL button and select the jobs from the listbox and run the job191.  how to find errors in job sequence?Ans1using DataStage Director we can find the errors in job sequence192.  it is possible to access the same job two users at a time in datastage?No, it is not possible to access the same job two users at the same time. DS will produce the following error : "Job is accessed by other user"193.  what is the meaning of instace in data stage?explain with examples?194.  If the size of the Hash file exceeds 2GB..What happens? Does it overwrite the current rows? Ans1195.  how to drop the index befor loading data in target and how to rebuild it in data stage?This can be achieved by "Direct Load" option of SQLLoaded utily.196.  How to parametarise a field in a sequential file?I am using Datastage as ETL Tool,Sequential file aHow to parametarise a field in a sequential file?I am using Datastage as ETL Tool,Sequential file as source.Ans1U can Parameterize using #parameter-name#... and define the parameter in the job properties..197.  how to kill the job in data stage?

Page 30: Data Stage Docs

You should use kill -14 so the job ends nicely. Sometimes use -9 leaves things in a bad state.198.  where we use link partitioner in data stage job?explain with example?We use Link Partitioner in DataStage Server Jobs.The Link Partitioner stage is an active stage which takes one input andallows you to distribute partitioned rows to up to 64 output links.199.  What is the difference between sequential file and a dataset? When to use the copy stage?Sequentiial Stage stores small amount of the data with any extension in order to acces the file where as DataSet is used to store Huge amount of the data and it opens only with an extension (.ds ) .The Copy stage copies a single input data set to a number of output datasets. Each record of the input data set is copied to every output data set.Records can be copied without modification or you can drop or change theorder of columns.

Main difference b/w sequential file and dataset is : Sequential stores small amount of data and stores normally.But dataset load the data like ansi format.

Sequential file stores small amount of the data with any extension .txt where as DataSet stores Huge amount of the data and opens the file only with an extension .ds.200.  what is the purpose of exception activity in data stage 7.5?It is used to catch the exception raised while running the job

The stages followed by exception activity will be executed whenever there is an unknown error occurs while running the job sequencer.

201. How I create datastage Engine stop start script.Actually my idea is as below.!#bin/bashdsadm - usersu - rootpassword (encript)DSHOMEBIN=/Ascential/DataStage/home/dsadm/Ascential/DataStage/DSEngine/binif check ps -ef | grep DataStage (client connection is there) { kill -9 PID (client connection) }uv -admin - stop > dev/nulluv -admin - start > dev/nullverify processcheck the connectionecho "Started properly"run it as dsadmOR202. How i create datastage Engine stop start script.  Go to the path /DATASTAGE/PROJECTS/DSENGINE/BIN/uv -admin -stopuv -admin –start

203. What does separation option in static hash-file mean?

The different hashing algorithms are designed to distribute records evenly among the groups of the file based on characters and their position in the record ids.

When a hashed file is created, Separation and modulo respectively specifies the group buffer size and the number of buffers allocated for a file. When a Static Hash file is created, DATASTAGE creates a file that contains the number of groups specified by modulo.

Size of Hash file = modulus (no. groups) * Separations (buffer size)

204. Give one real time situation where link petitioner stage used?

Page 31: Data Stage Docs

If we want to send more data from the source to the targets quickly we will be using the link partioner stage in the server jobs we can make a maximum of 64 partitions. And this will be in active stage. We can't connect two active stages but it is accepted only for this stage to connect to the transformer or aggregator stage. The data sent from the link partioner will be collected by the link collector at a max of 64 partitions. This is also an active stage so in order to avoid the connection of active stage from the transformer to the link collectors we will be using inter process communication. As this is a passive stage by using this data can be collected by the link collector. But we can use inter process communication only when the target is in passive stage

205- What is the difference between symmetrically parallel processing, massively parallel processing?

Symmetric Multiprocessing (SMP) - Some Hardware resources may be shared by processor. Processor communicates via shared memory and have single operating system.

Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor have exclusive access to hardware resources. Cluster systems can be physically dispoersed.The processor have their own operations system and communicate via high speed network

OR

Symmetric Multiprocessing (SMP) is the processing of programs by multiple processors that share a common operating system and memory. This SMP is also called as "Tightly Coupled Multiprocessing". A Single copy of the Operating System is in charge for all the Processors Running in an SMP. This SMP Methodology doesn’t exceed more than 16 Processors. SMP is better than MMP systems when Online Transaction Processing is done, in which many users can access the same database to do a search with a relatively simple set of common transactions. One main advantage of SMP is its ability to dynamically balance the workload among computers (As a result Serve more users at a faster rate)

Massively Parallel Processing  (MPP)is the processing of programs by multiple processors that work on different parts of the program and share different operating systems and memories. These Different Processors which run , communicate with each other through message interfaces. There are cases in which there are up to 200 processors which run for a single application. An Interconnect arrangement of data paths allows messages to be sent between different processors which run for a single application or product. The Setup for MPP is more complicated than SMP. An Experienced Thought Process should to be applied when u setup these MPPand one should have a good in depth knowledge to partition the database among these processors and how to assign the work to these processors. An MPP system can also be called as a loosely coupled system. An MPP is considered better than an SMP for applications that allow a number of databases to be searched in parallel.

206- How to implement slowly changing dimensions in Data stage?

Slowly changing dimensions is concept of DWH.

Data stage is tool for ETL purpose not for slowly changing dimensions.

In Informatica power center, there is a way to implement slowly changing dimension through wizard. Data stage does not have that type of wizard to implement SCD, should be implemented by manual logic.

207- What is data stage engine? What is its purpose?

Datastage sever contains Datastage engine DS Server will interact with Client components and Repository. Use of DS engine is to develop the jobs .Whenever the engine is on then only we will developed the jobs.

208- What is the size of the flat file?

The flat file size depends amount of data contained by that flat file

Page 32: Data Stage Docs

How to improve the performance of hash file?

You can improve performance of hashed file by

1 .Preloading hash file into memory -->this can be done by enabling preloading options in hash file output stage

2. Write caching options -->.It makes data written into cache before being flushed to disk. You can enable this to ensure that hash files are written in order onto cash before flushed to disk instead of order in which individual rows are written

3 .Preallocating--> Estimating the approx size   of the hash file so that file needs not to be splitted to often after write operation

209- Other than Round Robin, What is the algorithm used in link collector? Also explain how it will work?

Other than round robin, the other algorithm is Sort/Merge.

Using the sort/merge method the stage reads multiple sorted inputs and writes one sorted output.

Have you ever involved in updating the DS versions like DS 5.X, if so tell us some the steps you have

A) Yes. The following are some of the steps; I have taken in doing so:

1) Definitely take a back up of the whole project(s) by exporting the project as a .dsx file

2) See that you are using the same parent folder for the new version also for your old jobs using the hard-coded file path to work.

3) After installing the new version import the old project(s) and you have to compile them all again. You can use ‘Compile All’ tool for this.

4) Make sure that all your DB DSN’s are created with the same name as old one’s. this step is for moving DS from one machine to another.

5) In case if you are just upgrading your DB from Oracle 8i to Oracle 9i there is tool on DS CD that can do this for you.

6) Do not stop the 6.0 server before the upgrade, version 7.0

Install process collects project information during the upgrade. There is NO rework (recompilation of existing jobs/routines) needed after the upgrade.

My requirement is like this:

Here is the codification suggested:

SALE_HEADER_XXXXX_YYYYMMDD.PSVSALE_LINE_XXXXX_YYYYMMDD.PSV

XXXXX = LVM sequence to ensure unicity and continuity of file exchanges

Page 33: Data Stage Docs

Caution, there will an increment to implement.YYYYMMDD = LVM date of file creation

COMPRESSION AND DELIVERY TO: SALE_HEADER_XXXXX_YYYYMMDD.ZIP AND SALE_LINE_XXXXX_YYYYMMDD.ZIP

if we run that job the target file names are like this sale_header_1_20060206 & sale_line_1_20060206.

If we run next time means the target files we like this sale_header_2_20060206 & sale_line_2_20060206.

If we run the same in next day means the target files we want like this sale_header_3_20060306 & sale_line_3_20060306.

i.e., whenever we run the same job the target files automatically changes its filename to filename increment to previous number(previous number + 1)_current date;

This can be done by using UNIX script

1. Keep the Target filename as constant name xxx.psv

2. Once the job completed, invoke the Unix Script through after job routine - ExecSh

3. The script should get the number used in previous file and increment it by 1, After that move the file from xxx.psv to filename_(previous number + 1)_currentdate.psv and then delete the xxx.psv file .This is the  Easiest way to implement.

210- How to know the no. of records in a sequential file before running a server job?

If your environment is UNIX , you can check with wc -l filename command.

What is the transaction size and array size in OCI stage? How these can be used?

Transaction Size - This field exists for backward compatibility, but it is ignored for release 3.0 and later of the Plug-in. The transaction size for new jobs is now handled by Rows per transaction on the Transaction Handling tab on the Input page.

Rows per transaction - The number of rows written before a commit is executed for the transaction. The default value is 0, that is, all the rows are written before being committed to the data table. 

Array Size - The number of rows written to or read from the database at a time. The default value is 1, that is, each row is written in a separate statement.

211- How do u clean the datastage repository?

Remove log files periodically..... Or CLEAR.FILE &PH&

212- What is the difference between Transform and Routine in DataStage?

Transformer transform the data from one from to another form .where as Routines describes the business logic.

213- How to run the job in command prompt in UNIX?

Sing DSJob command,

-options

Page 34: Data Stage Docs

DSJob -run -jobstatus projectname jobname

214- How do you call procedures in datastage?

Use the Stored Procedure Stage

215- How do you remove duplicates without using remove duplicate stage?

In the target make the column as the key column and run the job.

Or

Using a sort stage, set property: ALLOW DUPLICATES: false

Or

You can do it at any stage.

Just do a hash partion of the input data and check the options Sort and Unique.

This will do.

216- What are environment variables? What is the use of this?

Basically Environment variable is predefined variable those we can use while creating DS job. We can set either as Project level or Job level. Once we set specific variable that variable will be available into the project/job.

We can also define new environment variable. For that we can got to DS Admin.

I hope u understand. for further details refer the DS Admin guide.

217- Can you tell me for what purpose .dsx files are used in the datasatage?

.dsx is the standard file extension of all the various data stage jobs. Whenever we export a job or a sequence, the file is exported in the .dsx format. A standard usage for the same can be that, we develop the job in our test environment and after testing we export the file and save it as x.dsx . This can be done using Datasatage Manager.

218- What is difference between ETL and ELT?

ETL usually scrubs the data then loads into the Datamart or Data Warehouse where as ELT Loads the data then use the RDMBS to scrub and reload into the Datamart or Datawarehouse

ETL = Extract >>> Transform >>> Load

ELT = Extract >>> Load >>> Transform

Or

ETL-> transformation takes place in staging area

And in ELT-> transformation takes at either source side r target side............

Page 35: Data Stage Docs

If we using two sources having same Meta data and how to check the data in two source is same or not? And if the data is not same i want to abort the job? How we can do this?

Use a change Capture Stage. Output it into a Transformer.

Write a routine to abort the job which is initiated at the Function.

@INROWNUM = 1.

So if the data is not matching it is passed in the transformer and the job is aborted.

219- How can I schedule the cleaning of the file &PH& by DSJob?

Create a job with dummy transformer and sequential file stage. In Before Job subroutine, use Exec TCL to execute the following command

220- What is quality stage and profile stage?

Quality Stage: It is used for cleansing, Profile stage: It is used for profiling

Profile Stage is used for analyzing data and their relationships.

221- How to find the process id? Explain with steps?

You can find it in UNIX by using ps -ef command it displays all the process currently running on the system along with the process ids

or

From the DS Director. Follow the path:

Job > Cleanup Resources.

There also you can see the PID. It also displays all the current running processes.

Or

Depending on your environment, you may have lots of process id's. From one of the datastage docs: you can try this on any given node: $ ps -ef | grep dsuserwhere dsuser is the account for datastage. If the above (ps command) doesn't make sense, you'll need some background theory about how processes work in UNIX (or the makes environment when running in windows).Also from the data stage docs (I haven't tried this one yet, but it looks interesting):APT_PM_SHOW_PIDS - If this variable is set, players will output an informational message upon startup, displaying their process id.

Or

U can also use Data stage Administrator. Just click on the project and execute command ,just follow the menu joice to get the job name and PID .then kill the process in the UNIX ,but for this u will require the user name of the data stage in which the process is locked

222- How to distinguish the surogate key in different dimensional tables? how can we give for different dimension tables?

Page 36: Data Stage Docs

Use Database sequence to make your job easier to generate the surrogate key.

223- What is the difference between OCI stage and ODBC stage?

Oracle OCI:       

         we can write the source query in this stage but we can’t write lookup query in this stage instead of this we are using hash file stage for the lookup.

         we are having the facility to write multiple queries before (Oracle OCI/Output/SQL/Before) or after (Oracle OCI/Output/SQL/After) executing the actual query (Oracle OCI/Output/SQL/Before)

         we don’t have multi-row lookup facility in this stage.

 ODBC:             

         we can write both source query as well as lookup query in this stage itself

         we are not having the facility to write multiple queries in this stage.

         we are having the multi-row lookup facility in this stage.

224- What is Runtime Column Propagation and how to use it?

If your job has more columns which are not defined in metadata if runtime propagation is enabled it will propagate those extra columns to the rest of the job

Can both Source system (Oracle, SQLServer,...etc) and Target Data warehouse (may be oracle,SQLServer..Etc) can be on windows environment or one of the system should be in UNIX/Linux environment.

Your Source System can be (Oracle, SQL, DB2, Flat File... etc) But your Target system for complete Data Warehouse should be one (Oracle or SQL or DB2 or..)

Or

In server edition you can have both in Windows. But in PX target should be in UNIX.

Is there any difference b/n Ascential DataStage and DataStage.

There is no difference between Ascential Datastage and Datastage, Now its IBM websphere Datastage earlier it was Ascential Datastage and IBM has bought it and named it as above.

What is the difference between reference link and straight link ?

225- The differerence between reference link and straight link is

The straight link is the one where data are passed to next stage directly and the reference link is the one where it shows that it has a reference (reference key) to the main table

For example in oracle EMP table has reference with DEPT table.

Page 37: Data Stage Docs

In DATASTAGE

2 table stage as source (one is straight link and other is reference link) to 1 transformer stage as process.

If 2 source as file stage (one is straight link and other is reference link to Hash file as reference) and  1 transformer stage. 

226 . What is the various process which starts when the datastage engine starts?There are three processes start when the Datastage engine starts:

1. DSRPC

2. Datastage Engine Resources

3. Datastage telnet Services

227- Difference between Hashfile and Sequential File? What is modulus?

The records in a Sequential file are organized serially, one after another, but the records in the file may be ordered or unordered. The hashed file access method scatters the records randomly throughout the RMS data file. When creating a hashed RMSfile, the maximum number of records the file will contain must be declared. When a record is added to a hashed RMSfile, the primary key value is transformed into a number between one and the number of records in the file. RMS attempts to place the record at that location. If a record already exists at that location, a collision has occurred and the record must be placed elsewhere.

228- What is pivot stage? Why are u using? What purpose that stage will be use?

Pivot stage is used to make the horizontal rows into vertical and vice versa

Or

Pivot stage supports only horizontal pivoting – columns into rows

Pivot stage doesn’t supports vertical pivoting – rows into columns

Example: In the below source table there are two cols about quarterly sales of a product but biz req. as target should contain single col. to represent quarter sales, we can achieve this problem using pivot stage, i.e. horizontal pivoting.

Source Table

ProdID Q1_Sales Q2_Sales

1010 123450 234550

Target Table

ProdID Quarter Sales Quarter

1010 123450 Q1

1010 234550 Q2

Page 38: Data Stage Docs

229- How to eliminate duplicate rows in data stage?

TO remove duplicate rows you can achieve by more than one way 

1. In DS there is one stage called "Remove Duplicate" is exist where you can specify the key.

2. Other way you can specify the key while using the stage i mean stage itself remove the duplicate rows based on key while processing time.

Or

By using Hash File Stage in DS Server we can eliminate the Duplicates in DS.

Or

Using a sort stage, set property: ALLOW DUPLICATES: false

OR

You can use any Stage in input tab choose hash partition and Specify the key and Check the unique checkbox.

Or

There are two methods for eliminating duplicate rows in datastage

1. Using hash file stage (Specify the keys and check the unique checkbox, Unique Key is not allowed duplicate values)

2. Using Sort stage by link remove duplicate stage

230- How can we load source into ODS? OR

231- What is our source? Depending on type of source, you have to use respective stage.

Like oracle enterprise: u can use this for oracle source and target.

Similarly for other sources.

232- How can we create environment variables in datasatage?

We can create environment variables by using DataStage Administrator.

Or

This mostly will come under Administrator part. As a Designer only we can add directly byDesigner-view-jobprops-parameters-addenvironment variable-under user defined-then add.

233- What is the difference between static hash files n dynamic hash files?

Static hash file don't change their number of groups (modules) except through manual resizing

Page 39: Data Stage Docs

Dynamic hash file automatically change their no of groups (modules) in response to the amount of data stored in a file.

234- What is a routine?

Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines:   1) Transform functions   2) Before-after job subroutines   3) Job Control routines

Or

Routine is user defined functions that can be reusable with in the project.

235- How find duplicate records using transformer stage in server edition?

This is questions has got more answers as the elimination odd duplicates are situation specific. Depending upon the situation we can use the best choice to remove duplicates.

1. Can write a SQL query depending upon the fields.2. You can use a has file, by nature which doesn’t allow duplicates. Attach a reject link to see the duplicates for your verification.

Or

Transformer stage to identify and remove duplicates from one output, and direct all input rows to another output (the "rejects"). This approach requires sorted input.

236- Type 30D hash file is GENERIC or SPECIFIC?

Type30 Files are Dynamic files.

237- How can we run the batch using command line?

DSJOB Command is the command to run the datastage jobs from command line. in the older architectures people use to create the Batch job to control the remaining datastage jobs in the process in KEN BLEND arch.

With DSJob command you can run any datastage job in datastage environment.

238- What is fact load?

You load the facts table in the data mart with the combined input of ODBC (OR DSE engine) data sources. You also create transformation logic to redirect output to an alternate target, the REJECTS table, using a row constraint.

In a star schema there will be facts and dimension tables to load in any datawarehouse environments. You will generally load the Dimension tables first and then Facts.

Facts will have the relative information of dimension.

239- Does type of partitioning change for SMP and MPP systems?

240- Explain a specific scenario where we would use range partitioning?

Page 40: Data Stage Docs

It is used when Data Volume is high. It’s Partitioning by Column wise

Or

If the data is large and if you cannot process the full data in one time process you will generally use the Range partitioning.

241- What is phantom error in data stage?

If a process is running and if you kill the process some times the process will be running in the background. This process is called phantom process.

You can use the resource manager to cleanup that kind of process.

242- What is job commit in datastage?

Job commit means it saves the changes made

Or

If you see datastage job commits each record in general cases. but you can force the datastage to take a set of records and then commit them.In case of Oracle stage in Transaction Handling Tab you can set the number of rows per transaction.

243- What is the difference between Job Control and Job Sequence?

What is the difference between Datastage Server jobs and Datastage Parallel jobs?

Basic difference is server job runs on windows platform usually and parallel job runs on UNIX platform.

Server job runs on node whereas parallel job runs on more than one node.

244- How will you pass the parameter to the job schedule if the job is running at night? What happens if one job fails in the night?

Is it possible to calculate a hash total for an EBCDIC file and have the hash total stored as EBCDIC using Datastage?

Currently, the total is converted to ASCII, even though the individual records are stored as EBCDIC.

how to attach a mtr file (MapTrace) via email and the MapTrace is used to record all the execute map errors

245- What is Phantom error in the datastage? How to overcome this error.

Phantom Process is Orphened process.Some times some processes will still running in the server even though you kill the actual process.

Some threads will be keeping running without any source process they are called Phantom Process.

if you see the Directory called %PH% this folder captures the log of phantom process.

I don’t know what is the case you are getting this error but please check the active process in the datastage server and kill them if they are running since very long time

Page 41: Data Stage Docs

246- How do you load partial data after job failed? i.e.

Source has 10000 records, Job failed after 5000 records are loaded. This status of the job is abort, Instead of removing 5000 records from target, how can I resume the load

There are lots of ways of doing this.

But we keep the Extract, Transform and Load process separately.

Generally only load job never fails unless there is a data issue.All data issues are cleared before in transform only.

There are some DB tools that do this automatically

If you want to do this manually. Keep track of number of records in a has file or test file.Update the file as you insert the record.If job failed in the middle then read the number from the file and process the records from there only ignoring the record numbers before that try @INROWNUM function for better result.

247- What are the important considerations while using join stage instead of lookups.

If the volume of data is high then we should use Join stage instead of Lookup.

Or

1. If u needs to capture mismatches between the two sources, lookups provide easy option

248- How can we Test jobs in Datastage??

Configuration of u'r system is fine. I don't think therez problem in the installation too.there r few possibilities for this to happen.

1 . Make sure the login n password wat u give r same as the one u give when 

    you logon to the system.

2. Get into the control panel n check whether all the services r up n running.

3. If all these are fine then try connecting an network cable n then try  

   working. Some of these should work.

2 -1) create user account and password in user accounts2)install data stage sever3)install client

it works..

imp: ur installation drive should be in NTFS not FAT32.

249- What is the use of Hash file??insted of hash file why can we use sequential file itself?

Page 42: Data Stage Docs

hash file is used to eliminate the duplicate rows based on hash key,and also used for lookups.data stage not allowed to use sequential file as lookup.

OR - Actually the primary use of the hash file is to do a look up. You can use a sequential file for look up but you need to write your own routine to match the columns. Coding time and execution time will be more expensive. But when you generate a hash file the hash file indexes the key by an inbuilt hashing algorithm. so when a look up is made  is much much faster. Also it eliminates the duplicate rows.

250- how can we test the jobs?

Testing of jobs can be performed at many different levels: Unit testing, SIT and UAT phases.

Testing basically involves functionality and performance tests.

Firstly data for the job needs to be created to test the functionality. By changing the data we will see whether the requirements are met by the existing code. Every iteration of code change should be accompanied by a testing iteration.

Performance tests basically involve load tests and see how well the exisiting code performance in a finite period of time. Performance tuning can be performed on sql or the job design or the basic/osh code for faster processing times.

Inaddition all job designs should include a error correction and fail over support so that the code is robust.

251- what is an environment variable??

Basically Environment variable is predefined variable those we can use while creating DS job.We can set eithere as Project level or Job level.Once we set specific variable that variable will be availabe into the project/job.

We can also define new envrionment variable.For that we can got to DS Admin .

252- how can we generate a surrogate key in server/parallel jobs?

In parallel jobs we can use surrogatekey generator stage.

OR - in server jobs we can use an inbuilt routine called KeyMgtGetNextValue

253- how to read the data from XL FILES?explain with steps?

Reading data from Excel file is

* Save the file in .csv (comma separated files).

* use a flat file stage in datastage job panel.

* double click on the flat file stage and assign input file to the .csv file (which you stored ).

* import metadate for the file . (once you imported or typed metadata , click view data to check the data values)

2 - Create a new DSN for the Excel driver and choose the workbook from which u want data Select the ODBC stage and access the Excel through that i.e., import the excel sheet using the new DSN

created for the Excel

254- how to distinguish the surrogate key in different dimentional tables?

the Surrogate key will be the key field in the dimensions

Page 43: Data Stage Docs

255- whats the meaning of performance tunning techinque,Example??

meaning of performance tuning meaning we rhave to take some action to increase performance of slowly running job by

1) use link partitioner and link collector to speedup performance

2) use sorted data for aggregation

3) use sorter at source side and aggregation at target side

4)Tuned the oci stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.

5) do not use ipc stage at target side..............

is this only related with server jobs .because in parallel extender these things are taken care by stages

256- Disadvantages of staging area??

I think disadvantage of staging are is disk space as we have to dump data into a local area.. As per my knowledge concern, there is no other disadvantages of staging area.

OR - Yes, its like a disadvantage of staging area, it takes more space in database and it may not be cost effective for client.

257- how to read the data from XL FILES?my problem is my data file having some commas in data,but we are using delimitor is| ?how to read the data ,explain with steps?

1. Create DSN for your XL file by picking Microsoft Excel Driver

2. Take ODBC as source stage

3. Configure ODBC with DSN details

4. While importing metadata for XL sheet, make sure you should select on system tables check box.

Note: In XL sheet the first line should be column names.

OR – If the problem is only commas in XL file data.. We can open it in Access and save the file with Pipe (|) separator... than can be used as simple sequential file but change the dilimiter to (|).. in the format tab .

258- what are the main diff between server job and parallel job in datastage?

in server jobs we have few stages and its mainly logical intensive and we r using transformer for most of the things and it does not uses MPP systems

in paralell jobs we have lots of stages and its stage intensive and for particular thing we have in built stages in parallel jobs and it uses MPP systems

OR- In server we dont have an option to process the data in multiple nodes as in parallel. In parallel we have an advatage to process the data in pipelines and by partitioning, whereas we dont have any such concept in server jobs.

Page 44: Data Stage Docs

There are lot of differences in using same stages in server and parallel. For example, in parallel, a sequencial file or any other file can have either an input link or an output ink, but in server it can have both(that too more than 1).

OR- server jobs can compile and run with in datastage server but parallel jobs can compile and run with in datastage unix server.

server jobs can extact total rows from source to anthor stage then only that stage will be activate and passing the rows into target level or dwh.it is time taking.

but in parallel jobs it is two types

1.pipe line parallelisam

2.partion parallelisam

1.based on statistical performence we can extract some rows from source to anthor stage at the same time the stage will be activate and passing the rows into target level or dwh.it will maintain only one node with in source and target.

2.partion parallelisam will maintain more than one node with in source and target.

259- what is complex stage?In which situation we are using this one?

A complex flat file can be used to read the data at the intial level. By using CFF, we can read ASCII or EBCIDIC data. We can select the required columns and can omit the remaining. We can collect the rejects (bad formatted records) by setting the property of rejects to "save" (other options: continue, fail). We can flatten the arrays(COBOL files).

260- Why is hash file is faster than sequential file n odbc stage??

Hash file is indexed. Also it works under hashing algo. That's why the search is faster in hash file.

261- how can we improve performance in aggregator stage...

For improving the performance when you use aggregator stage sort the data before u pass pass to the aggregator stage.

OR - Select the most appropriate partitioning method based on the data analysis. Hash partitoning performs well in most of the cases.

262- What is Integrated & Unit testing in DataStage ?

Unit Testing:

                 In Datastage senario Unit Testing is the technique of testing the individual Datastage jobs for its functionality.

Integrating Testing:

                 When the two or more jobs are  collectively tested for its functionality that is callled Integrating testing.

263- Why job sequence is use for? what is batches? what is the difference between job sequence and batches?

2-   Batch is a collection of jobs group together to perform a specific task.i.e It is s special type of job created using Data stage director which can be sheduled to run at specific time.

Page 45: Data Stage Docs

Difference between Sequencers and Batches:

Un like as in sequencers in batches we can not provide the control information. 

264- how can we improve the job performance?

in many ways we can improve,one simple method is  by inserting IPC stage between two active stage or two passive stages.....

there r lots of techniques for performance tuning as u asked ipc should b inserted btn two active stages

2- Some of the tips to be followed to improve the performance in DS parallel jobs-

1. Do right partitioning at right parts of the job, avoid re-partitioning of data as much as possible.

2. Sort the data before aggregating or removing duplicates.

3.  use transformer and pivot stages limitedly.

4. Try to develop small simple jobs, rather than huge complex ones.

5. Study and decide in which curcumstances a join or merge can be used and in which a lookup can be used.

265- what are two types of hash files??

the two type of hash file are, 1) static and 2) dynamic,,,,, the dynamic hash file is again subdivided in to Generic and Specific

266- Where can you output data using the Peek Stage?

In datastage Director!

Look at the datastage director LOg

OR - The output of peek stage can be viewed in director LOG and also it can be saved as a seperate text file?

267- For what purpose is the Stage Variable is mainly used?

Stage variable is temporary storage memory variable, if we are doing caluculations repeatedly the result,we can store in stage variable.

OR- The stage variable can be used in situations where U want to Store a Previous record value in a variable and compare with current record value and use if then else conditional statements.

If you want to show the product list seperated , for a each manufacturer with the following rows you can use stage variable

268- what are different types of file formats??

Some are

comma delimited csv files

tab delimited text files...

Page 46: Data Stage Docs

OR- .csv files. dxs files( standard extension of data stage)

269- What is a sequential file that has single input link??

Sequential file always has single link because its it cannot accept the multiple links or threads.

Data in sequential file always runs sequentially..

270- What is the use of tunnable??

tunables are the tab in datastage administartor by which one can increase or decrease the cashe size .

Tunable is project property in Datastage Administrator, in that we can change the value of cache size i.e. between 0 to 999 mb,

271- What is the use of job controle??

Job control is used for scripting. With the help of scripting, we can set parameters for a caller job, execute it, do error handling etc tasks.

272- What is the alternative way where we can do job control??

Job Control will possble Through scripting. Controling is dependent on Reqirements.need of the  job.

OR- Jobcontrol can be done using :

Datastage job SequencersDatastage Custom routinesScriptingScheduling tools like Autosys

273- What is user activity in datastage?

The User variable activity stage defines some variables,those are used in the sequence in future.....

274- Is it possible to query a hash file? Justify your answer

No its not possible to query a hash file . The reason being its a backend file and not a datatbase which can be queried .

2- You can do the query of hashed file using Universe stage. Hash file structure is like Universe.

3 – You can also use a Universe stage if the data is need in your job. I don't recall all details when I used this but I think there is an option when creating the hash file to make it Universe compatable or something like that. Then you can use it in a universe stage and do any query you like on it.

275- How to run a job using command line?

dsjob -run -jobstatus projectname jobname

276 -What is the diffrence between IBM Web Sphere DataStage 7.5 (Enterprise Edition ) & Standard Ascential DataStage 7.5 Version ?

Page 47: Data Stage Docs

IBM aquires Datastage from ascential 

Datastage 7.5 was released with the name of Ascential datastage.

I guess only 8.0 was released with IBM Websphere datastage 8.0

277- What is the difference betwen Merge Stage and Look up stage?

Merge stage : The parallel job stage that combines data sets

lookup stage: The mainframe processing jobs and parallel active stages that perfom table lookups.

OR - Lookup stage:1. Used to perform lookups.2. Multiple reference links, single input link, single output link, single rejects link, single primary link. 3. Large amount of memory usage. Because paging required5. Data on input links or reference links need NOT to be sorted. Merge stage:1. Combines the sorted data sets with the update datasets. 2. Several reject links, multiple output links will be exist. 3. Less memory usage.4. Data need to be sorted.

278- What is Fact loading, how to do it?

firstly u have to run the hash-jobs, secondly dimensional jobs and lastly fact jobs.

If any quries just mail tome

OR- Once we have loaded our dimensions, then as per business requirements we identify the facts(columns or measures on which business is measured) and then load into fact tables..

279- Aggregators – What does the warning “Hash table has grown to ‘xyz’ ….” mean?

Aggrigator cannot store the data onto disk like Sortstage Do the data landing.your system memory will be occupied by the data that is going to aggrigator.If your system memory is full then you get that kid of weird messages.I dealed witht that kind of error once.. My solution to that is use multiple chunks of data and multiple aggrigators.

280- Which partition we have to use for Aggregate Stage in parallel jobs ?

By default this stage allows Auto mode of partitioning. The best partitioning is based on the operating mode of this stage and preceding stage. If the aggregator is operating in sequential mode, it will first collect the data and before writing it to the file using the default Auto collection method. If the aggregator is in parallel mode then we can put any type of partitioning in the drop down list of partitioning tab. Generally auto or hash can be used.

OR -- By default this stage allows Auto mode of partitioning. The best partitioning is based on the operating mode of this stage and preceding stage. If the aggregator is operating in sequential mode, it will first collect the data and before writing it to the file using the default Auto collection method. If the aggregator is in parallel mode then we can put any type of partitioning in the drop down list of partitioning tab. Generally auto or hash can be used.

281- Where we can use these Stages Link Partetionar, Link Collector & Inter Process (OCI) Stage whether in Server Jobs or in Parallel Jobs ? And SMP is a Parallel or Server ?

You can use Link partitioner and link collector stages in server jobs to speed up processing.Suppose you have a source and target and a transformer in between that does some processing, applying fns etc.You can speed it up by using link partitioner to split the data from source into differernt links, apply the Business logic and then collect the data back using link collector and pump it into output.

IPC stage is also intended to speed up processing.

282- what is hashing algorithm?

Page 48: Data Stage Docs

Hashing is a technic how you store the data in dynamic files.There are few algorithams for doing this process.

Read Data-statructures books for the algoritham models.

Hash Files are created as dynamic files using hashing algo

283- what is the difference between RELEASE THE JOB and KILL THE JOB?

Release the job is to release the job from any dependencies and run it.Kill the job means kill the job that's currently running or scheduled to run.

284- How can we remove duplicates using sort stage?

Set the "Allow Duplicates" option to false

285- What is repository?

Repository resides in a spcified data base. it holds all the meta data, rawdata, mapping information and all the respective mapping information.OR- Repository is a content which is having all metadata (information).

286- what is the difference between datasatge and datastage TX?

WebSphere Transformation Extender (WTX) is the universal transformation engine for WebSphere that addresses these complex data challenges to integration. Through its unique ability to speak to and process any data type in its native format, it tackles the "hard, ugly challenges" in integrating systems and information across the enterprise through a codeless, graphical approach to development.