a - Handling Variable Length Files Using XML

Informatica – Handling Variable Length Files

Using XML

WHITE PAPER

Author:

Arvind Kumar

Senior ETL Architect

Oracle, Informatica Certified

[email protected]

V Keshav

Senior Informatica Developer

[email protected]


Curosys Solutions Inc. Page 2 of 26

Abstract Informatica is the leading provider of the Data Integration software. Informatica PowerCenter, based on Universal Data Services (UDS) architecture, is the foremost adaptive software for integrating immediate, accurate, and understandable enterprise data. PowerCenter provides improved data integrity and greater visibility of enterprise data and processes. This paper attempts to provide a solution to the limitations of Informatica PowerCenter in dealing with Variable Length Delimited Flat Files.

Intended Audience This paper is intended for readers who have a business need to process variable length flat files within Informatica PowerCenter. The reader is expected to have a fair knowledge of XML technology and is expected to have an understanding of the Informatica’s Midstream XML Transformations introduced in 7.1. Though it is not mandatory, it is recommended to have a good understanding of any one programming language that supports data structures before going through this paper.



Table of Contents INTRODUCTION ............................................................................................................................. 4

GETTING STARTED....................................................................................................................... 5

The Approach............................................................................................................................... 5

The Business Case...................................................................................................................... 6

THE XML GENERATION ................................................................................................................ 8

The Vertical File ........................................................................................................................... 8

Assumptions............................................................................................................................... 10

The XML Hierarchy .................................................................................................................... 11

Code – Tag Resolver ................................................................................................................. 12

Groups & Levels......................................................................................................................... 13

All that’s needed…..................................................................................................................... 14

Immediate Parents ..................................................................................................................... 16

Handling the Hierarchies............................................................................................................ 16

Already done…? ........................................................................................................................ 17

Generating the XML Output – Bringing them all together.......................................................... 18

APPENDIX A ................................................................................................................................. 22

Splitter Transformation............................................................................................................... 22

APPENDIX B ................................................................................................................................. 24

Implementation........................................................................................................................... 24

APPENDIX C ................................................................................................................................. 25

Generating the XML output – Code Sample.............................................................................. 25



INTRODUCTION Informatica PowerCenter is capable of handling two different kinds of flat files: Fixed Width and

Delimited. In fixed width files, the data of each column starts exactly at a predefined column and

hence width of column is known well before hand. In case of no data, spaces are introduced in

the data to fill up the column width irrespective of whether the data type is number or string.

Because of their very nature, the length of all the records in a fixed width file is same. On the

other hand, in delimited files, data of each column is separated by a delimiter (separator), most

often a comma. In case of no data, the next delimiter is immediately followed. Informatica can

handle delimited files where the number of columns in the input file is fixed. But sometimes, the

businesses demand usage of the variable length delimited files which are not supported by

Informatica. This paper attempts to provide an alternative for this problem by converting this file

into XML before applying Business Rules whatsoever. Please note that this paper doesn’t present

the actual code but uses pseudocode. Reader is expected to change it appropriately to apply it to

Informatica.



GETTING STARTED

The Approach

This section walks you through the approach to be suggested to handle the variable length

delimited flat files. Starting with, convert the input delimited file in to a vertical file. Meaning that,

replace the delimiter wherever with a new line so that each element (data between delimiters) is

in a different line. Now, read this vertical file as the input. As usual Informatica treats each line as

a row; hence each element now turns to be a row. Convert this incoming element to an XML tag

until you reach the starting of the next actual row. Concatenate all this XML content and pass it

over to the XML parser which then populates the corresponding data in each port. You can now

apply the business rules on the data coming out of the parser.

Following is a Flowchart that visualizes exactly the process mentioned above.

Figure 1: The Solution

Read the Source File

Convert the Source File to a Vertical File

For each element in vertical file

Convert the element into an XML Tag

Generate appropriate XML hierarchy

Frame up the complete XML content & pass over to Parser

Apply Business Rules



The Business Case

Going forward, the white paper explains the process of handling VLDF files (Variable

Length Delimited Flat Files) by taking a business case as an example. This section

walks you through the business case. Each line consists of a record. The end of record

is denoted by [END]: The End of Record indicator. Each tilde delimited element

contains two parts: Element Name and the corresponding data. The first 6 characters of

every element constitute the element name. Some elements, based on the value they

contain, may repeat more than once in the input record. In such case, a counter

element precedes the one or more such repeating elements. The counter element’s

data represents the no. of times the following element(s) will repeat. Each of these

repetitive set of element correspond to an aggregate in the XML that is to be created.

In the example above, the first line corresponds to an Indicator indicating the Start-Of-

File and the third line is the End-Of-File indicator. Actual records will always be between

these indicators. Hence, in the example above, the second line consists of a sample

input record.

The first 5 characters in the second line is the length of the line (and hence the record).

Then starts the first element (column), name of which is ELEMENT01_ (first ten

characters), whatever follows that is the element data, 00897000006 in this case. As

explained above, some elements may repeat more than once in the record making the

record as variable lengthed. The element ELEMENT07 has a value of 2 which means

that the ELEMENT08 and ELEMENT09 together repeat twice as can be seen in the data

above.

Example – Input

1. [SOF]

2. ELEMENT01_00897000006~ELEMENT02_089~ELEMENT03_0001~ELEM

ENT04_20040809~ELEMENT05_0001~ELEMENT06_0001~ELEMENT07_

002~ELEMENT08_SP~ELEMENT09_John Anderton~ELEMENT08_AB~

ELEMENT09_Mike~[EOR]

3. [EOF]



This data is then converted to XML. The output for the given sample record is shown

below:

The first section in the above example shows the XML hierarchy that needs to be

generated whereas the second one shows how the content has to be passed to the

XML Parsers. Whole XML goes as a single line to the XML Parser.

Example – Output

<ROOT>

<DATA_TOPIC1>

<ELEMENT01>00897000006</ELEMENT01>




</DATA_TOPIC1>

<DATA_TOPIC2>

<AGGREGATE1>

<AGGREGATE2>

<ELEMENT08>SP</ELEMENT08>

<ELEMENT09>John Anderton</ELEMENT09>

</AGGREGATE2>

<AGGREGATE2>

<ELEMENT08>AB</ELEMENT08>

<ELEMENT09>Mike</ELEMENT09>

</AGGREGATE2>

</AGGREGATE1>

</DATA_TOPIC2>

</ROOT>

__________________________________________________________

<ROOT><DATA_TOPIC1><ELEMENT01>00897000006</ELEMENT01><ELEM

ENT02>089</ELEMENT02><ELEMENT03>0001</ELEMENT03><ELEMENT04

>20040809</ELEMENT04></DATA_TOPIC1><DATA_TOPIC2><AGGREGATE

1><AGGREGATE2><ELEMENT08>SP</ELEMENT08><ELEMENT09>John

Anderton</ELEMENT09></AGGREGATE2><AGGREGATE2><ELEMENT08>AB

</ELEMENT08><ELEMENT09>Mike</ELEMENT09></AGGREGATE2></AGGREGATE1></DATA_TOPIC2></ROOT>



THE XML GENERATION

The Vertical File

Now, that we have seen what needs to be done, let’s get a closer look on how to do it. This

section explores different ways to first convert the incoming delimited file to Vertical File. There

are two ways of converting an incoming delimited flat file to vertical:

1. Replace the delimiter with new line character, land the file and read it over again in

another mapping.

2. Use the Splitter Transformation *

Option 1 – Replace String

This is the simplest way of converting a delimited file in to vertical file. A mapping needs to be

developed which reads the actual input file as a flat file with one and only one column as input.

The entire input row is read as one single column. Then pass the input to an expression

transformation where a ReplaceString function (or whatever) is used to replace the delimiter

with a new line character which is then passed to a Flat File Target. Following picture depicts the

same.

______________________

* Splitter Transformation is NOT available with Informatica 7.1.1 installable. This needs to

be downloaded from Informatica Developer Network. For further Information browse

http://devnet.informatica.com/

Figure 2: Vertical File – Option 1

Expression: Search in Incoming Data, Find '~'and Replace with Chr(13))

� Delimiter is an Tilde (~) � 13 is the ASCII code for New Line Character � Chr(13) returns the Character for the ASCII code 13 viz. New Line



It has to be noted that the above mapping writes the same no. of rows in the output file as it has

read from the input. However, each row that Informatica writes contains several lines (as the

expression introduces new line characters). The next mapping when reads this file reads each

line as a new row. This is because of Informatica’s default nature to read each line as a row while

reading from a flat file.

Option 2 – Splitter Transformation

The second option is more elegant as it doesn’t need to land a file but can do the same on the fly.

The core of this option lies in the Splitter Transformation that can split an incoming row based on

a delimiter and provide multiple output rows. Please refer to Appendix A for more information on

the Splitter Transformation.

Figure 3: Vertical File – Option 2

Splitter: Splits the incoming row in to multiple rows based on the delimiter specified

� This is an Advanced External Procedure – Custom Transformation � Not available with default Informatica Installable



Assumptions

Once, the vertical file is ready the next step is to generate the XML File. Before moving forward

with the XML generation the reader needs to get acquainted with the logic of the XML, this

section briefs the Assumptions made:

� It is assumed that the XML to be generated doesn’t contain more than 7 levels of

hierarchy.

� It is known well before hand the hierarchy and the XML Schema based on which the XML

needs to be generated.

� Every record will have a static Record Start Indicator and Record End Indicator that helps

in the generating the XML



The XML Hierarchy

To make the job simpler, the 7 Levels of Hierarchy are given 7 different names and are referred

with those names in the code. This makes the code more readable and user friendly. The

following picture demonstrates the same:

Hierarchical Rules:

� There will be one and only one Root per row

� All the aggregates except Root are optional and may or may not appear in a row.

� Ancestor is the Data Topic Level and hence cannot repeat. However, there can be any

number of distinct data topics per a given row

� Any number of Predecessors, Forerunners, Forebears, Antecedents and Precursors can

exist per a given row subject to the hierarchy given above

� A data element can appear at any level other than Root and Ancestor. Meaning that, data

elements can appear at any level starting with Predecessor going below.

______________________

Please note that all these Rules are subject to the case and scenario being explained in

this article. Actual Rules may vary based on the Users’s implementation. Also, all these

rules are NOT generic and might not be applied to all XML data in general.

Figure 4: XML Hierarchy

Root

Ancestor

Predecessor

Forerunner

Forebear

Antecedent

Precursor



Code – Tag Resolver

A lookup transformation is used to resolve every incoming element to an XML Tag. This lookup is

provides the hierarchical information of the XML tag being generated. This hierarchy is compared

against the hierarchy of previous elements to verify if any Hierarchy has to be Closed. For

example, if the previous element belongs to a different data topic and the current one to another,

then based on the differences in hierarchies, we Close Previous Hierarchy (generate the XML

tags such a way to close the hierarchy of the previous element up till whatever level applicable)

and Open Current Hierarchy (generate the XML tags such a way to open the hierarchy of the

previous element up till whatever level applicable). The closing and opening of hierarchies are

dealt in detail in the following sections.

Following table depicts the structure of the lookup file to be used for this purpose:

Lookup Column Description

Element Element name

Root The Root Element of the XML Hierarchy for the given Element

Ancestor The Data Topic Element of the XML Hierarchy for the given Element

Predecessor Third level parent, if any

Forerunner Fourth level parent, if any

Forebear Fifth level parent, if any

Antecedent Sixth level parent, if any

Precursor Seventh level parent, if any

Sample lookup file structure:

Element Root Ancestor Predecessor Forerunner Forebear Antecedent Precursor

ELEMENT01 Root Data_Topic1 Aggregate1 Aggregate2 Aggregate3 Aggregate4 Aggregate5

ELEMENT02 Root Data_Topic1 Aggregate1 Aggregate2 Aggregate3

ELEMENT03 Root Data_Topic1 Aggregate1 Aggregate2

Sample lookup file content:



This lookup helps is fetching the XML Hierarchy and makes the process more dynamic.

Groups & Levels

One more lookup is required to properly generate the hierarchies and levels of the XML content.

This doesn’t give us any information on the parents or any grand-levels of the element. However,

it helps us in framing up the XML with the help of the parents’ that we already obtained using the

Code Tag Resolver. This lookup answers several questions that cannot be answered by the

Code Tag Resolver. For Example, What if I want to re-open 3 levels of parents whenever a new

occurrence of the XML tag is encountered? What if I don’t want to generate any XML content for

some elements?, so on and so forth.

The following table briefs the structure of this lookup:

Lookup Column Description

Element Element name

Restart Group Is this the Restart Group – “Y” if yes, “N” if no

Restart Levels How many levels of parents need to be closed and restarted? – a

number

Skip Does this element need to be processed? – “Y” if yes, “N” if no

Example – Lookup

Element|Restart Group|Restart Levels|Skip

ELEMENT01|N|0|N

ELEMENT02|Y|1|N

ELEMENT03|N|0|Y

Example – Lookup

Element|Root|Ancestor|Predecessor|Forerunner|Forebear|Antecedent|Precursor

ELEMENT01|Root|Data_Topic1|Agg1|Agg2|Agg3|Agg4|Agg5

ELEMENT02|Root|Data_Topic1|Agg1|Agg2|Agg3||

ELEMENT03|Root|Data_Topic1|Agg1|Agg2|||



All that’s needed…

Now that all the preliminary things are done, we need to put them all together to properly convert

the XML. An expression transformation is used to do this. The table that follows briefs the ports

that are required during the conversion.

Port Name Description

in_Element Input element

in_Element_Data Data of the incoming element

in_Root Root of the current element

in_Ancestor Ancestor of the current element

in_Predecessor Predecessor of the current element

in_Forerunner Forerunner of the current element

in_Forebear Forebear of the current element

in_Antecedent Antecedent of the current element

in_Precursor Precursor of the current element

in_Restart_Group Is this a restart group?

in_Restart_Levels How many levels have to be restarted?

in_Skip Can this element be processed?

v_Immediate_Parent What’s the immediate parent for the current element?

v_Close_Prev_Hierarchy The hierarchy of the previous element that needs to be closed

v_Open_Hierarchy The hierarchy of the current element that needs to be opened

v_Is_Tag_Already_Parsed A variable that is used to identify if this element is already processed

(useful in identifying the elements that can occur more than once)

v_Output_Data Holds the XML content generated

v_Parsed_Tags Contains the list of parsed tags (separated by a pipe in this case)

v_Prev_Root Root of the previous element

v_Prev_Ancestor Ancestor of the previous element

v_Prev_Predecessor Predecessor of the previous element

v_Prev_Forerunner Forerunner of the previous element

v_Prev_Forebear Forebear of the previous element

v_Prev_Antecedent Antecedent of the previous element

v_Prev_Precursor Precursor of the previous element

v_Complete_Row Indicates if the row is complete – will be true when EOR (End Of

Record) indicator is encountered

o_Complete_Row Passes out the value of the v_Complete_Row

o_Output_Data Passes out the value of the v_Output_Data



The following picture demonstrates the complete mapping

Figure 5: The Mapping

Mapping: Converts the delimited file to XML with the help of Code – Tag Resolver and Groups & Levels

� Read the input as a Flat File with only one column � Use splitter Transformation to convert it to a vertical file � Lookup lkp_Code_Tag_Resolver provides the Hierarchical Information � Lookup Lkp_Groups_n_Levels supports the information provided by other

lookup � Convert 2 XML expression puts them all together and keeps the incoming

elements converted to XML � Filter is used to filter the data uptill EOR is reached � Then, apply business Rules and write to the target



Immediate Parents

This section explains how to deal with the immediate parents of the XML elements. We’ll require

the immediate parent information to determine whether to open/close/re-open a specific XML

element. Typical code identify the immediate parent may be as follows:

IF Precursor is NOT NULL

Select Precursor

Else IF Antecedent is NOT NULL

Select Antecedent

Else IF Forebear is NOT NULL

Select Forebear

Else IF Forerunner is NOT NULL

Select Forerunner

Else IF Predecessor is NOT NULL

Select Predecessor

Else IF Ancestor is NOT NULL

Select Ancestor

Else IF Root is NOT NULL

Select Root

Else

This is the top level element and doesn’t have any parent.

Handling the Hierarchies

This section explains how we generate the corresponding the Closing / Opening Tags that close

the hierarchies of previous element and open the tags for the current element. The logic of

closing the hierarchies is pretty simple and is just a combination of IF statements as follows:

Generate Concatenation of '</' and Previous Precursor and '>' if any of

the follwing conditions match:

� Current Precursor and Previous Precursor are NOT Same

� Restart Group of current element is Yes

Similar code is applied to all the levels of hierarchies viz, Root, Ancestor … in a Bottom – Up

approach to find if the corresponding closing tag can be generated.



Opening the Hierarchies for the XML elements is equally easier and simple with the exception

that a Top – Down approach is followed here. The following code fragment demonstrates the

same:

. . . . . .

. . . . . .

Generate Concatenation of '<' and Antecedent and '>' when any of the

following conditions satisfy:

� Current Antecedent and Previous Antecedent are NOT same

� Restart Group is 'Y'

Concatenate the above Output to

Generate Concatenation of '<' and '>' when the following condition

satisfies:

� Current Precursor and Previous Precursor are NOT same

Already done…?

During the XML generation, there are scenarios where an aggregate repeats itself. In these

scenarios, it is not simply enough if we generate the corresponding opening tag of the aggregate,

we’ll also need to generate the corresponding closing tag of the previous occurrence of the

aggregate. To identify whether an XML aggregate is already processed or not, we use a

combination of two ports – one to store names of all the aggregates that are processed and

another to flag if the aggregate has already been processed. The following code fragment aims at

the same:

Port: v_Is_Tag_Already_Parsed – Used to flag if we already processed an aggregate

IF Current element has Restart Group as 'Y'

Add Current element to Parsed Tags List

Else

Do Nothing

Port: v_Parsed_Tags – Stores the list of all the aggregates processed at least once

IF Current Element is '(EOR)' OR 'AL0010' i.e. Starting Element,

Empty Current List

Else



Concatenate Current Element to Parsed Tags List

It is very important that the port v_Parsed_Tags is placed below the port containing the output

data. This means that you must mark an aggregate as processed if and only if it is processed not

just before it.

Generating the XML Output – Bringing them all together

All the sections up till now have contributed in bits and pieces to form the final XML content. This

section places all of them together to achieve the end result. As all the data we framed up till now

is conditional, this port of the expression brings them together validating several conditions. This

port concatenates the results of several IF conditions with a NULL. NULL ensures that the data is

reset whenever appropriate. If none of the IF matches, the whole XML is outputted to NULL.

Starting with the first IF we would like to process is that of the start element (ELEMENT01 in this

case). If the input is NOT ELEMENT01 we concatenate the existing output data to NULL. This

ensures that at the starting of record (i.e. when ELEMENT01 appears) we reset to NULL. In all

other cases, the partial XML generated till now is preserved.

There are five different cases of XML generation identified for the elements at data topic or at the

level below that. So, corresponding code is executed to ensure that all these cases are met.

These cases represent different scenarios that determine the XML generation.

Case A:

Highlights:

- Closing Hierarchy is NULL

- Opening Hierarchy is NULL

- Current and Previous parents are exactly Same

- Restart Element is “Yes” for current element

Element Structure encountered:

Level_1

Level_2 � First Occurrence

Element_1 � Restart Element

Element_2 � Previous Element

Level_2 � Second Occurrence

Element_1 � Current Element – Restart Element

Element_2 � Element not yet encountered

Action:



We need to close the Immediate Parent and re-open it again.

Case B:

Highlights:

- Closing Hierarchy is NOT NULL


- Current and Previous parents are different

- Restart Element is “Yes” for current element


Level_1

Level_2 � First Occurrence

Element_1 � First Occur. First Element (Restart Element)

Level_3

Element_2


Level_2 � Second Occurrence


Level_3

Element_2


Action:

We need to close the hierarchy (up till level applicable), Close the Immediate Parent and

re-open it again.

Case B.1:

Highlights:




- Restart Element is “No” for current element


Level_1

Level_2

Element_1 � First Occur. First Element (Restart Element)

Level_3

Element_2




Element_4 � Current Element – Not a Restart Element

Element_5

Action:

We need to close the hierarchy (up till level applicable).

Case C:

Highlights:

- Closing Hierarchy is NULL

- Opening Hierarchy is NOT NULL



Level_1

Level_2


Level_3

Element_2 � Current Element

Element_3

Element_4

Action:

Open the hierarchy required

Case D:

Highlights:


- Opening Hierarchy is NOT NULL



Level_1

Level_2_1

Element_1

Level_3

Element_2


Level_2_2




Element_2

Action:

We need to close the hierarchy (up till level applicable) and open the hierarchy required

The code fragment in Appendix C details the implementation of the all the above scenarios.



APPENDIX A

Splitter Transformation

With the Splitter EP/AEP in your mapping, you can read data from a source that contains a

variable number of input fields in each row. This reader capability is not available in the flat file

reader. You can use the Splitter to read a variable number of delimited fields in each input row in

one of the following modes:

� External Procedure (EP) Mode

� Advanced External Procedure (AEP) mode

You can use the AEP mode to split data when you do not know the number of fields in an input

stream. For example, you have a dataset where the number of fields is not known. In HL7 Data

files, data can contain delimited fields. The following example shows the HL7 input data with “|”

as the split character:

Input:

PID^JOHN^DOE^5101112222|6506506500

PID^DONALD^DUCK^5101112222|6506506500|4084084080

As illustrated in the input, the input stream for a field that you want to split may contain a variable

number of fields. Since neither the actual nor maximum number of output fields is known, the

AEP mode is used. When you use the AEP mode to split data, each field in the input stream is

sent as a row.



Output:

Row 1 � PID

Row 2 � JOHN

Row 3 � DOE

Row 4 � 5101112222|6506506500

Row 5 � PID

Row 6 � DONALD

Row 7 � DUCK

Row 8 � 5101112222|6506506500|4084084080

NOTE: Information in this section is an extract provided as it is from the Informatica’s Spliiter AEP

Transformation Article. Please visit http://devnet.informatica.com for complete details.

Figure 6: Splitter Properties



APPENDIX B

Implementation

This was implemented for customer who is a leading provider of analytical business information in

United States. This process is currently handling daily data coming from approx. 100 countries

and executes on Informatica PowerCenter 7.1.1 Server hosted on a 6 CPU, 24 GB HP-UX Unix

Server. This process is currently delivering a performance of 250K / hour and processes an

average of 50K – 100K Incremental records daily.



APPENDIX C

Generating the XML output – Code Sample

The following code fragment details the scenarios explained the section Generating the XML output – Bringing them all together.



About Curosys Technologies

Curosys provides comprehensive IT solutions and services (including systems integration, IS outsourcing, package implementation, software application development and maintenance) and Research & Development services (hardware and software design, development and implementation) to corporation.

a - Handling Variable Length Files Using XML

Documents