Top Banner

of 33

Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

Apr 04, 2018

Download

Documents

swaroop24x7
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualitysta 1/33

    Standardize your data using InfoSphereQualityStageSkill Level: Intermediate

    Dhanunjaya Lokireddy ([email protected])Senior QA EngineerIBM

    11 Aug 2011

    Data standardization is a process that ensures that data conforms to quality rules.This tutorial introduces data standardization concepts and demonstrates how youcan achieve standardized data using IBM InfoSphere QualityStage. A readerwho is new to QualityStage standardization will get a basic understanding of theprocess. Readers should have basic knowledge of InfoSphere DataStage jobdevelopment. This tutorial covers standardization using country identifier, domainpre-processor, domain-specific and validation types of rule sets.

    Section 1. Before you start

    Editor's note: All personal data appearing in this tutorial is fictitious and was createdfor sample purposes only.

    InfoSphere QualityStage overview

    Enterprises often face issues with data arising out of lack of standards. Data may beentered in inconsistent ways across different systems, causing records to appeardifferent even though they are actually the same. For example, the following tworecords describe the same person at the same address, even though the name andaddress appear to be quite different:

    Bob Christiansan 614 Columbus Ave #3, Boston, Massachusetts02116

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 1 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualitysta 2/33

    R.J. Christensen 614 Columbus Suite #3, Suffolk County 02116

    Another common error leading to "data surprises" is that data can be misplaced.Here is an example where several of the fields contain the wrong type of information.The name field contains address information, the tax ID field contains telephone

    numbers, and the telephone field contains city name information. This misplacementof data often leads to application errors.

    Name Tax ID Telephone

    Becker & Co. C/O Bill 025-37-1998 415-392-2770

    B Smith DBA Lime Cons. 228-02-1695 6173380220

    1st Natl Provident 34-2671854 3309321

    HP 15 State St. 508-466-1550 Orlando

    A third kind of common data standardization problem involves the lack of consistentidentifiers. The following example has three records containing a productdescription. They look different, but they are actually same. This is because of thelack of consistent identifiers.

    91-84-301 RS232 Cable 5' M-F CandS

    CS-89641 5 ft. Cable Male-F, RS232 #87951

    C&SUCH6 Male/Female 25 PIN 5 Foot Cable

    InfoSphere QualityStage (hereafter called QualityStage), a component product of

    InfoSphere Information Server, helps identify and resolve the issues describedabove and provides a way to maintain an accurate view of master data entities.QualityStage has following capabilities:

    Investigation Helps you understand the nature and scope of dataanomalies

    Standardization Parses individual fields and makes them uniformaccording to business standards

    Matching Identifies duplicate records within and across data sources

    Survivorship Helps eliminate duplicate records and create thebest-breed record of data

    Understanding the standardization process

    Standardization parses or separates free-form fields into single component fields or

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 2 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualitysta 3/33

    assigns data to its appropriate metadata fields in a standard format.

    Data is frequently captured with variations resulting from:

    Data entry errors

    Different conventions for representing the same data value

    Semantic differences across systems

    Multiple sources for the same data element

    Lack of data quality standards

    But the target systems require cleansed data for reporting and decision-making.Standardization helps improve the addressability of data stored in free-form columnsand ensures that each data element has relevant content and format. It normalizesdata values to standard forms and prepares data elements for more effective

    matching. It also helps in identifying and removing invalid data values.Standardization is important because it prepares the data for further processing.

    Standardization works based on special instructions called rule sets. Some rule setsare:

    Country identifier, such as COUNTRY

    Domain pre-processor, such as USPREP

    Domain-specific, such as USNAME

    Validation, such as VDATE

    Most of the packaged rule sets are country-specific. For example, there are differentname standardization rule sets for the United States and Japan. As of InfoSphereInformation Server V8.5, these rule sets are packaged with QualityStage. Advancedusers can create rule sets based on their business requirements.

    Rule sets have three required components:

    Classification Table Contains the keywords, standard value, anduser-defined class

    Dictionary File Defines the layout of the output columns

    Pattern-Action File Contains the logic to populate output columns andparsing parameters

    Figure 1. Standardization process overview

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 3 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualitysta 4/33

    Figure 1 shows an overview of the standardization process:

    1. Parses input data using pattern action file (SEPLIST/STRIPLIST)parameters

    2. Assigns user-defined classes from classification table and apples defaultclasses for remaining tokens

    3. Forms output fields using a dictionary file

    4. Populates data to output fields using a pattern action file

    The remaining sections of the tutorial contain detailed steps to create standardizejobs using different type of rule sets with examples.

    Section 2. Implementing the country identifier rule set

    Editor's note: All personal data appearing in this tutorial is fictitious and was createdfor sample purposes only.

    The country identifier rule set helps to identify the country using the given data. Forexample, take the following data:

    Listing 1. Data records for country identifier example

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 4 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualitysta 5/33

    Andrew Conacher Level 10, 135 Exhibition St Melbourne VIC 3000Ian Williams 167-170 Washway Road Sale Manchester M33 6RJEric Ferm 17 Wellington Street W. 4th Floor Toronto, Ontario, M5K 1B1Dr Jeffery David Thomson Jnr PHD 52280A NC 42 72 HWY # 42

    The data contains records belongs to various countries. The steps below show howto use QualityStage to identify the country for each record.

    Step 1: Create a parallel job

    Create a parallel job as shown in Figure 2. Configure the input sequential file stageto read the input file, which contains the example records listed above.

    Figure 2. Parallel job with sequential and standardize stages

    Figure 3 shows the designer palette where the standardize stage is selected.

    Figure 3. Designer palette showing standardize stage

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 5 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualitysta 6/33

    Figure 4 shows the input sequential file with the data from the listing above.

    Figure 4. Input sequential file view data

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 6 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualitysta 7/33

    Step 2: Configure the standardize stage

    1. Create a new process. Use the New Process button in the toolbar.Figure 5. Standardize stage properties

    The next screen is the standardize new rule process window, with theavailable columns listed.

    Figure 6. Standardize new rule process window

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 7 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualitysta 8/33

    2. For the listed data column, which is the input sequential file metadata,select Rule Sets > Other > COUNTRY.Figure 7. Rule set selection

    3. Click the > button to move the Data column to the Selected column area.Figure 8. Standardize rule process window with selected rule set andcolumns

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 8 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualitysta 9/33

    4. Add metadata delimiter. Metadata delimiter plays an important rule in thistype of rule set. The delimiter is used to set default country code. If thecountry rule set can't determine the country based on the informationprovided, it defaults to the delimiter value. The format of the metadatadelimiter is ZQZQ. In this example, we are setting USas the default country. Enter ZQUSZQ in the Literal field.Figure 9. Standardize rule process window with metadata delimiterentered

    5. Click the > button beside the Literal field.Figure 10. Using literal to set the country code

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 9 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 10/33

    6. Use the Move Up and Move Down buttons to arrange the metadatadelimiter in the following way:ZQUSZQDataClick OK to add the process.Figure 11. Standardize rule process window with all metadatadelimiter arranged in order

    Figure 12. Standardize stage properties window with created ruleprocess

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 10 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 11/33

    7. Map the output columns (Stage Properties > Output > Mapping)The standardize stage produces columns based on the rule set selected.The following columns were selected in this example:ISOCountryCode_COUNTRY, IdentifierFlag_COUNTRY, along with"Data" input field.Drag and drop the columns listed above to the output.Figure 13. Standardize stage output column mapping

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 11 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 12/33

    Step 3: Configure the output file and run the job

    Configure the output sequential file stage to supply required fields like file name andother settings like format as required. Run the job and verify the output. Here is the

    output produced:

    Figure 14. Output sequential file view data

    Andrew Conacher Level 10, 135 Exhibition St Melbourne VIC 3000Country code for this record is identified as AU (ISOCountryCode_COUNTRY)Country code is identified based on the data only (IdentifierFlag_COUNTRY)

    Ian Williams 167-170 Washway Road Sale Manchester M33 6RJCountry code for this record is identified as GB (ISOCountryCode_COUNTRY)Country code is identified based on the data only (IdentifierFlag_COUNTRY)

    Eric Ferm 17 Wellington Street W. 4th Floor Toronto, Ontario, M5K 1B1Country code for this record is identified as CA (ISOCountryCode_COUNTRY)Country code is identified based on the data only (IdentifierFlag_COUNTRY)

    Dr Jeffery David Thomson Jnr PHD 52280A NC 42 72 HWY # 42Country code for this record is identified as US (ISOCountryCode_COUNTRY)Here country code couldn't identify based on data so it used default country codebased on the metadata delimiter (US (IdentifierFlag_COUNTRY))

    Section 3. Implementing the domain pre-processor

    Editor's note: All personal data appearing in this tutorial is fictitious and was createdfor sample purposes only.

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 12 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 13/33

    The domain pre-processor will identify different domains (like name, address andarea) from the given data and populate them to the correct fields. Let's take thefollowing data:

    "52280A NC 42 72 HWY # 42","KNOXVILLE TN 37920","Dr Jeffery David Thomson Jnr PHD"

    "International Business Machines Corp","1480 CARRIAGE LN APT 301","AUBURN IN 467069555""Peter heines","ASHVILLE NEW YORK 147109762","930 SOUTH BROAD ST EAST APT H"

    It has three fields: Field1, Field2, and Field3 (see Figure 16). But the data isscattered in all three fields. For example, the name in the first record is in Field3, inField 1 in the second record, and in Field1 in the third record. We will create astandardize job using pre-processor rule set to identify different domains.

    Step 1: Create a parallel job

    Create a parallel job as shown in Figure 15. Configure the input sequential file stageto read the input file, which contains the example records listed above.

    Figure 15. Parallel job with sequential and standardize stages

    Figure 16. Input sequential file view data

    Step 2: Configure the standardize stage

    1. Create a new process.Figure 17. Standardize stage properties

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 13 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 14/33

    Figure 18. Standardize new rule process window

    2. Select the USPREP rule set (Standardization Rules > USA > USPREP> USPREP) for the available columns Field1, Field2, and Field3, which isthe input sequential file metadata.Figure 19. Rule set selection

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 14 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 15/33

    3. Click the > button for the three fields to move them to the selected columnarea.Figure 20. Standardize rule process window with selected rule setand columns

    4. Add metadata delimiters. Metadata delimiters are used to convey whatkind of information we are expecting in each of the input field. If thepre-processor cannot determine the domain of a token, it will be defaultedto the domain that specified through metadata delimiter. The format of themetadata delimiter is ZQZQ. In this example, we areanticipating that Field1 contains Name data, Field2 contains Address

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 15 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 16/33

    data, and Field3 contains Area data. Add three delimiters: ZQNAMEZQ,ZQADDRZQ and ZQAREAZQ. Enter ZQNAMEZQ in the Literal field.Figure 21. Standardize rule process window with metadata delimiterentered

    5. Click the > button.Figure 22. Standardize rule process window with metadata delimiterselected

    6. Repeat steps 4 and 5 to add delimiters ZQADDRZQ and ZQAREAZQ.Figure 23. Standardize rule process window with all metadatadelimiters selected

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 16 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 17/33

    7. Use the Move Up and Move Down buttons to arrange the metadatadelimiters in the following way:ZQNAMEZQField1ZQADDRZQField2ZQAREAZQField3Click OK to add the process.Figure 24. Standardize rule process window with all metadatadelimiters arranged in order

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 17 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 18/33

    Figure 25. Standardize stage properties window with created ruleprocess

    8. Map the output columns (Stage Properties > Output > Mapping)The standardize stage produces columns based on the rule set selected.The following columns were selected in this example:NameDomain_USPREP, AddressDomain_U SPREP andAreaDomain_USPREPDrag and drop the columns listed above to the output.Figure 26. Standardize stage output column mapping

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 18 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 19/33

    Step 3: Configure the output file and run the job

    Configure the output sequential file stage to supply required fields like the file name

    and other settings like format as required. Run the job and verify the output. Figure27 shows the output produced.

    Figure 27. Output sequential file view data

    "International Business Machines Corp","1480 CARRIAGE LN APT301","AUBURN IN 467069555""International Business Machines Corp" is identified as name domain

    (NameDomain)"1480 CARRIAGE LN APT 301" is address domain (AddressDomain)"AUBURN IN 467069555" is area domain (AreaDomain)

    "52280A NC 42 72 HWY # 42","KNOXVILLE TN 37920","Dr Jeffery DavidThomson Jnr PHD""Dr Jeffery David Thomson Jnr PHD" is identified as name domain (NameDomain)"52280A NC 42 72 HWY # 42" is address domain (AddressDomain)

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 19 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 20/33

    "KNOXVILLE TN 37920" is area domain (AreaDomain)

    "Peter heines","ASHVILLE NEW YORK 147109762","930 SOUTH BROAD STEAST APT H""Peter heines" is identified as name domain (NameDomain)

    "930 SOUTH BROAD ST EAST APT H" is address domain (AddressDomain)"ASHVILLE NEW YORK 147109762" is area domain (AreaDomain)

    Section 4. Implementing name standardization

    Editor's note: All personal data appearing in this tutorial is fictitious and was createdfor sample purposes only.

    This is the domain-specific type of standardization. Let's take the following nameexamples.

    Dr Jeffery David Thomson Jnr PHDInternational Business Machines CorpPeter heines

    These examples contain individual and organization names, and assume thesebelong to country US. Our intention here is to identify different parts of the name likethe primary name, first name, and last name.

    Step 1: Create a parallel job

    Create a parallel job as shown in Figure 28. Configure input sequential file stage toread the input file that contains the above example records.

    Figure 28. Parallel job with sequential and standardize stages

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 20 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 21/33

    Figure 29. Input sequential file view data

    Step 2: Configure the standardize stage

    1. Create a new process.Figure 30. Standardize stage properties

    Figure 31. Standardize new rule process window

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 21 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 22/33

    2. Select the USNAME rule set (Standardization Rules > USA > USNAME> USNAME) for the column "name," which is the input sequential filemetadata.Figure 32. Rule set selection

    3. Click the > button.Figure 33. Standardize rule process window with rule set selected

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 22 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 23/33

    4. Do not add the "Optional NAMES Handling" option. The OptionalNAMES Handling field has the following options:

    Process All as Individual All columns are standardized asindividual names.

    Process All as Organization All columns are standardized asorganization names.

    Process Undefined as Individual All unhandled columns arestandardized as individual names.

    Process Undefined as Organization All unhandled columns arestandardized as organization names.

    This option is useful if we know the types of names in the input file. Forexample, if the file mainly contains organization names, specifyingProcess All as Organization enhances performance by eliminating theprocessing steps of determining the name's type.

    5. Click OK.Figure 34. Standardize rule process window with selected rule setand columns

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 23 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 24/33

    Figure 35. Standardize stage properties window with created ruleprocess

    6. Map the output columns (Stage Properties > Output > Mapping)The standardize stage produces columns based on the rule set selected.In this example, the following columns were selected:NameType_USNAME, GenderCode_USNAME, NamePrefix_USNAME,

    FirstName_USNAME, MiddleName_USNAME, PrimaryName_USNAME,NameGeneration_USNAME, and NameSuffix_USNAMEDrag and drop the above columns to the output.Figure 36. Standardize stage output column mapping

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 24 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 25/33

    Step 3: Configure the output file and run the job

    Configure the output sequential file stage to supply required fields like the file name

    and other settings like format as required. Run the job and verify the output. Figure37 shows the output produced.

    Figure 37. Output sequential file view data

    Dr Jeffery David Thomson Jnr PHDThe data is identified as an individual name (NameType)Gender is male (GenderCode)Dr is the name prefix (NamePrefix).Jeffery is the first name(FirstName).Davidis the middle name (MiddleName).

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 25 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 26/33

    Thomsonis the primary name (PrimaryName).Jr is identified as generation (NameGeneration) here, the actual input containsJnr, but the standardize stage gave the commonly used standard formatPHD is the name suffix (NameSuffix).

    International Business Machines CorpThe data is identified as the organization name (NameType).International Business Machines is the primary name (PrimaryName).Corp is the name suffix (NameSuffix).

    Peter heinesThe data is identified as the individual name (NameType).Gender is male (GenderCode).Peter is the first name (FirstName).Heinesis the primary name (PrimaryName).

    Section 5. Implementing validation

    This type of rule set is mainly used to validate the data (VDATE, VEMAIL, forexample). Let's take the following date examples:

    OCT02198309211991

    02/29/2011

    These are some of the acceptable input formats. The standardization job verifieswhether these are valid and sets valid flag, if valid. Then it produces the output instandard format CCYYMMDD; otherwise, it sets invalid reason code.

    Step 1: Create the parallel job

    Create a parallel job as shown in Figure 38. Configure the input sequential file stageto read the input file, which contains the above example records.

    Figure 38. Parallel job with sequential and standardize stages

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 26 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 27/33

    Figure 39. Input sequential file view data

    Step 2: Configure the standardize stage

    1. Create a new process.Figure 40. Standardize stage properties

    Figure 41. Standardize new rule process window

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 27 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 28/33

    2. Select the VDATE rule set (Standardization Rules > Other > VDATE)for the column "Date," which is the input sequential file metadata.Figure 42. Rule set selection

    3. Click the > button.Figure 43. Standardize rule process window with rule set selected

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 28 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 29/33

    4. Click OK.Figure 44. Standardize rule process window with selected rule setand columns

    Figure 45. Standardize stage properties window with created rule

    process

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 29 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 30/33

    5. Map the output columns (Stage Properties > Output > Mapping)The standardize stage produces columns based on the rule set selected.In this example following columns were selected: ValidFlag_VDATE,DateCCYYMMDD_VDATE, InvalidReason_VDATE, along with inputcolumn "Date."Drag and drop the above columns to the output.Figure 46. Standardize stage output column mapping

    developerWorks ibm.com/developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 30 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 31/33

    Step 3: Configure the output file and run the job

    Configure the output sequential file stage to supply required fields like the file nameand other settings like format as required. Run the job and verify the output. Here is

    the output produced:

    Figure 47. Output sequential file view data

    OCT021983Valid date (ValidFlag_VDATE)19831002is the standard format (DateCCYYMMDD_VDATE)

    09211991Valid date (ValidFlag_VDATE)19910921 is the standard format (DateCCYYMMDD_VDATE)

    02/29/2011Invalid date (ValidFlag_VDATE)The reason is it is invalid leap-year date (InvalidReason_VDATE)

    Section 6. Conclusion

    In this tutorial, you have learned what the standardization process is and how it canbe achieved by using InfoSphere QualityStage. You have also learned aboutstandardization using different types of rule sets like country identifier, domainpre-processor, domain-specific, and validation.

    ibm.com/developerWorks developerWorks

    Standardize your data using InfoSphere QualityStage Trademarks Copyright IBM Corporation 2011. All rights reserved. Page 31 of 33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 32/33

  • 7/29/2019 Www.ibm.Com Developerworks Data Tutorials Dm-1108isqualitystagestddata Dm-1108isqualitystagestddata-PDF

    http:///reader/full/wwwibmcom-developerworks-data-tutorials-dm-1108isqualitystagestddata-dm-1108isqualityst 33/33

    Resources

    Learn

    Get more information about InfoSphere Information Server from the Information

    Center. Learn more about Information Management at the developerWorks Information

    Management zone. Find technical documentation, how-to articles, education,downloads, product information, and more.

    Stay current with developerWorks technical events and webcasts.

    Follow developerWorks on Twitter.

    Get products and technologies

    Build your next development project with IBM trial software, available for

    download directly from developerWorks.Discuss

    Participate in the discussion forum for this content.

    Check out the developerWorks blogs and get involved in the developerWorkscommunity.

    About the author

    Dhanunjaya LokireddyDhanunjaya Lokireddy is a Senior QA Engineer working for theInfoSphere QualityStage team at IBM India Software Lab, Hyderabad.He has six years of experience in IBM working for different QA teams inthe Information Server product area.

    ibm.com/developerWorks developerWorks