-
Share:
Develop and deploy yournext
app on the IBM Bluemixcloud platform.
Start building for free
Data standardization is a process that ensures that data
conforms to quality rules. This tutorial introduces data
standardization concepts and demonstrates how you can achieve
standardized data using IBM InfoSphere
QualityStage. A reader who is new to QualityStage
standardization will get a basic understanding of the process.
Readers
should have basic knowledge of InfoSphere DataStage job
development. This tutorial covers standardization using country
identifier, domain pre-processor, domain-specific and validation
types of rule sets.
Dhanunjaya Lokireddy is a Senior QA Engineer working for the
InfoSphere QualityStage team at IBM India Software Lab, Hyderabad.
He hassix years of experience in IBM working for different QA teams
in the Information Server product area.
11 August 2011
Also available in Chinese Portuguese
Before you startEditor's note: All personal data appearing in
this tutorial is fictitious and was
created for sample purposes only.
InfoSphere QualityStage overview
Enterprises often face issues with data arising out of lack of
standards. Data
may be entered in inconsistent ways across different systems,
causing
records to appear different even though they are actually the
same. For
example, the following two records describe the same person at
the same address, even though the name
and address appear to be quite different:
Bob Christiansan 614 Columbus Ave #3, Boston, Massachusetts
02116
R.J. Christensen 614 Columbus Suite #3, Suffolk County 02116
Another common error leading to "data surprises" is that data
can be misplaced. Here is an example where
several of the fields contain the wrong type of information. The
name field contains address information, the
tax ID field contains telephone numbers, and the telephone field
contains city name information. This
misplacement of data often leads to application errors.
Name Tax ID Telephone
Becker & Co. C/O Bill 025-37-1998 415-392-2770
B Smith DBA Lime Cons. 228-02-1695 6173380220
1st Natl Provident 34-2671854 3309321
HP 15 State St. 508-466-1550 Orlando
A third kind of common data standardization problem involves the
lack of consistent identifiers. The
following example has three records containing a product
description. They look different, but they are
actually same. This is because of the lack of consistent
identifiers.
91-84-301 RS232 Cable 5' M-F CandS
developerWorks Technical topics Information Management Technical
library
Standardize your data using InfoSphere QualityStage
-
CS-89641 5 ft. Cable Male-F, RS232 #87951
C&SUCH6 Male/Female 25 PIN 5 Foot Cable
InfoSphere QualityStage (hereafter called QualityStage), a
component product of InfoSphere Information
Server, helps identify and resolve the issues described above
and provides a way to maintain an accurate
view of master data entities. QualityStage has following
capabilities:
Investigation Helps you understand the nature and scope of data
anomalies
Standardization Parses individual fields and makes them uniform
according to business standards
Matching Identifies duplicate records within and across data
sources
Survivorship Helps eliminate duplicate records and create the
best-breed record of data
Understanding the standardization process
Standardization parses or separates free-form fields into single
component fields or assigns data to its
appropriate metadata fields in a standard format.
Data is frequently captured with variations resulting from:
Data entry errors
Different conventions for representing the same data value
Semantic differences across systems
Multiple sources for the same data element
Lack of data quality standards
But the target systems require cleansed data for reporting and
decision-making. Standardization helps
improve the addressability of data stored in free-form columns
and ensures that each data element has
relevant content and format. It normalizes data values to
standard forms and prepares data elements for
more effective matching. It also helps in identifying and
removing invalid data values. Standardization is
important because it prepares the data for further
processing.
Standardization works based on special instructions called rule
sets. Some rule sets are:
Country identifier, such as COUNTRY
Domain pre-processor, such as USPREP
Domain-specific, such as USNAME
Validation, such as VDATE
Most of the packaged rule sets are country-specific. For
example, there are different name standardization
rule sets for the United States and Japan. As of InfoSphere
Information Server V8.5, these rule sets are
packaged with QualityStage. Advanced users can create rule sets
based on their business requirements.
Rule sets have three required components:
Classification Table Contains the keywords, standard value, and
user-defined class
-
Dictionary File Defines the layout of the output columns
Pattern-Action File Contains the logic to populate output
columns and parsing parameters
Figure 1. Standardization process overview
Figure 1 shows an overview of the standardization process:
1. Parses input data using pattern action file
(SEPLIST/STRIPLIST) parameters
2. Assigns user-defined classes from classification table and
apples default classes for remaining tokens
3. Forms output fields using a dictionary file
4. Populates data to output fields using a pattern action
file
The remaining sections of the tutorial contain detailed steps to
create standardize jobs using different type
of rule sets with examples.
Implementing the country identifier rule setEditor's note: All
personal data appearing in this tutorial is fictitious and was
created for sample purposes
only.
The country identifier rule set helps to identify the country
using the given data. For example, take the
following data:
Listing 1. Data records for country identifier exampleAndrew
Conacher Level 10, 135 Exhibition St Melbourne VIC 3000Ian Williams
167-170 Washway Road Sale Manchester M33 6RJEric Ferm 17 Wellington
Street W. 4th Floor Toronto, Ontario, M5K 1B1Dr Jeffery David
Thomson Jnr PHD 52280A NC 42 72 HWY # 42
The data contains records belongs to various countries. The
steps below show how to use QualityStage to
identify the country for each record.
Step 1: Create a parallel job
Create a parallel job as shown in Figure 2. Configure the input
sequential file stage to read the input file,
which contains the example records listed above.
Figure 2. Parallel job with sequential and standardize
stages
-
Figure 3 shows the designer palette where the standardize stage
is selected.
Figure 3. Designer palette showing standardize stage
Figure 4 shows the input sequential file with the data from the
listing above.
Figure 4. Input sequential file view data
Step 2: Configure the standardize stage
1. Create a new process. Use the New Process button in the
toolbar.
Figure 5. Standardize stage properties
-
The next screen is the standardize new rule process window, with
the available columns listed.
Figure 6. Standardize new rule process window
2. For the listed data column, which is the input sequential
file metadata, select Rule Sets > Other >
COUNTRY.
Figure 7. Rule set selection
3. Click the > button to move the Data column to the Selected
column area.
Figure 8. Standardize rule process window with selected rule set
and columns
-
4. Add metadata delimiter. Metadata delimiter plays an important
rule in this type of rule set. The delimiter
is used to set default country code. If the country rule set
can't determine the country based on the
information provided, it defaults to the delimiter value. The
format of the metadata delimiter is
ZQZQ. In this example, we are setting US as the default country.
Enter ZQUSZQ in the
Literal field.
Figure 9. Standardize rule process window with metadata
delimiter entered
5. Click the > button beside the Literal field.
Figure 10. Using literal to set the country code
6. Use the Move Up and Move Down buttons to arrange the metadata
delimiter in the following way:
-
ZQUSZQ
Data
Click OK to add the process.
Figure 11. Standardize rule process window with all metadata
delimiter arranged in order
Figure 12. Standardize stage properties window with created rule
process
7. Map the output columns (Stage Properties > Output >
Mapping)
The standardize stage produces columns based on the rule set
selected. The following columns were
selected in this example: ISOCountryCode_COUNTRY,
IdentifierFlag_COUNTRY, along with "Data" input
field.
Drag and drop the columns listed above to the output.
Figure 13. Standardize stage output column mapping
-
Step 3: Configure the output file and run the job
Configure the output sequential file stage to supply required
fields like file name and other settings like
format as required. Run the job and verify the output. Here is
the output produced:
Figure 14. Output sequential file view data
Andrew Conacher Level 10, 135 Exhibition St Melbourne VIC
3000
Country code for this record is identified as AU
(ISOCountryCode_COUNTRY)
Country code is identified based on the data only
(IdentifierFlag_COUNTRY)
Ian Williams 167-170 Washway Road Sale Manchester M33 6RJ
Country code for this record is identified as GB
(ISOCountryCode_COUNTRY)
Country code is identified based on the data only
(IdentifierFlag_COUNTRY)
Eric Ferm 17 Wellington Street W. 4th Floor Toronto, Ontario,
M5K 1B1
Country code for this record is identified as CA
(ISOCountryCode_COUNTRY)
Country code is identified based on the data only
(IdentifierFlag_COUNTRY)
Dr Jeffery David Thomson Jnr PHD 52280A NC 42 72 HWY # 42
Country code for this record is identified as US
(ISOCountryCode_COUNTRY)
Here country code couldn't identify based on data so it used
default country code based on the metadata
delimiter (US (IdentifierFlag_COUNTRY))
Implementing the domain pre-processorEditor's note: All personal
data appearing in this tutorial is fictitious and was created for
sample purposes
-
only.
The domain pre-processor will identify different domains (like
name, address and area) from the given data
and populate them to the correct fields. Let's take the
following data:
"52280A NC 42 72 HWY # 42","KNOXVILLE TN 37920","Dr Jeffery
David Thomson Jnr PHD""International Business Machines Corp","1480
CARRIAGE LN APT 301","AUBURN IN 467069555""Peter heines","ASHVILLE
NEW YORK 147109762","930 SOUTH BROAD ST EAST APT H"
It has three fields: Field1, Field2, and Field3 (see Figure 16).
But the data is scattered in all three fields.
For example, the name in the first record is in Field3, in Field
1 in the second record, and in Field1 in the
third record. We will create a standardize job using
pre-processor rule set to identify different domains.
Step 1: Create a parallel job
Create a parallel job as shown in Figure 15. Configure the input
sequential file stage to read the input file,
which contains the example records listed above.
Figure 15. Parallel job with sequential and standardize
stages
Figure 16. Input sequential file view data
Step 2: Configure the standardize stage
1. Create a new process.
Figure 17. Standardize stage properties
Figure 18. Standardize new rule process window
-
2. Select the USPREP rule set (Standardization Rules > USA
> USPREP > USPREP) for the available
columns Field1, Field2, and Field3, which is the input
sequential file metadata.
Figure 19. Rule set selection
3. Click the > button for the three fields to move them to
the selected column area.
Figure 20. Standardize rule process window with selected rule
set and columns
4. Add metadata delimiters. Metadata delimiters are used to
convey what kind of information we are
expecting in each of the input field. If the pre-processor
cannot determine the domain of a token, it will be
defaulted to the domain that specified through metadata
delimiter. The format of the metadata delimiter
is ZQZQ. In this example, we are anticipating that Field1
contains Name data, Field2 contains
Address data, and Field3 contains Area data. Add three
delimiters: ZQNAMEZQ, ZQADDRZQ and
-
ZQAREAZQ. Enter ZQNAMEZQ in the Literal field.
Figure 21. Standardize rule process window with metadata
delimiter entered
5. Click the > button.
Figure 22. Standardize rule process window with metadata
delimiter selected
6. Repeat steps 4 and 5 to add delimiters ZQADDRZQ and
ZQAREAZQ.
Figure 23. Standardize rule process window with all metadata
delimiters selected
7. Use the Move Up and Move Down buttons to arrange the metadata
delimiters in the following way:
ZQNAMEZQ
Field1
-
ZQADDRZQ
Field2
ZQAREAZQ
Field3
Click OK to add the process.
Figure 24. Standardize rule process window with all metadata
delimiters arranged in order
Figure 25. Standardize stage properties window with created rule
process
8. Map the output columns (Stage Properties > Output >
Mapping)
The standardize stage produces columns based on the rule set
selected. The following columns were
selected in this example: NameDomain_USPREP, AddressDomain_U
SPREP and AreaDomain_USPREP
Drag and drop the columns listed above to the output.
Figure 26. Standardize stage output column mapping
-
Step 3: Configure the output file and run the job
Configure the output sequential file stage to supply required
fields like the file name and other settings like
format as required. Run the job and verify the output. Figure 27
shows the output produced.
Figure 27. Output sequential file view data
"International Business Machines Corp","1480 CARRIAGE LN APT
301","AUBURN IN 467069555"
"International Business Machines Corp" is identified as name
domain (NameDomain)
"1480 CARRIAGE LN APT 301" is address domain (AddressDomain)
"AUBURN IN 467069555" is area domain (AreaDomain)
"52280A NC 42 72 HWY # 42","KNOXVILLE TN 37920","Dr Jeffery
David Thomson Jnr PHD"
"Dr Jeffery David Thomson Jnr PHD" is identified as name domain
(NameDomain)
"52280A NC 42 72 HWY # 42" is address domain (AddressDomain)
"KNOXVILLE TN 37920" is area domain (AreaDomain)
"Peter heines","ASHVILLE NEW YORK 147109762","930 SOUTH BROAD ST
EAST APT H"
"Peter heines" is identified as name domain (NameDomain)
"930 SOUTH BROAD ST EAST APT H" is address domain
(AddressDomain)
"ASHVILLE NEW YORK 147109762" is area domain (AreaDomain)
Implementing name standardizationEditor's note: All personal
data appearing in this tutorial is fictitious and was created for
sample purposes
only.
This is the domain-specific type of standardization. Let's take
the following name examples.
Dr Jeffery David Thomson Jnr PHDInternational Business Machines
CorpPeter heines
These examples contain individual and organization names, and
assume these belong to country US. Our
-
intention here is to identify different parts of the name like
the primary name, first name, and last name.
Step 1: Create a parallel job
Create a parallel job as shown in Figure 28. Configure input
sequential file stage to read the input file that
contains the above example records.
Figure 28. Parallel job with sequential and standardize
stages
Figure 29. Input sequential file view data
Step 2: Configure the standardize stage
1. Create a new process.
Figure 30. Standardize stage properties
Figure 31. Standardize new rule process window
-
2. Select the USNAME rule set (Standardization Rules > USA
> USNAME > USNAME) for the column
"name," which is the input sequential file metadata.
Figure 32. Rule set selection
3. Click the > button.
Figure 33. Standardize rule process window with rule set
selected
4. Do not add the "Optional NAMES Handling" option. The Optional
NAMES Handling field has the
following options:
Process All as Individual All columns are standardized as
individual names.
Process All as Organization All columns are standardized as
organization names.
Process Undefined as Individual All unhandled columns are
standardized as individual names.
Process Undefined as Organization All unhandled columns are
standardized as organization names.
This option is useful if we know the types of names in the input
file. For example, if the file mainly
contains organization names, specifying Process All as
Organization enhances performance by
eliminating the processing steps of determining the name's
type.
5. Click OK.
Figure 34. Standardize rule process window with selected rule
set and columns
-
Figure 35. Standardize stage properties window with created rule
process
6. Map the output columns (Stage Properties > Output >
Mapping)
The standardize stage produces columns based on the rule set
selected. In this example, the following
columns were selected: NameType_USNAME, GenderCode_USNAME,
NamePrefix_USNAME,
FirstName_USNAME, MiddleName_USNAME, PrimaryName_USNAME,
NameGeneration_USNAME, and
NameSuffix_USNAME
Drag and drop the above columns to the output.
Figure 36. Standardize stage output column mapping
-
Step 3: Configure the output file and run the job
Configure the output sequential file stage to supply required
fields like the file name and other settings like
format as required. Run the job and verify the output. Figure 37
shows the output produced.
Figure 37. Output sequential file view data
Dr Jeffery David Thomson Jnr PHD
The data is identified as an individual name (NameType)
Gender is male (GenderCode)
Dr is the name prefix (NamePrefix).
Jeffery is the first name(FirstName).
David is the middle name (MiddleName).
Thomson is the primary name (PrimaryName).
Jr is identified as generation (NameGeneration) here, the actual
input contains Jnr, but the standardize
stage gave the commonly used standard format
PHD is the name suffix (NameSuffix).
International Business Machines Corp
The data is identified as the organization name (NameType).
International Business Machines is the primary name
(PrimaryName).
Corp is the name suffix (NameSuffix).
Peter heines
The data is identified as the individual name (NameType).
Gender is male (GenderCode).
Peter is the first name (FirstName).
-
Heines is the primary name (PrimaryName).
Implementing validationThis type of rule set is mainly used to
validate the data (VDATE, VEMAIL, for example). Let's take the
following date examples:
OCT0219830921199102/29/2011
These are some of the acceptable input formats. The
standardization job verifies whether these are valid
and sets valid flag, if valid. Then it produces the output in
standard format CCYYMMDD; otherwise, it sets
invalid reason code.
Step 1: Create the parallel job
Create a parallel job as shown in Figure 38. Configure the input
sequential file stage to read the input file,
which contains the above example records.
Figure 38. Parallel job with sequential and standardize
stages
Figure 39. Input sequential file view data
Step 2: Configure the standardize stage
1. Create a new process.
Figure 40. Standardize stage properties
Figure 41. Standardize new rule process window
-
2. Select the VDATE rule set (Standardization Rules > Other
> VDATE) for the column "Date," which is
the input sequential file metadata.
Figure 42. Rule set selection
3. Click the > button.
Figure 43. Standardize rule process window with rule set
selected
4. Click OK.
Figure 44. Standardize rule process window with selected rule
set and columns
-
Figure 45. Standardize stage properties window with created rule
process
5. Map the output columns (Stage Properties > Output >
Mapping)
The standardize stage produces columns based on the rule set
selected. In this example following
columns were selected: ValidFlag_VDATE, DateCCYYMMDD_VDATE,
InvalidReason_VDATE, along
with input column "Date."
Drag and drop the above columns to the output.
Figure 46. Standardize stage output column mapping
-
ResourcesLearn
Get more information about InfoSphere Information Server from
the Information
Center.
Learn more about Information Management at the developerWorks
Information
Management zone. Find technical documentation, how-to articles,
education,
downloads, product information, and more.
Stay current with developerWorks technical events and
webcasts.
Follow developerWorks on Twitter.
Get products and technologies
Build your next development project with IBM trial software,
available for download
directly from developerWorks.
Discuss
Dig deeper into Informationmanagement ondeveloperWorks
Overview
New to Information management
Technical library (tutorials and more)
Forums
Community
Downloads
Products
Events
Bluemix DevelopersCommunityGet samples, articles, productdocs,
and community resources tohelp build, deploy, and manageyour cloud
apps.
Step 3: Configure the output file and run the job
Configure the output sequential file stage to supply required
fields like the file name and other settings like
format as required. Run the job and verify the output. Here is
the output produced:
Figure 47. Output sequential file view data
OCT021983
Valid date (ValidFlag_VDATE)
19831002 is the standard format (DateCCYYMMDD_VDATE)
09211991
Valid date (ValidFlag_VDATE)
19910921 is the standard format (DateCCYYMMDD_VDATE)
02/29/2011
Invalid date (ValidFlag_VDATE)
The reason is it is invalid leap-year date
(InvalidReason_VDATE)
ConclusionIn this tutorial, you have learned what the
standardization process is and how it can be achieved by using
InfoSphere QualityStage. You have also learned about
standardization using different types of rule sets like
country identifier, domain pre-processor, domain-specific, and
validation.
Download
Description Name Size
Sample jobs and data SampleJobDesigns.zip 10KB
-
Participate in the discussion forum.
Check out the developerWorks blogs and get involved in the
developerWorks
community.
developerWorks WeeklyNewsletterKeep up with the best and
latesttechnical info to help you tackleyour development
challenges.
DevOps ServicesSoftware development in the cloud.Register today
to create a project.
IBM evaluation softwareEvaluate IBM software andsolutions, and
transformchallenges into opportunities.