Top Banner
Informatica Data Quality Integration Guide Informatica PowerCenter ® (Version 8.1.1)
48
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript

Informatica Data Quality Integration Guide

Informatica PowerCenter(Version 8.1.1)

Informatica Data Quality Integration Guide Version 8.1.1 August 2007 Copyright (c) 19982007 Informatica Corporation. All rights reserved. This software and documentation contain proprietary information of Informatica Corporation, and are provided under a license agreement containing restrictions on use and disclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without the prior written consent of Informatica Corporation. Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable. Informatica, PowerCenter, PowerCenterRT, PowerExchange, PowerCenter Connect, PowerCenter Data Analyzer, PowerMart, Metadata Manager, Informatica Data Quality and Informatica Data Explorer are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners. U.S. Patent Pending. Portions of this software are copyrighted by DataDirect Technologies, 1999-2002. Informatica PowerCenter products contain ACE (TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University and University of California, Irvine, Copyright (c) 1993-2002, all rights reserved. Portions of this software contain copyrighted material from The JBoss Group, LLC. Your right to use such materials is set forth in the GNU Lesser General Public License Agreement, which may be found at http://www.opensource.org/licenses/lgpl-license.php. The JBoss materials are provided free of charge by Informatica, as-is, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Portions of this software contain copyrighted material from Meta Integration Technology, Inc. Meta Integration is a registered trademark of Meta Integration Technology, Inc. This product includes software developed by the Apache Software Foundation (http://www.apache.org/). The Apache Software is Copyright (c) 1999-2005 The Apache Software Foundation. All rights reserved. This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit and redistribution of this software is subject to terms available at http://www.openssl.org. Copyright 1998-2003 The OpenSSL Project. All Rights Reserved. The zlib library included with this software is Copyright (c) 1995-2003 Jean-loup Gailly and Mark Adler. The Curl license provided with this Software is Copyright 1998-2004, Daniel Stenberg, . All Rights Reserved. The PCRE library included with this software is Copyright (c) 1997-2001 University of Cambridge. Regular expression support is provided by the PCRE library package, which is open source software, written by Philip Hazel. The source for this library may be found at ftp://ftp.csx.cam.ac.uk/pub/software/programming/ pcre. InstallAnywhere is Copyright 2005 Zero G Software, Inc. All Rights Reserved. Portions of the Software are Copyright (c) 1998-2005 The OpenLDAP Foundation. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted only as authorized by the OpenLDAP Public License, available at http://www.openldap.org/software/release/license.html. This Software is protected by U.S. Patent Numbers 6,208,990; 6,044,374; 6,014,670; 6,032,158; 5,794,246; 6,339,775 and other U.S. Patents Pending. DISCLAIMER: Informatica Corporation provides this documentation "as is" without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of non-infringement, merchantability, or use for a particular purpose. The information provided in this documentation may include technical inaccuracies or typographical errors. Informatica could make improvements and/or changes in the products described in this documentation at any time without notice.

Table of ContentsList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Other Informatica Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Chapter 1: Understanding the Data Quality Integration Transformation1Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2

Chapter 2: Configuring a Data Quality Integration Transformation . . .3Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Process Architecture for the Integration Transformation . . . . . . . . . . . . . . . . . 6 Data Quality Integration Transformation Properties . . . . . . . . . . . . . . . . . . . . 8 Configuring the Integration to Read Data Quality Plans . . . . . . . . . . . . . . . . 10

Chapter 3: Defining Mappings for Data Quality Plans . . . . . . . . . . . . 15Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Defining a Mapping to Cleanse, Parse, or Validate Data . . . . . . . . . . . . . . . . 17 Defining a Mapping for Data Matching in a Single Data Source . . . . . . . . . . 18 Defining a Mapping for Data Matching Across Two Data Sources . . . . . . . . . 20

Appendix A: Working with Plans in Workbench . . . . . . . . . . . . . . . . 23Understanding Plan Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Copying Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Editing Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Designing Plans for PowerCenter Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35

Table of Contents

1

2

Table of Contents

List of FiguresFigure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure 2-1. Data Quality Integration Icon . . . . . . . . . . . . . . . . . . . . . . 2-2. Edit Transformation Dialog Box, Configurations Tab . . . . . 2-3. Data Quality Repository Connection Dialog Box . . . . . . . . . 2-4. Grouping Example, Data Extract Sorted by Zip or Postcode . 3-1. Mapping Defined for Standardizing, Parsing, and Validation 3-2. Mapping Defined for Single-Source Matching . . . . . . . . . . . 3-3. Mapping Designed for Matching on Two Data Sources . . . . A-1. Detailed Plan Configuration, Workbench User Interface . . . A-2. Match Threshold Scores, CSV Match Sink Component . . . . A-3. Realtime Source Component Configuration . . . . . . . . . . . . A-4. Realtime Sink Component Configuration . . . . . . . . . . . . . . A-5. CSV Match Sink Component Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . . . . . . . . . . . . . . . 5 . 9 10 13 17 18 20 24 27 29 30 31

List of Figures

1

2

List of Figures

List of TablesTable Table Table Table Table 2-1. Process for Implementing Data Quality-PowerCenter Integration Architecture 2-2. Edit Transformations Dialog Box Tabs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3. Configurations Tab Options List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4. Repository Connection Field Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1. Source and Sink Components with Real-time Capability . . . . . . . . . . . . . . . . .. .. .. .. .. . . . . . . . . . . . 6 . 8 . 9 11 29

List of Tables

1

2

List of Tables

Preface

Welcome to PowerCenter, the Informatica software product that delivers an open, scalable data integration solution addressing the complete life cycle for all data integration projects including data warehouses, data migration, data synchronization, and information hubs. PowerCenter combines the latest technology enhancements for reliably managing data repositories and delivering information resources in a timely, usable, and efficient manner. The PowerCenter repository coordinates and drives a variety of core functions, including extracting, transforming, loading, and managing data. The Integration Service can extract large volumes of data from multiple platforms, handle complex transformations on the data, and support high-speed loads. PowerCenter can simplify and accelerate the process of building a comprehensive data warehouse from disparate data sources.

3

About This BookThis document has been written for the following audience:

PowerCenter systems administrators who will install the Data Quality Integration plug-ins to their PowerCenter systems. PowerCenter users who will connect to the Informatica Data Quality repository and add data quality plans to the Data Quality Integration transformation.

Document ConventionsThis guide uses the following formatting conventions:If you see It means The word or set of words are especially emphasized. Emphasized subjects. This is the variable name for a value you enter as part of an operating system command. This is generic text that should be replaced with user-supplied values. The following paragraph provides additional facts. The following paragraph provides suggested uses. The following paragraph notes situations where you can overwrite or corrupt data, unless you follow the specified procedure. This is a code example. This is an operating system command you enter from a prompt to run a task.

italicized text boldfaced textitalicized monospaced text

Note: Tip: Warning:monospaced text bold monospaced text

4

Preface

Other Informatica ResourcesIn addition to the product manuals, Informatica provides these other resources:

Informatica Customer Portal Informatica web site Informatica Knowledge Base Informatica Global Customer Support

Visiting Informatica Customer PortalAs an Informatica customer, you can access the Informatica Customer Portal site at http://my.informatica.com. The site contains product information, user group information, newsletters, access to the Informatica customer support case management system (ATLAS), the Informatica Knowledge Base, Informatica Documentation Center, and access to the Informatica user community.

Visiting the Informatica Web SiteYou can access the Informatica corporate web site at http://www.informatica.com. The site contains information about Informatica, its background, upcoming events, and sales offices. You will also find product and partner information. The services area of the site includes important information about technical support, training and education, and implementation services.

Visiting the Informatica Knowledge BaseAs an Informatica customer, you can access the Informatica Knowledge Base at http://my.informatica.com. Use the Knowledge Base to search for documented solutions to known technical issues about Informatica products. You can also find answers to frequently asked questions, technical white papers, and technical tips.

Obtaining Customer SupportThere are many ways to access Informatica Global Customer Support. You can contact a Customer Support Center through telephone, email, or the WebSupport Service. Use the following email addresses to contact Informatica Global Customer Support:

[email protected] for technical inquiries [email protected] for general customer service requests

WebSupport requires a user name and password. You can request a user name and password at http://my.informatica.com.

Preface

5

Use the following telephone numbers to contact Informatica Global Customer Support:North America / South America Informatica Corporation Headquarters 100 Cardinal Way Redwood City, California 94063 United States Europe / Middle East / Africa Informatica Software Ltd. 6 Waltham Park Waltham Road, White Waltham Maidenhead, Berkshire SL6 3TN United Kingdom Asia / Australia Informatica Business Solutions Pvt. Ltd. Diamond District Tower B, 3rd Floor 150 Airport Road Bangalore 560 008 India Toll Free Australia: 1 800 151 830 Singapore: 001 800 4632 4357 Standard Rate India: +91 80 4112 5738

Toll Free 877 463 2435

Toll Free 00 800 4632 4357

Standard Rate United States: 650 385 5800

Standard Rate Belgium: +32 15 281 702 France: +33 1 41 38 92 26 Germany: +49 1805 702 702 Netherlands: +31 306 022 797 United Kingdom: +44 1628 511 445

6

Preface

Chapter 1

Understanding the Data Quality Integration TransformationThis chapter includes the following topics:

Overview, 2

1

OverviewThe Data Quality Integration is a plug-in component that integrates Informatica PowerCenter and Informatica Data Quality applications. The Integration adds a transformation, called Data Quality Integration, to PowerCenter. You can use this transformation in a mapping to connect to the Data Quality repository and retrieve data quality plan information. The data input and output settings in the plan you select define the input and output ports on the Data Quality Integration transformation. The Integration enables the following types of interaction:

It enables you to browse the Data Quality repository and add a data quality plan to the Data Quality Integration transformation. The functional details of the plan are saved as XML in the PowerCenter repository. It enables the PowerCenter Integration Service to send data quality plan XML to the Data Quality engine when a session containing a Data Quality Integration transformation is run.

The Integration installs as a plug-in for PowerCenter Designer and the PowerCenter Integration Service. There are client-side and server-side versions of the plug-ins.

Install the client version locally to the PowerCenter Designer that reads plans from the Data Quality repository. Install the server version locally to the PowerCenter Integration Service that runs a workflow containing either transformation.

You must also install Informatica Data Quality in these two locations. This makes the Data Quality repository available to PowerCenter Designer and a Data Quality engine available to the PowerCenter Integration Service.Note: For instructions on installing and registering the Data Quality Integration, see the

Informatica Data Quality Installation Guide. The Data Quality Integration is a core component of PowerCenter Data Cleanse and Match.

PowerCenter Data Cleanse and MatchThe Data Quality Integration transformation is a component of PowerCenter Data Cleanse and Match, a cross-application solution designed to validate and enhance the quality of name and address data. PowerCenter Data Cleanse and Match is composed of Data Quality Workbench, the Data Quality Integration, any address reference datasets you subscribe to, and a set of pre-built data quality plans. For more information about PowerCenter Data Cleanse and Match and Informatica Data Quality, see the Informatica Data Quality User Guide.

2

Chapter 1: Understanding the Data Quality Integration Transformation

Chapter 2

Configuring a Data Quality Integration TransformationThis chapter includes the following topics:

Overview, 4 Process Architecture for the Integration Transformation, 6 Data Quality Integration Transformation Properties, 8 Configuring the Integration to Read Data Quality Plans, 10

3

OverviewThe Data Quality Integration transformation lets you select plans from the Data Quality repository and load them to the PowerCenter repository. PowerCenter can then run the plan to the Data Quality engine as part of a workflow. You can add one plan to each Data Quality Integration transformation, and you can add one or more Data Quality Integration transformations to a mapping.

Active and Passive TransformationsWhen you first add a Data Quality Integration to the Designer workspace, you must specify whether it is a passive or active transformation. The plan that you can use depends on the type of transformation you create.

In passive transformations, the number and order of output data records must match the number and order of input data records. In active transformations, the number and order of output data records can differ from the number and order of input data records. Define an active transformation for use with data matching plans.

Note: Once set, the transformation type cannot be changed.

How PowerCenter Implements a Data Quality PlanData quality plan instructions are stored as metadata extensions with the transformation in the PowerCenter repository. Therefore, PowerCenter need not re-connect to the Data Quality repository to run a data quality plan. Correspondingly, once plan details are saved with a mapping, those details can only be changed by re-connecting to the Data Quality repository and refreshing the data quality plan in the transformation. When you run a workflow containing the mapping, the data quality plan instructions are passed by the PowerCenter Integration Service to the Informatica Data Quality engine, and the results are returned to PowerCenter for further processing in the session. For more information, see Process Architecture for the Integration Transformation on page 6.

Refreshing Plan Instructions in the Data Quality IntegrationTo update a Data Quality Integration with the latest details of a selected plan, open the Edit Transformation dialog box at the Configurations tab and click the Refresh button. When you click Refresh, the transformation connects to the Data Quality repository and retrieves the latest plan information. You do not need to re-select the plan. The transformation handles the addition and deletion of ports seamlessly, so long as no existing connections to ports on other transformations are broken.

4

Chapter 2: Configuring a Data Quality Integration Transformation

Selecting a Data Quality Integration TransformationYou can create and configure a Data Quality Integration transformation in the Transformation Developer or the Mapping Designer. Add a Data Quality Integration transformation to the PowerCenter Designer workspace by clicking the Data Quality Integration button on the PowerCenter toolbar. This button is visible as a Q on the toolbar and in the Mapping Designer workspace. Figure 2-1 shows a Data Quality Integration transformation in iconic form in a completed mapping:Figure 2-1. Data Quality Integration Icon

Overview

5

Process Architecture for the Integration TransformationWhen you install and register a Data Quality Integration transformation, you are ready to add data quality plans to a PowerCenter workflow. You can work from the plan level upward to the workflow level, adding a plan to a transformation, adding the transformation to a mapping, and adding a mapping to a workflow. The architecture allows you to follow the steps in Table 2-1 to add a data quality plan to your project.Table 2-1. Process for Implementing Data Quality-PowerCenter Integration ArchitectureProcedure Design one or more data quality plans in Data Quality Workbench. Data Quality saves the plans to the Data Quality repository. Your plan instructions may include address validation steps that make use of third-party reference data software and data files. If you are using pre-built plans from Informatica, you can skip this step. Create a Data Quality Integration transformation and connect to the Data Quality repository. Select a data quality plan and add its data quality information to the transformation. The type of plan you add to the Data Quality Integration will influence your choice of transformation. For more information, see Defining Mappings for Data Quality Plans on page 15. The plan information is saved with the transformation to the PowerCenter repository. Define a mapping and add the Integration transformation to it. Add the mapping to a workflow. Run the workflow. As the workflow runs, PowerCenter sends the data quality plan and the relevant input data to the Data Quality engine. The Data Quality engine sends the data outputs from the plan back to PowerCenter for further processing in the workflow. PowerCenter Designer PowerCenter Designer PowerCenter Workflow Manager Data Quality engine Reference data (optional) PowerCenter Designer Data Quality repository Component Data Quality Workbench Reference data (optional)

You do not need to interact with Data Quality Workbench to use the pre-built plans. You can explore the plans in Data Quality Workbench, and you can use Workbench to build and add plans to the Data Quality repository. Data quality plans can be highly complex in design, and you should only edit plans if you are properly trained in Data Quality Workbench.

Running Plans in SequenceYou can add multiple Data Quality Integration transformations to a mapping, and you can add a single data quality plan to each transformation. The number of Integration transformations you add to the mapping depends on the type of data quality task you are undertaking.

6

Chapter 2: Configuring a Data Quality Integration Transformation

Note: When you add a series of Data Quality Integration transformations to a single mapping,

the session containing the mapping will run faster than a session containing a series of mappings with a single Data Quality Integration transformation in each one.

Process Architecture for the Integration Transformation

7

Data Quality Integration Transformation PropertiesTo open and view the configuration of the Data Quality Integration transformation, doubleclick its icon on the PowerCenter workspace or right-click its title bar and select Edit from the shortcut menu. This opens the Edit Transformations dialog box. Table 2-2 describes the tabs available:Table 2-2. Edit Transformations Dialog Box TabsTab Transformations Ports Properties Initialization Properties Metadata Extensions Port Attribute Definitions Configurations Description Lists the name and type of the transformation. Includes a Description text field. Lists the input and output ports configured on the transformation. This tab also lists Datatype, Precision, and Scale values for each port. Lists the properties of the transformation. This tab provides names and values for several transformation attributes. Allows you enter external procedure initialization properties for the transformation. Allows you to extend the metadata stored in the repository by associating information with individual repository objects. Allows you to create port attributes for the transformation. Allows you to connect to the Data Quality repository and read plan information into the transformation.

Use the Configurations tab to configure the Data Quality Integration transformation. For more information on the other tabs on this dialog box, consult the PowerCenter online help.

8

Chapter 2: Configuring a Data Quality Integration Transformation

Configurations TabFigure 2-2 shows the Configurations tab options:Figure 2-2. Edit Transformation Dialog Box, Configurations Tab

Table 2-3 describes the options on this tab:Table 2-3. Configurations Tab Options ListOption Plan Name Description Identifies the plan to be added to the transformation. Includes a browse option that connects PowerCenter to the Data Quality repository, allowing PowerCenter to read plan information into the transformation. Lists the location of the Data Quality repository in which the plan is stored and the path to the plan within the Data Quality repository. Describes the connection. Lists any pass-through ports added to the transformation. These ports enable data to pass through the transformation unchanged. They are not included in the input and output ports created by the data quality plan and must be added in PowerCenter. Check this option to add a pass-through port to the transformation.

Plan Location Status I/O Ports

Include Pass Through Ports

Data Quality Integration Transformation Properties

9

Configuring the Integration to Read Data Quality PlansWhen you create a Data Quality Integration, you configure it to read plan information from the Data Quality repository. The principal steps to configuring the Data Quality Integration transformation with a data quality plan are as follows: 1. 2. 3. 4. Connect to the Data Quality repository. Select a plan. Optionally, add pass-through ports to the Data Quality Integration transformation. Optionally, select a grouping port for the Data Quality Integration transformation.

With these steps completed, you can connect the Data Quality Integration to other transformations in a PowerCenter mapping. For information about the steps to build a mapping that facilitates a Data Quality Integration, see Defining Mappings for Data Quality Plans on page 15. You perform these tasks through the Configurations tab on the Edit Transformations dialog box.

Step 1. Connect to the Data Quality RepositoryTo add a plan to a Data Quality Integration transformation, your PowerCenter Client must be able to read from the Data Quality repository.To test your connection to the Data Quality repository: 1. 2. 3.

Open the Edit Transformations dialog box for the transformation. Select the Configurations tab. Click the Connect Button. The Data Quality Repository Configuration dialog box opens.Figure 2-3. Data Quality Repository Connection Dialog Box

10

Chapter 2: Configuring a Data Quality Integration Transformation

4.

Ensure that the dialog box fields are populated as described in Table 2-4.

Table 2-4. Repository Connection Field SettingsSetting Host Name Database Name Port Number User Name Password Required/ Optional Required Optional Required Required Required Description The machine name or IP address of the Data Quality repository host. This field is disabled as the database is the Data Quality repository. The IP port on which the Repository is listening. A valid logon name for the database. You should not need to type in this field. A valid password for the logon name. You should not need to type in this field.

5.

Click Test Connection to test the connection details provided. A message box states if the connection is valid. Close the message box.

6.

Click OK to save configuration information.

Step 2. Select a Plan from the Data Quality RepositoryUse the following procedure to add a data quality plan to the Data Quality Integration transformation.To add a data quality plan to a Data Quality Integration transformation: 1. 2. 3. 4.

Open the Edit Transformations dialog box for the transformation. Select the Configurations tab. Click the Browse button beside the Plan Name field. The Select Plan dialog box opens. Browse the Data Quality repository in the upper pane, select the required plan, and click OK. Plan design details, including the plan creation date and last saved date, are shown in the lower pane of this dialog box.

5.

Click OK to return to the Configurations tab.

Step 3. Add Pass-Through Ports (Optional)The default input and output ports on a data quality plan are set when the plan is designed in Data Quality Workbench. However, you can add pass-through ports to the plan on the Configuration tab of the Data Quality Integration transformation. Pass-through ports allow their data to pass through the transformation without interacting with the plan, so that data that enters the transformation on that port leaves it in an identical state.Configuring the Integration to Read Data Quality Plans 11

Add pass-through ports for data fields that you do not want the data quality plan to work on.To add pass-through ports to a Data Quality Integration transformation (optional): 1. 2. 3.

Open the Edit Transformations dialog box for the transformation. Click the Configurations tab. Check the Include Pass Through Ports check box. Click the Add button. Click the Port Name field and enter a name for the new port. Set the precision value for the new port. For String datatypes, the precision value is the number of bytes the PowerCenter Integration Service reads from or writes to the file. The default value is 512. This is the maximum value allowed.Note: All pass-through ports are of String type. You cannot change the datatype or scale.

4. 5. 6.

7. 8.

Repeat steps 3 - 6 for all the pass-through ports you require. Click OK.

Step 4. Select a Grouping Port (Optional)Use Grouping Ports when working with a data quality plan that performs matching operations on its input data. To understand the purpose and importance of groups in matching operations, see Grouping and Matching Considerations on page 31. When you assign an input data field to the Grouping Port, the port acts as a buffer and sends the queued data records to the Data Quality engine whenever the value in the selected input field changes. A matching plan run on grouped data will complete far faster than a plan run on ungrouped data, with a minimal increase in the likelihood of missed matches in the dataset. A data quality plan that matches on grouped data performs matching operations on data within groups only and not between or across groups. Effective grouping, and thus effective matching operations, requires data preparation before the data reaches the Data Quality Integration transformation. The input data must first be sorted on the field to be assigned to the grouping port, so that all data records with a common group field value are sent together to the Data Quality engine. For this reason, a Data Quality Integration containing a matching plan is usually preceded in a mapping by a Sorter transformation.

12

Chapter 2: Configuring a Data Quality Integration Transformation

Figure 2-4 shows a data extract that has been sorted by zip/postcode. In this example, the Data Quality Integration will roll over to a new group when the value in this column changes:Figure 2-4. Grouping Example, Data Extract Sorted by Zip or Postcode

To set a grouping port for matching operations in a Data Quality Integration transformation: 1. 2. 3. 4.

Open the Edit Transformations dialog box for the transformation. Click the Configurations tab. Select an input port from the Grouping Port drop-down menu. Click OK.

Next StepsWith these steps completed, you can save the transformation or mapping to the repository. The next steps are to connect the Data Quality Integration to other elements in a PowerCenter mapping and to complete the mapping in a manner that suits the implementation of the data quality objectives.

Configuring the Integration to Read Data Quality Plans

13

14

Chapter 2: Configuring a Data Quality Integration Transformation

Chapter 3

Defining Mappings for Data Quality PlansThis chapter includes the following topics:

Overview, 16 Defining a Mapping to Cleanse, Parse, or Validate Data, 17 Defining a Mapping for Data Matching in a Single Data Source, 18 Defining a Mapping for Data Matching Across Two Data Sources, 20

15

OverviewThis chapter describes how to define mappings to support Data Quality Integration transformations that contain data quality plans. How you define a mapping depends on the type of data quality plan you add to the Data Quality Integration.

Planning for Data Standardizing, Parsing, and ValidationAdd plans that standardize, parse, and validate data accuracy to passive Data Quality Integration transformations. You can define relatively simple mappings to process them. For these types of operation, define a mapping with the Data Quality Integration, a qualified data source, and data target.

Planning for Data MatchingAdd a matching plan to an active Data Quality Integration and define the mapping with a Sorter transformation to organize the data prior to matching. You need additional transformations if you are matching across two datasets or need to add unique IDs to the data rows.

Planning for Grouping OperationsGrouping plans are typically add to a Data Quality Integration transformation in the same mapping as a matching plan. Add a grouping plan to your mapping to create specialized group key fields prior to matching.

Mapping ConsiderationsThe mapping descriptions in this chapter provide models for creating a mapping for the plan types above. They are intended to demonstrate how the Data Quality Integration interacts with other PowerCenter transformations and what dependencies may apply. However, they do not represent the only ways that mappings can be configured for each type. For more information about adding plans to a Data Quality Integration transformation, see Configuring a Data Quality Integration Transformation on page 3.

16

Chapter 3: Defining Mappings for Data Quality Plans

Defining a Mapping to Cleanse, Parse, or Validate DataThe mapping shown in Figure 3-1 is the simplest mapping model for the Data Quality Integration transformation. It contains a passive Integration transformation and is suitable for cleansing, standardization, parsing, validation, or grouping plans. You can configure a mapping like this one with multiple Integration transformations to conduct cleansing, standardization, parsing, validation, or grouping operations in sequence in a single mapping.Figure 3-1. Mapping Defined for Standardizing, Parsing, and Validation

The following steps define a mapping that includes a single Data Quality Integration transformation as illustrated in Figure 3-1. 1. 2. Add a source definition configured with your source data to the Mapping Designer workspace. Add a Source Qualifier transformation. This reads the data from the source file and enables the data to be read by other transformations such as the Data Quality Integration. Add a Data Quality Integration transformation to the Mapping Designer workspace. You should configure and test this Data Quality Integration transformation before running a workflow with this mapping. 4. Connect the outputs from the Source Qualifier to the input ports of the Data Quality Integration. Connect like fields. For example, connect an output port carrying name data to an input port that anticipates name data. 5. Add a Target Definition and connect the output ports from the Data Quality Integration to it.

3.

Defining a Mapping to Cleanse, Parse, or Validate Data

17

Defining a Mapping for Data Matching in a Single Data SourceThis model is suitable for a Data Quality Integration that contains a matching plan set up for a single data source. The matching plan compares every row in the dataset with every other row to identify duplicates. A matching plan is often preceded by a grouping plan, and both types of plan are included in this mapping. However, grouping plans are not essential in PowerCenter, as the Sorter transformation can act as a grouping agent. In PowerCenter, the principal advantage of a grouping plan is its ability to create custom group keys.Note: Unlike other plan types, matching plans require an active transformation.

Figure 3-2 shows a mapping set up for single-source matching. In this example, the mapping also includes a passive Integration transformation containing a grouping plan.Figure 3-2. Mapping Defined for Single-Source Matching

The following steps describe how to define a mapping with two Data Quality Integration transformations that respectively create group keys for a single data source and perform matching operations on the grouped data. 1. 2. Add a source definition configured with your source data to the Mapping Designer workspace. Add a Source Qualifier transformation. This reads the data from the source file and enables the data to be read by other transformations in the mapping.Note: If the input records lack unique identifiers, add a Sequence Generator

transformation. This transformation will generate a series of incremented values for the

18

Chapter 3: Defining Mappings for Data Quality Plans

records passed into it, creating a column of unique IDs. If the input records have unique IDs, you can omit this step. 3. Add a Data Quality Integration transformation containing a grouping plan. This plan candidate group key columns. You should configure and test this Data Quality Integration transformation before running a workflow with this mapping. 4. Add a Sorter transformation. Use to sort the data from the Data Quality Integration you have just added. Set the Sorter transformation to sort the input records according to values in a suitable field. To do so, open the transformation on the Ports tab and check the Key column box for the required port name. The Data Quality Integration containing the matching plan will read its input data in groups of records with common values in this field, and perform matching operations within each group. This enhances matching operation speed without significantly impacting match results. 5. Add a Data Quality Integration transformation containing the matching plan. Select as the Grouping Port the field you set as the Key column in the Sorter transformation. You should configure and test this Data Quality Integration transformation before running a workflow with this mapping. 6. Add a Target Definition and connect the output ports from this Data Quality Integration to it.

Defining a Mapping for Data Matching in a Single Data Source

19

Defining a Mapping for Data Matching Across Two Data SourcesThis model is suitable for a Data Quality Integration that contains a matching plan set up for two data sources. The mapping combines the two data sources into a single dataset in which the source records are flagged A and B to indicate their dataset of origin. It then applies the aggregated data to the matching plan. Figure 3-3 illustrates a dual-source mapping that includes a grouping plan.Figure 3-3. Mapping Designed for Matching on Two Data Sources

The following steps describe how to define a mapping with two Data Quality Integration transformations that respectively create group keys for a single data source and perform matching operations on the grouped data. In the Mapping Designer workspace, add two source definitions configured with the two data sources to be matched. 1. Add a Source Qualifier transformation for each source definition. These read the data from the source files and enable the data to be read by other transformations in the mapping. Add two Expression transformations. Use each Expression transformation to flag the data from each source as Source A and Source B. This facilitates matching across the sources.

2.

20

Chapter 3: Defining Mappings for Data Quality Plans

3.

Add a Union transformation. Use this transformation to combine the Source A and Source B data into a single dataset, as required by the matching plan.Note: If the input records lack unique identifiers, add a Sequence Generator

transformation. This transformation will generate a series of incremented values for the records passed into it, creating a column of unique IDs. If the input records have unique IDs, you can omit this step. 4. Add a Data Quality Integration transformation containing a grouping plan.This plan creates candidate group key columns. You should configure and test this Data Quality Integration transformation before running a workflow with this mapping. 5. Add a Sorter transformation. Use to sort the data from the Data Quality Integration you have just added. Set the Sorter transformation to sort the input records according to values in a suitable field. To do so, open the transformation on the Ports tab and check the Key column box for the required port name. You can select one of the candidate group key outputs from the grouping plan or another suitable field. The Data Quality Integration containing the matching plan will read its input data in groups of records with common values in this field, and identify matches within each group. This enhances matching operation speed without significantly impacting match results. 6. Add a Data Quality Integration containing the matching plan. Select as the Grouping Port the field you set as the Key column in the Sorter transformation. You should configure and test the Data Quality Integration transformation before creating your mapping. 7. Add a Target Definition and connect the output ports from this Data Quality Integration to it.

Defining a Mapping for Data Matching Across Two Data Sources

21

22

Chapter 3: Defining Mappings for Data Quality Plans

Appendix A

Working with Plans in WorkbenchThe appendix includes the following topics:

Understanding Plan Complexity, 24 Copying Plans, 26 Editing Plans, 27 Designing Plans for PowerCenter Use, 29

23

Understanding Plan ComplexityThis appendix discusses strategies and methods for making changes to data quality plans in Data Quality Workbench.Note: Do not edit data quality plans that will be used in a commercial or production

environment unless you are trained in Data Quality Workbench. Even simple plans can contain many interdependent elements. A plan is composed of a data source, a data sink (or target), and multiple operational components that can carry multiple instances of data. Figure A-1 shows the component icons for a standardization plan in the Workbench user interface:Figure A-1. Detailed Plan Configuration, Workbench User Interface

This plan contains thirty-five components. However, these components define more than two hundred data instances within the plan. Here, an instance is a column of data that a plan component reads as input or creates and makes available to other components later in the24 Appendix A: Working with Plans in Workbench

plan. Instances can be added or omitted from the plan data sinks. Unless an instance is selected for the plan output, it is not written to the data target and remains a potential column within the plan design. Before you change the configuration of any component, you must understand how your changes will affect the instances defined by the component, and the components that use those instances downstream in the plan. If you edit one component, you may have to edit multiple components. Otherwise, you jeopardize its ability to generate meaningful results and may cause the plan to fail.

What Not To ChangeHigh-level actions that can damage a plan include the following:

Deleting a component. Like deleting a transformation in a PowerCenter mapping, deleting a component can invalidate the plan. Renaming component outputs or component instances. This can makes the output or instance unreadable to other components in the plan. Changing data source or data sink file details. This can break the connection between the plan and its data source or target. Toggling the Enable Realtime setting in a data source or sink. Clearing this option disables the plan for PowerCenter. The pre-built plans that ship with Informatica Data Quality are designed for use in a black box manner. That is, you can use these plans in PowerCenter transformations without making any changes to them or interacting with Workbench in any way. Do not use Workbench to modify any data quality plans that are assigned to project data unless you have completed an Informatica Data Quality training course. The risks of error when working with such plans are too high.

Please note the following before making any changes to plans:

For more information about plan design and operation, see the Informatica Data Quality documentation that accompanies with Data Quality Workbench.

Understanding Plan Complexity

25

Copying PlansBefore you attempt any operations in Workbench, ensure that you are not working on any plans that will be used in a live project or for commercial purposes. You should work with copies of such plans only.To make a copy of a plan in the Data Quality repository: 1. 2. 3.

Create a new project in Data Quality Workbench by right-clicking My Repository under the Projects tab and selecting New > Project from the context menu. Copy one or more plans to this project by highlighting the plan name in the Project Manager and typing Ctrl+C. Paste the plan to the project you have just created by right-clicking the project name and selecting Paste from the context menu.

26

Appendix A: Working with Plans in Workbench

Editing PlansAny change you make to a plan, however minor, affects the output of the plan. This section looks at a number of plan configuration changes that are easy to make and that do not jeopardize the running of the plan.

Standardization SettingsYou can set up a number of components in Workbench to standardize the appearance of the input data:

You can alter the case of the data with a To Upper component. You can edit the actions taken by a Search Replace component. This component removes extraneous data characters such as double spaces, commas, and periods in address fields. For information about working with the Search Replace action, see the Informatica Data Quality User Guide.

Matching Output SettingsYou can edit the match threshold settings in a match sink (output) component. In the CSV Match Sink, for example, you can edit the lower and upper thresholds that determine the quality of matches recognized by the Data Quality engine. Figure A-2 shows the CSV Match Sink configuration dialog box.Figure A-2. Match Threshold Scores, CSV Match Sink Component

When Data Quality compares two data values for matching purposes, it assigns the pair of values a numerical score between 0 and 1.0, based on the degree of similarity between themEditing Plans 27

according to the matching criteria applied in the plan. The higher the score, the more identical the two data values. The default lower and upper thresholds are 0.85 and 1.0. You can edit these figures to allow more or fewer matches into the plan output. For more information about CSV Match Sink settings, see Match Output Types and Cluster Information on page 33.Note: Bear in mind that your matching plan need not identify perfect matches between data

records or data values. It is often useful to look for non-exact matches. For example, a data entry error may have caused the name Barbara Jacobsen to appear as Barbara Jacobson in the data, where the two names refer to the same individual. This is why matching components have a default minimum threshold of 0.85. Moreover, a matching plan may focus on some data fields and ignore others, and apply different thresholds to each field. If your plan seeks duplicate records in a customer database, your matching plan may focus on names and telephone numbers. If the records for Barbara Jacobsen and Barbara Jacobson share the same telephone number, they are likely to be the same individual.

Dictionary ValuesYou can add or edit the values in standard dictionary files. Dictionaries are useful sources of reference for data analysis and enhancement. You can apply dictionaries to plan data to verify data accuracy or to correct or standardize variant data values. Standard dictionary files are installed into the Dictionaries folder of your Informatica Data Quality installation. These Informatica-proprietary files have the suffix .DIC and that contain comma-separated text. You can view Data Quality dictionaries in the Dictionary Manager in Workbench, and you can also open and edit a dictionary file in any text editing application.Note: You cannot edit third-party reference data.

A dictionary is organized as a table, with a column of definitive spellings for the terms in the dictionary and one or more columns for matching or variant spellings. Each dictionary term has entries in at least two fields: Label field. Represents the spelling that can be written back to the plan. Item field(s). Represent the forms of spelling that are be recognized as a match for the Label in the input data. In a text editor, label and item fields are represented on a single line as comma-separated values. You can add a new value to a dictionary by opening the file in a text editor and typing a pair of Label and Item values on the first empty row. When the dictionary is used in a data quality plan, the Data Quality engine applies the dictionary entries, including your new entry, to the data passing through all component instances to which it is assigned.

28

Appendix A: Working with Plans in Workbench

Designing Plans for PowerCenter UseThe main characteristic of a data quality plan that has been configured for use in a PowerCenter transformation is the real-time processing capability of its source and sink components. To work in PowerCenter, a data quality plan must be able to receive data inputs and write its outputs in real time. Of the twenty-one source and sink components in Data Quality, six can be enabled for real-time data processing:Table A-1. Source and Sink Components with Real-time CapabilitySources Realtime Source CSV Source CSV Match Source Sinks Realtime Sink CSV Sink CSV Match Sink

If your plan must work with PowerCenter, it must contain a source and sink from Table A-1. As their names suggest, the Realtime Source and Sink are designed to provide real-time data processing. You can toggle on and off the real-time processing capabilities for each CSV component in Table A-1.Note: The CSV Dual Match Source and CSV Dual Match Sink are not real-time enabled.

Realtime Source and Sink ConfigurationTo configuring a Realtime Source in Workbench, open its configuration dialog box and use the right-click context menu to add new input fields. When the plan containing this source is run, it anticipates a flow of data rows with the number and type of fields set in this dialog box. Figure A-3 shows the Realtime Source dialog box from a data quality plan.Figure A-3. Realtime Source Component Configuration

Designing Plans for PowerCenter Use

29

The Realtime Sink works in a similar manner. Its configuration dialog box lists all the data outputs available within the plan. When you select an output in the dialog box, the plan makes available the data for that output in real time to other components or applications. Figure A-4 shows a Realtime Sink dialog box:Figure A-4. Realtime Sink Component Configuration

Note: To work in PowerCenter, the source and sink outputs must comply with Power naming

conventions. The source and sink output names can contain alphanumeric and underscore characters. They cannot contain other characters or character spaces.

30

Appendix A: Working with Plans in Workbench

CSV Source and Sink ConfigurationIn a CSV component from Table A-1 on page 29, you can activate or deactivate real-time processing by checking or unchecking the Enable Realtime Processing check box in the configuration dialog box.Figure A-5. CSV Match Sink Component Configuration

Figure A-5 shows the Workbench components from a single-source matching plan alongside the configuration dialog box for the CSV Match Source component from the plan. With the Enable Realtime Processing check box checked, this plan will accept inputs in real time only and is suitable for use in a PowerCenter transformation. With this check box unchecked, the plan will accept inputs from the file specified in the Source File field.Note: You must specify a source file for this component, whether or not your plan will operate

in real time. In a real-time setting, the plan reads the input heading information from the file specified in the Source File field and anticipates data rows matching those headings. It provides data columns under those headings to the rest of the plan. If you are defining the plan for real-time use, you must create a blank or placeholder comma-delimited file that will provide column header definitions for the plan inputs.

Grouping and Matching ConsiderationsTypes of MatchingInformatica Data Quality performs matching operations on database and file sources. It can match the rows in a single dataset or between two datasets. In single-source matching, Data Quality compares every row in the dataset with every other row. In dual-source matching, Data Quality compares every row in dataset A with every row in dataset B. Data Quality uses different data source components in each case, using for

Designing Plans for PowerCenter Use

31

example a CSV Match Source component for single-source matching and a CSV Dual Source component for dual-source matching. In PowerCenter, you can use a single-source matching plan whether you are matching within one source or between two sources. To match between two sources, add the sources to a mapping and combine them in a single data flow, flagging the data rows as source A and source B according to their origin. For a detailed description of this mapping, see Defining a Mapping for Data Matching Across Two Data Sources on page 20. Matching plans are rarely designed in isolation from other plans. Unless your input dataset is small in size, you will need to provide a means of grouping your data prior to sending it to the matching plan for processing.

Grouping in Data Quality and PowerCenterGrouping means sorting input records based on identical values in one or more user-selected fields. When a matching plan is run on grouped data, serial matching operations are performed on a group-by-group basis, so that data records within a group are matched but records across groups are not. A well-designed grouping plan can dramatically cut plan processing time while minimizing the likelihood of missed matches in the dataset. Grouping plans behave differently in Data Quality and PowerCenter.

In Informatica Data Quality, a grouping plan creates a set of temporary files or database entries that indicate the records that belong to each group. These files or database entries can be discarded when the matching plan has run and can be recreated by re-running the grouping plan. Every time you run the grouping plan in Workbench, you overwrite the group data. As well as creating group files, a grouping plan typically creates one or more custom group key fields to facilitate accurate group definition.

In PowerCenter, pre-match grouping is not necessary, as a Sorter transformation can arrange the data records according to the values in a user-selected field.

If you are using pre-built plans from Informatica, your matching plans may be designed for use in a Data Quality Integration transformation and therefore may not create groups when run directly in Data Quality Workbench. Instead, the plan may create columns of potential group keys. You can select one of these columns as the group key in the Sorter transformation downstream in the mapping. You must select the same column as the Grouping Port in the Data Quality Integration transformation containing the matching plan. Any column containing a statistically meaningful range of values can be used as a group key column, so long as the range of values has a meaningful association with the main focus of the matching exercise. For example, if your data quality plan focuses on matching person names, you could select date of birth information as a group key, on the basis that two records with common values for name and date of birth are likely to he the same person. In such a case, a City or Town name column would be a poor choice of group key, as there may be many people with similar or identical names in a city whose records are not duplicates of one another.32 Appendix A: Working with Plans in Workbench

In Workbench you can create composite group keys composed of data from two or more existing fields. For example, you could create a composite group key that included both date of birth and city or town of residence. For more information about defining a mapping for matching plans, see Defining a Mapping for Data Matching in a Single Data Source on page 18.

Match Output Types and Cluster InformationWhen you add a matching plan to a Data Quality Integration transformation in PowerCenter, you may see output ports named ClusterID and RecordsPerCluster. The Cluster ID provides a unique identifier for the set of matching records identified by the plan. The Records per Cluster provides the number of records in the cluster. The ClusterID and RecordsPerCluster output options are set in the CSV Match Sink component in Data Quality Workbench. These outputs appear in the Data Quality Integration transformation in PowerCenter, although they are not visible in the Outputs pane of the CSV Match Sink in Workbench. They are appended to the plans data output. To create these two fields, select the Identified Matches option in the CSV Match Sink configuration dialog box. For an illustration of this dialog box, see Figure A-2 on page 27.

ClusterID FormatData Quality Workbench and PowerCenter create ClusterID values in different ways. In Data Quality Workbench, the ClusterID values created by a matching plan are numbers that increment for each new cluster. In PowerCenter, the ClusterID value contains additional information that ensures it is unique within the system. The output format for a data row on the ClusterID port in PowerCenter is as follows::::

Designing Plans for PowerCenter Use

33

34

Appendix A: Working with Plans in Workbench

Index

Aactive transformations Data Quality Integration 4 adding Data Quality Integration transformations 5 pass-through ports to Data Quality Integration transformations 11 architecture Data Quality Integration transformations 6

Cconfiguring Data Quality Integration transformations 10 copying data quality plans 26 creating Data Quality Integration transformations 5

DData Cleanse and Match overview 2 Data Quailty Integration transformations installing 2 Data Quality Integration transformations adding pass-through ports 11

adding to workspace 5 architecture 6 configuring 10 designing mappings for data standardizing, parsing, and validation 17 designing mappings for dual-source matching 20 designing mappings for single-source matching 18 overview 4 selecting grouping ports 12 selecting plans 11 using in mappings 16 Data Quality plans copying plans 26 editing plans 27 using pre-built with the Data Quality Workbench 6 Data Quality repository verifying the connection 10 designing mappings for data standardizing, parsing, and validation 17 for dual-source matching 20 for single-source matching 18

Eediting Data Quality plans 27 standardization settings 27

35

Ggrouping ports selecting for Data Quality Integration transformations 12

Vverifying connections to the Data Quality repository 10

Mmappings for data standardizing, parsing, and validation 17

WWorkbench using with pre-built plans 6

NNorth America Content Pack running plans in sequence 6

Ppassive transformations Data Quality Integration 4 pass-through ports adding to Data Quality Integration transformations 11 plans running in sequence 6

Rrunning plans in sequence 6

Sselecting grouping ports in Data Quality Integration transformations 12 plans for Data Quality Integration transformations 11 standardization settings editing 27

Ttransformations active and passive 4 Data Quality Integration transformation 4

36

Index