Data Profiling A Quick Primer on the What and the Why of Data Integration AUTHORS Shankar Ganesh R Senior Technical Architect Architecture and Technology Services HCL Technologies, Chennai Sathish Kumar Srinivasan Enterprise Data Architect Architecture and Technology Services HCL Technologies, Chennai Subramanyam B S Lead Researcher ATS-Technical Research HCL Technologies, Chennai
15
Embed
Data ProfilingData Profiling e 5 This paper examines the reasons for and the process of data profiling. It also takes a look at data profiling opportunities. The Need for Data Profiling
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Profiling A Quick Primer on the What and the Why of Data Integration
AUTHORS
Shankar Ganesh R Senior Technical Architect Architecture and Technology Services HCL Technologies, Chennai
Sathish Kumar Srinivasan Enterprise Data Architect Architecture and Technology Services HCL Technologies, Chennai
Subramanyam B S
Lead Researcher ATS-Technical Research HCL Technologies, Chennai
(SCM), Stock Control, Logistics, and Business Intelligence (BI), to name
just a few. However, for any solution to deliver value, the data they
depend on needs to be accurate, complete, and consistent. In the Global
Data Management Survey conducted by Price Waterhouse Coopers, data is
considered the most important asset fundamental to an organization’s
success.1
Despite its importance, most companies do not have detailed information
about their data. As a result, the decision to proceed with a solution like
ERP, CRM, BI or SCM is fraught with the risk of implementation delays,
cost overruns or less than expected return on investment.
The Data Warehousing Institute (TDWI) reports
that 83 per cent of organizations suffer from
problems caused by poor data quality2
A Standish Group report indicates that 88 per
cent of data integration projects will fail, or
overrun their target budgets by 66 per cent3
Companies are also not completely sure of their data quality and in
another TDWI survey, half of the respondents said the quality of their data
is “excellent” or “good,” 44 per cent of respondents said that in reality, the
quality of their data is “worse than everyone thinks.”4 Rather than go by
the perception of the individuals managing the data, companies need to
resort to a data profiling exercise.
1 Price Waterhouse Coopers, P.18, Global Data Management Survey, 2001,http://sirnet.metamatrix.se/material/SIRNET_10/survey_01.pdf [June 2008] -> Date on which the site was accessed. 2 TDWI: BI/DW Education Survey Finds 83 Percent of Organizations Suffering from Poor
Master Data http://vendors.ittoolbox.com/profiles/tdwi-dw-professional-
Metadata validation analyzes the data and indicates, for example, whether
or not the field length is appropriate and if there are fields with missing
values. Validation also helps determine if the data collected is as per the
original plan, or if there are deviations.
Pattern Matching
Pattern matching determines if the data values in a field are consistent
across the data source and whether or not the information is in the
expected format.7 Pattern matching also checks for other format-specific
information about the data such as type and length.8
Basic Statistics
Basic statistics provide a snapshot of an entire data field by presenting
statistical information such as minimum and maximum values, mean,
median, mode, and standard deviation, to highlight aberrations from
normal values.9
Data Discovery
The second step in the data profiling process is data discovery. Data
discovery examines the problem areas that are indicated by structure
discovery by examining individual data elements. Data discovery
techniques use
Matching technology to uncover non-standard data
Frequency counts and outlier detection to find data elements that
don’t make sense
Standardization
Data in an organization comes from different sources consumers,
different departments and partners. Standardization helps discover
inconsistencies in the data and then provides a solution to address and fix
7 For example, a valid mobile telephone number, in India, could be entered in the database, in the format (+NN) nnnnnnnnnn, (0) nnnnnnnnnn, nnnnnnnnnn; where NN is the numeric code for the country, and n is a digit between 0 and 9. If a phone number is entered in a different format, the pattern report will indicate that the telephone number did not match a valid telephone number pattern. 8 Dorr, Brett and Herbert, Pat P. 4, Data Profiling: Designing the Blueprint for Improved Data Quality, http://www2.sas.com/proceedings/sugi30/102-30.pdf [June 2008] 9 For example, if customer orders range between 500 and 1000 units, an order of 10000 units would be considered abnormal and validated prior to its being entered into the system.
Many companies still do not have a single view of the customer. Having a
single view enables a company to
Obtain a precise understanding of all the business that the
company is conducting with customers, across multiple units and
product lines
Identify cross-selling opportunities
Create targeted marketing campaigns
Some examples of data profiling are given below.
Example 1
In supply chain management, supply chains are dependent upon effective
procurement processes and accurate procurement information. A single
database that contains details about suppliers and the items that they sell
increases efficiency.
Data profiling is useful in integrating supplier details and information about
the items that they sell to help improve immediate efficiencies and in
facilitating the consolidation and integration of different processes and
systems.
Example 2
Data repetition in data sources is common. For example, in banking,
insurance or retail, an account holder’s name can be recorded
as FirstName MiddleName LastName; FirstName M Lastname; F M
LastName and so on.
Data profiling traces and removes such repetitions to improve data quality
and enhance business intelligence and thereby enable better customer
experience and profitability.
Example 3
Databases provide assorted customer related information, such as the
types of products sold; the product profitability and customer profitability.
Critical business decisions depend on the accuracy of information in these
databases.
Data Profiling
provides a single
view of the
customer. It helps
understand the
gamut of company-
customer
transactions,
identify cross-
selling
opportunities and
then aids in creating
targeted marketing
campaigns.
Data Profiling
Pag
e11
For example a credit card company or a telecom company can use data
profiling to create customer profiles. These customer profiles could help
the company customize products for specific individuals or groups.
Information in the customer profile about the individual’s payment
behavior enables the company to monitor its overall risk portfolio and
enhance an individual’s credit limit.
Data Profiling Tools
Data profiling is generally done by using specific software tools designed
for the purpose rather than using statistical tools. Table 1 compares
statistical tools with data profiling tools and illustrates the advantages of
using data profiling tools.
Statistical Tools Data Profiling Tools
Must formulate a large number of
queries and/or reports in order to test
rules against the data
Addresses all the stages of
data profiling
Execution is slow since rules are
executed serially
Processes a large amount of
data in a short period of time
Cannot discover rules and users do
not understand the actual structure or
content of the data without discovery
Includes discovery processes
Use of validation processes alone will
result in not discovering issues
Includes automatic discovery
and validation processes
Table 1: Statistical Tools vs. Data Profiling Tools
An effective data profiling tool addresses the following three phases15:
Initial profiling and data assessment
Integration of profiling into automated processes
Passing profiling results to data quality and data integration
processes
The data profiling solution16 should also aid in constructing data correction,
validation, and verification routines directly from the profiling reports.
15
P.9, Ibid 8 16
Some well known data profiling tools are Trillium Software from Harte-Hanks, DataFlux
from DataFlux Corporation, Data Insight XI from Business Objects, Information Analyzer from IBM and Data Explorer from Informatica. For more information you can visit http://mediaproducts.gartner.com/reprints/businessobjects/149359.html [August 2008]