8/10/2019 2002 11 12DeMaio DataQualityIssues
1/28
Understanding Data Quality
Issues:
Finding Data Inaccuracies
Art DeMaio
Evoke Software
VP Technical Sales Support
8/10/2019 2002 11 12DeMaio DataQualityIssues
2/28
Agenda
Why is Understanding Data Important Methodology for Assessing Data
Defining
Weighting
Profiling
Revisiting
Finding
Addressing
Maintaining
What is Profiling
Benefits of the Assessment
8/10/2019 2002 11 12DeMaio DataQualityIssues
3/28
What the Experts say
Information quality is not an esotericnotion;it directly affects the effectiveness
and efficiency of business processes.
Information quality also plays a major rolein customer satisfaction.
- Larry P. English
8/10/2019 2002 11 12DeMaio DataQualityIssues
4/28
8/10/2019 2002 11 12DeMaio DataQualityIssues
5/28
Whats in Your DATA
three-quarters (of participatingcompanies) reported significant problems as
a result of defective data, with a third
failing to bill or collect receivables as aresult.
- In a PricewaterhouseCoopers survey of 600 CIOs,IT directors or similar executives
8/10/2019 2002 11 12DeMaio DataQualityIssues
6/28
What is Data Quality?
Accuracy of Content
Structure
Completeness
Timeliness
Presentation
8/10/2019 2002 11 12DeMaio DataQualityIssues
7/28
Assessing Your Data
2-Weight/Impact
3-Profile
Data
6-Address
Source Data
7-Maintain
4-Revisit
Definitions,
Weights
5-Findings1-DefineIssues
8/10/2019 2002 11 12DeMaio DataQualityIssues
8/28
Defining Issues
Standard list
Key requirements
Content
Structure
Completeness
Update list by project or source
Source Data
1-Define
Issues
8/10/2019 2002 11 12DeMaio DataQualityIssues
9/28
Defining Issues-sampleConstants
Definition Mismatches
Filler Containing Data
Inconsistent Cases
Inconsistent Data Types
Inconsistent Null Rules
Invalid Keys
Invalid Values
Miscellaneous
Missing Values
Orphans
Out of Range
Pattern Exceptions
Potential Constants
Potential Defaults
Potential Duplicates
Potential Invalids
Potential RedundantValues
Potential Unused Fields
Rule Exceptions
Unused Fields
Source Data
1-Define
Issues
8/10/2019 2002 11 12DeMaio DataQualityIssues
10/28
Weight Impact
After the issues are initially
identified:
Some issues are more
critical than others
Weights are not priorities
Assign a weighting factor
(1-5)
Weighting factors
SHOULD change byproject
2-Weight/Impact
Source Data
1-Define
Issues
8/10/2019 2002 11 12DeMaio DataQualityIssues
11/28
Profile Data
What does Data Profiling mean?
2-Weight/Impact
3-Profile
Data
Source Data
1-Define
Issues
8/10/2019 2002 11 12DeMaio DataQualityIssues
12/28
What is Data Profiling?
The use of analytical techniques on data for the
purpose of developing a thorough knowledge of its
content, structure and quality.
A process of developing information about data
instead of information from data.
8/10/2019 2002 11 12DeMaio DataQualityIssues
13/28
Information About Data: (Data Profiling)
30% of entries in SUPPLIER_ID are blank
the range of values in UNIT_PRICE is 5.99 to 4599.99there are 14 ORDER_HEADER rows with no ORDER_DETAIL rows
Information FROM Data: (not Data Profiling)
Texas auto buyers buy more Cadillacs per capita than any other state
The average mortgage amount increased last year by 6%
10% of last year's customers did not buy anything this year
What is Data Profiling?
8/10/2019 2002 11 12DeMaio DataQualityIssues
14/28
Profile Data
This is multi-step process
Collect documentation
Review the DATA itself
Compare data to documentation
Identify and detail specific issues
2-Weight
/Impact
3-Profile
Data
Source Data
1-Define
Issues
8/10/2019 2002 11 12DeMaio DataQualityIssues
15/28
Revisit
Review the issues and weights
Should there be more or less issues
What are they?
Are the relative importance of each
issue different?
2-Weight
/Impact
3-Profile
Data
Source Data
4-Revisit
Definitions,
Weights
1-Define
Issues
8/10/2019 2002 11 12DeMaio DataQualityIssues
16/28
Findings
Your findings tell others about the
data
Documented reports and/or charts
Results database
Quality Assessment Score
2-Weight
/Impact
3-Profile
Data
Source Data
4-Revisit
Definitions,
Weights
5-Findings1-DefineIssues
8/10/2019 2002 11 12DeMaio DataQualityIssues
17/28
Findings-Chart
Sample Company Issue Findings
0
5
10
15
20
25
Issue Categor y
Coun
to
fIssue
Constant
Definition Mismatch
Filler Containing Data
Inconsistent Case
Inconsistent Data Type
Inconsistent Null Rule
Invalid Keys
Invalid Values
Miscellaneous
Missing Values
Orphans
Out of Range
Pattern Exception
Potential Constant
Potential Default
Potential Duplicates
Potential InvalidPotential Redundant
Potential Unused
Rule Exceptions
Unused
8/10/2019 2002 11 12DeMaio DataQualityIssues
18/28
Findings-Chart
Issues Possible
Issue T ype Discovered IssuesConstants 1 59
Definition Mismatches 4 59
Filler Containing Data 1 59
Inconsistent Cases 3 59
Inconsistent Data Types 15 59
Inconsistent Null Rules 6 59
Invalid Keys 1 3
Invalid Values 1 59Miscellaneous 10 59
Missing Values 18 59
Orphans 2 2
Out of Range 3 59
Pattern Exceptions 10 59
Potential Constants 1 59
Potential Defaults 1 59
Potential Duplicates 3 59Potential Invalids 4 59
Potential RedundantValues 21 59
Potential Unused Fields 1 59
Rule Exceptions 3 3
Unused Fields 1 59
110 1070
Raw Score 89.7%
8/10/2019 2002 11 12DeMaio DataQualityIssues
19/28
Findings-ChartW e i g h t Issues Possible
F a c t o r Issue T ype Discovered Issues
4 Constants 1 59
2 Definition Mismatches 4 59
3 Filler Containing Data 1 59
1 Inconsistent Cases 3 59
2 Inconsistent Data Types 15 59
3 Inconsistent Null Rules 6 59
5 Invalid Keys 1 3
5 Invalid Values 1 59
1 Miscellaneous 10 593 Missing Values 18 59
4 Orphans 2 2
5 Out of Range 3 59
4 Pattern Exceptions 10 59
2 Potential Constants 1 59
2 Potential Defaults 1 59
1 Potential Duplicates 3 59
3 Potential Invalids 4 59
4 Potential RedundantValues 21 59
3 Potential Unused Fields 1 59
5 Rule Exceptions 3 3
4 Unused Fields 1 59
110 1070
Weighted Score 76.2%
8/10/2019 2002 11 12DeMaio DataQualityIssues
20/28
Findings-Chart
5 4 3 2 1 Weight Factor
8 35 30 21 16 Issues identified in weight factor
35.03% 31.19% 10.17% 8.90% 9.04% Average rate per factor
175.1% 124.7% 30.5% 17.8% 9.0% Total Average by weight
Weighted Issue Rate
- 23.8%
Weighted Assessment Score - 76.2%
8/10/2019 2002 11 12DeMaio DataQualityIssues
21/28
Address the Issues
Addressing your findings
Actual vs. Potential
Subject Matter Expertise
Cleansing Requirements
2-Weight
/Impact
3-Profile
Data
6-Address
Source Data
4-Revisit
Definitions,
Weights
5-Findings1-DefineIssues
8/10/2019 2002 11 12DeMaio DataQualityIssues
22/28
Maintain Vigilance
Maintain
Complete the cycle
Periodic review
Document score changes
2-Weight
/Impact
3-Profile
Data
6-Address
Source Data
7-Maintain
4-Revisit
Definitions,
Weights
5-Findings1-DefineIssues
8/10/2019 2002 11 12DeMaio DataQualityIssues
23/28
Why Do The Assessment?
Quantify the quality issues
Isolate true problems
Proactive review
reduces the cost of resolving issues
reduces the risk of customer dissatisfaction
Define the scope of issues
Determine the resources required to address
issues
8/10/2019 2002 11 12DeMaio DataQualityIssues
24/28
Why Do The Assessment?
Project
Timeline
When you find an Issue
Cost
toAddressanIssue
ProjectCosts
8/10/2019 2002 11 12DeMaio DataQualityIssues
25/28
Why should it be done
TIME
Pay me now or Pay me later
8/10/2019 2002 11 12DeMaio DataQualityIssues
26/28
When Should It Be Done?
Every IT data project Warehousing
CRM
ERP
EAI
M&A
Ongoing based on
Criticality of the system Current status (score)
Need to re-purpose data
8/10/2019 2002 11 12DeMaio DataQualityIssues
27/28
8/10/2019 2002 11 12DeMaio DataQualityIssues
28/28
Bibliography
Larry P. English: Improving Data Warehouse and BusinessInformation Quality, John Wiley & Sons Inc., 1999
Jack Olson, Data Profiling: The Accuracy Dimension,
Morgan Kaufmann, 2002
Thomas C. Redman: Data Quality for the Information Age,
Artech House, 1996
PricewaterhouseCoopers, Global Data Management Survey,
2001