7 December 2012 TCS Public Data Quality Concepts
Jan 21, 2016
7 December 2012 TCS Public
Data Quality Concepts
7 December 2012 2
Agenda
• Data Quality Concepts (3 hrs 15 mins)
• Introduction to Data Quality XI R2 (45 mins)
• Using the Project Architect (2 hrs)
• Using Transforms (10 hrs)
• Matching and Consolidation Records (10 hrs)
7 December 2012 3
What is Data Quality?
Data Quality refers to the quality of data.
•Data are of high quality if they are fit for their intended uses in
operations, decision making and planning
• It is the state of
– completeness,
– validity,
– consistency,
– timeliness and
– accuracy
that makes data appropriate for a specific use
7 December 2012 4
Why Data Quality?
•Companies often cannot rely on the information that serves as the very
foundation of their primary business applications
• Inaccurate/ inconsistent data can hinder a company's ability to
understand its current – and future – business problems
•This leads to poor decisions that can cause a host of negative results,
including lost profits, operational delays, customer dissatisfaction and
much more
• In short, the effectiveness and quality of decision-making is limited to
the quality of the data residing in it
7 December 2012 5
Poor Data Quality Leads to
• Inability to compare data from different sources
•Data entered into the wrong fields
•Lack of consistent data definitions
• Inability to consolidate data from multiple sources
• Inability to track data across time
• Inability to comply with government regulations
•Delayed or rejected reimbursement from third party providers
•The inability to determine important relationships
7 December 2012 6
Examples
T.Das|97336o8327|24.95|Y|-|0.0|1000
Ted J.|973-360-8779|2000|N|M|NY|1000
•Can we interpret the data?
– What do the fields mean?
– What is the key? The measures?
•Data glitches
– Typos, multiple formats, missing / default values
•Metadata and domain expertise
– Field three is Revenue. In dollars or cents?
– Field seven is Usage. Is it censored?
•Field 4 is a censored flag. How to handle censored data?
7 December 2012 7
Data Glitches
•Systemic changes to data which are external to the recorded process.
– Changes in data layout / data types
• Integer becomes string, fields swap positions, etc.
– Changes in scale / format
•Dollars vs. euros
– Temporary reversion to defaults
•Failure of a processing step
– Missing and default values
•Application programs do not handle NULL values well …
– Gaps in time series
•Especially when records represent incremental changes.
7 December 2012 8
Meaning of Data Quality
•There are many types of data, which have different uses and typical
quality problems
– Federated data
– High dimensional data
– Descriptive data
– Longitudinal data
– Streaming data
– Web (scraped) data
– Numeric vs. categorical vs. text data
7 December 2012 9
Meaning of Data Quality
•There are many uses of data
– Operations
– Aggregate analysis
– Customer relations …
•Data Interpretation : the data is useless if we don‟t know all of the rules
behind the data.
•Data Suitability : Can you get the answer from the available data
– Use of proxy data
– Relevant data is missing
7 December 2012 10
Data Quality Constraints
•Many data quality problems can be captured by static constraints based
on the schema.
– Nulls not allowed, field domains, foreign key constraints, etc
•Many others are due to problems in workflow, and can be captured by
dynamic constraints
– E.g., orders above $200 are processed by Biller 2
•The constraints follow an 80-20 rule
– A few constraints capture most cases, thousands of constraints to
capture the last few cases.
•Constraints are measurable. Data Quality Metrics?
7 December 2012 11
Data Quality Metrics
•We want a measurable quantity
– Indicates what is wrong and how to improve
– Realize that DQ is a messy problem, no set of numbers will be perfect
•Types of metrics
– Static vs. dynamic constraints
– Operational vs. diagnostic
•Metrics should be directionally correct with an improvement in use of the data.
•A very large number metrics are possible
– Choose the most important ones.
7 December 2012 12
Examples of Data Quality Metrics
•Conformance to schema
– Evaluate constraints on a snapshot
•Conformance to business rules
– Evaluate constraints on changes in the database
•Accuracy
– Perform inventory (expensive), or use proxy (track complaints).
Audit samples?
•Accessibility
• Interpretability
•Glitches in analysis
•Successful completion of end-to-end process
7 December 2012 13
Typical Data Quality Business Drivers
• Inability to make solid business decisions due to lack of trust in the data
driving business intelligence efforts
•Failed business and marketing programs that were based on poor data
• Inability to target best customers and suppliers
• Increased cost due to inability to deliver product and returned direct
marketing pieces and bills/invoices
•Compliance concerns and the legal and financial risks of reporting and
acting on bad data
•Rework or process delays due to duplicate or incorrect data within
enterprise systems
•Decline in customer satisfaction and perceptions
7 December 2012 14
Data Quality Issues
• Data content errors
• Missing data
• Invalid data
• Data that is significantly different than all other data
• Multiple formats for the same data elements
• Different meaning value for the same code value
• Multiple code values with the same meaning
• Field overuse: used for unintended purpose
• Data in filler
• Errors related to migration during ETL process
• Normalization Inconsistencies
• Duplicate or lost data
• Data structure problems
7 December 2012 15
Categories of Data Quality Problems
• Accuracy
• Objectivity
• Believability
• Reputation
• Relevancy
• Value-added
• Timeliness (currency)
•Completeness
•Amount of Information
• Interpretability
•Ease of Understanding
•Consistent Representation
•Concise Representation
•Access
•Security
7 December 2012 16
Where does Data Quality fits in EDW?
Netw
ork
RDBMS
CRM
ERP
Mainframe DBs
PC DBs
Extra
ctio
n
Cleansing,
Transformation,
Validation, Massaging
STAGING AREA
ODS
DW
Data
Marts Aggregation,
summarization, Data Mart
Population, Dimension
loading, Fact Loading
Client
Browsers
Reports, Cubes, Analysis, Data
mining, Dashboards, MIS reports,
Company Quarterly reports etc..
Option 1
Option 2 Option 3
7 December 2012 17
Option 1
•Data Quality is performed at the Data Source itself and the result is
over – written on the source itself
•This is a good option as the data is in sync throughout the EDW
•The report generated from source will be in sync with reports out from
data warehouse
•The only drawback is Data Quality needs to be performed at every
source system separately and standardization needs to be done during
ETL
7 December 2012 18
Where does Data Quality fits in EDW?
Netw
ork
RDBMS
CRM
ERP
Mainframe DBs
PC DBs
Extra
ctio
n
Cleansing,
Transformation,
Validation, Massaging
STAGING AREA
ODS
DW
Data
Marts Aggregation,
summarization, Data Mart
Population, Dimension
loading, Fact Loading
Client
Browsers
Reports, Cubes, Analysis, Data
mining, Dashboards, MIS reports,
Company Quarterly reports etc..
Option 1
Option 2 Option 3
7 December 2012 19
Option 2
•Data Quality is performed during ETL and the result is stored in the
Staging area
•This is the most appropriate place to perform Data Quality
•Data can from all the possible sources of EDW can be cleansed,
standardized and consolidated at one time
•No separate standardization needs to be done
•Clean data reduces the ETL effort as the data is cleansed and the
number of records failing during ETL reduces
7 December 2012 20
Where does Data Quality fits in EDW?
Netw
ork
RDBMS
CRM
ERP
Mainframe DBs
PC DBs
Extra
ctio
n
Cleansing,
Transformation,
Validation, Massaging
STAGING AREA
ODS
DW
Data
Marts Aggregation,
summarization, Data Mart
Population, Dimension
loading, Fact Loading
Client
Browsers
Reports, Cubes, Analysis, Data
mining, Dashboards, MIS reports,
Company Quarterly reports etc..
Option 1
Option 2 Option 3
7 December 2012 21
Option 3
•Data Quality is performed at Data warehouse store and the result is
over written on the data warehouse itself
•This is not a recommended option as the data is stored in highly de-
normalized format
•Also DW stores historic data, so the amount of data to operate on and
perform Data Quality is very high
•Here the incorrect data will enter the DW and will be cleansed at a later
stage
•The erroneous/ duplicate records need to be deleted from the DW after
data quality operation is performed
7 December 2012 22
Data Quality Process
7 December 2012 23
Data Profiling
7 December 2012 24
Data Profiling
•Before improving the quality of data it is imperative to
assess the current quality of data
•Data profiling includes:
– Setting data quality goals
– Creating a data Quality strategy
– Measuring data defects
– Analyzing cause and impact of those defects
– Reporting the results to key stakeholders
7 December 2012 25
Assessing Data
2-Weight
/Impact
3-Profile
Data
6-Address
Source Data 7-Maintain
4-Revisit
Definitions,
Weights
5-Findings 1-Define
Issues
7 December 2012 26
Pre-requisites for Data Profiling - Defining Issues
• Standard list
• Key requirements
– Content
– Structure
– Completeness
• Update list by project or source
Source Data
1-Define
Issues
7 December 2012 27
Pre-requisites for Data Profiling - Defining Issues Sample
Constants
Definition Mismatches
Filler Containing Data
Inconsistent Cases
Inconsistent Data Types
Inconsistent Null Rules
Invalid Keys
Invalid Values
Miscellaneous
Missing Values
Orphans
Out of Range
Pattern Exceptions
Potential Constants
Potential Defaults
Potential Duplicates
Potential Invalids
Potential RedundantValues
Potential Unused Fields
Rule Exceptions
Unused Fields
Source Data
1-Define
Issues
7 December 2012 28
Pre-requisites for Data Profiling - Weight Impact
2-Weight
/Impact
Source Data
1-Define
Issues
• After the issues are initially
identified:
–Some issues are more
critical than others
–Weights are not priorities
–Assign a weighting factor
(1-5)
–Weighting factors SHOULD
change by project
7 December 2012 29
Profile Data
•What does Data Profiling mean?
2-Weight
/Impact
3-Profile
Data
Source Data
1-Define
Issues
7 December 2012 30
What is Data Profiling?
•Use of analytical techniques on data for the purpose of developing a
thorough knowledge of its content, structure & quality
•A process of developing information ABOUT data instead of information
FROM data.
•This is multi-step process
– Collect documentation
– Review the DATA itself
– Compare data to documentation
– Identify and detail specific issues
7 December 2012 31
Data Profiling Sample
• Information ABOUT Data: (Data Profiling)
– 30% of entries in SUPPLIER_ID are blank
– the range of values in UNIT_PRICE is 5.99 to 4599.99
– there are 14 ORDER_HEADER rows with no ORDER_DETAIL rows
• Information FROM Data: (not Data Profiling)
– Texas auto buyers buy more Cadillacs per capita than any other
state
– The average mortgage amount increased last year by 6%
– 10% of last year's customers did not buy anything this year
7 December 2012 32
Data Profiling Process
• Inspecting the data for compliance to
business rules
•Comparing heterogeneous data sources
•Discovering any defects and measuring
their impact on your business
•Reporting findings to stakeholders
•Communicating business rules to be used
in cleansing
•Automating all of the above to provide
continuous monitoring
Performs summary, frequency,
completeness, uniqueness, and
redundancy profiling
Data Profile
Tests unique and inferred primary keys,
foreign keys, and inferred
rules/relationships
Structural
Integrity
Validity
Business
Rule
Compliance
Tests for unique primary keys, foreign
keys, and foreign
key parents
Using your business rules,
indicates which fields contain
invalid values
Data Profiling
7 December 2012 33
Data Profiling
• Data profiling tools scan every single record in every single column
and table in a source system.
• They generate the following
– List of data values
– Statistics
– Charts
– New Structures
– Range and distribution of values in each column
– Relationships between columns
– Drill down from summary views
– Other operations
7 December 2012 34
Benefits of Data Profiling
• Evaluate more data in less time
• Generates more information such as charts etc
• Some create appropriate data cleansing rules as well
• 100 percent accuracy and completeness
• Used to “audit” cleanliness of existing databases [ e.g. : to find missing
or duplicate values ]
• Exposes inconsistent business processes [ e.g. : each unit uses
different product codes ]
• Drill down from summary views
• Mitigates the risk posed by poor data quality
• Enables effective decision making by delivering trustworthy data
7 December 2012 35
Post Data Profiling - Revisit
2-Weight
/Impact
3-Profile
Data
Source Data
4-Revisit
Definitions,
Weights
1-Define
Issues
•Review the issues and weights
–Should there be more or less issues
–What are they?
–Are the relative importance of each issue different?
7 December 2012 36
Post Data Profiling - Findings
2-Weight
/Impact
3-Profile
Data
Source Data
4-Revisit
Definitions,
Weights
5-Findings 1-Define
Issues
•Your findings tell others about the data
–Documented reports and/or charts
–Results database
–Quality Assessment Score
7 December 2012 37
Findings-Chart
Sample Company Issue Findings
0
5
10
15
20
25
Issue Category
Co
un
t o
f Is
su
es
Constant
Definition Mismatch
Filler Containing Data
Inconsistent Case
Inconsistent Data Type
Inconsistent Null Rule
Invalid Keys
Invalid Values
Miscellaneous
Missing Values
Orphans
Out of Range
Pattern Exception
Potential Constant
Potential Default
Potential Duplicates
Potential Invalid
Potential Redundant
Potential Unused
Rule Exceptions
Unused
7 December 2012 38
Findings-Chart Issues Possible
Issue T ype Discovered Issues
Constants 1 59
Definition Mismatches 4 59
Filler Containing Data 1 59
Inconsistent Cases 3 59
Inconsistent Data Types 15 59
Inconsistent Null Rules 6 59
Invalid Keys 1 3
Invalid Values 1 59
Miscellaneous 10 59
Missing Values 18 59
Orphans 2 2
Out of Range 3 59
Pattern Exceptions 10 59
Potential Constants 1 59
Potential Defaults 1 59
Potential Duplicates 3 59
Potential Invalids 4 59
Potential RedundantValues 21 59
Potential Unused Fields 1 59
Rule Exceptions 3 3
Unused Fields 1 59
110 1070
Raw Score 89.7%
7 December 2012 39
Findings-Chart Weight Issues Possible
Factor Issue T ype Discovered Issues
4 Constants 1 59
2 Definition Mismatches 4 59
3 Filler Containing Data 1 59
1 Inconsistent Cases 3 59
2 Inconsistent Data Types 15 59
3 Inconsistent Null Rules 6 59
5 Invalid Keys 1 3
5 Invalid Values 1 59
1 Miscellaneous 10 59
3 Missing Values 18 59
4 Orphans 2 2
5 Out of Range 3 59
4 Pattern Exceptions 10 59
2 Potential Constants 1 59
2 Potential Defaults 1 59
1 Potential Duplicates 3 59
3 Potential Invalids 4 59
4 Potential RedundantValues 21 59
3 Potential Unused Fields 1 59
5 Rule Exceptions 3 3
4 Unused Fields 1 59
110 1070
Weighted Score 76.2%
7 December 2012 40
Findings-Chart
5 4 3 2 1 Weight Factor
8 35 30 21 16 Issues identified in weight factor
35.03% 31.19% 10.17% 8.90% 9.04% Average rate per factor
175.1% 124.7% 30.5% 17.8% 9.0% Total Average by weight
Weighted Issue Rate
- 23.8%
Weighted Assessment Score - 76.2%
7 December 2012 41
Post Data Profiling - Address the Issues
2-Weight
/Impact
3-Profile
Data
6-Address
Source Data
4-Revisit
Definitions,
Weights
5-Findings 1-Define
Issues
•Addressing your findings
–Actual vs. Potential
–Subject Matter Expertise
–Cleansing Requirements
7 December 2012 42
Post Data Profiling - Maintain Vigilance
2-Weight
/Impact
3-Profile
Data
6-Address
Source Data 7-Maintain
4-Revisit
Definitions,
Weights
5-Findings 1-Define
Issues
•Maintain
–Complete the cycle
–Periodic review
–Document score changes
7 December 2012 43
Why Do The Assessment?
•Quantify the quality issues
• Isolate true problems
•Proactive review
– reduces the cost of resolving issues
– reduces the risk of customer dissatisfaction
•Define the scope of issues
•Determine the resources required to address issues
7 December 2012 44
Why Do The Assessment?
Project
Timeline
When you find an Issue
Cost
to A
dd
ress
an
Iss
ue
Project
Costs
7 December 2012 45
Data Assessment Drives Cleansing
Address validation
Sharing results
Analysis Functions
Data Assessment Analysis
Data Quality Cleansing
7 December 2012 46
Data Cleansing
7 December 2012 47
Data Cleansing
•Data cleansing is also called Data Scrubbing
• It is the process of amending or removing data in a database that is
incorrect, incomplete, improperly formatted, or duplicated
•An organization in a data-intensive field like banking, insurance, might
use a data scrubbing tool to systematically examine data for flaws by
using rules, algorithms, and look-up tables
•Typically, a database scrubbing tool includes programs that are capable
of correcting a number of specific type of mistakes, such as adding
missing zip codes or finding duplicate records
7 December 2012 48
Data Cleansing (Customer Data)
•Cleanses and standardizes customer data such as names/addresses, emails, phone numbers, SSNs, and dates
•Manages international data for over 190 countries and reads and writes Unicode data
•Removes errors to uncover true content of database
• Improves integrity of data to identify matches and ultimately create a single customer view
7 December 2012 49
Data Cleansing (Customer Data)
Maggie.kline@future_electronics.com
Margaret Smith-Kline phd
FUTURE Electronics
5/23/03
101 6th ave
manhattan
ny
10012
001124367
Salutation: Ms.
First name: Margaret
Last name: Smith-Kline
Postname: Ph. D.
Match standards: Maggie, Peg, Peggy
Gender: Strong Female
Company name: Future Electronics
Address 1: 101 Avenue of the Americas
City: New York
State: NY
ZIP+4: 10013-1933
Email:
maggie.kline@future_electronics.com
SSN: 001-12-4367
Date May 23, 2003
Input record Output record
7 December 2012 50
Data Cleansing (Operational Data)
•Parses and standardizes business data
– Such as account numbers, product codes, product descriptions,
purchase dates, part numbers, SKUs, etc.
•Utilizes a rule-based parsing and rule editing architecture for even
greater customized results
•Provides GUI that allows users to determine how their data is parsed,
and evaluate the impact of their customized changes
7 December 2012 51
Data Cleansing (Operational Data)
Description Product Dimension Type Form
Kallkyle screw screw Kallkyle
test steel plate 20 x 35 mm plate 20x35 mm steel test
wire 23.33 x 40.50 cm wire 23.33 x 40.50 cm
plain wire 23.33 x 40.50 cm
diagonal wireless transmitter,
frequency 23.49
wire
transmitter
23.33 x 40.50 cm
wireless plain
34 x 60 mm steel plate plate 34 x 60 mm steel
steel plate 34,0 60 mm plate 34 x 60 mm steel
34.0 x 60,0 mm steel plate plate 34 x 60 mm steel
34 x 60 mm steel plate ? plate 34 X 60 mm steel
plate plate
plate plate
steel plate plate steel
steel plate plate steel
7 December 2012 52
Data Cleansing – Parsing Data
• It is the placement of various data elements into appropriate fields
•Parsing is a vital step for cleaning and matching stages
• It may also include rearranging data elements in a single field or moving elements to multiple, more discrete fields
• It may also include removing unwanted characters, words, or spaces in your data.
•Breaking data into more manageable components increases the reliability of correction techniques
7 December 2012 53
Data Cleansing – Parsing Data
•Parsing rules can be based on
– Type of data,
– Clues found within the data itself, or
– A library of common data patterns
•Typically, DQ technology includes pre-built vocabularies, grammars & a
host of modifiable expression files which help in efficiently & correctly
parse data
7 December 2012 54
Data Cleansing – Parsing Data
Input field Parsed output fields
Mr. Tom J. Jones,
Jr.,
CPA
Account Mgr.
Prename Mr.
First Name Tom
Middle Name J.
Last name Jones
Maturity Postname Jr.
Other Postname CPA
Title Account Mgr.
Example below shows how parsing identifies & isolates individual elements
from an input record
7 December 2012 55
Data Cleansing – Standardizing Data
•Data not assessed for DQ show multiple permutations of data and other
anomalies
• It is used for creating a uniform nomenclature for common record
•Example
ACME Manufacturing Corporation
Acme Mftg Corp
ACME
ACME Manufacturing
• In a Standardization scheme complete data is changed to a
Standardized format
•Once done, you get the complete picture of the relationship with the
organization (here ACME Manufacturing Corporation)
•This is so because all permutations have now been standardized to one
naming convention
7 December 2012 56
Data Cleansing – Standardizing Data
•Another example,
if the following are all representations for the top officer in a company:
•President
•Owner
•Chief Executive Officer
•CEO
•C.E.O.
•President/Owner
7 December 2012 57
Data Cleansing – Standardizing Data
•To make the records more consistent you can standardize date formats,
greetings, case and punctuations
•E.g.
Input record Output record
Purchase order: PO123456
Date of purchase: 030106
Description: wire rope, 3‟‟
diameter, 1
Purchase order : 12-3456
Date of purchase : 03-01-06
Description : Wire Rope
Diameter : 3
Quantity : 1
7 December 2012 58
Data Cleansing – Standardizing Data
Real life
Example of
customer data
analysis going
incorrect due to
lack of
standardization
7 December 2012 59
Data Cleansing – Cleansing Data
•Takes incorrect or erroneous data as input
•Apply a series of transformations to obtain correct and complete data as
the output
•Depending on the data type, it may also be possible to compare the
value of a data element to a known list of possible values and resolve
incomplete data to one of the known values
• It is also possible to append additional data or insert incomplete or
missing data
7 December 2012 60
Data Cleansing – Cleansing Data
7 December 2012 61
Data Cleansing – Cleansing Data
• Example
Here the address is corrected, city is appended and state name is corrected by comparing the input record to a directories/ dictionaries to obtain the correct value
Input record Output record
Tom J. Jones
101 6th Avenue
ny
Salutation: Mr.
First Name: Tom
Last Name: Jones
Address: 101 Avenue of Americas
City: New York
State: NY
7 December 2012 62
Data Enhancement
7 December 2012 63
Data Enhancement
•Data enhancement is appending additional data
•Example
– credit ratings,
– demographics,
– geocoding information,
– email addresses, etc.
are appended to existing data in order to increase its overall utility of the
input record
7 December 2012 64
Data Enhancement
•Completes records with directory information by appending name,
address, phone number, or email address
•Provides geocoding information append capabilities for geographic
and demographic marketing initiatives
•Provides geospatial assignment (FIPS codes) of customer addresses
for tax jurisdictions, insurance rating territories, and insurance hazards,
etc.
7 December 2012 65
Data Enhancement Margaret Smith-Kline, Ph.D.
Future Electronics
101 Avenue of the Americas
New York, NY 10013-1933
Appended information:
Phone: (222) 922-9922
Latitude: 40.722970 Longitude: -74.005035
Match quality: Highest quality address
FIPS Code: State: 36 New York
FIPS Code: County: 061 New York
FIPS Code: Place: 51000 New York
Special District: No
City Type: City
Class Code: C1
Incorporation Flag: 1
Taxing Authority Name: New York
Taxing Authority FIPS Code: 3606151000
Taxing Authority Remittance: 3600000000
Census Tract ID: 360610051001.01
Block Group ID: 360610051001012
Date Annexed: 122003
Date Updated: 042004
Date Verified: 042004
Example of Directory,
Goecoding and
Geospatial information
that has been
appended to a record
containing an address
7 December 2012 66
Matching and Consolidation
7 December 2012 67
Matching
• Identifying duplicate records within the same or even differing databases
•This is the „heart‟ of data warehousing
•One of the greatest challenges in matching is creating a system that incorporates your “business rules” – criteria for determining what constitutes a match
•These business rules will vary from one organization to another, and from one application to another
•Example 1
– you may require that name & address information match exactly
•Example 2
– you may accept wider address variations, as long as the name & phone number match closely
7 December 2012 68
Data Matching
7 December 2012 69
Data Matching
7 December 2012 70
Data Matching
7 December 2012 71
Consolidation
•Once you‟ve located the matching records in your data, you can identify
relationships between customers and build a consolidated view of each
•This critical component of successful one-to-one marketing allows you
to gain a clearer understanding of your customers
•Methods for consolidation:
– combines all of the data on any given customer using all of the
available data sources
– customer relationship identification – reveals links between your
customers
7 December 2012 72
Consolidation
7 December 2012 73
Matching and Consolidation
Ms Margaret Smith-Kline Ph.D.
Future Electronics
101 Avenue of the Americas
New York NY 10013-1933
maggie.kline@future_electronics.com
May 23, 2003
Name: Ms. Margaret Smith-Kline Ph.D.
Company name: Future Electronics Co. LLC
SSN: 001-12-4367
Purchase date: 5/23/2003
Address: 101 Avenue of the Americas
New York, NY 10013-1933
Latitude: 40.722970
Longitude: -74.005035
Fed code: 36061
Phone: (222) 922-9922
Email: maggie.kline@future_electronics.com
Input records
Consolidated record
Maggie Smith
Future Electronics Co. LLC
101 6th Ave.
Manhattan, NY 10012
maggie.kline@future_electronics.com
001-12-4367
Ms. Peg Kline
Future Elect. Co.
101 6th Ave.
New York NY 10013
001-12-4367
(222) 922-9922
5/23/03
7 December 2012 74
Matching and Consolidation
Unlocking the relationships between distinctly different sets of data
• Householding data to identify members of same household, corporation or any other hierarchy
• Identifying “snowbirds”
– i.e. individuals or households with multiple residences
• Creating a panoramic single best record
• De-duplication of records in database
• Preventing firms from doing business with entities on government watch lists
• Providing identity resolution to uncover non-obvious relationships for fraud detection
7 December 2012 75
Snow Removal
•Example
Owen Marketing Corp Trustee IRA DTD 9/01/98
John Owen
•Only 4 characters in the second line are contained in the first line,
applying any matching algorithm to these 2 examples would surely fail.
•To successfully match John to his company, the “snow” must first be
removed, leaving the clean company name Owen Marketing Corp.
•Owen merely comprises 4/17th or 23.5 percent of the line.
•Only after determining an appropriate weighting factor for each word
can these lines be accurately matched so that Owen, the only important
word in the first example, can be cross-referenced to John‟s last name.
7 December 2012 76
Householding (Hierarchal Matching)
•Householding links consumer
records that contain the same
address and last name.
•Use this strategy when matching
business rules consists of multiple
levels of consumer relationships
•By identifying the characteristics
and buying habits of a group or
household, you can create special
offers and better target direct
marketing efforts
Address
Family Name
Individual
Consumer Householding
7 December 2012 77
Business Grouping (Hierarchal Matching)
•Business grouping combines
business records that share such
information as company name,
address, department, or title.
•Use this strategy when matching
business rules consists of multiple
levels of corporate relationships
Firm Name
Family Name
Dept
Corporate Householding
7 December 2012 78
Importance of Eliminating Duplicate Data
•Truly “see” each customer, and generate accurate data about them
•Enhance response rates of marketing promotions
•Reduce the risk of offending customers with repeat offers
• Identify trends and patterns to accurately target new prospects
•The costs of duplicate faxes, mailings, and other forms of
communication can add up quickly if duplication exists within the
database.
7 December 2012 79
Importance of Eliminating Duplicate Data
• Any analysis of data, such as reporting, data mining, determining the
success of marketing campaigns, and forecasting, etc. can be heavily
skewed as a result of redundant data.
• Customer service efforts are diminished when customer information is
spread across multiple records, giving customer service reps only a
partial view of the account, & a limited ability to professionally service
& make a good impression with the customer
• Potential clients will lose respect for an organization that has multiple
salespeople call on them, & sales rep motivation will suffer as well.
7 December 2012 80
Continuous Monitoring
7 December 2012 81
Continuous Monitoring
•Set-up existing or inferred
business rules/tasks
•Automatically discovers business
rules and relationships that might
otherwise go unnoticed
•Set thresholds and schedule
assessment
•Automatically notify you when
your continuously monitored tasks
exceed that threshold
•Notification includes the details
about the threshold
7 December 2012 82
Dashboard Reports
Offers robust set of
graphical and
dashboard reports to
aid in quick
identification of data
problems
7 December 2012 83
Questions???