Top Banner
7 December 2012 TCS Public Data Quality Concepts
83

data-quality-concepts.pdf

Jan 21, 2016

Download

Documents

Data quality informatica
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: data-quality-concepts.pdf

7 December 2012 TCS Public

Data Quality Concepts

Page 2: data-quality-concepts.pdf

7 December 2012 2

Agenda

• Data Quality Concepts (3 hrs 15 mins)

• Introduction to Data Quality XI R2 (45 mins)

• Using the Project Architect (2 hrs)

• Using Transforms (10 hrs)

• Matching and Consolidation Records (10 hrs)

Page 3: data-quality-concepts.pdf

7 December 2012 3

What is Data Quality?

Data Quality refers to the quality of data.

•Data are of high quality if they are fit for their intended uses in

operations, decision making and planning

• It is the state of

– completeness,

– validity,

– consistency,

– timeliness and

– accuracy

that makes data appropriate for a specific use

Page 4: data-quality-concepts.pdf

7 December 2012 4

Why Data Quality?

•Companies often cannot rely on the information that serves as the very

foundation of their primary business applications

• Inaccurate/ inconsistent data can hinder a company's ability to

understand its current – and future – business problems

•This leads to poor decisions that can cause a host of negative results,

including lost profits, operational delays, customer dissatisfaction and

much more

• In short, the effectiveness and quality of decision-making is limited to

the quality of the data residing in it

Page 5: data-quality-concepts.pdf

7 December 2012 5

Poor Data Quality Leads to

• Inability to compare data from different sources

•Data entered into the wrong fields

•Lack of consistent data definitions

• Inability to consolidate data from multiple sources

• Inability to track data across time

• Inability to comply with government regulations

•Delayed or rejected reimbursement from third party providers

•The inability to determine important relationships

Page 6: data-quality-concepts.pdf

7 December 2012 6

Examples

T.Das|97336o8327|24.95|Y|-|0.0|1000

Ted J.|973-360-8779|2000|N|M|NY|1000

•Can we interpret the data?

– What do the fields mean?

– What is the key? The measures?

•Data glitches

– Typos, multiple formats, missing / default values

•Metadata and domain expertise

– Field three is Revenue. In dollars or cents?

– Field seven is Usage. Is it censored?

•Field 4 is a censored flag. How to handle censored data?

Page 7: data-quality-concepts.pdf

7 December 2012 7

Data Glitches

•Systemic changes to data which are external to the recorded process.

– Changes in data layout / data types

• Integer becomes string, fields swap positions, etc.

– Changes in scale / format

•Dollars vs. euros

– Temporary reversion to defaults

•Failure of a processing step

– Missing and default values

•Application programs do not handle NULL values well …

– Gaps in time series

•Especially when records represent incremental changes.

Page 8: data-quality-concepts.pdf

7 December 2012 8

Meaning of Data Quality

•There are many types of data, which have different uses and typical

quality problems

– Federated data

– High dimensional data

– Descriptive data

– Longitudinal data

– Streaming data

– Web (scraped) data

– Numeric vs. categorical vs. text data

Page 9: data-quality-concepts.pdf

7 December 2012 9

Meaning of Data Quality

•There are many uses of data

– Operations

– Aggregate analysis

– Customer relations …

•Data Interpretation : the data is useless if we don‟t know all of the rules

behind the data.

•Data Suitability : Can you get the answer from the available data

– Use of proxy data

– Relevant data is missing

Page 10: data-quality-concepts.pdf

7 December 2012 10

Data Quality Constraints

•Many data quality problems can be captured by static constraints based

on the schema.

– Nulls not allowed, field domains, foreign key constraints, etc

•Many others are due to problems in workflow, and can be captured by

dynamic constraints

– E.g., orders above $200 are processed by Biller 2

•The constraints follow an 80-20 rule

– A few constraints capture most cases, thousands of constraints to

capture the last few cases.

•Constraints are measurable. Data Quality Metrics?

Page 11: data-quality-concepts.pdf

7 December 2012 11

Data Quality Metrics

•We want a measurable quantity

– Indicates what is wrong and how to improve

– Realize that DQ is a messy problem, no set of numbers will be perfect

•Types of metrics

– Static vs. dynamic constraints

– Operational vs. diagnostic

•Metrics should be directionally correct with an improvement in use of the data.

•A very large number metrics are possible

– Choose the most important ones.

Page 12: data-quality-concepts.pdf

7 December 2012 12

Examples of Data Quality Metrics

•Conformance to schema

– Evaluate constraints on a snapshot

•Conformance to business rules

– Evaluate constraints on changes in the database

•Accuracy

– Perform inventory (expensive), or use proxy (track complaints).

Audit samples?

•Accessibility

• Interpretability

•Glitches in analysis

•Successful completion of end-to-end process

Page 13: data-quality-concepts.pdf

7 December 2012 13

Typical Data Quality Business Drivers

• Inability to make solid business decisions due to lack of trust in the data

driving business intelligence efforts

•Failed business and marketing programs that were based on poor data

• Inability to target best customers and suppliers

• Increased cost due to inability to deliver product and returned direct

marketing pieces and bills/invoices

•Compliance concerns and the legal and financial risks of reporting and

acting on bad data

•Rework or process delays due to duplicate or incorrect data within

enterprise systems

•Decline in customer satisfaction and perceptions

Page 14: data-quality-concepts.pdf

7 December 2012 14

Data Quality Issues

• Data content errors

• Missing data

• Invalid data

• Data that is significantly different than all other data

• Multiple formats for the same data elements

• Different meaning value for the same code value

• Multiple code values with the same meaning

• Field overuse: used for unintended purpose

• Data in filler

• Errors related to migration during ETL process

• Normalization Inconsistencies

• Duplicate or lost data

• Data structure problems

Page 15: data-quality-concepts.pdf

7 December 2012 15

Categories of Data Quality Problems

• Accuracy

• Objectivity

• Believability

• Reputation

• Relevancy

• Value-added

• Timeliness (currency)

•Completeness

•Amount of Information

• Interpretability

•Ease of Understanding

•Consistent Representation

•Concise Representation

•Access

•Security

Page 16: data-quality-concepts.pdf

7 December 2012 16

Where does Data Quality fits in EDW?

Netw

ork

RDBMS

CRM

ERP

Mainframe DBs

PC DBs

Extra

ctio

n

Cleansing,

Transformation,

Validation, Massaging

STAGING AREA

ODS

DW

Data

Marts Aggregation,

summarization, Data Mart

Population, Dimension

loading, Fact Loading

Client

Browsers

Reports, Cubes, Analysis, Data

mining, Dashboards, MIS reports,

Company Quarterly reports etc..

Option 1

Option 2 Option 3

Page 17: data-quality-concepts.pdf

7 December 2012 17

Option 1

•Data Quality is performed at the Data Source itself and the result is

over – written on the source itself

•This is a good option as the data is in sync throughout the EDW

•The report generated from source will be in sync with reports out from

data warehouse

•The only drawback is Data Quality needs to be performed at every

source system separately and standardization needs to be done during

ETL

Page 18: data-quality-concepts.pdf

7 December 2012 18

Where does Data Quality fits in EDW?

Netw

ork

RDBMS

CRM

ERP

Mainframe DBs

PC DBs

Extra

ctio

n

Cleansing,

Transformation,

Validation, Massaging

STAGING AREA

ODS

DW

Data

Marts Aggregation,

summarization, Data Mart

Population, Dimension

loading, Fact Loading

Client

Browsers

Reports, Cubes, Analysis, Data

mining, Dashboards, MIS reports,

Company Quarterly reports etc..

Option 1

Option 2 Option 3

Page 19: data-quality-concepts.pdf

7 December 2012 19

Option 2

•Data Quality is performed during ETL and the result is stored in the

Staging area

•This is the most appropriate place to perform Data Quality

•Data can from all the possible sources of EDW can be cleansed,

standardized and consolidated at one time

•No separate standardization needs to be done

•Clean data reduces the ETL effort as the data is cleansed and the

number of records failing during ETL reduces

Page 20: data-quality-concepts.pdf

7 December 2012 20

Where does Data Quality fits in EDW?

Netw

ork

RDBMS

CRM

ERP

Mainframe DBs

PC DBs

Extra

ctio

n

Cleansing,

Transformation,

Validation, Massaging

STAGING AREA

ODS

DW

Data

Marts Aggregation,

summarization, Data Mart

Population, Dimension

loading, Fact Loading

Client

Browsers

Reports, Cubes, Analysis, Data

mining, Dashboards, MIS reports,

Company Quarterly reports etc..

Option 1

Option 2 Option 3

Page 21: data-quality-concepts.pdf

7 December 2012 21

Option 3

•Data Quality is performed at Data warehouse store and the result is

over written on the data warehouse itself

•This is not a recommended option as the data is stored in highly de-

normalized format

•Also DW stores historic data, so the amount of data to operate on and

perform Data Quality is very high

•Here the incorrect data will enter the DW and will be cleansed at a later

stage

•The erroneous/ duplicate records need to be deleted from the DW after

data quality operation is performed

Page 22: data-quality-concepts.pdf

7 December 2012 22

Data Quality Process

Page 23: data-quality-concepts.pdf

7 December 2012 23

Data Profiling

Page 24: data-quality-concepts.pdf

7 December 2012 24

Data Profiling

•Before improving the quality of data it is imperative to

assess the current quality of data

•Data profiling includes:

– Setting data quality goals

– Creating a data Quality strategy

– Measuring data defects

– Analyzing cause and impact of those defects

– Reporting the results to key stakeholders

Page 25: data-quality-concepts.pdf

7 December 2012 25

Assessing Data

2-Weight

/Impact

3-Profile

Data

6-Address

Source Data 7-Maintain

4-Revisit

Definitions,

Weights

5-Findings 1-Define

Issues

Page 26: data-quality-concepts.pdf

7 December 2012 26

Pre-requisites for Data Profiling - Defining Issues

• Standard list

• Key requirements

– Content

– Structure

– Completeness

• Update list by project or source

Source Data

1-Define

Issues

Page 27: data-quality-concepts.pdf

7 December 2012 27

Pre-requisites for Data Profiling - Defining Issues Sample

Constants

Definition Mismatches

Filler Containing Data

Inconsistent Cases

Inconsistent Data Types

Inconsistent Null Rules

Invalid Keys

Invalid Values

Miscellaneous

Missing Values

Orphans

Out of Range

Pattern Exceptions

Potential Constants

Potential Defaults

Potential Duplicates

Potential Invalids

Potential RedundantValues

Potential Unused Fields

Rule Exceptions

Unused Fields

Source Data

1-Define

Issues

Page 28: data-quality-concepts.pdf

7 December 2012 28

Pre-requisites for Data Profiling - Weight Impact

2-Weight

/Impact

Source Data

1-Define

Issues

• After the issues are initially

identified:

–Some issues are more

critical than others

–Weights are not priorities

–Assign a weighting factor

(1-5)

–Weighting factors SHOULD

change by project

Page 29: data-quality-concepts.pdf

7 December 2012 29

Profile Data

•What does Data Profiling mean?

2-Weight

/Impact

3-Profile

Data

Source Data

1-Define

Issues

Page 30: data-quality-concepts.pdf

7 December 2012 30

What is Data Profiling?

•Use of analytical techniques on data for the purpose of developing a

thorough knowledge of its content, structure & quality

•A process of developing information ABOUT data instead of information

FROM data.

•This is multi-step process

– Collect documentation

– Review the DATA itself

– Compare data to documentation

– Identify and detail specific issues

Page 31: data-quality-concepts.pdf

7 December 2012 31

Data Profiling Sample

• Information ABOUT Data: (Data Profiling)

– 30% of entries in SUPPLIER_ID are blank

– the range of values in UNIT_PRICE is 5.99 to 4599.99

– there are 14 ORDER_HEADER rows with no ORDER_DETAIL rows

• Information FROM Data: (not Data Profiling)

– Texas auto buyers buy more Cadillacs per capita than any other

state

– The average mortgage amount increased last year by 6%

– 10% of last year's customers did not buy anything this year

Page 32: data-quality-concepts.pdf

7 December 2012 32

Data Profiling Process

• Inspecting the data for compliance to

business rules

•Comparing heterogeneous data sources

•Discovering any defects and measuring

their impact on your business

•Reporting findings to stakeholders

•Communicating business rules to be used

in cleansing

•Automating all of the above to provide

continuous monitoring

Performs summary, frequency,

completeness, uniqueness, and

redundancy profiling

Data Profile

Tests unique and inferred primary keys,

foreign keys, and inferred

rules/relationships

Structural

Integrity

Validity

Business

Rule

Compliance

Tests for unique primary keys, foreign

keys, and foreign

key parents

Using your business rules,

indicates which fields contain

invalid values

Data Profiling

Page 33: data-quality-concepts.pdf

7 December 2012 33

Data Profiling

• Data profiling tools scan every single record in every single column

and table in a source system.

• They generate the following

– List of data values

– Statistics

– Charts

– New Structures

– Range and distribution of values in each column

– Relationships between columns

– Drill down from summary views

– Other operations

Page 34: data-quality-concepts.pdf

7 December 2012 34

Benefits of Data Profiling

• Evaluate more data in less time

• Generates more information such as charts etc

• Some create appropriate data cleansing rules as well

• 100 percent accuracy and completeness

• Used to “audit” cleanliness of existing databases [ e.g. : to find missing

or duplicate values ]

• Exposes inconsistent business processes [ e.g. : each unit uses

different product codes ]

• Drill down from summary views

• Mitigates the risk posed by poor data quality

• Enables effective decision making by delivering trustworthy data

Page 35: data-quality-concepts.pdf

7 December 2012 35

Post Data Profiling - Revisit

2-Weight

/Impact

3-Profile

Data

Source Data

4-Revisit

Definitions,

Weights

1-Define

Issues

•Review the issues and weights

–Should there be more or less issues

–What are they?

–Are the relative importance of each issue different?

Page 36: data-quality-concepts.pdf

7 December 2012 36

Post Data Profiling - Findings

2-Weight

/Impact

3-Profile

Data

Source Data

4-Revisit

Definitions,

Weights

5-Findings 1-Define

Issues

•Your findings tell others about the data

–Documented reports and/or charts

–Results database

–Quality Assessment Score

Page 37: data-quality-concepts.pdf

7 December 2012 37

Findings-Chart

Sample Company Issue Findings

0

5

10

15

20

25

Issue Category

Co

un

t o

f Is

su

es

Constant

Definition Mismatch

Filler Containing Data

Inconsistent Case

Inconsistent Data Type

Inconsistent Null Rule

Invalid Keys

Invalid Values

Miscellaneous

Missing Values

Orphans

Out of Range

Pattern Exception

Potential Constant

Potential Default

Potential Duplicates

Potential Invalid

Potential Redundant

Potential Unused

Rule Exceptions

Unused

Page 38: data-quality-concepts.pdf

7 December 2012 38

Findings-Chart Issues Possible

Issue T ype Discovered Issues

Constants 1 59

Definition Mismatches 4 59

Filler Containing Data 1 59

Inconsistent Cases 3 59

Inconsistent Data Types 15 59

Inconsistent Null Rules 6 59

Invalid Keys 1 3

Invalid Values 1 59

Miscellaneous 10 59

Missing Values 18 59

Orphans 2 2

Out of Range 3 59

Pattern Exceptions 10 59

Potential Constants 1 59

Potential Defaults 1 59

Potential Duplicates 3 59

Potential Invalids 4 59

Potential RedundantValues 21 59

Potential Unused Fields 1 59

Rule Exceptions 3 3

Unused Fields 1 59

110 1070

Raw Score 89.7%

Page 39: data-quality-concepts.pdf

7 December 2012 39

Findings-Chart Weight Issues Possible

Factor Issue T ype Discovered Issues

4 Constants 1 59

2 Definition Mismatches 4 59

3 Filler Containing Data 1 59

1 Inconsistent Cases 3 59

2 Inconsistent Data Types 15 59

3 Inconsistent Null Rules 6 59

5 Invalid Keys 1 3

5 Invalid Values 1 59

1 Miscellaneous 10 59

3 Missing Values 18 59

4 Orphans 2 2

5 Out of Range 3 59

4 Pattern Exceptions 10 59

2 Potential Constants 1 59

2 Potential Defaults 1 59

1 Potential Duplicates 3 59

3 Potential Invalids 4 59

4 Potential RedundantValues 21 59

3 Potential Unused Fields 1 59

5 Rule Exceptions 3 3

4 Unused Fields 1 59

110 1070

Weighted Score 76.2%

Page 40: data-quality-concepts.pdf

7 December 2012 40

Findings-Chart

5 4 3 2 1 Weight Factor

8 35 30 21 16 Issues identified in weight factor

35.03% 31.19% 10.17% 8.90% 9.04% Average rate per factor

175.1% 124.7% 30.5% 17.8% 9.0% Total Average by weight

Weighted Issue Rate

- 23.8%

Weighted Assessment Score - 76.2%

Page 41: data-quality-concepts.pdf

7 December 2012 41

Post Data Profiling - Address the Issues

2-Weight

/Impact

3-Profile

Data

6-Address

Source Data

4-Revisit

Definitions,

Weights

5-Findings 1-Define

Issues

•Addressing your findings

–Actual vs. Potential

–Subject Matter Expertise

–Cleansing Requirements

Page 42: data-quality-concepts.pdf

7 December 2012 42

Post Data Profiling - Maintain Vigilance

2-Weight

/Impact

3-Profile

Data

6-Address

Source Data 7-Maintain

4-Revisit

Definitions,

Weights

5-Findings 1-Define

Issues

•Maintain

–Complete the cycle

–Periodic review

–Document score changes

Page 43: data-quality-concepts.pdf

7 December 2012 43

Why Do The Assessment?

•Quantify the quality issues

• Isolate true problems

•Proactive review

– reduces the cost of resolving issues

– reduces the risk of customer dissatisfaction

•Define the scope of issues

•Determine the resources required to address issues

Page 44: data-quality-concepts.pdf

7 December 2012 44

Why Do The Assessment?

Project

Timeline

When you find an Issue

Cost

to A

dd

ress

an

Iss

ue

Project

Costs

Page 45: data-quality-concepts.pdf

7 December 2012 45

Data Assessment Drives Cleansing

Address validation

Sharing results

Analysis Functions

Data Assessment Analysis

Data Quality Cleansing

Page 46: data-quality-concepts.pdf

7 December 2012 46

Data Cleansing

Page 47: data-quality-concepts.pdf

7 December 2012 47

Data Cleansing

•Data cleansing is also called Data Scrubbing

• It is the process of amending or removing data in a database that is

incorrect, incomplete, improperly formatted, or duplicated

•An organization in a data-intensive field like banking, insurance, might

use a data scrubbing tool to systematically examine data for flaws by

using rules, algorithms, and look-up tables

•Typically, a database scrubbing tool includes programs that are capable

of correcting a number of specific type of mistakes, such as adding

missing zip codes or finding duplicate records

Page 48: data-quality-concepts.pdf

7 December 2012 48

Data Cleansing (Customer Data)

•Cleanses and standardizes customer data such as names/addresses, emails, phone numbers, SSNs, and dates

•Manages international data for over 190 countries and reads and writes Unicode data

•Removes errors to uncover true content of database

• Improves integrity of data to identify matches and ultimately create a single customer view

Page 49: data-quality-concepts.pdf

7 December 2012 49

Data Cleansing (Customer Data)

Maggie.kline@future_electronics.com

Margaret Smith-Kline phd

FUTURE Electronics

5/23/03

101 6th ave

manhattan

ny

10012

001124367

Salutation: Ms.

First name: Margaret

Last name: Smith-Kline

Postname: Ph. D.

Match standards: Maggie, Peg, Peggy

Gender: Strong Female

Company name: Future Electronics

Address 1: 101 Avenue of the Americas

City: New York

State: NY

ZIP+4: 10013-1933

Email:

maggie.kline@future_electronics.com

SSN: 001-12-4367

Date May 23, 2003

Input record Output record

Page 50: data-quality-concepts.pdf

7 December 2012 50

Data Cleansing (Operational Data)

•Parses and standardizes business data

– Such as account numbers, product codes, product descriptions,

purchase dates, part numbers, SKUs, etc.

•Utilizes a rule-based parsing and rule editing architecture for even

greater customized results

•Provides GUI that allows users to determine how their data is parsed,

and evaluate the impact of their customized changes

Page 51: data-quality-concepts.pdf

7 December 2012 51

Data Cleansing (Operational Data)

Description Product Dimension Type Form

Kallkyle screw screw Kallkyle

test steel plate 20 x 35 mm plate 20x35 mm steel test

wire 23.33 x 40.50 cm wire 23.33 x 40.50 cm

plain wire 23.33 x 40.50 cm

diagonal wireless transmitter,

frequency 23.49

wire

transmitter

23.33 x 40.50 cm

wireless plain

34 x 60 mm steel plate plate 34 x 60 mm steel

steel plate 34,0 60 mm plate 34 x 60 mm steel

34.0 x 60,0 mm steel plate plate 34 x 60 mm steel

34 x 60 mm steel plate ? plate 34 X 60 mm steel

plate plate

plate plate

steel plate plate steel

steel plate plate steel

Page 52: data-quality-concepts.pdf

7 December 2012 52

Data Cleansing – Parsing Data

• It is the placement of various data elements into appropriate fields

•Parsing is a vital step for cleaning and matching stages

• It may also include rearranging data elements in a single field or moving elements to multiple, more discrete fields

• It may also include removing unwanted characters, words, or spaces in your data.

•Breaking data into more manageable components increases the reliability of correction techniques

Page 53: data-quality-concepts.pdf

7 December 2012 53

Data Cleansing – Parsing Data

•Parsing rules can be based on

– Type of data,

– Clues found within the data itself, or

– A library of common data patterns

•Typically, DQ technology includes pre-built vocabularies, grammars & a

host of modifiable expression files which help in efficiently & correctly

parse data

Page 54: data-quality-concepts.pdf

7 December 2012 54

Data Cleansing – Parsing Data

Input field Parsed output fields

Mr. Tom J. Jones,

Jr.,

CPA

Account Mgr.

Prename Mr.

First Name Tom

Middle Name J.

Last name Jones

Maturity Postname Jr.

Other Postname CPA

Title Account Mgr.

Example below shows how parsing identifies & isolates individual elements

from an input record

Page 55: data-quality-concepts.pdf

7 December 2012 55

Data Cleansing – Standardizing Data

•Data not assessed for DQ show multiple permutations of data and other

anomalies

• It is used for creating a uniform nomenclature for common record

•Example

ACME Manufacturing Corporation

Acme Mftg Corp

ACME

ACME Manufacturing

• In a Standardization scheme complete data is changed to a

Standardized format

•Once done, you get the complete picture of the relationship with the

organization (here ACME Manufacturing Corporation)

•This is so because all permutations have now been standardized to one

naming convention

Page 56: data-quality-concepts.pdf

7 December 2012 56

Data Cleansing – Standardizing Data

•Another example,

if the following are all representations for the top officer in a company:

•President

•Owner

•Chief Executive Officer

•CEO

•C.E.O.

•President/Owner

Page 57: data-quality-concepts.pdf

7 December 2012 57

Data Cleansing – Standardizing Data

•To make the records more consistent you can standardize date formats,

greetings, case and punctuations

•E.g.

Input record Output record

Purchase order: PO123456

Date of purchase: 030106

Description: wire rope, 3‟‟

diameter, 1

Purchase order : 12-3456

Date of purchase : 03-01-06

Description : Wire Rope

Diameter : 3

Quantity : 1

Page 58: data-quality-concepts.pdf

7 December 2012 58

Data Cleansing – Standardizing Data

Real life

Example of

customer data

analysis going

incorrect due to

lack of

standardization

Page 59: data-quality-concepts.pdf

7 December 2012 59

Data Cleansing – Cleansing Data

•Takes incorrect or erroneous data as input

•Apply a series of transformations to obtain correct and complete data as

the output

•Depending on the data type, it may also be possible to compare the

value of a data element to a known list of possible values and resolve

incomplete data to one of the known values

• It is also possible to append additional data or insert incomplete or

missing data

Page 60: data-quality-concepts.pdf

7 December 2012 60

Data Cleansing – Cleansing Data

Page 61: data-quality-concepts.pdf

7 December 2012 61

Data Cleansing – Cleansing Data

• Example

Here the address is corrected, city is appended and state name is corrected by comparing the input record to a directories/ dictionaries to obtain the correct value

Input record Output record

[email protected]

Tom J. Jones

101 6th Avenue

ny

Salutation: Mr.

First Name: Tom

Last Name: Jones

Address: 101 Avenue of Americas

City: New York

State: NY

Page 62: data-quality-concepts.pdf

7 December 2012 62

Data Enhancement

Page 63: data-quality-concepts.pdf

7 December 2012 63

Data Enhancement

•Data enhancement is appending additional data

•Example

– credit ratings,

– demographics,

– geocoding information,

– email addresses, etc.

are appended to existing data in order to increase its overall utility of the

input record

Page 64: data-quality-concepts.pdf

7 December 2012 64

Data Enhancement

•Completes records with directory information by appending name,

address, phone number, or email address

•Provides geocoding information append capabilities for geographic

and demographic marketing initiatives

•Provides geospatial assignment (FIPS codes) of customer addresses

for tax jurisdictions, insurance rating territories, and insurance hazards,

etc.

Page 65: data-quality-concepts.pdf

7 December 2012 65

Data Enhancement Margaret Smith-Kline, Ph.D.

Future Electronics

101 Avenue of the Americas

New York, NY 10013-1933

Appended information:

Phone: (222) 922-9922

Latitude: 40.722970 Longitude: -74.005035

Match quality: Highest quality address

FIPS Code: State: 36 New York

FIPS Code: County: 061 New York

FIPS Code: Place: 51000 New York

Special District: No

City Type: City

Class Code: C1

Incorporation Flag: 1

Taxing Authority Name: New York

Taxing Authority FIPS Code: 3606151000

Taxing Authority Remittance: 3600000000

Census Tract ID: 360610051001.01

Block Group ID: 360610051001012

Date Annexed: 122003

Date Updated: 042004

Date Verified: 042004

Example of Directory,

Goecoding and

Geospatial information

that has been

appended to a record

containing an address

Page 66: data-quality-concepts.pdf

7 December 2012 66

Matching and Consolidation

Page 67: data-quality-concepts.pdf

7 December 2012 67

Matching

• Identifying duplicate records within the same or even differing databases

•This is the „heart‟ of data warehousing

•One of the greatest challenges in matching is creating a system that incorporates your “business rules” – criteria for determining what constitutes a match

•These business rules will vary from one organization to another, and from one application to another

•Example 1

– you may require that name & address information match exactly

•Example 2

– you may accept wider address variations, as long as the name & phone number match closely

Page 68: data-quality-concepts.pdf

7 December 2012 68

Data Matching

Page 69: data-quality-concepts.pdf

7 December 2012 69

Data Matching

Page 70: data-quality-concepts.pdf

7 December 2012 70

Data Matching

Page 71: data-quality-concepts.pdf

7 December 2012 71

Consolidation

•Once you‟ve located the matching records in your data, you can identify

relationships between customers and build a consolidated view of each

•This critical component of successful one-to-one marketing allows you

to gain a clearer understanding of your customers

•Methods for consolidation:

– combines all of the data on any given customer using all of the

available data sources

– customer relationship identification – reveals links between your

customers

Page 72: data-quality-concepts.pdf

7 December 2012 72

Consolidation

Page 73: data-quality-concepts.pdf

7 December 2012 73

Matching and Consolidation

Ms Margaret Smith-Kline Ph.D.

Future Electronics

101 Avenue of the Americas

New York NY 10013-1933

maggie.kline@future_electronics.com

May 23, 2003

Name: Ms. Margaret Smith-Kline Ph.D.

Company name: Future Electronics Co. LLC

SSN: 001-12-4367

Purchase date: 5/23/2003

Address: 101 Avenue of the Americas

New York, NY 10013-1933

Latitude: 40.722970

Longitude: -74.005035

Fed code: 36061

Phone: (222) 922-9922

Email: maggie.kline@future_electronics.com

Input records

Consolidated record

Maggie Smith

Future Electronics Co. LLC

101 6th Ave.

Manhattan, NY 10012

maggie.kline@future_electronics.com

001-12-4367

Ms. Peg Kline

Future Elect. Co.

101 6th Ave.

New York NY 10013

001-12-4367

(222) 922-9922

5/23/03

Page 74: data-quality-concepts.pdf

7 December 2012 74

Matching and Consolidation

Unlocking the relationships between distinctly different sets of data

• Householding data to identify members of same household, corporation or any other hierarchy

• Identifying “snowbirds”

– i.e. individuals or households with multiple residences

• Creating a panoramic single best record

• De-duplication of records in database

• Preventing firms from doing business with entities on government watch lists

• Providing identity resolution to uncover non-obvious relationships for fraud detection

Page 75: data-quality-concepts.pdf

7 December 2012 75

Snow Removal

•Example

Owen Marketing Corp Trustee IRA DTD 9/01/98

John Owen

•Only 4 characters in the second line are contained in the first line,

applying any matching algorithm to these 2 examples would surely fail.

•To successfully match John to his company, the “snow” must first be

removed, leaving the clean company name Owen Marketing Corp.

•Owen merely comprises 4/17th or 23.5 percent of the line.

•Only after determining an appropriate weighting factor for each word

can these lines be accurately matched so that Owen, the only important

word in the first example, can be cross-referenced to John‟s last name.

Page 76: data-quality-concepts.pdf

7 December 2012 76

Householding (Hierarchal Matching)

•Householding links consumer

records that contain the same

address and last name.

•Use this strategy when matching

business rules consists of multiple

levels of consumer relationships

•By identifying the characteristics

and buying habits of a group or

household, you can create special

offers and better target direct

marketing efforts

Address

Family Name

Individual

Consumer Householding

Page 77: data-quality-concepts.pdf

7 December 2012 77

Business Grouping (Hierarchal Matching)

•Business grouping combines

business records that share such

information as company name,

address, department, or title.

•Use this strategy when matching

business rules consists of multiple

levels of corporate relationships

Firm Name

Family Name

Dept

Corporate Householding

Page 78: data-quality-concepts.pdf

7 December 2012 78

Importance of Eliminating Duplicate Data

•Truly “see” each customer, and generate accurate data about them

•Enhance response rates of marketing promotions

•Reduce the risk of offending customers with repeat offers

• Identify trends and patterns to accurately target new prospects

•The costs of duplicate faxes, mailings, and other forms of

communication can add up quickly if duplication exists within the

database.

Page 79: data-quality-concepts.pdf

7 December 2012 79

Importance of Eliminating Duplicate Data

• Any analysis of data, such as reporting, data mining, determining the

success of marketing campaigns, and forecasting, etc. can be heavily

skewed as a result of redundant data.

• Customer service efforts are diminished when customer information is

spread across multiple records, giving customer service reps only a

partial view of the account, & a limited ability to professionally service

& make a good impression with the customer

• Potential clients will lose respect for an organization that has multiple

salespeople call on them, & sales rep motivation will suffer as well.

Page 80: data-quality-concepts.pdf

7 December 2012 80

Continuous Monitoring

Page 81: data-quality-concepts.pdf

7 December 2012 81

Continuous Monitoring

•Set-up existing or inferred

business rules/tasks

•Automatically discovers business

rules and relationships that might

otherwise go unnoticed

•Set thresholds and schedule

assessment

•Automatically notify you when

your continuously monitored tasks

exceed that threshold

•Notification includes the details

about the threshold

Page 82: data-quality-concepts.pdf

7 December 2012 82

Dashboard Reports

Offers robust set of

graphical and

dashboard reports to

aid in quick

identification of data

problems

Page 83: data-quality-concepts.pdf

7 December 2012 83

Questions???