Data Cleanup: Unlock the potential at a corporate scale.

Post on 14-Dec-2015

224 Views

Category:

Documents

7 Downloads

Preview:

Click to see full reader

Transcript

Data Cleanup: Unlock the potential at a corporate

scale

Thibault Dambrine

• IT professional for 25 years– Network Designer– ETL Data Warehouse Analyst– Interface Specialist– ERP Developer

• Data Quality Experience: – Tasked to work on the pre-conversion data cleanup

project during the Shell JDE to SAP transition

Introduction

• Premise:“What would a Company-wide Data Quality

Initiative look like?”

• Base:– Experience setting up a data cleanup team, prior

to a JDE to SAP data conversion – Realizing the potential for increasing the data

value within the corporation

Defining Data Quality

• Intrinsic Data Quality: – Accuracy, Completeness, Uniqueness– Reliability, Security and Accessibility

• Contextual Data Quality: – Timeliness, Relevance – Inter-operability, consistency of identifiers

• Accessibility and Representational Data Quality– Ease of understanding – Consistency of identifiers– Consistency in structures

Quantifying the cost of Quality: The 1-10-100 Rule - Additional cost = Less Competitive Business

1

10

100

Prevention Cost

Correction Cost

Failure Cost

The 1-10-100 Quality Cost Rule

$$$

The 1-10-100 Rule A Data Quality Example:

The 1-10-100 Quality Rule Applied to mailing Data: It costs: • $1.00 to verify data at data entry time • $10.00 to clean the data after the fact • $100.00 to mop up errors caused by bad data – Packages mailed in the wrong address – Lost revenue – Lost customers – Bad (sloppy) reputation– Additional carrying cost for bad data

Data Quality: The Up Side!Trust• Consistent data inquiry results build confidence in information

systems• Tractability across Business and IT domains• Consistent data identifiers

– promotes internal cross-department reporting– Consistency– Confidence in results

Productivity • Removing redundant or near-redundant data • Maximizes re-use of data• Reduces the amount of data being processed • Reduces errors

ReliabilityConsistently good quality data is data you can count on!

A Data Cleanup Initiative

• Where to start? • Who will enforce such quality initiatives? • How will the data quality be maintained on

on-going basis?

DQB Task 1: Identify Sponsor and Data Quality Boss

To Identify sponsor: • Communicate clear understanding of the cost

of bad data• Use the 1-10-100 rule • Initiative has to be backed with – Money– Authority– Responsibility

Identify Data Quality Boss – DQB

• Must be knowledgeable on data quality • Must be knowledgeable on the Business• Will be responsible for data quality• Will have authority to make changes

Note: Responsibility without authority will not work

DQB Task 2: Identify Data Sets• Identify/inventory high-level data sets e.g.– CMDB– ERP • Master Data e.g.

– Customer Master – Item Master

• Transaction Data e.g. – PO’s & Invoices– Inventory movements

• Assign data sets to departments, potential lists of Data owners

• Note: The final data owners may not be the one initially penciled in at this stage

DQB Task 3: Identify Business Side Data Owners• Data Owners will effectively be the local, more granular, Data

Quality Bosses. Again, they will need to – Be responsible for the data at their level– Have authority to request changes at their level– Have bottom up knowledge of the data, understand what “should be

there”

• Setup meetings with every department, in line with the Data Sets identified, with aim of coming up with a set of Data Owners– Have a presentation ready – Look for individuals who have been in the Business for a long time,

who are well respected, who understand the data, the dependencies, and know who to talk to, to get answers, from the bottom up

DQB Task 4: Request from Data Owners the “Data Quality Specification” or DQS

• DQS is a document that spells out the data quality rules e.g. – No duplicates or near-duplicates– Data older than x years should be purged or archived– Data Dependencies such as no detail without a header or no invoice

without a PO– No duplicates– Consistency e.g. data format – Quality audit e.g. postal code matches address– More…

• Note: Some rules will apply in all DQS Documents• There is value in sharing, reviewing and updating the DQS over

time. • Data quality issues are not always apparent until a first cut of data

is cleaned up

DQS (part of Task 4 ) Also look for:

• Data Islands– Lack of consistent identifiers inhibit a single view

of the big picture

• Data Opportunity– Could correlated data sets be more useful to the

Business?

• Data Surprises– Misplaced Data– Information buried in free-form fields

• Data Quality – Cannot be a “side job” or a part-time task– Must be staffed with individuals who understand

data. Best candidates • Proficient in SQL, data extract techniques• Understand ETL tools and techniques • Are detailed-oriented • Experience: Data Warehouse staff is good fishing grounds

for such individuals

DQB Task 5: Build IT Data Quality Team

Mid-Presentation Recap: All the Ingredients are now in place

The real work can start! 1. Name Data Quality Sponsor & Data Quality Boss (DQB)

2. Identify Data Sets

3. Data Quality Owners

4. Data Quality Specifications (the DQ Roadmaps)

5. IT Data Cleansing Team

Introducing: the Data Quality Cycle

• We now have – a sponsor– Identified data sets and data owners• They have produced Data Quality Specifications

– An IT Team ready to work on the first Data Quality measurements, based on the DQS

• Next step: Initiate the cleanup– Not a single iteration but one that will be repeated

in a cyclic fashion

The Data Quality Cycle: Simple View

Improve

Monitor

Analyze

DATA

Data Quality Cycle - Corporate Version

Step 1: Identify Bad Data

Step 2: Data Cleansing

Step 3: Measure Progress

Step 4: Data Hygiene

– Schedule Cleanup/Reviews

– Ensure progress visible

Step 5: Sharpen the Saw

Analyze Data

Improve Data

Monitor Progress

Formalize ScheduleMake Progress VISIBLE

Continuous Improvement

Step 1: Identify Bad DataBad Data Definition: Does not adhere to DQS

• Coordinate meetings to translate DQS documents into a suite of repeatable data cleansing procedures

• Very important that these procedures should be repeatable, schedulable on regular basis

• Initial Focus: Identify Bad Data– Bad data(does not adhere to DQS), – Inconsistent data– Old Data – Note: DQS will spell out rules for “old” and “inconsistent”

• Ensure results are reported in a format readable by Management at executive level. This initiative has to be VISIBLE

Step 2: Data Cleansing• Data Cleansing can be done in two ways:

1) Automated, IT based cleanup 2) Business-based, manual cleanup

• Once the bad data is identified, determine who must do what– Business-based,

• manual cleanup appropriate for more subtle tasks, e.g. to determine which of two duplicates identified should be kept. These tasks may require additional research, phone calls etc.

– Automated, • IT based cleanup good for simpler tasks e.g. making telephone

number formats consistent• Can be also sub-contracted to specialized data quality

companies

IT-based Data Cleansing and Outsource Considerations

• Data cleansing may take valuable time from the Business, which is not available – Data Cleanup effort may suffer as a result

• Not all data cleansing is a simple SQL• Not all data is most confidential

When considering data cleansing tasks, look at all possible options

• Outsourcing some data cleansing tasks may be more economical than doing it all in-house

Step 3: Measure Progress• All programs, procedures written with the aim of

identifying data quality issues should– Be stored, like any other programming assets– Be repeatable and be schedulable – Provide aggregate measures to describe the data cleanup

status e.g. • X duplicates • Y old records • Z invoices without PO

– Progress • Has to be measured in a published dashboard• Has to be visible by the entire organization to provide a

sense of value

Step 4: Data Hygiene: * Schedule the Cleanup/Review Tasks * Ensure results are visible

• Bad data is created EVERY DAY• Data quality is an on-going effort • Establish, publish a schedule, part of the dashboard• Ensure there is visibility and accountability to ensure

the levels of bad data – are going down with time – Or are kept at a minimal level

Step 5: Sharpen the Saw

Once the data cleanup cycle is established• Review Results • Review DQS documents periodically (setup schedule) • Get Business input– Improve process– Give input on improvements to be made

• Ask the Business to come up with performance improvement measures born from the Data Quality initiative

ConclusionTwo sets of Five activities best define the Data Quality The Foundation Setup 1. Identify Data Quality Sponsor & Data Quality Boss (DQB)

2. Identify Data Sets

3. Identify Business Side Data Owners

4. Define Data Quality Specifications (DQS) - the DQ Roadmaps

5. Appoint IT Data Cleansing Team

The Data Quality Cycle – Ongoing 6. Identify Bad Data

7. Initiate Data Cleansing

8. Measure Progress

9. Initiate Data Hygiene, Data Cleanup Cycle– Schedule Cleanup/Reviews

– Ensure progress visible

10.Sharpen the Saw

Links• How to improve Data Quality

http://www.informit.com/articles/article.aspx?p=399325&seqNum=3

• Predefined data quality rule definitionshttp://pic.dhe.ibm.com/infocenter/iisinfsv/v9r1/index.jsp?topic=%2Fcom.ibm.swg.im.iis.ia.quality.doc%2Ftopics%2Fpdr_predef.html

• Creating Effective Business Rules: Interview with Graham Witthttp://dataqualitypro.com/data-quality-pro-blog/how-to-create-effective-business-rules-graham-witt

• Gartner Magic Quadrant on Data Quality Tools – “Demand for data quality tools remains strong”http://www.citia.co.uk/content/files/50_161-377.pdf

top related