Data Cleanup: Unlock the potential at a corporate scale
Dec 14, 2015
Data Cleanup: Unlock the potential at a corporate
scale
Thibault Dambrine
• IT professional for 25 years– Network Designer– ETL Data Warehouse Analyst– Interface Specialist– ERP Developer
• Data Quality Experience: – Tasked to work on the pre-conversion data cleanup
project during the Shell JDE to SAP transition
Introduction
• Premise:“What would a Company-wide Data Quality
Initiative look like?”
• Base:– Experience setting up a data cleanup team, prior
to a JDE to SAP data conversion – Realizing the potential for increasing the data
value within the corporation
Defining Data Quality
• Intrinsic Data Quality: – Accuracy, Completeness, Uniqueness– Reliability, Security and Accessibility
• Contextual Data Quality: – Timeliness, Relevance – Inter-operability, consistency of identifiers
• Accessibility and Representational Data Quality– Ease of understanding – Consistency of identifiers– Consistency in structures
Quantifying the cost of Quality: The 1-10-100 Rule - Additional cost = Less Competitive Business
1
10
100
Prevention Cost
Correction Cost
Failure Cost
The 1-10-100 Quality Cost Rule
$$$
The 1-10-100 Rule A Data Quality Example:
The 1-10-100 Quality Rule Applied to mailing Data: It costs: • $1.00 to verify data at data entry time • $10.00 to clean the data after the fact • $100.00 to mop up errors caused by bad data – Packages mailed in the wrong address – Lost revenue – Lost customers – Bad (sloppy) reputation– Additional carrying cost for bad data
Data Quality: The Up Side!Trust• Consistent data inquiry results build confidence in information
systems• Tractability across Business and IT domains• Consistent data identifiers
– promotes internal cross-department reporting– Consistency– Confidence in results
Productivity • Removing redundant or near-redundant data • Maximizes re-use of data• Reduces the amount of data being processed • Reduces errors
ReliabilityConsistently good quality data is data you can count on!
A Data Cleanup Initiative
• Where to start? • Who will enforce such quality initiatives? • How will the data quality be maintained on
on-going basis?
DQB Task 1: Identify Sponsor and Data Quality Boss
To Identify sponsor: • Communicate clear understanding of the cost
of bad data• Use the 1-10-100 rule • Initiative has to be backed with – Money– Authority– Responsibility
Identify Data Quality Boss – DQB
• Must be knowledgeable on data quality • Must be knowledgeable on the Business• Will be responsible for data quality• Will have authority to make changes
Note: Responsibility without authority will not work
DQB Task 2: Identify Data Sets• Identify/inventory high-level data sets e.g.– CMDB– ERP • Master Data e.g.
– Customer Master – Item Master
• Transaction Data e.g. – PO’s & Invoices– Inventory movements
• Assign data sets to departments, potential lists of Data owners
• Note: The final data owners may not be the one initially penciled in at this stage
DQB Task 3: Identify Business Side Data Owners• Data Owners will effectively be the local, more granular, Data
Quality Bosses. Again, they will need to – Be responsible for the data at their level– Have authority to request changes at their level– Have bottom up knowledge of the data, understand what “should be
there”
• Setup meetings with every department, in line with the Data Sets identified, with aim of coming up with a set of Data Owners– Have a presentation ready – Look for individuals who have been in the Business for a long time,
who are well respected, who understand the data, the dependencies, and know who to talk to, to get answers, from the bottom up
DQB Task 4: Request from Data Owners the “Data Quality Specification” or DQS
• DQS is a document that spells out the data quality rules e.g. – No duplicates or near-duplicates– Data older than x years should be purged or archived– Data Dependencies such as no detail without a header or no invoice
without a PO– No duplicates– Consistency e.g. data format – Quality audit e.g. postal code matches address– More…
• Note: Some rules will apply in all DQS Documents• There is value in sharing, reviewing and updating the DQS over
time. • Data quality issues are not always apparent until a first cut of data
is cleaned up
DQS (part of Task 4 ) Also look for:
• Data Islands– Lack of consistent identifiers inhibit a single view
of the big picture
• Data Opportunity– Could correlated data sets be more useful to the
Business?
• Data Surprises– Misplaced Data– Information buried in free-form fields
• Data Quality – Cannot be a “side job” or a part-time task– Must be staffed with individuals who understand
data. Best candidates • Proficient in SQL, data extract techniques• Understand ETL tools and techniques • Are detailed-oriented • Experience: Data Warehouse staff is good fishing grounds
for such individuals
DQB Task 5: Build IT Data Quality Team
Mid-Presentation Recap: All the Ingredients are now in place
The real work can start! 1. Name Data Quality Sponsor & Data Quality Boss (DQB)
2. Identify Data Sets
3. Data Quality Owners
4. Data Quality Specifications (the DQ Roadmaps)
5. IT Data Cleansing Team
Introducing: the Data Quality Cycle
• We now have – a sponsor– Identified data sets and data owners• They have produced Data Quality Specifications
– An IT Team ready to work on the first Data Quality measurements, based on the DQS
• Next step: Initiate the cleanup– Not a single iteration but one that will be repeated
in a cyclic fashion
The Data Quality Cycle: Simple View
Improve
Monitor
Analyze
DATA
Data Quality Cycle - Corporate Version
Step 1: Identify Bad Data
Step 2: Data Cleansing
Step 3: Measure Progress
Step 4: Data Hygiene
– Schedule Cleanup/Reviews
– Ensure progress visible
Step 5: Sharpen the Saw
Analyze Data
Improve Data
Monitor Progress
Formalize ScheduleMake Progress VISIBLE
Continuous Improvement
Step 1: Identify Bad DataBad Data Definition: Does not adhere to DQS
• Coordinate meetings to translate DQS documents into a suite of repeatable data cleansing procedures
• Very important that these procedures should be repeatable, schedulable on regular basis
• Initial Focus: Identify Bad Data– Bad data(does not adhere to DQS), – Inconsistent data– Old Data – Note: DQS will spell out rules for “old” and “inconsistent”
• Ensure results are reported in a format readable by Management at executive level. This initiative has to be VISIBLE
Step 2: Data Cleansing• Data Cleansing can be done in two ways:
1) Automated, IT based cleanup 2) Business-based, manual cleanup
• Once the bad data is identified, determine who must do what– Business-based,
• manual cleanup appropriate for more subtle tasks, e.g. to determine which of two duplicates identified should be kept. These tasks may require additional research, phone calls etc.
– Automated, • IT based cleanup good for simpler tasks e.g. making telephone
number formats consistent• Can be also sub-contracted to specialized data quality
companies
IT-based Data Cleansing and Outsource Considerations
• Data cleansing may take valuable time from the Business, which is not available – Data Cleanup effort may suffer as a result
• Not all data cleansing is a simple SQL• Not all data is most confidential
When considering data cleansing tasks, look at all possible options
• Outsourcing some data cleansing tasks may be more economical than doing it all in-house
Step 3: Measure Progress• All programs, procedures written with the aim of
identifying data quality issues should– Be stored, like any other programming assets– Be repeatable and be schedulable – Provide aggregate measures to describe the data cleanup
status e.g. • X duplicates • Y old records • Z invoices without PO
– Progress • Has to be measured in a published dashboard• Has to be visible by the entire organization to provide a
sense of value
Step 4: Data Hygiene: * Schedule the Cleanup/Review Tasks * Ensure results are visible
• Bad data is created EVERY DAY• Data quality is an on-going effort • Establish, publish a schedule, part of the dashboard• Ensure there is visibility and accountability to ensure
the levels of bad data – are going down with time – Or are kept at a minimal level
Step 5: Sharpen the Saw
Once the data cleanup cycle is established• Review Results • Review DQS documents periodically (setup schedule) • Get Business input– Improve process– Give input on improvements to be made
• Ask the Business to come up with performance improvement measures born from the Data Quality initiative
ConclusionTwo sets of Five activities best define the Data Quality The Foundation Setup 1. Identify Data Quality Sponsor & Data Quality Boss (DQB)
2. Identify Data Sets
3. Identify Business Side Data Owners
4. Define Data Quality Specifications (DQS) - the DQ Roadmaps
5. Appoint IT Data Cleansing Team
The Data Quality Cycle – Ongoing 6. Identify Bad Data
7. Initiate Data Cleansing
8. Measure Progress
9. Initiate Data Hygiene, Data Cleanup Cycle– Schedule Cleanup/Reviews
– Ensure progress visible
10.Sharpen the Saw
Links• How to improve Data Quality
http://www.informit.com/articles/article.aspx?p=399325&seqNum=3
• Predefined data quality rule definitionshttp://pic.dhe.ibm.com/infocenter/iisinfsv/v9r1/index.jsp?topic=%2Fcom.ibm.swg.im.iis.ia.quality.doc%2Ftopics%2Fpdr_predef.html
• Creating Effective Business Rules: Interview with Graham Witthttp://dataqualitypro.com/data-quality-pro-blog/how-to-create-effective-business-rules-graham-witt
• Gartner Magic Quadrant on Data Quality Tools – “Demand for data quality tools remains strong”http://www.citia.co.uk/content/files/50_161-377.pdf