Cleansing/Matching Data Using the SAS Data Quality Toolset
Post on 11-Sep-2021
5 Views
Preview:
Transcript
© Ecclesiastical Insurance Office plc 2011
Cleansing/Matching Data Using the SAS Data Quality ToolsetNigel Light, Data Governance and Quality Analyst Ecclesiastical Insurance
© Ecclesiastical Insurance Office plc 2011
Ecclesiastical Insurance – niches where we conduct business
• Charitable Insurer with over 125 years of experience (aim of giving £50M to charity in next 3 years)
• Insure a diverse mix of organisations and risks aligned to our core values
© Ecclesiastical Insurance Office plc 2011
Data Quality tools – why buy one?
Data Quality doesn’t need a toolset.
Without one, using standard database/spreadsheet software, you can :
• Match and deduplicate data• Write code to
• Profile data • Reformat data• Identify where data does not match expected patterns or values
But, with a data quality toolset, this is all generally available ‘out of the box’(and there are often additional features too)
© Ecclesiastical Insurance Office plc 2011
Cleansing data using SAS Dataflux – the basics
Things that a Data Quality tool, such as SAS, can do….
• Parsing • ie Breaking out a string of data into its standard elements
Eg Address string “123 Brookside Close, Henleze, Bristol, BS2 7BJ”
Can be ‘parsed’ into House Number : 123individual data elements Street : Brookside Close
Address line 2 : HenlezeCity : Bristol Postcode : BS2 7BJ
© Ecclesiastical Insurance Office plc 2011
Cleansing data using SAS – the basics
Also….
• Standardisation • Putting data into a defined standard form (for the defined data element)
Eg1 Phone number 07891425687 standardised to (07891) 425867 ie defined01452 678923 (01452) 678923 standard078 123 98756 (07812) 398756 form
Eg2 Suffix Ltd, Lmtd, Ltd., Limited standardised to Ltd.
Eg3 Name Bob, Bobby, Rob, Robert etc standardised to Robert
© Ecclesiastical Insurance Office plc 2011
Cleansing data using SAS– the basics
As well it can….
• Pattern identification of invalid data items • Identification on inappropriate or incomplete field contents based on
defined element values
Eg Postcode must be one of 6 formats XX99 9XX, XX9 9XX etc
• A postcode of eg G19L 9P2 would be recognised as invalid. • Ditto GL21 3 (incomplete)
NB It cannot identify invalid entries eg postcodes, of valid format, which do not exist(need to match to a reference dataset eg Royal Mail postcode file)
© Ecclesiastical Insurance Office plc 2011
Cleansing data using SAS– the basics
And….
• Profile data• Gain an understanding and insight
of data from a specified source
Eg For a particular field
• What are the 5 most common values? The 5 least common?
• What is the maximum, minimum values?• What is the type of data in the field
(eg alphabetic?, numeric? date?)• What is the longest/shortest value?
(alphabetic field)• What is the average value? (numeric field)
etc
© Ecclesiastical Insurance Office plc 2011
Cleansing data using SAS – the basics
How?
• SAS - Quality Knowledge Base (QKB)• Set of pre-defined ‘out-of-the-box’ templates• Target specific• Location specific
• Can also define you our own valueseg to accommodate clergy salutations
The Most Reverend and Right Honourable the Lord Archbishop of Canterbury(www.crockford.org.uk)
© Ecclesiastical Insurance Office plc 2011
Cleansing data using SAS – the basics
BUT
• SAS Data Quality cannot do ‘magic’ and fix all data problems
• Data needs to be of a certain ‘standard’ for the QKB to work satisfactorily
• Eg A phone number entry of ‘Ext 2378’ cannot be standardised
• The UK QKB may also find certain Eastern European and Asian name standardisation hard
© Ecclesiastical Insurance Office plc 2011
Cleansing data using SAS Data Quality – grouping and deduplication
Grouping
De-duplication
© Ecclesiastical Insurance Office plc 2011
Cleansing data using SAS Data Quality – be aware
Look the same? – yes
Same address? – yes (possibly)
Same birthday? - yes
Same surname? - yes
Same initial? – no (possible – but not this pair)
Same people? – NO
To be 100% sure need a unique identifier eg NI NumberOtherwise a human decision is required to identify whether they are the same
© Ecclesiastical Insurance Office plc 2011
SAS Data Quality – ‘fuzzy’ matching
Matching can be simple eg using conventional tools• Does A = B (exact match – including the number of spaces) eg ‘Smith’ = ‘Smith’
• Does A = B (using ‘wild card’ % characters) eg ‘Smith’ = ‘Smi%%’
But what about … if A looks ‘so similar’ to B they can be considered a ‘likely’ match?
eg ‘Smith’ = ‘Smith’?eg2 = ‘Smythe’?
eg3 = ‘Smtih’?eg4 = ‘Smith-Jones’?
= a ‘fuzzy’ match(the degree of ‘fuzziness’ can be varied – ie akin to matching non-identical twins)
This is achieved in SAS by using Match Codes
© Ecclesiastical Insurance Office plc 2011
Probabilistic record linkage– Brain overload?
http://en.wikipedia.org/wiki/Record_linkage
Luckily, the tool does all of this for you….
Jaro-Winkler Distance
Low Levenshstein
Distance
Phonetic Algorithm
Etc…
© Ecclesiastical Insurance Office plc 2011
SAS Data Quality – match codes
Matching – when to use match codes?
• Postcodes, email addresses and phone numbers
• Addresses
• Names
© Ecclesiastical Insurance Office plc 2011
Matching - Final tips
Matching is rarely a straightforward, exact processIt requires perseverance. Success rates can be improved by :
Understanding the data (ie identify the data nuances)
Experiment and try different matching techniques
Remove any ‘noise’ from the match strings
Using a ‘cascading degree of confidence’ – retaining the strongest match
However, it is often a balance between the number of false matches and missing the occasional ‘true’ match
© Ecclesiastical Insurance Office plc 2011
Example
Business Issue – Theft of metals from church roofs
linked to the high demandfor metals
© Ecclesiastical Insurance Office plc 2011
Example
Mitigation steps
Include application of Smartwater
Uniquely links metal to a location and is now a pre-requisite for obtaining insurance for a church with Ecclesiastical
© Ecclesiastical Insurance Office plc 2011
Example
Business Process :
© Ecclesiastical Insurance Office plc 2011
Example
Data Issues :
• ‘House of God…’
• ‘Many to many’
• Transposition of key data elements
• Standardisation
© Ecclesiastical Insurance Office plc 2011
Matching and Cleansing Methodology
• Remove non-Anglican church entries from the policy file
• Validate the Smartwater supplied policy number and attempt to match it
• ‘Break out’ church name
• Standardise and match on church name, postcode/short-postcode and town
• Score and de-duplicate, retaining the highest scoring match
• Output confident matches …and where not confident, suggest alternatives, to permit data correction
Example
© Ecclesiastical Insurance Office plc 2011
Final Results :
• 85% Smartwater locations were confidently matched to a policy
• 4% of the remaining policies had a single, more confidently matched alternative to the policy specified by Smartwater
• The remainder of the Smartwater entries had multiple possibilities• these required a manual decision to be made
• Currently working to maintain the level of Data Quality
• … and all users of the system can be more confident of the data used in the process
Example
© Ecclesiastical Insurance Office plc 2011
Questions?
Thank you Any questions?
Nigel.Light@Ecclesiastical.comwww.Ecclesiastical.com
top related