Top Banner
Delivering Data Quality in the real world A case study using SAS Dataflux
21

Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

Oct 19, 2014

Download

Technology

Overview of using Data Flux to resolve a number of data quality problems on a real world project.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

Delivering Data Quality in the real world

A case study using SAS Dataflux

Page 2: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

What  I  Will  Cover  1.  What  is  Data  Quality?  

2.  What  is  SAS  Dataflux?  

3.  The  approach  we  took  and  why  

4.  The  things  we  did  and  how  

5.  Monitoring  the  results  

Page 3: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

1.  What  is  Data  Quality  

•  Data  are  of  high  quality  “if  they  are  fit  for  their  intended  uses  in  opera6ons,  decision  making  and  planning"  (J.  M.  Juran).  Alterna6vely,  the  data  are  deemed  of  high  quality  if  they  correctly  represent  the  real-­‐world  construct  to  which  they  refer.”    

•  Source  Wikipedia    hGp://en.wikipedia.org/wiki/Data_quality  

•  Joseph  Moses  Juran  (December  24,  1904  –  February  28,  2008)  was  a  20th  century  management  consultant,  principally  remembered  as  an  evangelist  for  quality  and  quality  management,      

Page 4: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

2.  What  is  SAS  Dataflux  

DataFlux  provides  organisaXons  with  the  ability  to  plan  and  complete  data  integraXon,  data  quality  and  master  data  management  (MDM)  projects  –  all  from  a  single  interface    It    makes  it  easier  to  do:  

•  Profiling  •  StandardizaXon  •  Matching  •  AugmentaXon  •  Business  Rules  Monitoring  

Its  delivered  as:  

•  Standalone  Desktop  Client  •  Component  of  SAS  Enterprise  Data  IntegraXon  Server    

•  Full  Data  flux  soluXon  

Page 5: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

2.  What  is  SAS  Dataflux  

Page 6: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

3.  The  approach  we  took  and  why  InformaXon  Governance  Hierarchy  

Board  

ExecuXve  Team  

Data  Governance  CommiGee  

Data  Council  

Business  Data  Stewards   Technical  Data  Stewards  

Page 7: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

Data  Governance:  From  theory  to  pracXce    Zeeman  van  der  Merwe  Manager:  InformaXon  Integrity  and  Analysis,  ACC    2010  SUNZ  Conference  16  February  2010    

Page 8: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

Data  Quality  Maturity  Model  

Data  management   experXse   exists   internally   and   there   is   some   ability   to   duplicate  good  pracXces.    Key  data  management  individuals  are  assigned  to  criXcal  projects  to  reduce  risks  and  improve  results.  

Data  management   is   characterised   as   ad-­‐hoc   or   chaoXc.     The   organisaXon   depends   solely   on  individuals  with  no  awareness  of  data  management  pracXces,  resulXng  in  variable  results  and  no  repeatability.  

Unaware  

Repeatable  

Defined  

Managed  

EffecXve  

The  organisaXon  uses  a  set  of  defined  data  management  processes,  which  are  published  for  recommended  use.  

The   use   of   the   data   management   processes   are   required   and  monitored.    All  projects  and  iniXaXves  include  data  management  as  a  core  part  of  their  objecXves  and  deliverables’.    

Data   Quality   is   automaXcally   monitored   and   reported.      Reliability   and   predictability   of   result’s   is  monitored   via  Six  Sigma  or  equivalent  measurement  methodology.          

The   organisaXon   regularly   analyses   exisXng  data   management   processes   to   determine  where   changes   can   deliver   improved  efficiencies  and  implements  them.  

OpXmised  

Trust  in  Inform

aXon

 

Maturity  of  Data  Governance  processes  

Page 9: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

Data  Quality  Issue  

Monitoring  Scorecards  

Update  Source  System  

Profile  Issues  

IniXate    InformaXon  

Governance  Group  

PrioriXse  Data  Quality  Issues  

Manually  or  ProgrammaXcally  

update  data  

Data  Cleansing  Business  Process  

Page 10: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

4.  The  things  we  did  and  how  We  used  dataflux  to  

 •  Profile  the  data  

•  Profiled  •  Phone  numbers  •  Customer  AIributes  

•  Gender  •  Date  of  Birth  •  Missing  Values  

•  Addresses  •  Suppliers  •  Customers  •  Loca6ons  

 

Page 11: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

4.  The  things  we  did  and  how  Example  

 

Page 12: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

4.  The  things  we  did  and  how  Profile  Data  

 

Alpha String Count (NIGHTS 1 -ROOM 1 -X 3 ACT 1 COURSE 1 EX 7 EXT 4 FAX 1 N/A 2 SCHOOL 1 WK 1 X 48 XT 7 XTN 3

Category Count Percentage AREA CODE MISSING 13476 51% INVALID MOBILE NUMBER 158 1% INVALID NUMBER 212 1% INVALID LANDLINE NUMBER, TOO FEW DIGITS 723 3% INVALID LANDLINE NUMBER, TOO MANY DIGITS 366 1% MOBILE NUMBER 1744 7% MOBILE NUMBER OBSOLETE 942 4% NUMBER OK 8324 30% ZERO 511 2%

Pattern Count Percentage 9999999 12760 51% 9999999999 2979 12% 99 9999999 2634 10% 999999999 2210 9% 999 9999 1453 6% 99 999 9999 998 4% 999 999 9999 605 2% 99*9999999 493 2% 999 9999999 297 1% 99999999999 292 1% 99999999 101 0% 9999999 9999 84 0% 999*9999 71 0% 999 999999 53 0% 9999999 999 49 0% 999 999 999 47 0% 999999 41 0% 999 9999 999 36 0%

Page 13: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

4.  The  things  we  did  and  how  We  used  dataflux  to  

 •  Standardise  Data  

•  Use  Dataflux  Quality  Knowledge  Base  to:  •  Standardise  Person  Names  

•  Robert,  Rob,  Bob  •  Standardise  Loca6on  Names  

•  Wellington,  WLG,  Wgtn    

Page 14: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

4.  The  things  we  did  and  how  We  used  dataflux  to  

 •  Consolidate  Data  

•  Merge  mul6ple  people  records  •  Mul6ple  matching  rules  •  Needed  to  be  reusable  •  Needed  to  have  logic  layers  

 

Page 15: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

4.  The  things  we  did  and  how  Logic  Layers  

Page 16: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

4.  The  things  we  did  and  how  We  used  dataflux  to  

 

•  Programma6cally  Validate  and  Augment  the  Data  •  Validate  against  external  datasets  

•  NZ  Post  PAF  •  LINZ  Data  •  Poten6ally  

•  Birth’s,  Deaths  and  Marriages  data  •  External  Customer  Lists  •  Can’t  find  valid  Phone  number  dataset  

 

Page 17: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

5.  Monitoring  the  Results    

•  Typical  aGributes  to  measure  data  quality  •  Accuracy  

Are  targets  defined  to  measure  against?  •  Correctness  

Requires  something  to  look  up  •  Data  Age  

Data  Quality  degrades  over  Xme,  is  that  acceptable?  •  Completeness  

What  are  the  business  rules  that  define  what  is  acceptable?  •  Relevance  

Have  you  documented  how  it  is  used?      

Page 18: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

5.  Monitoring  the  Results    

•  Give  the  business  owners  feedback  that  tells  them:  •  If  their  Data  Quality  is  ge]ng  beIer  or  worse  

•  Who  is  the  business  owner  who  can  impact  the  data  quality  

•  What  do  they  need  to  change  

•  Encourage  the  business  owners  to  improve  the  quality  of  the  data    •  Ideally  programma6cally  update  the  data  for  them  

•  Or  use  center's  of  excellence’s  to  update  data  (i.e  Call  Centers  for  Phone  numbers)  

•  Or  provide  the  business  a  recommended  process  to  update  it  

•  Make  people  accountable  for  bad  data  quality!  •   

 

Page 19: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

5.  Monitoring  the  Results    

Record  Type   Count   Percentage  

Duplicates   1,037,964   56.85%  

Master   787,673   43.15%  

Customer  Records  

Page 20: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

Data  Quality  is  not  a  project,  it  is  a  never  ending  process  

Page 21: Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

The  shameless  plug!  

•  www.opXmalBI.com  Delivering  AcXonable  Insight  

•  www.saasInct.com  PreBuilt  SAS  Portlets  

•  blog.saasInct.com  Ramblings  about  SAS