Top Banner
Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern, 14.06.2019 Martin Müller-Lennert Senior Data Scientist [email protected] Milica Petrović Senior Data Scientist [email protected]
32

Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

Aug 09, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

Digitization - Data - Intelligence

Automated Data Quality Assurancewith Machine Learning and Autoencoders

SDS2019

Bern, 14.06.2019

Martin Müller-Lennert

Senior Data Scientist

[email protected]

Milica Petrović

Senior Data Scientist

[email protected]

Page 2: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

2

Talk Outline

2

What’s Wrong with Data Quality?

3

Error Detection using ML

4

Demo

Findings and Outlook

1

Page 3: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

3

Talk Outline

2

What’s Wrong with Data Quality?

3

Error Detection using ML

4

Demo

Findings and Outlook

1

Page 4: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

4

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 5: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

5

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 6: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

6

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 7: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

7

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 8: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

8

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 9: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

9

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 10: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

10

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 11: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

11

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 12: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

12

Data Quality TodayOur Take at a Solution

Data Quality Today

▪ Manually coded

SQL rules

▪ Uni-/bi-variate

checks

▪ Too much data: set of rules

▪ Too few rules: undetected

errors

▪ Too narrow focus: one-

dimensional

▪ Too late: new errors types

detected after occurrence

▪ Automate: simultaneous error detection & faster process

▪ Reusability: tailored ML algorithms reused for fields of similar type

▪ Deep dive: discovery of new types of errors based on multivariate

relationships

▪ Unsupervised

▪ Model of input data:

→ Anomalies easily

detected

▪ Capture multivariate

relationships

Ch

all

en

ges

Solutions with Machine Learning

Au

toan

co

ders

Page 13: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

13

Talk Outline

2

What’s Wrong with Data Quality?

3

Error Detection using ML

4

Demo

Findings and Outlook

1

Page 14: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

14

Autoencoders for Data QualityArchitecture and Training

Target: Reconstruct input

Bottleneck: Enforced by architecture or regularization

Ensures network learns structure of input data

For good data only

INPUTINPUT OUTPUT

Page 15: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

15

Autoencoders for Data QualityArchitecture and Training

Training on imperfect data: Requires large share of good data

Limits potency of network: More layers not always better

From simple one-layer NN up to VAE with LSTM cells

INPUT OUTPUT INPUT

For good data only

Page 16: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

16

Discriminating Good and Bad Data RecordsClustering the Reconstruction Errors

Mean

Sq

uare

d E

rro

r

Individual Data Records

Challenge: Many data points and potentially extreme class imbalance

Kernel Density Estimate

Page 17: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

17

Discriminating Good and Bad Data RecordsClustering the Reconstruction Errors

Mean

Sq

uare

d E

rro

r

Individual Data Records

Challenge: Many data points and potentially extreme class imbalance

Kern

el D

en

sity E

stimate

Page 18: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

18

Discriminating Good and Bad Data RecordsSequence of Autoencoders

1st iteration

Keep Rest of Data

Challenge: Magnitude of reconstruction error

varies across data error types

Remove Detected

Anomalies

Page 19: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

19

Discriminating Good and Bad Data RecordsSequence of Autoencoders

Page 20: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

20

Discriminating Good and Bad Data RecordsSequence of Autoencoders

Page 21: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

21

Discriminating Good and Bad Data RecordsSequence of Autoencoders

Across iterations: Increase model complexity

Stopping: When threshold separates large chunk of data

Page 22: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

22

Talk Outline

2

What’s Wrong with Data Quality?

3

Error Detection using ML

4

Demo

Findings and Outlook

1

Page 23: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

23

DemoBirth date

Page 24: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

24

DemoBirth date

Page 25: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

25

DemoFirst name

Page 26: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

26

DemoFirst name

Page 27: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

27

DemoRevenue

Page 28: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

28

DemoRevenue

Page 29: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

29

Talk Outline

2

What’s Wrong with Data Quality?

3

Error Detection using ML

4

Demo

Findings and Outlook

1

Page 30: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

30

Reusability of Pre-Processing and Model Setup

Type of variable Pre-processing Model

Character One-hot encoding of charactersVariational autoencoders with

LSTM cells

Categorical One-hot encodingComplete autoencoder with

regularization

DateNumerical features from digits Complete autoencoder with

regularizationNormalization

Numerical NormalizationUndercomplete autoencoder

with custom loss

Generic pipeline per field type → can be reused for other fields of same type

Page 31: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

31

Key Findings from Application to Production Data

High reusability: One-time customization effort

Replication: ML can automatically replicate rule-based data quality checks

Extension: Autoencoders can find additional errors

SME feedback necessary: Sanity checks during model building

Multivariate relationships: Detection of interdependencies

Unsupervised learning! Training data quality matters

Tra

inin

gP

erf

orm

an

ce

Page 32: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

32

Model lifecycle: improve models over time from feedback

Detected anomalies are errors: correct or leave out

Automated error correction

Data remediation using RPA

Batch processing: extend error detection to whole batches of data

Detect faulty data sourcing process

Detected anomalies are false positives: increase weight during training

OutlookFuture Endeavors