Top Banner
1 Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019 Improving Data Validation using Machine Learning Source: CC0 Public Domain Source: Packard Bell Computer, 1964 Team ‘Plausi++’: Christian Ruiz Christine Ammann Tschopp Elisabeth Kuhn Laurent Inversin Mehmet Aksözen Stefan Rüber
25

Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

Jul 15, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

1Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Improving Data Validation using Machine Learning

Source: CC0 Public Domain Source: Packard Bell Computer, 1964

Team ‘Plausi++’:

Christian Ruiz

Christine Ammann Tschopp

Elisabeth Kuhn

Laurent Inversin

Mehmet Aksözen

Stefan Rüber

Page 2: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

2Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Delimitation

The pilot project runs since April 2018 and is part of the «data innovation strategy» of the Swiss Federal Statistical Office (FSO). The pilot project will end in June 2019. The following presentationdoes not present the end results of the pilot project.

Page 3: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

3Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Overview

Part I: Introduction

Part II: Basic idea of Plausi++

Part III: Feedback mechanism

Page 4: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

4Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Page 5: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

5Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Data validation / «Plausi»- Manual

(Different solutions)

- Based on rules(Different solutions)

- Idea: Automatic recognition(Enhancing other types of data validation)

Source: CC0 Public Domain

Page 6: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

6Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Aims

-Higher resource efficiency

-Higher velocity

-Less administrative burden for our data suppliers

-Higher data quality

Page 7: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

7Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Feature

Engineering

Pre-Processing Recognize patterns

(Source: CC0 Public Domain with modifications)

Old data

PHASE: TRAINING Supervised learning

Learn?• Feed a large amount of data

• Recognize patterns

• Prediction of values

• Prediction is not equivalent to explanation or interpretation

Page 8: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

8Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Pre-ProcessingMuster erkennen /

Trouver des motifs

PHASE: TESTING

Feature

Engineering

Supervised learning

Predicted data

(Source: CC0 Public Domain with modifications)

Page 9: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

9Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

More human than machine

Human: Comprehension of the data

Human: Feature Engineering and pre-processing

Human: Choice of appropriate algorithms

Machine: Calculation

Machine: Preparation for final user

Human: Calibration and decisions

Human: Integration into the production environment

Page 10: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

10Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Part II: Basic idea of Plausi++

1) Selection of variables (from a FSO data set)

2) Prediction by ML algorithms of a ‘dependent variable’

3) Comparison between predicted and received data

If deviations: mistake or outlier («something seems odd», e.g. 21 years old prof.)

Page 11: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

11Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Example: Staff working at higher education institutions (HEI)

Staff category explained by sex, FTE, field, age, nationality, university

Dependent variable has 4 classes

P: Professors

U: Lecturers

A: Research assistants

D: Administrative employeesSic: Partly we use 5 classes in the slides (+W)

Page 12: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

12Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Examples

Only hypothetic data is shown

Page 13: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

13Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Only hypothetic data is shown

Page 14: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

14Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Results Currently over 94% of accuracy

Instead of 6 variables we have about 1000 variables

5 classes instead of 4

Required A sufficiently high amount of cases

Relationships between the variables

Meaningful relationships between the variables

Sufficiently ‘good’ data quality in train set

Page 15: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

15Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Part III: Feedback mechanism

Necessity of explanation and interpretability

Data suppliers are central

-> Higher data quality and less administrative burden

Page 16: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

16Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

The person is probably not prof.

but assistant

But why?

Page 17: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

17Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Hall, Gill and Meng, June 26 2018, O’Reilly

So why isn’t everyone just trying interpretable machine learning?

Simple answer:

it’s fundamentally difficult,

and in some ways, a very new field of research.

Page 18: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

18Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Global explanation: Variable Importance (GBM-Model)Did the person work as a prof. Last year?

Does/did the person study in that year?

How many % of the work are spent in R&D?

Is the person employed on a mandate/contract?

How many FTE did this person work overall?

When did the person finish A-levels?

Does the person work in the central administration?

When was the person born?

Is the person temporarily employed?

Page 19: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

19Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

DALEX Partial Residual Plot: Outcome = P, Case = 101426

Sic: Without U for

Better overview

Page 20: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

20Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

DALEX Partial Residual Plot: Outcome = P, Case = 100269

Sic: Without U for

Better overview

Page 21: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

21Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

LIME: Local Interpretable Model-Agnostic Explanations

Page 22: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

22Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Onion model The deeper we get, the more interpretable the result is.

The variables in the innermost layer correspond to the delivered variables.

The larger the distance from the innermost layer,

the more complex and less interpretable the result.

However, the prediction becomes better.

Page 23: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

23Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Layer: No human in the loop

Layer: Stat. Inst. human in the loop

Layer: human in the loop

Page 24: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

24Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Conclusion

Prediction works well. Accuracy currently over 94%

Not anymore 6 but around 1000 variables

Explanation part ongoing and pioneering work!

Pilot project until June 2019

Difficult challenges ahead

Feedback highly appreciated

Page 25: Improving Data Validation using Machine Learning - coms.events · Microsoft PowerPoint - Presenation_Plausi++_NTTS_20190314 Author: U80843671 Created Date: 3/8/2019 2:07:05 PM ...

25Dr. Christian Ruiz | Swiss Federal Statistical Office (FSO) | Plausi++ | 14.3.2019

Thank you very much for your attention!

Thanks to «Team-DALEX»: Mehmet Aksözen and Stefan Rüber

Thanks to «Team-LIME»: Elisabeth Kuhn and Laurent Inversin

Thanks to «Team-IT» : Christine Ammann Tschopp

Thanks to our advisor: Prof. Dr. Diego Kuonen

Source: CC0 Public Domain with modifications