Page 1
Using Machine Learning Algorithms …
… to reduce clerical effort and improve the quality of
administrative data sources.
Workshop on Access to Administrative Data Sources
Brussels, 13-14 September 2016
© Federal Statistical Office of Germany | Department E 105
Page 2
Crafts Trades – What's that?
© Federal Statistical Office of Germany | Department E 105
Quelle: Deutsche Fotothek Quelle: Marketing Handwerk
Page 3
Crafts Trades in Official Statistics?
© Federal Statistical Office of Germany | Department E 105
Official Statistics compile short term and structural results for
Turnover
Employees
Crafts Statistics are compiled entirelyfrom the Business Register (BR). BR
Page 4
Business Register
© Federal Statistical Office of Germany | Department E 105
BR
Page 5
Administrative Data Sources
© Federal Statistical Office of Germany | Department E 105
Page 6
Linking the Crafts Property to BR-Units
© Federal Statistical Office of Germany | Department E 105
BR
Page 7
The Problem – Irrelevant Cases
© Federal Statistical Office of Germany | Department E 105
BR
Page 8
The Problem – Irrelevant Cases
© Federal Statistical Office of Germany | Department E 105
Every year about 40 000 new or significantly changed matchesneed to be checked for relevance.
This means a lot of clerical review.
Page 9
The ProblemIrrelevant Cases
© Federal Statistical Office of Germany | Department E 105
0%
20%
40%
60%
80%
100%
Fallzahl Umsatz
Crafts in the BR;Share of Irrelevant Cases
relevant irrelevant
Cases Turnover
Page 10
The ProblemIrrelevant Cases
© Federal Statistical Office of Germany | Department E 105
Page 11
The ProblemIrrelevant Cases
© Federal Statistical Office of Germany | Department E 105
Density of irrelevant Cases in Craftsby Crafts Trade and NACE rev. 2
NACE rev. 2
Cra
fts T
rade
s
A 01
A 05
A 10
A 15
A 20
A 25
A 30
A 35
B 01
B 05
B 10
B 15
B 20
B 25
B 30
B 35
B 40
B 45
B 50
01 05 08 11 14 17 20 23 26 29 32 36 39 43 47 51 55 59 62 65 69 72 75 79 82 86 90 93 96 99
0,0
0,2
0,4
0,6
0,8
1,0
Page 12
The ProblemIrrelevant Cases
© Federal Statistical Office of Germany | Department E 105
Density of irrelevant Turnover in Craftsby Crafts Trade and NACE rev. 2
NACE rev. 2
Cra
fts T
rade
s
A 01
A 05
A 10
A 15
A 20
A 25
A 30
A 35
B 01
B 05
B 10
B 15
B 20
B 25
B 30
B 35
B 40
B 45
B 50
01 05 08 11 14 17 20 23 26 29 32 36 39 43 47 51 55 59 62 65 69 72 75 79 82 86 90 93 96 99
0,0
0,2
0,4
0,6
0,8
1,0
Page 13
Solution Machine Learning?
© Federal Statistical Office of Germany | Department E 105
Since results of the clerical checking from prior periods exist, …
…is it possible to train a Support Vector Machine to recognise the editing patterns …
… and apply them to a new data set?
Density of irrelevant Turnover in Craftsby Crafts Trade and NACE rev. 2
NACE rev. 2
Cra
fts T
rade
s
A 01
A 05
A 10
A 15
A 20
A 25
A 30
A 35
B 01
B 05
B 10
B 15
B 20
B 25
B 30
B 35
B 40
B 45
B 50
01 05 08 11 14 17 20 23 26 29 32 36 39 43 47 51 55 59 62 65 69 72 75 79 82 86 90 93 96 99
0,0
0,2
0,4
0,6
0,8
1,0
Page 14
Support Vector Machines – What's that?
© Federal Statistical Office of Germany | Department E 105
Support Vector Machines (SVM) find the widest separating hyperplanebetween two groups in a dataset with known classification...
Page 15
Support Vector Machines – What's that?
© Federal Statistical Office of Germany | Department E 105
… and applies the patterns found to a dataset with unknown classification.
Page 16
Support Vector Machines – What’s that?
© Federal Statistical Office of Germany | Department E 105
PROs: It operates on very little assumptions
CONs: SVM tend to require large amounts of calculating power
SVM (as well as other machine learning algorithms) are prone to overfitting
Keep in Mind: SVM have to be tuned to get useful results
Avoid overfitting by cross validating models
+
-
!
Page 17
SVM Applied – Training Data
© Federal Statistical Office of Germany | Department E 105
We train the SVM on a Business Register subset,containing about 650 000 unitswith crafts property and known relevance.
Roughly 13 000 (~2%) of them are marked irrelevant for the crafts statistics.
BR
Page 18
SVM Applied – Training Data
© Federal Statistical Office of Germany | Department E 105
We train on the variables:
Crafts Trade
NACE rev. 2
Turnover
Employees
…
Page 19
SVM AppliedTraining
© Federal Statistical Office of Germany | Department E 105
filter trivial cases
reduce to mostimportant variables
cases with knownrelevance
nontrivial cases
Nontrivial cases w. reduced variables
trivial cases
train SVM
model fornontrivial cases
model fortrivial cases
Page 20
SVM AppliedPrediction
© Federal Statistical Office of Germany | Department E 105
predict nontrivialcases
cases withunknown relevance
predict trivial casesmodel for trivial
casestrivial cases with
predicted relevance
model fornontrivial cases
nontrivial cases withunknown relevance
nontrivial cases withpredicted relevance
cases withpredicted relevance
Page 21
Results
© Federal Statistical Office of Germany | Department E 105
predicted original
relevant relevant 93,1 % 65,2 %
not relevant not relevant 3,2 % 33,1 %
96,3 % 98,2 %
not relevant relevant 0,3 % 0,3 %
relevant not relevant 3,4 % 1,5 %
3,7 % 1,8 %
relevant relevant 93,0 % 62,8 %
not relevant not relevant 1,4 % 20,5 %
94,4 % 83,3 %
not relevant relevant 0,4 % 2,7 %
relevant not relevant 5,2 % 14,1 %
5,6 % 16,7 %
classified correctly:
not correctly classified:
Random Forest
ClassificationAlgorithm Cases Turnover
SVMclassified correctly:
not correctly classified:
Page 22
Discussion
© Federal Statistical Office of Germany | Department E 105
We consider the results sufficiently precise to classify the bulk of the cases automatically and without clerical editing.
Cases with large impact on results will still be given into clerical review.
Furthermore the SVM-models can be used to check the results of the clerical review process.
Page 23
Jörg Feuerhake
Phone: +49/(0) 611 / 75 41 16
[email protected]
www.destatis.de
Page 24
Further Reading
© Federal Statistical Office of Germany | Department E 105
Applied Multivariate Statistical Analysis; Härdle, Simar; 2007
http://books.google.de/books/about/Applied_Multivariate_Statistical_Analysi.html?id=6nSF2PTi9bkC&redir_esc=y
The Elements of Statistical Learning; Hastie, Tibshirani, Friedman; 2009
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
A Practical Guide to Support Vector Classification; Hsu, Chang, and Lin; 2003-2010;
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
Support Vector Machines - The Interface to libsvm in Package e1071; Meyer; 2014
http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf
Page 25
Presentations
© Federal Statistical Office of Germany | Department E 105
A Simple Introduction to Support Vector Machines; Martin Lawhttp://www.cise.ufl.edu/class/cis4930sp11dtm/notes/intro_svm_new.pdf
Klassifikation mit Support Vector Machines; Florian Markowetz (German)
http://lectures.molgen.mpg.de/statistik03/docs/Kapitel_16.pdf
Support Vector Machines; Andrew W. Moore
http://www.autonlab.org/tutorials/svm15.pdf
Page 26
Videos
© Federal Statistical Office of Germany | Department E 105
Artificial Intelligence; Lecture 16: Learning: Support Vector Machines (49’34”)
https://www.youtube.com/watch?v=_PwhiWxHK8o
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-034-artificial-intelligence-fall-2010/index.htm
Introduction to Support Vector Machines (51’54”, 36’11”)
http://videolectures.net/epsrcws08_campbell_isvm/
Artificial Intelligence | Machine Learning; Lecture 6 (73‘09“) und 7 (75‘45“)
www.youtube.com/watch?v=qyyJKd-zXRE
www.youtube.com/watch?v=s8B4A5ubw6c
http://see.stanford.edu/see/courseinfo.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
Pattern Recognition Class (40’31“)
https://www.youtube.com/watch?v=rjIac3NxAYA