Top Banner
Machine Learning in R and its use in the statistical offices 1 stat.unido.org [email protected]
17

Machine Learning in R and its use in the statistical offices 1 stat.unido.org [email protected]@unido.org.

Dec 19, 2015

Download

Documents

Oliver Owen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

Machine Learning in R

and its use in the statistical offices

1

[email protected]

Page 2: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

Outline

2

1. Machine learning and R2. R packages3. Machine learning in official statistics4. Top 10 algorithms5. References

Page 3: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

What I talk about when I talk about Machine Learning

3

Machine Learning (ML)

Data Mining (DM)

Statistics

Artificial Intelegence

(AI)

Page 4: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

R and R packages

• What makes R so useful? – The users can extend and improve the

software or write variations for specific tasks.

• The R package mechanism allows packages written for R to add advanced algorithms, graphs, machine learning and and mining techniques

• Each R package provides a structured standard documentation including code application examples

Page 5: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

R and R packages

## Naive Bayes example > install.packages('e1071', dependencies =

TRUE)

> library(class)> library(e1071)> data(iris)

> pairs(iris[1:4], main = "Iris Data (red=setosa,green=versicolor,blue=virginica)", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

Page 6: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.
Page 7: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

R and R packages

> classifier <- naiveBayes(iris[,1:4], iris[,5])> table(predict(classifier, iris[,-5]), iris[,5])

setosa versicolor virginicasetosa 50 0 0versicolor 0 47 3virginica 0 3 47

Page 8: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

Machine Learning for Official Statistics

I. Automatic CodingII. Editing and ImputationIII. Record LinkageIV. Other Methods

Page 9: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

Automatic coding

A. Automatic coding via Bayesian classifier: caret, klaR

B. Automatic occupation coding via CASCOT: algorithm not described

C. Automatic coding via open-source indexing utility: ?

D. Automatic coding of census variables via SVM: e1071 (interface to libsvm)

Page 10: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

Editing and Imputation

A. Categorical data imputation via neural networks and Bayesian networks: neuralnet, gRain, bnlearn, deal

B. Identification of error-containing records via classification trees: rpart, tree, caret

C. Imputation donor pool screening via cluster analysis: class, klaR, cluster, kmeans(), hclust()

D. Imputation via Classification and Regression Trees (CART): rpart, caret, RWeka

E. Determination of imputation matching variables via Random Forests: randomForest

F. Creation of homogeneous imputation classes via CART: rpart

G. Derivation of edit rules via association analysis: arules

Page 11: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

Record Linkage

• Weighting vector classification:– The last major step in record linkage or

record de-duplication– could be understood as a classification

problem• In R: rpart, bagging() in package

ipred, ada, functions svm() and nnet() in package e1071

Page 12: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

Other Methods

A. Questionnaire consolidation via cluster analysis: class, klaR, cluster...

B. Forming non-response weighting groups via classification trees: rpart, tree, caret

C. Non-respondent prediction via classification trees: rpart, tree, caret

D. Analysis of reporting errors via classification trees: rpart, tree, caret

E. Substitutes for surveys via internet scraping: scrapeR, rvest

F. Tax evader detection via k-nearest neighbours: class, kknn

G. Crop yield estimation via image processing on satellite imaging data: is this ML?

Page 13: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

Do we Need Hundreds of Classiers to Solve Real World

Classication Problems?

• Fernandez-Delgado, Cernadas, Barro (2014)

• Evaluate 179 classifiers arising from 17 families on 121 data sets

• By far best are random forests and SVM with Gaussian kernel

• Most of the best classiffiers are

implemented in R and tuned using caret• seems the best alternative to select a classier

implementation

Page 14: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

Top 10 ML/DM Algorithms

Xindong Wu and Vipin Kumar (2009)1. C4.5 – generates classifiers expressed as decision trees or

ruleset form2. K-Means – simple iterative method to partition a given

dataset into a userspecified number of clusters, k3. SVM – support vector machines4. Apriori - derive association rules5. EM - Expectation–Maximization algorithm6. PageRank - produces a static ranking of Web pages7. AdaBoost – Ensemble learning8. kNN - k-nearest neighbor classification9. Naive Bayes – simple classifier, applying the Bayes‘

theorem with independence assumptions between the features

10. CART - Classification and Regression Trees

Page 15: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

The best R packages for ML

1. e1071: Naive Bayes, SVM, latent class analysis

2. rpart: regression trees3. RandomForest: RF4. gbm: generalized boosting models5. kernlab: SVM6. caret: Classification and Regression Training 7. neuralnet: neural networks

CRAN Task View: Machine Learning & Statistical Learning

Page 16: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

Machine learning books

Page 17: Machine Learning in R and its use in the statistical offices 1 stat.unido.org v.todorov@unido.org@unido.org.

17