Dec 19, 2015
Outline
2
1. Machine learning and R2. R packages3. Machine learning in official statistics4. Top 10 algorithms5. References
What I talk about when I talk about Machine Learning
3
Machine Learning (ML)
Data Mining (DM)
Statistics
Artificial Intelegence
(AI)
R and R packages
• What makes R so useful? – The users can extend and improve the
software or write variations for specific tasks.
• The R package mechanism allows packages written for R to add advanced algorithms, graphs, machine learning and and mining techniques
• Each R package provides a structured standard documentation including code application examples
R and R packages
## Naive Bayes example > install.packages('e1071', dependencies =
TRUE)
> library(class)> library(e1071)> data(iris)
> pairs(iris[1:4], main = "Iris Data (red=setosa,green=versicolor,blue=virginica)", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])
R and R packages
> classifier <- naiveBayes(iris[,1:4], iris[,5])> table(predict(classifier, iris[,-5]), iris[,5])
setosa versicolor virginicasetosa 50 0 0versicolor 0 47 3virginica 0 3 47
Machine Learning for Official Statistics
I. Automatic CodingII. Editing and ImputationIII. Record LinkageIV. Other Methods
Automatic coding
A. Automatic coding via Bayesian classifier: caret, klaR
B. Automatic occupation coding via CASCOT: algorithm not described
C. Automatic coding via open-source indexing utility: ?
D. Automatic coding of census variables via SVM: e1071 (interface to libsvm)
Editing and Imputation
A. Categorical data imputation via neural networks and Bayesian networks: neuralnet, gRain, bnlearn, deal
B. Identification of error-containing records via classification trees: rpart, tree, caret
C. Imputation donor pool screening via cluster analysis: class, klaR, cluster, kmeans(), hclust()
D. Imputation via Classification and Regression Trees (CART): rpart, caret, RWeka
E. Determination of imputation matching variables via Random Forests: randomForest
F. Creation of homogeneous imputation classes via CART: rpart
G. Derivation of edit rules via association analysis: arules
Record Linkage
• Weighting vector classification:– The last major step in record linkage or
record de-duplication– could be understood as a classification
problem• In R: rpart, bagging() in package
ipred, ada, functions svm() and nnet() in package e1071
Other Methods
A. Questionnaire consolidation via cluster analysis: class, klaR, cluster...
B. Forming non-response weighting groups via classification trees: rpart, tree, caret
C. Non-respondent prediction via classification trees: rpart, tree, caret
D. Analysis of reporting errors via classification trees: rpart, tree, caret
E. Substitutes for surveys via internet scraping: scrapeR, rvest
F. Tax evader detection via k-nearest neighbours: class, kknn
G. Crop yield estimation via image processing on satellite imaging data: is this ML?
Do we Need Hundreds of Classiers to Solve Real World
Classication Problems?
• Fernandez-Delgado, Cernadas, Barro (2014)
• Evaluate 179 classifiers arising from 17 families on 121 data sets
• By far best are random forests and SVM with Gaussian kernel
• Most of the best classiffiers are
implemented in R and tuned using caret• seems the best alternative to select a classier
implementation
Top 10 ML/DM Algorithms
Xindong Wu and Vipin Kumar (2009)1. C4.5 – generates classifiers expressed as decision trees or
ruleset form2. K-Means – simple iterative method to partition a given
dataset into a userspecified number of clusters, k3. SVM – support vector machines4. Apriori - derive association rules5. EM - Expectation–Maximization algorithm6. PageRank - produces a static ranking of Web pages7. AdaBoost – Ensemble learning8. kNN - k-nearest neighbor classification9. Naive Bayes – simple classifier, applying the Bayes‘
theorem with independence assumptions between the features
10. CART - Classification and Regression Trees
The best R packages for ML
1. e1071: Naive Bayes, SVM, latent class analysis
2. rpart: regression trees3. RandomForest: RF4. gbm: generalized boosting models5. kernlab: SVM6. caret: Classification and Regression Training 7. neuralnet: neural networks
CRAN Task View: Machine Learning & Statistical Learning