Machine Learning Homework 8, 598 and 494 Rob McCulloch 3/26/2019 Contents Trees on the Kaggle Data 1 A Simple Tree ............................................... 2 Random Forests .............................................. 8 Boosting ................................................... 10 Comparing the Methods .......................................... 11 Homework Problem 12 Trees on the Kaggle Data Let’s try trees on the kaggle data. We are trying to predict whether an account will go delinquent. ktr=read.csv("http://www.rob-mcculloch.org/data/kaggle-del-train.csv") #read in the train data kte=read.csv("http://www.rob-mcculloch.org/data/kaggle-del-test.csv") #read in the test data ktr$DelIn2Yr = as.factor(ktr$DelIn2Yr) kte$DelIn2Yr = as.factor(kte$DelIn2Yr) names(ktr) ## [1] "RevolvingUtilizationOfUnsecuredLines" ## [2] "age" ## [3] "NumberOfTime30.59DaysPastDueNotWorse" ## [4] "DebtRatio" ## [5] "NumberOfOpenCreditLinesAndLoans" ## [6] "NumberOfTimes90DaysLate" ## [7] "NumberRealEstateLoansOrLines" ## [8] "NumberOfTime60.89DaysPastDueNotWorse" ## [9] "DelIn2Yr" dim(ktr) ## [1] 75000 9 dim(kte) ## [1] 75000 9 table(ktr$DelIn2Yr)/length(ktr$DelIn2Yr) ## ## 0 1 ## 0.93294667 0.06705333 table(kte$DelIn2Yr)/length(kte$DelIn2Yr) ## ## 0 1 ## 0.93337333 0.06662667 So ktr is our training data and kte is our test data. 1
12
Embed
Machine Learning Homework 8, 598 and 494 - Rob McCullochrob-mcculloch.org/2019_ml/webpage/hw/hw8.pdf · 2019-05-09 · summary(bfit) NumberRealEstateLoansOrLines 0 10 Relative influence
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Machine Learning Homework 8, 598 and 494Rob McCulloch
We are trying to predict whether an account will go delinquent.
ktr=read.csv("http://www.rob-mcculloch.org/data/kaggle-del-train.csv") #read in the train datakte=read.csv("http://www.rob-mcculloch.org/data/kaggle-del-test.csv") #read in the test dataktr$DelIn2Yr = as.factor(ktr$DelIn2Yr)kte$DelIn2Yr = as.factor(kte$DelIn2Yr)names(ktr)## [1] "RevolvingUtilizationOfUnsecuredLines"## [2] "age"## [3] "NumberOfTime30.59DaysPastDueNotWorse"## [4] "DebtRatio"## [5] "NumberOfOpenCreditLinesAndLoans"## [6] "NumberOfTimes90DaysLate"## [7] "NumberRealEstateLoansOrLines"## [8] "NumberOfTime60.89DaysPastDueNotWorse"## [9] "DelIn2Yr"dim(ktr)## [1] 75000 9dim(kte)## [1] 75000 9table(ktr$DelIn2Yr)/length(ktr$DelIn2Yr)#### 0 1## 0.93294667 0.06705333table(kte$DelIn2Yr)/length(kte$DelIn2Yr)#### 0 1## 0.93337333 0.06662667
So ktr is our training data and kte is our test data.
1
A Simple Tree
Let’s fit a single tree to the data using the R package rpart.
First we fit a big tree by using a small cp (the .0001 below).
library(rpart)set.seed(99)big.tree = rpart(DelIn2Yr~.,data=ktr, control=rpart.control(cp=.0001))nbig = length(unique(big.tree$where))cat("size of big tree: ",nbig,"\n")## size of big tree: 376head(big.tree$where)## 1 2 3 4 5 6## 5 5 5 5 5 5
Remember, cp is the key cost complexity parameter which is α in the notes.A smaller cp gives you a bigger tree.
The where component of the list returned by rpart indicates the partitioning of data into disjoint subsets. Sothere are 376 bottom nodes in the tree big.tree and the first observeration is in the 5th bottom node. Thenumbering system for the bottom nodes is not the meaningful, we just have a unique integer assigned to eachbottom node.
Let see what cross validation tells us about the a good size for the tree. The rpart package does this for uswith the plotcp function.
plotcp(big.tree)
2
cp
X−
val R
elat
ive
Err
or
0.85
0.95
1.05
1.15
Inf 0.0025 0.0013 0.001 0.00057 0.00042 0.00023
1 5 8 13 20 30 40 52 72 82 183 211 311
size of tree
Let’s pull off the cp = α for the best size and prune our big tree pack using that cp and then prune the treeback using that cp.
iibest = which.min(big.tree$cptable[,"xerror"]) #which has the lowest errorbestcp=big.tree$cptable[iibest,"CP"]bestsize = big.tree$cptable[iibest,"nsplit"]+1cat("the best tree has size ",bestsize,"\n")## the best tree has size 33best.tree = prune(big.tree,cp=bestcp) #prune back big.tree using the best cp#let's check the sizenbest = length(unique(best.tree$where))cat("size of best tree: ",nbest,"\n")## size of best tree: 33
Now let’s look at our out-of-sample predictions and lift.
##out of sampleyhattest = predict(best.tree,kte)[,2] #first col is prob(y=0|x) second col is prob(Y=1|x)summary(yhattest)## Min. 1st Qu. Median Mean 3rd Qu. Max.## 0.02245 0.02245 0.02245 0.06790 0.08155 0.81250
##Roc, Auclibrary(pROC)## Type 'citation("pROC")' for a citation.#### Attaching package: 'pROC'## The following objects are masked from 'package:stats':#### cov, smooth, vardelRoc = roc(response=kte$DelIn2Yr,predictor=yhattest)delAuc = auc(delRoc)cat("AUC for tree fit to Kaggle data is ",delAuc,"\n")## AUC for tree fit to Kaggle data is 0.7945745plot(delRoc)
4
Specificity
Sen
sitiv
ity
1.0 0.5 0.0
0.0
0.2
0.4
0.6
0.8
1.0
Now let’s plot the tree. To get a tree that plots nicely, I’ll prune it back to be smaller than the best tree byusing a bigger cp value.
library(rpart.plot)best.tree10 = prune(big.tree,cp=.0015) #prune back big.tree using a bigger cprpart.plot(best.tree10,split.cex=0.5,cex=0.75,type=3,extra=4)
5
NumberOfTimes90DaysLate < 1
NumberOfTimes90DaysLate < 2
RevolvingUtilizationOfUnsecuredLines < 0.41
NumberOfTime60.89DaysPastDueNotWorse < 1
DebtRatio < 0.018
NumberOfTime30.59DaysPastDueNotWorse < 1
NumberOfTime60.89DaysPastDueNotWorse < 2
age >= 51
NumberOfOpenCreditLinesAndLoans < 1
RevolvingUtilizationOfUnsecuredLines < 0.93
>= 1
>= 2
>= 0.41
>= 1
>= 0.018
>= 1
>= 2
< 51
>= 1
>= 0.93
0.95 .05
0.83 .17
0.66 .34
0.85 .15
1.48 .52
0.65 .35
0.68 .32
0.55 .45
1.41 .59
1.19 .81
1.37 .63
How did I know that cp=.0015 would give me a nice size tree?You get this kind of info from the cptable:
From the cptable I can see that a cp of about .0015 corresponds to a tree with about 10 decision rules.
This is also where we got the best tree from. We find the row of the cptable with the smallest xerror andthen pull off the cp value from that row (see above).
library(randomForest)## randomForest 4.6-14## Type rfNews() to see new features/changes/bug fixes.set.seed(99)rffit = randomForest(DelIn2Yr~.,data=ktr,mtry=3,ntree=500)plot(rffit)
0 100 200 300 400 500
0.0
0.2
0.4
0.6
0.8
rffit
trees
Err
or
The plot suggests that it does not take many trees to get rid of the high variance, but the uncertainty is huge.
Let’s get the predictions and look at the lift.
rfyhattest = predict(rffit,newdata=kte,type="prob")[,2] #again, second column is p(y=1|x)summary(rfyhattest)## Min. 1st Qu. Median Mean 3rd Qu. Max.## 0.00000 0.00200 0.01200 0.06103 0.05000 0.99400lift.plot(rfyhattest,kte$DelIn2Yr,cex.lab=1.2)
8
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
% tried
% o
f suc
cess
es
Now let’s look at the variable importance.
varImpPlot(rffit)
NumberRealEstateLoansOrLines
NumberOfTime60.89DaysPastDueNotWorse
NumberOfTime30.59DaysPastDueNotWorse
NumberOfOpenCreditLinesAndLoans
NumberOfTimes90DaysLate
age
DebtRatio
RevolvingUtilizationOfUnsecuredLines
0 500 1000 1500 2000
rffit
MeanDecreaseGini
This does not agree with rpart !!
9
Boosting
library(gbm) #also xgboost is supposed to be good## Loaded gbm 2.1.4# first gbm needs a numeric y, weirdtrDB = ktr; trDB$DelIn2Yr = as.numeric(trDB$DelIn2Yr)-1teDB = kte; teDB$DelIn2Yr = as.numeric(teDB$DelIn2Yr)-1# check the new y's make sensetable(trDB$DelIn2Yr,ktr$DelIn2Yr)#### 0 1## 0 69971 0## 1 0 5029table(teDB$DelIn2Yr,kte$DelIn2Yr)#### 0 1## 0 70003 0## 1 0 4997#fit boostingbfit = gbm(DelIn2Yr~.,trDB, distribution="bernoulli",n.trees=500,interaction.depth=3,shrinkage=.05)byhattest = predict(bfit,newdata=teDB,n.trees=500,type="response")