Resampling Methods Cross-validation, Bootstrapping Marek Petrik 2/21/2017 Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Wien, T. Hastie and R. Tibshirani
61
Embed
Resampling Methods - Cross-validation, Bootstrappingmpetrik/teaching/intro_ml_17_files/class6.pdf · Resampling Methods Cross-validation, Bootstrapping ... 2.Which method is best
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Resampling MethodsCross-validation, Bootstrapping
Marek Petrik
2/21/2017
Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applications in R”(Springer, 2013) with permission from the authors: G. James, D. Wi�en, T. Hastie and R. Tibshirani
So Far in ML
I Regression vs classificationI Linear regressionI Logistic regressionI Linear discriminant analysis, QDAI Maximum likelihood
Discriminative vs Generative Models
I Discriminative modelsI Estimate conditional models Pr[Y | X]I Linear regressionI Logistic regression
I Generative modelsI Estimate joint probability Pr[Y,X] = Pr[Y | X] Pr[X]I Estimates not only probability of labels but also the featuresI Once model is fit, can be used to generate dataI LDA, QDA, Naive Bayes
I Successfully using basic machine learning methodsI Problems:
1. How well is the machine learning method doing2. Which method is best for my problem?3. How many features (and which ones) to use?4. What is the uncertainty in the learned parameters?
I Methods:1. Validation set2. Leave one out cross-validation3. k-fold cross validation4. Bootstrapping
Today
I Successfully using basic machine learning methodsI Problems:
1. How well is the machine learning method doing2. Which method is best for my problem?3. How many features (and which ones) to use?4. What is the uncertainty in the learned parameters?
I Methods:1. Validation set2. Leave one out cross-validation3. k-fold cross validation4. Bootstrapping
Problem: How to design features?
50 100 150 200
10
20
30
40
50
Horsepower
Mile
s p
er
gallo
n
Linear
Degree 2
Degree 5
Benefit of Good Features
0 20 40 60 80 100
−10
010
20
X
Y
2 5 10 200
510
15
20
Flexibility
Mean S
quare
d E
rror
gray: training error red: test error
Just Use Training Data?
I Using more features will always reduce MSEI Error on the test set will be greater
0 20 40 60 80 100
24
68
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
Me
an S
qua
red E
rror
gray: training error red: test error
Just Use Training Data?I Using more features will always reduce MSE
I Error on the test set will be greater
0 20 40 60 80 100
24
68
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
Me
an S
qua
red E
rror
gray: training error red: test error
Just Use Training Data?I Using more features will always reduce MSEI Error on the test set will be greater
0 20 40 60 80 100
24
68
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
Me
an
Sq
ua
red
Err
or
gray: training error red: test error
Solution 1: Validation Set
I Just evaluate how well the method works on the test setI Randomly split data to:
1. Training set: about half of all data2. Validation set (AKA hold-out set): remaining half
I As k increases we have:1. Increasing computational complexity2. Decreasing bias (more training data)3. Increasing variance (bigger overlap between training sets)
I Empirically good values: 5 - 10
Cross-validation in Classification
Logistic Regression
I Predict probability of a class: p(X)
I Example: p(balance) probability of default for person withbalance
I Linear regression:
p(X) = β0 + β1
I Logistic regression:
p(X) =eβ0+β1X
1 + eβ0+β1X
I the same as:
log
(p(X)
1− p(X)
)= β0 + β1X
I Linear decision boundary (derive from log odds: p(x1) ≥ p(x2))
Features in Logistic RegressionLogistic regression decision boundary is also linear . . . non-lineardecisions?
Degree=1
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Degree=2
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Degree=3
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Degree=4
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Logistic Regression with Nonlinear Features
I Linear:
log
(p(X)
1− p(X)
)= β0 + β1X
I Nonlinear odds:
log
(p(X)
1− p(X)
)= β0 + β1X + β2X
2 + β3X3
I Nonlinear probability:
p(X) =eβ0+β1X+β2X2+β3X3
1 + eβ0+β1X+β2X2+β3X3
Cross-validation in Classification
I Works the same as for regressionI Do not use MSE but:
CV(n) =1
n
n∑i=1
Erri
I Error is an indicator function:
Erri = I(yi 6= yi)
K in KNN
I How to decide on the right k to use in KNN?
I Cross-validation!
Logistic regression KNN
2 4 6 8 10
0.1
20
.14
0.1
60
.18
0.2
0
Order of Polynomials Used
Err
or
Ra
te
0.01 0.02 0.05 0.10 0.20 0.50 1.00
0.1
20
.14
0.1
60
.18
0.2
0
1/K
Err
or
Ra
te
Brown Test errorBlue Training errorBlack CV error
K in KNN
I How to decide on the right k to use in KNN?I Cross-validation!
Logistic regression KNN
2 4 6 8 10
0.1
20
.14
0.1
60
.18
0.2
0
Order of Polynomials Used
Err
or
Ra
te
0.01 0.02 0.05 0.10 0.20 0.50 1.00
0.1
20
.14
0.1
60
.18
0.2
0
1/KE
rro
r R
ate
Brown Test errorBlue Training errorBlack CV error
Overfi�ing and CV
I Is it possible to overfit when using cross-validation?
I Yes!I Inferring k in KNN using cross-validation is learningI Insightful theoretical analysis: Probably Approximately Correct
(PAC) Learning
I Cross-validation will not overfit when learning simple concepts
Overfi�ing and CV
I Is it possible to overfit when using cross-validation?I Yes!
I Inferring k in KNN using cross-validation is learningI Insightful theoretical analysis: Probably Approximately Correct
(PAC) Learning
I Cross-validation will not overfit when learning simple concepts
Overfi�ing and CV
I Is it possible to overfit when using cross-validation?I Yes!I Inferring k in KNN using cross-validation is learning
I Insightful theoretical analysis: Probably Approximately Correct(PAC) Learning
I Cross-validation will not overfit when learning simple concepts
Overfi�ing and CV
I Is it possible to overfit when using cross-validation?I Yes!I Inferring k in KNN using cross-validation is learningI Insightful theoretical analysis: Probably Approximately Correct
(PAC) Learning
I Cross-validation will not overfit when learning simple concepts
Overfi�ing and CV
I Is it possible to overfit when using cross-validation?I Yes!I Inferring k in KNN using cross-validation is learningI Insightful theoretical analysis: Probably Approximately Correct
(PAC) LearningI Cross-validation will not overfit when learning simple concepts
Overfi�ing with Cross-validation
I Task: Predict mpg ∼ power
I Define a new feature for some βs:
f = β0 + β1 power + β2 power2 + β3 power
3 + β4 power4 + . . .
I Linear regression: Find α such that:
mpg = α f
I Cross-validation: Find values of βs
I Will overfitI Same solution as using linear regression on entire data (no
cross-validation)
Overfi�ing with Cross-validation
I Task: Predict mpg ∼ power
I Define a new feature for some βs:
f = β0 + β1 power + β2 power2 + β3 power
3 + β4 power4 + . . .
I Linear regression: Find α such that:
mpg = α f
I Cross-validation: Find values of βsI Will overfitI Same solution as using linear regression on entire data (no
cross-validation)
Preventing Overfi�ing
I Gold standard: Have a test set that is used only once
I Rarely possible
I $1M Netflix prize design:1. Publicly available training set2. Leader-board results using a test set3. Private data set used to determine the final winner
Bootstrap
I Goal: Understand the confidence in learned parametersI Most useful in inferenceI How confident are we in learned values of β:
mpg = β0 + β1 power
I Approach: Run learning algorithm multiple times withdi�erent data sets:
I Create a new data-set by sampling with replacement fromthe original one
Bootstrap
I Goal: Understand the confidence in learned parametersI Most useful in inferenceI How confident are we in learned values of β:
mpg = β0 + β1 power
I Approach: Run learning algorithm multiple times withdi�erent data sets:
I Create a new data-set by sampling with replacement fromthe original one
Bootstrap
I Goal: Understand the confidence in learned parametersI Most useful in inferenceI How confident are we in learned values of β:
mpg = β0 + β1 power
I Approach: Run learning algorithm multiple times withdi�erent data sets:
I Create a new data-set by sampling with replacement fromthe original one