Top Banner
Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006
45

Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Jan 12, 2016

Download

Documents

Barrie Harrison
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Lecture 12 – Model Assessment and Selection

Rice ECE697

Farinaz Koushanfar

Fall 2006

Page 2: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Summary

• Bias, variance, model complexity• Optimism of training error rate• Estimates of in-sample prediction error, AIC• Effective number of parameters• The Bayesian approach and BIC• Vapnik-Chernovekis dimension• Cross-Validation• Bootstrap method

Page 3: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Model Selection Criteria

• Training Error

• Loss Function

• Generalization Error

Page 4: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Training Error vs. Test Error

Page 5: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Model Selection and Assessment

• Model selection: – Estimating the performance of different models in

order to chose the best

• Model assessment:– Having chosen a final model, estimating its

prediction error (generalization error) on new data

• If we were rich in data:

Train Validation Test

Page 6: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Bias-Variance Decomposition

• As we have seen before,

• The first term is the variance of the target around the true mean f(x0); the second term is the average by which our estimate is off from the true mean; the last term is variance of f^(x0)

* The more complex f, the lower the bias, but the higher the variance

Page 7: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Bias-Variance Decomposition (cont’d)

• For K-nearest neighbor

• For linear regression

Page 8: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Bias-Variance Decomposition (cont’d)

• For linear regression, where h(x0) is the vector of weights that produce fp(x0)=x0

T(XTX)-1XTy and hence Var[(fp(x0)]=||h(x0)||2

2

• This variance changes with x0, but its average over the sample values xi is (p/N)

2

Page 9: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Example

• 50 observations and 20 predictors, uniformly distributed in the hypercube [0,1]20.

• Left: Y is 0 if X11/2 and apply k-NN• Right: Y is 1 if j=1

10Xj is 5 and 0 otherwise

Prediction error

Squared bias

Variance

Page 10: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Example – loss function

Prediction error

Squared bias

Variance

Page 11: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Optimism of Training Error

• The training error• Is typically less than the true error• In sample error

• Optimism

• For squared error, 0-1, and other losses, on can show in general

Page 12: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Optimism (cont’d)

• Thus, the amount by which the error under estimates the true error depends on how much yi affects its own prediction

• For linear model

• For additive model Y=f(X)+ and thus,

Optimism increases linearly with number of inputs or basis d, decreases as training size increases

Page 13: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

How to count for optimism?

• Estimate the optimism and add it to the training error, e.g., AIC, BIC, etc.

• Bootstrap and cross-validation, are direct estimates of this optimism error

Page 14: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Estimates of In-Sample Prediction Error

• General form of in-sample estimate is computed from

• Cp statistic: for an additive error model, when d parameters are fit under squared error loss,

• Using this criterion, adjust the training error by a factor proportional to the number of basis

• Akaike Information Criterion (AIC) is a similar but a more generally applicable estimate of Errin, when the log-likelihood loss function is used

Page 15: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Akaike Information Criterion (AIC)

• AIC relies on a relationship that holds asymptotically as N

• Pr(Y) is a family of densities for Y (contains the “true” density), “ hat” is the max likelihood estimate of , “loglik” is the maximized log-likelihood:

N

1iiˆ)y(Prlogliklog

Page 16: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

AIC (cont’d)

• For the Gaussian model, the AICCp

• For the logistic regression, using the binomial log-likelihood, we have

• AIC=-2/N. loglik + 2. d/N

• Choose the model that produces the smallest possible AIC

• What if we don’t know d?

• How about having tuning parameters?

Page 17: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

AIC (cont’d)

• Given a set of models f(x) indexed by a tuning parameter , denote by err() and d() the training error and number of parameters

• The function AIC provides an estimate of the test error curve and we find the tuning parameter that maximizes it

• By choosing the best fitting model with d inputs, the effective number of parameters fit is more than d

2ˆN

)(d.2)(err)(AIC

Page 18: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

AIC- Example: Phenome recognition

Page 19: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

The effective number of parameters

• Generalize num of parameters to regularization

• Effective num of parameters is: d(S) = trace(S)

• In sample error is:

Page 20: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

The effective number of parameters

• Thus, for a regularized model:

• Hence

• and

Page 21: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

The Bayesian Approach and BIC

• Bayesian information criterion (BIC)

• BIC/2 is also known as Schwartz criterion

BIC is proportional to AIC (Cp) with a factor 2 replaced by log (N). BIC penalizes complex models more heavily, prefering Simpler models

Page 22: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

BIC (cont’d)

• BIC is asymptotically consistent as a selection criteria: given a family of models, including the true one, the prob. of selecting the true one is 1 for N

• Suppose we have a set of candidate models Mm, m=1,..,M and corresponding model parameters m, and we wish to chose a best model

• Assuming a prior distribution Pr(m|Mm) for the parameters of each model Mm, compute the posterior probability of a given model!

Page 23: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

BIC (cont’d)

• The posterior probability is

• Where Z represents the training data. To compare two models Mm and Ml, form the posterior odds

• If the posterior greater than one, chose m, otherwise l.

Page 24: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

BIC (cont’d)

• Bayes factor: the rightmost term in posterior odds

• We need to approximate Pr(Z|Mm)• A Laplace approximation to the integral gives

^m is the maximum likelihood estimate and dm is the

number of free parameters of model Mm• If the loss function is set as -2 log Pr(Z|Mm,^

m), this is equivalent to the BIC criteria

Page 25: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

BIC (cont’d)

• Thus, choosing the model with minimum BIC is equivalent to choosing the model with largest (approximate) posterior probability

• If we compute the BIC criterion for a set of M models, BICm, m=1,…,M, then the posterior of each model is estimates as

M

1l

BIC5.0

BIC5.0

l

m

e

e

• Thus, we can estimate not only the best model, but also

asses the relative merits of the models considered

Page 26: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Vapnik-Chernovenkis Dimension

• It is difficult to specify the number of parameters• The Vapnik-Chernovenkis (VC) provides a general

measure of complexity and associated bounds on optimism

• For a class of functions {f(x,)} indexed by a parameter vector , and xp.

• Assume f is in indicator function, either 0 or 1• If =(0,1) and f is a linear indicator, I(0+1

Tx>0), then it is reasonable to say complexity is p+1

• How about f(x, )=I(sin .x)?

Page 27: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

VC Dimension (cont’d)

Page 28: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

VC Dimension (cont’d)

• The Vapnik-Chernovenkis dimension is a way of measuring the complexity of a class of functions by assessing how wiggly its members can be

• The VC dimension of the class {f(x,)} is defined to be the largest number of points (in some configuration) that can be shattered by members of {f(x,)}

Page 29: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

VC Dimension (cont’d)

• A set of points is shattered by a class of functions if no matter how we assign a binary label to each point, a member of the class can perfectly separate them

• Example: VC dim of linear indicator function in 2D

Page 30: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

VC Dimension (cont’d)

• Using the concepts of VC dimension, one can prove results about the optimism of training error when using a class of functions. E.g.

• If we fit N data points using a class of functions {f(x,)} having VC dimension h, then with probability at least 1- over training sets

Cherkassky and Mulier, 1998For regression, a1=a2=1

Page 31: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

VC Dimension (cont’d)

• The bounds suggest that the optimism increases with h and decreases with N in qualitative agreement with the AIC correction d/N

• The results of VC dimension bounds are stronger: they give a probabilistic upper bounds for all functions f(x,) and hence allow for searching over the class

Page 32: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

VC Dimension (cont’d)

• Vapnik’s Structural Risk Minimization (SRM) is built around the described bounds

• SRM fits a nested sequence of models of increasing VC dimensions h1<h2<…, and then chooses the model with the smallest value of the upper bound

• Drawback is difficulty in computing VC dim• A crude upper bound may not be adequate

Page 33: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Example – AIC, BIC, SRM

Page 34: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Cross Validation (CV)

• The most widely used method• Directly estimate the generalization error by applying

the model to the test sample• K-fold cross validation

– Use part of data to build a model, different part to test

• Do this for k=1,2,…,K and calculate the prediction error when predicting the kth part

Page 35: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

CV (cont’d)

:{1,…,N}{1,…,K} divides the data to groups• Fitted function f^-(x), computed when removed• CV estimate of prediction error is

• If K=N, is called leave-one-out CV• Given a set of models f^-(x), the th model fit with

the kth part removed. For this set of models we have

N

1i i

)i(

i))x(f,y(LN1CV

N

1i i

)i(

i)),x(f,y(LN1)(CV

Page 36: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

CV (cont’d)

• CV() should be minimized over • What should we chose for K?

• With K=N, CV is unbiased, but can have a high variance since the K training sets are almost the same

• Computational complexity

N

1i i

)i(

i)),x(f,y(LN1)(CV

Page 37: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

CV (cont’d)

Page 38: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

CV (cont’d)

• With lower K, CV has a lower variance, but bias could be a problem!

• The most common are 5-fold and 10-fold!

Page 39: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

CV (cont’d)

• Generalized leave-one-out cross validation, for linear fitting with square error loss ỷ=Sy

• For linear fits (Sii is the ith on S diagonal)

• The GCV approximation is

N

1i

2

ii

ii2N

1ii

i

i]

S1

)x(fy[

N

1)]x(fy[

N

1

N

1i

2ii ]N/)S(trace1

)x(fy[

N

1GCV

GCV maybe sometimes advantageous where the trace is computed more easily than the individual Sii’s

Page 40: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Bootstrap

• Denote the training set by Z=(z1,…,zN) where zi=(xi,yi)

• Randomly draw a dataset with replacement from training data

• This is done B times (e.g., B=100)• Refit the model to each of the bootstrap datasets and

examine the behavior over the B replications• From the bootstrap sample, we can estimate any

aspect of the distribution of S(Z) – where S(z) can be any quantity computed from the data

Page 41: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Bootstrap - Schematic

B

1b

2*b* ]S)Z(S[1B

1)]Z(S[VarFor e.g.,

Page 42: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Bootstrap (Cont’d)

• Bootstrap to estimate the prediction error

• E^rrboot does not provide a good estimate– Bootstrap dataset is acting as both training and testing and

these two have common observations– The overfit predictions will look unrealistically good

• By mimicking CV, better bootstrap estimates• Only keep track of predictions from bootstrap

samples not containing the observations

B

1b

N

1ii

b*

iboot))x(f,y(L

N

1

B

1rrE

Page 43: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Bootstrap (Cont’d)

• The leave-one-out bootstrap estimate of prediction error

• C-i is the set of indices of the bootstrap sample b that do not contain observation I

• We either have to choose B large enough to ensure that all of |C-i| is greater than zero, or just leave-out the terms that correspond to |C-i|’s that are zero

Page 44: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Bootstrap (Cont’d)

• The leave-one-out bootstrap solves the overfitting problem, we has a training size bias

• The average number of distinct observations in each bootstrap sample is 0.632.N

• Thus, if the learning curve has a considerable slope at sample size N/2, leave-one-out bootstrap will be biased upward in estimating the error

• There are a number of proposed methods to alleviate this problem, e.g., .632 estimator, information error rate (overfitting rate)

Page 45: Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Bootstrap (Example)

• Five-fold CV and .632 estimate for the same problems as before

• Any of the measures could be biased but not affecting, as long as relative performance is the same