COMP 551 – Applied Machine Learning Lecture 3: Linear regression (cont’d) Instructor: Herke van Hoof ([email protected]) Slides mostly by: Joelle Pineau Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all material posted for this course are copyright of the Instructors, and cannot be reused or reposted without the instructor’s written permission.
62
Embed
COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
COMP 551 – Applied Machine LearningLecture 3: Linear regression (cont’d)
Unless otherwise noted, all material posted for this course are copyright of the Instructors, and cannot be reused or reposted without the instructor’s written permission.
Joelle Pineau2
What we saw last time
COMP-551: Applied Machine Learning
• Definition and characteristics of a supervised learning problem.
• Linear regression (hypothesis class, cost function, algorithm).
• Consider k partitions of the data (usually of equal size).• Train with k-1 subset, validate on kth subset. Repeat k times.• Average the prediction error over the k rounds/folds.
• Consider k partitions of the data (usually of equal size).• Train with k-1 subset, validate on kth subset. Repeat k times.• Average the prediction error over the k rounds/folds.
– Repeat n times:• Set aside one instance <xi, yi> from the training set.• Use all other data points to find w (optimization).• Measure prediction error on the held-out <xi, yi>.
– Average the prediction error over all n subsets.
• Choose the d with lowest estimated true prediction error.
Leave-one-out cross-validation
Joelle Pineau23COMP-551: Applied Machine Learning
Estimating true error for d=1
x y 0.86 2.490.09 0.83-0.85 -0.250.87 3.10-0.44 0.87-0.43 0.02-1.1 -0.120.40 1.81-0.96 -0.830.17 0.43
Data Cross-validation results
Joelle Pineau24COMP-551: Applied Machine Learning
Cross-validation results
• Optimal choice: d=2. Overfitting for d > 2.
Joelle Pineau25COMP-551: Applied Machine Learning
Evaluation
• We use cross-validation for model selection.
• Available labeled data is split into two parts:
– Training set is used to select a hypothesis f from a class of hypotheses F (e.g. regression of a given degree).
– Validation set is used to compare the best f from each hypothesis class across different classes (e.g. different degree regression).
• Must be untouched during the process of looking for f within a class F.
Joelle Pineau26
Evaluation
• After adapting the weights to minimize the error on the train set,
the weights could be exploiting particularities in the train set:– have to use the validation set as proxy for true error
• After choosing the hypothesis class to minimize error on the
validation set, the hypothesis class could be adapted to some
particularities in the validation set– Validation set is no longer a good proxy for the true error!
COMP-551: Applied Machine Learning
Joelle Pineau27COMP-551: Applied Machine Learning
Evaluation
• We use cross-validation for model selection.
• Available labeled data is split into parts:
– Training set is used to select a hypothesis f from a class of hypotheses F (e.g. regression of a given degree).
– Validation set is used to compare the best f from each hypothesis class across different classes (e.g. different degree regression).
• Must be untouched during the process of looking for f within a class F.
• Test set: Ideally, a separate set of (labeled) data is withheld to get a true estimate of the generalization error.– Cannot be touched during the process of selecting F(Often the “validation set” is called “test set”, without distinction.)
Joelle Pineau28
Validation vs Train error38 2. Overview of Supervised Learning
High Bias
Low Variance
Low Bias
High Variance
Prediction
Error
Model Complexity
Training Sample
Test Sample
Low High
FIGURE 2.11. Test and training error as a function of model complexity.
be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.
The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.
More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1
N
!i(yi − yi)2. Unfortunately
training error is not a good estimate of test error, as it does not properlyaccount for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.
COMP-551: Applied Machine Learning
[From Hastie et al. textbook]
Joelle Pineau29
Understanding the error
Given set of examples <X, Y>. Assume that y = f(x) + ∊,
where ∊ is Gaussian noise with zero mean and std deviation σ.
COMP-551: Applied Machine Learning
Polynomial regression wrap-up
• We can add to the training data all cross-product terms up to somedegree d
• Fitting parameters is no di�erent from linear regression
• We can use cross-validation to choose the best order of polynomial to fitour data.
• Because the number of parameters explodes with the degree of thepolynomial (see homework 1), we will often use only select cross-termsor higher-order powers (based on domain knowledge).
• Always use cross-validation, and report results on testing data (completelyuntouched in the training process)
COMP-652, Lecture 2 - September 11, 2012 15
The anatomy of the error
• Suppose we have examples ⇧x, y⌃ where y = f(x) + ⇥ and ⇥ is Gaussiannoise with zero mean and standard deviation ⇤
• Reminder: normal (Gaussian) distribution
N (x|µ,�2)
x
2�
µ
(see Bishop Ch. 2 for review)
COMP-652, Lecture 2 - September 11, 2012 16
f(x)
Joelle Pineau30
Understanding the error
• Consider standard linear regression solution:
Err(w) = ∑i=1:n ( yi - wTxi)2
• If we consider only the class of linear hypotheses, we have
systematic prediction error, called bias, whenever the data is
generated by a non-linear function.
• Depending on what dataset we observed, we may get different
solutions. Thus we can also have error due to this variance.
– This occurs even if data is generated from class of linear functions.
COMP-551: Applied Machine Learning
Joelle Pineau31
An example (from Tom Dietterich)
• The circles are data points. X is drawn uniformly randomly. Y is
generated by the function y = 2sin(0.5x) + ∊.
COMP-551: Applied Machine Learning
The anatomy of the error: Linear regression
• In linear regression, given a set of examples ⇧xi, yi⌃i=1...m, we fit a linearhypothesis h(x) = wTx, such as to minimize sum-squared error over thetraining data:
mX
i=1
(yi � h(xi))2
• Because of the hypothesis class that we chose (hypotheses linear inthe parameters) for some target functions f we will have a systematicprediction error
• Even if f were truly from the hypothesis class we picked, depending onthe data set we have, the parameters w that we find may be di�erent;this variability due to the specific data set on hand is a di�erent sourceof error
COMP-652, Lecture 2 - September 11, 2012 17
An example (Tom Dietterich)
• The sine is the true function• The circles are data points (x drawn uniformly randomly, y given by theformula)
• The straight line is the linear regression fit (see lecture 1)
COMP-652, Lecture 2 - September 11, 2012 18
Joelle Pineau32
An example (from Tom Dietterich)
• With different sets of 20 points, we get different lines.
COMP-551: Applied Machine Learning
Example continued
With di�erent sets of 20 points, we get di�erent lines
COMP-652, Lecture 2 - September 11, 2012 19
Bias-variance analysis
• Given a new data point x, what is the expected prediction error?• Assume that the data points are drawn independently and identicallydistributed (i.i.d.) from a unique underlying probability distributionP (⇧x, y⌃) = P (x)P (y|x)
• The goal of the analysis is to compute, for an arbitrary given point x,
EP
⇥(y � h(x))2|x
⇤
where y is the value of x in a data set, and the expectation is over alltraining sets of a given size, drawn according to P
• For a given hypothesis class, we can also compute the true error, whichis the expected error over the input distribution:
X
x
EP
⇥(y � h(x))2|x
⇤P (x)
(if x continuous, sum becomes integral with appropriate conditions).• We will decompose this expectation into three components
COMP-652, Lecture 2 - September 11, 2012 20
Joelle Pineau33
Bias-variance decomposition
• Compare f(x) (true value at x) to prediction h(x)
• Error:
• Bias:
– How far is the average estimate from the true function?
• Variance:
– How far are estimates (on average) from the average estimate
• Notes:
– Expectations over all possible dataset (noise realizations)
–
COMP-551: Applied Machine Learning
E[(h(x)� f(x))2]
E[(h(x)� h(x))2]
f(x)� h(x)
h(x) = E[h(x)]
Joelle Pineau34
Bias-variance decomposition
• Can show that: Error = bias2 + variance(e.g. https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf)
• What happens if we apply 1st order model to quadratic dataset?– No matter how much data we have, can’t fit the dataset– This means we have bias (underfitting)
• What happens if we apply 2nd order model to linear dataset?– The model can definitely fit the dataset! (no bias)– Spurious parameters make the model sensitive to noise– Higher variance than necessary (overfitting if dataset is small)
COMP-551: Applied Machine Learning
Joelle Pineau35
Bias-variance decomposition
• Can show that: Error = bias2 + variance(e.g. https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf)
• What happens if we apply 1st order model to quadratic dataset?– No matter how much data we have, can’t fit the dataset– This means we have bias (underfitting)
• What happens if we apply 2nd order model to linear dataset?– The model can definitely fit the dataset! (no bias)– Spurious parameters make the model sensitive to noise– Higher variance than necessary (overfitting if dataset is small)
COMP-551: Applied Machine Learning
Joelle Pineau36
Validation vs Train error38 2. Overview of Supervised Learning
High Bias
Low Variance
Low Bias
High Variance
Prediction
Error
Model Complexity
Training Sample
Test Sample
Low High
FIGURE 2.11. Test and training error as a function of model complexity.
be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.
The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.
More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1
N
!i(yi − yi)2. Unfortunately
training error is not a good estimate of test error, as it does not properlyaccount for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.
COMP-551: Applied Machine Learning
[From Hastie et al. textbook]
Joelle Pineau37
Bias-variance decomposition
• Can show that: Error = bias2 + variance(e.g. https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf)
• What happens if we apply 1st order model to quadratic dataset?– No matter how much data we have, can’t fit the dataset– This means we have bias (underfitting)
• What happens if we apply 2nd order model to linear dataset?– The model can definitely fit the dataset! (no bias)– Spurious parameters make the model sensitive to noise– Higher variance than necessary (overfitting if dataset is small)
• What happens if we apply 1st order model to linear dataset?
COMP-551: Applied Machine Learning
Joelle Pineau38
Gauss-Markov Theorem
• Main result:
The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates.
COMP-551: Applied Machine Learning
Joelle Pineau39
Gauss-Markov Theorem
• Main result:
The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates.
• Understanding the statement:
– Real parameters are denoted: w
– Estimate of the parameters is denoted: ŵ
– Error of the estimator: Err(ŵ) = E(ŵ-w)2 = Var(ŵ) + (E(ŵ-w))2
– Unbiased estimator means: E(ŵ-w)=0
– There may exist an estimator that has lower error, but some bias.
COMP-551: Applied Machine Learning
Joelle Pineau40
Bias vs Variance
• Gauss-Markov Theorem says:
The least-squares estimates of the parameters w have the smallest
variance among all linear unbiased estimates.
• Insight: Find lower variance solution, at the expense of some bias.
COMP-551: Applied Machine Learning
Joelle Pineau41
Bias vs Variance
• Gauss-Markov Theorem says:
The least-squares estimates of the parameters w have the smallest
variance among all linear unbiased estimates.
• Insight: Find lower variance solution, at the expense of some bias.
• E.g. fix low-relevance weights to 0
COMP-551: Applied Machine Learning
Joelle Pineau42
Recall our prostate cancer example• The Z-score measures the effect of dropping that feature from the
linear regression. zj = ŵj / sqrt(σ2vj )where ŵj is the estimated weight of the jth
feature, and vj is the jth diagonal element of (XTX)-1
COMP-551: Applied Machine Learning
50 3. Linear Methods for Regression
TABLE 3.1. Correlations of predictors in the prostate cancer data.
TABLE 3.2. Linear model fit to the prostate cancer data. The Z score is thecoefficient divided by its standard error (3.12). Roughly a Z score larger than twoin absolute value is significantly nonzero at the p = 0.05 level.
Term Coefficient Std. Error Z ScoreIntercept 2.46 0.09 27.60
example, that both lcavol and lcp show a strong relationship with theresponse lpsa, and with each other. We need to fit the effects jointly tountangle the relationships between the predictors and the response.
We fit a linear model to the log of prostate-specific antigen, lpsa, afterfirst standardizing the predictors to have unit variance. We randomly splitthe dataset into a training set of size 67 and a test set of size 30. We ap-plied least squares estimation to the training set, producing the estimates,standard errors and Z-scores shown in Table 3.2. The Z-scores are definedin (3.12), and measure the effect of dropping that variable from the model.A Z-score greater than 2 in absolute value is approximately significant atthe 5% level. (For our example, we have nine parameters, and the 0.025 tailquantiles of the t67−9 distribution are ±2.002!) The predictor lcavol showsthe strongest effect, with lweight and svi also strong. Notice that lcp isnot significant, once lcavol is in the model (when used in a model withoutlcavol, lcp is strongly significant). We can also test for the exclusion ofa number of terms at once, using the F -statistic (3.13). For example, weconsider dropping all the non-significant terms in Table 3.2, namely age,
[From Hastie et al. textbook]
Joelle Pineau43
Subset selection
• Idea: Keep only a small set of features with non-zero weights.
• Goal: Find lower variance solution, at the expense of some bias.
• There are many different methods for choosing subsets.
(More on this later… )
• Least-squares regression can be used to estimate the weights of
the selected features.
• Bias as true model might rely on the discarded features!
COMP-551: Applied Machine Learning
Joelle Pineau44
Bias vs Variance
• Find lower variance solution, at the expense of some bias.
• Force some weights to 0
• E.g. Include penalty for model complexity in error to reduce
overfitting.
Err(w) = ∑i=1:n ( yi - wTxi)2 + λ |model_size|
λ is a hyper-parameter that controls penalty size.
COMP-551: Applied Machine Learning
Joelle Pineau45
Ridge regression (aka L2-regularization)
• Constrains the weights by imposing a penalty on their size: