COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

COMP 551 – Applied Machine LearningLecture 3: Linear regression (cont’d)

Instructor: Herke van Hoof ([email protected])

Slides mostly by: Joelle Pineau

Class web page: www.cs.mcgill.ca/~hvanho2/comp551

Unless otherwise noted, all material posted for this course are copyright of the Instructors, and cannot be reused or reposted without the instructor’s written permission.

Joelle Pineau2

What we saw last time

COMP-551: Applied Machine Learning

• Definition and characteristics of a supervised learning problem.

• Linear regression (hypothesis class, cost function, algorithm).

• Closed-form least-squares solution method (algorithm,

computational complexity, stability issues).

• Alternative optimization methods via gradient descent.

Joelle Pineau3

This function looks complicated, and a linear hypothesis does not

seem very good.

What should we do?


Predicting recurrence time from tumor size

Joelle Pineau4

This function looks complicated, and a linear hypothesis does not

seem very good.

What should we do?

• Pick a better function?

• Use more features?

• Get more data?


Predicting recurrence time from tumor size

Joelle Pineau5

Input variables for linear regression

• Original quantitative variables X1, …, Xm

• Transformations of variables, e.g. Xm+1 = log(Xi)

• Basis expansions, e.g. Xm+1 = Xi2, Xm+2 = Xi

3, …

• Interaction terms, e.g. Xm+1 = Xi Xj

• Numeric coding of qualitative variables, e.g. Xm+1 = 1 if Xi is true

and 0 otherwise.

In all cases, we can add Xm+1, …, Xm+k to the list of original

variables and perform the linear regression.


Joelle Pineau6

Example of linear regression with polynomial terms

Answer: Polynomial regression

• Given data: (x1, y1), (x2, y2), . . . , (xm, ym).

• Suppose we want a degree-d polynomial fit.

• Let Y be as before and let

X =

2

664

xd1 . . . x2

1 x1 1xd2 . . . x2

2 x2 1...

......

...

xdm . . . x2

m xm 1

3

775

• Solve the linear regression Xw ⌅ Y .

COMP-652, Lecture 1 - September 6, 2012 44

Example of quadratic regression: Data matrices

X =

2

6666666666666664

0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1

3

7777777777777775

Y =

2

6666666666666664

2.490.83�0.253.100.870.02�0.121.81�0.830.43

3

7777777777777775



fw(x) = w0 + w1 x +w2 x2

x2 x

Joelle Pineau7

Solving the problem


Solving for w

w = (XTX)�1XTY =

4.11 �1.64 4.95�1.64 4.95 �1.394.95 �1.39 10

��1 3.606.498.34

�=

0.681.740.73

�

So the best order-2 polynomial is y = 0.68x2 + 1.74x+ 0.73.


Order-2 fit

x

y

Is this a better fit to the data?


Compared to y = 1.6x + 1.05 for the order-1 polynomial.

Joelle Pineau8COMP-551: Applied Machine Learning

Order-3 fit: Is this better?


Order-4 fit


Order-5 fit


Order-6 fit


Order-7 fit


Order-8 fit


Order-9 fit


This is overfitting!


This is overfitting!

• We can find a hypothesis that explains perfectly the training

data, but does not generalize well to new data.

• In this example: we have a lot of parameters (weights), so the

hypothesis matches the data points exactly,

but is wild everywhere else.

• A very important problem in machine learning.


Overfitting

• Every hypothesis has a true error measured on all possible data

items we could ever encounter (e.g. fw(xi) - yi ).

• Since we don’t have all possible data, in order to decide what is

a good hypothesis, we measure error over the training set.

• Formally: Suppose we compare hypotheses f1 and f2.

– Assume f1 has lower error on the training set.

– If f2 has lower true error, then our algorithm is overfitting.

Joelle Pineau18

Overfitting

• Which hypothesis has the lowest true error?


d=2 d=3 d=4

d=5 d=8d=7d=6

d=1


Cross-Validation• Partition your data into a Training Set and a Validation set.

– The proportions in each set can vary.

• Use the Training Set to find the best hypothesis in the class.

• Use the Validation Set to evaluate the true prediction error.– Compare across different hypothesis classes (different order polynominals.)

Answer: Polynomial regression

• Given data: (x1, y1), (x2, y2), . . . , (xm, ym).

• Suppose we want a degree-d polynomial fit.

• Let Y be as before and let

X =

2

664

xd1 . . . x2

1 x1 1xd2 . . . x2

2 x2 1...

......

...

xdm . . . x2

m xm 1

3

775

• Solve the linear regression Xw ⌅ Y .


Example of quadratic regression: Data matrices

X =

2

6666666666666664

0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1

3

7777777777777775

Y =

2

6666666666666664

2.490.83�0.253.100.870.02�0.121.81�0.830.43

3

7777777777777775


Train:

Validate:


k-fold Cross-Validation

• Consider k partitions of the data (usually of equal size).• Train with k-1 subset, validate on kth subset. Repeat k times.• Average the prediction error over the k rounds/folds.

Source: http://stackoverflow.com/questions/31947183/how-to-implement-walk-forward-testing-in-sklearn


k-fold Cross-Validation

• Consider k partitions of the data (usually of equal size).• Train with k-1 subset, validate on kth subset. Repeat k times.• Average the prediction error over the k rounds/folds.

• Computation time is increased by factor of k.

Source: http://stackoverflow.com/questions/31947183/how-to-implement-walk-forward-testing-in-sklearn


• Let k = n, the size of the training set

• For each order-d hypothesis class,

– Repeat n times:• Set aside one instance <xi, yi> from the training set.• Use all other data points to find w (optimization).• Measure prediction error on the held-out <xi, yi>.

– Average the prediction error over all n subsets.

• Choose the d with lowest estimated true prediction error.

Leave-one-out cross-validation


Estimating true error for d=1

x y 0.86 2.490.09 0.83-0.85 -0.250.87 3.10-0.44 0.87-0.43 0.02-1.1 -0.120.40 1.81-0.96 -0.830.17 0.43

Data Cross-validation results


Cross-validation results

• Optimal choice: d=2. Overfitting for d > 2.


Evaluation

• We use cross-validation for model selection.

• Available labeled data is split into two parts:

– Training set is used to select a hypothesis f from a class of hypotheses F (e.g. regression of a given degree).

– Validation set is used to compare the best f from each hypothesis class across different classes (e.g. different degree regression).

• Must be untouched during the process of looking for f within a class F.

Joelle Pineau26

Evaluation

• After adapting the weights to minimize the error on the train set,

the weights could be exploiting particularities in the train set:– have to use the validation set as proxy for true error

• After choosing the hypothesis class to minimize error on the

validation set, the hypothesis class could be adapted to some

particularities in the validation set– Validation set is no longer a good proxy for the true error!



Evaluation

• We use cross-validation for model selection.

• Available labeled data is split into parts:

– Training set is used to select a hypothesis f from a class of hypotheses F (e.g. regression of a given degree).

– Validation set is used to compare the best f from each hypothesis class across different classes (e.g. different degree regression).

• Must be untouched during the process of looking for f within a class F.

• Test set: Ideally, a separate set of (labeled) data is withheld to get a true estimate of the generalization error.– Cannot be touched during the process of selecting F(Often the “validation set” is called “test set”, without distinction.)

Joelle Pineau28

Validation vs Train error38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Prediction

Error

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)2. Unfortunately

training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.


[From Hastie et al. textbook]

Joelle Pineau29

Understanding the error

Given set of examples <X, Y>. Assume that y = f(x) + ∊,

where ∊ is Gaussian noise with zero mean and std deviation σ.


Polynomial regression wrap-up

• We can add to the training data all cross-product terms up to somedegree d

• Fitting parameters is no di�erent from linear regression

• We can use cross-validation to choose the best order of polynomial to fitour data.

• Because the number of parameters explodes with the degree of thepolynomial (see homework 1), we will often use only select cross-termsor higher-order powers (based on domain knowledge).

• Always use cross-validation, and report results on testing data (completelyuntouched in the training process)


The anatomy of the error

• Suppose we have examples ⇧x, y⌃ where y = f(x) + ⇥ and ⇥ is Gaussiannoise with zero mean and standard deviation ⇤

• Reminder: normal (Gaussian) distribution

N (x|µ,�2)

x

2�

µ

(see Bishop Ch. 2 for review)


f(x)

Joelle Pineau30

Understanding the error

• Consider standard linear regression solution:

Err(w) = ∑i=1:n ( yi - wTxi)2

• If we consider only the class of linear hypotheses, we have

systematic prediction error, called bias, whenever the data is

generated by a non-linear function.

• Depending on what dataset we observed, we may get different

solutions. Thus we can also have error due to this variance.

– This occurs even if data is generated from class of linear functions.


Joelle Pineau31

An example (from Tom Dietterich)

• The circles are data points. X is drawn uniformly randomly. Y is

generated by the function y = 2sin(0.5x) + ∊.


The anatomy of the error: Linear regression

• In linear regression, given a set of examples ⇧xi, yi⌃i=1...m, we fit a linearhypothesis h(x) = wTx, such as to minimize sum-squared error over thetraining data:

mX

i=1

(yi � h(xi))2

• Because of the hypothesis class that we chose (hypotheses linear inthe parameters) for some target functions f we will have a systematicprediction error

• Even if f were truly from the hypothesis class we picked, depending onthe data set we have, the parameters w that we find may be di�erent;this variability due to the specific data set on hand is a di�erent sourceof error


An example (Tom Dietterich)

• The sine is the true function• The circles are data points (x drawn uniformly randomly, y given by theformula)

• The straight line is the linear regression fit (see lecture 1)


Joelle Pineau32

An example (from Tom Dietterich)

• With different sets of 20 points, we get different lines.


Example continued

With di�erent sets of 20 points, we get di�erent lines


Bias-variance analysis

• Given a new data point x, what is the expected prediction error?• Assume that the data points are drawn independently and identicallydistributed (i.i.d.) from a unique underlying probability distributionP (⇧x, y⌃) = P (x)P (y|x)

• The goal of the analysis is to compute, for an arbitrary given point x,

EP

⇥(y � h(x))2|x

⇤

where y is the value of x in a data set, and the expectation is over alltraining sets of a given size, drawn according to P

• For a given hypothesis class, we can also compute the true error, whichis the expected error over the input distribution:

X

x

EP

⇥(y � h(x))2|x

⇤P (x)

(if x continuous, sum becomes integral with appropriate conditions).• We will decompose this expectation into three components


Joelle Pineau33

Bias-variance decomposition

• Compare f(x) (true value at x) to prediction h(x)

• Error:

• Bias:

– How far is the average estimate from the true function?

• Variance:

– How far are estimates (on average) from the average estimate

• Notes:

– Expectations over all possible dataset (noise realizations)

–


E[(h(x)� f(x))2]

E[(h(x)� h(x))2]

f(x)� h(x)

h(x) = E[h(x)]

Joelle Pineau34


• Can show that: Error = bias2 + variance(e.g. https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf)

• What happens if we apply 1st order model to quadratic dataset?– No matter how much data we have, can’t fit the dataset– This means we have bias (underfitting)

• What happens if we apply 2nd order model to linear dataset?– The model can definitely fit the dataset! (no bias)– Spurious parameters make the model sensitive to noise– Higher variance than necessary (overfitting if dataset is small)


Joelle Pineau35






Joelle Pineau36

Validation vs Train error38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Prediction

Error

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)2. Unfortunately

training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.



Joelle Pineau37





• What happens if we apply 1st order model to linear dataset?


Joelle Pineau38

Gauss-Markov Theorem

• Main result:

The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates.


Joelle Pineau39

Gauss-Markov Theorem

• Main result:

The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates.

• Understanding the statement:

– Real parameters are denoted: w

– Estimate of the parameters is denoted: ŵ

– Error of the estimator: Err(ŵ) = E(ŵ-w)2 = Var(ŵ) + (E(ŵ-w))2

– Unbiased estimator means: E(ŵ-w)=0

– There may exist an estimator that has lower error, but some bias.


Joelle Pineau40

Bias vs Variance

• Gauss-Markov Theorem says:

The least-squares estimates of the parameters w have the smallest

variance among all linear unbiased estimates.

• Insight: Find lower variance solution, at the expense of some bias.


Joelle Pineau41

Bias vs Variance

• Gauss-Markov Theorem says:

The least-squares estimates of the parameters w have the smallest

variance among all linear unbiased estimates.

• Insight: Find lower variance solution, at the expense of some bias.

• E.g. fix low-relevance weights to 0


Joelle Pineau42

Recall our prostate cancer example• The Z-score measures the effect of dropping that feature from the

linear regression. zj = ŵj / sqrt(σ2vj )where ŵj is the estimated weight of the jth

feature, and vj is the jth diagonal element of (XTX)-1


50 3. Linear Methods for Regression

TABLE 3.1. Correlations of predictors in the prostate cancer data.

lcavol lweight age lbph svi lcp gleason

lweight 0.300age 0.286 0.317lbph 0.063 0.437 0.287svi 0.593 0.181 0.129 −0.139lcp 0.692 0.157 0.173 −0.089 0.671

gleason 0.426 0.024 0.366 0.033 0.307 0.476pgg45 0.483 0.074 0.276 −0.030 0.481 0.663 0.757

TABLE 3.2. Linear model fit to the prostate cancer data. The Z score is thecoefficient divided by its standard error (3.12). Roughly a Z score larger than twoin absolute value is significantly nonzero at the p = 0.05 level.

Term Coefficient Std. Error Z ScoreIntercept 2.46 0.09 27.60

lcavol 0.68 0.13 5.37lweight 0.26 0.10 2.75

age −0.14 0.10 −1.40lbph 0.21 0.10 2.06svi 0.31 0.12 2.47lcp −0.29 0.15 −1.87

gleason −0.02 0.15 −0.15pgg45 0.27 0.15 1.74

example, that both lcavol and lcp show a strong relationship with theresponse lpsa, and with each other. We need to fit the effects jointly tountangle the relationships between the predictors and the response.

We fit a linear model to the log of prostate-specific antigen, lpsa, afterfirst standardizing the predictors to have unit variance. We randomly splitthe dataset into a training set of size 67 and a test set of size 30. We ap-plied least squares estimation to the training set, producing the estimates,standard errors and Z-scores shown in Table 3.2. The Z-scores are definedin (3.12), and measure the effect of dropping that variable from the model.A Z-score greater than 2 in absolute value is approximately significant atthe 5% level. (For our example, we have nine parameters, and the 0.025 tailquantiles of the t67−9 distribution are ±2.002!) The predictor lcavol showsthe strongest effect, with lweight and svi also strong. Notice that lcp isnot significant, once lcavol is in the model (when used in a model withoutlcavol, lcp is strongly significant). We can also test for the exclusion ofa number of terms at once, using the F -statistic (3.13). For example, weconsider dropping all the non-significant terms in Table 3.2, namely age,


Joelle Pineau43

Subset selection

• Idea: Keep only a small set of features with non-zero weights.

• Goal: Find lower variance solution, at the expense of some bias.

• There are many different methods for choosing subsets.

(More on this later… )

• Least-squares regression can be used to estimate the weights of

the selected features.

• Bias as true model might rely on the discarded features!


Joelle Pineau44

Bias vs Variance

• Find lower variance solution, at the expense of some bias.

• Force some weights to 0

• E.g. Include penalty for model complexity in error to reduce

overfitting.

Err(w) = ∑i=1:n ( yi - wTxi)2 + λ |model_size|

λ is a hyper-parameter that controls penalty size.


Joelle Pineau45

Ridge regression (aka L2-regularization)

• Constrains the weights by imposing a penalty on their size:

ŵridge = argminw { ∑i=1:n( yi - wTxi)2 + λ∑j=0:mwj2 }

where λ can be selected manually, or by cross-validation.


Joelle Pineau46





• Do a little algebra to get the solution: ŵridge = (XTX+λI)-1XTY


Joelle Pineau47


• Re-write in matrix notation: fw (X) = Xw

Err(w) = (Y – Xw)T(Y – Xw) + λwTw

• To minimize, take the derivative w.r.t. w:

∂Err(w)/∂w = -2 XT (Y–Xw) –2λw = 0

• Try a little algebra:


Joelle Pineau48


• Re-write in matrix notation: fw (X) = Xw

Err(w) = (Y – Xw)T(Y – Xw) + λwTw

• To minimize, take the derivative w.r.t. w:

∂Err(w)/∂w = -2 XT (Y–Xw) –2λw = 0

• Try a little algebra: XT Y = (XT X + Iλ) w

ŵ = (XTX + Iλ)-1 XT Y


Joelle Pineau49





• Do a little algebra to get the solution: ŵridge = (XTX+λI)-1XTY

– The ridge solution is not equivariant under scaling of the data, so typically need to normalize the inputs first.

– Ridge gives a smooth solution, effectively shrinking the weights, but drives few weights to 0.


Joelle Pineau50

Lasso regression (aka L1-regularization)

• Constrains the weights by penalizing the absolute value of their

size:

ŵlasso = argminW { ∑i=1:n( yi - wTxi)2 + λ∑j=1:m|wj| }


Joelle Pineau51



size:


• Now there is no closed-form solution. Need to solve a quadratic

programming problem instead.


Joelle Pineau52



size:


• Now there is no closed-form solution. Need to solve a quadratic

programming problem instead.

– More computationally expensive than Ridge regression.

– Effectively sets the weights of less relevant input features to zero.


Joelle Pineau53

Comparing Ridge and Lasso


Ridge Lassoλ∑j=1:m|wj| Complexity: ∑j=0:mwj

2

• Note that for lasso, reducing any weight by, say, 0.1 reduces the

complexity by the same amount

– So, we’d prefer reducing less relevant features

• In ridge regression, reducing high weights reduces complexity

more than reducing a low weight

– So, trade-off between reducing less relevant features and larger weights. Tend to not reduce weights that are already small

Joelle Pineau54



Ridge Lasso

4

Joelle Pineau 7



Solving L1 regularization

• The optimization problem is a quadratic program

• There is one constraint for each possible sign of the weights (2n

constraints for n weights)

• For example, with two weights:

minw1,w2

mX

j=1

(yj � w1x1 � w2x2)2

such that w1 + w2 ⇥ ⇤

w1 � w2 ⇥ ⇤

�w1 + w2 ⇥ ⇤

�w1 � w2 ⇥ ⇤

• Solving this program directly can be done for problems with a smallnumber of inputs


Visualizing L1 regularization

w1

w2

w?

• If ⌅ is big enough, the circle is very likely to intersect the diamond atone of the corners

• This makes L1 regularization much more likely to make some weightsexactly 0


Ridge Lasso

L2 Regularization for linear models revisited

• Optimization problem: minimize error while keeping norm of the weightsbounded

minw

JD(w) = minw

(�w � y)T (�w � y)

such that wTw ⇥ ⇤

• The Lagrangian is:

L(w,⌅) = JD(w)�⌅(⇤�wTw) = (�w�y)T (�w�y)+⌅wTw�⌅⇤

• For a fixed ⌅, and ⇤ = ⌅�1, the best w is the same as obtained byweight decay


Visualizing regularization (2 parameters)

w1

w2

w?

w⇤ = (�T�+ ⌅I)�1�y


Joelle Pineau 8

A quick look at evaluation functions

•  We call L(Y,fw(x)) the loss function.

–  Least-square / Mean squared-error (MSE) loss:

L(Y, fw(X)) = = ∑i=1:n ( yi - wTxi)2

•  Other loss functions?

–  Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |

–  0-1 loss for classification: L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )

•  Different loss functions make different assumptions.

–  Squared error loss assumes the data can be approximated by a global linear model.


Contours of equalregression error

Contours of equalmodel complexity penalty

Joelle Pineau55


• We call L(Y,fw(x)) the loss function.

– Least-square / Mean squared-error (MSE) loss:

L(Y, fw(X)) = ∑i=1:n ( yi - wTxi)2

• Other loss functions?


Joelle Pineau56





L(Y, fw(X)) = ∑i=1:n ( yi - wTxi)2


– Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |

– 0-1 loss (for classification): L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )

Joelle Pineau57




L(Y, fw(X)) = ∑i=1:n ( yi - wTxi)2


– Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |

– 0-1 loss (for classification): L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )

• Different loss functions make different assumptions.

– Squared error loss assumes the data can be approximated by a global linear model with Gaussian noise.

– Loss function independent of complexity penalty (l1 or l2)


Joelle Pineau58


• Assume data generated as


p(y|x;w) = N (xTw,�2)

=1p2⇡�2

exp

✓� (y � xTw)2

2�2

◆

Joelle Pineau59


• Assume data generated as

• Reasonable to try to maximize

the probability of generating the

labels y


argmaxw

p(y|x,w) = argmaxw

log p(y|x,w)

= argmaxw

const� (y � xTw)2

p(y|x;w) = N (xTw,�2)

=1p2⇡�2

exp

✓� (y � xTw)2

2�2

◆

Joelle Pineau60

Tutorial

• Programming with Python

• This Friday, 6-7 pm

• Stewart Biology building S3/3



What you should know

• Overfitting (when it happens, how to avoid it).

• Cross-validation (how and why we use it).

• Ridge regression (l2 regularisation), Lasso (l1 regularisation)

Joelle Pineau62

MSURJ announcement


COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Documents