COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

COMP 551 - Applied Machine LearningLecture 4 --- Linear ClassificationWilliam L. Hamilton(with slides and content from Joelle Pineau)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

William L. Hamilton, McGill University and Mila 1

MiniProject 1 is out!

§ Due September 28th at 11:59pm. The details are at: https://www.cs.mcgill.ca/~wlh/comp551/files/miniproject1_spec.pdf

§ Basic idea – “Machine Learning 101”: § Implement two linear classification algorithms (from this lecture and next lecture)

§ Run linear classification on two different datasets.

§ Compare different models, settings, and features.

§ (Semi-)open-ended write-up.

§ Completed in groups of 3! You can now register your group on MyCourses.

§ If you don’t have a group yet, find one quick! You can use the discussion board on MyCourses to search for potential group members.

Self-assessment / practice quizzes

§ Quiz 0 – Attempt 1:

§ Around 250 students completed it.

§ Roughly 70% average. A couple questions were tricky, but 80%+ is where you ideally should be.

§ Quiz 0 – Attempt 2:

§ Around 180 students completed it.

§ Average went up to 80%.

§ Probability questions seemed to be the hardest.

Quiz 0, Attempt 2, Question 5

§ The correct answer was 1024, since there are 10 binary choices of what features to include (i.e., 210=1024).

§ But what about the subset where no features are included? Should we include

this as an option? Yes!

§ Training a model with no features means we only learn the bias term (i.e., fw(x)=w0), which is equivalent to just predicting the average value for the target (and sometimes this is the best model we can find)!

Recap: Evaluating on held out data

§ Partition your data into a training set, validation set, and test set.

§ The proportions in each set can vary.

§ Training set is used to fit a model (find the best hypothesis in the class).

§ Validation set is used for model selection, i.e., to estimate true error and compare hypothesis classes. (E.g., compare different order polynominals).

§ Test set is what you report the final accuracy on.

k-fold cross validation§ Instead of just one validation set, we can evaluate on many splits!

§ Consider k partitions of the training/non-test data (usually of equal size).

§ Train with k-1 subsets, validate on kth subset. Repeat k times.

§ Average the prediction error over the k rounds/folds.

Source: http://stackoverflow.com/questions/31947183/how-to-implement-walk-forward-testing-in-sklearn

(increases computation time by a factor of k)

Generalization: test vs. train error

38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Prediction

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

!i(yi − yi)2. Unfortunately

training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

[From Hastie et al. textbook]§ Overly simple model:§ High training error and

high test error.§ Overly complex model:

§ Low training error but high test error.

Prediction problems

§ Classification

§ E.g., predicting whether a treatment is successful vs. unsuccessful

§ Y is a finite discrete set (e.g., successful vs. unsuccessful treatment)

§ Regression

§ E.g., predicting the future size of a tumor

§ Y=Â (i.e., we are predicting a real number)

tumor size texture perimeter shade outcome size change

18.02 rough 117.5 0 (very light) Y -0.1416.05 smooth 112.2 4 (dark) Y -0.1018.9 smooth 102.3 1 (light) N +0.21

Classification problems

William L. Hamilton, McGill University and Mila

Given data set D=<xi,yi>,i=1:n, with discrete yi, find a hypothesis which “best fits” the data.§ If yi ∈{0,1}this is binary classification.

§ If yi can take more than two values, the problem is called multi-class classification.

tumor size texture perimeter shade outcome size change

18.02 rough 117.5 0 (very light) Y -0.1416.05 smooth 112.2 4 (dark) Y -0.1018.9 smooth 102.3 1 (light) N +0.21

Applications of classification

§ Text classification (spam filtering, sentiment analysis, etc.).

§ Image classification (face detection, object recognition, etc.).

§ Prediction of cancer recurrence.

§ Recommendation systems.

§ Many, many more!

A simple example

§ Given “nucleus size”, predict cancer recurrence.

§ Univariate input: X= nucleus size.

§ Binary output: Y = {NoRecurrence = 0; Recurrence = 1}

§ Try: Minimize the least-square error.

NoRecurrence Recurrence

Joelle Pineau 11

•  Text classification (spam filtering, news filtering, building web

directories, etc.)

–  Features? Classes? Error (loss) function?

•  Image classification (face detection, object recognition, etc.)

•  Prediction of cancer recurrence.

•  Financial forecasting.

•  Many, many more!

COMP-598: Applied Machine Learning

Joelle Pineau 12

Simple example

•  Given “nucleus size”, predict cancer recurrence.

•  Univariate input: X = nucleus size.

•  Binary output: Y = {NoRecurrence = 0; Recurrence = 1}

•  Try: Minimize the least-square error.

NoRecurrence Example: Given “nucleus size” predict cancer

recurrrence

10 12 14 16 18 20 22 24 26 28 300

nucleus size

10 12 14 16 18 20 22 24 26 28 300

nucleus size

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

nucleus size

ence c

10 12 14 16 18 20 22 24 26 28 300

nucleus size

recurr

ence c

! "# "! $# $! %#!#&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

Recurrence

Joelle Pineau 11

•  Text classification (spam filtering, news filtering, building web

directories, etc.)

–  Features? Classes? Error (loss) function?

•  Image classification (face detection, object recognition, etc.)

•  Prediction of cancer recurrence.

•  Financial forecasting.

•  Many, many more!

Joelle Pineau 12

Simple example

•  Given “nucleus size”, predict cancer recurrence.

•  Univariate input: X = nucleus size.

•  Binary output: Y = {NoRecurrence = 0; Recurrence = 1}

•  Try: Minimize the least-square error.

NoRecurrence Example: Given “nucleus size” predict cancer

recurrrence

10 12 14 16 18 20 22 24 26 28 300

nucleus size

ence c

10 12 14 16 18 20 22 24 26 28 300

nucleus size

recurr

ence c

! "# "! $# $! %#!#&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

10 12 14 16 18 20 22 24 26 28 300

nucleus size

ence c

10 12 14 16 18 20 22 24 26 28 300

nucleus size

recurr

ence c

! "# "! $# $! %#!#&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

Recurrence

Classification via linear regression?

§ Here red line is: Y’=X(XTX)-1XTY§ How to get a binary output?

1. Threshold: {y<=tfor NoRecurrence, y>tfor Recurrence}

2. Interpret output as probability: y = Probability (Recurrence)

Joelle Pineau 13

Predicting a class from linear regression

•  Here red line is: Y’ = X (XTX)-1 XT Y

•  How to get a binary output?

1.  Threshold the output:

{ y <= t for NoRecurrence,

y > t for Recurrence}

2.  Interpret output as probability:

y = Pr (Recurrence)

•  Can we find a better model?

10 12 14 16 18 20 22 24 26 28 300

nucleus size

ence c

10 12 14 16 18 20 22 24 26 28 300

nucleus size

recurr

ence c

! "# "! $# $! %#!#&$

*+,-.+/0/12.*3*4.,+405#603404.,+405"6

Joelle Pineau 14

Modeling for binary classification

•  Two approaches:

–  Discriminative learning: Directly estimate P(y|x).

–  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

–  We will consider both, starting with discriminative learning.

P(y =1| x) = P(x | y =1)P(y =1)P(x)

Classification via linear regression?

§ Here red line is: Y’=X(XTX)-1XTY§ How to get a binary output?

1. Threshold: {y<=tfor NoRecurrence, y>tfor Recurrence}

2. Interpret output as probability: y = Probability (Recurrence)

Joelle Pineau 13

y = Pr (Recurrence)

10 12 14 16 18 20 22 24 26 28 300

nucleus size

ence c

10 12 14 16 18 20 22 24 26 28 300

nucleus size

recurr

ence c

! "# "! $# $! %#!#&$

*+,-.+/0/12.*3*4.,+405#603404.,+405"6

Joelle Pineau 14

P(y =1| x) = P(x | y =1)P(y =1)P(x)

Not a great fit!

Can we find a better model?

High-level views of binary classification

§ Probabilistic§ Goal: Estimate P(y | x), i.e. the conditional probability of the

target variable given the feature data.

§ Focus of the next few lectures.

§ Decision boundaries§ Goal: Partition the feature space into different regions, and

classify points based on the region where the lie.

§ Focus of later lectures on decision trees and SVMs.

Approaches to binary classification

§ Two probabilistic approaches:

1. Discriminative learning: Directly estimate P(y|x).

2. Generative learning: Separately model P(x|y)and P(y). Use Bayes’ rule, to estimate P(y|x):

Joelle Pineau 13

y = Pr (Recurrence)

10 12 14 16 18 20 22 24 26 28 300

nucleus size

ence c

10 12 14 16 18 20 22 24 26 28 300

nucleus size

recurr

ence c

! "# "! $# $! %#!#&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

Joelle Pineau 14

P(y =1| x) = P(x | y =1)P(y =1)P(x)

Probabilistic view of discriminative learning

§ Suppose we have 2 classes: y∊{0,1}§ What is the probability of a given input x having class y=1?

§ Consider Bayes rule:

where(By Bayes rule; P(x) on top and bottom cancels out.)

Joelle Pineau 15

A probabilistic view •  Suppose we have 2 classes: y � {0, 1}

•  What is the probability of a given input x having class y = 1?

•  Consider Bayes rule:

•  Here σ has a special form, called a sigmoid function and

! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.

P(y =1| x) = P(x, y =1)P(x)

=P(x | y =1)P(y =1)

P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

= ln P(y =1| x)P(y = 0 | x)

1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

11+ exp(−a)

(By Bayes rule; P(x) on top and bottom cancels out.)

Joelle Pineau 16

Logistic regression

•  Directly model the log-odds with a linear function:

= w0 + w1x1 + … + wmxm

•  The decision boundary here is the set of points for which the

log-odds (!) is zero. ) is zero.

•  We have the logistic function:

σ(wTx) = 1 / (1 + e-wTx)

How do we find the weights?

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

Joelle Pineau 15

P(y =1| x) = P(x, y =1)P(x)

=P(x | y =1)P(y =1)

P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

= ln P(y =1| x)P(y = 0 | x)

1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

11+ exp(−a)

Joelle Pineau 16

Logistic regression

= w0 + w1x1 + … + wmxm

σ(wTx) = 1 / (1 + e-wTx)

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

Joelle Pineau 15

P(y =1| x) = P(x, y =1)P(x)

=P(x | y =1)P(y =1)

P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

= ln P(y =1| x)P(y = 0 | x)

1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

11+ exp(−a)

Joelle Pineau 16

Logistic regression

= w0 + w1x1 + … + wmxm

σ(wTx) = 1 / (1 + e-wTx)

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

§ Log-odds ratio:

§ Logistic function:

Probabilistic view of discriminative learning

a = lnP (y = 1|x)P (y = 0|x)

� =1

1 + exp(�a)

How much more likely is y=1 compared to y=0?

What is our predicted

probability for y=1?

§ Idea: Directly model the log-odds with a linear function:

=w0 +w1x1 +…+wmxm

Discriminative learning: Logistic regression

a = lnP (y = 1|x)P (y = 0|x)

How much more likely is y=1

compared to y=0?

Approximated by Linear function of the input features x.

Discriminative learning: Logistic regression

§ Idea: Directly model the log-odds with a linear function:

=w0 +w1x1 +…+wmxm

§ The decision boundary is the set of points for which *=0.§ The linear logistic function:

a = lnP (y = 1|x)P (y = 0|x)

P (y = 1|x) = �(w>x) =1

1 + e�w>x

Learning the weights in logistic regression

§ Recall: σ(wTxi)is the probability that yi=1(given xi)1-σ(wTxi)is the probability that yi=0.

§ For y∊{0,1},the likelihood function is

(�(w>xi), if yi = 1

1� �(w>xi), if yi = 0Probability of the target data given the model parameters

The likelihood of the data

L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY

�(w>xi)yi(1� �(w>xi))

1�yi

Maximizing likelihood

§ Our goal is to maximize the likelihood!

§ In other words: we want to find the parameters that give the highest likelihood.

(�(w>xi), if yi = 1

L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY

�(w>xi)yi(1� �(w>xi))

1�yi

Maximizing log-likelihood

Likelihood

Problem: Taking products of lots of small numbers is numerically unstable, making this function hard to optimize…

Log-likelihood

L(D) =nY

�(w>xi)yi(1� �(w>xi))

1�yi

Easier to optimize!

l(D) = ln(L(D)) =nX

yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))

Maximizing likelihood vs. minimizing loss

§ Another view: The negative log-likelihood of the logistic

function is known as the cross-entropy loss.

§ So maximizing the likelihood is the same as minimizing the

cross-entropy loss.

cross-entropy(D) = �nX

yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))

§ Formal interpretation of cross entropy loss comes from

information theory.

§ Basic idea: it measures how many bits of information we

would need to correct the errors made by our model.

cross-entropy(D) = �nX

yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))

§ There are probabilistic interpretations of various loss

functions, and we can often view minimizing a loss as

equivalent to maximizing likelihood.

§ E.g., we can even interpret the mean-squared loss in linear

regression in a probabilistic lens.

Aside: Probabilistic view of linear regression

§ Assume that y =w0 +∑j=1:mwj xj+-where - ~ /(0, σ3)and - is i.i.d. and independent of x

§ Then we can compute the likelihood of a particular target

value according to a Gaussian distribution:

P (yi|xi,w) =1p2⇡�2

e�(yi�w>xi)

§ Assume that y =w0 +∑j=1:mwj xj+-where - ~ /(0, σ3)and - is i.i.d. and independent of x

§ Then we can compute the likelihood of a particular target

value according to a Gaussian distribution:

Looks just like the squared error!

P (yi|xi,w) =1p2⇡�2

e�(yi�w>xi)

§ Given the likelihood of an individual point:

§ We can then compute the log-likelihood of the whole dataset:

P (yi|xi,w) =1p2⇡�2

e�(yi�w>xi)

l(D) =nX

� ln(p2⇡�2)� (yi �w>xi)2

§ Given the likelihood of an individual point:

§ We can then compute the log-likelihood of the whole dataset:

P (yi|xi,w) =1p2⇡�2

e�(yi�w>xi)

l(D) =nX

� ln(p2⇡�2)� (yi �w>xi)2

These terms are constants, so maximizing this likelihood is equivalent to minimizing the squared loss!

Recap: likelihoods and losses

§ Under certain assumptions many loss functions have

probabilistic interpretations.

§ The cross-entropy loss = maximum likelihood for logistic

regression.

§ The squared loss = maximum likelihood for linear regression.

§ Assuming i.i.d. normally distributed errors!

Not all losses are created equal

§ We can come up with all kinds of losses:

§ Absolute error loss (for regression):

L(y,fw(X))= ∑i=1:n |yi – wTxi |§ 0-1 loss (for classification):

L(y,fw(X))= ∑i=1:n I(yi ≠fw(xi)§ … but these losses are not always easy to optimize (e.g.,

not differentiable).

§ … and these losses are often not theoretically grounded.

Losses are different from error metrics

§ Problem: The cross-entropy loss may be theoretically

grounded, but it is not very interpretable…

§ Solution: Train models using theoretically grounded loss

functions but evaluate using interpretable measures.

§ E.g., for linear classification

§ Train using cross-entropy.

§ Evaluate using accuracy (i.e., % correct predictions).

§ More evaluation functions to come in lecture 6!32

Back to logistic regression

§ Recall: σ(wTxi)is the probability that yi=1(given xi)1-σ(wTxi)is the probability that yi=0.

§ For y∊{0,1},the likelihood function is

(�(w>xi), if yi = 1

L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY

�(w>xi)yi(1� �(w>xi))

1�yi

Back to logistic regression: likelihood

§ Our goal is to maximize the likelihood!

§ In other words: we want to find the parameters that give the highest likelihood.

(�(w>xi), if yi = 1

L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY

�(w>xi)yi(1� �(w>xi))

1�yi

Back to logistic regression: log-likelihood

Likelihood

Problem: Taking products of lots of small numbers is numerically unstable, making this function hard to optimize…

Log-likelihood

L(D) =nY

�(w>xi)yi(1� �(w>xi))

1�yi

Easier to optimize!

l(D) = ln(L(D)) =nX

yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))

Gradient descent for logistic regression

§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]

§ Take the derivative:

∂Err(w)/∂w = - [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +…

δlog(σ)/δw=1/σ

∂Err(w)/∂w = - [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +…δσ/δw=σ(1-σ)

∂Err(w)/∂w = - [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +…δwTx/δw=x

∂Err(w)/∂w =- [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +(1-yi)(1/(1-σ(wTxi )))(1-σ(wTxi ))σ(wTxi )(-1)xi]

δ(1-σ)/δw=(1-σ)σ(-1)

§ Take the derivative:∂Err(w)/∂w =- [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +

(1-yi)(1/(1-σ(wTxi )))(1-σ(wTxi ))σ(wTxi )(-1)xi]=- ∑i=1:n xi(yi (1-σ(wTxi))- (1-yi)σ(wTxi))=- ∑i=1:n xi(yi - σ(wTxi))

§ Update rule: wk+1 =wk +αk∑i=1:n xi(yi – σ(wkTxi))

§ Intuition:

§ If we give a low probability to a positive point (i.e., yi=1), then we

should increase the parameter weights for the features strong

associated with that point.

§ If we give a high probability to a negative point (i.e., yi=0), then we

should decrease the parameters weights for the features strongly

associated with that point.

Joelle Pineau 13

y = Pr (Recurrence)

10 12 14 16 18 20 22 24 26 28 300

nucleus size

ence c

10 12 14 16 18 20 22 24 26 28 300

nucleus size

recurr

ence c

! "# "! $# $! %#!#&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

Joelle Pineau 14

P(y =1| x) = P(x | y =1)P(y =1)P(x)

Joelle Pineau 13

y = Pr (Recurrence)

10 12 14 16 18 20 22 24 26 28 300

nucleus size

ence c

10 12 14 16 18 20 22 24 26 28 300

nucleus size

recurr

ence c

! "# "! $# $! %#!#&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

Joelle Pineau 14

P(y =1| x) = P(x | y =1)P(y =1)P(x)

Next lecture

What you should know

§ Basic definition of linear classification problem.

§ Derivation of logistic regression.

§ The relationship between maximum likelihood and loss functions.

§ The difference between loss functions and error metrics.

Final notes

§ Get started on MiniProject 1!

§ The midterm is November 18th from 6-8pm. Contact the course staff ASAP if you know you cannot make this day!

COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Documents

Machine Learning Lecture

Machine Learning Lecture 1 - RWTH Aachen University ·...

This Lecture: Advanced Machine Learning Lecture 3

Machine Learning Lecture 10 Decision Trees G53MLE Machine...

This Lecture: Advanced Machine Learning Lecture 6 · This.....

Advanced Machine Learning Lecture 21 - Computer vision ·.....

CS340 Machine learning Lecture 4 Learning theory

Advanced Machine Learning Lecture Notes

Machine Learning Lecture 10 Decision Tree Learning 1.

Lecture 5 Machine Learning Boosting

Lecture 1: Introduction to Machine Learning

Machine Learning Lecture 12 - Computer vision · Machine...

Machine Learning in Practice Lecture 3

Course 395: Machine Learning – LecturesMaja Pantic Machine...

Machine Learning Lecture 5 Bayesian Learning G53MLE |...

Advanced Machine Learning Lecture 13