Top Banner
COMP 551 - Applied Machine Learning Lecture 4 --- Linear Classification William L. Hamilton (with slides and content from Joelle Pineau) * Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission. William L. Hamilton, McGill University and Mila 1
45

COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Jun 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

COMP 551 - Applied Machine LearningLecture 4 --- Linear ClassificationWilliam L. Hamilton(with slides and content from Joelle Pineau)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

William L. Hamilton, McGill University and Mila 1

Page 2: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

MiniProject 1 is out!

William L. Hamilton, McGill University and Mila 2

§ Due September 28th at 11:59pm. The details are at: https://www.cs.mcgill.ca/~wlh/comp551/files/miniproject1_spec.pdf

§ Basic idea – “Machine Learning 101”: § Implement two linear classification algorithms (from this lecture and next lecture)

§ Run linear classification on two different datasets.

§ Compare different models, settings, and features.

§ (Semi-)open-ended write-up.

§ Completed in groups of 3! You can now register your group on MyCourses.

§ If you don’t have a group yet, find one quick! You can use the discussion board on MyCourses to search for potential group members.

Page 3: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Self-assessment / practice quizzes

William L. Hamilton, McGill University and Mila 3

§ Quiz 0 – Attempt 1:

§ Around 250 students completed it.

§ Roughly 70% average. A couple questions were tricky, but 80%+ is where you ideally should be.

§ Quiz 0 – Attempt 2:

§ Around 180 students completed it.

§ Average went up to 80%.

§ Probability questions seemed to be the hardest.

Page 4: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Quiz 0, Attempt 2, Question 5

William L. Hamilton, McGill University and Mila 4

§ The correct answer was 1024, since there are 10 binary choices of what features to include (i.e., 210=1024).

§ But what about the subset where no features are included? Should we include

this as an option? Yes!

§ Training a model with no features means we only learn the bias term (i.e., fw(x)=w0), which is equivalent to just predicting the average value for the target (and sometimes this is the best model we can find)!

Page 5: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Recap: Evaluating on held out data

William L. Hamilton, McGill University and Mila 5

§ Partition your data into a training set, validation set, and test set.

§ The proportions in each set can vary.

§ Training set is used to fit a model (find the best hypothesis in the class).

§ Validation set is used for model selection, i.e., to estimate true error and compare hypothesis classes. (E.g., compare different order polynominals).

§ Test set is what you report the final accuracy on.

Page 6: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

k-fold cross validation§ Instead of just one validation set, we can evaluate on many splits!

§ Consider k partitions of the training/non-test data (usually of equal size).

§ Train with k-1 subsets, validate on kth subset. Repeat k times.

§ Average the prediction error over the k rounds/folds.

William L. Hamilton, McGill University and Mila 6

Source: http://stackoverflow.com/questions/31947183/how-to-implement-walk-forward-testing-in-sklearn

(increases computation time by a factor of k)

Page 7: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Generalization: test vs. train error

William L. Hamilton, McGill University and Mila 7

38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Prediction

Error

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)2. Unfortunately

training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

[From Hastie et al. textbook]§ Overly simple model:§ High training error and

high test error.§ Overly complex model:

§ Low training error but high test error.

Page 8: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Prediction problems

William L. Hamilton, McGill University and Mila 8

§ Classification

§ E.g., predicting whether a treatment is successful vs. unsuccessful

§ Y is a finite discrete set (e.g., successful vs. unsuccessful treatment)

§ Regression

§ E.g., predicting the future size of a tumor

§ Y=Â (i.e., we are predicting a real number)

tumor size texture perimeter shade outcome size change

18.02 rough 117.5 0 (very light) Y -0.1416.05 smooth 112.2 4 (dark) Y -0.1018.9 smooth 102.3 1 (light) N +0.21

Page 9: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Classification problems

William L. Hamilton, McGill University and Mila

Given data set D=<xi,yi>,i=1:n, with discrete yi, find a hypothesis which “best fits” the data.§ If yi ∈{0,1}this is binary classification.

§ If yi can take more than two values, the problem is called multi-class classification.

tumor size texture perimeter shade outcome size change

18.02 rough 117.5 0 (very light) Y -0.1416.05 smooth 112.2 4 (dark) Y -0.1018.9 smooth 102.3 1 (light) N +0.21

9

Page 10: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Applications of classification

William L. Hamilton, McGill University and Mila

§ Text classification (spam filtering, sentiment analysis, etc.).

§ Image classification (face detection, object recognition, etc.).

§ Prediction of cancer recurrence.

§ Recommendation systems.

§ Many, many more!

10

Page 11: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

A simple example

William L. Hamilton, McGill University and Mila

§ Given “nucleus size”, predict cancer recurrence.

§ Univariate input: X= nucleus size.

§ Binary output: Y = {NoRecurrence = 0; Recurrence = 1}

§ Try: Minimize the least-square error.

NoRecurrence Recurrence

6

Joelle Pineau 11

Applications of classification

•  Text classification (spam filtering, news filtering, building web

directories, etc.)

–  Features? Classes? Error (loss) function?

•  Image classification (face detection, object recognition, etc.)

•  Prediction of cancer recurrence.

•  Financial forecasting.

•  Many, many more!

COMP-598: Applied Machine Learning

Joelle Pineau 12

Simple example

•  Given “nucleus size”, predict cancer recurrence.

•  Univariate input: X = nucleus size.

•  Binary output: Y = {NoRecurrence = 0; Recurrence = 1}

•  Try: Minimize the least-square error.

COMP-598: Applied Machine Learning

NoRecurrence Example: Given “nucleus size” predict cancer

recurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

no

nre

cu

rre

nce

co

un

t

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recu

rre

nce

co

un

t

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Recurrence

6

Joelle Pineau 11

Applications of classification

•  Text classification (spam filtering, news filtering, building web

directories, etc.)

–  Features? Classes? Error (loss) function?

•  Image classification (face detection, object recognition, etc.)

•  Prediction of cancer recurrence.

•  Financial forecasting.

•  Many, many more!

COMP-598: Applied Machine Learning

Joelle Pineau 12

Simple example

•  Given “nucleus size”, predict cancer recurrence.

•  Univariate input: X = nucleus size.

•  Binary output: Y = {NoRecurrence = 0; Recurrence = 1}

•  Try: Minimize the least-square error.

COMP-598: Applied Machine Learning

NoRecurrence Example: Given “nucleus size” predict cancer

recurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Recurrence

11

Page 12: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Classification via linear regression?

William L. Hamilton, McGill University and Mila

§ Here red line is: Y’=X(XTX)-1XTY§ How to get a binary output?

1. Threshold: {y<=tfor NoRecurrence, y>tfor Recurrence}

2. Interpret output as probability: y = Probability (Recurrence)

7

Joelle Pineau 13

Predicting a class from linear regression

•  Here red line is: Y’ = X (XTX)-1 XT Y

•  How to get a binary output?

1.  Threshold the output:

{ y <= t for NoRecurrence,

y > t for Recurrence}

2.  Interpret output as probability:

y = Pr (Recurrence)

•  Can we find a better model?

COMP-598: Applied Machine Learning

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Joelle Pineau 14

Modeling for binary classification

•  Two approaches:

–  Discriminative learning: Directly estimate P(y|x).

–  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

–  We will consider both, starting with discriminative learning.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x | y =1)P(y =1)P(x)

12

Page 13: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Classification via linear regression?

William L. Hamilton, McGill University and Mila

§ Here red line is: Y’=X(XTX)-1XTY§ How to get a binary output?

1. Threshold: {y<=tfor NoRecurrence, y>tfor Recurrence}

2. Interpret output as probability: y = Probability (Recurrence)

7

Joelle Pineau 13

Predicting a class from linear regression

•  Here red line is: Y’ = X (XTX)-1 XT Y

•  How to get a binary output?

1.  Threshold the output:

{ y <= t for NoRecurrence,

y > t for Recurrence}

2.  Interpret output as probability:

y = Pr (Recurrence)

•  Can we find a better model?

COMP-598: Applied Machine Learning

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Joelle Pineau 14

Modeling for binary classification

•  Two approaches:

–  Discriminative learning: Directly estimate P(y|x).

–  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

–  We will consider both, starting with discriminative learning.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x | y =1)P(y =1)P(x)

Not a great fit!

Can we find a better model?

13

Page 14: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

High-level views of binary classification

William L. Hamilton, McGill University and Mila

§ Probabilistic§ Goal: Estimate P(y | x), i.e. the conditional probability of the

target variable given the feature data.

§ Focus of the next few lectures.

§ Decision boundaries§ Goal: Partition the feature space into different regions, and

classify points based on the region where the lie.

§ Focus of later lectures on decision trees and SVMs.

14

Page 15: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Approaches to binary classification

William L. Hamilton, McGill University and Mila

§ Two probabilistic approaches:

1. Discriminative learning: Directly estimate P(y|x).

2. Generative learning: Separately model P(x|y)and P(y). Use Bayes’ rule, to estimate P(y|x):

7

Joelle Pineau 13

Predicting a class from linear regression

•  Here red line is: Y’ = X (XTX)-1 XT Y

•  How to get a binary output?

1.  Threshold the output:

{ y <= t for NoRecurrence,

y > t for Recurrence}

2.  Interpret output as probability:

y = Pr (Recurrence)

•  Can we find a better model?

COMP-598: Applied Machine Learning

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Joelle Pineau 14

Modeling for binary classification

•  Two approaches:

–  Discriminative learning: Directly estimate P(y|x).

–  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

–  We will consider both, starting with discriminative learning.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x | y =1)P(y =1)P(x)

15

Page 16: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Probabilistic view of discriminative learning

William L. Hamilton, McGill University and Mila

§ Suppose we have 2 classes: y∊{0,1}§ What is the probability of a given input x having class y=1?

§ Consider Bayes rule:

where(By Bayes rule; P(x) on top and bottom cancels out.)

8

Joelle Pineau 15

A probabilistic view •  Suppose we have 2 classes: y � {0, 1}

•  What is the probability of a given input x having class y = 1?

•  Consider Bayes rule:

where

•  Here σ has a special form, called a sigmoid function and

! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x, y =1)P(x)

=P(x | y =1)P(y =1)

P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

= ln P(y =1| x)P(y = 0 | x)

=1

1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

=1

1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

)=

11+ exp(−a)

= σ

(By Bayes rule; P(x) on top and bottom cancels out.)

Joelle Pineau 16

Logistic regression

•  Directly model the log-odds with a linear function:

= w0 + w1x1 + … + wmxm

•  The decision boundary here is the set of points for which the

log-odds (!) is zero. ) is zero.

•  We have the logistic function:

σ(wTx) = 1 / (1 + e-wTx)

How do we find the weights?

COMP-598: Applied Machine Learning

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

8

Joelle Pineau 15

A probabilistic view •  Suppose we have 2 classes: y � {0, 1}

•  What is the probability of a given input x having class y = 1?

•  Consider Bayes rule:

where

•  Here σ has a special form, called a sigmoid function and

! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x, y =1)P(x)

=P(x | y =1)P(y =1)

P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

= ln P(y =1| x)P(y = 0 | x)

=1

1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

=1

1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

)=

11+ exp(−a)

= σ

(By Bayes rule; P(x) on top and bottom cancels out.)

Joelle Pineau 16

Logistic regression

•  Directly model the log-odds with a linear function:

= w0 + w1x1 + … + wmxm

•  The decision boundary here is the set of points for which the

log-odds (!) is zero. ) is zero.

•  We have the logistic function:

σ(wTx) = 1 / (1 + e-wTx)

How do we find the weights?

COMP-598: Applied Machine Learning

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

8

Joelle Pineau 15

A probabilistic view •  Suppose we have 2 classes: y � {0, 1}

•  What is the probability of a given input x having class y = 1?

•  Consider Bayes rule:

where

•  Here σ has a special form, called a sigmoid function and

! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x, y =1)P(x)

=P(x | y =1)P(y =1)

P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

= ln P(y =1| x)P(y = 0 | x)

=1

1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

=1

1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

)=

11+ exp(−a)

= σ

(By Bayes rule; P(x) on top and bottom cancels out.)

Joelle Pineau 16

Logistic regression

•  Directly model the log-odds with a linear function:

= w0 + w1x1 + … + wmxm

•  The decision boundary here is the set of points for which the

log-odds (!) is zero. ) is zero.

•  We have the logistic function:

σ(wTx) = 1 / (1 + e-wTx)

How do we find the weights?

COMP-598: Applied Machine Learning

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

16

Page 17: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

§ Log-odds ratio:

§ Logistic function:

Probabilistic view of discriminative learning

William L. Hamilton, McGill University and Mila

a = lnP (y = 1|x)P (y = 0|x)

� =1

1 + exp(�a)

How much more likely is y=1 compared to y=0?

What is our predicted

probability for y=1?

17

Page 18: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

§ Idea: Directly model the log-odds with a linear function:

=w0 +w1x1 +…+wmxm

Discriminative learning: Logistic regression

William L. Hamilton, McGill University and Mila

a = lnP (y = 1|x)P (y = 0|x)

How much more likely is y=1

compared to y=0?

Approximated by Linear function of the input features x.

18

Page 19: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Discriminative learning: Logistic regression

William L. Hamilton, McGill University and Mila

§ Idea: Directly model the log-odds with a linear function:

=w0 +w1x1 +…+wmxm

§ The decision boundary is the set of points for which *=0.§ The linear logistic function:

a = lnP (y = 1|x)P (y = 0|x)

P (y = 1|x) = �(w>x) =1

1 + e�w>x

19

Page 20: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Learning the weights in logistic regression

§ Recall: σ(wTxi)is the probability that yi=1(given xi)1-σ(wTxi)is the probability that yi=0.

§ For y∊{0,1},the likelihood function is

William L. Hamilton, McGill University and Mila

=

(�(w>xi), if yi = 1

1� �(w>xi), if yi = 0Probability of the target data given the model parameters

The likelihood of the data

L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY

i=1

�(w>xi)yi(1� �(w>xi))

1�yi

20

Page 21: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Maximizing likelihood

§ Our goal is to maximize the likelihood!

§ In other words: we want to find the parameters that give the highest likelihood.

William L. Hamilton, McGill University and Mila

=

(�(w>xi), if yi = 1

1� �(w>xi), if yi = 0Probability of the target data given the model parameters

The likelihood of the data

L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY

i=1

�(w>xi)yi(1� �(w>xi))

1�yi

21

Page 22: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Maximizing log-likelihood

William L. Hamilton, McGill University and Mila

Likelihood

Problem: Taking products of lots of small numbers is numerically unstable, making this function hard to optimize…

Log-likelihood

L(D) =nY

i=1

�(w>xi)yi(1� �(w>xi))

1�yi

Easier to optimize!

l(D) = ln(L(D)) =nX

i=1

yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))

22

Page 23: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Maximizing likelihood vs. minimizing loss

William L. Hamilton, McGill University and Mila

§ Another view: The negative log-likelihood of the logistic

function is known as the cross-entropy loss.

§ So maximizing the likelihood is the same as minimizing the

cross-entropy loss.

cross-entropy(D) = �nX

i=1

yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))

23

Page 24: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Maximizing likelihood vs. minimizing loss

William L. Hamilton, McGill University and Mila

§ Formal interpretation of cross entropy loss comes from

information theory.

§ Basic idea: it measures how many bits of information we

would need to correct the errors made by our model.

cross-entropy(D) = �nX

i=1

yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))

24

Page 25: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Maximizing likelihood vs. minimizing loss

William L. Hamilton, McGill University and Mila

§ There are probabilistic interpretations of various loss

functions, and we can often view minimizing a loss as

equivalent to maximizing likelihood.

§ E.g., we can even interpret the mean-squared loss in linear

regression in a probabilistic lens.

25

Page 26: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Aside: Probabilistic view of linear regression

William L. Hamilton, McGill University and Mila

§ Assume that y =w0 +∑j=1:mwj xj+-where - ~ /(0, σ3)and - is i.i.d. and independent of x

§ Then we can compute the likelihood of a particular target

value according to a Gaussian distribution:

P (yi|xi,w) =1p2⇡�2

e�(yi�w>xi)

2

2�2

26

Page 27: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Aside: Probabilistic view of linear regression

William L. Hamilton, McGill University and Mila

§ Assume that y =w0 +∑j=1:mwj xj+-where - ~ /(0, σ3)and - is i.i.d. and independent of x

§ Then we can compute the likelihood of a particular target

value according to a Gaussian distribution:

Looks just like the squared error!

P (yi|xi,w) =1p2⇡�2

e�(yi�w>xi)

2

2�2

27

Page 28: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Aside: Probabilistic view of linear regression

William L. Hamilton, McGill University and Mila

§ Given the likelihood of an individual point:

§ We can then compute the log-likelihood of the whole dataset:

P (yi|xi,w) =1p2⇡�2

e�(yi�w>xi)

2

2�2

l(D) =nX

i=1

� ln(p2⇡�2)� (yi �w>xi)2

2�2

28

Page 29: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Aside: Probabilistic view of linear regression

William L. Hamilton, McGill University and Mila

§ Given the likelihood of an individual point:

§ We can then compute the log-likelihood of the whole dataset:

P (yi|xi,w) =1p2⇡�2

e�(yi�w>xi)

2

2�2

l(D) =nX

i=1

� ln(p2⇡�2)� (yi �w>xi)2

2�2

These terms are constants, so maximizing this likelihood is equivalent to minimizing the squared loss!

29

Page 30: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Recap: likelihoods and losses

William L. Hamilton, McGill University and Mila

§ Under certain assumptions many loss functions have

probabilistic interpretations.

§ The cross-entropy loss = maximum likelihood for logistic

regression.

§ The squared loss = maximum likelihood for linear regression.

§ Assuming i.i.d. normally distributed errors!

30

Page 31: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Not all losses are created equal

William L. Hamilton, McGill University and Mila

§ We can come up with all kinds of losses:

§ Absolute error loss (for regression):

L(y,fw(X))= ∑i=1:n |yi – wTxi |§ 0-1 loss (for classification):

L(y,fw(X))= ∑i=1:n I(yi ≠fw(xi)§ … but these losses are not always easy to optimize (e.g.,

not differentiable).

§ … and these losses are often not theoretically grounded.

31

Page 32: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Losses are different from error metrics

William L. Hamilton, McGill University and Mila

§ Problem: The cross-entropy loss may be theoretically

grounded, but it is not very interpretable…

§ Solution: Train models using theoretically grounded loss

functions but evaluate using interpretable measures.

§ E.g., for linear classification

§ Train using cross-entropy.

§ Evaluate using accuracy (i.e., % correct predictions).

§ More evaluation functions to come in lecture 6!32

Page 33: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Back to logistic regression

§ Recall: σ(wTxi)is the probability that yi=1(given xi)1-σ(wTxi)is the probability that yi=0.

§ For y∊{0,1},the likelihood function is

William L. Hamilton, McGill University and Mila

=

(�(w>xi), if yi = 1

1� �(w>xi), if yi = 0Probability of the target data given the model parameters

The likelihood of the data

L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY

i=1

�(w>xi)yi(1� �(w>xi))

1�yi

33

Page 34: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Back to logistic regression: likelihood

§ Our goal is to maximize the likelihood!

§ In other words: we want to find the parameters that give the highest likelihood.

William L. Hamilton, McGill University and Mila

=

(�(w>xi), if yi = 1

1� �(w>xi), if yi = 0Probability of the target data given the model parameters

The likelihood of the data

L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY

i=1

�(w>xi)yi(1� �(w>xi))

1�yi

34

Page 35: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Back to logistic regression: log-likelihood

William L. Hamilton, McGill University and Mila

Likelihood

Problem: Taking products of lots of small numbers is numerically unstable, making this function hard to optimize…

Log-likelihood

L(D) =nY

i=1

�(w>xi)yi(1� �(w>xi))

1�yi

Easier to optimize!

l(D) = ln(L(D)) =nX

i=1

yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))

35

Page 36: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Gradient descent for logistic regression

William L. Hamilton, McGill University and Mila

§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]

§ Take the derivative:

∂Err(w)/∂w = - [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +…

δlog(σ)/δw=1/σ

36

Page 37: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Gradient descent for logistic regression

William L. Hamilton, McGill University and Mila

§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]

§ Take the derivative:

∂Err(w)/∂w = - [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +…δσ/δw=σ(1-σ)

37

Page 38: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Gradient descent for logistic regression

William L. Hamilton, McGill University and Mila

§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]

§ Take the derivative:

∂Err(w)/∂w = - [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +…δwTx/δw=x

38

Page 39: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Gradient descent for logistic regression

William L. Hamilton, McGill University and Mila

§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]

§ Take the derivative:

∂Err(w)/∂w =- [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +(1-yi)(1/(1-σ(wTxi )))(1-σ(wTxi ))σ(wTxi )(-1)xi]

δ(1-σ)/δw=(1-σ)σ(-1)

39

Page 40: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Gradient descent for logistic regression

William L. Hamilton, McGill University and Mila

§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]

§ Take the derivative:∂Err(w)/∂w =- [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +

(1-yi)(1/(1-σ(wTxi )))(1-σ(wTxi ))σ(wTxi )(-1)xi]=- ∑i=1:n xi(yi (1-σ(wTxi))- (1-yi)σ(wTxi))=- ∑i=1:n xi(yi - σ(wTxi))

§ Update rule: wk+1 =wk +αk∑i=1:n xi(yi – σ(wkTxi))

40

Page 41: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Gradient descent for logistic regression

William L. Hamilton, McGill University and Mila

§ Update rule: wk+1 =wk +αk∑i=1:n xi(yi – σ(wkTxi))

§ Intuition:

§ If we give a low probability to a positive point (i.e., yi=1), then we

should increase the parameter weights for the features strong

associated with that point.

§ If we give a high probability to a negative point (i.e., yi=0), then we

should decrease the parameters weights for the features strongly

associated with that point.

41

Page 42: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Approaches to binary classification

William L. Hamilton, McGill University and Mila

§ Two probabilistic approaches:

1. Discriminative learning: Directly estimate P(y|x).

2. Generative learning: Separately model P(x|y)and P(y). Use Bayes’ rule, to estimate P(y|x):

7

Joelle Pineau 13

Predicting a class from linear regression

•  Here red line is: Y’ = X (XTX)-1 XT Y

•  How to get a binary output?

1.  Threshold the output:

{ y <= t for NoRecurrence,

y > t for Recurrence}

2.  Interpret output as probability:

y = Pr (Recurrence)

•  Can we find a better model?

COMP-598: Applied Machine Learning

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Joelle Pineau 14

Modeling for binary classification

•  Two approaches:

–  Discriminative learning: Directly estimate P(y|x).

–  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

–  We will consider both, starting with discriminative learning.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x | y =1)P(y =1)P(x)

42

Page 43: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Approaches to binary classification

William L. Hamilton, McGill University and Mila

§ Two probabilistic approaches:

1. Discriminative learning: Directly estimate P(y|x).

2. Generative learning: Separately model P(x|y)and P(y). Use Bayes’ rule, to estimate P(y|x):

7

Joelle Pineau 13

Predicting a class from linear regression

•  Here red line is: Y’ = X (XTX)-1 XT Y

•  How to get a binary output?

1.  Threshold the output:

{ y <= t for NoRecurrence,

y > t for Recurrence}

2.  Interpret output as probability:

y = Pr (Recurrence)

•  Can we find a better model?

COMP-598: Applied Machine Learning

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Joelle Pineau 14

Modeling for binary classification

•  Two approaches:

–  Discriminative learning: Directly estimate P(y|x).

–  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

–  We will consider both, starting with discriminative learning.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x | y =1)P(y =1)P(x)

Today

Next lecture

43

Page 44: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

What you should know

William L. Hamilton, McGill University and Mila 44

§ Basic definition of linear classification problem.

§ Derivation of logistic regression.

§ The relationship between maximum likelihood and loss functions.

§ The difference between loss functions and error metrics.

Page 45: COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Final notes

William L. Hamilton, McGill University and Mila 45

§ Get started on MiniProject 1!

§ The midterm is November 18th from 6-8pm. Contact the course staff ASAP if you know you cannot make this day!