COMP 551 -Applied Machine Learning Lecture 4 -- …Basic idea –“Machine Learning 101”: Implement two linear classification algorithms (from this lecture and next lecture) Run

Post on 25-Jun-2020

6 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

Transcript

COMP 551 - Applied Machine LearningLecture 4 --- Linear ClassificationWilliam L. Hamilton(with slides and content from Joelle Pineau)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

William L. Hamilton, McGill University and Mila 1

MiniProject 1 is out!

William L. Hamilton, McGill University and Mila 2

§ Due September 28th at 11:59pm. The details are at: https://www.cs.mcgill.ca/~wlh/comp551/files/miniproject1_spec.pdf

§ Basic idea – “Machine Learning 101”: § Implement two linear classification algorithms (from this lecture and next lecture)

§ Run linear classification on two different datasets.

§ Compare different models, settings, and features.

§ (Semi-)open-ended write-up.

§ Completed in groups of 3! You can now register your group on MyCourses.

§ If you don’t have a group yet, find one quick! You can use the discussion board on MyCourses to search for potential group members.

Self-assessment / practice quizzes

William L. Hamilton, McGill University and Mila 3

§ Quiz 0 – Attempt 1:

§ Around 250 students completed it.

§ Roughly 70% average. A couple questions were tricky, but 80%+ is where you ideally should be.

§ Quiz 0 – Attempt 2:

§ Around 180 students completed it.

§ Average went up to 80%.

§ Probability questions seemed to be the hardest.

Quiz 0, Attempt 2, Question 5

William L. Hamilton, McGill University and Mila 4

§ The correct answer was 1024, since there are 10 binary choices of what features to include (i.e., 210=1024).

§ But what about the subset where no features are included? Should we include

this as an option? Yes!

§ Training a model with no features means we only learn the bias term (i.e., fw(x)=w0), which is equivalent to just predicting the average value for the target (and sometimes this is the best model we can find)!

Recap: Evaluating on held out data

William L. Hamilton, McGill University and Mila 5

§ Partition your data into a training set, validation set, and test set.

§ The proportions in each set can vary.

§ Training set is used to fit a model (find the best hypothesis in the class).

§ Validation set is used for model selection, i.e., to estimate true error and compare hypothesis classes. (E.g., compare different order polynominals).

§ Test set is what you report the final accuracy on.

k-fold cross validation§ Instead of just one validation set, we can evaluate on many splits!

§ Consider k partitions of the training/non-test data (usually of equal size).

§ Train with k-1 subsets, validate on kth subset. Repeat k times.

§ Average the prediction error over the k rounds/folds.

William L. Hamilton, McGill University and Mila 6

Source: http://stackoverflow.com/questions/31947183/how-to-implement-walk-forward-testing-in-sklearn

(increases computation time by a factor of k)

Generalization: test vs. train error

William L. Hamilton, McGill University and Mila 7

38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Prediction

Error

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)2. Unfortunately

training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

[From Hastie et al. textbook]§ Overly simple model:§ High training error and

high test error.§ Overly complex model:

§ Low training error but high test error.

Prediction problems

William L. Hamilton, McGill University and Mila 8

§ Classification

§ E.g., predicting whether a treatment is successful vs. unsuccessful

§ Y is a finite discrete set (e.g., successful vs. unsuccessful treatment)

§ Regression

§ E.g., predicting the future size of a tumor

§ Y=Â (i.e., we are predicting a real number)

tumor size texture perimeter shade outcome size change

18.02 rough 117.5 0 (very light) Y -0.1416.05 smooth 112.2 4 (dark) Y -0.1018.9 smooth 102.3 1 (light) N +0.21

Classification problems

William L. Hamilton, McGill University and Mila

Given data set D=<xi,yi>,i=1:n, with discrete yi, find a hypothesis which “best fits” the data.§ If yi ∈{0,1}this is binary classification.

§ If yi can take more than two values, the problem is called multi-class classification.

tumor size texture perimeter shade outcome size change

18.02 rough 117.5 0 (very light) Y -0.1416.05 smooth 112.2 4 (dark) Y -0.1018.9 smooth 102.3 1 (light) N +0.21

9

Applications of classification

William L. Hamilton, McGill University and Mila

§ Text classification (spam filtering, sentiment analysis, etc.).

§ Image classification (face detection, object recognition, etc.).

§ Prediction of cancer recurrence.

§ Recommendation systems.

§ Many, many more!

10

A simple example

William L. Hamilton, McGill University and Mila

§ Given “nucleus size”, predict cancer recurrence.

§ Univariate input: X= nucleus size.

§ Binary output: Y = {NoRecurrence = 0; Recurrence = 1}

§ Try: Minimize the least-square error.

NoRecurrence Recurrence

6

Joelle Pineau 11

Applications of classification

•  Text classification (spam filtering, news filtering, building web

directories, etc.)

–  Features? Classes? Error (loss) function?

•  Image classification (face detection, object recognition, etc.)

•  Prediction of cancer recurrence.

•  Financial forecasting.

•  Many, many more!

COMP-598: Applied Machine Learning

Joelle Pineau 12

Simple example

•  Given “nucleus size”, predict cancer recurrence.

•  Univariate input: X = nucleus size.

•  Binary output: Y = {NoRecurrence = 0; Recurrence = 1}

•  Try: Minimize the least-square error.

COMP-598: Applied Machine Learning

NoRecurrence Example: Given “nucleus size” predict cancer

recurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

no

nre

cu

rre

nce

co

un

t

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recu

rre

nce

co

un

t

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Recurrence

6

Joelle Pineau 11

Applications of classification

•  Text classification (spam filtering, news filtering, building web

directories, etc.)

–  Features? Classes? Error (loss) function?

•  Image classification (face detection, object recognition, etc.)

•  Prediction of cancer recurrence.

•  Financial forecasting.

•  Many, many more!

COMP-598: Applied Machine Learning

Joelle Pineau 12

Simple example

•  Given “nucleus size”, predict cancer recurrence.

•  Univariate input: X = nucleus size.

•  Binary output: Y = {NoRecurrence = 0; Recurrence = 1}

•  Try: Minimize the least-square error.

COMP-598: Applied Machine Learning

NoRecurrence Example: Given “nucleus size” predict cancer

recurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Recurrence

11

Classification via linear regression?

William L. Hamilton, McGill University and Mila

§ Here red line is: Y’=X(XTX)-1XTY§ How to get a binary output?

1. Threshold: {y<=tfor NoRecurrence, y>tfor Recurrence}

2. Interpret output as probability: y = Probability (Recurrence)

7

Joelle Pineau 13

Predicting a class from linear regression

•  Here red line is: Y’ = X (XTX)-1 XT Y

•  How to get a binary output?

1.  Threshold the output:

{ y <= t for NoRecurrence,

y > t for Recurrence}

2.  Interpret output as probability:

y = Pr (Recurrence)

•  Can we find a better model?

COMP-598: Applied Machine Learning

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Joelle Pineau 14

Modeling for binary classification

•  Two approaches:

–  Discriminative learning: Directly estimate P(y|x).

–  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

–  We will consider both, starting with discriminative learning.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x | y =1)P(y =1)P(x)

12

Classification via linear regression?

William L. Hamilton, McGill University and Mila

§ Here red line is: Y’=X(XTX)-1XTY§ How to get a binary output?

1. Threshold: {y<=tfor NoRecurrence, y>tfor Recurrence}

2. Interpret output as probability: y = Probability (Recurrence)

7

Joelle Pineau 13

Predicting a class from linear regression

•  Here red line is: Y’ = X (XTX)-1 XT Y

•  How to get a binary output?

1.  Threshold the output:

{ y <= t for NoRecurrence,

y > t for Recurrence}

2.  Interpret output as probability:

y = Pr (Recurrence)

•  Can we find a better model?

COMP-598: Applied Machine Learning

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Joelle Pineau 14

Modeling for binary classification

•  Two approaches:

–  Discriminative learning: Directly estimate P(y|x).

–  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

–  We will consider both, starting with discriminative learning.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x | y =1)P(y =1)P(x)

Not a great fit!

Can we find a better model?

13

High-level views of binary classification

William L. Hamilton, McGill University and Mila

§ Probabilistic§ Goal: Estimate P(y | x), i.e. the conditional probability of the

target variable given the feature data.

§ Focus of the next few lectures.

§ Decision boundaries§ Goal: Partition the feature space into different regions, and

classify points based on the region where the lie.

§ Focus of later lectures on decision trees and SVMs.

14

Approaches to binary classification

William L. Hamilton, McGill University and Mila

§ Two probabilistic approaches:

1. Discriminative learning: Directly estimate P(y|x).

2. Generative learning: Separately model P(x|y)and P(y). Use Bayes’ rule, to estimate P(y|x):

7

Joelle Pineau 13

Predicting a class from linear regression

•  Here red line is: Y’ = X (XTX)-1 XT Y

•  How to get a binary output?

1.  Threshold the output:

{ y <= t for NoRecurrence,

y > t for Recurrence}

2.  Interpret output as probability:

y = Pr (Recurrence)

•  Can we find a better model?

COMP-598: Applied Machine Learning

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Joelle Pineau 14

Modeling for binary classification

•  Two approaches:

–  Discriminative learning: Directly estimate P(y|x).

–  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

–  We will consider both, starting with discriminative learning.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x | y =1)P(y =1)P(x)

15

Probabilistic view of discriminative learning

William L. Hamilton, McGill University and Mila

§ Suppose we have 2 classes: y∊{0,1}§ What is the probability of a given input x having class y=1?

§ Consider Bayes rule:

where(By Bayes rule; P(x) on top and bottom cancels out.)

8

Joelle Pineau 15

A probabilistic view •  Suppose we have 2 classes: y � {0, 1}

•  What is the probability of a given input x having class y = 1?

•  Consider Bayes rule:

where

•  Here σ has a special form, called a sigmoid function and

! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x, y =1)P(x)

=P(x | y =1)P(y =1)

P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

= ln P(y =1| x)P(y = 0 | x)

=1

1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

=1

1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

)=

11+ exp(−a)

= σ

(By Bayes rule; P(x) on top and bottom cancels out.)

Joelle Pineau 16

Logistic regression

•  Directly model the log-odds with a linear function:

= w0 + w1x1 + … + wmxm

•  The decision boundary here is the set of points for which the

log-odds (!) is zero. ) is zero.

•  We have the logistic function:

σ(wTx) = 1 / (1 + e-wTx)

How do we find the weights?

COMP-598: Applied Machine Learning

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

8

Joelle Pineau 15

A probabilistic view •  Suppose we have 2 classes: y � {0, 1}

•  What is the probability of a given input x having class y = 1?

•  Consider Bayes rule:

where

•  Here σ has a special form, called a sigmoid function and

! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x, y =1)P(x)

=P(x | y =1)P(y =1)

P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

= ln P(y =1| x)P(y = 0 | x)

=1

1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

=1

1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

)=

11+ exp(−a)

= σ

(By Bayes rule; P(x) on top and bottom cancels out.)

Joelle Pineau 16

Logistic regression

•  Directly model the log-odds with a linear function:

= w0 + w1x1 + … + wmxm

•  The decision boundary here is the set of points for which the

log-odds (!) is zero. ) is zero.

•  We have the logistic function:

σ(wTx) = 1 / (1 + e-wTx)

How do we find the weights?

COMP-598: Applied Machine Learning

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

8

Joelle Pineau 15

A probabilistic view •  Suppose we have 2 classes: y � {0, 1}

•  What is the probability of a given input x having class y = 1?

•  Consider Bayes rule:

where

•  Here σ has a special form, called a sigmoid function and

! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x, y =1)P(x)

=P(x | y =1)P(y =1)

P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

= ln P(y =1| x)P(y = 0 | x)

=1

1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

=1

1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

)=

11+ exp(−a)

= σ

(By Bayes rule; P(x) on top and bottom cancels out.)

Joelle Pineau 16

Logistic regression

•  Directly model the log-odds with a linear function:

= w0 + w1x1 + … + wmxm

•  The decision boundary here is the set of points for which the

log-odds (!) is zero. ) is zero.

•  We have the logistic function:

σ(wTx) = 1 / (1 + e-wTx)

How do we find the weights?

COMP-598: Applied Machine Learning

a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

16

§ Log-odds ratio:

§ Logistic function:

Probabilistic view of discriminative learning

William L. Hamilton, McGill University and Mila

a = lnP (y = 1|x)P (y = 0|x)

� =1

1 + exp(�a)

How much more likely is y=1 compared to y=0?

What is our predicted

probability for y=1?

17

§ Idea: Directly model the log-odds with a linear function:

=w0 +w1x1 +…+wmxm

Discriminative learning: Logistic regression

William L. Hamilton, McGill University and Mila

a = lnP (y = 1|x)P (y = 0|x)

How much more likely is y=1

compared to y=0?

Approximated by Linear function of the input features x.

18

Discriminative learning: Logistic regression

William L. Hamilton, McGill University and Mila

§ Idea: Directly model the log-odds with a linear function:

=w0 +w1x1 +…+wmxm

§ The decision boundary is the set of points for which *=0.§ The linear logistic function:

a = lnP (y = 1|x)P (y = 0|x)

P (y = 1|x) = �(w>x) =1

1 + e�w>x

19

Learning the weights in logistic regression

§ Recall: σ(wTxi)is the probability that yi=1(given xi)1-σ(wTxi)is the probability that yi=0.

§ For y∊{0,1},the likelihood function is

William L. Hamilton, McGill University and Mila

=

(�(w>xi), if yi = 1

1� �(w>xi), if yi = 0Probability of the target data given the model parameters

The likelihood of the data

L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY

i=1

�(w>xi)yi(1� �(w>xi))

1�yi

20

Maximizing likelihood

§ Our goal is to maximize the likelihood!

§ In other words: we want to find the parameters that give the highest likelihood.

William L. Hamilton, McGill University and Mila

=

(�(w>xi), if yi = 1

1� �(w>xi), if yi = 0Probability of the target data given the model parameters

The likelihood of the data

L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY

i=1

�(w>xi)yi(1� �(w>xi))

1�yi

21

Maximizing log-likelihood

William L. Hamilton, McGill University and Mila

Likelihood

Problem: Taking products of lots of small numbers is numerically unstable, making this function hard to optimize…

Log-likelihood

L(D) =nY

i=1

�(w>xi)yi(1� �(w>xi))

1�yi

Easier to optimize!

l(D) = ln(L(D)) =nX

i=1

yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))

22

Maximizing likelihood vs. minimizing loss

William L. Hamilton, McGill University and Mila

§ Another view: The negative log-likelihood of the logistic

function is known as the cross-entropy loss.

§ So maximizing the likelihood is the same as minimizing the

cross-entropy loss.

cross-entropy(D) = �nX

i=1

yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))

23

Maximizing likelihood vs. minimizing loss

William L. Hamilton, McGill University and Mila

§ Formal interpretation of cross entropy loss comes from

information theory.

§ Basic idea: it measures how many bits of information we

would need to correct the errors made by our model.

cross-entropy(D) = �nX

i=1

yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))

24

Maximizing likelihood vs. minimizing loss

William L. Hamilton, McGill University and Mila

§ There are probabilistic interpretations of various loss

functions, and we can often view minimizing a loss as

equivalent to maximizing likelihood.

§ E.g., we can even interpret the mean-squared loss in linear

regression in a probabilistic lens.

25

Aside: Probabilistic view of linear regression

William L. Hamilton, McGill University and Mila

§ Assume that y =w0 +∑j=1:mwj xj+-where - ~ /(0, σ3)and - is i.i.d. and independent of x

§ Then we can compute the likelihood of a particular target

value according to a Gaussian distribution:

P (yi|xi,w) =1p2⇡�2

e�(yi�w>xi)

2

2�2

26

Aside: Probabilistic view of linear regression

William L. Hamilton, McGill University and Mila

§ Assume that y =w0 +∑j=1:mwj xj+-where - ~ /(0, σ3)and - is i.i.d. and independent of x

§ Then we can compute the likelihood of a particular target

value according to a Gaussian distribution:

Looks just like the squared error!

P (yi|xi,w) =1p2⇡�2

e�(yi�w>xi)

2

2�2

27

Aside: Probabilistic view of linear regression

William L. Hamilton, McGill University and Mila

§ Given the likelihood of an individual point:

§ We can then compute the log-likelihood of the whole dataset:

P (yi|xi,w) =1p2⇡�2

e�(yi�w>xi)

2

2�2

l(D) =nX

i=1

� ln(p2⇡�2)� (yi �w>xi)2

2�2

28

Aside: Probabilistic view of linear regression

William L. Hamilton, McGill University and Mila

§ Given the likelihood of an individual point:

§ We can then compute the log-likelihood of the whole dataset:

P (yi|xi,w) =1p2⇡�2

e�(yi�w>xi)

2

2�2

l(D) =nX

i=1

� ln(p2⇡�2)� (yi �w>xi)2

2�2

These terms are constants, so maximizing this likelihood is equivalent to minimizing the squared loss!

29

Recap: likelihoods and losses

William L. Hamilton, McGill University and Mila

§ Under certain assumptions many loss functions have

probabilistic interpretations.

§ The cross-entropy loss = maximum likelihood for logistic

regression.

§ The squared loss = maximum likelihood for linear regression.

§ Assuming i.i.d. normally distributed errors!

30

Not all losses are created equal

William L. Hamilton, McGill University and Mila

§ We can come up with all kinds of losses:

§ Absolute error loss (for regression):

L(y,fw(X))= ∑i=1:n |yi – wTxi |§ 0-1 loss (for classification):

L(y,fw(X))= ∑i=1:n I(yi ≠fw(xi)§ … but these losses are not always easy to optimize (e.g.,

not differentiable).

§ … and these losses are often not theoretically grounded.

31

Losses are different from error metrics

William L. Hamilton, McGill University and Mila

§ Problem: The cross-entropy loss may be theoretically

grounded, but it is not very interpretable…

§ Solution: Train models using theoretically grounded loss

functions but evaluate using interpretable measures.

§ E.g., for linear classification

§ Train using cross-entropy.

§ Evaluate using accuracy (i.e., % correct predictions).

§ More evaluation functions to come in lecture 6!32

Back to logistic regression

§ Recall: σ(wTxi)is the probability that yi=1(given xi)1-σ(wTxi)is the probability that yi=0.

§ For y∊{0,1},the likelihood function is

William L. Hamilton, McGill University and Mila

=

(�(w>xi), if yi = 1

1� �(w>xi), if yi = 0Probability of the target data given the model parameters

The likelihood of the data

L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY

i=1

�(w>xi)yi(1� �(w>xi))

1�yi

33

Back to logistic regression: likelihood

§ Our goal is to maximize the likelihood!

§ In other words: we want to find the parameters that give the highest likelihood.

William L. Hamilton, McGill University and Mila

=

(�(w>xi), if yi = 1

1� �(w>xi), if yi = 0Probability of the target data given the model parameters

The likelihood of the data

L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY

i=1

�(w>xi)yi(1� �(w>xi))

1�yi

34

Back to logistic regression: log-likelihood

William L. Hamilton, McGill University and Mila

Likelihood

Problem: Taking products of lots of small numbers is numerically unstable, making this function hard to optimize…

Log-likelihood

L(D) =nY

i=1

�(w>xi)yi(1� �(w>xi))

1�yi

Easier to optimize!

l(D) = ln(L(D)) =nX

i=1

yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))

35

Gradient descent for logistic regression

William L. Hamilton, McGill University and Mila

§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]

§ Take the derivative:

∂Err(w)/∂w = - [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +…

δlog(σ)/δw=1/σ

36

Gradient descent for logistic regression

William L. Hamilton, McGill University and Mila

§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]

§ Take the derivative:

∂Err(w)/∂w = - [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +…δσ/δw=σ(1-σ)

37

Gradient descent for logistic regression

William L. Hamilton, McGill University and Mila

§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]

§ Take the derivative:

∂Err(w)/∂w = - [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +…δwTx/δw=x

38

Gradient descent for logistic regression

William L. Hamilton, McGill University and Mila

§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]

§ Take the derivative:

∂Err(w)/∂w =- [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +(1-yi)(1/(1-σ(wTxi )))(1-σ(wTxi ))σ(wTxi )(-1)xi]

δ(1-σ)/δw=(1-σ)σ(-1)

39

Gradient descent for logistic regression

William L. Hamilton, McGill University and Mila

§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]

§ Take the derivative:∂Err(w)/∂w =- [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +

(1-yi)(1/(1-σ(wTxi )))(1-σ(wTxi ))σ(wTxi )(-1)xi]=- ∑i=1:n xi(yi (1-σ(wTxi))- (1-yi)σ(wTxi))=- ∑i=1:n xi(yi - σ(wTxi))

§ Update rule: wk+1 =wk +αk∑i=1:n xi(yi – σ(wkTxi))

40

Gradient descent for logistic regression

William L. Hamilton, McGill University and Mila

§ Update rule: wk+1 =wk +αk∑i=1:n xi(yi – σ(wkTxi))

§ Intuition:

§ If we give a low probability to a positive point (i.e., yi=1), then we

should increase the parameter weights for the features strong

associated with that point.

§ If we give a high probability to a negative point (i.e., yi=0), then we

should decrease the parameters weights for the features strongly

associated with that point.

41

Approaches to binary classification

William L. Hamilton, McGill University and Mila

§ Two probabilistic approaches:

1. Discriminative learning: Directly estimate P(y|x).

2. Generative learning: Separately model P(x|y)and P(y). Use Bayes’ rule, to estimate P(y|x):

7

Joelle Pineau 13

Predicting a class from linear regression

•  Here red line is: Y’ = X (XTX)-1 XT Y

•  How to get a binary output?

1.  Threshold the output:

{ y <= t for NoRecurrence,

y > t for Recurrence}

2.  Interpret output as probability:

y = Pr (Recurrence)

•  Can we find a better model?

COMP-598: Applied Machine Learning

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Joelle Pineau 14

Modeling for binary classification

•  Two approaches:

–  Discriminative learning: Directly estimate P(y|x).

–  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

–  We will consider both, starting with discriminative learning.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x | y =1)P(y =1)P(x)

42

Approaches to binary classification

William L. Hamilton, McGill University and Mila

§ Two probabilistic approaches:

1. Discriminative learning: Directly estimate P(y|x).

2. Generative learning: Separately model P(x|y)and P(y). Use Bayes’ rule, to estimate P(y|x):

7

Joelle Pineau 13

Predicting a class from linear regression

•  Here red line is: Y’ = X (XTX)-1 XT Y

•  How to get a binary output?

1.  Threshold the output:

{ y <= t for NoRecurrence,

y > t for Recurrence}

2.  Interpret output as probability:

y = Pr (Recurrence)

•  Can we find a better model?

COMP-598: Applied Machine Learning

Example: Given “nucleus size” predict cancerrecurrrence

10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

nucleus size

nonre

curr

ence c

ount

10 12 14 16 18 20 22 24 26 28 300

5

10

15

nucleus size

recurr

ence c

ount

COMP-652, Lecture 4 - September 18, 2012 29

Example: Solution by linear regression

• Univariate real input: nucleus size

• Output coding: non-recurrence = 0, recurrence = 1

• Sum squared error minimized by the red line

! "# "! $# $! %#!#&$

#

#&$

#&'

#&(

#&)

"

"&$

*+,-.+/0/12.

*3*4.,+405#603404.,+405"6

COMP-652, Lecture 4 - September 18, 2012 30

Joelle Pineau 14

Modeling for binary classification

•  Two approaches:

–  Discriminative learning: Directly estimate P(y|x).

–  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

–  We will consider both, starting with discriminative learning.

COMP-598: Applied Machine Learning

P(y =1| x) = P(x | y =1)P(y =1)P(x)

Today

Next lecture

43

What you should know

William L. Hamilton, McGill University and Mila 44

§ Basic definition of linear classification problem.

§ Derivation of logistic regression.

§ The relationship between maximum likelihood and loss functions.

§ The difference between loss functions and error metrics.

Final notes

William L. Hamilton, McGill University and Mila 45

§ Get started on MiniProject 1!

§ The midterm is November 18th from 6-8pm. Contact the course staff ASAP if you know you cannot make this day!

top related