Page 1
COMP 551 - Applied Machine LearningLecture 4 --- Linear ClassificationWilliam L. Hamilton(with slides and content from Joelle Pineau)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.
William L. Hamilton, McGill University and Mila 1
Page 2
MiniProject 1 is out!
William L. Hamilton, McGill University and Mila 2
§ Due September 28th at 11:59pm. The details are at: https://www.cs.mcgill.ca/~wlh/comp551/files/miniproject1_spec.pdf
§ Basic idea – “Machine Learning 101”: § Implement two linear classification algorithms (from this lecture and next lecture)
§ Run linear classification on two different datasets.
§ Compare different models, settings, and features.
§ (Semi-)open-ended write-up.
§ Completed in groups of 3! You can now register your group on MyCourses.
§ If you don’t have a group yet, find one quick! You can use the discussion board on MyCourses to search for potential group members.
Page 3
Self-assessment / practice quizzes
William L. Hamilton, McGill University and Mila 3
§ Quiz 0 – Attempt 1:
§ Around 250 students completed it.
§ Roughly 70% average. A couple questions were tricky, but 80%+ is where you ideally should be.
§ Quiz 0 – Attempt 2:
§ Around 180 students completed it.
§ Average went up to 80%.
§ Probability questions seemed to be the hardest.
Page 4
Quiz 0, Attempt 2, Question 5
William L. Hamilton, McGill University and Mila 4
§ The correct answer was 1024, since there are 10 binary choices of what features to include (i.e., 210=1024).
§ But what about the subset where no features are included? Should we include
this as an option? Yes!
§ Training a model with no features means we only learn the bias term (i.e., fw(x)=w0), which is equivalent to just predicting the average value for the target (and sometimes this is the best model we can find)!
Page 5
Recap: Evaluating on held out data
William L. Hamilton, McGill University and Mila 5
§ Partition your data into a training set, validation set, and test set.
§ The proportions in each set can vary.
§ Training set is used to fit a model (find the best hypothesis in the class).
§ Validation set is used for model selection, i.e., to estimate true error and compare hypothesis classes. (E.g., compare different order polynominals).
§ Test set is what you report the final accuracy on.
Page 6
k-fold cross validation§ Instead of just one validation set, we can evaluate on many splits!
§ Consider k partitions of the training/non-test data (usually of equal size).
§ Train with k-1 subsets, validate on kth subset. Repeat k times.
§ Average the prediction error over the k rounds/folds.
William L. Hamilton, McGill University and Mila 6
Source: http://stackoverflow.com/questions/31947183/how-to-implement-walk-forward-testing-in-sklearn
(increases computation time by a factor of k)
Page 7
Generalization: test vs. train error
William L. Hamilton, McGill University and Mila 7
38 2. Overview of Supervised Learning
High Bias
Low Variance
Low Bias
High Variance
Prediction
Error
Model Complexity
Training Sample
Test Sample
Low High
FIGURE 2.11. Test and training error as a function of model complexity.
be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.
The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.
More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1
N
!i(yi − yi)2. Unfortunately
training error is not a good estimate of test error, as it does not properlyaccount for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.
[From Hastie et al. textbook]§ Overly simple model:§ High training error and
high test error.§ Overly complex model:
§ Low training error but high test error.
Page 8
Prediction problems
William L. Hamilton, McGill University and Mila 8
§ Classification
§ E.g., predicting whether a treatment is successful vs. unsuccessful
§ Y is a finite discrete set (e.g., successful vs. unsuccessful treatment)
§ Regression
§ E.g., predicting the future size of a tumor
§ Y=Â (i.e., we are predicting a real number)
tumor size texture perimeter shade outcome size change
18.02 rough 117.5 0 (very light) Y -0.1416.05 smooth 112.2 4 (dark) Y -0.1018.9 smooth 102.3 1 (light) N +0.21
Page 9
Classification problems
William L. Hamilton, McGill University and Mila
Given data set D=<xi,yi>,i=1:n, with discrete yi, find a hypothesis which “best fits” the data.§ If yi ∈{0,1}this is binary classification.
§ If yi can take more than two values, the problem is called multi-class classification.
tumor size texture perimeter shade outcome size change
18.02 rough 117.5 0 (very light) Y -0.1416.05 smooth 112.2 4 (dark) Y -0.1018.9 smooth 102.3 1 (light) N +0.21
9
Page 10
Applications of classification
William L. Hamilton, McGill University and Mila
§ Text classification (spam filtering, sentiment analysis, etc.).
§ Image classification (face detection, object recognition, etc.).
§ Prediction of cancer recurrence.
§ Recommendation systems.
§ Many, many more!
10
Page 11
A simple example
William L. Hamilton, McGill University and Mila
§ Given “nucleus size”, predict cancer recurrence.
§ Univariate input: X= nucleus size.
§ Binary output: Y = {NoRecurrence = 0; Recurrence = 1}
§ Try: Minimize the least-square error.
NoRecurrence Recurrence
6
Joelle Pineau 11
Applications of classification
• Text classification (spam filtering, news filtering, building web
directories, etc.)
– Features? Classes? Error (loss) function?
• Image classification (face detection, object recognition, etc.)
• Prediction of cancer recurrence.
• Financial forecasting.
• Many, many more!
COMP-598: Applied Machine Learning
Joelle Pineau 12
Simple example
• Given “nucleus size”, predict cancer recurrence.
• Univariate input: X = nucleus size.
• Binary output: Y = {NoRecurrence = 0; Recurrence = 1}
• Try: Minimize the least-square error.
COMP-598: Applied Machine Learning
NoRecurrence Example: Given “nucleus size” predict cancer
recurrrence
10 12 14 16 18 20 22 24 26 28 300
5
10
15
20
25
30
35
nucleus size
no
nre
cu
rre
nce
co
un
t
10 12 14 16 18 20 22 24 26 28 300
5
10
15
nucleus size
recu
rre
nce
co
un
t
COMP-652, Lecture 4 - September 18, 2012 29
Example: Solution by linear regression
• Univariate real input: nucleus size
• Output coding: non-recurrence = 0, recurrence = 1
• Sum squared error minimized by the red line
! "# "! $# $! %#!#&$
#
#&$
#&'
#&(
#&)
"
"&$
*+,-.+/0/12.
*3*4.,+405#603404.,+405"6
COMP-652, Lecture 4 - September 18, 2012 30
Example: Given “nucleus size” predict cancerrecurrrence
10 12 14 16 18 20 22 24 26 28 300
5
10
15
20
25
30
35
nucleus size
nonre
curr
ence c
ount
10 12 14 16 18 20 22 24 26 28 300
5
10
15
nucleus size
recurr
ence c
ount
COMP-652, Lecture 4 - September 18, 2012 29
Example: Solution by linear regression
• Univariate real input: nucleus size
• Output coding: non-recurrence = 0, recurrence = 1
• Sum squared error minimized by the red line
! "# "! $# $! %#!#&$
#
#&$
#&'
#&(
#&)
"
"&$
*+,-.+/0/12.
*3*4.,+405#603404.,+405"6
COMP-652, Lecture 4 - September 18, 2012 30
Recurrence
6
Joelle Pineau 11
Applications of classification
• Text classification (spam filtering, news filtering, building web
directories, etc.)
– Features? Classes? Error (loss) function?
• Image classification (face detection, object recognition, etc.)
• Prediction of cancer recurrence.
• Financial forecasting.
• Many, many more!
COMP-598: Applied Machine Learning
Joelle Pineau 12
Simple example
• Given “nucleus size”, predict cancer recurrence.
• Univariate input: X = nucleus size.
• Binary output: Y = {NoRecurrence = 0; Recurrence = 1}
• Try: Minimize the least-square error.
COMP-598: Applied Machine Learning
NoRecurrence Example: Given “nucleus size” predict cancer
recurrrence
10 12 14 16 18 20 22 24 26 28 300
5
10
15
20
25
30
35
nucleus size
nonre
curr
ence c
ount
10 12 14 16 18 20 22 24 26 28 300
5
10
15
nucleus size
recurr
ence c
ount
COMP-652, Lecture 4 - September 18, 2012 29
Example: Solution by linear regression
• Univariate real input: nucleus size
• Output coding: non-recurrence = 0, recurrence = 1
• Sum squared error minimized by the red line
! "# "! $# $! %#!#&$
#
#&$
#&'
#&(
#&)
"
"&$
*+,-.+/0/12.
*3*4.,+405#603404.,+405"6
COMP-652, Lecture 4 - September 18, 2012 30
Example: Given “nucleus size” predict cancerrecurrrence
10 12 14 16 18 20 22 24 26 28 300
5
10
15
20
25
30
35
nucleus size
nonre
curr
ence c
ount
10 12 14 16 18 20 22 24 26 28 300
5
10
15
nucleus size
recurr
ence c
ount
COMP-652, Lecture 4 - September 18, 2012 29
Example: Solution by linear regression
• Univariate real input: nucleus size
• Output coding: non-recurrence = 0, recurrence = 1
• Sum squared error minimized by the red line
! "# "! $# $! %#!#&$
#
#&$
#&'
#&(
#&)
"
"&$
*+,-.+/0/12.
*3*4.,+405#603404.,+405"6
COMP-652, Lecture 4 - September 18, 2012 30
Recurrence
11
Page 12
Classification via linear regression?
William L. Hamilton, McGill University and Mila
§ Here red line is: Y’=X(XTX)-1XTY§ How to get a binary output?
1. Threshold: {y<=tfor NoRecurrence, y>tfor Recurrence}
2. Interpret output as probability: y = Probability (Recurrence)
7
Joelle Pineau 13
Predicting a class from linear regression
• Here red line is: Y’ = X (XTX)-1 XT Y
• How to get a binary output?
1. Threshold the output:
{ y <= t for NoRecurrence,
y > t for Recurrence}
2. Interpret output as probability:
y = Pr (Recurrence)
• Can we find a better model?
COMP-598: Applied Machine Learning
Example: Given “nucleus size” predict cancerrecurrrence
10 12 14 16 18 20 22 24 26 28 300
5
10
15
20
25
30
35
nucleus size
nonre
curr
ence c
ount
10 12 14 16 18 20 22 24 26 28 300
5
10
15
nucleus size
recurr
ence c
ount
COMP-652, Lecture 4 - September 18, 2012 29
Example: Solution by linear regression
• Univariate real input: nucleus size
• Output coding: non-recurrence = 0, recurrence = 1
• Sum squared error minimized by the red line
! "# "! $# $! %#!#&$
#
#&$
#&'
#&(
#&)
"
"&$
*+,-.+/0/12.*3*4.,+405#603404.,+405"6
COMP-652, Lecture 4 - September 18, 2012 30
Joelle Pineau 14
Modeling for binary classification
• Two approaches:
– Discriminative learning: Directly estimate P(y|x).
– Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).
– We will consider both, starting with discriminative learning.
COMP-598: Applied Machine Learning
P(y =1| x) = P(x | y =1)P(y =1)P(x)
12
Page 13
Classification via linear regression?
William L. Hamilton, McGill University and Mila
§ Here red line is: Y’=X(XTX)-1XTY§ How to get a binary output?
1. Threshold: {y<=tfor NoRecurrence, y>tfor Recurrence}
2. Interpret output as probability: y = Probability (Recurrence)
7
Joelle Pineau 13
Predicting a class from linear regression
• Here red line is: Y’ = X (XTX)-1 XT Y
• How to get a binary output?
1. Threshold the output:
{ y <= t for NoRecurrence,
y > t for Recurrence}
2. Interpret output as probability:
y = Pr (Recurrence)
• Can we find a better model?
COMP-598: Applied Machine Learning
Example: Given “nucleus size” predict cancerrecurrrence
10 12 14 16 18 20 22 24 26 28 300
5
10
15
20
25
30
35
nucleus size
nonre
curr
ence c
ount
10 12 14 16 18 20 22 24 26 28 300
5
10
15
nucleus size
recurr
ence c
ount
COMP-652, Lecture 4 - September 18, 2012 29
Example: Solution by linear regression
• Univariate real input: nucleus size
• Output coding: non-recurrence = 0, recurrence = 1
• Sum squared error minimized by the red line
! "# "! $# $! %#!#&$
#
#&$
#&'
#&(
#&)
"
"&$
*+,-.+/0/12.*3*4.,+405#603404.,+405"6
COMP-652, Lecture 4 - September 18, 2012 30
Joelle Pineau 14
Modeling for binary classification
• Two approaches:
– Discriminative learning: Directly estimate P(y|x).
– Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).
– We will consider both, starting with discriminative learning.
COMP-598: Applied Machine Learning
P(y =1| x) = P(x | y =1)P(y =1)P(x)
Not a great fit!
Can we find a better model?
13
Page 14
High-level views of binary classification
William L. Hamilton, McGill University and Mila
§ Probabilistic§ Goal: Estimate P(y | x), i.e. the conditional probability of the
target variable given the feature data.
§ Focus of the next few lectures.
§ Decision boundaries§ Goal: Partition the feature space into different regions, and
classify points based on the region where the lie.
§ Focus of later lectures on decision trees and SVMs.
14
Page 15
Approaches to binary classification
William L. Hamilton, McGill University and Mila
§ Two probabilistic approaches:
1. Discriminative learning: Directly estimate P(y|x).
2. Generative learning: Separately model P(x|y)and P(y). Use Bayes’ rule, to estimate P(y|x):
7
Joelle Pineau 13
Predicting a class from linear regression
• Here red line is: Y’ = X (XTX)-1 XT Y
• How to get a binary output?
1. Threshold the output:
{ y <= t for NoRecurrence,
y > t for Recurrence}
2. Interpret output as probability:
y = Pr (Recurrence)
• Can we find a better model?
COMP-598: Applied Machine Learning
Example: Given “nucleus size” predict cancerrecurrrence
10 12 14 16 18 20 22 24 26 28 300
5
10
15
20
25
30
35
nucleus size
nonre
curr
ence c
ount
10 12 14 16 18 20 22 24 26 28 300
5
10
15
nucleus size
recurr
ence c
ount
COMP-652, Lecture 4 - September 18, 2012 29
Example: Solution by linear regression
• Univariate real input: nucleus size
• Output coding: non-recurrence = 0, recurrence = 1
• Sum squared error minimized by the red line
! "# "! $# $! %#!#&$
#
#&$
#&'
#&(
#&)
"
"&$
*+,-.+/0/12.
*3*4.,+405#603404.,+405"6
COMP-652, Lecture 4 - September 18, 2012 30
Joelle Pineau 14
Modeling for binary classification
• Two approaches:
– Discriminative learning: Directly estimate P(y|x).
– Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).
– We will consider both, starting with discriminative learning.
COMP-598: Applied Machine Learning
P(y =1| x) = P(x | y =1)P(y =1)P(x)
15
Page 16
Probabilistic view of discriminative learning
William L. Hamilton, McGill University and Mila
§ Suppose we have 2 classes: y∊{0,1}§ What is the probability of a given input x having class y=1?
§ Consider Bayes rule:
where(By Bayes rule; P(x) on top and bottom cancels out.)
8
Joelle Pineau 15
A probabilistic view • Suppose we have 2 classes: y � {0, 1}
• What is the probability of a given input x having class y = 1?
• Consider Bayes rule:
where
• Here σ has a special form, called a sigmoid function and
! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.
COMP-598: Applied Machine Learning
P(y =1| x) = P(x, y =1)P(x)
=P(x | y =1)P(y =1)
P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)
a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)
= ln P(y =1| x)P(y = 0 | x)
=1
1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)
=1
1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)
)=
11+ exp(−a)
= σ
(By Bayes rule; P(x) on top and bottom cancels out.)
Joelle Pineau 16
Logistic regression
• Directly model the log-odds with a linear function:
= w0 + w1x1 + … + wmxm
• The decision boundary here is the set of points for which the
log-odds (!) is zero. ) is zero.
• We have the logistic function:
σ(wTx) = 1 / (1 + e-wTx)
How do we find the weights?
COMP-598: Applied Machine Learning
a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)
8
Joelle Pineau 15
A probabilistic view • Suppose we have 2 classes: y � {0, 1}
• What is the probability of a given input x having class y = 1?
• Consider Bayes rule:
where
• Here σ has a special form, called a sigmoid function and
! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.
COMP-598: Applied Machine Learning
P(y =1| x) = P(x, y =1)P(x)
=P(x | y =1)P(y =1)
P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)
a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)
= ln P(y =1| x)P(y = 0 | x)
=1
1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)
=1
1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)
)=
11+ exp(−a)
= σ
(By Bayes rule; P(x) on top and bottom cancels out.)
Joelle Pineau 16
Logistic regression
• Directly model the log-odds with a linear function:
= w0 + w1x1 + … + wmxm
• The decision boundary here is the set of points for which the
log-odds (!) is zero. ) is zero.
• We have the logistic function:
σ(wTx) = 1 / (1 + e-wTx)
How do we find the weights?
COMP-598: Applied Machine Learning
a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)
8
Joelle Pineau 15
A probabilistic view • Suppose we have 2 classes: y � {0, 1}
• What is the probability of a given input x having class y = 1?
• Consider Bayes rule:
where
• Here σ has a special form, called a sigmoid function and
! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.
COMP-598: Applied Machine Learning
P(y =1| x) = P(x, y =1)P(x)
=P(x | y =1)P(y =1)
P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)
a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)
= ln P(y =1| x)P(y = 0 | x)
=1
1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)
=1
1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)
)=
11+ exp(−a)
= σ
(By Bayes rule; P(x) on top and bottom cancels out.)
Joelle Pineau 16
Logistic regression
• Directly model the log-odds with a linear function:
= w0 + w1x1 + … + wmxm
• The decision boundary here is the set of points for which the
log-odds (!) is zero. ) is zero.
• We have the logistic function:
σ(wTx) = 1 / (1 + e-wTx)
How do we find the weights?
COMP-598: Applied Machine Learning
a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)
16
Page 17
§ Log-odds ratio:
§ Logistic function:
Probabilistic view of discriminative learning
William L. Hamilton, McGill University and Mila
a = lnP (y = 1|x)P (y = 0|x)
� =1
1 + exp(�a)
How much more likely is y=1 compared to y=0?
What is our predicted
probability for y=1?
17
Page 18
§ Idea: Directly model the log-odds with a linear function:
=w0 +w1x1 +…+wmxm
Discriminative learning: Logistic regression
William L. Hamilton, McGill University and Mila
a = lnP (y = 1|x)P (y = 0|x)
How much more likely is y=1
compared to y=0?
Approximated by Linear function of the input features x.
18
Page 19
Discriminative learning: Logistic regression
William L. Hamilton, McGill University and Mila
§ Idea: Directly model the log-odds with a linear function:
=w0 +w1x1 +…+wmxm
§ The decision boundary is the set of points for which *=0.§ The linear logistic function:
a = lnP (y = 1|x)P (y = 0|x)
P (y = 1|x) = �(w>x) =1
1 + e�w>x
19
Page 20
Learning the weights in logistic regression
§ Recall: σ(wTxi)is the probability that yi=1(given xi)1-σ(wTxi)is the probability that yi=0.
§ For y∊{0,1},the likelihood function is
William L. Hamilton, McGill University and Mila
=
(�(w>xi), if yi = 1
1� �(w>xi), if yi = 0Probability of the target data given the model parameters
The likelihood of the data
L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY
i=1
�(w>xi)yi(1� �(w>xi))
1�yi
20
Page 21
Maximizing likelihood
§ Our goal is to maximize the likelihood!
§ In other words: we want to find the parameters that give the highest likelihood.
William L. Hamilton, McGill University and Mila
=
(�(w>xi), if yi = 1
1� �(w>xi), if yi = 0Probability of the target data given the model parameters
The likelihood of the data
L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY
i=1
�(w>xi)yi(1� �(w>xi))
1�yi
21
Page 22
Maximizing log-likelihood
William L. Hamilton, McGill University and Mila
Likelihood
Problem: Taking products of lots of small numbers is numerically unstable, making this function hard to optimize…
Log-likelihood
L(D) =nY
i=1
�(w>xi)yi(1� �(w>xi))
1�yi
Easier to optimize!
l(D) = ln(L(D)) =nX
i=1
yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))
22
Page 23
Maximizing likelihood vs. minimizing loss
William L. Hamilton, McGill University and Mila
§ Another view: The negative log-likelihood of the logistic
function is known as the cross-entropy loss.
§ So maximizing the likelihood is the same as minimizing the
cross-entropy loss.
cross-entropy(D) = �nX
i=1
yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))
23
Page 24
Maximizing likelihood vs. minimizing loss
William L. Hamilton, McGill University and Mila
§ Formal interpretation of cross entropy loss comes from
information theory.
§ Basic idea: it measures how many bits of information we
would need to correct the errors made by our model.
cross-entropy(D) = �nX
i=1
yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))
24
Page 25
Maximizing likelihood vs. minimizing loss
William L. Hamilton, McGill University and Mila
§ There are probabilistic interpretations of various loss
functions, and we can often view minimizing a loss as
equivalent to maximizing likelihood.
§ E.g., we can even interpret the mean-squared loss in linear
regression in a probabilistic lens.
25
Page 26
Aside: Probabilistic view of linear regression
William L. Hamilton, McGill University and Mila
§ Assume that y =w0 +∑j=1:mwj xj+-where - ~ /(0, σ3)and - is i.i.d. and independent of x
§ Then we can compute the likelihood of a particular target
value according to a Gaussian distribution:
P (yi|xi,w) =1p2⇡�2
e�(yi�w>xi)
2
2�2
26
Page 27
Aside: Probabilistic view of linear regression
William L. Hamilton, McGill University and Mila
§ Assume that y =w0 +∑j=1:mwj xj+-where - ~ /(0, σ3)and - is i.i.d. and independent of x
§ Then we can compute the likelihood of a particular target
value according to a Gaussian distribution:
Looks just like the squared error!
P (yi|xi,w) =1p2⇡�2
e�(yi�w>xi)
2
2�2
27
Page 28
Aside: Probabilistic view of linear regression
William L. Hamilton, McGill University and Mila
§ Given the likelihood of an individual point:
§ We can then compute the log-likelihood of the whole dataset:
P (yi|xi,w) =1p2⇡�2
e�(yi�w>xi)
2
2�2
l(D) =nX
i=1
� ln(p2⇡�2)� (yi �w>xi)2
2�2
28
Page 29
Aside: Probabilistic view of linear regression
William L. Hamilton, McGill University and Mila
§ Given the likelihood of an individual point:
§ We can then compute the log-likelihood of the whole dataset:
P (yi|xi,w) =1p2⇡�2
e�(yi�w>xi)
2
2�2
l(D) =nX
i=1
� ln(p2⇡�2)� (yi �w>xi)2
2�2
These terms are constants, so maximizing this likelihood is equivalent to minimizing the squared loss!
29
Page 30
Recap: likelihoods and losses
William L. Hamilton, McGill University and Mila
§ Under certain assumptions many loss functions have
probabilistic interpretations.
§ The cross-entropy loss = maximum likelihood for logistic
regression.
§ The squared loss = maximum likelihood for linear regression.
§ Assuming i.i.d. normally distributed errors!
30
Page 31
Not all losses are created equal
William L. Hamilton, McGill University and Mila
§ We can come up with all kinds of losses:
§ Absolute error loss (for regression):
L(y,fw(X))= ∑i=1:n |yi – wTxi |§ 0-1 loss (for classification):
L(y,fw(X))= ∑i=1:n I(yi ≠fw(xi)§ … but these losses are not always easy to optimize (e.g.,
not differentiable).
§ … and these losses are often not theoretically grounded.
31
Page 32
Losses are different from error metrics
William L. Hamilton, McGill University and Mila
§ Problem: The cross-entropy loss may be theoretically
grounded, but it is not very interpretable…
§ Solution: Train models using theoretically grounded loss
functions but evaluate using interpretable measures.
§ E.g., for linear classification
§ Train using cross-entropy.
§ Evaluate using accuracy (i.e., % correct predictions).
§ More evaluation functions to come in lecture 6!32
Page 33
Back to logistic regression
§ Recall: σ(wTxi)is the probability that yi=1(given xi)1-σ(wTxi)is the probability that yi=0.
§ For y∊{0,1},the likelihood function is
William L. Hamilton, McGill University and Mila
=
(�(w>xi), if yi = 1
1� �(w>xi), if yi = 0Probability of the target data given the model parameters
The likelihood of the data
L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY
i=1
�(w>xi)yi(1� �(w>xi))
1�yi
33
Page 34
Back to logistic regression: likelihood
§ Our goal is to maximize the likelihood!
§ In other words: we want to find the parameters that give the highest likelihood.
William L. Hamilton, McGill University and Mila
=
(�(w>xi), if yi = 1
1� �(w>xi), if yi = 0Probability of the target data given the model parameters
The likelihood of the data
L(D) = P (y1, y2, ..., yn|x1,x2, ...,xn,w) =nY
i=1
�(w>xi)yi(1� �(w>xi))
1�yi
34
Page 35
Back to logistic regression: log-likelihood
William L. Hamilton, McGill University and Mila
Likelihood
Problem: Taking products of lots of small numbers is numerically unstable, making this function hard to optimize…
Log-likelihood
L(D) =nY
i=1
�(w>xi)yi(1� �(w>xi))
1�yi
Easier to optimize!
l(D) = ln(L(D)) =nX
i=1
yi ln(�(w>xi)) + (1� yi) ln(1� �(w>xi))
35
Page 36
Gradient descent for logistic regression
William L. Hamilton, McGill University and Mila
§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]
§ Take the derivative:
∂Err(w)/∂w = - [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +…
δlog(σ)/δw=1/σ
36
Page 37
Gradient descent for logistic regression
William L. Hamilton, McGill University and Mila
§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]
§ Take the derivative:
∂Err(w)/∂w = - [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +…δσ/δw=σ(1-σ)
37
Page 38
Gradient descent for logistic regression
William L. Hamilton, McGill University and Mila
§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]
§ Take the derivative:
∂Err(w)/∂w = - [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +…δwTx/δw=x
38
Page 39
Gradient descent for logistic regression
William L. Hamilton, McGill University and Mila
§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]
§ Take the derivative:
∂Err(w)/∂w =- [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +(1-yi)(1/(1-σ(wTxi )))(1-σ(wTxi ))σ(wTxi )(-1)xi]
δ(1-σ)/δw=(1-σ)σ(-1)
39
Page 40
Gradient descent for logistic regression
William L. Hamilton, McGill University and Mila
§ Cross-entropy loss: CE(D)=- [∑i=1:nyi log(σ(wTxi))+(1-yi)log(1-σ(wTxi))]
§ Take the derivative:∂Err(w)/∂w =- [∑i=1:nyi (1/σ(wTxi))(1-σ(wTxi))σ(wTxi)xi +
(1-yi)(1/(1-σ(wTxi )))(1-σ(wTxi ))σ(wTxi )(-1)xi]=- ∑i=1:n xi(yi (1-σ(wTxi))- (1-yi)σ(wTxi))=- ∑i=1:n xi(yi - σ(wTxi))
§ Update rule: wk+1 =wk +αk∑i=1:n xi(yi – σ(wkTxi))
40
Page 41
Gradient descent for logistic regression
William L. Hamilton, McGill University and Mila
§ Update rule: wk+1 =wk +αk∑i=1:n xi(yi – σ(wkTxi))
§ Intuition:
§ If we give a low probability to a positive point (i.e., yi=1), then we
should increase the parameter weights for the features strong
associated with that point.
§ If we give a high probability to a negative point (i.e., yi=0), then we
should decrease the parameters weights for the features strongly
associated with that point.
41
Page 42
Approaches to binary classification
William L. Hamilton, McGill University and Mila
§ Two probabilistic approaches:
1. Discriminative learning: Directly estimate P(y|x).
2. Generative learning: Separately model P(x|y)and P(y). Use Bayes’ rule, to estimate P(y|x):
7
Joelle Pineau 13
Predicting a class from linear regression
• Here red line is: Y’ = X (XTX)-1 XT Y
• How to get a binary output?
1. Threshold the output:
{ y <= t for NoRecurrence,
y > t for Recurrence}
2. Interpret output as probability:
y = Pr (Recurrence)
• Can we find a better model?
COMP-598: Applied Machine Learning
Example: Given “nucleus size” predict cancerrecurrrence
10 12 14 16 18 20 22 24 26 28 300
5
10
15
20
25
30
35
nucleus size
nonre
curr
ence c
ount
10 12 14 16 18 20 22 24 26 28 300
5
10
15
nucleus size
recurr
ence c
ount
COMP-652, Lecture 4 - September 18, 2012 29
Example: Solution by linear regression
• Univariate real input: nucleus size
• Output coding: non-recurrence = 0, recurrence = 1
• Sum squared error minimized by the red line
! "# "! $# $! %#!#&$
#
#&$
#&'
#&(
#&)
"
"&$
*+,-.+/0/12.
*3*4.,+405#603404.,+405"6
COMP-652, Lecture 4 - September 18, 2012 30
Joelle Pineau 14
Modeling for binary classification
• Two approaches:
– Discriminative learning: Directly estimate P(y|x).
– Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).
– We will consider both, starting with discriminative learning.
COMP-598: Applied Machine Learning
P(y =1| x) = P(x | y =1)P(y =1)P(x)
42
Page 43
Approaches to binary classification
William L. Hamilton, McGill University and Mila
§ Two probabilistic approaches:
1. Discriminative learning: Directly estimate P(y|x).
2. Generative learning: Separately model P(x|y)and P(y). Use Bayes’ rule, to estimate P(y|x):
7
Joelle Pineau 13
Predicting a class from linear regression
• Here red line is: Y’ = X (XTX)-1 XT Y
• How to get a binary output?
1. Threshold the output:
{ y <= t for NoRecurrence,
y > t for Recurrence}
2. Interpret output as probability:
y = Pr (Recurrence)
• Can we find a better model?
COMP-598: Applied Machine Learning
Example: Given “nucleus size” predict cancerrecurrrence
10 12 14 16 18 20 22 24 26 28 300
5
10
15
20
25
30
35
nucleus size
nonre
curr
ence c
ount
10 12 14 16 18 20 22 24 26 28 300
5
10
15
nucleus size
recurr
ence c
ount
COMP-652, Lecture 4 - September 18, 2012 29
Example: Solution by linear regression
• Univariate real input: nucleus size
• Output coding: non-recurrence = 0, recurrence = 1
• Sum squared error minimized by the red line
! "# "! $# $! %#!#&$
#
#&$
#&'
#&(
#&)
"
"&$
*+,-.+/0/12.
*3*4.,+405#603404.,+405"6
COMP-652, Lecture 4 - September 18, 2012 30
Joelle Pineau 14
Modeling for binary classification
• Two approaches:
– Discriminative learning: Directly estimate P(y|x).
– Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).
– We will consider both, starting with discriminative learning.
COMP-598: Applied Machine Learning
P(y =1| x) = P(x | y =1)P(y =1)P(x)
Today
Next lecture
43
Page 44
What you should know
William L. Hamilton, McGill University and Mila 44
§ Basic definition of linear classification problem.
§ Derivation of logistic regression.
§ The relationship between maximum likelihood and loss functions.
§ The difference between loss functions and error metrics.
Page 45
Final notes
William L. Hamilton, McGill University and Mila 45
§ Get started on MiniProject 1!
§ The midterm is November 18th from 6-8pm. Contact the course staff ASAP if you know you cannot make this day!