Applied Machine Learning Lecture 5-2: Logistic regression ...richajo/dit866/lectures/l5/l5_2.pdf · Applied Machine Learning Lecture 5-2: Logistic regression and SVM Selpi ([email protected])

Applied Machine LearningLecture 5-2: Logistic regression and SVM

Selpi ([email protected])

The slides are further development of Richard Johansson's slides

February 7, 2020

mailto:[email protected]

Overview

A bit on perceptron

logistic regression

training a logistic regression classi�er

detour: multiclass linear classi�ers

support vector classi�cation

optimizing the LR and SVM objectives

The perceptron algorithm and the simpler version

Perceptron algorithm

w = (0, . . . , 0)repeat N timesfor (x i , yi ) in the training set

score = w · x iif a pos. is misclassi�edw = w + x i

else if a neg. is misclassi�edw = w − x i

return w

The simpler versionif the yi are coded as +1or -1:

w = (0, . . . , 0)for (x i , yi ) in the training set

score = w · x iif yi · score ≤ 0w = w + yi · x i

return w

The perceptron algorithm and the simpler version

Perceptron algorithm

w = (0, . . . , 0)repeat N timesfor (x i , yi ) in the training set

score = w · x iif a pos. is misclassi�edw = w + x i

else if a neg. is misclassi�edw = w − x i

return w

The simpler versionif the yi are coded as +1or -1:

w = (0, . . . , 0)for (x i , yi ) in the training set

score = w · x iif yi · score ≤ 0w = w + yi · x i

return w

how can we get the �certainty� of a linear classi�er?

score = w · x

I large positive score: quite certain that x belongs to thepositive class

I large negative score: quite certain that x belongs to thenegative class

I near zero: we are unsure

3 2 1 0 1 2 3 4

3

2

1

0

1

2

3

Overview

A bit on perceptron

logistic regression





The logistic regression model

I logistic regression is a method to train a linear classi�er thatgives a probabilistic output

I how to get the probability? use a logistic or sigmoid function:

P(positive output|x) = 1

1+ e−score

where e−score = np.exp(-score)

P(negative output|x) = 1− 1

1+ e−score =1

1+ escore

The logistic regression model

I logistic regression is a method to train a linear classi�er thatgives a probabilistic output

I how to get the probability? use a logistic or sigmoid function:


1+ e−score

where e−score = np.exp(-score)

P(negative output|x) = 1− 1

1+ e−score =1

1+ escore

the logistic / sigmoid function [Verhulst, 1845]

6 4 2 0 2 4 6classifier score

0.0

0.2

0.4

0.6

0.8

1.0

P(y

= po

sitiv

e | x

)

interpreting the linear classi�er's score

I the output score w · x can now be interpreted as the log oddsin favor of the positive outcome

I odds: how much more likely is the positive outcome than thenegative outcome?

odds =p

1− p

Knew that we ventured on such dangerous seas

That if we wrought out life 'twas ten to one

Shakespeare, Henry IV, Part II, Act I, Scene 1 lines 183�4.

making it a bit more compact

I if we code the positive class as +1 and the negative classas -1, then we can write the probability a bit more neatly:

P(y |x) = 1

1+ e−y ·score

in scikit-learn

I LR is called sklearn.linear_model.LogisticRegression

I predict_proba gives the probability output

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

code example: using a logistic regression classi�er

Overview

A bit on perceptron

logistic regression





recall: the maximum likelihood principle

I in a probabilistic model, we can train the model by selecingparameters that assign a high probability to the data

I in our case, the parameters are the weight vector w

I adjust w so that each output label gets a high probability

the likelihood function

I formally, the �probability of the data� is de�ned by thelikelihood function

I this is the product of the probabilities of all m individualtraining instances:

L(w) = P(y1|x1) · · · · · P(ym|xm)

I in our case, this means

L(w) =1

1+ e−y1·(w ·x1)· · · · · 1

1+ e−ym·(w ·xm)

rewriting a bit. . .

I we rewrite the previous formula

L(w) =1

1+ e−y1·(w ·x1)· · · · · 1

1+ e−ym·(w ·xm)

as

− logL(w) = Loss(w , x1, y1) + . . .+ Loss(w , xm, ym)

where

Loss(w , x , y) = log(1+ exp(−y · (w · x)))

is called the log loss function

plot of the log loss

6 4 2 0 2 4 6y * classifier score

0

1

2

3

4

5

6lo

g lo

ss

The fundamental tradeo� in machine learning

I goodness of �t: the learned classi�er should be ableto correctly classify the examples in the training data

I regularization: the classi�er should be simple

I but so far in our LR description, we've just taken careof the �rst part!

− logL(w) = Loss(w , x1, y1) + . . .+ Loss(w , xm, ym)

regularization in logistic regression models

I just like we saw for linear regression models (Ridge and Lasso),we can add a regularizer that keeps the weights small

I most commonly, the L2 regularizer:

‖w‖2 = w1 · w1 + . . .+ wn · wn = w ·w

I . . . or an L1 regularizer:

‖w‖1 = |w1|+ . . .+ |wn|

which will do some feature selection

combining the pieces

I we combine the loss and the regularizer:

1

N·

N∑i=1

Loss(w , x i , yi ) +λ

2· ‖w‖2

I in this formula, λ is a �tweaking� parameter that controls thetradeo� between loss and regularization

I note: in some formulations (including scikit-learn), there is aparameter C instead of the λ that is put before the loss

C

N·

N∑i=1

Loss(w , x i , yi ) +1

2‖w‖2

check

C

N·

N∑i=1


2‖w‖2

I how do we convert this into an algorithm?

probabilistic justi�cation of the regularizers

I by adding the regularizer, we are carrying outmaximum a posteriori estimation insteadof maximum likelihood

I in MAP estimation, we use a prior to push

the parameters in a desired direction

wMAP = argmaxw

p(data|w)·p(w)

I L2 regularizer = Gaussian prior

− log p(w) = λ‖w‖2 + constant

I L1 regularizer = Laplace prior

− log p(w) = λ|w |+ constant

3 2 1 0 1 2 30.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

3 2 1 0 1 2 30.0

0.1

0.2

0.3

0.4

0.5

Overview

A bit on perceptron

logistic regression





two-class (binary) linear classi�ers

I a linear classi�er is a classi�er that is de�ned in terms of ascoring function like this

score = w · x

I this is a binary (2-class) classi�er:I return the �rst class if the score > 0I . . . otherwise the second class

I how can we deal with non-binary (multi-class) problems whenusing linear classi�ers?

approaches to multiclass classi�cation

I idea 1: break down the complex problem into simplerproblems, train a classi�er for each separately

I idea 2: modify the learning algorithm so that it can handlethe multiclass case directly

idea 1: reduction from multiclass to binary

I one-versus-rest (�long jump�):I for each class c, make a binary classi�er to distinguish c from

all other classesI so if there are n classes, there are n classi�ersI at test time, we select the class giving the highest score

I one-versus-one (�football league�):I for each pair of classes c1 and c2, make a classi�er to

distinguish c1 from c2I if there are n classes, there are

n·(n−1)2

classi�ersI at test time, we select the class that has most �wins�

example

I assume we're training a classi�er of fruits and we have theclasses apple, orange, mango

I in one-vs-rest, we train the following three classi�ers:I apple vs orange+mangoI orange vs apple+mangoI mango vs apple+orange

I in one-vs-one, we train the following three:I apple vs orangeI apple vs mangoI orange vs mango

example (continued)

I we train classi�ers to distinguish between apple, orange, andmango, using one-vs-restI so we get wapple, worange, wmango

I for some instance x , the respective scores are

[-1, 2.2, 1.5]

so our guess is orange

in scikit-learn

I scikit-learn includes implementations of both of the methodswe have discussed:I OneVsRestClassifierI OneVsOneClassifier

I however, the built-in algorithms (e.g. Perceptron,LogisticRegression) will do this automatically for youI they use one-versus-rest

http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html

http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html

idea 2: multiclass learning algorithms

I is it good to separate the multiclass task into smaller tasksthat are trained independently?I maybe training should be similar to testing?

I let's make a model where one-vs-rest is used while trainingI we'll see how this can be done for logistic regression

binary LR: reminders

I the logistic or sigmoid function:


1+ e−score

def sigmoid(score):

return 1 / (1 + np.exp(-score))

I when training, we minimize the log-loss

Loss(w , x , y) = log(1+ exp(−y · (w · x)))

multiclass LR using the softmax

I the softmax function is used in multiclass LR instead of thelogistic:

P(yi |x) =escorei∑k e

scorek

def softmax(scores):

expscores = np.exp(scores)

return expscores / sum(expscores)

[exercise: make softmax numerically stable]

softmax example

def softmax(scores):

expscores = np.exp(scores)

return expscores / sum(expscores)

scores = [-1, 2.2, 1.5, -0.3]

print(softmax(scores))

array([ 0.02517067, 0.61750026, 0.30664156, 0.05068751])

0 1 2 30.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

cross-entropy loss

I when training, the softmax probabilities lead to thecross-entropy loss instead of the log loss

LossCE (w , x i , yi ) = − logP(yi |x i ) = − logescorei∑k e

scorek

I just like the log-loss:I high probability for the correct label yi ⇒ low lossI low probability for yi ⇒ high loss

multiclass LR in scikit-learn

I LogisticRegression(multi_class='multinomial')

I (otherwise, separate classi�ers are trained independently)

Overview

A bit on perceptron

logistic regression





geometric view

I using lines to separate 2-dimensional dataI using planes to separate 3-dimensional data

4 3 2 1 0 1 2 3 4

3

2

1

0

1

2

3

4

margin of separation

I the margin γ denotes how well w separates the classes:

4 3 2 1 0 1 2 3 4

3

2

1

0

1

2

3

4

4 3 2 1 0 1 2 3 4

3

2

1

0

1

2

3

4

I γ is the shortest distance from the separator to the nearesttraining instance

large margins are good

I a result from statistical learning theory:

true error ≤ training error+ BigUglyFormula(1

γ2)

I larger margin → better generalization

[source]

http://www.cs.cmu.edu/afs/cs/Web/People/guestrin/Class/10701-S06/Slides/learning-theory-mid-review.pdf

support vector machinesI support vector machines (SVM) or support vector classi�ers

(SVC) are linear classi�ers constructed by selecting the w thatmaximizes the margin

4 3 2 1 0 1 2 3 4

3

2

1

0

1

2

3

4

I note: the solution depends only on the borderline examples:the support vectors

soft-margin SVMs

I in some cases the dataset is inseparable, or nearly inseparable

I soft-margin SVM: allow some examples to be disregardedwhen maximizing the margin

ξi

B) Soft Margin SVM A) Hard Margin SVM

ixr

ixr

stating the SVM as an objective function

I the hard-margin and soft-margin SVM can be statedmathematically in a number of ways

I we'll skip the details, but it can be shown (see Daumé's book)that the soft-margin SVM can be stated as minimizing

C

N·

N∑i=1


2‖w‖2

where

Loss(w , x , y) = max(0, 1− y · (w · x))

is called the hinge loss

plot of the hinge loss

6 4 2 0 2 4 6y * classifier score

0

1

2

3

4

5

6

7hi

nge

loss

in scikit-learn

I linear SVM is called sklearn.svm.LinearSVC

http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

Overview

A bit on perceptron

logistic regression





SVM and LR have convex objective functions

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.00.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

+

0.0 0.5 1.0 1.5 2.00.0

0.2

0.4

0.6

0.8

1.0

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.00.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

+

−1.0 −0.5 0.0 0.5 1.0 1.5 2.00.0

0.2

0.4

0.6

0.8

1.0

optimizing SVM and LR

I since the objective functions of SVM and LR are convex, wecan �nd w by stochastic gradient descent

I pseudocode:I set w to some initial value, e.g. all zeroI iterate a �xed number of times:

I select a single training instance xI select a �suitable� step length ηI compute the gradient of the hinge loss or log lossI subtract step length · gradient from w

I note the similarity to the perceptron!

missing pieces

I setting the learning rate ηI in principle, one can try to select a �small enough� value of ηI in practice, it's better to decrease η graduallyI we can use the Pegasos algorithm to set η as follows:

η =C

t=

1

λ · t

whereI t is the current step (1, 2, . . . )I C or λ is the loss/regularization tradeo�

I gradients for SVM and LR loss functions (hinge and log loss)

Next lecture

I Gradient Boosting

I Evaluation methods

Applied Machine Learning Lecture 5-2: Logistic regression ...richajo/dit866/lectures/l5/l5_2.pdf · Applied Machine Learning Lecture 5-2: Logistic regression and SVM Selpi ([email protected])

Documents