Jun 19, 2020
Applied Machine LearningLecture 5-2: Logistic regression and SVM
Selpi ([email protected])
The slides are further development of Richard Johansson's slides
February 7, 2020
Overview
A bit on perceptron
logistic regression
training a logistic regression classi�er
detour: multiclass linear classi�ers
support vector classi�cation
optimizing the LR and SVM objectives
The perceptron algorithm and the simpler version
Perceptron algorithm
w = (0, . . . , 0)repeat N timesfor (x i , yi ) in the training set
score = w · x iif a pos. is misclassi�edw = w + x i
else if a neg. is misclassi�edw = w − x i
return w
The simpler versionif the yi are coded as +1or -1:
w = (0, . . . , 0)for (x i , yi ) in the training set
score = w · x iif yi · score ≤ 0w = w + yi · x i
return w
The perceptron algorithm and the simpler version
Perceptron algorithm
w = (0, . . . , 0)repeat N timesfor (x i , yi ) in the training set
score = w · x iif a pos. is misclassi�edw = w + x i
else if a neg. is misclassi�edw = w − x i
return w
The simpler versionif the yi are coded as +1or -1:
w = (0, . . . , 0)for (x i , yi ) in the training set
score = w · x iif yi · score ≤ 0w = w + yi · x i
return w
how can we get the �certainty� of a linear classi�er?
score = w · x
I large positive score: quite certain that x belongs to thepositive class
I large negative score: quite certain that x belongs to thenegative class
I near zero: we are unsure
3 2 1 0 1 2 3 4
3
2
1
0
1
2
3
Overview
A bit on perceptron
logistic regression
training a logistic regression classi�er
detour: multiclass linear classi�ers
support vector classi�cation
optimizing the LR and SVM objectives
The logistic regression model
I logistic regression is a method to train a linear classi�er thatgives a probabilistic output
I how to get the probability? use a logistic or sigmoid function:
P(positive output|x) = 1
1+ e−score
where e−score = np.exp(-score)
P(negative output|x) = 1− 1
1+ e−score =1
1+ escore
The logistic regression model
I logistic regression is a method to train a linear classi�er thatgives a probabilistic output
I how to get the probability? use a logistic or sigmoid function:
P(positive output|x) = 1
1+ e−score
where e−score = np.exp(-score)
P(negative output|x) = 1− 1
1+ e−score =1
1+ escore
the logistic / sigmoid function [Verhulst, 1845]
6 4 2 0 2 4 6classifier score
0.0
0.2
0.4
0.6
0.8
1.0
P(y
= po
sitiv
e | x
)
interpreting the linear classi�er's score
I the output score w · x can now be interpreted as the log oddsin favor of the positive outcome
I odds: how much more likely is the positive outcome than thenegative outcome?
odds =p
1− p
Knew that we ventured on such dangerous seas
That if we wrought out life 'twas ten to one
Shakespeare, Henry IV, Part II, Act I, Scene 1 lines 183�4.
making it a bit more compact
I if we code the positive class as +1 and the negative classas -1, then we can write the probability a bit more neatly:
P(y |x) = 1
1+ e−y ·score
in scikit-learn
I LR is called sklearn.linear_model.LogisticRegression
I predict_proba gives the probability output
code example: using a logistic regression classi�er
Overview
A bit on perceptron
logistic regression
training a logistic regression classi�er
detour: multiclass linear classi�ers
support vector classi�cation
optimizing the LR and SVM objectives
recall: the maximum likelihood principle
I in a probabilistic model, we can train the model by selecingparameters that assign a high probability to the data
I in our case, the parameters are the weight vector w
I adjust w so that each output label gets a high probability
the likelihood function
I formally, the �probability of the data� is de�ned by thelikelihood function
I this is the product of the probabilities of all m individualtraining instances:
L(w) = P(y1|x1) · · · · · P(ym|xm)
I in our case, this means
L(w) =1
1+ e−y1·(w ·x1)· · · · · 1
1+ e−ym·(w ·xm)
rewriting a bit. . .
I we rewrite the previous formula
L(w) =1
1+ e−y1·(w ·x1)· · · · · 1
1+ e−ym·(w ·xm)
as
− logL(w) = Loss(w , x1, y1) + . . .+ Loss(w , xm, ym)
where
Loss(w , x , y) = log(1+ exp(−y · (w · x)))
is called the log loss function
plot of the log loss
6 4 2 0 2 4 6y * classifier score
0
1
2
3
4
5
6lo
g lo
ss
The fundamental tradeo� in machine learning
I goodness of �t: the learned classi�er should be ableto correctly classify the examples in the training data
I regularization: the classi�er should be simple
I but so far in our LR description, we've just taken careof the �rst part!
− logL(w) = Loss(w , x1, y1) + . . .+ Loss(w , xm, ym)
regularization in logistic regression models
I just like we saw for linear regression models (Ridge and Lasso),we can add a regularizer that keeps the weights small
I most commonly, the L2 regularizer:
‖w‖2 = w1 · w1 + . . .+ wn · wn = w ·w
I . . . or an L1 regularizer:
‖w‖1 = |w1|+ . . .+ |wn|
which will do some feature selection
combining the pieces
I we combine the loss and the regularizer:
1
N·
N∑i=1
Loss(w , x i , yi ) +λ
2· ‖w‖2
I in this formula, λ is a �tweaking� parameter that controls thetradeo� between loss and regularization
I note: in some formulations (including scikit-learn), there is aparameter C instead of the λ that is put before the loss
C
N·
N∑i=1
Loss(w , x i , yi ) +1
2‖w‖2
check
C
N·
N∑i=1
Loss(w , x i , yi ) +1
2‖w‖2
I how do we convert this into an algorithm?
probabilistic justi�cation of the regularizers
I by adding the regularizer, we are carrying outmaximum a posteriori estimation insteadof maximum likelihood
I in MAP estimation, we use a prior to push
the parameters in a desired direction
wMAP = argmaxw
p(data|w)·p(w)
I L2 regularizer = Gaussian prior
− log p(w) = λ‖w‖2 + constant
I L1 regularizer = Laplace prior
− log p(w) = λ|w |+ constant
3 2 1 0 1 2 30.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
3 2 1 0 1 2 30.0
0.1
0.2
0.3
0.4
0.5
Overview
A bit on perceptron
logistic regression
training a logistic regression classi�er
detour: multiclass linear classi�ers
support vector classi�cation
optimizing the LR and SVM objectives
two-class (binary) linear classi�ers
I a linear classi�er is a classi�er that is de�ned in terms of ascoring function like this
score = w · x
I this is a binary (2-class) classi�er:I return the �rst class if the score > 0I . . . otherwise the second class
I how can we deal with non-binary (multi-class) problems whenusing linear classi�ers?
approaches to multiclass classi�cation
I idea 1: break down the complex problem into simplerproblems, train a classi�er for each separately
I idea 2: modify the learning algorithm so that it can handlethe multiclass case directly
idea 1: reduction from multiclass to binary
I one-versus-rest (�long jump�):I for each class c, make a binary classi�er to distinguish c from
all other classesI so if there are n classes, there are n classi�ersI at test time, we select the class giving the highest score
I one-versus-one (�football league�):I for each pair of classes c1 and c2, make a classi�er to
distinguish c1 from c2I if there are n classes, there are
n·(n−1)2
classi�ersI at test time, we select the class that has most �wins�
example
I assume we're training a classi�er of fruits and we have theclasses apple, orange, mango
I in one-vs-rest, we train the following three classi�ers:I apple vs orange+mangoI orange vs apple+mangoI mango vs apple+orange
I in one-vs-one, we train the following three:I apple vs orangeI apple vs mangoI orange vs mango
example (continued)
I we train classi�ers to distinguish between apple, orange, andmango, using one-vs-restI so we get wapple, worange, wmango
I for some instance x , the respective scores are
[-1, 2.2, 1.5]
so our guess is orange
in scikit-learn
I scikit-learn includes implementations of both of the methodswe have discussed:I OneVsRestClassifierI OneVsOneClassifier
I however, the built-in algorithms (e.g. Perceptron,LogisticRegression) will do this automatically for youI they use one-versus-rest
idea 2: multiclass learning algorithms
I is it good to separate the multiclass task into smaller tasksthat are trained independently?I maybe training should be similar to testing?
I let's make a model where one-vs-rest is used while trainingI we'll see how this can be done for logistic regression
binary LR: reminders
I the logistic or sigmoid function:
P(positive output|x) = 1
1+ e−score
def sigmoid(score):
return 1 / (1 + np.exp(-score))
I when training, we minimize the log-loss
Loss(w , x , y) = log(1+ exp(−y · (w · x)))
multiclass LR using the softmax
I the softmax function is used in multiclass LR instead of thelogistic:
P(yi |x) =escorei∑k e
scorek
def softmax(scores):
expscores = np.exp(scores)
return expscores / sum(expscores)
[exercise: make softmax numerically stable]
softmax example
def softmax(scores):
expscores = np.exp(scores)
return expscores / sum(expscores)
scores = [-1, 2.2, 1.5, -0.3]
print(softmax(scores))
array([ 0.02517067, 0.61750026, 0.30664156, 0.05068751])
0 1 2 30.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
cross-entropy loss
I when training, the softmax probabilities lead to thecross-entropy loss instead of the log loss
LossCE (w , x i , yi ) = − logP(yi |x i ) = − logescorei∑k e
scorek
I just like the log-loss:I high probability for the correct label yi ⇒ low lossI low probability for yi ⇒ high loss
multiclass LR in scikit-learn
I LogisticRegression(multi_class='multinomial')
I (otherwise, separate classi�ers are trained independently)
Overview
A bit on perceptron
logistic regression
training a logistic regression classi�er
detour: multiclass linear classi�ers
support vector classi�cation
optimizing the LR and SVM objectives
geometric view
I using lines to separate 2-dimensional dataI using planes to separate 3-dimensional data
4 3 2 1 0 1 2 3 4
3
2
1
0
1
2
3
4
margin of separation
I the margin γ denotes how well w separates the classes:
4 3 2 1 0 1 2 3 4
3
2
1
0
1
2
3
4
4 3 2 1 0 1 2 3 4
3
2
1
0
1
2
3
4
I γ is the shortest distance from the separator to the nearesttraining instance
large margins are good
I a result from statistical learning theory:
true error ≤ training error+ BigUglyFormula(1
γ2)
I larger margin → better generalization
support vector machinesI support vector machines (SVM) or support vector classi�ers
(SVC) are linear classi�ers constructed by selecting the w thatmaximizes the margin
4 3 2 1 0 1 2 3 4
3
2
1
0
1
2
3
4
I note: the solution depends only on the borderline examples:the support vectors
soft-margin SVMs
I in some cases the dataset is inseparable, or nearly inseparable
I soft-margin SVM: allow some examples to be disregardedwhen maximizing the margin
ξi
B) Soft Margin SVM A) Hard Margin SVM
ixr
ixr
stating the SVM as an objective function
I the hard-margin and soft-margin SVM can be statedmathematically in a number of ways
I we'll skip the details, but it can be shown (see Daumé's book)that the soft-margin SVM can be stated as minimizing
C
N·
N∑i=1
Loss(w , x i , yi ) +1
2‖w‖2
where
Loss(w , x , y) = max(0, 1− y · (w · x))
is called the hinge loss
plot of the hinge loss
6 4 2 0 2 4 6y * classifier score
0
1
2
3
4
5
6
7hi
nge
loss
in scikit-learn
I linear SVM is called sklearn.svm.LinearSVC
Overview
A bit on perceptron
logistic regression
training a logistic regression classi�er
detour: multiclass linear classi�ers
support vector classi�cation
optimizing the LR and SVM objectives
SVM and LR have convex objective functions
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.00.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
+
0.0 0.5 1.0 1.5 2.00.0
0.2
0.4
0.6
0.8
1.0
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.00.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
+
−1.0 −0.5 0.0 0.5 1.0 1.5 2.00.0
0.2
0.4
0.6
0.8
1.0
optimizing SVM and LR
I since the objective functions of SVM and LR are convex, wecan �nd w by stochastic gradient descent
I pseudocode:I set w to some initial value, e.g. all zeroI iterate a �xed number of times:
I select a single training instance xI select a �suitable� step length ηI compute the gradient of the hinge loss or log lossI subtract step length · gradient from w
I note the similarity to the perceptron!
missing pieces
I setting the learning rate ηI in principle, one can try to select a �small enough� value of ηI in practice, it's better to decrease η graduallyI we can use the Pegasos algorithm to set η as follows:
η =C
t=
1
λ · t
whereI t is the current step (1, 2, . . . )I C or λ is the loss/regularization tradeo�
I gradients for SVM and LR loss functions (hinge and log loss)
Next lecture
I Gradient Boosting
I Evaluation methods