Support Vector Machines

SupportVector

Machines

MuhammadAdil Raja

Introduction

IntuitionBehindMargins

Notation

FunctionalandGeometricMargins

The OptimalMarginClassifiers

OptimalMarginClassifiers

Kernels

Regularizationand the non-separableCase

The SMOAlgorithm

References

SUPPORT VECTOR MACHINES

Muhammad Adil Raja

Roaming Researchers, Inc.

September 1, 2014

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OUTLINE

1 INTRODUCTION

2 INTUITION BEHIND MARGINS

3 NOTATION

4 FUNCTIONAL AND GEOMETRIC MARGINS

5 THE OPTIMAL MARGIN CLASSIFIERS

6 OPTIMAL MARGIN CLASSIFIERS

7 KERNELS

8 REGULARIZATION AND THE NON-SEPARABLE CASE

9 THE SMO ALGORITHM

10 REFERENCES

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OUTLINE

1 INTRODUCTION


3 NOTATION




7 KERNELS


9 THE SMO ALGORITHM

10 REFERENCES

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OUTLINE

1 INTRODUCTION


3 NOTATION




7 KERNELS


9 THE SMO ALGORITHM

10 REFERENCES

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OUTLINE

1 INTRODUCTION


3 NOTATION




7 KERNELS


9 THE SMO ALGORITHM

10 REFERENCES

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OUTLINE

1 INTRODUCTION


3 NOTATION




7 KERNELS


9 THE SMO ALGORITHM

10 REFERENCES

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OUTLINE

1 INTRODUCTION


3 NOTATION




7 KERNELS


9 THE SMO ALGORITHM

10 REFERENCES

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OUTLINE

1 INTRODUCTION


3 NOTATION




7 KERNELS


9 THE SMO ALGORITHM

10 REFERENCES

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OUTLINE

1 INTRODUCTION


3 NOTATION




7 KERNELS


9 THE SMO ALGORITHM

10 REFERENCES

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OUTLINE

1 INTRODUCTION


3 NOTATION




7 KERNELS


9 THE SMO ALGORITHM

10 REFERENCES

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OUTLINE

1 INTRODUCTION


3 NOTATION




7 KERNELS


9 THE SMO ALGORITHM

10 REFERENCES

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

INTRODUCTION I

• Support Vector Machines (SVMs) are among the best“off-the-shelf” supervised learning algorithms.

• Margins.• Large margin classifiers.• Lagrange duality.• Kernels.• High dimensional feature spaces.• Infinite dimensional feature spaces.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

INTUITION BEHIND MARGINS I

• Consider logistic regression.• The probability p(y = 1|x ; θ) is modeled by

hθ(x) = g(θT x).• We would then predict “1” on an input x if and only if

hθ(x) ≥ 0.5, or equivalently, if and only if θT x ≥ 0.• Consider a positive training example (y = 1).• The larger θT x is, the larger also is

hθ(x) = p(y = 1|x ;w ,b), and thus also the higher ourdegree of “confidence” that the label is 1. “T”hus,informally we can think of our prediction as being a veryconfident one that y = 1 if θT x � 0.

• Similarly, we think of logistic regression as making avery confident prediction of y = 0, if θT x � 0.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

INTUITION BEHIND MARGINS II

• Given a training set, again informally it seems thatweÕd have found a good fit to the training data if wecan find θ so that θT x(i)� 0 whenever y(i) = 1, andθT x(i)� 0 whenever y(i) = 0, since this would reflecta very confident (and correct) set of classifications forall the training examples.

• This seems to be a nice goal to aim for.• We shall formalize this idea using the notion of

functional margins.• For a different type of intuition, consider the following

figure, in which xs represent positive training examples,os denote negative training examples, a decisionboundary (this is the line given by the equationθT x = 0, and is also called the separating hyperplane)is also shown, and three points have also been labeledA, B and C.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

INTUITION BEHIND MARGINS III

2

!T x(i) ! 0 whenever y(i) = 0, since this would reflect a very confident (andcorrect) set of classifications for all the training examples. This seems to bea nice goal to aim for, and we’ll soon formalize this idea using the notion offunctional margins.

For a di!erent type of intuition, consider the following figure, in which x’srepresent positive training examples, o’s denote negative training examples,a decision boundary (this is the line given by the equation !T x = 0, andis also called the separating hyperplane) is also shown, and three pointshave also been labeled A, B and C.

B

A

C

Notice that the point A is very far from the decision boundary. If we areasked to make a prediction for the value of y at at A, it seems we should bequite confident that y = 1 there. Conversely, the point C is very close tothe decision boundary, and while it’s on the side of the decision boundaryon which we would predict y = 1, it seems likely that just a small change tothe decision boundary could easily have caused out prediction to be y = 0.Hence, we’re much more confident about our prediction at A than at C. Thepoint B lies in-between these two cases, and more broadly, we see that ifa point is far from the separating hyperplane, then we may be significantlymore confident in our predictions. Again, informally we think it’d be nice if,given a training set, we manage to find a decision boundary that allows usto make all correct and confident (meaning far from the decision boundary)predictions on the training examples. We’ll formalize this later using thenotion of geometric margins.

FIGURE: A hyperplane that separates data.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

INTUITION BEHIND MARGINS IV

• Notice that the point A is very far from the decisionboundary.

• If we are asked to make a prediction for the value of yat A, it seems we should be quite confident that y = 1there.

• Conversely, the point C is very close to the decisionboundary, and while it is on the side of the decisionboundary on which we would predict y = 1, it seemslikely that just a small change to the decision boundarycould easily have caused out prediction to be y = 0.

• Hence, weÕre much more confident about ourprediction at A than at C.

• The point B lies in-between these two cases, and morebroadly, we see that if a point is far from the separatinghyperplane, then we may be significantly moreconfident in our predictions.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

INTUITION BEHIND MARGINS V

• Again, informally we think itÕd be nice if, given atraining set, we manage to find a decision boundarythat allows us to make all correct and confident(meaning far from the decision boundary) predictionson the training examples.

• WeÕll formalize this later using the notion of geometricmargins.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

NOTATION I

• Some notation is in order to make the discussion ofSVMs easier.

• Consider a linear classifier for a binary classificationproblem with labels y and features x .

• From now, weÕll use y ∈ ?1,1 (instead of 0, 1) todenote the class labels.

• Also, rather than parameterizing our linear classifierwith the vector θ, we will use parameters w ,b, and writeour classifier as hw ,b(x) = g(wT x + b).

• Here, g(z) = 1 if z ≥ 0, and g(z) =?1 otherwise.• This “w,b” notation allows us to explicitly treat the

intercept term b separately from the other parameters.• We also drop the convention we had previously of

letting x0 = 1 be an extra coordinate in the input featurevector.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

NOTATION II

• Thus, b takes the role of what was previously θ0, and wtakes the role of [θ1, · · · , θn]

T .

• Note also that, from our definition of g above, ourclassifier will directly predict either 1 or ?1 (cf. theperceptron algorithm), without first going through theintermediate step of estimating the probability of ybeing 1 (which was what logistic regression did).

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

FUNCTIONAL AND GEOMETRIC MARGINS I

• Lets formalize the notions of the functional andgeometric margins.

• Given a training example (x (i), y (i)), we define thefunctional margin of (w ,b) with respect to the trainingexample γ(i) = y (i)(wT x + b).

• Note that if y (i) = 1, then for the functional margin to belarge (i.e., for our prediction to be confident andcorrect), then we need wT x + b to be a large positivenumber.

• Conversely, if y (i) =?1, then for the functional margin tobe large, then we need wT x + b to be a large negativenumber.

• Moreover, if y (i)(wT x + b) > 0, then our prediction onthis example is correct.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

FUNCTIONAL AND GEOMETRIC MARGINS II

• Hence, a large functional margin represents a confidentand a correct prediction.

• For a linear classifier with the choice of g given above(taking values in ?1,1), thereÕs one property of thefunctional margin that makes it not a very goodmeasure of confidence, however.

• Given our choice of g, we note that if we replace w with2w and b with 2b, then sinceg(wT x + b) = g(2wT x + 2b), this would not changehw ,b(x) at all.

• g, and hence also hw ,b(x), depends only on the sign,but not on the magnitude, of wT x + b.

• However, replacing (w ,b) with (2w ,2b) also results inmultiplying our functional margin by a factor of 2.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

FUNCTIONAL AND GEOMETRIC MARGINS III

• Thus, it seems that by exploiting our freedom to scalew and b, we can make the functional margin arbitrarilylarge without really changing anything meaningful.

• Intuitively, it might therefore make sense to imposesome sort of normalization condition such as that||w ||2 = 1; i.e., we might replace (w ,b) with(w/||w ||2,b/||w ||2), and instead consider the functionalmargin of (w/||w ||2,b/||w ||2).

• Given a training set S = (x (i), y (i)); i = 1, ...,m, we alsodefine the function margin of (w ,b) with respect to S asthe smallest of the functional margins of the individualtraining examples.

• Denoted by γ, this can therefore be written:γ = min i = 1, · · · ,mγ(i)

• Consider the picture below to understand geometricmargins.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

FUNCTIONAL AND GEOMETRIC MARGINS IV

• The decision boundary corresponding to (w ,b) isshown, along with the vector w .

• Note that w is orthogonal (at 900) to the separatinghyperplane.

• Consider the point at A, which represents the input x (i)

of some training example with label y (i) = 1.• Its distance to the decision boundary, γ(i), is given by

the line segment AB.• How can we find the value of γ(i)?• Well, w/||w || is a unit-length vector pointing in the

same direction as w .• Since A represents x (i), we therefore find that the point

B is given by x (i)?γ(i)w/||w ||.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

FUNCTIONAL AND GEOMETRIC MARGINS V

• But this point lies on the decision boundary, and allpoints x on the decision boundary satisfy the equationwT x + b = 0.

• Hence, wT (x (i) − γ(i) w||w ||) + b = 0.

• Solving for γ(i) yields γ(i) = wT x (i)+b||w || = ( w

||w ||)T x (i)+ b

||w || .

• This was worked out for the case of a positive trainingexample at A in the figure, where being on the“positive” side of the decision boundary is good.

• More generally, we define the geometric margin of(w ,b) with respect to a training example (x (i), y (i)) tobe γ(i) = y (i)(( w

||w ||)T x (i) + b

||w ||).

• Note that if ||w || = 1, then the functional margin equalsthe geometric margin – this thus gives us a way ofrelating these two different notions of margin.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

FUNCTIONAL AND GEOMETRIC MARGINS VI

• Also, the geometric margin is invariant to rescaling ofthe parameters; i.e., if we replace w with 2w and b with2b, then the geometric margin does not change.

• Specifically, because of this invariance to the scaling ofthe parameters, when trying to fit w and b to trainingdata, we can impose an arbitrary scaling constraint onw without changing anything important; for instance,we can demand that ||w || = 1, or |w1| = 5, or|w1 + b|+ |w2| = 2, and any of these can be satisfiedsimply by rescaling w and b.

• Finally, given a training set S = (x (i), y (i)); i = 1, ...,m,we also define the geometric margin of (w ,b) withrespect to S to be the smallest of the geometricmargins on the individual training examples:γ = min i = 1, · · · ,mγ(i)

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

FUNCTIONAL AND GEOMETRIC MARGINS VII

4

this would not change hw,b(x) at all. I.e., g, and hence also hw,b(x), dependsonly on the sign, but not on the magnitude, of wT x + b. However, replacing(w, b) with (2w, 2b) also results in multiplying our functional margin by afactor of 2. Thus, it seems that by exploiting our freedom to scale w and b,we can make the functional margin arbitrarily large without really changinganything meaningful. Intuitively, it might therefore make sense to imposesome sort of normalization condition such as that ||w||2 = 1; i.e., we mightreplace (w, b) with (w/||w||2, b/||w||2), and instead consider the functionalmargin of (w/||w||2, b/||w||2). We’ll come back to this later.

Given a training set S = {(x(i), y(i)); i = 1, . . . ,m}, we also define thefunction margin of (w, b) with respect to S as the smallest of the functionalmargins of the individual training examples. Denoted by !̂, this can thereforebe written:

!̂ = mini=1,...,m

!̂(i).

Next, lets talk about geometric margins. Consider the picture below:

wA

!

B

(i)

The decision boundary corresponding to (w, b) is shown, along with thevector w. Note that w is orthogonal (at 90!) to the separating hyperplane.(You should convince yourself that this must be the case.) Consider thepoint at A, which represents the input x(i) of some training example withlabel y(i) = 1. Its distance to the decision boundary, !(i), is given by the linesegment AB.

How can we find the value of !(i)? Well, w/||w|| is a unit-length vectorpointing in the same direction as w. Since A represents x(i), we therefore

FIGURE: Geometric margins.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

THE OPTIMAL MARGIN CLASSIFIER I

• Given a training set a natural goal is to try to find adecision boundary that maximizes the (geometric)margin, since this would reflect a very confident set ofpredictions on the training set and a good “fit” to thetraining data.

• Specifically, this will result in a classifier that separatesthe positive and the negative training examples with a“gap” (geometric margin).

• For now, we will assume that we are given a training setthat is linearly separable; i.e., that it is possible toseparate the positive and negative examples usingsome separating hyperplane.

• How we find the one that achieves the maximumgeometric margin?

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

THE OPTIMAL MARGIN CLASSIFIER II

• We can pose the following optimization problem:maxγ,w ,b γs.t .y (i)(wT x (i) + b) ≥ γ, i = 1, · · · ,m||w || = 1

• I.e., we want to maximize γ, subject to each trainingexample having functional margin at least γ.

• The ||w || = 1 constraint moreover ensures that thefunctional margin equals to the geometric margin, sowe are also guaranteed that all the geometric marginsare at least γ.

• Thus, solving this problem will result in (w ,b) with thelargest possible geometric margin with respect to thetraining set.

• If we could solve the optimization problem above, weÕdbe done.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

THE OPTIMAL MARGIN CLASSIFIER III

• But the “||w || = 1” constraint is a nasty (non-convex)one, and this problem certainly isnÕt in any format thatwe can plug into standard optimization software tosolve.

• So, lets try transforming the problem into a nicer one.Consider:maxγ,w ,b

γ||w ||s.t .y

(i)(wT x (i) + b) ≥ γ, i = 1, · · · ,m.

• Here, we are going to maximize γ/||w ||, subject to thefunctional margins all being at least γ.

• Since the geometric and functional margins are relatedby γ = γ̂/||w ||, this will give us the answer we want.

• Moreover, we’ve gotten rid of the constraint ||w || = 1that we didn’t like.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

THE OPTIMAL MARGIN CLASSIFIER IV

• The downside is that we now have a nasty (again,non-convex) objective γ̂

||w || function; and, we still don’thave any off-the-shelf software that can solve this formof an optimization problem.

• Recall that we can add an arbitrary scaling constrainton w and b without changing anything.

• This is the key idea we’ll use now.• We will introduce the scaling constraint that the

functional margin of w ,b with respect to the training setmust be 1: γ̂ = 1

• Since multiplying w and b by some constant results inthe functional margin being multiplied by that sameconstant, this is indeed a scaling constraint, and can besatisfied by rescaling w ,b.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

THE OPTIMAL MARGIN CLASSIFIER V

• Plugging this into our problem above, and noting thatmaximizing γ̂/||w || = 1/||w || is the same thing asminimizing ||w ||2, we now have the followingoptimization problem:minγ,w ,b 1

2 ||w ||2s.t .y (i)(wT x (i) + b) ≥ 1, i = 1, · · · ,m.• We’ve now transformed the problem into a form that

can be efficiently solved.• The above is an optimization problem with a convex

quadratic objective and only linear constraints.• Its solution gives us the optimal margin classifier.• This optimization problem can be solved using

commercial quadratic programming (QP) code.• While we could call the problem solved here, what we

will instead do is make a digression to talk aboutLagrange duality.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

THE OPTIMAL MARGIN CLASSIFIER VI

• This will lead us to our optimization problem’s dualform, which will play a key role in allowing us to usekernels to get optimal margin classifiers to workefficiently in very high dimensional spaces.

• The dual form will also allow us to derive an efficientalgorithm for solving the above optimization problemthat will typically do much better than generic QPsoftware.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

LAGRANGE DUALITY I

• Consider a problem of the following form: minw f (w) s.t.hi(w) = 0, i = 1, · · · , l

• In this method, we define the Lagrangian to beζ(w ,b) = f (w) +

∑li=1 βihi(w).

• Here, the βiÕs are called the Lagrange multipliers.• We would then find and set ζ ’s partial derivatives to

zero: ∂ζ∂wi

= 0; ∂ζ∂βi= 0 and solve for w and β.

• We will generalize this to constrained optimizationproblems in which we may have inequality as well asequality constraints.

• Consider the following, which weÕll call the primaloptimization problem: minw f (w)s.t .gi(w) ≤ 0, i =1, · · · , khi(w) = 0, i = 1, · · · , l ..

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

LAGRANGE DUALITY II

• To solve it, we start by defining the generalizedLagrangian:ζ(w , α, β) = f (w) +

∑ki=1 αigi(w) +

∑li=1 βihi(w).

• Here, the αi ’s and βi ’s are the Lagrangian multipliers.• Consider the quantity :θρ(w) = maxα, β : αi ≥ 0ζ(w , α, β).

• Here, the “ρ” subscript stands for “primal”.• Let some w be given.• If w violates any of the primal constraints (i.e., if either

gi(w) > 0 or hi(w) 6= 0 for some i), then you should beable to verify that:

θp(w) = maxα,β:αi≥0 f (w) +∑k

i=1 αigi(w) +∑l

i=1 βihi(w)(1)=∞ (2)

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

LAGRANGE DUALITY III

• Conversely, if the constraints are indeed satisfied for aparticular value of w , then θρ(w) = f (w).

• Hence,

θρ(w) =

{f (w) if w satisfies primal constraints.∞ otherwise.

(3)

• Thus, θρ takes the same value as the objective in ourproblem for all values of w that satisfies the primalconstraints, and is positive infinity if the constraints areviolated.

• Hence, if we consider the minimization problemminw θρ(w) = minw maxα,β:αi≥0 ζ(w , α, β),

• We see that it is the same problem (i.e., and has thesame solutions as) our original, primal problem.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

LAGRANGE DUALITY IV

• For later use, we also define the optimal value of theobjective to be p∗ = minwθρ(w);

• We call this the value of the primal problem.• Now, lets look at a slightly different problem.• We define: θD(α, β) = minwζ(w , α, β).• Here, the “D” subscript stands for “dual”.• Note also that whereas in the definition of θρ we were

optimizing (maximizing) with respect to α, β here areare minimizing with respect to w .

• We can now pose the dual optimization problem:maxα,β:αi≥0 θD(α, β) = maxα,β:αi≥0 minw ζ(w , α, β).

• This is exactly the same as the primal problem shownabove, except that the order of the “max” and the “min”are now exchanged.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

LAGRANGE DUALITY V

• We also define the optimal value of the dual problemÕsobjective to be d∗ = maxα,β:αi≥0θD(w).

• How are the primal and the dual problems related?• It can easily be shown that

d∗ = maxα,β:αi≥0 minw ζ(w , α, β) ≤minw maxα, β : αi0ζ(w , α, β) = p∗.

• this follows from the “max min” of a function alwaysbeing less than or equal to the “min max”.

• However, under certain conditions, we will haved∗ = p∗.

• So that we can solve the dual problem in lieu of theprimal problem.

• Lets see what these conditions are.• Suppose f and the giÕs are convex, and the hiÕs are

affine.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

LAGRANGE DUALITY VI

• Suppose further that the constraints gi are (strictly)feasible; this means that there exists some w so thatgi(w) < 0 for all i .

• Under our above assumptions, there must existw∗, α∗, β∗ so that w∗ is the solution to the primalproblem, α∗, β∗ are the solution to the dual problem,and moreover p∗ = d∗ = L(w∗, α∗, β∗).

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

LAGRANGE DUALITY VII

• Moreover, w∗, α∗ and β∗ satisfy theKarush-Kuhn-Tucker (KKT) conditions, which are asfollows:

∂

∂wiζ(w∗, α∗, β∗) = 0, i = 1, · · · ,n (4)

∂

∂βiζ(w∗, α∗, β∗) = = 0, i = 1, · · · , l (5)

α∗i gi(w∗) = 0, i = 1, · · · , k (6)gi(w∗) ≤ 0, i = 1, · · · , k (7)

α∗ ≥ 0, i = 1, · · · , k (8)

• Moreover, if some w∗, α∗, β∗ satisfy the KKT conditions,then it is also a solution to the primal and dualproblems.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

LAGRANGE DUALITY VIII

• We draw attention to Equation 7, which is called theKKT dual complementarity condition.

• Specifically, it implies that if α∗i > 0, then gi(w∗) = 0.(I.e., the “gi(w) ≤ 0” constraint is active, meaning itholds with equality rather than with inequality.)

• Later on, this will be key for showing that the SVM hasonly a small number of “support vectors”; the KKT dualcomplementarity condition will also give us ourconvergence test when we talk about the SMOalgorithm.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OPTIMAL MARGIN CLASSIFIERS I

• Previously, we posed the following (primal) optimizationproblem for finding the optimal margin classifier:minγ,w ,b 1

2 ||w ||2s.t .y (i)(wT x (i) + b) ≥ 1, i = 1, · · · ,m.• We can write the constraints as:

gi(w) = −y (i)x (i) + b) + 1 ≤ 0.• We have one such constraint for each training example.• Note that from the KKT dual complementarity condition,

we will have αi > 0 only for the training examples thathave functional margin exactly equal to one (i.e., theones corresponding to constraints that hold withequality, gi(w) = 0).

• Consider the figure below, in which a maximum marginseparating hyperplane is shown by the solid line.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OPTIMAL MARGIN CLASSIFIERS II

11

corresponding to constraints that hold with equality, gi(w) = 0). Considerthe figure below, in which a maximum margin separating hyperplane is shownby the solid line.

The points with the smallest margins are exactly the ones closest to thedecision boundary; here, these are the three points (one negative and two pos-itive examples) that lie on the dashed lines parallel to the decision boundary.Thus, only three of the !i’s—namely, the ones corresponding to these threetraining examples—will be non-zero at the optimal solution to our optimiza-tion problem. These three points are called the support vectors in thisproblem. The fact that the number of support vectors can be much smallerthan the size the training set will be useful later.

Lets move on. Looking ahead, as we develop the dual form of the problem,one key idea to watch out for is that we’ll try to write our algorithm in termsof only the inner product !x(i), x(j)" (think of this as (x(i))T x(j)) betweenpoints in the input feature space. The fact that we can express our algorithmin terms of these inner products will be key when we apply the kernel trick.

When we construct the Lagrangian for our optimization problem we have:

L(w, b,!) =1

2||w||2 #

m!

i=1

!i

"y(i)(wT x(i) + b) # 1

#. (8)

Note that there’re only “!i” but no “"i” Lagrange multipliers, since theproblem has only inequality constraints.

Lets find the dual form of the problem. To do so, we need to first minimizeL(w, b,!) with respect to w and b (for fixed !), to get #D, which we’ll do by

FIGURE: Optimum margin classifiers.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OPTIMAL MARGIN CLASSIFIERS III

• The points with the smallest margins are exactly theones closest to the decision boundary.

• Here, these are the three points (one negative and twopositive examples) that lie on the dashed lines parallelto the decision boundary.

• Thus, only three of the αiÕs – namely, the onescorresponding to these three training examples – willbe non-zero at the optimal solution to our optimizationproblem.

• These three points are called the support vectors in thisproblem.

• The fact that the number of support vectors can bemuch smaller than the size the training set will beuseful later.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OPTIMAL MARGIN CLASSIFIERS IV

• as we develop the dual form of the problem, one keyidea to watch out for is that we’ll try to write ouralgorithm in terms of only the inner product 〈x (i), x (j)〉(think of this as (x (i), x (j)) between points in the inputfeature space.

• The fact that we can express our algorithm in terms ofthese inner products will be key when we apply thekernel trick.

• When we construct the Lagrangian for our optimizationproblem we have:ζ(w ,b, α) = 1

2 ||w ||2 −∑m

i=1 αi [y (i)(wT x (i) + b)− 1].

• Note that thereÕre only “αi ” but no “βi ” Lagrangemultipliers, since the problem has only inequalityconstraints.

• Lets find the dual form of the problem.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OPTIMAL MARGIN CLASSIFIERS V

• To do so, we need to first minimize ζ(w ,b, α) withrespect to w and b (for fixed α), to get θD, which can bedone by setting the derivatives of ζ with respect tp wand b to zero.

• We have, 5wζ(w ,b, α) = w −∑mi=1 αiy (i)x (i) = 0

• This implies that: w =∑m

i=1 αiy (i)x (i).• As for the derivative with respect to b, we obtain:

∂∂bζ(w ,b, α) =

∑mi=1 αiy (i) = 0.

• Plugging back the value of w into the Lagrangian, weget: ζ(w ,b, α) =∑m

i=1 αi − 12∑m

i,j=1 y (i)y (j)αiαj(x (i))T x (j) − b∑m

i=1 αiy (i).• Since the last term must be zero, we obtain:ζ(w ,b, α) =

∑mi=1 αi − 1

2∑m

i,j=1 y (i)y (j)αiαj(x (i))T x (j)

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OPTIMAL MARGIN CLASSIFIERS VI

• Recall that we got to the equation above by minimizingL with respect to w and b. Putting this together with theconstraints αi ≥ 0 (that we always had) and theconstraint , we obtain the following dual optimizationproblem: maxα W (α) =∑m

i=1 αi − 12∑m

i,j=1 y (i)y (j)αiαj〈x (i), x (j)〉s.t .αi ≥ 0, i =1, · · · ,m∑m

i=1 αiy (i) = 0,.• It should also be verified that the conditions required for

p∗ = d∗ and the KKT conditions to hold are indeedsatisfied in our optimization problem.

• Hence, we can solve the dual in lieu of solving theprimal problem.

• Specifically, in the dual problem above, we have amaximization problem in which the parameters are theαi ’s.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OPTIMAL MARGIN CLASSIFIERS VII

• Having found w∗, by considering the primal problem, itis also straightforward to find the optimal value for theintercept term b as:

b∗ = −maxi:y (i)=−1 w∗T x (i) + mini:y (i)=1 w∗T x (i)

2(9)

• Suppose we’ve fit our model’s parameters to a trainingset, and now wish to make a prediction at a new pointinput x .

• We would then calculate wT x + b, and predict y = 1 ifand only if this quantity is bigger than zero.

• But, this quantity can also be written:

wT x + b = (∑m

i=1 αiy (i)x (i))T x + b (10)=

∑mi=1 αiy (i)〈x (i), x〉+ b. (11)

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OPTIMAL MARGIN CLASSIFIERS VIII

• Hence, if we’ve found the αiÕs, in order to make aprediction, we have to calculate a quantity that dependsonly on the inner product between x and the points inthe training set.

• Moreover, the αi ’s will all be zero except for the supportvectors.

• Thus, many of the terms in the sum above will be zero,and we really need to find only the inner productsbetween x and the support vectors (of which there isoften only a small number) in order to calculate 11 andmake our prediction.

• By examining the dual form of the optimization problem,we gained significant insight into the structure of theproblem, and were also able to write the entirealgorithm in terms of only inner products between inputfeature vectors.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

OPTIMAL MARGIN CLASSIFIERS IX

• In the next section, we will exploit this property to applythe kernels to our classification problem.

• The resulting algorithm, support vector machines, willbe able to efficiently learn in very high dimensionalspaces.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

KERNELS I

THEOREM

Mercer: Let K : <n ×<n → < be given. Then for K to be avalid (Mercer) kernel, it is necessary and sufficient that forany x (1), · · · , x (m), (m <∞), the corresponding kernelmatrix is symmetric positive semi-definite.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

REGULARIZATION AND THE NON-SEPARABLE

CASE I

• The derivation of the SVM as presented so farassumed that the data is linearly separable.

• While mapping data to a high dimensional featurespace via φ does generally increase the likelihood thatthe data is separable, we canÕt guarantee that italways will be so.

• Also, in some cases it is not clear that finding aseparating hyperplane is exactly what we’d want to do,since that might be susceptible to outliers.

• For instance, the left figure below shows an optimalmargin classifier, and when a single outlier is added inthe upper-left region (right figure), it causes thedecision boundary to make a dramatic swing, and theresulting classifier has a much smaller margin.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References


CASE II

19

algorithms that we’ll see later in this class will also be amenable to thismethod, which has come to be known as the “kernel trick.”

8 Regularization and the non-separable case

The derivation of the SVM as presented so far assumed that the data islinearly separable. While mapping data to a high dimensional feature spacevia ! does generally increase the likelihood that the data is separable, wecan’t guarantee that it always will be so. Also, in some cases it is not clearthat finding a separating hyperplane is exactly what we’d want to do, sincethat might be susceptible to outliers. For instance, the left figure belowshows an optimal margin classifier, and when a single outlier is added in theupper-left region (right figure), it causes the decision boundary to make adramatic swing, and the resulting classifier has a much smaller margin.

To make the algorithm work for non-linearly separable datasets as wellas be less sensitive to outliers, we reformulate our optimization (using "1

regularization) as follows:

min!,w,b1

2||w||2 + C

m!

i=1

#i

s.t. y(i)(wT x(i) + b) ! 1 " #i, i = 1, . . . ,m

#i ! 0, i = 1, . . . ,m.

Thus, examples are now permitted to have (functional) margin less than 1,and if an example whose functional margin is 1 " #i, we would pay a cost ofthe objective function being increased by C#i. The parameter C controls therelative weighting between the twin goals of making the ||w||2 large (whichwe saw earlier makes the margin small) and of ensuring that most exampleshave functional margin at least 1.

FIGURE: Regularization and non-separable case.

• Examples are now permitted to have (functional)margin less than 1, and if an example whose functionalmargin is 1?ψi , we would pay a cost of the objectivefunction being increased by Cψi .

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References


CASE III

• The parameter C controls the relative weightingbetween the twin goals of making the ||w ||2 large(which we saw earlier makes the margin small) and ofensuring that most examples have functional margin atleast 1.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

THE SMO ALGORITHM I

• Coordinate Ascent.

• SMO.

•

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

THE SMO ALGORITHM II22

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

The ellipses in the figure are the contours of a quadratic function thatwe want to optimize. Coordinate ascent was initialized at (2,!2), and alsoplotted in the figure is the path that it took on its way to the global maximum.Notice that on each step, coordinate ascent takes a step that’s parallel to oneof the axes, since only one variable is being optimized at a time.

9.2 SMO

We close o! the discussion of SVMs by sketching the derivation of the SMOalgorithm. Some details will be left to the homework, and for others youmay refer to the paper excerpt handed out in class.

Here’s the (dual) optimization problem that we want to solve:

max! W (!) =m!

i=1

!i ! 1

2

m!

i,j=1

y(i)y(j)!i!j"x(i), x(j)#. (17)

s.t. 0 $ !i $ C, i = 1, . . . ,m (18)m!

i=1

!iy(i) = 0. (19)

Lets say we have set of !i’s that satisfy the constraints (18-19). Now,suppose we want to hold !2, . . . ,!m fixed, and take a coordinate ascent stepand reoptimize the objective with respect to !1. Can we make any progress?The answer is no, because the constraint (19) ensures that

!1y(1) = !

m!

i=2

!iy(i).

FIGURE: Coordinate ascent.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

THE SMO ALGORITHM III 24

!2

!1

!1 !2

C

C

(1)+ (2)y y ="H

L

From the constraints (18), we know that !1 and !2 must lie within the box[0, C]! [0, C] shown. Also plotted is the line !1y

(1) +!2y(2) = ", on which we

know !1 and !2 must lie. Note also that, from these constraints, we knowL " !2 " H; otherwise, (!1,!2) can’t simultaneously satisfy both the boxand the straight line constraint. In this example, L = 0. But depending onwhat the line !1y

(1) + !2y(2) = " looks like, this won’t always necessarily be

the case; but more generally, there will be some lower-bound L and someupper-bound H on the permissable values for !2 that will ensure that !1, !2

lie within the box [0, C] ! [0, C].Using Equation (20), we can also write !1 as a function of !2:

!1 = (" # !2y(2))y(1).

(Check this derivation yourself; we again used the fact that y(1) $ {#1, 1} sothat (y(1))2 = 1.) Hence, the objective W (!) can be written

W (!1,!2, . . . ,!m) = W ((" # !2y(2))y(1),!2, . . . ,!m).

Treating !3, . . . ,!m as constants, you should be able to verify that this isjust some quadratic function in !2. I.e., this can also be expressed in theform a!2

2 + b!2 + c for some appropriate a, b, and c. If we ignore the “box”constraints (18) (or, equivalently, that L " !2 " H), then we can easilymaximize this quadratic function by setting its derivative to zero and solving.We’ll let !new,unclipped

2 denote the resulting value of !2. You should also beable to convince yourself that if we had instead wanted to maximize W withrespect to !2 but subject to the box constraint, then we can find the resultingvalue optimal simply by taking !new,unclipped

2 and “clipping” it to lie in the

FIGURE: SMO.

SupportVector

Machines

MuhammadAdil Raja

Introduction


Notation




Kernels


The SMOAlgorithm

References

REFERENCES

• The pictures and inspiration for preparing these slideswas taken from Andre Ng’s course on Machinelearning.

• The course can be found online on Stanfordengineering for all and Coursera.

Support Vector Machines

Engineering

supportvectormachinesmuhammadadil

margins3 notation4

separablecasethe

support vector

optimal margin

functional

functional

thefunctional