Support Vector Machines

Lecturer: Yishay Mansour

Itay Kirshenbaum

Lecture OverviewIn this lecture we present in detail one of the most theoretically well motivated and practically most effective classification algorithms in modern machine learning: Support Vector Machines (SVMs).

Lecture Overview – Cont. We begin with building the

intuition behind SVMs continue to define SVM as an

optimization problem and discuss how to efficiently solve it.

We conclude with an analysis of the error rate of SVMs using two techniques: Leave One Out and VC-dimension.

Introduction Support Vector Machine is a

supervised learning algorithm Used to learn a hyperplane that

can solve the binary classification problem

Among the most extensively studied problems in machine learning.

Binary Classification Problem Input space: Output space: Training data: S drawn i.i.d with distribution D Goal: Select hypothesis

that best predicts other points drawn i.i.d from D

)},(),...,,{( 11 mm yxyxS }1,1{ Y

Binary Classification – Cont. Consider the problem of

predicting the success of a new drug based on a patient height and weight m ill people are selected and treated This generates m 2d vectors (height

and weight) Each point is assigned +1 to indicate

successful treatment or -1 otherwise This can be used as training data

Binary classification – Cont. Infinitely many ways to classify Occam’s razor – simple

classification rules provide better results Linear classifier or hyperplane

Our class of linear classifiers:

0)*( if 1 to maps bxwXxHh

},|)*(sign {x RbRwbxwH n

Choosing a Good Hyperplane Intuition

Consider two cases of positive classification:

w*x + b = 0.1 w*x + b = 100

More confident in the decision made by the latter rather than the former

Choose a hyperplane with maximal margin

Good Hyperplane – Cont. Definition: Functional margin S

A linear classifier:

)*(ˆ with ˆˆ min},...,1{

s bxwy iiii

),( toaccording oftion classifica theis bwxy ii

Maximal Margin w,b can be scaled to increase

margin sign(w*x + b) = sign(5w*x + 5b) for

all x (5w, 5b) is 5 times greater than (w,b)

Cope by adding an additional constraint: ||w|| = 1

Maximal Margin – Cont. Geometric Margin

Consider the geometric distance between the hyperplane and the closest points

Geometric Margin Definition: Definition: Geometric margin S

Relation to functional margin

Both are equal when

)*( with min},...,1{

wwy iiii

ii yw ̂

The Algorithm We saw:

Two definitions of the margin Intuition behind seeking a

maximizing hyperplane Goal: Write an optimization

program that finds such a hyperplan

We always look for (w,b) maximizing the margin

The Algorithm – Take 1 First try:

Idea Maximize - For each sample the

Functional margin is at least Functional and geometric margin are

the same as Largest possible geometric margin

with respect to the training set

1,,...,1,)*(max wmibxwy ii

The Algorithm – Take 2 The first try can’t be solved by any

off-the-shelf optimization software The constraint is non-linear In fact, it’s even non-convex

How can we discard the constraint? Use geometric margin!

mibxwyw

ii ,...,1,ˆ)*(ˆ

The Algorithm – Take 3 We now have a non-convex

objective function – The problem remains

Remember We can scale (w,b) as we wish Force the functional margin to be 1 Objective function: Same as: Factor of 0.5 and power of 2 do not

change the program – Make things easier

21min w

The algorithm – Final version The final program:

The objective is convex (quadratic) All constraints are linear Can solve efficiently using standard

quadratic programing (QP) software

mibxwyw ii ,...,1,1)*(21max 2

Convex Optimization We want to solve the optimization

problem more efficiently than generic QP

Solution – Use convex optimization techniques

Convex Optimization – Cont. Definition: A convex function

Theorem

)()1()())1((

:1,0,,yfxfyxf

))(()()( :,functionconvex abledifferenti a be :Let

xyxfxfyfXyxxf

Convex Optimization Problem Convex optimization problem

We look for a value of Minimizes Under the constraint

mixgxfmixgf

,..,1,0)( s.t. )(min Findfunctionconvex be ,..,1,:,Let

Xx)(xf

mixgi ,..,1,0)(

Lagrange Multipliers Used to find a maxima or a minima

of a function subject to constraints Use to solve out optimization

problem Definition

sMultiplier Lagrange thecalled are

0, )()(),(

,..,1,sconstraint subject to function of Lagragian

XxxgxfxL

Primal Program Plan

Use the Lagrangian to write a program called the Primal Program

Equal to f(x) is all the constraints are met

Otherwise – Definition – Primal Program

),(max)( 0 xLxP

Primal Progam – Cont. The constraints are of the form If they are met

is maximized when all are 0, and the summation is 0

Otherwise is maximized for

0)( xgi

iii xg

)()( xfxP

iii xg

Primal Progam – Cont. Our convex optimization problem

is now:

Define as the value of the primal program

),(maxmin)(min 0 xLx XxPXx )(min* xp PXx

Dual Program We define the Dual Program as:

We’ll look at

Same as our primal program Order of min / max is different

Define the value of our Dual Program

),(min)( xLx XxD

),(minmax)(minmax 00 xLx XxaDXxa

),(minmax 0* xLd Xxa

Dual Program – Cont. We want to show

If we find a solution to one problem, we find the solution to the second problem

Start with “max min” is always less then “min

max”

Now on to

* ),(maxmin),(minmax pxLxLd aXxXxa

Dual Program – Cont. Claim

Conclude

)( osolution t a is and then

),(),(),( :feasible is which ,0

andpoint saddle a are which 0 and exists if

axLaxLaxLxa

),(infsup),(inf

),(),(sup),(supinf

daxLaxL

axLaxLaxLp

Karush-Kuhn-Tucker (KKT) conditions KKT conditions derive a

characterization of an optimal solution to a convex problem.

Theorem

0)()( 3.

0)( ),( .20)()( ),( 1.

:s.t. 0 problemon optimizati theosolution t a is convex. and abledifferenti are ,..,1, and that Assume

xgxLxgxfxL

KKT Conditions – Cont. Proof

The other direction holds as well

xxxfxfxfx

)]()([

)()()()(: feasibleevery For

KKT Conditions – Cont. Example

Consider the following optimization problem:

We have The Lagragian will be

2 .. 21min 2 xtsx

xxgxxf 2)(,21)( 1

)2(21),( 2 xxxL

202),(

212)2(

Optimal Margin Classifier Back to SVM

Rewrite our optimization program

Following the KKT conditions Only for points in the training

set with a margin of exactly 1 These are the support vectors of the

training set

01)*(),(

,...,1,1)*(21min 2

bxwybwg

mibxwyw

Optimal Margin – Cont. Optimal margin classifier and its support vectors

Optimal Margin – Cont. Construct the Lagragian

Find the dual form First minimize to get Do so by setting the derivatives to

]1)*([21),,(

bxwywbwL iim

),,( bwL D

xywbwL

Optimal Margin – Cont. Take the derivative with respect to

Use in the Lagrangian

We saw the last tem is zero

ii ybwL

ii ybxxyybwL

21),,(

)(21),,(

** WxxyybwL jim

Optimal Margin – Cont. The dual optimization problem

The KKT conditions hold Can solve by finding that maximize Assuming we have – define The solution to the primal problem

,..,1,0:)(max

iixyw1

Optimal Margin – Cont. Still need to find

Assume is a support vector We get

Error Analysis Using Leave-One-Out The Leave-One-Out (LOO) method

Remove one point at a time from the training set

Calculate an SVM for the remaining points

Test our result using the removed point

Definition The indicator function I(exp) is 1 if

exp is true, otherwise 0

))((1ˆ1

iixSLOO yxhI

LOO Error Analysis – Cont. Expected error

It follows the expected error of LOO for a training set of size m is the same as for a training set of size m-1

)]([])([

)])(([1]ˆ[

'~'}{,

1 SDSii

iixSLOODS

herrorEyxhE

yxhIEm

LOO Error Analysis – Cont. Theorem

ProofSSN

mSNEherrorE

SVDSSDS mm

in ctorssupport ve ofnumber theis )(

)([)]([ 1~~

1)(ˆ :Hence

ctor.support ve a bemust point they,incorrectlpoint a classifies if

Generalization Bounds Using VC-dimension Theorem

Then .,min:)(set

hyperplane theofdimension -VC thebe Let .:Let

Rdwxwxwsign

xyxywxy

dixwyw

:dover Summing .,..,1 )(:

1,1yevery for So shattered. is ,..,set that theAssume

Generalization Bounds Using VC-dimension – Cont. Proof – Cont.

Therefore

: thatconcludecan we

when 1][ and when 0][ Since

:ondistributi uniform with s' over the Averaging

dRxyyxxE

jiyyEjiyyE

yyxxExyExyE

Support Vector Machines

functional margin s

maximal margin cont

geometric margin s relation

d binary classification

geometric marginconsider

geometric distance

optimization problem

optimization program

Documents

Support Vector Machines Hyperplane...

Support Vector Machines -...

Support Vector Machines -...

Using Support Vector Machines, Convolutional Neural ... ·....

Kernel Methods: Support Vector Machines Maximum Margin...

Using Support Vector Machines, Convolutional … Support...

Support Vector Machines: Training with Stochastic Gradient.....

Support Vector Machines

Support Vector Machines Kernel Machines

Support Vector Machines - MIT OpenCourseWare · •...