Support Vector Machines

Post on 26-Feb-2016

41 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Support Vector Machines. Lecturer : Yishay Mansour Itay Kirshenbaum. Lecture Overview. In this lecture we present in detail one of the most theoretically well motivated and practically most effective classification algorithms in modern machine learning: Support Vector Machines (SVMs). . - PowerPoint PPT Presentation

Transcript

Support Vector Machines

Lecturer: Yishay Mansour

Itay Kirshenbaum

Lecture OverviewIn this lecture we present in detail one of the most theoretically well motivated and practically most effective classification algorithms in modern machine learning: Support Vector Machines (SVMs).

Lecture Overview – Cont. We begin with building the

intuition behind SVMs continue to define SVM as an

optimization problem and discuss how to efficiently solve it.

We conclude with an analysis of the error rate of SVMs using two techniques: Leave One Out and VC-dimension.

Introduction Support Vector Machine is a

supervised learning algorithm Used to learn a hyperplane that

can solve the binary classification problem

Among the most extensively studied problems in machine learning.

Binary Classification Problem Input space: Output space: Training data: S drawn i.i.d with distribution D Goal: Select hypothesis

that best predicts other points drawn i.i.d from D

nRX

)},(),...,,{( 11 mm yxyxS }1,1{ Y

Hh

Binary Classification – Cont. Consider the problem of

predicting the success of a new drug based on a patient height and weight m ill people are selected and treated This generates m 2d vectors (height

and weight) Each point is assigned +1 to indicate

successful treatment or -1 otherwise This can be used as training data

Binary classification – Cont. Infinitely many ways to classify Occam’s razor – simple

classification rules provide better results Linear classifier or hyperplane

Our class of linear classifiers:

0)*( if 1 to maps bxwXxHh

},|)*(sign {x RbRwbxwH n

Kirsh
Add a graphic

Choosing a Good Hyperplane Intuition

Consider two cases of positive classification:

w*x + b = 0.1 w*x + b = 100

More confident in the decision made by the latter rather than the former

Choose a hyperplane with maximal margin

Good Hyperplane – Cont. Definition: Functional margin S

A linear classifier:

)*(ˆ with ˆˆ min},...,1{

s bxwy iiii

mi

),( toaccording oftion classifica theis bwxy ii

Maximal Margin w,b can be scaled to increase

margin sign(w*x + b) = sign(5w*x + 5b) for

all x (5w, 5b) is 5 times greater than (w,b)

Cope by adding an additional constraint: ||w|| = 1

Maximal Margin – Cont. Geometric Margin

Consider the geometric distance between the hyperplane and the closest points

Geometric Margin Definition: Definition: Geometric margin S

Relation to functional margin

Both are equal when

)*( with min},...,1{

s wbx

wwy iiii

mi

ii yw ̂

1w

Kirsh
Another slide for the calculation

The Algorithm We saw:

Two definitions of the margin Intuition behind seeking a

maximizing hyperplane Goal: Write an optimization

program that finds such a hyperplan

We always look for (w,b) maximizing the margin

The Algorithm – Take 1 First try:

Idea Maximize - For each sample the

Functional margin is at least Functional and geometric margin are

the same as Largest possible geometric margin

with respect to the training set

1,,...,1,)*(max wmibxwy ii

1w

The Algorithm – Take 2 The first try can’t be solved by any

off-the-shelf optimization software The constraint is non-linear In fact, it’s even non-convex

How can we discard the constraint? Use geometric margin!

1w

mibxwyw

ii ,...,1,ˆ)*(ˆ

max

The Algorithm – Take 3 We now have a non-convex

objective function – The problem remains

Remember We can scale (w,b) as we wish Force the functional margin to be 1 Objective function: Same as: Factor of 0.5 and power of 2 do not

change the program – Make things easier

w1max

2

21min w

The algorithm – Final version The final program:

The objective is convex (quadratic) All constraints are linear Can solve efficiently using standard

quadratic programing (QP) software

mibxwyw ii ,...,1,1)*(21max 2

Convex Optimization We want to solve the optimization

problem more efficiently than generic QP

Solution – Use convex optimization techniques

Convex Optimization – Cont. Definition: A convex function

Theorem

)()1()())1((

:1,0,,yfxfyxf

Xyx

f

))(()()( :,functionconvex abledifferenti a be :Let

xyxfxfyfXyxxf

Convex Optimization Problem Convex optimization problem

We look for a value of Minimizes Under the constraint

mixgxfmixgf

iXx

i

,..,1,0)( s.t. )(min Findfunctionconvex be ,..,1,:,Let

Xx)(xf

mixgi ,..,1,0)(

Lagrange Multipliers Used to find a maxima or a minima

of a function subject to constraints Use to solve out optimization

problem Definition

sMultiplier Lagrange thecalled are

0, )()(),(

,..,1,sconstraint subject to function of Lagragian

1

i

i

m

iii

i

XxxgxfxL

migfL

Primal Program Plan

Use the Lagrangian to write a program called the Primal Program

Equal to f(x) is all the constraints are met

Otherwise – Definition – Primal Program

),(max)( 0 xLxP

Primal Progam – Cont. The constraints are of the form If they are met

is maximized when all are 0, and the summation is 0

Otherwise is maximized for

0)( xgi

m

iii xg

1

)(

i

)()( xfxP

m

iii xg

1

)( i

)(xP

Primal Progam – Cont. Our convex optimization problem

is now:

Define as the value of the primal program

),(maxmin)(min 0 xLx XxPXx )(min* xp PXx

Dual Program We define the Dual Program as:

We’ll look at

Same as our primal program Order of min / max is different

Define the value of our Dual Program

),(min)( xLx XxD

),(minmax)(minmax 00 xLx XxaDXxa

),(minmax 0* xLd Xxa

Dual Program – Cont. We want to show

If we find a solution to one problem, we find the solution to the second problem

Start with “max min” is always less then “min

max”

Now on to

** pd

** pd

*00

* ),(maxmin),(minmax pxLxLd aXxXxa

** dp

Dual Program – Cont. Claim

Proof

Conclude

)( osolution t a is and then

),(),(),( :feasible is which ,0

andpoint saddle a are which 0 and exists if

p***

****

**

xxdp

axLaxLaxLxa

ax

*

0

*

***

00

*

),(infsup),(inf

),(),(sup),(supinf

daxLaxL

axLaxLaxLp

xax

aax

** pd

Karush-Kuhn-Tucker (KKT) conditions KKT conditions derive a

characterization of an optimal solution to a convex problem.

Theorem

0)()( 3.

0)( ),( .20)()( ),( 1.

:s.t. 0 problemon optimizati theosolution t a is convex. and abledifferenti are ,..,1, and that Assume

xgxg

xgxLxgxfxL

xmigf

ii

a

xxx

i

KKT Conditions – Cont. Proof

The other direction holds as well

m

i ii

m

i iii

m

i ixi

x

xga

xgxga

xxxga

xxxfxfxfx

1

1

1

0)(

)]()([

)()(

)()()()(: feasibleevery For

KKT Conditions – Cont. Example

Consider the following optimization problem:

We have The Lagragian will be

2 .. 21min 2 xtsx

xxgxxf 2)(,21)( 1

2

)2(21),( 2 xxxL

**

22*

**

202),(

212)2(

21),(

0

xxL

xL

xxxL

Optimal Margin Classifier Back to SVM

Rewrite our optimization program

Following the KKT conditions Only for points in the training

set with a margin of exactly 1 These are the support vectors of the

training set

01)*(),(

,...,1,1)*(21min 2

bxwybwg

mibxwyw

iii

ii

0i

Optimal Margin – Cont. Optimal margin classifier and its support vectors

Optimal Margin – Cont. Construct the Lagragian

Find the dual form First minimize to get Do so by setting the derivatives to

zero

]1)*([21),,(

1

2

bxwywbwL iim

ii

),,( bwL D

m

i

iii

m

i

iiix

xyw

xywbwL

1

*

1

0),,(

Optimal Margin – Cont. Take the derivative with respect to

Use in the Lagrangian

We saw the last tem is zero

m

i

ii ybwL

b 1

0),,(

b

im

ii

jim

jiji

jim

ii ybxxyybwL

11,1

**

21),,(

*w

)(21),,(

1,1

** WxxyybwL jim

jiji

jim

ii

Optimal Margin – Cont. The dual optimization problem

The KKT conditions hold Can solve by finding that maximize Assuming we have – define The solution to the primal problem

m

i

ii

i

y

miW

1

0

,..,1,0:)(max

* )(W

m

i

iixyw1

**

Optimal Margin – Cont. Still need to find

Assume is a support vector We get

*bix

ii

ii

ii

xwyb

bxwy

bxwy

**

**

**

)(1

Error Analysis Using Leave-One-Out The Leave-One-Out (LOO) method

Remove one point at a time from the training set

Calculate an SVM for the remaining points

Test our result using the removed point

Definition The indicator function I(exp) is 1 if

exp is true, otherwise 0

))((1ˆ1

}{

m

i

iixSLOO yxhI

mR i

LOO Error Analysis – Cont. Expected error

It follows the expected error of LOO for a training set of size m is the same as for a training set of size m-1

)]([])([

)])(([1]ˆ[

'~'}{,

1}{~

1 SDSii

xSXS

m

i

iixSLOODS

herrorEyxhE

yxhIEm

RE

mi

im

LOO Error Analysis – Cont. Theorem

ProofSSN

mSNEherrorE

SV

SVDSSDS mm

in ctorssupport ve ofnumber theis )(

]1

)([)]([ 1~~

1)(ˆ :Hence

ctor.support ve a bemust point they,incorrectlpoint a classifies if

mSNR

h

SV

S

LOO

Generalization Bounds Using VC-dimension Theorem

Proof

2

22

Then .,min:)(set

hyperplane theofdimension -VC thebe Let .:Let

Rdwxwxwsign

dRxxS

Sx

d

i

iid

i

iiii

ii

d

xyxywxy

dixwyw

xx

1

d

1i 1

i1

wd

:dover Summing .,..,1 )(:

1,1yevery for So shattered. is ,..,set that theAssume

Generalization Bounds Using VC-dimension – Cont. Proof – Cont.

2

22

22

,

,

2

1

21

1

Therefore

][d

: thatconcludecan we

when 1][ and when 0][ Since

][d

:ondistributi uniform with s' over the Averaging

Rd

dRxyyxxE

jiyyEjiyyE

yyxxExyExyE

y

i

iji

ji

jiy

jiy

jiy

ji

ji

jiy

d

i

iiy

d

i

iiy

top related