Support Vector Machinesml-intro-2016.wdfiles.com/.../tau_ml16_svm_lect1.pdf · 2016. 11. 27. · Supervised Learning quail apple apple corn corn Labeled Data Model Class Consider

Support Vector Machines

Supervised Learning

quail apple apple corn cornLabeled

Data

Model Class

Consider classifiers of the form y = f(x;w)

Features1.1 -0.5 0 0 0.3 …

quail-1 0 1.2 -0.4 0.1 …

apple1.1 -0.5 0 0 0.3 … -1 0 1.2 -0.4 0.1 …

apple corn

Learning Find that works well on the training data ww

x

y

Optimization!

Linear Classifiers• A simple and effective family of classifiers

(xi, yi)

y = sign [w · x+ b]

• The training problem:

• Given a set of n training points

• Find for the “best fitting” classifier w, b

Training Linear Classifiersy = sign [w · x+ b]

• How do we find it?

• If there exists a classifier with zero training error we can find one with the __________ algorithmperceptron

Many Possible Solutions • If there exists one solution, there exist many.

• Which one should we choose?

• Intuitively: one that’s farther away from the points.

Maximum Margin Classifier• For every point denote its distance

from the hyperplane.

• Margin of a classifier: the shortest distance to the hyperplane:

• Goal: find classifier that maximizes

xi d(xi,w, b)

mini

d(xi,w, b)

mini

d(xi,w, b)

ML and Optimization• We have an optimization problem

• Namely we want to find a set of parameters that will maximize some objective function (the margin) subject to some constraints (classifying correctly)

• Need a toolbox for solving such problems

• In what follows we provide an overview

Unconstrained Optimization• Use w to denote optimization variables

• For example:

• Generally:

• Solve by:

• Find all w such that

• These stationary points are the candidates for the global minimum (and asymptotes)

minw1,w2

(w1 � 2w2)2

minw

f(w)

@f(w)

@w= 0

Constrained Minimization• Suppose we are only interested in variables that

satisfy

• The optimization problem is:

• The zero gradient point may not satisfy the constraint.

h(w) = 0

min f(w)s.t. h(w) = 0

w1

w2h(w) = 0

Directional Derivative• Given function f(w) and direction v.

• What happens if we make a small change in direction v (where |v|2=1?)

• It is

• The direction along the curve h(w)=0 has zero directional derivative.

• Thus the orthogonal is the gradientw1

w2h(w) = 0

rf(w) · v

rh(w)

w1

w2 v

w

f(w + ↵v)

↵

rf(w)

Constrained Minimization• The optimization problem is:

• Consider f along h

• Vector of movement along curve v is orthogonal to

min f(w)s.t. h(w) = 0

w1

w2h(w) = 0

rf(w) = �rh(w)

rh(w)

f(w)

w : h(w) = 0

rh(w)

• Gradient along curve

• Is zero iff

rf(w) · v

vrf(w)

Lagrange Multiplier• The optimum points should satisfy:

rf(w) = �rh(w) For some �h(w) = 0 Constraint satisfied

1.2.

• Alternative formulation. Define Lagrangian:

L(w,�) = f(w) + �h(w)

• The optimum should satisfy:rwL(w,�) = 0

r�L(w,�) = 0

1.2.

Example• What is the distance between hyperplane and point?

x̄

w · x+ b = 0

minx:w·x+b=0

0.5kx� x̄k22

• Use primal feasibility to solve for �

(x̄� �w) ·w + b = 0

L(x,�) = 0.5kx� x̄k22 + � (w · x+ b)

rx

L(x,�) = (x� x̄) + �w = 0

� =x̄ ·w + b

kwk

kx� x̄k2 =|x̄ ·w + b|

kwk

x = x̄� �w

Multiple Constraints• Solve:

min f(w)s.t. hi(w) = 0 8i = 1, . . . , p

• Introduce multiplier per constraint �1, · · · ,�p

• Lagrangian:

• Optimality conditions:rwL(w,�) = 01.r�iL(w,�) = 02.

• May be several such points. Need to check which one is the global optimum.

L(w,�,↵) = f(w) +X

i

�ihi(w)

Inequality Constraints• Solve:

• Optimality conditions:

1.

w1

w2h(w) 0min f(w)

s.t. h(w) 0rh(w)

h(w) 0 Constraint satisfied

w1

w2h(w) 0

rh(w)�rf(w)

• When we are “stuck” if the only directions that decrease f take us outside the constraints. Namely:

h(w) = 0

For some ↵ � 0rf(w) = �↵rh(w)2a.h(w) < 0• When we need:

rf(w) = 02b.

w1

w2h(w) 0

rh(w)�rf(w)

Progress possible

Stuck

Complementary Slackness

• Called the Karush Kuhn Tucker (KKT) conditions. Always necessary.

• Sufficient for convex optimization.

rf(w) = �↵rh(w)

h(w) < 0 rf(w) = 0

h(w) = 0

• Summarize as:rf(w) = �↵rh(w)

↵h(w) = 0

↵ � 0, h(w) 0

Lagrange Multipliers• Consider the general problem:

min f(w)s.t. hi(w) = 0 8i = 1, . . . , p

gi(w) 0 8i = 1, . . . ,m

• Define the Lagrangian:L(w,�,↵) = f(w) +

X

i

�ihi(w) +X

i

↵igi(w)

• Optimum must satisfy:rwL(w,�,↵) = 0

↵igi(x) = 0 8i↵i � 0, gi(w) 0, hi(w) = 0

Typically easy if someone hands us ! ↵,�

Convex Optimization• General optimization problem may have

many local minima/maxima and saddle points. w

f(w)

• Makes minimization hard (e.g., exponential in dimension).

• Convex optimization problems are a “nice” subclass. Require:

• Convex f(w),g(x)

• Linear h(x) w

f(w)

Convex Optimization• Convex function if:

• Value on line is less than linear function.

• Non-negative second derivative (or Hessian)

• Examples:f(w) = w · xf(w) = max [w · x, 0]

f(w) = wTAw A ⌫ 0

Convex Optimization• Nice things:

• No local optima

• KKT conditions are sufficient for global optimality.

• Multipliers can be solved via dual.

Convex Duality• For every convex problem, we can define a dual

problem that has the same value

• Optimization is over the Lagrange multipliers.

• Solution to dual implies solution to primal via KKT

• Dual might be easier to solve.

Convex Duality• Recall the Lagrangian:

L(w,�,↵) = f(w) +X

i

�ihi(w) +X

i

↵igi(w)

• Then:min f(w)s.t. hi(w) = 0

gi(w) 0

= min

wmax

�,↵�0L(w,�,↵) Why?

min

wmax

�,↵�0L(w,�,↵) max

�,↵�0min

wL(w,�,↵)

• Replacing min and max gives:

• In the convex case it is an equality

Convex Duality• Define:

• Dual problem:

g(�,↵) = minw

L(w,�,↵)

max

�,↵�0g(�,↵)

• Has same value as primal problem.

• The resulting are optimal. You can recover the “primal” variables w via KKT. This is often easy.

�,↵

rwL(w,�,↵) = 0

↵igi(x) = 0 8i↵i � 0, gi(w) 0, hi(w) = 0

Maximum Margin Classifier• For every point denote its distance

from the hyperplane.

• Margin of a classifier: the shortest distance to the hyperplane:

• Goal: find classifier that maximizes

xi d(xi,w, b)

mini

d(xi,w, b)

mini

d(xi,w, b)

Geometry of Linear Classifiers

y = sign [w · x+ b]

• w is the orthogonal direction to the hyperplane.

• Proof: if on hyperplane then

What'is'b?

wMx + ; = 0

;w

w x

|x ·w + b|kwk

• Distance from origin to hyperplane is |b|kwk

x1,x2 w · (x1 � x2) = 0

Max Margin Hyperplane• Find a hyperplane that maximizes the minimum distance

• Solve: maxw1

kwk mini |w · xi + b|s.t. yi (w · xi + b) � 0

• Any solution can be rescaled to and not affect the objective or constraints.

(w, b)(cw, cb)

• We can rescale such that mini

|w · xi + b| = 1

maxw kwk�1

s.t. yi (w · xi + b) � 0

mini |w · xi + b| = 1

Max Margin Hyperplanemaxw kwk�1

s.t. yi (w · xi + b) � 0

mini |w · xi + b| = 1

• Equivalently:maxw kwk�1

s.t. mini yi (w · xi + b) = 1

• We can relax to an inequality (why?): mini

yi (w · xi + b) � 1

maxw kwk�1

s.t. yi (w · xi + b) � 1

minw kwk2s.t. yi (w · xi + b) � 1

Support Vector Machines (SVM)• The SVM classifier is the solution to:

• The where this is an equality are “support vectors” xi

• It is a convex optimization problem. Called a convex quadratic program (quad. objective and linear constraints)

minw 0.5kwk2s.t. yi (w · xi + b) � 1

Factor 0.5 doesn’t affect the optimum

SVM History• Initial version by Vapnik and Chervonenkis (63)

• Non linear version by Boser, Guyon, Vapnik (92)

• Much work on generalization theory since (by Bartlett, Shawe Taylor, Mendelson, Schoelkopf, Smola, and others).

• Many variants for regression, unsupervised learning etc.

Solving SVM• The SVM classifier is the solution to:

• You can plug this into a solver and get w, b

• Lets use Lagrangian to understand solution.


w =X

i

↵iyixi

rbL(w, b,↵) = �X

i

↵iyi = 0X

i

↵iyi = 0

L(w, b,↵) = 0.5kwk2 +X

i

↵i [1� yi (w · xi + b)]

rwL(w, b,↵) = w �X

i

↵iyixi = 0

The Representer Theorem • The optimal weight is a weight

combination of the data pointsw =

X

i

↵iyixi

• This will be very important!

• When is ? (recall KKT) ↵i = 0

• Whenever w · xi + b < 1

• only when ↵i > 0w · xi + b = 1

• Optimal weight is a combination only of support vectors!

Deriving via Dual • How do we find and then b?

• Use the dual!

w =X

i

↵iyixi↵i

• We know the minimizing w. Plug into Lagrangian.

g(↵) = 0.5kX

i

↵iyixik22 �X

i

↵i

2

4yi

0

@

0

@X

j

↵jyjxj

1

A · xi + b

1

A� 1

3

5

X

i

↵iyi = 0

=X

i

↵i � 0.5X

i,j

↵i↵jyiyjxi · xj

• Constrain because otherwise g(↵) = �1

L(w, b,↵) = 0.5kwk2 +X

i

↵i [1� yi (w · xi + b)]

g(↵) = minw,b

L(w, b,↵)

The SVM Dual• The dual problem is:

max

Pi ↵i � 0.5

Pi,j ↵i↵jyiyjxi · xj

s.t. ↵i � 0,P

i ↵iyi = 0

• Number of variables and constraints is number of training points.

• Also a convex quadratic program (why?)

• Obtaining the primal w: w =X

i

↵iyixi

Finding b• Recall from KKT that support vectors ( ) satisfy:

w · xi + b = 1

↵i > 0

• Since we know w we can solve for b.

• Should give same value for all support vectors.

Non Separable Case• So far we assumed a separating

hyperplane exists.• If it doesn’t, our optimization problem

is infeasible. • For real data, we don’t want to make

this assumption. Because:

• Data may be noisy. Linear classifier may still do ok.

• May come from a non linear rule. Next class!

Non Separable Case• Ideally, we would like to find the

classifier that minimizes training error.• But:

• Turns out this is NP hard.

• How do we incorporate margin?

• Let’s start from the separable case.

Non Separable Case

• Separable case:

• Need to “relax” the constraints.

• Allow violation by , but “pay” for violation.


⇠i � 0

• C is a constant that determines how much we care about classification errors as opposed to margin.

minw 0.5kwk2 + CP

i ⇠is.t. yi (w · xi + b) � 1� ⇠i , ⇠i � 0

Dual for non separable

• Dual is:max

Pi ↵i � 0.5

Pi,j ↵i↵jyiyjxi · xj

s.t. 0 ↵i C,P

i ↵iyi = 0

• Mapping to primal is as before.

Alternative Interpretation

• Primal is: minw 0.5kwk2 + CP

i ⇠is.t. yi (w · xi + b) � 1� ⇠i , ⇠i � 0

• Can solve for to get: ⇠i ⇠i = max [0, 1� yi(w · xi + b)]

• Problem becomes:

min

wCX

i

max [0, 1� yi(w · xi + b)] + 0.5kwk22


• Primal is: min

wCX

i

max [0, 1� yiw · xi] + 0.5kwk22

• The function: is called the hinge loss.

max [0, 1� yiw · xi]Hinge'loss

• SVM'uses'the'hinge'loss'max(0,1+ − %#1 x# )• an'approximation'to'the'0]1'loss

%#1 x#0

1

2

3

4

5

_3 _2 _1 0 1 2 3 4

0_1hinge

yiw · xi

• Upper bounds the true classification error

• A convex upper bound!


• Primal is: min

wCX

i

max [0, 1� yiw · xi] + 0.5kwk22

Bound on loss Regularization• Very common design pattern.

• Other losses and regularizers can be considered.

• Logistic loss:

• L1 regularization: . Sparsity inducing.

1

ln 2ln(1 + e�yiw·xi)

kwk1 =X

i

|wi|

SVM and Generalization

• Intuitively choosing a large margin should improve generalization

• Assume true distribution and classifier are such that the margin is

• Expect generalization to behave like

• But can always increase by rescaling

• Denote R the largest norm of x

• Generalization scales with

�

��1

�

R��1

SVM and Generalization

• Assume training error is zero.

• Can be shown that generalization satisfies (up to some logarithmic factors):

�error(w) c1

m

R

2

�

2+

c2

m

log

m

�

• The VC dimension is replaced by

• Appeared in “Structural Risk Minimization over Data-Dependent Hierarchies” (98)

R2

�2

Leave one out bounds

• Another intuition: using few support vectors should lead to good generalization.

• We will show this via leave one out error.

• Denote training sample without

• Denote the hypothesis from training on S

S�i (xi, yi)

hS

R̂LOO(S) =1

m

mX

i=1

I [hS�i(xi) 6= yi]

Leave one out bounds

• LOO error is similar in spirit to generalization error. But we only train on m-1 points.

• Denote R(h) the generalization error of h

• Can show:

• LOO error and generalization error have same expected value

ESm

hR̂LOO(Sm)

i= ESm�1

⇥R(hSm�1)

⇤

R(h) = E(x,y)⇠D

I [h(x) 6= y]

Leave one out bounds for SVM

• What is the expected LOO error of SVM (separable case).

• If a non-support vector is left out, the solution will not change, and error will be zero.

• Otherwise there might be an error:R̂LOO(Sm) NSV (Sm)

m

ESm�1

⇥R(hSm�1)

⇤ 1

mESm [NSV (Sm)]• Therefore:

• Generalization related to number of SVs.

Support Vector Machinesml-intro-2016.wdfiles.com/.../tau_ml16_svm_lect1.pdf · 2016. 11. 27. · Supervised Learning quail apple apple corn corn Labeled Data Model Class Consider

Documents