Support Vector Machine and Convex Optimization

Support Vector Machine and Convex Optimization

Ian En-Hsu Yen

Overview

• Support Vector Machine

– The Art of Modeling --- Large Margin and Kernel Trick

– Convex Analysis

– Optimality Conditions

– Duality

• Optimization for Machine Learning

– Dual Coordinate Descent ( fast convergence, moderate cost )

• libLinear (Stochastic)

• libSVM (Greedy)

– Primal Methods

• Non-smooth Loss Stochastic Gradient Descent ( slow convergence, cheap iter. )

• Differentiable Loss Quasi-Newton Method ( very fast convergence, expensive iter. )

– Demo

A Learning/Prediction Game

• Your team members suggest a Hypothesis Space : {h1, h2 … }

• You can only request one sample.

• Finding a hypothesis with accuracy > 50%, you earn $100,000.

wrong hypothesis (acc <= 50%) get $100,000 punishment.

H={ h1 }, h1: (A+B) mod 13 = C

H={ h1, h2 } , h1: (A+B) mod 13 = C, h2: (A-B) mod 13 = C

Large |H| with Small |Data| Guarantees Nothing

• First case: only one hypothesis h1

– Pr { |Train_Error – Test_Error| >= 50% } <= 1/2 .

• Second case: two hypotheses h1, h2

– Pr{ |Train_Error - Test_Error| >= 50% for h1 or h2 } <= 1/2 + 1/2 = 1.

Guarantee Nothing.

Why Support Vector Machine (SVM) ?

Linear Model KNN

• Flexible Hypothesis Space. ( Non-linear Kernel ) • Not to Overfit ( Large-Margin ) • Sparsity ( Support Vectors ) • Easy to find Global Optimum ( Convex Problem )

Hypothesis Space

Small

Large

Ground Truth Nonlinear but Large Margin

Why Support Vector Machine (SVM) ?

Linear Model KNN

• Flexible Hypothesis Space. ( Non-linear Kernel ) • Not to Overfit ( Large-Margin ) • Sparsity ( Support Vectors ) • Easy to find Global Optimum ( Convex Problem )

Hypothesis Space

Small

Large

Ground Truth More Feature/Kernel Engineering

SVM: Large-Margin Perceptron

)(minmaxarg* n

T

nnw

xw

wyw

x

x x

w

)( min n

T

nn

xw

wy

x

x x

nxw

wyts n

T

n

w

, )( ..

max ,

nxwyts

w

n

T

n

w

,1 )( ..

1 max

nxwyts

w

n

T

n

w

,1 )( ..

min

nxwyts

w

n

T

n

w

,1 )( ..

min2

Choose

1w

SVM: Large-Margin Perceptron

x

x x

x

x

x x

x

Small C Large C

nxwyts

w

n

T

n

w

,1 )( ..

2

1 min

2 Non-Separable Problem

Hard Margin Soft Margin A drawback of SVM: Solution sensitive to C

nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,

From Linear to Non-Linear

Ellipse: a x12 + b x2

2 + c x1x2 = 0

Perceptron: a x1 + b x2 = 0

(center at origin)

From Linear to Non-Linear

nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,

nxwyts

Cw

nn

T

n

n

nw

,1 ) )( ( ..

2

1 min

2

0 ,

Linear SVM: Non-linear SVM:

31

32

21

2

3

2

2

2

1

3

2

1

2

2

2

)(

xx

xx

xx

x

x

x

x

x

x

x

x n

SVM: Kernel Trick

)( xx Feature Expansion

31

32

21

2

3

2

2

2

1

3

2

1

2

2

2

)(

xx

xx

xx

x

x

x

x

x

x

x

x n

3 features 3 + C32 = 6

100 features 100 + C1002 = 5050

Deg-2 Feature Expansion O(D2) Deg-K Feature Expansion O(DK)

22

3

2

1

3

2

1

313132322121

2

3

2

3

2

2

2

2

2

1

2

1

31

32

21

2

3

2

2

2

1

31

32

21

2

3

2

2

2

1

)() ()(2

2

2

2

2

2

2

)()( zx

z

z

z

x

x

x

zzxxzzxxzzxxzxzxzx

zz

zz

zz

z

z

z

xx

xx

xx

x

x

x

zx T

T

T

T

O(DK) O(D)

Dot Product can be computed efficiently: Compute dot Product using K(x,z)=(xTz)2

deg-2 feature expansion

SVM: Kernel Trick

)( xx Feature Expansion

nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,

Can we formulate the problem only using dot product φ(xi)

Tφ(xj) ?

By Representer Theorem, solution w* of the problem can be expressed as linear combination of instances:

N

NDNN

n

nnn xyxyxyw ...)(...)()(

1

*11

*

n

tnnn

n

t

T

nnn

t

T

n

nnnt

T

xxKyxxy

xxyxw

),()()(

))(())(()(

Prediction using only dot product φ(xi)Tφ(xj) :

O(N*D)

nxxyyts

C

n

i

n

T

iiin

n

nw

,1 )()( ..

2

1 min

1

TT

0,

nxxKyyts

CQ

i

niiin

n

,1 ),( ..

2

1 min

n

1

n

T

0 ,

),())())((( jijijjiiij xxKyyxyxyQ or O(|Support Vector|*D)

SVM: Kernel Trick

Some popular Kernels:

Polynomial Kernel: RBF Kernel: Linear Kernel:

dT xxxxK )1'()',(

)||'||exp()',( 2xxxxK

')',( xxxxK T

Kernels may be easier to define than Features :

String Kernel: Gene Classification / Rewriting or not

Tree Kernel: Syntactic parse tree classification

Graph Kernel: Graph Type Classification

Overview



– Convex Analysis


– Duality




• libSVM (Greedy)

– Primal Methods



Convex Analysis

General Optimization Problem:

is very difficult to solve. ( very long time vs. approximate )

Optimization is much easier if the problem is convex, that is: 1. The objective function is convex:

2. The feasible domain (constrained space) is convex:

10 )()1()())1((0 foryfxfyxf

10 ,)1( , CyxCyCxif

f(x)

f(y)

))1(( yxf

)()1()( yfxf

All local minimum is global minimum !!

Convex Analysis

Simple Example:

Convex Set ; Non-Convex function

x<=b x>=a

Convex Set ; Convex function

x<=b x>=a

Non-Convex Set ; Convex function

x>=b x<=a

Convex Analysis

x* is local minimum x* is global minimum (why?)

If x* is a local minimum, there is a “ball”, in which any feasible x’ has f(x’) >= f(x*).

x*

f(x*)

S

Convex Analysis


x*

x’

f(x’)

f(x*)

S


Assume for contradiction that x* is not a global minimum. There should be a feasible x’ with f(x’) < f(x*).

f( αx’ +(1- α)x* )

αf(x’)+(1- α)f(x*)

Convex Analysis


x*

x’

f(x’)

f(x*)

αx’+(1- α)x*

S


Assume for contradiction that x* is not a global minimum. There should be a feasible x’ with f(x’) < f(x*).

Then we can find a feasible αx’+(1- α)x* in the ball with: f( αx’ +(1- α)x* ) <= αf(x’)+(1- α)f(x*) < f(x*) contradiction.

Convex Analysis Example of Convex Set:

Linear equality constraint ( Hyperplane )

Linear inequality constraint ( Halfspace )

10 ,)1( , CyxCyCxif

Convex Analysis

Example of Convex Set:

)dCx ,(

55

44

33

22

11

bAx

dxc

dxc

bxa

bxa

bxa

Intersection of Convex Set :

? )1(

, ,,

BAyx

convexisBABAyx

10 ,)1( , CyxCyCxif

Convex Analysis

Example of Convex Function:

xcxf T)(

? c 2

1)( TxQxxxf T

Linear Function

Quadratic Function

10 )()1()( ))1((0 foryfxfyxf

Obviously, it depends ……

ax2 + bx + c, a > 0 ax2 + bx + c, a < 0

Convex Analysis


xcxf T)(

? c 2

1)( TxQxxxf T

Linear Function

Quadratic Function

10 )()1()( ))1((0 foryfxfyxf

A practical way to check convexity : Check the second derivative

ax2 + bx + c, a > 0 ax2 + bx + c, a < 0

xatx

xf

0

)(2

2

Convex Analysis


xcxf T)(

? c 2

1)( TxQxxxf T

Linear Function

Quadratic Function

10 )()1()( ))1((0 foryfxfyxf

In RD , we have convexity if the Hassian Matrix :

xat esemidefint postive )(

...)(

.........

)(...

)(

)()(

2

2

1

2

1

2

2

1

2

2

2

is

x

xf

xx

xf

xx

xf

x

xf

xHx

xf

DD

D

) 0 )( ( zforzxHzT

) 0 igenvalue all ( e

x

Convex Analysis


xat inte(semi-)def postive )(

...)(

.........

)(...

)(

)()(

2

2

1

2

1

2

2

1

2

2

2

is

x

xf

xx

xf

xx

xf

x

xf

xHx

xf

DD

D

) 0 )( ( zforzxHzT


H is positive definite H is negative definite

Other Cases ?

H is not postive (semi-)definite not negitive (semi-)definite

Convex Analysis


xat inte(semi-)def postive )(

...)(

.........

)(...

)(

)()(

2

2

1

2

1

2

2

1

2

2

2

is

x

xf

xx

xf

xx

xf

x

xf

xHx

xf

DD

D

) 0 )( ( zforzxHzT


H is positive definite H is negative definite

Other Cases ?

H=O is postive (semi-)definite is negitive (semi-)definite

Convex Analysis


xcxf T)(

? c 2

1)( TxQxxxf T

Linear Function

Quadratic Function

10 )()1()( ))1((0 foryfxfyxf

)0)()(

(2

2

xH

x

xf

))()(

(2

2

QxHx

xf

Quadratic Function is convex if Q is positive semi-definite.

nxwyts

w

n

T

n

w

,1 )( ..

2

1 min

2

wIww T 2

1

2

1 2 ? tesemidefinipositiveIIs

Convex Analysis


xcxf T)(

? c 2

1)( TxQxxxf T

Linear Function

Quadratic Function

10 )()1()( ))1((0 foryfxfyxf

)0)()(

(2

2

xH

x

xf

))()(

(2

2

QxHx

xf

Quadratic Function is convex if Q is positive semi-definite.

? -

) (

**

**tesemidefinipositive

OO

OIwHIs

NNND

NDDD

nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,

Half-space constraint SVM problem is a convex problem. ( Quadratic Program )

Convex Analysis

Example of Convex Problem:

Linear Programming:

Linear objective function s.t. Linear Constraint.

Quadratic Programming:

Quadratic objective function s.t. Linear Constraint.

tesemidefinipositivebemustPwhere

Overview



– Convex Analysis


– Duality




• libSVM (Greedy)

– Primal Methods



Optimality Condition

There are many, many different solvers designed for different problem, but they share the same optimality condition.

First, we consider Unconstrained Problem:

)( minx

xf Example: Matrix Factorization (non-convex)

QP,min




)( minx

xf

x* is Local minimizer

For twice-differenciable f(x), consider the Taylor Expansion:

p )()( ** withpallforpxfxf

....)(2

1)()( )( *2*** pxfppxfxfpxf TT

x* is Local minimizer 0)( * xf

x*

0)( * xf x* is Local minimizer ?




)( minx

xf




....)(2


x* is Local minimizer 0)( * xf

0)( * xf x* is Local minimizer

tesemidefinipositiveisxf )( *2

No need to check for Convex function (why?)

x*




)( minx

xf




....)(2


x*

For Convex f(x):

x* is Global minimizer 0)( * xf

Assume convexity for now on……

Ax=b

x*



Now, consider Equality Constrained Problem:

bAx ..

)( minx

ts

xf

x* is Local minimizer p feasible"" )()( ** withpallforpxfxf

(nonlinear equality is, in general, not convex)

q )()( ** withqallforZqxfxf

nn

T

aaa

AARow

...

:)(

2211

],...[

, :)(

)(1)*( ndndd zzZwhere

ZqANull

z1

a1

z2


....)(2

1 ))(( )( )( *2T*T** ZqxfZqqxfZxfqZxf TTT

f(x) is convex, x* is Glocal minimizer 0)( * xfZT

) )( )( ( )( ** ARowinxfAxf T

Largrange Multipliers

)( *xf

b)xa (ex. T

1

Ax=b

x*



Now, consider Equality Constrained Problem:

bAx ..

)( minx

ts

xf

(nonlinear equality is, in general, not convex)

nn

T

aaa

AARow

...

:)(

2211

],...[

, :)(


ZqANull

z1

a1

z2

x* is Local (Glocal) minimizer 0)( * xfZT

or ) )( )(- ( )( ** ARowinxfAxf T

Largrange Multipliers

? 0

? 0

? 0

n

n

n

)( *xf

)a)(- .( 1

* xfex

b)xa (ex. T

1

Detach from can decrease

Ax=b



Now, consider Inequality Constrained Problem:

bAx ..

)( minx

ts

xf

(Assume linear inequality for simplicity.)

nn

T

aaa

AARow

...

:)(

2211

],...[

, :)(


ZqANull

z1

a1

z2

x*

x* is Local (Global) minimizer TAxf **)(

Let A* (some rows of A) be the coefficients of binding constraints:

0and ( feasible direction not decrease f(x) )

)( *xf

0n bxan

)a)(- .( 1

* xfex

b)xa (ex. T

1

x)(f

Ax=b



Now, consider Inequality Constrained Problem:

bAx ..

)( minx

ts

xf

(Assume linear inequality for simplicity.)

nn

T

aaa

AARow

...

:)(

2211

],...[

, :)(


ZqANull

z1

a1

z2

x*

x* is Local (Global) minimizer TAxf )( *

Require for non-binding constraint:

0and ( feasible direction not decrease f(x) )

)( *xf

0n

nbxa nnn , 0)(

KKT conditions.

)a)(- .( 1

* xfex

b)xa (ex. T

1

Optimality Condition for SVM

nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,

What’s the KKT condition for:

TAxf )( *

nn

n

nnn

C

xyw

)(

0 0

0

n

n

0

0)1 )((

nn

nn

T

nn xwy

0)( nnn bxa

n

n

nn

nn

n

nwxy

C

w

wf

wf

)(

),(

),(


nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,


nn

n

nnn

C

xyw

)(

0 0)(

0

n

n

C

0)(

0)1 )((

nn

nn

T

nn

C

xwy

0)( nnn bxa

n

n TAxf )( *


nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,


n

nnn xyw )(

0 Cn 0 0)(

0)1 )((

nn

nn

T

nn

C

xwy

0)( nnn bxa

1. can be expressed as linear combination of instances.

2. If constraint not binding

3. If constraint is binding (Support Vectors !)

4. If loss of n-th instance

n

nnn xyw )(

nn

T

n xwy 1 )( 0n

0n

0n Cn

x

x x

x

n

n

nn

nn

n

nwxy

C

w

wf

wf

)(

),(

),(

TAxf )( *

Overview



– Convex Analysis


– Duality




• libSVM (Greedy)

– Primal Methods



Primal SVM Problem

nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,

Quadratic Program (QP) with: • D + N variables • N Linear constraints • N nonnegative constraints

( Let D:#feature , N:#samples )

Intractable for median scale (ex. N=1000, D=1000)

Primal SVM Problem Constrained Problem Non-smooth Unconstrained

nxfyts

Cw

nnn

n

nw

,1 )( ..

2

1 min

2

0 ,

otherwisexfy

xfyif

nn

nn

n ),(1

0 )(1 0

Given w, minimize w.r.t. n

} 0 , )(1 max{ nnn xfy

n

nnw

yxfLCw )),((2

1 min

2

Hinge-Loss L(.)

)( nn xfy

( Nonsmooth, Unconstrained )

• Every constrained problem can be transformed to an Nomsmooth Unconstrained Problem

L2-Regularized Loss Minimization

n

nnw

yxfLw )),((2

min2

Hinge-Loss ( L1-SVM, Structural SVM)

)( nn xfy

L2-SVM SVR

)( nn xfy )( nxf

ny

(Least-Square) Regression

)( nxf

ny

00

Logistic Regression (CRF)

)( nn xfy

0

)( nn xfy

0/1-Loss (Accuracy)



)( nn xfy

L2-SVM SVR

)( nn xfy )( nxf

ny


)( nxf

ny

00


)( nn xfy

0

)( nn xfy

0/1-Loss (Accuracy)

Convex Loss • Solve with Global Minimum

n

nnw

yxfLw )),((2

min2



)( nn xfy

L2-SVM SVR

)( nn xfy )( nxf

ny


)( nxf

ny

00


)( nn xfy

0

)( nn xfy

0/1-Loss (Accuracy)

Convex Smooth Loss • Applicable for Second-Order Method • Coordinate Descent (primal) • Gradient Descent O(log(1/ε)) rate Non-smooth O(1/ ε) rate

n

nnw

yxfLw )),((2

min2



)( nn xfy

L2-SVM SVR

)( nn xfy )( nxf

ny


)( nxf

ny

00


)( nn xfy

0

)( nn xfy

0/1-Loss (Accuracy)

Dual Sparsity n

nnw

yxfLw )),((2

min2



)( nn xfy

L2-SVM SVR

)( nn xfy )( nxf

ny


)( nxf

ny

00


)( nn xfy

0

)( nn xfy

0/1-Loss (Accuracy)

Noise-Sensitive

Most insensitive

n

nnw

yxfLw )),((2

min2

Stochastic (sub-)Gradient Descent (S. Shalev-Shwartz et al.,. ICML 2007)


n

nnw

yxfLN

w )),((1

2 min

2

Algorithm: Subgradient Descent For t = 1…T End

n

n

tt nLN

ww )('1

- w )(

t

(t))1(

sub-gradient at ynf(xn) = 1

t

1t A common choice:

Algorithm: Stochastic Subgradient Descent For t = 1…T End

n

tt nLww ~)(

t

(t))1( )~('- w

Draw from uniformly from {1…N} n~

(Shamir, O. and Zhang, ICML, 2013)

(avg. over iterations much faster)

iteration cost: O(D) iteration cost: O(N*D)

Stochastic (sub-)Gradient Descent (S. Shalev-Shwartz etal.,. ICML 2007)

n

nnw

yxfLCw )),((2

1 min

2


)( nn xfy

L2-SVM SVR

)( nn xfy )( nxf

ny


)( nxf

ny

00


)( nn xfy

0

)( nn xfy

0/1-Loss (Accuracy)

SGD • Applicable to all. • Non-Smooth GD:O(1/ε), SGD:O(1/ε) • Smooth GD:O(log1/ε), SGD:O(1/ε)

SGD (Pegasos) vs. Batch Method (LibLinear)

(Heish, ICML 2008) (LibLinear)

• Do you care a “Ratio Improvement” or “Absolute Improvement” in Testing ? • What’s your evaluation measure ? (AUC, Prec/Recall, Accuracy….) • ill-conditioned problems ( pos/neg ratio, Large C )

• Pros: SGD (online method) has same convergence rate for Testing and Training.

• Cons: SGD converges very slowly. (sometimes seems not convergent….)

Overview



– Convex Analysis


– Duality




• libSVM (Greedy)

– Primal Methods



– L1-regularized

• Primal Coordinate Descent

Smooth Loss vs. Non-smooth Loss

n

nnw

yxfLw )),((2

min2

L1-SVM: max(1-ynf(xn),0)

)( nn xfy

0

L2-SVM: max( 1-ynf(xn), 0 )2

)( nn xfy

0

Unconstrained Differentiable Problem.


)( nn xfy

0

Primal Quasi-Newton Method

n

nnw

yxfLCw )),((2

1 min

2

• Gradient Descent (1st order) uses Linear Approximation by

• Newton Method (2nd order) uses Quadratic Approximation by gf(w) Hf(w)2

gf(w)

and

Gradient Descent Newton Method


n

nn

T

wyxwLCw ),(

2

1f(w) min

2

n

nxnLCwg )(' f(w)

n

nnxxnLCIHT2 )('' f(w)

max( 1-ynf(xn), 0 ) 2

0

2max( 1-ynf(xn), 0 )

0

2

0

0

)(2

1 min )(

)(

tTT

wwswfsgHss

t

Quadratic Approximation at w(t):

Minimum at s*: gHs *

Algorithm: Newton Method For t = 1…T End

*

t

(t))1( w sw t

)()( tt gsHSolve

iteration cost: O(N*D2 + D3)


n

nn

T

wyxwLCw ),(

2

1f(w) min

2

n

nxnLCwg )(' f(w)

n

nnxxnLCIHT2 )('' f(w)

max( 1-ynf(xn), 0 ) 2

0

2max( 1-ynf(xn), 0 )

0

2

0

0

)(2

1 min )(

)(

tTT

wwswfsgHss

t

Quadratic Approximation at w(t):

Minimum at s*: 0* gHs

Algorithm: Quasi-Newton Method For t = 1…T End

*

t

(t))1( w sw t

. )()( elyapproximatgsHSolve tt

Iteration cost: O( N*D + |SV|*D*|Tinner| )

Algorithm: Conjugate Gradient for . For t = 1…Tinner

End

(t)

t

(t))1( d rd t

)()( tt Axbr

(t)

t

(t))1( x dx t

bAx

Overview



– Convex Analysis


– Duality




• libSVM (Greedy)

– Primal Methods



Lagrangian Duality

First, we consider the Equality Constrained Problem:

bAx ..

)( minx

ts

xfThe optimal solution x* is found iff:

)( ** TAxf

bAx*

If we define Lagrangian Function (Lagrangian) as:

)()( )(x, bAxxfL T

Then the optimality condition can be written as:

)( 0),L( **

TAxfx

x

bAx 0)L(x, *

)(x, max minx

L

)(x, min max

Lx

( cannot increase L(.) )

( x cannot decrease L(.) )

Lagrangian Duality

If we define Lagrangian Function (Lagrangian) as:

)()( )(x, bAxxfL T

)(- 0),L( **

TAxfx

x

bAx 0)L(x, *

)(x, L

x

),( max)(

xLxf

)L(x, min)( x

g

Every point satisfies

Every point satisfies

Lagrangian Duality

Original (Primal) problem is:

0b-Ax)L(x,

..

)()( )(x, minx

ts

bAxxfL T

Dual Problem:

0A)()L(x,

..

)()( )(x, max

T

xfx

ts

bAxxfL T

)()( )(x, max minx

bAxxfL T

Primal problem is:

)(x, L

x

0)L(x,

),(

xL

Dual problem is:

)()( )(x, min max bAxxfL T

x

)(x, L

x

0)L(x,

),(

xL

Lagrangian Duality

)()( )(x, max min0x

bAxxfL T

Primal problem is: Dual problem is:

)()( )(x, min max0

bAxxfL T

x

For Inequality Constrained Problem:

bAx ..

)( minx

ts

xf

0 ,

0 ),()(x, max

0 bAx

bAxxfL

)(x, min Lx

0)( TAxf

SVM Dual Problem

nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,

Primal Problem:

Lagrangian:

n

nn

n

nn

T

nn

n

n xwyCww )1)(( )2

1( ),, ,L(

2

),, ,L( max min0,0 ,

ww

Dual Problem:

),, ,L( min max ,0,0

ww

SVM Dual Problem

nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,

Primal Problem:

Lagrangian:

n

n

n

nnn

n

n

T

nn Cxwyww )( )(2

1 ),, ,L(

2

),, ,L( max min0,0 ,

ww

Dual Problem:

),, ,L( min max ,0,0

ww

nn

n

nnn

CL

xyww

Lts

w

0

)( 0 ..

),, ,L( max0,0

N

NN xyxy

...)(...)(

1

11

SVM Dual Problem

nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,

Primal Problem:

Lagrangian:

n

n

TTTT 0 2

1 ),L(

),, ,L( max min0,0 ,

ww

Dual Problem:

),, ,L( min max ,0,0

ww

nn

n

nnn

CL

xyww

Lts

w

0

)( 0 ..

),, ,L( max0,0

N

NN xyxy

...)(...)(

1

11

SVM Dual Problem

nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,

Primal Problem:

Lagrangian:

n

n

TTTT 0 2

1 ),L(

),, ,L( max min0,0 ,

ww

Dual Problem:

),, ,L( min max ,0,0

ww

nn

TT

n

n

Cts

..

2

1 - ),L( max

0,0

SVM Dual Problem

nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,

Primal Problem:

Lagrangian:

n

n

TTTT 0 2

1 ),L(

),, ,L( max min0,0 ,

ww

Dual Problem:

),, ,L( min max ,0,0

ww

Cts

TT

n

n

0 ..

2

1 - ),L( max

SVM Dual Problem

Dual Problem (only involve product φ(xi)Tφ(xj) ) :

Cts

TT

n

n

0 ..

2

1 - max

Cts

QT

n

n

0 ..

2

1 - max

),())())((( jijijjiiij xxKyyxyxyQ

1. Only “Box Constraint” Easy to solve.

2. dim(α) = N = |instance| , dim(w) = D = |features| 3. Weak Duality: Dual( α ) <= Primal( w )

4. String Duality: Dual( α* ) = Primal( w* ) (if primal is convex)

)(x,L

x

)L(x, max )(

xL

)L(x, min )(x

L

Overview



– Convex Analysis


– Duality


– Dual Coordinate Descent (DCD) ( fast convergence, moderate cost )

• libLinear (Stochastic CD)

• libSVM (Greedy CD)

– Primal Methods



?!

Cts

QT

n

n

0 ..

2

1 - max

),())())((( jijijjiiij xxKyyxyxyQ

Dual Optimization of SVM

Cts

Qn

n

T

0 ..

2

1 min

Constrained Minimization

a1

)()( 1 xfZxfTP

General Constraint Very Expensive: 1. Detecting binding constraint : O(|constraint|*dim(α) )

2. Compute “Projected Gradient” : O( |binding constraint|*dim(α) )

nxwyts

Cw

nn

T

n

n

nw

,1 )( ..

2

1 min

2

0 ,

Constrained Minimization for “Box Constraint”

Cheap: 1. Detecting binding constraint : O(|constraint|)

2. Compute “Projected Gradient” : O(|binding constraint|)

Cts

Qn

n

T

0 ..

2

1 min

1

2

C

C

N

...

1 C

Dual Coordinate Descent

Cts

Qn

n

T

0 ..

2

1 min

Cts

constff

i

iiiii

i

0 ..

.)]([)]([2

1 min 22

Minimize w.r.t. i

ii

iiii

Qf

Qf

]1[)(

)(

2

Dual Optimization of SVM

Cts

Qn

n

T

0 ..

2

1 min

1 )( 1** NNNQf Even Computing Gradient is Expensive: ( O(N2) )

Coordinate Descent: (Optimize w.r.t. one variable at a time)

1),(1 ][)( 1*:,)( k

kikikNii xxKyyQf ( O(N) )

How many variables ? As few as possible

Coordinate Descent (LibLinear) (1 variable at a time)

Sequential Minimal Optimization (LibSVM) (2 variable at a time)

NonLinear (LibSVM) vs. Linear (LibLinear)

Linear:

1),(1 ][)( 1*:,)( k

kikikNii xxKyyQf

11)(1)( wxyxyxyxxyyT

ii

k

kkk

T

ii

k

k

T

ikik

Non-Linear:

1),(1 ][)( 1*:,)( k

kikikNii xxKyyQf

O( |non-zero Feature| )

O( |Instances|*|non-zero Features|)

NonLinear (LibSVM) vs. Linear (LibLinear)

Linear:

1)( )( wxyfT

iii

Non-Linear:

1),()( )( k

kikiki xxKyyf

O( |Instances| )

O( |Feature| )

n

nnn xyw

α1

αN

1)( )( wxyfT

iii

α1

αN

)()( if

( Cheap Update Random Select Coordinate )

( Expensive Update Select most Promising Coordinate )

LibLinear Linear:

1)( )( wxyfT

iii

O( |Feature| )

n

nnn xyw

α1

αN

1)( )( wxyfT

iii

Random Select a Coordinate update ) )( (' )(

*

iii fproj

Update w

LibSVM

Choose 2 most Promising Coordinates update

Update , i=1~N

Non-Linear:

1),()( )( k

kikiki xxKyyf

O( |Instances|*|Features| ) (no cache) O( |Instances|) (cache)

α1

αN

)()( if

)()( if

) )( (' )(

*

iii fproj

Demo: libSVM, libLinear

– Normalize Features:

• svm-scale -s [range_file] [train] > train.scale

• svm-scale -r [range_file] [test] > test.scale

– Training:

• LibSVM: svm-train [train.scale] ( produce train.scale.model )

• LibSVM: svm-predict [test.scale] [train.scale.model] [pred_output]

• LibLinear: train [train.scale] ( produce train.scale.model )

• LibLinear: predict [test.scale] [train.scale.model] [pred_output]

Support Vector Machine and Convex Optimization

Documents