Support Vector Machine and Convex Optimization Ian En-Hsu Yen
Support Vector Machine and Convex Optimization
Ian En-Hsu Yen
Overview
• Support Vector Machine
– The Art of Modeling --- Large Margin and Kernel Trick
– Convex Analysis
– Optimality Conditions
– Duality
• Optimization for Machine Learning
– Dual Coordinate Descent ( fast convergence, moderate cost )
• libLinear (Stochastic)
• libSVM (Greedy)
– Primal Methods
• Non-smooth Loss Stochastic Gradient Descent ( slow convergence, cheap iter. )
• Differentiable Loss Quasi-Newton Method ( very fast convergence, expensive iter. )
– Demo
A Learning/Prediction Game
• Your team members suggest a Hypothesis Space : {h1, h2 … }
• You can only request one sample.
• Finding a hypothesis with accuracy > 50%, you earn $100,000.
wrong hypothesis (acc <= 50%) get $100,000 punishment.
H={ h1 }, h1: (A+B) mod 13 = C
H={ h1, h2 } , h1: (A+B) mod 13 = C, h2: (A-B) mod 13 = C
Large |H| with Small |Data| Guarantees Nothing
• First case: only one hypothesis h1
– Pr { |Train_Error – Test_Error| >= 50% } <= 1/2 .
• Second case: two hypotheses h1, h2
– Pr{ |Train_Error - Test_Error| >= 50% for h1 or h2 } <= 1/2 + 1/2 = 1.
Guarantee Nothing.
Why Support Vector Machine (SVM) ?
Linear Model KNN
• Flexible Hypothesis Space. ( Non-linear Kernel ) • Not to Overfit ( Large-Margin ) • Sparsity ( Support Vectors ) • Easy to find Global Optimum ( Convex Problem )
Hypothesis Space
Small
Large
Ground Truth Nonlinear but Large Margin
Why Support Vector Machine (SVM) ?
Linear Model KNN
• Flexible Hypothesis Space. ( Non-linear Kernel ) • Not to Overfit ( Large-Margin ) • Sparsity ( Support Vectors ) • Easy to find Global Optimum ( Convex Problem )
Hypothesis Space
Small
Large
Ground Truth More Feature/Kernel Engineering
SVM: Large-Margin Perceptron
)(minmaxarg* n
T
nnw
xw
wyw
x
x x
w
)( min n
T
nn
xw
wy
x
x x
nxw
wyts n
T
n
w
, )( ..
max ,
nxwyts
w
n
T
n
w
,1 )( ..
1 max
nxwyts
w
n
T
n
w
,1 )( ..
min
nxwyts
w
n
T
n
w
,1 )( ..
min2
Choose
1w
SVM: Large-Margin Perceptron
x
x x
x
x
x x
x
Small C Large C
nxwyts
w
n
T
n
w
,1 )( ..
2
1 min
2 Non-Separable Problem
Hard Margin Soft Margin A drawback of SVM: Solution sensitive to C
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
From Linear to Non-Linear
Ellipse: a x12 + b x2
2 + c x1x2 = 0
Perceptron: a x1 + b x2 = 0
(center at origin)
From Linear to Non-Linear
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
nxwyts
Cw
nn
T
n
n
nw
,1 ) )( ( ..
2
1 min
2
0 ,
Linear SVM: Non-linear SVM:
31
32
21
2
3
2
2
2
1
3
2
1
2
2
2
)(
xx
xx
xx
x
x
x
x
x
x
x
x n
SVM: Kernel Trick
)( xx Feature Expansion
31
32
21
2
3
2
2
2
1
3
2
1
2
2
2
)(
xx
xx
xx
x
x
x
x
x
x
x
x n
3 features 3 + C32 = 6
100 features 100 + C1002 = 5050
Deg-2 Feature Expansion O(D2) Deg-K Feature Expansion O(DK)
22
3
2
1
3
2
1
313132322121
2
3
2
3
2
2
2
2
2
1
2
1
31
32
21
2
3
2
2
2
1
31
32
21
2
3
2
2
2
1
)() ()(2
2
2
2
2
2
2
)()( zx
z
z
z
x
x
x
zzxxzzxxzzxxzxzxzx
zz
zz
zz
z
z
z
xx
xx
xx
x
x
x
zx T
T
T
T
O(DK) O(D)
Dot Product can be computed efficiently: Compute dot Product using K(x,z)=(xTz)2
deg-2 feature expansion
SVM: Kernel Trick
)( xx Feature Expansion
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
Can we formulate the problem only using dot product φ(xi)
Tφ(xj) ?
By Representer Theorem, solution w* of the problem can be expressed as linear combination of instances:
N
NDNN
n
nnn xyxyxyw ...)(...)()(
1
*11
*
n
tnnn
n
t
T
nnn
t
T
n
nnnt
T
xxKyxxy
xxyxw
),()()(
))(())(()(
Prediction using only dot product φ(xi)Tφ(xj) :
O(N*D)
nxxyyts
C
n
i
n
T
iiin
n
nw
,1 )()( ..
2
1 min
1
TT
0,
nxxKyyts
CQ
i
niiin
n
,1 ),( ..
2
1 min
n
1
n
T
0 ,
),())())((( jijijjiiij xxKyyxyxyQ or O(|Support Vector|*D)
SVM: Kernel Trick
Some popular Kernels:
Polynomial Kernel: RBF Kernel: Linear Kernel:
dT xxxxK )1'()',(
)||'||exp()',( 2xxxxK
')',( xxxxK T
Kernels may be easier to define than Features :
String Kernel: Gene Classification / Rewriting or not
Tree Kernel: Syntactic parse tree classification
Graph Kernel: Graph Type Classification
Overview
• Support Vector Machine
– The Art of Modeling --- Large Margin and Kernel Trick
– Convex Analysis
– Optimality Conditions
– Duality
• Optimization for Machine Learning
– Dual Coordinate Descent ( fast convergence, moderate cost )
• libLinear (Stochastic)
• libSVM (Greedy)
– Primal Methods
• Non-smooth Loss Stochastic Gradient Descent ( slow convergence, cheap iter. )
• Differentiable Loss Quasi-Newton Method ( very fast convergence, expensive iter. )
Convex Analysis
General Optimization Problem:
is very difficult to solve. ( very long time vs. approximate )
Optimization is much easier if the problem is convex, that is: 1. The objective function is convex:
2. The feasible domain (constrained space) is convex:
10 )()1()())1((0 foryfxfyxf
10 ,)1( , CyxCyCxif
f(x)
f(y)
))1(( yxf
)()1()( yfxf
All local minimum is global minimum !!
Convex Analysis
Simple Example:
Convex Set ; Non-Convex function
x<=b x>=a
Convex Set ; Convex function
x<=b x>=a
Non-Convex Set ; Convex function
x>=b x<=a
Convex Analysis
x* is local minimum x* is global minimum (why?)
If x* is a local minimum, there is a “ball”, in which any feasible x’ has f(x’) >= f(x*).
x*
f(x*)
S
Convex Analysis
x* is local minimum x* is global minimum (why?)
x*
x’
f(x’)
f(x*)
S
If x* is a local minimum, there is a “ball”, in which any feasible x’ has f(x’) >= f(x*).
Assume for contradiction that x* is not a global minimum. There should be a feasible x’ with f(x’) < f(x*).
f( αx’ +(1- α)x* )
αf(x’)+(1- α)f(x*)
Convex Analysis
x* is local minimum x* is global minimum (why?)
x*
x’
f(x’)
f(x*)
αx’+(1- α)x*
S
If x* is a local minimum, there is a “ball”, in which any feasible x’ has f(x’) >= f(x*).
Assume for contradiction that x* is not a global minimum. There should be a feasible x’ with f(x’) < f(x*).
Then we can find a feasible αx’+(1- α)x* in the ball with: f( αx’ +(1- α)x* ) <= αf(x’)+(1- α)f(x*) < f(x*) contradiction.
Convex Analysis Example of Convex Set:
Linear equality constraint ( Hyperplane )
Linear inequality constraint ( Halfspace )
10 ,)1( , CyxCyCxif
Convex Analysis
Example of Convex Set:
)dCx ,(
55
44
33
22
11
bAx
dxc
dxc
bxa
bxa
bxa
Intersection of Convex Set :
? )1(
, ,,
BAyx
convexisBABAyx
10 ,)1( , CyxCyCxif
Convex Analysis
Example of Convex Function:
xcxf T)(
? c 2
1)( TxQxxxf T
Linear Function
Quadratic Function
10 )()1()( ))1((0 foryfxfyxf
Obviously, it depends ……
ax2 + bx + c, a > 0 ax2 + bx + c, a < 0
Convex Analysis
Example of Convex Function:
xcxf T)(
? c 2
1)( TxQxxxf T
Linear Function
Quadratic Function
10 )()1()( ))1((0 foryfxfyxf
A practical way to check convexity : Check the second derivative
ax2 + bx + c, a > 0 ax2 + bx + c, a < 0
xatx
xf
0
)(2
2
Convex Analysis
Example of Convex Function:
xcxf T)(
? c 2
1)( TxQxxxf T
Linear Function
Quadratic Function
10 )()1()( ))1((0 foryfxfyxf
In RD , we have convexity if the Hassian Matrix :
xat esemidefint postive )(
...)(
.........
)(...
)(
)()(
2
2
1
2
1
2
2
1
2
2
2
is
x
xf
xx
xf
xx
xf
x
xf
xHx
xf
DD
D
) 0 )( ( zforzxHzT
) 0 igenvalue all ( e
x
Convex Analysis
In RD , we have convexity if the Hassian Matrix :
xat inte(semi-)def postive )(
...)(
.........
)(...
)(
)()(
2
2
1
2
1
2
2
1
2
2
2
is
x
xf
xx
xf
xx
xf
x
xf
xHx
xf
DD
D
) 0 )( ( zforzxHzT
) 0 igenvalue all ( e
H is positive definite H is negative definite
Other Cases ?
H is not postive (semi-)definite not negitive (semi-)definite
Convex Analysis
In RD , we have convexity if the Hassian Matrix :
xat inte(semi-)def postive )(
...)(
.........
)(...
)(
)()(
2
2
1
2
1
2
2
1
2
2
2
is
x
xf
xx
xf
xx
xf
x
xf
xHx
xf
DD
D
) 0 )( ( zforzxHzT
) 0 igenvalue all ( e
H is positive definite H is negative definite
Other Cases ?
H=O is postive (semi-)definite is negitive (semi-)definite
Convex Analysis
Example of Convex Function:
xcxf T)(
? c 2
1)( TxQxxxf T
Linear Function
Quadratic Function
10 )()1()( ))1((0 foryfxfyxf
)0)()(
(2
2
xH
x
xf
))()(
(2
2
QxHx
xf
Quadratic Function is convex if Q is positive semi-definite.
nxwyts
w
n
T
n
w
,1 )( ..
2
1 min
2
wIww T 2
1
2
1 2 ? tesemidefinipositiveIIs
Convex Analysis
Example of Convex Function:
xcxf T)(
? c 2
1)( TxQxxxf T
Linear Function
Quadratic Function
10 )()1()( ))1((0 foryfxfyxf
)0)()(
(2
2
xH
x
xf
))()(
(2
2
QxHx
xf
Quadratic Function is convex if Q is positive semi-definite.
? -
) (
**
**tesemidefinipositive
OO
OIwHIs
NNND
NDDD
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
Half-space constraint SVM problem is a convex problem. ( Quadratic Program )
Convex Analysis
Example of Convex Problem:
Linear Programming:
Linear objective function s.t. Linear Constraint.
Quadratic Programming:
Quadratic objective function s.t. Linear Constraint.
tesemidefinipositivebemustPwhere
Overview
• Support Vector Machine
– The Art of Modeling --- Large Margin and Kernel Trick
– Convex Analysis
– Optimality Conditions
– Duality
• Optimization for Machine Learning
– Dual Coordinate Descent ( fast convergence, moderate cost )
• libLinear (Stochastic)
• libSVM (Greedy)
– Primal Methods
• Non-smooth Loss Stochastic Gradient Descent ( slow convergence, cheap iter. )
• Differentiable Loss Quasi-Newton Method ( very fast convergence, expensive iter. )
Optimality Condition
There are many, many different solvers designed for different problem, but they share the same optimality condition.
First, we consider Unconstrained Problem:
)( minx
xf Example: Matrix Factorization (non-convex)
QP,min
Optimality Condition
There are many, many different solvers designed for different problem, but they share the same optimality condition.
First, we consider Unconstrained Problem:
)( minx
xf
x* is Local minimizer
For twice-differenciable f(x), consider the Taylor Expansion:
p )()( ** withpallforpxfxf
....)(2
1)()( )( *2*** pxfppxfxfpxf TT
x* is Local minimizer 0)( * xf
x*
0)( * xf x* is Local minimizer ?
Optimality Condition
There are many, many different solvers designed for different problem, but they share the same optimality condition.
First, we consider Unconstrained Problem:
)( minx
xf
x* is Local minimizer
For twice-differenciable f(x), consider the Taylor Expansion:
p )()( ** withpallforpxfxf
....)(2
1)()( )( *2*** pxfppxfxfpxf TT
x* is Local minimizer 0)( * xf
0)( * xf x* is Local minimizer
tesemidefinipositiveisxf )( *2
No need to check for Convex function (why?)
x*
Optimality Condition
There are many, many different solvers designed for different problem, but they share the same optimality condition.
First, we consider Unconstrained Problem:
)( minx
xf
x* is Local minimizer
For twice-differenciable f(x), consider the Taylor Expansion:
p )()( ** withpallforpxfxf
....)(2
1)()( )( *2*** pxfppxfxfpxf TT
x*
For Convex f(x):
x* is Global minimizer 0)( * xf
Assume convexity for now on……
Ax=b
x*
Optimality Condition
There are many, many different solvers designed for different problem, but they share the same optimality condition.
Now, consider Equality Constrained Problem:
bAx ..
)( minx
ts
xf
x* is Local minimizer p feasible"" )()( ** withpallforpxfxf
(nonlinear equality is, in general, not convex)
q )()( ** withqallforZqxfxf
nn
T
aaa
AARow
...
:)(
2211
],...[
, :)(
)(1)*( ndndd zzZwhere
ZqANull
z1
a1
z2
For twice-differenciable f(x), consider the Taylor Expansion:
....)(2
1 ))(( )( )( *2T*T** ZqxfZqqxfZxfqZxf TTT
f(x) is convex, x* is Glocal minimizer 0)( * xfZT
) )( )( ( )( ** ARowinxfAxf T
Largrange Multipliers
)( *xf
b)xa (ex. T
1
Ax=b
x*
Optimality Condition
There are many, many different solvers designed for different problem, but they share the same optimality condition.
Now, consider Equality Constrained Problem:
bAx ..
)( minx
ts
xf
(nonlinear equality is, in general, not convex)
nn
T
aaa
AARow
...
:)(
2211
],...[
, :)(
)(1)*( ndndd zzZwhere
ZqANull
z1
a1
z2
x* is Local (Glocal) minimizer 0)( * xfZT
or ) )( )(- ( )( ** ARowinxfAxf T
Largrange Multipliers
? 0
? 0
? 0
n
n
n
)( *xf
)a)(- .( 1
* xfex
b)xa (ex. T
1
Detach from can decrease
Ax=b
Optimality Condition
There are many, many different solvers designed for different problem, but they share the same optimality condition.
Now, consider Inequality Constrained Problem:
bAx ..
)( minx
ts
xf
(Assume linear inequality for simplicity.)
nn
T
aaa
AARow
...
:)(
2211
],...[
, :)(
)(1)*( ndndd zzZwhere
ZqANull
z1
a1
z2
x*
x* is Local (Global) minimizer TAxf **)(
Let A* (some rows of A) be the coefficients of binding constraints:
0and ( feasible direction not decrease f(x) )
)( *xf
0n bxan
)a)(- .( 1
* xfex
b)xa (ex. T
1
x)(f
Ax=b
Optimality Condition
There are many, many different solvers designed for different problem, but they share the same optimality condition.
Now, consider Inequality Constrained Problem:
bAx ..
)( minx
ts
xf
(Assume linear inequality for simplicity.)
nn
T
aaa
AARow
...
:)(
2211
],...[
, :)(
)(1)*( ndndd zzZwhere
ZqANull
z1
a1
z2
x*
x* is Local (Global) minimizer TAxf )( *
Require for non-binding constraint:
0and ( feasible direction not decrease f(x) )
)( *xf
0n
nbxa nnn , 0)(
KKT conditions.
)a)(- .( 1
* xfex
b)xa (ex. T
1
Optimality Condition for SVM
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
What’s the KKT condition for:
TAxf )( *
nn
n
nnn
C
xyw
)(
0 0
0
n
n
0
0)1 )((
nn
nn
T
nn xwy
0)( nnn bxa
n
n
nn
nn
n
nwxy
C
w
wf
wf
)(
),(
),(
Optimality Condition for SVM
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
What’s the KKT condition for:
nn
n
nnn
C
xyw
)(
0 0)(
0
n
n
C
0)(
0)1 )((
nn
nn
T
nn
C
xwy
0)( nnn bxa
n
n TAxf )( *
Optimality Condition for SVM
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
What’s the KKT condition for:
n
nnn xyw )(
0 Cn 0 0)(
0)1 )((
nn
nn
T
nn
C
xwy
0)( nnn bxa
1. can be expressed as linear combination of instances.
2. If constraint not binding
3. If constraint is binding (Support Vectors !)
4. If loss of n-th instance
n
nnn xyw )(
nn
T
n xwy 1 )( 0n
0n
0n Cn
x
x x
x
n
n
nn
nn
n
nwxy
C
w
wf
wf
)(
),(
),(
TAxf )( *
Overview
• Support Vector Machine
– The Art of Modeling --- Large Margin and Kernel Trick
– Convex Analysis
– Optimality Conditions
– Duality
• Optimization for Machine Learning
– Dual Coordinate Descent ( fast convergence, moderate cost )
• libLinear (Stochastic)
• libSVM (Greedy)
– Primal Methods
• Non-smooth Loss Stochastic Gradient Descent ( slow convergence, cheap iter. )
• Differentiable Loss Quasi-Newton Method ( very fast convergence, expensive iter. )
Primal SVM Problem
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
Quadratic Program (QP) with: • D + N variables • N Linear constraints • N nonnegative constraints
( Let D:#feature , N:#samples )
Intractable for median scale (ex. N=1000, D=1000)
Primal SVM Problem Constrained Problem Non-smooth Unconstrained
nxfyts
Cw
nnn
n
nw
,1 )( ..
2
1 min
2
0 ,
otherwisexfy
xfyif
nn
nn
n ),(1
0 )(1 0
Given w, minimize w.r.t. n
} 0 , )(1 max{ nnn xfy
n
nnw
yxfLCw )),((2
1 min
2
Hinge-Loss L(.)
)( nn xfy
( Nonsmooth, Unconstrained )
• Every constrained problem can be transformed to an Nomsmooth Unconstrained Problem
L2-Regularized Loss Minimization
n
nnw
yxfLw )),((2
min2
Hinge-Loss ( L1-SVM, Structural SVM)
)( nn xfy
L2-SVM SVR
)( nn xfy )( nxf
ny
(Least-Square) Regression
)( nxf
ny
00
Logistic Regression (CRF)
)( nn xfy
0
)( nn xfy
0/1-Loss (Accuracy)
L2-Regularized Loss Minimization
Hinge-Loss ( L1-SVM, Structural SVM)
)( nn xfy
L2-SVM SVR
)( nn xfy )( nxf
ny
(Least-Square) Regression
)( nxf
ny
00
Logistic Regression (CRF)
)( nn xfy
0
)( nn xfy
0/1-Loss (Accuracy)
Convex Loss • Solve with Global Minimum
n
nnw
yxfLw )),((2
min2
L2-Regularized Loss Minimization
Hinge-Loss ( L1-SVM, Structural SVM)
)( nn xfy
L2-SVM SVR
)( nn xfy )( nxf
ny
(Least-Square) Regression
)( nxf
ny
00
Logistic Regression (CRF)
)( nn xfy
0
)( nn xfy
0/1-Loss (Accuracy)
Convex Smooth Loss • Applicable for Second-Order Method • Coordinate Descent (primal) • Gradient Descent O(log(1/ε)) rate Non-smooth O(1/ ε) rate
n
nnw
yxfLw )),((2
min2
L2-Regularized Loss Minimization
Hinge-Loss ( L1-SVM, Structural SVM)
)( nn xfy
L2-SVM SVR
)( nn xfy )( nxf
ny
(Least-Square) Regression
)( nxf
ny
00
Logistic Regression (CRF)
)( nn xfy
0
)( nn xfy
0/1-Loss (Accuracy)
Dual Sparsity n
nnw
yxfLw )),((2
min2
L2-Regularized Loss Minimization
Hinge-Loss ( L1-SVM, Structural SVM)
)( nn xfy
L2-SVM SVR
)( nn xfy )( nxf
ny
(Least-Square) Regression
)( nxf
ny
00
Logistic Regression (CRF)
)( nn xfy
0
)( nn xfy
0/1-Loss (Accuracy)
Noise-Sensitive
Most insensitive
n
nnw
yxfLw )),((2
min2
Stochastic (sub-)Gradient Descent (S. Shalev-Shwartz et al.,. ICML 2007)
Hinge-Loss ( L1-SVM, Structural SVM)
n
nnw
yxfLN
w )),((1
2 min
2
Algorithm: Subgradient Descent For t = 1…T End
n
n
tt nLN
ww )('1
- w )(
t
(t))1(
sub-gradient at ynf(xn) = 1
t
1t A common choice:
Algorithm: Stochastic Subgradient Descent For t = 1…T End
n
tt nLww ~)(
t
(t))1( )~('- w
Draw from uniformly from {1…N} n~
(Shamir, O. and Zhang, ICML, 2013)
(avg. over iterations much faster)
iteration cost: O(D) iteration cost: O(N*D)
Stochastic (sub-)Gradient Descent (S. Shalev-Shwartz etal.,. ICML 2007)
n
nnw
yxfLCw )),((2
1 min
2
Hinge-Loss ( L1-SVM, Structural SVM)
)( nn xfy
L2-SVM SVR
)( nn xfy )( nxf
ny
(Least-Square) Regression
)( nxf
ny
00
Logistic Regression (CRF)
)( nn xfy
0
)( nn xfy
0/1-Loss (Accuracy)
SGD • Applicable to all. • Non-Smooth GD:O(1/ε), SGD:O(1/ε) • Smooth GD:O(log1/ε), SGD:O(1/ε)
SGD (Pegasos) vs. Batch Method (LibLinear)
(Heish, ICML 2008) (LibLinear)
• Do you care a “Ratio Improvement” or “Absolute Improvement” in Testing ? • What’s your evaluation measure ? (AUC, Prec/Recall, Accuracy….) • ill-conditioned problems ( pos/neg ratio, Large C )
• Pros: SGD (online method) has same convergence rate for Testing and Training.
• Cons: SGD converges very slowly. (sometimes seems not convergent….)
Overview
• Support Vector Machine
– The Art of Modeling --- Large Margin and Kernel Trick
– Convex Analysis
– Optimality Conditions
– Duality
• Optimization for Machine Learning
– Dual Coordinate Descent ( fast convergence, moderate cost )
• libLinear (Stochastic)
• libSVM (Greedy)
– Primal Methods
• Non-smooth Loss Stochastic Gradient Descent ( slow convergence, cheap iter. )
• Differentiable Loss Quasi-Newton Method ( very fast convergence, expensive iter. )
– L1-regularized
• Primal Coordinate Descent
Smooth Loss vs. Non-smooth Loss
n
nnw
yxfLw )),((2
min2
L1-SVM: max(1-ynf(xn),0)
)( nn xfy
0
L2-SVM: max( 1-ynf(xn), 0 )2
)( nn xfy
0
Unconstrained Differentiable Problem.
Logistic Regression (CRF)
)( nn xfy
0
Primal Quasi-Newton Method
n
nnw
yxfLCw )),((2
1 min
2
• Gradient Descent (1st order) uses Linear Approximation by
• Newton Method (2nd order) uses Quadratic Approximation by gf(w) Hf(w)2
gf(w)
and
Gradient Descent Newton Method
Primal Quasi-Newton Method
n
nn
T
wyxwLCw ),(
2
1f(w) min
2
n
nxnLCwg )(' f(w)
n
nnxxnLCIHT2 )('' f(w)
max( 1-ynf(xn), 0 ) 2
0
2max( 1-ynf(xn), 0 )
0
2
0
0
)(2
1 min )(
)(
tTT
wwswfsgHss
t
Quadratic Approximation at w(t):
Minimum at s*: gHs *
Algorithm: Newton Method For t = 1…T End
*
t
(t))1( w sw t
)()( tt gsHSolve
iteration cost: O(N*D2 + D3)
Primal Quasi-Newton Method
n
nn
T
wyxwLCw ),(
2
1f(w) min
2
n
nxnLCwg )(' f(w)
n
nnxxnLCIHT2 )('' f(w)
max( 1-ynf(xn), 0 ) 2
0
2max( 1-ynf(xn), 0 )
0
2
0
0
)(2
1 min )(
)(
tTT
wwswfsgHss
t
Quadratic Approximation at w(t):
Minimum at s*: 0* gHs
Algorithm: Quasi-Newton Method For t = 1…T End
*
t
(t))1( w sw t
. )()( elyapproximatgsHSolve tt
Iteration cost: O( N*D + |SV|*D*|Tinner| )
Algorithm: Conjugate Gradient for . For t = 1…Tinner
End
(t)
t
(t))1( d rd t
)()( tt Axbr
(t)
t
(t))1( x dx t
bAx
Overview
• Support Vector Machine
– The Art of Modeling --- Large Margin and Kernel Trick
– Convex Analysis
– Optimality Conditions
– Duality
• Optimization for Machine Learning
– Dual Coordinate Descent ( fast convergence, moderate cost )
• libLinear (Stochastic)
• libSVM (Greedy)
– Primal Methods
• Non-smooth Loss Stochastic Gradient Descent ( slow convergence, cheap iter. )
• Differentiable Loss Quasi-Newton Method ( very fast convergence, expensive iter. )
Lagrangian Duality
First, we consider the Equality Constrained Problem:
bAx ..
)( minx
ts
xfThe optimal solution x* is found iff:
)( ** TAxf
bAx*
If we define Lagrangian Function (Lagrangian) as:
)()( )(x, bAxxfL T
Then the optimality condition can be written as:
)( 0),L( **
TAxfx
x
bAx 0)L(x, *
)(x, max minx
L
)(x, min max
Lx
( cannot increase L(.) )
( x cannot decrease L(.) )
Lagrangian Duality
If we define Lagrangian Function (Lagrangian) as:
)()( )(x, bAxxfL T
)(- 0),L( **
TAxfx
x
bAx 0)L(x, *
)(x, L
x
),( max)(
xLxf
)L(x, min)( x
g
Every point satisfies
Every point satisfies
Lagrangian Duality
Original (Primal) problem is:
0b-Ax)L(x,
..
)()( )(x, minx
ts
bAxxfL T
Dual Problem:
0A)()L(x,
..
)()( )(x, max
T
xfx
ts
bAxxfL T
)()( )(x, max minx
bAxxfL T
Primal problem is:
)(x, L
x
0)L(x,
),(
xL
Dual problem is:
)()( )(x, min max bAxxfL T
x
)(x, L
x
0)L(x,
),(
xL
Lagrangian Duality
)()( )(x, max min0x
bAxxfL T
Primal problem is: Dual problem is:
)()( )(x, min max0
bAxxfL T
x
For Inequality Constrained Problem:
bAx ..
)( minx
ts
xf
0 ,
0 ),()(x, max
0 bAx
bAxxfL
)(x, min Lx
0)( TAxf
SVM Dual Problem
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
Primal Problem:
Lagrangian:
n
nn
n
nn
T
nn
n
n xwyCww )1)(( )2
1( ),, ,L(
2
),, ,L( max min0,0 ,
ww
Dual Problem:
),, ,L( min max ,0,0
ww
SVM Dual Problem
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
Primal Problem:
Lagrangian:
n
n
n
nnn
n
n
T
nn Cxwyww )( )(2
1 ),, ,L(
2
),, ,L( max min0,0 ,
ww
Dual Problem:
),, ,L( min max ,0,0
ww
nn
n
nnn
CL
xyww
Lts
w
0
)( 0 ..
),, ,L( max0,0
N
NN xyxy
...)(...)(
1
11
SVM Dual Problem
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
Primal Problem:
Lagrangian:
n
n
TTTT 0 2
1 ),L(
),, ,L( max min0,0 ,
ww
Dual Problem:
),, ,L( min max ,0,0
ww
nn
n
nnn
CL
xyww
Lts
w
0
)( 0 ..
),, ,L( max0,0
N
NN xyxy
...)(...)(
1
11
SVM Dual Problem
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
Primal Problem:
Lagrangian:
n
n
TTTT 0 2
1 ),L(
),, ,L( max min0,0 ,
ww
Dual Problem:
),, ,L( min max ,0,0
ww
nn
TT
n
n
Cts
..
2
1 - ),L( max
0,0
SVM Dual Problem
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
Primal Problem:
Lagrangian:
n
n
TTTT 0 2
1 ),L(
),, ,L( max min0,0 ,
ww
Dual Problem:
),, ,L( min max ,0,0
ww
Cts
TT
n
n
0 ..
2
1 - ),L( max
SVM Dual Problem
Dual Problem (only involve product φ(xi)Tφ(xj) ) :
Cts
TT
n
n
0 ..
2
1 - max
Cts
QT
n
n
0 ..
2
1 - max
),())())((( jijijjiiij xxKyyxyxyQ
1. Only “Box Constraint” Easy to solve.
2. dim(α) = N = |instance| , dim(w) = D = |features| 3. Weak Duality: Dual( α ) <= Primal( w )
4. String Duality: Dual( α* ) = Primal( w* ) (if primal is convex)
)(x,L
x
)L(x, max )(
xL
)L(x, min )(x
L
Overview
• Support Vector Machine
– The Art of Modeling --- Large Margin and Kernel Trick
– Convex Analysis
– Optimality Conditions
– Duality
• Optimization for Machine Learning
– Dual Coordinate Descent (DCD) ( fast convergence, moderate cost )
• libLinear (Stochastic CD)
• libSVM (Greedy CD)
– Primal Methods
• Non-smooth Loss Stochastic Gradient Descent ( slow convergence, cheap iter. )
• Differentiable Loss Quasi-Newton Method ( very fast convergence, expensive iter. )
?!
Cts
QT
n
n
0 ..
2
1 - max
),())())((( jijijjiiij xxKyyxyxyQ
Dual Optimization of SVM
Cts
Qn
n
T
0 ..
2
1 min
Constrained Minimization
a1
)()( 1 xfZxfTP
General Constraint Very Expensive: 1. Detecting binding constraint : O(|constraint|*dim(α) )
2. Compute “Projected Gradient” : O( |binding constraint|*dim(α) )
nxwyts
Cw
nn
T
n
n
nw
,1 )( ..
2
1 min
2
0 ,
Constrained Minimization for “Box Constraint”
Cheap: 1. Detecting binding constraint : O(|constraint|)
2. Compute “Projected Gradient” : O(|binding constraint|)
Cts
Qn
n
T
0 ..
2
1 min
1
2
C
C
N
...
1 C
Dual Coordinate Descent
Cts
Qn
n
T
0 ..
2
1 min
Cts
constff
i
iiiii
i
0 ..
.)]([)]([2
1 min 22
Minimize w.r.t. i
ii
iiii
Qf
Qf
]1[)(
)(
2
Dual Optimization of SVM
Cts
Qn
n
T
0 ..
2
1 min
1 )( 1** NNNQf Even Computing Gradient is Expensive: ( O(N2) )
Coordinate Descent: (Optimize w.r.t. one variable at a time)
1),(1 ][)( 1*:,)( k
kikikNii xxKyyQf ( O(N) )
How many variables ? As few as possible
Coordinate Descent (LibLinear) (1 variable at a time)
Sequential Minimal Optimization (LibSVM) (2 variable at a time)
NonLinear (LibSVM) vs. Linear (LibLinear)
Linear:
1),(1 ][)( 1*:,)( k
kikikNii xxKyyQf
11)(1)( wxyxyxyxxyyT
ii
k
kkk
T
ii
k
k
T
ikik
Non-Linear:
1),(1 ][)( 1*:,)( k
kikikNii xxKyyQf
O( |non-zero Feature| )
O( |Instances|*|non-zero Features|)
NonLinear (LibSVM) vs. Linear (LibLinear)
Linear:
1)( )( wxyfT
iii
Non-Linear:
1),()( )( k
kikiki xxKyyf
O( |Instances| )
O( |Feature| )
n
nnn xyw
α1
αN
1)( )( wxyfT
iii
α1
αN
)()( if
( Cheap Update Random Select Coordinate )
( Expensive Update Select most Promising Coordinate )
LibLinear Linear:
1)( )( wxyfT
iii
O( |Feature| )
n
nnn xyw
α1
αN
1)( )( wxyfT
iii
Random Select a Coordinate update ) )( (' )(
*
iii fproj
Update w
LibSVM
Choose 2 most Promising Coordinates update
Update , i=1~N
Non-Linear:
1),()( )( k
kikiki xxKyyf
O( |Instances|*|Features| ) (no cache) O( |Instances|) (cache)
α1
αN
)()( if
)()( if
) )( (' )(
*
iii fproj
Demo: libSVM, libLinear
– Normalize Features:
• svm-scale -s [range_file] [train] > train.scale
• svm-scale -r [range_file] [test] > test.scale
– Training:
• LibSVM: svm-train [train.scale] ( produce train.scale.model )
• LibSVM: svm-predict [test.scale] [train.scale.model] [pred_output]
• LibLinear: train [train.scale] ( produce train.scale.model )
• LibLinear: predict [test.scale] [train.scale.model] [pred_output]