8/16/2019 SSSVM2015
1/34
Kernel Method
and Support Vector Machines
Nguyen Duc Dung, Ph.D.
IOIT, VAST
8/16/2019 SSSVM2015
2/34
Outline
Reference Books, papers, slides, software
Support vector machines (SVMs) The maximum-margin hyper-plane
Kernel method
Implementation Approaches
Sequential minimal optimization (SMO)
Open problems
2
8/16/2019 SSSVM2015
3/34
Reference
Book Cristianini, N., Shawe-Taylor, J., An Introduction to SupportVector Machines, Cambridge University Press, (2000).http://www.support-vector.net/index.html
Bernhard Schölkopf and Alex Smola. Learning with Kernels.MIT Press, Cambridge, MA, 2002.
Paper C. J. C. Burges. A Tutorial on Support Vector Machines for
Pattern Recognition. Knowledge Discovery and Data Mining ,2(2), 1998.
Slide N. Cristianini. ICML'01 tutorial, 2001.
Software LibSVM (NTU), SVMlight (joachims.org)
Online resource http://www.kernel-machines.org/
3
http://www.support-vector.net/index.htmlhttp://www.learning-with-kernels.org/http://www.kernel-machines.org/papers/Burges98.ps.gzhttp://www.kernel-machines.org/papers/Burges98.ps.gzhttp://www.support-vector.net/tutorial.htmlhttp://www.kernel-machines.org/http://www.kernel-machines.org/http://www.kernel-machines.org/http://www.kernel-machines.org/http://www.support-vector.net/tutorial.htmlhttp://www.kernel-machines.org/papers/Burges98.ps.gzhttp://www.kernel-machines.org/papers/Burges98.ps.gzhttp://www.learning-with-kernels.org/http://www.support-vector.net/index.htmlhttp://www.support-vector.net/index.htmlhttp://www.support-vector.net/index.html
8/16/2019 SSSVM2015
4/34
Classification Problem
4
How would we classify this data set?
8/16/2019 SSSVM2015
5/34
Linear Classifiers
5
There are many lines that can be linear classifiers.
Which one is the better classifier?
8/16/2019 SSSVM2015
6/34
SVM Solution
6
SVM solution is the linear classifier with the maximum
margin (maximum margin linear classifier)
8/16/2019 SSSVM2015
7/34
Margin of a Linear Function f(x) = w. x + b
Functional margin
Geometric margin
Margin
SVM solution
7
),( ii y x
f
f
8/16/2019 SSSVM2015
8/34
A Bound on Expected Risk of a Linear Classifier f = sign(w. x)
With a probability at least (1 - ) , (0,1)
1lnln][][ 2
2
22
l Rl c f R f R
f
emp
where Remp is training error, l is training size, f is the
margin, ||w || , ||x || R, c is a constant
Larger m argin, smal ler bound
8
8/16/2019 SSSVM2015
9/34
Finding the Maximum-Margin Classifier
Constrain functional margin 1Minimize normal vector
9
),( ii y x
f
f
8/16/2019 SSSVM2015
10/34
Soft and Hard Margin
l i
bwx yt s
C w
i
iii
n
i
p
iw,b
,...,1,0
,1)( ..
||||2
1 min
1
2
l ibw.x y s.t.
w
ii
w,b
,...,1,1)(
||||2
1 min 2
Hard (maximum) margin Soft (maximum) margin
10
8/16/2019 SSSVM2015
11/34
Lagrangian Optimization
11
8/16/2019 SSSVM2015
12/34
Kuhn-Tucker Theorem
12
8/16/2019 SSSVM2015
13/34
Optimization
l i
bwx yt s
C w
i
iii
n
i
p
iw,b
,...,1,0
,1)( ..
||||2
1
min 1
2
13
l
i
ii
l
ii
l
ji ji ji ji
y
l iC
x x y yi
1
i
11,
.0
,,...,1,0 :s.t.
,2
1min
Primal problem
Dual problem
0i
iii x y
w
8/16/2019 SSSVM2015
14/34
(Linear) Support Vector Machines
Training
Quadratic optimization
l variables l 2 coefficients
Testing
Norm of the hyperplane
( xi , i), i 0 – support
vector
b xw x f )(
0i
iii x y
w
14
l
i
ii
l
i
i
l
ji
ji ji ji
y
l iC
x x y yi
1
i
11,
.0
,,...,1,0 :s.t.
,2
1min
8/16/2019 SSSVM2015
15/34
Kernel Method
Problem Most datasets are linearly non-separable
Solution
Map input data into a higher dimensional feature space
Find the optimal hyperplane in feature space
15
8/16/2019 SSSVM2015
16/34
Hyperplane in Feature Space
VC-dimension of a class offunctions: the maximumnumber of points that can beshattered
VC-dimension of linearfunctions in Rd is d+1
Dimension of feature space ishigh
Linear functions in featurespace has high VC-dimension, or high capacity
16
8/16/2019 SSSVM2015
17/34
VC Dimension: Example
Gaussian RBF SVMs of sufficiently small width can
classify an arbitrary large number of training points
correctly, and thus have inf in i te VC dimension
17
8/16/2019 SSSVM2015
18/34
Linear SVMs
Training
Quadratic optimization
l variables
l 2 coefficients
Testing
Norm of the hyperplane
( xi , i), i 0 – support
vector
0
,)(i
b x x y sign x f iii
0i
iii x yw
SVMs work with pairs of data (dot product), not sample
18
l
i
ii
l
i
i
l
ji
ji ji ji
y
l iC
x x y yi
1
i
11,
.0
,,...,1,0 :s.t.
,2
1min
8/16/2019 SSSVM2015
19/34
Non-linear SVMs
Kernel: to calculate dot product between twovectors in feature space K ( x, y) =
Training Testing
Norm of the hyperplane
0
),()(i
b x x K y sign x f iii
The maximal margin algorithm works indirectly in
feature space via kernel, or is not known explicitly
0
)(i
iii x y
19
l
i
ii
l
i
i
l
ji
ji ji ji
y
l iC
x x K y yi
1
i
11,
.0
,,...,1,0 :s.t.
),(21min
8/16/2019 SSSVM2015
20/34
Kernel
Linear: K ( x, y) =
Gaussian: K ( x, y) = exp(-|| x- y||2)
Dimension of feature space: infinite
Polynomial: K ( x, y) = p
Dimension of feature space: , where d – input space dimension
p
pd 1
20
8/16/2019 SSSVM2015
21/34
Support Vector Learning
Task Given a set of labeled data
Find the decision function
Training
Testing
21
}1,1{)},{( ,...,1 d
l iii R y xT
l
i
ii
l
i
i
l
ji
ji ji ji
y
l iC
x x K y yi
1
i
11,
.0
,,...,1,0 :s.t.
),(2
1min
0
),()(i
b x x K y sign x f iii
Time: O(l3 ),
Memory: O(l
2
)
Time: O(Ns)
8/16/2019 SSSVM2015
22/34
MNIST Data: SVM vs. Other
22
Data 60,000/10,000 training/testing
Performance
Hand written data
Method Testing
error (%)
linear classifier (1-layer NN) 12.0
K-nearest-neighbors 5.0
40 PCA + quadratic
classifier
3.3
SVM, Gaussian Kernel 1.42-layer NN, 300 hidden
units, mean square error
4.7
Convolutional net LeNet-4 1.1 (Source: http://yann.lecun.com/)
8/16/2019 SSSVM2015
23/34
SVM: Probability Output
23
SVM solution
Probability estimation
Maximum likelihood approach
B x Af
e
x y p
)(
1
1)|1(
l
i
iiiiba
pt pt ba F B A1,
)1log()1()log(),(minarg),(
b x x K y x f i
iii 0
),()(
)negative#: positive,#:.(,...,1,
1if 1
1
,1if 2
1
,1
1)|1( where
)(
N N l i
y N
y N
N
t
e x y p p
i
i
i
b xaf ii
8/16/2019 SSSVM2015
24/34
Outline
Reference Books, papers, slides, software
Support vector machines (SVMs) The maximum-margin hyperplane
Kernel method Implementation
Approaches
Sequential minimal optimization
Open problems
24
8/16/2019 SSSVM2015
25/34
SVM Training
Problem
Quadratic programming (QP)
Obj. function: quadratic w.r.t.
Number of variable: l
Number of parameter: l 2
Complexity
Time: O(l 3 ) or O( N S 3 + N S
2l + N S dl )
Memory: O(l 2 )
Constraint: box, linear
Approach Gradient method
Modified gradient projection (Bottou et
al., 94)
Divide-and-conquer
Decomposition alg. (e.g. Osuna et al.,
97, Joachims, 99)
Sequential minimal optimization (SMO)(Plat, 99)
Parallelization
Cascade SVM (Peter et al., 05)
Parallel mixture of SVM (Collobert et al.,
02)
Approximation Online and active learning (e. g. Bordes
et al., 05)
Core SVM (Tsang et al., 05, 07)
Combination of methods
25
l
i
ii
l
i
i
l
ji
ij ji ji
y
l iC
K y y F i
1
i
11,
0
,,...,1,0 :s.t.
2
1)(min
α
8/16/2019 SSSVM2015
26/34
Optimality
The Karush-Kuhn-Tucker (KKT) conditions
where
26
,0 1)(
, 1)(
,0 1)(
C for x f y
C for x f y
for x f y
iii
iii
iii
l
iiii
b x x K y x f 1
),()(
-+
+
-
-
+
0 +1-1
+
-
8/16/2019 SSSVM2015
27/34
SMO Algorithm
Initialize solution (zero)
While (!StoppingCondition) Select two vector {i,j }
Optimize on {i,j }
EndWhile
8/16/2019 SSSVM2015
28/34
8/16/2019 SSSVM2015
29/34
Selection Heuristic and Stopping Condition
Maximum violating pair
Maximum gain
where
Stopping condition:
29
lowk
upk
I k E j
I k E i
|minarg
|maxarg
ik lowik
upk
E E I k F j
I k E i
,|maxarg
|maxarg
}1,0or1,|{
}1,0or1,|{
t t t t low
t t t t up
y yC t I
y yC t I
)10( 3 ji E E
8/16/2019 SSSVM2015
30/34
Sequential Minimal Optimization
Training problem
Functional margin
Selection heuristic
Updating scheme
Stopping condition
30
l
i
ii
l
i
i
l
ji
ji ji ji
y
l iC
x x K y yi
1
i
11,
.0
,,...,1,0 :s.t.
),(2
1min
iik k
l
k
k i y x x K y E
),(1
}),(|{maxarg
)}(|{maxarg
ik lowik k
upk k
E E I k L j
I k E i
ji E E
.
2
,2
ij
old
j
old
i jold
j
new
ij
old
i
old
jiold
i
new
E E y
E E y
j
i
8/16/2019 SSSVM2015
31/34
Support Vector Regression (1)
Training data S = {( xi ,yi)}i = 1,…,l R N R Linear regressor y = f(x) = w. x + b
-loss function
31
8/16/2019 SSSVM2015
32/34
Support Vector Regression (2)
Optimization: minimizing
Dual problem
32
8/16/2019 SSSVM2015
33/34
Open Problems
Model selection Kernel type
Parameter setting
Speed and size Training: time O( N S
2l ),
space O(N S l) Testing: O(N S )
Multi-classapplication One-versus-rest
One-versus-one
Categorical data
33
8/16/2019 SSSVM2015
34/34