The support vector himachine
Nuno Vasconcelos ECE Department, UCSDp ,
Outlinewe have talked about classification and linear discriminantsthen we did a detour to talk about kernels• how do we implement a non-linear boundary on the linear
discriminant framework
this led us to RKHS th d i th l i ti• RKHS, the reproducing theorem, regularization
• the idea that all these learning problems boil down to the solution of an optimization problem
we have seen how to solve optimization problems• with and without constraints, equalities, inequalities
2
today we will put everything together by studying the SVM
Classificationa classification problem has two types of variables
• X - vector of observations (features) in the worldX - vector of observations (features) in the world• Y - state (class) of the world
e.g. g• x ∈ X ⊂ R2 = (fever, blood pressure)
• y ∈ Y = {disease, no disease}
X, Y related by a (unknown) function
)(xfyx
l d i l ifi h Y h th t h( ) f( ) ∀
)(xfy =x(.)f
3
goal: design a classifier h: X → Y such that h(x) = f(x) ∀x
Linear classifier implements the decision rule
⎧ >if 0)(1 xgwith
has the properties
bxwxg T +=)(⎩⎨⎧
=<−>
= sgn[g(x)]if if
0)(10)(1
)(*xgxg
xh
has the properties• it divides X into two “half-spaces”
• boundary is the plane with:
w
y p• normal w• distance to the origin b/||w||
( )/|| || i th di t f i twb
x
wxg )(
• g(x)/||w|| is the distance from point xto the boundary
• g(x) = 0 for points on the plane
4
• g(x) > 0 on the side w points to (“positive side”)• g(x) < 0 on the “negative side”
Linear classifierwe have a classification error if• y = 1 and g(x) < 0 or y = -1 and g(x) > 0• i.e y.g(x) < 0
and a correct classification if• y = 1 and g(x) > 0 or y = -1 and g(x) < 0• i.e y.g(x) > 0
note that for a linearly separable training setnote that, for a linearly separable training set D = {(x1,y1), ..., (xn,yn)}
we can have zero empirical riskthe necessary and sufficient condition is that
( ) ibxwy iT
i ∀>+ ,0
5
( )ii
The marginis the distance from the boundaryto the closest point w
y=1
wbxw i
T
i
+= minγ
there will be no error if it is strictlygreater than zero y=-1
w
greater than zero
note that this is ill-defined in the sense that( ) 0,0 >⇔∀>+ γ ibxwy i
Ti
y 1
w
note that this is ill-defined in the sense that γ does not change if both w and b are scaled by λ w
b
x
wxg )(
6
we need a normalization
Maximizing the marginthis is similar to what we have seenfor Fisher discriminantslets assume we have selected somenormalization, e.g. ||w||=1
th t ti i h t i ththe next question is: what is thecost that we are going to optimize?there are several planes that separatethere are several planes that separatethe classes, which one is best?recall that in the case of the Perceptron, we have seen that the margin determines the complexity of the learning problem• the Perceptron converges in less than (k/γ)2 iterations
7
the Perceptron converges in less than (k/γ) iterations
it sounds like maximizing the margin is a good idea.
Maximizing the marginintuition 1:• think of each point in the training
γ
set as a sample from a probability density centered on it
• if we draw another sample, we will not get the same points
• each point is really a pdf with a certain variance
• this is a kernel density estimate• if we leave a margin of γ on the
training set we are safe againsttraining set, we are safe against this uncertainty
• (as long as the radius of support of the pdf is smaller than γ)
8
of the pdf is smaller than γ)• the larger γ, the more robust the classifier!
Maximizing the marginintuition 2:• think of the plane as an uncertain
estimate because it is learnedfrom a sample drawn at random
• since the sample changes fromdraw to draw, the plane parametersare random variables of non-zerovariance
• instead of a single plane we havea probability distribution overplanes
• the larger the margin, the largerthe number of planes that willnot originate errors
9
• the larger γ, the larger the variance allowed on the plane parameter estimates!
Normalizationwe will go over a formal proof in a few classesfor now let’s look at the normalization y=1natural choice is ||w|| = 1the margin is maximized by solving
w
y=1
g y g
min i
max,
+ bxw iT
bw
y=-1 tosubject 12 =w
this is somewhat complex• need to find the points that achieve the min without knowing what
w and b are
10
• optimization problem with quadratic constraints
Normalizationa more convenient normalization is tomake |g(x)| = 1 for the closest point, i.e. w
y=1
1min ≡+ bxw iT
i
under which
y=-1w1
=γ
the SVM is the classifier that maximizes the margin under these constraints
y 1
w
w
the margin under these constraints
wb
x
wxg )(( ) ibxwyw i
Tibw
∀≥+ tosubject 1min 2
11
bw ,
Support vector machine( ) ibxwyw i
Tibw
∀≥+ tosubject 1min 2
,
last time we proved the following theoremTheorem: (strong duality) Consider the problem( g y) p
0)(minarg* ≤−=∈
jTj
Xxdxexfx to subject
where X, and f are convex, and the optimal value f* finite. Then there is at least one Lagrange multiplier vector and there is no duality gapthere is no duality gapthis means the SVM problem can be solved by solving its dual
12
The dual problemfor the primal ( ) ibxwyw i
Tibw
∀≥+ tosubject 21 1min 2
,
the dual problem is{ }),,(maxmax
00αα
ααbwLq
wmin )(
≥≥=
where the Lagrangian is
( )[ ]∑ 11)( 2 bbL T
00 αα w≥≥
setting derivatives to zero
( )[ ]-i
∑ −+= 12
),,( bxwywbwL iT
iiαα
g
i=⇔=−=⇔=∇
∑
∑∑i
iiiiiiw xywxywbwLL ααα *0),,(0
13
0=⇔=∇ ∑i
iib yL α0
The dual problemplugging back we get the Lagrangian
0 ,* == ∑∑i
iii
iii yxyw ααg g
( )[ ]∑ −+= iT
ii bxwywbwL αα 121),*,( 2 -
i
∑∑=ij
jTijijij
Tij
ijiji xxyy-xxyy αααα
21
∑∑ =+−i
ii
ii by αα
0
43421
∑∑ +−=i
ijTij
ijiji xxyy ααα
21
0
14
iij
The dual problemand the dual problem is
⎫⎧
0
21max
0 ⎭⎬⎫
⎩⎨⎧
+−
∑
∑∑≥ i
tbj t
ijTij
ijiji xxyy ααα
α
once this is solved, the vector
0=∑i
tosubject iiy α
∑=i
iii xyw α*
is the normal to the maximum margin planenote:
15
• the dual solution does not determine the optimal b*, since b drops off when we take derivatives
The dual problemdetermining b*• various possibilities, for example• pick one point x+ on the margin on the y=1 side and one point x-
on the y=-1 side• use the margin constraintg
2)(*
11 −+
+
+ +−=⇔
⎭⎬⎫
−=+=+ xxwb
bxwbxw T
T
T
1/||w||
x
note:• the maximum margin solution guarantees that
1⎭+ bxw
1/|| *||
x
• the maximum margin solution guarantees thatthere is always at least one point “on the margin”on each side
• if not we could move the plane and get a larger1/||w*||
1/||w*||
x
16
• if not, we could move the plane and get a largermargin
Support vectors
αi=0
another possibility is to average over all points on the margin
αi 0these are called support vectorsfrom the KKT conditions, a innactive constraint has zero
αi>0
innactive constraint has zero Lagrange multiplier αi. That is, • i) αi > 0 and yi(w*Txi + b*) = 1
αi=0
or• ii) αi = 0 and yi(w*Txi + b*) ≥ 1
hence α > 0 only for pointshence αi > 0 only for points|w*Txi + b*| = 1
which are those that lie at a
17
which are those that lie at a distance equal to the margin
Support vectors
αi=0
points with αi > 0 supportthe optimal plane (w*,b*).
αi 0for this they are called“support vectors”note that the decision rule is
αi>0
note that the decision rule is[ ]
⎤⎡
+=
∑
**sgn)(
*
bxwxf
T
T
αi=0⎥⎤
⎢⎡
+
⎥⎦
⎤⎢⎣
⎡+=
∑
∑
*
*sgn
*
*
bxxy
bxxy
T
i
Tiii α
where SV = {i | α*i>0} is the
⎥⎦
⎢⎣
+= ∑∈
*sgn bxxySVi
Tiii α
18
iset of support vectors
Support vectors
αi=0
since the decision rule is
⎥⎤
⎢⎡
+= ∑ *sgn)( * bxxyxf Tα
we only need the supportvectors to completely define
⎥⎦
⎢⎣
+= ∑∈
sgn)( bxxyxfSVi
iiiα
αi>0vectors to completely definethe classifierwe can literally throw away
αi=0
we can literally throw awayall other points!the Lagrange multipliers canl balso be seen as a measure
of importance of each pointpoints with α =0 have no
19
points with αi=0 have no influence, small perturbation does not change solution
Perceptron learningnote the similarities with the dual Perceptronset αi = 0, b = 0 In this case:set R = maxi ||xi||
do {• αi = 0 means that
the point was never misclassified
for i = 1:n {
if then {
• this means that we have an “easy” point, far from the boundary0≤
⎞⎜⎜⎛
+∑n
iTjjji bxxyy αif then {
– αi = αi + 1b b R2
y• very unlikely to
happen for a support vector
01
≤⎠
⎜⎜⎝
+∑=j
ijjji bxxyy α
– b = b + yiR2
}}
• but the Perceptron does not maximize the margin!
20
}
} until no errorsg
The robustness of SVMsin SLI we talked a lot about the “curse of dimensionality”• number of examples required to achieve certain precision isnumber of examples required to achieve certain precision is
exponential in the number of dimensions
it turns out that SVMs are remarkably robust to di i litdimensionality• not uncommon to see successful applications on 1,000D+ spaces
two main reasons for this:two main reasons for this:• 1) all that the SVM does is to learn a plane.
Alth h th b f di i bAlthough the number of dimensions may belarge, the number of parameters is relativelysmall and there is no much room for overfitting
x
x
21
In fact, d+1 points are enough to specify thedecision rule in Rd!
x
SVMs as feature selectorsthe second reason is that the space is not really that largeg• 2) the SVM is a feature selector
To see this let’s look at the decision functionTo see this let s look at the decision function
.*sgn)( *⎥⎦
⎤⎢⎣
⎡+= ∑
∈
bxxyxfSVi
Tiiiα
This is a thresholding of the quantity
⎦⎣ ∈SVi
∑ T xxy *α
note that each of the terms xiTx is the projection of the vector
to classify (x) into the training vector x
∑∈SVi
iii xxy α
22
to classify (x) into the training vector xi
SVMs as feature selectorsdefining z as the vector of the projection onto all support vectors
the decision function is a plane in the z-space
( )TiT
iT
kxxxxxz ,,)(
1L=
the decision function is a plane in the z space
⎥⎦
⎤⎢⎣
⎡+=⎥
⎦
⎤⎢⎣
⎡+= ∑∑
∈
*)(sgn*sgn)( ** bxzwbxxyxf kk
kSVi
Tiiiα
with⎦⎣⎦⎣ ∈ kSVi
( )Tiiii yyw *** αα L=
this means that• the classifier operates on the span of the support vectors!
( )iiii kkyyw ,,
11αα
23
• the classifier operates on the span of the support vectors!• the SVM performs feature selection automatically
SVMs as feature selectorsgeometrically, we have:• 1) projection on the span of the support vectors• 2) classifier on this space
xi
( )Tiiii kkyyw ** ,,*
11αα L=
xix
z(x)(w*,b*)
• the effective dimension is |SV| and, typically, |SV| << n
24
| | , yp y, | |
In summarySVM training:• 1) solve
21max
0 ⎭⎬⎫
⎩⎨⎧
+−
∑
∑∑≥ i
ijTij
ijiji xxyy ααα
α
• 2) then compute
0=∑i
tosubject iiy α
2) then compute
∑∈
=SVi
iii xyw ** α ( )−+
∈
+−= ∑ xxxxyb Ti
Tii
SVii
*
21* α
decision function:
⎥⎤
⎢⎡
+∑ *sgn)( * bxxyxf Tα
25
⎥⎦
⎢⎣
+= ∑∈
*sgn)( bxxyxfSVi
iiiα
Practical implementationsin practice we need an algorithm for solving the optimization problem of the training stagep p g g• this is still a complex problem• there has been a large amount of research in this area• coming up with “your own” algorithm is not going to be
competitive• luckily there are various packages available, e.g.:y p g , g
• libSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/• SVM light: http://www.cs.cornell.edu/People/tj/svm_light/
SVM fu: http://five percent nation mit edu/SvmFu/• SVM fu: http://five-percent-nation.mit.edu/SvmFu/• various others (see http://www.support-vector.net/software.html)
• also many papers and books on algorithms (see e.g. B. Schölkopf
26
and A. Smola. Learning with Kernels. MIT Press, 2002)
Kernelizationnote that all equations depend only on xi
Txj
the kernel trick is trivial: replace by K(xi,xj) xx
x2
• 1) training
( ),1max ⎬⎫
⎨⎧
+− ∑∑ ijijiji xxkyy ααα
xx
xx
xxx
xx x
oo
o
o
ooo
o
oooo
( )
0
,20
=⎭⎬
⎩⎨
∑
∑∑≥
i
i
tosubject ii
ijijij
iji
y
yy
α
αααα x1
Φ
• 2) decision function:
( ) ( )( )−+
∈
+−= ∑ xxKxxKyb iiiSVi
i ,,21* *α
xx
x
x
x
xxx
xx
xx
ooo
o
• 2) decision function:
( ) ⎥⎦
⎤⎢⎣
⎡+= ∑ *,sgn)( * bxxKyxf
SViiiiα
oo
oooo
o
oo ox1
x3x2
xn
27
⎦⎣ ∈SVi
Kernelizationnotes:• as usual this follows from the fact that nothing of what we did
really requires us to be in Rdreally requires us to be in Rd.• we could have simply used the notation <xi,xj> for the dot product
and all the equations would still hold• the only difference is that we can no longer recover w* explicitly
without determining the feature transformation Φ, since
( )∑ Φ xyw **
• this could have infinite dimension, e.g. we have seen that it is a f G i h th G i k l
( )∑∈
Φ=SVi
iii xyw * α
sum of Gaussians when we use the Gaussian kernel• but, luckily, we don’t really need w*, only the decision function
( ) ⎤⎡ ∑ *
28
( ) ⎥⎦
⎤⎢⎣
⎡+= ∑
∈
*,sgn)( * bxxKyxfSVi
iii α
Input space interpretationwhen we introduce a kernel, what is the SVM doing in the input space?let’s look again at the decision function
( ) ⎥⎦
⎤⎢⎣
⎡+= ∑ *,sgn)( * bxxKyxf
SViiii α
with
⎦⎣ ∈SVi
( ) ( )( )−+∑ xxKxxKyb 1* *
note that
( ) ( )( )+
∈
+−= ∑ xxKxxKyb iiiSVi
i ,,2
* α
• x+ and x- are support vectors• assuming that the kernel as reduced support when compared to
the distance between support vectors
29
the distance between support vectors
Input space interpretationnote that • assuming that the kernel as reduced support when compared to
the distance between support vectorsthe distance between support vectors
( ) ( )( ),,21* * +−= −+
∈∑ xxKxxKyb iii
SVii α
( ) ( )[ ]
0
,,21 **
≈
−−≈ −−−
+++ xxKxxK αα
• where we have also assumed that α+ ~ α-
• these assumptions are not crucial, but simplify what followsl th d i i f ti i• namely the decision function is
( )⎥⎦
⎤⎢⎣
⎡= ∑ iii xxKyxf ,sgn)( *α
30
⎥⎦
⎢⎣ ∈SVi
Input space interpretationor
( )( )
⎪⎨⎧ ≥
=∑∈
0,,1)(
*
SViiii xxKyif
xfα
rewritting
( )⎪⎩⎨ <−
=∑∈
0,,1)( *
SViiii xxKyif
xfα
g
( ) ( ) ( )∑∑∑<>∈
−=0|
*
0|
** ,,,ii yi
iiyi
iiSVi
iii xxKxxKxxKy ααα
this is
( ) ( )⎪⎧ ≥ ∑∑ xxKxxKif1 ** αα ( ) ( )⎪⎩
⎪⎨⎧
−
≥=
∑∑<≥
otherwise
xxKxxKifxf ii yi
iiyi
ii
,1
,,,1)( 0|0|
αα
31
Input space interpretationor
( ) ( )⎪⎨
⎧ ≥=
∑∑∑∑ <≥
xxKxxKifxf ii yi
iiiyi
iii
,1,1,1)( 0|
**
0|
** β
απ
α
with
⎪⎩
⎨
−
= ∑∑ <≥
≥<
otherwise
xf i
i
i
i
yiyi
iyiyi
i
,1
)( 0|0|
0|0|
0|,0|,
0|
*
**
0|
*
** <=≥=
∑∑<≥
i
yii
iii
yii
ii yiyi
ii
ααβ
ααπ
which is the same as 0|0| <≥ yiyi ii
( )⎧ ∑∑ xxK ** απ ( )( )
⎪⎪
⎪⎪⎨
⎧
≥= ∑∑
∑∑
≥
<
<
≥
h
xxK
xxKifxf
i
i
i
i
yii
yii
yiii
yiii
,
,,1)(
0|
*0|
0|
*0|
α
α
β
π
32
⎪⎩− otherwise,1
Input space interpretationnote that this is the Bayesian decision rule for• 1) class 1 with likelihood and prior
2) class 2 with likelihood and prior
( )∑≥0|
* ,iyi
ii xxKπ ∑∑< i
iyi
ii
*
0|
* / αα
• 2) class 2 with likelihood and prior
( )∑<0|
* ,iyi
ii xxKβ ∑∑≥ i
iyi
ii
*
0|
* / αα
these likelihood functions • are a kernel density estimate if k(.,xi) is a valid pdfi
• peculiar kernel estimate that only places kernels around the support vectors
• all other points are ignored
33
• all other points are ignored
Input space interpretationthis is a discriminant form of density estimation• concentrate modeling power• concentrate modeling power
where it matters the most, i.e. near classification boundary
• smart, since points away from the , p yboundary are always well classified, even if density estimates in their region are poorth SVM b hi hl• the SVM can be seen as a highly efficient combination of the BDR with kernel density estimates
• recall that one major problem of• recall that one major problem of kernel estimates is the complexity of the decision function, O(n).
• with the SVM, the complexity is
34
with the SVM, the complexity is only O(|SV|) but nothing is lost
Input space interpretationnote on the approximations made:• this result was derived assuming b~0this result was derived assuming b 0• in practice, b is frequently left as a parameter which is used to
trade-off false positives for misses• here, that can be done by controlling the BDR threshold
( )( )⎪⎪
⎧
≥∑
≥ TxxK
if iyiii ,
1 0|
*π
( )⎪⎪⎩
⎪⎨
−
≥= ∑<
otherwise
TxxK
ifxfiyi
ii
,1
,,1)(
0|
*β
• hence, there is really not much practical difference, even when the assumption of b*=0 does not hold!
35
36