Support Vector Machine Easy Tutorial(I wish)
Support Vector Machine
Easy Tutorial(I wish)
bT )()( xwx t y(1)
문제를 간단히 하기 위해서 몇 가지 가정
1, 어떤 함수를 통해feature space으로의 mapping 후 linear sepeable
2, lable은 , 1과 -1 두가지
))1(())((
)()()(
)2(
w
xw
w
x
w
x
bt
seperablelinearassumeweyty
vectorfeature
T
n
nn
거리와의초평면과
bT )()( xwx t y(1)
)}({' nnnn ysigntify xxww Perceptron algorithm
오분류(mistake)된 표본(sample)들을 가지고 초평면(hyperplane)의 법선(normal) 벡터를 조정해 나간다. 그래서 perceptron 알고리즘을 mistake driven 알고리즘이라고 한다.
Perceptron, convergence
• Claim : perceptron algorithm indeed convergences in a finite of updates.
• Assume that
updatekaftervectorparamaterthek :)(
RfinitesomeandtallforRt |||| x
t
T
tyts x)(.0 *
2
22
*2
*2*)(*)(
)(*)(
2
22)1(
22)1(
2)1(2)1(
2)1(2)(
)2(*
*)2(*
)2(*
)1(*
*)1(*)(*
||*||
||||1
||||||||||||||||||||
)()*,cos(1
...
||||
||||||||
||||)(2||||
||||||||
...
2)(
)()(
)()(
)(
)()()(
);('
Rkor
kR
k
kR
kk
kR
R
y
y
k
y
y
y
fyify
kk
kTk
k
t
k
tt
Tk
t
k
tt
kk
kT
t
T
t
kT
tt
kT
kT
t
T
t
kTkT
tttt
x
xx
x
x
x
x
xxConvergenc
e proof
Support Vector, and Margin의 정의
빨간색 동그라미가 Support vector이고 support vector와 초평면(hyperplane)과의 수직거리를 margin이라고 한다.
구하기파라메타하는대 로최마진을샘플의가까운가장초평면에
))((min1
maxarg)3(,
bt T
nnb
xwww
SVM의 동기(motivation)?, 직관(intuition )?
수많은 초평면들이 있다. 어느 초평면이 가장 일반화 오류(generalization error)가 적을까? Maximal margin classifier!!
Linear
Classifiers f x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Linear
Classifiers f x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Linear
Classifiers f x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Linear
Classifiers f x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Any of these would be fine..
..but which is best?
SVM 아이디어의 수식화를 위한 전처리?
는파라메터구하기마진의역수를최소화하
미분은나중의편의를위해서을최소화
최소화을
최대화마진
거리와의초평면과
임을뜻함수식만족하는점이모든
를조정함수식이되도록과만족하는가장가까운점이표면에서
변함없다마진은해도리스케일링로
;2
1minarg
)(2
1;
2
1
_
_;
arg1
))1(())((
)()()(
)2(
,))(()5(
))(()4(
))(())(())((.
___,;))((
.
2
,
2
2
1
w
w
w
w
ww
xw
w
x
w
x
xw
xw
w
xw
w
xw
w
xw
www
xw
w b
T
n
nn
T
n
T
n
T
n
T
n
T
n
T
n
inmbt
seperablelinearassumeweyty
vectorfeature
seperablelinearbt
bbt
bt
k
bkt
k
kbkt
kbbkbt
(6)
1;
1;
SVM 아이디어의 수식화
1 ))((2
1minarg
2
,
bttosubject T
nb
xwww
여기까지가 SVM 수식완성!!!!!! 이제 이 이 수식을 풀기만 하면 됨 최적화식은 우리가 고등학교 때 보았던 linear programming과 비슷함 Quadratic programming이라고 한다. 이 Quadratic programming 문제을 dual form으로 전환하기 풀기 위해서 Lagrange multipliers 개념을 도입한다.
Dual form으로 전환 이유
-> 커널 트릭을 위해,
위 수식은 convex 최적화 문제라는 것이다.
Convex , non convex 간단히 설명
->convex하면 좋은게 “미분해서 0” 기법으로 grobal minimum을 구할 수 있다는 것.
.;2/1
01
02
02
)1(1),()
);(),(
0,0)()(),(
.)()()(0)(0
0)(
)(,);()(
);()()(
21
21
1
1
212122
다른방법으로풀기
조건만족
에수직이다는에평행하기때문에는
위에있다고가정모두
테일러시리즈
xx
xx
x
x
xxxxLex
KKTgfL
wheregfgggf
ggggif
g
gxgxg
ggg
T
T
T
x
xx
xxx
xxxεxε
xε
xεxεx
xεxεx
thenlim
Lagrange multiplier
SVM 수식의 dual form으로 전환(SVM의 마술을 위해)
N
n
nn
n
N
n
mnm
N
m
n
N
n
n
N
n
mnm
N
m
n
N
n
n
N
n
n
N
n
N
n
nn
N
n
T
nnmnm
N
m
n
N
n
nn
N
n
nnn
N
n
nn
N
n
nnn
N
n
nnn
N
n
T
nn
ta
a
kttaaattaaaL
abtatattaa),bL(
ta
ta
tab
),bL(
tata),bL(
bta),bL(
1
1 111 11
11 111
1
1
1
11
1
2
)9(;0)12(
)10(;0)11(
),(2
1
2
1)(
~
)(2
1,
)7()9(),8(
)7(;0)9(
)()8(
0,
)()8(0)(,
;))((2
1,)7(
번식의제약식
번식의제약식
에대입해서정리하면를
수식나오는정리하면미분해서식을
수식최적화라그랑쥬
mn
T
T
xx(x)(x)a
xw(x)(x)aw
xw
aw
xwxww
aw
xwwaw
1
이부분 이해를 위해 천천히
1 ))((2
1minarg
2
,
bttosubject T
nb
xwww
수식최적화라그랑쥬 1 ;))((2
1,)7(
1
2
N
n
T
nn bta),bL( xwwaw
N
n
nnn
N
n
nn
N
n
nnn
N
n
nnn
ta
tab
),bL(
tata),bL(
1
1
11
)()8(
0,
)()8(0)(,
xw
aw
xwxww
aw
수식나오는정리하면미분해서식을 )7(;0)9(1
N
n
nnta
수식최적화라그랑쥬 1 ;))((2
1,)7(
1
2
N
n
T
nn bta),bL( xwwaw
N
n
nnnta1
)()8( xw
수식나오는정리하면미분해서식을 )7(;0)9(1
N
n
nnta
N
n
nn
n
N
n
mnm
N
m
n
N
n
n
N
n
mnm
N
m
n
N
n
n
N
n
n
N
n
N
n
nn
N
n
T
nnmnmnm
N
m
n
ta
a
kttaaattaaaL
abtatakttaa),bL(
1
1 111 11
11 111
)10(;0)12(
)10(;0)11(
),(2
1
2
1)(
~
)(),(2
1,
)7()9(),8(
번식의제약식
번식의제약식
에대입해서정리하면를
mn
Txx(x)(x)a
xwxxaw
SVM의 마술
.;!!!!!!!33310000000100000003).
..
)(.?
.
,)(
_)7()9(),8();,(2
1)(
~)10(
,)(
;))((2
1,)7(
3
1 11
3
1
2
없어진다차원은
한다트릭이라고커널이것을있다수할일을그계산으로적은
통해를함수커널그러나있다수할매핑을무한차원즉이라면하지만
좋아보인다안변환은듀얼폼으로의때문에이기보통
갯수의은복잡도계산필요한푸는데문제를최적화이
듀얼폼번식의식을이용해서얻은
차원의은복잡도계산필요한푸는데문제를최적화이
수식최적화라그랑쥬
xxxx(x)(x)
xxa
xwwaw
T
ge
kM
NM
featureNNOQP
kttaaaL
featureMMOQP
bta),bL(
N
n
mnmnm
N
m
n
N
n
n
N
n
T
nn 1
이부분 이해를 위해 천천히
한번 읽어보라고 한다
SVM의 예측기
;),()y( (13)
)()8,()()(y (1)
N
1n
1
bkta
tab
nnn
N
nnnn
T
xxx
xwxwx
.것이다학습하는를과결국이란에서 balearningSVM n
N
n
nnn
N
n
mnm
N
m
n
N
n
n
taatosubject
kttaaaL
1
1 11
;0,0
),(2
1)(
~mn xxa
제공함수
하나가그중있다연구도
최적화하려는를그래서부분이다무거운가장계산중가
풀면된다로제약식을위의은
''
)(.
.
.
quadprogMATLAB
SMO
QPSVMQP
QPan
onoptimizatiminimalsequential
.)13(
.,
.0
1)(0
;01)()16(
01)()15(
0)14(
)(
;)(2
1,)7(
1
2
있다수낼분류해식으로가지고집합만의이
있다수알를풀면를즉
것이다라는은이면의미는조건의이
만족한다조건을같은다음과수식은위의
수식최적화라그랑쥬
SV
SVQP
SVa
ytora
yta
yt
a
conditionKKT
yta),bL(
nn
nnn
nnn
nn
n
N
n
nn
x
x
x
x
xwaw
1
b를 계산
Sm
nmm
Sn
n
S
mmmn
nnnnn
ktatN
b
bktat
ytbkta
),(1
)18(
;1),()17(
,1)(.),(
xx
xx
xxxx
N
SVm
N
1n
) y((13) 대입하면에을
Support vector만 가지고 분류평면을 만들어 낸다.
Linear seperable 하지 않다면?
여태까지는 sample들이 고차원 mapping 함수를 통해서 linear seperable 가능하다고 가정하고 이론을 전개
시켜 나갔다.
하지만 실제 문제에서는 그러한 경우는 많지 않다.
그래서 데이터들이 linear
seperable하지 않을 때를 해결하기 위한 최적화 수식의
변화가 필요하다.
KKTyta
yt
a
ytaCbL
C
C
Nnyt
ytslackdef
nn
n
n
nnn
nn
n
N
n
nn
N
n
nnnn
N
n
n
b
N
n
n
nnn
nnn
nnn
0)28(
0)27(
0)26(
0)1)(()25(
01)()24(
0)23(
1)(2
1),,,,()22(
2
1minarg
;2
1)21(
;1,;10,;0
.)5(;,...,1,1)(
;)(:.
111
2
2
,
2
1
x
x
xwμaw
w
w
x
x
w
똑같아짐이랑이면
서마진의역을최소화해서는페널티를가하면잘못분류되는샘플에대
분류가틀리게됨마진안에옳게분류
는이식으로바뀐다제약식
차이예측값과의값과타겟실제
(6)
(20)
Linear seperable 하지 않다면?
))30((0)34(
00)26(0)23()31)(33(
);,(2
1)(
~
)()(2
1
)(2
1),,,,()32(
0)31(
00)30(
)(0)29(
1
1 11
111
2
11111
2
1
1
N
n
nn
nnnnn
N
n
mnmnm
N
m
n
N
n
n
N
n
nnn
N
n
n
N
n
nnn
N
n
nn
N
n
nn
N
n
n
N
n
nnn
N
n
n
nn
n
N
n
nn
N
n
nnn
ta
CaandaandCa
kttaaaL
aaCyta
aaytaCbL
CaL
tab
L
xtawL
xxa
xw
xwμaw
w
값미분한대해서파라메터에각
컨디션
Example.
z
zzzzzzxxxxxx
zxzxzxzxzxzx
zxzx
T
T
•T
x
zxzx ,
2
221
2
121
2
221
2
121
2
2
2
22211
2
1
2
12211
2
2211
2
,2,,2,2,1,2,,2,2,1
2221
11)k(
kernelpolynomial
?.
)5,1(,)3,2(:
)4,2(,)5,1(:
2
1
분류하라이용해서을커널과다음점들을다음 SVM
w
w
tt
tt
N
n
nnn
N
n
mnm
N
m
n
N
n
n
taatosubject
kttaaaL
1
1 11
;0,0
),(2
1)(
~mn xxa
원래는 QP로 풀어야되지만 간단한 문제이기 때문에 해석적으로
풀었다.lagrange
multiplier 예제에서
보여준 것처럼.
N
n
nnnta1
)()8( xw
고차원으로 매핑전의
차원에서의 초평면을
구하면 포물선 형태의
분류 경계가 나온다.
레퍼런스
1,비숍책
2,MIT 오픈강의
http://ocw.mit.edu/OcwWeb/Ele
ctrical-Engineering-and-
Computer-Science/6-867Fall-
2006/LectureNotes/index.htm
3,수많은 인터넷 자료들?
Fast Training of Support Vector Machines using
Sequential Minimal Optimization,
Platt, John.
Jung-Kyu Lee @ MLLAB
SVM revisited • the (dual) optimization problem we covered in SVM class
• Solving this optimization problem for large-scale application by using quadratic programming is very computationally expensive.
• John Platt a colleague at Microsoft, proposed efficient algorithm called SMO (Sequential Minimal Optimization) to actually solve this optimization problem.
0
m1,...,i ,0s.t
.,2
1)(max
m
1i
i
1,1
ii
jijij
m
ji
i
m
i
i
y
C
xxyyW
a
a
aaaaa
Digression: Coordinate
ascent(1/3)
• Consider trying to solve the following unconstrained optimization
problem.
• Coordinate ascent:
),...,,(max 21 mW aaaa
Digression: Coordinate
ascent(2/3)
• How coordinate ascent work?
• We optimize a quadratic function.
• Minimum is (0,0)
• First, minimize this w.r.t α1
• Next, minimize this w.r.t α2
• With same argument, iterate until
convergence.
Digression: Coordinate
ascent(3/3)
• Coordinate assent will usually take a lot more steps than other
iterative optimization algorithm such as gradient descent,
Newton’s method, conjugate gradient descent etc.
• However, there are many optimization problems for which it’s
particularly easy to fix all but one of the parameters and optimize
with respect to just that one parameter.
• It turns out that this will be true when we modify this algorithm to
solve the SVM optimization problem.
SMO • Coordinate assent in its basic form does not work to apply SVM
dual optimization problem.
• As coordinate assent progress, we can’t change αi without
violating the constraint.
0
m1,...,i ,0s.t
.,2
1)(max
m
1i
i
1,1
ii
jijij
m
ji
i
m
i
i
y
C
xxyyW
a
a
aaaaa
)1},1,1{(
0
2
11
m
2i
11
m
2i
11
m
1i
yyyy
yy
y
ii
ii
ii
aa
aa
a
Outline of SMO • The SMO (sequential minimal optimization ) algorithm, therefore,
instead of trying to change one α at a time, we will try to change
two α at a time to keep satisfying the constraints.
• The term ‘minimal’ refers to the fact that we’re choosing the
smallest number of α
Convergence of SMO • To test for convergence of SMO algorithm, we can check whether
the KKT conditions are satisfied to within some tolerance
• The key reason that SMO is an efficient algorithm is that updating
αi, αj can be computed very efficiently.
1)(0
1)(
1)(0
bxwyCa
bxwyCa
bxwya
i
T
ii
i
T
ii
i
T
ii
Detail of SMO
• Suppose we’ve decide to hold α3,…,αm , update α1, α2
aa
aaa
a
2211
m
3i
2211
m
1i
0
yy
yyy
y
ii
ii
Detail of SMO
• We can thus picture the constraints on α1, α2 as follows
• α1 and α2 must lie within
the box [0,C]×[0,C] shown.
• α1 and α2 must lie on line :
• From these constraints, we know:
aa 2211 yy
HL 2a
Detail of SMO
• From
• W(α) become
• This is a standard quadratic function.
1221
122
2
11
2211
2211
)(
)(
)(
yy
yyy
yy
yy
aa
aa
aa
aa
),...,,)((),...,,( 212221 mm yyWW aaaaaa
cba
2aa
Detail of SMO
• From
• This is really easy to optimize by setting its derivative to zero.
• If you end up with a value outside,
you must clip your solution.
• That’ll give you the optimal solution
of this quadratic optimization
problem subject to your solution
satisfying this box constraint
and lying on this straight line.
cba
2aa
Detail of SMO • In conclusion,
• We do this process very quickly, which makes the inner loop of
the SMO algorithm very efficient.
H
LifL
Lif
HifHunclippednew
unclippednew
unclippednew
unclippednew
new
,
2
,
2
,
2
,
2
2 a
a
a
a
a
1221 )( yyaa
Conclusion • SMO
- has scalability for large scale data.
- makes SVM computationally inexpensive.
- can be implemented easily because it has simple training
structure (coordinate ascent).
Reference
• [1] Platt, John. Fast Training of Support Vector Machines using Sequential
Minimal Optimization, in Advances in Kernel Methods – Support Vector
Learning, B. Scholkopf, C. Burges, A. Smola, eds., MIT Press (1998).
• [2] The Simplified SMO Algorithm , Machine Learning course @ Stanford
http://www.stanford.edu/class/cs229/materials/smo.pdf