Multiple Kernel Learning Hossein Hajimirsadeghi School of Computing Science Simon Fraser University November 5, 2013
Feb 07, 2016
Multiple Kernel Learning
Hossein HajimirsadeghiSchool of Computing Science
Simon Fraser University
November 5, 2013
2
Introduction - SVM
0)(. bxw
1)(. bxw
w
1Max . Margin
1))(.( bxwy ii
2
, 2
1min w
bw
is.t.
1)(. bxwbxwxf )(.)(
3
iii bxwy 1))(.(
i
ibw
Cw 2
, 2
1min
is.t. 0i
4
i
iibw
bxwyCw ))(.(1,0max2
1min
2
,
Regularizer )),(( ii yxfl
Loss Function
5
SVM: Optimization Problem
iii bxwy 1))(.(
i
ibw
Cw 2
, 2
1min
is.t. 0i
iii
iiiii
ii
bxwy
CwbwL
1))((
2
1),,,,(
2
6
SVM: Dual
iii bxwy 1))(.(
i
ibw
Cw 2
, 2
1min
is.t. 0i
ji
jijijii
i xxyy,
)().(2
1max
Ci 0
0i
ii y
s.t. i
Primal
Dual
SVM-Dual
7
ji
jijijii
i xxyy,
)().(2
1max
Ci 00
iii y
s.t. i
Resulting Classifier:
,)().()(.)( bxxybxwxfi
iii
j
ijji xxyyb )().(
ji xxK ,
8
Kernel Methods
:such that kernel, called ,: Define XXK
)().(),( yxyxK
Ideas:
K often interpreted as a similarity measure
Benefits: Efficiency Flexibility
22211 )(),( cyxyxyxK
c
xc
xc
xx
x
x
c
xc
xc
xx
x
x
2
1
21
21
21
2
1
21
21
21
2
2
2.
2
2
2
Kernelized SVM
9
ji
jijijii
i xxyy,
)().(2
1max
Ci 00
iii y
s.t. i
Classifier:
,)()()(.)( bxxybxwxfi
iii
j
ijji xxyyb )()(
Kernelized
10
ji
jijijii
i xxKyy,
),(2
1max
Ci 00
iii y
s.t. i
Classifier:
,),()(.)( bxxKybxwxfi
iii
j
ijji xxKyyb ),(
11
Kernelized SVM
),(...),(),(
...
...
),(...),(),(
),(...),(),(
21
22212
12111
NNNN
N
N
xxKxxKxxK
xxKxxKxxK
xxKxxKxxK
K
YKYααα1α
TT
2
1max
Cα00Yα1TSubject to
12
Ideal Kernel MatrixTyyK
bxxKyxfi
iii ),()(
ji
jijiji yy
yyyyxxK
1
1),(
byyyxfi
iii )(
byyxfi
ii 2)(
13
Motivation for MKL
• Success of SVM is dependent on choice of good kernel:– How to choose kernels• Kernel function• Parameters
• Practical problems involve multiple heterogeneous data sources– How can kernels help to fuse features• Esp. features from different modalities
14
Multiple Kernel Learning
P
m
mj
mimji xxKfxxK
1),(,
P
m
mj
mimmji xxKxxK
1
),(,
General MKL:
Linear MKL:
15
MKL Algorithms
• Fixed Rules• Heuristic Approaches• Similarity Optimization– Maximizing the similarity to ideal kernel matrix
• Structural Risk Optimization– Minimizing “regularization term” + “error term”
16
Similarity Optimization
• Similarity:– kernel alignment– Euclidean distance– Kullback-Leibler (KL) divergence
2211
2121
,,
,),(
KKKK
KKKK A
i j
jiji xxKxxK ),(),(, 222
11121 KK
),( TA yyK
17
Similarity Optimization
• Lanckriet et al. (2004)
0 ,1 s.t.
),(max
KK
yyK
tr
A T
P
mmm
1
KK
Can be converted to a Semi-definite programming problem
Better Results: Centered Kernel AlignmentCortes et al (2010)
18
Structural Risk Optimization
YαYKαα1 ηα
TT
2
1max
Cα00Yα1T
Subject to
)( ηK
)()(min ηK ηη
r
0ηK
Subject to
Structural Risk Optimization
19
Subject to
)()(min ηK ηη
r
0ηK
General MKL (Varma et al. 2009)
η
K η )(**
2
1Yα
η
KYα η
T
Coordinate descent algorithm:1-Fix kernel parameters and find 2-Fix and update by gradientα η
η α
YαYKαα1 ηα
TT
2
1max )( ηK
20
Structural Risk: Another View
P
m
mj
mimmji xxKxxK
1
),(, η
)(),(, jiji xxxxK ηηη
η0 if)(),( m
jmmim xx
)(
...
)(
)(
)(
...
)(
)(
,2
22
111
222
111
PiPP
i
i
T
PiPP
i
i
ji
x
x
x
x
x
x
xxK
η
)( ixη
21
Structural Risk: Another Viewbxxf )(.)(,, ηηbw w
b
x
x
x
xf
PiPP
i
i
P
)(
...
)(
)(
].,...,,[)(2
22
111
21,,
wwwηbw
b
x
x
x
xf
PiP
i
i
PP
)(
...
)(
)(
].,...,,[)(2
2
11
2211,,
wwwηbw
bxdxfP
mmmm
1,, )(.)( wdbw
1d 2d Pd
22
Structural Risk: Another Viewbxdxf
P
mmmm
1,, )(.)( wdbw
i
P
mmmmi bxdy
1))(.(1
w
i
i
P
mmm
bwCd
1
2
,, 2
1min w
is.t. 0i
mmm d wv :
i
P
mmmi bxy
1))(.(1
v
i
i
P
mmm
bvdCd
1
2
,,, 2
1min v
is.t. 0i
23
Structural Risk Optimization
i
P
mmmi bxy
1))(.(1
v is.t. 0i
Simple MKL
i
i
P
mmm
bCdJ
1
2
,, 2
1min)( vdv
)(min dd
J
11
P
mmd 0mdSuch that
Rakotomamonjy et al. 2008
24
Multi-Class SVM
yyy bxyxf )(.),(, wbw
iiiii yyyxfyxf ),(),(),( ww
i
iC
2
, 2
1min ww
yi,
s.t. 0i
),(.),( yxyxf ww
i
i
yy
yy
0
1
)),(),,((max iiiyy
yxfyxfli
ww
25
Latent SVM
iii xfy 1)(w
i
iC
2
, 2
1min ww
is.t. 0i
),(.),( hxhxF ww
),(.max)( hxxfh
ww
1x 2x mx
1h 2h mh…
…
… ),( hxm),( 2 hx),( 1 hx
)(h
),(.max hxih
w
26
),,(.max yhxih
w
Multi-Class Latent SVM
iiiii yyyxfyxf ),(),(),( ww
i
iC
2
, 2
1min ww
yi,
s.t. 0i
),,(.),,( yhxyhxF ww
),,(.max),( yhxyxfh
ww
1x 2x mx
1h 2h mh…
…
…
y
),( hxm),( 2 hx),( 1 hx
),( hy
),,(.max iih
yhxw
27
Latent Kernelized Structural SVM
i
iC
2
, 2
1min ww
Wu and Jia 2012
),,(.),,( yhxyhxF ww
),,(.max),( yhxyxfh
ww
)),(),,((max iiiyy
i yxfyxfli
ww
)),,(max),,(max1,0max(,
iih
iyh
i yhxFyhxF ww
28
Latent Kernelized Structural SVM
iiC
2
, 2
1min ww
)),,(max),,(max1,0max(,
iih
iyh
i yhxFyhxF ww
Find the dual
The dual Variables: ui , Su S
i Su
iiuw vxuxKvxFyhxF ),,,(),(),,( w
29
Latent Kernelized Structural SVM
i Su
iiuSv
wSv
vxuxKvxFxf ),,,(max),(max)( w
Inference),,(.),,( yhxyhxF ww
),,(.max),( yhxyxfh
ww
),(.max)( vxxfSv
ww
NO EFFICIENT EXACT
SOLUTION
?),,,(max,
jjiihh
hxhxKji
30
Latent MKL
i
P
mimmi bhxy
1)),(.(1
v is.t. 0i
m
mi
i
P
mmm
dbdCd 2
1
2
,,, 22
1min
vv
Vahdat et al. 2013Latent Version of SimpleMKL
P
mii
hhxdxf
1
),(.max)( ww
0md1y i
* ihh
1y i h
Coordinate descent Learning Algorithm:
1-Perform inference for positive samples2-Solve the dual optimization problem like SimpleMKL
Find the dual
31
Some other works
• Hierarchical MKL (Bach 2008)• Latent Kernel SVM (Yang et al. 2012)• Deep MKL (Strobl and Visweswaran 2013)
32
References• Gönen, M., & Alpaydın, E. (2011). Multiple kernel learning algorithms. The Journal of Machine Learning
Research, 2211-2268.
• Rakotomamonjy, A., Bach, F., Canu, S., and Grandvalet, Y. (2008). SimpleMKL.Journal of Machine Learning Research, 9, 2491-2521.
• Varma, M., & Babu, B. R. (2009, June). More generality in efficient multiple kernel learning. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 1065-1072).
• Cortes, C., Mohri, M., & Rostamizadeh, A. (2010). Two-stage learning kernel algorithms. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 239-246).
• Lanckriet, G. R., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. The Journal of Machine Learning Research, 5, 27-72.
• Wu, X., & Jia, Y. (2012). View-invariant action recognition using latent kernelized structural SVM. In Computer Vision–ECCV 2012 (pp. 411-424). Springer Berlin Heidelberg.
• Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008, June). A discriminatively trained, multiscale, deformable part model. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1-8). IEEE.
• Yang, W., Wang, Y., Vahdat, A., & Mori, G. (2012). Kernel Latent SVM for Visual Recognition. In Advances in Neural Information Processing Systems(pp. 818-826).
• Vahdat, A., Cannons, K., Mori, G., Oh, S., & Kim, I. (2013). Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach. IEEE International Conference on Computer Vision (ICCV).
• Cortes, C., Mohri, M., Rostamizadeh, A., ICML 2011 Tutorial: Learning Kernels.