Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions Bharath K. Sriperumbudur ⋆ , Kenji Fukumizu † , Arthur Gretton ‡,× , Gert R. G. Lanckriet ⋆ and Bernhard Sch¨olkopf × ⋆ UC San Diego † The Institute of Statistical Mathematics ‡ CMU × MPI for Biological Cybernetics NIPS 2009
27
Embed
Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Kernel Choice and Classifiability for RKHS
Embeddings of Probability Distributions
Bharath K. Sriperumbudur⋆, Kenji Fukumizu†, Arthur Gretton‡,×,Gert R. G. Lanckriet⋆ and Bernhard Scholkopf×
⋆UC San Diego †The Institute of Statistical Mathematics‡ CMU ×MPI for Biological Cybernetics
NIPS 2009
RKHS Embeddings of Probability Measures
◮ Input space : X
◮ Feature space : H
◮ Feature map : Φ
Φ : X → H x 7→ Φ(x).
Extension to probability measures:
P 7→ Φ(P)
Distance between P and Q:
γ(P, Q) = ‖Φ(P) − Φ(Q)‖H.
Applications
Two-sample problem:
◮ Given random samples {X1, . . . ,Xm} and {Y1, . . . ,Yn} drawn i.i.d.from P and Q, respectively.
◮ Determine: are P and Q different?
◮ γ(P, Q) : distance metric between P and Q.
H0 : P = Q H0 : γ(P, Q) = 0≡
H1 : P 6= Q H1 : γ(P, Q) > 0
◮ Test: Say H0 if γ(P, Q) < ε. Otherwise say H1.
Applications
◮ Hypothesis testing
◮ Testing for independence and conditional independence
◮ Goodness of fit test
◮ Density estimation : quality of the estimate, convergence results.
◮ Central limit theorems
◮ Information theory
Popular examples:
◮ Kullback-Leibler divergence
◮ Total-variation distance (metric)
◮ Hellinger distance
◮ χ2-distance
The above examples are special instances of Csiszar’s φ-divergence.
Integral Probability Metrics
◮ The integral probability metric [Muller, 1997] between P and Q isdefined as
γF(P, Q) = supf ∈F
|EPf − EQf | .
◮ Many popular probability metrics can be obtained by appropriatelychoosing F.
◮ Total variation distance : F = {f : ‖f ‖∞ ≤ 1}.
◮ Wasserstein distance : F = {f : ‖f ‖L ≤ 1}.
◮ Dudley metric : F = {f : ‖f ‖L + ‖f ‖∞ ≤ 1}.
◮ well-studied in statistics and probability theory.
F is a Reproducing Kernel Hilbert Space
◮ H : reproducing kernel Hilbert space (RKHS).
◮ k : measurable, bounded, real-valued reproducing kernel.
◮ F : a unit ball in H, i.e., F = {f : ‖f ‖H ≤ 1}.
Maximum mean discrepancy (MMD): [Gretton et al., 2007]
γk(P, Q) := γF(P, Q) = ‖EPk − EQk‖H
,
where ‖.‖H represents the RKHS norm.
RKHS embedding of probability measures:
P 7→ EPk =: Φ(P).
Advantages
◮ Easy to compute γk unlike other F.
◮ k is measurable and bounded: γk(Pm, Qn) is a√
mnm+n
-consistent
estimator of γk(P, Q) [Gretton et al., 2007].
◮ k is translation-invariant on Rd : the rate is independent of d .
◮ Easy to handle structured domains like graphs and strings.
Characteristic Kernels
When is γk a metric?
γk(P, Q) = 0 ⇔ EPk = EQk ⇔ P = Q.
Define: k is characteristic if
EPk = EQk ⇔ P = Q.
◮ Not all kernels are characteristic, e.g. k(x , y) = xT y .
γk(P, Q) = ‖µP − µQ‖2.
◮ When is k characteristic?[Gretton et al., 2007, Sriperumbudur et al., 2008,Fukumizu et al., 2008, Fukumizu et al., 2009].
Outline
◮ Characterization of characteristic kernels (visit poster!)
◮ Choice of characteristic kernels
◮ Characteristic kernels and binary classification
Choice of Characteristic Kernels
Examples: Gaussian, Laplacian, B2l+1-splines, Poisson kernel, etc.
Suppose k is a Gaussian kernel, kσ(x , y) = e−‖x−y‖2
22σ
2 .
◮ γk is a function of σ.
◮ So γk is a family of metrics. Which one do we use in practice?
◮ Note that γk → 0 as σ → 0 or σ → ∞.
◮ Defineγ(P, Q) = sup
σ∈R+
γkσ(P, Q).
Classes of Characteristic Kernels
Generalized MMD:γ(P, Q) := sup
k∈K
γk(P, Q).
Examples for K :
◮ Kg := {e−σ‖x−y‖22 , x , y ∈ Rd : σ ∈ R+}.
◮ Krbf := {∫∞
0e−λ‖x−y‖2
2 dµσ(λ), x , y ∈ Rd , µσ ∈ M + : σ ∈ Σ ⊂Rd}, where M + is the set of all finite nonnegative Borel measures,µσ on R+ that is not concentrated at zero.
◮ Klin := {kλ =∑l
i=1 λiki |kλ is pd,∑l
i=1 λi = 1}.
◮ Kcon := {kλ =∑l
i=1 λiki |λi ≥ 0,∑l
i=1 λi = 1}.
Computation
◮
γ(P, Q) = supk∈K
[ ∫∫k(x , y) dP(x) dP(y) +
∫∫k(x , y) dQ(x) dQ(y)
−2
∫∫k(x , y) dP(x) dQ(y)
]1/2
.
◮ Suppose {Xi}mi=1
i.i.d.∼ P and {Yi}ni=1
i.i.d.∼ Q.
◮ Let Pm := 1m
∑mi=1 δXi
and Qn := 1n
∑ni=1 δYi
, where δx representsthe Dirac measure at x .
◮ The empirical estimate of γ(P, Q):
γ(Pm, Qn) = supk∈K
m∑
i,j=1
k(Xi ,Xj)
m2+
n∑
i,j=1
k(Yi ,Yj)
n2− 2
m,n∑
i,j=1
k(Xi ,Yj)
mn
1/2
.
Question
◮ When is γ a metric?
◮ Answer: If any k ∈ K is characteristic, then γ is a metric.
Question
◮ For a fixed k that is measurable and bounded, [Gretton et al., 2007]have shown that
|γk(Pm, Qn) − γk(P, Q)| = O
(√m + n
mn
).
◮ When does γ(Pm, Qn)a.s.→ γ(P, Q)? What is the rate of convergence?
Statistical Consistency: Result
TheoremFor any K and ν := supk∈K,x∈M k(x , x) < ∞, with probability at least1 − δ, the following holds:
|γ(Pm, Qn) − γ(P, Q)| ≤√
8Um(K)
m+
√8Un(K)
n
+
(√
8ν +
√36ν log
4
δ
)√m + n
mn,
where
Um(K) := E
supk∈K
∣∣∣∣∣∣1
m
m∑
i<j
ρiρjk(Xi ,Xj)
∣∣∣∣∣∣
∣∣∣X1, . . . ,Xm
,
is the Rademacher chaos complexity and ρi are Rademacher randomvariables.
Statistical Consistency: Result
Proposition
Suppose K is a VC-subgraph class. Then
|γ(Pm, Qn) − γ(P, Q)| = O
(√m + n
mn
).
In addition, γ(Pm, Qn)a.s.→ γ(P, Q).
Examples: [Ying and Campbell, 2009, Srebro and Ben-David, 2006]
◮ Kg , Krbf , Klin, Kcon, etc.
The Two-Sample Problem
◮ Given : {X1, . . . ,Xm} i.i.d.∼ P and {Y1, . . . ,Yn} i.i.d.∼ Q.
◮ Determine: are P and Q different?
◮ γ(P, Q) : distance metric between P and Q.
H0 : P = Q H0 : γ(P, Q) = 0≡
H1 : P 6= Q H1 : γ(P, Q) > 0
◮ Test: Say H0 if γ(P, Q) < ε. Otherwise say H1.
◮ Good Test: Low Type-II error for user-defined Type-I error.
Experiments
◮ q = N (0, σ2q).
◮ p(x) = q(x)(1 + sin νx).
−5 0 50
0.1
0.2
0.3
x
q(x)
ν = 0
−5 0 50
0.1
0.2
0.3
0.4
x
p(x)
ν = 2
−5 0 50
0.1
0.2
0.3
0.4
x
p(x)
ν = 7.5
◮ k(x , y) = exp(−(x − y)2/σ).
◮ Test statistics: γ(Pm, Qm) and γk(Pm, Qm) for various σ.
Experiments
γ(P, Q)
0.5 0.75 1 1.25 1.5
0
2
456
ν
Err
or (
in %
)
Type−I errorType−II error
Experiments
γk(P, Q)
−3 −2 −1 0 1 2 3 4 5 6
5
10
15
20
25
log σ
Typ
e−I e
rror
(in
%)
ν=0.5ν=0.75ν=1.0ν=1.25ν=1.5
−3 −2 −1 0 1 2 3 4 5 60
50
100
log σ
Typ
e−II
erro
r (in
%)
ν=0.5ν=0.75ν=1.0ν=1.25ν=1.5
Outline
◮ Characterization of characteristic kernels (visit poster!)
◮ Choice of characteristic kernels
◮ Characteristic kernels and binary classification
◮ The MMD between class conditionals P and Q is negative of theBayes risk associated with a Parzen window classifier.
◮ Characteristic k is important.
γk and Support Vector Machine
◮ RKHS (H, k): k measurable and bounded.
◮ fsvm be the solution to the program,
inff ∈H
‖f ‖H
s.t. Yi f (Xi ) ≥ 1, ∀ i .
If k is characteristic, then
1
‖fsvm‖H
≤ 1
2γk(Pm, Qn).
Achievability of Bayes Risk
◮ G⋆ : set of all real-valued measurable functions on M.
◮ (H, k) : RKHS with measurable and bounded k.
◮ Achievability of Bayes risk :
infg∈H
R(g) = infg∈G⋆
R(g). (⋆⋆)
Under some technical conditions,
◮ (⋆⋆) ⇒ k is characteristic.
◮ Suppose 1 ∈ H. k is characteristic ⇒ (⋆⋆).
Summary
◮ Characteristic kernel
◮ A class of kernels that characterize the probability measure
associated with a random variable.
◮ MMD is a metric.
◮ How to choose characteristic kernels in practice?
◮ Generalized MMD.
◮ Performs better than MMD in a two-sample test.
◮ Characteristic kernels are important in binary classification.
◮ Parzen window classifier and hard-margin SVM.
◮ Achievability of Bayes risk.
Thank You
References
◮ Fukumizu, K., Gretton, A., Sun, X., and Scholkopf, B. (2008).Kernel measures of conditional dependence.In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems 20, pages 489–496,Cambridge, MA. MIT Press.
◮ Fukumizu, K., Sriperumbudur, B. K., Gretton, A., and Scholkopf, B. (2009).Characteristic kernels on groups and semigroups.In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages473–480.
◮ Gretton, A., Borgwardt, K. M., Rasch, M., Scholkopf, B., and Smola, A. (2007).A kernel method for the two sample problem.In Scholkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 513–520. MITPress.
◮ Muller, A. (1997).Integral probability metrics and their generating classes of functions.Advances in Applied Probability, 29:429–443.
◮ Srebro, N. and Ben-David, S. (2006).Learning bounds for support vector machines with learned kernels.
In Lugosi, G. and Simon, H. U., editors, Proc. of the 19th Annual Conference on Learning Theory, pages 169–183.
◮ Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G. R. G., and Scholkopf, B. (2008).Injective Hilbert space embeddings of probability measures.In Servedio, R. and Zhang, T., editors, Proc. of the 21st Annual Conference on Learning Theory, pages 111–122.
◮ Ying, Y. and Campbell, C. (2009).Generalization bounds for learning the kernel.
In Proc. of the 22nd Annual Conference on Learning Theory.