Top Banner
Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions Bharath K. Sriperumbudur , Kenji Fukumizu , Arthur Gretton ,× , Gert R. G. Lanckriet and Bernhard Sch¨olkopf × UC San Diego The Institute of Statistical Mathematics CMU × MPI for Biological Cybernetics NIPS 2009
27

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Apr 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Kernel Choice and Classifiability for RKHS

Embeddings of Probability Distributions

Bharath K. Sriperumbudur⋆, Kenji Fukumizu†, Arthur Gretton‡,×,Gert R. G. Lanckriet⋆ and Bernhard Scholkopf×

⋆UC San Diego †The Institute of Statistical Mathematics‡ CMU ×MPI for Biological Cybernetics

NIPS 2009

Page 2: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

RKHS Embeddings of Probability Measures

◮ Input space : X

◮ Feature space : H

◮ Feature map : Φ

Φ : X → H x 7→ Φ(x).

Extension to probability measures:

P 7→ Φ(P)

Distance between P and Q:

γ(P, Q) = ‖Φ(P) − Φ(Q)‖H.

Page 3: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Applications

Two-sample problem:

◮ Given random samples {X1, . . . ,Xm} and {Y1, . . . ,Yn} drawn i.i.d.from P and Q, respectively.

◮ Determine: are P and Q different?

◮ γ(P, Q) : distance metric between P and Q.

H0 : P = Q H0 : γ(P, Q) = 0≡

H1 : P 6= Q H1 : γ(P, Q) > 0

◮ Test: Say H0 if γ(P, Q) < ε. Otherwise say H1.

Page 4: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Applications

◮ Hypothesis testing

◮ Testing for independence and conditional independence

◮ Goodness of fit test

◮ Density estimation : quality of the estimate, convergence results.

◮ Central limit theorems

◮ Information theory

Popular examples:

◮ Kullback-Leibler divergence

◮ Total-variation distance (metric)

◮ Hellinger distance

◮ χ2-distance

The above examples are special instances of Csiszar’s φ-divergence.

Page 5: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Integral Probability Metrics

◮ The integral probability metric [Muller, 1997] between P and Q isdefined as

γF(P, Q) = supf ∈F

|EPf − EQf | .

◮ Many popular probability metrics can be obtained by appropriatelychoosing F.

◮ Total variation distance : F = {f : ‖f ‖∞ ≤ 1}.

◮ Wasserstein distance : F = {f : ‖f ‖L ≤ 1}.

◮ Dudley metric : F = {f : ‖f ‖L + ‖f ‖∞ ≤ 1}.

◮ well-studied in statistics and probability theory.

Page 6: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

F is a Reproducing Kernel Hilbert Space

◮ H : reproducing kernel Hilbert space (RKHS).

◮ k : measurable, bounded, real-valued reproducing kernel.

◮ F : a unit ball in H, i.e., F = {f : ‖f ‖H ≤ 1}.

Maximum mean discrepancy (MMD): [Gretton et al., 2007]

γk(P, Q) := γF(P, Q) = ‖EPk − EQk‖H

,

where ‖.‖H represents the RKHS norm.

RKHS embedding of probability measures:

P 7→ EPk =: Φ(P).

Page 7: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Advantages

◮ Easy to compute γk unlike other F.

◮ k is measurable and bounded: γk(Pm, Qn) is a√

mnm+n

-consistent

estimator of γk(P, Q) [Gretton et al., 2007].

◮ k is translation-invariant on Rd : the rate is independent of d .

◮ Easy to handle structured domains like graphs and strings.

Page 8: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Characteristic Kernels

When is γk a metric?

γk(P, Q) = 0 ⇔ EPk = EQk ⇔ P = Q.

Define: k is characteristic if

EPk = EQk ⇔ P = Q.

◮ Not all kernels are characteristic, e.g. k(x , y) = xT y .

γk(P, Q) = ‖µP − µQ‖2.

◮ When is k characteristic?[Gretton et al., 2007, Sriperumbudur et al., 2008,Fukumizu et al., 2008, Fukumizu et al., 2009].

Page 9: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Outline

◮ Characterization of characteristic kernels (visit poster!)

◮ Choice of characteristic kernels

◮ Characteristic kernels and binary classification

Page 10: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Choice of Characteristic Kernels

Examples: Gaussian, Laplacian, B2l+1-splines, Poisson kernel, etc.

Suppose k is a Gaussian kernel, kσ(x , y) = e−‖x−y‖2

22σ

2 .

◮ γk is a function of σ.

◮ So γk is a family of metrics. Which one do we use in practice?

◮ Note that γk → 0 as σ → 0 or σ → ∞.

◮ Defineγ(P, Q) = sup

σ∈R+

γkσ(P, Q).

Page 11: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Classes of Characteristic Kernels

Generalized MMD:γ(P, Q) := sup

k∈K

γk(P, Q).

Examples for K :

◮ Kg := {e−σ‖x−y‖22 , x , y ∈ Rd : σ ∈ R+}.

◮ Krbf := {∫∞

0e−λ‖x−y‖2

2 dµσ(λ), x , y ∈ Rd , µσ ∈ M + : σ ∈ Σ ⊂Rd}, where M + is the set of all finite nonnegative Borel measures,µσ on R+ that is not concentrated at zero.

◮ Klin := {kλ =∑l

i=1 λiki |kλ is pd,∑l

i=1 λi = 1}.

◮ Kcon := {kλ =∑l

i=1 λiki |λi ≥ 0,∑l

i=1 λi = 1}.

Page 12: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Computation

γ(P, Q) = supk∈K

[ ∫∫k(x , y) dP(x) dP(y) +

∫∫k(x , y) dQ(x) dQ(y)

−2

∫∫k(x , y) dP(x) dQ(y)

]1/2

.

◮ Suppose {Xi}mi=1

i.i.d.∼ P and {Yi}ni=1

i.i.d.∼ Q.

◮ Let Pm := 1m

∑mi=1 δXi

and Qn := 1n

∑ni=1 δYi

, where δx representsthe Dirac measure at x .

◮ The empirical estimate of γ(P, Q):

γ(Pm, Qn) = supk∈K

m∑

i,j=1

k(Xi ,Xj)

m2+

n∑

i,j=1

k(Yi ,Yj)

n2− 2

m,n∑

i,j=1

k(Xi ,Yj)

mn

1/2

.

Page 13: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Question

◮ When is γ a metric?

◮ Answer: If any k ∈ K is characteristic, then γ is a metric.

Page 14: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Question

◮ For a fixed k that is measurable and bounded, [Gretton et al., 2007]have shown that

|γk(Pm, Qn) − γk(P, Q)| = O

(√m + n

mn

).

◮ When does γ(Pm, Qn)a.s.→ γ(P, Q)? What is the rate of convergence?

Page 15: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Statistical Consistency: Result

TheoremFor any K and ν := supk∈K,x∈M k(x , x) < ∞, with probability at least1 − δ, the following holds:

|γ(Pm, Qn) − γ(P, Q)| ≤√

8Um(K)

m+

√8Un(K)

n

+

(√

8ν +

√36ν log

4

δ

)√m + n

mn,

where

Um(K) := E

supk∈K

∣∣∣∣∣∣1

m

m∑

i<j

ρiρjk(Xi ,Xj)

∣∣∣∣∣∣

∣∣∣X1, . . . ,Xm

,

is the Rademacher chaos complexity and ρi are Rademacher randomvariables.

Page 16: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Statistical Consistency: Result

Proposition

Suppose K is a VC-subgraph class. Then

|γ(Pm, Qn) − γ(P, Q)| = O

(√m + n

mn

).

In addition, γ(Pm, Qn)a.s.→ γ(P, Q).

Examples: [Ying and Campbell, 2009, Srebro and Ben-David, 2006]

◮ Kg , Krbf , Klin, Kcon, etc.

Page 17: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

The Two-Sample Problem

◮ Given : {X1, . . . ,Xm} i.i.d.∼ P and {Y1, . . . ,Yn} i.i.d.∼ Q.

◮ Determine: are P and Q different?

◮ γ(P, Q) : distance metric between P and Q.

H0 : P = Q H0 : γ(P, Q) = 0≡

H1 : P 6= Q H1 : γ(P, Q) > 0

◮ Test: Say H0 if γ(P, Q) < ε. Otherwise say H1.

◮ Good Test: Low Type-II error for user-defined Type-I error.

Page 18: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Experiments

◮ q = N (0, σ2q).

◮ p(x) = q(x)(1 + sin νx).

−5 0 50

0.1

0.2

0.3

x

q(x)

ν = 0

−5 0 50

0.1

0.2

0.3

0.4

x

p(x)

ν = 2

−5 0 50

0.1

0.2

0.3

0.4

x

p(x)

ν = 7.5

◮ k(x , y) = exp(−(x − y)2/σ).

◮ Test statistics: γ(Pm, Qm) and γk(Pm, Qm) for various σ.

Page 19: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Experiments

γ(P, Q)

0.5 0.75 1 1.25 1.5

0

2

456

ν

Err

or (

in %

)

Type−I errorType−II error

Page 20: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Experiments

γk(P, Q)

−3 −2 −1 0 1 2 3 4 5 6

5

10

15

20

25

log σ

Typ

e−I e

rror

(in

%)

ν=0.5ν=0.75ν=1.0ν=1.25ν=1.5

−3 −2 −1 0 1 2 3 4 5 60

50

100

log σ

Typ

e−II

erro

r (in

%)

ν=0.5ν=0.75ν=1.0ν=1.25ν=1.5

Page 21: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Outline

◮ Characterization of characteristic kernels (visit poster!)

◮ Choice of characteristic kernels

◮ Characteristic kernels and binary classification

Page 22: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

γk and Parzen Window Classifier

Let

◮ RKHS (H, k): k measurable and bounded.

◮ Fk = {f : ‖f ‖H ≤ 1}.◮ P, Q : class-conditional distributions

◮ RFk: Bayes risk of a classifier in Fk .

Then,γk(P, Q) = −RFk

.

◮ The MMD between class conditionals P and Q is negative of theBayes risk associated with a Parzen window classifier.

◮ Characteristic k is important.

Page 23: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

γk and Support Vector Machine

◮ RKHS (H, k): k measurable and bounded.

◮ fsvm be the solution to the program,

inff ∈H

‖f ‖H

s.t. Yi f (Xi ) ≥ 1, ∀ i .

If k is characteristic, then

1

‖fsvm‖H

≤ 1

2γk(Pm, Qn).

Page 24: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Achievability of Bayes Risk

◮ G⋆ : set of all real-valued measurable functions on M.

◮ (H, k) : RKHS with measurable and bounded k.

◮ Achievability of Bayes risk :

infg∈H

R(g) = infg∈G⋆

R(g). (⋆⋆)

Under some technical conditions,

◮ (⋆⋆) ⇒ k is characteristic.

◮ Suppose 1 ∈ H. k is characteristic ⇒ (⋆⋆).

Page 25: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Summary

◮ Characteristic kernel

◮ A class of kernels that characterize the probability measure

associated with a random variable.

◮ MMD is a metric.

◮ How to choose characteristic kernels in practice?

◮ Generalized MMD.

◮ Performs better than MMD in a two-sample test.

◮ Characteristic kernels are important in binary classification.

◮ Parzen window classifier and hard-margin SVM.

◮ Achievability of Bayes risk.

Page 26: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Thank You

Page 27: Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

References

◮ Fukumizu, K., Gretton, A., Sun, X., and Scholkopf, B. (2008).Kernel measures of conditional dependence.In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems 20, pages 489–496,Cambridge, MA. MIT Press.

◮ Fukumizu, K., Sriperumbudur, B. K., Gretton, A., and Scholkopf, B. (2009).Characteristic kernels on groups and semigroups.In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages473–480.

◮ Gretton, A., Borgwardt, K. M., Rasch, M., Scholkopf, B., and Smola, A. (2007).A kernel method for the two sample problem.In Scholkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 513–520. MITPress.

◮ Muller, A. (1997).Integral probability metrics and their generating classes of functions.Advances in Applied Probability, 29:429–443.

◮ Srebro, N. and Ben-David, S. (2006).Learning bounds for support vector machines with learned kernels.

In Lugosi, G. and Simon, H. U., editors, Proc. of the 19th Annual Conference on Learning Theory, pages 169–183.

◮ Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G. R. G., and Scholkopf, B. (2008).Injective Hilbert space embeddings of probability measures.In Servedio, R. and Zhang, T., editors, Proc. of the 21st Annual Conference on Learning Theory, pages 111–122.

◮ Ying, Y. and Campbell, C. (2009).Generalization bounds for learning the kernel.

In Proc. of the 22nd Annual Conference on Learning Theory.