Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS T¨ ubingen, 2015 Arthur Gretton Gatsby Unit, CSML, UCL
Lecture 2: Mappings of Probabilitiesto RKHS and Applications
MLSS Tubingen, 2015
Arthur Gretton
Gatsby Unit, CSML, UCL
Outline
• Kernel metric on the space of probability measures
– Function revealing differences in distributions
– Distance between means in space of features (RKHS)
– Independence measure: features of joint minus product of marginals
• Characteristic kernels: feature space mappings of probabilities unique
• Two-sample, independence tests for (almost!) any data type
– distributions on strings, images, graphs, groups (rotation matrices),
semigroups,. . .
• Advanced topics
– testing on big data, kernel choice
– Energy distance/distance covariance: special case of kernel statistic
Feature mean difference
• Simple example: 2 Gaussians with different means
• Answer: t-test
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different means
X
Pro
b. density
P
X
QX
Feature mean difference
• Two Gaussians with same means, different variance
• Idea: look at difference in means of features of the RVs
• In Gaussian case: second order features of form ϕ(x) = x2
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different variances
Pro
b.
de
nsity
X
P
X
QX
Feature mean difference
• Two Gaussians with same means, different variance
• Idea: look at difference in means of features of the RVs
• In Gaussian case: second order features of form ϕ(x) = x2
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different variances
Pro
b.
de
nsity
X
P
X
QX
10−1
100
101
102
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Densities of feature X2
X2
Pro
b.
de
nsity
P
X
QX
Feature mean difference
• Gaussian and Laplace distributions
• Same mean and same variance
• Difference in means using higher order features...RKHS
−4 −3 −2 −1 0 1 2 3 40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Gaussian and Laplace densities
Pro
b. density
X
P
X
QX
Probabilities in feature space: the mean trick
The kernel trick
• Given x ∈ X for some set X ,
define feature map ϕx ∈ F ,
ϕx = [. . . ϕi(x) . . .] ∈ ℓ2
• For positive definite k(x, x′),
k(x, x′) = 〈ϕx, ϕx′〉F
• The kernel trick: ∀f ∈ F ,
f(x) = 〈f, ϕx〉F
Probabilities in feature space: the mean trick
The kernel trick
• Given x ∈ X for some set X ,
define feature map ϕx ∈ F ,
ϕx = [. . . ϕi(x) . . .] ∈ ℓ2
• For positive definite k(x, x′),
k(x, x′) = 〈ϕx, ϕx′〉F
• The kernel trick: ∀f ∈ F ,
f(x) = 〈f, ϕx〉F
The mean trick
• Given P a Borel probability
measure on X , define feature
map µP ∈ F
µP = [. . .EP [ϕi(x)] . . .]
• For positive definite k(x, x′),
EP,Qk(x, y) = 〈µP, µQ〉F
for x ∼ P and y ∼ Q.
• The mean trick: (we call µP a
mean/distribution embedding)
EP(f(x)) =: 〈µP, f〉F
What does µP look like?
We plot the function µP
• Mean embedding µP ∈ F
〈µP(·), f(·)〉F = EPf(x).
• What does prob. feature map
look like?
µP(x) = 〈µP(·), ϕ(x)〉F= 〈µP(·), k(·, x)〉F = EPk(x, x).
Expectation of kernel!
• Empirical estimate:
µP(x) =1
m
m∑
i=1
k(xi, x) xi ∼ P
What does µP look like?
We plot the function µP
• Mean embedding µP ∈ F
〈µP(·), f(·)〉F = EPf(x).
• What does prob. feature map
look like?
µP(x) = 〈µP(·), ϕ(x)〉F= 〈µP(·), k(·, x)〉F = EPk(x, x).
Expectation of kernel!
• Empirical estimate:
µP(x) =1
m
m∑
i=1
k(xi, x) xi ∼ P
−2 0 20
0.01
0.02
0.03
X
Histogram
Embedding
Does the feature space mean exist?
Does there exist an element µP ∈ F such that
EPf(x) = EP〈f(·), ϕ(x)〉F = 〈f(·),EPϕ(x)〉F = 〈f(·), µP(·)〉F ∀f ∈ F
Does the feature space mean exist?
Does there exist an element µP ∈ F such that
EPf(x) = EP〈f(·), ϕ(x)〉F = 〈f(·),EPϕ(x)〉F = 〈f(·), µP(·)〉F ∀f ∈ F
Yes: You can exchange expectation and innner product (i.e. ϕ(x) is Bochner
integrable [Steinwart and Christmann, 2008]) under the condition
EP‖ϕ(x)‖F = EP
√k(x, x) <∞
Function Showing Difference in Distributions
• Are P and Q different?
Function Showing Difference in Distributions
• Are P and Q different?
0 0.2 0.4 0.6 0.8 1−1
−0.5
0
0.5
1
Samples from P and Q
Function Showing Difference in Distributions
• Are P and Q different?
0 0.2 0.4 0.6 0.8 1−1
−0.5
0
0.5
1
Samples from P and Q
Function Showing Difference in Distributions
• Maximum mean discrepancy: smooth function for P vs Q
MMD(P,Q;F ) := supf∈F
[EPf(x)−EQf(y)] .
0 0.2 0.4 0.6 0.8 1−1
−0.5
0
0.5
1
x
f(x)
Smooth function
Function Showing Difference in Distributions
• Maximum mean discrepancy: smooth function for P vs Q
MMD(P,Q;F ) := supf∈F
[EPf(x)−EQf(y)] .
0 0.2 0.4 0.6 0.8 1−1
−0.5
0
0.5
1
x
f(x)
Smooth function
Function Showing Difference in Distributions
• What if the function is not smooth?
MMD(P,Q;F ) := supf∈F
[EPf(x)−EQf(y)] .
0 0.2 0.4 0.6 0.8 1−1
−0.5
0
0.5
1
Bounded continuous function
x
f(x)
Function Showing Difference in Distributions
• What if the function is not smooth?
MMD(P,Q;F ) := supf∈F
[EPf(x)−EQf(y)] .
0 0.2 0.4 0.6 0.8 1−1
−0.5
0
0.5
1
Bounded continuous function
x
f(x)
Function Showing Difference in Distributions
• Maximum mean discrepancy: smooth function for P vs Q
MMD(P,Q;F ) := supf∈F
[EPf(x)−EQf(y)] .
• Gauss P vs Laplace Q
−6 −4 −2 0 2 4 6−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Witness f for Gauss and Laplace densities
X
Pro
b. density a
nd f
f
Gauss
Laplace
Function Showing Difference in Distributions
• Maximum mean discrepancy: smooth function for P vs Q
MMD(P,Q;F ) := supf∈F
[EPf(x)−EQf(y)] .
• Classical results: MMD(P,Q;F ) = 0 iff P = Q, when
– F =bounded continuous [Dudley, 2002]
– F = bounded variation 1 (Kolmogorov metric) [Muller, 1997]
– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]
Function Showing Difference in Distributions
• Maximum mean discrepancy: smooth function for P vs Q
MMD(P,Q;F ) := supf∈F
[EPf(x)−EQf(y)] .
• Classical results: MMD(P,Q;F ) = 0 iff P = Q, when
– F =bounded continuous [Dudley, 2002]
– F = bounded variation 1 (Kolmogorov metric) [Muller, 1997]
– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]
• MMD(P,Q;F ) = 0 iff P = Q when F =the unit ball in a characteristic
RKHS F (coming soon!) [ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]
Function Showing Difference in Distributions
• Maximum mean discrepancy: smooth function for P vs Q
MMD(P,Q;F ) := supf∈F
[EPf(x)−EQf(y)] .
• Classical results: MMD(P,Q;F ) = 0 iff P = Q, when
– F =bounded continuous [Dudley, 2002]
– F = bounded variation 1 (Kolmogorov metric) [Muller, 1997]
– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]
• MMD(P,Q;F ) = 0 iff P = Q when F =the unit ball in a characteristic
RKHS F (coming soon!) [ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]
How do smooth functions relate to feature maps?
Function view vs feature mean view
• The (kernel) MMD: [ISMB06, NIPS06a]
MMD2(P,Q;F )
=
(supf∈F
[EPf(x)−EQf(y)]
)2
−6 −4 −2 0 2 4 6−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Witness f for Gauss and Laplace densities
X
Pro
b. density a
nd f
f
Gauss
Laplace
Function view vs feature mean view
• The (kernel) MMD: [ISMB06, NIPS06a]
MMD2(P,Q;F )
=
(supf∈F
[EPf(x)−EQf(y)]
)2 use
EP(f(x)) =: 〈µP, f〉F
Function view vs feature mean view
• The (kernel) MMD: [ISMB06, NIPS06a]
MMD2(P,Q;F )
=
(supf∈F
[EPf(x)−EQf(y)]
)2
=
(supf∈F
〈f, µP − µQ〉F
)2
use
EP(f(x)) =: 〈µP, f〉F
Function view vs feature mean view
• The (kernel) MMD: [ISMB06, NIPS06a]
MMD2(P,Q;F )
=
(supf∈F
[EPf(x)−EQf(y)]
)2
=
(supf∈F
〈f, µP − µQ〉F
)2
= ‖µP − µQ‖2F
use
‖θ‖F = supf∈F
〈f, θ〉F
Function view and feature view equivalent
Empirical estimate of MMD
• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,
MMD2= 1
m(m−1)
∑mi=1
∑mj 6=i [k(xi, xj) + k(yi, yj)]
− 1m2
∑mi=1
∑mj=1 [k(yi, xj) + k(xi, yj)]
Empirical estimate of MMD
• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,
MMD2= 1
m(m−1)
∑mi=1
∑mj 6=i [k(xi, xj) + k(yi, yj)]
− 1m2
∑mi=1
∑mj=1 [k(yi, xj) + k(xi, yj)]
• Proof:
‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉
Empirical estimate of MMD
• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,
MMD2= 1
m(m−1)
∑mi=1
∑mj 6=i [k(xi, xj) + k(yi, yj)]
− 1m2
∑mi=1
∑mj=1 [k(yi, xj) + k(xi, yj)]
• Proof:
‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉
Empirical estimate of MMD
• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,
MMD2= 1
m(m−1)
∑mi=1
∑mj 6=i [k(xi, xj) + k(yi, yj)]
− 1m2
∑mi=1
∑mj=1 [k(yi, xj) + k(xi, yj)]
• Proof:
‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉= EP[µP(x)] + . . .
Empirical estimate of MMD
• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,
MMD2= 1
m(m−1)
∑mi=1
∑mj 6=i [k(xi, xj) + k(yi, yj)]
− 1m2
∑mi=1
∑mj=1 [k(yi, xj) + k(xi, yj)]
• Proof:
‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉= EP[µP(x)] + . . .
= EP 〈µP(·), ϕ(x)〉+ . . .
Empirical estimate of MMD
• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,
MMD2= 1
m(m−1)
∑mi=1
∑mj 6=i [k(xi, xj) + k(yi, yj)]
− 1m2
∑mi=1
∑mj=1 [k(yi, xj) + k(xi, yj)]
• Proof:
‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉= EP[µP(x)] + . . .
= EP 〈µP(·), k(x, ·)〉+ . . .
Empirical estimate of MMD
• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,
MMD2= 1
m(m−1)
∑mi=1
∑mj 6=i [k(xi, xj) + k(yi, yj)]
− 1m2
∑mi=1
∑mj=1 [k(yi, xj) + k(xi, yj)]
• Proof:
‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉= EP[µP(x)] + . . .
= EP 〈µP(·), k(x, ·)〉+ . . .
= EPk(x, x′) +EQk(y, y
′)− 2EP,Qk(x, y)
Empirical estimate of MMD
• An unbiased empirical estimate: for ximi=1 ∼ P and yimi=1 ∼ Q,
MMD2= 1
m(m−1)
∑mi=1
∑mj 6=i [k(xi, xj) + k(yi, yj)]
− 1m2
∑mi=1
∑mj=1 [k(yi, xj) + k(xi, yj)]
• Proof:
‖µP − µQ‖2F = 〈µP − µQ, µP − µQ〉F= 〈µP, µP〉+ 〈µQ, µQ〉 − 2 〈µP, µQ〉= EP[µP(x)] + . . .
= EP 〈µP(·), k(x, ·)〉+ . . .
= EPk(x, x′) +EQk(y, y
′)− 2EP,Qk(x, y)
Then Ek(x, x′) = 1m(m−1)
∑mi=1
∑mj 6=i k(xi, xj)
MMD for independence: HSIC
• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,
NIPS07a, ALT07, ALT08, JMLR10]
Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]
HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY
‖2
MMD for independence: HSIC
• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,
NIPS07a, ALT07, ALT08, JMLR10]
Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]
HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY
‖2
k( , )!" #" !"l( , )#"
k( , )× l( , )!" #" !" #"
κ( , ) =!" #"!" #"
MMD for independence: HSIC
• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,
NIPS07a, ALT07, ALT08, JMLR10]
Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]
HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY
‖2
HSIC using expectations of kernels:
Define RKHS F on X with kernel k, RKHS G on Y with kernel l. Then
HSIC(PXY ,PXPY )
= EXY EX′Y ′k(x, x′)l(y, y′) +EXEX′k(x, x′)EY EY ′ l(y, y′)
− 2EX′Y ′
[EXk(x, x
′)EY l(y, y′)].
HSIC: empirical estimate and intuition
!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&
-"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&
2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0&
#;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:&
=&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2&
,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&
-"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&
2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':&
!#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&
@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',&
#;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#&
<%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&&
$'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%%&
2',&0(//(7&3(+#%37"#%#:&
HSIC: empirical estimate and intuition
!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&
-"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&
2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0&
#;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:&
=&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2&
,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&
-"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&
2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':&
!#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&
@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',&
#;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#&
<%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&&
$'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%%&
2',&0(//(7&3(+#%37"#%#:&
!" #"
HSIC: empirical estimate and intuition
!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&
-"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&
2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0&
#;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:&
=&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2&
,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&
-"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&
2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':&
!#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&
@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',&
#;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#&
<%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&&
$'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%%&
2',&0(//(7&3(+#%37"#%#:&
!" #"
Empirical HSIC(PXY ,PXPY ):
1
n2(HKH HLH)++
Characteristic kernels (Via Fourier, on the torus T)
Characteristic Kernels (via Fourier)
Reminder:
Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, JMLR10]
In the next slides:
1. Characteristic property on [−π, π] with periodic boundary
2. Characteristic property on Rd
Characteristic Kernels (via Fourier)
Reminder: Fourier series
• Function [−π, π] with periodic boundary.
f(x) =∞∑
ℓ=−∞
fℓ exp(ıℓx) =∞∑
l=−∞
fℓ (cos(ℓx) + ı sin(ℓx)) .
−4 −2 0 2 4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
f(x
)
Top hat
−10 −5 0 5 10−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
ℓ
fℓ
Fourier series coefficients
Characteristic Kernels (via Fourier)
Reminder: Fourier series of kernel
k(x, y) = k(x− y) = k(z), k(z) =
∞∑
ℓ=−∞
kℓ exp (ıℓz) ,
E.g., k(x) = 12πϑ
(x2π ,
ıσ2
2π
), kℓ =
12π exp
(−σ2ℓ2
2
).
ϑ is the Jacobi theta function, close to Gaussian when σ2 sufficiently narrower than [−π, π].
−4 −2 0 2 4−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
x
k(x
)
Kernel
−10 −5 0 5 100
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
ℓ
fℓ
Fourier series coefficients
Characteristic Kernels (via Fourier)
Maximum mean embedding via Fourier series:
• Fourier series for P is characteristic function φP
• Fourier series for mean embedding is product of fourier series!
(convolution theorem)
µP(x) = EPk(x− x) =
∫ π
−πk(x− t)dP(t) µP,ℓ = kℓ × φP,ℓ
Characteristic Kernels (via Fourier)
Maximum mean embedding via Fourier series:
• Fourier series for P is characteristic function φP
• Fourier series for mean embedding is product of fourier series!
(convolution theorem)
µP(x) = EPk(x− x) =
∫ π
−πk(x− t)dP(t) µP,ℓ = kℓ × φP,ℓ
• MMD can be written in terms of Fourier series:
MMD(P,Q;F ) :=
∥∥∥∥∥
∞∑
ℓ=−∞
[(φP,ℓ − φQ,ℓ
)kℓ
]exp(ıℓx)
∥∥∥∥∥F
A simpler Fourier expression for MMD
• From previous slide,
MMD(P,Q;F ) :=
∥∥∥∥∥
∞∑
ℓ=−∞
[(φP,ℓ − φQ,ℓ
)kℓ
]exp(ıℓx)
∥∥∥∥∥F
• The squared norm of a function f in F is:
‖f‖2F = 〈f, f〉F =∞∑
l=−∞
|fℓ|2kℓ
.
• Simple, interpretable expression for squared MMD:
MMD2(P,Q;F ) =∞∑
l=−∞
[|φP,ℓ − φQ,ℓ|2kℓ]2kℓ
=∞∑
l=−∞
|φP,ℓ − φQ,ℓ|2kℓ
Example
• Example: P differs from Q at one frequency
−2 0 20
0.05
0.1
0.15
0.2
x
P(x
)
−2 0 20
0.05
0.1
0.15
0.2
x
Q(x
)
Characteristic Kernels (2)
• Example: P differs from Q at (roughly) one frequency
−2 0 20
0.05
0.1
0.15
0.2
x
P(x
)
−2 0 20
0.05
0.1
0.15
0.2
x
Q(x
)
F→
F→
−10 0 100
0.5
1
ℓ
φP
,ℓ
−10 0 100
0.5
1
ℓ
φQ
,ℓ
Characteristic Kernels (2)
• Example: P differs from Q at (roughly) one frequency
−2 0 20
0.05
0.1
0.15
0.2
x
P(x
)
−2 0 20
0.05
0.1
0.15
0.2
x
Q(x
)
F→
F→
−10 0 100
0.5
1
ℓ
φP
,ℓ
−10 0 100
0.5
1
ℓ
φQ
,ℓց
ր
Characteristic function difference
−10 0 100
0.2
0.4
0.6
0.8
1
ℓ
φP
,ℓ−
φQ
,ℓ
Example
Is the Gaussian-spectrum kernel characteristic?
−4 −2 0 2 4−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
x
k(x
)
Kernel
−10 −5 0 5 100
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
ℓ
fℓ
Fourier series coefficients
MMD2(P,Q;F ) :=∞∑
l=−∞
|φP,ℓ − φQ,ℓ|2kℓ
Example
Is the Gaussian-spectrum kernel characteristic? YES
−4 −2 0 2 4−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
x
k(x
)
Kernel
−10 −5 0 5 100
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
ℓ
fℓ
Fourier series coefficients
MMD2(P,Q;F ) :=
∞∑
l=−∞
|φP,ℓ − φQ,ℓ|2kℓ
Example
Is the triangle kernel characteristic?
−4 −2 0 2 4−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
x
f(x
)
Triangle
−10 −5 0 5 100
0.01
0.02
0.03
0.04
0.05
0.06
0.07
ℓ
fℓ
Fourier series coefficients
MMD2(P,Q;F ) :=
∞∑
l=−∞
|φP,ℓ − φQ,ℓ|2kℓ
Example
Is the triangle kernel characteristic? NO
−4 −2 0 2 4−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
x
f(x
)
Triangle
−10 −5 0 5 100
0.01
0.02
0.03
0.04
0.05
0.06
0.07
ℓ
fℓ
Fourier series coefficients
MMD2(P,Q;F ) :=
∞∑
l=−∞
|φP,ℓ − φQ,ℓ|2kℓ
Characteristic kernels (Via Fourier, on Rd)
Characteristic Kernels (via Fourier)
• Can we prove characteristic on Rd?
Characteristic Kernels (via Fourier)
• Can we prove characteristic on Rd?
• Characteristic function of P via Fourier transform
φP(ω) =
∫
Rd
eix⊤ωdP(x)
Characteristic Kernels (via Fourier)
• Can we prove characteristic on Rd?
• Characteristic function of P via Fourier transform
φP(ω) =
∫
Rd
eix⊤ωdP(x)
• Translation invariant kernels: k(x, y) = k(x− y) = k(z)
• Bochner’s theorem:
k(z) =
∫
Rd
e−iz⊤ωdΛ(ω)
– Λ finite non-negative Borel measure
Characteristic Kernels (via Fourier)
• Can we prove characteristic on Rd?
• Characteristic function of P via Fourier transform
φP(ω) =
∫
Rd
eix⊤ωdP(x)
• Translation invariant kernels: k(x, y) = k(x− y) = k(z)
• Bochner’s theorem:
k(z) =
∫
Rd
e−iz⊤ωdΛ(ω)
– Λ finite non-negative Borel measure
Characteristic Kernels (via Fourier)
• Fourier representation of MMD:
MMD(P,Q;F ) :=
∫ ∫|φP(ω)− φQ(ω)|2 dΛ(ω)
– φP characteristic function of P
Proof: Using Bochner’s theorem (a) and Fubini’s theorem (b),
MMD(P,Q) =
∫ ∫
Rd
k(x− y) d(P−Q)(x) d(P−Q)(y)
(a)=
∫ ∫ ∫
Rd
e−i(x−y)Tω dΛ(ω) d(P−Q)(x) d(P−Q)(y)
(b)=
∫ ∫
Rd
e−ixTω d(P−Q)(x)
∫
Rd
eiyTω d(P−Q)(y) dΛ(ω)
=
∫|φP(ω)− φQ(ω)|2 dΛ(ω)
Example
• Example: P differs from Q at (roughly) one frequency
−10 −5 0 5 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
X
P(X
)
−10 −5 0 5 100
0.1
0.2
0.3
0.4
0.5
X
Q(X
)
Example
• Example: P differs from Q at (roughly) one frequency
−10 −5 0 5 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
X
P(X
)
−10 −5 0 5 100
0.1
0.2
0.3
0.4
0.5
X
Q(X
)
F→
F→
−20 −10 0 10 200
0.1
0.2
0.3
0.4
ω
|φP |
−20 −10 0 10 200
0.1
0.2
0.3
0.4
ω
|φQ
|
Example
• Example: P differs from Q at (roughly) one frequency
−10 −5 0 5 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
X
P(X
)
−10 −5 0 5 100
0.1
0.2
0.3
0.4
0.5
X
Q(X
)
F→
F→
−20 −10 0 10 200
0.1
0.2
0.3
0.4
ω
|φP |
−20 −10 0 10 200
0.1
0.2
0.3
0.4
ω
|φQ
|ց
ր
Characteristic function difference
−30 −20 −10 0 10 20 300
0.05
0.1
0.15
0.2
ω
|φP −
φQ
|
Example
• Example: P differs from Q at (roughly) one frequency
Gaussian kernel
Difference |φP − φQ|
−30 −20 −10 0 10 20 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency ω
Example
• Example: P differs from Q at (roughly) one frequency
Characteristic
−30 −20 −10 0 10 20 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency ω
Example
• Example: P differs from Q at (roughly) one frequency
Sinc kernel
Difference |φP − φQ|
−30 −20 −10 0 10 20 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency ω
Example
• Example: P differs from Q at (roughly) one frequency
NOT characteristic
−30 −20 −10 0 10 20 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency ω
Example
• Example: P differs from Q at (roughly) one frequency
Triangle (B-spline) kernel
Difference |φP − φQ|
−30 −20 −10 0 10 20 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency ω
Example
• Example: P differs from Q at (roughly) one frequency
???
−30 −20 −10 0 10 20 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency ω
Example
• Example: P differs from Q at (roughly) one frequency
Characteristic
−30 −20 −10 0 10 20 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency ω
Summary: Characteristic Kernels
Characteristic kernel: (MMD = 0 iff P = Q) [NIPS07b, COLT08]
Main theorem: A translation invariant k characteristic for prob. measures on
Rd if and only if supp(Λ) = Rd (i.e. support zero on at most a countable set)
[COLT08, JMLR10]
Corollary: continuous, compactly supported k characteristic (since Fourier
spectrum Λ(ω) cannot be zero on an interval). 1-D proof sketch from [Mallat, 1999,
Theorem 2.6] proof on Rd via distribution theory in [Sriperumbudur et al., 2010, Corollary 10 p. 1535]
k characteristic iff supp(Λ) = Rd
Proof: supp Λ = Rd =⇒ k characteristic:
Recall Fourier definition of MMD:
MMD2(P,Q) =
∫
Rd
|φP(ω)− φQ(ω)|2 dΛ(ω).
Characteristic functions φP(ω) and φQ(ω) uniformly continuous, hence their
difference cannot be non-zero only on a countable set.
Map φP uniformly continuous: ∀ǫ > 0, ∃δ > 0 such that ∀(ω1, ω2) ∈ Ω for which d(ω1, ω2) < δ, we have
d(φP(ω1), φP(ω2)) < ǫ. Uniform: δ depends only on ǫ, not on ω1, ω2.
k characteristic iff supp(Λ) = Rd
Proof: k characteristic =⇒ supp Λ = Rd :
Proof by contrapositive.
Given supp Λ ( Rd, hence ∃ open interval U such that Λ(ω) zero on U .
Construct densities p(x), q(x) such that φP, φQ differ only inside U
Further extensions
• Similar reasoning wherever extensions of Bochner’s theorem exist:
[Fukumizu et al., 2009]
– Locally compact Abelian groups (periodic domains, as we saw)
– Compact, non-Abelian groups (orthogonal matrices)
– The semigroup R+n (histograms)
• Related kernel statistics: Fisher statistic [Harchaoui et al., 2008](zero iff P = Q
for characteristic kernels), other distances [Zhou and Chellappa, 2006](not yet
shown to establish whether P = Q), energy distances
Statistical hypothesis testing
Motivating question: differences in brain signals
The problem: Do local field potential (LFP) signals changewhen measured near a spike burst?
0 20 40 60 80 100−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
LFP near spike burst
Time
LF
P a
mp
litu
de
0 20 40 60 80 100−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
LFP without spike burst
Time
LF
P a
mp
litu
de
Motivating question: differences in brain signals
The problem: Do local field potential (LFP) signals changewhen measured near a spike burst?
Motivating question: differences in brain signals
The problem: Do local field potential (LFP) signals changewhen measured near a spike burst?
Statistical test using MMD (1)
• Two hypotheses:
– H0: null hypothesis (P = Q)
– H1: alternative hypothesis (P 6= Q)
Statistical test using MMD (1)
• Two hypotheses:
– H0: null hypothesis (P = Q)
– H1: alternative hypothesis (P 6= Q)
• Observe samples x := x1, . . . , xn from P and y from Q
• If empirical MMD(x,y;F ) is
– “far from zero”: reject H0
– “close to zero”: accept H0
Statistical test using MMD (2)
• “far from zero” vs “close to zero” - threshold?
• One answer: asymptotic distribution of MMD2
Statistical test using MMD (2)
• “far from zero” vs “close to zero” - threshold?
• One answer: asymptotic distribution of MMD2
• An unbiased empirical estimate (quadratic cost):
MMD2= 1
n(n−1)
∑
i 6=j
k(xi, xj)− k(xi, yj)− k(yi, xj) + k(yi, yj)︸ ︷︷ ︸h((xi,yi),(xj ,yj))
Statistical test using MMD (2)
• “far from zero” vs “close to zero” - threshold?
• One answer: asymptotic distribution of MMD2
• An unbiased empirical estimate (quadratic cost):
MMD2= 1
n(n−1)
∑
i 6=j
k(xi, xj)− k(xi, yj)− k(yi, xj) + k(yi, yj)︸ ︷︷ ︸h((xi,yi),(xj ,yj))
• When P 6= Q, asymptotically normal
(√n)(MMD
2 −MMD2)∼ N (0, σ2u)
[Hoeffding, 1948, Serfling, 1980]
• Expression for the variance: zi := (xi, yi)
σ2u = 4(Ez
[(Ez′h(z, z
′))2]−[Ez,z′(h(z, z
′))]2)
Statistical test using MMD (3)
• Example: laplace distributions with different variance
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40
2
4
6
8
10
12
14
MMD distribution and Gaussian fit under H1
MMD
Pro
b.
de
nsity
Empirical PDF
Gaussian fit
−6 −4 −2 0 2 4 60
0.5
1
1.5
Two Laplace distributions with different variances
X
Pro
b.
de
nsity
P
X
QX
Statistical test using MMD (4)
• When P = Q, U-statistic degenerate: Ez′ [h(z, z′)] = 0 [Anderson et al., 1994]
• Distribution is
nMMD(x,y;F ) ∼∞∑
l=1
λl[z2l − 2
]
• where
– zl ∼ N (0, 2) i.i.d
–∫Xk(x, x′)︸ ︷︷ ︸centred
ψi(x)dP(x) = λiψi(x′)
Statistical test using MMD (4)
• When P = Q, U-statistic degenerate: Ez′ [h(z, z′)] = 0 [Anderson et al., 1994]
• Distribution is
nMMD(x,y;F ) ∼∞∑
l=1
λl[z2l − 2
]
• where
– zl ∼ N (0, 2) i.i.d
–∫Xk(x, x′)︸ ︷︷ ︸centred
ψi(x)dP(x) = λiψi(x′)
−2 −1 0 1 2 3 4 5 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
MMD density under H0
n× MMD2
Pro
b. density
χ2 sum
Empirical PDF
Statistical test using MMD (5)
• Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05
MMD2= KP,P +KQ,Q − 2KP,Q
−2 −1 0 1 2 3 4 5 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
MMD density under H0 and H1
n× MMD2
Pro
b.
de
nsity
null
alternative
1−α null quantile
Type II error
Statistical test using MMD (5)
• Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05
Statistical test using MMD (5)
• Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05
• Permutation for empirical CDF [Arcones and Gine, 1992, Alba Fernandez et al., 2008]
• Pearson curves by matching first four moments [Johnson et al., 1994]
• Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989]
• Consistent test using kernel eigenspectrum [NIPS09b]
Statistical test using MMD (5)
• Given P = Q, want threshold T such that P(MMD > T ) ≤ 0.05
• Permutation for empirical CDF [Arcones and Gine, 1992, Alba Fernandez et al., 2008]
• Pearson curves by matching first four moments [Johnson et al., 1994]
• Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989]
• Consistent test using kernel eigenspectrum [NIPS09b]
−0.02 0 0.02 0.04 0.06 0.08 0.10
0.2
0.4
0.6
0.8
1
CDF of the MMD and Pearson fit
mmd
P(M
MD
< m
md)
MMD
Pearson
Approximate null distribution of MMD via permutation
Empirical MMD:
w = (1, 1, 1, . . . 1︸ ︷︷ ︸n
,−1 . . . ,−1,−1,−1︸ ︷︷ ︸n
)⊤
1n2
∑(KP,P KP,Q
KQ,P KQ,Q
⊙
[ww⊤
]) ≈ MMD2
Approximate null distribution of MMD via permutation
Permuted case: [Alba Fernandez et al., 2008]
w = (1,−1, 1, . . . 1︸ ︷︷ ︸n
,−1 . . . , 1,−1,−1︸ ︷︷ ︸n
)⊤
(equal number of +1 and −1)
1n2
∑(KP,P KP,Q
KQ,P KQ,Q
⊙
[ww⊤
])=[?]
Approximate null distribution of MMD via permutation
Permuted case: [Alba Fernandez et al., 2008]
w = (1,−1, 1, . . . 1︸ ︷︷ ︸n
,−1 . . . , 1,−1,−1︸ ︷︷ ︸n
)⊤
(equal number of +1 and −1)
1n2
∑(KP,P KP,Q
KQ,P KQ,Q
⊙
[ww⊤
])=[?]
⊙ =
Figure thanks to Kacper Chwialkowski.
Approximate null distribution of MMD2via permutation
MMD2
p ≈1n2
∑(KP,P KP ,Q
KQ,P KQ,Q
⊙
[ww⊤
])
−2 −1 0 1 2 3 4 5 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
MMD density under H0
n× MMD2
Pro
b. density
Null PDF
Null PDF from permutation
Detecting differences in brain signals
Do local field potential (LFP) signals change when measured near a spike
burst?
0 20 40 60 80 100−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
LFP near spike burst
Time
LF
P a
mp
litu
de
0 20 40 60 80 100−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
LFP without spike burst
Time
LF
P a
mp
litu
de
Nero data: consistent test w/o permutation
• Maximum mean discrepancy (MMD): distance between P and Q
MMD(P,Q;F ) := ‖µP − µQ‖2F
• Is MMD significantly > 0?
• P = Q, null distrib. of MMD:
nMMD →D
∞∑
l=1
λl(z2l − 2),
– λl is lth eigenvalue of
kernel k(xi, xj)100 150 200 250 3000
0.1
0.2
0.3
0.4
0.5
P ≠ Q (neuro)
Sample size mT
yp
e I
I e
rro
r
Spectral
Permutation
Use Gram matrix spectrum for λl: consistent test without permutation
Hypothesis testing with HSIC
Distribution of HSIC at independence
• (Biased) empirical HSIC a v-statistic
HSICb =1
n2trace(KHLH)
– Statistical testing: How do we find when this is larger enough that
the null hypothesis P = PxPy is unlikely?
– Formally: given P = PxPy, what is the threshold T such that
P(HSIC > T ) < α for small α?
Distribution of HSIC at independence
• (Biased) empirical HSIC a v-statistic
HSICb =1
n2trace(KHLH)
• Associated U-statistic degenerate when P = PxPy [Serfling, 1980]:
nHSICbD→
∞∑
l=1
λlz2l , zl ∼ N (0, 1)i.i.d.
λlψl(zj) =
∫hijqrψl(zi)dFi,q,r, hijqr =
1
4!
(i,j,q,r)∑
(t,u,v,w)
ktultu + ktulvw − 2ktultv
Distribution of HSIC at independence
• (Biased) empirical HSIC a v-statistic
HSICb =1
n2trace(KHLH)
• Associated U-statistic degenerate when P = PxPy [Serfling, 1980]:
nHSICbD→
∞∑
l=1
λlz2l , zl ∼ N (0, 1)i.i.d.
λlψl(zj) =
∫hijqrψl(zi)dFi,q,r, hijqr =
1
4!
(i,j,q,r)∑
(t,u,v,w)
ktultu + ktulvw − 2ktultv
• First two moments [NIPS07b]
E(HSICb) =1
nTrCxxTrCyy
var(HSICb) =2(n− 4)(n− 5)
(n)4‖Cxx‖
2HS ‖Cyy‖
2HS +O(n−3).
Statistical testing with HSIC
• Given P = PxPy, what is the threshold T such that P(HSIC > T ) < α
for small α?
• Null distribution via permutation [Feuerverger, 1993]
– Compute HSIC for xi, yπ(i)ni=1 for random permutation π of indices
1, . . . , n. This gives HSIC for independent variables.
– Repeat for many different permutations, get empirical CDF
– Threshold T is 1− α quantile of empirical CDF
Statistical testing with HSIC
• Given P = PxPy, what is the threshold T such that P(HSIC > T ) < α
for small α?
• Null distribution via permutation [Feuerverger, 1993]
– Compute HSIC for xi, yπ(i)ni=1 for random permutation π of indices
1, . . . , n. This gives HSIC for independent variables.
– Repeat for many different permutations, get empirical CDF
– Threshold T is 1− α quantile of empirical CDF
• Approximate null distribution via moment matching [Kankainen, 1995]:
nHSICb(Z) ∼xα−1e−x/β
βαΓ(α)
where
α =(E(HSICb))
2
var(HSICb), β =
var(HSICb)
nE(HSICb).
Experiment: dependence testing for translation
Are the French text extracts translations of English?X1: Honourable senators, I have a question for
the Leader of the Government in the Senate with
regard to the support funding to farmers that has
been announced. Most farmers have not received
any money yet.
Y1: Honorables senateurs, ma question
s’adresse au leader du gouvernement au Senat
et concerne l’aide financiere qu’on a annoncee
pour les agriculteurs. La plupart des agriculteurs
n’ont encore rien reu de cet argent.
X2: No doubt there is great pressure on provin-
cial and municipal governments in relation to the
issue of child care, but the reality is that there
have been no cuts to child care funding from the
federal government to the provinces. In fact,
we have increased federal investments for early
childhood development.
· · ·
?⇐⇒
Y2:Il est evident que les ordres de gouverne-
ments provinciaux et municipaux subissent de
fortes pressions en ce qui concerne les ser-
vices de garde, mais le gouvernement n’a pas
reduit le financement qu’il verse aux provinces
pour les services de garde. Au contraire, nous
avons augmente le financement federal pour le
developpement des jeunes enfants.
· · ·
Experiment: dependence testing for translation
• (Biased) empirical HSIC:
HSICb =1
n2trace(KHLH)
• Translation example: [NIPS07b]
Canadian Hansard
(agriculture)
• 5-line extracts,
k-spectrum kernel, k = 10,
repetitions=300,
sample size 10
⇓
K
⇒HSIC⇐
⇓
L
• k-spectrum kernel: average Type II error 0 (α = 0.05)
Experiment: dependence testing for translation
• (Biased) empirical HSIC:
HSICb =1
n2trace(KHLH)
• Translation example: [NIPS07b]
Canadian Hansard
(agriculture)
• 5-line extracts,
k-spectrum kernel, k = 10,
repetitions=300,
sample size 10
⇓
K
⇒HSIC⇐
⇓
L
• k-spectrum kernel: average Type II error 0 (α = 0.05)
• Bag of words kernel: average Type II error 0.18
Kernel two-sample tests for big data, optimal kernel choice
Quadratic time estimate of MMD
MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y
′)− 2EP,Qk(x, y)
Quadratic time estimate of MMD
MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y
′)− 2EP,Qk(x, y)
Given i.i.d. X := x1, . . . , xm and Y := y1, . . . , ym from P,Q,
respectively:
The earlier estimate: (quadratic time)
EPk(x, x′) =
1
m(m− 1)
m∑
i=1
m∑
j 6=i
k(xi, xj)
Quadratic time estimate of MMD
MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y
′)− 2EP,Qk(x, y)
Given i.i.d. X := x1, . . . , xm and Y := y1, . . . , ym from P,Q,
respectively:
The earlier estimate: (quadratic time)
EPk(x, x′) =
1
m(m− 1)
m∑
i=1
m∑
j 6=i
k(xi, xj)
New, linear time estimate:
EPk(x, x′) =
2
m[k(x1, x2) + k(x3, x4) + . . .]
=2
m
m/2∑
i=1
k(x2i−1, x2i)
Linear time MMD
Shorter expression with explicit k dependence:
MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v),
where
hk(x, x′, y, y′) = k(x, x′) + k(y, y′)− k(x, y′)− k(x′, y),
and v := [x, x′, y, y′].
Linear time MMD
Shorter expression with explicit k dependence:
MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v),
where
hk(x, x′, y, y′) = k(x, x′) + k(y, y′)− k(x, y′)− k(x′, y),
and v := [x, x′, y, y′].
The linear time estimate again:
ηk =2
m
m/2∑
i=1
hk(vi),
where vi := [x2i−1, x2i, y2i−1, y2i] and
hk(vi) := k(x2i−1, x2i) + k(y2i−1, y2i)− k(x2i−1, y2i)− k(x2i, y2i−1)
Linear time vs quadratic time MMD
Disadvantages of linear time MMD vs quadratic time MMD
• Much higher variance for a given m, hence. . .
• . . .a much less powerful test for a given m
Linear time vs quadratic time MMD
Disadvantages of linear time MMD vs quadratic time MMD
• Much higher variance for a given m, hence. . .
• . . .a much less powerful test for a given m
Advantages of the linear time MMD vs quadratic time MMD
• Very simple asymptotic null distribution (a Gaussian, vs an infinite
weighted sum of χ2)
• Both test statistic and threshold computable in O(m), with storage O(1).
• Given unlimited data, a given Type II error can be attained with less
computation
Asymptotics of linear time MMD
By central limit theorem,
m1/2 (ηk − ηk(p, q))D→ N (0, 2σ2k)
• assuming 0 < E(h2k) <∞ (true for bounded k)
• σ2k = Evh2k(v)− [Ev(hk(v))]
2 .
Hypothesis test
Hypothesis test of asymptotic level α:
tk,α = m−1/2σk√2Φ−1(1− α) where Φ−1 is inverse CDF of N (0, 1).
−4 −2 0 2 4 6 80
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4Null distribution, linear time MMD
2
= ηk
P(η
k)
ηk
Type I error
tk,α = (1 − α) quantile
Type II error
−4 −2 0 2 4 6 80
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
P(η
k)
ηk
Null vs alternative distribution, P (ηk)
null
alternative
Type II error ηk(p, q )
The best kernel: minimizes Type II error
Type II error: ηk falls below the threshold tk,α and ηk(p, q) > 0.
Prob. of a Type II error:
P (ηk < tk,α) = Φ
(Φ−1(1− α)− ηk(p, q)
√m
σk√2
)
where Φ is a Normal CDF.
The best kernel: minimizes Type II error
Type II error: ηk falls below the threshold tk,α and ηk(p, q) > 0.
Prob. of a Type II error:
P (ηk < tk,α) = Φ
(Φ−1(1− α)− ηk(p, q)
√m
σk√2
)
where Φ is a Normal CDF.
Since Φ monotonic, best kernel choice to minimize Type II error prob. is:
k∗ = argmaxk∈K
ηk(p, q)σ−1k ,
where K is the family of kernels under consideration.
Learning the best kernel in a family
Define the family of kernels as follows:
K :=
k : k =
d∑
u=1
βuku, ‖β‖1 = D, βu ≥ 0, ∀u ∈ 1, . . . , d.
Properties: if at least one βu > 0
• all k ∈ K are valid kernels,
• If all ku charateristic then k characteristic
Test statistic
The squared MMD becomes
ηk(p, q) = ‖µk(p)− µk(q)‖2Fk=
d∑
u=1
βuηu(p, q),
where ηu(p, q) := Evhu(v).
Test statistic
The squared MMD becomes
ηk(p, q) = ‖µk(p)− µk(q)‖2Fk=
d∑
u=1
βuηu(p, q),
where ηu(p, q) := Evhu(v).
Denote:
• β = (β1, β2, . . . , βd)⊤ ∈ Rd,
• h = (h1, h2, . . . , hd)⊤ ∈ Rd,
– hu(x, x′, y, y′) = ku(x, x
′) + ku(y, y′)− ku(x, y
′)− ku(x′, y)
• η = Ev(h) = (η1, η2, . . . , ηd)⊤ ∈ Rd.
Quantities for test:
ηk(p, q) = E(β⊤h) = β⊤η σ2k := β⊤cov(h)β.
Optimization of ratio ηk(p, q)σ−1k
Empirical test parameters:
ηk = β⊤η σk,λ =
√β⊤(Q+ λmI
)β,
Q is empirical estimate of cov(h).
Note: ηk, σk,λ computed on training data, vs ηk, σk on data to be tested
(why?)
Optimization of ratio ηk(p, q)σ−1k
Empirical test parameters:
ηk = β⊤η σk,λ =
√β⊤(Q+ λmI
)β,
Q is empirical estimate of cov(h).
Note: ηk, σk,λ computed on training data, vs ηk, σk on data to be tested
(why?)
Objective:
β∗ = argmaxβ0
ηk(p, q)σ−1k,λ
= argmaxβ0
(β⊤η
)(β⊤(Q+ λmI
)β)−1/2
=: α(β; η, Q)
Optmization of ratio ηk(p, q)σ−1k
Assume: η has at least one positive entry
Then there exists β 0 s.t. α(β; η, Q) > 0.
Thus: α(β∗; η, Q) > 0
Optmization of ratio ηk(p, q)σ−1k
Assume: η has at least one positive entry
Then there exists β 0 s.t. α(β; η, Q) > 0.
Thus: α(β∗; η, Q) > 0
Solve easier problem: β∗ = argmaxβ0 α2(β; η, Q).
Quadratic program:
minβ⊤(Q+ λmI
)β : β⊤η = 1, β 0
Optmization of ratio ηk(p, q)σ−1k
Assume: η has at least one positive entry
Then there exists β 0 s.t. α(β; η, Q) > 0.
Thus: α(β∗; η, Q) > 0
Solve easier problem: β∗ = argmaxβ0 α2(β; η, Q).
Quadratic program:
minβ⊤(Q+ λmI
)β : β⊤η = 1, β 0
What if η has no positive entries?
Test procedure
1. Split the data into testing and training.
2. On the training data:
(a) Compute ηu for all ku ∈ K(b) If at least one ηu > 0, solve the QP to get β∗, else choose random
kernel from K3. On the test data:
(a) Compute ηk∗ using k∗ =∑d
u=1 β∗ku
(b) Compute test threshold tα,k∗ using σk∗
4. Reject null if ηk∗ > tα,k∗
Convergence bounds
Assume bounded kernel, σk, bounded away from 0.
If λm = Θ(m−1/3) then
∣∣∣∣supk∈K
ηkσ−1k,λ − sup
k∈Kηkσ
−1k
∣∣∣∣ = OP
(m−1/3
).
Convergence bounds
Assume bounded kernel, σk, bounded away from 0.
If λm = Θ(m−1/3) then
∣∣∣∣supk∈K
ηkσ−1k,λ − sup
k∈Kηkσ
−1k
∣∣∣∣ = OP
(m−1/3
).
Idea:
∣∣∣∣supk∈K
ηkσ−1k,λ − sup
k∈Kηkσ
−1k
∣∣∣∣
≤ supk∈K
∣∣∣ηkσ−1k,λ − ηkσ
−1k,λ
∣∣∣+ supk∈K
∣∣∣ηkσ−1k,λ − ηkσ
−1k
∣∣∣
≤√d
D√λm
(C1 sup
k∈K|ηk − ηk|+ C2 sup
k∈K|σk,λ − σk,λ|
)+ C3D
2λm,
Experiments
Competing approaches
• Median heuristic
• Max. MMD: choose ku ∈ K with the largest ηu
– same as maximizing β⊤η subject to ‖β‖1 ≤ 1
• ℓ2 statistic: maximize β⊤η subject to ‖β‖2 ≤ 1
• Cross validation on training set
Also compare with:
• Single kernel that maximizes ratio ηk(p, q)σ−1k
Blobs: data
Difficult problems: lengthscale of the difference in distributions not the same
as that of the distributions.
Blobs: data
Difficult problems: lengthscale of the difference in distributions not the same
as that of the distributions.
We distinguish a field of Gaussian blobs with different covariances.
5 10 15 20 25 30 355
10
15
20
25
30
35Blob data p
x1
x2
5 10 15 20 25 30 355
10
15
20
25
30
35Blob data q
y1
y2
Ratio ε = 3.2 of largest to smallest eigenvalues of blobs in q.
Blobs: results
0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ε ratio
Typ
eII
erro
r
max ratio
opt
l2
maxmmd
xval
xvalc
med
Parameters: m = 10, 000 (for training and test). Ratio ε of largest to
smallest eigenvalues of blobs in q. Results are average over 617 trials.
Blobs: results
0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ε ratio
Type II err
or
max ratiooptl2maxmmdxvalxvalcmed
Optimize ratio ηk(p, q)σ−1k
Blobs: results
0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ε ratio
Type II err
or
max ratiooptl2maxmmdxvalxvalcmed
Maximize ηk(p, q) with β constraint
Blobs: results
0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ε ratio
Type II err
or
max ratiooptl2maxmmdxvalxvalcmed
Median heuristic
Feature selection: data
Idea: no single best kernel.
Each of the ku are univariate (along a single coordinate)
Feature selection: data
Idea: no single best kernel.
Each of the ku are univariate (along a single coordinate)
−4 −2 0 2 4 6 8
−4
−2
0
2
4
6
8Selection data
x1
x2
p
q
Feature selection: results
0 5 10 15 20 25 300.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
Feature selection
dimension
Typ
e I
I e
rro
r
max ratio
opt
l2
maxmmd
Single best kernel
Linear combination
m = 10, 000, average over 5000 trials
Amplitude modulated signals
Given an audio signal s(t), an amplitude modulated signal can be defined
u(t) = sin(ωct) [a s(t) + l]
• ωc: carrier frequency
• a = 0.2 is signal scaling, l = 2 is offset
Amplitude modulated signals
Given an audio signal s(t), an amplitude modulated signal can be defined
u(t) = sin(ωct) [a s(t) + l]
• ωc: carrier frequency
• a = 0.2 is signal scaling, l = 2 is offset
Two amplitude modulated signals from same artist (in this case, Magnetic
Fields).
• Music sampled at 8KHz (very low)
• Carrier frequency is 24kHz
• AM signal observed at 120kHz
• Samples are extracts of length N = 1000, approx. 0.01 sec (very short).
• Total dataset size is 30,000 samples from each of p, q.
Amplitude modulated signals
Samples from P Samples from Q
Results: AM signals
−0.2 0 0.2 0.4 0.6 0.8 1 1.20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
added noise
Typ
e I
I e
rro
r
max ratio
opt
med
l2
maxmmd
m = 10, 000 (for training and test) and scaling a = 0.5. Average over 4124
trials. Gaussian noise added.
Observations on kernel choice
• It is possible to choose the best kernel for a kernel
two-sample test
• Kernel choice matters for “difficult” problems, where the
distributions differ on a lengthscale different to that of
the data.
• Ongoing work:
– quadratic time statistic
– avoid training/test split
Summary
• MMD a distance between distributions [ISMB06, NIPS06a, JMLR10, JMLR12a]
– high dimensionality
– non-euclidean data (strings, graphs)
– Nonparametric hypothesis tests
• Measure and test independence [ALT05, NIPS07a, NIPS07b, ALT08, JMLR10, JMLR12a]
• Characteristic RKHS: MMD a metric [NIPS07b, COLT08, NIPS08a]
– Easy to check: does spectrum cover Rd
Co-authors
• From UCL:
– Luca Baldasssarre
– Steffen Grunewalder
– Guy Lever
– Sam Patterson
– Massimiliano Pontil
– Dino Sejdinovic
• External:
– Karsten Borgwardt, MPI
– Wicher Bergsma, LSE
– Kenji Fukumizu, ISM
– Zaid Harchaoui, INRIA
– Bernhard Schoelkopf, MPI
– Alex Smola, CMU/Google
– Le Song, Georgia Tech
– Bharath Sriperumbudur,
Cambridge
Selected references
Characteristic kernels and mean embeddings:
• Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT.
• Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space
embeddings and metrics on probability measures. JMLR.
• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR.
Two-sample, independence, conditional independence tests:
• Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of
independence. NIPS
• Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence.
• Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample
test. NIPS.
• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR
Energy distance, relation to kernel distances
• Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (2013). Equivalence of distance-based and
rkhs-based statistics in hypothesis testing. Annals of Statistics.
Three way interaction
• Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A Kernel Test for Three-Variable Interactions. NIPS.
Selected references (continued)
Conditional mean embedding, RKHS-valued regression:
• Weston, J., Chapelle, O., Elisseeff, A., Scholkopf, B., and Vapnik, V., (2003). Kernel Dependency
Estimation, NIPS.
• Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation.
• Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm.
Foundations of Computational Mathematics.
• Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional
Distributions. ICML.
• Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean
embeddings as regressors. ICML.
• Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML.
Kernel Bayes rule:
• Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified
kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.
• Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite
kernels, JMLR
Local departures from the null
What is a hard testing problem?
Local departures from the null
What is a hard testing problem?
• First version: for fixed m, “closer” P and Q have higher Type II error
0 0.2 0.4 0.6 0.8 1−1
−0.5
0
0.5
1
Samples from P and Q
0 0.2 0.4 0.6 0.8 1−1
−0.5
0
0.5
1
Samples from P and Q
Local departures from the null
What is a hard testing problem?
• As m increases, distinguish “closer” P and Q with fixed Type II error
Local departures from the null
What is a hard testing problem?
• As m increases, distinguish “closer” P and Q with fixed Type II error
• Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g
some fixed function such that fQ is a valid density
– If δ ∼ m−1/2, Type II error approaches a constant
More general local departures from null
• Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g
some fixed function such that fQ is a valid density
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
X
P(X
)
VS
−6 −4 −2 0 2 4 6
0.1
0.2
0.3
X
Q(X
)
−6 −4 −2 0 2 4 6
0.1
0.2
0.3
X
Q(X
)
−6 −4 −2 0 2 4 6
0.1
0.2
0.3
X
Q(X
)
Local departures from the null
What is a hard testing problem?
• As we see more samples m, distinguish “closer” P and Q with same
Type II error
• Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g
some fixed function such that fQ is a valid density
– If δ ∼ m−1/2, Type II error approaches a constant
• ...but other choices also possible – how to characterize them all?
Local departures from the null
What is a hard testing problem?
• As we see more samples m, distinguish “closer” P and Q with same
Type II error
• Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g
some fixed function such that fQ is a valid density
– If δ ∼ m−1/2, Type II error approaches a constant
• ...but other choices also possible – how to characterize them all?
General characterization of local departures from H0:
• Write µQ = µP + gm, where gm ∈ F chosen such that µP + gm a valid
distribution embedding
• Minimum distinguishable distance [JMLR12]
‖gm‖F = cm−1/2
More general local departures from null
• More advanced example of a local departure from the null
• Recall: µQ = µP + gm, and ‖gm‖F = cm−1/2
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
X
P(X
)
VS
−6 −4 −2 0 2 4 6
0.1
0.2
0.3
0.4
X
Q(X
)
−6 −4 −2 0 2 4 6
0.1
0.2
0.3
0.4
X
Q(X
)
−6 −4 −2 0 2 4 6
0.1
0.2
0.3
0.4
X
Q(X
)
Kernels vs kernels
• How does MMD relate to Parzen density estimate? [Anderson et al., 1994]
fP(x) =1
m
m∑
i=1
κ (xi − x) , where κ satisfies
∫
X
κ (x) dx = 1 and κ (x) ≥ 0.
Kernels vs kernels
• How does MMD relate to Parzen density estimate? [Anderson et al., 1994]
fP(x) =1
m
m∑
i=1
κ (xi − x) , where κ satisfies
∫
X
κ (x) dx = 1 and κ (x) ≥ 0.
• L2 distance between Parzen density estimates:
D2(fP, fQ)2 =
∫ [1
m
m∑
i=1
κ(xi − z)− 1
m
m∑
i=1
κ(yi − z)
]2dz
=1
m2
m∑
i,j=1
k(xi − xj) +1
m2
m∑
i,j=1
k(yi − yj)−2
m2
m∑
i,j=1
k(xi − yj),
where k(x− y) =∫κ(x− z)κ(y − z)dz
Kernels vs kernels
• How does MMD relate to Parzen density estimate? [Anderson et al., 1994]
fP(x) =1
m
m∑
i=1
κ (xi − x) , where κ satisfies
∫
X
κ (x) dx = 1 and κ (x) ≥ 0.
• L2 distance between Parzen density estimates:
D2(fP, fQ)2 =
∫ [1
m
m∑
i=1
κ(xi − z)− 1
m
m∑
i=1
κ(yi − z)
]2dz
=1
m2
m∑
i,j=1
k(xi − xj) +1
m2
m∑
i,j=1
k(yi − yj)−2
m2
m∑
i,j=1
k(xi − yj),
where k(x− y) =∫κ(x− z)κ(y − z)dz
• fQ = fP + δg, minimum distance to discriminate fP from fQ is
δ = (m)−1/2h−d/2m , where hm is width of κ.
Characteristic Kernels (via universality)
Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]
Characteristic Kernels (via universality)
Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]
Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X ),
the space of bounded continuous functions on X [Dudley, 2002]
Characteristic Kernels (via universality)
Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]
Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X ),
the space of bounded continuous functions on X [Dudley, 2002]
Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X ) with
respect to L∞ [Steinwart, 2001]
Characteristic Kernels (via universality)
Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]
Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X ),
the space of bounded continuous functions on X [Dudley, 2002]
Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X ) with
respect to L∞ [Steinwart, 2001]
If F universal, then MMD P,Q;F = 0 iff P = Q
Characteristic Kernels (via universality)
Proof:
First, it is clear that P = Q implies MMD P,Q;F is zero.
Converse: by the universality of F , for any given ǫ > 0 and f ∈ C(X ) ∃g ∈ F
‖f − g‖∞ ≤ ǫ.
Characteristic Kernels (via universality)
Proof:
First, it is clear that P = Q implies MMD P,Q;F is zero.
Converse: by the universality of F , for any given ǫ > 0 and f ∈ C(X ) ∃g ∈ F
‖f − g‖∞ ≤ ǫ.
We next make the expansion
|EPf(x)−EQf(y)| ≤ |EPf(x)−EPg(x)|+|EPg(x)−EQg(y)|+|EQg(y)−EQf(y)| .
The first and third terms satisfy
|EPf(x)−EPg(x)| ≤ EP |f(x)− g(x)| ≤ ǫ.
Characteristic Kernels (via universality)
Proof (continued):
Next, write
EPg(x)−EQg(y) = 〈g(·), µP − µQ〉F = 0,
since MMD P,Q;F = 0 implies µP = µQ. Hence
|EPf(x)−EQf(y)| ≤ 2ǫ
for all f ∈ C(X ) and ǫ > 0, which implies P = Q.
References
V.Alb
aFernandez,M
.Jim
enez-G
am
ero,and
J.M
unozG
arcia
.A
testfo
rthe
two-s
am
ple
proble
mbased
on
em
piric
alcharacteristic
functio
ns.Comput.
Sta
t.Data
An.,
52:3
730–3748,2008.
N.Anderson,P.Hall,
and
D.Titterin
gton.
Two-s
am
ple
test
statistic
sfo
rm
easurin
gdiscrepancie
sbetween
two
multiv
aria
te
probabilit
ydensity
functio
nsusin
gkernel-b
ased
density
estim
ates.JournalofM
ultiv
ariate
Anal-
ysis,
50:4
1–54,1994.
M.Arcones
and
E.G
ine.O
nthe
bootstrap
ofu
and
vstatistic
s.TheAnnals
ofSta
tistic
s,20(2):6
55–674,1992.
R.M
.D
udle
y.Realanalysis
and
pro
bability
.Cam
brid
ge
Univ
ersity
Press,Cam
-brid
ge,UK
,2002.
Andrey
Feuerverger.A
consistenttestfo
rbiv
aria
te
dependence.In
ternatio
nal
Sta
tistic
alReview,61(3):4
19–433,1993.
K.Fukum
izu,B.Srip
erum
budur,A.G
retton,and
B.Schoelk
opf.
Character-
istic
kernels
on
groups
and
sem
igroups.
InAdvancesin
Neura
lIn
formatio
n
Processin
gSystems21,pages
473–480,Red
Hook,NY,2009.Curran
Asso-
cia
tes
Inc.
Z.Harchaoui,
F.Bach,and
E.M
oulin
es.Testin
gfo
rhom
ogeneity
with
kernel
Fisher
discrim
inant
analy
sis.
InAdvances
inNeura
lIn
formatio
nProcessin
g
Systems20,pages
609–616.M
ITPress,Cam
brid
ge,M
A,2008.
W.
Hoeffd
ing.
Probabilit
yin
equalit
ies
for
sum
sof
bounded
random
vari-
able
s.
Journalofth
eAmerican
Sta
tistic
alAssociatio
n,58:1
3–30,1963.
Wassily
Hoeffd
ing.
Acla
ss
ofstatistic
swith
asym
ptotic
ally
norm
aldistri-
butio
n.
TheAnnals
ofM
ath
ematic
alSta
tistic
s,19(3):2
93–325,1948.
N.L.Johnson,S.K
otz,and
N.Bala
krishnan.
Contin
uousUnivariate
Distribu-
tions.Volume1.
John
Wile
yand
Sons,2nd
editio
n,1994.
A.
Kankain
en.
ConsistentTestin
gofTota
lIn
dependence
Based
on
the
Empirical
Chara
cteristic
Functio
n.
PhD
thesis,Univ
ersity
ofJyvaskyla
,1995.
S.M
alla
t.
AW
aveletTourofSignalProcessin
g.
Academ
icPress,2nd
editio
n,
1999.
C.M
cD
iarm
id.O
nthe
method
ofbounded
diffe
rences.In
Surveyin
Combin
a-
torics,
pages
148–188.Cam
brid
ge
Univ
ersity
Press,1989.
103-1
A.M
ulle
r.In
tegralprobabilit
ym
etric
sand
their
generatin
gcla
ssesoffu
nc-
tio
ns.
Advancesin
Applie
dPro
bability
,29(2):4
29–443,1997.
R.Serflin
g.Appro
xim
atio
nTheoremsofM
ath
ematic
alSta
tistic
s.W
iley,New
York,
1980.
B.
Srip
erum
budur,
A.
Gretton,
K.
Fukum
izu,
G.
Lanckrie
t,
and
B.Scholk
opf.
Hilb
ertspace
em
beddin
gsand
metric
son
probabilit
ym
ea-
sures.
JournalofM
achin
eLearnin
gResearch,11:1
517–1561,2010.
I.Stein
wart.
On
the
influ
ence
ofthe
kernelon
the
consistency
ofsupport
vector
machin
es.
JournalofM
achin
eLearnin
gResearch,2:6
7–93,2001.
Ingo
Stein
wartand
AndreasChristm
ann.SupportVecto
rM
achin
es.
Info
rm
atio
nScie
nce
and
Statistic
s.Sprin
ger,2008.
G.Szekely
and
M.Riz
zo.
Brownia
ndistance
covaria
nce.
Annals
ofApplie
d
Sta
tistic
s,4(3):1
233–1303,2009.
G.Szekely
,M
.Riz
zo,and
N.Bakirov.M
easurin
gand
testin
gdependence
by
correla
tio
nofdistances.
Ann.Sta
t.,35(6):2
769–2794,2007.
S.K
.Zhou
and
R.Chella
ppa.From
sam
ple
sim
ilarity
to
ensem
ble
sim
ilarity:
Probabilis
tic
distancem
easuresin
reproducin
gkernelhilb
ertspace.IE
EE
Tra
nsactio
nson
Patte
rn
Analysis
andM
achin
eIn
tellig
ence,28(6):9
17–929,2006.
103-2