-
609
On the Estimation of α-Divergences
Barnabás Póczos Jeff SchneiderSchool of Computer ScienceCarnegie
Mellon University
Pittsburgh, PAUSA, 15213
School of Computer ScienceCarnegie Mellon University
Pittsburgh, PAUSA, 15213
Abstract
We propose new nonparametric, consistentRényi-α and Tsallis-α
divergence estimatorsfor continuous distributions. Given two
in-dependent and identically distributed sam-ples, a “naïve”
approach would be to simplyestimate the underlying densities and
plugthe estimated densities into the correspond-ing formulas. Our
proposed estimators, incontrast, avoid density estimation
completely,estimating the divergences directly using onlysimple
k-nearest-neighbor statistics. We arenonetheless able to prove that
the estimatorsare consistent under certain conditions. Wealso
describe how to apply these estimatorsto mutual information and
demonstrate theirefficiency via numerical experiments.
1 Introduction
Many statistical, artificial intelligence, and machinelearning
problems require efficient estimation of thedivergence between two
distributions. We assume thatthese distributions are not given
explicitly. Only twofinite, independent and identically distributed
(i.i.d.)samples are given from the two underlying distribu-tions.
The Rényi-α (Rényi, 1961, 1970) and Tsallis-α(Villmann and Haase,
2010) divergences are two widelyapplied and prominent examples of
probability diver-gences. The popular Kullback–Leibler (kl)
divergenceis a special case of these families, and they can also
berelated to the Csiszár’s-f divergence (Csiszár, 1967).Under
certain conditions, these divergences can esti-mate entropy and can
also be applied to estimate Rényi
Appearing in Proceedings of the 14th International Con-ference
on Artificial Intelligence and Statistics (AISTATS)2011, Fort
Lauderdale, FL, USA. Volume 15 of JMLR:W&CP 15. Copyright 2011
by the authors.
and Tsallis mutual information. For more examplesand other
possible applications of these divergences,see our extended
technical report (Póczos and Schnei-der, 2011). Despite their wide
applicability, there isno known direct, consistent estimator for
Rényi-α orTsallis-α divergence.
An indirect way to obtain the desired estimates wouldbe to use a
“plug-in” estimation scheme—first, apply aconsistent density
estimator for the underlying densi-ties, and then plug them into
the desired formula. Theunknown densities, however, are nuisance
parametersin the case of divergence estimation, and we wouldprefer
to avoid estimating them. Furthermore, densityestimators usually
have tunable parameters, and wemay need expensive cross validation
to achieve goodperformance.
This paper provides a direct, L2-consistent estimatorfor the
Tsallis-α divergence and a weakly consistentestimator for the
Rényi-α divergence. These estima-tors can also be applied to (Rényi
and Tsallis) mutualinformation.
The closest existing work most relevant to the topicof this
paper is the work of Wang et al. (2009a), whoprovided an estimator
for the α → 1 limit case only,i.e., for the kl-divergence. However,
we warn thereader that there is an apparent error in their
work;they applied the reverse Fatou lemma under condi-tions when it
does not hold. It is not obvious howthis portion of the proof can
be remedied. This errororiginates in the work of Kozachenko and
Leonenko(1987) and can also be found in other works. Hero et
al.(2002a,b) also investigated the Rényi divergence estima-tion
problem but assumed that one of the two densityfunctions is known.
Gupta and Srivastava (2010) de-veloped algorithms for estimating
the Shannon entropyand the kl divergence for certain parametric
fami-lies. Recently, Nguyen et al. (2009, 2010) developedmethods
for estimating f -divergences using their varia-tional
characterization properties. They estimate thelikelihood ratio of
the two underlying densities and
-
610
On the Estimation of α-Divergences
plug that into the divergence formulas. This approachinvolves
solving a convex minimization problem overan infinite-dimensional
function space. For certainfunction classes defined by reproducing
kernel Hilbertspaces (rkhs), however, they were able to reduce
thecomputational load from solving infinite-dimensionalproblems to
solving n-dimensional problems, where ndenotes the sample size.
When n is large, solving theseconvex problems can still be very
demanding. Further-more, choosing an appropriate rkhs also
introducesquestions regarding model selection. An appealingproperty
of our estimator is that we do not need tosolve minimization
problems over function classes; weonly need to calculate certain
k-nearest-neighbor (k-nn) based statistics. Recently, Sricharan et
al. (2010)proposed k-nearest-neighbor based methods for esti-mating
non-linear functionals of density, but in contrastto our approach,
they were interested in the case wherek increases with the sample
size.
Our work borrows ideas from Leonenko et al. (2008a)and Goria et
al. (2005), who considered Shannon andRényi-α entropy estimation
from a single sample.1 Incontrast, we propose divergence estimators
using twoindependent samples. Recently, Póczos et al. (2010);Pál et
al. (2010) proposed a method for consistentRényi information
estimation, but this estimator alsouses one sample only and cannot
be used for estimatingdivergences. Further information and useful
reviewsof several different divergences can be found, e.g.,
inVillmann and Haase (2010), Cichocki et al. (2009), andWang et al.
(2009b).
The paper is organized as follows. In the next sectionwe
formally define our estimation problem, introducethe Rényi-α and
Tsallis-α divergences, and explain theirmost important properties.
Section 3 briefly introducesk-nn based density estimators. We
propose estimatorsfor the Rényi-α and Tsallis-α divergences in
Section 4and also present our most important theoretical
resultsabout the asymptotic unbiasedness and consistency ofthe
estimators. For their analysis we will need a fewgeneral tools,
which we collect in Section 5. We willprove the asymptotic
unbiasedness of our estimatorsin Section 6. Due to a lack of space,
we provide manydetails of the proofs in Póczos and Schneider
(2011).The analysis of the asymptotic variances of our estima-tors
follows an approach similar to their biases but ismore complex;
therefore, we relegate this material intoPóczos and Schneider
(2011) as well. Section 7 containsthe results of numerical
experiments that demonstratethe effectiveness of our proposed
algorithm. We alsodemonstrate in that section how our divergence
es-
1The original presentations of these works containedsome errors;
Leonenko and Pronzato (2010) provide correc-tions for some of these
theorems.
timators can be used for Rényi- and Tsallis-mutualinformation
(mi) estimation. Finally, we conclude witha discussion of our
work.
2 Divergences
For the remainder of this work we will assume thatM0 ⊂ Rd is a
measurable set with respect to the d-dimensional Lebesgue measure
and that p and q aredensities on this domain. The set where they
are strictlypositive will be denoted by supp(p) and supp(q),
re-spectively.
Let p and q be Rd ⊇M0 : → R density functions, andlet α ∈ R\{0,
1}. The α-divergence D̃α(p‖q) (Cichockiet al., 2008) is defined
as
D̃α(p‖q).=
1
α(1− α)
[1−
∫M0
pα(x)q1−α(x) dx
],
(1)assuming this integral exists. One can see that this isa
special case of Csiszár’s f -divergence (Csiszár, 1967)and hence it
is always nonnegative.2 Closely relateddivergences (but not special
cases) to (1) are the Rényi-α (Rényi, 1961) and the Tsallis-α
(Villmann and Haase,2010) divergences.
Definition 1. Let p, q be Rd ⊇ M0 : → R densityfunctions and let
α ∈ R \ {1}. The Rényi-α divergenceis defined as
Rα(p‖q).=
1
α− 1log
∫M0
pα(x)q1−α(x) dx. (2)
The Tsallis-α divergence is defined as
Tα(p‖q).=
1
α− 1
(∫M0
pα(x)q1−α(x) dx− 1). (3)
Both definitions assume that the corresponding
integralexists.
We can see that as α→ 1 these divergences converge tothe
kl-divergence. The following lemma summarizesthe behavior of these
divergences.
Lemma 2.
α < 0⇒ Rα(p‖q) ≤ 0, Tα(p‖q) ≤ 0α = 0⇒ Rα(p‖q) = Tα(p‖q) =
0
0 < α < 1⇒ Rα(p‖q) ≥ 0, Tα(p‖q) ≥ 0α = 1⇒ Rα(p‖q) =
Tα(p‖q) = KL(p‖q) ≥ 01 < α⇒ Rα(p‖q) ≥ 0, Tα(p‖q) ≥ 0.
We are now prepared to formally define the goal ofour paper.
Given two independent i.i.d. samples from
2See the Appendix for more details.
-
611
Barnabás Póczos, Jeff Schneider
distributions with densities p and q, respectively, weprovide an
L2-consistent estimator for
Dα(p‖q).=
∫M0
pα(x)q1−α(x) dx. (4)
By plugging our estimate of (4) into (3) and (2), weimmediately
get an L2-consistent estimator for Tα(p‖q),as well as a weakly
consistent estimator for Rα(p‖q)for α 6= 1.
3 k-nn Based Density Estimators
In the remainder of this paper we will heavily exploitsome
properties of k-nn based density estimators. Inthis section we
define these estimators and briefly sum-marize their most important
properties.
k-nn density estimators operate using only distancesbetween the
observations in a given sample and theirkth nearest neighbors
(breaking ties arbitrarily). LetX1:n
.= (X1, . . . , Xn) be an i.i.d. sample from a dis-
tribution with density p, and similarly let Y1:m.=
(Y1, . . . , Ym) be an i.i.d. sample from a distributionhaving
density q. Let ρk(i) denote the Euclidean dis-tance of the kth
nearest neighbor of Xi in the sampleX1:n, and similarly let νk(i)
denote the distance of thekth nearest neighbor of Xi in the sample
Y1:m. LetB(x,R) denote a closed ball around x ∈ Rd with radiusR,
and let V
(B(x,R)
)= c̄Rd be its volume, where c̄
stands for the volume of a d-dimensional unit ball.
Loftsgaarden and Quesenberry (1965) define the k-nnbased density
estimators of p and q at Xi as follows.Definition 3 (k-nn based
density estimators).
p̂k(Xi) =k/(n− 1)V(B(x, ρk)
) = k(n− 1)c̄ρdk(i)
, (5)
q̂k(Xi) =k/m
V(B(x, νk)
) = kmc̄νdk(i)
. (6)
The following theorems show the consistency of thesedensity
estimators.3
Theorem 4 (k-nn density estimators, convergencein probability).
If k(n) denotes the number of neigh-bors applied at sample size n,
limn→∞ k(n) =∞, andlimn→∞ n/k(n) = ∞, then p̂k(n)(x) →p p(x) for
al-most all x.Theorem 5 (k-nn density estimators, al-most sure
convergence in sup norm). Iflimn→∞ k(n)/ log(n) = ∞ and limn→∞
n/k(n) = ∞,then limn→∞ supx
∣∣p̂k(n)(x)− p(x)∣∣ = 0 almost surely.3We use Xn →p X and Xn →d
X to represent conver-
gence of random variables in probability and in
distribution,respectively. Fn →w F will denote the weak
convergenceof distribution functions.
Note that these estimators are consistent only whenk(n)→∞. We
will use these density estimators in ourproposed divergence
estimators; however, we will keepk fixed and will still be able to
prove their consistency.
4 An Estimator for Dα(p‖q)
In this section we introduce our estimator for Dα(p‖q)and claim
its L2 consistency in the form of severaltheorems. From now on we
will assume that (4) canbe rewritten as
Dα(p‖q) =∫M
(q(x)
p(x)
)1−αp(x) dx, (7)
whereM = supp(p). In other words, in the definitionof Dα(p‖q),
it is enough to integrate on the support ofp. There are other
possible ways to rewrite Dα(p‖q)(such as
∫(q/p)(1−α)p,
∫(p/q)αq, or
∫(q/p)−αq), and
we could start our analysis from these forms as well. Ifwe
simply plugged (5) and (6) into (7), then we couldestimate Dα(p‖q)
with
1
n
n∑i=1
((n− 1)ρdk(i)mνdk(i)
)1−α;
however, this estimator is asymptotically biased. Wewill prove
that by introducing a multiplicative term thefollowing estimator is
asymptotically unbiased undercertain conditions:
D̂α(X1:n‖Y1:m).=
1
n
n∑i=1
((n− 1)ρdk(i)mνdk(i)
)1−αBk,α,
(8)where Bk,α
.= Γ(k)
2
Γ(k−α+1)Γ(k+α−1) . Notably, this mul-tiplicative bias does not
depend on p or q. The fol-lowing theorems of this section contain
our main re-sults: D̂α(X1:n‖Y1:m) is an L2-consistent estimator
forDα(p‖q), i.e., it is asymptotically unbiased, and thevariance of
the estimator is asymptotically zero.
In our theorems we will assume that almost all pointsofM are in
its interior and thatM has the followingadditional property:
inf0
-
612
On the Estimation of α-Divergences
Theorem 6 (Asymptotic unbiasedness). Assume that(a) 0 < γ .=
1 − α < k, (b) p is bounded away fromzero, (c) p is uniformly
Lebesgue approximable, (d)∃δ0 s.t. ∀δ ∈ (0, δ0),
∫MH(x, p, δ, 1)p(x) dx < ∞, (e)∫
M ‖x − y‖γp(y) dy < ∞ for almost all x ∈ M, (f)∫∫
M2 ‖x − y‖γp(y)p(x) dy dx < ∞, and that (g) q is
bounded from above. Then
limn,m→∞
E[D̂α(X1:n‖Y1:m)
]= Dα(p‖q),
i.e., the estimator is asymptotically unbiased.
For the definition of a uniformly Lebesgue approximablefunction,
see Definition 14. The following theoremstates that the estimator
is asymptotically unbiasedwhen −k < γ .= 1− α < 0.Theorem 7
(Asymptotic unbiasedness). Assume that(a) −k < γ .= 1 − α <
0, (b) q is bounded away fromzero, (c) q is uniformly Lebesgue
approximable, (d)∃δ0 s.t. ∀δ ∈ (0, δ0)
∫MH(x, q, δ, 1)p(x) dx < ∞, (e)∫
M ‖x − y‖γq(y) dy < ∞ for almost all x ∈ M, (f)∫∫
M2 ‖x − y‖γq(y)p(x) dy dx < ∞, (g) p is bounded
from above, and that (h) supp(p) ⊆ supp(q). In thiscase, the
estimator is asymptotically unbiased.
The following theorems provide conditions under whichD̂ is L2
consistent. In the previous theorems we havestated conditions that
lead to asymptotically unbiaseddivergence estimation. In all of the
following theoremswe will assume that the estimator is
asymptoticallyunbiased for the parameter γ = 1−α as well as for a
newparameter γ̃ .= 2(1−α) (corresponding to α̃ .= 2α− 1),and also
assume that max
(Dα(p‖q), Dα̃(p‖q)
)
-
613
Barnabás Póczos, Jeff Schneider
(a) feasible (b) not allowed
Figure 1: A possible allowed and a not-allowed domainM under the
property in (13).
Lemma 13 (Lebesgue (1910)). If g ∈ L1(Rd), thenfor any sequence
of open balls B(x,Rn) with radiusRn → 0, and for almost all x ∈
Rd,
limn→∞
∫B(x,Rn) g(t) dt
V(B(x,Rn)
) = g(x). (11)This implies that ifM⊂ Rd is a
Lebesgue-measurableset, and g ∈ L1(M), then for any sequence of Rn
→ 0,for any δ > 0 and for almost all x ∈ M, there existsan n0(x,
δ) ∈ Z+ such that if n > n0(x, δ), then
g(x)− δ <
∫B(x,Rn) g(t) dt
V(B(x,Rn))< g(x) + δ. (12)
We will later require a generalization of this property;namely,
we will need it to hold uniformly over x ∈M.However, for this
generalization to hold we must putslight restrictions on the domain
M to avoid effectsaround its boundary. We will consider only those
do-mainsM that posses the property that the intersectionofM with an
arbitrary small ball having center inMhas volume that cannot be
arbitrary small relative tothe volume of the ball. To be more
formal, we wantthe following inequality to be satisfied:
inf0 0, there exists an n = n0(δ) ∈ Z+ (independent ofx) such
that if n > n0, then for almost all x ∈M,
g(x)− δ <
∫B(x,Rn)∩M g(t) dt
V(B(x,Rn) ∩M
) < g(x) + δ. (14)
This property is a uniform variant of (12). The follow-ing lemma
provides examples of uniformly Lebesgue-approximable functions.
Lemma 15. If g is uniformly continuous onM, thenit is uniformly
Lebesgue approximable onM.
Finally, as we proceed we will frequently use the fol-lowing
lemma:
Lemma 16 (Moments of the Erlang distribution). Letfx,k(u)
.= 1Γ(k)λ
k(x)uk−1 exp(−λ(x)u) be the density ofthe Erlang distribution
with parameters λ(x) > 0 andk ∈ Z+. Let γ ∈ R such that γ + k
> 0. The γthmoments of this Erlang distribution can be
calculatedas∫∞
0uγfx,k(u) du = λ(x)
−γ Γ(k+γ)Γ(k) .
6 Proving Asymptotic Unbiasedness
The following subsection contains several specific lem-mas and
theorems that we will use for proving theconsistency of the
proposed estimator in (8).
6.1 Preliminaries
Recall that ρk(j) is a random variable that measuresthe distance
from Xj to its kth nearest neighbor inX1:n \Xj .Lemma 17. Let
ζn,k,1
.= (n − 1)ρdk(1) be a random
variable, and let Fn,k,x(u).= Pr(ζn,k,1 < u | X1 = x)
denote its conditional distribution function. Then
Fn,k,x(u) =
1−k−1∑j=0
(n− 1j
)(Pn,u,x)
j(1− Pn,u,x)n−1−j , (15)
where Pn,u,x.=∫M∩B(x,Rn(u)) p(t) dt and Rn(u)
.=
(u/(n− 1))1/d.
We also have the following (Leonenko et al., 2008a).
Lemma 18. Fn,k,x →w Fk,x for almost all x ∈ M,where Fk,x(u)
.= 1−exp(−λu)
∑k−1j=0
(λu)j
j! is the Erlangdistribution with λ = c̄p(x).
Lemma 19. Let ξn,k,x and ξk,x be random variableswith Fn,k,x and
Fk,x distribution functions, and letγ ∈ R be arbitrary. Then for
almost all x ∈ M wehave that ξγn,k,x →d ξ
γk,x.
Theorem 20. For almost all x ∈ M the followingstatements hold.
If (i) −k < γ < 0, or (ii) 0 ≤ γ, and∫M ‖x− y‖
γp(y) dy
-
614
On the Estimation of α-Divergences
Similarly, if (i) −k < γ < 0 or (ii) 0 ≤ γ, and∫M ‖x−
y‖
γq(y) dy 0 and using Lemma 16.
All that remains is to prove that if ξγn,k,x →d ξγk,x, then
E[ξγn,k,x] → E[ξγk,x]. To see this, it is enough to show
(according to Theorem 12) that for some ε > 0 andc(x) 0 so
small that p(x)− δ > 0for all x ∈M. Then there exists a Np,q
> 0 such thatif m,n > Np,q, then for almost all x ∈M,
fn(x)gm(x) ≤ γ2L(x, 1, k, γ, p, δ, δ1)
[L̂(q̄, 1)
k − γ+
1
γ
],
where L̂(q̄, β) .= (q̄c̄)k exp(q̄c̄β).
Similarly, for the −k < γ .= 1−α < 0 case we have
thefollowing lemma.Lemma 23. Let −k < γ .= 1 − α < 0, and
letsupp(p) ⊆ supp(q). Furthermore, let q be uniformlyLebesgue
approximable onM = supp(p) and boundedaway from zero. Let p be
bounded above by p̄. Letδ1 > 0, and let δ > 0 so small that
q(x) − δ > 0 forall x ∈ M. Then there exists a Np,q > 0 such
that ifm,n > Np,q, then for almost all x ∈ supp(p),
fn(x)gm(x) ≤ γ2L(x, 1, k,−γ, q, δ, δ1)
[L̂(p̄, 1)
k + γ− 1γ
].
-
615
Barnabás Póczos, Jeff Schneider
Now, for the two cases 0 < γ = 1 − α < k and −k <γ = 1
− α < 0, we can see that under the conditionsdetailed in
Theorems 6 and 7, there exists a function Jand a threshold number
Np,q such that if n,m > Np,q,then for almost all x ∈ M
fn(x)gm(x) ≤ J(x) and∫M J(x)p(x) dx < ∞. Applying the Lebesgue
domi-nated convergence theorem finishes the proofs of
thesetheorems.
7 Numerical Experiments
In this section we present a few numerical experimentsto
demonstrate the consistency of the proposed diver-gence estimators.
We run experiments on beta dis-tributions, where the domains are
bounded, and wealso study normal distributions, which have
unboundeddomains. We chose these distributions because in
thesecases the divergences have known closed-form expres-sions, and
thus it is easy to evaluate our methods. Wewill also demonstrate
that the proposed divergenceestimators can be applied to estimate
mutual informa-tion. We note that in our simulations the
numericalresults were very similar for the estimation of Rα andTα;
therefore, we will only present our results for theRα case.
7.1 Normal distributions
We begin our discussion by investigating the perfor-mance of our
divergence estimators on normal distribu-tions. Note that when α /∈
[0, 1], the divergences caneasily become unbounded.5
In Figure 2 we display the performances of the proposedD̂α and
R̂α divergence estimators when the underly-ing densities were
zero-mean Gaussians with randomlychosen 5-dimensional covariance
matrices. Our resultsdemonstrates that when we increase the sample
sizes nand m, then the D̂α and R̂α values converge to theirtrue
values. For simplicity, in our experiments we al-ways set n = m.
The figures show five independentexperiments; the number of
instances were varied be-tween 50 and 25 000. The number of nearest
neighborsk was set to 8, and α to 0.8.
7.2 Beta distributions
We were also interested in examining the perfor-mance of our
estimators on beta distributions. Tobe able to study
multidimensional cases, we con-struct d-dimensional distributions
with independent1-dimensional beta distributions as marginals. For
aclosed-form expression of the true divergence in thiscase, see the
Appendix.
5See the Appendix for the details.
102
103
104
0.8
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
num of observations
Dα(
f || g
)
estimatedtrue
(a) Dα(f‖g)
102
103
104
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
num of observations
Rα(
f || g
)
estimatedtrue
(b) Rα(f‖g)
Figure 2: Estimated vs. true divergence for the
normaldistribution experiments as a function of the numberof
observations. The results of five independent experi-ments are
shown for estimating the (a) Dα(f‖g) and(b) Rα(f‖g)
divergences.
Our first experiment, illustrated in Figures 3(a)–3(b),indicates
that the estimators are consistent when d = 2;as we increase the
number of instances, the estimatorsconverge to the true Dα(f‖g) and
Rα(f‖g) values. Thefigures show five independent experiments,
varying thesample size between 100 and 10 000. α was set to 0.4,and
we used k = 4 nearest neighbors in the densityestimates. The
parameters of the beta distributionswere chosen independently and
uniformly random from[1, 2]. We repeated this experiment in 5d as
well. The5d results, shown in Figure 3(c)–3(d), show that
theestimators were also consistent in this case.
7.3 Mutual information estimation
In this section we demonstrate that the proposed diver-gence
estimators can also be used to estimate mutualinformation. Let f =
(f1, . . . , fd) ∈ Rd be the densityof a d-dimensional
distribution. The mutual informa-tion Iα(f) is the divergence
between f and the productof the marginal variables. Particularly,
for the Rényidivergence we have Iα(f) = Rα(f‖
∏di=1 fi). Therefore,
if we are given a sample X1, . . . , X2n from f , we mayestimate
mutual information as follows. We form oneset of size n by setting
aside the first n samples. Webuild another sample by randomly
permuting the coor-dinates of the remaining n observations
independentlyfor each coordinate to form n independent
instancessampled from
∏di=1 fi. Using these two sets, we can
estimate Iα(f). Figures 4(a)–4(b) show the results ofapplying
this procedure for a 2d Gaussian distributionwith a randomly chosen
covariance matrix. The subfig-ures show the true Dα and Rα values,
as well as theirestimations using different sample sizes. k was set
to8, and α was 0.8.
Figures 4(c)–4(d) show the results of repeating theprevious
experiment with two alterations. In this casewe estimated the
Shannon (rather than Rényi) infor-
-
616
On the Estimation of α-Divergences
102
103
104
0.88
0.9
0.92
0.94
0.96
0.98D(f | g)
estimatedtrue
(a) Dα(f‖g)
102
103
104
0.05
0.1
0.15
0.2
0.25
0.3R(f | g)
estimatedtrue
(b) Rα(f‖g)
102
103
104
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96D(f | g)
estimatedtrue
(c) Dα(f‖g)
102
103
104
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8R(f | g)
estimatedtrue
(d) Rα(f‖g)
Figure 3: Estimated vs. true divergence for the beta
distribution experiments as a function of the number
ofobservations. The figures show the results of five independent
experiments for estimating the Dα(f‖g) andRα(f‖g) divergences.
(a,b): f and g were the densities of two 2d beta distributions—the
marginal distributionswere independent 1d betas with randomly
chosen parameters. (c,d): The same as (a,b), but here f and g
werethe densities of two 5d beta distributions with independent
marginals.
102
103
104
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.92
num of observations
Dα(
f || g
)
estimatedtrue
(a) Dα(f‖g)
102
103
104
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
num of observations
Rα(
f || g
)
estimatedtrue
(b) Rα(f‖g)
103
104
0.9975
0.998
0.9985
0.999
0.9995
1
num of observations
Dα(
f || g
)
estimatedtrue
(c) Dα(f‖g)
103
104
0
2
4
6
8
10
num of observations
Rα(
f || g
)
estimatedtrue
(d) Rα(f‖g)
Figure 4: Estimated vs. true Rényi information for the mutual
information experiments as a function of the numberof observations.
(a) and (b) show five independent experiments for estimating
Dα(f‖g) and Iα(f) = Rα(f‖g)for a 2d Gaussian distribution using
sample sizes between 100 and 20 000. In (c)–(d), we estimated the
mutualinformation between the marginals of a π/4 degree rotated 2d
uniform distribution. The sample size was variedfrom 500 to 40
000.
mation, and for this purpose we selected a 2d
uniformdistribution on [−1/2, 1/2]2 rotated by π/4. Due to
thisrotation, the marginal distributions are no longer
inde-pendent. Because our goal was to estimate the
Shannoninformation, we set α to 0.9999. The number of
nearestneighbors used was k = 8, and the sample size wasvaried
between 500 and 40 000. The estimators gavequite good results for
the Shannon mutual informationas well as for D1(f‖
∏di=1 fi) = 1.
8 Discussion and Conclusion
We have derived a new nonparametric estimator forthe Rényi-α and
Tsallis-α divergences, two importantquantities with several
applications in machine learn-ing and statistics. Under certain
conditions we showedthe consistency of these estimators and how
they canbe applied to estimate mutual information. We
alsodemonstrated their efficiency using numerical experi-ments.
The main idea in the proofs of our new theorems wasthat the
expected value of our estimator can be rewrit-
ten as in (16). We showed that asymptotically theterms inside
this expectation converge to the Erlangdistribution and applied the
well-known formulas for itsmoments. The main difficulty was to show
that we canindeed switch the limit and expectation operators;
thatis, the limit of expectations equals the expectation ofthe
limits of the random variables. For this purpose webounded above
these random variables and applied theLebesgue dominated
convergence theorem. To derivea bound on these random variables, we
made severalassumptions on the densities p and q.
There remain some open issues: the conditions of thetheorems
could be weakened considerably, and therates of the estimators are
still unknown. It wouldalso be desirable to investigate whether the
proposedestimators are asymptotically normal.
References
Cichocki, A., Lee, H., Kim, Y.-D., and Choi, S.
(2008).Non-negative matrix factorization with α-divergence.Pattern
Recognition Letters.
-
617
Barnabás Póczos, Jeff Schneider
Cichocki, A., Zdunek, R., Phan, A., and Amari, S.-I.(2009).
Nonnegative Matrix and Tensor Factoriza-tions. John Wiley and
Sons.
Csiszár, I. (1967). Information-type measures of differ-ences of
probability distributions and indirect obser-vations. Studia Sci.
Math. Hungarica, 2:299–318.
Goria, M. N., Leonenko, N. N., Mergel, V. V., andInverardi, P.
L. N. (2005). A new class of randomvector entropy estimators and
its applications in test-ing statistical hypotheses. Journal of
NonparametricStatistics, 17:277–297.
Gupta, M. and Srivastava, S. (2010). Parametricbayesian
estimation of differential entropy and rela-tive entropy. Entropy,
12:818–843.
Hero, A. O., Ma, B., Michel, O., and Gorman, J.(2002a).
Alpha-divergence for classification, indexingand retrieval.
Communications and Signal ProcessingLaboratory Technical Report
CSPL-328.
Hero, A. O., Ma, B., Michel, O. J. J., and Gorman, J.(2002b).
Applications of entropic spanning graphs.IEEE Signal Processing
Magazine, 19(5):85–95.
Kozachenko, L. F. and Leonenko, N. N. (1987). A sta-tistical
estimate for the entropy of a random vector.Problems of Information
Transmission, 23:9–16.
Leonenko, N. and Pronzato, L. (2010). Correction of ‘aclass of
Rényi information estimators for mulitidimen-sional densities’ Ann.
Statist., 36(2008) 2153-2182.
Leonenko, N., Pronzato, L., and Savani, V. (2008a).A class of
Rényi information estimators for multidi-mensional densities.
Annals of Statistics, 36(5):2153–2182.
Leonenko, N., Pronzato, L., and Savani, V. (2008b).Estimation of
entropies and divergences via nearestneighbours. Tatra Mt.
Mathematical Publications,39.
Loftsgaarden, D. O. and Quesenberry, C. P. (1965).A
nonparametric estimate of a multivariate densityfunction. Ann.
Math. Statist, 36:1049–1051.
Nguyen, X., Wainwright, M., and Jordan., M. (2009).On surrogate
loss functions and f-divergences. Annalsof Statistics,
37:876–904.
Nguyen, X., Wainwright, M., and Jordan., M. (2010).Estimating
divergence functionals and the likelihoodratio by convex risk
minimization. IEEE Transac-tions on Information Theory, To
appear.
Pál, D., Póczos, B., and Szepesvári, C. (2010). Es-timation of
Rényi entropy and mutual informationbased on generalized
nearest-neighbor graphs. InProceedings of the Neural Information
ProcessingSystems.
Póczos, B., Kirshner, S., and Szepesvári, C. (2010).REGO:
Rank-based estimation of Rényi informationusing Euclidean graph
optimization. In AISTATS2010.
Póczos, B. and Schneider, J. (2011). On the estima-tion of
alpha-divergences. CMU, Auton Lab Tech-nical Report,
http://www.cs.cmu.edu/~bapoczos/articles/poczos11alphaTR.pdf.
Rényi, A. (1961). On measures of entropy and informa-tion. In
Fourth Berkeley Symposium on MathematicalStatistics and
Probability. University of CaliforniaPress.
Rényi, A. (1970). Probability Theory. North-HollandPublishing
Company, Amsterdam.
Sricharan, K., Raich, R., and Hero, A. (2010). Em-pirical
estimation of entropy functionals with confi-dence. Technical
Report, http://arxiv.org/abs/1012.4188.
van der Wart, A. W. (2007). Asymptotic Statistics.Cambridge
University Press.
Villmann, T. and Haase, S. (2010). Mathematicalaspects of
divergence based vector quantization usingFrechet-derivatives.
University of Applied SciencesMittweida.
Wang, Q., Kulkarni, S. R., and Verdú, S. (2009a).Divergence
estimation for multidimensional densitiesvia k-nearest-neighbor
distances. IEEE Transactionson Information Theory, 55(5).
Wang, Q., Kulkarni, S. R., and Verdú, S. (2009b). Uni-versal
estimation of information measures for analogsources. Foundations
and Trends in Communicationsand Information Theory,
5(3):265–352.