Tailoring Density Estimation via Reproducing Kernel Moment Matching Le Song 1 Xinhua Zhang 1 Alex Smola 1 Arthur Gretton 2 Bernhard Sch¨ olkopf 2 1 Statistical Machine Learning Program, NICTA, Canberra, Australia 2 Max Planck Institute for Biological Cybernetics, T¨ ubingen, Germany International Conference on Machine Learning Helsinki, Finland, July 2008 Song, Zhang, Smola, Gretton, Sch¨ olkopf, ANU and NICTA, MPI T¨ ubingen Kernels for Density Estimation 1/18
18
Embed
Tailoring Density Estimation via Reproducing Kernel Moment ... fileTailoring Density Estimation via Reproducing Kernel Moment Matching Le Song1 Xinhua Zhang1 Alex Smola1 Arthur Gretton2
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Tailoring Density Estimation viaReproducing Kernel Moment Matching
Le Song1 Xinhua Zhang1 Alex Smola1
Arthur Gretton2 Bernhard Scholkopf2
1Statistical Machine Learning Program, NICTA, Canberra, Australia
2Max Planck Institute for Biological Cybernetics, Tubingen, Germany
International Conference on Machine LearningHelsinki, Finland, July 2008
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 1/18
Outline
Motivation of our algorithm and estimation bounds
Formulation of our algorithm: quadratic programming
Experimental results
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 2/18
Motivation: Tailoring density estimation
Density estimation is often NOT the ultimate goal
E.g. interested in expectations of a random variable (r.v.), orfunctions of the r.v.E.g. parameter estimation for graphical models (gradient)
Hence, not clear whether maximum likelihood is ideal
Full density estimationby MLE, for arbitraryfunctions
v.s.
Focus on approximatingthe expectation of a setof functions known apriori
Given a distribution p and function set F , find distribution p, s.t.
|Ex∼p[f (x)]− Ex∼p[f (x)]| < ε ∀f ∈ F
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 3/18
Motivation: Tailoring density estimation
Density estimation is often NOT the ultimate goal
E.g. interested in expectations of a random variable (r.v.), orfunctions of the r.v.E.g. parameter estimation for graphical models (gradient)
Hence, not clear whether maximum likelihood is ideal
Full density estimationby MLE, for arbitraryfunctions
v.s.
Focus on approximatingthe expectation of a setof functions known apriori
Given a distribution p and function set F , find distribution p, s.t.
supf ∈F |Ex∼p[f (x)]− Ex∼p[f (x)]| < ε
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 4/18
Idea of Touchstone functions classes
Similar spirit
Weak convergence of probability measures
Probability measure (µn)n≥1 converges weakly to µ if∫f (x)µn(dx)→
∫f (x)µ(dx) n→∞
∀ f which is real valued, continuous and bounded on Rd .
Independence criteria [Renyi59]
for sufficiently rich function classes, the function correlation orcross-covariance serves as an independence test
Density estimation [Shawe-Taylor and Dolia 07]:
Loss measured on a set of randomly drawn touchstonefunctions
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 5/18
Choice of function class: RKHS embeddings of distribution
Pick: F := {f ∈ H : ‖f ‖H ≤ 1}Given: distribution p(x), kernel k(·, ·)⇒ RKHS H
sup‖f ‖H≤1
∣∣∣∣ Ex∼p
[f (x)]− Ex∼p
[f (x)]
∣∣∣∣
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 6/18
Choice of function class: RKHS embeddings of distribution
Pick: F := {f ∈ H : ‖f ‖H ≤ 1}Given: distribution p(x), kernel k(·, ·)⇒ RKHS H
sup‖f ‖H≤1
∣∣∣∣ Ex∼p
[f (x)]− Ex∼p
[f (x)]
∣∣∣∣ =∣∣∣∣ E
x∼p[k(x , ·)]︸ ︷︷ ︸
:=µ[p]
− Ex∼p
[k(x , ·)]︸ ︷︷ ︸:=µ[p]
∣∣∣∣H
Key idea: Embed p, p into the RKHS by kernel mean map:
µ[p] := Ex∼p[k(x , ·)] =∫X k(x , ·)p(x)dx
Naturally expects: p ≈ p ⇔ ‖µ[p]− µ[p]‖H is small
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 7/18
Estimation bounds
real density p ⇒
µ[p] = Ex∼p[k(x , ·)]
sample X ⇒ µ[X ] := 1n
∑ni=1 k(xi , ·)
estimated density p ⇒ µ[p] = Ex∼p[k(x , ·)]
µ[p] µ[X ]
µ[p]
proxy distance to minimize in practice
Rademacheraverage Rm(H, p)
ultimate distanceto minimize
With probability 1− exp(−ε2mR−2m /2), we have:
‖µ[p]− µ[p]‖H ≤ 2Rm(H, p) + ‖µ[X ]− µ[p]‖H + ε
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 8/18
Formulation: Quadratic Programming
Suppose p =m∑
i=1αi pi︸︷︷︸
fixed
, where α is in m-dim probability simplex ∆m
minimizep
‖µ[p]− µ[X ]‖H
m
minimizeα∈∆m
12α>Qα− l>α
whereQij = 〈µ[pi ], µ[pj ]〉H = Ex∼pi ,x ′∼pj
[k(x , x ′)]
li = 〈µ[X ], µ[pi ]〉H =1
n
∑n
s=1Ex∼pi [k(xs , x)]
Closed form formulae exist for Qij and li .
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 9/18
Experimental results
UCI dataset
Application 1: Message passing compression
Application 2: Image retrieval and categorization
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 10/18
UCI dataset
To show
Outperforms in terms of estimating the expectation offunctions which are in the RKHSDoes not outperform other algorithms in log likelihood
Algorithms under comparison:
Kernel Moment Matching (KMM) ← oursGaussian Mixture Model (GMM)Parzen window (PZ)Reduced Set Density Estimation (RSDE)
What to compare and how
1. Randomly generate function f from RKHS2. 1/2 data for density estimation p, and Ex∼p[f (x)].3. 1/2 data to compute the empirical average of f4. Report relative discrepancy
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 11/18
Result of function expectation estimation on UCI datasets
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 15/18
Application 2: Image retrieval
Task: Given an image database D and an image p, retrievefrom D a set of images similar to p.
p < D >
Idea:
On each image, perform density estimation over the featuredistributionRetrieve by ranking Dissimilarity(p, q) over q ∈ DDissimilarity measure: Earth Mover’s Distance (EMD) [Rubneret al, 98], applied on mixture of Gaussians
Side note: EMD is not in the RKHS in general.
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 16/18
Experimental results
Horizontal axis: number of retrieved imagesVertical axis: signed log p-value of paired sign test.> 2: GMM retrieved more correct images with significance < 0.01.< −2: KMM retrieved more correct images with significance
< 0.01.
0 2500 5000 7500 10000−9
−7
−5
−3−2
0
23
beach
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 17/18
Conclusion and discussion
Proposed a density estimation algorithm tailored for aparticular function class, solvable by simple QP
Proved uniform convergence guarantees for approximatingfunction expectations
Experimental results show that KMM better approximates theexpectation of functions in the RKHS, though no longermaximizing likelihood.
Closely connected to expectation propagation
Future directions:
Apply KMM to nonparametric loopy belief propagation ingraphical model inferenceOnline data stream compression by solving online QP
Song, Zhang, Smola, Gretton, Scholkopf, ANU and NICTA, MPI Tubingen Kernels for Density Estimation 18/18