Top Banner
Computational Information Geometry on Matrix Manifolds Frank Nielsen [email protected] www.informationgeometry.org Sony Computer Science Laboratories, Inc. July 2013, ICTP, Trieste, IT c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/56
56

Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Jan 19, 2015

Download

Technology

Frank Nielsen

Computational Information Geometry
on Matrix Manifolds

http://cdsagenda5.ictp.trieste.it/full_display.php?ida=a12193
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Computational Information Geometryon Matrix Manifolds

Frank [email protected]

www.informationgeometry.org

Sony Computer Science Laboratories, Inc.

July 2013, ICTP, Trieste, IT

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/56

Page 2: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Geometry of matrix manifolds...

◮ Euclidean geometry, Frobenius norm → distance:

‖M‖2F =∑

i ,j

m2ij =

i

‖Mi∗‖22 =∑

j

‖M∗j‖22 = tr(M⊤M)

◮ Riemannian geometry of symmetric positive definite (SPD)matrices [9, 2]

◮ Riemannian geometry of rank-deficient positive semi-definite(SPSD) matricesStiefel/Grassman manifolds [3]

◮ Quantum geometry: SPD matrices with unit trace

“One geometry cannot be more true than another;it can only be more convenient”,

— Jules Henri Poincare (1902)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/56

Page 3: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Forthcoming conference (GSI)

28th-30th August, Paris.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/56

Page 4: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

What is Computational Information Geometry?

◮ What is Information? = Essence of data (datum=“thing”)(make it tangible → e.g., parameters of generative models)

◮ Can we do Intrinsic computing?(unbiased by any particular “data representation” → sameresults after recoding data)

◮ Geometry?!−→ Science of invariance

(mother of Science, compass & ruler, Descartesanalytic=coordinate/Cartesian, imaginaries, ...)....the open-ended poetic mathematics!

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/56

Page 5: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Rationale for Computational Information Geometry

◮ Information is ...never void! → lower bounds

◮ Fisher information and Cramer-Rao lower bound (estimation)◮ Bayes error and Chernoff information (classification)◮ Coding and Shannon entropy (communication)◮ Program and Kolmogorov complexity (compression).

(Unfortunately not computable!)

◮ Geometry:◮ Language (point, line, ball, dimension, orthogonal, projection,

geodesic, immersion, etc.)◮ Power of characterization (eg., intersection of two

pseudo-segments not admitting closed-form expression)

◮ Computing: Information computing. Seeking for mathematicalconvenience and mathematical tricks (RKHS in ML).How to manipulate “space of functions” ?!?

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/56

Page 6: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Example I: Matrix manifoldPattern = Gaussian mixture models (universal class)Statistical (dis)similarity/distance: total Bregman divergence(tBD, tKL).Invariance: ..., xi ∼ N(µi ,Σi), y = A(x) = Lx + t,yi ∼ N(Lµi + t, LΣiL

⊤), D(X1 : X2) = D(Y1 : Y2)(L: any invertible affine transformation, t a translation)

Shape Retrieval using Hierarchical Total Bregman Soft Clustering [7],

IEEE PAMI, 2012.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/56

Page 7: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Example II: Matrix manifoldsDTI: diffusion ellipsoids, tensor interpolation.Pattern = zero-centered “Gaussians“Statistical (dis)similarity/distance: total Bregman divergence(tBD, tKL).Invariance: ..., D(A⊤PA : A⊤QA) = D(P : Q), A ∈ SL(d):orthogonal matrix(volume/orientation preserving)total Bregman divergence (tBD).

(3D rat corpus callosum)

Total Bregman Divergence and its Applications to DTI Analysis [20],

IEEE TMI. 2011.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/56

Page 8: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Example III: Gaussian manifolds

Consider 5D Gaussian Mixture Models (GMMs) of color images(image=RGBxy point set)

A Gaussian mixture model∑

wiN(µi ,Σi ) is interpreted as aweighted point set {θi = (µi ,Σi)}.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/56

Page 9: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Matrix center points & clusteringAggregation (matrix quantization for codebooks):

Given a data-set of matrices M = {M1, ...,Mn} ⊂ M, compute acenter matrix C .Centering as a variational minimization problem:

(OPT ) : Cp = arg minC∈M

i

widistancep(C ,Mi)

Notion of centrality, robustness to outliers?For diagonal matrices, with “Euclidean” distance, usual geometriccenter points:

◮ median (p = 1): robust to outliers (Fermat-Weber point, noclosed form),

◮ centroid (p = 2): breakdown point of 1 (→ tBD)),◮ circumcenter (lim p → ∞): minimize farthest point

(minimax [1]).

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/56

Page 10: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Diffusion Tensor Magnetic Resonance ImagingDT-MRI: Measures anisotropic diffusion of water molecules in a3× 3 tensor assigned to each voxel position (1990˜).Used to analyze in-vivo connectivity patterns of brain tissues:gray matter, white matter (corpus callosum) and cerebrospinalfluid (CSF)

c© Image courtesy Peter J. Basser(Magnetic resonance imaging of the brain and spine, Chapter 31)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/56

Page 11: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Gradiometry tensor: 3× 3 SPSD matrices

Beyond the “constant” g ≃ 9.81m/s2. Gravity field measuringanisotropy.

→ Oil & gas industry.Courtesy of BellGeo.http://www.bellgeo.com/tech/technology_theory_of_FTG.html

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/56

Page 12: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Structure tensors in computer vision

→ Pioneered in image processing: tensor descriptor of a region ata pixel. (Harris-Stephens [6]).Consider a kernel, and compute the tensor descriptor

T (p = (x , y)) = K ∗[

I ′2(x) I ′(x)I ′(y)I ′(y)I ′(x) I ′(y)2

]

,

=∑

u,v

w(u, v)∇I (u, v)(∇I (u, v))T

K : uniform, Gaussian kernel (eg., s × s window W centered at thepixel p)I ′(x), I ′(y): gradient, derivatives of the image.Versatile method: corner detection, optical flow estimation,segmentation, stereo matching, etc.→ Tensor image processing

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/56

Page 13: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Harris-Stephens structure tensor (1988)Deformation tensor field

Harris-Stephens combined corner-edge detector:

R = det T − k(tr T )2

→ Measures of tensor anisotropy.Structure tensor represents local orientation(eigenvectors/eigenvalues).Harris-Stephens’ combined corner/edge detector (note)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/56

Page 14: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Matrix with Frobenius metric distance

Matrix space M with vectorial structure

dE (P ,Q) = ‖P − Q‖F (1)

=√

tr(P − Q)T (P − Q) (2)

Centroid of tensors:

CE =1

n

n∑

i=1

wiTi

→ scalar average of each element of the tensor.Tensor Field Segmentation Using Region Based Active ContourModel [21], ECCV, 2004.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/56

Page 15: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Matrix vectorization & computational geometryComputational geometry on w × h-dimensional matrix spaces wrtFrobenius distance amounts to computational geometry onEuclidean vector space for D = w × h.→ Voronoi diagrams, smallest enclosing ball, minimum spanningtree, etc.For symmetric matrices, we have D = d(d+1)

2 degrees of freedom,and vectorize as follows:

‖M‖F =

√√√√

d∑

i=1

d∑

j=1

m2ij

=

√√√√

d∑

i=1

m2ii + 2

d−1∑

i=1

d∑

j=i+1

m2ij

= ‖m‖2

with m = [m11...mdd

√2m12

√2m1d ...

√2md−1,d ]

T = ~M.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/56

Page 16: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Matrix functions

From the spectral decomposition:

M = UDU⊤

with D = λ(M) = diag(λ1, ..., λd ) the diagonal matrix ofeigenvalues, consider real-valued function x 7→ f (x) to extend tomatrices as

f (M) = U diag(f (λ1), ..., f (λd )) UT

Examples: log x , exp x , |x |, x 12 , x2, etc .

O(d3) SVD factorization complexity.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/56

Page 17: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Riemannian cone of SPD matricesExponential maps from tangent planes (symmetric matrices Sym)to the manifold cone C:

expP : TPC = Sym → CLogarithmic maps from manifold cone C to tangent planes:

logP : C → TPC = Sym

logP(Q) = P12 log(P− 1

2QP− 12 )P

12

Map any point Q ∈ Sym++ to unique tangent vector at P suchthat γ0 = P and γ1 = Q.Geodesic equation:

γt(P ,Q) = P12

(

P− 12QP− 1

2

)tP

12

Geodesic (metric length) distance:

dR(P ,Q) = ‖ log P− 12QP− 1

2 ‖c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/56

Page 18: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Riemannian Karcher centroid

dR(P ,Q) =

tr log2(P−1Q) =

√√√√

d∑

i=1

log2 λi

= ‖ log P− 12QP− 1

2 ‖

, where the λi ’s are the eigenvalues of P−1Q.

(P−1Q = Q12P−1Q

12 )

Unique mean characterized by∑n

i=1 log(T−1i CR) = 0

Closed-form solution only for n = 2:

CR(P ,Q) = P12

(

P− 12QP− 1

2

) 12P

12 otherwise iterative

approximation (CR = limt→∞ Ct):

Ct+1 = Ct exp

(

1

n

n∑

i=1

logC−1t Ti

)

.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/56

Page 19: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Riemannian minimax SPD center (circumcenter [1])

Case of p = ∞, center that minimizes the maximum distance.

GEO-ALG:Starts with c1 ∈ P and iteratively update the currentcircumcenter as follows: ci+1 = Geodesic(ci , fi ,

1i+1),

where fi denotes the farthest point of P to ci , andGeodesic(p, q, t) denotes the intermediate point m

on the geodesic passing through p and q such thatρ(p,m) = t × ρ(p, q).

Geodesic:

γt(P ,Q) = P12

(

P− 12QP− 1

2

)tP

12

Find t such that∑d

i=1 log2 λt

i = t2∑d

i=1 log2 λi = r2

∑di=1 log

2 λi .That is t = r .Prove core-set and guaranteed convergence.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/56

Page 20: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Matrices as parameters in probability distributions

Exponential families: Gaussian, Wishart, etc.:

p(x ;λ) = pF (x ; θ) = exp (〈t(x), θ〉 − F (θ) + k(x)) .

Example: Poisson distribution

p(x ;λ) =λx

x!exp(−λ),

◮ the sufficient statistic t(x) = x ,

◮ θ = log λ, the natural parameter,

◮ F (θ) = exp θ, the log-normalizer → CONVEX,

◮ and k(x) = − log x! the carrier measure(with respect to the counting measure).

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/56

Page 21: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Gaussians as an exponential family

p(x ;λ) = p(x ;µ,Σ) =1

2π√det Σ

exp

(

−(x − µ)TΣ−1(x − µ))

2

)

◮ θ = (Σ−1µ, 12Σ−1) ∈ Θ = R

d ×Kd×d , with Kd×d cone ofpositive definite matrices,

◮ F (θ) = 14trθ

−12 θ1θ

T1 − 1

2 log det θ2 +d2 log π → CONVEX

◮ t(x) = (x ,−xT x),◮ k(x) = 0.

Inner product : composite, sum of a dot product and a matrixtrace :

〈θ, θ′〉 = θT1 θ′1 + trθT2 θ

′2.

The coordinate transformation τ : Λ → Θ is given for λ = (µ,Σ)by

τ(λ) =

(

λ−12 λ1,

1

2λ−12

)

, τ−1(θ) =

(1

2θ−12 θ1,

1

2θ−12

)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 21/56

Page 22: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Convex duality: Legendre transformation

◮ For a strictly convex and differentiable function F : X → R:

F ∗(y) = supx∈X

{〈y , x〉 − F (x)︸ ︷︷ ︸

lF (y ;x);

}

◮ Maximum obtained for y = ∇F (x):

∇x lF (y ; x) = y −∇F (x) = 0 ⇒ y = ∇F (x)

◮ Maximum unique from convexity of F (∇2F ≻ 0):

∇2x lF (y ; x) = −∇2F (x) ≺ 0

◮ Convex conjugates:

(F ,X ) ⇔ (F ∗,Y), Y = {∇F (x) | x ∈ X}

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 22/56

Page 23: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Legendre duality: Geometric interpretationConsider the epigraph of F as a convex object:

◮ convex hull (V -representation), versus◮ half-space (H-representation).

Legendre transform also called “slope” transform.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/56

Page 24: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Legendre duality & Canonical divergence

◮ Convex conjugates have functional inverse gradients∇F−1 = ∇F ∗

∇F ∗ may require numerical approximation(not always available in analytical closed-form)

◮ Involution: (F ∗)∗ = F with ∇F ∗ = (∇F )−1.

◮ Convex conjugate F ∗ expressed using (∇F )−1:

F ∗(y) = 〈x , y〉 − F (x), x = ∇yF∗(y)

= 〈(∇F )−1(y), y〉 − F ((∇F )−1(y))

◮ Fenchel-Young inequality at the heart of canonical divergence:

F (x) + F ∗(y) ≥ 〈x , y〉

AF (x : y) = AF∗(y : x) = F (x) + F ∗(y)− 〈x , y〉 ≥ 0

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/56

Page 25: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Dual Bregman divergences & canonical divergence [14]

KL(P : Q) = EP

[

logp(x)

q(x)

]

≥ 0

= BF (θQ : θP) = BF∗(ηP : ηQ)

= F (θQ) + F ∗(ηP)− 〈θQ , ηP〉= AF (θQ : ηP) = AF∗(ηP : θQ)

with θQ (natural parameterization) and ηP = EP [t(X )] = ∇F (θP)(moment parameterization).

KL(P : Q) =

p(x) log1

q(x)dx

︸ ︷︷ ︸

H×(P:Q)

−∫

p(x) log1

p(x)dx

︸ ︷︷ ︸

H(p)=H×(P:P)

Shannon cross-entropy and entropy of EF [14]:

H×(P : Q) = F (θQ)− 〈θQ ,∇F (θP)〉 − EP [k(x)]

H(P) = F (θP)− 〈θP ,∇F (θP)〉 − EP [k(x)]

H(P) = −F ∗(ηP)− EP [k(x)]

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/56

Page 26: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Bregman divergence: Geometric interpretation (I)

Potential function F , graph plot F : (x ,F (x)).

DF (p : q) = F (p)− F (q)− 〈p − q,∇F (q)〉

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/56

Page 27: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Bregman divergence: Geometric interpretation (II)

Potential function f , graph plot F : (x , f (x)).

Bf (p||q) = f (p)− f (q)− (p − q)f ′(q)

Bf (.||q): vertical distance between the hyperplane Hq tangent toF at lifted point q, and the translated hyperplane at p.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 27/56

Page 28: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Bregman divergence: Geometric interpretation (III)Bregman divergence and path integrals

B(θ1 : θ2) = F (θ1)− F (θ2)− 〈θ1 − θ2,∇F (θ2)〉, (3)

=

∫ θ1

θ2

〈∇F (t)−∇F (θ2),dt〉, (4)

=

∫ η2

η1

〈∇F ∗(t)−∇F ∗(η1),dt〉, (5)

= B∗(η2 : η1) (6)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 28/56

Page 29: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Matrix Bregman divergences [4, 16]

Choose F a real-valued functional generator and extend F tomatrices:

F (X ) = tr(Ψ(X ))

Ψ(X ) =∑

k≥0

tF ,kNk

(tF ,k from the Taylor expansion of real-valued F )

BF (P : Q) = F (P)− F (Q)− tr((P −Q)⊤∇F (Q)),

∇F (X ) =∑

k≥0

t ′F ,kNk

(t ′F ,k from the Taylor expansion of real-valued F ′)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 29/56

Page 30: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Matrix Bregman divergences [16]

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 30/56

Page 31: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Particular case: Bregman Schatten p-divergences [5, 16]

Schatten p-norm of real symmetric matrix X :(unitarily invariant matrix norms)

‖X‖p = ‖λ(X )‖pBregman generator:

F (X ) =1

2‖X‖2p

Used in regularized convex optimization [5], matrix datamining [16].

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 31/56

Page 32: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Matrix Legendre transformation

Extends classical Legendre-Fenchel transformation:

F ∗(η) = supspec(θ)⊆dom(F )

tr(θη⊤)− F (θ)

DF (θP : θQ) = DF∗(ηQ : ηP) = F (θ) + F ∗(η)− tr(θη⊤)

θ and η are dual matrix coordinate systems on the matrix manifold.

Non-metric differential structure with dual coordinate systems.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 32/56

Page 33: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Bregman matrix means

BF (X ,P) = F (X )− F (P)− tr((X − P)T∇F (P)),

F (·): strictly convex and differentiable function on an open convexspace.

C = ∇F−1

(n∑

i=1

wi∇F (Ti)

)

quasi-arithmetic mean for ∇F .Since BF (X ,P) 6= BF (P ,X ), define a right-sided centroid M ′:Find the center of mass [13] (independent of generator F )F (X ) = tr(XTX ): the quadratic matrix entropy,F (X ) = − log detX : the matrix Burg entropy, andF (X ) = tr(X logX − X ): the von Neumann entropy [19, 18, 15](Umegaki quantum relative entropy).

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 33/56

Page 34: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Total Bregman divergences (tBD)

Instead of ”vertical” projection in Bregman divergence, considerperpendicular projection.(Analogy with least squares and total least squares regression.)

tBF (P ,Q) =BF (P ,Q)

1 + ‖∇F (Q)‖2

→ proven statistically robust.Applications to robust DT-MRI segmentation [8].

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 34/56

Page 35: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Matrix Jensen/Burbea-Rao divergences [10]

Convexity gap defines a divergence

BRF (P ,Q) =F (P) + F (Q)

2− F

(P + Q

2

)

≥ 0

◮ F (X ) = tr(XTX ): the quadratic matrix entropy,

◮ F (X ) = − log detX : the matrix Burg entropy, and

◮ F (X ) = tr(X logX − X ): the von Neumann entropy.

◮ etc.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 35/56

Page 36: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Smooth family of convex generators [12, 17]

1-parameter family of generators:

Fα(X ) =1

α(1 − α)tr(αX − Xα + (1− α)I ), α 6= {0, 1}

Bα(P : Q) =1

α(1− α)tr(Qα − Pα + αQα−1(P − Q))

∇Fα(X ) =1

α− 1(I − Xα−1) ∇F−1

α (X ) = (I − (α− 1)X )1

α−1

When α → 1, ∇Fα(X ) = ∇F1(X ) = logX . When α → 0,∇Fα(X ) = ∇F0(X ) = X−1 − I .

◮ α = 2: Quadratic matrix information

◮ α → 1: von Neumann information

◮ α → 0: Burg log-det information

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 36/56

Page 37: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Jensen (Burbea-Rao) divergences

Based on Jensen’s inequality for a convex function F :

BRF (X ,P) =F (X ) + F (P)

2− F

(X + P

2

)

def=≥ 0.

strictly convex function F (·).Includes the special case of Jensen-Shannon divergence:

JS(p, q) = H

(p + q

2

)

− H(p) + H(q)

2

F (x) = −H(x), the negative Shannon entropy H(x) = −x log x .→ generators are convex and entropies are concave (negativegenerators)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 37/56

Page 38: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Visualizing Burbea-Rao divergences

include Squared Mahalanobis distance.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 38/56

Page 39: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Burbea-Rao from Symmetrizing Bregman divergences [13]

◮ Jeffreys-Bregman divergences.

SF (p; q) =BF (p, q) + BF (q, p)

2

=1

2〈p − q,∇F (p)−∇F (q)〉,

◮ Jensen-Bregman divergences (diversity index).

JF (p; q) =BF (p,

p+q2 ) + BF (q,

p+q2 )

2

=F (p) + F (q)

2− F

(p + q

2

)

= BRF (p, q)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 39/56

Page 40: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Skew Burbea-Rao divergences

BR(α)F : X × X → R

+

BR(α)F (p, q) = αF (p) + (1− α)F (q)− F (αp + (1− α)q)

BR(α)F (p, q) = αF (p) + (1− α)F (q)− F (αp + (1− α)q)

= BR(1−α)F (q, p)

Skew symmetrization of Bregman divergences:

αBF (p, αp + (1− α)q) + (1− α)BF (q, αp + (1− α)q)def=

BR(α)F (p, q)

= skew Jensen-Bregman divergences.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 40/56

Page 41: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Bregman divergences = asymptotic skewed Jensen

divergences

BF (p, q) = limα→1

1

1− αBR

(α)F (p, q)

BF (q, p) = limα→0

1

αBR

(α)F (p, q)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 41/56

Page 42: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Burbea-Rao/Jensen centroids

(p = 1)

OPT : CF = argminX

n∑

i=1

wiBR(αi )F (X ,Ti) = argmin

xL(x)

Wlog., equivalent to minimize

E (c) = (n∑

i=1

wiαi)F (C ) −n∑

i=1

wiF (αiC + (1− αi)Ti )

Sum E = F + G of convex F + concave G function ⇒Convex-ConCave Procedure (CCCP, NIPS*01)Start from arbitrary c0, and iteratively update as:

∇F (Ct+1) = −∇G (Ct)

⇒ guaranteed convergence to a (local) minimum.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 42/56

Page 43: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

ConCave Convex Procedure (CCCP)

minx E (x) = F (x) + G (x)∇F (ct+1) = −∇G (ct)

Decomposition may not be unique...

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 43/56

Page 44: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Iterative algorithm for Burbea-Rao centroids

Apply CCCP scheme

∇F (Ct+1) =1

∑ni=1 wiαi

n∑

i=1

wiαi∇F (αiCt + (1− αi )Ti )

Ct+1 = ∇F−1

(

1∑n

i=1 wiαi

n∑

i=1

wiαi∇F (αiCt + (1− αi)Ti )

)

Get arbitrarily fine approximations of the (skew) Burbea-Raomatrix centroids and barycenters.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 44/56

Page 45: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Special case: α-log det divergence [15, 11]Cone of Hermitian positive definite matrices (self-adjoint matricesMH = MT = M).F (X ) = − log detX , ∇F (X ) = ∇F−1(X ) = −X−1

Burbea-Rao α-log det divergences:

D(α)ld

(P ,Q) =

tr(Q−1P − I )− log det(Q−1P)) α = 14

1−α2 logdet( 1−α

2P+ 1+α

2Q)

(det P)1−α2 (detQ)

1+α2

α ∈ R\{−1, 1}tr(P−1Q − I )− log det(P−1Q) α = −1

Start with C1 =1n

∑ni=1 Ti ,

Ct+1 = n

(n∑

i=1

(1− α

2Ti +

1 + α

2Ct

)−1)−1

→ unique global mean (obtained from CCCP).

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 45/56

Page 46: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Bhattacharyya coefficients/distances

Bhattacharyya coefficient and non-metric distance:

C(p, q) =

∫√

p(x)q(x)dx , 0 < C (p, q) ≤ 1,B(p, q) = − lnC (p, q).

(coefficient is always strictly positive). Hellinger metric

H(p, q) =

1

2

(√

p(x)−√

q(x))2dx ,

such that 0 ≤ H(p, q) ≤ 1.

H(p, q) =

1

2

(∫

p(x)dx +

q(x)dx − 2

∫√

p(x)√

q(x)dx

)

=√

1− C (p, q).

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 46/56

Page 47: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Chernoff coefficients/α-divergencesSkew Bhattacharrya divergences based on Chernoff α-coefficients.

Bα(p, q) = − ln

xpα(x)q1−α(x)dx = − lnCα(p, q)

= − ln

xq(x)

(p(x)

q(x)

dx

= − lnEq[Lα(x)]

Amari α-divergence:

Dα(p||q) =

41−α2

(

1−∫p(x)

1−α2 q(x)

1+α2 dx

)

, α 6= ±1,∫p(x) log p(x)

q(x)dx = KL(p, q), α = −1,∫q(x) log q(x)

p(x)dx = KL(q, p), α = 1,

Dα(p||q) = D−α(q||p)Remapping α′ = 1−α

2 (α = 1− 2α′) to get Chernoff α′-divergencesc© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 47/56

Page 48: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Bhattacharyya/Chernoff of exponential families [10]

Equivalence with skew Burbea-Rao distances:

Bα(pF (x ; θp), pF (x ; θq)) = BR(α)F (θp, θq), (7)

= αF (θp) + (1− α)F (θq)− F (αθp + (1 − α)θq)

Bhat. divergence on probability distributions amounts to computea Jensen divergence on its parameters

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 48/56

Page 49: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Closed-form Bhattacharyya distances for exp. fam.

Generic formula that instantiates in those well-known formula instatistical pattern recognition.

Exp. fam. F (θ) (up to a constant) Bhattacharyya/Burbea-Rao BRF (λp, λq) = BRF (τ(λp), τ(λq))

Multinomial log(1 +∑d−1

i=1exp θi ) − ln

∑di=1

√pi qi

Poisson exp θ 12(√

µp − √µq )

2

Gaussian − θ214θ2

+ 12log(− π

θ2) 1

4

(µp−µq )2

σ2p+σ2

q+ 1

2ln

σ2p+σ2

q2σpσq

Gaussian 14trΘ−1θθT − 1

2log det Θ 1

8(µp − µq )

T(

Σp+Σq2

)

−1(µp − µq ) +

12ln

detΣp+Σq

2det Σp det Σq

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 49/56

Page 50: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Wrapping up

◮ Besides Euclidean, log-Euclidean and Riemannianmetric-based means, proposeddivergence-based matrix centroids,

◮ Total Bregman divergences and robustness (conformalgeometry),

◮ Riemannian minimax center,

◮ skew Burbea-Rao/Jensen divergences extending Bregmandivergences,

◮ Bhattacharrya means of densities = Burbea-Rao means on(matrix) parameters

Which mean you do you mean or need?

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 50/56

Page 51: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Non-metric matrix manifolds with dually affine connections

In a nutshell:

◮ asymmetric (Bregman) non-metric divergence,

◮ Legendre transform, convex conjugates & dual divergences

◮ Dual θ− or η- or mixed coordinate systems

◮ dual closed-form affine geodesics (convenient computationally)

◮ Pythagorean theorem

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 51/56

Page 52: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Thank you.

www.informationgeometry.org

“One geometry cannot be more true than another;it can only be more convenient”,

— Jules Henri Poincare (1902)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 52/56

Page 53: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Bibliographic references I

Marc Arnaudon and Frank Nielsen.

On approximating the Riemannian 1-center.

Comput. Geom., 46(1):93–104, 2013.

Rajendra Bhatia.

The Riemannian mean of positive matrices.

In Frank Nielsen and Rajendra Bhatia, editors, Matrix Information Geometry, pages 35–51, 2012.

Silvere Bonnabel and Rodolphe Sepulchre.

Riemannian metric and geometric mean for positive semidefinite matrices of fixed rank.

SIAM J. Matrix Analysis Applications, 31(3):1055–1070, 2009.

Inderjit S. Dhillon and Joel A. Tropp.

Matrix nearness problems with bregman divergences.

SIAM J. Matrix Anal. Appl., 29(4):1120–1146, November 2007.

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari.

Composite objective mirror descent.

In Adam Tauman Kalai and Mehryar Mohri, editors, COLT, pages 14–26. Omnipress, 2010.

C. Harris and M. Stephens.

A Combined Corner and Edge Detection.

In Proceedings of The Fourth Alvey Vision Conference, pages 147–151, 1988.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 53/56

Page 54: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Bibliographic references II

Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen.

Shape retrieval using hierarchical total Bregman soft clustering.

Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012.

Meizhu Liu, Baba C. Vemuri, Shun ichi Amari, and Frank Nielsen.

Shape retrieval using hierarchical total bregman soft clustering.

IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2407–2419, 2012.

Maher Moakher.

A differential geometric approach to the geometric mean of symmetric positive-definite matrices.

SIAM Journal on Matrix Analysis and Applications, 26(3):735–747, 2005.

Frank Nielsen and Sylvain Boltz.

The Burbea-Rao and Bhattacharyya centroids.

IEEE Transactions on Information Theory, 57(8):5455–5466, 2011.

Frank Nielsen, Meizhu Liu, Xiaojing Ye, and Baba C. Vemuri.

Jensen divergence based SPD matrix means and applications.

In International Conference on Pattern Recognition (ICPR), 2012.

Frank Nielsen and Richard Nock.

Quantum Voronoi diagrams and Holevo channel capacity for 1-qubit quantum states.

In IEEE International Symposium on Information Theory (ISIT), pages 96–100, 2008.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 54/56

Page 55: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Bibliographic references IIIFrank Nielsen and Richard Nock.

Sided and symmetrized Bregman centroids.

IEEE Trans. Inf. Theor., 55(6):2882–2904, June 2009.

Frank Nielsen and Richard Nock.

Entropies and cross-entropies of exponential families.

In International Conference on Image Processing (ICIP), pages 3621–3624, 2010.

R. Nock, B. Magdalou, E. Briys, and F. Nielsen.

On tracking portfolios with certainty equivalents on a generalization of Markowitz model: the fool, the wiseand the adaptive.

In Thorsten Joachims, editor, International Conference on Machine Learning (ICML). Omnipress, 2011.

Richard Nock, Brice Magdalou, Eric Briys, and Frank Nielsen.

Mining matrix data with Bregman matrix divergences for portfolio selection.

In Frank Nielsen and Rajendra Bhatia, editors, Matrix Information Geometry, pages 373–402, 2012.

Masanori Ohya and Denes Petz.

Quantum Entropy and Its Use.

1st ed. 1993. Corr 2nd printing, 2004.

Koji Tsuda, Gunnar Ratsch, and Manfred K. Warmuth.

Matrix exponentiated gradient updates for on-line learning and Bregman projection.

J. Mach. Learn. Res., 6:995–1018, December 2005.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 55/56

Page 56: Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Bibliographic references IV

Hisaharu Umegaki.

Conditional expectation in an operator algebra. IV. Entropy and information.

KodaiMathSemRep, 14(2):59, 1962.

Baba Vemuri, Meizhu Liu, Shun ichi Amari, and Frank Nielsen.

Total Bregman divergence and its applications to DTI analysis.

IEEE Transactions on Medical Imaging, 2011.

Zhizhou Wang and Baba C. Vemuri.

An affine invariant tensor dissimilarity measure and its applications to tensor-valued image segmentation.

In CVPR (1), pages 228–233, 2004.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 56/56