Computational Information Geometry on Matrix Manifolds (ICTP 2013)

Post on 19-Jan-2015

387 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Computational Information Geometry on Matrix Manifolds http://cdsagenda5.ictp.trieste.it/full_display.php?ida=a12193

Transcript

Computational Information Geometryon Matrix Manifolds

Frank NielsenFrank.Nielsen@acm.org

www.informationgeometry.org

Sony Computer Science Laboratories, Inc.

July 2013, ICTP, Trieste, IT

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/56

Geometry of matrix manifolds...

◮ Euclidean geometry, Frobenius norm → distance:

‖M‖2F =∑

i ,j

m2ij =

i

‖Mi∗‖22 =∑

j

‖M∗j‖22 = tr(M⊤M)

◮ Riemannian geometry of symmetric positive definite (SPD)matrices [9, 2]

◮ Riemannian geometry of rank-deficient positive semi-definite(SPSD) matricesStiefel/Grassman manifolds [3]

◮ Quantum geometry: SPD matrices with unit trace

“One geometry cannot be more true than another;it can only be more convenient”,

— Jules Henri Poincare (1902)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/56

Forthcoming conference (GSI)

28th-30th August, Paris.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/56

What is Computational Information Geometry?

◮ What is Information? = Essence of data (datum=“thing”)(make it tangible → e.g., parameters of generative models)

◮ Can we do Intrinsic computing?(unbiased by any particular “data representation” → sameresults after recoding data)

◮ Geometry?!−→ Science of invariance

(mother of Science, compass & ruler, Descartesanalytic=coordinate/Cartesian, imaginaries, ...)....the open-ended poetic mathematics!

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/56

Rationale for Computational Information Geometry

◮ Information is ...never void! → lower bounds

◮ Fisher information and Cramer-Rao lower bound (estimation)◮ Bayes error and Chernoff information (classification)◮ Coding and Shannon entropy (communication)◮ Program and Kolmogorov complexity (compression).

(Unfortunately not computable!)

◮ Geometry:◮ Language (point, line, ball, dimension, orthogonal, projection,

geodesic, immersion, etc.)◮ Power of characterization (eg., intersection of two

pseudo-segments not admitting closed-form expression)

◮ Computing: Information computing. Seeking for mathematicalconvenience and mathematical tricks (RKHS in ML).How to manipulate “space of functions” ?!?

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/56

Example I: Matrix manifoldPattern = Gaussian mixture models (universal class)Statistical (dis)similarity/distance: total Bregman divergence(tBD, tKL).Invariance: ..., xi ∼ N(µi ,Σi), y = A(x) = Lx + t,yi ∼ N(Lµi + t, LΣiL

⊤), D(X1 : X2) = D(Y1 : Y2)(L: any invertible affine transformation, t a translation)

Shape Retrieval using Hierarchical Total Bregman Soft Clustering [7],

IEEE PAMI, 2012.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/56

Example II: Matrix manifoldsDTI: diffusion ellipsoids, tensor interpolation.Pattern = zero-centered “Gaussians“Statistical (dis)similarity/distance: total Bregman divergence(tBD, tKL).Invariance: ..., D(A⊤PA : A⊤QA) = D(P : Q), A ∈ SL(d):orthogonal matrix(volume/orientation preserving)total Bregman divergence (tBD).

(3D rat corpus callosum)

Total Bregman Divergence and its Applications to DTI Analysis [20],

IEEE TMI. 2011.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/56

Example III: Gaussian manifolds

Consider 5D Gaussian Mixture Models (GMMs) of color images(image=RGBxy point set)

A Gaussian mixture model∑

wiN(µi ,Σi ) is interpreted as aweighted point set {θi = (µi ,Σi)}.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/56

Matrix center points & clusteringAggregation (matrix quantization for codebooks):

Given a data-set of matrices M = {M1, ...,Mn} ⊂ M, compute acenter matrix C .Centering as a variational minimization problem:

(OPT ) : Cp = arg minC∈M

i

widistancep(C ,Mi)

Notion of centrality, robustness to outliers?For diagonal matrices, with “Euclidean” distance, usual geometriccenter points:

◮ median (p = 1): robust to outliers (Fermat-Weber point, noclosed form),

◮ centroid (p = 2): breakdown point of 1 (→ tBD)),◮ circumcenter (lim p → ∞): minimize farthest point

(minimax [1]).

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/56

Diffusion Tensor Magnetic Resonance ImagingDT-MRI: Measures anisotropic diffusion of water molecules in a3× 3 tensor assigned to each voxel position (1990˜).Used to analyze in-vivo connectivity patterns of brain tissues:gray matter, white matter (corpus callosum) and cerebrospinalfluid (CSF)

c© Image courtesy Peter J. Basser(Magnetic resonance imaging of the brain and spine, Chapter 31)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/56

Gradiometry tensor: 3× 3 SPSD matrices

Beyond the “constant” g ≃ 9.81m/s2. Gravity field measuringanisotropy.

→ Oil & gas industry.Courtesy of BellGeo.http://www.bellgeo.com/tech/technology_theory_of_FTG.html

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/56

Structure tensors in computer vision

→ Pioneered in image processing: tensor descriptor of a region ata pixel. (Harris-Stephens [6]).Consider a kernel, and compute the tensor descriptor

T (p = (x , y)) = K ∗[

I ′2(x) I ′(x)I ′(y)I ′(y)I ′(x) I ′(y)2

]

,

=∑

u,v

w(u, v)∇I (u, v)(∇I (u, v))T

K : uniform, Gaussian kernel (eg., s × s window W centered at thepixel p)I ′(x), I ′(y): gradient, derivatives of the image.Versatile method: corner detection, optical flow estimation,segmentation, stereo matching, etc.→ Tensor image processing

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/56

Harris-Stephens structure tensor (1988)Deformation tensor field

Harris-Stephens combined corner-edge detector:

R = det T − k(tr T )2

→ Measures of tensor anisotropy.Structure tensor represents local orientation(eigenvectors/eigenvalues).Harris-Stephens’ combined corner/edge detector (note)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/56

Matrix with Frobenius metric distance

Matrix space M with vectorial structure

dE (P ,Q) = ‖P − Q‖F (1)

=√

tr(P − Q)T (P − Q) (2)

Centroid of tensors:

CE =1

n

n∑

i=1

wiTi

→ scalar average of each element of the tensor.Tensor Field Segmentation Using Region Based Active ContourModel [21], ECCV, 2004.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/56

Matrix vectorization & computational geometryComputational geometry on w × h-dimensional matrix spaces wrtFrobenius distance amounts to computational geometry onEuclidean vector space for D = w × h.→ Voronoi diagrams, smallest enclosing ball, minimum spanningtree, etc.For symmetric matrices, we have D = d(d+1)

2 degrees of freedom,and vectorize as follows:

‖M‖F =

√√√√

d∑

i=1

d∑

j=1

m2ij

=

√√√√

d∑

i=1

m2ii + 2

d−1∑

i=1

d∑

j=i+1

m2ij

= ‖m‖2

with m = [m11...mdd

√2m12

√2m1d ...

√2md−1,d ]

T = ~M.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/56

Matrix functions

From the spectral decomposition:

M = UDU⊤

with D = λ(M) = diag(λ1, ..., λd ) the diagonal matrix ofeigenvalues, consider real-valued function x 7→ f (x) to extend tomatrices as

f (M) = U diag(f (λ1), ..., f (λd )) UT

Examples: log x , exp x , |x |, x 12 , x2, etc .

O(d3) SVD factorization complexity.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/56

Riemannian cone of SPD matricesExponential maps from tangent planes (symmetric matrices Sym)to the manifold cone C:

expP : TPC = Sym → CLogarithmic maps from manifold cone C to tangent planes:

logP : C → TPC = Sym

logP(Q) = P12 log(P− 1

2QP− 12 )P

12

Map any point Q ∈ Sym++ to unique tangent vector at P suchthat γ0 = P and γ1 = Q.Geodesic equation:

γt(P ,Q) = P12

(

P− 12QP− 1

2

)tP

12

Geodesic (metric length) distance:

dR(P ,Q) = ‖ log P− 12QP− 1

2 ‖c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/56

Riemannian Karcher centroid

dR(P ,Q) =

tr log2(P−1Q) =

√√√√

d∑

i=1

log2 λi

= ‖ log P− 12QP− 1

2 ‖

, where the λi ’s are the eigenvalues of P−1Q.

(P−1Q = Q12P−1Q

12 )

Unique mean characterized by∑n

i=1 log(T−1i CR) = 0

Closed-form solution only for n = 2:

CR(P ,Q) = P12

(

P− 12QP− 1

2

) 12P

12 otherwise iterative

approximation (CR = limt→∞ Ct):

Ct+1 = Ct exp

(

1

n

n∑

i=1

logC−1t Ti

)

.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/56

Riemannian minimax SPD center (circumcenter [1])

Case of p = ∞, center that minimizes the maximum distance.

GEO-ALG:Starts with c1 ∈ P and iteratively update the currentcircumcenter as follows: ci+1 = Geodesic(ci , fi ,

1i+1),

where fi denotes the farthest point of P to ci , andGeodesic(p, q, t) denotes the intermediate point m

on the geodesic passing through p and q such thatρ(p,m) = t × ρ(p, q).

Geodesic:

γt(P ,Q) = P12

(

P− 12QP− 1

2

)tP

12

Find t such that∑d

i=1 log2 λt

i = t2∑d

i=1 log2 λi = r2

∑di=1 log

2 λi .That is t = r .Prove core-set and guaranteed convergence.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/56

Matrices as parameters in probability distributions

Exponential families: Gaussian, Wishart, etc.:

p(x ;λ) = pF (x ; θ) = exp (〈t(x), θ〉 − F (θ) + k(x)) .

Example: Poisson distribution

p(x ;λ) =λx

x!exp(−λ),

◮ the sufficient statistic t(x) = x ,

◮ θ = log λ, the natural parameter,

◮ F (θ) = exp θ, the log-normalizer → CONVEX,

◮ and k(x) = − log x! the carrier measure(with respect to the counting measure).

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/56

Gaussians as an exponential family

p(x ;λ) = p(x ;µ,Σ) =1

2π√det Σ

exp

(

−(x − µ)TΣ−1(x − µ))

2

)

◮ θ = (Σ−1µ, 12Σ−1) ∈ Θ = R

d ×Kd×d , with Kd×d cone ofpositive definite matrices,

◮ F (θ) = 14trθ

−12 θ1θ

T1 − 1

2 log det θ2 +d2 log π → CONVEX

◮ t(x) = (x ,−xT x),◮ k(x) = 0.

Inner product : composite, sum of a dot product and a matrixtrace :

〈θ, θ′〉 = θT1 θ′1 + trθT2 θ

′2.

The coordinate transformation τ : Λ → Θ is given for λ = (µ,Σ)by

τ(λ) =

(

λ−12 λ1,

1

2λ−12

)

, τ−1(θ) =

(1

2θ−12 θ1,

1

2θ−12

)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 21/56

Convex duality: Legendre transformation

◮ For a strictly convex and differentiable function F : X → R:

F ∗(y) = supx∈X

{〈y , x〉 − F (x)︸ ︷︷ ︸

lF (y ;x);

}

◮ Maximum obtained for y = ∇F (x):

∇x lF (y ; x) = y −∇F (x) = 0 ⇒ y = ∇F (x)

◮ Maximum unique from convexity of F (∇2F ≻ 0):

∇2x lF (y ; x) = −∇2F (x) ≺ 0

◮ Convex conjugates:

(F ,X ) ⇔ (F ∗,Y), Y = {∇F (x) | x ∈ X}

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 22/56

Legendre duality: Geometric interpretationConsider the epigraph of F as a convex object:

◮ convex hull (V -representation), versus◮ half-space (H-representation).

Legendre transform also called “slope” transform.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/56

Legendre duality & Canonical divergence

◮ Convex conjugates have functional inverse gradients∇F−1 = ∇F ∗

∇F ∗ may require numerical approximation(not always available in analytical closed-form)

◮ Involution: (F ∗)∗ = F with ∇F ∗ = (∇F )−1.

◮ Convex conjugate F ∗ expressed using (∇F )−1:

F ∗(y) = 〈x , y〉 − F (x), x = ∇yF∗(y)

= 〈(∇F )−1(y), y〉 − F ((∇F )−1(y))

◮ Fenchel-Young inequality at the heart of canonical divergence:

F (x) + F ∗(y) ≥ 〈x , y〉

AF (x : y) = AF∗(y : x) = F (x) + F ∗(y)− 〈x , y〉 ≥ 0

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/56

Dual Bregman divergences & canonical divergence [14]

KL(P : Q) = EP

[

logp(x)

q(x)

]

≥ 0

= BF (θQ : θP) = BF∗(ηP : ηQ)

= F (θQ) + F ∗(ηP)− 〈θQ , ηP〉= AF (θQ : ηP) = AF∗(ηP : θQ)

with θQ (natural parameterization) and ηP = EP [t(X )] = ∇F (θP)(moment parameterization).

KL(P : Q) =

p(x) log1

q(x)dx

︸ ︷︷ ︸

H×(P:Q)

−∫

p(x) log1

p(x)dx

︸ ︷︷ ︸

H(p)=H×(P:P)

Shannon cross-entropy and entropy of EF [14]:

H×(P : Q) = F (θQ)− 〈θQ ,∇F (θP)〉 − EP [k(x)]

H(P) = F (θP)− 〈θP ,∇F (θP)〉 − EP [k(x)]

H(P) = −F ∗(ηP)− EP [k(x)]

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/56

Bregman divergence: Geometric interpretation (I)

Potential function F , graph plot F : (x ,F (x)).

DF (p : q) = F (p)− F (q)− 〈p − q,∇F (q)〉

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/56

Bregman divergence: Geometric interpretation (II)

Potential function f , graph plot F : (x , f (x)).

Bf (p||q) = f (p)− f (q)− (p − q)f ′(q)

Bf (.||q): vertical distance between the hyperplane Hq tangent toF at lifted point q, and the translated hyperplane at p.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 27/56

Bregman divergence: Geometric interpretation (III)Bregman divergence and path integrals

B(θ1 : θ2) = F (θ1)− F (θ2)− 〈θ1 − θ2,∇F (θ2)〉, (3)

=

∫ θ1

θ2

〈∇F (t)−∇F (θ2),dt〉, (4)

=

∫ η2

η1

〈∇F ∗(t)−∇F ∗(η1),dt〉, (5)

= B∗(η2 : η1) (6)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 28/56

Matrix Bregman divergences [4, 16]

Choose F a real-valued functional generator and extend F tomatrices:

F (X ) = tr(Ψ(X ))

Ψ(X ) =∑

k≥0

tF ,kNk

(tF ,k from the Taylor expansion of real-valued F )

BF (P : Q) = F (P)− F (Q)− tr((P −Q)⊤∇F (Q)),

∇F (X ) =∑

k≥0

t ′F ,kNk

(t ′F ,k from the Taylor expansion of real-valued F ′)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 29/56

Matrix Bregman divergences [16]

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 30/56

Particular case: Bregman Schatten p-divergences [5, 16]

Schatten p-norm of real symmetric matrix X :(unitarily invariant matrix norms)

‖X‖p = ‖λ(X )‖pBregman generator:

F (X ) =1

2‖X‖2p

Used in regularized convex optimization [5], matrix datamining [16].

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 31/56

Matrix Legendre transformation

Extends classical Legendre-Fenchel transformation:

F ∗(η) = supspec(θ)⊆dom(F )

tr(θη⊤)− F (θ)

DF (θP : θQ) = DF∗(ηQ : ηP) = F (θ) + F ∗(η)− tr(θη⊤)

θ and η are dual matrix coordinate systems on the matrix manifold.

Non-metric differential structure with dual coordinate systems.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 32/56

Bregman matrix means

BF (X ,P) = F (X )− F (P)− tr((X − P)T∇F (P)),

F (·): strictly convex and differentiable function on an open convexspace.

C = ∇F−1

(n∑

i=1

wi∇F (Ti)

)

quasi-arithmetic mean for ∇F .Since BF (X ,P) 6= BF (P ,X ), define a right-sided centroid M ′:Find the center of mass [13] (independent of generator F )F (X ) = tr(XTX ): the quadratic matrix entropy,F (X ) = − log detX : the matrix Burg entropy, andF (X ) = tr(X logX − X ): the von Neumann entropy [19, 18, 15](Umegaki quantum relative entropy).

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 33/56

Total Bregman divergences (tBD)

Instead of ”vertical” projection in Bregman divergence, considerperpendicular projection.(Analogy with least squares and total least squares regression.)

tBF (P ,Q) =BF (P ,Q)

1 + ‖∇F (Q)‖2

→ proven statistically robust.Applications to robust DT-MRI segmentation [8].

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 34/56

Matrix Jensen/Burbea-Rao divergences [10]

Convexity gap defines a divergence

BRF (P ,Q) =F (P) + F (Q)

2− F

(P + Q

2

)

≥ 0

◮ F (X ) = tr(XTX ): the quadratic matrix entropy,

◮ F (X ) = − log detX : the matrix Burg entropy, and

◮ F (X ) = tr(X logX − X ): the von Neumann entropy.

◮ etc.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 35/56

Smooth family of convex generators [12, 17]

1-parameter family of generators:

Fα(X ) =1

α(1 − α)tr(αX − Xα + (1− α)I ), α 6= {0, 1}

Bα(P : Q) =1

α(1− α)tr(Qα − Pα + αQα−1(P − Q))

∇Fα(X ) =1

α− 1(I − Xα−1) ∇F−1

α (X ) = (I − (α− 1)X )1

α−1

When α → 1, ∇Fα(X ) = ∇F1(X ) = logX . When α → 0,∇Fα(X ) = ∇F0(X ) = X−1 − I .

◮ α = 2: Quadratic matrix information

◮ α → 1: von Neumann information

◮ α → 0: Burg log-det information

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 36/56

Jensen (Burbea-Rao) divergences

Based on Jensen’s inequality for a convex function F :

BRF (X ,P) =F (X ) + F (P)

2− F

(X + P

2

)

def=≥ 0.

strictly convex function F (·).Includes the special case of Jensen-Shannon divergence:

JS(p, q) = H

(p + q

2

)

− H(p) + H(q)

2

F (x) = −H(x), the negative Shannon entropy H(x) = −x log x .→ generators are convex and entropies are concave (negativegenerators)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 37/56

Visualizing Burbea-Rao divergences

include Squared Mahalanobis distance.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 38/56

Burbea-Rao from Symmetrizing Bregman divergences [13]

◮ Jeffreys-Bregman divergences.

SF (p; q) =BF (p, q) + BF (q, p)

2

=1

2〈p − q,∇F (p)−∇F (q)〉,

◮ Jensen-Bregman divergences (diversity index).

JF (p; q) =BF (p,

p+q2 ) + BF (q,

p+q2 )

2

=F (p) + F (q)

2− F

(p + q

2

)

= BRF (p, q)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 39/56

Skew Burbea-Rao divergences

BR(α)F : X × X → R

+

BR(α)F (p, q) = αF (p) + (1− α)F (q)− F (αp + (1− α)q)

BR(α)F (p, q) = αF (p) + (1− α)F (q)− F (αp + (1− α)q)

= BR(1−α)F (q, p)

Skew symmetrization of Bregman divergences:

αBF (p, αp + (1− α)q) + (1− α)BF (q, αp + (1− α)q)def=

BR(α)F (p, q)

= skew Jensen-Bregman divergences.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 40/56

Bregman divergences = asymptotic skewed Jensen

divergences

BF (p, q) = limα→1

1

1− αBR

(α)F (p, q)

BF (q, p) = limα→0

1

αBR

(α)F (p, q)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 41/56

Burbea-Rao/Jensen centroids

(p = 1)

OPT : CF = argminX

n∑

i=1

wiBR(αi )F (X ,Ti) = argmin

xL(x)

Wlog., equivalent to minimize

E (c) = (n∑

i=1

wiαi)F (C ) −n∑

i=1

wiF (αiC + (1− αi)Ti )

Sum E = F + G of convex F + concave G function ⇒Convex-ConCave Procedure (CCCP, NIPS*01)Start from arbitrary c0, and iteratively update as:

∇F (Ct+1) = −∇G (Ct)

⇒ guaranteed convergence to a (local) minimum.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 42/56

ConCave Convex Procedure (CCCP)

minx E (x) = F (x) + G (x)∇F (ct+1) = −∇G (ct)

Decomposition may not be unique...

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 43/56

Iterative algorithm for Burbea-Rao centroids

Apply CCCP scheme

∇F (Ct+1) =1

∑ni=1 wiαi

n∑

i=1

wiαi∇F (αiCt + (1− αi )Ti )

Ct+1 = ∇F−1

(

1∑n

i=1 wiαi

n∑

i=1

wiαi∇F (αiCt + (1− αi)Ti )

)

Get arbitrarily fine approximations of the (skew) Burbea-Raomatrix centroids and barycenters.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 44/56

Special case: α-log det divergence [15, 11]Cone of Hermitian positive definite matrices (self-adjoint matricesMH = MT = M).F (X ) = − log detX , ∇F (X ) = ∇F−1(X ) = −X−1

Burbea-Rao α-log det divergences:

D(α)ld

(P ,Q) =

tr(Q−1P − I )− log det(Q−1P)) α = 14

1−α2 logdet( 1−α

2P+ 1+α

2Q)

(det P)1−α2 (detQ)

1+α2

α ∈ R\{−1, 1}tr(P−1Q − I )− log det(P−1Q) α = −1

Start with C1 =1n

∑ni=1 Ti ,

Ct+1 = n

(n∑

i=1

(1− α

2Ti +

1 + α

2Ct

)−1)−1

→ unique global mean (obtained from CCCP).

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 45/56

Bhattacharyya coefficients/distances

Bhattacharyya coefficient and non-metric distance:

C(p, q) =

∫√

p(x)q(x)dx , 0 < C (p, q) ≤ 1,B(p, q) = − lnC (p, q).

(coefficient is always strictly positive). Hellinger metric

H(p, q) =

1

2

(√

p(x)−√

q(x))2dx ,

such that 0 ≤ H(p, q) ≤ 1.

H(p, q) =

1

2

(∫

p(x)dx +

q(x)dx − 2

∫√

p(x)√

q(x)dx

)

=√

1− C (p, q).

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 46/56

Chernoff coefficients/α-divergencesSkew Bhattacharrya divergences based on Chernoff α-coefficients.

Bα(p, q) = − ln

xpα(x)q1−α(x)dx = − lnCα(p, q)

= − ln

xq(x)

(p(x)

q(x)

dx

= − lnEq[Lα(x)]

Amari α-divergence:

Dα(p||q) =

41−α2

(

1−∫p(x)

1−α2 q(x)

1+α2 dx

)

, α 6= ±1,∫p(x) log p(x)

q(x)dx = KL(p, q), α = −1,∫q(x) log q(x)

p(x)dx = KL(q, p), α = 1,

Dα(p||q) = D−α(q||p)Remapping α′ = 1−α

2 (α = 1− 2α′) to get Chernoff α′-divergencesc© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 47/56

Bhattacharyya/Chernoff of exponential families [10]

Equivalence with skew Burbea-Rao distances:

Bα(pF (x ; θp), pF (x ; θq)) = BR(α)F (θp, θq), (7)

= αF (θp) + (1− α)F (θq)− F (αθp + (1 − α)θq)

Bhat. divergence on probability distributions amounts to computea Jensen divergence on its parameters

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 48/56

Closed-form Bhattacharyya distances for exp. fam.

Generic formula that instantiates in those well-known formula instatistical pattern recognition.

Exp. fam. F (θ) (up to a constant) Bhattacharyya/Burbea-Rao BRF (λp, λq) = BRF (τ(λp), τ(λq))

Multinomial log(1 +∑d−1

i=1exp θi ) − ln

∑di=1

√pi qi

Poisson exp θ 12(√

µp − √µq )

2

Gaussian − θ214θ2

+ 12log(− π

θ2) 1

4

(µp−µq )2

σ2p+σ2

q+ 1

2ln

σ2p+σ2

q2σpσq

Gaussian 14trΘ−1θθT − 1

2log det Θ 1

8(µp − µq )

T(

Σp+Σq2

)

−1(µp − µq ) +

12ln

detΣp+Σq

2det Σp det Σq

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 49/56

Wrapping up

◮ Besides Euclidean, log-Euclidean and Riemannianmetric-based means, proposeddivergence-based matrix centroids,

◮ Total Bregman divergences and robustness (conformalgeometry),

◮ Riemannian minimax center,

◮ skew Burbea-Rao/Jensen divergences extending Bregmandivergences,

◮ Bhattacharrya means of densities = Burbea-Rao means on(matrix) parameters

Which mean you do you mean or need?

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 50/56

Non-metric matrix manifolds with dually affine connections

In a nutshell:

◮ asymmetric (Bregman) non-metric divergence,

◮ Legendre transform, convex conjugates & dual divergences

◮ Dual θ− or η- or mixed coordinate systems

◮ dual closed-form affine geodesics (convenient computationally)

◮ Pythagorean theorem

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 51/56

Thank you.

www.informationgeometry.org

“One geometry cannot be more true than another;it can only be more convenient”,

— Jules Henri Poincare (1902)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 52/56

Bibliographic references I

Marc Arnaudon and Frank Nielsen.

On approximating the Riemannian 1-center.

Comput. Geom., 46(1):93–104, 2013.

Rajendra Bhatia.

The Riemannian mean of positive matrices.

In Frank Nielsen and Rajendra Bhatia, editors, Matrix Information Geometry, pages 35–51, 2012.

Silvere Bonnabel and Rodolphe Sepulchre.

Riemannian metric and geometric mean for positive semidefinite matrices of fixed rank.

SIAM J. Matrix Analysis Applications, 31(3):1055–1070, 2009.

Inderjit S. Dhillon and Joel A. Tropp.

Matrix nearness problems with bregman divergences.

SIAM J. Matrix Anal. Appl., 29(4):1120–1146, November 2007.

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari.

Composite objective mirror descent.

In Adam Tauman Kalai and Mehryar Mohri, editors, COLT, pages 14–26. Omnipress, 2010.

C. Harris and M. Stephens.

A Combined Corner and Edge Detection.

In Proceedings of The Fourth Alvey Vision Conference, pages 147–151, 1988.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 53/56

Bibliographic references II

Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen.

Shape retrieval using hierarchical total Bregman soft clustering.

Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012.

Meizhu Liu, Baba C. Vemuri, Shun ichi Amari, and Frank Nielsen.

Shape retrieval using hierarchical total bregman soft clustering.

IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2407–2419, 2012.

Maher Moakher.

A differential geometric approach to the geometric mean of symmetric positive-definite matrices.

SIAM Journal on Matrix Analysis and Applications, 26(3):735–747, 2005.

Frank Nielsen and Sylvain Boltz.

The Burbea-Rao and Bhattacharyya centroids.

IEEE Transactions on Information Theory, 57(8):5455–5466, 2011.

Frank Nielsen, Meizhu Liu, Xiaojing Ye, and Baba C. Vemuri.

Jensen divergence based SPD matrix means and applications.

In International Conference on Pattern Recognition (ICPR), 2012.

Frank Nielsen and Richard Nock.

Quantum Voronoi diagrams and Holevo channel capacity for 1-qubit quantum states.

In IEEE International Symposium on Information Theory (ISIT), pages 96–100, 2008.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 54/56

Bibliographic references IIIFrank Nielsen and Richard Nock.

Sided and symmetrized Bregman centroids.

IEEE Trans. Inf. Theor., 55(6):2882–2904, June 2009.

Frank Nielsen and Richard Nock.

Entropies and cross-entropies of exponential families.

In International Conference on Image Processing (ICIP), pages 3621–3624, 2010.

R. Nock, B. Magdalou, E. Briys, and F. Nielsen.

On tracking portfolios with certainty equivalents on a generalization of Markowitz model: the fool, the wiseand the adaptive.

In Thorsten Joachims, editor, International Conference on Machine Learning (ICML). Omnipress, 2011.

Richard Nock, Brice Magdalou, Eric Briys, and Frank Nielsen.

Mining matrix data with Bregman matrix divergences for portfolio selection.

In Frank Nielsen and Rajendra Bhatia, editors, Matrix Information Geometry, pages 373–402, 2012.

Masanori Ohya and Denes Petz.

Quantum Entropy and Its Use.

1st ed. 1993. Corr 2nd printing, 2004.

Koji Tsuda, Gunnar Ratsch, and Manfred K. Warmuth.

Matrix exponentiated gradient updates for on-line learning and Bregman projection.

J. Mach. Learn. Res., 6:995–1018, December 2005.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 55/56

Bibliographic references IV

Hisaharu Umegaki.

Conditional expectation in an operator algebra. IV. Entropy and information.

KodaiMathSemRep, 14(2):59, 1962.

Baba Vemuri, Meizhu Liu, Shun ichi Amari, and Frank Nielsen.

Total Bregman divergence and its applications to DTI analysis.

IEEE Transactions on Medical Imaging, 2011.

Zhizhou Wang and Baba C. Vemuri.

An affine invariant tensor dissimilarity measure and its applications to tensor-valued image segmentation.

In CVPR (1), pages 228–233, 2004.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 56/56

top related