Top Banner
My life as a mixture Christian P. Robert Universit´ e Paris-Dauphine, Paris & University of Warwick, Coventry September 17, 2014 [email protected]
92

BAYSM'14, Wien, Austria

Dec 05, 2014

Download

Science

talk at BAYSM'14 on mixtures and some resutls and considerations on the topic, hopefully not too soporific for the neXt generation!
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BAYSM'14, Wien, Austria

My life as a mixture

Christian P. RobertUniversite Paris-Dauphine, Paris & University of Warwick, Coventry

September 17, [email protected]

Page 2: BAYSM'14, Wien, Austria

Your next Valencia meeting:

I Objective Bayes section of ISBAmajor meeting:

I O-Bayes 2015 in Valencia,Spain, June 1-4(+1), 2015

I in memory of our friend SusieBayarri

I objective Bayes, limitedinformation, partly defined andapproximate models, &tc

I all flavours of Bayesian analysiswelcomed!

I “Spain in June, what else...?!”

Page 3: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 4: BAYSM'14, Wien, Austria

birthdate: May 1989, Ottawa Civic Hospital

Repartition of grey levels in an unprocessed chest radiograph

[X, 1994]

Page 5: BAYSM'14, Wien, Austria

Mixture models

Structure of mixtures of distributions:

x ∼ fj with probability pj ,

for j = 1, 2, . . . , k, with overall density

p1f1(x) + · · ·+ pk fk(x) .

Usual case: parameterised components

k∑i=1

pi f (x |θi )n∑

i=1

pi = 1

where weights pi ’s are distinguished from other parameters

Page 6: BAYSM'14, Wien, Austria

Mixture models

Structure of mixtures of distributions:

x ∼ fj with probability pj ,

for j = 1, 2, . . . , k, with overall density

p1f1(x) + · · ·+ pk fk(x) .

Usual case: parameterised components

k∑i=1

pi f (x |θi )n∑

i=1

pi = 1

where weights pi ’s are distinguished from other parameters

Page 7: BAYSM'14, Wien, Austria

Motivations

I Dataset made of several latent/missing/unobservedgroups/strata/subpopulations. Mixture structure due to themissing origin/allocation of each observation to a specificsubpopulation/stratum. Inference on either the allocations(clustering) or on the parameters (θi , pi ) or on the number ofgroups

I Semiparametric perspective where mixtures are functionalbasis approximations of unknown distributions

Page 8: BAYSM'14, Wien, Austria

Motivations

I Dataset made of several latent/missing/unobservedgroups/strata/subpopulations. Mixture structure due to themissing origin/allocation of each observation to a specificsubpopulation/stratum. Inference on either the allocations(clustering) or on the parameters (θi , pi ) or on the number ofgroups

I Semiparametric perspective where mixtures are functionalbasis approximations of unknown distributions

Page 9: BAYSM'14, Wien, Austria

License

Dataset derived from [my] license plate imageGrey levels concentrated on 256 values [later jittered]

[Marin & X, 2007]

Page 10: BAYSM'14, Wien, Austria

Likelihood

For a sample of independent random variables (x1, · · · , xn),likelihood

n∏i=1

{p1f1(xi ) + · · ·+ pk fk(xi )} .

Expanding this product involves

kn

elementary terms: prohibitive to compute in large samples.But likelihood still computable [pointwise] in O(kn) time.

Page 11: BAYSM'14, Wien, Austria

Likelihood

For a sample of independent random variables (x1, · · · , xn),likelihood

n∏i=1

{p1f1(xi ) + · · ·+ pk fk(xi )} .

Expanding this product involves

kn

elementary terms: prohibitive to compute in large samples.But likelihood still computable [pointwise] in O(kn) time.

Page 12: BAYSM'14, Wien, Austria

Normal mean benchmark

Normal mixture

pN(µ1, 1) + (1 − p)N(µ2, 1)

with only means unknown (2-D representation possible)

Identifiability

Parameters µ1 and µ2 identifiable: µ1cannot be confused with µ2 when p isdifferent from 0.5.

Presence of a spurious mode,

understood by letting p go to 0.5

Page 13: BAYSM'14, Wien, Austria

Normal mean benchmark

Normal mixture

pN(µ1, 1) + (1 − p)N(µ2, 1)

with only means unknown (2-D representation possible)

Identifiability

Parameters µ1 and µ2 identifiable: µ1cannot be confused with µ2 when p isdifferent from 0.5.

Presence of a spurious mode,

understood by letting p go to 0.5

Page 14: BAYSM'14, Wien, Austria

Bayesian inference on mixtures

For any prior π (θ,p), posterior distribution of (θ,p) available upto a multiplicative constant

π(θ,p|x) ∝

n∏i=1

k∑j=1

pj f (xi |θj)

π (θ,p)at a cost of order O(kn)

B Difficulty

Despite this, derivation of posterior characteristics like posteriorexpectations only possible in an exponential time of order O(kn)!

Page 15: BAYSM'14, Wien, Austria

Bayesian inference on mixtures

For any prior π (θ,p), posterior distribution of (θ,p) available upto a multiplicative constant

π(θ,p|x) ∝

n∏i=1

k∑j=1

pj f (xi |θj)

π (θ,p)at a cost of order O(kn)

B Difficulty

Despite this, derivation of posterior characteristics like posteriorexpectations only possible in an exponential time of order O(kn)!

Page 16: BAYSM'14, Wien, Austria

Missing variable representation

Associate to each xi a missing/latent variable zi that indicates itscomponent:

zi |p ∼ Mk(p1, . . . , pk)

andxi |zi ,θ ∼ f (·|θzi ) .

Completed likelihood

`(θ,p|x, z) =

n∏i=1

pzi f (xi |θzi ) ,

and

π(θ,p|x, z) ∝

[n∏

i=1

pzi f (xi |θzi )

]π (θ,p)

where z = (z1, . . . , zn)

Page 17: BAYSM'14, Wien, Austria

Missing variable representation

Associate to each xi a missing/latent variable zi that indicates itscomponent:

zi |p ∼ Mk(p1, . . . , pk)

andxi |zi ,θ ∼ f (·|θzi ) .

Completed likelihood

`(θ,p|x, z) =

n∏i=1

pzi f (xi |θzi ) ,

and

π(θ,p|x, z) ∝

[n∏

i=1

pzi f (xi |θzi )

]π (θ,p)

where z = (z1, . . . , zn)

Page 18: BAYSM'14, Wien, Austria

Gibbs sampling for mixture models

Take advantage of the missing data structure:

Algorithm

I Initialization: choose p(0) and θ(0) arbitrarilyI Step t. For t = 1, . . .

1. Generate z(t)i (i = 1, . . . , n) from (j = 1, . . . , k)

P(z(t)i = j |p

(t−1)j , θ

(t−1)j , xi

)∝ p

(t−1)j f

(xi |θ

(t−1)j

)2. Generate p(t) from π(p|z(t)),

3. Generate θ(t) from π(θ|z(t), x).

[Brooks & Gelman, 1990; Diebolt & X, 1990, 1994; Escobar & West, 1991]

Page 19: BAYSM'14, Wien, Austria

Normal mean example (cont’d)

Algorithm

I Initialization. Choose µ(0)1 and µ

(0)2 ,

I Step t. For t = 1, . . .

1. Generate z(t)i (i = 1, . . . , n) from

P(z(t)i = 1

)= 1−P

(z(t)i = 2

)∝ p exp

(−

1

2

(xi − µ

(t−1)1

)2)

2. Compute n(t)j =

n∑i=1

Iz(t)i =j

and (sxj )(t) =

n∑i=1

Iz(t)i =j

xi

3. Generate µ(t)j (j = 1, 2) from N

(λδ+ (sxj )

(t)

λ+ n(t)j

,1

λ+ n(t)j

).

Page 20: BAYSM'14, Wien, Austria

Normal mean example (cont’d)

−10

1

2

3

4

−1

0

1

2

3

4

µ1

µ2

(a) initialised at random[X & Casella, 2009]

Page 21: BAYSM'14, Wien, Austria

Normal mean example (cont’d)

−10

1

2

3

4

−1

0

1

2

3

4

µ1

µ2

(a) initialised at random

−2

0

2

4

−2

0

2

4

µ1

µ2

(b) initialised close to thelower mode

[X & Casella, 2009]

Page 22: BAYSM'14, Wien, Austria

License

Consider k = 3 components, a D3(1/2, 1/2, 1/2) prior for theweights, a N(x , σ2/3) prior on the means µi and a Ga(10, σ2) prioron the precisions σ−2

i , where x and σ2 are the empirical mean andvariance of License

[Empirical Bayes]

[Marin & X, 2007]

Page 23: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 24: BAYSM'14, Wien, Austria

weakly informative priors

I possible symmetric empirical Bayes priors

p ∼ D(γ, . . . ,γ), θi ∼ N(µ, ωσ2i ), σ−2i ∼ Ga(ν, εν)

which can be replaced with hierarchical priors[Diebolt & X, 1990; Richardson & Green, 1997]

I independent improper priors on θj ’s prohibited, thus standard“flat” and Jeffreys priors impossible to use (except with theexclude-empty-component trick)

[Diebolt & X, 1990; Wasserman, 1999]

Page 25: BAYSM'14, Wien, Austria

weakly informative priors

I Reparameterization to compact set for use of uniform priors

µi −→ eµi

1 + eµi, σi −→ σi

1 + σi

[Chopin, 2000]

I dependent weakly informative priors

p ∼ D(k , . . . , 1), θi ∼ N(θi−1, ζσ2i−1), σi ∼ U([0,σi−1])

[Mengersen & X, 1996; X & Titterington, 1998]

I reference priors

p ∼ D(1, . . . , 1), θi ∼ N(µ0, (σ2i + τ

20)/2), σ2i ∼ C+(0, τ20)

[Moreno & Liseo, 1999]

Page 26: BAYSM'14, Wien, Austria

Re-ban on improper priors

Difficult to use improper priors in the setting of mixtures becauseindependent improper priors,

π (θ) =

k∏i=1

πi (θi ) , with

∫πi (θi )dθi =∞

end up, for all n’s, with the property∫π(θ,p|x)dθdp =∞

Reason

There are (k − 1)n terms among the kn terms in the expansionthat allocate no observation at all to the i-th component

Page 27: BAYSM'14, Wien, Austria

Re-ban on improper priors

Difficult to use improper priors in the setting of mixtures becauseindependent improper priors,

π (θ) =

k∏i=1

πi (θi ) , with

∫πi (θi )dθi =∞

end up, for all n’s, with the property∫π(θ,p|x)dθdp =∞

Reason

There are (k − 1)n terms among the kn terms in the expansionthat allocate no observation at all to the i-th component

Page 28: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 29: BAYSM'14, Wien, Austria

Connected difficulties

1. Number of modes of the likelihood of order O(k!):c© Maximization and even [MCMC] exploration of the

posterior surface harder

2. Under exchangeable priors on (θ,p) [prior invariant underpermutation of the indices], all posterior marginals areidentical:c© Posterior expectation of θ1 equal to posterior expectation

of θ2

Page 30: BAYSM'14, Wien, Austria

Connected difficulties

1. Number of modes of the likelihood of order O(k!):c© Maximization and even [MCMC] exploration of the

posterior surface harder

2. Under exchangeable priors on (θ,p) [prior invariant underpermutation of the indices], all posterior marginals areidentical:c© Posterior expectation of θ1 equal to posterior expectation

of θ2

Page 31: BAYSM'14, Wien, Austria

License

When Gibbs output does not (re)produce exchangeability, Gibbssampler has failed to explored the whole parameter space: notenough energy to switch simultaneously enough componentallocations at once

[Marin & X, 2007]

Page 32: BAYSM'14, Wien, Austria

Label switching paradox

I We should observe the exchangeability of the components[label switching] to conclude about convergence of the Gibbssampler.

I If we observe it, then we do not know how to estimate theparameters.

I If we do not, then we are uncertain about the convergence!!!

[Celeux, Hurn & X, 2000]

[Fruhwirth-Schnatter, 2001, 2004]

[Holmes, Jasra & Stephens, 2005]

Page 33: BAYSM'14, Wien, Austria

Label switching paradox

I We should observe the exchangeability of the components[label switching] to conclude about convergence of the Gibbssampler.

I If we observe it, then we do not know how to estimate theparameters.

I If we do not, then we are uncertain about the convergence!!!

[Celeux, Hurn & X, 2000]

[Fruhwirth-Schnatter, 2001, 2004]

[Holmes, Jasra & Stephens, 2005]

Page 34: BAYSM'14, Wien, Austria

Label switching paradox

I We should observe the exchangeability of the components[label switching] to conclude about convergence of the Gibbssampler.

I If we observe it, then we do not know how to estimate theparameters.

I If we do not, then we are uncertain about the convergence!!!

[Celeux, Hurn & X, 2000]

[Fruhwirth-Schnatter, 2001, 2004]

[Holmes, Jasra & Stephens, 2005]

Page 35: BAYSM'14, Wien, Austria

Constraints

Usual reply to lack of identifiability: impose constraints like

µ1 6 . . . 6 µk

in the prior

Mostly incompatible with the topology of the posterior surface:posterior expectations then depend on the choice of theconstraints.

Computational “detail”

The constraint need be imposed during the simulation but caninstead be imposed after simulation, reordering MCMC outputaccording to constraints. [This avoids possible negative effects onconvergence]

Page 36: BAYSM'14, Wien, Austria

Constraints

Usual reply to lack of identifiability: impose constraints like

µ1 6 . . . 6 µk

in the prior

Mostly incompatible with the topology of the posterior surface:posterior expectations then depend on the choice of theconstraints.

Computational “detail”

The constraint need be imposed during the simulation but caninstead be imposed after simulation, reordering MCMC outputaccording to constraints. [This avoids possible negative effects onconvergence]

Page 37: BAYSM'14, Wien, Austria

Constraints

Usual reply to lack of identifiability: impose constraints like

µ1 6 . . . 6 µk

in the prior

Mostly incompatible with the topology of the posterior surface:posterior expectations then depend on the choice of theconstraints.

Computational “detail”

The constraint need be imposed during the simulation but caninstead be imposed after simulation, reordering MCMC outputaccording to constraints. [This avoids possible negative effects onconvergence]

Page 38: BAYSM'14, Wien, Austria

Relabeling towards the mode

Selection of one of the k! modal regions of the posterior,post-simulation, by computing the approximate MAP

(θ,p)(i∗) with i∗ = arg max

i=1,...,Mπ{(θ,p)(i)|x

}

Pivotal Reordering

At iteration i ∈ {1, . . . ,M},

1. Compute the optimal permutation

τi = arg minτ∈Sk

d(τ{(θ(i),p(i)), (θ(i∗),p(i∗))

})where d(·, ·) distance in the parameter space.

2. Set (θ(i),p(i)) = τi ((θ(i),p(i))).

[Celeux, 1998; Stephens, 2000; Celeux, Hurn & X, 2000]

Page 39: BAYSM'14, Wien, Austria

Relabeling towards the mode

Selection of one of the k! modal regions of the posterior,post-simulation, by computing the approximate MAP

(θ,p)(i∗) with i∗ = arg max

i=1,...,Mπ{(θ,p)(i)|x

}

Pivotal Reordering

At iteration i ∈ {1, . . . ,M},

1. Compute the optimal permutation

τi = arg minτ∈Sk

d(τ{(θ(i),p(i)), (θ(i∗),p(i∗))

})where d(·, ·) distance in the parameter space.

2. Set (θ(i),p(i)) = τi ((θ(i),p(i))).

[Celeux, 1998; Stephens, 2000; Celeux, Hurn & X, 2000]

Page 40: BAYSM'14, Wien, Austria

Loss functions for mixture estimation

Global loss function that considersdistance between predictives

L(ξ, ξ) =

∫Xfξ(x) log

{fξ(x)/fξ(x)

}dx

eliminates the labelling effectSimilar solution for estimating clustersthrough allocation variables

L(z , z) =∑i<j

(I[zi=zj ](1 − I[zi=zj ]) + I[zi=zj ](1 − I[zi=zj ])

).

[Celeux, Hurn & X, 2000]

Page 41: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 42: BAYSM'14, Wien, Austria

MAP estimation

For high-dimensional parameter space,difficulty with marginal MAP (MMAP) estimatesbecause nuisance parameters must be integrated out

θMMAP1 = argΘ1

max p (θ1|y)

where

p (θ1|y) =

∫Θ2

p (θ1,θ2|y) dθ2

Page 43: BAYSM'14, Wien, Austria

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Replace θ2 with γ artificial replications,

θ2 (1) , . . . ,θ2 (γ)

Page 44: BAYSM'14, Wien, Austria

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Treat the θ2 (j)’s as distinct random variables:

qγ (θ1,θ2 (1) , . . . ,θ2 (γ)|y) ∝γ∏

k=1

p (θ1,θ2 (k)|y)

Page 45: BAYSM'14, Wien, Austria

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Use corresponding marginal for θ1

qγ (θ1|y) =

∫qγ (θ1,θ2 (1) , . . . ,θ2 (γ)|y) dθ2 (1) . . . dθ2 (γ)

∝∫ γ∏k=1

p (θ1,θ2 (k)|y) dθ2 (1) . . . dθ2 (γ)

= pγ (θ1|y)

Page 46: BAYSM'14, Wien, Austria

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Build a MCMC algorithm in the augmented space, withinvariant distribution

qγ (θ1,θ2 (1) , . . . ,θ2 (γ)|y)

Page 47: BAYSM'14, Wien, Austria

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Use simulated subsequence{θ(i)1 ; i ∈ N

}as drawn from marginal posterior pγ (θ1|y)

Page 48: BAYSM'14, Wien, Austria

example: Galaxy dataset benchmark

82 observations of galaxy velocities from 3 (?) groupsAlgorithm EM MCEM SAME

Mean log-posterior 65.47 60.73 66.22

Std dev of 2.31 4.48 0.02

log-posterior[Doucet & X, 2002]

Page 49: BAYSM'14, Wien, Austria

Really the SAME?!

SAME algorithm re-invented in many guises:

I Gaetan & Yao, 2003, Biometrika

I Jacquier, Johannes & Polson, 2007, J. Econometrics

I Lele, Dennis & Lutscher, 2007, Ecology Letters [data cloning]

I ...

Page 50: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 51: BAYSM'14, Wien, Austria

Propp and Wilson’s perfect sampler

Difficulty devising MCMC stopping rules:when should one stop an MCMC algorithm?!

Principle: Coupling from the past

rather than start at t = 0 and wait till t = +∞, start at t = −∞and wait till t = 0

[Propp & Wilson, 1996]

c© Outcome at time t = 0 is stationnary

Page 52: BAYSM'14, Wien, Austria

Propp and Wilson’s perfect sampler

Difficulty devising MCMC stopping rules:when should one stop an MCMC algorithm?!

Principle: Coupling from the past

rather than start at t = 0 and wait till t = +∞, start at t = −∞and wait till t = 0

[Propp & Wilson, 1996]

c© Outcome at time t = 0 is stationnary

Page 53: BAYSM'14, Wien, Austria

CFTP Algorithm

Algorithm (Coupling from the past)

1. Start from the m possible values at time −t

2. Run the m chains till time 0 (coupling allowed)

3. Check if the chains are equal at time 0

4. If not, start further back: t ← 2 ∗ t, using the same randomnumbers at time already simulated

I requires a finite state space

I probability of merging chains must be high enough

I hard to implement w/o a monotonicity in both state spaceand transition

Page 54: BAYSM'14, Wien, Austria

CFTP Algorithm

Algorithm (Coupling from the past)

1. Start from the m possible values at time −t

2. Run the m chains till time 0 (coupling allowed)

3. Check if the chains are equal at time 0

4. If not, start further back: t ← 2 ∗ t, using the same randomnumbers at time already simulated

I requires a finite state space

I probability of merging chains must be high enough

I hard to implement w/o a monotonicity in both state spaceand transition

Page 55: BAYSM'14, Wien, Austria

Mixture models

Simplest possible mixture structure

pf0(x) + (1 − p)f1(x),

with uniform prior on p.

Algorithm (Data Augmentation Gibbs sampler)

At iteration t:

1. Generate n iid U(0, 1) rv’s u(t)1 , . . . , u

(t)n .

2. Derive the indicator variables z(t)i as z

(t)i = 0 iff

u(t)i 6 q

(t−1)i =

p(t−1)f0(xi )

p(t−1)f0(xi ) + (1 − p(t−1))f1(xi )

and compute

m(t) =

n∑i=1

z(t)i .

3. Simulate p(t) ∼ Be(n + 1 −m(t), 1 +m(t)).

Page 56: BAYSM'14, Wien, Austria

Mixture models

Algorithm (CFTP Gibbs sampler)

At iteration −t:

1. Generate n iid uniform rv’s u(−t)1 , . . . , u

(−t)n .

2. Partition [0, 1) into intervals [q[j], q[j+1]).

3. For each [q(−t)[j] , q

(−t)[j+1]), generate

p(−t)j ∼ Be(n − j + 1, j + 1).

4. For each j = 0, 1, . . . , n, r(−t)j ← p

(−t)j

5. For (` = 1, ` < T , `++) r(−t+`)j ← p

(−t+`)k with k such that

r(−t+`−1)j ∈ [q

(−t+`)[k] , q

(−t+`)[k+1] ]

6. Stop if the r(0)j ’s (0 6 j 6 n) are all equal. Otherwise, t ← 2 ∗ t.

[Hobert et al., 1999]

Page 57: BAYSM'14, Wien, Austria

Mixture models

Extension to the case k = 3:

Sample of n = 35 observations from

.23N(2.2, 1.44) + .62N(1.4, 0.49) + .15N(0.6, 0.64)

[Hobert et al., 1999]

Page 58: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 59: BAYSM'14, Wien, Austria

Bayesian model choice

Comparison of models Mi by Bayesian means:

probabilise the entire model/parameter space

I allocate probabilities pi to all models Mi

I define priors πi (θi ) for each parameter space Θi

I compute

π(Mi |x) =

pi

∫Θi

fi (x |θi )πi (θi )dθi∑j

pj

∫Θj

fj(x |θj)πj(θj)dθj

Page 60: BAYSM'14, Wien, Austria

Bayesian model choice

Comparison of models Mi by Bayesian means:

Relies on a central notion: the evidence

Zk =

∫Θk

πk(θk)Lk(θk) dθk ,

aka the marginal likelihood.

Page 61: BAYSM'14, Wien, Austria

Chib’s representation

Direct application of Bayes’ theorem: given x ∼ fk(x|θk) andθk ∼ πk(θk),

Zk = mk(x) =fk(x|θk)πk(θk)

πk(θk |x)

Replace with an approximation to the posterior

Zk = mk(x) =fk(x|θ

∗k)πk(θ

∗k)

πk(θ∗k |x)

.

[Chib, 1995]

Page 62: BAYSM'14, Wien, Austria

Chib’s representation

Direct application of Bayes’ theorem: given x ∼ fk(x|θk) andθk ∼ πk(θk),

Zk = mk(x) =fk(x|θk)πk(θk)

πk(θk |x)

Replace with an approximation to the posterior

Zk = mk(x) =fk(x|θ

∗k)πk(θ

∗k)

πk(θ∗k |x)

.

[Chib, 1995]

Page 63: BAYSM'14, Wien, Austria

Case of latent variables

For missing variable z as in mixture models, natural Rao-Blackwellestimate

πk(θ∗k |x) =

1

T

T∑t=1

πk(θ∗k |x, z

(t)k ) ,

where the z(t)k ’s are Gibbs sampled latent variables

[Diebolt & Robert, 1990; Chib, 1995]

Page 64: BAYSM'14, Wien, Austria

Compensation for label switching

For mixture models, z(t)k usually fails to visit all configurations in a

balanced way, despite the symmetry predicted by the theoryConsequences on numerical approximation, biased by an order k!Recover the theoretical symmetry by using

πk(θ∗k |x) =

1

T k!

∑σ∈Sk

T∑t=1

πk(σ(θ∗k)|x, z

(t)k ) .

for all σ’s in Sk , set of all permutations of {1, . . . , k}[Berkhof, Mechelen, & Gelman, 2003]

Page 65: BAYSM'14, Wien, Austria

Compensation for label switching

For mixture models, z(t)k usually fails to visit all configurations in a

balanced way, despite the symmetry predicted by the theoryConsequences on numerical approximation, biased by an order k!Recover the theoretical symmetry by using

πk(θ∗k |x) =

1

T k!

∑σ∈Sk

T∑t=1

πk(σ(θ∗k)|x, z

(t)k ) .

for all σ’s in Sk , set of all permutations of {1, . . . , k}[Berkhof, Mechelen, & Gelman, 2003]

Page 66: BAYSM'14, Wien, Austria

Galaxy dataset (k)

Using Chib’s estimate, with θ∗k as MAP estimator,

log(Zk(x)) = −105.1396

for k = 3, while introducing permutations leads to

log(Zk(x)) = −103.3479

Note that−105.1396 + log(3!) = −103.3479

k 2 3 4 5 6 7 8

Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44

Estimations of the marginal likelihoods by the symmetrised Chib’s

approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations

selected at random in Sk).

[Lee et al., 2008]

Page 67: BAYSM'14, Wien, Austria

Galaxy dataset (k)

Using Chib’s estimate, with θ∗k as MAP estimator,

log(Zk(x)) = −105.1396

for k = 3, while introducing permutations leads to

log(Zk(x)) = −103.3479

Note that−105.1396 + log(3!) = −103.3479

k 2 3 4 5 6 7 8

Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44

Estimations of the marginal likelihoods by the symmetrised Chib’s

approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations

selected at random in Sk).

[Lee et al., 2008]

Page 68: BAYSM'14, Wien, Austria

Galaxy dataset (k)

Using Chib’s estimate, with θ∗k as MAP estimator,

log(Zk(x)) = −105.1396

for k = 3, while introducing permutations leads to

log(Zk(x)) = −103.3479

Note that−105.1396 + log(3!) = −103.3479

k 2 3 4 5 6 7 8

Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44

Estimations of the marginal likelihoods by the symmetrised Chib’s

approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations

selected at random in Sk).

[Lee et al., 2008]

Page 69: BAYSM'14, Wien, Austria

More efficient sampling

Difficulty with the explosive numbers of terms in

πk(θ∗k |x) =

1

T k!

∑σ∈Sk

T∑t=1

πk(σ(θ∗k)|x, z

(t)k ) .

when most terms are equal to zero...Iterative bridge sampling:

E(t)(k) = E(t−1)(k)M−11

M1∑l=1

π(θl |x)

M1q(θl) +M2π(θl |x)

/

M−12

M2∑m=1

q(θm)

M1q(θm) +M2π(θm|x)

[Fruhwirth-Schnatter, 2004]

Page 70: BAYSM'14, Wien, Austria

More efficient sampling

Iterative bridge sampling:

E(t)(k) = E(t−1)(k)M−11

M1∑l=1

π(θl |x)

M1q(θl) +M2π(θl |x)

/

M−12

M2∑m=1

q(θm)

M1q(θm) +M2π(θm|x)

[Fruhwirth-Schnatter, 2004]

where

q(θ) =1

J1

J1∑j=1

p(θ|z(j))k∏

i=1

p(ξi |ξ(j)i<j , ξ

(j−1)i>j , z(j), x)

Page 71: BAYSM'14, Wien, Austria

More efficient sampling

Iterative bridge sampling:

E(t)(k) = E(t−1)(k)M−11

M1∑l=1

π(θl |x)

M1q(θl) +M2π(θl |x)

/

M−12

M2∑m=1

q(θm)

M1q(θm) +M2π(θm|x)

[Fruhwirth-Schnatter, 2004]

or where

q(θ) =1

k!

∑σ∈S(k)

p(θ|σ(zo))k∏

i=1

p(ξi |σ(ξoi<j),σ(ξ

oi>j),σ(z

o), x)

Page 72: BAYSM'14, Wien, Austria

Further efficiency

After de-switching (un-switching?), representation of importancefunction as

q(θ) =1

Jk!

J∑j=1

∑σ∈Sk

π(θ|σ(ϕ(j)), x) =1

k!

∑σ∈Sk

hσ(θ)

where hσ associated with particular mode of qAssuming generations

(θ(1), . . . ,θ(T)) ∼ hσc (θ)

how many of the hσ(θ(t)) are non-zero?

Page 73: BAYSM'14, Wien, Austria

Sparsity for the sum

Contribution of each term relative to q(θ)

ησ(θ) =hσ(θ)

k!q(θ)=

hσi (θ)∑σ∈Sk

hσ(θ)

and importance of permutation σ evaluated by

Ehσc [ησi (θ)] =1

M

M∑l=1

ησi (θ(l)) , θ(l) ∼ hσc (θ)

Approximate set A(k) ⊆ S(k) consist of [σ1, · · · ,σn] for thesmallest n that satisfies the condition

φn =1

M

M∑l=1

∣∣∣qn(θ(l)) − q(θ(l))∣∣∣ < τ

Page 74: BAYSM'14, Wien, Austria

dual importance sampling with approximation

DIS2A

1 Randomly select {z(j), θ(j)}Jj=1 from Gibbs sample and un-switchConstruct q(θ)

2 Choose hσc (θ) and generate particles {θ(t)}Tt=1 ∼ hσc (θ)

3 Construction of approximation q(θ) using first M-sample

3.1 Compute Ehσc[ησ1 (θ)], · · · , Ehσc

[ησk!(θ)]

3.2 Reorder the σ’s such thatEhσc

[ησ1 (θ)] > · · · > Ehσc[ησk!

(θ)].

3.3 Initially set n = 1 and compute qn(θ(t))’s and φn . If φn=1 < τ,

go to Step 4. Otherwise increase n = n+ 1

4 Replace q(θ(1)), . . . , q(θ(T)) with q(θ(1)), . . . , q(θ(T)) to

estimate E

[Lee & X, 2014]

Page 75: BAYSM'14, Wien, Austria

illustrations

k k! |A(k)| ∆(A)

3 6 1.0000 0.16754 24 2.7333 0.1148

Fishery data

k k! |A(k)| ∆(A)

3 6 1.000 0.16754 24 15.7000 0.65456 720 298.1200 0.4146

Galaxy data

Table : Mean estimates of approximate set sizes, |A(k)|, and thereduction rate of a number of evaluated h-terms ∆(A) for (a) fishery and(b) galaxy datasets

Page 76: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 77: BAYSM'14, Wien, Austria

Jeffreys priors for mixtures [teaser]

True Jeffreys prior for mixtures of distributions defined as∣∣Eθ[∇T∇ log f (X |θ)]∣∣

I O(k) matrix

I unavailable in closed form except special cases

I unidimensional integrals approximated by Monte Carlo tools

[Grazian [talk tomorrow] et al., 2014+]

Page 78: BAYSM'14, Wien, Austria

Difficulties

I complexity grows in O(k2)

I significant computing requirement (reduced by delayedacceptance)

[Banterle et al., 2014]

I differ from component-wise Jeffreys[Diebolt & X, 1990; Stoneking, 2014]

I when is the posterior proper?

I how to check properness via MCMC outputs?

Page 79: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 80: BAYSM'14, Wien, Austria

Difficulties with Bayes factors

I delicate calibration towards supporting a given hypothesis ormodel

I long-lasting impact of prior modelling, despite overallconsistency

I discontinuity in the use of improper priors in most settings

I binary outcome more suited for immediate decision than formodel evaluation

I related impossibility to ascertain misfit or outliers

I missing assesment of uncertainty associated with the decision

I difficult computation of marginal likelihoods in most settings

Page 81: BAYSM'14, Wien, Austria

Reformulation

I Representation of the test problem as a two-componentmixture estimation problem where the weights are formallyequal to 0 or 1

I Mixture model thus contains both models under comparisonas extreme cases

I Inspired by consistency result of Rousseau and Mengersen(2011) on overfitting mixtures

I Use of posteror distribution of the weight of a model insteadof single-digit posterior probability

[Kamari [see poster] et al., 2014+]

Page 82: BAYSM'14, Wien, Austria

Construction of Bayes tests

Given two statistical models,

M1 : x ∼ f1(x |θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x |θ2 , θ2 ∈ Θ2 , )

embed both models within an encompassing mixture model

Mα : x ∼ αf1(x |θ1) + (1 − α)f2(x |θ2) , 0 6 α 6 1 . (1)

Both models as special cases of the mixture model, one for α = 1and the other for α = 0

c© Test as inference on α

Page 83: BAYSM'14, Wien, Austria

Arguments

I substituting estimate of the weight α for posterior probabilityof model M1 produces an equally convergent indicator ofwhich model is “true” while removing the need of oftenartificial prior probabilities on model indices

I interpretation at least as natural as for the posteriorprobability, while avoiding the zero-one loss setting

I highly problematic computation of marginal likelihoodsbypassed by standard algorithms for mixture estimation

I straightforward extension to collection of models allows toconsider all models at once

I posterior on α evaluates thoroughly strength of support for agiven model, compared with single digit Bayes factor

I mixture model acknowledges possibility that both models [ornone] could be acceptable

Page 84: BAYSM'14, Wien, Austria

Arguments

I standard prior modelling can be reproduced here but improperpriors now acceptable, when both models reparameterisedtowards common-meaning parameters, e.g. location and scale

I using same parameters on both components is essential:opposition between components is not an issue with differentparameter values

I parameters of components, θ1 and θ2, integrated out byMonte Carlo

I contrary to common testing settings, data signal lack ofagreement with either model when posterior on α away fromboth 0 and 1

I in most settings, approach easily calibrated by parametricboostrap providing posterior of α under each model and priorpredictive error

Page 85: BAYSM'14, Wien, Austria

Toy examples (1)

Test of a Poisson P(λ) versus a geometric Geo(p) [as a number offailures, starting at zero]Same parameter used in Poisson P(λ) and geometric Geo(p) with

p = 1/1+λ

Improper noninformative prior π(λ) = 1/λ is validPosterior on λ conditional to allocation vector ζ

π(λ | x , ζ) ∝ exp(−n1(ζ)λ)λ(∑n

i=1 xi−1)(λ+ 1)−(n2+s2(ζ))

and α ∼ Be(n1 + a0, n2 + a0)

Page 86: BAYSM'14, Wien, Austria

Toy examples (1)

.1 .2 .3 .4 .5 1

0.00

0.02

0.04

0.06

0.08

lBF= −509684

Posterior of Poisson weight α when a0 = .1, .2, .3, .4, .5, 1 andsample of geometric G(0.5) 105 observations

Page 87: BAYSM'14, Wien, Austria

Toy examples (2)

Normal N(µ, 1) model versus double-exponential L(µ,√

2) [Scale√2 is intentionaly chosen to make both distributions share the

same variance]Location parameter µ can be shared by both models with a singleflat prior π(µ). Beta distributions B(a0, a0) are compared wrt theirhyperparameter a0

Page 88: BAYSM'14, Wien, Austria

Toy examples (2)

Posterior of double-exponential weight α for L(0,√

2) data, with5, . . . , 103 observations and 105 Gibbs iterations

Page 89: BAYSM'14, Wien, Austria

Toy examples (2)

Posterior of Normal weight α for N(0, .72) data with 103

observations and 104 Gibbs iterations

Page 90: BAYSM'14, Wien, Austria

Toy examples (2)

Posterior of Normal weight α for N(0, 1) data with 103

observations and 104 Gibbs iterations

Page 91: BAYSM'14, Wien, Austria

Toy examples (2)

Posterior of normal weight α for double-exponential data, with 103

observations and 104 Gibbs iterations

Page 92: BAYSM'14, Wien, Austria

Danke schon! Enjoy BAYSM 2014!