YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: BAYSM'14, Wien, Austria

My life as a mixture

Christian P. RobertUniversite Paris-Dauphine, Paris & University of Warwick, Coventry

September 17, [email protected]

Page 2: BAYSM'14, Wien, Austria

Your next Valencia meeting:

I Objective Bayes section of ISBAmajor meeting:

I O-Bayes 2015 in Valencia,Spain, June 1-4(+1), 2015

I in memory of our friend SusieBayarri

I objective Bayes, limitedinformation, partly defined andapproximate models, &tc

I all flavours of Bayesian analysiswelcomed!

I “Spain in June, what else...?!”

Page 3: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 4: BAYSM'14, Wien, Austria

birthdate: May 1989, Ottawa Civic Hospital

Repartition of grey levels in an unprocessed chest radiograph

[X, 1994]

Page 5: BAYSM'14, Wien, Austria

Mixture models

Structure of mixtures of distributions:

x ∼ fj with probability pj ,

for j = 1, 2, . . . , k, with overall density

p1f1(x) + · · ·+ pk fk(x) .

Usual case: parameterised components

k∑i=1

pi f (x |θi )n∑

i=1

pi = 1

where weights pi ’s are distinguished from other parameters

Page 6: BAYSM'14, Wien, Austria

Mixture models

Structure of mixtures of distributions:

x ∼ fj with probability pj ,

for j = 1, 2, . . . , k, with overall density

p1f1(x) + · · ·+ pk fk(x) .

Usual case: parameterised components

k∑i=1

pi f (x |θi )n∑

i=1

pi = 1

where weights pi ’s are distinguished from other parameters

Page 7: BAYSM'14, Wien, Austria

Motivations

I Dataset made of several latent/missing/unobservedgroups/strata/subpopulations. Mixture structure due to themissing origin/allocation of each observation to a specificsubpopulation/stratum. Inference on either the allocations(clustering) or on the parameters (θi , pi ) or on the number ofgroups

I Semiparametric perspective where mixtures are functionalbasis approximations of unknown distributions

Page 8: BAYSM'14, Wien, Austria

Motivations

I Dataset made of several latent/missing/unobservedgroups/strata/subpopulations. Mixture structure due to themissing origin/allocation of each observation to a specificsubpopulation/stratum. Inference on either the allocations(clustering) or on the parameters (θi , pi ) or on the number ofgroups

I Semiparametric perspective where mixtures are functionalbasis approximations of unknown distributions

Page 9: BAYSM'14, Wien, Austria

License

Dataset derived from [my] license plate imageGrey levels concentrated on 256 values [later jittered]

[Marin & X, 2007]

Page 10: BAYSM'14, Wien, Austria

Likelihood

For a sample of independent random variables (x1, · · · , xn),likelihood

n∏i=1

{p1f1(xi ) + · · ·+ pk fk(xi )} .

Expanding this product involves

kn

elementary terms: prohibitive to compute in large samples.But likelihood still computable [pointwise] in O(kn) time.

Page 11: BAYSM'14, Wien, Austria

Likelihood

For a sample of independent random variables (x1, · · · , xn),likelihood

n∏i=1

{p1f1(xi ) + · · ·+ pk fk(xi )} .

Expanding this product involves

kn

elementary terms: prohibitive to compute in large samples.But likelihood still computable [pointwise] in O(kn) time.

Page 12: BAYSM'14, Wien, Austria

Normal mean benchmark

Normal mixture

pN(µ1, 1) + (1 − p)N(µ2, 1)

with only means unknown (2-D representation possible)

Identifiability

Parameters µ1 and µ2 identifiable: µ1cannot be confused with µ2 when p isdifferent from 0.5.

Presence of a spurious mode,

understood by letting p go to 0.5

Page 13: BAYSM'14, Wien, Austria

Normal mean benchmark

Normal mixture

pN(µ1, 1) + (1 − p)N(µ2, 1)

with only means unknown (2-D representation possible)

Identifiability

Parameters µ1 and µ2 identifiable: µ1cannot be confused with µ2 when p isdifferent from 0.5.

Presence of a spurious mode,

understood by letting p go to 0.5

Page 14: BAYSM'14, Wien, Austria

Bayesian inference on mixtures

For any prior π (θ,p), posterior distribution of (θ,p) available upto a multiplicative constant

π(θ,p|x) ∝

n∏i=1

k∑j=1

pj f (xi |θj)

π (θ,p)at a cost of order O(kn)

B Difficulty

Despite this, derivation of posterior characteristics like posteriorexpectations only possible in an exponential time of order O(kn)!

Page 15: BAYSM'14, Wien, Austria

Bayesian inference on mixtures

For any prior π (θ,p), posterior distribution of (θ,p) available upto a multiplicative constant

π(θ,p|x) ∝

n∏i=1

k∑j=1

pj f (xi |θj)

π (θ,p)at a cost of order O(kn)

B Difficulty

Despite this, derivation of posterior characteristics like posteriorexpectations only possible in an exponential time of order O(kn)!

Page 16: BAYSM'14, Wien, Austria

Missing variable representation

Associate to each xi a missing/latent variable zi that indicates itscomponent:

zi |p ∼ Mk(p1, . . . , pk)

andxi |zi ,θ ∼ f (·|θzi ) .

Completed likelihood

`(θ,p|x, z) =

n∏i=1

pzi f (xi |θzi ) ,

and

π(θ,p|x, z) ∝

[n∏

i=1

pzi f (xi |θzi )

]π (θ,p)

where z = (z1, . . . , zn)

Page 17: BAYSM'14, Wien, Austria

Missing variable representation

Associate to each xi a missing/latent variable zi that indicates itscomponent:

zi |p ∼ Mk(p1, . . . , pk)

andxi |zi ,θ ∼ f (·|θzi ) .

Completed likelihood

`(θ,p|x, z) =

n∏i=1

pzi f (xi |θzi ) ,

and

π(θ,p|x, z) ∝

[n∏

i=1

pzi f (xi |θzi )

]π (θ,p)

where z = (z1, . . . , zn)

Page 18: BAYSM'14, Wien, Austria

Gibbs sampling for mixture models

Take advantage of the missing data structure:

Algorithm

I Initialization: choose p(0) and θ(0) arbitrarilyI Step t. For t = 1, . . .

1. Generate z(t)i (i = 1, . . . , n) from (j = 1, . . . , k)

P(z(t)i = j |p

(t−1)j , θ

(t−1)j , xi

)∝ p

(t−1)j f

(xi |θ

(t−1)j

)2. Generate p(t) from π(p|z(t)),

3. Generate θ(t) from π(θ|z(t), x).

[Brooks & Gelman, 1990; Diebolt & X, 1990, 1994; Escobar & West, 1991]

Page 19: BAYSM'14, Wien, Austria

Normal mean example (cont’d)

Algorithm

I Initialization. Choose µ(0)1 and µ

(0)2 ,

I Step t. For t = 1, . . .

1. Generate z(t)i (i = 1, . . . , n) from

P(z(t)i = 1

)= 1−P

(z(t)i = 2

)∝ p exp

(−

1

2

(xi − µ

(t−1)1

)2)

2. Compute n(t)j =

n∑i=1

Iz(t)i =j

and (sxj )(t) =

n∑i=1

Iz(t)i =j

xi

3. Generate µ(t)j (j = 1, 2) from N

(λδ+ (sxj )

(t)

λ+ n(t)j

,1

λ+ n(t)j

).

Page 20: BAYSM'14, Wien, Austria

Normal mean example (cont’d)

−10

1

2

3

4

−1

0

1

2

3

4

µ1

µ2

(a) initialised at random[X & Casella, 2009]

Page 21: BAYSM'14, Wien, Austria

Normal mean example (cont’d)

−10

1

2

3

4

−1

0

1

2

3

4

µ1

µ2

(a) initialised at random

−2

0

2

4

−2

0

2

4

µ1

µ2

(b) initialised close to thelower mode

[X & Casella, 2009]

Page 22: BAYSM'14, Wien, Austria

License

Consider k = 3 components, a D3(1/2, 1/2, 1/2) prior for theweights, a N(x , σ2/3) prior on the means µi and a Ga(10, σ2) prioron the precisions σ−2

i , where x and σ2 are the empirical mean andvariance of License

[Empirical Bayes]

[Marin & X, 2007]

Page 23: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 24: BAYSM'14, Wien, Austria

weakly informative priors

I possible symmetric empirical Bayes priors

p ∼ D(γ, . . . ,γ), θi ∼ N(µ, ωσ2i ), σ−2i ∼ Ga(ν, εν)

which can be replaced with hierarchical priors[Diebolt & X, 1990; Richardson & Green, 1997]

I independent improper priors on θj ’s prohibited, thus standard“flat” and Jeffreys priors impossible to use (except with theexclude-empty-component trick)

[Diebolt & X, 1990; Wasserman, 1999]

Page 25: BAYSM'14, Wien, Austria

weakly informative priors

I Reparameterization to compact set for use of uniform priors

µi −→ eµi

1 + eµi, σi −→ σi

1 + σi

[Chopin, 2000]

I dependent weakly informative priors

p ∼ D(k , . . . , 1), θi ∼ N(θi−1, ζσ2i−1), σi ∼ U([0,σi−1])

[Mengersen & X, 1996; X & Titterington, 1998]

I reference priors

p ∼ D(1, . . . , 1), θi ∼ N(µ0, (σ2i + τ

20)/2), σ2i ∼ C+(0, τ20)

[Moreno & Liseo, 1999]

Page 26: BAYSM'14, Wien, Austria

Re-ban on improper priors

Difficult to use improper priors in the setting of mixtures becauseindependent improper priors,

π (θ) =

k∏i=1

πi (θi ) , with

∫πi (θi )dθi =∞

end up, for all n’s, with the property∫π(θ,p|x)dθdp =∞

Reason

There are (k − 1)n terms among the kn terms in the expansionthat allocate no observation at all to the i-th component

Page 27: BAYSM'14, Wien, Austria

Re-ban on improper priors

Difficult to use improper priors in the setting of mixtures becauseindependent improper priors,

π (θ) =

k∏i=1

πi (θi ) , with

∫πi (θi )dθi =∞

end up, for all n’s, with the property∫π(θ,p|x)dθdp =∞

Reason

There are (k − 1)n terms among the kn terms in the expansionthat allocate no observation at all to the i-th component

Page 28: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 29: BAYSM'14, Wien, Austria

Connected difficulties

1. Number of modes of the likelihood of order O(k!):c© Maximization and even [MCMC] exploration of the

posterior surface harder

2. Under exchangeable priors on (θ,p) [prior invariant underpermutation of the indices], all posterior marginals areidentical:c© Posterior expectation of θ1 equal to posterior expectation

of θ2

Page 30: BAYSM'14, Wien, Austria

Connected difficulties

1. Number of modes of the likelihood of order O(k!):c© Maximization and even [MCMC] exploration of the

posterior surface harder

2. Under exchangeable priors on (θ,p) [prior invariant underpermutation of the indices], all posterior marginals areidentical:c© Posterior expectation of θ1 equal to posterior expectation

of θ2

Page 31: BAYSM'14, Wien, Austria

License

When Gibbs output does not (re)produce exchangeability, Gibbssampler has failed to explored the whole parameter space: notenough energy to switch simultaneously enough componentallocations at once

[Marin & X, 2007]

Page 32: BAYSM'14, Wien, Austria

Label switching paradox

I We should observe the exchangeability of the components[label switching] to conclude about convergence of the Gibbssampler.

I If we observe it, then we do not know how to estimate theparameters.

I If we do not, then we are uncertain about the convergence!!!

[Celeux, Hurn & X, 2000]

[Fruhwirth-Schnatter, 2001, 2004]

[Holmes, Jasra & Stephens, 2005]

Page 33: BAYSM'14, Wien, Austria

Label switching paradox

I We should observe the exchangeability of the components[label switching] to conclude about convergence of the Gibbssampler.

I If we observe it, then we do not know how to estimate theparameters.

I If we do not, then we are uncertain about the convergence!!!

[Celeux, Hurn & X, 2000]

[Fruhwirth-Schnatter, 2001, 2004]

[Holmes, Jasra & Stephens, 2005]

Page 34: BAYSM'14, Wien, Austria

Label switching paradox

I We should observe the exchangeability of the components[label switching] to conclude about convergence of the Gibbssampler.

I If we observe it, then we do not know how to estimate theparameters.

I If we do not, then we are uncertain about the convergence!!!

[Celeux, Hurn & X, 2000]

[Fruhwirth-Schnatter, 2001, 2004]

[Holmes, Jasra & Stephens, 2005]

Page 35: BAYSM'14, Wien, Austria

Constraints

Usual reply to lack of identifiability: impose constraints like

µ1 6 . . . 6 µk

in the prior

Mostly incompatible with the topology of the posterior surface:posterior expectations then depend on the choice of theconstraints.

Computational “detail”

The constraint need be imposed during the simulation but caninstead be imposed after simulation, reordering MCMC outputaccording to constraints. [This avoids possible negative effects onconvergence]

Page 36: BAYSM'14, Wien, Austria

Constraints

Usual reply to lack of identifiability: impose constraints like

µ1 6 . . . 6 µk

in the prior

Mostly incompatible with the topology of the posterior surface:posterior expectations then depend on the choice of theconstraints.

Computational “detail”

The constraint need be imposed during the simulation but caninstead be imposed after simulation, reordering MCMC outputaccording to constraints. [This avoids possible negative effects onconvergence]

Page 37: BAYSM'14, Wien, Austria

Constraints

Usual reply to lack of identifiability: impose constraints like

µ1 6 . . . 6 µk

in the prior

Mostly incompatible with the topology of the posterior surface:posterior expectations then depend on the choice of theconstraints.

Computational “detail”

The constraint need be imposed during the simulation but caninstead be imposed after simulation, reordering MCMC outputaccording to constraints. [This avoids possible negative effects onconvergence]

Page 38: BAYSM'14, Wien, Austria

Relabeling towards the mode

Selection of one of the k! modal regions of the posterior,post-simulation, by computing the approximate MAP

(θ,p)(i∗) with i∗ = arg max

i=1,...,Mπ{(θ,p)(i)|x

}

Pivotal Reordering

At iteration i ∈ {1, . . . ,M},

1. Compute the optimal permutation

τi = arg minτ∈Sk

d(τ{(θ(i),p(i)), (θ(i∗),p(i∗))

})where d(·, ·) distance in the parameter space.

2. Set (θ(i),p(i)) = τi ((θ(i),p(i))).

[Celeux, 1998; Stephens, 2000; Celeux, Hurn & X, 2000]

Page 39: BAYSM'14, Wien, Austria

Relabeling towards the mode

Selection of one of the k! modal regions of the posterior,post-simulation, by computing the approximate MAP

(θ,p)(i∗) with i∗ = arg max

i=1,...,Mπ{(θ,p)(i)|x

}

Pivotal Reordering

At iteration i ∈ {1, . . . ,M},

1. Compute the optimal permutation

τi = arg minτ∈Sk

d(τ{(θ(i),p(i)), (θ(i∗),p(i∗))

})where d(·, ·) distance in the parameter space.

2. Set (θ(i),p(i)) = τi ((θ(i),p(i))).

[Celeux, 1998; Stephens, 2000; Celeux, Hurn & X, 2000]

Page 40: BAYSM'14, Wien, Austria

Loss functions for mixture estimation

Global loss function that considersdistance between predictives

L(ξ, ξ) =

∫Xfξ(x) log

{fξ(x)/fξ(x)

}dx

eliminates the labelling effectSimilar solution for estimating clustersthrough allocation variables

L(z , z) =∑i<j

(I[zi=zj ](1 − I[zi=zj ]) + I[zi=zj ](1 − I[zi=zj ])

).

[Celeux, Hurn & X, 2000]

Page 41: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 42: BAYSM'14, Wien, Austria

MAP estimation

For high-dimensional parameter space,difficulty with marginal MAP (MMAP) estimatesbecause nuisance parameters must be integrated out

θMMAP1 = argΘ1

max p (θ1|y)

where

p (θ1|y) =

∫Θ2

p (θ1,θ2|y) dθ2

Page 43: BAYSM'14, Wien, Austria

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Replace θ2 with γ artificial replications,

θ2 (1) , . . . ,θ2 (γ)

Page 44: BAYSM'14, Wien, Austria

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Treat the θ2 (j)’s as distinct random variables:

qγ (θ1,θ2 (1) , . . . ,θ2 (γ)|y) ∝γ∏

k=1

p (θ1,θ2 (k)|y)

Page 45: BAYSM'14, Wien, Austria

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Use corresponding marginal for θ1

qγ (θ1|y) =

∫qγ (θ1,θ2 (1) , . . . ,θ2 (γ)|y) dθ2 (1) . . . dθ2 (γ)

∝∫ γ∏k=1

p (θ1,θ2 (k)|y) dθ2 (1) . . . dθ2 (γ)

= pγ (θ1|y)

Page 46: BAYSM'14, Wien, Austria

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Build a MCMC algorithm in the augmented space, withinvariant distribution

qγ (θ1,θ2 (1) , . . . ,θ2 (γ)|y)

Page 47: BAYSM'14, Wien, Austria

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Use simulated subsequence{θ(i)1 ; i ∈ N

}as drawn from marginal posterior pγ (θ1|y)

Page 48: BAYSM'14, Wien, Austria

example: Galaxy dataset benchmark

82 observations of galaxy velocities from 3 (?) groupsAlgorithm EM MCEM SAME

Mean log-posterior 65.47 60.73 66.22

Std dev of 2.31 4.48 0.02

log-posterior[Doucet & X, 2002]

Page 49: BAYSM'14, Wien, Austria

Really the SAME?!

SAME algorithm re-invented in many guises:

I Gaetan & Yao, 2003, Biometrika

I Jacquier, Johannes & Polson, 2007, J. Econometrics

I Lele, Dennis & Lutscher, 2007, Ecology Letters [data cloning]

I ...

Page 50: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 51: BAYSM'14, Wien, Austria

Propp and Wilson’s perfect sampler

Difficulty devising MCMC stopping rules:when should one stop an MCMC algorithm?!

Principle: Coupling from the past

rather than start at t = 0 and wait till t = +∞, start at t = −∞and wait till t = 0

[Propp & Wilson, 1996]

c© Outcome at time t = 0 is stationnary

Page 52: BAYSM'14, Wien, Austria

Propp and Wilson’s perfect sampler

Difficulty devising MCMC stopping rules:when should one stop an MCMC algorithm?!

Principle: Coupling from the past

rather than start at t = 0 and wait till t = +∞, start at t = −∞and wait till t = 0

[Propp & Wilson, 1996]

c© Outcome at time t = 0 is stationnary

Page 53: BAYSM'14, Wien, Austria

CFTP Algorithm

Algorithm (Coupling from the past)

1. Start from the m possible values at time −t

2. Run the m chains till time 0 (coupling allowed)

3. Check if the chains are equal at time 0

4. If not, start further back: t ← 2 ∗ t, using the same randomnumbers at time already simulated

I requires a finite state space

I probability of merging chains must be high enough

I hard to implement w/o a monotonicity in both state spaceand transition

Page 54: BAYSM'14, Wien, Austria

CFTP Algorithm

Algorithm (Coupling from the past)

1. Start from the m possible values at time −t

2. Run the m chains till time 0 (coupling allowed)

3. Check if the chains are equal at time 0

4. If not, start further back: t ← 2 ∗ t, using the same randomnumbers at time already simulated

I requires a finite state space

I probability of merging chains must be high enough

I hard to implement w/o a monotonicity in both state spaceand transition

Page 55: BAYSM'14, Wien, Austria

Mixture models

Simplest possible mixture structure

pf0(x) + (1 − p)f1(x),

with uniform prior on p.

Algorithm (Data Augmentation Gibbs sampler)

At iteration t:

1. Generate n iid U(0, 1) rv’s u(t)1 , . . . , u

(t)n .

2. Derive the indicator variables z(t)i as z

(t)i = 0 iff

u(t)i 6 q

(t−1)i =

p(t−1)f0(xi )

p(t−1)f0(xi ) + (1 − p(t−1))f1(xi )

and compute

m(t) =

n∑i=1

z(t)i .

3. Simulate p(t) ∼ Be(n + 1 −m(t), 1 +m(t)).

Page 56: BAYSM'14, Wien, Austria

Mixture models

Algorithm (CFTP Gibbs sampler)

At iteration −t:

1. Generate n iid uniform rv’s u(−t)1 , . . . , u

(−t)n .

2. Partition [0, 1) into intervals [q[j], q[j+1]).

3. For each [q(−t)[j] , q

(−t)[j+1]), generate

p(−t)j ∼ Be(n − j + 1, j + 1).

4. For each j = 0, 1, . . . , n, r(−t)j ← p

(−t)j

5. For (` = 1, ` < T , `++) r(−t+`)j ← p

(−t+`)k with k such that

r(−t+`−1)j ∈ [q

(−t+`)[k] , q

(−t+`)[k+1] ]

6. Stop if the r(0)j ’s (0 6 j 6 n) are all equal. Otherwise, t ← 2 ∗ t.

[Hobert et al., 1999]

Page 57: BAYSM'14, Wien, Austria

Mixture models

Extension to the case k = 3:

Sample of n = 35 observations from

.23N(2.2, 1.44) + .62N(1.4, 0.49) + .15N(0.6, 0.64)

[Hobert et al., 1999]

Page 58: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 59: BAYSM'14, Wien, Austria

Bayesian model choice

Comparison of models Mi by Bayesian means:

probabilise the entire model/parameter space

I allocate probabilities pi to all models Mi

I define priors πi (θi ) for each parameter space Θi

I compute

π(Mi |x) =

pi

∫Θi

fi (x |θi )πi (θi )dθi∑j

pj

∫Θj

fj(x |θj)πj(θj)dθj

Page 60: BAYSM'14, Wien, Austria

Bayesian model choice

Comparison of models Mi by Bayesian means:

Relies on a central notion: the evidence

Zk =

∫Θk

πk(θk)Lk(θk) dθk ,

aka the marginal likelihood.

Page 61: BAYSM'14, Wien, Austria

Chib’s representation

Direct application of Bayes’ theorem: given x ∼ fk(x|θk) andθk ∼ πk(θk),

Zk = mk(x) =fk(x|θk)πk(θk)

πk(θk |x)

Replace with an approximation to the posterior

Zk = mk(x) =fk(x|θ

∗k)πk(θ

∗k)

πk(θ∗k |x)

.

[Chib, 1995]

Page 62: BAYSM'14, Wien, Austria

Chib’s representation

Direct application of Bayes’ theorem: given x ∼ fk(x|θk) andθk ∼ πk(θk),

Zk = mk(x) =fk(x|θk)πk(θk)

πk(θk |x)

Replace with an approximation to the posterior

Zk = mk(x) =fk(x|θ

∗k)πk(θ

∗k)

πk(θ∗k |x)

.

[Chib, 1995]

Page 63: BAYSM'14, Wien, Austria

Case of latent variables

For missing variable z as in mixture models, natural Rao-Blackwellestimate

πk(θ∗k |x) =

1

T

T∑t=1

πk(θ∗k |x, z

(t)k ) ,

where the z(t)k ’s are Gibbs sampled latent variables

[Diebolt & Robert, 1990; Chib, 1995]

Page 64: BAYSM'14, Wien, Austria

Compensation for label switching

For mixture models, z(t)k usually fails to visit all configurations in a

balanced way, despite the symmetry predicted by the theoryConsequences on numerical approximation, biased by an order k!Recover the theoretical symmetry by using

πk(θ∗k |x) =

1

T k!

∑σ∈Sk

T∑t=1

πk(σ(θ∗k)|x, z

(t)k ) .

for all σ’s in Sk , set of all permutations of {1, . . . , k}[Berkhof, Mechelen, & Gelman, 2003]

Page 65: BAYSM'14, Wien, Austria

Compensation for label switching

For mixture models, z(t)k usually fails to visit all configurations in a

balanced way, despite the symmetry predicted by the theoryConsequences on numerical approximation, biased by an order k!Recover the theoretical symmetry by using

πk(θ∗k |x) =

1

T k!

∑σ∈Sk

T∑t=1

πk(σ(θ∗k)|x, z

(t)k ) .

for all σ’s in Sk , set of all permutations of {1, . . . , k}[Berkhof, Mechelen, & Gelman, 2003]

Page 66: BAYSM'14, Wien, Austria

Galaxy dataset (k)

Using Chib’s estimate, with θ∗k as MAP estimator,

log(Zk(x)) = −105.1396

for k = 3, while introducing permutations leads to

log(Zk(x)) = −103.3479

Note that−105.1396 + log(3!) = −103.3479

k 2 3 4 5 6 7 8

Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44

Estimations of the marginal likelihoods by the symmetrised Chib’s

approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations

selected at random in Sk).

[Lee et al., 2008]

Page 67: BAYSM'14, Wien, Austria

Galaxy dataset (k)

Using Chib’s estimate, with θ∗k as MAP estimator,

log(Zk(x)) = −105.1396

for k = 3, while introducing permutations leads to

log(Zk(x)) = −103.3479

Note that−105.1396 + log(3!) = −103.3479

k 2 3 4 5 6 7 8

Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44

Estimations of the marginal likelihoods by the symmetrised Chib’s

approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations

selected at random in Sk).

[Lee et al., 2008]

Page 68: BAYSM'14, Wien, Austria

Galaxy dataset (k)

Using Chib’s estimate, with θ∗k as MAP estimator,

log(Zk(x)) = −105.1396

for k = 3, while introducing permutations leads to

log(Zk(x)) = −103.3479

Note that−105.1396 + log(3!) = −103.3479

k 2 3 4 5 6 7 8

Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44

Estimations of the marginal likelihoods by the symmetrised Chib’s

approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations

selected at random in Sk).

[Lee et al., 2008]

Page 69: BAYSM'14, Wien, Austria

More efficient sampling

Difficulty with the explosive numbers of terms in

πk(θ∗k |x) =

1

T k!

∑σ∈Sk

T∑t=1

πk(σ(θ∗k)|x, z

(t)k ) .

when most terms are equal to zero...Iterative bridge sampling:

E(t)(k) = E(t−1)(k)M−11

M1∑l=1

π(θl |x)

M1q(θl) +M2π(θl |x)

/

M−12

M2∑m=1

q(θm)

M1q(θm) +M2π(θm|x)

[Fruhwirth-Schnatter, 2004]

Page 70: BAYSM'14, Wien, Austria

More efficient sampling

Iterative bridge sampling:

E(t)(k) = E(t−1)(k)M−11

M1∑l=1

π(θl |x)

M1q(θl) +M2π(θl |x)

/

M−12

M2∑m=1

q(θm)

M1q(θm) +M2π(θm|x)

[Fruhwirth-Schnatter, 2004]

where

q(θ) =1

J1

J1∑j=1

p(θ|z(j))k∏

i=1

p(ξi |ξ(j)i<j , ξ

(j−1)i>j , z(j), x)

Page 71: BAYSM'14, Wien, Austria

More efficient sampling

Iterative bridge sampling:

E(t)(k) = E(t−1)(k)M−11

M1∑l=1

π(θl |x)

M1q(θl) +M2π(θl |x)

/

M−12

M2∑m=1

q(θm)

M1q(θm) +M2π(θm|x)

[Fruhwirth-Schnatter, 2004]

or where

q(θ) =1

k!

∑σ∈S(k)

p(θ|σ(zo))k∏

i=1

p(ξi |σ(ξoi<j),σ(ξ

oi>j),σ(z

o), x)

Page 72: BAYSM'14, Wien, Austria

Further efficiency

After de-switching (un-switching?), representation of importancefunction as

q(θ) =1

Jk!

J∑j=1

∑σ∈Sk

π(θ|σ(ϕ(j)), x) =1

k!

∑σ∈Sk

hσ(θ)

where hσ associated with particular mode of qAssuming generations

(θ(1), . . . ,θ(T)) ∼ hσc (θ)

how many of the hσ(θ(t)) are non-zero?

Page 73: BAYSM'14, Wien, Austria

Sparsity for the sum

Contribution of each term relative to q(θ)

ησ(θ) =hσ(θ)

k!q(θ)=

hσi (θ)∑σ∈Sk

hσ(θ)

and importance of permutation σ evaluated by

Ehσc [ησi (θ)] =1

M

M∑l=1

ησi (θ(l)) , θ(l) ∼ hσc (θ)

Approximate set A(k) ⊆ S(k) consist of [σ1, · · · ,σn] for thesmallest n that satisfies the condition

φn =1

M

M∑l=1

∣∣∣qn(θ(l)) − q(θ(l))∣∣∣ < τ

Page 74: BAYSM'14, Wien, Austria

dual importance sampling with approximation

DIS2A

1 Randomly select {z(j), θ(j)}Jj=1 from Gibbs sample and un-switchConstruct q(θ)

2 Choose hσc (θ) and generate particles {θ(t)}Tt=1 ∼ hσc (θ)

3 Construction of approximation q(θ) using first M-sample

3.1 Compute Ehσc[ησ1 (θ)], · · · , Ehσc

[ησk!(θ)]

3.2 Reorder the σ’s such thatEhσc

[ησ1 (θ)] > · · · > Ehσc[ησk!

(θ)].

3.3 Initially set n = 1 and compute qn(θ(t))’s and φn . If φn=1 < τ,

go to Step 4. Otherwise increase n = n+ 1

4 Replace q(θ(1)), . . . , q(θ(T)) with q(θ(1)), . . . , q(θ(T)) to

estimate E

[Lee & X, 2014]

Page 75: BAYSM'14, Wien, Austria

illustrations

k k! |A(k)| ∆(A)

3 6 1.0000 0.16754 24 2.7333 0.1148

Fishery data

k k! |A(k)| ∆(A)

3 6 1.000 0.16754 24 15.7000 0.65456 720 298.1200 0.4146

Galaxy data

Table : Mean estimates of approximate set sizes, |A(k)|, and thereduction rate of a number of evaluated h-terms ∆(A) for (a) fishery and(b) galaxy datasets

Page 76: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 77: BAYSM'14, Wien, Austria

Jeffreys priors for mixtures [teaser]

True Jeffreys prior for mixtures of distributions defined as∣∣Eθ[∇T∇ log f (X |θ)]∣∣

I O(k) matrix

I unavailable in closed form except special cases

I unidimensional integrals approximated by Monte Carlo tools

[Grazian [talk tomorrow] et al., 2014+]

Page 78: BAYSM'14, Wien, Austria

Difficulties

I complexity grows in O(k2)

I significant computing requirement (reduced by delayedacceptance)

[Banterle et al., 2014]

I differ from component-wise Jeffreys[Diebolt & X, 1990; Stoneking, 2014]

I when is the posterior proper?

I how to check properness via MCMC outputs?

Page 79: BAYSM'14, Wien, Austria

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Page 80: BAYSM'14, Wien, Austria

Difficulties with Bayes factors

I delicate calibration towards supporting a given hypothesis ormodel

I long-lasting impact of prior modelling, despite overallconsistency

I discontinuity in the use of improper priors in most settings

I binary outcome more suited for immediate decision than formodel evaluation

I related impossibility to ascertain misfit or outliers

I missing assesment of uncertainty associated with the decision

I difficult computation of marginal likelihoods in most settings

Page 81: BAYSM'14, Wien, Austria

Reformulation

I Representation of the test problem as a two-componentmixture estimation problem where the weights are formallyequal to 0 or 1

I Mixture model thus contains both models under comparisonas extreme cases

I Inspired by consistency result of Rousseau and Mengersen(2011) on overfitting mixtures

I Use of posteror distribution of the weight of a model insteadof single-digit posterior probability

[Kamari [see poster] et al., 2014+]

Page 82: BAYSM'14, Wien, Austria

Construction of Bayes tests

Given two statistical models,

M1 : x ∼ f1(x |θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x |θ2 , θ2 ∈ Θ2 , )

embed both models within an encompassing mixture model

Mα : x ∼ αf1(x |θ1) + (1 − α)f2(x |θ2) , 0 6 α 6 1 . (1)

Both models as special cases of the mixture model, one for α = 1and the other for α = 0

c© Test as inference on α

Page 83: BAYSM'14, Wien, Austria

Arguments

I substituting estimate of the weight α for posterior probabilityof model M1 produces an equally convergent indicator ofwhich model is “true” while removing the need of oftenartificial prior probabilities on model indices

I interpretation at least as natural as for the posteriorprobability, while avoiding the zero-one loss setting

I highly problematic computation of marginal likelihoodsbypassed by standard algorithms for mixture estimation

I straightforward extension to collection of models allows toconsider all models at once

I posterior on α evaluates thoroughly strength of support for agiven model, compared with single digit Bayes factor

I mixture model acknowledges possibility that both models [ornone] could be acceptable

Page 84: BAYSM'14, Wien, Austria

Arguments

I standard prior modelling can be reproduced here but improperpriors now acceptable, when both models reparameterisedtowards common-meaning parameters, e.g. location and scale

I using same parameters on both components is essential:opposition between components is not an issue with differentparameter values

I parameters of components, θ1 and θ2, integrated out byMonte Carlo

I contrary to common testing settings, data signal lack ofagreement with either model when posterior on α away fromboth 0 and 1

I in most settings, approach easily calibrated by parametricboostrap providing posterior of α under each model and priorpredictive error

Page 85: BAYSM'14, Wien, Austria

Toy examples (1)

Test of a Poisson P(λ) versus a geometric Geo(p) [as a number offailures, starting at zero]Same parameter used in Poisson P(λ) and geometric Geo(p) with

p = 1/1+λ

Improper noninformative prior π(λ) = 1/λ is validPosterior on λ conditional to allocation vector ζ

π(λ | x , ζ) ∝ exp(−n1(ζ)λ)λ(∑n

i=1 xi−1)(λ+ 1)−(n2+s2(ζ))

and α ∼ Be(n1 + a0, n2 + a0)

Page 86: BAYSM'14, Wien, Austria

Toy examples (1)

.1 .2 .3 .4 .5 1

0.00

0.02

0.04

0.06

0.08

lBF= −509684

Posterior of Poisson weight α when a0 = .1, .2, .3, .4, .5, 1 andsample of geometric G(0.5) 105 observations

Page 87: BAYSM'14, Wien, Austria

Toy examples (2)

Normal N(µ, 1) model versus double-exponential L(µ,√

2) [Scale√2 is intentionaly chosen to make both distributions share the

same variance]Location parameter µ can be shared by both models with a singleflat prior π(µ). Beta distributions B(a0, a0) are compared wrt theirhyperparameter a0

Page 88: BAYSM'14, Wien, Austria

Toy examples (2)

Posterior of double-exponential weight α for L(0,√

2) data, with5, . . . , 103 observations and 105 Gibbs iterations

Page 89: BAYSM'14, Wien, Austria

Toy examples (2)

Posterior of Normal weight α for N(0, .72) data with 103

observations and 104 Gibbs iterations

Page 90: BAYSM'14, Wien, Austria

Toy examples (2)

Posterior of Normal weight α for N(0, 1) data with 103

observations and 104 Gibbs iterations

Page 91: BAYSM'14, Wien, Austria

Toy examples (2)

Posterior of normal weight α for double-exponential data, with 103

observations and 104 Gibbs iterations

Page 92: BAYSM'14, Wien, Austria

Danke schon! Enjoy BAYSM 2014!


Related Documents