My life as a mixture Christian P. Robert Universit´ e Paris-Dauphine, Paris & University of Warwick, Coventry September 17, 2014 [email protected]
My life as a mixture
Christian P. RobertUniversite Paris-Dauphine, Paris & University of Warwick, Coventry
September 17, [email protected]
Your next Valencia meeting:
I Objective Bayes section of ISBAmajor meeting:
I O-Bayes 2015 in Valencia,Spain, June 1-4(+1), 2015
I in memory of our friend SusieBayarri
I objective Bayes, limitedinformation, partly defined andapproximate models, &tc
I all flavours of Bayesian analysiswelcomed!
I “Spain in June, what else...?!”
Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
birthdate: May 1989, Ottawa Civic Hospital
Repartition of grey levels in an unprocessed chest radiograph
[X, 1994]
Mixture models
Structure of mixtures of distributions:
x ∼ fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
p1f1(x) + · · ·+ pk fk(x) .
Usual case: parameterised components
k∑i=1
pi f (x |θi )n∑
i=1
pi = 1
where weights pi ’s are distinguished from other parameters
Mixture models
Structure of mixtures of distributions:
x ∼ fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
p1f1(x) + · · ·+ pk fk(x) .
Usual case: parameterised components
k∑i=1
pi f (x |θi )n∑
i=1
pi = 1
where weights pi ’s are distinguished from other parameters
Motivations
I Dataset made of several latent/missing/unobservedgroups/strata/subpopulations. Mixture structure due to themissing origin/allocation of each observation to a specificsubpopulation/stratum. Inference on either the allocations(clustering) or on the parameters (θi , pi ) or on the number ofgroups
I Semiparametric perspective where mixtures are functionalbasis approximations of unknown distributions
Motivations
I Dataset made of several latent/missing/unobservedgroups/strata/subpopulations. Mixture structure due to themissing origin/allocation of each observation to a specificsubpopulation/stratum. Inference on either the allocations(clustering) or on the parameters (θi , pi ) or on the number ofgroups
I Semiparametric perspective where mixtures are functionalbasis approximations of unknown distributions
License
Dataset derived from [my] license plate imageGrey levels concentrated on 256 values [later jittered]
[Marin & X, 2007]
Likelihood
For a sample of independent random variables (x1, · · · , xn),likelihood
n∏i=1
{p1f1(xi ) + · · ·+ pk fk(xi )} .
Expanding this product involves
kn
elementary terms: prohibitive to compute in large samples.But likelihood still computable [pointwise] in O(kn) time.
Likelihood
For a sample of independent random variables (x1, · · · , xn),likelihood
n∏i=1
{p1f1(xi ) + · · ·+ pk fk(xi )} .
Expanding this product involves
kn
elementary terms: prohibitive to compute in large samples.But likelihood still computable [pointwise] in O(kn) time.
Normal mean benchmark
Normal mixture
pN(µ1, 1) + (1 − p)N(µ2, 1)
with only means unknown (2-D representation possible)
Identifiability
Parameters µ1 and µ2 identifiable: µ1cannot be confused with µ2 when p isdifferent from 0.5.
Presence of a spurious mode,
understood by letting p go to 0.5
Normal mean benchmark
Normal mixture
pN(µ1, 1) + (1 − p)N(µ2, 1)
with only means unknown (2-D representation possible)
Identifiability
Parameters µ1 and µ2 identifiable: µ1cannot be confused with µ2 when p isdifferent from 0.5.
Presence of a spurious mode,
understood by letting p go to 0.5
Bayesian inference on mixtures
For any prior π (θ,p), posterior distribution of (θ,p) available upto a multiplicative constant
π(θ,p|x) ∝
n∏i=1
k∑j=1
pj f (xi |θj)
π (θ,p)at a cost of order O(kn)
B Difficulty
Despite this, derivation of posterior characteristics like posteriorexpectations only possible in an exponential time of order O(kn)!
Bayesian inference on mixtures
For any prior π (θ,p), posterior distribution of (θ,p) available upto a multiplicative constant
π(θ,p|x) ∝
n∏i=1
k∑j=1
pj f (xi |θj)
π (θ,p)at a cost of order O(kn)
B Difficulty
Despite this, derivation of posterior characteristics like posteriorexpectations only possible in an exponential time of order O(kn)!
Missing variable representation
Associate to each xi a missing/latent variable zi that indicates itscomponent:
zi |p ∼ Mk(p1, . . . , pk)
andxi |zi ,θ ∼ f (·|θzi ) .
Completed likelihood
`(θ,p|x, z) =
n∏i=1
pzi f (xi |θzi ) ,
and
π(θ,p|x, z) ∝
[n∏
i=1
pzi f (xi |θzi )
]π (θ,p)
where z = (z1, . . . , zn)
Missing variable representation
Associate to each xi a missing/latent variable zi that indicates itscomponent:
zi |p ∼ Mk(p1, . . . , pk)
andxi |zi ,θ ∼ f (·|θzi ) .
Completed likelihood
`(θ,p|x, z) =
n∏i=1
pzi f (xi |θzi ) ,
and
π(θ,p|x, z) ∝
[n∏
i=1
pzi f (xi |θzi )
]π (θ,p)
where z = (z1, . . . , zn)
Gibbs sampling for mixture models
Take advantage of the missing data structure:
Algorithm
I Initialization: choose p(0) and θ(0) arbitrarilyI Step t. For t = 1, . . .
1. Generate z(t)i (i = 1, . . . , n) from (j = 1, . . . , k)
P(z(t)i = j |p
(t−1)j , θ
(t−1)j , xi
)∝ p
(t−1)j f
(xi |θ
(t−1)j
)2. Generate p(t) from π(p|z(t)),
3. Generate θ(t) from π(θ|z(t), x).
[Brooks & Gelman, 1990; Diebolt & X, 1990, 1994; Escobar & West, 1991]
Normal mean example (cont’d)
Algorithm
I Initialization. Choose µ(0)1 and µ
(0)2 ,
I Step t. For t = 1, . . .
1. Generate z(t)i (i = 1, . . . , n) from
P(z(t)i = 1
)= 1−P
(z(t)i = 2
)∝ p exp
(−
1
2
(xi − µ
(t−1)1
)2)
2. Compute n(t)j =
n∑i=1
Iz(t)i =j
and (sxj )(t) =
n∑i=1
Iz(t)i =j
xi
3. Generate µ(t)j (j = 1, 2) from N
(λδ+ (sxj )
(t)
λ+ n(t)j
,1
λ+ n(t)j
).
Normal mean example (cont’d)
−10
1
2
3
4
−1
0
1
2
3
4
µ1
µ2
(a) initialised at random[X & Casella, 2009]
Normal mean example (cont’d)
−10
1
2
3
4
−1
0
1
2
3
4
µ1
µ2
(a) initialised at random
−2
0
2
4
−2
0
2
4
µ1
µ2
(b) initialised close to thelower mode
[X & Casella, 2009]
License
Consider k = 3 components, a D3(1/2, 1/2, 1/2) prior for theweights, a N(x , σ2/3) prior on the means µi and a Ga(10, σ2) prioron the precisions σ−2
i , where x and σ2 are the empirical mean andvariance of License
[Empirical Bayes]
[Marin & X, 2007]
Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
weakly informative priors
I possible symmetric empirical Bayes priors
p ∼ D(γ, . . . ,γ), θi ∼ N(µ, ωσ2i ), σ−2i ∼ Ga(ν, εν)
which can be replaced with hierarchical priors[Diebolt & X, 1990; Richardson & Green, 1997]
I independent improper priors on θj ’s prohibited, thus standard“flat” and Jeffreys priors impossible to use (except with theexclude-empty-component trick)
[Diebolt & X, 1990; Wasserman, 1999]
weakly informative priors
I Reparameterization to compact set for use of uniform priors
µi −→ eµi
1 + eµi, σi −→ σi
1 + σi
[Chopin, 2000]
I dependent weakly informative priors
p ∼ D(k , . . . , 1), θi ∼ N(θi−1, ζσ2i−1), σi ∼ U([0,σi−1])
[Mengersen & X, 1996; X & Titterington, 1998]
I reference priors
p ∼ D(1, . . . , 1), θi ∼ N(µ0, (σ2i + τ
20)/2), σ2i ∼ C+(0, τ20)
[Moreno & Liseo, 1999]
Re-ban on improper priors
Difficult to use improper priors in the setting of mixtures becauseindependent improper priors,
π (θ) =
k∏i=1
πi (θi ) , with
∫πi (θi )dθi =∞
end up, for all n’s, with the property∫π(θ,p|x)dθdp =∞
Reason
There are (k − 1)n terms among the kn terms in the expansionthat allocate no observation at all to the i-th component
Re-ban on improper priors
Difficult to use improper priors in the setting of mixtures becauseindependent improper priors,
π (θ) =
k∏i=1
πi (θi ) , with
∫πi (θi )dθi =∞
end up, for all n’s, with the property∫π(θ,p|x)dθdp =∞
Reason
There are (k − 1)n terms among the kn terms in the expansionthat allocate no observation at all to the i-th component
Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
Connected difficulties
1. Number of modes of the likelihood of order O(k!):c© Maximization and even [MCMC] exploration of the
posterior surface harder
2. Under exchangeable priors on (θ,p) [prior invariant underpermutation of the indices], all posterior marginals areidentical:c© Posterior expectation of θ1 equal to posterior expectation
of θ2
Connected difficulties
1. Number of modes of the likelihood of order O(k!):c© Maximization and even [MCMC] exploration of the
posterior surface harder
2. Under exchangeable priors on (θ,p) [prior invariant underpermutation of the indices], all posterior marginals areidentical:c© Posterior expectation of θ1 equal to posterior expectation
of θ2
License
When Gibbs output does not (re)produce exchangeability, Gibbssampler has failed to explored the whole parameter space: notenough energy to switch simultaneously enough componentallocations at once
[Marin & X, 2007]
Label switching paradox
I We should observe the exchangeability of the components[label switching] to conclude about convergence of the Gibbssampler.
I If we observe it, then we do not know how to estimate theparameters.
I If we do not, then we are uncertain about the convergence!!!
[Celeux, Hurn & X, 2000]
[Fruhwirth-Schnatter, 2001, 2004]
[Holmes, Jasra & Stephens, 2005]
Label switching paradox
I We should observe the exchangeability of the components[label switching] to conclude about convergence of the Gibbssampler.
I If we observe it, then we do not know how to estimate theparameters.
I If we do not, then we are uncertain about the convergence!!!
[Celeux, Hurn & X, 2000]
[Fruhwirth-Schnatter, 2001, 2004]
[Holmes, Jasra & Stephens, 2005]
Label switching paradox
I We should observe the exchangeability of the components[label switching] to conclude about convergence of the Gibbssampler.
I If we observe it, then we do not know how to estimate theparameters.
I If we do not, then we are uncertain about the convergence!!!
[Celeux, Hurn & X, 2000]
[Fruhwirth-Schnatter, 2001, 2004]
[Holmes, Jasra & Stephens, 2005]
Constraints
Usual reply to lack of identifiability: impose constraints like
µ1 6 . . . 6 µk
in the prior
Mostly incompatible with the topology of the posterior surface:posterior expectations then depend on the choice of theconstraints.
Computational “detail”
The constraint need be imposed during the simulation but caninstead be imposed after simulation, reordering MCMC outputaccording to constraints. [This avoids possible negative effects onconvergence]
Constraints
Usual reply to lack of identifiability: impose constraints like
µ1 6 . . . 6 µk
in the prior
Mostly incompatible with the topology of the posterior surface:posterior expectations then depend on the choice of theconstraints.
Computational “detail”
The constraint need be imposed during the simulation but caninstead be imposed after simulation, reordering MCMC outputaccording to constraints. [This avoids possible negative effects onconvergence]
Constraints
Usual reply to lack of identifiability: impose constraints like
µ1 6 . . . 6 µk
in the prior
Mostly incompatible with the topology of the posterior surface:posterior expectations then depend on the choice of theconstraints.
Computational “detail”
The constraint need be imposed during the simulation but caninstead be imposed after simulation, reordering MCMC outputaccording to constraints. [This avoids possible negative effects onconvergence]
Relabeling towards the mode
Selection of one of the k! modal regions of the posterior,post-simulation, by computing the approximate MAP
(θ,p)(i∗) with i∗ = arg max
i=1,...,Mπ{(θ,p)(i)|x
}
Pivotal Reordering
At iteration i ∈ {1, . . . ,M},
1. Compute the optimal permutation
τi = arg minτ∈Sk
d(τ{(θ(i),p(i)), (θ(i∗),p(i∗))
})where d(·, ·) distance in the parameter space.
2. Set (θ(i),p(i)) = τi ((θ(i),p(i))).
[Celeux, 1998; Stephens, 2000; Celeux, Hurn & X, 2000]
Relabeling towards the mode
Selection of one of the k! modal regions of the posterior,post-simulation, by computing the approximate MAP
(θ,p)(i∗) with i∗ = arg max
i=1,...,Mπ{(θ,p)(i)|x
}
Pivotal Reordering
At iteration i ∈ {1, . . . ,M},
1. Compute the optimal permutation
τi = arg minτ∈Sk
d(τ{(θ(i),p(i)), (θ(i∗),p(i∗))
})where d(·, ·) distance in the parameter space.
2. Set (θ(i),p(i)) = τi ((θ(i),p(i))).
[Celeux, 1998; Stephens, 2000; Celeux, Hurn & X, 2000]
Loss functions for mixture estimation
Global loss function that considersdistance between predictives
L(ξ, ξ) =
∫Xfξ(x) log
{fξ(x)/fξ(x)
}dx
eliminates the labelling effectSimilar solution for estimating clustersthrough allocation variables
L(z , z) =∑i<j
(I[zi=zj ](1 − I[zi=zj ]) + I[zi=zj ](1 − I[zi=zj ])
).
[Celeux, Hurn & X, 2000]
Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
MAP estimation
For high-dimensional parameter space,difficulty with marginal MAP (MMAP) estimatesbecause nuisance parameters must be integrated out
θMMAP1 = argΘ1
max p (θ1|y)
where
p (θ1|y) =
∫Θ2
p (θ1,θ2|y) dθ2
MAP estimation
SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]
Artificially augmented probability model whose marginaldistribution is
pγ (θ1|y) ∝ p (θ1|y)γ
via replications of the nuisance parameters:
I Replace θ2 with γ artificial replications,
θ2 (1) , . . . ,θ2 (γ)
MAP estimation
SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]
Artificially augmented probability model whose marginaldistribution is
pγ (θ1|y) ∝ p (θ1|y)γ
via replications of the nuisance parameters:
I Treat the θ2 (j)’s as distinct random variables:
qγ (θ1,θ2 (1) , . . . ,θ2 (γ)|y) ∝γ∏
k=1
p (θ1,θ2 (k)|y)
MAP estimation
SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]
Artificially augmented probability model whose marginaldistribution is
pγ (θ1|y) ∝ p (θ1|y)γ
via replications of the nuisance parameters:
I Use corresponding marginal for θ1
qγ (θ1|y) =
∫qγ (θ1,θ2 (1) , . . . ,θ2 (γ)|y) dθ2 (1) . . . dθ2 (γ)
∝∫ γ∏k=1
p (θ1,θ2 (k)|y) dθ2 (1) . . . dθ2 (γ)
= pγ (θ1|y)
MAP estimation
SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]
Artificially augmented probability model whose marginaldistribution is
pγ (θ1|y) ∝ p (θ1|y)γ
via replications of the nuisance parameters:
I Build a MCMC algorithm in the augmented space, withinvariant distribution
qγ (θ1,θ2 (1) , . . . ,θ2 (γ)|y)
MAP estimation
SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]
Artificially augmented probability model whose marginaldistribution is
pγ (θ1|y) ∝ p (θ1|y)γ
via replications of the nuisance parameters:
I Use simulated subsequence{θ(i)1 ; i ∈ N
}as drawn from marginal posterior pγ (θ1|y)
example: Galaxy dataset benchmark
82 observations of galaxy velocities from 3 (?) groupsAlgorithm EM MCEM SAME
Mean log-posterior 65.47 60.73 66.22
Std dev of 2.31 4.48 0.02
log-posterior[Doucet & X, 2002]
Really the SAME?!
SAME algorithm re-invented in many guises:
I Gaetan & Yao, 2003, Biometrika
I Jacquier, Johannes & Polson, 2007, J. Econometrics
I Lele, Dennis & Lutscher, 2007, Ecology Letters [data cloning]
I ...
Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
Propp and Wilson’s perfect sampler
Difficulty devising MCMC stopping rules:when should one stop an MCMC algorithm?!
Principle: Coupling from the past
rather than start at t = 0 and wait till t = +∞, start at t = −∞and wait till t = 0
[Propp & Wilson, 1996]
c© Outcome at time t = 0 is stationnary
Propp and Wilson’s perfect sampler
Difficulty devising MCMC stopping rules:when should one stop an MCMC algorithm?!
Principle: Coupling from the past
rather than start at t = 0 and wait till t = +∞, start at t = −∞and wait till t = 0
[Propp & Wilson, 1996]
c© Outcome at time t = 0 is stationnary
CFTP Algorithm
Algorithm (Coupling from the past)
1. Start from the m possible values at time −t
2. Run the m chains till time 0 (coupling allowed)
3. Check if the chains are equal at time 0
4. If not, start further back: t ← 2 ∗ t, using the same randomnumbers at time already simulated
I requires a finite state space
I probability of merging chains must be high enough
I hard to implement w/o a monotonicity in both state spaceand transition
CFTP Algorithm
Algorithm (Coupling from the past)
1. Start from the m possible values at time −t
2. Run the m chains till time 0 (coupling allowed)
3. Check if the chains are equal at time 0
4. If not, start further back: t ← 2 ∗ t, using the same randomnumbers at time already simulated
I requires a finite state space
I probability of merging chains must be high enough
I hard to implement w/o a monotonicity in both state spaceand transition
Mixture models
Simplest possible mixture structure
pf0(x) + (1 − p)f1(x),
with uniform prior on p.
Algorithm (Data Augmentation Gibbs sampler)
At iteration t:
1. Generate n iid U(0, 1) rv’s u(t)1 , . . . , u
(t)n .
2. Derive the indicator variables z(t)i as z
(t)i = 0 iff
u(t)i 6 q
(t−1)i =
p(t−1)f0(xi )
p(t−1)f0(xi ) + (1 − p(t−1))f1(xi )
and compute
m(t) =
n∑i=1
z(t)i .
3. Simulate p(t) ∼ Be(n + 1 −m(t), 1 +m(t)).
Mixture models
Algorithm (CFTP Gibbs sampler)
At iteration −t:
1. Generate n iid uniform rv’s u(−t)1 , . . . , u
(−t)n .
2. Partition [0, 1) into intervals [q[j], q[j+1]).
3. For each [q(−t)[j] , q
(−t)[j+1]), generate
p(−t)j ∼ Be(n − j + 1, j + 1).
4. For each j = 0, 1, . . . , n, r(−t)j ← p
(−t)j
5. For (` = 1, ` < T , `++) r(−t+`)j ← p
(−t+`)k with k such that
r(−t+`−1)j ∈ [q
(−t+`)[k] , q
(−t+`)[k+1] ]
6. Stop if the r(0)j ’s (0 6 j 6 n) are all equal. Otherwise, t ← 2 ∗ t.
[Hobert et al., 1999]
Mixture models
Extension to the case k = 3:
Sample of n = 35 observations from
.23N(2.2, 1.44) + .62N(1.4, 0.49) + .15N(0.6, 0.64)
[Hobert et al., 1999]
Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
Bayesian model choice
Comparison of models Mi by Bayesian means:
probabilise the entire model/parameter space
I allocate probabilities pi to all models Mi
I define priors πi (θi ) for each parameter space Θi
I compute
π(Mi |x) =
pi
∫Θi
fi (x |θi )πi (θi )dθi∑j
pj
∫Θj
fj(x |θj)πj(θj)dθj
Bayesian model choice
Comparison of models Mi by Bayesian means:
Relies on a central notion: the evidence
Zk =
∫Θk
πk(θk)Lk(θk) dθk ,
aka the marginal likelihood.
Chib’s representation
Direct application of Bayes’ theorem: given x ∼ fk(x|θk) andθk ∼ πk(θk),
Zk = mk(x) =fk(x|θk)πk(θk)
πk(θk |x)
Replace with an approximation to the posterior
Zk = mk(x) =fk(x|θ
∗k)πk(θ
∗k)
πk(θ∗k |x)
.
[Chib, 1995]
Chib’s representation
Direct application of Bayes’ theorem: given x ∼ fk(x|θk) andθk ∼ πk(θk),
Zk = mk(x) =fk(x|θk)πk(θk)
πk(θk |x)
Replace with an approximation to the posterior
Zk = mk(x) =fk(x|θ
∗k)πk(θ
∗k)
πk(θ∗k |x)
.
[Chib, 1995]
Case of latent variables
For missing variable z as in mixture models, natural Rao-Blackwellestimate
πk(θ∗k |x) =
1
T
T∑t=1
πk(θ∗k |x, z
(t)k ) ,
where the z(t)k ’s are Gibbs sampled latent variables
[Diebolt & Robert, 1990; Chib, 1995]
Compensation for label switching
For mixture models, z(t)k usually fails to visit all configurations in a
balanced way, despite the symmetry predicted by the theoryConsequences on numerical approximation, biased by an order k!Recover the theoretical symmetry by using
πk(θ∗k |x) =
1
T k!
∑σ∈Sk
T∑t=1
πk(σ(θ∗k)|x, z
(t)k ) .
for all σ’s in Sk , set of all permutations of {1, . . . , k}[Berkhof, Mechelen, & Gelman, 2003]
Compensation for label switching
For mixture models, z(t)k usually fails to visit all configurations in a
balanced way, despite the symmetry predicted by the theoryConsequences on numerical approximation, biased by an order k!Recover the theoretical symmetry by using
πk(θ∗k |x) =
1
T k!
∑σ∈Sk
T∑t=1
πk(σ(θ∗k)|x, z
(t)k ) .
for all σ’s in Sk , set of all permutations of {1, . . . , k}[Berkhof, Mechelen, & Gelman, 2003]
Galaxy dataset (k)
Using Chib’s estimate, with θ∗k as MAP estimator,
log(Zk(x)) = −105.1396
for k = 3, while introducing permutations leads to
log(Zk(x)) = −103.3479
Note that−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations
selected at random in Sk).
[Lee et al., 2008]
Galaxy dataset (k)
Using Chib’s estimate, with θ∗k as MAP estimator,
log(Zk(x)) = −105.1396
for k = 3, while introducing permutations leads to
log(Zk(x)) = −103.3479
Note that−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations
selected at random in Sk).
[Lee et al., 2008]
Galaxy dataset (k)
Using Chib’s estimate, with θ∗k as MAP estimator,
log(Zk(x)) = −105.1396
for k = 3, while introducing permutations leads to
log(Zk(x)) = −103.3479
Note that−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations
selected at random in Sk).
[Lee et al., 2008]
More efficient sampling
Difficulty with the explosive numbers of terms in
πk(θ∗k |x) =
1
T k!
∑σ∈Sk
T∑t=1
πk(σ(θ∗k)|x, z
(t)k ) .
when most terms are equal to zero...Iterative bridge sampling:
E(t)(k) = E(t−1)(k)M−11
M1∑l=1
π(θl |x)
M1q(θl) +M2π(θl |x)
/
M−12
M2∑m=1
q(θm)
M1q(θm) +M2π(θm|x)
[Fruhwirth-Schnatter, 2004]
More efficient sampling
Iterative bridge sampling:
E(t)(k) = E(t−1)(k)M−11
M1∑l=1
π(θl |x)
M1q(θl) +M2π(θl |x)
/
M−12
M2∑m=1
q(θm)
M1q(θm) +M2π(θm|x)
[Fruhwirth-Schnatter, 2004]
where
q(θ) =1
J1
J1∑j=1
p(θ|z(j))k∏
i=1
p(ξi |ξ(j)i<j , ξ
(j−1)i>j , z(j), x)
More efficient sampling
Iterative bridge sampling:
E(t)(k) = E(t−1)(k)M−11
M1∑l=1
π(θl |x)
M1q(θl) +M2π(θl |x)
/
M−12
M2∑m=1
q(θm)
M1q(θm) +M2π(θm|x)
[Fruhwirth-Schnatter, 2004]
or where
q(θ) =1
k!
∑σ∈S(k)
p(θ|σ(zo))k∏
i=1
p(ξi |σ(ξoi<j),σ(ξ
oi>j),σ(z
o), x)
Further efficiency
After de-switching (un-switching?), representation of importancefunction as
q(θ) =1
Jk!
J∑j=1
∑σ∈Sk
π(θ|σ(ϕ(j)), x) =1
k!
∑σ∈Sk
hσ(θ)
where hσ associated with particular mode of qAssuming generations
(θ(1), . . . ,θ(T)) ∼ hσc (θ)
how many of the hσ(θ(t)) are non-zero?
Sparsity for the sum
Contribution of each term relative to q(θ)
ησ(θ) =hσ(θ)
k!q(θ)=
hσi (θ)∑σ∈Sk
hσ(θ)
and importance of permutation σ evaluated by
Ehσc [ησi (θ)] =1
M
M∑l=1
ησi (θ(l)) , θ(l) ∼ hσc (θ)
Approximate set A(k) ⊆ S(k) consist of [σ1, · · · ,σn] for thesmallest n that satisfies the condition
φn =1
M
M∑l=1
∣∣∣qn(θ(l)) − q(θ(l))∣∣∣ < τ
dual importance sampling with approximation
DIS2A
1 Randomly select {z(j), θ(j)}Jj=1 from Gibbs sample and un-switchConstruct q(θ)
2 Choose hσc (θ) and generate particles {θ(t)}Tt=1 ∼ hσc (θ)
3 Construction of approximation q(θ) using first M-sample
3.1 Compute Ehσc[ησ1 (θ)], · · · , Ehσc
[ησk!(θ)]
3.2 Reorder the σ’s such thatEhσc
[ησ1 (θ)] > · · · > Ehσc[ησk!
(θ)].
3.3 Initially set n = 1 and compute qn(θ(t))’s and φn . If φn=1 < τ,
go to Step 4. Otherwise increase n = n+ 1
4 Replace q(θ(1)), . . . , q(θ(T)) with q(θ(1)), . . . , q(θ(T)) to
estimate E
[Lee & X, 2014]
illustrations
k k! |A(k)| ∆(A)
3 6 1.0000 0.16754 24 2.7333 0.1148
Fishery data
k k! |A(k)| ∆(A)
3 6 1.000 0.16754 24 15.7000 0.65456 720 298.1200 0.4146
Galaxy data
Table : Mean estimates of approximate set sizes, |A(k)|, and thereduction rate of a number of evaluated h-terms ∆(A) for (a) fishery and(b) galaxy datasets
Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
Jeffreys priors for mixtures [teaser]
True Jeffreys prior for mixtures of distributions defined as∣∣Eθ[∇T∇ log f (X |θ)]∣∣
I O(k) matrix
I unavailable in closed form except special cases
I unidimensional integrals approximated by Monte Carlo tools
[Grazian [talk tomorrow] et al., 2014+]
Difficulties
I complexity grows in O(k2)
I significant computing requirement (reduced by delayedacceptance)
[Banterle et al., 2014]
I differ from component-wise Jeffreys[Diebolt & X, 1990; Stoneking, 2014]
I when is the posterior proper?
I how to check properness via MCMC outputs?
Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
Difficulties with Bayes factors
I delicate calibration towards supporting a given hypothesis ormodel
I long-lasting impact of prior modelling, despite overallconsistency
I discontinuity in the use of improper priors in most settings
I binary outcome more suited for immediate decision than formodel evaluation
I related impossibility to ascertain misfit or outliers
I missing assesment of uncertainty associated with the decision
I difficult computation of marginal likelihoods in most settings
Reformulation
I Representation of the test problem as a two-componentmixture estimation problem where the weights are formallyequal to 0 or 1
I Mixture model thus contains both models under comparisonas extreme cases
I Inspired by consistency result of Rousseau and Mengersen(2011) on overfitting mixtures
I Use of posteror distribution of the weight of a model insteadof single-digit posterior probability
[Kamari [see poster] et al., 2014+]
Construction of Bayes tests
Given two statistical models,
M1 : x ∼ f1(x |θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x |θ2 , θ2 ∈ Θ2 , )
embed both models within an encompassing mixture model
Mα : x ∼ αf1(x |θ1) + (1 − α)f2(x |θ2) , 0 6 α 6 1 . (1)
Both models as special cases of the mixture model, one for α = 1and the other for α = 0
c© Test as inference on α
Arguments
I substituting estimate of the weight α for posterior probabilityof model M1 produces an equally convergent indicator ofwhich model is “true” while removing the need of oftenartificial prior probabilities on model indices
I interpretation at least as natural as for the posteriorprobability, while avoiding the zero-one loss setting
I highly problematic computation of marginal likelihoodsbypassed by standard algorithms for mixture estimation
I straightforward extension to collection of models allows toconsider all models at once
I posterior on α evaluates thoroughly strength of support for agiven model, compared with single digit Bayes factor
I mixture model acknowledges possibility that both models [ornone] could be acceptable
Arguments
I standard prior modelling can be reproduced here but improperpriors now acceptable, when both models reparameterisedtowards common-meaning parameters, e.g. location and scale
I using same parameters on both components is essential:opposition between components is not an issue with differentparameter values
I parameters of components, θ1 and θ2, integrated out byMonte Carlo
I contrary to common testing settings, data signal lack ofagreement with either model when posterior on α away fromboth 0 and 1
I in most settings, approach easily calibrated by parametricboostrap providing posterior of α under each model and priorpredictive error
Toy examples (1)
Test of a Poisson P(λ) versus a geometric Geo(p) [as a number offailures, starting at zero]Same parameter used in Poisson P(λ) and geometric Geo(p) with
p = 1/1+λ
Improper noninformative prior π(λ) = 1/λ is validPosterior on λ conditional to allocation vector ζ
π(λ | x , ζ) ∝ exp(−n1(ζ)λ)λ(∑n
i=1 xi−1)(λ+ 1)−(n2+s2(ζ))
and α ∼ Be(n1 + a0, n2 + a0)
Toy examples (1)
.1 .2 .3 .4 .5 1
0.00
0.02
0.04
0.06
0.08
lBF= −509684
Posterior of Poisson weight α when a0 = .1, .2, .3, .4, .5, 1 andsample of geometric G(0.5) 105 observations
Toy examples (2)
Normal N(µ, 1) model versus double-exponential L(µ,√
2) [Scale√2 is intentionaly chosen to make both distributions share the
same variance]Location parameter µ can be shared by both models with a singleflat prior π(µ). Beta distributions B(a0, a0) are compared wrt theirhyperparameter a0
Toy examples (2)
Posterior of double-exponential weight α for L(0,√
2) data, with5, . . . , 103 observations and 105 Gibbs iterations
Toy examples (2)
Posterior of Normal weight α for N(0, .72) data with 103
observations and 104 Gibbs iterations
Toy examples (2)
Posterior of Normal weight α for N(0, 1) data with 103
observations and 104 Gibbs iterations
Toy examples (2)
Posterior of normal weight α for double-exponential data, with 103
observations and 104 Gibbs iterations
Danke schon! Enjoy BAYSM 2014!