Rimini discussion
Post on 10-May-2015
2283 Views
Preview:
DESCRIPTION
Transcript
The 21st Bayesian Century
“The 21st Century belongs to Bayes”as argued by a discussion on Bayesian testing and
Bayesian model choice
Christian P. Robert
Universite Paris Dauphine and CREST-INSEEhttp://www.ceremade.dauphine.fr/~xian
http://xianblog.wordpress.com
July 1, 2009
The 21st Bayesian Century
A consequence of Bayesian statistics being given a propername is that it encourages too much historical deference
from people who think that the bibles of Jeffreys, deFinetti, Jaynes, and others have all the answers.
—Gelman, Bayesian Analysis 3(3), 2008
The 21st Bayesian Century
Outline
Anyone not shocked by the Bayesian theory of inference has notunderstood it
Senn, BA., 2008
Introduction
Tests and model choice
Bayesian Calculations
A Defense of the Bayesian Choice
The 21st Bayesian Century
Introduction
Vocabulary and concepts
Bayesian inference is a coherent mathematical theorybut I don’t trust it in scientific applications.
Gelman, BA, 2008
IntroductionModelsThe Bayesian frameworkImproper prior distributionsNoninformative prior distributions
Tests and model choice
Bayesian Calculations
A Defense of the Bayesian Choice
The 21st Bayesian Century
Introduction
Models
Parametric model
Bayesians promote the idea that a multiplicity of parameters can behandled via hierarchical, typically exchangeable, models, but it seems
implausible that this could really work automatically [instead of] givingreasonable answers using minimal assumptions.
Gelman, BA, 2008
Observations x1, . . . , xn generated from a probability distributionfi(xi|θi, x1, . . . , xi−1) = fi(xi|θi, x1:i−1)
x = (x1, . . . , xn) ∼ f(x|θ), θ = (θ1, . . . , θn)
Associated likelihoodℓ(θ|x) = f(x|θ)
[inverted density & starting point]
The 21st Bayesian Century
Introduction
Models
And [B] nonparametrics?!Equally very active and definitely very 21st, thank you,but not mentioned in this talk!
7th Workshop on Bayesian Nonparametrics - Collegio... http://bnpworkshop.carloalberto.org/
21 - 25 June 2009, Moncalieri
The 7th Workshop on Bayesian Nonparametrics will be held atthe Collegio Carlo Alberto from June 21 to 25, 2009. The Collegio is aResearch Institution housed in an historical building located inMoncalieri on the outskirts of Turin, Italy.
The meeting will feature the latest developments in the area and willcover a wide variety of both theoretical and applied topics such as:foundations of the Bayesian nonparametric approach, constructionand properties of prior distributions, asymptotics, interplay withprobability theory and stochastic processes, statistical modelling,computational algorithms and applications in machine learning,biostatistics, bioinformatics, economics and econometrics.
The Workshop will be structured in 4 tutorials on special topics, aseries of invited talks and contributed posters sessions.
NewsTentative Workshop ScheduleAbstract Book (last updated 27th May 2009)Workshop Poster
The 21st Bayesian Century
Introduction
The Bayesian framework
Bayes theorem 101
Bayes theorem = Inversion of probabilities
If A and E are events such that P (E) 6= 0, P (A|E) and P (E|A)are related by
P (A|E) =
P (E|A)P (A)
P (E|A)P (A) + P (E|Ac)P (Ac)
=P (E|A)P (A)
P (E)
[Thomas Bayes (?)]
The 21st Bayesian Century
Introduction
The Bayesian framework
Bayesian approach
The impact of treating x as a fixed constantis to increase statistical power as an artefact
Templeton, Molec. Ecol., 2009
New perspective
◮ Uncertainty on the parameters θ of a model modeled througha probability distribution π on Θ, called prior distribution
◮ Inference based on the distribution of θ conditional on x,π(θ|x), called posterior distribution
π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ) dθ
.
The 21st Bayesian Century
Introduction
The Bayesian framework
[Nonphilosophical] justifications
Ignoring the sampling error of x underminesthe statistical validity of all inferences made by the method
Templeton, Molec. Ecol., 2009
◮ Semantic drift from unknown to random
◮ Actualization of the information on θ by extracting theinformation on θ contained in the observation x
◮ Allows incorporation of imperfect information in the decisionprocess
◮ Unique mathematical way to condition upon the observations(conditional perspective)
◮ Unique way to give meaning to statements like P(θ > 0)
The 21st Bayesian Century
Introduction
The Bayesian framework
Posterior distribution
Bayesian methods are presented as an automatic inference engine,and this raises suspicion in anyone with applied experience
Gelman, BA, 2008
π(θ|x) central to Bayesian inference
◮ Operates conditional upon the observations
◮ Incorporates the requirement of the Likelihood Principle
◮ Avoids averaging over the unobserved values of x
◮ Coherent updating of the information available on θ
◮ Provides a complete inferential machinery
The 21st Bayesian Century
Introduction
Improper prior distributions
Improper distributions
If we take P (dσ) ∝ dσ as a statement that σ may have any valuebetween 0 and ∞ (...), we must use ∞ instead of 1 to denote certainty.
Jeffreys, ToP, 1939
Necessary extension from a prior distribution to a prior σ-finitemeasure π such that
∫
Θπ(θ) dθ = +∞
Improper prior distribution[Weird? Inappropriate?? report!! ]
The 21st Bayesian Century
Introduction
Improper prior distributions
Justifications
If the parameter may have any value from −∞ to +∞,its prior probability should be taken as uniformly distributed
Jeffreys, ToP, 1939
Automated prior determination often leads to improper priors
1. Similar performances of estimators derived from thesegeneralized distributions
2. Improper priors as limits of proper distributions in many[mathematical] senses
The 21st Bayesian Century
Introduction
Improper prior distributions
More justifications
There is no good objective principle for choosing a noninformative prior(even if that concept were mathematically defined, which it is not)
Gelman, BA, 2008
4. Robust answer against possible misspecifications of the prior
5. Frequencial justifications, such as:
(i) minimaxity(ii) admissibility(iii) invariance (Haar measure)
6. Improper priors [much] prefered to vague proper priors likeN (0, 106)
The 21st Bayesian Century
Introduction
Improper prior distributions
Validation
The mistake is to think of them as representing ignoranceLindley, JASA, 1990
Extension of the posterior distribution π(θ|x) associated with animproper prior π as given by Bayes’s formula
π(θ|x) =f(x|θ)π(θ)∫
Θ f(x|θ)π(θ) dθ,
when ∫
Θf(x|θ)π(θ) dθ < ∞
Delete all emotional names
The 21st Bayesian Century
Introduction
Noninformative prior distributions
Noninformative priors
...cannot be expected to represent exactly total ignorance about theproblem, but should rather be taken as reference priors, upon which
everyone could fall back when the prior information is missing.Kass and Wasserman, JASA, 1996
What if all we know is that we know “nothing” ?!In the absence of prior information, prior distributions solelyderived from the sample distribution f(x|θ)Difficulty with uniform priors, lacking invariance properties.
The 21st Bayesian Century
Introduction
Noninformative prior distributions
Jeffreys’ prior
If we took the prior density for the parameters to be proportional to|I(θ)|1/2, it could be stated for any law that is differentiable with respectto all parameters that the total probability in any region of the θi would
be equal to the total probability in the corresponding region of the θ′iJeffreys, ToP, 1939
Based on Fisher information
I(θ) = Eθ
[∂ℓ
∂θT
∂ℓ
∂θ
]
Jeffreys’ prior distribution is
π∗(θ) ∝ |I(θ)|1/2
The 21st Bayesian Century
Tests and model choice
Tests and model choice
The Jeffreys-subjective synthesis betrays a much more dangerousconfusion than the Neyman-Pearson-Fisher synthesis as regards
hypothesis testsSenn, BA, 2008
Introduction
Tests and model choiceBayesian testsBayes factorsOpposition to classical testsModel choiceCompatible priorsVariable selection
Bayesian Calculations
The 21st Bayesian Century
Tests and model choice
Bayesian tests
Construction of Bayes tests
What is almost never used, however, is the Jeffreys significance test.Senn, BA, 2008
Definition (Test)
Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of astatistical model, a test is a statistical procedure that takes itsvalues in {0, 1}.
Example (Normal mean)
For x ∼ N (θ, 1), decide whether or not θ ≤ 0.
The 21st Bayesian Century
Tests and model choice
Bayesian tests
Decision-theoretic perspective
Loss functions [are] not relevant to statistical inferenceGelman, BA, 2008
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
L(θ, d) =
0 if d = IΘ0(θ)
a0 if d = 1 and θ 6∈ Θ0
a1 if d = 0 and θ ∈ Θ0
the Bayes procedure is
δπ(x) =
{1 if Prπ(θ ∈ Θ0|x) ≥ a0/(a0 + a1)
0 otherwise
The 21st Bayesian Century
Tests and model choice
Bayes factors
A function of posterior probabilities
The method posits two or more alternative hypotheses and tests theirrelative fits to some observed statistics
Templeton, Mol. Ecol., 2009
Definition (Bayes factors)
For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ 6∈ Θ0
B01 =π(Θ0|x)
π(Θc0|x)
/π(Θ0)
π(Θc0)
=
∫
Θ0
f(x|θ)π0(θ)dθ
∫
Θc0
f(x|θ)π1(θ)dθ
[Good, 1958 & Jeffreys, 1961]
Goto Poisson example
The 21st Bayesian Century
Tests and model choice
Bayes factors
Self-contained concept
Having a high relative probability does not mean that a hypothesis is trueor supported by the data
Templeton, Mol. Ecol., 2009
Non-decision-theoretic:
◮ eliminates choice of π(Θ0)
◮ Bayesian/marginal equivalent to the likelihood ratio
◮ Jeffreys’ scale of evidence:◮ if log10(B
π10) between 0 and 0.5, evidence against H0 weak,
◮ if log10(Bπ10) 0.5 and 1, evidence substantial,
◮ if log10(Bπ10) 1 and 2, evidence strong and
◮ if log10(Bπ10) above 2, evidence decisive
The 21st Bayesian Century
Tests and model choice
Bayes factors
A major modification
Considering whether a location parameter α is 0. The prior is uniformand we should have to take f(α) = 0 and B10 would always be infinite
Jeffreys, ToP, 1939
When the null hypothesis is supported by a set of measure 0,π(Θ0) = 0 and thus π(Θ0|x) = 0.
[End of the story?!]
The 21st Bayesian Century
Tests and model choice
Bayes factors
Changing the prior to fit the hypotheses
Requirement
Defined prior distributions under both assumptions,
π0(θ) ∝ π(θ)IΘ0(θ), π1(θ) ∝ π(θ)IΘ1
(θ),
(under the standard dominating measures on Θ0 and Θ1)
Using the prior probabilities π(Θ0) = 0 and π(Θ1) = 1,
π(θ) = 0π0(θ) + 1π1(θ).
The 21st Bayesian Century
Tests and model choice
Bayes factors
Point null hypotheses
I have no patience for statistical methods that assign positive probabilityto point hypotheses of the θ = 0 type that can never actually be true
Gelman, BA, 2008
Take ρ0 = Prπ(θ = θ0) and g1 prior density under Ha. Then
π(Θ0|x) =f(x|θ0)ρ0∫
f(x|θ)π(θ) dθ=
f(x|θ0)ρ0
f(x|θ0)ρ0 + (1 − ρ0)m1(x)
and Bayes factor
Bπ01(x) =
f(x|θ0)ρ0
m1(x)(1 − ρ0)
/ρ0
1 − ρ0=
f(x|θ0)
m1(x)
The 21st Bayesian Century
Tests and model choice
Bayes factors
Point null hypotheses (cont’d)
Example (Normal mean)
Test of H0 : θ = 0 when x ∼ N (θ, 1): we take π1 as N (0, τ2)
m1(x)
f(x|0)=
√σ2
σ2 + τ2exp
{τ2x2
2σ2(σ2 + τ2)
}
and the posterior probability is
τ/x 0 0.68 1.28 1.96
1 0.586 0.557 0.484 0.35110 0.768 0.729 0.612 0.366
The 21st Bayesian Century
Tests and model choice
Opposition to classical tests
Comparison with classical tests
The 95 percent frequentist intervals will live up to their advertisedcoverage claims
Wasserman, BA, 2008
Standard answer
Definition (p-value)
The p-value p(x) associated with a test is the largest significancelevel for which H0 is rejected
The 21st Bayesian Century
Tests and model choice
Opposition to classical tests
Problems with p-values
The use of P implies that a hypothesis that may be true may be rejectedbecause it had not predicted observable results that have not occurred
Jeffreys, ToP, 1939
◮ Evaluation of the wrong quantity, namely the probability toexceed the observed quantity.(wrong conditioning)
◮ Evaluation only under the null hypothesis
◮ Huge numerical difference with the Bayesian range of answers
The 21st Bayesian Century
Tests and model choice
Opposition to classical tests
Bayesian lower bounds
If the Bayes estimator has good frequency behaviorthen we might as well use the frequentist method.
If it has bad frequency behavior then we shouldn’t use it.Wasserman, BA, 2008
Least favourable Bayesian answer is
B(x, GA) = infg∈GA
f(x|θ0)∫Θ f(x|θ)g(θ) dθ
,
i.e., if there exists a mle for θ, θ(x),
B(x, GA) =f(x|θ0)
f(x|θ(x))
The 21st Bayesian Century
Tests and model choice
Opposition to classical tests
Illustration
Example (Normal case)
When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are
B(x, GA) = e−x2/2 and P(x, GA) =(1 + ex2/2
)−1,
i.e.p-value 0.10 0.05 0.01 0.001
P 0.205 0.128 0.035 0.004B 0.256 0.146 0.036 0.004
[Quite different!]
The 21st Bayesian Century
Tests and model choice
Model choice
Model choice and model comparison
There is no null hypothesis, which complicates the computation ofsampling error
Templeton, Mol. Ecol., 2009
Choice among modelsSeveral models available for the same observation(s)
Mi : x ∼ fi(x|θi), i ∈ I
where I can be finite or infinite
The 21st Bayesian Century
Tests and model choice
Model choice
Bayesian resolution
The posterior probabilities are constructed by using a numerator that is afunction of the observation for a particular model, then divided by a
denominator that ensures that the ”probabilities” sum to oneTempleton, Mol. Ecol., 2009
Probabilise the entire model/parameter space
◮ allocate probabilities pi to all models Mi
◮ define priors πi(θi) for each parameter space Θi
◮ compute
π(Mi|x) =
pi
∫
Θi
fi(x|θi)πi(θi)dθi
∑
j
pj
∫
Θj
fj(x|θj)πj(θj)dθj
The 21st Bayesian Century
Tests and model choice
Model choice
Bayesian resolution(2)
The numerators are not co-measurable across hypotheses, and thedenominators are sums of non-co-measurable entities. This means that it
is mathematically impossible for them to be probabilities.Templeton, Mol. Ecol., 2009
◮ take largest π(Mi|x) to determine “best” model,or use averaged predictive
∑
j
π(Mj |x)
∫
Θj
fj(x′|θj)πj(θj |x)dθj
The 21st Bayesian Century
Tests and model choice
Model choice
Natural Ockham’s razor
Pluralitas non est ponenda sine neccesitate
Variation is random until thecontrary is shown; and newparameters in laws, when theyare suggested, must be testedone at a time, unless there isspecific reason to the contrary.
Jeffreys, ToP, 1939
The Bayesian approach naturally weights differently models withdifferent parameter dimensions (BIC).
The 21st Bayesian Century
Tests and model choice
Compatible priors
Compatibility principle
Further complicating dimensionality of test statistics is the fact that themodels are often not nested, and one model may contain parameters that
do not have analogues in the other models and vice versaTempleton, Mol. Ecol., 2009
Difficulty of finding simultaneously priors on a collection of modelsEasier to start from a single prior on a “big” [encompassing] modeland to derive others from a coherence principle
[Dawid & Lauritzen, 2000]Raw regression output
The 21st Bayesian Century
Tests and model choice
Compatible priors
An illustration for linear regression
In the case M1 and M2 are two nested Gaussian linear regressionmodels with Zellner’s g-priors and the same variance σ2 ∼ π(σ2):
◮ M1 : y|β1, σ2 ∼ N (X1β1, σ
2) with
β1|σ2 ∼ N
(s1, σ
2n1(XT1 X1)
−1)
where X1 is a (n × k1) matrix of rank k1 ≤ n
◮ M2 : y|β2, σ2 ∼ N (X2β2, σ
2) with
β2|σ2 ∼ N
(s2, σ
2n2(XT2 X2)
−1)
,
where X2 is a (n × k2) matrix with span(X2) ⊆ span(X1)
[ c©Marin & Robert, Bayesian Core]
The 21st Bayesian Century
Tests and model choice
Compatible priors
Compatible g-priors
I don’t see any role for squared error loss, minimax, or the rest of what issometimes called statistical decision theory
Gelman, BA, 2008
Since σ2 is a nuisance parameter, minimize the Kullback-Leiblerdivergence between both marginal distributions conditional on σ2:m1(y|σ
2; s1, n1) and m2(y|σ2; s2, n2), with solution
β2|X2, σ2 ∼ N
(s∗2, σ
2n∗2(X
T2 X2)
−1)
withs∗2 = (XT
2 X2)−1XT
2 X1s1 n∗2 = n1
The 21st Bayesian Century
Tests and model choice
Variable selection
Variable selection
Regression setup where y regressed on aset {x1, . . . , xp} of p potentialexplanatory regressors (plus intercept)
Corresponding 2p submodels Mγ , whereγ ∈ Γ = {0, 1}p indicatesinclusion/exclusion of variables by abinary representation,e.g. γ = 101001011 means that x1, x3,x5, x7 and x8 are included.
The 21st Bayesian Century
Tests and model choice
Variable selection
Notations
For model Mγ ,
◮ qγ variables included
◮ t1(γ) = {t1,1(γ), . . . , t1,qγ (γ)} indices of those variables andt0(γ) indices of the variables not included
◮ For β ∈ Rp+1,
βt1(γ) =[β0, βt1,1(γ), . . . , βt1,qγ (γ)
]
Xt1(γ) =[1n|xt1,1(γ)| . . . |xt1,qγ (γ)
].
Submodel Mγ is thus
y|β, γ, σ2 ∼ N(Xt1(γ)βt1(γ), σ
2In
)
The 21st Bayesian Century
Tests and model choice
Variable selection
Global and compatible priors
Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,
β|σ2 ∼ N (β, cσ2(XTX)−1)
and a Jeffreys prior for σ2,
π(σ2) ∝ σ−2
Noninformative g
Resulting compatible prior
βt1(γ) ∼ N
((XT
t1(γ)Xt1(γ)
)−1XT
t1(γ)Xβ, cσ2(XT
t1(γ)Xt1(γ)
)−1)
The 21st Bayesian Century
Tests and model choice
Variable selection
Posterior model probability
Can be obtained in closed form:
π(γ|y) ∝ (c+1)−(qγ+1)/2
[yTy −
cyTP1y
c + 1+
βTXTP1Xβ
c + 1−
2yTP1Xβ
c + 1
]−n/2
.
Conditionally on γ, posterior distributions of β and σ2:
βt1(γ)|σ2, y, γ ∼ N
[c
c + 1(U1y + U1Xβ/c),
σ2c
c + 1
(XT
t1(γ)Xt1(γ)
)−1
],
σ2|y, γ ∼ IG
[n
2,yTy
2−
cyTP1y
2(c + 1)+
βTXTP1Xβ
2(c + 1)−
yTP1Xβ
c + 1
].
The 21st Bayesian Century
Tests and model choice
Variable selection
Noninformative case
Use the same compatible informative g-prior distribution withβ = 0p+1 and a hierarchical diffuse prior distribution on c,
π(c) ∝ c−1IN∗(c) or π(c) ∝ c−1
Ic>0
Recall g-prior
The choice of this hierarchical diffuse prior distribution on c is dueto the model posterior sensitivity to large values of c:
Taking β = 0p+1 and c large does not work
The 21st Bayesian Century
Tests and model choice
Variable selection
Processionary caterpillar
Influence of some forest settlement characteristics on thedevelopment of caterpillar colonies
Response y log-transform of the average number of nests ofcaterpillars per tree on an area of 500 square meters (n = 33 areas)
[ c©Marin & Robert, Bayesian Core]
x1 x2 x3
x4 x5 x6
x7 x8 x9
The 21st Bayesian Century
Tests and model choice
Variable selection
Processionary caterpillar (cont’d)
Potential explanatory variables
x1 altitude (in meters), x2 slope (in degrees),
x3 number of pines in the square,
x4 height (in meters) of the tree at the center of the square,
x5 diameter of the tree at the center of the square,
x6 index of the settlement density,
x7 orientation of the square (from 1 if southb’d to 2 ow),
x8 height (in meters) of the dominant tree,
x9 number of vegetation strata,
x10 mix settlement index (from 1 if not mixed to 2 if mixed).
The 21st Bayesian Century
Tests and model choice
Variable selection
Bayesian regression outputEstimate BF log10(BF)
(Intercept) 9.2714 26.334 1.4205 (***)X1 -0.0037 7.0839 0.8502 (**)X2 -0.0454 3.6850 0.5664 (**)X3 0.0573 0.4356 -0.3609X4 -1.0905 2.8314 0.4520 (*)X5 0.1953 2.5157 0.4007 (*)X6 -0.3008 0.3621 -0.4412X7 -0.2002 0.3627 -0.4404X8 0.1526 0.4589 -0.3383X9 -1.0835 0.9069 -0.0424X10 -0.3651 0.4132 -0.3838
evidence against H0: (****) decisive, (***) strong, (**)subtantial, (*) poor
The 21st Bayesian Century
Tests and model choice
Variable selection
Bayesian variable selection
t1(γ) π(γ|y, X)
0,1,2,4,5 0.09290,1,2,4,5,9 0.03250,1,2,4,5,10 0.02950,1,2,4,5,7 0.02310,1,2,4,5,8 0.02280,1,2,4,5,6 0.02280,1,2,3,4,5 0.02240,1,2,3,4,5,9 0.01670,1,2,4,5,6,9 0.01670,1,2,4,5,8,9 0.0137
Noninformative G-prior model choice
The 21st Bayesian Century
Bayesian Calculations
Bayesian Calculations
Bayesian methods seem to quickly move to elaborate computationGelman, BA, 2008
Introduction
Tests and model choice
Bayesian CalculationsImplementation difficultiesBayes factor approximationABC model choice
A Defense of the Bayesian Choice
The 21st Bayesian Century
Bayesian Calculations
Implementation difficulties
B Implementation difficulties
◮ Computing the posterior distribution
π(θ|x) ∝ π(θ)f(x|θ)
◮ Resolution of
arg min
∫
ΘL(θ, δ)π(θ)f(x|θ)dθ
◮ Maximisation of the marginal posterior
arg max
∫
Θ−1
π(θ|x)dθ−1
The 21st Bayesian Century
Bayesian Calculations
Implementation difficulties
B Implementation further difficulties
A statistical test returns a probability value, but rarely is the probabilityvalue per se the reason for an investigator performing the test
Templeton, Mol. Ecol., 2009
◮ Computing posterior quantities
δπ(x) =
∫
Θh(θ) π(θ|x)dθ =
∫
Θh(θ) π(θ)f(x|θ)dθ∫
Θπ(θ)f(x|θ)dθ
◮ Resolution (in k) of
P (π(θ|x) ≥ k|x) = α
The 21st Bayesian Century
Bayesian Calculations
Implementation difficulties
Monte Carlo methods
Bayesian simulation seems stuck in an infinite regress of inferentialuncertainty
Gelman, BA, 2008
Approximation of
I =
∫
Θg(θ)f(x|θ)π(θ) dθ,
takes advantage of the fact that f(x|θ)π(θ) is proportional to adensity: If the θi’s are from π(θ),
1
m
m∑
i=1
g(θi)f(x|θi)
converges (almost surely) to I
The 21st Bayesian Century
Bayesian Calculations
Implementation difficulties
Importance function
A simulation method of inference hides unrealistic assumptionsTempleton, Mol. Ecol., 2009
No need to simulate from π(·|x) or from π: if h is a probabilitydensity,
∫
Θg(θ)f(x|θ)π(θ) dθ =
∫g(θ)f(x|θ)π(θ)
h(θ)h(θ) dθ
and ∑mi=1 g(θi)ω(θi)∑m
i=1 ω(θi)with ω(θi) =
f(x|θi)π(θi)
h(θi)
approximates Eπ[g(θ)|x]
The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Bayes factor approximation
ABC’s When approximating the Bayes factor
B12 =
∫
Θ1
f1(x|θ1)π1(θ1)dθ1
∫
Θ2
f2(x|θ2)π2(θ2)dθ2
=Z1
Z2
use of importance functions 1 and 2 and
B12 =n−1
1
∑n1
i=1 f1(x|θi1)π1(θ
i1)/1(θ
i1)
n−12
∑n2
i=1 f2(x|θi2)π2(θi
2)/2(θi2)
θij ∼ j(θ)
[Chopin & Robert, 2007]
The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Bridge sampling
Special case:If
π1(θ1|x) ∝ π1(θ1|x)π2(θ2|x) ∝ π2(θ2|x)
live on the same space (Θ1 = Θ2), then
B12 ≈1
n
n∑
i=1
π1(θi|x)
π2(θi|x)θi ∼ π2(θ|x)
[Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]
The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
(Further) bridge sampling
In addition
B12 =
∫π2(θ|x)α(θ)π1(θ|x)dθ
∫π1(θ|x)α(θ)π2(θ|x)dθ
∀ α(·)
≈
1
n1
n1∑
i=1
π2(θ1i|x)α(θ1i)
1
n2
n2∑
i=1
π1(θ2i|x)α(θ2i)
θji ∼ πj(θ|x)
The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Optimal bridge sampling
The optimal choice of auxiliary function is
α⋆(θ) =n1 + n2
n1π1(θ|x) + n2π2(θ|x)
leading to
B12 ≈
1
n1
n1∑
i=1
π2(θ1i|x)
n1π1(θ1i|x) + n2π2(θ1i|x)
1
n2
n2∑
i=1
π1(θ2i|x)
n1π1(θ2i|x) + n2π2(θ2i|x)
Back later!
The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Approximating Zk from a posterior sample
Use of the [harmonic mean] identity
Eπk
[ϕ(θk)
πk(θk)Lk(θk)
∣∣∣∣ x
]=
∫ϕ(θk)
πk(θk)Lk(θk)
πk(θk)Lk(θk)
Zkdθk =
1
Zk
no matter what the proposal ϕ(·) is.[Gelfand & Dey, 1994; Bartolucci et al., 2006]
Direct exploitation of the MCMC output
The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance samplingconstraints: ϕ(θ) must have lighter (rather than fatter) tails thanπk(θk)Lk(θk) for the approximation
Z1k = 1
/1
T
T∑
t=1
ϕ(θ(t)k )
πk(θ(t)k )Lk(θ
(t)k )
to have a finite variance.E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Approximating Z using a mixture representation
Bridge sampling redux
Design a specific mixture for simulation [importance sampling]purposes, with density
ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) ,
where ϕ(·) is arbitrary (but normalised)Note: ω1 is not a probability weight
The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Approximating Z using a mixture representation (cont’d)
Corresponding MCMC (=Gibbs) sampler
At iteration t
1. Take δ(t) = 1 with probability
ω1πk(θ(t−1)k )Lk(θ
(t−1)k )
/ (ω1πk(θ
(t−1)k )Lk(θ
(t−1)k ) + ϕ(θ
(t−1)k )
)
and δ(t) = 2 otherwise;
2. If δ(t) = 1, generate θ(t)k ∼ MCMC(θ
(t−1)k , θk) where
MCMC(θk, θ′k) denotes an arbitrary MCMC kernel associated
with the posterior πk(θk|x) ∝ πk(θk)Lk(θk);
3. If δ(t) = 2, generate θ(t)k ∼ ϕ(θk) independently
The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Evidence approximation by mixtures
Rao-Blackwellised estimate
ξ =1
T
T∑
t=1
ω1πk(θ(t)k )Lk(θ
(t)k )
/ω1πk(θ
(t)k )Lk(θ
(t)k ) + ϕ(θ
(t)k ) ,
converges to ω1Zk/{ω1Zk + 1}Deduce Z3k from ω1Z3k/{ω1Z3k + 1} = ξ ie
Z3k =
∑Tt=1 ω1πk(θ
(t)k )Lk(θ
(t)k )
/ω1π(θ
(t)k )Lk(θ
(t)k ) + ϕ(θ
(t)k )
∑Tt=1 ϕ(θ
(t)k )
/ω1πk(θ
(t)k )Lk(θ
(t)k ) + ϕ(θ
(t)k )
[Bridge sampler]
The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Chib’s representation
Direct application of Bayes’ theorem: givenx ∼ fk(x|θk) and θk ∼ πk(θk),
Zk = mk(x) =fk(x|θk)πk(θk)
πk(θk|x)
Use of an approximation to the posterior
Zk = mk(x) =fk(x|θ
∗k)πk(θ
∗k)
πk(θ∗k|x)
.
The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Case of latent variables
For missing variable z as in mixture models, natural Rao-Blackwellestimate
πk(θ∗k|x) =
1
T
T∑
t=1
πk(θ∗k|x, z
(t)k ) ,
where the z(t)k ’s are Gibbs sampled latent variables
The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Approximate Bayesian Computation
Simulation target is π(θ)f(x|θ) with likelihood f(x|θ) not inclosed form.Likelihood-free rejection technique:
ABC algorithm
For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating
θ′ ∼ π(θ) , x ∼ f(x|θ′) ,
until the auxiliary variable x is equal to the observed value, x = y.
[Pritchard et al., 1999]
The 21st Bayesian Century
Bayesian Calculations
ABC model choice
A as approximative
When y is a continuous random variable, equality x = y is replacedwith a tolerance condition,
(x, y) ≤ ǫ
where is a distance between summary statisticsOutput distributed from
π(θ)Pθ{(x, y) < ǫ} ∝ π(θ|(x, y) < ǫ)
The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Gibbs random fields
Gibbs distribution
The rv y = (y1, . . . , yn) is a Gibbs random field associated withthe graph G if
f(y) =1
Zexp
{−
∑
c∈C
Vc(yc)
},
where Z is the normalising constant, C is the set of cliques of G
and Vc is any function also called potentialU(y) =
∑c∈C Vc(yc) is the energy function
c© Z is usually unavailable in closed form
The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Potts model
Potts model
Vc(y) is of the form
Vc(y) = θS(y) = θ∑
l∼i
δyl=yi
where l∼i denotes a neighbourhood structure
In most realistic settings, summation
Zθ =∑
x∈X
exp{θTS(x)}
involves too many terms to be manageable and numericalapproximations cannot always be trusted
[Cucala, Marin, CPR & Titterington, JASA, 2009]
The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Neighbourhood relations
Choice to be made between M neighbourhood relations
im∼ i′ (0 ≤ m ≤ M − 1)
withSm(x) =
∑
im∼i′
I{xi=xi′}
driven by the posterior probabilities of the models.
The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Model index
Formalisation via a model index M, new parameter with priordistribution π(M = m) and π(θ|M = m) = πm(θm)Computational target:
P(M = m|x) ∝
∫
Θm
fm(x|θm)πm(θm) dθm π(M = m)
The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Sufficient statisticsIf S(x) sufficient statistic for the joint parameters(M, θ0, . . . , θM−1),
P(M = m|x) = P(M = m|S(x)) .
For each model m, sufficient statistic Sm(·) makesS(·) = (S0(·), . . . , SM−1(·)) also sufficient.For Gibbs random fields,
x|M = m ∼ fm(x|θm) = f1m(x|S(x))f2
m(S(x)|θm)
=1
n(S(x))f2
m(S(x)|θm)
wheren(S(x)) = ♯ {x ∈ X : S(x) = S(x)}
c© S(x) also sufficient for the joint parameters[Specific to Gibbs random fields!]
The 21st Bayesian Century
Bayesian Calculations
ABC model choice
ABC model choice Algorithm
ABC-MC◮ Generate m∗ from the prior π(M = m).
◮ Generate θ∗m∗ from the prior πm∗(·).
◮ Generate x∗ from the model fm∗(·|θ∗m∗).
◮ Compute the distance ρ(S(x0), S(x∗)).
◮ Accept (θ∗m∗ , m∗) if ρ(S(x0), S(x∗)) < ǫ.
[Cornuet, Grelaud, Marin & Robert, BA, 2008]
Note When ǫ = 0 the algorithm is exact
The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Toy example
iid Bernoulli model versus two-state first-order Markov chain, i.e.
f0(x|θ0) = exp
(θ0
n∑
i=1
I{xi=1}
)/{1 + exp(θ0)}
n ,
versus
f1(x|θ1) =1
2exp
(θ1
n∑
i=2
I{xi=xi−1}
)/{1 + exp(θ1)}
n−1 ,
with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phasetransition” boundaries).
The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Toy example (2)
(left) Comparison of the true BFm0/m1(x0) with BFm0/m1
(x0)(in logs) over 2, 000 simulations and 4.106 proposals from theprior. (right) Same when using tolerance ǫ corresponding to the1% quantile on the distances.
The 21st Bayesian Century
A Defense of the Bayesian Choice
A Defense of the Bayesian Choice
Given the advances in practical Bayesian methods in the past twodecades, anti-Bayesianism is no longer a serious option
Gelman, BA, 2009
Bayesians are of course their own worst enemies. They makenon-Bayesians accuse them of religious fervour, and an unwillingness to
see another point of view.Davidson, 2009
The 21st Bayesian Century
A Defense of the Bayesian Choice
1. Choosing a probabilistic representation
Bayesian statistics is about making probability statementsGelman, BA, 2009
Bayesian Statistics appears as the calculus of uncertainty
Reminder:A probabilistic model is nothing but an interpretation of a givenphenomenon
What is the meaning of RD’s t test example?!
The 21st Bayesian Century
A Defense of the Bayesian Choice
1. Choosing a probabilistic representation (2)
Inference is impossible.Davidson, 2009
The Bahadur–Savage problem stems from the inability to makechoices about the shape of a statistical model, not from animpossibility to draw [Bayesian] inference.
Further, a probability distribution is more than the sum of itsmoments. Ill-posed problems thus highlight issues with the model,not the inference.
The 21st Bayesian Century
A Defense of the Bayesian Choice
2. Conditioning on the data
Bayesian data analysis is a method for summarizing uncertainty andmaking estimates and predictions using probability statements conditional
on observed data and an assumed modelGelman, BA, 2009
At the basis of statistical inference lies an inversion processbetween cause and effect. Using a prior distribution brings anecessary balance between observations and parameters and enableto operate conditional upon x
What is the data in RD’s t test example?! U ’s? Y ’s?
The 21st Bayesian Century
A Defense of the Bayesian Choice
3. Exhibiting the true likelihood
Frequentist statistics is an approach for evaluating statistical proceduresconditional on some family of posited probability models
Gelman, BA, 2009
Provides a complete quantitative inference on the parameters andpredictive that points out inadequacies of frequentist statistics,while implementing the Likelihood Principle.
There needs to be a true likelihood, including innon-parametric settings
[Rousseau, Van der Vaart]
The 21st Bayesian Century
A Defense of the Bayesian Choice
4. Using priors as tools and summaries
Bayesian techniques allow prior beliefs to be tested and discarded asappropriate
Gelman, BA, 2009
The choice of a prior distribution π does not require any kind ofbelief in this distribution: rather consider it as a tool thatsummarizes the available prior information and the uncertaintysurrounding this information
Non-identifiability is an issue in that the prior may stronglyimpact inference about identifiable bits
The 21st Bayesian Century
A Defense of the Bayesian Choice
4. Using priors as tools and summaries (2)
No uninformative prior exists for such models.Davidson, 2009
Reference priors can be deduced from the sampling distribution byan automated procedure, based on a minimal information principlethat maximises the information brought by the data.
Important literature on prior modelling for non-parametricproblems, incl. smoothness constraints.
The 21st Bayesian Century
A Defense of the Bayesian Choice
5. Accepting the subjective basis of knowledge
Knowledge is a critical confrontation between a prioris andexperiments. Ignoring these a prioris impoverishes analysis.
We have, for one thing, to use a language and ourlanguage is entirely made of preconceived ideas and has to beso. However, these are unconscious preconceived ideas, whichare a million times more dangerous than the other ones. Werewe to assert that if we are including other preconceived ideas,consciously stated, we would aggravate the evil! I do notbelieve so: I rather maintain that they would balance oneanother.
Henri Poincare, 1902
The 21st Bayesian Century
A Defense of the Bayesian Choice
6. Choosing a coherent system of inference
Bayesian data analysis has three stages: formulating a model,splitting the model to data, and checking the model fit.
The second step—inference—gets most of the attention,but the procedure as a whole is not automatic
Gelman, BA, 2009
To force inference into a decision-theoretic mold allows for aclarification of the way inferential tools should be evaluated, andtherefore implies a conscious (although subjective) choice of theretained optimality.Logical inference process Start with requested properties, i.e.loss function and prior distribution, then derive the best solutionsatisfying these properties.
The 21st Bayesian Century
A Defense of the Bayesian Choice
6. Choosing a coherent system of inference (2)
Asymptopia annoys Bayesians.Davidson, 2009
Asymptotics [for inference] sounds for a proxy for not specifyingcompletely the model and thus for using another model. Whileasymptotics [for simulation] is quite acceptable. Bayesian inferencedoes not escape asymptotic difficulties, see e.g. mixtures.
NP Bootstrap aims at inference with no[t enough]modelling, while P Bayesian bootstrap is essentially using theBayesian predictive
The 21st Bayesian Century
A Defense of the Bayesian Choice
7. Looking for optimal frequentist procedures
At intermediate levels of a Bayesian model, frequency properties typicallytake care of themselves. It is typically only at the top level of
unreplicated parameters that we have to worryGelman, BA, 2009
Bayesian inference widely intersects with the three notions ofminimaxity, admissibility and equivariance (Haar). Looking for anoptimal estimator most often ends up finding a Bayes estimator.
Optimality is easier to attain through the Bayes “filter”
The 21st Bayesian Century
A Defense of the Bayesian Choice
8. Solving the actual problem
Frequentist methods have coverage guarantees; Bayesian methods don’t.In science, coverage matters
Wasserman, BA, 2009
Frequentist methods justified on a long-term basis, i.e., from thestatistician viewpoint. From a decision-maker’s point of view, onlythe problem at hand matters! That is, he/she calls for an inferenceconditional on x.
The 21st Bayesian Century
A Defense of the Bayesian Choice
9. Providing a universal system of inference
Bayesian methods are presented as an automatic inference engineGelman, BA, 2009
Given the three factors
(X , f(x|θ), (Θ, π(θ)), (D , L(θ, d)) ,
the Bayesian approach validates one and only one inferentialprocedure
The 21st Bayesian Century
A Defense of the Bayesian Choice
10. Computing procedures as a minimization problem
The discussion of computational issues should not be allowed to obscurethe need for further analysis of inferential questions
Bernardo, BA, 2009
Bayesian procedures are easier to compute than procedures ofalternative theories, in the sense that there exists a universalmethod for the computation of Bayes estimators
Convergence assessment is an issue, but recent developmentsin adaptive MCMC allow for more confidence in the output
top related