Rimini discussion

The 21st Bayesian Century

“The 21st Century belongs to Bayes”as argued by a discussion on Bayesian testing and

Bayesian model choice

Christian P. Robert

Universite Paris Dauphine and CREST-INSEEhttp://www.ceremade.dauphine.fr/~xian

http://xianblog.wordpress.com

July 1, 2009

A consequence of Bayesian statistics being given a propername is that it encourages too much historical deference

from people who think that the bibles of Jeffreys, deFinetti, Jaynes, and others have all the answers.

—Gelman, Bayesian Analysis 3(3), 2008

Outline

Anyone not shocked by the Bayesian theory of inference has notunderstood it

Senn, BA., 2008

Introduction

Tests and model choice

Bayesian Calculations

A Defense of the Bayesian Choice

Introduction

Vocabulary and concepts

Bayesian inference is a coherent mathematical theorybut I don’t trust it in scientific applications.

Gelman, BA, 2008

IntroductionModelsThe Bayesian frameworkImproper prior distributionsNoninformative prior distributions

Introduction

Models

Parametric model

Bayesians promote the idea that a multiplicity of parameters can behandled via hierarchical, typically exchangeable, models, but it seems

implausible that this could really work automatically [instead of] givingreasonable answers using minimal assumptions.

Gelman, BA, 2008

Observations x1, . . . , xn generated from a probability distributionfi(xi|θi, x1, . . . , xi−1) = fi(xi|θi, x1:i−1)

x = (x1, . . . , xn) ∼ f(x|θ), θ = (θ1, . . . , θn)

Associated likelihoodℓ(θ|x) = f(x|θ)

[inverted density & starting point]

Introduction

Models

And [B] nonparametrics?!Equally very active and definitely very 21st, thank you,but not mentioned in this talk!

7th Workshop on Bayesian Nonparametrics - Collegio... http://bnpworkshop.carloalberto.org/

21 - 25 June 2009, Moncalieri

The 7th Workshop on Bayesian Nonparametrics will be held atthe Collegio Carlo Alberto from June 21 to 25, 2009. The Collegio is aResearch Institution housed in an historical building located inMoncalieri on the outskirts of Turin, Italy.

The meeting will feature the latest developments in the area and willcover a wide variety of both theoretical and applied topics such as:foundations of the Bayesian nonparametric approach, constructionand properties of prior distributions, asymptotics, interplay withprobability theory and stochastic processes, statistical modelling,computational algorithms and applications in machine learning,biostatistics, bioinformatics, economics and econometrics.

The Workshop will be structured in 4 tutorials on special topics, aseries of invited talks and contributed posters sessions.

NewsTentative Workshop ScheduleAbstract Book (last updated 27th May 2009)Workshop Poster

Introduction

The Bayesian framework

Bayes theorem 101

Bayes theorem = Inversion of probabilities

If A and E are events such that P (E) 6= 0, P (A|E) and P (E|A)are related by

P (A|E) =

P (E|A)P (A)

P (E|A)P (A) + P (E|Ac)P (Ac)

=P (E|A)P (A)

[Thomas Bayes (?)]

Introduction

Bayesian approach

The impact of treating x as a fixed constantis to increase statistical power as an artefact

Templeton, Molec. Ecol., 2009

New perspective

◮ Uncertainty on the parameters θ of a model modeled througha probability distribution π on Θ, called prior distribution

◮ Inference based on the distribution of θ conditional on x,π(θ|x), called posterior distribution

π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ) dθ

Introduction

[Nonphilosophical] justifications

Ignoring the sampling error of x underminesthe statistical validity of all inferences made by the method

Templeton, Molec. Ecol., 2009

◮ Semantic drift from unknown to random

◮ Actualization of the information on θ by extracting theinformation on θ contained in the observation x

◮ Allows incorporation of imperfect information in the decisionprocess

◮ Unique mathematical way to condition upon the observations(conditional perspective)

◮ Unique way to give meaning to statements like P(θ > 0)

Introduction

Posterior distribution

Bayesian methods are presented as an automatic inference engine,and this raises suspicion in anyone with applied experience

Gelman, BA, 2008

π(θ|x) central to Bayesian inference

◮ Operates conditional upon the observations

◮ Incorporates the requirement of the Likelihood Principle

◮ Avoids averaging over the unobserved values of x

◮ Coherent updating of the information available on θ

◮ Provides a complete inferential machinery

Introduction

Improper prior distributions

Improper distributions

If we take P (dσ) ∝ dσ as a statement that σ may have any valuebetween 0 and ∞ (...), we must use ∞ instead of 1 to denote certainty.

Jeffreys, ToP, 1939

Necessary extension from a prior distribution to a prior σ-finitemeasure π such that

Θπ(θ) dθ = +∞

Improper prior distribution[Weird? Inappropriate?? report!! ]

Introduction

Justifications

If the parameter may have any value from −∞ to +∞,its prior probability should be taken as uniformly distributed

Jeffreys, ToP, 1939

Automated prior determination often leads to improper priors

1. Similar performances of estimators derived from thesegeneralized distributions

2. Improper priors as limits of proper distributions in many[mathematical] senses

Introduction

More justifications

There is no good objective principle for choosing a noninformative prior(even if that concept were mathematically defined, which it is not)

Gelman, BA, 2008

4. Robust answer against possible misspecifications of the prior

5. Frequencial justifications, such as:

(i) minimaxity(ii) admissibility(iii) invariance (Haar measure)

6. Improper priors [much] prefered to vague proper priors likeN (0, 106)

Introduction

Validation

The mistake is to think of them as representing ignoranceLindley, JASA, 1990

Extension of the posterior distribution π(θ|x) associated with animproper prior π as given by Bayes’s formula

π(θ|x) =f(x|θ)π(θ)∫

Θ f(x|θ)π(θ) dθ,

when ∫

Θf(x|θ)π(θ) dθ < ∞

Delete all emotional names

Introduction

Noninformative prior distributions

Noninformative priors

...cannot be expected to represent exactly total ignorance about theproblem, but should rather be taken as reference priors, upon which

everyone could fall back when the prior information is missing.Kass and Wasserman, JASA, 1996

What if all we know is that we know “nothing” ?!In the absence of prior information, prior distributions solelyderived from the sample distribution f(x|θ)Difficulty with uniform priors, lacking invariance properties.

Introduction

Noninformative prior distributions

Jeffreys’ prior

If we took the prior density for the parameters to be proportional to|I(θ)|1/2, it could be stated for any law that is differentiable with respectto all parameters that the total probability in any region of the θi would

be equal to the total probability in the corresponding region of the θ′iJeffreys, ToP, 1939

Based on Fisher information

I(θ) = Eθ

[∂ℓ

∂θT

∂ℓ

Jeffreys’ prior distribution is

π∗(θ) ∝ |I(θ)|1/2

The Jeffreys-subjective synthesis betrays a much more dangerousconfusion than the Neyman-Pearson-Fisher synthesis as regards

hypothesis testsSenn, BA, 2008

Introduction

Tests and model choiceBayesian testsBayes factorsOpposition to classical testsModel choiceCompatible priorsVariable selection

Bayesian tests

Construction of Bayes tests

What is almost never used, however, is the Jeffreys significance test.Senn, BA, 2008

Definition (Test)

Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of astatistical model, a test is a statistical procedure that takes itsvalues in {0, 1}.

Example (Normal mean)

For x ∼ N (θ, 1), decide whether or not θ ≤ 0.

Bayesian tests

Decision-theoretic perspective

Loss functions [are] not relevant to statistical inferenceGelman, BA, 2008

Theorem (Optimal Bayes decision)

Under the 0 − 1 loss function

L(θ, d) =

0 if d = IΘ0(θ)

a0 if d = 1 and θ 6∈ Θ0

a1 if d = 0 and θ ∈ Θ0

the Bayes procedure is

δπ(x) =

{1 if Prπ(θ ∈ Θ0|x) ≥ a0/(a0 + a1)

0 otherwise

Bayes factors

A function of posterior probabilities

The method posits two or more alternative hypotheses and tests theirrelative fits to some observed statistics

Templeton, Mol. Ecol., 2009

Definition (Bayes factors)

For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ 6∈ Θ0

B01 =π(Θ0|x)

π(Θc0|x)

/π(Θ0)

π(Θc0)

f(x|θ)π0(θ)dθ

f(x|θ)π1(θ)dθ

[Good, 1958 & Jeffreys, 1961]

Goto Poisson example

Bayes factors

Self-contained concept

Having a high relative probability does not mean that a hypothesis is trueor supported by the data

Non-decision-theoretic:

◮ eliminates choice of π(Θ0)

◮ Bayesian/marginal equivalent to the likelihood ratio

◮ Jeffreys’ scale of evidence:◮ if log10(B

π10) between 0 and 0.5, evidence against H0 weak,

◮ if log10(Bπ10) 0.5 and 1, evidence substantial,

◮ if log10(Bπ10) 1 and 2, evidence strong and

◮ if log10(Bπ10) above 2, evidence decisive

Bayes factors

A major modification

Considering whether a location parameter α is 0. The prior is uniformand we should have to take f(α) = 0 and B10 would always be infinite

Jeffreys, ToP, 1939

When the null hypothesis is supported by a set of measure 0,π(Θ0) = 0 and thus π(Θ0|x) = 0.

[End of the story?!]

Bayes factors

Changing the prior to fit the hypotheses

Requirement

Defined prior distributions under both assumptions,

π0(θ) ∝ π(θ)IΘ0(θ), π1(θ) ∝ π(θ)IΘ1

(under the standard dominating measures on Θ0 and Θ1)

Using the prior probabilities π(Θ0) = 0 and π(Θ1) = 1,

π(θ) = 0π0(θ) + 1π1(θ).

Bayes factors

Point null hypotheses

I have no patience for statistical methods that assign positive probabilityto point hypotheses of the θ = 0 type that can never actually be true

Gelman, BA, 2008

Take ρ0 = Prπ(θ = θ0) and g1 prior density under Ha. Then

π(Θ0|x) =f(x|θ0)ρ0∫

f(x|θ)π(θ) dθ=

f(x|θ0)ρ0

f(x|θ0)ρ0 + (1 − ρ0)m1(x)

and Bayes factor

Bπ01(x) =

f(x|θ0)ρ0

m1(x)(1 − ρ0)

1 − ρ0=

f(x|θ0)

Bayes factors

Point null hypotheses (cont’d)

Example (Normal mean)

Test of H0 : θ = 0 when x ∼ N (θ, 1): we take π1 as N (0, τ2)

f(x|0)=

√σ2

σ2 + τ2exp

{τ2x2

2σ2(σ2 + τ2)

and the posterior probability is

τ/x 0 0.68 1.28 1.96

1 0.586 0.557 0.484 0.35110 0.768 0.729 0.612 0.366

Opposition to classical tests

Comparison with classical tests

The 95 percent frequentist intervals will live up to their advertisedcoverage claims

Wasserman, BA, 2008

Standard answer

Definition (p-value)

The p-value p(x) associated with a test is the largest significancelevel for which H0 is rejected

Problems with p-values

The use of P implies that a hypothesis that may be true may be rejectedbecause it had not predicted observable results that have not occurred

Jeffreys, ToP, 1939

◮ Evaluation of the wrong quantity, namely the probability toexceed the observed quantity.(wrong conditioning)

◮ Evaluation only under the null hypothesis

◮ Huge numerical difference with the Bayesian range of answers

Bayesian lower bounds

If the Bayes estimator has good frequency behaviorthen we might as well use the frequentist method.

If it has bad frequency behavior then we shouldn’t use it.Wasserman, BA, 2008

Least favourable Bayesian answer is

B(x, GA) = infg∈GA

f(x|θ0)∫Θ f(x|θ)g(θ) dθ

i.e., if there exists a mle for θ, θ(x),

B(x, GA) =f(x|θ0)

f(x|θ(x))

Illustration

Example (Normal case)

When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are

B(x, GA) = e−x2/2 and P(x, GA) =(1 + ex2/2

)−1,

i.e.p-value 0.10 0.05 0.01 0.001

P 0.205 0.128 0.035 0.004B 0.256 0.146 0.036 0.004

[Quite different!]

Model choice

Model choice and model comparison

There is no null hypothesis, which complicates the computation ofsampling error

Choice among modelsSeveral models available for the same observation(s)

Mi : x ∼ fi(x|θi), i ∈ I

where I can be finite or infinite

Model choice

Bayesian resolution

The posterior probabilities are constructed by using a numerator that is afunction of the observation for a particular model, then divided by a

denominator that ensures that the ”probabilities” sum to oneTempleton, Mol. Ecol., 2009

Probabilise the entire model/parameter space

◮ allocate probabilities pi to all models Mi

◮ define priors πi(θi) for each parameter space Θi

◮ compute

π(Mi|x) =

fi(x|θi)πi(θi)dθi

fj(x|θj)πj(θj)dθj

Model choice

Bayesian resolution(2)

The numerators are not co-measurable across hypotheses, and thedenominators are sums of non-co-measurable entities. This means that it

is mathematically impossible for them to be probabilities.Templeton, Mol. Ecol., 2009

◮ take largest π(Mi|x) to determine “best” model,or use averaged predictive

π(Mj |x)

fj(x′|θj)πj(θj |x)dθj

Model choice

Natural Ockham’s razor

Pluralitas non est ponenda sine neccesitate

Variation is random until thecontrary is shown; and newparameters in laws, when theyare suggested, must be testedone at a time, unless there isspecific reason to the contrary.

Jeffreys, ToP, 1939

The Bayesian approach naturally weights differently models withdifferent parameter dimensions (BIC).

Compatible priors

Compatibility principle

Further complicating dimensionality of test statistics is the fact that themodels are often not nested, and one model may contain parameters that

do not have analogues in the other models and vice versaTempleton, Mol. Ecol., 2009

Difficulty of finding simultaneously priors on a collection of modelsEasier to start from a single prior on a “big” [encompassing] modeland to derive others from a coherence principle

[Dawid & Lauritzen, 2000]Raw regression output

Compatible priors

An illustration for linear regression

In the case M1 and M2 are two nested Gaussian linear regressionmodels with Zellner’s g-priors and the same variance σ2 ∼ π(σ2):

◮ M1 : y|β1, σ2 ∼ N (X1β1, σ

2) with

β1|σ2 ∼ N

(s1, σ

2n1(XT1 X1)

where X1 is a (n × k1) matrix of rank k1 ≤ n

◮ M2 : y|β2, σ2 ∼ N (X2β2, σ

2) with

β2|σ2 ∼ N

(s2, σ

2n2(XT2 X2)

where X2 is a (n × k2) matrix with span(X2) ⊆ span(X1)

[ c©Marin & Robert, Bayesian Core]

Compatible priors

Compatible g-priors

I don’t see any role for squared error loss, minimax, or the rest of what issometimes called statistical decision theory

Gelman, BA, 2008

Since σ2 is a nuisance parameter, minimize the Kullback-Leiblerdivergence between both marginal distributions conditional on σ2:m1(y|σ

2; s1, n1) and m2(y|σ2; s2, n2), with solution

β2|X2, σ2 ∼ N

(s∗2, σ

2n∗2(X

T2 X2)

withs∗2 = (XT

2 X2)−1XT

2 X1s1 n∗2 = n1

Variable selection

Regression setup where y regressed on aset {x1, . . . , xp} of p potentialexplanatory regressors (plus intercept)

Corresponding 2p submodels Mγ , whereγ ∈ Γ = {0, 1}p indicatesinclusion/exclusion of variables by abinary representation,e.g. γ = 101001011 means that x1, x3,x5, x7 and x8 are included.

Variable selection

Notations

For model Mγ ,

◮ qγ variables included

◮ t1(γ) = {t1,1(γ), . . . , t1,qγ (γ)} indices of those variables andt0(γ) indices of the variables not included

◮ For β ∈ Rp+1,

βt1(γ) =[β0, βt1,1(γ), . . . , βt1,qγ (γ)

Xt1(γ) =[1n|xt1,1(γ)| . . . |xt1,qγ (γ)

Submodel Mγ is thus

y|β, γ, σ2 ∼ N(Xt1(γ)βt1(γ), σ

Variable selection

Global and compatible priors

Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,

β|σ2 ∼ N (β, cσ2(XTX)−1)

and a Jeffreys prior for σ2,

π(σ2) ∝ σ−2

Noninformative g

Resulting compatible prior

βt1(γ) ∼ N

t1(γ)Xt1(γ)

)−1XT

t1(γ)Xβ, cσ2(XT

t1(γ)Xt1(γ)

)−1)

Variable selection

Posterior model probability

Can be obtained in closed form:

π(γ|y) ∝ (c+1)−(qγ+1)/2

[yTy −

cyTP1y

c + 1+

βTXTP1Xβ

c + 1−

2yTP1Xβ

]−n/2

Conditionally on γ, posterior distributions of β and σ2:

βt1(γ)|σ2, y, γ ∼ N

c + 1(U1y + U1Xβ/c),

t1(γ)Xt1(γ)

σ2|y, γ ∼ IG

cyTP1y

2(c + 1)+

βTXTP1Xβ

2(c + 1)−

yTP1Xβ

Variable selection

Noninformative case

Use the same compatible informative g-prior distribution withβ = 0p+1 and a hierarchical diffuse prior distribution on c,

π(c) ∝ c−1IN∗(c) or π(c) ∝ c−1

Recall g-prior

The choice of this hierarchical diffuse prior distribution on c is dueto the model posterior sensitivity to large values of c:

Taking β = 0p+1 and c large does not work

Variable selection

Processionary caterpillar

Influence of some forest settlement characteristics on thedevelopment of caterpillar colonies

Response y log-transform of the average number of nests ofcaterpillars per tree on an area of 500 square meters (n = 33 areas)

[ c©Marin & Robert, Bayesian Core]

x1 x2 x3

x4 x5 x6

x7 x8 x9

Variable selection

Processionary caterpillar (cont’d)

Potential explanatory variables

x1 altitude (in meters), x2 slope (in degrees),

x3 number of pines in the square,

x4 height (in meters) of the tree at the center of the square,

x5 diameter of the tree at the center of the square,

x6 index of the settlement density,

x7 orientation of the square (from 1 if southb’d to 2 ow),

x8 height (in meters) of the dominant tree,

x9 number of vegetation strata,

x10 mix settlement index (from 1 if not mixed to 2 if mixed).

Variable selection

Bayesian regression outputEstimate BF log10(BF)

(Intercept) 9.2714 26.334 1.4205 (***)X1 -0.0037 7.0839 0.8502 (**)X2 -0.0454 3.6850 0.5664 (**)X3 0.0573 0.4356 -0.3609X4 -1.0905 2.8314 0.4520 (*)X5 0.1953 2.5157 0.4007 (*)X6 -0.3008 0.3621 -0.4412X7 -0.2002 0.3627 -0.4404X8 0.1526 0.4589 -0.3383X9 -1.0835 0.9069 -0.0424X10 -0.3651 0.4132 -0.3838

evidence against H0: (****) decisive, (***) strong, (**)subtantial, (*) poor

Variable selection

Bayesian variable selection

t1(γ) π(γ|y, X)

0,1,2,4,5 0.09290,1,2,4,5,9 0.03250,1,2,4,5,10 0.02950,1,2,4,5,7 0.02310,1,2,4,5,8 0.02280,1,2,4,5,6 0.02280,1,2,3,4,5 0.02240,1,2,3,4,5,9 0.01670,1,2,4,5,6,9 0.01670,1,2,4,5,8,9 0.0137

Noninformative G-prior model choice

Bayesian methods seem to quickly move to elaborate computationGelman, BA, 2008

Introduction

Bayesian CalculationsImplementation difficultiesBayes factor approximationABC model choice

Implementation difficulties

B Implementation difficulties

◮ Computing the posterior distribution

π(θ|x) ∝ π(θ)f(x|θ)

◮ Resolution of

arg min

ΘL(θ, δ)π(θ)f(x|θ)dθ

◮ Maximisation of the marginal posterior

arg max

Θ−1

π(θ|x)dθ−1

B Implementation further difficulties

A statistical test returns a probability value, but rarely is the probabilityvalue per se the reason for an investigator performing the test

◮ Computing posterior quantities

δπ(x) =

Θh(θ) π(θ|x)dθ =

Θh(θ) π(θ)f(x|θ)dθ∫

Θπ(θ)f(x|θ)dθ

◮ Resolution (in k) of

P (π(θ|x) ≥ k|x) = α

Monte Carlo methods

Bayesian simulation seems stuck in an infinite regress of inferentialuncertainty

Gelman, BA, 2008

Approximation of

Θg(θ)f(x|θ)π(θ) dθ,

takes advantage of the fact that f(x|θ)π(θ) is proportional to adensity: If the θi’s are from π(θ),

g(θi)f(x|θi)

converges (almost surely) to I

Importance function

A simulation method of inference hides unrealistic assumptionsTempleton, Mol. Ecol., 2009

No need to simulate from π(·|x) or from π: if h is a probabilitydensity,

Θg(θ)f(x|θ)π(θ) dθ =

∫g(θ)f(x|θ)π(θ)

h(θ)h(θ) dθ

and ∑mi=1 g(θi)ω(θi)∑m

i=1 ω(θi)with ω(θi) =

f(x|θi)π(θi)

h(θi)

approximates Eπ[g(θ)|x]

Bayes factor approximation

ABC’s When approximating the Bayes factor

f1(x|θ1)π1(θ1)dθ1

f2(x|θ2)π2(θ2)dθ2

use of importance functions 1 and 2 and

B12 =n−1

i=1 f1(x|θi1)π1(θ

i1)/1(θ

n−12

i=1 f2(x|θi2)π2(θi

2)/2(θi2)

θij ∼ j(θ)

[Chopin & Robert, 2007]

Bridge sampling

Special case:If

π1(θ1|x) ∝ π1(θ1|x)π2(θ2|x) ∝ π2(θ2|x)

live on the same space (Θ1 = Θ2), then

B12 ≈1

π1(θi|x)

π2(θi|x)θi ∼ π2(θ|x)

[Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]

(Further) bridge sampling

In addition

∫π2(θ|x)α(θ)π1(θ|x)dθ

∫π1(θ|x)α(θ)π2(θ|x)dθ

∀ α(·)

π2(θ1i|x)α(θ1i)

π1(θ2i|x)α(θ2i)

θji ∼ πj(θ|x)

Optimal bridge sampling

The optimal choice of auxiliary function is

α⋆(θ) =n1 + n2

n1π1(θ|x) + n2π2(θ|x)

leading to

B12 ≈

π2(θ1i|x)

n1π1(θ1i|x) + n2π2(θ1i|x)

π1(θ2i|x)

n1π1(θ2i|x) + n2π2(θ2i|x)

Back later!

Approximating Zk from a posterior sample

Use of the [harmonic mean] identity

[ϕ(θk)

πk(θk)Lk(θk)

∣∣∣∣ x

∫ϕ(θk)

πk(θk)Lk(θk)

Zkdθk =

no matter what the proposal ϕ(·) is.[Gelfand & Dey, 1994; Bartolucci et al., 2006]

Direct exploitation of the MCMC output

Comparison with regular importance sampling

Harmonic mean: Constraint opposed to usual importance samplingconstraints: ϕ(θ) must have lighter (rather than fatter) tails thanπk(θk)Lk(θk) for the approximation

Z1k = 1

ϕ(θ(t)k )

πk(θ(t)k )Lk(θ

(t)k )

to have a finite variance.E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ

Approximating Z using a mixture representation

Bridge sampling redux

Design a specific mixture for simulation [importance sampling]purposes, with density

ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) ,

where ϕ(·) is arbitrary (but normalised)Note: ω1 is not a probability weight

Approximating Z using a mixture representation (cont’d)

Corresponding MCMC (=Gibbs) sampler

At iteration t

1. Take δ(t) = 1 with probability

ω1πk(θ(t−1)k )Lk(θ

(t−1)k )

/ (ω1πk(θ

(t−1)k )Lk(θ

(t−1)k ) + ϕ(θ

(t−1)k )

and δ(t) = 2 otherwise;

2. If δ(t) = 1, generate θ(t)k ∼ MCMC(θ

(t−1)k , θk) where

MCMC(θk, θ′k) denotes an arbitrary MCMC kernel associated

with the posterior πk(θk|x) ∝ πk(θk)Lk(θk);

3. If δ(t) = 2, generate θ(t)k ∼ ϕ(θk) independently

Evidence approximation by mixtures

Rao-Blackwellised estimate

ω1πk(θ(t)k )Lk(θ

(t)k )

/ω1πk(θ

(t)k )Lk(θ

(t)k ) + ϕ(θ

(t)k ) ,

converges to ω1Zk/{ω1Zk + 1}Deduce Z3k from ω1Z3k/{ω1Z3k + 1} = ξ ie

∑Tt=1 ω1πk(θ

(t)k )Lk(θ

(t)k )

/ω1π(θ

(t)k )Lk(θ

(t)k ) + ϕ(θ

(t)k )

∑Tt=1 ϕ(θ

(t)k )

/ω1πk(θ

(t)k )Lk(θ

(t)k ) + ϕ(θ

(t)k )

[Bridge sampler]

Chib’s representation

Direct application of Bayes’ theorem: givenx ∼ fk(x|θk) and θk ∼ πk(θk),

Zk = mk(x) =fk(x|θk)πk(θk)

πk(θk|x)

Use of an approximation to the posterior

Zk = mk(x) =fk(x|θ

∗k)πk(θ

πk(θ∗k|x)

Case of latent variables

For missing variable z as in mixture models, natural Rao-Blackwellestimate

πk(θ∗k|x) =

πk(θ∗k|x, z

(t)k ) ,

where the z(t)k ’s are Gibbs sampled latent variables

ABC model choice

Approximate Bayesian Computation

Simulation target is π(θ)f(x|θ) with likelihood f(x|θ) not inclosed form.Likelihood-free rejection technique:

ABC algorithm

For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating

θ′ ∼ π(θ) , x ∼ f(x|θ′) ,

until the auxiliary variable x is equal to the observed value, x = y.

[Pritchard et al., 1999]

ABC model choice

A as approximative

When y is a continuous random variable, equality x = y is replacedwith a tolerance condition,

(x, y) ≤ ǫ

where is a distance between summary statisticsOutput distributed from

π(θ)Pθ{(x, y) < ǫ} ∝ π(θ|(x, y) < ǫ)

ABC model choice

Gibbs random fields

Gibbs distribution

The rv y = (y1, . . . , yn) is a Gibbs random field associated withthe graph G if

f(y) =1

Vc(yc)

where Z is the normalising constant, C is the set of cliques of G

and Vc is any function also called potentialU(y) =

∑c∈C Vc(yc) is the energy function

ABC model choice

Potts model

Vc(y) is of the form

Vc(y) = θS(y) = θ∑

δyl=yi

where l∼i denotes a neighbourhood structure

In most realistic settings, summation

Zθ =∑

exp{θTS(x)}

involves too many terms to be manageable and numericalapproximations cannot always be trusted

[Cucala, Marin, CPR & Titterington, JASA, 2009]

ABC model choice

Neighbourhood relations

Choice to be made between M neighbourhood relations

im∼ i′ (0 ≤ m ≤ M − 1)

withSm(x) =

im∼i′

I{xi=xi′}

driven by the posterior probabilities of the models.

ABC model choice

Model index

Formalisation via a model index M, new parameter with priordistribution π(M = m) and π(θ|M = m) = πm(θm)Computational target:

P(M = m|x) ∝

fm(x|θm)πm(θm) dθm π(M = m)

ABC model choice

Sufficient statisticsIf S(x) sufficient statistic for the joint parameters(M, θ0, . . . , θM−1),

P(M = m|x) = P(M = m|S(x)) .

For each model m, sufficient statistic Sm(·) makesS(·) = (S0(·), . . . , SM−1(·)) also sufficient.For Gibbs random fields,

x|M = m ∼ fm(x|θm) = f1m(x|S(x))f2

m(S(x)|θm)

n(S(x))f2

m(S(x)|θm)

wheren(S(x)) = ♯ {x ∈ X : S(x) = S(x)}

ABC model choice

ABC model choice Algorithm

ABC-MC◮ Generate m∗ from the prior π(M = m).

◮ Generate θ∗m∗ from the prior πm∗(·).

◮ Generate x∗ from the model fm∗(·|θ∗m∗).

◮ Compute the distance ρ(S(x0), S(x∗)).

◮ Accept (θ∗m∗ , m∗) if ρ(S(x0), S(x∗)) < ǫ.

[Cornuet, Grelaud, Marin & Robert, BA, 2008]

Note When ǫ = 0 the algorithm is exact

ABC model choice

Toy example

iid Bernoulli model versus two-state first-order Markov chain, i.e.

f0(x|θ0) = exp

I{xi=1}

)/{1 + exp(θ0)}

versus

f1(x|θ1) =1

I{xi=xi−1}

)/{1 + exp(θ1)}

n−1 ,

with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phasetransition” boundaries).

ABC model choice

Toy example (2)

(left) Comparison of the true BFm0/m1(x0) with BFm0/m1

(x0)(in logs) over 2, 000 simulations and 4.106 proposals from theprior. (right) Same when using tolerance ǫ corresponding to the1% quantile on the distances.

Given the advances in practical Bayesian methods in the past twodecades, anti-Bayesianism is no longer a serious option

Gelman, BA, 2009

Bayesians are of course their own worst enemies. They makenon-Bayesians accuse them of religious fervour, and an unwillingness to

see another point of view.Davidson, 2009

1. Choosing a probabilistic representation

Bayesian statistics is about making probability statementsGelman, BA, 2009

Bayesian Statistics appears as the calculus of uncertainty

Reminder:A probabilistic model is nothing but an interpretation of a givenphenomenon

What is the meaning of RD’s t test example?!

1. Choosing a probabilistic representation (2)

Inference is impossible.Davidson, 2009

The Bahadur–Savage problem stems from the inability to makechoices about the shape of a statistical model, not from animpossibility to draw [Bayesian] inference.

Further, a probability distribution is more than the sum of itsmoments. Ill-posed problems thus highlight issues with the model,not the inference.

2. Conditioning on the data

Bayesian data analysis is a method for summarizing uncertainty andmaking estimates and predictions using probability statements conditional

on observed data and an assumed modelGelman, BA, 2009

At the basis of statistical inference lies an inversion processbetween cause and effect. Using a prior distribution brings anecessary balance between observations and parameters and enableto operate conditional upon x

What is the data in RD’s t test example?! U ’s? Y ’s?

3. Exhibiting the true likelihood

Frequentist statistics is an approach for evaluating statistical proceduresconditional on some family of posited probability models

Gelman, BA, 2009

Provides a complete quantitative inference on the parameters andpredictive that points out inadequacies of frequentist statistics,while implementing the Likelihood Principle.

There needs to be a true likelihood, including innon-parametric settings

[Rousseau, Van der Vaart]

4. Using priors as tools and summaries

Bayesian techniques allow prior beliefs to be tested and discarded asappropriate

Gelman, BA, 2009

The choice of a prior distribution π does not require any kind ofbelief in this distribution: rather consider it as a tool thatsummarizes the available prior information and the uncertaintysurrounding this information

Non-identifiability is an issue in that the prior may stronglyimpact inference about identifiable bits

4. Using priors as tools and summaries (2)

No uninformative prior exists for such models.Davidson, 2009

Reference priors can be deduced from the sampling distribution byan automated procedure, based on a minimal information principlethat maximises the information brought by the data.

Important literature on prior modelling for non-parametricproblems, incl. smoothness constraints.

5. Accepting the subjective basis of knowledge

Knowledge is a critical confrontation between a prioris andexperiments. Ignoring these a prioris impoverishes analysis.

We have, for one thing, to use a language and ourlanguage is entirely made of preconceived ideas and has to beso. However, these are unconscious preconceived ideas, whichare a million times more dangerous than the other ones. Werewe to assert that if we are including other preconceived ideas,consciously stated, we would aggravate the evil! I do notbelieve so: I rather maintain that they would balance oneanother.

Henri Poincare, 1902

6. Choosing a coherent system of inference

Bayesian data analysis has three stages: formulating a model,splitting the model to data, and checking the model fit.

The second step—inference—gets most of the attention,but the procedure as a whole is not automatic

Gelman, BA, 2009

To force inference into a decision-theoretic mold allows for aclarification of the way inferential tools should be evaluated, andtherefore implies a conscious (although subjective) choice of theretained optimality.Logical inference process Start with requested properties, i.e.loss function and prior distribution, then derive the best solutionsatisfying these properties.

6. Choosing a coherent system of inference (2)

Asymptopia annoys Bayesians.Davidson, 2009

Asymptotics [for inference] sounds for a proxy for not specifyingcompletely the model and thus for using another model. Whileasymptotics [for simulation] is quite acceptable. Bayesian inferencedoes not escape asymptotic difficulties, see e.g. mixtures.

NP Bootstrap aims at inference with no[t enough]modelling, while P Bayesian bootstrap is essentially using theBayesian predictive

7. Looking for optimal frequentist procedures

At intermediate levels of a Bayesian model, frequency properties typicallytake care of themselves. It is typically only at the top level of

unreplicated parameters that we have to worryGelman, BA, 2009

Bayesian inference widely intersects with the three notions ofminimaxity, admissibility and equivariance (Haar). Looking for anoptimal estimator most often ends up finding a Bayes estimator.

Optimality is easier to attain through the Bayes “filter”

8. Solving the actual problem

Frequentist methods have coverage guarantees; Bayesian methods don’t.In science, coverage matters

Wasserman, BA, 2009

Frequentist methods justified on a long-term basis, i.e., from thestatistician viewpoint. From a decision-maker’s point of view, onlythe problem at hand matters! That is, he/she calls for an inferenceconditional on x.

9. Providing a universal system of inference

Bayesian methods are presented as an automatic inference engineGelman, BA, 2009

Given the three factors

(X , f(x|θ), (Θ, π(θ)), (D , L(θ, d)) ,

the Bayesian approach validates one and only one inferentialprocedure

10. Computing procedures as a minimization problem

The discussion of computational issues should not be allowed to obscurethe need for further analysis of inferential questions

Bernardo, BA, 2009

Bayesian procedures are easier to compute than procedures ofalternative theories, in the sense that there exists a universalmethod for the computation of Bayes estimators

Convergence assessment is an issue, but recent developmentsin adaptive MCMC allow for more confidence in the output

Rimini discussion

bayesian choice

bayesian testing

bayesian analysis

bayesian centurythe

bayesian century outlineanyone

bayesian theory of inference

concepts bayesian inference

likelihood x

Education

Arengo n.14 2008 - Rimini · comunale di Rimini,...

Rimini 2014

RIMINI (Italy)

Rimini Furniture

Rimini - formazione PNSD

Attico Rimini 07

GRID Rimini April2011

Contributo di Agenzia Mobilità Rimini alla Conferenza...

Rimini final questions

Attico Rimini

『Cargo Tokyo-Yokohama』...

Hurrà Rimini! 8

Hurrà Rimini!

Attico Rimini 24

Stazione ferroviaria Rimini Fiera Rimini Fiera Railway...

Province of Rimini Pilot action: Rimini Riutilizza!