Top Banner
Introduction Being a Bayesian Why Challenges Introduction to Bayesian Inference Abel Rodr´ ıguez - UC, Santa Cruz Congreso Latinoamericano de Estad´ ıstica Bayesiana July 1-4, 2015 Abel Rodr´ ıguez - UC, Santa Cruz Introduction to Bayesian Inference
142

Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Apr 30, 2018

Download

Documents

ngohuong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Introduction to Bayesian Inference

Abel Rodrıguez - UC, Santa Cruz

Congreso Latinoamericano de Estadıstica BayesianaJuly 1-4, 2015

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 2: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Probability vs. StatisticsWhat this course is …

Probability: What is the number of

heads in 6 tosses of a fair coin?

Statistics: If 3 heads out of 6

tosses, what is the probability of head? €

p(x |θ)Known Unknown

p(x |θ)Known Unknown

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 3: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Approaches to statistical problems

Geometric, e.g., least squares, principal components.

Moment/kernel based: (generalized method of moments,kernel density estimation).

Model-based ⇒ Full uncertainty quantification.

Frequentist.

Bayesian.

Fiducial.

Not an exhaustive or universally agreed list!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 4: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

A toy example

Assume that we have measurements of the body temperatureof n = 100 individuals and we want to characterize thevariability in those measurements.

A simple, moment-based approach to characterizing thatvariability would be to use moment estimators, i.e., to assumethat the mean of the underlying distribution is equal to thesample mean and the variance of the underlying distribution isequal to sample variance.

Quantification of uncertainty through CLT or similar results!Hard to generalize.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 5: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

A toy example

An alternative is a model-based approach.Assume a parametric model for the data-generation process:

yi | θ, σ2 ∼ N(θ, σ2) i = 1, . . . , n.

The log-likelihood function is constructed as

`(θ, σ2) =n∑

i=1

log p(yi | θ) = −n

2log 2π−n

2log σ2− n

2σ2

n∑

i=1

(yi−θ)2

In frequentist methods,point estimators can be obtained by maximizing the likelihood,(θ, σ2) = arg maxθ `(θ, σ

2).uncertainty estimators/hypotheses tests obtained from `(θ, σ2).

Bayesian methods also start with the likelihood function, butproceed in a different way.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 6: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Outline

What does it mean to be a Bayesian?

Why should you be Bayesian?

What are the challenges of being Bayesian?

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 7: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

What does it mean to be a Bayesian?1 Prior, posterior and predictive distributions.

2 Conjugate models.

3 Bayesian asymptotics.

4 Utility functions.

5 Hierarchical models.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 8: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

How does Bayesian statistics work?

Bayesian statistics is a model-based approach ⇒ Allinformation in the data is contained in the likelihood!

The vector of observations y = (y1, . . . , yn) is assumed to begenerated by an unknown probability distribution P withdensity/probability mass function p.

We are going to assume that P is indexed by a finitedimensional vector of parameters θ, e.g., P might correspondto a normal distribution with mean µ and variance σ2, so thatθ = (µ, σ2).Extensions to nonparametric methods, although possible, areoutside the scope of this course!

All models are wrong, but some models are useful!!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 9: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

How does Bayesian statistics work?

In Bayesian statistics we treat θ as a random variable, whichis assigned a prior distribution Π with density/probability massfunction π.

The prior π captures the uncertainty of the researcher aboutthe value of θ before observing y .

Another way to think about the prior is as summarizing allinformation about θ that is external to the sample y .

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 10: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The Bayesian approach to statistical inference

We can update our knowledge about θ using Bayes theoremto obtain a posterior distribution

p(θ|y) =p(y | θ)π(θ)∫p(y | θ)π(θ)dθ

.

The posterior combines information in the data with any otherinformation external to it that has been encoded in the prior.

Note that we slightly abuse notation by using p to denoteboth p(y | θ) and p(θ | y)! This is very common in theliterature, I will do it extensively ...

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 11: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The Bayesian approach to statistical inference

Two other quantities that play a key role in Bayesian statisticsare the prior and posterior predictive distributions:

The prior predictive distribution is just the marginaldistribution of the observed data if the parameter is integratedout

p(y) =

∫p(y | θ)π(θ)dθ

This is just the denominator in the posterior.The posterior predictive distribution is the distribution of anew sample given the previous sample.

p(y∗ | y) =

∫p(y∗ | θ, y)π(θ | y)dθ

which often (but not always!) is∫p(y∗ | θ)π(θ | y)dθ.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 12: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The Bayesian approach to statistical inference

Example (estimating the mean of a normal): Assumey1, . . . , yn is an independent and identically distributed samplewith yi | θ ∼ N(θ, 1) and assign θ a Gaussian prior, i.e.,θ ∼ N(µ, τ2).

Our goal is to provide (Bayesian) point and interval estimatesfor the unknown parameter θ, as well as a Bayesian hypothesistest for H0 : θ = µ vs. Ha : θ 6= µ and a prediction for a newobservation y∗.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 13: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The Bayesian approach to statistical inference

Example (estimating the mean of a normal, cont):Using Bayes theorem,

p(θ | y) =

(1

)n/2exp{− 1

2

∑ni=1(yi − θ)2

}(1

2πτ2

)1/2exp

{− (θ−µ)2

2τ2

}∫∞−∞

(1

)n/2exp{− 1

2

∑ni=1(yi − θ)2

}(1

2πτ2

)1/2exp

{− (θ−µ)2

2τ2

}dθ

Canceling terms in the numerator and denominator thissimplifies to

p(θ | y) =exp

{− 1

2

(nθ2 − 2θny + θ2

τ 2 − 2θ µτ 2

)}

∫∞−∞ exp

{− 1

2

(nθ2 − 2θny + θ2

τ 2 − 2θ µτ 2

)}dθ

Note that the expression in the numerator is a quadratic formon θ!!!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 14: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The Bayesian approach to statistical inference

Example (estimating the mean of a normal, cont):Doing a completion of squares we get

p(θ | y) =

exp

{− 1

2

(n + 1

τ2

)(θ −

ny+ µ

τ2

n+ 1τ2

)2}

exp

{12

(ny+ µ

τ2

)2

n+ 1τ2

}∫∞−∞ exp

{− 1

2

(n + 1

τ2

)(θ −

ny+ µ

τ2

n+ 1τ2

)2}

exp

{12

(ny+ µ

τ2

)2

n+ 1τ2

}dθ

Note that the last term cancels out in the numerator anddenominator, and that the integral in the denominator equals√

2π(n + 1

τ2

)−1/2. Hence p(θ | y) correspond to a Gaussian

distribution with meanny+ µ

τ2

n+ 1τ2

and variance(n + 1

τ2

)−1!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 15: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The Bayesian approach to statistical inference

Example (estimating the mean of a normal, cont): Nowthat we have found the posterior distribution, we can use theposterior mean as a point estimator, i.e.,

θ = E {θ | y} =ny + µ

τ2

n + 1τ2

.

It is not difficult to show that this is biased but consistentestimator of θ! Furthermore, if you let τ2 →∞ then θ → y(the MLE).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 16: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The Bayesian approach to statistical inference

Example (estimating the mean of a normal, cont):Similarly, interval estimates can be obtained by computinghighest posterior intervals (shortest intervals that contain agiven amount of posterior probability).

In our running example, and for an interval with 1− αcoverage, this reduces to

(ny + µ

τ 2

n + 1τ 2

− zα/2

{n +

1

τ 2

}− 12

,ny + µ

τ 2

n + 1τ 2

+ zα/2

{n +

1

τ 2

}− 12

).

Note that if τ2 →∞, then this is identical to the standardconfidence interval (but has a very different interpretation!).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 17: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The Bayesian approach to statistical inference

Example (estimating the mean of a normal, cont): Now,to contrast the hypotheses H0 : θ = µ and θ 6= µ we cancompute the probability associated with each hypothesis:

The hypothesis H0 : θ = µ implicitly implies a prior that is apoint mass at 0, π0(θ) = δµ(θ), so that

p(y1, . . . , yn|H0) =n∏

i=1

p(yi | θ = µ)

On the other hand under H1 : θ 6= µ we can use the same priorπ1(θ) = N(µ, τ 2), in which case

p(y1, . . . , yn|H1) =

∫ ∞

−∞

n∏

i=1

p(yi | θ)π1(θ)dθ

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 18: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The Bayesian approach to statistical inference

Example (estimating the mean of a normal, cont): Thisleads to

p(y1, . . . , yn|H0) =

(1

) n2

exp

{−1

2

n∑

i=1

(yi − µ)2

}

and

p(y1, . . . , yn|H1) =

(1

) n2

(1 + nτ2)−12

exp

{−1

2

[n∑

i=1

y2i +

µ2

τ2−(ny + µ

τ2

)2

n + 1τ2

]}

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 19: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The Bayesian approach to statistical inference

Example (estimating the mean of a normal, cont):Assuming that, a priori, P(H0) = P(H1) = 1/2, then usingBayes theorem again

P(H0 | y1, . . . , yn) =1

1 + B10

where B10 = p(y1,...,yn|H1)p(y1,...,yn|H0) , which reduces to

B10 = (1+nτ 2)−12 exp

{−1

2

[n∑

i=1

y 2i +

µ2

τ 2−(ny + µ

τ2

)2

n + 1τ2

−n∑

i=1

(yi − µ)2

}].

B10 is called the Bayes factor of model 1 to model 0.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 20: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The Bayesian approach to statistical inference

Example (estimating the mean of a normal, cont):Finally, suppose that we want to predict the value of a newobservation y∗ arising from this distribution. It is natural toconsider the posterior predictive distribution p(y∗ | y1, . . . , yn).

In this case, because convolutions of normals are again normal,

y∗ | y1, . . . , yn ∼ N

(ny + µ

τ2

n + 1τ2

, 1 +

{n +

1

τ2

}−1).

Clear separation of uncertainty due to unknown parametersand uncertainty due to natural randomness in the data! ⇒For large n this is just N(y , 1)!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 21: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The Bayesian approach to statistical inference

In the previous example we used reasonable but somewhatad-hoc approaches to go from the posterior distribution to theanswer to each of our questions.

These choices can be justified and generalized using decisiontheory ⇒ Start with a utility function U(θ, d) and get rid ofthe unknown θ by averaging over the posterior distribution.

The form of U(θ, d) depends on the statistical problem ofinterest,

Point estimation/Prediction.Interval estimation/Interval prediction.Hypothesis testing/Model comparison.Experimental design.

More on this later.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 22: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The Bayesian approach to statistical inference

In summary, Bayesian approaches to solving statisticalinference problems involve two broad steps:

Compute a posterior distribution based on a likelihood functionand a prior distribution that the modeler needs to elicit.

Compute an optimal action (e.g., point estimate, intervalestimate, rejection rule, etc) using the posterior distributiontogether with an appropriate utility function

In the next few slides we discuss each of these two steps indetail.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 23: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Conjugate families

A family of priors C = {π(θ | η) : η ∈ C} is said to beconjugate to the likelihood p(y | θ) if for any prior π(θ) thatis a member of C, the associated posterior

p(θ | y) =p(y | θ)π(θ)∫p(y | θ)π(θ)dθ

.

is also a member of C.

The family is closed under posterior updating for thelikelihood p.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 24: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Conjugate families

Example (Beta-Binomial): Consider an observationy | θ ∼ Bin(n, θ) where n is known and θ is unknown,

p(y | θ) =

(n

y

)θy (1− θ)n−y θ ∈ [0, 1], y ∈ {0, 1, . . . , n}

and assign as the prior a member of the Beta family,

C =

{π(θ | η1, η2) =

Γ(η1 + η2)

Γ(η1)Γ(η2)θη1−1(1− θ)η2−1 : η1 > 0, η2 > 0

}.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 25: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Conjugate families

Example (Beta-Binomial): Note that the correspondingposterior is equal to

p(θ | y) =

(ny

)θy (1− θ)n−y Γ(η1+η2)

Γ(η1)Γ(η2)θη1−1(1− θ)η2−1

∫ 1

0

(ny

)θy (1− θ)n−y Γ(η1+η2)

Γ(η1)Γ(η2)θη1−1(1− θ)η2−1dθ

After a bit of algebra,

p(θ | y) =θy+η1−1(1− θ)n−y+η2−1

∫ 10 θ

y+η1−1(1− θ)n+y−η2−1dθ

=Γ(n + η1 + η2)

Γ(y + η1)Γ(n − y + η2)θy+η1−1(1− θ)n−y+η2−1

So beta(η1, η2)⇒ beta(y + η1, n − y + η2)!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 26: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Conjugate families

Example (Poisson-Gamma): Consider now a set ofobservations y1, . . . , yn such that yi | θ ∼ Poi(θ) where θ isunknown,

p(yi | θ) =θyi exp {−θ}

yi !θ > 0, y ∈ {0, 1, 2, . . .}

and assign as the prior a member of the Gamma family,

C =

{π(θ | η1, η2) =

ηη1

2

Γ(η1)θη1−1 exp {−η2θ} : η1 > 0, η2 > 0

}.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 27: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Conjugate families

Example (Poisson-Gamma): The posterior is just

p(θ | y1 . . . , yn) ∝ θ∑n

i=1 yi+η1−1 exp {−(n + η2)θ}

(Note that I have just preserved the terms that include θ, andhave not explicitly written the normalizing constant of thedistribution).

Since this is just the kernel of another Gamma distribution,

θ ∼ Gam(η1, η2)⇒ θ | y1, . . . , yn ∼ Gam

(n∑

i=1

yi + η1, n + η2

).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 28: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Conjugate families

Note that in the first example you have

E {θ | y} =

(n

n + η1 + η2

)(yn

)

︸︷︷︸MLE

+

(η1 + η2

n + η1 + η2

)(η1

η1 + η2

)

︸ ︷︷ ︸Prior Mean

Similarly, in the second example

E {θ | y1, . . . , yn} =

(n

n + η2

)(y)︸︷︷︸

MLE

+

(η2

n + η2

) (η1

η2

)

︸ ︷︷ ︸Prior Mean

Similar result in the Gaussian case (motivating example).

Weighted average of MLE and prior mean ⇒ “Linear” Bayes.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 29: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Conjugate families

For likelihoods in the natural exponential family

p(y | θ) = exp{θT t(y) + h(y) + g(θ)

}

priors of the form

π(θ) = exp{θTµ+ νg(θ) + c(µ, ν)

}

are conjugate as they lead to posteriors of the form

p(θ | y) = exp{θT (µ+ t(y)) + (ν + 1)g(θ) + c(µ+ t(y), ν + 1)

}

The conjugate prior is also in the exponential family!

This is a very powerful result!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 30: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Conjugate families

Let ξ(θ) = E{t(y) | θ}. The posterior mean of ξ(θ) is simply:

E {ξ(θ) | y} =µ+ t(y)

λ+ 1

Again, we get a weighted average!

This suggest that the interpretation of µ as the prior locationand of λ as an equivalent sample size for the prior!

The prior predictive takes the form

p(y) = exp {h(y) + c(µ, ν)− c (µ+ t(y), ν + 1)}

A similar form for the posterior predictive (assuming thatobservations are iid).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 31: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Conjugate families

Furthermore, discrete mixtures of conjugate priors are alsoconjugate! In general

π(θ) =K∑

k=1

ωkπ(θ | ηk) ⇒ p(θ | y) =K∑

k=1

ω∗kp(θ | η∗k),

where

ω∗k =ωk

∫p(y | θ)π(θ | ηk)dθ

∑Kl=1 ωl

∫p(y | θ)π(θ | ηldθ)

and

p(θ | η∗k) =p(y | θ)π(θ | ηk)∫p(y | θ)π(θ | ηk)dθ

.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 32: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Conjugate families beyond the exponential family

Although most well-known examples of conjugate familiesarise as examples of the previous result, that is not always so!

For example consider the case where y1, . . . , yn are iid suchthat yi ∼ Uni[0, θ] and θ ∼ Pareto(η1, η2).

The posterior is

p(θ | y1, . . . , yn) ∝(

1

θ

)n

I

(θ ≥ max

i{yi}

)(1

θ

)η1

I(θ > η2)

=

(1

θ

)n+η1

I

(θ > max

{η2,max

i{yi}

})

θ ∼ Pareto(η1, η2)⇒ θ | y1, . . . , yn ∼ Pareto(η1 + n,max {η2,maxi{yi}})

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 33: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Beyond conjugate models

When you do not have conjugacy the process is the same, butyou often cannot make much progress.

There are a few exceptions. A well known one is the use ofdouble exponential priors for the location parameter of aGaussian likelihood,

p(y1, . . . , yn | θ) =

(1

) n2

exp

{−1

2

n∑

i=1

(yi − θ)2

},

π(θ) =1

2τexp

{−1

τ|θ − µ|

}.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 34: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Beyond conjugate models

The posterior is in this case:

p(θ | y1, . . . , yn) =a(n, y)

a(n, y) + b(n, y)TN

(λ1(n, y),

1

n,−∞, µ

)I(θ < µ)

+b(n, y)

a(n, y) + b(n, y)TN

(λ2(n, y),

1

n, µ,∞

)I(θ ≥ µ)

where

λ1(n, y) = y +1

nτλ2(n, y) = y − 1

and

a(n, y) = exp{−µτ

+ nλ1(n, y)}

Φ({µ− λ1(n, y)}

√n)

b(n, y) = exp{µτ

+ nλ2(n, y)}{

1− Φ({µ− λ2(n, y)}

√n)}

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 35: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Beyond conjugate models

The posterior mean is

a(n, y)

a(n, y) + b(n, y)

[nφ(√

n{µ− λ1(n, y)})

+ λ1(n, y)Φ(√

n{µ− λ1(n, y)})]

+

b(n, y)

a(n, y) + b(n, y)

[nφ(√

n{µ− λ2(n, y)})

+ λ2(n, y){

1− Φ(√

n{µ− λ2(n, y)})}]

,

where φ and Φ are the density and cdf of the standardnormal distribution. Note that this is highly non-linear!

One key feature is that the prior has bounded influence! ⇒Posterior distribution is robust to misspecification of the prior.

limµ→∞

= y +1

nτlim

µ→−∞= y − 1

(Note the difference with a Gaussian prior.)

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 36: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Beyond conjugate models

Homework: Find a closed form formula for the variance ofthe posterior distribution for the non-conjugate modeldiscussed in the previous slides and plot both the mean andthe variance as a function of µ for various values of n and τ .In your graphs, assume that y = 1.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 37: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian Assymptotics

If θ is continuous and the the observations are iid, then undermild regularity conditions the posterior distribution isapproximately normal

θ | y ∼ N(θ(y),H−1(θ(y))

),

where θ(y) represents the maximum likelihood estimator of θand H(θ(y)) is the observed information matrix.

Sometimes called the “Bayesian CLT”.

Other variants are possible, all of them are based on Laplaceapproximations to the posterior

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 38: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian Assymptotics - Laplace expansions

Consider an iid univariate sequence y1, . . . , yn whereyi | θ ∼ p(y | θ) and do a Taylor expansion of thelog-likelihood:

log {`(θ)} ≈ log `(θ)

+1

2

∂2

∂θ2log `

(θ)∣∣∣∣θ=θ

(θ − θ

)2

where ∂2

∂θ2 log `(θ)∣∣∣θ=θ

is the observed information matrix.

For the prior, you have two options:

You can either do a similar Taylor expansion of the prior.Under regularity conditions, the likelihood concentrates as ngrows, so the prior can be discarded.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 39: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian Assymptotics - Laplace expansions

This result has a few important consequences:

For large sample sizes, Bayesian and frequentist point andinterval estimation procedures yield essentially the sameresults! (This is not true for hypothesis testing!)

For large sample sizes, the form of the prior does not mattermuch! (Again, this is not true for hypothesis testing!)

Most “reasonable” Bayesian point estimators are consistent (ina classical, frequentist sense).

For large sample sizes, the approximation can be used to easilyconstruct point and interval estimates.

The quality of the approximation might depend on theparameterization.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 40: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian Assymptotics - The Discrete Case

Let y = (y1, . . . yn) be a sample of size n from a modelp(y | θ).

Assume now that θ ∈ Θ = {θ1, . . . , θK}, i.e., the parameterspace is discrete and that the true value of the parameter is θ∗

(which might or might not a member of Θ).

Let k be the index of the model that minimizes theKullback-Leibler distance of p(y | θk) with p(y | θ∗), i.e.,

k = arg mink

∫p(y | θ∗) log

{p(y | θk)

p(y | θ∗)

}dy

Then limn→∞ p(θk | y) = 1 and limn→∞ p(θk | y) = 0 for

k 6= k.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 41: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian point estimation

Eliciting utility functions for estimation in specific problemscan be hard.

Here we focus mostly on finding utility functions that justifythe use of standard centrality measures.

Posterior mean.

Posterior median/posterior quantiles.

Posterior mode.

Connection with literature on M-estimators (which goes backto Huber).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 42: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Squared error loss functions for estimation

A natural loss function widely used in practice is the squaredloss function,

L(θ, d) = ‖d − θ‖22 =

K∑

k=1

(dk − θk)2,

where d is the value you use to estimate the unknown θ.

In frequentist literature this loss leads to the MSE.

Symmetric around θ, with losses that grow very quickly.

Optimal estimator is given by

θ = arg mind

∫‖d − θ‖2

2p(θ | y)dθ.

The solution to this problem is simply θ = E {θ | y}.Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 43: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Squared error loss functions for estimation

For the univariate case, note that

∫(d − θ)2p(θ | y)dθ =

∫(d − E {θ | y}+ E {θ | y} − θ)2p(θ | y)dθ

so that

∫(d − θ)2p(θ | y)dθ =

(d − E {θ | y})2 +

∫(E {θ | y} − θ)2p(θ | y)dθ

The second term is constant with respect to d and the first isminimized when d = E {θ | y}.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 44: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Squared error loss functions for estimation

Under proper priors, the posterior mean is always a biasedestimator. This is a simple proof by contradiction. On onehand,

Ey ,θ

{(y − θ)2

}= Eθ

{Ey |θy

2 − 2θy + θ2}

= E(y2)− E(θ2).

On the other, if the posterior mean is an unbiased estimator,

Ey ,θ

{(y − θ)2

}= Ey

{Eθ|yy

2 − 2θy + θ2}

= −E(y2) + E(θ2).

Since both quantities have to be the same by definition, wehave E

{y2}

= E{θ2}

, but this can only happen in theuninteresting case when Pr(y = θ) = 1.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 45: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Squared error loss functions for estimation

Also, unlike MLEs, the posterior mean is not invariant undertransformations! Indeed, in general

h (E {θ | y}) 6= E (h {θ | y}) .

Even if the posterior distribution is in a well known family,computing the posterior mean for complex functionals of theparameters in closed form can be quite difficult!

However, the posterior mean is the same no matter whetherwe are integrating over the marginal or over the jointdistribution (invariance to marginalization)!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 46: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Absolute error loss

A natural alternative to the squared error loss is the absoluteerror loss,

L(θ, d) = ‖d − θ‖1 =K∑

k=1

|dk − θk |,

and the optimal parameter is

θ = arg mind

∫‖d − θ‖1p(θ | y)dθ.

Again, it is symmetric, but the penalty grows linearly and notquadratically.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 47: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Absolute error loss

The optimal estimator under absolute error loss θ is theposterior median of each componet,

∫ θk

−∞p(θk | y)dθk =

1

2

Like the MLE, the posterior median is invariant undermonotone transformations.

It is also invariant to marginalization.

You can easily generalize to asymmetric versions

L(θ, d) =

{a(d − θ) d ≥ θ(θ − d) d < θ

which leads to a quantile being the optimal estimator.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 48: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

0-1 Error loss

For discrete parameters, it is also natural to define a 0-1utility functions

U(θ, d) =

{1 d = θ

0 d 6= θ.

We get 1 if we choose the right value of the parameter, and 0otherwise.

The optimal estimator in this case is the posterior mode,

θ = arg maxθ

p(θ | y).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 49: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

0-1 Error loss

The concept can be extended to continuous parameters, butwe require a limiting argument. Start with

U∆(θ, d) =

{1 d ∈ Bθ,∆

0 d 6∈ Bθ,∆.

where Bθ,∆ is a ball of diameter ∆ centered around θ.

Then, define

θ∆ = arg maxθ

∫U∆(θ, d)p(θ|y)dθ, θ = lim

∆→0θ∆.

The posterior mode is invariant to monotone transformationsbut not invariant under marginalization.

This is the opposite of what happens with the posterior mean!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 50: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian interval estimation

In Bayesian statistics, credible sets are used for intervalestimation.

For a continuous (possibly vector-valued) parameter, a100(1− α)% credible set Cα is a subset of the parameterspace such that

Cαp(θ | y)dθ = 1− α.

Many different possible intervals for each α. How do we selectamong them?

What are appropriate utility functions over credible sets?

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 51: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Highest posterior density intervals

One intuitively nice property of the credible interval is that itbe small!

Consider a loss function for Cα of the form

Lk(θ, Cα) = kI(θ 6∈ Cα) + |Cα|

where |Cα| is the size of |Cα|, subject to∫Cα p(θ | y)dθ = 1− α.

The first term favors intervals that contain the true parameterθ, while the second term favors smaller credible sets!

The optimal credible set is indeed the smallest subset of theparameter space that has posterior probability equal to 1− α.

This is called the highest posterior density (HPD) set.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 52: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Highest posterior density intervals

One consequence of the form of the utility function is that thedensity at the boundaries of the HPD set must be constant!

This suggests a graphical procedure for finding the HPD set:

Consider a plot of the posterior density.Draw a “horizontal” line/plane that is going to slide up ordown on the density axis.Start sliding it down from +∞ until the area under thehigh-density parts of the curve is equal or bigger than 1− α.

This is most easily illustrated in the univariate case.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 53: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Highest posterior density intervals

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 54: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Highest posterior density intervals

Note that the HPD set might involve disjoint sets when theposterior is multimodal!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 55: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Highest posterior density intervals

HPD intervals can also contain the boundary of the support ofthe posterior, e.g., y | θ ∼ Bin(n, θ), θ ∼ Uni[0, 1], and y = 0or y = n.

HPD intervals are easy to compute from univariate symmetricdistributions, but they can be cumbersome in other cases.

This is particularly true for posterior distributions for whichintegrals cannot be easily computed in closed form!

What are alternatives that deal with the fact that the HPDinterval might not be simply connected and might includepoints in the boundary of the support?

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 56: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian interval estimation

In the continuous univariate context, let Cα = (aα, bα)Consider instead the utility function

Lα (θ, (aα, bα)) =α

2(a− θ)I(θ < a) +

(1− α

2

)(θ − a)I(θ ≥ a)+(

1− α

2

)(b − θ)I(θ < b) +

α

2(θ − b)I(θ ≥ b),

subject to∫Cα p(θ | y)dθ = 1− α.

This utility function leads to an equal-probability credibleinterval where

∫ a

−∞π(θ | y)dθ =

∫ ∞

bπ(θ | y)dθ =

α

2.

This is much easier to compute, and if the posterior issymmetric and unimodal this is equivalent to the HPD.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 57: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian interval estimation

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 58: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian interval estimation

Hartigan (1966) showed that for standard posterior credibleintervals, an interval [L(y),U(y)] with 100(1− α)% Bayesiancoverage will have

Pr(L(y) < θ < U(y) | θ) = (1− α) + εn

where |εn| < an for some constant n.

Correct asymptotic frequentist coverage, just like most largesample frequentist methods!

Some people (e.g., Jim Berger) argues that following Bayesianprocedures to derive credible interval might be a better optionthan asymptotic arguments to get “good” confidence intervals.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 59: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian interval estimation

Example: Let Y | θ ∼ Ber(n, θ) and θ ∼ Uni[0, 1]. If theobserved data corresponds to y = 0, then the posteriordistribution is θ | y ∼ beta(1, n + 1).

It is easy to see then that the 100× (1− α)% HPD interval isof the form (0, 1− α1/(n+1)) (to see this more clearly,construct a graph of the density and apply the graphicalprocedure we described before!).

On the other hand the 100× (1− α)% symmetric interval isof the form (1− {1− α/2}1/(n+1), 1− {α/2}1/(n+1)).

For example, if α = 0.95 and n = 20 then the intervals are(0, 0.132946) and (0.001205, 0.161098), while if n = 200 thenthey are (0, 0.014794) and (0.000126, 0.018185). Note thatthis is a situation where the vanilla confidence interval cannotbe computed!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 60: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian interval estimation

Example: Consider now the same Beta/Binomial model, butlet n = 1000 and y = 345. The symmetric interval can beobtained in R by typingalpha = 0.05

c(qbeta(alpha/2, 345+1, 1000-345+1), qbeta(1-alpha/2, 345+1, 1000-345+1))

leading to (0.316184, 0.375019). On the other hand, thefollowing R code can be used to obtain the HPD intervallibrary(rootSolve)

ff = function(x, prm){ff1 = dbeta(x[1], prm[1], prm[2]) - dbeta(x[2], prm[1], prm[2])

ff2 = pbeta(x[2], prm[1], prm[2]) - pbeta(x[1], prm[1], prm[2]) - (1 - prm[3])

return(c(ff1,ff2))

}a = 345 + 1

b = 1000 - 345 + 1

alpha = 0.05

ss <- multiroot(f = ff, start = c(qbeta(alpha/2, a, b), qbeta(1-alpha/2, a, b)), prm =

c(a, b, alpha))

print(ss$root)

leading to (0.315981, 0.374811).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 61: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian interval estimation

Example: Finally, using the Laplace approximation thecredible interval is simply given byalpha = 0.05

aa = 345

bb = 1000 - 345

c(qnorm(alpha/2, aa/(aa+bb), sqrt(aa*bb/((aa+bb+1)*(aa+bb)^2))), qnorm(1-alpha/2,

aa/(aa+bb), sqrt(aa*bb/((aa+bb+1)*(aa+bb)^2))))

leading to (0.315552, 0.374448). A better approximation canusually be obtained by using the posterior mean and variancesrather than the MLE and the information matrixalpha = 0.05

aa = 345 + 1

bb = 1000 - 345 + 1

c(qnorm(alpha/2, aa/(aa+bb), sqrt(aa*bb/((aa+bb+1)*(aa+bb)^2))), qnorm(1-alpha/2,

aa/(aa+bb), sqrt(aa*bb/((aa+bb+1)*(aa+bb^2))))

leading to (0.315884, 0.374735).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 62: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian hypothesis testing

You can think about Bayesian hypothesis testing as anestimation problem on a discrete parameter space.

Consider a hierarchical model in which you need to specifyp(y | θk ,Mk), π(θk | Mk) and π(Mk) for k = 1, . . . ,K .The key summary of interest is then p(Mk | y), whichintegrates over θk and therefore involves the prior predictivedistribution.

In Bayesian hypothesis testing you are not limited to testinghypotheses in a pairwise fashion.

Because it is a discrete parameter space, a natural utilityfunction is the 0-1 utility function we discussed before, whichin this case leads to selecting the model with the highestposterior probability:

k = arg maxk

Pr(Mk | y).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 63: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian hypothesis testing

Note that the posterior probability of model Mk can bewritten as

Pr(Mk | y) =p(y | Mk)π(Mk)

∑Kl=1 p(y | Ml)π(Ml)

=1

1 +∑

l 6=k Ol ,k.

where Ol ,k = p(Ml |y)p(Mk |y) = p(Ml )p(y |Ml )

p(Mk )p(y |Mk ) are the posterior odds,p(Ml )p(Mk ) are called the prior odds, and p(y |Ml )

p(y |Mk ) is the Bayes factor.

Some people prefer to phrase hypothesis testing/modelselection problems in terms of log odds rather than posteriorprobabilities, but they are equivalent.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 64: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian hypothesis testing

Interpretation of the posterior probabilities/Bayes factors isnot as straightforward as you would think.

The following table was proposed by Kass & Raftery (1995)

Odds Posterior Probability Interpretation

1 to 3 0.5 to 0.75 Not worth more thana bare mention

3 to 20 0.75 to 0.95 Positive20 to 150 0.95 to 0.99 Strong> 150 > 0.99 Very strong

Unlike estimation problems, Bayesian and frequentisthypothesis testing methods can yield very different results.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 65: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Lindley’s paradox

Example: Imagine a certain city where 49,581 boys and48,870 girls have been born over a certain time period.Assume that births are independent and identically distributedBernoulli random variables so that the total number of malesY is a Binomial random variable where Y | θ ∼ Ber(98451, θ).

We are interested in testing the hypotheses H0 : θ = 0.5 vs.H1 : θ 6= 0.5. Using a normal approximation to the samplingdistribution of the empirical proportion, we have

p-value = Pr

(∣∣∣∣∣ Y − nθ√nθ(1− θ)

∣∣∣∣∣ ≥∣∣∣∣49581− 49225.5

24612.75

∣∣∣∣ | θ = 0.5

)≈ 0.0235,

so we would clearly reject the null at a 95% significance level.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 66: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Lindley’s paradox

Example: On the other hand, a Bayesian analysis using auniform prior for θ under H1 would lead to

p(y | H0) =

(98451

49581

)(1

2

)49581(1− 1

2

)48870

≈ 1.95× 10−4

and

p(y | H1) =

∫ (98451

49581

)θ49581 (1− θ)48870 dθ ≈ 1.02× 10−5.

Assuming p(H0) = p(H1) = 1/2 a priori, this leads top(H0 | y) = 0.95, and so we would clearly favor H0.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 67: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Lindley’s paradox

To understand Lindley’s paradox, note that the power of thefrequentist test to detect the observed difference for asignificance level of 95% significance

1−Pr

(−1.95996 <

∣∣∣∣∣ Y − nθ√nθ(1− θ)

∣∣∣∣∣ < 1.95996 | θ = 0.5036

)≈ 0.6248752,

On the other hand, the type I and type II errors of theBayesian test are in this case

Type I error = Pr (B01 < 1 | θ = 0.5) ≈ 0.004

Power = Pr (B01 < 1 | θ = 0.5036) ≈ 0.268

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 68: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Lindley’s paradox

In this case we approximated the frequentist properties of theBayes factor using the following code:

log.bayes.factor.01 = function(y,n,theta0){lm0 = y*log(theta0) + (n-y)*log(1-theta0)

lm1 = lgamma(y) + lgamma(n-y) - lgamma(n)

return(lm0 - lm1)

}rejection.rate = function(rrr=10000, n=10000, theta=0.5,

theta0=0.5){outcome = rep(0, rrr)

for(i in 1:rrr){y = rbinom(1, n, theta)

lbf = log.bayes.factor.01(y, n, theta0)

outcome[i] = as.numeric(lbf < 0)

}return(mean(outcome))

}rejection.rate(rrr=10000, n=98451, theta=0.5, theta0=0.5)

rejection.rate(rrr=10000, n=98451, theta=0.5036, theta0=0.5)

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 69: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Lindley’s paradox

Note that the type I error and the power of the Bayesian testare much smaller!

Indeed, recall that a frequentist test with fixed type I error willalways tend to reject the null hypotheses for large n (as thepower goes to 1 as the sample size increases.The same is not true for a Bayesian test.

Homework: construct curves for the type I error and thepower of the frequentist and Bayesian tests as a function ofsample size n.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 70: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian hypothesis testing

When all hypotheses involved in the test are composite, theBayes factors can be naturally reformulated in terms ofposterior probabilities. This is better illustrated with anexample.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 71: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian hypothesis testing

Example: Let yi | θ ∼ N(θ, 1) for i = 1, . . . , n and considertesting H0 : θ ≤ θ0 vs. H1 : θ > θ0. To create a prior θ undereach of the hypotheses, it would be natural to start with aGaussian prior θ and restrict it to the appropriate interval sothat

π(θ | H0) =1

Φ(θ0−µτ

) 1√2πτ

exp

{− 1

2τ 2(θ − µ)2

}I(θ ≤ θ0)

π(θ | H1) =1

1− Φ(θ0−µτ

) 1√2πτ

exp

{− 1

2τ 2(θ − µ)2

}I(θ > θ0)

where Φ denotes the cumulative distribution function of thestandard normal.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 72: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian hypothesis testing

Example: Under such priors, the posterior odds O0,1 reduceto

O0,1 =ε

1− ε1− Φ

(θ0−µτ

)Φ(θ0−µτ

)∫ θ0

−∞ exp

{− 1

2

(n + 1

τ2

)(θ −

ny+ µ

τ2

n+ 1τ2

)2}

∫∞θ0

exp

{− 1

2

(n + 1

τ2

)(θ −

ny+ µ

τ2

n+ 1τ2

)2}

where ε = π(H0) = 1− π(H1). Note that if we take

ε = Φ(θ0−µτ

)then O0,1 = Pr(θ≤θ0|y)

Pr(θ>θ0|y) , which would be a

natural metric that can be computed directly from theposterior distribution based on a prior θ ∼ N(µ, τ2) withoutreferring to marginal likelihoods.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 73: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian hypothesis testing

Example: Similarly, if instead we take µ = θ0 (a very natural

choice because of symmetry), then Φ(θ0−µτ

)= 1

2 and

O0,1 =ε

1− ε

∫ θ0

−∞ exp

{− 1

2

(n + 1

τ2

)(θ −

ny+ µ

τ2

n+ 1τ2

)2}

∫ θ∞0

exp

{− 1

2

(n + 1

τ2

)(θ −

ny+ µ

τ2

n+ 1τ2

)2}

.

so that O0,1 = ε(1−ε)

Pr(θ≤θ0|y)Pr(θ>θ0|y) . This example also suggests

that prior elicitation can be a pretty tricky business: the priorassigned to each model is not only affected by ε, but also by µand τ2.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 74: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Prediction

Prediction (both point and interval prediction) can be treatedagain as an estimation problem.

In particular, we can use similar utility functions to the onesdiscussed before to obtain optimal prediction rules.

The main difference is that computations are carried out withrespect to the predictive distribution.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 75: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Prediction

For example, for point prediction you could use

y∗ = arg mind

∫(d − y∗)2p(y∗ | y)dy∗

where p(y∗ | y) =∫p(y∗ | θ, y)p(θ | y)dθ.

This leads to y∗ = E {y∗ | y}.In the case where multiple models are available, this utilityfunction leads to the so-called Bayesian model average rule:

y∗ =K∑

k=1

E {y∗ | y ,Mk} p(Mk).

A similar approach for interval prediction.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 76: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Hierarchical Models

Nothing prevents us from treating the parameters of the priordistribution as unknown and assign them further hyperpriors.

An example that will come up later has to do with estimatingmultiple means:

yi | θi ∼ N(θi , 1), i = 1, . . . , n,

θi | µ, τ2 ∼ N(µ, τ2), i = 1, . . . , n.

where µ and τ2 are unknown

This is related to the James-Stein paradox.

Motivation: The famous data on batting averages from Efron& Morris.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 77: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Hierarchical Models

Consider the batting averages of 18 baseball players duringtheir 45 first turns at the bat during the 1970 MLB season.

We are interested in predicting their true batting average,which could in turn be used to predict how they will do duringthe rest of the season.

In this example we have:

yi is the observed batting average of the i-th batter.θi is the true batting average of the i-th batter.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 78: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Hierarchical Models

A natural Bayesian solution is to add hyperpriors to get amodel with three levels

yi | θi ∼ N(θi , 1), i = 1, . . . , n,

θi | µ, τ2 ∼ N(µ, τ2), i = 1, . . . , n,

µ, τ2 ∼ p(µ, τ2).

where n = 18.

Computation for this hierarchical model is hard becauseintegrating over the posterior p(θ1, . . . , θn, µ, τ

2 | y1, . . . , yn)analytically is impossible ...

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 79: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Hierarchical Models

A possible trick is to rewrite

p(θ1, . . . , θn, µ, τ2 | y1, . . . , yn)

= p(θ1, . . . , θn, | µ, τ2, y1, . . . , yn)p(µ, τ2 | y1, . . . , yn)

Now, using iterated expectations,

θi = E {θi | y1, . . . , yn} = Eµ,τ2|y1,...,yn

{τ2

1 + τ2yi +

1

1 + τ2µ

}

where p(µ, τ2 | y1, . . . , yn) ∝ N(y | µ1, τ211T + I )p(µ)p(τ2).

One possibility is to compute the two-dimensional integralsneeded for computing the expectation using numericalintegration!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 80: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Hierarchical Models

A second alternative is to use empirical Bayes:

Still exploit the representation

p(θ1, . . . , θn, µ, τ2 | y1, . . . , yn)

= p(θ1, . . . , θn, | µ, τ2, y1, . . . , yn)p(µ, τ2 | y1, . . . , yn)

However, now we first obtain “good” values for µ and τ2 fromp(µ, τ2 | y1, . . . , yn) by using some simple estimates (MLE,moment-based, unbiased).

If p(µ, τ2) ∝ 1, p(µ, τ2 | y1, . . . , yn) = p(y1, . . . , yn | µ, τ2),

µ = y , τ2 = s2 − 1.

Now use the posterior p(θ1, . . . , θn, | µ, τ2, y1, . . . , yn):

θi = E{θi | y1, . . . , yn, µ, τ

2}

=τ2

1 + τ2yi +

1

1 + τ2µ

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 81: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Hierarchical Models

In this case µ = y ≈ .265 and 11+τ2 ≈ 0.212

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 82: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Hierarchical Models

We can use the average over the whole season to measurehow well the James-Stein estimator does when comparedagainst simple averages.

A natural metric is the mean squared error of the predictions.

In this case, the MSE of the predictions from the James-Steinestimator is 3.5 times smaller than that from simple averages!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 83: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Hierarchical Models

Some classical models can be seen (from a Bayesianperspective) as hierarchical models.

A classical example is the random effects model. For example,in a classical setting

yi ,j | θi , σ2 ∼ N(yi ,j | θi , σ2

),

θi | µ, τ2 ∼ N(µ, τ2).

A Bayesian would add priors

µ ∼ N(η, κ2),

(σ2, τ2) ∼ π(σ2, τ2).

turning it into a hierarchical model.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 84: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Why should you be a Bayesian?1 Admissibility.

2 Exchangeability.

3 Likelihood principle.

4 Others reasons.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 85: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Evaluating estimators: Frequentist Risk

Start with a loss function

L(θ, h(y))

How much do you lose if you es-timate the parameter θ using thestatistic θ = h(y), e.g.,

L(θ, h(y)) = {θ − h(y)}2

What is the average of this lossover different samples for a givenvalue of the parameter?

Ey |θ {L(θ, h(y))} = Rh(θ)

Note that the expectation is taken with respect to the posterior of θ, so that all knowledge of the problemis included in ρ(·). Moreover, ρ is a function of the value of the estimator h(x) and π. This means that wecan compare the expected loss for varying estimators, whereas in the frequentist world, we can’t compareR(θ, h) over functions θ.

Say you computed R(θ, h). However, we don’t know θ! One possibility is to marginalize over θ with aprior π(θ). This defines the Bayesian risk r(h,π):

r(h,π) = Eθ [R(θ, h)]

=

!R(θ, h)π(θ)dθ.

For examine, in the following figure, we can see that neither h1(x) nor h2(x) are optimal for all θ. However,if the prior is centered in the area where h2(x) is optimal, the Bayes risk can lend insight to the optimalestimator, as seen in Figure 1.

θR

(θ,h

)

!

!R(θ, h1(x))

R(θ, h2(x))π(θ)

Figure 1: Frequentist risk R(θ, h) for two estimators h1 and h2. Neither estimator is optimal for all θ, but a priorπ(θ), drawn as a dotted line, suggests that h2 is a better choice in the context of Bayesian risk r(h, π).

On the other hand, say you computed the optimal estimator from the expected loss function ρ. How wellwould this estimator do with more and more data? For that, we could examne Ex [ρ(h,π, x)], which is theexpected value of the expected loss function with respect to the prior predictive distribution p(x). It turnsout that this is also the same thing as the Bayesian risk r(h,π). To see this, write

Ex [ρ(h, pi, x)] =

! !L(θ, h(x))p(θ|x)p(x)dθdx

=

! !L(θ, h(x))p(θ, x)dθdx

=

! !L(θ, h(x))p(x|θ)p(θ)dθdx

=

! !L(θ, h(x))p(x|θ)p(θ)dxdθ by Fubini’s Theorem

= Eθ [R(θ, h)] .

In other words, the Bayes risk is the average loss with respect to the joint distribution p(x, θ). Consequently,if you find an estimator h(x) that minimizes ρ(·), it is also the optimal estimator for Bayes risk.

5-2

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 86: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Evaluating estimators: Dominance and Bayesian Risk

Clearly, if Rh1(θ) ≤ Rh2(θ) for every value of θ then we shouldnever use h2!

In that case we say that h1 dominates h2, or that h2 isinadmissible.

On the other hand, if neither h1 nor h2 dominate each other,how can this help in deciding which estimator is better?

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 87: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Evaluating estimators: The Expected Loss

In principle Bayesians do not care for different samples: aprocedure should do well for the sample at hand, futuresamples that you might or might not see are irrelevant!

As we discussed before, Bayesians prefer to focus on theaverage loss of the estimator over different values of theparameter for the observed sample?

Eθ|y {L(θ, h(y))} = ρh,π(y)

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 88: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Is it possible to reconcile the two views?

From the point of view of the FR, if neither h1 nor h2

dominate each other we could try to make a decision if weknew which values of θ are more “important” ⇒ The priordistribution π(θ) gives exactly that information!

Eθ {Rh(θ)} =

∫L(θ, h(y))p(y | θ)p(θ)dydθ

From the point of view of the EL, we might still ask whatestimator is on average better if we cared about repeatedsampling

Eθ {ρh,θ(y)} =

∫L(θ, h(y))p(θ | y)p(y)dθdy

Note that this integral is with respect to p(y) and notp(y | θ)!!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 89: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Linking it all together!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 90: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Admissibility

Recall that h2 is inadmissible if Rh1(θ) ≤ Rh2(θ) for some h1

and every value of θ.

You would think most standard procedures are all admissible⇒ You would be very, very wrong!

The most widely known counterexample is the so-calledJames-Stein paradox (which is not really a paradox!!!)

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 91: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Admissibility

Let y be a single draw from a d-variate normal distributionwith identity covariance matrix, i.e., y ∼ N(θ, I ) with d ≥ 2.

Under squared error loss, the maximum likelihood estimatorfor θ, θ = y is inadmissible for d > 2!!!!

In particular, one estimator that dominates it is:

θ(y) =

(1− d − 2

yT y

)y .

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 92: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Admissibility

Since (y − θ)T (y − θ) ∼ χ2d , the risk of the MLE is:

Ey |θ {L(θ, y)} = Ey |θ

{(θ − y)T (θ − y)

}= d

On the other hand, the risk of the James-Stein estimator is

Ey |θ

{L

(θ,

[1− d − 2

yT y

]y

)}=

Ey |θ

{(θ − y)T (θ − y)

}− Ey |θ

{(d − 2)2

(yT y)

}< d

because Ey |θ

{(d−2)2

(yT y)

}> 0 for d > 2.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 93: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Admissibility

The James-Stein estimator can be motivated using ahierarchical model of the form

yi | θi ∼ N(θi , 1), i = 1, . . . , n,

θi | τ2 ∼ N(0, τ2), i = 1, . . . , n,

where the hyperparameter τ2 is replaced by an unbiasedestimator obtained from p(y1, . . . , yn | τ2) (recall our exampleon batting averages).

This result illustrates an important insight from Bayesianmethods: Borrowing information across groups cansignificantly improve the estimate associated with eachindividual observation.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 94: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Admissibility

More generally:

Under proper priors, all Bayes rules that have finite Bayes riskare admissible.

Under regularity conditions, all admissible rules are Bayes ruleswith respect to an appropriately chosen (and possiblyimproper) prior.

The proofs are beyond the scope of this course, but can beseen in Robert (2007).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 95: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Exchangeability

An infinite sequence of random variable Y1,Y2, . . . is said tobe exchangeable if for any integer n and any permutation σnof {1, . . . , n} their joint distribution satisfies

p(y1, . . . , yn) = p(yσn(1), . . . , yσn(n))

Clearly observations that are conditionally independent given acommon parameter are exchangeable, but the concept is moregeneral.

For example, think of random variables that follow a jointGaussian distribution with E(Yi ) = 0, Var(Yi ) = 1 andCov(Yi ,Yj) = ρ with ρ 6= 1.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 96: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

De Finetti’s theorem

For binary random variables: Let Y1,Y2, . . . be an infinitelysequence of binary random variables. Then there exist adistribution function Π such that

p(y1, . . . , yn) =

∫ 1

0

n∏

i=1

θyi (1− θ)1−yi Π(dθ)

where Π is the limiting distribution function that satisfies

Π(θ) = limn→∞

P

(1

n

n∑

i=1

Yi ≤ θ).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 97: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

De Finetti’s theorem

Note that∫ 1

0

∏ni=1 θ

yi (1− θ)1−yi Π(dθ) is just the priorpredictive distribution from the Bayesian model

yi | θ ∼ Ber(θ),

θ ∼ π.

In other words, if a binary sequence is exchangeable then youshould use a hierarchical model for it in which the likelihood isBernoulli and the prior is a carefully chosen distribution π.

In particular, if you pick a prior π that is general enough thenyour model can be dense on the space of all data distributions.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 98: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

De Finetti’s theorem

Proof is longer than I have time for here.

The result is more general: If Yi ∈ Rd , then Y1,Y2, . . . isinfinitely exchangeable if and only if

p(y1, . . . , yn) =

∫ ( n∏

i=1

G (yi )

)dη(G )

where η(G ) is a prior on the space of distributions!

In the correlated multivariate Gaussian example mentionedabove, we could have written yi | θ ∼ N(θ, σ2) andθ ∼ N(0, τ2) where τ2 = ρ and σ2 = 1− ρ.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 99: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The likelihood principle

The likelihood principle states the following:

In making inferences or decisions about θ after some data y isobserved, all relevant experimental information is contained inthe likelihood function. Hence, if y and y ′ are two samplesthat have proportional likelihoods then the conclusions drawnfrom y and y ′ should be identical.

This is just a principle (not a theorem or axiom), but seems tobe a reasonable one (more on this later).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 100: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The likelihood principle

Example (binomial/negative binomial sampling):Consider two independent experiments. In the first experimentthe researchers decide to carry out 12 Bernoulli trials, and getexactly Y1 = 3 successes. The data therefore follows abinomial distribution

(12

3

)θ3(1− θ)9.

In the second experiment the researches decide to carry outtrials until they get 3 successes and end up running a total ofY2 = 12 trials. The likelihood is therefore negative binomial

(11

2

)θ3(1− θ)9.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 101: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The likelihood principle

Note that these two likelihoods are proportional to each other,so if we believe the likelihood principle then any inference wedraw from these experiments should be the same.

Note that maximum likelihood estimation satisfies thelikelihood principle: In both cases θ = 1/4.

Bayesian inference under proper subjective priors also satisfiesabout the likelihood principle.

However, frequentist hypothesis testing does not!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 102: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The likelihood principle

Consider the hypotheses H0 : θ ≥ 12 versus H0 : θ < 1

2 .

For the Binomial model we have

p-value = Pr

(Y1 ≤ 3 | θ =

1

2

)≈ 0.0729

On the other hand, for the Negative Binomial model we have

p-value = Pr

(Y2 ≥ 12 | θ =

1

2

)≈ 0.0327

Clearly the stopping rule matters here! The difference is dueto the fact that the p-value uses the probability of valuesother than the observed data.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 103: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The likelihood principle

The likelihood principle can be derived from two otherprinciples:

Sufficiency: Consider experiment Ex(Y ) = {Y , θ, f (y | θ)} andsuppose that T (Y ) is a sufficient statistic for θ. If Y and Y ′

are sample points satisfying T (Y ) = T (Y ′) then the evidencefrom Ex(Y ) and Ex(Y ′) is identical.

Conditionality principle: Consider two experimentsEx1(Y ) = {Y1, θ, f (y | θ)} and Ex2(Y2) = {Y2, θ, g(y | θ)}where θ is common to both. Consider a mixed experimentwhere you flip a coin with probability 1/2, and you runexperiment one if the coin comes up heads and experiment 2 ifthe coin comes up tails. Then the evidence from this mixedexperiment on θ is the same as that obtained from knowingwhat experiment was run.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 104: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

The likelihood principle

As the previous example illustrates, one implication of thelikelihood principle is that, as long as the the stopping rule forthe experiment does not depend on the unknown parametersof the likelihood, inferences should be independent of thestopping rule.

Should we adhere to the likelihood principle? Somephilosophers argue not (e.g., because “evidence” is not a welldefined concept). Probably a middle ground is reasonable: tryto follow it, but do not make it into a dogma. Indeed, thereare Bayesian procedures that have good justifications butviolate the likelihood principle (e.g., Jeffreys priors).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 105: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Other reasons to be Bayesian

In a Bayesian setting it is natural to think about hierarchiesand borrowing of information. Hierarchical models can (andare) used in non-Bayesian settings, but they are not as natural.

With the advent of MCMC and other simulation-basedapproaches to inference, computation in Bayesian settings hasbecome in some ways easier than computation for frequentistapproaches (particularly for complex models).

Since both observations and parameters are random variables,dealing with missing and censored data tends to be simpler.⇒ Data augmentation.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 106: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

What are the challenges of being a Bayesian?1 Prior elicitation.

2 Computation of posterior distributions and appropriatesummaries.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 107: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Some challenges in applying Bayesian statistical methods

As the previous discussion suggests, the most challengingaspects of Bayesian statistical inference correspond to

Eliciting prior distributions.Performing the prior posterior updates in non-conjugatemodels.Obtaining optimal actions when non-standard utility functionsare in play

In the next few slides we focus on the first two challenges.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 108: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Prior elicitation

In a Bayesian setting you need to think about your model ascomposed of both a likelihood and a set of priors!

In other words, the prior is an integral part of your model andneeds as much attention as the likelihood function.

Some people see this as a negative:

Researchers with different prior information might drawdifferent conclusions. ⇒ For estimation and composite vs.composite testing this is alleviated by large sample sizes.If your prior information is “bad” it could negatively affectyour inference. ⇒ Again, for some inferential problems this isalleviated by large sample sizes.Prior information can be hard to elicit. ⇒ True, but there aresome ways around it.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 109: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Prior elicitation

On the flip side, good prior information can substantiallyimprove the quality of your results, particularly in smallsample sizes.

Furthermore all modeling choices introduce prior informationin the problem, so frequentists also use it (they just do nottalk about it in those terms). This has become even clearerwith the popularity of “penalized likelihood” approaches(penalties are just priors by a different name).

Separation between “modeling” (in the sense of selecting thelikelihood) and “prior elicitation” (selecting the priors) isartificial.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 110: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Prior elicitation

People often distinguish between “subjective” and “objective”Bayes:

In subjective Bayes the prior needs to be carefully elicited fromexperts (potentially different from the modeler) and all priorsare well-defined probability distributions.

In objective Bayes the prior is treated as a computationaldevice that enables us to use Bayes theorem and couldpotentially be “improper” (as long as the posterior is proper).Focus is often on priors that have some desired, ad-hocproperties (minimum information, invariance, frequentistcoverage, etc.)

In practice, a middle ground is often appropriate: We shouldinclude prior information if we have it, but when it is notavailable (or when there are so many parameters thatelicitation is not practical) you want to have a “default” set ofpriors you can rely on.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 111: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Prior elicitation: Subjective Bayes

Real prior elicitation is complex because we need to generatea whole distribution.

Experts can often answer questions about a few quantiles ofthe distribution and maybe a couple of moments, but theyrarely can really tell you all that you would need.This means that you often have to restrict yourself to aparametric family.

Also, even after the introduction of more powerfulcomputational tools based on simulations, the choice of priordistributions is often driven by a desire to facilitatecomputation.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 112: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Prior elicitation: Subjective Bayes

We often restrict attention to a relatively small set of wellknown families that

Are computational tractable ⇒ Conditionally conjugate priors.Have a small number of interpretable parameters.

Once we restrict ourselves to these families, it is just a matterof eliciting enough quantiles from the experts to completelydetermine the prior parameters ⇒ Solve a system of equations(method of moments/method of quantiles).

It is often easier to elicit information about the observablesthemselves, and then see what that would require from thehyperparameters.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 113: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Prior elicitation: Empirical Bayes

Another approach to prior elicitation is the empirical Bayesprocedure we described when we discussed hierarchicalmodels.

In practice the “type II maximum likelihood” approach (i.e.,computing the marginal over the parameters of interest andthen computing the MLE of the hyperparameters) is not oftenused.

Instead, it is common to use rough estimates obtained fromthe data to get at the right location and scale for thehyperparameters.Again, it is common to elicit information about the observablesand seeing what the implication is for the hyperparameters.

The main drawback of empirical Bayes is that, technically, weare using the data twice. This means that we lose some of theoptimality properties associated with Bayesian methods ...

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 114: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Objective Bayes: Laplace’s indifference principle

Laplace’s indifference principle (also called principle ofinsufficient reason) states that if the K possibilities areindistinguishable except for their names, then each possibilityshould be assigned a probability equal to 1/K .

Note that such formulation is perfectly valid for a finiteparameter space.

For countable or continuous random variables the indifferenceprinciple might lead to improper prior distributions, which mayor may not lead to proper posteriors!

The posterior is proper if and only if∫p(y | θ)π(θ)dθ <∞

(always true if π(θ) is itself proper!).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 115: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Objective Bayes: Laplace’s indifference principle

Example (normal likelihood with a flat prior): Assumethat y1, . . . , yn is an independent and identically distributedsample with yi ∼ N(θ, 1) for i = 1, . . . , n. Using Laplace’sindifference principle in this case leads to π(θ) ∝ 1, which isimproper. Nonetheless, the posterior is proper:

p(y1, . . . , yn) =

∫ ∞

−∞

(1

) n2

exp

{−1

2

n∑

i=1

(yi − θ)2

}dθ

=

(1

) n−12 1√

nexp

{−1

2

[n∑

i=1

y2i − ny2

]},

so that θ|y1, . . . , yn ∼ N(y , 1

n

).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 116: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Objective Bayes: Laplace’s indifference principle

Example (Bernoulli likelihood): Consider a single Bernoulli

observation with success probability θ = exp{η}1+exp{η} and assign η

a uniform prior (which is improper since η ∈ R). Note that

∫ ∞

−∞

exp{yiη}1 + exp{η}dη =∞

both if yi = 0 or if yi = 1. Hence this posterior is improper!On the other hand, if we had placed a uniform prior on θ thenthe posterior would clearly be proper,

∫ 1

0θyi (1− θ)1−yi dθ =

Γ(1 + yi )Γ(2− yi )

Γ(3)

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 117: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Objective Bayes: Laplace’s indifference principle

The previous examples highlight one of the most importantshortcomings of Laplace’s indifference principle for continuousparameters: what is the right parameterization in which theprior is uniform?

The answer to this is not always clear or obvious! Thismotivates the use of Jeffreys priors!

Furthermore, the need to decide on the appropriateparameterization somehow diminishes the idea that this is adefault prior.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 118: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Objective Bayes: Jeffreys prior

The motivation for Jeffreys prior is to have a prior that isinvariant to the parameterization of the model.

The Jeffreys prior is defined in terms of the likelihood

πJ(θ) ∝ |I(θ)|1/2

where I(θ) is the information matrix with entries

[I(θ)]i ,j = −Ey |θ

{∂2

∂θiθjlog p(y | θ)

}

Since reparameterizations involve the Jacobian transformation,invariance to transformation should not be too surprising.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 119: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Objective Bayes: Jeffreys prior

Example (Binomial distribution): Let y ∼ Bin(n, θ), sothat

− d2

dθ2log p(y | θ) =

y

θ2+

n − y

(1− θ)2

and I(θ) = nθ(1−θ) . Therefore, in this case

πJ(θ) ∝ θ−1/2(1− θ)−1/2

which can be recognized as the kernel of a beta(1/2, 1/2)distribution. Hence, the Jeffreys prior is a proper prior in thiscase.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 120: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Objective Bayes: Jeffreys prior

Example (Binomial distribution, cont): Consider now the

transformation θ = exp{η}1+exp{η} , so that η corresponds to the

log-odds. Transforming the beta(1/2, 1/2) prior we justderived we get

πJ(η) =1

π

(exp{η}

1 + exp{η}

)−1/2 ( 1

1 + exp{η}

)−1/2 exp{η}(1 + exp{η})2

,

so that

πJ(η) =1

π

exp{η/2}1 + exp{η} ,

which is the same result that you would get if you were toapply Jeffreys rule directly to

log p(y | η) = yη − log (1 + exp{η}) .

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 121: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Objective Bayes: Jeffreys prior

Example (Negative binomial): Let Y | θ ∼ NBin(r , θ),where Y = r , r + 1, . . . is the total number of trials needed toobtain r successes. In this case

− d2

dθ2log p(y | θ) =

r

θ2+

y − r

(1− θ)2,

and since E{y} = r/θ we have

πJ(θ) ∝ θ−1(1− θ)−1/2

Note that∫ 1

0 θ−1(1− θ)−1/2dθ =∞, so in this case the

Jeffreys prior is improper. Furthermore, this expression isdifferent from the one we derived for the binomial case ⇒Bayesian analyses that use the Jeffreys prior do not necessarilysatisfy the likelihood principle.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 122: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Objective Bayes: Jeffreys prior

For location families where p(y | θ) = f (y − θ) for somedensity f , the Jeffreys prior is simply

πJ,L(θ) ∝ 1.

For scale families where p(y | θ) = 1θ f( yθ

)for some density f ,

the Jeffreys prior is

πJ,S(θ) ∝ 1

θ.

This is equivalent to placing a uniform prior on the log scale.

In both cases the Jeffreys prior is improper, and we need tocarefully check that the corresponding posterior is proper!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 123: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Objective Bayes: Jeffreys prior

The Jeffreys priors for location and scale families can beobtained as the limit of proper priors:

For the location family, consider a Gaussian prior

πL(θ) =1√2πτ

exp

{− 1

2τ 2(θ − µ)2

}

Note that limτ→∞ πL(θ) = πJ,L(θ).

For the scale family, consider a (slightly reparameterized)inverse Gamma family

πS(θ) =(αβ)α

Γ(α)θ−(α+1) exp

{−αβθ

}

where E{θ} = β. Here limα→0 πS(θ) = πJ,S(θ).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 124: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Objective Bayes: Jeffreys prior

Example (Normal distribution with unknown mean andvariance): Let y | µ, φ ∼ Np(µ, φ), so that φ = σ2 is thevariance. The Hessian in this case is

∣∣∣∣∣− 1φ − 1

σ2

∑ni=1(yi − µ)

1φ2

∑ni=1(yi − µ) n

2φ2 − 1φ3

∑ni=1(yi − µ)2

∣∣∣∣∣ .

Since E {∑ni=1(yi − µ)} = 0 and E

{∑ni=1(yi − µ)2

}= nφ,

|I(µ, φ)| =1

φ3

andπJ(µ, φ) ∝ φ−3/2.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 125: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Objective Bayes: Jeffreys prior

Multivariate Jeffreys priors sometimes have undesirablebehaviors.

An alternative is the so-called independence Jeffreys prior(originally proposed by Jeffreys himself) is very common.

The independence Jeffreys πIJ(µ, φ) assumes priorindependence and uses the corresponding univariate Jeffreyspriors, πJ(µ, φ) = πJ(µ)πJ(φ).

In the case of the normal model with unknown mean andvariance this reduces to

πIJ(µ, φ) ∝ 1

φ.

Independence Jeffreys priors are not invariant totransformations that involve more than one parameter.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 126: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Objective Bayes: Reference priors

Another (similar) solution is to construct the prior sequentially.

Let y | θ ∼ p(y | θ1, θ2) where θ1 is the parameter of interestand θ2 is the nuisance parameter.

Use p(y | θ1, θ2) to construct the Jeffreys prior for θ2 assuminga fixed value of θ1 to obtain πJ(θ2 | θ1).Compute p(y | θ1) =

∫p(y | θ1, θ2)πJ(θ2 | θ1)dθ2.

Construct the Jeffreys prior for θ1 using p(y | θ1).

The order of the parameters matters, and you might get adifferent prior depending on what is the main parameter ofinterest.

Can be generalized to more than two groups of parameters.

Can be justified as maximizing the posterior information.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 127: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Bayesian hypothesis testing

Improper priors such as those that often arise fromnon-informative approaches are often appropriate for bothpoint and interval estimation.

However, their use in problems that involve models ofdifferent dimensions (as is the case in hypothesis testing) isextremely problematic.

To illustrate this, let’s consider Barlett’s paradox.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 128: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Barlett’s paradox

Example: Let y1, . . . , yn be such that yi | θ ∼ N(θ, 1) andconsider testing the hypotheses H0 : θ = 0 against H1 : θ 6= 0under the prior θ | H1 ∼ N(0, τ2).

The corresponding Bayes factor (which we already calculatedin our introductory example) is

B10(τ2, y) = (1 + nτ2)−1/2 exp

{1

2

n2y2

n + 1τ2

},

and it is easy to verify that limτ2→∞ B10(τ2, y) = 0 andtherefore limτ2→∞ p(H0 | y1, . . . , yn) = 1 for any values of nand y .

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 129: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Barlett’s paradox

This appears to be paradoxical because we would expect theBayes factor to prefer H1 when y is large (and indeedlimy→∞ B10(τ2, y) =∞). Also, we had shown that Bayesianmodel selection is consistent as n→∞.

In conclusion, no matter how strong the information we havein our sample, the Bayes factor constructed using a flat priorwill always favor the null model.

The same phenomenon appears in most comparisons thatinvolve testing point vs. composite hypotheses using improperpriors on the parameter being tested.

More generally, this result highlights that priors can have a bigimpact for model selection, even for large sample sizes.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 130: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Barlett’s paradox

The key to understand Barlett’s paradox is to notice that theorder in which you take the limits matter! If you first takeτ2 →∞ first, the Bayes factor is ill behaved.

Another way to think about Barlett’s paradox is that it arisesbecause improper priors are defined up to an arbitrarynormalizing constant, the Bayes factor depends on suchconstant and their value is therefore undefined.

This does not happen in estimation problems because thearbitrary constant appears both in the numerator and thedenominator of the posterior distribution, canceling out.

Another way to think about it is that the improper prioressentially puts probability 0 on H1 a priori.

Barlett’s paradox does not arise when testing composite vs.composite hypotheses where the same improper prior is usedunder both hypotheses

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 131: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Barlett’s paradox

Example: Again, let y1, . . . , yn be such that yi | θ ∼ N(θ, 1)and consider testing the hypotheses H0 : θ ≤ 0 againstH1 : θ > 0 under the prior θ |∼ N(0, τ2). The posterior oddsin this case reduce to

p(H0 | y1, . . . , yn)

p(H1 | y1, . . . , yn)=

∫ 0

−∞ exp

{− 1

2

(n + 1

τ 2

)(θ − ny+ µ

τ2

n+ 1τ2

)2}

∫∞0

exp

{− 1

2

(n + 1

τ 2

)(θ − ny+ µ

τ2

n+ 1τ2

)2} .

so that

0 < limτ2→∞

p(H0 | y1, . . . , yn)

p(H1 | y1, . . . , yn)<∞

for any finite values of y and n. Taking a limit is not aproblem here!

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 132: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Barlett’s paradox

Example (cont): Intuitively, the improper prior is not aproblem here because the (undefined!) normalizing constantof the prior appears in both the numerator and thedenominator, so it cancels out!

More generally, traditional improper priors used in estimationwill work well for testing composite vs. composite hypotheses,but NOT for testing point vs. composite hypotheses. Different(“well calibrated”) default priors are required in that setting.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 133: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Computation for Bayesian models

Once prior distributions have been elicited, the next majorchallenge in using Bayesian methods is to compute(appropriate summaries of) the posterior distributions.

This is a much more challenging problem than thatencountered in classical statistical methods.

Computation for frequentist methods requires that we computemaximum/minimums and derivatives.Computation for Bayesian methods usually requires that wecompute integrals!

This was a major obstacle for the practical adoption ofBayesian methods before cheap and powerful computers werewidely available.

Until the mid 80s, research in Bayesian statistics was, for themost part, limited to the study of theoretical properties ofrelatively simple models.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 134: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Computation for Bayesian models

Optimization techniques can be useful in Bayesian statistics(e.g., to construct MAP estimators), but they do not allow fora full exploration of the posterior distribution.

One way to proceed is to employ analytical approximationssuch as the Laplace approximation, but these might not workwell in small sample sizes.

Another alternative is to use numerical integration techniques(such as Gaussian quadrature). However, this works well onlyfor relatively small-dimensional problems (≤ 4).

The introduction of simulation-based algorithms originallydeveloped in the physics literature has enabled theconstruction of fancy, high-dimensional Bayesian models,leading to the rise of Bayesian methods in practice.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 135: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

Why simulation based methods: A motivating example

To illustrate the power of simulation methods, consider theproblem of estimating the diagnostic power of a medical test,defined as the probability of a positive result in the test for adiseased individual.

η = Pr (D | T ) .

η is not directly observable. However, we can easily obtaindata on the prevalence θ, the false positive rate α and thefalse negative rate β,

θ = Pr(D), α = Pr(T | D

), β = Pr

(T | D

),

by sampling individuals from the general population, thepopulation of healthy individuals, and the population ofdiseased individuals, respectively.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 136: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

A motivating example

Note that η is related to θ, α and β by the formula:

η(θ, α, β) =(1− β)θ

(1− β)θ + α(1− θ).

(This is just Bayes theorem!)

Statistical model: Assuming that individuals are sampled atrandom from the corresponding populations

x1 =

{# of diseased individuals in a sampleof size n1 of the general population

}∼ Bin(n1, θ),

x2 =

{# of diseased individuals in a sampleof size n2 of healthy individuals

}∼ Bin(n2, α),

x3 =

{# of healthy individuals in a sampleof size n3 of diseased individuals

}∼ Bin(n3, 1− β).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 137: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

A motivating example

A frequentist approach:Point estimation: The MLEs for θ, α and β are

θ =x1

n1, α =

x2

n2, β = 1− x3

n3.

Hence, the MLE of η is simply η = (1−β)θ

(1−β)θ+α(1−θ).

Interval estimation: Finding a pivot for η is hard. However,an asymptotic approximate interval can be constructed usingthe Delta method

Var(η) ≈[∇η(θ, α, β)|(θ,α,β)

]TΣ[∇η(θ, α, β)|(θ,α,β)

]

where Σ = diag{

Var(θ),Var(α),Var(β)}

.

For small samples, parametric bootstrap could be used (whichis a simulation-based computational tool common infrequentist statistics).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 138: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

A motivating example

A frequentist approach (cont):Interval estimation (cont): Parametric bootstrap: Once

θ,α and β have been computed, repeat the following steps forb = 1, . . . ,B where B is “large”:

1 Sample imaginary samples x(b)1 ∼ Bin(n1, θ), x

(b)2 ∼ Bin(n2, α)

and x(b)3 ∼ Bin(n3, 1− β).

2 Compute θ(b) =x

(b)1n1

, α(b) =x

(b)2n2

and β(b) = 1− x(b)3n3

and let

η(b) = (1−β(b))θ(b)

(1−β(b))θ(b)+α(b)(1−θ(b)).

Then, an ε-coverage confidence interval for η is obtained bycomputing the ε/2 and 1− ε/2 quantiles of the sampleβ(1), . . . , β(B).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 139: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

A motivating example

A Bayesian approach:

Priors: Natural non-informative priors for this problem areθ ∼ beta

(12 ,

12

), α ∼ beta

(12 ,

12

)and β ∼ beta

(12 ,

12

).

Posteriors: Because of independence and conjugacy we have

p(θ, α, β | x1, x2, x3) = beta

(θ |

1

2+ x1,

1

2+ n1 − x1

)beta

(α |

1

2+ x2,

1

2+ n2 − x2

)beta

(β |

1

2+ n3 − x3,

1

2+ x3

)

Posteriors (cont): The posterior distribution for (θ, α, β) iseasy to obtain. Since η is a function of (θ, α, β), its posteriorcan be obtained through transformations. However, in thiscase, this is extremely messy ... Try it yourself.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 140: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

A motivating example

A Bayesian approach:

Using a simulation-based approach: Instead of trying tofind a closed-form expression for p(η | x1, x2, x3), sample!

1 For b = 1, . . . ,B generate θ(b) ∼ beta(

12

+ x1,12

+ n1 − x1

),

α(b) ∼ beta(

12

+ x2,12

+ n2 − x2

)and

β(b) ∼ beta(

12

+ n3 − x3,12

+ x3

).

2 For b = 1, . . . ,B let η(b) = (1−β(b))θ(b)

(1−β(b))θ(b)+α(b)(1−θ(b)).

The density of the posterior distribution can be approximatedusing a histogram or a kernel density estimator.

Optimal estimates under a quadratic or absolute difference losscan be obtained as ηQ = 1

B

∑Bb=1 η

(b) and ηA = Med{η(b)}

.

An ε-probability credible interval for η can be constructed bycomputing the ε/2 and 1− ε/2 quantiles of η(1), . . . , η(B).

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 141: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

A motivating example

Let’s run the scripts in the file simulationmotivation.R.

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference

Page 142: Introduction to Bayesian Inferenceabel/COBAL_2015/bayesian_intro... · IntroductionBeing a BayesianWhyChallenges Introduction to Bayesian Inference Abel Rodr guez - UC, Santa Cruz

Introduction Being a Bayesian Why Challenges

A motivating example

Even though a closed-form expression for the posteriordistribution of η is extremely hard to get, inferences for agiven dataset are very easy to obtain by using simulation!

One way to think about simulation-based methods is as theBayesian alternative to bootstrap.

In this case we could use standard software to simulaterandom numbers from a beta distribution.

What is R doing to generate random numbers?What do we do when R does not have a function to generateaccording to the distribution we are interested in?What do we do in more complicated settings where theposterior distribution does not have a form that we canrecognize?

Abel Rodrıguez - UC, Santa Cruz Introduction to Bayesian Inference