Introduction to Bayesian Data Analysis and Markov …odin.mdacc.tmc.edu/~jmorris/talks_files/mcmc2.pdf · Introduction to Bayesian Data Analysis and Markov Chain Monte Carlo Jeffrey

Introduction to Bayesian Data Analysis andMarkov Chain Monte Carlo

Jeffrey S. MorrisUniversity of Texas M.D. Anderson Cancer Center

Department of [email protected]

September 20, 2002

AbstractThe purpose of this talk is to give a brief overview of Bayesian

Inference and Markov Chain Monte Carlo methods, including the GibbsSampler and Metropolis Hastings algorithm.

MCMC OVERVIEW 1

Outline

• Bayesian vs. Frequentist paradigm

• Bayesian Inference and MCMC

? Gibbs Sampler? Metropolis-Hastings Algorithm

• Assessing Convergence of MCMC

• Hierarchical Model Example

• MCMC: Benefits and Cautions

MCMC OVERVIEW 2

Frequentist vs. Bayesian paradigms

• Data: X Parameters: Θ

MCMC OVERVIEW 2



• To a frequentist:

? The data X are random, and the parameters Θ are fixed.

MCMC OVERVIEW 2




? The data X are random, and the parameters Θ are fixed.? (ML) Inference is performed by finding Θ such that f(X|Θ) is

maximized.

MCMC OVERVIEW 2





maximized.? We cannot make probability statements about parameters, but only can

make statements about performance of estimators over repeatedsampling (e.g.confidence intervals).

MCMC OVERVIEW 2







• To a Bayesian:

? The current data X is fixed, and the unknown parameters Θ are random.

MCMC OVERVIEW 2







• To a Bayesian:

? The current data X is fixed, and the unknown parameters Θ are random.? Inference is performed via the posterior distribution f(Θ|X).

MCMC OVERVIEW 2







• To a Bayesian:

? The current data X is fixed, and the unknown parameters Θ are random.? Inference is performed via the posterior distribution f(Θ|X).? We can make probability statements about parameters, since they are

random quantities (e.g. credible intervals)

MCMC OVERVIEW 3

Bayes’ Rule

• The posterior distribution is computed by applying Bayes’ Rule:

f(Θ|X) =f(X|Θ)f(Θ)

f(X)

MCMC OVERVIEW 3

Bayes’ Rule



f(X)

• f(X|Θ)= Likelihood

MCMC OVERVIEW 3

Bayes’ Rule



f(X)


• f(Θ)= Prior Distribution

MCMC OVERVIEW 3

Bayes’ Rule



f(X)



? Reflects prior knowledge about Θ

MCMC OVERVIEW 3

Bayes’ Rule



f(X)



? Reflects prior knowledge about Θ? Sometimes controversial

MCMC OVERVIEW 3

Bayes’ Rule



f(X)



? Reflects prior knowledge about Θ? Sometimes controversial? If little information available, just use diffuse priors (avoid improper priors)

MCMC OVERVIEW 3

Bayes’ Rule



f(X)




• f(X)= Marginal Distribution =∫f(X|Θ)f(Θ)dΘ

MCMC OVERVIEW 3

Bayes’ Rule



f(X)





? Difficult to compute (usually intractable integral)

MCMC OVERVIEW 3

Bayes’ Rule



f(X)





? Difficult to compute (usually intractable integral)? Often not necessary to compute.

MCMC OVERVIEW 4

Conjugate priors

• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.

MCMC OVERVIEW 4

Conjugate priors


• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson

MCMC OVERVIEW 4

Conjugate priors



• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)

MCMC OVERVIEW 4

Conjugate priors




f(θ|X) ∝ f(X|θ)f(θ)

MCMC OVERVIEW 4

Conjugate priors





∝ θX(1− θ)n−Xθα(1− θ)β

MCMC OVERVIEW 4

Conjugate priors






= θα+X(1− θ)β+n−X

MCMC OVERVIEW 4

Conjugate priors







= kernel of Beta(α+X,β + n−X)

MCMC OVERVIEW 4

Conjugate priors








• For single parameter problem: conjugate priors allow closed form posteriordistributions.

MCMC OVERVIEW 4

Conjugate priors









• What if we don’t want to use conjugate priors?

MCMC OVERVIEW 4

Conjugate priors









• What if we don’t want to use conjugate priors?

What if we have multiple parameters?

MCMC OVERVIEW 5

Non-conjugate Case

• Suppose we are interested in the posterior mean:

E(Θ|X) =∫

Θf(Θ|X)dΘ

MCMC OVERVIEW 5

Non-conjugate Case


E(Θ|X) =∫

Θf(Θ|X)dΘ

=∫

Θf(X|Θ)f(Θ)dΘ∫f(X|Θ)f(Θ)dΘ

MCMC OVERVIEW 5

Non-conjugate Case


E(Θ|X) =∫

Θf(Θ|X)dΘ

=∫


• How do we compute this integral if it is intractable?

MCMC OVERVIEW 5

Non-conjugate Case


E(Θ|X) =∫

Θf(Θ|X)dΘ

=∫



? Numerical Integration (Quadrature)

MCMC OVERVIEW 5

Non-conjugate Case


E(Θ|X) =∫

Θf(Θ|X)dΘ

=∫



? Numerical Integration (Quadrature)May not work if there are many parameters.

MCMC OVERVIEW 5

Non-conjugate Case


E(Θ|X) =∫

Θf(Θ|X)dΘ

=∫



? Numerical Integration (Quadrature)May not work if there are many parameters.

? Monte Carlo integration

MCMC OVERVIEW 6

Markov Chain Monte Carlo: Monte Carlo Integration

• Monte Carlo integration:

Estimate integrals by randomly drawing samples from the requireddistribution.

MCMC OVERVIEW 6




E(Θ|X) =∫

Θf(Θ|X)dΘ

MCMC OVERVIEW 6




E(Θ|X) =∫

Θf(Θ|X)dΘ

≈ 1n

n∑t=1

Θt,

where Θt ∼ f(Θ|X)

MCMC OVERVIEW 6




E(Θ|X) =∫

Θf(Θ|X)dΘ

≈ 1n

n∑t=1

Θt,


• We still need a method for drawing samples from the posterior distribution:

MCMC OVERVIEW 6




E(Θ|X) =∫

Θf(Θ|X)dΘ

≈ 1n

n∑t=1

Θt,



? Rejection Sampling

MCMC OVERVIEW 6




E(Θ|X) =∫

Θf(Θ|X)dΘ

≈ 1n

n∑t=1

Θt,



? Rejection Sampling? Importance Sampling

MCMC OVERVIEW 6




E(Θ|X) =∫

Θf(Θ|X)dΘ

≈ 1n

n∑t=1

Θt,



? Rejection Sampling? Importance Sampling? Markov Chain

MCMC OVERVIEW 7

Markov Chain Monte Carlo: Markov Chains

• Markov Chain : Method to draw samples from a desired stationarydistribution.

MCMC OVERVIEW 7



• Steps:

1. Obtain starting values Θ0

MCMC OVERVIEW 7



• Steps:


2. Sample Θ1 from suitably chosen transition kernel P (Θ1|Θ0)

MCMC OVERVIEW 7



• Steps:


2. Sample Θ1 from suitably chosen transition kernel P (Θ1|Θ0)3. Repeat second step n times to obtain chain {Θ0,Θ1, . . . ,Θn}.

MCMC OVERVIEW 7



• Steps:



• Theorems show that, under certain regularity conditions, the chain willconverge to a particular stationary distribution after suitable burn-in period.

MCMC OVERVIEW 7



• Steps:



• Theorems show that, under certain regularity conditions, the chain willconverge to a particular stationary distribution after suitable burn-in period.

• End result: A (correlated) sample from the stationary distribution.

MCMC OVERVIEW 8

Markov Chain Monte Carlo

• Given Markov Chain {Θ0,Θ1, . . . ,Θn} with stationary distribution f(Θ|X)with burn-in m, we can estimate the posterior mean using Monte Carlointegration:

MCMC OVERVIEW 8



E(Θ|X) ≈ 1n−m

n∑t=m+1

Θt.

MCMC OVERVIEW 8



E(Θ|X) ≈ 1n−m

n∑t=m+1

Θt.

• Other quantities can also be computed from Markov Chain:

MCMC OVERVIEW 8



E(Θ|X) ≈ 1n−m

n∑t=m+1

Θt.


? Standard errors? Quantiles? Density estimates

MCMC OVERVIEW 8



E(Θ|X) ≈ 1n−m

n∑t=m+1

Θt.


? Standard errors? Quantiles? Density estimates

• Samples can be used to perform any Bayesian inference of interest.

• How do we generate the Markov Chain?

MCMC OVERVIEW 9

Gibbs Sampler

• Gibbs Sampler(Geman and Geman, 1984):

Markov transition kernel consists of drawing from full conditionaldistributions.

MCMC OVERVIEW 9

Gibbs Sampler



• Suppose Θ = {θ1, θ2, . . . , θp}T .

MCMC OVERVIEW 9

Gibbs Sampler



• Suppose Θ = {θ1, θ2, . . . , θp}T .

Full conditional distribution for parameter i: f(θi|X,Θ−i)

MCMC OVERVIEW 9

Gibbs Sampler



• Suppose Θ = {θ1, θ2, . . . , θp}T .

Full conditional distribution for parameter i: f(θi|X,Θ−i)

Conditions on:

? The data X? The values for all other parameters Θ−i.

MCMC OVERVIEW 10

Gibbs Sampler

• Steps of Gibbs sampler:

MCMC OVERVIEW 10

Gibbs Sampler


1. Choose a set of starting values Θ(0).

MCMC OVERVIEW 10

Gibbs Sampler


1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:

θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)

MCMC OVERVIEW 10

Gibbs Sampler



θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)

3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.

MCMC OVERVIEW 10

Gibbs Sampler



θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)

3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.4. Assuming convergence by iteration m, compute posterior mean,

quantiles, etc. using samples m through n.

MCMC OVERVIEW 10

Gibbs Sampler



θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)



• Many variations possible:

MCMC OVERVIEW 10

Gibbs Sampler



θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)




? Parameters to update each iteration, order of updating

MCMC OVERVIEW 10

Gibbs Sampler



θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)




? Parameters to update each iteration, order of updating? ’Blocking’ parameters together, working with marginalized distributions

MCMC OVERVIEW 10

Gibbs Sampler



θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)





• If conjugate priors used for all parameters, full conditionals in closed form.

MCMC OVERVIEW 10

Gibbs Sampler



θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)





• If conjugate priors used for all parameters, full conditionals in closed form.

• What if we don’t have closed form distributions for full conditionals?

MCMC OVERVIEW 11

Metropolis-Hastings Algorithm

• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):

Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.

MCMC OVERVIEW 11




π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ

(t−1)−i ).

MCMC OVERVIEW 11





(t−1)−i ).

• Steps:

MCMC OVERVIEW 11





(t−1)−i ).

• Steps:

1. Get θ(0)= starting value for θ.

MCMC OVERVIEW 11





(t−1)−i ).

• Steps:

1. Get θ(0)= starting value for θ.2. Get θ∗=proposed value for θ(1), by sampling from proposal density

q(θ|X, θ(0)).

MCMC OVERVIEW 11





(t−1)−i ).

• Steps:


q(θ|X, θ(0)).

3. Compute α(θ(0), θ∗)=min(

1,π(θ∗)q(θ(0)|θ∗)π(θ(0)q(θ∗|θ(0))

).

MCMC OVERVIEW 11





(t−1)−i ).

• Steps:


q(θ|X, θ(0)).


1,π(θ∗)q(θ(0)|θ∗)π(θ(0)q(θ∗|θ(0))

).

4. Generate u ∼Uniform(0,1).

MCMC OVERVIEW 11





(t−1)−i ).

• Steps:


q(θ|X, θ(0)).


1,π(θ∗)q(θ(0)|θ∗)π(θ(0)q(θ∗|θ(0))

).

4. Generate u ∼Uniform(0,1).If u < α⇒ let θ(1) = θ∗, else let θ(1) = θ(0).

MCMC OVERVIEW 11





(t−1)−i ).

• Steps:


q(θ|X, θ(0)).


1,π(θ∗)q(θ(0)|θ∗)π(θ(0)q(θ∗|θ(0))

).

4. Generate u ∼Uniform(0,1).If u < α⇒ let θ(1) = θ∗, else let θ(1) = θ(0).

• Types of proposals: Random Walk, Independence, Symmetric

MCMC OVERVIEW 12

Assessing Convergence of Markov Chains

• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?

MCMC OVERVIEW 12



i.e. How do I decide how long the burn-in should be?

MCMC OVERVIEW 12




1. Look at time series plots for the parameters.

MCMC OVERVIEW 12




1. Look at time series plots for the parameters.2. Run multiple chains with divergent starting values.

MCMC OVERVIEW 12




1. Look at time series plots for the parameters.2. Run multiple chains with divergent starting values.3. Run formal diagnostics (Gelman and Rubin 1992, Geweke 1992)

MCMC OVERVIEW 12





• Other issues:

? Length of chain

MCMC OVERVIEW 12





• Other issues:

? Length of chain? Thinning to decrease autocorrelation

MCMC OVERVIEW 12





• Other issues:

? Length of chain? Thinning to decrease autocorrelation

MCMC OVERVIEW 13

Example: Hierarchical Models

• Example: Growth curves for rats.

MCMC OVERVIEW 13


• Example: Growth curves for rats.• Data Yij consists of weights for 30 rats over 5 weeks.

MCMC OVERVIEW 13



Rat Growth Model Data

Time (days)

Wei

ght (

g)

10 15 20 25 30 35

100

150

200

250

300

350

MCMC OVERVIEW 13



Rat Growth Model Data

Time (days)

Wei

ght (

g)

10 15 20 25 30 35

100

150

200

250

300

350

• Can estimate mean growth curve by linear regression, but growth curvemodels necessary to get standard errors right.

MCMC OVERVIEW 14


• Model: Yij ∼ Normal(µij, τc)

MCMC OVERVIEW 14



µij = αi + βi(xj − x)

MCMC OVERVIEW 14




αi ∼ Normal(αc, τα)

MCMC OVERVIEW 14





βi ∼ Normal(βc, τβ)

MCMC OVERVIEW 14






• Model could be fit using linear mixed model or Bayesian hierarchical model.

MCMC OVERVIEW 14







• Priors (conjugate and vague):αc, βc ∼ Normal(0, 10−6)

MCMC OVERVIEW 14








τc, τα, τβ ∼ Gamma(0.001, 0.001)

MCMC OVERVIEW 14








τc, τα, τβ ∼ Gamma(0.001, 0.001)

• Gibbs sampler:

Since conjugate priors were used, the full conditionals are all available inclosed form and can be derived using some algebra.

MCMC OVERVIEW 14








τc, τα, τβ ∼ Gamma(0.001, 0.001)

• Gibbs sampler:


• WinBUGS: Statistical software to perform MCMC in general problems.

MCMC OVERVIEW 14








τc, τα, τβ ∼ Gamma(0.001, 0.001)

• Gibbs sampler:


• WinBUGS: Statistical software to perform MCMC in general problems.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?

? Flexible computing tool with ability to fit complex models.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?

? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?

? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?


• Words of Caution:

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?



? Monitor convergence!

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?



? Monitor convergence!∗ Unfortunately, the most complex models tend to converge very slowly.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?



? Monitor convergence!∗ Unfortunately, the most complex models tend to converge very slowly.∗ Can try blocking and marginalization to decrease correlation of model

parameters in MCMC.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?




parameters in MCMC.? Check if your answers make sense - compare with plots and simple

methods

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?





methods? Perform sensitivity analysis on priors.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?





methods? Perform sensitivity analysis on priors.

• Other book: Gelman, Carlin, Stern, & Rubin (1995) Bayesian Data Analysis

MCMC OVERVIEW 16

ReferencesGelman A and Rubin DB (1992) . Inference from iterative simulation usingmultiple sequences. Statistical Science 7, 457�75472.

Geman S and Geman D (1984) . Stochastic relaxation, Gibbs distributions,and the Bayesian restoration of images. IEEE Trans. Pattn. Anal. Mach. Intel.6, 721�75741.

Geweke J (1992) . Evaluation of accuracy of sampling-based approaches tothe calculation of posterior moments. In Bayesian Statistics 4(ed. JMBernardo, J Berger, AP Dawid and AFM Smith), pp. 169�75193. OxfordUniversity Press.

Gilks WR, Richardson S, and Spiegelhalter DJ (1996) . Markov ChainMonte Carlo in Practice, Chapman and Hall.

Hastings WK (1970) . Monte Carlo sampling methods using Markov chainsand their applications. Biometrika 57, 97�75109.

Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH and Teller E(1953). Equations of state calculations by fast computing machine. J. Chem.Phys. 21, 1087�751091.

Introduction to Bayesian Data Analysis and Markov …odin.mdacc.tmc.edu/~jmorris/talks_files/mcmc2.pdf · Introduction to Bayesian Data Analysis and Markov Chain Monte Carlo Jeffrey

Documents