The Metropolis-Hastings Algorithm 2... · 2016. 5. 11. · The Metropolis-Hastings Algorithm Frank Schorfheide University of Pennsylvania EABCN Training School May 10, 2016

The Metropolis-Hastings Algorithm

Frank SchorfheideUniversity of Pennsylvania

EABCN Training School

May 10, 2016

Markov Chain Monte Carlo (MCMC)

• Main idea: create a sequence of serially correlated draws such thatthe distribution of θi converges to the posterior distribution p(θ|Y ).

Frank Schorfheide The Metropolis-Hastings Algorithm

Generic Metropolis-Hastings Algorithm

For i = 1 to N:

1 Draw ϑ from a density q(ϑ|θi−1).

2 Set θi = ϑ with probability

α(ϑ|θi−1) = min

1,

p(Y |ϑ)p(ϑ)/q(ϑ|θi−1)

p(Y |θi−1)p(θi−1)/q(θi−1|ϑ)

and θi = θi−1 otherwise.

Recall p(θ|Y ) ∝ p(Y |θ)p(θ).

We draw θi conditional on a parameter draw θi−1: leads to Markovtransition kernel K (θ|θ).


Importance Invariance Property

• It can be shown that

p(θ|Y ) =

∫K (θ|θ)p(θ|Y )d θ.

• Write

K (θ|θ) = u(θ|θ) + r(θ)δθ(θ).

• u(θ|θ) is the density kernel (note that u(θ|·) does not integrated toone) for accepted draws:

u(θ|θ) = α(θ|θ)q(θ|θ).

• Rejection probability:

r(θ) =

∫ [1− α(θ|θ)

]q(θ|θ)dθ = 1−

∫u(θ|θ)dθ.


Importance Invariance Property

• Reversibility: Conditional on the sampler not rejecting the proposeddraw, the density associated with a transition from θ to θ is identicalto the density associated with a transition from θ to θ:

p(θ|Y )u(θ|θ) = p(θ|Y )q(θ|θ) min

1,

p(θ|Y )/q(θ|θ)

p(θ|Y )/q(θ|θ)

= min

p(θ|Y )q(θ|θ), p(θ|Y )q(θ|θ)

= p(θ|Y )q(θ|θ) min

p(θ|Y )/q(θ|θ)

p(θ|Y )/q(θ|θ), 1

= p(θ|Y )u(θ|θ).

• Using the reversibility result, we can now verify the invarianceproperty:∫

K (θ|θ)p(θ|Y )d θ =

∫u(θ|θ)p(θ|Y )d θ +

∫r(θ)δθ(θ)p(θ|Y )d θ

=

∫u(θ|θ)p(θ|Y )d θ + r(θ)p(θ|Y )

= p(θ|Y )


A Discrete Example

• Suppose parameter vector θ is scalar and takes only two values:

Θ = τ1, τ2

• The posterior distribution p(θ|Y ) can be represented by a set ofprobabilities collected in the vector π, say π = [π1, π2] with π2 > π1.

• Suppose we obtain ϑ based on transition matrix Q:

Q =

[q (1− q)

(1− q) q

].


Discrete MH Algorithm

• Iteration i : suppose that θi−1 = τj . Based on transition matrix

Q =

[q (1− q)

(1− q) q

],

determine a proposed state ϑ = τs .

• With probability α(τs |τj) the proposed state is accepted. Setθi = ϑ = τs .

• With probability 1− α(τs |τj) stay in old state and set θi = θi−1 = τj .

• Choose (Q terms cancel because of symmetry)

α(τs |τj) = min

1,πsπj

.


Discrete MH Algorithm: Transition Matrix

• The resulting chain’s transition matrix is:

K =

[q (1− q)

(1− q)π1

π2q + (1− q)

(1− π1

π2

) ].

• Straightforward calculations reveal that the transition matrix K haseigenvalues:

λ1(K ) = 1, λ2(K ) = q − (1− q)π1

1− π1.

• Equilibrium distribution is eigenvector associated with uniteigenvalue.

• For q ∈ [0, 1) the equilibrium distribution is unique.


Convergence

• The persistence of the Markov chain depends on second eigenvalue,which depends on the proposal distribution Q.

• Define the transformed parameter

ξi =θi − τ1τ2 − τ1

.

• We can represent the Markov chain associated with ξi as first-orderautoregressive process

ξi = (1− k22) + λ2(K )ξi−1 + ν i .

• Conditional on ξi = j , j = 0, 1, the innovation ν i has support on kjjand (1− kjj), its conditional mean is equal to zero, and itsconditional variance is equal to kjj(1− kjj).


Convergence

• Autocovariance function of h(θi ):

COV (h(θi ), h(θ(i−l)))

=(h(τ2)− h(τ1)

)2π1(1− π1)

(q − (1− q)

π11− π1

)l

= Vπ[h]

(q − (1− q)

π11− π1

)l

• If q = π1 then the autocovariances are equal to zero and the drawsh(θi ) are serially uncorrelated (in fact, in our simple discrete settingthey are also independent).


Convergence

• Define the Monte Carlo estimate

hN =1

N

N∑i=1

h(θi ).

• Deduce from CLT√N(hN − Eπ[h]) =⇒ N

(0,Ω(h)

),

where Ω(h) is the long-run covariance matrix

Ω(h) = limL−→∞

Vπ[h]

(1 + 2

L∑l=1

L− l

L

(q − (1− q)

π11− π1

)l).

• In turn, the asymptotic inefficiency factor is given by

InEff∞ =Ω(h)

Vπ[h]= 1 + 2 lim

L−→∞

L∑l=1

L− l

L

(q − (1− q)

π11− π1

)l

.


Autocorrelation Function of θi , π1 = 0.2

0 1 2 3 4 5 6 7 8 9

0.0

0.5

1.0

q = 0.00q = 0.20

q = 0.50q = 0.99


Asymptotic Inefficiency InEff∞, π1 = 0.2

0.0 0.2 0.4 0.6 0.8 1.010−1

100

101

102

q


Small Sample Variance V[hN ] across Chains versus HACEstimates of Ω(h)

10−4 10−3 10−210−5

10−4

10−3

10−2

10−1

Solid line is 45-degree line.


Posterior Inference

• We discussed how to solve a DSGE model;

• and how to compute the likelihood function p(Y |θ) for a DSGEmodel.

• According to Bayes Theorem

p(θ|Y ) =p(Y |θ)p(θ)∫p(Y |θ)p(θ)dθ

• We want to generate draws from posterior...


Benchmark Random-Walk Metropolis-Hastings (RWMH)Algorithm for DSGE Models

• Initialization:

1 Use a numerical optimization routine to maximize the log posterior,which up to a constant is given by ln p(Y |θ) + ln p(θ). Denote theposterior mode by θ.

2 Let Σ be the inverse of the (negative) Hessian computed at theposterior mode θ, which can be computed numerically.

3 Draw θ0 from N(θ, c20 Σ) or directly specify a starting value.

• Main Algorithm – For i = 1, . . . ,N:

1 Draw ϑ from the proposal distribution N(θi−1, c2Σ).2 Set θi = ϑ with probability

α(ϑ|θi−1) = min

1,

p(Y |ϑ)p(ϑ)

p(Y |θi−1)p(θi−1)

and θi = θi−1 otherwise.


Benchmark RWMH Algorithm for DSGE Models

• Initialization steps can be modified as needed for particularapplication.

• If numerical optimization does not work well, one could let Σ be adiagonal matrix with prior variances on the diagonal.

• Or, Σ could be based on a preliminary run of a posterior sampler.

• It is good practice to run multiple chains based on different startingvalues.

• For the subsequent illustrations we chose Σ = Vπ[h], where theposterior variance matrix is obtained from a long MCMC run.


Observables for Small-Scale New Keynesian Model

−2−1

012 Quarterly Output Growth

0

4

8 Quarterly Inflation

1985 1990 1995 2000048

12 Federal Funds Rate

Notes: Output growth per capita is measured in quarter-on-quarter(Q-o-Q) percentages. Inflation is CPI inflation in annualized Q-o-Qpercentages. Federal funds rate is the average annualized effective fundsrate for each quarter.


Convergence of Monte Carlo Average τN|N0

0.0 0.5 1.0×105

0

1

2

3

4

5

0.0 0.5 1.0×105

0.0 0.5 1.0×105

Notes: The x−axis indicates the number of draws N. N0 is set to 0,25, 000 and 50, 000, respectively.


Posterior Estimates of DSGE Model Parameters

Parameter Mean [0.05, 0.95] Parameter Mean [0.05,0.95]τ 2.83 [ 1.95, 3.82] ρr 0.77 [ 0.71, 0.82]κ 0.78 [ 0.51, 0.98] ρg 0.98 [ 0.96, 1.00]ψ1 1.80 [ 1.43, 2.20] ρz 0.88 [ 0.84, 0.92]ψ2 0.63 [ 0.23, 1.21] σr 0.22 [ 0.18, 0.26]r (A) 0.42 [ 0.04, 0.95] σg 0.71 [ 0.61, 0.84]π(A) 3.30 [ 2.78, 3.80] σz 0.31 [ 0.26, 0.36]γ(Q) 0.52 [ 0.28, 0.74]

Notes: We generated N = 100, 000 draws from the posterior anddiscarded the first 50,000 draws. Based on the remaining draws weapproximated the posterior mean and the 5th and 95th percentiles.


Effect of Scaling Constant c

0.0

0.5

1.0 Acceptance Rate α

100

102

104 InEff∞

0.0 0.5 1.0 1.5 2.0100102104106

c

InEffN

Notes: Results are based on Nrun = 50 independent Markov chains. Theacceptance rate (average across multiple chains), HAC-based estimate ofInEff∞[τ ] (average across multiple chains), and InEffN [τ ] are shown as afunction of the scaling constant c .


Acceptance Rate α versus Inaccuracy InEffN

0.0 0.2 0.4 0.6 0.8 1.0Acceptance Rate α

101

102

103

104

105

InE

ff N

Notes: InEffN [τ ] versus the acceptance rate α.


Impulse Responses of Exogenous Processes

0 2 4 6 8 100.0

0.2

0.4

0.6

0.8

1.0

t

Response of gt to εg,t

0 2 4 6 8 10t

Response of zt to εz,t

Notes: The figure depicts pointwise posterior means and 90% crediblebands. The responses are in percent relative to the initial level.


Parameter Transformations: Impulse Responses

0.00.20.40.60.8

Out

put

εg,t

-0.1

0.1

0.3

0.5

0.7εz,t

-0.2

-0.1

0.0εr ,t

-2.0

-1.0

0.0

1.0

Infla

tion

-0.20.20.61.01.4

-1.0

-0.6

-0.2

0 2 4 6 8 10-1.0

0.0

1.0

Fede

ralF

unds

Rat

e

0 2 4 6 8 10-0.2

0.2

0.6

1.0

1.4

0 2 4 6 8 100.0

0.4

0.8

Notes: The figure depicts pointwise posterior means and 90% crediblebands. The responses of output are in percent relative to the initial level,whereas the responses of inflation and interest rates are in annualizedpercentages.


Challenges Due to Irregular Posteriors

• A stylized state-space model:

yt = [1 1]st , st =

[φ1 0φ3 φ2

]st−1 +

[10

]εt , εt ∼ iidN(0, 1).

where

• Structural parameters θ = [θ1, θ2]′, domain is unit square.

• Reduced-form parameters φ = [φ1, φ2, φ3]′

φ1 = θ21, φ2 = (1− θ21), φ3 − φ2 = −θ1θ2.


Challenges Due to Irregular Posteriors

• s1,t looks like an exogenous technology process.

• s2,t evolves like an endogenous state variable, e.g., the capital stock.

• θ2 is not identifiable if θ1 = 0 because θ2 enters the model onlymultiplicatively.

• Law of motion of yt is restricted ARMA(2,1) process:(1− θ21L

)(1− (1− θ21)L

)yt =

(1− θ1θ2L

)εt .

• Given θ1 and θ2, we obtain an observationally equivalent process byswitching the values of the two roots of the autoregressive lagpolynomial.

• Choose θ1 and θ2 such that

θ1 =√

1− θ21, θ2 = θ1θ2/θ1.


Posteriors for Stylized State-Space Model

Local Identification Problem Global Identification Problem

0.0 0.1 0.2 0.3 0.4 0.5

0.2

0.4

0.6

0.8

θ2

θ10.2 0.4 0.6 0.8

0.2

0.4

0.6

0.8

θ2

θ1

Notes: Intersections of the solid lines indicate parameter values that wereused to generate the data from which the posteriors are constructed. Leftpanel: θ1 = 0.1 and θ2 = 0.5. Right panel: θ1 = 0.8, θ2 = 0.3.


Improvements to MCMC: Blocking

• In high-dimensional parameter spaces the RWMH algorithmgenerates highly persistent Markov chains.

• What’s bad about persistence?√N(hN − E[hN ])

=⇒ N

(0,

1

N

n∑i=1

V[h(θi )] +1

N

N∑i=1

∑j 6=i

COV[h(θi ), h(θj)

]).

• Potential Remedy:• Partition θ = [θ1, . . . , θK ].• Iterate over conditional posteriors p(θk |Y , θ<−k>).

• To reduce persistence of the chain, try to find partitions such thatparameters are strongly correlated within blocks and weaklycorrelated across blocks or use random blocking.


Block MH Algorithm

Draw θ0 ∈ Θ and then for i = 1 to N:

1 Create a partition B i of the parameter vector into Nblocks blocksθ1, . . . , θNblocks

via some rule (perhaps probabilistic), unrelated to thecurrent state of the Markov chain.

2 For b = 1, . . . ,Nblocks :

1 Draw ϑb ∼ q(·|[θi<b, θ

i−1b , θi−1

≥b

]).

2 With probability,

α = max

p([θi<b, ϑb, θ

i−1>b

]|Y )q(θi−1

b , |θi<b, ϑb, θi−1>b )

p(θi<b, θi−1b , θi−1

>b |Y )q(ϑb|θi<b, θi−1b , θi−1

>b ), 1

,

set θib = ϑb, otherwise set θib = θi−1b .


Random-Block MH Algorithm

1 Generate a sequence of random partitions B iNi=1 of the parametervector θ into Nblocks equally sized blocks, denoted by θb,b = 1, . . . ,Nblocks as follows:

1 assign an iidU[0, 1] draw to each element of θ;

2 sort the parameters according to the assigned random number;

3 let the b’th block consist of parameters (b − 1)Nblocks , . . . , bNblocks .1

2 Execute Algorithm Block MH Algorithm.

1If the number of parameters is not divisible by Nblocks , then the size of a subset ofthe blocks has to be adjusted.


Metropolis-Adjusted Langevin Algorithm

• The proposal distribution of Metropolis-Adjusted Langevin (MAL)algorithm is given by

µ(θi−1) = θi−1 +c12M1

∂

∂θln p(θi−1|Y )

∣∣∣∣θ=θi−1

,

Σ(θi−1) = c22M2.

that is θi−1 is adjusted by a step in the direction of the gradient ofthe log posterior density function.

• One standard practice is to set M1 = M2 = M, with

M = −[

∂

∂θ∂θ′ln p(θ|Y )

∣∣∣∣θ=θ

]−1,

where θ is the mode of the posterior distribution obtained using anumerical optimization routine.


Newton MH Algorithm

• Newton MH Algorithm replaces the Hessian evaluated at theposterior mode θ by the Hessian evaluated at θi−1.

• The proposal distribution is given by

µ(θi−1) = θi−1 − s

[∂


∣∣∣∣θ=θi−1

]−1× ∂

∂θln p(θi−1|Y )

∣∣∣∣θ=θi−1

Σ(θi−1) = −c22[

∂


∣∣∣∣θ=θi−1

]−1.

• It is useful to let s be independently of θi−1:

c1 = 2s, s ∼ iidU[0, s],

where s is a tuning parameter.


Run Times and Tuning Constants for MH Algorithms

Algorithm Run Time Acceptance Tuning[hh:mm:ss] Rate Constants

1-Block RWMH-I 00:01:13 0.28 c = 0.0151-Block RWMH-V 00:01:13 0.37 c = 0.4003-Block RWMH-I 00:03:38 0.40 c = 0.0703-Block RWMH-V 00:03:36 0.43 c = 1.2003-Block MAL 00:54:12 0.43 c1 = 0.400, c2 = 0.7503-Block Newton MH 03:01:40 0.53 s = 0.700, c2 = 0.600

Notes: In each run we generate N = 100, 000 draws. We report thefastest run time and the average acceptance rate across Nrun = 50independent Markov chains.


Autocorrelation Function of τ i

0 5 10 15 20 25 30 35 40−0.2

0.00.20.40.60.81.01.2

1-Block RWMH-V1-Block RWMH-I

3-Block RWMH-V3-Block RWMH-I

3-Block MAL3-Block Newton MH

Notes: The autocorrelation functions are computed based on a single runof each algorithm.


Inefficiency Factor InEffN [τ ]

3-BlockMAL

3-BlockNewton MH

3-BlockRWMH-V

1-BlockRWMH-V

3-BlockRWMH-I

1-BlockRWMH-I

100

101

102

103

104

105

Notes: The small-sample inefficiency factors are computed based onNrun = 50 independent runs of each algorithm.


IID Equivalent Draws Per Second

iid-equivalent draws per second =N

Run Time [seconds]· 1

InEffN.

Algorithm Draws Per Second1-Block RWMH-V 7.763-Block RWMH-V 5.653-Block MAL 1.243-Block RWMH-I 0.143-Block Newton MH 0.131-Block RWMH-I 0.04


Performance of Different MH Algorithms

RWMH-V (1 Block) RWMH-V (3 Blocks)

10−10 10−8 10−6 10−4 10−2 10010−10

10−8

10−6

10−4

10−2

100

10−10 10−8 10−6 10−4 10−2 10010−10

10−8

10−6

10−4

10−2

100

MAL Newton

10−10 10−8 10−6 10−4 10−2 10010−10

10−8

10−6

10−4

10−2

100

10−10 10−8 10−6 10−4 10−2 10010−10

10−8

10−6

10−4

10−2

100

Notes: Each panel contains scatter plots of the small sample varianceV[θ] computed across multiple chains (x-axis) versus the HAC[h]estimates of Ω(θ)/N (y -axis).


Recall: Posterior Odds and Marginal Data Densities

• Posterior model probabilities can be computed as follows:

πi,T =πi,0p(Y |Mi )∑j πj,0p(Y |Mj)

, j = 1, . . . , 2, (1)

• where

p(Y |M) =

∫p(Y |θ,M)p(θ|M)dθ (2)

• Note:

ln p(Y1:T |M) =T∑t=1

ln

∫p(yt |θ,Y1:t−1,M)p(θ|Y1:t−1,M)dθ

• Posterior odds and Bayes Factor

π1,Tπ2,T

=π1,0π2,0︸︷︷︸

Prior Odds

× p(Y |M1)

p(Y |M2)︸︷︷︸Bayes Factor

(3)


Computation of Marginal Data Densities

• Reciprocal importance sampling:

• Geweke’s modified harmonic mean estimator

• Sims, Waggoner, and Zha’s estimator

• Chib and Jeliazkov’s estimator

• For a survey, see Ardia, Hoogerheide, and van Dijk (2009).


Modified Harmonic Mean

• Reciprocal importance samplers are based on the following identity:

1

p(Y )=

∫f (θ)

p(Y |θ)p(θ)p(θ|Y )dθ, (4)

where∫f (θ)dθ = 1.

• Conditional on the choice of f (θ) an obvious estimator is

pG (Y ) =

[1

N

N∑i=1

f (θi )

p(Y |θi )p(θi )

]−1, (5)

where θi is drawn from the posterior p(θ|Y ).

• Geweke (1999):

f (θ) = τ−1(2π)−d/2|Vθ|−1/2 exp[−0.5(θ − θ)′V−1θ (θ − θ)

]×

(θ − θ)′V−1θ (θ − θ) ≤ F−1χ2d

(τ). (6)


Chib and Jeliazkov

• Rewrite Bayes Theorem:

p(Y ) =p(Y |θ)p(θ)

p(θ|Y ). (7)

• Thus,

pCS(Y ) =p(Y |θ)p(θ)

p(θ|Y ), (8)

where we replaced the generic θ in (7) by the posterior mode θ.


Chib and Jeliazkov

• Use output of Metropolis-Hastings Algorithm.

• Proposal density for transition θ 7→ θ: q(θ, θ|Y ).

• Probability of accepting proposed draw:

α(θ, θ|Y ) = min

1,

p(θ|Y )/q(θ, θ|Y )

p(θ|Y )/q(θ, θ|Y )

.

• Note that∫α(θ, θ|Y )q(θ, θ|Y )p(θ|Y )dθ

=

∫min

1,

p(θ|Y )/q(θ, θ|Y )

p(θ|Y )/q(θ, θ|Y )

q(θ, θ|Y )p(θ|Y )dθ

= p(θ|Y )

∫min

p(θ|Y )/q(θ, θ|Y )

p(θ|Y )/q(θ, θ|Y ), 1

q(θ, θ|Y )dθ

= p(θ|Y )

∫α(θ, θ|Y )q(θ, θ|Y )dθ


Chib and Jeliazkov

• Posterior density at the mode can be approximated as follows

p(θ|Y ) =1N

∑Ni=1 α(θi , θ|Y )q(θi , θ|Y )1J

∑Jj=1 α(θ, θj |Y )

, (9)

• θi are posterior draws obtained with the the M-H Algorithm;

• θj are additional draws from q(θ, θ|Y ) given the fixed value θ.


MH-Based Marginal Data Density Estimates

Model Mean(ln p(Y )) Std. Dev.(ln p(Y ))Geweke (τ = 0.5) -346.17 0.03Geweke (τ = 0.9) -346.10 0.04SWZ (q = 0.5) -346.29 0.03SWZ (q = 0.9) -346.31 0.02Chib and Jeliazkov -346.20 0.40

Notes: Table shows mean and standard deviation of log marginal datadensity estimators, computed over Nrun = 50 runs of the RWMH-Vsampler using N = 100, 000 draws, discarding a burn-in sample ofN0 = 50, 000 draws. The SWZ estimator uses J = 100, 000 draws tocompute τ , while the CJ estimators uses J = 100, 000 to compute thedenominator of p(θ|Y ).


The Metropolis-Hastings Algorithm 2... · 2016. 5. 11. · The Metropolis-Hastings Algorithm Frank Schorfheide University of Pennsylvania EABCN Training School May 10, 2016

Documents