Probabilistic Graphical Models
Advanced MCMC Methods: Optimization + MCMC
Eric XingLecture 10, February 12, 2020
Reading: https://arxiv.org/pdf/1206.1901.pdf
X1 X2
X3
0.25 0.7
0.50.50.75 0.3
© Eric Xing @ CMU, 2005-2020 1
Random walk in MCMC
© Eric Xing @ CMU, 2005-2020
2
!(#) % #&'( #)*+)
min{ 1, !(#&'()% #)*+ #&'()! #)*+ % #&'( #)*+)
}
#)*+
Might reject a lot of samples
Random walk in MCMC
© Eric Xing @ CMU, 2005-2020
3
!(#) % #&'( #)*+)
min{ 1, !(#&'()% #)*+ #&'()! #)*+ % #&'( #)*+)
}
#)*+
If variance of Q is small then next sample might be very correlated to the previous one
MCMC: Recap
q Random walk can have poor acceptance rateq The samples can have high correlation between themselves reducing
the effective sample sizeq Can we have a better proposal
q Using gradient informationq Using approximation of the given probability distribution
© Eric Xing @ CMU, 2005-2020 4
Hamiltonian Monte Carlo
q Hamiltonian Dynamics (1959)q Deterministic System
q Hybrid Monte Carlo (1987)q United MCMC and molecular Dynamics
q Statistical Application (1993)q Inference in Neural Networksq Improves acceptance rateq Uncorrelated Samples
© Eric Xing @ CMU, 2005-2020 5
P(x)= "#$(&)(
Target distribution:
H(x,p)= E(x)+ K(p)The Hamiltonian:
-̇ = . .̇ = −01(-)0- 2(.) = .T./2
PH(x,p)= "#6 7 #8(9)
:;
Auxiliary distribution:
Hamiltonian Dynamics
q Position vector !, Momentum vector "q Kinetic Energy # "q Potential Energy $ !q Define % ", ! = # " + $ !
© Eric Xing @ CMU, 2005-2020 6
Hamiltonian Dynamics
q Position vector !, Momentum vector "q Kinetic Energy # "q Potential Energy $ !q Define % ", ! = # " + $ !q Hamiltonian Dynamics
q Can help getting gradient of Uover x to draw next sample!
© Eric Xing @ CMU, 2005-2020 7
)!*)+ =
,%,"*
)"*)+ = − ,%,!*
Alternative notation
Hamiltonian Dynamics: Example
q Kinetic Energy ! " = |%|&'
q Potential Energy ( ) = *&'
q So
q And
© Eric Xing @ CMU, 2005-2020 8
How to compute updates: Euler’s Method
© Eric Xing @ CMU, 2005-2020
9
!(# +∈)'(# +∈) = ) *
+ )!(#)'(#)
A divergent series!
How to compute updates: Leapfrog Method
q The updates looks like
© Eric Xing @ CMU, 2005-2020 10
!(# +∈)'(# +∈) = ) *
+ )!(# +∈/-)'(# +∈)
A shear transformation → volume preserving
Leapfrog Vs Euler
© Eric Xing @ CMU, 2005-2020 11
MCMC from Hamiltonian Dynamics
q Let q be variable of interest (e.g., latent parameters of a model)q Define:
q where
q "($) denotes the prior, and &(($|() denotes the data likelihood
q Key Idea: Use Hamiltonian dynamics to propose next step.
© Eric Xing @ CMU, 2005-2020 12
Negative Log probability
MCMC from Hamiltonian Dynamics
q Given !" (starting state)q Draw # ∼ % 0,1q Use ) steps of leapfrog to propose next stateq Accept / reject based on change in Hamiltonian
Each iteration of the HMC algorithm has two steps. The first changes only the momentum; the second may change both position and momentum. Both steps leave the canonical joint distribution of (q, p) invariant, and hence their combination also leaves this distribution invariant.
© Eric Xing @ CMU, 2005-2020
13
MCMC from Hamiltonian Dynamics
© Eric Xing @ CMU, 2005-2020 14
MCMC from Hamiltonian Dynamics
© Eric Xing @ CMU, 2005-2020 15
MCMC from Hamiltonian Dynamics
© Eric Xing @ CMU, 2005-2020 16
MCMC from Hamiltonian Dynamics
© Eric Xing @ CMU, 2005-2020 17
MCMC from Hamiltonian Dynamics
© Eric Xing @ CMU, 2005-2020 18
MCMC from Hamiltonian Dynamics
q Detailed balance satisfiedq Ergodicq canonical distribution invariant
© Eric Xing @ CMU, 2005-2020 19
2D Gaussian Example
© Eric Xing @ CMU, 2005-2020 20
2D Gaussian Example
© Eric Xing @ CMU, 2005-2020 21
100D Gaussian Example
© Eric Xing @ CMU, 2005-2020 22
Acceptance Rate
q 2D example HMC : 91% Random Walk: 63%
q 100D example HMC: 87% Random Walk: 25%
© Eric Xing @ CMU, 2005-2020 23
Langevin Dynamics
24
Leapfrog
© Eric Xing @ CMU, 2005-2020
One leepfrog step only, all at once:
Stochastic Langevin Dynamics
q For large datasets hard to compute the whole gradient
© Eric Xing @ CMU, 2005-2020 25
Stochastic Gradient Langevin Dynamics
q For large datasets hard to compute the whole gradient
© Eric Xing @ CMU, 2005-2020 26
Calculate using subset of data
Stochastic Gradient Langevin Dynamics: Bayesian Models
q Posterior
q SGLD update:
© Eric Xing @ CMU, 2005-2020 27
Stochastic Gradient Langevin Dynamics
q High variance in stochastic gradient
q Take help from the optimization community
© Eric Xing @ CMU, 2005-2020 28
Conclusion
q HMC can improve acceptance rate and give better mixingq Stochastic variants can be used to improve performance in large dataset
scenariosq HMC may not be used for discrete variable
© Eric Xing @ CMU, 2005-2020 29
Supplementary
Variational MCMC
Sequential Monte Carlo
© Eric Xing @ CMU, 2005-2020 30
Towards better proposal
q ! "#$% "&'() determines when the chain converges
q Idea: Variational approximation of P(X) be the proposal distribution
© Eric Xing @ CMU, 2005-2020
31
Variational Inference: Recap
q Interested in posterior of parameters ! " #q Using Jensen’s Inequality
q Choose $ % & where & is the variational parameterq Replace ' # " with '(#|", +) where + is another set of variational
parametersq Using this we can easily obtain un-normalized bound for posterior
© Eric Xing @ CMU, 2005-2020
32
-./(0 1 2 ≥ 45 6 -./ 0 1 2 − 45 6 [-./ 5 6 ]
: 2 1 ) ≥ :;<=(2|1, >, ?)
Variational MCMC
q Idea: Variational approximation of P(X) be the proposal distribution
q ! "#$% "&'() = +,-.(0|2, 4, 5)
q Issues:q Low acceptance in high dimensionsq Works well if +,-. is close to P
© Eric Xing @ CMU, 2005-2020
33
Variational MCMC
q Design the proposal in blocks to take care of correlated variables
q Use a mixture of random walk and variational approximation as a proposal distribution
q Now can use stochastic variational methods in estimating !"#$(&|(, *, +)
© Eric Xing @ CMU, 2005-2020
34
Variational MCMC
© Eric Xing @ CMU, 2005-2020
35
Conclusion
q Adapting proposal distribution can be helpful inq Increasing mixingq Decreasing time to convergenceq Increasing acceptance rateq Getting uncorrelated information
© Eric Xing @ CMU, 2005-2020
36
Recall: weighted resampling
q Sampling importance resampling (SIR):1. Draw N samples from Q: X1 … XN
2. Construct weights: w1 … wN ,
3. Sub-sample x from {X1 … XN} w.p. (w1 … wN)
© Eric Xing @ CMU, 2005-2020 37
åå==
mm
m
lll
mmm
rr
xQxPxQxPw
)(/)()(/)(
Sequential MC: Sketch of Particle Filters
q The starting point
q Thus p(Xt|Y1:t) is represented by
q A sequential weighted resamplerq Time update
q Measurement update
© Eric Xing @ CMU, 2005-2020 38
ttttttt dXXpXXpXp ò ++ = )|()|()|( :: 1111 YY
ò -
-- ==
ttttt
ttttttttt dXXYpXp
XYpXpYXpXp)|()|()|()|(),|()(
:
:::
11
11111 Y
YYY
ïþ
ïýü
ïî
ïíì
=å
-=
M
mmtt
mtt
XYp
XYpmttt
mt wXpX
1
11)|(
)|(: ),|(~ Y
ò ++++
+++++ =
11111
1111111
ttttt
tttttt dXXYpXp
XYpXpXp)|()|()|()|()(
:
:: Y
YY
å +=m
mtt
mt XXpw )|( )(
1 (sample from a mixture model)
ïþ
ïýü
ïî
ïíì
=Þå
+++=
++
++M
mmtt
mtt
XYp
XYpmttt
mt wXpX
111
111111
)|(
)|(: ),|(~ Y (reweight)
)|( ttXp Y1+
PF for switching SSM
q Recall that the belief state has O(2t) Gaussian modes
© Eric Xing @ CMU, 2005-2020 39
PF for switching SSM
q Key idea: if you knew the discrete states, you can apply the right Kalman filter at each time step.
q So for each old particle m, samplefrom the prior, apply
the KF (using parameters for Stm)
to the old belief stateto get an approximation to
q Useful for online tracking, fault diagnosis, etc.
© Eric Xing @ CMU, 2005-2020 40
)|(~ mtt
mt SSPS 1-
),ˆ( ||mtt
mtt Px 1111 ----
),|( ::mttt syXP 11