Probabilistic Graphical Modelsepxing/Class/10708-20/lectures/lecture10... · 2020. 2. 19. · MCMC from Hamiltonian Dynamics q Given !" (starting state) q Draw # ∼ % 0,1 q Use )

Probabilistic Graphical Models

Advanced MCMC Methods: Optimization + MCMC

Eric XingLecture 10, February 12, 2020

Reading: https://arxiv.org/pdf/1206.1901.pdf

X1 X2

X3

0.25 0.7

0.50.50.75 0.3

© Eric Xing @ CMU, 2005-2020 1

Random walk in MCMC

© Eric Xing @ CMU, 2005-2020

2

!(#) % #&'( #)*+)

min{ 1, !(#&'()% #)*+ #&'()! #)*+ % #&'( #)*+)

}

#)*+

Might reject a lot of samples

Random walk in MCMC

© Eric Xing @ CMU, 2005-2020

3

!(#) % #&'( #)*+)

min{ 1, !(#&'()% #)*+ #&'()! #)*+ % #&'( #)*+)

}

#)*+

If variance of Q is small then next sample might be very correlated to the previous one

MCMC: Recap

q Random walk can have poor acceptance rateq The samples can have high correlation between themselves reducing

the effective sample sizeq Can we have a better proposal

q Using gradient informationq Using approximation of the given probability distribution

© Eric Xing @ CMU, 2005-2020 4

Hamiltonian Monte Carlo

q Hamiltonian Dynamics (1959)q Deterministic System

q Hybrid Monte Carlo (1987)q United MCMC and molecular Dynamics

q Statistical Application (1993)q Inference in Neural Networksq Improves acceptance rateq Uncorrelated Samples

© Eric Xing @ CMU, 2005-2020 5

P(x)= "#$(&)(

Target distribution:

H(x,p)= E(x)+ K(p)The Hamiltonian:

-̇ = . .̇ = −01(-)0- 2(.) = .T./2

PH(x,p)= "#6 7 #8(9)

:;

Auxiliary distribution:

Hamiltonian Dynamics

q Position vector !, Momentum vector "q Kinetic Energy # "q Potential Energy $ !q Define % ", ! = # " + $ !

© Eric Xing @ CMU, 2005-2020 6

Hamiltonian Dynamics

q Position vector !, Momentum vector "q Kinetic Energy # "q Potential Energy $ !q Define % ", ! = # " + $ !q Hamiltonian Dynamics

q Can help getting gradient of Uover x to draw next sample!

© Eric Xing @ CMU, 2005-2020 7

)!*)+ =

,%,"*

)"*)+ = − ,%,!*

Alternative notation

Hamiltonian Dynamics: Example

q Kinetic Energy ! " = |%|&'

q Potential Energy ( ) = *&'

q So

q And

© Eric Xing @ CMU, 2005-2020 8

How to compute updates: Euler’s Method

© Eric Xing @ CMU, 2005-2020

9

!(# +∈)'(# +∈) = ) *

+ )!(#)'(#)

A divergent series!

How to compute updates: Leapfrog Method

q The updates looks like

© Eric Xing @ CMU, 2005-2020 10

!(# +∈)'(# +∈) = ) *

+ )!(# +∈/-)'(# +∈)

A shear transformation → volume preserving

Leapfrog Vs Euler

© Eric Xing @ CMU, 2005-2020 11

MCMC from Hamiltonian Dynamics

q Let q be variable of interest (e.g., latent parameters of a model)q Define:

q where

q "($) denotes the prior, and &(($|() denotes the data likelihood

q Key Idea: Use Hamiltonian dynamics to propose next step.

© Eric Xing @ CMU, 2005-2020 12

Negative Log probability


q Given !" (starting state)q Draw # ∼ % 0,1q Use ) steps of leapfrog to propose next stateq Accept / reject based on change in Hamiltonian

Each iteration of the HMC algorithm has two steps. The first changes only the momentum; the second may change both position and momentum. Both steps leave the canonical joint distribution of (q, p) invariant, and hence their combination also leaves this distribution invariant.

© Eric Xing @ CMU, 2005-2020

13


© Eric Xing @ CMU, 2005-2020 14


© Eric Xing @ CMU, 2005-2020 15


© Eric Xing @ CMU, 2005-2020 16


© Eric Xing @ CMU, 2005-2020 17


© Eric Xing @ CMU, 2005-2020 18


q Detailed balance satisfiedq Ergodicq canonical distribution invariant

© Eric Xing @ CMU, 2005-2020 19

2D Gaussian Example

© Eric Xing @ CMU, 2005-2020 20

2D Gaussian Example

© Eric Xing @ CMU, 2005-2020 21

100D Gaussian Example

© Eric Xing @ CMU, 2005-2020 22

Acceptance Rate

q 2D example HMC : 91% Random Walk: 63%

q 100D example HMC: 87% Random Walk: 25%

© Eric Xing @ CMU, 2005-2020 23

Langevin Dynamics

24

Leapfrog

© Eric Xing @ CMU, 2005-2020

One leepfrog step only, all at once:

Stochastic Langevin Dynamics

q For large datasets hard to compute the whole gradient

© Eric Xing @ CMU, 2005-2020 25

Stochastic Gradient Langevin Dynamics

q For large datasets hard to compute the whole gradient

© Eric Xing @ CMU, 2005-2020 26

Calculate using subset of data

Stochastic Gradient Langevin Dynamics: Bayesian Models

q Posterior

q SGLD update:

© Eric Xing @ CMU, 2005-2020 27

Stochastic Gradient Langevin Dynamics

q High variance in stochastic gradient

q Take help from the optimization community

© Eric Xing @ CMU, 2005-2020 28

Conclusion

q HMC can improve acceptance rate and give better mixingq Stochastic variants can be used to improve performance in large dataset

scenariosq HMC may not be used for discrete variable

© Eric Xing @ CMU, 2005-2020 29

Supplementary

Variational MCMC

Sequential Monte Carlo

© Eric Xing @ CMU, 2005-2020 30

Towards better proposal

q ! "#$% "&'() determines when the chain converges

q Idea: Variational approximation of P(X) be the proposal distribution

© Eric Xing @ CMU, 2005-2020

31

Variational Inference: Recap

q Interested in posterior of parameters ! " #q Using Jensen’s Inequality

q Choose $ % & where & is the variational parameterq Replace ' # " with '(#|", +) where + is another set of variational

parametersq Using this we can easily obtain un-normalized bound for posterior

© Eric Xing @ CMU, 2005-2020

32

-./(0 1 2 ≥ 45 6 -./ 0 1 2 − 45 6 [-./ 5 6 ]

: 2 1 ) ≥ :;<=(2|1, >, ?)

Variational MCMC

q Idea: Variational approximation of P(X) be the proposal distribution

q ! "#$% "&'() = +,-.(0|2, 4, 5)

q Issues:q Low acceptance in high dimensionsq Works well if +,-. is close to P

© Eric Xing @ CMU, 2005-2020

33

Variational MCMC

q Design the proposal in blocks to take care of correlated variables

q Use a mixture of random walk and variational approximation as a proposal distribution

q Now can use stochastic variational methods in estimating !"#$(&|(, *, +)

© Eric Xing @ CMU, 2005-2020

34

Variational MCMC

© Eric Xing @ CMU, 2005-2020

35

Conclusion

q Adapting proposal distribution can be helpful inq Increasing mixingq Decreasing time to convergenceq Increasing acceptance rateq Getting uncorrelated information

© Eric Xing @ CMU, 2005-2020

36

Recall: weighted resampling

q Sampling importance resampling (SIR):1. Draw N samples from Q: X1 … XN

2. Construct weights: w1 … wN ,

3. Sub-sample x from {X1 … XN} w.p. (w1 … wN)

© Eric Xing @ CMU, 2005-2020 37

åå==

mm

m

lll

mmm

rr

xQxPxQxPw

)(/)()(/)(

Sequential MC: Sketch of Particle Filters

q The starting point

q Thus p(Xt|Y1:t) is represented by

q A sequential weighted resamplerq Time update

q Measurement update

© Eric Xing @ CMU, 2005-2020 38

ttttttt dXXpXXpXp ò ++ = )|()|()|( :: 1111 YY

ò -

-- ==

ttttt

ttttttttt dXXYpXp

XYpXpYXpXp)|()|()|()|(),|()(

:

:::

11

11111 Y

YYY

ïþ

ïýü

ïî

ïíì

=å

-=

M

mmtt

mtt

XYp

XYpmttt

mt wXpX

1

11)|(

)|(: ),|(~ Y

ò ++++

+++++ =

11111

1111111

ttttt

tttttt dXXYpXp

XYpXpXp)|()|()|()|()(

:

:: Y

YY

å +=m

mtt

mt XXpw )|( )(

1 (sample from a mixture model)

ïþ

ïýü

ïî

ïíì

=Þå

+++=

++

++M

mmtt

mtt

XYp

XYpmttt

mt wXpX

111

111111

)|(

)|(: ),|(~ Y (reweight)

)|( ttXp Y1+

PF for switching SSM

q Recall that the belief state has O(2t) Gaussian modes

© Eric Xing @ CMU, 2005-2020 39

PF for switching SSM

q Key idea: if you knew the discrete states, you can apply the right Kalman filter at each time step.

q So for each old particle m, samplefrom the prior, apply

the KF (using parameters for Stm)

to the old belief stateto get an approximation to

q Useful for online tracking, fault diagnosis, etc.

© Eric Xing @ CMU, 2005-2020 40

)|(~ mtt

mt SSPS 1-

),ˆ( ||mtt

mtt Px 1111 ----

),|( ::mttt syXP 11

Probabilistic Graphical Modelsepxing/Class/10708-20/lectures/lecture10... · 2020. 2. 19. · MCMC from Hamiltonian Dynamics q Given !" (starting state) q Draw # ∼ % 0,1 q Use )

Documents

Probabilistic Graphical Modelsepxing/Class/10708-20/lectures/lecture10... · 2020. 2. 19. · MCMC from Hamiltonian Dynamics q Given !" (starting state) q Draw # ∼ % 0,1 q Use )