Lab 1 Gibbs Sampling and LDA Lab Objective: Understand the basic principles of implementing a Gibbs sampler. Apply this to Latent Dirichlet Allocation. Gibbs Sampling Gibbs sampling is an MCMC sampling method in which we construct a Markov chain which is used to sample from a desired joint (conditional) distribution P(x 1 , ··· ,x n |y). Often it is difficult to sample from this high-dimensional joint distribution, while it may be easy to sample from the one-dimensional conditional distributions P(x i |x -i , y) where x -i = x 1 , ··· ,x i-1 ,x i+1 , ··· ,x n . Algorithm 1.1 Basic Gibbs Sampling Process. 1: procedure Gibbs Sampler 2: Randomly initialize x 1 ,x 2 ,...,x n . 3: for k =1, 2, 3,... do 4: for i =1, 2,...,n do 5: Draw x ∼ P(x i |x -i , y) 6: Fix x i = x 7: x (k) =(x 1 ,x 2 ,...,x n ) A Gibbs sampler proceeds according to Algorithm 1.1. Each iteration of the outer for loop is a sweep of the Gibbs sampler, and the value of x (k) after a sweep is a sample. This creates an irreducible, non-null recurrent, aperiodic Markov chain over the state space consisting of all possible x. The unique invariant distribution for the chain is the desired joint distribution P(x 1 , ··· ,x n |y). 1
19
Embed
Lab 1 Gibbs Sampling and LDA - BYU ACME · 2017-05-01 · 3 >>> alpha = 2. >>> beta = 15. >>> invgamma_sample = invgamma.rvs(alpha, scale=beta) Note that when sampling from the normal
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Lab 1
Gibbs Sampling and LDA
Lab Objective: Understand the basic principles of implementing a Gibbs sampler.
Apply this to Latent Dirichlet Allocation.
Gibbs Sampling
Gibbs sampling is an MCMC sampling method in which we construct a Markov
chain which is used to sample from a desired joint (conditional) distribution
P(x1, · · · , xn|y).
Often it is difficult to sample from this high-dimensional joint distribution, while it
may be easy to sample from the one-dimensional conditional distributions
P(xi|x−i,y)
where x−i = x1, · · · , xi−1, xi+1, · · · , xn.
Algorithm 1.1 Basic Gibbs Sampling Process.
1: procedure Gibbs Sampler
2: Randomly initialize x1, x2, . . . , xn.
3: for k = 1, 2, 3, . . . do
4: for i = 1, 2, . . . , n do
5: Draw x ∼ P(xi|x−i,y)
6: Fix xi = x
7: x(k) = (x1, x2, . . . , xn)
A Gibbs sampler proceeds according to Algorithm 1.1. Each iteration of the
outer for loop is a sweep of the Gibbs sampler, and the value of x(k) after a sweep
is a sample. This creates an irreducible, non-null recurrent, aperiodic Markov chain
over the state space consisting of all possible x. The unique invariant distribution
for the chain is the desired joint distribution
P(x1, · · · , xn|y).
1
2 Lab 1. Gibbs Sampling and LDA
Thus, after a burn-in period, our samples x(k) are effectively samples from the
desired distribution.
Consider the dataset ofN scores from a calculus exam in the file examscores.csv.
We believe that the spread of these exam scores can be modeled with a normal dis-
tribution of mean µ and variance σ2. Because we are unsure of the true value of µ
and σ2, we take a Bayesian approach and place priors on each parameter to quantify
this uncertainty:
µ ∼ N(µ0, σ20) (a normal distribution)
σ2 ∼ IG(α, β) (an inverse gamma distribution)
Letting y = (y1, . . . , yN ) be the set of exam scores, we would like to update our
beliefs of µ and σ2 by sampling from the posterior distribution
P(µ, σ2|y, µ0, σ20 , α, β).
Sampling directly can be difficult. However, we can easily sample from the following
conditional distributions:
P(µ|σ2,y, µ0, σ20 , α, β) = P(µ|σ2,y, µ0, σ
20)
P(σ2|µ,y, µ0, σ20 , α, β) = P(σ2|µ,y, α, β)
The reason for this is that these conditional distributions are conjugate to the prior
distributions, and hence are part of the same distributional families as the priors.
In particular, we have
P(µ|σ2,y, µ0, σ20) = N(µ∗, (σ∗)2)
P(σ2|µ,y, α, β) = IG(α∗, β∗),
where
(σ∗)2 =
(1
σ20
+N
σ2
)−1
µ∗ = (σ∗)2
(µ0
σ20
+1
σ2
N∑i=1
yi
)
α∗ = α+N
2
β∗ = β +1
2
N∑i=1
(yi − µ)2
We have thus set this up as a Gibbs sampling problem, where we have only to
alternate between sampling µ and sampling σ2. We can sample from a normal
distribution and an inverse gamma distribution as follows:
We can evaluate the quality of our results by plotting the log probabilities, the
µ samples, the σ2 samples, and kernel density estimators for the marginal posterior
distributions of µ and σ2. The code below will accomplish this task:
>>> # plot the first 500 log probs
>>> plt.plot(lprobs[:500])
>>> plt.show()
>>> # plot the mu samples
>>> plt.plot(draws[:,0])
>>> plt.show()
>>> # plot the sigma2 samples
>>> plt.plot(draws[:,1])
>>> plt.show()
>>> # build and plot KDE for posterior mu
>>> mu_kernel = gaussian_kde(draws[50:,0])
>>> x_min = min(draws[50:,0]) - 1
>>> x_max = max(draws[50:,0]) + 1
>>> x = np.arange(x_min, x_max, step=0.1)
>>> plt.plot(x,mu_kernel(x))
>>> plt.show()
>>> # build and plot KDE for posterior sigma2
>>> sig_kernel = gaussian_kde(draws[50:,1])
>>> x_min = 20
>>> x_max = 200
>>> x = np.arange(x_min, x_max, step=0.1)
>>> plt.plot(x,sig_kernel(x))
>>> plt.show()
Your results should be close to those given in Figures 2.1 and 2.2.
The Ising Model
In statistical mechanics, the Ising model describes how atoms interact in ferromag-
netic material. Assume we have some lattice Λ of sites. We say i ∼ j if i and j are
adjacent sites. Each site i in our lattice is assigned an associated spin σi ∈ {±1}. A
state in our Ising model is a particular spin configuration σ = (σk)k∈Λ. If L = |Λ|,then there are 2L possible states in our model. If L is large, the state space be-
comes huge, which is why MCMC sampling methods (in particular the Metropolis
algorithm) are so useful in calculating model estimations.
With any spin configuration σ, there is an associated energy
H(σ) = −J∑i∼j
σiσj
16 Lab 2. Metropolis Algorithm
Figure 2.3: Spin configuration from random initialization.
where J > 0 for ferromagnetic materials, and J < 0 for antiferromagnetic materials.
Throughout this lab, we will assume J = 1, leaving the energy equation to be
H(σ) = −∑i∼j σiσj where the interaction from each pair is added only once.
We will consider a lattice that is a 100× 100 square grid. The adjacent sites for
a given site are those directly above, below, to the left, and to the right of the site,
so to speak. For sites on the edge of the grid, we assume it wraps around. In other
words, a site at the farthest left side of the grid is adjacent to the corresponding
site on the farthest right side. Thus, a single spin configuration can be represented
as a 100× 100 array, with entries of ±1.
Problem 1. Write a function that initializes a spin configuration for an
n× n lattice. It should return an n× n array, each entry of which is either
1 or −1, chosen randomly. Test this for the grid described above, and plot
the spin configuration using matplotlib.pyplot.imshow. It should look fairly
random, as in Figure 2.3.
Problem 2. Write a function that computes the energy of a wrap-around
n× n lattice with a given spin configuration, as described above. Make sure
that you do not double count site pair interactions!
Different spin configurations occur with different probabilities, depending on the
energy of the spin configuration and β > 0, a quantity inversely proportional to the
temperature. More specifically, for a given β, we have
Pβ(σ) =e−βH(σ)
Zβ
17
where Zβ =∑σ e−βH(σ). Because there are 2100·100 = 210000 possible spin config-
urations for our particular lattice, computing this sum is infeasible. However, the
numerator is quite simple, provided we can efficiently compute the energy H(σ)
of a spin configuration. Thus the ratio of the probability densities of two spin
configurations is simple:
Pβ(σ∗)
Pβ(σ)=e−βH(σ∗)
e−βH(σ)
= eβ(H(σ)−H(σ∗))
The simplicity of this ratio should lead us to think that a Metropolis algorithm
might be an appropriate way by which to sample from the spin configuration prob-
ability distribution, in which case our acceptance probability would be
A(σ∗, σ) =
{1 if H(σ∗) < H(σ)
eβ(H(σ)−H(σ∗)) otherwise.
By choosing our transition matrix Q cleverly, we can also make it easy to com-
pute the energy for any proposed spin configuration. We restrict our possible pro-
posals to only those spin configurations in which we have flipped the spin at exactly
one lattice site, i.e. we choose a lattice site i and flip its spin. Thus, there are
only L possible proposal spin configurations σ∗ given σ, each being proposed with
probability 1L , and such that σ∗j = σj for all j 6= i, and σ∗i = −σi. Note that we
would never actually write out this matrix (it would be 210000 × 210000!!!). Com-
puting the proposed site’s energy is simple: if the spin flip site is i, then we have
H(σ∗) = H(σ) + 2∑j:j∼i σiσj .
Problem 3. Write a function that proposes a new spin configuration given
the current spin configuration on an n× n lattice, as described above. This
function simply needs to return a pair of indices (i, j), chosen with probability1n2 .
Problem 4. Write a function that computes the energy of a proposed spin
configuration, given the current spin configuration, its energy, and the pro-
posed spin flip site indices.
Problem 5. Write a function that accepts or rejects a proposed spin config-
uration, given the current configuration. It should accept the current energy,
the proposed energy, and β, and should return a boolean.
18 Lab 2. Metropolis Algorithm
To track the convergence of the Markov chain, we would like to look at the
probabilities of each sample at each time. However, this would require us to compute
the denominator Zβ , which–as we explained previously–is generally the reason we
have to use a Metropolis algorithm to begin with. We can get away with examining
only −βH(σ). We should see this value increase as the algorithm proceeds, and it
should converge once we are sampling from the correct distribution. Note that we
don’t expect these values to converge to a specific value, but rather to a restricted
range of values.
Problem 6. Write a function that initializes a spin configuration for an
n×n lattice as done previously, and then performs the Metropolis algorithm,
choosing new spin configurations and accepting or rejecting them. It should
burn in first, and then iterate n samples times, keeping every 100th sample
(this is to prevent memory failure) and all of the above values for −βH(σ)
(keep the values even for the burn-in period). It should also accept β as an
argument, allowing us to effectively adjust the temperature for the model.
Problem 7. Test your Metropolis sampler on a 100×100 grid, with 200000
iterations, with n samples large enough so that you will keep 50 samples,
testing with β = 1 and then with β = 0.2. Plot the proportional log probabil-
ities, and also plot a late sample from each test using matplotlib.pyplot.imshow.
How does the ferromagnetic material behave differently with differing tem-
peratures? Recall that β is an inverse function of temperature. You should
see more structure with lower temperature, as illustrated in Figures 2.4b and
2.4d.
19
(a) Proportional log probs when β = 1. (b) Spin configuration sample when β = 1.
(c) Proportional log probs when β = 0.2. (d) Spin configuration sample when β = 0.2.