Sampling and Markov Chain Monte Carlo Techniques

Part 1: 2016-01-20Part 2: 2016-02-10

Tomasz Kuśmierczyk

Session 5: Sampling & MCMC

Approximate and Scalable Inference for ComplexProbabilistic Models in Recommender Systems

Part 2: Inference Techniques

MCMC = Monte Carlo Markov Chains

MCMC ⊂ Sampling

Literature / Credits● Szymon Jaroszewicz lectures on “Selected Advanced Topics in Machine Learning”● Daphne Koller lectures on “Probabilistic Graphical Models” (https://class.coursera.

org/pgm-003/lecture)● Patrick Lam slides http://www.people.fas.harvard.

edu/~plam/teaching/methods/convergence/convergence_print.pdf● Bishop’s book ch. 11 ● MacKay, David JC. Information theory, inference and learning algorithms. Cambridge

university press, 2003. (http://www.inference.phy.cam.ac.uk/itprnn/book.pdf)● R & JAGS online tutorials…● …

https://class.coursera.org/pgm-003/lecture



http://www.people.fas.harvard.edu/~plam/teaching/methods/convergence/convergence_print.pdf



http://www.inference.phy.cam.ac.uk/itprnn/book.pdf

Basics & motivation

Motivation: Monte Carlo for integrating

http://mlg.eng.cam.ac.uk/zoubin/tut06/mcmc.pdf

Non-trivial posterior distribution (e.g., for BNs)

Sampling vs Variational Inference (previous seminar)

http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Sampling continued ...● Accuracy of sampling based estimates depends only on the variance of the

quantity being estimated● It does not depend directly on the dimensionality (having many variables is

not a problem)● In some cases we are able to break the curse of dimensionality

but

● Sampling gets much more difficult in higher dimensions● Variance often increases as the dimension grows● Accuracy of sampling based methods grows only with square root of the

number of samplesJaroszewicz

Sampling techniques - basic cases● uniform -> pseudo-random numbers generator● discrete distributions -> range matching with the help of uniform (in log of

number of outcomes time)● continous -> cdf inverse● various ‘tricks’● ...

Sampling techniques (e.g., for BNs posterior)● Ancestral Sampling (no evidence)● Probabilistic Logic Sampling (like AS but samples not consistent with

evidence are discarded -> low number of samples generated)● Likelihood weighting (estimations may be inaccurate + other problems)● Importance Sampling● (Adaptive) Rejection Sampling● Sampling-Importance-Resampling● Metropolis● Metropolis-Hastings● Gibbs Sampling● Hamiltionian (hybrid) Sampling● Slice sampling● and more...

Monte Carlo without Markov Chains

Few remarks● there is no difference between sampling from normalized and non-normalized

distributions ● non-normalized distributions are easy to evaluate for BNs● in most cases (e.g. rejection sampling) we work with non-normalized

distributions

● for simplicity p(x) is used in notation but there is no difference for complicated posterior distributions

● 1D case presented but work also in multi-dimensional case.

Rejection sampling

Jaroszewicz, Bishop

c q(x)

p(x)

Rejection sampling - proof

Jaroszewicz

Selection of c?● c should be as small as possible to have low reduction rate● but p <= c q must hold

● Adaptive Rejection Sampling for log-concave distributions○ log-concave = logarithm of the distribution is concave

Adaptive Rejection Sampling

Jaroszewicz

Rejection Sampling problems● part of the samples are rejected● tight “envelope” helps a bit

but

● in many dimensions (when there are many variables) dimensionality curse must be taken into account

● see Bishop’s example (for rejection sampling): ○ p(x) ~ N(0, s1) ○ q(x) ~ N(0, 1.01*s1)○ D=1000○ -> acceptance ratio 1/20000

Markov Chains

What is a Markov Chain?● A triple <possibly infinite set S of possible states, initial distribution over states

P0, transition matrix P (T)>● transition matrix - a matrix with probabilities Pij (Tij) that being in some state

si at time t we will move to another state sj at time t+1● Markov property = next state depends only on one previous

Jaroszewicz

Markov Chains - distribution over states

Jaroszewicz

Markov Chains - stationary distribution

Jaroszewicz

Stationarity example

Daphne Koller

Stationarity from regularity● If there exists k such that, for every two states <si, sj> the probability of

getting from si to sj in exactly k steps is > 0 (MC is regular) →MC converges to a unique stationary distribution

● Sufficient conditions for regularity: ○ there is a path between every pair of states○ for every state, there is a self-transition

Stationarity of irreducible, aperiodic MC ● Irreducible, aperiodic Markov chains always converge to a unique stationary

distribution

Reducibility

Jaroszewicz

Periodicity

Jaroszewicz

Why I talk about Markov Chains -> MCMCthe idea is that:

● Markov Chain “jumps” over states● states determine (BN) samples (that are later used for Monte Carlo)

○ for example: state ⇔ sample

but we need:

● Markov Chain converges to a stationary distribution (to be proved every time)● a distribution of generated samples is equal to required distribution (BNs

posterior)

Properties● Very general purpose ● Often easy to implement ● Good theoretical guarantees as t -> ∞

but:

● Lots of tunable parameters / design choices ● Can be quite slow to converge ● Difficult to tell whether it’s working

Metropolis-Hastings derivation on the blackboard:

1. From detailed balance to stationarity2. Proposed distribution and acceptance probability3. From detailed balance to conditions on acceptance probability

Part 2

Dawn of Statistical Renaissance

Gibbs sampling

Gibbs sampling: Algorithm

Daphne Koller

Does it work? - oftenUnder certain conditions, the stationary distribution of this Markov chain is the joint distribution of the Bayesian network:

● A probability distribution P(X) is positive, if P(X = x) > 0 for all x ∈ Dom(X).● Theorem: If all conditional distributions in a Bayesian network are positive

(all probabilities are > 0) then a Gibbs sampler converges to the joint distribution of the Bayesian network.

Gibbs properties● Can handle evidence even with very low probability● Works for all kinds of models, e.g. Markov networks, continuous variables● Works very well in many practical cases● overall is a very powerful and useful technique● very popular nowadays● has become another Swiss army knife for probabilistic inference

but

● Samples not statistically independent (statistics gets difficult)● Hard to give guarantees on results

Jaroszewicz

Gibbs problems - more exploratory chains needed

Jaroszewicz

Gibbs sampling: example

Bayesian PMF using MCMC

https://www.cs.toronto.edu/~amnih/papers/bpmf.pdf

Bayesian PMF using MCMC

Some useful formulas:

on the blackboard ...

Diagnostics

You never know with randomness...

Practical problems● We only want to use samples that are sampled from a distribution close to p

(x) - when chain is already ‘mixing’

● At early iterations (before chain converged) we may be far from p(x) - we need ‘burn-in’ iterations

● Samples are correlated - we need thinning (take only every n-th sample)

Diagnostics● Visual Inspection ● Geweke Diagnostic

○ tests whether the burn-in is sufficient

● Gelman and Rubin Diagnostic ○ may detect problems with disconnected sample spaces

● Raftery and Lewis Diagnostic ○ calculates the number of iterations and burn-in needed by first running

● Heidelberg and Welch Diagnostic ○ test statistic for stationarity of the distribution


Visual inspection


Multimodal distribution, hard to get from one mode to another.The chain is not mixing.

Autocorrelation (correlation between delayed samples)


Geweke Diagnostic● takes two nonoverlapping parts of the Markov chain ● compares the means of both parts, using a difference of means test ● to see if the two parts of the chain are from the same distribution (null

hypothesis). ● the test statistic is a standard Z-score with the standard errors adjusted for

autocorrelation.

Gelman and Rubin Diagnostic1. Run m ≥ 2 chains of length 2n from overdispersed starting values. 2. Discard the first n draws in each chain. 3. Calculate the within-chain and between-chain variance.


Gelman and Rubin Diagnostic 24. Calculate the estimated variance of the parameter as a weighted sum of the within-chain and between-chain variance.

5. Calculate the potential scale reduction factor.

When R is high (perhaps greater than 1.1 or 1.2), then we should run our chains out longer to improve convergence to the stationary distribution.


Probabilistic programming

Probabilistic programming languageprogramming language designed to:

● describe probabilistic models ● perform inference automatically even on complicated models

for example:

● PyMC● BUGS / JAGS● BayesPy

https://en.wikipedia.org/wiki/Probabilistic_programming_language

What’s inside?● BUGS - Adaptive Rejection (AR) sampling● JAGS - Slice Sampler (one variable at once)

JAGS PMF-like example: model filemodel{#########START###########

sv ~ dunif(0,100) su ~ dunif(0,100) s ~ dunif(0,100) tau <- 1/(s*s) tauv <- 1/(sv*sv) tauu <- 1/(su*su) ...

...

for (j in 1:M) { for (d in 1:D) { v[j,d] ~ dnorm(0, tauv) } } for (i in 1:N) { for (d in 1:D) { u[i,d] ~ dnorm(0, tauu) } } for (j in 1:M) { for (i in 1:N) { mu[i,j] <- inprod(u[i,], v[j,]) r3[i,j] <- 1/(1+exp(-mu[i,j])) r[i,j] ~ dnorm(r3[i,j], tau) } }

}#############END############

JAGS PMF-like example: Parameters preparationn.chains = 1n.iter = 5000n.burnin = n.itern.thin = 1 #max(1, floor((n.iter - n.burnin)/1000))D = 10lu = 0.05lv = 0.05n.cluster=n.chainsmodel.file = "models/pmf_hypnorm3.bug"

N = dim(train)[1]M = dim(train)[2]start.s = sd(train[!is.na(train)])start.su = sqrt(start.s^2/lu)start.sv = sqrt(start.s^2/lv)

jags.data = list(N=N, M=M, D=D, r=train)jags.params = c("u", "v", "s", "su", "sv")jags.inits = list(s=start.s, su=start.su, sv=start.sv, u=matrix( rnorm(N*D,mean=0,sd=start.su), N, D), v=matrix( rnorm(M*D,mean=0,sd=start.sv), M, D))

JAGS PMF-like example: running (sampling)

library(rjags)model = jags.model(model.file, jags.data, n.chains=n.chains, n.adapt=n.burnin)#update(model)samples = jags.samples(model, jags.params, n.iter=n.iter, thin=n.thin)

JAGS PMF-like example: retrieving samples

per.chain = dim(samples$u)[3]

iterations = per.chain * dim(samples$u)[4]

user_sample = function(i, k) {samples$u[i, , (k-1)%%per.chain+1, ceiling(k/per.chain)]}

item_sample = function(j, k) {samples$v[j, , (k-1)%%per.chain+1, ceiling(k/per.chain)]}

Why it’s good, why it’s bad?● fast prototyping● less control

Results on movielens 100k

RMSE = 0.943 (~SGD)

More on https://github.com/tkusmierczyk/pmf-jags

Thank you!

Sampling and Markov Chain Monte Carlo Techniques

Data & Analytics