Bayesian Learning in Undirected Graphical Models ...mlg.eng.cam.ac.uk/pub/pdf/MurGha04a.pdf · eter optimisation, Bayesian learning of pos-terior distributions over undirected model

Bayesian Learning in Undirected Graphical Models:Approximate MCMC algorithms

Iain Murray and Zoubin Ghahramani

Gatsby Computational Neuroscience UnitUniversity College London, London WC1N 3AR, UK

http://www.gatsby.ucl.ac.uk/

{i.murray,zoubin}@gatsby.ucl.ac.uk

Abstract

Bayesian learning in undirected graphicalmodels—computing posterior distributionsover parameters and predictive quantities—is exceptionally difficult. We conjecturethat for general undirected models, there areno tractable MCMC (Markov Chain MonteCarlo) schemes giving the correct equilib-rium distribution over parameters. Whilethis intractability, due to the partition func-tion, is familiar to those performing param-eter optimisation, Bayesian learning of pos-terior distributions over undirected modelparameters has been unexplored and posesnovel challenges. We propose several approx-imate MCMC schemes and test on fully ob-served binary models (Boltzmann machines)for a small coronary heart disease data setand larger artificial systems. While approx-imations must perform well on the model,their interaction with the sampling schemeis also important. Samplers based on vari-ational mean-field approximations generallyperformed poorly, more advanced methodsusing loopy propagation, brief sampling andstochastic dynamics lead to acceptable pa-rameter posteriors. Finally, we demonstratethese techniques on a Markov random fieldwith hidden variables.

1 Introduction

Probabilistic graphical models are an elegant and pow-erful framework for representing distributions overmany random variables. Undirected graphs providea natural description of soft constraints between vari-ables. Mutual compatibilities amongst variables, x =(x1, . . . xk), are described by a factorised joint proba-

bility distribution:

p(x|θ) =1

Z(θ)exp

∑

j

φj(xCj, θj)

, (1)

where Cj ⊂ {1, . . . , k} indexes a subset of the variablesand φj is a potential function, parameterised by θj ,expressing compatibilities amongst xCj

. The partitionfunction or normalisation constant

Z(θ) =∑

x

exp

∑

j

φj(xCj, θj)

(2)

is the (usually intractable) sum or integral over allconfigurations of the variables. The undirected modelrepresenting the conditional independencies implied bythe factorization (1) has a node for each variable andan undirected edge connecting every pair of variablesxi—x`, if i, ` ∈ Cj for some j. The subsets Cj aretherefore cliques (fully connected subgraphs) of thewhole graph. An alternative and more general rep-resentation of undirected models is a factor graph.Factor graphs are bipartite graphs consisting of twotypes of nodes, one type representing the variablesi ∈ {1, . . . , k} and the other type the factors j appear-ing in the product (1). A variable node i is connectedvia an undirected edge to a factor node j if i ∈ Cj .

This work focuses on representing the parameter pos-terior p(θ|x) using samples, which can be used in ap-proximating distributions over predictive quantities.Averaging over the parameter posterior can avoid theoverfitting associated with optimisation. While sam-pling from parameters has attracted much attention,and is often tractable, in directed models, it is muchmore difficult for all but the most trivial1 undirectedgraphical models. While directed models are a morenatural tool for modelling causal relationships, the softconstraints provided by undirected models have proven

1i.e., low tree-width graphs, graphical Gaussian modelsand small contingency tables.

392 MURRAY & GHAHRAMANI UAI 2004

useful in a variety of problem domains; we briefly men-tion six applications.

(a) In computer vision [1] Markov random fields(MRFs), a form of undirected model, are used tomodel the soft constraint a pixel or image feature im-poses on nearby pixels or features; this use of MRFsgrew out of a long tradition in spatial statistics [2].(b) In language modelling a common form of sen-tence model measures a large number of features ofa sentence fj(s), such as the presence of a word,subject-verb agreement, the output of a parser onthe sentence, etc, and assigns each such feature aweight λj . A random field model of this is thenp(s|λ) = (1/Z(λ)) exp{

∑

j λjfj(s)} where the weightscan be learned via maximum likelihood iterative scal-ing methods [3]. (c) These undirected models canbe extended to coreference analysis, which deals withdetermining, for example, whether two items (e.g.,strings, citations) refer to the same underlying ob-ject [4]. (d) Undirected models have been used tomodel protein folding [5] and the soft constraints onthe configuration of protein side chains [6]. (e) Semi-supervised classification is the problem of classifying alarge number of unlabelled points using a small num-ber of labelled points and some prior knowledge thatnearby points have the same label. This problemcan be approached by defining an undirected graph-ical model over both labelled and unlabelled data[7]. (f) Given a set of directed models p(x|θj), theproducts of experts idea is a simple way of defininga more powerful (undirected) model by multiplyingthem: p(x|θ) = (1/Z(θ))

∏

j p(x|θj) [8]. The prod-uct assigns high probability when there is consensusamong component models.

Despite the long history and wide applicability of undi-rected models, surprisingly, Bayesian treatments oflarge undirected models are virtually non-existent! In-deed there is a related statistical literature on Bayesianinference in undirected models, log linear models, andcontingency tables [9, 10, 11]. However, this literatureassumes that the partition function Z(θ) can be com-puted exactly. But for all six machine learning applica-tions of undirected models cited above, this assump-tion is unreasonable. This paper addresses Bayesianlearning for models with intractable Z(θ).

We focus on a particularly simple and well-studiedundirected model, the Boltzmann machine.

2 Bayesian Inference in Boltzmann

Machines

A Boltzmann machine (BM) is a Markov random fieldwhich defines a probability distribution over a vector

of binary variables s = [s1, . . . , sk] where si ∈ {0, 1}:

p(s|W ) =1

Z(W )exp

∑

i<j

Wijsisj

(3)

The symmetric weight matrix W parameterises thisdistribution. In a BM there are usually also linear biasterms

∑

i bisi in the exponent; we omit these biases tosimplify notation, although the models in the experi-ments assume them. The undirected model for a BMhas edges for all non-zero elements of W . Since theBoltzmann machine has only pairwise terms in the ex-ponent, factor graphs provide a better representationfor the model.

The usual algorithm for learning BMs is a maximumlikelihood version of the EM algorithm (assuming someof the variables are hidden sH and some observed sO)[12]. The gradient of the log probability is:

∂ log p(s|W )

∂Wij

= 〈sisj〉c − 〈sisj〉u (4)

where 〈·〉c denotes expectation under the “clamped”data distribution p(sH |sO,W ) and 〈·〉u denotes ex-pectation under the “unclamped” distribution p(s|W ).For a data set S = [s(1) . . . s(n) . . . s(N)] of i.i.d. datathe gradient of the log likelihood is simply summedover n. For Boltzmann machines with large tree-width these expectations would take exponential timeto compute, and the usual approach is to approximatethem using Gibbs sampling or one of many more recentapproximate inference algorithms.

Consider doing Bayesian inference for the parametersof a Boltzmann machine, i.e., computing p(W |S). Onecan define a joint model:

p(W,S) =1

Zexp

−1

2σ2

∑

j<i

W 2ij +

∑

n

∑

j<i

Wijs(n)i s

(n)j

(5)The first term acts like a prior, the normaliser Z doesnot depend on W , and it is easy to see that p(S|W )is exactly the likelihood term for a Boltzmann ma-chine with i.i.d. data: p(S|W ) =

∏

n p(s(n)|W ) =∏

n(1/Z(W )) exp{∑

i<j Wijs(n)i s

(n)j }. Moreover, it is

very easy to sample from p(W |S) since it is a multivari-ate Gaussian. Thus it appears that we have defined ajoint distribution where the likelihood is exactly theBM model, and the posterior over parameters is triv-ial to sample from. Could Bayesian inference in Boltz-mann machines be so simple?

Unfortunately, there is something deeply flawed withthe above approach. By marginalisation of (5), the

UAI 2004 MURRAY & GHAHRAMANI 393

actual prior over the parameters must have been

p(W ) =∑

S

p(W,S) ∝ N (0, σ2I)Z(W )N . (6)

However, this “prior” is dependent on the size of thedata set! Moreover, the parametric form of the “prior”is very complicated, favouring weights with large par-tition functions—an effect that will overwhelm theGaussian term. This is therefore not a valid hierar-chical Bayesian model for a BM, and inferences fromthis model will be essentially meaningless.

The lesson from this simple example is the following:it is not possible to remove the partition function fromthe parameter posterior, as the “prior” that this wouldimply will be dependent on the number of data points.In order to have sensible parameter inferences, there-fore, considering changes of the partition function withthe parameters is unavoidable. Fortunately, there ex-ist a large number of tools for approximating partitionfunctions and their derivatives, given by expectationsunder (1). We now examine how approximations topartition functions and expectations can be used forapproximate Bayesian inference in undirected models.

3 Monte Carlo Parameter Sampling

MCMC (Markov Chain Monte Carlo) methods allowus to draw correlated samples from a probability dis-tribution with unknown normalisation. A rich set ofmethods are available [13], but as discussed above anyscheme must compute a quantity related to the parti-tion function before making any change to the parame-ters. We discuss two simple samplers that demonstratethe range of approximate methods available.

Consider the simplest Metropolis sampling scheme forthe parameters of a Boltzmann machine given fully ob-served data. Starting from parameters W , assume thatW ′ is proposed from a symmetric proposal distributiont(W ′|W ) = t(W |W ′). This proposal should be ac-cepted with probability a = min(1, p(W ′|S)/p(W |S))where

p(W ′|S)

p(W |S)=

p(W ′)p(S|W ′)

p(W )p(S|W )(7)

=p(W ′)

p(W )

(

Z(W )

Z(W ′)

)N

exp

∑

n,i<j

(W ′

ij−Wij) s(n)i s

(n)j

.

For general BMs even a single step of this simplescheme is intractable due to Z(W ). One class of ap-proach we will pursue is using deterministic tools toapproximate Z(W ) ' Z(W ) in the above expression.Clearly this results in an approximate sampler, whichdoes not converge to the true equilibrium distribu-tion over parameters. Moreover, it seems reckless to

take an approximate quantity to the N th power. De-spite these caveats we explore empirically whether ap-proaches based on this class of approximation are vi-able.

Note that above we need only compute the ratio ofthe partition function at pairs of parameter settings,Z(W )/Z(W ′). This ratio can be approximated di-rectly by noting that:

Z(W )

Z(W ′)=

∑

s

e{∑

j<i(Wij−W ′

ij)sisj}

e{∑

j<iW ′

ijsisj}

Z(W ′)

≡

⟨

exp

∑

j<i

(Wij −W ′

ij)sisj

⟩

p(s|W ′)

(8)

where 〈·〉p denotes expectation under p. Thus anymethod for sampling from p(s|W ′), such as MCMCmethods, exact sampling methods, or any determinis-tic approximation that can yield the above expectationcan be nested into the Metropolis sampler for W .

The Metropolis scheme is often not an efficient wayof sampling from continuous spaces as it suffers from“random-walk” behaviour. That is, it typically takesat least order t steps to travel a distance of

√t.

Schemes exist that use gradient information to reducethis behaviour by simulating a stochastic dynamicalsystem [13]. The simplest of these is the “uncorrectedLangevin method”. Parameters are updated withoutany rejections according to the rule:

θ′i = θi +ε2

2

∂

∂θi

log p(x, θ) + εni, (9)

where ni are independent draws from a zero-mean unitvariance Gaussian. Intuitively this rule performs gra-dient descent but explores away from the optimumthrough the noise term. Strictly this is only an ap-proximation except in the limit of vanishing ε. A cor-rected version would require knowing Z(W ) as well asthe gradients. This effort may not be justified whenthe gradients and Z(W ) are only available as approxi-mations. However approximate correction would allowuse of the more general hybrid Monte Carlo method.

Using the above or other dynamical methods, a thirdtarget for approximation for systems with continuousparameters is the gradient of the joint log probability.In the case of BMs, we have:

∂ log p(S,W )

∂Wij

=∑

n

s(n)i s

(n)j −N〈sisj〉p(s|W )+

∂ log p(W )

∂Wij

(10)Assuming an easy to differentiate prior, the main dif-ficulty arises, as in (4), from computing the middleterm: the unclamped expectations over the variables.


Interestingly, although many learning algorithms forundirected models (e.g. 4) are based on computing gra-dients of the form (10), and it would be simple to plugthese into approximate stochastic dynamics MCMCmethods to do Bayesian inference, this approach doesnot appear to have been investigated. We explore thisapproach in our experiments.

We have taken two existing sampling schemes(Metropolis and Langevin) and identified three targetsfor approximation to make these schemes tractable(Z(W ), Z(W )/Z(W ′) and 〈sisj〉p(s|W )). While ourexplicit derivations have focused on Boltzmann ma-chines, these same expressions generalise in a straight-forward way to Bayesian parameter inference in a gen-eral undirected model of the form (1). In particular,many undirected models of interest can be parame-terised to have potentials in the exponential family,φj(xCj

, θj) = uj(xCj)>θj . For such models, the key

ingredient to an approximation are the expected suffi-cient statistics, 〈uj(xCj

)〉.

4 Approximation Schemes

Using the above concepts and focusing on Boltzmannmachines we now define a variety of approximate sam-pling methods, by deriving approximations to one ofour three target quantities in equations (7), (8) and(10).

Naive mean field. Using Jensen’s inequality we canlower bound the log partition function as follows:

log Z(W ) = log∑

s

exp{∑

j<i

Wijsisj}

≥∑

j<i

Wij〈sisj〉q(s) +H(q) ≡ F (W, q)(11)

where q(s) is any distribution over the variables, andH(q) is the entropy of this distribution. Defining theset of fully factorised distributions Qmf = {q : q(s) =∏

i qi(si)} we can find a local maximum of this lowerbound log Zmf(W ) = maxq∈Qmf

F (W, q) using an it-erative and tractable mean-field algorithm. We de-fine the mean-field Metropolis algorithm as usingZmf(W ) in place of Z(W ) in the acceptance proba-bility computation (7). The expectations from thenaive mean field algorithm could also be used to com-pute direct approximations to the gradient for use ina stochastic dynamics method (10).

Tree-structured variational approximation.Jensen’s inequality can be used to obtain muchtighter bounds than those given by the naive mean-field method. For example, constraining q to be inthe set of all tree-structured distributions Qtree wecan still tractably optimise the lower bound on the

partition function [14], obtaining Ztree(W ) ≤ Z(W ).The tree Metropolis algorithm is defined to usethis in (7). Alternatively, expectations under the treecould also be used to form the gradient estimate for astochastic dynamics method (10).

Bethe approximation. A recent justification for ap-plying belief propagation to graphs with cycles is therelationship between this algorithm’s messages and thefixed points of the Bethe free energy [15]. While thisbreakthrough gave a new approximation for the parti-tion function, we are unaware of any work using it forBayesian model selection. In the loopy Metropolis

algorithm belief propagation is run on each proposedsystem, and the Bethe free energy is used to approxi-mate the acceptance probability (7). Traditionally be-lief propagation is used to compute marginals; pairwisemarginals can be used to compute the expectationsused in gradient methods (10) or in finding partitionfunction ratios (8). These approaches lead to differentalgorithms, although their approximations are clearlyclosely related.

Langevin using brief sampling. The pairwisemarginals required in (9,10) can be approximated byMCMC sampling. The Gibbs sampler used in section6.1 is a popular choice, whereas in section 6.2 a moresophisticated Swendsen-Wang sampler is employed.Unfortunately—as in maximum likelihood learning(4)—the parameter-dependent variance of these esti-mates can hinder convergence and introduce biases [8].The brief Langevin algorithm, inspired by work onContrastive Divergence, uses very brief sampling start-ing from the data, S, which gives biased but low vari-ance estimates of the required expectations. As theapproximations in this section are run as an inner loopto the main sampler, the cheapness of brief samplingmakes it an attractive option.

Langevin using exact sampling2. Unbiased expec-tations can be obtained in some systems using an exactsampling algorithm based on coupling from the past,eg [16]. Again variance could be eliminated by reuseof random numbers. This seems a promising area forfuture research.

Pseudo-Likelihood. Replacing the likelihood of theparameters with a tractable product of conditionalprobabilities is a common approximation in Markovrandom fields for image modelling. The only Bayesianapproach to learning in large systems of which we areaware is in this context [17, 18]. The models used inour experiments (section 6.1) were not well approxi-mated by the pseudo-likelihood, so we did not exploreit further.

2Suggested by David MacKay.


5 Extension to Hidden Variables

So far we have only considered models of the formp(x|θ) where all variables, x, are observed. Often mod-els need to cope with missing data, or have variablesthat are always hidden. These are often the modelsthat would most benefit from a Bayesian approach tolearning the parameters. In fully observed models inthe exponential family the parameter posteriors areoften relatively simple as they are log concave if theprior used is also log concave (as seen later in figure 1).The parameter posterior with hidden variables will bea linear combination of log concave functions, whichneed not be log concave and can be multi-modal.

In theory the extension to hidden variables is simple.First consider a model p(x,h|θ), where h are unob-served variables. The parameter posterior is still pro-portional to p(x|θ)p(θ), and we observe

p(x|θ) =∑

h

p(x,h|θ)

=1

Z(θ)

∑

h

exp

∑

j

φj((x,h)Cj, θj)

log p(x|θ) = − log(Z(θ)) + log(Zx(θ)). (12)

That is, the sum in the second term is a partitionfunction, Zx, for an undirected graph of the variablesh. To see this compare to (2) and consider the fixedobservations x as parameters of the potential func-tions. In a system with multiple i.i.d. observations Zx

must be computed for each setting of x. Note howeverthat these additional partition function evaluations arefor systems smaller than the original. Therefore, anymethod that approximates Z(W ) or related quantitiesdirectly from the parameters can still be used for pa-rameter learning in systems with hidden variables.

The brief sampling and pseudo-likelihood approxima-tions rely on settings of every variable provided by thedata. For systems with hidden variables these meth-ods could use settings from samples conditioned onthe observed data. In some systems this sampling canbe performed easily [8]. In section 6.2 several stepsof MCMC sampling over the hidden variables are per-formed in order to apply our brief Langevin method.

6 Experiments

6.1 Fully observed models

Our approximate samplers were tested on three sys-tems. The first, taken from [19], lists six binary prop-erties detailing risk factors for coronary heart diseasein 1841 men. Modelling these variables as outputs of a

fully-connected Boltzmann machine, we attempted todraw samples from the distribution over the unknownweights. We can compute Z(W ) exactly in this sys-tem, which allows us to compare methods against aMetropolis sampler with an exact inner loop. A previ-ous Bayesian treatment of these data also exists [10].

Many practical applications may only need a few tensof samples from the weights. We performed samplingfor 100,000 iterations to obtain histograms for eachof the weights (Figure 1). The mean-field, tree andloopy Metropolis methods each proposed changes toone parameter at a time using a zero-mean Gaussianwith variance 0.01. The brief Langevin method useda step-size ε = 0.01. Qualitatively the results are thesame as [10], parameters deemed important have verylittle overlap with zero.

The mean-field Metropolis algorithm failed to con-verge, producing noisy and wide histograms over anever increasing range of weights (figure 1). The sam-pler with the tree-based inner loop did not always con-verge either and when it did, its samples did not matchthose of the exact Metropolis algorithm very well. Theloopy Metropolis and brief Langevin methods closelymatch the marginal distributions predicted by the ex-act Metropolis algorithm for most of the weights. Re-sults are not shown for algorithms using expectationsfrom loopy belief propagation in (10) or (8) as thesegave almost identical performance to loopy Metropolisbased on (7).

Our other two test systems are 100-node Boltzmannmachines and demonstrate learning where exact com-putation of Z(W ) is intractable3. We considered tworandomly generated systems, one with 204 edges andanother with 500. Each of the parameters not set tozero, including the 100 biases, was drawn from a unitGaussian. Experiments on an artificial system allowcomparisons with the true weight matrix. We ensuredour training data were drawn from the correct distri-bution with an exact sampling method [16]. This levelof control would not be available on a natural data set.

The loopy Metropolis algorithm and the briefLangevin method were applied to 100 data points fromeach system. The model structure was provided, sothat only non-zero parameters were learned. Figure 2shows a typical histogram of parameter samples, thepredictive ability over all parameters is also shown.Short runs on similar systems with stronger weightsshow that loopy Metropolis can be made to performarbitrarily badly more quickly than the brief Langevinmethod on this class of system.

3These test sets are available online:http://www.gatsby.ucl.ac.uk/~iam23/04blug/


−4 −2 0W

FF

−2 0 2W

EF

−2 −1 0W

EE

−2 0 2W

DF

0 1 2W

DE

−2 −1 0W

DD

−2 0 2W

CF

−2 −1 0W

CE

−2 0 2W

CD

−2 0 2W

CC

−2 0 2W

BF

−2 0 2W

BE

−2 0 2W

BD

−4 −2 0W

BC

−1 0 1W

BB

−2 0 2W

AF

−2 0 2W

AE

−2 −1 0W

AD

0 1 2W

AC

−2 0 2W

AB

−5 0 5W

AA

Mean Field Tree

Figure 1: Histograms of samples for every parameter in the heart disease risk factor model. Results from exactMetropolis are shown in solid (blue); loopy Metropolis dashed (purple); brief Langevin dotted (red). Thesecurves are often indistinguishable. The mean-field and tree Metropolis algorithms performed very badly; toreduce clutter these are only shown once each in the plots for WAA and WAB respectively, shown in dash-dotted(black).

−1 −0.5 0 0.5 1 1.50 50 100 150 200 250 300

0

0.2

0.4

0.6

0.8

Parameters

f

0 100 200 300 400 500 6000

0.2

0.4

0.6

0.8

Parameters

f

Figure 2: Loopy Metropolis is shown dashed (blue), brief Langevin solid (black). Left: an example histogram asin Figure 1 for the 204 edge BM; the vertical line shows the true weight. Also shown are the fractions of samples,f , within ±0.1 of the true value for every parameter in the 204 edge system (centre) and the 500 edge system(right). The parameters are sorted by f for clarity. Higher curves indicate better performance.

6.2 Hidden variables

Finally we consider an undirected model approachtaken from work on semi-supervised learning [7]. Herea graph is defined using the 2D positions, X ={(xi, yi)}, of unlabelled and labelled data. The vari-ables on the graph are the class labels, S = {si}, ofthe points. The joint model for the l labelled pointsand u unobserved or hidden variables is

p(S|X,σ) =1

Z(σ)exp

l+u∑

i=1

∑

j<i

δ(si, sj)Wij(σ)

(13)where

Wij(σ) = exp

(

−1

2

(

(xi−xj)2

σ2x

+(yi−yj)

2

σ2y

))

. (14)

The edge weights of the model, Wij , are functions ofthe Euclidean distance between points i and j mea-sured with respect to scale parameters σ = (σx, σy).Nearby points wish to be classified in the same way,whereas far away points may be approximately uncor-related, unless linked by a bridge of points in between.

The likelihoods in this model can be interesting func-tions of σ [7], leading to non-Gaussian and possi-bly multi-modal parameter posteriors with any simpleprior. As the likelihood is often a very flat functionover some parameter regions, the MAP parameters canchange dramatically with small changes in the prior.There is also the possibility that no single settings ofthe parameters can capture our knowledge.

For binary classification (13) can be rewritten as astandard Boltzmann Machine. The edge weights Wij


are now all coupled through σ, so our sampler will onlyexplore a two-dimensional parameter space (σx, σy).However, little of the above theory is changed by this:we can still approximate the partition function anduse this in a standard Metropolis scheme, or applyLangevin methods based on (10) where gradients in-clude sums over edges.

Figure 3(a) shows an example data set for this prob-lem. This toy data set is designed to have an inter-pretable posterior over σ and demonstrates the type ofparameter uncertainty observed in real problems. Wecan see intuitively that we do not want σx or σy to beclose to zero. This would disconnect all points in thegraph making the likelihood small (≈ 1/2l). Parame-ters that correlate nearby points that are the same willbe much more probable under a large range of sensi-ble priors. Neither can both σx and σy be large: thiswould force the × and ◦ clusters to be close, whichis also undesirable. However, one of σx and σy canbe large as long as the other stays below around one.These intuitions are closely matched by the resultsshown in figure 3(b). This plot shows draws from theparameter posterior using the brief Langevin methodbased on a Swendsen-Wang sampling inner loop de-scribed in [7]. We also reparameterised the posteriorto take gradients with respect to log(σ) rather thanσ. This is important for any unconstrained gradientmethod like Langevin. Note that predictions from typ-ical samples of σ will vary greatly. For example largeσx predicts the unlabelled cluster in the top left asmainly ×’s, whereas large σy predicts ◦’s. It wouldnot be possible to obtain the same predictive distri-bution over labels with a single ‘optimal’ setting ofthe parameters as was pursued in [7]. This demon-strates how Bayesian inference over the parameters ofan undirected model can have a significant impact onpredictions.

Figure 3(c) shows that loopy Metropolis converges toa very poor posterior distribution, which does not cap-ture the long arms in figure 3(b). This is due to poorapproximate partition functions from the inner loop.The graph induced by W contains many tight cycles,which cause problems for loopy belief propagation. Asexpected, loopy propagation gave sensible posteriorson other problems where the observed points were lessdense and formed linear chains.

7 Discussion

Although MCMC sampling in general undirected mod-els is intractable, there are a variety of approximatemethods that can be brought forth to tackle this prob-lem. We have proposed and explored a range ofsuch approximations including two variational approx-

imations, brief sampling and the Bethe approxima-tion, combined with Metropolis and Langevin meth-ods. Clearly there are many more approximations thatcould be explored.

Note that the idea of simply constructing a joint undi-rected graph including both parameters and variables,and running approximate inference in this joint graph,is not a good idea. Marginalising out the variables inthis graph results in “priors” over parameters that de-pend on the number of observed data points (6), whichis nonsensical.

The mean field and tree-based Metropolis algorithmsperformed disastrously even on simple problems. Webelieve these failures result from the use of a lowerbound as an approximation. Where the lower boundis poor, the acceptance probability for leaving that pa-rameter setting will be exceedingly low. Thus the sam-pler is often attracted towards extreme regions wherethe bound is loose, and does not return.

The Bethe free energy based Metropolis algorithm per-forms considerably better and gave the best results onone of our artificial systems. However it also performedterribly on our final application. In general if an ap-proximation performs poorly in the inner loop thenwe cannot expect good parameter posteriors from theouter loop. In loopy propagation it is well known thatpoor approximations result for frustrated systems, andsystems with large weights or tight cycles.

The typically broader distributions of and less rapidfailure with strong weights of brief Langevin meansthat we expect it to be more robust than loopyMetropolis. It gives reasonable answers on large sys-tems where the other methods failed. We have severalideas for how to further improve upon this method,for example by reusing random seeds, which we planto explore.

To summarise, this paper addresses the largely ne-glected problem of Bayesian learning in undirectedmodels. We have described and compared a widerange of approaches to this problem, highlighting someof the difficulties and solutions. While the problemis intractable, approximate Bayesian learning shouldbe possible in many of the applications of undirectedmodels (a–f section 1). Examining the approximateparameter optimisation methods currently in use pro-vides a valuable source of approximations for the quan-tities found in equations (7), (8) and (10). We haveshown principles for using these equations to designgood MCMC samplers, which should be widely appli-cable to Bayesian learning in these important uses ofundirected models.


−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

x

y

(a) Data set

0.5 1 1.5 2 2.50.5

1

1.5

2

2.5

σx

σ y

(b) Brief Langevin

0.4 0.5 0.6 0.7 0.8 0.90.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

σx

σ y

(c) Loopy Metropolis

Figure 3: (a) a data set for semi-supervised learning with 80 variables: two groups of classified points (× and◦) and unlabelled data (·). (b) 10,000 approximate samples from the posterior of the parameters σx and σy

(equation 13). An uncorrected Langevin sampler using gradients with respect to log(σ) approximated by aSwendsen-Wang sampler was used. (c) 10,000 approximate samples using Loopy Metropolis.

Acknowledgements

Thanks to Hyun-Chul Kim for conducting initial ex-periments and writing code for the mean-field and tree-based variational approximations. Thanks to DavidMacKay for useful comments and valuable discussions.

References

[1] D. Geman and S. Geman. Stochastic relaxation, Gibbsdistribution and Bayesian restoration of images. IEEETransactions on Pattern Analysis and Machine Intel-ligenc, 6:721–741, 1984.

[2] J. Besag. Spatial interaction and the statistical anal-ysis of lattice systems. Journal of Royal StatisticalSociety, B36:192–223, 1974.

[3] Stephen Della Pietra, Vincent J. Della Pietra, andJohn D. Lafferty. Inducing features of random fields.IEEE Transactions on Pattern Analysis and MachineIntelligence, 19(4):380–393, 1997.

[4] Andrew McCallum and Ben Wellner. Toward condi-tional models of identity uncertainty with applicationto proper noun coreference. IJCAI Workshop on Info.Integration on the Web, 2003.

[5] Ole Winther and Anders Krogh. Teaching computersto fold proteins. Preprint, arXiv:cond-mat/0309497.Submitted to Physical Review Letters, 2003.

[6] Yanover C. and Weiss Y. Approximate inference andprotein folding. In NIPS 15. MIT Press, 2002.

[7] Xiaojin Zhu and Zoubin Ghahramani. Towards semi-supervised classification with Markov random fields.Technical report, CMU CALD, 2002.

[8] Geoffrey E. Hinton. Training products of expertsby minimizing contrastive divergence. Neural Comp.,14:1771–1800, 2002.

[9] J. Albert. Bayesian selection of log-linear models.Technical report, Duke University, Institute of Statis-tics and Decision Sciences, 1995.

[10] P. Dellaportas and J. Forster. Markov chain MonteCarlo model determination for hierarchical and graph-ical models. Technical report, Southampton Univer-sity Faculty of Mathematics, 1996.

[11] A.Dobra, C.Tebaldi, and M. West. Bayesian inferencein incomplete multi-way tables. Technical report, In-stitute of Statistics and Decision Sciences, Duke Uni-versity, 2003.

[12] D. H. Ackley, G. E. Hinton, and T. J Sejnowski. Alearning algorithm for Boltzmann machines. CognitiveScience, 9:147–169, 1985.

[13] Radford M. Neal. Probabilistic inference usingMarkov chain Monte Carlo methods. Technical re-port, Department of Computer Science, University ofToronto, September 1993.

[14] Wim Wiegerinck. Variational approximations be-tween mean field theory and the junction tree algo-rithm. In UAI-2000, pages 626–633. Morgan Kauf-mann Publishers, 2000.

[15] Jonathan S. Yedidia, William T. Freeman, and YairWeiss. Generalized belief propagation. In NIPS 13,pages 689–695. MIT Press, 2000.

[16] Andrew M. Childs, Ryan B. Patterson, and DavidJ. C. MacKay. Exact sampling from nonattractivedistributions using summary states. Physical ReviewE, 63, 2001.

[17] J. Liu L. Wang and S.Z. Li. MRF parameter estima-tion by MCMC method. Pattern recognition, 33:1919–1925, 2000.

[18] Yihua Yu and Qiansheng Cheng. MRF parameter es-timation by an accelerated method. Pattern Recogni-tion Letters, 24:1251–1259, 2003.

[19] David Edwards and Tomas Havranek. A fast pro-cedure for model search in multidimensional con-tingency tables. Biometrika, 72(2):339–351, August1985.


Bayesian Learning in Undirected Graphical Models ...mlg.eng.cam.ac.uk/pub/pdf/MurGha04a.pdf · eter optimisation, Bayesian learning of pos-terior distributions over undirected model

Documents