Top Banner
J Stat Phys (2011) 145:860–890 DOI 10.1007/s10955-011-0384-7 Message-Passing Algorithms for Inference and Optimization “Belief Propagation” and “Divide and Concur” Jonathan S. Yedidia Received: 4 April 2011 / Accepted: 18 October 2011 / Published online: 29 October 2011 © Springer Science+Business Media, LLC 2011 Abstract Message-passing algorithms can solve a wide variety of optimization, inference, and constraint satisfaction problems. The algorithms operate on factor graphs that visually represent and specify the structure of the problems. After describing some of their applica- tions, I survey the family of belief propagation (BP) algorithms, beginning with a detailed description of the min-sum algorithm and its exactness on tree factor graphs, and then turn- ing to a variety of more sophisticated BP algorithms, including free-energy based BP algo- rithms, “splitting” BP algorithms that generalize “tree-reweighted” BP, and the various BP algorithms that have been proposed to deal with problems with continuous variables. The Divide and Concur (DC) algorithm is a projection-based constraint satisfaction al- gorithm that deals naturally with continuous variables, and converges to exact answers for problems where the solution sets of the constraints are convex. I show how it exploits the “difference-map” dynamics to avoid traps that cause more naive alternating projection algo- rithms to fail for non-convex problems, and explain that it is a message-passing algorithm that can also be applied to optimization problems. The BP and DC algorithms are compared, both in terms of their fundamental justifications and their strengths and weaknesses. Keywords Message-passing algorithms · Factor graphs · Belief propagation · Divide and concur · Difference-map · Optimization · Inference · Constraint satisfaction 1 Introduction This paper is a tutorial introduction to the important “Belief Propagation” (BP) and “Divide and Concur” (DC) algorithms. The tutorial will be informal, and my main goal is to explain the fundamental ideas clearly. Iterative message-passing algorithms like BP and DC have an amazing range of appli- cations, and it turns out that their theory is deeply connected to concepts from statistical physics. BP algorithms are already very well-known, and have been discussed in depth in J.S. Yedidia ( ) Disney Research Boston, 222 3rd Street, Suite 1101, Cambridge, MA 02142, USA e-mail: [email protected]
31

Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Mar 16, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

J Stat Phys (2011) 145:860–890DOI 10.1007/s10955-011-0384-7

Message-Passing Algorithms for Inference andOptimization“Belief Propagation” and “Divide and Concur”

Jonathan S. Yedidia

Received: 4 April 2011 / Accepted: 18 October 2011 / Published online: 29 October 2011© Springer Science+Business Media, LLC 2011

Abstract Message-passing algorithms can solve a wide variety of optimization, inference,and constraint satisfaction problems. The algorithms operate on factor graphs that visuallyrepresent and specify the structure of the problems. After describing some of their applica-tions, I survey the family of belief propagation (BP) algorithms, beginning with a detaileddescription of the min-sum algorithm and its exactness on tree factor graphs, and then turn-ing to a variety of more sophisticated BP algorithms, including free-energy based BP algo-rithms, “splitting” BP algorithms that generalize “tree-reweighted” BP, and the various BPalgorithms that have been proposed to deal with problems with continuous variables.

The Divide and Concur (DC) algorithm is a projection-based constraint satisfaction al-gorithm that deals naturally with continuous variables, and converges to exact answers forproblems where the solution sets of the constraints are convex. I show how it exploits the“difference-map” dynamics to avoid traps that cause more naive alternating projection algo-rithms to fail for non-convex problems, and explain that it is a message-passing algorithmthat can also be applied to optimization problems. The BP and DC algorithms are compared,both in terms of their fundamental justifications and their strengths and weaknesses.

Keywords Message-passing algorithms · Factor graphs · Belief propagation · Divide andconcur · Difference-map · Optimization · Inference · Constraint satisfaction

1 Introduction

This paper is a tutorial introduction to the important “Belief Propagation” (BP) and “Divideand Concur” (DC) algorithms. The tutorial will be informal, and my main goal is to explainthe fundamental ideas clearly.

Iterative message-passing algorithms like BP and DC have an amazing range of appli-cations, and it turns out that their theory is deeply connected to concepts from statisticalphysics. BP algorithms are already very well-known, and have been discussed in depth in

J.S. Yedidia (�)Disney Research Boston, 222 3rd Street, Suite 1101, Cambridge, MA 02142, USAe-mail: [email protected]

Page 2: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 861

some excellent reviews [2, 38, 74]. The DC algorithm [27] has important advantages com-pared with BP, but is so far much less known and used.

I have a couple target audiences in mind. One target reader is learning about message-passing algorithms for the first time; this paper can serve as a gentle introduction to the field.Another target reader might already have an knowledge of at least some aspects of BP, butmay not be familiar with recent advances in our understanding of BP, or with projection-based algorithms like DC, which are not usually described as message-passing algorithms.For such readers, I will present an overview of different BP algorithms, and show that DCcan be interpreted as a message-passing algorithm that can be easily compared and con-trasted with other BP algorithms.

2 Inference and Optimization Problems

No algorithm can be discussed in isolation from the problem it is solving. BP and DCMessage-passing algorithms are used to solve inference problems, optimization problems,and constraint satisfaction problems.

In an inference problem, one takes as input some noisy or ambiguous measurements, andtries to infer from those measurements the likely state of some hidden part of the world. Ingeneral, it is impossible to make those inferences with complete certainty, but one can atleast try to obtain, within a model of the world and the measurements, the most probablestate of the hidden part of the world [46, 63].

As an example, consider the “channel coding” problem which is fundamental to infor-mation theory [22, 60, 64]. We want to transmit a message consisting of some sequence ofbits across a noisy channel that might corrupt those bits. To deal with that noise, additionalbits that depend on the message are appended to it by an encoder before it is transmitted.The task of the decoder is one of probabilistic inference: it tries to compute, from all thepossibly noisy received bits, the most probable message that might have been transmitted.

Computer vision is another example of a field where probabilistic inference problemsare ubiquitous [17, 21, 66]. In a typical scenario, one obtains images or videos from one ormore cameras, and wants to infer something about the scene being captured. For a computer,photographic images are simply two-dimensional matrices of color intensities, and the sceneis a hidden three-dimensional world of objects that must somehow be inferred from thoseinherently ambiguous measurements. A computer vision probabilistic inference system usesa statistical model that connects the camera measurements to the scene quantities of interest(e.g. the depth or the identity of objects), and tries to find the most probable interpretationof the scene quantities given the measurements.

Speech recognition [28, 31, 58] and language understanding [33, 48] systems are simi-lar. Here one obtains measurements of a sound signal, and tries to infer the most probablesequence of words, or meanings, consistent with those sounds.

In statistical physics, one deals with mathematically analogous problems [39, 51]. If oneis given the energy function for some magnetic system or some macromolecule, the mostprobable configuration of the system is the one that has the lowest energy. Of course, asphysicists are well aware, the lowest energy configuration may be the most probable, whilenot being a typical configuration. To obtain probabilities about typical configurations, onemust also take into account the entropy of the system. In probabilistic inference, this distinc-tion corresponds to the difference between two tasks the system might be asked to perform.First, it might be asked to obtain the one most probable state of the entire system, or second,one might ask for the marginal probabilities of particular hidden variables, after summingover the probabilities of all possible configurations.

Page 3: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

862 J.S. Yedidia

Fig. 1 A toy factor graph withone observed variable node(variable 1), three hidden variablenodes, and three factor nodes

A probabilistic inference problem can be converted into a statistical physics problem bydefining an energy function using Boltzmann’s Law:

p(X) ∝ exp(−E(X)). (1)

That is, if we are given a statistical model (say for coding, or vision, or speech recognition)that assigns probabilities p(X) to states X, we can completely equivalently think of it as as-signing energies E(X), and proceed using any method of our choice from statistical physics.Or, if we are given a physical system with an energy function E(X), we can fruitfully applythe probabilistic inference algorithms that have been invented by computer scientists andelectrical engineers. Mathematically, many of the problems studied by statistical physicistsand computer scientists are completely equivalent [51].

3 Factor Graphs

The task of inferring the most probable state of a system is actually the optimization prob-lem of finding the minimum energy of that system. I will now describe a very convenientdata structure called a factor graph [40, 43] which can be used to visualize and precisely de-fine such optimization problems. A variety of other “graphical models” exist (e.g. Bayesiannetworks [56], Markov random fields [25], normal factor graphs [19, 43]), and have theiradvantages, but all these models can be easily converted into the standard factor graphs Iwill discuss [76].

Factor graphs are bipartite graphs containing two types of nodes: variable nodes andfactor nodes. See Fig. 1 for our first example of a toy factor graph.

The variable nodes, usually denoted using circles, represent the variables in the optimiza-tion problem. A variable might be discrete—that is, it can only take on a finite number ofstates, like an Ising spin, which can only be in the two states that we might call “up” and“down”—or it might have a continuous range of possible states.

If a variable’s state is known (or “observed”), we fill in the corresponding circle in thefactor graph; otherwise we use an open circle to denote a so-called “hidden” variable. Thefactor graph of Fig. 1 has three hidden variables and one observed variable.

The “factor” nodes in a factor graph show how the overall objective (“energy” or “cost”)function of our optimization problem breaks up—factorizes—into local terms. We draw anedge between each factor node representing a local cost function, and the variables that areinvolved in that local cost function. For example, for the factor graph of Fig. 1, the overall“cost” C(x1, x2, x3, x4) for a configuration (x1, x2, x3, x4) will be

C(x1, x2, x3, x4) = Ca(x1, x2, x3) + Cb(x2, x4) + Cc(x3, x4). (2)

More generally, if there are M local cost functions, we can write the overall cost functionC(X) as

C(X) =M∑

a=1

Ca(Xa) (3)

Page 4: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 863

Fig. 2 A factor graph, alongwith the lookup tables associatedwith each local cost function.Note that for this factor graph thefour variables are all discrete.Factor a is a “hard” constraint,that either allows or disallowsdifferent local configurations,while factors b and c are “soft”factors

where Xa is the set of variables involved in the ath local cost function.By itself, a factor graph just lets one visualize the relationships between the variables

and local functions in the problem; but it does not give the full information one needs tocompute the cost associated with a configuration. To fill out that information, we wouldneed to provide a table or explicit function for each factor node. For example, in Fig. 2,we supplement the factor graph from Fig. 1 with lookup tables that specify the local costsassociated with each factor. Notice that in this example some of the local costs for factor a

are infinite; this means that the corresponding configurations are forbidden and representsa “hard” constraint. For factors b and c, all the costs are finite, so these factors represent“soft” preferences.

The factor graph we have been looking at is just a toy example; we normally are interestedin problems where there are a large number of hidden variables—at least hundreds, maybethousands or millions. In that case, there will be an exponentially huge number of possiblestates of the system, so while it is always easy to compute the energy of one configuration,finding the lowest energy state, or summing over all the states, can be a very hard problem.

In fact, it should be obvious that our formulation is so general that many NP-hard op-timization problems can be described in this way. Thus, we certainly are not going to beable to specify here an algorithm that is guaranteed to find the lowest cost state of an arbi-trary factor graph, while using computational resources that only grow polynomially withthe size of a factor graph. Nevertheless, it turns out that the BP and DC message-passingalgorithms are often successfully used on problems that are NP-hard, and can in fact oftenefficiently find the global optimum for those problems. The point is that a problem mightbe NP-hard, and yet specific instances of it that we care about may be in an easier regimethan the worst case, and often be solved by a clever algorithm. For example, the problem offinding the optimal decoding of a low-density parity-check (LDPC) code is NP-hard, and yetefficient state-of-the-art BP decoders succeed sufficiently often that they closely approachthe Shannon limit of possible channel coding performance [60, 64].

4 Example Factor Graphs and Applications

Let’s take a look at just a few examples of how factor graphs can represent interesting prob-lems. To simplify the factor graphs in the following examples, we take advantage of thefact that we can absorb any observed variable nodes as parameters in the factors they areattached to, leaving only hidden variables in the factor graph (see Fig. 3).

Page 5: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

864 J.S. Yedidia

Fig. 3 Any observed variable nodes in a factor graph can be absorbed as parameters in the factor nodes thatthey are connected to, leaving only “hidden” variable nodes, and factor nodes that depend on the observations

Fig. 4 A factor graph for the (N = 7, k = 4) Hamming code, which has seven codeword bits, of which theleft-most four are information bits, and the last three are parity bits

In the following examples, we will also describe some properties of BP message-passingalgorithms, anticipating our more extensive discussion in future sections.

4.1 Error Correcting Codes

Our first example is a factor graph for an error correcting code—the simple (N = 7, k = 4)

Hamming code shown in Fig. 4. In this code, there are seven “hidden” variable nodes thatrepresent the seven unknown transmitted bits. The first four of those bits are informationbits that encode the original message, the other three are additional parity bits that can becomputed from the information bits using the parity check factor nodes. The three paritycheck factor nodes are hard constraints that force the sum of the bits connected to them toequal 0 modulo 2. There are also seven “soft” channel evidence factor nodes that give thea priori probability that each of the hidden codeword bits is equal to a one or zero, given theobserved received bits.

The goal will be to find the most likely values of the seven hidden transmitted bits,given the channel evidence and the fact that they must be consistent with the parity checkconstraints.

Such factor graphs were introduced into coding theory in 1981 by Tanner [68], to describeand visualize the low-density parity-check (LDPC) codes and the BP decoder for LDPCcodes that had been introduced earlier by Gallager in 1963 [23]. LDPC codes were giventheir name because each parity check is only connected to a small number of codeword bits.LDPC codes and their factor graphs are similar to the Hamming code in Fig. 4, except thatthe number of codeword bits is usually on the order of a few thousand in practical LDPCcodes, and the codeword bits are not simply divided into information bits and extra paritycheck bits. BP decoders of LDPC codes are very practically significant, because if they areproperly designed, their performance can closely approach the Shannon limit, and they canbe implemented in modern hardware [60, 64].

Page 6: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 865

Fig. 5 A toy factor graph representing an “expert system” for a medical diagnosis problem. The hiddenvariables represent the unknown possible diseases (e.g. “L” represents lung cancer), or information about thepatient such whether he is a smoker (“S”) or test results. The factors represent known statistical relationships(e.g. a smoker is more likely to have lung cancer). The goal is to obtain the best estimate of the probabilitiesof possible diseases given the available information

4.2 Diagnosis

Our next example (see Fig. 5) is a toy factor graph representing a medical diagnosis problem.The hidden variable nodes in this factor graph labeled “T,” “L,” and “B” represent differentpossible diagnoses of some patient (e.g. Tuberculosis, Lung Cancer or Bronchitis). The othervariables may represent some information about the patient (e.g. “S” represents how muchthe patient smokes). The factor nodes encode the known statistical relationships betweendiseases and other variables.

This factor graph is adapted from an example in [41] that originally used a Bayes net-work, a graphical model that uses directed edges and that has the advantage of explicitlyencoding conditional probabilities. The name “Belief Propagation” was in fact introducedby Pearl [56] for a version of the BP algorithm working on Bayes networks. It is easy toconvert between different graphical models [76], and one reason that I am focusing here onstandard factor graphs is that BP algorithms are easier to describe on standard factor graphsbecause they are undirected.

This example highlights an important point about probabilistic inference algorithms—one is often interested in more detailed information than the overall most probable config-uration. For example, one might want an accurate estimate of the marginal probability thatthe patient has a particular disease. As we shall see, different BP algorithms are designed togive answers to different types of questions—for the kind of marginal probabilities we wanthere, we should use the “sum-product” version of BP, which we will describe in more detaillater in this paper.

4.3 Computer Vision

Our next example (see Fig. 6) is a cartoon factor graph depicting the way factor graphsare used in low-level computer vision. The colored factor nodes in this example representimage intensity pixels that would be captured by a camera. The hidden variable nodes aresome variables about the scene that we would like to infer (for example, the depth from thecamera associated with each pixel). These hidden variables have some local probabilitiesgiven the observations, but there are also correlations between them—for example, if the

Page 7: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

866 J.S. Yedidia

Fig. 6 (Color online) An illustration of a factor graph equivalent to a pairwise Markov random field (left) asis often used to representing a computer vision problem, and a depth map returned by a BP algorithm [17]using stereo images (right). The variable nodes in the factor graph could represent unknown depth values,for which there exists some local evidence (the colored factor nodes), and which are statistically correlated(depths at particular pixels tend to be similar to nearby pixels)

depth has a certain value at one pixel, it is likely (but not guaranteed) to have similar valuesat neighboring pixels.

Our goal, as usual, will be to find the most probable values of the hidden scene variablesgiven what our cameras observe.

Graphical models called “pairwise Markov random fields,” which are equivalent to factorgraphs like that shown in Fig. 6, were introduced into computer vision by Geman and Geman[25], and have become increasingly popular in the last 25 years [17, 21, 66], although themodels used in practice are more complicated than the simplified cartoon shown in thisfigure. In a factor graph equivalent to a pairwise Markov random field, each factor node isconnected to no more than two variable nodes.

One significant practical problem for BP algorithms is that the hidden variables of interestin computer vision are typically continuous, and are often severely quantized for the sake ofefficiency when BP (or competing algorithms like “graph cuts” [8]) are run. This means thatone often sees quantization artifacts in the results. The depth map on the right hand side ofFig. 6 is the result of running a BP algorithm on a pairwise Markov random field for stereovision [17], and shows such artifacts.

Much recent research has focused on improving BP’s performance or efficiency for prob-lems with continuous variables (see Sect. 12), but nevertheless the fact that the Divide andConcur algorithm works naturally with continuous quantities to any desired precision makesit an attractive potential alternative to BP for computer vision applications.

It should be emphasized that factor graphs only give a principled way of representingan optimization or probabilistic inference problem. You then need to separately choose analgorithm to solve it.

Many different variants of message-passing algorithms exist, and of course there aremany other optimization algorithms (e.g. simulated annealing) that can be used once theproblem is represented as a factor graph. I focus here on message-passing algorithms be-cause they are often particularly powerful and efficient.

One should be careful to cleanly separate the model of an optimization or inferenceproblem from the algorithm being used on that model. When a clean separation is made, onecan determine whether it is the model or the optimization algorithm that needs improvementif one obtains an inadequate solution. For example, artifacts obtained using a particular

Page 8: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 867

Fig. 7 (Color online) The overall structure of message-passing algorithms. The algorithms iterate betweensteps, indicated by numbers in rectangles. In step 1, messages from variable nodes to factor nodes (in blue) areinitialized to random or non-informative values. In step 2, the factor nodes compute from the incoming mes-sages new outgoing messages (in red). In step 3, those messages are converted into beliefs, which in BP aregenerally represented as a cost for each possible state (the red numbers). In step 4, the beliefs are thresholdedto their lowest cost state (represented by the number inside the variable node), and a termination condition ischecked. In step 5, the beliefs and incoming messages are used to compute new outgoing messages from thevariable nodes, and then one returns to step 2, and the cycle continues

stereo vision model (that was in principle NP-hard to optimize) were for a long time blamedon the inability of optimization algorithms to find the global optimum. Eventually though,it was found that even the provably globally optima found by BP-based algorithm were notcompletely adequate, demonstrating that it was the model that needed to be improved [50].

5 Message-Passing Algorithms

Let’s turn now to the overall structure of message-passing algorithms, as they operate onfactor graphs. This overall structure is shared by different classes of BP algorithms and byDC algorithms.

Figure 7 breaks the structure down into five steps. Message-passing algorithms get theirname from the fact that in each step, messages are sent between nodes in the factor graph.

In the first step, the messages from variable nodes to factor nodes are initialized. The ini-tial messages could be random, but more often, one uses non-informative messages from thehidden nodes, and messages that correspond to the observations from the observed nodes.Messages should be thought of as a variable telling its neighboring factors what it thinks

Page 9: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

868 J.S. Yedidia

its state is, or what it thinks would be the cost for it to be in each of its possible states.A non-informative message simply sets the costs of each possible state to be equal.

The factor nodes take in all the messages from their neighboring variable nodes, and inthe second step, they compute messages that go back out to neighboring hidden variablenodes. These messages take into account what the variable nodes have told them, and thestatistical relationships encoded by a factor node. The messages tell the neighboring vari-ables what state they should be in, or what the costs will be for being in all their possiblestates.

In the third step, the hidden variable nodes inspect all the incoming messages, and com-pute a “belief” about what their state should be. In BP algorithms, this belief takes the formof a cost, or a probability, associated with each possible state of a variable. In the exampleshown in Fig. 7, the three hidden variable nodes have respectively two, two, and three pos-sible states, so the beliefs are similarly vectors (represented by the columns of red numbers)of the same lengths giving the costs of each state. Thus in a BP algorithm, if a variable iscontinuous valued, the belief associated with it must be a function of that continuous vari-able. In DC algorithms, the belief (and the messages) are just a single number, representingthe current best guess for the state of that variable node.

The fourth step of a message passing algorithm is necessary for BP algorithms but not forDC, and involves thresholding the beliefs to obtain a single best guess for a variable node. Inour example, the beliefs represent costs, and the lowest cost state is chosen for each variablenode (for example, the bottom right variable node has costs 0.7 for state 0, 0.3 for state 1,and 1.4 for state 2, so state 1 has the lowest cost for that variable).

After the fourth step is completed, one has obtained a guess for the overall configurationof the factor graph, and one can use that guess to check for a termination condition. Forexample, in BP-based decoders of LDPC codes, one can check whether the guess is a legalcodeword that satisfies all the parity-check constraints. Or one can check whether the guessor the beliefs have changed from previous iterations, or whether a maximum number ofiterations has been reached. If the termination condition is satisfied, one outputs the currentguess.

Otherwise, in the fifth step, the variable nodes will compute new messages to send backto the factor nodes, based on their beliefs and the messages that they received. Then onegoes back to the second step, and the cycle repeats.

Notice that all the factor nodes can compute their outgoing messages in step 2 in parallel,and similarly all the variable nodes can compute their outgoing messages in parallel in step 5.That makes these algorithms attractive for parallel implementation, either in hardware orsoftware, and has contributed significantly to their popularity.

6 Message and Belief Update Rules

Message-passing algorithms differ in the details of and justifications for their message-update rules. I will begin by presenting one particular BP algorithm, the “min-sum” algo-rithm. This algorithm uses messages and beliefs that represent the costs for each variable tobe in its different possible states. One sometimes sees the “min-sum” algorithm in a differentguise, and referred to as the “max-product” BP algorithm. “Max-product” BP is equivalentto “min-sum” BP; the only (completely superficial) difference is that messages and beliefsare represented as probabilities rather than costs.

Page 10: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 869

Fig. 8 The belief update rule forthe min-sum BP algorithm saysthat the belief at a variable nodeis simply the sum of incomingmessages from neighboringfactor nodes

Fig. 9 (Color online) Thevariable-to-factor message updaterule in min-sum BP says that theoutgoing (blue) message is thesum of all the incoming (red)messages on edges other than theedge of the outgoing message

I begin with the belief update rules for min-sum BP, which relate the beliefs bi(xi) at thevariable node i to the messages ma→i (xi) coming into node i from neighboring factor nodesa (see Fig. 8). The rule is very simple: the belief is the sum of all the messages:

bi(xi) =∑

a∈N(i)

ma→i (xi). (4)

Here we use the notation xi to represent the possible states of variable node i, and N(i) tobe the set of factor nodes neighboring node i.

This rule is easy to understand if we think of the factor nodes as giving independentinformation about different parts of the graph. For example, if node i is a binary variableneighboring two factor nodes, and the first factor thinks it will cost A more for node i to bein state 1 than state 0, and the second thinks it will cost B more to be in state 1, than it isnatural to conclude that overall it will cost A + B more to be in state 1, so long as the twofactor nodes are using information from independent parts of the graph.

Turning next to the message update rule for a message mi→a(xi) from a variable i toa factor node a (see Fig. 9), we see that the message depends on all messages coming intovariable node i from neighboring factor nodes b except for the one coming in from the targetfactor node a:

mi→a(xi) =∑

b∈N(i)\amb→i (xi). (5)

Again, the message out about the costs of the possible states of node i is just a sum ofincoming messages from other parts of the graph. Note that if the belief has already beencomputed, we can use it to more efficiently compute the outgoing message:

mi→a(xi) = bi(xi) − ma→i (xi). (6)

To complete our collection of update rules, we need the rule updating messages fromfactor to variable nodes. It’s a little more complicated than the other rules, but still easyto understand. Let’s look at the case when the factor node a is connected to three variablenodes i, j , and k (see Fig. 10). The message update rule

ma→i (xi) = minxj ,xk

[Ca(xi, xj , xk) + mj→a(xj ) + mk→a(xk)]. (7)

The three terms in this equation can be understood as follows. The message ma→i (xi)

should have node a tell node i what its costs would be for being in each of its possible

Page 11: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

870 J.S. Yedidia

Fig. 10 The factor-to-variable message update rule for a message from factor a to variable i depends on thelocal cost function Ca , and the incoming variable-to-factor messages on other edges

states. For each choice of state for node i, we will arrange the other nodes attached to i tobe in their best states given that choice (that is the explanation for the minimization over xj

and xk). One part of the cost is the cost associated with a itself (the Ca(xi, xj , xk) term). Theother parts of the cost are what it costs to place the xj and xk variables in the states that xi

would like them to be in, given by the incoming messages mj→a(xj ) and mk→a(xk).If Xa is the set of all variables attached to factor node a, and Xa\xi is all those variables

except for xi , then the general update rule for messages from factor nodes to variable nodes,generalizing (7), is

ma→i (xi) = minXa\xi

[Ca(Xa) +

j∈N(a)\imj→a(xj )

]. (8)

The form of (8) gives the min-sum algorithm its name. If the beliefs and messages repre-sent probabilities instead of costs, we would obtain an equivalent message update rule withthe sums replaced by products, and the minimization replaced by a maximization, so thatequivalent algorithm is called the “max-product” algorithm.

We have written these rules as equations, but in fact they are normally used as iterativeupdate rules. Only at a fixed point will they become equalities.

7 Example of the Min-Sum BP Algorithm

Let’s take a look at a concrete example of the min-sum BP algorithm in action, using as anexample a decoder of the binary Hamming code with the factor graph shown in Fig. 4. Eachof the codeword bit variables xi , with the index i running from 1 to 7, will have two possiblestates: xi = 0 or xi = 1.

For this factor graph, there are ten factor nodes, of which seven are channel evidencefactor nodes, and three are parity check factor nodes. The seven channel evidence factornodes each have a soft cost function Ci(xi) associated with them, representing the relativecost that a bit is a 0 or 1, given what was received from the channel. As an example, supposethat

C1(x1 = 0) = 0.0; C1(x1 = 1) = 3.0

C2(x2 = 0) = 0.0; C2(x2 = 1) = 2.0

C3(x3 = 0) = 0.0; C3(x2 = 1) = 2.5

C4(x4 = 0) = 0.0; C4(x2 = 1) = 5.4

C5(x5 = 0) = 0.0; C5(x2 = 1) = 4.0

Page 12: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 871

C6(x6 = 0) = 0.2; C6(x2 = 1) = 0.0

C7(x7 = 0) = 0.7; C7(x2 = 1) = 0.0.

Based just on the channel evidence, the first five bits would prefer to be 0, while the lasttwo bits would prefer to be 1, although the preferences of the last two bits are less strongthan those of the first five. The three parity check factor nodes, whose cost function wedenote by CA(x1, x2, x3, x5), CB(x1, x2, x4, x6) and CC(x1, x3, x4, x7), introduce additionalconstraints. We assign a zero cost to the legal configurations of the connected bits, where themodulo-two sum is equal to 0, and an infinite cost to the configurations where the modulo-two sum is equal to 1. For example, for the first parity check node, we have that CA(x1 =0, x2 = 0, x3 = 0, x5 = 0), CA(x1 = 0, x2 = 0, x3 = 1, x5 = 1), and so on are all equal tozero, while CA(x1 = 0, x2 = 0, x3 = 0, x5 = 1), CA(x1 = 0, x2 = 0, x3 = 1, x5 = 0) and soon are all infinite.

Let us now see how the beliefs and messages are updated, following the overall structureof a message-passing algorithm shown in Fig. 7. In step 1, all the messages from variables tofactors are initialized to zero. In step 2, we compute messages from factor nodes to variablenodes using (8). Because each channel evidence node is connected to only a single variablenode, the message update rule from evidence factor nodes to variable nodes simplifies to thesimple rule mai→i (xi) = Ci(xi) and these messages stay constant through all iterations. Onthe other hand, because all messages into the parity check factor nodes are initially zero, itis easy to work out that all the initial messages out of the parity nodes will also be zero.

At step 3, we compute the beliefs at the variable nodes, using (4). Because the messagesfrom the parity nodes are zero at this stage, the beliefs will initially be equal to the messagesfrom the evidence nodes: bi(xi) = mai→i (xi) = Ci(xi).

At step 4, we threshold these beliefs; that is, we assign the first 5 bits the value 0, andthe last two bits the value 1, because these are currently the lowest cost values accordingto the beliefs. Let us assume that our termination condition is that the thresholded beliefscorrespond to a codeword (that is, whether they satisfy all the parity checks in the code).For our current bit values, the second and third parity checks are violated, so we need tocontinue.

At step 5, we compute messages from variable nodes to factor nodes. The messagesfrom variable nodes to evidence nodes are in fact irrelevant because they are never usedto compute anything else, so we focus on the relevant messages from the variable nodesto the parity check nodes. Initially, all the messages coming from the parity check nodesare zero, so the messages from the variable nodes to the factor nodes will equal the beliefsat the variable nodes, which at this point equal the evidence cost functions. For example,m1→A(x1) = m1→B(x1) = m1→C(x1) = b1(x1) = C1(x1).

Now we cycle back to step 2, and the computations become more interesting. We need toupdate the messages from the parity checks to the variable nodes, using (8). As an example,let’s focus on the message mA→1(x1) from check A to variable 1. This will depend on theincoming messages to check A from variables 2, 3, and 5. The hard parity check constraintassigns an infinite cost to choices of the three variables x2, x3, and x5 whose sum do notequal x1 modulo-two. We find that mA→1(x1 = 0) equals

min [CA(0, x2, x3, x5) + m2→A(x2) + m3→A(x3) + m5→A(x5)]

= min [(0.0 + 0.0 + 0.0), (0.0 + 2.5 + 4.0), (2.0 + 0.0 + 4.0), (2.0 + 2.5 + 0.0)]

= 0.0,

while mA→(x1 = 1) equals

Page 13: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

872 J.S. Yedidia

min [CA(1, x2, x3, x5) + m2→A(x2) + m3→A(x3) + m5→A(x5)]

= min [(0.0 + 0.0 + 4.0), (0.0 + 2.5 + 0.0), (2.0 + 0.0 + 0.0), (2.0 + 2.5 + 4.0)]

= 2.0.

If one thinks about what the min-sum algorithm is doing here, it makes intuitive sense.For each of the two possible states of variable 1, it assigns the associated states of variables2, 3, and 5 such that they are consistent with the state of 1, while costing as little as possible.

It is straightforward to similarly compute all the other messages from parity checks tovariable nodes. We find

mA→1(x1) = (0.0,2.0)

mA→2(x2) = (0.0,2.5)

mA→3(x3) = (0.0,2.0)

mA→5(x5) = (0.0,2.0)

mB→1(x1) = (0.2,0.0)

mB→2(x2) = (0.2,0.0)

mB→4(x4) = (0.2,0.0)

mB→6(x6) = (0.0,2.0)

mC→1(x1) = (0.7,0.0)

mC→3(x3) = (0.7,0.0)

mC→4(x4) = (0.7,0.0)

mC→7(x7) = (0.0,2.5),

where we have represented the messages as vectors in an obvious way.Now we return to step 3, and update the beliefs, summing together these new messages

from the check nodes, as well as the constant messages from the channel evidence nodes.For example

b1(x1) = ma1→1(x1) + mA→1(x1) + mB→1(x1) + mC→1(x1)

= (0.0,3.0) + (0.0,2.0) + (0.2,0.0) + (0.7,0.0)

= (0.9,5.0).

The full set of beliefs at this stage will be

b1(x1) = (0.9,5.0)

b2(x2) = (0.2,4.5)

b3(x3) = (0.7,4.5)

b4(x4) = (0.9,5.4)

b5(x5) = (0.0,6.0)

b6(x6) = (0.2,2.0)

b6(x7) = (0.7,2.5).

Page 14: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 873

Fig. 11 A factor graph with nocycles used to illustrate the factthat min-sum BP finds an optimalconfiguration for such factorgraphs

When we return to step 4 and threshold the beliefs, all the bit values will be set to zero,and since this is a codeword consistent with the parity checks, the algorithm will terminateand output the all-zeros codeword.

Notice that the BP algorithm takes advantage of “soft” information. In our example, thealgorithm flipped two bits from the values that the channel evidence would have preferred,because the preference at those two bits was weak. If instead we had first assigned each bit tothe preferred value from the channel evidence, and then tried to flip the minimum number ofbits to find a codeword (many inferior “classical” decoders work in such a fashion, becausethey cannot use soft cost information), we would have chosen to leave bits 6 and 7 alone,and flipped bit 4 to the value 1 to reach a codeword. This would have only flipped one bitfrom the preferences of the channel evidence, but it would have cost 5.4 overall instead of0.9, so it would have been a worse choice.

8 Exactness of BP for Tree Factor Graphs

We have already given some intuitive explanations for the form of the min-sum update rules,but a more rigorous justification for the rules is that on a graph with no cycles, applying themin-sum rules will provably give the lowest-cost configuration, using an amount of memoryand time that only scales linearly with the number of nodes in the factor graph. I will notprove that fact here, and instead just give an example showing how that works.

Consider the tree factor graph shown in Fig. 11. Suppose that we want to compute the beststate for variable node 1 in the optimal configuration. We can get that by computing b1(x1),and thresholding it to the lowest cost state. By (4), that belief will be given by b1(x1) =mA→1(x1). Now, we can replace mA→1(x1) by using the min-sum update rule (8). If wecontinually replace messages with cost functions and other messages using the messageupdate rules, we find:

b1(x1) = mA→1(x1)

= minx2

[CA(x1, x2) + m2→A(x2)]

= minx2

[CA(x1, x2) + mB→2(x2)]

= minx2,x3,x4

[CA(x1, x2) + CB(x2, x3, x4) + m3→B(x3) + m4→B(x4)]

= minx2,x3,x4

[CA(x1, x2) + CB(x2, x3, x4) + mC→4(x4)]

= minx2,x3,x4

[CA(x1, x2) + CB(x2, x3, x4) + CC(x4)] .

In the end, we see that b1(x1) gives the exact minimal overall cost for each possible state ofthe first node, and a similar result would hold starting with any other node.

Intuitively, the reason that BP algorithms are exact on trees is that a message from a nodesummarizes everything that is happening on the branch of the tree beyond that node. Thus,

Page 15: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

874 J.S. Yedidia

in our example, the message from node 2 to node A compactly and exactly summarizeseverything that we know about nodes B , C, 3, and 4.

In fact, if should be clear from this example that the min-sum algorithm is a “dynamicprogramming” algorithm [12] when run on trees, and only needs an amount of computationand memory that scales linearly with the number of nodes in the factor graph.

As we have seen, the min-sum algorithm finds, for a factor graph with no cycles, thelowest-cost or highest-probability configuration of the system. Recall that the max-productalgorithm is equivalent to the min-sum version, with the only difference being that messagesare represented as probabilities rather than costs. A small modification of the max-productalgorithm can be used to obtain exact marginal probabilities pi(xi) for the variable nodesbeing in each of their possible states, averaged over all possible configurations of the sys-tem. That modification, which simply replaces the “max” in the factor-to-variable message-update rule with a sum, gives the “sum-product” version of BP. The beliefs in sum-productBP are precisely equal to the desired marginal probabilities so long as the factor graph hasno cycles.

9 Early History of BP Algorithms

It might be worthwhile at this point to review the history, through the end of the 20th century,of BP algorithms. BP algorithms have in fact been independently invented many times,and applied to a wide range of problems, which is a natural consequence of the fact thatthey provide an exact solution to problems whose factor graph has no cycles, and suchproblems are very common. The fundamental commonality between the different versionsof the algorithm introduced in different fields has only gradually and relatively recently beenunderstood.

In fact, one could trace the genesis of BP ideas back to the original introduction andsolution of the one-dimensional Ising model in 1925 [30], and the development of the trans-fer matrix approach [5] that was subsequently used to exactly solve a variety of models instatistical mechanics. Statistical physicists will recognize that transfer matrix computationshave essentially the same form as BP message update computations, although the transfermatrix method is not usually described as an algorithm.

Because the exactness of the BP algorithm on chains still holds if the variables are con-tinuous, and BP messages can be computed efficiently and exactly for Gaussian models, itturns out that Kalman’s 1960 solution of the Kalman filtering problem [35], for which thefactor graph represents a temporal chain rather than a spatial one, is also an instance of a BPalgorithm [44].

In 1967, Viterbi [70] introduced an algorithm, which would now be viewed as a min-sum or max-product BP algorithm, which gave an exact decoder of convolutional codes. In1973, Forney [20] made clear the significance of the Viterbi algorithm as an efficient way offinding optimal state sequences for a wide range of problems that can be represented usingone-dimensional factor graphs, and the Viterbi algorithm has since had enormous practicalimportance. Forney also introduced an important graphical visualization of the Viterbi algo-rithm, the “trellis” diagram (see Fig. 12). Trellis diagrams can be used to effectively trackthe values of messages as a BP algorithm proceeds forward in time, or equivalently along thevariables in a chain factor graph [19]. Each edge in the trellis corresponds to a choice for avalue of a one or more variables in the factor graph. For sufficiently simple error-correctingcodes, one can obtain optimal decoders from their trellis diagrams; and the important issueoften becomes how to represent a code so as to obtain a minimal trellis diagram [69].

Page 16: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 875

Fig. 12 An illustration fromForney’s paper [20] on theViterbi algorithm, showing astate diagram for a four-stateshift-register process (a), and thecorresponding trellis diagramgiven evidence for the state of theprocess from time 0 to time k (b)

The BCJR algorithm (from a modern-perspective, sum-product BP on a one-dimensionalchain) was introduced in 1974 and shown to minimize symbol error rates in convolutionalcodes [1]. This algorithm was also later used for exactly computing marginal probabilitiesin many other one-dimensional problems, sometimes being called the “forward-backward”algorithm in the context of solving one-dimensional hidden Markov models [58], where itis widely used in bioinformatics [14] and speech recognition [31].

In fact, new problems that can be solved exactly using BP algorithms are continually be-ing discovered. For instance, one can interpret binary decision diagrams (BDDs), introducedby Bryant in 1986 [10], and zero-suppressed binary decision diagrams (ZDDs), introducedby Minato in 1993 [54], as generalizations of trellis diagrams that allow many differenttypes of exact computations to be made for general factor graphs over discrete variables, us-ing as little memory and time as possible. Knuth’s remarkable recent tutorial [37] describesin detail the algorithms used to efficiently construct BDDs and ZDDs, and surveys the manydifferent kinds of combinatorial problems that they can now efficiently solve, in conjunctionwith BP algorithms.

From the modern perspective, Gallager’s introduction in 1963 of the sum-product al-gorithm as a decoder of LDPC [23] codes was a particularly significant conceptual break-through because it demonstrated that BP algorithms could also serve as effective approxi-mate algorithms even on factor graphs with cycles. In his seminal 1988 book introducing theterm “belief propagation,” and introducing its application to Bayesian networks for artificialintelligence [56], Pearl focused mostly on BP on tree-like factor graphs, but also mentionedin an exercise that it might be applied to factor graphs with loops. As Pearl pointed out,BP algorithms are perfectly well-defined on factor graphs with cycles, even though theyare not necessarily exact. When McEliece, MacKay, and Cheng pointed out in 1998 [49]that Turbo decoders, introduced in 1993 [6], were also an instance of a highly effective, al-beit approximate BP algorithm, it became clear that BP provides a “very attractive generalmethodology for devising low-complexity iterative decoding algorithms,” although as theypointed out in their conclusion, it was still mysterious at that point why BP so often gavegood approximations for factor graphs with cycles.

Page 17: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

876 J.S. Yedidia

10 Approaches Based on Free Energies

Some insight into why BP gives good approximations on factor graphs with cycles was ob-tained when my colleagues Bill Freeman and Yair Weiss and I showed [76, 77] that the fixedpoints obtained by sum-product BP were identical to the stationary points of a variationalfree energy, the so-called “Bethe free energy.”

Our approach starts with basic definitions from statistical physics. If we have a factorgraph with an overall cost function C(X) as defined in (3), then the we can define a cor-responding probability distribution over all the states by p(X) = exp(−C(X))/Z, where Z

is the partition function Z = ∑X exp(−C(X)). We can introduce a trial probability, or “be-

lief” function b(X) which is intended to approximate p(X), and a variational free energyF(b), with F(b) = U(b) − S(b) where U(b) is the variational average energy:

U(b) =∑

X

b(X)C(X) (9)

and S(b) is the variational entropy:

S(b) = −∑

X

b(X) lnb(X). (10)

It follows directly from our definitions that

F(b) = − lnZ + D(b||p) (11)

where

D(b||p) =∑

X

b(X) lnb(X)

p(X)(12)

is the Kullback-Liebler divergence between b(X) and p(X). Since D(b||p) is always non-negative, and is zero precisely when b(X) equals p(X), we see that F(b) ≥ − lnZ, withequality when b(X) = p(X). Thus, we can use the procedure of trying to minimize F(b) torecover the true p(X).

In fact, it is not normally tractable to compute the full Gibbs free energy using beliefsover all states of all the nodes in the system, but this variational argument inspires approx-imations that use beliefs over regions comprising only a limited number of variable nodes.The simplest such approximation, called the Bethe approximation, uses beliefs over “large”regions consisting of the variable nodes attached to a single factor node, and “small” regionsconsisting of single variable nodes [77].

If we use ba(Xa) to denote a multi-node belief over all the variable nodes attached to thefactor node a (intended to approximate the multi-node marginal probability pa(Xa)) thenthe Bethe free energy FB is given by FB = UB − SB , where the Bethe average energy UB is

UB =∑

a

Xa

ba(Xa)Ca(Xa) (13)

and the Bethe entropy SB turns out to be

SB = −∑

a

Xa

ba(Xa) lnba(Xa) +∑

i

(di − 1)∑

xi

bi(xi) lnbi(xi). (14)

Here, di is the number of factor nodes neighboring variable node i.The Bethe Free energy FB is a functional of the beliefs ba(Xa) and bi(xi), which must

satisfy the normalization conditions (for all a and i):∑

Xa

ba(Xa) =∑

xi

bi(xi) = 1 (15)

Page 18: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 877

and the marginalization conditions for variable nodes i neighboring factor nodes a:

bi(xi) =∑

Xa\xi

ba(Xa) (16)

where Xa\xi denotes all variables attached to factor node a except xi .The marginalization condition turns out to be fundamental: the Lagrange multipliers that

enforce it when minimizing the Bethe Free energy turn out to equal linear combinations ofthe fixed-point messages in sum-product BP.

The Bethe average energy UB is actually exactly equal to the true average energy of thesystem given correct beliefs, but the Bethe entropy SB is only an approximation. Improv-ing that approximation, using larger regions of nodes to compute the entropy, gives better“Kikuchi” [36] or “region graph” free energies, and corresponding “generalized belief prop-agation” algorithms that gave more accurate marginal probabilities than the sum-productalgorithm [77].

Building on these ideas, Chertkov and Chernyak developed another significant approachto improving BP, that begins with sum-product BP or the equivalent Bethe approximationas a zeroth order approximation, and obtains a “loop” series that systematically improves onthat approximation [11].

Given the fact that spin glasses and related models can be described using factor graphs,it was perhaps unsurprising that the well developed statistical physics of disordered sys-tem, and in particular the “replica” and “cavity field” methods [52] would be related to BPmessage-passing algorithms [51]. From this perspective, it is clear that the sum-product BPequations only describe a single minimum of the free energy of the system, and ignore anyfracturing of phase space such as often occurs in disordered systems. An important andsurprising breakthrough exploiting this insight was the development of another improvedmessage-passing algorithm called “survey propagation,” based on heuristic ideas that try totake phase-space fracturing into account [53]. Survey propagation has been shown to solveNP-hard random satisfiability problems very effectively, even very close to the satisfiabilitythreshold where they are particularly difficult [9].

On the other hand, one can take the point of view that the Bethe free energy that underliessum-product BP would be more convenient to work with if it was always convex as a func-tion of the beliefs, as it is when the factor graph is a tree. Wainwright, Jaakkola, and Willsky[71] derived new message-passing algorithms by replacing the Bethe entropy approximationwith concave entropies to obtain “convexified” free energies which are guaranteed to havea unique global minimum. They showed that their approach could be used to find upperbounds on the log partition function (or lower bounds on the Helmholtz free energy) of thesystem.

11 Min-sum Algorithms Based on “Splitting”

Wainwright, Jaakkola, and Willsky also introduced related “tree-reweighted” BP algorithms,in both sum-product [72] and max-product [73] form, and proved several powerful theo-rems about them (see also [65] for extensions of tree-reweighted BP algorithms to solvelinear programming problems). These algorithms were originally introduced in the contextof pairwise Markov random fields rather than standard factor graphs. I prefer therefore toconsider instead the insightful recent formulation of Ruozzi and Tatikonda [61], which gen-eralizes the tree-reweighted max-product algorithm in a way that directly connects to themin-sum BP algorithm on standard factor graphs.

Page 19: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

878 J.S. Yedidia

Fig. 13 One can obtain a new factor graph modeling exactly the same cost function by splitting a factor nodeinto two identical factor nodes, each taking half the cost of the original, and connected to the same variablenodes

Ruozzi and Tatikonda start from the idea that it is easy to create equivalent factor graphsfor the same overall cost function by “splitting” either factor or variable nodes. For example,as illustrated in Fig. 13, we can split the factor node b into two factor nodes b1 and b2, eachof which has the same neighboring nodes as b and gets exactly half of the cost associatedwith the original node b. Doing this gives us a factor graph that models exactly the sameoriginal cost function as the original cost function, but notice that the min-sum algorithmassociated with new factor graph is different from the original one.

Let’s assume that we initialize all messages from the same variable node but to differentsplit copies of a factor node to be equal to each other, and similarly for messages fromdifferent split copies of a factor node. In that case, the messages to and from split copiesof a factor node will continue to maintain that symmetry, and can be identified with themessages to and from the factor node in the original factor graph. So in fact, we can translatethe message-passing algorithm on the factor graph with split factor nodes into an inducedmessage-passing algorithm on the original factor graph. If each variable node i is split ki

times, and each factor node a is split ka times, it is easy to verify that the new inducedmessage update rules are

mi→a(xi) = (ka − 1)ma→i (xi) +∑

b∈N(i)\akbmb→i (xi) (17)

and

ma→i (xi) = minXa\xi

[Ca(Xa)

ka

+ (ki − 1)mi→a(xi) +∑

j∈N(a)\ikjmj→a(xk)

]. (18)

Notice that the message mi→a(xi) now depends directly on the message ma→i (xi) comingin the opposite direction on the same edge. This results from splitting: a message to one ofthe copies of a depends on the messages from the other copies of a.

Even though the message update rules (17) and (18) were originally derived using split-ting factors ka and ki that were positive integers, they are also perfectly well-defined for anyreal values of ka and ki , including real values that are less than 1. Ruozzi and Tatikondashow that if ka and ki are chosen appropriately (e.g. for a regular graph where each variablenode has degree d , choose ki = 1, and ka to be a positive real less than 1/d), then one canprove some remarkable theorems about the resulting message-passing algorithm.

In particular, if a fixed point of the message-passing update rules is reached, and theresulting single node beliefs each have a unique lowest cost state for that node, then theoverall state obtained by combining those single-node states is provably a global optimum!This is a surprising theorem, because it holds for arbitrary factor graphs, even those with cy-cles. Moreover, with the same choice of splitting constants, one can devise simple schedulesthat provably converge. Of course, the condition that each single-node belief must have a

Page 20: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 879

unique lowest cost state at the fixed point (no ties are allowed) is an important loophole thatwill often prevent a “splitting” message-passing algorithm from giving the globally optimalsolution for an NP-hard problem in a difficult regime.

12 BP for Factor Graphs with Continuous Variables

Because BP messages and beliefs track the cost of every possible state of a variable, itis much easier in practice to apply BP to problems where the variables all have a smallnumber of possible states. Nevertheless, the BP update rules are well-defined for continuousvariables; it is just that the messages and beliefs must be full functions of the variables.There are certain circumstances when these functions can still be computed with efficiently.

First, suppose one tries to parameterize a message or belief function that represents aprobability as a Gaussian (or equivalently as a quadratic if we use costs instead of proba-bilities); in this case one would only need to store its mean and variance. It turns out thatthere are a variety of important local cost functions such that locally, the max-product andsum-product BP algorithms preserve a Gaussian form [43]. The famous Kalman filteringproblem [35] is constructed from local cost functions that all preserve Gaussians, and more-over the factor graph is a chain, so that a Gaussian BP algorithm parameterizing messagesusing means and variances is exact, and is equivalent to the Kalman smoothing algorithm[43, 44].

For graphs with cycles, a BP algorithm may not converge, and in general its fixed pointswill no longer exact, even if the local cost functions preserve the Gaussian form of messages.However, in the important special case of a pairwise Markov random field that representsa Gaussian distribution of many variables, Weiss and Freeman [75] and Rusmevichientongand Van Roy [62] proved that if the BP algorithm converges, the calculated means (but notthe variances) are exact. The conditions under which Gaussian BP converges [47], and howto fix its convergence properties [32], has been the subject of much recent work.

For certain problems, one can sometimes show that min-sum BP will preserve otherfunctional forms. For example Gamarnik, Shah, and Wei recently showed that for the min-cost network flow problem, the messages preserve a piecewise linear form [24], and provedthat min-sum BP converges to the optimal solution.

If one does not have local cost functions that preserve any nice form for the messages,dealing with continuous variables becomes very difficult. Nevertheless, one can begin withthe assumption that messages have some nice form (for example Gaussians), then proceed bycomputing the outgoing messages using the sum-product or max-product BP update rules,and convert the results back into Gaussians by retaining only their means and variances.A systematic way to do this is provided by the popular “Expectation Propagation” algo-rithm [55].

“Non-parametric BP” is a similar but more accurate method that uses mixtures of Gaus-sians, and efficient sampling techniques [67]. Finally, the “Particle Belief Propagation” ap-proach [29] builds on the “Particle Filter” sampling technique [67] first developed for tempo-ral processes. However, the drawback of relying on sampling is that it becomes increasinglycomputationally intensive to obtain more precise results.

As you can see, there are lots of options for dealing with continuous variables in thecontext of BP, which is not really surprising, because continuous variables are ubiquitous ininference and optimization problems that we care about. For an extensive review, with anemphasis on Gaussian BP for signal processing applications, see [45].

Consider however, the most simple and naive approach imaginable: instead of keepingtrack of the costs of every possible value of a variable, why don’t we just use our single best

Page 21: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

880 J.S. Yedidia

Fig. 14 (Color online) Aconstraint graph with threehidden variables, one observedvariable, and three hardconstraints. The little blue“beads” placed on the edgesrepresent replicas of theneighboring variable node, whichcan temporarily take differentvalues, but must be equal at asolution

current guess for the value as a message? It might seem unlikely that such an approach cangive good results, but in fact with sufficient ingenuity it can; such an approach underlies theDivide and Concur algorithm, which we turn to now.

13 Divide and Concur

The DC algorithm was introduced by Gravel and Elser [27], and builds upon considerableearlier work done by Elser with his students on the use of “difference-map” dynamics initerative projection algorithms [16]. As I did with the min-sum BP algorithms, I will beginby explaining how to implement DC, and then discuss its justifications.

Like BP, DC can be used to solve optimization and inference problems, but it is easierto begin by considering its application to constraint satisfaction problems. In a constraintsatisfaction problem, we are looking for a configuration of variables such that all constraintsare satisfied. In the language of factor graphs, all “cost” functions are “hard”: their onlypossible values are zero (the constraint is satisfied) or infinite (the constraint is not satisfied).

To explain the DC algorithm it is standard to introduce “replicas” or “copies” of eachvariable; we introduce one replica of each variable for each constraint it is involved in.(These “replicas” have nothing to do with the “replica method” used for averaging overdisorder in statistical physics [51], or the copies used in Ruozzi and Tatikonda’s “splitting”method previously described [61].) Of course, the different replicas of the same variableeventually have to equal each other, but temporarily we can allow them to be unequal whilethey satisfy different constraints. Essentially we are lifting our original problem to a higherdimensional space where it is easier to solve.

In Fig. 14, I show a small example of a “constraint graph,” which is useful for visualizingthe DC algorithm. A constraint graph is like a factor graph, except that the factor nodes havebeen replaced with constraint nodes that represent hard constraints. These constraint nodescan represent arbitrary and possibly non-linear constraints on the neighboring variables.

Note that the variables in a constraint graph should always be thought of as real (contin-uous) numbers. However, problems with discrete variables can easily be handled by simplyadding the appropriate constraints to the constraint graph.

Our example constraint graph explicitly represents the replicas of variable as little“beads” on the edges between constraint and variable nodes. Next to each replica bead thereis a real number that is the current value of the replica. Notice that the replicas are associatedwith the edges of the constraint graph, just as BP messages are associated with the edges offactor graphs.

The DC algorithm is built from two “projections” on the replica values. The “Divide”projection does the most natural thing imaginable to satisfy the constraints: it moves the

Page 22: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 881

Fig. 15 State of the replicas ofthe constraint graph in Fig. 14after applying a Divide projection

Fig. 16 A “normal” version ofthe constraint graph from Fig. 14,where the variable nodes arereplaced with constraint nodesthat impose equality onneighboring replicas

replica values from their current values to the nearest values that satisfy all the constraints.The “Concur” projection also is completely natural: it averages the replica values that belongto the same variable.

Let us use our example constraint graph, with its replica values, to illustrate how a “Di-vide projection” works. In the top left, we have a constraint x1 ≥ x2 + x3, and the currentreplica values are 5 for the replica of x1, 3 for the replica of x2, and 6 for the replica ofx3. We know that x1 is an observed variable, so we leave it fixed, but otherwise we wantto move the replicas of x2 and x3 as little as possible, but make them satisfy the constraintx1 ≥ x2 + x3.

When I say “move as little as possible,” of course I must specify a metric. The choice ofmetric is a crucial issue, but for the moment, let’s just use a standard Euclidean metric.

Using this metric, the Divide projection would satisfy its constraint by moving the repli-cas of x2 and x3 to the values 1 and 4 respectively. At the same time, and in parallel, we canproject the replicas around the other constraints (x3x4 = 0 and x2 = (x4)

2) to their nearestvalues which satisfy those constraints. Note that each replica is connected to only one con-straint, so that all these local Divide projections can be done fully in parallel. The overallDivide projection of all the replicas is simply the concatenation of all the local projections.

Figure 15 shows the results of applying the Divide projection to the constraint graph fromFig. 14. It is usually easy to write a subroutine to compute each local divide projection ata particular constraint node, so the DC algorithm divides the overall problem of simultane-ously satisfying all the constraints into a lot of small problems of projecting to the nearestsolution of a single constraint.

The Concur projection can also be thought of as making the smallest move that satisfiesa constraint; now the constraint is that different replicas of the same variable must be equal,and the smallest move is to move all the replicas to the average. Thus we could as well havedrawn our constraint graph in the form shown in Fig. 16, where we have replaced the hiddenvariable nodes with constraints that impose equality on the neighboring replicas, and nowthe Divide and Concur projections would work in exactly the same way.

Page 23: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

882 J.S. Yedidia

Fig. 17 A toy “normal”constraint graph used to illustratereplica dynamics

Forney introduced such so-called “normal” versions of standard factor graphs [19],which, while being equivalent to the standard version, have several advantages comparedwith standard factor graphs in terms of the insight they give. For example, using normalfactor graphs, one sees that BP in fact has only one message update rule, just applied to dif-ferent types of factors. Normal factor graphs are also well-suited for hierarchical modeling(it is easy to create a super-factor by enclosing a group of factors in a box), and they arecompatible with standard block diagrams [43].

Returning to the DC algorithm, the simplest way to combine the Divide and Concurprojections would be just to alternate between them—the so-called “alternating projections”algorithm. However, that is often a bad idea, as explained in the next section.

14 Traps in Alternating Projections

Let us denote the vector of all the values of the replicas in a constraint graph at iteration t

by rt , and the replica values obtained by applying the Divide projection to rt to be PD(rt ).Then the alternating-projections algorithm would iteratively apply the Divide projection PD

and the Concur projection PC ; that is it would obtain rt+1 using the rule

rt+1 = PC(PD(rt )). (19)

The problem with alternating projections is that the algorithm can be trapped in a sim-ple cycle, where the replica vector first satisfies the Divide constraints but not the Concurconstraints, and then satisfies the Concur constraints but not the Divide constraints, and thengoes right back to where it was before satisfying the Divide constraints.

Consider the somewhat contrived “normal” constraint graph [78] shown in Fig. 17. Thisconstraint graph has two replicas, denoted rx and ry , and two constraints: an equality con-straint on the right corresponding to a variable node, and a constraint that the two replicasare either at point A, which is at (rx = 0, ry = 0) or at point B , which is at (rx = 3, ry = 1).We can consider the Divide projection to move the replica vector to the nearest of points A

and B , and the Concur projection to set rx and ry to their mean value.The only solution that satisfies all the constraints is the point A, where rx = ry = 0. But

let’s see what happens when we start at some point near B like point D in Fig. 18. TheDivide projection takes us to the nearest point of A or B , which is B , then the Concurprojection takes us to the nearest point on the diagonal line, which is C, then we go backto B , and so on. We never find the true solution at A.

15 Difference-Map Dynamics

To make progress, we need a way to turn pairs of points in replica space where the Divideconstraints and Concur constraints come close, but do not intersect (like points B and C

Page 24: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 883

Fig. 18 An example of a trapresulting from alternatingprojections. If one alternatelyprojects to the nearest point thatsatisfies the constraint to be at A

or B , and then the nearest pointwhere the replica values areequal (the diagonal line), onemay be trapped in a short cycle(B to C to B and so on) andnever find the true solution at A

in Fig. 18) into repellers in the dynamics rather than traps. The “Difference-Map” (DM)dynamics does that [16]. We first consider a particular version of DM, where the replicaupdate rule is

rt+1 = PC(rt + 2[PD(rt ) − rt ]) − [PD(rt ) − rt ]. (20)

To parse this complicated looking equation, it is useful to think of the difference-mapdynamics as breaking up into a three-step process [78]. The expression [PD(rt ) − rt ] repre-sents the change to the current values of the replicas resulting from the Divide projection.In the first step, the values of the replicas move twice the desired amount indicated by theDivide projection. We can refer to these new values of the replicas as the “overshot” valuesrovert = rt + 2[PD(rt ) − rt ]. Next the Concur projection is applied to the overshot values to

obtain the “concurred” values of the replicas rconct = PC(rover

t ). Finally the overshoot (thatis, the extra motion in the first step) is subtracted from the result of the Concur projection toobtain the replica value for the next iteration rt+1 = rconc

t − [PD(rt ) − rt ].In Fig. 19 we return to our previous example to illustrate that the DM dynamics do not

get stuck in a trap. Suppose that we now start initially at point r1 = (2,2). The Divideprojection would take us to point B , but the overshoot takes us twice as far to rover

1 = (4,0).The Concur projection takes us back to rconc

1 = (2,2). Finally, the amount by which weovershot is subtracted so that r2 = (1,3). The next full iteration takes us to r3 = (0,4) (sub-steps are tabulated in Fig. 19). Now however, we are closer to A then to B . Therefore, thenext overshoot takes us to rover

3 = (0,−4), from which we would move to rconc3 = (−2,−2),

and r4 = (−2,2). Finally, at r4 we have reached a fixed point in the dynamics, becauser5 = r4.

It can be proven that if a fixed point r∗ in the DM dynamics is reached such that rt+1 =rt = r∗, then that fixed point must correspond to a solution rsol that can be obtained usingrsol = PD(r∗). Thus, applying one more Divide projection to our fixed point, we arrive atour solution at (0,0).

Of course, the DM dynamics is not a panacea. It is possible for example for the replicavector to fall into a more complicated cycle and fail to find a fixed point. More generi-cally, it is believed that when DM fails to converge, it typically follows chaotic dynamics.Empirically, though, it is now clear that the DM dynamics can effectively solve a great va-riety of problems (e.g. graph coloring, solving Diophantine equations, Sudoku, spin glass

Page 25: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

884 J.S. Yedidia

Fig. 19 An example showing how DM dynamics avoids traps. If we start at the point r1, an alternating pro-jections dynamics would be trapped between point B and r1, and never find the solution at A. DM dynamicswill instead be repelled from the trap and move to r2 (via the three sub-steps denoted with dashed linesrover1 , rconc

1 = r1, and r2), then move to r3, and then end at the fixed point r4 = r∗ , which corresponds to thesolution at A

ground states, phase retrieval, random satisfiability, sphere packing, heteropolymer folding)that would be insoluble with the more naive alternating projections approach [16, 26].

It is natural to wonder to what extent the DM (20) can be modified, to perhaps improveconvergence; for example can that funny-looking 2 that defines how much one overshoots bechanged into a parameter? It turns out the value 2 is a good choice: smaller values and youdon’t always escape from a trap, larger ones and you start shooting away at an exponentiallygrowing rate, which can cause problems.

However, you might also notice that (20) does not treat the Divide and Concur projec-tions symmetrically. What if we swap the roles of PC and PD? It turns out that works fine,typically about as well as the original version. So is there a parameterized version of DM thatlets us move smoothly from the original version given by (20) to the version with PC and PD

swapped? Such a parameterization has indeed been devised [16, 27]. For many problems aparameter value that gives a version of DM part-way between the original and the swappedversion, but still rather close to either the original or swapped version, works best [26].

16 DC as a Message-passing Algorithm

Nevertheless, the standard DM dynamics given by (20) has an important conceptualadvantage—it makes it easier to see that DC is a message-passing algorithm closely analo-gous to BP [78]. Recall that each iteration of the standard DM dynamics can be interpretedas a three-step process: first overshoot, then concur, then correct.

The overshoot computation is done using the Divide projection at the constraint nodes.One can interpret the replica values before the overshoot as “messages” from the variablenodes to the factor nodes, and the resulting overshot values as messages from the factornodes to the variable nodes.

The second step is to concur the overshot replica values, and make them equal at eachvariable node. This exactly parallels the BP step where one computes a belief at each variablefrom the incoming messages.

Page 26: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 885

Fig. 20 (Color online)A constraint graph derived fromthe factor graph shown in Fig. 2.The red circles represent newcost variable nodescorresponding to the soft costfunctions in the original factorgraph. The red square representsa global constraint on themaximum summed cost

Finally, the third step is to correct the concurred replica values neighboring variablenodes by subtracting the original overshoot. This parallels the BP step (see (6)) where onecomputes the messages from variable nodes to factor nodes by subtracting the incomingmessage from the belief.

To summarize, although the details of the update rules are different, the DC overshotreplica values correspond to BP messages from factor nodes to variable nodes, the DC con-curred replica values correspond to BP beliefs, and the DC corrected replica values corre-spond to BP messages from variable nodes to factor nodes [78].

17 DC for Optimization

We have seen how to use DC to solve constraint satisfaction problems; now I will showthat optimization problems can be converted into constraint satisfaction problems. In somecases, this is relatively easy; if we know some conditions on the optimum configuration (e.g.the stationarity conditions) we can impose those as constraints.

Another possibility, which may be less elegant but which is always available, is to in-troduce “cost variables” corresponding to the cost functions of a factor graph. Consider forexample the factor graph that we first introduce in Fig. 2, and the constraint graph derivedfrom it shown in Fig. 20. For each soft local cost, we introduce a new variable correspondingto that cost, and modify the soft factor into a corresponding hard constraint. For example,for the constraint b in the constraint graph illustrated, we require that if x2 and x4 equal 0,then the cost variable Cb must equal 1.2.

All the cost variables can then be tied together in a new global hard constraint, whichsays that the sum of the cost variables must be less than some desired maximum cost. If welike, we can continually tighten the desired maximum cost in an outer loop on the algorithm.

This technique, which can generally convert an optimization problem into a constraintsatisfaction problem to which DC can be applied, has been used to develop DC decodersfor LDPC codes [26, 78], and DC algorithms that optimize heteropolymer energy func-tions [15]. Incidentally, similar ideas have long been used to convert optimization problemsinto decision problems in the theory of computational complexity [57].

Page 27: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

886 J.S. Yedidia

18 Advantages and Disadvantages of DC Compared with BP

As an optimization or constraint satisfaction algorithm, DC presents several notable ad-vantages compared with BP algorithms. First of all, as we have emphasized, DC has nodifficulties dealing with continuous variables.

A second, less obvious, advantage is that DC, unlike BP, performs very well even whenthe hidden variables have no good “local evidence.” BP algorithms often converge to a fixedpoint with non-informative beliefs in the absence of local evidence.

Consider, for example, the problem of packing hard spheres (or other geometrical ob-jects) as densely as possible into a finite volume such as the interior of a cube, for whichDC is the state-of-the-art algorithm [34]. Because there is no local evidence saying thatany particular sphere should be in a particular part of the volume, if one applied BP themessages sent from variables representing sphere centers to the constraints would start offgenerally non-informative, simply “saying” that the cost of being anywhere in the cube wasequal. Those messages would be a fixed point of the dynamics: the BP algorithm would getno traction. Even starting with random messages, a BP algorithm for this problem wouldnearly inevitably converge to non-informative messages. DC on the other hand, is forced tomake a single guess for each message at all times. Even starting with some initial randomguesses for the positions of the sphere centers, it gradually works its way to a solution thatsatisfies all the hard constraints.

A third advantage of DC is that it is much easier to introduce complicated hard con-straints, as might naturally arise in computer vision or control problems.

A fourth advantage of DC is that it cannot be trapped at a fixed point that is not a solutionof the problem. BP, on the other hand, can converge to non-solutions, such as the “trappingsets” that cause “error floors” in BP decoders of some LDPC codes [59]. DC LDPC decodersdo not get stuck in traps, as we would expect, but unfortunately they often decode to code-words that are less likely than the transmitted codeword, a failure mode not seen in BP [78].Working from the idea of combining the advantages of BP and DC, my colleagues YigeWang, Stark Draper and I recently proposed a hybrid “difference-map BP” decoder thatheuristically imports a difference-map dynamics into a min-sum BP decoder. Difference-map BP has proven to significantly improve error floor performance compared to standardBP decoders, and also turns out to be closely related to the “splitting” BP algorithm [78].

DC also has some disadvantages in comparison with BP. Most significantly, it fundamen-tally only tracks a single value for each variable, so it cannot compute marginal probabilitieslike the sum-product algorithm does, or properly account for a probabilistic weighted sumover states.

DC also is often somewhat slow in converging to a solution. A related issue is that theconvergence rate of DC depends crucially on the metric chosen. Even if we restrict ourselvesto the natural Euclidean metric, it is very important how one scales the different variablesin a problem (e.g. for a particular problem, it matters significantly whether one measuresdistances for a particular variable in units of millimeters or kilometers, and costs in units ofdollars or cents). Currently, the best way to scale variables in DC is somewhat of a mystery.

Finally, unlike many versions of BP, DC is not guaranteed to give correct answers forconstraint graphs or factor graphs without cycles. It does, however, have its own set ofguarantees, which we turn to next.

19 Justifications for and History of DC

Let’s begin by considering when the alternating projections algorithm can be guaranteed tofind a satisfying solution to a constraint satisfaction problem. Suppose that we have two con-

Page 28: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 887

straints that we want to satisfy. In our case we consider the collection of Divide constraintsto be one constraint, and the collection of Concur constraints to be the other constraint.Each of the two constraints can be considered to be sets of points in some space (in our caseconsider that to be the space of replica vectors).

A set of points is defined to be convex if the segment of points connecting any two pointsin the set is also contained in the set. It turns out (see the monograph by Bauschke andCombettes [3] for a full mathematical treatment of the topics discussed in this section) thatif two sets of points are convex, alternately projecting from a point in one set to the nearestpoint in the other is guaranteed to converge a point in the intersection of the two sets. In thatcase, the alternating projections algorithm is also known as the “Projections onto ConvexSets” algorithm. It follows that if the sets of replica vectors that satisfy the Divide andConcur constraints are each convex, than the alternating projections algorithm will convergeto a solution.

Notice however that the alternating projections algorithm is only guaranteed to be weaklyconvergent—the algorithm sometimes bounces from one convex set to the other, getting evercloser to the intersection of the two sets, but never actually reaching it in a finite number ofiterations. Think for example, of the case when the two convex sets are each a line thatintersects at a point.

Just as for alternating projections, one can prove that the DM projection dynamics alsoalways converges to the intersection of two convex sets, and in some cases (but not always) itaccelerates convergence compared to alternating projections. In fact, for some cases wherealternating projections converges only weakly, DM converges in a finite number of itera-tions. So for the problem of finding the intersection of two convex sets, the DM is a handyalternative algorithm to alternating projections: it also always converges to a solution, andsometimes is much faster.

It is easy to show that the set of replica values satisfying the Concur constraints is auto-matically convex. It is also true that an intersection of convex sets is convex, which meansthat if all the sets of replica values satisfying the local Divide constraints are convex, theoverall Divide constraint is also convex.

Historically, the “difference-map” dynamics given by (20) were first investigated in 1956by Douglas and Rachford [13] as an improved iterative method for solving partial differen-tial equations, and extended into a projection operator splitting method by Lions and Mercierin 1979 [42]. The surprisingly successful application of the DM algorithm to the non-convexphase retrieval problem [18] sparked the more recent investigations into why DM dynamicsshould also be useful for non-convex problems [4], their application to many other prob-lems [16], and finally the formulation of the DC algorithm [27], which makes clear that theapproach can be applied to general constraint satisfaction and optimization problems, albeitwithout the guarantees that obtain when it is applied to convex problems.

To summarize, just as BP algorithms converge to exact solutions on tree factor graphs butstill can give very useful results for problems defined on general graphs, the DC algorithmconverges to exact solutions when the constraint solution sets are convex, but can still givevery useful results for problems defined using general constraints.

20 Conclusions and Future Research

In this paper, I have tried to present a tutorial to BP and DC algorithms, which will helpintroduce the reader to a very active area of current research. These algorithms turn out tohave a similar message-passing structure which is ideally suited to parallelization.

Page 29: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

888 J.S. Yedidia

Both sets of algorithms can be used to find good approximate solutions to a very wideset of problems, and exact solutions to a more limited, but still very important sub-set. Ina general sense, I believe that an important goal of future research is to expand these setsof applications. For example, the set of problems that can be exactly solved using BP canbe fruitfully expanded by using BDDs and ZDDs [37]. Many interesting convex problemscan be solved exactly using DC or other projection-based algorithms; an important issuefor future research is how efficient these algorithms are compared to more conventionalapproaches [7].

The scope for future research concerning approximate solutions is even greater. In par-ticular, the ease with which DC algorithms deal with continuous variables and complicatedconstraints opens up many new possibilities. To take one example, control problems ofteninvolve optimizing functions of many continuous variables over complicated constraints. Sowhile these problems have not heretofore been a good fit for BP algorithms, it is certainlyconceivable that DC algorithms could be profitably used on such problems. A crucial andfundamental issue that could ultimately determine the success of such efforts will be whetherwe can learn to optimize DC algorithms so that they reach a solution as quickly as possible.

Acknowledgements I thank Veit Elser for many helpful discussions about the Divide and Concur algo-rithm, and Stark Draper, Yige Wang, Bill Freeman, and Yair Weiss for enjoyable collaborations on the sub-jects discussed here.

References

1. Bahl, L., Cocke, J., Jelinek, F., Raviv, J.: Optimal decoding of linear codes for minimizing symbol errorrate. IEEE Trans. Inf. Theory 20, 284–287 (1974)

2. Barber, D.: Bayesian Reasoning and Machine Learning. Cambridge University Press, Cambridge (2011)3. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces.

Springer, Berlin (2011)4. Bauschke, H.H., Combettes P.L., Luke D.R.: Phase retrieval, error reduction algorithm, and Fienup vari-

ants: a view from convex optimization. J. Opt. Soc. Am. A 19, 1334–1345 (2002)5. Baxter, R.J.: Exactly Solved Models in Statistical Mechanics. Academic Press, San Diego (1982)6. Berrou, C., Glavieux, A., Thitimajshima, P.: Near Shannon limit error-correcting coding and decoding:

turbo-codes. In: Proc. 1993 IEEE Int. Conf. on Comm., pp. 1064–1070 (1993)7. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)8. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimisation via graph cuts. IEEE Trans.

Pattern Anal. Mach. Intell. 29, 1222–1239 (2001)9. Braunstein, A., Mézard, M., Parisi, G.: Survey propagation: an algorithm for satisfiability. Random

Struct. Algorithms 27, 201–226 (2005)10. Bryant, R.E.: Graph-based algorithms for Boolean function manipulation. IEEE Trans. Comput. 35,

677–691 (1986)11. Chertkov, M., Chernyak, M.: Loop series for discrete statistical models on graphs. J. Stat. Mech. (2006)

doi:10.1088/1742-5468/2006/06/P0600912. Cormen, C.H., Leiserson, C.E., Rivest, R.L., Stein, C.: An Introduction to Algorithms, 3rd edn. MIT

Press, Cambridge (2009). Chap. 1513. Douglas, J., Rachford, H.H.: On the numerical solution of heat conduction problems in two or three

space variables. Trans. Am. Math. Soc. 82, 421–439 (1956)14. Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models

of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)15. Elser, V., Rankenburg, I.: Deconstructing the energy landscape: constraint-based algorithms for folding

heteropolymers. Phys. Rev. E 73, 026702 (2006)16. Elser, V., Rankenburg, I., Thibault, P.: Searching with iterated maps. Proc. Natl. Acad. Sci. USA 104,

418–423 (2007)17. Felzenszwalb, P.F., Huttnelocher, D.P.: Efficient belief propagation for early vision. Int. J. Comput. Vis.

70, 41–54 (2006)18. Fienup, J.R.: Phase retrieval algorithms: a comparison. Appl. Opt. 21, 2758–2769 (1982)

Page 30: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

Message-Passing Algorithms for Inference and Optimization 889

19. Forney, G.D.: Codes on graphs: normal realizations. IEEE Trans. Inf. Theory 47, 520–548 (2001)20. Forney, G.D.: The Viterbi algorithm. Proc. IEEE 61, 268–278 (1973)21. Freeman, W.T., Pasztor, E.C., Charmichael, O.T.: Learning low-level vision. Int. J. Comput. Vis. 40,

25–47 (2000)22. Gallager, R.G.: Information Theory and Reliable Communication. Wiley, New York (1968)23. Gallager, R.G.: Low-Density Parity-Check Codes. MIT Press, Cambridge (1963)24. Gamarnik, D., Shah, D., Wei, Y.: Belief propagation for min-cost network flow: convergence and cor-

rectness. In: Proc. of the 2010 ACM-SIAM Symp. on Discrete Algorithms (2010)25. Geman, S., Geman, D.: Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of

Images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)26. Gravel, S.: Using symmetries to solve asymmetric problems. Ph.D. dissertation, Cornell University, Ith-

ica, NY (2009)27. Gravel, S., Elser, V.: Divide and concur: a general approach to constraint satisfaction. Phys. Rev. E 78,

036706 (2008)28. Hershey, J.R., Rennie, S.J., Olsen, P.A., Kristjansson, T.T.: Super-human multi-talker speech recognition:

a graphical modeling approach. Comput. Speech Lang. 24, 45–66 (2010)29. Ihler, A., McAllester, D.: Particle belief propagation. In: Proc. of the 12th Int. Conf. on Artificial Intel-

ligence and Statistics (2009)30. Ising, E.: A contribution to the theory of ferromagnetism. Z. Phys. 31, 253 (1925)31. Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1997)32. Johnson, J.K., Bickson, D., Dolev, D.: Fixing convergence of Gaussian belief propagation. In: Proc. Int.

Symposium Inform. Theory (2009)33. Jurafsky, D.: Martin, J.H: Speech and Language Processing. Prentice Hall, Upper Saddle River (2000)34. Kallus, Y., Elser, V., Gravel, S.: Method for dense packing discovery. Phys. Rev. E 82, 056707 (2010)35. Kalman, R.E.: A new approach to linear filtering and prediction problems. J. Basic Eng. 82, 35–45

(1960)36. Kikuchi, R.: A theory of cooperative phenomena. Phys. Rev. 81, 988–1003 (1951)37. Knuth, D.E.: The Art of Computer Programming, vol. 4A. Combinatorial Algorithms. Addison-Wesley,

New York (2011). Sect. 7.1.438. Koller, D., Friedman, N.: Probabilistic Graphical Models. MIT Press, Cambridge (2009)39. Krauth, W.: Statistical Mechanics: Algorithms and Computations. Oxford University Press, Oxford

(2006)40. Kschischang, F.R., Frey, B.J., Loeliger, H.-A.: Factor graphs and the sum-product algorithm. IEEE

Trans. Inf. Theory 47, 498–519 (2001)41. Lauritzen, S.L.: Spiegelhalter, D.J.: Local computations with probabilities on graphical structures and

their application to expert systems. J. R. Stat.Soc., Ser. B 50, 157–194 (1988)42. Lions, P.-L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer.

Anal. 16, 964–979 (1979)43. Loeliger, H.-A.: An introduction to factor graphs. IEEE Signal Proc. Mag., 28–41 (2004)44. Loeliger, H.-A.: Least squares and Kalman filtering on Forney graphs. In: Blahut, R.E., Koetter, R. (eds.)

Codes, Graphs, and Systems, pp. 113–135. Kluwer Academic, Norwell (2002)45. Loeliger, H.-A., Dauwels, J., Hu, J., Korl, S., Ping, L., Kschischang, F.: The factor graph approach to

model-based signal processing. Proc. IEEE 95, 1295–1322 (2007)46. MacKay, D.J.C.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press,

Cambridge (2003)47. Malioutov, D.M., Johnson, J.K., Willsky, A.S.: Walk-sums and belief propagation in Gaussian graphical

models. J. Mach. Learn. Res. 7, 2031–2064 (2006)48. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cam-

bridge (1999)49. McEliece, R.J., MacKay, D.J.C., Cheng, J.F.: Turbo decoding as an instance of Pearl’s “belief propaga-

tion” algorithm. IEEE J. Sel. Areas Commun. 16, 140–152 (1998)50. Meltzer, T., Yanover, C., Weiss, Y.: Globally optimal solutions for energy minimization in stereo vision

using reweighted belief propagation. In: Int. Conference on Computer Vision (2005)51. Mézard, M., Montanari, A.: Information, Physics, and Computation. Oxford University Press, Oxford

(2009)52. Mézard, M., Parisi, G., Virasaro, M.A.: Spin Glass Theory and Beyond. World Scientific, Singapore

(1987)53. Mézard, M., Parisi, G., Zecchina, R.: Analytic and algorithmic solution of random satisfiability problems.

Science 297, 812–815 (2002)54. Minato, S.-E.: Zero-suppressed BDDs for set manipulation in combinatorial problems. In: Proc. 30th

ACM/IEEE Design Automation Conference, pp. 272–277 (1993)

Page 31: Message-Passing Algorithms for Inference and …...Message-Passing Algorithms for Inference and Optimization 861 some excellent reviews [2, 38, 74]. The DC algorithm [27] has important

890 J.S. Yedidia

55. Minka, T.P.: Expectation propagation for approximate Bayesian inference. In: Proc. of the 17th Conf. onUncertainty in Artificial Intelligence, pp. 362–369 (2001)

56. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kauf-mann, San Francisco (1988)

57. Papadimitriou, C.H.: Computational Complexity. Addison-Wesley, Reading (1993)58. Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc.

IEEE 77, 257–286 (1989)59. Richardson, T.: Error floors of LDPC codes. In: Proc. 41st Allerton Conf. Commun. Contr., Comput.,

Monticello, IL (2003)60. Richardson, T., Urbanke, R.: Modern Coding Theory. Cambridge University Press, Cambridge (2008)61. Ruozzi, N., Tatikonda, S.: Convergent and correct message passing schemes for optimization problems

over graphical models. Available at http://arxiv.org/abs/1002.3239 (2010)62. Rusmevichientong, P., Van Roy, B.: An analysis of belief propagation on the turbo decoding graph with

Gaussian densities. IEEE Trans. Inf. Theory 47, 745–765 (2001)63. Russell, S., Norvig, P.: Artificial Intelligence, A Modern Approach, 3rd edn. Prentice Hall, Upper Saddle

River (2009)64. Ryan, W.E., Lin, S.: Channel Codes: Classical and Modern. Cambridge University Press, Cambridge

(2009)65. Sontag, D., Meltzer, T., Globerson, A., Jaakkola, T., Weiss, Y.: Tightening LP relaxations for MAP using

message-passing. In: Uncertainty in Artificial Intelligence (2008)66. Sudderth, E.B., Freeman, W.T.: Signal and Image processing with belief propagation. IEEE Signal Pro-

cess. Mag. 25, 114–141 (2008)67. Sudderth, E.B., Ihler, A., Isard, M., Freeman, W.T., Willsky, A.S.: Non-parametric belief propagation.

Commun. ACM 53, 95–103 (2010)68. Tanner, R.M.: A recursive approach to low complexity codes. IEEE Trans. Inf. Theory 27, 533–547

(1981)69. Vardy, A.: Trellis structure of codes. In: Pless, V.S., Huffman, W.C. (eds.) Handbook of Coding Theory,

vol. 2, pp. 1989–2118. Elsevier, Amsterdam (1998)70. Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm.

IEEE Trans. Inf. Theory 13, 260–269 (1967)71. Wainwright, M.J., Jaakkola, T., Willsky, A.S.: A new class of upper bounds on the log partition function.

IEEE Trans. Inf. Theory 51, 2313–2335 (2005)72. Wainwright, M.J., Jaakkola, T., Willsky, A.S.: Tree-based reparametrization framework for analysis of

sum-product and related algorithms. IEEE Trans. Inf. Theory 45, 1120–1146 (2003)73. Wainwright, M.J., Jaakkola, T., Willsky, A.S.: MAP estimation via agreement on (hyper)trees: message-

passing and linear programming approaches. IEEE Trans. Inf. Theory 51, 3697–3717 (2005)74. Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference.

Faund. Trends Mach. Learn. 1, 1–305 (2008)75. Weiss, Y., Freeman, W.T.: On the optimality of the max-product belief propagation algorithm on arbitrary

graphs. IEEE Trans. Inf. Theory 47, 736–744 (2001)76. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Understanding belief propagation and its generalizations. In:

Lakemeyer, G., Nebel, B. (eds.) Exploring Artificial Intelligence in the New Millenium, pp. 239–270.Morgan Kaufmann, San Francisco (2003)

77. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Constructing free energy approximations and generalized beliefpropagation algorithms. IEEE Trans. Inf. Theory 51, 2282–2312 (2005)

78. Yedidia, J.S., Wang, Y., Draper, S.C.: Divide and concur and difference-map BP decoders for LDPCcodes. IEEE Trans. Inf. Theory 57, 786–802 (2011)