Global Approximate Inference

Global Approximate Inference Eran Segal

Weizmann Institute

General Approximate Inference Strategy

Define a class of simpler distributions Q Search for a particular instance in Q that is “close”

to P Answer queries using inference in Q

Cluster Graph A cluster graph K for factors F is an undirected

graph Nodes are associated with a subset of variables

CiU The graph is family preserving: each factor F is

associated with one node Ci such that Scope[]Ci Each edge Ci–Cj is associated with a sepset Si,j = Ci

Cj

A cluster tree over factors F that satisfies the running intersection property is called a clique tree

Clique Tree InferenceC

D I

SG

L

JH

Verify:

Tree and family preserving

Running intersection property

C,D G,I,DD

G,S,I G,J,S,L H,G,JG,I G,S G,J

P(C)P(D|C)

P(G|I,D) P(I)P(S|I)

P(L|G)P(J|L,S)

P(H|G,J)

1 2 3 45

Message Passing: Belief Propagation

Initialize the clique tree For each clique Ci set For each edge Ci—Cj set

While unset cliques exist Select Ci—Cj

Send message from Ci to Cj

Marginalize the clique over the sepset

Update the belief at Cj

Update the sepset at Ci–Cj

ii )(:

1, ji

jii SC iji,

jiji ,

ji

jijj

,

Clique Tree Invariant Belief propagation can be viewed as

reparameterizing the joint distribution Upon calibration we showed

Initially this invariant holds since

At each update step invariant is also maintained Message only changes i and i,j so most terms remain

unchanged We need to show

But this is exactly the message passing step

Belief propagation reparameterizes P at each step

TCC jiji

TC ii

ji

i

S

CP

)( ,, )(

][)(

U

)(1)(

][

)( ,,

UPS

CF

TCC jiji

TC ii

ji

i

ji

i

ji

i

,,'

'

ji

ijii

,

,''

Global Approximate Inference Inference as optimization Generalized Belief Propagation

Define algorithm Constructing cluster graphs Analyze approximation guarantees

Propagation with approximate messages Factorized messages Approximate message propagation

Structured variational approximations

The Energy Functional Suppose we want to approximate P with Q

Represent P by factors

Define the energy functional Then:

)(][ln],[ UQF

QF HEQPF

F

F ZP

)(

1)( UU

)||(],[

)(ln)(ln)(ln)(ln

)(ln)(ln

)(ln)(ln

)(ln)(ln

)(ln)(lnln

FF

FFQQQQ

FFQQ

FQF

Q

FF

FF

PQDQPF

PEQEQEUE

PEUE

PEUE

PU

PUZ

UUU

U

U

U

U Minimizing D(Q||PF) is equivalent to maximizing F[PF,Q]

lnZ F[PF,Q] (since D(Q||PF)0)

Inference as Optimization We show that inference can be viewed as

maximizing the energy functional F[PF,Q] Define a distribution Q over clique potentials Transform F[PF,Q] to an equivalent factored form

F’[PF,Q] Show that if Q maximizes F’[PF,Q] subject to

constraints in which Q represents calibrated potentials, then there exists factors that satisfy the inference message passing equations

Defining Q Recall that throughout BP

Define Q as reparameterization of P such that

Since D(Q||PF)=0 we show that calibrating Q is equivalent to maximizing F[PF,Q]

TCC jiji

TC ii

ji

i

S

CP

)( ,, )(

][)(

U

})(:{}{ , TCCQ jijii

TCC jiji

TC ii

T

ji

i

S

CQ

)( ,, )(

][)(

U

Factored Energy Functional Define the factored energy functional as

Theorem: if Q is a set of calibrated potentials for T, then F[PF,Q]=F’[PF,Q]

TCC

jii TC

iiF

ji

ji

i

iiSHCHEQPF

)(,

0 )()(][ln],[',

Inference as Optimization Optimization task

Find Q that maximizes F’[PF,Q] subject to

Theorem: fixed points of Q for the above optimization

Suggests iterative optimization procedure Identical to belief propagation!

i

jii

Cii

SCjiiji

TC

TCC

1

)(,

,

ijjiji

Nj ijii

SC jNk ikiji

iC

jii iC

,

0

}{

0

,





Generalized Belief Propagation

Perform belief propagation in a cluster graph with loops

Strategy:

C

A

B D

Bayesian

network

A,B,D

B,C,D

B,D

Cluster tree

A,B

B,C

B

Cluster graph

A,D

C,D

D

C

A



Strategy:

A,B

B,C

B

Cluster graph

A,D

C,D

D

C

A Inference may be incorrect: double counting evidence

Unlike in BP on trees: Convergence is not guaranteed Potentials in calibrated tree are not

guaranteed to be marginals in P



Strategy:

A,B

B,C

B

Cluster graph

A,D

C,D

D

C

A

Generalized Cluster Graph A cluster graph K for factors F is an undirected

graph Nodes are associated with a subset of variables CiU The graph is family preserving: each factor F is

associated with one node Ci such that Scope[]Ci Each edge Ci–Cj is associated with a sepset Si,j = Ci Cj

A generalized cluster graph K for factors F is an undirected graph

Nodes are associated with a subset of variables CiU The graph is family preserving: each factor F is

associated with one node Ci such that Scope[]Ci Each edge Ci–Cj is associated with a subset Si,j Ci Cj

Generalized Cluster Graph A generalized cluster graph obeys the running

intersection property if for each XCi and XCj, there is exactly one path between Ci and Cj for which XS for each subset S along the path

All edges associated with X form a tree that spans all the clusters that contain X

Note: some of these clusters may beconnected with more than one path

A,B

B,C

B

A,D

C,D

D

C

A

Calibrated Cluster Graph A generalized cluster graph is calibrated if for

each edge Ci – Cj we have:

Weaker than in clique trees, since Si,j is a subset of the intersection between Ci and Cj

If a cluster graph satisfies the running intersection property, then the marginal on any variable X is the same in every cluster that contains X

jijjii SC

jjSC

ii CC,,

][][

GBP is Efficient

X11 X12 X13

X21 X22 X23

X31 X32 X33

X11,X12

X12X12,X13

X12,X22 X13,X23X11,X21

X21,X22 X22,X23

X22,X32 X23,X33X21,X31

X31,X32 X32,X33

X11

X21

X21

X31

X32

X32 X33

X23X22

X22

X22 X23

X13X12

Cluster graph

Markov grid network

Note: clique tree in a n x n grid is exponential in n

Round of GBP is O(n)





Constructing Cluster Graphs When constructing clique trees, all

constructions give the same result but differ in computational complexity

In GBP, different cluster graphs can vary in both computational complexity and approximation quality

Transforming Pairwise MNs A pairwise Markov network over a graph H has:

A set of node potentials {[Xi]:i=1,...n} A set of edge potentials {[Xi,Xj]: Xi,XjH} Example:

X11 X12 X13

X21 X22 X23

X31 X32 X33

X11,X21

X11 X12 X13

X23

X33X32X31

X21 X22

X12,X22 X13,X23

X21,X31 X22,X32 X23,X33

X21,X22

X11,X12

X31,X32

X22,X23

X12,X13

X32,X33

Transforming Bayesian Networks

Example:

“Large” cluster per each CPD Single nodes for each variable Connect node and large cluster if node in CPD Graph obeys running intersection property

A,B,C

A D FC B

A,B,D B,D,FA B

DC

F Bethe approximation





Generalized Belief Propagation GBP maintains distribution invariance

(since message passing maintains invariance)

KCC jiji

KC ii

F

ji

i

S

CP

)( ,, )(

][)(

U

Generalized Belief Propagation If GBP converges (K is calibrated)

Each subtree T is calibrated with edge potentials corresponding to marginals of PT(U)

(since PT(U) is a calibrated tree)

TCC jiji

TC ii

T

ji

i

S

CP

)( ,, )(

][)(

U

Generalized Belief Propagation Calibrated graph potentials are not PF(U)

marginals A,B

B,C

B

A,D

C,D

D

C

A1

2 3

4A,B

B,C

B

C,DC

1

2 3

][][][][

],[],[],[],[

),,,(

1,44,33,22,1

4321

ADCB

DADCCBBA

DCBAPF

][][

],[],[],[

),,,(

3,22,1

321

CB

DCCBBA

DCBAPT

][][],[ 4,14,34 ADDA

),(],[1 BAPBAPP FFT

Inference as Optimization Optimization task



Suggests iterative optimization procedure Identical to belief propagation!

i

jii

Cii

SCjiiji

TC

TCC

1

)(,

,

ijjiji

Nj ijii

SC jNk ikiji

iC

jii iC

,

0

}{

0

,

GBP as Optimization Optimization task



Note: Si,j is only a subset of intersection between Ci and Cj Iterative optimization procedure is GBP

i

jii

Cii

SCjiiji

KC

KCC

1

)(,

,

ijjiji

Nj ijii

SC jNk ikiji

iC

jii iC

,

0

}{

0

,

GBP as Optimization Clique trees

F[PF,Q]=F’[PF,Q] Iterative procedure (BP) guaranteed to converge Convergence point represents marginal distributions

of PF

Cluster graphs F[PF,Q]=F’[PF,Q] does not hold! Iterative procedure (GBP) not guaranteed to

converge Convergence point does not represent marginal

distributions of PF

GBP in Practice Dealing with non-convergence

Often small portions of the network do not converge stop inference and use current beliefs

Use intelligent message passing scheduling Tree reparameterization (TRP) selects entire trees, and

calibrates them while keeping all other beliefs fixed Focus attention on uncalibrated regions of the graph





Propagation w. Approximate Msgs

General idea Perform BP (or GBP) as before, but propagate

messages that are only approximate Modular approach

General inference scheme remains the same Can plug in many different approximate message

computations

Factorized Messages

X11 X12 X13

X21 X22 X23

X31 X32 X33

X21

X11 X12

X21

X31 X32

X22

X31

X11 X13

X22

X33

X23

X32

X12

1 2 3

Markov network

Clique tree

Keep internal structure of the clique tree cliques

Calibration involves sending messages that are joint over three variables

Idea: simplify messages using factored representation

Example:

][~][

~][

~],,[

~31212121112131211121 XXXXXX

Computational Savings Answering queries in Cluster 2

Exact inference: Exponential in joint space of cluster 2

Approximate inference with factored messages Notice that subnetwork with factored messages is a tree Perform efficient exact inference on subtree to answer

queries

X21

X11 X12

X21

X31 X32

X22

X31

X11

X22

X32

X12

1 2 3

2321022

],,[~

31211121 XXX 2321022

~~~ ],,[

~32221223 XXX

Factor Sets A factor set ={1,...,k} provides a compact

representation for high-dimensional factor 1,...,k

Belief propagation Multiplication of factor sets

Easy: simply the union of the factors in each factor set multiplied

Marginalization of factor set: inference in simplified network

Example: compute 23X21

X11 X12

X21

X31 X32

X22

X31

X11

X22

X32

X12

1 2 3

21

~ 0

2 )~

(~

210232





Approximate Message Propagation

Input Clique tree (or cluster graph) Assignments of original factors to clusters/cliques The factorized form of each cluster/clique

Can be represented by a network for each edge Ci—Cj that specifies the factorization (in previous examples we assumed empty network)

Two strategies for approximate message propagation

Sum-product message passing scheme Belief update messages

Sum-Product Propagation Same propagation scheme as in exact inference

Select a root Propagate messages towards the root

Each cluster collects messages from its neighbors and sends outgoing messages when possible

Propagate messages from the root

Each message passing performs inference on cluster

Terminates in a fixed number of iterations

Note: final marginals at each variable are not exact

Message Passing: Belief Propagation

Same as BP but with approximate messages Initialize the clique tree

For each clique Ci set For each edge Ci—Cj set

While unset cliques exist Select Ci—Cj

Send message from Ci to Cj

Marginalize the clique over the sepset

Update the belief at Cj

Update the sepset at Ci–Cj

ii )(:

~

1~

, ji

jii SC iji,

~~

jiji ~~,

ji

jijj

,~

~~~

Approximation

Two message passing schemes differ in approximate inference





Structured Variational Approx. Select a simple family of distributions Q Find QQ that maximizes F[PF,Q]

Mean Field Approximation Q(x) = Q(Xi) Q loses much of the information of PF

Approximation is computationally attractive Every query in Q is simple to compute Q is easy to represent

X11 X12 X13

X21 X22 X23

X31 X32 X33

PF – Markov grid network

X11 X12 X13

X21 X22 X23

X31 X32 X33

Q – Mean field network

Mean Field Approximation The energy functional is easy to compute, even

for networks where inference is complex

)(][ln],[ UQF

QF HEQPF

)(ln)()(ln)(][ln

uuuu

uu

iX

iF

Q xQQE

i

iQQ XHH )()(U

Mean Field Maximization Maximizing the Energy Functional of Mean-Field

Find Q(x) = Q(Xi) that maximizes F[PF,Q]

Subject to for all i: xiQ(xi)=1

Mean Field Maximization Theorem: Q(Xi) is a stationary point of the

mean field given Q(X1),...Q(Xi-1),Q(Xi+1),...Q(Xn) if and only if

Proof: To optimize Q(Xi) define the Lagrangian

corresponds to the constraint that Q(Xi) is a distribution

We now compute the derivative of Li

F

iQi

i xEZ

xQ

]|[lnexp1

)(

ixi

jjQ

FQi xQXHEQL 1)()(][ln][

Mean Field Maximization

ix

ij

jQF

Qi

ii

xQXHExQ

QLxQ

1)()(][ln)(

][)(

FiQ

F Xxii

Fi

F Xxii

i

F i

FiFQ

i

xE

xQxQ

xQ

xxQQxQ

QxQ

QxQ

ExQ

ii

ii

]|[ln

],|)[ln()'(

],|)[ln(

],|)[ln()()(

]|)[ln()(

]|[ln)()(

][ln)(

' Vv

Vv

Vv

Xx

Xx

vv

vv

vv

xx

xx


ix

ij

jQF

Qi

ii

xQXHExQ

QLxQ

1)()(][ln)(

][)(

1)(ln

)(ln)()(

)(ln)()(

)()(

i

Xxii

i

j Xxjj

ijjQ

i

xQ

xQxQxQ

xQxQxQ

XHxQ

ii

jj

ixi

i

xQxQ

1)()(

F FiQQ

i

xEExQ

]|[ln][ln)(


1)(ln]|[ln

1)()(][ln)(

][)(

iiF

Q

xi

jjQ

FQ

ii

i

xQxE

xQXHExQ

QLxQ

i

F

iQi xExQ

]|[ln1)(ln

Setting the derivative to zero, and rearranging terms, we get:

Taking exponents of both sides we get:

F

iQi

i xEZ

xQ

]|[lnexp1

)(

Mean Field Maximization: Intuition

)]([ln)]|([ln

)],('[ln

),('ln)(

,|ln)(

],|[ln)(]|[ln

VV

V

vv

vv

vv

Vv

Vv

Vv

FQiFQ

iFQ

iF

Fi

Fi

FiQ

ZPExPE

xPE

xPQ

xQ

xQxE

We can thus rewrite Q(xi) as:

)]([lnexp)]|([lnexp1

)( VV FQiFQi

i ZPExPEZ

xQ

F

iQi

i xEZ

xQ

]|[lnexp1

)(

Mean Field Maximization: Intuition

)]([lnexp)]|([lnexp1

)( VV FQiFQi

i ZPExPEZ

xQ

)]|([lnexp1

)( ViFQi

i xPEZ

xQ

)]|([)|()()( vvvv

iFPiFFiF xPExPPxPF

Q(xi) is the geometric average of P(xi|V) Relative to the probability distribution Q In this sense, marginal is “consistent” with other

marginals In PF we can also represent marginals

Arithmetic average with respect to PF

Mean Field: Algorithm Simplify:

To:

Since terms that do not involve xi can be added to constant

Note: Q(xi) does not appear on right hand side Can solve and reach optimal Q(xi) in one step Note: step is only optimal given all other Q(Xi) Suggests an iterative algorithm Convergence guaranteed to local maxima since each step

improves

F

iQi

i xEZ

xQ

]|[lnexp1

)(

)(:

)],([lnexp1

)(

ScopeX

iQi

i

i

xUEZ

xQ

Markov Network Approximations

Can use Q that are increasingly complex As long as Q is easy (=inference feasible)

efficient update equations can be derived

X11 X12 X13

X21 X22 X23

X31 X32 X33

PF – Markov grid network

X11 X12 X13

X21 X22 X23

X31 X32 X33

Q – Mean field network

Global Approximate Inference

Documents

distribution q

bpdefine q

calibrating q

q subject totheorem

fixed points of q

belief propagationinitialize

cluster tree

inference message