Global Approximate Inference Eran Segal Weizmann Institute
Jan 07, 2016
Global Approximate Inference Eran Segal
Weizmann Institute
General Approximate Inference Strategy
Define a class of simpler distributions Q Search for a particular instance in Q that is “close”
to P Answer queries using inference in Q
Cluster Graph A cluster graph K for factors F is an undirected
graph Nodes are associated with a subset of variables
CiU The graph is family preserving: each factor F is
associated with one node Ci such that Scope[]Ci Each edge Ci–Cj is associated with a sepset Si,j = Ci
Cj
A cluster tree over factors F that satisfies the running intersection property is called a clique tree
Clique Tree InferenceC
D I
SG
L
JH
Verify:
Tree and family preserving
Running intersection property
C,D G,I,DD
G,S,I G,J,S,L H,G,JG,I G,S G,J
P(C)P(D|C)
P(G|I,D) P(I)P(S|I)
P(L|G)P(J|L,S)
P(H|G,J)
1 2 3 45
Message Passing: Belief Propagation
Initialize the clique tree For each clique Ci set For each edge Ci—Cj set
While unset cliques exist Select Ci—Cj
Send message from Ci to Cj
Marginalize the clique over the sepset
Update the belief at Cj
Update the sepset at Ci–Cj
ii )(:
1, ji
jii SC iji,
jiji ,
ji
jijj
,
Clique Tree Invariant Belief propagation can be viewed as
reparameterizing the joint distribution Upon calibration we showed
Initially this invariant holds since
At each update step invariant is also maintained Message only changes i and i,j so most terms remain
unchanged We need to show
But this is exactly the message passing step
Belief propagation reparameterizes P at each step
TCC jiji
TC ii
ji
i
S
CP
)( ,, )(
][)(
U
)(1)(
][
)( ,,
UPS
CF
TCC jiji
TC ii
ji
i
ji
i
ji
i
,,'
'
ji
ijii
,
,''
Global Approximate Inference Inference as optimization Generalized Belief Propagation
Define algorithm Constructing cluster graphs Analyze approximation guarantees
Propagation with approximate messages Factorized messages Approximate message propagation
Structured variational approximations
The Energy Functional Suppose we want to approximate P with Q
Represent P by factors
Define the energy functional Then:
)(][ln],[ UQF
QF HEQPF
F
F ZP
)(
1)( UU
)||(],[
)(ln)(ln)(ln)(ln
)(ln)(ln
)(ln)(ln
)(ln)(ln
)(ln)(lnln
FF
FFQQQQ
FFQQ
FQF
Q
FF
FF
PQDQPF
PEQEQEUE
PEUE
PEUE
PU
PUZ
UUU
U
U
U
U Minimizing D(Q||PF) is equivalent to maximizing F[PF,Q]
lnZ F[PF,Q] (since D(Q||PF)0)
Inference as Optimization We show that inference can be viewed as
maximizing the energy functional F[PF,Q] Define a distribution Q over clique potentials Transform F[PF,Q] to an equivalent factored form
F’[PF,Q] Show that if Q maximizes F’[PF,Q] subject to
constraints in which Q represents calibrated potentials, then there exists factors that satisfy the inference message passing equations
Defining Q Recall that throughout BP
Define Q as reparameterization of P such that
Since D(Q||PF)=0 we show that calibrating Q is equivalent to maximizing F[PF,Q]
TCC jiji
TC ii
ji
i
S
CP
)( ,, )(
][)(
U
})(:{}{ , TCCQ jijii
TCC jiji
TC ii
T
ji
i
S
CQ
)( ,, )(
][)(
U
Factored Energy Functional Define the factored energy functional as
Theorem: if Q is a set of calibrated potentials for T, then F[PF,Q]=F’[PF,Q]
TCC
jii TC
iiF
ji
ji
i
iiSHCHEQPF
)(,
0 )()(][ln],[',
Inference as Optimization Optimization task
Find Q that maximizes F’[PF,Q] subject to
Theorem: fixed points of Q for the above optimization
Suggests iterative optimization procedure Identical to belief propagation!
i
jii
Cii
SCjiiji
TC
TCC
1
)(,
,
ijjiji
Nj ijii
SC jNk ikiji
iC
jii iC
,
0
}{
0
,
Global Approximate Inference Inference as optimization Generalized Belief Propagation
Define algorithm Constructing cluster graphs Analyze approximation guarantees
Propagation with approximate messages Factorized messages Approximate message propagation
Structured variational approximations
Generalized Belief Propagation
Perform belief propagation in a cluster graph with loops
Strategy:
C
A
B D
Bayesian
network
A,B,D
B,C,D
B,D
Cluster tree
A,B
B,C
B
Cluster graph
A,D
C,D
D
C
A
Generalized Belief Propagation
Perform belief propagation in a cluster graph with loops
Strategy:
A,B
B,C
B
Cluster graph
A,D
C,D
D
C
A Inference may be incorrect: double counting evidence
Unlike in BP on trees: Convergence is not guaranteed Potentials in calibrated tree are not
guaranteed to be marginals in P
Generalized Belief Propagation
Perform belief propagation in a cluster graph with loops
Strategy:
A,B
B,C
B
Cluster graph
A,D
C,D
D
C
A
Generalized Cluster Graph A cluster graph K for factors F is an undirected
graph Nodes are associated with a subset of variables CiU The graph is family preserving: each factor F is
associated with one node Ci such that Scope[]Ci Each edge Ci–Cj is associated with a sepset Si,j = Ci Cj
A generalized cluster graph K for factors F is an undirected graph
Nodes are associated with a subset of variables CiU The graph is family preserving: each factor F is
associated with one node Ci such that Scope[]Ci Each edge Ci–Cj is associated with a subset Si,j Ci Cj
Generalized Cluster Graph A generalized cluster graph obeys the running
intersection property if for each XCi and XCj, there is exactly one path between Ci and Cj for which XS for each subset S along the path
All edges associated with X form a tree that spans all the clusters that contain X
Note: some of these clusters may beconnected with more than one path
A,B
B,C
B
A,D
C,D
D
C
A
Calibrated Cluster Graph A generalized cluster graph is calibrated if for
each edge Ci – Cj we have:
Weaker than in clique trees, since Si,j is a subset of the intersection between Ci and Cj
If a cluster graph satisfies the running intersection property, then the marginal on any variable X is the same in every cluster that contains X
jijjii SC
jjSC
ii CC,,
][][
GBP is Efficient
X11 X12 X13
X21 X22 X23
X31 X32 X33
X11,X12
X12X12,X13
X12,X22 X13,X23X11,X21
X21,X22 X22,X23
X22,X32 X23,X33X21,X31
X31,X32 X32,X33
X11
X21
X21
X31
X32
X32 X33
X23X22
X22
X22 X23
X13X12
Cluster graph
Markov grid network
Note: clique tree in a n x n grid is exponential in n
Round of GBP is O(n)
Global Approximate Inference Inference as optimization Generalized Belief Propagation
Define algorithm Constructing cluster graphs Analyze approximation guarantees
Propagation with approximate messages Factorized messages Approximate message propagation
Structured variational approximations
Constructing Cluster Graphs When constructing clique trees, all
constructions give the same result but differ in computational complexity
In GBP, different cluster graphs can vary in both computational complexity and approximation quality
Transforming Pairwise MNs A pairwise Markov network over a graph H has:
A set of node potentials {[Xi]:i=1,...n} A set of edge potentials {[Xi,Xj]: Xi,XjH} Example:
X11 X12 X13
X21 X22 X23
X31 X32 X33
X11,X21
X11 X12 X13
X23
X33X32X31
X21 X22
X12,X22 X13,X23
X21,X31 X22,X32 X23,X33
X21,X22
X11,X12
X31,X32
X22,X23
X12,X13
X32,X33
Transforming Bayesian Networks
Example:
“Large” cluster per each CPD Single nodes for each variable Connect node and large cluster if node in CPD Graph obeys running intersection property
A,B,C
A D FC B
A,B,D B,D,FA B
DC
F Bethe approximation
Global Approximate Inference Inference as optimization Generalized Belief Propagation
Define algorithm Constructing cluster graphs Analyze approximation guarantees
Propagation with approximate messages Factorized messages Approximate message propagation
Structured variational approximations
Generalized Belief Propagation GBP maintains distribution invariance
(since message passing maintains invariance)
KCC jiji
KC ii
F
ji
i
S
CP
)( ,, )(
][)(
U
Generalized Belief Propagation If GBP converges (K is calibrated)
Each subtree T is calibrated with edge potentials corresponding to marginals of PT(U)
(since PT(U) is a calibrated tree)
TCC jiji
TC ii
T
ji
i
S
CP
)( ,, )(
][)(
U
Generalized Belief Propagation Calibrated graph potentials are not PF(U)
marginals A,B
B,C
B
A,D
C,D
D
C
A1
2 3
4A,B
B,C
B
C,DC
1
2 3
][][][][
],[],[],[],[
),,,(
1,44,33,22,1
4321
ADCB
DADCCBBA
DCBAPF
][][
],[],[],[
),,,(
3,22,1
321
CB
DCCBBA
DCBAPT
][][],[ 4,14,34 ADDA
),(],[1 BAPBAPP FFT
Inference as Optimization Optimization task
Find Q that maximizes F’[PF,Q] subject to
Theorem: fixed points of Q for the above optimization
Suggests iterative optimization procedure Identical to belief propagation!
i
jii
Cii
SCjiiji
TC
TCC
1
)(,
,
ijjiji
Nj ijii
SC jNk ikiji
iC
jii iC
,
0
}{
0
,
GBP as Optimization Optimization task
Find Q that maximizes F’[PF,Q] subject to
Theorem: fixed points of Q for the above optimization
Note: Si,j is only a subset of intersection between Ci and Cj Iterative optimization procedure is GBP
i
jii
Cii
SCjiiji
KC
KCC
1
)(,
,
ijjiji
Nj ijii
SC jNk ikiji
iC
jii iC
,
0
}{
0
,
GBP as Optimization Clique trees
F[PF,Q]=F’[PF,Q] Iterative procedure (BP) guaranteed to converge Convergence point represents marginal distributions
of PF
Cluster graphs F[PF,Q]=F’[PF,Q] does not hold! Iterative procedure (GBP) not guaranteed to
converge Convergence point does not represent marginal
distributions of PF
GBP in Practice Dealing with non-convergence
Often small portions of the network do not converge stop inference and use current beliefs
Use intelligent message passing scheduling Tree reparameterization (TRP) selects entire trees, and
calibrates them while keeping all other beliefs fixed Focus attention on uncalibrated regions of the graph
Global Approximate Inference Inference as optimization Generalized Belief Propagation
Define algorithm Constructing cluster graphs Analyze approximation guarantees
Propagation with approximate messages Factorized messages Approximate message propagation
Structured variational approximations
Propagation w. Approximate Msgs
General idea Perform BP (or GBP) as before, but propagate
messages that are only approximate Modular approach
General inference scheme remains the same Can plug in many different approximate message
computations
Factorized Messages
X11 X12 X13
X21 X22 X23
X31 X32 X33
X21
X11 X12
X21
X31 X32
X22
X31
X11 X13
X22
X33
X23
X32
X12
1 2 3
Markov network
Clique tree
Keep internal structure of the clique tree cliques
Calibration involves sending messages that are joint over three variables
Idea: simplify messages using factored representation
Example:
][~][
~][
~],,[
~31212121112131211121 XXXXXX
Computational Savings Answering queries in Cluster 2
Exact inference: Exponential in joint space of cluster 2
Approximate inference with factored messages Notice that subnetwork with factored messages is a tree Perform efficient exact inference on subtree to answer
queries
X21
X11 X12
X21
X31 X32
X22
X31
X11
X22
X32
X12
1 2 3
2321022
],,[~
31211121 XXX 2321022
~~~ ],,[
~32221223 XXX
Factor Sets A factor set ={1,...,k} provides a compact
representation for high-dimensional factor 1,...,k
Belief propagation Multiplication of factor sets
Easy: simply the union of the factors in each factor set multiplied
Marginalization of factor set: inference in simplified network
Example: compute 23X21
X11 X12
X21
X31 X32
X22
X31
X11
X22
X32
X12
1 2 3
21
~ 0
2 )~
(~
210232
Global Approximate Inference Inference as optimization Generalized Belief Propagation
Define algorithm Constructing cluster graphs Analyze approximation guarantees
Propagation with approximate messages Factorized messages Approximate message propagation
Structured variational approximations
Approximate Message Propagation
Input Clique tree (or cluster graph) Assignments of original factors to clusters/cliques The factorized form of each cluster/clique
Can be represented by a network for each edge Ci—Cj that specifies the factorization (in previous examples we assumed empty network)
Two strategies for approximate message propagation
Sum-product message passing scheme Belief update messages
Sum-Product Propagation Same propagation scheme as in exact inference
Select a root Propagate messages towards the root
Each cluster collects messages from its neighbors and sends outgoing messages when possible
Propagate messages from the root
Each message passing performs inference on cluster
Terminates in a fixed number of iterations
Note: final marginals at each variable are not exact
Message Passing: Belief Propagation
Same as BP but with approximate messages Initialize the clique tree
For each clique Ci set For each edge Ci—Cj set
While unset cliques exist Select Ci—Cj
Send message from Ci to Cj
Marginalize the clique over the sepset
Update the belief at Cj
Update the sepset at Ci–Cj
ii )(:
~
1~
, ji
jii SC iji,
~~
jiji ~~,
ji
jijj
,~
~~~
Approximation
Two message passing schemes differ in approximate inference
Global Approximate Inference Inference as optimization Generalized Belief Propagation
Define algorithm Constructing cluster graphs Analyze approximation guarantees
Propagation with approximate messages Factorized messages Approximate message propagation
Structured variational approximations
Structured Variational Approx. Select a simple family of distributions Q Find QQ that maximizes F[PF,Q]
Mean Field Approximation Q(x) = Q(Xi) Q loses much of the information of PF
Approximation is computationally attractive Every query in Q is simple to compute Q is easy to represent
X11 X12 X13
X21 X22 X23
X31 X32 X33
PF – Markov grid network
X11 X12 X13
X21 X22 X23
X31 X32 X33
Q – Mean field network
Mean Field Approximation The energy functional is easy to compute, even
for networks where inference is complex
)(][ln],[ UQF
QF HEQPF
)(ln)()(ln)(][ln
uuuu
uu
iX
iF
Q xQQE
i
iQQ XHH )()(U
Mean Field Maximization Maximizing the Energy Functional of Mean-Field
Find Q(x) = Q(Xi) that maximizes F[PF,Q]
Subject to for all i: xiQ(xi)=1
Mean Field Maximization Theorem: Q(Xi) is a stationary point of the
mean field given Q(X1),...Q(Xi-1),Q(Xi+1),...Q(Xn) if and only if
Proof: To optimize Q(Xi) define the Lagrangian
corresponds to the constraint that Q(Xi) is a distribution
We now compute the derivative of Li
F
iQi
i xEZ
xQ
]|[lnexp1
)(
ixi
jjQ
FQi xQXHEQL 1)()(][ln][
Mean Field Maximization
ix
ij
jQF
Qi
ii
xQXHExQ
QLxQ
1)()(][ln)(
][)(
FiQ
F Xxii
Fi
F Xxii
i
F i
FiFQ
i
xE
xQxQ
xQ
xxQQxQ
QxQ
QxQ
ExQ
ii
ii
]|[ln
],|)[ln()'(
],|)[ln(
],|)[ln()()(
]|)[ln()(
]|[ln)()(
][ln)(
' Vv
Vv
Vv
Xx
Xx
vv
vv
vv
xx
xx
Mean Field Maximization
ix
ij
jQF
Qi
ii
xQXHExQ
QLxQ
1)()(][ln)(
][)(
1)(ln
)(ln)()(
)(ln)()(
)()(
i
Xxii
i
j Xxjj
ijjQ
i
xQ
xQxQxQ
xQxQxQ
XHxQ
ii
jj
ixi
i
xQxQ
1)()(
F FiQQ
i
xEExQ
]|[ln][ln)(
Mean Field Maximization
1)(ln]|[ln
1)()(][ln)(
][)(
iiF
Q
xi
jjQ
FQ
ii
i
xQxE
xQXHExQ
QLxQ
i
F
iQi xExQ
]|[ln1)(ln
Setting the derivative to zero, and rearranging terms, we get:
Taking exponents of both sides we get:
F
iQi
i xEZ
xQ
]|[lnexp1
)(
Mean Field Maximization: Intuition
)]([ln)]|([ln
)],('[ln
),('ln)(
,|ln)(
],|[ln)(]|[ln
VV
V
vv
vv
vv
Vv
Vv
Vv
FQiFQ
iFQ
iF
Fi
Fi
FiQ
ZPExPE
xPE
xPQ
xQ
xQxE
We can thus rewrite Q(xi) as:
)]([lnexp)]|([lnexp1
)( VV FQiFQi
i ZPExPEZ
xQ
F
iQi
i xEZ
xQ
]|[lnexp1
)(
Mean Field Maximization: Intuition
)]([lnexp)]|([lnexp1
)( VV FQiFQi
i ZPExPEZ
xQ
)]|([lnexp1
)( ViFQi
i xPEZ
xQ
)]|([)|()()( vvvv
iFPiFFiF xPExPPxPF
Q(xi) is the geometric average of P(xi|V) Relative to the probability distribution Q In this sense, marginal is “consistent” with other
marginals In PF we can also represent marginals
Arithmetic average with respect to PF
Mean Field: Algorithm Simplify:
To:
Since terms that do not involve xi can be added to constant
Note: Q(xi) does not appear on right hand side Can solve and reach optimal Q(xi) in one step Note: step is only optimal given all other Q(Xi) Suggests an iterative algorithm Convergence guaranteed to local maxima since each step
improves
F
iQi
i xEZ
xQ
]|[lnexp1
)(
)(:
)],([lnexp1
)(
ScopeX
iQi
i
i
xUEZ
xQ
Markov Network Approximations
Can use Q that are increasingly complex As long as Q is easy (=inference feasible)
efficient update equations can be derived
X11 X12 X13
X21 X22 X23
X31 X32 X33
PF – Markov grid network
X11 X12 X13
X21 X22 X23
X31 X32 X33
Q – Mean field network