Variational Methods for Graphical Models

Micheal I. JordanZoubin GhahramaniTommi S. JaakkolaLawrence K. Saul

Presented by: Afsaneh Shirazi

Outline

• Motivation• Inference in graphical models• Exact inference is intractable• Variational methodology

– Sequential approach– Block approach

• Conclusions

Motivation(Example: Medical Diagnosis)

symptoms

diseases

What is the most probable disease?

Motivation

• We want to answer some queries about our data

• Graphical model is a way to model data• Inference in some graphical models is

intractable (NP-hard)• Variational methods simplify the inference

in graphical models by using approximation

Graphical Models

• Directed (Bayesian network)

• Undirected

S2P(S2)

P(S5|S3,S4)

P(S3|S1,S2) P(S4|S3)

Inference in Graphical Models

Inference: Given a graphical model, the process of computing answers to queries

• How computationally hard is this decision problem?

• Theorem: Computing P(X = x) in a Bayesian network is NP-hard

Why Exact Inference is Intractable?

symptoms

diseases

Diagnose the most probable disease

symptoms

diseases

: Observed symptoms

)()|(),( dPdfPdfP f

symptoms

diseases:Noisy-OR model)|( dfP i

symptoms

diseases :Noisy-OR model)|( dfP i

))1,0,1(|0( ifP

)1()1()|0()(

ij ijij

symptoms

diseases

: Observed symptoms

ii dPdfP

dPdfPdfP

)()|(),(f

j jjkij kijjkiij ijji ddd

*)( 0)0( 000 ...

symptoms

diseases

: Observed symptoms

ii dPdfP

dPdfPdfP

)()|(),(f

)1(...)1( )( 0)0( 000

kij kijjkiij ijji ddee

Reducing the Computational Complexity

Variational Methods

Simple graph for exact methods

Approximate the probability

distribution

Use the role of convexity

Express a Function Variationally

• is a concave function)ln(x

))((min )ln(

))ln((min )( xxHx

• is a concave function)ln(x

)1)ln((min )ln(

• If the function is not convex or concave: transform the function to a desired form

• Example: logistic function

11 )( ))((min

))(ln()( xfxg ))((min)(

Transformation

Approximation

Transforming back

Approaches to Variational Methods

• Sequential Approach: (on-line) nodes are transformed in an order, determined during inference process

• Block Approach: (off-line) has obvious substructures

Sequential Approach(Two Methods)

Untransformed Graph

Transform one node at a time

Simple Graph for exact methods

Reintroduce one node at a time

Simple Graph for exact methods

Completelytransformed

Sequential Approach (Example)

)( 01)|1( ij ijijd

i edfP

symptoms

diseases

Log Concave

)( 01)|1( ij ijijd

i edfP

symptoms

diseases

Log Concave

)(1 fxx ee

)( ][)|1( 0

jijiiii eedfP

symptoms

diseases

)( ][)|1( 0

jijiiii eedfP

434)1( 3edP

)0( 3 dP

symptoms

diseases

)( ][)|1( 0

jijiiii eedfP

symptoms

diseases

)( ][)|1( 0

jijiiii eedfP

Sequential Approach (Upper Bound and Lower Bound)

• We need both lower bound and upper bound

),(),(),(

jj dfPdfP

dfPfdP

)),(()),(()),((

jj dfPLBdfPUB

dfPUBfdP

How to Compute Lower Bound for a Concave Function?

• Lower bound for concave functions:

qafzaf

Variational parameter is probability distribution

Block Approach (Overview)

• Off-line application of sequential approach– Identify some structure amenable to exact

inference– Family of probability distribution via

introduction of parameters– Choose best approximation based on

evidence

Block Approach (Details)

• KL divergence

}{ )()(ln)()||(

S SPSQSQPQD

)|( EHP

),|( EHQFamily of

),|( *EHQ

Minimize KL divergence

Block Approach (Example – Boltzmann machine)

ji i iijiij SSS

jijici S 00

ji i icijiij

1)1(),|(

Minimize KL Divergence

ijiji )( 0

Minimize KL Divergence

ijiji )( 0

Mean field equations: solve for fixed point

Conclusions

• Time or space complexity of exact calculation is unacceptable

• Complex graphs can be probabilistically simple

• Inference in simplified models provides bounds on probabilities in the original model

Extra Slides

Concerns

• Approximation accuracy• Strong dependencies can be identified• Not based on convexity transformation• Not able to assure that the framework will

transfer to other examples• Not straightforward to develop a

variational approximation for new architectures

Justification for KL Divergence

• Best lower bound on the probability of the evidence

)|(),(ln)|(

)|(),()|(ln

),(ln)(ln

EHQEHPEHQ

• Maximum likelihood parameter estimation:

• Following function is the lower bound on log likelihood

)|( EP

)|(ln)|()|,(ln)|(),(}{

EHQEHQEHPEHQQLH

),()|(ln QLEP KL Divergence between Q(H|E) and P(H|E,)

1. Maximize the bound with respect to Q

2. Fix Q, maximize with respect to

),(maxarg :step) (E )()1( kQ

),(maxarg :step) (M )1()1( kk QL

),|( )(kEHP

Traditional EMApproximation to EM algorithm

Principle of InferenceDAG

Junction Tree

Inconsistent Junction TreeInitialization

Consistent Junction TreePropagation

)|( eEvVPMarginalization

Example: Create Join Tree

HMM with 2 time steps:

Junction Tree:

X1,X2X1,Y1 X2,Y2X1 X2

Example: Initialization

Variable Associated Cluster

Potential function

X1 X1,Y1

Y1 X1,Y1

X2 X1,X2

Y2 X2,Y2

X1,Y1 P(X1)

X1,Y1 P(X1)P(Y1 | X1)

X1,X 2 P(X2 | X1)

X 2,Y 2 P(Y2 | X2)

X1,X2X1,Y1 X2,Y2X1 X2

Example: Collect Evidence

• Choose arbitrary clique, e.g. X1,X2, where all potential functions will be collected.

• Call recursively neighboring cliques for messages:

• 1. Call X1,Y1.– 1. Projection:

– 2. Absorption:

X1 X1,Y1 P(X1,Y1)P(X1)Y1

{X1,Y1} X1

X1,X 2 X1,X 2X1

X1old P(X2 | X1)P(X1)P(X1,X2)

Example: Collect Evidence (cont.)

• 2. Call X2,Y2:– 1. Projection:

– 2. Absorption:

X 2 X 2,Y 2 P(Y2 | X2)1Y 2

{X 2,Y 2} X 2

X1,X2X1,Y1 X2,Y2X1 X2

X1,X 2 X1,X 2X 2

X 2old P(X1,X2)

Example: Distribute Evidence

• Pass messages recursively to neighboring nodes

• Pass message from X1,X2 to X1,Y1:– 1. Projection:

– 2. Absorption:

X1 X1,X 2 P(X1,X2)P(X1)X 2

{X1,X 2} X1

X1,Y1 X1,Y1X1

X1old P(X1,Y1) P(X1)

Example: Distribute Evidence (cont.)

• Pass message from X1,X2 to X2,Y2:– 1. Projection:

– 2. Absorption:

X 2 X1,X 2 P(X1,X2)P(X2)X1

{X1,X 2} X 2

X 2,Y 2 X 2,Y 2X 2

X 2old P(Y2 | X2) P(X2)

1P(Y2,X2)

X1,X2X1,Y1 X2,Y2X1 X2

Example: Inference with evidence

• Assume we want to compute: P(X2|Y1=0,Y2=1) (state estimation)

• Assign likelihoods to the potential functions during initialization:

X1,Y1 0 if Y11

P(X1,Y10) if Y10

X 2,Y 2 0 if Y20

P(Y21 | X2) if Y21

Example: Inference with evidence (cont.)

• Repeating the same steps as in the previous case, we obtain:

X1,Y1 0 if Y11

P(X1,Y10,Y21) if Y10

X1 P(X1,Y10,Y21)X1,X 2 P(X1,Y10,X2,Y21)X 2 P(Y10,X2,Y21)

X 2,Y 2 0 if Y20

P(Y10,X2,Y21) if Y21

Variable EliminationGeneral idea:• Write query in the form

• Iteratively– Move all irrelevant terms outside of innermost sum– Perform innermost sum, getting a new term– Insert the new term into the product

)|(),(nxX i

iin paxPXP e

kxkx yyxfyyf ),,,('),,( 11

ilikx i

yyxfyyxf1

,1,1,11 ),,(),,,('

Complexity of variable elimination

• Suppose in one elimination step we compute

This requires • multiplications

• additions

Complexity is exponential in number of variables in the intermediate factor

iYXm )Val()Val(

iYX )Val()Val(

Chordal Graphs

• elimination ordering undirected chordal graph

Graph:• Maximal cliques are factors in elimination• Factors in elimination are cliques in the graph• Complexity is exponential in size of the largest

clique in graph

Induced Width• The size of the largest clique in the induced

graph is thus an indicator for the complexity of variable elimination

• This quantity is called the induced width of a graph according to the specified ordering

• Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph

Properties of Junction Trees

• In every junction tree:– For each cluster (or sepset) ,

– The probability distribution of any variable , using any cluster (or sepset) that contains

X)(XX P

Exact inference Using Junction Trees

• Undirected tree• Each node is a cluster • Running intersection property:

– Given two clusters and , all clusters on the path between and contain

• Separator sets (sepsets): – Intersection of adjacent clusters

X YXY YX

ADEABD DEFAD DE

Cluster ABDSepset DE

Constructing Junction Trees

Marrying ParentsX4

Moral GraphX4

TriangulationX4

Identify CliquesX4

X2X5X6X1X2X3

X2X3X5 X2X4

Junction Tree

• Junction tree is a subgraph of the clique graph satisfying the running intersection property

X1X2X3 X2X5X6X2X3X5X2X3 X2X5

X2X5X6

X1X2X3

X2X3X5 X2X4

Constructing Junction Trees

Moral Graph

Triangulated Graph

Junction Tree

Identify Cliques

• Lower bound for medical diagnosis ex: j j

zafqzaf )()(

ijijij

jijiij

ij jiji

ij ijij

)()1()(

Variational Methods for Graphical Models

symptomswhy exact inference

model101why exact inference

exact methodsapproximate

nphardwhy exact inference

function variationallyif

graphical modelsmicheal

graphical modelsinference

concave functionexpress

Documents

Variational Autoencoder and Extensionsvariational inference....

Variational Bayesian learning of generative models · 70...

Variational Inference in Graphical ModelsVariational...

Gaussian Graphical Models - Oxford...

Approximate Inference: Variational InferenceVariational...

Graphical models and variational methods:...

Variational Mean Field for Graphical...

Probabilistic Graphical Models - EPFL€¦ · Probabilistic...

Probabilistic Graphical...

1 Loopy Belief Propagation Generalized Belief Propagation...

Cutting Plane Algorithms for Variational Inference in...

Hierarchical Variational Models - NYU...

Probabilistic Graphical Models - University of...

Conditional Graphical Models for Protein Structure...

Graphical Models 4dummies

CSC2535: 2011 Advanced Machine Learning Lecture 2:...