Top Banner
Variational Inference Note: Much (meaning almost all) of this has been liberated from John Winn and Matthew Beal’s theses, and David McKay’s book.
31

Variational Inference

Feb 11, 2017

Download

Technology

Tushar Tank
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Variational Inference

Variational Inference

Note: Much (meaning almost all) of this has been liberated from John Winn and Matthew Beal’s theses, and David McKay’s book.

Page 2: Variational Inference

Overview

• Probabilistic models & Bayesian inference

• Variational Inference

• Univariate Gaussian Example

• GMM Example

• Variational Message Passing

Page 3: Variational Inference

Bayesian networks

• Directed graph• Nodes represent

variables• Links show dependencies• Conditional distribution at

each node• Defines a joint

distribution:

.P(C,L,S,I)=P(L) P(C) P(S|C) P(I|L,S)

Lighting color

Surface color

Image color

Object class

C

SL

I

P(L)

P(C)

P(S|C)

P(I|L,S)

Page 4: Variational Inference

Lighting color

Hidden

Bayesian inference

Observed

• Observed variables D and hidden variables H.

• Hidden variables includeparameters and latent variables.

• Learning/inference involves finding:

• P(H1, H2…| D), or• P(H,|D,M) - explicitly for

generative model.

Surface color

Image color

C

SL

I

Object class

Page 5: Variational Inference

Bayesian inference vs. ML/MAP• Consider learning one parameter θ

• How should we represent this posterior distribution?

)()|( PDP

Page 6: Variational Inference

Bayesian inference vs. ML/MAP

θMAP

θ

Maximum of P(V| θ) P(θ)

• Consider learning one parameter θ

P(D| θ) P(θ)

Page 7: Variational Inference

Bayesian inference vs. ML/MAP

P(D| θ) P(θ)

θMAP

θ

High probability massHigh probability density

• Consider learning one parameter θ

Page 8: Variational Inference

Bayesian inference vs. ML/MAP

θML

θSamples

• Consider learning one parameter θ

P(D| θ) P(θ)

Page 9: Variational Inference

Bayesian inference vs. ML/MAP

θML

θVariational

approximation

)(θQ

• Consider learning one parameter θ

P(D| θ) P(θ)

Page 10: Variational Inference

Variational Inference

1. Choose a family of variational distributions Q(H).

2. Use Kullback-Leibler divergence KL(Q||P) as a measure of ‘distance’ between P(H|D) and Q(H).

3. Find Q which minimizes divergence.

(in three easy steps…)

Page 11: Variational Inference

Choose Variational Distribution

• P(H|D) ~ Q(H).• If P is so complex how do we choose Q?• Any Q is better than an ML or MAP point

estimate.• Choose Q so it can “get” close to P and is

tractable – factorize, conjugate.

Page 12: Variational Inference

Kullback-Leibler Divergence

• Derived from Variational Free Energy by Feynman and Bobliubov

• Relative Entropy between two probability distributions• KL(Q||P) > 0 , for any Q (Jensen’s inequality)• KL(Q||P) = 0 iff P = Q.

• Not true distance measure, not symmetric

X xP

xQxQPQKL)()(ln)()||(

Page 13: Variational Inference

Kullback-Leibler Divergence

Minimising KL(Q||P)

P

Q

Q Exclusive

H DHP

HQHQ)|(

)(ln)(

Minimising KL(P||Q) P

H HQ

DHPDHP)(

)|(ln)|(

Inclusive

Page 14: Variational Inference

Kullback-Leibler Divergence

H DHP

HQHQPQKL)|(

)(ln)()||(

H H

DPHQDHP

HQHQPQKL )(ln)(),(

)(ln)()||(

H DHP

DPHQHQPQKL),(

)()(ln)()||(

H HQ

DHPDHPQPK)(

)|(ln)|()||(

H

DPDHP

HQHQPQKL )(ln),(

)(ln)()||(

H DHP

HQHQPQKL)|(

)(ln)()||(

Bayes Rules

Log property

Sum over H

Page 15: Variational Inference

Kullback-Leibler Divergence

H H

HQHQDHPHQQL )(ln)(),(ln)()( DEFINE

• L is the difference between: expectation of the marginal likelihood with respect to Q, and the entropy of Q

• Maximize L(Q) is equivalent to minimizing KL Divergence

• We could not do the same trick for KL(P||Q), thus we will approximate likelihood with a function that has it’s mass where the likelihood is most probable (exclusive).

H

DPDHP

HQHQPQKL )(ln),(

)(ln)()||(

)()(ln)||( QLDPPQKL

Page 16: Variational Inference

Summarize

where

• For arbitrary Q(H)

• We choose a family of Q distributions where L(Q) is tractable to compute.

maximisefixed minimise

Still difficult in general to calculate

Page 17: Variational Inference

Minimising the KL divergence

L(Q)

KL(Q || P)

ln P(D)maximise

fixed

Page 18: Variational Inference

Minimising the KL divergence

L(Q)

KL(Q || P)

ln P(D)maximise

fixed

Page 19: Variational Inference

Minimising the KL divergence

L(Q)

KL(Q || P)

ln P(D)

maximise

fixed

Page 20: Variational Inference

Minimising the KL divergence

L(Q)

KL(Q || P)

ln P(D)

maximise

fixed

Page 21: Variational Inference

Minimising the KL divergence

L(Q)

KL(Q || P)

ln P(D)

maximise

fixed

Page 22: Variational Inference

Factorised Approximation

• Assume Q factorises

• Optimal solution for one factor given by

• Given the form of Q, find the best H in KL sense• Choose conjugate priors P(H) to give from of Q• Do it iteratively of each Qi(Hi)

ji H

iiijji

DHPHQZ

HQ )),(ln)(exp(1)(*

Page 23: Variational Inference

Derivation

ji H

iijji

DHPHQZ

HQ )),(ln)(exp(1)(*

H H

HQHQDHPHQQL )(ln)(),(ln)()(

H H j

jji

iii

ii HQHQDHPHQ )(ln)(),(ln)(

H H i j

jjiii

ii HQHQDHPHQ )(ln)(),(ln)(

H i H

iiiii

iii

HQHQDHPHQ )(ln)(),(ln)(

H ji H

iiiiH

jjjjji

iijjij

HQHQHQHQDHPHQHQ )(ln)()(ln)(),(ln)()(

Log property

Substitution

Factor one term Qj

Not a Function of Qj

Idea: Use factoring of Q to isolate Qj and maximize L wrt Qj

ZQQKL jj log)||( *

Page 24: Variational Inference

Example: Univariate Gaussian• Normal distribution• Find P(| x)• Conjugate prior • Factorized variational

distribution• Q distribution same form as

prior distributions• Inference involves updating

these hidden parameters

Page 25: Variational Inference

Example: Univariate Gaussian• Use Q* to derive:

• Where <> is the expectation over Q function• Iteratively solve

Page 26: Variational Inference

Example: Univariate Gaussian

• Estimate of log evidence can be found by calculating L(Q):

• Where <.> are expectations wrt to Q(.)

Page 27: Variational Inference

Example

Take four data samples form Gaussian (Thick Line) to find posterior. Dashed lines distribution from sampled variational.

Variational and True posterior from Gaussian given four samples. P() = N(0,1000). P() = Gamma(.001,.001).

Page 28: Variational Inference

VB with Image Segmentation

20 40 60 80 100 120 140 160 180

20

40

60

80

100

120

0 100 200 3000

100

200

0 100 200 3000

50

100

0 100 200 3000

100

200

300

0 100 200 3000

50

100

0 100 200 3000

50

100

150

0 100 200 3000

50

100

RGB histogram of two pixel locations.

“VB at the pixel level will give better results.”

Feature vector (x,y,Vx,Vy,r,g,b) - will have issues with data association.

VB with GMM will be complex – doing this in real time will be execrable.

Page 29: Variational Inference

Lower Bound for GMM-Ugly

Page 30: Variational Inference

Variational Equations for GMM-Ugly

Page 31: Variational Inference

Brings Up VMP – Efficient Computation

Lighting color

Surface color

Image color

Object class

C

SL

I

P(L)

P(C)

P(S|C)

P(I|L,S)