Top Banner
Improvements to Variational Bayesian Inference Yee Whye Teh Max Welling Kenichi Kurihara David Newman March 26, 2008 Newton Institute, Cambridge
59

Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Mar 12, 2018

Download

Documents

dinhngoc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Improvements toVariational Bayesian Inference

Yee Whye TehMax Welling

Kenichi KuriharaDavid Newman

March 26, 2008Newton Institute, Cambridge

Page 2: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Page 3: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Page 4: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Bayesian Networks

Asia

Tuberculosis

Smoker

Lung cancer Bronchitis

X-ray Dyspnoea

[Pearl 1988, Heckerman 1995]

Page 5: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Bayesian NetworksAssumptions

I Discrete networks.I Parameter independence.I Conjugate Dirichlet priors.

p(x|θ) =∏

i

p(xi |xpa(i)) =∏i,j,k

θδ(xpa(i)=j)δ(xi =k)

ijk

p(θ|α) =∏i,j

Γ(∑

k αijk )∏k Γ(αijk )

∏k

θαijk−1ijk

i = index of variablej = value of parents xpa(i) of xi

k = value of xi

Page 6: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Bayesian NetworksExample: Naïve Bayes

features

class

data items

I Each class is described by a product distribution overfeatures.

Page 7: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Bayesian NetworksExample: Document Clustering

documents

words

θ

θclusters

I Each document belongs to acluster.

I Words in each document areiid, drawn from acluster-specific distributionover vocabulary.

Page 8: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Bayesian NetworksExample: Latent Dirichlet Allocation

documents

words

θ

θtopics

I Each document is describedby a distribution over topics.

I For each word: draw a topic,then draw the word itself froma topic-specific distributionover vocabulary.

I Mixed membership model;admixture.

[Blei, Ng and Jordan 2003]

Page 9: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Bayesian NetworksExample: Biclustering

rows

columns

Page 10: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Bayesian NetworksExample: Stochastic Block Model

rows

columns

θ

θ

θparameters

[Airoldi et al 2007]

Page 11: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Bayesian NetworksInference

I Observed variables x, unobserved z.I Parameters θ.I We wish to compute (marginals of) the posterior:

p(z,θ|x) =p(x, z|θ)p(θ)

p(x)

I Computational techniques:I Markov chain Monte Carlo,I variational approximations.

Page 12: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Variational Bayes

I Observed variables x, unobserved z, parameters θ.I Posterior:

p(z,θ|x) =p(x, z|θ)p(θ)

p(x)

= argmaxq(z,θ)

Eq[log p(x, z,θ)− log q(z,θ)]

I Wish to optimize the variational free energy:

F(q(z,θ)) = Eq[log p(x, z,θ)− log q(z,θ)]

[Beal 2003]

Page 13: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Variational BayesI Variational free energy:

F(q(z,θ)) = Eq[log p(x, z,θ)− log q(z,θ)]

I Factorization approximation:

q(z,θ) = q(z)q(θ)

I Variational EM algorithm:Variational E step:

q(z) ∝ exp Eq(θ)[log p(x, z,θ)]

Variational M step:

q(θ) ∝ exp Eq(z)[log p(x, z,θ)]

Page 14: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Variational Bayes

q(z) ∝ exp Eq(θ)[log p(x, z,θ)]

q(θ) ∝ exp Eq(z)[log p(x, z,θ)]

I If p(x, z|θ) is an exponential family with tractable conjugateprior p(θ), then

I q(z) takes same form as p(z|x, θ);I q(θ) takes same form as p(θ).

I Computational costs of variational Bayes equivalent to EM.I But: biased.I This talk: improve approximation without incurring

computational expense.

Page 15: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Latent Dirichlet Allocation (LDA)

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

I The d th document distribution over topics

θd |α,π ∼ Dirichlet(απ)

I Corpus-wide topics (distributions over words)

φk |β, τ ∼ Dirichlet(βτ )

I The i th word in d th document:I Draw topic zid |θd ∼ Discrete(θd)I Draw word xid |zid , φ: ∼ Discrete(φzid

)

[Blei, Ng and Jordan 2003, Griffiths and Steyvers 2004]

Page 16: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Latent Dirichlet Allocation (LDA)Inferred Topics on KOS

n=48576november

pollhouseaccountelectoralsenategovernorrepublicans

pollsvote

n=48190iraqwar

militaryiraqi

americantroopsbushsoldierspeopleforces

n=47944bush

administrationyearstaxyearbushstimemillionhealthamerica

n=45580bushkerrypoll

percentgeneralvoterspolls

presidentvote

election

n=42552bush

administrationhousewhite

presidentintelligencereportofficials

commissiondefense

n=42190senatehouserace

electionsrepublicanstate

democratsseatdistrict

democratic

n=40050peoplepoliticalparty

republicansconservative

issuemarriagerightsgayvote

n=40030party

campaignrepublicandemocraticdemocratselection

republicansstatemillionstates

n=39086bushkerry

presidentnewsgeneralmedia

campaignjohntimecheney

n=34138deankerry

edwardsprimary

democraticclarkiowapoll

gephardtlieberman

Page 17: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Latent Dirichlet Allocation (LDA)Inferred Topics on NIPS

n=104314networkweighttraininglearningerrorsetunitoutputhidden

performance

n=81815distributiongaussiandatameanmodelbayesianprobabilityvariablesprior

posterior

n=80199trainingclassifier

classificationsetdataclass

performancepatterntesterror

n=73581modeldata

parametermixturelikelihoodestimationhmm

probabilitydensitymarkov

n=72394neuronsynapticmodelfiringcellspikeinput

synapsespotentialnetwork

n=67775speech

recognitionwordsystemnetworkcharactertrainingspeakerneuralinput

n=66250equationsystempoint

dynamicfunctionparameterlearningmatrixfixednetwork

n=65897unit

networkinputoutputhiddenlayerneuralrecurrentweightactivation

n=65642errortraininglearning

generalizationpredictionweightinputset

networkneural

n=64818functionboundtheorem

approximationcaseresultnumberlossprooferror

Page 18: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Page 19: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Standard Gibbs Sampling for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

I Conditional distributions for Gibbs sampling:

θd |z ∼ Dirichlet(απ + (nd1, . . . ,ndK ))

φk |x, z ∼ Dirichlet(βτ + (nk1, . . . ,nkW ))

p(zid = k |θd ,φ, xid ) ∝ θdkφkxid

I ndkw =∑

i I[zid =k ]I[xid =w ]. Missing indices summed out.I Strong coupling between θ,φ and z.

Page 20: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Standard Variational Bayes for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid !k

I Basic factorization induces further simplifications:

q(z,θ,φ) = q(z)q(θ,φ) =∏id

q(zid )∏

d

q(θd )∏

k

q(φd )

Page 21: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Standard Variational Bayes for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

I Variational posteriors:

q(θd )← Dirichlet(θd ;απ + Eq(z)[nd :])

q(φk )← Dirichlet(φk ; βτ + Eq(z)[nk :])

q(zid = k)←∝ exp(Eq(θ)[log θdk ] + Eq(φ)[logφkxid ])

I Structurally very similar to Gibbs conditionals.I Strong coupling between θ,φ and z.

Page 22: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Gibbs Sampling for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

I Integrate out θ,φ (Rao-Blackwellize), and Gibbs sample z:

p(zid = k |z¬id , xid ) ∝ (απk + n¬iddk )

βτxid + n¬idkxid

β + n¬idk

I Each zid interacts with other latent variables z¬id only via then counts—effect of any one zi ′d ′ on zid is weak.

I Faster convergence; mean field approximation is expectedto work well.

Page 23: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Variational Bayes vs Gibbs Sampling in LDAI Variational Bayes

+ Easier to debug.+ Easy to diagnose convergence.− Derivations more involved.− Approximate posterior potentially far from true.+ Lower bound on marginal probability of data.+ Easy to analyse result of inference.

I Gibbs Sampling− Often hard to debug.− Hard to diagnose convergence (if ever).+ Will converge to true posterior if willing to wait.+ Unconverged samples may still be “good enough” for

prediction.− No good way to compute marginal probability of data.− Unclear how to combine multiple samples for analysis due to

non-identifiability.

Page 24: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Page 25: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

document d=1...D

words i=1...nd

zid

I Integrate out θ,φ (Rao-Blackwellize), and factorize z:

q(z) =∏id

q(zid )

[Teh, Newman and Welling 2007]

Page 26: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

I Variational posterior updates:

q(zid = k)←∝ exp(Eq[ log(απk + n¬iddk )

+ log(βτxid + n¬idkxid

)

− log(β + n¬idk )])

I Structurally similar to collapsed Gibbs sampling.I Weak interactions among zid ’s.

Page 27: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

topics k=1...Kdocument d=1...D

words i=1...nd

!d zid !k

I Dependence of θ,φ on z treated exactly. Another approach:

q(z,θ,φ) = q(θ,φ|z)∏id

q(zid )

Page 28: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for LDAEquivalence of approaches

I Variational free energy:

F(q(z,θ,φ))

=F(q(z)q(θ,φ|z))

=Eq(z,θ,φ)[log p(x, z,θ,φ)− log q(z,θ,φ)]

=Eq(z)[Eq(θ,φ|z)[log p(x, z,θ,φ)− log q(θ,φ|z)]− log q(z)]

I Optimum of q(θ,φ|z) is p(θ,φ|x, z), plugging in, we get,

maxq(θ,φ|z)

F(q(z)q(θ,φ|z)) = Eq(z)[log p(x, z)− log q(z)]

I Both formulations are equivalent.

Page 29: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for LDAEfficient computations

I Collapsed variational updates:

q(zid = k)←∝ exp(Eq[ log(απk + n¬iddk )

+ log(βτxid + n¬idkxid

)

− log(β + n¬idk )])

I Need to compute terms of form E[log(a + n)], wheren =

∑l bl where bl are independent Bernoulli variables, say

bl ∼ Bernoulli(ρl).I Can compute with fast Fourier transforms, but a

second-order Taylor approximation works very well:

E[log(a + n)] ≈ log(a + E[n])− V[n]

2(a + E[n])2

Page 30: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for LDAExperimental Results

I Corpora:I KOS: D = 3430, W = 6909, N = 467714, K = 8.I NIPS: D = 1675, W = 12419, N = 2166029, K = 40.

I 10% of words in each document withheld as test set.I α = β = .1.I Repeated 50 times.I Report both bounds on marginal probabilities of training set,

and predictive probabilities on test set.

Page 31: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for LDABounds on Marginal Probabilities on KOS and NIPS

0 20 40 60 80 100!9

!8.5

!8

!7.5

Collapsed VBStandard VB

0 20 40 60 80 100!9

!8.8

!8.6

!8.4

!8.2

!8

!7.8

!7.6

!7.4

Collapsed VBStandard VB

!7.8 !7.675 !7.550

5

10

15

20Collapsed VBStandard VB

!7.65 !7.6 !7.55 !7.5 !7.45 !7.40

5

10

15

20

25

30

35

40Collapsed VBStandard VB

Page 32: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for LDAPredictive Probabilities on KOS and NIPS

0 20 40 60 80 100!7.9

!7.8

!7.7

!7.6

!7.5

!7.4

Collapsed GibbsCollapsed VBStandard VB

0 20 40 60 80 100!7.9

!7.8

!7.7

!7.6

!7.5

!7.4

!7.3

!7.2

Collapsed GibbsCollapsed VBStandard VB

!7.7 !7.65 !7.6 !7.55 !7.5 !7.45 !7.40

5

10

15

20Collapsed GibbsCollapsed VBStandard VB

!7.5 !7.45 !7.4 !7.35 !7.3 !7.25 !7.20

5

10

15

20

25

30 Collapsed GibbsCollapsed VBStandard VB

Page 33: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Page 34: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Latent Dirichlet AllocationHyperpriors and Model Selection/Averaging

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

I Sensitivity to hyperparameter values.I Sensitivity to the number of topics K .I Model selection/averaging inefficient.I Limitations of parametric models.

Page 35: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Latent Dirichlet AllocationHyperpriors and Model Selection/Averaging

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

topics k=1...Kdocument d=1...D

words i=1...nd

! !d zid xid !k

!

!

!!

γ ∼ Gamma(aγ,bγ) α ∼ Gamma(aα,bα) β ∼ Gamma(aβ,bβ)

π ∼ Dirichlet(γ/K , . . . , γ/K ) τ ∼ Dirichlet(aτ/W , . . . ,aτ/W )

Page 36: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Hierarchical Dirichlet ProcessesNonparametric Alternative to LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

topics k=1...∞ document d=1...D

words i=1...nd

! !d zid xid !k

!

!

!!!

γ ∼ Gamma(aγ,bγ) α ∼ Gamma(aα,bα) β ∼ Gamma(aβ,bβ)

π∼ GEM(γ) τ ∼ Dirichlet(aτ/W , . . . ,aτ/W )

Page 37: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Hierarchical Dirichlet ProcessSpecification in terms of Random Measures

topics k=1...∞ document d=1...D

words i=1...nd

! !d zid xid !k

!

!

!!!

I Equivalent to:

G0 ∼ DP(γ,Dirichlet(βτ)) yid ∼ Gd

Gd ∼ DP(α,G0) xid ∼ Discrete(yid )

[Teh et al 2006]

Page 38: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Page 39: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDP

topics k=1...∞ document d=1...D

words i=1...nd

! !d zid xid !k

!

!

!!!

I Wish to deal with a full variational posterior.I Need to consider a countably infinite number of topics.

I Truncate the posterior.I Parameter priors are no longer independent.I Hyperpriors not conjugate to priors.

I Collapsed variational Bayes with auxiliary variables.

[Teh, Kurihara and Welling 2008]

Page 40: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDP

topics k=1...∞ document d=1...D

words i=1...nd

! !d zid xid !k

!

!

!!!

1. Integrate out parameters θ, φ from HDP.2. Introduce auxiliary variables η, s, ξ, t.3. Factorize the posterior.4. Truncate the posterior.

Page 41: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDP

topics k=1...∞ document d=1...D

words i=1...nd

! zid xid

!

!

!!

1. Integrate out parameters θ, φ from HDP.2. Introduce auxiliary variables η, s, ξ, t.3. Factorize the posterior.4. Truncate the posterior.

Page 42: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDP

topics k=1...∞ document d=1...D

words i=1...nd

! zid xid

!

!

!!

!dsd!k tk

1. Integrate out parameters θ, φ from HDP.2. Introduce auxiliary variables η, s, ξ, t.3. Factorize the posterior.4. Truncate the posterior.

Page 43: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDP

topics k=1...∞ document d=1...D

words i=1...nd

! zid

!

!

!!

!dsd!k tk

1. Integrate out parameters θ, φ from HDP.2. Introduce auxiliary variables η, s, ξ, t.3. Factorize the posterior.4. Truncate the posterior.

Page 44: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDP

topics k=1...K document d=1...D

words i=1...nd

! zid

!

!

!!

!dsd!k tk

1. Integrate out parameters θ, φ from HDP.2. Introduce auxiliary variables η, s, ξ, t.3. Factorize the posterior.4. Truncate the posterior.

Page 45: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDP

1. Integrate out parameters θ and φ.

p(z::|α,π)=D∏

d=1

Γ(α)

Γ(α + nd )

∞∏k=1

Γ(απk + ndk )

Γ(απk )

p(x::|z::, β, τ )=∞∏

k=1

Γ(β)

Γ(β + nk )

W∏w=1

Γ(βτw + nkw )

Γ(βτw )

Page 46: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDP2. Introduce auxiliary variables η, s, ξ and t.

Γ(α)

Γ(α + nd )=

1Γ(nd )

∫ 1

0ηα−1

d (1− ηd )nd−1 dηd

Γ(απk + ndk )

Γ(απk )=

ndk∑sdk =0

[ndk

sdk

](απk )sdk

Γ(β)

Γ(β + nk )=

1Γ(nk )

∫ 1

0ξβ−1

k (1− ξk )nk−1 dξk

Γ(βτw + nkw )

Γ(βτw )=

nkw∑tkw =0

[nkw

tkw

](βτw )tkw

I Formulas for ηd , ξk are Beta identities.I Formulas for sdk , tkw are generating functions for numbers

of tables in CRPs.I [ n

m ] are unsigned Stirling numbers of the first kind.

Page 47: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDP

3. Assume the factorization:

q(α, β, γ, τ ,π,η:,s::, ξ:, t::, z::)

=q(γ)q(α, β, τ ,π)q(η:,s::, ξ:, t::|z::)∏id

q(zid )

=q(γ)q(α)q(β)q(τ )q(π)∏d

q(ηd |z::)∏dk

q(sdk |z::)∏

k

q(ξk |z::)∏kw

q(tkw |z::)∏id

q(zid )

Page 48: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDP

4. Constrain all posterior mass to first K topics, assume:

q(zid = k) = 0 for all i , d , k > K

Stick-breaking: πk = π̃k

k−1∏l=1

(1− π̃l) π̃k |γ∼ Beta(1, γ)

Page 49: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDP

5. Improved second orderapproximation.

I Approximate E[log ξk ], E[log ηd ],E[sdk ] and E[tkw ] for efficiency.

I Second order Taylor expansion tolog(n) fails for Ψ(n) due to fasterdivergence at n = 0.

I Treat n = 0 exactly, andapproximate n > 0.

1 2 3 4 5 6 7 80

20

40

60

80

100

120

Err

or R

atio

to E

xact

Val

ue (

%)

Topic Label

2nd Order Approx.

2nd Order Approx. + 0−Treatment

1 2 3 4 5 6 7 80

1

2

3

4

E[n

k]

Topic Label

Page 50: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDPExperimental Results

I Corpora:I KOS: D = 3430, W = 6909, N = 467714.I Reuters: D = 8433, W = 4593, N = 566298.I NIPS: D = 1675, W = 12419, N = 2166029.

I 10% of words in each document withheld as test set.I Repeated 10 times.I Report both bounds on marginal probabilities of training set,

and predictive probabilities on test set.

Page 51: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDPBounds on Margnal Probabilities

V: variational, CV: collapsed variational.I Variational bound significantly better than VLDA or CVLDA.

Page 52: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDPPredictive Probabilities

V: variational, CV: collapsed variational, G: collapsed Gibbs.I Predictive probabilities better than VLDA and CVLDA.I Better than GLDA.I Worse than GHDP.I Note: GHDP1 and GHDP100 gives different results indicating

local optima issue.

Page 53: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Collapsed Variational Bayes for HDPLocal Optima Issues?

V: variational, CV: collapsed variational, G: collapsed Gibbs.I Initializing at the converged mode of GHDP100 gives better

results (almost same as GHDP100).I Local optima problem: Gibbs sampler better at escaping

bad local optima, if we can find a good local optima forCVHDP it can work very well.

Page 54: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Page 55: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Variational Inference vs MCMCI Variational Inference

− Applicable in limited ranged of models.+ Easier to debug.+ Easy to diagnose convergence.− Derivations more involved.− Approximate posterior potentially far from true.+ Lower bound on marginal probability of data.+ Analysis of posterior easier.

I Markov Chain Monte Carlo (MCMC)+ Applicable in wide range of models.− Often hard to debug.− Hard to diagnose convergence (if ever).+ Will converge to true posterior if willing to wait.+ Unconverged samples may still be “good enough”.− No good way to compute marginal probability of data.− Analysis of posterior harder due to non-identifiability.

Page 56: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Variational Inference in Nonparametric Models

I Inference in nonparametric models currently dominated byMCMC methods.

I Only recently has variational inference been proposed, andonly for DP mixtures at that.

I We explore variational inference for the HDP.I More choices for inference algorithms in nonparametric

models.I Compare variational and MCMC in specific circumstance.I Techniques developed applicable to other models.

I Specific case of HDP applied to topic modelling.I Here the HDP can be seen as a nonparametric

generalization of latent Dirichlet allocation (LDA).I Lessons learned here can be applied to other settings.

Page 57: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Discussion

I Variational approximation taken to the extreme.I Need to resolve local optima issue.I The techniques developed here is applicable to many

models composed of discrete and Dirichlet variables.I Infinite State Bayesian Networks (ISBNs) are nonparametric

generalization of Bayesian networks.I Uses hierarchical Dirichlet processes as priors over sets of

parameters.I Future work: variational approximation for HDPs with more

than two layers.

[Welling, Porteous and Bart 2008]

Page 58: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Variational UpdatesCaution: Inspect Only with Magnifying Glass

Update counts from qidk = q(zid = k):

E[n.k.]=P

d,i qidk ; V[n.k.]=P

d,i qidk (1−qidk );

P+[n.k.]=1−Q

d,i (1−qidk ); E+[n.k.]=E[n.k. ]

P+[n.k. ] ; V+[n.k.]=V[n.k. ]

P+[n.k. ]−(1−P+[n.k.])E+[n.k.]2;

E[ndk.]=P

i qidk ; V[ndk.]=P

i qidk (1−qidk );

P+[ndk.]=1−Q

i (1−qidk ); E+[ndk.]=E[ndk. ]

P+[ndk. ] ; V+[ndk.]=V[ndk. ]

P+[ndk. ]−(1−P+[ndk.])E+[ndk.]2;

E[n.kw ]=P

d,i:xid =w qidk ; V[n.kw ]=P

d,i:xid =w qidk (1−qidk );

P+[n.kw ]=1−Q

d,i:xid =w 1−qidk ; E+[n.kw ]=E[n.kw ]

P+[n.kw ] ; V+[n.kw ]=V[n.kw ]

P+[n.kw ]−(1−P+[n.kw ])E+[n.kw ]2;

Update auxiliary variable posteriors:

E[log ηd ]=Ψ(E[α])−Ψ(E[α]+nd··);

E[log ξk ]≈P+[n·k·](Ψ(E[β])−Ψ(E[β]+E+[n·k·])− 12 V+[n·k·]Ψ′′(E[β]+E+[n·k·]));

E[sdk ]≈G[α]G[πk ]P+[ndk·](

Ψ(G[α]G[πk ]+E+[ndk·])−Ψ(G[α]G[πk ])+ 12 V+[ndk·]Ψ

′′(G[α]G[πk ]+E+[ndk·]))

;

E[tkw ]≈G[β]G[τw ]P+[n·kw ](

Ψ(G[β]G[τw ]+E+[n·kw ])−Ψ(G[β]G[τw ])+ 12 V+[n·kw ]Ψ′′(G[β]G[τw ]+E+[n·kw ])

);

Page 59: Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Variational UpdatesCaution: Inspect Only with Magnifying Glass

Update Hyperparameters:

E[α]=aα+

Pdk E[sdk ]

bα−P

d E[log ηd ] ; G[α]=exp(Ψ(aα+

Pdk E[sdk ]))

bα−P

d E[log ηd ] ;

E[β]=aβ+

Pkw E[tkw ]

bβ−P

k E[log ξk ] ; G[β]=exp(Ψ(aβ+

Pkw E[tkw ]))

bβ−P

k E[log ξk ] ;

G[π̃k ]=exp(Ψ(1+

Pd E[sdk ]))

exp(Ψ(γ+1+P

dP

l≥k E[sdl ])) ; G[1−π̃k ]=exp(Ψ(γ+

PdP

l>k E[sdl ]))

exp(Ψ(γ+1+P

dP

l≥k E[sdl ])) ;

G[τw ]=exp(Ψ(κ/W +

Pk E[tkw ]))

exp(Ψ(κ+P

k,w E[tkw ])) ; G[πk ]=G[π̃k ]Qk−1

l=1 G[1−π̃l ];

Update q(z):

E[n¬iddk. ]=E[ndk.]−qidk ; V[n¬id

dk. ]=V[ndk.]−qidk (1−qidk );

E[n¬id.kxid

]=E[n.kxid ]−qidk ; V[n¬id.kxid

]=V[n.kxid ]−qidk (1−qidk );

E[n¬id.k. ]=E[n.k.]−qidk ; V[n¬id

.k. ]=V[n.k.]−qidk (1−qidk );

q(zid =k)∝[G[α]G[πk ]+E[n¬id

dk· ]][

G[β]G[τxid ]+E[n¬id·kxid

]][

E[β]+E[n¬id·k· ]]−1

× exp

V[n¬iddk· ]

2(G[α]G[πk ]+E[n¬iddk· ])2−

V[n¬id·kxid

]

2(G[β]G[τxid ]+E[n¬id·kxid

])2 +V[n¬id·k· ]

2(E[β]+E[n¬id·k· ])2

!.