Latent variable models for discrete dataml.cs.tsinghua.edu.cn/~jianfei/static/lvm.pdf · Latent variable models for discrete data Jianfei Chen Department of Computer Science and Technology

Latent variable models for discrete data

Jianfei Chen

Department of Computer Science and TechnologyTsinghua University, Beijing 100084

[email protected]

Janurary 13, 2014

Murphy, Kevin P. Machine learning: a probabilistic perspective. The MIT Press, 2012.Chapter 27.

Introduction

We want to model three types of discrete data

Sequence of tokens: p(yi,1:Li)

Bag of words: p(ni)

Discrete features: p(yi,1:R)

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 2 / 21

Outline

Mixture Models

LSA / PLSI / LDA / GaP / NMF

LDA

EvaluationInferenceVariants: CTM, DTM, LDA-HMM, SLDA, MedLDA, etc.

RBM


Mixture models

p(y) =∑k

p(y|qi = k)p(qi = k)

Sequence of tokens: p(yi,1:Li |qi = k) =∏Lil=1Cat(yil|bk)

Discrete features: p(yi,1:R|qi = k) =∏Rr=1Cat(yir|b

(r)k )

Bag of words (known Li): p(ni|Li, qi = k) = Mu(ni|Li,bk)Bag of words (unknown Li): p(ni|qi = k) =

∏Vv=1 Poi(niv|λvk)


Mixture models

Theorem

If ∀i,Xi ∼ Poi(λi), let n =∑

iXi

p(X1, · · · , Xk|n) = Mu(X|n, π)

where πi =λi∑k λk

.


Exponential Family PCA

latent semantic analysis (LSA) / latent semantic indexing (LSI)

Sequence of tokens: p(yi,1:Li |zi) =∏Lil=1Cat(yil|S(Wzi))

Discrete features: p(yi,1:R|zi) =∏Rr=1Cat(yir|S(Wrzi))

Bag of words (known Li): p(ni|Li, zi) = Mu(ni|Li, S(Wzi))

Bag of words (unknown Li): p(ni|zi) =∏Vv=1 Poi(niv|exp(wv,:zi))

where S(·) is the softmax transformation, zi ∈ RK , W,Wr ∈ RV×K .Inference

coordinate ascent / degenerated EM (problem: overfitting?)

variational EM / MCMC


LSA / PLSI / LDA

Unigram: p(yi,1:Li |qi = k) =∏Lil=1Cat(yil|bk)

LSI: p(yi,1:Li |zi) =∏Lil=1Cat(yil|S(Wzi))

PLSI: p(yi,1:Li |πi) =∏Lil=1Cat(yil|Bπi)

LDA: p(yi,1:Li |πi) =∏Lil=1Cat(yil|Bπi), πi ∼ Dir(πi|α)

LDA for other data types

Bag of words:p(ni|Li, πi) = Mu(ni|Li,Bπi)Discrete features:p(yi,1:R|πi) =

∏Rr=1Cat(yir|B(r)πi)

Question: What is dual parameter? Why is it convenient?


Marlin, Benjamin M. ”Modeling user rating profiles for collaborative filtering.” Advancesin neural information processing systems. 2003.

Gamma-Poisson Model

LDA

models p(ni|Li, πi) = Mu(ni|Li,Bπi)Prior πi ∼ Dir(α)

Constraint 0 ≤ πik,∑

j πik = 1, 0 ≤ Bvk,∑

v Bvk = 1

GaP

models p(ni|z+i ) =∏Vv=1 Poi(niv|b>v,:z

+i )

Prior p(z+i ) =∏kGa(z+ik|αk, βk)

Constraint 0 ≤ zik, 0 ≤ BvkCan use sparse-inducing prior (27.17)GaP only have non-negative constraints


Non-negative matrix factorization

Given non-negative matrix V , find non-negative matrix factors W,H suchthat

V ≈WH

Vi ≈∑k

WikHk

Can be view as GaP when prior αk = βk = 0.


Seung, D., and L. Lee. ”Algorithms for non-negative matrix factorization.” Advances inneural information processing systems.

Latent Dirichlet Allocation (LDA)

Notation

πz|α ∼ Dir(α) (1)

qil|πi ∼ Cat(πi) (2)

bk|γ ∼ Dir(γ) (3)

yil|qil = k,B ∼ Cat(bk) (4)

Geometric interpretation

Simplex: handle ambiguity (?)

Unidentifiable: Labeled LDA


D. Blei et al. ”Latent dirichlet allocation.” JMLRG. Heinrich. ”Parameter estimation for text analysis.”D. Ramage, et al. ”Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora.” EMNLP

http://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

Evaluation: Perplexity

Perplexity of language model q given language p is defined as (both p, qare stocastic process)

perplexity(p, q) = 2H(p,q)

where H(p, q) is cross-entrypy

H(p, q) = limN→∞

− 1

N

∑y1:N

p(y1:N ) log q(y1:N )

Approximations

N is finite

p(y1:N ) = δy∗1:N

(y1:N )


Evaluation: Perplexity

H(p, q) = − 1

Nlog q(y∗1:N )

Intuition: weighted average branching factorFor unigram model

H = − 1

N

N∑i=1

1

Li

Li∑l=1

log q(y∗il)

For LDA

H = − 1

N

∑i=1N

p(y∗i,1:Li)

Use variational evidence lower bound (ELBO)

Use annealed importance sampling

Use validation set and plug in approximation


H. Wallach, et al. ”Evaluation methods for topic models.” ICML 2009

Evaluation: Coherence

TODO


D. Newman et al. ”Automatic evaluation of topic coherence.” NAACL HLT 2010.

Inference

Exponential number of inference algorithms

Variational inference vs sampling vs both

Collapsed vs non-collpased

Online vs stocastic vs offline

Empirical Bayes vs fully Bayes

Other algorithms: expectation propagation, etc.


Inference: towards large scale

algorithms

Online / stocasticSparsitySpectral methods

system

Distributed: Yahoo-LDA, Petuum, Parameter-Server, etc.GPU: BIDMach, etc.


Model Selection

Compute evidence with AIS / ELBO

Cross validation

Bayesian non-parametrics


Teh et al. ”Hierarchical dirichlet processes.” Journal of the american statistical association (2006).

Extensions of LDA

Correlation: Correlated topic model

Time series: Dynamic topic model

Syntax: LDA-HMM

Supervision: many

1D categorial label: SLDA (generative), DLDA (discrimitive), MedLDA(regularized)nD label: MR-LDA, random effects mixture of experts, conditionaltopic random field, Dirchlet multinomial regression LDAK labels per document: labeled LDAlabels per word: TagLDA

Structural: RTM


Restricted Boltzmann machines


p(h,v|θ) = 1

Z(θ)

R∏r=1

K∏k=1

ψrk(vr, hk)

where h,v are binary vectors.factorized posterior

p(h|v, θ) =∏k

p(hk|v, θ)

advantage: symmetric, both posterior inference (backward) and generating(forward) are easy.

Exponential family harmonium (harmonium is 2-layer UGM)



Binary latent and binary visiable (other models exist, see Table 27.2)

p(v,h|θ) = 1

Z(θ)exp(−E(v,h; θ)) (5)

E(v,h; θ) = v>Wh (6)

p(h|v, θ) =∏k

Ber(hk|sigm(w>:,k,v)) (7)

p(v|h, θ) =∏r

Ber(vr|sigm(w>r,:,h)) (8)



Goal: maximize p(v|θ)

∇wl = Epemp(·|θ)[vh>]− Ep(·|θ)[vh>]


Conclusions

Why there are many things to do

Exponential number of inference algorithms

Exponential number of models

Exponential × exponential number of solutions

Application, evaluation, theory (e.g. spectral), etc.

Need a way for information retriver, data miners find correct & fastsolutions for them...


Latent variable models for discrete dataml.cs.tsinghua.edu.cn/~jianfei/static/lvm.pdf · Latent variable models for discrete data Jianfei Chen Department of Computer Science and Technology

Documents