Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood

Rebuilding Factorized InformationCriterion: Asymptotically Accurate

Marginal Likelihood

Kohei Hayashi12 Shin-ichi Maeda3

Ryohei Fujimaki4

1National Institute of Informatics

2Kawarabayashi Large Graph Project, ERATO, JST

3Kyoto University

4NEC Knowledge Discovery Laboratories

July 10, 20151 / 21

Introduction

Factorized asymptotic Bayesian inference (FAB)

• Recently-developed approximate Bayesian method

4 Accurate and tractable

6 Limited to binary latent variable models (LVMs)

Our contributions:

• Extend FAB to general LVMs (e.g. PCA)

• Analyze theoretical properties that are unclear in theprevious studies

2 / 21

..1 Revisiting FAB

..2 Generalization of FAB

3 / 21

Bayesian Inference for Binary LVMs

Binary LVM:p( X︸︷︷︸

data

, Z︸︷︷︸LVs

, Π︸︷︷︸params

| K︸︷︷︸model

) = p(Π)︸︷︷︸prior

p(X,Z|Π, K)︸︷︷︸joint likelihood

Assumptions:

• X and Z are jointly i.i.d.

p(X,Z|Π, K) =N∏n=1

p(xn, zn|Π, K)

• The prior doesn’t depend on N• ln p(Π) = O(1)• “Flat” prior

4 / 21

Goal: To obtain

• the marginal likelihood:

p(X|K) =

∫p(X,Z,Π|K)dZdΠ

• the marginal posteriors:

p(Z|X, K) =

∫p(X,Z,Π|K)dΠ/p(X|K)

p(Π|X, K) =

∫p(X,Z,Π|K)dZ/p(X|K)

Problem: The marginalizations are intractable5 / 21

Key idea: Use

• the variational representation for∫dZ

• Laplace’s method for∫dΠ

.Factorized information criterion (FIC)..

......

FIC(K) ≡ maxq

Eq

[maxΠ

ln p(X,Z|Π, K)]

− Eq

[DΠ

2

∑k

ln∑n

znk

]︸︷︷︸

FIC penalty term

+H(q)+O(lnN)

• q(Z): trial distribution

• H(q): entropy

6 / 21

Accuracy of FIC

4 Asymptotically equivalent to the marginal likelihood

.Theorem 3 of [Fujimaki+ 12a]..

......

In mixture models, under mild conditions,

FIC(K) = ln p(X|K) +O(1)

≈ ln p(X|K)

Similar results are obtained for:

• HMMs [Fujimaki+ 12b]

• Latent feature models [KH+ 13]

• Mixture of experts [Eto+ 14]

• Factorial relational models [Liu+ yesterday]

7 / 21

Optimizing FIC

Computation of FIC is difficult

maxq

Eq

[maxΠ

ln p(X,Z|Π, K)]− DΠ

2

∑k

Eq

[ln∑n

znk

]+H(q)

≥maxq∈Q

Eq

[maxΠ

ln p(X,Z|Π, K)]− DΠ

2

∑k

Eq

[ln∑n

znk

]+H(q)

Mean-field approx. (Q ≡ {q(Z)|q(Z) =∏

n q(zn)})

≥ maxq∈Q,Π

Eq [ln p(X,Z|Π, K)]− DΠ

2

∑k

ln∑n

Eq[znk] +H(q)

Jensen’s ineq.

≡FIC(K)

8 / 21

Algorithm

Optimization problem:

maxq∈Q,Π


2

∑k

ln∑n

Eq[znk] +H(q)

Can be solved by EM-like alternating updates:

..1 Initialize q and Π

..2 Update q (Fix Π)

..3 Update Π (Fix q)

..4 Repeat step 2 and 3 until convergence

9 / 21

Model PruningThe FAB algorithm eliminates irrelevant componentsautomatically


2

∑k

ln∑n

Eq[znk]︸︷︷︸penalty term

+H(q)

0 2 4 6 8 10

-log(x)

• The penalty term introduces group sparsity to Z

K=6

Z

K=6 K=3

update update

10 / 21

Summary of FIC/FAB

4 Asymptotically equivalent to the marginal likelihood• Fits to “Big Data” situations

4 Performs parameter inference and model selectionsimultaneously

• EM-like updates of q and Π• ARD-like model pruning

4 Doesn’t depend on the choice of p(Π)• More frequentist than Bayesian

4 Works in many binary LVMs

11 / 21

Limitations of FIC/FAB

6 Limited to binary LVMs• In real Z,

∑n znk can be negative

• − ln∑

n znk may diverge

6 Missing relations to EM and VB• Similar approaches, but which are better?

6 Unclear legitimacy of optimizing FIC• e.g. tightness

12 / 21

..1 Revisiting FAB

..2 Generalization of FAB

13 / 21

Setting

• Now Z can take general values (e.g. Z ∈ RN×K)

• Consider separating the parameters:Π = {Θ,Ξ}

• Θ: k-independent params• Ξ = {ξk}Kk=1: k-dependent params (e.g. mixing

coefficients)

14 / 21

Generalized FIC (gFIC).Definition..

......

gFIC(K) ≡ Eq∗

maxΠ

ln p(X,Z|Π, K)−1

2ln |FΞ|︸︷︷︸

penalty

+H(q)+O(lnN)

• q∗(Z) ≡ p(Z|X, K): marginal posterior

• FΞ: Hessian of − ln p(X,Z|Π, K)/N(i.e. empirical Fisher information)

• In PCA, FΞ = Z>Z

FIC gFICApplicable class Binary LVMs General LVMsPenalty term −

∑k ln

∑n znk − ln |FΞ|

Regularization Group sparsity ”Low-rank”15 / 21

Generalized FAB (gFAB)

4 Use the same technique as FAB

Eq∗

[maxΠ

ln p(X,Z|Π, K)]

− 1

2Eq∗ [ln |FΞ|] +H(q∗)

≥maxq∈Q

Eq

[maxΠ

ln p(X,Z|Π, K)]

− 1

2Eq [ln |FΞ|] +H(q)

Mean-field approx.

≥ maxq∈Q,Π

Eq [ln p(X,Z|Π, K)] − 1

2lnEq[|FΞ|] +H(q)

Jensen’s ineq.

≡gFIC(K)

• Able to solve by alternating updates of q and Π16 / 21

Comparison with EM and VB4 gFAB asymp. approx. ln p(X|K) for all K,

whereas EM and VB don’t.Theorem 2 & Corollary 5..

......

Let K ′ be the “true” model of X, then

gFIC(K ′) ≈ ln p(X|K) for K > K ′

gFIC(K) ≈ ln p(X|K) for K ≤ K ′

• K ′ can be obtained by model pruning

.Proposition 10+..

......

6 EM +O(lnN) ≈ ln p(X|K) only for K ≤ K ′

6 VB ≈ ln p(X|K) only for K ≤ K ′17 / 21

Asymptotic Behavior of gFIC

4 gFIC(K) ≈ gFIC(K) in some cases.Proposition 6..

......

q∗ is asymptotically mutually independent.

4 Justify mean-field approximation

.Proposition 7..

......

If q is not degenerated and ln p(X,Z|Π, K) is smoothand concave w.r.t. Π,

Eq[maxΠ

ln p(X,Z|Π, K)]p→ max

ΠEq[ln p(X,Z|Π, K)].

4 Justify Jensen’s inequality18 / 21

Experiments: Bayesian PCA

−1000

0

1000

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

30000

35000

40000

45000

N=

100N

=500

N=

1000N

=2000

10 20 30K

Obj

ectiv

e

method EM BICEM VB1 VB2 gFAB Task: model selection

• Choose K thatmaximizes theobjective

Results:

4 gFAB: Successfullyobtain true K = 10w/ skippingK = 10, . . . , 29

6 EM: Alwaysoverestimates K (assuggested in Prop.10+)

6 VB1: Select true Kbut need to computeall K = 1, . . . , 30

19 / 21

Conclusion

Summary of this talk:

• FAB: Tractable Bayesian method for binary LVMs

• Proposed gFAB for general LVMs (e.g. PCA)• Theoretical Analysis

• Showing the desirable properties of gFAB

At the poster session (right after):We will explain more details such as

• Full derivation of gFIC

• “High-level” mechanism of model pruning

• ...

20 / 21

Future work• Potentially applicable to a wide class of LVMs

• factor analysis, CCA, partial membership, lineardynamical systems, ...

• If you are interested in, let’s collaborate!

Thank you!

References[Fujimaki+ 12a] Fujimaki, Ryohei and Morinaga, Satoshi. Factorized asymptotic Bayesian

inference for mixture modeling. In AISTATS, 2012.

[Fujimaki+ 12b] Fujimaki, Ryohei and Hayashi, Kohei. Factorized asymptotic Bayesianhidden Markov model. In ICML, 2012.

[KH+ 13] Hayashi, Kohei and Fujimaki, Ryohei. Factorized asymptotic Bayesianinference for latent feature models. In NIPS, 2013.

[Eto+ 14] Eto, Riki, Fujimaki, Ryohei, Morinaga, Satoshi, and Tamano, Hiroshi.Fully-automatic Bayesian piecewise sparse linear models. In AISTATS, 2014.

[Liu+ yesterday] Liu, Chunchen, Feng, Lu, Fujimaki, Ryohei, and Muraoka, Yusuke. Scalablemodel selection for large-scale factorial relational models. In ICML, 2015.

21 / 21

Model SelectionGoal: Choose a good model

• A wrong model degrades the final output.(e.g., wrong prediction, messy visualization)

simple model

intermediate model

complex model

data

How can we select a good model?22 / 21

Factorized Information Criterion (FIC)

Recently-developed model selection framework

• Applicable for many Discrete LVMs:• mixture models [Fujimaki+ 12]

• hidden Markov models [Fujimaki+ 12]

• latent feature models [Hayashi+ 13]

• mixture of experts [Eto+ 14]

• Accurate if we have a lot of data samples• Fit to “Big Data” settings

• Computationally efficient one-pass model selection• ARD-like “pruning”

23 / 21

Bayesian Model SelectionChoose K that maximizes the marginal likelihood:

p(X|K) =

∫p(Π)︸︷︷︸prior

p(X|Π, K)︸︷︷︸likelihood

dΠ.

data

The integral is intractable.

• Need approximation of p(X|K): BIC24 / 21

Derivation of BICSuppose p(X|Π, K) has the unique MLE Π̂⇔ p(X|Π, K) is regular.

Then, Laplace’s method yields:

ln p(X|K) = BIC(K) +O(1),

BIC(K) = ln p(X|Π̂, K)− DΠ

2lnN.

• Assuming ln p(Π) = O(1)25 / 21

Why BIC Fails?In LVMs, p(X|Π, K) is NOT regular.

• p(X|Π, K) =∫dZp(X,Z|Π, K)

• Mixing various Zs⇒ p(X|Π, K) has many maxima⇒ Laplace’s method doesn’t work.

=+ + ...

.Idea of gFIC........Apply Laplace’s method separately for each Z

26 / 21

Derivation of gFICTo separate p(X,Z|Π, K), use the variational bound:

p(X|K) =Eq

[ln

∫p(X,Z|Π, K)p(Π)dΠ

]+H(q) + KL(q‖p(Z|X,K))

• q(Z): a distribution of Z• H(q): the entropy of q• Eq[f(Z)] =

∫dZq(Z)f(Z)

• Moves∫dZ to the outside of p(X,Z|Π, K).

If p(X,Z|Π, K) is regular → Laplace’s method works!• The regularity depends on Z.• Need case analysis for Z.

27 / 21

Case 1: p(X,Z|Π, K) Is Regular

Laplace’s method yields:

ln p(X,Z|K)dΠ = BIC(K)− 1

2ln |F|+O(1)

• F: the Hessian of ln p(X|Π̂, K)/N w.r.t. Ξ

Remark:

• BIC ignores 12 ln |F| as constant.

• However, gFIC includes it.• F depends on Z and will change the solution of q.

28 / 21

Case 2: p(X,Z|Π, K) Is Not Regular

In this case, Z is degenerated, i.e.

• z·k = 0 (0 column vectors)

• z·k = z·l 6=k (overlapping representation)

Eliminating such components gives a regular modelK ′ = rank(Z) < K: p(X,Z|Π, K) = p(X, Z̃|Π̃, K ′)

Now Laplace’s method yields:1

ln p(X,Z|K)dΠ = ln

∫p(X, Z̃|Π̃, K ′)p̃(Π̃)dΠ̃

= BIC(K ′)− 1

2ln |F̃|+O(1)

1Assuming the support of p(Π) is compact and the whole space29 / 21

Substituting the Laplace approx into the variationalbound, we obtain gFIC..gFIC..

......

ln p(X|K) =gFIC +O(1),

gFIC =Eq∗

[ln p(X, Z̃| ˆ̃Π)− 1

2ln |F̃|

]+H(q∗)

−O(lnN).

30 / 21

Model Pruning

.Proposition 4..

......

q∗(Z) = pK(Z)(1 +O(N−1)),

pK(Z) ≡

{p(Z,X|Π̂,K)|FΠ̂|−1/2

C K = κ(Z),

pκ(Z)(Z̃) K > κ(Z),

K' KK

=

31 / 21

Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood

Data & Analytics

eq ln px

n ln p

hessian of ln px

gfick eq max ln px

fick max q eq max ln

dicult max q eq max

real z

negative ln n znk