Methods for Matrix Factorization - WordPress.com

Methods for Matrix Factorization

Wray Buntine

Professor in Faculty of ITDirector of Master of Data Science

Monash Universityhttp://topicmodels.org

Thanks to: Lan Du and Ethan Zhao of Monash University and Kar Wai Lim of ANU forcontent,

Lancelot James of HKUST for teaching me some of the material

2018-10-14

Buntine (Monash) Matrix Factorization 2018-10-14 1 / 51

And Now for Something Completely Different

We work in Bayesian machine learning in Computer Science.

We work with problems on sparse matrices and graphs, discrete data,with gigabytes of data.

We use complex hierarchical models built on exponential familydistributions with millions of parameters.

We use collapsing and augmentation with Gibbs sampling to make itwork.

realistically, with these dimensions, we’re only doing approximateestimation for predictive inference

Our methods are state of the art for a range of challenging problems.

Currently working on automated/compiling system ala BUGS andStan. (unfunded!)




















Applications of (Discrete) Matrix Factorisation

Outline

1 Applications of (Discrete) Matrix Factorisa-tion

2 Statistical Background

3 Statistical Methods

4 Non-parametric Topic Models

5 Other Examples



Semi-structured Data



Bibliographic Data: The ACM Database



Bibliographic Data as a Graph

multiple node types: authors, institutions, papers, ...

multiple arc types: cited-by, co-author, employee-of, ...

rich side information: paper abstracts, author education, authoremployment, institution location and ranking, ...

NB. graphs can be represented as multiple sparse matrices with commondimensions



Fun with Bibliographies

objects, face, recognition,motion, tracking

wavelet, segmentation,transform, motion, shape

network, networks,distributed, design, parallel

classifiers, classifier, accuracy,prediction, machine-learning

linear, function, functions,approximation, optimization

retrieval, web, text,image-retrieval, document

bayesian, networks, inference,estimation, probabilistic

robot, control, robots,environment, mobile-robot

language, word, recognition,text, training

search, optimization, genetic-algorithms,genetic-algorithm, evolutionary

network, neural-networks,networks, neural-network, neural

user, human, research,interaction, speech

satisfiability, logic,reasoning, boolean, sat

speech, speech-recognition,recognition, acoustic, audio

mining, data-mining,clustering, patterns, database

clustering, kernel,space, feature, distance

data-mining, network,software, detection, security

reinforcement, control,state, policy, planning

channel, coding, error,rate, estimation

agents, games, game,agent, reinforcement

g hinton

m gales

s singh

y rui

b schölkopf

z ghahramani

t dietterich

s thrun

r kohavi

m kearns

n friedman

t joachims

d koller

y freund

t hofmann

j quinlan

c burges

d lowe

y yang

s chen

d aha

d heckerman

r schapire

j lafferty

a blum

r agrawal

r sutton

j friedman

t cootes

p viola

k murphyp belhumeur

m swain

l kaelbling

w cohen

m isard

See Lim et al. ACML (2014)



Document Collections as Matrices

Original newsarticle:

Women may only account for 11% of all Lok-Sabha MPsbut they fared better when it came to representation inthe Cabinet. Six women were sworn in as senior ministerson Monday, accounting for 25% of the Cabinet. Theyinclude Swaraj, Gandhi, Najma, Badal, Uma and Smriti.

Bag of words:

11% 25% Badal Cabinet(2) Gandhi Lok-Sabha MPs Mon-day Najma Six Smriti Swaraj They Uma Women accountaccounting all and as better but came fared for(2) in(2)include it may ministers of on only representation seniorsworn the(2) they to were when women

NB. a matrix where rows are documents and columns are words,so wi ,j is count of word j in document i



Fun with News about Obesity

see https://topicmodels.org/2016/04/04/talk-at-data-science-meetup/



Document Segmentation

Passage: contiguous text with no boundary, e.g., a sentence

Segment: consecutive text passages that are semantically related.

Document Segmentation Task: (roughly) given a document as amonolithic block of text, where should we put the segment boundaries.

Rather like changepoint detection for paragraph/sentence streams, orsequential clustering of paragraph/sentences.
























How Many Species of Mosquitoes are There?

e.g. Given some measurement points about mosquitoes in Asia, how manyspecies are there?

K=4? K=5? K=6 K=8?



How Many Words in the English Language are There?

... lastly, she pictured to herself how this same little sister of hers would, in

the after-time, be herself a grown woman; and how she would keep, through

all her riper years, the simple and loving heart of her childhood: and how she

would gather about her other little children, and make their eyes bright and

eager with many a strange tale, perhaps even with the dream of wonderland

of long ago: ...

e.g. Given 10 gigabytes of English text, how many words are there in theEnglish language?

K=1,235,791? K=1,719,765? K=2,983,548?



Which Music Genre’s do You Listen to?

(see http://everynoise.com)Music genre’s are constantly developing.

Which ones do you listen to?What is the chance that a new genre is seen?



What is an Unknown Dimension?

Some dimensions are fixed but unknown:this uses parametric statisticsthis is not Bayesian non-parametrics

Some dimensions keep on growing as we get more data:this is Bayesian non-parametrics



Modelling – What We Do

Source data: semi-structured data

some with ever expanding numbers ofitems/dimensions/nodes

Objects: matrices, tensors, graphs

annotated with side information

Predictions: preferences, ratings, connections

Domains: bioinformatics, social networks, bibliographicdata


Statistical Background

Outline





5 Other Examples



Matrix Approximation

W ' ΘΦT

Data W Components Θ Error Modelsreal valued unconstrained least squares PCA and LSAnon-negative non-negative least squares NMF, learning codebooksnon-neg int. rates cross-entropy Poisson & Neg.Bino. MFnon-neg int.∗ probabilities cross-entropy topic modelsreal valued independent small ICA

∗ but normalise rows of W



NMF by Seung and Lee



Component Models, Generally

image−→

Prince, Queen,Elizabeth, title,son, ...

school, student,college, education,year, ...

John, David,Michael, Scott,Paul, ...

and, or, to , from,with, in, out, ...

text−→

13 1995 accompany and(2) andrew at

boys(2) charles close college day de-

spite diana dr eton first for gayley

harry here housemaster looking old on

on school separation sept stayed the

their(2) they to william(2) with year

Approximate faces/bag-of-words (RHS) with a linear combination ofcomponents (LHS).



Reading a Graphical Model

~θ

x

N

~θ

x2x1 xNo o o

≡

~n

x

N

arcs = “depends on”double headed arcs = “deterministically computed from”

shaded nodes = “supplied variable/data”unshaded nodes = “unknown variable/data”

boxes = “replication”



Models in Graphical Form

Supervised learning orPrediction model

z

~xI

Clustering or Mixturemodel

z

~xI



Mixture Models in Graphical Form

z

~xI

template

~θ

z

~x

~φ

I

K

GMM

~θ

z

w

α

h(·)~φ

L

K

I

K

LDA



The Topic Model: Approximate W ≈ ΘΦT

Latent Dirichlet Allocation (LDA): word counts per document can begenerated with a multinomial on rows of ΘΦT , or individual wordsgenerated sequentially with a categorical on rows of ΘΦT .

~θi ∼ DirichletK

( αK~1)

~φk ∼ h(·) ∀k=1,...,K

wi ,l ∼ Categorical

(K∑k

θi ,k ~φk

)OR

zi ,l ∼ Categorical(~θi ) ∀n=1,...,Li

wi ,l ∼ Categorical(~φzi,l ) ∀n=1,...,Li

~θ

z

w

α

h(·)~φ

L

K

I

K



Hierarchical Processes

H

G

Dirichlet

Dirichlet

process

process

H

G

Dirichlet

Dirichlet

process

distrib.

See “The Hierarchical Dirichlet

Process” Teh, Jordan, Beal & Blei,

JASA, 2006 (2640 citations)

Works because the Dirichlet process is a normalised gamma process,and the gamma distribution is infinitely divisible.

But stable distributions, etc., are also infinitely divisible withcorresponding processes.

Machine learning folks like to use hierarchical processes, but in manycases it simplifies as above.

Has generated enormous literature on hierarchical Chinese restaurantprocesses.



Hierarchical Processes

H

G

Dirichlet

Dirichlet

process

process

H

G

Dirichlet

Dirichlet

process

distrib.

See “The Hierarchical Dirichlet

Process” Teh, Jordan, Beal & Blei,

JASA, 2006 (2640 citations)

Works because the Dirichlet process is a normalised gamma process,and the gamma distribution is infinitely divisible.

But stable distributions, etc., are also infinitely divisible withcorresponding processes.

Machine learning folks like to use hierarchical processes, but in manycases it simplifies as above.

Has generated enormous literature on hierarchical Chinese restaurantprocesses.



Using Processes in Matrix Factorisation

Processes generally correspond to building infinite vectors in the formµ(x) =

∑∞k=1 wkδxk (x). (called Completely Random Measures)

The weights wk are independent with an improper prior given by theLevy measure. (haven’t yet managed to convince the theoreticians ofthis)

Posterior theory and analysis done in series of papers by LancelotJames et al.:

“Bayesian Poisson Calculus for Latent Feature Modeling viaGeneralized Indian Buffet Process Priors” Annals of Stats. (2016)“Posterior analysis for normalized random measures with independentincrements” Scand. Jnl. of Stats. (2009)

The root instances of the processes can be collapsed in an MCMCsampler using James’ methods.

The hierarchical instances of the processes correspond to theunderlying additive distribution, so can be dealt with parametrically.






















Statistical Methods

Outline





5 Other Examples


Statistical Methods

Sharing/Inheritance with a Probability Hierarchy~φ

~θ1 ~θ2

~θ1,1~θ1,2

~θ2,1 ~θ2,2

We might model a set of vocabularies/documents hierarchically:

~θ1 ∼ Dirichlet(α0~φ)

~θ1,2 ∼ Dirichlet(α1~θ1

)Statistical estimation with these generally difficult:

...Γ (∑

k α1θ1,k)∏k Γ(α1θ1,k)

∏k

θα1θ1,k−11,2,k

∏k

θ1,kα1φk−1...

Resolve by doing a marginalisation and then an augmentation to reduceproblem to something more manageable.


Statistical Methods

A Simple Hierarchical Dirichlet Model

~θ1

~p1 ~p2 ~p3

x1 x2 x3

~θ2

~p4 ~p5 ~p6

x4 x5 x6

~µ

p(~µ)p(~θ1

∣∣∣ ~µ) p (~θ2

∣∣∣ ~µ)p(~p1

∣∣∣ ~θ1

)p(~p2

∣∣∣ ~θ1

)p(~p3

∣∣∣ ~θ1

)p(~p4

∣∣∣ ~θ2

)p(~p5

∣∣∣ ~θ2

)p(~p6

∣∣∣ ~θ2

)∏l

pn1,l

1,l

∏l

pn2,l

2,l

∏l

pn3,l

3,l

∏l

pn4,l

4,l

∏l

pn5,l

5,l

∏l

pn6,l

6,l


Statistical Methods

Collapsing and Augmenting the Posterior

Collapse theDirichlet vectorsand concurrentlyaugment:

~s1

~n1, ~t1 ~n2, ~t2 ~n3, ~t3

~s2

~n4, ~t4 ~n5, ~t5 ~n6, ~t6

~r

~n1, ~n2,... are the data at the leaf nodes.

~t1, ~t2,... are auxiliary counts representing multinomial likelihood on ~p1, ~p2,... passed to ~θ1 and ~θ2 (~t1, ~t2 counts bounded above by ~n1, ~n2)

~s1 and ~s2 are auxiliary counts representing multinomial likelihood on ~θ1 and~θ2 passed to ~µ (~s1 bounded above by ~t1 + ~t2 + ~t3, etc.)

~r are auxiliary counts representing multinomial likelihood on ~µ)(bounded above by ~s1 + ~s2)


Statistical Methods

MCMC Problem Specification

Converted pos-terior requiringa Gibbs/MCMCsampler:

~s1

~n1, ~t1 ~n2, ~t2 ~n3, ~t3

~s2

~n4, ~t4 ~n5, ~t5 ~n6, ~t6

~r

(~µ) · · · αR

(α)S1+S2

K∏k=1

(Ss1,k+s2,k

rk ,0

1

K rk

)

(~θ1, ~θ2) · · · αS1

(α)T1+T2+T3

K∏k=1

St1,k+t2,k+t3,k

s1,k ,0

αS2

(α)T4+T5+T6

K∏k=1

St4,k+t5,k+t6,k

s2,k ,0

(∀k~pk) · · ·6∏

l=1

(K∏

k=1

Snl,ktl,k ,0

)Buntine (Monash) Matrix Factorization 2018-10-14 27 / 51

Statistical Methods

General Approach

Gibbs sampling over the network.

Collapse and augment variables to achieve a simpler/faster sampler.

The Gamma/Dirichlet/Beta/Stable processes at the roots of thegraphical model can be collapsed using James’ methods.

The hierarchical instances of the processes correspond to anunderlying additive distribution, so can be dealt with parametrically.e.g.

hierarchical Gamma process becomes gamma distributionhierarchical Dirichlet process becomes Dirichlet distribution


Statistical Methods

Auxiliary Variable Samplers

Given an intractible form in β, below table provides a data augmentationstrategy. Algorithms X1,...,X5 are generally linear.

Form in β Auxiliary variable New form

(I + β)−α q ∼ Γ(α, I + β), α, I > 0 e−qβ

β(n) t ∼ Alg. X1 for (0, β, n), n ∈ Z+ βt

Γ(β)/Γ(β + α) q ∼ be(β, α), α, β > 0 qβ

(β|α)(n) t ∼ Alg. X2 for (α, β, n), n ∈ Z+ βt

(α|β)(n) t ∼ Alg. X3 for (β, α, n), n ∈ Z+ βn−t

Snt,β ~m ∼ Alg. X4 for (β, n, t),

∏tk=1

Γ(mk−β)Γ(1−β)∏t

k=1Γ(mk−β)Γ(1−β) s ∼ Alg. X5 for (t, ~m, β) (1− β)s

NB. The Pochhammer symbols (α|β)(n) and generalised Stirling numbers Snt,β

appear in posteriors of the Pitman-Yor process and variants of the stable

distribution.Buntine (Monash) Matrix Factorization 2018-10-14 29 / 51

Non-parametric Topic Models

Outline





5 Other Examples



Design Notes for Non-parametric LDA

Prediction task: given part of a test document, predict additional words.Hierarchical priors: whenever parts of the system seem similar, we give

them a common prior and learn the similarity.

Estimating parameters: whenever parameters cannot be reasonable set, weestimate them instead.

Fast-ish implementation: for full non-parametrics:in all, doubles memory and time of regular LDA Gibbsmulti-core implementation good for upto 8 coresworks on medium sizes: I=1,000,000 docs withJ=10,000 words in dictionary and K=5,000 topics.

Burstiness: we make the word vectors ~wi Dirichlet compoundmultinomial intead of multinomial; more realistic fordocuments because it models word burstiness.

See Buntine and Mishra, “Experiments with Non-parametric TopicModels” ACM SIGKDD (2014).





them a common prior and learn the similarity.Estimating parameters: whenever parameters cannot be reasonable set, we

estimate them instead.

Fast-ish implementation: for full non-parametrics:in all, doubles memory and time of regular LDA Gibbsmulti-core implementation good for upto 8 coresworks on medium sizes: I=1,000,000 docs withJ=10,000 words in dictionary and K=5,000 topics.








estimate them instead.Fast-ish implementation: for full non-parametrics:

in all, doubles memory and time of regular LDA Gibbsmulti-core implementation good for upto 8 coresworks on medium sizes: I=1,000,000 docs withJ=10,000 words in dictionary and K=5,000 topics.








estimate them instead.Fast-ish implementation: for full non-parametrics:

in all, doubles memory and time of regular LDA Gibbsmulti-core implementation good for upto 8 coresworks on medium sizes: I=1,000,000 docs withJ=10,000 words in dictionary and K=5,000 topics.





Evolution of Models: Basic LDA

wd,n ~φk

zd,n

~θd

α

β

N

D

K

LDAα and β are concentrations;

originally parameters supplied by user;a symmetric Dirichlet prior on rows of Θ and Φ ;later versions use an asymmetric Dirichlet prior



Evolution of Models: NP-LDA (2013)

wd,n ~φk

aφ, bφzd,n

~θd aθ, bθ

~α

~β

aβ , bβ

aα, bα

N

D

K

NP-LDAadds power law on word distributionslike Sato and Nakagawa (2010) and

estimation of background worddistribution



Evolution of Models: Bursty NP-LDA (2014)

wd,n ~ψd,k

bψ,k

aψ

~φk

aφ, bφzd,n

~θd aθ, bθ

~α

~β

aβ , bβ

aα, bα

N K

D

K

NP-LDA with Burstinessadd’s burstiness like Doyle

and Elkan (2009)



Marginalised Bursty Non-parametric Topic Model

wd,n ~tψk

~nψk ~nφk

θψ

dψ

~tφk

dφ, θφ

zd,n

~tµd ~nµd

θµ~nα

~nβ

dβ , θβ

θα

N K

D

started with two hierarchies~µd → ~α and ~ψk → ~φk → ~β

counts (in blue) ~nµd , ~nα, ~nψk ,

~nφk and ~nβ introduced, andtheir auxiliary counts ~tµd , etc.

root of each hierarchymodelled with an improperDirichlet so no ~tα or ~tβ



Example Topics

2691 abstracts from JMLR vol. 1-11, about 60 words length (afterremoving stops). Built 1000 topics. Examples below contain “posterior”.#25,0.42%

posteriori expectation likelihood-based analytically maximum likelihood maximizatione-step posterior estimation map coming model (top=14)

#31,0.40%

undirected graphical message-passing inference graphs directed posteriori junctiongraph clique intractable models cliques partition (top=13)

#38,0.37%

dirichlet bayesian priors conjugate prior ill-posed posterior covariance gaussian inferserves distribution analytical distributions (top=9)

#58,0.31%

latent variables discover posterior dependencies variable models modeling hidden parentunobserved part correlations constituent (top=11)

#95,0.22%

particle kalman tracking filter filtering observer state appearance implement dynamicsvisual occlusion posterior multimodal (top=5)

#124,0.19%

monte carlo chain markov jump mcmc chains reversible iterates mix proposal posteriorproblematic sampling (top=2)

#239,0.11%

naive classifier bayes averaging logarithmic multiple-instance averaged posterior alreadycounterpart considerably weakly classifications goal (top=3)

#678,0.03%

nondeterministic posterior probabilities cancer hypotheses successively true determin-istic distributions worst-case comprise discussed limiting (top=2)

#820,0.02%

estimators constrains sparsely constraints cross-validated insufficient parameters madeamong posterior context brain correlated fast (top=1)



Understanding the Word Prior ~βword prior ~β corresponds to a backgroundtopic (with PYPs to make Zipfian)

all topics ~φk are variants of it

“topical” words are reduced and “stop”words are increased compared to collectionfrequency

word w importance for topic k can bemeasured as φk,w/βw

395 Reuters RCV1 news articles from 1996 containing “church”.~β background(high βw )

the of to a in and ’s was on for by telephone said sick fighting with asat is republican land shortly he remains difficult voice shown mary insidedone travelled

topical(high dfw/βw )

diana teresa missionaries russia parker elizabeth bowles camilla churchillwinston harriman pamela her quoted princess pontiff prince navarro-vallsdies kremlin averell parkinson

topic #1 ’m else something n’t everyone someone my stand i me like truth alwaysreally going do you ran know similar lover things look sun think ’ve



Understanding the Topic Prior ~α

topic prior ~α corresponds to a expectedoccurence of topic

all document topics ~θd are variants of it

uses Dirichlet processes

395 Reuters RCV1 news articles from 1996 containing “church”.#25.05%

quoted declined saying reportsposition further talks sought

#42.28%

pope vatican navarro-valls pon-tiff solidarity 76-year-old

#51.90%

diana charles bowles parkerprince camilla princess elizabeth

#471.00%

nazi nazis germany german warjews recalled forces wartime

#481.85%

estimated sold york company es-tate island percent sell money

#490.98%

art works culture includes cul-tural st boris programme show



Understanding the Topic Confidence bψ,k

topic confidence bψ,k corresponds to aconcentration parameter (inverse variance)when generating ~ψd ,k from ~φk

large values means ~ψd ,k is a copy of ~φk

low values means ~ψd ,k differs greatly from ~φkif really low, its a “rubbish” topic

8616 Reuters RCV1 news articles from 1996 containing “person”.#22605

willingness guarantor caution definiteabsences disgruntled seriousness ma-noeuvring govern instability

#42271

detractors predecessors illustriousfront-runner outsider credentials flaircourteous woo self-effacing married

#92153

teresa missionaries woodlands birlapacemaker gutters nun calcutta

#1993.57

royalties michelle job lopez earnshungarian-born eating credits

#2002.92

penguin birthdays 1000 abc com-piled wheel mausoleum 1800 provoketimetable budapest

#1941.41

spa verdicts korzhakov beginningsburmese ethics betrayed blair fujimoriheroin


Other Examples

Outline





5 Other ExamplesDocument SegmentationTopic Models with Side InformationEven More Models


Other Examples Document Segmentation

Outline








Bayesian Segmentation

Bayesian word segmentation models (Goldwater et al., 2009)

Learn to place boundaries after phonemes in an utterance.

A pointwise boundary sampling algorithm: compute the probability ofplacing a word boundary after each phoneme.

u1 u2 u3 u4 u6 u5 u7 u8 u9 u10

Prediction task: predict where to place segment boundaries.

Other interpretations:

sentence change point detection

sequential clustering of sentences



Bayesian Segmentation

Bayesian word segmentation models (Goldwater et al., 2009)

Learn to place boundaries after phonemes in an utterance.

A pointwise boundary sampling algorithm: compute the probability ofplacing a word boundary after each phoneme.

u1 u2 u3 u4 u6 u5 u7 u8 u9 u10

Prediction task: predict where to place segment boundaries.

Other interpretations:

sentence change point detection

sequential clustering of sentences



Problem

Problem:

µ

º2 º1 º3

u1 u2 u3 u4 u6 u5 u7 u8 u9 u10

µ

u1 u2 u3 u4 u6 u5 u7 u8 u9 u10

Hypothesis: simultaneously learning topic segmentation andtopic identification should allow better detection of topicboundaries.



Segmentation Model–Generative process

A segmentation model

K

® µ

º

¼

½U

Á

°

s

z

w

N U

D

1

Generative process

~φ ∼ Dirichlet(~γ)

~µ ∼ Dirichlet(~α)

π ∼ Beta(~λ)

~ν ∼ PYP(a, b, ~µ)

ρ ∼ Bernoulli(π)

z ∼ Discrete(~νs)

w ∼ Discrete(~φz)

z : topic assignment of word w ;

N: the number of words in a passage.



Posterior Inference–General Picture

We need to sample the topic assignments z and segment boundariesρ.



Experiments on two meeting transcripts

00.20.40.6

p(l=

1) TSM

0 100 200 300 400 500 600 700 8000

0.20.40.6 PLDA

Utterance position in sequence

p(l

= 1)

Figure: Probability of a topic boundary, compared with gold-standardsegmentation on one ICSI transcript.

Gold Standard {77, 95, 189, 365, 508, 609, 860}PLDA {96, 136, 203, 226, 361, 508, 860}TSM {85, 96, 188, 363, 499, 508, 860}

See ”Topic Segmentation with a Structured Topic Model”, Du, Buntine,Johnson NAACL (2013).


Other Examples Topic Models with Side Information

Outline








Evolution of Models: Add Side Information to LDA

wd,n ~φk

zd,n

~θd

~α

~β

N

D

K

LDAregress latent ~α from Boolean document

features ~fd ;regress latent ~β from Boolean word features ~gv ;

details in forthcoming paper



Topic Model with Side Information

Topic priors ~αd and word priors ~βkconstructed as

βk,v =∏l ′

δgv,l′

l ′,k

αd ,k =∏l

λfd,ll ,k

zd,i

wd,i

~θd~αd

fd,lλl,k

µ0

~φk

~βk

δl′,k

gv ,l′

ν0

∀ k

∀ l

∀ v

∀ l ′

∀ i

∀ d

∀ k



Details

Algorithms:

marginalise out Θ and Φ from posterior;

augment so each δl ′,k and λl ,k are conditionally gamma;

use a Gibbs sampler.

Experiments on short text:

Measure perplexity (negative of per-word log-likelihood on testdocuments)

WS = 12k web snippets on 10k words;TMN = 33k Tag My News items on 13k words;

doc features = 7 categories

word features = 50 Booleanised GloVe word embeddings

Results also good with other collections, and in terms of the humancomprehensibility of the topics discovered



Details

Algorithms:












Details

Algorithms:












Perplexity Results

ExtLDA = our algorithmExtLDA-no = our algorithm with no side information

(variant of earlier NP-LDA)DMR = previous state of the art algorithm using logistic regression

WS TMN50 100 150 200 50 100 150 200

LDA 961 878 869 888 1969 1873 1881 1916ExtLDA 774 627 572 534 1657 1415 1304 1235

ExtLDA-no 884 733 671 625 1800 1578 1469 1422DMR 845 683 607 562 1750 1506 1391 1323

LF-LDA 1162 1076 1016 1012 2436 2404 2394 2396WF-LDA 894 839 827 842 1853 1766 1830 1854

LLDA 1543 2958

PLLDA5 10 20 50 5 10 20 50

1060 886 735 642 2181 1863 1647 1456


Other Examples Even More Models

Outline








Twitter-Network Topic Model

Misc. Topic

Author Topic

Authors

Link

Hashtags

Words

Word Dist.

Tags Dist. Doc. Topic

See Lim et al. IJAR (2016)Buntine (Monash) Matrix Factorization 2018-10-14 48 / 51


Adaptive Sequential Topic Model

A more complex (sequential) docu-ment model.

The PYPs exist in longchains (~ν1, ~ν2, ..., ~νJ).

A single probability vector~νj can have two parents,~νj−1 and ~µ.

More complex chains oftable indicators and blocksampling.

See Du et al. EMNLP(2012)



Author-Citation Topic Model

(probability vector hierarchies circled in red)

See Lim and Buntine, ACML (2014)



Conclusion

Models many key problems on discrete matrices, tensors and graphs.

Wide variety of complex probabilistic networks can be constructedusing rich side-information and priors.

Combination of collapsing and augmentation often leads to fastalgorithms based on Gibbs with simple distributions.

State of the art performance on several problems.

Currently working on automated/compiling system ala BUGS andStan.



Conclusion

Models many key problems on discrete matrices, tensors and graphs.

Wide variety of complex probabilistic networks can be constructedusing rich side-information and priors.

Combination of collapsing and augmentation often leads to fastalgorithms based on Gibbs with simple distributions.

State of the art performance on several problems.

Currently working on automated/compiling system ala BUGS andStan.


Methods for Matrix Factorization - WordPress.com

Documents