Top Banner
Methods for Matrix Factorization Wray Buntine Professor in Faculty of IT Director of Master of Data Science Monash University http://topicmodels.org Thanks to: Lan Du and Ethan Zhao of Monash University and Kar Wai Lim of ANU for content, Lancelot James of HKUST for teaching me some of the material 2018-10-14 Buntine (Monash) Matrix Factorization 2018-10-14 1 / 51
74

Methods for Matrix Factorization - WordPress.com

Jun 23, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Methods for Matrix Factorization - WordPress.com

Methods for Matrix Factorization

Wray Buntine

Professor in Faculty of ITDirector of Master of Data Science

Monash Universityhttp://topicmodels.org

Thanks to: Lan Du and Ethan Zhao of Monash University and Kar Wai Lim of ANU forcontent,

Lancelot James of HKUST for teaching me some of the material

2018-10-14

Buntine (Monash) Matrix Factorization 2018-10-14 1 / 51

Page 2: Methods for Matrix Factorization - WordPress.com

And Now for Something Completely Different

We work in Bayesian machine learning in Computer Science.

We work with problems on sparse matrices and graphs, discrete data,with gigabytes of data.

We use complex hierarchical models built on exponential familydistributions with millions of parameters.

We use collapsing and augmentation with Gibbs sampling to make itwork.

realistically, with these dimensions, we’re only doing approximateestimation for predictive inference

Our methods are state of the art for a range of challenging problems.

Currently working on automated/compiling system ala BUGS andStan. (unfunded!)

Buntine (Monash) Matrix Factorization 2018-10-14 2 / 51

Page 3: Methods for Matrix Factorization - WordPress.com

And Now for Something Completely Different

We work in Bayesian machine learning in Computer Science.

We work with problems on sparse matrices and graphs, discrete data,with gigabytes of data.

We use complex hierarchical models built on exponential familydistributions with millions of parameters.

We use collapsing and augmentation with Gibbs sampling to make itwork.

realistically, with these dimensions, we’re only doing approximateestimation for predictive inference

Our methods are state of the art for a range of challenging problems.

Currently working on automated/compiling system ala BUGS andStan. (unfunded!)

Buntine (Monash) Matrix Factorization 2018-10-14 2 / 51

Page 4: Methods for Matrix Factorization - WordPress.com

And Now for Something Completely Different

We work in Bayesian machine learning in Computer Science.

We work with problems on sparse matrices and graphs, discrete data,with gigabytes of data.

We use complex hierarchical models built on exponential familydistributions with millions of parameters.

We use collapsing and augmentation with Gibbs sampling to make itwork.

realistically, with these dimensions, we’re only doing approximateestimation for predictive inference

Our methods are state of the art for a range of challenging problems.

Currently working on automated/compiling system ala BUGS andStan. (unfunded!)

Buntine (Monash) Matrix Factorization 2018-10-14 2 / 51

Page 5: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

Outline

1 Applications of (Discrete) Matrix Factorisa-tion

2 Statistical Background

3 Statistical Methods

4 Non-parametric Topic Models

5 Other Examples

Buntine (Monash) Matrix Factorization 2018-10-14 3 / 51

Page 6: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

Semi-structured Data

Buntine (Monash) Matrix Factorization 2018-10-14 3 / 51

Page 7: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

Bibliographic Data: The ACM Database

Buntine (Monash) Matrix Factorization 2018-10-14 4 / 51

Page 8: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

Bibliographic Data as a Graph

multiple node types: authors, institutions, papers, ...

multiple arc types: cited-by, co-author, employee-of, ...

rich side information: paper abstracts, author education, authoremployment, institution location and ranking, ...

NB. graphs can be represented as multiple sparse matrices with commondimensions

Buntine (Monash) Matrix Factorization 2018-10-14 5 / 51

Page 9: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

Fun with Bibliographies

objects, face, recognition,motion, tracking

wavelet, segmentation,transform, motion, shape

network, networks,distributed, design, parallel

classifiers, classifier, accuracy,prediction, machine-learning

linear, function, functions,approximation, optimization

retrieval, web, text,image-retrieval, document

bayesian, networks, inference,estimation, probabilistic

robot, control, robots,environment, mobile-robot

language, word, recognition,text, training

search, optimization, genetic-algorithms,genetic-algorithm, evolutionary

network, neural-networks,networks, neural-network, neural

user, human, research,interaction, speech

satisfiability, logic,reasoning, boolean, sat

speech, speech-recognition,recognition, acoustic, audio

mining, data-mining,clustering, patterns, database

clustering, kernel,space, feature, distance

data-mining, network,software, detection, security

reinforcement, control,state, policy, planning

channel, coding, error,rate, estimation

agents, games, game,agent, reinforcement

g hinton

m gales

s singh

y rui

b schölkopf

z ghahramani

t dietterich

s thrun

r kohavi

m kearns

n friedman

t joachims

d koller

y freund

t hofmann

j quinlan

c burges

d lowe

y yang

s chen

d aha

d heckerman

r schapire

j lafferty

a blum

r agrawal

r sutton

j friedman

t cootes

p viola

k murphyp belhumeur

m swain

l kaelbling

w cohen

m isard

See Lim et al. ACML (2014)

Buntine (Monash) Matrix Factorization 2018-10-14 6 / 51

Page 10: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

Document Collections as Matrices

Original newsarticle:

Women may only account for 11% of all Lok-Sabha MPsbut they fared better when it came to representation inthe Cabinet. Six women were sworn in as senior ministerson Monday, accounting for 25% of the Cabinet. Theyinclude Swaraj, Gandhi, Najma, Badal, Uma and Smriti.

Bag of words:

11% 25% Badal Cabinet(2) Gandhi Lok-Sabha MPs Mon-day Najma Six Smriti Swaraj They Uma Women accountaccounting all and as better but came fared for(2) in(2)include it may ministers of on only representation seniorsworn the(2) they to were when women

NB. a matrix where rows are documents and columns are words,so wi ,j is count of word j in document i

Buntine (Monash) Matrix Factorization 2018-10-14 7 / 51

Page 11: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

Fun with News about Obesity

see https://topicmodels.org/2016/04/04/talk-at-data-science-meetup/

Buntine (Monash) Matrix Factorization 2018-10-14 8 / 51

Page 12: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

Document Segmentation

Passage: contiguous text with no boundary, e.g., a sentence

Segment: consecutive text passages that are semantically related.

Document Segmentation Task: (roughly) given a document as amonolithic block of text, where should we put the segment boundaries.

Rather like changepoint detection for paragraph/sentence streams, orsequential clustering of paragraph/sentences.

Buntine (Monash) Matrix Factorization 2018-10-14 9 / 51

Page 13: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

Document Segmentation

Passage: contiguous text with no boundary, e.g., a sentence

Segment: consecutive text passages that are semantically related.

Document Segmentation Task: (roughly) given a document as amonolithic block of text, where should we put the segment boundaries.

Rather like changepoint detection for paragraph/sentence streams, orsequential clustering of paragraph/sentences.

Buntine (Monash) Matrix Factorization 2018-10-14 9 / 51

Page 14: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

Document Segmentation

Passage: contiguous text with no boundary, e.g., a sentence

Segment: consecutive text passages that are semantically related.

Document Segmentation Task: (roughly) given a document as amonolithic block of text, where should we put the segment boundaries.

Rather like changepoint detection for paragraph/sentence streams, orsequential clustering of paragraph/sentences.

Buntine (Monash) Matrix Factorization 2018-10-14 9 / 51

Page 15: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

Document Segmentation

Passage: contiguous text with no boundary, e.g., a sentence

Segment: consecutive text passages that are semantically related.

Document Segmentation Task: (roughly) given a document as amonolithic block of text, where should we put the segment boundaries.

Rather like changepoint detection for paragraph/sentence streams, orsequential clustering of paragraph/sentences.

Buntine (Monash) Matrix Factorization 2018-10-14 9 / 51

Page 16: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

How Many Species of Mosquitoes are There?

e.g. Given some measurement points about mosquitoes in Asia, how manyspecies are there?

K=4? K=5? K=6 K=8?

Buntine (Monash) Matrix Factorization 2018-10-14 10 / 51

Page 17: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

How Many Words in the English Language are There?

... lastly, she pictured to herself how this same little sister of hers would, in

the after-time, be herself a grown woman; and how she would keep, through

all her riper years, the simple and loving heart of her childhood: and how she

would gather about her other little children, and make their eyes bright and

eager with many a strange tale, perhaps even with the dream of wonderland

of long ago: ...

e.g. Given 10 gigabytes of English text, how many words are there in theEnglish language?

K=1,235,791? K=1,719,765? K=2,983,548?

Buntine (Monash) Matrix Factorization 2018-10-14 11 / 51

Page 18: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

Which Music Genre’s do You Listen to?

(see http://everynoise.com)Music genre’s are constantly developing.

Which ones do you listen to?What is the chance that a new genre is seen?

Buntine (Monash) Matrix Factorization 2018-10-14 12 / 51

Page 19: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

What is an Unknown Dimension?

Some dimensions are fixed but unknown:this uses parametric statisticsthis is not Bayesian non-parametrics

Some dimensions keep on growing as we get more data:this is Bayesian non-parametrics

Buntine (Monash) Matrix Factorization 2018-10-14 13 / 51

Page 20: Methods for Matrix Factorization - WordPress.com

Applications of (Discrete) Matrix Factorisation

Modelling – What We Do

Source data: semi-structured data

some with ever expanding numbers ofitems/dimensions/nodes

Objects: matrices, tensors, graphs

annotated with side information

Predictions: preferences, ratings, connections

Domains: bioinformatics, social networks, bibliographicdata

Buntine (Monash) Matrix Factorization 2018-10-14 14 / 51

Page 21: Methods for Matrix Factorization - WordPress.com

Statistical Background

Outline

1 Applications of (Discrete) Matrix Factorisa-tion

2 Statistical Background

3 Statistical Methods

4 Non-parametric Topic Models

5 Other Examples

Buntine (Monash) Matrix Factorization 2018-10-14 15 / 51

Page 22: Methods for Matrix Factorization - WordPress.com

Statistical Background

Matrix Approximation

W ' ΘΦT

Data W Components Θ Error Modelsreal valued unconstrained least squares PCA and LSAnon-negative non-negative least squares NMF, learning codebooksnon-neg int. rates cross-entropy Poisson & Neg.Bino. MFnon-neg int.∗ probabilities cross-entropy topic modelsreal valued independent small ICA

∗ but normalise rows of W

Buntine (Monash) Matrix Factorization 2018-10-14 15 / 51

Page 23: Methods for Matrix Factorization - WordPress.com

Statistical Background

NMF by Seung and Lee

Buntine (Monash) Matrix Factorization 2018-10-14 16 / 51

Page 24: Methods for Matrix Factorization - WordPress.com

Statistical Background

Component Models, Generally

image−→

Prince, Queen,Elizabeth, title,son, ...

school, student,college, education,year, ...

John, David,Michael, Scott,Paul, ...

and, or, to , from,with, in, out, ...

text−→

13 1995 accompany and(2) andrew at

boys(2) charles close college day de-

spite diana dr eton first for gayley

harry here housemaster looking old on

on school separation sept stayed the

their(2) they to william(2) with year

Approximate faces/bag-of-words (RHS) with a linear combination ofcomponents (LHS).

Buntine (Monash) Matrix Factorization 2018-10-14 17 / 51

Page 25: Methods for Matrix Factorization - WordPress.com

Statistical Background

Reading a Graphical Model

x

N

x2x1 xNo o o

~n

x

N

arcs = “depends on”double headed arcs = “deterministically computed from”

shaded nodes = “supplied variable/data”unshaded nodes = “unknown variable/data”

boxes = “replication”

Buntine (Monash) Matrix Factorization 2018-10-14 18 / 51

Page 26: Methods for Matrix Factorization - WordPress.com

Statistical Background

Models in Graphical Form

Supervised learning orPrediction model

z

~xI

Clustering or Mixturemodel

z

~xI

Buntine (Monash) Matrix Factorization 2018-10-14 19 / 51

Page 27: Methods for Matrix Factorization - WordPress.com

Statistical Background

Mixture Models in Graphical Form

z

~xI

template

z

~x

I

K

GMM

z

w

α

h(·)~φ

L

K

I

K

LDA

Buntine (Monash) Matrix Factorization 2018-10-14 20 / 51

Page 28: Methods for Matrix Factorization - WordPress.com

Statistical Background

The Topic Model: Approximate W ≈ ΘΦT

Latent Dirichlet Allocation (LDA): word counts per document can begenerated with a multinomial on rows of ΘΦT , or individual wordsgenerated sequentially with a categorical on rows of ΘΦT .

~θi ∼ DirichletK

( αK~1)

~φk ∼ h(·) ∀k=1,...,K

wi ,l ∼ Categorical

(K∑k

θi ,k ~φk

)OR

zi ,l ∼ Categorical(~θi ) ∀n=1,...,Li

wi ,l ∼ Categorical(~φzi,l ) ∀n=1,...,Li

z

w

α

h(·)~φ

L

K

I

K

Buntine (Monash) Matrix Factorization 2018-10-14 21 / 51

Page 29: Methods for Matrix Factorization - WordPress.com

Statistical Background

Hierarchical Processes

H

G

Dirichlet

Dirichlet

process

process

H

G

Dirichlet

Dirichlet

process

distrib.

See “The Hierarchical Dirichlet

Process” Teh, Jordan, Beal & Blei,

JASA, 2006 (2640 citations)

Works because the Dirichlet process is a normalised gamma process,and the gamma distribution is infinitely divisible.

But stable distributions, etc., are also infinitely divisible withcorresponding processes.

Machine learning folks like to use hierarchical processes, but in manycases it simplifies as above.

Has generated enormous literature on hierarchical Chinese restaurantprocesses.

Buntine (Monash) Matrix Factorization 2018-10-14 22 / 51

Page 30: Methods for Matrix Factorization - WordPress.com

Statistical Background

Hierarchical Processes

H

G

Dirichlet

Dirichlet

process

process

H

G

Dirichlet

Dirichlet

process

distrib.

See “The Hierarchical Dirichlet

Process” Teh, Jordan, Beal & Blei,

JASA, 2006 (2640 citations)

Works because the Dirichlet process is a normalised gamma process,and the gamma distribution is infinitely divisible.

But stable distributions, etc., are also infinitely divisible withcorresponding processes.

Machine learning folks like to use hierarchical processes, but in manycases it simplifies as above.

Has generated enormous literature on hierarchical Chinese restaurantprocesses.

Buntine (Monash) Matrix Factorization 2018-10-14 22 / 51

Page 31: Methods for Matrix Factorization - WordPress.com

Statistical Background

Using Processes in Matrix Factorisation

Processes generally correspond to building infinite vectors in the formµ(x) =

∑∞k=1 wkδxk (x). (called Completely Random Measures)

The weights wk are independent with an improper prior given by theLevy measure. (haven’t yet managed to convince the theoreticians ofthis)

Posterior theory and analysis done in series of papers by LancelotJames et al.:

“Bayesian Poisson Calculus for Latent Feature Modeling viaGeneralized Indian Buffet Process Priors” Annals of Stats. (2016)“Posterior analysis for normalized random measures with independentincrements” Scand. Jnl. of Stats. (2009)

The root instances of the processes can be collapsed in an MCMCsampler using James’ methods.

The hierarchical instances of the processes correspond to theunderlying additive distribution, so can be dealt with parametrically.

Buntine (Monash) Matrix Factorization 2018-10-14 23 / 51

Page 32: Methods for Matrix Factorization - WordPress.com

Statistical Background

Using Processes in Matrix Factorisation

Processes generally correspond to building infinite vectors in the formµ(x) =

∑∞k=1 wkδxk (x). (called Completely Random Measures)

The weights wk are independent with an improper prior given by theLevy measure. (haven’t yet managed to convince the theoreticians ofthis)

Posterior theory and analysis done in series of papers by LancelotJames et al.:

“Bayesian Poisson Calculus for Latent Feature Modeling viaGeneralized Indian Buffet Process Priors” Annals of Stats. (2016)“Posterior analysis for normalized random measures with independentincrements” Scand. Jnl. of Stats. (2009)

The root instances of the processes can be collapsed in an MCMCsampler using James’ methods.

The hierarchical instances of the processes correspond to theunderlying additive distribution, so can be dealt with parametrically.

Buntine (Monash) Matrix Factorization 2018-10-14 23 / 51

Page 33: Methods for Matrix Factorization - WordPress.com

Statistical Background

Using Processes in Matrix Factorisation

Processes generally correspond to building infinite vectors in the formµ(x) =

∑∞k=1 wkδxk (x). (called Completely Random Measures)

The weights wk are independent with an improper prior given by theLevy measure. (haven’t yet managed to convince the theoreticians ofthis)

Posterior theory and analysis done in series of papers by LancelotJames et al.:

“Bayesian Poisson Calculus for Latent Feature Modeling viaGeneralized Indian Buffet Process Priors” Annals of Stats. (2016)“Posterior analysis for normalized random measures with independentincrements” Scand. Jnl. of Stats. (2009)

The root instances of the processes can be collapsed in an MCMCsampler using James’ methods.

The hierarchical instances of the processes correspond to theunderlying additive distribution, so can be dealt with parametrically.

Buntine (Monash) Matrix Factorization 2018-10-14 23 / 51

Page 34: Methods for Matrix Factorization - WordPress.com

Statistical Methods

Outline

1 Applications of (Discrete) Matrix Factorisa-tion

2 Statistical Background

3 Statistical Methods

4 Non-parametric Topic Models

5 Other Examples

Buntine (Monash) Matrix Factorization 2018-10-14 24 / 51

Page 35: Methods for Matrix Factorization - WordPress.com

Statistical Methods

Sharing/Inheritance with a Probability Hierarchy~φ

~θ1 ~θ2

~θ1,1~θ1,2

~θ2,1 ~θ2,2

We might model a set of vocabularies/documents hierarchically:

~θ1 ∼ Dirichlet(α0~φ)

~θ1,2 ∼ Dirichlet(α1~θ1

)Statistical estimation with these generally difficult:

...Γ (∑

k α1θ1,k)∏k Γ(α1θ1,k)

∏k

θα1θ1,k−11,2,k

∏k

θ1,kα1φk−1...

Resolve by doing a marginalisation and then an augmentation to reduceproblem to something more manageable.

Buntine (Monash) Matrix Factorization 2018-10-14 24 / 51

Page 36: Methods for Matrix Factorization - WordPress.com

Statistical Methods

A Simple Hierarchical Dirichlet Model

~θ1

~p1 ~p2 ~p3

x1 x2 x3

~θ2

~p4 ~p5 ~p6

x4 x5 x6

p(~µ)p(~θ1

∣∣∣ ~µ) p (~θ2

∣∣∣ ~µ)p(~p1

∣∣∣ ~θ1

)p(~p2

∣∣∣ ~θ1

)p(~p3

∣∣∣ ~θ1

)p(~p4

∣∣∣ ~θ2

)p(~p5

∣∣∣ ~θ2

)p(~p6

∣∣∣ ~θ2

)∏l

pn1,l

1,l

∏l

pn2,l

2,l

∏l

pn3,l

3,l

∏l

pn4,l

4,l

∏l

pn5,l

5,l

∏l

pn6,l

6,l

Buntine (Monash) Matrix Factorization 2018-10-14 25 / 51

Page 37: Methods for Matrix Factorization - WordPress.com

Statistical Methods

Collapsing and Augmenting the Posterior

Collapse theDirichlet vectorsand concurrentlyaugment:

~s1

~n1, ~t1 ~n2, ~t2 ~n3, ~t3

~s2

~n4, ~t4 ~n5, ~t5 ~n6, ~t6

~r

~n1, ~n2,... are the data at the leaf nodes.

~t1, ~t2,... are auxiliary counts representing multinomial likelihood on ~p1, ~p2,... passed to ~θ1 and ~θ2 (~t1, ~t2 counts bounded above by ~n1, ~n2)

~s1 and ~s2 are auxiliary counts representing multinomial likelihood on ~θ1 and~θ2 passed to ~µ (~s1 bounded above by ~t1 + ~t2 + ~t3, etc.)

~r are auxiliary counts representing multinomial likelihood on ~µ)(bounded above by ~s1 + ~s2)

Buntine (Monash) Matrix Factorization 2018-10-14 26 / 51

Page 38: Methods for Matrix Factorization - WordPress.com

Statistical Methods

MCMC Problem Specification

Converted pos-terior requiringa Gibbs/MCMCsampler:

~s1

~n1, ~t1 ~n2, ~t2 ~n3, ~t3

~s2

~n4, ~t4 ~n5, ~t5 ~n6, ~t6

~r

(~µ) · · · αR

(α)S1+S2

K∏k=1

(Ss1,k+s2,k

rk ,0

1

K rk

)

(~θ1, ~θ2) · · · αS1

(α)T1+T2+T3

K∏k=1

St1,k+t2,k+t3,k

s1,k ,0

αS2

(α)T4+T5+T6

K∏k=1

St4,k+t5,k+t6,k

s2,k ,0

(∀k~pk) · · ·6∏

l=1

(K∏

k=1

Snl,ktl,k ,0

)Buntine (Monash) Matrix Factorization 2018-10-14 27 / 51

Page 39: Methods for Matrix Factorization - WordPress.com

Statistical Methods

General Approach

Gibbs sampling over the network.

Collapse and augment variables to achieve a simpler/faster sampler.

The Gamma/Dirichlet/Beta/Stable processes at the roots of thegraphical model can be collapsed using James’ methods.

The hierarchical instances of the processes correspond to anunderlying additive distribution, so can be dealt with parametrically.e.g.

hierarchical Gamma process becomes gamma distributionhierarchical Dirichlet process becomes Dirichlet distribution

Buntine (Monash) Matrix Factorization 2018-10-14 28 / 51

Page 40: Methods for Matrix Factorization - WordPress.com

Statistical Methods

Auxiliary Variable Samplers

Given an intractible form in β, below table provides a data augmentationstrategy. Algorithms X1,...,X5 are generally linear.

Form in β Auxiliary variable New form

(I + β)−α q ∼ Γ(α, I + β), α, I > 0 e−qβ

β(n) t ∼ Alg. X1 for (0, β, n), n ∈ Z+ βt

Γ(β)/Γ(β + α) q ∼ be(β, α), α, β > 0 qβ

(β|α)(n) t ∼ Alg. X2 for (α, β, n), n ∈ Z+ βt

(α|β)(n) t ∼ Alg. X3 for (β, α, n), n ∈ Z+ βn−t

Snt,β ~m ∼ Alg. X4 for (β, n, t),

∏tk=1

Γ(mk−β)Γ(1−β)∏t

k=1Γ(mk−β)Γ(1−β) s ∼ Alg. X5 for (t, ~m, β) (1− β)s

NB. The Pochhammer symbols (α|β)(n) and generalised Stirling numbers Snt,β

appear in posteriors of the Pitman-Yor process and variants of the stable

distribution.Buntine (Monash) Matrix Factorization 2018-10-14 29 / 51

Page 41: Methods for Matrix Factorization - WordPress.com

Non-parametric Topic Models

Outline

1 Applications of (Discrete) Matrix Factorisa-tion

2 Statistical Background

3 Statistical Methods

4 Non-parametric Topic Models

5 Other Examples

Buntine (Monash) Matrix Factorization 2018-10-14 30 / 51

Page 42: Methods for Matrix Factorization - WordPress.com

Non-parametric Topic Models

Design Notes for Non-parametric LDA

Prediction task: given part of a test document, predict additional words.Hierarchical priors: whenever parts of the system seem similar, we give

them a common prior and learn the similarity.

Estimating parameters: whenever parameters cannot be reasonable set, weestimate them instead.

Fast-ish implementation: for full non-parametrics:in all, doubles memory and time of regular LDA Gibbsmulti-core implementation good for upto 8 coresworks on medium sizes: I=1,000,000 docs withJ=10,000 words in dictionary and K=5,000 topics.

Burstiness: we make the word vectors ~wi Dirichlet compoundmultinomial intead of multinomial; more realistic fordocuments because it models word burstiness.

See Buntine and Mishra, “Experiments with Non-parametric TopicModels” ACM SIGKDD (2014).

Buntine (Monash) Matrix Factorization 2018-10-14 30 / 51

Page 43: Methods for Matrix Factorization - WordPress.com

Non-parametric Topic Models

Design Notes for Non-parametric LDA

Prediction task: given part of a test document, predict additional words.Hierarchical priors: whenever parts of the system seem similar, we give

them a common prior and learn the similarity.Estimating parameters: whenever parameters cannot be reasonable set, we

estimate them instead.

Fast-ish implementation: for full non-parametrics:in all, doubles memory and time of regular LDA Gibbsmulti-core implementation good for upto 8 coresworks on medium sizes: I=1,000,000 docs withJ=10,000 words in dictionary and K=5,000 topics.

Burstiness: we make the word vectors ~wi Dirichlet compoundmultinomial intead of multinomial; more realistic fordocuments because it models word burstiness.

See Buntine and Mishra, “Experiments with Non-parametric TopicModels” ACM SIGKDD (2014).

Buntine (Monash) Matrix Factorization 2018-10-14 30 / 51

Page 44: Methods for Matrix Factorization - WordPress.com

Non-parametric Topic Models

Design Notes for Non-parametric LDA

Prediction task: given part of a test document, predict additional words.Hierarchical priors: whenever parts of the system seem similar, we give

them a common prior and learn the similarity.Estimating parameters: whenever parameters cannot be reasonable set, we

estimate them instead.Fast-ish implementation: for full non-parametrics:

in all, doubles memory and time of regular LDA Gibbsmulti-core implementation good for upto 8 coresworks on medium sizes: I=1,000,000 docs withJ=10,000 words in dictionary and K=5,000 topics.

Burstiness: we make the word vectors ~wi Dirichlet compoundmultinomial intead of multinomial; more realistic fordocuments because it models word burstiness.

See Buntine and Mishra, “Experiments with Non-parametric TopicModels” ACM SIGKDD (2014).

Buntine (Monash) Matrix Factorization 2018-10-14 30 / 51

Page 45: Methods for Matrix Factorization - WordPress.com

Non-parametric Topic Models

Design Notes for Non-parametric LDA

Prediction task: given part of a test document, predict additional words.Hierarchical priors: whenever parts of the system seem similar, we give

them a common prior and learn the similarity.Estimating parameters: whenever parameters cannot be reasonable set, we

estimate them instead.Fast-ish implementation: for full non-parametrics:

in all, doubles memory and time of regular LDA Gibbsmulti-core implementation good for upto 8 coresworks on medium sizes: I=1,000,000 docs withJ=10,000 words in dictionary and K=5,000 topics.

Burstiness: we make the word vectors ~wi Dirichlet compoundmultinomial intead of multinomial; more realistic fordocuments because it models word burstiness.

See Buntine and Mishra, “Experiments with Non-parametric TopicModels” ACM SIGKDD (2014).

Buntine (Monash) Matrix Factorization 2018-10-14 30 / 51

Page 46: Methods for Matrix Factorization - WordPress.com

Non-parametric Topic Models

Evolution of Models: Basic LDA

wd,n ~φk

zd,n

~θd

α

β

N

D

K

LDAα and β are concentrations;

originally parameters supplied by user;a symmetric Dirichlet prior on rows of Θ and Φ ;later versions use an asymmetric Dirichlet prior

Buntine (Monash) Matrix Factorization 2018-10-14 31 / 51

Page 47: Methods for Matrix Factorization - WordPress.com

Non-parametric Topic Models

Evolution of Models: NP-LDA (2013)

wd,n ~φk

aφ, bφzd,n

~θd aθ, bθ

aβ , bβ

aα, bα

N

D

K

NP-LDAadds power law on word distributionslike Sato and Nakagawa (2010) and

estimation of background worddistribution

Buntine (Monash) Matrix Factorization 2018-10-14 32 / 51

Page 48: Methods for Matrix Factorization - WordPress.com

Non-parametric Topic Models

Evolution of Models: Bursty NP-LDA (2014)

wd,n ~ψd,k

bψ,k

~φk

aφ, bφzd,n

~θd aθ, bθ

aβ , bβ

aα, bα

N K

D

K

NP-LDA with Burstinessadd’s burstiness like Doyle

and Elkan (2009)

Buntine (Monash) Matrix Factorization 2018-10-14 33 / 51

Page 49: Methods for Matrix Factorization - WordPress.com

Non-parametric Topic Models

Marginalised Bursty Non-parametric Topic Model

wd,n ~tψk

~nψk ~nφk

θψ

~tφk

dφ, θφ

zd,n

~tµd ~nµd

θµ~nα

~nβ

dβ , θβ

θα

N K

D

started with two hierarchies~µd → ~α and ~ψk → ~φk → ~β

counts (in blue) ~nµd , ~nα, ~nψk ,

~nφk and ~nβ introduced, andtheir auxiliary counts ~tµd , etc.

root of each hierarchymodelled with an improperDirichlet so no ~tα or ~tβ

Buntine (Monash) Matrix Factorization 2018-10-14 34 / 51

Page 50: Methods for Matrix Factorization - WordPress.com

Non-parametric Topic Models

Example Topics

2691 abstracts from JMLR vol. 1-11, about 60 words length (afterremoving stops). Built 1000 topics. Examples below contain “posterior”.#25,0.42%

posteriori expectation likelihood-based analytically maximum likelihood maximizatione-step posterior estimation map coming model (top=14)

#31,0.40%

undirected graphical message-passing inference graphs directed posteriori junctiongraph clique intractable models cliques partition (top=13)

#38,0.37%

dirichlet bayesian priors conjugate prior ill-posed posterior covariance gaussian inferserves distribution analytical distributions (top=9)

#58,0.31%

latent variables discover posterior dependencies variable models modeling hidden parentunobserved part correlations constituent (top=11)

#95,0.22%

particle kalman tracking filter filtering observer state appearance implement dynamicsvisual occlusion posterior multimodal (top=5)

#124,0.19%

monte carlo chain markov jump mcmc chains reversible iterates mix proposal posteriorproblematic sampling (top=2)

#239,0.11%

naive classifier bayes averaging logarithmic multiple-instance averaged posterior alreadycounterpart considerably weakly classifications goal (top=3)

#678,0.03%

nondeterministic posterior probabilities cancer hypotheses successively true determin-istic distributions worst-case comprise discussed limiting (top=2)

#820,0.02%

estimators constrains sparsely constraints cross-validated insufficient parameters madeamong posterior context brain correlated fast (top=1)

Buntine (Monash) Matrix Factorization 2018-10-14 35 / 51

Page 51: Methods for Matrix Factorization - WordPress.com

Non-parametric Topic Models

Understanding the Word Prior ~βword prior ~β corresponds to a backgroundtopic (with PYPs to make Zipfian)

all topics ~φk are variants of it

“topical” words are reduced and “stop”words are increased compared to collectionfrequency

word w importance for topic k can bemeasured as φk,w/βw

395 Reuters RCV1 news articles from 1996 containing “church”.~β background(high βw )

the of to a in and ’s was on for by telephone said sick fighting with asat is republican land shortly he remains difficult voice shown mary insidedone travelled

topical(high dfw/βw )

diana teresa missionaries russia parker elizabeth bowles camilla churchillwinston harriman pamela her quoted princess pontiff prince navarro-vallsdies kremlin averell parkinson

topic #1 ’m else something n’t everyone someone my stand i me like truth alwaysreally going do you ran know similar lover things look sun think ’ve

Buntine (Monash) Matrix Factorization 2018-10-14 36 / 51

Page 52: Methods for Matrix Factorization - WordPress.com

Non-parametric Topic Models

Understanding the Topic Prior ~α

topic prior ~α corresponds to a expectedoccurence of topic

all document topics ~θd are variants of it

uses Dirichlet processes

395 Reuters RCV1 news articles from 1996 containing “church”.#25.05%

quoted declined saying reportsposition further talks sought

#42.28%

pope vatican navarro-valls pon-tiff solidarity 76-year-old

#51.90%

diana charles bowles parkerprince camilla princess elizabeth

#471.00%

nazi nazis germany german warjews recalled forces wartime

#481.85%

estimated sold york company es-tate island percent sell money

#490.98%

art works culture includes cul-tural st boris programme show

Buntine (Monash) Matrix Factorization 2018-10-14 37 / 51

Page 53: Methods for Matrix Factorization - WordPress.com

Non-parametric Topic Models

Understanding the Topic Confidence bψ,k

topic confidence bψ,k corresponds to aconcentration parameter (inverse variance)when generating ~ψd ,k from ~φk

large values means ~ψd ,k is a copy of ~φk

low values means ~ψd ,k differs greatly from ~φkif really low, its a “rubbish” topic

8616 Reuters RCV1 news articles from 1996 containing “person”.#22605

willingness guarantor caution definiteabsences disgruntled seriousness ma-noeuvring govern instability

#42271

detractors predecessors illustriousfront-runner outsider credentials flaircourteous woo self-effacing married

#92153

teresa missionaries woodlands birlapacemaker gutters nun calcutta

#1993.57

royalties michelle job lopez earnshungarian-born eating credits

#2002.92

penguin birthdays 1000 abc com-piled wheel mausoleum 1800 provoketimetable budapest

#1941.41

spa verdicts korzhakov beginningsburmese ethics betrayed blair fujimoriheroin

Buntine (Monash) Matrix Factorization 2018-10-14 38 / 51

Page 54: Methods for Matrix Factorization - WordPress.com

Other Examples

Outline

1 Applications of (Discrete) Matrix Factorisa-tion

2 Statistical Background

3 Statistical Methods

4 Non-parametric Topic Models

5 Other ExamplesDocument SegmentationTopic Models with Side InformationEven More Models

Buntine (Monash) Matrix Factorization 2018-10-14 39 / 51

Page 55: Methods for Matrix Factorization - WordPress.com

Other Examples Document Segmentation

Outline

1 Applications of (Discrete) Matrix Factorisa-tion

2 Statistical Background

3 Statistical Methods

4 Non-parametric Topic Models

5 Other ExamplesDocument SegmentationTopic Models with Side InformationEven More Models

Buntine (Monash) Matrix Factorization 2018-10-14 39 / 51

Page 56: Methods for Matrix Factorization - WordPress.com

Other Examples Document Segmentation

Bayesian Segmentation

Bayesian word segmentation models (Goldwater et al., 2009)

Learn to place boundaries after phonemes in an utterance.

A pointwise boundary sampling algorithm: compute the probability ofplacing a word boundary after each phoneme.

u1 u2 u3 u4 u6 u5 u7 u8 u9 u10

Prediction task: predict where to place segment boundaries.

Other interpretations:

sentence change point detection

sequential clustering of sentences

Buntine (Monash) Matrix Factorization 2018-10-14 39 / 51

Page 57: Methods for Matrix Factorization - WordPress.com

Other Examples Document Segmentation

Bayesian Segmentation

Bayesian word segmentation models (Goldwater et al., 2009)

Learn to place boundaries after phonemes in an utterance.

A pointwise boundary sampling algorithm: compute the probability ofplacing a word boundary after each phoneme.

u1 u2 u3 u4 u6 u5 u7 u8 u9 u10

Prediction task: predict where to place segment boundaries.

Other interpretations:

sentence change point detection

sequential clustering of sentences

Buntine (Monash) Matrix Factorization 2018-10-14 39 / 51

Page 58: Methods for Matrix Factorization - WordPress.com

Other Examples Document Segmentation

Problem

Problem:

µ

º2 º1 º3

u1 u2 u3 u4 u6 u5 u7 u8 u9 u10

µ

u1 u2 u3 u4 u6 u5 u7 u8 u9 u10

Hypothesis: simultaneously learning topic segmentation andtopic identification should allow better detection of topicboundaries.

Buntine (Monash) Matrix Factorization 2018-10-14 40 / 51

Page 59: Methods for Matrix Factorization - WordPress.com

Other Examples Document Segmentation

Segmentation Model–Generative process

A segmentation model

K

® µ

º

¼

½U

Á

°

s

z

w

N U

D

1

Generative process

~φ ∼ Dirichlet(~γ)

~µ ∼ Dirichlet(~α)

π ∼ Beta(~λ)

~ν ∼ PYP(a, b, ~µ)

ρ ∼ Bernoulli(π)

z ∼ Discrete(~νs)

w ∼ Discrete(~φz)

z : topic assignment of word w ;

N: the number of words in a passage.

Buntine (Monash) Matrix Factorization 2018-10-14 41 / 51

Page 60: Methods for Matrix Factorization - WordPress.com

Other Examples Document Segmentation

Posterior Inference–General Picture

We need to sample the topic assignments z and segment boundariesρ.

Buntine (Monash) Matrix Factorization 2018-10-14 42 / 51

Page 61: Methods for Matrix Factorization - WordPress.com

Other Examples Document Segmentation

Experiments on two meeting transcripts

00.20.40.6

p(l=

1) TSM

0 100 200 300 400 500 600 700 8000

0.20.40.6 PLDA

Utterance position in sequence

p(l

= 1)

Figure: Probability of a topic boundary, compared with gold-standardsegmentation on one ICSI transcript.

Gold Standard {77, 95, 189, 365, 508, 609, 860}PLDA {96, 136, 203, 226, 361, 508, 860}TSM {85, 96, 188, 363, 499, 508, 860}

See ”Topic Segmentation with a Structured Topic Model”, Du, Buntine,Johnson NAACL (2013).

Buntine (Monash) Matrix Factorization 2018-10-14 43 / 51

Page 62: Methods for Matrix Factorization - WordPress.com

Other Examples Topic Models with Side Information

Outline

1 Applications of (Discrete) Matrix Factorisa-tion

2 Statistical Background

3 Statistical Methods

4 Non-parametric Topic Models

5 Other ExamplesDocument SegmentationTopic Models with Side InformationEven More Models

Buntine (Monash) Matrix Factorization 2018-10-14 44 / 51

Page 63: Methods for Matrix Factorization - WordPress.com

Other Examples Topic Models with Side Information

Evolution of Models: Add Side Information to LDA

wd,n ~φk

zd,n

~θd

N

D

K

LDAregress latent ~α from Boolean document

features ~fd ;regress latent ~β from Boolean word features ~gv ;

details in forthcoming paper

Buntine (Monash) Matrix Factorization 2018-10-14 44 / 51

Page 64: Methods for Matrix Factorization - WordPress.com

Other Examples Topic Models with Side Information

Topic Model with Side Information

Topic priors ~αd and word priors ~βkconstructed as

βk,v =∏l ′

δgv,l′

l ′,k

αd ,k =∏l

λfd,ll ,k

zd,i

wd,i

~θd~αd

fd,lλl,k

µ0

~φk

~βk

δl′,k

gv ,l′

ν0

∀ k

∀ l

∀ v

∀ l ′

∀ i

∀ d

∀ k

Buntine (Monash) Matrix Factorization 2018-10-14 45 / 51

Page 65: Methods for Matrix Factorization - WordPress.com

Other Examples Topic Models with Side Information

Details

Algorithms:

marginalise out Θ and Φ from posterior;

augment so each δl ′,k and λl ,k are conditionally gamma;

use a Gibbs sampler.

Experiments on short text:

Measure perplexity (negative of per-word log-likelihood on testdocuments)

WS = 12k web snippets on 10k words;TMN = 33k Tag My News items on 13k words;

doc features = 7 categories

word features = 50 Booleanised GloVe word embeddings

Results also good with other collections, and in terms of the humancomprehensibility of the topics discovered

Buntine (Monash) Matrix Factorization 2018-10-14 46 / 51

Page 66: Methods for Matrix Factorization - WordPress.com

Other Examples Topic Models with Side Information

Details

Algorithms:

marginalise out Θ and Φ from posterior;

augment so each δl ′,k and λl ,k are conditionally gamma;

use a Gibbs sampler.

Experiments on short text:

Measure perplexity (negative of per-word log-likelihood on testdocuments)

WS = 12k web snippets on 10k words;TMN = 33k Tag My News items on 13k words;

doc features = 7 categories

word features = 50 Booleanised GloVe word embeddings

Results also good with other collections, and in terms of the humancomprehensibility of the topics discovered

Buntine (Monash) Matrix Factorization 2018-10-14 46 / 51

Page 67: Methods for Matrix Factorization - WordPress.com

Other Examples Topic Models with Side Information

Details

Algorithms:

marginalise out Θ and Φ from posterior;

augment so each δl ′,k and λl ,k are conditionally gamma;

use a Gibbs sampler.

Experiments on short text:

Measure perplexity (negative of per-word log-likelihood on testdocuments)

WS = 12k web snippets on 10k words;TMN = 33k Tag My News items on 13k words;

doc features = 7 categories

word features = 50 Booleanised GloVe word embeddings

Results also good with other collections, and in terms of the humancomprehensibility of the topics discovered

Buntine (Monash) Matrix Factorization 2018-10-14 46 / 51

Page 68: Methods for Matrix Factorization - WordPress.com

Other Examples Topic Models with Side Information

Perplexity Results

ExtLDA = our algorithmExtLDA-no = our algorithm with no side information

(variant of earlier NP-LDA)DMR = previous state of the art algorithm using logistic regression

WS TMN50 100 150 200 50 100 150 200

LDA 961 878 869 888 1969 1873 1881 1916ExtLDA 774 627 572 534 1657 1415 1304 1235

ExtLDA-no 884 733 671 625 1800 1578 1469 1422DMR 845 683 607 562 1750 1506 1391 1323

LF-LDA 1162 1076 1016 1012 2436 2404 2394 2396WF-LDA 894 839 827 842 1853 1766 1830 1854

LLDA 1543 2958

PLLDA5 10 20 50 5 10 20 50

1060 886 735 642 2181 1863 1647 1456

Buntine (Monash) Matrix Factorization 2018-10-14 47 / 51

Page 69: Methods for Matrix Factorization - WordPress.com

Other Examples Even More Models

Outline

1 Applications of (Discrete) Matrix Factorisa-tion

2 Statistical Background

3 Statistical Methods

4 Non-parametric Topic Models

5 Other ExamplesDocument SegmentationTopic Models with Side InformationEven More Models

Buntine (Monash) Matrix Factorization 2018-10-14 48 / 51

Page 70: Methods for Matrix Factorization - WordPress.com

Other Examples Even More Models

Twitter-Network Topic Model

Misc. Topic

Author Topic

Authors

Link

Hashtags

Words

Word Dist.

Tags Dist. Doc. Topic

See Lim et al. IJAR (2016)Buntine (Monash) Matrix Factorization 2018-10-14 48 / 51

Page 71: Methods for Matrix Factorization - WordPress.com

Other Examples Even More Models

Adaptive Sequential Topic Model

A more complex (sequential) docu-ment model.

The PYPs exist in longchains (~ν1, ~ν2, ..., ~νJ).

A single probability vector~νj can have two parents,~νj−1 and ~µ.

More complex chains oftable indicators and blocksampling.

See Du et al. EMNLP(2012)

Buntine (Monash) Matrix Factorization 2018-10-14 49 / 51

Page 72: Methods for Matrix Factorization - WordPress.com

Other Examples Even More Models

Author-Citation Topic Model

(probability vector hierarchies circled in red)

See Lim and Buntine, ACML (2014)

Buntine (Monash) Matrix Factorization 2018-10-14 50 / 51

Page 73: Methods for Matrix Factorization - WordPress.com

Other Examples Even More Models

Conclusion

Models many key problems on discrete matrices, tensors and graphs.

Wide variety of complex probabilistic networks can be constructedusing rich side-information and priors.

Combination of collapsing and augmentation often leads to fastalgorithms based on Gibbs with simple distributions.

State of the art performance on several problems.

Currently working on automated/compiling system ala BUGS andStan.

Buntine (Monash) Matrix Factorization 2018-10-14 51 / 51

Page 74: Methods for Matrix Factorization - WordPress.com

Other Examples Even More Models

Conclusion

Models many key problems on discrete matrices, tensors and graphs.

Wide variety of complex probabilistic networks can be constructedusing rich side-information and priors.

Combination of collapsing and augmentation often leads to fastalgorithms based on Gibbs with simple distributions.

State of the art performance on several problems.

Currently working on automated/compiling system ala BUGS andStan.

Buntine (Monash) Matrix Factorization 2018-10-14 51 / 51