Statistical Approaches to Joint Modeling of Text and Network Data

1

Statistical Approaches to Joint Modeling of Text and

Network DataArthur Asuncion, Qiang Liu, Padhraic Smyth

UC Irvine

MURI Project Meeting August 25, 2009

2

Outline• Models:

– The “topic model”: Latent Dirichlet Allocation (LDA)– Relational topic model (RTM)

• Inference techniques:– Collapsed Gibbs sampling– Fast collapsed variational inference– Parameter estimation, approximation of non-edges

• Performance on document networks:– Citation network of CS research papers– Wikipedia pages of Netflix movies– Enron emails

• Discussion:– RTM’s relationship to latent-space models– Extensions

3

Motivation• In (online) social networks, nodes/edges often have

associated text (e.g. blog posts, emails, tweets)

• Topic models are suitable for high-dimensional count data, such as text or images

• Jointly modeling text and network data can be useful:– Interpretability: Which “topics” are associated to each

node/edge?– Link prediction and clustering, based on topics

4

What is topic modeling?• Learning “topics” from a set of documents in a statistical

unsupervised fashion

• Many useful applications:– Improved web searching– Automatic indexing of digital historical archives– Specialized search browsers (e.g. medical applications)– Legal applications (e.g. email forensics)

Topic ModelAlgorithm

List of “topics”

Topical characterization of each document

# topics

“bag-of-words”

5

Latent Dirichlet Allocation (LDA)

[Blei, Ng, Jordan, 2003]• History:

– 1988: Latent Semantic Analysis (LSA)• Singular Value Decomposition (SVD) of word-document count matrix

– 1999: Probabilistic Latent Semantic Analysis (PLSA)• Non-negative matrix factorization (NMF) -- version which minimizes KL

divergence– 2003: Latent Dirichlet Allocation (LDA)

• Bayesian version of PLSA

P (word | doc) P (word | topic) P (topic | doc)≈ *W

D D

W

K

K

6

Graphical model for LDA

idZ

idXwk

kd

K DN d

Each document d has a distribution over topics

Θk,d ~ Dirichlet(α)Each topic k is a

distribution over words

Φw,k ~ Dirichlet(β)

Topic assignments for each word are drawn from document’s

mixture

zid ~ Θk,d

The specific word is drawn from the topic zid

xid ~ Φw,z

Demo• Hidden/observed variables are in unshaded/shaded circles. • Parameters are in boxes.• Plates denote replication across indices.

http://asuncion.ics.uci.edu/demo/demo/iter0.htm

7

What if the corpus has network structure?

CORA citation network. Figure from [Chang, Blei, AISTATS 2009]

8

Relational Topic Model (RTM)

[Chang, Blei, 2009]• Same setup as LDA, except now we have observed network

information across documents (adjacency matrix)

idZ

idXwk

kd

KN d

id'Z

id'X

kd'

N d’

d' d,y

,“Link probability function”

Documents with similar topics are more likely to be linked.

9

Link probability functions• Exponential:

• Sigmoid:

• Normal CDF:

• Normal:

– where Element-wise (Hadamard) product

0/1 vector of size K

Note: The formulation above is similar to “cosine distance”, but since we don’t divide by the magnitude, this is not a true notion of “distance”.

10

Approximate inference techniques

(because exact inference is intractable)• Collapsed Gibbs sampling (CGS):

– Integrate out Θ and Φ– Sample each zid from the conditional– CGS for LDA: [Griffiths, Steyvers, 2004]

• Fast collapsed variational Bayesian inference (“CVB0”):– Integrate out Θ and Φ– Update variational distribution for each zid using the conditional – CVB0 for LDA: [Asuncion, Welling, Smyth, Teh, 2009]

• Other options:– ML/MAP estimation, non-collapsed GS, non-collapsed VB, etc.

11

Collapsed Gibbs sampling for RTM

• Conditional distribution of each z:

• Using the exponential link probability function, it is computationally efficient to calculate the “edge” term.

• It is very costly to compute the “non-edge” term exactly.

LDA term

“Edge” term

“Non-edge” term

12

Approximating the non-edges

1. Assume non-edges are “missing” and ignore the term entirely (Chang/Blei)

2. Make the following fast approximation:

3. Subsample non-edges and exactly calculate the term over subset.

4. Subsample non-edges but instead of recalculating statistics for every z id token, calculate statistics once per document and cache them over each Gibbs sweep.

13

Variational inference• Minimize Kullback-Leibler (KL) divergence between true posterior

and “variational” posterior (equivalent to maximizing “evidence lower bound”):

• Typically we use a factorized variational posterior for computational reasons:

Jensen’s inequality.Gap = KL [q, p(h|y)]

By maximizing this lower bound, we are implicitly minimizing KL (q, p)

14

CVB0 inference for topic models

[Asuncion, Welling, Smyth, Teh, 2009]• Collapsed Gibbs sampling:

• Collapsed variational inference (0th-order approx):

• Statistics affected by q(zid):

– Counts in LDA term:

– Counts in Hadamard product:

•“Soft” Gibbs update• Deterministic• Very similar to ML/MAP estimation

15

Parameter estimation• We learn the parameters of the link function (γ = [η, ν]) via

gradient ascent:

• We learn parameters (α, β) via a fixed-point algorithm [Minka 2000].– Also possible to Gibbs sample α, β

Step-size

16

Document networks

# Docs

# Links Ave. Doc-

Length

Vocab-Size

Link Semantics

CORA 4,000 17,000 1,200 60,000 Paper citation (undirected)

Netflix Movies

10,000 43,000 640 38,000 Common actor/director

Enron (Undirected)

1,000 16,000 7,000 55,000 Communication between person i and

person jEnron (Directed)

2,000 21,000 3,500 55,000 Email from person i to person j

17

Link rank• We use “link rank” on held-out data as our evaluation metric.• Lower is better.

• How to compute link rank for RTM:1. Run RTM Gibbs sampler on {dtrain} and obtain {Φ, Θtrain, η, ν}2. Given Φ, fold in dtest to obtain Θtest3. Given {Θtrain, Θtest, η, ν}, calculate probability that dtest would link to each dtrain. Rank {dtrain} according

to these probabilities. 4. For each observed link between dtest and {dtrain}, find the “rank”, and average all these ranks to obtain

the “link rank”

dtest

{dtrain} Black-box predictor

Ranking over {dtrain}

Edges between dtest and {dtrain}Edges among {dtrain}

Link ranks

18

Results on CORA dataComparison on CORA, K=20

150

170

190

210

230

250

270

Baseline(TFIDF/Cosine)

LDA + Regression Ignoring non-edges Fast approximation ofnon-edges

Subsampling non-edges (20%)+Caching

Link

Ran

k

We performed 8-fold cross-validation. Random guessing gives link rank = 2000.

19

Results on CORA data

0 20 40 60 80 100 120 140 160100

150

200

250

300

350

400

Number of Topics

Link

Ran

k

BaselineRTM, Fast Approximation

• Model does better with more topics• Model does better with more words in each document

0 0.2 0.4 0.6 0.8 1150

200

250

300

350

400

450

500

550

600

650

Percentage of WordsLi

nk R

ank

BaselineLDA + Regression (K=40)Ignoring Non-Edges (K=40)Fast Approximation (K=40)Subsampling (5%) + Caching (K=40)

20

Timing Results on CORA

“Subsampling (20%) without caching” not shown since it takes 62,000 seconds for D=1000 and 3,720,150 seconds for D=4000

1000 1500 2000 2500 3000 3500 40000

1000

2000

3000

4000

5000

6000

7000

Number of Documents

Tim

e (in

sec

onds

)

CORA, K=20

LDA + RegressionIgnoring Non-EdgesFast ApproximationSubsampling (5%) + CachingSubsampling (20%) + Caching

21

CGS vs. CVB0 inference

Total time:CGS = 5285 secondsCVB0 = 4191 seconds

CVB0 converges more quickly. Also, each iteration is faster due to clumping of data points.0 50 100 150 200

150

200

250

300

350

400

450

500

Iteration

Link

Ran

k

CORA, K=40, S=1, Fast Approximation

CGSCVB0

22

Results on NetflixNETFLIX, K=20

Random Guessing 5000Baseline (TF-IDF / Cosine)

541

LDA + Regression 2321Ignoring Non-Edges 1955Fast Approximation 2089

(Note K=50: 1256)

Subsampling 5% + Caching

1739

Baseline does very well!Needs more investigation…

23

Some Netflix topicsPOLICE: [t2] police agent kill gun action escape car filmDISNEY: [t4] disney film animated movie christmas cat animation storyAMERICAN: [t5] president war american political united states government againstCHINESE: [t6] film kong hong chinese chan wong china linkWESTERN: [t7] western town texas sheriff eastwood west clint genreSCI-FI: [t8] earth science space fiction alien bond planet shipAWARDS: [t9] award film academy nominated won actor actress pictureWAR: [t20] war soldier army officer captain air military generalFRENCH: [t21] french film jean france paris fran les linkHINDI: [t24] film hindi award link india khan indian musicMUSIC: [t28] album song band music rock live soundtrack recordJAPANESE: [t30] anime japanese manga series english japan retrieved characterBRITISH: [t31] british play london john shakespeare film production sirFAMILY: [t32] love girl mother family father friend school sisterSERIES: [t35] series television show episode season character episodes originalSPIELBERG:[t36] spielberg steven park joe future marty gremlin jurassicMEDIEVAL [t37] king island robin treasure princess lost adventure castleGERMAN: [t38] film german russian von germany language anna sovietGIBSON: [t41] max ben danny gibson johnny mad ice melMUSICAL: [t42] musical phantom opera song music broadway stage judyBATTLE: [t43] power human world attack character battle earth gameMURDER: [t46] death murder kill police killed wife later killerSPORTS: [t47] team game player rocky baseball play charlie ruthKING: [t48] king henry arthur queen knight anne prince elizabethHORROR: [t49] horror film dracula scooby doo vampire blood ghost

24

Some movie examples• 'Sholay'

– Indian film, 45% of words belong to topic 24 (Hindi topic)– Top 5 most probable movie links in training set:

• 'Laawaris‘• 'Hote Hote Pyaar Ho Gaya‘• 'Trishul‘• 'Mr. Natwarlal‘• 'Rangeela‘

• ‘Cowboy’– Western film, 25% of words belong to topic 7 (western topic)– Top 5 most probable movie links in training set:

• 'Tall in the Saddle‘• 'The Indian Fighter'• 'Dakota'• 'The Train Robbers'• 'A Lady Takes a Chance‘

• ‘Rocky II’– Boxing film, 40% of words belong to topic 47 (sports topic)– Top 5 most probable movie links in training set:

• 'Bull Durham‘• '2003 World Series‘• 'Bowfinger‘• 'Rocky V‘• 'Rocky IV'

25

Directed vs. Undirected RTM on ENRON emails

10 20 30 40120

130

140

150

160

170

180

K

ENRON, S=2

Link

Ran

k

Undirected RTMDirected RTM

• Undirected: Aggregate incoming & outgoing emails into 1 document

• Directed: Aggregate incoming emails into 1 “receiver” document and outgoing emails into 1 “sender” document

• Directed RTM performs better than undirected RTM

Random guessing: link rank=500

26

Discussion• RTM is similar to latent space models:

• Topic mixtures (the “topic space”) can be combined with the other dimensions (the “social space”) to create a combined latent position z.

• Other extensions:– Include other attributes in the link probability (e.g. timestamp of email, language of

movie)– Use non-parametric prior over dimensionality of latent space (e.g. use Dirichlet

processes)– Place a hierarchy over {θd} to learn clusters of documents – similar to latent position

cluster model [Handcock, Raftery, Tantrum, 2007]

RTM

Projection model [Hoff, Raftery, Handcock, 2002]

Multiplicative latent factor model [Hoff, 2006]

27

Conclusion• Relational topic modeling provides a useful start for

combining text and network data in a single statistical framework

• RTM can improve over simpler approaches for link prediction

• Opportunities for future work:– Faster algorithms for larger data sets– Better understanding of non-edge modeling– Extended models

28

Thank you!

Statistical Approaches to Joint Modeling of Text and Network Data

Documents

topic modeling

topic zidxid

dirichleteach topic

network information

variational distribution

network structure

noncollapsed vb

noncollapsed gs