1 Statistical Approaches to Joint Modeling of Text and Network Data Arthur Asuncion, Qiang Liu, Padhraic Smyth UC Irvine MURI Project Meeting August 25, 2009
Feb 25, 2016
1
Statistical Approaches to Joint Modeling of Text and
Network DataArthur Asuncion, Qiang Liu, Padhraic Smyth
UC Irvine
MURI Project Meeting August 25, 2009
2
Outline• Models:
– The “topic model”: Latent Dirichlet Allocation (LDA)– Relational topic model (RTM)
• Inference techniques:– Collapsed Gibbs sampling– Fast collapsed variational inference– Parameter estimation, approximation of non-edges
• Performance on document networks:– Citation network of CS research papers– Wikipedia pages of Netflix movies– Enron emails
• Discussion:– RTM’s relationship to latent-space models– Extensions
3
Motivation• In (online) social networks, nodes/edges often have
associated text (e.g. blog posts, emails, tweets)
• Topic models are suitable for high-dimensional count data, such as text or images
• Jointly modeling text and network data can be useful:– Interpretability: Which “topics” are associated to each
node/edge?– Link prediction and clustering, based on topics
4
What is topic modeling?• Learning “topics” from a set of documents in a statistical
unsupervised fashion
• Many useful applications:– Improved web searching– Automatic indexing of digital historical archives– Specialized search browsers (e.g. medical applications)– Legal applications (e.g. email forensics)
Topic ModelAlgorithm
List of “topics”
Topical characterization of each document
# topics
“bag-of-words”
5
Latent Dirichlet Allocation (LDA)
[Blei, Ng, Jordan, 2003]• History:
– 1988: Latent Semantic Analysis (LSA)• Singular Value Decomposition (SVD) of word-document count matrix
– 1999: Probabilistic Latent Semantic Analysis (PLSA)• Non-negative matrix factorization (NMF) -- version which minimizes KL
divergence– 2003: Latent Dirichlet Allocation (LDA)
• Bayesian version of PLSA
P (word | doc) P (word | topic) P (topic | doc)≈ *W
D D
W
K
K
6
Graphical model for LDA
idZ
idXwk
kd
K DN d
Each document d has a distribution over topics
Θk,d ~ Dirichlet(α)Each topic k is a
distribution over words
Φw,k ~ Dirichlet(β)
Topic assignments for each word are drawn from document’s
mixture
zid ~ Θk,d
The specific word is drawn from the topic zid
xid ~ Φw,z
Demo• Hidden/observed variables are in unshaded/shaded circles. • Parameters are in boxes.• Plates denote replication across indices.
7
What if the corpus has network structure?
CORA citation network. Figure from [Chang, Blei, AISTATS 2009]
8
Relational Topic Model (RTM)
[Chang, Blei, 2009]• Same setup as LDA, except now we have observed network
information across documents (adjacency matrix)
idZ
idXwk
kd
KN d
id'Z
id'X
kd'
N d’
d' d,y
,“Link probability function”
Documents with similar topics are more likely to be linked.
9
Link probability functions• Exponential:
• Sigmoid:
• Normal CDF:
• Normal:
– where Element-wise (Hadamard) product
0/1 vector of size K
Note: The formulation above is similar to “cosine distance”, but since we don’t divide by the magnitude, this is not a true notion of “distance”.
10
Approximate inference techniques
(because exact inference is intractable)• Collapsed Gibbs sampling (CGS):
– Integrate out Θ and Φ– Sample each zid from the conditional– CGS for LDA: [Griffiths, Steyvers, 2004]
• Fast collapsed variational Bayesian inference (“CVB0”):– Integrate out Θ and Φ– Update variational distribution for each zid using the conditional – CVB0 for LDA: [Asuncion, Welling, Smyth, Teh, 2009]
• Other options:– ML/MAP estimation, non-collapsed GS, non-collapsed VB, etc.
11
Collapsed Gibbs sampling for RTM
• Conditional distribution of each z:
• Using the exponential link probability function, it is computationally efficient to calculate the “edge” term.
• It is very costly to compute the “non-edge” term exactly.
LDA term
“Edge” term
“Non-edge” term
12
Approximating the non-edges
1. Assume non-edges are “missing” and ignore the term entirely (Chang/Blei)
2. Make the following fast approximation:
3. Subsample non-edges and exactly calculate the term over subset.
4. Subsample non-edges but instead of recalculating statistics for every z id token, calculate statistics once per document and cache them over each Gibbs sweep.
13
Variational inference• Minimize Kullback-Leibler (KL) divergence between true posterior
and “variational” posterior (equivalent to maximizing “evidence lower bound”):
• Typically we use a factorized variational posterior for computational reasons:
Jensen’s inequality.Gap = KL [q, p(h|y)]
By maximizing this lower bound, we are implicitly minimizing KL (q, p)
14
CVB0 inference for topic models
[Asuncion, Welling, Smyth, Teh, 2009]• Collapsed Gibbs sampling:
• Collapsed variational inference (0th-order approx):
• Statistics affected by q(zid):
– Counts in LDA term:
– Counts in Hadamard product:
•“Soft” Gibbs update• Deterministic• Very similar to ML/MAP estimation
15
Parameter estimation• We learn the parameters of the link function (γ = [η, ν]) via
gradient ascent:
• We learn parameters (α, β) via a fixed-point algorithm [Minka 2000].– Also possible to Gibbs sample α, β
Step-size
16
Document networks
# Docs
# Links Ave. Doc-
Length
Vocab-Size
Link Semantics
CORA 4,000 17,000 1,200 60,000 Paper citation (undirected)
Netflix Movies
10,000 43,000 640 38,000 Common actor/director
Enron (Undirected)
1,000 16,000 7,000 55,000 Communication between person i and
person jEnron (Directed)
2,000 21,000 3,500 55,000 Email from person i to person j
17
Link rank• We use “link rank” on held-out data as our evaluation metric.• Lower is better.
• How to compute link rank for RTM:1. Run RTM Gibbs sampler on {dtrain} and obtain {Φ, Θtrain, η, ν}2. Given Φ, fold in dtest to obtain Θtest3. Given {Θtrain, Θtest, η, ν}, calculate probability that dtest would link to each dtrain. Rank {dtrain} according
to these probabilities. 4. For each observed link between dtest and {dtrain}, find the “rank”, and average all these ranks to obtain
the “link rank”
dtest
{dtrain} Black-box predictor
Ranking over {dtrain}
Edges between dtest and {dtrain}Edges among {dtrain}
Link ranks
18
Results on CORA dataComparison on CORA, K=20
150
170
190
210
230
250
270
Baseline(TFIDF/Cosine)
LDA + Regression Ignoring non-edges Fast approximation ofnon-edges
Subsampling non-edges (20%)+Caching
Link
Ran
k
We performed 8-fold cross-validation. Random guessing gives link rank = 2000.
19
Results on CORA data
0 20 40 60 80 100 120 140 160100
150
200
250
300
350
400
Number of Topics
Link
Ran
k
BaselineRTM, Fast Approximation
• Model does better with more topics• Model does better with more words in each document
0 0.2 0.4 0.6 0.8 1150
200
250
300
350
400
450
500
550
600
650
Percentage of WordsLi
nk R
ank
BaselineLDA + Regression (K=40)Ignoring Non-Edges (K=40)Fast Approximation (K=40)Subsampling (5%) + Caching (K=40)
20
Timing Results on CORA
“Subsampling (20%) without caching” not shown since it takes 62,000 seconds for D=1000 and 3,720,150 seconds for D=4000
1000 1500 2000 2500 3000 3500 40000
1000
2000
3000
4000
5000
6000
7000
Number of Documents
Tim
e (in
sec
onds
)
CORA, K=20
LDA + RegressionIgnoring Non-EdgesFast ApproximationSubsampling (5%) + CachingSubsampling (20%) + Caching
21
CGS vs. CVB0 inference
Total time:CGS = 5285 secondsCVB0 = 4191 seconds
CVB0 converges more quickly. Also, each iteration is faster due to clumping of data points.0 50 100 150 200
150
200
250
300
350
400
450
500
Iteration
Link
Ran
k
CORA, K=40, S=1, Fast Approximation
CGSCVB0
22
Results on NetflixNETFLIX, K=20
Random Guessing 5000Baseline (TF-IDF / Cosine)
541
LDA + Regression 2321Ignoring Non-Edges 1955Fast Approximation 2089
(Note K=50: 1256)
Subsampling 5% + Caching
1739
Baseline does very well!Needs more investigation…
23
Some Netflix topicsPOLICE: [t2] police agent kill gun action escape car filmDISNEY: [t4] disney film animated movie christmas cat animation storyAMERICAN: [t5] president war american political united states government againstCHINESE: [t6] film kong hong chinese chan wong china linkWESTERN: [t7] western town texas sheriff eastwood west clint genreSCI-FI: [t8] earth science space fiction alien bond planet shipAWARDS: [t9] award film academy nominated won actor actress pictureWAR: [t20] war soldier army officer captain air military generalFRENCH: [t21] french film jean france paris fran les linkHINDI: [t24] film hindi award link india khan indian musicMUSIC: [t28] album song band music rock live soundtrack recordJAPANESE: [t30] anime japanese manga series english japan retrieved characterBRITISH: [t31] british play london john shakespeare film production sirFAMILY: [t32] love girl mother family father friend school sisterSERIES: [t35] series television show episode season character episodes originalSPIELBERG:[t36] spielberg steven park joe future marty gremlin jurassicMEDIEVAL [t37] king island robin treasure princess lost adventure castleGERMAN: [t38] film german russian von germany language anna sovietGIBSON: [t41] max ben danny gibson johnny mad ice melMUSICAL: [t42] musical phantom opera song music broadway stage judyBATTLE: [t43] power human world attack character battle earth gameMURDER: [t46] death murder kill police killed wife later killerSPORTS: [t47] team game player rocky baseball play charlie ruthKING: [t48] king henry arthur queen knight anne prince elizabethHORROR: [t49] horror film dracula scooby doo vampire blood ghost
24
Some movie examples• 'Sholay'
– Indian film, 45% of words belong to topic 24 (Hindi topic)– Top 5 most probable movie links in training set:
• 'Laawaris‘• 'Hote Hote Pyaar Ho Gaya‘• 'Trishul‘• 'Mr. Natwarlal‘• 'Rangeela‘
• ‘Cowboy’– Western film, 25% of words belong to topic 7 (western topic)– Top 5 most probable movie links in training set:
• 'Tall in the Saddle‘• 'The Indian Fighter'• 'Dakota'• 'The Train Robbers'• 'A Lady Takes a Chance‘
• ‘Rocky II’– Boxing film, 40% of words belong to topic 47 (sports topic)– Top 5 most probable movie links in training set:
• 'Bull Durham‘• '2003 World Series‘• 'Bowfinger‘• 'Rocky V‘• 'Rocky IV'
25
Directed vs. Undirected RTM on ENRON emails
10 20 30 40120
130
140
150
160
170
180
K
ENRON, S=2
Link
Ran
k
Undirected RTMDirected RTM
• Undirected: Aggregate incoming & outgoing emails into 1 document
• Directed: Aggregate incoming emails into 1 “receiver” document and outgoing emails into 1 “sender” document
• Directed RTM performs better than undirected RTM
Random guessing: link rank=500
26
Discussion• RTM is similar to latent space models:
• Topic mixtures (the “topic space”) can be combined with the other dimensions (the “social space”) to create a combined latent position z.
• Other extensions:– Include other attributes in the link probability (e.g. timestamp of email, language of
movie)– Use non-parametric prior over dimensionality of latent space (e.g. use Dirichlet
processes)– Place a hierarchy over {θd} to learn clusters of documents – similar to latent position
cluster model [Handcock, Raftery, Tantrum, 2007]
RTM
Projection model [Hoff, Raftery, Handcock, 2002]
Multiplicative latent factor model [Hoff, 2006]
27
Conclusion• Relational topic modeling provides a useful start for
combining text and network data in a single statistical framework
• RTM can improve over simpler approaches for link prediction
• Opportunities for future work:– Faster algorithms for larger data sets– Better understanding of non-edge modeling– Extended models
28
Thank you!