Transcript
DIGITAL Institute for Information and Communication Technologies
Topic Models
Claudia WagnerGraz, 16.9.2010
2
Semantic Representation of Text
a) Network Model (nodes and edges) b) Space Model (points and proximity) c) Probabilistic Models (words belong to a set of
probabilistic topics)
(Griffiths, 2007)
3
Topic Models
= probabilistic models for uncovering the underlying semantic structure of a document collection based on a hierarchical Bayesian analysis of the original texts (Blei, 2003)
Aim: discover patterns of word-use and connect documents that exhibit similar patterns
Idea: documents are mixtures of topics and a topic is a probability distribution over words
4
Topic Models
source: http://www.cs.umass.edu/~wallach/talks/priors.pdf
5
Topic Models
Topic 1 Topic 2
3 latent variables:
Word distribution per topic(word-topic-matrix)
Topic distribution per doc(topic-doc-matrix)
Topic word assignment
(Steyvers, 2006)
6
Summary
Observed variables: Word-distribution per document
3 latent variables Topic distribution per document : P(z) = θ(d)
Word distribution per topic: P(w, z) = φ(z)
Word-Topic assignment: P(z|w)
Training: Learn latent variables on trainings-collection of documents
Test: Predict topic distribution θ(d) of an unseen document d
7
Topic Models
pLSA (Hoffmann, 1999) LDA (Blei, 2003) Author Model (McCallum, 1999) Author-Topic Model (Rosen-Zvi, 2004) Author-Recipient Topic Model (McCallum, 2004) Group-Topic Model (Wang, 2005) Community-Author-Recipient Topic Model (Pathak,
2009) Semi-Supervised Topic Models
Labeled LDA (Ramage, 2009)
8
pLSA (Hoffmann, 1999)
Problem: Not a proper generative model for new documents!Why? Because we do not learn any corpus-level parameter we learn for each doc of the trainingsset a topic-distribution
z
)|(*)|(*)(),( zPzwPdPwdP
number of documents
number of words
P( z | θ)
P( w | z )
Topic distribution of a document
9
Latent Dirichlet Allocation (LDA) (Blei, 2003)
Advantage: We learn topic distribution of a corpus we can predict topic distribution of an unseen document of this corpus by observing its words
Hyper-parameters α and β are corpus-level parameters are only sampled once
P( w | z, φ (z) )
P(φ(z) | β)
z
dzzd zPzwPPPdPwdP )|(*),|(*)|(*)|(*)(),( )()()()(
number of documentsnumber of words
10
Dirichlet Prior α α is a prior on the topic-distribution of documents (of a corpus) α is a corpus-level parameter (is chosen once) α is a force on the topic combinations Amount of smoothing determined by α Higher α more smoothing less „distinct“ topics Low α the pressure is to pick for each document a topic distribution
favoring just a few topics Recommended value: α = 50/T (or less if T is very small)
High αLow α
Each doc’s topic distribution θ is a smooth mix of all topics
Each doc’s topic distribution θ must favor few topics
Topic-distr. of Doc1 =
(1/3, 1/3, 1/3) Topic-distr. of Doc2 =
(1, 0, 0)
Doc1
Doc2
11
Dirichlet Prior β β is a prior on the word-distribution β is a corpus-level parameter (is chosen once) β is a force on the word combinations Amount of smoothing determined by β Higher β more smoothing Low β the pressure is to pick for each topic w word
distribution favoring just a few words Recommended values: β = 0.01
High β Low βTopic-distr. of Doc1 =
(1/3, 1/3, 1/3)
Word-distr. of Topic2 =
(1, 0, 0)
Topic1 Topic2
12
Matrix Representation of LDA
observed latent
latent
θ(d)φ(z)
13
Statistical Inference and Parameter Estimation
Key problem:
Compute posterior distribution of the hidden variables given a document
Posterior distribution is intractable for exact inference
(Blei, 2003)
Latent Vars Observed VarsandPriors
14
Statistical Inference and Parameter Estimation
How can we estimate posterior distribution of hidden variables given a corpus of trainings-documents? Direct (e.g. via expectation maximization, variational inference or
expectation propagation algorithms)
Indirect i.e. estimate the posterior distribution over z (i.e. P(z)) Gibbs sampling, a form of Markov chain Monte Carlo, is often
used to estimate the posterior probability over a high-dimensional random variable z
17
Gibbs Sampling generates a sequence of samples from the joint probability
distribution of two or more random variables.
Aim: compute posterior distribution over latent variable z Pre-request: we must know the conditional probability of z
P( zi = j | z-i , wi , di , . )
Why do we need to estimate P(z|w) via random walk?
z is a high-dimensional random variable
If num of topics T = 50 and num of words = 1000
We must visit 501000 points and compute P(z) for all of them.
18
Gibbs Sampling for LDA Random start Iterative For each word we compute
How dominante is a topic z in the doc d? How often was the topic z already used in doc d?
How likely is a word for a topic z? How often was the word w already assigned to topic z?
19
Run Gibbs Sampling Example (1)
topic1 topic2
money 3 2
bank 3 6
Loan 2 1
River 2 2
Stream 2 1
1 12
2 2
2
1
1
1 2
1 2 1
21
12
1 2 21
21
2
doc1 doc2 doc3
topic1 4 4 4
topic2 4 4 4
1. Random topic assignments
2. 2 count-matrices:
CWT Words per topic
CDT Topics per document
20
Gibbs Sampling for LDA
Probability that topic j is chosen for word wi, conditioned on all other assigned topics of words in this doc and all other observed vars.
Count number of times a word token wi was assigned to a topic j across all docs
Count number of times a topic j was already assigned to some word token in doc di
unnormalized!
=> divide the probability of assigning topic j to word wi by the sum over all topics T
22
Run Gibbs Sampling Example (2)
topic1 topic2
money 3 2
bank 3 6
Loan 2 1
River 2 2
Stream 2 1
12
2 2
2
1
1
1 2
1 2 1
21
12
1 2 21
21
2
doc1 doc2 doc3
topic1 4 4 4
topic2 4 4 4
First Iteration: Decrement CDT and CWT for current topic j Sample new topic from the current topic-
distribution of a doc
32
2
5
3
23
Run Gibbs Sampling Example (2)
topic1 topic2
money 2 3
bank 3 6
Loan 2 1
River 2 2
Stream 2 1
12
2 2
2
1
1
1 2
1 2 1
21
12
1 2 21
21
2
doc1 doc2 doc3
topic1 3 4 4
topic2 5 4 4
First Iteration: Decrement CDT and CWT for current topic j Sample new topic from the current topic-
distribution of a doc
2
4
2
55 6
24
Run Gibbs Sampling Example (3)
α = 50/T = 25 and β = 0.01
39.025*23
254*
01.0*57
01.05,.),,|2(
iii dbankztopiczP
“Bank” is assigned to Topic 2
19.025*24
253*
01.0*58
01.03,.),,|1(
iii dbankztopiczP
How often were all other topics used in doc di
How often was topic j used in doc di
25
Summary: Run Gibbs Sampling
Gibbs sampling is used to estimate topic assignment for each word of each doc
Factors affecting topic assignments How likely is a word w for a topic j?
Probability of word w under topic j How dominante is a topic j in a doc d?
Probability that topic j has under the current topic distribution for document d
Once many tokens of a word have been assigned to topic j (across documents), the probability of assigning any particular token of that word to topic j increases all other topics become less likely for word w (Explaining Away).
Once a topic j has been used multiple times in one document, it will increase the probability that any word from that document will be assigned to topic j all other documents become less likely for topic j (Explaining Away).
26
Black = topic 1White = topic2
Random Start N iterations Each iteration updates count-matrices
Convergence: count-matrices stop changing
Gibbs samples start to approximate the target distribution (i.e., the posterior distribution over z)
Gibbs Sampling Convergence
27
Gibbs Sampling Convergence
Ignore some number of samples at the beginning (Burn-In period)
Consider only every nth sample when averaging values to compute an expectation
Why? successive Gibbs-samples are not independent they form a Markov
chain with some amount of correlation The stationary distribution of the Markov chain is the desired joint
distribution over the latent variables, but it may take a while for that stationary distribution to be reached
Techniques that may reduce autocorrelation between several latent variables are simulated annealing, collapsed Gibbs sampling or blocked Gibbs sampling;
29
Author-Topic (AT)Model(Rosen-Zvi, 2004)
Aim: discover patterns of word-use and connect authors that exhibit similar patterns
Idea/Intuition: Words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture
Each author == distribution over topics Each topic == distribution over words Each document with multiple authors ==
distribution over topics that is a mixture of the distributions associated with the authors.
30
AT-Model Algorithm Sample author
For each doc d and each word w of that doc an author x is sampled from the doc‘s author distribution/set ad.
Sample topic For each doc d and each word w of that doc a
topic z is sampled from the topic distribution θ(x) of the author x which has been assigned to that word.
Sample word From the word-distribution φ(z) of each sampled
topic z a word w is sampled.
z
xzzxd xzPzwPPPaxPwxP ),|(*),|(*)|(*)|(*)|(),( )()()()(
P( w | z, φ (z) )
P( z | x, θ(x) )
31
AT ModelLatent Variables
Latent Variables:
2) Author-distribution of each topic determines which topics are used by which authors count matrix CAT
1) Author-Topic assignment for each word
3) Word-distribution of each topic count matrix CWT
?
32
Matrix Representation of Author-Topic-Model
source: http://www.ics.uci.edu/~smyth/kddpapers/UCI_KD-D_author_topic_preprint.pdf
θ(x)φ(z) ad
observedobservedlatent
latent
33
Example (1)
topic1 topic2
money 3 2
bank 3 6
loan 2 1
river 2 2
stream 2 1
1 12
2 2
2
1
1
1 2
1 2 1
21
12
1 2 1
21
2
author1 author2 author3
topic1 4 8 0
topic2 0 8 4
1. Random topic-author assignments
2. 2 count-matrices:
CWT Words per topic
CAT Authors per topic
1 2 1
2
1
1
2
2 2
2
2
2
2
2
2
23
22
2
3
3
3
2
22
34
Gibbs Sampling for Author-Topic-Model
Estimate posterior distribution of 2 random variables: z and x. For each word, we draw an author xi and a topic zi (OR a pair (zi;
xi) as a block) conditioned on all other variables Blocked Gibbs sampling improves convergence of the Gibbs
sampler when the variables are highly dependent
Count number of times an author k was already assigned to topic j.
Count number of times a word token wi was assigned to a topic j across all docs
35
Problems of the AT Model
AT model learns author‘s topic distribution for a document-corpus
But we don‘t learn topic distribution of documents AT model cannot model idiosyncratic aspects of a
document
36
AT Modelwith Fictitious Authors
Add one fictitious author for each document; ad +1 uniform or non-uniform distribution over authors (including
the fictitious author)
Each word is either sampled from a real author‘s or the fictitious author‘s topic distribution.i.e., we learn topic-distribution for real-authors and for fictitious „author“ (= documents).
Problem reported in (Hong, 2010): topic distribution of each twitter message learnt via AT-model was worse than LDA with USER schema sparse messages and not all words of one message are used to learn document‘s topic distribution.
37
Predictive Power of different models (Rosen-Zvi, 2005)
Experiment:Trainingsdata: 1 557 papers
Testdata:183 papers (102 are single-authored papers).
They choose test data documents in such a way that each author of a test set document also appears in the training set as an author.
38
Author-Recipients-Topic (ART) Model (McCallum, 2004)
Observed Variables: Words per message Authors per message Recipients per message
Sample for each word a recipient-author pair AND a topic conditioned on the receiver-author pair‘s
topic distribution θ(A,R)
Learn 2 corpus-level variables: Author-recipient-pair distribution for each topic Word-distribution for each topic
2 count matrices: Pair-topic Word-topic
, R
, x
P( z | x, ad, θ(A,R) )
P( w | z, φ(z) )
39
Gibbs SamplingART-Model
Random Start: Sample author-recipient pair for each wordSample topic for each word
Compute for each word wi:
Number of recipients of message
to which word wi belongs
Number of times topic t was assigned to an author-recipient-pair
Number of times current word token was assigned to topic t
Number of times all other topics were assigned to an author-recipient-pair
Number of times all other words were assigned to topic t
Number of words * beta
40
Labeled LDA(Ramage, 2009)
Word-topic assignments are drawn from a document’s topic distribution θwhich is restricted to the topic distribution Λ of the labels observed in d. Topic distribution of a label l is the same as topic distribution of all documents containing label l.
The document’s labels Λ are first generate using a Bernoulli coin toss for each topic k with a labeling prior φ.
Constraining the topic model to use only those topics that correspond to a document’s (observed) label set.
Topic assignments are limited to the document’s labels One-to-one correspondence between LDA’s latent topics and
user tags/labels
41
Group-Topic Model(Wang, 2005)
Discovery of groups is guided by the emerging topics Discovery of topics is guided by the emerging groups
GT-model is an extension of the blockstructure model group-membership is conditioned on a latent variable associated with the attributes of the relation (i.e., the words) latent variable represents the topics which have generated the words.
GT model discovers topics relevant to relationships between entities in the social network
42
Group-Topic Model(Wang, 2005)
Generative process for each event (an interaction between entities) pick the topic t of the
event and then generates all the words describing the event according to the topics’s word-distribution φ
for each entity s, which interacts within this event, the group assignment g is chosen conditionally from a particular multinomial (discrete) distribution θ over groups for each topic t.
For each event we have a matrix V which stores whether groups of 2 entities behaved the same or not during an event.
Number of events (=interactions
between entities)
Number of entities
43
CART Model(Pathak, 2008)
Generative process To generate email ed a community cd is chosen uniformly at random Based the community cd , the author ad and the set of recipients ρd
are chosen To generate every word w(d,i) in that email, a recipient r(d,i) is chosen
uniformly at random from the set of recipients ρd
Based on the community cd, author ad and recipient r(d,i), a topic z(d,i) is chosen
The word w(d,i) itself is chosen based on the topic z(d,i)
Gibbs-sampling:alternates between updating latent communities c conditioned on other variables, and updating recipient-topic tuples (r, z) for each word conditioned on other variables.
44
Copycat Model(Dietz, 2007)
Topics of a citing document are a “weighted sum” of documents it cites. The weights of the terms capture the notion of the influence
Generative process For each word of the citing publication d a cited publication c’ is picked from
the set of all cited publications γ. For each word in the citing publication d a topic is picked according to the
current topic distribution which is a mix of the topic distribution of the assigned cited documents c’.
45
Copycat Model(Dietz, 2007)
Example: A publication c is cited by two publications d1 and d2.
The topic mixture of c is not only about all words in the cited publication c, but also about some words in d1 and d2, which are associated
with c. This way, the topic mixture of c is influenced by the citing publications d1
and d2! The topic distribution of the cited document c in turn influences the
association of words in d1 and d2 to c. All tokens that are associated with a cited publication are called the
topical atmosphere of a cited publication
d1c
d2
cites
46
Copycat Model(Dietz, 2007)
Bi-partite citation graph 2 disjoint node sets D and C D contains only nodes with outgoing citation links (the citing
publications) C contains nodes with incoming links (the cited publications).
Documents in the original citation graph with incoming and outgoing links are represented as two nodes
47
Copycat Model(Dietz, 2007)
Problem: bidirectional interdependence of links and topics caused by the topical atmosphere
Publications originated in one research area (such as Gibbs sampling, which originated in physics) will also be associated with topics they are often cited by (such as machine learning).
Problem: enforces each word in a citing publication to be associated with a cited publication noise
48
Citation InfluenceModel(Dietz, 2007)
Copycat Model enforces each word in a citing publication to be associated with a cited publication this introduces noise
A citing publication may choose to draw a word’s topic from a topic mixture of a citing publication θc (the topical atmosphere) or from it’s own topic mixture ψd.
The choice is modeled by a flip of an unfair coin s. The parameter λ of the coin is learned by the model, given an asymmetric beta prior, which prefers the topic mixture θ of a cited publication.
The parameter λ yields an estimate for how well a publication fits to all its citations
innovation topic mixture of a citingpublication
distribution of citation influences
parameter of the coin flip, choosing to drawtopics from θ or ψ
49
References David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent Dirichlet Allocation. Journal of Machine Learning Research
3: 993-1022 (2003).
Dietz, L., Bickel, S. and Scheffer, T. (2007). Unsupervised prediction of citation influences. Proc. ICML, 2007.
Thomas Hoffmann, Probabilistic Latent Semantic Analysis, Proc. of Uncertainty in Artificial Intelligence, UAI'99, (1999).
Thomas L. Griffiths, Joshua B. Tenenbaum, Mark Steyvers, Topics in semantic representation, (2007).
Michal Rosen-Zvi, Chaitanya Chemudugunta, Thomas Griffiths, Padhraic Smyth, Irvine Mark Steyvers, Learning author-topic models from text corpora, (2010).
Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers and Padhraic Smyth, The author-topic model for authors and documents, In Proceedings of the 20th conference on Uncertainty in artificial intelligence (2004).
Andrew Mccallum, Andres Corrada-Emmanuel, Xuerui Wang, The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email, Tech-Report, (2004).
Nishith Pathak, Colin Delong, Arindam Banerjee, Kendrick Erickson, Social Topic Models for Community Extraction, In The 2nd SNA-KDD Workshop ’08, (2008).
Steyvers and Griffiths, Probabilistic Topic Models, (2006).
Ramage, Daniel and Hall, David and Nallapati, Ramesh and Manning, Christopher D., Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora, EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (2009)
Xuerui Wang, Natasha Mohanty, Andrew McCallum, Group and topic discovery from relations and text, (2005).
Hanna M. Wallach, David Mimno and Andrew McCallum, Rethinking LDA: Why Priors Matter (2009)
top related