Structured Topic Models: Jointly Modeling Words and Their Accompanying Modalities Xuerui Wang Computer Science Department University of Massachusetts Amherst Joint work with Andrew McCallum, Andres Corrada- Emmanuel, Chris Pal, Xing Wei and Natasha Mohanty.
57
Embed
Structured Topic Models: Jointly Modeling Words and Their Accompanying Modalities
Structured Topic Models: Jointly Modeling Words and Their Accompanying Modalities. Xuerui Wang Computer Science Department University of Massachusetts Amherst. Joint work with Andrew McCallum, Andres Corrada-Emmanuel, Chris Pal, Xing Wei and Natasha Mohanty. Probabilistic topic models. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Structured Topic Models: Jointly Modeling Words and Their
Accompanying Modalities
Xuerui WangComputer Science Department
University of Massachusetts Amherst
Joint work with Andrew McCallum, Andres Corrada-Emmanuel, Chris Pal, Xing Wei and Natasha Mohanty.
2
Probabilistic topic models
• Main Assumption:– Documents are mixture of topics– Topic distributions over words for co-occurrence
• Objectives:– Understand text using learned topics– Represent documents in topic space
3
Clustering words into topics withLatent Dirichlet Allocation
[Blei, Ng, Jordan 2003]
Sample a distributionover topics,
For each document:
Sample a topic, z
For each word in doc
Sample a wordfrom the topic, w
Example:
70% finance30% environment
finance
“bank”
GenerativeProcess:
environment
4
STORYSTORIES
TELLCHARACTER
CHARACTERSAUTHOR
READTOLD
SETTINGTALESPLOT
TELLINGSHORT
FICTIONACTION
TRUEEVENTSTELLSTALE
NOVEL
MINDWORLDDREAM
DREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEA
SWIMSWIMMING
POOLLIKE
SHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASES
GERMSFEVERCAUSE
CAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
Example topicsinduced from a large collection of text
FIELDMAGNETIC
MAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGY
FIELDPHYSICS
LABORATORYSTUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTS
BATTERRY
JOBWORKJOBS
CAREEREXPERIENCE
EMPLOYMENTOPPORTUNITIES
WORKINGTRAINING
SKILLSCAREERS
POSITIONSFIND
POSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
[Tennenbaum et al]
5
STORYSTORIES
TELLCHARACTER
CHARACTERSAUTHOR
READTOLD
SETTINGTALESPLOT
TELLINGSHORT
FICTIONACTION
TRUEEVENTSTELLSTALE
NOVEL
MINDWORLDDREAM
DREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEA
SWIMSWIMMING
POOLLIKE
SHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASES
GERMSFEVERCAUSE
CAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
FIELDMAGNETIC
MAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGYFIELD
PHYSICSLABORATORY
STUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTS
BATTERRY
JOBWORKJOBS
CAREEREXPERIENCE
EMPLOYMENTOPPORTUNITIES
WORKINGTRAINING
SKILLSCAREERS
POSITIONSFIND
POSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
Example topicsinduced from a large collection of text
[Tennenbaum et al]
6
Documents are not just text !
• Multiple modalities:– Research papers (author, venue, words, etc.)– Email messages (sender, recipients, time, words, etc.)– Legislative resolutions (voting record, words, etc.)– And many more
• Most previous work: one modality at a time– Learn topics from words– Discover groups from relations– Etc.
8
Outline
• Introduction
• Role and Topic Discovery in Social Networks
• Group and Topic Discovery from Voting Records
• Topics over Time
• Topical Phrase with Markov Assumption
• Conclusions
9
All possible “topic models” with one latent topic, two observed modalities
and two conditional dependencies
10
Outline
• Introduction
• Role and Topic Discovery in Social Networks
• Group and Topic Discovery from Voting Records
• Topics over Time
• Topical Phrase with Markov Assumption
• Conclusions
11
From LDA to Author-Recipient-Topic
12
All possible “topic models” with two observed modalities
13
Inference and Estimation
Gibbs Sampling:- Easy to implement- Reasonably fast
Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.
DP
Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]
15
Topics, and prominent senders / receiversdiscovered by ARTTopic names,
by hand
16
Topics, and prominent senders / receiversdiscovered by ART
Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice President of Regulatory Affairs”Steffes = “Vice President of Government Affairs”
17
Comparing role discovery
connection strength (A,B) =
distribution overauthored topics
Traditional SNA
distribution overrecipients
distribution overauthored topics
Author-TopicART
18
Comparing role discovery Tracy Geaconne Dan McCarty
There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:
Admiration relations among six high school students.
Student Roster
AdamsBennettCarterDavisEdwardsFrederking
Academic Admiration
Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)
30
Adjacency matrix representing relations
A B C D E FABCDEF
A B C D E FG1G2G1G2G3G3
G1G2G1G2G3G3
ABCDEF
A C B D E FG1G1G2G2G3G3
G1G1G2G2G3G3
ACBDEF
Student Roster
AdamsBennettCarterDavisEdwardsFrederking
Academic Admiration
Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)
31
Group Model: partitioning entities into groups
2Sv
β
2Gγ α
Stochastic Blockstructures for Relations[Nowicki, Snijders 2001]
S: number of entities
G: number of groups
Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004]
BetaDirichlet
Binomial
SgMultinomial
32
Two relations with different attributes
A C B D E FG1G1G2G2G3G3
G1G1G2G2G3G3
A C E B D FG1G1G1G2G2G2
G1G1G1G2G2G2
ACEBDF
Student Roster
AdamsBennettCarterDavisEdwardsFrederking
Academic Admiration
Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)
Social Admiration
Soci(A, B) Soci(A, D) Soci(A, F)Soci(B, A) Soci(B, C) Soci(B, E)Soci(C, B) Soci(C, D) Soci(C, F)Soci(D, A) Soci(D, C) Soci(D, E)Soci(E, B) Soci(E, D) Soci(E, F)Soci(F, A) Soci(F, C) Soci(F, E)
ACBDEF
33
Goal:Model relations and their (textual) attributes simultaneously to obtain better groups and more meaningful topics.
budget, funding, annual, cash
document, corrections, review, annual
34
The Group-Topic model: discovering groups and topics simultaneously
bNw
t
B
T
φ
η
DirichletMultinomial
Uniform
2Sv
β
2Gγ α
Beta
Dirichlet
Binomial
SgMultinomial
T
35
All possible “topic models” with two observed modalities
37
U.S. Senate data set
• 16 years of voting records in the US Senate (1989 – 2005)
• a Senator may respond Yea or Nay to a resolution
• 3423 resolutions with text attributes (index terms)
• 191 Senators in total across 16 years
S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms
Senators Who Change Coalition the most Dependent on Topic
e.g. Senator Shelby (D-AL) votes with the Republicans on Economicwith the Democrats on Education + Domesticwith a small group of maverick Republicans on Social Security + Medicare
44
Do we get better groups with the GT model?
1. Cluster bills into topics using mixture of unigrams;
2. Apply group model on topic-specific subsets of bills.
Agreement Index (AI) measures group cohesion. Higher, better.
Datasets Avg. AI for Baseline Avg. AI for GT p-value
Senate 0.8198 0.8294 <.01
UN 0.8548 0.8664 <.01
1. Jointly cluster topic and groups at the same time using the GT model.
Baseline Model GT Model
46
Outline
• Introduction
• Role and Topic Discovery in Social Networks
• Group and Topic Discovery from Voting Records
• Topics over Time
• Topical Phrase with Markov Assumption
• Conclusions
48
Want to model trends over time
• Is prevalence of topic growing or waning?
• Pattern appears only briefly– Capture its statistics in focused way– Don’t confuse it with patterns elsewhere in time
• How do roles, groups, influence shift over time?
49
Topics Over Time (TOT)
Betaover time
topicindex
timestamp
word
Multinomialover words
Dirichletprior
Dirichlet prior
multinomialover topics
Betaover time
topicindex
timestamp
wordMultinomialover words
Dirichlet prior
multinomialover topics
Dirichlet prior
50
All possible “topic models” with two observed modalities
51
State of the union address
208 Addresses delivered between January 8, 1790 and January 29, 2002.
To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied.
•17156 ‘documents’
•21534 words
•669,425 tokens
Our scheme of taxation, by means of which this needless surplus is takenfrom the people and put into the public Treasury, consists of a tariff orduty levied upon importations from abroad and internal-revenue taxes leviedupon the consumption of tobacco and spirituous and malt liquors. It must beconceded that none of the things subjected to internal-revenue taxationare, strictly speaking, necessaries. There appears to be no just complaintof this taxation by the consumers of these articles, and there seems to benothing so well able to bear the burden without hardship to any portion ofthe people.
1910
52
Comparing
TOT
against
LDA
55
Topic Distributions Conditioned on Time
time
top
ic m
ass
(in
ver
tica
l h
eig
ht)
in N
IPS
con
ference p
apers
57
TOT improves ability to predict time
Predicting the year of a State-of-the-Union address.
L1 = distance between predicted year and actual year.
58
Outline
• Introduction
• Role and Topic Discovery in Social Networks
• Group and Topic Discovery from Voting Records
• Topics over Time
• Topical Phrase with Markov Assumption
• Conclusions
59
Topic Interpretability
LDA
algorithmsalgorithmgenetic
problemsefficient
Topical N-grams
genetic algorithmsgenetic algorithm
evolutionary computationevolutionary algorithms
fitness function
60
Topics modeling phrases
• Topics based only on unigrams often difficult to interpret
• Topic discovery itself is confused because important meaning / distinctions carried by phrases.
• Significant opportunity to provide improved language models to ASR, MT, IR, etc.
61
Topical N-Gram model
z1 z2 z3 z4
w1 w2 w3 w4
y1 y2 y3 y4
1
T
D
. . .
. . .
. . .
α
WTW
γ1 γ2β 2
62
All possible “topic models” with two observed modalities
63
Features of Topical N-Grams model
• Easily trained by Gibbs sampling– Can run efficiently on millions of words
• Topic-specific phrase discovery– “white house” has special meaning as a phrase
in the politics topic,– ... but not in the real estate topic.
64
NIPS research papers• Full text of NIPS papers between 1987-1999.
• 1,740 research papers in total.
• 13, 649 unique words and 2,301,375 word tokens.
• Stop words removed and no stemming.
65
“Reinforcement Learning”
state learning policy action reinforcement states time optimal actions function algorithm reward step dynamic control sutton rl decision algorithms agent
kernel linear vector support set nonlinear data algorithm space pca function problem margin vectors solution training svm kernels matrix machines
LDA
support vectors test error support vector machines training error feature space training examples decision function cost functions test inputs kkt conditions leave-one-out procedure soft margin bayesian transduction training patterns training points maximum margin strictly convex regularization operators base classifiers convex optimization
kernel training support margin svm solution kernels regularization adaboost test data generalization examples cost convex algorithm working feature sv functions
• Short-distance dependency ---- phrases (usually discovered by separate modules) can boost IR performance [Fagan, 1989; Evans et al., 1991; Strzalkowski, 1995; Mitra et al., 1997].
• TNG simultaneously capture both.
68
San Jose Mercury News (TREC)
• Covers materials from San Jose Mercury News in 1991
• With TREC queries 51-150
• 90,257 documents in total, 255, 686 unique words and 17,574,989 word tokens.
• Stop words removed and no stemming.
<DOC><DOCNO> SJMN91-06364022 </DOCNO><ACCESS> 06364022 </ACCESS><CAPTION> Photo; PHOTO: Associated Press; MONSTER MASH -- Kentucky's Jamal MashBurn shows his stuff in the Wildcats' 103-89 victory over state rival Louisville onSaturday. Mashburn had 25 points. </CAPTION><DESCRIPT> COLLEGE; BASKETBALL; GAME; RESULT; RANKING; SCHOOL </DESCRIPT><LEADPARA> Arizona had a 24-point night from Sean Rooks, a height advantage and strong defense, but still struggled to an 83-76 victory over Evansville in the FiestaBowl Classic in Tucson, Ariz., on Saturday.; The victory moved the No. 6Wildcats into the championship of their tournament for the seventh straighttime. </LEADPARA><SECTION> Sports </SECTION><HEADLINE> ARIZONA EDGES EVANSVILLE……
69
Ad-hoc retrieval on SJMN
Clearly contain phrases
No phrases due to stopping and punctuation removing
Mixed results on many other queries.
70
Ad-hoc retrieval on SJMN
* indicates statistically significant differences in performance with 95% confidence according to the Wilcoxon test
71
Outline
• Introduction
• Role and Topic Discovery in Social Networks
• Group and Topic Discovery from Voting Records
• Topics over Time
• Topical Phrase with Markov Assumption
• Conclusions
72
All possible “topic models” with two observed modalities (revisit)
ARTGTTOT TNG
73
Conclusions
• With carefully designed model structures, we can utilize multi-modality information.
• Choices of configuration are task dependent.
• Better results are obtained from joint inference on various tasks.