Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Simultaneous Joint and Conditional Modeling of

Documents Tagged from Two Perspectives

Pradipto Das, Rohini Srihari and Yun FuSUNY Buffalo

CIKM 2011, Glasgow, Scotland

Ubiquitous Bi-Perspective Document Structure

Words indicative of

important Wiki concepts

Actual human generated

Wiki category tags – words

that summarize/

categorize the document

Wikipedia

Words indicative

of questions

Actual tags for the

forum post – even

frequencies are given!

Words indicative

of answers

StackOverflow



Words indicative

of document

title

Actual tags given by users

Words indicative of image

description

Yahoo! Flickr

Understanding the Two Perspectives

News Article

What if the documents are plain text files?

It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case.


News Article

Imagine browsing over reports in a topic cluster

It is believed US investigators have asked for, but have been so far refused access to, evidence

accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year.

This investigation was launched by US President Bill Clinton and is in principle a far more simple

or at least more single-minded pursuit than that of Ms. Holland.

Dorothea Holland, until four months ago

was the only prosecuting lawyer on the

German case.


News Article

The “document level” perspective

What words can we remember after a first browse?

German, US, investigations, GM, Dorothea Holland, Lopez,

prosecute

Important Verbs and Dependents

Named Entities

Understanding the Two Perspectives What helped us generate the Document Level perspective?

ORGANIZATION

It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case.

News Article

LOCATIONMISC

PERSON

WHAT HAPPENED?

The “word level” perspective

The “document level” perspective


prosecute

What if we turn the document off? Summarization power of the perspectives

It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case.


prosecute

Sentence Boundaries

End

(2)

Hypothesis• Documents are at least tagged from two

different perspectives – either implicit or explicit and one perspective affects the other– Simplest example of implicit WL tagging – binned

positions indicating sections– Simplest example of implicit DL tagging – tag cloud

It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case. tagcrowd.com

Begi

n (0

)M

idd

le (1

)

The “word level” (WL) tags are usually some category descriptions

How can bi-level perspective be exploited?

Can we generate category labels for Wikipedia documents by looking at image captions? Can we use images to label latent topics?

Can we build a topic model that incorporates both perspectives simultaneously? choice of document level tags, impact on

performance Can supervised and unsupervised generative

models work together?

Example – A Wikipedia Article on “fog”

Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather phenomena | Fog | Psychrometrics Labels by human editors

0

1

2

The Wikipedia Article on “fog”

Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather phenomena | Fog | Psychrometrics Labels by human editors

Categories: fog, San Francisco, visible, high, temperature, streets, Bay, lake, California, bridge, air Labels by model from title and image captions

Take the first category label – “weather hazards to aircraft” “aircraft” doesn’t occur in the document body! “hazard” only appears in a section label read as “Visibility

hazards” “Weather” appears only 6 out of 15 times in the main body

However, if we look at the images, it seems that the concept of fog is related to concepts like fog over the Golden Gate bridge, fog in streets, poor visibility and quality of air

The Family of Tag-Topic Models

• TagLDA: An occurrence of a word depends on how much of it is explained by a topic K and a WL tag t Intuitively

LDA TagLDA

LDA’s learnt “purple” topic can generate all 4 large balls with high probability

TagLDA learns the “purple” topic better based on a constraint - it will generate a mix of large and small balls with high probability

L LSLL L

Trai

nSa

mpl

e

SL

Faceted Bi-Perspective Document Organization

Topics conditioned on different section identifiers (WL tag categories)

Topic Marginals

Topics over

image captions

Correspondence of DL tag words

with content words

Topic Labeling

The Family of Tag-Topic ModelsMETag2LDA CorrMETag2LDAMMLDA CorrMMLDATagLDA

Combines TagLDA and

MMLDA

Combines TagLDA and CorrMMLDA

MM = Multinomial + Multinomial; ME = Multinomial + Exponential

The Family of Tag-Topic Models• METag2LDA: A topic generating all DL tags in a document doesn’t

necessarily mean that the same topic generates all words in the document

• CorrMETag2LDA: A topic generating *all* DL tags in a document does mean that the same topic generates all words in the document - a considerable strongpoint

Topic concentration parameter

Document specific topic proportions

Document content words

Document Level (DL) tags

Word Level (WL) tags

Indicator variables

Topic Parameters

Tag Parameters

CorrME-Tag2LDA

METag2LDA

Experiments Wikipedia articles with images and captions manually

collected along {food, animal, countries, sport, war, transportation, nature, weapon, universe and ethnic groups} concepts

Tags used: DL Tags – image caption words and the article titles WL Tags – Positions of sections binned into 5 bins

Objective: to generate category labels for test documents Evaluation

– Perplexity: to see performance among various TagLDA models– WordNet based similarity evaluation between actual category labels

and model output

Evaluations – Held-out Perplexity

Selected Wikipedia Articles

WL tag categories – Section positions in the document DL tags – image caption words and article titles TagLDA perplexity is comparable to MM(METag2)LDA

The (image caption words + article titles) and the content words are independently discriminative enough

CorrMM(METag2)LDA performs best since almost all image caption words and the article title for a Wikipedia document are about a specific topic and the correspondence assumption is accepted by the model with much higher confidence

K=20 K=50 K=100 K=2000

100000200000300000400000500000600000700000800000

MMLDA TagLDA corrLDA METag2LDA corrMETag2LDA

Evaluations – Application End-Goals

Inverse Hop distance in WordNet ontology

Top 5 words from the caption vocabulary are chosen Max Weighted Average = 5, Max Best = 1 METag2LDA almost always wins by narrow margins METag2LDA reweights the vocabulary of caption words and article titles that are about a

topic and hence may miss specializations relevant to document within the top (5) ones In WordNet ontology, specializations lead to more hop distance

Ontology based scoring helps explain connections to caption words to ground truths e.g. Skateboard skate glide snowboard

K=20 K=50 K=100 K=2000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

METag2LDA-Av-erageDistance

corrMETag2LDA-AverageDistance

METag2LDA-BestDistance

corrMETag2LDA-BestDistance

Evaluations – Held-out Perplexity

DUC05 Newswire Dataset (Recent Experiments with TagLDA Included)

WL tag categories – Named Entities DL tags – abstract coherence markers like (“subj” “obj”) e.g. “Mary/Subj taught the class.

Everybody liked Mary/Obj.” [Ignored coref resolution] Abstract markers like (“subj” “obj”) acting as DL perspective are not document

discriminative markers Rather they indicate a semantic perspective of coherence which is intricately linked to words Topics are influenced both by non-sparse document level coherence indicators like (“subj”

“obj”, “subj” “--”, etc.) AND also by document level co-occurrence By ignoring the DL perspective completely leads to better fit by TagLDA due to variations

in word distributions only

40 60 80 1001350000

1400000

1450000

1500000

1550000

1600000

1650000

MMLDA METag2LDA corrLDA corrMETag2LDA

40 60 80 1000

200000400000600000800000

10000001200000140000016000001800000

MMLDA METag2LDA corrLDA corrMETag2LDATagLDA

Evaluations – Application End-Goals

Person Named Entity coverage (DUC05 data)

Two PERSON NEs in the same docset i.e., manual topic set are related (G in total) A_B, A, B are treated as separate PERSON NEs For each docset in DUC05 data

Create a set of best topics for a docset and pull out top PER NE pairs from the PER NE facets

Find how many matched over all documents in a docset (M in total) Win over baseline = M/G (averaged over all docsets) CorrMETag2LDA wins here because of the nature of DL perspective (Role transitions like

“SubjObj” coherence markers) More topics are pulled out that group more PER NEs across documents (Recall )

40 60 80 1000

0.5

1

1.5

2

2.5

3

3.5

4

0.350.63

0.98 0.910.96

1.88

3.08

3.66

METag2LDA

CorrMETag2LDA

Model Usefulness and Applications

• Applications– Document classification using reduced dimensions– Find faceted topics automatically through word level tags– Learn correspondences between perspectives– Label topics through document level multimedia– Create recommendations based on perspectives– Video analysis: word prediction given video features– Tying “multilingual comparable corpora” through topics– Multi-document summarization using coherence– E-Textbook aided discussion forum mining:

• Explore topics through the lens of students and teachers• Label topics from posts through concepts in the e-textbook

Summary

• Flexible family of topic models that integrate a partitioned space of DL tags and words with WL tag categories– Supervised models can collaborate with unsupervised

generative models i.e. supervised models can be bettered independently

• Captioned multimedia objects like images, video, audio can provide intuitive latent space labeling – a picture is worth a 1000 words

• Obtain “facets” in topics• As always held-out perplexity should not always be the

sole judge of end-task performance

Thanks!

Special thanks to Jordan Boyd-Graber for useful discussions on TagLDA parameter regularizations

Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Documents

us groupinvestigations

us president bill clinton

document level perspective

german prosecutors

thenews article german

gm director

hollanddorothea holland

choice of document level