Top Banner
Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives Pradipto Das, Rohini Srihari and Yun Fu SUNY Buffalo CIKM 2011, Glasgow, Scotland
25

Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Dec 03, 2014

Download

Documents

Pradipto Das

CIKM11 Full Paper Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Simultaneous Joint and Conditional Modeling of

Documents Tagged from Two Perspectives

Pradipto Das, Rohini Srihari and Yun FuSUNY Buffalo

CIKM 2011, Glasgow, Scotland

Page 2: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Ubiquitous Bi-Perspective Document Structure

Words indicative of

important Wiki concepts

Actual human generated

Wiki category tags – words

that summarize/

categorize the document

Wikipedia

Page 3: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Words indicative

of questions

Actual tags for the

forum post – even

frequencies are given!

Words indicative

of answers

StackOverflow

Ubiquitous Bi-Perspective Document Structure

Page 4: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Ubiquitous Bi-Perspective Document Structure

Words indicative

of document

title

Actual tags given by users

Words indicative of image

description

Yahoo! Flickr

Page 5: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Understanding the Two Perspectives

News Article

What if the documents are plain text files?

Page 6: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case.

Understanding the Two Perspectives

News Article

Imagine browsing over reports in a topic cluster

Page 7: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

It is believed US investigators have asked for, but have been so far refused access to, evidence

accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year.

This investigation was launched by US President Bill Clinton and is in principle a far more simple

or at least more single-minded pursuit than that of Ms. Holland.

Dorothea Holland, until four months ago

was the only prosecuting lawyer on the

German case.

Understanding the Two Perspectives

News Article

The “document level” perspective

What words can we remember after a first browse?

German, US, investigations, GM, Dorothea Holland, Lopez,

prosecute

Page 8: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Important Verbs and Dependents

Named Entities

Understanding the Two Perspectives What helped us generate the Document Level perspective?

ORGANIZATION

It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case.

News Article

LOCATIONMISC

PERSON

WHAT HAPPENED?

The “word level” perspective

The “document level” perspective

German, US, investigations, GM, Dorothea Holland, Lopez,

prosecute

Page 9: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

What if we turn the document off? Summarization power of the perspectives

It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case.

German, US, investigations, GM, Dorothea Holland, Lopez,

prosecute

Sentence Boundaries

Page 10: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

End

(2)

Hypothesis• Documents are at least tagged from two

different perspectives – either implicit or explicit and one perspective affects the other– Simplest example of implicit WL tagging – binned

positions indicating sections– Simplest example of implicit DL tagging – tag cloud

It is believed US investigators have asked for, but have been so far refused access to, evidence accumulated by German prosecutors probing allegations that former GM director, Mr. Lopez, stole industrial secrets from the US group and took them with him when he joined VW last year. This investigation was launched by US President Bill Clinton and is in principle a far more simple or at least more single-minded pursuit than that of Ms. Holland. Dorothea Holland, until four months ago was the only prosecuting lawyer on the German case. tagcrowd.com

Begi

n (0

)M

idd

le (1

)

The “word level” (WL) tags are usually some category descriptions

Page 11: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

How can bi-level perspective be exploited?

Can we generate category labels for Wikipedia documents by looking at image captions? Can we use images to label latent topics?

Can we build a topic model that incorporates both perspectives simultaneously? choice of document level tags, impact on

performance Can supervised and unsupervised generative

models work together?

Page 12: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Example – A Wikipedia Article on “fog”

Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather phenomena | Fog | Psychrometrics Labels by human editors

0

1

2

Page 13: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

The Wikipedia Article on “fog”

Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather phenomena | Fog | Psychrometrics Labels by human editors

Categories: fog, San Francisco, visible, high, temperature, streets, Bay, lake, California, bridge, air Labels by model from title and image captions

Take the first category label – “weather hazards to aircraft” “aircraft” doesn’t occur in the document body! “hazard” only appears in a section label read as “Visibility

hazards” “Weather” appears only 6 out of 15 times in the main body

However, if we look at the images, it seems that the concept of fog is related to concepts like fog over the Golden Gate bridge, fog in streets, poor visibility and quality of air

Page 14: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

The Family of Tag-Topic Models

• TagLDA: An occurrence of a word depends on how much of it is explained by a topic K and a WL tag t Intuitively

LDA TagLDA

LDA’s learnt “purple” topic can generate all 4 large balls with high probability

TagLDA learns the “purple” topic better based on a constraint - it will generate a mix of large and small balls with high probability

L LSLL L

Trai

nSa

mpl

e

SL

Page 15: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Faceted Bi-Perspective Document Organization

Topics conditioned on different section identifiers (WL tag categories)

Topic Marginals

Topics over

image captions

Correspondence of DL tag words

with content words

Topic Labeling

Page 16: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

The Family of Tag-Topic ModelsMETag2LDA CorrMETag2LDAMMLDA CorrMMLDATagLDA

Combines TagLDA and

MMLDA

Combines TagLDA and CorrMMLDA

MM = Multinomial + Multinomial; ME = Multinomial + Exponential

Page 17: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

The Family of Tag-Topic Models• METag2LDA: A topic generating all DL tags in a document doesn’t

necessarily mean that the same topic generates all words in the document

• CorrMETag2LDA: A topic generating *all* DL tags in a document does mean that the same topic generates all words in the document - a considerable strongpoint

Topic concentration parameter

Document specific topic proportions

Document content words

Document Level (DL) tags

Word Level (WL) tags

Indicator variables

Topic Parameters

Tag Parameters

CorrME-Tag2LDA

METag2LDA

Page 18: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Experiments Wikipedia articles with images and captions manually

collected along {food, animal, countries, sport, war, transportation, nature, weapon, universe and ethnic groups} concepts

Tags used: DL Tags – image caption words and the article titles WL Tags – Positions of sections binned into 5 bins

Objective: to generate category labels for test documents Evaluation

– Perplexity: to see performance among various TagLDA models– WordNet based similarity evaluation between actual category labels

and model output

Page 19: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Evaluations – Held-out Perplexity

Selected Wikipedia Articles

WL tag categories – Section positions in the document DL tags – image caption words and article titles TagLDA perplexity is comparable to MM(METag2)LDA

The (image caption words + article titles) and the content words are independently discriminative enough

CorrMM(METag2)LDA performs best since almost all image caption words and the article title for a Wikipedia document are about a specific topic and the correspondence assumption is accepted by the model with much higher confidence

K=20 K=50 K=100 K=2000

100000200000300000400000500000600000700000800000

MMLDA TagLDA corrLDA METag2LDA corrMETag2LDA

Page 20: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Evaluations – Application End-Goals

Inverse Hop distance in WordNet ontology

Top 5 words from the caption vocabulary are chosen Max Weighted Average = 5, Max Best = 1 METag2LDA almost always wins by narrow margins METag2LDA reweights the vocabulary of caption words and article titles that are about a

topic and hence may miss specializations relevant to document within the top (5) ones In WordNet ontology, specializations lead to more hop distance

Ontology based scoring helps explain connections to caption words to ground truths e.g. Skateboard skate glide snowboard

K=20 K=50 K=100 K=2000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

METag2LDA-Av-erageDistance

corrMETag2LDA-AverageDistance

METag2LDA-BestDistance

corrMETag2LDA-BestDistance

Page 21: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Evaluations – Held-out Perplexity

DUC05 Newswire Dataset (Recent Experiments with TagLDA Included)

WL tag categories – Named Entities DL tags – abstract coherence markers like (“subj” “obj”) e.g. “Mary/Subj taught the class.

Everybody liked Mary/Obj.” [Ignored coref resolution] Abstract markers like (“subj” “obj”) acting as DL perspective are not document

discriminative markers Rather they indicate a semantic perspective of coherence which is intricately linked to words Topics are influenced both by non-sparse document level coherence indicators like (“subj”

“obj”, “subj” “--”, etc.) AND also by document level co-occurrence By ignoring the DL perspective completely leads to better fit by TagLDA due to variations

in word distributions only

40 60 80 1001350000

1400000

1450000

1500000

1550000

1600000

1650000

MMLDA METag2LDA corrLDA corrMETag2LDA

40 60 80 1000

200000400000600000800000

10000001200000140000016000001800000

MMLDA METag2LDA corrLDA corrMETag2LDATagLDA

Page 22: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Evaluations – Application End-Goals

Person Named Entity coverage (DUC05 data)

Two PERSON NEs in the same docset i.e., manual topic set are related (G in total) A_B, A, B are treated as separate PERSON NEs For each docset in DUC05 data

Create a set of best topics for a docset and pull out top PER NE pairs from the PER NE facets

Find how many matched over all documents in a docset (M in total) Win over baseline = M/G (averaged over all docsets) CorrMETag2LDA wins here because of the nature of DL perspective (Role transitions like

“SubjObj” coherence markers) More topics are pulled out that group more PER NEs across documents (Recall )

40 60 80 1000

0.5

1

1.5

2

2.5

3

3.5

4

0.350.63

0.98 0.910.96

1.88

3.08

3.66

METag2LDA

CorrMETag2LDA

Page 23: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Model Usefulness and Applications

• Applications– Document classification using reduced dimensions– Find faceted topics automatically through word level tags– Learn correspondences between perspectives– Label topics through document level multimedia– Create recommendations based on perspectives– Video analysis: word prediction given video features– Tying “multilingual comparable corpora” through topics– Multi-document summarization using coherence– E-Textbook aided discussion forum mining:

• Explore topics through the lens of students and teachers• Label topics from posts through concepts in the e-textbook

Page 24: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Summary

• Flexible family of topic models that integrate a partitioned space of DL tags and words with WL tag categories– Supervised models can collaborate with unsupervised

generative models i.e. supervised models can be bettered independently

• Captioned multimedia objects like images, video, audio can provide intuitive latent space labeling – a picture is worth a 1000 words

• Obtain “facets” in topics• As always held-out perplexity should not always be the

sole judge of end-task performance

Page 25: Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives

Thanks!

Special thanks to Jordan Boyd-Graber for useful discussions on TagLDA parameter regularizations