Top Banner
Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media
73

Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

May 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Hybrid Models for Text and Graphs

10/23/2012 Analysis of Social Media

Page 2: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Newswire Text

•  Formal •  Primary purpose:

–  Inform “typical reader” about recent events

•  Broad audience: –  Explicitly establish

shared context with reader

–  Ambiguity often avoided

•  Informal •  Many purposes:

–  Entertain, connect, persuade…

•  Narrow audience: –  Friends and colleagues –  Shared context already

established –  Many statements are

ambiguous out of social context

Social Media Text

Page 3: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Newswire Text

•  Goals of analysis: –  Extract information

about events from text –  “Understanding” text

requires understanding “typical reader”

•  conventions for communicating with him/her

•  Prior knowledge, background, …

•  Goals of analysis: –  Very diverse –  Evaluation is difficult

•  And requires revisiting often as goals evolve

–  Often “understanding” social text requires understanding a community

Social Media Text

Page 4: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Outline

•  Tools for analysis of text –  Probabilistic models for text, communities,

and time • Mixture models and LDA models for text •  LDA extensions to model hyperlink structure •  LDA extensions to model time

Page 5: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Introduction to Topic Models

•  Mixture model: unsupervised naïve Bayes model

C

W

NM

π

β

•  Joint probability of words and classes:

•  But classes are not visible: Z

Page 6: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Introduction to Topic Models

JMLR, 2003

Page 7: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Introduction to Topic Models

•  Latent Dirichlet Allocation

z

w

β

M

θ

N

α •  For each document d = 1,,M

•  Generate θd ~ Dir(.| α)

•  For each position n = 1,, Nd

•  generate zn ~ Mult( . | θd)

•  generate wn ~ Mult( .| βzn)

Page 8: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Introduction to Topic Models

•  Latent Dirichlet Allocation –  Overcomes some technical issues with PLSA

•  PLSA only estimates mixing parameters for training docs –  Parameter learning is more complicated:

•  Gibbs Sampling: easy to program, often slow •  Variational EM

Page 9: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Introduction to Topic Models

•  Perplexity comparison of various models

Unigram

Mixture model

PLSA

LDA Lower is better

Page 10: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Introduction to Topic Models •  Prediction accuracy for classification using learning

with topic-models as features

Higher is better

Page 11: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Before LDA….LSA and pLSA

Page 12: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Introduction to Topic Models

•  Probabilistic Latent Semantic Analysis Model

d

z

w

β

M

•  Select document d ~ Mult(π)

•  For each position n = 1,, Nd

•  generate zn ~ Mult( _ | θd)

•  generate wn ~ Mult( _ | βzn)

θd π

N

Topic distribution

PLSA model:

•  each word is generated by a single unknown multinomial distribution of words, each document is mixed by θd

•  need to estimate θd for each d è overfitting is easy

LDA:

•  integrate out θd and only estimate β

Page 13: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Introduction to Topic Models

•  PLSA topics (TDT-1 corpus)

Page 14: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Outline

•  Tools for analysis of text –  Probabilistic models for text, communities,

and time • Mixture models and LDA models for text •  LDA extensions to model hyperlink structure •  LDA extensions to model time

–  Alternative framework based on graph analysis to model time & community •  Preliminary results & tradeoffs

•  Discussion of results & challenges

Page 15: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Hyperlink modeling using PLSA

Page 16: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001]

d

z

w

β

M

θd

π

N

z

c

γ

•  Select document d ~ Mult(π)

•  For each position n = 1,, Nd

•  generate zn ~ Mult( . | θd)

•  generate wn ~ Mult( . | βzn)

•  For each citation j = 1,, Ld

•  generate zj ~ Mult( . | θd)

•  generate cj ~ Mult( . | γzj)

L

Page 17: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001]

d

z

w

β

M

θd

π

N

z

c

γ

L

PLSA likelihood:

New likelihood:

Learning using EM

Page 18: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001]

Heuristic:

0 · α · 1 determines the relative importance of content and hyperlinks

α (1-α)

Page 19: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] •  Experiments: Text Classification •  Datasets:

–  Web KB •  6000 CS dept web pages with hyperlinks •  6 Classes: faculty, course, student, staff, etc.

–  Cora •  2000 Machine learning abstracts with citations •  7 classes: sub-areas of machine learning

•  Methodology: –  Learn the model on complete data and obtain θd for each

document –  Test documents classified into the label of the nearest

neighbor in training set –  Distance measured as cosine similarity in the θ space –  Measure the performance as a function of α

Page 20: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001]

•  Classification performance

Hyperlink content link content

Page 21: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Hyperlink modeling using LDA

Page 22: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Hyperlink modeling using LinkLDA [Erosheva, Fienberg, Lafferty, PNAS, 2004]

z

w

β

M

θ

N

α

•  For each document d = 1,,M

•  Generate θd ~ Dir(¢ | α)

•  For each position n = 1,, Nd

•  generate zn ~ Mult( . | θd)

•  generate wn ~ Mult( . | βzn)

• For each citation j = 1,, Ld

•  generate zj ~ Mult( . | θd)

•  generate cj ~ Mult( . | γzj)

z

c

γ

L

Learning using variational EM

Page 23: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004]

Page 24: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Newswire Text

•  Goals of analysis: –  Extract information

about events from text –  “Understanding” text

requires understanding “typical reader”

•  conventions for communicating with him/her

•  Prior knowledge, background, …

•  Goals of analysis: –  Very diverse –  Evaluation is difficult

•  And requires revisiting often as goals evolve

–  Often “understanding” social text requires understanding a community

Social Media Text

Science as a testbed for social text: an open community which we understand

Page 25: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic Model for Scientific Literature

Page 26: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

z

w

φ

M

θ

N

β

•  For each author a = 1,,A

•  Generate θa ~ Dir(. | γ)

•  For each topic k = 1,,K

•  Generate φk ~ Dir( . | α)

• For each document d = 1,,M

•  For each position n = 1,, Nd

• Generate author x ~ Unif(. | ad)

•  generate zn ~ Mult(. | θa)

•  generate wn ~ Mult(. | φzn)

x

a

A

K

Page 27: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

•  Perplexity results

Page 28: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

•  Topic-Author visualization

Page 29: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

•  Application 1: Author similarity

Page 30: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

•  Application 2: Author entropy

Page 31: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Labeled LDA: [Ramage, Hall, Nallapati, Manning, EMNLP 2009]

Page 32: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Labeled LDA Del.icio.us tags as labels for documents

Page 33: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Labeled LDA

Page 34: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Page 35: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Gibbs sampling

Page 36: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

•  Datasets –  Enron email data

•  23,488 messages between 147 users – McCallum’s personal email

•  23,488(?) messages with 128 authors

Page 37: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

•  Topic Visualization: Enron set

Page 38: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

•  Topic Visualization: McCallum’s data

Page 39: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Page 40: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Models of hypertext for blogs [ICWSM 2008]

Ramesh Nallapati

me

Amr Ahmed Eric Xing

Page 41: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

LinkLDA model for citing documents Variant of PLSA model for cited documents Topics are shared between citing, cited Links depend on topics in two documents

Link-PLSA-LDA

Page 42: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Experiments

•  8.4M blog postings in Nielsen/Buzzmetrics corpus –  Collected over three weeks summer 2005

•  Selected all postings with >=2 inlinks or >=2 outlinks –  2248 citing (2+ outlinks), 1777 cited documents (2+ inlinks) –  Only 68 in both sets, which are duplicated

•  Fit model using variational EM

Page 43: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Topics in blogs

Model can answer questions like: which blogs are most likely to be cited when discussing topic z?

Page 44: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Topics in blogs

Model can be evaluated by predicting which links an author will include in a an article

Lower is better

Link-PLSA-LDA

Link-LDA

Page 45: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Another model: Pairwise Link-LDA

z

w

β

θ

N

α

z

w

θ

N

z z

c

γ

•  LDA for both cited and citing documents •  Generate an indicator for every pair of docs

•  Vs. generating pairs of docs

• Link depends on the mixing components (θ’s)

•  stochastic block model

Page 46: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Pairwise Link-LDA supports new inferences…

…but doesn’t perform better on link prediction

Page 47: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Outline

•  Tools for analysis of text –  Probabilistic models for text, communities,

and time • Mixture models and LDA models for text •  LDA extensions to model hyperlink structure

–  Observation: these models can be used for many purposes…

•  LDA extensions to model time –  Alternative framework based on graph

analysis to model time & community •  Discussion of results & challenges

Page 48: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire
Page 49: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Authors are using a number of clever tricks for inference….

Page 50: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire
Page 51: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire
Page 52: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire
Page 53: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Predicting Response to Political Blog Posts with Topic Models [NAACL ’09]

Tae Yano Noah Smith

Page 54: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

54

Political blogs and and comments

Comment style is casual, creative,"less carefully edited"

Posts are often coupled "with comment sections"

Page 55: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Political blogs and comments

•  Most of the text associated with large “A-list” community blogs is comments –  5-20x as many words in comments as in text for the

5 sites considered in Yano et al. •  A large part of socially-created commentary in

the blogosphere is comments. –  Not blog à blog hyperlinks

•  Comments do not just echo the post

Page 56: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Modeling political blogs

Our political blog model:

CommentLDA

D = # of documents; N = # of words in post; M = # of words in comments

z, z` = topic w = word (in post) w`= word (in comments) u = user

Page 57: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Modeling political blogs

Our proposed political blog model:

CommentLDA LHS is vanilla LDA

D = # of documents; N = # of words in post; M = # of words in comments

Page 58: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Modeling political blogs

Our proposed political blog model:

CommentLDA

RHS to capture the generation of reaction separately from the post body

Two separate sets of word distributions

D = # of documents; N = # of words in post; M = # of words in comments

Two chambers share the same topic-mixture

Page 59: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Modeling political blogs

Our proposed political blog model:

CommentLDA

User IDs of the commenters as a part of comment text

generate the words in the comment section

D = # of documents; N = # of words in post; M = # of words in comments

Page 60: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Modeling political blogs

Another model we tried:

CommentLDA

This is a model agnostic to the words in the comment section!

D = # of documents; N = # of words in post; M = # of words in comments

Took out the words from the comment section!

The model is structurally equivalent to the LinkLDA from (Erosheva et al., 2004)

Page 61: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

61

Topic discovery - Matthew Yglesias (MY) site

Page 62: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

62

Topic discovery - Matthew Yglesias (MY) site

Page 63: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

63

Topic discovery - Matthew Yglesias (MY) site

Page 64: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

64

•  LinkLDA and CommentLDA consistently outperform baseline models"•  Neither consistently outperforms the other."

Comment prediction

20.54 %

16.92 % 32.06

%

Comment LDA (R)

Link LDA (R) Link LDA (C)

user prediction: Precision at top 10!From left to right: Link LDA(-v, -r,-c) Cmnt LDA (-v, -r, -c), Baseline (Freq, NB)"

(CB)

(MY)

(RS)

Page 65: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Document modeling with Latent Dirichlet Allocation (LDA)

z

w

β

M

θ

N

α •  For each document d = 1,,M

•  Generate θd ~ Dir(. | α)

•  For each position n = 1,, Nd

•  generate zn ~ Mult( . | θd)

•  generate wn ~ Mult( . | βzn)

Page 66: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Page 67: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

“SNA” = Jensen-Shannon divergence for recipients of messages

Page 68: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Modeling Citation Influences

Page 69: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007]

•  Copycat model of citation influence

c is a cited document s is a coin toss to mix γ and ψ

plaigarism

innovation

Page 70: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

s is a coin toss to mix γ and ψ

Page 71: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007]

•  Citation influence graph for LDA paper

Page 72: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Modeling Citation Influences

Page 73: Hybrid Models for Text and Graphs - Carnegie Mellon School ...wcohen/10-802/10-23-textgraph.pdf · Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media . Newswire

Modeling Citation Influences

User study: self-reported citation influence on Likert scale LDA-post is Prob(cited doc|paper) LDA-js is Jensen-Shannon dist in topic space