Generative Topic Models for Community Analysis

Post on 06-Feb-2016

54 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Generative Topic Models for Community Analysis. Pilfered from: Ramesh Nallapati http://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.ppt. Objectives. Cultural literacy for ML: Q: What are “topic models”? A 1 : popular indoor sport for machine learning researchers - PowerPoint PPT Presentation

Transcript

Generative Topic Models for Community Analysis

Pilfered from: Ramesh Nallapatihttp://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.ppt

2 / 57

Objectives

• Cultural literacy for ML: – Q: What are “topic models”?

– A1: popular indoor sport for machine learning researchers

– A2: a particular way of applying unsupervised learning of Bayes nets to text

• Quick historical survey of some sample papers in the area

3 / 57

Outline• Part I: Introduction to Topic Models

– Naive Bayes model– Mixture Models

• Expectation Maximization

– PLSA– LDA

• Variational EM• Gibbs Sampling

• Part II: Topic Models for Community Analysis– Citation modeling with PLSA– Citation Modeling with LDA– Author Topic Model– Author Topic Recipient Model– Modeling influence of Citations– Mixed membership Stochastic Block Model

4 / 57

Introduction to Topic Models

• Multinomial Naïve Bayes

C

W1 W2 W3 ….. WN

M

• For each document d = 1,, M

• Generate Cd ~ Mult( ¢ | )

• For each position n = 1,, Nd

• Generate wn ~ Mult(¢|,Cd)

5 / 57

Introduction to Topic Models• Naïve Bayes Model: Compact representation

C

W1 W2 W3 ….. WN

C

W

N

M

M

6 / 57

Introduction to Topic Models

• Mixture model: unsupervised naïve Bayes model

C

W

NM

• Joint probability of words and classes:

• But classes are not visible:Z

7 / 57

Introduction to Topic Models

8 / 57

Introduction to Topic Models

• Probabilistic Latent Semantic Analysis Model

d

z

w

M

• Select document d ~ Mult()

• For each position n = 1,, Nd

• generate zn ~ Mult( ¢ | d)

• generate wn ~ Mult( ¢ | zn)

d

N

Topic distribution

9 / 57

Introduction to Topic Models

• Probabilistic Latent Semantic Analysis Model– Learning using EM– Not a complete generative model

• Has a distribution over the training set of documents: no new document can be generated!

– Nevertheless, more realistic than mixture model

• Documents can discuss multiple topics!

10 / 57

Introduction to Topic Models

• PLSA topics (TDT-1 corpus)

11 / 57

Introduction to Topic Models

12 / 57

Introduction to Topic Models

• Latent Dirichlet Allocation

z

w

M

N

• For each document d = 1,,M

• Generate d ~ Dir(¢ | )

• For each position n = 1,, Nd

• generate zn ~ Mult( ¢ | d)

• generate wn ~ Mult( ¢ | zn)

13 / 57

Introduction to Topic Models

• Latent Dirichlet Allocation– Overcomes the issues with PLSA

• Can generate any random document

– Parameter learning:• Variational EM

– Numerical approximation using lower-bounds

– Results in biased solutions

– Convergence has numerical guarantees

• Gibbs Sampling – Stochastic simulation

– unbiased solutions

– Stochastic convergence

14 / 57

Introduction to Topic Models

• Variational EM for LDA– Approximate the posterior by a simpler

distribution

• A convex function in each parameter!

15 / 57

Introduction to Topic Models

• Gibbs sampling– Applicable when joint distribution is hard to evaluate but

conditional distribution is known– Sequence of samples comprises a Markov Chain– Stationary distribution of the chain is the joint distribution

16 / 57

Introduction to Topic Models

• LDA topics

17 / 57

Introduction to Topic Models

• LDA’s view of a document

18 / 57

Introduction to Topic Models

• Perplexity comparison of various models

Unigram

Mixture model

PLSA

LDALower is better

19 / 57

Outline• Part I: Introduction to Topic Models

– Naive Bayes model– Mixture Models

• Expectation Maximization

– PLSA– LDA

• Variational EM• Gibbs Sampling

• Part II: Topic Models for Community Analysis– Citation modeling with PLSA– Citation Modeling with LDA– Author Topic Model– Author Topic Recipient Model– Modeling influence of Citations– Mixed membership Stochastic Block Model

20 / 57

Hyperlink modeling using PLSA

21 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

d

z

w

M

d

N

z

c

• Select document d ~ Mult()

• For each position n = 1,, Nd

• generate zn ~ Mult( ¢ | d)

• generate wn ~ Mult( ¢ | zn)

• For each citation j = 1,, Ld

• generate zj ~ Mult( ¢ | d)

• generate cj ~ Mult( ¢ | zj)L

22 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

d

z

w

M

d

N

z

c

L

PLSA likelihood:

New likelihood:

Learning using EM

23 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

Heuristic:

0 · · 1 determines the relative importance of content and hyperlinks

(1-)

24 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

• Classification performance

Hyperlink content Hyperlink content

25 / 57

Hyperlink modeling using LDA

26 / 57

Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]

z

w

M

N

• For each document d = 1,,M

• Generate d ~ Dir(¢ | )

• For each position n = 1,, Nd

• generate zn ~ Mult( ¢ | d)

• generate wn ~ Mult( ¢ | zn)

•For each citation j = 1,, Ld

• generate zj ~ Mult( . | d)

• generate cj ~ Mult( . | zj)

z

c

L

Learning using variational EM

27 / 57

Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]

28 / 57

Author-Topic Model for Scientific Literature

29 / 57

Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

z

w

M

N

• For each author a = 1,,A

• Generate a ~ Dir(¢ | )

• For each topic k = 1,,K

• Generate k ~ Dir( ¢ | )

•For each document d = 1,,M

• For each position n = 1,, Nd

•Generate author x ~ Unif(¢ | ad)

• generate zn ~ Mult( ¢ | a)

• generate wn ~ Mult( ¢ | zn)

x

a

A

P

K

30 / 57

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

Learning: Gibbs sampling

z

w

M

N

x

a

A

P

K

31 / 57

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

• Topic-Author visualization

32 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

33 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Gibbs sampling

34 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

• Datasets– Enron email data

• 23,488 messages between 147 users

– McCallum’s personal email• 23,488(?) messages with 128 authors

35 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

• Topic Visualization: Enron set

36 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

• Topic Visualization: McCallum’s data

37 / 57

Modeling Citation Influences

38 / 57

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]

• Citation influence model

39 / 57

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]

• Citation influence graph for LDA paper

40 / 57

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]

• Words in LDA paper assigned to citations

41 / 57

Link-PLSA-LDA: Topic Influence in Blogs (ICWSM 2008)

Ramesh Nallapati,

Amr Ahmed

Eric Xing

42 / 57

top related