Top Banner
Shared Components Topic Models Ma$hew R. Gormley, Mark Dredze, Benjamin Van Durme, Jason Eisner NAACL 2012 June 6, 2012 Center for Language and Speech Processing Human Language Technology Center of Excellence Johns Hopkins University
104

jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Aug 26, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Shared  Components  Topic  Models  

Ma$hew  R.  Gormley,    Mark  Dredze,    Benjamin  Van  Durme,    Jason  Eisner  

NAACL  2012  June  6,  2012  

Center  for  Language  and  Speech  Processing  Human  Language  Technology  Center  of  Excellence  

Johns  Hopkins  University  

Page 2: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Contrast  of  LDA  Extensions  

2  

Topic  Model    =  DistribuTons  

over    topics  (docs)  

DistribuTons  over    

words  (topics)  +  

Page 3: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Contrast  of  LDA  Extensions  

3  

Topic  Model    =  DistribuTons  

over    topics  (docs)  

DistribuTons  over    

words  (topics)  +  

Most  extensions  to  LDA   Our  Model  

Page 4: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

•  Each  topic  is  defined  as  a  Mul5nomial  distribu5on  over  the  vocabulary,  parameterized  by  ϕk  

4  

(Blei,  Ng,  &  Jordan,  2003)  

words

probability

0.000

0.006

words

probability

0.000

0.006

words

probability

0.0000.0060.012

wordsprobability

0.000

0.006

words

probability

0.000

0.006

words

probability

0.0000.0060.012

Page 5: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

•  Each  topic  is  defined  as  a  Mul5nomial  distribu5on  over  the  vocabulary,  parameterized  by  ϕk  

5  

(Blei,  Ng,  &  Jordan,  2003)  

ϕ1   ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

words

probability

0.000

0.006

words

probability

0.000

0.006

words

probability

0.0000.0060.012

wordsprobability

0.000

0.006

words

probability

0.000

0.006

words

probability

0.0000.0060.012

Page 6: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

•  A  topic  is  visualized  as  its  high  probability  words.    

6  

(Blei,  Ng,  &  Jordan,  2003)  

ϕ1   ϕ2 ϕ3 ϕ4 ϕ5 ϕ6

team,  season,  hockey,  player,  penguins,  ice,    canadiens,  puck,  montreal,  stanley,  cup  

words

probability

0.000

0.006

words

probability

0.000

0.006

words

probability

0.0000.0060.012

words

probability

0.000

0.006

words

probability

0.000

0.006

words

probability

0.0000.0060.012

Page 7: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

•  A  topic  is  visualized  as  its  high  probability  words.    •  A  pedigogical  label  is  used  to  idenTfy  the  topic.  

7  

(Blei,  Ng,  &  Jordan,  2003)  

ϕ1   ϕ2 ϕ3 {hockey}  

ϕ4 ϕ5 ϕ6

team,  season,  hockey,  player,  penguins,  ice,    canadiens,  puck,  montreal,  stanley,  cup  

words

probability

0.000

0.006

words

probability

0.000

0.006

words

probability

0.0000.0060.012

words

probability

0.000

0.006

words

probability

0.000

0.006

words

probability

0.0000.0060.012

Page 8: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

•  A  topic  is  visualized  as  its  high  probability  words.    •  A  pedagogical  label  is  used  to  idenTfy  the  topic.  

8  

(Blei,  Ng,  &  Jordan,  2003)  

ϕ1 {Canadian  gov.}  

ϕ2 {government}  

ϕ3 {hockey}  

ϕ4 {U.S.  gov.}  

ϕ5 {baseball}  

ϕ6 {Japan}  

Page 9: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

9  

θ1=  

ϕ1 {Canadian  gov.}  

ϕ2 {government}  

ϕ3 {hockey}  

ϕ4 {U.S.  gov.}  

ϕ5 {baseball}  

ϕ6 {Japan}  

Dirichlet(α)  

Page 10: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

10  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  

θ1=  

ϕ1 {Canadian  gov.}  

ϕ2 {government}  

ϕ3 {hockey}  

ϕ4 {U.S.  gov.}  

ϕ5 {baseball}  

ϕ6 {Japan}  

Dirichlet(α)  

Page 11: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

11  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  

θ1=  

ϕ1 {Canadian  gov.}  

ϕ2 {government}  

ϕ3 {hockey}  

ϕ4 {U.S.  gov.}  

ϕ5 {baseball}  

ϕ6 {Japan}  

Dirichlet(α)  

Page 12: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

12  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  Coast  Guard  

θ1=  

Dirichlet(α)  

ϕ1 {Canadian  gov.}  

ϕ2 {government}  

ϕ3 {hockey}  

ϕ4 {U.S.  gov.}  

ϕ5 {baseball}  

ϕ6 {Japan}  

Page 13: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

13  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  Coast  Guard  vessels  regularly  if  infrequently  detain  each  other's  fish  boats  in  the  disputed  waters  off  Dixon…  

θ1=  

ϕ1 {Canadian  gov.}  

ϕ2 {government}  

ϕ3 {hockey}  

ϕ4 {U.S.  gov.}  

ϕ5 {baseball}  

ϕ6 {Japan}  

Dirichlet(α)  

Page 14: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

14  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  Coast  Guard  vessels  regularly  if  infrequently  detain  each  other's  fish  boats  in  the  disputed  waters  off  Dixon…  

In  the  year  before  Lemieux  came,  Pi$sburgh  finished  with  38  points.    Following  his  arrival,  the  Pens  finished…  

The  Orioles'  pitching  staff  again  is  having  a  fine  exhibiTon  season.  Four  shutouts,  low  team  ERA,  (Well,  I  haven't  go$en  any  baseball…  

θ1=   θ2=   θ3=  

ϕ1 {Canadian  gov.}  

ϕ2 {government}  

ϕ3 {hockey}  

ϕ4 {U.S.  gov.}  

ϕ5 {baseball}  

ϕ6 {Japan}  

Dirichlet(α)  

Page 15: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

15  

Dirichlet(β)  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  Coast  Guard  vessels  regularly  if  infrequently  detain  each  other's  fish  boats  in  the  disputed  waters  off  Dixon…  

In  the  year  before  Lemieux  came,  Pi$sburgh  finished  with  38  points.    Following  his  arrival,  the  Pens  finished…  

The  Orioles'  pitching  staff  again  is  having  a  fine  exhibiTon  season.  Four  shutouts,  low  team  ERA,  (Well,  I  haven't  go$en  any  baseball…  

θ1=   θ2=   θ3=  

ϕ1 {Canadian  gov.}  

ϕ2 {government}  

ϕ3 {hockey}  

ϕ4 {U.S.  gov.}  

ϕ5 {baseball}  

ϕ6 {Japan}  

Dirichlet(α)  

Page 16: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

16  

Dirichlet(β)  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  Coast  Guard  vessels  regularly  if  infrequently  detain  each  other's  fish  boats  in  the  disputed  waters  off  Dixon…  

In  the  year  before  Lemieux  came,  Pi$sburgh  finished  with  38  points.    Following  his  arrival,  the  Pens  finished…  

The  Orioles'  pitching  staff  again  is  having  a  fine  exhibiTon  season.  Four  shutouts,  low  team  ERA,  (Well,  I  haven't  go$en  any  baseball…  

θ1=   θ2=   θ3=  

ϕ1 {Canadian  gov.}  

ϕ2 {government}  

ϕ3 {hockey}  

ϕ4 {U.S.  gov.}  

ϕ5 {baseball}  

ϕ6 {Japan}  

Dirichlet(α)   DistribuTons  over    

topics  (docs)  

DistribuTons  over    

words  (topics)  

Page 17: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

17  

Dirichlet(β)  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  Coast  Guard  vessels  regularly  if  infrequently  detain  each  other's  fish  boats  in  the  disputed  waters  off  Dixon…  

In  the  year  before  Lemieux  came,  Pi$sburgh  finished  with  38  points.    Following  his  arrival,  the  Pens  finished…  

The  Orioles'  pitching  staff  again  is  having  a  fine  exhibiTon  season.  Four  shutouts,  low  team  ERA,  (Well,  I  haven't  go$en  any  baseball…  

θ1=   θ2=   θ3=  

ϕ1 {Canadian  gov.}  

ϕ2 {government}  

ϕ3 {hockey}  

ϕ4 {U.S.  gov.}  

ϕ5 {baseball}  

ϕ6 {Japan}  

Dirichlet(α)   DistribuTons  over    

topics  (docs)  

DistribuTons  over    

words  (topics)  

Page 18: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

Two  problems  with  the  LDA  generaTve  story  for  topics:    1.  Independently  generate  each  topic  2.  For  each  topic,  store  a  parameter  per  word  in  the  

vocabulary  

18  

Page 19: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

LDA  for  Topic  Modeling  

Two  problems  with  the  LDA  generaTve  story  for  topics:    1.  Independently  generate  each  topic  2.  For  each  topic,  store  a  parameter  per  word  in  the  

vocabulary  

We’re  not  the  first  to  noTce  this…  

19  

Page 20: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Shared  Components  Topic  Model  (SCTM):  –  Generate  a  pool  of  “components”  (proto-­‐topics)  

–  Assemble  each  topic  from  some  of  the  components  •  MulTply  and  renormalize  (“product  of  experts”)  

–  Documents  are  mixtures  of  topics  (just  like  LDA)  

1.  So  the  wordlists  of  two  topics  are  not  generated  independently!  

2.  Fewer  parameters  

20  

Our  Model  

Page 21: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

ϕ1 {sports}   ϕ2 {Canada}   ϕ3 {government}  

player  

team  

hockey  

baseball  

Orioles  

Canucks  

season  

canada  

Quebec  

parliament  

snow  

Hansard’s  

Elizabeth  II  

hockey  

democracy  

socialism  

voted  

elecTon  

Obama  

PuTn  

parliament  

SCTM:  MoTvaTng  Example  

Components  are  distribuTons  over  words.  How  to  combine  components  into  topics?  

21  

Decreasing  prob

ability  

Page 22: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

ϕ2 {Canada}   ϕ3 {government}  

canada  

Quebec  

parliament  

snow  

Hansard’s  

Elizabeth  II  

hockey  

democracy  

socialism  

voted  

elecTon  

Obama  

PuTn  

parliament  

ϕ1 {sports}  

player  

team  

hockey  

baseball  

Orioles  

Canucks  

season  

SCTM:  MoTvaTng  Example  

We  can  imagine  a  component  as  a  set  of  words  (i.e.  all  the  non-­‐zero  probabiliTes  are  idenTcal):  

22  

Page 23: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

ϕ2 {Canada}   ϕ3 {government}  

canada  

Quebec  

parliament  

snow  

Hansard’s  

Elizabeth  II  

hockey  

democracy  

socialism  

voted  

elecTon  

Obama  

PuTn  

parliament  

ϕ1 {sports}  

player  

team  

hockey  

baseball  

Orioles  

Canucks  

season  

SCTM:  MoTvaTng  Example  

To  create  a  {Canadian  government}  topic  we  could  take  the  union  of  {government}  and  {Canada}.    

23  

Page 24: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Be$er  yet,  to  create  a  {Canadian  government}  topic  we  could  take  the  intersec5on  of  {government}  and  {Canada}.    

ϕ2 {Canada}  

ϕ3 {government}  

ϕ1 {sports}  

SCTM:  MoTvaTng  Example  

24  

{Canadian  gov.}  

Page 25: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Be$er  yet,  to  create  a  {Canadian  government}  topic  we  could  take  the  intersec5on  of  {government}  and  {Canada}.    

ϕ2 {Canada}  

ϕ3 {government}  

ϕ1 {sports}  

SCTM:  MoTvaTng  Example  

25  

{Canadian  gov.}  

{hockey}?  

Page 26: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

More  complex  intersec5ons  might  be  more  realisTc:  

SCTM:  MoTvaTng  Example  

26  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

b1 {Canadian  gov.}  

Page 27: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

SCTM:  MoTvaTng  Example  

27  

More  complex  intersec5ons  might  be  more  realisTc:  

b1 {Canadian  gov.}  

b4 {U.S.  gov.}  

Page 28: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

SCTM:  MoTvaTng  Example  

28  

More  complex  intersec5ons  might  be  more  realisTc:  

b1 {Canadian  gov.}  

b3  {hockey}  

b4 {U.S.  gov.}  

Page 29: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

SCTM:  MoTvaTng  Example  

29  

More  complex  intersec5ons  might  be  more  realisTc:  

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

Page 30: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

SCTM:  MoTvaTng  Example  

30  

More  complex  intersec5ons  might  be  more  realisTc:  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

Components  

Topics  

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

Page 31: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Sos  IntersecTon  and  Union  

•  We  don’t  want  topics  to  be  sets  of  words,  we  want  probability  distribu5ons  over  words  

•  In  probability  space…  

31  

Union   Mixture  

IntersecTon   Normalized  Product  

Page 32: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Product  of  Experts  

Product  of  Experts  (PoE)  model  (Hinton,  2002)  – Another  name  for  a  normalized  product  

– For  a  subset  of  components,  define  the  model  as:  

32  

IntersecTon   Normalized  Product  (PoE)  

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

Shared Components Topic Models

Anonymous Author(s)AffiliationAddressemail

A Product of Experts (PoE) [1] model p(x|φ1, . . . ,φC) =QC

c=1 φcxPVv=1

QCc=1 φcv

, where there are C

components, and the summation in the denominator is over all possible feature types.

p(x|φ1, . . . ,φC) =�

c∈C φcx�V

v=1

�c∈C φcv

Latent Dirichlet allocation generative process

For each topic k ∈ {1, . . . , K}:φk ∼ Dir(β) [draw distribution over words]

For each document m ∈ {1, . . . , M}θm ∼ Dir(α) [draw distribution over topics]For each word n ∈ {1, . . . , Nm}

zmn ∼ Mult(1, θm) [draw topic]xmn ∼ φzmi

[draw word]

The Finite IBP model generative process

For each component c ∈ {1, . . . , C}: [columns]

πc ∼ Beta( γC , 1) [draw probability of component c]

For each topic k ∈ {1, . . . , K}: [rows]bkc ∼ Bernoulli(πc)[draw whether topic includes cth component in its PoE]

0.1 PoE

p(x|φ1, . . . ,φC) =�C

c=1 φcx�Vv=1

�Cc=1 φcv

(1)

0.2 IBP

Latent Dirichlet allocation generative process

For each topic k ∈ {1, . . . , K}:φk ∼ Dir(β) [draw distribution over words]

For each document m ∈ {1, . . . , M}θm ∼ Dir(α) [draw distribution over topics]For each word n ∈ {1, . . . , Nm}

zmn ∼ Mult(1, θm) [draw topic]xmn ∼ φzmi

[draw word]

The Beta-Bernoulli model generative process

For each feature c ∈ {1, . . . , C}: [columns]

πc ∼ Beta( γC , 1)

For each class k ∈ {1, . . . , K}: [rows]bkc ∼ Bernoulli(πc)

0.3 Shared Components Topic Models

Generative process We can now present the formal generative process for the SCTM. For eachof the C shared components, we generate a distribution φc over the V words from a Dirichletparametrized by β. Next, we generate a K × C binary matrix using the finite IBP prior. We selectthe probability πc of each component c being on (bkc = 1) from a Beta distribution parametrizedby γ/C. We then sample K topics (rows of the matrix), which combine component distributions,where each position bkc is drawn from a Bernoulli parameterized by πc. These components and the

1

Page 33: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

33  

Our  Model  

Shared  Components  Topic  Model  (SCTM):  –  Generate  a  pool  of  “components”  (proto-­‐topics)  

–  Assemble  each  topic  from  some  of  the  components  • Mul5ply  and  renormalize  (“product  of  experts”)  

–  Documents  are  mixtures  of  topics  (just  like  LDA)  

1.  So  topics  are  not  independent!  2.  Fewer  parameters  

Page 34: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

34  

Our  Model  

Shared  Components  Topic  Model  (SCTM):  –  Generate  a  pool  of  “components”  (proto-­‐topics)  

–  Assemble  each  topic  from  some  of  the  components  •  MulTply  and  renormalize  (“product  of  experts”)  

–  Documents  are  mixtures  of  topics  (just  like  LDA)  

1.  So  topics  are  not  independent!  2.  Fewer  parameters  

Page 35: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Learning  the  Structure  of  Topics  

35  

How  do  we  decide  which  subset  of  components  combine  to  form  a  single  topic?  

b1

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

Page 36: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Learning  the  Structure  of  Topics  

36  

How  do  we  decide  which  subset  of  components  combine  to  form  a  single  topic?  

b1

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

π1   π2 π3 π4 π5

Page 37: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Learning  the  Structure  of  Topics  

37  

How  do  we  decide  which  subset  of  components  combine  to  form  a  single  topic?  

b1

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

π1   π2 π3 π4 π5

 b1c ~ Bernoulli(πc)  

Page 38: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Learning  the  Structure  of  Topics  

38  

How  do  we  decide  which  subset  of  components  combine  to  form  a  single  topic?  

b1

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

π1   π2 π3 π4 π5

 b1c ~ Bernoulli(πc)  

Page 39: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Learning  the  Structure  of  Topics  

39  

How  do  we  decide  which  subset  of  components  combine  to  form  a  single  topic?  

b1

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

π1   π2 π3 π4 π5

 b1c ~ Bernoulli(πc)  

Page 40: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Learning  the  Structure  of  Topics  

40  

How  do  we  decide  which  subset  of  components  combine  to  form  a  single  topic?  

b1

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

π1   π2 π3 π4 π5

 b1c ~ Bernoulli(πc)  

Page 41: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Learning  the  Structure  of  Topics  

41  

How  do  we  decide  which  subset  of  components  combine  to  form  a  single  topic?  

b1

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

π1   π2 π3 π4 π5

 b1c ~ Bernoulli(πc)  

Page 42: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Learning  the  Structure  of  Topics  

42  

How  do  we  decide  which  subset  of  components  combine  to  form  a  single  topic?  

b1 {Canadian  gov.}  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

π1   π2 π3 π4 π5

 b1c ~ Bernoulli(πc)  

Page 43: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Learning  the  Structure  of  Topics  

43  

How  do  we  decide  which  subset  of  components  combine  to  form  a  single  topic?  

b1 {Canadian  gov.}  

b2 {government}  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

π1   π2 π3 π4 π5

 bkc ~ Bernoulli(πc)  

Page 44: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Learning  the  Structure  of  Topics  

44  

How  do  we  decide  which  subset  of  components  combine  to  form  a  single  topic?  

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

π1   π2 π3 π4 π5

 bkc ~ Bernoulli(πc)  

Page 45: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Learning  the  Structure  of  Topics  

45  

How  do  we  decide  which  subset  of  components  combine  to  form  a  single  topic?  

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

π1   π2 π3 π4 π5

Beta(γ/C,  1)  

 bkc ~ Bernoulli(πc)  

Page 46: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Learning  the  Structure  of  Topics  

46  

How  do  we  decide  which  subset  of  components  combine  to  form  a  single  topic?  

Beta-­‐Bernoulli  model  –  The  finite  version  of  the  Indian  Buffet  Process  (Griffiths  &  Ghahramani,  2006)  

–  Prior  over  K  x  C  binary  matrices  

Page 47: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Learning  the  Structure  of  Topics  

47  

How  do  we  decide  which  subset  of  components  combine  to  form  a  single  topic?  

Beta-­‐Bernoulli  model  –  The  finite  version  of  the  Indian  Buffet  Process  (Griffiths  &  Ghahramani,  2006)  

–  Prior  over  K  x  C  binary  matrices  

– We  can  stack  the  binary  vectors  to  form  a  matrix  

ϕ1 ϕ2 ϕ3 ϕ4 ϕ5

b1   {Canadian  gov.}  

b2   {government}  

b3   {hockey}  

b4   {U.S.  gov.}  

b5   {baseball}  

b6   {Japan}  

Page 48: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

48  

Our  Model  

Shared  Components  Topic  Model  (SCTM):  –  Generate  a  pool  of  “components”  (proto-­‐topics)  

–  Assemble  each  topic  from  some  of  the  components  •  MulTply  and  renormalize  (“product  of  experts”)  

–  Documents  are  mixtures  of  topics  (just  like  LDA)  

1.  So  topics  are  not  independent!  2.  Fewer  parameters  

Page 49: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

Our  Model  (SCTM)  

49  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

How  do  we  generate  the  components?  

Page 50: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

Our  Model  (SCTM)  

50  

Dirichlet(β)  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

How  do  we  generate  the  components?  

Page 51: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

Our  Model  (SCTM)  

51  

Dirichlet(β)  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

Page 52: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

Our  Model  (SCTM)  

52  

Dirichlet(β)  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  Coast  Guard  vessels  regularly  if  infrequently  detain  each  other's  fish  boats  in  the  

In  the  year  before  Lemieux  came,  Pi$sburgh  finished  with  38  points.    Following  his  arrival,  the  Pens  finished…  

The  Orioles'  pitching  staff  again  is  having  a  fine  exhibiTon  season.  Four  shutouts,  low  team  ERA,  (Well,  I  haven't  go$en  any  

θ1=   θ2=   θ3=  

Dirichlet(α)  

Page 53: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

SCTM  

53  

Dirichlet(β)  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  Coast  Guard  vessels  regularly  if  infrequently  detain  each  other's  fish  boats  in  the  

In  the  year  before  Lemieux  came,  Pi$sburgh  finished  with  38  points.    Following  his  arrival,  the  Pens  finished…  

The  Orioles'  pitching  staff  again  is  having  a  fine  exhibiTon  season.  Four  shutouts,  low  team  ERA,  (Well,  I  haven't  go$en  any  

θ1=   θ2=   θ3=  

Dirichlet(α)  

π1   π2 π3 π4 π5

Beta(γ/C,  1)  

Page 54: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

SCTM  

54  

Dirichlet(β)  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  Coast  Guard  vessels  regularly  if  infrequently  detain  each  other's  fish  boats  in  the  

In  the  year  before  Lemieux  came,  Pi$sburgh  finished  with  38  points.    Following  his  arrival,  the  Pens  finished…  

The  Orioles'  pitching  staff  again  is  having  a  fine  exhibiTon  season.  Four  shutouts,  low  team  ERA,  (Well,  I  haven't  go$en  any  

θ1=   θ2=   θ3=  

Dirichlet(α)  

π1   π2 π3 π4 π5

Beta(γ/C,  1)  

DistribuTons  over    

topics  (docs)  

DistribuTons  over    

words  (topics)  

Page 55: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Contrast  of  LDA  Extensions  

55  

Topic  Model    =  DistribuTons  

over    topics  (docs)  

DistribuTons  over    

words  (topics)  +  

•  (Blei  et  al.,  2004)  •  (Rosen-­‐Zvi  et  al.,  2004)    •  (Teh  et  al.,  2004)  •  (Blei  &  Lafferty,  2006)  •  (Li  &  McCallum,  2006)  •  (Mimno  et  al.,  2007)  •  (Boyd-­‐Graber  &  Blei,  2009)  •  (Williamson  et  al,  2010)  •  (Paul  &  Girju,  2010)  •  (Paisley  et  al,  2011)  •  (Kim  &  Sudderth,  2011)  

Page 56: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Contrast  of  LDA  Extensions  

56  

Topic  Model    =  DistribuTons  

over    topics  (docs)  

DistribuTons  over    

words  (topics)  +  

•  Hierarchical  LDA  (hLDA)  •  Author-­‐Topic  Model  •  HDP  mixture  model  •  Correlated  Topic  Models  (CTM)  •  Pachinko  AllocaTon  Model  (PAM)  •  Hierarchical  PAM  (hPAM)  •  SyntacTc  Topic  Models  •  Focused  Topic  Models  •  2D  Topic-­‐Aspect  Model  •  DILN  for  mixed-­‐membership  modeling  •  Doubly  Correlated  Nonparametric  TM  

Page 57: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Correlated  Topics  •  Correlated  Topics  

–  Correlated  Topic  Models  (CTM)  –  Pachinko  AllocaTon  Model  (PAM)  –  Hierarchical  LDA  (hLDA)  –  Hierarchical  PAM  (hPAM)  

•  Key  difference  from  SCTM:  correlaTon  is  limited  to  topics  that  appear  together  in  the  same  document  –  Example:  {hockey}  and  {baseball}  topics  share  many  words  in  common,  but  never  appear  in  the  same  document  

•  The  spirit  of  learning  relaTonships  between  topics  is  very  similar!  

57  

Page 58: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

Our  Model  (SCTM)  

58  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  Coast  Guard  vessels  regularly  if  infrequently  detain  each  other's  fish  boats  in  the  

In  the  year  before  Lemieux  came,  Pi$sburgh  finished  with  38  points.    Following  his  arrival,  the  Pens  finished…  

The  Orioles'  pitching  staff  again  is  having  a  fine  exhibiTon  season.  Four  shutouts,  low  team  ERA,  (Well,  I  haven't  go$en  any  

θ1=   θ2=   θ3=  

Page 59: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Correlated  Topics  •  Correlated  Topics  

–  Correlated  Topic  Models  (CTM)  –  Pachinko  AllocaTon  Model  (PAM)  –  Hierarchical  LDA  (hLDA)  –  Hierarchical  PAM  (hPAM)  

•  Key  difference  from  SCTM:  correlaTon  is  limited  to  topics  that  appear  together  in  the  same  document  –  Example:  {hockey}  and  {baseball}  topics  share  many  words  in  common,  but  never  appear  in  the  same  document  

•  The  spirit  of  learning  relaTonships  between  topics  is  very  similar!  

59  

b3  {hockey}  

ϕ3 {sports}  

b5 {baseball}  

Page 60: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Contrast  of  LDA  Extensions  

60  

Topic  Model    =  DistribuTons  

over    topics  (docs)  

DistribuTons  over    

words  (topics)  +  

•  Hierarchical  LDA  (hLDA)  •  Author-­‐Topic  Model  •  HDP  mixture  model  •  Correlated  Topic  Models  (CTM)  •  Pachinko  AllocaTon  Model  (PAM)  •  Hierarchical  PAM  (hPAM)  •  SyntacTc  Topic  Models  •  Focused  Topic  Models  •  2D  Topic-­‐Aspect  Model  •  DILN  for  mixed-­‐membership  modeling  •  Doubly  Correlated  Nonparametric  TM  

•  (Wallach  et  al.,  2009)  •  (Reisinger  et  al.,  2010)  •  (Wang  &  Blei,  2009)  •  (Eisenstein  et  al.,  2011)  

Page 61: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Contrast  of  LDA  Extensions  

61  

Topic  Model    =  DistribuTons  

over    topics  (docs)  

DistribuTons  over    

words  (topics)  +  

•  Hierarchical  LDA  (hLDA)  •  Author-­‐Topic  Model  •  HDP  mixture  model  •  Correlated  Topic  Models  (CTM)  •  Pachinko  AllocaTon  Model  (PAM)  •  Hierarchical  PAM  (hPAM)  •  SyntacTc  Topic  Models  •  Focused  Topic  Models  •  2D  Topic-­‐Aspect  Model  •  DILN  for  mixed-­‐membership  modeling  •  Doubly  Correlated  Nonparametric  TM  

•  Asymmetric  Dirichlet  prior  •  Spherical  Topic  Models  •  Sparse  Topic  Models  •  SAGE  for  topic  modeling  

Page 62: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Contrast  of  LDA  Extensions  

62  

Topic  Model    =  DistribuTons  

over    topics  (docs)  

DistribuTons  over    

words  (topics)  +  

•  Hierarchical  LDA  (hLDA)  •  Author-­‐Topic  Model  •  HDP  mixture  model  •  Correlated  Topic  Models  (CTM)  •  Pachinko  AllocaTon  Model  (PAM)  •  Hierarchical  PAM  (hPAM)  •  SyntacTc  Topic  Models  •  Focused  Topic  Models  •  2D  Topic-­‐Aspect  Model  •  DILN  for  mixed-­‐membership  modeling  •  Doubly  Correlated  Nonparametric  TM  

•  Asymmetric  Dirichlet  prior  •  Spherical  Topic  Models  •  Sparse  Topic  Models  •  SAGE  for  topic  modeling  •  Shared  Components  Topic  

Models  (this  work)  

Page 63: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Contrast  of  LDA  Extensions  

63  

Topic  Model    =  DistribuTons  

over    topics  (docs)  

DistribuTons  over    

words  (topics)  +  

•  Asymmetric  Dirichlet  prior  •  Spherical  Topic  Models  •  Sparse  Topic  Models  •  SAGE  for  topic  modeling  •  Shared  Components  Topic  

Models  (this  work)  

Page 64: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Comparison  of  a  few  Topic  Models  

64  

Dependently  Generated  Topics  

Fewer  Parameters  

Descrip5on  

LDA  (Blei  et  al.,  2003)  

Asymmetric  Dirichlet  Prior  (Wallach  et  al.,  2009)   All  topics  drawn  from  

language  specific  base  distribuTon  Spherical  Topic  Model  

(Reisinger  et  al.,  2010)  

SparseTM  (Wang  &  Blei,  2009)  

Each  topic  is  sparse  SAGE  (Eisenstein  et  al.,  2011)  

Page 65: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Comparison  of  a  few  Topic  Models  

65  

Dependently  Generated  Topics  

Fewer  Parameters  

Descrip5on  

LDA  (Blei  et  al.,  2003)  

Asymmetric  Dirichlet  Prior  (Wallach  et  al.,  2009)   All  topics  drawn  from  

language  specific  base  distribuTon  Spherical  Topic  Model  

(Reisinger  et  al.,  2010)  

SparseTM  (Wang  &  Blei,  2009)  

Each  topic  is  sparse  SAGE  (Eisenstein  et  al.,  2011)  

SCTM  (This  paper)  

Topics  are  products  of  a  shared  pool  of  components  

Page 66: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Parameter  EsTmaTon  

• Goal:  infer  values  for  model  parameters  

• Monte  Carlo  EM  (MCEM)  algorithm,  where  the  M-­‐step  minimizes  a  ContrasTve  Divergence  (CD)  objecTve    

66  

πc   θm=  ϕc

{Canada}  

Page 67: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

Parameter  EsTmaTon  

67  

Dirichlet(β)  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  Coast  Guard  vessels  regularly  if  infrequently  detain  each  other's  fish  boats  in  the  

In  the  year  before  Lemieux  came,  Pi$sburgh  finished  with  38  points.    Following  his  arrival,  the  Pens  finished…  

The  Orioles'  pitching  staff  again  is  having  a  fine  exhibiTon  season.  Four  shutouts,  low  team  ERA,  (Well,  I  haven't  go$en  any  

θ1=   θ2=   θ3=  

Dirichlet(α)  

π1   π2 π3 π4 π5

Beta(γ)  

Page 68: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

68  

Dirichlet(β)  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  Coast  Guard  vessels  regularly  if  infrequently  detain  each  other's  fish  boats  in  the  

In  the  year  before  Lemieux  came,  Pi$sburgh  finished  with  38  points.    Following  his  arrival,  the  Pens  finished…  

The  Orioles'  pitching  staff  again  is  having  a  fine  exhibiTon  season.  Four  shutouts,  low  team  ERA,  (Well,  I  haven't  go$en  any  

θ1=   θ2=   θ3=  

Dirichlet(α)  

π1   π2 π3 π4 π5

Beta(γ)  

Integrated  out  

Integrated  out  Parameter  EsTmaTon  

Page 69: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Parameter  EsTmaTon  

69  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

The  54/40'  boundary  dispute  is  sTll  unresolved,  and  Canadian  and  US  Coast  Guard  vessels  regularly  if  infrequently  detain  each  other's  fish  boats  in  the  disputed  waters  off  Dixon…  

In  the  year  before  Lemieux  came,  Pi$sburgh  finished  with  38  points.    Following  his  arrival,  the  Pens  finished…  

The  Orioles'  pitching  staff  again  is  having  a  fine  exhibiTon  season.  Four  shutouts,  low  team  ERA,  (Well,  I  haven't  go$en  any  baseball…  

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

Page 70: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Parameter  EsTmaTon  

70  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

z11  z12  

z13  

z16  

z15  z14  

z21  

z24  

z23  z22  

z31   z32  

z33  

z34  

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

Page 71: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Parameter  EsTmaTon  

71  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

z11  z12  

z13  

z16  

z15  z14  

z21  

z24  

z23  z22  

z31   z32  

z33  

z34  

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

Model  parameters  

Latent  variables  

Page 72: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Parameter  EsTmaTon  

72  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

z11  z12  

z13  

z16  

z15  z14  

z21  

z24  

z23  z22  

z31   z32  

z33  

z34  

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

Standard  M-­‐step:  Maximize  likelihood  of  ϕc condiToned  on  zmn  and  bck    

Standard  E-­‐step:  Compute  expectaTons  of  zmn  and  bck  condiToned  on  ϕc  

Page 73: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Parameter  EsTmaTon  

73  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

z11  z12  

z13  

z16  

z15  z14  

z21  

z24  

z23  z22  

z31   z32  

z33  

z34  

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

Standard  M-­‐step:  Maximize  likelihood  of  ϕc condiToned  on  zmn  and  bck    

Monte-­‐Carlo  E-­‐step:  Sample  zmn  and  bck  condiToned  on  ϕc  

Page 74: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Parameter  EsTmaTon  

74  

ϕ1 {Canada}  

ϕ2 {government}  

ϕ3 {sports}  

ϕ4 {U.S.}  

ϕ5 {Japan}  

z11  z12  

z13  

z16  

z15  z14  

z21  

z24  

z23  z22  

z31   z32  

z33  

z34  

b1 {Canadian  gov.}  

b2 {government}  

b3  {hockey}  

b4 {U.S.  gov.}  

b5 {baseball}  

b6 {Japan}  

CD  M-­‐step:  Minimize  contrasTve  divergence  of  ϕc condiToned  on  zmn  and  bck    

Monte-­‐Carlo  E-­‐step:  Sample  zmn  and  bck  condiToned  on  ϕc  

Page 75: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Parameter  EsTmaTon  

75  

CD  M-­‐step:  

Monte-­‐Carlo  E-­‐step:  

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

Algorithm 1 SCTM Training

Initialize parameters: ξc, bkc, zi.

while not converged do{E-step:}for j = 1 to J do

{Draw jth sample {Z, B}(j)}for i = 1 to N do

Sample zi

for k = 1 to K dofor c = 1 to C do

Sample bkc

{M-step:}for c = 1 to C do

for v = 1 to V doSingle gradient step over ξ

ξ(t+1)cv = ξ(t)

cv − η ·d CD({Z, B})

dξcv

Algorithm 2 SCTM Training

Initialize parameters: ξc, bkc, zi.

while not converged do{E-step:}for j = 1 to J do

{Draw jth sample {Z, B}(j)}for i = 1 to N do

Sample zi

for k = 1 to K dofor c = 1 to C do

Sample bkc

{M-step:}for c = 1 to C do

for v = 1 to V doSingle gradient step over ξ

φ(t+1)cv = φ(t)

cv − η ·d CD({Z, B})

dφcv

Contrastive Divergence Below, we provide the approximate derivative of the contrastive diver-

gence objective, where Z and B are treated as fixed.1

d CD({Z, B})dξcv

≈−�

d log f(x|bz, φ)dξcv

Q0

(4)

+�

d log f(x|bz, φ)dξcv

Q1ξ

(5)

where f(x|bz, φ) =�C

c=1 φbzccx is the numerator of p(x|bz, φ) and the derivative of its log is efficient

to compute:

d log f(x|bz, φ)dξcv

=�

bzc(1− φcv) for x = v−bzcφcv for x �= v

(6)

References[1] Geoffrey Hinton. Products of experts. In International Conference on Artificial Neural Networks (ICANN),

1999.

[2] Geoffrey Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,

14(8):1771–1800, 2002.

1The derivative is approximate because we drop the term: − d Q1

ξ

dξ · d Q1ξ||Q∞ξ

d Q1ξ

, which is ‘problematic to

compute’ [2]. This is the standard use of CD.

3

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

The Shared Components Topic Model generative processFor each feature c ∈ {1, . . . , C}:

φc ∼ Dir(β) [draw distribution over nouns]

πc ∼ Beta( γC , 1) [draw probability of feature c]

For each class k ∈ {1, . . . , K}:

bkc ∼ Bernoulli(πc) [draw whether class includescth feature in its PoE]

For each verb m ∈ {1, . . . , M}θm ∼ Dir(α) [draw distribution over classes]

For each example n ∈ {1, . . . , Nm}zmn ∼ Mult(1, θm) [draw class]

xmn ∼ p(· |zmn, bzmn , φ) [draw noun]

Latent Feature Topic Model generative processFor each feature c ∈ {1, . . . , C}:

φc ∼ Dir(β) [draw distribution over nouns]

πc ∼ Beta( γC , 1) [draw probability of feature c]

For each class k ∈ {1, . . . , K}:

bkc ∼ Bernoulli(πc) [draw whether class includes

cth feature in its PoE]

For each verb m ∈ {1, . . . , M}θm ∼ Dir(α) [draw distribution over classes]

For each example n ∈ {1, . . . , Nm}zmn ∼ Mult(1, θm) [draw class]

xmn ∼ p(· |zmn, bzmn , φ) [draw noun]

where

p(x|z, bz, φ) =

QCc=1 φbzc

c,xPVv=1

QCc=1 φbzc

c,v

Algorithm 1 SCTM Training

Initialize parameters: ξc, bkc, zi.

while not converged do{E-step:}for j = 1 to J do

{Draw jth sample {Z, B}(j)}for i = 1 to N do

Sample zi

for k = 1 to K dofor c = 1 to C do

Sample bkc

{M-step:}for c = 1 to C do

for v = 1 to V doSingle gradient step over ξ

ξ(t+1)c = ξ(t)

cv − η ·d Q(φ|φ(t))

dξcv

References[1] Geoffrey Hinton. Products of experts. In International Conference on Artificial Neural Networks (ICANN),

1999.

2

We  follow  Hinton  (2002)  

Page 76: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Parameter  EsTmaTon  

• Goal:  infer  values  for  model  parameters  

• Monte  Carlo  EM  (MCEM)  algorithm,  where  the  M-­‐step  minimizes  a  ContrasTve  Divergence  (CD)  objecTve    

76  

πc   θm=  ϕc

{Canada}  

Page 77: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

•  Experiments:  –  Can  SCTM  combine  a  fixed  number  of  components  (mulTnomials)  into  topics  to  achieve  lower  perplexity?  

– Does  SCTM  achieve  lower  perplexity  than  LDA  with  a  more  compact  model?  

•  Analysis:  – What  are  the  learned  topics  like?  

– What  are  the  learned  components  like?  – What  topic-­‐structure  is  learned?  

77  

Page 78: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

Experimental  Setup:  – Datasets:    

•  1,000  random  arTcles  from  20  Newsgroups  •  1,617  NIPS  abstracts  

– EvaluaTon:    •  les-­‐to-­‐right  average  perplexity  on  held-­‐out  data  

– Models:  •  LDA  trained  with  a  collapsed  Gibbs  sampler  

–  In  LDA,  components  and  topics  are  in  a  one-­‐to-­‐one  relaTonship  (i.e.  a  special  case  of  the  SCTM  where  each  topic  is  comprised  of  only  its  corresponding  component)  

•  SCTM  with  parameter  esTmaTon  as  described  

78  

Page 79: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

•  Experiments:  –  Can  SCTM  combine  a  fixed  number  of  components  (mulTnomials)  into  topics  to  achieve  lower  perplexity?  

– Does  SCTM  achieve  lower  perplexity  than  LDA  with  a  more  compact  model?  

•  Analysis:  – What  are  the  learned  topics  like?  

– What  are  the  learned  components  like?  – What  topic-­‐structure  is  learned?  

79  

Page 80: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

80  

800

1000

1200

1400

1600

1800

0 20 40 60 80 100# of Components

Perp

lexi

ty

●● LDA

20News  

Page 81: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

81  

800

1000

1200

1400

1600

1800

10

100

20

40

6080

0 20 40 60 80 100# of Components

Perp

lexi

ty

●● LDASCTM

20News  

SCTM with # components = # topics

(labels show # topics)

Page 82: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

82  

800

1000

1200

1400

1600

1800

10

20

100

200

20

40 40

80

120

60

160

80

0 20 40 60 80 100# of Components

Perp

lexi

ty

●● LDASCTM

20News  

(labels show # topics)

Page 83: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

83  

800

1000

1200

1400

1600

1800

10

2030

100

200300

20

40

60

120

40

80

120

180

60

160240

80

0 20 40 60 80 100# of Components

Perp

lexi

ty

●● LDASCTM

20News  

(labels show # topics)

Page 84: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

84  

800

1000

1200

1400

1600

1800

10

2030

40

100

200300400

20

40

6080

120160

40

80

120

180240

60

160240320

80

0 20 40 60 80 100# of Components

Perp

lexi

ty

●● LDASCTM

20News  

(labels show # topics)

Page 85: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

85  

800

1000

1200

1400

1600

1800

10

2030

4050

100

200300400500

100

20

40

6080

120160200

40

80

120

180240300

60

160240320400

80

0 20 40 60 80 100# of Components

Perp

lexi

ty

●● LDASCTM

20News  

(labels show # topics)

Page 86: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

86  

300

400

500

600

700

0 20 40 60 80 100# of Components

Perp

lexi

ty

●● LDA

NIPS  

Page 87: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

87  

300

400

500

600

700

10

20304050

100

200300400500

100

20

406080

120160200

40

80

120180240300

60

160240320400

80

0 20 40 60 80 100# of Components

Perp

lexi

ty

●● LDASCTM

NIPS  

(labels show # topics)

Page 88: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

•  Experiments:  –  Can  SCTM  combine  a  fixed  number  of  components  (mulTnomials)  into  topics  to  achieve  lower  perplexity?  

– Does  SCTM  achieve  lower  perplexity  than  LDA  with  a  more  compact  model?  

•  Analysis:  – What  are  the  learned  topics  like?  

– What  are  the  learned  components  like?  – What  topic-­‐structure  is  learned?  

88  

Page 89: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

89  

800

1000

1200

1400

●●

10

100120

140

20

40

6080

0 100 200 300 400 500 600# of Model Parameters (thousands)

Perp

lexi

ty

●● LDA

20News  

Labels for LDA show # topics. Labels for SCTM show # components, # topics

Page 90: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

90  

800

1000

1200

1400

●●

10,2010,30

10,4010,50

100,200

100,300

100,400100,500

20,100

20,40

20,60

20,80

40,120

40,16040,200

40,80

60,120

60,180

60,24060,300

80,160

80,240

80,32080,400

10

100120

140

20

40

6080

0 100 200 300 400 500 600# of Model Parameters (thousands)

Perp

lexi

ty

●● LDASCTM

20News  

Labels for LDA show # topics. Labels for SCTM show # components, # topics

Page 91: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

91  

300

400

500

600

●●

●●

●●

10

100120

140 160180

20

200

40

60

80

0 100 200 300 400# of Model Parameters (thousands)

Perp

lexi

ty

●● LDA

NIPS  

Page 92: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

92  

300

400

500

600

●●

●●

●●

10,20

10,3010,4010,50

100,200 100,300100,400

100,500

20,100

20,40

20,60

20,80

40,120

40,16040,200

40,80

60,120

60,18060,240

60,300

80,16080,240

80,32080,400

10

100120

140 160180

20

200

40

60

80

0 100 200 300 400# of Model Parameters (thousands)

Perp

lexi

ty

●● LDASCTM

NIPS  

Page 93: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

•  Experiments:  –  Can  SCTM  combine  a  fixed  number  of  components  (mulTnomials)  into  topics  to  achieve  lower  perplexity?  

– Does  SCTM  achieve  lower  perplexity  than  LDA  with  a  more  compact  model?  

•  Analysis:  – What  are  the  learned  topics  like?  

– What  are  the  learned  components  like?  – What  topic-­‐structure  is  learned?  

93  

Page 94: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

What  does  SCTM  learn?  

94  

Figure 2: SCTM binary matrix and topics from 3599 training documents of 20NEWS for C = 10, K = 20. Bluesquares are “on” (equal to 1).

x

y

5

10

15

20

2 4 6 8 10

k αk Top words for topic Top words for topic after ablating component c=1← 1 0.306 subject organization israel return define law org organization subject israel law peace define israeli← 2 0.031 encryption chip clipper keys des escrow security law administration president year market money senior← 3 0.025 turkish armenian armenians war turkey turks armenia years food center year air russian war army← 4 0.102 drive card disk scsi hard controller mac drives opinions drive hard power support cost research price← 5 0.071 image jpeg window display code gif color mit pitt file program year center programs image division← 6 0.018 jews israeli jewish arab peace land war arabs← 7 0.074 org money back question years thing things point← 8 0.106 christian bible church question christ christians life← 9 0.011 administration president year market money senior← 10 0.055 health medical center research information april← 11 0.063 gun law state guns control bill rights states← 12 0.160 world organization system israel state usa cwru reply← 13 0.042 space nasa gov launch power wire ground air← 14 0.038 space nasa gov launch power wire ground air← 15 0.079 team game year play games season players hockey← 16 0.158 car lines dod bike good uiuc sun cars← 17 0.136 windows file government key jesus system program← 18 0.122 article writes center page harvard virginia research← 19 0.017 max output access digex int entry col line← 20 0.380 lines people don university posting host nntp time # of Model Parameters (thousands)

Perp

lexi

ty

800

1000

1200

1400

!

!

!

!

!

!!

!

!

!

10

100

11

120140

2021

40

6080

10,2010,30

10,4010,50

100,200

100,300100,400100,500

20,100

20,40

20,60

20,80

40,120

40,16040,200

40,80

60,120

60,180

60,24060,300

80,160

80,24080,320

80,400

0 100 200 300 400 500 600

! LDASCTM

(a)

# of Components

Perp

lexi

ty

800

1000

1200

1400

1600

1800

!

!

!

!!

!

!

!

10

2030

4050

100

200300400500

100

20

40

6080

120

160200

40

80

120

180240300

60

160240320400

80

0 20 40 60 80 100

! LDASCTM

(b)

# of Components

Perp

lexi

ty

300

400

500

600

700!

!

!

!!

!

!

!

10

20304050

100

200300400500

100

20

40

60

80

120160200

40

80

120

180240300

60

160240320400

80

0 20 40 60 80 100

! LDASCTM

(c)

# of Model Parameters (thousands)

Perp

lexi

ty

300

350

400

450

500

550

600

!

!

!

!!

!

!

!

!

!

!

!

!

10

100

11

120 140 160180

20

200

21

40

60

80

10,20

10,3010,4010,50

100,200100,300100,400

100,500

20,100

20,40

20,60

20,80

40,12040,16040,200

40,80

60,120

60,18060,240

60,30080,160

80,24080,320

80,400

0 100 200 300 400

! LDASCTM

(d)

Figure 3: Perplexity results on held-out data for 20NEWS (b) and NIPS (c) showing the results of LDA and the SCTMfor the same number of components and varying K (SCTM). For the same number of components (multinomials), theSCTM achieves lower perplexity by combining them into more topics. Results for 20NEWS (a) and NIPS (d) showingnon-square SCTM achieves lower perplexity than LDA with a more compact model.

over, this suggests that our products (PoEs) provideadditional and complementary expressivity over justmixtures of topics.

Model Compactness Including an additionaltopic in the SCTM only adds C binary parameters,for an extra row in the matrix. Whereas in LDA, anadditional topic requires V (the size of the vocab-ulary) additional parameters to represent the multi-nomial. In both cases, the number of document-specific parameters must increase as well. Figures3a and 3d present held-out perplexity vs. numberof model parameters on 20NEWS and NIPS, exclud-ing the case of square (C = K) binary matrices forthe SCTM. The regions show a confidence inter-val (p = 0.05) around the smoothed fit to the data,

LDA labels show C, and SCTM labels show C, K.The SCTM achieves lower perplexity with fewermodel parameters, even when the increase in non-component parameters is taken into account. We ex-pect that because of its smaller size the SCTM ex-hibits lower sample complexity, allowing for bettergeneralization to unseen data.

5.2 Analysis

Figure 2 gives the binary matrix and topics learnedon a larger section of 20NEWS training documents.These topics evidence that the SCTM is able toachieve a diversity of topics by combining varioussubsets of components, and we expect that the lowperplexity achieved by the SCTM can be attributed

20News  

Page 95: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

What  does  SCTM  learn?  

95  

Figure 2: SCTM binary matrix and topics from 3599 training documents of 20NEWS for C = 10, K = 20. Bluesquares are “on” (equal to 1).

x

y

5

10

15

20

2 4 6 8 10

k αk Top words for topic Top words for topic after ablating component c=1← 1 0.306 subject organization israel return define law org organization subject israel law peace define israeli← 2 0.031 encryption chip clipper keys des escrow security law administration president year market money senior← 3 0.025 turkish armenian armenians war turkey turks armenia years food center year air russian war army← 4 0.102 drive card disk scsi hard controller mac drives opinions drive hard power support cost research price← 5 0.071 image jpeg window display code gif color mit pitt file program year center programs image division← 6 0.018 jews israeli jewish arab peace land war arabs← 7 0.074 org money back question years thing things point← 8 0.106 christian bible church question christ christians life← 9 0.011 administration president year market money senior← 10 0.055 health medical center research information april← 11 0.063 gun law state guns control bill rights states← 12 0.160 world organization system israel state usa cwru reply← 13 0.042 space nasa gov launch power wire ground air← 14 0.038 space nasa gov launch power wire ground air← 15 0.079 team game year play games season players hockey← 16 0.158 car lines dod bike good uiuc sun cars← 17 0.136 windows file government key jesus system program← 18 0.122 article writes center page harvard virginia research← 19 0.017 max output access digex int entry col line← 20 0.380 lines people don university posting host nntp time # of Model Parameters (thousands)

Perp

lexi

ty

800

1000

1200

1400

!

!

!

!

!

!!

!

!

!

10

100

11

120140

2021

40

6080

10,2010,30

10,4010,50

100,200

100,300100,400100,500

20,100

20,40

20,60

20,80

40,120

40,16040,200

40,80

60,120

60,180

60,24060,300

80,160

80,24080,320

80,400

0 100 200 300 400 500 600

! LDASCTM

(a)

# of Components

Perp

lexi

ty

800

1000

1200

1400

1600

1800

!

!

!

!!

!

!

!

10

2030

4050

100

200300400500

100

20

40

6080

120

160200

40

80

120

180240300

60

160240320400

80

0 20 40 60 80 100

! LDASCTM

(b)

# of Components

Perp

lexi

ty

300

400

500

600

700!

!

!

!!

!

!

!

10

20304050

100

200300400500

100

20

40

60

80

120160200

40

80

120

180240300

60

160240320400

80

0 20 40 60 80 100

! LDASCTM

(c)

# of Model Parameters (thousands)

Perp

lexi

ty

300

350

400

450

500

550

600

!

!

!

!!

!

!

!

!

!

!

!

!

10

100

11

120 140 160180

20

200

21

40

60

80

10,20

10,3010,4010,50

100,200100,300100,400

100,500

20,100

20,40

20,60

20,80

40,12040,16040,200

40,80

60,120

60,18060,240

60,30080,160

80,24080,320

80,400

0 100 200 300 400

! LDASCTM

(d)

Figure 3: Perplexity results on held-out data for 20NEWS (b) and NIPS (c) showing the results of LDA and the SCTMfor the same number of components and varying K (SCTM). For the same number of components (multinomials), theSCTM achieves lower perplexity by combining them into more topics. Results for 20NEWS (a) and NIPS (d) showingnon-square SCTM achieves lower perplexity than LDA with a more compact model.

over, this suggests that our products (PoEs) provideadditional and complementary expressivity over justmixtures of topics.

Model Compactness Including an additionaltopic in the SCTM only adds C binary parameters,for an extra row in the matrix. Whereas in LDA, anadditional topic requires V (the size of the vocab-ulary) additional parameters to represent the multi-nomial. In both cases, the number of document-specific parameters must increase as well. Figures3a and 3d present held-out perplexity vs. numberof model parameters on 20NEWS and NIPS, exclud-ing the case of square (C = K) binary matrices forthe SCTM. The regions show a confidence inter-val (p = 0.05) around the smoothed fit to the data,

LDA labels show C, and SCTM labels show C, K.The SCTM achieves lower perplexity with fewermodel parameters, even when the increase in non-component parameters is taken into account. We ex-pect that because of its smaller size the SCTM ex-hibits lower sample complexity, allowing for bettergeneralization to unseen data.

5.2 Analysis

Figure 2 gives the binary matrix and topics learnedon a larger section of 20NEWS training documents.These topics evidence that the SCTM is able toachieve a diversity of topics by combining varioussubsets of components, and we expect that the lowperplexity achieved by the SCTM can be attributed

20News  Figure 2: SCTM binary matrix and topics from 3599 training documents of 20NEWS for C = 10, K = 20. Bluesquares are “on” (equal to 1).

x

y

5

10

15

20

2 4 6 8 10

k αk Top words for topic Top words for topic after ablating component c=1← 1 0.306 subject organization israel return define law org organization subject israel law peace define israeli← 2 0.031 encryption chip clipper keys des escrow security law administration president year market money senior← 3 0.025 turkish armenian armenians war turkey turks armenia years food center year air russian war army← 4 0.102 drive card disk scsi hard controller mac drives opinions drive hard power support cost research price← 5 0.071 image jpeg window display code gif color mit pitt file program year center programs image division← 6 0.018 jews israeli jewish arab peace land war arabs← 7 0.074 org money back question years thing things point← 8 0.106 christian bible church question christ christians life← 9 0.011 administration president year market money senior← 10 0.055 health medical center research information april← 11 0.063 gun law state guns control bill rights states← 12 0.160 world organization system israel state usa cwru reply← 13 0.042 space nasa gov launch power wire ground air← 14 0.038 space nasa gov launch power wire ground air← 15 0.079 team game year play games season players hockey← 16 0.158 car lines dod bike good uiuc sun cars← 17 0.136 windows file government key jesus system program← 18 0.122 article writes center page harvard virginia research← 19 0.017 max output access digex int entry col line← 20 0.380 lines people don university posting host nntp time # of Model Parameters (thousands)

Perp

lexi

ty

800

1000

1200

1400

!

!

!

!

!

!!

!

!

!

10

100

11

120140

2021

40

6080

10,2010,30

10,4010,50

100,200

100,300100,400100,500

20,100

20,40

20,60

20,80

40,120

40,16040,200

40,80

60,120

60,180

60,24060,300

80,160

80,24080,320

80,400

0 100 200 300 400 500 600

! LDASCTM

(a)

# of Components

Perp

lexi

ty

800

1000

1200

1400

1600

1800

!

!

!

!!

!

!

!

10

2030

4050

100

200300400500

100

20

40

6080

120

160200

40

80

120

180240300

60

160240320400

80

0 20 40 60 80 100

! LDASCTM

(b)

# of Components

Perp

lexi

ty

300

400

500

600

700!

!

!

!!

!

!

!

10

20304050

100

200300400500

100

20

40

60

80

120160200

40

80

120

180240300

60

160240320400

80

0 20 40 60 80 100

! LDASCTM

(c)

# of Model Parameters (thousands)

Perp

lexi

ty

300

350

400

450

500

550

600

!

!

!

!!

!

!

!

!

!

!

!

!

10

100

11

120 140 160180

20

200

21

40

60

80

10,20

10,3010,4010,50

100,200100,300100,400

100,500

20,100

20,40

20,60

20,80

40,12040,16040,200

40,80

60,120

60,18060,240

60,30080,160

80,24080,320

80,400

0 100 200 300 400

! LDASCTM

(d)

Figure 3: Perplexity results on held-out data for 20NEWS (b) and NIPS (c) showing the results of LDA and the SCTMfor the same number of components and varying K (SCTM). For the same number of components (multinomials), theSCTM achieves lower perplexity by combining them into more topics. Results for 20NEWS (a) and NIPS (d) showingnon-square SCTM achieves lower perplexity than LDA with a more compact model.

over, this suggests that our products (PoEs) provideadditional and complementary expressivity over justmixtures of topics.

Model Compactness Including an additionaltopic in the SCTM only adds C binary parameters,for an extra row in the matrix. Whereas in LDA, anadditional topic requires V (the size of the vocab-ulary) additional parameters to represent the multi-nomial. In both cases, the number of document-specific parameters must increase as well. Figures3a and 3d present held-out perplexity vs. numberof model parameters on 20NEWS and NIPS, exclud-ing the case of square (C = K) binary matrices forthe SCTM. The regions show a confidence inter-val (p = 0.05) around the smoothed fit to the data,

LDA labels show C, and SCTM labels show C, K.The SCTM achieves lower perplexity with fewermodel parameters, even when the increase in non-component parameters is taken into account. We ex-pect that because of its smaller size the SCTM ex-hibits lower sample complexity, allowing for bettergeneralization to unseen data.

5.2 Analysis

Figure 2 gives the binary matrix and topics learnedon a larger section of 20NEWS training documents.These topics evidence that the SCTM is able toachieve a diversity of topics by combining varioussubsets of components, and we expect that the lowperplexity achieved by the SCTM can be attributed

Page 96: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

What  does  SCTM  learn?  

96  

Figure 2: SCTM binary matrix and topics from 3599 training documents of 20NEWS for C = 10, K = 20. Bluesquares are “on” (equal to 1).

x

y

5

10

15

20

2 4 6 8 10

k αk Top words for topic Top words for topic after ablating component c=1← 1 0.306 subject organization israel return define law org organization subject israel law peace define israeli← 2 0.031 encryption chip clipper keys des escrow security law administration president year market money senior← 3 0.025 turkish armenian armenians war turkey turks armenia years food center year air russian war army← 4 0.102 drive card disk scsi hard controller mac drives opinions drive hard power support cost research price← 5 0.071 image jpeg window display code gif color mit pitt file program year center programs image division← 6 0.018 jews israeli jewish arab peace land war arabs← 7 0.074 org money back question years thing things point← 8 0.106 christian bible church question christ christians life← 9 0.011 administration president year market money senior← 10 0.055 health medical center research information april← 11 0.063 gun law state guns control bill rights states← 12 0.160 world organization system israel state usa cwru reply← 13 0.042 space nasa gov launch power wire ground air← 14 0.038 space nasa gov launch power wire ground air← 15 0.079 team game year play games season players hockey← 16 0.158 car lines dod bike good uiuc sun cars← 17 0.136 windows file government key jesus system program← 18 0.122 article writes center page harvard virginia research← 19 0.017 max output access digex int entry col line← 20 0.380 lines people don university posting host nntp time # of Model Parameters (thousands)

Perp

lexi

ty

800

1000

1200

1400

!

!

!

!

!

!!

!

!

!

10

100

11

120140

2021

40

6080

10,2010,30

10,4010,50

100,200

100,300100,400100,500

20,100

20,40

20,60

20,80

40,120

40,16040,200

40,80

60,120

60,180

60,24060,300

80,160

80,24080,320

80,400

0 100 200 300 400 500 600

! LDASCTM

(a)

# of Components

Perp

lexi

ty

800

1000

1200

1400

1600

1800

!

!

!

!!

!

!

!

10

2030

4050

100

200300400500

100

20

40

6080

120

160200

40

80

120

180240300

60

160240320400

80

0 20 40 60 80 100

! LDASCTM

(b)

# of Components

Perp

lexi

ty

300

400

500

600

700!

!

!

!!

!

!

!

10

20304050

100

200300400500

100

20

40

60

80

120160200

40

80

120

180240300

60

160240320400

80

0 20 40 60 80 100

! LDASCTM

(c)

# of Model Parameters (thousands)

Perp

lexi

ty

300

350

400

450

500

550

600

!

!

!

!!

!

!

!

!

!

!

!

!

10

100

11

120 140 160180

20

200

21

40

60

80

10,20

10,3010,4010,50

100,200100,300100,400

100,500

20,100

20,40

20,60

20,80

40,12040,16040,200

40,80

60,120

60,18060,240

60,30080,160

80,24080,320

80,400

0 100 200 300 400

! LDASCTM

(d)

Figure 3: Perplexity results on held-out data for 20NEWS (b) and NIPS (c) showing the results of LDA and the SCTMfor the same number of components and varying K (SCTM). For the same number of components (multinomials), theSCTM achieves lower perplexity by combining them into more topics. Results for 20NEWS (a) and NIPS (d) showingnon-square SCTM achieves lower perplexity than LDA with a more compact model.

over, this suggests that our products (PoEs) provideadditional and complementary expressivity over justmixtures of topics.

Model Compactness Including an additionaltopic in the SCTM only adds C binary parameters,for an extra row in the matrix. Whereas in LDA, anadditional topic requires V (the size of the vocab-ulary) additional parameters to represent the multi-nomial. In both cases, the number of document-specific parameters must increase as well. Figures3a and 3d present held-out perplexity vs. numberof model parameters on 20NEWS and NIPS, exclud-ing the case of square (C = K) binary matrices forthe SCTM. The regions show a confidence inter-val (p = 0.05) around the smoothed fit to the data,

LDA labels show C, and SCTM labels show C, K.The SCTM achieves lower perplexity with fewermodel parameters, even when the increase in non-component parameters is taken into account. We ex-pect that because of its smaller size the SCTM ex-hibits lower sample complexity, allowing for bettergeneralization to unseen data.

5.2 Analysis

Figure 2 gives the binary matrix and topics learnedon a larger section of 20NEWS training documents.These topics evidence that the SCTM is able toachieve a diversity of topics by combining varioussubsets of components, and we expect that the lowperplexity achieved by the SCTM can be attributed

20News  

Figure 2: SCTM binary matrix and topics from 3599 training documents of 20NEWS for C = 10, K = 20. Bluesquares are “on” (equal to 1).

x

y

5

10

15

20

2 4 6 8 10

k αk Top words for topic Top words for topic after ablating component c=1← 1 0.306 subject organization israel return define law org organization subject israel law peace define israeli← 2 0.031 encryption chip clipper keys des escrow security law administration president year market money senior← 3 0.025 turkish armenian armenians war turkey turks armenia years food center year air russian war army← 4 0.102 drive card disk scsi hard controller mac drives opinions drive hard power support cost research price← 5 0.071 image jpeg window display code gif color mit pitt file program year center programs image division← 6 0.018 jews israeli jewish arab peace land war arabs← 7 0.074 org money back question years thing things point← 8 0.106 christian bible church question christ christians life← 9 0.011 administration president year market money senior← 10 0.055 health medical center research information april← 11 0.063 gun law state guns control bill rights states← 12 0.160 world organization system israel state usa cwru reply← 13 0.042 space nasa gov launch power wire ground air← 14 0.038 space nasa gov launch power wire ground air← 15 0.079 team game year play games season players hockey← 16 0.158 car lines dod bike good uiuc sun cars← 17 0.136 windows file government key jesus system program← 18 0.122 article writes center page harvard virginia research← 19 0.017 max output access digex int entry col line← 20 0.380 lines people don university posting host nntp time # of Model Parameters (thousands)

Perp

lexi

ty

800

1000

1200

1400

!

!

!

!

!

!!

!

!

!

10

100

11

120140

2021

40

6080

10,2010,30

10,4010,50

100,200

100,300100,400100,500

20,100

20,40

20,60

20,80

40,120

40,16040,200

40,80

60,120

60,180

60,24060,300

80,160

80,24080,320

80,400

0 100 200 300 400 500 600

! LDASCTM

(a)

# of Components

Perp

lexi

ty

800

1000

1200

1400

1600

1800

!

!

!

!!

!

!

!

10

2030

4050

100

200300400500

100

20

40

6080

120

160200

40

80

120

180240300

60

160240320400

80

0 20 40 60 80 100

! LDASCTM

(b)

# of Components

Perp

lexi

ty

300

400

500

600

700!

!

!

!!

!

!

!

10

20304050

100

200300400500

100

20

40

60

80

120160200

40

80

120

180240300

60

160240320400

80

0 20 40 60 80 100

! LDASCTM

(c)

# of Model Parameters (thousands)

Perp

lexi

ty300

350

400

450

500

550

600

!

!

!

!!

!

!

!

!

!

!

!

!

10

100

11

120 140 160180

20

200

21

40

60

80

10,20

10,3010,4010,50

100,200100,300100,400

100,500

20,100

20,40

20,60

20,80

40,12040,16040,200

40,80

60,120

60,18060,240

60,30080,160

80,24080,320

80,400

0 100 200 300 400

! LDASCTM

(d)

Figure 3: Perplexity results on held-out data for 20NEWS (b) and NIPS (c) showing the results of LDA and the SCTMfor the same number of components and varying K (SCTM). For the same number of components (multinomials), theSCTM achieves lower perplexity by combining them into more topics. Results for 20NEWS (a) and NIPS (d) showingnon-square SCTM achieves lower perplexity than LDA with a more compact model.

over, this suggests that our products (PoEs) provideadditional and complementary expressivity over justmixtures of topics.

Model Compactness Including an additionaltopic in the SCTM only adds C binary parameters,for an extra row in the matrix. Whereas in LDA, anadditional topic requires V (the size of the vocab-ulary) additional parameters to represent the multi-nomial. In both cases, the number of document-specific parameters must increase as well. Figures3a and 3d present held-out perplexity vs. numberof model parameters on 20NEWS and NIPS, exclud-ing the case of square (C = K) binary matrices forthe SCTM. The regions show a confidence inter-val (p = 0.05) around the smoothed fit to the data,

LDA labels show C, and SCTM labels show C, K.The SCTM achieves lower perplexity with fewermodel parameters, even when the increase in non-component parameters is taken into account. We ex-pect that because of its smaller size the SCTM ex-hibits lower sample complexity, allowing for bettergeneralization to unseen data.

5.2 Analysis

Figure 2 gives the binary matrix and topics learnedon a larger section of 20NEWS training documents.These topics evidence that the SCTM is able toachieve a diversity of topics by combining varioussubsets of components, and we expect that the lowperplexity achieved by the SCTM can be attributed

Page 97: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

•  Experiments:  –  Can  SCTM  combine  a  fixed  number  of  components  (mulTnomials)  into  topics  to  achieve  lower  perplexity?  

– Does  SCTM  achieve  lower  perplexity  than  LDA  with  a  more  compact  model?  

•  Analysis:  – What  are  the  learned  topics  like?  

– What  are  the  learned  components  like?  – What  topic-­‐structure  is  learned?  

97  

Page 98: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

SCTM:  Hasse  Diagram  over  Topics  

98  

k=12 αk=0.13problem state

controlreinforcement

problems modelstime based

decision markovsystems function

k=11 αk=0.08learning

networks systemrecognition time

networkdescribes handcontext viewsclassification

k=14 αk=0.07models imagesimage problem

structureanalysis mixture

clusteringapproach showcomputational

k=13 αk=0.05networks

network learningdistributed

system weightvectors property

binary pointoptimal real

k=16 αk=0.11training unitspaper hidden

number outputproblem rule setorder unit showpresent methodweights task

k=15 αk=0.12cells neuronsvisual cortex

motion responseprocessingspatial cellproperties

patterns spike

k=18 αk=0.07information

analysiscomponent rules

signalindependent

representationsnoise basis

k=17 αk=0.10number

functionsweights function

layergeneralizationerror results

loss linear size

k=20 αk=0.02time network

weightsactivation delaycurrent chaotic

connecteddiscrete

connections

k=19 αk=0.03system networks

set neuronsvisual phase

featureprocessing

features outputassociative

c=1model

informationparameters

kalman robustmatrices

likelihoodexperimentally

c=2network

networks datalearning optimal

linear vectorindependent

binary naturalalgorithms pca

c=4paper units

output layernetworks

patterns unitpattern set rulenetwork rules

weights training

c=9visual imageimages cellscortex scene

support spatialfeature visioncues stimulus

statistics

k=10 αk=0.09neural neurons

analog synapticneuron networks

memory timecapacity model

associativenoise dynamics

k=9 αk=0.02vector featureclassification

support vectorskernel

regressionweight inputs

dimensionality

k=2 αk=0.13network input

information timerecurrent backpropagation

unitsarchitecture

forward layer

k=1 αk=0.11model learning

systeminformationparameters

networks robustkalman rulesestimation

k=4 αk=0.12bayesian

results showestimation

method basedparameterslikelihood

methods models

k=3 αk=0.06object

recognitionsystem objects

informationvisual matchingproblem basedclassification

k=6 αk=0.23neural network

paperrecognition

speech systemsbased resultsperformance

artificial

k=5 αk=0.04object

recognitionsystem objects

informationvisual matchingproblem basedclassification

k=8 αk=0.23algorithm

training errorfunction method

performanceinput

classificationclassifier

k=7 αk=0.08data paper

networks networkoutput feature

featurespatterns set

train introducedunit functions

Figure 4: Hasse diagram on NIPS for C = 10, K = 20 showing the top words for topics and unrepresented com-ponents (in shaded box). Notice that some topics only consist of a single component. The shaded box contains thecomponents that didn’t appear as a topic. For the sake of clarity, we only show arrows for the subsumption rela-tionships between the topics, and we omit the implicit arrows between the components in the shaded box and thetopics.

to the high-level of component re-use across topics.Topics are typically interpreted by looking at the

top-N words, whereas the top-N words of a compo-nent often do not even appear in the topics to whichit contributes. Instead, we find that the componentscontribution to a topic is typically through vetoingwords. For example, the top words of componentc=1, corresponding to the first column of the binarymatrix in figure 2, are [subject organization posting apple mit

screen write window video port], yet only a few of these ap-pear in topics k=1,2,3,4,5, which use it.

On the right of figure 2, we show what the top-ics become when we ablate component c=1 fromthe matrix by setting the column to all zeros. Topick=2 changes from being about information securityto general politics and is identical to k=9. Topic k=3changes from the Turkish-Armenian War to a moregeneral war topic. Topic k=4 changes to a less fo-cused version of itself. In this way, we can gain fur-ther insight into the contribution of this component,and the way in which components tend to increasethe specificity of a topic to which they are added.

The SCTM learns each topic as a soft intersec-tion of its components, as represented by the binarymatrix. We can describe the overlap between topicsbased on the components that they have in common.One topic subsumes another topic when the parentconsists of a subset of the child’s components. Inthis way, the binary matrix defines a Hasse diagram,a directed acyclic graph describing all the subsump-tion relationships between topics. Figure 4 showssuch a Hasse diagram on the NIPS data. Several top-ics consist of only a single component, such as k=12on reinforcement learning and k=8 on optimization.These two topics combine with the component c=1so that their overlap forms the topic k=4 on Bayesianmethods. These subsumption relationships are dif-ferent from and complementary to hLDA (see §4),which models topic co-occurrence, not componentintersection. For example, topic k=10 on connec-tionism and k=2 on neural networks intersect toform k=20 which contains words that would onlyappear in both of its subsuming topics, thereby ex-plicitly modeling topic overlap.

NIPS  

Page 99: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

SCTM:  Hasse  Diagram  over  Topics  

99  

k=12 αk=0.13problem state

controlreinforcement

problems modelstime based

decision markovsystems function

k=11 αk=0.08learning

networks systemrecognition time

networkdescribes handcontext viewsclassification

k=14 αk=0.07models imagesimage problem

structureanalysis mixture

clusteringapproach showcomputational

k=13 αk=0.05networks

network learningdistributed

system weightvectors property

binary pointoptimal real

k=16 αk=0.11training unitspaper hidden

number outputproblem rule setorder unit showpresent methodweights task

k=15 αk=0.12cells neuronsvisual cortex

motion responseprocessingspatial cellproperties

patterns spike

k=18 αk=0.07information

analysiscomponent rules

signalindependent

representationsnoise basis

k=17 αk=0.10number

functionsweights function

layergeneralizationerror results

loss linear size

k=20 αk=0.02time network

weightsactivation delaycurrent chaotic

connecteddiscrete

connections

k=19 αk=0.03system networks

set neuronsvisual phase

featureprocessing

features outputassociative

c=1model

informationparameters

kalman robustmatrices

likelihoodexperimentally

c=2network

networks datalearning optimal

linear vectorindependent

binary naturalalgorithms pca

c=4paper units

output layernetworks

patterns unitpattern set rulenetwork rules

weights training

c=9visual imageimages cellscortex scene

support spatialfeature visioncues stimulus

statistics

k=10 αk=0.09neural neurons

analog synapticneuron networks

memory timecapacity model

associativenoise dynamics

k=9 αk=0.02vector featureclassification

support vectorskernel

regressionweight inputs

dimensionality

k=2 αk=0.13network input

information timerecurrent backpropagation

unitsarchitecture

forward layer

k=1 αk=0.11model learning

systeminformationparameters

networks robustkalman rulesestimation

k=4 αk=0.12bayesian

results showestimation

method basedparameterslikelihood

methods models

k=3 αk=0.06object

recognitionsystem objects

informationvisual matchingproblem basedclassification

k=6 αk=0.23neural network

paperrecognition

speech systemsbased resultsperformance

artificial

k=5 αk=0.04object

recognitionsystem objects

informationvisual matchingproblem basedclassification

k=8 αk=0.23algorithm

training errorfunction method

performanceinput

classificationclassifier

k=7 αk=0.08data paper

networks networkoutput feature

featurespatterns set

train introducedunit functions

Figure 4: Hasse diagram on NIPS for C = 10, K = 20 showing the top words for topics and unrepresented com-ponents (in shaded box). Notice that some topics only consist of a single component. The shaded box contains thecomponents that didn’t appear as a topic. For the sake of clarity, we only show arrows for the subsumption rela-tionships between the topics, and we omit the implicit arrows between the components in the shaded box and thetopics.

to the high-level of component re-use across topics.Topics are typically interpreted by looking at the

top-N words, whereas the top-N words of a compo-nent often do not even appear in the topics to whichit contributes. Instead, we find that the componentscontribution to a topic is typically through vetoingwords. For example, the top words of componentc=1, corresponding to the first column of the binarymatrix in figure 2, are [subject organization posting apple mit

screen write window video port], yet only a few of these ap-pear in topics k=1,2,3,4,5, which use it.

On the right of figure 2, we show what the top-ics become when we ablate component c=1 fromthe matrix by setting the column to all zeros. Topick=2 changes from being about information securityto general politics and is identical to k=9. Topic k=3changes from the Turkish-Armenian War to a moregeneral war topic. Topic k=4 changes to a less fo-cused version of itself. In this way, we can gain fur-ther insight into the contribution of this component,and the way in which components tend to increasethe specificity of a topic to which they are added.

The SCTM learns each topic as a soft intersec-tion of its components, as represented by the binarymatrix. We can describe the overlap between topicsbased on the components that they have in common.One topic subsumes another topic when the parentconsists of a subset of the child’s components. Inthis way, the binary matrix defines a Hasse diagram,a directed acyclic graph describing all the subsump-tion relationships between topics. Figure 4 showssuch a Hasse diagram on the NIPS data. Several top-ics consist of only a single component, such as k=12on reinforcement learning and k=8 on optimization.These two topics combine with the component c=1so that their overlap forms the topic k=4 on Bayesianmethods. These subsumption relationships are dif-ferent from and complementary to hLDA (see §4),which models topic co-occurrence, not componentintersection. For example, topic k=10 on connec-tionism and k=2 on neural networks intersect toform k=20 which contains words that would onlyappear in both of its subsuming topics, thereby ex-plicitly modeling topic overlap.

NIPS  

k=12 αk=0.13problem state

controlreinforcement

problems modelstime based

decision markovsystems function

k=11 αk=0.08learning

networks systemrecognition time

networkdescribes handcontext viewsclassification

k=14 αk=0.07models imagesimage problem

structureanalysis mixture

clusteringapproach showcomputational

k=13 αk=0.05networks

network learningdistributed

system weightvectors property

binary pointoptimal real

k=16 αk=0.11training unitspaper hidden

number outputproblem rule setorder unit showpresent methodweights task

k=15 αk=0.12cells neuronsvisual cortex

motion responseprocessingspatial cellproperties

patterns spike

k=18 αk=0.07information

analysiscomponent rules

signalindependent

representationsnoise basis

k=17 αk=0.10number

functionsweights function

layergeneralizationerror results

loss linear size

k=20 αk=0.02time network

weightsactivation delaycurrent chaotic

connecteddiscrete

connections

k=19 αk=0.03system networks

set neuronsvisual phase

featureprocessing

features outputassociative

c=1model

informationparameters

kalman robustmatrices

likelihoodexperimentally

c=2network

networks datalearning optimal

linear vectorindependent

binary naturalalgorithms pca

c=4paper units

output layernetworks

patterns unitpattern set rulenetwork rules

weights training

c=9visual imageimages cellscortex scene

support spatialfeature visioncues stimulus

statistics

k=10 αk=0.09neural neurons

analog synapticneuron networks

memory timecapacity model

associativenoise dynamics

k=9 αk=0.02vector featureclassification

support vectorskernel

regressionweight inputs

dimensionality

k=2 αk=0.13network input

information timerecurrent backpropagation

unitsarchitecture

forward layer

k=1 αk=0.11model learning

systeminformationparameters

networks robustkalman rulesestimation

k=4 αk=0.12bayesian

results showestimation

method basedparameterslikelihood

methods models

k=3 αk=0.06object

recognitionsystem objects

informationvisual matchingproblem basedclassification

k=6 αk=0.23neural network

paperrecognition

speech systemsbased resultsperformance

artificial

k=5 αk=0.04object

recognitionsystem objects

informationvisual matchingproblem basedclassification

k=8 αk=0.23algorithm

training errorfunction method

performanceinput

classificationclassifier

k=7 αk=0.08data paper

networks networkoutput feature

featurespatterns set

train introducedunit functions

Figure 4: Hasse diagram on NIPS for C = 10, K = 20 showing the top words for topics and unrepresented com-ponents (in shaded box). Notice that some topics only consist of a single component. The shaded box contains thecomponents that didn’t appear as a topic. For the sake of clarity, we only show arrows for the subsumption rela-tionships between the topics, and we omit the implicit arrows between the components in the shaded box and thetopics.

to the high-level of component re-use across topics.Topics are typically interpreted by looking at the

top-N words, whereas the top-N words of a compo-nent often do not even appear in the topics to whichit contributes. Instead, we find that the componentscontribution to a topic is typically through vetoingwords. For example, the top words of componentc=1, corresponding to the first column of the binarymatrix in figure 2, are [subject organization posting apple mit

screen write window video port], yet only a few of these ap-pear in topics k=1,2,3,4,5, which use it.

On the right of figure 2, we show what the top-ics become when we ablate component c=1 fromthe matrix by setting the column to all zeros. Topick=2 changes from being about information securityto general politics and is identical to k=9. Topic k=3changes from the Turkish-Armenian War to a moregeneral war topic. Topic k=4 changes to a less fo-cused version of itself. In this way, we can gain fur-ther insight into the contribution of this component,and the way in which components tend to increasethe specificity of a topic to which they are added.

The SCTM learns each topic as a soft intersec-tion of its components, as represented by the binarymatrix. We can describe the overlap between topicsbased on the components that they have in common.One topic subsumes another topic when the parentconsists of a subset of the child’s components. Inthis way, the binary matrix defines a Hasse diagram,a directed acyclic graph describing all the subsump-tion relationships between topics. Figure 4 showssuch a Hasse diagram on the NIPS data. Several top-ics consist of only a single component, such as k=12on reinforcement learning and k=8 on optimization.These two topics combine with the component c=1so that their overlap forms the topic k=4 on Bayesianmethods. These subsumption relationships are dif-ferent from and complementary to hLDA (see §4),which models topic co-occurrence, not componentintersection. For example, topic k=10 on connec-tionism and k=2 on neural networks intersect toform k=20 which contains words that would onlyappear in both of its subsuming topics, thereby ex-plicitly modeling topic overlap.

Page 100: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

SCTM:  Hasse  Diagram  over  Topics  

100  

k=12 αk=0.13problem state

controlreinforcement

problems modelstime based

decision markovsystems function

k=11 αk=0.08learning

networks systemrecognition time

networkdescribes handcontext viewsclassification

k=14 αk=0.07models imagesimage problem

structureanalysis mixture

clusteringapproach showcomputational

k=13 αk=0.05networks

network learningdistributed

system weightvectors property

binary pointoptimal real

k=16 αk=0.11training unitspaper hidden

number outputproblem rule setorder unit showpresent methodweights task

k=15 αk=0.12cells neuronsvisual cortex

motion responseprocessingspatial cellproperties

patterns spike

k=18 αk=0.07information

analysiscomponent rules

signalindependent

representationsnoise basis

k=17 αk=0.10number

functionsweights function

layergeneralizationerror results

loss linear size

k=20 αk=0.02time network

weightsactivation delaycurrent chaotic

connecteddiscrete

connections

k=19 αk=0.03system networks

set neuronsvisual phase

featureprocessing

features outputassociative

c=1model

informationparameters

kalman robustmatrices

likelihoodexperimentally

c=2network

networks datalearning optimal

linear vectorindependent

binary naturalalgorithms pca

c=4paper units

output layernetworks

patterns unitpattern set rulenetwork rules

weights training

c=9visual imageimages cellscortex scene

support spatialfeature visioncues stimulus

statistics

k=10 αk=0.09neural neurons

analog synapticneuron networks

memory timecapacity model

associativenoise dynamics

k=9 αk=0.02vector featureclassification

support vectorskernel

regressionweight inputs

dimensionality

k=2 αk=0.13network input

information timerecurrent backpropagation

unitsarchitecture

forward layer

k=1 αk=0.11model learning

systeminformationparameters

networks robustkalman rulesestimation

k=4 αk=0.12bayesian

results showestimation

method basedparameterslikelihood

methods models

k=3 αk=0.06object

recognitionsystem objects

informationvisual matchingproblem basedclassification

k=6 αk=0.23neural network

paperrecognition

speech systemsbased resultsperformance

artificial

k=5 αk=0.04object

recognitionsystem objects

informationvisual matchingproblem basedclassification

k=8 αk=0.23algorithm

training errorfunction method

performanceinput

classificationclassifier

k=7 αk=0.08data paper

networks networkoutput feature

featurespatterns set

train introducedunit functions

Figure 4: Hasse diagram on NIPS for C = 10, K = 20 showing the top words for topics and unrepresented com-ponents (in shaded box). Notice that some topics only consist of a single component. The shaded box contains thecomponents that didn’t appear as a topic. For the sake of clarity, we only show arrows for the subsumption rela-tionships between the topics, and we omit the implicit arrows between the components in the shaded box and thetopics.

to the high-level of component re-use across topics.Topics are typically interpreted by looking at the

top-N words, whereas the top-N words of a compo-nent often do not even appear in the topics to whichit contributes. Instead, we find that the componentscontribution to a topic is typically through vetoingwords. For example, the top words of componentc=1, corresponding to the first column of the binarymatrix in figure 2, are [subject organization posting apple mit

screen write window video port], yet only a few of these ap-pear in topics k=1,2,3,4,5, which use it.

On the right of figure 2, we show what the top-ics become when we ablate component c=1 fromthe matrix by setting the column to all zeros. Topick=2 changes from being about information securityto general politics and is identical to k=9. Topic k=3changes from the Turkish-Armenian War to a moregeneral war topic. Topic k=4 changes to a less fo-cused version of itself. In this way, we can gain fur-ther insight into the contribution of this component,and the way in which components tend to increasethe specificity of a topic to which they are added.

The SCTM learns each topic as a soft intersec-tion of its components, as represented by the binarymatrix. We can describe the overlap between topicsbased on the components that they have in common.One topic subsumes another topic when the parentconsists of a subset of the child’s components. Inthis way, the binary matrix defines a Hasse diagram,a directed acyclic graph describing all the subsump-tion relationships between topics. Figure 4 showssuch a Hasse diagram on the NIPS data. Several top-ics consist of only a single component, such as k=12on reinforcement learning and k=8 on optimization.These two topics combine with the component c=1so that their overlap forms the topic k=4 on Bayesianmethods. These subsumption relationships are dif-ferent from and complementary to hLDA (see §4),which models topic co-occurrence, not componentintersection. For example, topic k=10 on connec-tionism and k=2 on neural networks intersect toform k=20 which contains words that would onlyappear in both of its subsuming topics, thereby ex-plicitly modeling topic overlap.

NIPS  

k=12 αk=0.13problem state

controlreinforcement

problems modelstime based

decision markovsystems function

k=11 αk=0.08learning

networks systemrecognition time

networkdescribes handcontext viewsclassification

k=14 αk=0.07models imagesimage problem

structureanalysis mixture

clusteringapproach showcomputational

k=13 αk=0.05networks

network learningdistributed

system weightvectors property

binary pointoptimal real

k=16 αk=0.11training unitspaper hidden

number outputproblem rule setorder unit showpresent methodweights task

k=15 αk=0.12cells neuronsvisual cortex

motion responseprocessingspatial cellproperties

patterns spike

k=18 αk=0.07information

analysiscomponent rules

signalindependent

representationsnoise basis

k=17 αk=0.10number

functionsweights function

layergeneralizationerror results

loss linear size

k=20 αk=0.02time network

weightsactivation delaycurrent chaotic

connecteddiscrete

connections

k=19 αk=0.03system networks

set neuronsvisual phase

featureprocessing

features outputassociative

c=1model

informationparameters

kalman robustmatrices

likelihoodexperimentally

c=2network

networks datalearning optimal

linear vectorindependent

binary naturalalgorithms pca

c=4paper units

output layernetworks

patterns unitpattern set rulenetwork rules

weights training

c=9visual imageimages cellscortex scene

support spatialfeature visioncues stimulus

statistics

k=10 αk=0.09neural neurons

analog synapticneuron networks

memory timecapacity model

associativenoise dynamics

k=9 αk=0.02vector featureclassification

support vectorskernel

regressionweight inputs

dimensionality

k=2 αk=0.13network input

information timerecurrent backpropagation

unitsarchitecture

forward layer

k=1 αk=0.11model learning

systeminformationparameters

networks robustkalman rulesestimation

k=4 αk=0.12bayesian

results showestimation

method basedparameterslikelihood

methods models

k=3 αk=0.06object

recognitionsystem objects

informationvisual matchingproblem basedclassification

k=6 αk=0.23neural network

paperrecognition

speech systemsbased resultsperformance

artificial

k=5 αk=0.04object

recognitionsystem objects

informationvisual matchingproblem basedclassification

k=8 αk=0.23algorithm

training errorfunction method

performanceinput

classificationclassifier

k=7 αk=0.08data paper

networks networkoutput feature

featurespatterns set

train introducedunit functions

Figure 4: Hasse diagram on NIPS for C = 10, K = 20 showing the top words for topics and unrepresented com-ponents (in shaded box). Notice that some topics only consist of a single component. The shaded box contains thecomponents that didn’t appear as a topic. For the sake of clarity, we only show arrows for the subsumption rela-tionships between the topics, and we omit the implicit arrows between the components in the shaded box and thetopics.

to the high-level of component re-use across topics.Topics are typically interpreted by looking at the

top-N words, whereas the top-N words of a compo-nent often do not even appear in the topics to whichit contributes. Instead, we find that the componentscontribution to a topic is typically through vetoingwords. For example, the top words of componentc=1, corresponding to the first column of the binarymatrix in figure 2, are [subject organization posting apple mit

screen write window video port], yet only a few of these ap-pear in topics k=1,2,3,4,5, which use it.

On the right of figure 2, we show what the top-ics become when we ablate component c=1 fromthe matrix by setting the column to all zeros. Topick=2 changes from being about information securityto general politics and is identical to k=9. Topic k=3changes from the Turkish-Armenian War to a moregeneral war topic. Topic k=4 changes to a less fo-cused version of itself. In this way, we can gain fur-ther insight into the contribution of this component,and the way in which components tend to increasethe specificity of a topic to which they are added.

The SCTM learns each topic as a soft intersec-tion of its components, as represented by the binarymatrix. We can describe the overlap between topicsbased on the components that they have in common.One topic subsumes another topic when the parentconsists of a subset of the child’s components. Inthis way, the binary matrix defines a Hasse diagram,a directed acyclic graph describing all the subsump-tion relationships between topics. Figure 4 showssuch a Hasse diagram on the NIPS data. Several top-ics consist of only a single component, such as k=12on reinforcement learning and k=8 on optimization.These two topics combine with the component c=1so that their overlap forms the topic k=4 on Bayesianmethods. These subsumption relationships are dif-ferent from and complementary to hLDA (see §4),which models topic co-occurrence, not componentintersection. For example, topic k=10 on connec-tionism and k=2 on neural networks intersect toform k=20 which contains words that would onlyappear in both of its subsuming topics, thereby ex-plicitly modeling topic overlap.

Page 101: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Experiments:  Topic  Modeling  

•  Experiments:  –  For  the  same  number  of  components  (mulTnomials),  SCTM  achieves  lower  perplexity  than  LDA  

– Non-­‐square  SCTM  achieves  lower  perplexity  than  LDA  with  a  more  compact  model  

•  Analysis:  –  SCTM  learns  diverse  LDA-­‐like  topics  –  Components  are  usually  only  interpretable  when  they  also  appear  as  a  topic  

–  SCTM  learns  an  implicit  Hasse  diagram  defining  subsumpTon  relaTonships  between  topics  

101  

Page 102: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Summary  

Shared  Components  Topic  Model  (SCTM):  1.  Generate  a  pool  of  “components”  (proto-­‐topics)  2.  Assemble  each  topic  from  some  of  the  

components  •  MulTply  and  renormalize  (“product  of  experts”)  

3.  Documents  are  mixtures  of  topics  (just  like  LDA)  

– So  the  wordlists  of  two  topics  are  not  generated  independently!  

– Fewer  parameters  

102  

Page 103: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Future  Work  

•  Improve  inference  for  SCTM  •  Topics  as  products  of  components  in  other  applica5ons  – SelecTonal  preference:  components  could  correspond  to  semanTc  features  that  intersect  to  define  semanTc  classes  

– Vision:  topics  are  classes  of  objects,  the  components  could  be  features  of  those  objects  

103  

Page 104: jason/papers/gormley+al.naacl12.slides.pdfLDA!for!Topic!Modeling! • A!topic!is!visualized!as!its!high!probability!words.!! • A pedigogical!label is!used!to!idenTfy!the!topic.!

Thank  you!  

QuesTons,  comments?  

104