Neural Models for Documents with Metadata · LatentDirichletAllocation Blei,Ng,andJordan. Latent Dirichlet Allocation. JMLR.2003. DavidBlei. Probabilistic topic models. Comm. ACM.2012

Neural Models for Documents withMetadata

Dallas Card, Chenhao Tan, Noah A. Smith

July 18, 2018

Outline

Main points of this talk:

1. Introducing Scholar1: a neural model for documents withmetadata

Background (LDA, SAGE, SLDA, etc.)Model and related workExperiments and Results

2. Power of neural variational inference for interactive modeling

1Sparse Contextual Hidden and Observed Language Autoencoder

1

Latent Dirichlet Allocation

Blei, Ng, and Jordan. Latent Dirichlet Allocation. JMLR. 2003.David Blei. Probabilistic topic models. Comm. ACM. 2012 2

Types of metadata

Date or time

Author(s)

Rating

Sentiment

Ideology

etc.

3

Variations and extensions

Author topic model (Rosen-Zvi et al 2004)

Supervised LDA (SLDA; McAuliffe and Blei, 2008)

Dirichlet multinomial regression (Mimno and McCallum, 2008)

Sparse additive generative models (SAGE; Eisenstein et al,2011)

Structural topic model (Roberts et al, 2014)

...

4

Desired features of model

Fast, scalable inference.

Easy modification by end-users.

Incorporation of metadata:Covariates: features which influences text (as in SAGE).Labels: features to be predicted along with text (as in SLDA).

Possibility of sparse topics.

Incorporate additional prior knowledge.

→ Use variational autoencoder (VAE) style of inference (Kingmaand Welling, 2014)

5








5








5








5








5

Desired outcome

Coherent groupings of words (something like topics), withoffsets for observed metadata

Encoder to map from documents to latent representations

Classifier to predict labels from from latent representation

6

Desired outcome




6

Desired outcome




6

Model

ik

words

generator network: p(w i) = fg( )

7

Model

ik

words

generator network: p(w i) = fg( )p( i w)

8

Model

ik

words

generator network: p(w i) = fg( )p( i w)q( i w)

9

Model

ik

words

generator network: p(w i) = fg( )p( i w)q( i w)

ELBO = Eq[log p(words | θi )]− DKL[q(θi | words)‖p(θi )]

10

Model

ik

words


words

encoder network: q( i w) = fe( )

ELBO = Eq[log p(words | θi )]− DKL[q(θi | words)‖p(θi )]

11

Model

ik

words


words


rik

i = softmax(ri)

ELBO = Eq[log p(words | ri )]− DKL[q(ri | words)‖p(ri )]

12

Model

ik

words


words


rik

i = softmax(ri)

ELBO ≈ 1S

∑Ss=1[log p(words | r

(s)i )]− DKL[q(ri | words)‖p(ri )]

13

Model

ik

words


words


rik

i = softmax(ri)

(0, I)

ELBO ≈ 1S



14

Model

ik

words


words


rik

i = softmax(ri)

(0, I)

= q + (s)q

ELBO ≈ 1S



15

Model

ik

words


words


rik

i = softmax(ri)

(0, I)

= q + (s)q

Srivastava and Sutton, 2017, Miao et al, 2016

16

Model

ik

words


words


rik

i = softmax(ri)

(0, I)

= q + (s)q

yi

17

Model

ik

words


words


rik

i = softmax(ri)

(0, I)

= q + (s)q

yi

ci

18

Model

ik

words


words


rik

i = softmax(ri)

(0, I)

= q + (s)q

yi

ci

, ci, yi

19

Scholar

Generator network:

p(word | θi , ci ) = softmax(d + θTi B(topic) + cTi B(cov))

Optionally include interactions between topics and covariates

p(yi | θi , ci ) = fy (θi , ci )

Encoder:

µi = fµ(words, ci , yi )

log σi = fσ(words, ci , yi )

Optional incorporation of word vectors to embed input

20

Scholar

Generator network:




Encoder:




20

Scholar

Generator network:




Encoder:




20

Scholar

Generator network:




Encoder:




20

Optimization

Stochastic optimization using mini-batches of documents

Tricks from Srivastava and Sutton, 2017:Adam optimizer with high-learning rate to bypass mode collapseBatch-norm layers to avoid divergence

Annealing away from batch-norm output to keep resultsinterpretable

21

Output of Scholar

B(topic),B(cov): Coherent groupings of positive and negativedeviations from background (∼ topics)

fµ, fσ: Encoder network: mapping from words to topics:θi = softmax(fe(words, ci , yi , ε))

fy : Classifier mapping from θi to labels: y = fy (θi , ci )

22

Output of Scholar




22

Output of Scholar




22

Evaluation

1. Performance as a topic model, without metadata (perplexity,coherence)

2. Performance as a classifier, compared to SLDA

3. Exploratory data analysis

23

Quantitative results: basic model

0

1000

2000

Perplexity

0.0

0.1

0.2

Coherence

LDA0.0

0.5Sparsity

IMDB dataset (Maas, 2011)

24


0

1000

2000

Perplexity

0.0

0.1

0.2

Coherence

LDA SAGE0.0

0.5Sparsity


25


0

1000

2000

Perplexity

0.0

0.1

0.2

Coherence

LDA SAGE NVDM0.0

0.5Sparsity


26


0

1000

2000

Perplexity

0.0

0.1

0.2

Coherence

LDA SAGE NVDM Scholar0.0

0.5Sparsity


27


0

1000

2000

Perplexity

0.0

0.1

0.2

Coherence

LDA SAGE NVDM Scholar Scholar+wv

0.0

0.5Sparsity

IMDB dataset (Maas, 2011)28


0

1000

2000

Perplexity

0.0

0.1

0.2

Coherence

LDA SAGE NVDM Scholar Scholar+wv

Scholar+sparsity

0.0

0.5Sparsity

IMDB dataset (Maas, 2011)29

Classification results

LR SLDA Scholar(labels)

Scholar(covariates)

0.5

0.6

0.7

0.8

0.9

1.0

Acc

urac

y


30

Exploratory Data Analysis

Data: Media Frames Corpus (Card et al, 2015)

Collection of thousands of news articles annotated in terms oftone and framing

Relevant metadata: year of publication, newspaper, etc.

31

Tone as a label

0 1

p(pro-immigration | topic)

arrested charged charges agents operationstate gov benefits arizona law bill billsbush border president bill republicanslabor jobs workers percent study wagesasylum judge appeals deportation courtvisas visa applications students citizenshipboat desert died men miles coast haitianenglish language city spanish community

32

Tone as a covariate, with interactions

Base topics Anti-immigration Pro-immigrationice customs agency criminal customs detainees detentionpopulation born percent jobs million illegals english newcomersjudge case court guilty guilty charges man asylum court judgepatrol border miles patrol border died authorities desertlicenses drivers card foreign sept visas green citizenship cardisland story chinese smuggling federal island school ellisguest worker workers bill border house workers tech skilledbenefits bill welfare republican california law welfare students

33

Conclusions

Variational autoencoders (VAEs) provide a powerful frameworkfor latent variable modeling

We use the VAE framework to create a customizable model fordocuments with metadata

We obtain comparable performance with enhanced flexibilityand scalability

Code is available: www.github.com/dallascard/scholar

34

www.github.com/dallascard/scholar