Neural Models for Documents with Metadata Dallas Card, Chenhao Tan, Noah A. Smith July 18, 2018
Neural Models for Documents withMetadata
Dallas Card, Chenhao Tan, Noah A. Smith
July 18, 2018
Outline
Main points of this talk:
1. Introducing Scholar1: a neural model for documents withmetadata
Background (LDA, SAGE, SLDA, etc.)Model and related workExperiments and Results
2. Power of neural variational inference for interactive modeling
1Sparse Contextual Hidden and Observed Language Autoencoder
1
Latent Dirichlet Allocation
Blei, Ng, and Jordan. Latent Dirichlet Allocation. JMLR. 2003.David Blei. Probabilistic topic models. Comm. ACM. 2012 2
Types of metadata
Date or time
Author(s)
Rating
Sentiment
Ideology
etc.
3
Variations and extensions
Author topic model (Rosen-Zvi et al 2004)
Supervised LDA (SLDA; McAuliffe and Blei, 2008)
Dirichlet multinomial regression (Mimno and McCallum, 2008)
Sparse additive generative models (SAGE; Eisenstein et al,2011)
Structural topic model (Roberts et al, 2014)
...
4
Desired features of model
Fast, scalable inference.
Easy modification by end-users.
Incorporation of metadata:Covariates: features which influences text (as in SAGE).Labels: features to be predicted along with text (as in SLDA).
Possibility of sparse topics.
Incorporate additional prior knowledge.
→ Use variational autoencoder (VAE) style of inference (Kingmaand Welling, 2014)
5
Desired features of model
Fast, scalable inference.
Easy modification by end-users.
Incorporation of metadata:Covariates: features which influences text (as in SAGE).Labels: features to be predicted along with text (as in SLDA).
Possibility of sparse topics.
Incorporate additional prior knowledge.
→ Use variational autoencoder (VAE) style of inference (Kingmaand Welling, 2014)
5
Desired features of model
Fast, scalable inference.
Easy modification by end-users.
Incorporation of metadata:Covariates: features which influences text (as in SAGE).Labels: features to be predicted along with text (as in SLDA).
Possibility of sparse topics.
Incorporate additional prior knowledge.
→ Use variational autoencoder (VAE) style of inference (Kingmaand Welling, 2014)
5
Desired features of model
Fast, scalable inference.
Easy modification by end-users.
Incorporation of metadata:Covariates: features which influences text (as in SAGE).Labels: features to be predicted along with text (as in SLDA).
Possibility of sparse topics.
Incorporate additional prior knowledge.
→ Use variational autoencoder (VAE) style of inference (Kingmaand Welling, 2014)
5
Desired features of model
Fast, scalable inference.
Easy modification by end-users.
Incorporation of metadata:Covariates: features which influences text (as in SAGE).Labels: features to be predicted along with text (as in SLDA).
Possibility of sparse topics.
Incorporate additional prior knowledge.
→ Use variational autoencoder (VAE) style of inference (Kingmaand Welling, 2014)
5
Desired outcome
Coherent groupings of words (something like topics), withoffsets for observed metadata
Encoder to map from documents to latent representations
Classifier to predict labels from from latent representation
6
Desired outcome
Coherent groupings of words (something like topics), withoffsets for observed metadata
Encoder to map from documents to latent representations
Classifier to predict labels from from latent representation
6
Desired outcome
Coherent groupings of words (something like topics), withoffsets for observed metadata
Encoder to map from documents to latent representations
Classifier to predict labels from from latent representation
6
Model
ik
words
generator network: p(w i) = fg( )
7
Model
ik
words
generator network: p(w i) = fg( )p( i w)
8
Model
ik
words
generator network: p(w i) = fg( )p( i w)q( i w)
9
Model
ik
words
generator network: p(w i) = fg( )p( i w)q( i w)
ELBO = Eq[log p(words | θi )]− DKL[q(θi | words)‖p(θi )]
10
Model
ik
words
generator network: p(w i) = fg( )
words
encoder network: q( i w) = fe( )
ELBO = Eq[log p(words | θi )]− DKL[q(θi | words)‖p(θi )]
11
Model
ik
words
generator network: p(w i) = fg( )
words
encoder network: q( i w) = fe( )
rik
i = softmax(ri)
ELBO = Eq[log p(words | ri )]− DKL[q(ri | words)‖p(ri )]
12
Model
ik
words
generator network: p(w i) = fg( )
words
encoder network: q( i w) = fe( )
rik
i = softmax(ri)
ELBO ≈ 1S
∑Ss=1[log p(words | r
(s)i )]− DKL[q(ri | words)‖p(ri )]
13
Model
ik
words
generator network: p(w i) = fg( )
words
encoder network: q( i w) = fe( )
rik
i = softmax(ri)
(0, I)
ELBO ≈ 1S
∑Ss=1[log p(words | r
(s)i )]− DKL[q(ri | words)‖p(ri )]
14
Model
ik
words
generator network: p(w i) = fg( )
words
encoder network: q( i w) = fe( )
rik
i = softmax(ri)
(0, I)
= q + (s)q
ELBO ≈ 1S
∑Ss=1[log p(words | r
(s)i )]− DKL[q(ri | words)‖p(ri )]
15
Model
ik
words
generator network: p(w i) = fg( )
words
encoder network: q( i w) = fe( )
rik
i = softmax(ri)
(0, I)
= q + (s)q
Srivastava and Sutton, 2017, Miao et al, 2016
16
Model
ik
words
generator network: p(w i) = fg( )
words
encoder network: q( i w) = fe( )
rik
i = softmax(ri)
(0, I)
= q + (s)q
yi
17
Model
ik
words
generator network: p(w i) = fg( )
words
encoder network: q( i w) = fe( )
rik
i = softmax(ri)
(0, I)
= q + (s)q
yi
ci
18
Model
ik
words
generator network: p(w i) = fg( )
words
encoder network: q( i w) = fe( )
rik
i = softmax(ri)
(0, I)
= q + (s)q
yi
ci
, ci, yi
19
Scholar
Generator network:
p(word | θi , ci ) = softmax(d + θTi B(topic) + cTi B(cov))
Optionally include interactions between topics and covariates
p(yi | θi , ci ) = fy (θi , ci )
Encoder:
µi = fµ(words, ci , yi )
log σi = fσ(words, ci , yi )
Optional incorporation of word vectors to embed input
20
Scholar
Generator network:
p(word | θi , ci ) = softmax(d + θTi B(topic) + cTi B(cov))
Optionally include interactions between topics and covariates
p(yi | θi , ci ) = fy (θi , ci )
Encoder:
µi = fµ(words, ci , yi )
log σi = fσ(words, ci , yi )
Optional incorporation of word vectors to embed input
20
Scholar
Generator network:
p(word | θi , ci ) = softmax(d + θTi B(topic) + cTi B(cov))
Optionally include interactions between topics and covariates
p(yi | θi , ci ) = fy (θi , ci )
Encoder:
µi = fµ(words, ci , yi )
log σi = fσ(words, ci , yi )
Optional incorporation of word vectors to embed input
20
Scholar
Generator network:
p(word | θi , ci ) = softmax(d + θTi B(topic) + cTi B(cov))
Optionally include interactions between topics and covariates
p(yi | θi , ci ) = fy (θi , ci )
Encoder:
µi = fµ(words, ci , yi )
log σi = fσ(words, ci , yi )
Optional incorporation of word vectors to embed input
20
Optimization
Stochastic optimization using mini-batches of documents
Tricks from Srivastava and Sutton, 2017:Adam optimizer with high-learning rate to bypass mode collapseBatch-norm layers to avoid divergence
Annealing away from batch-norm output to keep resultsinterpretable
21
Output of Scholar
B(topic),B(cov): Coherent groupings of positive and negativedeviations from background (∼ topics)
fµ, fσ: Encoder network: mapping from words to topics:θi = softmax(fe(words, ci , yi , ε))
fy : Classifier mapping from θi to labels: y = fy (θi , ci )
22
Output of Scholar
B(topic),B(cov): Coherent groupings of positive and negativedeviations from background (∼ topics)
fµ, fσ: Encoder network: mapping from words to topics:θi = softmax(fe(words, ci , yi , ε))
fy : Classifier mapping from θi to labels: y = fy (θi , ci )
22
Output of Scholar
B(topic),B(cov): Coherent groupings of positive and negativedeviations from background (∼ topics)
fµ, fσ: Encoder network: mapping from words to topics:θi = softmax(fe(words, ci , yi , ε))
fy : Classifier mapping from θi to labels: y = fy (θi , ci )
22
Evaluation
1. Performance as a topic model, without metadata (perplexity,coherence)
2. Performance as a classifier, compared to SLDA
3. Exploratory data analysis
23
Quantitative results: basic model
0
1000
2000
Perplexity
0.0
0.1
0.2
Coherence
LDA0.0
0.5Sparsity
IMDB dataset (Maas, 2011)
24
Quantitative results: basic model
0
1000
2000
Perplexity
0.0
0.1
0.2
Coherence
LDA SAGE0.0
0.5Sparsity
IMDB dataset (Maas, 2011)
25
Quantitative results: basic model
0
1000
2000
Perplexity
0.0
0.1
0.2
Coherence
LDA SAGE NVDM0.0
0.5Sparsity
IMDB dataset (Maas, 2011)
26
Quantitative results: basic model
0
1000
2000
Perplexity
0.0
0.1
0.2
Coherence
LDA SAGE NVDM Scholar0.0
0.5Sparsity
IMDB dataset (Maas, 2011)
27
Quantitative results: basic model
0
1000
2000
Perplexity
0.0
0.1
0.2
Coherence
LDA SAGE NVDM Scholar Scholar+wv
0.0
0.5Sparsity
IMDB dataset (Maas, 2011)28
Quantitative results: basic model
0
1000
2000
Perplexity
0.0
0.1
0.2
Coherence
LDA SAGE NVDM Scholar Scholar+wv
Scholar+sparsity
0.0
0.5Sparsity
IMDB dataset (Maas, 2011)29
Classification results
LR SLDA Scholar(labels)
Scholar(covariates)
0.5
0.6
0.7
0.8
0.9
1.0
Acc
urac
y
IMDB dataset (Maas, 2011)
30
Exploratory Data Analysis
Data: Media Frames Corpus (Card et al, 2015)
Collection of thousands of news articles annotated in terms oftone and framing
Relevant metadata: year of publication, newspaper, etc.
31
Tone as a label
0 1
p(pro-immigration | topic)
arrested charged charges agents operationstate gov benefits arizona law bill billsbush border president bill republicanslabor jobs workers percent study wagesasylum judge appeals deportation courtvisas visa applications students citizenshipboat desert died men miles coast haitianenglish language city spanish community
32
Tone as a covariate, with interactions
Base topics Anti-immigration Pro-immigrationice customs agency criminal customs detainees detentionpopulation born percent jobs million illegals english newcomersjudge case court guilty guilty charges man asylum court judgepatrol border miles patrol border died authorities desertlicenses drivers card foreign sept visas green citizenship cardisland story chinese smuggling federal island school ellisguest worker workers bill border house workers tech skilledbenefits bill welfare republican california law welfare students
33
Conclusions
Variational autoencoders (VAEs) provide a powerful frameworkfor latent variable modeling
We use the VAE framework to create a customizable model fordocuments with metadata
We obtain comparable performance with enhanced flexibilityand scalability
Code is available: www.github.com/dallascard/scholar
34