-
THE UNIVERSITY OF CHICAGO
BROAD CONTEXT LANGUAGE MODELING WITH READING COMPREHENSION
A DISSERTATION SUBMITTED TO
THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCE DIVISION
IN CANDIDACY FOR THE DEGREE OF
MASTER OF SCIENCE
DEPARTMENT OF COMPUTER SCIENCE
BY
ZEWEI CHU
CHICAGO, ILLINOIS
JUNE, 2017
-
Copyright c© 2017 by Zewei Chu
All Rights Reserved
-
I dedicate my dissertation to my grandparents Chengrong Li and
Liuzhen Xu.
-
Epigraph Text
-
TABLE OF CONTENTS
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . vi
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . vii
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . viii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . ix
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 1
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 32.1 Machine Reading Comprehension . . . . . .
. . . . . . . . . . . . . . . . . . 32.2 Neural Readers . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 The
LAMBADA dataset . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 8
3 METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 103.1 LAMBADA as Reading Comprehension . .
. . . . . . . . . . . . . . . . . . . 103.2 Training Data
Construction . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
4 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 12
5 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 145.1 Manual Analysis . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . 165.2 Discussion . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 18
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 20
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 21
7 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 23
v
-
LIST OF FIGURES
2.1 Attention Sum Reader [7] . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 42.2 Gated-attention Reader [3] . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 42.3 Stanford
Reader [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 5
5.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 16
7.1 instance 1 . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 237.2 instance 2 . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.3
instance 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 247.4 instance 4 . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 247.5 instance 5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 247.6 instance 6 . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 247.7 instance 7 . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
247.8 instance 8 . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 257.9 instance 9 . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.10
instance 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 25
vi
-
LIST OF TABLES
2.1 Example instances from LAMBADA. . . . . . . . . . . . . . .
. . . . . . . . . . 9
5.1 Accuracies on test and control datasets, computed over all
instances (“all”)and separately on those in which the answer is in
the context (“context”). Thefirst section is from [10]. ∗Estimated
from 100 randomly-sampled dev instances.†Estimated from 100
randomly-sampled control instances. . . . . . . . . . . 15
5.2 Labels derived from manual analysis of 100 LAMBADA dev
instances. An in-stance can be tagged with multiple labels, hence
the sum of instances across labelsexceeds 100. . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
vii
-
ACKNOWLEDGMENTS
I want to thank Hai Wang, my thesis advisor Kevin Gimpel and
Prof. David McAllester
for their collaboration on the project of applying reading
comprehension models on the
LAMBADA dataset. Hai Wang contributed his code of various neural
reader models and
GPUs for our experiments. My advisor Kevin Gimpel provided
insightful guidance on the
project and did 100 manual analysis for the LAMBADA dataset.
Prof. David McAllester
provided feedbacks and suggestions for the project. Without
their generous contribution,
this project and the thesis cannot be accomplished.
We thank Denis Paperno for answering our questions about the
LAMBADA dataset and
we thank NVIDIA Corporation for donating GPUs used in this
research.
viii
-
ABSTRACT
Progress in text understanding has been driven by large datasets
that test particular capa-
bilities, like recent datasets for reading comprehension [4]. We
focus here on the LAMBADA
dataset [10], a word prediction task requiring broader context
than the immediate sentence.
We view LAMBADA as a reading comprehension problem and apply
comprehension models
based on neural networks. Though these models are constrained to
choose a word from the
context, they improve the state of the art on LAMBADA from 7.3%
to 49%. We analyze 100
instances, finding that neural network readers perform well in
cases that involve selecting a
name from the context based on dialogue or discourse cues but
struggle when coreference
resolution or external knowledge is needed.
ix
-
CHAPTER 1
INTRODUCTION
The recently created LAMABDA [10] dataset is a challenging word
prediction task for all
state-of-the-art language models. The dataset is created to
encourage the invention of new
language models that capture broader context.
Paperno et al. [10] provide baseline results with popular
language models and neural
network architectures; all achieve zero percent accuracy. The
best accuracy is 7.3% obtained
by randomly choosing a capitalized word from the passage.
Our approach is based on the observation that in 83% of
instances the answer appears
in the context. We exploit this in two ways. First, we
automatically construct a large
training set of 1.8 million instances by simply selecting
passages where the answer occurs in
the context. Second, we treat the problem as a reading
comprehension task similar to the
CNN/Daily Mail datasets introduced by [4], the Children’s Book
Test (CBT) of [5], and the
Who-did-What dataset of [9]. We show that standard models for
reading comprehension,
trained on our automatically generated training set, improve the
state of the art on the
LAMBADA test set from 7.3% to 49.0%. This is in spite of the
fact that these models fail
on the 17% of instances in which the answer is not in the
context.
We also perform a manual analysis of the LAMBADA task, provide
an estimate of human
performance, and categorize the instances in terms of the
phenomena they test. We find that
the comprehension models perform best on instances that require
selecting a name from the
context based on dialogue or discourse cues, but struggle when
required to do coreference
resolution or when external knowledge could help in choosing the
answer.
Our contributions are:
• We review the recent development of reading comprehension
models, and recent created
reading comprehension dataset, and the LAMABDA dataset.
• We created a new training set from the original LAMABDA
dataset.1
-
• We provide stronger baseline results of the LAMABDA
dataset.
• We show that language modeling tasks can be assisted by
reading comprehension models.
• Our manual analysis provides insights on the design of future
models to further improve
the results on LAMBADA.
2
-
CHAPTER 2
BACKGROUND
This chapter briefly reviews machine reading comprehension
tasks, neural network based
reading comprehension models and the LAMBADA dataset.
2.1 Machine Reading Comprehension
Researchers created various reading comprehension datasets to
test machine’s capability of
understanding text. We introduce some good datasets that are
created recently.
MCTest [12]: each instance of MCTest comes with a story, and
four multiple choice
questions regarding the story. The dataset is of high quality,
but the limitation is that it
only provides 640 instances, and this amount is not sufficient
for training a good model for
solving reading comprehension tasks.
DeepMind’s CNN/Daily Mail dataset [4] is built from news of CNN
and Daily Mail.
They created a much bigger dataset, consisting of approximately
187k documents and 1259k
questions. This model has boosted the research of designing
attention based neural reading
comprehension models, as introduced in section 2.2.
Some other high quality reading comprehension dataset such as
SQuAD [11], Wik-
iQA [15], RACE [8], etc. focus on various language aspects. We
do not discuss them in
details for brevity.
2.2 Neural Readers
Various kinds of recurrent neural network models have been
invented to solve cloze style
question answering tasks with state-of-the-art accuracies.
DeepMind [4] provides a large
scale supervised reading comprehension dataset (CNNDM: CNN/Daily
Mail dataset) and
3
-
Figure 2.1: Attention Sum Reader [7]
Figure 2.2: Gated-attention Reader [3]
boosted the research of designing attention based neural reading
comprehension models, such
as attention sum reader [7], the Stanford reader [1], and
gated-attention readers [3].
We refer to these models as “neural readers”. The architectures
of the aforementioned
three neural readers are shown in figure 2.1, figure 2.2 and
figure 2.3.
These neural readers use attention based on the question and
passage to choose an answer
from among the words in the passage. We use d for the context
word sequence, q for the
question (with a blank to be filled), A for the candidate answer
list, and V for the vocabulary.
We describe neural readers in terms of three components:
4
-
Figure 2.3: Stanford Reader [1]
1. Embedding and Encoding: Each word in d and q is mapped into a
v-dimensional
vector via the embedding function e(w) ∈ Rv, for all w ∈ d ∪ q.1
The same embedding
function is used for both d and q. The embeddings are learned
from random initializa-
tion; no pretrained word embeddings are used. The embedded
context is processed by a
bidirectional recurrent neural network (RNN) which computes
hidden vectors hi for each
position i in each sequence:
−→h = fRNN (
−→θd, e(d))
←−h = bRNN (
←−θd, e(d))
h = 〈−→h ,←−h 〉
where θ→d and θ←d are RNN parameters, and each of fRNN and bRNN
return a sequence
of hidden vectors, one for each position in the input e(d). The
question is encoded into a
1. We overload the e function to operate on sequences and denote
the embedding of d and q as matricese(d) and e(q).
5
-
single vector g which is the concatenation of the final vectors
of two RNNs:
−→g = fRNN (−→θq, e(q))
←−g = bRNN (←−θq, e(q))
g = 〈−→g|q|,←−g0〉
The RNNs use either gated recurrent units [2] or long short-term
memory [6].
The gated-attention reader uses different bidirectional RNN
encoder at each layer.
2. Attention: The readers then compute attention weights on
positions of h using g. In
general, we define αi = softmax(att(hi, g)), where i ranges over
positions in h. The att
function is an inner product in the Attention Sum Reader [7] and
a bilinear product in
the Stanford Reader [1].
The computed attentions are then passed through a softmax
function to form a probability
distribution.
The Gated-Attention Reader uses a richer attention architecture
[3]. Each attention layer
has its own question embedding function, and the next layer
document embedding is
computed by the following GA function. The super indices
indicate different layers.
d(k) = GA(h(k),q(k))
The GA function at each layer is defined as following, and the
super script is omitted:
6
-
αi = softmax(e(d)>hi) (2.1)
q̃i = e(q)αi (2.2)
di = hi � q̃i (2.3)
3. Output and Prediction: To output a prediction a∗, the
Stanford Reader [1] computes
the attention-weighted sum of the context vectors and then an
inner product with each
candidate answer:
c =
|d|∑i=1
αihi a∗ = argmax
a∈Ao(a)>c
where o(a) is the “output” embedding function. As the Stanford
Reader was developed
for the anonymized CNN/Daily Mail tasks, only a few entries in
the output embedding
function needed to be well-trained in their experiments.
However, for LAMBADA, the
set of possible answers can range over the entirety of V ,
making the output embedding
function difficult to train. Therefore we also experiment with a
modified version of the
Stanford Reader that uses the same embedding function e for both
input and output
words:
a∗ = argmaxa∈A
e(a)>Wc (2.4)
where W is an additional parameter matrix used to match
dimensions and model any
additional needed transformation.
7
-
For the Attention Sum and Gated-Attention Readers the answer is
computed by:
∀a ∈ A, P (a|d,q) =∑
i∈I(a,d)αi
a∗ = argmaxa∈A
P (a|d,q)
where I(a,d) is the set of positions where a appears in context
d.
2.3 The LAMBADA dataset
The LAMBADA dataset [10] was designed by identifying word
prediction tasks that require
broad context. Each instance is drawn from the BookCorpus [16]
and consists of a passage
of several sentences where the task is to predict the last word
of the last sentence.
The creation procedure of the LAMBADA dataset was designed to
guarentee that two
human subjects can correctly predict the target word with the
context, while at least ten
human subjects were not able to correctly guess the target word
without the context but
only the target sentence. This filtering procedure finds cases
that are guessable by humans
when given the larger context but not when only given the last
sentence.
The expense of this manual filtering has limited the dataset to
only about 10,000 instances
which are viewed as development and test data. The training data
is taken to be books in
the corpus other than those from which the evaluation passages
were extracted.
Table 2.1 are two LAMABADA instances. Each instance consists of
a context document,
a target sentence without the target word, and a target word
that needs to be predicted.
None of several state-of-the-art language models reaches
accuracy above 1% on the LAM-
BADA dataset as reported by Paperno et al.[10]. They thus claim
the LAMBADA dataset
as a challenging test set, meant to encourage models requiring
genuine understanding of
broad context in natural language text.
8
-
Context : “Why?” “I would have thoughtyou’d find him rather
dry,” she said. “Idon’t know about that,” said Gabriel.“He was a
great craftsman,” said Heather.“That he was,” said Flannery.Target
Sentence: “And Polish, to boot,”saidTarget Word : Gabriel
Context : Both its sun-speckled shade andthe cool grass beneath
were a welcomerespite after the stifling kitchen, and I wasglad to
relax against the tree’s rough, brit-tle bark and begin my
breakfast of buttery,toasted bread and fresh fruit. Even the wa-ter
was tasty, it was so clean and cold.Target Sentence: It almost made
up for thelack ofTarget Word : coffee
Table 2.1: Example instances from LAMBADA.
9
-
CHAPTER 3
METHODS
In this chapter we describe how we model the LAMBADA dataset as
a reading comprehension
task, and the procedures we take to create a new training set
from the original LAMBADA
training set.
3.1 LAMBADA as Reading Comprehension
As reported in [10], n-gram and LSTM achieves accuracy of less
than 1% on the LAMBADA
dataset. [10] claims that solving LAMBADA requires broader
context information compared
to what traditional language models can provide. It inspired us
to model LAMABDA as a
reading comprehension task and apply neural reading
comprehension models on them.
For each instance, we feed the context sentences as the context
document of the neural
readers, and we feed the target sentence as the target sentence
of neural readers. Some
neural readers require a candidate target word list to choose
from. The tricky part is what
list of candidate words we feed into the neural readers. In our
experiment, we list all words
in the context as candidate answers, except for punctuation.1.
This choice, though sacrifices
the instances that do not contain the target word in the
context, is a reasonable decision, as
over 80% of LAMABDA development instances have the target word
in the context.
3.2 Training Data Construction
Each LAMBADA instance is divided into a context (4.6 sentences
on average) and a target
sentence, and the last word of the target sentence is the target
word to be predicted. The
1. This list of punctuation symbols is at
https://raw.githubusercontent.com/ZeweiChu/lambada-dataset/master/stopwords/shortlist-stopwords.txt
10
-
LAMBADA dataset consists of development (dev) and test (test)
sets; [10] also provide a
control dataset (control), an unfiltered sample of instances
from the BookCorpus.
We construct a new training dataset from the BookCorpus. We
restrict it to instances
that contain the target word in the context. This decision is
natural given our use of neural
readers that assume the answer is contained in the passage. We
also ensure that the context
has at least 50 words and contains 4 or 5 sentences and we
require the target sentences to
have more than 10 words.
Our new dataset contains 1,827,123 instances in total. We divide
it into two parts, a
training set (train) of 1,618,782 instances and a validation set
(val) of 208,341 instances.
These datasets can be found at the authors’ websites.
11
-
CHAPTER 4
EXPERIMENTS
We use the Stanford Reader [1], our modified Stanford Reader
(Eq. 2.4), the Attention Sum
(AS) Reader [7], and the Gated-Attention (GA) Reader [3]. We
also add the simple features
from [14] to the AS and GA Readers. The features are
concatenated to the word embeddings
in the context. They include: whether the word appears in the
target sentence, the frequency
of the word in the context, the position of the word’s first
occurrence in the context as a
percentage of the context length, and whether the text
surrounding the word matches the
text surrounding the blank in the target sentence. For the last
feature, we only consider
matching the left word since the blank is always the last word
in the target sentence.
All models are trained end to end without any warm start and
without using pretrained
embeddings. We train each reader on train for a max of 10
epochs, stopping when accuracy
on dev decreases two epochs in a row. We take the model from the
epoch with max dev
accuracy and evaluate it on test and control. val is not
used.
We evaluate several other baseline systems inspired by those of
[10], but we focus on
versions that restrict the choice of answers to non-stopwords in
the context.1 We found this
strategy to consistently improve performance even though it
limits the maximum achievable
accuracy.
We consider two n-gram language model baselines. We use the
SRILM toolkit [13] to
estimate a 4-gram model with modified Kneser-Ney smoothing on
the combination of train
and val. One uses a cache size of 100 and the other does not use
a cache. We use each
model to score each non-stopword from the context. We also
evaluate an LSTM language
model. We train it on train, where the loss is cross entropy
summed over all positions in
each instance. The output vocabulary is the vocabulary of train,
approximately 130k word
types. At test time, we again limit the search to non-stopwords
in the context.
1. We use the stopword list from [12].
12
-
We also test simple baselines that choose particular
non-stopwords from the context,
including a random one, the first in the context, the last in
the context, and the most
frequent in the context.
13
-
CHAPTER 5
RESULTS
Table 5.1 shows our results. We report accuracies on the
entirety of test and control
(“all”), as well as separately on the part of test and control
where the target word is in
the context (“context”). The first part of the table shows
results from [10]. We then show
our baselines that choose a word from the context. Choosing the
most frequent yields a
surprisingly high accuracy of 11.7%, which is better than all
results from Paperno et al[10].
Our language models perform comparably, with the n-gram + cache
model doing best.
By forcing language models to select a word from the context,
the accuracy on test is
much higher than the analogous models from Paperno et al.[10],
though accuracy suffers on
control.
We then show results with the neural readers, showing that they
give much higher accura-
cies on test than all other methods. The GA Reader with the
simple additional features [14]
yields the highest accuracy, reaching 49.0%. We also measured
the “top k” accuracy of this
model, where we give the model credit if the correct answer is
among the top k ranked
answers. On test, we reach 65.4% top-2 accuracy and 72.8%
top-3.
Figure 5.1 shows the accuracies of various models on the
entirety of LAMABDA test set.
The AS and GA Readers work much better than the Stanford Reader.
One cause appears
to be that the Stanford Reader learns distinct embeddings for
input and answer words,
as discussed above. Our modified Stanford Reader, which uses
only a single set of word
embeddings, improves by 10.4% absolute. Since the AS and GA
Readers merely score words
in the context, they do not learn separate answer word
embeddings and therefore do not
suffer from this effect.
We suspect the remaining accuracy difference between the
Stanford and the other readers
is due to the difference in the output function. The Stanford
Reader was developed for the
CNN and Daily Mail datasets, in which correct answers are
anonymized entity identifiers
14
-
Methodtest control
all context all context
Baselines [10]
Random in context 1.6 N/A 0 N/ARandom cap. in context 7.3 N/A 0
N/An-gram 0.1 N/A 19.1 N/An-gram + cache 0.1 N/A 19.1 N/ALSTM 0 N/A
21.9 N/AMemory network 0 N/A 8.5 N/A
Our context-restricted non-stopword baselines
Random 5.6 6.7 0.3 2.2First 3.8 4.6 0.1 1.1Last 6.2 7.5 0.9
6.5Most frequent 11.7 14.4 0.4 8.1
Our context-restricted language model baselines
n-gram 10.7 13.1 2.2 15.6n-gram + cache 11.8 14.5 2.2 15.6LSTM
9.2 11.1 2.4 16.9
Our neural reader results
Stanford Reader 21.7 26.2 7.0 49.3Modified Stanford Reader 32.1
38.8 7.4 52.3AS Reader 41.4 50.1 8.5 60.2AS Reader + features 47.4
57.4 8.6 60.6GA Reader 44.5 53.9 8.8 62.5GA Reader + features 49.0
59.4 9.3 65.6
Human 86.0∗ 36.0† - -
Table 5.1: Accuracies on test and control datasets, computed
over all instances (“all”)and separately on those in which the
answer is in the context (“context”). The first sectionis from
[10]. ∗Estimated from 100 randomly-sampled dev instances.
†Estimated from 100randomly-sampled control instances.
which are reused across instances. Since the identifier
embeddings are observed so frequently
in the training data, they are frequently updated. In our
setting, however, answers are
words from a large vocabulary, so many of the word embeddings of
correct answers may be
undertrained. This could potentially be addressed by augmenting
the word embeddings with
identifiers to obtain some of the modeling benefits of
anonymization [14].
All context restricted models yield poor accuracies on the
entirety of control. This is
due to the fact that only 14.1% of control instances have the
target word in the context,
so this sets the upper bound that these models can achieve.
15
-
Figure 5.1: Accuracy
5.1 Manual Analysis
One annotator, a native English speaker, sampled 100 instances
randomly from dev, hid
the final word, and attempted to guess it from the context and
target sentence. The an-
notator was correct in 86 cases. For the subset that contained
the answer in the context,
the annotator was correct in 79 of 87 cases. Even though two
annotators were able to cor-
rectly answer all LAMBADA instances during dataset construction
[10], our results give an
estimate of how often a third would agree. The annotator did the
same on 100 instances
randomly sampled from control, guessing correctly in 36 cases.
These results are reported
in Table 5.1. The annotator was correct on 6 of the 12 control
instances in which the
answer was contained in the context.
We analyzed the 100 LAMBADA dev instances, tagging each with
labels indicating the
minimal kinds of understanding needed to answer it correctly.1
Each instance can have
1. The annotations are available from the authors’ websites.
16
-
label # GA+ humansingle name cue 9 89% 100%simple speaker
tracking 19 84% 100%basic reference 18 56% 72%discourse inference
rule 16 50% 88%semantic trigger 20 40% 80%coreference 21 38%
90%external knowledge 24 21% 88%all 100 55% 86%
Table 5.2: Labels derived from manual analysis of 100 LAMBADA
dev instances. Aninstance can be tagged with multiple labels, hence
the sum of instances across labels exceeds100.
multiple labels. We briefly describe each label below:
• single name cue: the answer is clearly a name according to
contextual cues and only a
single name is mentioned in the context.
• simple speaker tracking: instance can be answered merely by
tracking who is speaking
without understanding what they are saying.
• basic reference: answer is a reference to something mentioned
in the context; simple
understanding/context matching suffices.
• discourse inference rule: answer can be found by applying a
single discourse inference rule,
such as the rule: “X left Y and went in search of Z” → Y 6=
Z.
• semantic trigger: amorphous semantic information is needed to
choose the answer, typ-
ically related to event sequences or dialogue turns, e.g., a
customer says “Where is the
X?” and a supplier responds “We got plenty of X”.
• coreference: instance requires non-trivial coreference
resolution to solve correctly, typically
the resolution of anaphoric pronouns.
• external knowledge: some particular external knowledge is
needed to choose the answer.
17
-
Table 5.2 shows the breakdown of these labels across instances,
as well as the accuracy on
each label of the GA Reader with features.
The GA Reader performs well on instances involving shallower,
more surface-level cues.
In 9 cases, the answer is clearly a name based on contextual
cues in the target sentence and
there is only one name in the context; the reader answers all
but one correctly. When only
simple speaker tracking is needed (19 cases), the reader gets
84% correct.
The hardest instances are those that involve deeper
understanding, like semantic links,
coreference resolution, and external knowledge. While external
knowledge is difficult to
define, we chose this label when we were able to explicitly
write down the knowledge that
one would use when answering the instances, e.g., one instance
requires knowing that “when
something explodes, noise emanates from it”. These instances
make up nearly a quarter of
those we analyzed, making LAMBADA a good task for work in
leveraging external knowledge
for language understanding.
5.2 Discussion
On control, while our readers outperform our other baselines,
they are outperformed by
the language modeling baselines from Paperno et al. This
suggests that though we have
improved the state of the art on LAMBADA by more than 40%
absolute, we have not solved
the general language modeling problem; there is no single model
that performs well on both
test and control. Our 36% estimate of human performance on
control shows the
difficulty of the general problem, and reveals a gap of 14%
between the best language model
and human accuracy.
A natural question to ask is whether applying neural readers is
a good direction for
this task, since they fail on the 17% of instances which do not
have the target word in the
context. Furthermore, this subset of LAMBADA in which the answer
is not in the context
may in fact display the most interesting and challenging
phenomena. Some neural readers,
18
-
like the Stanford Reader, can be easily used to predict target
words that do not appear
in the context, and the other readers can be modified to do so.
Doing this will require a
different selection of training data than that used above.
However, we do wish to note that,
in addition to the relative rarity of these instances in
LAMBADA, we found them to be
challenging for our annotator (who was correct on only 7 of the
13 in this subset).
We note that train has similar characteristics to the part of
control that contains the
answer in the context (the final column of Table 5.1). We find
that the ranking of systems
according to this column is similar to that in the test column.
This suggests that our
simple method of dataset creation could be used to create
additional training or evaluation
sets for challenging language modeling problems like LAMBADA,
perhaps by combining it
with baseline suppression [9]. The experiments of solving the
LAMBADA word prediction
task with neural readers also suggests that language modeling
tasks can be solved by reading
comprehension models.
19
-
CHAPTER 6
CONCLUSION
We constructed a new training set for LAMBADA and used it to
train neural readers to
improve the state of the art from 7.3% to 49%. We also provided
results with several other
strong baselines and included a manual evaluation in an attempt
to better understand the
phenomena tested by the task. Our hope is that other researchers
will seek models and
training regimes that simultaneously perform well on both
LAMBADA and control, with
the goal of solving the general problem of language
modeling.
20
-
REFERENCES
[1] Danqi Chen, Jason Bolton, and Christopher D. Manning. A
thorough examination ofthe CNN/Daily Mail reading comprehension
task. In Proc. of ACL, 2016.
[2] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre,
Dzmitry Bahdanau, FethiBougares, Holger Schwenk, and Yoshua Bengio.
Learning phrase representations usingRNN encoder-decoder for
statistical machine translation. In Proc. of EMNLP, 2014.
[3] Bhuwan Dhingra, Hanxiao Liu, William W. Cohen, and Ruslan
Salakhutdinov. Gated-attention readers for text comprehension.
arXiv preprint, 2016.
[4] Karl Moritz Hermann, Tom Koisk, Edward Grefenstette, Lasse
Espeholt, Will Kay,Mustafa Suleyman, and Phil Blunsom. Teaching
machines to read and comprehend. InProc. of NIPS, 2015.
[5] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston.
The Goldilocks principle:Reading children’s books with explicit
memory representations. In Proc. of ICLR, 2016.
[6] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term
memory. Neural Computa-tion, 9(8), 1997.
[7] Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan
Kleindienst. Text understand-ing with the attention sum reader
network. In Proc. of ACL, 2016.
[8] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard
Hovy. Race: Large-scale reading comprehension dataset from
examinations. arXiv preprint, 2017.
[9] Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and
David McAllester. Whodid What: A large-scale person-centered cloze
dataset. In Proc. of EMNLP, 2016.
[10] Denis Paperno, Germn Kruszewski, Angeliki Lazaridou, Quan
Ngoc Pham, RaffaellaBernardi, Sandro Pezzelle, Marco Baroni, Gemma
Boleda, and Raquel Fernndez. TheLAMBADA dataset: Word prediction
requiring a broad discourse context. In Proc. ofACL, 2016.
[11] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy
Liang. Squad: 100,000+questions for machine comprehension of text.
EMNLP, 2016.
[12] Matthew Richardson, Christopher JC Burges, and Erin
Renshaw. MCTest: A challengedataset for the open-domain machine
comprehension of text. In Proc. of EMNLP, 2013.
[13] Andreas Stolcke. SRILM-an extensible language modeling
toolkit. In Proc. of Inter-speech, 2002.
[14] Hai Wang, Takeshi Onishi, Kevin Gimpel, and David
McAllester. Emergent logicalstructure in vector representations of
neural readers. arXiv preprint, 2016.
21
-
[15] Yi Yang, Scott Wen tau Yih, and Chris Meek. Wikiqa: A
challenge dataset for open-domain question answering. EMNLP,
2015.
[16] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov,
Raquel Urtasun, An-tonio Torralba, and Sanja Fidler. Aligning books
and movies: Towards story-like visualexplanations by watching
movies and reading books. In Proc. of ICCV, 2015.
22
-
CHAPTER 7
APPENDIX
We show heat maps of the LAMBADA instances solved by the
gated-attention sum reader.
Words with darker red color are givin higher probability by the
gated-attention sum reader
model. The first candidate word of each instance is the correct
target word. We can see that
in most cases, the gated-attention sum reader attends to a few
candidate words.
Figure 7.1: instance 1
23
-
Figure 7.2: instance 2
Figure 7.3: instance 3
Figure 7.4: instance 4
Figure 7.5: instance 5
Figure 7.6: instance 6
Figure 7.7: instance 7
24
-
Figure 7.8: instance 8
Figure 7.9: instance 9
Figure 7.10: instance 10
25