Top Banner
THE UNIVERSITY OF CHICAGO BROAD CONTEXT LANGUAGE MODELING WITH READING COMPREHENSION A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCE DIVISION IN CANDIDACY FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE BY ZEWEI CHU CHICAGO, ILLINOIS JUNE, 2017
34

THE UNIVERSITY OF CHICAGO BROAD CONTEXT ......state-of-the-art language models. The dataset is created to encourage the invention of new language models that capture broader context.

Feb 01, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • THE UNIVERSITY OF CHICAGO

    BROAD CONTEXT LANGUAGE MODELING WITH READING COMPREHENSION

    A DISSERTATION SUBMITTED TO

    THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCE DIVISION

    IN CANDIDACY FOR THE DEGREE OF

    MASTER OF SCIENCE

    DEPARTMENT OF COMPUTER SCIENCE

    BY

    ZEWEI CHU

    CHICAGO, ILLINOIS

    JUNE, 2017

  • Copyright c© 2017 by Zewei Chu

    All Rights Reserved

  • I dedicate my dissertation to my grandparents Chengrong Li and Liuzhen Xu.

  • Epigraph Text

  • TABLE OF CONTENTS

    LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

    LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

    ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

    ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

    1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Machine Reading Comprehension . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Neural Readers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 The LAMBADA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3 METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 LAMBADA as Reading Comprehension . . . . . . . . . . . . . . . . . . . . . 103.2 Training Data Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    4 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    5 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.1 Manual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    7 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    v

  • LIST OF FIGURES

    2.1 Attention Sum Reader [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Gated-attention Reader [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Stanford Reader [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    5.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    7.1 instance 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237.2 instance 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.3 instance 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.4 instance 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.5 instance 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.6 instance 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.7 instance 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.8 instance 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.9 instance 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.10 instance 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    vi

  • LIST OF TABLES

    2.1 Example instances from LAMBADA. . . . . . . . . . . . . . . . . . . . . . . . . 9

    5.1 Accuracies on test and control datasets, computed over all instances (“all”)and separately on those in which the answer is in the context (“context”). Thefirst section is from [10]. ∗Estimated from 100 randomly-sampled dev instances.†Estimated from 100 randomly-sampled control instances. . . . . . . . . . . 15

    5.2 Labels derived from manual analysis of 100 LAMBADA dev instances. An in-stance can be tagged with multiple labels, hence the sum of instances across labelsexceeds 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    vii

  • ACKNOWLEDGMENTS

    I want to thank Hai Wang, my thesis advisor Kevin Gimpel and Prof. David McAllester

    for their collaboration on the project of applying reading comprehension models on the

    LAMBADA dataset. Hai Wang contributed his code of various neural reader models and

    GPUs for our experiments. My advisor Kevin Gimpel provided insightful guidance on the

    project and did 100 manual analysis for the LAMBADA dataset. Prof. David McAllester

    provided feedbacks and suggestions for the project. Without their generous contribution,

    this project and the thesis cannot be accomplished.

    We thank Denis Paperno for answering our questions about the LAMBADA dataset and

    we thank NVIDIA Corporation for donating GPUs used in this research.

    viii

  • ABSTRACT

    Progress in text understanding has been driven by large datasets that test particular capa-

    bilities, like recent datasets for reading comprehension [4]. We focus here on the LAMBADA

    dataset [10], a word prediction task requiring broader context than the immediate sentence.

    We view LAMBADA as a reading comprehension problem and apply comprehension models

    based on neural networks. Though these models are constrained to choose a word from the

    context, they improve the state of the art on LAMBADA from 7.3% to 49%. We analyze 100

    instances, finding that neural network readers perform well in cases that involve selecting a

    name from the context based on dialogue or discourse cues but struggle when coreference

    resolution or external knowledge is needed.

    ix

  • CHAPTER 1

    INTRODUCTION

    The recently created LAMABDA [10] dataset is a challenging word prediction task for all

    state-of-the-art language models. The dataset is created to encourage the invention of new

    language models that capture broader context.

    Paperno et al. [10] provide baseline results with popular language models and neural

    network architectures; all achieve zero percent accuracy. The best accuracy is 7.3% obtained

    by randomly choosing a capitalized word from the passage.

    Our approach is based on the observation that in 83% of instances the answer appears

    in the context. We exploit this in two ways. First, we automatically construct a large

    training set of 1.8 million instances by simply selecting passages where the answer occurs in

    the context. Second, we treat the problem as a reading comprehension task similar to the

    CNN/Daily Mail datasets introduced by [4], the Children’s Book Test (CBT) of [5], and the

    Who-did-What dataset of [9]. We show that standard models for reading comprehension,

    trained on our automatically generated training set, improve the state of the art on the

    LAMBADA test set from 7.3% to 49.0%. This is in spite of the fact that these models fail

    on the 17% of instances in which the answer is not in the context.

    We also perform a manual analysis of the LAMBADA task, provide an estimate of human

    performance, and categorize the instances in terms of the phenomena they test. We find that

    the comprehension models perform best on instances that require selecting a name from the

    context based on dialogue or discourse cues, but struggle when required to do coreference

    resolution or when external knowledge could help in choosing the answer.

    Our contributions are:

    • We review the recent development of reading comprehension models, and recent created

    reading comprehension dataset, and the LAMABDA dataset.

    • We created a new training set from the original LAMABDA dataset.1

  • • We provide stronger baseline results of the LAMABDA dataset.

    • We show that language modeling tasks can be assisted by reading comprehension models.

    • Our manual analysis provides insights on the design of future models to further improve

    the results on LAMBADA.

    2

  • CHAPTER 2

    BACKGROUND

    This chapter briefly reviews machine reading comprehension tasks, neural network based

    reading comprehension models and the LAMBADA dataset.

    2.1 Machine Reading Comprehension

    Researchers created various reading comprehension datasets to test machine’s capability of

    understanding text. We introduce some good datasets that are created recently.

    MCTest [12]: each instance of MCTest comes with a story, and four multiple choice

    questions regarding the story. The dataset is of high quality, but the limitation is that it

    only provides 640 instances, and this amount is not sufficient for training a good model for

    solving reading comprehension tasks.

    DeepMind’s CNN/Daily Mail dataset [4] is built from news of CNN and Daily Mail.

    They created a much bigger dataset, consisting of approximately 187k documents and 1259k

    questions. This model has boosted the research of designing attention based neural reading

    comprehension models, as introduced in section 2.2.

    Some other high quality reading comprehension dataset such as SQuAD [11], Wik-

    iQA [15], RACE [8], etc. focus on various language aspects. We do not discuss them in

    details for brevity.

    2.2 Neural Readers

    Various kinds of recurrent neural network models have been invented to solve cloze style

    question answering tasks with state-of-the-art accuracies. DeepMind [4] provides a large

    scale supervised reading comprehension dataset (CNNDM: CNN/Daily Mail dataset) and

    3

  • Figure 2.1: Attention Sum Reader [7]

    Figure 2.2: Gated-attention Reader [3]

    boosted the research of designing attention based neural reading comprehension models, such

    as attention sum reader [7], the Stanford reader [1], and gated-attention readers [3].

    We refer to these models as “neural readers”. The architectures of the aforementioned

    three neural readers are shown in figure 2.1, figure 2.2 and figure 2.3.

    These neural readers use attention based on the question and passage to choose an answer

    from among the words in the passage. We use d for the context word sequence, q for the

    question (with a blank to be filled), A for the candidate answer list, and V for the vocabulary.

    We describe neural readers in terms of three components:

    4

  • Figure 2.3: Stanford Reader [1]

    1. Embedding and Encoding: Each word in d and q is mapped into a v-dimensional

    vector via the embedding function e(w) ∈ Rv, for all w ∈ d ∪ q.1 The same embedding

    function is used for both d and q. The embeddings are learned from random initializa-

    tion; no pretrained word embeddings are used. The embedded context is processed by a

    bidirectional recurrent neural network (RNN) which computes hidden vectors hi for each

    position i in each sequence:

    −→h = fRNN (

    −→θd, e(d))

    ←−h = bRNN (

    ←−θd, e(d))

    h = 〈−→h ,←−h 〉

    where θ→d and θ←d are RNN parameters, and each of fRNN and bRNN return a sequence

    of hidden vectors, one for each position in the input e(d). The question is encoded into a

    1. We overload the e function to operate on sequences and denote the embedding of d and q as matricese(d) and e(q).

    5

  • single vector g which is the concatenation of the final vectors of two RNNs:

    −→g = fRNN (−→θq, e(q))

    ←−g = bRNN (←−θq, e(q))

    g = 〈−→g|q|,←−g0〉

    The RNNs use either gated recurrent units [2] or long short-term memory [6].

    The gated-attention reader uses different bidirectional RNN encoder at each layer.

    2. Attention: The readers then compute attention weights on positions of h using g. In

    general, we define αi = softmax(att(hi, g)), where i ranges over positions in h. The att

    function is an inner product in the Attention Sum Reader [7] and a bilinear product in

    the Stanford Reader [1].

    The computed attentions are then passed through a softmax function to form a probability

    distribution.

    The Gated-Attention Reader uses a richer attention architecture [3]. Each attention layer

    has its own question embedding function, and the next layer document embedding is

    computed by the following GA function. The super indices indicate different layers.

    d(k) = GA(h(k),q(k))

    The GA function at each layer is defined as following, and the super script is omitted:

    6

  • αi = softmax(e(d)>hi) (2.1)

    q̃i = e(q)αi (2.2)

    di = hi � q̃i (2.3)

    3. Output and Prediction: To output a prediction a∗, the Stanford Reader [1] computes

    the attention-weighted sum of the context vectors and then an inner product with each

    candidate answer:

    c =

    |d|∑i=1

    αihi a∗ = argmax

    a∈Ao(a)>c

    where o(a) is the “output” embedding function. As the Stanford Reader was developed

    for the anonymized CNN/Daily Mail tasks, only a few entries in the output embedding

    function needed to be well-trained in their experiments. However, for LAMBADA, the

    set of possible answers can range over the entirety of V , making the output embedding

    function difficult to train. Therefore we also experiment with a modified version of the

    Stanford Reader that uses the same embedding function e for both input and output

    words:

    a∗ = argmaxa∈A

    e(a)>Wc (2.4)

    where W is an additional parameter matrix used to match dimensions and model any

    additional needed transformation.

    7

  • For the Attention Sum and Gated-Attention Readers the answer is computed by:

    ∀a ∈ A, P (a|d,q) =∑

    i∈I(a,d)αi

    a∗ = argmaxa∈A

    P (a|d,q)

    where I(a,d) is the set of positions where a appears in context d.

    2.3 The LAMBADA dataset

    The LAMBADA dataset [10] was designed by identifying word prediction tasks that require

    broad context. Each instance is drawn from the BookCorpus [16] and consists of a passage

    of several sentences where the task is to predict the last word of the last sentence.

    The creation procedure of the LAMBADA dataset was designed to guarentee that two

    human subjects can correctly predict the target word with the context, while at least ten

    human subjects were not able to correctly guess the target word without the context but

    only the target sentence. This filtering procedure finds cases that are guessable by humans

    when given the larger context but not when only given the last sentence.

    The expense of this manual filtering has limited the dataset to only about 10,000 instances

    which are viewed as development and test data. The training data is taken to be books in

    the corpus other than those from which the evaluation passages were extracted.

    Table 2.1 are two LAMABADA instances. Each instance consists of a context document,

    a target sentence without the target word, and a target word that needs to be predicted.

    None of several state-of-the-art language models reaches accuracy above 1% on the LAM-

    BADA dataset as reported by Paperno et al.[10]. They thus claim the LAMBADA dataset

    as a challenging test set, meant to encourage models requiring genuine understanding of

    broad context in natural language text.

    8

  • Context : “Why?” “I would have thoughtyou’d find him rather dry,” she said. “Idon’t know about that,” said Gabriel.“He was a great craftsman,” said Heather.“That he was,” said Flannery.Target Sentence: “And Polish, to boot,”saidTarget Word : Gabriel

    Context : Both its sun-speckled shade andthe cool grass beneath were a welcomerespite after the stifling kitchen, and I wasglad to relax against the tree’s rough, brit-tle bark and begin my breakfast of buttery,toasted bread and fresh fruit. Even the wa-ter was tasty, it was so clean and cold.Target Sentence: It almost made up for thelack ofTarget Word : coffee

    Table 2.1: Example instances from LAMBADA.

    9

  • CHAPTER 3

    METHODS

    In this chapter we describe how we model the LAMBADA dataset as a reading comprehension

    task, and the procedures we take to create a new training set from the original LAMBADA

    training set.

    3.1 LAMBADA as Reading Comprehension

    As reported in [10], n-gram and LSTM achieves accuracy of less than 1% on the LAMBADA

    dataset. [10] claims that solving LAMBADA requires broader context information compared

    to what traditional language models can provide. It inspired us to model LAMABDA as a

    reading comprehension task and apply neural reading comprehension models on them.

    For each instance, we feed the context sentences as the context document of the neural

    readers, and we feed the target sentence as the target sentence of neural readers. Some

    neural readers require a candidate target word list to choose from. The tricky part is what

    list of candidate words we feed into the neural readers. In our experiment, we list all words

    in the context as candidate answers, except for punctuation.1. This choice, though sacrifices

    the instances that do not contain the target word in the context, is a reasonable decision, as

    over 80% of LAMABDA development instances have the target word in the context.

    3.2 Training Data Construction

    Each LAMBADA instance is divided into a context (4.6 sentences on average) and a target

    sentence, and the last word of the target sentence is the target word to be predicted. The

    1. This list of punctuation symbols is at https://raw.githubusercontent.com/ZeweiChu/lambada-dataset/master/stopwords/shortlist-stopwords.txt

    10

  • LAMBADA dataset consists of development (dev) and test (test) sets; [10] also provide a

    control dataset (control), an unfiltered sample of instances from the BookCorpus.

    We construct a new training dataset from the BookCorpus. We restrict it to instances

    that contain the target word in the context. This decision is natural given our use of neural

    readers that assume the answer is contained in the passage. We also ensure that the context

    has at least 50 words and contains 4 or 5 sentences and we require the target sentences to

    have more than 10 words.

    Our new dataset contains 1,827,123 instances in total. We divide it into two parts, a

    training set (train) of 1,618,782 instances and a validation set (val) of 208,341 instances.

    These datasets can be found at the authors’ websites.

    11

  • CHAPTER 4

    EXPERIMENTS

    We use the Stanford Reader [1], our modified Stanford Reader (Eq. 2.4), the Attention Sum

    (AS) Reader [7], and the Gated-Attention (GA) Reader [3]. We also add the simple features

    from [14] to the AS and GA Readers. The features are concatenated to the word embeddings

    in the context. They include: whether the word appears in the target sentence, the frequency

    of the word in the context, the position of the word’s first occurrence in the context as a

    percentage of the context length, and whether the text surrounding the word matches the

    text surrounding the blank in the target sentence. For the last feature, we only consider

    matching the left word since the blank is always the last word in the target sentence.

    All models are trained end to end without any warm start and without using pretrained

    embeddings. We train each reader on train for a max of 10 epochs, stopping when accuracy

    on dev decreases two epochs in a row. We take the model from the epoch with max dev

    accuracy and evaluate it on test and control. val is not used.

    We evaluate several other baseline systems inspired by those of [10], but we focus on

    versions that restrict the choice of answers to non-stopwords in the context.1 We found this

    strategy to consistently improve performance even though it limits the maximum achievable

    accuracy.

    We consider two n-gram language model baselines. We use the SRILM toolkit [13] to

    estimate a 4-gram model with modified Kneser-Ney smoothing on the combination of train

    and val. One uses a cache size of 100 and the other does not use a cache. We use each

    model to score each non-stopword from the context. We also evaluate an LSTM language

    model. We train it on train, where the loss is cross entropy summed over all positions in

    each instance. The output vocabulary is the vocabulary of train, approximately 130k word

    types. At test time, we again limit the search to non-stopwords in the context.

    1. We use the stopword list from [12].

    12

  • We also test simple baselines that choose particular non-stopwords from the context,

    including a random one, the first in the context, the last in the context, and the most

    frequent in the context.

    13

  • CHAPTER 5

    RESULTS

    Table 5.1 shows our results. We report accuracies on the entirety of test and control

    (“all”), as well as separately on the part of test and control where the target word is in

    the context (“context”). The first part of the table shows results from [10]. We then show

    our baselines that choose a word from the context. Choosing the most frequent yields a

    surprisingly high accuracy of 11.7%, which is better than all results from Paperno et al[10].

    Our language models perform comparably, with the n-gram + cache model doing best.

    By forcing language models to select a word from the context, the accuracy on test is

    much higher than the analogous models from Paperno et al.[10], though accuracy suffers on

    control.

    We then show results with the neural readers, showing that they give much higher accura-

    cies on test than all other methods. The GA Reader with the simple additional features [14]

    yields the highest accuracy, reaching 49.0%. We also measured the “top k” accuracy of this

    model, where we give the model credit if the correct answer is among the top k ranked

    answers. On test, we reach 65.4% top-2 accuracy and 72.8% top-3.

    Figure 5.1 shows the accuracies of various models on the entirety of LAMABDA test set.

    The AS and GA Readers work much better than the Stanford Reader. One cause appears

    to be that the Stanford Reader learns distinct embeddings for input and answer words,

    as discussed above. Our modified Stanford Reader, which uses only a single set of word

    embeddings, improves by 10.4% absolute. Since the AS and GA Readers merely score words

    in the context, they do not learn separate answer word embeddings and therefore do not

    suffer from this effect.

    We suspect the remaining accuracy difference between the Stanford and the other readers

    is due to the difference in the output function. The Stanford Reader was developed for the

    CNN and Daily Mail datasets, in which correct answers are anonymized entity identifiers

    14

  • Methodtest control

    all context all context

    Baselines [10]

    Random in context 1.6 N/A 0 N/ARandom cap. in context 7.3 N/A 0 N/An-gram 0.1 N/A 19.1 N/An-gram + cache 0.1 N/A 19.1 N/ALSTM 0 N/A 21.9 N/AMemory network 0 N/A 8.5 N/A

    Our context-restricted non-stopword baselines

    Random 5.6 6.7 0.3 2.2First 3.8 4.6 0.1 1.1Last 6.2 7.5 0.9 6.5Most frequent 11.7 14.4 0.4 8.1

    Our context-restricted language model baselines

    n-gram 10.7 13.1 2.2 15.6n-gram + cache 11.8 14.5 2.2 15.6LSTM 9.2 11.1 2.4 16.9

    Our neural reader results

    Stanford Reader 21.7 26.2 7.0 49.3Modified Stanford Reader 32.1 38.8 7.4 52.3AS Reader 41.4 50.1 8.5 60.2AS Reader + features 47.4 57.4 8.6 60.6GA Reader 44.5 53.9 8.8 62.5GA Reader + features 49.0 59.4 9.3 65.6

    Human 86.0∗ 36.0† - -

    Table 5.1: Accuracies on test and control datasets, computed over all instances (“all”)and separately on those in which the answer is in the context (“context”). The first sectionis from [10]. ∗Estimated from 100 randomly-sampled dev instances. †Estimated from 100randomly-sampled control instances.

    which are reused across instances. Since the identifier embeddings are observed so frequently

    in the training data, they are frequently updated. In our setting, however, answers are

    words from a large vocabulary, so many of the word embeddings of correct answers may be

    undertrained. This could potentially be addressed by augmenting the word embeddings with

    identifiers to obtain some of the modeling benefits of anonymization [14].

    All context restricted models yield poor accuracies on the entirety of control. This is

    due to the fact that only 14.1% of control instances have the target word in the context,

    so this sets the upper bound that these models can achieve.

    15

  • Figure 5.1: Accuracy

    5.1 Manual Analysis

    One annotator, a native English speaker, sampled 100 instances randomly from dev, hid

    the final word, and attempted to guess it from the context and target sentence. The an-

    notator was correct in 86 cases. For the subset that contained the answer in the context,

    the annotator was correct in 79 of 87 cases. Even though two annotators were able to cor-

    rectly answer all LAMBADA instances during dataset construction [10], our results give an

    estimate of how often a third would agree. The annotator did the same on 100 instances

    randomly sampled from control, guessing correctly in 36 cases. These results are reported

    in Table 5.1. The annotator was correct on 6 of the 12 control instances in which the

    answer was contained in the context.

    We analyzed the 100 LAMBADA dev instances, tagging each with labels indicating the

    minimal kinds of understanding needed to answer it correctly.1 Each instance can have

    1. The annotations are available from the authors’ websites.

    16

  • label # GA+ humansingle name cue 9 89% 100%simple speaker tracking 19 84% 100%basic reference 18 56% 72%discourse inference rule 16 50% 88%semantic trigger 20 40% 80%coreference 21 38% 90%external knowledge 24 21% 88%all 100 55% 86%

    Table 5.2: Labels derived from manual analysis of 100 LAMBADA dev instances. Aninstance can be tagged with multiple labels, hence the sum of instances across labels exceeds100.

    multiple labels. We briefly describe each label below:

    • single name cue: the answer is clearly a name according to contextual cues and only a

    single name is mentioned in the context.

    • simple speaker tracking: instance can be answered merely by tracking who is speaking

    without understanding what they are saying.

    • basic reference: answer is a reference to something mentioned in the context; simple

    understanding/context matching suffices.

    • discourse inference rule: answer can be found by applying a single discourse inference rule,

    such as the rule: “X left Y and went in search of Z” → Y 6= Z.

    • semantic trigger: amorphous semantic information is needed to choose the answer, typ-

    ically related to event sequences or dialogue turns, e.g., a customer says “Where is the

    X?” and a supplier responds “We got plenty of X”.

    • coreference: instance requires non-trivial coreference resolution to solve correctly, typically

    the resolution of anaphoric pronouns.

    • external knowledge: some particular external knowledge is needed to choose the answer.

    17

  • Table 5.2 shows the breakdown of these labels across instances, as well as the accuracy on

    each label of the GA Reader with features.

    The GA Reader performs well on instances involving shallower, more surface-level cues.

    In 9 cases, the answer is clearly a name based on contextual cues in the target sentence and

    there is only one name in the context; the reader answers all but one correctly. When only

    simple speaker tracking is needed (19 cases), the reader gets 84% correct.

    The hardest instances are those that involve deeper understanding, like semantic links,

    coreference resolution, and external knowledge. While external knowledge is difficult to

    define, we chose this label when we were able to explicitly write down the knowledge that

    one would use when answering the instances, e.g., one instance requires knowing that “when

    something explodes, noise emanates from it”. These instances make up nearly a quarter of

    those we analyzed, making LAMBADA a good task for work in leveraging external knowledge

    for language understanding.

    5.2 Discussion

    On control, while our readers outperform our other baselines, they are outperformed by

    the language modeling baselines from Paperno et al. This suggests that though we have

    improved the state of the art on LAMBADA by more than 40% absolute, we have not solved

    the general language modeling problem; there is no single model that performs well on both

    test and control. Our 36% estimate of human performance on control shows the

    difficulty of the general problem, and reveals a gap of 14% between the best language model

    and human accuracy.

    A natural question to ask is whether applying neural readers is a good direction for

    this task, since they fail on the 17% of instances which do not have the target word in the

    context. Furthermore, this subset of LAMBADA in which the answer is not in the context

    may in fact display the most interesting and challenging phenomena. Some neural readers,

    18

  • like the Stanford Reader, can be easily used to predict target words that do not appear

    in the context, and the other readers can be modified to do so. Doing this will require a

    different selection of training data than that used above. However, we do wish to note that,

    in addition to the relative rarity of these instances in LAMBADA, we found them to be

    challenging for our annotator (who was correct on only 7 of the 13 in this subset).

    We note that train has similar characteristics to the part of control that contains the

    answer in the context (the final column of Table 5.1). We find that the ranking of systems

    according to this column is similar to that in the test column. This suggests that our

    simple method of dataset creation could be used to create additional training or evaluation

    sets for challenging language modeling problems like LAMBADA, perhaps by combining it

    with baseline suppression [9]. The experiments of solving the LAMBADA word prediction

    task with neural readers also suggests that language modeling tasks can be solved by reading

    comprehension models.

    19

  • CHAPTER 6

    CONCLUSION

    We constructed a new training set for LAMBADA and used it to train neural readers to

    improve the state of the art from 7.3% to 49%. We also provided results with several other

    strong baselines and included a manual evaluation in an attempt to better understand the

    phenomena tested by the task. Our hope is that other researchers will seek models and

    training regimes that simultaneously perform well on both LAMBADA and control, with

    the goal of solving the general problem of language modeling.

    20

  • REFERENCES

    [1] Danqi Chen, Jason Bolton, and Christopher D. Manning. A thorough examination ofthe CNN/Daily Mail reading comprehension task. In Proc. of ACL, 2016.

    [2] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, FethiBougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations usingRNN encoder-decoder for statistical machine translation. In Proc. of EMNLP, 2014.

    [3] Bhuwan Dhingra, Hanxiao Liu, William W. Cohen, and Ruslan Salakhutdinov. Gated-attention readers for text comprehension. arXiv preprint, 2016.

    [4] Karl Moritz Hermann, Tom Koisk, Edward Grefenstette, Lasse Espeholt, Will Kay,Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. InProc. of NIPS, 2015.

    [5] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The Goldilocks principle:Reading children’s books with explicit memory representations. In Proc. of ICLR, 2016.

    [6] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computa-tion, 9(8), 1997.

    [7] Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. Text understand-ing with the attention sum reader network. In Proc. of ACL, 2016.

    [8] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint, 2017.

    [9] Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. Whodid What: A large-scale person-centered cloze dataset. In Proc. of EMNLP, 2016.

    [10] Denis Paperno, Germn Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, RaffaellaBernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernndez. TheLAMBADA dataset: Word prediction requiring a broad discourse context. In Proc. ofACL, 2016.

    [11] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+questions for machine comprehension of text. EMNLP, 2016.

    [12] Matthew Richardson, Christopher JC Burges, and Erin Renshaw. MCTest: A challengedataset for the open-domain machine comprehension of text. In Proc. of EMNLP, 2013.

    [13] Andreas Stolcke. SRILM-an extensible language modeling toolkit. In Proc. of Inter-speech, 2002.

    [14] Hai Wang, Takeshi Onishi, Kevin Gimpel, and David McAllester. Emergent logicalstructure in vector representations of neural readers. arXiv preprint, 2016.

    21

  • [15] Yi Yang, Scott Wen tau Yih, and Chris Meek. Wikiqa: A challenge dataset for open-domain question answering. EMNLP, 2015.

    [16] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, An-tonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visualexplanations by watching movies and reading books. In Proc. of ICCV, 2015.

    22

  • CHAPTER 7

    APPENDIX

    We show heat maps of the LAMBADA instances solved by the gated-attention sum reader.

    Words with darker red color are givin higher probability by the gated-attention sum reader

    model. The first candidate word of each instance is the correct target word. We can see that

    in most cases, the gated-attention sum reader attends to a few candidate words.

    Figure 7.1: instance 1

    23

  • Figure 7.2: instance 2

    Figure 7.3: instance 3

    Figure 7.4: instance 4

    Figure 7.5: instance 5

    Figure 7.6: instance 6

    Figure 7.7: instance 7

    24

  • Figure 7.8: instance 8

    Figure 7.9: instance 9

    Figure 7.10: instance 10

    25