1/27 Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary Memory Networks, Neural Turing Machines, and Question Answering Akram El-Korashy 1 1 Max Planck Institute for Informatics November 30, 2015 Deep Learning Seminar. Papers by Weston et al. (ICLR2015), Graves et al. (2014), and Sukhbaatar et al. (2015)
52
Embed
Memory Networks, Neural Turing Machines, and Question Answering
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Human’s working memory is a capacity for short-term storageof information and its rule-based manipulation. . .
Therefore, an NTM1resembles a working memory system, as itis designed to solve tasks that require the application ofapproximate rules to “rapidly-created variables”.
1Neural Turing Machine. I will use it interchangeably with MemoryNetworks, depending on which paper I am citing.
The memory in these models is the state of the network, whichis latent (i.e., hidden; no exlpicit access) and inherentlyunstable over long timescales. [Sukhbaatar2015]
Unlike a standard network, NTM interacts with a memory matrixusing selective read and write operations that can focus on(almost) a single memory location. [Graves2014]
Why memory networks? How about attention models with RNNencoders/decoders?
The memory model is indeed analogous to the attentionmechanisms introduced for machine translation.
Main differencesIn a memory network model, the query can be made overmultiple sentences, unlike machine translation.The memory model makes several hops on the memorybefore making an output.The network architecture of the memory scoring is asimple linear layer, as opposed to a sophisticated gatedarchitecture in previous work.
Memory as non-compact storageExplicitly update memory slots mi on test time by making use ofa “generalization” component that determines “what” is to bestored from input x , and “where” to store it (choosing amongthe memory slots).
Storing stories for Question AnsweringGiven a story (i.e., a sequence of sentences), training of theoutput component of the memory network can learn scoringfunctions (i.e., similarity) between query sentences and existingmemory slots from previous sentences.
Note that for each question, only some subset of thestatements contain information needed for the answer, andthe others are essentially irrelevant distractors (e.g., thefirst sentence in the first example).
Figure: A given QA task consists of a set of statements, followed by aquestion whose answer is typically a single word. [Sukhbaatar2015]
Two different sentence representations: bag-of-words(BoW), and Position Encoding (PE)
BoW embeds each words, and sums the resulting vectors,e.g., mi =
∑j Axij .
PE encodes the position of the word using a column vectorlj where lkj = (1 − j/J)− (k/d)(1 − 2j/J), where J is thenumber of words in the sentence.
Temporal Encoding: Modify the memory vector with aspecial matrix that encodes temporal information. 2
Now, mi =∑
j Axij + TA(i), where TA(i) is the i th row of aspecial temporal matrix TA.All the T matrices are learned during training. They aresubject to the sharing constraints as between A and C.
2There isn’t enough detail on what constraints this matrix should besubject to, if any.
Embedding Matrices A, B and C, as well as W are jointlylearnt.Loss function is a standard cross entropy between a andthe true label a.Stochastic gradient descent is used with learning rate ofη = 0.01, with annealing.
Parameters and TechniquesRN: Learning time invariance by injecting random noise toregularize TA
LS: Linear start: Remove all softmax except for the answerprediction layer. Apply it back when validation loss stopsdecreasing. (LS learning rate of η = 0.005 instead of 0.01for normal training.)LW: Layer-wise, RNN-like weight tying. Otherwise,adjacent weight tying.BoW or PE: sentence representation.joint: training on all 20 tasks jointly vs independently.
LS: Linear start: Remove all softmax except for the answer prediction layer. Apply it back when validation loss stopsdecreasing. (LS learning rate of η = 0.005 instead of 0.01 for normal training.)
Figure: All variants of the end-to-end trained memory modelcomfortably beat the weakly supervised baseline methods.[Sukhbaatar2015]
The core of inference lies in the O and R modules. The Omodule produces output features by finding k supportingmemories given x .For k = 1, the highest scoring supporting memory isretrieved: o1 = O1(x ,m) = argmax
i=1,...,NsO(x ,mi).
For k = 2, a second supporting memory is additionallycomputed: o2 = O2(x ,m) = argmax
i=1,...,NsO([x ,mo1 ],mi).
In the single-word response setting, where W is the set ofall words in the dict., then r = argmax
Supporting sentences annotations are available as part of the trainingdata. Thus, scoring functions are trained by minimizing a marginranking loss over the model parameters UO and UR using SGD.
Figure: For a given question x with true response r and supportingsentences mO1 , mO2 (i.e., k = 2), this expression is minimized overparameters UO and UR :
where f , f ′ and r are all other choices than the correct labels, and γ is the margin.
Figure: Results on a QA dataset with 14M statements.
Hashing techniques for efficient memory scoringcluster hash: Run K-means to cluster word vectors (UO)i , giving Kbuckets. Hash sentence to all buckets in which its words belong.
Figure: The task is a simple simulation of 4 characters, 3 objects and5 rooms - with characters moving around, picking up and droppingobjects. (Similar to the 10k dataset of MemN2N)
Figure: Sample test set predictions (in red) for the simulation in thesetting of word-based input and where answers are sentences and anLSTM is used as the R component of the MemNN.
Figure: Content-addressing is implemented by learning similaritymeasures, analogous to MemNN. Additionally, the controller offerssimulation of location-based addressing by implementing a rotationalshift of a weighting.
Figure: The networks were trained to copy sequences of eight bitrandom vectors, where the sequence lengths were randomizedbetween 1 and 20. NTM with LSTM controller was used.
Figure: The networks were trained to copy sequences of eight bitrandom vectors, where the sequence lengths were randomizedbetween 1 and 20. NTM with LSTM controller was used.
Intuition of memory networks vs standard neural networkmodels.MemNN is successful through strongly-supervised learningin QA tasksMemN2N is used with more realistic end-to-end training,and is competent enough on the same tasks.NTMs can learn simple memory copy and recall tasks frominput-memory, output-memory training data.
Intuition of memory networks vs standard neural networkmodels.MemNN is successful through strongly-supervised learningin QA tasksMemN2N is used with more realistic end-to-end training,and is competent enough on the same tasks.NTMs can learn simple memory copy and recall tasks frominput-memory, output-memory training data.