A Survey of Current Neural Network Architectures for NLP Márton Miháltz Meltwater Group Hungarian NLP Meetup
Jan 14, 2017
A Survey of Current Neural Network Architectures for NLPMárton Miháltz
Meltwater Group
Hungarian NLP Meetup
2
• Introduction• Short intro to NN concepts
• Recurrent neural networks• Long Short-Term Memory, Gated Recurrent Unit
• Recursive neural networks• Applications to sentiment analysis: Socher et al. 2013; Tai et al. 2015
• Convolutional neural networks• Applications to text classification: Kim 2014
• Some more recent architectures• Memory networks, attention models, hybrid architectures
• Tools• Theano, Torch, Tensor Flow, Caffe, Keras
Outline
3
• Feed-forward neural network• Activation fn: tanh, ReLU,
Leaky/Parametric ReLU, SoftPlus, …• Logistic regression or softmax
function for classification layer• Loss functions (objectives):
categorical cross-entropy, neg. log likelihood, …
• Training (optimizers): Gradient Descent, SGD, Mini-batch GD, RMSprop, Ada, Adagrad, Adam, Adamax, Nesterov Momentum, L-BFGS, …
Very Short Intro to Modern Neural Networks
• Input embeddings• 1-hot encoding• Random vectors• Pre-trained vectors, eg. distributional similarity
4
● Tutorials, Blogs○ Denny Britz’s blog (RNNs, CNNs for NLP, code etc.) -- code in Theano, Tensor Flow○ Cristopher Olah’s blog (architectures, DL for NLP etc.)
○ Andrej Karpathy’s fun blogpost about RNNs: generate Shakespeare, Paul Graham text, LaTex source, C code etc. + nice LSTM activity visualizations
○ Deeplearning.net Tutorial -- code in Theano (python)
● Courses○ Richard Socher’s course Deep Learning for Natural Language Processing at Stanford --
code in Tensor Flow○ Stanford Unsupervised Feature Learning and Deep Learning Tutorial -- code in Matlab○ Stanford course Convolutional Neural Networks for Image Recognition (Andrej Karpathy)
● Other sources○ Bengio’s Deep Learning book
Further Reading (DL for NLP)
5
• Powerful apparatus for learning complex functions for ML• Better at certain NLP tasks than previous methods• Pre-trained distributed representation vectors
• Word2vec, GloVe, GenSim, doc2vec, skip-thought vectors etc.• Vector space properties: similarity, analogies, compositionality etc.
• Less feature engineering needed• Network learns abstract representations
• Transfer learning / domain adaptation• Joint learning/execution of NLP steps possible• Easy to go multimodal
Why Deep Learning for NLP?
6
● About RNNs○ Internal state depends on state of last step○ Good for sequential input○ Backprop. Through Time (BPTT) training
● Applications○ Language modeling (eg. in machine translation)○ Sequential labeling○ Text generation (eg. image description generation, together w/ CNN)
● Problems with RNNs○ Long sentences, long-term dependencies○ Exponentially shrinking gradients (“vanishing gradients”)○ Solutions:
■ Initialization of weights; regularization; using ReLU activ. fn.■ RNN variations: bidirectional RNN, deep RNN etc.■ gated RNNs: LSTM, GRU
Recurrent Neural Networks
7
• Long Short Term Memory Networks• A special recurrent network• Has a memory cell (internal memory) (c)• 3 gates: input, forget, output
sigmoid layers with pointwise multiplication operation (vector of values in [0, 1])
• LSTM is able to remove or add information to the cell state, regulated by gates, which optionally let information through
• Gated Recurrent Units• Another RNN variant• No internal memory different from internal state• 2 gates: reset, update (z)
• Reset gate: how to combine new input with previous
state, update gate: how much of the previous state to keep
LSTMs and GRUs
t-1 t-1
t-1 t-1
[Chung et al. 2014+ red labels by me]
8
• Overcome RNNs’ long dependency limitations& vanishing gradients problem
• Very hip in current NLP applications, eg. SOTA in MT• More complex architectures:
• Bi-directional LSTM• Stacked (deep) (B-)LSTM/GRU layers• Another extension, Grid-LSTM (Kalchbrenner et al. 2015)• Still evolving!
• LSTM vs. GRU better: still in the jury• GRU has fewer parameters, may be faster to train
• LSTM may be better with more data
LSTMs and GRUs
9
• About RNNs• Hierarchical architecture• Shared weights• Plausible approach for modeling linguistics structures
• Sentiment Analysis with Recursive Networks (Socher et al. 2013)• Compositional processing of parsed input (Eg. able to handle negations)
• Performs sentence-level sentiment classification:
Rotten Tomatoes dataset (Pang & Lee 2005): 11K movie review sentences pos or neg85.5% Accuracy on binary class subset, 45.7% on 5-class
• Not SOTA score any more, but was first to go over 80% after 7 years
• Sentiment Treebank for training
Recursive Networks
10
• Sentence words: embedding layer w/ random initial vectors (d=25..35)• Parse nodes: compositionality function computes representation, recursive• Softmax classifier: pos-neg (or 5-class) label for each word & each parse node
Recursive Neural Tensor Network
● Weight tensor V:
● Intuition:each slice of the tensor captures a specific type of composition
Sentiment Analysis with RNTN
12
• Tree-LSTM• Using constituency parsing• Using GloVe word vectors, updated during training
• Idea: sum hidden states of child vectorsof tree nodes
• Each child has its own forget gate• Polarity softmax classifiers on tree nodes
• Improves Socher et al 2013• Fine-grained sentence sentiment: 51.0% vs. 45.7%• Binary sentence sentiment: 88.0% vs. 85.4%
Tree-LSTMs for Sentiment Analysis (Tai et al 2015)
13
Convolutional Neural Networks• CNNs (ConvNets) widely used in
image processing• Location invariety• Compositionality• Fast
• Convolution layers• “sliding window” over input representation:
filter/kernel/feature generator• Local connectivity• Sharing weights
• Hyperparameters• Wide vs. narrow convolution (padding)• Filter size (width, height, depth)• Number of filters/layer• Stride size• Channels (R, G, B)
14
CNNs for Text Classification
● Intuition: filter windows over sentence words <-> n-grams
● Advantage over Recursive NN/Tree-LSTM: does not require parsing
● Becoming a standard baseline for new text classification architectures
● Easy to parallelize on GPUs
15
CNN for Sentiment Analysis (Kim 2014)• Sentence polarity classification (RT dataset/Sentiment Treebank)
• 88.1% on binary sentiment classification
• Use word2vec vectors• sentences: concatenated word vectors
• 2 channels: • Static word2vec vectors & tuned via backprop
• Multiple window sizes (h=3,4,5) and multiple filters (eg. 100)• Apply max-pooling on feature map
• Selects most important feature from feature map
• Penultimate layer: final feature vector• Concatenate all pooled features
• Final layer: softmax classifier (pos/neg sentiment)• Regularization: dropout on penultimate layer
• Randomly set to 0 some of the feature weights• Prevents co-adaptation of hidden units during forward propagation (overfitting)
16
Adaptation of Word Vectors
17
• Recursive NNs• Linguistically plausible, applicable to grammatical structures,
needs parsing• Recurrent NNs
• Engineered for sequential input, current improvements with gated RNNs (LSTM, GRU etc.)
• Convolutional NNs• Exceptionally good for classification; unclear how to incorporate
phrase-level structures, hard to interpret, needs zero padding, good for GPUs
Summary
18
• Memory Networks• MemN2N (Sukhbaatar et al 2015)
Facebook’s bAbI Question Answering tasks 90-90%
• Dynamic Memory Networks (Kumar, Irsoy et al 2015): Sentiment on RT dataset 88.6%Episodic memory: input sequences, questions, reasoning about answers
• Attention models• Parsing (Vinyals & Hinton et al 2015); Machine Translation (Bahdanau & Bengio et al 2016)• Relation extraction with LSTM + attention (Zhou et al 2016)• Sentence embeddings with attention model (Wang et al 2016)
• Hybrid architectures• NER with BLSTM-CNN (Chiu & Nichols 2016): 91.62% CoNLL, 86.28% OntoNotes• Sequential labeling with BLSTM-CNN-CRF (Ma & Hovy 2016): 97.55% PoS, 91.21% NER• Sentiment Analysis using CNN-LSTM (Wang et al 2016)
• Joint learning of NLP tasks• Pos-tagging, chunking and CC-tagging with one network (Søgaard & Goldberg 2016)• JEDI: Joint learning of NER and RE (Kirschnick et al 2016)
Some Recent Work
19
● Cuda, CudNN○ You need these drivers installed
to utilize the GPU (Nvidia)
● Theano○ Low level abstraction; you define
symbolic variables & functions; python
● Tensor Flow○ Low level abstraction; you define
data flow graphs; C++, python
● Torch○ High abstraction level; very easy
C interfacing, Lua
Tools for Hacking ● Caffe○ Very high level, simple JSON
config, little versatility, most useful with convnets (C+Python to extend)
● High-level wrappers○ Keras: can bind to either Tensor
Flow or Theano; python○ SkFlow: wrapper around Tensor
Flow for those familiar with Scikit-learn; python
○ Pretty Tensor, TensorFlow Slim: high level wrapper functions for Tensor Flow; python
○ Digits: Supports Caffe and Torch
● More○ nice overview here
Thank you!