Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)

A Survey of Current Neural Network Architectures for NLPMárton Miháltz

Meltwater Group

Hungarian NLP Meetup

2

• Introduction• Short intro to NN concepts

• Recurrent neural networks• Long Short-Term Memory, Gated Recurrent Unit

• Recursive neural networks• Applications to sentiment analysis: Socher et al. 2013; Tai et al. 2015

• Convolutional neural networks• Applications to text classification: Kim 2014

• Some more recent architectures• Memory networks, attention models, hybrid architectures

• Tools• Theano, Torch, Tensor Flow, Caffe, Keras

Outline

3

• Feed-forward neural network• Activation fn: tanh, ReLU,

Leaky/Parametric ReLU, SoftPlus, …• Logistic regression or softmax

function for classification layer• Loss functions (objectives):

categorical cross-entropy, neg. log likelihood, …

• Training (optimizers): Gradient Descent, SGD, Mini-batch GD, RMSprop, Ada, Adagrad, Adam, Adamax, Nesterov Momentum, L-BFGS, …

Very Short Intro to Modern Neural Networks

• Input embeddings• 1-hot encoding• Random vectors• Pre-trained vectors, eg. distributional similarity

4

● Tutorials, Blogs○ Denny Britz’s blog (RNNs, CNNs for NLP, code etc.) -- code in Theano, Tensor Flow○ Cristopher Olah’s blog (architectures, DL for NLP etc.)

○ Andrej Karpathy’s fun blogpost about RNNs: generate Shakespeare, Paul Graham text, LaTex source, C code etc. + nice LSTM activity visualizations

○ Deeplearning.net Tutorial -- code in Theano (python)

● Courses○ Richard Socher’s course Deep Learning for Natural Language Processing at Stanford --

code in Tensor Flow○ Stanford Unsupervised Feature Learning and Deep Learning Tutorial -- code in Matlab○ Stanford course Convolutional Neural Networks for Image Recognition (Andrej Karpathy)

● Other sources○ Bengio’s Deep Learning book

Further Reading (DL for NLP)

http://www.wildml.com/

http://www.wildml.com/

http://colah.github.io/

http://colah.github.io/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://deeplearning.net/tutorial

http://deeplearning.net/tutorial

http://cs224d.stanford.edu/

http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

http://cs231n.github.io/

http://www.deeplearningbook.org/contents/

5

• Powerful apparatus for learning complex functions for ML• Better at certain NLP tasks than previous methods• Pre-trained distributed representation vectors

• Word2vec, GloVe, GenSim, doc2vec, skip-thought vectors etc.• Vector space properties: similarity, analogies, compositionality etc.

• Less feature engineering needed• Network learns abstract representations

• Transfer learning / domain adaptation• Joint learning/execution of NLP steps possible• Easy to go multimodal

Why Deep Learning for NLP?

6

● About RNNs○ Internal state depends on state of last step○ Good for sequential input○ Backprop. Through Time (BPTT) training

● Applications○ Language modeling (eg. in machine translation)○ Sequential labeling○ Text generation (eg. image description generation, together w/ CNN)

● Problems with RNNs○ Long sentences, long-term dependencies○ Exponentially shrinking gradients (“vanishing gradients”)○ Solutions:

■ Initialization of weights; regularization; using ReLU activ. fn.■ RNN variations: bidirectional RNN, deep RNN etc.■ gated RNNs: LSTM, GRU

Recurrent Neural Networks

7

• Long Short Term Memory Networks• A special recurrent network• Has a memory cell (internal memory) (c)• 3 gates: input, forget, output

sigmoid layers with pointwise multiplication operation (vector of values in [0, 1])

• LSTM is able to remove or add information to the cell state, regulated by gates, which optionally let information through

• Gated Recurrent Units• Another RNN variant• No internal memory different from internal state• 2 gates: reset, update (z)

• Reset gate: how to combine new input with previous

state, update gate: how much of the previous state to keep

LSTMs and GRUs

t-1 t-1

t-1 t-1

[Chung et al. 2014+ red labels by me]

8

• Overcome RNNs’ long dependency limitations& vanishing gradients problem

• Very hip in current NLP applications, eg. SOTA in MT• More complex architectures:

• Bi-directional LSTM• Stacked (deep) (B-)LSTM/GRU layers• Another extension, Grid-LSTM (Kalchbrenner et al. 2015)• Still evolving!

• LSTM vs. GRU better: still in the jury• GRU has fewer parameters, may be faster to train

• LSTM may be better with more data

LSTMs and GRUs

http://arxiv.org/pdf/1507.01526v1.pdf

9

• About RNNs• Hierarchical architecture• Shared weights• Plausible approach for modeling linguistics structures

• Sentiment Analysis with Recursive Networks (Socher et al. 2013)• Compositional processing of parsed input (Eg. able to handle negations)

• Performs sentence-level sentiment classification:

Rotten Tomatoes dataset (Pang & Lee 2005): 11K movie review sentences pos or neg85.5% Accuracy on binary class subset, 45.7% on 5-class

• Not SOTA score any more, but was first to go over 80% after 7 years

• Sentiment Treebank for training

Recursive Networks

10

• Sentence words: embedding layer w/ random initial vectors (d=25..35)• Parse nodes: compositionality function computes representation, recursive• Softmax classifier: pos-neg (or 5-class) label for each word & each parse node

Recursive Neural Tensor Network

● Weight tensor V:

● Intuition:each slice of the tensor captures a specific type of composition

Sentiment Analysis with RNTN

12

• Tree-LSTM• Using constituency parsing• Using GloVe word vectors, updated during training

• Idea: sum hidden states of child vectorsof tree nodes

• Each child has its own forget gate• Polarity softmax classifiers on tree nodes

• Improves Socher et al 2013• Fine-grained sentence sentiment: 51.0% vs. 45.7%• Binary sentence sentiment: 88.0% vs. 85.4%

Tree-LSTMs for Sentiment Analysis (Tai et al 2015)

13

Convolutional Neural Networks• CNNs (ConvNets) widely used in

image processing• Location invariety• Compositionality• Fast

• Convolution layers• “sliding window” over input representation:

filter/kernel/feature generator• Local connectivity• Sharing weights

• Hyperparameters• Wide vs. narrow convolution (padding)• Filter size (width, height, depth)• Number of filters/layer• Stride size• Channels (R, G, B)

14

CNNs for Text Classification

● Intuition: filter windows over sentence words <-> n-grams

● Advantage over Recursive NN/Tree-LSTM: does not require parsing

● Becoming a standard baseline for new text classification architectures

● Easy to parallelize on GPUs

15

CNN for Sentiment Analysis (Kim 2014)• Sentence polarity classification (RT dataset/Sentiment Treebank)

• 88.1% on binary sentiment classification

• Use word2vec vectors• sentences: concatenated word vectors

• 2 channels: • Static word2vec vectors & tuned via backprop

• Multiple window sizes (h=3,4,5) and multiple filters (eg. 100)• Apply max-pooling on feature map

• Selects most important feature from feature map

• Penultimate layer: final feature vector• Concatenate all pooled features

• Final layer: softmax classifier (pos/neg sentiment)• Regularization: dropout on penultimate layer

• Randomly set to 0 some of the feature weights• Prevents co-adaptation of hidden units during forward propagation (overfitting)

16

Adaptation of Word Vectors

17

• Recursive NNs• Linguistically plausible, applicable to grammatical structures,

needs parsing• Recurrent NNs

• Engineered for sequential input, current improvements with gated RNNs (LSTM, GRU etc.)

• Convolutional NNs• Exceptionally good for classification; unclear how to incorporate

phrase-level structures, hard to interpret, needs zero padding, good for GPUs

Summary

18

• Memory Networks• MemN2N (Sukhbaatar et al 2015)

Facebook’s bAbI Question Answering tasks 90-90%

• Dynamic Memory Networks (Kumar, Irsoy et al 2015): Sentiment on RT dataset 88.6%Episodic memory: input sequences, questions, reasoning about answers

• Attention models• Parsing (Vinyals & Hinton et al 2015); Machine Translation (Bahdanau & Bengio et al 2016)• Relation extraction with LSTM + attention (Zhou et al 2016)• Sentence embeddings with attention model (Wang et al 2016)

• Hybrid architectures• NER with BLSTM-CNN (Chiu & Nichols 2016): 91.62% CoNLL, 86.28% OntoNotes• Sequential labeling with BLSTM-CNN-CRF (Ma & Hovy 2016): 97.55% PoS, 91.21% NER• Sentiment Analysis using CNN-LSTM (Wang et al 2016)

• Joint learning of NLP tasks• Pos-tagging, chunking and CC-tagging with one network (Søgaard & Goldberg 2016)• JEDI: Joint learning of NER and RE (Kirschnick et al 2016)

Some Recent Work

19

● Cuda, CudNN○ You need these drivers installed

to utilize the GPU (Nvidia)

● Theano○ Low level abstraction; you define

symbolic variables & functions; python

● Tensor Flow○ Low level abstraction; you define

data flow graphs; C++, python

● Torch○ High abstraction level; very easy

C interfacing, Lua

Tools for Hacking ● Caffe○ Very high level, simple JSON

config, little versatility, most useful with convnets (C+Python to extend)

● High-level wrappers○ Keras: can bind to either Tensor

Flow or Theano; python○ SkFlow: wrapper around Tensor

Flow for those familiar with Scikit-learn; python

○ Pretty Tensor, TensorFlow Slim: high level wrapper functions for Tensor Flow; python

○ Digits: Supports Caffe and Torch

● More○ nice overview here

http://www.teglor.com/b/deep-learning-libraries-language-cm569/

Thank you!

Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)

Engineering