Recurrent Neural Network Based Language Model
Author: Toma ́sˇ Mikolov et. al Johns Hopkins University, USA
Presented by : Vicky Xuening Wang
ECS 289G, Nov 2015, UC Davis
1
Toma ́sˇ Mikolov1,2, Martin Karafia ́t1, Luka ́sˇ Burget1, Jan “Honza” Cˇernocky ́1, Sanjeev Khudanpur2
2
Language Model Tasks
• Statistical/Probabilistic Language Models
• Goal: compute the probability of a sentence or sequence of words:
• P(W) = P(w1,w2,w3,w4,w5...wn)
• Related task: predict probability of an upcoming word:
• P(wn|w1,w2,w3,w4,….wn-1)
3
Introduction - Language model
https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
• Chain rule of probability
• Markov assumption
• N-gram model
4
Introduction
https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
• Typical tasks: • Machine Translation
• P(high winds tonite) > P(large winds tonite)
• Spell Correction • P(about fifteen minutes from) > P(about fifteen minuets
from)
• Speech Recognition • P(I saw a van) >> P(eyes awe of an)
• Summarization, question-answering, etc
5
Introduction - LM tasks
https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
6
Introduction - Bigram model
https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
Maximum Likelihood Estimation
7
Introduction - Perplexity
https://web.stanford.edu/class/cs124/lec/languagemodeling.pdfhttps://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
Lower is better!
8
Introduction - WER
Lower is better!
• Recurrent Neural Network based language model (RNN-LM) outperforms standard backoff N-gram models
• Words are projected into low dimensional space, similar words are automatically clustered together.
• Smoothing is solved implicitly.
• Backpropagation is used for training.
9
Overview
10
Fixed-length
11
12
• Input layer x • Hidden/context layer s • Output layer y
13
Model Description - RNN Con’d
• RNN can be seen as a chain of NNs • Intimately related to sequences and lists. • In the last few years, RNN has been successfully applied
to : speech recognition, language modeling, translation, image captioning…
14
RNN v.s. FF
• Parameters to tune or selected:
• RNN
• Size of hidden layer
• FF
• size of layer that projects words to low dimensional space
• size of hidden layer
• size of context-length
15
RNN v.s. FF
• In feedforward networks, history is represented by context of N − 1 words - it is limited in the same way as in N-gram backoff models.
• In recurrent networks, history is represented by neurons with recurrent connections - history length is unlimited.
• Also, recurrent networks can learn to compress whole history in low dimensional space, while feedforward networks compress (project) just single word.
• Recurrent networks have possibility to form short term memory, so they can better deal with position invariance; feedforward networks cannot do that.
16
Comparison of modelsSimple experiment on 4M words from Switchboard corpus
60
70
80
90
100
PPL
73.5
80
85.1
93.7baseline
KN 5gram FF RNN 4*RNN+KN5
17
Model setting• Standard backpropogation algorithm + SGD
• Train in several epochs:
• α=0.1
• if log-likelihood of validation data increases
• continue
• else α=0.5α and continue
• terminate if no significant improvement
• Convergence usually reached at 10-20 epochs
18
19
Model setting- Optimization
• Rare token • merge all words occurring less often
than a threshold in training data to a uniformly distributed rare token
20
Experiments• WSJ (Source: read text only)
• training corpus consists of 37M words
• baseline KN5 - modified Kneser-Ney smoothed 5-gram
• RNN LM - select 6.4M words trained on 300K sentences
• combine 0.75 RNN+0.25 backoff Model
• NIST RT05 (115 hours of meeting speech + web data)
• more than1.3G words
• RNN LM select 5.4M words
21Best perplexity result is 112 for mixture of static and dynamic RNN LMs with larger learning rate 0.3
~50%! 18%
22
12% improvement
23
• RNNs are trained only on in-domain data(5.4M words)
• RT 05, RT 09 are trained on more than 1.3G words
24
Summary
• RNN LM is simple and intelligent.
• RNN LMs can be competitive with backoff LMs that are trained on much more data.
• Results show interesting improvements both for ASR and MT.
• Simple toolkit has been developed that can be used to train RNN LMs.
• This work provides clear connection between machine learning, data compression and language modeling.
25
Future work• Clustering of vocabulary to speed up training
• Parallel implementation of neural network training algorithm
• Online learning or dynamic model will be the future
• BPTT algorithm for a lot of training data
• Go beyond BPTT? LSTM
• Extended to OCR, data compression, cognitive sciences…
26
–Xuening
Thanks!
27