Recurrent Neural Network Based Language Model Author: Toma sˇ Mikolov et. al Johns Hopkins University, USA Presented by : Vicky Xuening Wang ECS 289G, Nov 2015, UC Davis 1 Toma sˇ Mikolov1,2, Martin Karafia t1, Luka sˇ Burget1, Jan “Honza” Cˇernocky 1, Sanjeev Khudanpur2
27
Embed
Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Recurrent Neural Network Based Language Model
Author: Toma ́sˇ Mikolov et. al Johns Hopkins University, USA
Presented by : Vicky Xuening Wang
ECS 289G, Nov 2015, UC Davis
1
Toma ́sˇ Mikolov1,2, Martin Karafia ́t1, Luka ́sˇ Burget1, Jan “Honza” Cˇernocky ́1, Sanjeev Khudanpur2
2
Language Model Tasks
• Statistical/Probabilistic Language Models
• Goal: compute the probability of a sentence or sequence of words:
• P(W) = P(w1,w2,w3,w4,w5...wn)
• Related task: predict probability of an upcoming word:
• Recurrent Neural Network based language model (RNN-LM) outperforms standard backoff N-gram models
• Words are projected into low dimensional space, similar words are automatically clustered together.
• Smoothing is solved implicitly.
• Backpropagation is used for training.
9
Overview
10
Fixed-length
11
12
• Input layer x • Hidden/context layer s • Output layer y
13
Model Description - RNN Con’d
• RNN can be seen as a chain of NNs • Intimately related to sequences and lists. • In the last few years, RNN has been successfully applied
to : speech recognition, language modeling, translation, image captioning…
14
RNN v.s. FF
• Parameters to tune or selected:
• RNN
• Size of hidden layer
• FF
• size of layer that projects words to low dimensional space
• size of hidden layer
• size of context-length
15
RNN v.s. FF
• In feedforward networks, history is represented by context of N − 1 words - it is limited in the same way as in N-gram backoff models.
• In recurrent networks, history is represented by neurons with recurrent connections - history length is unlimited.
• Also, recurrent networks can learn to compress whole history in low dimensional space, while feedforward networks compress (project) just single word.
• Recurrent networks have possibility to form short term memory, so they can better deal with position invariance; feedforward networks cannot do that.
16
Comparison of modelsSimple experiment on 4M words from Switchboard corpus
60
70
80
90
100
PPL
73.5
80
85.1
93.7baseline
KN 5gram FF RNN 4*RNN+KN5
17
Model setting• Standard backpropogation algorithm + SGD
• Train in several epochs:
• α=0.1
• if log-likelihood of validation data increases
• continue
• else α=0.5α and continue
• terminate if no significant improvement
• Convergence usually reached at 10-20 epochs
18
19
Model setting- Optimization
• Rare token • merge all words occurring less often
than a threshold in training data to a uniformly distributed rare token