Deep Lip Reading: a comparison of models and an online application January 20, 2021 1/23
Deep Lip Reading: a comparison of modelsand an online application
January 20, 2021
1/23
Outline
1 Context & Motivation
2 LipNet
3 Deep Lip Reading
2/23
3/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Lip reading: what is it and what role does it play?
I The ability to recognize what is being said based on visualinformation
I It plays a crucial role in human communication and speechunderstanding [McGurk and MacDonald, 1976]
I babies selectively observe their interlocutor’s vocal during socialinteractions [Lewkowicz and Hansen-Tift, 2012]
I It’s a difficult task for humans, specially in the absence ofcontext
I Multiple sounds (phonemes) have almost identical lipshapes (i.e., viseme)
3/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Lip reading: what is it and what role does it play?
I The ability to recognize what is being said based on visualinformation
I It plays a crucial role in human communication and speechunderstanding [McGurk and MacDonald, 1976]
I babies selectively observe their interlocutor’s vocal during socialinteractions [Lewkowicz and Hansen-Tift, 2012]
I It’s a difficult task for humans, specially in the absence ofcontext
I Multiple sounds (phonemes) have almost identical lipshapes (i.e., viseme)
3/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Lip reading: what is it and what role does it play?
I The ability to recognize what is being said based on visualinformation
I It plays a crucial role in human communication and speechunderstanding [McGurk and MacDonald, 1976]I babies selectively observe their interlocutor’s vocal during social
interactions [Lewkowicz and Hansen-Tift, 2012]
I It’s a difficult task for humans, specially in the absence ofcontext
I Multiple sounds (phonemes) have almost identical lipshapes (i.e., viseme)
3/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Lip reading: what is it and what role does it play?
I The ability to recognize what is being said based on visualinformation
I It plays a crucial role in human communication and speechunderstanding [McGurk and MacDonald, 1976]I babies selectively observe their interlocutor’s vocal during social
interactions [Lewkowicz and Hansen-Tift, 2012]I It’s a difficult task for humans, specially in the absence of
context
I Multiple sounds (phonemes) have almost identical lipshapes (i.e., viseme)
3/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Lip reading: what is it and what role does it play?
I The ability to recognize what is being said based on visualinformation
I It plays a crucial role in human communication and speechunderstanding [McGurk and MacDonald, 1976]I babies selectively observe their interlocutor’s vocal during social
interactions [Lewkowicz and Hansen-Tift, 2012]I It’s a difficult task for humans, specially in the absence of
contextI Multiple sounds (phonemes) have almost identical lip
shapes (i.e., viseme)
Figure 1: Bark pronunciation Figure 2: Mark pronunciation
4/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Human lipreading performance is normally poor
I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]
I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words
I Enormous applications including
I improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing
I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal
features from the videosI Deep learning approaches offer an end-to-end strategy to
extract these features
4/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Human lipreading performance is normally poor
I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic words
I 21 % ± 11 % for 30 compound wordsI Enormous applications including
I improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing
I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal
features from the videosI Deep learning approaches offer an end-to-end strategy to
extract these features
4/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Human lipreading performance is normally poor
I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words
I Enormous applications including
I improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing
I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal
features from the videosI Deep learning approaches offer an end-to-end strategy to
extract these features
4/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Human lipreading performance is normally poor
I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words
I Enormous applications including
I improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing
I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal
features from the videosI Deep learning approaches offer an end-to-end strategy to
extract these features
4/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Human lipreading performance is normally poor
I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words
I Enormous applications includingI improve hearing aids
I silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing
I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal
features from the videosI Deep learning approaches offer an end-to-end strategy to
extract these features
4/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Human lipreading performance is normally poor
I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words
I Enormous applications includingI improve hearing aidsI silent dictation in public spaces
I speech recognition in noisy environmentsI salience movie processing
I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal
features from the videosI Deep learning approaches offer an end-to-end strategy to
extract these features
4/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Human lipreading performance is normally poor
I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words
I Enormous applications includingI improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environments
I salience movie processingI Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal
features from the videosI Deep learning approaches offer an end-to-end strategy to
extract these features
4/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Human lipreading performance is normally poor
I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words
I Enormous applications includingI improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing
I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal
features from the videosI Deep learning approaches offer an end-to-end strategy to
extract these features
4/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Human lipreading performance is normally poor
I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words
I Enormous applications includingI improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing
I Automate lipreading comprises an important goal
I Machine lipreading requires extracting spatiotemporalfeatures from the videos
I Deep learning approaches offer an end-to-end strategy toextract these features
4/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Human lipreading performance is normally poor
I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words
I Enormous applications includingI improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing
I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal
features from the videos
I Deep learning approaches offer an end-to-end strategy toextract these features
4/23
Deep Lip Reading: a comparison of models and an online application
Context & Motivation
Human lipreading performance is normally poor
I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words
I Enormous applications includingI improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing
I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal
features from the videosI Deep learning approaches offer an end-to-end strategy to
extract these features
Outline
1 Context & Motivation
2 LipNetPre-deep learning and first deep learning attemptsResultsTakeaways
3 Deep Lip Reading
5/23
6/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Pre-deep learning and first deep learning attempts
Speakers generalization and motion extractions were the main issues
Task
Given a silence video of a talking face, predict the sentences beingspoken
I Many works focused on video and image preprocessing toextract different features [Zhou et al., 2014]I Hidden Markov model (HMM) and generalized method of
moments (GMM) combined with handed-engineered featuresI Speaker-dependency accuracy and/or limited utterances
I First deep learning attempts limited to word or phonemeclassificationI Fixed sequences sizeI Speaker-dependentI Lacked sequence prediction
I Connectionist temporal classification (CTC) loss [Graveset al., 2006]
6/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Pre-deep learning and first deep learning attempts
Speakers generalization and motion extractions were the main issues
Task
Given a silence video of a talking face, predict the sentences beingspoken
I Many works focused on video and image preprocessing toextract different features [Zhou et al., 2014]I Hidden Markov model (HMM) and generalized method of
moments (GMM) combined with handed-engineered featuresI Speaker-dependency accuracy and/or limited utterances
I First deep learning attempts limited to word or phonemeclassificationI Fixed sequences sizeI Speaker-dependentI Lacked sequence prediction
I Connectionist temporal classification (CTC) loss [Graveset al., 2006]
6/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Pre-deep learning and first deep learning attempts
Speakers generalization and motion extractions were the main issues
Task
Given a silence video of a talking face, predict the sentences beingspoken
I Many works focused on video and image preprocessing toextract different features [Zhou et al., 2014]I Hidden Markov model (HMM) and generalized method of
moments (GMM) combined with handed-engineered featuresI Speaker-dependency accuracy and/or limited utterances
I First deep learning attempts limited to word or phonemeclassificationI Fixed sequences sizeI Speaker-dependentI Lacked sequence prediction
I Connectionist temporal classification (CTC) loss [Graveset al., 2006]
6/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Pre-deep learning and first deep learning attempts
Speakers generalization and motion extractions were the main issues
Task
Given a silence video of a talking face, predict the sentences beingspoken
I Many works focused on video and image preprocessing toextract different features [Zhou et al., 2014]I Hidden Markov model (HMM) and generalized method of
moments (GMM) combined with handed-engineered featuresI Speaker-dependency accuracy and/or limited utterances
I First deep learning attempts limited to word or phonemeclassificationI Fixed sequences sizeI Speaker-dependentI Lacked sequence prediction
I Connectionist temporal classification (CTC) loss [Graveset al., 2006]
7/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Pre-deep learning and first deep learning attempts
First to show an end-to-end strategy for lipreading
I Maps variable-length sequences of video frames to textsequences
I GRID corpus 33k sentences
Figure 3: LipNet architecture. Source: Assael et al., 2016
8/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Pre-deep learning and first deep learning attempts
GRID dataset has a fixed grammar structure
Table 1: GRID sentence and grammar structure
command color∗ preposition letter∗ digit adverb∗
{bin, lay, place, set} {blue, green, red, white} {at, by, in, with} [A–Z] \ {W} [0–9] {again, now, please, soon}∗keywords
9/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Pre-deep learning and first deep learning attempts
Four different strategies to compare with the LipNet performance
I Hearing-impaired students three members of the OxfordStudents’ Disability community
I Baseline-LSTM: replicate a state-of-the art architectureI Baseline-2D: spatial-only convolutionsI Baseline-NoLM: language model disabledI Use word error rate (WER) and character error rate (CER)
10/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Results
LipNet outperforms human and previous state-of-the-art model
Table 2: Performance of LipNet on the GRID dataset
Unseen Speakers Overlapped SpeakersMethod CER WER CER WER
Hearing-Impaired – 47.7% – –Baseline-LSTM 38.4% 52.8% 15.2% 26.3%
Baseline-2D 16.2% 26.7% 4.3% 11.6%Baseline-NoLM 6.7% 13.6% 2.0% 5.6%
LipNet 6.4% 11.4% 1.9% 4.8%
11/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Takeaways
LipNet: takeaways
I It is an end-to-end sentence-sequence prediction model
I spatiotemporal frontend + 3D and 2D convolutions + 2 xbidirectional-LSTM (BLSTM)
I It relies on CTC to:
1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions
and the output sequence
I Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than
aggregating spatial-only featuresI GRID dataset: fixed grammar structure
11/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Takeaways
LipNet: takeaways
I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x
bidirectional-LSTM (BLSTM)
I It relies on CTC to:
1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions
and the output sequence
I Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than
aggregating spatial-only featuresI GRID dataset: fixed grammar structure
11/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Takeaways
LipNet: takeaways
I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x
bidirectional-LSTM (BLSTM)I It relies on CTC to:
1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions
and the output sequenceI Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than
aggregating spatial-only featuresI GRID dataset: fixed grammar structure
11/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Takeaways
LipNet: takeaways
I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x
bidirectional-LSTM (BLSTM)I It relies on CTC to:
1 predict frame-wise labels
2 look for the optimal alignment between the frame-wise predictionsand the output sequence
I Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than
aggregating spatial-only featuresI GRID dataset: fixed grammar structure
11/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Takeaways
LipNet: takeaways
I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x
bidirectional-LSTM (BLSTM)I It relies on CTC to:
1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions
and the output sequence
I Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than
aggregating spatial-only featuresI GRID dataset: fixed grammar structure
11/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Takeaways
LipNet: takeaways
I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x
bidirectional-LSTM (BLSTM)I It relies on CTC to:
1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions
and the output sequenceI Confirms the importance of combining STCNNs with RNNs
I Extracting spatiotemporal features using STCNN is better thanaggregating spatial-only features
I GRID dataset: fixed grammar structure
11/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Takeaways
LipNet: takeaways
I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x
bidirectional-LSTM (BLSTM)I It relies on CTC to:
1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions
and the output sequenceI Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than
aggregating spatial-only features
I GRID dataset: fixed grammar structure
11/23
Deep Lip Reading: a comparison of models and an online application
LipNet
Takeaways
LipNet: takeaways
I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x
bidirectional-LSTM (BLSTM)I It relies on CTC to:
1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions
and the output sequenceI Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than
aggregating spatial-only featuresI GRID dataset: fixed grammar structure
Outline
1 Context & Motivation
2 LipNet
3 Deep Lip ReadingVision moduleBidirectional LSTMFully convolutionalTransformerExternal language modelExperiments & ResultsTakeaways
12/23
13/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Focus on analyzing the performance of different DL architectures
Goal
I Compare the performance and training time of three different deeplearning architectures
Figure 4: Deep lipreading models. Source: Afouras, Chung, and Zisserman, 2018
13/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Focus on analyzing the performance of different DL architectures
Goal
I Compare the performance and training time of three different deeplearning architectures
Figure 4: Deep lipreading models. Source: Afouras, Chung, and Zisserman, 2018
13/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Focus on analyzing the performance of different DL architectures
Goal
I Compare the performance and training time of three different deeplearning architectures
Figure 4: Deep lipreading models. Source: Afouras, Chung, and Zisserman, 2018
13/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Focus on analyzing the performance of different DL architectures
Goal
I Compare the performance and training time of three different deeplearning architectures
Figure 4: Deep lipreading models. Source: Afouras, Chung, and Zisserman, 2018
14/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Vision module
Spatiotemporal visual front-end
I Spatiotemporal 3D convolutional on theinput with a filter width of five frames
I Followed by a 2D ResNet which decreasesthe spatial dimensions
I For an input sequence of T × H × Wframes outputs a T × H32 ×
W32 × 512 tensor
I Results in a 512-dimensional feature vectorfor each input video frame
15/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Bidirectional LSTM
Bidirectional LSTM (BLSTM)
I Comprises three stacked bidirectionalLSTMs
I Ingests the video feature vectorsI Outputs a character probability for each
input frameI It’s trained with connectionist temporal
classification (CTC)I Output alphabet is augmented with CTC
blank characterI Decoding is performed with a beam search
15/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Bidirectional LSTM
Bidirectional LSTM (BLSTM)
I Comprises three stacked bidirectionalLSTMs
I Ingests the video feature vectors
I Outputs a character probability for eachinput frame
I It’s trained with connectionist temporalclassification (CTC)
I Output alphabet is augmented with CTCblank character
I Decoding is performed with a beam search
15/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Bidirectional LSTM
Bidirectional LSTM (BLSTM)
I Comprises three stacked bidirectionalLSTMs
I Ingests the video feature vectorsI Outputs a character probability for each
input frame
I It’s trained with connectionist temporalclassification (CTC)
I Output alphabet is augmented with CTCblank character
I Decoding is performed with a beam search
15/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Bidirectional LSTM
Bidirectional LSTM (BLSTM)
I Comprises three stacked bidirectionalLSTMs
I Ingests the video feature vectorsI Outputs a character probability for each
input frameI It’s trained with connectionist temporal
classification (CTC)
I Output alphabet is augmented with CTCblank character
I Decoding is performed with a beam search
15/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Bidirectional LSTM
Bidirectional LSTM (BLSTM)
I Comprises three stacked bidirectionalLSTMs
I Ingests the video feature vectorsI Outputs a character probability for each
input frameI It’s trained with connectionist temporal
classification (CTC)I Output alphabet is augmented with CTC
blank character
I Decoding is performed with a beam search
15/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Bidirectional LSTM
Bidirectional LSTM (BLSTM)
I Comprises three stacked bidirectionalLSTMs
I Ingests the video feature vectorsI Outputs a character probability for each
input frameI It’s trained with connectionist temporal
classification (CTC)I Output alphabet is augmented with CTC
blank characterI Decoding is performed with a beam search
16/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Fully convolutional
Fully convolutional (FC) model
I Rely on a depth-wise separable convolutionlayers
I Each convolution adds a skip-connectionfollowed by ReLU and batch normalization
I Also trained with CTC lossI Considers two variants: 10 and 15
convolutional layers
17/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Transformer
Transformer model (TC)
I Input serves as attention queries, keys, andvalues
I Encoder outputs are the the attention keysand values
I Previous decoding layer outputs are thequeries
I The decoder produces characterprobabilities
I Rely on the based model proposedby Vaswani et al., 2017
I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1
17/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Transformer
Transformer model (TC)
I Input serves as attention queries, keys, andvalues
I Encoder outputs are the the attention keysand values
I Previous decoding layer outputs are thequeries
I The decoder produces characterprobabilities
I Rely on the based model proposedby Vaswani et al., 2017
I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1
17/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Transformer
Transformer model (TC)
I Input serves as attention queries, keys, andvalues
I Encoder outputs are the the attention keysand values
I Previous decoding layer outputs are thequeries
I The decoder produces characterprobabilities
I Rely on the based model proposedby Vaswani et al., 2017
I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1
17/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Transformer
Transformer model (TC)
I Input serves as attention queries, keys, andvalues
I Encoder outputs are the the attention keysand values
I Previous decoding layer outputs are thequeries
I The decoder produces characterprobabilities
I Rely on the based model proposedby Vaswani et al., 2017
I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1
17/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Transformer
Transformer model (TC)
I Input serves as attention queries, keys, andvalues
I Encoder outputs are the the attention keysand values
I Previous decoding layer outputs are thequeries
I The decoder produces characterprobabilities
I Rely on the based model proposedby Vaswani et al., 2017
I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1
17/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Transformer
Transformer model (TC)
I Input serves as attention queries, keys, andvalues
I Encoder outputs are the the attention keysand values
I Previous decoding layer outputs are thequeries
I The decoder produces characterprobabilities
I Rely on the based model proposedby Vaswani et al., 2017I 6 encoder and 6 decoder layers
I 8 attention heads with dropout with p = 0.1
17/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Transformer
Transformer model (TC)
I Input serves as attention queries, keys, andvalues
I Encoder outputs are the the attention keysand values
I Previous decoding layer outputs are thequeries
I The decoder produces characterprobabilities
I Rely on the based model proposedby Vaswani et al., 2017I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1
18/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
External language model
An external character-level language model
I Use a character-level language model during inferenceI Recurrent neural network with 4 unidirectional layers of 1024
LSTM cells eachI Trained to predict one character at a time
19/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Experiments & Results
Two different datasets for performance evaluation
Table 3: Datasets used for trained and test
Dataset # Words Type Vocabulary # Utter. Viewpoint
LRW 489k single word 500 – uniqueLRS2 2M sentences 41K 142K multiple
MV-LRS(w) 1.9M sentences 480 – uniqueMV-LRS 5M sentences 30K 430K unique
LRW: Lip Reading in the WildLRS2: Lip Reading Sentences 2I Two different corpora to train the language model
1 transcriptions of the LRS2 pre-train and main train-data ≡ 2Mwords
2 full subtitles of LRS2 training set ≡ 26M wordsI Evaluated on LRS2 ≡ 1, 243 utterancesI Report character error rates (CER) and word error
rates (WER)
19/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Experiments & Results
Two different datasets for performance evaluation
Table 3: Datasets used for trained and test
Dataset # Words Type Vocabulary # Utter. Viewpoint
LRW 489k single word 500 – uniqueLRS2 2M sentences 41K 142K multiple
MV-LRS(w) 1.9M sentences 480 – uniqueMV-LRS 5M sentences 30K 430K unique
LRW: Lip Reading in the WildLRS2: Lip Reading Sentences 2I Two different corpora to train the language model
1 transcriptions of the LRS2 pre-train and main train-data ≡ 2Mwords
2 full subtitles of LRS2 training set ≡ 26M wordsI Evaluated on LRS2 ≡ 1, 243 utterancesI Report character error rates (CER) and word error
rates (WER)
19/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Experiments & Results
Two different datasets for performance evaluation
Table 3: Datasets used for trained and test
Dataset # Words Type Vocabulary # Utter. Viewpoint
LRW 489k single word 500 – uniqueLRS2 2M sentences 41K 142K multiple
MV-LRS(w) 1.9M sentences 480 – uniqueMV-LRS 5M sentences 30K 430K unique
LRW: Lip Reading in the WildLRS2: Lip Reading Sentences 2I Two different corpora to train the language model
1 transcriptions of the LRS2 pre-train and main train-data ≡ 2Mwords
2 full subtitles of LRS2 training set ≡ 26M words
I Evaluated on LRS2 ≡ 1, 243 utterancesI Report character error rates (CER) and word error
rates (WER)
19/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Experiments & Results
Two different datasets for performance evaluation
Table 3: Datasets used for trained and test
Dataset # Words Type Vocabulary # Utter. Viewpoint
LRW 489k single word 500 – uniqueLRS2 2M sentences 41K 142K multiple
MV-LRS(w) 1.9M sentences 480 – uniqueMV-LRS 5M sentences 30K 430K unique
LRW: Lip Reading in the WildLRS2: Lip Reading Sentences 2I Two different corpora to train the language model
1 transcriptions of the LRS2 pre-train and main train-data ≡ 2Mwords
2 full subtitles of LRS2 training set ≡ 26M wordsI Evaluated on LRS2 ≡ 1, 243 utterances
I Report character error rates (CER) and word errorrates (WER)
19/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Experiments & Results
Two different datasets for performance evaluation
Table 3: Datasets used for trained and test
Dataset # Words Type Vocabulary # Utter. Viewpoint
LRW 489k single word 500 – uniqueLRS2 2M sentences 41K 142K multiple
MV-LRS(w) 1.9M sentences 480 – uniqueMV-LRS 5M sentences 30K 430K unique
LRW: Lip Reading in the WildLRS2: Lip Reading Sentences 2I Two different corpora to train the language model
1 transcriptions of the LRS2 pre-train and main train-data ≡ 2Mwords
2 full subtitles of LRS2 training set ≡ 26M wordsI Evaluated on LRS2 ≡ 1, 243 utterancesI Report character error rates (CER) and word error
rates (WER)
20/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Experiments & Results
Training process includes three stages
1 Visual front-end module2 Use vision module to generate visual features for all the training
data3 Sequence processing module
21/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Experiments & Results
Transformer architecture seems to be good choice
Table 4: Character error rates and word error rates on LRS2 dataset
lower is better
I Transformer outperforms the other network models
I An improvement of 20% over previous state-of-the-art modelI High computational cost (i.e., 13 days to train the model)
21/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Experiments & Results
Transformer architecture seems to be good choice
Table 4: Character error rates and word error rates on LRS2 dataset
lower is better
I Transformer outperforms the other network modelsI An improvement of 20% over previous state-of-the-art model
I High computational cost (i.e., 13 days to train the model)
21/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Experiments & Results
Transformer architecture seems to be good choice
Table 4: Character error rates and word error rates on LRS2 dataset
lower is better
I Transformer outperforms the other network modelsI An improvement of 20% over previous state-of-the-art modelI High computational cost (i.e., 13 days to train the model)
22/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Takeaways
Takeaways
I Lipreading is a challenge problem
I Context information plays an important roleI Transformer architecture combined with convolutional neural
networks enable machine lipreadingI Machine lipreading can outperform human-performanceI Computational cost is still an issue
22/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Takeaways
Takeaways
I Lipreading is a challenge problemI Context information plays an important role
I Transformer architecture combined with convolutional neuralnetworks enable machine lipreading
I Machine lipreading can outperform human-performanceI Computational cost is still an issue
22/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Takeaways
Takeaways
I Lipreading is a challenge problemI Context information plays an important roleI Transformer architecture combined with convolutional neural
networks enable machine lipreading
I Machine lipreading can outperform human-performanceI Computational cost is still an issue
22/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Takeaways
Takeaways
I Lipreading is a challenge problemI Context information plays an important roleI Transformer architecture combined with convolutional neural
networks enable machine lipreadingI Machine lipreading can outperform human-performance
I Computational cost is still an issue
22/23
Deep Lip Reading: a comparison of models and an online application
Deep Lip Reading
Takeaways
Takeaways
I Lipreading is a challenge problemI Context information plays an important roleI Transformer architecture combined with convolutional neural
networks enable machine lipreadingI Machine lipreading can outperform human-performanceI Computational cost is still an issue
References
1 H. McGurk and J. MacDonald. “Hearing lips and seeing voices”.In: Nature 264.5588 (1976), pp. 746–748
2 D. J. Lewkowicz and A. M. Hansen-Tift. “Infants deploy selectiveattention to the mouth of a talking face when learning speech”.In: National Academy of Sciences 109.5 (2012), pp. 1431–1436
3 R. D. Easton and M. Basala. “Perceptual dominance duringlipreading”. In: Perception & Psychophysics 32.6 (1982),pp. 562–570
4 A. Vaswani et al. “Attention is all you need”. In: NeurIPS. 2017,pp. 5998–6008
5 T. Afouras, J. S. Chung, and A. Zisserman. “Deep Lip Reading: acomparison of models and an online application”. In:INTERSPEECH. 2018
22/23
23/23
Context & MotivationLipNetPre-deep learning and first deep learning attemptsResultsTakeaways
Deep Lip ReadingVision moduleBidirectional LSTMFully convolutionalTransformerExternal language modelExperiments & ResultsTakeaways
anm0: 0.26: 0.25: 0.24: 0.23: 0.22: 0.21: 0.20: 0.19: 0.18: 0.17: 0.16: 0.15: 0.14: 0.13: 0.12: 0.11: 0.10: 0.9: 0.8: 0.7: 0.6: 0.5: 0.4: 0.3: 0.2: 0.1: 0.0: