Deep Lip Reading: a comparison of models and an online ......3/23 Deep Lip Reading: a comparison of models and an online application Context & Motivation Lip reading: what is it and

Deep Lip Reading: a comparison of modelsand an online application

January 20, 2021

1/23

Outline

1 Context & Motivation

2 LipNet

3 Deep Lip Reading

2/23

3/23

Deep Lip Reading: a comparison of models and an online application

Context & Motivation

Lip reading: what is it and what role does it play?

I The ability to recognize what is being said based on visualinformation

I It plays a crucial role in human communication and speechunderstanding [McGurk and MacDonald, 1976]

I babies selectively observe their interlocutor’s vocal during socialinteractions [Lewkowicz and Hansen-Tift, 2012]

I It’s a difficult task for humans, specially in the absence ofcontext

I Multiple sounds (phonemes) have almost identical lipshapes (i.e., viseme)

3/23





I It plays a crucial role in human communication and speechunderstanding [McGurk and MacDonald, 1976]I babies selectively observe their interlocutor’s vocal during social

interactions [Lewkowicz and Hansen-Tift, 2012]

I It’s a difficult task for humans, specially in the absence ofcontext


3/23






interactions [Lewkowicz and Hansen-Tift, 2012]I It’s a difficult task for humans, specially in the absence of

context


3/23






interactions [Lewkowicz and Hansen-Tift, 2012]I It’s a difficult task for humans, specially in the absence of

contextI Multiple sounds (phonemes) have almost identical lip

shapes (i.e., viseme)

Figure 1: Bark pronunciation Figure 2: Mark pronunciation

4/23



Human lipreading performance is normally poor

I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]

I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words

I Enormous applications including

I improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing

I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal

features from the videosI Deep learning approaches offer an end-to-end strategy to

extract these features

4/23




I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic words

I 21 % ± 11 % for 30 compound wordsI Enormous applications including





4/23




I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words

I Enormous applications including





4/23





I Enormous applications includingI improve hearing aids

I silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing




4/23





I Enormous applications includingI improve hearing aidsI silent dictation in public spaces

I speech recognition in noisy environmentsI salience movie processing




4/23





I Enormous applications includingI improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environments

I salience movie processingI Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal



4/23





I Enormous applications includingI improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing




4/23






I Automate lipreading comprises an important goal

I Machine lipreading requires extracting spatiotemporalfeatures from the videos

I Deep learning approaches offer an end-to-end strategy toextract these features

4/23







features from the videos

I Deep learning approaches offer an end-to-end strategy toextract these features

4/23









Outline


2 LipNetPre-deep learning and first deep learning attemptsResultsTakeaways

3 Deep Lip Reading

5/23

6/23


LipNet

Pre-deep learning and first deep learning attempts

Speakers generalization and motion extractions were the main issues

Task

Given a silence video of a talking face, predict the sentences beingspoken

I Many works focused on video and image preprocessing toextract different features [Zhou et al., 2014]I Hidden Markov model (HMM) and generalized method of

moments (GMM) combined with handed-engineered featuresI Speaker-dependency accuracy and/or limited utterances

I First deep learning attempts limited to word or phonemeclassificationI Fixed sequences sizeI Speaker-dependentI Lacked sequence prediction

I Connectionist temporal classification (CTC) loss [Graveset al., 2006]

7/23


LipNet


First to show an end-to-end strategy for lipreading

I Maps variable-length sequences of video frames to textsequences

I GRID corpus 33k sentences

Figure 3: LipNet architecture. Source: Assael et al., 2016

8/23


LipNet


GRID dataset has a fixed grammar structure

Table 1: GRID sentence and grammar structure

command color∗ preposition letter∗ digit adverb∗

{bin, lay, place, set} {blue, green, red, white} {at, by, in, with} [A–Z] \ {W} [0–9] {again, now, please, soon}∗keywords

9/23


LipNet


Four different strategies to compare with the LipNet performance

I Hearing-impaired students three members of the OxfordStudents’ Disability community

I Baseline-LSTM: replicate a state-of-the art architectureI Baseline-2D: spatial-only convolutionsI Baseline-NoLM: language model disabledI Use word error rate (WER) and character error rate (CER)

10/23


LipNet

Results

LipNet outperforms human and previous state-of-the-art model

Table 2: Performance of LipNet on the GRID dataset

Unseen Speakers Overlapped SpeakersMethod CER WER CER WER

Hearing-Impaired – 47.7% – –Baseline-LSTM 38.4% 52.8% 15.2% 26.3%

Baseline-2D 16.2% 26.7% 4.3% 11.6%Baseline-NoLM 6.7% 13.6% 2.0% 5.6%

LipNet 6.4% 11.4% 1.9% 4.8%

11/23


LipNet

Takeaways

LipNet: takeaways

I It is an end-to-end sentence-sequence prediction model

I spatiotemporal frontend + 3D and 2D convolutions + 2 xbidirectional-LSTM (BLSTM)

I It relies on CTC to:

1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions

and the output sequence

I Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than

aggregating spatial-only featuresI GRID dataset: fixed grammar structure

11/23


LipNet

Takeaways

LipNet: takeaways

I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x

bidirectional-LSTM (BLSTM)

I It relies on CTC to:





11/23


LipNet

Takeaways

LipNet: takeaways


bidirectional-LSTM (BLSTM)I It relies on CTC to:


and the output sequenceI Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than


11/23


LipNet

Takeaways

LipNet: takeaways



1 predict frame-wise labels

2 look for the optimal alignment between the frame-wise predictionsand the output sequence



11/23


LipNet

Takeaways

LipNet: takeaways







11/23


LipNet

Takeaways

LipNet: takeaways




and the output sequenceI Confirms the importance of combining STCNNs with RNNs

I Extracting spatiotemporal features using STCNN is better thanaggregating spatial-only features

I GRID dataset: fixed grammar structure

11/23


LipNet

Takeaways

LipNet: takeaways





aggregating spatial-only features

I GRID dataset: fixed grammar structure

11/23


LipNet

Takeaways

LipNet: takeaways






Outline


2 LipNet

3 Deep Lip ReadingVision moduleBidirectional LSTMFully convolutionalTransformerExternal language modelExperiments & ResultsTakeaways

12/23

13/23


Deep Lip Reading

Focus on analyzing the performance of different DL architectures

Goal

I Compare the performance and training time of three different deeplearning architectures

Figure 4: Deep lipreading models. Source: Afouras, Chung, and Zisserman, 2018

14/23


Deep Lip Reading

Vision module

Spatiotemporal visual front-end

I Spatiotemporal 3D convolutional on theinput with a filter width of five frames

I Followed by a 2D ResNet which decreasesthe spatial dimensions

I For an input sequence of T × H × Wframes outputs a T × H32 ×

W32 × 512 tensor

I Results in a 512-dimensional feature vectorfor each input video frame

15/23


Deep Lip Reading

Bidirectional LSTM

Bidirectional LSTM (BLSTM)

I Comprises three stacked bidirectionalLSTMs

I Ingests the video feature vectorsI Outputs a character probability for each

input frameI It’s trained with connectionist temporal

classification (CTC)I Output alphabet is augmented with CTC

blank characterI Decoding is performed with a beam search

15/23


Deep Lip Reading

Bidirectional LSTM



I Ingests the video feature vectors

I Outputs a character probability for eachinput frame

I It’s trained with connectionist temporalclassification (CTC)

I Output alphabet is augmented with CTCblank character

I Decoding is performed with a beam search

15/23


Deep Lip Reading

Bidirectional LSTM




input frame

I It’s trained with connectionist temporalclassification (CTC)



15/23


Deep Lip Reading

Bidirectional LSTM





classification (CTC)



15/23


Deep Lip Reading

Bidirectional LSTM






blank character


15/23


Deep Lip Reading

Bidirectional LSTM






blank characterI Decoding is performed with a beam search

16/23


Deep Lip Reading

Fully convolutional

Fully convolutional (FC) model

I Rely on a depth-wise separable convolutionlayers

I Each convolution adds a skip-connectionfollowed by ReLU and batch normalization

I Also trained with CTC lossI Considers two variants: 10 and 15

convolutional layers

17/23


Deep Lip Reading

Transformer

Transformer model (TC)

I Input serves as attention queries, keys, andvalues

I Encoder outputs are the the attention keysand values

I Previous decoding layer outputs are thequeries

I The decoder produces characterprobabilities

I Rely on the based model proposedby Vaswani et al., 2017

I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1

17/23


Deep Lip Reading

Transformer






I Rely on the based model proposedby Vaswani et al., 2017I 6 encoder and 6 decoder layers

I 8 attention heads with dropout with p = 0.1

17/23


Deep Lip Reading

Transformer






I Rely on the based model proposedby Vaswani et al., 2017I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1

18/23


Deep Lip Reading

External language model

An external character-level language model

I Use a character-level language model during inferenceI Recurrent neural network with 4 unidirectional layers of 1024

LSTM cells eachI Trained to predict one character at a time

19/23


Deep Lip Reading

Experiments & Results

Two different datasets for performance evaluation

Table 3: Datasets used for trained and test

Dataset # Words Type Vocabulary # Utter. Viewpoint

LRW 489k single word 500 – uniqueLRS2 2M sentences 41K 142K multiple

MV-LRS(w) 1.9M sentences 480 – uniqueMV-LRS 5M sentences 30K 430K unique

LRW: Lip Reading in the WildLRS2: Lip Reading Sentences 2I Two different corpora to train the language model

1 transcriptions of the LRS2 pre-train and main train-data ≡ 2Mwords

2 full subtitles of LRS2 training set ≡ 26M wordsI Evaluated on LRS2 ≡ 1, 243 utterancesI Report character error rates (CER) and word error

rates (WER)

19/23


Deep Lip Reading









2 full subtitles of LRS2 training set ≡ 26M words

I Evaluated on LRS2 ≡ 1, 243 utterancesI Report character error rates (CER) and word error

rates (WER)

19/23


Deep Lip Reading









2 full subtitles of LRS2 training set ≡ 26M wordsI Evaluated on LRS2 ≡ 1, 243 utterances

I Report character error rates (CER) and word errorrates (WER)

19/23


Deep Lip Reading









2 full subtitles of LRS2 training set ≡ 26M wordsI Evaluated on LRS2 ≡ 1, 243 utterancesI Report character error rates (CER) and word error

rates (WER)

20/23


Deep Lip Reading


Training process includes three stages

1 Visual front-end module2 Use vision module to generate visual features for all the training

data3 Sequence processing module

21/23


Deep Lip Reading


Transformer architecture seems to be good choice

Table 4: Character error rates and word error rates on LRS2 dataset

lower is better

I Transformer outperforms the other network models

I An improvement of 20% over previous state-of-the-art modelI High computational cost (i.e., 13 days to train the model)

21/23


Deep Lip Reading




lower is better

I Transformer outperforms the other network modelsI An improvement of 20% over previous state-of-the-art model

I High computational cost (i.e., 13 days to train the model)

21/23


Deep Lip Reading




lower is better

I Transformer outperforms the other network modelsI An improvement of 20% over previous state-of-the-art modelI High computational cost (i.e., 13 days to train the model)

22/23


Deep Lip Reading

Takeaways

Takeaways

I Lipreading is a challenge problem

I Context information plays an important roleI Transformer architecture combined with convolutional neural

networks enable machine lipreadingI Machine lipreading can outperform human-performanceI Computational cost is still an issue

22/23


Deep Lip Reading

Takeaways

Takeaways

I Lipreading is a challenge problemI Context information plays an important role

I Transformer architecture combined with convolutional neuralnetworks enable machine lipreading

I Machine lipreading can outperform human-performanceI Computational cost is still an issue

22/23


Deep Lip Reading

Takeaways

Takeaways

I Lipreading is a challenge problemI Context information plays an important roleI Transformer architecture combined with convolutional neural

networks enable machine lipreading

I Machine lipreading can outperform human-performanceI Computational cost is still an issue

22/23


Deep Lip Reading

Takeaways

Takeaways


networks enable machine lipreadingI Machine lipreading can outperform human-performance

I Computational cost is still an issue

22/23


Deep Lip Reading

Takeaways

Takeaways


networks enable machine lipreadingI Machine lipreading can outperform human-performanceI Computational cost is still an issue

References

1 H. McGurk and J. MacDonald. “Hearing lips and seeing voices”.In: Nature 264.5588 (1976), pp. 746–748

2 D. J. Lewkowicz and A. M. Hansen-Tift. “Infants deploy selectiveattention to the mouth of a talking face when learning speech”.In: National Academy of Sciences 109.5 (2012), pp. 1431–1436

3 R. D. Easton and M. Basala. “Perceptual dominance duringlipreading”. In: Perception & Psychophysics 32.6 (1982),pp. 562–570

4 A. Vaswani et al. “Attention is all you need”. In: NeurIPS. 2017,pp. 5998–6008

5 T. Afouras, J. S. Chung, and A. Zisserman. “Deep Lip Reading: acomparison of models and an online application”. In:INTERSPEECH. 2018

22/23

23/23

Context & MotivationLipNetPre-deep learning and first deep learning attemptsResultsTakeaways

Deep Lip ReadingVision moduleBidirectional LSTMFully convolutionalTransformerExternal language modelExperiments & ResultsTakeaways

anm0: 0.26: 0.25: 0.24: 0.23: 0.22: 0.21: 0.20: 0.19: 0.18: 0.17: 0.16: 0.15: 0.14: 0.13: 0.12: 0.11: 0.10: 0.9: 0.8: 0.7: 0.6: 0.5: 0.4: 0.3: 0.2: 0.1: 0.0:

Deep Lip Reading: a comparison of models and an online ......3/23 Deep Lip Reading: a comparison of models and an online application Context & Motivation Lip reading: what is it and

Documents