Top Banner
Deep Lip Reading: a comparison of models and an online application January 20, 2021 1/23
72

Deep Lip Reading: a comparison of models and an online ......3/23 Deep Lip Reading: a comparison of models and an online application Context & Motivation Lip reading: what is it and

Jan 24, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Deep Lip Reading: a comparison of modelsand an online application

    January 20, 2021

    1/23

  • Outline

    1 Context & Motivation

    2 LipNet

    3 Deep Lip Reading

    2/23

  • 3/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Lip reading: what is it and what role does it play?

    I The ability to recognize what is being said based on visualinformation

    I It plays a crucial role in human communication and speechunderstanding [McGurk and MacDonald, 1976]

    I babies selectively observe their interlocutor’s vocal during socialinteractions [Lewkowicz and Hansen-Tift, 2012]

    I It’s a difficult task for humans, specially in the absence ofcontext

    I Multiple sounds (phonemes) have almost identical lipshapes (i.e., viseme)

  • 3/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Lip reading: what is it and what role does it play?

    I The ability to recognize what is being said based on visualinformation

    I It plays a crucial role in human communication and speechunderstanding [McGurk and MacDonald, 1976]

    I babies selectively observe their interlocutor’s vocal during socialinteractions [Lewkowicz and Hansen-Tift, 2012]

    I It’s a difficult task for humans, specially in the absence ofcontext

    I Multiple sounds (phonemes) have almost identical lipshapes (i.e., viseme)

  • 3/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Lip reading: what is it and what role does it play?

    I The ability to recognize what is being said based on visualinformation

    I It plays a crucial role in human communication and speechunderstanding [McGurk and MacDonald, 1976]I babies selectively observe their interlocutor’s vocal during social

    interactions [Lewkowicz and Hansen-Tift, 2012]

    I It’s a difficult task for humans, specially in the absence ofcontext

    I Multiple sounds (phonemes) have almost identical lipshapes (i.e., viseme)

  • 3/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Lip reading: what is it and what role does it play?

    I The ability to recognize what is being said based on visualinformation

    I It plays a crucial role in human communication and speechunderstanding [McGurk and MacDonald, 1976]I babies selectively observe their interlocutor’s vocal during social

    interactions [Lewkowicz and Hansen-Tift, 2012]I It’s a difficult task for humans, specially in the absence of

    context

    I Multiple sounds (phonemes) have almost identical lipshapes (i.e., viseme)

  • 3/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Lip reading: what is it and what role does it play?

    I The ability to recognize what is being said based on visualinformation

    I It plays a crucial role in human communication and speechunderstanding [McGurk and MacDonald, 1976]I babies selectively observe their interlocutor’s vocal during social

    interactions [Lewkowicz and Hansen-Tift, 2012]I It’s a difficult task for humans, specially in the absence of

    contextI Multiple sounds (phonemes) have almost identical lip

    shapes (i.e., viseme)

    Figure 1: Bark pronunciation Figure 2: Mark pronunciation

  • 4/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Human lipreading performance is normally poor

    I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]

    I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words

    I Enormous applications including

    I improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing

    I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal

    features from the videosI Deep learning approaches offer an end-to-end strategy to

    extract these features

  • 4/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Human lipreading performance is normally poor

    I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic words

    I 21 % ± 11 % for 30 compound wordsI Enormous applications including

    I improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing

    I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal

    features from the videosI Deep learning approaches offer an end-to-end strategy to

    extract these features

  • 4/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Human lipreading performance is normally poor

    I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words

    I Enormous applications including

    I improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing

    I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal

    features from the videosI Deep learning approaches offer an end-to-end strategy to

    extract these features

  • 4/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Human lipreading performance is normally poor

    I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words

    I Enormous applications including

    I improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing

    I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal

    features from the videosI Deep learning approaches offer an end-to-end strategy to

    extract these features

  • 4/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Human lipreading performance is normally poor

    I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words

    I Enormous applications includingI improve hearing aids

    I silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing

    I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal

    features from the videosI Deep learning approaches offer an end-to-end strategy to

    extract these features

  • 4/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Human lipreading performance is normally poor

    I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words

    I Enormous applications includingI improve hearing aidsI silent dictation in public spaces

    I speech recognition in noisy environmentsI salience movie processing

    I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal

    features from the videosI Deep learning approaches offer an end-to-end strategy to

    extract these features

  • 4/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Human lipreading performance is normally poor

    I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words

    I Enormous applications includingI improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environments

    I salience movie processingI Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal

    features from the videosI Deep learning approaches offer an end-to-end strategy to

    extract these features

  • 4/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Human lipreading performance is normally poor

    I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words

    I Enormous applications includingI improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing

    I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal

    features from the videosI Deep learning approaches offer an end-to-end strategy to

    extract these features

  • 4/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Human lipreading performance is normally poor

    I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words

    I Enormous applications includingI improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing

    I Automate lipreading comprises an important goal

    I Machine lipreading requires extracting spatiotemporalfeatures from the videos

    I Deep learning approaches offer an end-to-end strategy toextract these features

  • 4/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Human lipreading performance is normally poor

    I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words

    I Enormous applications includingI improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing

    I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal

    features from the videos

    I Deep learning approaches offer an end-to-end strategy toextract these features

  • 4/23

    Deep Lip Reading: a comparison of models and an online application

    Context & Motivation

    Human lipreading performance is normally poor

    I Hearing-impaired people’s accuracy is only [Easton andBasala, 1982]I 17 % ± 12 % for 30 monosyllabic wordsI 21 % ± 11 % for 30 compound words

    I Enormous applications includingI improve hearing aidsI silent dictation in public spacesI speech recognition in noisy environmentsI salience movie processing

    I Automate lipreading comprises an important goalI Machine lipreading requires extracting spatiotemporal

    features from the videosI Deep learning approaches offer an end-to-end strategy to

    extract these features

  • Outline

    1 Context & Motivation

    2 LipNetPre-deep learning and first deep learning attemptsResultsTakeaways

    3 Deep Lip Reading

    5/23

  • 6/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Pre-deep learning and first deep learning attempts

    Speakers generalization and motion extractions were the main issues

    Task

    Given a silence video of a talking face, predict the sentences beingspoken

    I Many works focused on video and image preprocessing toextract different features [Zhou et al., 2014]I Hidden Markov model (HMM) and generalized method of

    moments (GMM) combined with handed-engineered featuresI Speaker-dependency accuracy and/or limited utterances

    I First deep learning attempts limited to word or phonemeclassificationI Fixed sequences sizeI Speaker-dependentI Lacked sequence prediction

    I Connectionist temporal classification (CTC) loss [Graveset al., 2006]

  • 6/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Pre-deep learning and first deep learning attempts

    Speakers generalization and motion extractions were the main issues

    Task

    Given a silence video of a talking face, predict the sentences beingspoken

    I Many works focused on video and image preprocessing toextract different features [Zhou et al., 2014]I Hidden Markov model (HMM) and generalized method of

    moments (GMM) combined with handed-engineered featuresI Speaker-dependency accuracy and/or limited utterances

    I First deep learning attempts limited to word or phonemeclassificationI Fixed sequences sizeI Speaker-dependentI Lacked sequence prediction

    I Connectionist temporal classification (CTC) loss [Graveset al., 2006]

  • 6/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Pre-deep learning and first deep learning attempts

    Speakers generalization and motion extractions were the main issues

    Task

    Given a silence video of a talking face, predict the sentences beingspoken

    I Many works focused on video and image preprocessing toextract different features [Zhou et al., 2014]I Hidden Markov model (HMM) and generalized method of

    moments (GMM) combined with handed-engineered featuresI Speaker-dependency accuracy and/or limited utterances

    I First deep learning attempts limited to word or phonemeclassificationI Fixed sequences sizeI Speaker-dependentI Lacked sequence prediction

    I Connectionist temporal classification (CTC) loss [Graveset al., 2006]

  • 6/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Pre-deep learning and first deep learning attempts

    Speakers generalization and motion extractions were the main issues

    Task

    Given a silence video of a talking face, predict the sentences beingspoken

    I Many works focused on video and image preprocessing toextract different features [Zhou et al., 2014]I Hidden Markov model (HMM) and generalized method of

    moments (GMM) combined with handed-engineered featuresI Speaker-dependency accuracy and/or limited utterances

    I First deep learning attempts limited to word or phonemeclassificationI Fixed sequences sizeI Speaker-dependentI Lacked sequence prediction

    I Connectionist temporal classification (CTC) loss [Graveset al., 2006]

  • 7/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Pre-deep learning and first deep learning attempts

    First to show an end-to-end strategy for lipreading

    I Maps variable-length sequences of video frames to textsequences

    I GRID corpus 33k sentences

    Figure 3: LipNet architecture. Source: Assael et al., 2016

  • 8/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Pre-deep learning and first deep learning attempts

    GRID dataset has a fixed grammar structure

    Table 1: GRID sentence and grammar structure

    command color∗ preposition letter∗ digit adverb∗

    {bin, lay, place, set} {blue, green, red, white} {at, by, in, with} [A–Z] \ {W} [0–9] {again, now, please, soon}∗keywords

  • 9/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Pre-deep learning and first deep learning attempts

    Four different strategies to compare with the LipNet performance

    I Hearing-impaired students three members of the OxfordStudents’ Disability community

    I Baseline-LSTM: replicate a state-of-the art architectureI Baseline-2D: spatial-only convolutionsI Baseline-NoLM: language model disabledI Use word error rate (WER) and character error rate (CER)

  • 10/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Results

    LipNet outperforms human and previous state-of-the-art model

    Table 2: Performance of LipNet on the GRID dataset

    Unseen Speakers Overlapped SpeakersMethod CER WER CER WER

    Hearing-Impaired – 47.7% – –Baseline-LSTM 38.4% 52.8% 15.2% 26.3%

    Baseline-2D 16.2% 26.7% 4.3% 11.6%Baseline-NoLM 6.7% 13.6% 2.0% 5.6%

    LipNet 6.4% 11.4% 1.9% 4.8%

  • 11/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Takeaways

    LipNet: takeaways

    I It is an end-to-end sentence-sequence prediction model

    I spatiotemporal frontend + 3D and 2D convolutions + 2 xbidirectional-LSTM (BLSTM)

    I It relies on CTC to:

    1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions

    and the output sequence

    I Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than

    aggregating spatial-only featuresI GRID dataset: fixed grammar structure

  • 11/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Takeaways

    LipNet: takeaways

    I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x

    bidirectional-LSTM (BLSTM)

    I It relies on CTC to:

    1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions

    and the output sequence

    I Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than

    aggregating spatial-only featuresI GRID dataset: fixed grammar structure

  • 11/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Takeaways

    LipNet: takeaways

    I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x

    bidirectional-LSTM (BLSTM)I It relies on CTC to:

    1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions

    and the output sequenceI Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than

    aggregating spatial-only featuresI GRID dataset: fixed grammar structure

  • 11/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Takeaways

    LipNet: takeaways

    I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x

    bidirectional-LSTM (BLSTM)I It relies on CTC to:

    1 predict frame-wise labels

    2 look for the optimal alignment between the frame-wise predictionsand the output sequence

    I Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than

    aggregating spatial-only featuresI GRID dataset: fixed grammar structure

  • 11/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Takeaways

    LipNet: takeaways

    I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x

    bidirectional-LSTM (BLSTM)I It relies on CTC to:

    1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions

    and the output sequence

    I Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than

    aggregating spatial-only featuresI GRID dataset: fixed grammar structure

  • 11/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Takeaways

    LipNet: takeaways

    I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x

    bidirectional-LSTM (BLSTM)I It relies on CTC to:

    1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions

    and the output sequenceI Confirms the importance of combining STCNNs with RNNs

    I Extracting spatiotemporal features using STCNN is better thanaggregating spatial-only features

    I GRID dataset: fixed grammar structure

  • 11/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Takeaways

    LipNet: takeaways

    I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x

    bidirectional-LSTM (BLSTM)I It relies on CTC to:

    1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions

    and the output sequenceI Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than

    aggregating spatial-only features

    I GRID dataset: fixed grammar structure

  • 11/23

    Deep Lip Reading: a comparison of models and an online application

    LipNet

    Takeaways

    LipNet: takeaways

    I It is an end-to-end sentence-sequence prediction modelI spatiotemporal frontend + 3D and 2D convolutions + 2 x

    bidirectional-LSTM (BLSTM)I It relies on CTC to:

    1 predict frame-wise labels2 look for the optimal alignment between the frame-wise predictions

    and the output sequenceI Confirms the importance of combining STCNNs with RNNsI Extracting spatiotemporal features using STCNN is better than

    aggregating spatial-only featuresI GRID dataset: fixed grammar structure

  • Outline

    1 Context & Motivation

    2 LipNet

    3 Deep Lip ReadingVision moduleBidirectional LSTMFully convolutionalTransformerExternal language modelExperiments & ResultsTakeaways

    12/23

  • 13/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Focus on analyzing the performance of different DL architectures

    Goal

    I Compare the performance and training time of three different deeplearning architectures

    Figure 4: Deep lipreading models. Source: Afouras, Chung, and Zisserman, 2018

  • 13/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Focus on analyzing the performance of different DL architectures

    Goal

    I Compare the performance and training time of three different deeplearning architectures

    Figure 4: Deep lipreading models. Source: Afouras, Chung, and Zisserman, 2018

  • 13/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Focus on analyzing the performance of different DL architectures

    Goal

    I Compare the performance and training time of three different deeplearning architectures

    Figure 4: Deep lipreading models. Source: Afouras, Chung, and Zisserman, 2018

  • 13/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Focus on analyzing the performance of different DL architectures

    Goal

    I Compare the performance and training time of three different deeplearning architectures

    Figure 4: Deep lipreading models. Source: Afouras, Chung, and Zisserman, 2018

  • 14/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Vision module

    Spatiotemporal visual front-end

    I Spatiotemporal 3D convolutional on theinput with a filter width of five frames

    I Followed by a 2D ResNet which decreasesthe spatial dimensions

    I For an input sequence of T × H × Wframes outputs a T × H32 ×

    W32 × 512 tensor

    I Results in a 512-dimensional feature vectorfor each input video frame

  • 15/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Bidirectional LSTM

    Bidirectional LSTM (BLSTM)

    I Comprises three stacked bidirectionalLSTMs

    I Ingests the video feature vectorsI Outputs a character probability for each

    input frameI It’s trained with connectionist temporal

    classification (CTC)I Output alphabet is augmented with CTC

    blank characterI Decoding is performed with a beam search

  • 15/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Bidirectional LSTM

    Bidirectional LSTM (BLSTM)

    I Comprises three stacked bidirectionalLSTMs

    I Ingests the video feature vectors

    I Outputs a character probability for eachinput frame

    I It’s trained with connectionist temporalclassification (CTC)

    I Output alphabet is augmented with CTCblank character

    I Decoding is performed with a beam search

  • 15/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Bidirectional LSTM

    Bidirectional LSTM (BLSTM)

    I Comprises three stacked bidirectionalLSTMs

    I Ingests the video feature vectorsI Outputs a character probability for each

    input frame

    I It’s trained with connectionist temporalclassification (CTC)

    I Output alphabet is augmented with CTCblank character

    I Decoding is performed with a beam search

  • 15/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Bidirectional LSTM

    Bidirectional LSTM (BLSTM)

    I Comprises three stacked bidirectionalLSTMs

    I Ingests the video feature vectorsI Outputs a character probability for each

    input frameI It’s trained with connectionist temporal

    classification (CTC)

    I Output alphabet is augmented with CTCblank character

    I Decoding is performed with a beam search

  • 15/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Bidirectional LSTM

    Bidirectional LSTM (BLSTM)

    I Comprises three stacked bidirectionalLSTMs

    I Ingests the video feature vectorsI Outputs a character probability for each

    input frameI It’s trained with connectionist temporal

    classification (CTC)I Output alphabet is augmented with CTC

    blank character

    I Decoding is performed with a beam search

  • 15/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Bidirectional LSTM

    Bidirectional LSTM (BLSTM)

    I Comprises three stacked bidirectionalLSTMs

    I Ingests the video feature vectorsI Outputs a character probability for each

    input frameI It’s trained with connectionist temporal

    classification (CTC)I Output alphabet is augmented with CTC

    blank characterI Decoding is performed with a beam search

  • 16/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Fully convolutional

    Fully convolutional (FC) model

    I Rely on a depth-wise separable convolutionlayers

    I Each convolution adds a skip-connectionfollowed by ReLU and batch normalization

    I Also trained with CTC lossI Considers two variants: 10 and 15

    convolutional layers

  • 17/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Transformer

    Transformer model (TC)

    I Input serves as attention queries, keys, andvalues

    I Encoder outputs are the the attention keysand values

    I Previous decoding layer outputs are thequeries

    I The decoder produces characterprobabilities

    I Rely on the based model proposedby Vaswani et al., 2017

    I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1

  • 17/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Transformer

    Transformer model (TC)

    I Input serves as attention queries, keys, andvalues

    I Encoder outputs are the the attention keysand values

    I Previous decoding layer outputs are thequeries

    I The decoder produces characterprobabilities

    I Rely on the based model proposedby Vaswani et al., 2017

    I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1

  • 17/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Transformer

    Transformer model (TC)

    I Input serves as attention queries, keys, andvalues

    I Encoder outputs are the the attention keysand values

    I Previous decoding layer outputs are thequeries

    I The decoder produces characterprobabilities

    I Rely on the based model proposedby Vaswani et al., 2017

    I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1

  • 17/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Transformer

    Transformer model (TC)

    I Input serves as attention queries, keys, andvalues

    I Encoder outputs are the the attention keysand values

    I Previous decoding layer outputs are thequeries

    I The decoder produces characterprobabilities

    I Rely on the based model proposedby Vaswani et al., 2017

    I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1

  • 17/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Transformer

    Transformer model (TC)

    I Input serves as attention queries, keys, andvalues

    I Encoder outputs are the the attention keysand values

    I Previous decoding layer outputs are thequeries

    I The decoder produces characterprobabilities

    I Rely on the based model proposedby Vaswani et al., 2017

    I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1

  • 17/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Transformer

    Transformer model (TC)

    I Input serves as attention queries, keys, andvalues

    I Encoder outputs are the the attention keysand values

    I Previous decoding layer outputs are thequeries

    I The decoder produces characterprobabilities

    I Rely on the based model proposedby Vaswani et al., 2017I 6 encoder and 6 decoder layers

    I 8 attention heads with dropout with p = 0.1

  • 17/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Transformer

    Transformer model (TC)

    I Input serves as attention queries, keys, andvalues

    I Encoder outputs are the the attention keysand values

    I Previous decoding layer outputs are thequeries

    I The decoder produces characterprobabilities

    I Rely on the based model proposedby Vaswani et al., 2017I 6 encoder and 6 decoder layersI 8 attention heads with dropout with p = 0.1

  • 18/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    External language model

    An external character-level language model

    I Use a character-level language model during inferenceI Recurrent neural network with 4 unidirectional layers of 1024

    LSTM cells eachI Trained to predict one character at a time

  • 19/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Experiments & Results

    Two different datasets for performance evaluation

    Table 3: Datasets used for trained and test

    Dataset # Words Type Vocabulary # Utter. Viewpoint

    LRW 489k single word 500 – uniqueLRS2 2M sentences 41K 142K multiple

    MV-LRS(w) 1.9M sentences 480 – uniqueMV-LRS 5M sentences 30K 430K unique

    LRW: Lip Reading in the WildLRS2: Lip Reading Sentences 2I Two different corpora to train the language model

    1 transcriptions of the LRS2 pre-train and main train-data ≡ 2Mwords

    2 full subtitles of LRS2 training set ≡ 26M wordsI Evaluated on LRS2 ≡ 1, 243 utterancesI Report character error rates (CER) and word error

    rates (WER)

  • 19/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Experiments & Results

    Two different datasets for performance evaluation

    Table 3: Datasets used for trained and test

    Dataset # Words Type Vocabulary # Utter. Viewpoint

    LRW 489k single word 500 – uniqueLRS2 2M sentences 41K 142K multiple

    MV-LRS(w) 1.9M sentences 480 – uniqueMV-LRS 5M sentences 30K 430K unique

    LRW: Lip Reading in the WildLRS2: Lip Reading Sentences 2I Two different corpora to train the language model

    1 transcriptions of the LRS2 pre-train and main train-data ≡ 2Mwords

    2 full subtitles of LRS2 training set ≡ 26M wordsI Evaluated on LRS2 ≡ 1, 243 utterancesI Report character error rates (CER) and word error

    rates (WER)

  • 19/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Experiments & Results

    Two different datasets for performance evaluation

    Table 3: Datasets used for trained and test

    Dataset # Words Type Vocabulary # Utter. Viewpoint

    LRW 489k single word 500 – uniqueLRS2 2M sentences 41K 142K multiple

    MV-LRS(w) 1.9M sentences 480 – uniqueMV-LRS 5M sentences 30K 430K unique

    LRW: Lip Reading in the WildLRS2: Lip Reading Sentences 2I Two different corpora to train the language model

    1 transcriptions of the LRS2 pre-train and main train-data ≡ 2Mwords

    2 full subtitles of LRS2 training set ≡ 26M words

    I Evaluated on LRS2 ≡ 1, 243 utterancesI Report character error rates (CER) and word error

    rates (WER)

  • 19/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Experiments & Results

    Two different datasets for performance evaluation

    Table 3: Datasets used for trained and test

    Dataset # Words Type Vocabulary # Utter. Viewpoint

    LRW 489k single word 500 – uniqueLRS2 2M sentences 41K 142K multiple

    MV-LRS(w) 1.9M sentences 480 – uniqueMV-LRS 5M sentences 30K 430K unique

    LRW: Lip Reading in the WildLRS2: Lip Reading Sentences 2I Two different corpora to train the language model

    1 transcriptions of the LRS2 pre-train and main train-data ≡ 2Mwords

    2 full subtitles of LRS2 training set ≡ 26M wordsI Evaluated on LRS2 ≡ 1, 243 utterances

    I Report character error rates (CER) and word errorrates (WER)

  • 19/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Experiments & Results

    Two different datasets for performance evaluation

    Table 3: Datasets used for trained and test

    Dataset # Words Type Vocabulary # Utter. Viewpoint

    LRW 489k single word 500 – uniqueLRS2 2M sentences 41K 142K multiple

    MV-LRS(w) 1.9M sentences 480 – uniqueMV-LRS 5M sentences 30K 430K unique

    LRW: Lip Reading in the WildLRS2: Lip Reading Sentences 2I Two different corpora to train the language model

    1 transcriptions of the LRS2 pre-train and main train-data ≡ 2Mwords

    2 full subtitles of LRS2 training set ≡ 26M wordsI Evaluated on LRS2 ≡ 1, 243 utterancesI Report character error rates (CER) and word error

    rates (WER)

  • 20/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Experiments & Results

    Training process includes three stages

    1 Visual front-end module2 Use vision module to generate visual features for all the training

    data3 Sequence processing module

  • 21/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Experiments & Results

    Transformer architecture seems to be good choice

    Table 4: Character error rates and word error rates on LRS2 dataset

    lower is better

    I Transformer outperforms the other network models

    I An improvement of 20% over previous state-of-the-art modelI High computational cost (i.e., 13 days to train the model)

  • 21/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Experiments & Results

    Transformer architecture seems to be good choice

    Table 4: Character error rates and word error rates on LRS2 dataset

    lower is better

    I Transformer outperforms the other network modelsI An improvement of 20% over previous state-of-the-art model

    I High computational cost (i.e., 13 days to train the model)

  • 21/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Experiments & Results

    Transformer architecture seems to be good choice

    Table 4: Character error rates and word error rates on LRS2 dataset

    lower is better

    I Transformer outperforms the other network modelsI An improvement of 20% over previous state-of-the-art modelI High computational cost (i.e., 13 days to train the model)

  • 22/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Takeaways

    Takeaways

    I Lipreading is a challenge problem

    I Context information plays an important roleI Transformer architecture combined with convolutional neural

    networks enable machine lipreadingI Machine lipreading can outperform human-performanceI Computational cost is still an issue

  • 22/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Takeaways

    Takeaways

    I Lipreading is a challenge problemI Context information plays an important role

    I Transformer architecture combined with convolutional neuralnetworks enable machine lipreading

    I Machine lipreading can outperform human-performanceI Computational cost is still an issue

  • 22/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Takeaways

    Takeaways

    I Lipreading is a challenge problemI Context information plays an important roleI Transformer architecture combined with convolutional neural

    networks enable machine lipreading

    I Machine lipreading can outperform human-performanceI Computational cost is still an issue

  • 22/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Takeaways

    Takeaways

    I Lipreading is a challenge problemI Context information plays an important roleI Transformer architecture combined with convolutional neural

    networks enable machine lipreadingI Machine lipreading can outperform human-performance

    I Computational cost is still an issue

  • 22/23

    Deep Lip Reading: a comparison of models and an online application

    Deep Lip Reading

    Takeaways

    Takeaways

    I Lipreading is a challenge problemI Context information plays an important roleI Transformer architecture combined with convolutional neural

    networks enable machine lipreadingI Machine lipreading can outperform human-performanceI Computational cost is still an issue

  • References

    1 H. McGurk and J. MacDonald. “Hearing lips and seeing voices”.In: Nature 264.5588 (1976), pp. 746–748

    2 D. J. Lewkowicz and A. M. Hansen-Tift. “Infants deploy selectiveattention to the mouth of a talking face when learning speech”.In: National Academy of Sciences 109.5 (2012), pp. 1431–1436

    3 R. D. Easton and M. Basala. “Perceptual dominance duringlipreading”. In: Perception & Psychophysics 32.6 (1982),pp. 562–570

    4 A. Vaswani et al. “Attention is all you need”. In: NeurIPS. 2017,pp. 5998–6008

    5 T. Afouras, J. S. Chung, and A. Zisserman. “Deep Lip Reading: acomparison of models and an online application”. In:INTERSPEECH. 2018

    22/23

  • 23/23

    Context & MotivationLipNetPre-deep learning and first deep learning attemptsResultsTakeaways

    Deep Lip ReadingVision moduleBidirectional LSTMFully convolutionalTransformerExternal language modelExperiments & ResultsTakeaways

    anm0: 0.26: 0.25: 0.24: 0.23: 0.22: 0.21: 0.20: 0.19: 0.18: 0.17: 0.16: 0.15: 0.14: 0.13: 0.12: 0.11: 0.10: 0.9: 0.8: 0.7: 0.6: 0.5: 0.4: 0.3: 0.2: 0.1: 0.0: