Audio-Visual Audio-Visual Speech Recognition: Speech Recognition: Audio Noise, Audio Noise, Video Noise, Video Noise, and Pronunciation and Pronunciation Variability Variability Mark Hasegawa-Johnson Mark Hasegawa-Johnson Electrical and Computer Electrical and Computer Engineering Engineering
Audio-Visual Speech Recognition: Audio Noise, Video Noise, and Pronunciation Variability Mark Hasegawa-Johnson Electrical and Computer Engineering. Audio-Visual Speech Recognition. Video Noise Graphical Methods: Manifold Estimation Local Graph Discriminant Features Audio Noise - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
1)1) Video NoiseVideo Noise1)1) Graphical Methods: Manifold EstimationGraphical Methods: Manifold Estimation2)2) Local Graph Discriminant FeaturesLocal Graph Discriminant Features
2)2) Audio NoiseAudio Noise1)1) Beam-Form, Post-Filter, and Low-SNR VADBeam-Form, Post-Filter, and Low-SNR VAD
3)3) Pronunciation VariabilityPronunciation Variability1)1) Graphical Methods: Dynamic Bayesian NetworkGraphical Methods: Dynamic Bayesian Network2)2) An Articulatory-Feature Model for Audio-An Articulatory-Feature Model for Audio-
1)1) Video NoiseVideo Noise1)1) Graphical Methods: Manifold EstimationGraphical Methods: Manifold Estimation2)2) Local Graph Discriminant FeaturesLocal Graph Discriminant Features
2)2) Audio NoiseAudio Noise1)1) Beam-Form, Post-Filter, and Low-SNR VADBeam-Form, Post-Filter, and Low-SNR VAD
3)3) Pronunciation VariabilityPronunciation Variability1)1) Graphical Methods: Dynamic Bayesian NetworkGraphical Methods: Dynamic Bayesian Network2)2) An Articulatory-Feature Model for Audio-An Articulatory-Feature Model for Audio-
AVICAR DatabaseAVICAR Database● AVICAR = Audio-Visual In a CARAVICAR = Audio-Visual In a CAR● 100 Talkers100 Talkers● 4 Cameras, 7 Microphones4 Cameras, 7 Microphones● 5 noise conditions: Engine idling, 35mph, 35mph with 5 noise conditions: Engine idling, 35mph, 35mph with
windows open, 55mph, 55mph with windows openwindows open, 55mph, 55mph with windows open● Three types of utterances: Three types of utterances:
– DigitsDigits & & Phone numbersPhone numbers, for training and testing phone-, for training and testing phone-number recognizersnumber recognizers
– TIMIT sentencesTIMIT sentences, for training and testing large , for training and testing large vocabulary speech recognitionvocabulary speech recognition
– Isolated LettersIsolated Letters, to test the use of video for an acoustically , to test the use of video for an acoustically hard recognition problemhard recognition problem
AVICAR Recording HardwareAVICAR Recording Hardware(Lee, Hasegawa-Johnson et al., ICSLP 2004)(Lee, Hasegawa-Johnson et al., ICSLP 2004)
4 Cameras,
Glare Shields,
Adjustable
Mounting
Best Place=
Dashboard
8 Mics,
Pre-amps,
Wooden
Baffle.
Best Place=
Sunvisor.
System is not permanently installed; mounting requires 10 minutes.
AVICAR Video NoiseAVICAR Video Noise
Lighting: Many different angles, many types of Lighting: Many different angles, many types of weatherweather
Interlace: 30fps NTSC encoding used to transmit Interlace: 30fps NTSC encoding used to transmit data from camera to digital video tapedata from camera to digital video tape
Related Problem: DimensionalityRelated Problem: Dimensionality
Dimension of the raw grayscale lip rectangle:Dimension of the raw grayscale lip rectangle:30x200=6000 pixels30x200=6000 pixels
Dimension of the DCT of the lip rectangle:Dimension of the DCT of the lip rectangle:30x200=6000 dimensions30x200=6000 dimensions
Smallest truncated DCT that allows a human viewer to recognize Smallest truncated DCT that allows a human viewer to recognize lip shapes (Hasegawa-Johnson, informal experiments):lip shapes (Hasegawa-Johnson, informal experiments):
25x25=625 dimensions25x25=625 dimensions
Truncated DCT typically used in AVSR:Truncated DCT typically used in AVSR:4x4=16 dimensions4x4=16 dimensions
Dimension of “geometric lip features” that allow high-accuracy Dimension of “geometric lip features” that allow high-accuracy AVSR (e.g., Chu and Huang, 2000):AVSR (e.g., Chu and Huang, 2000):
Manifold EstimationManifold Estimation(e.g., Roweis and Saul, Science 2000)(e.g., Roweis and Saul, Science 2000)
0
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.5 3
Neighborhood GraphNeighborhood Graph Node = data pointNode = data point Edge = connect each data Edge = connect each data
point to its K nearest point to its K nearest neighborsneighbors
Manifold EstimationManifold Estimation The K nearest neighbors The K nearest neighbors
of each data point define of each data point define the local (K-1)-the local (K-1)-dimensional tangent space dimensional tangent space of a manifoldof a manifold
0
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.5 3
Local Discriminant GraphLocal Discriminant Graph(Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007)(Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007)
Maximize Local Inter-Maximize Local Inter-Manifold Interpolation Manifold Interpolation Errors, Errors,
subject to a constant Same-subject to a constant Same-Class Interpolation Error:Class Interpolation Error:
Find Find PP to maximize to maximize
DDii||||PPTT(x(xii--kkcckkyykk)||)||2,2,
yykk ЄЄ KNN(x KNN(xii), other classes), other classes
Subject to Subject to SS = constant, = constant,
SS = = ii||||PPTT(x(xii--jjccjjxxjj)||)||22,,
xxjj ЄЄ KNN(x KNN(xii), same class), same class
PCA, LDA, LDG: Experimental TestPCA, LDA, LDG: Experimental Test (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007)
1)1) Video NoiseVideo Noise1)1) Graphical Methods: Manifold EstimationGraphical Methods: Manifold Estimation2)2) Local Graph Discriminant FeaturesLocal Graph Discriminant Features
2)2) Audio NoiseAudio Noise1)1) Beam-Form, Post-Filter, and Low-SNR VADBeam-Form, Post-Filter, and Low-SNR VAD
3)3) Pronunciation VariabilityPronunciation Variability1)1) Graphical Methods: Dynamic Bayesian NetworkGraphical Methods: Dynamic Bayesian Network2)2) An Articulatory-Feature Model for Audio-An Articulatory-Feature Model for Audio-
BeamformingBeamforming– Filter-and-sum (MVDR) vs. Delay-and-sumFilter-and-sum (MVDR) vs. Delay-and-sum
Post-FilterPost-Filter– MMSE log spectral amplitude estimator (Ephraim and MMSE log spectral amplitude estimator (Ephraim and
Malah, 1984) vs. Spectral SubtractionMalah, 1984) vs. Spectral Subtraction
Voice Activity DetectionVoice Activity Detection– Likelihood ratio method (Sohn and Sung, ICASSP 1998)Likelihood ratio method (Sohn and Sung, ICASSP 1998)
Fixed estimate: NFixed estimate: N00=average of first 10 frames=average of first 10 frames
Autoregressive estimator (Sohn and Sung):Autoregressive estimator (Sohn and Sung):
NNt t = = t t XXtt + (1- + (1-tt) N) Nt-1t-1
t t = function of X= function of Xtt, N, N00
Backoff estimator (Lee and Hasegawa-Backoff estimator (Lee and Hasegawa-Johnson, DSP for In-Vehicle and Mobile Johnson, DSP for In-Vehicle and Mobile Systems, 2007):Systems, 2007):
NNt t = = t t XXtt + (1- + (1-tt) N) N00
VAD: Noise EstimatorsVAD: Noise Estimators
Word Error Rate: DigitsWord Error Rate: Digits
0123456789
10
Word Error Rate
(%)
Idle35U35D55U55D
Noise Condition
BackoffEstimation
AutoregressiveEstimation
Fixed Noise
III. Pronunciation VariabilityIII. Pronunciation Variability
1)1) Video NoiseVideo Noise1)1) Graphical Methods: Manifold EstimationGraphical Methods: Manifold Estimation2)2) Local Graph Discriminant FeaturesLocal Graph Discriminant Features
2)2) Audio NoiseAudio Noise1)1) Beam-Form, Post-Filter, and Low-SNR VADBeam-Form, Post-Filter, and Low-SNR VAD
3)3) Pronunciation VariabilityPronunciation Variability1)1) Graphical Methods: Dynamic Bayesian NetworkGraphical Methods: Dynamic Bayesian Network2)2) An Articulatory-Feature Model for Audio-An Articulatory-Feature Model for Audio-
Bayesian Network = A Graph in whichBayesian Network = A Graph in which Nodes are Random Variables (RVs)Nodes are Random Variables (RVs) Edges Represent DependenceEdges Represent Dependence
Dynamic Bayesian Network = A BN in which Dynamic Bayesian Network = A BN in which RVs are repeated once per time stepRVs are repeated once per time step
Example: an HMM is a DBNExample: an HMM is a DBN Most important RV: the “phonestate” variable qMost important RV: the “phonestate” variable q tt
Typically qTypically qtt ЄЄ {Phones} x {1,2,3} {Phones} x {1,2,3} Acoustic features xAcoustic features xtt and video features y and video features ytt depend on q depend on qtt
Example: HMM is a DBNExample: HMM is a DBN
qqt-1t-1
t-1t-1
xxt-1t-1 yyt-1t-1
wwt-1t-1
wincwinct-1t-1
qincqinct-1t-1
Frame t-1Frame t-1
qqtt
tt
xxtt yytt
wwtt
wincwinctt
qincqinctt
Frame tFrame t
qqtt is the phonestate, e.g., q is the phonestate, e.g., qtt ЄЄ { /w/1, /w/2, /w/3, /n/1, /n/2, … } { /w/1, /w/2, /w/3, /n/1, /n/2, … } wwtt is the word label at time t, for example, wt is the word label at time t, for example, wt ЄЄ {“one”, “two”, …} {“one”, “two”, …} tt is the position of phone q is the position of phone qtt within word w within word wtt: : tt ЄЄ {1 {1stst, 2, 2ndnd, 3, 3rdrd, …}, …} qincqinct t ЄЄ {0,1} specifies whether {0,1} specifies whether t+1t+1==tt or or t+1t+1==tt+1+1
Even when reading phone numbers, talkers Even when reading phone numbers, talkers “blend” articulations.“blend” articulations.
For example: “seven eight:” /sFor example: “seven eight:” /svvәәnet/→ /snet/→ /svne?/ vne?/ As speech gets less formal, pronunciation As speech gets less formal, pronunciation
variability gets worse, e.g., worse in a car than in variability gets worse, e.g., worse in a car than in the lab; worse in conversation than in read speechthe lab; worse in conversation than in read speech
A Related Problem: AsynchronyA Related Problem: Asynchrony
Audio and Video Audio and Video information are not information are not synchronoussynchronous
For example: “th” (/For example: “th” (//) /) in “three” is visible, but in “three” is visible, but not yet audible, because not yet audible, because the audio is still silentthe audio is still silent
Should HMM be in Should HMM be in qqtt=“silence,” or q=“silence,” or qtt=/=//?/?
qqtt
tt
wwtt
wincwinctt
qincqinctt
Frame tFrame t
xxtt
vvtt
tt
vincvinctt
yytt
tt
qqt-1t-1
t-1t-1
wwt-1t-1
wincwinct-1t-1
qincqinct-1t-1
Frame t-1Frame t-1
xxt-1t-1
vvt-1t-1
t-1t-1
vincvinct-1t-1
yyt-1t-1
t-1t-1
A Solution: Two State VariablesA Solution: Two State Variables(Chu and Huang, ICASSP 2000)(Chu and Huang, ICASSP 2000)
Coupled HMM Coupled HMM (CHMM): Two (CHMM): Two parallel HMMsparallel HMMs
qqtt: Audio state (x: Audio state (xtt: :
audio observation)audio observation) vvtt: Video state (y: Video state (ytt: :
video observation) video observation) tt==tt--tt: :
Asynchrony, Asynchrony, capped at |capped at |tt|<3|<3
Asynchrony in Articulatory PhonologyAsynchrony in Articulatory Phonology(Livescu and Glass, 2004)(Livescu and Glass, 2004)
It’s not really the AUDIO and VIDEO that are It’s not really the AUDIO and VIDEO that are ssynchronous…ssynchronous…
It is the LIPS, TONGUE, and GLOTTIS that are It is the LIPS, TONGUE, and GLOTTIS that are asynchronousasynchronous
S1 S1
word word
ind1 ind1
ind2 ind2
ind3 ind3
U1 U1
S2 S2
U2U2
U3
S3 S3
U3
sync1,2
sync2,3
sync1,2
sync2,3
Asynchrony in Articulatory PhonologyAsynchrony in Articulatory Phonology
Dental /Dental ///TongueTongue
GlottisGlottis UnvoicedUnvoiced
Retroflex /r/Retroflex /r/
VoicedVoiced
Palatal /i/Palatal /i/
““three,” dictionary formthree,” dictionary form
timetime
It’s not really the AUDIO and VIDEO that are It’s not really the AUDIO and VIDEO that are ssynchronous…ssynchronous…
It is the LIPS, TONGUE, and GLOTTIS that are It is the LIPS, TONGUE, and GLOTTIS that are asynchronousasynchronous
Dental /Dental ///TongueTongue
GlottisGlottis UnvoicedUnvoiced
Retroflex /r/Retroflex /r/
VoicedVoiced
Palatal /i/Palatal /i/
““three,” casual speechthree,” casual speech
SilentSilent
SilentSilent
Asynchrony in Articulatory PhonologyAsynchrony in Articulatory Phonology
Same mechanism represents pronunciation variability:Same mechanism represents pronunciation variability:– ““Seven:” /vSeven:” /vәәn/→ /vn/ if tongue closes before lips openn/→ /vn/ if tongue closes before lips open
– ““Eight:” /et/ → /e?/ if glottis closes before tongue tip closesEight:” /et/ → /e?/ if glottis closes before tongue tip closes
An Articulatory Feature ModelAn Articulatory Feature Model(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)
There is no There is no “phonestate” “phonestate” variable. Instead, variable. Instead, we use a vector we use a vector qqtt→[→[lltt,,tttt,,ggtt]]– Lipstate variable Lipstate variable
lltt
– Tonguestate Tonguestate variable tvariable ttt
– Glotstate variable Glotstate variable ggtt
ggtt
tt
gincginctt
tt
ggt-1t-1
t-1t-1
gincginct-1t-1
t-1t-1
Experimental Test Experimental Test (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)
Training and test data: CUAVE corpus Training and test data: CUAVE corpus – Patterson, Gurbuz, Turfecki and Gowdy, ICASSP 2002Patterson, Gurbuz, Turfecki and Gowdy, ICASSP 2002
– 169 utterances used, 10 digits each, silence between words169 utterances used, 10 digits each, silence between words
– Recorded without Audio or Video noise (studio lighting; silent bkgd)Recorded without Audio or Video noise (studio lighting; silent bkgd)
Audio prepared by Kate Saenko at MITAudio prepared by Kate Saenko at MIT– NOISEX speech babble added at various SNRsNOISEX speech babble added at various SNRs
Video prepared by Amar Subramanya at UWVideo prepared by Amar Subramanya at UW– Feature vector = DCT of lip rectangleFeature vector = DCT of lip rectangle
– Upsampled from 33ms frames to 10ms framesUpsampled from 33ms frames to 10ms frames
Experimental Condition: Train-Test MismatchExperimental Condition: Train-Test Mismatch– Training on clean dataTraining on clean data
– Audio/video weights tuned on noise-specific dev setsAudio/video weights tuned on noise-specific dev sets
– Language model: uniform (all words equal probability), constrained to have the Language model: uniform (all words equal probability), constrained to have the right number of words per utteranceright number of words per utterance
Experimental QuestionsExperimental Questions(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)(Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)
1)1) Does Video reduce word error rate?Does Video reduce word error rate?
2)2) Does Audio-Video Asynchrony reduce word Does Audio-Video Asynchrony reduce word error rate?error rate?
3)3) Should asynchrony be represented as Should asynchrony be represented as 1)1) Audio-Video Asynchrony (CHMM), orAudio-Video Asynchrony (CHMM), or
4)4) Is it better to use only CHMM, only AFM, or a Is it better to use only CHMM, only AFM, or a combination of both methods?combination of both methods?
Results, part 1:Results, part 1: Should we use video? Should we use video?Answer: YES. Audio-Visual WER < Single-stream WERAnswer: YES. Audio-Visual WER < Single-stream WER
0
10
20
30
40
50
60
70
80
90
CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB
Audio
Video
Audiovisual
Results, part 2:Results, part 2: Are Audio and Video be asynchronous? Are Audio and Video be asynchronous?Answer: YES. Async WER < Sync WER.Answer: YES. Async WER < Sync WER.
0
10
20
30
40
50
60
70
CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4
No Asynchrony
1 State Async
2 States Async
Unlimited Asyn
Results, part 3: Results, part 3: Should we use CHMM or AFM?Should we use CHMM or AFM?Answer: DOESN’T MATTER! WERs are equal.Answer: DOESN’T MATTER! WERs are equal.
0
10
20
30
40
50
60
70
80
Clean SNR12dB
SNR10dB
SNR 6dB SNR 4dB SNR -4dB
Phone-viseme
Articulatory features
Results, part 4: Results, part 4: Should we combine systems?Should we combine systems?Answer: YES. Best is AFM+CH1+CH2 ROVERAnswer: YES. Best is AFM+CH1+CH2 ROVER
17
18
19
20
21
22
23
A+C1+C2 ROVER CU+C1+C2ROVER
C2: CHMM A: AFM C1: CHMM
Video Feature Extraction:Video Feature Extraction:– Manifold discriminant is better than a global discriminantManifold discriminant is better than a global discriminant
Audio Feature Extraction:Audio Feature Extraction:– Beamformer: Delay-and-sum beats Filter-and-sumBeamformer: Delay-and-sum beats Filter-and-sum– Postfilter: Spectral subtraction gives best WER (though Postfilter: Spectral subtraction gives best WER (though
MMSE-logSA sounds best)MMSE-logSA sounds best)– VAD: Backoff noise estimation works best in this corpusVAD: Backoff noise estimation works best in this corpus
Audio-Video Fusion:Audio-Video Fusion:– Video reduces WER in train-test mismatch conditionsVideo reduces WER in train-test mismatch conditions– Audio and video are asynchronous (CHMM)Audio and video are asynchronous (CHMM)– Lips, tongue and glottis are asynchronous (AFM)Lips, tongue and glottis are asynchronous (AFM)– It doesn’t matter whether you use CHMM or AFM, but...It doesn’t matter whether you use CHMM or AFM, but...– Best result: combine both representationsBest result: combine both representations