Stream weight estimation for multistream audio visualspeech recognition in a multispeaker environment

Available online at www.sciencedirect.com

www.elsevier.com/locate/specom

Speech Communication 50 (2008) 337–353

Stream weight estimation for multistream audio–visualspeech recognition in a multispeaker environment q

Xu Shao *, Jon Barker

The University of Sheffield, Department of Computer Science, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK

Received 11 December 2006; received in revised form 9 November 2007; accepted 12 November 2007

Abstract

The paper considers the problem of audio–visual speech recognition in a simultaneous (target/masker) speaker environment. Thepaper follows a conventional multistream approach and examines the specific problem of estimating reliable time-varying audio andvisual stream weights. The task is challenging because, in the two speaker condition, signal-to-noise ratio (SNR) – and hence audiostream weight – cannot always be reliably inferred from the acoustics alone. Similarity between the target and masker sound sourcescan cause the foreground and background to be confused. The paper presents a novel solution that combines both audio and visual infor-mation to estimate acoustic SNR. The method employs artificial neural networks to estimate the SNR from hidden Markov model(HMM) state-likelihoods calculated using separate audio and visual streams. SNR estimates are then mapped to either constant utter-ance-level (global) stream weights or time-varying frame-based (local) stream weights.

The system has been evaluated using either gender dependent models that are specific to the target speaker, or gender independentmodels that discriminate poorly between target and masker. When using known SNR, the time-varying stream weight system outper-forms the constant stream weight systems at all SNRs tested. It is thought that the time-vary weight allows the automatic speech rec-ognition system to take advantage of regions where local SNRs are temporally high despite the global SNR being low. When usingestimated SNR the time-varying system outperformed the constant stream weight system at SNRs of 0 dB and above. Systems usingstream weights estimated from both audio and video information performed better than those using stream weights estimated fromthe audio stream alone, particularly in the gender independent case. However, when mixtures are at a global SNR below 0 dB, streamweights are not sufficiently well estimated to produce good performance. Methods for improving the SNR estimation are discussed. Thepaper also relates the use of visual information in the current system to its role in recent simultaneous speaker intelligibility studies,where, as well as providing phonetic content, it triggers ‘informational masking release’, helping the listener to attend selectively tothe target speech stream.� 2007 Elsevier B.V. All rights reserved.

Keywords: Audio–visual speech recognition; Multistream; Multispeaker; Likelihood; Artificial neural networks

1. Introduction

Speech communication in its most natural face-to-faceform is an audio–visual experience. Participants in a con-versation both hear what is being said and see the corre-

0167-6393/$ - see front matter � 2007 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2007.11.002

q This project was supported by Grants from the UK Engineering andPhysical Research Council (GR/T04823/01).

* Corresponding author.E-mail addresses: [email protected], [email protected] (X. Shao),

[email protected] (J. Barker).

sponding movements of the speaker’s face. The humanspeech perception system exploits this multimodality, inte-grating both the visual and audio streams of information toform a robust coherent percept. The central role of visualinformation has been described in recent accounts ofspeech perception (Massaro, 1998; Rosenblum, 2005).The visual speech information is particularly valuablewhen there are competing sound sources in the environ-ment. When this is the case, the audio signal becomes unre-liable, and observation of the speaker’s face greatlyimproves speech intelligibility. Sumby and Pollack (1954)

mailto:[email protected]



338 X. Shao, J. Barker / Speech Communication 50 (2008) 337–353

demonstrated that the visual speech signal could confer anincrease in intelligibility equivalent to that produced byreducing the noise level by about 16 dB. More recent stud-ies have demonstrated similarly dramatic benefits, particu-larly in cases where speech is masked by a competingspeaker (Summerfield, 1979; Rudmann et al., 2003; Helferand Freyman, 2005).

Demonstrations of the robustness of audio–visualspeech have inspired much recent research in audio–visualautomatic speech recognition (AVASR). There are twomajor research questions in this field. First, how best torepresent the visual speech information (Patterson et al.,2002; Matthews, 1998; Potamianos et al., 1998). Second,how to combine the audio and visual information so asto optimise recognition performance (Lucey et al., 2005;Chibelushi et al., 2002; Dupont and Luettin, 2000).Although these questions are no doubt subtly connectedthey are usually addressed in isolation. It is the latter ques-tion, that of audio–visual integration, that forms the focusof the work presented here.

Although visual speech is inherently ambiguous withmany words having an identical appearance (e.g. ‘pop’,‘bop’, and ‘bob’), its value can be clearly seen by notingits ability to disambiguate acoustically confusable wordpairs. For example, consider the words ‘met’ and ‘net’.Although they are acoustically similar, being distinguishedby subtle differences in the nasal consonants /m/ and /n/,they are visually distinct, with the lips closing for /m/ butnot for /n/. In situations where the acoustic differencesare masked by additive noise, the visual features may beall that is available to distinguish the two words. Althoughit is clear that the visual stream carries valuable informa-tion, making optimal use of this information has provedvery difficult in practise.

AVASR systems typically incorporate a parameter thatcontrols the relative influence of the audio and visualstreams – in the common multistream formalism, that isused in the current study, this parameter is known as thestream weight. Much recent research has been devoted todeveloping schemes for estimating suitable values for thestream weight. A standard approach is to base the streamweight on an estimate of the signal-to-noise ratio (SNR)(Dupont and Luettin, 2000; Meier et al., 1996; Pattersonet al., 2001). SNR-based stream weights have proven tobe very successful in a number of scenarios, particularlyin cases where the background noise is approximately sta-tionary. These systems typically employ a fixed streamweight for the duration of an utterance. In situations wherethe background noise is highly non-stationary a time-vary-ing stream weight based on local SNR estimates, or localaudio reliability measures, is more appropriate (e.g. Meieret al., 1996; Glotin et al., 2001). However, estimating localSNR is itself a challenging problem.

In the current work we search for solutions to the streamweighting problem in a particularly challenging condition:the case where the background noise source is itself aspeech signal (which we will refer to as the ‘masking

speaker’). This situation is of particular interest because itis one that occurs frequently in everyday life, and it is a sit-uation in which human performance has been closely stud-ied. Listening tests have shown that speech intelligibility isremarkably robust to the effects of background speech.This is particularly true if the target speaker is sufficiently‘unlike’ the masker speaker that it can be selectivelyattended to with minimal distraction (Brungart, 2001).However, background speech causes particular problemsfor ASR. The difficulty here is twofold. First, as alreadymentioned, the noise signal is highly non-stationary. Thismeans that the SNR is changing from instant to instant,and the reliability of the audio signal is rapidly varying.Second, the noise is not easily distinguished from thespeech signal. Acoustically, a region of the signal domi-nated by the masking speaker (low SNR), may be little dif-ferent to a region dominated by the target speaker (highSNR). So the non-stationarity demands a time varyingstream weight, but the weight is hard to estimate reliablyfrom local properties of the acoustic signal.

We attempt to solve the weight estimation problem byinvolving both the audio and video streams in the localSNR estimation. In outline, we employ a standard state-synchronous multistream system (Potamianos et al.,2003) in which the underlying generative model for theaudio–visual speech is a hidden Markov model (HMM)in which each state generates both audio and visual obser-vation drawn from different distributions (i.e. audio andvideo observations are modelled as being independentgiven the state). The system can be trained on noise-freeaudio–visual speech. To make it robust to acoustic noisethe audio and visual-likelihood are combined using expo-nential weights based on a measure of their relative reliabil-ity. In the current work the stream reliability is estimatedfrom HMM state-likelihood information using artificialneural networks (ANN) (Bishop and Oxford, 1995). Theperformance of ANNs trained either on audio-likelihoods,or on combined audio- and visual-likelihoods is compared.The hypothesis is that the pattern of audio and visual-like-lihoods should distinguish between, (i) local SNRs close to0 where the acoustics match the models poorly; (ii) positiveSNRs where the acoustics match the models well and arecorrelated with the visual information; and (iii) negativeSNRs where the (masker) acoustics may match the modelswell, but will be poorly correlated with the target’s visualfeatures. The audio–visual SNR estimator is expected tooutperform the audio-only SNR estimator because theacoustic-likelihoods alone are not sufficient for distinguish-ing between cases (ii) and (iii), i.e. regions of high SNRwhere the target matches the models and regions of lowSNR where the masker matches the models.

The paper compares the role of video information instream weight estimation in two different scenarios. Inthe first, male target utterances are mixed with female mas-ker utterances, and recognition is performed using modelstrained on male speech. In this task the acoustic models arespecific to the target. In the second scenario, male or female

X. Shao, J. Barker / Speech Communication 50 (2008) 337–353 339

target utterances are mixed with either male or femalemaskers (but not from the same speaker), and recognitionis performed using models that have been trained on a mix-ture of male and female speech. In this case the acousticmodels are not specific to the target, and could equally wellmatch the masking utterance.

The remainder of this paper is arranged as follows. Sec-tion 2 reviews the multistream approach to AVASR thatwe use to integrate the audio and visual feature streams.Section 3 details the AVASR system employed in thiswork, and in particular, the proposed method for estimat-ing a time-varying stream weight. Section 4 details experi-ments that evaluate the behaviour of the system whenusing both same-gender and mixed-gender utterance pairs.Finally, Section 5 presents conclusions and discusses possi-ble directions for future work.

2. Background

2.1. Integration of audio and visual features

AVASR systems can be broadly classified according tothe manner in which they combine the incoming audioand visual information streams – see Lucey et al. for arecent review (Lucey et al., 2005). Most systems can bedescribed as performing either feature fusion or decisionfusion. In feature fusion systems – also termed ‘early inte-gration’ (EI) – the audio and visual feature vectors arecombined, typically by simple concatenation, and the clas-sifier learns the statistics of the joint audio–visual observa-tion. By contrast, in decision fusion systems – also termed‘late integration’ (LI) – separate classifiers are constructedfor the audio and visual features, and it is the classifier out-puts, rather than the feature vectors themselves, that arecombined. A third class of techniques that combine theaudio and visual signal before the feature extraction stagehas largely been abandoned.

Most recent systems, including the work presented here,use some form of LI (e.g. Luettin et al., 2001; Garg et al.,2003; Potamianos et al., 2003). Although EI allows themodelling of the complete audio–visual observation (andhence can model detailed correlation between features ofthe audio and video observations) it suffers to the extentthat corruption of either the audio or the visual datastream can lead to incorrect decisions being made. In con-trast, late integration systems can easily be made robust toknown corruption of either stream by simply weighting theaudio and visual classifier decisions during combination.

The design of LI systems is very flexible and many vari-ations on the theme exist, but they can be roughly sub-clas-sified according to the lexical level at which decision fusionoccurs. At the lowest level decisions are fused on a frameby frame basis. If the streams are modelled using HMMsthen this equates to combining the likelihoods of corre-sponding audio and visual HMM model states – oftentermed ‘state-synchronous decision fusion’. Combiningdecisions at higher lexical levels – such as the phoneme

or word level – is usually achieved by using a parallelHMM design. Within each unit (phoneme or word) themodel progresses independently through the states of theseparate audio and visual HMMs under the constraint thatthe boundaries between units occur synchronously in theaudio and visual domains. This technique can model someof the asynchrony within the modelling units that isobserved between the audio and visual speech streams.For example, visual evidence for the onset of a phonemeoften precedes the audio evidence of the same event (Luet-tin et al., 2001; Massaro and Stork, 1998). This extra mod-elling power can produce small performance improvementswhen recognising spontaneous speech. In the current workwe are employing a small vocabulary ‘read speech’ task andprecise acoustic modelling of audio visual asynchrony isconsidered of secondary importance to the design of tech-niques for reliably estimating stream weights. Accordinglywe employ the simpler state-synchronous decision fusiontechnique.

In the state-synchronous decision fusion AVASR sys-tems, the HMM can be considered to be a generative modelwhich produces observations for both the audio and thevisual streams that are independent given the state. Sothe likelihood of state q given the observed audio andvisual features, oa;t and ov;t respectively, is computed as

P ðoa;t; ov;tjqÞ ¼ P ðoa;tjqÞ � P ðov;tjqÞ ð1Þ

In order to make the system robust to acoustic noise, atrecognition time the state-likelihoods are replaced with ascore based on a weighted combination of the audio andvisual-likelihood components,

Sðoa;t; ov;tÞ ¼ P ðoa;tjqÞka;t � Pðov;tjqÞkv;t ð2Þ

where the exponents ka;t and kv;t are the audio and visualstream weights. Typically, they are constrained such thatka;t; kv;t P 0 and ka;t þ kv;t ¼ 1. These scores are usuallycomputed in the log domain,

log Sðoa;t; ov;tÞ ¼ kt log P ðoa;tjqÞ þ ð1� ktÞ log P ðov;tjqÞ ð3Þ

where k lies in the range [0,1]. Values of k close to 0 giveemphasis to the video stream and are used when the audiostream is believed to be unreliable – and values close to 1give emphasis to the audio stream.

2.2. Stream weight estimation

The stream weighting parameter is related to the relativereliability of the audio and visual modalities, which in turnis dependent on the SNR. If the acoustic signal has highSNR at time t, the audio stream is reliable and a high valueof k can be used. In contrast, during regions of low SNRthe stream weight parameter should be reduced so thatthe visual information is emphasised. In previous studiesvarious audio-based stream weight estimation strategieshave been employed. For example, Glotin et al. (2001) usedvoicing as a measurement of audio reliability. Garg et al.(2003) employed the N-best log-likelihood as an SNR indi-


cator to measure the modality reliability. Tamura et al.(2005) estimated the stream weight from the normalised-likelihoods. Gurbuz et al. (2002) chose the stream confi-dence value from a lookup table according to the noise typeand SNR. These earlier techniques have all tried to esti-mate stream weight from the acoustic signal. However, inthe two speaker problem, the acoustic properties of regionsdominated by the masking speech source may be similar tothose of regions dominated by the target. This ambiguitymakes it difficult to achieve accurate stream weight esti-mates using purely acoustic measurements. As a solutionto this problem, the current study proposes an audio–visualstream weighting method (see Section 3.3.2).

Another design choice is the time interval over which toestimate the stream weight. Many previous systems havekept the stream weight fixed over the duration of an utter-ance (Gurbuz et al., 2002; Cox et al., 1997). In such systemsthe stream weight is being set according to some averagevalue of the SNR. Even when the additive noise is station-ary the SNR computed over a short time window will varywidely. For example, during voiced regions the speech sig-nal may dominate the noise making a high value of kappropriate, whereas during an unvoiced fricative the noisemay mask the speech audio meaning that a low value of kshould be used to give temporary emphasis to the video sig-nal. In the speech-plus-speech case the non-stationarity ofthe masker makes these effects even larger. In the data weemploy, the local (frame-based) SNR can vary by as muchas 30 dB above and below the global SNR. In the currentwork we attempt to capture this variation by allowingthe stream weight to vary from frame to frame.

extractaudio feature

extractvisual feature

combine a-v feature

train audio HMM

train visua HMM

calculate likelihood

d

Recognition performancee.g. bin blue at C 3 please

FEATURE EXTRACTION

AUIDO-VISUAL HMM

mfc

cv-

feat

ure

av-fe

atur

e

v

av-feature

av-feature

v-feature

mfcc

Fig. 1. The block diagram for the audi

Dynamic stream weighting techniques have previouslybeen proposed by Meier et al. (1996) and evaluated withnon-speech maskers at SNRs down to 8 dB, and subse-quently evaluated by Glotin et al. on a large vocabularyclean-speech task. However, the stream weighting schemesproposed by Meier et al. are not appropriate for the twospeaker problem. Their SNR-based weight estimation tech-nique employs noise spectrum estimation techniques due toHirsch (1993). These do not work reliably in speech plusspeech situations. They also proposed a stream weightbased on the entropy of posterior phoneme and visemeprobabilities. This technique is closer in spirit to our sys-tem, but crucially such a technique would fail in the twospeaker situation as it relies on the phoneme entropy beinghigh when the SNR is low, whereas in fact the oppositehappens: the more the SNR is reduced, the more the mas-ker dominates the target, and the more the acousticsbecome like those of a clean speech source. In our systemwe attempt to overcome these problems by basing thestream weight estimate on the complete pattern of audioand visual speech-unit probabilities, using an ANN tolearn the mapping. By so doing, information about the cor-relation between the audio and visual data can be learnt – ifthe audio appears ‘clean’ but does not ‘fit’ with the videothen the target is being dominated by the masker and theaudio stream should have a low weight.

3. System description

An overview of the AVASR system is shown in Fig. 1illustrating the system’s three components: feature extrac-

l combine a-v HMM

retrain a-v HMM

replace audio HMM

recognition test

best?

reduce dimenstionality

combine a-v likelihood

train neural networks

test neural networks

median smooth calculate

global SNR

mapping SNR<->weight

viterbiecoding

clean help dataset

frameenergy

look-up table

local SNR

No

yes

AUIDO-VISUAL HMM TRAINING

TESTING

a-hmm

v-hmm

av-hmm

av-hmm av-hmm

av-hmm

net

a-lik

v-lik

a-lik

v-lik

av-lik

local SNRlocal SNR

global SNR

-lik a-lik

av-lik std ratio factor

STREAM WEIGHTING

weight factor

o–visual speech recognition system.


tion, audio–visual HMM training and audio–visual HMMtesting.

In brief, audio and visual feature vectors are extractedfrom the acoustic and visual data respectively and concat-enated at a 100 Hz frame rate to form audio–visual featurevectors (Section 3.1). The multistream HMMs are thentrained (Section 3.2). The audio and visual HMMs are firsttrained separately using clean audio and visual features.After independent training the state alignments impliedby the audio and video HMMs are not necessarily consis-tent. A joint AV training step is employed to make themodels compatible. The optimised AV HMMs are thenpassed to the testing stage (Section 3.3). First, theunweighted audio and video state-likelihoods, P ðoajqÞ andP ðovjqÞ, are computed. An ANN is used to learn how tomap these likelihoods onto frame-based SNR estimates.Then, using a hand optimised mapping, the frame-basedSNR estimates are either (i) mapped directly onto a time-varying stream weight, or (ii) first converted into an utter-ance-level SNR (global SNR) estimate and then mappedonto a stationary stream weight. The audio and visualstate-likelihoods are then weighted and combined to pro-duce scores for each HMM state. The recogniser outputis obtained using a standard Viterbi decoder to find thebest scoring state sequence. The details of these proceduresare described in the following sections.

3.1. Feature extraction

The acoustics are represented using standard MFCCfeatures (Davis and Mermelstein, 1980). The 25 kHz acous-tic signal is passed through a high pass preemphasis filter(1–0.97z�1). Thirteen MFCC coefficients are computedusing 25 ms Hamming windows at a 10 ms interval. TheMFCC coefficients are supplemented by their temporal dif-ferences computed using linear regression over a threeframe window, to produce a 26-dimensional feature vector.

The video features are based on a 2D-DCT representa-tion of a rectangular region encompassing the speakers lipsextracted using a technique similar to that of Pattersonet al. (2002). For each speaker, six video frames are ran-domly selected. For each frame the outer lip contour ishand labelled. These hand segmented images are then usedto train separate three-component Gaussian mixture mod-els (GMM) for the distribution of pixel RGB values in (i)the lip region, and (ii) the region surrounding the lips. Thenin each frame of a video sequence, a Bayes’s classificationof the pixels in a search region around the estimated centreof the mouth is performed such that each pixel is labelled aseither ‘lip’ or ‘skin’ (i.e. a binary image). Opening and clos-ing morphological operations are then applied to remove‘salt and pepper’ noise from the binary image. The largestconnected region bearing the ‘lip’ label is identified. Thenthe centre of gravity of this region is taken as the new posi-tion of the mouth centre. In this way the centre of themouth can be tracked from frame to frame (this simpletracking system is similar in principle to the CamShift algo-

rithm Bradski, 1998). The region of interest is then taken asa rectangular box positioned at the mouth centre with anarea proportional to that of the area of the estimated lipregion. This procedure was applied to all 34 speakers inthe database. The results were checked to ensure that thebox reliably tracked the lips throughout each utterance.

For each video frame the image in the box surroundingthe lips is downsampled to 32� 32 pixels, and then pro-jected into feature space using a 2D-DCT from which the36 (6 � 6) low-order coefficients are extracted as visual fea-tures. These are supplemented with their dynamic featuresto produce a 72-dimensional visual feature vector. Linearinterpolation is then employed to upsample the visualstream from 25 frames per second to the rate of 100 framesper second employed by the audio stream.

Note, in this work we wish to examine the issue ofaudio–visual integration assuming the presence of highquality video features without regard to how these featureshave been produced. The reliability of the visual features isbeing ensured by training speaker specific colour modelsthat are also specific to the lighting conditions of the partic-ular corpus recording session. Furthermore, results of thefeature extraction are being monitored and poorly repre-sented speakers are being rejected. By using artificial buthigh quality video features we hope to be able to focusthe study on the problem of stream integration, withouthaving to contend with the problems of varying reliabilityin the video features. Whether such reliable video featurescan be produced in a real application is a separate researchquestion.

3.2. Audio and video HMM training

The construction of the multistream AV HMM com-mences by first training independent audio and videoword-level HMMs using clean audio and visual features.These two models are then used to initialise the parametersof the state-synchronous multistream AVASR systemsdescribed in Section 2.1. The Gaussian mixture models(GMMs) describing the observation distributions for thestates of the audio and visual stream are taken directlyfrom the corresponding states in the independently trainedmodels. The transition matrix for the multistream HMM isinitialised to be that of the transition matrix of the audioHMM. Unfortunately, the portions of the signal that arepresented by corresponding HMM-states of the indepen-dently trained audio and visual HMMs are not necessarilythe same. Generally there is an audio–visual asynchronywith the onset of lip movement often preceding the acousticevidence for the phoneme (Luettin et al., 2001; Massaroand Stork, 1998). This lack of synchrony leads to poor per-formance as the AV HMM model has been made by com-bining corresponding audio and visual HMM states underthat assumption of state synchrony. The parameters of theAV HMM have to be retrained so that the audio and visualcomponents are compatible. As the clean speech recogni-tion performance suggests that the audio parameters are


far more informative than their video counterparts, theretraining stage is performed in such a way that the audioparameters and the transition matrices (which wereadopted from the audio HMM) are held constant and onlythe GMMs of the visual stream are adapted.

The retraining step (shown in the block diagram, Fig. 1)proceeds as follows: The full set of AV HMM parametersare retrained using the Baum–Welch algorithm (Rabiner,1989; Young et al., 1995) and the clean data training set.During training the stream weight is set to 0.5. After eachBaum–Welch iteration, the emission distribution for theaudio component, and the state transition probabilitymatrices, are reset to the value they had before reestima-tion. This forces the AV model to adopt the segmentationthat was inferred by the audio-only model. If the audioparameters are not held constant in this way it was foundthat the training unacceptably reduces the performanceof the system in clean conditions where the video streamweight is close to 0. It is possible that the clamping of theaudio parameters would not have been necessary if a higheraudio stream weight was used during training. However,the current procedure produces a model that performs wellin both the video-only and audio-only condition, andseems to be appropriate for the small vocabulary taskemployed in which the video parameters are essentiallyredundant in the clean speech condition.

After each training iteration, performance is tested on across-validation data set and the procedure is repeateduntil the performance reaches a maximum. Fig. 2 showsthe development of the recognition performance duringthe joint training of the audio–visual HMMs. Performancefor the clean speech cross-validation set is plotted againstthe number of reestimation iterations. When the audioand visual models are initially combined, the performanceis not as good as that obtained using the audio model alone– presumably because of the mismatch in the audio and

Fig. 2. The development of the audio–visual speech recognition perfor-mance measured on the clean cross-validation set during joint AV HMMreestimation. Accuracy is plotted after each iteration of parameterreestimation.

visual models described above. As the retraining iterationsare repeated the performance of the system improves. Thestate-alignment implied by the parameters of the visualHMMs becomes consistent with that of the audio HMMs.At a particular point – eight iterations, in this case – theperformance reaches a peak after which performance startsto reduce. This pattern is consistent with the effects of over-fitting. The model trained with eight iterations is the onethat is selected to be applied to the test set.

Interestingly, it was also noted that, in our experiments,the video-only performance gained using the retrainedvisual stream HMMs was slightly better than that obtainedusing the visual stream HMMs before joint AV training.This can possibly be understood by analogy to supervisedtraining – the audio parameters, which are generally morepowerful than the visual parameters, act like labels to theunknown state indicated by the visual parameters. The factthat there is scope for the audio parameters to help in thisway may indicate that the initial video-only training wasnot robust and suffered from poor initialisation.

3.3. Audio visual HMM testing

The recognition process involves three stages, (i) calcula-tion of state log-likelihoods for both the audio and visualcomponents of the HMM, i.e. P ðoa;tjqÞ and P ðov;tjqÞ inEq. (2), (ii) estimation of the stream weights, i.e. kt and1� kt, by first estimating either local or global SNR as afunction of the state log-likelihoods, (iii) Viterbi-decodingthe HMM state scores, Sðoa;t; ov;tÞ, computed from thestream weights and the log-likelihoods according to Eq.(2). To analyse the performance of the system, two pairsof conditions (audio-likelihood versus audio–visual-likeli-hood and local SNR versus global SNR) are combined togive four possible sets of results to compare.

3.3.1. Audio and video state-likelihood computation

At recognition time, for each frame of observed audioand video features the log-likelihood of each HMM stateis computed. Hence for an N-state HMM, N audio-basedlikelihoods and N video-based likelihoods are computedand stored as a pair of N-dimensional likelihood vectors.For the small vocabulary Grid task (Cooke et al., 2006)used for the current evaluations, word-level HMMs areemployed with a total of 251 states (see Section 4.1 fordetails). Some of the states have very similar emission dis-tributions. For example, considering the audio stream, thestates corresponding to the phoneme /iy/ in the English let-ters, E, D, and the digit, three will have similar distribu-tions. Likewise, in the video stream, the statescorresponding to the plosive visemes at the starts of theGrid corpus words bin and place will also be similar (seeSection 4.1 for the full set of words used in the Grid task).These states will have very similar likelihoods regardless ofthe observation. Hence, it is possible to reduce the dimen-sionality of the likelihood vector without loss of informa-tion by representing the likelihood for such groups of


states with a single value. This is implemented by firstemploying a state-clustering technique (Young et al.,1995) to identify similar states. The clustering is performedseparately for the audio and video-based HMMs and thedegree of clustering is determined by that which gives thebest recognition performance when testing on clean speech.The dimensionality of the likelihood vector is then reducedby replacing the elements corresponding to the members ofa cluster with a single value computed by averaging thelikelihoods of those states. The reduced likelihood vectorsare then used as the basis for the stream weight estimationdescribed in the next section.

3.3.2. ANN-based stream weight estimation

In this work we wish to see whether the use of videoinformation may solve a problem specific to the speech-plus-speech case: in regions of the signal where the targettalker dominates, the patterns of log-likelihood can be sim-ilar to the patterns seen in regions where the masker dom-inates – in general there is a symmetry around 0 dB localSNR. This symmetry will exist to the extent that the targetand masker speaker fit equally well to the speech models.For example, this problem should be particularly apparentwhen using speaker independent models, but less apparentif using gender-dependent models and mixed gender utter-ance pairs. Potentially, this confusion can be disambigua-ted using the visual-likelihoods. At positive local SNRs,the visual and audio-likelihoods will be concentrated incorresponding HMM states (e.g. if the state representingan audio ‘f’ has a high likelihood then the state represent-ing a visual ‘f’ should also have a high likelihood). At neg-ative local SNRs, the audio and visual-likelihoods willgenerally be concentrated in different HMM states becausethe masker’s speech is not correlated with the target speak-er’s lip movements.

Based on these considerations, our system operates intwo stages: first, SNR is estimated by considering thematch between the data and the clean speech models, andsecond, SNR is mapped to stream weight using a handoptimised mapping (see dash-dot box in Fig. 1). The diffi-culty arises in producing reliable estimates of either localor global SNR. Following previous work, the SNR esti-mates are based on the HMM-state-likelihoods. However,unlike previous approaches which have based the estimateson specific features of the pattern of likelihoods, such asentropy (Okawa et al., 1999; Potamianos and Neti, 2000)or dispersion (Potamianos and Neti, 2000; Adjoudaniand Benoit, 1996), we attempt the more general approachof trying to learn the SNR directly from the complete like-lihood data using artificial neural networks (ANN). Fur-thermore, in our experiments, as well as employingmappings based on the audio-only stream likelihoods, wealso consider mappings that are based on the state-likeli-hoods of both the audio and visual stream.

Multi-layer perceptrons (MLPs) with three layers wereemployed with an input unit for each element of the likeli-hood vector and a single output unit representing local

SNR. The network employs a single hidden layer. Unitsin the hidden layer and output layer have sigmoid and lin-ear activation functions respectively. The network istrained using conjugate gradient descent (Shewchuk,1994). The number of hidden units is optimised by observ-ing the errors between the MLP output and the target SNRover a validation data set. The MLP topology that pro-duces the minimum error is selected. The network is trainedusing either the log-likelihoods of the audio stream, or theconcatenated log-likelihoods of both the audio and visualstreams.

The sequence of MLP outputs is median smoothed toremove outlying SNR estimates. The time-varying localSNR can either be used directly to produce a time-varyingstream weight estimate, or the local SNR estimates are firstconverted into a single global SNR estimate and are thenused to produce a single utterance-level stream weight.The global SNR is calculated according to

SNRg ¼ 10 log10

XI

i¼1

Ei

1þ Bi

XI

i¼1

Ei � Bi

1þ Bi

, ! ð4Þ

where Bi ¼ 10SNRi

10 and Ei and SNRi represent the ith frameenergy and SNR, respectively. SNRg denotes the globalSNR and I is the total number of frames in the utterance.

3.3.3. Optimising the lookup table and likelihood scaling

factor

It has been documented in previous studies (e.g. Gravieret al., 2002) that normalising the log-likelihood of bothstreams to have equal standard deviation can improvethe performance of multistream-based systems. This nor-malisation can be expressed as

b0vi ¼Stda

Stdv

� bvi ð5Þ

where bvi and b0vi are log-likelihoods of the ith frame of thevisual stream, log Pðov;tjqÞ, before and after scaling respec-tively. Stda and Stdv denote the original standard deviationof the log-likelihoods for the audio and visual stream,respectively. The normalisation is performed on a per-utterance basis, i.e. a separate scaling factor is computedfor each utterance by pooling log-likelihoods across allstates and across all frames of the utterance.

In the current work it was further noted that, for thetime-varying stream weight system, a further increase inperformance could be gained by optimising a global scalingconstant, g, applied to the normalised video streamlikelihood,

b00vi ¼ g � Stda

Stdv

� bvi ð6Þ

The inclusion of g allows the ratio of the standard devia-tion of the audio and visual streams to be set to an arbi-trary value.

Note that varying g will have no effect on the globalSNR system. The effect of any multiplicative constant


applied to one set of likelihoods can be equally wellintroduced as part of the SNR to stream weight mapping.As the mapping is optimised for the global SNR systemafter the application of Eq. (6), scaling the video-likelihoods will change the mapping but will not changethe recognition result. However, scaling the likelihoodsdoes have an effect on the performance of the local SNR-based system.

As noted above, the SNR to stream weight lookup tableis dependent on the value of g. In other words, the lookuptable and scaling factor must be jointly optimised. Thisjoint optimisation is performed in such a way as to maxi-mise recognition performance for both the global and localSNR-based systems. First, the value of g is chosen from anumber of predefined values varying from [0.35–1–25].When g ¼ 1 the standard deviation of both audio andvisual-likelihood are the same. When g > 1 the standarddeviation of audio-likelihood is greater than that ofvisual-likelihood and vice versa. The selected value of gand the training dataset (the same dataset that was usedfor training the neural network) are both passed to theexhaustive search block. This block performs an exhaustivesearch to optimise the lookup table which maps betweenthe global SNR and stream weight, i.e. at each value of glo-bal SNR a series of different stream weights is tested andthe stream weight that produces the best recognition per-formance is recorded. This table is then employed tolookup dynamic stream weights for recognition tests usingthe same training dataset and the same likelihood scalingfactor, g. The lookup table and likelihood scaling factor,g, which lead to the best performance for both the globalSNR and the local SNR system are chosen as those to beused for evaluation of the final test.

Table 1Structure of the sentences in the GRID corpus Cooke et al. (2006)

Verb Colour PREP. Letter Digit Adverb

Bin Blue At a–z 1–9 AgainLay Green By (no ‘w’) And zero NowPlace Red On PleaseSet White With Soon

4. Experimental results and analysis

This section describes the evaluation of the system. Itwill commence by describing the audio–visual speech dataemployed and the specifics of the system set-up. Followingthis, three sets of ASR experiments are presented. First,experiments are performed using known SNR (Section4.2). Results using either known local or known globalSNR are compared against audio-only and video-onlybaselines. These results establish an upper limit for the per-formance of the system. Section 4.3 presents a direct eval-uation of the MLPs ability to predict SNR. Theseestimated SNRs are then mapped onto the stream weightsused in the second set of ASR experiments (Section 4.4).The use of SNR estimates based on audio-only versusaudio–visual state-likelihoods are compared. The final setof ASR experiments (Section 4.5) illustrates the impact oferrors in the estimation of the sign of the SNR. Theseerrors result from foreground/background ambiguity.The structure of the training/testing data and details ofthe HMM-models employed which are common to all theexperiments are described in Section 4.1.

4.1. Database and feature extraction

All experiments have employed the audio–visual Gridcorpus (Cooke et al., 2006) which consists of high qualityaudio and video recordings of small vocabulary ‘readspeech’ utterances of the form indicated in Table 1 spokenby each of 34 speakers (16 female speakers and 18 malespeakers). An example sentence is ‘‘bin red in c 3 again”.Of the 34 speakers, 20 (10 male and 10 female) areemployed in the experiments reported here.

Fig. 3 shows a representative selection of lip regionsextracted from the corpus following the proceduresdescribed in Section 3.1. The upper half of the figure showsneutral positions for 8 of the 34 speakers illustrating thevery large inter-speaker variability in lip appearance. Thelower half shows a selection of lip positions for one ofthe male speakers. In order to give an indication of the highquality of the original video data the images are shownbefore the downsampling to 32� 32 pixels that occurs dur-ing feature extraction. It can be observed that the imagesare evenly illuminated and have a high level of detail.

In all the following experiments the behaviour of twodifferent model configurations has been separately consid-ered: (i) a gender-dependent configuration in which modelsof male speech are used to recognise a male target maskedby a female speaker; (ii) a more challenging gender-inde-pendent configuration where gender-independent modelsare used to recognise a target of unknown gender mixedwith a masker that is also of arbitrary unknown gender.

In the gender-dependent condition, 3500 utterances cho-sen from the 10 different male speakers (350 utterancesfrom each speaker) have been employed to train a set ofmale (gender-dependent) HMMs. A collection of simulta-neous speech mixtures was generated to be used variouslyfor training the SNR estimator and evaluating the recogni-tion system. Three thousand one hundred utterances thathave not been used during training were randomly chosenfrom the 10 male speakers to mix with 3100 utterancesselected from 10 female speakers (310 utterances from eachspeaker). A Viterbi forced alignment was used to detect theinitial and final silences which were then removed. Theshorter utterance of each pair was zero-padded to thelength of the longer one. The two signals were then artifi-cially mixed at global SNRs of �10, 0, 5, 10, 15 and20 dB. These mixed utterances were then randomly dividedinto three sets: 1000 utterances were employed to train theANNs; 100 utterances were used as cross validation data to

Fig. 3. Example lip regions extracted from the Grid corpus. The images in the top panel show eight different speakers in a resting lip position illustratingthe inter-speaker variability. The images in the lower panel are examples of a single speaker with lips in different positions to illustrate the intra-speakervariability.


prevent the ANNs overfitting; the remaining 2000 utter-ances were used for the final recognition test set.

Preparation of the gender-independent condition wassimilar except 4000 utterances, taken from the 10 maleand 10 female speakers, were used to train a set of gen-der-independent HMMs. Again 3100 simultaneous speechexamples were constructed, but this time pairs of utter-ances were randomly chosen from the 20 speakers withthe only restriction being that the target and masker arenever the same speaker.

For each utterance, the 26-dimensional MFCC-basedaudio feature vectors and the 72-dimensional DCT-basedvideo feature vectors were extracted according to the pro-cedures described in Section 3.1. The 98-dimensionalaudio–visual feature vectors were constructed by simpleconcatenation of the audio and video components.

Word-level HMMs were employed to model the 51words in the recognition task’s vocabulary. The modelscontained between four and ten states per word determinedusing a rule of 2 states per phoneme. In each state both theaudio and video emission distributions were modelledusing a 5-component Gaussian mixture model with eachcomponent having diagonal covariance.

The audio and visual HMMs were trained indepen-dently using audio-only and visual-only features. The AVmodel was produced from the independent audio andvisual HMM according to the procedure detailed in Section3.

4.2. ASR experiment 1: known SNR

In the first experiments the known SNR is mapped ontoa stream weight using the mapping – which takes the formof a global-SNR-to-weight lookup table – that has beenpreviously optimised so as to maximise recognition perfor-mance as described in Section 3.3.3. The mapping is eitherapplied to the global SNR to produce a single stream

weight to use for all frames in the utterance, or the map-ping is applied to the local (frame-based SNRs) to producea time varying stream weight. In the latter case, the map-ping is linearly interpolated to handle local SNRs that donot necessarily match the six global SNRs in the look uptable.

At the same time, the likelihoods for the audio andvisual streams are scaled appropriately for use in eitherthe global SNR or the local SNR systems using the likeli-hood scaling factor optimised using the techniquesdescribed in Section 3.3.3. It was found, in both our tasks,that the local SNR-based performance was the greatest ifthe likelihood scaling factor was set such that the standarddeviation of audio-likelihood was 7.5 times that of thevisual-likelihood.

Fig. 4 shows the results for speech recognition perfor-mance using gender-dependent models where the linemarked with the circle and the line marked with the trian-gle are the traditional audio-only and visual-only speechrecognition performance respectively. These results are inagreement with previous studies in that the visual streamproduces significantly poorer results than the audio inlow noise conditions (above an SNR of 8 dB). The video-only data is inherently ambiguous as many phonemes havesimilar visual appearance (Fu et al., 2005; Goldschen,1993). However, the video-only result is obviously notaffected by the level of the acoustic noise and remains at73.1% for all SNRs.

The line marked with the ‘+’ symbol and the line indi-cated by the ‘�’ show the results of the multistreams modelwhen using either the mapping from global SNR to fixedstream weight (AV-GSNR), or local SNR to time-varyingstream weight (AV-LSNR), respectively. The audio–visualresults are better than both the audio-only and visual-onlybaselines across all SNRs tested. The system using localSNRs outperforms that using global SNR in all SNR cat-egories. The maximum gain is at an SNR of 5 dB where the

Fig. 4. A comparison of performance on the gender-dependent task foraudio-only, visual-only and audio–visual ASR. The AV system employsstream weights estimated using known SNR. The audio–visual recognitionaccuracy is plotted against global SNR and is shown for both a constantstream weight (estimated from the global SNR) and a time-varying streamweight (estimated from the local SNR).


recognition accuracy of the local SNR system is 5.8%(absolute) higher than that of the global SNR system. Thisis presumably because the time-varying stream weightallows the system to achieve better performance by exploit-ing the audio information in regions where the local SNR istemporarily favourable.

Fig. 5 is the same as Fig. 4 except that it shows speechrecognition performance for the gender-independent task.Note that the pattern of results is very similar, thoughthe average performance using the gender-independent

Fig. 5. A comparison of performance on the gender-independent task foraudio-only, visual-only and audio–visual ASR. The AV system employsstream weights estimated from known SNR. The audio–visual recognitionaccuracy is plotted against global SNR and is shown for both a constantstream weight (estimated from the global SNR) and a time-varying streamweight (estimated from the local SNR).

models is clearly worse than that using the gender-depen-dent models. The speech recognition performance of thevisual-only system falls from 73.1% for the gender-depen-dent system to 58.4% for the gender-independent system.As before, the local SNR information provides a relativeimprovement over using the single global SNR, with amaximum gain in accuracy of 9.9% occurring at the SNRof 5 dB.

4.3. Evaluation of SNR estimation

The previous section demonstrated the potential of thesystem when using a priori SNR values. The success ofthe real system will depend on the degree of reliability withwhich SNR can be estimated from the likelihood streams.This section presents a direct evaluation of this componentof the system.

Separate MLPs of the structure described in Section3.3.2 were trained for the audio-only and audio–visualSNR estimation. The MLPs had an input unit for each ele-ment of the likelihood feature vector. Using the state-clus-tering techniques described in Section 3.3.1, it was foundthat the 251-dimensional likelihood vectors could bereduced down to 54 and 18 dimensions for the audio andvideo stream respectively without reducing the audio-onlyand video-only recognition performance. Hence, it wasdecided that this would be an appropriate degree of cluster-ing to apply to the likelihood vectors to be used in the SNRestimation. So, the audio-only MLP has 54 input units,whereas the audio–visual MLP has 72 (54 + 18) inputunits. The MLPs were trained using around 600,000 framesof data randomly drawn from the training data set. Thetarget SNR at each frame is computed using a priori

knowledge of the target and masker (i.e. knowledge ofthe clean target and masker utterances prior to mixing).The number of units in the hidden layer was optimisedusing a randomly drawn validation set consisting of60,000 frames. In the gender-dependent task, the best per-formance for the audio-only MLP was achieved with 46hidden units, while 30 hidden units gave best performancefor the audio–visual MLP. In the gender-independent task,11 hidden units gave the best performance for both audio-only and audio–visual MLPs.

The more challenging gender-independent system hasbeen evaluated using a selection of 100 utterances takenfrom the test set mixed at the range of global SNRs usedfor the ASR experiments. The average magnitude of thedifference between the SNR estimate and the target wascomputed. Also, the global SNR for each utterance wasestimated from the sequence of local SNR estimates – usingEq. (4) – and compared to the known global SNR. Table 2shows the size of the local and global SNR error for boththe audio-only and audio–visual estimates. It can be seenthat other than for the local SNR at 20 dB, the audio–visual system consistently outperform the audio-only sys-tem. The error reduction is around 5 dB for the globalSNR estimate with the largest gains at the lowest SNRs.

Table 2Average magnitude of local and global SNR estimation error (LSNR andGSNR, respectively) for estimations based on either audio-only or audio–visual evidence

SNR (dB) Audio-only Audio–visual

LSNR GSNR LSNR GSNR

20 16.1 15.6 18.2 12.915 15.8 16.1 15.4 11.210 16.1 16.9 13.7 11.35 17.8 17.1 13.3 11.80 22.4 18.4 15.2 12.1

�10 34.7 28.9 21.5 17.1

SNR estimation errors are shown in dB and are reported separately forutterances mixed at each global SNR.


The gain is particularly large at �10 dB. This large increaseis presumably because at �10 dB there are many frameswhich appear to be clean speech but are in fact frames thatare dominated by the masker. It is precisely this conditionthat can be disambiguated by the inclusion of videoevidence.

To get an impression of what the figures in Table 2 meanin practice, Fig. 6 compares the true and estimated localSNR over the course of a single pair of male utterancesmixed at a global SNR of 5 dB. It can be seen that theaudio-only system is prone to occasional regions of grosserror (e.g. around frames 95 and 125). These gross errorsare largely repaired by introducing visual information,and although the estimates are seldom very precise, the tra-jectory of the estimate is a reasonable match to that of thetrue SNR.

It may be considered that the errors in Table 2 seem dis-appointingly large. It would seem that state-likelihoods ofa single frame do not contain sufficient evidence to consis-tently estimate SNR – possible reasons will be discussed inSection 4.6. However, large errors in the SNR estimationdo not in themselves mean that the ASR results will bepoor. First, it should be remembered that the SNR tostream weight mapping can be fairly flat over large SNRregions. For example, below 0 dB the optimum audiostream weight becomes very close to 0 and above 20 dB itis very close to 1. So a large SNR estimation error in theseregions – for example, estimating SNR as �30 dB ratherthan �10 – will not necessarily have a large impact onthe ASR result. Second, for some frames, and even someutterances, there may be larger tolerance to error in thestream weighting parameter. If the errors are not occurringin the sensitive regions then they may not lead to recogni-tion errors. Finally, the impact of the errors will dependnot just on their average size, but also on their distribution.Although the average error can be large, it can be seenfrom Fig. 6 that there can also be regions in which theSNR is well estimated. It is possible that a small numberof poorly estimated stream weights can be tolerated bythe HMM decoder. In short, although the direct evaluationof the MLP is informative and may be helpful in the devel-opment of system improvements, an evaluation via ASRresults is the only fair test of the system as a whole.

4.4. ASR experiment 2: estimated SNR – audio versus

audio–visual estimators

In the second set of experiments the recognition systemsare retested but this time using the SNRs that have beenestimated from the likelihood data. Again, performanceof both the global and local SNR-based systems is com-pared. A comparison is also made between the perfor-mance of the SNR estimates obtained using audio-onlylikelihoods and that of the SNR estimates obtainedusing the combined audio and visual streams (see Section3.3.2).

Recognition accuracies for the gender-dependent taskare shown in Fig. 7. Results using the neural networkstrained from either audio-only likelihood or both audioand visual-likelihood are labelled with (A-*) and (AV-*),respectively. Both neural networks were employed toestimate either global SNR (*-GSNR) or local SNR(*-LSNR). It can be seen that the proposed stream weight-ing method leads to better results than both audio-only andvisual-only baselines at all SNRs except �10 dB. However,comparison with Fig. 4 indicates that recognition perfor-mances using estimated global and local SNR are worsethan corresponding results using the known SNR. This fig-ure also shows that the recognition results achieved usingestimated local SNR are better than those for estimatedglobal SNR at noise levels in the SNR range from 15 dBdown to 0 dB, despite the fact that local SNR is harderto estimate robustly.

Comparing the pair of local SNR results, A-LSNR withAV-LSNR, or the pair of global SNR results, A-GSNRwith AV-GSNR shows that the audio-only and theaudio–visual-based estimators provide broadly similar per-formance. At high noise levels – i.e. SNRs less than 10 dB –the audio–visual estimate provides a small but consistentbenefit.

Fig. 8 shows the recognition accuracies for the gender-independent task. As with Figs. 7 and 8 also indicates thatthe performance based on the local SNR (*-LSNR) is bet-ter than that based on the global SNR (*-GSNR) across arange of SNRs, i.e. 15 dB down to 0 dB. However, at verylow SNRs the performance of the global SNR system issuperior to that of the local SNR system. This is possiblybecause although in these conditions local SNR estimateshave a significant mean-square error, they are generallyunbiased, so the per-frame errors tend to be averaged outduring the global SNR estimation.

Again, comparing these results with those in Fig. 5shows that the recognition performance using the proposedstream weighting method achieves better results than usingeither the audio or visual stream alone, but as expected, theperformance using estimated SNR is somewhat less thanthat achieved using the known SNR.

Comparison of either A-GSNR with AV-GSNR, orA-LSNR with AV-LSNR illustrates that using visual infor-mation during the stream weight estimation leads to betterrecognition performance. This is as expected – the visual

Fig. 6. The figure illustrates the quality of the SNR estimation for a typical utterance pair: the upper panels show the spectrogram of a target (top) and amasker utterance. Energy contours for the target and masker are shown for a global SNR of 5 dB. The bottom panels compares the true local SNR(dashed line) with the estimates produced by the MLP using audio evidence only (solid) and audio–visual evidence (dot–dash). Note that although theerrors can be quite large, the audio–visual curve shows a similar trend to the true SNR, and makes fewer gross errors than the audio-only system.


information is reducing the effects of ‘informational mask-

ing’ (which is large in the gender independent condition)and helping to disambiguate regions of positive and nega-tive SNR.

It can also be noted from Fig. 8 that the performance ofthe audio stream (A-*) is slightly better than that of theaudio–visual system (AV-*) in the high SNR region. Partic-ularly in the local SNR case. The audio stream weight

Fig. 7. Recognition accuracy obtained at each global SNR when usinggender-dependent models and using SNR estimated from the log-likeli-hoods of either the audio-only or the audio–visual stream (A or AV) andwhen estimating either global or local SNR (GSNR or LSNR).

Fig. 8. Recognition accuracy obtained at each global SNR when usinggender-independent models and using SNR estimated from the log-likelihoods of either the audio-only or the audio–visual stream (A orAV) and when estimating either global or local SNR (GSNR or LSNR).

Fig. 9. Comparison of speech recognition performance obtained at eachglobal SNR when combining the estimated SNR magnitude and the a

priori SNR sign when using either global or local SNR (*-GSNR or*-LSNR) and gender-dependent models. Whether the SNR magnitude isestimated with audio-only likelihoods or audio–visual-likelihoods makesno significant difference, so for the sake of clarity only the results using theaudio–visual estimate are shown.


appears to be too low, an indication that the high SNRs arebeing underestimated.

Comparing the audio–visual (AV-*) results in Fig. 8with those in Fig. 7 it is seen that the visual stream providesgreater benefit in the gender-independent task than in thegender-dependent task. This is also as expected. In the gen-der-dependent task the acoustic models specifically matchthe gender of the target. The fact that the models matchthe acoustics of the target but not those of the gender,enables target and masker to be distinguished using acous-tics alone, i.e. there is less role for the visual information inthe gender-dependent task as there is less acoustic informa-tional masking.

Both the above experiments show that the visual streamcan help to improve speech recognition accuracies across awide range of SNRs. However, the performance is still lim-ited by an inability to form reliable estimate of local SNR,particularly when the target and masker are mixed at a lowglobal SNR.

4.5. ASR experiment 3: estimated SNR magnitude with

known sign

In the simultaneous speech condition the foregroundand background are acoustically similar. So, although themagnitude of the SNR may be well estimated, the sign ofthe SNR may be ambiguous and hard to estimate correctly.The final experiments investigated the impact of this spe-cific problem on the performance of the recognition system.The audio-only and audio–visual MLPs were retrainedusing the same dataset, parameters and optimising methodas in the last experiment except that the signs of the localSNR targets were removed. During the testing stage, theestimated SNR magnitude for each frame is combined withthe a priori SNR sign obtained using knowledge of theunmixed signals.

Figs. 9 and 10 show the results for gender-dependenttask and the gender-independent task, respectively.

Considering first the gender-dependent system, it wasnoted that the results obtained using the audio-only estima-tor were not significantly different from those obtainedusing the audio–visual estimator (because they are not sig-nificantly different only the AV estimator results are shownin Fig. 9). Furthermore, it can be noted that the results areclose to those obtained by the known-SNR system (Fig. 4).This would suggest that the poor performance seen in the

Fig. 10. Comparison of speech recognition performance obtained at eachglobal SNR when combining the estimated SNR magnitude and the a

priori SNR sign when using either audio-only likelihoods or audio–visual-likelihoods (A-* or AV-*) and when using either global or local SNR(*-GSNR or *-LSNR) and gender-independent models.


previous systems that estimated both the sign and magni-tude of the SNR (Fig. 7) are arising due to an inabilityto estimate the sign. Furthermore, if only the magnitudeof the SNR needs to be estimated then the audio likeli-hoods are sufficient. It also suggests that the previous smalladvantage provided by the visual stream when estimatingboth sign and magnitude (Fig. 7) is due to an improvementin the estimation of the sign of the SNR rather than itsmagnitude. This is expected as ambiguity in the sign ofthe SNR arises due to ambiguity in the determination ofwhich source is the target and which the masker – it is pre-cisely to reduce this ambiguity that the visual informationis introduced.

Fig. 10 shows results for the gender independent task.As before, the structure of this figure is the same as thatin Fig. 8. It can be seen that the behaviour of the resultsis similar to that of the gender-dependent task (Fig. 9) inthat they are now very similar to the results obtained inwhen the SNR sign and magnitude are both known(Fig. 5). The large advantage of using visual informationfor SNR estimation that is observed in the real system(Fig. 8) is again reduced when the true SNR sign is knowna priori. Again, this suggests that the main contribution ofvisual formation is in the estimation of this sign.

4.6. Discussion

4.6.1. The multiple roles of visual information

Considering the audio–visual speech recognition task, itis possible to identify three different levels at which thevisual information may be employed. First, visual informa-tion may play an early role in mitigating the effects of ener-getic masking. For example, there may exist multimodalmechanisms that are similar to the auditory mechanisms

that operate to provide co-modulation masking release(Hall et al., 1984). Second, audio visual integration mayoperate at a later stage to reduce the impact of informa-tional masking (IM), that is, to help the listener selectivelyattend to the target while ignoring the masker. This partic-ular role of visual speech has been the focus of recent per-ceptual studies, such as the work of Helfer and Freyman(2005) and Wightman et al. (2006). Finally, visual informa-tion may play a role in the speech unit classification taskthat underlies speech recognition. This is the role that hasbeen central in the great majority of audio–visual ASRresearch to date. At this level, the visual information is use-ful to the extent that it provides extra features to help dis-criminate between partially masked, and otherwiseambiguous, speech sound classes.

A major contribution of the current work lies in empha-sising the role of visual information as a cue for reducing‘informational masking’, i.e. the second of the three rolesdescribed above. In the current system, as in the study ofHelfer and Freyman (2005), the visual speech signal is help-ing to discriminate between the acoustic foreground andbackground, providing greatest benefit in the situationwhere the target and masker are most greatly confusable.In many everyday listening situations the release frominformational masking afforded by the visual signal maybe very significant. Brungart has demonstrated large infor-mational masking effects on small vocabulary recognitiontasks when the masker and target have the same-genderand have a target-masker ratio in the range �3 dB to+3 dB (Brungart, 2001). For conversational speech wherethe perplexity of the recognition task is greater there is per-haps even more scope for target/masker confusion. Amechanism that allows the listener to reliably extract thetarget from the background is invaluable.

From the earliest audio–visual speech perception studiesit has been noted that the audio–visual error rate is usuallysignificantly lower than both the audio-only and video-onlyerror rates. This is usually explained in terms of the com-plementarity of the phonetic/visemic cues. Some speechunits are highly discriminable acoustically, while othersare highly discriminable visually. However, this resultmay also be in part due to the role visual features play inreducing IM. Even if visual features carried no visemicinformation, and video-only ASR performance was nomore than chance, it would still be possible for them tohelp drive auditory attention toward the acoustic fore-ground hence improving AV performance over thatachieved using audio alone. Schwartz et al. (2004) havedemonstrated exactly this point using carefully controlledintelligibility tests in which target words are visually identi-cal and are masked by a speech background. This observa-tion raises interesting possibilities. For example, consider avisual signal that is so low in quality (e.g. having a verypoor resolution) that by itself it affords no more thanchance recognition performance. It is possible that this sig-nal may usefully reduce IM. In terms of the current system,this would be a case where the visual signal is sufficient to

1 Details of the Speech Separation Challenge can be found at http://www.dcs.shef.ac.uk/martin/SpeechSeparationChallenge.htm.


improve the stream weight estimation, but not to provideany visemic information at the classification stage.

4.6.2. Improving audio–visual stream weight estimation

Comparison of the results achieved using known SNR(Figs. 4 and 5) with those achieved using estimated SNR(Figs. 7 and 8) highlights the fact that performance of thesystem is limited by the extent to which SNR can be esti-mated. Good performance can be achieved if the modelsare sufficiently specific to the target (e.g. mixed gendersand gender-dependent models), but when the backgroundand foreground are statistically similar local SNR estimatesare poor. In this condition visual information can help theSNR estimate (Fig. 8) but, even then, AV recognition per-formance falls below that achieved using video alone atSNRs below 0 dB. The poor estimates can be largely over-come by averaging over an utterance to produce a globalSNR, but this sacrifices the responsiveness of the time-varying stream weight which Figs. 4 and 5 demonstrateto be important for the speech plus speech task.

The current SNR estimation technique bases its judge-ments on a single frame of data. To be effective theaudio–visual system needs to judge whether the audioand visual states correspond. Despite the use of temporaldifference features to capture local audio and visualdynamics, there may often be periods where there is insuf-ficient context to reliably judge audio–visual correspon-dence. Many audio speech units have the same visualappearance, so a masking phoneme may differ from the tar-get phoneme, but still be consistent with the target’s lipmovements. The temporal granularity is too small. Apotential solution would be to train the MLP using featurevectors formed by concatenating several frames of data.There is some precedence for such an approach in speechrecognition. Feature vectors that include even up to halfa second of context have been shown to be useful inimproving phonetic discrimination (Hermansky and Shar-ma, 1998). However, a temporal window that is too long,if not employed with care, may lead to oversmoothing ofthe rapidly varying SNR, reintroducing the compromisethat exists in the per-utterance SNR system employed inthe current work. Therefore, experiments would be neededto carefully optimise the window design, and the compres-sion applied to the larger likelihood vector that would begenerated by a larger window.

A further shortcoming of the existing system is that thelocal SNR to stream weight mapping is performed using atable that has been optimised for global SNR. Training amapping using global SNR is convenient as each SNRpoint in the mapping can be independently optimised usinga separate training set. However, it is not clear that thismapping should produce the best performance for a localSNR-based system. Thus, the results achieved with knownlocal SNR, which are presented as an upper limit of theperformance given perfect SNR estimation, could poten-tially be improved upon if an SNR to weight mappingmore suitable for local SNR were found. One approach

would be to use a parameterised curve to estimate the map-ping. For example, given that stream weight is likely to bemonotonically increasing with increasing SNR, and that isis constrained to lie in the range �1 to +1, a sigmoid mightbe a reasonable approximation. If the number of parame-ters is small enough it would be possible to locate the curvethat maximises the ASR performance by a straightforwardsearch of the parameter space.

4.6.3. Relation to other robust ASR approaches

The standard multistream approach to AVASR – ofwhich the current work is an example – treats the acousticfeature vector as a single monolithic stream that has a sin-gle reliability weight. For the combination of additive noiseand cepstral features employed in the current system thismay be appropriate. Additive noise, whether broadbandor narrowband, will effect the entire cepstrum. However,if the acoustics are represented in the spectral domain, thenadditive noise, at any particular time, will generally have alocal effect. In the same way that some time frames may berelatively free of the masker, even in time frames that areheavily corrupted, some frequency bands will contain moremasker energy than others.

Many robust ASR techniques have been proposed totake advantage of local spectro-temporal regions of highSNR that can be found in noisy speech signals: multi-bandsystems have been developed which apply separate reliabil-ity weights to independent frequency bands (Bourlard andDupont, 1996); soft-mask missing data system generalisethis idea by attempting to judge the SNR at each spectro-temporal point (Barker et al., 2000); speech fragment tech-niques attempt to piece together a partial description of theclean speech spectrum from reliable spectro-temporal frag-ments (Barker et al., 2005). Such techniques might make abetter starting point for developing robust audio–visualASR systems. For example, using both acoustic and visual

features to estimate multi-band reliability weights, or miss-ing data soft-mask values would be a possibility.

Another general approach to robust ASR is to attemptto remove the noise from the mixture prior to recognition.For example, spectral subtraction techniques attempt toremove estimates of the noise spectrum in order to recoverthe clean speech spectrum (Boll, 1979; Lockwood et al.,1991). This class of techniques may be considered comple-mentary to the multistream approach described in thispaper. Any technique that removes noise from the mixtureto leave a cleaner version of the speech representation couldbe added to the current system as a preprocessing stage.

A common approach to the simultaneous speaker task isto exploit continuities in the speech spectrum, (primarilypitch), to track and separate the target and masker speak-ers. This was the strategy of several systems competing inthe Pascal Speech Separation Challenge held at Interspeech2007.1 However, apart from in exceptional circumstances

http://www.dcs.shef.ac.uk/martin

http://www.dcs.shef.ac.uk/martin


such approaches are normally unable to offer a single unam-biguous interpretation of the acoustic scene. For example,pitch tracks of competing sound sources can cross in ambig-uous ways, or they may contain breaks leading to problemsof sequentially grouping discontinuous pitch track seg-ments. Pitch tracking is however sufficient to locate localregions of spectral-temporal dominance (e.g. vowel for-mants). So a potential strategy could be to first use suchtracking technique to form a number of spectro-temporalacoustic fragments of unknown origin, and then to use bothaudio and visual evidence to judge the source identity ofeach spectro-temporal fragment. The target versus maskerambiguities that arise in the current system when consider-ing a single acoustic frame would be much reduced whenconsidering an extended spectro-temporal speech fragment.Integration of audio and visual features at this level is likelyto lead to systems with greater robustness.

5. Conclusion

This paper has examined the problem of applying mul-tistream audio–visual speech recognition techniques in achallenging simultaneous speaker environment using bothgender-dependent and gender-independent hidden Markovmodels. It has been shown that in this condition, either astatic stream weight parameter based on a global SNR esti-mate, or a dynamic stream weight parameter based on alocal SNR estimate can be used to successfully integrateaudio and visual information. The dynamic stream weightparameter leads to better overall performance. A techniquefor estimating either the local or global SNR from audioand visual HMM state-likelihoods has been presented.Despite a lack of precision in the SNR estimates, the esti-mates were sufficiently reliable to lead to recognition resultsthat are better than both the audio-only and visual-onlybaseline performance across a wide range of SNRs in botha gender-dependent and a gender-independent task.

Experiments using a priori SNR estimates have shownthat a time-varying stream weight based on local SNRhas the potential to greatly outperform a per-utterancestream weight based on a per-utterance SNR. The perfor-mance difference was most marked for global SNRs ofaround 0 dB. However, in practise, the difficulty of makingaccurate local SNR estimates means that real systems can-not match this performance level. Particular problemsoccur due to the difficulty in distinguishing between theacoustic foreground and background in local regions. Bas-ing SNR estimates on a combination of both audio andvisual-likelihoods went some way to reducing this problem.

The paper has demonstrated the potential for a time-varying stream-weighting approach for AV speech recogni-tion in multispeaker environments. However, it has alsohighlighted the difficulty in achieving results that matchup to those that can be achieved using a priori SNR infor-mation. Possibilities for improving SNR estimation havebeen discussed, as have possibilities for combining the cur-

rent approach in a complementary fashion with existingrobust ASR approaches.

Finally, and most importantly, the paper has highlightedthe potential for using visual information at multiple stagesof the recognition process. In particular, there appears tobe great potential for developing the use of visual speechinformation in the separation of acoustic sources, and inthe disambiguation of the foreground/background con-fusions that occur when speech targets are mixed withacoustically similar maskers. More complete, multi-levelintegration of visual information into recognition systemsmay lead to future AVASR technology that comes closerto exhibiting the robustness of human speech processing.

References

Adjoudani, A., Benoit, C., 1996. On the integration of auditory and visualparameters in an HMM-based ASR. In: Stork, D.G., Hennecke, M.E.(Eds.), Speechreading by Humans and Machines. Springer, Berlin, pp.461–471.

Barker, J.P., Josifovski, L.B., Cooke, M.P., Green, P.D. Soft decisions inmissing data techniques for robust automatic speech recognition. In:Proc. ICSLP’00, Beijing, China, 2000, pp. 373–376.

Barker, J.P., Cooke, M.P., Ellis, D.P.W., 2005. Decoding speech in thepresence of other sources. Speech Comm. 45, 5–25.

Bishop, C.M., 1995. Neural Networks for Pattern Recognition. ClarendonPress, Oxford.

Boll, S.F., 1979. Suppression of acoustic noise in speech using spectralsubtraction. IEEE Trans. Acoust. Speech Signal Process. 27 (2), 113–120.

Bourlard, H., Dupont, S., 1996. A new ASR approach based onindependent processing and recombination of partial frequency bands.In Proc. ICSLP’96. Philadelphia, PA.

Bradski, G.R., 1998. Computer video face tracking for use in a perceptualuser interface. Intel. Technol. J. Q2, 43–98.

Brungart, D.S., 2001. Informational and energetic masking effects in theperception of two simultaneous talkers. J. Acoust. Soc. Amer. 109 (3),1101–1109.

Chibelushi, C.C., Deravi, F., Mason, J.S.D., 2002. A review of speech-based bimodal recognition. IEEE Trans. Multimedia 4 (1), 23–37.

Cooke, M., Barker, J., Cunningham, S., Shao, X., 2006. An audio–visualcorpus for speech perception and automatic speech recognition. J.Acoust. Soc. Amer. 120 (5), 2421–2424.

Cox, S., Matthews, I., Bangham, J.A., 1997. Combining noise compen-sation with visual information in speech recognition. In: Proc. Audio–visual Speech Processing (AVSP’97), Rhodes, Greece.

Davis, S., Mermelstein, P., 1980. Comparison of parametric representa-tions for monosyllabic word recognition in continuously spokensentences. IEEE Trans. Acoust. Speech Signal Process. 28, 357–366.

Dupont, S., Luettin, J., 2000. Audio–visual speech modeling for continousspeech recognition. IEEE Trans. Multimedia 2 (3), 141–151.

Fu, S., Gutierrez-Osuna, R., Esposito, A., Kakumanu, P.K., Garcia,O.N., 2005. Audio/visual mapping with cross-modal hidden Markovmodels. IEEE Trans. Multimedia 7 (2), 243–252.

Garg, A., Potamianos, G., Neti, C., Huang, T.S., 2003. Frame-dependentmulti-stream reliability indicators for audio–visual speech recognition.In: Proc. ICASSP 2003, Hong Kong, pp. 24–27.

Glotin, H., Vergyri, D., Neti, C., Potamianos, G., Luettin, J., 2001.Weighting schemes for audio–visual fusion in speech recognition. In:Proc. ICASSP 2001, Salt Lake City, pp. 173–176.

Goldschen, A.J., 1993. Continuous automatic speech recognition bylipreading. Ph.D. thesis, Engineering and Applied Science, GeorgeWashington University.

Gravier, G., Axelrod, S., Potamianos, G., Neti, C., 2002. Maximumentropy and MCE based HMM stream weight estimation for audio–visual ASR. In: Proc. ICASSP 2002. Orlando, FL.


Gurbuz, S., Tufekci, Z., Patterson, E., Gowdy, J.N., 2002. Multi-streamproduct model audio–visual integration strategy for robust adaptivespeech recognition. In: Proc. ICASSP 2002, Orlando, FL.

Hall, J.W., Haggard, M.P., Fernandes, M.A., 1984. Detection in noise byspectro-temporal pattern analysis. J. Acoust. Soc. Amer. 76, 50–56.

Helfer, K.S., Freyman, R.L., 2005. The role of visual speech cues inreducing energetic and informational masking. J. Acoust. Soc. Amer.117 (2), 842–849.

Hermansky, H., Sharma, S., 1998. TRAPs – classifiers of temporalpatterns. In: Proc. ICSLP 1998, Sydney, Australia.

Hirsch, H.G., 1993. Estimation of Noise Spectrum and its Application toSNR-estimation and Speech Enhancement. In: Tech. Rep. TR-93-012.International Computer Science Institute, Berkeley, CA.

Lockwood, P., Boudy, J., 1991. Experiments with a non-linear spectralsubtractor (NSS) hidden Markov Models and the projection for robustspeech recognition in cars. In: Proc. Eurospeech’91, Vol. 1, pp. 79–82.

Lucey, S., Chen, T., Sridharan, S., Chandran, V., 2005. Integrationstrategies for audio–visual speech processing: applied to text dependentspeaker recognition. IEEE Trans. Multimedia 7 (3), 495–506.

Luettin, J., Poaminanos, G., Neti, V., 2001. Asynchronous streammodeling for large vocabulary audio–visual speech recognition. In:Proc. ICASSP 2001, Salt Lake City.

Massaro, D.W., 1998. Perceiving Talking Faces: From Speech Perceptionto Behavioral Principle. MIT Press, Cambridge.

Massaro, D.W., Stork, D.G., 1998. Speech recognition and sensoryintegration. Amer. Sci. 86 (3), 236–244.

Matthews, I., 1998. Features for audio–visual speech recognition. Ph.D.thesis, School of Information Systems, University of East Anglia,Norwich.

Meier, U., Hurst, W., Duchnowski, P., 1996. Adaptive bimodal sensorfusion for automatic speechreading. In: Proc. ICASSP 1996, Atlanta,GA.

Okawa, S., Nakajima, T., Shirai, K., 1999. A recombination strategy formultiband speech recognition based on mutual information criterion.In: Proc. European Conf. on Speech Communication Technology,Budapest, pp. 603–606.

Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N., 2001. Noise-basedaudio–visual fusion for robust speech recognition. In: Proc. Audio–Visual Speech Processing (AVSP’01), Scheelsminde, Denmark.

Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N., 2002. Moving-talker, speaker-independent feature study and baseline results using theCUAVE multimodal speech corpus. EURASIP J. Appl. SignalProcess. 11, 1189–1201.

Potamianos, G., Neti, C., 2000. Stream confidence estimation for audio–visual speech recognition. In: Proc. ICSLP 2000, Beijing, China.

Potamianos, G., Graf, H.P., Cosatto, E., 1998. An image transformapproach for HMM based automatic lipreading. In: Proc. IEEEInternat. Conf. on Image Processing.

Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W., 2003.Recent advances in the automatic recognition of audiovisual speech.Proc. IEEE 91 (9), 1306–1326.

Rabiner, L.R., 1989. A tutorial on HMMs and selected applications inspeech recognition. Proc. IEEE 12 (2), 267–296.

Rosenblum, L.D., 2005. The primacy of multimodal speech perception.In: Pisoni, D., Remez, R. (Eds.), Handbook of Speech Perception.Blackwell, Malden, MA, pp. 51–78.

Rudmann, D.S., McCarley, J.S., Kramer, A.F., 2003. Bimodal displaysimprove speech comprehension in environments with multiple speak-ers. Human Factors 45, 329–336.

Schwartz, J.L., Berthommier, F., Savariaux, C., 2004. Seeing to hearbetter: evidence for early audio–visual interactions in speech identifi-cation. Cognition 93, B69–B78.

Shewchuk, J.R., 1994. An Introduction to the Conjugate GradientMethod Without the Agonizing Pain. In: Tech. Rep. CMU-CS-94-125 School of Computer Science. Carnegie Mellon University.

Sumby, W.H., Pollack, I., 1954. Visual contribution to speech intelligi-bility in noise. J. Acoust. Soc. Amer. 26, 212–215.

Summerfield, A.Q., 1979. Use of visual information in phonetic percep-tion. Phonetica 36, 314–331.

Tamura, S., Iwano, K., Furui, S., A stream-weight optimization methodfor multi-stream HMMs based on likelihood value normalisation. InProc. ICASSP 2005, Philidelphia, PA, March 2005.

Wightman, F., Kistler, D., Brungart, D., 2006. Informational masking ofspeech in children: auditory-visual integration. J. Acoust. Soc. Amer.119 (6), 3940–3949.

Young, S., Odell, J., Ollason, D., Valtchev, V., Woodland, P., 1995. TheHTK Book. http://htk.eng.cam.ac.uk/.

http://htk.eng.cam.ac.uk/

Stream weight estimation for multistream audio visualspeech recognition in a multispeaker environment

Technology

salt lake

competing

articial neural

hidden markov

barker speech

audiovisual

robust asr

audiovisual