Audio-Visual Biometrics

8/6/2019 Audio-Visual Biometrics

http://slidepdf.com/reader/full/audio-visual-biometrics 1/20

I N V I T E D

P A P E R

Audio-Visual BiometricsLow-cost technology that combines recognition of individuals’ faces withvoice recognition is now available for low-security applications,but overcoming false rejections remains an unsolved problem.

By Petar S. Aleksic and Aggelos K. Katsaggelos

ABSTRACT | Biometric characteristics can be utilized in order

to enable reliable and robust-to-impostor-attacks person

recognition. Speaker recognition technology is commonly

utilized in various systems enabling natural human computer

interaction. The majority of the speaker recognition systems

rely only on acoustic information, ignoring the visual modality.

However, visual information conveys correlated and compli-

mentary information to the audio information and its integra-

tion into a recognition system can potentially increase the

system’s performance, especially in the presence of adverse

acoustic conditions. Acoustic and visual biometric signals, such

as the person’s voice and face, can be obtained using

unobtrusive and user-friendly procedures and low-cost sen-

sors. Developing unobtrusive biometric systems makes bio-

metric technology more socially acceptable and accelerates its

integration into every day life. In this paper, we describe the

main components of audio-visual biometric systems, review

existing systems and their performance, and discuss future

research and development directions in this area.

KEYWORDS | Audio-visual biometrics; audio-visual databases;

audio-visual fusion; audio-visual person recognition; face

tracking; hidden Markov models; multimodal recognition;

visual feature extraction

I . IN TR O DUC TIO N

Biometrics, or biometric recognition, refers to the utili-zation of physiological and behavioral characteristics for

automatic person recognition [1], [2]. Person recognitioncan be classified into two problems: person identification

and person verification (authentication). Person identifi-

cation is the problem of determining the identity of a person from a closed set of candidates, while person

verification refers to the problem of determining whether

a person is who s/he claims to be. In general, a person

verification system should be capable of rejecting claimsfrom impostors, i.e., persons not registered with the

system or registered but attempting access under someone

else’s identity, and accepting claims from clients, i.e.,persons registered with the system and claiming their ownidentity. Applications that can employ person recognition

systems include automatic banking, computer network

security, information retrieval, secure building access, e-commerce, teleworking, etc. Personal devices, such as cell

phones, PDAs, laptops, cars, etc., could also have built-in

person recognition systems which would help preventimpostors from using them. Traditional person recognition

methods, including knowledge-based (e.g., passwords,PINs) and token-based (e.g., ATM or credit cards, and

keys) methods, are vulnerable to impostor attacks. Pass-

words can be compromised, while keys and cards can bestolen or duplicated. Identity theft is one of the fastest

growing crimes worldwide. For example, in the U.S. over

ten million people were victims of identity theft within a 12-month period in 2004 and 2005 (approximately 4.6% of the adult population) [3].

Unlike knowledge- and token-based information,

biometric characteristics cannot be forgotten or easily stolen and the systems that utilize them exhibit improved

robustness to impostor attacks [1], [2], [4]. There are many

different biometric characteristics that can be used inperson recognition systems, including fingerprints, palm

prints, hand and finger geometry, hand veins, iris andretinal scans, infrared thermograms, DNA, ears, faces,

gait, voice, signature, etc. [1] (see Fig. 1). The choice of

biometric characteristics depends on many factors,including the best achievable performance, robustness to

noise, cost and size of biometric sensors, invariance of

characteristics with time, robustness to attacks, unique-ness, population coverage, scalability, template (represen-

tation) size, etc. [1], [2].Each biometrics modality has its own advantages and

disadvantages with respect to the above-mentioned factors. All of them are usually considered when choosing the most

Manuscript received September 2, 2005; revised August 16, 2006.

The authors are with the Department of Electrical Engineering and

Computer Science, Northwestern University, Evanston, IL 60208 USA (e-mail:

[email protected]; http://ivpl.ece.northwestern.edu/Staff/Petar.html;[email protected]).

Digital Object Identifier: 10.1109/JPROC.2006.886017

Vol. 94, No. 11, November 2006 | Proceedings of the IEEE 20250018-9219/$20.00 Ó2006 IEEE



appropriate biometric characteristics for a certain appli-

cation. A comparison of the most commonly used bio-metric characteristics with respect to maturity, accuracy,

scalability, cost, obtrusiveness, sensor size, and template

size is shown in Table 1 [2]. The best recognition per-formance is achieved when iris, fingerprints, hand, or

signature are used as biometric features. However, systemsutilizing these biometric features require user cooperation,

are considered intrusive, and/or utilize high cost sensors.

Although reliable, they are usually not widely acceptable,except in high-security applications. In addition, there are

many biometric applications, such as sport venue (gym)

entrance check, access to desktop, building access, etc., which do not require high security, and in which it is very important to use unobtrusive and low-cost methods for

extracting biometric features, thus enabling natural person

recognition and reducing inconvenience.In this review paper, we address audio-visual (AV)

biometrics, where speech is utilized together with static

video frames of the face or certain parts of the face (facerecognition) [5]–[9] and/or video sequences of the face or

mouth area (visual speech) [10]–[16] in order to improveperson recognition performance (see Fig. 2). With respect

to the type of acoustic and visual information they use,

person recognition systems can be classified into audio-only, visual-only-static (only visual features obtained from

a single face image are used), visual-only-dynamic (visual

features containing temporal information obtained from video sequences are used), audio-visual-static, and audio-

visual-dynamic. The systems that utilize acoustic informa-tion only are extensively covered in the audio-only speaker

recognition literature [17] and will not be discussed here.

In addition, visual-only-static (face recognition) systemsare also covered extensively in the literature [18]–[22] and

will not be discussed separately, but only in the context of

AV biometrics.

Speaker recognition systems that rely only on audiodata are sensitive to microphone types (headset, desktop,telephone, etc.), acoustic environment (car, plane, factory,

babble, etc.), channel noise (telephone lines, VoIP, etc.), or

complexity of the scenario (speech under stress, Lombardspeech, whispered speech). On the other hand, systems

that rely only on visual data are sensitive to visual noise,such as extreme lighting changes, shadowing, changing

background, speaker occlusion and nonfrontality, segmen-tation errors, low spatial and temporal resolution video,

compressed video, and appearance changes (hair style,

make-up, clothing). Audio-only speaker recognizers can perform poorly

even at typical acoustic background signal-to-noise (SNR)

levels (À10 to 15 dB), and the incorporation of additionalbiometric modalities can alleviate problems characteristic

of a single modality and improve system performance. Thishas been well established in the literature [5]–[16],

[23]–[51]. The use of visual information, in addition to

audio, improves speaker recognition performance even innoise-free environments [14], [28], [34]. The potential for

such improvements is greater in acoustically noisy envi-

ronments, since visual speech information is typically much less affected by acoustic noise than the acousticspeech information. It is true, however, that there is an

equivalent to the acoustic Lombard effect in the visual

Fig. 1. Biometric characteristics: (a) fingerprints, (b) palm print, (c) hand and finger geometry, (d) hand veins, (e) retinal scan,

(f) iris, (g) infrared thermogram, (h) DNA, (i) ear, (j) face,

(k) gait, (l) speech, and (m) signature.

Table 1 Comparison of Biometric Characteristics (Adapted From [2])

Aleksic and Katsaggelos: Audio-Visual Biometrics

2026 Proceedings of the IEEE | Vol. 94, No. 11, November 2006



domain; although, it has been shown that it does not affectthe visual speech recognition as much as the acoustic

Lombard effect affects acoustic speech recognition [52].

Audio-only and static-image-based (face recognition)biometric systems are susceptible to impostor attacks

(spoofing) if the impostor possesses a photograph and/orspeech recordings of the client. It is considerably more

difficult for an impostor to impersonate both acoustic and

dynamical visual information simultaneously. Overall, AVperson recognition is one of the most promising user-

friendly low-cost person recognition technologies that is

rather resilient to spoofing. It holds promise for wideradoption due to the low cost of audio and video biometric

sensors and the ease of acquiring audio and video signals(even without assistance from the client) [53].

The remainder of the paper is organized as follows. Wefirst describe in Section II the justification for combining

audio and video modalities for person recognition. We

then describe the structure of an AV biometric system inSection III, and in Section IV we review the various ap-

proaches for extracting and representing important visual

features. Subsequently, in Section V we describe speechrecognition process and in Section VI review the main

approaches for integrating the audio and visual informa-tion. In Section VII, we provide a description of some of

the AV biometric systems that appeared in the literature.

Finally, in Section VIII, we provide an assessment of thetopic, describe some of the open problems, and conclude

the paper.

I I . I M P O R T A N CE O F A V B I O M E T R I C S

Although great progress has been achieved over the past

decades in computer processing of speech, it still lags

significantly compared to human performance levels [54],especially in noisy environments. On the other hand,

humans easily accomplish complex communication tasksby utilizing additional sources of information whenever

required, especially visual information. Face visibility be-nefits speech perception due to the fact that the visual

signal is both correlated to the produced audio signal[55]–[60] and also contains complementary information

to it [55], [61]–[66]. There has been significant work on

investigating the relationship between articulatory move-ments and vocal tract shape and speech acoustics [67]–[70].

It has also been shown that there exists a strong corre-lation among face motion, vocal tract shape, and speech

acoustics [55], [61]–[66].

For example, Yehia et al. [55] investigated the degreesof this correlation. They measured the motion of markers,

which were placed on the face and in the vocal tract. Their

results show that 91% of the total variance observed in thefacial motion could be determined from the vocal tract

motion, using simple linear estimators. In addition, lookingat the reverse problem, they determined that 80% of the

total variance observed in the vocal tract can be estimatedfrom face motion. Regarding speech acoustics, linear esti-

mators were sufficient to determine between 72% and 85%

(depending on subject and utterance) of the varianceobserved in the root mean squared amplitude and line-

spectrum pair parametric representation of the spectral

envelope from face motion. They also showed that even thetongue motion can be reasonably well recovered from the

face motion, since the tongue frequently displays similarmotion as the jaw during speech articulation.

Jiang et al. [56] investigated the correlation among

external face movements, tongue movements, and speechacoustics for consonant–vowel (CV) syllables and sen-

tences. They showed that multilinear regression could

successfully be used to predict face movements fromspeech acoustics for short speech segments, such as CV

syllables. The prediction was the best for chin movements,followed by lips and cheeks movements. They also showed,

like the authors of [55], that there is high correlationbetween tongue and face movements.

Hearing impaired individuals utilize lipreading and

speechreading in order to improve their speech perception.In addition, normal hearing persons also use lipreading and

speechreading to a certain extent, especially in acoustically noisy environments [61]–[66]. Lipreading represents the

Fig. 2. AV biometric system.


Vol. 94, No. 11, November 2006 | Proceedings of the IEEE 2027





The preprocessing should be coupled with the choiceand extraction of acoustic and visual features as depicted

by the dashed lines in Fig. 3. Acoustic features are chosenbased on their robustness to channel and background

noise. A number of results have been reported in the

literature on extracting the appropriate acoustic featuresfor both clean and noisy speech conditions [88]–[91]. The

most commonly utilized acoustic features are mel-

frequency cepstral coefficients (MFCCs) and linearprediction coefficients (LPCs). Features that are more

robust to noise obtained with the use of spectral subbandcentroids [90] or the zero-crossings with peak-amplitudes

model [89] as an acoustic front end, have also been

proposed. Acoustic features are usually augmented by their first- and second-order derivatives (delta and delta–

delta coefficients) [88]. The appropriate selection and

extraction of acoustic features is not addressed in this

paper.On the other hand, the establishment of visual featuresfor speaker recognition is a relatively newer research

topic. Various approaches have been implemented towards

face detection and tracking and facial feature extraction[22], [92]–[96] and will be discussed in more detail in

Section IV. The dynamics of the visual speech are cap-tured, similarly to acoustic features, by augmenting the

Bstatic[ (frame-based) visual feature vector by its first- andsecond-order time derivatives, which are computed over a

short temporal window centered at the current video

frame [86]. Mean normalization of the visual feature vec-tors can also be utilized to reduce variability due to illumi-

nation [115].

AV fusion combines audio and visual information inorder to achieve higher person recognition performance

than both audio-only and visual-only person recognitionsystems. If no fusion of the acoustic and visual information

takes place, then audio-only and visual-only person recog-

nition systems result (see Fig. 3). The main advantage of AV biometric systems lies in their robustness, since each

modality can provide independent and complementary

information and therefore prevent performance degrada-tion due to the noise present in one or both of the modali-ties. There exist various adaptive fusion approaches, which

weight the contribution of different modalities based on

their discrimination ability and reliability, as discussed inSection VI. The rates of the acoustic and visual features are

typically different. The rate of acoustic features is usually

100 Hz [86], while video frame rates can be up to25 frames per second (50 fields per second) for PAL or

30 frames per second (60 fields per second) for NTSC.For fusion methods that require the same rate for both

modalities, the video is typically up-sampled using an in-

terpolation technique in order to achieve AV feature syn-chrony at the audio rate. Finally, adaptation of the person’s

models is an important part of the AV system in Fig. 3. It is

usually performed when the environment or the speaker’s voice characteristics change, or when the person’s appear-

ance changes, due for example, to pose or illuminationchanges, facial hair, glasses, or aging [80].

IV . A N A L Y S I S O F V I S U A L F E A T U R E SU T I L I Z E D F O R A V B I O M E T R I CS

The choice of acoustic features for speaker recognition has

been thoroughly investigated in the literature [88] (see

also other papers in this issue). Therefore, in this paper, wefocus on presenting various approaches for the extraction

of visual features utilized for AV speaker recognition.Visual features are usually extracted from two-dimensional

(2-D) or three-dimensional (3-D) images [97]–[99], in the

visible or infrared part of the spectrum [100]. Facial visualfeatures can be classified into global or local, depending on

whether the face is represented by only one or multiple

feature vectors. Each local feature vector represents in-

formation contained in small image patches of the face orspecific regions of the face (e.g., eyes, nose, mouth, etc.).Visual features can also be either static (a single face image

is used) or dynamic (a video sequence of only the mouth

region, the visual-labial features, or the whole face isused). Static visual features are commonly used for face

recognition [18]–[22], [92]–[96], while dynamic visualfeatures for speaker recognition, since they contain addi-

tional important temporal information that captures thedynamics of facial feature changes, especially the changes in

the mouth region (visual speech). The various sets of visual

facial features proposed in the literature are generally grouped into three categories [101]: 1)

appearance-based features, such as transformed vectors of the face or mouth

region pixel intensities using, for example, image com-pression techniques [28], [77], [79]–[81], [83], [102];

2) shape-based features, such as geometric or model-basedrepresentations of the face or lip contours [77], [79]–[83];

and 3) features that are a combination of both appearance

and shape features in 1) and 2) [79]–[81], [103].The algorithms utilized for detecting and tracking the

face, mouth, or lips depend on the visual features that will

be used for speaker recognition, along with the quality of the video data and the resource constraints. For example,only a rough detection of the face or mouth region is suf-

ficient to obtain appearance-based visual features, requir-

ing only the tracking of the face and the two mouthcorners. On the other hand, a computationally more ex-

pensive lip extraction and tracking algorithm is addi-

tionally required for obtaining shape-based features, a challenging task especially in low-resolution videos.

A. Facial Feature Detection, Tracking, and ExtractionFace detection constitutes, in general, a difficult

problem, especially in cases where the background, headpose, and lighting are varying. It has attracted significant

interest in the literature [93]–[96], [104]–[106]. Some

reported face detection systems use traditional imageprocessing techniques, such as edge detection, image





thresholding, template matching, color segmentation, ormotion information in image sequences [106]. They take

advantage of the fact that many local facial subfeaturescontain strong edges and are approximately rigid. Never-

theless, the most widely used techniques follow a

statistical modeling of the face appearance to obtain a classification of image regions into face and nonface

classes. Such regions are typically represented as vectors of

grayscale or color image pixel intensities over normalizedrectangles of a predetermined size. They are often pro-

jected onto lower dimensional spaces and are defined over a

Bpyramid[ of possible locations, scales, and orientations in

the image [93]. These regions are usually classified, using

one or more techniques, such as neural networks,clustering algorithms along with distance metrics from

the face or nonface spaces, simple linear discriminants,

support vector machines (SVMs), and Gaussian mixture

models (GMMs) [93], [94], [104]. An alternative popularapproach uses a cascade of weak classifiers instead that aretrained using the AdaBoost technique and operate on local

appearance features within these regions [105]. If color

information is available, image regions that do not containsufficient number of skin-tone like pixels can be deter-

mined (for example, utilizing hue and saturation) [107],[108] and eliminated from the search. Typically, face

detection goes hand-in-hand with tracking in which thetemporal correlation is taken into account (tracking can be

performed at the face or facial feature level). The simplest

possible approach to capitalize on the temporal correlationis to assume that the face (or facial feature) will be present

in the same spatial location in the next frame.

After successful face detection, the face image can beprocessed to obtain an appearance-based representation of

the face. Hierarchical techniques can also be used at thispoint to detect a number of interesting facial features, such

as mouth corners, eyes, nostrils, and chin, by utilizing

prior knowledge of their relative position on the face inorder to simplify the search. Also, if color information is

available, hue and saturation information can be utilized in

order to directly detect and extract certain facial features(especially lips) or constrain the search area and enablemore accurate feature extraction [107], [108]. These

features can be used to extract and normalize the mouth

region-of-interest (ROI), containing useful visual speechinformation. The normalization is usually performed with

respect to head-pose information and lighting [Fig. 4(a)

and (b)]. The appearance-based features are extractedfrom the ROI using image transforms (see Section IV).

On the other hand, shape-based visual mouth features(divided into geometric, parametric, and statistical, as in

Fig. 5) are extracted from the ROI utilizing techniques

such as snakes [109], templates [110], and active shape andappearance models [111]. A snake is an elastic curve rep-

resented by a set of control points, and it is used to detect

important visual features, such as lines, edges, or contours.The snake control point coordinates are iteratively up-

dated, converging towards a minimum of the energy func-tion, defined on the basis of curve smoothness constraints

and a matching criterion to desired features of the image[109]. Templates are parametric curves that are fitted to the

desired shape by minimizing an energy function, defined

similarly to snakes. Examples of lip contour estimationusing a gradient vector field (GVF) snake and two parabolic

templates are depicted in Fig. 4(c) [82]. Examples of

statistical models are active shape models (ASMs) andactive appearance models (AAMs) [111]. The former are

obtained by applying principal component analysis (PCA)[112] to training vectors containing the coordinates of a set

of points that lie on the shapes of interest, such as the lip

inner and outer contours. These vectors are projected onto a lower dimensional space defined by the eigenvectors

corresponding to the largest PCA eigenvalues, representing

the axes of shape variation. The latter are extensions of

ASMs that, in addition, capture the appearance variation of the region around the desired shape. AAMs remove the

redundancy due to shape and appearance correlation and

create a single model that describes both shape and thecorresponding appearance deformation.

B. Visual FeaturesWith appearance-based approaches to visual feature

representation, the pixel-values of the face or mouth ROI,tracked and extracted according to the discussion of the

previous section, are directly considered. The extracted

ROI is typically a rectangle containing the mouth, possibly including larger parts of the lower face, such as the jaw

and cheeks [80] or could even be the entire face [103]

(see Figs. 4(b) and 5). It can also be extended into a three-dimensional rectangle, containing adjacent frame

ROIs, thus capturing dynamic visual speech information. Alternatively, the mouth ROI can be obtained from a

number of image profiles vertical to the estimated lip

contour as in [79] or from a disc around the mouth center[113]. A feature vector x t (see Fig. 5) is created by order-

ing the grayscale pixel values inside the ROI. The

Fig. 4. Mouth appearance and shape tracking for visual feature

extraction. (a) Commonly detected facial features. (b) Two

corresponding mouth ROIs of different sizes. (c) Lip contour

estimation using a gradient vector field snake (upper: the snake’s external force field is depicted) and

two parabolas (lower) [82].





dimension d of this vector typically becomes prohibitively

large for successful statistical modeling of the classes of interest, and thus a lower dimensional transformation of it

is used instead. A D Â d dimensional linear transform

matrix R is generally sought after, such that the trans-formed data vector y t ¼ R Á x t contains most speech

reading information in its D ( d elements (see Fig. 5).Matrix R is often obtained based on a number of training

ROI grayscale pixel value vectors utilizing techniquesborrowed from the image compression and pattern classi-

fication literature. Examples of such transforms are PCA,

generating Beigenlips[ (or Beigenfaces[ if applied to faceimages for face recognition) [113], the discrete cosine

transform (DCT) [114], [115], the discrete wavelet trans-

form (DWT) [115], linear discriminant analysis (LDA) [11],[77], [80], Fisher linear discriminant (FLD), and the

maximum likelihood linear transform (MLLT) [28], [80].PCA provides low-dimensional representation optimal in

the mean-squared error sense, while LDA and FLD provide

most discriminant features, that is, features that offer a clear separation between the pattern classes. Often, these

transforms are applied in a cascade [11], [80] in order to

cope with the Bcurse of dimensionality [ problem. Inaddition, fast algorithmic implementations are available forsome of these transformations. It is important to point out

that appearance-based features allow for dynamic visual

feature extraction in real time due to the fact that a roughROI extraction can be achieved by utilizing computation-

ally inexpensive face detection algorithms. Clearly, the

quality of the appearance-based visual features degradesunder intense head-pose and lighting variations.

With shape-based features it is assumed that most of the information is contained in face contours or the shape

of the speaker’s lips [77], [102], [103], [116]. Therefore,

such features achieve a compact representation of facial

images and visual speech using low-dimensional vectorsand are invariant to head pose and lighting. However, their

extraction requires robust algorithms, which is often dif-ficult and computationally intensive in realistic scenarios.

Geometric features, such as the height, width, perimeter of

the mouth, etc., are meaningful to humans and can bereadily extracted from the mouth images. Geometric

mouth features have been used for visual speech recogni-tion [87], [117] and speaker recognition [117]. Alterna-

tively, model-based visual features are typically obtained

in conjunction with a parametric or statistical facialfeature extraction algorithm. With model-based ap-

proaches the model parameters are directly used as visual

speech features [79], [82], [111]. An example of model-based visual features is repre-

sented by the facial animation parameters (FAPs) of theouter- and inner-lip contours [14], [82], [102]. FAPs

describe facial movement and are used in the MPEG-4

AV object-based video representation standard to controlfacial animation, together with the so-called facial defi-

nition parameters (FDPs) that describe the shape of the

face. The FAPs extraction system described in [82] isshown in Fig. 6. The system first employs a templatematching algorithm to locate the person’s nostrils by

searching the central area of the face in the first frame of

each sequence. Tracking is performed by centering thesearch area in the next frame at the location of the nose in

the previous frame. The nose location is used to determine

the approximate mouth location. Subsequently, the outerlip contour is determined by using a combination of a GVF

snake and a parabolic template [see also Fig. 4(c)].Following the outer lip contour detection and tracking, ten

FAPs describing the outer-lip shape (Bgroup 8[ FAPs [82])

Fig. 5. Various visual speech feature representation approaches discussed in this section: appearance-based (upper) and shape-based

features (lower) that may utilize lip geometry, parametric, or statistical lip models.





are extracted from the resulting lip contour (see also

Figs. 4 and 5). These are placed into a feature vector,

which is subsequently projected by means of PCA onto a three-dimensional space [82]. The resulting visual

features (eigenFAPs) are augmented by their first- and

second-order derivatives providing a nine-dimensionaldynamic visual speech vector.

Since appearance- and shape-based visual featurescontain respectively low- and high-level information about

the person’s face and lip movements, their combination

has been utilized in the expectation of improving theperformance of the recognition system. Features of each

type are usually just concatenated [79], or a single model of

face shape and appearance is created [103], [111]. Forexample, PCA appearance features are combined with

snake-based features or ASMs or a single model of faceshape and appearance is created using AAMs [111]. PCA

can be applied further to this single model vector [103].In summary, a number of approaches can be used for

extracting and representing visual information utilized for

speaker recognition. Unfortunately, limited work exists inthe literature in comparing the relative performance of

visual speech features. The advantage of the appearance-

based features is that they, unlike shape-based features, donot require sophisticated extraction methods. In addition,

appearance-based visual features contain information thatcannot be captured by shape-based visual features. Their

disadvantage is that they are generally sensitive to lighting

and rotation changes. The dimensionality of the appear-ance-based visual features is also usually much higher than

that of the shape-based visual features, which affects

reliable training of the AV person recognition systems.Most comparisons of visual features are made for features

within the same category (appearance- or shape-based) in

the context of AV or V-only person recognition, or AV-ASR

[103], [114]–[116]. Occasionally, features across categoriesare compared, but in most cases with inconclusive results

[102], [103], [114]. Thus, the question of what are the most

appropriate and robust visual speech features remains to a

large extent unresolved. Clearly, the characteristics of the

particular application and factors such as computational

requirements, video quality, and the visual environment,have to be considered in addressing this question.

V . S PEA KER R EC O GN ITIO N PR O C ES S

Although single-modality biometric systems can achievehigh performance in some cases, they are usually not

robust under nonideal conditions and do not meet the

needs of many potential person recognition applications.In order to improve the robustness of biometric systems,

multisamples (multiple samples of the same biometric

characteristic), multialgorithms (multiple algorithms withthe same biometric sample), and multimodal (different

biometric characteristics) biometric systems have beendeveloped [5]–[16], [23]–[51]. Different modalities can

provide independent and complementary information,thus alleviating problems characteristic of single modali-

ties. The fusion of multiple modalities is a critical issue in

the design of any recognition system, as is the case with thefusion of the audio and visual modalities in the design of

AV person recognition systems. In order to justify the

complexity and cost of incorporating the visual modality into a person recognition system, fusion strategies should

ensure that the performance of the resulting AV systemexceeds that of its single-modality counterpart, hopefully

by a significant amount, especially in nonideal conditions

(e.g., acoustically noisy environments). The choice of classifiers and algorithms for feature and classifier fusion

are clearly central to the design of AV person recognition

systems. In this section, we describe the speaker recogni-tion process, and in the next section we review the mainconcepts and techniques of combining the acoustic and

visual feature streams.

Audio-only speaker recognition has been extensively discussed in the literature [17] and the objective of this

paper is to concentrate only on AV speaker recognition

systems. The process of speaker recognition consists of two

Fig. 6. Shape-based visual feature extraction system of [82], depicted schematically in parallel with audio front end, as used for

AV speaker recognition experiments [14].





phases, those of training and testing. In the training phase,the speaker is asked to utter certain phrases in order to

acquire the data to be used for training. In general, thelarger the amount of training data, the better the per-

formance of a speaker recognition system. In the testing

phase, the speaker utters a certain phrase, and the systemaccepts or rejects his/her claim (speaker authentication) or

makes a decision on the speaker’s identity (speaker iden-

tification). The testing phase can be followed by theadaptation phase during which the recognized speaker’s

data is used to update the models corresponding tohim/her. This phase is used for improving the robustness

of the system to the speaker’s acoustic and visual speech

characteristic changes over time, due to channel and envi-ronmental noise, visual appearance changes, etc. Maxi-

mum a posteriori (MAP) [112] adaptation is commonly

used in this phase.

Speaker recognition systems can be classified into text-dependent and text-independent, based on the text used inthe testing phase. Text-dependent systems can be further

divided into fixed-phrase or prompted-phrase systems. Fixed-

phrase systems are trained on the phrase that is also used fortesting. However, these systems are very vulnerable to

attacks (an impostor, for example, could play a recordedphrase uttered by the user). Prompted-phrase systems ask

the claimant to utter a word sequence (phoneme sequence)not used in the training phase or in previous tests. These

systems are less vulnerable to attacks (an impostor would

need to generate in real time an AV representation of thespeaker) but require an interface for prompting phrases. In

the text-independent systems, the speech used for testing is

unconstrained. These systems are convenient for applica-tions in which we cannot control the claimant’s input.

However, they would be classified as more vulnerable toattacks than the prompted-phase systems.

A. Classifiers in AV Speaker RecognitionThe speaker recognition problem is a classification

problem. A set of classes needs to be defined first, and then

based on the observations one of these classes is chosen.Let C denote the set of all classes. In speaker identificationsystems, C typically consists of the enrolled subject pop-

ulation possibly augmented by a class denoting the

unknown subject. On the other hand, in speaker authen-tication systems, C reduces to a two-member set, consist-

ing of the class corresponding to the user and the general

population (impostor class).The number of classes can be larger than mentioned

above, as, for example, in text-dependent speaker recogni-tion (and ASR). In this case, subphonetic units are con-

sidered, utilizing tree-based clustering of possible phonetic

contexts (bi-phones, or tri-phones, for example), to allow coarticulation modeling [86]. The set of classes C then

becomes the product space between the set of speakers and

the set of phonetic based units. Similarly, one could con-sider visemic subphonetic units, obtained for example by

decision tree clustering based on visemic context (visemicunits have been used in ASR [118]). The same set of classes

is typically used if both speech modalities are considered,since the use of different classes complicates AV integration,

especially in text-dependent speaker recognition.

The input observations to the recognizer are repre-sented by the extracted feature vectors at time t, os;t, and

their sequence over a time interval T , Os ¼ fos;t; t 2 T g,

where s 2 S denotes the available modalities. For instance,for the AV speaker recognition system, S ¼ fa; v; f g, wherea stands for audio, v for visual-dynamic (visual-labial), and

f for face appearance input.

A number of approaches can be used to model our

knowledge of how the observations are generated by eachclass. Such approaches are usually statistical in nature and

express our prior knowledge in terms of the conditional

probability P ðos;tjcÞ, where c 2 C , by utilizing artificial

neural networks (ANNs), support vector machines(SVMs), Gaussian mixture models (GMMs), hiddenMarkov models (HMMs), etc. The parameters of the prior

model are estimated during training. During testing based

on the trained model, which describes the way theobservations are generated by each class, the posterior

probability is maximized. In single modality speaker iden-

tification systems, the objective is to determine the class c,corresponding to one of the enrolled persons or theunknown person that best matches the person’s biometric

data, that is

c ¼ argmaxc2C

P ðcjos;tÞ; s 2 fa; v; f g (1)

where P ðcjos;tÞ represents the posterior conditional prob-ability. In closed-set identification systems the unknown

person is not modeled and the classification is forced into

one of the enrolled persons’ classes.In single modality speaker verification systems there

are only two classes, the class corresponding to the general

population (impostor class), that is w, and c, the classcorresponding to the true claimant. Then, the following

similarity measure D can be defined:

D ¼ log P ðcjos;tÞ À log P ðwjos;tÞ; s ¼ fa; v; f g: (2)

If D is larger than an a priori defined verification threshold

the claim is accepted; otherwise, it is rejected. The world

model is usually obtained using biometric data from thegeneral speaker population.

A widely used prior model is the GMM, expressed by

P ðos;tjcÞ ¼XK s;c

k¼1

ws;c;kN ðos;t; ms;c;k; X s;c;kÞ (3)





where K s;c denotes the number of mixture weights ws;c;k, which are positive and add up to one, and N ðo; m; X Þrepresents a multivariate Gaussian distribution with mean

m and covariance matrix X , typically considered as a

diagonal matrix. During training the parameters of each

GMM in (3), namely ws;c;k, ms;c;k, and X s;c;k are estimated.In certain applications, for example, text-independent

speaker recognition, a single GMM is used to model the

entire observation sequence Os for each class c. Then, for a given modality s, the MAP estimation of the unknown class

is obtained as

c ¼ argmaxc2C

P ðcjOsÞ ¼ argmaxc2C

P ðcÞYt2T

P ðos;tjcÞ (4)

where P ðcÞ denotes the class prior.

In a number of other applications, such as text-dependent speaker recognition, the observations aregenerated by a temporal sequence of interacting states.

In this case, HMMs are widely used. They consist of a number of states. At each time instance, one of them

generates (Bemits[) the observed features with class con-

ditional probability given by (3). The HMM parameters(mixture weights, means, variances, transition proba-

bilities) are typically estimated iteratively during train-

ing, using the expectation-maximization (EM) algorithm[84], [86], or by discriminative training methods [84].

Once the model parameters are estimated, HMMs can be

used to obtain the Boptimal state sequence[ lc ¼ flct ; t 2 T gper class c, given an observation sequence Os over an in-

terval T ; namely

lc ¼ argmax lc

P ð lcjOsÞ

¼ argmax lc

P ðcÞYt2T

P lct jlc

tÀ1

À ÁP os;tjl

ct

À Á(5)

where P ðlct jlc

tÀ1Þ denotes the transition probability from

state lctÀ1 to state lc

t . The Viterbi algorithm is used for

solving (5), based on dynamic programming [84], [86].Then, the MAP estimate of the unknown class is ob-tained as

c ¼ argmaxc2C

P ð lcjOsÞ: (6)

B. Performance Evaluation MeasuresThe performance of identification systems is usually

reported in terms of identification error or rank-N correct

identification rate, defined as the probability that the

correct match of the unknown person’s biometric data is inthe top N similarity scores (this scenario corresponds to

the identification system which is not fully automated,needing human intervention or additional identification

systems applied in cascade).

Two commonly used error measures for verificationperformance are the false acceptance rate (FAR) V an

impostor is accepted V

and the false rejection rate(FRR) V a client is rejected. They are defined by

FAR ¼I AI

Â 100% FRR ¼C R

C Â 100% (7)

where I A denotes the number of accepted impostors, I the

number of impostor claims, C R the number of rejectedclients, and C the number of client claims. There is an

inherent tradeoff between FAR and FRR, which iscontrolled by an a priori chosen verification threshold.

The receiver operator curve (ROC) or the detection errortradeoff (DET) curve can be used to graphically representthe tradeoff between FAR and FRR [119]. DET and ROC

depict FRR as a function of FAR in a log and linear scale,

respectively. The detection cost function (DCF) is a measure derived from FAR and FRR according to

DCF ¼ Cost(FR) Á P (client) Á FRR

þ Cost(FA) Á P (impostor) Á FAR (8)

whereP

(client) andP

(impostor) are the prior probabilities

that a client or an impostor will use the system, respec-tively, while Cost(FA) and Cost(FR) represent, respec-tively, the costs of false acceptance and false rejection.

Half total error rate (HTER) [119], [120] is a special case of

DCF when the prior probabilities are equal to 0.5 and thecosts equal to 1, resulting in HTER ¼ ð1=2ÞðFRR þ FAR Þ.

Verification system performance is often reported using

a single measure either by choosing the threshold for which FAR and FRR are equal, resulting in the equal error

rate (EER), or by choosing the threshold that minimizesDCF (or HTER). The appropriate threshold can be found

either using the test set (providing biased results) or a

separate validation set [121]. Expected performance curves(EPCs) are proposed as a verification measure in [121] and

[122]. They provide unbiased expected system perfor-

mance analysis using a validation set to compute thresh-olds corresponding to various criteria related to real-life

applications.

VI. A V F U S I O N M E T H O D S

Information fusion is used for integration of different

sources of information with the ultimate goal of achiev-

ing superior classification results. Fusion approaches areusually classified into three categories: premapping

fusion, midst-mapping fusion, and postmapping fusion [8]





(see Fig. 7). These are also referred to in the literature asearly integration, intermediate integration, and late integra-tion, respectively [80], and they will be used interchange-

ably in this paper. In the premapping fusion audio and visual information are combined before the classification

process. In the midst-mapping fusion, audio and visual in-

formation are combined during the mapping from sensordata or feature space into opinion or decision space. Fi-

nally, in the postmapping fusion, information is combinedafter the mapping from sensor data or feature space into

opinion or decision space. In the remainder of this section,following [8], we will review some of the most commonly

used information fusion methods and comment on their

use for the fusion of acoustic and visual information.

A. Premapping Fusion (Early Integration)Premapping fusion can be divided into sensor data level

and feature level fusion. In sensor data level fusion [123],

the sensor data obtained from different sensors of the samemodality is combined. Weighted summation and mosaic

construction are typically utilized in order to enable sensor

data level fusion. In the weighted summation approach thedata is first normalized, usually by mapping them to a

common interval, and then combined utilizing weights.

For example, weighted summation can be utilized to com-bine multiple visible and infrared images or to combine

acoustic data obtained from several microphones of dif-ferent types and quality. Mosaic construction can be uti-

lized to create one image from images of parts of the face

obtained by several different cameras.Feature level fusion represents the combination of the

features obtained from different sensors. Joint feature vectors are obtained either by weighted summation (after

normalization) or concatenation (e.g., by appending a visual to an acoustic feature vector with normalization in

order to obtain a joint feature vector). The features ob-

tained by the concatenation approach are usually of highdimensionality, which can affect reliable training of a

classification system (Bcurse of dimensionality [) and

ultimately recognition performance. In addition, concat-enation does not allow for the modeling of the reliability of

individual feature streams. For example, in the case of

audio and visual streams, it cannot take advantage of theinformation that might be available, about the acoustic or

the visual noise in the environment. Furthermore, audioand visual feature streams should be synchronized before

the concatenation is performed.

B. Midst-Mapping Fusion (Intermediate Integration)With midst-mapping fusion, information streams are

processed during the procedure of mapping the feature

space into the opinion or decision space. It exploits the

temporal dynamics contained in different streams, thusavoiding problems resulting from vector concatenation,

such as the Bcurse of dimensionality [ and the requirementof matching rates. Furthermore, stream weights can be

utilized in midst-mapping fusion to account for the

reliability of different streams of information. Multi-stream and extended HMMs [32], [79], [80], [82], [88] are

commonly used for midst-mapping fusion in AV speaker

recognition systems [13], [30], [32], [33]. For example,multi-stream HMMs can be used to combine acoustic and

visual dynamic information [80], [82], [88], allowing foreasy modeling of the reliability of the audio and visual

streams and various levels of asynchronicity between them.

In the state-synchronous case, the probability density func-tion for each HMM state is defined as [77], [88]

P os;tjlctÀ Á

¼Y

s2fa;vg

XK s;c

k¼1

ws;lct ;kN ðos;t; ms;c;k; X s;c;kÞ

" # s

(9)

Fig. 7. AV fusion methods (adapted from [8]).





where s denotes the stream weight corresponding tomodality s. The stream weights add up to one, that is,

a þ v ¼ 1. The combination of audio and visual informa-tion can also be performed at phone, word, and utterance

level [80], allowing for different levels of asynchronicity

between audio and visual streams.

C. Postmapping Fusion (Late Integration)Postmapping fusion approaches are grouped into

decision and opinion fusion (also referred to as score-level fusion). With decision fusion [11], [14], classifier decisionsare combined in order to obtain the final decision by uti-

lizing majority voting, or combination of ranked lists, or and and or operators. In majority voting [48], the finaldecision is made when the majority of the classifiers reaches

the same decision. The number of classifiers should be

chosen carefully in order to prevent ties (e.g., for a two class

problem, such as speaker verification, the number of classifiers should be odd). In ranked list combination fusion[48], [124], ranked lists provided by each classifier are

combined in order to obtain the final ranked list. There exist

various approaches for combination of ranked lists [124]. In and fusion [125], the final decision is made only if all

classifiers reach the same decision. This type of fusion istypically used for high-security applications where we want

to achieve very low FA rates, by allowing higher FR rates.On the other hand, when the or fusion method is utilized,

the final decision is made as soon as one of the classifiers

reaches a decision. This type of fusion is utilized for low-security applications where we want to achieve lower FR

rates and prevent causing inconvenience to the registered

users by allowing higher FA rates.Unlike decision fusion, in opinion (score-level) fusion

methods the experts do not provide final decisions but only opinions (scores) on each possible decision. The opinions

are usually first normalized by mapping them to a common

interval and then combined utilizing weights (e.g., either weighted summation or weighted product fusion). The

weights are determined based on the discriminating ability

of the classifier and the quality of the utilized features(usually affected by the feature extraction method and/orpresence of different types of noise). For example, when

audio and visual information are employed, the acoustic

SNR and/or the quality of the visual feature extractionalgorithms are considered in determining weights. After

the opinions are combined, the class that corresponds to

the highest opinion is chosen. In the postclassifier opinionfusion approach [8] the likelihoods corresponding to each

of the N C classes of interest, obtained utilizing each of the N L available experts, are considered as features in the

N L Â N C dimensional space, where the classification of the

resulting features is performed. This method is particularly useful in verification applications since only two classes

are available. In the case of AV speaker verification, the

number of experts can also be small (two or more).However, utilizing this method in identification problems

(large number of classes), or when the number of expertsN L is large, can result in features of high dimensionality,

which could cause inadequate performance.It is important to point out that the problem of

combining single-modality biometric results becomes

more complicated in the case of speaker identificationand requires one-to-many comparisons to find the best

match. In addition, when choosing biometric modalities to

be used in a multimodal biometric system, one shouldconsider not only how much the modalities complement

each other in terms of identification accuracy but also theidentification speed. In some cases, the cascade of single-

modality biometric classifiers can be used in order to

narrow down the number of candidates by a fast, but notnecessarily of high accuracy, first biometric system, and

then use the second, higher accuracy biometric system on

the remaining candidates to improve the overall identifi-

cation performance.

VII. A V B I O M E T R I C S Y S T E M S

In this section, we briefly review corpora commonly usedfor AV person recognition research and present examples

of specific systems and their performance.

A. AV DatabasesIn contrast to the abundance of audio-only databases,

there exist only a few databases suitable for AV biometric

research. This is because the field is relatively young, butalso due to the fact that AV corpora pose additional

challenges concerning database collection, storage, distri-

bution, and privacy. Most commonly used databases in theliterature were collected by few university groups or

individual researchers with limited resources, and as a result, they usually contain a small number of subjects and

have relatively short duration. AV databases usually vary

greatly in the number of speakers, vocabulary size, numberof sessions, nonideal acoustic and visual conditions, and

evaluation measures. This makes the comparison of

different visual features and fusion methods, with respectto the overall performance of an AV biometric system,difficult. They lack realistic variability and are usually

limited to one area or focus on one aspect of biometric

person recognition. Some of the currently publicly avail-able AV databases which have been used in the published

literature for AV biometrics are the M2VTS (multimodal

verification for teleservices and security applications)[126] and XM2VTS (extended M2VTS) [127], BANCA

(biometric access control for networked and e-commerceapplications) [128], VidTIMIT [26] (video recordings of

people reciting sentences from the TIMIT corpus), DAVID

[129], VALID [130], and AVICAR (AV speech corpus in a car environment) [131] databases. We provide next a short

description for each of them.

The M2VTS [126] database consists of audio record-ings and video sequences of 37 subjects uttering digits





0 through 9 in five sessions spaced apart by at least one week. The subjects were also asked to rotate their head to

the left and then to the right in each session in order to

obtain a head rotation sequence that can provide 3-D facefeatures to be used for face recognition purposes. The main

drawbacks of this database are its small size and limited

vocabulary. The extended M2VTS database [127] consistsof audio recordings and video sequences of 295 subjects

uttering three fixed phrases, two ten-digit sequences andone seven-word sentence, with two utterances of each

phrase, in four sessions. The main drawback of this data-

base is its limitation to the development of text-dependentsystems. Both M2VTS and XM2VTS databases have been

frequently used in the literature for comparison of dif-

ferent AV biometric systems (see Table 2).The BANCA database consists of audio recordings and

video sequences of 208 subjects (104 male, 104 female)

recorded in three different scenarios, controlled, degradedand adverse, over 12 different sessions spanning three

months. The subjects were asked to say a random 12-digitnumber, their name, their address and date of birth, during

each of the recordings. The BANCA database was captured

Table 2 Sample AV Person Recognition Systems





in four European languages. Both high- and low-quality microphones and cameras were used for recording. This

database provides realistic and challenging conditions andallows for comparison of different systems with respect to

their robustness.

The VidTIMIT database consists of audio recordingsand video sequences of 43 subjects (19 female and 24 male),

reciting short sentences from the test section of the

NTIMIT corpus [26] in three sessions with an average delay of a week between sessions, allowing for appearance and

mood changes. Each person utters ten sentences. The firsttwo sentences are the same for all subjects, while the re-

maining eight are generally different for each person. All

sessions contain phonetically balanced sentences. In addi-tion to the sentences, the subjects were asked to move their

heads left, right, up, then down, in order to obtain head

rotation sequence. The AV biometric systems that utilize

the VidTIMIT corpora are described in [8].The DAVID database consists of audio and videorecordings (frontal and profile views) of more than

100 speakers including 30 subjects recorded in five

sessions over a period of several months. The utterancesinclude digit set, alphabet set, vowel–consonant–vowel

syllables, and phrases. The challenging visual conditionsinclude illumination changes and variable scene back-

ground complexity.The VALID database consists of five recordings of

106 subjects (77 male, 29 female) over a period of one

month. Four of the sessions were recorded in an officeenvironment in the presence of visual noise (illumination

changes) and acoustic noise (background noise). In

addition, one session was recorded in the studio environ-ment containing a head rotation sequence, where the

subjects were asked to face four targets, placed above,below, left, and right of the camera. The database consists

of recordings of the same utterances as those recorded in

the XM2VTS database, therefore enabling comparison of the performance of different systems and investigation of

the effect of challenging visual environments on the

performance of algorithms developed with the XM2VTSdatabase.

The AVICAR database [131] consists of audio record-

ings and video sequences of 100 speakers (50 male and

50 female) uttering isolated digits, isolated letters, phonenumbers, and TIMIT sentences with various language

backgrounds (60% native American English speakers)

inside a car. Audio recordings are obtained using a visor-mounted array composed of eight microphones under

five different car noise conditions (car idle, 35 and 55 mph with all windows rolled up, or just front windows rolled

down). Video sequences are obtained using dashboard-

mounted array of four video cameras. This database provides different challenges for tracking and extraction of

visual features and can be utilized for analysis of the effect

of nonideal acoustic and visual conditions on AV speakerrecognition performance.

Additional datasets for AV biometrics research are theClemson University AV Experiments (CUAVE) corpus

containing connected digit strings [132], the AMP/CMUdatabase of 78 isolated words [133], the Tulips1 set of four

isolated digits [134], the IBM AV database [15], and the

AV-TIMIT AV corpus [9].None of the existing AV databases probably has all

desirable characteristics, such as adequate number of sub-

jects, size of vocabulary and utterances, realistic varia-bility (representing for example speaker identification on

a mobile hand-held device, or taking into account othernonideal acoustic and visual conditions), recommended

experiment protocols (it is, for example, specified that

certain specific subjects are to be used as clients andcertain specific subjects as impostors), and ability to uti-

lize them for text-independent as well as text-dependent

verification systems. There is, therefore, a great need for

new, standardized databases and evaluation measures (seeSection V-B) that would enable fair comparison of dif-

ferent systems and represent realistic nonideal condi-

tions. Experiment protocols should also be defined in a way that avoids biased results and allows for fair com-

parison of different person recognition systems.

B. Examples of AV Biometric SystemsThe performance of AV person recognition systems

strongly depends on the choice and accurate extraction of

the acoustic and visual features and the AV fusion

approach utilized. Due to differences in visual features,fusion methods, AV databases, and evaluation procedures,

it is usually very difficult to compare systems. As men-

tioned in Section I, audio-only speaker recognition andface recognition systems are extensively covered elsewhere

and will not be discussed here. We present in this section various visual-only-dynamic, audio-visual-static, and audio-

visual-dynamic systems and provide some comparisons.

Table 2 shows an overview of the various AV person recog-nition systems found in the literature. Some of those

systems are discussed in more detail in the remainder of

this section.Luettin et al. [135] developed a visual-only speaker

identification system by utilizing only the dynamic visual

information present in the video recordings of the mouth

area. They utilized the Tulips1 database [134], consisting of recordings of 12 speakers uttering first four English digits,

extracted shape- and appearance-based visual features, and

performed both text-dependent and text-independentexperiments. Their person identification system, based

on HMMs, achieved 72.9%, 89.6%, and 91.7% recognitionrates when shape-based, appearance-based, and joint

(concatenation fusion) visual features were utilized,

respectively, in text-dependent experiments. In text-independent experiments, their system achieved 83.3%,

95.8%, and 97.9% recognition rates when shape-based,

appearance-based, and joint (concatenation) visual fea-tures were utilized, respectively. In summary, they





achieved better results with appearance-based than withshape-based visual features, and the identification perfor-

mance improved further when joint features were utilized.

1) Audio-Visual-Static Biometric Systems: Chibelushi et al.

[5] developed an AV biometrics system that utilizesacoustic information and static visual information con-

tained in face profiles. They utilized an AV database that

consists of audio recordings and face images of ten speakers[5]. The images are taken at different head orientations,

image scales, and subject positions. They combinedacoustic and visual information utilizing weighted summa-

tion fusion. Their system achieved an EER of 3.4%, 3.0%,

and 1.5% when only speech information, only visualinformation, or both acoustic and visual information were

used, respectively.

Brunelli and Falavigna [6] developed a text-

independent speaker identification system that combinesaudio-only speaker identification and face recognition.They utilized an AV database that consists of audio re-

cordings and face images of 89 speakers collected in three

sessions [6]. The system provides five classifiers, twoacoustic and three visual. The two acoustic classifiers cor-

respond to two sets of acoustic features (static anddynamic) derived from the short time spectral analysis of

the speech signal. Their audio-only speaker identificationsystem is based on vector quantization (VQ). The three

visual classifiers correspond to the visual classifying fea-

tures extracted from three regions of the face, i.e., eyes,nose, and mouth. The individually obtained classification

scores are combined using the weighted product approach.

The identification rate of the integrated system is 98%,compared to the 88% and 91% rates obtained by the audio-

only speaker recognition and face recognition systems,respectively.

Ben-Yacoub et al. [7] developed both text-dependent

and text-independent AV speaker verification systems by utilizing acoustic information and frontal face visual

information from the XM2VTS database. They utilized

elastic graph matching in order to obtain face matchingscores. They investigated several binary classifiers forpostclassifier opinion fusion, namely, SVM, Bayesian

classifier, Fisher’s linear discriminant, decision tree, and

multilayer perceptron (MLP). They obtained the best re-sults utilizing SVM and Bayesian classifiers, which also

outperformed single modalities.

Sanderson and Paliwal [8] utilized speech and faceinformation to perform text-independent identity verifica-

tion. They extracted appearance-based visual features by performing PCA on the face image window containing the

eyes and the nose. The acoustic features consisted of

MFCCs and their corresponding deltas and maximumautocorrelation values, which capture pitch and voicing

information. A voice activity detector (VAD) was used to

remove the feature vectors which represent silence orbackground noise, while a GMM classifier was used as a

modality (speech or face) expert, to obtain opinions fromthe speech features. They performed elaborate analysis and

evaluation of several nonadaptive and adaptive approachesfor information fusion and compared them in noisy and

clean audio conditions with respect to overall verification

performance on the VidTIMIT database. The fusionmethods they analyzed include weighted summation,

Bayesian classifier, SVM, concatenation, adaptive weighted

summation, and proposed piecewise linear postclassifier,and modified Bayesian postclassifier. The utilized fusion

methods take into account how the distributions of opinions are likely to change due to noisy conditions,

without making a direct assumption about the type of noise

present in the testing features. The verification resultsobtained for various SNRs in the presence of operations-

room noise are shown in Fig. 8. The operations-room noise

contains background speech as well as machinery sounds.

The results are reported in terms of total error (TE),defined as TE ¼ FAR þ FRR. They concluded that the

performance of most of the nonadaptive fusion systems

was similar and that it degraded in noisy conditions.Hazen et al. [9] developed a text-dependent speaker

authentication system that utilizes lower quality audio and

visual signals obtained by a handheld device. They de-tected 14 face components and used ten of them, after

normalization, as visual features for a face recognitionalgorithm which utilizes SVMs. They achieved 90%

reduction in speaker verification EER when fusing face

and speaker identification information.

2) Audio-Visual-Dynamic Biometric Systems: Jourlin et al.[10] developed a text-dependent AV speaker verificationsystem that utilizes both acoustic and visual dynamic

information and tested it on the M2VTS database. Their

Fig. 8. AV person recognition performance obtained for

various SNRs in presence of operations-room noise utilizing several nonadaptive and adaptive AV fusion methods

described in [8] on VidTimit database [26].





39-dimensional acoustic features consist of LPC coeffi-cients and their first- and second-order derivatives. They

use 14 lip shape parameters, ten intensity parameters, andthe scale as visual features, resulting in a 25-dimensional

visual feature vector. They utilize HMMs to perform

audio-only, visual-only, and AV experiments. The AV scoreis computed as a weighted sum of the audio and visual

scores. Their results demonstrate a reduction of FAR from

2.3% when the audio-only system is used to 0.5% when themultimodal system is used.

Wark et al. [11]–[13] employed multi-stream HMMs todevelop text-independent AV speaker verification and

identification systems tested on the M2VTS database. They

utilized MFCCs as acoustic features and lip contourinformation obtained after applying PCA and LDA, as

visual features. They trained the system in clean conditions

and tested it in degraded acoustic conditions. At low SNRs,

the AV system achieved significant performance improve-ment over the audio-only system and also outperformedthe visual-only system, while at high SNRs the perfor-

mance was similar to the performance of the audio-only

system.We have developed an AV speaker recognition system

with the AMP/CMU database [133] utilizing 13 MFCCcoefficients and their first- and second-order derivatives

as acoustic features [14]. A visual shape-based feature vector consisting of ten FAPs, which describe the move-

ment of the outer-lip contour [82], extracted using the

systems previously discussed in Section IV, was projectedby means of PCA onto a three-dimensional space (see

Fig. 6). The resulting visual features were augmented with

first- and second-order derivatives providing nine-dimen-sional dynamic visual feature vectors. We used a feature

fusion integration approach and single-stream HMMs tointegrate dynamic acoustic and visual information. Speak-

er verification and identification experiments were per-

formed using audio-only and AV information, under bothclean and noisy audio conditions at SNRs ranging from 0 to

30 dB. The obtained results for both speaker identification

and verification experiments, expressed in terms of theidentification error and EER, are shown in Table 3.Significant improvement in performance over the audio-

only (AU) speaker recognition system was achieved,

especially under noisy acoustic conditions. For instance,the identification error was reduced from 53.1%, when

audio-only information was utilized, to 12.82%, when AVinformation was employed at 0-dB SNR.

Chaudhari et al. [15] developed an AV speaker iden-tification and verification system which modeled the

reliability of the audio and video information streams with

time-varying and context-dependent parameters. Theacoustic features consisted of 23 MFCC coefficients, while

the visual features consisted of 24 DCT coefficients from

the transformed ROI. They utilized GMMs to modelspeakers and parameters that depended on time, modality,

and speaker to model stream reliability. The system wastested on the IBM database [15] achieving an EER of 1.04%,

compared to 1.71%, 1.51%, and 1.22%, of the audio-only,

video-only, and AV (feature fusion) systems, respectively.Dieckmann et al. [16] developed a system which used

visual features obtained from all three modalities, face,

voice, and lip movement. Their fusion scheme utilized

majority voting and opinion fusion. Two of the threeexperts had to agree on the opinion, and the combined

opinion had to exceed the predefined threshold. The iden-

tification error decreased to 7% when all three modalities were used, compared to 10.4%, 11%, and 18.7%, when

voice, lip movements, and face visual features were used

individually.

VIII. DIS C US S IO N / C O N C LUS IO N

In this paper, we addressed how the joint processing of

audio and visual signals, both generated by a talkingperson, can provide valuable information to benefit AV

speaker recognition applications. We first concentrated on

the analysis of visual signals and described various ways of representing and extracting information available in them

unique to each speaker. We then discussed how the visualfeatures can complement features extracted (by well-

studied methods) from the acoustic signal, and how the

two-modality representations can be fused together toallow joint AV processing. We reviewed several speaker

identification and/or authentication systems that have

appeared in the literature and presented some experimen-tal results. These results demonstrated the importance of utilizing visual information for speaker recognition,

especially in the presence of acoustic noise.

The field of joint AV person recognition is new andactive, with many accomplishments (as described in this

paper) and exciting opportunities for further research and

development. Some of these open issues and opportunitiesare the following.

There is a need for additional resources for advancingand accessing the performance of AV speaker recognition

systems. Publicly available multimodal corpora that better

reflect realistic conditions, such as acoustic noise andlighting changes would help in investigating robustness of

AV systems. They can serve as a reference point for devel-

opment, as well as evaluation and comparison of varioussystems. Baseline algorithms and systems could also be

Table 3 Speaker Recognition Performance Obtained for Various SNRs

Utilizing Audio-Only and AV Systems in [14] Tested on AMP/CMU

Database [133]





agreed upon and made available in order to facilitateseparate investigation of the effects that various factors,

such as the choice of acoustic and visual features, theinformation fusion approach, or the classification algo-

rithms, have on system performance.

In comparing and evaluating speaker recognitionsystems, the statistical significance of the results needs

to be determined [136]. It is not adequate to simply report

that one system achieved a lower error rate than anotherone (and therefore it is better than the other one) using the

same experiment setup. The mean and the variance of a particular error measure can assist in determining the

relative performance of systems. In addition, standard

experiment protocols and evaluation procedures should bedefined in order to enable fair comparison of different

systems. Experiment protocols could include a number of

different configurations in which available subjects are

randomly divided into enrolled subjects and impostors,therefore providing performance measures obtained foreach of the configurations. These measures can be used to

determine statistical significance of the results.

In many cases, the performance of person identifica-tion systems is reported for a closed set system. The

underlying assumption is that the same performance willcarry over to an open set system. This, however, may well

not hold true; therefore, a model for an unknown personshould be used.

The design of a truly high-performing visual feature

representation system with improved robustness to the visual environment, possibly employing 2.5-D or 3-D face

information [97]–[99], needs to be further investigated.

Finally, the development of improved AV integration algo-rithms that will allow unconstrained AV asynchrony

modeling and robust, localized reliability estimation of the signal information content, due for example, to occlu-

sion, illumination change, or pose, are also needed.Concerning the practical deployment of the AV bio-

metric technology, robustness represents the grand chal-

lenge. There are few applications where the environmentand the data acquisition mechanism can be carefully

controlled. For the widespread use of the technology it will

require us to be able to handle variability in the envi-ronment, data acquisition devices, and degradations due to

data and channel encoding and spoofing. Overall, thetechnology is quite robust to spoofing (when audio and

dynamic visual features are used); although, it would be

possible in the future to synthesize a video of a persontalking and replay it to the biometric system, for example

on a laptop and even on the fly, in order to defeat a

prompted phrase systems.

The technology is readily available, however, for mostday-to-day applications that have the following character-istics [137]:

1) low security and highly user friendly, e.g., accessto desktop using biometric log-in;

2) high security but user can tolerate the inconve-

nience of being falsely rejected, e.g., access tomilitary property;

3) low security and convenience is the prime factor(more than any other factors such as cost), e.g.,

time-stamping in a factory setting.

The technology is not yet available for highly securedand highly user-friendly applications (such as banking).

Further research and development is therefore required

for AV biometric systems to become widespread inpractice. h

RE F E RE NCE S

[1] A. K. Jain, A. Ross, and S. Prabhakar,B An introduction to biometric recognition,[IEEE Trans. Circuits Systems Video Technol., vol. 14, no. 1, pp. 4–20, Jan. 2004.

[2] N. K. Ratha, A. W. Senior, and R. M. Bolle,B Automated biometrics,[ in Proc. Int. Conf.

Advances Pattern Recognition, Rio de Janeiro,Brazil, 2001, pp. 445–474.

[3] Financial crimes report to the public. Fed.Bur. Investigation, Financial Crimes Section,Criminal Investigation Division. [Online]. Available: http://www.fbi.gov/publications/financial/fcs_report052005/fcs_report052005.htm.

[4] A. K. Jain and U. Uludag, BHiding biometricdata,[ IEEE Trans. Pattern Anal. MachineIntell., vol. 25, no. 11, pp. 1494–1498,Nov. 2003.

[5] C. C. Chibelushi, F. Deravi, and J. S. Mason,BVoice and facial image integration forspeaker recognition,[ in Proc. IEEE Int. Symp.

Multimedia Technologies Future Appl.,Southampton, U.K., 1993.

[6] R. Brunelli and D. Falavigna, BPersonidentification using multiple cues,[ IEEE

Trans. Pattern Anal. Machine Intell., vol. 10,pp. 955–965, Oct. 1995.

[7] S. Ben-Yacoub, Y. Abdeljaoued, andE. Mayoraz, BFusion of face and speechdata for person identity verification,[ IEEETrans. Neural Networks, vol. 10,pp. 1065–1074, 1999.

[8] C. Sanderson and K. K. Paliwal, BIdentity verification using speech and faceinformation,[ Digital Signal Processing, vol. 14, no. 5, pp. 449–480, 2004.

[9] T. J. Hazen, E. Weinstein, R. Kabir, A. Park,and B. Heisele, BMulti-modal face andspeaker identification on a handheld device,[in Proc. Works. Multimodal User

Authentication, Santa Barbara, CA, 2003,pp. 113–120.

[10] P. Jourlin, J. Luettin, D. Genoud, andH. Wassner, BIntegrating acoustic and labialinformation for speaker identification and verification,[ in Proc. 5th Eur. Conf. SpeechCommunication Technology, Rhodes, Greece,1997, pp. 1603–1606.

[11] T. Wark, S. Sridharan, and V. Chandran,BRobust speaker verification via fusion of speech and lip modalities,[ in Proc. Int. Conf.

Acoustics, Speech Signal Processing , Phoenix, AZ, 1999, pp. 3061–3064.

[12] VV , BRobust speaker verification via

asynchronous fusion of speech and lipinformation,[ in Proc. 2th Int. Conf.

Audio- and Video-Based Biometric Person

Authentication, Washington, DC, 1999,pp. 37–42.

[13] VV , BThe use of temporal speech andlip information for multi-modal speakeridentification via multi-stream HMMs,[ inProc. Int. Conf. Acoustics, Speech SignalProcessing , Istanbul, Turkey, 2000,pp. 2389–2392.

[14] P. S. Aleksic and A. K. Katsaggelos,B An audio-visual person identification and verification system using FAPs as visualfeatures,[ in Proc. Works. Multimedia User

Authentication, Santa Barbara, CA, 2003,pp. 80–84.

[15] U. V. Chaudhari, G. N. Ramaswamy,G. Potamianos, and C. Neti, BInformationfusion and decision cascading foraudio-visual speaker recognition based ontime-varying stream reliability prediction,[in Proc. Int. Conf. Multimedia Expo,Baltimore, MD, Jul. 6–9, 2003, pp. 9–12.

[16] U. Dieckmann, P. Plankensteiner, andT. Wagner, BSESAM: A biometric personidentification system using sensor fusion,[Pattern Recogn. Lett., vol. 18, pp. 827–833,1997.

[17] J. P. Campbell, BSpeaker recognition:

A tutorial,[ Proc. IEEE, vol. 85, no. 9,pp. 1437–1462, Sep. 1997.





[18] W.-Y. Zhao, R. Chellappa, P. J. J. Phillips,and A. Rosenfeld, BFace recognition: Aliterature survey,[ ACM Computing Survey,pp. 399–458, 2003, Dec. Issue.

[19] M. Turk and A. Pentland, BEigenfacesfor recognition,[ J. CognitiveNeuroscience, vol. 3, no. 1, pp. 586–591,Sep. 1991.

[20] M. Kirby and L. Sirovich, B Applicationof the Karhunen-Loeve procedure for thecharacterization of human faces,[ IEEETrans. Pattern Anal. Mach. Intell., vol. 12,no. 1, pp. 103–108, Jan. 1990.

[21] P. N. Belhumeur, J. P. Hespanha, andD. J. Kriegman, BEigenfaces versusfisherfaces: Recognition using classspecific linear projection,[ IEEE Trans.Pattern Anal. Mach. Intell., vol. 19,pp. 711–720, 1997.

[22] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, BFace recognition: A literaturesurvey,[ Proc. ACM Computing Surverys(CSUR), vol. 35, no. 4, pp. 399–458, 2003.

[23] J. Luettin, BVisual speech and speakerrecognition,[ Ph.D. dissertation, Dept.

Computer Science, Univ. Sheffield,Sheffield, U.K., 1997.

[24] C. C. Chibelushi, F. Deravi, and J. S. Mason,B Audio-visual person recognition: Anevaluation of data fusion strategies,[ inProc. Eur. Conf. Security Detection, London,U.K., 1997, pp. 26–30.

[25] R. Brunelli, D. Falavigna, T. Poggio, andL. Stringa, B Automatic person recognitionusing acoustic and geometric features,[

Machine Vision Appl., vol. 8,pp. 317–325, 1995.

[26] C. Sanderson and K. K. Paliwal, BNoisecompensation in a person verification systemusing face and multiple speech features,[Pattern Recognition, vol. 36, no. 2,pp. 293–302, Feb. 2003.

[27] P. Jourlin, J. Luettin, D. Genoud, andH. Wassner, B Acoustic-labial speaker verification,[ Pattern Recogn. Lett., vol. 18,pp. 853–858, 1997.

[28] U. V. Chaudhari, G. N. Ramaswamy,G. Potamianos, and C. Neti, B Audio-visualspeaker recognition using time-varyingstream reliability prediction,[ in Proc. Int.Conf. Acoustics, Speech Signal Processing ,Hong Kong, China, 2003, pp. V-712–V-715.

[29] S. Bengio, BMultimodal authentication usingasynchronous HMMs,[ in Proc. 4th Int. Conf.

Audio- and Video-Based Biometric Person Authentication, Guildford, U.K., 2003,pp. 770–777.

[30] A. V. Nefian, L. H. Liang, T. Fu, andX. X. Liu, B A Bayesian approach toaudio-visual speaker identification,[ inProc. 4th Int. Conf. Audio- and Video-BasedBiometric Person Authentication, Guildford,U.K., 2003, pp. 761–769.

[31] T. Fu, X. X. Liu, L. H. Liang, X. Pi, and A. V. Nefian, B An audio-visual speakeridentification using coupled hidden Markov models,[ in Proc. Int. Conf. Image Processing ,Barcelona, Spain, 2003, pp. 29–32.

[32] S. Bengio, BMultimodal authentication usingasynchronous HMMs,[ in Proc. 4th Int. Conf.

Audio- and Video-Based Biometric Person Authentication, Guildford, U.K., 2003,pp. 770–777.

[33] VV , BMultimodal speech processing usingasynchronous hidden Markov models,[Information Fusion, vol. 5, pp. 81–89, 2004.

[34] N. A. Fox, R. Gross, P. de Chazal, J. F. Cohn,and R. B. Reilly, BPerson identificationusing automatic integration of speech, lip,

and face experts,[ in Proc. ACM SIGMM2003 Multimedia Biometrics Methods and

Applications Workshop (WBMA’03), Berkeley,CA, 2003, pp. 25–32.

[35] N. A. Fox and R. B. Reilly, B Audio-visualspeaker identification based on the use of dynamic audio and visual features,[ inProc. 4th Int. Conf. Audio- and Video-Based

Biometric Person Authentication, Guildford,U.K., 2003, pp. 743–751.

[36] Y. Abdeljaoued, BFusion of personauthentication probabilities by Bayesianstatistics,[ in Proc. 2nd Int. Conf.

Audio- and Video-Based Biometric Person Authentication, Washington, DC, 1999,pp. 172–175.

[37] Y. Yemez, A. Kanak, E. Erzin, and A. M. Tekalp, BMultimodal speakeridentification with audio-videoprocessing,[ in Proc. Int. Conf. ImageProcessing , Barcelona, Spain, 2003, pp. 5–8.

[38] A. Kanak, E. Erzin, Y. Yemez, and A. M. Tekalp, BJoint audio-videoprocessing for biometric speakeridentification,[ in Proc. Int. Conf. Acoustic,Speech Signal Processing , Hong Kong,China, 2003, pp. 561–564.

[39] E. Erzin, Y. Yemez, and A. M. Tekalp,BMultimodal speaker identificationusing an adaptive classifier cascade basedon modality reliability,[ IEEE Trans.

Multimedia, vol. 7, no. 5, pp. 840–852,Oct. 2005.

[40] M. E. Sargin, E. Erzin, Y. Yemez, and A. M. Tekalp, BMultimodal speakeridentification using canonical correlationanalysis,[ in IEEE Proc. Int. Conf. Acoustics,Speech Signal Processing , Toulouse, France,May 2006, pp. 613–616.

[41] J. Kittler, J. Matas, K. Johnsson, andM. U. Ramos-Sa nchez, BCombining evidencein personal identity verification systems,[Pattern Recogn. Lett., vol. 18,

pp. 845–852, 1997.[42] J. Kittler, M. Hatef, R. P. W. Duin, and

J. Matas, BOn combining classifiers,[ IEEETrans. Pattern Anal. Machine Intell., vol. 20,pp. 226–239, 1998.

[43] J. Kittler and K. Messer, BFusion of multipleexperts in multimodal biometric personalidentity verification systems,[ in Proc. 12thIEEE Workshop Neural Networks Sig.Processing, Switzerland, 2002, pp. 3–12.

[44] E. S. Bigun, J. Bigun, B. Duc, and S. Fisher,BExpert conciliation for multi modalperson authentication systems by Bayesianstatistics,[ in Proc. 1st Int. Conf. Audio- andVideo-Based Biometric Person Authentication,Crans-Montana, Switzerland, Mar. 1997,pp. 291–300.

[45] R. W. Frischholz and U. Dieckmann,BBiolD: A multimodal biometricidentification system,[ Computer, vol. 33,pp. 64–68, 2000.

[46] S. Basu, H. S. M. Beigi, S. H. Maes,M. Ghislain, E. Benoit, C. Neti, and A. W. Senior, BMethods and apparatus foraudio-visual speaker recognition andutterance verification,[ U.S. Patent6 219 640, 1999.

[47] L. Hong and A. Jain, BIntegrating faces andfingerprints for personal identification,[IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 1295–1307, 1998.

[48] V. Radova and J. Psutka, B An approach tospeaker identification using multipleclassifiers,[ in Proc. IEEE Conf. Acoustics,

Speech Signal Processing , Munich,Germany, 1997, vol. 2, pp. 1135–1138.

[49] A. Ross and A. Jain, BInformation fusion inbiometrics,[ Pattern Recogn. Lett., vol. 24,pp. 2115–2125, 2003.

[50] V. Chatzis, A. G. Bors, and I. Pitas,BMultimodal decision-level fusion for personauthentication,[ IEEE Trans. Systems, Man,Cybernetics, Part A: Syst. Humans, vol. 29,no. 5, pp. 674–680, Nov. 1999.

[51] N. Fox, R. Gross, J. Cohn, and R. B. Reilly,BRobust automatic human identificationusing face, mouth, and acousticinformation,[ in Proc. Int. Workshop Analysis

Modeling of Faces and Gestures, Beijing,China, Oct. 2005, pp. 263–277.

[52] F. J. Huang and T. Chen, BConsiderationof Lombard effect for speechreading,[ inProc. Works. Multimedia Signal Process, 2001,pp. 613–618.

[53] J. D. Woodward, BBiometrics: Privacy’s foeor privacy’s friend?[ Proc. IEEE, vol. 85,pp. 1480–1492, 1997.

[54] R. P. Lippmann, BSpeech recognitionby machines and humans,[ Speech Commun., vol. 22, no. 1, pp. 1–15, 1997.

[55] H. Yehia, P. Rubin, and

E. Vatikiotis-Bateson, BQuantitativeassociation of vocal-tract and facialbehavior,[ Speech Commun., vol. 26,no. 1–2, pp. 23–43, 1998.

[56] J. Jiang, A. Alwan, P. A. Keating,E. T. Auer, Jr., and L. E. Bernstein,BOn the relationship between facemovements, tongue movements, andspeech acoustics,[ EURASIP J. Appl. SignalProcessing, vol. 2002, no. 11, pp. 1174–1188,Nov. 2002.

[57] J. P. Barker and F. Berthommier,BEstimation of speech acoustics from visualspeech features: A comparison of linear andnon-linear models,[ in Proc. Int. Conf.

Auditory Visual Speech Processing , Santa Cruz,CA, 1999, pp. 112–117.

[58] H. C. Yehia, T. Kuratate, andE. Vatikiotis-Bateson, BUsing speechacoustics to drive facial motion,[ in Proc.14th Int. Congr. Phonetic Sciences,San Francisco, CA, 1999, pp. 631–634.

[59] A. V. Barbosa and H. C. Yehia, BMeasuringthe relation between speech acoustics and2-D facial motion,[ in Proc. Int. Conf.

Acoustics, Speech Signal Processing , Salt LakeCity, UT, 2001, vol. 1, pp. 181–184.

[60] P. S. Aleksic and A. K. Katsaggelos,BSpeech-to-video synthesis using MPEG-4compliant visual features,[ IEEE Trans.CSVT, Special Issue Audio Video Analysis for

Multimedia Interactive Services, pp. 682–692,May 2004.

[61] A. Q. Summerfield, BSome preliminariesto a comprehensive account of audio-visualspeech perception,[ in Hearing by Eye: ThePsychology of Lip-Reading , R. Campbell andB. Dodd, Eds. London, U.K.: LawrenceErlbaum, 1987, pp. 3–51.

[62] D. W. Massaro and D. G. Stork, BSpeechrecognition and sensory integration,[ Amer.Scientist, vol. 86, no. 3, pp. 236–244, 1998.

[63] J. J. Williams and A. K. Katsaggelos,B An HMM-based speech-to-videosynthesizer,[ IEEE Trans. Neural Networks,Special Issue Intelligent Multimedia, vol. 13,no. 4, pp. 900–915, Jul. 2002.

[64] Q. Summerfield, BUse of visual informationin phonetic perception,[ Phonetica, vol. 36,pp. 314–331, 1979.

[65] VV , BLipreading and audio-visual speechperception,[ Phil. Trans. R. Soc. Lond. B., vol. 335, pp. 71–78, 1992.





[66] K. W. Grant and L. D. Braida,BEvaluating the articulation index forauditory-visual input,[ J. Acoustical Soc.

Amer., vol. 89, pp. 2950–2960, Jun. 1991.

[67] G. Fant, Acoustic Theory of Speech Production,S-Gravenhage. Amsterdam,The Netherlands: Mouton, 1960.

[68] J. L. Flanagan, Speech Analysis Synthesis

and Perception. Berlin, Germany:Springer-Verlag, 1965.

[69] S. Narayanan and A. Alwan,B Articulatory-acoustic models for fricativeconsonants,[ IEEE Trans. Speech AudioProcessing, vol. 8, no. 3, pp. 328–344,Jun. 2000.

[70] J. Schroeter and M. Sondhi, BTechniquesfor estimating vocal-tract shapes from thespeech signal,[ IEEE Trans. Speech AudioProcessing, vol. 2, no. 1, pp. 133–150,Feb. 1994.

[71] H. McGurk and J. MacDonald, BHearinglips and seeing voices,[ Nature, vol. 264,pp. 746–748, 1976.

[72] T. Chen and R. R. Rao, B Audio-visualintegration in multimodal communication,[

Proc. IEEE, vol. 86, no. 5, pp. 837–852,May 1998.

[73] S. Oviatt, P. Cohen, L. Wu, J. Vergo,L. Duncan, B. Suhm, J. Bers, T. Holzman,T. Winograd, J. Landay, J. Larson, andD. Ferro, BDesigning the user interface formultimodal speech and pen-based gestureapplications: State-of-the-art systems andresearch directions,[ Human-Computer Interaction, vol. 15, no. 4, pp. 263–322, Aug. 2000.

[74] J. Schroeter, J. Ostermann, H. P. Graf,M. Beutnagel, E. Cosatto, A. Syrdal, A. Conkie, and Y. Stylianou, BMultimodalspeech synthesis,[ in Proc. Int. Conf.

Multimedia Expo, New York, 2000,pp. 571–574.

[75] C. C. Chibelushi, F. Deravi, and J. S. D.Mason, B A review of speech-based bimodalrecognition,[ IEEE Trans. Multimedia, vol. 4,no. 1, pp. 23–37, Mar. 2002.

[76] D. G. Stork and M. E. Hennecke, Eds.,Speechreading by Humans and Machines.Berlin, Germany: Springer, 1996.

[77] P. S. Aleksic, G. Potamianos, and A. K. Katsaggelos, BExploiting visualinformation in automatic speechprocessing,[ in Handbook of Image and VideoProcessing , A. Bovik, Ed. New York: Academic, Jun. 2005, pp. 1263–1289.

[78] E. Petajan,B Automatic lipreading to enhancespeech recognition,[ Ph.D. dissertation,Univ. Illinois at Urbana-Champaign, Urbana,IL, 1984.

[79] S. Dupont and J. Luettin,B Audio-visualspeech modeling for continuous speech

recognition,[ IEEE Trans. Multimedia, vol. 2,no. 3, pp. 141–151, Sep. 2000.

[80] G. Potamianos, C. Neti, G. Gravier, A. Garg,and A. W. Senior, BRecent advances in theautomatic recognition of audiovisualspeech,[ Proc. IEEE, vol. 91, no. 9,pp. 1306–1326, Sep. 2003.

[81] G. Potamianos, C. Neti, J. Luettin, andI. Matthews, B Audio-visual automatic speechrecognition: An overview,[ in Issues in Visualand Audio-Visual Speech Processing , G. Bailly,E. Vatikiotis-Bateson, and P. Perrier, Eds.Cambridge, MA: MIT Press, 2004.

[82] P. S. Aleksic, J. J. Williams, Z. Wu, and A. K. Katsaggelos, B Audio-visual speechrecognition using MPEG-4 compliant visualfeatures,[ EURASIP J. Appl. Signal

Processing, vol. 2002, no. 11, pp. 1213–1227,Nov. 2002.

[83] T. Chen, B Audiovisual speech processing.Lip reading and lip synchronization,[ IEEESignal Processing Mag., vol. 18, no. 1,pp. 9–21, Jan. 2001.

[84] J. R. Deller, Jr., J. G. Proakis, and J. H. L.Hansen, Discrete-Time Processing of Speech

Signals. Englewood Cliffs, NJ: Macmillan,1993.

[85] R. Campbell, B. Dodd, and D. Burnham,Eds., Hearing by Eye II: Advances in thePsychology of Speechreading and Auditory Visual Speech. Hove, U.K.: Psychology Press, 1998.

[86] S. Young, G. Evermann, T. Hain,D. Kershaw, G. Moore, J. Odell, D. Ollason,D. Povey, V. Valtchev, and P. Woodland,The HTK Book. London, U.K.: Entropic,2005.

[87] A. J. Goldschen, O. N. Garcia, andE. D. Petajan, BRationale for phoneme-visememapping and feature selection in visualspeech recognition,[ in Speechreading byHumans and Machines, D. G. Stork and

M. E. Hennecke, Eds. Berlin, Germany:Springer, 1996, pp. 505–515.

[88] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ:Prentice Hall, 1993.

[89] D.-S. Kim, S.-Y. Lee, and R. M. Kil,B Auditory processing of speech signals forrobust speech recognition in real-world noisy environments,[ IEEE Trans. Speech AudioProcessing, vol. 7, no. 1, pp. 55–69, Jan. 1999.

[90] K. K. Paliwal, BSpectral subband centroidsfeatures for speech recognition,[ in Proc. Int.Conf. Acoustics, Speech and Signal Processing ,Seattle, WA, 1998, vol. 2, pp. 617–620.

[91] M. Akbacak and J. H. L. Hansen,BEnvironmental sniffing: Noise knowledgeestimation for robust speech systems,[ in

Proc. Int. Conf. Acoustics, Speech and SignalProcessing , Hong Kong, China, 2003, vol. 2,pp. 113–116.

[92] H. A. Rowley, S. Baluja, and T. Kanade,BNeutral networks-based face detection,[IEEE Trans. Pattern Anal. Machine Intell., vol. 20, no. 1, pp. 23–38, Jan. 1998.

[93] A. W. Senior, BFace and feature finding for a face recognition system,[ in Proc. Int. Conf.

Audio Video-based Biometric Person Authentication, Washington, DC, 1999,pp. 154–159.

[94] K. Sung and T. Poggio, BExample-basedlearning for view-based human facedetection,[ IEEE Trans. Pattern Anal.

Machine Intell., vol. 20, no. 1, pp. 39–51,1998.

[95] E. Hjelmas and B. K. Low,B

Face detection: A survey,[ Computer Vision and ImageUnderstanding, vol. 83, no. 3, pp. 236–274,Sep. 2001.

[96] M.-H. Yang, D. Kriegman, and N. Ahuja,BDetecting faces in images: A survey,[ IEEETrans. Pattern Anal. Machine Intell., vol. 24,no. 1, pp. 34–58, Jan. 2002.

[97] V. Blanz, P. Grother, P. J. Phillips, andT. Vetter, BFace recognition based on frontal views generated from non-frontal images,[in Proc. Computer Vision PatternRecognition, 2005, pp. 454–461.

[98] C. Sanderson, S. Bengio, and Y. Gao,BOn transforming statistical models fornon-frontal face verification,[ PatternRecognition, vol. 39, no. 2, pp. 288–302,2006.

[99] K. W. Bowyer, K. Chang, and P. Flynn,B A survey of approaches and challenges in

3-D and multi-modal 3-D face recognition,[Computer Vision Image Understanding, vol. 101, no. 1, pp. 1–15, 2006.

[100] S. G. Kong, J. Heo, B. R. Abidi, J. Paik, andM. A. Abidi, BRecent advances in visualand infrared face recognition V A review,[Computer Vision Image Understanding, vol. 97,no. 1, pp. 103–135, 2005.

[101] M. E. Hennecke, D. G. Stork, andK. V. Prasad, BVisionary speech: Lookingahead to practical speechreading systems,[in Speechreading by Humans and Machines,D. G. Stork and M. E. Hennecke, Eds.Berlin, Germany: Springer, 1996,pp. 331–349.

[102] P. S. Aleksic and A. K. Katsaggelos,BComparison of low- and high-level visualfeatures for audio-visual continuousautomatic speech recognition,[ in Proc.Int. Conf. Acoustics, Speech Signal Processing ,Montreal, Canada, 2004, pp. 917–920.

[103] I. Matthews, G. Potamianos, C. Neti, andJ. Luettin, B A comparison of model andtransform-based visual features foraudio-visual LVCSR,[ in Proc. Int. Conf.

Multimedia Expo, 2001, pp. 22–25.

[104] H. A. Rowley, S. Baluja, and T. Kanade,BNeural network based face detection,[ IEEETrans. Pattern Anal. Machine Intell., vol. 20,no. 1, pp. 23–38, Jan. 1998.

[105] P. Viola and M. Jones, BRapid objectdetection using a boosted cascade of simplefeatures,[ in Proc. Conf. Computer VisionPattern Recognition, Kauai, HI, Dec. 11–13,2001, pp. 511–518.

[106] H. P. Graf, E. Cosatto, and G. Potamianos,BRobust recognition of faces and facialfeatures with a multi-modal system,[ in Proc.Int. Conf. Systems, Man, Cybernetics, Orlando,FL, 1997, pp. 2034–2039.

[107] M. T. Chan, Y. Zhang, and T. S. Huang,BReal-time lip tracking and bimodalcontinuous speech recognition,[ in Proc.Workshop Multimedia Signal Processing ,Redondo Beach, CA, 1998, pp. 65–70.

[108] G. Chetty and M. Wagner,BFLiveness_ verification in audio-videoauthentication,[ in Proc. Int. Conf. SpokenLanguage Processing , Jeju Island, Korea,2004, pp. 2509–2512.

[109] M. Kass, A. Witkin, and D. Terzopoulos,BSnakes: Active contour models,[ Int. J.Computer Vision, vol. 4, no. 4,pp. 321–331, 1988.

[110] A. L. Yuille, P. W. Hallinan, and D. S. Cohen,BFeature extraction from faces usingdeformable templates,[ Int. J. Computer Vision, vol. 8, no. 2, pp. 99–111, 1992.

[111] T. F. Cootes, G. J. Edwards, and C. J. Taylor,B Active appearance models,[ in Proc. Eur.Conf. Computer Vision, Freiburg,Germany, 1998, pp. 484–498.

[112] R. O. Duda, P. E. Hart, and D. G. Stork,Pattern Classification. Hoboken, NJ: Wiley,2001.

[113] B. Maison, C. Neti, and A. Senior,B Audio-visual speaker recognition forbroadcast news: Some fusion techniques,[ inProc. Works. Multimedia Signal Processing ,Copenhagen, Denmark, 1999, pp. 161–167.

[114] P. Duchnowski, U. Meier, and A. Waibel,BSee me, hear me: Integrating automaticspeech recognition and lip-reading,[ in Proc.Int. Conf. Spoken Lang. Processing , Yokohama,Japan, Sep. 18–22, 1994, pp. 547–550.

[115] G. Potamianos, H. P. Graf, and E. Cosatto,B An image transform approach for HMMbased automatic lipreading,[ in Proc. Int.





Conf. Image Processing , Chicago, IL, Oct. 4–7,1998, vol. 1, pp. 173–177.

[116] P. S. Aleksic and A. K. Katsaggelos,BComparison of MPEG-4 facialanimation parameter groups withrespect to audio-visual speech recognitionperformance,[ in Proc. Int. Conf. ImageProcessing, Italy, Sep. 2005, vol. 5,

pp. 501–504.[117] X. Zhang, C. C. Broun, R. M. Mersereau, and

M. Clements, B Automatic speechreading with applications to human-computerinterfaces,[ EURASIP J. Appl. SignalProcessing, vol. 2002, no. 11,pp. 1228–1247, 2002.

[118] M. Gordan, C. Kotropoulos, and I. Pitas,B A support vector machine-based dynamicnetwork for visual speech recognitionapplications,[ EURASIP J. Appl. SignalProcessing, vol. 2002, no. 11,pp. 1248–1259, 2002.

[119] F. Cardinaux, C. Sanderson, and S. Bengio,BUser authentication via adapted statisticalmodels of face images,[ IEEE Trans. SignalProcessing, vol. 54, no. 1, pp. 361–373,Jan. 2006.

[120] G. R. Doddington, M. A. Przybycki, A. F. Martin, and D. A. Reynolds, BThe NISTspeaker recognition evaluation V Overview,methodology, systems, results, perspective,[Speech Commun., vol. 31, no. 2–3,pp. 225–254, 2000.

[121] S. Bengio, J. Mariethoz, and M. Keller, BTheexpected performance curve,[ in Int. Conf.

Machine Learning, Workshop ROC Analysis Machine Learning , Bonn, Germany, 2005.

[122] S. Bengio and J. Mariethoz, BThe expectedperformance curve: A new assessmentmeasure for person authentication,[ inProc. Speaker Language Recognition Works.(Odyssey), Toledo, OH, 2004, pp. 279–284.

[123] D. L. Hall and J. Llinas, BMultisensor data fusion,[ in Handbook of Multisensor DataFusion, D. L. Hall and J. Llinas, Eds.Boca Raton, FL: CRC, 2001, pp. 1–10.

[124] T. K. Ho, J. J. Hull, and S. N. Srihari,BDecision combination in multiple classifiersystems,[ IEEE Trans. Pattern Anal. MachineIntell., vol. 16, pp. 66–75, 1994.

[125] R. C. Luo and M. G. Kay, BIntroduction,[in Multisensor Integration and Fusion for Intelligent Machines and Systems, R. C. Luoand M. G. Kay, Eds. Norwood, NJ: Ablex,1995, pp. 1–26.

[126] S. Pigeon and L. Vandendorpe, BThe M2VTSmultimodal face database (release 1.00),[ inProc. 1st Int. Conf. Audio- and Video-BasedBiometric Person Authentication,Crans-Montana, Switzerland, 1997,pp. 403–409.

[127] K. Messer, J. Matas, J. Kittler, J. Luettin, andG. Maitre, BXM2VTSDB: Te extendedM2VTS database,[ in Proc. 2nd Int. Conf.

Audio- and Video-Based Biometric Person Authentication, Washington, DC, 1999,pp. 72–77.

[128] E. Bailly-Bailliere, S. Bengio, F. Bimbot,M. Hamouz, J. Kittler, J. Mariethoz, J. Matas,K. Messer, V. Popovici, F. Poree, B. Ruiz, andJ.-P. Thiran, BThe BANCA database andevaluation protocol,[ in Proc. Audio- andVideo-Based Biometric Person Authentication,Guilford, 2003, pp. 625–638.

[129] C. C. Chibelushi, F. Deravi, and J. S. Mason,BT DAVID Database V Internal Rep., Speechand Image Processing Research Group,Dept. of Electrical and ElectronicEngineering, Univ, les Swansea, 1996.

[130] N. Fox, B. O’Mullane, and R. B. Reilly, BTherealistic multi-modal VALID database and visual speaker identification comparisonexperiments,[ in Lecture Notes in Computer

Science, T. Kanade, A. K. Jain, and N. K.Ratha, Eds. New York: Springer-Verlag,2005, vol. 3546, p. 777.

[131] B. Lee, M. Hasegawa-Johnson,C. Goudeseune, S. Kamdar, S. Borys, M. Liu,and T. Huang, B AVICAR: Audio-visualspeech corpus in a car environment,[ in Proc.Conf. Spoken Language, Jeju, Korea, 2004.

[132] E. K. Patterson, S. Gurbuz, Z. Tufekci, andJ. N. Gowdy, BCUAVE: A new audio-visualdatabase for multimodal human-computerinterface research,[ in Proc. Int. Conf.

Acoustics, Speech and Signal Processing ,Orlando, FL, 2002.

[133] T. Chen, B Audiovisual speech processing,[IEEE Signal Processing Mag., vol. 18,pp. 9–21, Jan. 2001.

[134] J. R. Movellan, BVisual speech recognition with stochastic networks,[ in Advancesin Neural Information Processing Systems, G. Tesauro, D. Toruetzky, andT. Leen, Eds. Cambridge, MA: MIT Press,1995, vol. 7.

[135] J. Luettin, N. Thacker, and S. Beet, BSpeakeridentification by lipreading,[ in Proc Int.

Conf. Speech and Language Processing ,Philadelphia, PA, 1996, pp. 62–64.

[136] S. Bengio and J. Mariethoz, B A statisticalsignificance test for person authentication,[in Proc. Speaker and Language RecognitionWorkshop (Odyssey), Toledo, 2004,pp. 237–244.

[137] N. Poh and J. Korczak, BBiometricauthentication in the e-World,[ in Automated

Authentication Using Hybrid BiometricSystem, D. Zhang, Ed. Boston, MA:Kluwer, 2003, ch. 16.

A BOUT THE A UTHORS

Petar S. Aleksic received the B.S. degree in

electrical engineering from the University of

Belgrade, Serbia, in 1999, and the M.S. and Ph.D.

degrees in electrical engineering from Northwest-

ern University, Evanston, IL, in 2001 and 2004,

respectively.

He has been a member of the Image and Video

Processing Lab at Northwestern University, since

1999, where he is currently a Postdoctoral Fellow.

He has published more thanten articlesin the area

of audio-visual signal processing, pattern recognition, and computervision. His primary research interests include visual feature extraction

and analysis, audio-visual speech recognition, audio-visual biometrics,

multimedia communications, computer vision, pattern recognition, and

multimedia data mining.

Aggelos K. Katsaggelos received the Diploma

degree in electrical and mechanical engineering

from the Aristotelian University, Thessaloniki,

Greece, in 1979, and the M.S. and Ph.D. degrees

from the Georgia Institute of Technology, Atlanta,

in 1981 and 1985, respectively, both in electrical

engineering.

He is currently a Professor of electrical engi-

neering and computer science at Northwestern

University, Evanston, IL, and also the Director of

the Motorola Center for Seamless Communications and a member ofthe academic affiliate staff, Department of Medicine, Evanston Hospital.

He is the Editor of Digital Image Restoration (New York, Springer,

1991), Coauthor of Rate-Distortion Based Video Compression (Kluwer,

Norwell, 1997), and Coeditor of Recovery Techniques for Image and

Video Compression and Transmission (Kluwer, Norwell, 1998). Also, he

is Coinventor of ten international patents.

Dr. Katsaggelos is a member of the Publication Board of the IEEE

PROCEEDINGS and has served as the Editor-in-Chief of the IEEE Signal

Processing Magazine (1997–2002). He has been a recipient of the IEEE

Third Millennium Medal (2000), the IEEE Signal Processing Society

Meritorious Service Award (2001), and an IEEE Signal Processing Society

Best Paper Award (2001).


Audio-Visual Biometrics

Documents