Top Banner
152 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 2, FEBRUARY 2005 Face Localization and Authentication Using Color and Depth Images Filareti Tsalakanidou, Sotiris Malassiotis, and Michael G. Strintzis, Fellow, IEEE Abstract—This paper presents a complete face authentication system integrating both two-dimensional (color or intensity) and three-dimensional (3-D) range data, based on a low-cost 3-D sensor, capable of real-time acquisition of 3-D and color images. Novel al- gorithms are proposed that exploit depth information to achieve robust face detection and localization under conditions of back- ground clutter, occlusion, face pose alteration, and harsh illumina- tion. The well-known embedded hidden Markov model technique for face authentication is applied to depth maps and color images. To cope with pose and illumination variations, the enrichment of face databases with synthetically generated views is proposed. The performance of the proposed authentication scheme is tested thor- oughly on two distinct face databases of significant size. Exper- imental results demonstrate significant gains resulting from the combined use of depth and color or intensity information. Index Terms—Depth maps, embedded hidden Markov models, face authentication, face localization, face recognition, fusion, illu- mination, pose, synthetical views. I. INTRODUCTION I N THE LAST 25 years, face authentication has received growing interest, in response to the increased number of real world applications requiring detection and recognition of humans, as well as the availability of low-cost hardware. Al- though human faces have generally the same structure, they are at the same time very different from each other due to gender, race, and individual variations. In addition to these variations, facial expressions can change their appearance. A robust detec- tion-authentication system must also overcome variations due to lighting conditions, rotations of the head, complex background, etc. The majority of face recognition techniques employ two-di- mensional (2-D) grayscale or color images [1]. Although the three-dimensional (3-D) structure of the human face intuitively Manuscript received August 1, 2003; revised April 20, 2004. This work was supported in part by research projects BioSec IST-2002-001766 (“Biometrics and Security”) under Information Society Technologies (IST) priority of the Sixth Framework Programme of the European Community and in part by HiS- core IST-1999-10087-FP5 (“High Speed 3-D and Color Interface to the Real World”). The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Nasser Kehtarnavaz. F. Tsalakanidou is with the Electrical and Computer Engineering Department, Aristotle University of Thessaloniki, Thessaloniki, Greece, and also with the In- formatics and Telematics Institute, Centre for Research and Technology Hellas (ITI/CERTH), Thessaloniki, Greece (e-mail: fi[email protected]). S. Malassiotis is with the Informatics and Telematics Institute, Centre for Research and Technology Hellas (ITI/CERTH), Thessaloniki, Greece (e-mail: [email protected]). Michael G. Strintzis is with the Electrical and Computer Engineering Depart- ment, Aristotle University of Thessaloniki, Thessaloniki, Greece, and also with the Informatics and Telematics Institute, Centre for Research and Technology Hellas (ITI/CERTH), Thessaloniki, Greece (e-mail: [email protected]). Digital Object Identifier 10.1109/TIP.2004.840714 provides highly discriminatory information and is insensitive to environmental conditions, only a few techniques have been pro- posed that are based on range or depth images. This is mainly due to the high cost of available 3-D digitizers that makes their use prohibitive in real-world applications. Furthermore, these devices often do not operate in real time (e.g., time of flight laser scanners) or produce inaccurate depth information (e.g., stereo vision). The work presented in this paper is partly motivated by the re- cent development of novel low-cost 3-D sensors that are capable of real-time 3-D acquisition [2]. Another motivation comes from the fact that the 3-D structure of the face may be exploited to discriminate among individuals or aid 2-D face authentication, since the face shape data is not sensitive to variations of illumi- nation, face pigmentation, and cosmetics, and is less sensitive to use or nonuse of glasses and facial expressions compared to 2-D surface reflectance data represented by 2-D intensity images. A common approach adopted toward 3-D face recognition is based on the extraction of 3-D facial features by means of dif- ferential geometry techniques. Facial features invariant to rigid transformations of the face may be detected using surface curva- ture measures [3]. Curvature information is subsequently used to extract higher level facial features. The extended Gaussian image (EGI) has been used in [4] and [5] for the representation of regions corresponding to distinct facial features. A knowl- edge-based approach for feature extraction has been adopted in [6] and [7]. An alternative method proposed for 3-D surface fea- ture extraction is by means of point signatures. An application of this technique for face recognition is presented in [8] and [9], where the problem of recognizing faces is treated as a problem of nonrigid object recognition. Feature-based face recognition techniques based on a com- bination of 3-D and grayscale images are used in [9] and [10]. The grayscale image in [10] is used to guide the detection of fa- cial features on the 3-D surface (the eyes, for example, are more easily detected on the intensity image), while the 3-D image is used to compensate for the pose of the face. The most important argument against techniques using a feature-based approach is that they rely on accurate 3-D maps of faces, usually extracted by expensive off-line 3-D scanners. Low-cost scanners, however, produce very noisy 3-D data. The applicability of feature-based approaches when using such data is questionable, especially if computation of curvature infor- mation is involved. Also, the computational cost associated with the extraction of the features (e.g., curvatures) is signifi- cantly high. This hinders the application of such techniques in real-world security systems. The recognition rates claimed by the above techniques [4]–[10] were estimated using databases 1057-7149/$20.00 © 2005 IEEE
17

Face localization and authentication using color and depth images

Feb 06, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Face localization and authentication using color and depth images

152 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 2, FEBRUARY 2005

Face Localization and AuthenticationUsing Color and Depth Images

Filareti Tsalakanidou, Sotiris Malassiotis, and Michael G. Strintzis, Fellow, IEEE

Abstract—This paper presents a complete face authenticationsystem integrating both two-dimensional (color or intensity) andthree-dimensional (3-D) range data, based on a low-cost 3-D sensor,capable of real-time acquisition of 3-D and color images. Novel al-gorithms are proposed that exploit depth information to achieverobust face detection and localization under conditions of back-ground clutter, occlusion, face pose alteration, and harsh illumina-tion. The well-known embedded hidden Markov model techniquefor face authentication is applied to depth maps and color images.To cope with pose and illumination variations, the enrichment offace databases with synthetically generated views is proposed. Theperformance of the proposed authentication scheme is tested thor-oughly on two distinct face databases of significant size. Exper-imental results demonstrate significant gains resulting from thecombined use of depth and color or intensity information.

Index Terms—Depth maps, embedded hidden Markov models,face authentication, face localization, face recognition, fusion, illu-mination, pose, synthetical views.

I. INTRODUCTION

I N THE LAST 25 years, face authentication has receivedgrowing interest, in response to the increased number of

real world applications requiring detection and recognition ofhumans, as well as the availability of low-cost hardware. Al-though human faces have generally the same structure, they areat the same time very different from each other due to gender,race, and individual variations. In addition to these variations,facial expressions can change their appearance. A robust detec-tion-authentication system must also overcome variations due tolighting conditions, rotations of the head, complex background,etc.

The majority of face recognition techniques employ two-di-mensional (2-D) grayscale or color images [1]. Although thethree-dimensional (3-D) structure of the human face intuitively

Manuscript received August 1, 2003; revised April 20, 2004. This work wassupported in part by research projects BioSec IST-2002-001766 (“Biometricsand Security”) under Information Society Technologies (IST) priority of theSixth Framework Programme of the European Community and in part by HiS-core IST-1999-10087-FP5 (“High Speed 3-D and Color Interface to the RealWorld”). The associate editor coordinating the review of this manuscript andapproving it for publication was Prof. Nasser Kehtarnavaz.

F. Tsalakanidou is with the Electrical and Computer Engineering Department,Aristotle University of Thessaloniki, Thessaloniki, Greece, and also with the In-formatics and Telematics Institute, Centre for Research and Technology Hellas(ITI/CERTH), Thessaloniki, Greece (e-mail: [email protected]).

S. Malassiotis is with the Informatics and Telematics Institute, Centre forResearch and Technology Hellas (ITI/CERTH), Thessaloniki, Greece (e-mail:[email protected]).

Michael G. Strintzis is with the Electrical and Computer Engineering Depart-ment, Aristotle University of Thessaloniki, Thessaloniki, Greece, and also withthe Informatics and Telematics Institute, Centre for Research and TechnologyHellas (ITI/CERTH), Thessaloniki, Greece (e-mail: [email protected]).

Digital Object Identifier 10.1109/TIP.2004.840714

provides highly discriminatory information and is insensitive toenvironmental conditions, only a few techniques have been pro-posed that are based on range or depth images. This is mainlydue to the high cost of available 3-D digitizers that makes theiruse prohibitive in real-world applications. Furthermore, thesedevices often do not operate in real time (e.g., time of flight laserscanners) or produce inaccurate depth information (e.g., stereovision).

The work presented in this paper is partly motivated by the re-cent development of novel low-cost 3-D sensors that are capableof real-time 3-D acquisition [2]. Another motivation comes fromthe fact that the 3-D structure of the face may be exploited todiscriminate among individuals or aid 2-D face authentication,since the face shape data is not sensitive to variations of illumi-nation, face pigmentation, and cosmetics, and is less sensitive touse or nonuse of glasses and facial expressions compared to 2-Dsurface reflectance data represented by 2-D intensity images.

A common approach adopted toward 3-D face recognition isbased on the extraction of 3-D facial features by means of dif-ferential geometry techniques. Facial features invariant to rigidtransformations of the face may be detected using surface curva-ture measures [3]. Curvature information is subsequently usedto extract higher level facial features. The extended Gaussianimage (EGI) has been used in [4] and [5] for the representationof regions corresponding to distinct facial features. A knowl-edge-based approach for feature extraction has been adopted in[6] and [7]. An alternative method proposed for 3-D surface fea-ture extraction is by means of point signatures. An applicationof this technique for face recognition is presented in [8] and [9],where the problem of recognizing faces is treated as a problemof nonrigid object recognition.

Feature-based face recognition techniques based on a com-bination of 3-D and grayscale images are used in [9] and [10].The grayscale image in [10] is used to guide the detection of fa-cial features on the 3-D surface (the eyes, for example, are moreeasily detected on the intensity image), while the 3-D image isused to compensate for the pose of the face.

The most important argument against techniques using afeature-based approach is that they rely on accurate 3-D mapsof faces, usually extracted by expensive off-line 3-D scanners.Low-cost scanners, however, produce very noisy 3-D data. Theapplicability of feature-based approaches when using such datais questionable, especially if computation of curvature infor-mation is involved. Also, the computational cost associatedwith the extraction of the features (e.g., curvatures) is signifi-cantly high. This hinders the application of such techniques inreal-world security systems. The recognition rates claimed bythe above techniques [4]–[10] were estimated using databases

1057-7149/$20.00 © 2005 IEEE

Page 2: Face localization and authentication using color and depth images

TSALAKANIDOU et al.: FACE LOCALIZATION AND AUTHENTICATION 153

Fig. 1. (a) Image captured by the color camera with color pattern projected. (b) Computed range image. Darker pixels correspond to points closer to the cameraand black pixels correspond to undetermined depth values.

of limited size and without significant variations of the faces.Only recently Chang et al. [11] conducted an experiment witha database of significant size (275 persons) containing bothgrayscale and range images, and produced comparative resultsof face identification using eigenfaces for 2-D, 3-D, and theircombination and for varying image quality. This test, however,considered only frontal images with neutral expression, cap-tured under constant illumination conditions.

The problem of face authentication using 3-D images wasonly investigated under ideal conditions, both regarding 3-Dimage quality and environmental and facial variability. In prac-tice, however, the sensors are imperfect, more so in systems suit-able for mass screening. Face characteristics change with time,with lighting conditions, with pose, and with face expressions;moreover, on-line, split-second authentication is needed. Thisfar from trivial face authentication problem has not been solvedby any methodology thus far proposed in the literature. To ad-dress some of the problems usually reported in 2-D and 3-Dface classification, an appearance-based face authentication andrecognition system integrating 3-D range data and 2-D coloror intensity images is presented in this paper. The proposedsystem uses a structured light approach based on inexpensiveoff-the-self components, by which a 3-D depth map of the facealong with its color image become available. It is emphasizedthat the proposed face classification scheme may be easily com-bined with the majority of existing 2-D systems to improve theirperformance and that it is capable of real-time face authenti-cation, notwithstanding arbitrary pose, lighting, and expressionvariations.

Apart from the combination of 2-D and 3-D information, thispaper introduces several novel techniques that exploit the avail-ability of 3-D information to improve classification accuracy.

1) A novel face detection and localization method using acombination of color and depth data is presented. Thealgorithm exhibits robustness to background clutter, oc-clusion, face pose variation, and harsh illumination con-ditions by exploiting depth information and prior knowl-edge of face geometry and symmetry. The fast response ofthe algorithm makes it suitable for real-time applications.

2) Unlike techniques that rely on an extensive training set toachieve high recognition rates, our system requires only afew images per person. This is achieved by enriching theoriginal database with synthetically generated images, de-picting variations in pose and illumination. Thus, a cum-

bersome enrollment stage is avoided, while better recog-nition rates are achieved. Although the idea of enlargingthe training set with novel examples is not new, the inves-tigation of such techniques for 3-D face classification isnovel.

3) The performance of the proposed system is extensivelyevaluated using two face databases of significant size(2500 images each) containing several facial and lightingvariations. The combination of 2-D and 3-D informationwas shown to produce consistently superior results incomparison with a 2-D state-of-the-art algorithm, demon-strating at the same time near real-time performance.

The paper is organized as follows. A brief description ofthe 3-D sensor is given in Section II. Face localization basedon depth and color information is described in Section III,while in Section IV, the embedded hidden Markov model(EHMM) method for both color and depth images is outlined.In Section V, we examine algorithms for the enrichment of facedatabases with novel views. Experimental results evaluatingthe developed techniques are presented in Section VI. Finally,a discussion of our achievements, limitations, and ideas forfuture work is presented in Section VII.

II. ACQUISITION OF 3-D AND COLOR IMAGES

A 3-D and color camera capable of synchronous real-time ac-quisition of 3-D images and associated color 2-D images is em-ployed [2]. The 3-D data acquisition system is based on an ac-tive triangulation principle, making use of an improved and ex-tended version of the well-known coded light approach (CLA)for 3-D data acquisition. The basic principle lying behind thisdevice is the projection of a color-encoded light pattern on thescene and measuring its deformation on the object surfaces (seeFig. 1). The 3-D camera achieves real-time image acquisitionfor range images and near real-time (14 fps) for color plus depthimages on a standard PC. It is based on low-cost devices, anoff-the-shelf CCTV-color camera, and a standard slide projector[2]. By rapidly alternating the color-coded light pattern with awhite-light pattern, both color and depth images are acquired(Fig. 1).

For the experiments in this paper, the system was optimizedfor an access control application scenario, leading to an averagedepth accuracy of 1 mm for objects located about 1 m fromthe camera in an effective working space of 60 50 50 cm.Computed depth values are quantized into 16 bits. The spatial

Page 3: Face localization and authentication using color and depth images

154 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 2, FEBRUARY 2005

resolution of the range images is equal to the horizontal colorcamera resolution in the horizontal dimension, while in the ver-tical direction it is dependent on the width of the color stripesof the projected light pattern and the bandwidth of the surfacesignal. For a low-bandwidth surface, such as the face, the ver-tical range image resolution is close to the vertical resolution ofthe color camera. For objects outside the working volume, theprojected pattern will be fuzzy leading to undetermined depthvalues. The acquired range images contain artifacts and missingpoints, mainly over areas that cannot be reached by the projectedlight and/or over highly refractive (e.g., eye glasses) or low-re-flective surfaces (e.g., hair and beard).

III. FACE LOCALIZATION USING COLOR

AND DEPTH INFORMATION

Although face detection is an ongoing subject of research,face localization is a problem rarely addressed in face recog-nition literature. While face detection is most concerned withroughly finding all the faces in large, complex images, whichinclude many faces and much clutter, localization emphasizesspatial accuracy, usually achieved by accurate detection of fa-cial features.

Several face detection techniques have been proposed forgrayscale images [12]. These may be roughly categorized tothose based on the detection of facial features, possibly ex-ploiting their relative geometric arrangement, and those basedon the classification of the brightness pattern inside an imagewindow, obtained by exhaustively sweeping the whole imageas face or nonface. Techniques in the second category wererecently shown to be more successful in detecting faces incluttered backgrounds [13]; however, the correct detection ratesreported were below 90%. Further shortcomings of existingface detection algorithms include their sensitivity to partialocclusion of the face (e.g., glasses and hair), hard illuminationand head pose, as well as their computational complexity.

Color information, when available, is a powerful cue for de-tecting the face [14]. However, the parameters of the color dis-tribution were shown to rely on the environmental illuminationand the response characteristics of the acquisition device. Fur-thermore, irrelevant skin-colored image regions will result in er-roneous face candidates.

More robust face localization may be achieved by using depthinformation. In this paper, a face localization procedure com-bining both depth and color data is proposed. By exploitingdepth information, the human body may be easily separatedfrom the background, while by using a priori knowledge of itsgeometric structure, efficient segmentation of the head from thebody (neck and shoulders) is achieved. The position of the faceis further refined using the color image to locate the point thatlies just above the nose, in between the eyes, by exploiting facesymmetry.

In the following, we shall assume that the target face is theclosest object to the camera, as is reasonable to assume forall access control applications. Even in a surveillance applica-tion, where multiple faces may be visible, the face closer to thecamera will be selected for recognition, since the faces furtherback will be classified as background. With a small effective

working space of 60 50 50 cm (see Section II), little back-ground clutter exists. Even with a large working volume, where50% of image pixels are on the background, the proposed de-tection technique works successfully. This is due to the relianceof the structured light approach on the projection of a light pat-tern on the object. For objects outside the working volume, theprojected pattern will be fuzzy, leading to undetermined depthvalues over these areas. In other words, the 3-D sensor automati-cally segments the foreground from the background, i.e., objectswithin the working volume from objects outside the workingvolume. Nevertheless, there may be background objects insidethe working volume, for example, when another person standsright in the back. Even in such a case, the depth histogram con-tains two sufficiently distinct modes (body and background)that may be separated with simple thresholding. The optimalthreshold is selected using the algorithm in [15].

Segmentation of the head from the body relies on statisticalmodeling of the head–torso points in 3-D space, inspired bythe approach adopted in [16] for the segmentation of the armfrom the body. A novel technique for the initialization of 3-Dblob parameters is introduced, to ensure the successful and rapidconvergence of the expectation–maximization (EM) algorithm.The probability distribution of a 3-D point is modeled as amixture of two Gaussians

(1)

(2)

where are prior probabilities of the head and torso, re-spectively, and is the 3-D Gaussian distribution withmean and covariance . Maximum-likelihood estimation ofthe unknown parameters from the 3-D datais obtained by means of the EM algorithm [17]. To avoid theconvergence of the algorithm to local minima, good initial pa-rameter values are required. In our case, these may be obtainedby using 3-D moments, i.e., center of mass and covariance ma-trix. Let be the center of mass,the scatter matrix computed from the data points , and

the eigenvectors of , ordered according to the mag-nitude of the corresponding eigenvalues . Initialestimates of the unknown parameters are selected by

where

is the orthogonal eigenvector matrix of , whileare constants related to the relative size of

the head with respect to the torso (in the experiments,, and were used).

The physical interpretation of the above parameter selection isillustrated in Fig. 2. That is, 3-D blob centers and covariancesare initialized based on prior knowledge of the human bodystructure (e.g., head above torso) and relative dimension ofbody parts. In the case that no torso is apparent in the effective

Page 4: Face localization and authentication using color and depth images

TSALAKANIDOU et al.: FACE LOCALIZATION AND AUTHENTICATION 155

Fig. 2. Illustration of knowledge-based initialization of 3-D blob distributionparameters. Ellipses represent iso-probability contours of posterior distribu-tions. The lengths of the axes of the ellipses are selected on the basis of theiso-probability ellipse estimate computed using all 3-D data.

working space, for example, when someone is wearing a darkshirt, the prior probability for the pixel blobs correspondingto the torso converges to zero.

A 3-D point is classified as head or torso using the max-imum likelihood criterion. Experimental results demonstrate ro-bustness of the algorithm to head orientation variability andmoderate occlusion of the face (e.g., hand in front of the face;see Fig. 3) leading to correct classification of face pixels in al-most 100% of the images.

One shortcoming of the above iterative procedure is that theestimated segmentation is biased by erroneous depth estimates.In fact, since one side of the face is partially occluded from thestructured light source, no depth is computed over the corre-sponding image region, hence the center of the face may not beaccurately localized. Therefore, a second step is required thatrefines the localization using brightness information.

The aim of this second step is the localization of the pointthat lies in the middle of the line segment defined by the cen-ters of the eyes. Then, an image window containing the faceis centered around this point, thus achieving approximate align-ment of facial features in all images, which is very important forface classification. A novel localization technique is proposedexploiting the highly symmetric structure of the face. The esti-mation of the horizontally oriented axis of bilateral symmetrybetween the eyes is sought first. Then, the vertically orientedaxis of bilateral symmetry between the eyes is estimated. Theintersection of these two axes defines the point of interest.

Based on the technique described in [18], which proposesa measure of bilateral symmetry for arbitrary 2-D objects, wepropose a novel method for the estimation of the symmetryaxis of faces by minimization of the above symmetry measureusing the Hough transform. Furthermore, we introduce the idea

of using the symmetry of the face to localize its center. Letbe any image point. Then,

is the intensity gradient at point . We alsodenote by and , respectively, the magnitude and phase ofthe gradient vector. For a pair of points and , let

be their midpoint and the line that passes throughand is perpendicular to the direction defined by and . Then,a measure of reflectional symmetry of the two points about theline is given by

Let the line of symmetry be defined in the parametric form, where is the slant of the line and

its distance from the origin. The unknown parameters are es-timated by means of a Hough transform technique described inthe following.

A bounding box large enough to contain the eyes with cer-tainty is calculated from the initial depth segmentation, usingprior knowledge of the average dimensions of the eyes on theimage, thus constraining the range of values of . The rotationof the head is assumed to be limited . Then, the range ofparameter values is efficiently quantized and a 2-D Hough ac-cumulator table is created (30 60 bins). Each pair of quan-tized parameter values defines a candidate symmetry axis. For each pixel on this line, we gather all intensity edge

pairs and such that and, where is a constant that equals approximately

the maximum vertical pixel size of the eyes in the image (ap-proximately 1.5 cm in 3-D space). Then, for each pair,is incremented by . The symmetry axis is finally found byfinding the cell in table which received maximum votes. Thevertical axis of symmetry is similarly estimated. in this caseequals approximately the inter-occular distance recorded in theimage (6.5 cm on the average in 3-D space). The intersection ofthe estimated symmetry axes defines the center of the face. Awindow of constant aspect ratio is centered on this point, whileits size is scaled according to the depth value of the center point.

Using the above two-step procedure, highly robust and com-putationally efficient face localization is achieved (Fig. 3). Wehave tested the algorithm on 500 images, where the center ofthe face was manually specified. The average localization errorwas four pixels in the horizontal and three pixels in the verticaldirection.

IV. FACE CLASSIFICATION

Face classification aims to identify individuals by meansof discriminatory facial attributes extracted from one or moreimages belonging to the same person. The challenge is, there-fore, in the selection of appropriate features and their efficientmatching. Face classification techniques can be roughly dividedinto two main categories: global approaches and feature-basedtechniques. In global approaches, the whole image serves asa feature vector, while in local feature approaches, a numberof fiducial or control points are extracted and used for clas-sification. Global approaches model the variability of theface by analyzing its statistical properties based on a large

Page 5: Face localization and authentication using color and depth images

156 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 2, FEBRUARY 2005

Fig. 3. Face localization results. The box with the dashed line corresponds to depth-based segmentation. The horizontal line of symmetry and the intersectionpoint with the vertical line of symmetry are illustrated. The window that encloses the face is centered on this point (continuous line).

set of training images. Representative global techniques areeigenfaces [19], linear/Fisher discriminant analysis (LDA)[20], [21], support vector machines (SVM) [22], and hiddenMarkov models (HMMs) [23], [24]. Feature-based techniques,on the other hand, discriminate among different faces based onmeasurements of structural attributes of the face. More recentapproaches include elastic graph matching [25] and dynamiclink architecture [26]–[28].

In this paper, we employ the EHMM technique in [23] as abaseline classifier for both color and depth images. We havealso tested other state-of-the-art algorithms, such as LDA andelastic bunch graphs, and we have obtained similar results. TheEHMM algorithm models the face as a sequence of states corre-sponding to homogeneous image regions. The states are initial-ized so that they are roughly aligned to facial features [29]. Theprobability distribution function corresponding to each state ofthe EHMM is approximated by a mixture of Gaussians. The pa-rameters of the distribution are estimated given observations ex-tracted from training images [29]. The observation sequence fora face image is formed by the low-frequency DCT coefficientscalculated from image blocks that are extracted by scanning theimage from left-to-right and top-to-bottom [23]. For color im-ages, the image is transformed from RGB to YUV color spaceto decorrelate the color components and the DCT is appliedto every component separately. Then, the observation vectoris constructed by concatenation of the color components. TheEHMM technique was applied to depth maps, as well.

Face classification begins with the extraction of the observa-tions from the input face image. Next, the trained EHMMs inthe database are used to calculate the probability of the obser-vations, given the EHMM of each person. The model presentingthe highest probability is considered to reveal the identity of theperson appearing in the input image.

Two independent EHMM classifiers, one for color andone for depth, are combined to handle both color and depthinformation. In general, the use of multimodal classifiersaims at exploiting the extra information that independentclassifiers can offer, in order to achieve higher recognitionrates. Since classifiers based on depth information demonstraterobustness in handling cases where color or intensity-basedclassifiers show weakness, and vice versa, their combinationresults in increased recognition robustness in all cases. Let

be the set of scores computed foreach EHMM model in the database usingthe color classifier. Similarly, letbe the set of scores estimated by the depth classifier. Each setof scores is normalized in the range [0,1] and a new set ofcombined scores is computed by

(3)

where and are normalizing constants, chosen experimen-tally once during the training of the classifier. The element of

with the lowest value (highest similarity) corresponds tothe identity of the most similar person. The fusion techniquedescribed above was shown elsewhere to give the best recogni-tion results [30].

The proposed method combines color and 3-D informationin order to exploit the advantages of both approaches and dealsuccessfully with the cases where one alone is not sufficientfor accurate identification. In the subsequent sections, we willprovide experimental results to substantiate our claim that suchan approach results in a more robust recognition/authenticationscheme.

V. ENROLLMENT

One of the main problems in face recognition is that facial ap-pearance is distorted by, for example, seasonal changes (aging,hairstyle, usage of cosmetics, etc.), rotation, harsh or heteroge-neous illumination, and occlusions caused by glasses, scarves,etc. This problem may be partly alleviated by recording a richtraining database containing representative variations. Such anapproach is shown to lead to improved recognition/authentica-tion rates [31].

In this paper, a database enrichment procedure is describedthat avoids a cumbersome enrollment process. A small set ofimages (normally less than five per person) depicting differentfacial expressions with and without eyeglasses are originallyrecorded. These are subsequently used to create canonical faceimages (upright pose). For each canonical image pair (color anddepth), a 3-D model is constructed and used to automaticallygenerate artificial views of the face depicting pose and illumina-tion variations. The various steps of this procedure are describedin the sequel.

Page 6: Face localization and authentication using color and depth images

TSALAKANIDOU et al.: FACE LOCALIZATION AND AUTHENTICATION 157

Fig. 4. Surface interpolation example. (a) Original color image with interest points identified. (b) Original depth image. (c) Warped depth image.(d) Symmetry-based interpolation. (e) Final linear interpolated depth image. (f) Warped color image.

A. Surface Interpolation

As mentioned above, face authentication algorithms maycope with moderate pose variations by enriching the trainingdatabase with a set of novel views. In our case, this requiresthat the depth images do not contain holes. If some pixels areundetermined in the depth image, the corresponding pixels willalso be missing in the artificially generated color images de-picting various pose and illumination combinations. In practice,more than 20% of the face surface pixels are missing, mainlyover occluded points on one side of the nose and face or overthe eyes, for people wearing glasses. Therefore, a semiauto-mated surface interpolation procedure has been developed thatexploits face symmetry. Missing pixels from the right part ofthe face may be copied from symmetrically located pixels onthe left part of the face, or vice versa. This procedure requiresthe estimation of the vertical bilateral symmetry axis with highaccuracy.

In order to achieve optimal rectification of training imagesduring the enrolment of a new user, the estimation of symmetryaxis is not performed automatically as described in Section III,but manually. The user has to mark the two points correspondingto the centers of the eyes and another point close to the chin ormouth on the input image, so that the vertical symmetry axis ofthe face is defined. The complete enrollment time is about 2–3min for an experienced user, since less than 10 s are requiredfor this step, and the enrollment of a new person in the data-base requires about five images of this person. Once the threeinterest points have been located and the corresponding depthvalues have been computed from the depth image, a 3-D coordi-nate frame is defined centered on the face. A warping procedureis applied to align this local coordinate frame with the frame ofthe camera, thus bringing the face in upright orientation. Then,the following operation is applied on the depth image : if origi-nally (i.e., if the depth of is originally missing), then

, where is the point that is reflectionally sym-metric to . Note that this procedure is only applied to missingpixels of the depth map not the color image. Also, since thesepixels are normally on the sides of the face and over low-reflec-tivity areas, the lower part of the face is not affected, and, thus,inherent facial asymmetry, which may be a cue for discrimina-tion, is not violated.

After the symmetry-based interpolation, some pixel valuesmay still remain undefined. These missing points are linearlyinterpolated using neighboring points. Since the dimension andshape of the holes varies, a 2-D Delaunay triangulation proce-dure is applied [32]. In Fig. 4, the various steps of the surface

interpolation procedure are illustrated. To compute the corre-sponding rectified color image, the inverse 3-D transformationis first applied to the interpolated depth map. Then, the forward3-D transformation is applied to the color image using the backprojected interpolated depth values.

B. Creation of Artificially Rotated Depth and Color Images

Pose variation presents one of the most challenging problemsin face recognition, leading to significant deterioration in theperformance of face recognition algorithms, since the resultingchanges in appearance can be greater than the perceived vari-ability between persons [33]. Several techniques have been pro-posed to recognize faces under varying pose. One approach isthe automatic generation of novel views resembling the pose inthe probe image. This is achieved either by using a face model(active appearance model in [34] and deformable 3-D model in[35]) or by warping frontal images using the estimated opticalflow between probe and gallery [36]. Classification is, subse-quently, based on the similarity between the probe image andthe generated view.

An approach quite different from the above is based onbuilding a pose varying eigenspace by recording several imagesof each person under varying pose. Representative techniquesare the view-based subspace of [37] and the predictive character-ized subspace of [38]. The proposed technique may be classifiedin this later category. However, the view-based subspace is auto-matically generated from a few frontal view images exploitingthe information about the 3-D structure of the human face.

In order to create artificially rotated depth maps and associ-ated color images, a 3-D model of the face is first constructed.The 3-D model consists of quadrilaterals, where the vertices ofeach quadrilateral are computed from the depth values of fourneighboring pixels on the image grid. The 3-D coordinates ofeach vertex are computed from the corresponding depth valueusing camera calibration parameters. Also, associated with eachvertex is the color value of the corresponding pixel in the colorimage.

A global rotation is subsequently applied to the vertices of the3-D model. The rotation center is fixed in 3-D space so that itsprojection corresponds to the point between the eyes and it is ata fixed distance from the camera. The transformed 3-D modelis subsequently rendered using the -buffer rendering algorithm[39], giving rise to a synthetic color and range image (Fig. 5).To limit the effect of points on the face coming into view whenthe face model is rotated, a mask that includes the central partof the face is used.

Page 7: Face localization and authentication using color and depth images

158 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 2, FEBRUARY 2005

Fig. 5. Artificial rotation example. Original color image and depth map and five generated views: rotation�20 ;+20 around the vertical axis y;�15 ;+15

around the horizontal axis x and +5 around axis z.

Fig. 6. Artificial illumination example. Original color image and depth map and four synthetical views.

C. Simulating Illumination

Another source of variation in facial appearance is the illu-mination of the face. Several techniques have been proposedto cope with varying illumination. The majority of these tech-niques exploits the low dimensionality of the face space undervarying illumination conditions and the Lambertian assumption[40]. They either use several images of the same person recordedunder varying illumination conditions [41] or rely on the avail-ability of 3-D face models and albedo maps [42]–[44] to gen-erate novel views. The main shortcoming of this approach is therequirement in practice of large example sets to achieve goodreconstructions.

Our approach, on the other hand, builds an illuminationvarying subspace by constructing artificially illuminated colorimages from an original image. This normally requires avail-ability of surface gradient information, which, in our case, maybe easily computed from depth data. Since it is impossibleto simulate all types of illumination conditions that one mayfind in the real world, we try to simulate those conditionsthat have the greatest effect in face recognition performance.Heterogeneous shading of the face caused by a directional lightcoming from one side of the face was experimentally shown tobe most commonly liable for misclassification.

Given the surface normal vector computed over each pointof the surface and the direction of the artificial light source

specified by the azimuthangles and , the RGB color vector of a pixel in the artificialview is given by

where is the corresponding color value in the original view, andweigh the effect of ambient light and diffuse reflectance re-

spectively ( were chosen in our experimentsto approximate the average reflectance of the human face). Fig. 6shows an example of artificially illuminated views.

VI. EXPERIMENTAL EVALUATION

The focus of the experimental evaluation was to show that theproposed integration of 2-D color (or intensity) information and3-D range data demonstrates superior performance comparedwith state-of-the-art 2-D face authentication techniques and alsorobustness against natural facial and environmental variability.Two databases were recorded comprised of 2-D color imagesand their corresponding depth maps. In order to make the exper-imental results comparable to public evaluations of face recog-nition algorithms and systems, international practices such as[45] and [46] were followed, regarding the evaluation conceptand database recording methodology.

A. First Evaluation Database

The first face database for evaluation contains 2818 record-ings of 50 individuals, where each recording consists of a colorimage (571 752 pixels, 24 bits/pixel) and the correspondingdepth map (571 752 pixels, 16 bits/pixel). The images wererecorded by the 3-D sensor prototype during a period of threemonths. The database images were captured in an indoor officeenvironment. The first face database consists of five recordingsessions per person, captured in intervals of ten days on the av-erage. The test population contains 33 male and 17 female in-dividuals. The age of the participants was between 20 and 70years old, but most of the individuals were between 20 and35 years old. Each session contains on the average 12 record-ings per person: three frontal images (frontal upright orienta-tion, neutral facial expression), two poses (head rotated aboutthe vertical axis ), three expressions (smile, laugh, andeyes half-closed), three illumination recordings (lights out andan extra light source coming from the left or from the right),and one recording with/without eyeglasses. Table I tabulates thenumber of images used for enrollment and the number of testimages used for the different facial variation tests.

Some examples of color images and corresponding depthmaps belonging to the first evaluation database can be seenin Fig. 7. Examples of face images of an individual depicting

Page 8: Face localization and authentication using color and depth images

TSALAKANIDOU et al.: FACE LOCALIZATION AND AUTHENTICATION 159

TABLE INUMBER OF ENROLLMENT AND TEST IMAGESDEPICTING DIFFERENT FACIAL VARIATIONS ON THE FIRST AND SECOND FACE DATABASE

Fig. 7. Examples of face color images and corresponding depth maps belonging to the first face database.

Fig. 8. Examples of face color images and corresponding depth maps depicting various expressions, poses, as well as illumination variations, during the recordingof a single session.

Fig. 9. Examples of face color images and corresponding depth maps depicting facial variations due to hairstyle, makeup, beard, glasses, and suntan during therecording of different sessions.

various expressions, poses, as well as illumination variationsduring the recording of a single session, can be seen in Fig. 8.During the recording of the database sessions, intrapersonvariability due to different hairstyle, makeup, suntan, beard,and glasses, was also observed, as can be clearly seen in Fig. 9.

B. Second Evaluation Database

Due to a miscalibration of the camera during recording ses-sions, the quality of color data in the first database did not allow

testing of color images or combination of color and depth data.Also, an improved 3-D sensor prototype was developed, whichled to better depth quality compared to the first database. Inorder to evaluate the combination of color and depth imagesfor face authentication, as well as to examine the effect of dataquality in the performance of the system, a second evaluationdatabase was recorded. It contains 2244 recordings of 20 indi-viduals captured during two sessions in an indoor office envi-ronment (see Table I). The time between the recording sessions

Page 9: Face localization and authentication using color and depth images

160 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 2, FEBRUARY 2005

was at least one week. For this database, only 20 male individ-uals were available, other than those participating in the firstdatabase. The age of the participants was between 28 and 60.

C. Training and Testing

For the training of the EHMMs, five image pairs per subject(two frontal, two expressions, and one with eyeglasses) takenfrom a single recording session were used. For the test of theface authentication system, all images of the remaining sessionswere used. The procedure described in Section V was followedfor every training image pair: for every pair of depth and colorimages, a surface interpolation method was performed to covermissing pixels in the depth map and bring the face in uprightorientation. The resulting images were scaled and cropped todimensions 140 200. Artificial views depicting pose and/orillumination variations were then generated from the resultingrectified enrollment images. Since it is impossible to simulatethe full range of facial poses and all orientations of the lightsource, we have experimentally determined the range of faceorientations that are typical for cooperating users. This rangeis to around axis and and to aroundaxis . As far as the illumination parameters are concerned, weconsider azimuth angles and equal to and 0 , respec-tively. A shift parameter was also considered to compensate forany misalignment during face localization.

Three training sets were created using different enrichmentschemes. Each training set is abbreviated as TSi,in the following. TS1 is comprised of the original (rectified) en-rollment images and does not contain synthetical views. TS2 re-sults from the enrichment of TS1 with views depicting artificialrotations of and around axes . A shift ofpixels along the horizontal axis was also incorporated. Forevery image pair of TS1, 124 synthetical views were generatedin TS2. For the creation of TS3, all images in TS2 were artifi-cially illuminated, setting and .

We have trained three EHMM classifiers for each trainingset, using depth, color, and intensity images, respectively. Sev-eral combinations of EHMM parameters were experimentallytested. The optimal number of EHMM superstates was shownto be 3, 7, 7, 7, 7, and 3, meaning that the embedded hiddenmarkov model consists of six superstates with 3, 7, 7, 7, 7, and3 states each. A sampling window of 12 12 pixels presentedthe best results, while an observation window 3 3 (number ofDCT coefficients to form an observation) was selected. Sincethe global intensity of the light affects only the dc componentof the 2-D-DCT, while the rest of the coefficients are relativelyunaffected [29], the dc coefficient is not taken into account inour experiments. Investigating the optimal size of the overlapalong axes and , we found that 10 10 is a good compro-mise. Higher values of overlap lead to high computational loadand delay the response of the verification system. The selec-tion of the color space is also of great importance. The YCrCbcolor space yielded the best results among different color spacestested.

The test phase consists of a face localization step and a faceclassification step. The result of face localization on an inputimage (as described in Section III) is a pair of cropped images(140 200 pixels). Each image contains the face of the person,

so that the point between the eyes is in the center of the image. Asimple linear surface interpolation is then applied to the depthimage, so that missing pixels values are interpolated. A maskof elliptical shape and standard dimension is used to crop partsof the face near the boundaries of the image. In cases whereonly 2-D information is used (intensity or color approach), theface detection algorithm described in [47] was applied. The al-gorithm was trained using manually cropped images from thetraining set and attains similar accuracy with the 3-D-based facelocalization, but at a very high computational cost.

The face classification system takes as input the images re-sulting from the previous step and employs the EMMM clas-sification method. The probability of the observations is calcu-lated for every enrolled person. The person associated with thehighest resulting probability score is identified as the person inthe test image (see Section IV).

The weights for the combination approaches (Sec-tion IV) were chosen experimentally. Multiple tests were per-formed on a set of test images, other than those employed fortraining, varying the values of in the range (0, 1), with

. The selection is made in terms of minimizingthe total equal error rate (EER). The EER values were plottedversus . The resulting curve exhibits a global minimum for aspecific value of . This value is used for the combination ofauthentication scores resulting from the use of different modal-ities and it is kept fixed for all tests.

For each pair of test images five tests are performed, corre-sponding to different modality combination: depth, color, inten-sity, combination of depth and color, and combination of inten-sity and depth. The experimental results are described in detailin Section VI-D.

D. Evaluation Results

1) Performance Evaluation: In this section, we evaluatethe performance of the proposed face authentication system interms of the receiver operating characteristic (ROC) diagramsand the equal error rate (EER), which are established methodsfor the comparison of face authentication algorithms.

Different operating points of biometric systems are com-monly illustrated in a ROC diagram, where the false acceptancerate (FAR) values versus the corresponding false rejection rate(FRR) values are plotted. FAR is defined as the percentage ofinstances that a nonauthorized individual is falsely accepted bythe system, while FRR is defined as the percentage of instancesan authorized individual is falsely rejected by the system. The(FAR, FRR) pairs calculated at a specific operating point of thesystem (threshold of the maximum observation probability inour case) define a point on the ROC.

Comparison of different algorithms may be performed bycomparing their ROC curves (the one that is closer to the lowerleft corner is the one demonstrating better performance) or bycomparing their equal error rates. EER corresponds to the FARvalue on an operating point such as FAR FRR. The operatingpoint of a biometric system can often be manipulated by thesetting of a threshold and is an application-dependent compro-mise between FAR and FRR. EER is mostly a statistical mea-surement and represents a better measure of performance thaneither FAR or FRR quoted in isolation. In practice, however,

Page 10: Face localization and authentication using color and depth images

TSALAKANIDOU et al.: FACE LOCALIZATION AND AUTHENTICATION 161

Fig. 10. First face database. ROC curves of “i,” “d,” and “i+d” when training sets (a) TS1 and (b) TS3 are used. All image variation types are included in the testset. The 95% confidence interval of FAR and FRR is indicated with horizontal and vertical error bars, respectively.

biometric verification applications do not operate on the EER.One may bias the recognition system toward a larger FAR but asmaller FRR (user friendly) or a larger FRR but a smaller FAR(user unfriendly, high-security oriented).

In the test procedure, each individual is considered as an im-postor against all remaining persons, that are considered eli-gible. Each user claims his/her own identity, while each im-postor claims the identity of each real user, in turn. In eachclaim, the observations corresponding to the test image are ex-tracted and their probability is calculated given the EHMM ofthe user in the training set, whose identity is claimed. Then,for each of these claims, a distance measure or matching scorebetween the probe and corresponding gallery images is calcu-lated. The calculated distance measure is then compared to athreshold and accordingly the claimed identity is verified or re-jected. Thus, by varying this threshold, FAR and FRR values arecomputed, defining points on the ROC curve.

Each authentication test can be regarded as a binary eventwith outcome 1 for false acceptance and 0 for correct accep-tance, where is the probability for false acceptance and

for correct acceptance (or false rejection and correct rejec-tion, respectively). If we consider the authentication tests asBernoulli trials, where is the true value of FARor FRR and are the values of FAR and FRR esti-mated by experimental testing of the proposed technique, thenthe confidence interval of and is given by [27]

(4)

where is the number of trials (user or impostor claims in ourcase) performed to estimate and and is the -per-centile of the standard normal distribution with mean 0 and vari-ance 1. The % confidence interval of FAR and FRRis indicated with horizontal and vertical error bars respectivelyin all ROC curves presented in Sections VI-D.2 and VI-D.3.Since the first database consists of 2818 images of 50 persons,in total, the test procedure provides user claimsand impostor claims. The second

database consists of 20 people and 2244 images, providing 2244user claims and impostor claims. For thesevalues of , we get a confidence interval of 1% for an estimatedFFR of 10% and 0.15% for an estimated FAR of 10% in the firstdatabase. The corresponding confidence intervals in the seconddatabase are 1.1% and 0.3%, respectively.

The above methodology was used to provide comparativeevaluation among different techniques and for different facialvariations. The approaches which utilize only color or inten-sity images represent the baseline of this evaluation. The perfor-mance of the approaches based on depth, combination of colorand depth, and combination of intensity and depth were com-pared to this baseline to determine whether and in what extentthe utilization of depth images can improve the performance offace authentication systems.

For ease of reference the following abbreviations will be usedthroughout the rest of the paper: “c” for color approach, “d”for depth, “i” for intensity, and “c d” and “i d” for thecombination of color and depth and the combination of intensityand depth, respectively.

2) Experimental Results of the First Evaluation Data-base: In this section, we present the experimental results ofthe tests performed on the first evaluation database (see Sec-tion VI-A). Due to color miscalibration of the camera duringrecording sessions, only the methods that use intensity and/ordepth information were tested on the first evaluation database.

Fig. 10(a) and (b) illustrate the ROC diagrams of approaches“i,” “d,” and “i d.” TS1 and TS3 were used respectively.The depth-only approach has the worst performance in bothcases, presenting higher error rates compared to the other twoapproaches. The combination of depth and intensity decreasesthe EER of the intensity approach by almost 4%.

To investigate the benefits of incorporating artificially gener-ated images in the training set, multiple tests were performedemploying synthetical images depicting various head rotationsand illumination variations. The ROC diagrams of “i” and “i d”that resulted from the use of different training sets are plottedin Fig. 11. As it can be clearly seen, the use of synthetical im-ages for training, substantially improves the authentication rate.

Page 11: Face localization and authentication using color and depth images

162 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 2, FEBRUARY 2005

Fig. 11. First face database. ROC curves of (a) “i” and (b) “i+ d” resulting by the employment of different training sets. All image variation types are includedin the test set. The 95% confidence interval of FAR and FRR is indicated with horizontal and vertical error bars, respectively.

TABLE IIFIRST FACE DATABASE: EQUAL ERROR RATES ACHIEVED BY “i,” “d,” AND“i+d” FOR DIFFERENT FACIAL VARIATIONS. TS3 WAS USED

TABLE IIIFIRST FACE DATABASE: EQUAL ERROR RATES ACHIEVED BY “i+d” APPROACH FOR DIFFERENT FACIAL VARIATIONS AND TRAINING SETS

Note how the use of different training sets gradually decreasesthe error rate, when “i d” is employed: TS1, which is com-prised of original images only, presents the highest EER. En-richment of the training set with rotated and translated views(TS2) decreases the error rate by 2.5%, while the additional useof artificially illuminated images (TS3) further enhances the per-formance of the face authentication system and leads to a totaldecrement of the EER by almost 4%.

Table II tabulates the EERs reported for test images depictingdifferent appearance variations, when different modalities areused. By an inspection of Table II, it can be seen that while vari-ations in pose, expression, and eyeglasses decrease the recogni-tion quality slightly, illumination variations seriously affect therecognition performance in every approach, but most noticeablyin the approach that uses only intensity images. While the EERreported for all test images is about 7.5% (“i,” TS3), the EERfor images depicting illumination variations, it is about 15.5%.The combination of 2-D and depth data is shown to improve theEER of the 2-D intensity approach whatever the variation of testimages. As expected, depth is robust to lighting variations, al-

though in practice strong side spot light deteriorates the qualityof depth images for this database. Depth data mainly suffersfrom pose variations and the use of eyeglasses. The projectedpattern on the face creates reflections on the glasses, resultingto a distorted depth map, while pose variations introduce occlu-sions of parts of the face, which may not be fully compensated.

Table III tabulates the EERs achieved by test images depictingdifferent facial variations, when different training sets are used.By inspection of the reported EERs, it can be seen that theproposed enrichment procedure benefits authentication not onlyfor images depicting pose and illumination variations, but alsofor frontal neutral views, varying expressions and presence ofglasses. TS3 produced the best results among all training setsused.

The proposed combination of 3-D and 2-D data, along withthe enrichment of the training set with synthetical views, resultsin a decrement of the EER by almost 2% for frontal views, 4%for poses, 2% for expressions, 7% for images depicting illumi-nation variations, and almost 4% for all images (“i d,” TS3versus “i,” TS1).

Page 12: Face localization and authentication using color and depth images

TSALAKANIDOU et al.: FACE LOCALIZATION AND AUTHENTICATION 163

Fig. 12. First face database: Cumulative recognition rates versus rank achieved by “i” and “i+d”. Training sets TS1 and TS3 were used. All image variation typesare included in the test set.

Fig. 13. Second face database. ROC curves of “c,” “i,” “d,” “c+d,” and “i+d,” when training sets (a) TS1 and (b) TS3 are used. All image variation types areincluded in the test set. The 95% confidence interval of FAR and FRR is indicated with horizontal and vertical error bar, respectively.

Up to this point, the proposed system was tested only for faceauthentication purposes. Concluding this section, we will studythe performance of the proposed approaches for face recogni-tion. Fig. 12 shows the percentage of correctly identified testimages within the best matches for , for the“i” and “i d” approaches, when TS1 and TS3 are employedfor the training of the EHMMs. It clearly demonstrates the su-periority of the combination approach followed by enrichmentof synthetical images, over the intensity approach that uses onlyoriginal images, as most state-of-the art algorithms. It is inter-esting to notice the increment of almost 10% between (“i,” TS1)and (“i d,” TS3) for Rank , depicted in Fig. 12.

3) Experimental Results of the Second Evaluation Data-base: Color calibration was consistent among recordingsessions for this database, which allowed the experimentalevaluation of the combined use of color and depth images.Approaches “c,” “i,” “d,” “c d,” and “i d” were tested.

Detailed experimental results of all the tests performed on thesecond database are outlined below, following the structure ofSection VI-D.2.

Approaches “c d” and “i d” were compared against “c,”“i,” “d,” employing training sets TS1-TS3. The resulting ROCdiagrams are plotted in Fig. 13. The depth-only approach hasthe worst performance, exactly as reported for tests performedon the first database. The performance of the intensity and colorapproaches is relatively similar when the training set is com-prised of original images (TS1). The use of additional artificiallyrotated and/or translated images during training (TS2,TS3) fa-vors more the intensity approach than the color approach. In allcases, the combination approaches “c d” and “i d” improvethe performance of the color and intensity approaches distinctly,by 4% –5% and 2% –3%, respectively.

Next, we explore the merits of using artificially generated im-ages for the training. Fig. 14 shows the resulting ROC diagrams

Page 13: Face localization and authentication using color and depth images

164 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 2, FEBRUARY 2005

Fig. 14. Second face database: ROC curves of (a) “c+d” and (b) “i+d,” resulting by the use of different training sets. All image variation types are included inthe test set. The 95% confidence interval of FAR and FRR is indicated with horizontal and vertical error bars, respectively.

TABLE IVSECOND FACE DATABASE: EQUAL ERROR RATES ACHIEVED BY “i,” “c,” “d,” “c+d,” AND “i+d” FOR DIFFERENT FACIAL VARIATIONS. TS1 WAS USED

TABLE VSECOND FACE DATABASE: EQUAL ERROR RATES ACHIEVED BY “i+d” FOR DIFFERENT FACIAL VARIATIONS AND TRAINING SETS

of “c d” and “i d.” As can be clearly seen, the enhancementof the training set with synthetical novel views depicting rota-tions and translations of the head improves the authenticationrate significantly. The improvement is obvious in the case of the“i d” test, where a decrement of the EER of almost 4% is re-ported.

Table IV tabulates the EER values of test images depictingdifferent facial variations. The combination of 2-D and depthdata is shown to improve the EER of 2-D approaches (color,intensity) significantly, whatever the facial variation of test im-ages. The improvement is particularly significant in case of il-lumination variations. Note how such variations affect the au-thentication rate when the 2-D approach is used: while the totalEER achieved by “c” is about 13%, in the case of illuminationvariations, the EER increases dramatically to 23.58%. The useof depth decreases the EER of illumination images from 23.58%

(“c,” TS1) to 15.46% (“c d,” TS1). The use of additional syn-thetical images for training reaches an EER of 13.15% (“c d,”TS3). Note, also, that “c d” is better than “i d” for all facialvariations, except for illumination variations. In general, inten-sity is more robust than color to such variations. Table V tabu-lates the EERs achieved by “i d,” when different training setsare used. TS3 produced the best results among all training setsused.

Concluding, we study the performance of the proposed ap-proaches for face recognition. Fig. 15 shows the percentage ofcorrectly identified test images within the best matches for

, of the combination approaches versus com-monly used 2-D approaches. It can be clearly seen that the com-bination approaches outperform the approaches that use onlycolor or intensity. The recognition rate for all test images in-creases from 73% (“c,” TS1) and 67% (“i,” TS1) to 77% (“c d,”

Page 14: Face localization and authentication using color and depth images

TSALAKANIDOU et al.: FACE LOCALIZATION AND AUTHENTICATION 165

Fig. 15. Second face database: Cumulative recognition rates versus rank achieved by 2-D approaches (“c” and “i”) versus combination approaches (“i+d” and“i+d”). Training sets TS1 and TS3 were used. All image variation types are included in the test set.

TABLE VIEQUAL ERROR RATES ACHIEVED FOR THE SUBSETS OF THE FIRST AND SECOND FACE DATABASE (DB1 AND DB2)

TS1) and 74% (“i d,” TS1), respectively. The additional useof synthetical views enhances the performance of the combi-nation classifier even further, achieving a recognition rate of79.2% (“c d,” TS3) and 77.3% (“i d,” TS3), respectively(Rank ).

4) Comparison Between the Two EvaluationDatabases: Experimental results presented in Sec-tions VI-D.2 and VI-D.3 establish clearly the adequacy ofthe proposed scheme in both databases. Tests performed onthe first database provide solid experimental results, whichprove that the use of depth enhances the performance of 2-Dclassifiers, when tested in a database comprised of manysubjects and containing images depicting not only facial, buttime variations, as well. The second database was recordedusing different camera settings which improved the qualityof depth maps (less missing pixels, higher accuracy) andallowed the experimental testing of “c” and “c d.” In thissection, we provide experimental results to support our claimthat the higher the quality of the 3-D shape data the betterthe performance of the combination classifier, as well as theintuitive notion that 3-D shape data can be discriminatoryproviding its representation is accurate enough.

In order to establish this point, all other database variations(time variation, number of subjects, and number of images) had

to be eliminated. A subset of the first database containing 20subjects and two sessions was used for this particular experi-ment. The time interval between the two sessions is the samein both databases. Table VI tabulates the EERs reported for thetwo subdatabases and all modalities, when various training setsare used. It can be seen that the enhancement of the quality ofdepth maps in the second database yields better results in termsof the verification efficiency, compared to the EERs of the firstdatabase.

5) Evaluation of the Performance of the Proposed Algo-rithms: In this section, we briefly discuss and evaluate theaccuracy of the proposed authentication scheme following themethod presented in [28]. First, we calculate the number of teststhat gives statistically significant results. For the experimentalevaluation of the proposed method we used two databases. Asalready mentioned in Section VI-D.1, the first database consistsof 50 persons and 2818 pairs of test images providing 2818 userclaims and 138 082 impostor claims, while the second databaseconsists of 20 persons and 2244 images, providing 2244 userand 42 636 impostor claims.

Following the example and notation of [28], let us denote withthe percentile of the standard Gaussian distribution with

zero mean and unit variance. The desired EER is set to 6%,which is rather strict considering that the desired EER for the

Page 15: Face localization and authentication using color and depth images

166 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 2, FEBRUARY 2005

XM2VTS database, which is comprised of neutral frontal im-ages with no significant illumination or background variations,is set to 6%, as well. The confidence interval is

% and the error margin is set to 1%. The authenticationtests can be considered as Bernoulli trials. In this case, we canassume with confidence that the minimum number of clientclaims which ensures that the absolute value of the differencebetween the expected value of the FFR and the empiricalvalue is less than , is given by

(5)

Substituting the parameters of (5) with the aforementionedvalues, we get .

Next, we estimate the number of test impostor claims,which is sufficient for ensuring that , accordingto [28]

(6)

where is the probability that one impostor is falsely acceptedas one client and is the number of clients. Setting %and , we find for the first databaseand for the second database, for an error margin

%. Both demands on estimated minimum numbers of testimpostor claims are met by the tests performed.

VII. DISCUSSION

In this paper, a face authentication system integrating 3-Drange data and 2-D color or grayscale images is presented.Novel algorithms for face detection and localization are pro-posed, that exploit depth information to achieve robustnessunder background clutter, occlusions, face pose, and illumina-tion variations. Face authentication is based on the well-knownEHMM method, which is also applied to range images. Unlikemost methods in 3-D face recognition literature that performrecognition or authentication using a 3-D model of the humanhead derived from range data, the proposed method simplifiesthe processing of 3-D data by regarding it as a 2-D image, withpixel values corresponding to the distance of the surface of thehuman head from the camera plane.

From an inspection of the experimental results presented inSection VI-D, it can be concluded that the verification perfor-mance of the approach that uses only depth maps is no betterthan that of the standard 2-D approaches. The resolution-ac-curacy of the range image acquisition system was shown to bepartly responsible for the low performance of the depth-onlyapproach. Occlusions, missing pixels, and artifacts in depthmaps hinder face localization and authentication, especiallywhen they appear near important facial features, such as thenose and the mouth. Despite the rather low performance of thedepth-only approach, the combination of depth and color (orintensity) images decreases the EER of the 2-D approachessignificantly.

The use of synthetic images presenting various rotations en-hances the authentication of faces depicting poses other thanfrontal, as well as the verification of frontal views. The latter isjustified by the fact that even images assumed as frontal may de-pict a small rotation. The use of artificial translations also com-pensates for any possible inaccuracies of the face localizationalgorithm. Experimental results presented in Tables III and Vprovide evidence to support this conclusion.

From an inspection of the experimental results reported forthe second database, it can be seen that the approaches usingcolor or intensity are especially sensitive to illumination varia-tions, the color approach being more sensitive than the intensityapproach. Although the enhancement of the training set withsynthetic illumination images improves the verification capa-bility of the proposed algorithms, the reported EERs are stillhigher than those reported for frontal and expression images,or pose variations. The proposed method may enhance the au-thentication efficiency, but unfortunately, modeling illuminationsources is extremely challenging and was shown to be only par-tially successful. Automatic recovering of the illumination pa-rameters by inverse rendering (using color and associated depthimages) is currently under investigation.

Two issues were mainly addressed by the present work: theuse of depth data for face localization and classification andthe enhancement of the training set with artificially generatedviews, based on the 3-D structure of human face. Both pro-posed methods contribute to the enhancement of the verifica-tion performance, as demonstrated by the results presented inSection VI. The combined use of depth data and 2-D inten-sity or color images yields better results in terms of verifica-tion efficiency compared to the use of 2-D images, which ismost common among state-of-the-art face authentication sys-tems. Additionally, the use of synthetical images, further im-proves the authentication rate of all approaches, whether theyuse 2-D or combination of 2-D and 3-D data.

In general, the 3-D acquisition system described in Section IIcan be improved further, so that the quality and resolution ofrange images are enhanced. Investigation of alternative fusiontechniques and, particularly, data fusion at an earlier stage, useof more advanced surface features, and exploitation of illumi-nation compensation techniques are among our future researchplans.

Since evaluation on both databases demonstrates that the per-formance of the 2-D system is considerably improved by theutilization of 3-D depth data, we argue that the depth approachcould be incorporated into any state-of-the-art 2-D system andsuch a 3-D system would be expected to yield superior perfor-mance.

REFERENCES

[1] W. Zhao, R. Chellappa, A. Rosenfeld, and P. J. Phillips, “Face recog-nition: A literature survey,” Univ. Maryland, College Park, MD, Tech.Rep. CS-TR4167, Oct. 2000.

[2] F. Forster, P. Rummel, M. Lang, and B. Radig, “The hiscore camera: Areal time three dimensional and color camera,” in Proc. IEEE Int. Conf.Image Processing, vol. 2, Oct. 2001, pp. 598–601.

[3] E. Trucco and A. Verri, Introductory Techniques for 3-D Computer Vi-sion. Englewood Cliffs, NJ: Prentice-Hall, 1998. 07 458.

Page 16: Face localization and authentication using color and depth images

TSALAKANIDOU et al.: FACE LOCALIZATION AND AUTHENTICATION 167

[4] J. C. Lee and E. Milios, “Matching range images of human faces,” inProc. 3rd IEEE Int. Conf. Image Processing, Dec. 1990, pp. 722–726.

[5] H. T. Tanaka, M. Ikeda, and H. Chiaki, “Curvature-based face surfacerecognition using spherical correlation. Principal directions for curvedobject recognition,” in Proc. 3rd IEEE Int. Conf. Automatic Face GestureRecognition, 1998, pp. 372–377.

[6] G. G. Gordon, “Face recognition based on depth and curvaturefeatures,” in Proc. Conf. Computer Vision Pattern Recognition, 1992,pp. 808–810.

[7] T. K. Kim, S. C. Kee, and S. R. Kim, “Real-time normalization andfeature extraction of 3-D face data using curvature characteristics,” inProc. 10th IEEE Int. Workshop Robot Human Interactive Communica-tion, 2001, pp. 74–79.

[8] C.-S. Chua, F. Han, and Y.-K. Ho, “3-D human face recognition usingpoint signature,” in Proc. 4th IEEE Int. Conf. Automatic Face GestureRecognition, 2000, pp. 233–238.

[9] Y. Wang, C.-S. Chua, Y.-K. Ho, and Y. Ren, “Integrated 2-D and 3-Dimages for face recognition,” in Proc. 11th Int. Conf. Image AnalysisProcessing, 2001, pp. 48–53.

[10] S. Tsutsumi, S. Kikuchi, and M. Nakajima, “Face identification using a3-D gray-scale image-a method for lessening restrictions on facial direc-tions,” in Proc. 3rd IEEE Int. Conf. Automatic Face Gesture Recognition,1998, pp. 306–311.

[11] K. I. Chang, K. W. Bowyer, and P. J. Flynn, “Face recognition using2-D and 3-D facial data,” in Proc. ACM Workshop on Multimodal UserAuthentication, Santa Barbara, CA, Dec. 2003, pp. 25–32.

[12] M.-H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting faces in images:A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 1, pp.34–58, Jan. 2002.

[13] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based facedetection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 1, pp.23–38, Jan. 1998.

[14] R.-L. Hsu, M. Abdel-Mottaleb, and A. K. Jain, “Face detection in colorimages,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 5, pp.696–706, May 2002.

[15] N. Otsu, “A threshold selection method from gray-level histograms,”IEEE Trans. Syst., Man, Cybern. A, vol. 9, no. 1, pp. 62–66, Jan. 1979.

[16] N. Jojic, B. Brumitt, B. Meyers, S. Harris, and T. Huang, “Detectionand estimation of pointing gestures in dense disparity maps,” in Proc.4th IEEE Int. Conf. Automatic Face Gesture Recognition, 2000, pp.468–475.

[17] S. Malassiotis and M. G. Strintzis, “Real-time head tracking and 3-Dpose estimation from range data,” in IEEE Int. Conf. Image Processing,vol. 2, Barcelona, Spain, Sept. 2003, pp. 859–862.

[18] D. Reisfeld, H. Wolfson, and Y. Yeshurun, “Context free attentional op-erators: The generalized symmetry transform,” Int. J. Comp. Vision, vol.14, no. 2, pp. 119–130, Mar. 1995.

[19] M. A. Turk and A. P. Pentland, “Face recognition using eigenfaces,” inProc. Conf. Computer Vision Pattern Recognition, 1991, pp. 586–591.

[20] K. Etemad and R. Chellappa, “Face recognition using discriminanteigenvectors,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Pro-cessing, vol. 4, 1996, pp. 2148–2151.

[21] W. Zhao, R. Chellappa, and A. Krishnaswamy, “Discriminant analysisof principal components for face recognition,” in Proc. 3rd IEEE Int.Conf. Automatic Face Gesture Recognition, 1998, pp. 336–341.

[22] B. Heisele, P. Ho, and T. Poggio, “Face recognition with support vectormachines: Global versus component-based approach,” in Proc. IEEE Int.Conf. Computer Vision, vol. 2, 2001, pp. 688–694.

[23] A. V. Nefian and M. H. Hayes III, “An embedded HMM—Based ap-proach for face detection and recognition,” in Proc. IEEE Int. Conf.Acoust., Speech, Signal Processing, vol. 6, 1999, pp. 3553–3556.

[24] , “Maximum likelihood training of the embedded HMM for facedetection and recognition,” in Proc. Int. Conf. Image Processing, vol. 1,2000, pp. 33–36.

[25] L. Wiskott, J. M. Fellous, N. Kuiger, and C. von der Malsburg, “Facerecognition by elastic bunch graph matching,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 19, no. 7, pp. 775–779, Jul. 1997.

[26] M. Lades, J. C. Vorbruggen, J. Buhmann, J. Lange, C. von der Malsburg,R. P. Wurtz, and W. Konen, “Distortion invariant object recognition inthe dynamic link architecture,” IEEE Trans. Comput., vol. 42, no. 3, pp.300–311, Mar. 1993.

[27] C. Kotropoulos, A. Tefas, and I. Pitas, “Frontal face authentication usingmorphological elastic graph matching,” IEEE Trans. Image Process.,vol. 9, no. 4, pp. 555–560, Apr. 2000.

[28] , “Frontal face authentication using discriminating grids with mor-phological feature vectors,” IEEE Trans. Multimedia, vol. 2, no. 1, pp.14–26, Mar. 2000.

[29] A. Nefian, “A hidden Markov model-based approach for face detectionand recognition,” Ph.D. dissertation, Georgia Inst. Technol., Atlanta,GA, 1999.

[30] J. Kittler, M. Ballette, J. Czyz, F. Roli, and L. Vandendorpe, “Enhancingthe performance of personal identity authentication systems by fusionof face verification experts,” in Proc. IEEE Int. Conf. Multimedia Expo,vol. 2, 2002, pp. 581–584.

[31] H. Murase and S. K. Nayar, “Learning by a generation approach toappearance-based object recognition,” in Proc. 13th Int. Conf. PatternRecognition, vol. 1, Aug. 1996, pp. 24–29.

[32] J. Ruppert, “A delaunay refinement algorithm for quality 2-D mesh gen-eration,” J. Algorithms, vol. 8, no. 13, pp. 548–585, May 1995.

[33] S. Z. Li and A. K. Jain, Eds., Handbook of Face Recognition. NewYork: Springer-Verlag, 2004.

[34] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearancemodels,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp.681–685, Jun. 2001.

[35] V. Blanz, S. Romdhami, and T. Vetter, “Face identification across dif-ferent poses and illuminations with a 3-D morphable model,” in Proc. 5thInt. Conf. Automatic Face Gesture Recognition, May 2002, pp. 192–197.

[36] D. Beymer and T. Poggio, “Face recognition from one example view,”in Proc. 5th Int. Conf. Computer Vision, Jun. 1995, pp. 500–507.

[37] A. Pentland, B. Moghaddam, and T. Starner, “View-based and modulareigenspaces for face recognition,” in Proc. Conf. Computer Vision Pat-tern Recognition, Jun. 1994, pp. 84–91.

[38] D. B. Graham and N. M. Allinson, “Face recognition from unfamiliarviews: Subspace methods and pose dependency,” in Proc. 3rd Int. Conf.Automatic Face Gesture Recognition, Apr. 1998, pp. 348–353.

[39] J. Foley, A. van Dam, S. K. Feiner, and J. F. Hughes, ComputerGraphics, Principles, and Practice, 2nd ed. Reading, MA: Ad-dison-Wesley, 1991.

[40] P. N. Belhumeur and D. J. Kriegman, “What is the set of images of anobject under all possible illumination conditions?,” Int. J. Comput. Vis.,vol. 28, no. 3, pp. 1–16, 1998.

[41] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From fewto many: Illumination cone models for face recognition under variablelighting and pose,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no.6, pp. 643–660, Jun. 2001.

[42] R. Ishiyama and S. Sakamoto, “Geodesic illumination basis: Compen-sating for illumination variations in any pose for face recognition,” inProc. 16th Int. Conf. Pattern Recognition, vol. 4, 2002, pp. 297–301.

[43] R. Basri and D. Jacobs, “Lambertian reflectance and linear subspaces,”in Proc. IEEE Int. Conf. Computer Vision, vol. 2, 2001, pp. 383–390.

[44] L. Zhang and D. Samaras, “Face recognition under variable lightingusing harmonic image exemplars,” in Proc. Conf. Computer Vision Pat-tern Recognition, vol. 1, 2003, pp. 9–25.

[45] A. J. Mansfield and J. L. Wayman, “Best practices in testing and re-porting performance of biometric devices v. 2.01,” Centre for Mathe-matics and Scientific Computing, National Physical Laboratory, U.K.,Evaluation Rep., Aug. 2002.

[46] P. J. Phillips, H. Moon, P. Rauss, and S. A. Rizvi, “The feret evaluationmethodology for face-recognition algorithms,” in Proc. Conf. ComputerVision Pattern Recognition, 1997, pp. 137–143.

[47] B. Moghaddam and A. Pentland, “Probabilistic visual learning for ob-ject detection,” in Proc. 5th Int. Conf. Computer Vision, Jun. 1995, pp.786–793.

Filareti Tsalakanidou received the Diploma inelectrical and computer engineering from theElectrical and Computer Engineering Department,Aristotle University of Thessaloniki, Thessaloniki,Greece, in 2000. She is currently pursuing the Ph.D.degree in the Electrical and Computer EngineeringDepartment, Aristotle University of Thessaloniki,where she holds research and teaching assistantshippositions.

She is a Graduate Research Assistant with theInformatics and Telematics Institute, Centre for

Research and Technology Hellas, Thessaloniki. Her research interests includeface and gesture recognition, computer vision, pattern recognition, and imageprocessing.

Ms. Tsalakanidou is a member of the Technical Chamber of Greece.

Page 17: Face localization and authentication using color and depth images

168 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 2, FEBRUARY 2005

Sotiris Malassiotis was born in Thessaloniki,Greece, in 1971. He received the B.S. and Ph.D.degrees in electrical engineering from the AristotleUniversity of Thessaloniki, in 1993 and 1998,respectively.

From 1994 to 1997, he was conducting researchin the Information Processing Laboratory, AristotleUniversity of Thessaloniki. He is currently a SeniorResearcher in the Informatics and Telematics Insti-tute, Centre for Research and Technology Hellas,Thessaloniki. He has participated in several Euro-

pean and national research projects. He is the author of more than ten articlesin refereed journals and more than 20 papers in international conferences. Hisresearch interests include stereoscopic image analysis, range image analysis,pattern recognition, and computer graphics.

Michael G. Strintzis (S’68–M’70–SM’80–F’03)received the Diploma in electrical engineering fromthe National Technical University of Athens, Athens,Greece, in 1967 and the M.A. and Ph.D. degreesin electrical engineering from Princeton University,Princeton, NJ, in 1969 and 1970, respectively.

He joined the Electrical Engineering Department,University of Pittsburgh, Pittsburgh, PA, where heserved as an Assistant Professor from 1970 to 1976and an Associate Professor from 1976 to 1980.During that time, he worked in the area of stability

of multidimensional systems. Since 1980, he has been a Professor of electricaland computer engineering at the Aristotle University of Thessaloniki, Thessa-loniki, Greece. He has worked in the areas of multidimensional imaging andvideo coding. Over the past ten years, he has authored over over 100 journalpublications and over 200 conference presentations. In 1998, he founded theInformatics and Telematics Institute, currently part of the Centre for Researchand Technology Hellas, Thessaloniki.

Dr. Strintzis was awarded the Centennial Medal of the IEEE in 1984 and theEmpirikeion Award for Research Excellence in Engineering in 1999.