Top Banner
Face Recognition in Low-Resolution Videos Using Learning-Based Likelihood Measurement Model * Soma Biswas, Gaurav Aggarwal and Patrick J. Flynn, Department of Computer Science and Engineering, University of Notre Dame, Notre Dame {sbiswas, gaggarwa, flynn}@nd.edu Abstract Low-resolution surveillance videos with uncontrolled pose and illumination present a significant challenge to both face tracking and recognition algorithms. Consider- able appearance difference between the probe videos and high-resolution controlled images in the gallery acquired during enrollment makes the problem even harder. In this paper, we extend the simultaneous tracking and recognition framework [22] to address the problem of matching high- resolution gallery images with surveillance quality probe videos. We propose using a learning-based likelihood mea- surement model to handle the large appearance and res- olution difference between the gallery images and probe videos. The measurement model consists of a mapping which transforms the gallery and probe features to a space in which their inter-Euclidean distances approximate the distances that would have been obtained had all the de- scriptors been computed from good quality frontal images. Experimental results on real surveillance quality videos and comparisons with related approaches show the effectiveness of the proposed framework. 1. Introduction The wide range of applications in law-enforcement and security has made face recognition (FR) a very important area of research in the field of computer vision and pattern recognition. The ubiquitous use of surveillance cameras for improved security has shifted the focus of face recognition * This research was funded by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), through the Army Research Laboratory (ARL). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing official policies, either expressed or im- plied, of IARPA, the ODNI, the Army Research Laboratory, or the U.S. Government. The U.S. Government is authorized to reproduce and dis- tribute reprints for Government purposes notwithstanding any copyright notation herein. from controlled scenarios to the uncontrolled environment typical in surveillance setting [17]. Typically, the images or videos captured from the surveillance systems have non- frontal pose and uncontrolled illumination in addition to low-resolution due to the distance of the subjects from the cameras. On the other hand, good high-resolution images of the subjects may be present in the gallery during enroll- ment. This presents the challenge of matching gallery and probe images or videos which differ significantly in resolu- tion, pose and illumination. In this paper, we consider the scenario in which the gallery consists of one or more high- resolution frontal images, while the probe consists of low- resolution videos with uncontrolled pose and illumination as is typically obtained in surveillance systems. Most of the research in video-based face recognition has focused on dealing with one or more challenges like un- controlled pose, illumination, etc. [23], but there are very few approaches which simultaneously deal with all the challenges. Some of the recent approaches which handle the resolution difference between the gallery and probe ei- ther are restricted to frontal images [6] or require videos for enrollment [2]. For video based FR, a tracking-then- recognition paradigm is typically followed, in which the faces are first tracked and then used for recognition. But both tracking and recognition are very challenging for low- quality videos with low-resolution and significant variations in pose and illumination. In this paper, we extend the simultaneous tracking and recognition framework [22] which performs the two tasks of tracking and recognition in a single unified framework to address these challenges. We propose using distance learn- ing based techniques for better modeling the appearance changes between the frames of the low-resolution probe videos and the high-resolution gallery images for better recognition and tracking accuracy. Multidimensional Scal- ing [4] is used to learn a mapping from training images which transforms the gallery and probe features to a space in which their inter-Euclidean distances approximate the distances that would have been obtained had all the descrip- 1 978-1-4577-1359-0/11/$26.00 ©2011 IEEE
7

Face Recognition in Low-Resolution Videos Using Learning-Based … · Face Recognition in Low-Resolution Videos Using Learning-Based Likelihood Measurement Model Soma Biswas, Gaurav

Oct 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Face Recognition in Low-Resolution Videos Using Learning-Based … · Face Recognition in Low-Resolution Videos Using Learning-Based Likelihood Measurement Model Soma Biswas, Gaurav

Face Recognition in Low-Resolution Videos Using Learning-Based LikelihoodMeasurement Model∗

Soma Biswas, Gaurav Aggarwal and Patrick J. Flynn,Department of Computer Science and Engineering,

University of Notre Dame, Notre Dame{sbiswas, gaggarwa, flynn}@nd.edu

Abstract

Low-resolution surveillance videos with uncontrolledpose and illumination present a significant challenge toboth face tracking and recognition algorithms. Consider-able appearance difference between the probe videos andhigh-resolution controlled images in the gallery acquiredduring enrollment makes the problem even harder. In thispaper, we extend the simultaneous tracking and recognitionframework [22] to address the problem of matching high-resolution gallery images with surveillance quality probevideos. We propose using a learning-based likelihood mea-surement model to handle the large appearance and res-olution difference between the gallery images and probevideos. The measurement model consists of a mappingwhich transforms the gallery and probe features to a spacein which their inter-Euclidean distances approximate thedistances that would have been obtained had all the de-scriptors been computed from good quality frontal images.Experimental results on real surveillance quality videos andcomparisons with related approaches show the effectivenessof the proposed framework.

1. Introduction

The wide range of applications in law-enforcement andsecurity has made face recognition (FR) a very importantarea of research in the field of computer vision and patternrecognition. The ubiquitous use of surveillance cameras forimproved security has shifted the focus of face recognition

∗This research was funded by the Office of the Director of NationalIntelligence (ODNI), Intelligence Advanced Research Projects Activity(IARPA), through the Army Research Laboratory (ARL). The views andconclusions contained in this document are those of the authors and shouldnot be interpreted as representing official policies, either expressed or im-plied, of IARPA, the ODNI, the Army Research Laboratory, or the U.S.Government. The U.S. Government is authorized to reproduce and dis-tribute reprints for Government purposes notwithstanding any copyrightnotation herein.

from controlled scenarios to the uncontrolled environmenttypical in surveillance setting [17]. Typically, the imagesor videos captured from the surveillance systems have non-frontal pose and uncontrolled illumination in addition tolow-resolution due to the distance of the subjects from thecameras. On the other hand, good high-resolution imagesof the subjects may be present in the gallery during enroll-ment. This presents the challenge of matching gallery andprobe images or videos which differ significantly in resolu-tion, pose and illumination. In this paper, we consider thescenario in which the gallery consists of one or more high-resolution frontal images, while the probe consists of low-resolution videos with uncontrolled pose and illuminationas is typically obtained in surveillance systems.

Most of the research in video-based face recognition hasfocused on dealing with one or more challenges like un-controlled pose, illumination, etc. [23], but there are veryfew approaches which simultaneously deal with all thechallenges. Some of the recent approaches which handlethe resolution difference between the gallery and probe ei-ther are restricted to frontal images [6] or require videosfor enrollment [2]. For video based FR, a tracking-then-recognition paradigm is typically followed, in which thefaces are first tracked and then used for recognition. Butboth tracking and recognition are very challenging for low-quality videos with low-resolution and significant variationsin pose and illumination.

In this paper, we extend the simultaneous tracking andrecognition framework [22] which performs the two tasksof tracking and recognition in a single unified framework toaddress these challenges. We propose using distance learn-ing based techniques for better modeling the appearancechanges between the frames of the low-resolution probevideos and the high-resolution gallery images for betterrecognition and tracking accuracy. Multidimensional Scal-ing [4] is used to learn a mapping from training imageswhich transforms the gallery and probe features to a spacein which their inter-Euclidean distances approximate thedistances that would have been obtained had all the descrip-

1

978-1-4577-1359-0/11/$26.00 ©2011 IEEE

Page 2: Face Recognition in Low-Resolution Videos Using Learning-Based … · Face Recognition in Low-Resolution Videos Using Learning-Based Likelihood Measurement Model Soma Biswas, Gaurav

tors been computed from high-resolution frontal images.We evaluate the effectiveness of the proposed approach

on surveillance quality videos from the MBGC data [16].We observe that the proposed approach performs signifi-cantly better in terms of both tracking and recognition ac-curacy as compared to standard appearance modeling ap-proaches.

The rest of the paper is organized as follows. Anoverview of the related approaches is discussed in Section 2.The details of the proposed approach are provided in Sec-tion 3. The results of experimental evaluation are presentedin Section 4. The paper concludes with a brief summary anddiscussion.

2. Previous WorkIn this section, we discuss the related work in the lit-

erature. For brevity, we will refer to high-resolution asHR and low-resolution as LR. There has been a consider-able amount of work in general video-based FR address-ing two kind of scenarios: (1) both the gallery and probeare video sequences [11] [13] [10] [18] and (2) the probevideos are compared with one or multiple still images inthe gallery [22]. For tracking and recognizing faces in real-world, noisy videos, Kim et al. [10] propose a tracker thatadaptively builds a target model reflecting changes in ap-pearance typical of a video setting. In the subsequent recog-nition phase, the identity of the tracked subject is estab-lished by fusing pose-discriminant and person-discriminantfeatures over the duration of a video sequence. Stallkampet al. [18] classify faces using a local appearance-based FRalgorithm for real-time video-based face identification. Theobtained confidence scores from each classification are pro-gressively combined to provide an identity estimate for theentire sequence. Many researchers have also addressed theproblem of video-based FR by treating the videos as imagesets [20].

Most of the current approaches which address the prob-lem of LR still face recognition follow a super-resolutionapproach. Given an LR face image, Jia and Gong [8]propose directly computing a maximum likelihood iden-tity parameter vector in the HR tensor space which can beused for recognition and reconstruction of HR face images.Liu et al. [12] propose a two-step statistical modeling ap-proach for hallucinating a HR face image from a LR input.The relationship between the HR images and their corre-sponding LR images is learned using a global linear modeland the residual high-frequency content is modeled by apatch-based non-parametric Markov network. Several othersuper-resolution techniques have also been proposed [5] [9].The main aim of these techniques is to produce a high res-olution image from the low-resolution input using assump-tions about the image content, and they are usually not de-signed from a matching perspective. A Multidimensional

Scaling (MDS)-based approach has been recently proposedto improve the performance of still LR images, but it doesnot deal with matching a HR gallery image with a LR probevideo [3].

Recently, Hennings-Yeomans et al. [6] proposed an ap-proach to perform super-resolution and recognition simulta-neously. Using features from the face and super-resolutionpriors, they extract an HR template that simultaneouslyfits the super-resolution as well as the face-feature con-straints. The formulation was extended to use multipleframes, and the authors showed that it can also be gener-alized to use multiple image formation processes, modelingdifferent cameras [7]. But this approach assumes that theprobe and gallery images are in the same pose making themnot directly applicable for more general scenarios. Arand-jelovic and Cipolla [2] propose a generative model for sep-arating the illumination and down-sampling effects for theproblem of matching a face in a LR query video sequenceagainst a set of HR gallery sequences. It is an extensionof the Generic Shape-Illumination Manifold framework [1]which was used to describe the appearance variations dueto the combined effects of facial shape and illumination. Asnoted in [7], a limitation of this approach is that it requiresa video sequence at enrollment.

3. Proposed ApproachFor matching LR probe videos with significant pose and

illumination variations with HR frontal gallery images, wepropose to use a learning based appearance modeling in asimultaneous tracking and recognition framework.

3.1. Simultaneous Tracking and Recognition

First, we briefly describe the tracking and recognitionframework [22] which uses a modified version of the CON-DENSATION algorithm for tracking the facial featuresacross the frames in the poor quality probe video and forrecognition. The filtering framework consists of a motionmodel which characterizes the motion of the subject in thevideo. The overall state vector of this unified tracking andrecognition framework consists of an identity variable inaddition to the usual motion parameters. The observationmodel determines the measurement likelihood i.e. the like-lihood of observing the particular measurement given thecurrent state consisting of the motion and identity variable.Motion Model: The motion model is given by the first-order Markov chain

θt = θt−1 + ut; t ≥ 1 (1)

Here affine motion parameters are used and so θ =(a1, a2, a3, a4, tx, ty) where {a1, a2, a3, a4} are deforma-tion parameters and {tx, ty} are 2D translation parameters.ut is noise in the motion model.

Page 3: Face Recognition in Low-Resolution Videos Using Learning-Based … · Face Recognition in Low-Resolution Videos Using Learning-Based Likelihood Measurement Model Soma Biswas, Gaurav

Identity equation: Assuming that the identity does notchange as time proceeds, the identity equation is given by

nt = nt−1; t ≥ 1 (2)

Observation Model: Assuming that the transformed obser-vation is a noise-corrupted version of some still template inthe gallery, the observation equation can be written as

Tθt{zt} = Int + vt; t ≥ 1 (3)

where vt is the observation noise at time t and Tθt{zt}is a transformed version of the observation zt. HereTθt{zt} is composed of (1) an affine transform of z using{a1, a2, a3, a4}, (2) cropping the region of interest at posi-tion {tx, ty} with the same size as some still template and(3) performing zero-mean-unit-variance normalization.

In this modified version of the CONDENSATION algo-rithm, random samples are propagated on the motion vectorwhile the samples on the identity variable are kept fixed. Al-though only the marginal distribution is propagated for mo-tion tracking, the joint distribution is propagated for recog-nition purposes. This results in a considerable improvementin computation over propagating random variables on boththe motion vector and identity variable for large databases.The different steps of the simultaneous tracking and recog-nition framework are given in Figure 1. The mean of theGaussian distributed prior comes from the initial detectorwhose covariance matrix is manually specified. Please referto [22] for more details of the algorithm.

3.2. Traditional Likelihood Measurement

If there is no significant facial appearance difference be-tween the probe frames and the gallery templates, a sim-ple likelihood measurement like a truncated Laplacian issufficient [22]. More sophisticated likelihood measurementmodels like the probabilistic subspace density approach arerequired to handle greater appearance difference betweenthe probe and the gallery [22]. In this approach, the intra-personal variations are learned using the available galleryand one frame of the video sequences. Usually, surveillancevideos have very poor resolution, in addition to large vari-ations in pose and illumination which results in decreasein both tracking and recognition performance. Here wepropose a multidimensional scaling (MDS)-based approachfor computing the measurement likelihood which resultsin better modeling of the appearance difference betweenthe gallery and probe resulting in both better tracking andrecognition.

3.3. Learning-Based Likelihood Measurement

In this work, we use local SIFT features [14] at sevenfiducial locations for representing a face (Figure 2). SIFTdescriptors are fairly robust to modest variations in pose and

Initialize a sample set S0 = {(θ(j)0 )}Jj=1 accordingto the prior distribution p(θ0|z0) which is assumed tobe Gaussian. The particle weights for each subject{w(j)

0,n}Jj=1, n = 1, · · · , N is initialized to 1. J andN denotes the number of particles and subjects respec-tively.

1. Predict: sample by drawing θ(j)t from the motionstate transition probability p(θt|θ(j)t−1) and computethe transformed image T corresponding to the pre-dicted sample.

2. Update: the weights using αjt,n = w(j)t−1,n ?

p(zt|n, θ(j)t ) (measurement likelihood) for eachsubject in the gallery. The normalized weights aregiven by w

(j)t,n = αjt,n/

∑Nn=1

∑Jj=1 α

jt,n. The

measurement likelihood is learned from a set of HRtraining images (Section 3.3).

3. Resample: Particles for all subjects are re-weighted to obtain samples with new weightsw

(j)t,n = w

(j)t,n/w

(j)t , where the denominator is given

by w(j)t =

∑Nn=1 w

(j)t,n.

Marginalize over θt to obtain the weights for nt to ob-tain the probe id.

Figure 1. Simultaneous tracking and recognition framework [22].

resolution and this kind of representation has been shownto be useful for matching facial images in uncontrolled sce-narios. But the large variations in pose, illumination andresolution observed in surveillance quality videos results insignificant decrease in recognition performance using SIFTdescriptors. The MDS-based approach transforms the SIFTdescriptors extracted from gallery/probe images to a spacein which their inter-Euclidean distances approximate thedistances had all the descriptors been computed using HRfrontal images. The transformation is learned from a set ofHR and corresponding LR training images.

Figure 2. SIFT features at fiducial locations used for representingthe face.

Let HR frontal images are denoted by I(h,f), and the LRnon-frontal images are denoted by I(l,p). The correspondingSIFT-based feature descriptors are denoted by x(h,f) andx(l,p). Let f : Rd → Rm denote the mapping from the

Page 4: Face Recognition in Low-Resolution Videos Using Learning-Based … · Face Recognition in Low-Resolution Videos Using Learning-Based Likelihood Measurement Model Soma Biswas, Gaurav

Figure 3. Flow chart showing the steps of the proposed algorithm.

input feature space Rd to the embedded Euclidean spaceRm

f(x;W) = WTφ(x) (4)

Here φ(x) can be a linear or non-linear function of the fea-ture vectors and W is the matrix of the weights to be de-termined. The goal is to simultaneously transform the fea-ture vectors from I

(h,f)i and I

(l,p)j such that the Euclidean

distance between the transformed feature vectors approx-imates d(h,f)i,j (distance if both the images are frontal andhigh resolution).

Thus the objective function to be minimized is given bythe distance preserving term JDP which ensures that the dis-tance between the transformed feature vectors approximatesd(h,f)i,j

JDP(W) =

N∑i=1

N∑j=1

(qij(W)− d(h,f)i,j

)2(5)

Here qij(W) is the distance between the transformed fea-ture vectors of the images I

(h,f)i and I

(l,p)j . An optional

class separability term JCS can also be incorporated in theobjective function to further facilitate discriminability

JCS(W) =

N∑i=1

N∑j=1

δ(ωi, ωj)q2i,j(W) (6)

This term tries to minimize the distance between featurevectors belonging to same class [21]. Here δ(ωi, ωj) = 0;

when ωi 6= ωj and 0 otherwise (ωi denotes the class la-bel of the ith image). Combining the above two terms, thetransformation is obtained by minimizing the following ob-jective function

J(W) = λJDP(W) + (1− λ)JCS(W) (7)

The relative effect of the two terms in the objective functionis controlled by the parameter λ. The iterative majorizationalgorithm [21] is used to minimize the objective function (7)to solve for the transformation matrix W.

To compute the measurement likelihood, the SIFT de-scriptors of the gallery and affine-transformed probe frameare mapped using the learned transformation W, followedby computation of Euclidean distances between the trans-formed features.

p(zt|nt, θt) = |WT[φ(Tθt{zt})− φ(xnt

)]| (8)

Figure 3 shows a flow-chart of the proposed learning-basedsimultaneous tracking and recognition framework.

4. Experimental EvaluationIn this section, we will discuss in detail the experimental

evaluation of the proposed approach.

4.1. Dataset Used

For our experiments, we use 50 surveillance qualityvideos (each 40 − 100 frames from 50 subjects) from the

Page 5: Face Recognition in Low-Resolution Videos Using Learning-Based … · Face Recognition in Low-Resolution Videos Using Learning-Based Likelihood Measurement Model Soma Biswas, Gaurav

Figure 4. Example frames from MBGC video challenge [16].

Multiple Biometric Grand Challenge (MBGC) [16] videochallenge data for the probe videos. Figure 4 showssome sample frames from a video sequence. Since theMBGC video challenge data does not contain high reso-lution frontal still images needed to form the HR galleryset, we select images of the same subjects from FRGC datawhich has considerable subject overlap with the MBGCdata. Figure 5 (top row) shows some sample gallery imagesfrom the dataset used and the bottom row shows croppedface regions from the corresponding probe videos. We seethat there is a considerable difference in pose, illuminationand resolution between the gallery images and the probevideos.

Figure 5. (Top) Example high resolution gallery images; (Bot-tom) Cropped facial regions from the corresponding low resolu-tion probe videos.

4.2. Recognition and Tracking Accuracy

Here we report both tracking and recognition perfor-mance of the proposed approach. The proposed learning-based likelihood measurement model is compared with thefollowing two approaches for computing the likelihoodmeasurement [22]:

1. Truncated laplacian likelihood: Here the likelihoodmeasurement model is given by [22]

p(zt|nt, θt) = LAP(‖ Tθt{zt} − Int ‖;σ1, τ1

)(9)

Here ‖ . ‖ is the absolute distance and

LAP(x;σ; τ) ={σ−1exp(−x/σ) if x ≤ τσ,σ−1exp(−τ) otherwise

2. Probabilistic subspace density based likelihood: Tohandle significant appearance differences between thefacial images in the gallery and probe, Zhou et al. [22]proposed using the probabilistic subspace densitybased approach proposed by Moghaddam et al. [15]due to its computational efficiency and high recogni-tion accuracy. The available gallery and one videoframe was used for constructing the intra-personalspace (IPS). Using this approach, the measurementlikelihood can be written as

p(zt|nt, θt) = PS(Tθt{zt} − Int

)(10)

where

PS(x) =exp(− 1/2

∑si=1(y

2i /λi)

)(2π)s/2

∏si=1 λ

1/2i

Here {λi, ei}si=1 are the top s eigenvalues and the cor-responding eigenvectors obtained by performing reg-ular Principal Component Analysis [19] on IPS andyi = eTi x is the ith principal component of x.

We build upon the code provided in the authors website(http://www.cfar.umd.edu/shaohua/sourcecodes.html). Forall experiments, the kernel mapping φ is set to identity (i.e.,φ(x) = x) to highlight just the performance improvementdue to the proposed learning approach. Training is doneusing images from a separate set of 50 subjects. For com-putation of the transformation matrix using the iterative ma-jorization algorithm, we observe that the objective functiondecreases till around 20 iterations and then stabilizes. Thevalue of the parameter λ is set to 0.8 and the output dimen-sion m is set to 100. The number of particles for the particlefiltering framework is taken to be 200.

The recognition performance of the proposed approachis shown in Table 1. Comparisons with the two differentkinds of likelihood models are also shown. The three ap-proaches label each video as belonging to one of the sub-jects in the gallery. The recognition rate is calculated asthe percentage of correct labels out of all videos. We seethat the recognition performance of the proposed learning-based simultaneous tracking and recognition framework isconsiderably better than the other approaches due to better

Page 6: Face Recognition in Low-Resolution Videos Using Learning-Based … · Face Recognition in Low-Resolution Videos Using Learning-Based Likelihood Measurement Model Soma Biswas, Gaurav

Method Truncated laplacian Probabilistic subspace density Proposed ApproachLikelihood Based likelihood

Rank 1 Recog. Accuracy 24% 40% 68%Tracking Accuracy 4.8 5.8 2.8

Table 1. Rank-1 recognition accuracy and tracking accuracy (pixels/frame) using the proposed approach. Comparisons with other ap-proaches are also provided.

modeling of the appearance difference between the galleryand the probe images.

To compute the tracking error, we manually markedthree fiducial locations (the center of the two eyes and thebottom of the nose) of every fifth frame of each video. Foreach probe video, we measured the difference between themanually marked ground truth locations and the locationsgiven by the tracker. For a probe video, the tracking erroris given by the average difference in the fiducial locations(averaged over all the frames). Figure 6 shows the trackingresults for a few frames of a probe video for the proposedapproach. Figure 7 shows the tracking error for the pro-posed approach and for the truncated laplacian-based like-lihood and probabilistic subspace density-based likelihoodmodel. We see for 49 out of 50 videos, the proposed ap-proach achieves a lower tracking error as compared to theother approaches. The mean tracking error (in pixels) overall the probe videos for all the approaches are shown in Ta-ble 1.

Figure 7. Average tracking accuracy of the proposed learning-based approach. Comparison with the other approaches are alsoprovided.

5. Summary and DiscussionIn this paper, we consider the problem of matching faces

in low-resolution surveillance videos with good high res-olution images in the gallery. Tracking and recognizingfaces in low-resolution videos with considerable variationsin pose, illumination, expression, etc. is a very challenging

problem. Performing tracking and recognition simultane-ously in a unified framework as opposed to first perform-ing tracking and then recognition has been shown to im-prove both the tracking and recognition performance. Butsimple likelihood measurement models like truncated lapla-cian, IPS, etc. fail to give satisfactory performance for caseswhere there is significant difference between the appearanceof the gallery images and the faces in the probe videos. Inthis paper, we propose using a learning-based likelihoodmeasurement model to improve both the recognition andtracking accuracy for surveillance quality videos. In thetraining stage, a transformation is learned to simultaneouslytransform the features from the poor quality probe imagesand the high quality gallery images in such a manner thatthe distances between them approximate the distances hadthe probe videos been captured in the same conditions as thegallery images. In the testing stage, the learned transforma-tion matrix is used to transform the features from the galleryimages and the different particles to compute the likelihoodof each particle in the modified particle-filtering framework.Experiments on surveillance quality videos show the useful-ness of the proposed approach.

References[1] O. Arandjelovic and R. Cipolla. Face recognition from video

using the generic shape-illumination manifold. In EuropeanConf. on Computer Vision, pages 27–40, 2006. 2

[2] O. Arandjelovic and R. Cipolla. A manifold approach to facerecognition from low quality video across illumination andpose using implicit super-resolution. In IEEE InternationalConf. on Computer Vision, 2007. 1, 2

[3] S. Biswas, K. W. Bowyer, and P. J. Flynn. Multidimen-sional scaling for matching low-resolution facial images. InIEEE International Conf. On Biometrics: Theory, Applica-tions And Systems, 2010. 2

[4] I. Borg and P. Groenen. Modern Multidimensional Scaling:Theory and Applications. Springer, Second Edition, NewYork, NY, 2005. 1

[5] B. Gunturk, A. Batur, Y. Altunbasak, M. Hayes, andR. Mersereau. Eigenface-domain super-resolution for facerecognition. IEEE Trans. on Image Processing, 12(5):597–606, May 2003. 2

[6] P. Hennings-Yeomans, S. Baker, and B. Kumar. Simultane-ous super-resolution and feature extraction for recognition oflow-resolution faces. In IEEE Conf. on Computer Vision andPattern Recognition, pages 1–8, 2008. 1, 2

Page 7: Face Recognition in Low-Resolution Videos Using Learning-Based … · Face Recognition in Low-Resolution Videos Using Learning-Based Likelihood Measurement Model Soma Biswas, Gaurav

Figure 6. A few frames showing the tracking results obtained using the proposed approach. Here only the region of the frames containingthe person is shown for better visualization.

[7] P. Hennings-Yeomans, B. Kumar, and S. Baker. Recognitionof low-resolution faces using multiple still images and mul-tiple cameras. In IEEE International Conf. On Biometrics:Theory, Applications And Systems, pages 1–6, 2008. 2

[8] K. Jia and S. Gong. Multi-modal tensor face for simultane-ous super-resolution and recognition. In IEEE InternationalConf. on Computer Vision, pages 1683–1690, 2005. 2

[9] K. Jia and S. Gong. Generalized face super-resolution. IEEETrans. on Image Processing, 17(6):873–886, June 2008. 2

[10] M. Kim, S. Kumar, V. Pavlovic, and H. Rowley. Facetracking and recognition with visual constraints in real-worldvideos. In IEEE Conf. on Computer Vision and PatternRecognition, pages 1–8, 2008. 2

[11] K. C. Lee, J. Ho, M. H. Yang, and D. Kriegman. Video-basedface recognition using probabilistic appearance manifolds.In IEEE Conf. on Computer Vision and Pattern Recognition,pages 313–320, 2003. 2

[12] C. Liu, H. Y. Shum, and W. T. Freeman. Face hallucination:Theory and practice. International Journal of Computer Vi-sion, 75(1):115–134, 2007. 2

[13] X. Liu and T. Chen. Video-based face recognition usingadaptive hidden markov models. In IEEE Conf. on Com-puter Vision and Pattern Recognition, pages 340–345, 2003.2

[14] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. International Journal of Computer Vision,60(2):91–110, 2004. 3

[15] B. Moghaddam. Principal manifolds and probabilistic sub-spaces for visual recognition. IEEE Trans. on Pattern Anal-ysis and Machine Intelligence, 24(6):780–788, June 2002. 5

[16] P. J. Phillips, P. J. Flynn, J. R. Beveridge, W. T. Scruggs, A. J.O’Toole, D. S. Bolme, K. W. Bowyer, A. Draper, Bruce,

G. H. Givens, Y. M. Lui, H. Sahibzada, J. A. Scallan, andS. Weimer. Overview of the multiple biometrics grand chal-lenge. In International Conference on Biometrics, pages705–714, 2009. 2, 5

[17] P. J. Phillips, W. T. Scruggs, A. J. O’Toole, P. J. Flynn, K. W.Bowyer, C. L. Schott, and M. Sharpe. Frvt 2006 and ice2006 large-scale experimental results. IEEE Transactions onPattern Analysis and Machine Intelligence, 32(5):831–846,2010. 1

[18] J. Stallkamp, H. K. Ekenel, and R. Stiefelhagen. Video-basedface recognition on real-world data. In IEEE InternationalConf. on Computer Vision, 2007. 2

[19] M. Turk and P. Pentland. Eigenfaces for recognition. Journalof Cognitive Neurosicence, 3(1):71–86, 1991. 5

[20] R. Wang, S. Shan, X. Chen, and W. Gao. Manifold-manifolddistance with application to face recognition based on imageset. In IEEE Conf. on Computer Vision and Pattern Recog-nition, 2008. 2

[21] A. Webb. Multidimensional scaling by iterative majorizationusing radial basis functions. Pattern Recognition, 28(5):753–759, May 1995. 4

[22] S. K. Zhao, V. Krueger, and R. Chellappa. Probabilisticrecognition of human faces from video. Computer Visionand Image Understanding, 91:214–245, 2003. 1, 2, 3, 5

[23] W. Zhao, R. Chellappa, P. Phillips, and A. Rosenfeld. Facerecognition: A literature survey. ACM Computing Surveys,35(4):399–458, 2003. 1