Appearance-based Gaze Estimation using Visual Saliency

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Appearance-based Gaze Estimation usingVisual Saliency

Yusuke Sugano, Yasuyuki Matsushita, and Yoichi Sato

Abstract—We propose a gaze sensing method using visual saliency maps that does not need explicit personal calibration. Ourgoal is to create a gaze estimator using only the eye images captured from a person watching a video clip. Our method treatsthe saliency maps of the video frames as the probability distributions of the gaze points. We aggregate the saliency maps basedon the similarity in eye images to efficiently identify the gaze points from the saliency maps. We establish a mapping betweenthe eye images to the gaze points by using Gaussian process regression. In addition, we use a feedback loop from the gazeestimator to refine the gaze probability maps to improve the accuracy of the gaze estimation. The experimental results show thatthe proposed method works well with different people and video clips and achieves a 3.5-degree accuracy, which is sufficient forestimating a user’s attention on a display.

Index Terms—Gaze estimation, Visual attention, Face and gesture recognition.

F

1 INTRODUCTIONGaze estimation is important for predicting human at-tention, and therefore can be used to better understandhuman activities as well as interactive systems. There isa wide range of applications for gaze estimation includingmarket analysis of online content and digital signage, gaze-driven interactive displays, and many other human-machineinterfaces.

In general, gaze estimation is achieved by analyzing theappearance of a person’s eyes. There are two categories ofcamera-based remote sensing methods: Model-based andappearance-based. Model-based methods use a geometriceye model and its associated features. Using specializedhardware such as multiple synchronized cameras and in-frared light sources, they extract the geometric features ofan eye to determine the gaze direction. Appearance-basedmethods, on the other hand, use the natural appearances ofeyes observed from a commodity camera without requir-ing any dedicated hardware. Various implementations ofcamera-based gaze estimators have been proposed includingcommercial products (see [1] for a recent survey).

One of the key challenges in previous gaze estimators isthe need for explicit personal calibration to adapt to individ-ual users. The users in these existing systems are always re-quired to actively participate in calibration tasks by fixatingtheir eyes on explicit reference points. Another problem thatmost estimation methods suffer from is calibration drift, andtheir calibration accuracy highly depends on the users andinstallation settings. An interactive local calibration schemewith, e.g., user feedback [2], is sometimes required in

• Y. Sugano and Y. Sato are with the Institute of Industrial Science,the University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo, 153-8505,Japan. E-mail: sugano,[email protected]

• Y. Matsushita is with Microsoft Research Asia, 13F, Building 2, No.5 Dan Ling Street, Haidian district, Beijing, 100080, China. E-mail:[email protected]

practical application systems to correct personal calibrationerrors. In many scenarios, such active personal calibrationis too restrictive as it interrupts natural interactions andmakes unnoticeable gaze estimation impossible. Althoughthe number of reference points for personal calibration canbe reduced using specialized hardware such as multiplelight sources [3], [4], [5] and stereo cameras [6], it stillrequires a user to actively participate in the calibration task.

It is also well known in the class of model-based ap-proaches that the gaze direction can be approximately esti-mated as the direction of the optical axis without requiringpersonal calibration [7]. However, its offset with the visualaxis, which corresponds to the actual gaze direction, canbe as large as 5 degrees [1], [4], and the accuracy variessignificantly based on the individual. More importantly,such hardware-based attempts add a strong constraint tothe application setting, and this naturally limits the userscenarios.

There are previous studies that aim at completely remov-ing the need for explicit personal calibration processes. Ya-mazoe et al. use a simple eyeball model for gaze estimationand perform automatic calibration by fitting the model tothe appearance of a user’s eye while the user is movinghis/her eyes [8]. In Sugano et al.’s method, in a similarspirit to [2], a user’s natural mouse inputs are used for theincremental personal calibration of the appearance-basedgaze estimation without any calibration instructions [9].Both methods use only a monocular camera, however, theseapproaches still have some limitations. Yamazoe et al.’sapproach suffers from inaccuracy due to the simplifiedeyeball model, and Sugano et al.’s approach can only beapplied to interactive environments with user inputs.

Apart from these gaze estimation studies, computationalmodels of visual saliency have been studied to estimatethe visual attention on an image, which is computed ina bottom-up manner. In contrast to gaze estimation ap-proaches that aim to determine where peoples’ eyes actually


Auto-calibrationfor gaze estimation

Saliency map

Fig. 1. Illustration of our method. Our method usessaliency maps computed from video frames in bottom-up manner to automatically construct a gaze estimator.

look, visual saliency computes the image region that islikely to attract human attention. Biologically, a humantends to gaze at an image region with high saliency, i.e.,a region containing unique and distinctive visual featurescompared to the surrounding regions. After Koch andUllman’s original concept [10] of visual saliency, variousbottom-up computational models of visual saliency mapshave been proposed in [11], [12], [13], [14], [15]. Experi-ments show that there is a correlation between bottom-upvisual saliency and fixation locations [16]. However, thevisual attention mechanism is not yet fully understood. Itis already known that fixation prediction becomes muchmore difficult under natural dynamic scenes, in which ahigh-level task and knowledge have a stronger influence onthe gaze control [17].

Gaze estimation (top-down) and visual saliency (bottom-up) models are closely related. Nonetheless, not manystudies exist that bridge these two subjects. Kienzle etal. [18], [19] propose a method for learning the compu-tational models of bottom-up visual saliency by using thegaze estimation data. A visual saliency map is modeled intheir work as a linear combination of the Gaussian radialbasis functions, and their coefficients are learned using asupport vector machine (SVM). Judd et al. [20] and Zhaoand Koch [21] also use this approach with different featuresand a larger database. The linear weights of low-level imagefeatures (e.g., color and intensity) and high-level features(e.g., face detector) are learned via the SVM in [20]. In [21],the optimal feature weights are learned by solving a non-negative least squares problem using an active set method.These approaches learn accurate saliency models using gazepoints. In contrast to these methods, our goal is to createa gaze estimator from the collection of visual saliencymaps. To our knowledge, this is the first work using visualsaliency as prior information for gaze estimation.

In this paper, we propose a novel gaze sensing methodthat uses computational visual saliency, as illustrated inFig. 1. Our approach is based on the assumption thatbottom-up visual saliency is correlated with actual gazepoints. By computing the visual saliency maps from a videoand relating them with the associated eye images of a user,our method automatically learns the mapping from the eyeimages to the gaze points. We aggregate the saliency mapsbased on the similarity of the eye images to produce reliablemaps, which we call gaze probability maps in this paper, tohandle low prediction accuracy of raw saliency maps. Oncethe gaze probability maps are obtained, our method learns

the relationship between the gaze probability maps andthe eye images. In addition, a feedback scheme optimizesthe feature weights used to compute the visual saliencymaps. The feedback loop enables us to further strengthenthe correlation between the gaze probability maps and theeye images. From one point of view, our method closesthe bottom-up visual saliency and top-down gaze estimatesloop; the visual saliency determines the likely locationof the gaze points, and the gaze points in return refinethe computation of the visual saliency. We demonstrateour approach through extensive user testing and verifythe effectiveness of the use of visual saliency for gazeestimation.

Our method takes a set of eye images recorded insynchronization with any video clip as the input. Fromsuch an input, our method automatically determines therelationship between the eye images and gaze directions.In addition, our method does not distinguish between thetest data and training data, i.e., one dataset can be usedfor both the calibration and estimation at the same time.Therefore, when only the gaze estimates for a particularvideo clip are needed, a user only needs to watch thevideo clip once. Once the relationship is learned, our gazeestimator can be used in other application scenarios as longas the configuration between the camera and user remainsunchanged. In this manner, the proposed framework leadsto a gaze estimation technique that exempts the users fromthe active personal calibration.

In general, a fundamental trade-off between the accuracyand a system’s portability exists. Our system aims atminimizing the hardware and calibration constraints for de-veloping a fully ambient gaze estimation technique, whichis a key factor for opening up a new way of attentive userinterface [22], [23]. For example, to collect the gaze dataover a film clip on a public display, the film creator mayonly have to place a camera to capture the eye images ofthe audience. Similarly, movie players on PCs can naturallyobtain gaze data for media understanding without the users’notice. In addition, the calibrated gaze estimators can beused for gaze-based interactions. Our method can furtherenhance the calibration accuracy during the gaze estimationprocess by using the eye images as input. By closing theloop of calibration and estimation in this manner, thiswork aims at enhancing the approach of calibrating gazeestimators through daily activities [9].

The preliminary version of this work appeared in [24].A closely related work is recently introduced by Chen andJi [25]. They use the idea of using visual saliency mapsfor model-based gaze estimation of a person looking atstill pictures. While Chen and Ji’s approach achieves ahigher level of accuracy and allows for free head movement,their results rely on a model-based setup with a longerrecording time on a single image. In contrast, our systemuses an appearance-based estimation and is built using onlya monocular camera. While it is often discussed that a gazeprediction from saliency maps is more reliable when usingstatic photographs than when using video clips, our methodavoids this problem via the aggregation of the saliency


Estimator

construction

Eye image Eye image

Video frame Gaze probability map

Average eye image

Saliency map

Optimization of

saliency model parametersSaliency

aggregation

Gazeestimator

Saliency

extraction

Eye image

Gaze point

Fig. 2. Illustration of proposed approach. Our method consists of four steps. The saliency extraction stepcomputes saliency maps from the input video. The saliency aggregation step combines saliency maps to producegaze probability maps. Using the gaze probability maps and associated average eye images, the estimatorconstruction step learns the mapping from an eye image to a gaze point. A feedback loop is used to optimize thefeature weights to improve the accuracy by using cross validation.

maps, which results in statistically accurate and stable gazeprobability maps.

This paper is organized as follows. In Section 2, wedescribe the proposed gaze estimation method that auto-calibrates from the bottom-up saliency maps. Section 3describes the feedback loop from the estimated gaze pointto the saliency weight computation. This feedback loopis intended to bridge the gap between the top-down gazepoint and the bottom-up visual saliency, and improvesgaze estimation accuracy. Finally, we validate the proposedmethod by conducting user tests in Section 4. Our resultsshow that our method can achieve a 3.5-degree of accu-racy without needing any specialized hardware or explicitpersonal calibration processes.

2 GAZE ESTIMATION FROM SALIENCYMAPS

Our goal is to construct a gaze estimator without ancalibration stage. Our method assumes a fixed head poseand fixed relative positions among the user’s head, cam-era, and display. The term calibration indicates obtainingthe mapping function from an eye image to a point onthe display coordinate. The relationship between the eyeimages (input) and the gaze points (output) is expressed asa single regression function in an appearance-based gazeestimation, and our goal is to estimate the parameters ofthe gaze estimation function without using explicit trainingdata.

The inputs for our system are N video framesI1, . . . , IN and associated feature vectors e1, . . . , eNextracted from the eye images of a person who is watchinga video clip with a fixed head position. The implementationdetails of the feature vector e are described in Section 4.1;but our framework does not depend on specific imagefeatures. For presentation clarity, we denote e simply aseye image in this paper. In our setting, the eye images andvideo frames are synchronized. The i-th eye image ei iscaptured at the same time as frame Ii is shown to the

person. Using this dataset (I1, e1), . . . , (IN , eN ), a gazeestimation function from an eye image e∗ to an unknowngaze point g∗ is built.

Our method consists of four steps: Saliency extraction,saliency aggregation, estimator construction, and featureweights optimization, as illustrated in Fig. 2. Once thesaliency maps are computed in the saliency extraction step,the saliency aggregation step produces gaze probabilitymaps that have a higher concentration of gaze point es-timates than the saliency maps. The average eye imagesare computed by clustering the eye images, and all thesaliency maps are aggregated according to the eye imagesimilarities to compute the gaze probability maps. Using thegaze probability maps and associated average eye images,the estimator construction step learns the mapping froman eye image to a gaze point by using a variant of theGaussian process regression. Our method further optimizesthe feature weights that are used for the saliency computa-tion by using the feedback loop. By optimizing the weightsin a cross-validation manner, this fourth step improves theaccuracy of the gaze estimator. The resulting gaze estimatoroutputs the gaze points for any eye image of the user. Inthe following subsections, we describe the details of thesaliency extraction, aggregation, and estimator constructionsteps, and in Section 3 the feature weight optimization.

2.1 Saliency ExtractionThis step extracts the visual saliency maps from the inputvideo frames I1, . . . , IN. As shown in Fig. 3, our methodadopts six features to compute the saliency maps: five low-level features and one high-level feature.

Each frame I is first decomposed into multiple featuremaps F . We use commonly-used feature channels, i.e.,color, intensity, and orientations as the static features, andflicker and motion are used as dynamic features in ourmethod. The intensity channel indicates the grayscale lu-minance, two color channels are red/green and blue/yellowdifferences, and four orientation channels are the responsesfrom the 2D Gabor filters with orientation at 0, 45, 90,


Feature Extraction

Color Intensity OrientationFlicker Motion

Graph-Based Visual Saliency ComputationFace Detection

Fig. 3. Computation of visual saliency maps. Our method uses six features to compute the saliency maps:Face detection-based high level saliency, and five low-level features obtained by using a graph-based saliencycomputation; color, intensity, orientation, flicker, and motion. s(1) ∼ s(6) show examples of computed saliencymaps.

and 135, respectively. The flicker channel indicates anabsolute intensity difference from the previous frame, andfour motion channels use the spatially-shifted differencesbetween the Gabor responses. The feature maps are com-puted at three levels of the image pyramid which are 1/2,1/4, and 1/8 of the original image resolution. As a result,36 (3 levels × (1 intensity + 2 color + 4 orientation + 1flicker + 4 motion)) feature maps F are computed.

The saliency maps are then computed from the featuremaps F using a Graph-based Visual Saliency (GBVS) [14].Computation in the GBVS algorithm is conducted in twostages: Activation and normalization. Activation maps Aare first computed from the feature maps F to locate theregions with prominent image features. Greater values areassigned to the pixels in activation maps A where they havedistinct values compared with their surrounding regions inthe feature maps. In the GBVS algorithm, this computationis performed in a form of a steady-state analysis of aMarkov chain GA. Each node of GA corresponds to a pixelposition in feature maps F , and a transition probabilityΩa between nodes (i, j) and (p, q) is defined based on adissimilarity between the two corresponding pixels in F as

Ωa((i, j), (p, q)) , Ωd|F (i, j)− F (p, q)|, (1)

where Ωd indicates the Gaussian weight evaluating theEuclidean distance between (i, j) and (p, q). In this way,the nodes (= pixels) with a higher dissimilarity to theirsurroundings have higher transition probabilities. Therefore,by iteratively computing the equilibrium distribution da (araster-scanned vector form of A) of GA that satisfies

Ωada = da, (2)

where Ωa is the transition probability matrix consisting ofΩa, the salient pixels in F have larger values in A.

Since the resulting activation maps often have manyinsignificant peaks, the GBVS algorithm further normalizesthem to suppress the local maxima. Using the computedactivation maps A, a Markov chain GN is defined in asimilar way with a transition probability Ωn as:

Ωn((i, j), (p, q)) , Ωd|A(p, q)|. (3)

By computing the equilibrium distribution of GN as de-scribed above, the resulting maps are concentrated so thatthey have fewer important peaks. These normalized activa-tion maps are averaged within each channel, and as a result,five low-level saliency maps s(1), . . . , s(5) are computed.

It is well known that humans tend to fixate on faces,especially on the eyes, which are highly salient for humans.With this observation, Cerf et al. [26] proposed a facechannel-based saliency model using a face detector. We fol-low this approach to produce reliable saliency maps usingthe facial features. We use a facial feature detector OKAOVision library developed by OMRON Corporation [27]to obtain facial features. This sixth saliency map s(6) ismodeled as 2-D Gaussian circles with a fixed variance atthe center of two detected eye positions. When the detectoronly detects a face but not the eyes, e.g., due to a limitedresolution, the facial saliency is defined at the center of thefacial region.

Finally, our method computes the temporal average ofeach saliency map s(1)i , . . . , s

(6)i within a temporal window

ns as

s(f)i =

1

ns + 1

i∑j=i−ns

s(f)j , (4)

where s(f)j is the raw saliency map of the f -th

feature computed from the j-th frame, and ns isthe number of frames used for temporal averaging.This is because humans cannot instantly follow rapidscene changes, and only the past frames are used forthe smoothing to account for the latency. As a re-sult, synchronized pairs of saliency maps and eye im-ages Ds = (s(1)1 , . . . , s

(6)1 , e1), . . . , (s

(1)N , . . . , s

(6)N , eN )

are produced.

2.2 Saliency AggregationAlthough it is assumed that the saliency maps can predictgaze points, their accuracy is insufficient for determiningthe exact gaze point locations as discussed in previousstudies [17]. In this section, we describe our method to


compute the probability distribution of the gaze point byaggregating the computed saliency maps.

The computed saliency maps s(f) encode the distinc-tive visual features of video frames. While the saliencymap does not provide the exact gaze point, highly salientregions in the saliency map are likely to coincide withthe actual gaze point. Suppose we have a set of saliencymaps that statistically have high saliency scores aroundthe actual gaze point and random saliency scores at otherregions. Since we assume a fixed head position, there isa one-to-one correspondence between the gaze points andthe eye images; actual gaze points would be almost thesame between visually similar eye images. Therefore, byaggregating the saliency maps based on the similarity ofthe associated eye images, we can assume that the imageregion around the actual gaze point has a vivid peak ofsaliency compared with other regions. The aggregated mapcan be used as the gaze probability map, i.e., the probabilitydistribution of the gaze point.

The similarity score ws of a pair of eye images ei andej is defined as

ws(ei, ej) = exp(−κ2s||ei − ej ||2), (5)

where the factor κs controls the similarity score. Thesimilarity score ws is higher when the appearances of twoeye images are similar, i.e., the gaze points of the eyeimages are close. Since the appearance variation of the eyeimages is quite large for different people, the optimal valueof κs in Eq. (5) is highly person-dependent. Therefore, inthis work, the factor κs is indirectly defined via the range ofthe values taken by ws. More specifically, κs is optimizedby minimizing an error defined as

κs = argminκs

||Ts − det(Ws)||2, (6)

where Ws ∈ RNs×Ns is a similarity weight matrix com-puted using randomly selected Ns eye images in Ds. Ts isthe target value of the determinant that is empirically de-fined, e.g., by quantitatively checking a sample dataset. Thefactor κs is determined to adapt to the person-dependencyby minimizing Eq. (6) via the gradient descent.

We eliminate the eye images that are not useful for gazeestimation from the dataset, e.g., eye images of blinking,prior to the computation of the gaze probability maps. Onthe other hand, the eye images recorded during fixationare useful as training data. To automatically identify sucheye images, we use a fixation measure of an eye image edefined as

we(ei) = exp(−αeκ2sVar(ei)), (7)

where αe is a weighting factor, and Var(ei) denotes thevariance in the eye images ei−nf

, . . . , ei+nf over a

temporal window 2nf + 1 centered at i,

Var(ei) =

i+nf∑j=i−nf

||ej − µei ||2, (8)

µei =1

2nf + 1

i+nf∑k=i−nf

ek. (9)

Eq. (7) evaluates the stability of the eye regions, and itis assumed that there are no significant changes in thelighting conditions during the temporal window. Since theappearances of the eye images rapidly change during thefast movement of the eyes, we(ei) becomes small whenei is captured during eye movement or blinking. A sub-set Ds′ = (s(1)1 , . . . , s

(6)1 , e1), . . . , (s

(1)N ′ , . . . , s

(6)N ′ , eN ′)

is created from Ds by removing the eye images where thewe scores are lower than a predefined threshold τf .

Since variation in the gaze points is limited in Ds′ andthere can be many samples that share almost the same gazepoint, the eye images are clustered according to similarityws to reduce redundancy and computational cost. Using thesimilarity score (Eq. (5)), each eye image ei is sequentiallyadded to the cluster whose average eye image e is the mostsimilar to ei. A new cluster is adaptively created if thehighest similarity among all existing clusters is lower thana threshold τe. M clusters and their average eye imagese1, . . . , eM are computed from these computations.

After these steps, a gaze probability map p(f)i of eachfeature f is computed as

p(f)i =

∑N ′

j=1 ws(ei, ej)(s(f)j − s

(f)all )∑N ′

j=1 ws(ei, ej), (10)

where s(f)all is the average of all the maps s(f)1 , . . . , s(f)N ′ in

Ds′ . It is known that a center bias of visual saliency [20],[21] exists because man-made pictures usually have ahigher saliency at the center of the image. The averagesaliency map s(f)all is used to eliminate this center bias ina gaze probability map. Without this, the gaze probabilitymap tends to have a higher value at the center regardlessof the eye image ei. The gaze probability map p(f)i canalso have negative values. In our case, only the relativedifferences matter, and therefore, we used the computedresults by normalizing the values to a fixed range. We againuse the graph-based normalization scheme (Eq. (3)) to thegaze probability maps p(f)i in order to enhance the peaksin the gaze probability maps.

The final gaze probability map pi is computed as aweighted sum of all the feature-dependent maps p(f) as

pi =

6∑f=1

ωf p(f)i , (11)

where ωf is a weight for f -th feature. pi is thennormalized to a fixed range, and we obtain a datasetDp = (p1, e1), . . . , (pM , eM ). We followed many exist-ing visual saliency map models and used equal weights atthis step to aggregate feature maps. However, it is often


Fig. 4. Examples of gaze probability maps p andcorresponding average eye images e. The overlaiddots depict the actual gaze points of e to illustrate thecorrespondence between the gaze points and the gazeprobability. The true gaze points are obtained using acalibration-based gaze estimator.

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Gaze probability maps

Raw saliency maps

False positive rate

True positive rate

Fig. 5. ROC curves of raw saliency maps and gazeprobability maps. The horizontal axis indicates thefalse positive rate, i.e., the rate of the pixels above athreshold. The vertical axis indicates the true positiverate, i.e., the rate of the frames that have a highersaliency value than the threshold at the true gaze point.The thin line (AUC = 0.82) indicates the performanceof the raw saliency maps obtained by the processdescribed in Section 2.1. The bold line (AUC = 0.93)indicates the performance of the gaze probability mapsdescribed in Section 2.2.

pointed out that the contribution of each feature is notuniform, and there is a certain degree of data dependency.We use a feedback scheme to optimally adjust the weightparameters to address these issues. The feedback scheme isdiscussed in Section 3.

Fig. 4 shows examples of the obtained gaze probabilitymaps p for six people. The eye images shown at the top-left of each sub-figure indicate the corresponding averageeye images e, and the overlaid dots indicate the actualgaze points of e. Note that e is a prototype of the eyeimages synthesized through the above process, and theactual gaze points are unknown. Therefore, we used theestimates from the appearance-based gaze estimator usingexplicit calibration, which is described in Section 4, toobtain the true gaze points as a reference. Although thegaze probability maps pi are generated without knowingthe actual gaze points, they have a significant correlationwith the actual gaze points.

We compare the gaze probability maps with the originalraw saliency maps to assess the correlation improvementwith the actual gaze points. Fig. 5 shows the correla-tion improvement using a receiver operating characteristic(ROC) curve. We sweep the threshold value for the gazeprobability maps and raw saliency maps to obtain the plots,and assess all the ground truth gaze points that we obtainthrough the experiment. The horizontal axis represents thefalse positive rate, i.e., the rate of the pixels in a map abovea threshold value. The vertical axis is the true positive rate,which indicates the rate of frames whose saliency value atthe gaze point is greater than the threshold. The area underthe curve (AUC) of the gaze probability maps is 0.93, andthat of the raw saliency maps is 0.82. This result shows thatthe correlation is significantly enhanced by the aggregationprocess.

2.3 Estimator Construction

In the previous step, M average eye images e1, . . . , eMand corresponding gaze probability maps p1, . . . , pM areobtained. This section describes our method for creatinga gaze estimator using them as the training dataset. Ourgoal is to establish a mapping from the eye image togaze points. We develop a method based on Gaussianprocess regression that has been successfully applied toboth appearance-based [28] and model-based [29] gazeestimation to efficiently achieve this task.

With the standard Gaussian process regression frame-work, an estimator is built to output the probability dis-tribution P (g∗|e∗,Dg) of an unknown gaze point g∗

from an eye image e∗, given the labeled training dataDg = (g1, e1), . . . , (gM , eM ), which consists of the eyeimages and corresponding gaze points. In our case, how-ever, we only know Dp = (p1, e1), . . . , (pM , eM ) wherethe gaze probability map pi only provides the probabilitydistribution of the gaze point gi. Therefore, instead ofdirectly applying the standard Gaussian process regression,we work on a marginalized probability where explicittraining labels are not required.

After normalizing the gaze probability maps, we definethe gaze probability distribution P (g|p) as

P (g|p) =p(g)∑x

∑y p

, (12)

where p(g) indicates the value of p at the gaze pointg, and

∑x

∑y p is the overall summation of p. In the

above equation, we describe the estimation of a one-dimensional scalar g to simplify the notation, but tworegressors are independently built for each X- and Y -direction in the actual implementation. Using Eq. (12),the target distribution P (g∗|e∗,Dp) can be obtained bymarginalizing over all the possible combinations of M gazepoints Dg = g1, . . . , gM as

P (g∗|e∗,Dp) =∑Dg

P (g∗|e∗, Dg)P (Dg|Dp), (13)


where

P (Dg|Dp) =

M∏i

P (gi|pi). (14)

In Eq. (13), g∗ indicates an unknown gaze point associatedwith the eye image e∗, and gi is a candidate of the gazepoint that corresponds to ei.

However, the summations of the Eq. (13) are com-putationally expensive. Various approximation techniquesfor the Gaussian process regression [30] can reduce thecomputational cost of computing P (g∗|e∗, Dg), howeverthey cannot directly approximate Eq. (13) itself. Forthese reasons, we solve Eq. (13) using a Monte Carloapproximation. We randomly produce ng sets of sam-ples D(l)

g = (g(l)1 , e1), . . . , (g(l)M , eM )ng

l=1 according to theprobability distribution defined by Eq. (12). In particular,g(l)i in the l-th set is generated according to the distributionP (gi|pi) defined by the i-th probability map. It has beenexperimentally shown in Fig. 5 that gaze probability mapshave a high correlation with the actual gaze points. Fromthis observation, we discard the low saliency values fromthe gaze probability maps to reduce the number of samples.We use a threshold τs to set the probability to zero if p(x, y)is lower than the threshold. Using these sets, Eq. (13) canbe approximated as

P (g∗|e∗,Dp) =1

ng

ng∑l=1

P (g∗|e∗,D(l)g ). (15)

Finally, P (g∗|e∗,D(l)g ) for each l is computed based on the

Gaussian process regression as follows [30].

Gaussian Process Regression.We assume a noisy observation model for a gaze point

gi = f(ei) + εi, i.e., a gaze point gi is given as a functionof the eye image ei with a noise term εi = N (0, γ2ς2i ).The data-dependent noise variance γ2ς2i is defined as beingproportional to ς2i , which is an actual variance of the gener-ated samples g(1)i , . . . , g

(ng)i . It explicitly assigns a higher

noise variance for samples from ambiguous saliency mapsthat have several peaks. The function f(ei) is assumed tobe a zero-mean Gaussian with a covariance function k:

k(ei, ej) = α2 exp(−β2||ei − ej ||2), (16)

with hyperparameters α and β. With this assumption,P (g∗|e∗,D(l)

g ) is derived as a Gaussian distributionN (µl, σ

2l ) that has a mean µl and variance σ2

l :

µl = K∗K−1G(l), (17)

and

σ2l = k(e∗, e∗)−K∗K−1tK∗, (18)

where the matrix K ∈ RM×M is a covariancematrix where its (i, j)-th element is defined asKij = k(ei, ej) + γ2ς2i δij . The vector K∗ ∈ R1×M

represents a covariance vector of the input eyeimage and average eye images, whose i-th element

is K∗i = k(ei, e∗), and Gl ∈ R1×M is a vector of the

gaze points, where its i-th element is G(l)i = g

(l)i . As a

result, the distribution P (g∗|e∗,Dp) can be estimated as aGaussian distribution N (µ, σ2) with

µ =1

ng

ng∑l=1

µl, σ2 =1

ng

ng∑l=1

σ2l = σ2

1 . (19)

The variance σ2 is simply the same as σ21 , because σ2

l ofEq. (18) is actually independent of the index l.

Tuning Hyperparameters.There are three hyperparameters, α, β, and γ, in the

above formulation, which need to be optimized for eachdataset. We use a cross-validation approach [31] for theoptimization in our method. The optimal parameters canbe estimated by maximizing a leave-one-out log predictiveprobability L defined as

L(D(l)g ,θ) =

ng∑l=1

M∑i=1

log p(g(l)i |D

(l)g −i,θ), (20)

where θ is a set of hyperparameters θ = α, β, γ, andD(l)

g −i is a set of generated samples that excludes thesamples with the i-th eye image. The predictive probabilityp(g

(l)i |D

(l)g −i,θ) is defined as a Gaussian function as

log p(g(l)i |D

(l)g −i,θ) =

− 1

2σ2−i −

1

2

(g(l)i − µ−i)2

2σ2−i

− 1

2log 2π, (21)

where µ−i and σ2−i are the estimated mean and variance

using the sample set D(l)g −i.

In our case, however, the actual gaze points of the eyeimages in the samples D(l)

g also have a center bias forthe same reason discussed in Section 2.2. This results infewer samples in the peripheral region, and the errors inthis region become overly small. When Eq. (20) is directlyminimized, the optimization result also tends to be biasedto the center. Therefore, we modify Eq. (20) to remove thecenter bias by normalizing the predictive probability as

L =

ng∑l=1

M∑i=1

1

n(i,l)log p(g

(l)i |D

(l)g −i,θ), (22)

where n(i,l) is the total number of samples that have thesame gaze points as g(l)i . By evaluating the average errors ofeach gaze point, Eq. (22) can evaluate the estimation errorsin an unbiased manner on the display coordinates. Usingpartial derivatives with respect to the hyperparameters,Eq. (22) is maximized via a conjugate gradient method.The readers are referred to [30] for a detailed derivation ofthe partial derivatives.

2.4 Gaze EstimationOnce we have matrices K, S, and G(1), . . . ,G(ng) inEqs. (17) and (18), a gaze point can be estimated by takinga new eye image e as an input. The estimated distributions


Error minimization

Gaze

estimator

Fig. 6. Illustration of feature weight optimization. Targetattention maps a are generated based on leave-one-out estimates, and feature weights are optimized byminimizing a sum of squared residuals between targetmaps and sum maps p.

for each X- and Y-direction, N (µx, σ2x) and N (µy, σ

2y),

are converted to the display coordinates N (µx, σ2x) and

N (µy, σ2y) by

µx = xo +WI

Wsµx, µy = yo +

HI

Hsµy, (23)

and

σ2x =

WI

Wsσ2x, σ2

y =HI

Hsσ2y, (24)

where Ws, Hs indicates the width and height of thesaliency maps, WI , HI indicates the actual width andheight of the displayed images I1, . . . , IN, and (xo, yo)indicates the display origin of the images. The average(µx, µy) corresponds to the estimated gaze point g. Theestimated variances are not directly used in our currentsystem, however the estimation accuracy will be improvedby incorporating techniques such as the Kalman filter asin [28].

3 FEATURE WEIGHT OPTIMIZATION

The previous section describes the complete pipeline ofthe training the gaze estimator and testing method. In thesaliency aggregation step (Section 2.2), all six saliencyfeatures p(f) are independently aggregated, and the ag-gregated maps are linearly combined by Eq. (11) to producethe summed map p.

Although most existing methods use equal weights forsimplicity [11], [14], [26], how to optimally tune theweights to achieve an even higher correlation between thevisual saliency and gaze points remains unclear. Somestudies use a data-driven learning approach to optimallytune the weight parameters using known gaze points [18],[19], [20], [21]. In our scenario, we do not have accessto ground truth gaze points. Thus, the feature weights arerefined in our method by using the estimated gaze pointsusing the data-driven learning approach. The correlation

between the combined saliency maps and gaze points isrefined using this feedback loop.

Once the gaze estimator is built (Section 2.3), it canbe used to estimate the gaze points from the associatedaverage eye images e in Dp. Using this dataset, ourmethod optimizes the feature weights so that the correlationbetween the peak of the gaze probability map p and thegaze point estimate is higher.

Our approach is illustrated in Fig. 6. As described inSection 2.3, leave-one-out estimates µ−1, . . . , µ−M areobtained for each average eye image in both the X- and Y -directions. Using these estimated gaze coordinates, targetattention maps a1, . . . ,aM, which represent the top-down gaze point distribution, are generated by drawingcircles at the gaze points with a fixed radius. Since theseleave-one-out estimates can include a large error, the radiusis set to a relatively larger value (∼ 4 degrees in our currentimplementation) than the central area of human vision (∼ 1degree). Using the target attention maps a, the featureweights ω = t(ω1, . . . , ω6) are optimized by minimizingthe sum of the squared residuals as

ω = arg minω

M∑i=1

||ai −6∑

f=1

ωf p(f)i ||

2, (25)

with a non-negativity constraint

ω ≥ 0. (26)

To reduce the number of equations in Eq. (25), na pointsare randomly sampled from both the positive and zeroregions in all the attention maps a. The matrix form ofEq. (25) can be written as

ω = arg minω

M∑i=1

||A− Pω||2, (27)

where A ∈ R2Mna×1 is a vector that consists of valuesat the selected points and P ∈ R2Mna×6 contains thecorresponding feature values in each row. Eq. (27) issolved by using the non-negative least-squares algorithmof Lawson and Hanson [32] to obtain the optimal set offeature weights.

4 EXPERIMENTAL RESULTS

In this section, we present our experimental results toevaluate our method. We use a set of 80 video sources inthe experiments that are downloaded from the Vimeo web-site [33], which include various types of video clips, e.g.,music videos and short films. Some example frames areshown in Fig. 7. 30-second video sequences are randomlyextracted from each video source without an audio signal,and resized to a fixed resolution of 960 × 540. These 80short clips are divided into four datasets A,B,C,D, andseven novice test subjects t1, . . . , t7 are asked to watch allof them. The video clips are shown at 25 fps; therefore, thenumber of frames is N = 15000 in each set, and the displayresolution is set to WI = 1920 and HI = 1080. Althoughit highly depends on the algorithms and can be done prior


Fig. 7. Examples of video clips used in experiments.We use a set of 80 video clips downloaded from theVimeo website [33]. All the pictures are licensed undera Creative Commons License.2

to the recording session of the eye images, the most timeconsuming part of the proposed framework is the saliencyextraction step. For an efficient computation, the saliencymaps are calculated at a smaller resolution, Ws = 32 andHs = 18, in our experiments. One pixel corresponds toabout 1.35 × 1.35 degrees in our current setting, which isclose to the limit of the central area of human vision.

Throughout the experiment, the parameters are setat ns = 5, τe = 0.45, Ts = −300, αe = 0.008, nf = 3,τf = 0.2, ng = 1000, na = 40, and τs is adaptively setto retain the top 15% of pixels and the remaining 85%are set to zero in each map. These parameter settings areempirically obtained from our experiment. In our currentimplementation, when M ' 100, it took about 1 minutefor parameter optimization, and 1 millisecond per frame forestimation using a 3.33-GHz Core i7 CPU with a simplecode parallelization using OpenMP [34].

4.1 Experiment DetailsA chin rest is used during the experiments to fix the peo-ples’ head positions, and a 23-inch Full HD (508.8×286.2mm) display is placed about 630 mm away from the subjectto show the video clips. A VGA-resolution camera is placedunder the display to capture eye images.

We use the OMRON OKAO Vision library to detect thecorners of each eye. We use template matching around thedetected corners of each eye to ensure alignment accuracy.Small template images of the corners of each eye areregistered during the initialization, and the locations withthe highest normalized cross-correlation are used as thealigned corner positions. Based on the aligned positionsas illustrated in Fig. 8 (a), the eye images are cropped to afixed size of 70× 35 pixels.

The eye images are histogram-equalized and pixels withintensity lower than the given threshold are truncated tozero so that images contain only eye regions (Fig. 8 (b)) tominimize effects caused by lighting changes. The thresholdvalue is automatically decided using Otsu’s method [35].

2. From top to bottom, left to right: “The Eyewriter” by Evan Roth(http://vimeo.com/6376466), “Balloons” by Javi Devitt (http://vimeo.com/10256420), “Tenniscoats - Baibaba Bimba — A Take Away Show” by LaBlogotheque (http://vimeo.com/11046286), “Persona” by superhumanoids(http://vimeo.com/13848244), “MADRID LONGBOARD” by Juan Rayos(http://vimeo.com/12132621), and “MATATORO” by Matatoro Team(http://vimeo.com/13487624).

(a) (b)

Fig. 8. Examples of eye images. (a) The eye imagesare cropped to a fixed size based on the detectedpositions. (b) The images are histogram-equalized andthresholded so that they contain only the iris and eyecontours.

0

10

20

30

40

50

60

Average Error [mm]

t1 t2 t3 t4 t5 t6 t7 Average

Tobii TX300

Calibrated estimator

0.0

0.9

1.8

2.7

3.6

4.5

5.4

[deg.]

Test subject

Fig. 9. Error comparison of commercial gaze estimator(Tobii TX300) and calibrated estimation method. Atarget point (the ground truth) is explicitly displayedto the test subjects, and the estimation accuracy isevaluated by assessing the deviation from the groundtruth.

Finally, we apply a discrete Fourier transform to obtainthe 4900-dimensional feature vectors e, which consists ofFourier magnitudes of both the left and right eye images.

Ground truth.We use a Tobii TX300 gaze tracker [36] to obtain the

ground truth to quantitatively assess the effectiveness ofour proposed method. The Tobii gaze tracker is placedwithin our setup and run in parallel with our method toobtain the ground truth gaze points. In addition, we alsorun a standard appearance-based gaze estimation methodthat uses an explicit calibration (in short, what we calla calibrated method hereafter) as a baseline method forfurther assessing our method. We show 16 × 9 referencepoints for each test subject at a regular interval on thedisplay to train the estimator of the calibrated method. Theeye images are recorded during the calibration to establishthe mapping between the eye image and the gaze points.Once the pairs of reference gaze points and eye images areobtained, a learning process, which is the same manneras described in Section 2.3, is performed to obtain themapping.

It is important to discuss the accuracy of the referencemethod that we use as the ground truth. The catalogspecification of the accuracy of the Tobii TX300 is less than1 degree. However, the error can be larger depending onthe test subjects and installation conditions. We conduct apreliminary experiment using our setting to verify the actualaccuracy of both the Tobii gaze tracker and the calibratedgaze estimator. A total of 120 target points are randomly


shown on the display to each of the seven test subjects.The subjects are asked to look at these target points, andwe assess the gaze estimation accuracy using the targetpoints as the ground truth gaze points. Fig. 9 shows theaverage distance errors between the ground truth targetpoints and the estimated gaze points by the commercialand calibrated gaze estimators. As shown in the plot, theestimation accuracy depends on the subjects in our setting.The average errors are similar in these two approaches;it is about 30 mm (' 2.7 degrees) with the commercialestimator, and 25 mm (' 2.3 degrees) with the calibratedestimator. We observe that there is a fundamental limit inthe accuracy evaluation of a gaze estimator from theseexperiments. As described above, we use the output ofthe commercial gaze estimator as the ground truth becauseof its availability to the readers and reproducibility in theexperiments.

4.2 Gaze Estimation Result

We examine the performance of the proposed method byusing the following procedure. First, we use the entiredataset for both training and testing to assess the upper-bound accuracy of the proposed method, i.e., the trainingdata performance. Second, we divide the dataset into two,one for training and the other for testing, to evaluate thegeneralizability of the proposed gaze estimator, i.e., the testdata performance.

Performance evaluation.We first assess the performance, where the same dataset

is used for both the training and testing. We perform thisevaluation using four dataset A,B,C,D independently. Theestimation results are summarized in Table 1. Each row cor-responds to the result using dataset A,B,C,D, where all20 video clips are used for both the training and testing. Thefirst two columns indicate the AUCs of the average ROCcurves of the raw saliency maps s and the gaze probabilitymaps p. The remaining columns list the estimation errorsof the proposed method and the calibrated appearance-based estimator for both the distance and angular errors.The errors are described by their (average ± standarddeviation) form. The distance error is evaluated by using theEuclidean distance between the estimated and the groundtruth gaze points, and the angular errors are computed usingthe distance between the eyes and the display.

Similarly, Table 2 lists the estimation error of eachsubject t1, . . . , t7. Each row corresponds to the averageof the results of the corresponding test subject using thefour datasets A,B,C,D. The columns list the AUCs andestimation errors in the same manner as in Table 1. Theoverall average error is 39 mm (' 3.5 degrees). It can beseen from Table 1 and Table 2 that the performance doesnot have a strong dependency on the dataset and subjects.Although our method has a larger error than the calibratedestimator, our method can still achieve an accuracy of 3.5degrees, which is sufficient for obtaining the regions ofattention in images.

25

35

45

55

65

75

Error [mm]

5 10 15 20

Number of Training Video Clips

A

B

C

D

[deg.]

2.3

3.2

4.1

5.0

5.9

6.8

Fig. 10. Estimation errors w.r.t. different amounts oftraining video clips. Each point is plotted by choosinga limited number of learning video clips from the corre-sponding dataset, and the average error is computedfrom the results from the seven test subjects.

0

10

20

30

40

50

60

Error [mm]

A B C D

Performance on training data

Performance on test data

0.0

0.9

1.8

2.7

3.6

4.5

5.4

[deg.]

Dataset

Fig. 11. Comparison of two estimation methods. Allfour datasets are divided into two subsets. One subsetis used to train the gaze estimators, and the estima-tion accuracy is independently examined using eachsubset. The dark bars represent the training data per-formance, and the light bars represent the test dataperformance.

Performance variation w.r.t. the amount of trainingdata.

We conduct a test using varying amounts of trainingvideo data to analyze the performance variations withrespect to the amount of training data. Fig. 10 shows acomparison of the estimation errors for different amountsof training video clips. The X-axis represents the numberof video clips used from each dataset. The Y-axis shows theaverage error that is computed from all seven test subjects.For this evaluation, five video clips are randomly selectedfrom the training dataset and used as the test data. Theresult shows that the larger amount of training data resultsin a more accurate estimation result, though improvementslows after 10 training clips.

Performance on test data.We use the test data sampled from the training dataset

to verify the upper-bound on the accuracy in the aboveexperiments. Our gaze estimator can also be applied tounseen video clips after training. We perform a test byseparating the training and test datasets to evaluate thegeneralizability of the proposed method. Each of the fourdataset A, . . . ,D in this test are divided into two subsets


TABLE 1Average error for each dataset. Two AUC columns show the AUCs of the average ROC curves of the rawsaliency maps s and the gaze probability maps p. The remaining columns are the distance and angular

estimation errors (average ± standard deviation) when using the two estimation methods.s p Proposed method Calibrated method

Dataset AUC AUC error [mm] error [deg.] error [mm] error [deg.]A 0.80 0.92 41 ± 26 3.7 ± 2.3 33 ± 15 3.0 ± 1.4B 0.83 0.95 36 ± 23 3.3 ± 2.1 24 ± 13 2.2 ± 1.2C 0.81 0.94 41 ± 25 3.7 ± 2.3 27 ± 15 2.5 ± 1.4D 0.83 0.91 36 ± 25 3.3 ± 2.3 34 ± 16 3.1 ± 1.5

Average 0.82 0.93 39 ± 25 3.5 ± 2.3 30 ± 15 2.7 ± 1.4

TABLE 2Average error of each subject. The columns indicate the AUCs of the average ROC curves and estimation

errors in the same manner as in Table 1.s p Proposed method Calibrated method

Subject AUC AUC error [mm] error [deg.] error [mm] error [deg.]t1 0.80 0.93 41 ± 28 3.7 ± 2.6 29 ± 16 2.6 ± 1.4t2 0.79 0.92 41 ± 27 4.0 ± 2.4 30 ± 14 2.7 ± 1.3t3 0.83 0.93 33 ± 22 3.0 ± 2.0 34 ± 15 3.1 ± 1.4t4 0.81 0.94 42 ± 24 3.8 ± 2.2 30 ± 14 2.7 ± 1.3t5 0.84 0.92 35 ± 23 3.2 ± 2.1 27 ± 15 2.5 ± 1.4t6 0.83 0.94 36 ± 22 3.2 ± 2.0 24 ± 12 2.2 ± 1.1t7 0.82 0.90 39 ± 27 3.6 ± 2.5 33 ± 18 3.0 ± 1.7

Fig. 12. Gaze estimation results. The estimation re-sults of our method are rendered as 2-D Gaussiancircles. The corresponding input eye images are shownat the top-left corner. The overlaid cross shapes repre-sent the ground truth gaze points obtained by the com-mercial gaze tracker (Tobii TX300), and the solid circlesindicate the gaze points obtained from the calibratedestimator.

that consist of 10 video clips each. One subset is usedto train our gaze estimator, and the other subset is usedfor testing. At the same time, we evaluate the performanceusing the original training data as the test data for a com-parison. Fig. 11 shows the comparative results of these twoestimation scenarios. The dark bars in the figure indicatethe training data performance, and the light bars correspondto the test data performance. In most of the datasets,the training data performance showed the higher expectedaccuracy. However, the test data performance is comparableto it without any significant performance degradation.

Fig. 12 shows some examples of the gaze estimationresults. The output of our method is rendered as a 2-D Gaussian circle centered at (µx, µy) with a variance(σ2x, σ

2y) given by Eq. (19), and the mean coordinate

(µx, µy) is used as the estimated gaze point. The input eye

0

10

20

30

40

50

60

Error [mm] Proposed method

Sample-wise optimization

t1 t2 t3 t4 t5 t6 t7 Average0.0

0.9

1.8

2.7

3.6

4.5

5.4

[deg.]

Test subject

Fig. 13. Comparison of hyperparameters tuningmethod. The dark bars indicate the estimation errorsafter optimization using Eq. (22), and the light barsindicate the optimization results using Eq. (20). Theproposed method without using Eq. (22) results inlower errors.

images are shown at the top-left corner. The overlaid crossshape represents the ground truth gaze point obtained bythe commercial gaze tracker, and the solid circle shows thegaze point estimated by the calibration-based estimator.

4.3 Effects of Parameter OptimizationIn this section, we quantitatively evaluate the effectivenessof our parameter optimizations. We first assess the effective-ness of the hyperparameter tuning described in Section 2.3,and second, we evaluate the feature weight optimizationdescribed in Section 3.

Hyperparameter tuning results.The effectiveness of the hyperparameter tuning of the

Gaussian process regression is summarized in Fig. 13.The dark bars indicate the estimation errors after our


Face

Color

Flicker

Intensity

Motion

Orientation

A B

C D

Fig. 14. Optimized weights of saliency features. Thepie graph shows the ratio of the optimized featureweights averaged over all test subjects.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

NSS

A B C D

Optimized weights

Equal weights

Dataset

Fig. 15. Normalized Scanpath Saliency (NSS) scoresof gaze probability maps. Dark bars indicate NSSvalues using optimized weights, and light bars indi-cate NSS values using initial equal weights. Optimizedweights always outperform equal weights.

optimization using Eq. (22), and the light bars show theoptimization results using Eq. (20), which corresponds tothe standard formulation used to evaluate the sample-wiseerror. Our optimization method reduces the error by about0.5 degrees.

Feature weight optimization results.This experiment assesses the effectiveness of the feature

weight optimization by our feedback loop, where the fea-ture weight is updated to further enhance the accuracy ofthe gaze estimation. The original feature weights are allset to one. After feature weight optimization, the ratio ofthe feature weights is optimized, as shown in Fig. 14. Thepie graph shows the ratio of the average weight computedfrom the data from all test subjects. After optimization, faceand orientation features have greater weight compared withthe others. This result is consistent with the report given byZhao et al. [21], which optimized the feature weights usingknown gaze points.

We compute the Normalized Scanpath Saliency(NSS) [37] to evaluate the correlation between the gazeprobability map and the estimated gaze point. The originaldefinition of NSS is the normalized average of the saliencyvalues at the fixation locations, i.e., the saliency mapsare linearly normalized to have a zero mean and a unitstandard deviation, and the NSS is computed as the averageof the saliency values at the fixated positions. Therefore,

0

10

20

30

40

50

60

Error [mm]

A B C D

Optimized weights

Equal weights

0.0

0.9

1.8

2.7

3.6

4.5

5.4

[deg.]

Dataset

Fig. 16. Error comparison between before and afterfeature weight optimization. Optimized weights led toslightly reduced error for 3 of 4 datasets.

a higher NSS score indicates a greater correlation. Inour case, instead of using the fixation locations, we usethe true gaze points that are associated with the averageeye images to compute the NSS. Given all the sets ofaverage eye images and the gaze probability maps, wefirst compute the ground-truth gaze points. Since theaverage eye images are synthesized features as discussedin Section 2.2, we use the estimates from the calibratedmethod instead of the commercial gaze tracker. Usingthe ground-truth gaze points, we compute the NSS usingthe normalized gaze probability maps. Fig. 14 shows thesummary of the results; dark bars indicate NSS valuesafter optimization, and light bars indicate NSS valuesof initial equal weights. NSS scores are improved afteroptimization for all datasets.

Fig. 16 shows an error comparison between the beforeand after feature weight optimization. Dark bars indicateerrors after feature weight optimization, and light barsrepresent errors when initial equal weights are used. Whilethe improvement is rather small, the optimized weightsyield a higher accuracy in 3 out of 4 datasets. One possibleexplanation for this small improvement would be thatthe aggregated feature maps p(f) are already sufficientlyaccurate for predicting gaze points, and the estimationaccuracy is almost saturated in our setting. However, thisimprovement suggests that it is useful to incorporate thefeedback loop for saliency optimization, especially whenmore complex saliency map models with many features areincorporated.

4.4 Spatial Bias in Error

The accuracy of our method depends on the distribution ofthe training samples. Fig. 17 shows the spatial distributionof the average estimation errors in the display coordinate. Inthe figure, each grid corresponds to the ground truth gazelocation, and the magnitude and direction of the averageestimation errors are rendered. In Fig. 17 (a), the higherintensity indicates a greater magnitude of estimation errors.In (b), the error directions are color coded, where eachcolor corresponds to a certain direction that is illustratedin the reference circle, and a higher saturation indicates agreater error magnitude, like in Fig. 17 (a). There exists a


[mm]

8

160

(a) Magnitude

(b) Direction

Fig. 17. Spatial distribution of estimation errors indisplay coordinate: (a) higher intensity corresponds togreater numbers of estimation errors, and (b) colorwheel indicates error directions while saturation indi-cates error magnitudes.

Average saliency map Distribution of gaze points

Fig. 18. Average saliency map and spatial histogram ofgaze points. The image on the left shows the averageof all the raw saliency maps extracted from the fourvideo clips used in the experiment. The image on theright shows the spatial histogram of the ground truthgaze points from the experimental dataset. A higherintensity corresponds to larger counts of gaze points.

systematic bias, i.e., large errors are observed around theperipheral areas, and the error directions are biased towardthe center of the display.

Fig. 18 shows the average saliency map and spatialhistogram of the gaze points. The image on the left showsthe average of all raw saliency maps extracted from alldatasets used in our experiment. The image on the rightshows the spatial histogram of the ground truth gaze pointsobtained from the same datasets, in which the brighterintensity represents a higher frequency. Since salient objectsare typically located near the centers of video frames, theaverage for the saliency maps is lower around the displayboundary. The actual gaze points also tend to concentratearound the center of the display. As a result, the numberof learning samples at the display edges is limited, and abias in the estimation accuracy exists because of this.

5 CONCLUSION

We propose a novel gaze estimation framework in thispaper that auto-calibrates by using saliency maps. Unlikethe previous approaches that require an explicit calibration,

our method automatically establishes the mapping fromthe eye image to the gaze point using video clips. Takinga synchronized set of eye images and video frames, ourmethod trains the gaze estimator by regarding the saliencymaps as the probabilistic distributions of the gaze points.In our experimental setting with fixed head positions, ourmethod achieves an accuracy with about a 3.5-degree error.

We took an appearance-based gaze estimation approach.Appearance-based methods have a significant benefit in thatthey can be constructed using only a monocular camerawithout requiring a specialized hardware device. However,one of the biggest technical challenges common amongexisting appearance-based methods is the difficulty in han-dling the head pose movements. This is mainly due tothe fact that allowing a head pose movement significantlyexpands the space of the training samples, and thus, thetraining becomes more difficult. There is some effort putforth to handle the head pose variations in an appearance-based setting [9], [38], and our future work includesadopting these techniques to allow head pose movement.In addition, as shown in [25], it is an alternative futuredirection to incorporate our approach in a model-based gazeestimation framework to further improve the estimationaccuracy.

Naturally, the estimation accuracy of our method dependson the quality of the raw saliency maps extracted from theinput video clips. The human gaze control mechanism isnot yet completely understood, and there is a wide range ofpossibilities for more advanced saliency models to be usedin our method to improve the gaze estimation accuracy.

ACKNOWLEDGMENTSThis research was supported by CREST, JST.

REFERENCES[1] D. W. Hansen and Q. Ji, “In the eye of the beholder: A survey of

models for eyes and gaze,” IEEE Transactions on pattern analysisand machine intelligence, vol. 32, no. 3, pp. 478–500, 2010.

[2] R. J. K. Jacob, “Eye tracking in advanced interface design,” in Virtualenvironments and advanced interface design, W. Barfield and T. A.Furness, Eds. Oxford University Press, 1995, pp. 258–288.

[3] T. Ohno, “One-point calibration gaze tracking method,” in Proceed-ings of the 2006 symposium on Eye tracking research & applications(ETRA ’06), 2006, pp. 34–34.

[4] E. D. Guestrin and M. Eizenman, “Remote point-of-gaze estimationrequiring a single-point calibration for applications with infants,” inProceedings of the 2008 symposium on Eye tracking research &applications (ETRA ’08), 2008, pp. 267–274.

[5] A. Villanueva and R. Cabeza, “A novel gaze estimation system withone calibration point,” IEEE Transactions on Systems, Man, andCybernetics, Part B: Cybernetics, vol. 38, no. 4, pp. 1123–1138,2008.

[6] T. Nagamatsu, J. Kamahara, T. Iko, and N. Tanaka, “One-pointcalibration gaze tracking based on eyeball kinematics using stereocameras,” in Proceedings of the 2008 symposium on Eye trackingresearch & applications (ETRA ’08), 2008, pp. 95–98.

[7] E. D. Guestrin and M. Eizenman, “General theory of remote gazeestimation using the pupil center and corneal reflections,” BiomedicalEngineering, IEEE Transactions on, vol. 53, no. 6, pp. 1124 –1133,2006.

[8] H. Yamazoe, A. Utsumi, T. Yonezawa, and S. Abe, “Remote gazeestimation with a single camera based on facial-feature trackingwithout special calibration actions,” in Proceedings of the 2008symposium on Eye tracking research & applications (ETRA ’08),2008, pp. 245–250.


[9] Y. Sugano, Y. Matsushita, Y. Sato, and H. Koike, “An incrementallearning method for unconstrained gaze estimation,” in Proceedingsof the 10th European Conference on Computer Vision (ECCV 2008),2008, pp. 656–667.

[10] C. Koch and S. Ullman, “Shifts in selective visual attention: towardsthe underlying neural circuitry,” Human neurobiology, vol. 4, no. 4,pp. 219–227, 1985.

[11] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visualattention for rapid scene analysis,” IEEE Transactions on patternanalysis and machine intelligence, vol. 20, no. 11, pp. 1254–1259,1998.

[12] C. Privitera and L. Stark, “Algorithms for defining visual regions-of-interest: Comparison with eye fixations,” IEEE Transactions onpattern analysis and machine intelligence, vol. 22, no. 9, pp. 970–982, 2000.

[13] L. Itti and P. F. Baldi, “Bayesian surprise attracts human attention,” inAdvances in Neural Information Processing Systems, Vol. 19 (NIPS2005), 2006, pp. 547–554.

[14] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” inProceedings of Advances in neural information processing systems(NIPS 2007), vol. 19, 2007, pp. 545–552.

[15] N. D. B. Bruce and J. K. Tsotsos, “Saliency, attention, and visualsearch: An information theoretic approach,” Journal of Vision, vol. 9,no. 3, pp. 1–24, 2009.

[16] D. Parkhurst, K. Law, and E. Niebur, “Modeling the role of saliencein the allocation of overt visual attention,” Vision research, vol. 42,no. 1, pp. 107–123, 2002.

[17] A. C. Schutz, D. I. Braun, and K. R. Gegenfurtner, “Eye movementsand perception: A selective review,” Journal of Vision, vol. 11, no. 5,2011.

[18] W. Kienzle, F. A. Wichmann, B. Scholkopf, and M. O. Franz, “Anonparametric approach to bottom-up visual saliency,” in Proceed-ings of Advances in neural information processing systems (NIPS2006), vol. 19, 2006, pp. 689–696.

[19] W. Kienzle, B. Scholkopf, F. A. Wichmann, and M. O. Franz, “Howto find interesting locations in video: a spatiotemporal interest pointdetector learned from human eye movements,” in Proceedings ofthe 29th Annual Symposium of the German Association for PatternRecognition (DAGM 2007), 2007, pp. 405–414.

[20] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning topredict where humans look,” in Proceedings of the Twelfth IEEEInternational Conference on Computer Vision (ICCV 2009), 2009.

[21] Q. Zhao and C. Koch, “Learning a saliency map using fixatedlocations in natural scenes,” Journal of Vision, vol. 11, no. 3, 2011.

[22] E. Horvitz, C. Kadie, T. Paek, and D. Hovel, “Models of attentionin computing and communication: from principles to applications,”Communications of the ACM, vol. 46, no. 3, pp. 52–59, 2003.

[23] R. Vertegaal, J. Shell, D. Chen, and A. Mamuji, “Designing for aug-mented attention: Towards a framework for attentive user interfaces,”Computers in Human Behavior, vol. 22, no. 4, pp. 771–789, 2006.

[24] Y. Sugano, Y. Matsushita, and Y. Sato, “Calibration-free gaze sensingusing saliency maps,” in IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR2010), 2010, pp.2667–2674.

[25] J. Chen and Q. Ji, “Probabilistic gaze estimation without activepersonal calibration,” in IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR2011), 2011.

[26] M. Cerf, J. Harel, W. Einhauser, and C. Koch, “Predicting humangaze using low-level saliency combined with face detection,” inProceedings of Advances in neural information processing systems(NIPS 2008), vol. 20, 2008, pp. 241–248.

[27] OMRON OKAO Vsion library, https://www.omron.com/r d/coretech/vision/okao.html.

[28] O. Williams, A. Blake, and R. Cipolla, “Sparse and semi-supervisedvisual mapping with the S3GP,” in IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition (CVPR2006),vol. 1, 2006, pp. 230–237.

[29] D. W. Hansen, J. S. Agustin, and A. Villanueva, “Homographynormalization for robust gaze estimation in uncalibrated setups,”in Proceedings of the 2010 Symposium on Eye-Tracking Research& Applications (ETRA ’10), 2010, pp. 13–20.

[30] C. E. Rasmussen and C. K. I. Williams, Gaussian processes formachine learning. The MIT Press, 2006.

[31] S. Sundararajan and S. S. Keerthi, “Predictive approaches for choos-ing hyperparameters in gaussian processes,” Neural Computation,vol. 13, no. 5, pp. 1103–1118, 2001.

[32] C. L. Lawson and R. J. Hanson, Solving least squares problems.Society for Industrial Mathematics, 1987.

[33] Vimeo, http://vimeo.com/.[34] OpenMP, http://openmp.org/.[35] N. Otsu, “A threshold selection method from gray-level histograms,”

Systems, Man and Cybernetics, IEEE Transactions on, vol. 9, no. 1,pp. 62 –66, 1979.

[36] Tobii Technology, http://www.tobii.com/.[37] R. Peters, A. Iyer, L. Itti, and C. Koch, “Components of bottom-up

gaze allocation in natural images,” Vision Research, vol. 45, no. 18,pp. 2397–2416, 2005.

[38] F. Lu, Y. Sugano, O. Takahiro, and Y. Sato, “A head pose-freeapproach for appearance-based gaze estimation,” in Proceedings ofthe 22nd British Machine Vision Conference (BMVC 2011), 2011.

Yusuke Sugano received his B.S., M.S. andPh.D. degrees in information science andtechnology from the University of Tokyo in2005, 2007 and 2010 respectively. He iscurrently a Project Research Associate atInstitute of Industrial Science, the Univer-sity of Tokyo. His research interests includecomputer vision and human-computer inter-action.

Yasuyuki Matsushita received his B.S.,M.S. and Ph.D. degrees in EECS from theUniversity of Tokyo in 1998, 2000, and 2003,respectively. He joined Microsoft ResearchAsia in April 2003. He is a Lead Researcherin Visual Computing Group of MSRA. Hismajor areas of research are computer vision(photometric techniques, such as radiometriccalibration, photometric stereo, shape-from-shading), computer graphics (image relight-ing, video analysis and synthesis). Dr. Mat-

sushita is on the editorial board member of IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI), InternationalJournal of Computer Vision (IJCV), IPSJ Journal of Computer Visionand Applications (CVA), The Visual Computer Journal, and Ency-clopedia of Computer Vision. He served/is serving as a ProgramCo-Chair of PSIVT 2010, 3DIMPVT 2011 and ACCV 2012. He isappointed as a Guest Associate Professor at Osaka University (April2010-) and National Institute of Informatics, Japan (April 2011-). Heis a senior member of IEEE.

Yoichi Sato is a professor at Institute of In-dustrial Science, the University of Tokyo. Hereceived his B.S. degree from the Universityof Tokyo in 1990, and his M.S. and Ph.D.degrees in robotics from School of Com-puter Science, Carnegie Mellon Universityin 1993 and 1997 respectively. His researchinterests include physics-based vision, re-flectance analysis, image-based modelingand rendering, and tracking and gestureanalysis. He is currently on the Editorial

Board of International Journal of Computer Vision, IPSJ Journal ofComputer Vision and Applications, and IET Computer Vision. Healso served as Associate Editor of IEEE Transactions on PatternAnalysis and Machine Intelligence. He is a Program Co-Chair ofECCV 2012.

Appearance-based Gaze Estimation using Visual Saliency

Documents