Label Consistent Quadratic Surrogate Model for Visual Saliency … · 2015-05-26 · Label Consistent Quadratic Surrogate Model for Visual Saliency Prediction Yan Luo 1, Yongkang

Label Consistent Quadratic Surrogate Model for Visual Saliency Prediction

Yan Luo1, Yongkang Wong2, Qi Zhao1∗

1Department of Electrical and Computer Engineering, National University of Singapore2Interactive & Digital Media Institute, National University of Singapore

{luoyan, yongkang.wong, eleqiz}@nus.edu.sg

Abstract

Recently, an increasing number of works have proposedto learn visual saliency by leveraging human fixations.However, the collection of human fixations is time con-suming and the existing eye tracking datasets are generallysmall when compared with other domains. Thus, it containsa certain degree of dataset bias due to the large image vari-ations (e.g., outdoor scenes vs. emotion-evoking images).In the learning based saliency prediction literature, mostmodels are trained and evaluated within the same datasetand cross dataset validation is not yet a common practice.Instead of directly applying model learned from anotherdataset in cross dataset fashion, it is better to transfer theprior knowledge obtained from one dataset to improve thetraining and prediction on another. In addition, since newdatasets are built and shared in the community from timeto time, it would be good not to retrain the entire modelwhen new data are added. To address these problems, weproposed a new learning based saliency model, namely La-bel Consistent Quadratic Surrogate algorithm, which em-ploys an iterative online algorithm to learn a sparse dic-tionary with label consistent constraint. The advantages ofthe proposed model are three-folds: (1) the quadratic sur-rogate function guarantees convergence at each iteration,(2) the label consistent constraint enforces the predictedsparse code to be discriminative, and (3) the online proper-ties enable the proposed algorithm to adapt existing modelwith new data without retraining. As shown in this work, theproposed saliency model achieves better performance thanthe state-of-the-art saliency models.

1. IntroductionIn the recent advances in sensor technology, computer

vision systems are undoubtedly facing great difficulty toprocess the increasing number of pixels available from mul-

∗Corresponding author.

tiple visual sources. To tackle the information overloadproblem, visual saliency detection has emerged to be anefficient solution to detect the regions of interest to en-hance existing computer vision system. For example, imageand video compression [15], visual tracking [25] and objectrecognition [28, 2, 33].

Conventional saliency models employ a straightforwardbottom-up solution to predict visual saliency [13, 14, 17,39]. Recently, learning based saliency prediction modelswere proposed to leverage the power of machine learn-ing techniques and human knowledge (from human fixa-tion maps), and decipher the pattern to better predict thesaliency regions. These models generally achieve stableperformance on various datasets. However, the conven-tional learning based models in saliency prediction assumethat the training data is fully observed and there exist suf-ficient training data. It is important to state that the collec-tion of human fixations is time consuming and the existingeye tracking datasets are relatively small when comparedwith other domains. Thus, it consists a certain degree ofdataset bias due to the large variations in images (e.g., out-door scenes vs. emotion-evoking images) and human sub-jects. In addition, existing learning based models are trainedand evaluated within the same dataset where cross datasetvalidation is not yet a common practice. It is importantto note that existing models are unable to adapt a learnedmodel with new training data unless the model is retrainedfrom scratch.

To address all the aforementioned problems, we proposea new saliency prediction model, namely Label ConsistentQuadratic Surrogate (LCQS) Algorithm, which employs aniterative online dictionary learning framework with labelconsistent constraint. The novelty of the proposed modelare as followed: First, we adapt the Quadratic Surrogate(QS) algorithm [26] to solve the sparse dictionary learningproblem. It enables the dictionary learning process to de-pend on one training sample at a time, which provides goodtraining efficiency and convergence rate. Second, we addlabel consistent constrain in the dictionary learning process

1

to ensure that the learned sparse dictionary can generate dis-criminative sparse code for saliency prediction. Last but notleast, the proposed saliency model can adapt a trained dic-tionary with new training data. This allows us to leveragethe prior knowledge from other dataset to improve the qual-ity of dictionary on a new dataset. This property also ad-dresses the limitation in the number and size of available hu-man fixations datasets. As shown in Section 5, the proposedsaliency prediction model achieves better performance thanthe state-of-the-art saliency models.

The remaining of the paper is organized as follows. Sec-tion 2 describes the related work. Sections 3 and 4 elaborateour proposed online saliency framework with LCQS algo-rithm. Section 5 demonstrates qualitative and quantitativeresults, and Section 6 concludes the paper.

2. Related Works2.1. Saliency model

Modeling visual attention has recently raised a greatamount of research interest [4, 13, 14, 17, 18, 39]. The firstsaliency model was proposed by Koch and Ullman [23].Based on [23], Itti et al. [17] proposed a bottom-up compu-tational model with center-surround feature to detect con-spicuous regions. Zhang et al. [39] exploited bottom-upsaliency cues from natural statistics to measure the improb-ability of a local patch. Harel et al. [13] introduced a Graph-Based Visual Saliency (GBVS) model to weigh the dissimi-larity between two arbitrary positions to detect the conspic-uous regions. In [14], Hou et al. considered saliency detec-tion as a figure-ground separation problem and employedsparse signal analysis to solve it. Recently, Zhang et al. [38]proposed a Boolean map based saliency model to computethe saliency map by analyzing the topological structure ofBoolean maps.

The aforementioned bottom-up saliency models arestraightforward solutions for saliency detection. Recently,learning based saliency prediction models were emerged toleverage the power of machine learning techniques and hu-man knowledge (from human fixation maps), and decipherthe pattern to better predict the saliency regions [18, 21, 40].Jiang et al. [18] proposed a learning based model based onthe Label Consistent K-SVD (LC-KSVD) algorithm [20],where the goal is to fill the semantic gap between compu-tational saliency models and human behavior. Despite theresults demonstrated superiority over non-learning basedmethods, this method has to be retrained from scratch isnew training data are built and shared in the community.

2.2. Online Learning

Machine learning techniques are widely employed insignal processing, neuroscience and computer vision com-munity. Most of the machine learning techniques employ

a batch learning framework, where the model was trainedonce with a set of training data. The training process aregenerally slow and the quality of the trained model are con-fined by the quality of the training data. In contrast, onlinemachine learning is a model of induction that learns one in-stance at a time. There are two unique advantages of onlinelearning: (1) the training instances in each learning stageis very small, which results cheaper training cost and bet-ter model convergence, (2) it avoids model retraining fromscratch and can adapts the existing model with new data inthe future.

There exists a variety of online learning algorithms [6,7, 9, 22, 29, 36]. The normal herd algorithm [7] wasintroduced to herd a Gaussian weight vector distributionby trading off velocity constraints with a loss function.In [36], the soft confidence-weighted algorithm was pro-posed to address the limitation of the confidence-weightedalgorithm [6], which is prone to wrongly change the pa-rameters of the distribution. Kivinen et al. [22] presenteda kernel-based algorithm with Stochastic Gradient Descent(SGD) in an online setting. This method suffered from thehigh algorithmic complexity and extensive memory cost forlarge number of training instances. The aforementionedworks depended on linear model and SGD method withinthe original feature space, which is not capable to fully de-cipher the complicated patterns. Based on K-SVD algo-rithm [1], Jiang et al. [19] proposed the LC-KSVD modelto learn an overcomplete dictionary over a set of training in-stances, and enforce the learned dictionary to be more dis-criminative. However, this method do not satisfy the mathe-matical properties of online learning and required to retrainmodel when new training instances are available. In [20],Jiang et al.extend the LC-KSVD model with an incremen-tal learning framework with SGD. However, there is no evi-dence to guarantee the convergence properties in each learn-ing stage and the model does not support online learningwith new training data. In this work, we adopt an onlinedictionary learning algorithm, namely Quadratic Surrogate(QS) algorithm [26], as the solution for learning the sparsedictionary.

3. Sparse Coding Based Saliency Model

3.1. Feature Extraction & Sampling

Itti et al. [16] has proved that center-surround feature iseffective for the modeling of visual attention. Histogram ofOriented Gradients (HOG) [8] has been widely accepted asone of the best features to capture the edge or local shapeinformation in detection. In this work, we adopt center-surround and HOG feature as input of the proposed model.Center-Surround Feature. Following the conventionalsaliency model by Itti et al. [17], an input image is sub-sampled into a Gaussian pyramid of S scales from 1/1

Training images Image features Knowledge ( Q,H ) Image features Test image

Human fixation ( v,Z ) < D,w,Q,H >= Flcqs(v,Z,Q,H) M = Slcqs(Z,D,w) Saliency map

PredictionLearning

extract

sample

sam

ple

Figure 1: An overview of the LCQS saliency model.

(scale 0) to 1/256 (scale 8). The image is decomposedinto seven feature channels at each scale, including twocolor contrast channels (Red-Green CRG and Blue-YellowCBY ), intensity channel I and four local orientation chan-nels (Oθ, θ ∈ {0◦, 45◦, 90◦, 135◦}) computed using Gaborfilters. For each of these channels, center-surround featuremaps are computed by subtracting each center pixel at a finescale c ∈ {3, 4, 5} by the corresponding surrounding pixelsat a coarse scale s = c + δ, δ ∈ {2, 3}, yielding 6 center-surround maps in total.Histogram of Oriented Gradients Feature. HOG fea-ture can capture object’s texture and contour informationagainst noises or environmental changes. Locally normal-ized HOG representation with both contrast-sensitive andcontrast-insensitive orientation bins is incorporated. Simi-lar to the feature construction in [11], we define a dense rep-resentation of an image at each particular scale c where c =c+ 3.Sampling Strategy. In this work, dictionary is learnedfrom both salient and non-salient training samples. First,a ground truth saliency map, MGT , of an image is derivedfrom visual fixation maps from human eye tracking data.Specifically, each fixation location is represented as a whitepixel while non-fixated locations are represented with blackpixels, followed by a blur operation with Gaussian kernel.Second, given the feature maps F from various scale c orc, training samples can be extracted based on the saliencyvalue vx,y of corresponding location on MGT . Each train-ing sample is represented as a duple <v, z> which con-sists of: (1) the saliency value v; and (2) feature vector zby extracting r × r neighborhood at corresponding pixelfrom F and concatenating all the center-surround and HOGfeatures. We select n and m samples with the highest andlowest saliency values, respectively. In this work, we em-pirically set both value to 3600.

3.2. Label Consistent Quadratic Surrogate Model

In the context of sparse representation, the objective isto approximate a given sample as a linear combination ofa small number of basis elements, where these basis ele-ments form the subspaces of a feature space. This featurespace is thought to be overcomplete such that any givensample can be represented with a relatively small set of

basis elements. In this work, sparse coding approach isemployed to learn an efficient representation of image fea-tures in relation to visual saliency in an online fashion. Anoverview of the proposed model is shown in Fig. 1. Un-der the formal mathematical formulation, let us supposethat D = [d1,d2, · · · ,dk] ∈ Rm×k is a dictionary andeach column di is known as a basis. Given a set of trainingfeature samples, Z = [z1, z2, . . . ,zn] ∈ Rm×n, extractedfrom salient and non-salient image patches, conventionalsparse dictionary learning problem [1, 24, 26] is solved byoptimizing the empirical cost function

D = F(Z) = arg minD

f(Z,D) = arg minD

1

n

n∑i=1

`(zi,D)

(1)where ` is a loss function such that `(z,D) approximate to0 when D perfectly represent z. `(z,D) can approximatesthe sparse solution x by solving the l1-minimization prob-lem, which yields the convex optimization problem [24, 26]

`(z,D) = minx∈Rk

1

2‖z −Dx‖22 + λ‖x‖1 (2)

which can be rewritten as a matrix factorization problemwith a sparsity penalty term

`(Z,D) = minD∈C,X∈Rk×n

1

2‖Z−DX‖2F + λ‖X‖1,1 (3)

where λ is the regularization parameter and C is the convexset of matrices verifying this constraints:

C , {D ∈ Rm×k s.t. ∀ j = 1, . . . , k, dTj dj ≤ 1} (4)

Assuming that the training set is composed of i.i.d. sam-ples of a distribution p(z), i.e., z ∼ p(z), one element ztis drawn from Z at a time in the inner loop of the learningprocess where t is current iteration. Given the dictionaryDt−1 obtained from the previous iteration and the sparsecode, xi ∀ i < t, computed during the previous iterations,the updated dictionary Dt are computed by minimizing thefollowing Quadratic Surrogate (QS) function

ft(Z,D) =1

t

t∑i=1

(1

2‖zi −Dxi‖22 + λ‖xi‖1

)(5)

In [26], Mairal et al.proved that the QS function ft hasapproximate upper bound of f in Eq. (1) and can convergeto the same limit of ft. Thus, ft acts as a surrogate forf . As ft is close to ft−1 for large values of t, so are Dt

and Dt−1 under suitable assumptions, which makes it effi-cient to use Dt−1 as warm restart for computing Dt. QSalgorithm guarantees that it certainly converges to the set ofstationary points of the dictionary learning problem. With-out the proof, the convergence of K-SVD is uncertain. Fur-thermore, QS solver can adapt prior knowledge from pastlearning processes to improve current dictionary learning,this property is not possessed by K-SVD.

Given the QS function ft, Eq. (1) can be rewritten as

D = Fqs(Z) = arg minDft(Z,D) (6)

The details of dictionary learning and prior knowledgeadaptation of Fqs will be elaborated in Section 4.

Similar to [18], the saliency prediction problem is castedas a binary classification problem in this work. Given thetraining samples from Section 3.1, a discriminative sparseerror term, ‖U − LX‖2F , and a classification error term,‖vT−wTX‖22, are taken into account to approximate thediscriminative sparse codes X = [x1,x2, . . . ,xn] ∈ Rk×nand to learn a sparse dictionary D. The objective functionin the dictionary learning problem for visual saliency pre-diction can be formulated as:

< D,L,X,w > = arg minD,A,X,w

‖Z−DX‖2F + α‖U− LX‖2F

+ β‖vT −wTX‖22 + λ‖X‖1,1(7)

and

U =

U10 0 0 0 0 0

0. . . 0 0 0 0

0 0 US0 0 0 0

0 0 0 U11 0 0

0 0 0 0. . . 0

0 0 0 0 0 US1

(8)

where the coefficients α and β control the relative contri-bution of the discriminative sparse error term and classifi-cation error term, respectively. v is saliency labels fromthe human fixation ground truth and w is the classificationweights to reconstruct the ground truth saliency labels. Thematrix U ∈ {0, 1}k×n is the discriminative sparse codesof input Z and L ∈ Rk×k is a linear transformation ma-trix to enforce original sparse codes in X to be more dis-criminative. Assuming Z = (Z1

0, . . . ,ZS0 ,Z

11, . . . ,Z

S1 ) is

a set of training features where S is the maximal scaleand the subscript 0 and 1 indicate that the training fea-tures are from non-salient and salient samples, respectively.

Us0, s ∈ {1, 2, . . . , S}, is generated by the corresponding

Zs0. For example, if Zs0 only contains z1 and z2, Us0 is a

2× 2 all-ones matrix.To compute the optimal sparse codes X, Eq. (7) can be

rewritten as:

<D,X>= arg minD,X‖Z− DX‖2F + λ‖X‖1,1 (9)

where Z and D are denoted as:

Z = (ZT ,√αUT ,

√βv)T (10)

D = (DT ,√αLT ,

√βw)T (11)

and λ is a regularization parameter. Now Eq. (9) becomes atypical sparse coding problem.

3.3. Saliency Prediction

Given a learned dictionary D and a set of feature patchesZ = [z1, z2, . . . ,zn] ∈ Rm×n extracted from all pixels in atest image. The sparse code x and saliency value v for eachcorresponding z can be computed as follows:

x = arg minx

1

2‖z −Dx‖22 + λ‖x‖1 (12)

v = (wTx) ·∣∣wTx

∣∣ (13)

where Eq. (12) is solved with LARS algorithm [10] and wis obtained from Eq. (7). The predicted v from each pixellocation form a saliency response M.

Finally, to represent the conspicuity at every location inthe visual field by a scalar quantity and simulate the field ofview of human attention, saliency response M is convolutedwith a Gaussian kernel g and the normalization saliencymap M is computed as:

M =M ∗ g −min(M ∗ g)

max(M ∗ g)−min(M ∗ g)(14)

where ∗ represents the convolution operator.

4. Online Dictionary LearningIn this section, we elaborate the detail steps to solve

Eq. (9) with the Label Consistent Quadratic Surrogate(LCQS) algorithm, followed by the online mathematicalstructure to update the dictionary, and the initialization andoptimization of the LCQS model. The proposed onlinesaliency model with the LCQS algorithm is summarized inAlgorithm 1.

4.1. Dictionary Learning & Update

Given a set of training samples Z = [z1, . . . , zn] wherezi ∈ p(z), one sample zt is drawn from Z, at iteration t, tocompute the decomposition of zt, xt, with the dictionary

learned in the previous iteration, Dt−1, using LARS algo-rithm [10]

xt = arg minx∈Rk

1

2‖zt−1 − Dt−1x‖22 + λ‖x‖1 (15)

The computed xt will be used to update the knowledgematrices Q and H via

Qt ← Qt−1 + xtxTt

Ht ← Ht−1 + ztxTt

(16)

where Q0 and H0 are both zero matrices if there is no priorinformation. At the meantime, the objective function inEq. (9) can be rewritten in an iterative fashion

Dt = arg minD∈C

1

t

t∑i=1

(1

2‖zi − Dxi‖22 + λ‖xi‖1

)= arg min

D∈C

1

t

(1

2Tr(DT DQt)− Tr(DTHt)

).

(17)In the dictionary update process, the block-coordinate

descent method is applied with Dt−1 as warm restarts. Theupdate procedure does not require any parameter to controlthe learning rate. In addition, it does not store the train-ing samples and sparse codes from the previous iterations,but only the thesaurus matrices Qt = [q1,t, . . . ,qk,t] andHt = [h1,t, . . . ,hk,t]. In each iteration, each basis in Dis sequentially updated, i.e., updating the j-th basis dj ata time while freezing the other ones under the constraintdTj dj ≤ 1. Specifically, dj is updated to optimize forEq. (17)

yj ←1

Qjj(hj − Dqj) + dj

dj ←1

max(‖yj‖2, 1)yj

(18)

In the LCQS model, as xi is a sparse vector and the coeffi-cients of Qt are often concentrated on the diagonal region,the block-coordinate descent method can be performed ef-ficiently. In the dictionary update process, each basis in Dundergoes the update until a convergence criteria is satis-fied [26].

4.2. Initialization

For the LCQS algorithm, D0, L0 and w0 are initial-ized as follows. Given the training samples Z, D0 can belearned with Eq. (3). For L0, the multivariate ridge regres-sion model [12] is applied with the quadratic loss and l2-norm regularization as follows

L = argminL‖U− LX‖2 + λ2‖L‖22 (19)

which leads to the following solution

L0 = (XXT + λ2I)−1XUT (20)

Algorithm 1: Pseudo code for Label ConsistentQuadratic Surrogate algorithm

Input: Z,v, T, λ, α, βOutput: D,w

1 Initialize: w0,L0,U2 if Prior knowledge exists then3 Q0 ← Qpast, H0 ← Hpast;4 else5 Q0 ← 0, H0 ← 0;6 end7 Compute Z and D by Eq. (10) and Eq. (11);8 for t = 1, 2, . . . , T do9 Draw zt from Z.

10 Compute xt with Eq. (15)11 Compute Qt,Ht with Eq. (16)12 Compute Dt using block-coordinate descent

method with Dt−1 as warm restart:13 repeat14 for j = 1 to k do15 Update sequentially the j-th column to

optimize by Eq. (18).16 end17 until convergence18 Update Dt

19 end20 Decompose D and w from D by Eq. (11)21 return D and w

where I is an identity matrix and λ2 is the regularizationparameter. Similar to initializing L0, w0 can be obtainedby

w0 = (XXT + λ1I)−1XvT (21)

where λ1 is the l1-norm regularization parameter. Once D0

is computed, the LARS algorithm is performed to computeX which will be fed to Eq. (20) and Eq. (21) to initialize L0

and w0.

4.3. Prior Knowledge Adaptation

As illustrated in Algorithm 1, the thesaurus matrices Qand H can be generated and saved as prior knowledge ineach iteration. When we initiate a new dictionary trainingprocess with an unseen dataset, the proposed model first re-views if there exists prior knowledge generated from theprevious dictionary learning processes. If there is no priorknowledge, Q and H are initialized as zero matrices. Asshown in our experiments, the prior knowledge improvesthe dictionary learning, especially in the scenario where thetraining dataset is relatively small.

Table 1: Overview of the eye tracking datasets. The human eye fixation ground truth were collected from free-viewing on each image.

Datasets Subjects Durations Images

MIT [21] 15 3 sec 1003 natural indoor and outdoor scene imagesOSIE [37] 15 3 sec 700 natural indoor and outdoor scene images, aesthetic photographs from Flickr and GoogleNUSEF [32] 25 5 sec 758 everyday scene images from Flickr, aesthetic content from Photo.net, Google, emotion-

evoking IAPS pictures

4.4. Optimization

Leverage the prior knowledge. At each iteration, the newthesaurus information is updated with equal weight as theprior information. In the online learning literature, a generalpractice is to allocate new information with more weightwhile reducing the weight of existing information to boostthe process of convergence [27]. By taking this practice intoaccount, Eq. (16) can be replaced by

Qt ← βQt−1 + xtxTt

Ht ← βHt−1 + ztxTt

(22)

where β = (1 − 1t )

ρ and ρ is the convergence rate factor.Correspondingly, Eq. (17) becomes

Dt=argminD∈C

1t∑

j=1

(jt

)ρ t∑i=1

((i

t

)ρ

‖zi − Dxi‖22 + λ‖xi‖1)

=argminD∈C

1t∑

j=1

(jt

)ρ(1

2Tr(DT DQt)− Tr(DTHt)

)

(23)Now, Eq. (17) is a special case of Eq. (23) when ρ = 0.Update with mini-batch. To improve the convergencespeed, η > 1 samples are drawn at each iteration insteadof a single sample using the same heuristic in the stochas-tic gradient descent algorithm. Let us denote zt,1, . . . , zt,ηas the samples drawn at iteration t. Hence, Eq. (16) can berewritten to update the thesaurus information with multipletraining samples as:

Qt ← Qt−1 +1

η

η∑i=1

xt,ixTt,i

Ht ← Ht−1 +1

η

η∑i=1

zt,ixTt,i

(24)

5. Experiments5.1. Experimental Configuration

Datasets and Evaluation Configuration. The proposedmodel is evaluated on 3 benchmark eye tracking datasets:MIT dataset (MIT) [21], Object and Semantic Images andEye-tracking dataset (OSIE) [37] and NUS Eye-Fixation

A Samples Dictionary Aextract learn predict

(a) Baseline

B Samples Dictionary Aextract learn predict

(b) CrossDB: Cross dataset validation

B Samples Knowledge

A Samples + Dictionary A

extract learn

extract learn predict

(c) Adaptive: Leverage prior knowledge for improved dictionary learning

Figure 2: Conceptual illustration of the three experiment config-urations employed in this work. A and B are datasets, where A israndomly divided into training set A and evaluation set A.

(NUSEF) dataset [32]. These datasets consist of largenumber of affective stimuli [5], which is beneficial in thiswork. The details are summarized in Table 1. In thiswork, we compared the proposed method with the LC-KSVD saliency model [18] and 4 state-of-the-art bottom-up saliency models (i.e., Itti [17], GBVS [13], SUN [39]and Image Signature [14]).

Three types of experiment configuration are used to val-idate the performance. Firstly, we evaluate the proposedmodel with the conventional learning strategy (denoted asBaseline, see Fig. 2(a)). Under this strategy, the trainingset and evaluation set are both selected from the dataset A.Secondly, we conduct evaluation with cross dataset valida-tion (denoted as CrossDB, see Fig. 2(b)). In this configura-tion, saliency model is trained on dataset B and predicts ondataset A. Thirdly, we leverage the prior information fromanother dataset to improved the model’s quality. The train-ing and prediction are both conducted on dataset A, whilethe trained model with the prior knowledge of dataset B isused under online learning configuration (denoted as Adap-tive, see Fig. 2(c)).

The correlations of the visual content between datasetsare considered for datasets selection for the CrossDB andAdaptive configurations. The MIT and OSIE datasets bothcontain natural scene images. Hence, we conducted two ex-periments by mirroring the role of MIT and OSIE, whichprovides understanding when leveraging prior knowledgefrom similar dataset as well as study the impact of the cen-ter bias factor. We also conduct one set of experiment byleveraging information from MIT to NUSEF in order to ex-ploit the disadvantage of naive CrossDB method. A large

portion of NUSEF images contain emotional faces, nudesand actions, which have semantic impact on human fixa-tions and different from the MIT and OSIE datasets.Evaluation Metrics. There are several widely used met-rics to evaluate the performance of visual saliency modelswith human fixation data. The Area Under the ROC curve(AUC) [35] considers human fixations as the positive setand some points from the image are randomly chosen as thenegative set. However, AUC generates a large value for acentral Gaussian model and is affected by center bias [34].To address this problem, the shuffled AUC (sAUC) [35, 39]was introduced to select negative samples from human fix-ation locations from all other training sample. In addition,the Normalized Scanpath Saliency (NSS) [31] and the Cor-relation Coefficient (CC) [30] are employed to measure theperformance. NSS is defined as the average saliency valueat the fixated locations in the normalized predicted saliencymap which has zero mean and unit variance, whereas CCmeasures the linear correlation between the saliency pre-diction and the ground truth. As mentioned in [34], ob-servers show a marked tendency to fixate on the screen cen-ter and this centre bias is shown in the MIT dataset [21].The GBVS model implicitly used center-preference to pre-dict saliency [13]. To conduct fair comparison, we use a200 × 200 pixels Gaussian blob (σ = 60) as center biasand multiplying with saliency maps to compute CC andNSS [3].

The Gaussian kernel for blurring affects the sAUC, NSSand CC scores. We parametrize the standard deviation ofthe blurring kernel from 0 to 0.08 in steps of 0.01. We firstgenerate the saliency maps from various models withoutsmoothing, followed by blurring them with various kernels.The blurred saliency maps are used to generate the respec-tive scores. The three evaluation metrics are complementaryand provide a more objective evaluation of various models.All performance is reported with the mean accuracy using10-fold cross validations.

5.2. Performance Evaluation

The quantitative results are shown in Fig. 3 whereas themaximum scores are shown in Table 2. We first evalu-ate the results under the baseline configuration. The pro-posed LCQS-Baseline outperforms all methods across allblur widths with a noticeable margin in most scenarios. Thisis due to the advantage from the dictionary training stageand the convergence properties of the QS algorithm. OnMIT, LCQS-Baseline remarkably outperforms other modelson sAUC, whereas the best NSS and CC of the other mod-els are close to LCQS-Baseline. Quite a number of MITimages have a dominant object in the center of the image,this results a saliency model to detect the same object aspredicted by other models.

For the cross dataset validation, LCQS-CrossDB sig-

Table 2: Qualitative results of the proposed LCQS saliency modeland various state-of-the-art models. The accuracy is measuredwith the shuffled AUC and reported results is the mean accuracywith 10-fold validations. The best performance on each dataset arein BOLD.

A=MIT, B=OSIE A=OSIE, B=MIT A=NUSEF, B=MIT

Itti 0.6271 0.6575 0.5816

SUN 0.6609 0.7353 0.6172

Signature 0.6795 0.7487 0.6267

GBVS 0.6694 0.7055 0.6112

LCKSVD - Baseline 0.6846 0.7479 0.6406

LCKSVD - CrossDB 0.6694 0.7127 0.6374

LCQS - Baseline 0.6898 0.7649 0.6495

LCQS - CrossDB 0.6935 0.7415 0.6379

LCQS - Adaptive 0.7012 0.7696 0.6517

nificantly outperforms LCKSVD-CrossDB which shows aconsistent pattern between LCQS-Baseline and LCKSVD-Baseline. The sAUC, NSS and CC of LCQS-CrossDB islower than LCQS-Baseline’s on OSIE and NUSEF, whereasLCQS-CrossDB is higher than LCQS-Baseline on MIT.This is partly due to the fact that MIT has a considerableportion of fixations in the center and the saliency map ofLCQS-CrossDB has more false detections on the center.

LCQS-Adaptive achieved the best performance oversAUC, NSS and CC across all blur widths on all datasets.Compared to the naive cross dataset configuration, it ben-efits from eliminating the dataset bias by leveraging otherdatasets’ prior knowledge. The improvement of LCQS-Adaptive over LCQS-Baseline on NUSEF is also observed(with smaller margin) despite that the visual content on MITare significant different.

The qualitative results are shown in Fig. 4. LCQS-Baseline shows more consistent maps with human fixationsthan other comparative models. For example, it better de-tects the two ships in the fourth row. By taking the priorknowledge, LCQS-Adaptive has a stronger response on hu-man face in the sixth row which better approximates humanfixations than the result of LCQS-Baseline.

6. Conclusions

In this work, we have presented a new learning basedsaliency prediction model, which employ an iterative on-line algorithm to learn a sparse dictionary with label con-sistent constraints. By utilizing the advantage of quadraticsurrogate algorithm and label consistent constrains, the pro-posed model consistently achieves noticeable improvementover existing state-of-the-art saliency models, as well as ad-dressing the problem of insufficient eye fixation datasets byleveraging the prior knowledge from a learned model to im-prove the quality of learning.

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#****

CVPR#****

CVPR 2015 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.65

0.66

0.67

0.68

0.69

0.7

Blur Width

sAU

C

A=MIT, B=OSIE

IttiSUN

SignatureGBVS

LCKSVD-BaselineLCKSVD-CrossDB

LCQS-BaselineLCQS-CrossDBLCQS-Adaptive

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.081

1.1

1.2

1.3

1.4

1.5

Blur Width

NSS

A=MIT, B=OSIE

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.35

0.4

0.45

0.5

Blur Width

CC

A=MIT, B=OSIE

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.72

0.73

0.74

0.75

0.76

Blur Width

sAU

C

A=OSIE, B=MIT

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.081.1

1.2

1.3

1.4

1.5

1.6

1.7

Blur Width

NSS

A=OSIE, B=MIT

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.35

0.4

0.45

0.5

Blur Width

CC

A=OSIE, B=MIT

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.58

0.59

0.6

0.61

0.62

0.63

0.64

0.65

Blur Width

sAU

C

A=NUSEF, B=MIT

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.081.1

1.2

1.3

1.4

1.5

Blur Width

NSS

A=NUSEF, B=MIT

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.45

0.5

0.55

0.6

Blur Width

CC

A=NUSEF, B=MIT

Figure 3: Shuffled AUC, NSS and CC scores on various datasets with various models. The blur width is parameterized by a Gaussian’sstandard deviation in image widths.

References

[1] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An al-gorithm for designing overcomplete dictionaries forsparse representation. IEEE Transactions on SignalProcessing, 54(11):4311–4322, 2006.

[2] A. Borji. Boosting bottom-up and top-down visualfeatures for saliency estimation. In IEEE Conferenceon Computer Vision and Pattern Recognition, pages438–445, 2012.

[3] A. Borji and L. Itti. State-of-the-art in visual attentionmodeling. IEEE Transactions on Pattern Analysis andMachine Intelligence, 35(1):185–207, 2013.

[4] K. Crammer, M. Dredze, and F. Pereira. Exact convexconfidence-weighted learning. In Advances in Neu-ral Information Processing Systems, pages 345–352,2008.

[5] K. Crammer and D. D. Lee. Learning via Gaussianherding. In Advances in Neural Information Process-ing Systems, pages 451–459, 2010.

[6] N. Dalal and B. Triggs. Histograms of oriented gra-dients for human detection. In IEEE Conferenceon Computer Vision and Pattern Recognition, pages886–893, 2005.

[7] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive sub-gradient methods for online learning and stochasticoptimization. Journal of Machine Learning Research,12:2121–2159, 2011.

[8] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani.Least angle regression. The Annals of statistics,32(2):407–499, 2004.

[9] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester,and D. Ramanan. Object detection with discrimi-natively trained part-based models. IEEE Transac-

8

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.65

0.66

0.67

0.68

0.69

0.7

Blur Width

sAU

C

A=MIT, B=OSIE

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.081

1.1

1.2

1.3

1.4

1.5

Blur Width

NSS

A=MIT, B=OSIE

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.35

0.4

0.45

0.5

Blur Width

CC

A=MIT, B=OSIE

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.72

0.73

0.74

0.75

0.76

Blur Width

sAU

C

A=OSIE, B=MIT

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.081.1

1.2

1.3

1.4

1.5

1.6

1.7

Blur Width

NSS

A=OSIE, B=MIT

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.35

0.4

0.45

0.5

Blur Width

CC

A=OSIE, B=MIT

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.58

0.59

0.6

0.61

0.62

0.63

0.64

0.65

Blur Width

sAU

C

A=NUSEF, B=MIT

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.081.1

1.2

1.3

1.4

1.5

Blur Width

NSS

A=NUSEF, B=MIT

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080.45

0.5

0.55

0.6

Blur WidthC

C

A=NUSEF, B=MIT

Figure 3: Shuffled AUC, NSS and CC scores on various datasets with various models. The blur width is parameterized by a Gaussian’sstandard deviation in image widths.

Input Fixation LCQS-A LCQS-B LCQS-C LCKSVD-BLCKSVD-C GBVS Signature Itti SUN

Figure 4: Qualitative results of the proposed LCQS saliency model and various state-of-the-art models. Row 1 & 2 are samples from MITdataset, Row 3 & 4 are samples from OSIE dataset, and Row 5 & 6 are samples from NUSEF dataset. The configuration for the crossdataset learning are as stated in Section 5.1. The suffix A, B, and C stand for Adaptive, Baseline, and CrossDB, respectively.

AcknowledgmentsThe research was supported by the Singapore NRF un-

der its IRC@SG Funding Initiative and administered by theIDMPO, the Defense Innovative Research Programme (No.9014100596), and the Ministry of Education Academic Re-search Fund Tier 1 (No. R-263-000-A49-112).

References[1] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An

algorithm for designing overcomplete dictionaries forsparse representation. IEEE Transactions on SignalProcessing, 54(11):4311–4322, 2006.

[2] B. Babenko, P. Dollar, and S. J. Belongie. Task spe-cific local region matching. In IEEE 11th Interna-tional Conference on Computer Vision, ICCV 2007,Rio de Janeiro, Brazil, October 14-20, 2007, pages 1–8, 2007.

[3] A. Borji. Boosting bottom-up and top-down visualfeatures for saliency estimation. In IEEE Conferenceon Computer Vision and Pattern Recognition, pages438–445, 2012.

[4] A. Borji and L. Itti. State-of-the-art in visual attentionmodeling. IEEE Transactions on Pattern Analysis andMachine Intelligence, 35(1):185–207, 2013.

[5] A. Borji, H. Tavakoli, D. Sihite, and L. Itti. Analysisof scores, datasets, and models in visual saliency pre-diction. In Computer Vision (ICCV), 2013 IEEE Inter-national Conference on, pages 921–928, Dec 2013.

[6] K. Crammer, M. Dredze, and F. Pereira. Exact convexconfidence-weighted learning. In Advances in Neu-ral Information Processing Systems, pages 345–352,2008.

[7] K. Crammer and D. D. Lee. Learning via Gaussianherding. In Advances in Neural Information Process-ing Systems, pages 451–459, 2010.

[8] N. Dalal and B. Triggs. Histograms of oriented gra-dients for human detection. In IEEE Conference onComputer Vision and Pattern Recognition, pages 886–893, 2005.

[9] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgra-dient methods for online learning and stochastic op-timization. Journal of Machine Learning Research,12:2121–2159, 2011.

[10] B. Efron, T. Hastie, I. Johnstone, and R. Tibshi-rani. Least angle regression. The Annals of statistics,32(2):407–499, 2004.

[11] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester,and D. Ramanan. Object detection with discrimi-natively trained part-based models. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,32(9):1627–1645, 2010.

[12] G. H. Golub, P. C. Hansen, and D. P. O’Leary.Tikhonov regularization and total least squares.SIAM Journal on Matrix Analysis and Applications,21(1):185–194, 1999.

[13] J. Harel, C. Koch, and P. Perona. Graph-based visualsaliency. In Advances in Neural Information Process-ing Systems, pages 545–552, 2006.

[14] X. Hou, J. Harel, and C. Koch. Image signature:Highlighting sparse salient regions. IEEE Transac-tions on Pattern Analysis and Machine Intelligence,34(1):194–201, 2012.

[15] L. Itti. Automatic foveation for video compres-sion using a neurobiological model of visual at-tention. IEEE Transactions on Image Processing,13(10):1304–1318, 2004.

[16] L. Itti and C. Koch. Computational modelling of vi-sual attention. Nature reviews neuroscience, 2(3):194–203, 2001.

[17] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEETransactions on Pattern Analysis and Machine Intelli-gence, 20(11):1254–1259, 1998.

[18] M. Jiang, M. Song, and Q. Zhao. Leveraging humanfixations in sparse coding: Learning a discriminativedictionary for saliency prediction. In IEEE Interna-tional Conference on Systems, Man., and Cybernetics,pages 2126–2133, 2013.

[19] Z. Jiang, Z. Lin, and L. S. Davis. Learning a discrimi-native dictionary for sparse coding via label consistentK-SVD. In IEEE Conference on Computer Vision andPattern Recognition, pages 1697–1704. IEEE, 2011.

[20] Z. Jiang, Z. Lin, and L. S. Davis. Label consistent K-SVD: learning a discriminative dictionary for recog-nition. IEEE Transactions on Pattern Analysis andMachine Intelligence, 35(11):2651–2664, 2013.

[21] T. Judd, K. A. Ehinger, F. Durand, and A. Torralba.Learning to predict where humans look. In IEEEInternational Conference on Computer Vision, pages2106–2113, 2009.

[22] J. Kivinen, A. J. Smola, and R. C. Williamson. Onlinelearning with kernels. IEEE Transactions on SignalProcessing, 52(8):2165–2176, 2004.

[23] C. Koch and S. Ullman. Shifts in selective visual atten-tion: Towards the underlying neural circuitry. Mattersof Intelligence, 188:115–141, 1987.

[24] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficientsparse coding algorithms. In Advances in Neural In-formation Processing Systems, pages 801–808, 2007.

[25] V. Mahadevan and N. Vasconcelos. Saliency-baseddiscriminant tracking. In IEEE Conference on Com-puter Vision and Pattern Recognition, pages 1007–1013, 2009.

[26] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. On-line learning for matrix factorization and sparse cod-ing. Journal of Machine Learning Research, 11:19–60, 2010.

[27] R. M. Neal and G. E. Hinton. A view of the EM algo-rithm that justifies incremental, sparse, and other vari-ants. Learning in Graphical Models, pages 355–368,1998.

[28] A. Oliva, A. Torralba, M. S. Castelhano, and J. M.Henderson. Top-down control of visual attention inobject detection. In ICIP (1), pages 253–256, 2003.

[29] F. Orabona and K. Crammer. New adaptive algorithmsfor online classification. In Advances in Neural Infor-mation Processing Systems, pages 1840–1848, 2010.

[30] N. Ouerhani, R. von Wartburg, H. Hugli, and R. Muri.Empirical validation of the saliency-based model ofvisual attention. Electronic Letters on Computer Vi-sion and Image Analysis, 3(1):13–24, 2004.

[31] R. J. Peters, A. Iyer, L. Itti, and C. Koch. Componentsof bottom-up gaze allocation in natural images. VisionResearch, 45(18):2397–2416, 2005.

[32] S. Ramanathan, H. Katti, N. Sebe, M. S. Kankanhalli,and T.-S. Chua. An eye fixation database for saliencydetection in images. In Lecture Notes in ComputerScience, volume 6314, pages 30–43, 2010.

[33] U. Rutishauser, D. Walther, C. Koch, and P. Perona.Is bottom-up attention useful for object recognition?In IEEE Conference on Computer Vision and PatternRecognition, pages 37–44, 2004.

[34] B. W. Tatler. The central fixation bias in scene view-ing: Selecting an optimal viewing position indepen-dently of motor biases and image feature distributions.Journal of Vision, 7(14), 2007.

[35] B. W. Tatler, R. J. Baddeley, and I. D. Gilchrist. Vi-sual correlates of fixation selection: effects of scaleand time. Vision Research, 45(5):643 – 659, 2005.

[36] J. Wang, P. Zhao, and S. C. Hoi. Exact soft confidence-weighted learning. In International Conference onMachine Learning, pages 121–128, 2012.

[37] J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, andQ. Zhao. Predicting human gaze beyond pixels. Jour-nal of Vision, 14(1):1–20, 2014.

[38] J. Zhang and S. Sclaroff. Saliency detection: ABoolean map approach. In IEEE International Con-ference on Computer Vision, pages 153–160, 2013.

[39] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, andG. W. Cottrell. SUN: A Bayesian framework forsaliency using natural statistics. Journal of Vision,8(7):32, 2008.

[40] Q. Zhao and C. Koch. Learning a saliency map usingfixated locations in natural scenes. Journal of vision,11(3):9, 2011.

Label Consistent Quadratic Surrogate Model for Visual Saliency … · 2015-05-26 · Label Consistent Quadratic Surrogate Model for Visual Saliency Prediction Yan Luo 1, Yongkang

Documents