Top Banner
Latent Pyramidal Regions for Recognizing Scenes Fereshteh Sadeghi, Marshall Tappen University of Central Florida Orlando, Florida fsadeghi,[email protected] Abstract In this paper we propose a simple but efficient image representation for solving the scene classification problem. Our new representation combines the benefits of spatial pyramid representation using nonlinear feature coding and latent Support Vector Machine (LSVM) to train a set of La- tent Pyramidal Regions (LPR). Each of our LPRs captures a discriminative characteristic of the scenes and is trained by searching over all possible sub-windows of the images in a latent SVM training procedure. The final response of the LPRs form a single feature vector which we call the LPR representation and can be used for the classification task. We tested our model in three datasets with a variety of scene categories (15-Scenes, UIUC-Sports and MIT-indoor) and obtained state-of-the-art results. 1. Introduction In [6], we propose a new approach for representing im- ages for image classification and particularly scene recog- nition. Our work is inspired by the success of latent vari- able approaches [4, 1] and Spatial Pyramids (SP) represen- tation [2, 7]. The spatial pyramid representation can capture the spatial aspects of images however its ability to model images is limited by the fixed grid. In our model a set of region detectors are learned. Each region is represented by a spatial pyramid and trained in a latent SVM framework to be flexible for capturing the key characteristics of the scenes. Our model also has similarities with [4] where the deformable object detector [1] is utilized to classify scenes without requiring human-segmented regions. The key dif- ferences of this work with [4] lie in how the models are con- structed. Responding to the varied appearance of scenes, our model removes spatial constraints and focuses on find- ing characteristic image regions. Also, we separate localiza- tion of key regions from the scene categorization. This al- lows the classifier to optimize the weights for distinguishing between classes without having to balance how the weight values will affect which image regions are chosen. 1.1. The Latent Pyramidal Region Representation We propose a new image representation designed for dis- criminating between image classes. In our new representa- tion, each feature value expresses a particular type of scene region that is present in the images of one category. To make this representation robust to different spatial configurations, the position of each scene region is treated as a latent vari- able that is optimized as part of the representation. To cap- ture the structure within a region, each region is represented with a spatial pyramid which we refer to as Latent Pyrami- dal Regions and we refer to this representation as the Latent Pyramidal Regions(LPR) representation. The fundamental unit in the LPR is an image region detector that is param- eterized to find image regions with a specific appearance. Given an input image I , the vector ~v will denote the LPR representation of the image I . Each element in the vector ~v is computed by finding the maximum response of a cost function applied to different sub-windows in the image. If v i is the ith element of ~v, we formally denote it as v i = max wI θ > i ~ f (I,w), (1) where ~ f (I,w) is a function that returns a vector of features extracted from sub-window w in the image I . This max operation occurs over the set of all possible sub-windows in the image and thus w is the latent variable of our model. We represent these regions using the coding scheme pro- posed in [7]. The vector θ i is a set of parameters that defines what type of image region each detector selects for and are trained discriminatively based on one-versus-all training of a structural Latent SVM. The underlying idea behind the training process is to build the set of region detectors optimized for separating each scene category from the others. A detector defined by parameters θ k is created by first choosing a particular scene category k. Each training image, I can then be assigned a label y ∈ {-1, +1}, with y taking the label +1 if the I be- longs to category k. Otherwise, y takes the value -1. The goal in training is to learn a prediction rule of the form: F θ (I ) = argmax k,w [θ > k ~ f (I,w)], (2) where k will be the predicted label and w will be the sub- window with the highest detection score. As in Eq. (1), the function ~ f (I,w) evaluates to a vector of features extracted from sub-window w. The parameter vector θ is found by minimizing the cost function: f (θ)= λ 2 kθk 2 + N X j=1 R j (θ), (3) 4321
2

Latent Pyramidal Regions for Recognizing Scenessunw.csail.mit.edu/2013/papers/Sadeghi_36_SUNw.pdf · 2017. 5. 24. · Latent Pyramidal Regions for Recognizing Scenes Fereshteh Sadeghi,

Mar 28, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Latent Pyramidal Regions for Recognizing Scenessunw.csail.mit.edu/2013/papers/Sadeghi_36_SUNw.pdf · 2017. 5. 24. · Latent Pyramidal Regions for Recognizing Scenes Fereshteh Sadeghi,

Latent Pyramidal Regions for Recognizing Scenes

Fereshteh Sadeghi, Marshall TappenUniversity of Central Florida

Orlando, Floridafsadeghi,[email protected]

AbstractIn this paper we propose a simple but efficient image

representation for solving the scene classification problem.Our new representation combines the benefits of spatialpyramid representation using nonlinear feature coding andlatent Support Vector Machine (LSVM) to train a set of La-tent Pyramidal Regions (LPR). Each of our LPRs capturesa discriminative characteristic of the scenes and is trainedby searching over all possible sub-windows of the images ina latent SVM training procedure. The final response of theLPRs form a single feature vector which we call the LPRrepresentation and can be used for the classification task.We tested our model in three datasets with a variety of scenecategories (15-Scenes, UIUC-Sports and MIT-indoor) andobtained state-of-the-art results.

1. IntroductionIn [6], we propose a new approach for representing im-

ages for image classification and particularly scene recog-nition. Our work is inspired by the success of latent vari-able approaches [4, 1] and Spatial Pyramids (SP) represen-tation [2, 7]. The spatial pyramid representation can capturethe spatial aspects of images however its ability to modelimages is limited by the fixed grid. In our model a set ofregion detectors are learned. Each region is represented bya spatial pyramid and trained in a latent SVM frameworkto be flexible for capturing the key characteristics of thescenes. Our model also has similarities with [4] where thedeformable object detector [1] is utilized to classify sceneswithout requiring human-segmented regions. The key dif-ferences of this work with [4] lie in how the models are con-structed. Responding to the varied appearance of scenes,our model removes spatial constraints and focuses on find-ing characteristic image regions. Also, we separate localiza-tion of key regions from the scene categorization. This al-lows the classifier to optimize the weights for distinguishingbetween classes without having to balance how the weightvalues will affect which image regions are chosen.1.1. The Latent Pyramidal Region Representation

We propose a new image representation designed for dis-criminating between image classes. In our new representa-tion, each feature value expresses a particular type of sceneregion that is present in the images of one category. To make

this representation robust to different spatial configurations,the position of each scene region is treated as a latent vari-able that is optimized as part of the representation. To cap-ture the structure within a region, each region is representedwith a spatial pyramid which we refer to as Latent Pyrami-dal Regions and we refer to this representation as the LatentPyramidal Regions(LPR) representation. The fundamentalunit in the LPR is an image region detector that is param-eterized to find image regions with a specific appearance.Given an input image I , the vector ~v will denote the LPRrepresentation of the image I .

Each element in the vector ~v is computed by finding themaximum response of a cost function applied to differentsub-windows in the image. If vi is the ith element of ~v, weformally denote it as

vi = maxw∈I

θ>i~f(I, w), (1)

where ~f(I, w) is a function that returns a vector of featuresextracted from sub-window w in the image I . This maxoperation occurs over the set of all possible sub-windowsin the image and thus w is the latent variable of our model.We represent these regions using the coding scheme pro-posed in [7]. The vector θi is a set of parameters that defineswhat type of image region each detector selects for and aretrained discriminatively based on one-versus-all training ofa structural Latent SVM.

The underlying idea behind the training process is tobuild the set of region detectors optimized for separatingeach scene category from the others. A detector defined byparameters θk is created by first choosing a particular scenecategory k. Each training image, I can then be assigned alabel y ∈ {−1,+1}, with y taking the label +1 if the I be-longs to category k. Otherwise, y takes the value −1. Thegoal in training is to learn a prediction rule of the form:

Fθ(I) = argmaxk,w

[θ>k~f(I, w)], (2)

where k will be the predicted label and w will be the sub-window with the highest detection score. As in Eq. (1), thefunction ~f(I, w) evaluates to a vector of features extractedfrom sub-window w. The parameter vector θ is found byminimizing the cost function:

f(θ) =λ

2‖θ‖2 +

N∑j=1

Rj(θ), (3)

4321

Page 2: Latent Pyramidal Regions for Recognizing Scenessunw.csail.mit.edu/2013/papers/Sadeghi_36_SUNw.pdf · 2017. 5. 24. · Latent Pyramidal Regions for Recognizing Scenes Fereshteh Sadeghi,

Method AccuracyLLC(baseline) 80.57LPR-MS(our approach) 83.29LPR-LIN(our approach) 85.72LPR-RBF(our approach) 85.81

Table 1. The average per-class accuracy results on the 15-Scenes dataset.

Method AccuracyLLC(our global term) 81.87LPR-MS(our approach) 85.0LPR-LIN(our approach) 85.2LPR-RBF(our approach) 86.25

Table 2. The average per-class accuracy results on UIUC-Sports dataset.

with λ balancing between the quadratic regularizer ‖θ‖2and the risk function Rj(θ), which is summed over the Ntraining images. The risk functionRj(θ) is structured to pe-nalize the prediction function when it predicts an incorrectlabel (see [6] for more details).

2. ExperimentsThe performance of the proposed method is evaluated

on three scene datasets with diverse types of scenes (15-Scenes [2], UIUC-Sports [3], (MIT-Indoor [5]). We reportthree key results, in addition to results of previous work:• The accuracy computed using a linear SVM combined

with the spatial pyramid representation of the imageusing a locality-constrained coding(LLC) [7].• LPR-MS is the accuracy computed using the maxi-

mum response of region detectors associated with eachclass. If we denote Vk as the set of all region detectorstrained to respond to class k, the classification score iscomputed by summing the response of those detectors.This is expressed as y = argmax

k∈K

[∑i∈Vk

vi], where

there are K possible classes.• LPR-RBF and LPR-LIN are the accuracy computed by

RBF kernel SVM and Linear SVM using the LPR rep-resentation.

2.1. Key ResultsWe wish to highlight the following key results:• While the LLC representation and LPR representation

use the exact same descriptors and coding scheme, theLPR representation outperforms LLC.• The LPR representation outperforms other single fea-

ture accuracy results as well as he deformable partmodel [4]. When other systems outperform LPR, thisrequires the fusion of multiple features.

References[1] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ra-

manan. Object detection with discriminatively trained part-based models. IEEE Trans. PAMI, 32(9):1627 –1645, 2010.

[2] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of fea-tures: Spatial pyramid matching for recognizing natural scenecategories. In CVPR, 2006.

Method AccuracyDPM [4] 30.4DPM+KSPM+GIST-color [4] 43.1LLC(our global term) 37.32LPR-MS(our approach) 41.22LPR-LIN(our approach) 44.84LPR-RBF(our approach) 44.41

Table 3. The average per-class accuracy results on the MIT-indoor dataset.

mov

ieth

eate

ro

pe

ratin

g r

oo

mst

air

case

gym

livin

gro

om

polo

saili

ngbe

droo

mM

ITm

ount

ain

CA

Lsub

urb

MIT-indoor

15-Scenes

UIUC-Sports

Figure 1. The detected regions found by LPR in. The first five columnsshow the characteristic regions found by LPR. The last column is an ex-ample of inappropriate LPR region selection.

[3] L.-J. Li and L. Fei-Fei. What, where and who? classifyingevent by scene and object recognition. In ICCV, 2007.

[4] M. Pandey and S. Lazebnik. Scene recognition and weakly su-pervised object localization with deformable part-based mod-els. In ICCV, 2011.

[5] A. Quattoni and A. Torralba. Recognizing indoor scenes. InCVPR, 2009.

[6] F. Sadeghi and M. F. Tappen. Latent pyramidal regions forrecognizing scenes. In ECCV, pages 228–241, 2012.

[7] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong.Locality-constrained linear coding for image classification. InCVPR, 2010.

4322