Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vgg Includes slides from: Mark Everingham, Pedro Felzenszwalb, Rob Fergus, Kristen Grauman, Bastian Leibe, Fei-Fei Li, Marcin Marszalek, Pietro Perona, Deva Ramanan, Josef Sivic and Andrea Vedaldi Visual search and recognition Part II – category recognition
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Andrew ZissermanVisual Geometry Group
University of Oxfordhttp://www.robots.ox.ac.uk/~vgg
Includes slides from: Mark Everingham, Pedro Felzenszwalb, Rob Fergus, Kristen Grauman, Bastian Leibe, Fei-Fei Li, Marcin Marszalek, Pietro Perona, Deva Ramanan, Josef Sivic and Andrea Vedaldi
Visual search and recognitionPart II – category recognition
What we would like to be able to do …
• Visual recognition and scene understanding• What is in the image and where
• scene type: outdoor, city …• object classes• material properties• actions
Recognition Tasks• Image Classification
– Does the image contain an aeroplane?
• Object Class Detection/Localization– Where are the aeroplanes (if any)?
• Object Class Segmentation– Which pixels are part of an aeroplane
(if any)?
Things vs. StuffStuff (n): Material defined by a homogeneous or repetitive pattern of fine-scale properties, but has no specific or distinctive spatial extent or shape.
Thing (n): An object with a specific size and shape.
Ted Adelson, Forsyth et al. 1996.
Slide: Geremy Heitz
Challenges: Clutter
Challenges: Occlusion and truncation
Challenges: Intra-class variation
Object Category Recognition by Learning• Difficult to define model of a category. Instead, learn from
example images
Level of Supervision for LearningImage-level label
Pixel-level segmentation
Bounding box
“Parts”
Outline1. Image Classification
• Bag of visual words method
• Features and adding spatial information
• Encoding
• PASCAL VOC and other datasets
2. Object Category Detection
3. The future and challenges
Recognition Task• Image Classification
– Does the image contain an aeroplane?
• Challenges– Imaging factors e.g. lighting, pose,
occlusion, clutter– Intra-class variation
– Position can vary within image– Training data may not specify
Image classification
• Supervised approach – Training data with labels indicating presence/absence of the class
Positive training images containing an object class, here motorbike
Negative training images that don’t
– Learn classifier
Image classification
• Test image– Image without label– Determine whether test image contains the object class or not
• Classify– Run classifier on the test image
?
Bag of visual words• Images yield varying number of local
features• Features are high-dimensional
e.g. 128-D for SIFT
• How to summarize image content in a fixed-length vector for classification?
1. Map descriptors onto a common vocabulary of visual words
2. Represent image as a histogram over visual words – a bag of words
Examples for visual words
Airplanes
Motorbikes
Faces
Wild Cats
Leaves
People
Bikes
Intuition
Visual Vocabulary
• Visual words represent “iconic” image fragments• Discarding spatial information gives lots of invariance
positive negative
Train classifier,e.g. SVM
Training data: vectors are histograms, one from each training image
Faces 435
Motorbikes 800
Airplanes 800
Cars (rear) 1155
Background 900
Total: 4090
Example Image collection: four object classes + background
The “Caltech 5”
Example: weak supervision
Training• 50% images• No identifcation of
object within image
Testing• 50% images• Simple object
present/absent test
Motorbikes Airplanes Frontal Faces
Cars (Rear) Background
Learning• SVM classifier• Gaussian kernel using as similarity between histograms
Result• Between 98.3 – 100% correct, depending on class
Zhang et al 2005Csurka et al 2004
K(x,y) = e−γχ2(x,y)
Localization according to visual word probability
50 100 150 200
20
40
60
80
100
120
50 100 150 200
20
40
60
80
100
120
foreground word more probable
background word more probable
sparse segmentation
Why does SVM learning work?
• Learns foreground and background visual words
foreground words – positive weight
background words – negative weight
w
Linear SVM, f(x) = w>x+ b
Bag of visual words summary
• Advantages:– largely unaffected by position and orientation of object in image– fixed length vector irrespective of number of detections– Very successful in classifying images according to the objects they
contain
• Disadvantages:– No explicit use of configuration of visual word positions– Poor at localizing objects within an image
Adding Spatial Information
Beyond BOW II: Grids and spatial pyramidsStart from BoW for image
• no spatial information recorded
Bag of Words
Feature Vector
Adding Spatial Information to Bag of Words
Bag of Words
Concatenate
Feature Vector[Fergus et al, 2005]Keeps fixed length feature vector for a window
Tiling defines (records) the spatial correspondence of the words
If codebook has V visual words, then representation has dimension 4V
Fergus et al ICCV 05
• parameter: number of tiles
Spatial Pyramid – represent correspondence
•••••••••••••••••••••
1 BoW
4 BoW
16 BoW
[Lazebnik et al, 2006][Grauman & Darrell, 2005]
Dense Visual Words• Why extract only sparse visual words?• Good where lots of invariance is needed (e.g. to rotation or
scale), but not relevant if it isn’t• Also, interest points do not necessarily capture “all” features
• Instead, extract dense visual words of fixed scales on an overlapping grid
• More “detail” at the expense of invariance• Improves performance for most categories• Pyramid histogram of visual words (PHOW)
[Luong & Malik, 1999][Varma & Zisserman, 2003]
[Vogel & Schiele, 2004][Jurie & Triggs, 2005]
[Fei-Fei & Perona, 2005][Bosch et al, 2006]Patch / SIFT
• Max AP: 93.3% (aeroplane) ... 53.3% (potted plant)
Progress 2008-2010
• Results on 2008 data improve for best 2009 and 2010 methods for all categories, by over 100% for some categories
– Caveat: Better methods or more training data?
0102030405060708090
100ae
ropl
ane
bicy
cle
bird
boat
bottl
ebu
sca
r
cat
chai
rco
wdi
ning
tabl
edo
g
hors
em
otor
bike
pers
onpo
ttedp
lant
shee
p
sofa
train
tvm
onito
r
Max
AP
(%)
200820092010
The Indoor Scene Dataset
• 67 indoor categories
• 15620 images
• At least 100 images per category
• Training 67 x 80 images
• Testing 67 x 20 images
• A. Quattoni, and A.Torralba. Recognizing Indoor Scenes. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
The Oxford Flowers Dataset• Explore fine grained visual categorization
• 102 different species
Dataset statistics• 102 categories
• Training set – 10 images per category
• Validation set– 10 images per category
• Test set– >20 images per category. – Total 6129 images.
Fine grained visual classification – flowers
Y. Chai, M.E. Nilsback, V. Lempitsky, A. Zisserman, ICVGIP’08, ICCV’ 11
Outline1. Image Classification
2. Object Category Detection
• Sliding window methods
• Histogram of Oriented Gradients (HOG)
• Learning an object detector
• PASCAL VOC (again) and two state of the art algorithms
3. The future and challenges
• Object Class Detection/Localization– Where are the aeroplanes (if any)?
Recognition Task
• Challenges– Imaging factors e.g. lighting, pose,
occlusion, clutter– Intra-class variation
• Compared to Classification– Detailed prediction e.g. bounding box– Location usually provided for training
aeroplane bicycle
car cow
motorbikehorse
Preview of typical results
Problem of background clutter• Use a sub-window
– At correct position, no clutter is present– Slide window to detect object– Change size of window to search over scale
Yes,a carNo,
not a car
Detection by Classification• Basic component of sliding window classifier: binary classifier
Car/non-carClassifier
Detection by Classification• Detect objects in clutter by search
Car/non-carClassifier
• Sliding window: exhaustive search over position and scale
Detection by Classification• Detect objects in clutter by search
Car/non-carClassifier
• Sliding window: exhaustive search over position and scale
Detection by Classification• Detect objects in clutter by search
Car/non-carClassifier
• Sliding window: exhaustive search over position and scale(can use same size window over a spatial pyramid of images)
Window (Image) Classification
• Features usually engineered• Classifier learnt from data
FeatureExtraction
Classifier
Training Data
Car/Non-car
Problems with sliding windows …
• aspect ratio
• granuality (finite grid)
• partial occlusion
• multiple responses
See work by
• Christoph Lampert et al CVPR 08, ECCV 08
Bag of (visual) Words representation
• Detect affine invariant local features (e.g. affine-Harris)
• Represent by high-dimensionaldescriptors, e.g. 128-D for SIFT
• Summarizes sliding window content in a fixed-length vector suitable for classification
1. Map descriptors onto a common vocabulary of visual words
2. Represent sliding window as a histogram over visual words – a bag of words
Sliding window detector• Classifier: SVM with linear kernel
• BOW representation for ROI
Example detections for dog
Lampert et al CVPR 08: Efficient branch and bound search over all windows
Discussion: ROI as a Bag of Visual Words
• Advantages– No explicit modelling of spatial information ⇒
high level of invariance to position and orientation in image
– Fixed length vector ⇒ standard machine learning methods applicable
• Disadvantages– No explicit modelling of spatial information ⇒
less discriminative power– Inferior to state of the art performance– Add dense features
Dalal & Triggs CVPR 2005Pedestrian detection
• Objective: detect (localize) standing humans in an image
• Sliding window classifier
• Train a binary classifier on whether a window contains a standing person or not
• Histogram of Oriented Gradients (HOG) feature
• Although HOG + SVM originally introduced for pedestrians has been used very successfully for many object categories
Feature: Histogram of Oriented Gradients (HOG)
imagedominant direction HOG
frequ
ency
orientation
• tile 64 x 128 pixel window into 8 x 8 pixel cells
• each cell represented by histogram over 8 orientation bins (i.e. angles in range 0-180 degrees)
Histogram of Oriented Gradients (HOG) continued
• Adds a second level of overlapping spatial bins re-normalizing orientation histograms over a larger spatial area
• Feature vector dimension (approx) = 16 x 8 (for tiling) x 8 (orientations) x 4 (for blocks) = 4096
Window (Image) Classification
• HOG Features• Linear SVM classifier
FeatureExtraction
Classifier
Training Data
pedestrian/Non-pedestrian
Averaged examples
Advantages of linear SVM:
• Training (Learning)• Very efficient packages for the linear case, e.g. LIBLINEAR for batch training and Pegasos for on-line training.
• Complexity O(N) for N training points (cf O(N^3) for general SVM)
• Testing (Detection)
Classifier: linear SVMf(x) = w>x+ b
f(x) =SXi
αik(xi,x) + b
f(x) =SXi
αixi>x+ b
= w>x+ b
S = # of support vectors
= (worst case ) N
size of training data
Non-linear
Linear
Independent of size of training data
Dalal and Triggs, CVPR 2005
Learned model
f(x) = w>x+ b
average over positive training data
Slide from Deva Ramanan
Why does HOG + SVM work so well?• Similar to SIFT, records spatial arrangement of histogram orientations• Compare to learning only edges:
– Complex junctions can be represented– Avoids problem of early thresholding– Represents also soft internal gradients
• Older methods based on edges have become largely obsolete
• HOG gives fixed length vector for window, suitable for feature vector for SVM
Training a sliding window detector• Object detection is inherently asymmetric: much more
“non-object” than “object” data
• Classifier needs to have very low false positive rate• Non-object category is very complex – need lots of data
Bootstrapping
1. Pick negative training set at random
2. Train classifier3. Run on training data4. Add false positives to
training set5. Repeat from 2
• Collect a finite but diverse set of non-object windows• Force classifier to concentrate on hard negative examples
• For some classifiers can ensure equivalence to training on entire data set
Example: train an upper body detector– Training data – used for training and validation sets
• 33 Hollywood2 training movies• 1122 frames with upper bodies marked
– First stage training (bootstrapping)• 1607 upper body annotations jittered to 32k positive samples• 55k negatives sampled from the same set of frames
– Second stage training (retraining)• 150k hard negatives found in the training data
Training data – positive annotations
Positive windows
Note: common size and alignment
Jittered positives
Jittered positives
Random negatives
Random negatives
Window (Image) first stage classification
HOG FeatureExtraction
Linear SVMClassifier
Jittered positives
random negatives f(x) = w>x+ b
• find high scoring false positives detections
• these are the hard negatives for the next round of training
• cost = # training images x inference on each image
Hard negatives
Hard negatives
First stage performance on validation set
Performance after retraining
Effects of retraining
Side by side
before retraining after retraining
Side by side
before retraining after retraining
Side by sidebefore retraining after retraining
Tracked upper body detections
Accelerating Sliding Window Search• Sliding window search is slow because so many windows are
needed e.g. x × y × scale ≈ 100,000 for a 320×240 image
• Most windows are clearly not the object class of interest
• Can we speed up the search?
Cascaded Classification• Build a sequence of classifiers with increasing complexity
ClassifierN
Face
Non-face
Classifier2
Non-face
Classifier1
Non-face
Window
More complex, slower, lower false positive rate
• Reject easy non-objects using simpler and faster classifiers
Possibly a face
Possibly a face
Cascaded Classification
• Slow expensive classifiers only applied to a few windows ⇒significant speed-up
• Controlling classifier complexity/speed:– Number of support vectors [Romdhani et al, 2001]– Number of features [Viola & Jones, 2001]– Type of SVM kernel [Vedaldi et al, 2009]
Summary: Sliding Window Detection• Can convert any image classifier into an
object detector by sliding window. Efficient search methods available.
• Requirements for invariance are reduced by searching over e.g. translation and scale
• Spatial correspondence can be “engineered in” by spatial tiling
Outline1. Image Classification
2. Object Category Detection
• Sliding window methods
• Histogram of Oriented Gradients (HOG)
• Learning an object detector
• PASCAL VOC (again) and two state of the art algorithms
3. The future and challenges
The PASCAL Visual Object Classes (VOC) Dataset and Challenge
Mark EveringhamLuc Van GoolChris Williams
John WinnAndrew Zisserman
Detection: Evaluation of Bounding Boxes• Area of Overlap (AO) Measure
Ground truth Bgt
Predicted Bp
Bgt � Bp
> ThresholdDetection if50%
• Evaluation: Average precision per class on predictions