Andrew Zisserman Talk - Part 2

Andrew ZissermanVisual Geometry Group

University of Oxfordhttp://www.robots.ox.ac.uk/~vgg

Includes slides from: Mark Everingham, Pedro Felzenszwalb, Rob Fergus, Kristen Grauman, Bastian Leibe, Fei-Fei Li, Marcin Marszalek, Pietro Perona, Deva Ramanan, Josef Sivic and Andrea Vedaldi

Visual search and recognitionPart II – category recognition

What we would like to be able to do …

• Visual recognition and scene understanding• What is in the image and where

• scene type: outdoor, city …• object classes• material properties• actions

Recognition Tasks• Image Classification

– Does the image contain an aeroplane?

• Object Class Detection/Localization– Where are the aeroplanes (if any)?

• Object Class Segmentation– Which pixels are part of an aeroplane

(if any)?

Things vs. StuffStuff (n): Material defined by a homogeneous or repetitive pattern of fine-scale properties, but has no specific or distinctive spatial extent or shape.

Thing (n): An object with a specific size and shape.

Ted Adelson, Forsyth et al. 1996.

Slide: Geremy Heitz

Challenges: Clutter

Challenges: Occlusion and truncation

Challenges: Intra-class variation

Object Category Recognition by Learning• Difficult to define model of a category. Instead, learn from

example images

Level of Supervision for LearningImage-level label

Pixel-level segmentation

Bounding box

“Parts”

Outline1. Image Classification

• Bag of visual words method

• Features and adding spatial information

• Encoding

• PASCAL VOC and other datasets

2. Object Category Detection

3. The future and challenges

Recognition Task• Image Classification

– Does the image contain an aeroplane?

• Challenges– Imaging factors e.g. lighting, pose,

occlusion, clutter– Intra-class variation

– Position can vary within image– Training data may not specify

Image classification

• Supervised approach – Training data with labels indicating presence/absence of the class

Positive training images containing an object class, here motorbike

Negative training images that don’t

– Learn classifier

Image classification

• Test image– Image without label– Determine whether test image contains the object class or not

• Classify– Run classifier on the test image

?

Bag of visual words• Images yield varying number of local

features• Features are high-dimensional

e.g. 128-D for SIFT

• How to summarize image content in a fixed-length vector for classification?

1. Map descriptors onto a common vocabulary of visual words

2. Represent image as a histogram over visual words – a bag of words

Examples for visual words

Airplanes

Motorbikes

Faces

Wild Cats

Leaves

People

Bikes

Intuition

Visual Vocabulary

• Visual words represent “iconic” image fragments• Discarding spatial information gives lots of invariance

positive negative

Train classifier,e.g. SVM

Training data: vectors are histograms, one from each training image

Faces 435

Motorbikes 800

Airplanes 800

Cars (rear) 1155

Background 900

Total: 4090

Example Image collection: four object classes + background

The “Caltech 5”

Example: weak supervision

Training• 50% images• No identifcation of

object within image

Testing• 50% images• Simple object

present/absent test

Motorbikes Airplanes Frontal Faces

Cars (Rear) Background

Learning• SVM classifier• Gaussian kernel using as similarity between histograms

Result• Between 98.3 – 100% correct, depending on class

Zhang et al 2005Csurka et al 2004

K(x,y) = e−γχ2(x,y)

Localization according to visual word probability

50 100 150 200

20

40

60

80

100

120

50 100 150 200

20

40

60

80

100

120

foreground word more probable

background word more probable

sparse segmentation

Why does SVM learning work?

• Learns foreground and background visual words

foreground words – positive weight

background words – negative weight

w

Linear SVM, f(x) = w>x+ b

Bag of visual words summary

• Advantages:– largely unaffected by position and orientation of object in image– fixed length vector irrespective of number of detections– Very successful in classifying images according to the objects they

contain

• Disadvantages:– No explicit use of configuration of visual word positions– Poor at localizing objects within an image

Adding Spatial Information

Beyond BOW II: Grids and spatial pyramidsStart from BoW for image

• no spatial information recorded

Bag of Words

Feature Vector

Adding Spatial Information to Bag of Words

Bag of Words

Concatenate

Feature Vector[Fergus et al, 2005]Keeps fixed length feature vector for a window

Tiling defines (records) the spatial correspondence of the words

If codebook has V visual words, then representation has dimension 4V

Fergus et al ICCV 05

• parameter: number of tiles

Spatial Pyramid – represent correspondence

•••••••••••••••••••••

1 BoW

4 BoW

16 BoW

[Lazebnik et al, 2006][Grauman & Darrell, 2005]

Dense Visual Words• Why extract only sparse visual words?• Good where lots of invariance is needed (e.g. to rotation or

scale), but not relevant if it isn’t• Also, interest points do not necessarily capture “all” features

• Instead, extract dense visual words of fixed scales on an overlapping grid

• More “detail” at the expense of invariance• Improves performance for most categories• Pyramid histogram of visual words (PHOW)

[Luong & Malik, 1999][Varma & Zisserman, 2003]

[Vogel & Schiele, 2004][Jurie & Triggs, 2005]

[Fei-Fei & Perona, 2005][Bosch et al, 2006]Patch / SIFT

QuantizeWord

Image categorization: summary

... ... ... ...

VQ

Linear SVM

dogs

Example – retrieving videos of scene categories

• TrecVid 2010 test data

– 8383 videos

– 144,988 shots

– 232,276 key-frames

– 200G mpeg2

Feature ExtractionDense Sift + VQ+ Spatial Tiling

Positive (Airplane)

Negative (Background)

Training Images Features

LearningSupport VectorMachine (SVM)

Test Images

Feature ExtractionDense Sift + VQ + Spatial Tiling

Features

Scoring

Scores

0.91

0.12

0.65

0.89

Ranked List

0.91

0.89

0.65

0.12

Ranking

Classifier Model

Test

ing

Trai

ning


aircraft


cityscape


demonstration


singing

Encodings

Beyond hard assignments …1. Distance based methods

– Represent descriptor by function of distance to set of nearest neighbour cluster centres (soft assignment)

A: 0.1B: 0.5C: 0.4

B: 1.0 Hard Assignment

Soft Assignment

[Philbin et al. CVPR 2008, Van Gemert et al. ECCV 2008]

Beyond hard assignments …2. Reconstruction based methods

– Sparse-coding: approximate descriptor using nearest neighbourcentres as a basis

– Locality-constrained Linear Coding (LLC)

[Wang et al. CVPR 2010]

minα||x− Bα||2 + λ||α||2

where B is a matrix formed from the nearest centres as column vectors

Beyond hard assignments …3. Represent the residual (and more)

– Measure (and quantize) the difference vector from the cluster centre

[Jegou et al. CVPR 2010, Perronin et al., Fisher kernels ECCV 2010 ]

ci

x

Spatial tiling & binning

1. Aggregate by sum-pooling

2. Aggregate by max-pooling

[Wang et al. CVPR 2010]

sum

max

Datasets

The PASCAL Visual Object Classes (VOC) Dataset and Challenge

Mark EveringhamLuc Van GoolChris Williams

John WinnAndrew Zisserman

• Challenge in visual objectrecognition funded byPASCAL network of excellence

• Publicly available dataset ofannotated images

• Main competitions in classification (is there an X in this image), detection (where are the X’s), and segmentation (which pixels belong to X)

• “Taster competitions” in 2-D human “pose estimation” (2007-present) and static action classes (2010-present)

• Standard evaluation protocol (software supplied)

The PASCAL VOC Challenge

Dataset Content

• 20 classes: aeroplane, bicycle, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, train, TV

• Real images downloaded from flickr, not filtered for “quality”

• Complex scenes, scale, pose, lighting, occlusion, ...

Annotation• Complete annotation of all objects

• Annotated in one session with written guidelines

TruncatedObject extends beyond BB

OccludedObject is significantly occluded within BB

PoseFacing left

DifficultNot scored in evaluation

Examples

Aeroplane

Bus

Bicycle Bird Boat Bottle

Car Cat Chair Cow

Examples

Dining Table

Potted Plant

Dog Horse Motorbike Person

Sheep Sofa Train TV/Monitor

Dataset Collection• Images downloaded from flickr

– 500,000 images downloaded and random subset selected for annotation

– Queries• Keyword e.g. “car”, “vehicle”, “street”, “downtown”• Date of capture e.g. “taken 21-July”

– Removes “recency” bias in flickr results• Images selected from random page of results

– Reduces bias toward particular flickr users

• 2008/9 datasets retained as subset of 2010– Assignments to training/test sets maintained

Dataset Statistics• Around 40% increase in size over VOC2009

22,992

9,637

23,374

10,103

Training Testing

Images (7,054) (6,650)

Objects (17,218) (16,829)

VOC2009 counts shown in brackets

• Minimum ~500 training objects per category– ~1700 cars, 1500 dogs, 7000 people

• Approximately equal distribution across training and test sets

Classification Challenge• Predict whether at least one object of a given class is present in an image

• Evaluation: average precision per class

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

AP

• Average Precision (AP) measures area under precision/recall curve

• Application independent

• A good score requires both high recall and high precision

Precision/Recall: Aeroplane (All)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

nAll results

NEC_V1_HOGLBP_NONLIN_SVM (93.3)NEC_V1_HOGLBP_NONLIN_SVMDET (93.3)NUSPSL_KERNELREGFUSING (93.0)NUSPSL_MFDETSVM (91.9)CVC_PLUSDET (91.7)UVA_BW_NEWCOLOURSIFT (91.5)NUSPSL_EXCLASSIFIER (91.3)CVC_PLUS (91.0)SURREY_MK_KDA (90.6)UVA_BW_NEWCOLOURSIFT_SRKDA (90.6)NLPR_VSTAR_CLS_DICTLEARN (90.3)BUT_FU_SVM_SIFT (89.7)CVC_FLAT (89.4)BONN_FGT_SEGM (88.0)LIRIS_MKL_TRAINVAL (87.5)TIT_SIFT_GMM_MKL (87.2)XRCE_IFV (87.1)NUDT_SVM_LDP_SIFT_PMK_SPMK (86.1)RITSU_CBVR_WKF (85.6)UC3M_GENDISC (85.5)NUDT_SVM_WHGO_SIFT_CENTRIST_LLM (83.5)BUPT_LPBETA_MULTFEAT (82.1)BUPT_SVM_MULTFEAT (81.1)BUPT_SPM_SC_HOG (79.6)LIP6UPMC_RANKING (78.8)LIP6UPMC_MKL_L1 (78.5)LIP6UPMC_KSVM_BASELINE (78.4)NTHU_LINSPARSE_2 (77.9)WLU_SPM_EMDIST (75.8)LIG_MSVM_FUSE_CONCEPT (74.4)NII_SVMSIFT (69.3)HIT_PROTOLEARN_2 (60.7)

Precision/Recall: Aeroplane (Top 10 by AP)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

nTop 10 results by AP

NEC_V1_HOGLBP_NONLIN_SVM (93.3)NEC_V1_HOGLBP_NONLIN_SVMDET (93.3)NUSPSL_KERNELREGFUSING (93.0)NUSPSL_MFDETSVM (91.9)CVC_PLUSDET (91.7)UVA_BW_NEWCOLOURSIFT (91.5)NUSPSL_EXCLASSIFIER (91.3)CVC_PLUS (91.0)SURREY_MK_KDA (90.6)UVA_BW_NEWCOLOURSIFT_SRKDA (90.6)

AP by Class

0102030405060708090

100ae

ropl

ane

pers

ontra

inbu

sm

otor

bike

hors

eca

rca

tbi

cycl

ebo

attv

/mon

itor

bird

dog

shee

pco

wch

air

dini

ngta

ble

sofa

bottl

epo

ttedp

lant

AP

(%) max

medianchance

• Max AP: 93.3% (aeroplane) ... 53.3% (potted plant)

Progress 2008-2010

• Results on 2008 data improve for best 2009 and 2010 methods for all categories, by over 100% for some categories

– Caveat: Better methods or more training data?

0102030405060708090

100ae

ropl

ane

bicy

cle

bird

boat

bottl

ebu

sca

r

cat

chai

rco

wdi

ning

tabl

edo

g

hors

em

otor

bike

pers

onpo

ttedp

lant

shee

p

sofa

train

tvm

onito

r

Max

AP

(%)

200820092010

The Indoor Scene Dataset

• 67 indoor categories

• 15620 images

• At least 100 images per category

• Training 67 x 80 images

• Testing 67 x 20 images

• A. Quattoni, and A.Torralba. Recognizing Indoor Scenes. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

The Oxford Flowers Dataset• Explore fine grained visual categorization

• 102 different species

Dataset statistics• 102 categories

• Training set – 10 images per category

• Validation set– 10 images per category

• Test set– >20 images per category. – Total 6129 images.

Fine grained visual classification – flowers

Y. Chai, M.E. Nilsback, V. Lempitsky, A. Zisserman, ICVGIP’08, ICCV’ 11



• Sliding window methods

• Histogram of Oriented Gradients (HOG)

• Learning an object detector

• PASCAL VOC (again) and two state of the art algorithms


• Object Class Detection/Localization– Where are the aeroplanes (if any)?

Recognition Task

• Challenges– Imaging factors e.g. lighting, pose,

occlusion, clutter– Intra-class variation

• Compared to Classification– Detailed prediction e.g. bounding box– Location usually provided for training

aeroplane bicycle

car cow

motorbikehorse

Preview of typical results

Problem of background clutter• Use a sub-window

– At correct position, no clutter is present– Slide window to detect object– Change size of window to search over scale

Yes,a carNo,

not a car

Detection by Classification• Basic component of sliding window classifier: binary classifier

Car/non-carClassifier

Detection by Classification• Detect objects in clutter by search


• Sliding window: exhaustive search over position and scale



• Sliding window: exhaustive search over position and scale



• Sliding window: exhaustive search over position and scale(can use same size window over a spatial pyramid of images)

Window (Image) Classification

• Features usually engineered• Classifier learnt from data

FeatureExtraction

Classifier

Training Data

Car/Non-car

Problems with sliding windows …

• aspect ratio

• granuality (finite grid)

• partial occlusion

• multiple responses

See work by

• Christoph Lampert et al CVPR 08, ECCV 08

Bag of (visual) Words representation

• Detect affine invariant local features (e.g. affine-Harris)

• Represent by high-dimensionaldescriptors, e.g. 128-D for SIFT

• Summarizes sliding window content in a fixed-length vector suitable for classification

1. Map descriptors onto a common vocabulary of visual words

2. Represent sliding window as a histogram over visual words – a bag of words

Sliding window detector• Classifier: SVM with linear kernel

• BOW representation for ROI

Example detections for dog

Lampert et al CVPR 08: Efficient branch and bound search over all windows

Discussion: ROI as a Bag of Visual Words

• Advantages– No explicit modelling of spatial information ⇒

high level of invariance to position and orientation in image

– Fixed length vector ⇒ standard machine learning methods applicable

• Disadvantages– No explicit modelling of spatial information ⇒

less discriminative power– Inferior to state of the art performance– Add dense features

Dalal & Triggs CVPR 2005Pedestrian detection

• Objective: detect (localize) standing humans in an image

• Sliding window classifier

• Train a binary classifier on whether a window contains a standing person or not

• Histogram of Oriented Gradients (HOG) feature

• Although HOG + SVM originally introduced for pedestrians has been used very successfully for many object categories

Feature: Histogram of Oriented Gradients (HOG)

imagedominant direction HOG

frequ

ency

orientation

• tile 64 x 128 pixel window into 8 x 8 pixel cells

• each cell represented by histogram over 8 orientation bins (i.e. angles in range 0-180 degrees)

Histogram of Oriented Gradients (HOG) continued

• Adds a second level of overlapping spatial bins re-normalizing orientation histograms over a larger spatial area

• Feature vector dimension (approx) = 16 x 8 (for tiling) x 8 (orientations) x 4 (for blocks) = 4096

Window (Image) Classification

• HOG Features• Linear SVM classifier

FeatureExtraction

Classifier

Training Data

pedestrian/Non-pedestrian

Averaged examples

Advantages of linear SVM:

• Training (Learning)• Very efficient packages for the linear case, e.g. LIBLINEAR for batch training and Pegasos for on-line training.

• Complexity O(N) for N training points (cf O(N^3) for general SVM)

• Testing (Detection)

Classifier: linear SVMf(x) = w>x+ b

f(x) =SXi

αik(xi,x) + b

f(x) =SXi

αixi>x+ b

= w>x+ b

S = # of support vectors

= (worst case ) N

size of training data

Non-linear

Linear

Independent of size of training data

Dalal and Triggs, CVPR 2005

Learned model

f(x) = w>x+ b

average over positive training data

Slide from Deva Ramanan

Why does HOG + SVM work so well?• Similar to SIFT, records spatial arrangement of histogram orientations• Compare to learning only edges:

– Complex junctions can be represented– Avoids problem of early thresholding– Represents also soft internal gradients

• Older methods based on edges have become largely obsolete

• HOG gives fixed length vector for window, suitable for feature vector for SVM

Training a sliding window detector• Object detection is inherently asymmetric: much more

“non-object” than “object” data

• Classifier needs to have very low false positive rate• Non-object category is very complex – need lots of data

Bootstrapping

1. Pick negative training set at random

2. Train classifier3. Run on training data4. Add false positives to

training set5. Repeat from 2

• Collect a finite but diverse set of non-object windows• Force classifier to concentrate on hard negative examples

• For some classifiers can ensure equivalence to training on entire data set

Example: train an upper body detector– Training data – used for training and validation sets

• 33 Hollywood2 training movies• 1122 frames with upper bodies marked

– First stage training (bootstrapping)• 1607 upper body annotations jittered to 32k positive samples• 55k negatives sampled from the same set of frames

– Second stage training (retraining)• 150k hard negatives found in the training data

Training data – positive annotations

Positive windows

Note: common size and alignment

Jittered positives

Jittered positives

Random negatives

Random negatives

Window (Image) first stage classification

HOG FeatureExtraction

Linear SVMClassifier

Jittered positives

random negatives f(x) = w>x+ b

• find high scoring false positives detections

• these are the hard negatives for the next round of training

• cost = # training images x inference on each image

Hard negatives

Hard negatives

First stage performance on validation set

Performance after retraining

Effects of retraining

Side by side

before retraining after retraining

Side by side

before retraining after retraining

Side by sidebefore retraining after retraining

Tracked upper body detections

Accelerating Sliding Window Search• Sliding window search is slow because so many windows are

needed e.g. x × y × scale ≈ 100,000 for a 320×240 image

• Most windows are clearly not the object class of interest

• Can we speed up the search?

Cascaded Classification• Build a sequence of classifiers with increasing complexity

ClassifierN

Face

Non-face

Classifier2

Non-face

Classifier1

Non-face

Window

More complex, slower, lower false positive rate

• Reject easy non-objects using simpler and faster classifiers

Possibly a face

Possibly a face

Cascaded Classification

• Slow expensive classifiers only applied to a few windows ⇒significant speed-up

• Controlling classifier complexity/speed:– Number of support vectors [Romdhani et al, 2001]– Number of features [Viola & Jones, 2001]– Type of SVM kernel [Vedaldi et al, 2009]

Summary: Sliding Window Detection• Can convert any image classifier into an

object detector by sliding window. Efficient search methods available.

• Requirements for invariance are reduced by searching over e.g. translation and scale

• Spatial correspondence can be “engineered in” by spatial tiling



• Sliding window methods

• Histogram of Oriented Gradients (HOG)

• Learning an object detector

• PASCAL VOC (again) and two state of the art algorithms


The PASCAL Visual Object Classes (VOC) Dataset and Challenge

Mark EveringhamLuc Van GoolChris Williams

John WinnAndrew Zisserman

Detection: Evaluation of Bounding Boxes• Area of Overlap (AO) Measure

Ground truth Bgt

Predicted Bp

Bgt � Bp

> ThresholdDetection if50%

• Evaluation: Average precision per class on predictions

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

NLPR_HOGLBP_MC_LCEGCHLC (55.3)UOCTTI_LSVM_MDPM (54.3)UCI_DPM_SP (52.6)NUS_HOGLBP_CTX_CLS_RESCORE_V2 (52.4)MITUCLA_HIERARCHY (48.5)UVA_DETMONKEY (39.8)UVA_GROUPLOC (39.6)UMNECUIUC_HOGLBP_DHOGBOW_SVM (34.7)BONN_FGT_SEGM (33.7)UMNECUIUC_HOGLBP_LINSVM (33.7)CMU_RANDPARTS (31.7)LJKINPG_HOG_LBP_LTP_PLS2ROOTS (29.7)CMIC_SYNTHTRAIN (28.9)CMIC_VARPARTS (28.2)BONN_SVR_SEGM (24.4)TIT_SIFT_GMM_MKL2 (14.5)UC3M_GENDISC (5.5)TIT_SIFT_GMM_MKL (1.6)

Precision/Recall - Bicycle

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

UOCTTI_LSVM_MDPM (49.1)NLPR_HOGLBP_MC_LCEGCHLC (46.7)UCI_DPM_SP (44.5)MITUCLA_HIERARCHY (43.5)UVA_GROUPLOC (37.8)UVA_DETMONKEY (36.9)UMNECUIUC_HOGLBP_DHOGBOW_SVM (33.8)UMNECUIUC_HOGLBP_LINSVM (33.1)BONN_SVR_SEGM (32.9)NUS_HOGLBP_CTX_CLS_RESCORE_V2 (32.8)BONN_FGT_SEGM (31.9)LJKINPG_HOG_LBP_LTP_PLS2ROOTS (27.5)CMU_RANDPARTS (19.5)CMIC_VARPARTS (13.7)CMIC_SYNTHTRAIN (13.3)TIT_SIFT_GMM_MKL2 (8.1)UC3M_GENDISC (5.4)TIT_SIFT_GMM_MKL (1.6)

Precision/Recall - Car

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

UVA_GROUPLOC (13.0)UVA_DETMONKEY (12.1)UCI_DPM_SP (11.6)NLPR_HOGLBP_MC_LCEGCHLC (10.2)MITUCLA_HIERARCHY (9.7)NUS_HOGLBP_CTX_CLS_RESCORE_V2 (9.6)UOCTTI_LSVM_MDPM (9.1)UMNECUIUC_HOGLBP_LINSVM (7.2)BONN_SVR_SEGM (6.7)UMNECUIUC_HOGLBP_DHOGBOW_SVM (6.4)BONN_FGT_SEGM (5.8)LJKINPG_HOG_LBP_LTP_PLS2ROOTS (3.1)UC3M_GENDISC (1.5)CMU_RANDPARTS (1.1)TIT_SIFT_GMM_MKL2 (0.8)TIT_SIFT_GMM_MKL (0.3)

Precision/Recall – Potted plant

True Positives - MotorbikeMITUCLA_HIERARCHY

NLPR_HOGLBP_MC_LCEGCHLC

NUS_HOGLBP_CTX_CLS_RESCORE_V2

False Positives - MotorbikeMITUCLA_HIERARCHY

NLPR_HOGLBP_MC_LCEGCHLC

NUS_HOGLBP_CTX_CLS_RESCORE_V2

True Positives - CatUVA_DETMONKEY

UVA_GROUPLOC

MITUCLA_HIERARCHY

False Positives - CatUVA_DETMONKEY

UVA_GROUPLOC

MITUCLA_HIERARCHY

Progress 2008-2010

Results on 2008 data improve for best 2009 and 2010 methods for all categories, by over 100% for some categories

0

10

20

30

40

50

60ae

ropl

ane

bicy

cle bird

boat

bottl

e

bus

car

cat

chai

r

cow

dinin

gtab

le

dog

hors

e

moto

rbike

pers

on

potte

dpla

ntsh

eep

sofa

train

tvmo

nitor

Max

AP

(%)

200820092010

Multiple Kernels for Object Detection

Andrea Vedaldi, Varun Gulshan,Manik Varma, Andrew Zisserman

ICCV 2009

Approach

• Three stage cascade– Each stage uses a more powerful and more expensive classifier

• Multiple kernel learning for the classifiers over multiple features• Jumping window first stage

Feature vector

Fast Linear SVM

Quasi-linear SVM

Jumping Window

Non-linear SVM

Multiple Kernel Classification

PHOW Gray

Visual Words

PHOG

SSIM

PHOW Color

PHOG Sym

MK SVM

combine one kernel per histogram

[Varma & Rai, 2007][Gehler & Nowozin, 2009]

Feature vector

Jumping window

Hypothesis

Position of visual word with respect to the object

learn the position/scale/aspect ratio of the ROI with respect to the visual word

Trai

ning

Det

ectio

n

Handles change of aspect ratio

140

SVMs overview• First stage

– linear SVM– (or jumping window)– time: #windows

• Second stage– quasi-linear SVM– χ2 kernel– time: #windows × #dimensions

• Third stage– non-linear SVM– χ2-RBF kernel– time:

#windows × #dimensions × #SVs

140

Feature vector

Fast Linear SVM

Quasi-linear SVM

Jumping Window

Non-linear SVM

Results

Results

Results

Single Kernel vs. Multiple Kernels• Multiple Kernels gives substantial boost• Multiple Kernel Learning:

– small improvement over averaging– sparse feature selection

Object Detection with Discriminatively Trained Part Based Models

Pedro F. Felzenszwalb, David Mcallester, Deva Ramanan, Ross Girshick

PAMI 2010

Approach

• Mixture of deformable part-based models– One component per “aspect” e.g. front/side view

• Each component has global template + deformable parts• Discriminative training from bounding boxes alone

Example Model• One component of person model

root filterscoarse resolution

part filtersfiner resolution

deformationmodels

x1

x3

x4

x6

x5

x2

Object Hypothesis• Position of root + each part• Each part: HOG filter (at higher resolution)

Score is sum of filter scores minus

deformation costs

p0 : location of rootp1,..., pn : location of parts

z = (p0,..., pn)

Score of a Hypothesis

• Linear classifier applied to feature subset defined by hypothesis

filters deformation parameters

displacements

Appearance term Spatial prior

concatenation of HOG features and part displacement

features

concatenation of filters and deformation

parameters

Training• Training data = images + bounding boxes• Need to learn: model structure, filters, deformation costs

Training

Latent SVM (MI-SVM)

Minimize

Training data

We would like to find β such that:

Classifiers that score an example x using

β are model parametersz are latent values

• Which component?• Where are the parts?

SVM objective

Latent SVM Training

• Convex if we fix z for positive examples

• Optimization:– Initialize β and iterate:

• Pick best z for each positive example• Optimize β with z fixed

• Local minimum: needs good initialization– Parts initialized heuristically from root

Alternation strategy

Person Model



deformationmodels

Handles partial occlusion/truncation

Car Model



deformationmodels

Car Detections

high scoring false positiveshigh scoring true positives

Person Detections

high scoring true positiveshigh scoring false positives

(not enough overlap)

Comparison of Models

Summary• Multiple features and multiple kernels boost

performance• Discriminative learning of model with latent

variables for single feature (HOG):– Latent variables can learn best alignment in the

ROI training annotation– Parts can be thought of as local SIFT vectors– Some similarities to Implicit Shape

Model/Constellation models but with discriminative/careful training throughout

NB: Code available for latent model !

Outline

1. Image Classification



Current Research Challenges• Context

– from scene properties: GIST, BoW, stuff – from other objects, e.g. Felzenszwalb et al, PAMI 10– from geometry of scene, e.g. Hoiem et al CVPR 06

• Occlusion/truncation– Winn & Shotton, Layout Consistent Random Field, CVPR 06– Vedaldi & Zisserman, NIPS 09– Yang et al, Layered Object Detection, CVPR 10

• 3D

• Scaling up – thousands of classes– Torralba et al, Feature sharing– ImageNet

• Weak and noisy supervision

Andrew Zisserman Talk - Part 2

Education

sparse visual words

bag of visual words

image content

v visual words

visual recognition

image training data

visual search

words parameter