Lecture 17 - Fei-Fei Li Lecture 17: object detection Professor Fei-Fei Li Stanford Vision Lab 30-Nov-11 1
Lecture 17 -Fei-Fei Li
Lecture 17:
object detection
Professor Fei-Fei Li
Stanford Vision Lab
30-Nov-111
Lecture 17 -Fei-Fei Li
Object detection
30-Nov-112
Lecture 17 -Fei-Fei Li
What we will learn today?
• Implicit Shape Model
– Representation
– Recognition
– Experiments and results
• Deformable Models
– The PASCAL challenge
– Latent SVM Model
30-Nov-113
Lecture 17 -Fei-Fei Li
What we will learn today?
• Implicit Shape Model
– Representation
– Recognition
– Experiments and results
• Deformable Models
– The PASCAL challenge
– Latent SVM Model
30-Nov-114
Lecture 17 -Fei-Fei Li
Implicit Shape Model (ISM)
• Basic ideas
– Learn an appearance codebook
– Learn a star-topology structural model
• Features are considered independent given obj. center
• Algorithm: probabilistic Gen. Hough Transform
– Exact correspondences → Prob. match to object part
– NN matching → Soft matching
– Feature location on obj. → Part location distribution
– Uniform votes → Probabilistic vote weighting
– Quantized Hough array → Continuous Hough space
x1
x3
x4
x6
x5
x2
Source: Bastian Leibe
30-Nov-115
Lecture 17 -Fei-Fei Li
Implicit Shape Model: Basic Idea
• Visual vocabulary is used to index votes for object
position [a visual word = “part”].
Training image
Visual codeword with
displacement vectors
Source: Bastian Leibe
B. Leibe, A. Leonardis, and B. Schiele, Robust Object Detection with Interleaved Categorization and
Segmentation, International Journal of Computer Vision, Vol. 77(1-3), 2008.
30-Nov-116
Lecture 17 -Fei-Fei Li
• Objects are detected as consistent configurations of
the observed parts (visual words).
Test image
Implicit Shape Model: Basic Idea
Source: Bastian Leibe
B. Leibe, A. Leonardis, and B. Schiele, Robust Object Detection with Interleaved Categorization and
Segmentation, International Journal of Computer Vision, Vol. 77(1-3), 2008.
30-Nov-117
Lecture 17 -Fei-Fei Li
Implicit Shape Model - Representation
• Learn appearance codebook
– Extract local features at interest points
– Agglomerative clustering ⇒ codebook
• Learn spatial distributions
– Match codebook to training images
– Record matching positions on object
Training images(+reference segmentation)
Appearance codebook…
………
…
Spatial occurrence distributionsx
y
sx
y
s
x
y
sx
y
s
+ local figure-ground labelsSource: Bastian Leibe
30-Nov-118
Lecture 17 -Fei-Fei Li
Implicit Shape Model - Recognition
Interest Points Matched Codebook Entries
Probabilistic Voting
3D Voting Space(continuous)
x
y
s
Object Position
o,x
Image Feature
f
Interpretation(Codebook match)
Ci
)( fCp i ),,( lin Cxop
Probabilistic vote weighting(will be explained later in detail)
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
30-Nov-119
Lecture 17 -Fei-Fei Li
Implicit Shape Model - Recognition[L
eib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
BackprojectedHypotheses
Interest Points Matched Codebook Entries
Probabilistic Voting
3D Voting Space(continuous)
x
y
s
Backprojectionof Maxima
30-Nov-1110
Lecture 17 -Fei-Fei Li
Original image
Example: Results on Cows
Source: Bastian Leibe
30-Nov-1111
Lecture 17 -Fei-Fei Li
Original imageInterest points
Example: Results on Cows
Source: Bastian Leibe
30-Nov-1112
Lecture 17 -Fei-Fei Li
Original imageInterest pointsMatched patches
Example: Results on Cows
Source: Bastian Leibe
30-Nov-1113
Lecture 17 -Fei-Fei Li
Example: Results on Cows
Prob. VotesSource: Bastian Leibe
30-Nov-1114
Lecture 17 -Fei-Fei Li
1st hypothesis
Example: Results on Cows
So
urc
e:
K.
Gra
um
an
& B
. Le
ibe
30-Nov-1115
Lecture 17 -Fei-Fei Li
2nd hypothesis
Example: Results on Cows
Source: Bastian Leibe
30-Nov-1116
Lecture 17 -Fei-Fei Li
Example: Results on Cows
3rd hypothesisSource: Bastian Leibe
30-Nov-1117
Lecture 17 -Fei-Fei Li
• Scale-invariant feature selection– Scale-invariant interest points
– Rescale extracted patches
– Match to constant-size codebook
• Generate scale votes– Scale as 3rd dimension in voting space
– Search for maxima in 3D voting space
Scale Invariant Voting
Search window
x
y
s
Source: Bastian Leibe
30-Nov-1118
Lecture 17 -Fei-Fei Li
Scale Voting: Efficient Computation
• Continuous Generalized Hough Transform
� Binned accumulator array similar to standard Gen. Hough Transf.
� Quickly identify candidate maxima locations
� Refine locations by Mean-Shift search only around those points
⇒ Avoid quantization effects by keeping exact vote locations.
⇒ Mean-shift interpretation as kernel prob. density estimation.
y
s
xRefinement(Mean-Shift)
y
s
xCandidatemaxima
y
s
Scale votesx
y
s
Binned accum. array
x
Source: Bastian Leibe
30-Nov-1119
Lecture 17 -Fei-Fei Li
• Scale-adaptive Mean-Shift search for refinement
– Increase search window size with hypothesis scale
– Scale-adaptive balloon density estimatorThis image cannot currently be displayed.
Scale Voting: Efficient Computation
y
s
xRefinement(Mean-Shift)
y
s
xCandidatemaxima
y
s
Scale votesx
y
s
Binned accum. array
x
Source: Bastian Leibe
30-Nov-1120
Lecture 17 -
Detection Results
• Qualitative Performance
– Recognizes different kinds of objects
– Robust to clutter, occlusion, noise, low contrast
Source: Bastian Leibe
21
Lecture 17 -Fei-Fei Li
Figure-Ground Segregation
• What happens first – segmentation or recognition?
• Problem extensively studied in Psychophysics
• Experiments with ambiguousfigure-ground stimuli
• Results:
– Evidence that object recognition canand does operate before figure-ground organization
– Interpreted as Gestalt cue familiarity.
M.A. Peterson, “Object Recognition Processes Can an d Do Operate Before Figure-Ground Organization”, Cur. Dir. in Psych. Sc., 3:105-111, 1994.
30-Nov-1122
Lecture 17 -Fei-Fei Li
ISM – Top-Down Segmentation
BackprojectedHypotheses
Interest Points Matched Codebook Entries
Probabilistic Voting
Segmentation3D Voting Space
(continuous)
x
y
s
Backprojectionof Maxima
p(figure)Probabilities
[Leibe, Leonardis, Schiele, SLCV’04; IJCV’08]
30-Nov-1123
Lecture 17 -Fei-Fei Li
Top-Down Segmentation: Motivation
• Secondary hypotheses (“mixtures of cars/cows/etc.”)– Desired property of algorithm! ⇒ robustness to occlusion
– Standard solution: reject based on bounding box overlap
⇒ Problematic - may lead to missing detections!
⇒ Use segmentations to resolve ambiguities instead.
– Basic idea: each observed pixel can only be explained by (at most) one detection.
Source: Bastian Leibe
30-Nov-1124
Lecture 17 -Fei-Fei Li
• Secondary hypotheses (“mixtures of cars/cows/etc.”)– Desired property of algorithm! ⇒ robustness to occlusion
– Standard solution: reject based on bounding box overlap
⇒ Problematic - may lead to missing detections!
⇒ Use segmentations to resolve ambiguities instead.
– Basic idea: each observed pixel can only be explained by (at most) one detection.
Top-Down Segmentation: Motivation
Source: Bastian Leibe
30-Nov-1125
Lecture 17 -Fei-Fei Li
Segmentation: Probabilistic Formulation
• Influence of patch on object hypothesis (vote weight)
( ) ( ) ( ) ( )( )xop
f,pfCpCxopxofp
n
i iin
n ,
||,,,
∑=l
l
( ) ( ) ( )∑∈
===),(
,|,,,,|,|l
ll
fnnn xofpxoffigurepxofigurep
p
pp
• Backprojection to features f and pixels p:
Segmentationinformation
Influence on object hypothesis
[Leibe, Leonardis, Schiele, SLCV’04; IJCV’08]
30-Nov-1126
Lecture 17 -Fei-Fei Li
Derivation: ISM Recognition
• Algorithm stages
1. Voting
2. Mean-shift search
3. Backprojection
• Vote weights: contribution of a single feature f
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
Object location
on,x
Image Feature fat location l
f
Codebook matches
Ci
)( fCp i ),,( lin Cxop
Matching probability
Occurrence distribution
30-Nov-1127
Lecture 17 -Fei-Fei Li
• Algorithm stages
1. Voting
2. Mean-shift search
3. Backprojection
• Vote weights: contribution of a single feature f
� Probability that object on occurs at location x given (f,l)
( , , )ni
p o x f =∑l
)( fCp i
Matching probability
),,( lin Cxop
Occurrence distribution
)( fCp i
Matching probability
),,( lin Cxop
Occurrence distribution
Derivation: ISM Recognition
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
30-Nov-1128
Lecture 17 -Fei-Fei Li
Derivation: ISM Recognition
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
• Algorithm stages
1. Voting
2. Mean-shift search
3. Backprojection
• Vote weights: contribution of a single feature f
� Probability that object on occurs at location x given (f,l)
� How to measure those probabilities?
( , , )ni
p o x f =∑l
{ }1( ) , where | ( , )
| |i i ip C f C C d C fC
θ= = ≤
1( , , )
# ( )n ii
p o x Coccurrences C
=l
)( fCp i ),,( lin Cxop
θ f
Activatedcodebook entries
30-Nov-1129
Lecture 17 -Fei-Fei Li
Derivation: ISM Recognition
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
• Algorithm stages
1. Voting
2. Mean-shift search
3. Backprojection
• Vote weights: contribution of a single feature f
� Probability that object on occurs at location x given (f,l)
� Likelihood of the observed features given the object hypothesis
( , , )ni
p o x f =∑l )( fCp i ),,( lin Cxop
( ) ( ) ( )( )
( ) ( ) ( )( )
, | , |, || ,
, ,n i in i
nn n
p o x C p C f p f,p o x f, p f,p f, o x
p o x p o x= = ∑
l ll ll
( ),np o x : Prior for the object location( )p f,l : Indicator variable for sampled features
30-Nov-1130
Lecture 17 -Fei-Fei Li
Derivation: ISM Recognition
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
• Algorithm stages
1. Voting
2. Mean-shift search
3. Backprojection
• Vote weights: contribution of a single feature f
( ) ( ) ( )( )
( ) ( ) ( )( )
, | , |, || ,
, ,n i in i
nn n
p o x C p C f p f,p o x f, p f,p f, o x
p o x p o x= = ∑
l ll ll
( ) ( ) ( )( )
( ) ( ) ( )( )
, | , |, || ,
, ,n i in i
nn n
p o x C p C f p f,p o x f, p f,p f, o x
p o x p o x= = ∑
l ll ll
30-Nov-1131
Lecture 17 -Fei-Fei Li
Derivation: ISM Recognition
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
• Algorithm stages
1. Voting
2. Mean-shift search
3. Backprojection
• Vote weights: contribution of a single feature f
( ) ( ) ( )( )
( ) ( ) ( )( )
, | , |, || ,
, ,n i in i
nn n
p o x C p C f p f,p o x f, p f,p f, o x
p o x p o x= = ∑
l ll ll
30-Nov-1132
Lecture 17 -Fei-Fei Li
Derivation: ISM Recognition
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
• Algorithm stages
1. Voting
2. Mean-shift search
3. Backprojection
• Vote weights: contribution of a single feature f
( ) ( ) ( )( )
( ) ( ) ( )( )
, | , |, || ,
, ,n i in i
nn n
p o x C p C f p f,p o x f, p f,p f, o x
p o x p o x= = ∑
l ll ll
x
y
s
30-Nov-1133
Lecture 17 -Fei-Fei Li
Derivation: ISM Top-Down Segmentation
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
• Algorithm stages
1. Voting
2. Mean-shift search
3. Backprojection
• Vote weights: contribution of a single feature f
• Figure-ground backprojection
Fig./Gnd. labelfor each occurrence
Influence on object hypothesis
( ) ( ) ( ) ( )( )∑ ∑
∈
==)p
pl
lll
f, i n
iinin xop
f,pfCpCxopCxofig.p
( ,
|,|,,,,|
( ) ( ) ( )( )
( ) ( ) ( )( )
, | , |, || ,
, ,n i in i
nn n
p o x C p C f p f,p o x f, p f,p f, o x
p o x p o x= = ∑
l ll ll
( ) == l,,,,| in Cfxofigurep p
30-Nov-1134
Lecture 17 -Fei-Fei Li
Derivation: ISM Top-Down Segmentation
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
• Algorithm stages
1. Voting
2. Mean-shift search
3. Backprojection
• Vote weights: contribution of a single feature f
• Figure-ground backprojection
Fig./Gnd. labelfor each occurrence
Influence on object hypothesis
( ) ( ) ( ) ( )( )∑ ∑
∈
==)p
pl
lll
f, i n
iinin xop
f,pfCpCxopCxofig.p
( ,
|,|,,,,|
( ) ( ) ( )( )
( ) ( ) ( )( )
, | , |, || ,
, ,n i in i
nn n
p o x C p C f p f,p o x f, p f,p f, o x
p o x p o x= = ∑
l ll ll
( ) ∑==i
n fxofigurep l,,,|p
Marginalize overall codebook entries
matched to f
30-Nov-1135
Lecture 17 -Fei-Fei Li
Derivation: ISM Top-Down Segmentation
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
• Algorithm stages
1. Voting
2. Mean-shift search
3. Backprojection
• Vote weights: contribution of a single feature f
• Figure-ground backprojection
Fig./Gnd. labelfor each occurrence
Influence on object hypothesis
( ) ( ) ( ) ( )( )∑ ∑
∈
==)p
pl
lll
f, i n
iinin xop
f,pfCpCxopCxofig.p
( ,
|,|,,,,|
( ) ( ) ( )( )
( ) ( ) ( )( )
, | , |, || ,
, ,n i in i
nn n
p o x C p C f p f,p o x f, p f,p f, o x
p o x p o x= = ∑
l ll ll
( ) ∑ ∑∈
==),(
,|lf i
n xofigurepp
p
Marginalize overall features contai-
ning pixel p
30-Nov-1136
Lecture 17 -Fei-Fei Li
Top-Down Segmentation Algorithm
• This may sound quite complicated, but it boils down to a very simple algorithm…
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
30-Nov-1137
Lecture 17 -Fei-Fei Li
Segmentation
• Interpretation of p(figure) map
� per-pixel confidence in object hypothesis
� Use for hypothesis verification
p(figure)
p(ground)
Segmentation
p(figure)
p(ground)
Original image
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
30-Nov-1138
Lecture 17 -Fei-Fei Li
Example Results: Motorbikes
[Leibe, Leonardis, Schiele, SLCV’04; IJCV’08]
30-Nov-1139
Lecture 17 -Fei-Fei Li
• Training
– 112 hand-segmented images
• Results on novel sequences:
Single-frame recognition - No temporal continuity used!
Example Results: Cows
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
30-Nov-1140
Lecture 17 -Fei-Fei Li
Office chairs
Dining room chairs
Example Results: Chairs
Source: Bastian Leibe
30-Nov-1141
Lecture 17 -Fei-Fei Li
Detections Using Ground Plane Constraints
left camera 1175 frames
Battery of 5ISM detectorsfor differentcar views
[Leib
e,
Leonard
is,
Schie
le,
SLCV’0
4;
IJCV’0
8]
30-Nov-1142
Lecture 17 -Fei-Fei Li
TrainingTraining
TestTest OutputOutput
[Thom
as,
Ferr
ari
, Tu
yte
laars
, Leib
e,
Van G
ool,
3D
RR’0
7;
RSS’0
8]
Inferring Other Information: Part Labels (1)
30-Nov-1143
Lecture 17 -Fei-Fei Li
Inferring Other Information: Part Labels (2)
[Thom
as,
Ferr
ari
, Tu
yte
laars
, Leib
e,
Van G
ool,
3D
RR’0
7;
RSS’0
8]
30-Nov-1144
Lecture 17 -Fei-Fei Li
Inferring Other Information: Depth Maps
“Depth from a single image”
[Thom
as,
Ferr
ari
, Tu
yte
laars
, Leib
e,
Van G
ool,
3D
RR’0
7;
RSS’0
8]
30-Nov-1145
Lecture 17 -Fei-Fei Li
• Try to fit silhouette to detected person
• Basic idea– Search for the silhouette that simultaneously optimizes the
• Chamfer match to the distance-transformed edge image
• Overlap with the top-down segmentation
– Enforces global consistency
– Caveat: introduces again reliance on global model
Extension: Estimating Articulation
[Leibe, Seemann, Schiele, CVPR’05]
30-Nov-1146
Lecture 17 -Fei-Fei Li
• Polar instead of Cartesian voting scheme
• Benefits:– Recognize objects under image-plane rotations
– Possibility to share parts between articulations.
• Caveats:– Rotation invariance should only be used when it’s really needed.
(Also increases false positive detections)
Extension: Rotation-Invariant Detection
[Mikolajczyk, Leibe, Schiele, CVPR’06]
θq
φ
dq
φ
θ
d
30-Nov-1147
Lecture 17 -Fei-Fei Li
Sometimes, Rotation Invariance Is Needed…
[Mikolajczyk et al., CVPR’06]
30-Nov-1148
Lecture 17 -Fei-Fei Li
You Can Try It At Home…
• Linux binaries available
– Including datasets & several pre-trained detectors
– http://www.vision.ee.ethz.ch/bleibe/code
x
y
s
Source: Bastian Leibe
30-Nov-1149
Lecture 17 -Fei-Fei Li
Discussion: Implicit Shape Model• Pros:
– Works well for many different object categories• Both rigid and articulated objects
– Flexible geometric model• Can recombine parts seen on different training examples
– Learning from relatively few (50-100) training examples– Optimized for detection, good localization properties
• Cons:– Needs supervised training data
• Object bounding boxes for detection• Reference segmentations for top-down segm.
– Only weak geometric constraints• Result segmentations may contain superfluous
body parts.
– Purely representative model• No discriminative learning
Source: Bastian Leibe
30-Nov-1150
Lecture 17 -Fei-Fei Li
What we will learn today?
• Implicit Shape Model
– Representation
– Recognition
– Experiments and results
• Deformable Models
– The PASCAL challenge
– Latent SVM Model
30-Nov-1151
Lecture 17 -Fei-Fei Li
Object Detection
– the PASCAL Challenge
• ~10,000 images, with ~25,000 target objects.
– Objects from 20 categories (person, car, bicycle, cow,
table...).
– Objects are annotated with labeled bounding boxes.
30-Nov-1152
So
urc
e:
Pe
dro
Fe
lze
nsw
alb
Lecture 17 -Fei-Fei Li 30-Nov-1153
Lecture 17 -Fei-Fei Li
Latent SVM Model: an Overview
30-Nov-1154
root filter part filters deformation
modelsdetection
So
urc
e:
Pe
dro
Fe
lze
nsw
alb
Lecture 17 -Fei-Fei Li
Histogram of Oriented Gradient (HOG) Features
• Image is partitioned into 8x8 pixel blocks.
• In each block we compute a histogram of gradient
orientations.
– Invariant to changes in lighting, small deformations, etc.
• We compute features at different resolutions (pyramid).
30-Nov-1155
So
urc
e:
Pe
dro
Fe
lze
nsw
alb
Lecture 17 -Fei-Fei Li
Filters
• Filters are rectangular templates defining weights for features.
• Score is dot product of filter and subwindow of HOG pyramid.
30-Nov-1156
HOG pyramid
W
Score of H at this location is H ⋅ W
H
Source: Pedro Felzenswalb
Lecture 17 -Fei-Fei Li
Object Hypothesis
30-Nov-1157
Multiscale model captures features at two-resolutions
Score is sum of filter scores
plus deformation scores
Lecture 17 -Fei-Fei Li
Training the Latent SVM Model
• Training data consists of images with labeled bounding boxes.
• Need to learn the model structure, filters and deformation costs.
30-Nov-1158
Training
Source: Pedro Felzenswalb
Lecture 17 -Fei-Fei Li
Connection with Linear Classifiers
• Score of model is sum of filter scores plus
deformation scores
– Bounding box in training data specifies that the score
should be high for some placement in a range
30-Nov-1159
w is a model
x is a detection window
z are filter placements
Concatenation of filters and
deformation parameters
Concatenation of features
and part displacements
Latent
SVM
Standard
SVM
Weight vector Features
Lecture 17 -Fei-Fei Li
Latent SVM Training
• Semi-convex optimization problem
– is convex in w
– convex if we fix z for positive examples
• Iterative optimization procedure:
– Initialize w
– Iterate:
• Pick best z for each positive example
• Optimize w via gradient descent with data mining
30-Nov-1160
Linear in w if z is fixed Observed variables Latent variables
Lecture 17 -Fei-Fei Li
Latent SVM Training: Initializing w
30-Nov-1161
• For k component mixture model:– Split examples into k sets based on bounding box aspect
ratio
• Learn k root filters using standard SVM– Training data: Warped positive examples and random
windows from negative images (Dalal & Triggs)
• Initialize parts by selecting patches from root filters:– Sub-windows with strong coefficients
– Interpolate to get higher resolution filters
– Initialize spatial model using fixed spring constants
Lecture 17 -Fei-Fei Li
Learned Models
30-Nov-1162
Lecture 17 -Fei-Fei Li
Example Results
30-Nov-1163
So
urc
e:
Pe
dro
Fe
lze
nsw
alb
Lecture 17 -Fei-Fei Li
More Results
30-Nov-1164
Lecture 17 -Fei-Fei Li
Quantitative Results
• 9 systems competed in the 2007 challenge.
• Out of 20 classes:
– First place in 10 classes
– Second place in 6 classes
• Some statistics:
– It takes ~2 seconds to evaluate a model in one image.
– It takes ~3 hours to train a model.
– MUCH faster than most systems.
30-Nov-1165
Source: Pedro Felzenswalb
Lecture 17 -Fei-Fei Li
Code for Latent SVM
Source code for the system and models trained on PASCAL 2006, 2007 and 2008
data are available at:
http://www.cs.uchicago.edu/~pff/latent
30-Nov-1166
Source: Pedro Felzenswalb
Lecture 17 -Fei-Fei Li
Summary
• Deformable models provide an elegant framework
for object detection and recognition.
– Efficient algorithms for matching models to images.
– Applications: pose estimation, medical image analysis,
object recognition, etc.
• We can learn models from partially labeled data.
– Generalized standard ideas from machine learning.
– Leads to state-of-the-art results in PASCAL challenge.
• Future work: hierarchical models, grammars, 3D
objects.
30-Nov-1167
Source: Pedro Felzenswalb
Lecture 17 -Fei-Fei Li
What we have learned today
• Implicit Shape Model
– Representation
– Recognition
– Experiments and results
• Deformable Models
– The PASCAL challenge
– Latent SVM Model
30-Nov-1168