This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Microsoft PowerPoint - weakly_supervised_gatsby_oxfordWeakly
supervised learning motivationWeakly supervised learning -
motivation
Massive and ever growing amount of digital image andamount of
digital image and video content – Flickr and YouTube – Audiovisual
archives (BBC, INA) – Personal collections
Comes with meta-data Text, audio, user click data, …
Meta-data is a sparse and noisy, f
2
Object detection
4
Weakly supervised learning for imagesWeakly-supervised learning for
images
• Given a set of images with positive and negative labels, g p g ,
determine the object region, learn detector
• Avoids costly annotation of object regions
5
6
7
Our approach descriptorsOur approach – descriptors
• Extract selective search regions [Uijlings et al., ICCV’13] •
Regions described with high-dimensional Fisher vectors
CNNor CNNs • Image labeled as positives or negatives
8
9
Multi fold MILMulti-fold MIL
[Cinbis Vebeek & Schmid Multi fold MIL for WS object
localization CVPR’14]
10
[Cinbis, Vebeek, & Schmid, Multi-fold MIL for WS object
localization, CVPR 14]
Multi fold MILMulti-fold MIL
Avoid relocalization bias since windows used for training and
evaluation are different
11
12
13
Our approach: multi fold training for MILOur approach: multi-fold
training for MIL
14
Refinement of selected boxesRefinement of selected boxes
Window refinement by local search to align windows with contours
[Edge boxes: locating object proposals from edges, Zitnick et
Dollar,ECCV’14]
17
Refinement of selected boxesRefinement of selected boxes
• Locally refine the top 10 scoring boxes with “edgebox” scorey p g
g • “Edgebox” score: encourages alignment with long contours,
discourages contours straddling the window • Final score: “edgebox”
score + “selection” score
[Cinbis, Verbeek & Schmid, WS Object Localization with
Multi-fold MIL, arXiv’15]
18
[ , , j , ]
Comparison to the state of the artComparison to the
state-of-the-art
19
Comparison to the state of the artComparison to the
state-of-the-art
20
• State-of-the-art results for WS localization
• Further improve “initial” and “selected” windows
• Update the CNN features (fine-tuning)
• Dealing with noisy or missing image labels (eg. Google image
download)
21
OverviewOverview
22
Positives + BB Positives Positives + +
g y None
Object detection (Leibe et al.’08; Felzenszwalb et al.’10; Girshick
et al.’14)Object detect on (Le be et al. 08; Felzenszwalb et al. 0;
G rsh ck et al. ) Weakly supervised localization
(Chum’07;Pandey’11;Desaelers’12;Siva’12;Shi’13;Cinbis’14;Wang’14)
Co-segmentation/localization
(Rother’06;Russell’06;Joulin’10;Kim’11;Vicente’11;Joulin’14;Tang’14)
Unsupervised discovery (Grauman & Darrell’05; Sivic et
al’05,08; Kim et al.’05,09)
Supervision
Correspondence
(Russell et al.’06; Cho et al.’10; Rubinstein & Joulin’13;
Rubio et al.’13)
Our approach
• Correspondences • Correspondences • as a substitute for
supervision
b d bj• between parts and objects • picked from bottom-up
segmentation proposals • and k-nearest-neighbor images
• How? • Probabilistic Hough matchingProbabilistic Hough matching •
Stand-out scoring of part hierarchies
[Cho, Kwak, Schmid & Ponce, Unsupervised Object Discovery and
Localization in the Wild: Part-based Matching with Bottom-up Region
Proposals, CVPR’15 ]
Finding parts and objects among region candidates
Here: bottom-up segmentation proposals (Manen et al.’13, Uijlings
et al.’13) and HOG descriptors (Dalal & Triggs’05)
Matching model – Probabilistic Hough matching
match data configuration
P ( m | d ) = c P ( m | c, d ) P ( c |d )
Matching model – Probabilistic Hough matching
match data configuration
P ( m | d ) = c P ( m | c, d ) P ( c |d )
= P ( ma | d ) c P ( mg | c ) P ( c | d )
appearance geometry
• Probabilistic model
P ( m | d ) = c P ( m | c, d ) P ( c |d )
= P ( ma | d ) c P ( mg | c ) P ( c | d )
Matching model – Probabilistic Hough matching
• Probabilistic model
P ( m | d ) = c P ( m | c, d ) P ( c |d )
= P ( ma | d ) c P ( mg | c ) P ( c | d )
• Probabilistic Hough transform
P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )
= m P ( mg | c ) P ( ma | d )
(Hough’59; Ballard’81; Stephens’91; Leibe et al.’04; Maji &
Malik’09; Barinova et al.’12)
Matching model – Probabilistic Hough matching
• Probabilistic model
P ( m | d ) = c P ( m | c, d ) P ( c |d ) = P ( ma | d ) c P ( mg |
c ) P ( c | d )
• Probabilistic Hough transform
P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )
• Probabilistic Hough transform
• Region confidence
C ( r’ | [ d’ , d’’ ] ) = max r’’ P ( r’ r’’ | [ d’, d’’ ] )
Matching model – Probabilistic Hough matching
• Probabilistic model
P ( m | d ) = c P ( m | c, d ) P ( c |d ) = P ( ma | d ) c P ( mg |
c ) P ( c | d )
• Probabilistic Hough transform
P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )
• Probabilistic Hough transform
• Two images -> multiple images
multiple
two
Stand-out scoring of part hierarchies
• Object regions should containj g • more foreground than part
regions • less background than larger regions
Stand-out scoring of part hierarchies
• Object regions should containj g • more foreground than part
regions • less background than larger regions
• S ( r ) = C ( r ) – max r’ r C( r’ )∩
An iterative algorithm – iteration 1
Retrieve 10 NN with GIST (Oliva & Torralba’06)
An iterative algorithm – iteration 1
Probabilistic Hough Matching with 10 NN
An iterative algorithm – iteration 1
Localize top 5 scoring windows with stand-out score
An iterative algorithm – iteration 1
For all image: Localize 5 top scoring windows
An iterative algorithm – next iterations
E l it l t d i Exploit selected regions Retrieve 10 NN using PHM
with top confidence regions
An iterative algorithm – next iterations
Probabilistic Hough Matching with 10 NN
An iterative algorithm – next iterations
Localize 5 top scoring windows per image with stand-out score
Localization improvement over iterations
Comparative evaluation Two benchmarks: • Object discovery dataset
(Rubinstein et al.’13)
S b t f 300 i f 3 l• Subset of 300 image from 3 classes • Includes
between 7 and 18 outliers per class
• Pascal’07 – all (Everingham et al.’07) • 4548 images from 20
classes • From train/val set, minus difficult/truncated images,
g
Computing time: < 1h for 500 images on 10-core desktop
Performance metrics: • CorLoc: percentage of boxes such that
intersection/union > 0.5p g • CorRet: percentage of retrieved 10
NNs in the same class as query
Experimental results: Object discovery dataset CorLoc – separate
classes
CorLoc / CorRet – mixed classes without labels
Examples – mixed classes without labelsExamples mixed classes
without labels
Experimental results: Pascal’07 – all
CorLoc – separate classes
Experimental results: Pascal’07 – all
Examples– mixed classes without labels Successes
Experimental results: Pascal’07 – all
Examples– mixed classes without labels Failures
Discussion and future workDiscussion and future work
• Effective method for object discovery and localization in
challenging unlabeled scenariosand localization in challenging
unlabeled scenarios
No use of saliency or objectness measures• No use of saliency or
objectness measures • No use of negative examples or pretrained
features
• Next: • Image categorization and object detection • Handling
multiple objects per imageg p j p g
OverviewOverview
57
• easier to separate object from background reduce need for
bounding-box annotation
• a video shows a range of variations for an object• a video shows
a range of variations for an object easier to learn multi-view,
articulation, illumination
• many frames and easy to access (e.g. YouTube) lots of extra
data!
Automatic extraction of objects
[Prest, Leistner, Civera, Schmid & Ferrari, Learning object
detectors from weakly annotated video, CVPR’12]
Automatic extraction of objectsj
Data collection 1.
pick 10 moving classes from PASCAL VOC
2 collect 9 24 videos per class from YouTube (~500 shots per
class)2.
collect 924 videos per class from YouTube (
500 shots per class) 3.
shot change detection
chunks videos into shots
• total 0 57 million frames•
total 0.57 million frames •
videolevel label only
(i.e. keep some shots without the class)
Step 1
[N. Sundaram et al., Dense point trajectories by GPU-accelerated
large displacement optical flow, ECCV 2010]
Candidate tubesCandidate tubes
motion segmentationmotion segmentation
Candidate tubesCandidate tubes
motion segmentationmotion segmentation
Selecting tubesg
• jointly select one tube per shot by energy minimization• jointly
select one tube per shot by energy minimization
node states
homogeneity within a tube,
location prior
shotshot Pairwise potential:
similarity between tubes from
different shots based on appearance
descriptors (BOWappearance descriptors (BOW,
HOG) extracted for a fixed
number of frames per tube
Find states minimizing sum of potentials Inference with
TRWSFind states minimizing sum of potentials, Inference with TRW
S [V. Kolmogorov, Convergent tree-reweighted message passing
for energy minimization, PAMI 06]
Selecting tubes
Over-segmentation
Experiments: tube quality
10 object classes that move from PASCAL
•
aeroplane, bird, boat, car, cat, cow, dog, horse, motorbike, train
• ~500 shots per class from YouTube
Evaluate on 100290 frames/class manually annotated (total 1407)
•
Performance = detectionrate as defined in Pascal (>0.5 IoU)
28 5
+
auto selection picks best available tube 80% of the time
0 10 20 30 40 50 60 70 80
motion segmentation far from perfect (best tube covers 35% objects)
Train detector
Experiments: detection in PASCAL VOC
Test on Pascal 2007 test
setTest on Pascal 2007 test set •
4952 test images; multiple classes per image
•
Many variations: scale, viewpoint, illumination, occlusion, intraclass
Standard Pascal evaluation protocol
15 2
31,6 DPM
0 10 20 30 40 50
About same number of training instances (500/class)
• Only about half the mAP! Big gap!
Experiments: detection in PASCAL VOC
Image GT
Video ICCV13 seg
Image GT + Video induced by VOC
0 10 20 30 40 50
y
transfer
Still images from PASCAL VOC 2007
Summary and discussionSummary and discussion
Video significantly more training data• Video significantly more
training data
M ti dditi l• Motion as an additional cue
I t ti f ti t l t b• Improve extraction of spatio-temporal
tubes
D i hif f d b i i d• Domain shift factors needs to be
investigated
• Construction of more complete models