Top Banner
Weakly supervised learning Weakly-supervised learning Cordelia Schmid
76

weakly supervised gatsby oxford

Apr 12, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Microsoft PowerPoint - weakly_supervised_gatsby_oxfordWeakly supervised learning motivationWeakly supervised learning - motivation
Massive and ever growing amount of digital image andamount of digital image and video content – Flickr and YouTube – Audiovisual archives (BBC, INA) – Personal collections
Comes with meta-data Text, audio, user click data, …
Meta-data is a sparse and noisy, f
2
Object detection
4
Weakly supervised learning for imagesWeakly-supervised learning for images
• Given a set of images with positive and negative labels, g p g , determine the object region, learn detector
• Avoids costly annotation of object regions
5
6
7
Our approach descriptorsOur approach – descriptors
• Extract selective search regions [Uijlings et al., ICCV’13] • Regions described with high-dimensional Fisher vectors
CNNor CNNs • Image labeled as positives or negatives
8
9
Multi fold MILMulti-fold MIL
[Cinbis Vebeek & Schmid Multi fold MIL for WS object localization CVPR’14]
10
[Cinbis, Vebeek, & Schmid, Multi-fold MIL for WS object localization, CVPR 14]
Multi fold MILMulti-fold MIL
Avoid relocalization bias since windows used for training and evaluation are different
11
12
13
Our approach: multi fold training for MILOur approach: multi-fold training for MIL
14
Refinement of selected boxesRefinement of selected boxes
Window refinement by local search to align windows with contours [Edge boxes: locating object proposals from edges, Zitnick et Dollar,ECCV’14]
17
Refinement of selected boxesRefinement of selected boxes
• Locally refine the top 10 scoring boxes with “edgebox” scorey p g g • “Edgebox” score: encourages alignment with long contours,
discourages contours straddling the window • Final score: “edgebox” score + “selection” score
[Cinbis, Verbeek & Schmid, WS Object Localization with Multi-fold MIL, arXiv’15]
18
[ , , j , ]
Comparison to the state of the artComparison to the state-of-the-art
19
Comparison to the state of the artComparison to the state-of-the-art
20
• State-of-the-art results for WS localization
• Further improve “initial” and “selected” windows
• Update the CNN features (fine-tuning)
• Dealing with noisy or missing image labels (eg. Google image download)
21
OverviewOverview
22
Positives + BB Positives Positives + +
g y None
Object detection (Leibe et al.’08; Felzenszwalb et al.’10; Girshick et al.’14)Object detect on (Le be et al. 08; Felzenszwalb et al. 0; G rsh ck et al. ) Weakly supervised localization (Chum’07;Pandey’11;Desaelers’12;Siva’12;Shi’13;Cinbis’14;Wang’14) Co-segmentation/localization (Rother’06;Russell’06;Joulin’10;Kim’11;Vicente’11;Joulin’14;Tang’14) Unsupervised discovery (Grauman & Darrell’05; Sivic et al’05,08; Kim et al.’05,09)
Supervision
Correspondence
(Russell et al.’06; Cho et al.’10; Rubinstein & Joulin’13; Rubio et al.’13)
Our approach
• Correspondences • Correspondences • as a substitute for supervision
b d bj• between parts and objects • picked from bottom-up segmentation proposals • and k-nearest-neighbor images
• How? • Probabilistic Hough matchingProbabilistic Hough matching • Stand-out scoring of part hierarchies
[Cho, Kwak, Schmid & Ponce, Unsupervised Object Discovery and Localization in the Wild: Part-based Matching with Bottom-up Region Proposals, CVPR’15 ]
Finding parts and objects among region candidates
Here: bottom-up segmentation proposals (Manen et al.’13, Uijlings et al.’13) and HOG descriptors (Dalal & Triggs’05)
Matching model – Probabilistic Hough matching
match data configuration
P ( m | d ) = c P ( m | c, d ) P ( c |d )
Matching model – Probabilistic Hough matching
match data configuration
P ( m | d ) = c P ( m | c, d ) P ( c |d )
= P ( ma | d ) c P ( mg | c ) P ( c | d )
appearance geometry
• Probabilistic model
P ( m | d ) = c P ( m | c, d ) P ( c |d )
= P ( ma | d ) c P ( mg | c ) P ( c | d )
Matching model – Probabilistic Hough matching
• Probabilistic model
P ( m | d ) = c P ( m | c, d ) P ( c |d )
= P ( ma | d ) c P ( mg | c ) P ( c | d )
• Probabilistic Hough transform
P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )
= m P ( mg | c ) P ( ma | d )
(Hough’59; Ballard’81; Stephens’91; Leibe et al.’04; Maji & Malik’09; Barinova et al.’12)
Matching model – Probabilistic Hough matching
• Probabilistic model
P ( m | d ) = c P ( m | c, d ) P ( c |d ) = P ( ma | d ) c P ( mg | c ) P ( c | d )
• Probabilistic Hough transform
P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )
• Probabilistic Hough transform
• Region confidence
C ( r’ | [ d’ , d’’ ] ) = max r’’ P ( r’ r’’ | [ d’, d’’ ] )
Matching model – Probabilistic Hough matching
• Probabilistic model
P ( m | d ) = c P ( m | c, d ) P ( c |d ) = P ( ma | d ) c P ( mg | c ) P ( c | d )
• Probabilistic Hough transform
P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )
• Probabilistic Hough transform
• Two images -> multiple images
multiple
two
Stand-out scoring of part hierarchies
• Object regions should containj g • more foreground than part regions • less background than larger regions
Stand-out scoring of part hierarchies

• Object regions should containj g • more foreground than part regions • less background than larger regions
• S ( r ) = C ( r ) – max r’ r C( r’ )∩
An iterative algorithm – iteration 1
Retrieve 10 NN with GIST (Oliva & Torralba’06)
An iterative algorithm – iteration 1
Probabilistic Hough Matching with 10 NN
An iterative algorithm – iteration 1
Localize top 5 scoring windows with stand-out score
An iterative algorithm – iteration 1
For all image: Localize 5 top scoring windows
An iterative algorithm – next iterations
E l it l t d i Exploit selected regions Retrieve 10 NN using PHM with top confidence regions
An iterative algorithm – next iterations
Probabilistic Hough Matching with 10 NN
An iterative algorithm – next iterations
Localize 5 top scoring windows per image with stand-out score
Localization improvement over iterations
Comparative evaluation Two benchmarks: • Object discovery dataset (Rubinstein et al.’13)
S b t f 300 i f 3 l• Subset of 300 image from 3 classes • Includes between 7 and 18 outliers per class
• Pascal’07 – all (Everingham et al.’07) • 4548 images from 20 classes • From train/val set, minus difficult/truncated images, g
Computing time: < 1h for 500 images on 10-core desktop
Performance metrics: • CorLoc: percentage of boxes such that intersection/union > 0.5p g • CorRet: percentage of retrieved 10 NNs in the same class as query
Experimental results: Object discovery dataset CorLoc – separate classes
CorLoc / CorRet – mixed classes without labels
Examples – mixed classes without labelsExamples mixed classes without labels
Experimental results: Pascal’07 – all
CorLoc – separate classes
Experimental results: Pascal’07 – all
Examples– mixed classes without labels Successes
Experimental results: Pascal’07 – all
Examples– mixed classes without labels Failures
Discussion and future workDiscussion and future work
• Effective method for object discovery and localization in challenging unlabeled scenariosand localization in challenging unlabeled scenarios
No use of saliency or objectness measures• No use of saliency or objectness measures • No use of negative examples or pretrained features
• Next: • Image categorization and object detection • Handling multiple objects per imageg p j p g
OverviewOverview
57
• easier to separate object from background reduce need for bounding-box annotation
• a video shows a range of variations for an object• a video shows a range of variations for an object easier to learn multi-view, articulation, illumination
• many frames and easy to access (e.g. YouTube) lots of extra data!
Automatic extraction of objects
[Prest, Leistner, Civera, Schmid & Ferrari, Learning object detectors from weakly annotated video, CVPR’12]
Automatic extraction of objectsj
Data collection 1. pick 10 moving classes from PASCAL VOC 2 collect 9 24 videos per class from YouTube (~500 shots per class)2. collect 924 videos per class from YouTube ( 500 shots per class) 3. shot change detection  chunks videos into shots
• total 0 57 million frames• total 0.57 million frames • videolevel label only (i.e. keep some shots without the class)
Step 1
[N. Sundaram et al., Dense point trajectories by GPU-accelerated large displacement optical flow, ECCV 2010]
Candidate tubesCandidate tubes
motion segmentationmotion segmentation
Candidate tubesCandidate tubes
motion segmentationmotion segmentation
Selecting tubesg
• jointly select one tube per shot by energy minimization• jointly select one tube per shot by energy minimization
node states
homogeneity within a tube,  location prior 
shotshot Pairwise potential: similarity between tubes from  different shots based on  appearance descriptors (BOWappearance descriptors (BOW,  HOG) extracted for a fixed  number of frames per tube
Find states minimizing sum of potentials Inference with TRWSFind states minimizing sum of potentials, Inference with TRW S  [V. Kolmogorov, Convergent tree-reweighted message passing for energy minimization, PAMI 06]
Selecting tubes
Over-segmentation
Experiments: tube quality
10 object classes that move from PASCAL • aeroplane, bird, boat, car, cat, cow, dog, horse, motorbike, train • ~500 shots per class from YouTube
Evaluate on 100290 frames/class manually annotated (total 1407) • Performance = detectionrate as defined in Pascal (>0.5 IoU)
28 5
+ auto selection picks best available tube 80% of the time
0 10 20 30 40 50 60 70 80
motion segmentation far from perfect (best tube covers 35% objects)
Train detector
Experiments: detection in PASCAL VOC
Test on Pascal 2007 test setTest on Pascal 2007 test set • 4952 test images; multiple classes per image • Many variations: scale, viewpoint, illumination, occlusion, intraclass
Standard Pascal evaluation protocol
15 2
31,6 DPM
0 10 20 30 40 50
About same number of training instances (500/class)
• Only about half the mAP! Big gap!
Experiments: detection in PASCAL VOC
Image GT
Video ICCV13 seg
Image GT + Video induced by VOC
0 10 20 30 40 50
y
transfer
Still images from PASCAL VOC 2007
Summary and discussionSummary and discussion
Video significantly more training data• Video significantly more training data
M ti dditi l• Motion as an additional cue
I t ti f ti t l t b• Improve extraction of spatio-temporal tubes
D i hif f d b i i d• Domain shift factors needs to be investigated
• Construction of more complete models