YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: weakly supervised gatsby oxford

Weakly supervised learningWeakly-supervised learning

Cordelia Schmid

Page 2: weakly supervised gatsby oxford

Weakly supervised learning motivationWeakly supervised learning - motivation

Massive and ever growing amount of digital image andamount of digital image and video content– Flickr and YouTube– Audiovisual archives (BBC, INA) – Personal collections

Comes with meta-data‒ Text, audio, user click data, …

Meta-data is a sparse and noisy, f

2

yet rich and diverse source of annotation

Page 3: weakly supervised gatsby oxford

Weakly supervised learning motivationWeakly supervised learning - motivation

Object detection

Weakly supervisedlarge-scale learning

3

Action recognition

Page 4: weakly supervised gatsby oxford

OverviewOverview

• Multi-fold MIL for weakly-supervised learning from images

• Unsupervised learning from images based on matching

• Weakly-supervised learning from videos with motion segmentation

4

Page 5: weakly supervised gatsby oxford

Weakly supervised learning for imagesWeakly-supervised learning for images

• Given a set of images with positive and negative labels, g p g ,determine the object region, learn detector

• Avoids costly annotation of object regions

5

Page 6: weakly supervised gatsby oxford

6

Page 7: weakly supervised gatsby oxford

7

Page 8: weakly supervised gatsby oxford

Our approach descriptorsOur approach – descriptors

• Extract selective search regions [Uijlings et al., ICCV’13]• Regions described with high-dimensional Fisher vectors

CNNor CNNs• Image labeled as positives or negatives

8

Page 9: weakly supervised gatsby oxford

Standard MIL (Fisher vectors)Standard MIL (Fisher vectors)

9

Page 10: weakly supervised gatsby oxford

Multi fold MILMulti-fold MIL

[Cinbis Vebeek & Schmid Multi fold MIL for WS object localization CVPR’14]

10

[Cinbis, Vebeek, & Schmid, Multi-fold MIL for WS object localization, CVPR 14]

Page 11: weakly supervised gatsby oxford

Multi fold MILMulti-fold MIL

Avoid relocalization bias since windows used for training and evaluation are different

11

Page 12: weakly supervised gatsby oxford

Comparing standard and multi foldComparing standard and multi-fold

12

Page 13: weakly supervised gatsby oxford

Performance over iterations (Fisher Vectors)Performance over iterations (Fisher Vectors)

13

Page 14: weakly supervised gatsby oxford

Our approach: multi fold training for MILOur approach: multi-fold training for MIL

14

Page 15: weakly supervised gatsby oxford

Localization examplesLocalization examples

15

Page 16: weakly supervised gatsby oxford

Failure casesFailure cases

16

Page 17: weakly supervised gatsby oxford

Refinement of selected boxesRefinement of selected boxes

Window refinement by local search to align windows with contours [Edge boxes: locating object proposals from edges, Zitnick et Dollar,ECCV’14]

17

Page 18: weakly supervised gatsby oxford

Refinement of selected boxesRefinement of selected boxes

• Locally refine the top 10 scoring boxes with “edgebox” scorey p g g• “Edgebox” score: encourages alignment with long contours,

discourages contours straddling the window • Final score: “edgebox” score + “selection” score

[Cinbis, Verbeek & Schmid, WS Object Localization with Multi-fold MIL, arXiv’15]

18

[ , , j , ]

Page 19: weakly supervised gatsby oxford

Comparison to the state of the artComparison to the state-of-the-art

19

Page 20: weakly supervised gatsby oxford

Comparison to the state of the artComparison to the state-of-the-art

20

Page 21: weakly supervised gatsby oxford

Summary and future workSummary and future work

• State-of-the-art results for WS localization

• Further improve “initial” and “selected” windows

• Update the CNN features (fine-tuning)

• Dealing with noisy or missing image labels (eg. Google image download)

21

Page 22: weakly supervised gatsby oxford

OverviewOverview

• Multi-fold MIL for weakly-supervised learning from images

• Unsupervised learning from images based on matching

• Weakly-supervised learning from videos with motion segmentation

22

Page 23: weakly supervised gatsby oxford

How much supervision for localization?

Positives + BB Positives Positives+ +

Strong Weak Very weakNegatives Negatives

g yNone

Object detection (Leibe et al.’08; Felzenszwalb et al.’10; Girshick et al.’14)Object detect on (Le be et al. 08; Felzenszwalb et al. 0; G rsh ck et al. )Weakly supervised localization (Chum’07;Pandey’11;Desaelers’12;Siva’12;Shi’13;Cinbis’14;Wang’14)Co-segmentation/localization (Rother’06;Russell’06;Joulin’10;Kim’11;Vicente’11;Joulin’14;Tang’14)Unsupervised discovery (Grauman & Darrell’05; Sivic et al’05,08; Kim et al.’05,09)

Page 24: weakly supervised gatsby oxford

Supervision

Page 25: weakly supervised gatsby oxford

Correspondence

(Russell et al.’06; Cho et al.’10; Rubinstein & Joulin’13; Rubio et al.’13)

Page 26: weakly supervised gatsby oxford

Our approach

• Correspondences • Correspondences • as a substitute for supervision

b d bj• between parts and objects• picked from bottom-up segmentation proposals• and k-nearest-neighbor images

• How?• Probabilistic Hough matchingProbabilistic Hough matching• Stand-out scoring of part hierarchies

[Cho, Kwak, Schmid & Ponce, Unsupervised Object Discovery and Localization in theWild: Part-based Matching with Bottom-up Region Proposals, CVPR’15 ]

Page 27: weakly supervised gatsby oxford

Finding parts and objects among region candidates

Here: bottom-up segmentation proposals (Manen et al.’13, Uijlings et al.’13)and HOG descriptors (Dalal & Triggs’05)

Page 28: weakly supervised gatsby oxford

Matching model – Probabilistic Hough matching

match data configuration

P ( m | d ) = c P ( m | c, d ) P ( c |d )

Page 29: weakly supervised gatsby oxford

Matching model – Probabilistic Hough matching

match data configuration

P ( m | d ) = c P ( m | c, d ) P ( c |d )

= P ( ma | d ) c P ( mg | c ) P ( c | d )

appearance geometry

Page 30: weakly supervised gatsby oxford

Matching model – Probabilistic Hough matching

• Probabilistic model

P ( m | d ) = c P ( m | c, d ) P ( c |d )

= P ( ma | d ) c P ( mg | c ) P ( c | d )

Page 31: weakly supervised gatsby oxford

Matching model – Probabilistic Hough matching

• Probabilistic model

P ( m | d ) = c P ( m | c, d ) P ( c |d )

= P ( ma | d ) c P ( mg | c ) P ( c | d )

• Probabilistic Hough transform

P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )

= m P ( mg | c ) P ( ma | d )

(Hough’59; Ballard’81; Stephens’91; Leibe et al.’04; Maji & Malik’09; Barinova et al.’12)

Page 32: weakly supervised gatsby oxford

Matching model – Probabilistic Hough matching

• Probabilistic model

P ( m | d ) = c P ( m | c, d ) P ( c |d )= P ( ma | d ) c P ( mg | c ) P ( c | d )

• Probabilistic Hough transform

P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )

• Probabilistic Hough transform

( | ) ( | ) m ( | )= m P ( mg | c ) P ( ma | d )

• Region confidence

C ( r’ | [ d’ , d’’ ] ) = max r’’ P ( r’ r’’ | [ d’, d’’ ] )

Page 33: weakly supervised gatsby oxford
Page 34: weakly supervised gatsby oxford
Page 35: weakly supervised gatsby oxford

Matching model – Probabilistic Hough matching

• Probabilistic model

P ( m | d ) = c P ( m | c, d ) P ( c |d )= P ( ma | d ) c P ( mg | c ) P ( c | d )

• Probabilistic Hough transform

P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )

• Probabilistic Hough transform

( | ) ( | ) m ( | )= m P ( mg | c ) P ( ma | d )

• Two images -> multiple images

Cd’ ( r’ ) = d’’ C ( r’ | [ d’ , d’’ ] )

Page 36: weakly supervised gatsby oxford

multiple

two

Page 37: weakly supervised gatsby oxford

Appearance only

Page 38: weakly supervised gatsby oxford

PHM

Page 39: weakly supervised gatsby oxford

Stand-out scoring of part hierarchies

• Object regions should containj g• more foreground than part regions• less background than larger regions

Page 40: weakly supervised gatsby oxford

Stand-out scoring of part hierarchies

• Object regions should containj g• more foreground than part regions• less background than larger regions

• S ( r ) = C ( r ) – max r’ r C( r’ )∩

Page 41: weakly supervised gatsby oxford

An iterative algorithm – iteration 1

Retrieve 10 NN with GIST (Oliva & Torralba’06)

Page 42: weakly supervised gatsby oxford

An iterative algorithm – iteration 1

Probabilistic Hough Matching with 10 NN

Page 43: weakly supervised gatsby oxford

An iterative algorithm – iteration 1

Localize top 5 scoring windows with stand-out score

Page 44: weakly supervised gatsby oxford

An iterative algorithm – iteration 1

For all image: Localize 5 top scoring windows

Page 45: weakly supervised gatsby oxford

An iterative algorithm – next iterations

E l it l t d i Exploit selected regions Retrieve 10 NN using PHM with top confidence regions

Page 46: weakly supervised gatsby oxford

An iterative algorithm – next iterations

Probabilistic Hough Matching with 10 NN

Page 47: weakly supervised gatsby oxford

An iterative algorithm – next iterations

Localize 5 top scoring windows per image with stand-out score

Page 48: weakly supervised gatsby oxford

Localization improvement over iterations

1 3

1 3

Page 49: weakly supervised gatsby oxford

Retrieval improvement over iterations

1st iteration

5th iteration

Page 50: weakly supervised gatsby oxford

Retrieval improvement over iterations

1st iteration

5th iteration

Page 51: weakly supervised gatsby oxford

Comparative evaluationTwo benchmarks:• Object discovery dataset (Rubinstein et al.’13)

S b t f 300 i f 3 l• Subset of 300 image from 3 classes• Includes between 7 and 18 outliers per class

• Pascal’07 – all (Everingham et al.’07)• 4548 images from 20 classes • From train/val set, minus difficult/truncated images, g

Computing time: < 1h for 500 images on 10-core desktop

Performance metrics:• CorLoc: percentage of boxes such that intersection/union > 0.5p g• CorRet: percentage of retrieved 10 NNs in the same class as query

Page 52: weakly supervised gatsby oxford

Experimental results: Object discovery datasetCorLoc – separate classes

CorLoc / CorRet – mixed classes without labels

Examples – mixed classes without labelsExamples mixed classes without labels

Page 53: weakly supervised gatsby oxford

Experimental results: Pascal’07 – all

CorLoc – separate classes

CorLoc and CorRet – mixed classes without labels

Page 54: weakly supervised gatsby oxford

Experimental results: Pascal’07 – all

Examples– mixed classes without labels Successes

Page 55: weakly supervised gatsby oxford

Experimental results: Pascal’07 – all

Examples– mixed classes without labels Failures

Page 56: weakly supervised gatsby oxford

Discussion and future workDiscussion and future work

• Effective method for object discoveryand localization in challenging unlabeled scenariosand localization in challenging unlabeled scenarios

No use of saliency or objectness measures• No use of saliency or objectness measures• No use of negative examples or pretrained features

• Next:• Image categorization and object detection• Handling multiple objects per imageg p j p g

Page 57: weakly supervised gatsby oxford

OverviewOverview

• Multi-fold MIL for weakly-supervised learning from images

• Unsupervised learning from images based on matching

• Weakly-supervised learning from videos with motion segmentation

57

Page 58: weakly supervised gatsby oxford

Learning from videosg

• easier to separate object from background reduce need for bounding-box annotation

• a video shows a range of variations for an object• a video shows a range of variations for an object easier to learn multi-view, articulation, illumination

• many frames and easy to access (e.g. YouTube) lots of extra data!

Page 59: weakly supervised gatsby oxford

Automatic extraction of objects

??

[Prest, Leistner, Civera, Schmid & Ferrari, Learning object detectors from weakly annotated video, CVPR’12]

Page 60: weakly supervised gatsby oxford

Automatic extraction of objectsj

Page 61: weakly supervised gatsby oxford

Data collection1. pick 10 moving classes from PASCAL VOC2 collect 9 24 videos per class from YouTube (~500 shots per class)2. collect 9‐24 videos per class from YouTube ( 500 shots per class)3. shot change detection  chunks videos into shots

• total 0 57 million frames• total 0.57 million frames• video‐level label only (i.e. keep some shots without the class)

Page 62: weakly supervised gatsby oxford

Step 1

LocalizeLocalizeobject tubes

Page 63: weakly supervised gatsby oxford

Candidate tubesCandidate tubes

dense point tracksdense point tracks

[N. Sundaram et al., Dense point trajectories by GPU-accelerated large displacement optical flow, ECCV 2010]

Page 64: weakly supervised gatsby oxford

Candidate tubesCandidate tubes

motion segmentationmotion segmentation

Page 65: weakly supervised gatsby oxford

Candidate tubesCandidate tubes

motion segmentationmotion segmentation

Page 66: weakly supervised gatsby oxford

Selecting tubesg

• jointly select one tube per shot by energy minimization• jointly select one tube per shot by energy minimization

nodestates

nodenode

Page 67: weakly supervised gatsby oxford

Selecting tubesgshot

candidate tubesUnary potential: tubesUnary potential:

homogeneity within a tube, location prior 

shotshotPairwise potential:similarity between tubes from different shots based on appearance descriptors (BOWappearance descriptors (BOW, HOG) extracted for a fixed number of frames per tube

Find states minimizing sum of potentials Inference with TRW‐SFind states minimizing sum of potentials, Inference with TRW S [V. Kolmogorov, Convergent tree-reweighted message passing for energy minimization, PAMI 06]

Page 68: weakly supervised gatsby oxford

Selecting tubes

Motion Segments Candidate Tubes Automatically Selected TubeSelected Tube

Page 69: weakly supervised gatsby oxford

Over-segmentation

Wrong tube selection

Heavy occlusion and lack of motiony

Page 70: weakly supervised gatsby oxford

Experiments: tube quality

10 object classes that move from PASCAL• aeroplane, bird, boat, car, cat, cow, dog, horse, motorbike, train• ~500 shots per class from YouTube

Evaluate on 100‐290 frames/class manually annotated (total 1407)• Performance = detection‐rate as defined in Pascal (>0.5 IoU)

28 5

34,8 Best segment

28,5 Autmatically selected

+ auto selection picks best available tube 80% of the time

0 10 20 30 40 50 60 70 80

‐ motion segmentation far from perfect (best tube covers 35% objects)

Page 71: weakly supervised gatsby oxford

Train detector

LocalizeLocalizeobject tubes

Page 72: weakly supervised gatsby oxford

Experiments: detection in PASCAL VOC

Test on Pascal 2007 test setTest on Pascal 2007 test set• 4952 test images; multiple classes per image• Many variations: scale, viewpoint, illumination, occlusion, intra‐class

Standard Pascal evaluation protocol

DPM object detector [Felzenszwalb 2010]

Page 73: weakly supervised gatsby oxford

Experiments: detection in PASCAL VOCp

15 2

31,6DPM

Image GTVideo CVPR12 seg15,2 Video CVPR12 seg

0 10 20 30 40 50

About same number of training instances (500/class)

• Only about half the mAP! Big gap!

Page 74: weakly supervised gatsby oxford

Experiments: detection in PASCAL VOC

Image GT

p

15,231,6 Video CVPR12 seg

Video ICCV13 seg

18,617,1

DPM

Video ICCV13 seg

Video GT

Induced by VOC

33,620,5 Induced by VOC

Image GT + Video induced by VOC

0 10 20 30 40 50

y

• Adding video data to Image GT +2% mAP

• But only with some domain adaptation, otherwise negative

transfer

Page 75: weakly supervised gatsby oxford

Training + testing object detectorsTraining + testing object detectorsVideoStill images CombinationdeoS ages Co b a o

Still images from PASCAL VOC 2007

Page 76: weakly supervised gatsby oxford

Summary and discussionSummary and discussion

Video significantly more training data• Video significantly more training data

M ti dditi l• Motion as an additional cue

I t ti f ti t l t b• Improve extraction of spatio-temporal tubes

D i hif f d b i i d• Domain shift factors needs to be investigated

• Construction of more complete models


Related Documents