weakly supervised gatsby oxford

Weakly supervised learningWeakly-supervised learning

Cordelia Schmid

Weakly supervised learning motivationWeakly supervised learning - motivation

Massive and ever growing amount of digital image andamount of digital image and video content– Flickr and YouTube– Audiovisual archives (BBC, INA) – Personal collections

Comes with meta-data‒ Text, audio, user click data, …

Meta-data is a sparse and noisy, f

2

yet rich and diverse source of annotation

Weakly supervised learning motivationWeakly supervised learning - motivation

Object detection

Weakly supervisedlarge-scale learning

3

Action recognition

OverviewOverview

• Multi-fold MIL for weakly-supervised learning from images

• Unsupervised learning from images based on matching

• Weakly-supervised learning from videos with motion segmentation

4

Weakly supervised learning for imagesWeakly-supervised learning for images

• Given a set of images with positive and negative labels, g p g ,determine the object region, learn detector

• Avoids costly annotation of object regions

5

6

7

Our approach descriptorsOur approach – descriptors

• Extract selective search regions [Uijlings et al., ICCV’13]• Regions described with high-dimensional Fisher vectors

CNNor CNNs• Image labeled as positives or negatives

8

Standard MIL (Fisher vectors)Standard MIL (Fisher vectors)

9

Multi fold MILMulti-fold MIL

[Cinbis Vebeek & Schmid Multi fold MIL for WS object localization CVPR’14]

10

[Cinbis, Vebeek, & Schmid, Multi-fold MIL for WS object localization, CVPR 14]

Multi fold MILMulti-fold MIL

Avoid relocalization bias since windows used for training and evaluation are different

11

Comparing standard and multi foldComparing standard and multi-fold

12

Performance over iterations (Fisher Vectors)Performance over iterations (Fisher Vectors)

13

Our approach: multi fold training for MILOur approach: multi-fold training for MIL

14

Localization examplesLocalization examples

15

Failure casesFailure cases

16

Refinement of selected boxesRefinement of selected boxes

Window refinement by local search to align windows with contours [Edge boxes: locating object proposals from edges, Zitnick et Dollar,ECCV’14]

17

Refinement of selected boxesRefinement of selected boxes

• Locally refine the top 10 scoring boxes with “edgebox” scorey p g g• “Edgebox” score: encourages alignment with long contours,

discourages contours straddling the window • Final score: “edgebox” score + “selection” score

[Cinbis, Verbeek & Schmid, WS Object Localization with Multi-fold MIL, arXiv’15]

18

[ , , j , ]

Comparison to the state of the artComparison to the state-of-the-art

19

Comparison to the state of the artComparison to the state-of-the-art

20

Summary and future workSummary and future work

• State-of-the-art results for WS localization

• Further improve “initial” and “selected” windows

• Update the CNN features (fine-tuning)

• Dealing with noisy or missing image labels (eg. Google image download)

21

OverviewOverview




22

How much supervision for localization?

Positives + BB Positives Positives+ +

Strong Weak Very weakNegatives Negatives

g yNone

Object detection (Leibe et al.’08; Felzenszwalb et al.’10; Girshick et al.’14)Object detect on (Le be et al. 08; Felzenszwalb et al. 0; G rsh ck et al. )Weakly supervised localization (Chum’07;Pandey’11;Desaelers’12;Siva’12;Shi’13;Cinbis’14;Wang’14)Co-segmentation/localization (Rother’06;Russell’06;Joulin’10;Kim’11;Vicente’11;Joulin’14;Tang’14)Unsupervised discovery (Grauman & Darrell’05; Sivic et al’05,08; Kim et al.’05,09)

Supervision

Correspondence

(Russell et al.’06; Cho et al.’10; Rubinstein & Joulin’13; Rubio et al.’13)

Our approach

• Correspondences • Correspondences • as a substitute for supervision

b d bj• between parts and objects• picked from bottom-up segmentation proposals• and k-nearest-neighbor images

• How?• Probabilistic Hough matchingProbabilistic Hough matching• Stand-out scoring of part hierarchies

[Cho, Kwak, Schmid & Ponce, Unsupervised Object Discovery and Localization in theWild: Part-based Matching with Bottom-up Region Proposals, CVPR’15 ]

Finding parts and objects among region candidates

Here: bottom-up segmentation proposals (Manen et al.’13, Uijlings et al.’13)and HOG descriptors (Dalal & Triggs’05)

Matching model – Probabilistic Hough matching

match data configuration

P ( m | d ) = c P ( m | c, d ) P ( c |d )


match data configuration

P ( m | d ) = c P ( m | c, d ) P ( c |d )

= P ( ma | d ) c P ( mg | c ) P ( c | d )

appearance geometry


• Probabilistic model

P ( m | d ) = c P ( m | c, d ) P ( c |d )

= P ( ma | d ) c P ( mg | c ) P ( c | d )



P ( m | d ) = c P ( m | c, d ) P ( c |d )

= P ( ma | d ) c P ( mg | c ) P ( c | d )

• Probabilistic Hough transform

P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )

= m P ( mg | c ) P ( ma | d )

(Hough’59; Ballard’81; Stephens’91; Leibe et al.’04; Maji & Malik’09; Barinova et al.’12)



P ( m | d ) = c P ( m | c, d ) P ( c |d )= P ( ma | d ) c P ( mg | c ) P ( c | d )


P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )


( | ) ( | ) m ( | )= m P ( mg | c ) P ( ma | d )

• Region confidence

C ( r’ | [ d’ , d’’ ] ) = max r’’ P ( r’ r’’ | [ d’, d’’ ] )



P ( m | d ) = c P ( m | c, d ) P ( c |d )= P ( ma | d ) c P ( mg | c ) P ( c | d )


P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )


( | ) ( | ) m ( | )= m P ( mg | c ) P ( ma | d )

• Two images -> multiple images

Cd’ ( r’ ) = d’’ C ( r’ | [ d’ , d’’ ] )

multiple

two

Appearance only

PHM

Stand-out scoring of part hierarchies

• Object regions should containj g• more foreground than part regions• less background than larger regions

Stand-out scoring of part hierarchies

• Object regions should containj g• more foreground than part regions• less background than larger regions

• S ( r ) = C ( r ) – max r’ r C( r’ )∩

An iterative algorithm – iteration 1

Retrieve 10 NN with GIST (Oliva & Torralba’06)


Probabilistic Hough Matching with 10 NN


Localize top 5 scoring windows with stand-out score


For all image: Localize 5 top scoring windows

An iterative algorithm – next iterations

E l it l t d i Exploit selected regions Retrieve 10 NN using PHM with top confidence regions


Probabilistic Hough Matching with 10 NN


Localize 5 top scoring windows per image with stand-out score

Localization improvement over iterations

1 3

1 3

Retrieval improvement over iterations

1st iteration

5th iteration

Retrieval improvement over iterations

1st iteration

5th iteration

Comparative evaluationTwo benchmarks:• Object discovery dataset (Rubinstein et al.’13)

S b t f 300 i f 3 l• Subset of 300 image from 3 classes• Includes between 7 and 18 outliers per class

• Pascal’07 – all (Everingham et al.’07)• 4548 images from 20 classes • From train/val set, minus difficult/truncated images, g

Computing time: < 1h for 500 images on 10-core desktop

Performance metrics:• CorLoc: percentage of boxes such that intersection/union > 0.5p g• CorRet: percentage of retrieved 10 NNs in the same class as query

Experimental results: Object discovery datasetCorLoc – separate classes

CorLoc / CorRet – mixed classes without labels

Examples – mixed classes without labelsExamples mixed classes without labels

Experimental results: Pascal’07 – all

CorLoc – separate classes

CorLoc and CorRet – mixed classes without labels


Examples– mixed classes without labels Successes


Examples– mixed classes without labels Failures

Discussion and future workDiscussion and future work

• Effective method for object discoveryand localization in challenging unlabeled scenariosand localization in challenging unlabeled scenarios

No use of saliency or objectness measures• No use of saliency or objectness measures• No use of negative examples or pretrained features

• Next:• Image categorization and object detection• Handling multiple objects per imageg p j p g

OverviewOverview




57

Learning from videosg

• easier to separate object from background reduce need for bounding-box annotation

• a video shows a range of variations for an object• a video shows a range of variations for an object easier to learn multi-view, articulation, illumination

• many frames and easy to access (e.g. YouTube) lots of extra data!

Automatic extraction of objects

??

[Prest, Leistner, Civera, Schmid & Ferrari, Learning object detectors from weakly annotated video, CVPR’12]

Automatic extraction of objectsj

Data collection1. pick 10 moving classes from PASCAL VOC2 collect 9 24 videos per class from YouTube (~500 shots per class)2. collect 9‐24 videos per class from YouTube ( 500 shots per class)3. shot change detection chunks videos into shots

• total 0 57 million frames• total 0.57 million frames• video‐level label only (i.e. keep some shots without the class)

Step 1

LocalizeLocalizeobject tubes

Candidate tubesCandidate tubes

dense point tracksdense point tracks

[N. Sundaram et al., Dense point trajectories by GPU-accelerated large displacement optical flow, ECCV 2010]


motion segmentationmotion segmentation


motion segmentationmotion segmentation

Selecting tubesg

• jointly select one tube per shot by energy minimization• jointly select one tube per shot by energy minimization

nodestates

nodenode

Selecting tubesgshot

candidate tubesUnary potential: tubesUnary potential:

homogeneity within a tube, location prior

shotshotPairwise potential:similarity between tubes from different shots based on appearance descriptors (BOWappearance descriptors (BOW, HOG) extracted for a fixed number of frames per tube

Find states minimizing sum of potentials Inference with TRW‐SFind states minimizing sum of potentials, Inference with TRW S [V. Kolmogorov, Convergent tree-reweighted message passing for energy minimization, PAMI 06]

Selecting tubes

Motion Segments Candidate Tubes Automatically Selected TubeSelected Tube

Over-segmentation

Wrong tube selection

Heavy occlusion and lack of motiony

Experiments: tube quality

10 object classes that move from PASCAL• aeroplane, bird, boat, car, cat, cow, dog, horse, motorbike, train• ~500 shots per class from YouTube

Evaluate on 100‐290 frames/class manually annotated (total 1407)• Performance = detection‐rate as defined in Pascal (>0.5 IoU)

28 5

34,8 Best segment

28,5 Autmatically selected

+ auto selection picks best available tube 80% of the time

0 10 20 30 40 50 60 70 80

‐ motion segmentation far from perfect (best tube covers 35% objects)

Train detector

LocalizeLocalizeobject tubes

Experiments: detection in PASCAL VOC

Test on Pascal 2007 test setTest on Pascal 2007 test set• 4952 test images; multiple classes per image• Many variations: scale, viewpoint, illumination, occlusion, intra‐class

Standard Pascal evaluation protocol

DPM object detector [Felzenszwalb 2010]

Experiments: detection in PASCAL VOCp

15 2

31,6DPM

Image GTVideo CVPR12 seg15,2 Video CVPR12 seg

0 10 20 30 40 50

About same number of training instances (500/class)

• Only about half the mAP! Big gap!

Experiments: detection in PASCAL VOC

Image GT

p

15,231,6 Video CVPR12 seg

Video ICCV13 seg

18,617,1

DPM

Video ICCV13 seg

Video GT

Induced by VOC

33,620,5 Induced by VOC

Image GT + Video induced by VOC

0 10 20 30 40 50

y

• Adding video data to Image GT +2% mAP

• But only with some domain adaptation, otherwise negative

transfer

Training + testing object detectorsTraining + testing object detectorsVideoStill images CombinationdeoS ages Co b a o

Still images from PASCAL VOC 2007

Summary and discussionSummary and discussion

Video significantly more training data• Video significantly more training data

M ti dditi l• Motion as an additional cue

I t ti f ti t l t b• Improve extraction of spatio-temporal tubes

D i hif f d b i i d• Domain shift factors needs to be investigated

• Construction of more complete models

weakly supervised gatsby oxford

Documents