YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Analysis of Large Scale Visual Recognition

Analysis of Large Scale Visual Recognition

Fei-Fei Li and Olga Russakovsky

Olga Russakovsky, Jia Deng, Zhiheng Huang, Alex Berg, Li Fei-FeiDetecting avocados to zucchinis: what have we done, and where are we going? ICCV 2013 http://image-net.org/challenges/LSVRC/2012/analysis

Page 2: Analysis of Large Scale Visual Recognition

Backpack

Page 3: Analysis of Large Scale Visual Recognition

Backpack

Flute Strawberry Traffic light

Bathing capMatchstick

Racket

Sea lion

Page 4: Analysis of Large Scale Visual Recognition

Large-scale recognition

Page 5: Analysis of Large Scale Visual Recognition

Large-scale recognition

Need benchmark datasets

Page 6: Analysis of Large Scale Visual Recognition

PASCAL VOC 2005-2012

Classification: person, motorcycleDetection Segmentation

Person

Motorcycle

Action: riding bicycle

Everingham, Van Gool, Williams, Winn and Zisserman.The PASCAL Visual Object Classes (VOC) Challenge. IJCV 2010.

20 object classes 22,591 images

Page 7: Analysis of Large Scale Visual Recognition

Large Scale Visual Recognition Challenge (ILSVRC) 2010-2012

20 object classes 22,591 images1000 object classes 1,431,167 images

Dalmatian

http://image-net.org/challenges/LSVRC/{2010,2011,2012}

Page 8: Analysis of Large Scale Visual Recognition

Variety of object classes in ILSVRC

Page 9: Analysis of Large Scale Visual Recognition

Variety of object classes in ILSVRC

Page 10: Analysis of Large Scale Visual Recognition

ILSVRC Task 1: ClassificationSteel drum

Page 11: Analysis of Large Scale Visual Recognition

ILSVRC Task 1: Classification

Output:Scale

T-shirtSteel drumDrumstickMud turtle

Steel drum

✔ ✗Output:

ScaleT-shirt

Giant pandaDrumstickMud turtle

Page 12: Analysis of Large Scale Visual Recognition

ILSVRC Task 1: Classification

Output:Scale

T-shirtSteel drumDrumstickMud turtle

Steel drum

✔ ✗

Accuracy =

Output:Scale

T-shirtGiant pandaDrumstickMud turtle

Σ100,000images

1[correct on image i]1100,000

Page 13: Analysis of Large Scale Visual Recognition

ILSVRC Task 1: Classification

Accuracy (5 predictions/image)

# Su

bmiss

ions

0.72

0.74

0.85

2010

2011

2012

Page 14: Analysis of Large Scale Visual Recognition

ILSVRC Task 2: Classification + Localization

Steel drum

Page 15: Analysis of Large Scale Visual Recognition

✔ Folding chair

Persian cat

Loud speaker

Steel drumPicket

fence

OutputSteel drum

ILSVRC Task 2: Classification + Localization

Page 16: Analysis of Large Scale Visual Recognition

✔ Folding chair

Persian cat

Loud speaker

Steel drumPicket

fence

Output

✗ Folding chair

Persian cat

Loud speaker

Steel drumPicket

fence

Output (bad localization)

✗ Folding chair

Persian cat

Loud speaker

Picket fence

King penguin

Output (bad classification)

Steel drum

ILSVRC Task 2: Classification + Localization

Page 17: Analysis of Large Scale Visual Recognition

✔ Folding chair

Persian cat

Loud speaker

Steel drumPicket

fence

OutputSteel drum

ILSVRC Task 2: Classification + Localization

Accuracy = Σ100,000images

1[correct on image i]1100,000

Page 18: Analysis of Large Scale Visual Recognition

ILSVRC Task 2: Classification + Localization

ISI

OXFORD_VGG

SuperVision

Accu

racy

(5

pre

dicti

ons)

Page 19: Analysis of Large Scale Visual Recognition

What happens under the hood?

Page 20: Analysis of Large Scale Visual Recognition

What happens under the hoodon classification+localization?

Page 21: Analysis of Large Scale Visual Recognition

What happens under the hoodon classification+localization?

Olga Russakovsky, Jia Deng, Zhiheng Huang, Alex Berg, Li Fei-FeiDetecting avocados to zucchinis: what have we done, and where are we going? ICCV 2013 http://image-net.org/challenges/LSVRC/2012/analysis

Page 22: Analysis of Large Scale Visual Recognition

What happens under the hoodon classification+localization?

Preliminaries:• ILSVRC-500 (2012) dataset• Leading algorithms

Olga Russakovsky, Jia Deng, Zhiheng Huang, Alex Berg, Li Fei-FeiDetecting avocados to zucchinis: what have we done, and where are we going? ICCV 2013 http://image-net.org/challenges/LSVRC/2012/analysis

Page 23: Analysis of Large Scale Visual Recognition

What happens under the hoodon classification+localization?

Preliminaries:• ILSVRC-500 (2012) dataset• Leading algorithms

• A closer look at small objects• A closer look at textured objects

Olga Russakovsky, Jia Deng, Zhiheng Huang, Alex Berg, Li Fei-FeiDetecting avocados to zucchinis: what have we done, and where are we going? ICCV 2013 http://image-net.org/challenges/LSVRC/2012/analysis

Page 24: Analysis of Large Scale Visual Recognition

What happens under the hoodon classification+localization?

Preliminaries:• ILSVRC-500 (2012) dataset• Leading algorithms

• A closer look at small objects• A closer look at textured objects

Olga Russakovsky, Jia Deng, Zhiheng Huang, Alex Berg, Li Fei-FeiDetecting avocados to zucchinis: what have we done, and where are we going? ICCV 2013 http://image-net.org/challenges/LSVRC/2012/analysis

Page 25: Analysis of Large Scale Visual Recognition

Easy to localize Hard to localize

1000 object classes

ILSVRC (2012)

Page 26: Analysis of Large Scale Visual Recognition

Easy to localize Hard to localize

500 classes with smallest objects

ILSVRC-500 (2012)

Page 27: Analysis of Large Scale Visual Recognition

Easy to localize Hard to localize

ILSVRC-500 (2012) 500 object categories 25.3% PASCAL VOC (2012) 20 object categories 25.2%

Object scale (fraction of image area occupied by target object)

ILSVRC-500 (2012)500 classes with smallest objects

Page 28: Analysis of Large Scale Visual Recognition

Chance Performance of LocalizationSteel drum

B1 B2 B3B4 B5

B6 B7

B8 B9

N = 9 here

Page 29: Analysis of Large Scale Visual Recognition

Chance Performance of LocalizationSteel drum

B1 B2 B3B4 B5

B6 B7

B8 B9

N = 9 here

Page 30: Analysis of Large Scale Visual Recognition

Chance Performance of LocalizationSteel drum

ILSVRC-500 (2012) 500 object categories 8.4%PASCAL VOC (2012) 20 object categories 8.8%

B1 B2 B3B4 B5

B6 B7

B8 B9

N = 9 here

Page 31: Analysis of Large Scale Visual Recognition

Level of clutterSteel drum

- Generate candidate object regions using method of

Selective Search for Object Detection

vanDeSande et al. ICCV 2011- Filter out regions inside object- Count regions

Page 32: Analysis of Large Scale Visual Recognition

Level of clutterSteel drum

- Generate candidate object regions using method of

Selective Search for Object Detection

vanDeSande et al. ICCV 2011- Filter out regions inside object- Count regions

ILSVRC-500 (2012) 500 object categories 128 ± 35PASCAL VOC (2012) 20 object categories 130 ± 29

Page 33: Analysis of Large Scale Visual Recognition

What happens under the hoodon classification+localization?

Preliminaries:• ILSVRC-500 (2012) dataset – similar to PASCAL• Leading algorithms

• A closer look at small objects• A closer look at textured objects

Olga Russakovsky, Jia Deng, Zhiheng Huang, Alex Berg, Li Fei-FeiDetecting avocados to zucchinis: what have we done, and where are we going? ICCV 2013 http://image-net.org/challenges/LSVRC/2012/analysis

Page 34: Analysis of Large Scale Visual Recognition

SuperVision (SV)Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton (Krizhevsky NIPS12)

Image classification: Deep convolutional neural networks• 7 hidden “weight” layers, 650K neurons, 60M parameters,

630M connections • Rectified Linear Units, max pooling, dropout trick• Randomly extracted 224x224 patches for more data• Trained with SGD on two GPUs for a week, fully supervised

Localization: Regression on (x,y,w,h)

http://image-net.org/challenges/LSVRC/2012/supervision.pdf

Page 35: Analysis of Large Scale Visual Recognition

SuperVision (SV)Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton (Krizhevsky NIPS12)

Image classification: Deep convolutional neural networks• 7 hidden “weight” layers, 650K neurons, 60M parameters,

630M connections • Rectified Linear Units, max pooling, dropout trick• Randomly extracted 224x224 patches for more data• Trained with SGD on two GPUs for a week, fully supervised

Localization: Regression on (x,y,w,h)

http://image-net.org/challenges/LSVRC/2012/supervision.pdf

Page 36: Analysis of Large Scale Visual Recognition

OXFORD_VGG (VGG)Karen Simonyan, Yusuf Aytar, Andrea Vedaldi, Andrew Zisserman

Image classification: Fisher vector + linear SVM (Sanchez CVPR11)• Root-SIFT (Arandjelovic CVPR12), color statistics, augmentation

with patch location (x,y) (Sanchez PRL12)• Fisher vectors: 1024 Gaussians, 135K dimensions • No SPM, product quantization to compress• Semi-supervised learning to find additional bounding boxes• 1000 one-vs-rest SVM trained with Pegasos SGD• 135M parameters!

Localization: Deformable part-based models (Felzenszwalb PAMI10), without parts (root-only)

http://image-net.org/challenges/LSVRC/2012/oxford_vgg.pdf

Page 37: Analysis of Large Scale Visual Recognition

What happens under the hoodon classification+localization?

Preliminaries:• ILSVRC-500 (2012) dataset – similar to PASCAL• Leading algorithms: SV and VGG

• A closer look at small objects• A closer look at textured objects

Olga Russakovsky, Jia Deng, Zhiheng Huang, Alex Berg, Li Fei-FeiDetecting avocados to zucchinis: what have we done, and where are we going? ICCV 2013 http://image-net.org/challenges/LSVRC/2012/analysis

Page 38: Analysis of Large Scale Visual Recognition

SV VGG

Cls+

loc

accu

racy

54.3%45.8%

Results on ILSVRC-500

Page 39: Analysis of Large Scale Visual Recognition

Difference in accuracy: SV versus VGG

Classification-only

✔ Folding chair

Persian cat

Loud speaker

Steel drumPicket

fence

Page 40: Analysis of Large Scale Visual Recognition

Object scale

Cls.

Acc

urac

y: S

V - V

GG

Difference in accuracy: SV versus VGG

Classification-only

Page 41: Analysis of Large Scale Visual Recognition

SV better(452 classes)

VGG better(34 classes)

Object scale

Cls.

Acc

urac

y: S

V - V

GG

Difference in accuracy: SV versus VGG

Classification-only

Page 42: Analysis of Large Scale Visual Recognition

SV better(452 classes)

VGG better(34 classes)

Object scale

Cls.

Acc

urac

y: S

V - V

GG

Difference in accuracy: SV versus VGG

Classification-only

*

*** *** ***

*** *** ***

SV beats VGG

VGG beats SV

Page 43: Analysis of Large Scale Visual Recognition

SV better(452 classes)

VGG better(34 classes)

Object scale

Cls.

Acc

urac

y: S

V - V

GG

Difference in accuracy: SV versus VGG

Cls+

Loc

Accu

racy

: SV

- VGG

Object scale

Classification-only

VGG better(150 classes)

SV better(338 classes)

Classification+Localiation

Page 44: Analysis of Large Scale Visual Recognition

Cumulative accuracy across scales

SV

VGGSV

VGG

Object scale

Cum

ulati

ve c

ls. a

ccur

acy

Classification-only Classification+Localization

Cum

ulati

ve c

ls+lo

c ac

cura

cy

Object scale

Page 45: Analysis of Large Scale Visual Recognition

Cumulative accuracy across scales

SV

VGGSV

Object scale

Cum

ulati

ve c

ls. a

ccur

acy

Classification-only Classification+Localization

Cum

ulati

ve c

ls+lo

c ac

cura

cy

Object scale0.24

205 smallest object classes

VGG

Page 46: Analysis of Large Scale Visual Recognition

What happens under the hoodon classification+localization?

Preliminaries:• ILSVRC-500 (2012) dataset – similar to PASCAL• Leading algorithms: SV and VGG

• SV always great at classification, but VGG does better than SV at localizing small objects

• A closer look at textured objects

Olga Russakovsky, Jia Deng, Zhiheng Huang, Alex Berg, Li Fei-FeiDetecting avocados to zucchinis: what have we done, and where are we going? ICCV 2013 http://image-net.org/challenges/LSVRC/2012/analysis

Page 47: Analysis of Large Scale Visual Recognition

What happens under the hoodon classification+localization?

Preliminaries:• ILSVRC-500 (2012) dataset – similar to PASCAL• Leading algorithms: SV and VGG

• SV always great at classification, but VGG does better than SV at localizing small objects

• A closer look at textured objectsWHY?

Olga Russakovsky, Jia Deng, Zhiheng Huang, Alex Berg, Li Fei-FeiDetecting avocados to zucchinis: what have we done, and where are we going? ICCV 2013 http://image-net.org/challenges/LSVRC/2012/analysis

Page 48: Analysis of Large Scale Visual Recognition

What happens under the hoodon classification+localization?

Preliminaries:• ILSVRC-500 (2012) dataset – similar to PASCAL• Leading algorithms: SV and VGG

• SV always great at classification, but VGG does better than SV at localizing small objects

• A closer look at textured objects

Olga Russakovsky, Jia Deng, Zhiheng Huang, Alex Berg, Li Fei-FeiDetecting avocados to zucchinis: what have we done, and where are we going? ICCV 2013 http://image-net.org/challenges/LSVRC/2012/analysis

Page 49: Analysis of Large Scale Visual Recognition

Textured objects (ILSVRC-500)

Amount of textureLow High

Page 50: Analysis of Large Scale Visual Recognition

No texture Low texture Medium texture High texture# classes 116 189 143 52

Textured objects (ILSVRC-500)

Amount of textureLow High

Page 51: Analysis of Large Scale Visual Recognition

No texture Low texture Medium texture High texture# classes 116 189 143 52

Object scale 20.8% 23.7% 23.5% 25.0%

Textured objects (ILSVRC-500)

Amount of textureLow High

Page 52: Analysis of Large Scale Visual Recognition

No texture Low texture Medium texture High texture# classes 116 189 149 143 115 52 35

Object scale 20.8% 23.7% 20.8% 23.5% 20.8% 25.0% 20.8%

Textured objects (416 classes)

Amount of textureLow High

Page 53: Analysis of Large Scale Visual Recognition

Localizing textured objects (416 classes, same average object scale at each level of texture)

Loca

lizati

on a

ccur

acy

Level of texture

SV VGG

Page 54: Analysis of Large Scale Visual Recognition

Level of texture

Loca

lizati

on a

ccur

acy On correctly classified images

SV VGG

Localizing textured objects (416 classes, same average object scale at each level of texture)

Page 55: Analysis of Large Scale Visual Recognition

Level of texture

Loca

lizati

on a

ccur

acy On correctly classified images

SV VGG

Localizing textured objects (416 classes, same average object scale at each level of texture)

Page 56: Analysis of Large Scale Visual Recognition

What happens under the hoodon classification+localization?

Preliminaries:• ILSVRC-500 (2012) dataset – similar to PASCAL• Leading algorithms: SV and VGG

• SV always great at classification, but VGG does better than SV at localizing small objects

• Textured objects easier to localize, especially for SV

Olga Russakovsky, Jia Deng, Zhiheng Huang, Alex Berg, Li Fei-FeiDetecting avocados to zucchinis: what have we done, and where are we going? ICCV 2013 http://image-net.org/challenges/LSVRC/2012/analysis

Page 57: Analysis of Large Scale Visual Recognition

ILSVRC 2013 with large-scale object detection

http://image-net.org/challenges/LSVRC/2013/

Fully annotated 200 object classes across 60,000 images

Allows evaluation of generic object detection in cluttered scenes at scale

PersonCar

MotorcycleHelmet

NEW

Page 58: Analysis of Large Scale Visual Recognition

ILSVRC 2013 with large-scale object detection

Statistics PASCAL VOC 2012 ILSVRC 2013Object classes 20 200

TrainingImages 5.7K 395KObjects 13.6K 345K

ValidationImages 5.8K 20.1KObjects 13.8K 55.5K

TestingImages 11.0K 40.1KObjects --- ---

4x

10x

http://image-net.org/challenges/LSVRC/2013/

25x

More than 50,000 person instances annotated

NEW

Page 59: Analysis of Large Scale Visual Recognition

• 159 downloads so far:http://image-net.org/challenges/LSVRC/2013/

• Submission deadline Nov. 15th

• ICCV workshop on December 7th, 2013

• Fine-Grained Challenge 2013:https://sites.google.com/site/fgcomp2013/

ILSVRC 2013 with large-scale object detection

NEW

Page 60: Analysis of Large Scale Visual Recognition

Thank you!

Prof. Alex BergUNC Chapel Hill

Jonathan KrauseStanford U.

Sanjeev SatheeshStanford U.

Zhiheng HuangStanford U.

Dr. Jia DengStanford U.

Hao SuStanford U.


Related Documents