Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Deep Fisher Networks and Class Saliency Maps for

Object Classification and Localisation

Karén Simonyan, Andrea Vedaldi, Andrew ZissermanVisual Geometry Group, University of Oxford

Outline

• Classification challenge• can Fisher Vector encodings be improved by a deep architecture?• deep Fisher Network (FN)• combination of two deep models: Convolutional Network (CN) and deep Fisher Network

• Localisation challenge• visualization of class saliency maps and per‐image foreground pixels from a single classification CN

• bounding boxes computed from foreground pixels• weak supervision: only image class labels used for training

• Bag of Visual Words (BOW) pipeline

... ... ... ...

VQ

Linear SVM

dogs

Shallow Image Encoding & Classification• Dense SIFT features

[Luong & Malik, 1999][Varma & Zisserman, 2003]

[Csurka et al, 2004] [Vogel & Schiele, 2004]

[Jurie & Triggs, 2005][Lazebnik et al, 2006]

[Bosch et al, 2006]

soft-assignment to GMM

1st order stats (k-th Gaussian):

2nd order stats (k-th Gaussian):

80-D 80-D 80-D

FV dimensionality: 80×2×512=81,920(for a mixture of 512 Gaussians)

stacking e.g. if SIFT x reduced to 80 dimensions by PCA

Dense set of local SIFT features → Fisher vector (high dim)

Fisher Vector (FV) – Encoding

Perronnin et al CVPR 07 & 10, ECCV 10

• Learn projection onto a low-dim space where classes are well-separated

• Joint learning of projection and projected-space classifiers (WSABIE):

• Or project onto the space of classifier scores:

• are linear SVM classifiers in the high-dimensional FV space

• fast-to-learn

Projection LearningFisher vector (high dim) → low dimensional representation

Wφ

Deep Fisher Network

Dense feature extractionSIFT, colour

One vs. rest linear SVMs

low-dim FV encoder

Spatial stacking

L2 norm. & PCA

FV encoder

SSR & L2 norm.

SSR & L2 norm.

input image

0-th layer

1-st Fisher layer (local & global pooling)

2-nd Fisher layer(global pooling)

classifier layer

Dense feature extraction

SIFT, raw patches, …

One vs rest linear SVMs

FV encoder

SSR & L2 norm.

Shallow Fisher Vector

Fisher Layer

Compressedlocal Fisher encoding

Spatial stacking(2×2)

L2 norm‐n & PCA

decorrelation

featurew

h

80

w/2

h/2

82000

w/2

h/2

1000

w/2

h/2

4000

w/2

h/2

256

Deep Fisher Network

Dense feature extractionSIFT, colour

One vs. rest linear SVMs

low-dim FV encoder

Spatial stacking

L2 norm. & PCA

FV encoder

SSR & L2 norm.

SSR & L2 norm.

input image

0-th layer

1-st Fisher layer (local & global pooling)

2-nd Fisher layer(global pooling)

classifier layer

Dense feature extraction

SIFT, raw patches, …

One vs rest linear SVMs

FV encoder

SSR & L2 norm.

Shallow Fisher Vector

Classification Results for Fisher Network

ImageNet 2010 challenge dataset: • 1.2M images, 1K classes• SIFT & colour features• Learning: 2‐3 days on 200 CPU cores (MATLAB + MEX implementation)

Improved classification accuracy by adding layer

Deep ConvNet Implementation

• Based on cuda‐convnet [Krizhevsky et al., 2012]• 8 weight layers (rather narrow):

conv64‐conv256‐conv256‐conv256‐conv256‐full4096‐full4096‐full1000

• Jittering:• cropping, flipping, PCA‐aligned noise• random occlusion:

• Single ConvNet instance

Classification Results

ImageNet 2012 challenge dataset: • 1.2M images, 1K classes• top‐5 classification accuracy

Method top‐5 accuracyFV encoding (our 2012 entry) 72.7%Deep FishNet 76.9%Deep ConvNet [Krizhevsky et al., 2012] 81.8%

83.6% (5 ConvNets)Deep ConvNet (our implementation) 82.3%Deep ConvNet + Deep FishNet 84.8%

ConvNet and FisherNet are complementary

Outline

• Classification challenge• can Fisher Vector encodings be improved by a deep architecture?• deep Fisher Network (FN)• combination of two deep models: Convolutional Network (CN) and deep Fisher Network

• Localisation challenge• visualization of class saliency maps and per‐image foreground pixels from a single classification CN

• bounding boxes computed from foreground pixels• weak supervision: only image class labels used for training

Deep inside ConvNets: what Has Been Learnt?

ConvNet class model visualisation• find a (regularised) image with a high class score :

with a fixed learnt model

• compute using back‐prop

Cf ConvNet training• max log‐likelihood of the correct class

• using back‐prop

Visualizing higher‐layer features of a deep network. Erhan, D., Bengio, Y., Courville, A., Vincent, P. Technical report, University of Montreal, 2009.

fully connectedclassifier layer

soft-max layer

…

fox

pepper

dumbbell

Deep inside ConvNets: what Has Been Learnt?

ConvNet class model visualisation• find a (regularised) image with a high class score :

with a fixed learnt model

• compute using back‐prop

NB

Visualizing higher‐layer features of a deep network. Erhan, D., Bengio, Y., Courville, A., Vincent, P. Technical report, University of Montreal, 2009.

fully connectedclassifier layer

soft-max layer

…

gives less prominent visualisation, as it concentrates on reducing scores of other classes

Deep inside ConvNets:What Makes an Image Belong to a Class?

• ConvNets are highly non‐linear → local linear approxima on

• 1st order expansion of a class score around a given image :

• has the same dimensions as image• magnitude of defines a saliency map for image and class

– computed using back‐prop

– score of ‐th class

How to Explain Individual Classification Decisions. Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., Müller, K.‐R. JMLR, 2010.

Saliency Maps For Top‐1 Class



• Weakly supervised• computed using class‐n ConvNet, trained on image class labels• no additional annotation required (e.g. boxes or masks)

• Highlights discriminative object parts• Instant computation – no sliding window• Fires on several object instances

• Related to deconvnet [Zeiler and Fergus, 2013]• very similar for convolution, max‐pooling, and RELU layers• but we also back‐prop through fully‐connected layers

Image Saliency Map

Saliency Maps for Object Localisation• Image → top‐k class → class saliency map → object box

• Given an image and a saliency map:

BBox Localisation for ILSVRC Submission

• Given an image and a saliency map:1. Foreground/background mask

using thresholds on saliency


blue – foregroundcyan – backgroundred – undefined


using thresholds on saliency2. GraphCut colour segmentation

[Boykov and Jolly, 2001]



using thresholds on saliency2. GraphCut colour segmentation

[Boykov and Jolly, 2001]3. Bounding box of the largest

connected component

• Colour information propagates segmentation from the most discriminative areas


Segmentation‐Localisation Examples

Segmentation‐Localisation Examples

Segmentation‐Localisation Failure Cases

• Several object instances

• Segmentation isn’t propagated from the salient parts


• Limitations of GraphCut segmentation


Summary

• Fisher encoding benefits from stacking

• Deep FishNet is complementary to Deep ConvNet

• Class saliency maps are useful for localisation• location of discriminative object parts• weakly supervised: bounding boxes not used for training• fast to compute

Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Documents