AttentionNet for Accurate Localization and Detection of ... › content › gtc-kr › part_1_kaist.pdfAttentionNet for Accurate Localization and Detection of Objects. (To appear in

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

Donggeun Yoo, Sunggyun Park, Joon-Young Lee, Anthony Paek, In So Kweon.

State-of-the-art frameworks for object detection.


1. Region-CNN framework. [Gkioxari et al., CVPR’14]



Object proposal.



CN

N

Object proposal.



CN

N

SVM

Object proposal.



CN

N

SVM

NM

S

BB R

eg.

Object proposal.



(−) The maximally scored region is prone to focus on discriminative part (e.g. face)

rather than entire object (e.g. human body).

CN

N

SVM

NM

S

BB R

eg.

Object proposal.


2. Detection by CNN-regression. [Szegedy et al., NIPS’13]



CN

N

X1

y1

X2

y2



(X1,Y1)

(X2,Y2)

CN

N

X1

y1

X2

y2



(−) Direct mapping from an image to an exact bounding box is relatively difficult for a CNN.

(X1,Y1)

(X2,Y2)

CN

N

X1

y1

X2

y2

Idea: Ensemble of weak prediction.

Stop signal


Stop signal

Stop signal


Model: Rather than CNN regression model,

use CNN classification model.



Bottom-right direction prediction. Top-left direction prediction.

Convolution.

Normalization.

Pooling.

Convolution.

Normalization.

Pooling.

Convolution.

Convolution.

Convolution.

Fully connected.

Fully connected.



[ 3 directions, stop signal, no object ] ∈ ℜ5

Bottom-right direction prediction. Top-left direction prediction.

Convolution.

Normalization.

Pooling.

Convolution.

Normalization.

Pooling.

Convolution.

Convolution.

Convolution.

Fully connected.

Fully connected.





Convolution.

Normalization.

Pooling.

Convolution.

Normalization.

Pooling.

Convolution.

Convolution.

Convolution.

Fully connected.

Fully connected.


→ ↘ ↓ • F ← ↖ ↑ • F

Iterative test: Ensemble of weak directions.

Training AttentionNet.


1. Generating training samples.


2. Minimizing the loss function by back-propagation and stochastic gradient descent.

𝐿 =1

2𝐿𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑦𝑇𝐿, 𝑡𝑇𝐿 +

1

2𝐿𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑦𝐵𝑅 , 𝑡𝐵𝑅 .

Result. (Good examples.)

Result. (Bad examples.)

How to detect multiple instance?

Extension to multiple-instance: 1. Fast multi-scale sliding window search

using fully-convolutional network.

*Fast extraction of multi-scale dense activations.


Conv. 5

Conv. 4

Conv. 3

Conv. 2

Conv. 1

FC 8

FC 7

FC 6

227×227×3


Conv. 5

Conv. 4

Conv. 3

Conv. 2

Conv. 1

FC 8

FC 7

FC 6

Conv. 5

Conv. 4

Conv. 3

Conv. 2

Conv. 1

FC 8

FC 7

FC 6

227×227×3

322×322×3


Idea: Fully connection can be equally implemented

by convolutional layer.

Conv. 5

Conv. 4

Conv. 3

Conv. 2

Conv. 1

FC 8

FC 7

FC 6

Conv. 5

Conv. 4

Conv. 3

Conv. 2

Conv. 1

FC 8

FC 7

FC 6

227×227×3

322×322×3


Idea: Fully connection can be equally implemented

by convolutional layer.

Conv. 5

Conv. 4

Conv. 3

Conv. 2

Conv. 1

FC 8

FC 7

FC 6

Conv. 5

Conv. 4

Conv. 3

Conv. 2

Conv. 1

FC 8

Conv. 7

Conv. 6

227×227×3

322×322×3


…


…

…


Multi-scale

dense

activations.

…

…

…

4,096


Multi-scale

dense

activations.

…

…

4,096

Each activation vector

comes from each patch.

Extension to multiple-instance: 1. Fast multi-scale sliding window search

using fully-convolutional network.

Extension to multiple-instance:

2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.



Satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Start iterative test.



Un-satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Reject.


Extension to multiple-instance: Overall architecture for sliding window search.

Extension to multiple-instance: Merging multiple bounding boxes.

Evaluation on PASCAL VOC Series.

PASCAL VOC 2007 “Person”.


58.7 RCNN.

RCNN-based.




58.7 RCNN.

RCNN-based.

AttentionNet.

AttentionNet.




58.7 RCNN.

RCNN-based.

AttentionNet+RCNN.

AttentionNet+RCNN.




Precision-recall curve on PASCAL VOC 2007 “Person”.

58.7

AttentionNet for Accurate Localization and Detection of ... › content › gtc-kr › part_1_kaist.pdfAttentionNet for Accurate Localization and Detection of Objects. (To appear in

Documents