Top Banner
74

AttentionNet for Accurate Localization and Detection of ... › content › gtc-kr › part_1_kaist.pdfAttentionNet for Accurate Localization and Detection of Objects. (To appear in

Feb 03, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

    Donggeun Yoo, Sunggyun Park, Joon-Young Lee, Anthony Paek, In So Kweon.

  • State-of-the-art frameworks for object detection.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    Object proposal.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    CN

    N

    Object proposal.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    CN

    N

    SVM

    Object proposal.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    CN

    N

    SVM

    NM

    S

    BB R

    eg.

    Object proposal.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    CN

    N

    SVM

    NM

    S

    BB R

    eg.

    Object proposal.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    (−) The maximally scored region is prone to focus on discriminative part (e.g. face)

    rather than entire object (e.g. human body).

    CN

    N

    SVM

    NM

    S

    BB R

    eg.

    Object proposal.

  • State-of-the-art frameworks for object detection.

    1. Region-CNN framework. [Gkioxari et al., CVPR’14]

    (−) The maximally scored region is prone to focus on discriminative part (e.g. face)

    rather than entire object (e.g. human body).

    CN

    N

    SVM

    NM

    S

    BB R

    eg.

    Object proposal.

  • State-of-the-art frameworks for object detection.

    2. Detection by CNN-regression. [Szegedy et al., NIPS’13]

  • State-of-the-art frameworks for object detection.

    2. Detection by CNN-regression. [Szegedy et al., NIPS’13]

  • State-of-the-art frameworks for object detection.

    2. Detection by CNN-regression. [Szegedy et al., NIPS’13]

    CN

    N

    X1

    y1

    X2

    y2

  • State-of-the-art frameworks for object detection.

    2. Detection by CNN-regression. [Szegedy et al., NIPS’13]

    (X1,Y1)

    (X2,Y2)

    CN

    N

    X1

    y1

    X2

    y2

  • State-of-the-art frameworks for object detection.

    2. Detection by CNN-regression. [Szegedy et al., NIPS’13]

    (−) Direct mapping from an image to an exact bounding box is relatively difficult for a CNN.

    (X1,Y1)

    (X2,Y2)

    CN

    N

    X1

    y1

    X2

    y2

  • Idea: Ensemble of weak prediction.

  • Idea: Ensemble of weak prediction.

  • Idea: Ensemble of weak prediction.

  • Idea: Ensemble of weak prediction.

  • Idea: Ensemble of weak prediction.

  • Idea: Ensemble of weak prediction.

  • Stop signal

    Idea: Ensemble of weak prediction.

  • Stop signal

    Idea: Ensemble of weak prediction.

  • Stop signal

    Stop signal

    Idea: Ensemble of weak prediction.

  • Stop signal

    Stop signal

    Idea: Ensemble of weak prediction.

  • Model: Rather than CNN regression model,

    use CNN classification model.

  • Model: Rather than CNN regression model,

    use CNN classification model.

    Bottom-right direction prediction. Top-left direction prediction.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Convolution.

    Convolution.

    Fully connected.

    Fully connected.

  • Model: Rather than CNN regression model,

    use CNN classification model.

    Bottom-right direction prediction. Top-left direction prediction.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Convolution.

    Convolution.

    Fully connected.

    Fully connected.

  • Model: Rather than CNN regression model,

    use CNN classification model.

    [ 3 directions, stop signal, no object ] ∈ ℜ5

    Bottom-right direction prediction. Top-left direction prediction.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Convolution.

    Convolution.

    Fully connected.

    Fully connected.

    [ 3 directions, stop signal, no object ] ∈ ℜ5

  • Model: Rather than CNN regression model,

    use CNN classification model.

    [ 3 directions, stop signal, no object ] ∈ ℜ5

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Normalization.

    Pooling.

    Convolution.

    Convolution.

    Convolution.

    Fully connected.

    Fully connected.

    [ 3 directions, stop signal, no object ] ∈ ℜ5

    → ↘ ↓ • F ← ↖ ↑ • F

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Iterative test: Ensemble of weak directions.

  • Training AttentionNet.

  • Training AttentionNet.

    1. Generating training samples.

  • Training AttentionNet.

    2. Minimizing the loss function by back-propagation and stochastic gradient descent.

    𝐿 =1

    2𝐿𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑦𝑇𝐿, 𝑡𝑇𝐿 +

    1

    2𝐿𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑦𝐵𝑅 , 𝑡𝐵𝑅 .

  • Result. (Good examples.)

  • Result. (Good examples.)

  • Result. (Bad examples.)

  • How to detect multiple instance?

  • Extension to multiple-instance: 1. Fast multi-scale sliding window search

    using fully-convolutional network.

  • *Fast extraction of multi-scale dense activations.

  • *Fast extraction of multi-scale dense activations.

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    FC 7

    FC 6

    227×227×3

  • *Fast extraction of multi-scale dense activations.

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    FC 7

    FC 6

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    FC 7

    FC 6

    227×227×3

    322×322×3

  • *Fast extraction of multi-scale dense activations.

    Idea: Fully connection can be equally implemented

    by convolutional layer.

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    FC 7

    FC 6

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    FC 7

    FC 6

    227×227×3

    322×322×3

  • *Fast extraction of multi-scale dense activations.

    Idea: Fully connection can be equally implemented

    by convolutional layer.

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    FC 7

    FC 6

    Conv. 5

    Conv. 4

    Conv. 3

    Conv. 2

    Conv. 1

    FC 8

    Conv. 7

    Conv. 6

    227×227×3

    322×322×3

  • *Fast extraction of multi-scale dense activations.

  • *Fast extraction of multi-scale dense activations.

  • *Fast extraction of multi-scale dense activations.

  • *Fast extraction of multi-scale dense activations.

    Multi-scale

    dense

    activations.

    4,096

  • *Fast extraction of multi-scale dense activations.

    Multi-scale

    dense

    activations.

    4,096

    Each activation vector

    comes from each patch.

  • Extension to multiple-instance: 1. Fast multi-scale sliding window search

    using fully-convolutional network.

  • Extension to multiple-instance:

    2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.

  • Extension to multiple-instance:

    2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.

    Satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Start iterative test.

  • Extension to multiple-instance:

    2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.

    Un-satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Reject.

    Satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Start iterative test.

  • Extension to multiple-instance:

    2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.

    Un-satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Reject.

    Un-satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Reject.

    Satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Start iterative test.

  • Extension to multiple-instance: Overall architecture for sliding window search.

  • Extension to multiple-instance: Merging multiple bounding boxes.

  • Extension to multiple-instance: Merging multiple bounding boxes.

  • Extension to multiple-instance: Merging multiple bounding boxes.

  • Extension to multiple-instance: Merging multiple bounding boxes.

  • Extension to multiple-instance: Merging multiple bounding boxes.

  • Evaluation on PASCAL VOC Series.

    PASCAL VOC 2007 “Person”.

    PASCAL VOC 2012 “Person”.

    58.7 RCNN.

    RCNN-based.

  • Evaluation on PASCAL VOC Series.

    PASCAL VOC 2007 “Person”.

    PASCAL VOC 2012 “Person”.

    58.7 RCNN.

    RCNN-based.

    AttentionNet.

    AttentionNet.

  • Evaluation on PASCAL VOC Series.

    PASCAL VOC 2007 “Person”.

    PASCAL VOC 2012 “Person”.

    58.7 RCNN.

    RCNN-based.

    AttentionNet+RCNN.

    AttentionNet+RCNN.

  • Evaluation on PASCAL VOC Series.

    PASCAL VOC 2007 “Person”.

    PASCAL VOC 2012 “Person”.

    Precision-recall curve on PASCAL VOC 2007 “Person”.

    58.7