Top Banner
WEAKLY- AND SEMI-SUPERVISED PANOPTIC SEGMENTATION Qizhu Li*, Anurag Arnab*, Philip H.S. Torr * Indicates equal contribution Scan me! Project Page INTRODUCTION We present a weakly supervised model that jointly performs both semantic- and instance-segmentation – a particularly relevant problem given the substantial cost of obtaining pixel-perfect annotation for these tasks. In contrast to many popular instance segmentation approaches based on object detectors, our method does not predict any overlapping instances. Moreover, we are able to Available for download! segment both “thing” and “stuff” classes, and thus explain all the pixels in the image. Method Validation Test th. st. all th. st. all th. Ours (weak, ImageNet init.) 17.0 33.1 26.3 35.8 43.9 40.5 12.8 Ours (full, ImageNet init.) 24.3 42.6 34.9 39.6 52.9 47.3 18.8 Ours (full, PSPNet [8] init.) [1] 28.6 52.6 42.5 42.5 62.1 53.8 23.4 Pixel Encoding [3] 9.9 - - - - - 8.9 RecAttend [4] - - - - - - 9.5 InstanceCut [5] - - - - - - 13.0 DWT [6] 21.2 - - - - - 19.4 SGN [7] 29.2 - - - - - 25.0 Dataset VOC COCO Weak Weak 75.7 55.5 59.5 Weak Full 75.8 56.1 59.8 Full Weak 77.5 58.9 62.7 Full Full 79.0 59.5 63.1 Method (weak) (full) % Ours (th.) 68.2 70.4 96.9 Ours (st.) 60.2 72.4 83.1 Ours (all) 63.6 71.6 88.8 Table 1. Semantic and instance segmentation performance on Pascal VOC with varying levels of supervision. We obtain state-of-the-art results for both full and weak supervision. Table 2. Semantic segmentation results on the Cityscapes val. set. Using more informative, bounding-box cues for “thing” classes leads to its higher % than that of “stuff” classes, which are trained with only image-level tags. Table 3. Instance-level segmentation results on Cityscapes. On the validation set, we report results for both “thing” (th.) and “stuff” (st.) classes. The online server, which evaluates the test set, only computes the for “thing” classes. We compare to other fully supervised methods which produce non- overlapping instances. Semantic and instance segmentation on Pascal VOC (weak, semi, full): Semantic segmentation on Cityscapes (weak, full): Instance-level segmentation on Cityscapes (weak, full): Input image “Thing” detector Fully convolutional network Box consistency term Global term Instance CRF Instance-level segmentation Category-level Seg. Module Instance-level Seg. Module Forward and backward Forward only ××3 × × ( + ) × × ( + ) × × ( + ) × × ( + ) We use the network architecture proposed in our previous fully-supervised work [1], which produce non-overlapping instances. Each of the detections (variable number per image) defines a possible “thing” instance. We assume that there can only be a single instance of a "stuff" class in an image. Therefore, there can be ( + ) instances per image which we need to label. The box consistency term encourages pixels inside a bounding box (given by the detector for “things”, or covering the whole image for “stuff”) to associate with the -th instance: = =ቊ , 0, otherwise The global term handles poor detection localisation: = = ( ) We use the same CRF formulation as our earlier work [1] with densely connected pairwise terms [2]: = =− ln 1 + 2 + + < ( , ) (a) Input image (b) Weakly supervised model (c) Fully supervised model [1] A Arnab, et al. Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, 2017. [2] P Krahenbuhl and V Koltun. Efficient Inference in fully connected CRFs with Gaussian edge potentials. In NIPS, 2011. [3] J Uhrig, et al. Pixel-level encoding and depth layering for instance-level semantic labeling. In GCPR, 2016. [4] M Ren and RS Zemel. End-to-end instance segmentation with recurrent attention. In CVPR, 2017. [5] A Kirillov, et al. Instancecut: from edgesto instances with multicut. In CVPR, 2017. [6] M Bai and R Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017. [7] S Liu, et al. Sgn: Sequential grouping networks for instance segmentation. In ICCV, 2017. [8] H Zhao, et al. Pyramid scene parsing network. In CVPR, 2017. Project page: qizhuli.github.io/publication/weakly-supervised-panoptic-segmentation/ Code release: github.com/qizhuli/Weakly-Supervised-Panoptic-Segmentation SEGMENTATION NETWORK STRUCTURE QUANTITATIVE RESULTS QUALITATIVE RESULTS Merge Pseudo ground truth Image-level tags Train multi- label classifier Class activation maps (CAM) “STUFF” BRANCH Bounding boxes Run MCG and GrabCut Coarse foreground masks (FM) “THING” BRANCH Merge Train seg. network Network predictions Better pseudo ground truth ITERATIVE TRAINING Data Compute Input Output Input n times Legend WEAKLY- AND SEMI-SUPERVISED TRAINING 1 2 3 (1a) Input image (1b) Localisation heatmaps (1c) Approximate ground truth (2a) B-Boxes Figure 1. Approximate ground truth generated from image-level tags using weak localisation cues from a multi-label classification network. Figure 3. (3a-3e): By using the output of the trained network, the initial approximate ground truth produced above (Iteration 0) can be iteratively refined. Black regions are “ignore” labels over which the loss is not computed in training. Note for instance segmentation, permutations of instance labels of the same class are equivalent. (3f): Panoptic quality ( ) of our panoptic segmentation results show significant improvement due to iterative training. (3a) Input image (3b) Iteration 0 (3c) Iteration 2 (3d) Iteration 5 (3e) Ground truth (3f) vs Iter. 4 3 4 1 (2b) Appr. Semantic GT Figure 2. Approximate ground truth generated from bounding boxes using coarse object masks from MCG&GrabCut. 2 (2c) Appr. Instance GT road building vegetation sky
1

Qizhu Li*, Anurag Arnab*, Philip H.S. Torr - Information …liqizhu/projects/weakly... · 2018. 9. 17. · WEAKLY- AND SEMI-SUPERVISED PANOPTIC SEGMENTATION Qizhu Li*, Anurag Arnab*,

Jan 23, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Qizhu Li*, Anurag Arnab*, Philip H.S. Torr - Information …liqizhu/projects/weakly... · 2018. 9. 17. · WEAKLY- AND SEMI-SUPERVISED PANOPTIC SEGMENTATION Qizhu Li*, Anurag Arnab*,

WEAKLY- AND SEMI-SUPERVISED PANOPTIC SEGMENTATIONQizhu Li*, Anurag Arnab*, Philip H.S. Torr

* Indicates equal contribution

Scan me!

Project Page

INTRODUCTION

We present a weakly supervised model that jointly performs both semantic- and

instance-segmentation – a particularly relevant problem given the substantial cost of

obtaining pixel-perfect annotation for these tasks. In contrast to many popular instance

segmentation approaches based on object detectors, our method does not predict

any overlapping instances. Moreover, we are able toAvailable for

download!segment both “thing” and “stuff” classes, and thus

explain all the pixels in the image. MethodValidation Test

𝐴𝑃𝑣𝑜𝑙𝑟 th. 𝐴𝑃𝑣𝑜𝑙

𝑟 st. 𝐴𝑃𝑣𝑜𝑙𝑟 all 𝑃𝑄 th. 𝑃𝑄 st. 𝑃𝑄 all 𝐴𝑃𝑣𝑜𝑙

𝑟 th.

Ours (weak, ImageNet init.) 17.0 33.1 26.3 35.8 43.9 40.5 12.8

Ours (full, ImageNet init.) 24.3 42.6 34.9 39.6 52.9 47.3 18.8

Ours (full, PSPNet [8] init.) [1] 28.6 52.6 42.5 42.5 62.1 53.8 23.4

Pixel Encoding [3] 9.9 - - - - - 8.9

RecAttend [4] - - - - - - 9.5

InstanceCut [5] - - - - - - 13.0

DWT [6] 21.2 - - - - - 19.4

SGN [7] 29.2 - - - - - 25.0

Dataset𝐼𝑜𝑈 𝐴𝑃𝑣𝑜𝑙

𝑟 𝑃𝑄VOC COCO

Weak Weak 75.7 55.5 59.5

Weak Full 75.8 56.1 59.8

Full Weak 77.5 58.9 62.7

Full Full 79.0 59.5 63.1

Method 𝐼𝑜𝑈 (weak) 𝐼𝑜𝑈 (full) 𝐹𝑆%

Ours (th.) 68.2 70.4 96.9

Ours (st.) 60.2 72.4 83.1

Ours (all) 63.6 71.6 88.8

Table 1. Semantic and instance segmentation

performance on Pascal VOC with varying levels of

supervision. We obtain state-of-the-art results for

both full and weak supervision.

Table 2. Semantic segmentation results on the

Cityscapes val. set. Using more informative,

bounding-box cues for “thing” classes leads to

its higher 𝐹𝑆% than that of “stuff” classes,

which are trained with only image-level tags.

Table 3. Instance-level segmentation results on Cityscapes. On the validation set, we report results for

both “thing” (th.) and “stuff” (st.) classes. The online server, which evaluates the test set, only computes

the 𝐴𝑃𝑣𝑜𝑙𝑟 for “thing” classes. We compare to other fully supervised methods which produce non-

overlapping instances.

Semantic and instance segmentation

on Pascal VOC (weak, semi, full):

Semantic segmentation on

Cityscapes (weak, full):

Instance-level segmentation on Cityscapes (weak, full):

Input

image

“Thing”

detector

Fully

convolutional

network

Box

consistency

term

Global

term

Instance

CRF

Instance-level

segmentation

Category-level

Seg. Module

Instance-level

Seg. Module

Forward and

backward

Forward

only

𝐻 ×𝑊 × 3

𝐻 ×𝑊 × (𝐶𝑠 + 𝐶𝑡)

5 × 𝐷𝑡 𝐻 ×𝑊 × (𝐶𝑠 + 𝐷𝑡)

𝐻 ×𝑊 × (𝐶𝑠 + 𝐷𝑡)

𝐻 ×𝑊 × (𝐶𝑠 + 𝐷𝑡)

We use the network architecture proposed in our previous fully-supervised work [1], which produce

non-overlapping instances. Each of the 𝐷𝑡 detections (variable number per image) defines a possible

“thing” instance. We assume that there can only be a single instance of a "stuff" class in an image.

Therefore, there can be (𝐶𝑠 + 𝐷𝑡) instances per image which we need to label.

The box consistency term 𝜓𝐵𝑜𝑥 encourages pixels inside a bounding box 𝐵𝑖 (given by the detector

for “things”, or covering the whole image for “stuff”) to associate with the 𝑖-th instance:

𝜓𝐵𝑜𝑥 𝑉𝑘 = 𝑖 = ቊ𝑠𝑖𝑄𝑘 𝑙𝑖 , 𝑘 ∈ 𝐵𝑖0, otherwise

The global term 𝜓𝐺𝑙𝑜𝑏𝑎𝑙 handles poor detection localisation:

𝜓𝐺𝑙𝑜𝑏𝑎𝑙 𝑉𝑘 = 𝑖 = 𝑄𝑘(𝑙𝑖)

We use the same CRF formulation as our earlier work [1] with densely connected pairwise terms [2]:

𝐸 𝑽 = 𝒗 = −

𝑖

𝑁

ln 𝑤1𝜓𝐵𝑜𝑥 𝑣𝑖 + 𝑤2𝜓𝐺𝑙𝑜𝑏𝑎𝑙 𝑣𝑖 + 𝜀 +

𝑖<𝑗

𝑁

𝜓𝑃𝑎𝑖𝑟𝑤𝑖𝑠𝑒(𝑣𝑖 , 𝑣𝑗)

(a) Input image (b) Weakly supervised model (c) Fully supervised model

[1] A Arnab, et al. Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, 2017.

[2] P Krahenbuhl and V Koltun. Efficient Inference in fully connected CRFs with Gaussian edge potentials. In NIPS, 2011.

[3] J Uhrig, et al. Pixel-level encoding and depth layering for instance-level semantic labeling. In GCPR, 2016.

[4] M Ren and RS Zemel. End-to-end instance segmentation with recurrent attention. In CVPR, 2017.

[5] A Kirillov, et al. Instancecut: from edgesto instances with multicut. In CVPR, 2017.

[6] M Bai and R Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017.

[7] S Liu, et al. Sgn: Sequential grouping networks for instance segmentation. In ICCV, 2017.

[8] H Zhao, et al. Pyramid scene parsing network. In CVPR, 2017.Project page: qizhuli.github.io/publication/weakly-supervised-panoptic-segmentation/

Code release: github.com/qizhuli/Weakly-Supervised-Panoptic-Segmentation

SEGMENTATION NETWORK STRUCTURE

QUANTITATIVE RESULTS

QUALITATIVE RESULTS

MergePseudo

ground truth

Image-level

tags

Train multi-

label

classifier

Class

activation

maps (CAM)

“STUFF” BRANCH

Bounding

boxes

Run MCG

and GrabCut

Coarse

foreground

masks (FM)

“THING” BRANCH

MergeTrain seg.

network

Network

predictions

Better

pseudo

ground truth

ITERATIVE TRAINING

Data

Compute

Input

Output

Input n

times

Legend

WEAKLY- AND SEMI-SUPERVISED TRAINING

1

2

3

(1a) Input image (1b) Localisation

heatmaps

(1c) Approximate

ground truth

(2a)

B-Boxes

Figure 1. Approximate ground truth generated from

image-level tags using weak localisation cues from a

multi-label classification network.

Figure 3. (3a-3e): By using the output of the trained network, the initial approximate ground truth

produced above (Iteration 0) can be iteratively refined. Black regions are “ignore” labels over which

the loss is not computed in training. Note for instance segmentation, permutations of instance labels

of the same class are equivalent. (3f): Panoptic quality (𝑃𝑄) of our panoptic segmentation results show

significant improvement due to iterative training.

(3a) Input image (3b) Iteration 0 (3c) Iteration 2 (3d) Iteration 5 (3e) Ground truth (3f) 𝑃𝑄 vs Iter.

4

3 4

1

(2b) Appr.

Semantic GT

Figure 2. Approximate ground truth

generated from bounding boxes usingcoarse object masks from MCG&GrabCut.

2

(2c) Appr.

Instance GT

road building

vegetation sky