WEAKLY- AND SEMI-SUPERVISED PANOPTIC SEGMENTATION Qizhu Li*, Anurag Arnab*, Philip H.S. Torr * Indicates equal contribution Scan me! Project Page INTRODUCTION We present a weakly supervised model that jointly performs both semantic- and instance-segmentation – a particularly relevant problem given the substantial cost of obtaining pixel-perfect annotation for these tasks. In contrast to many popular instance segmentation approaches based on object detectors, our method does not predict any overlapping instances. Moreover, we are able to Available for download! segment both “thing” and “stuff” classes, and thus explain all the pixels in the image. Method Validation Test th. st. all th. st. all th. Ours (weak, ImageNet init.) 17.0 33.1 26.3 35.8 43.9 40.5 12.8 Ours (full, ImageNet init.) 24.3 42.6 34.9 39.6 52.9 47.3 18.8 Ours (full, PSPNet [8] init.) [1] 28.6 52.6 42.5 42.5 62.1 53.8 23.4 Pixel Encoding [3] 9.9 - - - - - 8.9 RecAttend [4] - - - - - - 9.5 InstanceCut [5] - - - - - - 13.0 DWT [6] 21.2 - - - - - 19.4 SGN [7] 29.2 - - - - - 25.0 Dataset VOC COCO Weak Weak 75.7 55.5 59.5 Weak Full 75.8 56.1 59.8 Full Weak 77.5 58.9 62.7 Full Full 79.0 59.5 63.1 Method (weak) (full) % Ours (th.) 68.2 70.4 96.9 Ours (st.) 60.2 72.4 83.1 Ours (all) 63.6 71.6 88.8 Table 1. Semantic and instance segmentation performance on Pascal VOC with varying levels of supervision. We obtain state-of-the-art results for both full and weak supervision. Table 2. Semantic segmentation results on the Cityscapes val. set. Using more informative, bounding-box cues for “thing” classes leads to its higher % than that of “stuff” classes, which are trained with only image-level tags. Table 3. Instance-level segmentation results on Cityscapes. On the validation set, we report results for both “thing” (th.) and “stuff” (st.) classes. The online server, which evaluates the test set, only computes the for “thing” classes. We compare to other fully supervised methods which produce non- overlapping instances. Semantic and instance segmentation on Pascal VOC (weak, semi, full): Semantic segmentation on Cityscapes (weak, full): Instance-level segmentation on Cityscapes (weak, full): Input image “Thing” detector Fully convolutional network Box consistency term Global term Instance CRF Instance-level segmentation Category-level Seg. Module Instance-level Seg. Module Forward and backward Forward only ××3 × × ( + ) 5× × × ( + ) × × ( + ) × × ( + ) We use the network architecture proposed in our previous fully-supervised work [1], which produce non-overlapping instances. Each of the detections (variable number per image) defines a possible “thing” instance. We assume that there can only be a single instance of a "stuff" class in an image. Therefore, there can be ( + ) instances per image which we need to label. The box consistency term encourages pixels inside a bounding box (given by the detector for “things”, or covering the whole image for “stuff”) to associate with the -th instance: = =ቊ , ∈ 0, otherwise The global term handles poor detection localisation: = = ( ) We use the same CRF formulation as our earlier work [1] with densely connected pairwise terms [2]: = =− ln 1 + 2 + + < ( , ) (a) Input image (b) Weakly supervised model (c) Fully supervised model [1] A Arnab, et al. Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, 2017. [2] P Krahenbuhl and V Koltun. Efficient Inference in fully connected CRFs with Gaussian edge potentials. In NIPS, 2011. [3] J Uhrig, et al. Pixel-level encoding and depth layering for instance-level semantic labeling. In GCPR, 2016. [4] M Ren and RS Zemel. End-to-end instance segmentation with recurrent attention. In CVPR, 2017. [5] A Kirillov, et al. Instancecut: from edgesto instances with multicut. In CVPR, 2017. [6] M Bai and R Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017. [7] S Liu, et al. Sgn: Sequential grouping networks for instance segmentation. In ICCV, 2017. [8] H Zhao, et al. Pyramid scene parsing network. In CVPR, 2017. Project page: qizhuli.github.io/publication/weakly-supervised-panoptic-segmentation/ Code release: github.com/qizhuli/Weakly-Supervised-Panoptic-Segmentation SEGMENTATION NETWORK STRUCTURE QUANTITATIVE RESULTS QUALITATIVE RESULTS Merge Pseudo ground truth Image-level tags Train multi- label classifier Class activation maps (CAM) “STUFF” BRANCH Bounding boxes Run MCG and GrabCut Coarse foreground masks (FM) “THING” BRANCH Merge Train seg. network Network predictions Better pseudo ground truth ITERATIVE TRAINING Data Compute Input Output Input n times Legend WEAKLY- AND SEMI-SUPERVISED TRAINING 1 2 3 (1a) Input image (1b) Localisation heatmaps (1c) Approximate ground truth (2a) B-Boxes Figure 1. Approximate ground truth generated from image-level tags using weak localisation cues from a multi-label classification network. Figure 3. (3a-3e): By using the output of the trained network, the initial approximate ground truth produced above (Iteration 0) can be iteratively refined. Black regions are “ignore” labels over which the loss is not computed in training. Note for instance segmentation, permutations of instance labels of the same class are equivalent. (3f): Panoptic quality ( ) of our panoptic segmentation results show significant improvement due to iterative training. (3a) Input image (3b) Iteration 0 (3c) Iteration 2 (3d) Iteration 5 (3e) Ground truth (3f) vs Iter. 4 3 4 1 (2b) Appr. Semantic GT Figure 2. Approximate ground truth generated from bounding boxes using coarse object masks from MCG&GrabCut. 2 (2c) Appr. Instance GT road building vegetation sky