www.data61.csiro.au FOR FURTHER INFORMATION Fatemehsadat Saleh , E: [email protected] Built-in Foreground/Background Prior for Weakly-Supervised Semantic Segmentation Fatemehsadat Saleh 1,2 , M. Sadegh Ali Akbarian 1,2 , Mathieu Salzmann 3 , Lars Petersson 1,2 , Stephen Gould 1 , and Jose M. Alvarez 1,2 1 The Australian National University (ANU) 2 Data61-CSIRO, Australia 3 CVLab, EPFL, Switzerland Introduction Goal: Assigning a semantic label to every pixel in the image. Problem: Acquiring a huge amount of pixel-level annotations is expensive. Our approach: We aim at using one of the weakest levels of annotation, image-level tags. Drawbacks of current approaches using image-level tags: • Poor localization and inaccurate object boundaries. • Additional priors require pixel-level annotations/bounding boxes. Different types of annotation used in related works Contributions • A method to extract accurate masks from a network pre-trained for object recognition. • A novel loss function to incorporate these masks during training. • A new form of weak supervision, where the user selects the best mask among several automatically generated candidates. Our Method Built-in Foreground/Background Model From a network pre-trained on ImageNet, we propose to exploit the unit activations of the hidden layers to extract a foreground/background mask. We make use of the fourth and fifth convolutional layers to compute foreground probabilities acting as unary potentials in a fully-connected CRF[9]. Benefits Foreground/background mask extracted without relying on an external method. Requires no additional annotations. Novel Loss Function ൌ ͳ ݎ ͳ ܯ ǡȁெ ǡೕ ୀଵ ௌ ǡೕ ೖ ൌ ͳ ݎ ͳ ȁ ഥ ܯȁ ǡȁெ ǡೕ ୀ ௌ ǡೕ బ ܮ௦ ൌ െ ͳ ܮെͳ אǡஷ െ െ ͳ ത ܮǤ ܫ ǡאூǡ אത ͳെ ǡ horse person Semantic segmentation network Pixel-level annotation Image-level annotation[2,4,5] Point-level annotation[3] Bounding-box annotation[4,2] Image Fourth Conv. Fifth Conv. Fusion Mask Present classes should appear in the foreground mask no pixels in the whole image should take on an absent label pixels predicted as background should be assigned to the background class Our Method (Cont.) Novel Weak Supervision • The mask obtained by inference in the CRF is not always the desired one. • We generate multiple, diverse, low energy predictions (M-best Problem[1]). • A user can then decide which prediction is the best one. • This manual selection takes roughly 2-3 seconds per image. Experiments Network Structure Mask Evaluation Semantic Segmentation Results References: [1] Batra, D., et.al: Diverse mbest solutions in markov random fields. In: ECCV (2012) [2] Pinheiro, P.O., et. al: From image-level to pixel-level labeling with convolutional networks. In: CVPR (2015) [3] Bearman, et. al: What's the point: Semantic segmentation with point supervision. ArXiv e-prints (2015) [4] Papandreou, G., et. al: Weakly- and semi-supervised learning of a deep convolutional network for semantic image segmentation. In: ICCV (2015) [5] Pathak, D., et. al: Constrained convolutional neural networks for weakly supervised segmentation. In: ICCV(2015) [6] Wei, Y., et. al: Learning to segment with image-level annotations. Pattern Recognition (2016) [7] Wei, Y., et. al: Stc: A simple to complex framework for weakly-supervised semantic segmentation. ArXiv e-prints (2015) [8] Alexe, B., et. al: Measuring the objectness of image windows. Pattern Analysis and Machine Intelligence (2012) [9] Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. Advances in Neural Information Processing Systems (2011) Method Additional Supervision Validation Test [2] MIL+bbox 37.8% 37.0% [2] MIL+seg 42.0% 40.6% [6] SN-B+MCG seg 41.9% 43.2% [7] STC+Add. train data 49.8% 51.2% [3] What’s the point +1Point 42.7% 43.6% [5] CCNN+Size Info. 42.4% 45.1% Ours (CheckMask) 51.5% 52.9% Method Image Tag Validation Test [2] MIL+sspxl 36.6% 35.8% [3] What’s the point w/Obj 32.2% --- [4] EM-Adapt 38.2% 39.6% [5] CCNN 35.3% 35.6% Ours (Tag) 46.6% 48.0% Method Mean IOU [5] CCNN (Tags) 32.2% Ours (Tags) 39.0% Ours (CheckMask) 46.3% Mean IOU on PASCAL VOC using image-tags Mean IOU on PASCAL VOC using additional supervision Mean IOU on PASCAL VOC validation set using Flickr for training Method Mean IOU Mask obtained using objectness method [8] 52.3% Mask obtained using MCG 50.2% Our masks 60.1% Mask evaluation results on 10% of PASCAL VOC training set Image ௨௦ Objectness Map[8] MCG-Map Our Mask Objectness Mask[8] MCG-Mask Image Baseline Pascal Tags Pascal CheckMask Pascal Tags Flickr CheckMask Flickr G.T