Fully Convolutional Networks for Semantic Segmentation By Jonathan Long* Evan Shelhamer* Trevor Darrell Instance-sensitive Fully Convolutional Networks By Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, Jian Sun Presented by Zilong Bai [email protected]
64
Embed
Fully Convolutional Networks for Semantic …yjlee/teaching/ecs289g-fall...Fully Convolutional Networks for Semantic Segmentation By Jonathan Long* Evan Shelhamer* Trevor Darrell Instance-sensitive
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Fully Convolutional Networks for Semantic SegmentationBy Jonathan Long* Evan Shelhamer* Trevor Darrell
Instance-sensitive Fully Convolutional NetworksBy Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, Jian Sun
To be honest, fully convolution, is just another way of thinking…
But it makes significant difference in training and maintaining the network structure in implementation!- Only convolution kernels are maintained; downsampling ratios are controlled by strides.- Arbitrary size- Faster! Compare to naive implementation
Layer 6 can be generated with kernel 13 x 13 x d_5, stride = 0: a kernel that does not move aroundLayer 7 can be generated with kernel 1 x 1 x d_6, stride = 0: another kernel that does not move around
Each semantic segmentation ground truth image actually needs to be divided into (class number + 1) slices and each slice corresponds to the ground truth heat map of one category.
Take FCN-16s for instance: fusing pool4 and conv 7 in the following steps:
1. Add a 1 x 1 convolution layer on top of pool4 to produce additional class predictions. a. The output predictions of pool4 are 16s
2. 2x upsample the output of conv 7 which are 32s. a. The output predictions of upsampled conv 7 are 16s as well.
3. Add these 16s predictions together.4. Upsample these 16s predictions back to image size.NOTE: ALL the weights can be learned. The upsampling weights can be initialized with bilinear interpolation.
Training + Testing- Train full image at a time without patch sampling - Reshape network to take input of any size- Forward time is ~100ms for 500 x 500 x 21 output (This is really fast!)
*Simultaneous Detection and Segmentation Hariharan et al. ECCV14
resultsFCN SDS* Truth Input
27
Relative to prior state-of-the-art SDS:
- 30% relative improvementfor mean IoU
- 286× faster
*Simultaneous Detection and Segmentation Hariharan et al. ECCV14
Ghosts sitting on that boat?!!
Qualitative Results
Experimental Setup1) AlexNet architecture2) VGG nets, pick the VGG 16-layer net5 3) GoogLeNet, use only the final loss layer, and improve performance by
discarding the final average pooling layer.
*Decapitate each net by discarding the final classifier layer, and convert all fully connected layers to convolutions.
A boring extension: if we directly use shallower layers and upsample without fusing with deeper layers, how bad would it be?
An interesting, promising and intuitive extension:What the next paper attempted to address=>
Instance-sensitive Fully Convolutional Networks
Jifeng Dai, Kaiming He, Jian Sun. Microsoft ResearchYi Li. Tsinghua University (While interning at Microsoft Research)Shaoqing Ren.University of Science and Technology of China (While interning at Microsoft Research)
32
Problem to solve:Instance-level Segmentation. Pixels in, pixels out.
33
Problem to solve:Instance-level Segmentation. Pixels in, pixels out.
34
Major ContributionsA fully convolutional network architecture that:1) Computes a set of instance-sensitive score maps
a) Each pixel is a classifier of relative positions to an object instance
b) Assemble to output instance candidate at each position
2) Reuse semantic segmentation results
3) Exploits image local coherence
a) w/o any high-dimensional layer related to the mask resolution
(compare with DeepMask)
Major ContributionsA fully convolutional network architecture for
instance-level segmentation.
37
Recall:Upsampling output
NOTE: Upsampled output is H x W x (class number + 1)
Each H x W slice shows the heat map for one category
Experimental Setup1) Use the VGG-16 network pre-trained on ImageNet as the feature extractor. 2) The 13 convolutional layers in VGG-16 are applied fully convolutionally on
an input image of arbitrary size.3) Reduce the network stride and increase feature map resolution:
a) the max pooling layer pool4 (between conv4_3 and conv5_1) is modified to have a stride of 1 instead of 2,
b) accordingly the filters in conv5_1 to conv5_3 are adjusted by the “hole algorithm”.
*Using this modified VGG network, the effective stride of the conv5_3 feature map is s = 8 pixels w.r.t. the input image.
DeepMaskLooks similar, but it doesn’t know how to use the local coherence
Quantitative ResultsAblation comparisons on the PASCAL VOC 2012 validation set
Quantitative ResultsPerformance evaluations on PASCAL VOC 2012Validation set
Quantitative ResultsPerformance evaluations on MS COCOValidation set
Qualitative Result
Qualitative Result
Strengths and WeaknessesStrengths:1) Both papers addressed very important questions with fully convolutional networks efficiently.2) Both papers have novelty with respect to network architectures.3) Both papers have convincing experiments.
a) Visualization and numerical results are clear and convincing.4) The discussion on the convolution operations in the first paper is helpful for interpretation and
better understanding of convolutional networks.5) The second paper doesn’t require another process to generate region proposals.
Weaknesses:1) How to use the training data is never clearly addressed.
a) What ground truth is used together with the forwarded heap maps for the loss functions?i) The first paper is intuitive in this part, but the second paper is very confusing.
2) Several essential points are unclear in the second papera) Did the second paper skip layers? b) Where did the second paper upsample? Or they just did not?
3) The relative location grids in the second paper worked well but look strange:a) One person’s “left” could be the other’s “right”, but each channel is in charge of the
relative location of all sliding windows.
Potential directions
1) Other tasks to be resolved by fully convolutional networksa) Scene recognition?
(1) Semantical combination of objects
2) Why is the size of sliding windows fixed in the second paper?a) Many small instances crowded together.
3) What about combining box-level object recognition with semantic segmentation?
+ NYUD net for multi-modal input and SIFT Flow net for multi-task output
PASCAL VOC Table 3 gives the performance of our FCN-8s on the test sets of PASCAL VOC 2011 and 2012, and compares it to the previous state-of-the-art, SDS [17], and the well-known R-CNN [12]. NYUDv2 [33] is an RGB-D dataset collected using the Microsoft Kinect. It has 1449 RGB-D images, with pixelwise labels that have been coalesced into a 40 class semantic segmentation task by Gupta et al. [14].
SIFT Flow is a dataset of 2,688 images with pixel labels for 33 semantic categories (“bridge”, “mountain”, “sun”), as well as three geometric categories (“horizontal”, “vertical”,and “sky”).
Past and future history offully convolutional networks
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs.Chen* & Papandreou* et al. ICLR 2015. 57Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7