Enhancing Salient Object Segmentation Through Attention Anuj Pahuja* Avishek Majumder* Anirban Chakraborty R. Venkatesh Babu Indian Institute of Science, Bangalore, India [email protected][email protected]{anirban,venky}@iisc.ac.in Patch Generation Module Saliency Prediction Module Recurrent Attention Module Figure 1: Overview of the proposed approach. An input RGB image goes through three different modules. Patch Generation Module learns to generate image patches in a differentiable manner. The Saliency Prediction Module operates on every patch and generates saliency feature maps. Finally, the Recurrent Attention Module aggregates the bag of features and iteratively refines the complete segmentation map. Abstract Segmenting salient objects in an image is an important vision task with ubiquitous applications. The problem be- comes more challenging in the presence of a cluttered and textured background, low resolution and/or low contrast images. Even though existing algorithms perform well in segmenting most of the object(s) of interest, they often end up segmenting false positives due to resembling salient ob- jects in the background. In this work, we tackle this prob- lem by iteratively attending to image patches in a recurrent fashion and subsequently enhancing the predicted segmen- tation mask. Saliency features are estimated independently for every image patch which are further combined using an aggregation strategy based on a Convolutional Gated Re- current Unit (ConvGRU) network. The proposed approach works in an end-to-end manner, removing background noise and false positives incrementally. Through extensive eval- *Equal contribution uation on various benchmark datasets, we show superior performance to the existing approaches without any post- processing. 1. Introduction Saliency is an important aspect of human vision. It is the phenomenon that allows our brain to focus on some parts of a scene more than the rest of it. Thousands of years of evolution has optimized our brain usage by focus- ing only in the most important regions in our field of view and ignore the rest of it. Indeed, even in computer vision, saliency plays a huge role in many applications, including what humans use it for - compressed representation [16, 20]. Saliency can be exploited to improve agent navigation in the wild [10], image retrieval [5, 14, 17], content based ob- ject re-targeting [8, 37], scene parsing [51], object detection and segmentation [13, 32, 34] among many others. Due to its vast applications in vision, saliency prediction is a well established problem with decades of on-going research. De- 27
10
Embed
Enhancing Salient Object Segmentation Through Attentionopenaccess.thecvf.com/content_CVPRW_2019/papers... · 3.2. Saliency Prediction Module (SPM) SPM is the primary saliency feature
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Enhancing Salient Object Segmentation Through Attention
Anuj Pahuja* Avishek Majumder* Anirban Chakraborty R. Venkatesh Babu
Table 1: Quantitative comparison with other state-of-the-art methods on various datasets. Top two results are in bold numbers.
UCF [50], RFCN [42] and two non-deep methods -
DRFI [22] and BSCA [35].
Our method consistently gets top Fβ scores, implying a
greater precision in the predicted map. The high precision
showcases its effectiveness on suppressing false positives in
cluttered backgrounds and partly salient objects. Metrics of
other methods have either been reported by the respective
authors or have been computed by us using available pre-
dictions/weights. For a fair comparison, we use the scores
obtained without post-processing for all methods.
In Figure 5, we compare the qualitative results of the
aforementioned methods with ours. We show results for a
set of images with varying difficulties:
Cluttered background. Row 1 contains a textured back-
ground, making algorithms prone to background noise.
Shadows in background. Rows 2 and 4 include images
with object shadows. While every method performs well on
Row 4, our method is able to suppress much of the ‘shadow
saliency’ in background of Row 2 that is easily thresholded
during inference.
Low contrast. Row 7 contains an image with low contrast
between object and background. We are able to segment
better with fewer false positives than others.
Multiple Objects. Rows 5, 6, 7 and 10 contain multiple
foreground objects. 6 and 10 contains multiple salient ob-
jects whereas 5 and 7 have only a single salient object. Our
algorithm performs very well in these scenarios.
Complex foreground. Row 12 contains a complex salient
object where most other algorithms create holes in the pre-
diction. Our method is able to better understand the regional
and global context.
Object within an object. Row 9 contains an interesting
image which contains an image of a bird (with sharp con-
trast) within a poster (salient object). Our method is the
*DSS also employs a CRF post-processing step.
Module MAE (↓) max.Fβ (↑)
SPM 0.080 0.870
SPM + RAM (Epoch 2) 0.0692 0.9193
PGM + SPM + RAM (Epoch 2) 0.0662 0.9196
SPM + RAM (Epoch 5) 0.0661 0.9205
PGM + SPM + RAM (Epoch 5) 0.0623 0.9204
Table 2: Incremental performance gains for different mod-
ules on ECSSD
only method that does not fail by trying to segregating these
two objects.
4.4. Method Analysis
We analyze our network’s performance by evaluating
component-wise and step-wise results. The results shed
light on our design choices and incremental gains. The eval-
uation metrics have been described in Section 4.2.
To better quantify the role of every module in our ar-
chitecture, we do a component-wise performance analysis
on ECSSD dataset (Table 2). We first compute the re-
sults using just SPM with added convolutional layers as
described in Section 3.4. We can easily evaluate its per-
formance independently since it is trained first. We then
plug in RAM to SPM’s 512-dimensional features and do
an inference on trained SPM and RAM by only evaluat-
ing on Pred0. We see an immediate performance boost
with this setting. While this could just be attributed to in-
crease in model complexity, we argue that the initial setup
with SPM + 3 layers also has similar complexity. This ob-
servation shows that RAM not only predicts a better out-
put for every Predk+1(k ∈ [0, N − 1]), but also improves
Predi(i ∈ [0, k]) in the process. We do a final evaluation
by allowing a forward pass through all three modules.
In a recurrent network, we should ideally see perfor-
32
Image GT Ours DSS Amulet DHS UCF NLDF
Figure 5: Qualitative comparison with various state-of-the-art approaches on some challenging images from ECSSD [45].
Most of the images where we perform better are the ones where global spatial context is important to distinguish between
foreground and background.
mance improvements for every iterative step. To verify this,
we evaluate the predicted saliency maps computed at every
kth step and compare the results in Table 3. We evaluate re-
sults after 2nd and 5th(final) epoch. For both the readings,
we observe that the Fβ scores do not vary much across the
(N + 1) predictions. Higher Fβ does not imply a lower
33
Input Pred1 Pred2 Pred3 Pred4 Pred5 GTFigure 6: Recurrent step-wise qualitative performance analysis. We observe that Pred1 captures a lot of ‘pseudo’ salient
objects. As we go from left to right, we see a clear reduction in number of false positives that arise from background.
Predk (Epoch 2) MAE (↓) Fβ (↑)
k = 1 0.0692 0.9193
k = 2 0.0687 0.9196
k = 3 0.0666 0.9197
k = 4 0.0669 0.9197
k = 5 0.0662 0.9196
Predk (Epoch 5) MAE (↓) Fβ (↑)
k = 1 0.0661 0.9205
k = 2 0.0637 0.9210
k = 3 0.0629 0.9208
k = 4 0.0625 0.9206
k = 5 0.0623 0.9204
Table 3: Quantitative performance comparison of N predic-
tions from the Recurrent Attention Module on ECSSD
MAE [4]. Often, a decrease in MAE also leads to a de-
crease in Fβ . Thus, an important observation is the gradual
decrease in MAE as we increase k. A decrease in MAE
while maintaining the Fβ scores affirms that RAM reduces
the false positives incrementally without losing precision.
Qualitatively, we observed that visible differences in
saliency maps are more noticeable during the initial epochs.
Hence, we show the (N + 1) predicted maps after the 1st
epoch in Figure 6. We can clearly notice the suppression of
false positives in background for every subsequent predic-
tion.
5. Conclusion
We present an intuitive, scalable and effective approach
for detecting salient objects in a scene. Our approach is
modular, resulting in interpretable results. We propose a
Patch Generation Module, a Saliency Prediction Module
and a Recurrent Attention Module that work in tandem to
improve overall object segmentation by generating image
patches, their corresponding feature maps and effectively
aggregating them. Through our quantitative and qualitative
performance on benchmark datasets, we show the impor-
tance of region-wise attention in saliency prediction. An
easy and important extension to our work could be a dy-
namic improvement of predictions based on the number of
allowed patches. This can reduce the inference time signif-
icantly for an accuracy trade-off. In future, we would also
like to test our method’s effectiveness on the task of video
object segmentation in an unsupervised setting.
34
References
[1] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada,
and Sabine Susstrunk. Frequency-tuned salient region detec-
tion. In CVPR, 2009. 5
[2] Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville.
Delving deeper into convolutional networks for learning
video representations. arXiv preprint arXiv:1511.06432,
2015. 3
[3] Nicolas Ballas, Li Yao, Chris Pal, and Aaron C. Courville.
Delving deeper into convolutional networks for learning
video representations. arXiv preprint arXiv:1511.06432,
2015. 4
[4] Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li.
Salient object detection: A benchmark. IEEE Transactions
on Image Processing, 24(12):5706–5722, 2015. 2, 5, 8
[5] Ming-Ming Cheng, Qi-Bin Hou, Song-Hai Zhang, and
Paul L Rosin. Intelligent visual media processing: When
graphics meets vision. Journal of Computer Science and
Technology, 32(1):110–121, 2017. 1
[6] Ming-Ming Cheng, Niloy J Mitra, Xiaolei Huang, Philip HS
Torr, and Shi-Min Hu. Global contrast based salient region
detection. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 37(3):569–582, 2015. 2
[7] Ming-Ming Cheng, Jonathan Warrell, Wen-Yan Lin, Shuai
Zheng, Vibhav Vineet, and Nigel Crook. Efficient salient
region detection with soft image abstraction. In ICCV, 2013.