Online Adaptation of Convolutional Neural Networks for the 2017 DAVIS Challenge on Video Object Segmentation Paul Voigtlaender Bastian Leibe Visual Computing Institute RWTH Aachen University {voigtlaender,leibe}@vision.rwth-aachen.de Abstract This paper describes our method used for the 2017 DAVIS Challenge on Video Object Segmentation [26]. The challenge’s task is to segment the pixels belonging to mul- tiple objects in a video using the ground truth pixel masks, which are given for the first frame. We build on our recently proposed Online Adaptive Video Object Segmentation (On- AVOS) [28] method which pretrains a convolutional neu- ral network for objectness, fine-tunes it on the first frame, and further updates the network online while processing the video. OnAVOS selects confidently predicted foreground pixels as positive training examples and pixels, which are far away from the last assumed object position as negative examples. While OnAVOS was designed to work with a sin- gle object, we extend it to handle multiple objects by com- bining the predictions of multiple single-object runs. We introduce further extensions including upsampling layers which increase the output resolution. We achieved the fifth place out of 22 submissions to the competition. 1. Introduction Video object segmentation (VOS) is a fundamental com- puter vision task with important applications in video edit- ing, robotics, and autonomous driving. The goal of VOS is to segment the pixels of one or more objects in a video using the ground truth pixel masks of the first frame. While single-object and multi-object tracking on bounding box level has received much attention in the computer vision community, their variant on pixel level, i.e. VOS, has been less well explored, mainly due to the lack of datasets of sufficient size and quality. However, the recent intro- duction of the DAVIS 2016 dataset [25] for single-object VOS and the DAVIS 2017 dataset and competition [26] for multi-object VOS together with the adoption of deep learning techniques, led to a significant advancement in the state of the art of VOS. The most successful methods are no adaptation adaptation targets online adapted merged Figure 1: Overview of the proposed multi-object version of OnAVOS [28]. For each object in the video, OnAVOS is run once using the corresponding pixel mask for the first frame (not shown). OnAVOS updates the network online using the adaptation targets to improve the results. Positive adaptation targets are shown in yellow and negative targets are shown in blue. Finally, the single-object predictions are merged to yield the multi-object segmentation. It can be seen that the online adaptation significantly improves the segmentation of the left person. based on pretrained fully-convolutional neural networks, which are fine-tuned on the first frame of the target video [24, 5, 17, 28]. Most of these methods leave the network parameters fixed after fine-tuning, which means that they cannot deal well with large changes in appearance, e.g. aris- ing from an altered viewpoint. This is in contrast to our re- 1 The 2017 DAVIS Challenge on Video Object Segmentation - CVPR 2017 Workshops
6
Embed
DAVIS: Densely Annotated VIdeo Segmentation - Online … · 2020. 7. 17. · Online Adaptation of Convolutional Neural Networks for the 2017 DAVIS Challenge on Video Object Segmentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Online Adaptation of Convolutional Neural Networks for the 2017 DAVIS
Challenge on Video Object Segmentation
Paul Voigtlaender Bastian Leibe
Visual Computing Institute
RWTH Aachen University
{voigtlaender,leibe}@vision.rwth-aachen.de
Abstract
This paper describes our method used for the 2017
DAVIS Challenge on Video Object Segmentation [26]. The
challenge’s task is to segment the pixels belonging to mul-
tiple objects in a video using the ground truth pixel masks,
which are given for the first frame. We build on our recently
proposed Online Adaptive Video Object Segmentation (On-
AVOS) [28] method which pretrains a convolutional neu-
ral network for objectness, fine-tunes it on the first frame,
and further updates the network online while processing
the video. OnAVOS selects confidently predicted foreground
pixels as positive training examples and pixels, which are
far away from the last assumed object position as negative
examples. While OnAVOS was designed to work with a sin-
gle object, we extend it to handle multiple objects by com-
bining the predictions of multiple single-object runs. We
introduce further extensions including upsampling layers
which increase the output resolution. We achieved the fifth
place out of 22 submissions to the competition.
1. Introduction
Video object segmentation (VOS) is a fundamental com-
puter vision task with important applications in video edit-
ing, robotics, and autonomous driving. The goal of VOS
is to segment the pixels of one or more objects in a video
using the ground truth pixel masks of the first frame. While
single-object and multi-object tracking on bounding box
level has received much attention in the computer vision
community, their variant on pixel level, i.e. VOS, has been
less well explored, mainly due to the lack of datasets of
sufficient size and quality. However, the recent intro-
duction of the DAVIS 2016 dataset [25] for single-object
VOS and the DAVIS 2017 dataset and competition [26]
for multi-object VOS together with the adoption of deep
learning techniques, led to a significant advancement in the
state of the art of VOS. The most successful methods are
no
adaptation
adaptation
targets
online
adapted
merged
Figure 1: Overview of the proposed multi-object version
of OnAVOS [28]. For each object in the video, OnAVOS is
run once using the corresponding pixel mask for the first
frame (not shown). OnAVOS updates the network online
using the adaptation targets to improve the results. Positive
adaptation targets are shown in yellow and negative targets
are shown in blue. Finally, the single-object predictions are
merged to yield the multi-object segmentation. It can be
seen that the online adaptation significantly improves the
segmentation of the left person.
based on pretrained fully-convolutional neural networks,
which are fine-tuned on the first frame of the target video
[24, 5, 17, 28]. Most of these methods leave the network
parameters fixed after fine-tuning, which means that they
cannot deal well with large changes in appearance, e.g. aris-
ing from an altered viewpoint. This is in contrast to our re-
1
The 2017 DAVIS Challenge on Video Object Segmentation - CVPR 2017 Workshops
cently introduced Online Adaptive Video Object Segmenta-
tion (OnAVOS) [28] method, which adapts to these changes
by updating the network online while processing the video,
leading to significant improvements on the DAVIS 2016
[25] and the YouTube-Objects [27, 13] datasets for single-
object VOS. See Fig. 1 for an overview of the online adap-
tation approach. In this work, we adopt OnAVOS and gen-
eralize it in a simple way to work with multiple objects. We
demonstrate its effectiveness for this task by achieving the
fifth place on the 2017 DAVIS challenge [26].
2. Related Work
Fully Convolutional Neural Networks for Semantic Seg-
mentation. Long et al. [20] introduced fully convolu-
tional neural networks (FCNs) for semantic segmentation
which replace the fully-connected layers of a pretrained
convolutional classification network by 1× 1 convolutions,
enabling the network to output dense predictions for se-
mantic segmentation instead of just one global class pre-
diction. Recently, Wu et al. [29] introduced a very wide
fully-convolutional ResNet [12] variant, which achieved
outstanding results for classification and semantic segmen-
tation. We adopt their network architecture and pretrained
weights for our experiments.
Video Object Segmentation with Convolutional Neural
Networks. Caelles et al. introduced the one-shot video
object segmentation (OSVOS) [5] approach, which pretrains
a convolutional neural network on ImageNet [7], then fine-
tunes it on the 30 training videos of DAVIS 2016, and finally
fine-tunes on the first frame of the target video. The result-
ing fine-tuned network is then applied on each frame of the
video individually. The MaskTrack method [24] pretrains a
convolutional neural network to propagate pixel masks from
one frame to the next while exploiting optical flow informa-
tion. They also fine-tune on the first frame. LucidTracker
[17] uses a similar approach to MaskTrack and introduces
an elaborate data augmentation method, which generates
a large number of training examples from the first frame.
Caelles et al. [4] incorporate the semantic information of an
instance segmentation method into their VOS pipeline. The
current best result on DAVIS 2016 is obtained by OnAVOS,
which extends the basic pipeline of OSVOS by an online
adaptation mechanism. OnAVOS is described in more detail
in Section 3.
Online Adaptation. Online adaptation methods are a
common element of many multi-object tracking methods
both for classical methods like online boosting [10] or the
Tracking-Learning-Detection framework [16], and in the
context of deep learning [21]. However, its use on pixel
level, i.e. for VOS, is less well explored and prior work
mainly focuses on classical methods like online updated
color or shape models [2, 1, 23] or random forests [8].
3. Online Adaptive Video Object Segmentation
The Online Adaptive Video Object Segmentation
(OnAVOS) [28] method extends OSVOS by an additional
objectness pretraining step and online updates. Fig. 2 illus-
trates its pipeline, which will be described in the following.
Base Network. The first step of OnAVOS is to pretrain
a convolutional neural network on large datasets like Ima-
geNet [7], Microsoft COCO [19], and PASCAL [9] for clas-
sification or semantic segmentation to learn a powerful rep-
resentation of objects. The resulting pretrained network is
called the base network.
Objectness Network. In the next step, the base network
is fine-tuned for pixel objectness [15, 14] on the PASCAL
dataset [9] with extended augmentations [11]. By treating
each of the 20 classes as foreground and all other pixels as
background, the network is trained for binary classification
and learns a general notion of which pixels belong to ob-
jects.
Domain Specific Objectness Network. In order to better
match the target domain, i.e. videos from DAVIS, the net-
work is then further fine-tuned for objectness on the DAVIS
2017 training sequences using all annotated objects as fore-
ground target and all other pixels as background. This
yields the domain specific objectness network.
Test Network. At test time, the domain specific objectness
network is fine-tuned on the first frame of the target video
in order to adapt to the specific appearance of the object of
interest. The resulting network is called the test network
and can either directly be applied to the rest of the video
as in OSVOS [5], or be further updated online as described
below.
Online Adapted Test Network. Algorithm 1 shows
the online update mechanism of OnAVOS for single-object
VOS. When processing a new video frame, positive and
negative pixels are selected as training examples. For pos-
itive examples, we select pixels, for which the network is
very confident that they belong to the foreground, i.e. pix-
els, for which the foreground probability provided by the
network is above a threshold α = 0.99. For negative ex-
amples, a different strategy is employed, since using strong
background predictions as negative examples destroys all
chances to adapt to changes in appearance. Instead, all pix-
els, which have a distance of more than d = 190 pixels
from the predicted foreground mask of the last frame are
selected as negative examples, and all remaining pixels are
assigned a “don’t care” label. To deal with noise, an ero-
sion operation can optionally be applied to the mask of the
last frame, before calculating the distance. See Fig. 1 for
an example of the selected adaptation targets. The obtained
pixel labels can then be used to fine-tune the network on
the current frame. However, naively doing so quickly leads
2
Objectness NetworkDomain Specific
Objectness NetworkTest Network
Online Adapted
Test Network(pretrained on PASCAL) (pretrained on DAVIS) (fine-tuned on first frame) (fine-tuned online)
(a) (b) (c) (d)
Figure 2: The pipeline of OnAVOS [28] for a single object. Starting with pretrained weights, the network learns a general
notion of objectness on PASCAL (a). The network is then further pretrained for objectness on the DAVIS training set to
better match the target domain (b). At test time, the network is fine-tuned on the first frame of the target video to adapt to the
appearance of the object of interest (c). OnAVOS adapts the network online while processing the video, which makes it more
robust against appearance changes (d).
Algorithm 1 Online Adaptive Video Object Segmentation