Video Object Segmentation by Hierarchical Localized Classification of Regions Chenguang Zhang, Haizhou Ai Dept. of Computer Science and Technology Tsinghua University, Beijing, P.R. China zhangcg06@mail s.tsinghua.edu.cn, [email protected]ghua.edu.cn Abstract —Video Obj ect Segme ntation (VOS) is to cut out a selected object from video sequences, where the main difficulties are shape deformat ion, appea rance variation s and backg roun d clutte r . To cop e wit h the se dif ficu lti es, we pr opose a nov el metho d, named as Hier arc hical Localiz ed Clas sific ation of Re- gions (HLCR). We sugges t that appearan ce models as well as the spatial and temporal coher enc e bet wee n fra mes are the keys to bre ak thro ugh bottl enec k. Local ly , in order to iden tify for egr ound regi ons, we prop ose to use Hier arch ial Localized Class ifier s, whic h organ ize reg ional featu res as deci sion trees. In globa l, we adopt Gaussi an Mixtu re Color Models (GMMs). After integrating the local and global results into a probability mask, we can achiev e the final segme ntati on result by graph cut. Experiments on various challenging video sequences demonstrate the efficiency and adaptability of the proposed method. Index Terms—video object segmentation, classification, track- ing, graph cut I. I NTRODUCTION In comp uter vision, Video Obje ct Segmenta tion (VOS) is an attractive task which has many applications, such as video edit, video composition, object recognition, etc. Generally, a VOS system mainly faces two basi c prob lems in comp uter vision, object tracking and segmentation. There are numerous algorithms to solve object tracking [1], such as mean shift [2], particle filter [3], online boosting [4], random forest [5], etc. There are also a great deal of works on object segmentation, such as level set methods [6], graph cut [7] and grab cut [8]. It is well known that , for a VOS system, deal ing wi th general video sequences is an extremely challenging objective, due to the factors from appearance variations, irregular motion and background clutter. On the basis of object tracking and segmentation, various approaches have been proposed for VOS in rec ent years . Li et al. [9] direc tly exten d the tr adi tio nal graph cut [7] algorithm from 2D image to 3D image sequence, and optimize the global energy function to yield segmentation result. Apart from the limitation of heavily relying on Gaussian mixture color models, this 3D graph cut method is quite time- consu ming and does not allo w user interact ion. After ward , localized color and shape models are introduced by Xue [10] in Video SnapCut system, which shows increased discriminative abi lit y and pro ves to be mor e ef fici ent . Howe ve r, due to unexpected errors of optical flow when the object is occluded by itself or others, it is not reliable to perform classification on obje ct bound ary and shif t loca l window. An alte rnati ve met hod by Bre nde l et al. [11 ], foc usi ng on tra cki ng re gio n acro ss frames, is attr acti ve for its comp utat iona l bene fit and spatial-temporal coherence. However, suffering from failure of matching the contour of regions, this method lacks of the ability to deal with complex deformation of non-rigid object. Meanwhile, Niebles et al. [12] demonstrate how to combine model-based information (e.g. part-based detection result for huma n) and appearan ce approaches to ext ract human body reg ions. Nev erthe less , for gener al obje cts, high perf orma nce detectors are usually not available, which limits the general- ization of that method. Inspired by previous works of localized windows [10] and tracking re gions [11 ], we pro pos e a nov el met hod , named as Hierarchical Localized Classification of Regions (HLCR), for video object segmentation. The main contribution of our approach is to ove rcome the limitations of dire ctly shiftin g local windows and unreliable region tracking, by taking the spatial-temporal relationship between corresponding regions in neighboring frames as inference strategy. The rest of this paper is organized as follows. In Section II, we first give a formulation, and then show a brief overview ofour system. Section III introduces the whole pipeline of our appr oach. Exper imental resu lts on dif fere nt video seque nces are presented in Section IV. Finally, in Section V, we offer a conclusion of our method, followed by a discussion about the future work. II. PROBLEMF ORMULATION ANDS YSTEMOVERVIEW Given an input video sequence I={I0 , I1 ,...,I N−1 }, the VOS system is ini tia lized by a selected ke y fra me Ik with known foreground maskF(Ik ). The output of a typical VOS system is to label out the foreground maskM(It ) for each frame It . Taking the foreground mask in a particular frame as input, as illustrated in Fig . 1, our syste m is des ign ed to gen era te the fore gro und mask in the next fra me. Wit h the help ofRegional Back-Track Method for motion estimation, we can assign regions to a series of Hierarchical Localized Classifiers, to predict potential foreground and background regions locally. Comb inin g the clas sification result with Gauss ian Mixt ure Color Models (GMMs), we can produce a probability mask, followed by an optimization based on the mask to yield final segmentation results with graph cut [7] algorithm.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Abstract—Video Object Segmentation (VOS) is to cut out aselected object from video sequences, where the main difficultiesare shape deformation, appearance variations and backgroundclutter. To cope with these difficulties, we propose a novelmethod, named as Hierarchical Localized Classification of Re-gions (HLCR). We suggest that appearance models as well asthe spatial and temporal coherence between frames are thekeys to break through bottleneck. Locally, in order to identifyforeground regions, we propose to use Hierarchial LocalizedClassifiers, which organize regional features as decision trees.
In global, we adopt Gaussian Mixture Color Models (GMMs).After integrating the local and global results into a probabilitymask, we can achieve the final segmentation result by graph cut.Experiments on various challenging video sequences demonstratethe efficiency and adaptability of the proposed method.
Index Terms—video object segmentation, classification, track-ing, graph cut
I. INTRODUCTION
In computer vision, Video Object Segmentation (VOS) is
an attractive task which has many applications, such as video
edit, video composition, object recognition, etc. Generally, a
VOS system mainly faces two basic problems in computer
vision, object tracking and segmentation. There are numerousalgorithms to solve object tracking [1], such as mean shift [2],
particle filter [3], online boosting [4], random forest [5], etc.
There are also a great deal of works on object segmentation,
such as level set methods [6], graph cut [7] and grab cut [8].
It is well known that, for a VOS system, dealing with
general video sequences is an extremely challenging objective,
due to the factors from appearance variations, irregular motion
and background clutter. On the basis of object tracking and
segmentation, various approaches have been proposed for VOS
in recent years. Li et al. [9] directly extend the traditional
graph cut [7] algorithm from 2D image to 3D image sequence,
and optimize the global energy function to yield segmentation
result. Apart from the limitation of heavily relying on Gaussianmixture color models, this 3D graph cut method is quite time-
consuming and does not allow user interaction. Afterward,
localized color and shape models are introduced by Xue [10] in
Video SnapCut system, which shows increased discriminative
ability and proves to be more efficient. However, due to
unexpected errors of optical flow when the object is occluded
by itself or others, it is not reliable to perform classification
on object boundary and shift local window. An alternative
method by Brendel et al. [11], focusing on tracking region
across frames, is attractive for its computational benefit and
spatial-temporal coherence. However, suffering from failure
of matching the contour of regions, this method lacks of the
ability to deal with complex deformation of non-rigid object.
Meanwhile, Niebles et al. [12] demonstrate how to combine
model-based information (e.g. part-based detection result for
human) and appearance approaches to extract human body
regions. Nevertheless, for general objects, high performance
detectors are usually not available, which limits the general-ization of that method.
Inspired by previous works of localized windows [10] and
tracking regions [11], we propose a novel method, named
as Hierarchical Localized Classification of Regions (HLCR),
for video object segmentation. The main contribution of our
approach is to overcome the limitations of directly shifting
local windows and unreliable region tracking, by taking the
spatial-temporal relationship between corresponding regions
in neighboring frames as inference strategy.
The rest of this paper is organized as follows. In Section II,
we first give a formulation, and then show a brief overview of
our system. Section III introduces the whole pipeline of our
approach. Experimental results on different video sequencesare presented in Section IV. Finally, in Section V, we offer a
conclusion of our method, followed by a discussion about the
future work.
I I . PROBLEM F ORMULATION AND S YSTEM OVERVIEW
Given an input video sequence I = {I 0, I 1, . . . , I N −1}, the
VOS system is initialized by a selected key frame I k with
known foreground mask F (I k). The output of a typical VOS
system is to label out the foreground mask M (I t) for each
frame I t.
Taking the foreground mask in a particular frame as input,as illustrated in Fig. 1, our system is designed to generate
the foreground mask in the next frame. With the help of
Regional Back-Track Method for motion estimation, we can
assign regions to a series of Hierarchical Localized Classifiers,
to predict potential foreground and background regions locally.
Combining the classification result with Gaussian Mixture
Color Models (GMMs), we can produce a probability mask,
followed by an optimization based on the mask to yield final
segmentation results with graph cut [7] algorithm.
window W i, trained by all inner regions which have already
been labeled as foreground or background according to the
foreground mask M (I t). Here, we build a multi-dimensional
feature vector f (R) = (r,g,b,y,u,v,cx,cy) for region R,
where (r,g,b ,y,u,v) denotes the average value of all pixels
in region R in RGB and YUV color space and (cx, cy) denotes
the center of region R. If W i contains both foreground and
background regions, we use a decision tree for classification.
Otherwise, the localized classifier L(W i) is degenerated into
a constant function (Return 1 if it contains only foreground,
and return 0 if not.).
Fig. 3. Hierarchical Localized Classifiers based on quad-tree partition. If alocal window is larger than a fixed size λ and contains both foreground and
background regions, e.g. W i, we split it into four sub-windows. Otherwise,the partition terminates here and this window turns out to be leaf node, e.g.W j . For each window W i, a localized classifier L(W i) is trained by all theinside regions.
As for prediction, instead of shifting local windows, we
prefer to assign each region Ra in frame t + 1 to a series
of windows {W i0 ,W i1 , . . . ,W in−1} in frame t. Recall the
Regional Back-Track method introduced in section III-A,
assuming we have found the best match region Rb in frame t
(if not, we will discuss how to handle the mismatched Ra
later in section III-C), Rb should be covered by a unique
leaf node of the quad-tree partition. Tracing back to all
the ancient nodes in the quad-tree, we can get a series
of windows {W i0 ,W i1 , . . . ,W in−1}. For each window W ik ,
we use the pre-trained localized classifier L(W ik) to predict
whether Ra is belong to foreground or not. (Note here we use
(r,g,b,y,u,v,cx − vxRa, cy − vyRa
) as the feature vector,
where (vxRa, vyRa
) is the averaged motion vector of Ra.)
To produce the final classification result q Ra, we need inte-
grating the localized classifiers together, using this equation:
q Ra =
n−1k=0 ωkq kn−1
k=0 ωk
(4)
where q k denotes the binary prediction of L(W ik) and ωk
denotes the weight of classifier L(W ik). Obviously, the clas-
sifiers with high confidence should be weighted more thanthose with low confidence. Therefore, in our experiments, the
classification ratio on training set is used as ωk.
In summary, for an arbitrary region Ra in frame t+1 which
finds corresponding region Rb in frame t, the Hierarchical
Localized Classifiers make an integrated prediction of the
probability that Ra will be included in the foreground mask.
C. Combined Probability Mask and Iterative Refinement
Combined Probability Mask is introduced to integrate lo-
calized classification result with global GMMs. As a result,
we can use graph cut algorithm to optimize the segmentation
result.
For graph cut method, we need to optimize the following
energy function
E = λ
i
E d(Ri) +i=j
E c(Ri, Rj ) (5)
where E d(Ri) is data energy and E c(Ri, Rj ) is regional
connection energy. In our framework, E c(Ri, Rj ) is the color
difference between region Ri and Rj , which is the same as
traditional graph cut method [7], and E d(Ri) is the com-
bined probability of Global Gaussian Mixture Color Models
(GMMs) and Hierarchical Localized Classifiers predictions,
which is shown as follows.
GMMs are widely used in segmentation and tracking tasks
and turn out to be quite effective. In our system, both fore-
ground and background GMMs are acquired by clustering
regions in the reference frame t according to the given mask.
Note that directly updating foreground GMMs is very risky.
Considering the initial foreground mask provided by user
input in the key frame is extremely important, we suggestthat a combination of foreground in the initial key frame
and reference frame is quite necessary. In general, though
the discrimination ability of Hierarchical Localized Classifiers
is better than GMMs, it may suffer from the risk of over-
fitting and is incapable of handling mismatched regions in
section III-A. Consequently, we combine these two responses
to generate a more reliable foreground probability p(Ra),
using the formula shown below.
1) If Ra has a corresponding region Rb in frame t, then
p(Ra) = q fg (Ra) · q Ra
q fg (Ra) · q Ra + q bg(Ra) · (1 − q Ra
). (6)
2) Otherwise, Ra is mismatched. Since q Ra is not available,we have
p(Ra) = q fg (Ra)
q fg (Ra) + q bg(Ra) (7)
where q f g(Ra) is probability that Ra is in foreground GMMs,
q bg(Ra) is probability that Ra is in background GMMs and
q Ra is the classification response in section III-B.
Given the combined probability p(Ra) as data energy
E d(Ri), we can solve this two-label graph cut problem
through max-flow method. However, since complex videos
often contain unexpected noise, the combined probability
p(Ra) may drift in a few regions. Therefore, we apply a
iterative refinement to the graph cut result, which is shown
as following.1) Perform Graph Cut based on the combined probability
p(Ra) to get foreground regions.
2) Perform the max-connected component detection for
foreground regions to filter false alarmed regions.
3) Update the foreground and background GMMs and the
combined probability p(Ra). Repeat Step 1) and 2) until
converge.
In our experiments, repeating for only 2 or 3 times, the
iterative refinement will produce a convincing result.
In terms of complexity, our method only takes about 300milliseconds for each frame on an Intel core quad 2.40 GHz
CPU with 3GB memory. With the help of the initial labeled
foreground mask and a reliable frame-by-frame inference
strategy, our method can deal with very complex videos. Nev-
ertheless, our method fails when unexpected sudden change
of foreground appearance occurs.
V. CONCLUSION
In this paper, we propose a novel method to regard VOS
as a problem of tracking and classifying regions in local
windows. Regional Back-Track Method, which is based on
optical flow, is applied to track regions across frames. The
Hierarchical Localized Classifiers are introduced for the pre-
diction of potential foreground regions. Combined probability
mask based on classification results and GMMs is used for
graph cut algorithm with iterative refinement, which produces
reliable segmentation results. Experiments on various videos
demonstrate its great performance.
In current version, we only use single frame propagation
in this paper, which may lead to unexpected drifts in certain
extreme scenario. Although the foreground GMMs in the
initial key frame are used as global constraints, which enhance
the stability of our method, we believe that multi-frames
propagation will benefit more from spatial temporal space.
Another potential work is extending this work to multi-object
cutout, which has more extensive application prospect. We
expect to investigate these issues in our future work.
ACKNOWLEDGMENT
This work is supported by National Science Foundation of
China under grant No.61075026.
REFERENCES
[1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Comput. Surv., vol. 38, no. 4, p. 13, 2006.
[2] D. Comaniciu and P. Meer, “Mean shift analysis and applications,” inThe Proceedings of IEEE International Conference on Computer Vision,vol. 2, 1999, pp. 1197 –1203.
[3] K. Nummiaro, E. Koller-Meier, and L. J. V. Gool, “An adaptive color-based particle filter,” Image Vision Comput., vol. 21, no. 1, pp. 99–110,2003.
[4] H. Grabner and H. Bischof, “On-line boosting and vision,” in IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, vol. 1, 2006, pp. 260 – 267.
[5] A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof, “On-linerandom forests,” in IEEE International Conference on Computer VisionWorkshops, 2009, pp. 1393 –1400.
[6] A. reza Mansouri and J. Konrad, “Motion segmentation with level sets,”in IEEE International Conference on Image Processing, 1999, pp. 126–130.
[7] Y. Boykov and M. pierre Jolly, “Interactive graph cuts for optimalboundary and region segmentation of objects in n-d images,” in IEEE
International Conference on Computer Vision, 2001, pp. 105–112.[8] C. Rother, V. Kolmogorov, and A. Blake, “Grab cut: interactive fore-
ground extraction using iterated graph cuts,” ACM Transactions onGraphics, vol. 23, pp. 309–314, 2004.
[9] Y. Li, J. Sun, and H. yeung Shum, “Video object cut and paste,” ACM
Transactions on Graphics, vol. 24, pp. 595–600, 2005.[10] X. Bai, J. Wang, D. Simons, and G. Sapiro, “Video snapcut: robust video
object cutout using localized classifiers,” vol. 28, 2009.[11] W. Brendel and S. Todorovic, “Video object segmentation by tracking
regions,” in IEEE International Conference on Computer Vision, 2009,pp. 833 –840.
[12] J. C. Niebles, B. Han, A. Ferencz, and F. fei Li, “Extracting moving
people from internet videos,” in European Conference on Computer Vision, 2008, pp. 527–540.
[13] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,“Slic superpixels,” EPFL, Tech. Rep., jun 2010.
[14] R. Nock and F. Nielsen, “Statistical region merging,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 26, pp. 1452–1458,2004.
[15] M. Grundmann, V. Kwatra, M. Han, and I. Essa, “Efficient hierarchicalgraph based video segmentation,” in IEEE International Conference onComputer Vision and Pattern Recognition, 2010, pp. 2141–2148.
Fig. 4. Experimental Results. From left to right, 1st row: Original Key Frame Image, Segmentation Results of Our Approach; 2nd row: Initial LabeledForeground Mask, Segmentation Results of Grab Cut [8]. Please zoom in to check for more segmentation details.