[2011][Acpr]Zhang Chenguang

8/9/2019 [2011][Acpr]Zhang Chenguang

http://slidepdf.com/reader/full/2011acprzhang-chenguang 1/5

Video Object Segmentation by Hierarchical

Localized Classification of Regions

Chenguang Zhang, Haizhou AiDept. of Computer Science and Technology

Tsinghua University, Beijing, P.R. China

[email protected], [email protected]

Abstract—Video Object Segmentation (VOS) is to cut out aselected object from video sequences, where the main difficultiesare shape deformation, appearance variations and backgroundclutter. To cope with these difficulties, we propose a novelmethod, named as Hierarchical Localized Classification of Re-gions (HLCR). We suggest that appearance models as well asthe spatial and temporal coherence between frames are thekeys to break through bottleneck. Locally, in order to identifyforeground regions, we propose to use Hierarchial LocalizedClassifiers, which organize regional features as decision trees.

In global, we adopt Gaussian Mixture Color Models (GMMs).After integrating the local and global results into a probabilitymask, we can achieve the final segmentation result by graph cut.Experiments on various challenging video sequences demonstratethe efficiency and adaptability of the proposed method.

Index Terms—video object segmentation, classification, track-ing, graph cut

I. INTRODUCTION

In computer vision, Video Object Segmentation (VOS) is

an attractive task which has many applications, such as video

edit, video composition, object recognition, etc. Generally, a

VOS system mainly faces two basic problems in computer

vision, object tracking and segmentation. There are numerousalgorithms to solve object tracking [1], such as mean shift [2],

particle filter [3], online boosting [4], random forest [5], etc.

There are also a great deal of works on object segmentation,

such as level set methods [6], graph cut [7] and grab cut [8].

It is well known that, for a VOS system, dealing with

general video sequences is an extremely challenging objective,

due to the factors from appearance variations, irregular motion

and background clutter. On the basis of object tracking and

segmentation, various approaches have been proposed for VOS

in recent years. Li et al. [9] directly extend the traditional

graph cut [7] algorithm from 2D image to 3D image sequence,

and optimize the global energy function to yield segmentation

result. Apart from the limitation of heavily relying on Gaussianmixture color models, this 3D graph cut method is quite time-

consuming and does not allow user interaction. Afterward,

localized color and shape models are introduced by Xue [10] in

Video SnapCut system, which shows increased discriminative

ability and proves to be more efficient. However, due to

unexpected errors of optical flow when the object is occluded

by itself or others, it is not reliable to perform classification

on object boundary and shift local window. An alternative

method by Brendel et al. [11], focusing on tracking region

across frames, is attractive for its computational benefit and

spatial-temporal coherence. However, suffering from failure

of matching the contour of regions, this method lacks of the

ability to deal with complex deformation of non-rigid object.

Meanwhile, Niebles et al. [12] demonstrate how to combine

model-based information (e.g. part-based detection result for

human) and appearance approaches to extract human body

regions. Nevertheless, for general objects, high performance

detectors are usually not available, which limits the general-ization of that method.

Inspired by previous works of localized windows [10] and

tracking regions [11], we propose a novel method, named

as Hierarchical Localized Classification of Regions (HLCR),

for video object segmentation. The main contribution of our

approach is to overcome the limitations of directly shifting

local windows and unreliable region tracking, by taking the

spatial-temporal relationship between corresponding regions

in neighboring frames as inference strategy.

The rest of this paper is organized as follows. In Section II,

we first give a formulation, and then show a brief overview of

our system. Section III introduces the whole pipeline of our

approach. Experimental results on different video sequencesare presented in Section IV. Finally, in Section V, we offer a

conclusion of our method, followed by a discussion about the

future work.

I I . PROBLEM F ORMULATION AND S YSTEM OVERVIEW

Given an input video sequence I = {I 0, I 1, . . . , I N −1}, the

VOS system is initialized by a selected key frame I k with

known foreground mask F (I k). The output of a typical VOS

system is to label out the foreground mask M (I t) for each

frame I t.

Taking the foreground mask in a particular frame as input,as illustrated in Fig. 1, our system is designed to generate

the foreground mask in the next frame. With the help of

Regional Back-Track Method for motion estimation, we can

assign regions to a series of Hierarchical Localized Classifiers,

to predict potential foreground and background regions locally.

Combining the classification result with Gaussian Mixture

Color Models (GMMs), we can produce a probability mask,

followed by an optimization based on the mask to yield final

segmentation results with graph cut [7] algorithm.



Fig. 1. Outline of our approach

III. OUR A PPROACH

First of all, the initial foreground mask F (I k) is provided

by user. Since video frames are spatial-temporally cohesive,

we can propagate the foreground mask between neighboring

frames. From the reference frame (Fig. 2(a)) to the target

frame (Fig. 2(b)), bidirectional propagations are both feasible.

Without loss of generality, the following analysis only explains

the forward direction, which is from frame I t to frame I t+1.

Naturally, using the selected key frame I k as the first referenceframe and repeatedly applying this procedure of propagation,

we can get foreground masks in all frames.

For computational benefit as well as distinctiveness and ro-

bustness, each frame is over-segmented into SLIC superpixels

[13], which convert the original pixel-connected graph GP

(Fig. 2(b)) to a regional-connected graph GR (Fig. 2(c) ).

(a) Frame I t (b) Frame I t+1 (c) SLIC Regions (d) Optical Flow

(e) Classification (f) GMMs Prob. (g) Graph Prob. (h) Seg Result

Fig. 2. An example of the procedure of processing a single frame

A. Regional Back-Track Method

For a region in frame t + 1, Regional Back-Track Method

is introduced to find out the best matching region in frame t,

and determine whether they are essentially corresponded.

There is no doubt that pixel-level optical flow (Fig. 2(d) )

is not reliable when heavy occlusion happens. Although it is

claimed in [10] that flow averaging approach in local windowcould generate more robust result, it still produces meaningless

motion vector when there are no really “matched” regions.

Based on this observation, we suggest that a reliable region

track method should not only be insensitive to minor optical

flow errors, but also judge whether the matched regions are

essentially corresponded. For arbitrary region Ra in frame t+1, Regional Back-Track Method is defined as

BackTrack(Ra) = mincRa−vRa−cRb≤δ

Diff (Ra, Rb) (1)

where Rb is in frame t, cRa denotes the center of region Ra,

vRa denotes the averaged motion vector for all pixels in region

Ra and Diff (Ra, Rb) denotes the difference between region

Ra and Rb. Obviously, a larger δ would be more robust to

optical flow errors while more risky to introduce mistaken

regions. On the other hand, δ is highly related to the radius

rRa, since the center of large regions drift easier than small

ones. Consequently, in our experiments, δ is set as rRa

and

Diff (Ra, Rb) is set as the Euclid Distance between the mean

color of two regions.

A key issue of Regional Back-Track Method is how

to convert Diff (Ra, Rb) to a binary decision. Traditional

methods, such as selecting a global threshold or using Chi-

square test, are very tricky and unstable. Here, inspired by

the Statistical Region Merging Method [14], we choose the

independent bounded difference inequality as the decision

function. (Treating each pixel in Ra as a bounded independent

random variable.) As a result, the predicate logic is shown

below.

B(Ra, Rb) =

1 if |Ra − Rb| ≤

b2(Ra) + b2(Rb)0 otherwise

.

(2)

To summarize, for an arbitrary region Ra in frame t + 1,

Regional Back-Track Method provide the best match region

Rb in frame t if they are essentially corresponded. Otherwise,

this method would mark Ra as a “mismatched” region.

B. Hierarchical Localized Classifiers

In this section, we introduce Hierarchical Localized Classi-

fiers to evaluate the probability of that a region in frame t + 1belongs to foreground.

Localized classifiers for VOS system are introduced in

Video SnapCut System [10], in which a series of overlapping

local windows are created along foreground boundary with

fixed size and then propagate through frames. However, due to

a large boundary variation and local window drift, that method

is limited when facing topology changes. In addition, since

the size of local window is fixed, we definitely sacrifice the

ability to benefit from multi-scale space. To overcome these

limitations, we propose a new solution called Hierarchical

Localized Classifiers.

Given a foreground mask M (I t) and the corresponding

foreground bounding box B(I t) in reference frame t, we

define a potential searching box S (I t) by extending B(I t)for a fixed ratio β (β = 0.3 in our experiments), using the

following equations.

center(S (I t)) = center(B(I t))

height(S (I t)) = (1 + β )height(B(I t))

width(S (I t)) = (1 + β )width(B(I t))

(3)

Next, we build a hierarchical quad-tree structure by splitting

the searching box S (I t), in which each tree node corresponds

to a local window. The partition rules are shown in Fig.

3. Then, we generate a localized classifier L(W i) for each



window W i, trained by all inner regions which have already

been labeled as foreground or background according to the

foreground mask M (I t). Here, we build a multi-dimensional

feature vector f (R) = (r,g,b,y,u,v,cx,cy) for region R,

where (r,g,b ,y,u,v) denotes the average value of all pixels

in region R in RGB and YUV color space and (cx, cy) denotes

the center of region R. If W i contains both foreground and

background regions, we use a decision tree for classification.

Otherwise, the localized classifier L(W i) is degenerated into

a constant function (Return 1 if it contains only foreground,

and return 0 if not.).

Fig. 3. Hierarchical Localized Classifiers based on quad-tree partition. If alocal window is larger than a fixed size λ and contains both foreground and

background regions, e.g. W i, we split it into four sub-windows. Otherwise,the partition terminates here and this window turns out to be leaf node, e.g.W j . For each window W i, a localized classifier L(W i) is trained by all theinside regions.

As for prediction, instead of shifting local windows, we

prefer to assign each region Ra in frame t + 1 to a series

of windows {W i0 ,W i1 , . . . ,W in−1} in frame t. Recall the

Regional Back-Track method introduced in section III-A,

assuming we have found the best match region Rb in frame t

(if not, we will discuss how to handle the mismatched Ra

later in section III-C), Rb should be covered by a unique

leaf node of the quad-tree partition. Tracing back to all

the ancient nodes in the quad-tree, we can get a series

of windows {W i0 ,W i1 , . . . ,W in−1}. For each window W ik ,

we use the pre-trained localized classifier L(W ik) to predict

whether Ra is belong to foreground or not. (Note here we use

(r,g,b,y,u,v,cx − vxRa, cy − vyRa

) as the feature vector,

where (vxRa, vyRa

) is the averaged motion vector of Ra.)

To produce the final classification result q Ra, we need inte-

grating the localized classifiers together, using this equation:

q Ra =

n−1k=0 ωkq kn−1

k=0 ωk

(4)

where q k denotes the binary prediction of L(W ik) and ωk

denotes the weight of classifier L(W ik). Obviously, the clas-

sifiers with high confidence should be weighted more thanthose with low confidence. Therefore, in our experiments, the

classification ratio on training set is used as ωk.

In summary, for an arbitrary region Ra in frame t+1 which

finds corresponding region Rb in frame t, the Hierarchical

Localized Classifiers make an integrated prediction of the

probability that Ra will be included in the foreground mask.

C. Combined Probability Mask and Iterative Refinement

Combined Probability Mask is introduced to integrate lo-

calized classification result with global GMMs. As a result,

we can use graph cut algorithm to optimize the segmentation

result.

For graph cut method, we need to optimize the following

energy function

E = λ

i

E d(Ri) +i=j

E c(Ri, Rj ) (5)

where E d(Ri) is data energy and E c(Ri, Rj ) is regional

connection energy. In our framework, E c(Ri, Rj ) is the color

difference between region Ri and Rj , which is the same as

traditional graph cut method [7], and E d(Ri) is the com-

bined probability of Global Gaussian Mixture Color Models

(GMMs) and Hierarchical Localized Classifiers predictions,

which is shown as follows.

GMMs are widely used in segmentation and tracking tasks

and turn out to be quite effective. In our system, both fore-

ground and background GMMs are acquired by clustering

regions in the reference frame t according to the given mask.

Note that directly updating foreground GMMs is very risky.

Considering the initial foreground mask provided by user

input in the key frame is extremely important, we suggestthat a combination of foreground in the initial key frame

and reference frame is quite necessary. In general, though

the discrimination ability of Hierarchical Localized Classifiers

is better than GMMs, it may suffer from the risk of over-

fitting and is incapable of handling mismatched regions in

section III-A. Consequently, we combine these two responses

to generate a more reliable foreground probability p(Ra),

using the formula shown below.

1) If Ra has a corresponding region Rb in frame t, then

p(Ra) = q fg (Ra) · q Ra

q fg (Ra) · q Ra + q bg(Ra) · (1 − q Ra

). (6)

2) Otherwise, Ra is mismatched. Since q Ra is not available,we have

p(Ra) = q fg (Ra)

q fg (Ra) + q bg(Ra) (7)

where q f g(Ra) is probability that Ra is in foreground GMMs,

q bg(Ra) is probability that Ra is in background GMMs and

q Ra is the classification response in section III-B.

Given the combined probability p(Ra) as data energy

E d(Ri), we can solve this two-label graph cut problem

through max-flow method. However, since complex videos

often contain unexpected noise, the combined probability

p(Ra) may drift in a few regions. Therefore, we apply a

iterative refinement to the graph cut result, which is shown

as following.1) Perform Graph Cut based on the combined probability

p(Ra) to get foreground regions.

2) Perform the max-connected component detection for

foreground regions to filter false alarmed regions.

3) Update the foreground and background GMMs and the

combined probability p(Ra). Repeat Step 1) and 2) until

converge.

In our experiments, repeating for only 2 or 3 times, the

iterative refinement will produce a convincing result.



IV. EXPERIMENTS

Currently, since there is no standard datasets for video

segmentation, in our experiments, the testing datasets are

collected from [15] and [12]. The first video clip is waterskiing

from [15], 97 frames, 544 × 280. The second one is diving

from [15], 179 frames, 880×488. The third one is skating from

[15], 573 frames, 552 × 310. The fourth one is dancing from

[12], 138 frames, 320 × 240. Note that these videos are verychallenging in terms of dynamic camera, background clutter,

blurred motion, object shadows, etc.

We quantitatively analysis our approach on these test

datasets. We randomly select 10 frames from each video clip

for evaluation and label out the true foreground manually. The

metric is standard F -Measure, which is defined as below.

F -Measure = 2 · P recision · Recall

Precision + Recall (8)

where P recision is the probability that an auto-segmented

foreground pixel is a true foreground pixel and Recall is the

probability that a true foreground pixel is detected.

Since there is no available source code or executable binary

for current VOS method, such as [10] and [11], we chooseto use Grab Cut [8] algorithm for comparison, where we

draw foreground bounding boxes for several times and select

the best one for each frame. Table. I sums up the achieved

comparisons, from which we can see that our approach is

much better than Grab cut. Note that our method works very

well when handling visually similar foreground and back-

ground (such as dark legs and black background in Fig. 4(d)),

which improves F -Measure by as much as twenty percentage

points. Some examples are shown in Fig. 4, which demonstrate

that our method significantly improves the subjective quality

of segmentation.

TABLE IEXPERIMENTAL R ESULTS

Vide Cli p Method P r ecision Recal l F -Measure

Water-skiing

Grab Cut 0.753 0.911 0.836

Our Method 0.938 0.849 0.891+/− 0.185 -0.062 0.067

Diving

Grab Cut 0.823 0.849 0.836

Our Method 0.914 0.950 0.931+/− 0.091 0.101 0.096

Skating

Grab Cut 0.956 0.905 0.930

Our Method 0.973 0.919 0.945+/− 0.017 0.014 0.015

DancingGrab Cut 0.873 0.620 0.725

Our Method 0.946 0.947 0.947

+/− 0.073 0.327 0.221

In terms of complexity, our method only takes about 300milliseconds for each frame on an Intel core quad 2.40 GHz

CPU with 3GB memory. With the help of the initial labeled

foreground mask and a reliable frame-by-frame inference

strategy, our method can deal with very complex videos. Nev-

ertheless, our method fails when unexpected sudden change

of foreground appearance occurs.

V. CONCLUSION

In this paper, we propose a novel method to regard VOS

as a problem of tracking and classifying regions in local

windows. Regional Back-Track Method, which is based on

optical flow, is applied to track regions across frames. The

Hierarchical Localized Classifiers are introduced for the pre-

diction of potential foreground regions. Combined probability

mask based on classification results and GMMs is used for

graph cut algorithm with iterative refinement, which produces

reliable segmentation results. Experiments on various videos

demonstrate its great performance.

In current version, we only use single frame propagation

in this paper, which may lead to unexpected drifts in certain

extreme scenario. Although the foreground GMMs in the

initial key frame are used as global constraints, which enhance

the stability of our method, we believe that multi-frames

propagation will benefit more from spatial temporal space.

Another potential work is extending this work to multi-object

cutout, which has more extensive application prospect. We

expect to investigate these issues in our future work.

ACKNOWLEDGMENT

This work is supported by National Science Foundation of

China under grant No.61075026.

REFERENCES

[1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Comput. Surv., vol. 38, no. 4, p. 13, 2006.

[2] D. Comaniciu and P. Meer, “Mean shift analysis and applications,” inThe Proceedings of IEEE International Conference on Computer Vision,vol. 2, 1999, pp. 1197 –1203.

[3] K. Nummiaro, E. Koller-Meier, and L. J. V. Gool, “An adaptive color-based particle filter,” Image Vision Comput., vol. 21, no. 1, pp. 99–110,2003.

[4] H. Grabner and H. Bischof, “On-line boosting and vision,” in IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, vol. 1, 2006, pp. 260 – 267.

[5] A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof, “On-linerandom forests,” in IEEE International Conference on Computer VisionWorkshops, 2009, pp. 1393 –1400.

[6] A. reza Mansouri and J. Konrad, “Motion segmentation with level sets,”in IEEE International Conference on Image Processing, 1999, pp. 126–130.

[7] Y. Boykov and M. pierre Jolly, “Interactive graph cuts for optimalboundary and region segmentation of objects in n-d images,” in IEEE

International Conference on Computer Vision, 2001, pp. 105–112.[8] C. Rother, V. Kolmogorov, and A. Blake, “Grab cut: interactive fore-

ground extraction using iterated graph cuts,” ACM Transactions onGraphics, vol. 23, pp. 309–314, 2004.

[9] Y. Li, J. Sun, and H. yeung Shum, “Video object cut and paste,” ACM

Transactions on Graphics, vol. 24, pp. 595–600, 2005.[10] X. Bai, J. Wang, D. Simons, and G. Sapiro, “Video snapcut: robust video

object cutout using localized classifiers,” vol. 28, 2009.[11] W. Brendel and S. Todorovic, “Video object segmentation by tracking

regions,” in IEEE International Conference on Computer Vision, 2009,pp. 833 –840.

[12] J. C. Niebles, B. Han, A. Ferencz, and F. fei Li, “Extracting moving

people from internet videos,” in European Conference on Computer Vision, 2008, pp. 527–540.

[13] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,“Slic superpixels,” EPFL, Tech. Rep., jun 2010.

[14] R. Nock and F. Nielsen, “Statistical region merging,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 26, pp. 1452–1458,2004.

[15] M. Grundmann, V. Kwatra, M. Han, and I. Essa, “Efficient hierarchicalgraph based video segmentation,” in IEEE International Conference onComputer Vision and Pattern Recognition, 2010, pp. 2141–2148.



(a) Water-skiing Sequence on Frame 27, 48, 57, 67

(b) Diving Sequence on Frame 35, 64, 83, 122

(c) Skating Sequence on Frame 12, 18, 63, 111

(d) Dancing Sequence on Frame 5, 20, 101, 130

Fig. 4. Experimental Results. From left to right, 1st row: Original Key Frame Image, Segmentation Results of Our Approach; 2nd row: Initial LabeledForeground Mask, Segmentation Results of Grab Cut [8]. Please zoom in to check for more segmentation details.

[2011][Acpr]Zhang Chenguang

Documents