Top Banner
Improving video foreground segmentation with an object-like pool Xiaoliu Cheng Wei Lv Huawei Liu Xing You Baoqing Li Xiaobing Yuan Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 11 Nov 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
9

Improving video foreground segmentation with an object ...

Nov 11, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Improving video foreground segmentation with an object ...

Improving video foregroundsegmentation with an object-like pool

Xiaoliu ChengWei LvHuawei LiuXing YouBaoqing LiXiaobing Yuan

Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 11 Nov 2021Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

Page 2: Improving video foreground segmentation with an object ...

Improving video foreground segmentation withan object-like pool

Xiaoliu Cheng,a,b Wei Lv,a Huawei Liu,a,b Xing You,a,* Baoqing Li,a and Xiaobing Yuana

aChinese Academy of Sciences, Shanghai Institute of Microsystem and Information Technology, Wireless Sensor Network Laboratory,No. 865 Changning Road, Changning District, Shanghai 200050, ChinabUniversity of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China

Abstract. Foreground segmentation in video frames is quite valuable for object and activity recognition, whilethe existing approaches often demand training data or initial annotation, which is expensive and inconvenient.We propose an automatic and unsupervised method of foreground segmentation given an unlabeled and shortvideo. The pixel-level optical flow and binary mask features are converted into the normal probabilistic super-pixels, therefore, they are adaptable to build the superpixel-level conditional random field which aims to labelthe foreground and background. We exploit the fact that the appearance and motion features of the movingobject are temporally and spatially coherent in general, to construct an object-like pool and background-like pool via the previous segmented results. The continuously updated pools can be regarded as the “prior”knowledge of the current frame to provide a reliable way to learn the features of the object. Experimental resultsdemonstrate that our approach exceeds the current methods, both qualitatively and quantitatively. © The Authors.Published by SPIE under a Creative Commons Attribution 3.0 Unported License. Distribution or reproduction of this work in whole or inpart requires full attribution of the original publication, including its DOI. [DOI: 10.1117/1.JEI.24.2.023034]

Keywords: video foreground segmentation; object-like pool; unsupervised; unlabeled; conditional random field; probabilisticsuperpixels.

Paper 15016 received Jan. 9, 2015; accepted for publication Mar. 27, 2015; published online Apr. 23, 2015.

1 IntroductionVideo foreground segmentation plays a prerequisite role in avariety of visual applications such as safety surveillance1 andintelligent transportation.2 The existing algorithms usuallyuse supervised or semisupervised methods and achievesatisfying results. However, the performances are still limitedwhen they are applied for unsupervised and short videos,because the supervised methods usually demand manytraining examples that are expensive to manually label.Furthermore, the training examples cannot cover all the con-ditions and need to retrain the new examples to improvethe generalization. Some semisupervised methods requireaccurate object region annotation only for the first frame,then they exploit the region-tracking methods to segmentthe rest of the frames. However, many visual applicationslike safety surveillance demand intelligent and unattendedoperations, which make the initial annotation impractical.The available video frames may be insufficient sometimessince the objects can move rapidly into and out of the visualfield when they are near the camera.

There has been a substantial amount of work related toforeground segmentation. Classical segmentation methodsthat operate at the pixel level are often based on local featureslike textons,3 then they are augmented by Markov randomfield or graph-cut based methods to gain the refinedresults.4,5 Furthermore, some new methods of this typeregard the meaningful superpixels as the basic units insteadof the rigid pixels to get better results,6–10 because super-pixels are efficient in practice and more robust to noise thanpixels, and work well for representing objects as well. For

instance, Tian et al.6 propose two superpixel-based dataterms and smooth terms defined on the spatiotemporal super-pixel neighborhood with a shape cue to implement thesegmentation. Their method can handle arbitrary lengthvideo sequences although it demands that the first frame bemanually labeled. Shu et al.9 apply a superpixel-based bag-of-words model to iteratively refine the output of a genericdetector, then an online-learning appearance model isexploited to train a support vector machine and to achieve theexact objects using conditional random field (CRF). How-ever, it requires a mass of various examples to train the clas-sifier, and it is not well adapted to short videos.

Perhaps the work that is related most to ours is that ofSchick et al.8 They convert the traditional pixel-based segmen-tation into a probabilistic superpixel representation and inte-grate the structure information and similarities into Markovrandom field (MRF) to improve the segmentation. Theshape of the object in the given foreground segmentation isimproved by their probabilistic superpixel Markov randomfield (PSP-MRF) method. Moreover, it also reduces thenoisy regions and improves recall, precision, and F-measure.However, it stringently depends on the binary mask (seeSec. 3.3). For instance, if the given binary mask is quitepoor because of the cluttered background, the performancewill rapidly decline. In addition, full use is not made of thelocal features and environmental information to achieve morerobust results.

In order to improve the performance of unsupervised andshort video segmentation, we proposed an online unsuper-vised learning approach inspired by Ref. 9. The intuitionis that the appearance and motion features of the movingobject vary slowly frame by frame in a typical video.According to the temporal and spatial coherence, we can*Address all correspondence to: Xing You, E-mail: [email protected]

Journal of Electronic Imaging 023034-1 Mar∕Apr 2015 • Vol. 24(2)

Journal of Electronic Imaging 24(2), 023034 (Mar∕Apr 2015)

Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 11 Nov 2021Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

Page 3: Improving video foreground segmentation with an object ...

exploit the segmented result of the previous frame to providevaluable cues for the current segmentation.

This paper aims to segment the moving foreground fromthe unlabeled and short video in an unsupervised way with-out prior knowledge. The overview of our approach is illus-trated in Fig. 1. The main contributions of our work are listedas follows: (1) The pixel-level optical flow and binary maskfeatures are converted into the normalized probabilisticsuperpixels, which fit very well for the CRF. (2) Because ofthe temporal and spatial coherence of appearance and motionfeatures of the moving object, we leverage the previous seg-mented result to build an object-like pool and background-like pool, which serve as the “prior” knowledge of the cur-rent segmentation. The continuously updated pools provide areliable and continuous way to learn the features of theobject. The proposed algorithm has been validated by severalchallenging videos from the change detection 2014 dataset,and experimental results demonstrate that our approach out-performs the other methods both in accuracy and robustness,even when the basic features suffer from great interference.

The rest of this paper is organized as follows: Sec. 2presents our detailed approach. Experimental results aregiven in Sec. 3 and conclusions are discussed in Sec. 4.

2 Our ApproachSince we have no prior knowledge about the unlabeledvideo, we actually know nothing about the object at first:we do not know its type, size, moving direction, and soon. Similarly, the scenario is also unpredictable: it may sufferfrom swaying trees, illumination change, bad weather, shad-ows, and so on. Therefore, an unsupervised and efficientapproach should be developed because of the limited infor-mation in the short video.

First, the optical flow field is regarded as the initial detec-tor to extract the moving region, which is actually a coarsebounding box. Second, the pixel-level optical flow andbinary mask features are converted into the normalized prob-abilistic superpixels. Combining the normalized probabilisticsuperpixels with the foreground likelihood that is generatedby the object-like pool and background-like pool, we build asuperpixel-based CRF model to provide a natural way tolearn the conditional distribution over the class labeling.

Afterward, the graph-cut based method is adopted to achievethe foreground segmentation. Last, an exceptional handlingmechanism is applied to avoid error accumulation in the caseof abnormal events.

2.1 Superpixel SegmentationSuperpixels11,12 have become a significant tool in computervision. They group pixels into meaningful subregions insteadof rigid pixels which can greatly reduce the complexity of thetask in image processing. What is more, the superpixels haveuniform information in color and space and adhere well tothe contour of the object. So far they have become the basicblocks of many computer vision algorithms, such as objectsegmentation,9 depth estimation,13 and object tracking.14 Asa kind of middle-level feature, superpixels both increase thespeed and improve the quality of the segmented results.

Simple linear iterative clustering (SLIC)15 is an efficientmethod of superpixel segmentation, which is also simple toimplement and easy to apply in practice. In this paper, we seta proper size of superpixels (8 × 8 in all the experiments)and segment the image with the SLIC algorithm. Then weacquire the table of the labeled superpixels, the seeds of thesuperpixels, and the number of the superpixels. Specifically,the table shows the label values of all the pixels and the maxi-mum value represents the total number of the final superpix-els. Note that the exact number of the segmented superpixelsis usually not equal to the given number because some smallsuperpixels are integrated into the larger ones. The seeds ofthe superpixels are used to judge the neighbor informationsince the labeled values of superpixels are not in order.

2.2 Probabilistic SuperpixelsThe pixel-level processing is vulnerable to unpredictablenoise and it suffers from a heavy calculation burden as well.In order to achieve a robust and efficient segmentation, weoperate at the superpixel level in the following steps.According to Ref. 8, a probabilistic superpixel gives theprobability that its pixels belong to a certain class, so it fitswell into the probabilistic frameworks like CRF, as we willshow later.

Though without prior knowledge, the pixel-level opticalflow and binary mask can be converted into probabilistic

Fig. 1 The overview of our approach: (a) input sequential frames, (b) moving region, (c) binary mask,(d) superpixel-level optical flow, (e) foreground likelihood, (f) segmented results, (g) object-like pool, and(h) background-like pool.

Journal of Electronic Imaging 023034-2 Mar∕Apr 2015 • Vol. 24(2)

Cheng et al.: Improving video foreground segmentation with an object-like pool

Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 11 Nov 2021Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

Page 4: Improving video foreground segmentation with an object ...

superpixels to measure the foreground likelihood. Let B bethe pixel-level binary mask and sp a superpixel with pixelsp ∈ sp and jspj its size, so the likelihood of the superpixel-based binary mask to construct the object is defined as8

LbinaryðspÞ ¼P

p∈spBðpÞjspj : (1)

The optical flow of each superpixel is represented by theaverage optical flow of the inside pixels. Then the likelihoodof a superpixel sp (let ~sp be its optical flow vector) to formthe foreground based on optical flow is defined as

Lflowð ~spÞ ¼ cosh ~sp; ~ri · k ~spkk~rk ; (2)

where h ~sp; ~ri denotes the angle between the vectors ~sp and ~r.The reference optical flow vector ~r is defined by the meanoptical flow of all the superpixels in the moving region.Finally, the superpixel-level optical flow and binary maskare normalized to represent the foreground and backgroundprobabilities by the following equations:

Pfg ¼ α · Lflow þ ð1 − αÞ · Lbinary; (3)

Pbg ¼ 1 − Pfg; (4)

where α ∈ ð0;1Þ represents the tradeoff between the featuresof the binary mask and the optical flow.

2.3 Superpixel-Based Conditional Random FieldCRF16 is a class of statistical modeling methods widelyapplied to computer vision. According to the result of super-pixel segmentation, the foreground objects are usually over-segmented and are consisted of more than one superpixel.Therefore, it is essential to cluster and label the superpixelsbased on their features. Fortunately, CRF provides a naturalway to incorporate superpixel-based features into a singleunified model3 to learn the conditional distribution over theclass labeling.

Let GðS; EÞ be the adjacent graph of superpixels spi(spi ∈ S) in a frame, and E is the set of edges formedbetween pairs of adjacent superpixels in the eight-connectedneighbors. Let PðcjG; wÞ be the conditional probability10 ofthe set of class assignments c given the adjacent graphGðS; EÞ and a weight w

− log½PðcjG;wÞ� ¼Xspi∈ S

ΨðcijspiÞ

þ wX

ðspi; spjÞ∈E

Φðci; cjjspi; spjÞ; (5)

where Ψð·Þ and Φð·Þ represent the unary potential and pair-wise edge potential, respectively.

The unary potential Ψð·Þ defines the cost of labelingsuperpixel spi with label ci, and it is represented as follows:

ΨðcijspiÞ ¼ − log½Pfgðci; spiÞ�: (6)

The relationship between two adjacent superpixels spiand spj is modeled by the pairwise potential4 Φð·Þ

Φðci; cjjspi; spjÞ ¼ ½ci ≠ cj� expð−βkci − cjk2Þ; (7)

β ¼ ð2hkci − cjk2iÞ−1jðspi; spjÞ ∈ E; (8)

where ½·� denotes the indicator function with values 0 or 1,kci − cjk2 is the L2 norm of the color difference between twoadjacent nodes in LAB color space, and h·i is the expectationoperator.

The conditional probability can be optimized by graphcuts.17 Once the CRF model has been built, we minimizeEq. (5) with the multilabel graph-cuts18–20 based on anoptimization library10 using the swap algorithm. This isquite efficient since the CRF model is defined on the super-pixel-level graph.

2.4 Pools ConstructionNow the superpixels are classified into two clusters: fore-ground and background. In order to learn the features ofthe object from the segmented result, the superpixels belong-ing to the foreground and background are separately selectedto construct the object-like pool ot−1 and the background-like pool bgt−1

ot−1 ¼ fspig; spi ∈ foreground; (9)

bgt−1 ¼ fspjg; spj ∈ background; (10)

where ot−1 and bgt−1 are the independent object-like pooland background-like pool that are generated from the seg-mented result of the ðt − 1Þ’th frame. The color distributionand optical flow of each superpixel within the pools havealready been recorded. Based on the temporal and spatialcoherence of appearance and motion features, the real objectin the next frame should be similar to the previous segmentedforeground for both color and optical flow. Therefore, thetwo pools can be regarded as the “prior” knowledge forthe object in the next frame. By comparing the features ofthe “new” superpixels in the current frame and the “old”superpixels in the two pools, we assign each “new” super-pixel a likelihood of its belonging to the foreground.

2.5 Foreground LikelihoodBased on the segmented result of the previous frame, theobject-like pool ot−1 formed by the ðt − 1Þ’th frame isachieved. As discussed above, ot−1 can be regarded as the“prior” knowledge of current frame t, hence the key featuresabout the object can be learned. Let spti be the i’th superpixelin frame t and spt−1k (spt−1k ∈ ot−1) be one of the nearest Mkneighbors of spti. The similarity to the object about spti isdenoted as

StoðsptiÞ ¼1

Mk

Xspt−1k ∈Nðspt

iÞHðspt−1k Þ · HðsptiÞT

· exp

�−Dð ~spt−1k ; ~sptiÞ

η

�; (11)

where Hð·Þ and Dð·Þ are the histogram distribution and theEuclidean distance between optical flow vectors, respectively.

Journal of Electronic Imaging 023034-3 Mar∕Apr 2015 • Vol. 24(2)

Cheng et al.: Improving video foreground segmentation with an object-like pool

Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 11 Nov 2021Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

Page 5: Improving video foreground segmentation with an object ...

The optical flow vector of spti is denoted as ~spti and η is theexpectation of Dð·Þ.

Similarly, we repeat the aforementioned procedures withthe background-like pool bgt−1 and obtain the backgroundsimilarity Stbg, so the likelihood of a certain superpixel inframe t belonging to the foreground should be

Ltfg ¼ Sto∕ðSto þ StbgÞ: (12)

The comprehensive probability of the superpixels to formthe foreground is represented as

Pfg ¼ β · Lflow þ γ · Lfg þ ð1 − β − γÞ · Lbinary; (13)

where β and γ weight the three features. β; γ ∈ ð0;1Þand ðβ þ γÞ ∈ ð0;1Þ.

Then we jump to Sec. 2.3, where Pfg is calculated byEq. (13) instead of Eq. (3). Just as before, a new super-pixel-based CRF model is built and a new segmentationis implemented by graph cut.

2.6 Exception HandingThe object-like pool works well most of the time, and thesegmented results will theoretically be improved frame byframe. However, when the previous segmented foregroundis mixed with some noise, it will have a negative effecton the object-like pool. Furthermore, the error will beaccumulated in the current segmentation based on the inac-curate object-like pool, so the vicious circle occurs. This ismost likely to happen from the first initial segmentationbecause the initially segmented result is coarse in general.Therefore, some measures should be taken to prevent theerror accumulation.

Let Rtn be the mean ratio of the number of superpixels in

the object-like pool from frame ðt − nÞ to frame t

Rtn ¼

1

n

Xni¼1

Nt−iþ1sp

Nt−isp

; (14)

whereNtsp represents the number of the foreground superpix-

els from frame t. Therefore, Rt1 is the ratio of the foreground

superpixels from frame ðt − 1Þ to frame t. Let R be the set ofthe normal ratios. Then the state of the object-like pool isrepresented as

state¼�

normal;Rt1 ∈ R; if Rt

1∕Rt−1n ∈ ð1− λ;1þ λÞ

abnormal;Rt1 ∈= R; others

.

(15)

The parameter n (n ¼ 3 recommended) denotes the num-ber of previous reference frames, and the parameter λ(λ ¼ 0.2 in our experiments) is the offset of the floor andceiling bounds, respectively.

Once the state of the object-like pool is abnormal, theexception handling is activated. Then, we discard the object-like pool and the background-like pool and reinitialize theforeground likelihood based on Eq. (3) instead of Eq. (13).The exception handling mechanism is quite effective to avoiderror accumulation.

3 Experimental ResultsOur algorithm is evaluated by several challenging datasets:“bungalows,” “twoPositionPTZCam,” “highway,” “fall,”“snowFall,” and “blizzard.” They are from the ChangeDetection 2014 dataset and provide a range of runningout of sight, direction change, shadow, dynamic background,partial occlusion, bad weather, and similar color. The pro-posed algorithm (ours) is compared with a binary mask(BM), ours-shortcut (ours-SC), and PSP-MRF algorithms.8

Note that the ours-SC algorithm is short of the object-likepool and background-like pool that provide “prior” informa-tion for the next segmentation. In addition, only a fewsequential frames (less than 25 in all the experiments) arechosen to run our unsupervised algorithm, because we donot need huge frames to build and update the backgroundmodel or to serve as the training frames. In addition, weonly pay attention to a single rigid moving object with themotionless camera in our experiments.

3.1 Qualitative EvaluationThe dataset provides various noises: “bungalows” shows thecondition where the moving object is running out of the cam-era’s visual field, so several frames only capture a part ofthe object. In the “twoPositionPTZCam,” the object contin-uously changes its moving direction around the corner. Thecar in “highway” suffers from shadows from the upper trees,and “fall” presents the dynamic background of the swayingleaves and the partial occlusion from the middle tree. In addi-tion, a mass of the snow is falling down in the “snowFall,” invery bad weather. In “blizzard,” the small car has a similarcolor as the snowy background.

Figure 2 shows the qualitative results of ours, ours-SC,PSP-MRF, and ground truth. BM results are not drawnbecause they are mostly fragmentary which will make theresults cluttered. According to the visual evaluation, thePSP-MRF method performs the worst on average becauseof the incomplete and even fragmentary segmentations.Furthermore, ours-SC achieves better results than PSP-MRF,although it still lacks some detailed components of theobject. By learning the object-like pool and background-like pool, our approach outperforms all the compared meth-ods in terms of robustness and completeness.

3.2 Quantitative EvaluationThe performances of different methods are evaluated by twomeasures: F-measure and percentage of wrong classification(PWC). F-measure is the harmonically weighted balance ofprecision and recall.21 F-measure and PWC are specificallydefined as

F-measure ¼ 2 ×Precision × Recall

Precisionþ Recall; (16)

PWC ¼ FNþ FP

TPþ TNþ FPþ FN; (17)

where TP, TN, FP, and FN are abbreviations for true positive,true negative, false negative, and false negative, respectively.The detailed quantitative performances are shown in Fig. 3.Although ours-SC shows comparatively good results in“snowFall” and “blizzard,” it sometimes produces terrible

Journal of Electronic Imaging 023034-4 Mar∕Apr 2015 • Vol. 24(2)

Cheng et al.: Improving video foreground segmentation with an object-like pool

Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 11 Nov 2021Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

Page 6: Improving video foreground segmentation with an object ...

Fig. 2 Visual segmentation results: (a) bunglows, (b) twoPositionPTZCam, (c) highway, (d) fall, (e) snow-fall, and (f) blizzard. The results of ours, ours-SC, PSP-MRF, and ground truth are respectively repre-sented by red, green, blue, and yellow curves.

Journal of Electronic Imaging 023034-5 Mar∕Apr 2015 • Vol. 24(2)

Cheng et al.: Improving video foreground segmentation with an object-like pool

Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 11 Nov 2021Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

Page 7: Improving video foreground segmentation with an object ...

results (see the result of “fall”). We conclude that it is notrobust and neither is the PSP-MRF. Above all, the averagescores of our method in terms of F-measure and PWC per-form the best compared with the others.

3.3 Impact of Binary MaskBinary mask is one of the basic cues which is exploited byPSP-MRF, ours-SC, and ours. Specifically, it makes up theprobabilistic superpixels in the PSP-MRF and occupies aweighted part in both ours-SC and ours, so their resultsare closely related to the binary mask. In the implementationof the binary mask, we use the temporal difference method.Although it is simple and sensitive for detecting changes, ithas poor antinoise performance and outputs an incompleteobject with “ghosts” (see the rapidly descending magentaline in “bungalows” in Fig. 3).

In Fig. 3, it is easy to see that the blue PSP-MRF line hasa certain positive correlation with the magenta BM line.According to Ref. 8, the binary mask directly determinesthe unary term, which captures the likelihood of superpixelsbelonging to the foreground. As a result, the performanceof PSP-MRF gets worse when the binary mask goes bad.Furthermore, ours-SC method fuses the optical flow andbinary mask together, so its performance is partly influencedby the binary mask. Moreover, with the object-like pool andbackground-like pool, our method is only slightly influencedby the binary mask even when it goes bad (see red line

in “bungalows,” “twoPositionPTZCam,” and “blizzard” inFig. 3). Overall, the proposed algorithm is the least sensitiveto the performance of the binary mask.

3.4 Impact of Optical FlowSimilar to the binary mask discussed previously, optical flowconstitutes one of the elements of ours-SC and ours.However, it is vulnerable to noise that may be generatedfrom the illumination change or an area with the samecolor. For example, in the “fall” dataset of Fig. 3, the reflec-tion of the ground increases the error of the optical flow andthe green line goes bad quickly even though the binary maskis not so bad. In contrast, our algorithm remains the bestunder this condition. Similar to the binary mask, the pro-posed algorithm is also the least sensitive to the performanceof the optical flow.

3.5 Effectiveness of Object-Like PoolTo further evaluate the effectiveness of our object-like pool,a comparison is conducted between the method with (ours)and without the object-like pool (ours-SC). According to theperformance in Fig. 3, our proposed algorithm achieves thesmoothest and highest F-measure curves and the least PWCon average, while the curves of ours-SC fluctuate heavily andperform worse than ours. The reason is that the object-likepool provides a reliable and continuous way to propagate the

Fig. 3 Performance comparison of different methods. (a) the quantitative result of bunglows, (b) thequantitative result of twoPositionPTZCam, (c) the quantitative result of highway, (d) the quantitative resultof fall, (e) the quantitative result of snowfall, and (f) the quantitative result of blizzard.

Journal of Electronic Imaging 023034-6 Mar∕Apr 2015 • Vol. 24(2)

Cheng et al.: Improving video foreground segmentation with an object-like pool

Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 11 Nov 2021Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

Page 8: Improving video foreground segmentation with an object ...

object against the noise from other features. Besides, thedetails of the objects with our algorithm can still be improvedeven when ours-SC has already achieved good results, aswith the performances of “snowFall” and “blizzard” asshown in Fig. 3. In brief, the proposed method with anobject-like pool achieves more robust and accurate resultsthan the methods without the object-like pool.

3.6 Impact of Parameters SelectionTo study the sensitivity of parameter selection, differentparameters of α, β, and γ are chosen. Taking the typical “bun-galows” as an example, we calculate the segmented resultsbased on three groups of parameters and the performance isillustrated in Fig. 4. We call the “bungalows” typical becausethe last two frames have achieved comparatively satisfyingoptical flows but terrible binary masks, which are balancedby α, β, and γ. According to the F-measure curves in Fig. 4,the last two points of ours-SC descend quickly with theincreasing weight of the binary mask. However, our approachstill maintains an excellent performance even while beingfaced with the awful binary mask. Therefore, our approachis more robust than ours-SC in terms of the parameters.

3.7 Comparison of Computational ComplexityThe computational complexity is introduced to make a sci-entific comparison of the time cost in different approaches.We first establish the notations used.

1. Let H and W be the height and width of the videoframe.

2. Let h and w be the height and width of the movingregion.

3. Let K be the total number of the superpixels.4. Let S be the number of the pixels between two adja-

cent seeds of the superpixels.5. Let T be the iterations of superpixel segmentation in

the SLIC method.6. Let L be the length of the search range in the SLIC

method.7. Let N be the number of the neighbors described

in Eq. (11).

According to the detailed algorithm of SLIC, it’s runningtime is OðwhTL2∕KÞ. We set T ¼ 10 and T ¼ 3 for therealization of SLIC in all the experiments, and K is generallylarger than 100. Therefore, we have OðwhTL2∕KÞ ≤OðwhÞ < OðWHÞ. The proposed object-like pool and back-ground-like pool cost OðNKÞ running time in total, inwhich we choose N ¼ 9 as the nine-connected neighbors

Fig. 4 Performance comparison of different parameters. (a) High weight for optical flow and low weightfor binary mask. (b) Equal weights of optical flow and binary mask. (c) Lowweight for optical flow and highweight for binary mask.

Table 1 Computational complexity of different methods.

Method Computational complexity

Binary mask (BM) OðWHÞ

PSP-MRF OðwhTL2∕K Þ þOðWHÞ ¼ OðWHÞ

Ours-shortcut(ours-SC)

OðwhTL2∕K Þ þOðWHÞ þOðK Þ ¼ OðWHÞ

Ours OðwhTL2∕K Þ þOðWHÞ þOðK Þ þOðNK Þ ¼OðWHÞ

Journal of Electronic Imaging 023034-7 Mar∕Apr 2015 • Vol. 24(2)

Cheng et al.: Improving video foreground segmentation with an object-like pool

Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 11 Nov 2021Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

Page 9: Improving video foreground segmentation with an object ...

in Eq. (11). Since the features of the binary mask and opticalflow are defined at the superpixel level, we can figure out thatthey take at most OðKÞ ≤ OðWHÞ running time. The imple-mentation of graph cut costs Oðwh∕S2Þ ¼ OðKÞ runningtime because of S ¼ ffiffiffiffiffiffiffiffiffiffiffiffi

wh∕Kp

.Based on the mentioned inferences, we compare our

approach (ours) in terms of computational complexity withours-SC, PSP-MRF, and BM in Table 1. We find that thecomputational complexity of all the methods is equal inpolynomial time.

4 ConclusionsWe proposed a robust and effective method to improve theunlabeled short video segmentation based on the object-likepool. Our approach exploits the temporal and spatial coher-ence of appearance and motion features of the moving objectto generate the foreground likelihood across the frames.According to the qualitative and quantitative results, ourapproach exceeds the other compared methods, both in accu-racy and robustness, even when the binary mask and opticalflow suffer from great interference.

However, the proposed algorithm still has some limita-tions. Occasionally we need to empirically tune the weightedparameters among different features to produce satisfactoryresults, so an intelligent and adaptive method to automati-cally generate weights should be developed. In addition,our method works worse for nonrigid objects than rigidobjects because of the conflicting optical flow within them.Therefore, a more generalized algorithm should be proposedto solve this problem in further work.

AcknowledgmentsThis work is partly supported by the National NaturalScience Foundation of China (14ZR1447200).

References

1. S. C. Huang, “An advanced motion detection algorithm with videoquality analysis for video surveillance systems,” IEEE Trans.Circuits Syst. Video Technol. 21(1), 1–14 (2011).

2. N. C. Mithun, N. U. Rashid, and S. M. M. Rahman, “Detection andclassification of vehicles from video using multiple time-spatialimages,” IEEE Trans. Intell. Transp. Syst. 13(3), 1215–1225 (2012).

3. J. Shotton et al., “Textonboost: joint appearance, shape and contextmodeling for multi-class object recognition and segmentation,” Lec.Notes Comput. Sci. 3951, 1–15 (2006).

4. D. Zhang, O. Javed, and M. Shah, “Video object segmentation throughspatially accurate and temporally dense extraction of primary objectregions,” in Proc. 2013 IEEE Conf. on Computer Vision and PatternRecognition (CVPR), pp. 628–635 (2013).

5. X. M. He, R. S. Zemel, andM. A. Carreira-Perpinan, “Multiscale condi-tional random fields for image labeling,” in Proc. 2004 IEEE ComputerSociety Conf. on Computer Vision and Pattern Recognition, Vol. 2,pp. 695–702 (2004).

6. Z. Q. Tian et al., “Video object segmentation with shape cue based onspatiotemporal superpixel neighbourhood,” IET Comput. Vision 8(1),16–25 (2014).

7. X. F. Wang and X. P. Zhang, “A new localized superpixel Markovrandom field for image segmentation,” in Proc. IEEE Int. Conf. onMultimedia and Expo (ICME 2009), Vol. 1–3, pp. 642–645 (2009).

8. A. Schick, M. Bauml, and R. Stiefelhagen, “Improving foreground seg-mentations with probabilistic superpixel Markov random fields,” inProc. 2012 IEEE Computer Society Conf. on Computer Vision andPattern Recognition Workshops (CVPRW), pp. 27–31 (2012).

9. G. Shu, A. Dehghan, and M. Shah, “Improving an object detector andextracting regions using superpixels,” in Proc. 2013 IEEE Conf. on

Computer Vision and Pattern Recognition (CVPR), pp. 3721–3727(2013).

10. B. Fulkerson, A. Vedaldi, and S. Soatto, “Class segmentation and objectlocalization with superpixel neighborhoods,” in Proc. 2009 IEEE 12thInt. Conf. on Computer Vision (ICCV), pp. 670–677 (2009).

11. X. F. Ren and J. Malik, “Learning a classification model for segmen-tation,” in Proc. Ninth IEEE Int. Conf. on Computer Vision, Vol. 1,pp. 10–17 (2003).

12. A. Levinshtein et al., “Turbopixels: fast superpixels using geometricflows,” IEEE Trans. Pattern Anal. Mach. Intell. 31(12), 2290–2297(2009).

13. Y. Yuan, J. W. Fang, and Q. Wang, “Robust superpixel tracking viadepth fusion,” IEEE Trans. Circuits Syst. Video Technol. 24(1), 15–26(2014).

14. F. Yang, H. C. Lu, and M. H. Yang, “Robust superpixel tracking,” IEEETrans. Image Process. 23(4), 1639–1651 (2014).

15. R. Achanta et al., “SLIC superpixels compared to state-of-the-art super-pixel methods,” IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2274–2281 (2012).

16. C. Sutton, A. McCallum, and K. Rohanimanesh, “Dynamic conditionalrandom fields: factorized probabilistic models for labeling and seg-menting sequence data,” J. Mach. Learn. Res. 8, 693–723 (2007).

17. Y. Boykov and G. Funka-Lea, “Graph cuts and efficient N-D imagesegmentation,” Int. J. Comput. Vision 70(2), 109–131 (2006).

18. V. Kolmogorov and R. Zabih, “What energy functions can be mini-mized via graph cuts?” IEEE Trans. Pattern Anal. Mach. Intell. 26(2),147–159 (2004).

19. Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy min-imization via graph cuts,” IEEE Trans. Pattern Anal. Mach. Intell.23(11), 1222–1239 (2001).

20. Y. Boykov and V. Kolmogorov, “An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision,” IEEETrans. Pattern Anal. Mach. Intell. 26(9), 1124–1137 (2004).

21. L. Maddalena and A. Petrosino, “A self-organizing approach to back-ground subtraction for visual surveillance applications,” IEEE Trans.Image Process. 17(7), 1168–1177 (2008).

Xiaoliu Cheng received his BS degree in electronic science and tech-nology from Zhengzhou University, Zhengzhou, China, in 2011. From2011 to 2012, he studied signal processing at the University ofScience and Technology of China, Hefei, China. Currently, he is pur-suing his PhD degree at the Shanghai Institute of Microsystem andInformation Technology (SIMIT), Chinese Academy of Sciences(CAS), Shanghai, China. His research interests include computervision, machine learning, and wireless sensor networks.

Wei Lv received his MS degree from Harbin Engineering University,Harbin, China, in 2007. She is an assistant researcher at SIMIT, CAS,Shanghai, China. Her research interests include image processingand wireless sensor networks.

Huawei Liu received his MS degree from Harbin EngineeringUniversity, Harbin, China, in 2008. He is an assistant researcher atSIMIT, CAS, Shanghai, China. His research interests include sensorsignal processing and wireless sensor networks.

Xing You received her PhD from SIMIT, CAS, Shanghai, China, in2013. She is an assistant professor at SIMIT, CAS, Shanghai, China.Her research interests include video processing and informationhiding.

Baoqing Li received his PhD from the State Key Laboratory ofTransducer Technology, Shanghai Institute of Metallurgy, CAS,Shanghai, China, in 1999. Currently, he is a professor at SIMIT, CAS,Shanghai, China. His research interests include signal processing,microelectromechanical systems, and wireless sensor networks.

Xiaobing Yuan received his PhD from the Changchun Institute ofOptics, Fine Mechanics and Physics, CAS, Changchun, China, in2000. Currently, he is a professor at SIMIT, CAS, Shanghai, China.His research interests include wireless sensor networks, informationtransmission and processing.

Journal of Electronic Imaging 023034-8 Mar∕Apr 2015 • Vol. 24(2)

Cheng et al.: Improving video foreground segmentation with an object-like pool

Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 11 Nov 2021Terms of Use: https://www.spiedigitallibrary.org/terms-of-use