Statistical Modeling of Complex Backgrounds for Foreground Object ...

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004 1459

Statistical Modeling of Complex Backgroundsfor Foreground Object Detection

Liyuan Li, Member, IEEE, Weimin Huang, Member, IEEE, Irene Yu-Hua Gu, Senior Member, IEEE, andQi Tian, Senior Member, IEEE

Abstract—This paper addresses the problem of backgroundmodeling for foreground object detection in complex environ-ments. A Bayesian framework that incorporates spectral, spatial,and temporal features to characterize the background appearanceis proposed. Under this framework, the background is representedby the most significant and frequent features, i.e., the principalfeatures, at each pixel. A Bayes decision rule is derived for back-ground and foreground classification based on the statistics ofprincipal features. Principal feature representation for both thestatic and dynamic background pixels is investigated. A novellearning method is proposed to adapt to both gradual and sudden“once-off” background changes. The convergence of the learningprocess is analyzed and a formula to select a proper learning rateis derived. Under the proposed framework, a novel algorithm fordetecting foreground objects from complex environments is thenestablished. It consists of change detection, change classification,foreground segmentation, and background maintenance. Experi-ments were conducted on image sequences containing targets ofinterest in a variety of environments, e.g., offices, public buildings,subway stations, campuses, parking lots, airports, and sidewalks.Good results of foreground detection were obtained. Quantitativeevaluation and comparison with the existing method show that theproposed method provides much improved results.

Index Terms—Background maintenance, background mod-eling, background subtraction, Bayes decision theory, complexbackground, feature extraction, motion analysis, object detection,principal features, video surveillance.

I. INTRODUCTION

I N COMPUTER vision applications, such as video surveil-lance, human motion analysis, human-machine interaction,

and object based video encoding (e.g., MPEG4), objects of in-terest are often the moving foreground objects in an image se-quence. One effective way of foreground object extraction isto suppress the background points in the image frames [1]–[6].To achieve this, an accurate and adaptive background model isoften desirable.

Background usually contains nonliving objects that remainpassive in the scene. The background objects can be stationaryobjects, such as walls, doors and room furniture, or nonsta-tionary objects such as wavering bushes or moving escalators.

Manuscript received June 19, 2003; revised January 29, 2004. The associateeditor coordinating the review of this manuscript and approving it for publica-tion was Dr. Luca Lucchese.

L. Li, W. Huang, and Q. Tian are with Institute for Infocomm Research ,Singapore, 119613 (e-mail: [email protected]; [email protected];[email protected]).

I. Y.-H. Gu is with the Department of Signals and Systems, ChalmersUniversity of Technology, SE-412 96 Göteborg, Sweden (e-mail: [email protected]).

Digital Object Identifier 10.1109/TIP.2004.836169

The appearance of background objects often undergoes variouschanges over time, e.g., the changes in brightness caused bychanging weather conditions or the switching on/off of lights.The background image can be described as consisting of staticand dynamic pixels. The static pixels belong to the stationary ob-jects, and the dynamic pixels are associated with nonstationaryobjects. The static background part can be converted to a dy-namic one as time advances, e.g., by turning on a computerscreen. A dynamic background pixel can also turn to a staticone, such as a pixel in the bush when the wind stops. To de-scribe a general background scene, a background model mustbe able to

1) represent the appearance of a static background pixel;2) represent the appearance of a dynamic background pixel;3) self-evolve to gradual background changes;4) self-evolve to sudden “once-off” background changes.

For background modeling without specific domain knowledge,the background is usually represented by image features ateach pixel. The features extracted from an image sequence canbe classified into three types: spectral, spatial, and temporalfeatures. Spectral features could be associated with gray-scaleor color information, spatial features could be associated withgradient or local structure, and temporal features could beassociated with interframe changes at the pixel. Many existingmethods utilize spectral features (distributions of intensities orcolors at each pixel) to model the background [4], [5], [7]–[9].To be robust to illumination changes, some spatial features arealso exploited [2], [10], [11]. The spectral and spatial featuresare suitable to describe the appearance of static backgroundpixels. Recently, a few methods have introduced temporalfeatures to describe the dynamic background pixels associatedwith nonstationary objects [6], [12], [13]. There is, however,a lack of systematic approaches to incorporate all three typesof features into a representation of a complex backgroundcontaining both stationary and nonstationary objects.

The features that characterize stationary and dynamic back-ground objects should be different. If a background model candescribe a general background, it should be able to learn thesignificant features of the background at each pixel and providethe information for foreground and background classification.Motivated by this, a Bayesian framework which incorporatesmultiple types of features for modeling complex backgroundsis proposed in this paper. The major novelties of the proposedmethod are as follows.

1) A Bayesian framework is proposed for incorporatingspectral, spatial, and temporal features in the back-ground modeling.

1057-7149/04$20.00 © 2004 IEEE

1460 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004

2) A new formula of Bayes decision rule is derived for back-ground and foreground classification.

3) The background is represented using statistics of prin-cipal features associated with stationary and nonsta-tionary background objects.

4) A novel method is proposed for learning and updatingbackground features to both gradual and “once-off” back-ground changes.

5) The convergence of the learning process is analyzed anda formula is derived to select a proper learning rate.

6) A new real-time algorithm is developed for foregroundobject detection from complex environments.

Further, a wide range of tests is conducted on a variety ofenvironments, including offices, campuses, parks, commercialbuildings, hotels, subway stations, airports, and sidewalks.

The remaining part of the paper is organized as follows.After a brief literature review of existing work in Section I-A,Section II describes the statistical modeling of complex back-ground based on principal features. First, a new formula ofBayes decision rule for background and foreground classi-fication is derived. Based on this formula, an effective datastructure to record the statistics of principal features is estab-lished. Principal feature representation for different backgroundobjects is addressed. In Section III, the method for learningand updating the statistics of principal features is described.Strategies to adapt to both gradual and sudden “once-off”background changes are proposed. Properties of the learningprocess are analyzed. In Section IV, an algorithm for foregroundobject detection based on the statistical background modelingis described. It contains four steps: change detection, changeclassification, foreground segmentation, and background main-tenance. Section V presents the experimental results on variousenvironments. Evaluations and comparisons with an existingmethod are also included. Finally, conclusions are given inSection VI.

A. Related Work

A simple and direct way to describe the background at eachpixel is to use the spectral information, i.e., the gray-scaleor color of the background pixel. Early studies describebackground features using an average of gray-scale or colorintensities at each pixel. Infinite impulse response (IIR) orKalman filters [7], [14], [15] are employed to update slowand gradual changes in the background. These methods areapplicable to backgrounds consisting of stationary objects. Totolerate the background variations caused by imaging noise,illumination changes, and the motion of nonstationary objects,the statistical models are used to represent the spectral featuresat each background pixel. The frequently used models includegaussian [8], [16]–[22] and mixture of gaussians (MoG) [4],[23]–[25]. In these models, one or a few gaussians are used torepresent the color distributions at each background pixel. Amixture of Gaussian distributions can represent various back-ground appearances, e.g., road surfaces under the sun or in theshadows [23]. The parameters (mean, variance, and weight) foreach gaussian are updated using an IIR filter to adapt to gradualbackground changes. Moreover, by replacing an old gaussianwith a newly learned color distribution, MoG can adapt to

“once-off” background changes. In [9], a nonparametric modelis proposed for background modeling, where a kernel-basedfunction is employed to represent the color distribution of eachbackground pixel. The kernel-based distribution is a general-ization of MoG which does not require parameter estimation.The computation is high for this method. A variant model isused in [5], where the distribution of temporal variationsin color at each pixel is used to model the spectral feature ofthe background. MoG performs better in a time-varying envi-ronment where the background is not completely stationary.But, the method can lead to misclassification of foreground ifthe background scenes are complex [19], [26]. For example,if the background contains a nonstationary object with signif-icant motion, the colors of pixels in that region may changewidely over time. Foreground objects with similar colors (thecamouflage foreground objects) could easily be misclassifiedas background.

The spatial information has recently been exploited to im-prove the accuracy of background representation. The localstatistics of the spectral features [27], [28], local texture features[2], [3], or global structure information [29] are found helpfulfor accurate foreground extraction. These methods are mostsuitable to stationary background. Paragios and Ramesh [10]use a mixture model (gaussians or laplacians) to represent thedistributions of background differences for static backgroundpoints. A Markov random field (MRF) model is developed toincorporate the spatio-spectral coherence for robust foregroundsegmentation. In [11], gradient distributions are introduced toMoG to reduce the misclassification purely depending on colordistributions. Spatial information helps to detect camouflageforeground objects and suppress shadows. The spatial featuresare however not applicable to nonstationary background objectsat pixel level since the corresponding spatial features vary overtime.

A few more attempts to segment foreground objects fromnonstationary background have been made by using temporalfeatures. One way is to estimate the consistency of optical flowover a short duration of time [13], [30]. The dynamic featuresof nonstationary background objects are represented by the sig-nificant variation of accumulated local optical flows. In [12],Li et al. propose a method to employ the statistics of colorco-occurrence between two consecutive frames to model thedynamic features associated with a nonstationary backgroundobject. Temporal features are suitable to model the appearanceof nonstationary objects. In Wallflower [6], Toyama et al. usea linear Wiener filter, a self-regression model, to represent in-tensity changes for each background pixel. The linear predictorcould learn and estimate the intensity variations of a backgroundpixel. It works well for periodical changes. The linear regressionmodel is difficult to predict shadows and background changeswith varying frequency in natural scene. A brief summary of theexisting methods based on the types of used features is listed inTable I. Further, most existing methods perform the backgroundand foreground classification with one or more heuristic thresh-olds. For backgrounds with different complexities, the thresh-olds should be adjusted empirically. In addition, these methodsare often tested only on a few background environments (e.g.,laboratories, campuses, etc.).

LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS 1461

TABLE ICLASSIFICATION OF PREVIOUS METHODS AND THE PROPOSED METHOD

II. STATISTICAL MODELING OF THE BACKGROUND

A. Bayes Classification of Background and Foreground

For arbitrary background and foreground objects or regions,the classification of the background and the foreground can beformulated under Bayes decision theory.

Let be the position of an image pixel, be theinput image at time , and be a -dimensional feature vectorextracted from the position at time from the image sequence.Then, the posterior probability of the feature vector from thebackground at can be computed by using the Bayes rule

(1)

where indicates the background. is the probabilityof the feature vector being observed as a background at ,

is the prior probability of the pixel belonging to thebackground, and is the prior probability of the featurevector being observed at the position . Similarly, the posteriorprobability that the feature vector comes from a foregroundobject at is

(2)

where denotes the foreground. Using the Bayes decision rule,a pixel is classified as belonging to the background accordingto its feature vector observed at time if

(3)

Otherwise, it is classified as belonging to the foreground. Notethat a feature vector observed at an image pixel comes fromeither background or foreground objects, it follows:

(4)

Substituting (1) and (4) into (3), it follows that the Bayes deci-sion rule (3) becomes

(5)

Using (5), the pixel with observed feature vector at time canbe classified as a background or a foreground point, providedthat the prior and conditional probabilities , , and

are known in advance.

B. Principal Feature Representation of Background

To apply (5) for classification of background and foreground,the probability functions , , and should beknown in advance, or can be properly estimated. For complexbackgrounds, the forms of these probability functions are un-known. One way to estimate these probability functions is to usethe histogram of features. The problem that would be encoun-tered is the high cost for storage and computation. Assumingis a -dimension vector and each of its element is quantized to

values, the histogram would contain cells. For example,assuming the resolution of color has 256 levels, the histogramwould contain 256 cells. The method would be unrealistic interms of computational and memory requirements.

It is reasonable to assume that if the selected features repre-sent the background effectively, the intraclass spread of back-ground features should be small, which implies that the distri-bution of background features will be highly concentrated in asmall region in the histogram. Further, features from variousforeground objects would spread widely in the feature space.Therefore, there would be less overlap between the distribu-tions of background and foreground features. This implies that,with a proper selection and quantization of features, it wouldbe possible to approximately describe the background by usingonly a small number of feature vectors. A concise data structureto implement such representation of background is created asfollows.

Let be the quantized feature vectors sorted in descendingorder with respect to for each pixel . Then, for aproper selection of features, there would be a small integer

, a high percentage value , and a low percentage value(e.g., and ) such

that the background could be well approximated by

and (6)

The value of and the existence of and dependon the selection and quantization of the feature vectors. The

feature vectors are defined as the principal features of thebackground at the pixel .

To learn and update the prior and conditional probabilities forthe principal feature vectors, a table of statistics for the possibleprincipal features is established for each feature type at . Thetable is denoted as

(7)

where is the learned based on the observation ofthe features and records the statistics of the most


Fig. 1. One example of learned principal features for a static background pixel in a busy scene. The left image shows the position of the selected pixel. The tworight images are the histograms of the statistics for the most significant colors and gradients, where the height of a bar is the value of , the light gray part is

, and the top dark gray part is . The icons below the histograms are the corresponding color and gradient features.

frequent feature vectors at pixel . Eachcontains three components

(8)

where is the dimension of the feature vector . Thein the table are sorted in descending order with respectto the value . The first elements from the table ,together with , are used in (5) for background and fore-ground classification.

C. Feature Selection

The next essential issue for principal feature representationis feature selection. The significant features of different back-ground objects are different. To achieve effective and accuraterepresentation of background pixels with principal features,the employment of proper types of features is important. Threetypes of features, the spectral, spatial, and temporal features,are used for background modeling.

1) Features for Static Background Pixels: For a pixel be-longing to a stationary background object, the stable and mostsignificant features are its color and local structure (gradient).Hence, two tables are used to learn the principal features. Theyare and with and rep-resenting the color and gradient vectors, respectively. Since thegradient is less sensitive to illumination changes, the two typesof feature vectors can be integrated under the Bayes frameworkas the following.

Let and assume that the and are indepen-dent, the Bayes decision rule (5) becomes

(9)

For the features from static background pixels, the quantizationmeasure should be less sensitive to illumination changes. Here,a normalized distance measure based on the inner product oftwo vectors is employed for both color and gradient vectors. Thedistance measure is

(10)

where can be or , respectively. If is less thana small value , and are matched to each other. The ro-bustness of the distance measure (10) to illumination changesand imaging noise is shown in [2]. The color vector is directlyobtained from the input images with 256 resolution levels foreach component, while the gradient vector is obtained by ap-plying Sobel operator to the corresponding gray-scale input im-ages with 256 resolution levels. With , isfound accurate enough to learn the principal features for staticbackground pixels. An example of principal feature representa-tion for static background pixel is shown in Fig. 1, where thehistograms for the most significant color and gradient featuresin and are displayed. The histogram of the colorfeatures shows that only the first two are the principal colors forthe background, and the histogram of the gradients shows thatthe first six, excluding the fourth, are the principal gradients forthe background.

2) Features for Dynamic Background Pixels: For dynamicbackground pixels associated with nonstationary objects, colorco-occurrences are used as their dynamic features. This is be-cause the color co-occurrence between consecutive frames hasbeen found to be suitable to describe the dynamic features asso-ciated with nonstationary background objects, such as movingtree branches or a flickering screen [12]. Giving an interframechange from the color to

at the time instant and the pixel ,the feature vector of color co-occurrence is defined as

. Similarly, a table ofstatistics for color co-occurrence is maintained at eachpixel. Let be the inputcolor image; the color co-occurrence vector is generated byquantizing color components to low resolution. For example,by quantizing the color resolution to 32 levels for each com-ponent and selecting , one may obtain a goodprincipal feature representation for dynamic background pixel.An example of the principal feature representation with colorco-occurrence for a flickering screen is shown in Fig. 2. Com-pared with the quantized color co-occurrence feature space of

cells, implies that with a very small number offeature vectors, the principal features are capable of modelingthe dynamic background pixels.


Fig. 2. One example of learned principal features for dynamic background pixels. The left image shows the position of the selected pixel. The right image is thehistogram of the statistics for the most significant color co-occurrences in , where the height of a bar is the value of , the light gray part is , andthe top dark gray part is . The icons below the histogram are the corresponding color co-occurrence features. In the screen, the color changes amongwhite, dark blue, and light blue periodically.

III. LEARNING AND UPDATING THE STATISTICS

FOR PRINCIPAL FEATURES

Since the background might undergo both gradual and“once-off” changes, two strategies to learn and update thestatistics for principal features are proposed. The convergenceof the learning process is analyzed and a formula to select aproper learning rate is derived.

A. For Gradual Background Changes

At each time instant, if the pixel is identified as a staticpoint, the features of color and gradient are used for fore-ground and background classification. Otherwise, the feature ofcolor co-occurrence is used. Let us assume that the featurevector is used to classify the pixel at time based on theprincipal features learned previously. Then the statistics of thecorresponding feature vectors in the table ( and ,or ) is gradually updated at each time instant by

(11)

where the learning rate is a small positive number and. In (11), means that is classified as a

background point at time in the final segmentation, otherwise,. Similarly, means that the th vector of the

table matches the input feature vector , and otherwise.

The above updating operation states the following. If the pixelis labeled as a background point at time , is slightly

increased from due to . Further, the probabilitiesfor the matched feature vector are also increased due to .However, if , then the statistics for the un-matchedfeature vectors are slightly decreased. If there is no match be-tween the feature vector and the vectors in the table , the

th vector in the table is replaced by a new feature vector

(12)

If the pixel is labeled as a foreground point at time ,and are slightly decreased with . However, thematched vector in the table is slightly increased.

The updated elements in the table are resorted in a de-scending order with respect to , such that the table may keep

the most frequent and significant feature vectors observedat pixel .

B. For “Once-Off” Background Changes

According to (4), the statistics of the principal features satisfy

(13)These probabilities are learned gradually with operations de-scribed by (11) and (12) at each pixel . When a “once-off”background change has happened, the new background appear-ance soon becomes dominant after the change. With the replace-ment operation (12), the gradual accumulation operation (11)and resorting at each time step, the learned new features will begradually moved to the first few positions in . After sometime duration, the term on the left hand of (13) becomes large( 1) and the first term on the right hand of (13) becomes verysmall since the new background features are classified as fore-ground. From (6) and (13), new background appearance at canbe found if

(14)In (14), denotes the previous background before the “once-off”change and denotes the new background appearance after the“once-off” change. The factor prevents errors caused bya small number of foreground features. Using the notation in (7)and (8), the condition (14) becomes

(15)

Once the above condition is satisfied, the statistics for the fore-ground should be tuned to be the new background appearance.According to (4), the “once-off” learning operation is performedas follows:

(16)

for .


C. Convergence of the Learning Process

If the time-evolving principal feature representation has suc-cessfully approximated the background, then

should be satisfied. Hence, it is desirable that willconverge to 1 with the evolution of the learning process. Weshall show in the following that the learning operation (11) in-deed meets such a condition.

Suppose at time , and the th vector in thetable matches the input feature vector which has beendetected as background in the final segmentation at time . Then,according to (11), we have

(17)which implies the sum of the conditional probabilities of theprincipal features being background will remain equal or closeto 1 during the evolution of the learning process.

Let us suppose at time due to some reasonssuch as the disturbance from foreground objects or the operationof “once-off” learning, and the from the first vectorsin matches the input feature vector , then we have

(18)

If the pixel is detected as a background point at time , it leadsto

(19)

If , then . In thiscase, the sum of the conditional probabilities of the principalfeatures being background increases slightly. On the other hand,If , there will be ,and the sum of the conditional probabilities of the principalfeatures being background decreases slightly. From these twocases, it can be concluded that the sum of the conditional prob-abilities of the principal features being background converges to1 as long as the background features are observed frequently.

D. Selection of the Learning Rate

In general, for an IIR filtering-based learning process, thereis a tradeoff in the selection of the learning rate . To makethe learning process adapt to the gradual background changessmoothly and not to be perturbed by noise and foreground ob-jects, a small value should be selected for . On the other hand,if is too small, the system becomes too slow to respond to the“once-off” background changes. Previous methods select it em-pirically [4], [5], [8], [14]. Here, a formula is derived to select

according to the required time for the system to respond to“once-off” background changes.

An ideal “once-off” background change at time can beassumed to be a step function. Suppose the features beforefall into the first vectors in the table ,

and the features after fall into the next elements of. Then, the statistics at time can be described

as

(20)

Since the new background appearance at pixel after timeis classified as foreground before the “once-off” updating with(16), , and decrease exponentially,

whereas increases exponentially and will beshifted to the first positions in the updated tablewith sorting at each time step. Once the condition of (15) ismet at time , the new background state is learned. To makethe expression simpler, let us assume that there is no resortingoperation. Then the condition (15) becomes

(21)

From (11) and (20), it follows that at time , the followingconditions hold:

(22)

(23)

(24)

By substituting (22)–(24) to (21) and rearranging terms, one canobtain

(25)

where is the number of frames required to learn the new back-ground appearance. Equation (25) implies that if one wishesthe system to learn the new background state in no later thanframes, one should choose , such that (25) is satisfied. For ex-ample, if the system is to respond to an “once-off” backgroundchange in 20 s with the frame rate being 20 fps and ,

should be satisfied.

IV. FOREGROUND OBJECT DETECTION: THE ALGORITHM

With the Bayesain formulation of background and foregroundclassification, as well as the background representation withprincipal features, an algorithm for foreground object detectionfrom complex environments is developed. It consists of fourparts: change detection, change classification, foreground objectsegmentation, and background maintenance. The block diagramof the algorithm is shown in Fig. 3. The white blocks from left toright correspond to the first three steps, and the blocks with grayshades correspond to background maintenance. In the first step,unchanged background pixels in the current frame are filtered


Fig. 3. Block diagram of the proposed method.

out by using simple background and temporal differencing. Thedetected changes are separated into static and dynamic pointsaccording to interframe changes. In the second step, the detectedstatic and dynamic change points are further classified as back-ground or foreground using the Bayes rule and the statistics ofprincipal features for background. Static points are classifiedbased on the statistics of principal colors and gradients, whereasdynamic points are classified based on those of principal colorco-occurrences. In the third step, foreground objects are seg-mented by combining the classification results from both staticand dynamic points. In the fourth step, background models areupdated. It includes updating the statistics of principal featuresfor background as well as a reference background image. Briefdescriptions of the steps are presented in the following.

A. Change Detection

In this step, simple adaptive image differencing is used tofilter out nonchange background pixels. The minor variationsof colors caused by imaging noise are filtered out to save thecomputation for further processing.

Let be the input image andbe the reference background image maintained at

time with denoting a color component. Thebackground difference is obtained as follows. First, image dif-ferencing and thresholding for each color component are per-formed, where the threshold is automatically generated usingthe least median of squares (LMedS) method [31]. The back-ground difference is then obtained by fusing the resultsfrom the three color components. Similarly, the temporal (or in-terframe) difference between two consecutive frames

and is obtained. If both and, the pixel is classified as a nonchange back-

ground point. In general, more than 50% of the pixels would befiltered out in this step.

B. Change Classification

If is detected at a pixel , it is classified asa dynamic point, otherwise, it is classified as a static point. Achange that occurs at a static point could be caused by illumina-tion changes, “once-off” background changes, or a temporarilymotionless foreground object. A change detected at a dynamicpoint could be caused by a moving background or foregroundobject. They are further classified as background or foregroundby using the Bayes decision rule and the statistics of the corre-sponding principal features.

Let be the input feature vector at and time . The proba-bilities are estimated as

(26)

where is a feature vector set composed of those inwhich match the input vector , i.e.

and(27)

If no principal feature vector in the table matches ,both and are set as 0. Then, the change point isclassified as background or foreground as the following.

Classification of Static Point: For a static point, the probabil-ities for both color and gradient features are estimated by (26)with and , respectively, where the vector distancemeasure in (27) is calculated as (10). In this work, thestatistics of the two type principal features ( and )are learned separately. In general cases, there would be

. The Bayes decision rule (9) can be applied for back-ground and foreground classification. In some complex cases,


one type of the features from the background might be unstable.One example is the temporal static states of a wavering watersurface. For these states, the gradient features are not constant.Another example is video captured with an auto-gain camera.The gain is often self-tuned due to the motion of objects and thegradient features are more stable than the color features for staticbackground pixels. To work stably in various conditions, the fol-lowing method is adopted. Let and

. If (in our test), the color and gradient features are coincident andboth features are used for classification using the Bayes rule (9).Otherwise, only one type of the features with a larger prior value

is used for classification using the Bayes rule (5).Classification of Dynamic Point: For a dynamic point at time

, the feature vector of color co-occurrence is generated. Theprobabilities for are calculated as (26), where the distancebetween two feature vectors in (27) is computed as

(28)

and is chosen. Finally, the Bayes rule (5) is applied forbackground and foreground classification. As observed fromour experiments, for the dynamic background points, only asmall percentage of them are wrongly classified as foregroundchanges [12]. Further, the remainders have become isolatedpoints, which can easily be removed by a smoothing operation.

C. Foreground Object Segmentation

A post processing is applied to segment the remaining changepoints into foreground regions. This is done by firstly applyinga morphological operation (a pair of open and close) to sup-press the residual errors. Then the foreground regions are ex-tracted, holes are filled and small regions are removed. Further,an AND operation is applied to the resulting segments in con-secutive frames to remove the false foreground regions detectedby temporal differencing [32].

D. Background Maintenance

With the feedback from the above segmentation, backgroundmodels are updated. First, the statistics of principal features areupdated as described in Section IV. For the static points, thetables and are updated. For the dynamic points,the table is updated. Meanwhile, a reference backgroundimage is also maintained to make the background differenceaccurate. Let be a background point in the final segmentationresult at time . If it is identified as an unchanged backgroundpoint in the change detection step, the background referenceimage at is smoothly updated by

(29)

where and is a small positive number. If isclassified as background in change classification step, the back-ground reference image at is replaced by the new backgroundappearance

(30)

Fig. 4. Summary of the complete algorithm.

With (30), the reference background image can follow the dy-namic background changes, e.g., the changes of color betweentree branch and sky, as well as “once-off” background changes.

E. Memory Requirement and Computational Time

The complete algorithm is summarized in Fig. 4. The majorpart of memory usage is to store the tables of the statistics( , and ) for each pixel. In our implementa-tion, the memory requirement for each pixel is approximately1.78 KB. For a video with image sized 160 120 pixels, therequired memory is approximately 33.4 MB. While for imagesized 320 240 pixels, 133.5-MB memory is required. For astandard PC, this is still feasible. With a 1.7-GHz Pentium CPUPC, real-time processing of image sequences is achievable at arate of about 15 frames per second (fps) for images sized 160

120 pixels and at a rate of 3 fps for images sized 320 240pixels.


Fig. 5. Experimental results on a meeting room environment (MR) with wavering curtains in the winds. The two examples are the results of the frame 1816 and2268.

Fig. 6. Experimental results on a lobby environment (LB) in an office building with switching on/off lights. Upper row: a frame before switching off some lights(364). Lower row: the frame 15 s after switching off some lights (648).

V. EXPERIMENTAL RESULTS

The proposed method has been tested on a variety of indoorand outdoor environments, including offices, campuses, parkinglots, shopping malls, restaurants, airports, subway stations, side-walks, and other private and public sites. It has also been testedon image sequences captured in various weather conditions, in-cluding sunny, cloudy, and rainy weather, as well as night andcrowd scenes. In all the tests, the proposed method was automat-ically initialized (bootstrap) from “blinking background” (i.e.,

, , and for and). The system gradually learned the most signifi-

cant features for both stationary and nonstationary backgroundobjects. Once the “once-off” updating is performed, the systemis able to separate the foreground from the background well.

MoG [4] is a widely-used adaptive background subtractionmethod. It performs quite well for both stationary and nonsta-tionary backgrounds among the existing methods [6]. The pro-posed method has also been compared with MoG in the experi-ments. The same learning rate was used for both the proposedmethod and MoG in each test.1 Further, for a fair comparison,the post processing used in the proposed method was appliedfor the MoG method as well.

1The similar analysis of the learning process and dynamic performance forMoG can be made as in Section III-C and III-D.

The visual examples and quantitative evaluations of the ex-periments are described in the following two subsections, re-spectively.

A. Examples on Various Environments

Selected results on five typical indoor and outdoor environ-ments are displayed in this section. The typical environments areoffices, campuses, shopping malls, subway stations, and side-walks. In the figures of this subsection, pictures are arranged inrows. In each row, the images from left to right are the inputframe, the background reference image maintained by the pro-posed method at the moment , the manually gen-erated “ground truth,” the results of the proposed method andMoG.

1) Office Environments: Office environments include of-fices, laboratories, meeting rooms, corridors, lobbies, andentrances. An office environment is usually composed ofstationary background objects. The difficulties for foregrounddetection in these scenes can be caused by shadows, changesof illumination conditions, and camouflage foreground objects(i.e., the color of the foreground object is similar to that of thecovered background). In some cases, background may consistof dynamic objects, such as waving curtains, running fans,and flickering screens. Examples from two test sequences areshown in Figs. 5 and 6, respectively.


Fig. 7. Experimental results on a campus environment (CAM) containing wavering tree branches in strong winds. They are frame 1019, 1337, and 1393.

The first sequence (MR) was captured by an auto-gaincamera in a meeting room. The background curtain wasmoving in winds. The first example in the upper row came froma scenario containing significant motion of the curtain, as wellas background changes caused by automatic gain adjustment.In the next example, the person wore clothes of bright colors,which are similar to the color of the curtain. In both cases, theproposed method separated the background and foregroundsatisfactorily.

The second sequence (LB) was captured from a lobby in anoffice building. On this occasion, background changes weremainly caused by switching on/off lights. Two examples fromthis sequence are shown in Fig. 6. The first example shows ascene before some lights are switched off. A significant shadowof the person can be observed. The result of the proposedmethod is rather satisfactory apart from a small shadow in-cluded. The second example shows a scene at about 220 frames(about 15 s) after some lights have been switched off. In thisexample, even through the background reference image had notbeen recovered completely, the proposed method detected theperson successfully.

2) Campus Environments: The second type of environmentsare campuses or parks. Changes in the background are oftencaused by motion of tree branches and their shadows on theground surface, or the changes in the weather. Three examplesdisplayed in Fig. 7 were from a sequence (CAM) captured ina campus containing moving tree branches. The great motionof tree branches was caused by strong winds which can be ob-served from the waving yellow flag in the left of the images.The moving tree branches also resulted in the changes of treeshadows. The three example frames contain vehicles of differentcolors. The results have shown that the proposed method has de-tected the vehicles quite well in such an environment.

3) Shopping Malls: The third type of typical environmentsare shopping centers, hotels, museums, airports, and restaurants.In these environments, the lighting are distributed from the ceil-

ings and there are some specular highlight ground surfaces. Insuch cases, if multiple persons move in the scene, the shadowson the ground surface vary significantly in the image sequences.In these environments, the shadows can be classified into umbraand penumbra [33]. The umbra corresponds to the backgroundarea where the direct light is almost totally blocked by the fore-ground object, whereas in the penumbra area of the background,the lighting is partially blocked.

Three examples from such environments are shown in Fig. 8.They were from a busy shopping center (SC), an airport (AP),and a buffet restaurant (BR) [6]. Significant shadows of movingpersons cast on the ground surfaces from different directionscan be observed. As one can see, the proposed method hasobtained the satisfactory results apart from where small partsof the shadows have been detected in these three environments.The recognized shadows could also be observed in the main-tained background reference images. This can be explainedas a) the feature distance measure (10) that is robust to theillumination changes has played a major role in suppressingthe penumbra areas; b) the learned color co-occurrences of thechanges from the normal background appearance to umbra andvice versa could identify many background pixels in the umbraareas. Hence, without special models for the shadows, theproposed method has suppressed much of the various shadowsin these environments.

4) Subway Stations: Subway stations are other public sitesthat often require monitoring. In these situations, the motion ofbackground objects (e.g. trains and escalators) would make thebackground modeling difficult. Further, the background modelis hard to be established if there are frequent human crowdsin the scenes. Fig. 9 shows two examples from a sequence ofa subway station (SS) recorded on a tape by a CCTV surveil-lance system. The scene contains three moving escalators andfrequent human flows in the right side of the images. In addition,there are significant background changes caused by variation oflighting conditions due to many glass and stainless steel mate-


Fig. 8. Experimental results on shopping mall environments which contain specular ground surfaces. The three examples came from a busy shopping center (SC),an airport (AP), and a buffet restaurant (BR), respectively.

Fig. 9. Experimental results on a subway station environments (SS). The examples are the frame 1993 and 2634.

rials inside the building. Another difficulty for this sequence iscaused by the noise which is due to the old video recording de-vice. The busy flow of human crowds can be observed from thefirst example in the figure. Our test results have shown that theproposed method performed quite satisfactorily for such diffi-cult scenarios.

5) Sidewalks: The pedestrians are often the targets of interestin many video surveillance systems. In such a case, a surveillancesystem may monitor the scene from day to night with a range ofweather conditions. The tests were performed on such an envi-ronment around the clock. The image sequences (SW) were ob-tained from highly compressed MPEG4 videos through a localwireless network. There were large variations of background inthe images. Five examples and test results are shown in Fig. 10.These correspond to sunny, cloudy and rainy weather conditions,as well as the night and crowded scenes. The interval between thefirst two frames was less than 10 s. Comparing the results withthe “ground truths,” one can find that the proposed method per-formed very robustly in this complex environment.

From the comparisons with MoG in these examples shown inFigs. 5–10, one can find that the proposed method has outper-formed the MoG method in these selected difficult situations.

The parameters used for these tests are listed in Tables II andIII. The parameters in Table II were applied to all tests. Thelearning rates in the first row of Table III were applied to alltests except for three shorter sequences where larger rates (inthe second row of the table) were applied. This is because if theimage sequences are short, a slightly faster learning rate shouldbe used to speed up the initial learning. Since the decision (5) forthe classification of background and foreground is not directlydependent on any threshold, the performance of the proposedmethod is not very sensitive to these parameters.

B. Quantitative Evaluations

To get a systematic evaluation of proposed method, the per-formance of the proposed method was also evaluated quantita-tively on randomly selected samples from ten sequences.


Fig. 10. Experimental results of pedestrian detection from a sidewalk environment (SW) around the clock. From top to bottom are the frames from sunny, cloudy,rainy, night, and crowd scenes.

TABLE IIPARAMETERS USED FOR ALL TEST EXAMPLES

TABLE IIILEARNING RATES USED IN THE TEST EXAMPLES

In the previous work [6], the results were evaluated quantita-tively from the comparison with the “ground truths” in terms of

1) false negative error: the number of foreground pixels thatare missed;

2) false positive error: the number of background pixels thatare misdetected as foreground.

However, it is found that when averaging the measures over var-ious environments, they are not accurate enough. In this paper,a new similarity measure is introduced to evaluate the results offoreground segmentation. Let be a detected region and be

the corresponding “ground truth,” then the similarity measurebetween regions and is defined as

(31)

Using this measure, approaches to a maximum value1.0 if and are the same. Otherwise, varies between1 and 0 according to their similarity. It approaches to 0 withthe least similarity. It integrates the false positive and negativeerrors in one measure. But one drawback of the measure (31) isthat it is a nonlinear measure. To obtain a visual impression ofthe quantities of the similarity measures, some matching imagesand their similarity values are displayed in Fig. 11.

For systematic evaluation and comparison, the similaritymeasure (31) has been applied to the experimental results withthe proposed method and the MoG method. A total of tenimage sequences were used, including those in Figs. 5–10,as well as two others [watersurface (WS) and fountain (FT)].We randomly selected 20 frames from each sequence, leadingto total of 200 sample frames for evaluation. The “groundtruths” of these 200 frames were generated manually by fourinvited persons. All of the ten test sequences, the results, andthe “ground truths” of the sample frames are available.2 The

2http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html.


Fig. 11. Some examples of matching images with different similarity measure values. In the images, the bright color indicates the intersection of the detectedregions and the “ground truths,” the dark gray color indicates the false negatives, and the light gray color indicates the false positives.

TABLE IVQUANTITATIVE EVALUATION AND COMPARISON RESULT: VALUES FROM THE TEST SEQUENCES

averaging values of similarity measures for each individualsequence and for ten sequences are shown in Table IV. Thecorresponding values obtained from MoG method are alsoincluded. The ten test sequences are chosen among the difficultsequences, containing global background changes as well aspersons staying motionless for quite a while besides the variousbackground changes described in the previous subsection.Taking these situations into account, the obtained evaluationvalues for both methods are quite good. Comparing the resultsin Table IV and in Fig. 11, the performance of the proposedmethod is rather satisfactory. The comparison shows that theproposed method has provided improved results over thosefrom the MoG method, especially for image sequences withcomplex background.

C. Limitations of the Method

Since the statistics are related to each individual pixelwithout considering its neighborhood, the method can wronglyabsorb a foreground object into the background if the objectremains motionless for a long time duration. For example, ifa foreground moving person or car suddenly stopped movingin the scene and remains still for a long time duration. Furtherimprovement should be made, e.g., by combining informationfrom high-level object recognition and tracking in backgroundupdating [34], [35].

Another potential problem is that the method can wronglylearn the features of foreground objects as the background ifcrowded foreground objects (e.g., crowds) are constantly pre-sented in the scenes. Adjusting the learning rate based on thefeedback from the optical flow could provide a possible solu-tion [36]. A method of controlling the learning processes frommultilevel feedbacks is being investigated in order to further im-prove the results.

VI. CONCLUSION

For detecting foreground object from complex environments,this paper proposed a novel statistical method for backgroundmodeling. In the proposed method, the background appearanceis characterized by the principal features and their statistics.

Foreground objects are detected through foreground and back-ground classification under Bayesian framework. Our test re-sults have shown that the principal features are effective in rep-resenting the spectral, spatial, and temporal characteristics ofthe background. A learning method to adapt to the time-varyingbackground features has been proposed and analyzed. Experi-ments have been conducted on a variety of environments, in-cluding offices, public buildings, subway stations, campuses,parking lots, airports, and sidewalks. The experimental resultshave shown the effectiveness of the proposed method. Quanti-tative evaluation and comparison with the existing method haveshown that an improved performance for foreground object de-tection in complex background has been achieved. Some limi-tations of the method have been discussed with suggestions topossible improvement.

ACKNOWLEDGMENT

The authors would like to thank R. Luo, J. Shang, X. Huang,and W. Liu for their work to generate the “ground truths” forevaluation.

REFERENCES

[1] D. Gavrila, “The visual analysis of human movement: A survey,”Comput. Vis. Image Understanding, vol. 73, no. 1, pp. 82–98, 1999.

[2] L. Li and M. Leung, “Integrating intensity and texture differences forrobust change detection,” IEEE Trans. Image Processing, vol. 11, pp.105–112, Feb. 2002.

[3] E. Durucan and T. Ebrahimi, “Change detection and background extrac-tion by linear algebra,” Proc. IEEE, vol. 89, pp. 1368–1381, Oct. 2001.

[4] C. Stauffer and W. Grimson, “Learning patterns of activity using real-time tracking,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp.747–757, Aug. 2000.

[5] I. Haritaoglu, D. Harwood, and L. Davis, “ : Real-time surveillanceof people and their activities,” IEEE Trans. Pattern Anal. Machine Intell.,vol. 22, pp. 809–830, Aug. 2000.

[6] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower: Princi-ples and practice of background maintenance,” in Proc. IEEE Int. Conf.Computer Vision, Sept. 1999, pp. 255–261.

[7] K. Karmann and A. Von Brandt, “Moving object recognition using anadaptive background memory,” Time-Varing Image Process. MovingObject Recognit., 2, pp. 289–296, 1990.

[8] C. Wren, A. Azarbaygaui, T. Darrell, and A. Pentland, “Pfinder: real-time tracking of the human body,” IEEE Trans. Pattern Anal. MachineIntell., vol. 19, pp. 780–785, July 1997.

[9] A. Elgammal, D. Harwood, and L. Davis, “Non-parametric model forbackground subtraction,” in Proc. Eur. Conf. Computer Vision, 2000.


[10] N. Paragios and V. Ramesh, “A MRF-based approach for real-timesubway monitoring,” in Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 1, Dec. 2001, pp. I-1034–I-1040.

[11] O. Javed, K. Shafique, and M. Shah, “A hierarchical approach to robustbackground subtraction using color and gradient information,” in Proc.IEEE Workshop Motion Video Computing, Dec. 2002, pp. 22–27.

[12] L. Li, W. M. Huang, I. Y. H. Gu, and Q. Tian, “Foreground object detec-tion in changing background based on color co-occurrence statistics,” inProc. IEEE Workshop Applications of Computer Vision, Dec. 2002, pp.269–274.

[13] L. Wixson, “Detecting salient motion by accumulating directionary-con-sistent flow,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp.774–780, Aug. 2000.

[14] N. J. B. McFarlane and C. P. Schofield, “Segmentation and tracking ofpiglets in images,” Mach. Vis. Applicat., vol. 8, pp. 187–193, 1995.

[15] D. Koller, J. Weber, T. Huang, J. Malik, G. Ogasawara, B. Rao, and S.Russel, “Toward robust automatic traffic scene analysis in real-time,” inProc. Int. Conf. Pattern Recognition, 1994, pp. 126–131.

[16] A. Bobick, J. Davis, S. Intille, F. Baird, L. Cambell, Y. Irinov, C. Pin-hanez, and A. Wilson, “Kidsroom: Action recognition in an interactivestory environment,” Mass. Inst. Technol., Cambridge, Perceptual Com-puting Tech. Rep. 398, 1996.

[17] J. Rehg, M. Loughlin, and K. Waters, “Vision for a smart kiosk,” in Proc.IEEE Conf. Computer Vision and Pattern Recognition, June 1997, pp.690–696.

[18] T. Olson and F. Brill, “Moving object detection and event recognitionalgorithm for smart cameras,” in Proc. DARPA Image UnderstandingWorkshop, 1997, pp. 159–175.

[19] T. Boult, “Frame-rate multi-body tracking for surveillance,” in Proc.DARPA Image Understanding Workshop, 1998.

[20] T. Darell, G. Gordon, M. Harville, and J. Woodfill, “Integrated persontracking using stereo, color, and pattern detection,” in Proc. IEEEConf. Computer Vision and Pattern Recognition, June 1998, pp.601–608.

[21] A. Shafer, J. Krumm, B. Brumitt, B. Meyers, M. Czerwinski, andD. Robbins, “The new EasyLiving project at microsoft,” in Proc.DARPA/NIST Smart Space Workshop, 1998.

[22] C. Eveland, K. Konolige, and R. C. Bolles, “Background modeling forsegmentation of video-rate stereo sequences,” in Proc. IEEE Conf. Com-puter Vision and Pattern Recognition, June 1998, pp. 266–271.

[23] N. Friedman and S. Russell, “Image segmentation in video sequences:a probabilistic approach,” in Proc. 13th Conf. Uncertainty Artificial In-telligence, 1997.

[24] A. J. Lipton, H. Fujiyoshi, and R. S. Patil, “Moving target classificationand tracking from real-time video,” in Proc. IEEE Workshop Applicationof Computer Vision, Oct. 1998, pp. 8–14.

[25] M. Harville, G. Gordon, and J. Woodfill, “Foreground segmentationusing adaptive mixture model in color and depth,” in Proc. IEEEWorkshop Detection and Recognition of Events in Video, July 2001, pp.3–11.

[26] X. Gao, T. Boult, F. Coetzee, and V. Ramesh, “Error analysis of back-ground adaption,” in Proc. IEEE Conf. Computer Vision and PatternRecognition, June 2000, pp. 503–510.

[27] K. Skifstad and R. Jain, “Illumination independent change detectionfrom real world image sequence,” Comput. Vis., Graph. Image Process.,vol. 46, pp. 387–399, 1989.

[28] S. C. Liu, C. W. Fu, and S. Chang, “Statistical change detection with mo-ments under time-varying illumination,” IEEE Trans. Image Processing,vol. 7, pp. 1258–1268, Aug. 1998.

[29] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian computer visionsystem for modeling human interactions,” IEEE Trans. Pattern Anal.Machine Intell., vol. 22, pp. 831–843, Aug. 2000.

[30] A. Iketani, A. Nagai, Y. Kuno, and Y. Shirai, “Detecting persons onchanging background,” in Proc. Int. Conf. Pattern Recognition, vol. 1,1998, pp. 74–76.

[31] P. Rosin, “Thresholding for change detection,” in Proc. IEEE Int. Conf.Computer Vision, Jan. 1998, pp. 274–279.

[32] Q. Cai, A. Mitiche, and J. K. Aggarwal, “Tracking human motion in anindoor environment,” in Proc. IEEE Int. Conf. Image Processing, Oct.1995, pp. 215–218.

[33] C. Jiang and M. O. Ward, “Shadow identification” (in June), in Proc.IEEE Int. Conf. Computer Vision and Pattern Recognition, 1992, pp.606–612.

[34] L. Li, I. Y. H. Gu, M. K. H. Leung, and Q. Tian, “Knowledge-based fuzzyreasoning for maintenance of moderate-to-fast background changes invideo surveillance,” in Proc. 4th IASTED Int. Conf. Signal and ImageProcessing, 2002, pp. 436–440.

[35] M. Harville, “A framework for high-level feedback to adaptive,per-pixel, mixture-of-gaussian background models,” in Proc. Eur. Conf.Computer Vision, 2002, pp. 543–560.

[36] D. Gutchess, M. Trajkovic, E. Cohen-Solal, D. Lyons, and A. K. Jain,“A background model initialization algorithm for video surveillance,” inProc. IEEE Int. Conf. Computer Vision, vol. 1, July 2001, pp. 733–740.

Liyuan Li (M’96) received the B.E. and M.E. de-grees from Southeast University, Nanjing, China, in1985 and 1988, respectively, and the Ph.D. degreefrom Nanyang Technological University, Singapore,in 2001.

From 1988 to 1999, he was on the faculty at South-east University, where he was an Assistant Lecturer(1988 to 1990), Lecturer (1990 to 1994), and Asso-ciate Professor (1995 to 1999). Since 2001, he hasbeen a Research Scientist at the Institute for Info-comm Research, Singapore. His current research in-

terests include video surveillance, object tracking, event and behavior under-standing, etc.

Weimin Huang (M’97) received the B.Eng. degreein automation and the M.Eng. and Ph.D. degreesin computer engineering from Tsinghua University,Beijing, China, in 1989, 1991, and 1996, respec-tively.

He is a Research Scientist at the Institute for Info-comm Research, Singapore. He has worked on theresearch of handwriting signature verification, bio-metrics authentication, and audio/video event detec-tion. His current research interests include image pro-cessing, computer vision, pattern recognition, human

computer interaction, and statistical learning.

Irene Yu-Hua Gu (M’94–SM’03) received the Ph.D.degree in electrical engineering from the EindhovenUniversity of Technology, Eindhoven, The Nether-lands, in 1992.

She is an Associate Professor in the Departmentof Signals and Systems, Chalmers University ofTechnology, Göteborg, Sweden. She was a Re-search Fellow at Philips Research Institute IPO,The Netherlands, and Staffordshire University,Staffordshire, U.K., and a Lecturer at The Universityof Birmingham, Birmingham, U.K., from 1992 to

1996. Since 1996, she has been with the Department of Signals and Systems,Chalmers University of Technology. Her current research interests includeimage processing, video surveillance and object tracking, video communica-tions, and signal processing applications to electric power systems.

Dr. Gu has served as an Associate Editor for the IEEE TRANSACTIONS ONSYSTEMS, MAN, AND CYBERNETICS since 2000, and she is currently the Chair-Elect of the IEEE Swedish Signal Processing Chapter.

Qi Tian (M’83–SM’90) received the B.S. and M.S.degrees in electrical and computer engineering fromthe Tsinghua University, Beijing, China, in 1967 and1981, respectively, and the Ph.D. degree in electricaland computer engineering from the University ofSouth Carolina, Columbia, in 1984.

He is a Principal Scientist at the Media Divi-sion, Institute for Infocomm Research, Singapore.His main research interests are image/video/audioanalysis, indexing and retrieval, media content iden-tification and security, computer vision, and pattern

recognition. He joined the Institute of System Science, National Universityof Singapore, in 1992. Since then, he has been working on robust characterID recognition and video indexing. He was the Program Director for theMedia Engineering Program, Kent Ridge Digital Labs, then Laboratories forInformation Technology, from 2001 to 2002.

Dr. Tian has served on editorial boards of professional journals and as chairsand members of technical committees of the IEEE Pacific-Rim Conferenceon Multimedia (PCM), the IEEE International Conference on Multimedia andExpo (ICME), etc.

Statistical Modeling of Complex Backgrounds for Foreground Object ...

Documents