Top Banner
Multi-Object Tracking Using Color, Texture and Motion Valtteri Takala and Matti Pietik¨ ainen Machine Vision Group Infotech Oulu and Dept. of Electrical and Information Engineering P.O. Box 4500 FIN-90014 University of Oulu, Finland http://www.ee.oulu.fi/mvg Abstract In this paper, we introduce a novel real-time tracker based on color, texture and motion information. RGB color histogram and correlogram (autocorrelogram) are ex- ploited as color cues and texture properties are represented by local binary patterns (LBP). Object’s motion is taken into account through location and trajectory. After extrac- tion, these features are used to build a unifying distance measure. The measure is utilized in tracking and in the classification event, in which an object is leaving a group. The initial object detection is done by a texture-based back- ground subtraction algorithm. The experiments on indoor and outdoor surveillance videos show that a unified system works better than the versions based on single features. It also copes well with low illumination conditions and low frame rates which are common in large scale surveillance systems. 1. Introduction One of the most useful and valuable applications for cur- rent vision technology is visual surveillance. It is feasible to construct and able to generate direct benefits for its uti- lizers. It can be used for many purposes in a number of different environments: people and vehicle tracking in traf- fic scenes [6], proximity detection in battlefields [18] (orig- inal study: [15]), suspicious event detection [13] ([5]), and geriatric care, to name but a few. This has been already no- ticed, and the research is in full flow both in academic and industrial sides. Tracking is the primary part in active visual surveillance where human intervention is to be minimized. It is also a field of numerous methods for numerous different tracking cases: Mean Shift based algorithms [1], [3], [4], [7] can be used in single object problems with both static and moving cameras. Multi-object tracking, however, is a very different and more challenging problem. In addition to the normal frame-to-frame following of a salient area, the system must be able to handle occlusion, splitting, merging and other complicated events related to multiple moving targets. Ex- isting solutions [6], [8], [9], [15] are meant for static cam- eras and limited types of scenes. Lately, particle filters [14] have gained a great deal of attention [8], [17], [20], [21], [26]. To survive in diverse environments, one should take ad- vantage of multiple image properties, like color, texture, temporal, etc., as none of them alone provides all-around invariance to different imaging conditions. By using a ver- satile collection of properties the system performance can be enhanced and made more robust against the large vari- ation of data common in surveillance. Still, one has to be careful while choosing multiple features as they may also have a negative effect on each other. Color correlogram [11] (also known as autocorrelogram) and Local Binary Patterns (LBP) [19] are well developed descriptors for general image classification and retrieval. They are fast to extract and provide features with varying size and discrimination power, depending on the used pa- rameters (kernel radius, color channel quantization, sample count, etc.). They offer good histogram-based descriptions for object detection and matching. In low frame rates (< 10 fps), some sort of model-based matching is a worthwhile way to go as the spatial corre- spondence of objects in adjacent frames is often low. Frame rates like 2 fps are common in large scale surveillance sys- tems where the amount of collected data can be tremendous. One server may have to handle a data stream of tens of video sources. In such situations the load on the system’s IO is significant and frame rates like 25 fps per video source are not easily feasible. There is a good likelihood that the situ- ation stays the same in the near future as the camera tech- nology advances rapidly and megapixel-class video frames are already reality. This paper introduces an approach which uses multiple image features for frame-to-frame correspondence match- ing. RGB color histogram and correlogram are used to 1
7

Multi-Object Tracking Using Color, Texture and Motion · Tracking is the primary part in active visual surveillance where human intervention is to be minimized. It is also a field

Jul 15, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multi-Object Tracking Using Color, Texture and Motion · Tracking is the primary part in active visual surveillance where human intervention is to be minimized. It is also a field

Multi-Object Tracking Using Color, Texture and Motion

Valtteri Takala and Matti PietikainenMachine Vision Group

Infotech Oulu and Dept. of Electrical and Information EngineeringP.O. Box 4500 FIN-90014 University of Oulu, Finland

http://www.ee.oulu.fi/mvg

Abstract

In this paper, we introduce a novel real-time trackerbased on color, texture and motion information. RGBcolor histogram and correlogram (autocorrelogram) are ex-ploited as color cues and texture properties are representedby local binary patterns (LBP). Object’s motion is takeninto account through location and trajectory. After extrac-tion, these features are used to build a unifying distancemeasure. The measure is utilized in tracking and in theclassification event, in which an object is leaving a group.The initial object detection is done by a texture-based back-ground subtraction algorithm. The experiments on indoorand outdoor surveillance videos show that a unified systemworks better than the versions based on single features. Italso copes well with low illumination conditions and lowframe rates which are common in large scale surveillancesystems.

1. Introduction

One of the most useful and valuable applications for cur-rent vision technology is visual surveillance. It is feasibleto construct and able to generate direct benefits for its uti-lizers. It can be used for many purposes in a number ofdifferent environments: people and vehicle tracking in traf-fic scenes [6], proximity detection in battlefields [18] (orig-inal study: [15]), suspicious event detection [13] ([5]), andgeriatric care, to name but a few. This has been already no-ticed, and the research is in full flow both in academic andindustrial sides.

Tracking is the primary part in active visual surveillancewhere human intervention is to be minimized. It is also afield of numerous methods for numerous different trackingcases: Mean Shift based algorithms [1], [3], [4], [7] can beused in single object problems with both static and movingcameras. Multi-object tracking, however, is a very differentand more challenging problem. In addition to the normal

frame-to-frame following of a salient area, the system mustbe able to handle occlusion, splitting, merging and othercomplicated events related to multiple moving targets. Ex-isting solutions [6], [8], [9], [15] are meant for static cam-eras and limited types of scenes. Lately, particle filters [14]have gained a great deal of attention [8], [17], [20], [21],[26].

To survive in diverse environments, one should take ad-vantage of multiple image properties, like color, texture,temporal, etc., as none of them alone provides all-aroundinvariance to different imaging conditions. By using a ver-satile collection of properties the system performance canbe enhanced and made more robust against the large vari-ation of data common in surveillance. Still, one has to becareful while choosing multiple features as they may alsohave a negative effect on each other.

Color correlogram [11] (also known as autocorrelogram)and Local Binary Patterns (LBP) [19] are well developeddescriptors for general image classification and retrieval.They are fast to extract and provide features with varyingsize and discrimination power, depending on the used pa-rameters (kernel radius, color channel quantization, samplecount, etc.). They offer good histogram-based descriptionsfor object detection and matching.

In low frame rates (< 10 fps), some sort of model-basedmatching is a worthwhile way to go as the spatial corre-spondence of objects in adjacent frames is often low. Framerates like 2 fps are common in large scale surveillance sys-tems where the amount of collected data can be tremendous.One server may have to handle a data stream of tens of videosources. In such situations the load on the system’s IO issignificant and frame rates like 25 fps per video source arenot easily feasible. There is a good likelihood that the situ-ation stays the same in the near future as the camera tech-nology advances rapidly and megapixel-class video framesare already reality.

This paper introduces an approach which uses multipleimage features for frame-to-frame correspondence match-ing. RGB color histogram and correlogram are used to

1

Page 2: Multi-Object Tracking Using Color, Texture and Motion · Tracking is the primary part in active visual surveillance where human intervention is to be minimized. It is also a field

describe the object’s color properties, LBP is chosen forthe texture, and geometric location and the smoothness oftrajectory provide the motional support. The merging andsplitting of objects are handled using the same set of fea-tures. The tracker’s performance on low frame rate video isemphasized, as it is the area which has not been consideredoften enough.

2. Features for Tracking

Matching-based tracking requires good feature descrip-tors to be usable in the diverse conditions of real-worldvideo surveillance. The main sources of descriptors arecolor, texture, shape, and temporal (motion) properties.Each of them has its pros and cons but the color has gainedthe most of attention as it is well distinguishable to humaneye and seems to contain a good amount of useful informa-tion.

2.1. Color

Color provides many cues. The most well known colordescriptor is the RGB color histogram [24] which has beenused for tracking in various occasions [1], [3], [27]. Thereare also other potential features like the color moments[23], MPEG-7 color descriptors [16], and color correlo-grams [11], to mention but a few. In this study, the lastone was selected together with the RGB color histogramto describe the color properties of objects. Selecting thecolor correlogram was natural due to its good discrimina-tion power [11]. The main advantage of the correlogram isthat it pays attention to the local spatial correlation of colorpixels, thus increasing the value of color as the color his-togram is purely a global measure.

2.2. Texture

It would be shortsightedness to rely on color propertiesonly. For instance, colors are very sensitive to illuminationchanges. This trouble can be alleviated, to some extent, byusing other features that are less responsive to such imagetransformations. Texture, which has not enjoyed major at-tention in tracking applications, provides a good option toenhance the power of color descriptors. The list of avail-able texture features is quite a long one but a good survey todifferent approaches has been made by Tuceryan and Jain[25].

As being one of the most efficient texture descriptors, theLBP texture measure [19] is a logical choice for describingobject’s textural properties. We selected a modified versionof it [10] which is more stable against noise. LBP’s maincharacteristics are invariance to monotonic changes in gray-scale and fast computation, and it has proven performancebackground in texture classification [19]. While operating

in gray-scale color space, LBP is also robust to illuminationchanges common in surveillance videos.

2.3. Motion

In addition to visually descriptive features, video pro-vides temporal properties. One can add, of course, a tempo-ral dimension to any feature described above by combiningthe information extracted at consecutive instants of time insome meaningful way. We decided to use the object’s lo-cation and trajectory to describe its motional properties. Inaddition to the close positions in successive frames, phys-ical objects tend to have smooth trajectories, at least whenthe frame rate is high enough, and this can be exploited. Wecan calculate the smoothness of direction and speed [22] foreach existing object track i:

Si,t = w

(vi,t−1 ◦ vi,t

|vi,t−1| |vi,t|

)+(1 − w)

(2√|vi,t−1| |vi,t|

|vi,t−1| + |vi,t|

),

(1)where the first term defines the smoothness of direction andthe second one the smoothness of speed. Si,t is the com-bined smoothness of the track i between the time instants tand t − 1. v stands for the difference vector of two pointsand w is a weight.

If m points are extracted from n frames, the totalsmoothness is defined by Equation (2), which is the sumof smoothnesses of all the interior points of all the m paths.

Ts =m∑

i=1

n−1∑t=2

Si,t (2)

3. Tracker

Our tracker (see Figure 1) consists of two main elements:background subtraction (detection) and tracking. The sub-traction on the video data, that is first processed with aGaussian filter to remove noise, is done by an adaptive al-gorithm which is based on LBP texture distributions [10].The algorithm was chosen because of its good performancein most environments and the fact that it exploits the sametexture properties as the object matching part of the tracker,so the re-use of features is possible.

The subtracted foreground is enhanced by filtering theartifacts caused by noise and moving background usingstandard morphological operations. All the remaining fore-ground areas are considered as possible object candidatesand filtered according to the needs: the size of object de-pends heavily on the surveillance scene.

The tracking is done by matching features extracted fromthe subtracted foreground shapes. As told in Section 2, threeof these are based on histogram distributions: RGB colorhistogram [24] and correlogram [11], and LBP [19]. The

Page 3: Multi-Object Tracking Using Color, Texture and Motion · Tracking is the primary part in active visual surveillance where human intervention is to be minimized. It is also a field

Texture-based background subtraction

Object detection using contours

Matching using color, texture and motion features

Group handling (merging and

splitting)

Detection Tracking

Frames Tracked objects

Figure 1. The tracker. Same features (color, texture and motion)are used in initial matching and group handling.

reason for choosing two color-based descriptors is the dif-ference in their spatial performance. While the histogramis a global measure and thus invariant to many local at-tributes like scale, the correlogram takes into account spa-tial color distributions and has better discrimination perfor-mance on coherent data. The spatial properties are usu-ally well preserved in tracking as the objects do not changemuch between successive frames, depending, of course, onthe frame rate.

The LBP texture measure was selected due to its qual-ified performance as stated in Section 2.2. It supports thecolor features well in natural scenes as those often containa lot of textural information. It also provides better discrim-ination capabilities in many situations where a simple colordescriptor may fail, for example in low lighting conditions.

The other two cues for matching, the geometric distanceand the combined smoothness of speed and direction, wereincluded to emphasize the importance of motion on track-ing.

The tracker uses similar structure as Yang et al.’s system[27] in which tracking is based on distance and correspon-dence matrices and object occlusion is managed through thedetection of splitting and merging events. First a distancematrix with new measures as columns and existing tracks asrows is built. This is followed by creating a zero-initializedcorrespondence matrix in which the best matching track foreach measure is marked by incrementing the correspondingmatrix element by one. The same is done for each trackafter which the correspondence matrix elements with valueof two are considered as definite matches. Unmatched mea-sures and tracks are considered as possible candidates formerging and splitting. The complete details of the logicsare described in the Yang et al.’s paper [27].

In our system, a group of diverse features is employedfor measure-to-track matching and the events are detecteddifferently. Instead of using the bounding boxes themselvesfor occlusion detection, we surround the boxes with circlesthat have the radii of the half diagonals of the boxes as inthe Figure 2 and use them for event detection: If the objectcircles are occluding each other in the previous frame n− 1a merging event in frame n is possible. If the occlusionis true in frame n + 1 and the closest occluding object is

Figure 2. Diagonal occlusion. The merging event in frame n isdetected by analyzing the situation in the previous frame n − 1.The splitting detection is done in frame n + 1.

0.4 0.6 0.8 1 1.2 1.4 1.60

10

20

30

40

50

60

70

80

90Radius Weight vs. Frame Rate

radius weight wr

fram

era

tef f

ps

Figure 3. Weight selection. The weight of event detection circleradius is selected according to the frame rate. The weights of 2and 25 fps frame rates are displayed with small circles.

a group object 1(1&2) then a split might have happened.The circle’s radius can be weighted according to the framerate to cover a smaller or larger detection area. The radiusweight wr is inversely related to the frame rate ffps and canbe estimated with a hyperbola function:

ffps =α

w2r

+ β, (3)

from which we get

wr =√

α

ffps − β, (4)

where α and β are constants. Figure 3 shows an examplecurve where α = 15.3 and β = −5.7.

The matching itself is carried out by using an overall dis-tance obtained from several descriptors. It is done both inthe initial correspondence matching process and in the split-ting event, in which one has to recognize the object that isleaving its group. After filtering out the least probable can-didates with a separate geometric distance threshold, fivedescriptors are used: RGB color histogram and correlo-gram, LBP, geometric distance and smoothness. The firstthree are applied on the upper and lower half of the objectseparately. This is done to add more spatial discriminationpower to situations where the interesting objects are peo-ple wearing two distinctive pieces of clothing, like a shirt

Page 4: Multi-Object Tracking Using Color, Texture and Motion · Tracking is the primary part in active visual surveillance where human intervention is to be minimized. It is also a field

and trousers. It should be also noticed that the distribution-based features are extracted from the foreground only.

The distance matrix D(i, j) of the initial matching phaseis constructed as follows. Each matrix column j corre-sponding to a measure is filled with a distance vector dj

of M elements (tracks). The vector elements are sums offive descriptor measures that have been normalized to therange [0, 1] prior to final summing to the vector:

dj =5∑

f=1

di,f , i = 1, 2, ...,M, (5)

where di,f is the normalized distance of the feature f ofthe ith element. The feature-specific normalization is donebetween the original distance values di,f as

di,f =di,f∑Mi=1 di,f

. (6)

The smoothness is obtained by adding the center pointof the measure to the track’s trajectory and then calculatinga new smoothness value. Note: if the candidate tracks havelong common history (period of occlusion), there is no dif-ference in smoothness as the active trajectories, from whichthe feature is extracted, have become identical. The actualsmoothness value Ts, as not being a distance measure con-verging to zero, is first shifted by its bias, which dependson the length of the trajectory, and then inverted to make itcomparable with the other descriptors:

ds =1

Ts − bTsc, (7)

where ds is the final measure nearing to zero. bTsc isthe bias value which has been rounded down to the closestsmaller integer value.

The geometric distance is used in the same way as thedistribution-based descriptors. After calculating the initialgeometric distances between a measure and its candidatetracks, the obtained vector of distance values is normalizedto one.

Before distance measurement, the track’s color and tex-ture descriptors, all consisting of histogram representations,are updated by filtering last N histograms with Gaussianweights:

hupdatedi =

N∑t=1

hi,twG (t) , (8)

where hupdatedi is the updated value of bin i and hi,t the

corresponding original bin value, in which t is the index intemporal dimension, 1 referring to the latest bin value andN to the oldest in history. The weights wG are obtainedusing a standard Gaussian distribution (µ = 0, σ2 = 1):

winit (t) =1√2π

e−t22 , (9)

which is then normalized to one to get wG.The histogram distances are compared by an Euclidean

distance measure:

Deucl (x1, x2) =

√√√√ N∑i=1

(x1,i − x2,i)2, (10)

where i is the corresponding bin of the histogram x of lengthN .

4. ExperimentsOur system consist of a software framework operating on

standard Intel Pentium 4 (3GHz) PC hardware with 1 GB ofmemory. The details of test video sequences are collectedin Table 1. The first two sequences (Indoor and Outdoor)are from real surveillance scenes (camera type unknown)and their original frame rate is 2 fps. The Merge & splitvideo has been created using a Sony DFW-VL500 camera.The Near-IR dataset was made with Axis 213 PTZ NetworkCamera which has a built-in IR source and is capable ofsensing near infrared light. The last two video sets belongto the publicly available CAVIAR [2] and IBM databases[12]. The CAVIAR sequence had an original frame rate of25 fps which was decreased to 1 fps for testing purposes.

The background subtraction parameters include the ra-dius and sample count of the LBP operator, the radius ofa circular region around a pixel over which the LBP his-togram is calculated (Rregion), the number of LBP his-tograms per pixel (K), the thresholds of histogram proxim-ity measure (TP ) and background histogram selection (TB),and the learning rates for model histogram updating (αb,αw). The selected parameter values are collected in Table 2.Guidance for selecting them properly can be found in theoriginal paper [10]. Both our background subtraction andtracking implementations use the modified version of LBPwith the thresholding constant a set to 3, as suggested bythe original study. To speed up the subtraction, we applyit to every third pixel in horizontal and vertical directionsand use that value as an approximation of the pixel’s circu-lar surroundings. Prior to the subtraction, the video frame isprocessed with a Gaussian 3x3 kernel which has a standarddeviation σ of 0.95. Afterwards, the small artifacts in thesubtracted scene are morphologically filtered by doing oneclose and open operation with a circular kernel of 3 pixelsin diameter.

The RGB color histogram has 216 bins (6x6x6) and theRGB color correlogram is created from two distances (1 and3) for 64 colors (4x4x4) making up a 128-bin descriptor intotal. The LBP feature is calculated from a circular neigh-borhood of eight samples with radius of two, thus the lengthof the LBP histogram becomes 256 bins. In the histogramfiltering process, as described by Equations (8) and (9), we

Page 5: Multi-Object Tracking Using Color, Texture and Motion · Tracking is the primary part in active visual surveillance where human intervention is to be minimized. It is also a field

Test sequence Size Fps Length (s)Indoor 352x288 2 30

Outdoor 352x288 2 61Merge & split 320x240 15 31

Near-IR 352x288 6 83CAVIAR 384x288 1 16

IBM 320x240 30 11

Table 1. Test videos and their sizes, frame rates and lengths. Ex-amples of tracking in each video are included in the supportingmaterial.

Parameter ValueLBP radius 2LBP sample count 6Rregion 9K 3TP 0.6TB 0.65αb 0.01αw 0.01

Table 2. Background subtraction parameters.

use 5 for N . In the smoothness extraction, the active trajec-tory is taken from the last three seconds and the directionand speed are equally weighted.

The other tracker parameters are chosen as follows: Theinteresting contours for object detection are thresholded byminimum size that depends on the scene. In short dis-tances, when the objects tend to be large, we use a fore-ground/background ratio of 0.02. For long distance surveil-lance data, the size filtering is done by a ratio of 0.003. Thegeometric distance threshold for matching, as introduced inSection 3, is selected according to frame size. We use val-ues equaling to 0.2-0.3 times the diagonal of the frame. Forsmaller frame rates or very large objects even greater valuescould be used as the locations of the targets change morerapidly. The radius of the merging and splitting detectioncircle is weighted using Equation (4) with α = 15.3 andβ = −5.7. α and β were obtained through experimentswith different frame rates (1-30 fps).

Figure 4 presents our system’s performance in indoorand outdoor scenes at very low frame rate (2 fps). Track-ing is successful through the heavy occlusion events of theindoor dataset, and it is not easily disturbed by the mov-ing background elements, like moving trees and reflectionsfrom the asphalt.

To demonstrate the performance in even lower framerates, we took a sequence from CAVIAR database and low-ered its frame rate from 25 to 1 fps. Figure 5 shows theresult where object identities are maintained through a fight-ing scene containing partial occlusion.

Figure 4. Indoor and Outdoor sequences. The tracker survives insituations where the frame rate is very low (2 fps), heavy occlusionexists and the background contains a lot of movement (movingtrees, reflections).

Figure 5. CAVIAR. A low frame rate do not cause major problemsto the tracker.

Feature(s) PrecisionAll features 87%

Color histogram 73%Color correlogram 73%

LBP 60%Geometric distance 60%

Smoothness 40%

Table 3. Splitting performance.

The overall performance of different features in the crit-ical object splitting event is shown in Table 3. Each num-ber is the average percentage of right splitting decisions (15in total) on six test videos of Table 1. The performanceof combined features is clearly better than that of the oth-ers. Color histogram and correlogram are equal in averagebut their performance deviate on different videos. LBP hasclearly lower recognition rate which may have been affectedby the Gaussian filtering of the preprocessing phase. Thefiltering process affects the structural patterns that are con-sidered as textures by the LBP operator. The poor perfor-mance of smoothness feature may indicate long occlusiontimes of objects in the test datasets. If the period of occlu-sion is considerably longer than in normal passing and theobjects move in a uniform manner the reliability of the tra-jectory measure will decrease together with its impact onthe final classification decision.

Figure 6 contains a few example frames from the multi-feature tracking in Merge & split video together with corre-

Page 6: Multi-Object Tracking Using Color, Texture and Motion · Tracking is the primary part in active visual surveillance where human intervention is to be minimized. It is also a field

Figure 6. Merge & split. The upper row contains the results of thebackground subtraction process and the lower shows the actualtracking.

Figure 7. IBM dataset. The system tracks multiple people throughseveral successive occlusion events.

sponding background subtraction steps. The dependency onthe texture-based background subtraction is especially highin this sequence as there are substantial color similaritiesbetween one of the foreground objects and the background.

Figure 7 shows tracking results on an IBM dataset inwhich three people enter the scene and become occludedby each other several times. Our system has some problemsin the beginning of the sequence, as the persons enter theroom in parallel, but it is able to maintain the number ofpeople and their identities most of the time after they havebeen initially discovered.

The system was also tested in low light conditions. Inthe sequence of Figure 8 the persons were tracked correctlyeven though no color information was available. In the leftimage, the person on the left has a turquoise shirt while theother one has white. Both look very similar in noisy gray-scale imagery.

5. Conclusions

In this study, we have introduced a novel tracker basedon the combined use of color, texture and motion features,and texture-based background subtraction. The system isable to track multiple objects in diverse conditions whileachieving speeds of 10-15 fps on a 3 GHz Intel Pentium 4computer. It is also less sensitive to color due to the useof a versatile collection of cues. The system shows robustperformance while most of the parameters are fixed.

Future research will concentrate on a different weightingscheme in which the illumination conditions and other ef-

Figure 8. Near-IR. The system works also in noisy near-infraredconditions where color information is limited to gray-scale. Thebackground subtraction results are shown in the upper row.

fectors are taken into account adaptively. The current wayof using static feature weighting prevents the cues from dy-ing but in some cases some of the features become uselessand may even decrease the probability of correct classifica-tion. For example, after a very long period of occlusion thesmoothness of trajectory has no use for classification. Thelong-term accuracy on people tracking could be improvedby confirming the number of people in a group with an ad-ditional detector of human shapes or faces. To include carsand other interesting objects of arbitrary shape, a more ver-satile classifier would be needed.

6. AcknowledgmentThis research was supported by the Infotech Oulu Gradu-

ate School and the Finnish Funding Agency for Technologyand Innovation (Tekes).

References[1] G. R. Bradski. Computer video face tracking for use in a

perceptual user interface. Intel Technology Journal, 2, 1998.1, 2

[2] CAVIAR project: benchmark datasets for video surveillance.http://homepages.inf.ed.ac.uk/rbf/caviar/, 2005. 4

[3] R. Collins, Y. Liu, and M. Leordeanu. Online selectionof discriminative tracking features. IEEE Transactions onPattern Analysis and Machine Intelligence, 27:1631–1643,2005. 1, 2

[4] D. Comaniciu and P. Meer. Mean shift: A robust approachtoward feature space analysis. IEEE Transactions on PatternAnalysis and Machine Intelligence, 24:603–619, 2002. 1

[5] D. Gibbins, G. N. Newsam, and M. J. Brooks. Detect-ing suspicious background changes in video surveillance ofbusy scenes. In Proceedings of the Third IEEE Workshopon Applications of Computer Vision, (WACV’96), Sarasota,Florida, USA, 2-4 December 1996, pages 22–26, 1996. 1

[6] A. Hampapur, L. Brown, J. Connell, A. Ekin, N. Haas,M. Lu, H. Merkl, S. Pankanti, A. Senior, C.-F. Shu, and Y. L.Tian. Smart video surveillance: Exploring the concept of

Page 7: Multi-Object Tracking Using Color, Texture and Motion · Tracking is the primary part in active visual surveillance where human intervention is to be minimized. It is also a field

multiscale spatiotemporal tracking. IEEE Signal ProcessingMagazine, 22:38–51, 2005. 1

[7] B. Han and L. Davis. Object tracking by adaptive featureextraction. In Proceedings of International Conference onImage Processing, (ICIP 2004), Singapore, 27 June-2 July2004, pages 638–644, 2004. 1

[8] B. Han, Y. Zhu, D. Comaniciu, and L. Davis. Kernel-basedBayesian filtering for object tracking. In Proceedings of the2005 IEEE Computer Society Conference on Computer Vi-sion and Pattern Recognition, (CVPR 2005), San Diego, Cal-ifornia, USA, 20-25 June 2005, pages 227–234, 2005. 1

[9] I. Haritaoglu, D. Hardwood, and L. S. Davis. W4: Real-timesurveillance of people and their activities. IEEE Transactionson Pattern Analysis and Machine Intelligence, 22:809–830,2000. 1

[10] M. Heikkila and M. Pietikainen. A texture-based methodfor modeling the background and detecting moving objects.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 28:657–662, 2006. 2, 4

[11] J. Huang, S. R. Kumar, M. Mitra, W.-J. Zhu, and R. Zabih.Image indexing using color correlograms. In Proceedings ofthe 1997 IEEE Computer Society Conference on ComputerVision and Pattern Recognition, (CVPR 1997), San Juan,Puerto Rico, 17-19 June 1997, pages 762–768, 1997. 1, 2

[12] IBM datasets. http://www.research.ibm.com/peoplevision/,2005. 4

[13] iOmniScient. http://www.iomniscient.com/, 2006. 1[14] M. Isard and A. Blake. CONDENSATION - conditional den-

sity propagation for visual tracking. International Journal ofComputer Vision, 29:5–28, 1998. 1

[15] A. J. Lipton, H. Fujiyoshi, and R. S. Patil. Moving targetclassification and tracking from real-time video. In Proceed-ings of the Fourth IEEE Workshop on Applications of Com-puter Vision, (WACV’98), Princeton, New Jersey, USA, 19-21 October 1998, pages 8–14, 1998. 1

[16] B. S. Manjunath, J.-R. Ohm, V. Vasudevan, and A. Yamada.Color and texture descriptors. IEEE Transactions on Circuitsand Systems for Video Technology, 11:703–715, 2001. 2

[17] K. Nummiaro, E. Koller-Meier, and L. V. Gool. An adap-tive color-based particle filter. Image and Vision Computing,21:99–110, 2003. 1

[18] ObjectVideo. http://www.objectvideo.com/, 2006. 1[19] T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution

gray-scale and rotation invariant texture classification withlocal binary patterns. IEEE Transactions on Pattern Analysisand Machine Intelligence, 24:971–987, 2002. 1, 2

[20] P. Perez, C. Hue, J. Vermaak, and M. Gangnet. Color-basedprobabilistic tracking. In Proceedings of the 7th EuropeanConference on Computer Vision (ECCV 2002), Copenhagen,Denmark, 28-31 May 2002, pages 661–675, 2002. 1

[21] V. Philomin, R. Duraiswami, and L. Davis. Quasi-randomsampling for Condensation. In Proceedings of the 6th Euro-pean Conference on Computer Vision (ECCV 2000), Dublin,Ireland, 26 June - 1 July 2000, pages 134–149, 2000. 1

[22] L. G. Shapiro and G. Stockman. Computer Vision, page 267.Prentice Hall, first edition, 2001. 2

[23] M. Stricker and M. Orengo. Similarity of color images.In Proceedings of the SPIE Conference on Storage and Re-trieval for Image and Video Databases, San Jose, California,USA, 9 February 1995, pages 381–392, 1995. 2

[24] M. Swain and D. Ballard. Color indexing. In Proceedingsof the Third IEEE International Conference on Computer Vi-sion, (ICCV 1990), Osaka, Japan, December 1990, pages11–32, 1990. 2

[25] M. Tuceryan and A. Jain. Texture analysis. In C. Chen,L. Pau, and P. Wang, editors, Handbook of Pattern Recogni-tion and Computer Vision, pages 207–248. World Scientific,second edition, 1999. 2

[26] C. Yang, R. Duraiswami, and L. Davis. Fast multiple objecttracking via a hierarchical particle filter. In The Proceed-ings of the Tenth IEEE International Conference on Com-puter Vision, (ICCV 2005), Beijing, China, 17-21 October2005, pages 212–219, 2005. 1

[27] T. Yang, S. Z. Li, Q. Pan, and J. Li. Real-time multiple ob-jects tracking with occlusion handling in dynamic scenes. InProceedings of the 2005 IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, (CVPR 2005),San Diego, California, USA, 20-25 June 2005, pages 970–975, 2005. 2, 3