-
Image and Vision Computing 30 (2012) 966977
Contents lists available at SciVerse ScienceDirect
Image and Vision Computing
j ourna l homepage: www.e lsev ie r .com/ locate / imav
isMultiple human tracking in high-density crowds
Irshad Ali , Matthew N. DaileyComputer Science and Information
Management Program, Asian Institute of Technology (AIT),
Pathumthani, Thailand This paper has been recommended for
acceptance b Corresponding author at: Computer Science and
Inform
Asian Institute of Technology (AIT), P.O. Box 4, Klong LuanTel.:
+66 875972954.
E-mail addresses: [email protected] (I. Ali), mdailey
0262-8856/$ see front matter 2012 Elsevier B.V.
Allhttp://dx.doi.org/10.1016/j.imavis.2012.08.013a b s t r a c ta r
t i c l e i n f oArticle history:Received 19 January 2012Received
in revised form 26 July 2012Accepted 22 August 2012
Keywords:Head detectionPedestrian trackingCrowd trackingParticle
filters3D object tracking3D head plane estimationHuman
detectionLeast-squares plane estimationAdaBoost detection cascadeIn
this paper, we introduce a fully automatic algorithm to detect and
track multiple humans in high-densitycrowds in the presence of
extreme occlusion. Typical approaches such as background modeling
and bodypart-based pedestrian detection fail when most of the scene
is in motion and most body parts of most ofthe pedestrians are
occluded. To overcome this problem, we integrate human detection
and tracking into asingle framework and introduce a
confirmation-by-classification method for tracking that associates
detec-tions with tracks, tracks humans through occlusions, and
eliminates false positive tracks. We use a Violaand Jones AdaBoost
detection cascade, a particle filter for tracking, and color
histograms for appearancemodeling. To further reduce false
detections due to dense features and shadows, we introduce a
methodfor estimation and utilization of a 3D head plane that
reduces false positives while preserving high detectionrates. The
algorithm learns the head plane from observations of human heads
incrementally, without any apriori extrinsic camera calibration
information, and only begins to utilize the head plane once
confidencein the parameter estimates is sufficiently high. In an
experimental evaluation, we show that
confirmation-by-classification and head plane estimation together
enable the construction of an excellent pedestriantracker for dense
crowds.
2012 Elsevier B.V. All rights reserved.1. Introduction
As public concern about crime and terrorist activity increases,
theimportance of security is growing, and video surveillance
systems areincreasingly widespread tools for monitoring,
management, and lawenforcement in public areas. Since it is
difficult for human operatorsto monitor surveillance cameras
continuously, there is strong interestin automated analysis of
video surveillance data. Some of the impor-tant problems include
pedestrian tracking, behavior understanding,anomaly detection, and
unattended baggage detection. In this paper,we focus on pedestrian
tracking.
Automatic pedestrian detection and tracking is a well-studied
prob-lem in computer vision research, but the solutions proposed
thus far areonly able to track a fewpeople. Inter-object occlusion,
self-occlusion, re-flections, and shadows are some of the factors
making automatic detec-tion and tracking of people in crowds
difficult. The pedestrian trackingproblem is especially difficult
when the task is to monitor and managea large crowd in gathering
areas such as airports and train stations.See the example shown in
Fig. 1. There has been a great deal of progressin recent years, but
still, most state-of-the-art systems are inapplicableto large crowd
management situations because they rely on eithery Massimo
Piccardi, Ph.D.ationManagement Department,g, Pathumthani 12120,
Thailand.
@ait.asia (M.N. Dailey).
rights reserved.background modeling [15], body part detection
[3,6], or body shapemodels [7,8,1]. These techniques are not
applicable to heavily crowdedscenes in which the majority of the
scene is in motion (rendering back-ground modeling useless) and
most human bodies are partially or fullyoccluded. Under these
conditions, we believe that the head is the onlybody part that can
be robustly detected and tracked. In this paper wetherefore present
a method for tracking pedestrians that detects andtracks heads
rather than full bodies. The main contributions of ourwork are as
follows:
1. We combine a head detector and particle filter to track
multiplepeople in high-density crowds.
2. We introduce a method for estimation and utilization of a
headplane parallel to the ground plane at the expected human
heightthat is extracted automatically from observations from a
single,uncalibrated camera. The head plane is estimated
incrementally,and when the confidence in the estimate is
sufficiently high, weuse it to reject likely false detections
produced by the headdetector.
3. We introduce a confirmation by classification method for
trackingthat associates detections with tracks over an image and
handlesocclusions in a single step.
Our system assumes a single static uncalibrated camera placed at
asufficient height so that the heads of people traversing the scene
canbe observed. For detection we use a standard Viola and Jones
Haar-likeAdaBoost cascade [9], but the detector could be replaced
generically
http://dx.doi.org/10.1016/j.imavis.2012.08.013mailto:[email protected]:[email protected]://dx.doi.org/10.1016/j.imavis.2012.08.013http://www.sciencedirect.com/science/journal/02628856
-
Fig. 1. A sample frame from the Mochit station dataset.
967I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012)
966977with any real time detector capable of detecting heads in
crowds. Fortracking we use a particle filter [10,11] for each head
that incorporatesa simple motion model and a color histogram-based
appearance model.
The main difficulty in using a generic object detector for
humantracking is that the detector's output is unreliable; all
detectorsmake er-rors. We have a tradeoff between detection rates
and false positiverates: when we try to increase the detection
rate, in most cases wealso increase the false positive rate.
However, we can alleviate this di-lemma when scene constraints are
available; detections inconsistentwith scene constraints can be
rejected without affecting the true detec-tion rate. One such
constraint is 3D scene information. We propose atechnique that
neither assumes known scene geometry nor computesinterdependencies
between objects. We merely assume the existenceof a head plane that
is parallel to the ground plane at the averagehuman height. Nearly
all human heads in a crowded scene will appearwithin a meter or two
of this head plane. If the relationship betweenthe camera and the
head plane is known, and the camera's intrinsic pa-rameters are
known, we can predict the approximate size of a head'sprojection
into the image plane, and we can use this information to re-ject
inconsistent candidate trajectories or only search for heads at
ap-propriate scales for each position in the image. To find the
head plane,we run our head detector over one or more images of a
scene at multi-ple scales, compute the 3D position of each head
based on an assumedreal-world head size and the camera's
intrinsics, and then we find thehead plane using robust nonlinear
least squares.
When occlusion is not a problem, constrained head
detectionworksfairly well, and we can use the detector to guide the
frame-to-frametracker using simple rules for data association and
elimination of falsetracks due to false alarms in the detector.
However, when partial orfull occlusions are frequent, data
association becomes critical, and sim-ple matching algorithms no
longer work. False detections often mis-guide tracks, and tracked
heads are frequently lost due to occlusion.To address these issues,
we introduce a confirmation-by-classificationmethod that performs
data association and occlusion handling in singlestep. On each
frame, we first use the detector to confirm the trackingprediction
result for each live trajectory, thenwe eliminate live
trajecto-ries that have not been confirmed for some number of
frames. This pro-cess allows us to minimize the number of false
positive trajectorieswithout losing track of heads that are
occluded for short periods of time.
In an experimental evaluation, we find that the proposed
methodprovides for effective tracking of large numbers of people in
a crowd.Using the automatically-extracted 3D head plane information
im-proves accuracy, reducing false positive rates while preserving
highdetection rates. To our knowledge, this is the largest-scale
individualhuman tracking experiment performed thus far, and the
results areextremely encouraging. In future work, with further
algorithmicimprovements and runtime optimization, we hope to
achieve robust,real time pedestrian tracking for even larger
crowds.
The paper is organized as follows: in Section 2, we provide a
briefsurvey of related work. Section 3 describes the detection and
trackingalgorithms in detail. In Section 4, we describe an
experimental evalu-ation of the algorithm. Section 5 concludes the
paper.
2. Related work
In this section, we provide a summary of related work.
Whilespace limitations make it impossible to provide a complete
survey,we identify the major trends in the research on tracking
pedestriansin crowds.
In crowds, the head is the most reliably visible part of the
humanbody. Many researchers have attempted to detect pedestrians
throughhead detection. Zhao et al. [1,12] detect heads from
foreground bound-aries, intensity edges, and foreground residues
(foreground regionswith previously detected object regions
removed). Wu and Nevatia[2] detect humans using body part
detection. They train their detectoron examples of heads and
shoulders as well as other body parts.These methods use background
modeling, so while they are effectivefor isolated pedestrians or
small groups of people, they fail in high den-sity crowds. For a
broad view of pedestrian detection methods, see therecent survey by
Dollar et al. [13]. To attain robust head detection inhigh density
crowds, we use a Viola and Jones AdaBoost cascade classi-fier using
Haar-like features [9,14]. We train the AdaBoost cascadeoffline,
then, at runtime, we use the classifier as a detector, running
asliding window over the image at the specific range of scales
expectedfor the scene.
For tracking, we use a particle filter [11]. The particle filter
or se-quential Monte Carlo method was introduced to the computer
visioncommunity by Isard and Blake [10,15] and is well known to
enable ro-bust object tracking (see e.g. [1621]). In this paper, we
use the objectdetector to guide the tracker. To track heads from
frame to frame, weuse the standard approach in which the
uncertainty about an object'sstate (position) is represented as a
set of weighted particles, each par-ticle representing one possible
state. The filter propagates particlesfrom one frame to another
frame using a motion model, computes aweight for each propagated
particle using a sensor or appearancemodel, then resamples the
particles according to their weights. Theinitial distribution for
the filter is centered on the location of the ob-ject the first
time it is detected.
Building on recent advances in object detection, many
researchershave proposed tracking methods that utilize object
detection. Thesealgorithms use appearance, size, and motion
information to measuresimilarities between detections and
trajectories. Many solutions existfor the problem of data
association between detections and trajecto-ries. In the joint
probabilistic data association filter approach [22],joint posterior
association probabilities are computed for multipletargets in
Poisson clutter, and the best possible assignment is madeon each
time step. Reid [23] generates a set of data-association
hy-potheses to account for all possible origins of every
measurementover several time steps. The well-known Hungarian
algorithm [24]can also be used for optimal matching between all
possible pairs ofdetections and live tracker trajectories. Very
recent work [16,25]uses an appearance-based classifier at each time
step to solve thedata association problem. In dense crowds, where
appearance ambi-guity is high, it is difficult to use global
appearance-based data associ-ation between detections and
trajectories; spatial locality constraintsneed to be exploited. In
this work, we combine head detection withparticle filters to
perform spatially-constrained data association. Weuse the particle
filter to constrain the search for a detection for
eachtrajectory.
Rodriguez et al. [26] first combine crowd density estimates with
in-dividual person detections and minimize an energy function to
jointlyoptimize the estimates of the density and locations of
individual people
-
968 I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012)
966977in the crowd. In a second part, the authors use the scene
geometrymethod proposed by Hoeim et al. [27,28]. They select a few
detectedheads and compute vanishing lines to estimate the camera
height.After getting the camera height, the authors estimate the 3D
locationsof detected heads and compare each 3D location with the
averagehuman height to reject head detections inconsistent with the
scene ge-ometry. Rodriguez et al. only estimate the head plane in
order toestimate the camera height, whereas we estimate the head
plane incre-mentally from a series of head detections then use the
head plane to re-ject detections once sufficient confidence in the
head plane estimate isachieved. The Rodriguez et al. method
compares detections to a fixedaverage human height to reject
inconsistent heads. Our method ismore adaptive and is only used if
sufficient confidence in the headplane estimate is achieved.
The multiple-view approach takes input from two or more
cam-eras, with or without overlapping fields of view. Several
researchgroups use the multiple-view approach to track people
[2932]. Inmost cases, evidence from all cameras is merged to avoid
the occlu-sion problem. The multiple-view approach can reduce the
ambiguityinherent in a single camera view and can help solve the
problem ofpartial or full occlusion in dense crowds, but in many
real environ-ments, it is not feasible.
Several research groups have proposed the use of 3D information
forsegmentation, for occlusion reasoning, and to recover 3D
trajectories.Lv, Zhao and Nevatia [33] propose a method for auto
calibration froma video of a walking human. First, they detect the
human's head andlegs at leg-crossing phases using background
subtraction and temporalanalysis of the object shape. Then they
locate the head and feet posi-tions from the principal axis of the
human blob. Finally, they findvanishing points and calibrate the
camera. The method is based onbackground modeling and shape
analysis and is effective for isolatedpedestrians or small groups
of people, but these techniques fail inhigh density crowds. Zhao,
Nevatia, and Wu [1] use a known groundplane for segmentation and to
reduce inter-object occlusion; Rosalesand Sclaroff [34] recover 3D
trajectories using anextendedKalmanfilter(EKF); Leibe, Schindler,
and van Gool [35] formulate object detectionand trajectory
estimation as a coupled optimization problem on aknown ground
plane. Hoiem, Efros, and Hebert [27,28] first estimaterough surface
geometry in the scene and then use this information toadjust the
probability of finding a pedestrian at a given image location.In
other words, they estimate possible object locations before
applyingan object detector to the image. Their algorithm is based
on recovery ofsurface geometry and camera height. They use a
publicly available exe-cutable to produce confidence maps for three
main classes: ground,vertical, and sky, and five subclasses of
vertical: planar surfacesfacing left, center, and right, and
non-planar solid and poroussurfaces. To recover the camera height,
the authors use manually la-beled training images and compute a
maximum likelihood estimate ofthe camera height based on the
labeled horizon and the height distribu-tions of cars and people in
the scene. Finally, they put the object intoperspective, modeling
the location and scale in the image. They modelinterdependencies
between objects, surface orientations, and the cam-era viewpoint.
Our algorithm works directly from the object detector'sresults,
without any assumptions about scene geometry or interdepen-dencies
between objects. We first apply the head detector to the imageand
estimate a head plane parallel to the ground plane at the
expectedhuman height directly from detected heads without any other
informa-tion. The head plane estimate is updated incrementally, and
when theconfidence in the estimate is sufficiently high, we use it
to reject falsedetections produced by the head detector.
Ge, Collins and Ruback [36] use a hierarchical clustering
algorithm todivide low andmedium density crowds into small groups
of individualstraveling together. To discover the group structure,
the authors detectand track the moving individuals in the scene.
The method has very in-teresting applications such as abnormal
event detection in crowds anddiscovering pathways in
crowds.Preliminary reports on our pedestrian tracker have
previouslyappeared in two conference papers. In the first [37], we
introducethe confirmation-by-classification method, and in the
second [38],we introduce automatic identification of the head plane
from a singleimage. In the current paper, we have improved the
particle filter fortracking, added robust incremental estimation of
the head planewith a stopping criterion, and performed an extensive
empirical eval-uation of the method on several publicly available
data sets. We com-pare our results to the state of the art on the
same data sets.
3. Human head detection and tracking
Here we provide a summary of our head detection and tracking
al-gorithm in pseudocode then give the details of each of the main
com-ponents of the system.
3.1. Summary
1. Acquire input crowd video V.2. In first frame v0 of V, detect
heads. Let xi,0=(xi,yi), i1 N be the
2D positions of the centers and let hi, i1 N be the heights of
thedetection windows for the detected heads.
3. For each detected head i, compute the approximate 3D
locationXi=(Xi,Yi,Zi) corresponding to xi,0 and hi.
4. Find the 3D plane =(a,b,c,d) best fitting the 3D locations Xi
usingRANSAC then refine the estimate of using LevenbergMarquardtto
minimize the sum squared difference between observed andpredicted
head heights.
5. From the error covariance matrix for the parameters of , find
thevolume V of the error ellipsoid as an indicator of the
uncertainty inthe head plane.
6. Initialize trajectories Tj, j1 N with initial positions
xj,0.7. Initialize occlusion count Oj for each trajectory j to 0.8.
Initialize the appearance model (color histogram) hj,0 for each
tra-
jectory from the region around xj,0.9. For each subsequent frame
vi of input video,
(a) For each existing trajectory Tj:i. use the motion model to
predict the distribution p(xj,i|xj,i1)
over locations for head j in frame i, creating a set of
candidateparticles xj,i(k), k1 K.
ii. compute the color histogram hj,i(k) and likelihood
p(hj,i(k)|xj,i(k),hj,i1) for each particle k using the appearance
model.
iii. resample the particles according to their likelihood. Let
kj
be the index of the most likely particle for trajectory j.(b)
Perform confirmation by classification:
i. Run the head detector on frame vi and get 3D locations
foreach detection.
ii. If head plane uncertainty V is greater than threshold,
addthe new observations and 3D locations, reestimate ,
andrecalculate V (Steps 11).
iii. If head plane uncertainty V is less than threshold, use the
3Dpositions and current estimate of to filter out detectionstoo far
from the head plane. Let xl, l1M be the 2D posi-tions of the
centers of new detections after filtering.
iv. For each trajectory Tj,find the detection xlnearest to
xj,i(kj)with-
in some distance C. If found, consider the location classified
asa head and resetOj to 0; otherwise, incrementOj. In our
exper-iments, we set C to 75% of the width (in pixels) of head
j.
v. Initialize a new trajectory for each detection not
associatedwith a trajectory in the previous step.
vi. Delete each trajectory Tj that has occlusion count Oj
greaterthan a threshold and history length |Tj| less than track
survivalthreshold.
vii. Deactivate each trajectory Tj with occlusion count Oj
greaterthan threshold and history length |Tj| greater than or
equalto track survival threshold.
-
969I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012)
9669773.2. Detection
For object detection, although more promising algorithms
haverecently appeared [39,40], we currently use a standard Viola
andJones AdaBoost cascade [9,14] trained on Haar-like features
offlinewith a few thousand example heads and negative images. At
runtime,we use the classifier as a detector, running a sliding
window over theimage at the specific range of scales expected for
the scene.
Our approach to head plane estimation is based on a few
assump-tions. We assume a pinhole camera with known focal length
and thatall human heads are approximately the same size in the real
world.We further assume that the heads visible in a crowded scene
will lieclose to a plane that is parallel to the ground at the
average heightof the humans in the scene. Based on these
assumptions, we can com-pute the approximate 3D position of a
detected head using the size ofthe detection window then estimate
the plane best fitting the data.Once sufficient confidence in the
plane is obtained, we can then rejectdetections corresponding to 3D
positions too far from the head plane.See Fig. 2 for a schematic
diagram.
We use a robust incremental estimation method. First, we
detectheads in the first frame and obtain a robust estimate of the
besthead plane using RANSAC. Second, we refine the estimate by
mini-mizing the squared difference between measured and
predictedhead heights using the LevenbergMarquardt nonlinear least
squaresalgorithm. Using the normalized error covariance matrix, we
com-pute the volume of the error ellipsoid, which indicates the
uncertain-ty in the estimated plane's parameters. On subsequent
frames, we addany new detections and reestimate the plane until the
volume of theerror ellipsoid is below a threshold. We determine the
threshold ex-perimentally. Details of the method are given
below.3.2.1. 3D head position estimation from a 2D detectionGiven
the approximate actual height ho of humanheads, a 2Dheadde-
tection at image position xi=(xi,yi) with height hi, and the
camera focallength f, we can compute the approximate 3D location
Xi=(Xi,Yi, Zi) ofthe candidate head in the camera coordinate system
as follows.
Zi hohi
f 1
Xi Zifxi
h0hi
xi 2Detection3DE
3D Head PlaneEstimation
Compute ErrorEllipsoid Volume
V
2D Locations
3D Plane
Fig. 2. Flow of the incremental heaYi Zifyi
h0hi
yi 3
3.2.2. Linear head plane estimationAfter obtaining a set of 3D
locations X Xif gi1n of possible
heads, we compute the parameters =(a,b,c,d) of the plane
aX bY cZ d 0
minimizing the objective function
q Xni1
aXi bYi cZi d 2: 4
Since the set of 3D locations X will in general contain outliers
dueto errors in head height estimates and false detections from the
de-tector, before performing the above minimization, we eliminate
out-liers using RANSAC [41]. On each iteration of RANSAC, we
samplethree points from X, compute the corresponding plane, find
the con-sensus set for that plane, and retain the largest consensus
set. Thenumber of iterations is calculated adaptively based on the
size ofthe largest consensus set. We set the number of iterations k
to theminimum number of iterations required to guarantee, with a
smallprobability of failure, that the model with the largest
consensus sethas been found. The probability of selecting three
inliers from X atleast once is p 1 1w3 k, where w is the
probability of selectingan inlier in a single sample. We initialize
k to infinity, then on each it-eration, we recalculate w using the
size of the largest consensus setfound so far and then find the
number of iterations k needed toachieve a success rate of p.
3.2.3. Nonlinear head plane refinementThe linear estimate of the
head plane computed in the previous
section minimizes an algebraic objective function (Eq. (4)) that
doesnot take into account the fact that head detections close to
the cameraare more accurately localized in 3D than head detections
far awayfrom the camera.
In this step, we refine the linear estimate of the head plane to
min-imize the objective function
q Xni1
hihi 2
; 5Locationstimation
First Frameor
V>T
Apply Thresholdon Point to Plane
Distance
3D Locations
Output 2DLocations
1 0
Output 2DLocations
d plane estimation algorithm.
image of Fig.2
-
970 I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012)
966977where hi is the height of the detection window for head i and
hi is thepredicted height of head i based on the 2D location
(xi,yi) of its detectionand the plane . To calculate hi, we find
the ray through the camera cen-
ter C=(0,0,0) passing through (xi,yi), find the intersection X
i
X i Y iZ ih iT
of that raywith the head plane , then calculate the expected
height of an object with height h0 at X i when projected into
the image.To find the intersection of the ray with the plane, given
the camera
matrix
K f 0 cx0 f cy0 0 1
24
35
containing the focal length f and principal point (cx,cy), we
find an ar-bitrary point
X0
i K1xiyi1
24
35
on the ray then find the scalar such that
X0
i1
T 0:
Finally, we calculate X i X0i and hi h0Z i f .
We use the Lourakis implementation of the
LevenbergMarquardtnonlinear least squares algorithm [42] to find
the plane minimizingthe objective function of Eq. (5) and obtain an
error covariance matrixQ for the elements of .
3.2.4. Incremental head plane estimationBased on the parameter
vector and covariancematrix Q obtained
as described in the previous section, we compute the volume of
theerror ellipsoid to indicate the uncertainty in the estimated
plane's pa-rameters. Since the uncertainty in the plane's
orientation only de-pends weakly on the distance of the camera to
the head plane,whereas the uncertainty in the plane's distance to
the camera de-pends strongly on that distance, we only consider the
uncertainty inthe plane's (normalized) orientation, ignoring the
uncertainty in thedistance to the plane. Let 1, 2, and 3 be the
eigenvalues of theupper 33 submatrix of Q representing the
uncertainty in the planeorientation parameters a, b, and c. The
radii of the normalized plane
orientation error ellipsoid are ri
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffii.
a2 b2 c2 r
for i1, 2, 3.
The volume of the plane orientation error ellipsoid is then
V 43r1r2r3: 6
V quantifies the uncertainty of the estimated head plane's
orienta-tion. During tracking, for each frame, we add any newly
detectedheads, reestimate the head plane, and recalculate V. If it
is less thanthreshold, we stop the process and use the head plane
to filter subse-quent detections. We determine the threshold
empirically.
3.3. Particle filter
We use particle filters [10,11] to track heads. The particle
filter iswell known to enable robust object tracking (see e.g.
[1621]). Weuse the standard approach in which the uncertainty about
an object'sstate (position) is represented as a set of weighted
particles, each parti-cle representing onepossible state. Ourmethod
automatically initializesseparate filters for each new trajectory.
The initial distribution of theparticles is centered on the
location of the object the first time it isdetected. The filters
propagate particles from frame i-1 to frame i usinga motion model
then compute weights for each propagated particleusing a sensor or
appearance model. Here are the steps in more detail:
1. Predict: we predict p(xj,i|xj,i1), a distribution over head
j's positionin frame i given our belief in its position in frame
i-1. The motionmodel is described in the next section.
2. Measure: for each propagated particle k, we measure the
likelihoodp(hj,i(k)|xj,i(k), hj,i1) using a color histogram-based
appearancemodel. After computing the likelihood of each particle,
we treatthe likelihoods as weights, normalizing them to sum to
1.
3. Resample: we resample the particles to avoid degenerate
weights,obtaining a new set of equally-weighted particles. We use
sequen-tial importance resampling (SIR) [11].
3.3.1. Motion modelWe use a second-order auto-regressive
dynamical model to pre-
dict the 2D position in the current frame based on the 2D
positionsin the past two frames. In particular, we assume the
simple second-order linear autoregressive model
xj;i 2xj;i1xj;i2 i
in which i is distributed as a circular Gaussian.
3.3.2. Appearance modelOur appearance model uses color
histograms to compute particle
likelihoods. We use the simple method of quantizing to
fixed-widthbins. We use 30 bins for hue and 32 bins for saturation.
Learning op-timized bins or using a more sophisticated appearance
model basedon local histograms along with other information such as
spatial orstructural information would most likely improve our
tracking per-formance, but the simple method works well in our
experiments.
Whenever we create a new track, we compute a color histogram
hjfor the detection window in HSV space and save it for
comparisonwith histograms extracted from future frames. To extract
the histo-gram from a detection window, we use a circular mask to
removethe corners of the window.
We use the Bhattacharyya similarity coefficient between
modelhistogram hj and observed histogram h(k) to compute a
particle's like-lihood as follows, assuming n bins in each
histogram:
p h x; h0ed h;h
0 7where
d h;h0 1Xn
b1
ffiffiffiffiffiffiffiffiffiffihbh
0b
q
and hb and hb denote bin b of h and h
0, respectively.When we track an object for a long time, its
appearance will
change, so we update the track histogram for every frame in
whichthe track is confirmed.
3.4. Confirmation by classification
To reduce tracking errors, we introduce a simple
confirmation-by-classification method, described in detail in this
section.
3.4.1. Recovery from missesMany researchers, for example
Breitenstein et al. [16], initialize
trackers for only those detections appearing in a zone along the
imageborder. In high density crowds, this assumption is invalid, so
we maymiss many heads. Due to occlusion and appearance variation,
we maynot detect all heads in the first frame or when they
initially appear. Tosolve this problem, in each image,we search for
newheads in all regions
-
971I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012)
966977of the image not predicted by themotionmodel for a previously
trackedhead. Any newly detected headwithin some distance C of the
predictedposition of a previously tracked head is assumed to be
associated withthe existing trajectory and ignored. If the distance
is greater than C,we create a new trajectory for that detection. We
currently set C to be75% of the width of the detection
window.3.4.2. Data associationWith detection-based tracking, it is
difficult to decide which de-
tection should guide which track. Most researchers compute a
simi-larity matrix between new detections in the frame and
existingtrajectories using color, size, position, and motion
features then findan optimal assignment. These solutions work well
in many cases,but in high density crowds in which the majority of
the scene is inmotion andmost of the humans' bodies are partially
or fully occluded,it tends to introduce tracking error such as ID
switches. In this work,we use the particle filter to guide the
search for a detection for eachtrajectory. For each trajectory Tj,
we search for a detection at locationxj,i(k) within some distance
C, where xj,i(k
) is the position of the mostlikely particle for trajectory j.
We currently set C to be 75% of thewidth of the detection window.
If found detection we consider thatthe location is classified as a
head, the trajectory is confirmed in thisframe, and associate the
detection with the current track; if notfound we consider the
trajectory is occluded in this frame. We useconfirmation and
occlusion information to reduce tracking errors. De-tails are given
in next section.3.4.3. Occlusion countOcclusion handling and
inconsistent false track rejections are the
main challenges for any tracking algorithm. To handle these
prob-lems, we introduce a simple occlusion count scheme. When head
jis first detected and its trajectory is initialized, we set the
occlusioncount Oj=0. After updating the head's position in frame i,
we confirmthe estimated position through detection as described in
the previoussection. On each frame, the occlusion count of each
trajectories notconfirmed through classification is incremented,
and the occlusioncount of each confirmed trajectory is reset to 0.
An example of theFig. 3. Occlusion count scheme. (a) Short
occlusion. The occlusion count does not reach the docclusion. The
occlusion count reaches the deactivation threshold, so the track is
deactivated. (pared with deactivated trajectories, and if the
appearance match is sufficiently strong, the deaincrement and reset
process is shown in Fig. 3(a). The details of thealgorithm are as
follows:
1. Eliminate false tracks: shadows and other non-head objects in
thescene tend to produce transient false detections that could
leadto tracking errors. In order to prevent these false detections
frombeing tracked by the appearance-based tracker through time,
weuse the head detector to confirm the estimated head position
foreach trajectory and eliminate any new trajectory not
confirmedfor some number of frames. A trajectory is considered
transientuntil it is confirmed in several frames.
2. Short occlusions: to handle short occlusions during tracking,
wekeep track of the occlusion count for each trajectory. If the
headis confirmed before the occlusion count reaches the
deactivationthreshold, we consider the head successfully tracked
through theocclusion. An example shown in Fig. 3(a).
3. Long occlusions: when an occlusion in a crowded scene is
long, it isoften impossible to recover, due to appearance changes
and uncer-tainty in tracking. However, if the object's appearance
when itreappears is sufficiently similar to its appearance before
the occlu-sion, it can be restored. We use the occlusion count and
number ofconfirmations to handle long occlusions through
deactivation andreactivation. When an occlusion count reaches the
deactivationthreshold, we deactivate the trajectory. An example is
shown inFig. 3(b). Subsequently, when a new trajectory is confirmed
by de-tections in several consecutive frames, we consider it a
candidatecontinuation of existing deactivated trajectories. An
example isshown in Fig. 3(c). Whenever a newly confirmed trajectory
matchesa deactivated trajectory sufficiently strongly, the
deactivated trajec-tory is reactivated from the position of the new
trajectory.4. Experimental evaluation
In this section, we provide details of a series of experiments
to eval-uate our algorithm. Firstwe describe the training and test
data sets. Sec-ond, we describe some of the important
implementation details. Third,we describe the evaluation metrics we
use. Finally we provide resultsand discussion for the detection and
tracking evaluations.eactivation threshold, so the head is
successfully tracked through the occlusion. (b) Longc) Possible
reactivation of a deactivated trajectory. Newly confirmed
trajectories are com-ctivated track is reactivated from the
position of the matching new trajectory.
-
Fig. 4. Positive head samples used to train the head detector
offline. (a) Example images collected for training. (b) Example
scaled images.
972 I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012)
9669774.1. Training data
To train the Viola and Jones Haar-like AdaBoost cascade
detector,we cropped 5396 heads (examples are shown in Fig. 4) from
videoscollected at various locations and scaled each to 1616
pixels. Wealso collected 4187 negative images not containing human
heads.
For the CAVIAR experiments described in the next section,
sincemany of the heads appearing in the CAVIAR test set are very
small,we created a second training set by scaling the same 5396
positive ex-amples to a size of 1010.
4.2. Test data
There is no generally accepted dataset available for crowd
tracking.Most researchers use their own datasets to evaluate their
algorithms.In this work, for the experimental evaluation, we have
created, to thebest of our knowledge, the most challenging existing
dataset specifical-ly for tracking people in high density crowds;
the dataset with groundtruth information is publicly available for
download at http://www.cs.ait.ac.th/vgl/irshad/.
We captured the video at 640480 pixels and 30 frames per
secondat the Mochit light rail station in Bangkok, Thailand. A
sample frame isshown in Fig. 1. We then hand labeled the locations
of all heads presentin every frame of the video. For the ground
truth data format, wefollowed the annotation guidelines for the
Video Analysis and ContentExtraction (VACE-II) workshop [43]. This
means that for each head0 100 200106
105
104
103
102
Fra
Err
or E
llips
oid
Vol
ume
Fig. 5. Error in head plane estimation. We plot the volume of
ellipsoid computed using covagraph shows error in each frame after
adding newly detected heads in the frame.present in each frame, we
record the bounding box, a unique ID thatis consistent across
frames, and a flag indicatingwhether the head is oc-cluded or not.
We labeled a total of 700 frames containing a total of28,430 heads,
for an average of 40.6 heads/frame.
For comparison with existing state of the art research, we also
testour algorithm on the well-known indoor Context-Aware Vision
usingImage-based Active Recognition (CAVIAR) [44] dataset. Since
our eval-uation requires ground truth positions of each
pedestrian's head ineach frame, we selected the four sequences from
the shopping centercorridor view for which the needed information
is available. The se-quences contain a total of 6315 frames, the
frame size is 384288,and the sequences were captured at 25 frames
per second. There are atotal of 26,950 heads over all four
sequences, with an average of4.27 heads per frame.
Our algorithm is designed to track heads in high density
crowds.The performance of any tracking algorithm will depend upon
thedensity of the crowd. In order to characterize this relationship
we in-troduce a simple crowd density measure
D i PiN
; 8
where Pi is the number of pixels in pedestrian i's bounding box
and Nis the total number of pixels in all of the images.
According to this measure, the highest per-frame crowd density
inour Mochit test sequence is 0.63, whereas the highest
per-frame300 400 500
me No
riance matrix of non linear head plane estimation based on 3D
positions of heads. The
http://www.cs.ait.ac.th/vgl/irshad/http://www.cs.ait.ac.th/vgl/irshad/image
of Fig.4image of Fig.5
-
Table 1Detection results for single image with and without head
plane estimation.
GT Hits Misses FP
Without head plane estimation 34 31 3 35With head plane
estimation 34 30 4 12
973I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012)
966977crowd density in CAVIAR is 0.27. The crowd density in the
Mochit se-quence is higher than that in any publicly-available
pedestrian track-ing video database.
To directly evaluate the accuracy of our head plane
estimationmethod, we use the People Tracking sequence (S2.L1) of
PerformanceEvaluation of Tracking and Surveillance (PETS) [45]
dataset. In PETS,camera calibration information is given for every
camera. In this se-quence there are a total of eight views. In
views 1, 2, 3, and 4, thehead sizes are very small; we excluded
these views because ourmethod has a limit on the minimum head size
that can be detected.We trained a new head detector specifically on
the PETS trainingdata (which is separate from the people tracking
test sequence), ranthe tracking and head plane estimation algorithm
on views 5, 6, 7,and 8, then compared our estimated head plane with
the actualground plane information that comes with the dataset.
4.3. Implementation details
We implemented the system in C++withOpenCVwithout any spe-cial
code optimization. The system attempts to track anything
head-likein the scene, whether moving or not, since it does not
rely on any back-ground modeling. We detect heads and create
initial trajectories basedon the first frame, and thenwe track
heads from frame to frame. Furtherimplementation details are given
in the following sections.
4.3.1. Trajectory initialization and terminationWe use the head
detector to find heads in the first frame and cre-
ate initial trajectories. As previously mentioned, rather than
detectheads only in the border region of the image, we detect all
heads inevery frame. We first try to associate new heads with
existing trajec-tories; when this fails for a new head detection, a
new trajectory isinitialized from the current frame. Any head
trajectory in the exitzone (close to the image border) for which
the motion model pre-dicts a location outside the frame is
eliminated.
4.3.2. Identity managementIt is also important to assign
andmaintain object IDs automatically
during tracking. We assign a unique ID to each trajectory during
ini-tialization then maintain the ID during tracking. Trajectories
thatFig. 6. Detection results for one frame of the Mochit test
video. Rectangles indicate candiestimation.are temporarily lost due
to occlusion are reassigned the same ID onrecovery to avoid
identity changes. During long occlusions, when atrack is not
confirmed for several frames, we deactivate that trackand search
for new matching detections in subsequent frames. If amatch is
found, the track is reactivated from the position of the
newdetection and reassigned the same ID.
4.3.3. Incremental head plane estimationAs previously discussed,
we incrementally estimate the head
plane. During tracking, we collect detected heads
cumulativelyover each frame and perform head plane estimation. As
discussed inSection 3.2.4, to determine when to start using the
estimated plane tofilter detections, we use the volume of the
normalized plane orientationerror ellipsoid (see Eq. (6)) as a
measure of the uncertainty in the esti-mate. Fig. 5 shows how the
error ellipsoid volume evolves over time onthe Mochit data set as
heads detected in subsequent frames are addedto the data set. We
stop head plane estimation when the error ellipsoidvolume is less
than 0.0003.
As a quantitative evaluation of the head plane estimation
method,we trained a new head detector on the PETS training set then
ran oursystem on views 58 from the PETS People Tracking sequence
(S2.L1).The estimation error was 305 mm, 433 mm, 355 mm, and 280
mmfor the orthogonal distance between the plane and the camera
centerand 25, 20, 17, and 16 for the orientation. This indicates
that themethod is quite effective at unsupervised estimation of the
head plane.
4.4. Evaluation metrics
In this section, we describe the methods we use to evaluate
thetracking algorithm. Unfortunately there are no
commonly-usedmetricsfor human detection and tracking in crowded
scenes. We adopt mea-sures similar to those proposed by Nevatia et
al. [2,46,47] for trackingpedestrians in sparse scenes. In their
work, there are different defini-tions for ID switch and trajectory
fragmentation errors. Wu and Nevatia[2] define ID switches as
identity exchanges between a pair of resulttrajectories, while Li,
Huang, and Nevatia [47] define an ID switch asa tracked trajectory
changing its matched GT ID.We adopt the defini-tions of ID switch
and fragment errors proposed by Li, Huang, andNevatia [47]. If a
trajectory ID is changed but not exchanged, we countit as one ID
switch, similarly for fragments. This definition is more strictand
leads to higher numbers of ID switch and fragment errors, but it
iswell defined.
Bernardin and Stiefelhagen have proposed an alternative set
ofmetrics, the CLEARMOTmetrics [48], for multiple object tracking
per-formance. Their multiple object tracking precision (MOTP) and
multi-ple object tracking accuracy (MOTA) methods are not suitable
fordate head positions. (a) Without 3D head plane estimation. (b)
With 3D head plane
-
Table 2Overall detection results with and without head plane
estimation.
GT Hits Misses FP
Without head plane estimation 24,605 19,277 5328 7053With head
plane estimation 24,605 17,950 6655 3328
Table 3Tracking results with and without head plane estimation
for the Mochit station dataset.Total number of trajectories is
74.
MT% PT% ML% Frag IDS FAT
Without head plane 67.6 28.4 4.0 43 20 41With head plane 70.3
25.7 4.0 46 17 27
974 I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012)
966977crowd tracking because they integrate multiple factors into
onescalar-valued measure. Kasturi and colleagues [51] have proposed
aframework to evaluate face, text and vehicle detection and
trackingin video. Their method are not suitable for crowd tracking
becausethey integrate multiple factors into one scalar-valued
measure.
We specifically use the following evaluation criteria:
1. Ground truth (GT): number of ground truth trajectories.2.
Mostly tracked (MT): number of trajectories that are
successfully
tracked for more than 80% of their length (tracked length
dividedby the ground truth track length).
3. Partially tracked (PT): number of trajectories that are
successfullytracked in 20%80% of the ground truth frames.
4. Mostly lost (ML): number of trajectories that are
successfullytracked for less than 20% of the ground truth
frames.
5. Fragments (Frag): number of times that a ground truth
trajectory isinterrupted in the tracking results.
6. ID switches (IDS): number of times the system-assigned ID
changesover all ground truth trajectories.
7. False trajectories (FAT): number of system trajectories that
do notcorrespond to ground truth trajectories.
4.5. Detection results
We trained our head detection cascade using the
OpenCVhaartraining utility. We set the number of training stages to
20,Fig. 7. Sample tracking results for the Mochit test video. Blue
rectangles indicate ethe minimum hit rate per stage to 0.995, and
the maximum falsealarm rate per stage to 0.5. The training process
required about16 h on a 2.8 GHz Intel Pentium 4 with 4 GB of
RAM.
To test the system's raw head detection performance with
andwithout head plane estimation on a single image, we ran our head
de-tector on an arbitrary single frame extracted from the Mochit
testdata sequence. The results are summarized in Table 1 and
visualizedin Fig. 6.
There are a total of 34 visible ground truth (GT) heads in
theframe. Using the head plane to reject detections inconsistent
withthe scene geometry reduces the number of false positives (FP)
from35 to 12 and only reduces the number of detections (hits) from
31to 30. The results show that the head plane estimation method
isvery useful for filtering false detections.
To test the system's head detection performance with and
withouthead plane estimation on whole sequence, we ran our head
detectoron each frame extracted from theMochit test data sequence
and com-pared with the head location reported by our tracking
algorithm. Theresults are summarized in Table 2.
There are a total of 24,605 visible ground truth (GT) heads in
thesequence of 700 frames. Our algorithm reduces the number of
falsepositives (FP) from 7053 to 3328 and only reduces the number
of de-tections (hits) from 19,277 to 17,950.
4.6. Tracking results
In the Mochit station dataset [49], there are an average of 40.6
indi-viduals per frame over the 700 hand-labeled ground truth
frames, for atotal of 28,430 heads, and a total of 74 individual
ground truth trajecto-ries. We used 20 particles per head. Tracking
results with and withouthead plane estimation are shown in Table 3.
The head plane estimationmethod improves accuracy slightly, but
more importantly, it reducesthe false positive ratewhile preserving
a high rate of successful tracking.For a frame size of 640480, the
processing time was approximately1.4 s per frame, with or without
head plane estimation, on a 3.2 GHzIntel Core i5 with 4 GB RAM.
Fig. 7 shows tracking results for severalframes of the Mochit test
video.
In the four selected sequences from the CAVIAR dataset for
whichthe ground truth pedestrian head information is available,
there are atotal of 43 ground truth trajectories, with an average
of 4.27 individualsper frame or 26,950 heads total over the 6315
hand-labeled groundstimated head positions; red rectangles indicate
ground truth head positions.
image of Fig.7
-
Fig. 8. Sample tracking results on the CAVIAR dataset. Blue
rectangles indicate estimated head positions; red rectangles
indicate ground truth head positions.
975I.A
li,M.N.D
ailey/Im
ageand
Vision
Computing
30(2012)
966977
image of Fig.8
-
976 I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012)
966977truth frames. Since our detector cannot detect heads smaller
than1515 reliably, in the main evaluation, we exclude ground
truthheads smaller than 1515. However, to enable direct comparison
toexisting full-body pedestrian detection and tracking methods, we
alsoprovide results including the untracked small heads as errors.
Weagain used 20 particles per head. Tracking results with and
withoutsmall heads and with and without head plane estimation are
shownin Table 4. For a frame size of 384288, the processing time
was ap-proximately 0.350 s per frame on the same 3.2 GHz Intel Core
i5 with4 GB RAM. Fig. 8 shows tracking results for several frames
of theCAVIAR test set.
Our head tracking algorithm is designed especially for high
densi-ty crowds. We do not expect it to work as well as full-body
trackingalgorithms on sparse data where the full body is in most
cases visible.Although it is difficult to draw any strong
conclusion from the data inTable 4, since none of the reported work
is using precisely the samesubset of the CAVIAR data, we
tentatively conclude that our methodgives comparable performance to
the state of the art, even though itis using less information.
Although the researchers whose work is summarized in Table 4have
not made their code public to enable direct comparison on theMochit
test set, we would expect that any method relying on fullbody or
body part based tracking would perform much more poorlythan our
method on that data set.5. Conclusion
Tracking people in high density crowds such as the one shown
inFig. 1 is a real challenge and is still an open problem. In this
paper, weintroduce a fully automatic algorithm to detect and track
multiplehumans in high-density crowds in the presence of extreme
occlusion.We integrate human detection and tracking into a single
frameworkand introduce a confirmation by classification method to
estimateconfidence in a tracked trajectory, track humans through
occlusions,and eliminate false positive tracks. We find that
confirmation by clas-sification dramatically reduces tracking
errors such as ID switchesand fragments.
The main difficulty in using a generic object detector for
humantracking is that the detector's output is unreliable; all
detectorsmake errors. To further reduce false detections due to
dense featuresand shadows, we present an algorithm using an
estimate of the 3Dhead plane to reduce false positive head
detections and improve pe-destrian tracking accuracy in crowds. The
method is straightforward,makes reasonable assumptions, and does
not require any knowledgeof camera extrinsics. Based on the
projective geometry of the pinholecamera and an assumed approximate
head size, we compute 3D loca-tions of candidate head detections.
We then fit a plane to the set ofdetections and reject detections
inconsistent with the estimatedscene geometry. The algorithm learns
the head plane from observa-tions of human heads incrementally, and
only begins to utilize theTable 4Tracking results for CAVIAR
dataset.
GT MT% PT% ML% Frag IDS FAT
Zhao, Nevatia, and Wu [1] 227 62.1 5.3 89a 22a 27Wu and Nevatia
[2] 189 74.1 4.2 40a 19a 4Xing, Ai and Lao [50]b 140 84.3 12.1 3.6
24 14 Li, Huang and Nevatia [47] 143 84.6 14.0 1.4 17 11 Ali and
Daileyc 33 75.8 24.2 0.0 10 1 21Ali and Dailey (without head plane)
33 75.8 24.2 0.0 14 2 34Ali and Dailey (all heads) 43 16.3 58.1
25.6 11 1 21
a The Frag and IDS definitions are less strict than ours, giving
lower numbers of frag-ments and ID switches.
b Does not count people less than 24 pixels wide.c Does not
count heads less than 15 pixels wide.head plane once confidence in
the parameter estimates is sufficientlyhigh.
We find that together, the confirmation-by-classification and
headplane estimation methods enable the construction of an
excellent pe-destrian tracker for dense crowds. In futurework, with
further algorith-mic improvements and runtime optimization, we hope
to achieverobust, real time pedestrian tracking for even larger
crowds.Acknowledgments
This research was supported by graduate fellowships from
theHigher Education Commission of Pakistan (HEC) and the Asian
Instituteof Technology (AIT) to Irshad Ali. We are grateful to
Shashi Gharti forhelp with ground truth labeling software. We thank
Faisal Bukhariand Waheed Iqbal for valuable discussions related to
this work.Appendix A. Supplementary data
Supplementary data to this article can be found online at
http://dx.doi.org/10.1016/j.imavis.2012.08.013.References
[1] T. Zhao, R. Nevatia, B. Wu, Segmentation and tracking of
multiple humans incrowded environments, IEEE Trans. Pattern Anal.
Mach. Intell. (PAMI) 30 (7)(2008) 11981211.
[2] B. Wu, R. Nevatia, Detection and tracking of multiple,
partially occluded humansby Bayesian combination of edgelet based
part detectors, Int. J. Comput. Vision(IJCV) 75 (2) (2007)
247266.
[3] B. Wu, R. Nevatia, Y. Li, Segmentation of multiple,
partially occluded objects bygrouping, merging, assigning part
detection responses, in: IEEE Conference Com-puter Vision and
Pattern Recognition (CVPR), 2008.
[4] S.M. Khan, M. Shah, A multiview approach to tracking people
in crowded scenesusing a planar homography constraint, in: European
Conference on Computer Vi-sion (ECCV), 2006.
[5] J. Berclaz, F. Fleuret, P. Fua, Robust people tracking with
global trajectory optimi-zation, in: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR),2006.
[6] M. Andriluka, S. Roth, B. Schiele,
People-tracking-by-detection and people-detection-by-tracking, in:
IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR),
2008, pp. 18.
[7] D. Ramanan, D.A. Forsyth, A. Zisserman, Tracking people by
learning their appear-ance, IEEE Trans. Pattern Anal. Mach. Intell.
(PAMI) 29 (1) (2007) 6581.
[8] T. Zhao, R. Nevatia, Tracking multiple humans in crowded
environment, in: IEEEConference on Computer Vision and Pattern
Recognition (CVPR), Vol. 2, 2004.
[9] P. Viola, M. Jones, Robust real time object detection, Int.
J. Comput. Vision (IJCV)57 (2001) 137154.
[10] M. Isard, A. Blake, A mixed-state condensation tracker with
automatic model-switching, in: IEEE International Conference on
Computer Vision (ICCV), 1998,pp. 107112.
[11] A. Doucet, N. de Freitas, N. Gordon, Sequential Monte Carlo
Methods in Practice,Springer, New York, 2001.
[12] T. Zhao, R. Nevatia, Tracking multiple humans in complex
situations, IEEE Trans.Pattern Anal. Mach. Intell. (PAMI) 26 (9)
(2004) 12081221.
[13] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian
detection: an evaluation of thestate of the art, IEEE Trans.
Pattern Anal. Mach. Intell. (PAMI) 34 (2012) 743761.
[14] P. Viola, M. Jones, Rapid object detection using a boosted
cascade of simple fea-tures, in: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR),2001, pp. 511518.
[15] M. Isard, A. Blake, CONDENSATION conditional density
propagation for visualtracking, Int. J. Comput. Vision (IJCV) 29
(1998) 528.
[16] M.D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier,
L.V. Gool, Onlinemulti-person tracking-by-detection from a single,
uncalibrated camera, IEEETrans. Pattern Anal. Mach. Intell. (PAMI)
33 (9) (2011) 18201833.
[17] H.-G. Kang, D. Kim, Real-time multiple people tracking
using competitive conden-sation, Pattern Recognit. 38 (2005)
10451058.
[18] S.V. Martnez, J. Knebel, J. Thiran, Multi-object tracking
using the particle filter algo-rithm on the top-view plan, in:
European Signal Processing Conference (EUSIPCO),2004.
[19] J. Vermaak, A. Doucet, P. Perez, Maintaining multi-modality
through mixturetracking, in: IEEE International Conference on
Computer Vision (ICCV), 2003.
[20] Z. Khan, T. Balch, F. Dellaert, MCMC-based particle
filtering for tracking a variablenumber of interacting targets,
IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 27(2005)
18051918.
[21] K. Okuma, A. Taleghani, N.D. Freitas, J.J. Little, D.G.
Lowe, A boosted particle filter:multitarget detection and tracking,
in: European Conference on Computer Vision(ECCV), 2004.
http://dx.doi.org/10.1016/j.imavis.2012.08.013http://dx.doi.org/10.1016/j.imavis.2012.08.013
-
977I. Ali, M.N. Dailey / Image and Vision Computing 30 (2012)
966977[22] C. Rasmussen, G.D. Hager, Probabilistic data association
methods for trackingcomplex visual objects, IEEE Trans. Pattern
Anal. Mach. Intell. (PAMI) 23 (6)(2001) 560576.
[23] D.B. Reid, An algorithm for tracking multiple targets, IEEE
Trans. Autom. Control.24 (6) (1979) 843854.
[24] H.W. Kuhn, The Hungarian method for the assignment problem,
Nav. Res. Logist.Q. 2 (1955) 8387.
[25] C.-H. Kuo, C. Huang, R. Nevatia, Multi-target tracking by
on-line learned discrimina-tive appearancemodels, in: IEEE
Conference on Computer Vision and Pattern Recog-nition (CVPR),
2010.
[26] M. Rodriguez, I. Laptev, J. Sivic, J.-Y. Audibert,
Density-aware person detectionand tracking in crowds, in: IEEE
International Conference on Computer Vision(ICCV), 2011.
[27] D. Hoiem, A. Efros, M. Hebert, Putting objects into
perspective, in: IEEE Confer-ence on Computer Vision and Pattern
Recognition (CVPR), 2006, pp. 21372144.
[28] D. Hoiem, A. Efros, M. Hebert, Putting objects into
perspective, Int. J. Comput. Vi-sion (IJCV) 80 (1) (2008) 315.
[29] F. Fleuret, J. Berclaz, R. Lengagne, P. Fua, Multicamera
people tracking with aprobabilistic occupancy map, IEEE Trans.
Pattern Anal. Mach. Intell. (PAMI) 30(2008) 267282.
[30] A. Mittal, L.S. Davis, M2tracker: a multi-view approach to
segmenting and track-ing people in a cluttered scene, Int. J.
Comput. Vision (IJCV) 51 (2003) 189203.
[31] R. Eshel, Y. Moses, Homography based multiple camera
detection and tracking ofpeople in a dense crowd, in: IEEE
Conference on Computer Vision and PatternRecognition (CVPR),
2008.
[32] T. Zhao, M. Aggarwal, R. Kumar, H. Sawhney, Real-time wide
area multi-camerastereo tracking, in: IEEE Conference on Computer
Vision and Pattern Recognition(CVPR), 2005.
[33] M. Fengjun Lv, T. Zhao, R. Nevatia, Camera calibration from
video of a walkinghuman, IEEE Trans. Pattern Anal. Mach. Intell.
(PAMI) 28 (9) (2006) 15131518.
[34] R. Rosales, S. Sclaroff, 3D trajectory recovery for
tracking multiple objects and tra-jectory guided recognition of
actions, in: IEEE Conference on Computer Visionand Pattern
Recognition (CVPR), 1999.
[35] B. Leibe, K. Schindler, L.V. Gool, Coupled detection and
trajectory estimation formulti-object tracking, in: IEEE
International Conference on Computer Vision (ICCV),2007, pp.
18.
[36] W. Ge, R.T. Collins, R.B. Ruback, Vision-based analysis of
small groups in pedes-trian crowds, IEEE Trans. Pattern Anal. Mach.
Intell. (PAMI) 34 (5) (2011)10031016.[37] I. Ali, M.N. Dailey,
Multiple human tracking in high-density crowds, in: Ad-vanced
Concepts for Intelligent Vision Systems (ACIVS), Vol. LNCS 5807,
2009,pp. 540549.
[38] I. Ali, M.N. Dailey, Head plane estimation improves the
accuracy of pedestrian trackingin dense crowds, in: International
Conference on Control, Automation, Robotics andVi-sion (ICARCV),
2010, pp. 20542059,
http://dx.doi.org/10.1109/ICARCV.2010.5707425.
[39] B. Leibe, A. Leonardis, B. Schiele, Robust object detection
with interleaved catego-rization and segmentation, Int. J. Comput.
Vision (IJCV) 77 (2008) 259289.
[40] N. Dalal, B. Triggs, Histograms of oriented gradients for
human detection, in: IEEEConference on Computer Vision and Pattern
Recognition (CVPR), 2005.
[41] M.A. Fischler, R.C. Bolles, Random sample consensus: a
paradigm for model fittingwith applications to image analysis and
automated cartography, Commun. ACM24 (6) (1981) 381395.
[42] M. Lourakis, levmar: LevenbergMarquardt nonlinear least
squares algorithms inC/C++, available at
http://www.ics.forth.gr/lourakis/levmar/ Jul. 2004.
[43] H. Raju, S. Prasad, Annotation guidelines for video
analysis and content extraction(VACE-II). available at
http://isl.ira.uka.de/clear07/downloads/ 2006.
[44] The CAVIAR data set, available at
http://homepages.inf.ed.ac.uk/rbf/CAVIAR/ 2011.[45] PETS benchmark
data, available at http://www.cvg.rdg.ac.uk/PETS2009/a.html
2009.[46] C.-H. Kuo, C. Huang, R. Nevatia, Multi-target tracking by
on-line learned discrim-
inative appearance models, in: IEEE Conference on Computer
Vision and PatternRecognition (CVPR), 2010, pp. 685692.
[47] Y. Li, C. Huang, R. Nevatia, Learning to associate: hybrid
boosted multi-targettracker for crowded scene, in: IEEE Conference
on Computer Vision and PatternRecognition (CVPR), 2009, pp.
29532960.
[48] K. Bernardin, R. Stiefelhagen, Evaluating multiple object
tracking performance:the CLEAR MOT metrics, EURASIP J. Image Video
Process. 2008 (2008) 110.
[49] Mochit station dataset, available at
http://www.cs.ait.ac.th/vgl/irshad/ 2009.[50] J. Xing, H. Ai, S.
Lao, Multi-object tracking through occlusions by local tracklets
fil-
tering and global tracklets association with detection
responses, in: IEEE Confer-ence on Computer Vision and Pattern
Recognition (CVPR), 2009, pp. 12001207.
[51] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J.
Garofolo, R. Bowers, M.Boonstra, V. Korzhova, J. Zhang, Framework
for performance evaluation of face,text, and vehicle detection and
tracking in video: Data, metrics, and protocol,IEEE Trans. Pattern
Anal. Mach. Intell. (PAMI) 31 (2) (2009) 319336.
http://dx.doi.org/10.1109/ICARCV.2010.5707425http://www.ics.forth.gr/lourakis/levmar/http://isl.ira.uka.de/clear07/downloads/http://homepages.inf.ed.ac.uk/rbf/CAVIAR/http://www.cvg.rdg.ac.uk/PETS2009/a.htmlhttp://www.cs.ait.ac.th/vgl/irshad/
Multiple human tracking in high-density crowds1. Introduction2.
Related work3. Human head detection and tracking3.1. Summary3.2.
Detection3.2.1. 3D head position estimation from a 2D
detection3.2.2. Linear head plane estimation3.2.3. Nonlinear head
plane refinement3.2.4. Incremental head plane estimation
3.3. Particle filter3.3.1. Motion model3.3.2. Appearance
model
3.4. Confirmation by classification3.4.1. Recovery from
misses3.4.2. Data association3.4.3. Occlusion count
4. Experimental evaluation4.1. Training data4.2. Test data4.3.
Implementation details4.3.1. Trajectory initialization and
termination4.3.2. Identity management4.3.3. Incremental head plane
estimation
4.4. Evaluation metrics4.5. Detection results4.6. Tracking
results
5. ConclusionAcknowledgmentsAppendix A. Supplementary
dataReferences