Page 1
SEGMENTATION AND TRACKING OF MULTIPLE
HUMANS IN A SURVEILLANCE VIDEO STREAM
Abstract:
This paper proposes a novel method for rapid and robust human detection and tracking
based on the omega-shape features of people’s head-shoulder parts.This paper consists of 5
phases.They are Frame Separation and Extraction of the Foreground from the Background
using Background Appearance Model,Human Shape Model to identify each head of a
person,Identifying the head of a person for the occluded sequence using Canny Edge
algorithm and hence identify the ohm – shape head-shoulder model to identify that it’s a
human,Tracking using color histogram along with Mean Shift technique and Markov Chain
Dynamics – add, remove, establish, break, exchange, update.
Introduction:
Human detection and tracking can be widely used in many applications, including people
counting and security surveillance in public scenes.However, full-body human detection
often suffers from occlusions among individuals and scenes in which people are not
necessarily standing. Hence, much recent research instead is focused on the upper part of
human body.The ultimate goal of video understanding is to make structured decomposition of
video into the scene, describe the objects and their time-varying properties and to extract
semantic meaning from them.Humans are of special significance since they are the main class
of actors in daily life.Being able to detect human objects and track their motion in video
sequences is highly useful.This can enable many applications which will provide information,
convenience and security to our lives.The detection and tracking by itself can be used for
intrusion detection, human counting and estimation of crowd flow.Automatic Video
Surveillance which facilitate fighting against crime, providing higher level of security,
Advanced Human Computer Interaction, Behavioral assistance &Content based Video
indexing.
Page 2
Goals:
To Segment/Detect multiple possible inter occluded human objects in the image. Answers
how many people are there in a frame and where they are.To reliably track the global
motion(i.e. position) of multiple possible inter occluded human objects in the scene and
provide consistent trajectories.To estimate the motion modes(i.e. walking, running, standing)
and phases(i.e. the human postures).
Main Features Of This Work:
A three-dimensional part-based human body model which enables the segmentation and
tracking of humans in 3D and the inference of inter-object occlusion naturally.A Bayesian
framework that integrates segmentation and tracking based on a joint likelihood for the
appearance of multiple objects.The method is feasible for a crowded scene:An occluded
sequence.Do not require that humans need to be isolated when they first enter the scene.More
complex shape models are needed.Joint reasoning about the collection of objects is
needed.Image segmentation is the process of partitioning the digital image into multiple
regions that can be associated with the properties of one or more criterion. It is an initial and
vital step in pattern recognition-a series of processes aimed at overall image
understanding.Some of the Segmentation Techniques include Clustering, Histogram Based
Methods,Edge Detection,Thresholding,Motion Blob Detection,Region Growing
Method,Watershed Method and Model Based Segmentation.
Drawbacks Of Previous Segmentation Techniques:
Segmentation techniques like Thresholding, Region growing, Watershed are highly
disadvantageous.In the motion blob detection technique, the blobs are detected by
comparison with the learned background. In this, segmenting humans from such blobs are not
straightforward.One blob may include multiple objects, while one object may split into
multiple blobs. Such approaches are likely to fail when occlusion is persistent.Some
approaches have been developed to handle occlusion, but require the objects to be initialized
before occlusion happens. This is usually infeasible for a crowded scene.
Image(color/texture) segmentation groups image pixels into regions of similar color/texture
and its not likely to segment individual humans because human clothing may not have
uniform color/texture and adjacent people may wear similar clothes.Motion Segmentation
groups image pixels of consistent motion; it may not give satisfactory results due to similarity
of human motions in a group.Face Detection-Statistical methods fail because of the resolution
Page 3
of the surveillance video.Model Based Segmentation used when the region of interest is a
repetitive form of geometry.A Probabilistic model using a prior must be used.A Multi-
Ellipsoid Model is used as a prior as humans are either walking or standing.An ellipsoid fits a
human body part well and its projection which is an ellipse is a convenient form to represent
humans.
Proposed Method:
To augment the best solution from the complex solution space, we use the MCMC
approach.Tracking involves the detection of Human movement in the video and hence to
calculate the Human Motion Trajectory.The various tracking methods are
Detection-Based Tracking
Matching-based tracking
Blob Tracking
The basic assumptions are that the camera must be stationary,People walk on a known
ground plane.There are no significant false alarms due to shadows, reflections or other
reasons.
Main Features Of Proposed Plan:
A 3D part based human body model which enables segmentation and tracking of humans in
3D and the inference of inter - object occlusion naturally.A Bayesian framework that
integrates segmentation and tracking based on a joint likelihood for the appearance of
multiple objects with the design of an efficient Markov chain dynamics.Based on the
Background Model, the foreground blobs are extracted.By using the camera model and the
assumption that objects move on a known ground plane, multiple 3D human hypotheses are
projected onto the image plane and matched with the foreground blobs.In one frame,
segmentation of the foreground blobs into multiple humans is performed and associate the
segmented humans with existing trajectories.
Architecture-Diagram:
Segmentation and Tracking are integrated in a unified framework and interoperate along
time:
Page 4
Prior Models:
Background model:Based on a background model, the foreground blobs are extracted as the
basic observation.3D Human Shape model:Since the hypotheses are in 3D, occlusion
reasoning is straightforward.Camera model & Ground Plane:Multiple 3D human hypotheses
are projected onto the image plane and matched with the foreground blobs.
Phases Involved:
Frame Separation and Extraction of the Foreground from the Background using Background
Appearance Model,Human Shape Model to identify each head of a person,Identify the head
of a person for the occluded sequence using Canny Edge algorithm and hence identify the
ohm - shape head-shoulder model to identify that it’s a human,Tracking using color
Camera Model Ground Plane
Human Shape Model
Background Model
Model Based
Model Based Segmenta
Number Of Humans And Their Positions
Global Motion Trajectories Video Input
Page 5
histogram along with Mean Shift technique,Markov Chain Dynamics – add, remove,
establish, break, exchange, update.
Frame Separation and Foreground Extraction (Phase I):
In this phase, frames are extracted from the input video sequence. Every second, 25 frames
are got from the video. (Fps=25).Based on Probabilistic modeling, selected number of frames
are chosen from the 25 frames for effective computation.
where,
: the solution space.(represents the chosen Frame).
: the state of the objects.(represents the objects in the chosen frame).
: the image observation(represents the set of extracted frames).
Foreground Extraction is performed using Background Appearance Model.
The probability of pixel j being from the background is calculated by Gaussian
distribution,
are the jth pixel values in the current frame.
are the jth pixel values in the background image.
is a small constant(=0.5) and is variance.
θ( t )¿
=argmaxθ(t )∈Θ
P (θ( t )|I (1 ,.. ., t ))
I(1, . . ., t )
¿max ¿¿Pb ( I j )=Pb (r j , g j , b j )
r j , g j ,b j
r j , g j , b j
σε
Page 6
Experimental Results For Separation And Foreground Extraction:
3D Human Shape Model(Phase II):
The parameters of an individual human, mi, are defined based on a 3D human shape
model.Our attempt is to capture the basic shape and articulation parameters of the human
body.It is a Multi-ellipsoid model.The parameters (mi) to describe 3D human hypothesis:
size (hi): 3D height of the model, it also controls the overall scaling of the object in three(X,Y
and Z) directions.thickness (fi): Captures extra scaling in the horizontal directions.position (ui
or (xi,yi)): Image position of the head.orientation (oi): 3D orientation of the bodyOrientations
of the models are quantized into few levels for computation efficiency(00 and 900).inclination
(ii): 2D inclination of the body.There is the chance that the body may be inclined
slightly(Inclination angle can be positive, negative or zero).
mi={hi , f i , x i , y i , oi , ii}
Page 7
Detection of Ohm Shape using Canny Edge algorithm(Phase III):
Identify the peak points in the foreground frame.To identify if it’s a human, check if there are
sufficient number of pixels below the peak positions.Then the ohm shape is obtained by
taking one half of the head ellipse and the upper quarter of the torso, which would be the
shoulder.
Blob Tracking (Phase IV):
Each human object is assigned a unique ID to track it.In each frame, the blobs (B(t) = {B1(t)…
Bn(t)}) are matched with the blobs in the previous frame (B(t-1) = {B1
(t-1)…Bm(t-1)}).Two blobs
are declared as perfect match if the centroids of the 2 blobs are close and their size difference
is sufficiently small.Those blobs Bi(t) which are a perfect match in B(t-1) is found.If best match
is found, then the unique ID of human object found in the previous frame is copied into the
current frame else a new ID is given to the human object.
Page 8
Overview of Mean Shift algorithm:
To find the best match, Mean shift technique is used.Mean Shift algorithm is implemented by
the following steps:1.Calculation of an initial histogram which identifies the object being
tracked.2.Applying the initial histogram onto every new frame from the input stream using a
technique called back projection, yielding a single channel (grayscale) image where each
pixel contains the bin size of the initial histogram for the color of the corresponding pixel in
the new frame.3.Searching the back-projection to find the region with the highest intensity
which corresponds to the area where the tracked object most probably resides.
a.Input Image b.Back Projection Image
Calculation Of Initial Histogram:
Initial histogram is calculated for the object defined within the object shape.A Bounding
Rectangle on the Target object is drawn.A single red, green, blue (RGB) histogram with 512
bins is constructed using all the blobs within the three elliptic regions of the Object Model.It
helps to establish correspondence in tracking because it is insensitive to the non rigidity of
human motion.Using the initial histogram, the Back projection image is generated which is
used in Mean Shift technique.
Generation Of Trajectory:
The path of a moving human blob across each frame is defined as trajectory.The centroid in
the previous frame is found earlier.In the current frame, new centroid is calculated by Mean
Shift technique.Both these centroids are joined together to form a trajectory which is used to
traverse the path taken by an individual human object.
Page 9
Block Diagram of the MCMC Tracking Algorithm(Phase V):
Computing MAP by efficient MCMC:To calculate Maximum a Posteriori(MAP) using
MCMC method:
A Markov chain with stationary distribution is designed. At the gth iteration, sample a
candidate state ’ from a proposal distribution q(g| g-1).If the candidate state ’ is accepted,
)(1
)( tg
tg
'
)(1
tg
..
.
. ..
Yes
No
Probabilistic Acceptance
Accept
Compute Acceptance Ratio
Exchange Identity
Object Merge Object SplitRemoval of object
Addition of object
Page 10
g= ’.Otherwise, g= g-1 . Markov chain constructed in this way has its stationary
distribution equal to P(), independent of the choice of the proposal probability, q() and the
initial state, 0.The choice of the proposal probability q() can affect the efficiency of MCMC
significantly.Using more informed proposal probabilities, for example, as in the data-driven
MCMC, will make the Markov chain traverse the solution space more efficiently. Therefore,
the proposal distribution is written as q(g| g-1, I).
Markov Chain Dynamics:
Data Driven Markov Chain Monte Carlo(DDMCMC) algorithm uses Markov Chain
Dynamics to do various operations as explained below:
Object Addition: Whenever a new human enters the frame, a new human object is added.
Object Removal: Whenever a human moves out of the frame, that human object is removed.
ki : is the unique identity of the ith human object.
mi : describes the parameters of the ith human object.
: is the solution space.
Object Split: When a blob contains more than one object, then it is splitted into two. During
this, the blob is separated into six equal parts. For each part, a histogram is constructed.In the
histogram, the split angle is calculated according to the part which has highest intensity. By
means of using split angle, the blob is splitted into two.
qadd(θg−1∪{k n+1 , mn+1}|θg−1 , I )
qremove(θg−1{kr ,mr ¿|θg−1 )¿
Page 11
Object Merge:When a single object is present in more than one blob, then it is merged into a
single blob.The blobs are merged when area of human blob is less compared to the required
value.
Exchange Identity: When there is full occlusion, there is a possibility that IDs can be wrongly
assigned. So those IDs have to be exchanged.
Parameter Update: Update the continuous parameters of a human object.
Example of alert system:
Input and Output VideoInput and Output Video Sequences of a SecuritySequences of a Security
Page 12
Criteria Number Percentage
Total number of people in the input video 3 -
Total Number of Heads detected 3 100%
Total Number of Persons segmented according to The Human
Shape Model
3 100%
Total Number of Heads detected according to the Ohm Shape
Head Shoulder Model
3 100%
Page 13
Thus,in the Gaussian Distribution formula, the probability factor of 0.5 is used when the
number of humans are less and 0.9 when the number of humans are more.Instead of outer box
histogram, inner box histogram is used which is more accurate, thus delivering better results
while tracking.A new parameter, Area of Human Body has been added in terms of number of
pixels which helps in calculation of number of humans during occlusion.
Social Impact and Applicability:
When installed in Super Markets and Shopping Malls, detection of human movement can be
done automatically.If theft is detected, automatic theft alarm system can be initiated and
Evaluation of TrackingEvaluation of Tracking OutputOutput
Page 14
products can be secured.In highly sensitive areas of Indian Defense or Airports, arms and
ammunitions taken out illegally can be detected and automated alarm system can be started.
Future Enhancements:
The current system may require enhancements like Extension to track multiple classes of
objects (e.g. humans and cars) can be done by adding model switching in the MCMC
dynamics.Complete elimination of ambiguities that inevitably exist, especially in the case of
tracking fully occluded objects.Improved tracking accuracy using Multiple Cameras along
with Grid Computing technology to increase the processing speed to process all the 25 frames
per second.
Conclusion:
A Principled approach to simultaneously detect and track humans in a crowded scene
acquired from a single stationary camera is developed.Experiments and evaluations on
challenging real-life data show promising results.The success of our approach mainly lies in
the integration of the top-down Bayesian formulation following the image formation process
and the bottom-up features that are directly extracted from images. The integration has the
benefit of both the computational efficiency of image features and the optimality of a
Bayesian Formulation.
REFERENCES:
“Segmentation and Tracking of Multiple Humans in Crowded Environments” by Tao Zhao,
Ram Nevatia , Bo Wu, IEEE Transactions on Patten Analysis and Machine Intelligence,
VOL. 30, NO. 7, JULY 2008.
Rafael C. Gonzalez and Richard E. Woods, Prentice Hall, Second Edition, “Digital Image
Processing”.
Page 15
“Tracking of Humans Using Masked Histograms and Mean Shift” by Elad Ben-Israel, Efi
Arazi School of Computer Science, The Interdisciplinary Center Herzliya, March 2007.
http://www.ph.tn.tudelft.nl/Courses/FIP/noframes/fip-Spectral.html
http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2007.70770
http://groups.inf.ed.ac.uk/vision/CAVIAR/CAVIARDATA1/