Segmentation and Tracking of Multiple Humans in a Surveillance Video Stream

SEGMENTATION AND TRACKING OF MULTIPLE

HUMANS IN A SURVEILLANCE VIDEO STREAM

Abstract:

This paper proposes a novel method for rapid and robust human detection and tracking

based on the omega-shape features of people’s head-shoulder parts.This paper consists of 5

phases.They are Frame Separation and Extraction of the Foreground from the Background

using Background Appearance Model,Human Shape Model to identify each head of a

person,Identifying the head of a person for the occluded sequence using Canny Edge

algorithm and hence identify the ohm – shape head-shoulder model to identify that it’s a

human,Tracking using color histogram along with Mean Shift technique and Markov Chain

Dynamics – add, remove, establish, break, exchange, update.

Introduction:

Human detection and tracking can be widely used in many applications, including people

counting and security surveillance in public scenes.However, full-body human detection

often suffers from occlusions among individuals and scenes in which people are not

necessarily standing. Hence, much recent research instead is focused on the upper part of

human body.The ultimate goal of video understanding is to make structured decomposition of

video into the scene, describe the objects and their time-varying properties and to extract

semantic meaning from them.Humans are of special significance since they are the main class

of actors in daily life.Being able to detect human objects and track their motion in video

sequences is highly useful.This can enable many applications which will provide information,

convenience and security to our lives.The detection and tracking by itself can be used for

intrusion detection, human counting and estimation of crowd flow.Automatic Video

Surveillance which facilitate fighting against crime, providing higher level of security,

Advanced Human Computer Interaction, Behavioral assistance &Content based Video

indexing.

Goals:

To Segment/Detect multiple possible inter occluded human objects in the image. Answers

how many people are there in a frame and where they are.To reliably track the global

motion(i.e. position) of multiple possible inter occluded human objects in the scene and

provide consistent trajectories.To estimate the motion modes(i.e. walking, running, standing)

and phases(i.e. the human postures).

Main Features Of This Work:

A three-dimensional part-based human body model which enables the segmentation and

tracking of humans in 3D and the inference of inter-object occlusion naturally.A Bayesian

framework that integrates segmentation and tracking based on a joint likelihood for the

appearance of multiple objects.The method is feasible for a crowded scene:An occluded

sequence.Do not require that humans need to be isolated when they first enter the scene.More

complex shape models are needed.Joint reasoning about the collection of objects is

needed.Image segmentation is the process of partitioning the digital image into multiple

regions that can be associated with the properties of one or more criterion. It is an initial and

vital step in pattern recognition-a series of processes aimed at overall image

understanding.Some of the Segmentation Techniques include Clustering, Histogram Based

Methods,Edge Detection,Thresholding,Motion Blob Detection,Region Growing

Method,Watershed Method and Model Based Segmentation.

Drawbacks Of Previous Segmentation Techniques:

Segmentation techniques like Thresholding, Region growing, Watershed are highly

disadvantageous.In the motion blob detection technique, the blobs are detected by

comparison with the learned background. In this, segmenting humans from such blobs are not

straightforward.One blob may include multiple objects, while one object may split into

multiple blobs. Such approaches are likely to fail when occlusion is persistent.Some

approaches have been developed to handle occlusion, but require the objects to be initialized

before occlusion happens. This is usually infeasible for a crowded scene.

Image(color/texture) segmentation groups image pixels into regions of similar color/texture

and its not likely to segment individual humans because human clothing may not have

uniform color/texture and adjacent people may wear similar clothes.Motion Segmentation

groups image pixels of consistent motion; it may not give satisfactory results due to similarity

of human motions in a group.Face Detection-Statistical methods fail because of the resolution

of the surveillance video.Model Based Segmentation used when the region of interest is a

repetitive form of geometry.A Probabilistic model using a prior must be used.A Multi-

Ellipsoid Model is used as a prior as humans are either walking or standing.An ellipsoid fits a

human body part well and its projection which is an ellipse is a convenient form to represent

humans.

Proposed Method:

To augment the best solution from the complex solution space, we use the MCMC

approach.Tracking involves the detection of Human movement in the video and hence to

calculate the Human Motion Trajectory.The various tracking methods are

Detection-Based Tracking

Matching-based tracking

Blob Tracking

The basic assumptions are that the camera must be stationary,People walk on a known

ground plane.There are no significant false alarms due to shadows, reflections or other

reasons.

Main Features Of Proposed Plan:

A 3D part based human body model which enables segmentation and tracking of humans in

3D and the inference of inter - object occlusion naturally.A Bayesian framework that

integrates segmentation and tracking based on a joint likelihood for the appearance of

multiple objects with the design of an efficient Markov chain dynamics.Based on the

Background Model, the foreground blobs are extracted.By using the camera model and the

assumption that objects move on a known ground plane, multiple 3D human hypotheses are

projected onto the image plane and matched with the foreground blobs.In one frame,

segmentation of the foreground blobs into multiple humans is performed and associate the

segmented humans with existing trajectories.

Architecture-Diagram:

Segmentation and Tracking are integrated in a unified framework and interoperate along

time:

Prior Models:

Background model:Based on a background model, the foreground blobs are extracted as the

basic observation.3D Human Shape model:Since the hypotheses are in 3D, occlusion

reasoning is straightforward.Camera model & Ground Plane:Multiple 3D human hypotheses

are projected onto the image plane and matched with the foreground blobs.

Phases Involved:

Frame Separation and Extraction of the Foreground from the Background using Background

Appearance Model,Human Shape Model to identify each head of a person,Identify the head

of a person for the occluded sequence using Canny Edge algorithm and hence identify the

ohm - shape head-shoulder model to identify that it’s a human,Tracking using color

Camera Model Ground Plane

Human Shape Model

Background Model

Model Based

Model Based Segmenta

Number Of Humans And Their Positions

Global Motion Trajectories Video Input

histogram along with Mean Shift technique,Markov Chain Dynamics – add, remove,

establish, break, exchange, update.

Frame Separation and Foreground Extraction (Phase I):

In this phase, frames are extracted from the input video sequence. Every second, 25 frames

are got from the video. (Fps=25).Based on Probabilistic modeling, selected number of frames

are chosen from the 25 frames for effective computation.

where,

: the solution space.(represents the chosen Frame).

: the state of the objects.(represents the objects in the chosen frame).

: the image observation(represents the set of extracted frames).

Foreground Extraction is performed using Background Appearance Model.

The probability of pixel j being from the background is calculated by Gaussian

distribution,

are the jth pixel values in the current frame.

are the jth pixel values in the background image.

is a small constant(=0.5) and is variance.

θ( t )¿

=argmaxθ(t )∈Θ

P (θ( t )|I (1 ,.. ., t ))

I(1, . . ., t )

¿max ¿¿Pb ( I j )=Pb (r j , g j , b j )

r j , g j ,b j

r j , g j , b j

σε

Experimental Results For Separation And Foreground Extraction:

3D Human Shape Model(Phase II):

The parameters of an individual human, mi, are defined based on a 3D human shape

model.Our attempt is to capture the basic shape and articulation parameters of the human

body.It is a Multi-ellipsoid model.The parameters (mi) to describe 3D human hypothesis:

size (hi): 3D height of the model, it also controls the overall scaling of the object in three(X,Y

and Z) directions.thickness (fi): Captures extra scaling in the horizontal directions.position (ui

or (xi,yi)): Image position of the head.orientation (oi): 3D orientation of the bodyOrientations

of the models are quantized into few levels for computation efficiency(00 and 900).inclination

(ii): 2D inclination of the body.There is the chance that the body may be inclined

slightly(Inclination angle can be positive, negative or zero).

mi={hi , f i , x i , y i , oi , ii}

Detection of Ohm Shape using Canny Edge algorithm(Phase III):

Identify the peak points in the foreground frame.To identify if it’s a human, check if there are

sufficient number of pixels below the peak positions.Then the ohm shape is obtained by

taking one half of the head ellipse and the upper quarter of the torso, which would be the

shoulder.

Blob Tracking (Phase IV):

Each human object is assigned a unique ID to track it.In each frame, the blobs (B(t) = {B1(t)…

Bn(t)}) are matched with the blobs in the previous frame (B(t-1) = {B1

(t-1)…Bm(t-1)}).Two blobs

are declared as perfect match if the centroids of the 2 blobs are close and their size difference

is sufficiently small.Those blobs Bi(t) which are a perfect match in B(t-1) is found.If best match

is found, then the unique ID of human object found in the previous frame is copied into the

current frame else a new ID is given to the human object.

Overview of Mean Shift algorithm:

To find the best match, Mean shift technique is used.Mean Shift algorithm is implemented by

the following steps:1.Calculation of an initial histogram which identifies the object being

tracked.2.Applying the initial histogram onto every new frame from the input stream using a

technique called back projection, yielding a single channel (grayscale) image where each

pixel contains the bin size of the initial histogram for the color of the corresponding pixel in

the new frame.3.Searching the back-projection to find the region with the highest intensity

which corresponds to the area where the tracked object most probably resides.

a.Input Image b.Back Projection Image

Calculation Of Initial Histogram:

Initial histogram is calculated for the object defined within the object shape.A Bounding

Rectangle on the Target object is drawn.A single red, green, blue (RGB) histogram with 512

bins is constructed using all the blobs within the three elliptic regions of the Object Model.It

helps to establish correspondence in tracking because it is insensitive to the non rigidity of

human motion.Using the initial histogram, the Back projection image is generated which is

used in Mean Shift technique.

Generation Of Trajectory:

The path of a moving human blob across each frame is defined as trajectory.The centroid in

the previous frame is found earlier.In the current frame, new centroid is calculated by Mean

Shift technique.Both these centroids are joined together to form a trajectory which is used to

traverse the path taken by an individual human object.

Block Diagram of the MCMC Tracking Algorithm(Phase V):

Computing MAP by efficient MCMC:To calculate Maximum a Posteriori(MAP) using

MCMC method:

A Markov chain with stationary distribution is designed. At the gth iteration, sample a

candidate state ’ from a proposal distribution q(g| g-1).If the candidate state ’ is accepted,

)(1

)( tg

tg

'

)(1

tg

..

.

. ..

Yes

No

Probabilistic Acceptance

Accept

Compute Acceptance Ratio

Exchange Identity

Object Merge Object SplitRemoval of object

Addition of object

g= ’.Otherwise, g= g-1 . Markov chain constructed in this way has its stationary

distribution equal to P(), independent of the choice of the proposal probability, q() and the

initial state, 0.The choice of the proposal probability q() can affect the efficiency of MCMC

significantly.Using more informed proposal probabilities, for example, as in the data-driven

MCMC, will make the Markov chain traverse the solution space more efficiently. Therefore,

the proposal distribution is written as q(g| g-1, I).

Markov Chain Dynamics:

Data Driven Markov Chain Monte Carlo(DDMCMC) algorithm uses Markov Chain

Dynamics to do various operations as explained below:

Object Addition: Whenever a new human enters the frame, a new human object is added.

Object Removal: Whenever a human moves out of the frame, that human object is removed.

ki : is the unique identity of the ith human object.

mi : describes the parameters of the ith human object.

: is the solution space.

Object Split: When a blob contains more than one object, then it is splitted into two. During

this, the blob is separated into six equal parts. For each part, a histogram is constructed.In the

histogram, the split angle is calculated according to the part which has highest intensity. By

means of using split angle, the blob is splitted into two.

qadd(θg−1∪{k n+1 , mn+1}|θg−1 , I )

qremove(θg−1{kr ,mr ¿|θg−1 )¿

Object Merge:When a single object is present in more than one blob, then it is merged into a

single blob.The blobs are merged when area of human blob is less compared to the required

value.

Exchange Identity: When there is full occlusion, there is a possibility that IDs can be wrongly

assigned. So those IDs have to be exchanged.

Parameter Update: Update the continuous parameters of a human object.

Example of alert system:

Input and Output VideoInput and Output Video Sequences of a SecuritySequences of a Security

Criteria Number Percentage

Total number of people in the input video 3 -

Total Number of Heads detected 3 100%

Total Number of Persons segmented according to The Human

Shape Model

3 100%

Total Number of Heads detected according to the Ohm Shape

Head Shoulder Model

3 100%

Thus,in the Gaussian Distribution formula, the probability factor of 0.5 is used when the

number of humans are less and 0.9 when the number of humans are more.Instead of outer box

histogram, inner box histogram is used which is more accurate, thus delivering better results

while tracking.A new parameter, Area of Human Body has been added in terms of number of

pixels which helps in calculation of number of humans during occlusion.

Social Impact and Applicability:

When installed in Super Markets and Shopping Malls, detection of human movement can be

done automatically.If theft is detected, automatic theft alarm system can be initiated and

Evaluation of TrackingEvaluation of Tracking OutputOutput

products can be secured.In highly sensitive areas of Indian Defense or Airports, arms and

ammunitions taken out illegally can be detected and automated alarm system can be started.

Future Enhancements:

The current system may require enhancements like Extension to track multiple classes of

objects (e.g. humans and cars) can be done by adding model switching in the MCMC

dynamics.Complete elimination of ambiguities that inevitably exist, especially in the case of

tracking fully occluded objects.Improved tracking accuracy using Multiple Cameras along

with Grid Computing technology to increase the processing speed to process all the 25 frames

per second.

Conclusion:

A Principled approach to simultaneously detect and track humans in a crowded scene

acquired from a single stationary camera is developed.Experiments and evaluations on

challenging real-life data show promising results.The success of our approach mainly lies in

the integration of the top-down Bayesian formulation following the image formation process

and the bottom-up features that are directly extracted from images. The integration has the

benefit of both the computational efficiency of image features and the optimality of a

Bayesian Formulation.

REFERENCES:

“Segmentation and Tracking of Multiple Humans in Crowded Environments” by Tao Zhao,

Ram Nevatia , Bo Wu, IEEE Transactions on Patten Analysis and Machine Intelligence,

VOL. 30, NO. 7, JULY 2008.

Rafael C. Gonzalez and Richard E. Woods, Prentice Hall, Second Edition, “Digital Image

Processing”.

“Tracking of Humans Using Masked Histograms and Mean Shift” by Elad Ben-Israel, Efi

Arazi School of Computer Science, The Interdisciplinary Center Herzliya, March 2007.

http://www.ph.tn.tudelft.nl/Courses/FIP/noframes/fip-Spectral.html

http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2007.70770

http://groups.inf.ed.ac.uk/vision/CAVIAR/CAVIARDATA1/

Segmentation and Tracking of Multiple Humans in a Surveillance Video Stream

Documents