-
CHAPTER 2
LITERATURE SURVEY
2.1 INTRODUCTION
This chapter presents a detailed literature survey on facial
tracking
using lip movement, skin color and mouth movement in a video
sequence.
The Automatic Facial extraction, 3D modal Shaping, Algorithm for
Robust
segmentation of various parts that are designed by various
authors are
discussed.
2.2 FACIAL TRACKING USING LIP READING
Yuille et al., 1992, develop an automatic facial feature
extraction
system, which is able to identify the detailed shape of eyes,
eyebrows and
mouth from facial images. The developed system not only extracts
the
location information of the features, but also estimates the
parameters
pertaining the contours and parts of the features using
parametric
deformable templates approach. In order to extract facial
features,
deformable models for each of eye, eyebrow, and mouth are
developed. The
development steps of the geometry, imaging model and
matching
algorithms, and energy functions for each of these templates are
presented
in detail, along with the important implementation issues. An
eigenface
based multi-scale face detection algorithm which incorporates
standard facial
proportions is implemented, so that when a face is detected, the
rough
search regions for the facial features are readily available.
The developed
system is tested on JAFFE (Japanese Females Facial Expression
Database),
Yale Faces, and ORL (Olivetti Research Laboratory) face image
databases.
The performance of each deformable template and the face
detection
algorithm are discussed separately.
Rabiner 1993, state that although the face detection algorithm
is
designed for frontal face, the same mechanism can also be
applied to track
-
non-frontal faces with online adapted face models. Due to the
essence of
template matching, the algorithm is capable of comparing the
similarity
among different faces, which makes it suitable for tracking the
same face
that occur at disjointed temporal locations in video. While the
proposed face
detection method provides comparable accuracy as the neural
network
based approach, it is much faster.
Terzopoulos et al., 1993, present a new approach to the analysis
of
dynamic facial images for the purposes of estimating and
resynthesizing
dynamic facial expressions. The approach exploits a
sophisticated generative
model of the human face originally developed for realistic
facial animation.
The face model, which may be simulated and rendered at
interactive rates
on a graphics workstation, incorporates a physics-based
synthetic facial
tissue and a set of anatomically motivated facial muscle
actuators. They
consider the estimation of dynamic facial muscle contractions
from video
sequences of expressive human faces. They develop an estimation
technique
that uses deformable contour models (snakes) to track the
non-rigid motions
of facial features in video images
Lanitis et al., 1994, present flexible shape and flexible
grey-level
models for representing variations in the appearance of human
faces. These
models are controlled by a small number of parameters which can
be used
for coding and reconstructing a face image.
Jacquin Amaud et al., 1995, address the issue of
automatically
tracking the faces and facial features of persons in
head-and-shoulders video
sequences. They propose two totally automatic algorithms
which
respectively perform the detection of head outlines and identify
rectangular
eyes-nose-mouth regions, both from down sampled binary threshold
edge
images. Unlike ones that have been proposed recently, a priori
assumptions
regarding the nature and content of the sequences to code are
minimal for
-
our techniques, and the algorithms operate accurately and
robustly, even in
cases of significant head rotation or partial occlusion by
moving objects.
Gavrila and Davis, 1996, present a vision system for the 3-D
model-
based tracking of unconstrained human movement. Using image
sequences
acquired simultaneously from multiple views, they recover the 3D
body pose
at each time instant without the use of markers. The pose
recovery problem
is formulated as a search problem and entails finding the pose
parameters of
a graphical human model whose synthesized appearance is most
similar to
the actual appearance of the real human in the multi-view
images. The
models used for this purpose are acquired from the images. They
use a
decomposition approach and a best-first technique to search
through the
high dimensional pose parameter space. A robust variant of
chamfer
matching is used as a fast similarity measure between
synthesized and real
edge images. They present initial tracking results from a large
new human-
in-action database containing more the 2500 frames in each of
four
orthogonal views. The four image streams are synchronized. They
contain
subjects involved in a variety of activities, of various degree
of complexity
ranging from the simpler one person hand waving to the
challenging two
persons close interaction in the Argentine Tango.
McKenna et al., 1996, they describe a dynamic face tracking
system
based on an integrated motion-based object tracking and model
based face
detection, framework. The motion-based tracker focuses attention
for the
face detector whilst the latter aids the tracking process. The
system
produces segmented face sequences from complex scenes with poor
viewing
conditions in surveillance applications. They also investigate a
Gabor wavelet
transform as a representation scheme for capturing head
rotations in depth.
Principal components analysis was used to visualize the
manifolds described
by pose change. Heinzmann and Zelinsky, 1997, state that people
naturally
express themselves through facial gestures. They have
implemented an
-
interface that tracks a person's facial features robustly in
real time (30Hz)
and does not require artificial artifacts such as special
illumination or facial
makeup. Even if features become occluded, the system is capable
of
recovering tracking in a couple of frames after the features
reappear in the
image. Based on this fault tolerant face tracker they have
implemented real
time gesture recognition capable of distinguishing 12 different
gestures
ranging from "yes", "no" and "may be" to winks, blinks and
"asleep".
Sanchez et al., 1997, a method for lip tracking intended to
support
personal verification is presented. Lip contours are represented
by means of
quadratic B-splines. The lips are automatically localized in the
original image
and an elliptic B-spline is generated to start up tracking. Lip
localization
exploits grey-level gradient projections as well as chromaticity
models to
find the lips in an automatically segmented region corresponding
to the face
area. Tracking proceeds by estimating new lip contour positions
according to
a statistical chromaticity model for the lips. The current
tracker
implementation follows a deterministic second order model for
the spline
motion based on a Lagrangian formulation of contour dynamics.
The method
has been tested on the M2VTS database. Lips were accurately
tracked on
sequences consisting of more than hundred frames.
Basu et al., 1998, address the problem of tracking and
reconstructing
3D human lip motions from a 2D view. They build a
physically-based 3D
model of lips and train it to cover only the subspace of lip
motions. They
then track this model in video by finding the shape within the
subspace that
maximizes the posterior probability of the model given the
observed
features. The features are the likelihoods of the lip and
non-lip color classes:
they iteratively derive forces from these values to apply to the
physical
model and converge to the final solution. Because of the full 3D
nature of
the model, this framework allows to track the lips from any head
pose. In
addition, because of the constraints imposed by the learned
subspace of the
-
model, they are able to accurately estimate the full 3D lip
shape from the 2D
view.
Edward et al., 1998, address the problem of robust face
identification
in the presence of pose, lighting, and expression variation.
Previous
approaches to the problem have assumed similar models of
variation for
each individual, estimated from pooled training data. They
describe a
method of updating a first order global estimate to identity by
learning the
class specific correlation between the estimate and the residual
variation
during a sequence. This is integrated with an optimal tracking
scheme, in
which identity variation is decoupled from pose, lighting and
expression
variation. The method results in robust tracking and a more
stable estimate
of facial identity under changing conditions.
Schödl Arno et al., 1998, describe the use of a
three-dimensional
textured model of the human head under perspective projection to
track a
person’s face. The system is hand-initialized by projecting an
image of the
face onto a polygonal head model. Tracking is achieved by
finding the six
translation and rotation parameters to register the rendered
images of the
textured model with the video images. They find the parameters
by mapping
the derivative of the error with respect to the parameters to
intensity
gradients in the image. They use a robust estimator to pool the
information
and do gradient descent to find an error minimum.
Stan Birchfield, 1998, presents an algorithm for tracking a
person’s
head. The head’s projection onto the image plane is modeled as
an ellipse
whose position and size are continually updated by a local
search combining
the output of a module concentrating on the intensity gradient
around the
ellipse’s perimeter with that of another module focusing on the
color
histogram of the ellipse’s interior. Since these two modules
have roughly
orthogonal failure modes, they serve to complement one another.
The result
is a robust, real-time system that is able to track a person’s
head with
-
enough accuracy to automatically control the camera’s pan, tilt,
and zoom in
order to keep the person centered in the field of view at a
desired size.
Extensive experimentation shows the algorithm’s robustness with
respect to
full 360-degree out-of-plane rotation, up to 90-degree tilting,
severe but
brief occlusion, arbitrary camera movement, and multiple moving
people in
the background.
Toyama 1998, real-time 3D face tracking is a task with
applications to
animation, video teleconferencing, speech reading, and
accessibility. In spite
of advances in hardware and efficient vision algorithms, robust
face tracking
remains elusive for all of the reasons which make computer
vision difficult:
Variations in illumination, pose, expression, and visibility
complicate the
tracking process, especially under real-time constraints. They
note that
robust systems tend to possess some state-based architecture
comprising
heterogeneous algorithms, and that robust recovery from tracking
failure
requires several other facial image analysis tasks.
Cascia et al., 2000, propose an improved technique for 3D
head
tracking under varying illuminating conditions. The head is
modeled as a
texture mapped cylinder. Tracking is formulated as an image
registration
problem in the cylinder's texture map image. The resulting
dynamic texture
map provides a stabilized view of the face that can be used as
input to many
existing 2D techniques for face recognition, facial expressions
analysis, lip
reading, and eye tracking.
Lievin and Luthon, 2000, propose an algorithm for speaker's
lip
segmentation and features extraction. A color video sequence of
speaker's
face is acquired, under natural lighting conditions and without
any particular
make-up. A logarithmic color transform is performed from the RGB
to HI
(hue, intensity) color space. A statistical approach using
Markov random
field modeling determines the red hue prevailing region and
motion in a
-
spatiotemporal neighborhood. Third, the final label field is
used to extract
ROI (region of interest) and geometrical features.
Tian et al., 2000, propose a dual state model based system of
tracking
eye features that uses convergent tracking techniques and show
how it can
be used to detect whether the eyes are open or closed, and to
recover the
parameters of the eye model.
Jian et al., 2001, develop real time lip tracking information
that can be
used to implement and control a virtual lip. The use of soft
computing to
represent the real time lip parameters enables them to have a
more robust
and flexible system which can compensate for the potential
errors of lip
tracking.
Chan et al., 2002, state that contour model-based tracking is
more
robust if an accurate reference shape model of the underlying
object is
available. As lip shapes vary, the ability to automatically
extract user-
dependent lip models from input images is desirable. They
present an
unsupervised segmentation method to hierarchically locate the
user's face
and lips. Techniques employed include modeling in the hue /
saturation color
space using Gaussian mixture models and the use of geometric
constraints.
With the region of interest automatically located, the model
extraction
problem is formulated as a regularized model-fitting problem.
The use of a
generic shape as prior information improves the accuracy of the
extracted lip
model which is based on a cubic B-spline representation. They
describe a
method to compute automatically an optimal linear color space
transform
needed to obtain raw estimates of the lip boundary locations, as
required by
the fitting procedure.
Delman and Lievin, 2002, present an algorithm for speaker's
lip
segmentation and features extraction. A color video sequence of
speaker's
face is acquired, under natural lighting conditions and without
any particular
make-up. A logarithmic color transform is performed from RGB to
HI (hue,
-
intensity) color space. A statistical approach using Markov
randomly
modeling determines lip prevailing region and motion in
spatiotemporal
neighborhoods.
Eveno et al., 2002, propose an accurate and robust lip
segmentation
algorithm. Characteristic points are found by using hybrid
edges, which
combine color and intensity information, and a priori knowledge
about the lip
structure. Corner position, which is crucial, is provided by a
coarse-to-fine
process. A model is fitted on the lips. Unlike most model
oriented methods,
they consider that the lip boundary is composed of several
independent
cubic polynomial models. It gives to the global model enough
flexibility to
reproduce the specificity of very different lip shapes. Compared
to existing
models, it brings a significant accuracy improvement. It ensures
a robust
convergence towards the edges.
Liew et al., 2002, present use of visual information from
lip
movements that can improve the accuracy and robustness of a
speech
recognition system. A region-based lip contour extraction
algorithm based
on deformable model is proposed. The algorithm employs a
stochastic cost
function to partition a color lip image into lip and non-lip
regions such that
the joint probability of the two regions is maximized. Given a
discrete
probability map generated by spatial fuzzy clustering, they show
how the
optimization of the cost function can be done in the continuous
setting. The
region-based approach makes the algorithm more tolerant to noise
and
artifacts in the image. It also allows larger region of
attraction, thus making
the algorithm less sensitive to initial parameter settings. The
algorithm
works on unadorned lips and accurate extraction of lip contour
is possible.
Mark Barnard et al., 2002, propose a robust and adaptable lip
tracking
method that uses a combination of snakes and a 2D template
matching
technique. The snake, an energy minimizing spline, is driven by
2D template
matching techniques to find the expected lip contour of a
specific speaker.
-
Their experiments show that the technique can track the
unadorned lips in
various colors and shapes of speakers, including the lips of a
bearded
speaker.
Morency et al., 2002, present a robust implementation of
stereo-based
head tracking designed for interactive environments with
uncontrolled
lighting. They integrate fast face detection and drift reduction
algorithms
with a gradient-based stereo rigid motion tracking technique.
Their system
can automatically segment and track a user’s head under large
rotation and
illumination variations. Precision and usability of their
approach are
compared with previous tracking methods for cursor control and
target
selection in both desktop and interactive room environments.
Yang et al., 2002, insist that images containing faces are
essential to
intelligent vision-based human computer interaction, and
research efforts in
face processing include face recognition, face tracking, pose
estimation, and
expression recognition. Given a single image, the goal of face
detection is to
identify all image regions which contain a face regardless of
its three-
dimensional position, orientation, and lighting conditions. Such
a problem is
challenging because faces are not rigid and have a high degree
of variability
in size, shape, color, and texture. Numerous techniques have
been
developed to detect faces in a single image.
Blanz Volker and Vetter, 2003, present a method for face
recognition
across variations in pose, ranging from frontal to profile
views, and across a
wide range of illuminations, including cast shadows and secular
reflections.
To account for these variations, the algorithm simulates the
process of
image formation in 3D space, using computer graphics, and it
estimates 3D
shape and texture of faces from single images. The estimate is
achieved by
fitting a statistical, morph able model of 3D faces to images.
The model is
learned from a set of textured 3D scans of heads. They describe
the
construction of the morph able model, an algorithm to fit the
model to
-
images, and a framework for face identification. In this
framework, faces are
represented by model parameters for 3D shape and texture.
Liew, 2003, describe the application of a novel spatial fuzzy
clustering
algorithm to the lip segmentation problem. The proposed spatial
fuzzy
clustering algorithm is able to take into account both the
distributions of
data in feature space and the spatial interactions between
neighboring pixels
during clustering. By appropriate pre- and post processing
utilizing the color
and shape properties of the lip region, successful segmentation
of most lip
images is possible. Comparative study with some existing lip
segmentation
algorithms such as the hue filtering algorithm and the fuzzy
entropy
histogram thresholding algorithm has demonstrated the
superior
performance of their method.
Suandi et al., 2003, introduce an extended technique in
template
matching to track eyes and mouth in real-time. The technique
makes use of
a set of ‘n’ correlation candidates from template matching. They
first list all
the candidates from each face model regions, and select the best
candidates
based on two selective functions. These functions are for
right-left eyes pair
and eyes-mouth pair selection, respectively. They also introduce
a novel
technique in tracking framework, called feature selective (FS),
where the
system selects the features automatically so that it is feasible
for multiple
face types and conditions.
Wu et al., 2003, state that occlusion is a difficult problem
for
appearance-based target tracking, especially when it needs to
track multiple
targets simultaneously and maintain the target identities during
tracking.
They propose a dynamic Bayesian network which accommodates an
extra
hidden process for occlusion and stipulates the conditions on
which the
image observation likelihood is calculated. The statistical
inference of such a
hidden process can reveal the occlusion relations among
different targets,
which makes the tracker more robust against partial even
complete
-
occlusions. In addition, considering the fact that target
appearances change
with views, another generative model for multiple view
representation is
proposed by adding a switching variable to select from different
view
templates .The integration of the occlusion model and multiple
view model
results in a complex dynamic Bayesian network, where extra
hidden
processes describe the switch of targets’ templates, dynamics,
and the
occlusions among different targets. The tracking and inference
algorithms
are implemented by the sampling-based sequential Monte Carlo
strategies.
Our experiments show the effectiveness of the proposed
probabilistic models
and the algorithms.
Eveno Nicolas et al., 2004, they propose an accurate and robust
quasi
automatic lip segmentation algorithm. The upper mouth boundary
and
several characteristic points are detected in the first frame by
using a new
kind of active contour: the “jumping snake”. Unlike classic
snakes, it can be
initialized far from the final edge and the adjustment of its
parameters is
easy and intuitive. Then, to achieve the segmentation they
propose a
parametric model composed of several cubic curves. Its high
flexibility
enables accurate lip contour extraction even in the challenging
case of very
asymmetric mouth. It brings a significant accuracy and
realism
improvement. The segmentation in the following frames is
achieved by using
an inter frame tracking of the key points and the model
parameters. The
key point’s positions become unreliable after a few frames. They
propose an
adjustment process that enables an accurate tracking even after
hundreds of
frames and the mean key points tracking errors of our algorithm
are
comparable to manual point’s selection errors.
Leung Shu-Hung et al., 2004, presented a new fuzzy
clustering
method for lip image segmentation. This clustering method takes
both the
color information and the spatial distance into account while
most of the
current clustering methods only deal with the former. A new
dissimilarity
-
measure, which integrates the color dissimilarity and the
spatial distance in
terms of an elliptic shape function, is introduced. Because of
the presence of
the elliptic shape function, the new measure is able to
differentiate the pixels
having similar Color information but located in different
regions. A new
iterative algorithm for the determination of the membership and
centroid for
each class is derived, which is shown to provide good
differentiation between
the lip region and the non-lip region.
Wang et al., 2004, visual information from lip shapes and
movements
helps improve the accuracy and robustness of a speech
recognition system.
A new region-based lip contour extraction algorithm that
combines the
merits of the point-based model and the parametric model is
presented.
Their algorithm uses a 16-point lip model to describe the lip
contour. Given a
robust probability map of the color lip image generated by the
FCMS (fuzzy
clustering method incorporating shape function) algorithm, a
region-based
cost function that maximizes the joint probability of the lip
and non-lip
region can be established. Then an iterative point-driven
optimization
procedure has been developed to fit the lip model to the
probability map. In
each iteration, the adjustment of the 16 lip points is governed
by three
pieces of quadratic curves that constrain the points to form a
physical lip
shape.
Narayanan et al., 2006, they present a lip contour tracking
algorithm
using attractor guided particle filtering. It is difficult to
robustly track the lip
contour because the lip contour is highly deformable and the
contrast
between skin and lip colors is very low. It makes the
traditional blind
segmentation-based algorithms often fail to have robust and
realistic results.
The lip contour is constrained by the facial muscles; the
tracking
configuration space can then be represented by a lower
dimensional
manifold. They take some representative lip shapes as the
attractors in the
lower dimensional manifold. To resolve the low contrast problem,
they adopt
-
a color feature selection algorithm to maximize the between skin
and lip
colors. Then they integrate the shape priors and the
discriminative feature
into the attractor-guided particle filtering framework to track
the lip contour.
Nguyen et al., 2008, they propose and evaluate a novel method
for
enhancing performance of lips contour tracking, which is based
on the
concept of statistic shape models (ASM) and multi features. On
the first
image of the video sequence, lip region is detected using the
Bayesian's rule
in which lip color information is modeled by using the Gaussian
Mixture
Model (GMM) and the GMM is trained by Expectation-Maximization
(EM)
algorithm. The lip region is then used to initialize the lip
shape model. A
single feature-based ASM presents good performance only in
particular
conditions but gets stuck in local minima for noisy conditions
enhance the
convergence, we propose to use 2 features: normal profile and
grey level
patches, and combine them by using a voting approach. The
standard ASM
is not able to take into account temporal information from
previous frames
therefore the lip contours are tracked by replacing the standard
ASM with a
hybrid active shape model (HASM) which is capable to take
advantage of the
temporal information.
Ong Eng-Jon and Bowden, 2008, they propose a learnt
data-driven
approach to the accurate, real-time tracking of lip shapes using
only
intensity information. This has the advantage that constraints
such as a-
priori shape models or temporal models for dynamics are not
required or
used. Tracking the lip shape is simply the independent tracking
of a set of
points that lie on the lip’s contour. This allows us to cope
with different lip
shapes that were not present in the training data and performs
as well as
other approaches that have pre-learnt shape models such as the
AAM.
Tracking is archived via linear predictors, where each linear
predictor
essentially linearly maps sparse template difference vectors to
tracked
feature position displacements. Multiple linear predictors are
grouped into a
-
rigid flock to obtain increased robustness. To achieve accurate
tracking, two
approaches are proposed for selecting relevant sets of LPs
within each flock.
Analysis of the selection results show that the LPs selected for
tracking a
feature point choose areas that are strongly correlated with
that of the
tracked target and that these areas are not necessarily the
region around
the feature point as is commonly assumed in LK based
approaches.,
effective fusion of acoustic and visual modalities in speech
recognition has
been an important issue in human computer interfaces, warranting
further
improvements in intelligibility and robustness. Speaker lip
motion stands out
as the most linguistically relevant visual feature for speech
recognition. They
present a new hybrid approach to improve lip localization and
tracking,
aimed at improving speech recognition in noisy environments. It
begins with
a new color space transformation for enhancing lip segmentation.
In the
color space transformation, a PCA method is employed to derive a
new one
dimensional color space which maximizes discrimination between
lip and
non-lip colors. Intensity information is also incorporated in
the process to
improve contrast of upper and corner lip segments. In the
subsequent step,
a constrained deformable lip model with high flexibility is
constructed to
accurately capture and track lip shapes. The model requires only
six degrees
of freedom, yet provides a precise description of lip shapes
using a simple
least square fitting method. Experimental results indicate that
the proposed
hybrid approach delivers reliable and accurate localization and
tracking of lip
motions under various measurement conditions.
Rohani et al., 2008, Lip feature extraction is one of the
most
challenging tasks in the lip reading systems' performance. They
propose a
new approach for lip contour extraction based on fuzzy
clustering. The
algorithm employs a stochastic cost function to partition a
color image into
lip and non-lip regions such that the joint probability of the
two regions is
maximized. The mouth location is determined and then, lip region
is
-
preprocessed using pseudo hue transformation. Fuzzy c-means
clustering is
applied to each transformed image along with b components of
CIELAB color
space. To delete the clustered pixels around lip, an ellipse and
a Gaussian
mask were used. In order to show the performance of the proposed
method,
the pseudo hue segmentation and fuzzy c-mean clustering
without
preprocessing are compared. The compared methods were applied to
the
VidTIMIT and M2VTS databases and the results show the
superiority of the
proposed method in comparison with other methods.
Chin Siew Wen et al., 2009, present automatic lips detection
and
tracking system based on watershed segmentation approach. For
some of
the lips detection systems, skin / non-skin detection is a
prerequisite step to
localize the face region followed by detection of lip region. A
direct lips
detection technique using watershed segmentation without
needing
preliminary face localization is proposed. The watershed
algorithm segments
the input image into regions. The cubic spline interplant lips
color modeling
and symmetry detection is used to detect the lips region from
the
segmented regions. The position of the segmented lips is passed
to the
tracking system to predict the location of the lips in the
succeeding video
frame.
Hoai BAC Et Al., 2010, they present to solve a narrower problem,
the
lip tracking, which is an essential step to provide visual lip
data for the lip-
reading system. Inspired by the idea of AVCSR, which has
combined visual
features with audio features to increase the accuracy in noisy
environments;
they use AdaBoost algorithm and Kalman filter for the face and
lip detectors.
-
1.3 FACIAL TRACKING USING SPEECH
Leymaric and Levine, 1993, propose segmentation of a noisy
intensity
image and tracking a non-rigid object. A technique based on an
active
contour model commonly called a snake is examined. The technique
is
applied to cell locomotion and tracking studies. The snake
permits both the
segmentation and tracking problems to be simultaneously solved
in
constrained cases. A detailed analysis of the snake model,
emphasizing its
limitations and shortcomings, is presented, and improvements to
the original
description of the model are proposed. Problems of convergence
of the
optimization scheme are considered. In particular, an improved
terminating
criterion for the optimization scheme that is based on
topographic features
of the graph of the intensity image is proposed. Hierarchical
filtering
methods, as well as a continuation method based on a discrete
sale-space
representation, are discussed.
Luettin Juergen et al., 1996, describe a robust method for
extracting
visual speech information from the shape of lips to be used for
an automatic
speech reading (lip reading) systems. Lip de-formation is model
led by a
statistically based deformable contour model which learns
typical lip
deformation from a training set. The main difficulty in locating
and tracking
lips consists of finding dominant image features for
representing the lip
contours. They describe the use of a statistical profile model
which learns
dominant image features from a training set. The model captures
global
intensity variation due to different illumination and different
skin reflectance
as well as intensity changes at the inner lip contour due to
mouth opening
and visibility of teeth and tongue. The method is validated for
locating and
tracking lip movements on a database of a broad variety of
speakers.
Kaucle and Blake, 1998, human speech is inherently
multi-model
consisting of both audio and visual components. Recently
researchers have
shown that the incorporation of information about the position
of the lips
-
into acoustic speech recognizer enables robust recognition of
noisy speech.
In the case of Hidden Markov model recognition, they show that
this
happens because the visual signal stabilizes the alignment of
states. It is
also shown that unadorned lips, both the inner and outer
contours, can be
robustly tracked in real time on general purpose workstations.
To accomplish
this, efficient algorithms are employed which contain three key
components,
shape models, motion models, and focused color feature detectors
all of
which are learnt from examples.
Lei et al., 2004, the paper presents a robust hierarchical lip
tracking
approach (RoHiLTA) for lip-reading and audio visual speech
recognition
(AVSR) applications. Lip regions of interest are subtly detected
by motion
and facial structure information. Improvements are made on
active shape
models (ASMs) for extracting lip contours more accurately and
efficiently
from video sequences of a speaker's talking face in natural
lighting
conditions and without particular make-ups. Local and global ASM
search
algorithms are both improved by introducing color information,
2D mouth
corner match, and robust estimation. For noise-free features,
localization
errors are automatically corrected by an interpolating scheme. A
fast
implementation of the hierarchical approach is also proposed.
Extensive
experiments show that the improved ASM can effectively reduce
the lip
locating errors. The fast implementation of RoHiLTA can
consistently achieve
superior performance to conventional ASMs in lip tracking tasks,
and then
can be effectively integrated in lip-reading and AVSR
systems.
1.4 FACIAL TRACKING USING SKIN AND COLOR
Sobottka Karin and Pitas loannis, 1996, present a new approach
for
automatically segmenting and tracking of faces in color images.
The
segmentation of faces is done based on color and shape
information. By
searching for facial features, face hypotheses are verified.
Afterwards
-
tracking is performed by using an active contour model. This
ensures fast
processing and an increase in robustness for the face
recognition process.
The exterior forces of the active contour are defined based on
color features.
Results for tracking are shown for an image sequence consisting
of 150
frames.
Yang and Waibel, 1996, present a real-time face tracker. The
system
has achieved a rate of 30+ frames / second using an HP-9000
workstation
with a frame grabber and a Canon VC-C1 camera. It can track a
person’s
face while the person moves freely (e.g., walks, jumps, sits
down and stands
up) in a room. Three types of models have been employed in
developing the
system. They present a stochastic model to characterize
skin-color
distributions of human faces. The information provided by the
model is
sufficient for tracking a human face in various poses and views.
This model
is adaptable to different people and different lighting
conditions in real-time.
A motion model is used to estimate image motion and to predict
search
window. A camera model is used to predict and to compensate for
camera
motion. The system can be applied to teleconferencing and many
HCI
applications including lip-reading and gaze tracking. The
principle in
developing this system can be extended to other tracking
problems such as
tracking the human hand.
Jebara et al., 1997, describe automatic detecting, modeling
and
tracking faces in 3D. A closed loop approach is proposed which
utilizes
structure from motion to generate a 3D model of a face and then
feedback
the estimated structure to constrain feature tracking in the
next frame. The
system initializes by using skin classification, symmetry
operations, 3D
warping and eigenfaces to and a face. Feature trajectories are
then
computed by SSD or correlation-based tracking. The trajectories
are
simultaneously processed by an extended Kalman filter to stably
recover 3D
structure, camera geometry and facial pose. Adaptively weighted
estimation
-
is used in this filter by modeling the noise characteristics of
the 2D image
patch tracking technique. The structural estimate is constrained
by using
parameterized models of facial structure (eigen-heads). The
Kalman filter's
estimate of the 3D state and motion of the face predicts the
trajectory of the
features which constrains the search space for the next frame in
the video
sequence. The feature tracking and Kalman filtering closed loop
system
operates at 30Hz.
Bradski Gary, 1998, states a first step towards a perceptual
user
interface. A computer vision color tracking algorithm is
developed and
applied towards tracking human faces. The algorithm is based on
a robust
nonparametric technique for climbing density gradients to find
the mode of
probability distributions called the mean shift algorithm. The
mean shift
algorithm is modified to deal with dynamically changing color
probability
distributions derived from video frame sequences. The modified
algorithm is
called the continuously adaptive mean shift (CAMSHIFT)
algorithm.
CAMSHIFT’s tracking accuracy is compared against a Polhemus
tracker.
Bradski, 1998, develop computer vision algorithms that are
intended
to form part of a perceptual user interface. They must be able
to track in
real time yet not absorb a major share of computational
resources: other
tasks must be able to run while the visual interface is being
used. The new
algorithm developed is based on a robust nonparametric technique
for
climbing density gradients to find the mode (peak) of
probability
distributions called the mean shift algorithm. They want to find
the mode of
a color distribution within a video scene. The mean shift
algorithm is
modified to deal with dynamically changing color probability
distributions
derived from video frame sequences. The modified algorithm is
called the
Continuously Adaptive Mean Shift (CAMSHIFT) algorithm.
CAMSHIFT’s
tracking accuracy is compared against a Polhemus tracker.
Tolerance to
noise, distracters and performance is studied. CAMSHIFT is then
used as a
-
computer interface for controlling commercial computer games and
for
exploring immersive 3D graphic worlds.
Raja Yogesh et al., 1998, state that they used to obtain
robust
detection and tracking of people in relatively unconstrained
dynamic scenes.
Gaussian mixture models were used to estimate probability
densities of color
for skin, clothing and background. These models were used to
detect, track
and segment people, faces and hands. A technique for dynamically
updating
the models to accommodate changes in apparent color due to
varying
lighting conditions was used. Two applications are highlighted:
(1) actor
segmentation for virtual studios and (2) focus of attention for
face and
gesture recognition systems.
Yang et al., 1998, state that a human face provides a variety
of
different communicative functions. They present approaches for
real-time
face / facial feature tracking and their applications. They
present techniques
of tracking human faces. It is revealed that human skin color
can be used as
a major feature for tracking human faces. An adaptive stochastic
model has
been developed to characterize the skin-color distributions.
Based on the
maximum likelihood method, the model parameters can be adapted
for
different people and different lighting conditions. The
feasibility of the model
has been demonstrated by the development of a real time face
tracker. We
then present a top-down approach for tracking facial features
such as eyes,
nostrils, and lip corners. These real-time tracking techniques
have been
successfully applied to many applications such as eye-gaze
monitoring, head
pose tracking, and lip-reading.
Jordao et al., 1999, describe a method for the detection and
tracking
of human face and facial features. Skin segmentation is learnt
from samples
of an image. After detecting a moving object, the corresponding
area is
searched for clusters of pixels with a known distribution. This
process is
quite insensitive to illumination changes. The face localization
procedure
-
looks for areas in the segmented area which resemble a head.
Using simple
heuristics, the located head is searched and its centroid is fed
back to a
camera motion control algorithm which tries to keep the face
centered in the
image using a pan-tilt camera unit. The system is capable of
tracking, in
every frame, the three main features of a human face. Since
precise eye
location is computationally intensive, an eye and mouth locator
using fast
morphological and linear filters is developed. This allows for
frame-by-frame
checking, which reduces the probability of tracking a non-basis
feature,
yielding a higher success ratio. Velocity and robustness are the
main
advantages of this fast facial feature detector.
Lihin, 2000, propose an algorithm for speaker’s lip contour
extraction.
A color video sequence of speaker’s face is acquired, under
natural lighting
conditions and without any particular make-up. A logarithmic
color transform
is performed from RGB to HI (hue, intensity) color space. A
Bayesian
approach segments the mouth area using Markov random field
modeling.
Motion is combined with red hue lip information into a
spatiotemporal
neighborhood. Simultaneously, a region of interest and relevant
boundaries
points are automatically extracted. An active contour using
spatially varying
coefficients is initialized with the results of the
preprocessing stage. An
accurate lip shape with inner and outer borders is obtained with
good quality
results in this challenging situation.
Schwerdt and Crowley, 2000, discuss robust tracking
technique
applied to histograms of intensity normalized color. This
technique supports
a video codec based on orthonormal basis coding. Orthonormal
basis coding
can be very efficient when the images to be coded have been
normalized in
size and position. However, an imprecise tracking procedure can
have a
negative impact on the efficiency and the quality of
reconstruction of this
technique, since it may increase the size of the required basis
space. The
-
method has greater stability, higher precision and less jitter,
over
conventional tracking techniques using color histograms.
Zhang and Mersereau, 2000, state that the use of color
information
can significantly improve the efficiency and robustness of lip
feature
extraction capability over purely grayscale-based methods. Edge
information
provides another useful tool in characterizing lip boundaries.
They present a
method of integrating both types of information to address the
problem of lip
feature extraction for the purpose of speech reading. They first
examine
various color models and view hue as an effective descriptor to
characterize
the lips due to its invariance to luminance and human skin
color, and its
discriminative properties. They use prominent red hue as an
indicator to
locate the position of the lips. Based on the identified lip
area, they further
refine the interior and exterior lip boundary using both color
and spatial edge
information, where those two are combined within a Markov random
field
(MRF) framework.
Spors et al., 2001, present face localization and tracking
algorithm
which is based upon skin color detection and principle component
analysis
(PCA) based eye localization. Skin color segmentation is
performed using
statistical models for human skin color. The skin color
segmentation task
results in a mask marking the skin color regions in the actual
frame, which is
further used to compute the position and size of the dominant
facial region
utilizing a robust statistics-based localization method. To
improve the results
of skin color segmentation a foreground / background
segmentation and an
adaptive background update scheme were added. The derived face
position
is tracked with Kalman filter.
Gargesha 2002, Existing techniques for facial feature point
detection
from color images include template matching, facial geometry and
symmetry
analysis, mathematical morphology, luminance and chrominance
analysis,
and PCA. These techniques are plagued by poor performance in the
presence
-
of scale variations. A hybrid technique is proposed that employs
a
combination of the above approaches along with curvature
analysis of the
intensity surface of the face image in order to provide a
superior
performance with reduced computational complexity, even in the
presence
of scale variations.
Perez et al., 2002, propose color-based trackers for drastically
shape
varying objects. The method relies on the deterministic search
of a window
whose color content matches a reference histogram color model.
Relying on
the same principle of color histogram distance, but within a
probabilistic
framework, they introduce a Monte Carlo tracking technique. The
use of a
particle filter allows them to better handle color clutter in
the background, as
well as complete occlusion of the tracked entities over few
frames. The
probabilistic approach is very flexible and can be extended in a
number of
useful ways. In particular, they introduce the following
ingredients: multi-
part color modeling to capture a rough spatial layout ignored by
global
histograms, incorporation of a background color model when
relevant, and
extension to multiple objects.
Andreas et al., 2003, present a hierarchical realization of an
enhanced
active shape model for color video tracking and study the
performance of
both hierarchical and nonhierarchical implementations in the
RGB, YUV, and
HSI color spaces. Active shape models can be applied to tracking
non-rigid
objects in video image sequences.
Huang and Trivedi, et al., 2004, Human face analysis has
been
recognized as a crucial part in intelligent systems. There are
several
challenges before robust and reliable face analysis systems can
be deployed
in real-world environments. One of the main difficulties is
associated with
the detection of faces with variations in illumination
conditions and viewing
perspectives. They present the development of a computational
framework
for robust detection, tracking and pose estimation of faces
captured by video
-
arrays. They discuss development of a multi primitive skin-tone
and edge-
based detection module integrated with a tracking module for
efficient and
robust detection and tracking. A multi-state continuous density
Hidden
Markov Model based pose estimation module is developed for
providing an
accurate estimate of the orientation of the face.
Varona et al., 2005, they present a robust real-time 3D
tracking
system of human hands and face. This system can be used as a
perceptual
interface for virtual reality activities in a workbench
environment. In front of
the virtual reality device, do not needs any type of marker or
special suite.
The system includes a real time color segmentation module to
detect in real-
time the skin-color pixels present in the images. The results of
this skin-
color segmentation are skin-color blobs that are the inputs of a
data
association module that labels the blobs pixels using a set of
object state
hypothesis from previous frames. The 2D tracking results are
used for the
3D reconstruction of hands and face in order to obtain the 3D
positions of
these limbs. They present several results using the H-ANIM
standard to
show the system output performance.
Stasiak and Vicente-Garcia, 2010, a system for parallel face
detection,
tracking and recognition in real-time video sequences is being
developed.
The particle filtering is utilized for the purpose of combined
and effective
detection, tracking and recognition. Temporal information
contained in
videos is utilized. Fast, skin color-based face extraction and
normalization
technique is applied. Consequently, real-time processing is
achieved.
Implementation of face recognition mechanisms within the
tracking
framework is used for the purpose of identity recognition, and
to improve
the tracking robustness in case of multi-person tracking
scenarios. Face-to-
track assignment conflicts can often be resolved with the use of
motion
modeling. motion-based conflict resolution can be erroneous.
Identity clue
can be used to improve tracking quality they describe the
concept of face
-
tracking corrections with the use of identity recognition
mechanism,
implemented within a compact particle filtering-based framework
for face
detection, tracking and recognition.
Shi and Tomasi, 1994, state that no feature-based vision system
can
work unless good features can be identified and tracked from
frame to
frame. Although tracking itself is by and large a solved
problem, selecting
features that can be tracked well and correspond to physical
points in the
world is still hard. They propose a feature selection criterion
that is optimal
by construction because it is based on how the tracker works,
and a feature
monitoring method that can detect occlusions, disocclusions, and
features
that do not correspond to points in the world. These methods are
based on a
new tracking algorithm that extends previous Newton-Raphson
style search
methods to work under affine image transformations. They test
performance
with several simulations and experiments.
Black et al., 1995, explore the use of local parameterized
models of
image motion for recovering and recognizing the non-rigid and
articulated
motion of human faces. Parametric models are popular for
estimating motion
in rigid scenes. They observe that within local regions in space
and time,
such models not only accurately model non-rigid facial motions
but also
provide a concise description of the motion in terms of a small
number of
parameters. These parameters are intuitively related to the
motion of facial
features during facial expressions and show how expressions can
be
recognized from the local parametric motions in the presence of
significant
head motion. The motion tracking and expression recognition
approach
performs with high accuracy movie sequences.
MacCormick and Blake 1995, tracking multiple targets is a
challenging
problem, especially when the targets are identical, in the sense
that the
same model is used to describe each target. They present an
observation
density for tracking, which solves the problem by exhibiting a
probabilistic
-
exclusion principle. Exclusion arises naturally from a
systematic derivation of
the observation density, without relying on heuristics. They
presentation
partitioned sampling, a new sampling method for multiple object
tracking.
Partitioned sampling avoids the high computational load
associated with fully
coupled trackers, while retaining the desirable properties of
coupling.
Basu Sumit et al., 1996, describe a method for the robust
tracking of
rigid head motion from video. This method uses a 3D ellipsoidal
model of the
head and interprets the optical flow in terms of the possible
rigid motions of
the model. This method is robust to large angular and
translational motions
of the head and is not subject to the singularities of a 2D
model. The method
has been successfully applied to heads with a variety of shapes,
hair styles.
This method has the advantage of accurately capturing the 3D
motion
parameters of the head. The accuracy is shown through comparison
with a
ground truth synthetic sequence. The ellipsoidal model is robust
to small
variations in the initial fit, enabling the automation of the
model
initialization.
Darrell et al., 1996, demonstrate real-time face tracking and
pose
estimation in an unconstrained office environment with a camera.
Using
vision routines previously implemented for an interactive
environment they
determine the spatial location of a user’s head and guide and
active camera
to obtain images of the face. Faces are analyzed using a set of
Eigen spaces
indexed over both pose and world location. Closed loop feedback
from the
estimated facial location is used to guide the camera when a
face is present
in the fontal view.
Crowley James et al., 1997, describe a system which uses
multiple
visual processes to detect and track faces for video compression
and
transmission. The system is based on an architecture in which a
supervisor
selects and activates visual processes in cyclic manner. Control
of visual
processes is made possible by a confidence factor which
accompanies each
-
observation. Fusion of results into a unified estimation for
tracking is made
possible by estimating a covariance matrix with each
observation. Visual
processes for face tracking are described using blink detection,
normalized
color histogram matching, and cross correlation (SSD and NCC).
Ensembles
of visual processes are organized into processing states so as
to provide
robust tracking. Transition between states is determined by
events detected
by processes. The result of face detection is fed into recursive
estimator. The
output from the estimator drives a PD controller for a pan /
tilts / zoom
camera.
Fieguth et al., 1997, develop a simple and very fast method for
object
tracking based exclusively on color information in digitized
video images.
Running on a silicon graphics R4600 Indy system with an Indy
cam, the
algorithm is capable of simultaneously tracking objects at full
frame size
(640 x 480) pixels and video frame rate 50fps. Robustness with
respect to
occlusion is achieved via an explicit hypothesis tree model of
the occlusion
process. They demonstrate the efficacy of their techniques in
the challenging
task of tracking people, especially tracking human head and
hands.
Oliver Nuria and Pentland, 1997, describe an active-camera
real-time
system for tracking, shape description, and classification of
the human face
and mouth using only an SGI Indy computer. The system is based
on use of
2-D blob features, which are spatially-compact clusters of
pixels that are
similar in terms of low-level image properties. Patterns of
behavior Facial
expressions and head movements can be classified in real-time
using Hidden
Markov Model (HMM) methods. The system has been tested on
hundreds of
users and has demonstrated extremely reliable and accurate
performance.
Birchfield, 1998, present for tracking a person’s head. The
head’s
projection onto the image plane is modeled as an ellipse whose
position and
size are continually updated by a local search combining the
output of a
module concentrating on the intensity gradient around the
ellipse’s
-
perimeter with that of another module focusing on the color
histogram of the
ellipse’s interior. These two modules have roughly orthogonal
failure modes;
they serve to complement one another. The result is a robust,
real-time
system that is able to track a person’s head with enough
accuracy to
automatically control the camera’s pan, tilt, and zoom in order
to keep the
person centered in the field of view at a desired size.
Hager Gregory and Belhumeur, 1998, develop an efficient,
general
framework for object tracking which addresses different
complications. They
first develop a computationally efficient method for handling
the geometric
distortions produced by changes in pose. Then combine geometry
and
illumination into an algorithm that tracks large image regions
using no more
computation than would be required to track with no
accommodation for
illumination changes. They augment these methods with techniques
from
robust statistics and treat occluded regions on the object as
statistical
outliers. Throughout, they present experimental results
performed on live
video sequences demonstrating the effectiveness and efficiency
of their
methods.
Hager Gregory and Toyama et al., 1998, describe X Vision as a
small
set of image-level tracking primitives, and a framework for
combining
tracking primitives to form complex tracking systems. Efficiency
and
robustness are achieved by propagating geometric and temporal
constraints
to the feature detection level, where image warping and
specialized image
processing are combined to perform feature detection quickly and
robustly.
They present some of these applications as an illustration of
how useful,
robust tracking systems can be constructed by simple
combinations of a few
basic primitives combined with the appropriate task-specific
constraints.
Colmenarez, 1999, provide information from video and keep track
of
the people, recognize their facial expressions and gestures, and
complement
other forms of human computer interfaces. A learning technique
based on
-
information-theoretic discrimination is used to construct face
and facial
feature detectors. A real-time system for face and facial
feature detection
and tracking in continuous video is done. A probabilistic
framework for
embedded face and facial expression recognition from image
sequences is
obtained.
Harold Hualu Wang and Chang, 1999, present Face Track, a
system
that detects, tracks, and groups faces from compressed video
data. They
introduce the face tracking framework based on the Kalman filter
and
multiple hypothesis techniques. They compare and discuss the
effects of
various motion models on tracking performance. They investigate
constant-
velocity, constant-acceleration, correlated-acceleration, and
variable-
dimension-filter models. They find that constant-velocity and
correlated-
acceleration models work more effectively for commercial videos
sampled at
high frame rates. They also develop novel approaches based on
multiple
hypothesis techniques to resolving ambiguity issues. Simulation
results show
the effectiveness of the proposed algorithms on tracking faces
in real
applications.
Vieux et al., 1999, use face tracking system developed in the
robotics
area to normalize a video sequence to centered images of the
face. The
face-tracking allowed us to implement a compression scheme based
on
Principal Component Analysis (PCA), which they call Orthonormal
Basis
Coding (OBC).
Comaniciu et al., 2000, propose a new method for real-time
tracking
of non-rigid objects seen from a moving camera. The central
computational
module is based on the mean shift iterations and finds the most
probable
target position in the current frame. The dissimilarity between
the target
model and the target candidates are expressed by a metric
derived from the
Bhattacharyya coefficient. The theoretical analysis of the
approach shows
that it relates to the Bayesian framework while providing a
practical, fast
-
and efficient solution. The capability of the tracker to handle
in real-time
partial occlusions, significant clutter, and target scale
variations is
demonstrated for several image sequences.
Feris Rogério Schmidt et al., 2000, present a real time system
for
detection and tracking of facial features in video sequences.
Such system
may be used in visual communication applications, such as
teleconferencing,
virtual reality, intelligent interfaces, human machine
interaction, and
surveillance. They have used a statistical skin-color model to
segment face-
candidate regions in the image. The presence or absence of a
face in each
region is verified by means of an eye detector, based on an
efficient
template matching scheme. Once a face is detected, the pupils,
nostrils and
lip corners are located and these facial features are tracked in
the image
sequence, performing real time processing.
Liu Zhu and Wang, 2000, propose a new approach for combined
face
detection and tracking in video. The face detection algorithm is
a fast
template matching procedure using iterative dynamic programming
(DP).
Schneider man and Kanade, 2000, describe a statistical method
for 3D
object detection. They represent the statistics of both object
appearance and
“non-object” appearance using a product of histograms. Each
histogram
represents the joint statistics of a subset of wavelet
coefficients and their
position on the object. Their approach is to use many such
histograms
representing a wide variety of visual attributes. Using this
method, they
have developed the first algorithm that can reliably detect
human faces with
out-of-plane rotation and the first algorithm that can reliably
detect
passenger cars over a wide range of viewpoints.
Shan et al., 2001, present model-based bundle adjustment
algorithm
to recover the 3D model of a scene / object from a sequence of
images with
unknown motions. Instead of representing scene / object by a
collection of
isolated 3D features (usually points), their algorithm uses a
surface
-
controlled by a small set of parameters. Compared with previous
model
based approaches, their approach has the following advantages.
Instead of
using the model space as a regularized, they directly use it as
their search
space, thus resulting in a more elegant formulation with fewer
unknowns
and fewer equations. Their algorithm automatically associates
tracked points
with their correct locations on the surfaces, thereby
eliminating the need for
a prior 2D-to-3D association. Regarding face modeling, they use
a very
small set of face metrics to parameterize the face geometry,
resulting in a
smaller search space and a better posed system.
Towama and Blake, 2001, present probabilistic paradigm for
visual
tracking. Probabilistic mechanisms are attractive because they
handle fusion
of information, especially temporal fusion, in a principled
manner. Exemplars
are selected as representatives of raw training data. They
represent
probabilistic mixture distributions of object configurations.
Their use avoids
tedious hand-construction of object models, and problems with
changes of
topology. Using exemplars in place of a parameterized model
poses several
challenges. It uses a noise model that is learned from training
data. It
eliminates any need for an assumption of probabilistic pixel
wise
independence.
Arulampalam et al., 2002, review both optimal and suboptimal
Bayesian algorithms for nonlinear / non-Gaussian tracking
problems, with a
focus on particle filters. Particle filters are sequential Monte
Carlo methods
based on point mass representations of probability densities,
which can be
applied to any state-space model and that generalize the
traditional Kalman
filtering methods. Several variants of the particle filter such
as SIR, ASIR,
and RPF are introduced within a generic framework of the
sequential
importance sampling (SIS) algorithm and compared with the
standard EKF.
Chiang et al., 2003, present a real-time face detection
algorithm for
locating faces in images and videos. This algorithm finds not
only the face
-
regions, but also the precise locations of the facial components
such as eyes
and lips. The algorithm starts from the extraction of skin
pixels based upon
rules derived from a simple quadratic polynomial model. With a
minor
modification, this polynomial model is also applicable to the
extraction of
lips. The benefits of applying these two similar polynomial
models are
twofold. First, much computation time are saved. Second both
extraction
processes can be performed simultaneously in one scan of the
image or
video frame. The eye components are then extracted after the
extraction of
skin pixels and lips. The algorithm removes the falsely
extracted components
by verifying with rules derived from the spatial and geometrical
relationships
of facial components. The precise face regions are determined
accordingly.
According to the experimental results, the proposed algorithm
exhibits
satisfactory performance in terms of both accuracy and speed for
detecting
faces with wide variations in size.
Verma et al., 2003, present probabilistic method for detecting
and
tracking multiple faces in a video sequence. The proposed method
integrates
the information of face probabilities provided by the detector
and the
temporal information provided by the tracker to produce a method
superior
to the available detection and tracking methods. They claim 1)
Accumulation
of probabilities of detection over a sequence. This leads to
coherent
detection over time and, improves detection results. 2)
Prediction of the
detection parameters which are position, scale, and pose. This
guarantees
the accuracy of accumulation as well as a continuous detection.
3) The
representation of pose is based on the combination of two
detectors, one for
frontal views and one for profiles.
Zhou et al., 2003, they propose a time series state space model
to
fuse temporal information in a probe video, which
simultaneously
characterizes the kinematics and identity using a motion vector
and an
identity variable, respectively. The joint posterior
distribution of the motion
-
vector and the identity variable is estimated at each time
instant and then
propagate to the next time instant. Marginalization over the
motion vector
yields a robust estimate of the posterior distribution of the
identity variable.
A computationally efficient sequential importance sampling (SIS)
algorithm
is developed to estimate the posterior distribution. The
propagation of the
identity variable over time, degeneracy in posterior probability
of the identity
variable is achieved to give improved recognition. The gallery
is generalized
to videos in order to realize video-to-video recognition. An
exemplar-based
learning strategy is adopted to automatically select video
representatives
from the gallery, serving as mixture centers in an updated
likelihood
measure. The SIS algorithm is applied to approximate the
posterior
distribution of the motion vector, the identity variable, and
the exemplar
index, whose marginal distribution of the identity variable
produces the
recognition result. The model formulation is very general and it
allows a
variety of image representations and transformations.
Okuma Kenji et al., 2004, introduce a vision system that is
capable of
learning, detecting and tracking the objects of interest. The
system is
demonstrated in the context of tracking hockey players using
video
sequences. Their approach combines the strengths of two
successful
algorithms: mixture particle filters and Adaboost. The mixture
particle filter
is ideally suited to multi-target tracking as it assigns a
mixture component to
each player. The crucial design issues in mixture particle
filters are the
choice of the proposal distribution and the treatment of objects
leaving and
entering the scene. They construct the proposal distribution
using a mixture
model that incorporates information from the dynamic models of
each player
and the detection hypotheses generated by Adaboost. The learned
Adaboost
proposal distribution allows us to quickly detect players
entering the scene,
while the filtering process enables us to keep track of the
individual players.
-
Perez, 2004, the effectiveness of probabilistic tracking of
objects in
image sequences has been revolutionized by the development of
particle
filtering. Kalman filters are restricted to Gaussian
distributions, particle
filters can propagate more general distributions, albeit only
approximately.
This is of particular benefit in visual tracking because of the
inherent
ambiguity of the visual world that stems from its richness and
complexity.
One important advantage of the particle filtering framework is
that it allows
the information from different measurement sources to be fused
in a
principled manner. They introduce generic importance sampling
mechanisms
for data fusion and discuss them for fusing color with either
stereo sound,
for teleconferencing, or with motion, for surveillance with a
still camera.
They show how each of the three cues can be modeled by an
appropriate
data likelihood function, and how the intermittent cues (sound
or motion)
are best handled by generating proposal distributions from their
likelihood
functions. The effective fusion of the cues by particle
filtering is
demonstrated on real teleconference and surveillance data.
Vacchetti et al., 2004, they propose an efficient real-time
solution for
tracking rigid objects in 3D using a single camera that can
handle large
camera displacements, drastic aspect changes, and partial
occlusions. While
commercial products are already available for offline camera
registration,
robust online tracking remains an open issue because many
real-time
algorithms described in the literature still lack robustness and
are prone to
drift and jitter. To address these problems, they have
formulated the
tracking problem in terms of local bundle adjustment and have
developed a
method for establishing image correspondences that can equally
well handle
short and wide baseline matching. They then can merge the
information
from preceding frames with that provided by a very limited
number of key
frames created during a training stage, which results in a
real-time tracker
that does not jitter or drift and can deal with significant
aspect changes.
-
Dong-gil Jeong et al., 2005, propose a robust real-time head
tracking
algorithm using a pan-tilt-zoom camera. They assume the shape of
a head is
an ellipse and a model color histogram is acquired in advance.
In the first
frame, the appropriate position and scale of the head is
determined based
on the user input. In the subsequent frames, the initial
position is selected
at the same position of the ellipse as in the previous frame.
The mean shift
procedure is applied to make the ellipse position converge to
the target
center where the color histogram similarity to the model and
previous one is
maximized. The previous histogram means a color histogram
adaptively
extracted from the result of the previous frame. The
position-adjusted ellipse
is refined by using color and shape information. Large
background motion
often prohibits the initial position from converging to the
target position.
They estimate a robust initial position by compensating the
background
motion. They use vertical and horizontal 1-D projection
datasets. Extensive
experiments prove that a head is well tracked even when the
person moves
fast and the scale of the head changes drastically.
Fidaleo Douglas et al., 2005, provides an extensive analysis of
a state-
of-the-art key frame based tracker: quantitatively demonstrating
the
dependence of tracking performance on underlying mesh accuracy,
number
and coverage of reliably matched feature points, and initial key
frame
alignment. 3D tracking of faces in video streams is a difficult
problem that
can be assisted with the use of a priori knowledge of the
structure and
appearance of the subject’s face at predefined poses (key
frames). Tracking
with a generic face mesh can introduce an erroneous bias that
leads to
degraded tracking performance when the subject’s out-of-plane
motion is far
from the set of key frames. To reduce this bias, they show how
online
refinement of a rough estimate of face geometry may be used to
re-estimate
the 3d key frame features, thereby mitigating sensitivities to
initial key
frame inaccuracies in pose and geometry. An in-depth analysis is
performed
-
on sequences of faces with synthesized rigid head motion.
Subsequent trials
on real video sequences demonstrate that tracking performance is
more
sensitive to initial model alignment and geometry errors when
fewer feature
points are matched and/or do not adequately span the face. The
analysis
suggests several indications for most effective 3D tracking of
faces in real
environments.
Hampapur et al., 2005, Situation awareness is the key to
security.
Awareness requires information that spans multiple scales of
space and
time. Smart video surveillance systems are capable of enhancing
situational
awareness across multiple scales of space and time at the
present time, the
component technologies are evolving in isolation. To provide
comprehensive,
nonintrusive situation awareness, it is imperative to address
the challenge of
multi scale, spatiotemporal tracking. This article explores the
concepts of
multi scale spatiotemporal tracking through the use of real-time
video
analysis, active cameras, multiple object models, and long-term
pattern
analysis to provide comprehensive situation awareness.
Koterba Seth et al., 2005, study the relationship between multi
view
Active Appearance Model (AAM) fitting and camera calibration.
They propose
to calibrate the relative orientation of a set of N > 1
cameras by fitting an
AAM to sets of N images. They use the human face as a
(non-rigid)
calibration grid. Algorithm calibrates a set of 2 × 3 weak
perspective camera
projection matrices, projections of the world coordinate system
origin into
the images, depths of the world coordinate system origin, and
focal lengths.
Roy-Chowdhury et al., 2005, present two algorithms for 3D
face
modeling from a monocular video sequence. The first method is
based on
Structure from Motion (SFM), while the second one relies on
contour
adaptation over time. The SFM based method incorporates
statistical
measures of quality of the 3D estimate into the reconstruction
algorithm.
The initial multi-frame SFM estimate is smoothed using a generic
face model
-
in an energy function minimization framework. Such a strategy
avoids
excessively biasing the final 3D estimate towards the generic
model. The
second method relies on matching a generic 3D face model to the
outer
contours of a face in the input video sequence, and integrating
this strategy
over all the frames in the sequence. It consists of an
edge-based head pose
estimation step, followed by global and local deformations of
the generic
face model in order to adapt it to the actual 3D face. This
contour adaptation
approach is able to separate the geometric subtleties of the
human head
from the variations in shading and texture and it does not rely
on finding
accurate point correspondences across frames.
Adam et al., 2006, present an algorithm for tracking an object
in a
video sequence. The template object is represented by multiple
image
fragments or patches. The patches are arbitrary and are not
based on an
object model. Every patch votes on the possible positions and
scales of the
object in the current frame, by comparing its histogram with
the
corresponding image patch histogram.
Dedeoglu et al., 2006, describe active appearance models (AAM)
as
compact representations of the shape and appearance of objects.
Fitting
AAMs to images is a difficult, non-linear optimization task.
Traditional
approaches minimize the L2 norm error between the model instance
and the
input image warped onto the model coordinate frame. While this
works well
for high resolution data, the fitting accuracy degrades quickly
at lower
resolutions. They show a careful design of the fitting criterion
can overcome
many of the low resolution challenges. In resolution aware
formulation
(RAF), they explicitly account for the finite size sensing
elements of digital
cameras, and simultaneously model the processes of object
appearance
variation, geometric deformation, and image formation.
Gauss-Newton
gradient descent algorithm not only synthesizes model instances
as a
function of estimated parameters, simulates the formation of low
resolution
-
images in a digital camera. They compare the RAF algorithm
against a state-
of-the-art tracker across a variety of resolution and model
complexity levels.
Fonseca Pedro Miguel et al., 2006, states that a compressed
domain
generic object tracking algorithm offers, in combination with a
face detection
algorithm, a low-computational- cost solution to the problem of
detecting
and locating faces in frames of compressed video sequences (such
as MPEG-
1 or MPEG-2). Objects such as faces can thus be tracked through
a
compressed video stream using motion information provided by
existing
forward and backward motion vectors. The described solution
requires only
low computational resources on CE devices and offers at one and
the same
time sufficiently good location rates.
Lu Le and Dai Xiangtan, 2006, presents a hybrid sampling
solution that
combines RANSAC and particle filtering.RANSAC provides proposal
particles
that, with high probability, represent the observation
likelihood. Both
conditionally independent RANSAC sampling and boosting-like
conditionally
dependent RANSAC sampling are explored. They show that the use
of
RANSAC-guided sampling reduces the necessary number of particles
to
dozens for a full 3D tracking problem. The algorithm has been
applied to the
problem of 3D face pose tracking with changing expression.
They
demonstrate the validity of approach with several video
sequences acquired
in an unstructured environment.
Xu and Roy Chowdhury, 2007, they present a theory for combining
the
effects of motion, illumination, and 3D structure, and camera
parameters in
a sequence of images obtained by a perspective camera. The set
of all
Lambertian reflectance functions of a moving object, at any
position,
illuminated by arbitrarily distant light sources, lies “close”
to a bilinear
subspace consisting of nine illumination variables and six
motion variables.
This result implies that, given an arbitrary video sequence, it
is possible to
recover the 3D structure, motion and illumination conditions
simultaneously
-
using the bilinear subspace formulation. The derivation builds
upon existing
work on linear subspace representations of reflectance by
generalizing it to
moving objects. Lighting can change slowly or suddenly, locally
or globally,
and can originate from a combination of point and extended
sources. They
experimentally compare the results of their theory with ground
truth data
and also provide results on real data by using video sequences
of a 3D face
and the entire human body with various combinations of motion
and
illumination directions. They show results of their theory in
estimating 3D
motion and illumination model parameters from a video
sequence.
Yu et al., 2007, they propose a method to incrementally super
resolve
3D facial texture by integrating information frame by frame from
a video
captured under changing poses and illuminations. They recover
illumination,
3D motion and shape parameters from our tracking algorithm.
This
information is then used to super-resolve 3D texture using
Iterative Back-
Projection (IBP) method. The super-resolved texture is fed back
to the
tracking part to improve the estimation of illumination and
motion
parameters. This closed-loop process continues to refine the
texture as new
frames come in. They also propose a local-region based scheme to
handle
non-rigidity of the human face.
Stasiak and Pacut, 2008, a system for parallel face detection,
tracking
and recognition in real-time video sequences is being developed.
They
describe its face detection and tracking modules. The solution
is based on
the particle filtering in the conditional density propagation
framework of
Izard and Blake and utilizes color information at different
levels of detail.
The use of color makes processing computationally cheap and
robust in
finding candidates for further processing.
Suandi et al., 2008, they describes a technique to estimate
human
face pose from color video sequence using Dynamic Bayesian
Network
(DBN). As face and facial features trackers usually track eyes,
pupils, mouth
-
corners and skin region(face), their proposed method utilizes
merely three of
these features–pupils, mouth center and skin region to compute
the
evidence for DBN inference. No additional image processing
algorithm is
required, thus, it is simple and operates in real-time. The
evidence, which
are called horizontal ratio and vertical ratio, are determined
using model-
based technique and designed significantly to simultaneously
solve two
problems in tracking task; scales factor and noise
influence.
Valenti and Gevers, 2008, the ubiquitous application of eye
tracking is
precluded by the requirement of dedicated and expensive
hardware, such as
infrared high definition cameras. The systems based solely on
appearance
are being proposed in literature. These systems are able to
successfully
locate eyes; their accuracy is significantly lower than
commercial eye
tracking devices. Their aim is to perform very accurate eye
center location
and tracking, using a simple web cam. By means of a novel
relevance
mechanism, the proposed method makes use of isopoda properties
to gain
invariance to linear lighting changes, to achieve rotational
invariance and to
keep low computational costs. They test their approach for
accurate eye
location and robustness to changes in illumination and pose,
using the BioID
and the Yale Face B databases. They demonstrate that their
system can
achieve a considerable improvement in accuracy over state of the
art
techniques.
Yung et al., 2011, propose the state-of-the-art progress on
visual
tracking methods, classify them into different categories, as
well as identify
future trends. Visual tracking is a fundamental task in many
computer vision
applications and has been well studied in the last decades.
Robust visual
tracking remains a huge challenge. Difficulties in visual
tracking can arise
due to abrupt object motion, appearance pattern change,
non-rigid object
structures, occlusion and camera motion. They first analyze the
state-of-the-
art feature descriptors which are used to represent the
appearance of
-
tracked objects. Then, they categorize the tracking progresses
into three
groups; provide detailed descriptions of representative methods
in each
group examine their positive and negative aspects and the future
trends for
visual tracking research.
2.5 SUMMARY
This chapter has presented the various methods used for face
tracking
in a continuous video. The local features such as eye brows,
lips, and mouth,
skin color based face tracking are presented. Chapter 3 presents
the feature
extraction.