On the Multi-View Fitting and Construction of Dense Deformable Face Models Krishnan Ramnath CMU-RI-TR-07-10 May 2007 Submitted in partial fulfillment of the requirements for the degree of Master of Science The Robotics Institute Carnegie Mellon University Pittsburgh, Pennsylvania 15213 c 2007 by Krishnan Ramnath
91
Embed
On the Multi-View Fitting and Construction of Dense Deformable Face Models · 2008-12-03 · Although AAMs were originally formulated as 2D, there are other deformable 3D models (3D
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
On the Multi-View Fitting and Construction of Dense
Deformable Face Models
Krishnan Ramnath
CMU-RI-TR-07-10
May 2007
Submitted in partial fulfillment of the requirements
5.14 Face Tracking using Automatic Dense AAM Example 1 . . . . . . . . . . . . 67
5.15 Face Tracking using Automatic Dense AAM Example 2 . . . . . . . . . . . . 68
xii
Chapter 1
Introduction
Active Appearance Models (AAMs) [11, 12, 14, 13, 17], and the related concepts of Active
Blobs [34, 35] and Morphable Models [5, 25, 41], are generative models of a certain visual
phenomenon. AAMs are examples of statistical models that are used to characterize the
shape and the appearance of the underlying object by a set of model parameters. Though
AAMs are useful for other phenomena [34, 25], they are commonly used to model faces. In
a typical application, once an AAM has been constructed, the first step is to fit it to an
input image, i.e. model parameters are found to maximize the match between the model
instance and the input image. The model parameters can then be passed to a classifier.
Many different classification tasks are possible.
In this thesis we study three important topics related to deformable face models such as
AAMs: (1) we outline various techniques to simultaneously fit a 3D face model to multiple
images captured from multiple viewpoints, (2) we present a multi-view algorithm for 3D face
model construction, and (3) we present an automatic algorithm for dense deformable face
model construction.
1
1.1 Multi-View Face Model Fitting
Although AAMs were originally formulated as 2D, there are other deformable 3D models
(3D Morphable Models [5]) and AAMs have also been extended to 3D (2D+3D AAMs [44].)
A number of algorithms have been proposed to build deformable 3D face models and to fit
them efficiently [44, 33, 2, 37, 43, 32, 16]. Deformable 3D face models have a wide variety
of applications. Not only can they be used for tasks like pose estimation, which just require
the estimation of the 3D rigid motion, but also for tasks such as expression recognition and
lipreading, which require, explicitly or implicitly, estimation of the 3D non-rigid motion.
Most of the previous algorithms for AAM fitting and construction have been single-view.
One area that has not been studied much in the past (an exception is [15]) is the development
of simultaneous multi-view algorithms. Multi-view algorithms can potentially perform better
than single-view as they can take into account more visual information. In this thesis we
present multi-view algorithms to both fit and build 3D AAMs.
In the first part of this thesis we study multi-view fitting of AAMs. Fitting an AAM to
an image consists of minimizing the error between the input image and the closest model
instance; i.e. solving a nonlinear optimization problem. Face models are usually fit to a single
image of a face. In many application scenarios, however, it is possible to set up two or more
cameras and acquire simultaneous multiple views of the face. If we integrate the information
from multiple views, we can possibly obtain better application performance. For example,
Gross et al. [19] demonstrated improved face recognition performance by combining multiple
images of the same face captured from multiple widely spaced viewpoints. In Chapter 3, we
describe how a single AAM can be fit to multiple images, captured by cameras with arbitrary
locations, rotations, and response functions.
The main technical challenge is relating the AAM shape parameters in one view with
the corresponding parameters in the other views. This relationship is complex for a 2D
2
shape model but is straightforward for a 3D shape model. We use 2D+3D AAMs [44] in this
thesis. A 2D+3D AAM contains both a 2D shape model and a 3D shape model. Besides the
requirement of having a 3D shape model, the main advantage of using a 2D+3D AAM is that
2D+3D AAMs can be fit very efficiently in real-time [44]. Corresponding multi-view fitting
algorithms could also be derived for other 3D face models such as 3D Morphable Models [5].
We could easily have used a 3D Morphable Model instead to conduct the research in this
thesis, but the fitting algorithms would have been slower.
To generalize the 2D+3D fitting algorithm to multiple images, we use a separate set of
2D shape parameters for each image, but just a single, global set of 3D shape parameters
as represented in Figure 1.1. We impose the constraints that for each view separately,
the 2D shape model for that view must approximately equal the projection of the single
3D shape model. Imposing these constraints indirectly couples the 2D shape parameters
for each view in a physically consistent manner. Our algorithm can use any number of
cameras, positioned arbitrarily. The cameras can be moved and replaced with different
cameras without any retraining. The computational cost of the multi-view 2D+3D algorithm
is only approximately N times more than the single-view algorithm where N is the number
of cameras. In Section 3.1 we present a qualitative evaluation of our multi-view 2D+3D
fitting algorithm. We defer the quantitative evaluation to Section 3.9 where we also compare
it with a calibrated multi-view algorithm.
We also study how our multi-view fitting algorithm can be used for camera calibration.
The multi-view fitting algorithm of Section 3.1 uses the scaled orthographic imaging model
used by previous authors, and in the process of fitting computes, or calibrates, the scaled
orthographic camera matrices. In Section 3.3 we describe an extension of this algorithm to
calibrate weak perspective (or full perspective) camera models for each of the cameras. In
essence, both of these algorithms use the human face as a (non-rigid) calibration grid. Such
3
Camera pro)ection matrix
and 2D shape
Ob)ect 3D rotation7 translation
and 3D shape
frames
Camera pro)ection matrix
and 2D shape
Camera pro)ection matrix
and 2D shape
Figure 1.1: A representation of the experimental setup for multi-view 2D+3D AAM fitting. For
each view we have a separate set of 2D shape parameters and camera projection matrices, but just
a single, global set of 3D shape parameters and the associated global 3D rotation and translation.
Our fitting algorithm imposes the constraints that for each view separately, the 2D shape model
for that view must approximately equal the projection of the single 3D shape model.
an algorithm may be useful in a surveillance setting where we wish to install the cameras on
the fly, but avoid walking around the scene with a calibration grid.
The perspective algorithm requires at least two sets of multi-view images of the face at two
different locations. More images can be used to improve the accuracy if they are available.
We evaluate our algorithm by comparing it with an algorithm that uses a calibration grid
and show the performance to be roughly comparable.
We then show how camera calibration can improve the performance of multi-view face
model fitting. We present an extension of the multi-view AAM fitting algorithm of Sec-
4
tion 3.1 that takes advantage of calibrated cameras. We use the calibration algorithm of
Section 3.3 to explicitly provide calibration information to the multi-view fitting algorithm.
We demonstrate that this algorithm results in far better fitting performance than either the
single-view fitting (Chapter 2) or the uncalibrated1 multi-view fitting (Section 3.1) algo-
rithms. We consider two performance measures: (1) the robustness of fitting - the likelihood
of convergence for a given magnitude perturbation from the ground-truth, and (2) speed
of fitting - the average number of iterations required to converge from a given magnitude
perturbation from the ground-truth.
1.2 Multi-View Face Model Construction
In the second part of this thesis we study calibrated multi-view construction of AAMs. A
variety of non-rigid structure-from-motion algorithms have be proposed, both non-linear
[8, 40] and linear [9, 45, 46] that can be used for deformable 3D model construction from
both a single view [8, 9, 45, 46] and multiple views [40].
In most cases, it is only practical to apply face model construction algorithms to data
with relatively little pose variation. Tracking facial feature points becomes more difficult the
more pose variation there is. Unfortunately, single-view and multi-view algorithms such as
non-rigid structure-from-motion have a tendency to scale (stretch or compress) the face in
the depth-direction when applied to data with only medium amounts of pose variation. The
problem is not the algorithms themselves, but the Bas-Relief ambiguity between the camera
translation/rotation and the depth [47, 38, 36, 23]. The Bas-Relief ambiguity is normally
formulated in the case of rigid structure-from-motion, but applies equally in the non-rigid
1Note that for the uncalibrated multi-view algorithm described in Section 3.1, the calibration parametersare unknown and are estimated as a part of the optimization. For the calibrated multi-view fitting algorithmthe calibration parameters are known and are obtained from a calibration algorithm (possibly the algorithmof Section 3.3.)
5
case. As empirically validated in Chapter 4, the result is a compressed/stretched face model,
which gives erroneous estimates of the 3D rigid and non-rigid motion.
One way to eliminate the ambiguity is to use a calibrated stereo rig instead of a single
camera. The known, fixed translation between the cameras then sets the scale and breaks the
ambiguity. The straightforward approach is to use stereo to build a static 3D model at each
time instant and then build the deformable model by modeling how the 3D shape changes
across time. Two algorithms that takes this approach are [10, 18], one in the uncalibrated
case [10], the other in the calibrated case [18]. An alternative approach is to extend the non-
rigid structure-from-motion paradigm of [9, 8, 40, 45] and pose the face model construction
problem as a single large optimization over the unknown shape model modes, in essence
a large bundle adjustment. In Chapter 4 of this thesis we derive a calibrated multi-view
non-rigid motion-stereo algorithm [42, 48] to do exactly this. Our multi-view algorithm
explicitly incorporates the knowledge of the calibrated relative orientation of the cameras in
the stereo rig. In Section 4.5 we present qualitative results to validate these claims. We also
use the multi-view calibration algorithm described in Chapter ?? to quantitatively compare
the fidelity of 3D models.
1.3 Dense Face Model Construction
Deformable face models are generative parametric models that are used to model both rigid
and non-rigid deformations. The two best known examples of deformable models are Active
Appearance Models (AAMs) [12, 14, 13, 17, 26] and 3D Morphable Models (3DMMs) [5,
25, 33, 41, 8]. Although AAMs and 3DMMs are closely related, there are a number of
differences between them. One main difference between AAMs and 3DMMs is that AAMs
are typically sparse whereas 3DMMs are typically dense. This difference is mainly based
6
1
1
2 2
Figure 1.2: An illustration of the effects that make optical flow hard for human faces: (1) appear-
ance/disappearance of structures such as teeth and wrinkles (2) non-Lambertian, largely textureless
regions of skin such as the cheeks.
on how these models are constructed. AAMs are normally constructed from a collection
of training images of faces with a mesh of canonical feature points hand-marked on them.
Since the feature points are hand-marked, the correspondence can only be sparse. 3DMMs
are usually computed by running an optical flow algorithm to estimate the dense non-rigid
alignment of the texture maps [5]. AAMs could also potentially be constructed from dense
correspondence estimation using optical flow.
Computing dense alignment of face images using optical flow is difficult for a number of
reasons, as illustrated in Figure 1.2. Observe the appearance/disappearance of structures
such as teeth and wrinkles. Also note the non-Lambertian reflectance of textureless regions
such as the cheeks. Optical flow algorithms are not robust to such variations in the images.
Note, however, that the failure of the optical flow algorithm in the construction of 3DMMs
is hidden in the fact that most applications are graphics. The artifacts are hidden by the
texture mapping of the 3D mesh.
In the final part of this thesis we propose a different approach to building a dense de-
7
formable face model. Rather than assuming that dense correspondence can be computed
in a pre-processing step, our algorithm instead builds a dense model (Figure 1.3) by itera-
tively building a face model, fitting the model to image data and then refining the model.
There are three ways in which the model is refined: 1) By adding more mesh vertices 2) By
changing the mesh connectivity using image-consistent surface triangulation [30], and 3) By
refining the shape modes using a modification of the algorithm in [4]. Although the goal of
our algorithm is to compute a dense model, note that in the process it implicitly computes
the dense correspondence that an optical flow algorithm would. However, as the model is
refined, it builds a model of visual effects such as the appearance/disappearance of structure
such as the teeth and wrinkles, and also builds an implicit model of the illumination variation
manifested across the face including large textureless regions such as the cheeks. This is the
reason our algorithm is able to outperform standard optical flow algorithms.
In Section 5.2 we show a number of results to illustrate that our densification algorithm
can be used to accurately build dense models. We first evaluate our algorithm quantitatively
with a set of ground-truth data using a form of hidden markers. We compare with a number
of popular optical flow algorithms [24, 27, 31] for the same task and find our algorithm to
be more robust and more accurate. We then perform comparisons to show improvement in
fitting robustness. We also present a number of tracking results to qualitatively illustrate
other aspects of our algorithm. Finally, we also show how our algorithm can be used to
construct dense deformable models automatically, starting with a rigid planar model of the
face that is subsequently refined to model the non-planarity and the non-rigid components.
8
Automated Model Densification
Krishnan Ramnath, Simon Baker, Iain Matthews
Figure 1.3: An example dense mesh achieved using our densification algorithm. On the left we
show the initial sparse mesh as well as the mesh vertices. On the right we show the resulting
triangulated mesh as well as vertices after applying the densification algorithm.
9
Chapter 2
Background
In this section we review 2D Active Appearance Models (AAMs) [13] and 2D+3D Active
Appearance Models [44]. We also revisit the efficient inverse compositional fitting algo-
rithms [3, 44].
2.1 2D Active Appearance Models
The 2D shape s of a 2D Active Appearance Model is a 2D triangulated mesh. In particular,
s is a column vector containing the vertex locations of the mesh. AAMs allow linear shape
variation. This means that the 2D shape s can be expressed as a base shape s0 plus a linear
combination of m shape vectors si:
s = s0 +m∑
i=1
pi si (2.1)
where the coefficients pi are the shape parameters. AAMs are normally computed from
training data consisting of a set of images with the shape mesh (hand) marked on them
[13]. The Procrustes alignment algorithm and Principal Component Analysis (PCA) are
then applied to compute the the base shape s0 and the shape variation si [13]. An example
mesh is shown in Figure 2.1.
10
s0 s1 s2 s3
Figure 1: The linear shape model of an independent AAM. The model consists of a triangulated base mesh
s0 plus a linear combination of n shape vectors si. The base mesh is shown on the left, and to the right are
the first three shape vectors s1, s2, and s3 overlaid on the base mesh.
apply Principal Component Analysis (PCA) to the training meshes [11]. The base shape s0 is the
mean shape and the vectors s0 are the n eigenvectors corresponding to the n largest eigenvalues.
Usually, the training meshes are first normalised using a Procrustes analysis [10] before PCA is
applied. This step removes variation due to a chosen global shape normalising transformation so
that the resulting PCA is only concerned with local, non-rigid shape deformation. See Section 4.2
for the details of how such a normalisation affects the AAM fitting algorithm described in this
paper.
An example shape model is shown in Figure 1. On the left of the figure, we plot the triangulated
base mesh s0. In the remainder of the figure, the base mesh s0 is overlaid with arrows corresponding
to each of the first three shape vectors s1, s2, and s3.
2.1.2 Appearance
The appearance of an independent AAM is defined within the base mesh s0. Let s0 also denote the
set of pixels x = (x, y)T that lie inside the base mesh s0, a convenient abuse of terminology. The
appearance of an AAM is then an image A(x) defined over the pixels x ! s0. AAMs allow linear
appearance variation. This means that the appearance A(x) can be expressed as a base appearance
A0(x) plus a linear combination ofm appearance images Ai(x):
A(x) = A0(x) +m
!
i=1
!iAi(x) " x ! s0 (3)
4
Figure 2.1: The 2D linear shape model of an AAM. The model consists of triangulated base mesh
s0 plus a linear combination of m shape vectors si. The base mesh is shown on the left, followed
by the first three shape vectors s1, mathbfs2 and s3 overlaid over the base mesh.
The appearance of an AAM is defined within the base mesh s0. Let s0 also denote the
set of pixels u = (u, v)T that lie inside the base mesh s0, a convenient notational short-cut.
The appearance of the AAM is then an image A(u) defined over the pixels u ∈ s0. AAMs
allow linear appearance variation. This means that the appearance A(u) can be expressed
as a base appearance A0(u) plus a linear combination of l appearance images Ai(u):
A(u) = A0(u) +l∑
i=1
λi Ai(u) (2.2)
where the coefficients λi are the appearance parameters. The base (mean) appearance A0
and appearance images Ai are usually computed by applying Principal Component Analysis
to the shape normalised training images [13]. The appearance variation of an AAM is
illustrated in Figure 2.2.
Although Equations (2.1) and (2.2) describe the AAM shape and appearance variation,
they do not describe how to generate a model instance. The AAM model instance (Figure 2.3)
with shape parameters p and appearance parameters λi is created by warping the appearance
A from the base mesh s0 to the model shape mesh s. In particular, the pair of meshes s0
and s define a piecewise affine warp from s0 to s denoted1 W(u;p)[28].
1Note that for ease of presentation we have omitted any mention of the 2D similarity transformation that
11
A0(x) A1(x) A2(x) A3(x)
Figure 2: The linear appearance variation of an independent AAM. The model consists of a base appear-
ance image A0 defined on the pixels inside the base mesh s0 plus a linear combination of m appearance
images Ai also defined on the same set of pixels.
In this expression the coefficients !i are the appearance parameters. Since we can always perform
a linear reparameterization, wherever necessary we assume that the images Ai are orthonormal.
As with the shape component, the base appearance A0 and the appearance images Ai are nor-
mally computed by applying PCA to a set of shape normalised training images. Each training
image is shape normalised by warping the (hand labelled) training mesh onto the base mesh s0.
Usually the mesh is triangulated and a piecewise affine warp is defined between corresponding
triangles in the training and base meshes [11] (although there are ways to avoid triangulating the
mesh using, for example, thin plate splines rather than piecewise affine warping [10].) The base
appearance is set to be the mean image and the images Ai to be them eigenimages corresponding
to the m largest eigenvalues. The fact that the training images are shape normalised before PCA
is applied normally results in a far more compact appearance eigenspace than would otherwise be
obtained.
The appearance of an example independent AAM is shown in Figure 2. On the left of the figure
we plot the base appearance A0. On the right we plot the first three appearance images A1–A3.
2.1.3 Model Instantiation
Equations (2) and (3) describe the AAM shape and appearance variation. However, they do not de-
scribe how to generate a model instance. Given the AAM shape parameters p = (p1, p2, . . . , pn)T
we can use Equation (2) to generate the shape of the AAM s. Similarly, given the AAM appear-
ance parameters ! = (!1, !2, . . . , !m)T, we can generate the AAM appearanceA(x) defined in the
interior of the base mesh s0. The AAM model instance with shape parameters p and appearance
5
Figure 2.2: The 2D linear appearance model of an AAM. The model consists of a base appearance
A0 defined over all the pixels inside the base shape mesh s0 plus a linear combination of l appearance
vectors Ai.
2.2 Fitting a 2D AAM to a Single Image
The goal of fitting a 2D AAM to a single input image I [28] is to minimize:
∑u∈s0
[A0(u) +
l∑i=1
λiAi(u)− I(W(u;p))
]2
=
∥∥∥∥∥A0(u) +l∑
i=1
λiAi(u)− I(W(u;p))
∥∥∥∥∥2
(2.3)
with respect to the 2D shape p and appearance λi parameters. In [28] it was shown that the
inverse compositional algorithm [3] can be used to optimize the expression in Equation (2.3).
The algorithm uses the “project out” algorithm [21, 28] to break the optimization into two
steps. The first step consists of optimizing:
‖A0(u)− I(W(u;p))‖2span(Ai)⊥(2.4)
with respect to the shape parameters p where the subscript span(Ai)⊥ means project the
vector into the subspace orthogonal to the subspace spanned by Ai, i = 1, . . . , l. The second
step consists of solving for the appearance parameters:
is used with an AAM to normalise the shape [13]. In this thesis we include the normalising warp in W(u;p)and the similarity normalisation parameters in p. See [28] for a description of how to include the normalisingwarp in W(u;p).
12
9.1s3
W(x;p)
Appearance, A
Shape, s
=
=
A0
AAMModel Instance
M(W(x;p))
s0 ! +54s1 ! . . .10s2
. . .256A3!351A2++ 3559A1
Figure 3: An example of AAM instantiation. The shape parameters p = (p1, p2, . . . , pn)T are used to
compute the model shape s and the appearance parameters ! = (!1,!2, . . . ,!m)T are used to compute themodel appearance A. The model appearance is defined in the base mesh s0. The pair of meshes s0 and s
define a (piecewise affine) warp from s0 to s which we denote W(x;p). The final AAM model instance,
denoted M(W(x;p)), is computed by forwards warping the appearance A from s0 to s usingW(x;p).
parameters ! is then created by warping the appearance A from the base mesh s0 to the model
shape s. This process is illustrated in Figure 3 for concrete values of p and !.
In particular, the pair of meshes s0 and s define a piecewise affine warp from s0 to s. For each
triangle in s0 there is a corresponding triangle in s. Any pair of triangles define a unique affine
warp from one to the other such that the vertices of the first triangle map to the vertices of the
second triangle. See Section 4.1.1 for more details. The complete warp is then computed: (1) for
any pixel x in s0 find out which triangle it lies in, and then (2) warp x with the affine warp for that
triangle. We denote this piecewise affine warp W(x;p). The final AAM model instance is then
computed by warping the appearance A from s0 to s with warp W(x;p). This process is defined
by the following equation:
M(W(x;p)) = A(x) (4)
where M is a 2D image of the appropriate size and shape that contains the model instance. This
equation, describes a forwards warping that should be interpreted as follows. Given a pixel x in
6
Figure 2.3: An example of an AAM model instance. The shape parameters p are used to create
the shape model sand the appearance parameters λi are used to create the appearance model A.
The AAM model instance (Figure 2.3) is then created by warping the appearance A from the base
mesh s0 to the model shape mesh s. In particular, the pair of meshes s0 and s define a piecewise
affine warp from s0 to s denoted W(u;p)
λi = −∑u∈s0
Ai(u) [A0(u)− I(W(u;p)] (2.5)
where the appearance vectors Ai are orthonormal. Optimizing Equation (2.4) itself can be
performed by iterating the following two steps. Step 1 consists of computing:
∆p = −H−12D∆pSD where ∆pSD =
∑u∈s0 [SD2D(u)]T [A0(u)− I(W(u;p)]
13
where the following two terms can be pre-computed (and combined) to achieve high
efficiency:
SD2D(u) =[∇A0
∂W∂p
]span(Ai)⊥
H2D =∑
u∈s0 [SD2D(u)]T SD2D(u)
where ∇A0 =[
∂A0
∂x∂A0
∂y
].
Step 2 consists of updating the warp by composing with the inverse incremental warp:
W(u;p) ← W(u;p) ◦W(u; ∆p)−1 (2.6)
The resulting 2D AAM fitting algorithm runs at over 200 frames per second. See [28] for
more details.
2.3 2D+3D Active Appearance Models
Most deformable 3D face models, including 3D Morphable Models [5] and the models in [9, 8,
40, 45], use a 3D linear shape variation model, essentially equivalent to a 3D generalization of
the model in Section 2.1. The 3D shape s is a 3D triangulated mesh which can be expressed
as a base shape s0 plus a linear combination of m shape vectors sj:
s = s0 +m∑
j=1
pj sj (2.7)
where the coefficients pi are the shape parameters.
A 2D+3D AAM [44] consists of the 2D shape variation si of a 2D AAM governed by
Equation (2.1), the appearance variation Ai(u) of a 2D AAM governed by Equation (2.2),
and the 3D shape variation sj of a 3D AAM governed by Equation (2.7). The 2D shape
variation si and the appearance variation Ai(u) of the 2D+3D AAM are constructed exactly
as for a 2D AAM. The construction of the 3D shape variation sj is the subject of Chapter 4
of this thesis.
14
To generate a 2D+3D model instance, an image formation model is needed to convert
the 3D shape s into a 2D mesh, onto which the appearance is warped. In [44] the following
scaled orthographic imaging model was used:
u = Pso x = σ
ix iy iz
jx jy jz
x +
ox
oy
. (2.8)
where x = (x, y, z) is a 3D vertex location, (ox, oy) is an offset to the origin, σ is the scale
and the projection axes i = (ix, iy, iz) and j = (jx, jy, jz) are unit length and orthogonal:
i · i = j · j = 1; i · j = 0. The model instance is then computed by projecting every 3D shape
vertex onto a 2D vertex using Equation (2.8). The 2D appearance A(u) is finally warped
onto the 2D mesh (taking into account visibility) to generate the final model instance.
2.4 Fitting a 2D+3D AAM to a Single Image
The goal of fitting a 2D+3D AAM to an image I [44] is to minimize:∥∥∥∥∥A0(u) +l∑
i=1
λiAi(u)− I(W(u;p))
∥∥∥∥∥2
+ K ·
∥∥∥∥∥∥s0 +m∑
i=1
pi si −Pso
s0 +m∑
j=1
pj sj
∥∥∥∥∥∥2
(2.9)
with respect to p, λi, Pso, and p where K is a large constant weight. A pictorial represen-
tation of the 2D+3D AAM fitting is shown in Figure 2.4.
Equation (2.9) should be interpreted as follows. The first term in Equation (2.9) is the
2D AAM fitting criterion. The second term enforces the (heavily weighted, soft) constraints
that the 2D shape s equals the projection of the 3D shape s with projection matrix Pso.
In [44] it was shown that the 2D AAM fitting algorithm [28] can be extended to a 2D+3D
AAM. The resulting algorithm still runs in real-time [29].
As with the 2D AAM algorithm, the “project out” algorithm [28] is used to break the
optimization into two steps, the first optimizing:
‖A0(u)− I(W(u;p))‖2span(Ai)⊥+ K ·
∑i
F 2i (p;Pso;p) (2.10)
15
Fitting a 2D+3D AAM
Pso(x) = !
!
ix iy izjx jy jz
"
x +
!
ou
ov
"
Scaled orthographic projection of 3D Shape
3D Shape
arg min2D Shape, 3D Shape
Appearance, Pso
!
pixels
"
# AAM ! Image
$
%
2
+!
mesh
"
# 2D Shape ! Pso(3D Shape)
$
%
2
Warp to s0
2D (pixels) Proj. 3D to 2D (vertices)
Figure 2.4: A representation of the 2D+3D AAM fitting algorithm. The fitting goal consists of
two terms: (1) the 2D fitting goal, and (2) the regularization term that enforces the 2D shape s to
equal the projection of the 3D shape s with projection matrix Pso.
with respect to p, Pso, and p, where Fi(p;Pso;p) is the error inside the L2 norm in the
second term in Equation (2.9) for each of the mesh x and y vertices. The second step
solves for the appearance parameters using Equation (2.5). The 2D+3D algorithm has more
unknowns to solve for than the 2D algorithm. As a notational convenience, concatenate all
the unknown parameters into one vector q = (p;Pso;p). Optimizing Equation (2.10) is then
performed by iterating the following two steps. Step 1 consists of computing2:
∆q = −H−13D∆qSD = −H−1
3D
∆pSD
0
+ K ·∑
i
(∂Fi
∂q
)T
Fi(q)
(2.11)
2To simplify presentation, in this thesis we omit the additional correction that needs to be made toFi(p;Pso;p) to use the inverse compositional algorithm. See [44] for details.
16
where:
H3D =
H2D 0
0 0
+ K ·∑
i
(∂Fi
∂q
)T∂Fi
∂q. (2.12)
Step 2 consists of first extracting the parameters p, Pso, and p from q, and then updating
the warp using Equation (2.6), and the other parameters Pso and p additively [29].
17
Chapter 3
Multi-View 2D+3D AAM Fitting and
Camera Calibration
In the previous chapter we reviewed some of the efficient algorithms to fit an AAM to
a single image. If we have multiple, simultaneous, views of the face, the performance of
AAM fitting can be improved if we use all views. In this chapter we first describe an
algorithm to fit a single 2D+3D AAM simultaneously to multiple images. During fitting
we impose the constraints that for each view separately, the 2D shape model for that view
must approximately equal the projection of the single 3D shape model. Our algorithm can
use any number of cameras, positioned arbitrarily. We then show how our multi-view fitting
algorithm can be used for camera calibration. We describe an algorithm to calibrate weak
perspective (or full perspective) camera models for each of the cameras using the human
face as a (non-rigid) calibration grid. Finally we show how camera calibration can improve
the performance of multi-view face model fitting.
18
3.1 Multi-View 2D+3D AAM Fitting
Suppose that we have N images In : n = 1, . . . , N of a face that we wish to fit the 2D+3D
AAM to. In this section we assume that the images are captured simultaneously by syn-
chronized, but uncalibrated cameras (see Section 3.9 for a calibrated algorithm.) The naive
algorithm is to fit the 2D+3D AAM independently to each of the images. This algorithm can
be improved upon by using the fact that, since the images In are captured simultaneously,
the 3D shape of the face is the same in all views. We therefore pose fitting a single 2D+3D
AAM to multiple images as minimizing:
N∑n=1
∥∥∥∥∥A0(u) +l∑
i=1
λni Ai(u)− In(W(u;pn))
∥∥∥∥∥2
+
K ·
∥∥∥∥∥∥s0 +m∑
i=1
pni si −Pn
so
s0 +m∑
j=1
pj sj
∥∥∥∥∥∥2 (3.1)
simultaneously with respect to the N sets of 2D shape parameters pn, the N sets of appear-
ance parameters λni (the appearance may be different in different images due to different
camera response functions, etc), the N sets of camera matrices Pnso, and the one, global
set of 3D shape parameters p. Note that the 2D shape parameters in each image are not
independent, but are coupled in a physically consistent1 manner through the single set of
3D shape parameters p. Optimizing Equation (3.1) therefore cannot be decomposed into
N independent optimizations. The appearance parameters λni can, however, be dealt with
using the “project out” algorithm [21, 28], in the usual way; i.e. we first optimize:
N∑n=1
‖A0(u)− In(W(u;pn))‖2span(Ai)⊥+ K ·
∥∥∥∥∥∥s0 +m∑
i=1
pni si −Pn
so
s0 +m∑
j=1
pj sj
∥∥∥∥∥∥2
(3.2)
with respect to pn, Pnso, and p, and then solve for the appearance parameters:
1Note that directly coupling the 2D shape models would be difficult due to the complex relationshipbetween the 2D shape in one image and another. Multi-view face model fitting is best achieved with a 3Dmodel. A similar algorithm could be derived for other 3D face models such as 3D Morphable Models [5].The main advantage of using a 2D+3D AAM [44] is the fitting speed.
19
λni = −∑u∈s0 Ai(u) · [A0(u)− In(W(u;pn))] .
Organize the unknowns in Equation (3.2) into a single vector r = (p1;P1so; . . . ;p
N ;PNso;p).
Also, split the single-view 2D+3D AAM terms into parts from Equations (2.11) and (2.12)
that correspond to the 2D image parameters (pn and Pnso) and the 3D shape parameters (p):
∆qnSD =
∆qnSD,2D
∆qnSD,p
and Hn3D =
Hn3D,2D,2D Hn
3D,2D,p
Hn3D,p,2D Hn
3D,p,p
.
Optimising Equation (3.2) can then be performed by iterating the following two steps.
Step 1 consists of computing:
∆r = −H−1MV∆rSD = −H−1
MV
∆q1SD,2D
...
∆qNSD,2D∑N
n=1 ∆qnSD,p
(3.3)
where:
HMV =
H13D,2D,2D 0 . . . 0 H1
3D,2D,p
0 H23D,2D,2D . . . 0 H2
3D,2D,p
......
......
...
0 . . . 0 HN3D,2D,2D HN
3D,2D,p
H13D,p,2D H2
3D,p,2D . . . HN3D,p,2D
∑Nn=1 Hn
3D,p,p
.
Step 2 consists of extracting the parameters pn, Pnso, and p from r, and updating the
warp parameters pn using Equation (2.6), and the other parameters Pnso and p additively.
The N image algorithm is very similar to N copies of the single image algorithm. Almost
all of the computation is just replicated N times, one copy for each image. The only extra
computation is adding the N terms in the components of ∆rSD and HMV that correspond to
the single set of global 3D shape parameters p, inverting the matrix HMV, and the matrix
20
multiply in Equation (3.3). Overall, the N image algorithm is therefore approximately N
times slower than the single image 2D+3D fitting algorithm (It is more than N times slower
due to the large matrix inversion and matrix multiplication step, but in practice only slightly
so.)
3.2 Experimental Results
An example of using our algorithm to fit a single 2D+3D AAM to three simultaneously
captured images2 of a face is shown in Figure 3.1. For all the results in this Chapter,
the translation and scale of the 2D face model in each view is initialized by hand and the
2D shape set to be the mean shape. However, 2D+3D AAMs can easily be initialized
with a face detector [29]. See the movie iterations.mov for the fitting video sequence.
The initialization is displayed in the top row of the figure, the result after 5 iterations in
the middle row, and the final converged result in the bottom row. In each case, all three
input images are overlaid with the 2D shape pn plotted in dark dots. We also display the
recovered pose angles (roll, pitch and yaw) extracted from the three scaled orthographic
camera matrices Pnso in the top left of each image. Each camera computes a different relative
head pose, illustrating that the estimate of Pnso is view dependent. The single 3D shape p
for all views at the current iteration is displayed in the top-right of the center image. The
view-dependent camera projection of this 3D shape is also plotted as a white mesh overlaid
on the face.
Applying the multi-view fitting algorithm sequentially allows us to track the face simulta-
neously in N video sequences. Some example frames of the algorithm being using to track a
face in a trinocular sequence is shown in Figure 3.2. We also include the movie tracking.mov
2Note that the input images for all experiments described in this thesis are chosen such that there is noocclusion of the face. For ways to handle occlusion in the input data see [20, 29].
The multi-view fitting algorithm in Chapter 3 uses the scaled orthographic image formation
model in Equation (2.8). A more powerful model when working with multiple cameras
(because it models the coupling between the scales across the cameras through the focal
lengths and average depths) is the weak perspective model:
u = Pwp(x) =f
oz + z
ix iy iz
jx jy jz
x +
ou
ov
. (3.4)
In Equation (3.4), oz is the depth of the origin of the world coordinate system and z is
the average depth of the scene points measured relative to the world coordinate origin. The
“z” (depth) direction is k = i × j where × is the vector cross product, i = (ix, iy, iz), and
j = (jx, jy, jz). The average depth relative to the world origin z equals the average value of
k · x computed over all points x in the scene.
The weak perspective model is an approximation to the full perspective model:
u = Ppersp(x) =
f 0 0
0 f 0
0 0 1
ix iy iz ou
jx jy jz ov
kx ky kz oz
x
1
. (3.5)
where the depth of the scene k · x is assumed to be roughly constant z. The calibration
parameters of the two perspective models in Equations (3.4) and (3.5) are interchangeable.
When evaluating the calibration results in Section 3.8 below we use the full perspective
model. In the calibrated fitting algorithms in Section 3.9 we use the weak perspective model
because it is reasonable to assume that the depth of the face is roughly constant, a common
assumption in many face modeling papers [33, 44].
24
3.4 Camera Calibration Goal
Suppose we have N cameras n = 1, . . . , N . The goal of our camera calibration algorithm
is to compute the 2 × 3 camera projection matrix (i, j), the focal length f , the projection
of the world coordinate system origin into the image (ou, ov), and the depth of the world
coordinate system origin (oz) for each camera. If we superscript the camera parameters with
n we need to compute Pnwp = in, jn, fn, on
u, onv , and on
z . There are 7 unknowns in Pnwp (rather
than 10) because there are only 3 degrees of freedom in choosing the 2×3 camera projection
matrix (i, j) such that it is orthonormal.
3.5 Calibration using Two Time Instants
For ease of understanding, we first describe an algorithm that uses two sets of multi-view
images captured at two time instants. Deriving this algorithm also allows us to show that
two sets of images are needed and derive the requirements on the motion of the face between
the two time instants. In Section 3.6 we describe an algorithm that use an arbitrary number
of multi-view image sets and in Section 3.7 another algorithm that poses calibration as a
single large optimization.
The uncalibrated multi-view fitting algorithm of Chapter 3 uses the scaled orthographic
camera matrices Pnso in Equation (2.8) and optimizes over the N scale parameters σn. Using
Equation (3.4) instead of Equation (2.8) and optimizing over the focal lengths fn and origin
depths onz is ambiguous. Multiple values of fn and on
z yield the same value of σn = fn
onz +zn .
However, the values of fn and onz can be computed by applying (a slightly modified version
of) the uncalibrated multi-view fitting algorithm a second time with the face at a different
location. With the first set of images we compute in, jn, onu, on
v . Suppose that σn = σn1 is
the scale at this location. Without loss of generality we also assume that the face model is
25
at the world coordinate origin at this first time instant. Finally, without loss of generality
we assume that the mean value of x computed across the face model (both mean shape s0
and all shape vectors si) is zero. It follows that z is zero and so:
fn
onz
= σn1 . (3.6)
Suppose that at the second time instant the face has undergone a global 3D rotation R3 and
3D translation T. Both the rotation R and translation T have three degrees of freedom. We
then perform a modified multi-view fit, minimizing:
N∑n=1
∥∥∥∥∥A0(u) +l∑
i=1
λni Ai(u)− In(W(u;pn))
∥∥∥∥∥2
+ K·
∥∥∥∥∥∥s0 +m∑
i=1
pni si −Pn
so
R
s0 +m∑
j=1
pj sj
+ T
∥∥∥∥∥∥2 (3.7)
with respect to the N sets of 2D shape parameters pn, the N sets of appearance parameters
λni , the one global set of 3D shape parameters p, the 3D rotation R, the 3D translation T,
and the N scale values σn = σn2 . In this optimization all of the camera parameters (in, jn,
onu, and on
v ) except the scale (σ) in the scaled orthographic model Pnso are held fixed to the
values computed in the first time instant. Since the object underwent a global translation
T then zn = kn ·T where kn = in × jn is the z-axis of camera n. It follows that:
fn
onz + kn ·T
= σn2 . (3.8)
Equations (3.6) and (3.8) are two sets of linear simultaneous equations in the 2∗N unknowns
(fn and onz ). Assuming that kn · T 6= 0 (the global translation T is not perpendicular to
any of the camera z-axes), these two equations can be solved for fn and onz to complete the
camera calibration. Note also that to uniquely compute all three components of T using the
3Note that in the case of calibrated camera(s) it is convenient to think of the relative motion betweenthe object and the camera(s) as the motion of the object R, T. In the single camera case (Equation 2.9)and the multiple cameras, single time instant case with uncalibrated camera matrix P (Equation 3.1) it isconvenient to think of the relative motion as camera motion.
26
optimization in Equation (3.7) at least one pair of the cameras must be verged (the axes (in,
jn) of the camera matrices Pnso must not all span the same 2D subspace.)
3.6 Multiple Time Instant Calibration Algorithm
Rarely are two sets of multi-view images sufficient to obtain an accurate calibration. The
approach just described can easily be generalized to T time instants. The first time instant
is treated as above and used to compute in, jn, onu, on
v and to impose the constraint on fn
and onz in Equation (3.6). Equation (3.7) is then applied to the remaining T − 1 frames to
obtain additional constraints:
fn
onz + kn ·Tt
= σnt for t = 2, 3, . . . , T (3.9)
where Tt is the translation estimated in the tth time instant and σnt is the scale of the face
in the nth camera at the tth time instant. Equations (3.6) and (3.9) are then re-arranged to
obtain an over-constrained linear system which can then be solved to obtain fn and onz .
3.7 Calibration as a Single Optimization
The algorithms in Sections 3.5 and 3.6 have the disadvantage of being two stage algorithms.
First they solve for in, jn, onu, and on
v , and then for fn and onz . It is better to pose calibration
as the single large non-linear optimization of:
N∑n=1
T∑t=1
∥∥∥∥∥A0(u) +l∑
i=1
λn,ti Ai(u)− In,t(W(u;pn,t))
∥∥∥∥∥2
+ K·
∥∥∥∥∥∥s0 +m∑
i=1
pn,ti si −Pn
wp
Rt
s0 +m∑
j=1
ptj sj
+ Tt
∥∥∥∥∥∥2 (3.10)
summed over all cameras n and time instants t with respect to the 2D shape parameters
pn,t, the appearance parameters λn,ti , the 3D shape parameters pt, the rotations Rt, the
27
translations Tt, and the calibration parameters in, jn, fn, onu, on
v , and onz . In Equation (3.10),
In,t represents the image captured by the nth camera in the tth time instant and the average
depth z = kn · Tt in Pnwp given by Equation (3.4). Finally, we define the world coordinate
system by enforcing R1 = I and T1 = 0.
The expression in Equation (3.10) can be optimized by iterating two steps: (1) The
calibration parameters are optimized given the 2D shape and (rotated translated) 3D shape;
i.e. the second term in Equation (3.10) is minimized given fixed 2D shape, 3D shape, Rt,
and Tt. This optimization decomposes into a separate 7 dimensional optimization for each
camera. (2) A calibrated multi-view fit (see Section 3.9) is performed on each frame in
the sequence; i.e. the entire expression in Equation (3.10) is minimized, but keeping the
calibration parameters in Pnwp fixed and just optimizing over the 2D shape, 3D shape, Rt,
and Tt. The entire large optimization can be initialized using the multiple time instant
algorithm in Section 3.6.
3.8 Empirical Evaluation of Calibration
We tested our calibration algorithms on a trinocular stereo rig. Two example images of the
1300 input images from each of the three cameras are shown in Figure 3.3. The complete
input sequence is included in the movie calib input.mov. We wish to compare our calibra-
tion algorithm with an algorithm that uses a calibration grid. In Sections 3.8.1 and 3.8.2
we present results for the epipolar geomtery. We compute a fundamental matrix from the
camera parameters in, jn, fn, onu, on
v , and onz estimated by our algorithm and use the 8-
point algorithm [22] to estimate the fundamental matrix from the calibration grid data. In
Section 4.5.3 we present results for the camera focal length and relative orientation of the
cameras, while also comparing the 3D model building algorithms.
and 4) motion-stereo are summarized in Figure 4.2. Note that the input to the NR-SFM is
generated by stacking together the image sequences from each of the three cameras. All four
algorithms therefore use exactly the same set of input image data.
For each model, we display the mean shape (s0) and the first two shape modes (s1,
s2) from two viewpoints to help the reader visualize the 3D structure. The main thing
to note in Figure 4.2 is how “stretched” the NR-SFM and the MV-SFM models are. The
depth (z) values of all of the points in the mean shape appear to have been scaled by a
constant multiplier. The underlying cause of this stretching is the Bas-Relief ambiguity which
occurs when applying (non-rigid) structure-from-motion to data with little pose variation
[47, 38, 36, 23]. The problem manifests itself for both linear (NR-SFM) [9, 8, 45] and non-
linear (MV-SFM) [40] algorithms. The MV-SFM model is slightly better than the NR-SFM
model but the ambiguity persists as the problem is in the data. (Because the problem is
an ambiguity, it is possible that by chance the scale may be chosen more accurately. The
chance of accurate estimation of scale increases the more pose variation there is, and the
less noise there is [47, 38, 36, 23].) The motion-stereo and stereo models do not suffer from
this problem. In the next section we present a quantitative comparison using the calibration
algorithm derived in Section 3.3.
46
NR
-SFM
MV
-SFM
Ster
eoM
otio
n-St
ereo
Mean Shape s0 Shape Mode s1 Shape Mode s2
Figure 4.2: This figure shows the mean shape and first two shape modes of the single-view and
multi-view non-rigid structure-from-motion models, the stereo model and the motion-stereo model.
The main thing to note is that the non-rigid structure-from-motion models are “stretched” in the
depth direction.
4.5.3 Quantitative Comparison using Camera Calibration
In this section we quantitatively compare the performance of the four 3D face model con-
struction algorithms in terms of how well the resulting models can be used to perform camera
calibration using the algorithm in Section 3.7. One possible way of obtaining quantitative
results might be to capture range data as ground-truth. This approach, however, requires
(1) calibrating and (2) aligning the range data to the image data. Static range data also
cannot be used to evaluate the deformable 3D shape modes. Ideally, we would like a way of
evaluating the 3D fidelity of the face models using video data of a moving face.
47
!"#
!
"
$
%
&
'
(
)
*
+,-
./012345206
3378
!"#
$
%
&
'
!#
!$
!%
!&
()*
+,-./012/-3
0045
!"#
$
!
"
%
&
'
(
)
*
+,-
./012345206
3378
Relative Yaw Between Each Pair of Cameras
10
1000
2000
3000
4000
5000
6000
Cam
Fo
ca
l L
en
gth
GT
!"
#"""
!"""
$"""
%"""
&"""
'"""
()*
+,-)./012345
//67
!"
#"""
$"""
!"""
%"""
&"""
'"""
()*
+,-)./012345
67!8+9
9:!8+9
841;1,
9!841;1,
//<=
Focal Length of Each Camera
Figure 4.3: A quantitative evaluation of the 3D fidelity of the models, obtained by using the models
to calibrate the cameras using the algorithm in Section 3.7. The results show the motion-stereo
algorithm to perform the best. The single-view non-rigid structure-from-motion model results in
estimates of the yaw and focal length that are both off by a large factor. The two error factors
are roughly the same. Using multi-view non-rigid structure-from-motion does help in reducing the
errors to a significant degree, but the results are still not as good as the motion-stereo model. GT
refers to the ground truth values computed using the Matlab camera calibration toolbox [7].
48
The algorithm in Section 3.7 is used to calibrate weak perspective camera matrices for a
set of stereo cameras using a 3D face model. By comparing the results of this algorithm with
ground-truth calibration data, we can indirectly measure the 3D fidelity of the face models.
The relative orientation component of the calibration primarily measures the pose estimation
accuracy of the algorithms, without any absolute head pose ground-truth. Estimating the
focal lengths and the epipolar geometry requires more than the relative orientation. Accurate
focal lengths and epipolar geometry requires the accurate non-rigid 3D tracking of the face
in an extended sequence.
We implemented the multi-view single optimization calibration algorithm in Section 3.7
and compared the results with a calibration performed using a calibration standard grid and
the Matlab Camera Calibration Toolbox [7]. In Figure 4.3 we present results for the yaw
rotation (about the vertical axis) between each pair of the three cameras and for each of the
three focal lengths. The yaw between each pair of the three cameras was computed from
the relative rotation matrices of the three cameras. We include results for each of the four
models, and compare them to the ground-truth. The results in Figure 4.3 clearly show the
motion-stereo algorithm to perform the best. The results for the NR-SFM model are a long
way off. The yaw1 is underestimated by a large factor, and the focal length overestimated by
a similar factor. Based on the results in Figure 4.2, this is to be expected. The face model is
too deep, so a medium amount of parallax is generated by a too small yaw angle. Similarly,
a scaling of the model is interpreted as a too large motion in the depth direction and so too
large a focal length. The MV-SFM model also suffers from the same problem due to the
scaled nature of the model albeit generating better results than the NR-SFM model. Overall,
the motion-stereo2 algorithm clearly out performs both these algorithms and gives estimates
1The results for the pitch and roll between each pair of cameras are omitted. The pitch and roll are veryclose to zero and so there is little difference between any of the algorithms.
2Since the motion-stereo algorithm is the best among the four algorithms that we compared, we used themotion-stereo model for all the fitting and calibration experiments described in the previous sections.
49
Relative Yaw Focal Length
Cam 12 Cam 13 Cam 23 Cam 1 Cam 2 Cam 3
NR-SFM 62.1% 66.2% 68.9% 193.5% 201.7% 214.8%
MV-SFM 8.6% 18.8% 25.7% 30.9% 35.5% 41.2%
Stereo 30.2% 15.4% 5.5% 23.9% 18.2% 15.1%
Motion-Stereo 21.7% 7.8% 1.5% 8.7% 3.0% 1.1%
Table 4.1: This table summarizes the results presented in Figure 4.3. For each 3D model we
compute the percentage deviation of the relative “yaw” between each pair of cameras and focal
length of each camera from the ground-truth data (computed using the Matlab camera calibration
toolbox [7].) The motion-stereo model results in estimates of yaw and focal length that are both
comparable to the ground-truth values whereas the estimates from the non-rigid structure-from-
motion (NR-SFM) model are both off by a large factor. The multi-view non-rigid structure-from-
motion (MV-SFM) model performs better than the NR-SFM model but overall the motion-stereo
model performs the best.
of yaw and focal lengths that are comparable to ground-truth calibration data (computed
using the Matlab camera calibration toolbox [7].) To further emphasize this observation, we
compute the percentage deviation of the yaw and focal length estimates of each 3D model
from the ground-truth data. Although the bar graphs in Figure 4.3 may look similar, the
motion-stereo results for the focal length are several times better than the stereo or MV-SFM
results by the relative error measure in Table 4.1.
50
Chapter 5
Dense Face Model Construction
In this chapter we outline an algorithm to build dense Active Appearance Models (AAMs) [12,
14, 13, 17, 26]. Our algorithm builds a dense model by iteratively building a face model,
fitting the model to image data and then refining the model. In the following section we
detail the refinement process of the algorithm.
5.1 Model Densification
In this section we describe our algorithm to construct a dense AAM. There are two main
reasons why we work with AAMs rather than 3D Morphable Models (3DMMs) [5, 25, 33,
41, 8]: (1) it allows us to avoid the issue of 3D data and instead focus on the core model
refinement algorithm (2) we already have an implementation of AAMs in our lab. With some
work, our algorithm could be extended to 3DMMs. However, no conceptual advancement is
required to do so; just a re-application of the same ideas to the 3D range scan and texture
map data.
The input to our algorithm can be from two different sources: (1) They could be the
vertices of a sparse AAM or (2) They could be the vertices output by a rigid tracker or a face
51
detector. Our algorithm then constructs a dense AAM by iterating three important steps: (1)
Model Construction (Chapter 2.1), (2) Model Fitting (Section 2.2) and (3) Model Refinement
(described in this section.) A flow diagram of our algorithm is given in Figure 5.1. The model
refinement step is the key part of the algorithm. We refine the model in three different ways:
(1) we add more mesh vertices to the AAM, (2) we improve the mesh connectivity by re-
triangulating the mesh, and (3) we refine the shape modes of the AAM. We give a detailed
description of each of these steps in the following sections.
1. Adding mesh vertices: The first step in the iterative refinement process is to add
more mesh vertices. There are a number of ways to choose a mesh triangle and also the
location within the triangle to add the points. We adopt a simple but effective way to
ensure that we end up with similarly sized triangles. See Figure 5.2. At each iteration, we
look at the current mesh triangulation and choose the mesh triangle with the longest edge.
Once we choose the triangle a new point is added on the mid-point of the longest edge.
By making sure that the longest edge keeps being reduced we avoid the formation of “long
thin” triangles. Figure 5.3 illustrates the addition of two points to the mesh. To maintain
symmetry we add a pair of points simultaneously to both halves of the face mesh at each
step.
One extension of this algorithm might be to explore other heuristics to choose where to
add the new points such as choosing the triangle with the largest average coding error, and
trying to place points on structural discontinuities. However, it should be noted that as the
mesh gets more and more dense, the choice of a specific heuristic becomes less important as
there are vertices close to any point on the face.
2. Image Consistent Re-Triangulation: Once we have the new points in place we
improve the mesh connectivity by doing an image consistent re-triangulation. This step is
inspired by the work in [30]. We look at each pair of adjacent triangles and flip the common
52
Initial Training Images
Sparse Landmark Points
Model Build
Model Fit
Shape Mode Refinement
Add Points
Image Consistent
Triangulation
Final Dense Models
Model Refinement
Iterate
Figure 5.1: An overview of our deformable dense model construction algorithm. The algorithm
is initialized using a set of sparse hand-labeled mesh points. The algorithm then iterates through
model building, model fitting and model refinement steps to produce the dense model. The refine-
ment step is further split into refining the shape modes, adding mesh vertices and image consistent
re-triangulation.
edge. We look at the RMS model reconstruction error:√√√√√ 1T
T∑t=1
[∑u∈s0
[A0(u) +
l∑i=1
λtiAi(u)− It(W(u;p))
]]2
(5.1)
across the training data to determine whether the flip was optimal or not. We repeat this
step for each pair of adjacent triangles formed by the newly added points. In Figure 5.5
ICCV 2007 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Figure 2. A pair of images showing the mesh before and afteradding two new mesh points to the longest edges. The newlyadded mesh points are highlighted. Note that adding the new ver-tices causes the triangles to split.
are a number of ways to choose a mesh triangle and alsothe location within the triangle to add the points. We adopta simple but efficient strategy to mesh densification. Ateach iteration, we look at the current mesh triangulationand choose the mesh triangle with the longest edge. Thisallows sub-division of triangles based on their sizes andtherefore ensures that no areas on the face are with far toofew mesh points. Once we choose the triangle a new pointis added on the mid-point of the longest edge. By makingsure that the longest edge keeps being reduced we avoid theformation of “long thin” triangles. Figure 2 illustrates theaddition of two points to the mesh. To maintain symmetrywe add a pair of points simultaneously to both halves of theface mesh at each step.
One extension of this algorithm might be to explore otherways to choose where to add the new points such as choos-ing the triangle with the largest average coding error, andtrying to place points on structural discontinuities. How-ever, it should be noted that as the mesh gets more and moredense, the choice of how the mesh is subdivided gets lessimportant because the final mesh is very dense and so thereare vertices close to any point on the face.
We present the pseudo code for our mesh vertex additionalgorithm below:
(1) Initialize using a sparse triangulated mesh.(2) Choose the triangle with the largest edge in the meanshape s0
(3) Add new point to the mid point of the longest edge(4) Propagate the new points to all the images from themean shape using the warp W(u;p).
2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This stepis inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determinewhether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newly
Initial Training Images
Sparse Landmark Points
Model Build
Model Fit
Free Fit
Add Points
Image Consistent
Triangulation
Final Dense Models
Model Refinement
Iterate
Iterate
Figure 3. A representation of our deformable dense model con-struction algorithm. The algorithm is initialized using a set ofsparse hand labeled mesh points. The algorithm then iteratesthrough model building, model fitting and model refinement stepsto produce the dense model. The refinement step is further splitinto refining the shape modes or free fit, adding mesh vertices andimage consistent re-triangulation.
Figure 4. A pair of images showing the mesh before and afteradding a number of mesh points and performing image consistentre-triangulation. The final triangulation is optimal with respecte tothe training data.
added points. In Figure 4 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:
Initialize with the current mesh topologyRepeat:
Get initial image RMS errorRepeat (For each pair of adjacent triangles that include
the new point)Flip the common edge of the quadrilateralNote the image RMS error
Get the edge flip corresponding to the minimum imageRMS error
Flip the edge, check for thin or flipped triangles
3
Figure 2. The algorithm to add mesh vertices.
Figure 3. A pair of images showing the mesh before and afteradding two new mesh points to the longest edges. The newlyadded mesh points are highlighted. Note that adding the new ver-tices causes the triangles to split.
are a number of ways to choose a mesh triangle and alsothe location within the triangle to add the points. We adopta simple but efficient strategy to mesh densification. Ateach iteration, we look at the current mesh triangulationand choose the mesh triangle with the longest edge. Thisallows sub-division of triangles based on their sizes andtherefore ensures that no areas on the face are with far toofew mesh points. Once we choose the triangle a new pointis added on the mid-point of the longest edge. By makingsure that the longest edge keeps being reduced we avoid theformation of “long thin” triangles. Figure 3 illustrates theaddition of two points to the mesh. To maintain symmetrywe add a pair of points simultaneously to both halves of theface mesh at each step.
One extension of this algorithm might be to explore otherways to choose where to add the new points such as choos-ing the triangle with the largest average coding error, andtrying to place points on structural discontinuities. How-ever, it should be noted that as the mesh gets more and moredense, the choice of how the mesh is subdivided gets lessimportant because the final mesh is very dense and so thereare vertices close to any point on the face.
We present the pseudo code for our mesh vertex additionalgorithm in Figure 2.
(1) Initialize using a sparse triangulated mesh.(2) Choose triangle with longest edge in mean shape s0.(3) Add new point to the mid point of the longest edge.(4) Propagate new point to all training images from meanshape s0 using current estimate of the warp W(u;p).
2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This stepis inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determine
Initial Training Images
Sparse Landmark Points
Model Build
Model Fit
Free Fit
Add Points
Image Consistent
Triangulation
Final Dense Models
Model Refinement
Iterate
Iterate
Figure 4. A representation of our deformable dense model con-struction algorithm. The algorithm is initialized using a set ofsparse hand labeled mesh points. The algorithm then iteratesthrough model building, model fitting and model refinement stepsto produce the dense model. The refinement step is further splitinto refining the shape modes or free fit, adding mesh vertices andimage consistent re-triangulation.
Figure 5. A pair of images showing the mesh before and afteradding a number of mesh points and performing image consistentre-triangulation. The final triangulation is optimal with respecte tothe training data.
whether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newlyadded points. In Figure 5 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:
Initialize with the current mesh topologyRepeat:
Get initial image RMS errorRepeat (For each pair of adjacent triangles that include
the new point)Flip the common edge of the quadrilateralNote the image RMS error
Get the edge flip corresponding to the minimum image
3
Figure 5.2: The algorithm to add mesh vertices.
Figure 5.3: A pair of images showing the mesh before and after adding two new mesh points to
the longest edges. The newly added mesh points and the edges are highlighted. Note that adding
the new vertices causes two adjacent triangles to split.
we show the mesh before and after performing image consistent re-triangulation. Note that
in order to make the mesh look better we make sure that the symmetry of the mesh is
maintained. The algorithm for image consistent re-triangulation is presented in Figure 5.1.
3. Shape Mode Refinement: The third step of the refinement process is to refine the
shape modes. Since we are iteratively refining the model and building a new one, shape
mode refinement is equivalent to refining the locations of the mesh vertices in the training
data. The model fit step of our algorithm allows the mesh vertices to move around but the
movement is limited to the shape subspace of the face model. If we allow the mesh vertices to
ICCV 2007 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
are a number of ways to choose a mesh triangle and alsothe location within the triangle to add the points. We adopta simple but efficient strategy to mesh densification. Ateach iteration, we look at the current mesh triangulationand choose the mesh triangle with the longest edge. Thisallows sub-division of triangles based on their sizes andtherefore ensures that no areas on the face are with far toofew mesh points. Once we choose the triangle a new pointis added on the mid-point of the longest edge. By makingsure that the longest edge keeps being reduced we avoid theformation of “long thin” triangles. Figure 4 illustrates theaddition of two points to the mesh. To maintain symmetrywe add a pair of points simultaneously to both halves of theface mesh at each step.
One extension of this algorithm might be to explore otherways to choose where to add the new points such as choos-ing the triangle with the largest average coding error, andtrying to place points on structural discontinuities. How-ever, it should be noted that as the mesh gets more and moredense, the choice of how the mesh is subdivided gets lessimportant because the final mesh is very dense and so thereare vertices close to any point on the face. We present thepseudo code for our mesh vertex addition algorithm in Fig-ure 3.
Initial Training Images
Sparse Landmark Points
Model Build
Model Fit
Free Fit
Add Points
Image Consistent
Triangulation
Final Dense Models
Model Refinement
Iterate
Iterate
Figure 2. A representation of our deformable dense model con-struction algorithm. The algorithm is initialized using a set ofsparse hand labeled mesh points. The algorithm then iteratesthrough model building, model fitting and model refinement stepsto produce the dense model. The refinement step is further splitinto refining the shape modes or free fit, adding mesh vertices andimage consistent re-triangulation.
2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This step
ICCV 2007 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Figure 2. A pair of images showing the mesh before and afteradding two new mesh points to the longest edges. The newlyadded mesh points are highlighted. Note that adding the new ver-tices causes the triangles to split.
are a number of ways to choose a mesh triangle and alsothe location within the triangle to add the points. We adopta simple but efficient strategy to mesh densification. Ateach iteration, we look at the current mesh triangulationand choose the mesh triangle with the longest edge. Thisallows sub-division of triangles based on their sizes andtherefore ensures that no areas on the face are with far toofew mesh points. Once we choose the triangle a new pointis added on the mid-point of the longest edge. By makingsure that the longest edge keeps being reduced we avoid theformation of “long thin” triangles. Figure 2 illustrates theaddition of two points to the mesh. To maintain symmetrywe add a pair of points simultaneously to both halves of theface mesh at each step.
One extension of this algorithm might be to explore otherways to choose where to add the new points such as choos-ing the triangle with the largest average coding error, andtrying to place points on structural discontinuities. How-ever, it should be noted that as the mesh gets more and moredense, the choice of how the mesh is subdivided gets lessimportant because the final mesh is very dense and so thereare vertices close to any point on the face.
We present the pseudo code for our mesh vertex additionalgorithm below:
(1) Initialize using a sparse triangulated mesh.(2) Choose the triangle with the largest edge in the meanshape s0
(3) Add new point to the mid point of the longest edge(4) Propagate the new points to all the images from themean shape using the warp W(u;p).
2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This stepis inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determinewhether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newly
Initial Training Images
Sparse Landmark Points
Model Build
Model Fit
Free Fit
Add Points
Image Consistent
Triangulation
Final Dense Models
Model Refinement
Iterate
Iterate
Figure 3. A representation of our deformable dense model con-struction algorithm. The algorithm is initialized using a set ofsparse hand labeled mesh points. The algorithm then iteratesthrough model building, model fitting and model refinement stepsto produce the dense model. The refinement step is further splitinto refining the shape modes or free fit, adding mesh vertices andimage consistent re-triangulation.
Figure 4. A pair of images showing the mesh before and afteradding a number of mesh points and performing image consistentre-triangulation. The final triangulation is optimal with respecte tothe training data.
added points. In Figure 4 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:
Initialize with the current mesh topologyRepeat:
Get initial image RMS errorRepeat (For each pair of adjacent triangles that include
the new point)Flip the common edge of the quadrilateralNote the image RMS error
Get the edge flip corresponding to the minimum imageRMS error
Flip the edge, check for thin or flipped triangles
3
Figure 2. The algorithm to add mesh vertices.
Figure 3. A pair of images showing the mesh before and afteradding two new mesh points to the longest edges. The newlyadded mesh points are highlighted. Note that adding the new ver-tices causes the triangles to split.
are a number of ways to choose a mesh triangle and alsothe location within the triangle to add the points. We adopta simple but efficient strategy to mesh densification. Ateach iteration, we look at the current mesh triangulationand choose the mesh triangle with the longest edge. Thisallows sub-division of triangles based on their sizes andtherefore ensures that no areas on the face are with far toofew mesh points. Once we choose the triangle a new pointis added on the mid-point of the longest edge. By makingsure that the longest edge keeps being reduced we avoid theformation of “long thin” triangles. Figure 3 illustrates theaddition of two points to the mesh. To maintain symmetrywe add a pair of points simultaneously to both halves of theface mesh at each step.
One extension of this algorithm might be to explore otherways to choose where to add the new points such as choos-ing the triangle with the largest average coding error, andtrying to place points on structural discontinuities. How-ever, it should be noted that as the mesh gets more and moredense, the choice of how the mesh is subdivided gets lessimportant because the final mesh is very dense and so thereare vertices close to any point on the face.
We present the pseudo code for our mesh vertex additionalgorithm in Figure 2.
(1) Initialize using a sparse triangulated mesh.(2) Choose triangle with longest edge in mean shape s0.(3) Add new point to the mid point of the longest edge.(4) Propagate new point to all training images from meanshape s0 using current estimate of the warp W(u;p).
2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This stepis inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determine
Initial Training Images
Sparse Landmark Points
Model Build
Model Fit
Free Fit
Add Points
Image Consistent
Triangulation
Final Dense Models
Model Refinement
Iterate
Iterate
Figure 4. A representation of our deformable dense model con-struction algorithm. The algorithm is initialized using a set ofsparse hand labeled mesh points. The algorithm then iteratesthrough model building, model fitting and model refinement stepsto produce the dense model. The refinement step is further splitinto refining the shape modes or free fit, adding mesh vertices andimage consistent re-triangulation.
Figure 5. A pair of images showing the mesh before and afteradding a number of mesh points and performing image consistentre-triangulation. The final triangulation is optimal with respecte tothe training data.
whether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newlyadded points. In Figure 5 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:
Initialize with the current mesh topologyRepeat:
Get initial image RMS errorRepeat (For each pair of adjacent triangles that include
the new point)Flip the common edge of the quadrilateralNote the image RMS error
Get the edge flip corresponding to the minimum image
3
Figure 3. The algorithm to add mesh vertices.
Figure 4. A pair of images showing the mesh before and afteradding two new mesh points to the longest edges. The newlyadded mesh points and the edges are highlighted. Note that addingthe new vertices causes two adjacent triangles to split.
Figure 5. A pair of images showing the mesh before and after per-forming an image consistent re-triangulation of the mesh. The newpoints as well as the edges that were flipped are highlighted. Theresulting triangulation is optimal with respect to the image data.is inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determinewhether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newlyadded points. In Figure 5 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:
(1) Initialize with the current mesh topology.(2) For each pair of adjacent triangles with new point:
(2.1) Flip the common edge of the quadrilateral.(2.2) Note the image RMS error.
(3) Get the edge flip with minimum image RMS error.(4) Flip the edge, check for thin or flipped triangles.
3. Shape Modes Refinement: The third and the mostimportant step of the refinement process is to refine theshape modes. The shape mode refinement can be thoughtof as refining the locations of the mesh vertices. The modelfit step of our algorithm allows the mesh vertices to movearound but the movement is limited to within the shapesubspace of the face model. If we allow the mesh vertices
3
Figure 5.4: The algorithm for image consistent re-triangulation.
Figure 5.5: A pair of images showing the mesh before and after performing an image consistent
re-triangulation of the mesh [30]. The new points as well as the edges that were flipped are
highlighted.
move outside the shape subspace of the model then we can possibly learn new deformations
and hence the current set of mesh vertices can better explain the face data. For this we
perform a model fit step similar to the one described in Section 2.2 except that we replace
the shape modes with identity bases that span the entire 2D space. As indicated before, the
optimization equation is similar to Equation 2.3 except that the 2D shape s is now defined
using these basis vectors that allow all the points to move in both x and y directions:
[s1 . . . s2M ] =
55
1 0 0 . . .
0 1 0 . . .
0 0 1 . . .
......
...
2M×2M
where M is the number of mesh vertices. Even though the shape mode refinement step
is initialized by the model fit at the previous density, it is still very high dimensional and
so prone to local minima. Hence we regularize this step with two priors. The first is a
smoothness constraint. The second is a constraint that the initial sparse vertices cannot
move too far from the input (hand-marked) locations.
The smoothness constraint restricts the movement of the newly added points to be not too
far away from their initial position with respect to the initial triangle that they were added
in. Figure 5.6 illustrates the mesh vertices that go into the optimization. The minimization
goal is given by:
∀v4‖v1 + λ (v2 − v1) + µ (v3 − v1)− v4‖2 (5.2)
with respect to all newly added mesh vertices v4. The λ and µ coefficients are the barycentric
coordinates [6] with respect to the base triangle in the mean shape s0.
The second constraint restricts the movement of the initial points and enforces that
they do not move too much from their initial hand specified locations. This constraint is
represented for a single image as:
‖s0 +m∑
i=1
pi si − s‖2 (5.3)
where the mean shape s0, 2D shape parameters p and the eigenvectors si are all defined over
the number of initial vertices and s is the initial hand-labeled mesh vertex locations [x y]
for a given image I.
The final optimization equation is a combination of the terms in Equations 2.3 , 5.2
56
V1
V2 V3V4
New Point
Figure 5.6: This figure shows the triangle vertices used to impose the smoothness constraints. The
new vertex is constrained by Equation 5.2 based on the location of the other three vertices.
and 5.3. The minimization goal is thus given by:
T∑t=1
∑u∈s0
[I t(W
(u;pt
))−(A0 (u) +
l∑i=1
λtiAi (u)
)]2
+
K1 ·T∑
t=1
‖s0 +m∑
i=1
pti si − st‖2+
K2 ·T∑
t=1
∀vt4‖vt
1 + λ(vt
2 − vt1
)+ µ
(vt
3 − vt1
)− vt
4‖2 (5.4)
with respect to the 2D shape p and appearance λi parameters. The optimization is done
using the algorithm in [4] but with a different prior. The weights K1 and K2 are chosen
after running the algorithms for different values of the weights and choosing the weighting
that gives the best performance. We also need to specify a suitable stopping criteria for
the algorithm to avoid the model getting unnecessarily dense. We choose a stopping criteria
that is based on the image coding error; if the coding error stops decreasing we terminate
the algorithm and output the dense model obtained at that particular iteration. Since our
model construction technique is an iterative offline algorithm, a typical iteration run of our
MATLAB implementation of the algorithm on a Mac PowerPC G5 2.4 GHz machine takes
approximately 30 minutes.
57
5.2 Experimental Results
In this section we present results from a number of experiments to show improved perfor-
mance of our densification algorithm. There are two ways to initialize our algorithm. One
way is to start with an existing sparse AAM and then increase the mesh density. In Sec-
tions 5.2.1, 5.2.2 and 5.2.3 we present results for this case. We could also automatically
construct a dense AAM using the output of a rigid tracker as initialization. In Section 5.2.4
we present tracking results using this approach.
5.2.1 Quantitative Evaluation
In this section we present quantitative comparisons to demonstrate the improved perfor-
mance of our algorithm in building dense models. We evaluate our model construction
algorithm using the implicitly computed dense correspondence by comparing it with with
those estimated by standard optical flow techniques [24, 27, 31]. We perform our comparison
with AAMs, but similar results could be obtained with 3DMMs.
5.2.1.1 Ground-Truth Data Collection
We collected high-resolution face data using Canon EOS SLR cameras capable of capturing
6 megapixel images. We obtained facial ground-truth data using a form of hidden markers
on the face. See [39] for a different way of embedding hidden ground-truth. These ground-
truth points have to be small so as to not interfere with the working of the algorithms. To
solve this we mark a number of very small black dots on the face. We then record the facial
deformations along with the marked ground-truth points using the high-resolution cameras.
Figure 5.7 shows one such high resolution image with ground-truth points marked on it along
with a zoomed in version highlighting the ground-truth point locations. The input data to
58
Figure 5.7: On the left is an example of the high resolution image obtained using the experimental
setup described in Section 5.2.1.1. The hand-marked ground-truth points on the face are highlighted
using dark circles. On the right are two examples of the down sampled images. Notice that the
ground-truth points are almost invisible in the down sampled images.
all algorithms consists of all the high resolution images (3072 x 2040) down sampled to one
fourth their size (768 x 510.) The ground-truth points are no longer visible in these low
resolution images and hence do not influence the working of the algorithms. Two example
down sampled images are also shown in Figure 5.7.
Note that we use only a single person’s ground-truthed data for the quantitative com-
parisons. The reason for this is that the notion of corresponding points is not well defined
across different people. We cannot estimate where a point on the face of person A should
correspond to a point on the face of person B. Also note that we cannot use range data
to help with this process since the important aspect of the ground-truth is the non-rigid
mapping from frame to frame. Knowing the perfect 3D depth from range data does not
provide us with this information.
59
5.2.1.2 Images used for Optical Flow Computation
Optical flow can be particularly hard when the motion is large. In our case, the head moves
around quite a bit in the input image. Our algorithm keeps track of where the original
head locations were and so implicitly avoids this large search space. We provide this same
information to the optical flow algorithms by warping all the input images into the coordinate
frame of the mean face for the initial sparse model. This means that the maximum flow for
all of the images is of the order of 3-4 pixels, well within the search ranges of most optical flow
algorithms. Another issue that can cause difficulty for optical flow algorithms is boundary
effects. We avoid this by also warping a boundary region around the face. We present
examples of the face mesh and the original and warped images in Figure 5.9. Observe that
the warped images are closer to each other and hence makes it easier for the optical flow
algorithms.
5.2.1.3 2D Ground-Truth Points Prediction Results
We compare the performance of four different algorithms: 1) our densification algorithm, 2)
optical flow algorithm by Horn and Schunck [24], 3) optical flow algorithm by Lucas and
Kanade [27], and 3) optical flow with diffused connectivity (Openvis3D) [31] based on their
ability to generate accurate feature point locations which are used to predict ground-truth
data point locations. We use the OpenCV implementations [1] for algorithms (2) and (3).
The evaluation methodology we adopt is based on ground-truth prediction. We use the
dense correspondence obtained from our algorithm and the optical flow algorithms to predict
the locations of the ground-truth points in all other images, given their position in one image.
We repeat this procedure for each image and finally average the predicted locations of the
ground-truth points. Once we have the predicted locations of the ground-truth points in
all images we compute the RMS spatial error between the predicted and the actual ground-
60
truth point locations. To perform a fair comparison among different algorithms we do all
the above computations in the mean shape by warping all images and correspondence onto
the mean shape.
We present the results of our algorithmic comparisons in Figures 5.8 for two different
people. We plot the RMS ground-truth prediction error vs the number of mesh points (al-
gorithm iterations.) The number of ground-truth points used for evaluation in the first case
is 21 whereas for the second case is 13. The results indicate that the densification algorithm
produces dense correspondences that lead to greater accuracies in predicting ground-truth
data. The optical flow algorithms clearly perform worse. This validates our claim that many
standard optical flow techniques prove to be bad predictors of point locations given images
taken under varying illumination, involving significant object deformations and consisting of
sparsely textured data.
5.2.1.4 3D Ground-Truth Points Prediction Results
In this section we perform comparisons similar to the ones in the previous section to evaluate
the 3D consistency of the correspondence computed by our algorithm with respect to the
ground-truth. In this case we evaluate our algorithm on trinocular stereo data. We repeat
the experimental setup described in Section 5.2.1.1 except that now we have a stereo rig with
calibrated cameras [7]. We use the initial sparse correspondence (the input to our algorithm)
and the dense correspondence from our algorithm and triangulate them to obtain 3D point
locations. We also triangulate the 2D ground-truth points to obtain 3D ground-truth points.
We compare the 3D fidelity of the sparse and the dense correspondences by computing the
distance of each 3D ground-truth point from the corresponding triangular plane comprised
of the sparse and dense mesh vertices. We find that the ground-truth points are closer (in
the depth direction) to the dense triangular mesh planes than the sparse ones, indicating
61
0 69 79 88 98 108 118 128 138 148 158 1682.6
2.8
3
3.2
3.4
3.6
3.8
4
4.2
Number of Mesh Vertices
Gro
und
Trut
h Po
int L
ocat
ion
Erro
r (RM
S)
Hand−labeled landmarksOptical Flow − Openvis3DOptical Flow − Lucas and KanadeOptical Flow − Horn and SchunckDensification Algorithm Output
0 69 79 88 98 108 118 128 138 148 158 168 1781
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
Number of Mesh Vertices
Gro
un
d T
ruth
Po
int
Lo
ca
tio
n E
rro
r (R
MS
)
Person 1 Person 2
Figure 5.8: A comparison of the algorithms on their ability to generate landmarks that lead to
better ground-truth point locations prediction for two different people. On the x-axis we plot algo-
rithmic iterations (each iteration adds 10 mesh points) vs the RMS ground-truth point prediction
error on the y-axis. The densification algorithm clearly performs the best.
that our densification algorithm generated mesh vertices with higher 3D fidelity. We plot
the results of our quantitative comparison in Figure 5.10.
5.2.2 Fitting Robustness
In Figure 5.13 we show quantitative results to demonstrate the increased robustness of our
dense AAMs. In experiments similar to those in [28], we generated 1800 test cases (20
trials each for 90 images) by randomly perturbing the 2D shape model from a ground-truth
obtained by tracking the face in video sequences and allowing the algorithms to converge. The
2D shape and similarity parameters obtained from the dense AAM tracks were perturbed
and the perturbations were projected on to the ground-truth tracks of the sparse AAMs.
This ensures that the initial perturbation is a valid starting point for all algorithms. We
62
(a) (b) (c)
Figure 5.9: (a) An example of the mesh used to warp input images onto the mean shape for
computing optical flow. The face mesh is extended to eliminate boundary effects for optical flow
algorithms. (b) The original input images to our algorithm. Note that it is difficult for optical
flow algorithms to work on these images with varying head locations. (c) The two images from (b)
warped onto the mean shape using the mesh from (a). Note that by warping the images to mean
we make it easier for the optical flow algorithms.
then run each algorithm (one using the dense AAM and the other with the sparse AAM)
from the same perturbed starting point and determine their convergence by computing the
RMS error between the mesh location of the fit and the ground-truth mesh coordinates.
The algorithm is considered to have converged if the RMS spatial error is less than 2.0
pixels. The magnitude of the perturbation is chosen to vary on average from 0 to 4 times
the 2D shape standard deviation. The perturbation results were obtained on the trinocular
stereo data (Section 5.2.1.1) for each of the three camera views and the average frequency of
convergence is reported in Figure 5.13. The results show that the dense AAM converges to
ground truth more often than the sparse AAM. The increased robustness of the dense AAM
may be surprising given its apparent increased flexibility. But note that both the sparse and
dense AAMs have the same number of shape modes. The increased robustness of the dense
AAM is because it is a better (more compact) coding of the underlying phenomenon. Also
note that since both the sparse and the dense AAMs have the same number of parameters
63
1 2 3 4 5 6 10
15
20
25
30
35
Image
Dist
ance
in m
m
Sparse CorrespondenceDense Correspondence
Figure 5.10: The distance of the triangulated 3D ground truth points from the 3D mesh plane
for each 3-frame. The values were computed for six 3-frames. The smallest triangle in which the
ground-truth point lies in 2D was computed. The distance was computed between the triangular
plane (formed by the 3D mesh vertices) and the corresponding 3D ground truth points. This was
repeated for 21 ground-truth points and the sum of the distances was computed. The average
distance across images for the sparse correspondence (69 mesh points) is 27.84 mm whereas for
the dense correspondence (168 mesh points) it is 13.625 mm.
that are optimized during the fit, the dense AAM fitting is as fast as the sparse AAM fitting.
The additional overheads such as in computing the affine warp for composition [28] hardly
affect the speed of fitting. A typical dense AAM (168 points) fit iteration takes 0.25 secs
using a MATLAB implementation on a Mac PowerPC G5 2.4 GHz machine, while fitting to
an image of VGA (640 x 480) resolution.
64
100 200 300 400 500 600 700
50
100
150
200
250
300
350
400
450
100 200 300 400 500 600 700
50
100
150
200
250
300
350
400
450
100 200 300 400 500 600 700
50
100
150
200
250
300
350
400
450
Figure 5.11: Our algorithm can be applied to data of multiple people. Here we show a few frames
of a dense multi-person AAM being used to track three different people. See face track.mov for
the complete tracking sequences.
5.2.3 Face Tracking
In Section 5.2.1 we compared our algorithm on single-person data to allow a quantitative
comparison on ground-truth. Our algorithm can of course be applied to data of any number of