On the Multi-View Fitting and Construction of Dense Deformable Face Models · 2008-12-03 · Although AAMs were originally formulated as 2D, there are other deformable 3D models (3D

On the Multi-View Fitting and Construction of Dense

Deformable Face Models

Krishnan Ramnath

CMU-RI-TR-07-10

May 2007

Submitted in partial fulfillment of the requirements

for the degree of Master of Science

The Robotics Institute

Carnegie Mellon University

Pittsburgh, Pennsylvania 15213

c©2007 by Krishnan Ramnath

Abstract

Active Appearance Models (AAMs) are generative, parametric models that have been

successfully used in the past to model deformable objects such as human faces. Fitting

an AAM to an image consists of minimizing the error between the input image and the

closest model instance; i.e. solving a nonlinear optimization problem. In this thesis we study

three important topics related to deformable face models such as AAMs: (1) multi-view

3D face model fitting, (2) multi-view 3D face model construction, and (3) automatic dense

deformable face model construction.

The original AAMs formulation was 2D, but they have recently been extended to include

a 3D shape model. A variety of single-view algorithms exist for fitting and constructing 3D

AAMs but one area that has not been studied is multi-view algorithms. In the first part of

this thesis we describe an algorithm for fitting a single AAM to multiple images, captured

simultaneously by cameras with arbitrary locations, rotations, and response functions. This

algorithm uses the scaled orthographic imaging model used by previous authors, and in the

process of fitting computes, or calibrates, the scaled orthographic camera matrices. We also

describe an extension of this algorithm to calibrate weak perspective (or full perspective)

camera models for each of the cameras. In essence, we use the human face as a (non-

rigid) calibration grid. We demonstrate that the performance of this algorithm is roughly

comparable to a standard algorithm using a calibration grid. We then show how camera

calibration improves the performance of AAM fitting.

A variety of non-rigid structure-from-motion algorithms, both single-view and multi-

view, have been proposed that can be used to construct the corresponding 3D non-rigid

shape models of a 2D AAM. In the second part of this thesis we show that constructing a

3D face model using non-rigid structure-from-motion suffers from the Bas-Relief ambiguity

and may result in a “scaled” (stretched/compressed) model. We outline a robust non-rigid

iii

motion-stereo algorithm for calibrated multi-view 3D AAM construction and show how using

calibrated multi-view motion-stereo can eliminate the Bas-Relief ambiguity and yield face

models with higher 3D fidelity.

An important step in computing dense deformable face models such as 3D Morphable

Models (3DMMs) is to register the input texture maps using optical flow. However, optical

flow algorithms perform poorly on images of faces because of the appearance and disappear-

ance of structure such as teeth and wrinkles, and because of the non-Lambertian, textureless

cheek regions. In the final part of this thesis we propose a different approach to building

dense face models. Our algorithm iteratively builds a face model, fits the model to the input

image data, and then refines the model. The refinement consists of three steps: (1) the addi-

tion of more mesh points to increase the density, (2) image consistent re-triangulation of the

mesh, and (3) refinement of the shape modes. Using a carefully collected dataset containing

hidden marker ground-truth, we show that our algorithm generates dense models that are

quantitatively better than those obtained using off the shelf optical flow algorithms. We also

show how our algorithm can be used to construct dense deformable models automatically,

starting with a rigid planar model of the face that is subsequently refined to model the

non-planarity and the non-rigid components.

iv

Acknowledgements

I wish to express my sincere gratitude to my advisor Dr. Simon Baker for his able

guidance and motivation throughout the course of my research work. It is his continued

support, enthusiasm and technical advice that saw me through as a student and researcher

at Carnegie Mellon University. My repeated interactions with him helped me develop both

on a technical as well as personal front. I cannot thank him enough for that.

I am very thankful my co-advisor Dr. Iain Matthews for his insightful suggestions during

various stages of my research. I have learnt a lot in terms of research approach, programming

practices and presentation skills from him. I also cherish all the fun times I have had with

him during work and otherwise. It is indeed my pleasure to have worked with highly talented

people such as my advisors and many others here at the Robotics Institute, CMU.

My sincere thanks to Seth Koterba for providing immense help during the initial stages

of my research. I also enjoyed our frequent interactions and squash matches. I also wish to

thank my thesis committee members: Dr. Alexei (Alyosha) Efros and Ankur Datta for their

suggestions. Talking to Alyosha has always been a pleasure, our conversations were always

enjoyable.

I am grateful to Prof. Martial Herbert, Prof. Michael Erdmann, Prof. Matthew Mason,

Prof. Srinivas Narasimhan, Prof. Tom Mitchell, Prof. Eric Xing for providing expert

technical guidance during my interactions with them.

I wish to thank Dr. Simon Lucey for patiently sitting through all the data collection

sessions. I am grateful to Dr. Deva Ramanan (TTI Chicago) for his contributions and help

towards my research. I gratefully acknowledge all the help provided by Sanjeev Koppal and

Mohit Gupta; for sitting through my practice talks and providing constructive criticisms.

Many thanks to Ralph Gross, Goksel Dedeoglu and Fernando De La Torre for their help.

I wish to thank my classmates Francisco Calderon, Peter Barnum, Kevin Yoon, Manuel

v

Quero, Stefan, Javier, Kristina, Ling for their help with the face data collection. I also wish

to thank all my friends here at RI who made my day, everyday, over the past two years.

A special thanks to Suzanne Lyons Muth for all the help and for making my life easier

on numerous occasions.

Finally, I wish to thank the Almighty for his divine grace.

vi

Contents

Abstract iii

Acknowledgements v

List of Tables x

List of Figures xi

1 Introduction 1

1.1 Multi-View Face Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Multi-View Face Model Construction . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Dense Face Model Construction . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 10

2.1 2D Active Appearance Models . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Fitting a 2D AAM to a Single Image . . . . . . . . . . . . . . . . . . . . . . 12

2.3 2D+3D Active Appearance Models . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Fitting a 2D+3D AAM to a Single Image . . . . . . . . . . . . . . . . . . . . 15

3 Multi-View 2D+3D AAM Fitting and Camera Calibration 18

3.1 Multi-View 2D+3D AAM Fitting . . . . . . . . . . . . . . . . . . . . . . . . 19

vii

3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Camera Calibration: Image Formation Model . . . . . . . . . . . . . . . . . 24

3.4 Camera Calibration Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5 Calibration using Two Time Instants . . . . . . . . . . . . . . . . . . . . . . 25

3.6 Multiple Time Instant Calibration Algorithm . . . . . . . . . . . . . . . . . . 27

3.7 Calibration as a Single Optimization . . . . . . . . . . . . . . . . . . . . . . 27

3.8 Empirical Evaluation of Calibration . . . . . . . . . . . . . . . . . . . . . . . 28

3.8.1 Qualitative Comparison of Epipolar Geometry . . . . . . . . . . . . . 29

3.8.2 Quantitative Comparison of Epipolar Geometry . . . . . . . . . . . . 30

3.9 Calibrated Multi-View Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.10 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.10.1 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.10.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Multi-View 3D Model Construction 40

4.1 Non-Rigid Structure-from-Motion . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Multi-view Structure-from-Motion . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4 Motion-Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5.2 Qualitative Multi-View Model Construction Comparison . . . . . . . 46

4.5.3 Quantitative Comparison using Camera Calibration . . . . . . . . . . 47

5 Dense Face Model Construction 51

5.1 Model Densification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

viii

5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.1 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.1.1 Ground-Truth Data Collection . . . . . . . . . . . . . . . . 58

5.2.1.2 Images used for Optical Flow Computation . . . . . . . . . 60

5.2.1.3 2D Ground-Truth Points Prediction Results . . . . . . . . . 60

5.2.1.4 3D Ground-Truth Points Prediction Results . . . . . . . . . 61

5.2.2 Fitting Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2.3 Face Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2.4 Application to Rigid Tracker Output . . . . . . . . . . . . . . . . . . 68

6 Conclusion 70

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Bibliography 74

ix

List of Tables

3.1 Fitting Algorithms Timing Results . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Quantitative Comparison of 3D Models . . . . . . . . . . . . . . . . . . . . . 50

x

List of Figures

1.1 Experimental Setup for Multi-View Fitting . . . . . . . . . . . . . . . . . . . 4

1.2 Example Face Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Dense Face Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 AAM Shape Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 AAM Appearance Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 AAM Model Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 2D+3D AAM Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Uncalibrated Multi-View Fitting Algorithm . . . . . . . . . . . . . . . . . . 22

3.2 Multi-View Tracking Example . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Input to Calibration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Qualitative Comparison of Calibration Algorithms . . . . . . . . . . . . . . . 30

3.5 Quantitative Comparison of Calibration Algorithms . . . . . . . . . . . . . . 31

3.6 Quantitative Comparison of Calibration Algorithms . . . . . . . . . . . . . . 32

3.7 Calibrated Multi-View Fitting Example . . . . . . . . . . . . . . . . . . . . . 35

3.8 Quantitative Comparison of Fitting Algorithms . . . . . . . . . . . . . . . . 37

4.1 Input to 3D Model Construction Algorithms . . . . . . . . . . . . . . . . . . 45

4.2 Qualitative Comparison of 3D Models . . . . . . . . . . . . . . . . . . . . . . 47

xi

4.3 Quantitative Comparison of 3D Models . . . . . . . . . . . . . . . . . . . . . 48

5.1 Densification Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2 Algorithm to Add Vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3 Adding Vertices Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4 Algorithm for Image Consistent Re-Triangulation . . . . . . . . . . . . . . . 55

5.5 Image Consistent Re-Triangulation Example . . . . . . . . . . . . . . . . . . 55

5.6 Smoothness Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.7 Input to Densification Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 59

5.8 Quantitative Comparison of Densification Algorithm . . . . . . . . . . . . . 62

5.9 Input to Optical Flow Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 63

5.10 3D Ground-Truth Point Prediction Results . . . . . . . . . . . . . . . . . . . 64

5.11 Dense Multi-Person Tracking Results 1 . . . . . . . . . . . . . . . . . . . . . 65

5.12 Dense Multi-Person Tracking Results 2 . . . . . . . . . . . . . . . . . . . . . 66

5.13 Dense Model Fitting Robustness . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.14 Face Tracking using Automatic Dense AAM Example 1 . . . . . . . . . . . . 67

5.15 Face Tracking using Automatic Dense AAM Example 2 . . . . . . . . . . . . 68

xii

Chapter 1

Introduction

Active Appearance Models (AAMs) [11, 12, 14, 13, 17], and the related concepts of Active

Blobs [34, 35] and Morphable Models [5, 25, 41], are generative models of a certain visual

phenomenon. AAMs are examples of statistical models that are used to characterize the

shape and the appearance of the underlying object by a set of model parameters. Though

AAMs are useful for other phenomena [34, 25], they are commonly used to model faces. In

a typical application, once an AAM has been constructed, the first step is to fit it to an

input image, i.e. model parameters are found to maximize the match between the model

instance and the input image. The model parameters can then be passed to a classifier.

Many different classification tasks are possible.

In this thesis we study three important topics related to deformable face models such as

AAMs: (1) we outline various techniques to simultaneously fit a 3D face model to multiple

images captured from multiple viewpoints, (2) we present a multi-view algorithm for 3D face

model construction, and (3) we present an automatic algorithm for dense deformable face

model construction.

1

1.1 Multi-View Face Model Fitting

Although AAMs were originally formulated as 2D, there are other deformable 3D models

(3D Morphable Models [5]) and AAMs have also been extended to 3D (2D+3D AAMs [44].)

A number of algorithms have been proposed to build deformable 3D face models and to fit

them efficiently [44, 33, 2, 37, 43, 32, 16]. Deformable 3D face models have a wide variety

of applications. Not only can they be used for tasks like pose estimation, which just require

the estimation of the 3D rigid motion, but also for tasks such as expression recognition and

lipreading, which require, explicitly or implicitly, estimation of the 3D non-rigid motion.

Most of the previous algorithms for AAM fitting and construction have been single-view.

One area that has not been studied much in the past (an exception is [15]) is the development

of simultaneous multi-view algorithms. Multi-view algorithms can potentially perform better

than single-view as they can take into account more visual information. In this thesis we

present multi-view algorithms to both fit and build 3D AAMs.

In the first part of this thesis we study multi-view fitting of AAMs. Fitting an AAM to

an image consists of minimizing the error between the input image and the closest model

instance; i.e. solving a nonlinear optimization problem. Face models are usually fit to a single

image of a face. In many application scenarios, however, it is possible to set up two or more

cameras and acquire simultaneous multiple views of the face. If we integrate the information

from multiple views, we can possibly obtain better application performance. For example,

Gross et al. [19] demonstrated improved face recognition performance by combining multiple

images of the same face captured from multiple widely spaced viewpoints. In Chapter 3, we

describe how a single AAM can be fit to multiple images, captured by cameras with arbitrary

locations, rotations, and response functions.

The main technical challenge is relating the AAM shape parameters in one view with

the corresponding parameters in the other views. This relationship is complex for a 2D

2

shape model but is straightforward for a 3D shape model. We use 2D+3D AAMs [44] in this

thesis. A 2D+3D AAM contains both a 2D shape model and a 3D shape model. Besides the

requirement of having a 3D shape model, the main advantage of using a 2D+3D AAM is that

2D+3D AAMs can be fit very efficiently in real-time [44]. Corresponding multi-view fitting

algorithms could also be derived for other 3D face models such as 3D Morphable Models [5].

We could easily have used a 3D Morphable Model instead to conduct the research in this

thesis, but the fitting algorithms would have been slower.

To generalize the 2D+3D fitting algorithm to multiple images, we use a separate set of

2D shape parameters for each image, but just a single, global set of 3D shape parameters

as represented in Figure 1.1. We impose the constraints that for each view separately,

the 2D shape model for that view must approximately equal the projection of the single

3D shape model. Imposing these constraints indirectly couples the 2D shape parameters

for each view in a physically consistent manner. Our algorithm can use any number of

cameras, positioned arbitrarily. The cameras can be moved and replaced with different

cameras without any retraining. The computational cost of the multi-view 2D+3D algorithm

is only approximately N times more than the single-view algorithm where N is the number

of cameras. In Section 3.1 we present a qualitative evaluation of our multi-view 2D+3D

fitting algorithm. We defer the quantitative evaluation to Section 3.9 where we also compare

it with a calibrated multi-view algorithm.

We also study how our multi-view fitting algorithm can be used for camera calibration.

The multi-view fitting algorithm of Section 3.1 uses the scaled orthographic imaging model

used by previous authors, and in the process of fitting computes, or calibrates, the scaled

orthographic camera matrices. In Section 3.3 we describe an extension of this algorithm to

calibrate weak perspective (or full perspective) camera models for each of the cameras. In

essence, both of these algorithms use the human face as a (non-rigid) calibration grid. Such

3

Camera pro)ection matrix

and 2D shape

Ob)ect 3D rotation7 translation

and 3D shape

frames


and 2D shape


and 2D shape

Figure 1.1: A representation of the experimental setup for multi-view 2D+3D AAM fitting. For

each view we have a separate set of 2D shape parameters and camera projection matrices, but just

a single, global set of 3D shape parameters and the associated global 3D rotation and translation.

Our fitting algorithm imposes the constraints that for each view separately, the 2D shape model

for that view must approximately equal the projection of the single 3D shape model.

an algorithm may be useful in a surveillance setting where we wish to install the cameras on

the fly, but avoid walking around the scene with a calibration grid.

The perspective algorithm requires at least two sets of multi-view images of the face at two

different locations. More images can be used to improve the accuracy if they are available.

We evaluate our algorithm by comparing it with an algorithm that uses a calibration grid

and show the performance to be roughly comparable.

We then show how camera calibration can improve the performance of multi-view face

model fitting. We present an extension of the multi-view AAM fitting algorithm of Sec-

4

tion 3.1 that takes advantage of calibrated cameras. We use the calibration algorithm of

Section 3.3 to explicitly provide calibration information to the multi-view fitting algorithm.

We demonstrate that this algorithm results in far better fitting performance than either the

single-view fitting (Chapter 2) or the uncalibrated1 multi-view fitting (Section 3.1) algo-

rithms. We consider two performance measures: (1) the robustness of fitting - the likelihood

of convergence for a given magnitude perturbation from the ground-truth, and (2) speed

of fitting - the average number of iterations required to converge from a given magnitude

perturbation from the ground-truth.

1.2 Multi-View Face Model Construction

In the second part of this thesis we study calibrated multi-view construction of AAMs. A

variety of non-rigid structure-from-motion algorithms have be proposed, both non-linear

[8, 40] and linear [9, 45, 46] that can be used for deformable 3D model construction from

both a single view [8, 9, 45, 46] and multiple views [40].

In most cases, it is only practical to apply face model construction algorithms to data

with relatively little pose variation. Tracking facial feature points becomes more difficult the

more pose variation there is. Unfortunately, single-view and multi-view algorithms such as

non-rigid structure-from-motion have a tendency to scale (stretch or compress) the face in

the depth-direction when applied to data with only medium amounts of pose variation. The

problem is not the algorithms themselves, but the Bas-Relief ambiguity between the camera

translation/rotation and the depth [47, 38, 36, 23]. The Bas-Relief ambiguity is normally

formulated in the case of rigid structure-from-motion, but applies equally in the non-rigid

1Note that for the uncalibrated multi-view algorithm described in Section 3.1, the calibration parametersare unknown and are estimated as a part of the optimization. For the calibrated multi-view fitting algorithmthe calibration parameters are known and are obtained from a calibration algorithm (possibly the algorithmof Section 3.3.)

5

case. As empirically validated in Chapter 4, the result is a compressed/stretched face model,

which gives erroneous estimates of the 3D rigid and non-rigid motion.

One way to eliminate the ambiguity is to use a calibrated stereo rig instead of a single

camera. The known, fixed translation between the cameras then sets the scale and breaks the

ambiguity. The straightforward approach is to use stereo to build a static 3D model at each

time instant and then build the deformable model by modeling how the 3D shape changes

across time. Two algorithms that takes this approach are [10, 18], one in the uncalibrated

case [10], the other in the calibrated case [18]. An alternative approach is to extend the non-

rigid structure-from-motion paradigm of [9, 8, 40, 45] and pose the face model construction

problem as a single large optimization over the unknown shape model modes, in essence

a large bundle adjustment. In Chapter 4 of this thesis we derive a calibrated multi-view

non-rigid motion-stereo algorithm [42, 48] to do exactly this. Our multi-view algorithm

explicitly incorporates the knowledge of the calibrated relative orientation of the cameras in

the stereo rig. In Section 4.5 we present qualitative results to validate these claims. We also

use the multi-view calibration algorithm described in Chapter ?? to quantitatively compare

the fidelity of 3D models.

1.3 Dense Face Model Construction

Deformable face models are generative parametric models that are used to model both rigid

and non-rigid deformations. The two best known examples of deformable models are Active

Appearance Models (AAMs) [12, 14, 13, 17, 26] and 3D Morphable Models (3DMMs) [5,

25, 33, 41, 8]. Although AAMs and 3DMMs are closely related, there are a number of

differences between them. One main difference between AAMs and 3DMMs is that AAMs

are typically sparse whereas 3DMMs are typically dense. This difference is mainly based

6

1

1

2 2

Figure 1.2: An illustration of the effects that make optical flow hard for human faces: (1) appear-

ance/disappearance of structures such as teeth and wrinkles (2) non-Lambertian, largely textureless

regions of skin such as the cheeks.

on how these models are constructed. AAMs are normally constructed from a collection

of training images of faces with a mesh of canonical feature points hand-marked on them.

Since the feature points are hand-marked, the correspondence can only be sparse. 3DMMs

are usually computed by running an optical flow algorithm to estimate the dense non-rigid

alignment of the texture maps [5]. AAMs could also potentially be constructed from dense

correspondence estimation using optical flow.

Computing dense alignment of face images using optical flow is difficult for a number of

reasons, as illustrated in Figure 1.2. Observe the appearance/disappearance of structures

such as teeth and wrinkles. Also note the non-Lambertian reflectance of textureless regions

such as the cheeks. Optical flow algorithms are not robust to such variations in the images.

Note, however, that the failure of the optical flow algorithm in the construction of 3DMMs

is hidden in the fact that most applications are graphics. The artifacts are hidden by the

texture mapping of the 3D mesh.

In the final part of this thesis we propose a different approach to building a dense de-

7

formable face model. Rather than assuming that dense correspondence can be computed

in a pre-processing step, our algorithm instead builds a dense model (Figure 1.3) by itera-

tively building a face model, fitting the model to image data and then refining the model.

There are three ways in which the model is refined: 1) By adding more mesh vertices 2) By

changing the mesh connectivity using image-consistent surface triangulation [30], and 3) By

refining the shape modes using a modification of the algorithm in [4]. Although the goal of

our algorithm is to compute a dense model, note that in the process it implicitly computes

the dense correspondence that an optical flow algorithm would. However, as the model is

refined, it builds a model of visual effects such as the appearance/disappearance of structure

such as the teeth and wrinkles, and also builds an implicit model of the illumination variation

manifested across the face including large textureless regions such as the cheeks. This is the

reason our algorithm is able to outperform standard optical flow algorithms.

In Section 5.2 we show a number of results to illustrate that our densification algorithm

can be used to accurately build dense models. We first evaluate our algorithm quantitatively

with a set of ground-truth data using a form of hidden markers. We compare with a number

of popular optical flow algorithms [24, 27, 31] for the same task and find our algorithm to

be more robust and more accurate. We then perform comparisons to show improvement in

fitting robustness. We also present a number of tracking results to qualitatively illustrate

other aspects of our algorithm. Finally, we also show how our algorithm can be used to

construct dense deformable models automatically, starting with a rigid planar model of the

face that is subsequently refined to model the non-planarity and the non-rigid components.

8

Automated Model Densification

Krishnan Ramnath, Simon Baker, Iain Matthews

Figure 1.3: An example dense mesh achieved using our densification algorithm. On the left we

show the initial sparse mesh as well as the mesh vertices. On the right we show the resulting

triangulated mesh as well as vertices after applying the densification algorithm.

9

Chapter 2

Background

In this section we review 2D Active Appearance Models (AAMs) [13] and 2D+3D Active

Appearance Models [44]. We also revisit the efficient inverse compositional fitting algo-

rithms [3, 44].

2.1 2D Active Appearance Models

The 2D shape s of a 2D Active Appearance Model is a 2D triangulated mesh. In particular,

s is a column vector containing the vertex locations of the mesh. AAMs allow linear shape

variation. This means that the 2D shape s can be expressed as a base shape s0 plus a linear

combination of m shape vectors si:

s = s0 +m∑

i=1

pi si (2.1)

where the coefficients pi are the shape parameters. AAMs are normally computed from

training data consisting of a set of images with the shape mesh (hand) marked on them

[13]. The Procrustes alignment algorithm and Principal Component Analysis (PCA) are

then applied to compute the the base shape s0 and the shape variation si [13]. An example

mesh is shown in Figure 2.1.

10

s0 s1 s2 s3

Figure 1: The linear shape model of an independent AAM. The model consists of a triangulated base mesh

s0 plus a linear combination of n shape vectors si. The base mesh is shown on the left, and to the right are

the first three shape vectors s1, s2, and s3 overlaid on the base mesh.

apply Principal Component Analysis (PCA) to the training meshes [11]. The base shape s0 is the

mean shape and the vectors s0 are the n eigenvectors corresponding to the n largest eigenvalues.

Usually, the training meshes are first normalised using a Procrustes analysis [10] before PCA is

applied. This step removes variation due to a chosen global shape normalising transformation so

that the resulting PCA is only concerned with local, non-rigid shape deformation. See Section 4.2

for the details of how such a normalisation affects the AAM fitting algorithm described in this

paper.

An example shape model is shown in Figure 1. On the left of the figure, we plot the triangulated

base mesh s0. In the remainder of the figure, the base mesh s0 is overlaid with arrows corresponding

to each of the first three shape vectors s1, s2, and s3.

2.1.2 Appearance

The appearance of an independent AAM is defined within the base mesh s0. Let s0 also denote the

set of pixels x = (x, y)T that lie inside the base mesh s0, a convenient abuse of terminology. The

appearance of an AAM is then an image A(x) defined over the pixels x ! s0. AAMs allow linear

appearance variation. This means that the appearance A(x) can be expressed as a base appearance

A0(x) plus a linear combination ofm appearance images Ai(x):

A(x) = A0(x) +m

!

i=1

!iAi(x) " x ! s0 (3)

4

Figure 2.1: The 2D linear shape model of an AAM. The model consists of triangulated base mesh

s0 plus a linear combination of m shape vectors si. The base mesh is shown on the left, followed

by the first three shape vectors s1, mathbfs2 and s3 overlaid over the base mesh.

The appearance of an AAM is defined within the base mesh s0. Let s0 also denote the

set of pixels u = (u, v)T that lie inside the base mesh s0, a convenient notational short-cut.

The appearance of the AAM is then an image A(u) defined over the pixels u ∈ s0. AAMs

allow linear appearance variation. This means that the appearance A(u) can be expressed

as a base appearance A0(u) plus a linear combination of l appearance images Ai(u):

A(u) = A0(u) +l∑

i=1

λi Ai(u) (2.2)

where the coefficients λi are the appearance parameters. The base (mean) appearance A0

and appearance images Ai are usually computed by applying Principal Component Analysis

to the shape normalised training images [13]. The appearance variation of an AAM is

illustrated in Figure 2.2.

Although Equations (2.1) and (2.2) describe the AAM shape and appearance variation,

they do not describe how to generate a model instance. The AAM model instance (Figure 2.3)

with shape parameters p and appearance parameters λi is created by warping the appearance

A from the base mesh s0 to the model shape mesh s. In particular, the pair of meshes s0

and s define a piecewise affine warp from s0 to s denoted1 W(u;p)[28].

1Note that for ease of presentation we have omitted any mention of the 2D similarity transformation that

11

A0(x) A1(x) A2(x) A3(x)

Figure 2: The linear appearance variation of an independent AAM. The model consists of a base appear-

ance image A0 defined on the pixels inside the base mesh s0 plus a linear combination of m appearance

images Ai also defined on the same set of pixels.

In this expression the coefficients !i are the appearance parameters. Since we can always perform

a linear reparameterization, wherever necessary we assume that the images Ai are orthonormal.

As with the shape component, the base appearance A0 and the appearance images Ai are nor-

mally computed by applying PCA to a set of shape normalised training images. Each training

image is shape normalised by warping the (hand labelled) training mesh onto the base mesh s0.

Usually the mesh is triangulated and a piecewise affine warp is defined between corresponding

triangles in the training and base meshes [11] (although there are ways to avoid triangulating the

mesh using, for example, thin plate splines rather than piecewise affine warping [10].) The base

appearance is set to be the mean image and the images Ai to be them eigenimages corresponding

to the m largest eigenvalues. The fact that the training images are shape normalised before PCA

is applied normally results in a far more compact appearance eigenspace than would otherwise be

obtained.

The appearance of an example independent AAM is shown in Figure 2. On the left of the figure

we plot the base appearance A0. On the right we plot the first three appearance images A1–A3.

2.1.3 Model Instantiation

Equations (2) and (3) describe the AAM shape and appearance variation. However, they do not de-

scribe how to generate a model instance. Given the AAM shape parameters p = (p1, p2, . . . , pn)T

we can use Equation (2) to generate the shape of the AAM s. Similarly, given the AAM appear-

ance parameters ! = (!1, !2, . . . , !m)T, we can generate the AAM appearanceA(x) defined in the

interior of the base mesh s0. The AAM model instance with shape parameters p and appearance

5

Figure 2.2: The 2D linear appearance model of an AAM. The model consists of a base appearance

A0 defined over all the pixels inside the base shape mesh s0 plus a linear combination of l appearance

vectors Ai.

2.2 Fitting a 2D AAM to a Single Image

The goal of fitting a 2D AAM to a single input image I [28] is to minimize:

∑u∈s0

[A0(u) +

l∑i=1

λiAi(u)− I(W(u;p))

]2

=

∥∥∥∥∥A0(u) +l∑

i=1


∥∥∥∥∥2

(2.3)

with respect to the 2D shape p and appearance λi parameters. In [28] it was shown that the

inverse compositional algorithm [3] can be used to optimize the expression in Equation (2.3).

The algorithm uses the “project out” algorithm [21, 28] to break the optimization into two

steps. The first step consists of optimizing:

‖A0(u)− I(W(u;p))‖2span(Ai)⊥(2.4)

with respect to the shape parameters p where the subscript span(Ai)⊥ means project the

vector into the subspace orthogonal to the subspace spanned by Ai, i = 1, . . . , l. The second

step consists of solving for the appearance parameters:

is used with an AAM to normalise the shape [13]. In this thesis we include the normalising warp in W(u;p)and the similarity normalisation parameters in p. See [28] for a description of how to include the normalisingwarp in W(u;p).

12

9.1s3

W(x;p)

Appearance, A

Shape, s

=

=

A0

AAMModel Instance

M(W(x;p))

s0 ! +54s1 ! . . .10s2

. . .256A3!351A2++ 3559A1

Figure 3: An example of AAM instantiation. The shape parameters p = (p1, p2, . . . , pn)T are used to

compute the model shape s and the appearance parameters ! = (!1,!2, . . . ,!m)T are used to compute themodel appearance A. The model appearance is defined in the base mesh s0. The pair of meshes s0 and s

define a (piecewise affine) warp from s0 to s which we denote W(x;p). The final AAM model instance,

denoted M(W(x;p)), is computed by forwards warping the appearance A from s0 to s usingW(x;p).

parameters ! is then created by warping the appearance A from the base mesh s0 to the model

shape s. This process is illustrated in Figure 3 for concrete values of p and !.

In particular, the pair of meshes s0 and s define a piecewise affine warp from s0 to s. For each

triangle in s0 there is a corresponding triangle in s. Any pair of triangles define a unique affine

warp from one to the other such that the vertices of the first triangle map to the vertices of the

second triangle. See Section 4.1.1 for more details. The complete warp is then computed: (1) for

any pixel x in s0 find out which triangle it lies in, and then (2) warp x with the affine warp for that

triangle. We denote this piecewise affine warp W(x;p). The final AAM model instance is then

computed by warping the appearance A from s0 to s with warp W(x;p). This process is defined

by the following equation:

M(W(x;p)) = A(x) (4)

where M is a 2D image of the appropriate size and shape that contains the model instance. This

equation, describes a forwards warping that should be interpreted as follows. Given a pixel x in

6

Figure 2.3: An example of an AAM model instance. The shape parameters p are used to create

the shape model sand the appearance parameters λi are used to create the appearance model A.

The AAM model instance (Figure 2.3) is then created by warping the appearance A from the base

mesh s0 to the model shape mesh s. In particular, the pair of meshes s0 and s define a piecewise

affine warp from s0 to s denoted W(u;p)

λi = −∑u∈s0

Ai(u) [A0(u)− I(W(u;p)] (2.5)

where the appearance vectors Ai are orthonormal. Optimizing Equation (2.4) itself can be

performed by iterating the following two steps. Step 1 consists of computing:

∆p = −H−12D∆pSD where ∆pSD =

∑u∈s0 [SD2D(u)]T [A0(u)− I(W(u;p)]

13

where the following two terms can be pre-computed (and combined) to achieve high

efficiency:

SD2D(u) =[∇A0

∂W∂p

]span(Ai)⊥

H2D =∑

u∈s0 [SD2D(u)]T SD2D(u)

where ∇A0 =[

∂A0

∂x∂A0

∂y

].

Step 2 consists of updating the warp by composing with the inverse incremental warp:

W(u;p) ← W(u;p) ◦W(u; ∆p)−1 (2.6)

The resulting 2D AAM fitting algorithm runs at over 200 frames per second. See [28] for

more details.

2.3 2D+3D Active Appearance Models

Most deformable 3D face models, including 3D Morphable Models [5] and the models in [9, 8,

40, 45], use a 3D linear shape variation model, essentially equivalent to a 3D generalization of

the model in Section 2.1. The 3D shape s is a 3D triangulated mesh which can be expressed

as a base shape s0 plus a linear combination of m shape vectors sj:

s = s0 +m∑

j=1

pj sj (2.7)

where the coefficients pi are the shape parameters.

A 2D+3D AAM [44] consists of the 2D shape variation si of a 2D AAM governed by

Equation (2.1), the appearance variation Ai(u) of a 2D AAM governed by Equation (2.2),

and the 3D shape variation sj of a 3D AAM governed by Equation (2.7). The 2D shape

variation si and the appearance variation Ai(u) of the 2D+3D AAM are constructed exactly

as for a 2D AAM. The construction of the 3D shape variation sj is the subject of Chapter 4

of this thesis.

14

To generate a 2D+3D model instance, an image formation model is needed to convert

the 3D shape s into a 2D mesh, onto which the appearance is warped. In [44] the following

scaled orthographic imaging model was used:

u = Pso x = σ

ix iy iz

jx jy jz

x +

ox

oy

. (2.8)

where x = (x, y, z) is a 3D vertex location, (ox, oy) is an offset to the origin, σ is the scale

and the projection axes i = (ix, iy, iz) and j = (jx, jy, jz) are unit length and orthogonal:

i · i = j · j = 1; i · j = 0. The model instance is then computed by projecting every 3D shape

vertex onto a 2D vertex using Equation (2.8). The 2D appearance A(u) is finally warped

onto the 2D mesh (taking into account visibility) to generate the final model instance.

2.4 Fitting a 2D+3D AAM to a Single Image

The goal of fitting a 2D+3D AAM to an image I [44] is to minimize:∥∥∥∥∥A0(u) +l∑

i=1


∥∥∥∥∥2

+ K ·

∥∥∥∥∥∥s0 +m∑

i=1

pi si −Pso

s0 +m∑

j=1

pj sj

∥∥∥∥∥∥2

(2.9)

with respect to p, λi, Pso, and p where K is a large constant weight. A pictorial represen-

tation of the 2D+3D AAM fitting is shown in Figure 2.4.

Equation (2.9) should be interpreted as follows. The first term in Equation (2.9) is the

2D AAM fitting criterion. The second term enforces the (heavily weighted, soft) constraints

that the 2D shape s equals the projection of the 3D shape s with projection matrix Pso.

In [44] it was shown that the 2D AAM fitting algorithm [28] can be extended to a 2D+3D

AAM. The resulting algorithm still runs in real-time [29].

As with the 2D AAM algorithm, the “project out” algorithm [28] is used to break the

optimization into two steps, the first optimizing:

‖A0(u)− I(W(u;p))‖2span(Ai)⊥+ K ·

∑i

F 2i (p;Pso;p) (2.10)

15

Fitting a 2D+3D AAM

Pso(x) = !

!

ix iy izjx jy jz

"

x +

!

ou

ov

"

Scaled orthographic projection of 3D Shape

3D Shape

arg min2D Shape, 3D Shape

Appearance, Pso

!

pixels

"

# AAM ! Image

$

%

2

+!

mesh

"

# 2D Shape ! Pso(3D Shape)

$

%

2

Warp to s0

2D (pixels) Proj. 3D to 2D (vertices)

Figure 2.4: A representation of the 2D+3D AAM fitting algorithm. The fitting goal consists of

two terms: (1) the 2D fitting goal, and (2) the regularization term that enforces the 2D shape s to

equal the projection of the 3D shape s with projection matrix Pso.

with respect to p, Pso, and p, where Fi(p;Pso;p) is the error inside the L2 norm in the

second term in Equation (2.9) for each of the mesh x and y vertices. The second step

solves for the appearance parameters using Equation (2.5). The 2D+3D algorithm has more

unknowns to solve for than the 2D algorithm. As a notational convenience, concatenate all

the unknown parameters into one vector q = (p;Pso;p). Optimizing Equation (2.10) is then

performed by iterating the following two steps. Step 1 consists of computing2:

∆q = −H−13D∆qSD = −H−1

3D

∆pSD

0

+ K ·∑

i

(∂Fi

∂q

)T

Fi(q)

(2.11)

2To simplify presentation, in this thesis we omit the additional correction that needs to be made toFi(p;Pso;p) to use the inverse compositional algorithm. See [44] for details.

16

where:

H3D =

H2D 0

0 0

+ K ·∑

i

(∂Fi

∂q

)T∂Fi

∂q. (2.12)

Step 2 consists of first extracting the parameters p, Pso, and p from q, and then updating

the warp using Equation (2.6), and the other parameters Pso and p additively [29].

17

Chapter 3

Multi-View 2D+3D AAM Fitting and

Camera Calibration

In the previous chapter we reviewed some of the efficient algorithms to fit an AAM to

a single image. If we have multiple, simultaneous, views of the face, the performance of

AAM fitting can be improved if we use all views. In this chapter we first describe an

algorithm to fit a single 2D+3D AAM simultaneously to multiple images. During fitting

we impose the constraints that for each view separately, the 2D shape model for that view

must approximately equal the projection of the single 3D shape model. Our algorithm can

use any number of cameras, positioned arbitrarily. We then show how our multi-view fitting

algorithm can be used for camera calibration. We describe an algorithm to calibrate weak

perspective (or full perspective) camera models for each of the cameras using the human

face as a (non-rigid) calibration grid. Finally we show how camera calibration can improve

the performance of multi-view face model fitting.

18

3.1 Multi-View 2D+3D AAM Fitting

Suppose that we have N images In : n = 1, . . . , N of a face that we wish to fit the 2D+3D

AAM to. In this section we assume that the images are captured simultaneously by syn-

chronized, but uncalibrated cameras (see Section 3.9 for a calibrated algorithm.) The naive

algorithm is to fit the 2D+3D AAM independently to each of the images. This algorithm can

be improved upon by using the fact that, since the images In are captured simultaneously,

the 3D shape of the face is the same in all views. We therefore pose fitting a single 2D+3D

AAM to multiple images as minimizing:

N∑n=1

∥∥∥∥∥A0(u) +l∑

i=1

λni Ai(u)− In(W(u;pn))

∥∥∥∥∥2

+

K ·

∥∥∥∥∥∥s0 +m∑

i=1

pni si −Pn

so

s0 +m∑

j=1

pj sj

∥∥∥∥∥∥2 (3.1)

simultaneously with respect to the N sets of 2D shape parameters pn, the N sets of appear-

ance parameters λni (the appearance may be different in different images due to different

camera response functions, etc), the N sets of camera matrices Pnso, and the one, global

set of 3D shape parameters p. Note that the 2D shape parameters in each image are not

independent, but are coupled in a physically consistent1 manner through the single set of

3D shape parameters p. Optimizing Equation (3.1) therefore cannot be decomposed into

N independent optimizations. The appearance parameters λni can, however, be dealt with

using the “project out” algorithm [21, 28], in the usual way; i.e. we first optimize:

N∑n=1

‖A0(u)− In(W(u;pn))‖2span(Ai)⊥+ K ·

∥∥∥∥∥∥s0 +m∑

i=1

pni si −Pn

so

s0 +m∑

j=1

pj sj

∥∥∥∥∥∥2

(3.2)

with respect to pn, Pnso, and p, and then solve for the appearance parameters:

1Note that directly coupling the 2D shape models would be difficult due to the complex relationshipbetween the 2D shape in one image and another. Multi-view face model fitting is best achieved with a 3Dmodel. A similar algorithm could be derived for other 3D face models such as 3D Morphable Models [5].The main advantage of using a 2D+3D AAM [44] is the fitting speed.

19

λni = −∑u∈s0 Ai(u) · [A0(u)− In(W(u;pn))] .

Organize the unknowns in Equation (3.2) into a single vector r = (p1;P1so; . . . ;p

N ;PNso;p).

Also, split the single-view 2D+3D AAM terms into parts from Equations (2.11) and (2.12)

that correspond to the 2D image parameters (pn and Pnso) and the 3D shape parameters (p):

∆qnSD =

∆qnSD,2D

∆qnSD,p

and Hn3D =

Hn3D,2D,2D Hn

3D,2D,p

Hn3D,p,2D Hn

3D,p,p

.

Optimising Equation (3.2) can then be performed by iterating the following two steps.

Step 1 consists of computing:

∆r = −H−1MV∆rSD = −H−1

MV

∆q1SD,2D

...

∆qNSD,2D∑N

n=1 ∆qnSD,p

(3.3)

where:

HMV =

H13D,2D,2D 0 . . . 0 H1

3D,2D,p

0 H23D,2D,2D . . . 0 H2

3D,2D,p

......

......

...

0 . . . 0 HN3D,2D,2D HN

3D,2D,p

H13D,p,2D H2

3D,p,2D . . . HN3D,p,2D

∑Nn=1 Hn

3D,p,p

.

Step 2 consists of extracting the parameters pn, Pnso, and p from r, and updating the

warp parameters pn using Equation (2.6), and the other parameters Pnso and p additively.

The N image algorithm is very similar to N copies of the single image algorithm. Almost

all of the computation is just replicated N times, one copy for each image. The only extra

computation is adding the N terms in the components of ∆rSD and HMV that correspond to

the single set of global 3D shape parameters p, inverting the matrix HMV, and the matrix

20

multiply in Equation (3.3). Overall, the N image algorithm is therefore approximately N

times slower than the single image 2D+3D fitting algorithm (It is more than N times slower

due to the large matrix inversion and matrix multiplication step, but in practice only slightly

so.)

3.2 Experimental Results

An example of using our algorithm to fit a single 2D+3D AAM to three simultaneously

captured images2 of a face is shown in Figure 3.1. For all the results in this Chapter,

the translation and scale of the 2D face model in each view is initialized by hand and the

2D shape set to be the mean shape. However, 2D+3D AAMs can easily be initialized

with a face detector [29]. See the movie iterations.mov for the fitting video sequence.

The initialization is displayed in the top row of the figure, the result after 5 iterations in

the middle row, and the final converged result in the bottom row. In each case, all three

input images are overlaid with the 2D shape pn plotted in dark dots. We also display the

recovered pose angles (roll, pitch and yaw) extracted from the three scaled orthographic

camera matrices Pnso in the top left of each image. Each camera computes a different relative

head pose, illustrating that the estimate of Pnso is view dependent. The single 3D shape p

for all views at the current iteration is displayed in the top-right of the center image. The

view-dependent camera projection of this 3D shape is also plotted as a white mesh overlaid

on the face.

Applying the multi-view fitting algorithm sequentially allows us to track the face simulta-

neously in N video sequences. Some example frames of the algorithm being using to track a

face in a trinocular sequence is shown in Figure 3.2. We also include the movie tracking.mov

2Note that the input images for all experiments described in this thesis are chosen such that there is noocclusion of the face. For ways to handle occlusion in the input data see [20, 29].

21

http://www.andrew.cmu.edu/~kramnath/iterations.mov

http://www.andrew.cmu.edu/~kramnath/tracking.mov

Init

ialis

atio

nA

fter

5It

erat

ions

Con

verg

ed

Left Camera Center Camera Right Camera

Figure 3.1: An example of using our uncalibrated multi-view fitting algorithm to fit a single 2D+3D

AAM to three simultaneous images of a face. Each image is overlaid with the corresponding 2D

shape for that image in dark dots. The head pose (extracted from the camera matrix PNso) is

displayed in the top left of each image as roll, pitch and yaw. The single 3D shape p for the current

‘3-frame’ is displayed in the top right of the center image. This 3D shape is also overlaid in each

image, using the corresponding PNso, as a white mesh. See the movie iterations.mov for a video

of the whole fitting sequence.

for the complete tracking sequence. The tracking remains accurate and stable both over time

and between views. In Section 3.9 we present a quantitative evaluation of this multi-view

algorithm.

22

http://www.andrew.cmu.edu/~kramnath/iterations.mov

Fram

e1

Fram

e12

0Fr

ame

200


Figure 3.2: An example of our multi-view fitting algorithm being used to track a face in a trinocular

sequence. As the face is tracked we compute a single 3D shape and three estimates of the head pose

using three independent camera matrices. See the movie tracking.mov for the complete sequence.

23

http://www.andrew.cmu.edu/~kramnath/tracking.mov

3.3 Camera Calibration: Image Formation Model

The multi-view fitting algorithm in Chapter 3 uses the scaled orthographic image formation

model in Equation (2.8). A more powerful model when working with multiple cameras

(because it models the coupling between the scales across the cameras through the focal

lengths and average depths) is the weak perspective model:

u = Pwp(x) =f

oz + z

ix iy iz

jx jy jz

x +

ou

ov

. (3.4)

In Equation (3.4), oz is the depth of the origin of the world coordinate system and z is

the average depth of the scene points measured relative to the world coordinate origin. The

“z” (depth) direction is k = i × j where × is the vector cross product, i = (ix, iy, iz), and

j = (jx, jy, jz). The average depth relative to the world origin z equals the average value of

k · x computed over all points x in the scene.

The weak perspective model is an approximation to the full perspective model:

u = Ppersp(x) =

f 0 0

0 f 0

0 0 1

ix iy iz ou

jx jy jz ov

kx ky kz oz

x

1

. (3.5)

where the depth of the scene k · x is assumed to be roughly constant z. The calibration

parameters of the two perspective models in Equations (3.4) and (3.5) are interchangeable.

When evaluating the calibration results in Section 3.8 below we use the full perspective

model. In the calibrated fitting algorithms in Section 3.9 we use the weak perspective model

because it is reasonable to assume that the depth of the face is roughly constant, a common

assumption in many face modeling papers [33, 44].

24

3.4 Camera Calibration Goal

Suppose we have N cameras n = 1, . . . , N . The goal of our camera calibration algorithm

is to compute the 2 × 3 camera projection matrix (i, j), the focal length f , the projection

of the world coordinate system origin into the image (ou, ov), and the depth of the world

coordinate system origin (oz) for each camera. If we superscript the camera parameters with

n we need to compute Pnwp = in, jn, fn, on

u, onv , and on

z . There are 7 unknowns in Pnwp (rather

than 10) because there are only 3 degrees of freedom in choosing the 2×3 camera projection

matrix (i, j) such that it is orthonormal.

3.5 Calibration using Two Time Instants

For ease of understanding, we first describe an algorithm that uses two sets of multi-view

images captured at two time instants. Deriving this algorithm also allows us to show that

two sets of images are needed and derive the requirements on the motion of the face between

the two time instants. In Section 3.6 we describe an algorithm that use an arbitrary number

of multi-view image sets and in Section 3.7 another algorithm that poses calibration as a

single large optimization.

The uncalibrated multi-view fitting algorithm of Chapter 3 uses the scaled orthographic

camera matrices Pnso in Equation (2.8) and optimizes over the N scale parameters σn. Using

Equation (3.4) instead of Equation (2.8) and optimizing over the focal lengths fn and origin

depths onz is ambiguous. Multiple values of fn and on

z yield the same value of σn = fn

onz +zn .

However, the values of fn and onz can be computed by applying (a slightly modified version

of) the uncalibrated multi-view fitting algorithm a second time with the face at a different

location. With the first set of images we compute in, jn, onu, on

v . Suppose that σn = σn1 is

the scale at this location. Without loss of generality we also assume that the face model is

25

at the world coordinate origin at this first time instant. Finally, without loss of generality

we assume that the mean value of x computed across the face model (both mean shape s0

and all shape vectors si) is zero. It follows that z is zero and so:

fn

onz

= σn1 . (3.6)

Suppose that at the second time instant the face has undergone a global 3D rotation R3 and

3D translation T. Both the rotation R and translation T have three degrees of freedom. We

then perform a modified multi-view fit, minimizing:

N∑n=1

∥∥∥∥∥A0(u) +l∑

i=1


∥∥∥∥∥2

+ K·

∥∥∥∥∥∥s0 +m∑

i=1

pni si −Pn

so

R

s0 +m∑

j=1

pj sj

+ T

∥∥∥∥∥∥2 (3.7)

with respect to the N sets of 2D shape parameters pn, the N sets of appearance parameters

λni , the one global set of 3D shape parameters p, the 3D rotation R, the 3D translation T,

and the N scale values σn = σn2 . In this optimization all of the camera parameters (in, jn,

onu, and on

v ) except the scale (σ) in the scaled orthographic model Pnso are held fixed to the

values computed in the first time instant. Since the object underwent a global translation

T then zn = kn ·T where kn = in × jn is the z-axis of camera n. It follows that:

fn

onz + kn ·T

= σn2 . (3.8)

Equations (3.6) and (3.8) are two sets of linear simultaneous equations in the 2∗N unknowns

(fn and onz ). Assuming that kn · T 6= 0 (the global translation T is not perpendicular to

any of the camera z-axes), these two equations can be solved for fn and onz to complete the

camera calibration. Note also that to uniquely compute all three components of T using the

3Note that in the case of calibrated camera(s) it is convenient to think of the relative motion betweenthe object and the camera(s) as the motion of the object R, T. In the single camera case (Equation 2.9)and the multiple cameras, single time instant case with uncalibrated camera matrix P (Equation 3.1) it isconvenient to think of the relative motion as camera motion.

26

optimization in Equation (3.7) at least one pair of the cameras must be verged (the axes (in,

jn) of the camera matrices Pnso must not all span the same 2D subspace.)

3.6 Multiple Time Instant Calibration Algorithm

Rarely are two sets of multi-view images sufficient to obtain an accurate calibration. The

approach just described can easily be generalized to T time instants. The first time instant

is treated as above and used to compute in, jn, onu, on

v and to impose the constraint on fn

and onz in Equation (3.6). Equation (3.7) is then applied to the remaining T − 1 frames to

obtain additional constraints:

fn

onz + kn ·Tt

= σnt for t = 2, 3, . . . , T (3.9)

where Tt is the translation estimated in the tth time instant and σnt is the scale of the face

in the nth camera at the tth time instant. Equations (3.6) and (3.9) are then re-arranged to

obtain an over-constrained linear system which can then be solved to obtain fn and onz .

3.7 Calibration as a Single Optimization

The algorithms in Sections 3.5 and 3.6 have the disadvantage of being two stage algorithms.

First they solve for in, jn, onu, and on

v , and then for fn and onz . It is better to pose calibration

as the single large non-linear optimization of:

N∑n=1

T∑t=1

∥∥∥∥∥A0(u) +l∑

i=1

λn,ti Ai(u)− In,t(W(u;pn,t))

∥∥∥∥∥2

+ K·

∥∥∥∥∥∥s0 +m∑

i=1

pn,ti si −Pn

wp

Rt

s0 +m∑

j=1

ptj sj

+ Tt

∥∥∥∥∥∥2 (3.10)

summed over all cameras n and time instants t with respect to the 2D shape parameters

pn,t, the appearance parameters λn,ti , the 3D shape parameters pt, the rotations Rt, the

27

translations Tt, and the calibration parameters in, jn, fn, onu, on

v , and onz . In Equation (3.10),

In,t represents the image captured by the nth camera in the tth time instant and the average

depth z = kn · Tt in Pnwp given by Equation (3.4). Finally, we define the world coordinate

system by enforcing R1 = I and T1 = 0.

The expression in Equation (3.10) can be optimized by iterating two steps: (1) The

calibration parameters are optimized given the 2D shape and (rotated translated) 3D shape;

i.e. the second term in Equation (3.10) is minimized given fixed 2D shape, 3D shape, Rt,

and Tt. This optimization decomposes into a separate 7 dimensional optimization for each

camera. (2) A calibrated multi-view fit (see Section 3.9) is performed on each frame in

the sequence; i.e. the entire expression in Equation (3.10) is minimized, but keeping the

calibration parameters in Pnwp fixed and just optimizing over the 2D shape, 3D shape, Rt,

and Tt. The entire large optimization can be initialized using the multiple time instant

algorithm in Section 3.6.

3.8 Empirical Evaluation of Calibration

We tested our calibration algorithms on a trinocular stereo rig. Two example images of the

1300 input images from each of the three cameras are shown in Figure 3.3. The complete

input sequence is included in the movie calib input.mov. We wish to compare our calibra-

tion algorithm with an algorithm that uses a calibration grid. In Sections 3.8.1 and 3.8.2

we present results for the epipolar geomtery. We compute a fundamental matrix from the

camera parameters in, jn, fn, onu, on

v , and onz estimated by our algorithm and use the 8-

point algorithm [22] to estimate the fundamental matrix from the calibration grid data. In

Section 4.5.3 we present results for the camera focal length and relative orientation of the

cameras, while also comparing the 3D model building algorithms.

28

http://www.andrew.cmu.edu/~kramnath/calib_input.mov

Tim

e1

Tim

e2

Camera 1 Camera 2 Camera 3

Figure 3.3: Example inputs to our calibration algorithms: A set of simultaneously captured image

sets of a face at a variety of different positions and expressions. See calib input.mov for the

complete input.

3.8.1 Qualitative Comparison of Epipolar Geometry

In Figure 3.4 we show a set of epipolar lines computed by the algorithms. In Figure 3.4(a)

we show an input image captured by camera 1, with a few feature points marked on it.

In Figure 3.4(b) we show the corresponding points in the other image and the epipolar

lines. The solid dark colored epipolar lines are computed using the 8-point algorithm on

the calibration grid data. The dashed black epipolar lines are computed using the two stage

multiple time instant algorithm of Section 3.6. The solid light colored epipolar lines are

computed using the single large optimization algorithm of Section 3.7. Figures 3.4(d) and

(c) are similar for feature points marked in camera 3. While all three sets of epipolar lines

are very similar, the epipolar lines for the single large optimization algorithm are overall

closer to those for the 8-point algorithm than those of the two stage algorithm.

29

http://www.andrew.cmu.edu/~kramnath/calib_input.mov

feature pointsgridtwo!stagesingle opt

(a) Camera 1 Image (b) Epipolar Lines in Camera 2

(c) Epipolar Lines in Camera 2 (d) Camera 3 Image

Figure 3.4: Qualitative comparison between our AAM-based calibration algorithms and the 8-

point algorithm [22]. (a) An input image captured by the first camera with several feature points

marked on it. (b) The corresponding points and epipolar lines of the other image. The solid dark

colored epipolar lines are computed using the 8-point algorithm, the dashed black epipolar lines

using the two stage multiple time instant algorithm, and the solid light colored epipolar lines are

computed using the optimization algorithm. (d) Shows the input image of the third camera, and

(c) the corresponding points and epipolar lines for the second camera.

3.8.2 Quantitative Comparison of Epipolar Geometry

In Figures 3.5 and 3.6 we present the results of a quantitative comparison of the fundamental

matrices by extracting a set of ground-truth feature point correspondences and computing

the RMS distance between each feature point and the corresponding epipolar line predicted

30

! " # $ % & ' ( ) !**

"

$

&+a-! ! +a-"

! " # $ % & ' ( ) !**

"

$

&+a-! ! +a-#

/01

2pip

olar

resid

ual <

=i>e

ls?

! " # $ % & ' ( ) !**"$&

+a-" ! +a-#

@-age nu-Cer

DaliCration Frid G (!=oint HlgIa+e Jata G Kwo 1tage HlgIa+e Jata G 1ingle Mpti-iNation

Figure 3.5: Quantitative comparison between our AAM-based calibration algorithms and the

8-point algorithm [22] using a calibration grid. The evaluation is performed on 10 images of a

calibration grid (data similar to, but not used by the 8-point algorithm). The ground-truth is

extracted using a corner detector. We plot the RMS distance error between epipolar lines and the

corresponding feature points for each of the 10 images.

by the fundamental matrix. In Figure 3.5 we present results on 10 images of a calibration

grid, similar (but not identical) to that used by the calibration grid algorithm. The ground-

truth correspondences are extracted using a corner detector. In Figure 3.6 we present results

on 1400 images of a face at different scales. The ground-truth correspondences are extracted

by fitting a single-view AAM independently to each image (i.e. no use of the multi-view

geometry is used.)

31

!"" #"" $"" %"" &""" &!"" &#"""&!'# (a*& ! (a*!

!"" #"" $"" %"" &""" &!"" &#"""&!'# (a*& ! (a*'

,-.

/pip

olar

resid

ual 9

:i;e

ls<

!"" #"" $"" %"" &""" &!"" &#"""&!'# (a*! ! (a*'

=*age nu*@er

Aali@ration Crid D %!:oint ElgFa(e Gata D Hwo .tage ElgFa(e Gata D .ingle Jpti*iKation

Figure 3.6: Quantitative comparison between our AAM-based calibration algorithms and the 8-

point algorithm [22] using a calibration grid. The evaluation is performed on 1400 images of a

face. The ground-truth is extracted using a single-view AAM fitting algorithm. We plot the RMS

distance error between epipolar lines and the corresponding feature points for each of the 1400

images.

Although the optimization algorithm of Section 3.7 performs significantly better than

the two stage algorithm in Section 3.6, both AAM-based algorithms perform slightly worse

than the 8-point algorithm on the calibration grid data in Figure 3.5. The main reason is

probably that the ground-truth calibration grid data covers a similar volume to the data

used by the 8-point algorithm, but a much larger volume than the face data used by the

AAM-based algorithms. When compared on the face data in Figure 3.6 (which covers a

32

similar volume to that used by the AAM-based algorithm), the 8-point algorithm and the

optimization algorithm of Section 3.7 perform comparably well.

33

3.9 Calibrated Multi-View Fitting

Once we have calibrated the cameras and computed in, jn, fn, onu, on

v , and onz we can then use

a weak perspective calibrated multi-view fitting algorithm to fit a given AAM to multiple

images. We optimize:

N∑n=1

∥∥∥∥∥A0(u) +l∑

i=1


∥∥∥∥∥2

+ K·

∥∥∥∥∥∥s0 +m∑

i=1

pni si −Pn

wp

R

s0 +m∑

j=1

pj sj

+ T

∥∥∥∥∥∥2

with respect to the N sets of 2D shape parameters pn, the N sets of appearance param-

eters λni , the one global set of 3D shape parameters p, the global rotation R, and the global

translation T. In this optimization, Pnwp is defined by Equation (3.4) where z = kn ·T. It is

also possible to formulate a similar scaled orthographic calibrated algorithm in which Pnwp

is replaced with Pnso defined in Equation (2.8) and the optimization is also performed over

the additional N scales σn. Note that in these calibrated fitting algorithms, the calibration

parameters in, jn, fn, onu, on

v , and onz are constant and not optimized. As shown below, this

leads to a lower dimensional optimization and more robust fitting.

3.10 Empirical Evaluation

3.10.1 Qualitative Results

An example of using our calibrated multi-view fitting algorithm to track by fitting a single

2D+3D AAM to three concurrently captured images of a face is shown in Figure 3.7. The

complete fitting sequence is included in the movie calib fitting.mov. The top row of the

figure shows the tracking result for one frame. The bottom row shows the tracking result

for a frame later in the sequence. In each case, all three input images are overlaid with the

34

http://www.andrew.cmu.edu/~kramnath/calib_fitting.mov

Tim

e1

Tim

e2


Figure 3.7: An example of using our calibrated multi-view fitting algorithm to fit a single 2D+3D

AAM to three simultaneously captured images of a face. Each image is overlaid with the corre-

sponding 2D shape for that image in dark dots. The single 3D shape p for the current triple of

images is displayed in the top right of the center image. This 3D shape is also projected into each

image using the corresponding Pn, and displayed as a white mesh. The single head pose (extracted

from the rotation matrix R) is displayed in the top left of the center image as roll, pitch, and yaw.

This should be compared with the algorithm in Chapter 3 in which there is a separate head pose

for each camera. See the movie calib fitting.mov for the complete fitting sequence.

2D shape pn plotted in dark dots. The view-dependent camera projection of this 3D shape

is also plotted as a white mesh overlaid on the face. The single 3D shape p at the current

frame is displayed in the top-right of the center image. We also display the recovered roll,

pitch, and yaw of the face (extracted from the global rotation matrix R) in the top left of the

center image. The three cameras combine to compute a single head pose, unlike Figure 3.2

where the pose is view dependent.

35

http://www.andrew.cmu.edu/~kramnath/calib_fitting.mov

3.10.2 Quantitative Results

In Figure 3.8 we show quantitative results to demonstrate the increased robustness and

convergence rate of our calibrated multi-view fitting algorithms. In experiments similar

to those in [28], we generated a large number of test cases by randomly perturbing from a

ground-truth obtained by tracking the face in the multi-view video sequences. The global 3D

shape parameters p, global rotation matrix R, and global translation T were all perturbed

and projected into each of the three views. This ensures the initial perturbation is a valid

starting point for all algorithms. We then run each algorithm from the same perturbed

starting point and determine whether they converged or not by computing the RMS error

between the mesh location of the fit and the ground-truth mesh coordinates. The algorithm

is considered to have converged if the total spatial error is less than 2.0 pixels. We repeat

the experiment 20 times for each set of 3 images and average over all 300 image triples in

the test sequences. This procedure is repeated for different values of perturbation energy.

The magnitude of the perturbation is chosen to vary on average from 0 to 4 times the 3D

shape standard deviation. The global rotation R, and global translation T are perturbed

by a scalar multiples α and β of this value. The values of α and β were chosen so that the

rotation and translation components introduce the same amount of pertubation energy as

the shape component [28].

In Figure 3.8(a) we plot a graph of the likelihood (frequency) of convergence against the

magnitude of the random perturbation for the 2D+3D single-view fitting algorithm [44] ap-

plied independently to each camera, the uncalibrated multi-view fitting algorithm described

in Chapter 3 and the two calibrated multi-view fitting algorithms: scaled orthographic and

weak perspective. The results clearly show that the calibrated multi-view algorithms are

more robust than the uncalibrated multi-view algorithm, which is more robust than the

2D+3D single-view algorithm. Overall, the weak perspective calibrated multi-view fitting

36

! !"# $ $"# % %"# & &"# '!

$!

%!

&!

'!

#!

(!

)!

*!

+!

$!!,-./-0123-45647.829:4/50;-.3-<

=23081><-4564,-.1>[email protected]>0<!7.>1B0 5 10 15 20 25 30

0

5

10

15

20

RM

S p

oin

t lo

ca

tio

n e

rro

r

Iteration

2D+3D single viewuncalibrated mv

scaled orthographicweak perspective

(a) Frequency of Convergence (b) Rate of Convergence

Figure 3.8: (a) The likelihood (frequency) of convergence plot against the magnitude of a ran-

dom perturbation to the ground-truth fitting results computed by tracking through a trinocular

sequence. The results show that the calibrated multi-view algorithms are more robust than the

uncalibrated multi-view algorithm discussed in Chapter 3, which itself is more robust than the

2D+3D single-view algorithm [44]. (b) The rate of convergence is estimated by plotting the aver-

age error after each iteration against the iteration number. The results show that the calibrated

multi-view algorithms converge faster than the uncalibrated algorithm, which converges faster than

the single-view 2D+3D algorithm.

algorithm performs the best. The main source of the increased robustness of the calibrated

multi-view fitting algorithms is imposing the constraint that the head pose is consistent

across all N cameras. We also compute how fast the algorithms converge by computing

the average RMS mesh location error after each iteration. Only trials that actually con-

verge are used in this computation. The results for two different magnitudes of perturbation

(0.8 and 2.0) to the ground-truth are included in Figure 3.8(b). The results indicate that

the calibrated multi-view algorithms converge faster than the uncalibrated algorithm, which

converges faster than the single-view 2D+3D algorithm.

37

Algorithm Time per frame Iterations per frame Time per iteration

2D+3D single-view 33.808 2.5209 13.401

uncalibrated multi-view 152.33 3.2915 46.247

scaled orthographic 152.94 3.2178 47.534

weak perspective 125.94 2.6131 48.158

Table 3.1: This table shows the timing results for our Matlab implementations of the four fitting

algorithms evaluated in Section 3.10.2 in milliseconds. The results were obtained on a dual 2.5 GHz

Power Mac G5 machine and were averaged over 600 image triples with VGA (640 x 480) resolution.

Each algorithm was allowed to iterate until convergence over each image triple. Note that the

results for the single-view algorithm is just the cost of processing one image from the image triple.

We include the movie compare.mov to demonstrate a few examples of the perturba-

tion experiments. The movie illustrates how the calibrated multi-view algorithms impose a

consistent head pose (c.f. uncalibrated algorithm) and a single 3D face shape (c.f. 2D+3D

algorithm). As a result, the calibrated algorithms sometimes converges when the other

algorithms diverge. The speed of convergence is also visibly faster.

In Table 3.1 we include timing results for our Matlab implementations of the four fitting

algorithms compared in this section. The results were obtained on a dual 2.5 GHz Power

Mac G5 machine and were averaged over 600 image triples with VGA (640 x 480) resolution.

Each algorithm was allowed to iterate until convergence over each image triple. Note that

the results for the single-view algorithm4 is just the cost of processing one image from the

image triple. The multi-view algorithms are all therefore approximatley 3 times slower than

the single-view algorithm, as should be expected. Also note that since the weak perspective

algorithm is more constrained it converges more quickly than the uncalibrated and scaled

4The single-view algorithm can be implemented in real-time (approximately 60Hz) in C [29].

38

http://www.andrew.cmu.edu/~kramnath/compare.mov

orthographic multi-view algorithms. The single-view algorithm requires slightly fewer itera-

tions than all of the multi-view algorithms because it does not have to impose consistency

on the 2D shapes in the different views.

39

Chapter 4

Multi-View 3D Model Construction

In the previous chapter we have shown that the performance of AAM fitting can be improved

by using multiple views and calibration information. Similarly, a 3D AAM can be constructed

more reliably using multiple calibrated cameras. In this chapter, we outline a calibrated

multi-view motion-stereo algorithm for 3D AAM construction and compare its performance

with other existing single-view and multi-view non-rigid structure-from-motion algorithms.

4.1 Non-Rigid Structure-from-Motion

One way to build a deformable 3D face model is to use 3D range data. In [5], the 3D

mesh vertices s are first located in a set of “training” 3D range scans. Principal Component

Analysis is then used to extract the base (or mean) shape s0 and the m dominant shape

modes sj. More recently, however, the task of building deformable face models from a video

captured by a single camera using non-rigid structure-from-motion has received a great deal

of attention [9, 8].

Suppose that we have a sequence of images I t of a face captured across time t = 1, . . . , T .

Either the face, the camera, or both may be moving. Assume we can track K 2D feature

40

points in the 2D images I t. Denote the tracking results:

ut =

ut1 ut

2 . . . utK

vt1 vt

2 . . . vtK

Also denote the camera matrix of the camera at time t by Pt. Non-rigid structure-from-

motion can then be posed as minimizing:

T∑t=1

∥∥∥∥∥∥Pt

s0 +m∑

j=1

ptjsj

− ut

∥∥∥∥∥∥2

(4.1)

with respect to the base shape s0, the shape modes sj, the shape parameters ptj and the

camera matrices Pt. If Pt is a perspective camera model, the above optimization is non-

linear, but can be solved using an appropriate non-linear optimization algorithm [46]. If

Pt is a linear camera model, such as the scaled orthographic model (P = Pso), the above

optimization can be solved using a linear algorithm [9, 8, 45].

4.2 Multi-view Structure-from-Motion

The single-view non-rigid structure-from-motion (NR-SFM) paradigm can be extended to

include information from multiple views/cameras to yield a multi-view non-rigid structure-

from-motion algorithm [40] (MV-SFM.)

Suppose we have a set of N > 1 cameras that simultaneously capture videos In,t for

n = 1, . . . , N across time t = 1, . . . , T . Denote the unknown camera matrices by Pn for

n = 1, . . . , N and the global 3D rotation and translation of the face across time by Rt and

Tt. Assume that we can track K feature points across time in the videos In,t. Denote the

tracking results as:

un,t =

un,t1 un,t

2 . . . un,tK

vn,t1 vn,t

2 . . . vn,tK

(4.2)

41

The problem then becomes one of minimizing:

N∑n=1

T∑t=1

∥∥∥∥∥∥Pn

Rt

s0 +m∑

j=1

ptjsj

+ Tt

− un,t

∥∥∥∥∥∥2

(4.3)

with respect to the base shape s0, the shape modes sj, the shape parameters ptj, the camera

matrices Pn, the global 3D rotation Rt and translation Tt of the face across time.

4.3 Stereo

Both the single-view and multi-view structure-from-motion algorithms suffer from the Bas

Relief ambiguity [47, 38, 36, 23]. The Bas Relief ambiguity is an ambiguity between the

motion (translation or small rotation) of the cameras and the depths of the points in the

scene. In both the single-view and multi-view cases, the camera matrices must be solved

for as well as the structure of the scene. So, the ambiguity can manifest itself in the form

of scaled depths and motion between the cameras. If we have multiple calibrated cameras,

however, it is possible to derive better algorithms that do not suffer from the Bas-Relief

ambiguity. As we now describe, the simplest approach is to use stereo to fulfill the same role

as a range-scanner.

Suppose now that we have a calibrated stereo rig with N > 1 cameras in it. Denote

the known (calibrated) camera matrices Pn for n = 1, . . . , N . Suppose that the nth camera

captures the images In,t across time t = 1, . . . , T as the face (and possibly the stereo rig)

move. Assume that we can track K feature points across time in the videos In,t, and also

compute correspondence between the cameras. Denote the tracked feature points as:

un,t =

un,t1 un,t

2 . . . un,tK

vn,t1 vn,t

2 . . . vn,tK

(4.4)

A stereo algorithm (similar to those in [10, 18]) to compute the deformable model is then as

follows:

42

1. Perform stereo at each time t by minimizing:

N∑n=1

‖Pn(st)− un,t‖2

with respect to the 3D static shape st.

2. Align the 3D static shapes st with a transformation consisting of a 3D rigidity trans-

formation (6 degrees of freedom) and a single scale (1 degree of freedom); i.e. perform

a 3D “Procrustes” alignment.

3. Compute s0, sj using Principal Component Analysis.

4.4 Motion-Stereo

The above stereo algorithm can be improved upon by posing the problem as a single large

optimization, a generalization of the non-rigid structure-from-motion formulation in Equa-

tion (4.1). The input to the motion-stereo algorithm is the same as the stereo algorithm,

namely the camera matrices Pn and the tracked feature points un,t. Denote the global 3D

rotation and translation of the face across time by Rt and Tt. In the stereo algorithm above,

Rt and Tt are computed by the 3D similarity Procrustes algorithm. The model construction

problem can then be posed as minimizing:

N∑n=1

T∑t=1

∥∥∥∥∥∥Pn

Rt

s0 +m∑

j=1

ptjsj

+ Tt

− un,t

∥∥∥∥∥∥2

(4.5)

with respect to the base shape s0, the shape modes sj, the shape parameters ptj, the global

rotations Rt, and the global translations Tt. The construction goal in Equation (4.5) can

be minimized using the following motion-stereo algorithm:

1. Initialize using the stereo algorithm in Section 4.3

43

(a) 3D similarity Procrustes → Rt,Tt.

(b) Principal Components Analysis → s0, sj, ptj.

2. Iterate the following two steps until convergence:

(a) Fix s0, sj, solve for Rt,Tt, ptj.

(b) Fix ptj,R

t,Tt, solve for s0, sj.

3. Project out any scale, rotation, or translation components left in the 3D shape modes

sj.

In Step 2a, the optimization can be broken down into separate optimizations for each time

t; i.e. for each t minimize:

N∑n=1

∥∥∥∥∥∥Pn

Rt

s0 +m∑

j=1

ptjsj

+ Tt

− un,t

∥∥∥∥∥∥2

with respect to Rt,Tt, ptj. In Step 2b, we break the optimization down in m + 1 sub-steps.

We first solve for the mean shape s0 and then for each shape mode sj in turn.

4.5 Experimental Evaluation

4.5.1 Input

The input to our four face model construction algorithms consists of a set of 2D tracked facial

feature points un,t (see Equation (4.4)) in 312 images captured by n = 1, 2, 3 synchronized

cameras at t = 1, . . . , 104 time instants. We tracked 68 feature points independently in

each video sequence using a 2D Active Appearance Model (AAM) [13, 28]. Example results

for 9 images (3 cameras × 3 time instants) are shown in Figure 4.1. We also include the

movie 2d track.mov showing the complete tracked input sequence. Note that the head pose

44

http://www.andrew.cmu.edu/~kramnath/2d_track.mov

Fram

e1

Fram

e2

Fram

e3

Camera 1 Camera 2 Camera 3

Figure 4.1: Three example frames from each of three synchronized stereo cameras. In total, we

tracked the head independently through 104 frames in each camera using a 68 point 2D AAM

[13, 28]. The pose variation in the three sequences is the most that a single 2D AAM can cope

with before it fails. See the movie 2d track.mov for the complete tracked input sequence.

variation is substantial, but not too extreme. None of the videos contain any full profiles. The

input sequences were carefully chosen to maximize the head pose variation, while not causing

the 2D AAM to fail. In our experience, the head pose variation shown in Figure 4.1 is the

most that a single 2D AAM can cope with. While more sophisticated tracking algorithms,

which can cope with occlusions, severe foreshortening, and non-Lambertian reflectance have

45

http://www.andrew.cmu.edu/~kramnath/2d_track.mov

been proposed, the pose variation in Figure 4.1 is about the most that can be tracked using

the basic algorithm.

4.5.2 Qualitative Multi-View Model Construction Comparison

The results of applying each of the four algorithms: 1) non-rigid structure-from-motion

(NR-SFM) [45], 2) multi-view non-rigid structure-from-motion (MV-SFM) [40], 3) stereo,

and 4) motion-stereo are summarized in Figure 4.2. Note that the input to the NR-SFM is

generated by stacking together the image sequences from each of the three cameras. All four

algorithms therefore use exactly the same set of input image data.

For each model, we display the mean shape (s0) and the first two shape modes (s1,

s2) from two viewpoints to help the reader visualize the 3D structure. The main thing

to note in Figure 4.2 is how “stretched” the NR-SFM and the MV-SFM models are. The

depth (z) values of all of the points in the mean shape appear to have been scaled by a

constant multiplier. The underlying cause of this stretching is the Bas-Relief ambiguity which

occurs when applying (non-rigid) structure-from-motion to data with little pose variation

[47, 38, 36, 23]. The problem manifests itself for both linear (NR-SFM) [9, 8, 45] and non-

linear (MV-SFM) [40] algorithms. The MV-SFM model is slightly better than the NR-SFM

model but the ambiguity persists as the problem is in the data. (Because the problem is

an ambiguity, it is possible that by chance the scale may be chosen more accurately. The

chance of accurate estimation of scale increases the more pose variation there is, and the

less noise there is [47, 38, 36, 23].) The motion-stereo and stereo models do not suffer from

this problem. In the next section we present a quantitative comparison using the calibration

algorithm derived in Section 3.3.

46

NR

-SFM

MV

-SFM

Ster

eoM

otio

n-St

ereo

Mean Shape s0 Shape Mode s1 Shape Mode s2

Figure 4.2: This figure shows the mean shape and first two shape modes of the single-view and

multi-view non-rigid structure-from-motion models, the stereo model and the motion-stereo model.

The main thing to note is that the non-rigid structure-from-motion models are “stretched” in the

depth direction.

4.5.3 Quantitative Comparison using Camera Calibration

In this section we quantitatively compare the performance of the four 3D face model con-

struction algorithms in terms of how well the resulting models can be used to perform camera

calibration using the algorithm in Section 3.7. One possible way of obtaining quantitative

results might be to capture range data as ground-truth. This approach, however, requires

(1) calibrating and (2) aligning the range data to the image data. Static range data also

cannot be used to evaluate the deformable 3D shape modes. Ideally, we would like a way of

evaluating the 3D fidelity of the face models using video data of a moving face.

47

!"#

!

"

$

%

&

'

(

)

*

+,-

./012345206

3378

!"#

$

%

&

'

!#

!$

!%

!&

()*

+,-./012/-3

0045

!"#

$

!

"

%

&

'

(

)

*

+,-

./012345206

3378

Relative Yaw Between Each Pair of Cameras

10

1000

2000

3000

4000

5000

6000

Cam

Fo

ca

l L

en

gth

GT

!"

#"""

!"""

$"""

%"""

&"""

'"""

()*

+,-)./012345

//67

!"

#"""

$"""

!"""

%"""

&"""

'"""

()*

+,-)./012345

67!8+9

9:!8+9

841;1,

9!841;1,

//<=

Focal Length of Each Camera

Figure 4.3: A quantitative evaluation of the 3D fidelity of the models, obtained by using the models

to calibrate the cameras using the algorithm in Section 3.7. The results show the motion-stereo

algorithm to perform the best. The single-view non-rigid structure-from-motion model results in

estimates of the yaw and focal length that are both off by a large factor. The two error factors

are roughly the same. Using multi-view non-rigid structure-from-motion does help in reducing the

errors to a significant degree, but the results are still not as good as the motion-stereo model. GT

refers to the ground truth values computed using the Matlab camera calibration toolbox [7].

48

The algorithm in Section 3.7 is used to calibrate weak perspective camera matrices for a

set of stereo cameras using a 3D face model. By comparing the results of this algorithm with

ground-truth calibration data, we can indirectly measure the 3D fidelity of the face models.

The relative orientation component of the calibration primarily measures the pose estimation

accuracy of the algorithms, without any absolute head pose ground-truth. Estimating the

focal lengths and the epipolar geometry requires more than the relative orientation. Accurate

focal lengths and epipolar geometry requires the accurate non-rigid 3D tracking of the face

in an extended sequence.

We implemented the multi-view single optimization calibration algorithm in Section 3.7

and compared the results with a calibration performed using a calibration standard grid and

the Matlab Camera Calibration Toolbox [7]. In Figure 4.3 we present results for the yaw

rotation (about the vertical axis) between each pair of the three cameras and for each of the

three focal lengths. The yaw between each pair of the three cameras was computed from

the relative rotation matrices of the three cameras. We include results for each of the four

models, and compare them to the ground-truth. The results in Figure 4.3 clearly show the

motion-stereo algorithm to perform the best. The results for the NR-SFM model are a long

way off. The yaw1 is underestimated by a large factor, and the focal length overestimated by

a similar factor. Based on the results in Figure 4.2, this is to be expected. The face model is

too deep, so a medium amount of parallax is generated by a too small yaw angle. Similarly,

a scaling of the model is interpreted as a too large motion in the depth direction and so too

large a focal length. The MV-SFM model also suffers from the same problem due to the

scaled nature of the model albeit generating better results than the NR-SFM model. Overall,

the motion-stereo2 algorithm clearly out performs both these algorithms and gives estimates

1The results for the pitch and roll between each pair of cameras are omitted. The pitch and roll are veryclose to zero and so there is little difference between any of the algorithms.

2Since the motion-stereo algorithm is the best among the four algorithms that we compared, we used themotion-stereo model for all the fitting and calibration experiments described in the previous sections.

49

Relative Yaw Focal Length

Cam 12 Cam 13 Cam 23 Cam 1 Cam 2 Cam 3

NR-SFM 62.1% 66.2% 68.9% 193.5% 201.7% 214.8%

MV-SFM 8.6% 18.8% 25.7% 30.9% 35.5% 41.2%

Stereo 30.2% 15.4% 5.5% 23.9% 18.2% 15.1%

Motion-Stereo 21.7% 7.8% 1.5% 8.7% 3.0% 1.1%

Table 4.1: This table summarizes the results presented in Figure 4.3. For each 3D model we

compute the percentage deviation of the relative “yaw” between each pair of cameras and focal

length of each camera from the ground-truth data (computed using the Matlab camera calibration

toolbox [7].) The motion-stereo model results in estimates of yaw and focal length that are both

comparable to the ground-truth values whereas the estimates from the non-rigid structure-from-

motion (NR-SFM) model are both off by a large factor. The multi-view non-rigid structure-from-

motion (MV-SFM) model performs better than the NR-SFM model but overall the motion-stereo

model performs the best.

of yaw and focal lengths that are comparable to ground-truth calibration data (computed

using the Matlab camera calibration toolbox [7].) To further emphasize this observation, we

compute the percentage deviation of the yaw and focal length estimates of each 3D model

from the ground-truth data. Although the bar graphs in Figure 4.3 may look similar, the

motion-stereo results for the focal length are several times better than the stereo or MV-SFM

results by the relative error measure in Table 4.1.

50

Chapter 5

Dense Face Model Construction

In this chapter we outline an algorithm to build dense Active Appearance Models (AAMs) [12,

14, 13, 17, 26]. Our algorithm builds a dense model by iteratively building a face model,

fitting the model to image data and then refining the model. In the following section we

detail the refinement process of the algorithm.

5.1 Model Densification

In this section we describe our algorithm to construct a dense AAM. There are two main

reasons why we work with AAMs rather than 3D Morphable Models (3DMMs) [5, 25, 33,

41, 8]: (1) it allows us to avoid the issue of 3D data and instead focus on the core model

refinement algorithm (2) we already have an implementation of AAMs in our lab. With some

work, our algorithm could be extended to 3DMMs. However, no conceptual advancement is

required to do so; just a re-application of the same ideas to the 3D range scan and texture

map data.

The input to our algorithm can be from two different sources: (1) They could be the

vertices of a sparse AAM or (2) They could be the vertices output by a rigid tracker or a face

51

detector. Our algorithm then constructs a dense AAM by iterating three important steps: (1)

Model Construction (Chapter 2.1), (2) Model Fitting (Section 2.2) and (3) Model Refinement

(described in this section.) A flow diagram of our algorithm is given in Figure 5.1. The model

refinement step is the key part of the algorithm. We refine the model in three different ways:

(1) we add more mesh vertices to the AAM, (2) we improve the mesh connectivity by re-

triangulating the mesh, and (3) we refine the shape modes of the AAM. We give a detailed

description of each of these steps in the following sections.

1. Adding mesh vertices: The first step in the iterative refinement process is to add

more mesh vertices. There are a number of ways to choose a mesh triangle and also the

location within the triangle to add the points. We adopt a simple but effective way to

ensure that we end up with similarly sized triangles. See Figure 5.2. At each iteration, we

look at the current mesh triangulation and choose the mesh triangle with the longest edge.

Once we choose the triangle a new point is added on the mid-point of the longest edge.

By making sure that the longest edge keeps being reduced we avoid the formation of “long

thin” triangles. Figure 5.3 illustrates the addition of two points to the mesh. To maintain

symmetry we add a pair of points simultaneously to both halves of the face mesh at each

step.

One extension of this algorithm might be to explore other heuristics to choose where to

add the new points such as choosing the triangle with the largest average coding error, and

trying to place points on structural discontinuities. However, it should be noted that as the

mesh gets more and more dense, the choice of a specific heuristic becomes less important as

there are vertices close to any point on the face.

2. Image Consistent Re-Triangulation: Once we have the new points in place we

improve the mesh connectivity by doing an image consistent re-triangulation. This step is

inspired by the work in [30]. We look at each pair of adjacent triangles and flip the common

52

Initial Training Images

Sparse Landmark Points

Model Build

Model Fit

Shape Mode Refinement

Add Points

Image Consistent

Triangulation

Final Dense Models

Model Refinement

Iterate

Figure 5.1: An overview of our deformable dense model construction algorithm. The algorithm

is initialized using a set of sparse hand-labeled mesh points. The algorithm then iterates through

model building, model fitting and model refinement steps to produce the dense model. The refine-

ment step is further split into refining the shape modes, adding mesh vertices and image consistent

re-triangulation.

edge. We look at the RMS model reconstruction error:√√√√√ 1T

T∑t=1

[∑u∈s0

[A0(u) +

l∑i=1

λtiAi(u)− It(W(u;p))

]]2

(5.1)

across the training data to determine whether the flip was optimal or not. We repeat this

step for each pair of adjacent triangles formed by the newly added points. In Figure 5.5

53

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#****

ICCV#****

ICCV 2007 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#****

ICCV#****


Figure 2. A pair of images showing the mesh before and afteradding two new mesh points to the longest edges. The newlyadded mesh points are highlighted. Note that adding the new ver-tices causes the triangles to split.

are a number of ways to choose a mesh triangle and alsothe location within the triangle to add the points. We adopta simple but efficient strategy to mesh densification. Ateach iteration, we look at the current mesh triangulationand choose the mesh triangle with the longest edge. Thisallows sub-division of triangles based on their sizes andtherefore ensures that no areas on the face are with far toofew mesh points. Once we choose the triangle a new pointis added on the mid-point of the longest edge. By makingsure that the longest edge keeps being reduced we avoid theformation of “long thin” triangles. Figure 2 illustrates theaddition of two points to the mesh. To maintain symmetrywe add a pair of points simultaneously to both halves of theface mesh at each step.

One extension of this algorithm might be to explore otherways to choose where to add the new points such as choos-ing the triangle with the largest average coding error, andtrying to place points on structural discontinuities. How-ever, it should be noted that as the mesh gets more and moredense, the choice of how the mesh is subdivided gets lessimportant because the final mesh is very dense and so thereare vertices close to any point on the face.

We present the pseudo code for our mesh vertex additionalgorithm below:

(1) Initialize using a sparse triangulated mesh.(2) Choose the triangle with the largest edge in the meanshape s0

(3) Add new point to the mid point of the longest edge(4) Propagate the new points to all the images from themean shape using the warp W(u;p).

2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This stepis inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determinewhether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newly



Model Build

Model Fit

Free Fit

Add Points

Image Consistent

Triangulation

Final Dense Models

Model Refinement

Iterate

Iterate

Figure 3. A representation of our deformable dense model con-struction algorithm. The algorithm is initialized using a set ofsparse hand labeled mesh points. The algorithm then iteratesthrough model building, model fitting and model refinement stepsto produce the dense model. The refinement step is further splitinto refining the shape modes or free fit, adding mesh vertices andimage consistent re-triangulation.

Figure 4. A pair of images showing the mesh before and afteradding a number of mesh points and performing image consistentre-triangulation. The final triangulation is optimal with respecte tothe training data.

added points. In Figure 4 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:

Initialize with the current mesh topologyRepeat:

Get initial image RMS errorRepeat (For each pair of adjacent triangles that include

the new point)Flip the common edge of the quadrilateralNote the image RMS error

Get the edge flip corresponding to the minimum imageRMS error

Flip the edge, check for thin or flipped triangles

3

Figure 2. The algorithm to add mesh vertices.




We present the pseudo code for our mesh vertex additionalgorithm in Figure 2.

(1) Initialize using a sparse triangulated mesh.(2) Choose triangle with longest edge in mean shape s0.(3) Add new point to the mid point of the longest edge.(4) Propagate new point to all training images from meanshape s0 using current estimate of the warp W(u;p).

2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This stepis inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determine



Model Build

Model Fit

Free Fit

Add Points

Image Consistent

Triangulation

Final Dense Models

Model Refinement

Iterate

Iterate



whether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newlyadded points. In Figure 5 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:




Get the edge flip corresponding to the minimum image

3

Figure 5.2: The algorithm to add mesh vertices.

Figure 5.3: A pair of images showing the mesh before and after adding two new mesh points to

the longest edges. The newly added mesh points and the edges are highlighted. Note that adding

the new vertices causes two adjacent triangles to split.

we show the mesh before and after performing image consistent re-triangulation. Note that

in order to make the mesh look better we make sure that the symmetry of the mesh is

maintained. The algorithm for image consistent re-triangulation is presented in Figure 5.1.

3. Shape Mode Refinement: The third step of the refinement process is to refine the

shape modes. Since we are iteratively refining the model and building a new one, shape

mode refinement is equivalent to refining the locations of the mesh vertices in the training

data. The model fit step of our algorithm allows the mesh vertices to move around but the

movement is limited to the shape subspace of the face model. If we allow the mesh vertices to

54

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#****

ICCV#****



One extension of this algorithm might be to explore otherways to choose where to add the new points such as choos-ing the triangle with the largest average coding error, andtrying to place points on structural discontinuities. How-ever, it should be noted that as the mesh gets more and moredense, the choice of how the mesh is subdivided gets lessimportant because the final mesh is very dense and so thereare vertices close to any point on the face. We present thepseudo code for our mesh vertex addition algorithm in Fig-ure 3.



Model Build

Model Fit

Free Fit

Add Points

Image Consistent

Triangulation

Final Dense Models

Model Refinement

Iterate

Iterate


2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This step

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#****

ICCV#****


216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#****

ICCV#****





We present the pseudo code for our mesh vertex additionalgorithm below:

(1) Initialize using a sparse triangulated mesh.(2) Choose the triangle with the largest edge in the meanshape s0

(3) Add new point to the mid point of the longest edge(4) Propagate the new points to all the images from themean shape using the warp W(u;p).

2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This stepis inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determinewhether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newly



Model Build

Model Fit

Free Fit

Add Points

Image Consistent

Triangulation

Final Dense Models

Model Refinement

Iterate

Iterate



added points. In Figure 4 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:




Get the edge flip corresponding to the minimum imageRMS error

Flip the edge, check for thin or flipped triangles

3





We present the pseudo code for our mesh vertex additionalgorithm in Figure 2.

(1) Initialize using a sparse triangulated mesh.(2) Choose triangle with longest edge in mean shape s0.(3) Add new point to the mid point of the longest edge.(4) Propagate new point to all training images from meanshape s0 using current estimate of the warp W(u;p).

2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This stepis inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determine



Model Build

Model Fit

Free Fit

Add Points

Image Consistent

Triangulation

Final Dense Models

Model Refinement

Iterate

Iterate



whether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newlyadded points. In Figure 5 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:




Get the edge flip corresponding to the minimum image

3


Figure 4. A pair of images showing the mesh before and afteradding two new mesh points to the longest edges. The newlyadded mesh points and the edges are highlighted. Note that addingthe new vertices causes two adjacent triangles to split.

Figure 5. A pair of images showing the mesh before and after per-forming an image consistent re-triangulation of the mesh. The newpoints as well as the edges that were flipped are highlighted. Theresulting triangulation is optimal with respect to the image data.is inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determinewhether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newlyadded points. In Figure 5 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:

(1) Initialize with the current mesh topology.(2) For each pair of adjacent triangles with new point:

(2.1) Flip the common edge of the quadrilateral.(2.2) Note the image RMS error.

(3) Get the edge flip with minimum image RMS error.(4) Flip the edge, check for thin or flipped triangles.

3. Shape Modes Refinement: The third and the mostimportant step of the refinement process is to refine theshape modes. The shape mode refinement can be thoughtof as refining the locations of the mesh vertices. The modelfit step of our algorithm allows the mesh vertices to movearound but the movement is limited to within the shapesubspace of the face model. If we allow the mesh vertices

3

Figure 5.4: The algorithm for image consistent re-triangulation.

Figure 5.5: A pair of images showing the mesh before and after performing an image consistent

re-triangulation of the mesh [30]. The new points as well as the edges that were flipped are

highlighted.

move outside the shape subspace of the model then we can possibly learn new deformations

and hence the current set of mesh vertices can better explain the face data. For this we

perform a model fit step similar to the one described in Section 2.2 except that we replace

the shape modes with identity bases that span the entire 2D space. As indicated before, the

optimization equation is similar to Equation 2.3 except that the 2D shape s is now defined

using these basis vectors that allow all the points to move in both x and y directions:

[s1 . . . s2M ] =

55

1 0 0 . . .

0 1 0 . . .

0 0 1 . . .

......

...

2M×2M

where M is the number of mesh vertices. Even though the shape mode refinement step

is initialized by the model fit at the previous density, it is still very high dimensional and

so prone to local minima. Hence we regularize this step with two priors. The first is a

smoothness constraint. The second is a constraint that the initial sparse vertices cannot

move too far from the input (hand-marked) locations.

The smoothness constraint restricts the movement of the newly added points to be not too

far away from their initial position with respect to the initial triangle that they were added

in. Figure 5.6 illustrates the mesh vertices that go into the optimization. The minimization

goal is given by:

∀v4‖v1 + λ (v2 − v1) + µ (v3 − v1)− v4‖2 (5.2)

with respect to all newly added mesh vertices v4. The λ and µ coefficients are the barycentric

coordinates [6] with respect to the base triangle in the mean shape s0.

The second constraint restricts the movement of the initial points and enforces that

they do not move too much from their initial hand specified locations. This constraint is

represented for a single image as:

‖s0 +m∑

i=1

pi si − s‖2 (5.3)

where the mean shape s0, 2D shape parameters p and the eigenvectors si are all defined over

the number of initial vertices and s is the initial hand-labeled mesh vertex locations [x y]

for a given image I.

The final optimization equation is a combination of the terms in Equations 2.3 , 5.2

56

V1

V2 V3V4

New Point

Figure 5.6: This figure shows the triangle vertices used to impose the smoothness constraints. The

new vertex is constrained by Equation 5.2 based on the location of the other three vertices.

and 5.3. The minimization goal is thus given by:

T∑t=1

∑u∈s0

[I t(W

(u;pt

))−(A0 (u) +

l∑i=1

λtiAi (u)

)]2

+

K1 ·T∑

t=1

‖s0 +m∑

i=1

pti si − st‖2+

K2 ·T∑

t=1

∀vt4‖vt

1 + λ(vt

2 − vt1

)+ µ

(vt

3 − vt1

)− vt

4‖2 (5.4)

with respect to the 2D shape p and appearance λi parameters. The optimization is done

using the algorithm in [4] but with a different prior. The weights K1 and K2 are chosen

after running the algorithms for different values of the weights and choosing the weighting

that gives the best performance. We also need to specify a suitable stopping criteria for

the algorithm to avoid the model getting unnecessarily dense. We choose a stopping criteria

that is based on the image coding error; if the coding error stops decreasing we terminate

the algorithm and output the dense model obtained at that particular iteration. Since our

model construction technique is an iterative offline algorithm, a typical iteration run of our

MATLAB implementation of the algorithm on a Mac PowerPC G5 2.4 GHz machine takes

approximately 30 minutes.

57

5.2 Experimental Results

In this section we present results from a number of experiments to show improved perfor-

mance of our densification algorithm. There are two ways to initialize our algorithm. One

way is to start with an existing sparse AAM and then increase the mesh density. In Sec-

tions 5.2.1, 5.2.2 and 5.2.3 we present results for this case. We could also automatically

construct a dense AAM using the output of a rigid tracker as initialization. In Section 5.2.4

we present tracking results using this approach.

5.2.1 Quantitative Evaluation

In this section we present quantitative comparisons to demonstrate the improved perfor-

mance of our algorithm in building dense models. We evaluate our model construction

algorithm using the implicitly computed dense correspondence by comparing it with with

those estimated by standard optical flow techniques [24, 27, 31]. We perform our comparison

with AAMs, but similar results could be obtained with 3DMMs.

5.2.1.1 Ground-Truth Data Collection

We collected high-resolution face data using Canon EOS SLR cameras capable of capturing

6 megapixel images. We obtained facial ground-truth data using a form of hidden markers

on the face. See [39] for a different way of embedding hidden ground-truth. These ground-

truth points have to be small so as to not interfere with the working of the algorithms. To

solve this we mark a number of very small black dots on the face. We then record the facial

deformations along with the marked ground-truth points using the high-resolution cameras.

Figure 5.7 shows one such high resolution image with ground-truth points marked on it along

with a zoomed in version highlighting the ground-truth point locations. The input data to

58

Figure 5.7: On the left is an example of the high resolution image obtained using the experimental

setup described in Section 5.2.1.1. The hand-marked ground-truth points on the face are highlighted

using dark circles. On the right are two examples of the down sampled images. Notice that the

ground-truth points are almost invisible in the down sampled images.

all algorithms consists of all the high resolution images (3072 x 2040) down sampled to one

fourth their size (768 x 510.) The ground-truth points are no longer visible in these low

resolution images and hence do not influence the working of the algorithms. Two example

down sampled images are also shown in Figure 5.7.

Note that we use only a single person’s ground-truthed data for the quantitative com-

parisons. The reason for this is that the notion of corresponding points is not well defined

across different people. We cannot estimate where a point on the face of person A should

correspond to a point on the face of person B. Also note that we cannot use range data

to help with this process since the important aspect of the ground-truth is the non-rigid

mapping from frame to frame. Knowing the perfect 3D depth from range data does not

provide us with this information.

59

5.2.1.2 Images used for Optical Flow Computation

Optical flow can be particularly hard when the motion is large. In our case, the head moves

around quite a bit in the input image. Our algorithm keeps track of where the original

head locations were and so implicitly avoids this large search space. We provide this same

information to the optical flow algorithms by warping all the input images into the coordinate

frame of the mean face for the initial sparse model. This means that the maximum flow for

all of the images is of the order of 3-4 pixels, well within the search ranges of most optical flow

algorithms. Another issue that can cause difficulty for optical flow algorithms is boundary

effects. We avoid this by also warping a boundary region around the face. We present

examples of the face mesh and the original and warped images in Figure 5.9. Observe that

the warped images are closer to each other and hence makes it easier for the optical flow

algorithms.

5.2.1.3 2D Ground-Truth Points Prediction Results

We compare the performance of four different algorithms: 1) our densification algorithm, 2)

optical flow algorithm by Horn and Schunck [24], 3) optical flow algorithm by Lucas and

Kanade [27], and 3) optical flow with diffused connectivity (Openvis3D) [31] based on their

ability to generate accurate feature point locations which are used to predict ground-truth

data point locations. We use the OpenCV implementations [1] for algorithms (2) and (3).

The evaluation methodology we adopt is based on ground-truth prediction. We use the

dense correspondence obtained from our algorithm and the optical flow algorithms to predict

the locations of the ground-truth points in all other images, given their position in one image.

We repeat this procedure for each image and finally average the predicted locations of the

ground-truth points. Once we have the predicted locations of the ground-truth points in

all images we compute the RMS spatial error between the predicted and the actual ground-

60

truth point locations. To perform a fair comparison among different algorithms we do all

the above computations in the mean shape by warping all images and correspondence onto

the mean shape.

We present the results of our algorithmic comparisons in Figures 5.8 for two different

people. We plot the RMS ground-truth prediction error vs the number of mesh points (al-

gorithm iterations.) The number of ground-truth points used for evaluation in the first case

is 21 whereas for the second case is 13. The results indicate that the densification algorithm

produces dense correspondences that lead to greater accuracies in predicting ground-truth

data. The optical flow algorithms clearly perform worse. This validates our claim that many

standard optical flow techniques prove to be bad predictors of point locations given images

taken under varying illumination, involving significant object deformations and consisting of

sparsely textured data.

5.2.1.4 3D Ground-Truth Points Prediction Results

In this section we perform comparisons similar to the ones in the previous section to evaluate

the 3D consistency of the correspondence computed by our algorithm with respect to the

ground-truth. In this case we evaluate our algorithm on trinocular stereo data. We repeat

the experimental setup described in Section 5.2.1.1 except that now we have a stereo rig with

calibrated cameras [7]. We use the initial sparse correspondence (the input to our algorithm)

and the dense correspondence from our algorithm and triangulate them to obtain 3D point

locations. We also triangulate the 2D ground-truth points to obtain 3D ground-truth points.

We compare the 3D fidelity of the sparse and the dense correspondences by computing the

distance of each 3D ground-truth point from the corresponding triangular plane comprised

of the sparse and dense mesh vertices. We find that the ground-truth points are closer (in

the depth direction) to the dense triangular mesh planes than the sparse ones, indicating

61

0 69 79 88 98 108 118 128 138 148 158 1682.6

2.8

3

3.2

3.4

3.6

3.8

4

4.2

Number of Mesh Vertices

Gro

und

Trut

h Po

int L

ocat

ion

Erro

r (RM

S)

Hand−labeled landmarksOptical Flow − Openvis3DOptical Flow − Lucas and KanadeOptical Flow − Horn and SchunckDensification Algorithm Output

0 69 79 88 98 108 118 128 138 148 158 168 1781

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

Number of Mesh Vertices

Gro

un

d T

ruth

Po

int

Lo

ca

tio

n E

rro

r (R

MS

)

Person 1 Person 2

Figure 5.8: A comparison of the algorithms on their ability to generate landmarks that lead to

better ground-truth point locations prediction for two different people. On the x-axis we plot algo-

rithmic iterations (each iteration adds 10 mesh points) vs the RMS ground-truth point prediction

error on the y-axis. The densification algorithm clearly performs the best.

that our densification algorithm generated mesh vertices with higher 3D fidelity. We plot

the results of our quantitative comparison in Figure 5.10.

5.2.2 Fitting Robustness

In Figure 5.13 we show quantitative results to demonstrate the increased robustness of our

dense AAMs. In experiments similar to those in [28], we generated 1800 test cases (20

trials each for 90 images) by randomly perturbing the 2D shape model from a ground-truth

obtained by tracking the face in video sequences and allowing the algorithms to converge. The

2D shape and similarity parameters obtained from the dense AAM tracks were perturbed

and the perturbations were projected on to the ground-truth tracks of the sparse AAMs.

This ensures that the initial perturbation is a valid starting point for all algorithms. We

62

(a) (b) (c)

Figure 5.9: (a) An example of the mesh used to warp input images onto the mean shape for

computing optical flow. The face mesh is extended to eliminate boundary effects for optical flow

algorithms. (b) The original input images to our algorithm. Note that it is difficult for optical

flow algorithms to work on these images with varying head locations. (c) The two images from (b)

warped onto the mean shape using the mesh from (a). Note that by warping the images to mean

we make it easier for the optical flow algorithms.

then run each algorithm (one using the dense AAM and the other with the sparse AAM)

from the same perturbed starting point and determine their convergence by computing the

RMS error between the mesh location of the fit and the ground-truth mesh coordinates.

The algorithm is considered to have converged if the RMS spatial error is less than 2.0

pixels. The magnitude of the perturbation is chosen to vary on average from 0 to 4 times

the 2D shape standard deviation. The perturbation results were obtained on the trinocular

stereo data (Section 5.2.1.1) for each of the three camera views and the average frequency of

convergence is reported in Figure 5.13. The results show that the dense AAM converges to

ground truth more often than the sparse AAM. The increased robustness of the dense AAM

may be surprising given its apparent increased flexibility. But note that both the sparse and

dense AAMs have the same number of shape modes. The increased robustness of the dense

AAM is because it is a better (more compact) coding of the underlying phenomenon. Also

note that since both the sparse and the dense AAMs have the same number of parameters

63

1 2 3 4 5 6 10

15

20

25

30

35

Image

Dist

ance

in m

m

Sparse CorrespondenceDense Correspondence

Figure 5.10: The distance of the triangulated 3D ground truth points from the 3D mesh plane

for each 3-frame. The values were computed for six 3-frames. The smallest triangle in which the

ground-truth point lies in 2D was computed. The distance was computed between the triangular

plane (formed by the 3D mesh vertices) and the corresponding 3D ground truth points. This was

repeated for 21 ground-truth points and the sum of the distances was computed. The average

distance across images for the sparse correspondence (69 mesh points) is 27.84 mm whereas for

the dense correspondence (168 mesh points) it is 13.625 mm.

that are optimized during the fit, the dense AAM fitting is as fast as the sparse AAM fitting.

The additional overheads such as in computing the affine warp for composition [28] hardly

affect the speed of fitting. A typical dense AAM (168 points) fit iteration takes 0.25 secs

using a MATLAB implementation on a Mac PowerPC G5 2.4 GHz machine, while fitting to

an image of VGA (640 x 480) resolution.

64

100 200 300 400 500 600 700

50

100

150

200

250

300

350

400

450

100 200 300 400 500 600 700

50

100

150

200

250

300

350

400

450

100 200 300 400 500 600 700

50

100

150

200

250

300

350

400

450

Figure 5.11: Our algorithm can be applied to data of multiple people. Here we show a few frames

of a dense multi-person AAM being used to track three different people. See face track.mov for

the complete tracking sequences.

5.2.3 Face Tracking

In Section 5.2.1 we compared our algorithm on single-person data to allow a quantitative

comparison on ground-truth. Our algorithm can of course be applied to data of any number of

65

http://www.andrew.cmu.edu/~kramnath/face_track.mov

100 200 300 400 500 600 700

50

100

150

200

250

300

350

400

450

100 200 300 400 500 600 700

50

100

150

200

250

300

350

400

450

Figure 5.12: [Contd. from Figure 5.11] Here we show a few frames of a dense multi-person AAM

being used to track two other people. See face track.mov for the complete tracking sequences.

people. In this section we present a qualitative evaluation of the tracking ability of the dense

AAM constructed using our algorithm. We collect tracking data of five different subjects

using a video camera and use our algorithm to compute a dense multi-person AAM. We then

use the dense AAM thus computed to track test data with varied facial expressions across

multiple subjects. We find that the dense AAM can be used to reliably track the test data

that we presented. In particular, the dense model generalizes well to unseen expressions. We

include the movie face track.mov to illustrate this. A few snap shots of the tracking movie

are presented in Figures 5.11 and 5.12.

66



0 1 2 3 450

60

70

80

90

100

Avg. Shape Sigma

Pe

rce

nta

ge

of

Tria

ls C

on

ve

rge

d

Sparse model

Dense model

Figure 5.13: Model fitting robustness results comparing the sparse AAM and the dense AAM. We

see an increase in frequency of convergence using the dense AAM.

Figure 5.14: A few snapshots from the first tracking sequence. The first image shows the rigid

tracker initialization to our algorithm. We track using the automatically constructed dense AAM.

The dense AAM was built by running our algorithm initialized with a face tracker. Note that the

tracking is fairly accurate especially the mesh region around the mouth deforms well according to

the change in expression. The complete tracking sequence is included in the movie auto1.mov.

67

http://www.andrew.cmu.edu/~kramnath/auto1.mov

Figure 5.15: A few snapshots from the second tracking sequence. The first image shows the rigid

tracker initialization to our algorithm. We track using the automatically constructed dense AAM.

The complete tracking sequence is included in the movie auto2.mov.

5.2.4 Application to Rigid Tracker Output

In this section we illustrate how our densification algorithm can be used to perform un-

supervised dense AAM construction. Our algorithm is different from previous automatic

construction algorithm [4] in two important ways: 1) Our algorithm does not need a hand

specified mesh; the mesh topology is computed by our algorithm and 2) Our algorithm works

far better because of the progressive model refinement. The results obtained by previous

authors [4] were fairly limited and mostly consisted of simple rigid motions whereas we apply

our algorithm on face data with widely varying non-rigid deformations.

We use a rigid blob tracker which is based on a skin color model to detect the faces in

video sequences. We use the output of this rigid tracker (an affine warped planar grid) as

initialization for our algorithm. The dense deformable models generated by our algorithm

can be used for various tasks such as AAM fitting, 3D model construction or whatever

application thereof.

As a demonstration, we used the dense models to track the faces in two different video

sequences. The two video sequences were captured under different illumination conditions

68


and are presented for two different subjects. A few snaps shots from the tracking video

are shown in Figures 5.14 and 5.15. The complete tracking sequences are included in the

movies auto1.mov and auto2.mov. Observe that the tracking is reliable, with the face mesh

deforming according to the variation in expression and pose.

69



Chapter 6

Conclusion

6.1 Summary

In this thesis we have studied three important topics: (1) multi-view 3D AAM model fitting,

(2) multi-view 3D AAM model construction and (2) automatic dense face model construc-

tion. In Chapter 3 we have described an algorithm to fit a single 2D+3D AAM to N images

captured simultaneously by N uncalibrated cameras. In the process, our algorithm com-

putes: 2D shape parameters for each image, a single set of global 3D shape parameters, the

scaled orthographic camera matrix for each view, and appearance parameters for each image

(which may be different due to different camera response functions.) Our algorithm enforces

the constraints that all of these quantities are physically consistent in the 3D scene. The

algorithm operates approximately N times slower than the real-time single image 2D+3D

AAM fitting algorithm [29, 44]. We have shown our multi-view 2D+3D AAM algorithm to

be both slightly more robust and converge more quickly than the single-view 2D+3D AAM

algorithm, which is itself more robust than the single-view 2D AAM algorithm [28].

In Section 3.3 we have shown how the multi-view face model fitting algorithm can be

extended to calibrate a weak perspective (or full perspective) camera model. In essence, we

70

use the human face as a (non-rigid) calibration grid.

We demonstrated that the resulting calibration is of comparable accuracy to that obtained

using a calibration grid. We have also shown in Section 3.9, how the calibration algorithms

described in this thesis can be used to improve the performance of multi-view face model

fitting. The calibrated multi-view algorithms perform better than the uncalibrated multi-

view algorithm, which performs better than the 2D+3D single-view algorithm in terms of

frequency of convergence and rate of convergence towards ground-truth when perturbed from

the ground-truth data.

In Chapter 4 we proposed a calibrated multi-view 3D model construction algorithm that is

superior to existing single-view and multi-view algorithms. We have shown that constructing

a 3D face model using a single-view or multi-view non-rigid structure-from-motion algorithm

suffers from the Bas-Relief ambiguity that may result in a “scaled” (stretched/compressed)

model when applied to data containing pose variation typical of that which can be obtained

using a standard face tracker such as a 2D Active Appearance Model [13, 28]. We have

shown how using calibrated multi-view motion-stereo can eliminate this ambiguity and yield

face models with higher 3D fidelity. In Section 4.5.3 we quantitatively compared the fidelity

of the 3D models described in Chapter 4 using the calibration algorithm in Section 3.7 and

showed that calibrated multi-view motion-stereo algorithm performs the best for calibration

of camera relative orientations and focal lengths.

In Chapter 5 we have outlined an algorithm that can be used to construct dense de-

formable face models. Although we demonstrate our algorithm using AAMs, the concept

can equally be applied to 3DMMs as well. In Section 5.2 we perfomed experiments to show

that our algorithm results in dense AAMs that perform significantly better than those ob-

tained using off the shelf optical flow techniques in ground-truth prediction. We showed that

the dense AAMs obtained using our algorithm perform better in many model based tasks

71

such as fitting robustness and 2D tracking. We also showed that our algorithm can be used

to construct dense AAMs using the initialization from a rigid tracker.

6.2 Discussion

In this thesis we have shown how multi-view data can be used to improve both the fitting

and construction of face models. Multiple images always provide more information, but it

is not always obvious how best to take advantage of it. One of the interesting results is

that camera calibration considerably improves the performance of multi-view model fitting

and construction. In fact the results in Figures 3.8 and 4.3 show that the benefit of using

calibrated multi-view over uncalibrated multi-view is in most cases perhaps even bigger

than the benefit of using uncalibrated multi-view over single-view. As model construction

is typically performed offline it is not a problem to use calibrated cameras. However, in the

case of model fitting, assuming calibration is not so easy. The cameras may be moved, they

may be pan-tilt, or it may not be possible to enter the scene. So automatic calibration is

important in many applications and dramatically improves fitting performance.

In this thesis, we have also outlined an automatic dense model construction algorithm.

Note that although the primary goal of our algorithm is to compute a dense model of the

face, in the process it computes correspondence between the training images. Hence it can

be regarded as a batch optical flow algorithm that works by iteratively building a model of

the face (scene) and fitting the model to the images.

6.3 Future Work

In terms of multi-view 3D model construction, one limitation of our motion-stereo algorithm

is that it only computes the shape model for 68 points on the face. One area for future work

72

would be to extend our algorithm to compute dense 3D shape models. One possibility is to

use dense stereo to compute the 3D model, assuming calibrated cameras, followed by optical

flow methods [8, 25] or automatic construction methods [4] to find the relationship between

views.

In terms of multi-view fitting, one area of future work is batch fitting over time to a

video sequence. The main difference between a video sequence and a set of simultaneously

captured multi-view images is that the face cannot be assumed to have the same 3D shape

in all images. However, it is possible that the multi-view algorithms can be extended to

temporal sequences by imposing the constraint that the 3D shape does not change very fast;

i.e. impose soft constraints on the 3D shape over time instead of the hard constraint that it

is exactly the same in each of the views.

In terms of dense deformable face model construction, one area for future work is to

extend this approach to more general scenes.

73

Bibliography

[1] Intel Open Source Computer Vision Library. http://opencvlibrary.sourceforge.net, 2005.

[2] J. Ahlberg. Using the active appearance algorithm for face and facial feature tracking. In

Proc. International Conference on Computer Vision Workshop on Recognition, Analysis

and Tracking of Faces and Gestures in Real-Time Systems, pages 68–72, 2001.

[3] S. Baker and I. Matthews. Lucas-Kanade 20 years on: A unifying framework. Interna-

tional Journal of Computer Vision, 56(3):221–255, 2004.

[4] S. Baker, I. Matthews, and J. Schneider. Automatic construction of active appear-

ance models as an image coding problem. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 26(10):1380–1384, October 2004.

[5] V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. In Proc.

SIGGRAPH, 1999.

[6] O. Bottema. On the area of a triangle in barycentric coordinates. Crux. Math, 8:228–

231, 1982.

[7] J.-Y. Bouguet. Camera calibration toolbox for Matlab.

http://www.vision.caltech.edu/bouguetj/calib doc, 2005.

[8] M. Brand. Morphable 3D models from video. In Proc. IEEE Computer Society Con-

ference on Computer Vision and Pattern Recognition, volume 2, pages 456–463, 2001.

74

http://opencvlibrary.sourceforge.net

http://www.vision.caltech.edu/bouguetj/calib_doc

[9] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3D shape from image

streams. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern

Recognition, pages 690–696, 2000.

[10] T. Cootes, E. Di Mauro, C. Taylor, and A. Lanitis. Flexible 3D models from uncalibrated

cameras. Image and Vision Computing, 14:581–587, 1996.

[11] T. Cootes, G. Edwards, and C. Taylor. Active appearance models. In Proceedings of

the European Conference on Computer Vision, volume 2, pages 484–498, 1998.

[12] T. Cootes, G. Edwards, and C. Taylor. A comparitive evaluation of active appearance

models algorithms. In Proc. of the British Machine Vision Conference, 1998.

[13] T. Cootes, G. Edwards, and C. Taylor. Active Appearance Models. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 23(6):681–685, 2001.

[14] T. Cootes and P. Kittipanya-ngam. Comparing variations on the active appearance

model algorithm. In Proc. of the British Machine Vision Conference, volume 2, pages

837–846, 2002.

[15] T. Cootes, G. Wheeler, K. Walker, and C. Taylor. Coupled-view Active Appearance

Models. In Proc. of the British Machine Vision Conference, volume 1, pages 52–61,

2000.

[16] F. Dornaika and J. Ahlberg. Fast and reliable active appearance model search for 3D

face tracking. IEEE Transactions on Systems, Man and Cybernetics, 34:1838–1853,

2004.

[17] G. J. Edwards. Learning to Identify Faces in Images and Video Sequences. PhD thesis,

University of Manchester, Division of Imaging Science and Biomedical Engineering,

1999.

75

[18] S. Gokturk, J. Bouget, and R. Grzeszczuk. A data driven model for monocular face

tracking. In Proc. of the IEEE International Conference on Computer Vision, 2001.

[19] R. Gross, I. Matthews, and S. Baker. Appearance-based face recognition and light-fields.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(4):449–465, 2004.

[20] R. Gross, I. Matthews, and S. Baker. Active Appearance Models with occlusion. Image

and Vision Computing, 24(6):593–604, 2006.

[21] G. Hager and P. Belhumeur. Efficient region tracking with parametric models of geome-

try and illumination. IEEE Transactions on Pattern Analysis and Machine Intelligence,

20:1025–1039, 1998.

[22] R. Hartley. In defence of the 8-point algorithm. In Proc. International Conference on

Computer Vision, pages 1064–1070, 1995.

[23] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge

University Press, 2000.

[24] B. Horn and B. Schunck. Determining Optical Flow. Technical report, Massachusetts

Institute of Technology, Cambridge, MA, USA,, April 1980.

[25] M. Jones and T. Poggio. Multidimensional morphable models: A framework for repre-

senting and matching object classes. In Proc. of the IEEE International Conference on

Computer Vision, pages 683–688, 1998.

[26] A. Lanitis, C. J. Taylor, and T. F. Cootes. Automatic interpretation and coding of face

images using flexible models. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 19(7):743 – 756, 1997.

[27] B. Lucas and T. Kanade. An iterative image registration technique with application to

stereo vision. In DARPA Image Understanding, pages 121–130, 1981.

76

[28] I. Matthews and S. Baker. Active Appearance Models revisited. International Journal

of Computer Vision, 60(2):135–164, 2004. Also appeared as Carnegie Mellon University

Robotics Institute Technical Report CMU-RI-TR-03-02.

[29] I. Matthews, J. Xiao, and S. Baker. On the Dimensionality of Deformable Face Models.

International Journal of Computer Vision, Under review 2006.

[30] D. D. Morris and T. Kanade. Image-consistent surface triangulation. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 1:332 – 338, 2004.

[31] A. Ogale and Y. Aloimonos. Shape and the stereo correspondence problem. Interna-

tional Journal of Computer Vision, 65(1):147–162, 2005.

[32] F. H. Pighin, R. Szeliski, and D. Salesin. Resynthesizing facial animation through 3d

model-based tracking. In Proc. International Conference on Computer Vision, pages

143–150, 1999.

[33] S. Romdhani and T. Vetter. Efficient, robust and accurate fitting of a 3D morphable

model. In Proc. of the International Conference on Computer Vision, pages 59–66,

2003.

[34] S. Sclaroff and J. Isidoro. Active blobs. In Proc. of the IEEE International Conference

on Computer Vision, pages 1146–1153, 1998.

[35] S. Sclaroff and J. Isidoro. Active blobs: region-based, deformable appearance models.

Computer Vision and Image Understanding, 89(2/3):197–225, Feb. 2003.

[36] S. Soatto and R. Brockett. Optimal structure from motion: local ambiguities and

global estimates. In Proc. of the IEEE Conference on Computer Vision and Pattern

Recognition, 1998.

77

[37] J. Sung and D. Kim. Extension of aam with 3d shape model for facial shape tracking. In

Proc. IEEE International Conference on Image Processing, volume 5, pages 3363–3366,

2004.

[38] R. Szeliski and S.-B. Kang. Shape ambiguities in structure from motion. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 19(5), 1997.

[39] M. F. Tappen, E. H. Adelson, and W. T. Freeman. Estimating intrinsic component

images using non-linear regression. In Proc. IEEE Computer Society Conference on

Computer Vision and Pattern Recognition, volume 2, pages 1992–1999, 2006.

[40] L. Torresani, D. Yang, G. Alexander, and C. Bregler. Tracking and modeling non-

rigid objects with rank constraints. In Proc. IEEE Computer Society Conference on

Computer Vision and Pattern Recognition, pages 493–500, 2001.

[41] T. Vetter and T. Poggio. Linear object classes and image synthesis from a single example

image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):733–

742, 1997.

[42] A. Waxman and J. Duncan. Binocular image flows: Steps toward stereo-motion fusion.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):715–729, 1986.

[43] Z. Wen and T. S. Huang. Capturing Subtle Facial Motions in 3D Face Tracking. In

Proc. International Conference on Computer Vision, page 1343, 2003.

[44] J. Xiao, S. Baker, I. Matthews, and T. Kanade. Real-time combined 2D+3D Active

Appearance Models. In Proc. IEEE Computer Society Conference on Computer Vision

and Pattern Recognition, volume 2, pages 535–542, 2004.

[45] J. Xiao, J. Chai, and T. Kanade. A closed-form solution to non-rigid shape and motion

recovery. In Proc. European Conference on Computer Vision, pages 573–587, 2004.

78

[46] J. Xiao and T. Kanade. Uncalibrated perspective reconstruction of deformable struc-

tures. In Proc. of the IEEE International Conference on Computer Vision, 2005.

[47] Z. Zhang and O. Faugeras. 3D Dynamic Scene Analysis. Springer-Verlag, 1992.

[48] Z. Zhang and O. Faugeras. Estimation of displacements from two 3-D frames ob-

tained from stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence,

14(12):1141–1156, 1992.

79

On the Multi-View Fitting and Construction of Dense Deformable Face Models · 2008-12-03 · Although AAMs were originally formulated as 2D, there are other deformable 3D models (3D

Documents