Top Banner
76 March/April 2012 Published by the IEEE Computer Society 0272-1716/12/$31.00 © 2012 IEEE Feature Article Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis Yuxu Lin and Mingli Song Zhejiang University Dao Thi Phuong Quynh and Ying He Nanyang Technological University Chun Chen Zhejiang University T he last decade has witnessed rapid devel- opment of 3D facial-expression synthesis owing to the extensive requirements for highly realistic animation in movies and video games. Although 3D scanners enable the scanning of real human faces, capturing facial-expression sequences for large numbers of people is expensive and difficult. On the other hand, synthesiz- ing realistic facial expressions without high-quality 3D face models is challenging and time- consuming. So, a flexible, robust framework to synthesize realis- tic 3D facial expressions from captured data would be highly desirable. Existing 3D facial-expression modeling approaches involve complicated representation and manipulation. (For more on some of these approaches, see the sidebar.) In contrast, the hu- man brain can represent and reconstruct facial expressions in a sparse way and figure out intact faces from incomplete or vague faces. So, it’s natu- ral to follow the human manner of representing and recovering facial expressions to overcome the conventional approaches’ limitations. Sparse cod- ing is the representation of items by the strong ac- tivation of a relatively small set of neurons. It has proven effective in mimicking receptive fields of neurons in the visual cortex, and its sparse repre- sentation is close to how human brains represent objects. 1 Moreover, sparse coding can be used to train redundant (overcomplete) dictionaries and sparse coefficients, which are stable to recovery signals (are always able to recover the signals) with a very low noise level from noisy data. 2 We’ve developed an approach that flexibly ap- plies sparse coding to obtain a redundant dic- tionary of subject faces and basic expressions. So, it can synthesize 3D facial expressions by newly specified coefficients based on the redun- dant dictionary. Unlike existing approaches, ours can flexibly synthesize facial expressions and gen- erate expressions based on noisy or incomplete 3D faces. The Basic Problem A practical system for 3D facial animation should have three features: Expression generation. It can accurately generate realistic expressions for an arbitrary neutral face given by coefficients. Expression retargeting or cloning. It can reproduce an example 3D face’s expressions on any other neutral face. Robustness. It can generate realistic facial ex- pressions robustly from noisy or even incom- plete 3D faces. The key issue in designing such a system is finding an effective representation of facial expressions that’s robust to noise. Strongly inspired by the A proposed modeling framework applies sparse coding to synthesize 3D expressive faces, using specified coefficients or expression examples. It also robustly recovers facial expressions from noisy and incomplete data. This approach can synthesize higher-quality expressions in less time than the state-of-the- art techniques.
13

Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

Feb 03, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

76 March/April 2012 Published by the IEEE Computer Society 0272-1716/12/$31.00 © 2012 IEEE

Feature Article

Sparse Coding for Flexible, Robust 3D Facial-Expression SynthesisYuxu Lin and Mingli Song ■ Zhejiang University

Dao Thi Phuong Quynh and Ying He ■ Nanyang Technological University

Chun Chen ■ Zhejiang University

The last decade has witnessed rapid devel-opment of 3D facial-expression synthesis owing to the extensive requirements for

highly realistic animation in movies and video games. Although 3D scanners enable the scanning of real human faces, capturing facial-expression

sequences for large numbers of people is expensive and difficult. On the other hand, synthesiz-ing realistic facial expressions without high-quality 3D face models is challenging and time-consuming. So, a flexible, robust framework to synthesize realis-tic 3D facial expressions from captured data would be highly desirable.

Existing 3D facial-expression modeling approaches involve complicated representation and manipulation. (For more on some of these approaches, see the sidebar.) In contrast, the hu-

man brain can represent and reconstruct facial expressions in a sparse way and figure out intact faces from incomplete or vague faces. So, it’s natu-ral to follow the human manner of representing and recovering facial expressions to overcome the conventional approaches’ limitations. Sparse cod-ing is the representation of items by the strong ac-tivation of a relatively small set of neurons. It has proven effective in mimicking receptive fields of neurons in the visual cortex, and its sparse repre-

sentation is close to how human brains represent objects.1 Moreover, sparse coding can be used to train redundant (overcomplete) dictionaries and sparse coefficients, which are stable to recovery signals (are always able to recover the signals) with a very low noise level from noisy data.2

We’ve developed an approach that flexibly ap-plies sparse coding to obtain a redundant dic-tionary of subject faces and basic expressions. So, it can synthesize 3D facial expressions by newly specified coefficients based on the redun-dant dictionary. Unlike existing approaches, ours can flexibly synthesize facial expressions and gen-erate expressions based on noisy or incomplete 3D faces.

The Basic ProblemA practical system for 3D facial animation should have three features:

■ Expression generation. It can accurately generate realistic expressions for an arbitrary neutral face given by coefficients.

■ Expression retargeting or cloning. It can reproduce an example 3D face’s expressions on any other neutral face.

■ Robustness. It can generate realistic facial ex-pressions robustly from noisy or even incom-plete 3D faces.

The key issue in designing such a system is finding an effective representation of facial expressions that’s robust to noise. Strongly inspired by the

A proposed modeling framework applies sparse coding to synthesize 3D expressive faces, using specified coefficients or expression examples. It also robustly recovers facial expressions from noisy and incomplete data. This approach can synthesize higher-quality expressions in less time than the state-of-the-art techniques.

Page 2: Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

IEEE Computer Graphics and Applications 77

abstracting nature of sparse coding, we adopt sparse coding that minimizes this function:

argmin,C D

X D C C− ⋅ + ⋅( )2

0γ , (1)

where D is a dictionary consisting of several basis vectors of the linear space spanned by the training set. C is the set of coefficient vectors corresponding to the set of training faces, X. Each face is represented by its vertex coordinates in a column vector [v1x, v1y, v1z ... vix, viy, viz ... vkx, vky, vkz]T, where k represents the number of vertices and vix, viy, and viz are the x, y, and z coordinates of vertex i. The arrangement of faces in X depends on the application—that is, coefficient-based facial-expression synthesis or facial-expression retargeting. We discuss this in more detail later.

The second term of Equation 1, γ ⋅ C0

, measures the sparseness of C corresponding to the training faces. That is, it constrains most elements of C to

be zero. So, it provides a compact representation of the training set.

Although solving Equation 1 is NP-hard, David Donoho proved that replacing γ ⋅ C

0 by the L1-

norm γ ⋅ C1

can keep the sparsest solution under most situations, which leads to a much simpler optimization:3

argmin,C D

X D C C− ⋅ + ⋅( )2

1γ (2)

such that D jiji

2 1∑ = ∀, , (3)

where Equation 3 normalizes the basis vectors in D, preventing them from being zero, which would be the trivial solution. We omit Equation 3 in the rest of this explanation for convenience.

To efficiently solve Equation 2, we use Honglak Lee and his colleagues’ approach.4 They formulated the equation as a combination of two convex optimization problems, then employed feature-sign search to solve the L1-regularized least-squares

There has been considerable research on 3D facial-expression synthesis since the 1970s.1,2 The existing

approaches fall into three categories: parameter-driven, example-based, and learning-based synthesis.

Parameter-driven synthesis parameterizes 3D faces and controls their shape and action by a parameter set. This technique was common in computer graphics’ early days.3,4 Unfortunately, it usually uses a low-resolution face model to reduce computational cost and thus can’t mimic subtle expression details (wrinkles, furrows, and so on) on the tar-get face owing to their sparse vertex distribution. Although this technique adds textures to the 3D faces to enhance the realism, assessing the synthesized deformations’ quality is difficult because the textures mask them.

To overcome parameter-driven methods’ limitations, researchers developed example-based synthesis. Jun-Yong Noh and Ulrich Neumann used motion vectors to represent the vertex deformation that expressions cause in the source face.5 They cloned facial expressions by applying motion vectors on the target face. Mingli Song and his colleagues introduced the vertex-tent coordinate to model local de-formations in the source face.6 They transferred these local deformations to the target face under a consistency con-straint. Example-based synthesis is popular but is computa-tionally expensive and sensitive to noise, which might lead to a singular deformation and produce flaws and artifacts on the target face. In addition, it can’t produce an expres-sive face without example faces.

To make expression synthesis more adaptable, some researchers have proposed learning-based methods that gain knowledge from training faces. Daniel Vlasic and his

colleagues presented a multilinear model for face model-ing.7 They organized the training faces to construct a three-mode tensor (expression, subject, and vertex). They synthesized facial expressions by applying different coef-ficients obtained by tensor decomposition. Later, Dacheng Tao and his colleagues presented Bayesian tensor analysis to explain the multilinear model from a probabilistic view.2 Both methods work well for clean and well-tessellated data but can’t deal with 3D faces with missing data, which structured-light scanners often produce owing to high-lights or occlusions.

References 1. F.I. Parke and K. Waters, Computer Facial Animation, A K Peters,

1996.

2. D. Tao et al., “Bayesian Tensor Approach for 3-D Face Mod-

eling,” IEEE Trans. Circuits and Systems for Video Technology,

vol. 18, no. 10, 2008, pp. 1397–1410.

3. F.I. Parke, “Parameterized Models for Facial Animation,” IEEE

Computer Graphics and Applications, vol. 2, no. 9, pp. 61–68,

1982.

4. K. Waters, “A Muscle Model for Animation Three-Dimensional

Facial Expression,” Proc. Siggraph, ACM, 1987, pp. 17–24.

5. J.-Y. Noh and U. Neumann, “Expression Cloning,” Proc. Siggraph,

ACM, 2001, pp. 277–288.

6. M. Song et al., “A Generic Framework for Efficient 2-D and

3-D Facial Expression Analogy,” IEEE Trans. Multimedia, vol.

9, no. 7, 2007, pp. 1384–1395.

7. D. Vlasic et al., “Face Transfer with Multilinear Models,” ACM

Trans. Graphics, vol. 24, no. 3, 2005, pp. 426–433.

Related Work in Facial-Expression Synthesis

Page 3: Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

78 March/April 2012

Feature Article

problem to learn C. To learn D, they proposed a Lagrange-dual method for the L2-constrained least-squares problem.

Our sparse-coding facial-expression-synthesis framework uses three operations. The train opera-tion solves Equation 2, abstracting the linear space spanned by the training set. The project operation computes the coefficient vector set (C) for X on the basis of D, which solves this subproblem of Equation 2:

argminC

X D C C− ⋅ + ⋅( )2

1γ .

The recover operation recovers a face from C:

X D Ci i= ⋅ ,

where Ci is the coefficient vector and Xi is its corresponding recovered face.

Facial-Expression SynthesisOur framework starts with data acquisition and model training to learn the dictionary. Using the dictionary, it then carries out facial-expression syn-thesis, facial-expression retargeting, and incomplete-face recovery.

Data AcquisitionTo capture expressions, we employ the video-based 3D-shape acquisition system that Song Zhang and Peisen Huang developed.5 The system can obtain the geometry and textures with 512 × 512 resolu-tion at 30 fps. This lets us accurately measure all subtle facial expressions.

We hypothesized that facial expressions are approximate isometric transformations. If that’s the case, intrinsic properties such as geodesics and Gaussian curvature will be expression invariant.

To verify our hypothesis, we marked feature points on two expressions of the same subject face (Figure 1a shows one example) and computed the pairwise geodesics—that is, the geodesic between every pair of feature points. We removed the eyes and mouth to eliminate topological ambiguity. Then we computed the corresponding geodesic difference between the two expressions.

Figure 1b shows the geodesic distances for dif-ferent expressions of the same subject. More than 75 percent of the geodesic differences are within ±3 mm (or 1.5 percent), and 90 percent are within ±5 mm (or 2.5 percent). These results verify that facial expressions are approximate isometric trans-formations. As Figure 2 shows, the geodesic pat-terns are highly consistent among the expressions.

After computing the geodesic difference for a face, we crop the face using the user-specified threshold.

Given the captured data (see Figure 3a), we first specify the salient feature points, such as the eyes and mouth. Then we compute a multisource geodesic mask using the detected feature points (see Figure 3b). Because the geodesic is invariant in facial expressions, we can accurately and robustly segment the expression data (see Figure 3c).

Model TrainingWe collect the training dataset of 3D faces from the subjects by 3D scanner, as we just discussed. For each person, we obtain one neutral 3D face and a number of basic expressions (our experiments used nine). We denote the 3D face dataset as Tr = {Xij}, where Xij is the ith subject with the jth basic facial expression. Xij is a column vector containing vertex coordinates.

Using the training dataset, the train operation abstracts the linear space of each basic facial expression:

(b)(a)

–10

250

200

150

100

50

0–9 –8 –7 –6 –5 –4 –3 –2 –1 0

Error (mm)

Freq

uenc

y

1 2 3 4 5 6 7 8 9 10

Figure 1. Data acquisition. (a) Two expressions of the same subject. We marked 37 salient features on two 3D faces of the same subject and computed the geodesics between every pair of features. Then, we computed the difference of the pairwise geodesic between two expressions. (b) More than 75 percent of the geodesic differences are within ±3 mm (or 1.5 percent), and 90 percent are within ±5mm (or 2.5 percent). The results demonstrate that human facial expressions are approximate isometric transformations.

Page 4: Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

IEEE Computer Graphics and Applications 79

argmin,C D

X D C C− ⋅ + ⋅( )2

1γ , (4)

where X X X XT TnT T

= :, :, :,, , ,1 2 … , D D D DT T

nT T

= 1 2, , ,… ,

and X X X Xj j j mj:, , , ,=[ ]1 2 … denote 3D faces with the jth basic facial expression. In other words, C represents each type of basic facial expression X:,j in terms of the corresponding dictionary Dj. Also, C measures only the variance of face shapes in a single basic facial expression instead of the variance among the different basic facial expressions. That is, we expect that different facial expressions of the same subject will share the same coefficient vector in terms of D, which we use as the constraint for facial-expression synthesis.

We can also use the training dataset to train a dictionary DA that abstracts the linear space spanned by faces from different subjects:

arg min,D C

A A AA A

Y D C C− ⋅ + ⋅( )2

1γ ,

where Y = [X11, X12, ..., X21, X22, ..., Xmn] is a matrix consisting of all the training faces. We use DA to recover facial expressions from noisy or incomplete faces.

Coefficient-Based SynthesisOur approach requires only one neutral face to synthesize any other expressions of the same

subject. By treating that face as a new subject, we synthesize expressions in the following three steps (see Figure 4).

Basic-expression recovery. As we described earlier, the learned dictionary Di reflects only the variance of facial shapes in the ith basic expression, and different basic expressions from the same subject should share the same coefficient vector. So, we can synthesize all basic expressions of the new subject if we can compute the coefficient vector corresponding to each basic expression. Given a neutral face of a new subject and a subdictionary D1, we estimate the corresponding coefficient vector by projecting the neutral face to D1. The recover operation synthesizes basic expressions:

argminC

F FF

F D C C− ⋅ + ⋅( )12

1γ ,

Y Y Y D CFT

FT

FT T

Fm1 2, , ,…

= ⋅ ,

where CF is the estimated corresponding coef-ficient vector, and Y Y Y YF F F Fm=[ ]1 2, , ,… denotes m synthesized basic expressions of the given neu-tral face.

Expression space learning. Using the synthesized ba-sic expressions, the train operation learns the new subject’s (linear) expression space DF:

(a)

(b)

Figure 2. Given (a) four expressions of the same subject, we compute the geodesic distance from the nose tip and (b) visualize them using isolines. These geodesic patterns are highly consistent except for some small differences near the mouth and eyes.

Page 5: Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

80 March/April 2012

Feature Article

argminD

F F F FF

Y D C C− ⋅ + ⋅( )2

1γ .

Similar to Equation 4, this problem’s solution also employs the feature-sign search and a Lagrange-dual method.

Expression synthesizing. After reconstructing DF, we can synthesize any expression Fe by the recover operation—that is, linear combination of the basis vectors in DF:

F D Ce F e= ⋅ ,

where Ce is the expression’s coefficient vector. We can specify Ce manually or automatically (for example, through expression retargeting).

Figure 5 illustrates the results. Although most of the synthesized expressions in Figure 5 are realis-tic, the last two seem unnatural owing to infea-sible coefficient vectors, which usually exceed the natural faces’ scope.

Facial-Expression RetargetingExpression retargeting usually makes an expression analogy between the performer and avatar. So, we can state a typical expression-retargeting problem as this: given a source neutral face S, a target neutral face T, and a source expression face S′, construct a target face T′ whose expression is the same as S′.

Figure 6 depicts the expression-retargeting work-flow, which comprises the following three steps.

Basic-expression recovery. To recover the basic ex-pressions for the source and target faces, we em-ploy the method we used for expression synthesis. We denote the source face’s basic expressions as Y Y Y YS S S Sn=[ ]1 2, , ,… and the target face’s expres-sions as Y Y Y YT T T Tn=[ ]1 2, , ,… .

Expression space colearning. Using the recovered basic expressions, the train operation learns the expression space of the source or target:

(a) (b) (c)

Figure 3. Creating and using geodesic masks. (a) Captured facial data. (b) Geodesic masks from the detected features on the mouth and eyes, which are invariant to the expressions and are approximate isometric transformations. (c) Segmented facial expressions using the geodesic masks. Because the geodesic is invariant in facial expressions, we can accurately and robustly segment the expression data.

Page 6: Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

IEEE Computer Graphics and Applications 81

arg min ,,D C S

TTT T

ST ST STST ST

Y Y D C C

− ⋅ + ⋅

2

,

where D D DST ST

TT T

= , . We constrain the source

and target faces to share the same coefficient vector set, CST. So, the same expressions of the source and target faces share the same corresponding coefficient vector in terms of DST.

Coefficient transfer. Because T′ and S′ have the same corresponding coefficient vector in terms of DST, we can estimate the coefficient vector by projecting S′ to DS. Then, the recover operation synthesizes T′ with the same coefficient vector and DT:

argmin′′− ⋅ ′ + ⋅ ′( )C

SS D C C2

1γ ,

′ = ⋅ ′T D CT ,

where C′ is the coefficient vector of S′ projected to DS, and T′ shares the same basic coefficient vector in terms of DT.

Incomplete-Face RecoveryOwing to unavoidable scanning errors, some input faces often suffer from noises or unwanted holes on the surfaces. Such incomplete faces can’t be used directly for facial-expression synthesis, retargeting, and so on.

To synthesize facial expressions robustly, we employ partial projection. This strategy has two steps. First, the project operation estimates the co-efficient vector. Then, the recover operation syn-thesizes the intact face.

Expression spacelearning

Recovery

Coef�cientsD

(dictionary)

Dictionary

Basic expressionrecovery

Figure 4. The algorithmic pipeline for expression synthesis by setting coefficients. Our approach requires only one neutral face to synthesize any other expressions of the same subject.

(a) (c)

(b)

Figure 5. Expression synthesizing. (a) The neutral face of a subject. (b) Basic expressions recovered from the neutral face. (c) Facial expressions synthesized by random coefficient vectors. Although most of the synthesized expressions are realistic, the last two seem unnatural owing to infeasible coefficient vectors, which usually exceed the natural faces’ scope.

Page 7: Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

82 March/April 2012

Feature Article

Let Finp denote the incomplete face, and IF the face’s valid vertex indices. We compute the partial projection as

argmin ˆC

inp A inp inpinp

F D C C− ⋅ + ⋅

2

1γ ,

F D Ccomp A inp= ⋅ , (5)

where D̂A is the subdictionary of DA (learned during model training) that contains only those rows with indices in IF. Although the partial projection doesn’t take into account the missing vertices, the optimization in Equation 5 recovers every vertex, including the missing one. This is because D̂A contains information that implies the dependency between the known and missing vertices.

Figure 7 compares the faces recovered from in-complete or noisy data (Gaussian noise with a 0 mean and 0.2 percent standard deviation) with the ground truth Ftruth. We compute the mean-square error of each recovered expressive face Fcomp as

EF F n

DiagLength Fcomp truth e

truth=

( )

2

,

where DiagLength(⋅) computes the diagonal length of the bounding box of the 3D face and ne is the number of components of Ftruth. The small mean-square errors demonstrate that our approach can robustly handle incomplete or noisy data.

Experimental ResultsWe performed our experiments on a 64-bit Win-dows system with an Intel T9300 processor and 2 Gbytes of RAM. The training faces consisted of 30 subjects with 10 basic facial expressions (the first one was neutral); we set the sparseness at 0.3.

Face CorrespondenceTo carry out the evaluation, we carefully aligned all the faces to find correspondences among them in advance. For higher accuracy, we employed a supervised-correspondence method;6 the algorithm employs the following two steps (see Figure 8a).

Cylindrical projection. For the face selected as the template (the top-left face in Figure 8a), we manually marked the predefined feature points on the template. Then, we obtained a template mask mesh by Delaunay triangulation. To maintain consistency, we had each input face’s mask mesh adopt the template mask mesh’s topology, rather than recompute the triangulation. Finally, we performed cylindrical projection on these 3D faces and their masks to obtain their corresponding 2D coordinates. We developed a friendly interface that lets users easily mark the feature points on 3D faces.

Mapping barycentric coefficients. As Figure 8b shows, given a vertex V = (x, y, z) in the template face, we obtained its corresponding 2D coordinate V2d = (x2d, y2d) by cylindrical projection. Moreover, we

Basic expressionrecovery

Coef�cientmapping

Expression spacecolearning

Coef�cients

S T

D

CodictionaryT’

S’

Basic expressionrecovery

Figure 6. The algorithmic pipeline for expression retargeting. Given a source neutral face (S), a target neutral face (T), and a source expression face (S′), we construct a target face (T′) whose expression is the same as the source expression face.

Page 8: Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

IEEE Computer Graphics and Applications 83

computed its barycentric coefficients according to its surrounding mask triangle T2d. Assuming the corresponding mask triangle of T2d is ′T d2 in the input face, we easily found V’s corresponding ver-tex V ′ with the same barycentric coefficients in ′T d2 . By carrying out this barycentric-coefficient

mapping vertex-by-vertex, we achieved correspon-dence among all the 3D faces.

Comparing Approaches to Facial-Expression RetargetingWe compared our sparse-coding approach to two representative approaches to facial-expression

synthesis: Bayesian tensor analysis (BTA) and facial-expression analogy (FEA).6 (For more on BTA, see the sidebar.) Table 1 lists the three approaches’ capabilities. BTA uses different coefficients than our approach, and FEA doesn’t support coefficient-based facial-expression synthesis. So, it’s more feasible for us to evaluate these approaches on the basis of facial-expression retargeting instead of coefficient-based synthesis.

We retargeted the 3D facial expressions back to the source neutral face and compared the synthe-sized results with the ground truth (see Figure 9). Given Ftruth, we computed the mean-square error

0.61%(c) 0.57% 0.76% 0.72% 0.81% 0.60%

0.61%(b) 0.57% 0.76% 0.72% 0.81% 0.60%

0.61%(a) 0.57% 0.76% 0.72% 0.81% 0.60%

Figure 7. Facial-expression recovery from incomplete or noisy data. (a) The ground truth. (b) The incomplete or noisy faces. (c) The expressive faces recovered by partial projection. The small mean-square errors (shown below each recovered face) demonstrate that our approach can robustly handle incomplete or noisy data.

Table 1. The capabilities of three approaches to facial-expression synthesis.

Approach Coefficient-based synthesis Expression retargeting Recovery with incomplete data

Bayesian tensor analysis P P —

Facial-expression analogy — P —

Our approach P P P

Page 9: Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

84 March/April 2012

Feature Article

of each synthesized result T′ as

ET F n

DiagLength F

truth e

truth=

′−

( )

2

.

As the mean-square errors in Figure 9 show, our ap-proach outperforms BTA and is comparable to FEA.

We also compared the three approaches with clean data (see Figure 10). BTA produced artifacts on the chin due to unstable solutions (see Figure 10b). FEA produced a twist near the mouth (see Figure 10c). Our approach produced artifact-free, realistic results (see Figure 10d).

In addition, we added Gaussian noise (with a 0 mean and 0.2 percent standard deviation) to the test 3D faces (see Figure 11). The learning-based approaches (BTA and our approach) were robust

because the training faces provided knowledge of faces’ shapes that constrained the recovered face during synthesis. FEA failed to produce satisfactory results, owing to the singular solution caused by the noise.

Our approach’s execution time has two parts: facial-expression modeling and facial-expression synthesis. The former includes basic-expression recovery and expression space colearning. The lat-ter only needs to run once for each newly input 3D face. In our experiments (see Figures 10 and 11), model training took an average of 15.04 sec. Syn-thesis took only 0.067 sec., which was much more efficient than BTA (0.64 sec.) and FEA (2.32 sec.).

Results with Incomplete Data or Noisy FacesTo further evaluate our approach’s robustness and flexibility, we conducted experiments with incom-

(a)

(b)

VV2d

T2dT’2d

V’2d V’

Cylindricalprojection

Cylindricalprojection

... ...

...

Input faces

TemplateCorrespondence

Cylindricalprojection

Barycentric-coef�cient mapping

Figure 8. Face correspondence. (a) The pipeline. (b) Mapping barycentric coefficients. V indicates a vertice; T indicates a mask triangle. This method provides higher accuracy.

Page 10: Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

IEEE Computer Graphics and Applications 85

plete and noisy data. (This experiment didn’t in-clude BTA and FEA because neither can deal with incomplete data.)

We generated incomplete faces by removing some vertices manually (see Figure 12). We gener-ated the noisy faces with missing data by adding Gaussian noise with a 0 mean and 0.2 percent standard deviation on the incomplete faces. The results show that our approach successfully recov-ers expressions for the target faces.

Our approach has two limitations that deserve further attention. First, we had to carefully

align the 3D faces before training and synthesis. We’d like to investigate an automatic algorithm for 3D face correspondence. Furthermore, conformal parameterization could replace cylindrical projec-tion to provide more accurate feature correspon-dence by preserving more local shape information.

Second, our approach doesn’t provide the range of coefficients for natural facial-expression synthesis.

Neutral face

0.67% 0.95% 0.72% 0.80%

0.11% 0.07% 0.086% 0.26%

0.44% 0.58% 0.59% 0.47%

(a)

(b)

(c)

(d)

Figure 9. Comparing three approaches to facial-expression retargeting: (a) the ground truth, (b) Bayesian tensor analysis (BTA), (c) facial-expression analogy (FEA), and (d) our approach. The source and target faces are the same. The mean-square error appears below each synthesized result. Our approach outperforms BTA and is comparable to FEA.

Page 11: Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

86 March/April 2012

Feature Article

To find this range, we could use some statistical techniques.

AcknowledgmentsWe thank the editor and all reviewers for their careful review and constructive suggestions. This article was supported by the National Natural Science Founda-tion of China under grant 60873124, the Natural Science Foundation of Zhejiang Province under grant Y1090516, the Fundamental Research Funds for the Central Universities under grant 2009QNA5015, and Singapore National Research Foundation grant NRF2008IDM-IDM004-006.

References 1. B. Olshausen and D. Field, “Emergence of Simple-

Cell Receptive Field Properties by Learning a Sparse

Code for Natural Images,” Nature, vol. 381, no. 6583, 1996, pp. 607–609.

2. E. Candès, J. Romberg, and T. Tao, “Stable Signal Recovery from Incomplete and Inaccurate Measure-ments,” Comm. Pure and Applied Mathematics, vol. 59, no. 8, 2006, pp. 1207–1223.

3. D. Donoho, “For Most Large Underdetermined Systems of Linear Equations the Minimal ℓ1-Norm Solution Is Also the Sparsest Solution,” Comm. Pure and Applied Mathematics, vol. 59, no. 6, 2006, pp. 797–829.

4. H. Lee et al., “Efficient Sparse Coding Algorithms,” Proc. 20th Conf. Neural Information Processing Systems, MIT Press, 2007, pp. 801–808.

5. S. Zhang and P. Huang, “High-Resolution, Real-Time 3D Shape Acquisition,” Proc. 2004 Computer Vision and Pattern Recognition Workshop (CVPRW 04), IEEE CS Press, 2004, p. 28.

6. M. Song et al., “A Generic Framework for Efficient

Neutral face

(a)

(b)

(c)

(d)

Figure 10. Comparing the three approaches, using clean data: (a) neutral faces for the source and target, (b) BTA, (c) FEA, and (d) our approach. The red arrows in the BTA results highlight artifacts. Also, the FEA results in column 5 are inconsistent with the source. Our approach produced artifact-free, realistic results.

Page 12: Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

IEEE Computer Graphics and Applications 87

2-D and 3-D Facial Expression Analogy,” IEEE Trans. Multimedia, vol. 9, no. 7, 2007, pp. 1384–1395.

Yuxu Lin is a PhD candidate in Zhejiang University’s College of Computer Science. His research interests mainly include 3D face modeling and deformation. Lin has a bachelor in com-puter science and technology from Zhejiang University. He’s a member of IEEE. Contact him at [email protected].

Mingli Song is an associate professor in Zhejiang Univer-sity’s College of Computer Science. His research interests in-

clude face modeling and facial-expression analysis. Song has a PhD in computer science from Zhejiang University. He’s a member of IEEE. He’s the corresponding author. Contact him at [email protected].

Dao Thi Phuong Quynh is a PhD student in Nanyang Technological University’s School of Computer Engineering. Her research interests include computational geometry and machine learning. Quynh has a BS in applied mathematics and computer science from Moscow State University. Con-tact her at [email protected].

Neutral face

(a)

(b)

(c)

(d)

Figure 11. Comparing the three approaches with noisy data (Gaussian noise with a 0 mean and 0.2 percent standard deviation): (a) neutral faces, (b) BTA, (c) FEA, and (d) our approach. Unlike our approach, BTA and FEA produced artifacts and flaws.

Page 13: Sparse Coding for Flexible, Robust 3D Facial-Expression Synthesis

88 March/April 2012

Feature Article

Ying He is an assistant professor in Nanyang Technological University’s School of Computer Engineering. His research in-terests are computer graphics, computer-aided design, and sci-entifi c visualization. He has a PhD in computer science from Stony Brook University. Contact him at [email protected].

Chun Chen is a professor in Zhejiang University’s College

of Computer Science. His research interests include computer vision, computer graphics, and embedded technology. Chen has a PhD in computer science from Zhejiang University. Contact him at [email protected].

Selected CS articles and columns are also available

for free at http://ComputingNow.computer.org.

Neutral face

(a)

(b)

(c)

(d)

Figure 12. 3D facial-expression retargeting from incomplete or noisy faces. The (a) source faces and recovered results for (b) target face 1, (c) target face 2, and (d) target face 3. Our approach successfully recovered expressions for the targets.

Silver BulletSecurity Podcast

Sponsored by

www.computer.org/security/podcasts*Also available at iTunes

In-depth interviews with security gurus. Hosted by Gary McGraw.