Real-Time Speech-Driven Face Animation Pengyu Hong, Zhen Wen, Tom Huang Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA Abstract This chapter presents our research on real-time speech-driven face animation. First, a visual representation, called Motion Unit (MU), for facial deformation is learned from a set of labeled face deformation data. A facial deformation can be approximated by a lin- ear combination of MUs weighted by the corresponding MU parameters (MUPs), which are used as the visual features of facial deformations. MUs explore the correlation among those facial feature points used by the MPEG-4 face animation (FA) to describe facial deformations. MU-based FA is compatible with MPEG-4 FA. We then collect a set of audio-visual (AV) training database and use the training database to train a real-time audio-to-visual mapping (AVM). 1. Introduction Speech-driven face animation takes advantage of the correlation between speech and fa- cial coarticulation. It takes speech stream as input and outputs corresponding face anima- tion sequences. Therefore, speech-driven face animation only requires very low band- width for “face-to-face” communications. The AVM is the main research issue of speech- driven face animation. First, the audio features of the raw speech signals are calculated. Then, the AVM maps the audio features to the visual features that describe how the face model should be deformed.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Real-Time Speech-Driven Face Animation
Pengyu Hong, Zhen Wen, Tom Huang
Beckman Institute for Advanced Science and Technology
University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
Abstract
This chapter presents our research on real-time speech-driven face animation. First, a
visual representation, called Motion Unit (MU), for facial deformation is learned from a
set of labeled face deformation data. A facial deformation can be approximated by a lin-
ear combination of MUs weighted by the corresponding MU parameters (MUPs), which
are used as the visual features of facial deformations. MUs explore the correlation
among those facial feature points used by the MPEG-4 face animation (FA) to describe
facial deformations. MU-based FA is compatible with MPEG-4 FA. We then collect a set
of audio-visual (AV) training database and use the training database to train a real-time
audio-to-visual mapping (AVM).
1. Introduction
Speech-driven face animation takes advantage of the correlation between speech and fa-
cial coarticulation. It takes speech stream as input and outputs corresponding face anima-
tion sequences. Therefore, speech-driven face animation only requires very low band-
width for “face-to-face” communications. The AVM is the main research issue of speech-
driven face animation. First, the audio features of the raw speech signals are calculated.
Then, the AVM maps the audio features to the visual features that describe how the face
model should be deformed.
Some speech-driven face animation approaches use phonemes or words as intermediate
representations. Lewis [14] used linear prediction to recognize phonemes. The recognized
phonemes are associated with mouth shapes which provide keyframes for face animation.
Video Rewrite [2] trains hidden Markov models (HMMs) [18] to automatically label pho-
nemes in both the training audio tracks and the new audio tracks. It models short-term
mouth co-articulation within the duration of triphones. The mouth image sequence of a
new audio track is generated by reordering the mouth images selected from the training
footage. Video Rewrite is an offline approach. It requires a very large training database to
cover all possible cases of triphones and needs large computational resources. Chen and
Rao [3] train HMMs to parse the audio feature vector sequences of isolated words into
state sequences. The state probability for each audio frame is evaluated by the trained
HMMs. A visual feature is estimated for every possible state of each audio frame. The
estimated visual features of all states are then weighted by the corresponding probabilities
to obtain the final visual features, which are used for lip animation.
Voice Puppetry [1] trains HMMs for modeling the probability distribution over the mani-
fold of possible facial motions from audio streams. This approach first estimates the
probabilities of the visual state sequence for a new speech stream. A closed-form solution
for the optimal result is derived to determine the most probable series of facial control
parameters, given the boundary (the beginning and ending frames) values of the parame-
ters and the visual probabilities. An advantage of this approach is that it does not require
recognizing speech into high-level meaningful symbols (e.g., phonemes, words), which is
very difficult to obtain a high recognition rate. However, the speech-driven face anima-
tion approaches in [1], [2] and [3] have relative long time delays.
Some approaches attempt to generate the lip shapes using one audio frame via vector
quantization [16], affine transformation [21], Gaussian mixture model [20], or artificial
neural networks [17], [11]. Vector quantization [16] first classifies the audio feature into
one of a number of classes. Each class is then mapped to a corresponding visual feature.
Though it is computationally efficient, the vector quantization approach often leads to
discontinuous mapping results. The affine transformation approach [21] maps an audio
feature to a visual feature by a simple linear matrix operation. The Gaussian mixture ap-
proach [20] models the joint probability distribution of the audio-visual vectors as a
Gaussian mixture. Each Gaussian mixture component generates an estimation of the vis-
ual feature for an audio feature. The estimations of all the mixture components are then
weighted to produce the final estimation of the visual feature. The Gaussian mixture ap-
proach produces smoother results than the vector quantization approach does. In [17],
Morishima and Harashima trained a multilayer perceptron (MLP) to map the LPC Cep-
strum coefficients of each speech frame to the mouth-shape parameters of five vowels.
Kshirsagar and Magnenat-Thalmann [11] trained a MLP to classify each speech segment
into the classes of vowels. Each vowel is associated with a mouth shape. The average en-
ergy of the speech segment is then used to modulate the lip shapes of the recognized
vowels.
However, those approaches proposed in [16], [21], [20], [17], and [11] do not consider
the audio context information, which is very important for modeling mouth coarticulation
during speech producing. Many approaches have been proposed to train neural networks
as AVMs while taking into account the audio contextual information. Massaro et al. [15]
trained a MLP as the AVM. They modelled the mouth coarticulation by considering the
speech context information of eleven consecutive speech frames (five backward, current,
and five forward frames). Lavagetto [12] and Curinga et al. [5] train time delay neural
networks (TDNNs) to map the LPC cepstral coefficients of speech signals to lip anima-
tion parameters. TDNN is a special case of MLP and it considers the contextual informa-
tion by imposing ordinary time delay on the information units. Nevertheless, the neural
networks used in [15], [12], and [5] have a large number of hidden units in order to han-
dle large vocabulary. Therefore, their training phrases face very large searching space and
have very high computational complexity.
2. Motion Units – The Visual Representation
MPEG-4 FA standard defines 68 MPEG-4 FAPs. Among them, two are high-level pa-
rameters, which specify visemes and expressions. The others are low-level parameters
that describe the movements of sparse feature points defined on head, tongue, eyes,
mouth, and ears. MPEG-4 FAPs do not specify detail spatial information of facial defor-
mation. The user needs to define the method to animate the rest of the face model.
MPEG-4 FAPs do not encode the information about the correlation among facial feature
points. The user may assign some values to the MPEG-4 FAPs that do not correspond to
natural facial deformations.
We are interested in investigating natural facial movements caused by speech producing
as well as the relations among those facial feature points in MPEG-4 standard. We first
learn a set of MUs from real facial deformations to characterize natural facial deforma-
tions during speech producing. We assume that any facial deformation can be approxi-
mated by a linear combination of MUs. Principal Component Analysis (PCA) [10] is ap-
plied to learning the significant characteristics of the facial deformation samples. Motion
Units are related to the works in [4], [7].
We put 62 markers in the lower face of the subject (see Figure 1). Those markers cover
the facial feature points that are defined by the MPEG-4 FA standard to describe the
movements of the cheeks and the lips. The number of the markers decides the representa-
tion capacity of the MUs. More markers enable the MUs to encode more detailed infor-
mation. Depending on the need of the system, the user can flexibly decide the number of
the markers. Here, we only focus on the lower face because the movements of the upper
face are not closely related to speech producing. Currently, we only deal with 2D defor-
mations of the lower face. However, the method described in this chapter can be applied
to the whole face as well as the 3D facial movements if the training data of 3D facial de-
formations are available. To handle the global movement of the face, we add three addi-
tional markers. Two of them are on the glasses of the subject. The rest one is on the nose.
Those three markers mainly have rigid movements and we can use them to align the data.
A mesh is created according to those markers to visualize facial deformations. The mesh
is shown to overlap with the markers in Figure 1.
Figure 1. The markers and the mesh.
We capture the front view of the subject while he is pronouncing all English phonemes.
The subject is asked to stabilize his head as much as possible. The video is digitized at 30
frame-per-second. Hence, we have more than 1000 image frames. The markers are auto-
matically tracked by template matching. A graphic interactive interface is developed for
manually correcting the positions of trackers using the mouse when the template match-
ing fails due to large facial motions. To achieve a balanced representation on facial de-
formations, we manually select facial shapes from those more than 1000 samples so that
each viseme and the transitions among each pair of visemes are nearly evenly repre-
sented. To compensate the global face motion, the tracking results are aligned by affine
transformations defined by those three additional markers.
After normalization, we calculate the deformations of the markers with respect to posi-
tions of the markers in the neutral face. The deformations of the markers at each time
frame are concatenated to form a vector. PCA is applied to the selected facial deforma-
tion data. The mean facial deformation and the first seven eigenvectors of the PCA re-
sults, which correspond to the largest seven eigenvalues, are selected as the MUs in our
experiments. The MUs are represented as Miim 0}{ =
r . Hence, we have
01
0 smcmsM
iii
rrrr ++= ∑=
(1)
where 0sr is the neutral facial shape and Mkkc 1}{ = is the MUP set. The first four MUs are
shown in Figure 2. They respectively represent the mean deformation and the local de-
formations around cheeks, lips, and mouth corners.
Figure 2. Motion Units. 0mk r= .
MUs are also used to derive robust face and facial motion tracking algorithms [9]. In this
chapter, we are only interested in speech-driven face animation.
3. MUPs and MPEG-4 FAPs
It can be shown that the conversion between the MUPs and the low-level MPEG-4 FAPs
is linear. If the values of the MUPs are known, the facial deformation can be calculated
using eq. (1). Consequently, the movements of facial features in the lower face used by
MPEG-4 FAPs can be calculated because MUs cover the feature points in the lower face
defined by the MPEG-4 standard. It is then straightforward to calculate the values of
MPEG-4 FAPs.
If the values of MPEG-4 FAPs are known, we can calculate the MUPs in the following
way. First, the movements of the facial features are calculated. The concatenation of the
facial feature movements forms a vector pr . Then, we can form a set of vectors, say { 0fr
,
1fr
, …, Mfr
}, by extracting the elements that correspond to those facial features from the
MU set { 0mr , 1mr , …, Mmr }. The vector elements of { 0fr
, 1fr
, …, Mfr
} and those of pr are
arranged so that the information about the deformations of the facial feature points is rep-
resented in the same order. The MUPs can be then calculated by