ABSTRACT Current computer animated speech systems do not take into account the visual impact of prosody. This leads to nonrealistic animated figures, since prosodic effects are fundamental in human communications. Prosody, for a person speaking in English, is the stress, intonation, length, and rhythm of syllables and sentences. A person’s use of prosody during speech causes his/her mouth, face, jaw, and lips to visually change. The project had three distinct phases. The first phase started with an analysis of the existing linguistics research on standard American English. During this phase an experimental corpus of words and sentences was developed that exhibits prosody. This corpus was used in a motion capture environment to capture raw data of prosodic effects on the human face. In the second phase, computer assisted data segmentation was used to remove noise, determine the timing of phonemes, and to match the data captured to prosody parameters. During the third phase, computer software was developed to implement an algorithm to extract jaw, mouth, and facial muscle parameters using the motion capture data as input data. These parameters were used to animate a parametric facial model. The extracted parametric curves can then be used to develop a model of prosody to create advanced facial animations. ii
85
Embed
Animated Speech Prosody Modelingsci.tamucc.edu/~cams/projects/264.pdf · fundamental in human communications. Prosody, for a person speaking in English, is the stress, intonation,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ABSTRACT
Current computer animated speech systems do not take into account the visual
impact of prosody. This leads to nonrealistic animated figures, since prosodic effects are
fundamental in human communications. Prosody, for a person speaking in English, is the
stress, intonation, length, and rhythm of syllables and sentences. A person’s use of
prosody during speech causes his/her mouth, face, jaw, and lips to visually change.
The project had three distinct phases. The first phase started with an analysis of
the existing linguistics research on standard American English. During this phase an
experimental corpus of words and sentences was developed that exhibits prosody. This
corpus was used in a motion capture environment to capture raw data of prosodic effects
on the human face. In the second phase, computer assisted data segmentation was used
to remove noise, determine the timing of phonemes, and to match the data captured to
prosody parameters. During the third phase, computer software was developed to
implement an algorithm to extract jaw, mouth, and facial muscle parameters using the
motion capture data as input data. These parameters were used to animate a parametric
facial model. The extracted parametric curves can then be used to develop a model of
prosody to create advanced facial animations.
ii
TABLE OF CONTENTS
Abstract…………………………………………………………………………. ii Table of Contents………………………………………………………………. iii List of Figures...................................................................................................... ..v 1. Introduction and Background...……………………………………….….....1
1.1 Background ……………………….…………………………………....1
1.2 What Is Prosody?.........………………………………………………....2 1.3 What Using a Prosody Model Can Accomplish………..……………....4 1.4 Current Research..........………………………………………………....5
1.4.1 Talking Head Model..........……………………………….........5
1.4.2 Structure of Talking Head………………………………........ ..5
2. Animated Speech Prosody Modeling................................…………….........7 2.1 Corpus Selection Criteria..……………………………………………..7 2.2 Data Segmentation........………………………………………………..8 2.3 Prosody Parameter Development..……………………………………..8 3. System Design or Research..................…………………………………......9
3.1 Determining a Prosody Corpus……….……………………………......9
3.2 Use of Corpus during Motion Capture…………………..…………......9 3.3 Data Segmentation……....………………………………………….....10
4. Evaluation and Results…..………………………………………………...14
iii
4.1 Corpus Evaluation………….……….……………………..………........14
4.2 Corpus Results...…………….……………………………...……..........14
4.2.1 Suprasegmentals…..……………………………….….…...….....15 4.2.2 Word Stress…..………...…………………………….….…........15 4.2.3 Intonation……...…..……………………………….…....…….....15 4.2.4 Length……………..…………………………….………….…....16 4.2.5 Sentence Stress and Rhythm...……………………..……............16 4.3 Data Segmentation Evaluation………...………………………..…........16
4.4 Data Segmentation Results...…………..……………………...…..........17
4.5 Prosody Parameter Evaluation……….…………………..………..........19 4.6 Prosody Parameter Results……….…………………..………...............20 5. Future Work......................................………………………….…………….25 6. Conclusion.......................................………………………………….….…..26 Bibliography and References………..…………………………..………..…....…28 Appendix A. Prosody Corpus ..............................................................................31 Appendix B. Label and Hierarchy Files................................................................45 Appendix C. Algorithm Code................................................................................53 Appendix D. Example Phoneme-Viseme Parameter Set.......................................74
Figure 4.5. EMU Segmentation Results for “about”...........................................20 Figure 4.6. Top 5 Key Parameter Values for “about”.........................................21 Figure 4.7. Error Reduction for “about”..............................................................22 Figure 4.8. Word Parameter Impact.....................................................................23 Figure 4.9. Parameter Impact...............................................................................23
v
1. INTRODUCTION AND BACKGROUND
Prosody is the stress, intonation, length, and rhythm of syllables and sentences, as
a person is speaking English. Audio prosody has been extensively studied. During
prosodic speech, how a person visually changes his/her mouth, face, jaw, and lips
synchronized with audio prosody has not been extensively studied. As a result, computer
animated speech systems do not take into account the visual impact of prosody. This
leads to non-realistic animated figures. Conducting basic research of prosody and its
visual components was the objective of this study.
1.1 Background
Audio prosody, sometimes called suprasegmentals in linguistics, has been studied
extensively in the field of phonetics [Coleman 2005]. This was due to the need for
understanding the process in learning languages, either as the primary or follow-on
language learned by individuals. Most of this research focused on phonemes made up of
one or more phones (basic sounds a human can produce) and their use to build syllables
which in turn were used to build words, phrases, and utterances. By applying prosodic
affects to these phonemes, syllables, words, phrases, and utterances you can change their
information content and meaning. In addition, how phonemes are physically created was
studied.
A field of study called English phonology has developed since the initial research
interest in phonetics. English phonology is the study of the patterns in speech sounds in
the English language [Wikipedia 2005]. In particular, it is this field of study that appears
1
to be the dominant field in the study of audio prosody. The concept of prosody is applied
to a subset of the phonological hierarchy, called the prosodic hierarchy.
Animated speech prosody modeling is a relatively new field of research. Prosody
effects, i.e., how the mouth, face, and lips change as a result of the speaker’s use of
prosody to communicate, is relatively unexplored at this time.
With the advent of digital computers and the advances in computer graphics since
1990, more and more full length, feature movies are computer animations. Also in the
future, computer animations of human’s speaking will become a main human computer
interaction mechanism. The present lack of naturally speaking animations, that take into
account both visual effects and audio prosody, have created an emerging field of study.
1.2 What Is Prosody?
Over the years audio prosody has been defined in many ways, with many different
terms, but the underlying concepts of what audio prosody is have been fairly standard. It
will be useful to explore the various definitions of prosody to build a case that it is the
perspective of the prosody researcher that varies, not the underlying concepts.
Three different definitions of prosody are given in Wikipedia, the free
encyclopedia [Wikipedia 2005]. They are:
• “Prosody consists of distinctive variations of stress, tone, and timing in
spoken language. How pitch changes from word to word, the speed of
speech, the loudness of speech, and the duration of pauses all contribute to
prosody.
• In linguistics, prosody includes intonation and vocal stress in speech.
2
• In poetry, prosody includes the scansion and metrical shape of the lines.”
Another definition of prosody can be gleaned from the International Phonetic
Alphabet use of “a group of symbols for stress, length, intonation, syllabification and
tone under the general heading “suprasegmentals,” reflecting a conceptual division of
speech into “segmental” and “suprasegmental” parts.” [Coleman 2005]. Note, this
distinction is only partially correct, because IPA also applies the concepts of stress and
intonation to vowels and consonants which are at the segmental level of the phonological
hierarchy, i.e., the phoneme level.
Another definition of prosody comes from the field of speech synthesis. It defines
prosody to be the relationships between the duration, amplitude, and F0 (the fundamental
frequency) of sound sequences [O’Shaughnessy 2000]. He believes “segmentals cue
phoneme and word identification, while prosody primarily cues other linguistic
phenomena.” [O’Shaughnessy 2000]. There are lexically stressed syllables to place
emphasis, syntactically and emotionally stressed words and phrases, and efforts to make
speech clearer to understand. He also believes that “highlighting stressed syllables
against a background of unstressed syllables is a primary function of prosody.”
[O’Shaughnessy 2000].
A final definition comes from J. Bowen’s book entitled Patterns of ENGLISH
Pronunciation written in 1975 [Bowen 1975]. He states “The term intonation as used in
this book is intended to cover a number of separate phenomena, including stress, pitch,
juncture (the transitions between phrases from sound to silence at the ends of phrases),
and rhythm.” [Bowen 1975]. He further highlights that the combination of these
phenomena, with the syllables making up words and phrases, conveys the meaning
3
intended by the person speaking. He then proceeds to examine and to describe some of
the features of intonation and their combinations in the spoken informal English
language. Finally, he defines prosody as “the science of poetical forms, including
quantity and accent of syllables, meter, versification, and metrical composition.” [Bowen
1975].
1.3 What Using a Prosody Model Can Accomplish
Bowen’s definition of intonation produces almost exactly the same affects as
does O’Shaughnessy’s definition developed in 2000. So, if one considers all of the
prosody definitions, it is clear that audio prosody is an essential part of human
communications.
Now the question is, “Why do we want to capture the audio and visual elements
of prosody in facial computer animations?” The answer is clearly that we desire to have
our computer animations produce, as close as possible, the reality of human
communications.
The underlying goal of this project is to conduct research that supports further
work toward a mathematical model of visual effects and audio prosody for computer
animations. For example, having a computer animated figure reading poetry with the
same level of quality, understandability, and emotions, as a real human, could result from
this and follow-on research.
4
1.4 Current Research
Speaking facial model animations currently exist, but to the best of my knowledge
none of them tightly couple the facial models with linguistics prosody associated with
standard American English. Part of the problem is the lack of definition of prosody in a
linguistics sense by the developers of these facial models. An example of this disconnect
is apparent in a recent article. Somasundaram believes that visual prosodic elements are
the movement of the head, eye, eyebrow, and eyelid and that these elements improve the
intelligibility of speech. [Somasundaram 2005].
Speaking facial model animations are fundamentally different from the facial
animations used in today’s generation of animated movies. In animated movies, actors
provide the voices before the animation is created on a frame by frame basis by
animation artists. Speaking facial model animations have synthesized or recorded speech
creating the facial animations as the speaking is occurring.
1.4.1 Talking Head Model
Talking Head is a 3D parametric lip model which supports the lip motion for
facial animation [King 2001]. It uses the approach given in the schematic in Figure 2.1.
It was primarily designed to provide a lip model for synchronized speech. It also has
been used successfully in a text-to-audio-visual-speech system to achieve facial
animations with synchronized speech.
1.4.2 Structure of Talking Head
The lip model is created from a B-spline surface which has parameters to define
the movement of the lip surface and its internal surface [King 2000]. The model
parameterization is muscle-based and allows for specification of a wide range of lip
5
motion. Figure 2.2 shows a screen shot of the talking head model control screen. The
lip model is combined with a model of a human head which uses computer graphics to
add color, lighting, and surface texture to increase the combined model’s realism.
Figure 1.1: A facial animation system general schematic [King 2001]
Figure 1.2: A parameter control screen [King 2001]
6
2. ANIMATED SPEECH PROSODY MODELING
This section provides an overview of the project. The project had three distinct
phases. The first phase developed a conceptual basis for determining an experimental
corpus of words and sentences that exhibit prosody. This corpus was used in a motion
capture environment at The Ohio State University to capture raw data of prosody effects
on the human face. During the second phase, computer assisted data segmentation was
used to remove noise, to determine timing of phonemes, to match the data captured to
prosody parameters, and to better understand the captured data. During the third phase,
computer software was developed, using the motion capture data as input data. This
software implemented an algorithm to extract jaw, mouth, and facial muscle parameters.
These parameters drive an existing parametric talking head facial model and set the stage
for further research into the effect of prosody on animated speech.
2.1 Corpus Selection Criteria
Selection criteria was determined which focused on all three primary areas of
prosody in linguistics, word stress, intonation, and length. Words, phrases, and sentences
were selected to get a broad mix of a, e, i, o, u, and y phonemes associated with the
consonants that surround them. Sentences and a poem were selected in an attempt to
capture some data on sentence stress and rhythm.
The Carnegie Mellon University (CMU) Pronouncing Dictionary was used in the
selection process [CMU 2005]. It is a pronunciation dictionary for North American
English that contains over 125,000 words and their phoneme construction. It is
7
particularly useful for speech recognition and synthesis research because it maps the
words to the phoneme set that is commonly used to pronounce the words. It currently
contains 39 phonemes, and if you consider vowels with stress, the dictionary contains 50
phonemes that are commonly used in standard American English (see Appendix A for a
listing of the phonemes).
2.2 Data Segmentation
Data segmentation techniques were used to ensure the data represented the
prosodic aspects highlighted in each motion capture data session. The phonemes’ (which
make up the syllables, phrases, and sentences) timing and other characteristics were
processed and evaluated.
2.3 Prosody Parameter Development
During the third phase, computer software was developed to extract prosody
parameters (jaw, mouth, and other facial movements, i.e., the visemes produced by
prosody) from the motion capture data. TalkingHead, an existing set of computer
programs, is a parametric talking head facial model that uses a set of these parameters to
deform the face. An algorithm was used to do an inverse mapping of motion capture data
to a set of 19 parameters for the 50 phoneme-viseme pairs. The local optimal set for the
individual phoneme-viseme pairs is the set of parameters that minimized the distance
(error) between the motion capture data and the virtual marker data for each motion
capture frame. Virtual marker data locations were initially set for a head at rest and then
were transformed for each frame to predict how the animated face should move.
8
3. SYSTEM DESIGN OR RESEARCH
This section provides a detailed description of how the project was accomplished.
As previously mentioned, this project had three phases. The first phase surveyed
linguistics research to develop a basis for determining an experimental corpus of words
and sentences that exhibit prosody. The second phase accomplished computer assisted
data segmentation to remove noise, to determine the timing of phonemes, to match the
data captured to prosody, and to clearly understand the data captured during the first
phase. During the third phase, computer software was developed to extract prosody
parameters from the motion capture data.
3.1 Determining a Prosody Corpus
A prosody corpus, designed to highlight the principle prosodic aspects of standard
American English, was developed using applied phonetics laboratory exercises. Applied
phonetics is a field primarily used to train teachers how to help people with English as a
second language, people with speaking difficulties, or people who want to improve their
pronunciation of American English.
The Carnegie Mellon University (CMU) Pronouncing Dictionary was used in the
selection process of various words and sentences that highlight prosodic aspects of
standard American English. The selected words and sentences are in Appendix A.
3.2 Use of Corpus during Motion Capture
Each word, phrase, sentence was said two times as it would be said in standard
American English to establish a baseline for each particular prosody aspect, word or
9
sentence. This procedure was then repeated by saying the corpus content, one time,
loudly, then softly, then faster, and then slower. By varying the volume and speaking
rate of the corpus, pitch changes under differing conditions were captured, a full range of
viseme patterns were available for parameterization, and several candidate viseme
patterns to simulate emotional activity of a speaker were captured in the 96 motion
capture sessions.
The motion capture data collection design included many features to facilitate the
data segmentation process. A wired microphone was used for sound capture. A head jig
was used to provide the motion capture equipment a reference set of points which were
later used to remove rigid motion effects from the motion capture data. A rigid object
with a second set of points was next to the speaker for advanced noise estimation.
Motion capture sessions were broken up into different sessions and tagged with both the
type of session and flash card number. The speaker made a deliberate pause between
words, phrases, and sentences where he closed his mouth. The audio capture was synced
to the motion capture, so motion capture data frames and selected audio waveforms
exhibiting prosody were time related.
3.3 Data Segmentation
The second phase involved computer assisted data segmentation to remove noise,
to adjust the timing of the phonemes, to match the data captured to prosody, and to
clearly understand the data captured during the first phase. An open source text to speech
program, called Festival, and Festvox, a Festival tool, were used to match the spoken data
phonemes to the 50 phonemes used in the CMU dictionary. It was also used to generate
10
initial estimates of the start and stop times of the collected phonemes. The EMU Speech
Database System, an audio analyzer computer program, was used to fine tune the start
and stop times of these phonemes to the actual audio waveforms. EMU was also used to
develop a hierarchical database of phones, phonemes, syllables, words, phrases, and
utterances for future use with TalkingHead software.
3.4 Inverse Mapping Algorithm
The primary inputs to the inverse mapping algorithm were the location of the
motion capture data markers in x, y, and z coordinates on the face of the speaker, as the
speaker goes through the data generation plan, using the prosody corpus. These inputs
were compared to the virtual x, y, and z data resulting from systematically varying
combinations of individual facial model parameters. Thus, the virtual marker placement
in x, y, and z coordinates were translated due to facial deformation. These translated
coordinates were then subtracted from the motion capture data for each frame to
determine the error. The combination of parameters with the smallest distance error, for
each frame, was determined to be the local optimal set of parameters for a particular
phoneme/viseme combination. The combination of parameters unique to each frame were
stored so they could be used to playback the prosodic features of a particular phoneme,
syllable, word, or sentence in the future.
Talking Head model code and mostly MoCap.C was modified to create the new
code to extract facial model parameters. This code is given in Appendix C - Algorithm
Code. The MoCap.C subroutine, MoCapGUICB, was modified to control the estimation
of parameters on a frame by frame basis.
11
The TalkingHead.C subroutine, estimateJawParamFromMocap, was used to
provide an initial estimate for the Jaw Open parameter. The TalkingHead.C subroutine
was used to provide an initial estimate for the OrbOris parameter.
The error between the measured markers and the virtual markers was calculated in
the newly created TalkingHead.C subroutine, markerError. The MoCap marker set
locations and the virtual marker set locations were subtracted in this subroutine.
The newly created TalkingHead::estimateParamsFromMocap(int f, float t)
subroutine then estimated 17 parameters to produce a locally optimal phoneme-viseme
set for each frame. The subroutine was based on a search of realistic values of each of
the parameters in a prioritized order of importance. Once the higher priority parameters
were determined they are used as givens in the remaining search processes to reduce
computation time. The parameters were estimated, in a six step search process, in the
following order: k0, k1; k3, k12, k14; k4, k5, k6; k7, k15, k16; k11, k17, k18; and k8,
k9, k10. A description of the parameters follows:
Step1 k0 = OPEN_JAW - opens jaw k1 = JAW_IN - moves jaw in Step 2 k3 = ORB_ORIS - contracts lips, making mouth opening smaller k12 = DEP_INF - opens both lips k14 = MENTALIS - pulls lips together Step 3 k4 = L_RIS, Left Risorius - moves left corner towards ear k5 = R_RIS, Right Risorius - moves right corner towards ear k6 = L_PLATYSMA, Left Platysma - moves left corner downward and lateral Step 4 k7 = R_PLATYSMA, Right Platysma - right corner downward and lateral k15 = L_BUCCINATOR - pulls back at left corner k16 = R_BUCCINATOR - pulls back at right corner
12
Step 5 k11 = R_LEV_SUP - moves right top lip up k17 = INCISIVE_SUP - top lip rolls over bottom lip k18 = INCISIVE_INF - bottom lip rolls over top lip Step 6 k8 = L_ZYG, Left Zygomaticus - raises corner up and lateral k9 = R_ZYG, Right Zygomaticus - raises corner up and lateral k10 = L_LEV_SUP - moves left top lip up
13
4. EVALUATION AND RESULTS
This section describes the evaluation and the results. During the first phase
evaluation and results consisted primarily of a statistical analysis to determine
completeness of phoneme coverage and developing a theory based prosody corpus. The
second phase evaluation and results consisted of utilizing data segmentation techniques
and developing a methodology to segment individual phoneme – viseme pairs collected
during motion capture. Such things as common sense examination of the motion capture
data, listening to audio while viewing the waveforms to remove noise, adjusting start and
stop times of the phonemes, matching the data captured to prosody, and understanding
the strengths and weaknesses of the motion capture data were all used. During the third
phase, prosody parameters (visemes) were generated from the motion capture data.
4.1 Corpus Evaluation
The primary way the corpus was tested prior to using it for collecting prosody
data during motion capture was through the use of statistical analysis. Specifically a
histogram was developed that showed the frequency distribution of all 50 phonemes to
ensure that a complete coverage of these phonemes would occur during data capture.
4.2 Corpus Results
As a result of the research into prosody, it was determined that the primary
aspects of prosody in standard American English were: word stress, intonation, and
vowel length. Secondary aspects of prosody were: sentence stress and rhythm. These
aspects were the prosody attributes that the corpus was developed to study.
14
4.2.1 Suprasegmentals
Suprasegmentals (prosodic features) are fundamental to the definition of prosody
in linguistics research. Their effects are said to be “superimposed on” or “distributed
over” strings of segments [Lehiste 1970].
In standard American English, a phoneme is a segment. A suprasegmental’s
domain is larger than just one segment and may apply to an entire syllable or word. To
study suprasegmentals, comparisons or relative values must be used, e.g., to decide if a
vowel is long it must be compared to another vowel. Also, different suprasegmental
aspects interact, e.g., stressed vowels tend to be lengthened.
4.2.2 Word Stress
Word stress is a primary prosodic effect in standard American English. Stressed
syllables are produced with extra force or increased muscular effort. Stress differences
are used to distinguish words containing the same sounds, e.g., convict (noun) and
convict (verb). Some guidelines exist for two-syllable words, but in general stress is not
predictable. In syllables with stressed vowels there is a slight change of pitch just prior to
the next phoneme.
4.2.3 Intonation
Intonation is a primary prosodic effect in standard American English. Intonation
refers to the function of pitch (vocal fold vibration or fundamental frequency) at the
phrase or sentence level [Lehiste 1970]. Pitch changes signal meaning differences, e.g.,
in statements, exclamations, and questions:
– You’re going.
– You’re going!
15
– You’re going?
Pitch change patterns also exist for commands, where, what, why, when, yes/no , and tag
questions.
4.2.4 Length
Length is less significant than word stress and intonation as a prosodic effect in
standard American English. Length is the perceived duration and quantity refers to how
length is used in a language [Lehiste 1970]. Length changes are largely predictable and
often depend on neighboring phonemes for both vowels and consonants.
4.2.5 Sentence Stress and Rhythm
Sentence stress is often used in standard American English. Content words, such
as nouns and verbs, are normally stressed by speaking them more slowly or more
distinctly. Sentence stresses tend to occur at regular intervals, so English is said to be a
stress-timed language. Poetry and the rhythm of a poem are an example of the use of
sentence stress to convey different meanings and images.
4.3 Data Segmentation Evaluation
The primary way the motion capture data was tested was through a labor intensive
analysis of the motion capture and audio data using two software programs. These
programs are extensively used in speech synthesis and linguistics research. Data
anomalies were determined and judgment calls were made to determine timing and
validity of data. For example, noise and/or other problems, such as the speaker saying
extra words in a recording, were removed from the data.
16
4.4 Data Segmentation Results
The following Figures show samples of the data segmentation results. Figure 4.1
shows an example of phoneme timing and Figure 4.2 shows an example of the
hierarchical database created for quickly finding phonemes and their characteristics.
Figure 4.3 shows the EMU query tool searching for the phoneme “IH0” in the
sound wave. Figure 4.4 shows the results of that search. It is now possible to create
libraries of the 50 phonemes and their timing characteristics. Appendix B - Label and
Hierarchy Files, contains an example of the *.lab file (which gives the phoneme timing
and is used in TalkingHead) and the *.hlb file (which gives the utterance hierarchy).
4.5 Prosody Parameter Evaluation
The intent of this phase was to determine a set of prosody parameters for each
frame that represented a local optimal solution for each phoneme-viseme pair. During
each developmental step of the computer software, used to implement the inverse
mapping algorithm, the new code was white box tested to insure the appropriate action
was happening. This was necessary, because the new code had to interface with
19
theTalkingHead model code, which is over 50 Mbytes and represents numerous years of
research work. Also, with any code project of this magnitude, there are numerous coding
techniques, object oriented relationships, and OpenInventor (an open source OpenGL
application program) techniques that had to be studied.
4.5 Prosody Parameter Results
A careful comparison of Figure 4.1 and Figure 4.5 for the phonemes AH1 and
AH0, respectively, clearly show the prosodic effect of stress on the phoneme AH. AH0
is non-stressed, as indicated by the “0” following the AH. AH1 is the same phoneme
stressed, as indicated by the “1” following the AH. The “judgment” AH1 audio
waveform is different in the time domain, the frequency domains, and pitch.
Figure 4.5: EMU Segmentation Results for “about”
20
This project has captured similar information for every phoneme, syllable, word,
phrase, and utterance in the prosody experimental corpus. Associated with every audio
prosody event captured is a phoneme-viseme parameter set to animate the Talking Head
model in real-time. An example phoneme-viseme parameter set for the word “about” in
given in Appendix D.
Figure 4.6 gives the top 5 viseme parameters for “about” for the MoCap frames
associated with saying the word “about”. These parameters move the Talking Head
model and thus exhibit the use of the non-stressed phoneme AH0. The actual word
“about” is shown from approximately frame 1073 to frame 1137.
Figure 4.6: Top 5 Key Parameter Values for “about”
Another set of parameters move the Talking Head model differently for the word
“judgment” for the stressed phoneme AH1. Thus, it is now possible to compare and to
21
study both the audio differences between AH0 and AH1 and the visual differences caused
by audio prosody.
Figure 4.7 graphically represents the six step search process and an example of its
results. The line labeled “Guess” was an heuristic approach to model two parameters
quickly. The line labeled “Initial” was the best case error given the current limitations of
the Talking Head model. These limitations included the model being based on the
speaker’s face several years ago along with errors in the MoCap and virtual marker
determination process.
Figure 4.7: Error Reduction for “about”
Step 1, using 2 different parameters, was the first step in the process and
immediately produced better results than “Guess” prior to “about” being said, during the
time “about” was voiced, and immediately after “about” was voiced. Each subsequent
step converged to the limiting “Initial” error in the virtual marker data. Similar search
22
results were obtained with the words “become” and “judgment.” Also, only 3 frames
from 117 frames for “about” produced errors of 28 – 35 millimeters for the 44 marker
locations studied.
Word Parameter Impact
02468
1012141618
0 1 3 4 5 7 8 9 10 11 12 14 15 16 17 18
Parameter Number
Rank
(low
er m
eans
mor
e im
pact
) aboutbecomejudgment
Figure 4.8: Word Parameter Impact
Parameter Impact
02468
1012141618
0 1 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18
Parameter Number
Aver
age
Ran
k (lo
wer
m
eans
mor
e im
pact
)
Figure 4.9: Parameter Impact
23
Figures 4.8 and 4.9 clearly show that lip movement and then jaw movement
parameters dominate the prosody parameters. These figures were created by rank
ordering the impact each parameter had describing the facial movements during the
speaking of three different words. These words were chosen to give a wide range of
facial movements due to the speaking of different phonemes. The lower the number
means the higher the rank, e.g., a rank order of 1 indicates that this parameter was the
dominate parameter produced by the inverse mapping algorithm.
24
5. FUTURE WORK
Now that a research scope has been established, there are several areas where
future work can extend this project. One of the main things that need to be done is
improving the “Initial” error situation. This will involve updating the Talking Head
facial model to reflect the speaker changes and improving the translation process for both
the MoCap data and the virtual marker data.
Finding a better way to estimate the sets of parameters, given motion capture data,
is another primary task that should also be done. It is also envisioned that feedback loops
could be used in the optimization of a parameter set for each type of prosody and
phoneme used in the corpus.
Visemes can be viewed by comparing the movies taken during each of the motion
capture data sessions with the Talking Head facial model saying the same word or phrase
with the same prosodic effects. This gave a high level indication as to how the
methodology worked.
Libraries of model parameters (visemes) could be developed at the phoneme
level. Different words that contain the same phonemes could be used to determine how
good the parameter set developed from one use of the phoneme applies to a differing use
of the phoneme. This will give a more quantifiable measure of the prosody effects
captured and lead to Talking Head exhibiting real-time prosodic effects.
Model parameters (visemes) developed could be studied at the word, phrase, or
sentence level from different volume levels or different speed levels. This would give a
range of parameter sets for each of the prosody effects captured.
25
Finally, some of the activities in each phrase could be improved upon. Low
priority improvements were not attempted because of the limited time available for this
project. Also, applying the methodology to all the motion capture data was not
attempted.
6. CONCLUSION
This project is important because it helps understand the impact of prosody to
change the animated figure’s face as the figure is speaking. A person’s use of prosody
during speech causes his/her mouth, face, jaw, and lips to visually change. These
changes have clearly documented by this project. This project also clearly demonstrates
that audio prosody does have a visual signal associated with the audio signal.
This project has produced a research scope and an experimental methodology that
can help in future explorations of visemes in human communications. It has also
produced several results that are significant in the use of this methodology. Findings that
the “Initial” error in MoCap and virtual marker translation process need to be improved is
important to develop trust in the inverse mapping algorithm. Findings that the inverse
mapping algorithm converges to the “Initial” error indicate that a tailored search
procedure for a local optimal solution is workable and a great improvement over
simplistic heuristic models. Determining whether or not a global optimal solution is even
necessary to exhibit prosody in an animated facial model is now in question. Findings
that indicate that lip movement and jaw movement are the primary parameters to describe
prosodic events need to be further explored. All of these findings, as they are further
26
explored and validated, will lead to more realistic facial animations and improved human
computer interaction(s) with future computer systems.
27
BIBLIOGRAPHY AND REFERENCES
[Bowen 1975] Bowen, J. Donald. Patterns of ENGLISH Pronunciation. Newbury House Publishers, Inc., Rowley, Massachusetts, 1975.
[Bronstein 1960] Bronstein, Arthur J. The Pronunciation of AMERICAN ENGLISH, An
Introduction to Phonetics. Appleton-Century-Crofts, Inc., New York, N.Y., 1960. [Clarey 1963] Clarey, M. Elizabeth and Dixson, Robert J. Pronunciation Exercises in
English. Regents Publishing Company, Inc., New York, N.Y., 1963. [CMU 2005] Carnegie Mellon University. The CMU Pronouncing Dictionary.
Available from http://www.speech.cs.cmu.edu/cgi-bin/cmudict (visited Oct 30, 2005).
[Coleman 2005] Oxford University Phonetics Laboratory. Prosody (and
Suprasegmentals). Available from www.phon.ox.ac.uk/~jcoleman/PROSODY.htm (visited Oct 4, 2005).
[DeCarlo 2002] DeCarlo, Matthew Stone, Corey Revilla, and Venditti, Jennifer J. Making discourse visible: coding and animating conversational facial displays.
Proceedings of Computer Animation, (June 2002), 11 - 16. [DeCarlo 2004] DeCarlo, Matthew Stone, Corey Revilla, and Venditti, Jennifer J. Specifying and animating facial signals for discourse in embodied conversational
agents. Computer Animation Virtual Worlds 2004, 15 (2004), 27 - 38. [DeCarlo 2005] DeCarlo, Doug and Stone, Matthew. The Rutgers University Talking
Head: RUTH. Available from http://www.cs.rutgers.edu/~village/ruth/ruthmanual10.pdf (visited Nov. 6, 2005).
[Edge 2001] Edge, James D. and Maddock, Steve. Expressive Visual Speech using
Geometric Muscle Functions. Proceeding Eurographics UK, 2001, (2001). [Edge 2003] Edge, James D. and Maddock, Steve. Image-based Talking Heads using
Radial Basis Functions. Proceedings of the Theory and Practice of Computer Graphics, 2003, (2003).
[Edge 2003] Edge, James D., Manuel S. Lorenzo, Scott A. King and Maddock, Steve.
Use and Re-use of Facial Motion Capture Data. Proceedings of Vision, Video, and Graphics, 2003, University of Bath, (July 2003), 135 - 142.
28
[Edwards 1983] Edwards, Mary Louise and Shriberg, Lawrence D. Phonology Applications in Communicative Disorders. College-Hill Press, Inc., San Diego, California, 1983.
[Edwards 1986] Edwards, Mary Louise. Introduction to Applied Phonetics, Laboratory
Workbook. College-Hill Press, San Diego, California, 1986. [Geroch 2004] Geroch, Margaret S. Motion Capture for the Rest of Us. Journal of
Computing Sciences in Colleges, (2004). [Goecke 2004] Goecke, Roland. A Stereo Vision Lip Tracking Algorithm and
Subsequent Statistical Analyses of the Audio-Video Correlation in Australian English. Ph.D. Thesis, The Australian National University, Canberra, Australia, 2004.
[Harrington 1999] Harrington, Jonathan and Cassidy, Steve. Techniques in Speech
Acoustics. Kluwer Academic Publishers, Boston, Massachusetts, 1999. [Harris 2005] Harris, Randy Allen. Voice Interaction Design, Crafting the New
Conversational Speech Systems. Morgan Kaufmann Publishers, Boston, Massachusetts, 2005.
[Horne 2000] Horne, Merle, editor. Prosody: Theory and Experiment, Studies
Presented to Gosta Bruce. Kluwer Academic Publishers, Boston, Massachusetts, 2000.
[King 2000] King, Scott A., Parent, Richard E., and Olsafsky, Barbara. An
Anatomically-Based 3d Parametric Lip Model to Support Facial Animation and Synchronized Speech. Proceedings of Deform 2000, 29 -30 November, Geneva, (2000), 7 -19.
[King 2001] King, Scott A. A Facial Model and Animation Techniques for Animated
Speech. Ph.D. Thesis, Ohio State University, Columbus, Ohio, 2004. [Ladefoged 1975] Ladefoged, Peter. A Course in Phonetics. Harcourt Brace
Jovanovich, Inc., New York, N.Y., 1975. [Lehiste 1970] Lehiste, I. Suprasegmentals. The M.I.T. Press, Cambridge,
Massachusetts, 1970. [Lieberman 1967] Lieberman, Philip. Intonation, Perception, and Language. The
M.I.T. Press, Cambridge, Massachusetts, 1967. [O’Shaughnessy 2000] O’Shaughnessy, Douglas. Speech Communications, Human and
Machine 2nd ed. The Institute of Electrical and Electronics Engineers, Inc., New York, N.Y., 2000.
29
[Sakiey 1980] Sakiey, Elizabeth, Edward Fry, Albert Goss, and Loigman, Barry. A
Syllable Frequency Count. Visible Language (originally published as The Journal of Typographical Research), Vol. 14.2, (1980).
[Somasundaram 2005] Somasundaram, Arunachalam and Parent, Rick. Audio-Visual
Speech Styles with Prosody. Eurographics/ACM SIGGRAPH Symposium on Computer Animation, Posters and Demos, 2005.
[Wikipedia 2005] Wikipedia. Available from en.wikipedia.org/wiki/ (visited on Oct. 2,
2005). [Wennerstrom 2001] Wennerstrom, Ann. The Music of Everyday Speech, Prosody and
Discourse Analysis. Oxford University Press, Inc., New York, N.Y., 2001.
30
APPENDIX A – PROSODY CORPUS
The current CMU phoneme set has 39 phonemes, not counting varia for lexical stress. Phoneme Example Translation ------- ------- ----------- AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW AY hide HH AY D B be B IY CH cheese CH IY Z D dee D IY DH thee DH IY EH Ed EH D ER hurt HH ER T EY ate EY T F fee F IY G green G R IY N HH he HH IY IH it IH T IY eat IY T JH gee JH IY K key K IY L lee L IY M me M IY N knee N IY NG ping P IH NG OW oat OW T OY toy T OY P pee P IY R read R IY D S sea S IY SH she SH IY T tea T IY TH theta TH EY T AH UH hood HH UH D UW two T UW V vee V IY W we W IY Y yield Y IY L D Z zee Z IY ZH seizure S IY ZH ER
31
Two-syllable words (Word Stress) Slide 11 1’ – 2 ma’rry
marry M EH1 R IY0 .
ju’dgment judgment JH AH1 JH M AH0 N T .
la’ter later L EY1 T ER0 .
1 – 2’ abo’ut
about AH0 B AW1 T .
beco’me become B IH0 K AH1 M .
secu’re secure S IH0 K Y UH1 R .
ali’gn align AH0 L AY1 N .
disea’se disease D IH0 Z IY1 Z .
Two-syllable words (Word Stress) Slide 12 1’ – 2` gre’enhou`se
greenhouse G R IY1 N HH AW2 S .
dru’gsto`re drugstore D R AH1 G S T AO2 R .
bla’ckou`t blackout B L AE1 K AW2 T .
i’ncli`ne incline (IH0 N K L AY1 N | IH1 N K L AY0 N) .
32
1` – 2’ my`se’lf
myself M AY2 S EH1 L F .
hi`mse’lf himself HH IH0 M S EH1 L F .
po`stpo’ne postpone (P OW0 S T P OW1 N | P OW0 S P OW1 N) .
(W EH1 R | HH W EH1 R) . (IH1 Z | IH0 Z) . M AY1 . S UW1 T . • Who is he’? • Who is he
HH UW1 . (IH1 Z | IH0 Z) . HH IY1 . • When’s the da’nce? • When is the dance
(W EH1 N | HH W EH1 N | W IH1 N | HH W IH1 N) . (IH1 Z | IH0 Z) . (DH AH0 | DH AH1 | DH IY0) . D AE1 N S .
• Where is my co’at? • Where is my coat
(W EH1 R | HH W EH1 R) . (IH1 Z | IH0 Z) . M AY1 . K OW1 T . • Who just came i’n?
39
• Who just came in HH UW1 . (JH AH1 S T | JH IH0 S T) . K EY1 M . (IH0 N | IH1 N) .
Yes-no Questions (Intonation) Slide 22
• Did they co’me? • Did they come
(D IH1 D | D IH0 D) . DH EY1 . K AH1 M . • Can you go wi’th us? • Can you go with us
(K AE1 N | K AH0 N) . Y UW1 . G OW1 . (W IH1 DH | W IH1 TH | W IH0 TH | W IH0 DH) . (AH1 S | Y UW1 EH1 S) .
• Are you a’ngry? • Are you angry
(AA1 R | ER0) . Y UW1 . AE1 NG G R IY0 . • Have you met Bo’b? • Have you met Bob
HH AE1 V . Y UW1 . M EH1 T . B AA1 B . • Is that Su’san? • Is that Susan
(IH1 Z | IH0 Z) . (DH AE1 T | DH AH0 T) . S UW1 Z AH0 N . Tag Questions (Intonation) Slide 23
You don’t want to go’, do’ you?
You don’t want to go, do you Y UW1 . D AA1 N . ? . T IY1 . (W AA1 N T | W AO1 N T) . (T UW1 | T IH0 | T
AH0) . G OW1 . ? . D UW1 . Y UW1 .
You don’t want to go’, do’ you?
You don’t want to go, do you Y UW1 . D AA1 N . ? . T IY1 . (W AA1 N T | W AO1 N T) . (T UW1 | T IH0 | T
AH0) . G OW1 . ? . D UW1 . Y UW1 .
They haven’t se’en him, ha’ve they? They haven’t seen him, have they
DH EY1 . HH EY1 V AH0 N . ? . T IY1 . S IY1 N . (HH IH1 M | IH0 M) . ? . HH AE1 V . DH EY1 .
40
They haven’t se’en him, ha’ve they? They haven’t seen him, have they
DH EY1 . HH EY1 V AH0 N . ? . T IY1 . S IY1 N . (HH IH1 M | IH0 M) . ? . HH AE1 V . DH EY1 . Vowel Length Prosody* Slide 24
• faze face • faze
F EY1 Z . • face
F EY1 S . • mat mass • mat
M AE1 T . • mass
M AE1 S . • “H” age • H
EY1 CH . • age
EY1 JH . • maid mate • maid
M EY1 D . • mate
M EY1 T . • match Madge • match
M AE1 CH . • Madge
M AE1 JH . • seed seat • seed
S IY1 D . • seat
S IY1 T . • bead beat • bead
B IY1 D . • beat
B IY1 T . • kiss kit • kiss
K IH1 S .
41
• kit K IH1 T .
• bit bid • bit
B IH1 T . • bid
B IH1 D . • Bert bird • Bert
B ER1 T . • bird
B ER1 D . • glo lock • log
L AO1 G . • lock
L AA1 K . • mop mob • mop
M AA1 P . • mob
M AA1 B . • moot moose • moot
M UW1 T . • moose
M UW1 S . • it suedsu • suit
S UW1 T . • sued
S UW1 D . • buck bug • buck
B AH1 K . • bug
B AH1 G .
Sentence Stress Prosody* Slide 25
ls.• The bo’y bu’ilds ma’ny mo’de • The boy builds many models
(DH AH0 | DH AH1 | DH IY0) . B OY1 . B IH1 L D Z . M EH1 N IY0 . M A
• TA1 D AH0 L Z . he bo’ y is interested in constru’cting mo’dels.
42
• The boy is interested in constructing models (DH AH0 | DH AH1 | DH IY0) . B OY1 . (IH
AH0 D | IH1 N T R IH0 S T IH0 D | IH1 N T ER0 AH0 S T AH0 D | IH1 N T ER0 IH0 S T IH0 D) . (IH0 N | IH1 N) . K AH0 N S T R AH1 K T IH0 NG . M AA1 D AH0 L Z .
1 Z | IH0 Z) . (IH1 N T R AH0 S T
Tw kle, Twinkle Little Star* (Rhythm) Slide 26in
winkle, twinkle, little star, T
Twinkle, twinkle, little star, T W IH1 NG K AH0 L . ? . T
How
W IH1 NG K AH0 L . ? . L IH1 T AH0 L . S T AA1 R . ? .
I wonder what you are. How I wonder what you are. HH AW1 . AY1 . W AH1 N D
R0) . ? . Up above t
ER0 . (W AH1 T | HH W AH1 T) . Y UW1 . (AA1 R | E
he world so high, Up above the world so high, AH1 P . AH0 B AH1 V . (DHY1 . ? .
Like a dia
AH0 | DH AH1 | DH IY0) . W ER1 L D . S OW1 . HH A
mond in the sky. Like a diamond in the sky. L AY1 K . (AH0 | EY1) . D H IY0) . S K AY1 . ? .
Twinkle, twinkle, little star
AY1 M AH0 N D . (IH0 N | IH1 N) . (DH AH0 | DH AH1 | D
, Twinkle, twinkle, little star, T W IH1 NG K AH0 L . ? . T
How
W IH1 NG K AH0 L . ? . L IH1 T AH0 L . S T AA1 R . ? .
I wonder what you are! How I wonder what you are. HH AW1 . AY1 . W AH1 N D
R0) . ? .
ER0 . (W AH1 T | HH W AH1 T) . Y UW1 . (AA1 R | E
lide 27
hen the blazing sun is gone,
S W
When the blazing sun is gone, (W EH1 N | HH W EH1 N | W I L EY1 Z IH0 NG . S AH1 N . (IH1 Z | IH0 Z) . G AO1 N . ? .
When he nothing shines upon,
H1 N | HH W IH1 N) . (DH AH0 | DH AH1 | DH IY0) . B
When he nothing shines upon, (W EH1 N | HH W EH1 N | W I AY1 N Z . AH0 P AA1 N . ? .
Then you show your little light,
H1 N | HH W IH1 N) . HH IY1 . N AH1 TH IH0 NG . SH
43
Then you show your little light, DH EH1 N . Y UW1 . SH OW1 . (Y AO1 R | Y UH1 R) . L IH1 T AH0 L . L AY1 T .
? . Twinkle, twinkle, all the night.
Twinkle, twinkle, all the night. T W IH1 NG K AH0 L . ? . T W IH1 NG K AH0 L . ? . AO1 L . (DH AH0 | DH AH1 |
DH IY0) . N AY1 T . ? . Twinkle, twinkle, little star,
Twinkle, twinkle, little star, T W IH1 NG K AH0 L . ? . T W IH1 NG K AH0 L . ? . L IH1 T AH0 L . S T AA1 R .
? . How I wonder what you are!
How I wonder what you are. HH AW1 . AY1 . W AH1 N D ER0 . (W AH1 T | HH W AH1 T) . Y UW1 . (AA1 R |
ER0) . ? .
44
APPENDIX B – LABEL AND HIERARCHY FILES
Phoneme Timing File – a01.lab signal a01 nfields 1 # 2.137369 125 H# 2.234768 125 m 2.301221 125 eh 2.435000 125 r 2.575000 125 iy 4.303340 125 pau 4.354888 125 jh 4.473623 125 ah 4.573807 125 jh 4.599780 125 m 4.683267 125 ah 4.745000 125 n 4.910000 125 t 6.328551 125 pau 6.421517 125 l 6.552619 125 ey 6.597144 125 t 6.735000 125 er 8.221845 125 pau 8.282010 125 ah 8.385000 125 b 8.540329 125 aw 8.609592 125 t 10.121434 125 pau 10.160000 125 b 10.210000 125 ih 10.328170 125 k 10.509793 125 ah 10.594078 125 m 11.969260 125 pau 12.085000 125 s 12.140000 125 ih 12.290000 125 k 12.350000 125 y 12.455000 125 uh 12.540000 125 r 13.784136 125 pau 13.850000 125 ah 13.945000 125 l 14.150000 125 ay 14.308383 125 n 15.563582 125 pau 15.592183 125 d 15.663299 125 ih 15.762243 125 z 16.042069 125 iy 16.230000 125 z 18.395000 125 pau
45
Phoneme Hierarchical Labels File – a01.hlb **EMU hierarchical labels** 183 Utterance Utterance 6 marry 10 pau 26 judgment 32 pau 34 later 144 pau 150 about 145 pau 45 become 158 pau 33 secure 27 pau 25 align 154 pau 7 disease 151 pau Phrase Phrase 2 marry 15 pau 16 judgment 19 44 pau 141 later 146 pau 147 about 153 pau 157 become 163 pau 164 secure 172 pau 173 align 177 pau 178 disease 183 pau Word Word Accent Text 1 w n marry 8 pau pau 9 w n judgment 31 pau pau 36 w n later 37 pau pau 142 w n about 143 pau pau 148 w n become
46
149 pau pau 155 w n secure 156 pau pau 160 w n align 161 pau pau 166 w n disease 167 pau pau Syllable Syllable Pitch_Accent 0 mar y 3 ry n 4 pau 5 judg y 11 ment n 12 pau 13 lat y 14 er n 17 pau 18 a n 21 bout y 22 pau 23 be n 24 come y 28 pau 29 se n 30 cure y 35 pau 38 a n 39 lign y 40 pau 41 dis n 43 ease y 46 pau Phoneme Phoneme 47 M 48 EH1 49 R 50 IY0 51 pau 52 JH 53 AH1 54 JH 55 M 56 AH0 57 N 58 T 59 pau 60 L 61 EY1 62 T 63 ER0 64 pau 65 AH0 66 B 67 AW1 68 T
47
69 pau 70 B 71 IH0 72 K 73 AH1 74 M 75 pau 76 S 77 IH0 78 K 79 Y 80 UH1 81 R 82 pau 83 AH0 84 L 85 AY1 86 N 87 pau 88 D 89 IH0 90 Z 91 IY1 92 Z 93 pau Phonetic Phonetic 94 m 95 eh 96 r 97 iy 98 pau 99 jh 100 ah 101 jh 102 m 103 ah 104 n 105 t 106 pau 107 l 108 ey 109 t 110 er 111 pau 112 ah 113 b 114 aw 115 t 116 pau 117 b 118 ih 119 k 120 ah 121 m 122 pau 123 s
//Talking Head model code and MoCap.C was written by Dr. Scott King. //Stanley Leja modified the code and created new code to extract facial model parameters. //Some of the modified code and all of the newly created code are shown below. //This subroutine was modified to control the estimation of parameters on a frame by frame basis. Void MoCapGUICB(Widget w, XtPointer clientData, XtPointer user) { CGCBData *cbdata = (CGCBData *) clientData; TalkingHead *th = ((TalkingHead *)(cbdata->obj)); int which = cbdata->WhichScale; // cerr << "MoCapGUICB which " << which << endl; switch (which) { case -1: int Val; XtVaGetValues(w, XmNvalue, &Val, NULL); cbdata->GUI->setFrame((int) (Val * .0020 * cbdata->GUI->getMaxFrame())); break; case 1: cbdata->GUI->setFrame(cbdata->GUI->getFrame() - 1); break; case 2: cbdata->GUI->setFrame(cbdata->GUI->getFrame() + 1); //changed by Stan break; case 3: // Prev Phoneme //cbdata->GUI->setTime(cbdata->GUI->getTime() - 1/100.0); break; case 4: // Next Phoneme //cbdata->GUI->setTime(cbdata->GUI->getTime() + 1/100.0); break; default: cerr << "default which is " << which << "!\n"; break; } int f = cbdata->GUI->getFrame(); float t = ((float) f)/ 120.0; th->SetShowKeysTime(t); th->DisplayMoCapFrame(f);
53
cbdata->GUI->setTime(t); // cerr << "setting time to " << t << " and max is " << cbdata->GUI->getTime() << endl; cerr << "frame " << f << " \t Before estimating total markerError is " << th->markerError(f) << endl; if (th->estimatingParams) { // cerr << "We are estimating params\n"; th->estimateParamsFromMocap(f, t); //cerr << "\nframe = " << f << "\tAfter estimating makerError is " << th->markerError(f) << endl; } } //This subroutine was used to provide the initial estimate for the Jaw Open parameter. float TalkingHead::estimateJawParamFromMocap(int f) { float d, v; static int StartFrame = 0; // 30 for some datasets! int ChinMarker; switch (_MoCapData->numMarkers) { case 31: case 32: ChinMarker = 1; break; case 73: // Same as 74 only no MNOSE (it fell off case 74: ChinMarker = 31; // It is the 32nd marker break; case 75: ChinMarker = 31; // It is the 32nd marker break; case 90: ChinMarker = 78; // It is the 32nd marker break; default: ChinMarker = 1; // This will cause an error regardless. break; } d = (_MoCapData->Data[f][ChinMarker] - _MoCapData->Data[StartFrame][ChinMarker]).length(); // Experimentation is about 76 is the max. v = d/76.0; // Should give us between 0 and 1; if (v < 0) v = 0.0;
54
if (v > 1) v = 1.0; return(v); } //This subroutine was used to provide the initial estimate for the OrbOris parameter. float TalkingHead::estimateOrbOrisParamFromMocap(int f, float pJaw) { float d, v; static int StartFrame = 0; // 30 for some datasets! int LMouthMarker, RMouthMarker; switch (_MoCapData->numMarkers) { case 31: case 32: // Mouth markers are // 13 LMTH:X LMTH:Y LMTH:Z // 16 LULP:X LULP:Y LULP:Z // 30 RULP:X RULP:Y RULP:Z // 27 RMTH:X RMTH:Y RMTH:Z // 24 RLLP:X RLLP:Y RLLP:Z // 10 LLLP:X LLLP:Y LLLP:Z LMouthMarker = 13; RMouthMarker = 27; break; case 75: // jig1:LOH,,,jig1:LIH,,,jig1:RMH,,,jig1:ROH,,,jig1:UPV,,, // jig1:MIDV,,,jig1:LOWV,,,sk:RHAIR,,,sk:HAIR,,,sk:LHAIR,,, // sk:RFOR,,,sk:LFOR,,,sk:FORE,,,sk:BRDG,,,sk:NOSE,,, 10 // sk:RNOSE,,,sk:MNOSE,,,sk:LNOSE,,,sk:LMTH,,,sk:LIUL,,, // sk:LOUL,,,sk:MUL,,,sk:ROUL,,,sk:RIUL,,,sk:RMTH,,, 20 // sk:RILL,,,sk:ROLL,,,sk:MLL,,,sk:LOLL,,,sk:LILL,,, // sk:RCHN,,,sk:CHIN,,,sk:LCHN,,,sk:RTMP,,,sk:RSID,,, 30 // sk:REAR,,,sk:ROBW,,,sk:RMBW,,,sk:RIBW,,,sk:RUEYE,,, // sk:RIEYE,,,sk:REYE,,,sk:RUCB,,,sk:ROCB,,,sk:RMCB,,, 40 // sk:RICB,,,sk:ROCK,,,sk:RMCK,,,sk:RICK,,,sk:RLJO,,, // sk:RUJO,,,sk:RUJI,,,sk:RLJI,,,sk:RLJM,,,sk:LTMP,,, 50 // sk:LSID,,,sk:LEAR,,,sk:LOBW,,,sk:LMBW,,,sk:LIBW,,, // sk:LUEYE,,,sk:LIEYE,,,sk:LEYE,,,sk:LUCB,,,sk:LOCB,,, 60 // sk:LMCB,,,sk:LICB,,,sk:LOCK,,,sk:LMCK,,,sk:LICK,,, // sk:LLJO,,,sk:LUJO,,,sk:LUJI,,,sk:LLJI,,,sk:LLJM 70 LMouthMarker = 18; RMouthMarker = 24; case 90: // For 2005 LMouthMarker = 65; // ?? is this right RMouthMarker = 59; //LMouthMarker = 66; //RMouthMarker = 60;
*/ } if ((_MoCapData->numMarkers == 31) || (_MoCapData->numMarkers == 32)) { SbVec3f Dir = (_MoCapData->Data[f][LMouthMarker] - _MoCapData->Data[f][RMouthMarker]); d = (_MoCapData->Data[f][LMouthMarker] - _MoCapData->Data[f][RMouthMarker]).length(); if (d > 57) v = 0; else { // cerr << "v = (d - 57 - 18*pJaw))/15 = " << " (" << d << " - 57 - 18*" << pJaw << "))/15" << " 57-18*pJaw is " << 57-18*pJaw << endl; v = ((57.0 - 18.0*pJaw)-d)/15.0; } // Should give us between 0 and 1; if (v < 0) v = 0.0; if (v > 1) v = 1.0; //cerr << "For frame:" << f << " d: " << d << " Dir:" << PrintVec(Dir) //<< " orb_oris est ~= " << v << endl; return(v); } return(0); } //Written by Stanley Leja //The error between the measured markers and the virtual markers from TalkingHead is calculated here. // markerError is a member of the TalkingHead class for simplicity // It can be a convenience function or a member of some new Processing MoCap // class. The two marker set locations are passed in and used here. float TalkingHead:: markerError(int frame) { float error = 0; float squaredError = 0; float d; int i, N = _VirtualMarkers->point.getNum(); for (i = 39; i < N - 7; i++) { // starts with 39 because mouth muscles only effect markers 39 and up & deletes neck markers 83-89 //cerr << "_MoCapData->Data[f= " << frame << "][i= " << i << "] " << CerrPt(_MoCapData->Data[frame][i]) << endl; //cerr << "_VirtualMarkers->point[i]" << CerrPt(_VirtualMarkers->point[i]) << endl;
57
//cerr << "VirtualMarkerLocationsXfm[i]->translation.getValue()" << CerrPt(VirtualMarkerLocationsXfm[i]->translation.getValue()) << endl; // d = (_MoCapData->Data[frame][i] - (_VirtualMarkers->point[i] + VirtualMarkerLocationsXfm[i]->translation.getValue())).length(); // d = (_MoCapData->Data[frame][i] - _VirtualMarkers->point[i]).length(); // Import markers for the 2005 dataset // 78 bottom of chin // 68 middle of lower lip // 72 top of chin // 71 mid chin d = (_MoCapData->Data[frame][i] - VirtualMarkerLocationsXfm[i]->translation.getValue()).length(); //if (i == 78 || i == 68 || i == 72 || i == 71) //changed by Stan //cerr << "marker " << i << " frame " << frame << " d is " << d <<endl; //changed by Stan error += d; // squaredError += d*d; // weightedError += d*w[i]; // w[i] is some weights for the errors // You can use a SoMFFloat, or just and array // w can be passed in or made a member of the TalkingHead class // or the MoCap class, or have a new class for doing this. } return error; // or squared error; } Written by Stanley Leja /* This subroutine estimates 17 parameters which produce the phoneme-viseme set. It is based on a search of realistic values of each of the parameters in a prioritized order of importance. Once the higher priority parameters are locally searched they are used as given in the remaining local search processes to reduce computation time. The parameters were searched in the following order: k0, k1; k3, k12, k14; k4, k5, k6; k7, k15, k16; k11, k17, k18; and k8, k9, k10. k0 = OPEN_JAW - opens jaw k1 = JAW_IN - moves jaw in k2 = JAW_SIDE – does not work k3 = ORB_ORIS - contracts lips, making mouth opening smaller k4 = L_RIS Left Risorius - moves left corner towards ear k5 = R_RIS Right Risorius - moves right corner towards ear k6 = L_PLATYSMA Left Platysma - moves left corner downward and lateral k7 = R_PLATYSMA Right Platysma - moves right corner downward and lateral k8 = L_ZYG Left Zygomaticus - raises corner up and lateral
58
k9 = R_ZYG Right Zygomaticus - raises corner up and lateral k10 = L_LEV_SUP - moves left top lip up k11 = R_LEV_SUP - moves right top lip up k12 = DEP_INF - opens both lips k13 = DEP_ORIS - does not work k14 = MENTALIS - pulls lips together k15 = L_BUCCINATOR - pulls back at left corner k16 = R_BUCCINATOR - pulls back at right corner k17 = INCISIVE_SUP - top lip rolls over bottom lip k18 = INCISIVE_INF - bottom lip rolls over top lip */ void TalkingHead::estimateParamsFromMocap(int f, float t) { startFunc("\nestimateParamsFromMoCap"); float pJaw, pOrb, k0 = 0.0, k1 = 0.0, k3 = 0.0, k12 = 0.0, k14 = 0.0, markerErrorStart = 10000.0, markerErrorNew = 9999.0; float k4 = 0.0, k5 = 0.0, k6 = 0.0, k7 = 0.0, k8 = 0.0, k9 = 0.0, k10 = 0.0, k11 = 0.0, k15 = 0.0, k16 = 0.0, k17 = 0.0, k18 = 0.0; #define STUFF 1 #if STUFF SbVec3f localOptVM_Data[100]; #endif _MoCapData->MoCapToVM_Error[f][7] = markerError(f); // establishes marker error without estimation //cerr << "time = " << t << endl; _MoCapData->MoCapToVM_Value[f][19] = t; //storing time of frame for (int i = 0; i < 19; i++) { _LipModel->setParameter(i, 0.0); } duringFunc("\nAbout to do 0 & 1"); // now doing 0 & 1 for (int i = 0; i < 10; i++) { k0 = (float) i/10; _LipModel->setParameter(0, k0); for (int k = 0; k < 3; k++) { if (k == 0) k1 = -0.20; if (k == 1) k1 = 0.00; if (k == 2) k1 = 0.20; _LipModel->setParameter(1, k1);
59
//needed to adjust prior to deforming characteristic points and markers SetMandible(_LipModel->getParameter(_LIP_PARAM_OPEN_JAW)*40, - _LipModel->getParameter(_LIP_PARAM_JAW_SIDE) * _LipModel->getDelta(_LIP_PARAM_JAW_SIDE)[0], _LipModel->getParameter(_LIP_PARAM_JAW_IN) * _LipModel->getDelta(_LIP_PARAM_JAW_IN)[2]); DeformCharacteristicPoints(); DeformMarkers(); markerErrorNew = markerError(f); if (markerErrorNew < markerErrorStart) { //cerr << "\n*******markerErrorNew = " << markerErrorNew << "\tmarkerErrorStart = " << markerErrorStart << "\n"; markerErrorStart = markerErrorNew; _MoCapData->MoCapToVM_Value[f][0] = k0; //storing best parameters _MoCapData->MoCapToVM_Value[f][1] = k1; #if STUFF for (int iii = 39; iii < 83; iii++) { localOptVM_Data[iii] = VirtualMarkerLocationsXfm[iii]->translation.getValue(); /* if (iii == 78 || iii == 68 || iii == 72 || iii == 71) { cerr << "VirtualMarkerLocationsXfm[" << iii << "]->translation.getValue()" << CerrPt(VirtualMarkerLocationsXfm[iii]->translation.getValue()) << endl; cerr << "localOptVM_Data[iii] = " << CerrPt(localOptVM_Data[iii]) << endl; } */ } #endif } } // k1 } // k0 // develops comparison data from old estimation process pJaw = estimateJawParamFromMocap(f); _LipModel->setParameter(_LIP_PARAM_OPEN_JAW, pJaw); pOrb = estimateOrbOrisParamFromMocap(f, pJaw); _LipModel->setParameter(_LIP_PARAM_ORB_ORIS, pOrb);
60
cerr << "The old estimated parameters are pJaw: " << pJaw << "\tpOrb: " << pOrb << endl; SetMandible(_LipModel->getParameter(_LIP_PARAM_OPEN_JAW)*40, - _LipModel->getParameter(_LIP_PARAM_JAW_SIDE) * _LipModel->getDelta(_LIP_PARAM_JAW_SIDE)[0], _LipModel->getParameter(_LIP_PARAM_JAW_IN) * _LipModel->getDelta(_LIP_PARAM_JAW_IN)[2] ); #if STUFF for (int ii = 39; ii < 83; ii++) { VirtualMarkerLocationsXfm[ii]->translation = localOptVM_Data[ii]; } #endif DeformCharacteristicPoints(); DeformMarkers(); _MoCapData->MoCapToVM_Error[f][0] = markerError(f); cerr << "\tmarkerErrorAfterDeformFace Using Old Estimates = " << markerError(f) << "\n\n"; // sets parameters to values from new estimation process // develops comparison from new process for (int i = 0; i < 19; i++) { _LipModel->setParameter(i, _MoCapData->MoCapToVM_Value[f][i]); if (_MoCapData->MoCapToVM_Value[f][i] != 0.0) cerr << " MoCapToVM_Value[" << f << "][" << i << "] = " << _MoCapData->MoCapToVM_Value[f][i] << "\n"; } _MoCapData->MoCapToVM_Error[f][1] = markerErrorStart; // use [1], etc., if decide to make more pass(es) for error reduction cerr << " MoCapToVM_Error[" << f << "][1] = " << markerErrorStart << "\n"; SetMandible(_LipModel->getParameter(_LIP_PARAM_OPEN_JAW)*40, - _LipModel->getParameter(_LIP_PARAM_JAW_SIDE) * _LipModel->getDelta(_LIP_PARAM_JAW_SIDE)[0], _LipModel->getParameter(_LIP_PARAM_JAW_IN) * _LipModel->getDelta(_LIP_PARAM_JAW_IN)[2] ); DeformCharacteristicPoints(); DeformMarkers();
61
#if STUFF for (int ii = 39; ii < 83; ii++) { VirtualMarkerLocationsXfm[ii]->translation = localOptVM_Data[ii]; } #endif DeformCharacteristicPoints(); DeformMarkers(); duringFunc("\nAbout to do 3,12,14"); // now doing 3, 12, & 14 k0 = _MoCapData->MoCapToVM_Value[f][0]; k1 = _MoCapData->MoCapToVM_Value[f][1]; //cerr << "***k0 = " << k0 << "\tk1 = " << k1 << "\tk3 = " << k3 << "\tk12 = " << k12 << "\tk14 = " << k14 << endl; for (int k = 0; k < 5; k++) { if (k == 0) k3 = 0.0; if (k == 1) k3 = 0.2; if (k == 2) k3 = 0.4; if (k == 3) k3 = 0.6; if (k == 4) k3 = 0.8; _LipModel->setParameter(3, k3); for (int k = 0; k < 10; k++) { k12 = (float) k/10; _LipModel->setParameter(12, k12); for (int k = 0; k < 5; k++) { if (k == 0) k14 = 0.0; if (k == 1) k14 = 0.2; if (k == 2) k14 = 0.4; if (k == 3) k14 = 0.6; if (k == 4) k14 = 0.8; _LipModel->setParameter(14, k14); SetMandible(_LipModel->getParameter(_LIP_PARAM_OPEN_JAW)*40, - _LipModel->getParameter(_LIP_PARAM_JAW_SIDE) * _LipModel->getDelta(_LIP_PARAM_JAW_SIDE)[0], _LipModel->getParameter(_LIP_PARAM_JAW_IN) *
62
_LipModel->getDelta(_LIP_PARAM_JAW_IN)[2]); DeformCharacteristicPoints(); DeformMarkers(); markerErrorNew = markerError(f); if (markerErrorNew < markerErrorStart) { markerErrorStart = markerErrorNew; _MoCapData->MoCapToVM_Value[f][3] = k3; _MoCapData->MoCapToVM_Value[f][12] = k12; _MoCapData->MoCapToVM_Value[f][14] = k14; #if STUFF for (int iii = 39; iii < 83; iii++) { localOptVM_Data[iii] = VirtualMarkerLocationsXfm[iii]->translation.getValue(); } #endif } } //k14 } //k12 } //k3 // sets parameters to values from new estimation process // develops comparison from new process for (int i = 0; i < 19; i++) { _LipModel->setParameter(i, _MoCapData->MoCapToVM_Value[f][i]); //cerr << _MoCapData->MoCapToVM_Value[f][i]; if (_MoCapData->MoCapToVM_Value[f][i] != 0.0) cerr << " MoCapToVM_Value[" << f << "][" << i << "] = " << _MoCapData->MoCapToVM_Value[f][i] << "\n"; } _MoCapData->MoCapToVM_Error[f][2] = markerErrorStart; // use [1], etc., if decide to make more pass(es) for error reduction cerr << " MoCapToVM_Error[" << f << "][2] = " << markerErrorStart << "\n"; SetMandible(_LipModel->getParameter(_LIP_PARAM_OPEN_JAW)*40, - _LipModel->getParameter(_LIP_PARAM_JAW_SIDE) * _LipModel->getDelta(_LIP_PARAM_JAW_SIDE)[0], _LipModel->getParameter(_LIP_PARAM_JAW_IN) * _LipModel->getDelta(_LIP_PARAM_JAW_IN)[2] );
63
#if STUFF for (int ii = 39; ii < 83; ii++) { VirtualMarkerLocationsXfm[ii]->translation = localOptVM_Data[ii]; } #endif DeformCharacteristicPoints(); DeformMarkers(); duringFunc("\nAbout to do 4,5,6"); // now doing 4, 5, & 6 k3 = _MoCapData->MoCapToVM_Value[f][3]; k12 = _MoCapData->MoCapToVM_Value[f][12]; k14 = _MoCapData->MoCapToVM_Value[f][14]; for (int k = 0; k < 5; k++) { if (k == 0) k4 = 0.0; if (k == 1) k4 = 0.2; if (k == 2) k4 = 0.4; if (k == 3) k4 = 0.6; if (k == 4) k4 = 0.8; _LipModel->setParameter(4, k4); for (int k = 0; k < 5; k++) { if (k == 0) k5 = 0.0; if (k == 1) k5 = 0.2; if (k == 2) k5 = 0.4; if (k == 3) k5 = 0.6; if (k == 4) k5 = 0.8; _LipModel->setParameter(5, k5); for (int k = 0; k < 5; k++) { if (k == 0) k6 = 0.0; if (k == 1) k6 = 0.2; if (k == 2) k6 = 0.4; if (k == 3) k6 = 0.6; if (k == 4) k6 = 0.8; _LipModel->setParameter(6, k6); SetMandible(_LipModel->getParameter(_LIP_PARAM_OPEN_JAW)*40, - _LipModel->getParameter(_LIP_PARAM_JAW_SIDE) *
64
_LipModel->getDelta(_LIP_PARAM_JAW_SIDE)[0], _LipModel->getParameter(_LIP_PARAM_JAW_IN) * _LipModel->getDelta(_LIP_PARAM_JAW_IN)[2]); DeformCharacteristicPoints(); DeformMarkers(); markerErrorNew = markerError(f); if (markerErrorNew < markerErrorStart) { markerErrorStart = markerErrorNew; _MoCapData->MoCapToVM_Value[f][4] = k4; _MoCapData->MoCapToVM_Value[f][5] = k5; _MoCapData->MoCapToVM_Value[f][6] = k6; #if STUFF for (int iii = 39; iii < 83; iii++) { localOptVM_Data[iii] = VirtualMarkerLocationsXfm[iii]->translation.getValue(); } #endif } } //k6 } // k5 } // k4 // sets parameters to values from new estimation process // develops comparison from new process for (int i = 0; i < 19; i++) { _LipModel->setParameter(i, _MoCapData->MoCapToVM_Value[f][i]); //cerr << _MoCapData->MoCapToVM_Value[f][i]; if (_MoCapData->MoCapToVM_Value[f][i] != 0.0) cerr << " MoCapToVM_Value[" << f << "][" << i << "] = " << _MoCapData->MoCapToVM_Value[f][i] << "\n"; } _MoCapData->MoCapToVM_Error[f][3] = markerErrorStart; // use [1], etc., if decide to make more pass(es) for error reduction cerr << " MoCapToVM_Error[" << f << "][3] = " << markerErrorStart << "\n"; SetMandible(_LipModel->getParameter(_LIP_PARAM_OPEN_JAW)*40, - _LipModel->getParameter(_LIP_PARAM_JAW_SIDE) * _LipModel->getDelta(_LIP_PARAM_JAW_SIDE)[0], _LipModel->getParameter(_LIP_PARAM_JAW_IN) * _LipModel->getDelta(_LIP_PARAM_JAW_IN)[2] );
65
#if STUFF for (int ii = 39; ii < 83; ii++) { VirtualMarkerLocationsXfm[ii]->translation = localOptVM_Data[ii]; } #endif DeformCharacteristicPoints(); DeformMarkers(); duringFunc("\nAbout to do 7,15,16"); // now doing 7, 15, & 16 k4 = _MoCapData->MoCapToVM_Value[f][4]; k5 = _MoCapData->MoCapToVM_Value[f][5]; k6 = _MoCapData->MoCapToVM_Value[f][6]; //cerr << "***k0 = " << k0 << "\tk1 = " << k1 << "\tk3 = " << k3 << "\tk12 = " << k12 << "\tk14 = " << k14 << endl; //cerr << "***k4 = " << k4 << "\tk5 = " << k5 << "\tk6 = " << k6 << endl; //cerr << "***k7 = " << k7 << "\tk15 = " << k15 << "\tk16 = " << k16 << endl; for (int k = 0; k < 5; k++) { if (k == 0) k7 = 0.0; if (k == 1) k7 = 0.2; if (k == 2) k7 = 0.4; if (k == 3) k7 = 0.6; if (k == 4) k7 = 0.8; _LipModel->setParameter(7, k7); for (int k = 0; k < 5; k++) { if (k == 0) k15 = 0.0; if (k == 1) k15 = 0.2; if (k == 2) k15 = 0.4; if (k == 3) k15 = 0.6; if (k == 4) k15 = 0.8; _LipModel->setParameter(15, k15); for (int k = 0; k < 5; k++) { if (k == 0) k16 = 0.0; if (k == 1) k16 = 0.2; if (k == 2) k16 = 0.4; if (k == 3) k16 = 0.6; if (k == 4) k16 = 0.8;
66
_LipModel->setParameter(16, k16); SetMandible(_LipModel->getParameter(_LIP_PARAM_OPEN_JAW)*40, - _LipModel->getParameter(_LIP_PARAM_JAW_SIDE) * _LipModel->getDelta(_LIP_PARAM_JAW_SIDE)[0], _LipModel->getParameter(_LIP_PARAM_JAW_IN) * _LipModel->getDelta(_LIP_PARAM_JAW_IN)[2]); DeformCharacteristicPoints(); DeformMarkers(); markerErrorNew = markerError(f); if (markerErrorNew < markerErrorStart) { markerErrorStart = markerErrorNew; _MoCapData->MoCapToVM_Value[f][7] = k7; _MoCapData->MoCapToVM_Value[f][15] = k15; _MoCapData->MoCapToVM_Value[f][16] = k16; #if STUFF for (int iii = 39; iii < 83; iii++) { localOptVM_Data[iii] = VirtualMarkerLocationsXfm[iii]->translation.getValue(); } #endif } } //k16 } // k15 } // k7 // sets parameters to values from new estimation process // develops comparison from new process for (int i = 0; i < 19; i++) { _LipModel->setParameter(i, _MoCapData->MoCapToVM_Value[f][i]); //cerr << _MoCapData->MoCapToVM_Value[f][i]; if (_MoCapData->MoCapToVM_Value[f][i] != 0.0) cerr << " MoCapToVM_Value[" << f << "][" << i << "] = " << _MoCapData->MoCapToVM_Value[f][i] << "\n"; } _MoCapData->MoCapToVM_Error[f][4] = markerErrorStart; // use [1], etc., if decide to make more pass(es) for error reduction
if (k == 1) k17 = 0.2; if (k == 2) k17 = 0.4; if (k == 3) k17 = 0.6; if (k == 4) k17 = 0.8; _LipModel->setParameter(17, k17); for (int k = 0; k < 5; k++) { if (k == 0) k18 = 0.0; if (k == 1) k18 = 0.2; if (k == 2) k18 = 0.4; if (k == 3) k18 = 0.6; if (k == 4) k18 = 0.8; _LipModel->setParameter(18, k18); SetMandible(_LipModel->getParameter(_LIP_PARAM_OPEN_JAW)*40, - _LipModel->getParameter(_LIP_PARAM_JAW_SIDE) * _LipModel->getDelta(_LIP_PARAM_JAW_SIDE)[0], _LipModel->getParameter(_LIP_PARAM_JAW_IN) * _LipModel->getDelta(_LIP_PARAM_JAW_IN)[2]); DeformCharacteristicPoints(); DeformMarkers(); markerErrorNew = markerError(f); if (markerErrorNew < markerErrorStart) { markerErrorStart = markerErrorNew; _MoCapData->MoCapToVM_Value[f][11] = k11; _MoCapData->MoCapToVM_Value[f][17] = k17; _MoCapData->MoCapToVM_Value[f][18] = k18; #if STUFF for (int iii = 39; iii < 83; iii++) { localOptVM_Data[iii] = VirtualMarkerLocationsXfm[iii]->translation.getValue(); } #endif } } //k18 } // k17 } // k11
69
// sets parameters to values from new estimation process // develops comparison from new process for (int i = 0; i < 19; i++) { _LipModel->setParameter(i, _MoCapData->MoCapToVM_Value[f][i]); //cerr << _MoCapData->MoCapToVM_Value[f][i]; if (_MoCapData->MoCapToVM_Value[f][i] != 0.0) cerr << " MoCapToVM_Value[" << f << "][" << i << "] = " << _MoCapData->MoCapToVM_Value[f][i] << "\n"; } _MoCapData->MoCapToVM_Error[f][5] = markerErrorStart; // use [1], etc., if decide to make more pass(es) for error reduction cerr << " MoCapToVM_Error[" << f << "][5] = " << markerErrorStart << "\n"; SetMandible(_LipModel->getParameter(_LIP_PARAM_OPEN_JAW)*40, - _LipModel->getParameter(_LIP_PARAM_JAW_SIDE) * _LipModel->getDelta(_LIP_PARAM_JAW_SIDE)[0], _LipModel->getParameter(_LIP_PARAM_JAW_IN) * _LipModel->getDelta(_LIP_PARAM_JAW_IN)[2] ); #if STUFF for (int ii = 39; ii < 83; ii++) { VirtualMarkerLocationsXfm[ii]->translation = localOptVM_Data[ii]; } #endif DeformCharacteristicPoints(); DeformMarkers(); duringFunc("\nAbout to do 8,9,10"); // now doing 8, 9, & 10 k11 = _MoCapData->MoCapToVM_Value[f][11]; k17 = _MoCapData->MoCapToVM_Value[f][17]; k18 = _MoCapData->MoCapToVM_Value[f][18]; //cerr << "***k0 = " << k0 << "\tk1 = " << k1 << endl; //cerr << "***k3 = " << k3 << "\tk12 = " << k12 << "\tk14 = " << k14 << endl; //cerr << "***k4 = " << k4 << "\tk5 = " << k5 << "\tk6 = " << k6 << endl; //cerr << "***k7 = " << k7 << "\tk15 = " << k15 << "\tk16 = " << k16 << endl; //cerr << "***k11 = " << k11 << "\tk17 = " << k17 << "\tk18 = " << k18 << endl;
70
//cerr << "***k8 = " << k8 << "\tk9 = " << k9 << "\tk10 = " << k10 << endl; for (int k = 0; k < 5; k++) { if (k == 0) k8 = 0.0; if (k == 1) k8 = 0.2; if (k == 2) k8 = 0.4; if (k == 3) k8 = 0.6; if (k == 4) k8 = 0.8; _LipModel->setParameter(8, k8); for (int k = 0; k < 5; k++) { if (k == 0) k9 = 0.0; if (k == 1) k9 = 0.2; if (k == 2) k9 = 0.4; if (k == 3) k9 = 0.6; if (k == 4) k9 = 0.8; _LipModel->setParameter(9, k9); for (int k = 0; k < 5; k++) { if (k == 0) k10 = 0.0; if (k == 1) k10 = 0.2; if (k == 2) k10 = 0.4; if (k == 3) k10 = 0.6; if (k == 4) k10 = 0.8; _LipModel->setParameter(10, k10); SetMandible(_LipModel->getParameter(_LIP_PARAM_OPEN_JAW)*40, - _LipModel->getParameter(_LIP_PARAM_JAW_SIDE) * _LipModel->getDelta(_LIP_PARAM_JAW_SIDE)[0], _LipModel->getParameter(_LIP_PARAM_JAW_IN) * _LipModel->getDelta(_LIP_PARAM_JAW_IN)[2]); DeformCharacteristicPoints(); DeformMarkers(); markerErrorNew = markerError(f); if (markerErrorNew < markerErrorStart) { markerErrorStart = markerErrorNew; _MoCapData->MoCapToVM_Value[f][8] = k8; _MoCapData->MoCapToVM_Value[f][9] = k9; _MoCapData->MoCapToVM_Value[f][10] = k10; #if STUFF for (int iii = 39; iii < 83; iii++) {
71
localOptVM_Data[iii] = VirtualMarkerLocationsXfm[iii]->translation.getValue(); } #endif } } //k10 } // k9 } // k8 // sets parameters to values from new estimation process // develops comparison from new process for (int i = 0; i < 19; i++) { _LipModel->setParameter(i, _MoCapData->MoCapToVM_Value[f][i]); //cerr << _MoCapData->MoCapToVM_Value[f][i]; if (_MoCapData->MoCapToVM_Value[f][i] != 0.0) cerr << " MoCapToVM_Value[" << f << "][" << i << "] = " << _MoCapData->MoCapToVM_Value[f][i] << "\n"; } _MoCapData->MoCapToVM_Error[f][6] = markerErrorStart; // use [1], etc., if decide to make more pass(es) for error reduction cerr << " MoCapToVM_Error[" << f << "][6] = " << markerErrorStart << "\n"; SetMandible(_LipModel->getParameter(_LIP_PARAM_OPEN_JAW)*40, - _LipModel->getParameter(_LIP_PARAM_JAW_SIDE) * _LipModel->getDelta(_LIP_PARAM_JAW_SIDE)[0], _LipModel->getParameter(_LIP_PARAM_JAW_IN) * _LipModel->getDelta(_LIP_PARAM_JAW_IN)[2] ); #if STUFF for (int ii = 39; ii < 83; ii++) { VirtualMarkerLocationsXfm[ii]->translation = localOptVM_Data[ii]; } #endif //DeformCharacteristicPoints(); //DeformMarkers(); DeformFace(); endFunc("\nestimateParamsFromMoCap"); //used to generate viseme data for individual words
72
if ((f == 444) || (f == 706) || (f == 930) || (f == 1174) || (f == 1401) || (f == 1634) || (f == 1840) || (f == 2067)) { cerr << "\n\nMoCapToVM_Value\n"; for (int i = 1057; i < 1174; i++) { cerr << i << ", "; for (int ii = 0; ii < 20; ii++) { cerr << _MoCapData->MoCapToVM_Value[i][ii] << ", "; } cerr << endl; } cerr << "\n\nMoCapToVM_Error\n"; for (int i = 1057; i < 1174; i++) { cerr << i << ", "; for (int ii = 0; ii < 8; ii++) { cerr << _MoCapData->MoCapToVM_Error[i][ii] << ", "; } cerr << endl; } } }