Create Photo- Create Photo- Realistic Talking Realistic Talking Face Face Changbo Hu Changbo Hu 2001.11.26 2001.11.26 * * This work was done during visiting This work was done during visiting Microsoft Research China with Bainin Microsoft Research China with Bainin g Guo and Bo Zhang g Guo and Bo Zhang
35
Embed
Create Photo-Realistic Talking Face Changbo Hu 2001.11.26 * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Create Photo-Realistic Create Photo-Realistic Talking FaceTalking Face
Changbo HuChangbo Hu
2001.11.262001.11.26
**This work was done during visiting Microsoft ReThis work was done during visiting Microsoft Research China with Baining Guo and Bo Zhangsearch China with Baining Guo and Bo Zhang
OutlineOutline
Introduction of talking faceIntroduction of talking face
MotivationsMotivations
System overviewSystem overview
TechniquesTechniques
ConclusionsConclusions
IntroductionIntroduction
What is a talking faceWhat is a talking face Face (lip) animation, driven by voiceFace (lip) animation, driven by voice ApplicationsApplications
The process of talking faceThe process of talking face Face modelFace model Motion captureMotion capture Mapping betweenMapping between
audio and video audio and video Rendering, Rendering,
Photo-realistic?Photo-realistic?
LiteraturesLiteratures
Walter,93, DecFace, 2Dwire frame modelWalter,93, DecFace, 2Dwire frame model Terzopoulos,95, Skin and muscle modelTerzopoulos,95, Skin and muscle model Breglar,97, Video Rewrite, Sample image basedBreglar,97, Video Rewrite, Sample image based TS Huang,98,Mesh model from range dataTS Huang,98,Mesh model from range data Poggio,98, MikeTalk, Viseme morphingPoggio,98, MikeTalk, Viseme morphing Guenter,99, Making face, 3D from multicamera Guenter,99, Making face, 3D from multicamera Zhengyou Zhang, 00, 3D face modeling from video Zhengyou Zhang, 00, 3D face modeling from video
through epipolar constraintthrough epipolar constraint Cosatto,00, Planar quads modelCosatto,00, Planar quads model
Some Face modelsSome Face models
MotivationsMotivations
Aim: a graphics interface for conversation Aim: a graphics interface for conversation agentagent Photo-realisticPhoto-realistic Driven by ChineseDriven by Chinese Smooth connection between sentencesSmooth connection between sentences
Extended from “Video rewrite”Extended from “Video rewrite”
System overview:System overview:Pipeline of the system(1)Pipeline of the system(1)
Video with Sound
Images Sound
Pose trackingPhoneme
segmentation
AnnotationLip motion Tracking
Train database
System overview: System overview: Pipeline of the system(2)Pipeline of the system(2)
New text
Wav sound
TTS system
Triphone sequence
Segmentation
Synthesized triphone sequence
Train database
Lip motion sequence
Rewrite to faces
Background sequence
TechniquesTechniques
Analysis:Analysis: Audio processAudio process Image processImage process
Phonemes represents the basic elements Phonemes represents the basic elements in speech. All possible speech can be in speech. All possible speech can be represented by combination of phonemes.represented by combination of phonemes.
CH, JH, S, EH, EY, OY, AE, SIL…CH, JH, S, EH, EY, OY, AE, SIL…
Triphone are three consecutive Triphone are three consecutive phonemes. It not only represents phonemes. It not only represents pronounce characteristics but also pronounce characteristics but also contains context information.contains context information.
T-IY-P, IY-P-AA, P-AA-T…T-IY-P, IY-P-AA, P-AA-T…
Chinese Phoneme vs. EnglishChinese Phoneme vs. English
Chinese phoneme has two basic groups: Initials Chinese phoneme has two basic groups: Initials and Finals.and Finals.
Chinese finals each has 5 tones: 1,2,3,4,5.Chinese finals each has 5 tones: 1,2,3,4,5.Different tones: a1, a2, a3, a4, a5.Different tones: a1, a2, a3, a4, a5.
Chinese finals actually is not a basic elements of Chinese finals actually is not a basic elements of speech.speech.
Assume a plane Assume a plane model for facemodel for face
Standard Standard minimization method minimization method to find transform to find transform matrix (affine matrix (affine transform)[Black,95]transform)[Black,95]
Mask is used to Mask is used to constrain interests constrain interests part of the facepart of the face
Template Picture
Mask Image
Pose trackingPose tracking
Motion prediction using parameters with Motion prediction using parameters with physical meaningphysical meaning
100
0cossin
0sincos
.
100
0
0
.
100
10
01
100543
211
syk
ksx
t
t
aaa
aaa
y
x
Pose TrackingPose Tracking
Some tracking results:Some tracking results:
Lip Motion TrackingLip Motion Tracking
Using Eigen Points (Covell, 91)Using Eigen Points (Covell, 91)
Feature Points include Jaw, lip and teethFeature Points include Jaw, lip and teeth
Training database specified manuallyTraining database specified manually
Auto tracking through all pose-tracked imaAuto tracking through all pose-tracked imagesges
Lip motion trackingLip motion tracking
Lip MotionLip Motion TrackingTrackingT
rain
D
atab
ase
(ha
nd-
labe
led)
Aut
o T
rack
ing
Res
ults
Synthesis new sentencesSynthesis new sentences
New text converted by TTS system to wavNew text converted by TTS system to wav
Wav is segmented to phoneme sequenceWav is segmented to phoneme sequence
Using DP to find an optimal video Using DP to find an optimal video sequence from the training databasesequence from the training database
Time-align triphone videos and stitch them Time-align triphone videos and stitch them together.together.
Transform the lip sequence and paste Transform the lip sequence and paste them to background faces.them to background faces.
Lip sequence synthesisLip sequence synthesis
Optimal phoneme sequences
Triphone 1
Triphone 2 Triphone 5
Triphone 3
Triphone 4
Triphone 6
Triphone 7
Triphone 8 Triphone B
Triphone 9
Triphone A
Triphone C
New phoneme sequences
New phoneme sequences
Dynamic ProgrammingDynamic Programming
Begin
Triphone1 Triphone3Triphone2 Triphone4
End
Triphone5
Edge Cost DefinitionEdge Cost Definition
Two parts: Two parts: 1.1. phoneme distance: 3 phonemes’ distances added phoneme distance: 3 phonemes’ distances added
togethertogether
2.2. Lip shape distance for the overlap portion of triphone Lip shape distance for the overlap portion of triphone videovideo
Weighted add together two partWeighted add together two part
Background video generationBackground video generation
Background is a video sequence when the Background is a video sequence when the virtual character spoke something elsevirtual character spoke something else
Similarity measurement of backgroundSimilarity measurement of background
Select “standard frame”Select “standard frame”The frame with maximal number of frames similar The frame with maximal number of frames similar to itto it
Filter out the frames with jerkinessFilter out the frames with jerkiness
yxyx swswkwwtwtwFFD ******),( 65432121
Stitch the time-aligned result to Stitch the time-aligned result to background facesbackground faces
Write back with a maskWrite back with a mask
Transform the synthesized lip to the Transform the synthesized lip to the background facebackground face
Mask image for write-back operation
Original background frame Write-back result of the same frame
More video resultsMore video results
More video resultsMore video results
Conclusion and Future WorkConclusion and Future Work
Pose tracking and lip motion trackingPose tracking and lip motion tracking
Size of the train databaseSize of the train database
Talking face with expressionTalking face with expression
Real-time generation?Real-time generation?
Fast modeling for different personFast modeling for different person