MikeTalk:An Adaptive Man-Machine Interface Tony Ezzat Volker Blanz Tomaso Poggio
Jan 23, 2016
MikeTalk:An Adaptive Man-Machine Interface
Tony EzzatVolker Blanz
Tomaso Poggio
TTVS Overview
• Input: Text
• Output: Photo-realistic talking face uttering text
Desktop Agents
Desktop Agents
You have received 1 email from Tommy Poggio.
Customer Support
Customer Support
You have bought 20 shares of SONYat $40 each.
Advertisements
Advertisements
Hi Tony, would you be interestedin a ticket from Boston to New
York for $50.00?
Modules
Phoneme Corpus
Step 1:
– collect a visual corpus from a subject
– corpus contains 44 words
–one word for each American English phoneme
6 Consonantal Visemes
Step 2:
– extract one image per phoneme: viseme
–group visemes together by visual similarity
9 Vocalic Visemes (+ 1 SilenceViseme)
Problem1:Need to Interpolate!
Solution: Morphing!
Problem 2: too tedious to specify correspondence by hand across many images!
Simultaneous interpolation of shape & texture. (Beier & Neely 1992)
Solution: Optical Flow
• To interpolate between two visemes, optical flow is first computed
• A 2D motion vector field is produced:
dx(x,y) dy(x,y)
(Horn & Schunk 1986) (Lucas & Kanade 1988)
Morphing
• Forward warping A to B
• Forward warping B to A
• Blending
• Holefilling
Synthesis Database
• 16 Visemes total
• 256 Optical flow vectors total, from every viseme to every other viseme
Concatenation and Lip Sync
• Load the correct viseme transitions
• Concatenate viseme transitions
• Sample the viseme transitions using audio durations
Examples
“1, 2, 3, 4, 5”
“cat, dog, pig,cow, moose, horse,sheep”
“you have received10 email messages.”
Current Work
• Coarticulation
• Eye + head movements
• Emotion
• 3D instead of 2d
• Psychophysics
3DWith Volker Blanz
The End
Co-articulation
• Problem: Current method does not handle coarticulation, so speech looks overly articulated
• Can record all possible triphones/ quadriphones but this approach requires a lot of data!
• Best method is to learn a model for coarticulation, but what is the representation for the lips?
Principal Components Analysis
• Each image is a vector in a high-dimensional space
• Using PCA, find the optimal set of vectors that span the space
• Project the entire corpus onto those basis vectors
Top 2 PCA Bases for /buut/
Top 2 PCA Bases for /get/
Problem: Too nonlinear!
Flow Component Analysis
• Compute optical from a reference lip image to all other images in the corpus
• Compute PCA on all the flows
Top 2 FPCA Bases for /buut/
Top 2 FPCA Bases for /get/
Much more linear behavior!
Current Work
• Now that we have parameterized the mouth, what is the model for mouth synthesis?
• How is that model fit to the PCA data?