Video Rewrite Driving Visual Speech with Audio Christoph Bregler Michele Covell Malcolm Slaney Presenter : Jack jeryes 3/3/2008.

Video RewriteVideo RewriteDriving Visual Speech with AudioDriving Visual Speech with Audio

Christoph Bregler

Michele Covell

Malcolm Slaney

PresenterPresenter:: Jack jeryesJack jeryes

3/3/20083/3/2008

What is video rewriteWhat is video rewrite??

use existing footage to create new video of a person mouthing words that he did not speak in the original footage

Example:Example:

Why video rewrite?Why video rewrite?

movie dubbing :

to sync the actors’ lip motions to the new soundtrack

Teleconferencing

Special effects

Approach Approach

Learn from example footage how Learn from example footage how

a person’s face changes during a person’s face changes during speech speech

(dynamics and idiosyncrasies) (dynamics and idiosyncrasies)

Stages Stages

Video rewrite have two statges:Video rewrite have two statges:

Analysis stageAnalysis stage

Synthesis stageSynthesis stage

Analysis stageAnalysis stage::

use the audio track to segment the video into triphones. Vision techniques find the head orientation , mouth & chin shape and position in each image

Synthesis stage:Synthesis stage:

segments new audio and uses it to select triphones from the video model.Based on labels from the analysis stage, the new mouth images are morphed into a new background face

Analysis for video modeling the analysis stage creates an annotated database of example video clips, derived from unconstrained footage. (video model)

-Annotation Using Image Analysis

-Annotation Using Audio Analysis

Annotation Using Image Analysis

As face moves within the frame, need to know

-mouth position-lip shapes at all times.

Using eigenpoints (good for low resolution)

Eigenpoints :

A small set of hand-labeled facial images is usedto train subspace models.Given a new image, the eigenpoint modelstell us the positions of points on the lips and jaw

Eigenpoints (cont.)

54 eigenpoints for each image :

34 on the mouth

20 on the chin and jaw line.

Only 26 images hand labeled

26 / 14,218 about 0.2%

Extended the hand-annotated dataset by morphing pairs to form intermediate images

Eigenpoints (cont.)

Eigenpoints doesn’t allow variety of motions.thus, warp each face image into a standard reference plane,

prior to eigpoints labelingUse affine transform to minimize the mean-squared error between a large portion of the face image and a facial template

Mask to estimate global warp

Each image is warped to account for changes in the head’sposition, size, and rotation. The transform minimizes thedifference between the transformed images and the facetemplate. The mask (left) forces the minimization toconsider only the upper face (right).

global mapping…Once the best global mapping is found, it is inverted and applied to the image, putting that face into the standard coordinate frame.

We then perform eigenpoints analysis on this pre-warped image to find the fiduciary points.

Finally, we back-project the fiduciary points

through the global warp to place them on the original face image

Annotation Using Audio Analysis

All the speech segmented into sequences of phonemes

the /T/ in “beet” looks different from

the /T/ in “boot.”

Consider coarticulation


Use triphones: collections of three sequential phonemes

“teapot” is split into :

/SIL-T-IY/ /T-IY-P/ /IY-P-AA/

/P-AA-T/ and /AA-T-SIL/


While synthesize a video,

-Emphasize the middle of each triphone.

-Cross-fade the overlapping regions of neighboring triphones

Synthesis using a video model

segments new audio and uses it to select triphones from the video model.Based on labels from the analysis stage, the new mouth images are morphed into a new background face

Synthesis using a video model

background, head tilts and the eyes blinktaken from the source footage in the same order as they were shotthe triphone images include the mouth, chin, and part of

the cheeks,

use illumination-matching techniques to avoid visible seams

Selection of Triphone Videos

choosing a sequence of clips that

approximates the desired transitions

and shape continuity

Selection of Triphone Videos

Given a triphone in the new speech

utterance, we compute a matching distance

to each triphone in the video database

Dp = phoneme-context distance

Ds = lip-shape distance

sp DDerror 1


Dp is based on categorical distances between phoneme categories and between viseme classes

Dp= waited sum (viseme-distance ,

phonemic-distance)

26 viseme classes :

1- /CH/ /JH/ /SH/ /ZH/

2- /K/ /G/ /N/ /L/ /T/ /D/

3- /P/ /B/ /M/

..


-Phonemic-distance ( /P/ , /P/ ) = 0 same phonemic category

-Viseme-distance ( /P/ ,/IY/ ) = 1 different viseme classes

Dp ( /P/ ,/B/ ) = between 0-1same viseme classdifferent phonemic category


Ds ,measures how closely the mouthContours match in overlapping segmentsof adjacent triphone videos

In “teapot” : /IY/ and /P/ in /T-IY-P/ shall match the contours for /IY/ and /P/ in /IY-P-AA/


Euclidean distance frame by frame betweenEuclidean distance frame by frame between

4-elements feature vector4-elements feature vector

(overall lip width , overall lip high,(overall lip width , overall lip high,

inner lip height, height of visible teeth)inner lip height, height of visible teeth)

Stitching all Together

The remaining task is to stitch the triphone videos into the background sequence

Video Rewrite Driving Visual Speech with Audio Christoph Bregler Michele Covell Malcolm Slaney Presenter : Jack jeryes 3/3/2008.

Documents

audio analysis slide

sil slide

jaw slide

original face image

eigenpoints analysis

image analysis annotation

coarticulation slide

new background face