Adaptation of orofacial clones to the morphology and control strategies

Adaptation of orofacial clones to the morphology and control strategies of target speakers for speech articulation

Julián Andrés VALDÉS VARGAS

Jury:

Michel DESVIGNES (President)

Yves LAPRIE (Reviewer)

Rudolph SOCK (Reviewer)

Thierry LEGOU (Examiner)

Pierre BADIN (Thesis Director)

1

gipsa-lab

• Context of visual articulatory feedback

• Articulatory data

• Individual models and characterisation

• Multi-speaker models

• Conclusions and perspectives

2

Summary

gipsa-lab






3

Summary

gipsa-lab

Context• Mastery of articulators for speech production

• Skill maintained/improved by Perception-action loop (Matthies et al., 1996)

• Feedback in speech– Auditory

– proprioceptive

4

gipsa-lab

Vision of articulators• Augmented speech Visual feedback

– Display of articulators

• Vision of lips and face– Improves speech intelligibility (Sumby and Pollack, 1954)

– Speech imitation is faster (Fowler et al., 2003)

• Vision of hidden articulations– Increases intelligibility (Badin et al.,2010)

5

gipsa-lab

Visual articulatory feedback system• System of visual articulatory feedback (Ben Youssef et al.,

2011)

• Applications– Speech rehabilitation– Computer Aided Pronunciation Training (CAPT)

6

Speech sound

signal of a given

speaker

Visual articulatory feedback system

Clone’s animation

gipsa-lab

Problem of articulatory adaptation• Animation of clone based on a single speaker

• Adaptation to several speakers

7

Speech sound

speaker 1

Visual articulatory feedback system

Speech sound

speaker 2

Speech sound

speaker n

Animation based onreference speakerMismatch between

clone’s animation and real speakers

Acoustic Adaptation

(Atef BEN YOUSSEF) Articulatory

adaptation

Animation based onentry speaker

gipsa-lab

• Morphology – Different vocal tracts

• Size, vertical / horizontal lengths ratios• Shape (e.g. concave / flat palates)

• Articulatory control strategies– Cope with morphology different articulatory strategies to achieve sounds

considered equivalent for speech communication purposes

8

Inter-speaker variability

gipsa-lab

Illustration of speaker differences

/a/

/i/

/u/

Speaker PB Speaker AA Speaker YL

9

gipsa-lab10

Objectives• Articulatory adaptation (Initial objective)

normalization: extraction of common components (patterns) to control the articulators of several speakers.

• To acquire knowledge about inter-speaker variability

gipsa-lab






11

Summary

gipsa-lab

Articulatory data• Type of data Articulatory data Building

articulatory models• Inter-speaker variability:

• 11 French speakers (6 males and 5 females)

• Articulatory phonetic coverage: • 13 vowels• 10 consonants in 5 vocalic contexts

(vowel-consonant-vowel) • 63 articulations in total

12

gipsa-lab

Recording Methods• Several recording methods considered:

• X-ray (Meyer (1907) ,Mosher (1927))

• Difficult to accurately identify the contours

• Electro-Magnetic Articulography (EMA)• No recording of the whole vocal tract

• Magnetic Resonance Imaging (MRI)

(Rokkaku et al., 1986)

• Tomographic (imaging by sections)

• Maintained vocal tract positions

• Speakers in supine position Gravitational effect is moderate

(Engwall (2003; 2006) )

13

gipsa-lab

Decision to use MRI • Whole vocal tract information ≠ EMA

• Contours easier to identify compared to X-ray

• No health hazard compared to X-ray

• Recording parameters:• Midsagittal image of the vocal tract

• Slice thickness: 4 mm

• Spatial resolution: 1 mm / pixel

• Acquisition time: 8 -16 seconds

14

gipsa-lab

MRI Recording• The speaker is asked to go through several stages

• Speakers lay in supine position

• Bed shifted into the MRI machine

• Setting up of alignment recording properties

• Maintained pronunciation of articulations for 8-16 seconds.

• Speakers are asked not to move

their heads

15

gipsa-lab

Processing of MRI• Midsagittal contours manually edited

16

• Rigid contours are drawn once for a given speaker• Positioning of palate using skull bones as reference• Rotation and translation

• Positioning of jaw by means of rototranslations• Edition of deformable contours: Lips, tongue, velum, etc.• Palate of all articulations are aligned• Avoidance of noise introduced by head moving

/a/ /i/ /u/

gipsa-lab

Contours modelled• Upper tongue: 150 (x,y) points• Lips: 100 (x,y) points• Velum: 150 (x,y) points

17

• Static data Articulatory study/models

gipsa-lab






18

Summary

gipsa-lab

Universal control parameters• Extraction of common set of patterns (components)

• Goals:– Building individual-speaker articulatory models

– Controlling all individual articulatory models from a universal set of components

19

UniversalSet of

Components

Speaker 1

Speaker 2

/a//i//u/

/a//i//u/

/a//i//u/

/a//i//u/

Articulator contours of individual

speakers

Universal model

Universal model

Speaker specificweights

Speaker specificweights

CP/a/CP /i/CP/u/

CP/a/CP /i/CP/u/

CP/a/CP/i/CP/u/

CP/a/CP/i/CP/u/

Components

Mspeaker1Mspeaker1Speaker 1

Speaker 2 Mspeaker2Mspeaker2

/a//i//u/

/a//i//u/

/a//i//u/

/a//i//u/

Articulator contours of individual

speakers

CP/a/CP/i/CP/u/

CP/a/CP/i/CP/u/

Individual articulatory

models

gipsa-lab

Method for individual models of speakers

• Principal component analysis (PCA)• dimensionality reduction extraction of orthogonal components

20

gipsa-lab

• Evaluation of model for a individual speaker X• Variance explanation

• Root Mean Square Error (RMSE)

21

Assessment of models

gipsa-lab

• Performance of models to reconstruct data that was not used for training

• Leave-one-out cross validation procedure (a.k.a. Jackknife)

• Observation left out Reconstruction of observation left out by inverting the model

Validation of generalization properties

Valuable predictors retained

22

Generalization properties of models

gipsa-lab

• Guided PCA model (Badin & Serrurier (2006))

• 4 components extracted

23

Individual tongue models• First component extracted by Linear regression

• Jaw Height (predictor)

Three degrees of freedom: x,y translation and rotation (Edwards & Harris, 1990)

Normalized value of the y-coordinate of the lower incisor (Badin & Serrurier (2006))

(X,Y)

Corr(Y, θ) ≈ 0.92

gipsa-lab24

Individual tongue models• Other 3 components extracted by PCA from the

residue:• Tongue Body (TB)

• Tongue Dorsum (TD)

• Tongue Tip (TT)

gipsa-lab25




• Tongue Tip (TT)

gipsa-lab26




• Tongue Tip (TT)

gipsa-lab

Speakers

Per

cent

age

of v

aria

nce

expl

aine

d by

eac

h co

mpo

nent

%

JH

PB YL LH RL LD BR HL AA MG AK MGO0

5

10

15

20

25

30

JH 4.23%Subject AK TB 36.68%

TD 23.39% TT 16.92%

JH 4.31%Subject RL TB 41.41%

TD 22.36% TT 13.70%

JH 26.11%Subject LD TB 29.40%

TD 20.02% TT 12.06%

27

Comparison between components

• JH component:• Max. variance: LD• Min. variance: RL, MG, AK• Compensation strategy of MG

0 50 100 150-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6comparison of slopes

BACK

Co

eff

icie

nts

of

LR

: J

H v

s. t

on

gu

e v

ert

ex

TONGUE TIP

RL

LDMG

AK

• TB component:• Represents more variance

than other components• Horizontal/diagonal back-front

movement


TD 20.07% TT 10.79%


TD 22.36% TT 13.70%


TD 22.73% TT 16.65%

Speaker LD Speaker RL Speaker AK


TD 20.07% TT 10.79%


TD 22.36% TT 13.70%


TD 22.73% TT 16.65%

• TD component:• vertical/diagonal arching

movement

• TT component:• Used in different proportion

according to the speaker


TD 20.07% TT 10.79%


TD 22.36% TT 13.70%


TD 22.73% TT 16.65%

Y-Tongue = Coefficients_LR * JH• Nomograms: graphical representation of components

• Variation between -3 to 3

gipsa-lab28

3 4 5 6 7

7

8

9

10

11

12

13var(UL): 1.68% - var(LL): 30.90%

JHJHJHJHJHJHJHJHJHJHJHJHJH

7

8

9

10

11

12

13var(ULP): 21.93% - var(LLP): 34.85%

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

var(ULH): 55.03% - var(LLH): 20.51%

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

Speaker RL

Individual lips models

3 4 5 6 7

7

8

9

10

11

12

13var(UL): 25.19% - var(LL): 44.59%

JHJHJHJHJHJHJHJHJHJHJHJHJH

7

8

9

10

11

12

13var(ULP): 52.74% - var(LLP): 28.64%

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

ULP

LLP

var(ULH): 12.75% - var(LLH): 15.36%

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

ULH

LLH

Speaker LD• 3 components extracted by

Guided PCA model (Badin et al., 2012)

• Jaw Height• More influence on LL than UL

• Little influence on UL for RL

• Protrusion• ULP > LLP for speaker LD

• LLP > ULP for speaker RL

• Lip height• ULH > LLH for all speakers

Except for speaker LD

25.2%

44.6%

52.7%

28.6%

12.7%

15.4%

1.7%

31%

21.9%

34.8%

55%

20.5%

gipsa-lab

• 2 components extracted by PCA (Serrurier & Badin, 2008):

• Velum levator (Oblique movement) - VL

• Superior pharyngeal constrictor (horizontal movement) - VS

29

Individual velum models

4 6 8 10 12 144

5

6

7

8

9

10

11

12

13

14

15var(PCA-1): 77.67 %

PCA-1

Speaker AA

4

5

6

7

8

9

10

11

12

13

14

15var(PCA-2): 17.32 %

PCA-2

VL VS

gipsa-lab30

Individual velum models: consonant /ʁ/

Speaker AA Speaker HL

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-8

-6

-4

-2

0

2

4

kEka ke kiku

tEta

te titu

E Oaan eX oon

xuy i

RERaRe

Ri

Ru

SESa

Se

SiSufEfafefi fulE

lalelilumE

ma memi

mu

nEna neni

nu

pE papepi

pu

sE sasesisu

PCA-1 vs PCA-2 for speaker aa

PCA-1

PC

A-2

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-8

-6

-4

-2

0

2

4

kE

ka

ke kiku tE

ta

te

tituE

O

aane

X

o

on

x

u

yi

RE

RaReRi

RuSE

Sa

Se

SiSufEfa

fefifulEla

leli

lu

mE

mame

mimunE

na

ne

ninu

pE

pa

pepi pu

sE

sa

se

si

su

PCA-1 vs PCA-2 for speaker hl

PCA-1

PC

A-2

/ʁa/

VL VL

VS VS

gipsa-lab31

Conclusions: individual models• Tongue PCA models: 4 components

(JH,TB,TD,TT)• Variance Explained: 93%, RMSE: 0.13 cm

• Lip models: 3 components (JH, Protrusion, Height)• Variance Explained: 94%, RMSE: 0.04 cm

• Velum models: 2 components (VL, VS)• Variance Explained: 90%, RMSE: 0.08 cm

gipsa-lab






32

Summary

gipsa-lab33

Literature on multi-speaker models

• PARAFAC models : 2 components extracted

• Studies based on EMA (Hoole(1998), Geng(2000), Hu(2006))• 6-7 speakers, 10-15 vowels, 3-4 sensors on the

tongue, 80%-96% variance explained.

• Study based on X-ray: Harshman(1977)

• 5 speakers, 10 vowels, 13 points, 92.7%

• Studies based on MRI (Hoole(2000), Zheng(2003), Ananth(2010))

• 3-9 speakers, 7-13 vowels, 13-150 points, 71%-87% of variance exp.

gipsa-lab

Multi-speaker decomposition methods Extraction of common set of components PARAFAC (Harshman,1970) (three-way factor

analysis, diagonal speaker adaptation matrix)

34

gipsa-lab

TUCKER 3 Extension of PARAFAC Decomposition in all modes of variation

35

Multi-speaker decomposition methods

gipsa-lab

Joint PCA (two-way analysis adapted to multi-speaker models) (Ananthakrishnan et al. (2010) – KTH(Sweden))

All speakers articulatory measurements for one phoneme considered as one set of data

forces common components

36


gipsa-lab

• RMSE and Variance Explained (VarEx)• multi-speaker model (red, green, black) vs.

• average of individual speakers’ models (blue)

VarEx RMSE

2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number Of Components

Per

cent

age

Var

ianc

e E

xpla

ined

Average Variance explained of methods

Average PCA

Joint PCA

PARAFAC

TUCKER

2 4 6 8 10 12 14 16 18 200

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


RM

S E

rror

in c

m

Average Rmse of methods

Average PCA

Joint PCAPARAFAC

TUCKER

Comparison of performance between methods

37

gipsa-lab

• Reference PCA model with 4 components

• Total number of components: 11 x 4 = 44

• Student's t-test for RMSE at 5% signif. level

• Joint PCA: 14 – 21 components ( TUCKER )

• PARAFAC: 21 components

VarEx RMSE

2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Per

cent

age

Var

ianc

e E

xpla

ined

Average Variance explained of methods

Average PCA

Joint PCA

PARAFAC

TUCKER

2 4 6 8 10 12 14 16 18 200

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


RM

S E

rror

in c

m

Average Rmse of methods

Average PCA

Joint PCAPARAFAC

TUCKER

Multi-speaker Tongue models

38

• Student's t-test -> determine if the RMSE of models are significantly different from each other

gipsa-lab

• Individual models:

• Reference PCA model with 44 (11 x 4) components• VarEx: 93.23 %

• RMSE: 0.13 cm

• Multi-speaker models:

• Joint PCA with 4 components• VarEx: 72.16 %

• RMSE: 0.27 cm

• Interpretation of components: JH, TB, TD and TT

• Equivalent solution: Joint PCA, 21 components• VarEx: 94.88%

• RMSE: 0.12 cm

• Lack of interpretation from the 5th component

Literature

No. Components: 2VarExp: 71% - 96%Corpus: 7-15 vowelsSpeakers: 3-9

Present studyCorpus: 63 articulations (vowels and consonants)Speakers: 11 speakers

Multi-speaker Tongue models

39

gipsa-lab

Multi-speaker modelslips and velum• Lips and velum models comparable with tongue models

• Lips

individual models: 33 components (3 * 11)

multi-speaker joint PCA models: equivalent with 21 components

Reduced no. of components: 3 interpretable components

(JH, protrusion, lip height)

• Velum


multi-speaker joint PCA models: equivalent with 14 components

Reduced no. of components: 2 components

(Oblique, horizontal)

40

gipsa-lab






41

Summary

gipsa-lab

Conclusions Data

Unique set of articulatory data for French MRI for the whole vocal tract for 11 French speakers Contours Vowels and consonants More speakers compared to the literature

Characterisation of different speakers’ strategies Tongue Upper and lower lip Velum

Multi-speaker models (normalisation) of tongue, lips and velum contours

No work in the literature on lips and velum

42

gipsa-lab

Perspectives

43

More speakers Relation between articulatory strategies and acoustics Cross-speaker velum variability

Influence of the tongue movement Nasality

new modelling solutions Non-linear methods:

Kernel PCA Artificial Neural Networks (ANN) Support Vector Machines (SVM)

gipsa-lab

Acknowledgments Laurent Lamalle (IRMaGe, Grenoble) Speakers ARTIS project (GIPSA-lab, LORIA)

43

gipsa-lab

Thank you for your attention

Questions?

44

gipsa-lab

• Maeda S. (1979) Fix grid

• Busset J.(2013) : Adaptive grid system Euclidean coordinates (intersections) Distances and extreme angles Polar coordinates (distances and angles for each grid line)

• Beautemps et al. (2001): adapted to each articulation

Euclidean coordinates

Distances and TngAdv + TngBot

46

Grid system

gipsa-lab

PB = 0.6611

YL = 0.7385

LH = 0.7174

RL = 0.3946

LD = 0.8423

BR = 0.7764

HL = 0.7913

AA = 0.4952

MG = 0.4151

AK = 0.8317

MGO = 0.9228

47

Corr(Y-jaw,Angle_rotation)

(X,Y)

gipsa-lab

• Grid system

Midsagittal function vocal tract area function (series of areas and lengths of

each sagittal section) α , β models (Beautemps et al.1995; Heinz & Stevens, 1965)

A = Area of a given grid section, d = midsagittal distance

α , β coefficients depending on subject and vocal tract location

α , β according to speaker of reference: PB

vocal tract acoustic transfer function (Fant, 1960; Badin & Fant, 1984)

Formants

48

Acoustic simulation

d A

gipsa-lab49

No. Coefficients by method

2 4 6 8 10 12 14 16 18 200

0.5

1

1.5

2

2.5

3

3.5

4x 10

5

No. of components

No.

of

coef

ficie

nts

PCA

PARAFACJoint PCA

TUCKER

gipsa-lab

“Essentially, all models are wrong, but some are useful“

George Edward Pelham Box

50

gipsa-lab

Joint PCA (two-way analysis adapted to multi-speaker models) (Ananthakrishnan et al. (2010) – KTH(Sweden))

All speakers articulatory measurements for one phoneme considered as one set of data

forces common components

51


gipsa-lab52

Generalisation

gipsa-lab

• Estimation of non visible landmarks (Tongue tip and jaw attachment)

• Computed as the average position of the articulations in which is distinguishable

53

Articulatory data

Not distinguishable tongue tip Not distinguishable jaw attachment

gipsa-lab

State of the art on articulatory normalisation

• Articulatory normalisation based on linear decomposition methods• PARAFAC tongue models, 2 components extracted

• Data: 7 – 15 vowels, 3 – 9 speakers

• Performance: 71% - 96% of variance explanation

• Geometric normalisation• Scaling transformations -> do not normalise articulatory control

strategies employed by different speakers

• Challenge• Modelling of other contours such as lips and velum

• Extension to consonants54

gipsa-lab

Linear regression between couple of speakers

2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Per

cent

age

varia

nce

expl

aine

d

Number of components

Variance explanation of prediction of speaker pb

PCA model of pbPrediction from yl

Prediction from lh

Prediction from rl

Prediction from ld

Prediction from brPrediction from hl

Prediction from aa

Prediction from mg

Prediction from akPrediction from mgo

2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

RM

S E

rror

in c

m

Number of components

RMSE of prediction of speaker pb

• Prediction of PCA control parameters of a target speaker (πTS) from PCA control parameters of a source speaker (πSS) Multi-linear Regression

TS SSi

n

cmpi

1

VarEx RMSE

• Overfitted from 10th component on LOOCV

• 10th components 64.32 % variance explained, 0.37 cm (RMSE)

55

gipsa-lab

Speakers

Per

cent

age

of v

aria

nce

expl

aine

d by

eac

h co

mpo

nent

%

JH

TB

TD

TT

PB YL LH RL LD BR HL AA MG AK MGO0

10

20

30

40

50

60

70

80

90

56

Individual tongue modelsJH 25.28%Subject LD TB 31.47%

TD 20.07% TT 10.79%


TD 41.23% TT 13.50%


TD 22.73% TT 16.65%

1 1 1

1

1

Individual tongue models: Synergy jaw-tongue

Max Min ~= speakers RL, MG,AK

0 50 100 150-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6comparison of slopes

BACK

Co

eff

icie

nts

of

LR

: J

H v

s. t

on

gu

e v

ert

ex

TONGUE TIP

rl

ldmg

ak

Y-coordinate tongue contour

gipsa-lab

• Evaluation of model for a individual speaker X• Variance explanation

• Root Mean Square Error (RMSE)

Xp = speaker data predicted, n = number of observations , m = number of articulator measurements

57

mn

n miX

iX

XVARIANCE.

2)1 1 ()(

)(

)(),(_

XVARIANCEp

XVARIANCE

pXXEXPLAINEDVARIANCE

mn

n mpredictedi

XiX

RMSE.

2)1 1 _(

Assessment of models

gipsa-lab

Multi-speaker modelslips and velum• Lips and velum models comparable with tongue

models• Lips


multi-speaker joint PCA models: 21 components

Reduced no. of components: 3 interpretable components

• Velum


multi-speaker joint PCA models: 14 components

Reduced no. of components: 2 components

58

Contour

Average PCA Joint PCA according to Student's t-test Joint PCA with reduced no. of components

No. Components

Variance Exp.

RMSENo.

ComponentsVariance Exp. RMSE

No. Components

Variance Exp. RMSE

Upper tongue 44 (4 *11) 93.23% 0.13 cm 21 94.88% 0.12 cm 4 72.16% 0.27 cm

Upper lip 33 (3*11) 94.89% 0.03 cm 21 96.67% 0.03 cm 3 74.28% 0.08 cmLower lip 33 (3*11) 94.50% 0.05 cm 21 96.85% 0.04 cm 3 69.26% 0.15 cm

Velum 22(2*11) 90% 0.08 cm 14 94.20% 0.07 cm 2 76.01% 0.14 cm

Adaptation of orofacial clones to the morphology and control strategies

Documents

speech productionskill

speech imitation

speech sound signal

speech communication

french speakers

vocal tract information

adaptation of orofacial

tomographic imaging