Adaptation of orofacial clones to the morphology and control strategies of target speakers for speech articulation Julián Andrés VALDÉS VARGAS Jury: Michel DESVIGNES (President) Yves LAPRIE (Reviewer) Rudolph SOCK (Reviewer) Thierry LEGOU (Examiner) Pierre BADIN (Thesis Director) 1
58
Embed
Adaptation of orofacial clones to the morphology and control strategies
Adaptation of orofacial clones to the morphology and control strategies of target speakers for speech articulation. Julián Andrés VALDÉS VARGAS Jury: Michel DESVIGNES (President) Yves LAPRIE (Reviewer) Rudolph SOCK (Reviewer) Thierry LEGOU (Examiner) Pierre BADIN (Thesis Director). 1. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Adaptation of orofacial clones to the morphology and control strategies of target speakers for speech articulation
Julián Andrés VALDÉS VARGAS
Jury:
Michel DESVIGNES (President)
Yves LAPRIE (Reviewer)
Rudolph SOCK (Reviewer)
Thierry LEGOU (Examiner)
Pierre BADIN (Thesis Director)
1
gipsa-lab
• Context of visual articulatory feedback
• Articulatory data
• Individual models and characterisation
• Multi-speaker models
• Conclusions and perspectives
2
Summary
gipsa-lab
• Context of visual articulatory feedback
• Articulatory data
• Individual models and characterisation
• Multi-speaker models
• Conclusions and perspectives
3
Summary
gipsa-lab
Context• Mastery of articulators for speech production
• Skill maintained/improved by Perception-action loop (Matthies et al., 1996)
• Feedback in speech– Auditory
– proprioceptive
4
gipsa-lab
Vision of articulators• Augmented speech Visual feedback
– Display of articulators
• Vision of lips and face– Improves speech intelligibility (Sumby and Pollack, 1954)
– Speech imitation is faster (Fowler et al., 2003)
• Vision of hidden articulations– Increases intelligibility (Badin et al.,2010)
5
gipsa-lab
Visual articulatory feedback system• System of visual articulatory feedback (Ben Youssef et al.,
2011)
• Applications– Speech rehabilitation– Computer Aided Pronunciation Training (CAPT)
6
Speech sound
signal of a given
speaker
Visual articulatory feedback system
Clone’s animation
gipsa-lab
Problem of articulatory adaptation• Animation of clone based on a single speaker
• Adaptation to several speakers
7
Speech sound
speaker 1
Visual articulatory feedback system
Speech sound
speaker 2
Speech sound
speaker n
Animation based onreference speakerMismatch between
(vowel-consonant-vowel) • 63 articulations in total
12
gipsa-lab
Recording Methods• Several recording methods considered:
• X-ray (Meyer (1907) ,Mosher (1927))
• Difficult to accurately identify the contours
• Electro-Magnetic Articulography (EMA)• No recording of the whole vocal tract
• Magnetic Resonance Imaging (MRI)
(Rokkaku et al., 1986)
• Tomographic (imaging by sections)
• Maintained vocal tract positions
• Speakers in supine position Gravitational effect is moderate
(Engwall (2003; 2006) )
13
gipsa-lab
Decision to use MRI • Whole vocal tract information ≠ EMA
• Contours easier to identify compared to X-ray
• No health hazard compared to X-ray
• Recording parameters:• Midsagittal image of the vocal tract
• Slice thickness: 4 mm
• Spatial resolution: 1 mm / pixel
• Acquisition time: 8 -16 seconds
14
gipsa-lab
MRI Recording• The speaker is asked to go through several stages
• Speakers lay in supine position
• Bed shifted into the MRI machine
• Setting up of alignment recording properties
• Maintained pronunciation of articulations for 8-16 seconds.
• Speakers are asked not to move
their heads
15
gipsa-lab
Processing of MRI• Midsagittal contours manually edited
16
• Rigid contours are drawn once for a given speaker• Positioning of palate using skull bones as reference• Rotation and translation
• Positioning of jaw by means of rototranslations• Edition of deformable contours: Lips, tongue, velum, etc.• Palate of all articulations are aligned• Avoidance of noise introduced by head moving
Present studyCorpus: 63 articulations (vowels and consonants)Speakers: 11 speakers
Multi-speaker Tongue models
39
gipsa-lab
Multi-speaker modelslips and velum• Lips and velum models comparable with tongue models
• Lips
individual models: 33 components (3 * 11)
multi-speaker joint PCA models: equivalent with 21 components
Reduced no. of components: 3 interpretable components
(JH, protrusion, lip height)
• Velum
individual models: 22 components (2 * 11)
multi-speaker joint PCA models: equivalent with 14 components
Reduced no. of components: 2 components
(Oblique, horizontal)
40
gipsa-lab
• Context of visual articulatory feedback
• Articulatory data
• Individual models and characterisation
• Multi-speaker models
• Conclusions and perspectives
41
Summary
gipsa-lab
Conclusions Data
Unique set of articulatory data for French MRI for the whole vocal tract for 11 French speakers Contours Vowels and consonants More speakers compared to the literature
Characterisation of different speakers’ strategies Tongue Upper and lower lip Velum
Multi-speaker models (normalisation) of tongue, lips and velum contours
No work in the literature on lips and velum
42
gipsa-lab
Perspectives
43
More speakers Relation between articulatory strategies and acoustics Cross-speaker velum variability
Influence of the tongue movement Nasality
new modelling solutions Non-linear methods:
Kernel PCA Artificial Neural Networks (ANN) Support Vector Machines (SVM)
gipsa-lab
Acknowledgments Laurent Lamalle (IRMaGe, Grenoble) Speakers ARTIS project (GIPSA-lab, LORIA)
43
gipsa-lab
Thank you for your attention
Questions?
44
gipsa-lab
• Maeda S. (1979) Fix grid
• Busset J.(2013) : Adaptive grid system Euclidean coordinates (intersections) Distances and extreme angles Polar coordinates (distances and angles for each grid line)
• Beautemps et al. (2001): adapted to each articulation
Euclidean coordinates
Distances and TngAdv + TngBot
46
Grid system
gipsa-lab
PB = 0.6611
YL = 0.7385
LH = 0.7174
RL = 0.3946
LD = 0.8423
BR = 0.7764
HL = 0.7913
AA = 0.4952
MG = 0.4151
AK = 0.8317
MGO = 0.9228
47
Corr(Y-jaw,Angle_rotation)
(X,Y)
gipsa-lab
• Grid system
Midsagittal function vocal tract area function (series of areas and lengths of
each sagittal section) α , β models (Beautemps et al.1995; Heinz & Stevens, 1965)
A = Area of a given grid section, d = midsagittal distance
α , β coefficients depending on subject and vocal tract location
α , β according to speaker of reference: PB
vocal tract acoustic transfer function (Fant, 1960; Badin & Fant, 1984)
Formants
48
Acoustic simulation
d A
gipsa-lab49
No. Coefficients by method
2 4 6 8 10 12 14 16 18 200
0.5
1
1.5
2
2.5
3
3.5
4x 10
5
No. of components
No.
of
coef
ficie
nts
PCA
PARAFACJoint PCA
TUCKER
gipsa-lab
“Essentially, all models are wrong, but some are useful“
George Edward Pelham Box
50
gipsa-lab
Joint PCA (two-way analysis adapted to multi-speaker models) (Ananthakrishnan et al. (2010) – KTH(Sweden))
All speakers articulatory measurements for one phoneme considered as one set of data
forces common components
51
Multi-speaker decomposition methods
gipsa-lab52
Generalisation
gipsa-lab
• Estimation of non visible landmarks (Tongue tip and jaw attachment)
• Computed as the average position of the articulations in which is distinguishable
53
Articulatory data
Not distinguishable tongue tip Not distinguishable jaw attachment
gipsa-lab
State of the art on articulatory normalisation
• Articulatory normalisation based on linear decomposition methods• PARAFAC tongue models, 2 components extracted
• Data: 7 – 15 vowels, 3 – 9 speakers
• Performance: 71% - 96% of variance explanation
• Geometric normalisation• Scaling transformations -> do not normalise articulatory control
strategies employed by different speakers
• Challenge• Modelling of other contours such as lips and velum
• Extension to consonants54
gipsa-lab
Linear regression between couple of speakers
2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Per
cent
age
varia
nce
expl
aine
d
Number of components
Variance explanation of prediction of speaker pb
PCA model of pbPrediction from yl
Prediction from lh
Prediction from rl
Prediction from ld
Prediction from brPrediction from hl
Prediction from aa
Prediction from mg
Prediction from akPrediction from mgo
2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
RM
S E
rror
in c
m
Number of components
RMSE of prediction of speaker pb
• Prediction of PCA control parameters of a target speaker (πTS) from PCA control parameters of a source speaker (πSS) Multi-linear Regression
TS SSi
n
cmpi
1
VarEx RMSE
• Overfitted from 10th component on LOOCV
• 10th components 64.32 % variance explained, 0.37 cm (RMSE)