Page 1
Philip Jackson, Boon-Hooi LoPhilip Jackson, Boon-Hooi Lo
and Martin Russelland Martin Russell
Electronic Electrical and Computer Engineering
Models of speech Models of speech dynamics for ASR, using dynamics for ASR, using
intermediate linear intermediate linear representationsrepresentations
http://web.bham.ac.uk/p.jackson/balthasar/
Page 2
AbstractINTRODUCTIONINTRODUCTION
Page 3
Speech dynamics into ASR• dynamics of speech production to
constrain recognizer– noisy environments– conversational speech– speaker adaptation
• efficient, complete and trainable models– for recognition– for analysis– for synthesis
INTRODUCTIONINTRODUCTION
Page 4
Articulatory trajectories
from West (2000)
INTRODUCTIONINTRODUCTION
Page 5
Articulatory-trajectory model
INTRODUCTIONINTRODUCTION
Page 6
intermediate
finite-state
surface
Level
source dependent
Articulatory-trajectory model
INTRODUCTIONINTRODUCTION
Page 7
Multi-level Segmental HMM
• segmental finite-state process
• intermediate “articulatory” layer– linear trajectories
• mapping required– linear transformation– radial basis function network
INTRODUCTIONINTRODUCTION
Page 8
Linear-trajectory modelINTRODUCTIONINTRODUCTION
acoustic layer
articulatory-to-acoustic mapping
intermediate layer
segmental HMM
2 3 4 51
Page 9
Linear-trajectory equations
Defined as
whereSegment probability:
,iii ttt cmf
.21t
1
1 ,)();(t
iii RtWtb fyy N
THEORYTHEORY
Page 10
Linear mapping
Objective function
with matched sequences and
,)()()()(1
1
T
ti ttWRttWE yxyx
T1x
YXW
.1Ty
YWXD min
THEORYTHEORY
Page 11
Trajectory parameters
S
i
ttii
iyb
1
1)1(,Pr Msy
Utterance probability,
and, for the optimal (ML) state sequence ,s
1 2
1
1
1
1
1
)(ˆ
)(1
ˆ
i
i
i
i
i
i
t
tt
t
tt iki
i
t
ttikii
tt
tDWDtt
tDWDT
ym
yc
THEORYTHEORY
Page 12
Non-linear (RBF) mapping
. . .
tif
. . . tjx
tky. . .acoustic layer
formant
trajectories
THEORYTHEORY
Page 13
Trajectory parametersWith the RBF, the least-squares solution issought by gradient descent:
t j j
ijijjj
i
t j j
ijijjj
i
tftxttxtt
mE
tftxttx
cE
2
2
)()()()(2
)()()()(2
yw
yw
THEORYTHEORY
Page 14
Tests on TIMIT• N. American English, at 8kHz
– MFCC13 acoustic features (incl. zero’th)
a) F1-3: formants F1, F2 and F3, estimated by Holmes formant tracker
b) F1-3+BE5: five band energies added
c) PFS12: synthesiser control parameters
METHODMETHOD
Page 15
TIMIT baseline performance
47
48
49
50
51
52
53
54
ID_0 ID_1
Features
Acc
ura
cy (
%)
• Constant-trajectory SHMM (ID_0)• Linear-trajectory SHMM (ID_1)
RESULTSRESULTS
Page 16
Performance across feature sets
47
48
49
50
51
52
53
54
ID_0 (a) F1-3 (b) F1-3+BE5 (c) PFS12 ID_1
Features
Acc
ura
cy (
%)
RESULTSRESULTS
Page 17
Phone categorisationNo. Description
A 1 all data
B 2 silence; speech
C 6 linguistic categories: silence/stop; vowel; liquid; nasal; fricative; affricate
D 10 as Deng and Ma (2000):silence; vowel; liquid; nasal; UV fric; /s,ch/; V fric; /z,jh/; UV stop; V stop
E 10 discrete articulatory regions
F 49 silence; individual phones
METHODMETHOD
Page 18
Discrete articulatory regionsFeatures Description
0 -voice Silence, non-speech
1 +voice, VT open Vowel, glide
2 +voice, VT part. Liquid, approximant
3 +voice, VT closed, +velum
Nasal
4 +voice, VT closed Voiced plosive (closure)
5 -voice, VT closed Voiceless plosive (closure)
6 +voice, VT open, +plosion
Voiced plosive (release)
7 -voice, VT open, +plosion Voiceless plosive (release)
8 +voice, VT part., +fric/asp
Voiced fricative
9 -voice, VT part., +fric/asp Voiceless fricative
METHODMETHOD
Page 19
Performance across groupings
47
48
49
50
51
52
53
54
ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1
Mappings
Acc
ura
cy (
%)
RESULTSRESULTS
Page 20
Results across groupings
47
48
49
50
51
52
53
54
ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1
Mappings
Acc
ura
cy (
%)
(a) F1-3
(b) F1-3+BE5
(c) PFS12
RESULTSRESULTS
Page 21
Tests on MOCHA• S. British English, at 16kHz
– MFCC13 acoustic features (incl. zero’th)
– articulatory x- & y-coords from 7 EMA coils
– PCA9+Lx: first nine articulatory modes plus the laryngograph log energy
METHODMETHOD
Page 22
MOCHA baseline performance
53
54
55
56
ID_0 ID_1
Mappings
Acc
ura
cy (
%)
RESULTSRESULTS
Page 23
Performance across mappings
53
54
55
56
ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1
Mappings
Acc
ura
cy (
%)
RESULTSRESULTS
Page 24
Model visualisationDISCUSSIONDISCUSSION
Originalacousticdata
Constant-trajectorymodel
Linear-trajectorymodel, (F)PFS12 (c)
Page 25
Conclusions• Theory of Multi-level Segmental HMMs• Benefits of linear trajectories• Results show near optimal performance
with linear mappings• Progress towards unified models of the
speech production process
• What next?– unsupervised (embedded) training, to
derive pseudo-articulatory representations– implement non-linear mapping (i.e., RBF)– include biphone language model, and
segment duration models
SUMMARYSUMMARY