Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Philip Jackson, Boon-Hooi LoPhilip Jackson, Boon-Hooi Lo

and Martin Russelland Martin Russell

Electronic Electrical and Computer Engineering

Models of speech Models of speech dynamics for ASR, using dynamics for ASR, using

intermediate linear intermediate linear representationsrepresentations

http://web.bham.ac.uk/p.jackson/balthasar/

AbstractINTRODUCTIONINTRODUCTION

Speech dynamics into ASR• dynamics of speech production to

constrain recognizer– noisy environments– conversational speech– speaker adaptation

• efficient, complete and trainable models– for recognition– for analysis– for synthesis

INTRODUCTIONINTRODUCTION

Articulatory trajectories

from West (2000)


Articulatory-trajectory model


intermediate

finite-state

surface

Level

source dependent

Articulatory-trajectory model


Multi-level Segmental HMM

• segmental finite-state process

• intermediate “articulatory” layer– linear trajectories

• mapping required– linear transformation– radial basis function network


Linear-trajectory modelINTRODUCTIONINTRODUCTION

acoustic layer

articulatory-to-acoustic mapping

intermediate layer

segmental HMM

2 3 4 51

Linear-trajectory equations

Defined as

whereSegment probability:

,iii ttt cmf

.21t

1

1 ,)();(t

iii RtWtb fyy N

THEORYTHEORY

Linear mapping

Objective function

with matched sequences and

,)()()()(1

1

T

ti ttWRttWE yxyx

T1x

YXW

.1Ty

YWXD min

THEORYTHEORY

Trajectory parameters

S

i

ttii

iyb

1

1)1(,Pr Msy

Utterance probability,

and, for the optimal (ML) state sequence ,s

1 2

1

1

1

1

1

)(ˆ

)(1

ˆ

i

i

i

i

i

i

t

tt

t

tt iki

i

t

ttikii

tt

tDWDtt

tDWDT

ym

yc

THEORYTHEORY

Non-linear (RBF) mapping

. . .

tif

. . . tjx

tky. . .acoustic layer

formant

trajectories

THEORYTHEORY

Trajectory parametersWith the RBF, the least-squares solution issought by gradient descent:

t j j

ijijjj

i

t j j

ijijjj

i

tftxttxtt

mE

tftxttx

cE

2

2

)()()()(2

)()()()(2

yw

yw

THEORYTHEORY

Tests on TIMIT• N. American English, at 8kHz

– MFCC13 acoustic features (incl. zero’th)

a) F1-3: formants F1, F2 and F3, estimated by Holmes formant tracker

b) F1-3+BE5: five band energies added

c) PFS12: synthesiser control parameters

METHODMETHOD

TIMIT baseline performance

47

48

49

50

51

52

53

54

ID_0 ID_1

Features

Acc

ura

cy (

%)

• Constant-trajectory SHMM (ID_0)• Linear-trajectory SHMM (ID_1)

RESULTSRESULTS

Performance across feature sets

47

48

49

50

51

52

53

54

ID_0 (a) F1-3 (b) F1-3+BE5 (c) PFS12 ID_1

Features

Acc

ura

cy (

%)

RESULTSRESULTS

Phone categorisationNo. Description

A 1 all data

B 2 silence; speech

C 6 linguistic categories: silence/stop; vowel; liquid; nasal; fricative; affricate

D 10 as Deng and Ma (2000):silence; vowel; liquid; nasal; UV fric; /s,ch/; V fric; /z,jh/; UV stop; V stop

E 10 discrete articulatory regions

F 49 silence; individual phones

METHODMETHOD

Discrete articulatory regionsFeatures Description

0 -voice Silence, non-speech

1 +voice, VT open Vowel, glide

2 +voice, VT part. Liquid, approximant

3 +voice, VT closed, +velum

Nasal

4 +voice, VT closed Voiced plosive (closure)

5 -voice, VT closed Voiceless plosive (closure)

6 +voice, VT open, +plosion

Voiced plosive (release)

7 -voice, VT open, +plosion Voiceless plosive (release)

8 +voice, VT part., +fric/asp

Voiced fricative

9 -voice, VT part., +fric/asp Voiceless fricative

METHODMETHOD

Performance across groupings

47

48

49

50

51

52

53

54

ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1

Mappings

Acc

ura

cy (

%)

RESULTSRESULTS

Results across groupings

47

48

49

50

51

52

53

54

ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1

Mappings

Acc

ura

cy (

%)

(a) F1-3

(b) F1-3+BE5

(c) PFS12

RESULTSRESULTS

Tests on MOCHA• S. British English, at 16kHz

– MFCC13 acoustic features (incl. zero’th)

– articulatory x- & y-coords from 7 EMA coils

– PCA9+Lx: first nine articulatory modes plus the laryngograph log energy

METHODMETHOD

MOCHA baseline performance

53

54

55

56

ID_0 ID_1

Mappings

Acc

ura

cy (

%)

RESULTSRESULTS

Performance across mappings

53

54

55

56

ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1

Mappings

Acc

ura

cy (

%)

RESULTSRESULTS

Model visualisationDISCUSSIONDISCUSSION

Originalacousticdata

Constant-trajectorymodel

Linear-trajectorymodel, (F)PFS12 (c)

Conclusions• Theory of Multi-level Segmental HMMs• Benefits of linear trajectories• Results show near optimal performance

with linear mappings• Progress towards unified models of the

speech production process

• What next?– unsupervised (embedded) training, to

derive pseudo-articulatory representations– implement non-linear mapping (i.e., RBF)– include biphone language model, and

segment duration models

SUMMARYSUMMARY

Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Documents

lineartrajectory shmm

c pfs1254id

formants f1

8khzmfcc13 acoustic

models of speech dynamics

be5c pfs12mappingsaccuracy

computer engineeringhttp

band energies addedpfs12