Top Banner
Philip Jackson, Boon-Hooi Lo Philip Jackson, Boon-Hooi Lo and Martin Russell and Martin Russell Electronic Electrical and Computer Engineering Models of speech Models of speech dynamics for ASR, dynamics for ASR, using intermediate using intermediate linear representations linear representations http://web.bham.ac.uk/p.jackson/ balthasar/
25

Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Jan 19, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Philip Jackson, Boon-Hooi LoPhilip Jackson, Boon-Hooi Lo

and Martin Russelland Martin Russell

Electronic Electrical and Computer Engineering

Models of speech Models of speech dynamics for ASR, using dynamics for ASR, using

intermediate linear intermediate linear representationsrepresentations

http://web.bham.ac.uk/p.jackson/balthasar/

Page 2: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

AbstractINTRODUCTIONINTRODUCTION

Page 3: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Speech dynamics into ASR• dynamics of speech production to

constrain recognizer– noisy environments– conversational speech– speaker adaptation

• efficient, complete and trainable models– for recognition– for analysis– for synthesis

INTRODUCTIONINTRODUCTION

Page 4: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Articulatory trajectories

from West (2000)

INTRODUCTIONINTRODUCTION

Page 5: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Articulatory-trajectory model

INTRODUCTIONINTRODUCTION

Page 6: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

intermediate

finite-state

surface

Level

source dependent

Articulatory-trajectory model

INTRODUCTIONINTRODUCTION

Page 7: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Multi-level Segmental HMM

• segmental finite-state process

• intermediate “articulatory” layer– linear trajectories

• mapping required– linear transformation– radial basis function network

INTRODUCTIONINTRODUCTION

Page 8: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Linear-trajectory modelINTRODUCTIONINTRODUCTION

acoustic layer

articulatory-to-acoustic mapping

intermediate layer

segmental HMM

2 3 4 51

Page 9: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Linear-trajectory equations

Defined as

whereSegment probability:

,iii ttt cmf

.21t

1

1 ,)();(t

iii RtWtb fyy N

THEORYTHEORY

Page 10: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Linear mapping

Objective function

with matched sequences and

,)()()()(1

1

T

ti ttWRttWE yxyx

T1x

YXW

.1Ty

YWXD min

THEORYTHEORY

Page 11: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Trajectory parameters

S

i

ttii

iyb

1

1)1(,Pr Msy

Utterance probability,

and, for the optimal (ML) state sequence ,s

1 2

1

1

1

1

1

)(ˆ

)(1

ˆ

i

i

i

i

i

i

t

tt

t

tt iki

i

t

ttikii

tt

tDWDtt

tDWDT

ym

yc

THEORYTHEORY

Page 12: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Non-linear (RBF) mapping

. . .

tif

. . . tjx

tky. . .acoustic layer

formant

trajectories

THEORYTHEORY

Page 13: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Trajectory parametersWith the RBF, the least-squares solution issought by gradient descent:

t j j

ijijjj

i

t j j

ijijjj

i

tftxttxtt

mE

tftxttx

cE

2

2

)()()()(2

)()()()(2

yw

yw

THEORYTHEORY

Page 14: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Tests on TIMIT• N. American English, at 8kHz

– MFCC13 acoustic features (incl. zero’th)

a) F1-3: formants F1, F2 and F3, estimated by Holmes formant tracker

b) F1-3+BE5: five band energies added

c) PFS12: synthesiser control parameters

METHODMETHOD

Page 15: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

TIMIT baseline performance

47

48

49

50

51

52

53

54

ID_0 ID_1

Features

Acc

ura

cy (

%)

• Constant-trajectory SHMM (ID_0)• Linear-trajectory SHMM (ID_1)

RESULTSRESULTS

Page 16: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Performance across feature sets

47

48

49

50

51

52

53

54

ID_0 (a) F1-3 (b) F1-3+BE5 (c) PFS12 ID_1

Features

Acc

ura

cy (

%)

RESULTSRESULTS

Page 17: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Phone categorisationNo. Description

A 1 all data

B 2 silence; speech

C 6 linguistic categories: silence/stop; vowel; liquid; nasal; fricative; affricate

D 10 as Deng and Ma (2000):silence; vowel; liquid; nasal; UV fric; /s,ch/; V fric; /z,jh/; UV stop; V stop

E 10 discrete articulatory regions

F 49 silence; individual phones

METHODMETHOD

Page 18: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Discrete articulatory regionsFeatures Description

0 -voice Silence, non-speech

1 +voice, VT open Vowel, glide

2 +voice, VT part. Liquid, approximant

3 +voice, VT closed, +velum

Nasal

4 +voice, VT closed Voiced plosive (closure)

5 -voice, VT closed Voiceless plosive (closure)

6 +voice, VT open, +plosion

Voiced plosive (release)

7 -voice, VT open, +plosion Voiceless plosive (release)

8 +voice, VT part., +fric/asp

Voiced fricative

9 -voice, VT part., +fric/asp Voiceless fricative

METHODMETHOD

Page 19: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Performance across groupings

47

48

49

50

51

52

53

54

ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1

Mappings

Acc

ura

cy (

%)

RESULTSRESULTS

Page 20: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Results across groupings

47

48

49

50

51

52

53

54

ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1

Mappings

Acc

ura

cy (

%)

(a) F1-3

(b) F1-3+BE5

(c) PFS12

RESULTSRESULTS

Page 21: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Tests on MOCHA• S. British English, at 16kHz

– MFCC13 acoustic features (incl. zero’th)

– articulatory x- & y-coords from 7 EMA coils

– PCA9+Lx: first nine articulatory modes plus the laryngograph log energy

METHODMETHOD

Page 22: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

MOCHA baseline performance

53

54

55

56

ID_0 ID_1

Mappings

Acc

ura

cy (

%)

RESULTSRESULTS

Page 23: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Performance across mappings

53

54

55

56

ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1

Mappings

Acc

ura

cy (

%)

RESULTSRESULTS

Page 24: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Model visualisationDISCUSSIONDISCUSSION

Originalacousticdata

Constant-trajectorymodel

Linear-trajectorymodel, (F)PFS12 (c)

Page 25: Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.

Conclusions• Theory of Multi-level Segmental HMMs• Benefits of linear trajectories• Results show near optimal performance

with linear mappings• Progress towards unified models of the

speech production process

• What next?– unsupervised (embedded) training, to

derive pseudo-articulatory representations– implement non-linear mapping (i.e., RBF)– include biphone language model, and

segment duration models

SUMMARYSUMMARY