Top Banner
Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide Automatic speech recognition on the articulation index corpus Guy J. Brown and Amy Beeston Department of Computer Science University of Sheffield [email protected]
18

Automatic speech recognition on the articulation index corpus

Jan 06, 2016

Download

Documents

malory

Automatic speech recognition on the articulation index corpus. Guy J. Brown and Amy Beeston Department of Computer Science University of Sheffield [email protected]. Aims. Eventual aim is to develop a ‘perceptual constancy’ front-end for automatic speech recognition (ASR). - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 1

Automatic speech recognition on

the articulation index corpus

Guy J. Brown and Amy BeestonDepartment of Computer Science

University of [email protected]

Page 2: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 2

Aims

• Eventual aim is to develop a ‘perceptual constancy’ front-end for automatic speech recognition (ASR).

• Should be compatible with Watkins et al. findings but also validated on a ‘real world’ ASR task.– wider vocabulary– range of reverberation conditions– variety of speech contexts– naturalistic speech, rather than interpolated stimuli– consider phonetic confusions in reverberation in general

• Initial ASR studies using articulation index corpus• Aim to compare human performance (Amy

experiment) and machine performance on same task

Page 3: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 3

Articulation index (AI) corpus

• Recorded by Jonathan Wright (University of Pennsylvania)

• Intended for speech recognition in noise experiments similar to those of Fletcher.

• Suggested to us by Hynek Hermansky; utterances are similar to those used by Watkins:– American English– Target syllables are mostly nonsense, but some

correspond to real words (including “sir” and “stir”)– Target syllables are embedded in a context sentence

drawn from a limited vocabulary

Page 4: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 4

Grammar for Amy’s subset of AI corpus

$cw1 = YOU | I | THEY | NO-ONE | WE | ANYONE | EVERYONE | SOMEONE | PEOPLE;

$cw2 = SPEAK | SAY | USE | THINK | SENSE | ELICIT | WITNESS | DESCRIBE | SPELL | READ | STUDY | REPEAT | RECALL | REPORT | PROPOSE | EVOKE | UTTER | HEAR | PONDER | WATCH | SAW | REMEMBER | DETECT | SAID | REVIEW | PRONOUNCE | RECORD | WRITE | ATTEMPT | ECHO | CHECK | NOTICE | PROMPT | DETERMINE | UNDERSTAND | EXAMINE | DISTINGUISH | PERCEIVE | TRY | VIEW | SEE | UTILIZE | IMAGINE | NOTE | SUGGEST | RECOGNIZE | OBSERVE | SHOW | MONITOR | PRODUCE;

$cw3 = ONLY | STEADILY | EVENLY | ALWAYS | NINTH | FLUENTLY | PROPERLY | EASILY | ANYWAY | NIGHTLY | NOW | SOMETIME | DAILY | CLEARLY | WISELY | SURELY | FIFTH | PRECISELY | USUALLY | TODAY | MONTHLY | WEEKLY | MORE | TYPICALLY | NEATLY | TENTH | EIGHTH | FIRST | AGAIN | SIXTH | THIRD | SEVENTH | OFTEN | SECOND | HAPPILY | TWICE | WELL | GLADLY | YEARLY | NICELY | FOURTH | ENTIRELY | HOURLY;

$test = SIR | STIR | SPUR | SKUR;

( !ENTER $cw1 $cw2 $test $cw3 !EXIT )

Audio demos

Page 5: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 5

ASR system

• HMM-based phone recogniser – implemented in HTK– monophone models– 20 Gaussian mixtures per state– adapted from scripts by Tony Robinson/Dan Ellis

• Bootstrapped by training on TIMIT then further 10-12 iterations of embedded training on AI corpus

• Word-level transcripts in AI corpus expanded to phones using the CMU pronunciation dictionary

• All of AI corpus used for training, except the 80 utterances in Amy’s experimental stimuli

Page 6: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 6

MFCC features

• Baseline system trained using mel-frequency cepstral coefficients (MFCCs)– 12 MFCCs + energy + delta+acceleration (total 39

features per frame)– cepstral mean normalization

• Baseline system performance on Amy’s clean subset of AI corpus (80 utterances, no reverberation):– 98.75% context words correct– 96.25% test words correct

Page 7: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 7

Amy experiment

• Amy’s first experiment used 80 utterances – 20 instances each of “sir”, “skur”, “spur” and “stir”

test words

• Overall confusion rate was controlled by lowpass filtering at 1, 1.5, 2, 3 and 4 kHz

• Same reverberation conditions as in Watkins et al. experiments

• Stimuli presented to the ASR system as in Amy’s human studies

Test 0.32m Test 10m

Context 0.32m

near-near near-far

Context 10m far-near far-far

Page 8: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 8

Baseline ASR: context words

• Performance falls as the cutoff frequency decreases

• Performance falls as level of reverberation increases

• Near context substantially better than far context at most cutoffs

Page 9: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 9

Baseline ASR: test words

No particular pattern of confusions in 2kHz near-near case but more frequent skur/spur/stir errors

Page 10: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 10

Human data(20 subjects)

BaselineASR system

Baseline ASR: human comparison

• Data for 4 kHz cutoff• Even mild

reverberation (near near) causes substantial errors in the baseline ASR system

• Human listeners exhibit compensation in the AIC task, the baseline ASR system doesn’t (as expected) far test word near test word

perc

enta

ge e

rror

Page 11: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 11

Training on auditory features

• 80 channels between 100 Hz and 8 kHz

• 15 DCT coefficients + delta + acceleration (45 features per frame)

• Efferent attenuation set to zero for initial tests

• Performance of auditory features on Amy’s clean subset of AI corpus (80 utterances, no reverberation):

– 95% context words correct– 97.5% test words correct

DRNLDRNLOMEOME Hair cellHair cell Frame & DCT

Frame & DCT

RecogniserRecogniserATTATT

Auditory periphery

Efferent system

StimulusStimulus

Page 12: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 12

Auditory features: context words

• Take a big hit in performance using auditory features– saturation in AN is

likely to be an issue– mean normalization

• Performance falls sharply with decreasing cutoff

• As expected, best performance in the least reverberated conditions

Page 13: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 13

Auditory features: test words

Page 14: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 14

Effect of efferent suppression

• Not yet used fullclosed-loop modelin ASR experiments

• Indication of likelyperformanceobtained by increasingefferent attenuationin ‘far’ context conditions

DRNLDRNLOMEOME Hair cellHair cell Frame & DCT

Frame & DCT

RecogniserRecogniserATTATT

Auditory periphery

Efferent system

StimulusStimulus

Page 15: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 15

Auditory features: human comparison

• 4 kHz cutoff• Efferent

suppression effective for mild reverberation

• Detrimental to far test word

• Currently unable to model human data, but:– not closed loop– same efferent

attenuation in all bands

Human data(20 subjects)

No efferent suppressio

n

10 dB efferent suppression

far test word near test word

Page 16: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 16

Confusion analysis: far-near condition

• Without efferent attenuation “skur”, “spur” and “stir” are frequently confused as “sir”

• These confusions are reduced by more than half when 10 dB of efferent attenuation is applied

SIRSKU

RSPU

RSTIR

SIR 12 3 3 2

SKUR

11 5 2 2

SPUR

10 2 4 4

STIR 7 1 7 5

SIR SKUR SPUR STIR

SIR 9 2 2 7

SKUR 3 10 4 3

SPUR 5 4 4 7

STIR 3 2 3 12

far-near 0 dB attenuation

far-near 10 dB attenuation

Page 17: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 17

Confusion analysis: far-far condition

• Again “skur”, “spur” and “stir” are commonly reported as “sir”

• These confusions are somewhat reduced by 10dB efferent attenuation, but:

– gain is outweighed by more frequent “skur”, “spur”, “stir” confusions

• Efferent attenuation recovers the dip in the temporal envelope but not cues to /k/, /p/ and /t/

SIRSKU

RSPU

RSTIR

SIR 18 0 1 1

SKUR

14 1 3 2

SPUR

12 1 5 2

STIR 12 0 3 5

SIRSKU

RSPU

RSTIR

SIR 13 2 1 4

SKUR

11 3 1 5

SPUR

6 5 2 7

STIR 10 2 1 7

far-far 0 dB attenuation

far-far 10 dB attenuation

Page 18: Automatic speech recognition on  the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 18

Summary

• ASR framework in place for the AI corpus experiments

• We can compare human and machine performance on the AIC task

• Reasonable performance from baseline MFCC system

• Need to address shortfall in performance when using auditory features

• Haven’t yet tried the full within-channel model as a front end