Top Banner
Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing (UCLA) Collaborators: Dong Yu (MSR), Xiang Li (CMU), A.
23

Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Jan 03, 2016

Download

Documents

Emory Randall
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Computer Speech Recognition: Mimicking the Human System

Li Deng

Microsoft Research, RedmondMicrosoft Research, RedmondFeb. 2, 2005

at IPAM Workshop on Math of Ear and Sound Processing (UCLA)

Collaborators: Dong Yu (MSR), Xiang Li (CMU), A. Acero (MSR)

Page 2: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Speech Recognition--- Introduction • Converting naturally uttered speech into text and meaning• Human-machine dialogues (scenario demos)• Conventional technology --- statistical modeling and estimation

(HMM)• Limitations

– noisy acoustic environments– rigid speaking style– constrained task– unrealistic demand of training data– huge model sizes, etc.– far below human speech recognition performance

• Trend: Incorporate key aspects of human speech processing mechanisms

Page 3: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Production & Perception: Closed-Loop Chain

message

motor/articulators

Internal model

decodedmessage

ear/a

uditory

rece

ption

SPEAKER LISTENER

Speech Acoustics in

closed-loop chain

Page 4: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Encoder: Two-Stage Production Mechanisms

message

motor/articulators

Speech Acoustics

Phonology (higher level):•Symbolic encoding of linguistic message•Discrete representation by phonological features•Loosely-coupled multiple feature tiers•Overcome beads-on-a-string phone model•Theories of distinctive features, feature geometry & articulatory phonology• Account for partial/full sound deletion/modification in casual speech

SPEAKER

PhoneticsPhonetics (lower level):(lower level):•Convert discrete linguistic features toConvert discrete linguistic features to continuous acousticscontinuous acoustics•Mediated by motor control & articulatory Mediated by motor control & articulatory dynamics dynamics•Mapping from articulatory variables toMapping from articulatory variables to VT area function to acoustics VT area function to acoustics •Account for co-articulation and reduction Account for co-articulation and reduction (target undershoot), etc. (target undershoot), etc.

Page 5: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Encoder: Phonological Modeling

message

motor/articulators

Speech Acoustics

Computational phonology:• Represent pronunciation variations as constrained factorial Markov chain • Constraint: from articulatory phonology• Language-universal representation

SPEAKER

ten themes

/ t ε n ө i: m z /

TongueTip

TongueBody

High / FrontMid / Front

Page 6: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Encoder: Phonetic Modeling

message

motor/articulators

Speech Acoustics

SPEAKER

Computational phonetics:Computational phonetics:• Segmental factorial HMM for sequential target Segmental factorial HMM for sequential target in articulatory or vocal tract resonance domainin articulatory or vocal tract resonance domain• Switching trajectory model for target-directedSwitching trajectory model for target-directed articulatory dynamicsarticulatory dynamics• Switching nonlinear state-space model forSwitching nonlinear state-space model for dynamics in speech acousticsdynamics in speech acoustics• Illustration:Illustration:

Page 7: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Phonetic Encoder: Computation

message

motor/articulators

Speech Acoustics

SPEAKER

articulation

targets

distortion-free acoustics

distorted acoustics

distortion factors & feedback to articulation

Page 8: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Phonetic Reduction Illustration

yo-yo (formal) yo-yo (casual)

2 21 22 (1 )n s n s n s s nz z z T w

Page 9: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Decoder I: Auditory Reception

message

motor/articulators

Internal model

decodedmessage

ear/a

uditory

rece

ption

LISTENER• Convert speech acoustic waves into efficient & robust auditory representation• This processing is largely independent of phonological units• Involves processing stages in cochlea (ear), cochlear nucleus, SOC, IC,…, all the way to A1 cortex• Principal roles: 1) combat environmental acoustic distortion; 2) detect relevant speech features 3) provide temporal landmarks to aid decoding• Key properties: 1) Critical-band freq scale, logarithmic compression, 2) adapt freq selectivity, cross-channel correlation, 3) sharp response to transient sounds 4) modulation in independent frequency bands, 5) binaural noise suppression, etc.

Page 10: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Decoder II: Cognitive Perception

message

motor/articulators

Internal model

decodedmessage

ear/a

uditory

rece

ption

LISTENER• Cognitive process: recovery of linguistic message• Relies on 1) “Internal” model: structural knowledge of the encoder (production system) 2) Robust auditory representation of features 3) Temporal landmarks• Child speech acquisition process is one that gradually establishes the “internal” model• Strategy: analysis by synthesis• i.e., Probabilistic inference on (deeply) hidden linguistic units using the internal model• No motor theory: the above strategy requires no articulatory recovery from speech acoustics

Page 11: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

• On-line modification of speaker’s articulatory behavior (speaking effort, rate, clarity, etc.) based on listener’s “decoding” performance (i.e. discrimination)

• Especially important for conversational speech recognition and understanding

• On-line adaptation of “encoder” parameters• Novel criteria:

– maximize discrimination while minimizing articulation effort

• In this closed-loop model, the “effort” quantified as “curvature” of temporal sequence of articulatory vector zt.

• No such concept of “effort” in conventional HMM systems

Speaker-Listener Interaction

Page 12: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Stage-I illustration (effects of speaking rate)Stage-I illustration (effects of speaking rate)

Page 13: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Sound Confusion for Casual Speech (model vs. data)

speaking rate speaking rate• Two sounds merge when they become “sloppy”• Human perception does “extrapolation”; so does our model

• 5000 hand-labeled speech tokens• Source: J. Acoustical Society of America, 2000

model prediction hand measurements

Page 14: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Model Stage-I:

• Impulse response of FIR filter (non-causal):

• Output of filter:

Page 15: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Model Stage-II:

• Analytical prediction of cepstra:

Assuming P-th order all-pole model

• Residual random vector for statistical bias modeling (finite pole order, no zeros):

residual

Page 16: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Illustration: Output of Stage-II (green)Illustration: Output of Stage-II (green)

data

Model

Page 17: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Speech Recognizer Architecture

• Stages I and II of the hidden trajectory model in combination speech recognizer

• No context-dependent parameters FIR bi-directional filter provides context dependence, as well as reduction

• Training procedure

• Recognition procedure

Page 18: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Procedure --- Training

• training residual parameters and ss

featureextraction

targetfilteringw/ FIR

Table lookup

training waveform

phoneticxcriptw/ time

LPCC

targetsequence

LPCCresidual

VTR trackspredicted

+

-

2ss

nonlinear mapping

LPCCpredicted

monophoneHMM trainer

ss 2ss

-

Page 19: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Procedure --- N-best Evaluation

ss

featureextraction

FIR

triphoneHMM

system

test data LPCC

+

-

2ss

nonlinear mapping

table lookup

Hyp 1

Hyp 2

Hyp N

N-best list (N=1000); each hypothesis has phonetic xcript & time

………

GaussianScorer

table lookup

table lookup

FIR

FIR

nonlinear mapping

nonlinear mapping

-

-

+

+

GaussianScorer

GaussianScorer

H*=

arg Max { P

(H1), P

(H2),…

P(H

1000)}

………

ssT

………

………

………

………

parameter free

(k) (k)

Page 20: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

Results (recognition accuracy %)

30

40

50

60

70

80

90

100

1 101 10001

Acc%

. . . HMM

1001 11N in N-best

Page 21: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

• Human speech production/perception viewed as synergistic elements in a closed-looped communication chain

• They function as encoding & decoding of linguistic messages, respectively.

• In human, speech “encoder” (production system) consists of phonological (symbolic) and phonetic (numeric) levels.

• Current HMM approach approximates these two levels in a crude way: – phone-based phonological model (“beads-on-a-string”)– multiple Gaussians as phonetic model for acoustics directly– very weak hidden structure

Summary & Conclusion

Page 22: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

• “Linguistic message recovery” (decoding) formulated as:– auditory reception for efficient & robust speech representation & for

providing temporal landmarks for phonological features– cognition perception using “encoder” knowledge or “internal model” to

perform probabilistic analysis by synthesis or pattern matching

• Dynamic Bayes network developed as a computational tool for constructing encoder and decoder

• Speaker-listener interaction (in addition to poor acoustic environment) cause substantial changes of articulation behavior and acoustic patterns

• Scientific background and computational framework for our recent MSR speech recognition research

Summary & Conclusion (cont’d)

Page 23: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.

End &

Backup Slides