Open Problems in Speech Recognition Nelson Morgan, EECS and ICSI.

Open Problems in Speech Recognition

Nelson Morgan, EECS and ICSI

ICSI and EECSICSI and EECS

•International Computer Science Institute

•Nonprofit, closely affiliated with UCB-EECS:

- faculty (e.g., Morgan, Feldman)- Board (Berlekamp, Karp, Malik)- students (PhD, MS)

• Focus areas in speech,language,theory, internet research; CITRIS involvement

A working speech A working speech recognizer (circa 1920)recognizer (circa 1920)

A working speech A working speech recognizer (circa 2002)recognizer (circa 2002)

Current ApplicationsCurrent Applications

•Toys

•Telephone queries (operator/touch tone replacement)

• Voice dialing (for cell phones)

• Dictation (esp. for specific domains)

Major Reasons for Major Reasons for SuccessSuccess

• Late 60’s statistical methodology (HMMs, developed for cryptography) applied to speech in 70’s and 80’s

• Moore’s Law + engineering refinements to HMM training/recognition (1986-now)

• Normalization approaches (mean norms, RASTA filtering, vocal tract length approx)

Two examples of things Two examples of things that helpedthat helped

• RASTA: 2% digit error -> 60% for different phone system; down to 3% using RASTA; now used for voice dialing in millions of cell phones

• Vocal tract length normalization: 1 parameter for each speaker, significant effect on errors; now used in all large research systems

Major Technical Major Technical ChallengesChallenges

•Speaker variability for fluent/conversational (pronunciation, rate, overlaps)

25-40%error on conversations

•Acoustic variability for general environments (noise, reverb, talker movement) 3-10%error on read digits (vs <1% in clean conditions)

Modern ASR SystemsModern ASR Systems

• From 50,000 ft, all ASR systems the same:

- compute local spectral envelope- determine likelihoods of speech

sounds- search for most likely HMMs

• Spectral envelope distorted by many things

- Alternatives often are bad fits to the statistical models

Pronunciation Lexicon

Signal Processing

PhoneticProbabilityEstimator

Decoder(word search)

WordsSpeech

Grammar

ASR in BriefASR in Brief

ASR is half-deafASR is half-deaf

• Phonetic classification very poor

• Success due to constraints (domain, speaker, noise-canceling mic, etc)

• These constraints can mask the underlying weakness of the technology

Rethinking Acoustic Rethinking Acoustic Processing for ASRProcessing for ASR

• Escape dependence on spectral envelope

• Use multiple front ends across time/freq

• Modify statistical models to accommodate new front ends

• Design optimal combination schemes for multiple models

The DARPA (IAO) The DARPA (IAO) “EARS” Program“EARS” Program

• New 5 year program to radically reduce errors in conversational speech-to-text

• Two components: - Rich Transcription (large reductions

in error rate, improvements in readability and portability to new languages)

- Novel Approaches (radical changes)

EARS: Effective Affordable EARS: Effective Affordable Reusable Speech-to-textReusable Speech-to-text

• Rich Transcription: 4 teams- SRI/ICSI/UW- BBN/U.Pitt/UW/LIMSI- Cambridge U.- IBM

• Novel Approaches: 2 teams- ICSI/SRI/UW/OGI/Columbia/IDIAP- Microsoft

time

Novel Approach 1: Novel Approach 1: Pushing the Envelope Pushing the Envelope

(aside)(aside)

• Problem: Spectral envelope is a fragile information carrier

estimate of sound identity

info

rmat

ion

fusi

on

10 msOLD

PROPOSED

• Solution: Probabilities from multiple time-frequency patches

i-th estimate

up to 1s

k-th estimate

n-th estimate

estimate of sound identity

Novel Approach 2: Novel Approach 2: Beyond Frames…Beyond Frames…

• Solution: Advanced features require advanced models, not limited by fixed-frame-rate paradigm

OLD

PROPOSED

conventional HMMshort-term features

• Problem: Features & models interact, new features may require different models

advanced features multi-rate / dynamic scale classifier

Other speech-to-text Other speech-to-text projectsprojects

• Dialog systems: DARPA Communicator/Symphony, German SmartKom

• Noise/reverberation for cell phone, military environments: DARPA SPINE program, various European projects (EU, ETSI)

• Recognition/retrieval/summarization for multiparty meetings: Swiss IM2, EU m4, ICSI/UW/SRI/Columbia NSF-ITR

Resource generation Resource generation from Berkeley from Berkeley researchersresearchers

• gmtk - a new graphical model toolkit specialized for speech (extension of 2 PhD theses, Bilmes [UW] and Zweig [IBM]) -

• Publicly available speech/neural network software (RASTA, speech neural network training system)

• Soon: a “meeting data” corpus

Campus interactionCampus interaction

• Within EECS (CIS):- Feldman (also ICSI), NLU- Jordan and Russell, machine

learning

• Linguists:- Ohala, phonology- Fillmore(ICSI), semantic

lexicography

Natural Speech + Natural Speech + Language Projects at Language Projects at

ICSI/EECSICSI/EECS• Berkeley Restaurant Project (BeRP) - online stochastic context free grammar probabilities with natural mixed initiative

• SmartKom - tourist information query system w/American pronunciations of German place names

SummarySummary

• Progress in speech recognition research led to working systems in particular domains

• Performance still severely limited for conversational speech, noisy/reverberant conditions

• We and others are working to transcend these limitations with novel approaches

Open Problems in Speech Recognition Nelson Morgan, EECS and ICSI.

Documents

icsi slide

specific domains slide

working speech recognizer

rasta filtering

digit error

cell phones dictation

major reasons

s statistical methodology