Open Problems in Speech Recognition Nelson Morgan, EECS and ICSI.
Post on 20-Dec-2015
222 Views
Preview:
Transcript
Open Problems in Speech Recognition
Nelson Morgan, EECS and ICSI
ICSI and EECSICSI and EECS
•International Computer Science Institute
•Nonprofit, closely affiliated with UCB-EECS:
- faculty (e.g., Morgan, Feldman)- Board (Berlekamp, Karp, Malik)- students (PhD, MS)
• Focus areas in speech,language,theory, internet research; CITRIS involvement
A working speech A working speech recognizer (circa 1920)recognizer (circa 1920)
A working speech A working speech recognizer (circa 2002)recognizer (circa 2002)
Current ApplicationsCurrent Applications
•Toys
•Telephone queries (operator/touch tone replacement)
• Voice dialing (for cell phones)
• Dictation (esp. for specific domains)
Major Reasons for Major Reasons for SuccessSuccess
• Late 60’s statistical methodology (HMMs, developed for cryptography) applied to speech in 70’s and 80’s
• Moore’s Law + engineering refinements to HMM training/recognition (1986-now)
• Normalization approaches (mean norms, RASTA filtering, vocal tract length approx)
Two examples of things Two examples of things that helpedthat helped
• RASTA: 2% digit error -> 60% for different phone system; down to 3% using RASTA; now used for voice dialing in millions of cell phones
• Vocal tract length normalization: 1 parameter for each speaker, significant effect on errors; now used in all large research systems
Major Technical Major Technical ChallengesChallenges
•Speaker variability for fluent/conversational (pronunciation, rate, overlaps)
25-40%error on conversations
•Acoustic variability for general environments (noise, reverb, talker movement) 3-10%error on read digits (vs <1% in clean conditions)
Modern ASR SystemsModern ASR Systems
• From 50,000 ft, all ASR systems the same:
- compute local spectral envelope- determine likelihoods of speech
sounds- search for most likely HMMs
• Spectral envelope distorted by many things
- Alternatives often are bad fits to the statistical models
Pronunciation Lexicon
Signal Processing
PhoneticProbabilityEstimator
Decoder(word search)
WordsSpeech
Grammar
ASR in BriefASR in Brief
ASR is half-deafASR is half-deaf
• Phonetic classification very poor
• Success due to constraints (domain, speaker, noise-canceling mic, etc)
• These constraints can mask the underlying weakness of the technology
Rethinking Acoustic Rethinking Acoustic Processing for ASRProcessing for ASR
• Escape dependence on spectral envelope
• Use multiple front ends across time/freq
• Modify statistical models to accommodate new front ends
• Design optimal combination schemes for multiple models
The DARPA (IAO) The DARPA (IAO) “EARS” Program“EARS” Program
• New 5 year program to radically reduce errors in conversational speech-to-text
• Two components: - Rich Transcription (large reductions
in error rate, improvements in readability and portability to new languages)
- Novel Approaches (radical changes)
EARS: Effective Affordable EARS: Effective Affordable Reusable Speech-to-textReusable Speech-to-text
• Rich Transcription: 4 teams- SRI/ICSI/UW- BBN/U.Pitt/UW/LIMSI- Cambridge U.- IBM
• Novel Approaches: 2 teams- ICSI/SRI/UW/OGI/Columbia/IDIAP- Microsoft
time
Novel Approach 1: Novel Approach 1: Pushing the Envelope Pushing the Envelope
(aside)(aside)
• Problem: Spectral envelope is a fragile information carrier
estimate of sound identity
info
rmat
ion
fusi
on
10 msOLD
PROPOSED
• Solution: Probabilities from multiple time-frequency patches
i-th estimate
up to 1s
k-th estimate
n-th estimate
estimate of sound identity
Novel Approach 2: Novel Approach 2: Beyond Frames…Beyond Frames…
• Solution: Advanced features require advanced models, not limited by fixed-frame-rate paradigm
OLD
PROPOSED
conventional HMMshort-term features
• Problem: Features & models interact, new features may require different models
advanced features multi-rate / dynamic scale classifier
Other speech-to-text Other speech-to-text projectsprojects
• Dialog systems: DARPA Communicator/Symphony, German SmartKom
• Noise/reverberation for cell phone, military environments: DARPA SPINE program, various European projects (EU, ETSI)
• Recognition/retrieval/summarization for multiparty meetings: Swiss IM2, EU m4, ICSI/UW/SRI/Columbia NSF-ITR
Resource generation Resource generation from Berkeley from Berkeley researchersresearchers
• gmtk - a new graphical model toolkit specialized for speech (extension of 2 PhD theses, Bilmes [UW] and Zweig [IBM]) -
• Publicly available speech/neural network software (RASTA, speech neural network training system)
• Soon: a “meeting data” corpus
Campus interactionCampus interaction
• Within EECS (CIS):- Feldman (also ICSI), NLU- Jordan and Russell, machine
learning
• Linguists:- Ohala, phonology- Fillmore(ICSI), semantic
lexicography
Natural Speech + Natural Speech + Language Projects at Language Projects at
ICSI/EECSICSI/EECS• Berkeley Restaurant Project (BeRP) - online stochastic context free grammar probabilities with natural mixed initiative
• SmartKom - tourist information query system w/American pronunciations of German place names
SummarySummary
• Progress in speech recognition research led to working systems in particular domains
• Performance still severely limited for conversational speech, noisy/reverberant conditions
• We and others are working to transcend these limitations with novel approaches
top related