By the Novel Approaches team: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sonmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL George Doddington, NA-sayer EARS Kickoff Meeting: “Pushing the Envelope”
Dec 21, 2015
By the Novel Approaches team:
Nelson Morgan, ICSIHynek Hermansky, OGI
Dan Ellis, ColumbiaKemal Sonmez, SRIMari Ostendorf, UW
Hervé Bourlard, IDIAP/EPFLGeorge Doddington, NA-sayer
EARS Kickoff Meeting:“Pushing the Envelope”
Modern ASR SystemsModern ASR Systems
• From 50,000 ft, all ASR systems the same:
- compute local spectral envelope- determine likelihoods of speech
sounds- search for most likely HMMs
• Spectral envelope distorted by many things
- Alternatives often are bad fits to the statistical models
ASR is half-deafASR is half-deaf• Phonetic classification very poor
• Success due to constraints (domain, speaker, noise-canceling mic, etc)
• These constraints can mask the underlyingweakness of the technology
“Y'see, they just find out who complains
the loudest about the cooking, and he gets to be the
cook.”
- Utah Phillips
Who gets to try to fix it?Who gets to try to fix it?
Rethinking Acoustic Rethinking Acoustic Processing for ASRProcessing for ASR
• Escape dependence on spectral envelope
• Use multiple front ends across time/freq
• Modify statistical models to accommodate new front ends
• Design optimal combination schemes for multiple models
The Two EARS-NA The Two EARS-NA TasksTasks
• Signal processing - Replacing the spectral envelope by long-time and short-time (multirate) probabilistic functions of the spectro-temporal plane.
• Statistical Modeling: Modifying the statistical models, both to incorporate these new multirate front ends and to explicitly handle areas of missing information.
time
Task 1: Pushing the Task 1: Pushing the Envelope (aside)Envelope (aside)
• Problem: Spectral envelope is a fragile information carrier
estimate of sound identity
info
rmat
ion
fusi
on
10 msOLD
PROPOSED
• Solution: Probabilities from multiple time-frequency patches
i-th estimate
up to 1s
k-th estimate
n-th estimate
estimate of sound identity
Multiple time-Multiple time-frequency tradeoffsfrequency tradeoffs
• Temporal trajectories of narrow subbands
• Optimal search for more general patches
• Data-driven broad class probabilities
time
k-th estimate
n-th estimate
i-th estimate
up to 1s
Pitch-related featuresPitch-related features• Current recognizers have no use for pitch• Listeners benefit from pitch• Correlogram estimates spectrum of pitch
Principled multistreamPrincipled multistream
• Not just different, but useful in combination
- minimizing relative entropy between error signals
- minimizing conditional information of posterior signals
• Choosing categories for per-stream probabilistic functions (e.g., broad classes)
Task 2: Beyond Task 2: Beyond Frames…Frames…
• Solution: Advanced features require advanced models, not limited by fixed-frame-rate paradigm
OLD
PROPOSED
conventional HMMshort-term features
• Problem: Features & models interact, new features may require different models
advanced features multi-rate / dynamic scale classifier
Multirate ModelsMultirate Models
• Goal: Model features that span different time scales and dependence across scales/streams
advanced features multirate classifier
Multirate Models (ctd)Multirate Models (ctd)• Why multirate vs. redundant features?
- Redundant features violate independence assumptions, lead to poor confidence (posterior) estimates- Redundancy adds unnecessary computation
• Important research issues:- Acoustically driven rate mixing and/or variable alignment - Discriminative learning of dependence across streams
Partial information Partial information techniquestechniques
• Can integrate across unknown dimensions
• particularly simple for diagonal Gaussians
• e.g. Spectral masks: Skip missing dimensions
• Hard part is identifying the bad data
Multistream statisticsMultistream statistics
• All possible combinations of individual streams
Multistream statistics Multistream statistics (ctd)(ctd)
• Statistical modeling in both frequency and time: HMM2
EvaluationEvaluation
• For greatest and most reliable progress, need frequent internal evaluations
• Most importantly, need to define helpful evaluation tasks – to guide the research
• Other considerations beyond the task:- definition of performance measures- choice of corpora- establishment of an evaluation process
Task and corpus, initial Task and corpus, initial planplan
• Evaluation tasks – Recognition of words and syllables
• Cross-corpus testing- training on Hub 5, Macrophone
- testing on OGI numbers for quick turn- around, debugging
• Testing on Hub 5 in due course
• Rescoring SRI decoder output (N-best or lattice)
Metrics and diagnosticsMetrics and diagnostics• Word and syllable error statistics
• Detection statistics and error distribution across speakers (and other conditions that are deemed to be important)
• Comparison to human performance
• Running scores on dev sets within group, held-out evals at least annually (NA-sayer wants weekly )
Connection to RT evalsConnection to RT evals
• Rescore output of SRI system
• In later years work more closely with RT team to transfer most successful ideas
• Feedback from RT experience (error diagnostics) is also important
Summary
• An alternative view of acoustic
processing for ASR for
features+models
• Pushing the envelope … aside
• Matching new front end
characteristics with appropriate
statistical models
• Diagnostic evaluations a key feature
Closing Thought
“When you come to a fork in the road,
take it.”
- Yogi Berra