By the Novel Approaches team: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sonmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL.

By the Novel Approaches team:

Nelson Morgan, ICSIHynek Hermansky, OGI

Dan Ellis, ColumbiaKemal Sonmez, SRIMari Ostendorf, UW

Hervé Bourlard, IDIAP/EPFLGeorge Doddington, NA-sayer

EARS Kickoff Meeting:“Pushing the Envelope”

Modern ASR SystemsModern ASR Systems

• From 50,000 ft, all ASR systems the same:

- compute local spectral envelope- determine likelihoods of speech

sounds- search for most likely HMMs

• Spectral envelope distorted by many things

- Alternatives often are bad fits to the statistical models

ASR is half-deafASR is half-deaf• Phonetic classification very poor

• Success due to constraints (domain, speaker, noise-canceling mic, etc)

• These constraints can mask the underlyingweakness of the technology

“Y'see, they just find out who complains

the loudest about the cooking, and he gets to be the

cook.”

- Utah Phillips

Who gets to try to fix it?Who gets to try to fix it?

Rethinking Acoustic Rethinking Acoustic Processing for ASRProcessing for ASR

• Escape dependence on spectral envelope

• Use multiple front ends across time/freq

• Modify statistical models to accommodate new front ends

• Design optimal combination schemes for multiple models

The Two EARS-NA The Two EARS-NA TasksTasks

• Signal processing - Replacing the spectral envelope by long-time and short-time (multirate) probabilistic functions of the spectro-temporal plane.

• Statistical Modeling: Modifying the statistical models, both to incorporate these new multirate front ends and to explicitly handle areas of missing information.

time

Task 1: Pushing the Task 1: Pushing the Envelope (aside)Envelope (aside)

• Problem: Spectral envelope is a fragile information carrier

estimate of sound identity

info

rmat

ion

fusi

on

10 msOLD

PROPOSED

• Solution: Probabilities from multiple time-frequency patches

i-th estimate

up to 1s

k-th estimate

n-th estimate

estimate of sound identity

Multiple time-Multiple time-frequency tradeoffsfrequency tradeoffs

• Temporal trajectories of narrow subbands

• Optimal search for more general patches

• Data-driven broad class probabilities

time

k-th estimate

n-th estimate

i-th estimate

up to 1s

Pitch-related featuresPitch-related features• Current recognizers have no use for pitch• Listeners benefit from pitch• Correlogram estimates spectrum of pitch

Principled multistreamPrincipled multistream

• Not just different, but useful in combination

- minimizing relative entropy between error signals

- minimizing conditional information of posterior signals

• Choosing categories for per-stream probabilistic functions (e.g., broad classes)

Task 2: Beyond Task 2: Beyond Frames…Frames…

• Solution: Advanced features require advanced models, not limited by fixed-frame-rate paradigm

OLD

PROPOSED

conventional HMMshort-term features

• Problem: Features & models interact, new features may require different models

advanced features multi-rate / dynamic scale classifier

Multirate ModelsMultirate Models

• Goal: Model features that span different time scales and dependence across scales/streams

advanced features multirate classifier

Multirate Models (ctd)Multirate Models (ctd)• Why multirate vs. redundant features?

- Redundant features violate independence assumptions, lead to poor confidence (posterior) estimates- Redundancy adds unnecessary computation

• Important research issues:- Acoustically driven rate mixing and/or variable alignment - Discriminative learning of dependence across streams

Partial information Partial information techniquestechniques

• Can integrate across unknown dimensions

• particularly simple for diagonal Gaussians

• e.g. Spectral masks: Skip missing dimensions

• Hard part is identifying the bad data

Multistream statisticsMultistream statistics

• All possible combinations of individual streams

Multistream statistics Multistream statistics (ctd)(ctd)

• Statistical modeling in both frequency and time: HMM2

EvaluationEvaluation

• For greatest and most reliable progress, need frequent internal evaluations

• Most importantly, need to define helpful evaluation tasks – to guide the research

• Other considerations beyond the task:- definition of performance measures- choice of corpora- establishment of an evaluation process

Task and corpus, initial Task and corpus, initial planplan

• Evaluation tasks – Recognition of words and syllables

• Cross-corpus testing- training on Hub 5, Macrophone

- testing on OGI numbers for quick turn- around, debugging

• Testing on Hub 5 in due course

• Rescoring SRI decoder output (N-best or lattice)

Metrics and diagnosticsMetrics and diagnostics• Word and syllable error statistics

• Detection statistics and error distribution across speakers (and other conditions that are deemed to be important)

• Comparison to human performance

• Running scores on dev sets within group, held-out evals at least annually (NA-sayer wants weekly )

Connection to RT evalsConnection to RT evals

• Rescore output of SRI system

• In later years work more closely with RT team to transfer most successful ideas

• Feedback from RT experience (error diagnostics) is also important

Summary

• An alternative view of acoustic

processing for ASR for

features+models

• Pushing the envelope … aside

• Matching new front end

characteristics with appropriate

statistical models

• Diagnostic evaluations a key feature

Closing Thought

“When you come to a fork in the road,

take it.”

- Yogi Berra

By the Novel Approaches team: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sonmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL.

Documents

envelope slide

features models

s slide

technology slide

advanced models

multiple models

new features

statistical models