Pushing the Envelope: Rethinking Acoustic …dpwe/proposals/DARPA-EARS-NA-2002...Pushing the Envelope: Rethinking Acoustic Processing A-1 Pushing the Envelope: Rethinking Acoustic

Pushing the Envelope: Rethinking Acoustic Processing A-1

Pushing the Envelope:Rethinking Acoustic Processing for Speech Recognition

A. Executive Summary

State-of-the-art speech recognition systems continue to improve, but the core acoustic operation

remains the same: a single feature vector (derived from the power spectral envelope over a 20-30

ms window, stepped forward by ~10 ms per frame) is compared to a set of distributions derived

from training data for an inventory of sub-word units (usually some variant of phones). This step

has remained essentially unchanged for decades, and we believe that this limited perspective is a

key weakness in speech recognizers. Note for instance that, under good conditions, human phone

error rate for nonsense syllables has been estimated to be as low as 1.5% [Allen 1994], as

compared with rates over an order of magnitude higher for the best machine phone recognizers

[Lee & Glass 1998, Deng & Sun 1994, Robinson et al. 1994]. In this light, our best current

recognizers appear half-deaf, only making up for this deficiency by incorporating strong domain

constraints. To develop generally applicable and useful recognition techniques, we must

overcome the limitations of current acoustic processing. Interestingly, even human phonetic

categorization is poor for extremely short segments (e.g. <100 ms), suggesting that analysis of

longer time regions is somehow essential to the task. This suggestion is supported by information

theoretic analysis showing discriminative conditional dependence between features separated in

time by up to several hundred milliseconds [Yang et al. 2000, Bilmes 1998].

In the development of speech recognition, there have been certain innovations in feature

processing, such as delta calculation, cepstral mean normalization, or RASTA [Hermansky &

Morgan 1994], which have been able to effect valuable performance improvements with minimal

changes to the statistical processing. In general, however, signal processing and statistical

modeling techniques have co-evolved , making it unlikely that a modification in one domain will

significantly improve performance without a corresponding change in the other. This is clearly

illustrated for the case of longer temporal support, which is most simply introduced by using

highly overlapped analysis windows (e.g., 500 ms processing windows with a 10 ms frame step).

Unfortunately, successive frames of the resulting features are highly correlated when compared to

a standard 20-30 ms window, and this increased violation of the conditional independence

assumptions made in the statistical processing leads to the introduction of tweak factors at

every time scale in an effort to compensate. The close coupling of signal processing and statistical

modeling leads us to propose a balanced effort between the two areas — radical modifications to

the front end processing, along with a corresponding restructuring of the statistical models to

accommodate these modifications.

The proposed work consists of two tasks:

1) Signal Processing: Replacing the current notion of a spectral energy-based vector at

time t by a set of variables based on posterior probabilities of broad acoustic categories

for long-time and short-time functions of the spectro-temporal plane, where long-time

refers to periods of up to a second. Depending on the categories, nontraditional variables

such as pitch-related features may be useful, and other ‘style variables such as speaking

rate could also be incorporated. These features will result in multiple streams of

probabilistic information.

Pushing the Envelope: Rethinking Acoustic Processing A-2

2) Statistical Modeling: Modifying the statistical models both to incorporate these new

multirate front ends, and to handle explicitly areas of missing information, i.e. portions of

the time-frequency plane that are obscured by acoustic degradation. This task will also

cover the discriminative learning of dependence across streams, and the exploitation of

this information for optimal combination design. In addition, the work will result in new

event-based models, in the sense of allowing acoustic cues of multiple time spans

associated with one unit.

We will pursue a number of techniques, broadly inspired by both human audition and our wish to

develop compatible statistical underpinnings that work together within a unifying multirate (and

multistream) framework, itself derived from our understanding of perception. We seek to replace

energy-based features, the common currency of acoustic front ends, by values reflecting the

posterior probability of different signal categories, themselves defined by data-driven techniques.

These probabilistic estimates will be supported on time-frequency windows drawn from a large

and flexible family, selected by experimental results from the auditory system, by data-adaptive

decompositions, and by empirical evaluation. Further input from hearing research will come via

novel features developed to reflect pitch, rate and other perceptual attributes. These approaches

are particularly important for the case of conversational speech, which exhibits the greatest

variability in speaking rate and vocal quality, and which must be analyzed in terms of parameters

that vary across the full range of time scales, from phones through to phrases and beyond. The

best approach to variability in realization is to have a wide range of alternative information

sources from which to estimate the speech content, allied with a combination strategy able to

switch opportunistically amongst the most useful sources at any given instant.

Our proposal seeks to address the impairment of current speech recognizers through a radical

reconstruction of the interface between sound and search. The representation of speech as a

sequence of spectral envelopes will be pushed aside. Pronunciation and grammar constraints,

while invaluable for reducing word error rates, can often serve to mask basic problems in the

acoustic classification, and thus we will not explore their extension in this work. Instead, we will

concentrate on the modeling of the most basic speech sounds, with application to word

recognition tasks using systems that hold constant the later stages of processing (search,

pronunciation, language modeling, etc.). A solid foundation at this level — in itself a novel

approach in speech recognition — will accrue further benefits when constraints are re-applied.

The teams working on the two tasks in this proposal will interact closely, through membership

overlap, frequent meetings, and joint work on internal evaluations. Each team includes strong

senior players, known for their innovations in these areas. Task 1 will include Hynek

Hermansky; Task 2 will include Mari Ostendorf and Herve Bourlard. Evaluation strategies for

both tasks will be developed by George Doddington. PI Nelson Morgan, along with Dan Ellis and

Kemal Sonmez, will work on both Tasks 1 and 2. A separate proposal with SRI as the prime site

will focus on Rich Transcription, and we will ensure that SRI can exploit the technologies

developed in this Novel Approaches effort when they bear fruit.

While the approaches proposed here comprise a radical departure from mainstream methods, we

feel that pushing the envelope is a difficult but necessary step to achieve the dramatic

reduction in word error rate that this program seeks.

Pushing the Envelope: Rethinking Acoustic Processing B-1

B. Innovative Claims

1) Use of broad category posterior probabilities over time-frequency patches rather

than short-window spectra or cepstra as basic features.

2) As a particular case, use of long temporal regions and limited spectral regions,

where long means analysis windows from 100 ms to 1 second, and limited

means 1-3 critical bands.

3) An alternate approach to time frequency analysis will incorporate a signal

adaptive front end, using an information-theoretic criterion to cluster temporal

regions with a local cosine basis tree.

4) Integration of features from analysis using differing temporal extent, using

multirate statistical models.

5) Integration of methods from Computational Auditory Scene Analysis (CASA)

into this new ASR framework.

6) Development of multiple streams based both on the methods referred to above

and on a criterion of minimal common errors between streams, subject to an

overall constraint on errors to avoid a trivial but useless solution.

7) Development of an event-based statistical model, in which event timing but not

extent is critical.

8) Development of multistream models incorporating all possible stream

combinations.

9) Use of partial information techniques so that low-confidence regions of time-

frequency are not given significant weight.

10) Development of task choice/evaluation methods to match the goals of this project.

Pushing the Envelope: Rethinking Acoustic Processing C-1

C. Statement of Work

Core research - The contractor will research, develop, evaluate and document innovative

acoustic processing algorithms for ASR, including time-frequency tradeoffs for front end signal

processing , and development of statistical algorithms and models to optimally incorporate the

new front end features.

Evaluation — The contractor will develop methods for the rapid evaluation of progress in

intermediate stages of analysis, in the course of which small tasks will be proposed and

developed. In addition, more traditional means of ASR evaluation will be employed throughout

the project. Together, these methods will be used to guide progress within the project. In the 4th

and 5th

years these procedures will be augmented by feedback from results in the governmental

evaluations of the Rich Transcription task, as the best Novel Approaches will be integrated into

the SRI-based evaluation system for the later years of the project.

Program Collaboration — The contractor will encourage collaboration within its team (in

particular with frequent internal meetings, conference calls, team access to a common Web site,

and the use of CVS or similar mechanisms to facilitate the development of common code

wherever possible). The contractor will also attend and support EARS meetings to facilitate

program level collaboration.

Reporting — The contractor will prepare and submit deliverable reports describing the progress,

results, and technical details of task-related activities.

Project Management — The contractor will manage the EARS Novel Approaches effort

including budgeting, scheduling, resource specification, financial tracking of the project Tasks.

The contractor will coordinate, consolidate, and submit Task-level Status Report and Project

Summary deliverables as defined in the PIP.

Pushing the Envelope: Rethinking Acoustic Processing D-1

D. Technical Rationale

Introduction

Word error rates for ASR are still too high in general, and particularly so for conversational

speech and other speech recorded under the imperfect acoustic conditions typical of many

military and commercial applications. The single largest contribution to the significant

improvements obtained by researchers over the last 5 years has been due to adaptation (in the

most general sense) over substantial amounts of testing data, and while this can be invaluable, it is

often not applicable for tasks that require excellent performance regardless of the data available

for each speaker. Even with these enhancements the performance is too poor for many

applications, and minor refinements of the basic methods are unlikely to yield the needed

improvement of conversational speech recognition down to the 5-10% range in word error rate,

particularly under general acoustic conditions (e.g., cell phone, speaker phone, and/or noisy

acoustic background).

It is further unlikely that any single magic bullet will be able to provide the desired degree of

improvement. Rather, as we have seen from the past, multiple innovations will be required to

provide significant change. However, there is an additional problem to be faced — the problem of

the so-called local minimum in error rates for complete ASR systems. As noted in [Bourlard,

Hermansky & Morgan 1996], once a system or set of approaches has been extensively

optimized, nearly any change to the system will lead to an increase in word error rates. While

most changes will simply be based on unsuccessful ideas, there is a small subset of initially

unpromising novel approaches that could lead to fundamental improvements in the complete

system once the consequences of the change are better understood. The core system design in

nearly every current state of the art ASR systems uses cepstral coefficients derived from an

auditory-scaled filter bank, computed over a 20-30 ms analysis window once every ~10 ms, with

Gaussian mixture models trained on such features to provide acoustic likelihoods (combined in

later components with the prior linguistic knowledge of pronunciation and grammar models). The

signal processing and statistical components for such a system have co-evolved so that it is

difficult to improve performance by modifying one without a corresponding change in the other.

Consider a much simpler problem than conversational speech recognition, the recognition of read

connected digits. Surprisingly, for the case of noisy digit strings such as is explored in the Aurora

project [Hirsch & Pearce 2000], error rates are still extremely high, averaging 13.1% for the

baseline system when tested over SNRs between 20 and 0 dB. Therefore, even for cases where

the range of pronunciation variability is small and where there is very little to improve on in the

language model, the performance is poor, even with the best current systems. This is true

despite the use of a Maximum a Posteriori (MAP) recognition scheme, in which the most

probable model will always be chosen. The suboptimality in practice must arise from incorrect

models, in the sense that the statistics do not well represent the data that will be seen in

recognition. This needs to be corrected in two ways: first, the data representation (features)

needs to be chosen so that the ultimate hypotheses are invariant over a range of conditions that

may not be seen during the training phase; and second, the statistical models must be developed

to properly represent the distributions and dependencies that will be observed from the stream

(or streams) of new features.

Summarizing these arguments, we need a coordinated effort in signal processing and in statistical


modeling in order to successfully provide an improved alternative to today s state of the art.

However, we are left with a critical difficulty: When essentially all changes to the current

standard are likely to provide an increase in the error rate (or small decreases when used together

with the older system with some combination methods such as ROVER [Fiscus 1997]), how can

we determine the most likely directions for ultimate significant improvements? The core idea

common to all the pieces proposed in this document is to move beyond the current framewise

orientation of speech recognizers. Why should this be desirable (and not just different)? To begin

with, typical cepstral methods are inherently sensitive to spectral amplitudes, which are affected

by channel characteristics, noise, and reverberation. They also are sensitive to modification based

on context and speaking style. Temporal information is incorporated in a very specific and

limited way in the first order Markov models that we use, and it is likely that there is much more

that is fundamental to speech patterns that could be incorporated.

For these reasons, we are proposing to focus on the incorporation of acoustic information from

much longer time regions (100 ms — 1s), using a number of different approaches to feature

extraction from the time-frequency plane. Accompanying this core approach are a number of

other key ideas that our preliminary efforts have suggested: multirate statistical models, partial

information techniques, and models of higher-order auditory processing such as pitch perception

and source formation. Additionally, our experience over the last 5 years has convinced us that

there is no single ideal form of front end signal processing, so that a multistream approach will

be used. Unlike earlier efforts in which arbitrary multiple front ends were used, we will be

focusing on developing a rational approach to the design of these front ends with the criterion of

minimizing the number of errors in common (subject to an overall constraint on errors to avoid a

trivial but useless solution).

We have formulated the proposed work in terms of 2 tasks. We are proposing sufficient

personnel in each area to insure progress, but there will be enough overlap that a coordinated

effort between the task groups can be assured. The two tasks are:

1) Signal Processing: design and instantiate a new acoustic front end to calculate functions

of the time-frequency plane, particularly with a much longer time support than is

typically used for ASR. The output of the front end will be more like posterior

probabilities of broad classes than like spectral energies or cepstra.

2) Statistical Modeling: design and instantiate methods to handle incomplete information,

multiple rate data, and multiple streams of the posteriors based on the new time-

frequency functions.

Late in the project we will also take the best-scoring approaches from our internal evaluations

and provide them to our colleagues working on Rich Transcription for application to the ultimate

goals of the EARS program.

The proposed solutions within the two tasks are further summarized below.

Task 1: Signal Processing

A core idea for this task is to replace the current notion of a spectral energy-based vector at time twith a vector based on posterior probabilities of broad categories for long-time (up to a second or

more) and short-time functions of the time frequency plane. Other analyses will use time and

frequency windows intermediate between these extremes. Depending on the categories,


nontraditional variables such as pitch-related features could be useful, and attributes such as

speaking rate could also be included in the classification. These features may be represented as

multiple streams of probabilistic information.

• Multiple time-frequency trade-offs: As we have noted, the dominant representation of

sound information is energy spectra calculated over brief time frames. This has been

advantageous in allowing hidden Markov models to accommodate timing variations

encountered in real speech, but at the cost of making certain kinds of temporal information

awkward or impossible to employ. Our recent work has shown effective and highly

complementary recognition by pushing the time-frequency balance to the opposite extreme:

TRAPS make independent first-stage classifications based only on information in a single

narrow frequency band, measured over an extended time window of up to 1 second

[Hermansky & Sharma 1998]. TRAPS are competitive with spectral features for clean

speech, and can halve the error rate for bandlimited noise corruption [Hermansky & Sharma

1999]. There is no particular reason to favor either dimension exclusively; a framework able

to integrate partially-dependent information will allow us to use both these views of the

signal, as well as a range of other analyses that use time and frequency windows intermediate

between these extremes. In particular, we will also use local cosine trees to zoom in and out in

time for a multitude of scales (bandwidths) weighted in probability using the entropy

criterion in growing the multiresolution tree.

• Auditory-based signal cues: In a further re-evaluation of the information extracted from

the basic sound data, we will consider looking beyond the coarse spectral energy (in all its

time-frequency guises) to extract information from the finer time structure within each band.

This information is demonstrably important to listeners, as can be witnessed by the blurry,

whispering crowd effect that is the best that can be resynthesized from current speech

recognition representations [Ellis-surfsynth 1997]. Pitch information is noteworthy in

allowing listeners to recognize voiced segments as such, and particularly in helping to glue

together different parts of the signal energy that properly belong to the same voice, and

separating them from other interfering energy: Approaches of this kind, employing pitch,

onset, modulation and spatial cues, have been developed under the banner of computational

auditory scene analysis (CASA) [Cooke & Ellis 2001], and will be integrated as a further

basis for classification within our framework.

¥ Principled multistream framework: The simplest approach to recognition is to identify a

single cue (such as broad short-time spectral profile) and use it for classification, but such an

approach is intrinsically brittle. Human perception apparently uses a far more robust

approach of employing numerous, redundant, parallel cues which are integrated to form a

single decision [Minsky 1986]. Recently, speech recognition systems based on multiple

independent recognizers have consistently and significantly outperformed other systems

[Fiscus 1997, Singh et al. 2001], but their hypothesis-level combinations are heuristic.

Instead, we will develop a probabilistically-rigorous system for combining many sources of

information, with different degrees of mutual dependence, to yield an optimal classification.

In this way, errors that occur independently in different information streams can be

discounted, and weak but consistent evidence from multiple sources can reinforce a correct

decision. We see relative error rate improvements of 25% or more over the best single stream

when complementary information streams are combined at the appropriate intermediate level


(for the noisy digits Aurora task)[Ellis & Bilmes 2000].

For all of these, our vision is to displace spectral energy magnitude as the common element

underlying all acoustic modeling, replacing it with posterior probabilities of particular speech

classes — values with specific meaning that can be estimated from any kind of partial-signal cue,

and amenable to optimal combination via well-known procedures. A key part of the research will

be the determination of archetypical speech categories that will be used for generation of the low-

level posteriors. Our preference is to use data-driven methods to determine these categories,

rather than relying on linguistic theory-dependent classes such as articulatory features; however,

if data-driven approaches lead to something very close to traditional categories, we will use them.

Task 2: Statistical Modeling

A key goal in Task 2 is modifying the statistical models to both incorporate these new (multirate)

front ends, and to explicitly handle missing information (i.e., portions of the time-frequency

plane that are obscured by degradation such as noise or reverberation). This task will also include

the development of discriminative learning of dependence across streams, and incorporation of

this information for optimal combination.

• Partial information recognition: One of the greatest weaknesses of current speech

technology is the tacit assumption that the signal consists exclusively of the single voice of

interest, or, failing that, in trying to normalize the problem in such a way to approximate this

condition. The strength, however, of a recognition scheme based on multiple alternative

information streams is that individual streams can become unreliable or unavailable without

compromising the overall classification — provided their unreliability is correctly detected.

Thus, detection and labeling of non-target information is central at every level of this

approach. To address a classification problem in which a certain, dynamically-varying

portion of the information is unavailable, we will use the optimal tools of the missing data

formalism, recently developed specifically to address situations in which high levels of

nonstationary noise can temporarily obscure any part of the signal. This approach has been

shown as extremely successful in small-vocabulary, high-noise conditions such as the Aurora

task [Barker, Green & Cooke 2001]. Classification that uses acoustic information only when

it is informative, and backs off to contextual inference for other dimensions, offers the best

promise of rising to human levels of recognition in the phone-detection task, and by extension

to the classification of larger units .

• Statistical models for multistream combination: In previous work on subband

(multiband) recognizers, we have developed methods for the combination of information from

disjoint parts of the spectrum. In one of these approaches, developed at IDIAP, likelihoods

are estimated by integrating over all possible reliable stream subsets. We will conduct research

to determine if such methods are useful outside of the multiband example, and in particular

for the streams developed in Task 1. In other work at OGI and ICSI, neural networks and

simple combination rules have been used to integrate information from multiple streams

outside of the multiband context, and we will incorporate this experience in our work.

Conditional auxiliary variables can also be used to generate functions that estimate the

reliability of streams for their combination. Finally, we will study a new model called HMM2

which is a mixture of HMMs, which in principle could permit dynamic subband

segmentation as well as optimal recombination [Bourlard et al 2000, Bengio et al 2000].


• Multirate and event-based models for recognition: The obvious next step after dividing

the observation space into multiple streams is to allow multiple rates, but little work has been

done with such observations because of the synchronization problems in decoding associated

with having different rates. In this work, we propose two alternative models that provide a

mechanism for using multirate features: a multirate model which is essentially a coupled set

of HMMs and a multiresolution model which has a switching mechanism between HMM

streams associated with the different rates to accommodate signal-dependent analysis. In

work on machine toolwear (wear on the edge of a machining tool), UW has shown that the

multirate model outperforms a standard Gaussian mixture HMM (Fish, 2001). A key

difference in the application of this model to speech is that it is problematic to allow

segmentation times only at the slowest rate, so we introduce the notion of event-based

models with fixed temporal extent but variable-rate analysis within each feature stream and

changing temporal resolution across streams. At SRI, related preliminary work has been done

on a kind of HMM that can handle a multiresolution feature stream, i.e. where there is a

single stream but the resolution varies as a function of time. In both cases, much work is still

needed before the methods will be useful for large vocabulary recognition, including research

on parameter tying and discriminative stream coupling. Most fundamentally, we need to gain

experience in applying this approach to automatic speech recognition.

Throughout these subtasks runs the consistent theme: constructing both a formal and a practical

foundation for the incorporation of multiple, incomplete acoustic streams, including streams

having different temporal support.

It could be said that, in general, the work proposed here is motivated much more by speech

perception than speech production. In other words, we take a very different approach than in

the recent trends in speech recognition that involve articulatory modeling. Both approaches are

well motivated from the perspective of trying to account for the variability observed in

conversational speech, in providing mechanisms for handling phones of widely different

durations and articulation quality. However, our perceptually-motivated models have the added

advantages that (1) they can better account for a signal that does not entirely due to articulators,

since there is noise (including multispeaker interference) and reverberation, and (2) are more

amenable to discriminative training techniques. Of course, this argument does not preclude using

both approaches in a single system, but we believe that this would be too much to tackle in one

project. Like the other important efforts in pronunciation modeling or language modeling that we

will not work with here, the proposed project has the potential to yield a key component for

speech recognition systems of the future.

Pushing the Envelope: Rethinking Acoustic Processing E-1

E. State-of-the-Art SystemSeveral systems are available for use in the proposed project. First and foremost, due to the ICSI-

SRI collaboration, the complete SRI system is available for our use, including on-site SRI

researchers at ICSI who are expert in its use and modification. It will be used for evaluations at

ICSI and SRI of methods developed at all the sites, and is further described below. Team

members at all sites will also be able to use HTK, the UW Graphical Models Toolkit (GMTK),

and a variant of the ICSI hybrid neural network/HMM system that was used successfully in the

1998 Hub 4 evaluation. In addition, multirate modeling software (to be developed at UW as part

of this project) will be available to all sites.

The target Rich Transcription system for transferring the successful results of this project will be

the SRI DECIPHERTM

system, which incorporates a combination of state-of-the-art techniques.

The DECIPHERTM

system has consistently exhibited state-of-the-art performance throughout

several years of government-administered tests, as is distinguished by its detailed modeling of

pronunciation variation, its robustness to noise and channel distortion, and its multilingual

capabilities. These features make the system accurate in recognizing spontaneous speech of many

styles, dialects, languages, and noise conditions.

Relevant techniques used in DECIPHERTM

include:

• Bottom-up state-clustered Gaussian mixture HMMs for acoustic modeling [Digalakis &

Murveit 1994].

• Acoustic adaptation to speakers, channels, and environments using affine mean and variance

transforms [Digalakis et al. 1995, Sankar & Lee 1996] and combined transform-based and

Bayesian adaptation [Digalakis & Neumeyer 1996].

• Vocal-tract length normalization [Cohen et al 1995,Wegmann et al. 1996]

• Inverse transform speaker adaptive training [Jin et al. 1998]

• Probabilistic optimum filtering to overcome noisy and mismatched conditions [Neumeyer &

Weintraub 1994]

• Context and word-dependent phone duration modeling [Gadde 2000]

• Progressive search with lattice recognition and N-best rescoring [Ostendorf et al. 1991,

Murveit et al. 1993]

• Minimum word error decoding by posterior maximization in confusion networks [Stolcke et

al. 1997, Mangu et al. 2000]

• Multiple system combination based on word confusion networks [Fiscus 1997, Stolcke et al.

2000]

• Acoustic Model Training using MMIE [Jing et al. 2001]

These and other components are already integrated in a reconfigurable software system that can

be readily retargeted to new tasks. For example, the same system with minor changes has been

applied successfully to the NIST HUB4 and Hub5 tasks [Sankar et al. 1998; Stolcke et al. 2000]

and the SPINE evaluations [Gadde et al. 2001].

Pushing the Envelope: Rethinking Acoustic Processing F-1

F. Tasks

TaskNumber

Task Title Lead Site Principal Investigator(s)

1 Signal ProcessingInternational Computer

Science Institute (ICSI)

Nelson Morgan (ICSI)

Hynek Hermansky (OGI)

2 Statistical ModelingInternational Computer

Science Institute (ICSI)

Nelson Morgan (ICSI)

Dan Ellis (Columbia)

In cases when two names are listed in Principal Investigator (PI) column, the PI is listed first and

the co-PI is listed second.

Pushing the Envelope: Rethinking Acoustic Processing G-1

G. CostsCategory Year 1 Year 2 Year 3 Year 4 Year 5 Total

Task 1: Signal Processing and Evaluation

Labor 111,231 115,772 120,541 125,548 130,805 603,897Benefits 13,901 14,595 15,324 16,091 16,895 76,806travel 10,500 8,250 8,250 8,250 8,250 43,500ODC: tuition 6,450 6,450 6,450 6,450 6,450 32,250equipment 6,510 0 0 0 0 6,510subcontracts 579,337 564,959 569,411 573,990 578,894 2,866,591overhead 137,702 96,339 100,160 104,173 108,385 546,759

Total Cost 865,631 806,365 820,136 834,502 849,679 4,176,313

Task 2: Statistical Modelling andEvaluation

Labor 111,231 115,772 120,541 125,548 130,805 603,897Benefits 13,901 14,595 15,324 16,091 16,895 76,806travel 10,500 8,250 8,250 8,250 8,250 43,500ODC: tuition 6,450 6,450 6,450 6,450 6,450 32,250equipment 6,510 0 0 0 0 6,510subcontracts 529,354 517,620 532,422 546,303 555,364 2,681,063overhead 137,702 96,339 100,160 104,173 108,385 546,759

Total Cost 815,648 759,026 783,147 806,815 826,149 3,990,785TOTAL 1,681,279 1,565,391 1,603,283 1,641,317 1,675,828 8,167,098

Budgetary Estimate

ICSI Columbia IDIAP OGI SRI UW TotalTask 1 1,309,722 500,002 0 1,500,000 866,589 0 4,176,313Task 2 1,309,722 500,002 318,940 0 843,866 1,018,255 3,990,785Total 2,619,444 1,000,004 318,940 1,500,000 1,710,455 1,018,255 8,167,098

Pushing the Envelope: Rethinking Acoustic Processing H-1

H. Tasks

1.1 Task 1, Signal Processing

Participating Sites: ICSI (Lead Site), OGI, Columbia, SRI International

Key Personnel: Nelson Morgan (PI), Hynek Hermansky (co-PI), George Doddington, Dan

Ellis, Kemal Sonmez

Dependencies: Task 1 can be done without Task 2, but has a much greater likelihood of success

if both are done, since traditional modeling approaches are not well-suited to the kinds of signal

processing proposed in Task 1. Task 2 does not make sense without Task 1.


1.2 Technical Objective

This task comprises a set of related novel approaches to front end signal processing for ASR.

The goal is to produce multiple feature representations which, when used together, will reduce

word error rates on conversational and/or noisy speech. The methods employed should be

applicable to some extent with conventional means for combining such approaches (e.g., using

ROVER [Fiscus 1997]), but the much greater potential lies with coordination with the statistical

modeling approaches to be developed in Task 2.

More specifically, our goal is to replace the current notion of a spectral energy-based vector at

time t with variables based on posterior probabilities of broad categories for long-time and short-

time functions of the time-frequency plane. Depending on the categories, nontraditional variables

such as pitch-related features could be useful, and speaking style variables such as speaking rate

could also affect the classification. Long-time features could be based on periods as long as a

second. These features may be represented as multiple streams of probabilistic information.

Other analyses that use time and frequency windows will intermediate between these extremes.

The activities proposed under this task consist of:

• Signal processing using multiple time-frequency tradeoffs, including intervals of up to a

second and spectral bandwidths of down to a single critical band

• Use of auditory-based signal cues to generate non-traditional (e.g., pitch-related) variables

• Employing these processing alternatives to generate functions relating to the posterior

probabilities of data-driven acoustic classes, to be used as the front end observations in

the recognition system.

• Principled generation of mutually-beneficial streams of such functions for use with

multistream, multirate statistical models

If successful, this work would remove much of the sensitivity current recognizers have to short-

term spectral variability that occurs over speakers, speaking styles, and acoustic conditions, both

by placing a greater reliance on temporal information and by reducing the dependence on any

particular aspect of the time-frequency plane.

Any such high-risk research program can fail in many ways. However, the team assembled for

this task has had a number of key successes in this area, and the methods chosen are based on

long study of the problem and on promising preliminary results. Therefore, we think it highly

likely that research along these lines will yield discoveries useful to working systems in the

future.


1.3 Technical Essence

In order to develop novel approaches to the acoustic signal processing done in an ASR front end,

it is necessary to break away from the traditional structure of a local cesptrum based on the

outputs of a mel-scaled filter bank over a 20-30 ms analysis window. Our team s experience and

intuition point us towards alternates that are characterized by several key properties:

• The core features will be functions of broad-category posterior probabilities over time-

frequency patches, where the patches can include long temporal regions and limited

spectral regions. Here, long means analysis windows from 100 ms to 1 second, and

limited means 1-3 critical bands.

• Features reflecting additional aspects of the signal such as pitch and speaking rate will be

incorporated to improve the posterior probability estimation.

• Multiple streams of features will be designed and tested, based both on the methods

above and on a criterion of minimal common errors between streams, with an overall error

reduction constraint to prevent a trivial but useless solution.

Details of the proposed methods follow in section 1.4.


1.4 Technical Approach

Introduction

The core idea for Task 1 is to replace the current notion of a spectral energy-based vector at a

given time instant with a set of variables describing posterior probabilities of broad categories for

long-time and short-time functions of the time-frequency plane. This research direction is based

on the empirical observation that information relevant for deriving sub-word categories is not

only located in short regions in the immediate vicinity of a particular frame , but rather is

distributed over a long timespan corresponding to a syllable or more [Bilmes 1998, Yang et al.

2000]. We have also observed that speech categories can be classified surprisingly well given very

little spectral information but only the temporal evolution of frequency-localized parameters

[Hermansky et al. 1999]. Finally, we have experimentally determined that multiple feature

vectors having different properties can be combined to yield lower error rates on a number of

problems [Janin et al. 1999].

Our experience with mechanisms to exploit these observations is still limited, and many of the

key parameters of such a new approach are unexplored. Some of the questions remaining to be

asked are:

• Once we deviate from time-local spectrally (or cepstrally) based features, how should we

determine the optimal time-frequency patches to use, and what functions of these patches

should be applied? Early explorations suggest that frequency-local temporal patterns of

up to 1 s can be extremely useful in augmenting the more time-local cepstral patterns, but

we have no assurance of the optimality of such an approach.

• If these longer analysis windows are used to generate posterior probabilities, for what

classes should these probabilities be generated? Presumably they should be sub-word

classes, perhaps for primitives such as broad acoustic categories. However, we prefer to

derive such classes from training data rather than relying on expert opinion.

• Traditional speech features, as well as the methods mentioned above, rely on local

spectral envelope information, and deliberately suppress pitch information as unhelpful

in practice. However, harmonic structure is a strong cue for the specific character of

frequency-local information. Simply extending local feature vectors to include pitch

information has not been a successful approach (H.L. Mencken noted that For every

complex problem there is always a solution that is simple, elegant, and wrong. ). We

think that this information may be more helpful in estimating the probabilities for broad

classes using limited frequency regions, supporting recognition decisions that are less

sensitive to signal variability. Still, exactly how do we make use of pitch and other

prosodic features present in the speech signal?

• Assuming that there is no one single sufficient answer to the questions above, how can we

choose candidates for combination in a multistream multirate system?

In the methods described below, which expand the brief summaries of Section D, we will return

to these points and give our current perspectives on how we will attempt to answer the

questions posed. We will also draw comparisons with related work.

Multiple time-frequency trade-offs


We will explore the use of different varieties of time-frequency functions as the basis of the

feature streams providing fuel to our statistical engine. For all the reasons described earlier, it is

highly probable that we need to extend our analysis to significantly larger time windows than are

typically used. As a specific example, we will work with narrow spectral subbands and long

temporal windows (up to a second or more, but in any event sufficiently long for multiple

syllables). In our previous work on this topic we have used the mean and variance-normalized

temporal trajectory of logarithmic spectral energy in a single critical band as a feature vector for

class posterior probability estimation [Hermansky & Sharma 1998]. We developed this

approach for use in a multistream combination system alongside conventional features, and using

posterior estimates as intermediate features. For the OGI Numbers task (natural numbers such

as sixty-five over the telephone), the temporal trajectory features afforded a 23% relative error

rate improvement to 3.7% overall, compared to 4.8% for the posterior-feature system without

temporal trajectory features; our plain GMM-HMM baseline had a word error rate of 6.0% on

the same task [Hermansky et al.1999].

Working with such narrow bands makes the acoustic processing inherently more robust to

frequency-localized signal degradations. Psychoacoustics, however, has identified an effect

known as Co-modulation Masking Release (CMR), in which correlated noise in multiple bands

can actually help the identification of signals in one of the bands [Hall et al. 1984]. This suggests

that we should incorporate more than a single critical band in each of our estimators.

This brings us back to the question of determining sets of optimal time-frequency patches. One

approach is to follow psychoacoustically-guided inspiration, and choose our spectrotemporal

support on the basis of experimental results for broad category classification or detection.

However, we also have other tools at our disposal, such as linear and nonlinear discriminant

optimization techniques [Kajarekar et al. 2001]. In particular, we can segment the signal in a

principled manner by assigning proper probabilities to different time-frequency resolution

tradeoffs using a local cosine tree representation. Wavelet packets and cosine bases have good

feature localization properties, and fast algorithms exist for building multiresolution

representations. However, they have a major shortcoming for pattern recognition applications, in

that the representations are not time-invariant. Because the transform coefficients of a pattern are

quite different when subjected to translation, there is no simple mechanism to define a

translation-invariant detector. In this project, we will incorporate a novel way to address this

problem by using local cosine trees to zoom in and out in time to segment the transients and

stationary parts. In processing the signal, long segments on the order of a second are transformed

with a local cosine basis to generate an entropy-based tree and a dyadic time axis division. The

shortest segments at the highest resolution (the leaf nodes of the tree) form the basic frames. The

local cosine transform effectively clusters these frames in groups of powers of two, depending on

their frequency content. Each segment is then analyzed into Fourier spectra at a multitude of

scales (bandwidths), and probability estimates from each scale can be weighted by the entropy

criterion computed during the growth of the multiresolution tree.

Whether our time and frequency extents are based on empirical classification/detection results, or

adaptively derived from local signal properties, we aim to generate feature streams that look more

like probabilities than energies. Returning to our second introductory question, for what classes

should these probabilities be generated? Our preference is to develop data-driven broad classes,

as in [Kajerekar & Hermansky 2000] where we achieved a 15% relative improvement in word


error rate by defining new subword units that maximized the conditional independence of words

and features given the unit — i.e., the criterion in deriving the unit was to capture as much as

possible of the lexically-relevant information in the features. In the proposed project, new broad

classes will be derived for the chosen time-frequency patches, and the combined set of

probabilities for a particular time can be used as an input to classification machinery for finer

categories. One particularly salient characteristic of subword units (which would be a likely cue

borad classes) is voicing. Voicing can be extracted from signal fine structure as discussed in thenext section.

Auditory-based signal cues

We will also look beyond the coarse spectral energy (in all its time-frequency guises) to extract

information from the finer time structure within each band. Sounds that we perceive as having a

pitch — such as voiced speech, the buzzing of a bee, or the squeaking of a door — generally exhibit

energy modulation at a specific fundamental frequency across a number of frequency bands. For

instance, the fundamental of voiced speech usually lies in the range 100-250 Hz, but the formant

frequencies excited at that pitch extend up to 3000 Hz and beyond. The prevalence and

importance of these wideband periodic signals, and the strong perceptual cohesion of all the

frequency bands involved, inspired the development of the weft element, which represents this

kind of signal in a single, integrated object [Ellis 1997]. Wefts are extracted from a mixed signal

by calculating the autocorrelation of the energy envelope in each peripheral frequency channel,

summing normalized autocorrelations across channels to form a periodogram (a display of

dominant periodicities as a function of time), picking the dominant periodicities at each time slice,

then going back to the per-channel autocorrelations to estimate the energy intensity at each

periodicity for each frequency band. The result is an element defined by a fundamental-period

track, and a smooth time-frequency energy surface which describes the overall amplitude and

formant structure associated with that period.

The key advantage of this operation is that, in suitable circumstances, it is possible to

simultaneously estimate the intensity of two different periods in the same time-frequency cell.

Also, because the time-frequency surface is assumed to have smoothly-varying properties, any

missing values (where the dominance of one signal over another makes it impossible to extract

the intensity of the weaker source in that band) can be estimated either by simple interpolation of

the surrounding values, or by more sophisticated inference.

Periodicity-based signal analysis is an example of the computational auditory scene analysis

(CASA) approaches which will be applied to the problem of better estimating the low-level

probabilities that will form our new front end streams. Other techniques that we will investigate,

similarly concerned with organizing the signal according to the different sound sources present,

include using information about common onset and offset across frequency bands, and coherent

modulation in amplitude and frequency [Cooke & Ellis 2001]. There is strong evidence for tuned

receptive fields in the primary auditory cortex, analogous to the oriented edge detectors in the

primary visual cortex, which could be implicated in the perception of formant transitions in

speech [Simon et al. 1998], and analysis along these lines will be investigated within the same

CASA framework.

In previous experiments along these lines, incorporation of an autocorrelation-based harmonicity

measure as a basis for a soft labelling of time-frequency cells as corrupt or reliable yeilded a word


error rate improvement of approximately 10% relative (from 42% to 38%) at 0dB SNR in the

Aurora noisy digits task, when compared to the same missing-data system using only static noise

estimates for labelling [Barker, Cooke & Ellis 2001]. This result, based only on clean speech

training data, actually outperforms the multicondition baseline result (trained on speech

corrupted to resemble the test set), even though multicondition training typically betters clean-

training by a factor of 2 or more in this high-noise case.

Principled multistream framework

Finally, we will explore approaches to generating multiple streams from the algorithms discussed

above. One simple approach is to use a number of likely but arbitrary candidates from the

different algorithms. In particular, we would generate streams with differing temporal and

spectral resolutions, since even events that seem to be quite localized in time or frequency can

have correlated effects over significant stretches of time due to the inertia of the vocal

mechanisms (coarticulation). But even given the above methods, there are a range of functions

that could be applied to the streams to give them different properties, for instance to develop

features reflecting energy with particular time-frequency orientations, i.e., energy moving up or

down in frequency at a particular rate. In all of these cases, we will try to choose streams that

generate orthogonal errors; at least, we wish to minimize the number of errors in common for the

different streams (subject to an overall constraint on errors to avoid a trivial but useless solution,

as noted previously). This requirement can be defined quite simply for the case of combining

multiple recognizers, as is done with approaches such as ROVER [Fiscus 1997]; however, what

would it mean for combination at the level of the acoustic front end? In previous experiments

[Wu et al. 1998] we have compared the specific words and utterances where errors occur for

systems based on single streams to predict which streams would be most profitably combined.

However, since our combination is at an earlier stage, it would make more sense to develop an

orthogonality criterion at the acoustic level. We have investigated measures of the mutual

information in different feature streams as a basis for choosing both which streams to combine

and the method of combination, for instance, before or after the posterior estimation stage [Ellis

& Bilmes 2000]; minimizing the conditional mutual information of posterior estimates based on

single feature sets correctly selected the best-performing stream pair, able to achieve a 25%

relative word error rate improvement relative to the best single stream (for the noisy digits

Aurora task). Another approach to stream selection would be to examine broad category or

phonetic errors for the individual streams, or to compute relative entropies between error signals

over the different classes for each stream. We will experiment with all these approaches to this

problem.


1.5 Evaluation Methodology

It is now generally accepted that research productivity is enhanced, if not determined, by

guidance through evaluation feedback. This is true at least in human language technologies and it

is endorsed by the subject solicitation through the requirement for this section on evaluation.

Careful definition of evaluation can also help to define the research tasks and to focus research on

key problem areas. Thus evaluation ideally should be a dynamic process during the research

process — as better understanding of the problem is gained, research tasks and performance

measures should be modified to focus on the critical technical challenges.

Spoken language is arguably the most challenging problem that humankind faces today. Being the

fundamental process of one human communicating semantic information over an acoustic channel

to another individual, it is a multifaceted problem that in the limit requires human competence in

acoustic, linguistic, and semantic processing, including complete (and dynamic) world knowledge.

Some of the most important challenges in taking on such a huge mission involve calibration of the

problem, specifically in segmenting the problem into sub-problems that can be tackled in a

meaningful way, even if not fully solved, and in identifying a successful plan of attack (which

means identifying the critical technical challenges and the order in which they should be

addressed). The subject solicitation (for Novel Approaches) presents an opportunity, an

invitation even, to do this kind of segmentation. Potential opportunities for segmentation lie

between the levels of acoustic, phonetic, syllabic, word, syntactic and semantic representation.

Evaluation methodology for speech transcription comprises four parts. The most fundamental of

these is the definition of the evaluation task. The other three supporting parts are the selection

of a performance measure, the identification/collection/selection of a speech corpus and other

research resources, and the establishment of an evaluation process.

For the signal processing research task, we expect to define the evaluation task at a low level of

representation, primarily at the level of the word and the level of the syllable. The lower the

level the better, because the signal processing research addresses the problem of acoustic

representation and of decoding speech into the most primitive symbolic level, typically a

phonetic-level unit. The closer evaluation is to this level, the better the feedback and insight on

issues and technical challenges. There is danger in evaluating at the phonetic level, however,

because the conceptual integrity and perceptual value/stability at this level is not clear. Thus we

propose to create evaluation tasks at the syllabic and word level.

We also propose to establish human performance benchmarks and to use this information to

better understand where the challenges lie and what levels of algorithmic performance are

reasonable to expect. When human listeners are unconstrained, transcription performance is quite

high. Word error rates of less than one percent may be observed for human listeners, even for

conversational speech at 10 dB S/N ratio [Deshmukh et al. 1996]. In order to achieve this level of

performance, however, we believe that human listeners depend on semantic and contextual

information. And in order to provide truly useful guidance to us, our human performance

benchmarks must use only those sources of knowledge that we are dealing with. But limiting

human listeners to these (lower level) sources of knowledge is a tricky and difficult challenge.

While we propose to use only natural and unconstrained speech as test data, there may be

reasonable ways to eliminate or minimize the unwanted higher level information provided to

listeners. For example, we can excise sub-phrase (~ 1 sec, two or three word) segments and


present these randomly to human listeners.

Word error rate (WER) has been used as the standard performance measure for speech

transcription tasks for many years. And while efforts have been made to find more meaningful

measures of transcription performance (such as information-based measures and information-

weighted WER), the WER remains as a simple and reasonably satisfactory performance measure.

We therefore plan to use it also, and to adapt it to syllable recognition performance as well.

Another way of looking at transcription errors which has been useful is to tabulate these errors in

a detection task framework [Doddington et al. 1997]. This results in miss probabilities and false

alarm rates which can be analyzed in various ways. For example, when these statistics are

conditioned on word (or on syllable) they can provide valuable insight into failure mechanisms

and model weaknesses.

Evaluation corpora will be drawn from published corpora that are available from LDC and other

linguistic resource suppliers. And while these corpora will be under our control and therefore not

truly unseen by us, we will control the use of speech data in the usual way to avoid being

misled by biased results, with a division of speech data resources into training data ,

development test data , and evaluation test data .

The most valuable feedback, especially at the exploratory stages of research being proposed here,

requires detailed analysis of performance. This includes, as a prime example, the conditioning of

performance on parameters of interest. Thus the evaluation tools must be flexible and easily

modified in order to accommodate the range and change of research focus. Evaluation will thus

play an active role in research through exploratory analysis of test results in an effort to

understand the problem better, not just to measure the overall success of various algorithmic

ideas.

In summary, the evaluation protocol must be nimble enough to evolve along with advances in the

ambitious research program described above, but will feature:

• establishment of human performance benchmarks

• regular evaluation of technical achievement

• test data drawn from standard LDC corpora

• evaluation at the phone, syllable, and/or word levels, with variants to be explored as part of

this effort

• use of standard metrics, such as word error rate and adaptations to syllable recognition task.

Evaluation will play a central role in our research effort, as it does in all productive speech

research environments. The fact that we are dealing with novel approaches and with new ideas

does not eliminate or minimize the value of performing evaluations and performance analysis to

gain an understanding of the problem (and of our proposed ideas). In fact the situation is just the

opposite — new and radical ideas need all the more testing and analysis in order to become mature

contributions to speech technology.


1.6 Resources Required

The research proposed here will be solely based on available speech data and annotations. LDC

Corpora (Switchboard, Call Home, Broadcast News), in addition to ICSI and UW Meeting data,

will be used for train and test sets for large vocabulary work. We will also use ICSI s

phonetically labeled Switchboard data, and a variety of small vocabulary corpora available at ICSI

(e.g., Aurora) for quick turnaround development work.


1.7 Work Plan

In the first year, we expect to focus on the exploration of time-frequency tradeoffs for the front

end work of Task 1, making use of some sensible default choices for the broad acoustic categories

(e.g., voicing, stop vs continuant, etc.) that will be used for the posterior probabilities. In parallel,

we will work on refining the choice of these categories through data-driven methods. We will also

begin to develop pitch-related measures to assist in the probability estimation. Initial Task 1

work will be based on a small vocabulary task (e.g., Aurora) to keep experiment turnaround time

fast, to minimize language model effects (given the focus on front end design), and because we

will be using a simplified statistical modeling structure since the Task 2 work will not yet have

generated new instantiations for Task 1 to use.

After the first year, we will use a reduced set of candidate approaches from the first year (the

better time-frequency tradeoffs, pitch-related features, and data-driven broad acoustic categories

that we found) and do a joint development to study the interaction between these choices. In

addition, once we observe encouraging results for a particular approach on a small task, we will

migrate to a task that will be compatible to the Rich Transcription effort at SRI/ICSI/UW, with

the goal of eventually transferring technology to that effort. Task 1 researchers will work more

closely with the Task 2 researchers at this point, both to transfer the new front end to the

statistical modelers and to incorporate the improved models in the front end studies

Our initial efforts will focus on the use of a single feature stream, but once all of these methods

have been developed, we will put a greater focus on the principled generation of multiple feature

streams. By this time Task 2 work should have refined the stream combination methods.

Throughout the project, we will frequently compare notes on the different approaches being

tested, e.g., for determination of time-frequency tradeoffs or of functions for multistream

generation. In this way, we hope to encourage diverse thought within the collaboration while

providing frequent opportunities for merging the best ideas. In the final two years, we will work

on merging some of the different signal processing approaches more explicitly, as well as

transferring the best technology to the SRI team s Rich Transcription system.


1.8 Milestones and Schedule


1.9 Cost Breakdown

Category Year 1 Year 2 Year 3 Year 4 Year 5 TotalTask 1: Signal Processing andEvaluation

Labor 111,231 115,772 120,541 125,548 130,805 603,897Benefits 13,901 14,595 15,324 16,091 16,895 76,806labor total: 125,132 130,367 135,865 141,639 147,700 680,703

travel 10,500 8,250 8,250 8,250 8,250 43,500

ODC: tuition 6,450 6,450 6,450 6,450 6,450 32,250

equipment 6,510 0 0 0 0 6,510ODC total: 12,960 6,450 6,450 6,450 6,450 38,760

subcontracts:Columbia 100,000 100,000 100,001 100,000 100,001 500,002

OGI 300,000 300,000 300,000 300,000 300,000 1,500,000SRI 179,337 164,959 169,410 173,990 178,893 866,589

subcontract total: 579,337 564,959 569,411 573,990 578,894 2,866,591

overhead137,702

96,339 100,160 104,173 108,385 546,759

Total CostEstimate

865,631 806,365 820,136 834,502 849,679 4,176,313

Task 1 Roll-up:

Labor 603897Benefits 76806travel 43500ODC: tuition 32250 : equipment 6510subcontracts 2866591overhead 546759

Total CostEstimate

4176313


2.1 Task 2, Statistical Modeling

Participating Sites: ICSI (Lead Site), UW, Columbia, SRI International, IDIAP

Key Personnel: Nelson Morgan, Dan Ellis (co-PI), Herv Bourlard, George Doddington, Mari

Ostendorf, Kemal Sonmez,

Dependencies: Requires Task 1 (Signal Processing), since the point of this task is to model and

integrate the information streams developed in Task 1.


2.2 Technical Objective

This task is aimed at improving the ability of statistical ASR engines to accommodate novel front

end signal processing such as the approaches developed in Task 1. The goal is to develop

statistical models and modeling methods that will optimally incorporate multiple feature

representations, including representations with much longer time support than is used in

conventional systems. When used together with such diverse front end representations, the

desired result would be to reduce word error rates on conversational and/or noisy speech. The

methods employed should be applicable to some extent with existing multiple front ends (e.g.,

PLP and RASTA), but the much greater potential lies with coordination with the front end

approaches to be developed in Task 1.

More specifically, our goal is to replace the current notion of a statistical model based on a single

stream of observations assumed to have a short-time basis with statistical models that both

incorporate these new (multirate) front ends, and explicitly handle missing information (i.e.,

portions of the time-frequency plane that are obscured by degradation such as noise or

reverberation). This task will also include the development of discriminative learning of

dependence across streams, and incorporation of this information for optimal combination.

The activities proposed under this task consist of:

• Development of multirate and event-based statistical models for speech recognition

• Development of partial information approaches for the noisy incomplete data that will

comprise some of the multiple streams

• Principled combination of multiple streams for minimum error rates, particularly for

posterior probabilities for data-driven acoustic classes used as front end observations.

If this work is successful, it would vastly broaden the capability of recognition engines to

accommodate very different feature streams such as the ones proposed in Task 1. If both are

successful, the results of this task would complement those of Task 1 to further decrease the

sensitivity current recognizers have to variability in the short-term spectrum that occurs over

speakers, speaking styles, and acoustic conditions.

As with Task 1, such a high-risk research program can fail in many ways. However, as with Task

1, the team assembled for this task has had a number of key successes in this area in the past, and

the methods chosen are based on long study of the problem and on preliminary promising results.

Therefore, we think it highly likely that research along these lines will yield discoveries that will

be used by the state-of-the-art systems of the future.


2.3 Technical Essence

It has often been difficult to incorporate novel front ends into traditional speech recognition

systems. We believe that a significant reason for this problem has been that the implicit

assumptions built into our common statistical models has made them optimal for the traditional

short-time front end. More generally, in order to optimally incorporate novel approaches to the

acoustic signal processing done in an ASR front end, we must develop new statistical structures

and methods that can handle multiple time scales and multiple feature streams. Our team s

experience and intuition point us towards alternatives that are characterized by several key

properties:

• Multiple coupled HMMs will be used to integrate features from analysis over differing

temporal extent, which implies an event-driven rather than segmental view of the speech

process.

• We will investigate models for fixed rate, variable rate and multirate feature streams.

• So-called partial information or missing data formalisms will be used to handle low-

confidence regions in time and frequency.

• A variety of statistical formulations that describe possible combinations of (or coupling

between) candidate streams will be tested, including Markov and hidden Markov

dependencies.

• New statistical modeling frameworks will be developed for characterizing speech events,

which may have acoustic cues of varying time spans, rather than the standard approach of

characterizing subword units bounded by specific start and end times.


2.4 Technical Approach

There is currently much interest in new types of front ends, particularly using different scales of

temporal processing. Evidence from data-driven learning indicates that there is much potential

utility of features computed using relatively long analysis windows (>100 ms). However, hidden

Markov Models (HMMs) per se are not well suited to these features for a couple of reasons.

First, the HMM and frame-based cepstra have co-evolved as ASR system components, and

hence are very much tuned to each other; this is one reason why progress with novel approaches

has been so difficult. In particular, systems incorporating new signal processing methods in the

front end are at a disadvantage when tested using standard HMMs. Secondly, the standard way

to use longer temporal scales with an HMM is to simply use a large analysis window and a small

(e.g., 10 ms) frame step, so that the frame rate is the same as for the small analysis window. The

problem with this approach is that successive features at the slow time scale are even more

correlated than are those at the fast time scale, so there will be a need for tweak factors at

every time scale. These points suggest that we should consider changing the statistical model.

One approach that has been proposed is to add feature dependencies or explicitly model the

dynamics of frame-based features in various extensions of HMMs. While we have ourselves

made contributions in this area, we now believe that a very different approach is needed that

relaxes the frame-based processing constraint. We propose instead to focus on the problem of

multistream and multirate process modeling for two main reasons. First, the use of multiple

streams provides robustness to corruption of individual streams. Second, the use of multiple

streams introduces more flexibility in characterizing speech at different time and frequency scales,

which we hypothesize will be useful for both noise robustness and characterizing the variability

observed in conversational speech.

Partial information recognition

One of the greatest weaknesses of current speech technology is the tacit assumption that the

signal consists exclusively of the single voice of interest, or, failing that, in trying to normalize the

problem in such a way to approximate this condition. The strength, however, of a recognition

scheme based on multiple alternative information streams is that individual streams can become

unreliable or unavailable without compromising the overall classification — provided their

unreliability is correctly detected. Thus, detection and labeling of non-target information is

central at every level of this approach. To address a classification problem in which a certain,

dynamically-varying portion of the information is unavailable, we will use the optimal tools of

the missing data formalism, recently developed specifically to address situations in which high

levels of nonstationary noise can temporarily obscure any part of the signal. This approach has

been shown as extremely successful in small-vocabulary, high-noise conditions such as the

Aurora task [Barker, Green & Cooke 2001]. Classification that uses acoustic information only

when it is informative, and backs off to contextual inference for other dimensions, offers the best

promise of rising to human levels of recognition in the phone-detection task.

Consider, in particular, the effect of acoustical interference such as noise and reverberation on a

particular stream. While onsets and strong formant peaks will in general be relatively unaffected,

the spectral valleys between formants and the periods of energy decay are likely to be affected

much more seriously. As a simplification, we could say that some aspects of the original speech

spectrum are observable, while others have been masked by the interfering energy and are thus


hidden. Bayesian decision theory has little difficulty accommodating this situation; the overall

posterior probability of a particular subword class given the evidence is simply the expected

value of the posterior probability over all possible values of the missing data. That is, the overall

posterior probability of a particular subword class q given present evidence Xp is simply the

expected value of the posterior over all possible values of the missing data Xm i.e.:

The case of diagonal-covariance Gaussian model is particularly simple because each data

dimension is conditionally independent, so integrating over the missing data dimensions reduces

to evaluating the Gaussian only for the available data dimensions. Thus for a mixture of diagonal-

covariance Gaussians — the most common distribution model used in speech recognizers — the

data likelihood needed to compare models in the HMM decoding algorithm can be calculated as

[Cooke et al. 2001]:

where p(k | q) is the static mixture weight for component k within state q, and the integral over

Xm will, in the absence of any information on the missing data, evaluate to unity and disappear.

The principal difficulty, of course, lies in detecting which values have been corrupted and which

are reliable, i.e., in dividing each observation vector into the present and missing subsets Xp and

Xm. In general, this information will not be directly available (since that would imply knowing a

priori the separate spectra of target and noise), and it must be inferred somehow from the signal

content, either by using CASA-like processes to organize the input, by comparing the

consequences of different possible labelings in terms of the resulting fit to speech models, or by a

combination of both these methods.

Statistical models for multistream combination:

Our team and others have found that it can provide improved performance to incorporate

multiple feature streams [Janin et al. 1999, Hermansky et al. 1999, Singh et al. 2001]. While

approaches that combine systems at the word level (e.g. ROVER [Fiscus 1997]) can be very

useful, their utility greatly diminishes for systems that are not very good individually. However,

when feature streams are combined at a lower level, we have often found that systems that are

individually rather poor can behave quite well collectively, and can also show robust behavior

when properly chosen. Therefore, we will be working to develop new approaches for the

combination of such streams, in coordination with the corresponding activity in Task 1.

Multiband speech recognition is an example of an approach in which individual streams (based on

narrow bands of the spectrum) have relatively poor performance, while the complete system has

often demonstrated good performance, particularly for robustness to narrow-band noise

[Bourlard et al 1996, Bourlard & Dupont 1996]. However, some limitations of the approach are

evident. In particular, the assumption of independence between frequency bands could be the

cause of reduced performance in the case of clean speech and wide-band noise.

To overcome the above problems, while nicely reconciling missing data and multiband

approaches, IDIAP recently proposed a new set of full combination rules which integrate

acoustic models trained on all possible combinations of subbands, preserving correlation

information and leading to higher performance in all noise conditions. More specifically, the

p q X p(((( )))) p q X p Xm,,,,(((( )))) p Xm X p(((( )))) X md∫∫∫∫=

p Xp q(((( )))) p k q(((( )))) p X p k q,,,,(((( )))) p X m k q,,,,(((( )))) X md∫∫∫∫k

∑∑∑∑=


HMM emission (posterior) probability )|( Xqp is estimated by integrating over all possible

subband band combinations:

)|()|()|,()|(11

XbpXqpXbqpXqp pp

B

pk

B

pp ∑∑

==

==

where B is the number of possible subband combinations (K

2= , if K is the number of

subbands), pX , as above, represents the present (reliable) subset, and )|( Xbp p the

probability that subset pb is reliable.

Comparing to the missing data approach, this method is equivalent to considering the pointer to

the missing subsets as an additional latent variable used during HMM training/decoding. This was

recently evaluated, and showed significant improvements on all kinds of stationary and

nonstationary, narrow and wide band, noise conditions [Hagen et al. 2000, Hagen 2001].

Although excellent results were already achieved with uniform reliability measures )|( Xbp p ,

and no adaptation at all to the signal or noise conditions, further improvements could also be

achieved by doing online and unsupervised maximum likelihood adaptation of those reliability

weights. For the longer term goals of this project we will extend the full-combination approach

on the multistream problem in general, not only for frequency subband streams.

We will also investigate the use of a frequency-based HMM to compute emission probabilities,

yielding a model called HMM2 (a mixture of HMMs). HMM emission probabilities are

typically modeled through Gaussian mixtures or artificial neural networks. Also, in the multiband

based recognizers discussed above, we have to decide a priori the number and position of the

subbands being considered. In the HMM2 formalism, introduced in [Bourlard et al. 2000, Bengio

et al. 2000], the emission probabilities of the HMM (now referred to as temporal HMMs ) are

estimated through a secondary, state-dependent, HMM (referred to as feature HMMs )

working along the feature vector. This model will then allow for dynamic (time and state

dependent) subband (frequency) segmentation, integration of all possible frequency subband

combinations, as well as optimal recombination according to a standard maximum likelihood

criterion (although other criteria used in standard HMMs could also be used). This approach may

also be viewed as performing a kind of nonlinear vocal tract normalization.

Multirate recognition

The idea of multistream modeling can be extended to the case where the feature processes in the

different streams involve different time scales. Here, we look at two different mathematical

frameworks for handling such observations: one assumes that all time scales are always observed,

and the other that a subset of the time scales are observable (as for signal-dependent signal

processing) according to a switching model.

For the fully observable case, UW has developed a multirate model that is essentially a coupled

set of HMMs, where each HMM operates on a different feature stream with dependence of

higher rate state transitions and/or distributions on the lower rate state index. The multirate model

has some aspects in common with factorial HMMs [Gharamani & Jordan 1997, Saul & Jordan

1999, Logan & Moreno 1998, Nock & Young 2000] and the multistream models previously

discussed. However, it differs because the feature time series are extracted at different rates, and

also because of the explicit coupling across scales with conditional distributions. Multirate

features have previously been investigated for speech, but only with weak coupling between the


streams via score combination techniques [Dupont & Bourlard, 1997]. Here we propose to look

at Markov coupling between the states and the conditional Gaussian dependencies across

streams. It may also be possible to extend the ideas of HMM2, described previously, to the

multirate framework.

The multirate model was used at UW to improve classification performance in an application of

estimating the wear on the edge of a machine milling tool [Fish 2001]. The tool wear-level

estimation task required modeling at two different rates for gradual trends and chipping events.

Related multiresolution models have also been used with success in image classification

applications [Li et al. 1999]. While application to speech recognition problems will require more

sophisticated parameter tying and decoding structures than in these other applications, we expect

this approach to revolutionize speech processing because of the broader spectrum of acoustical

processing methods that can be incorporated in a statistically meaningful way. In particular, the

framework provides a way to integrate features computed over phone- and syllable-sized time

scales with standard short-time features that have already been shown to be useful. In addition to

evidence that syllable-size features would be useful for English, we anticipate that it will provide

a much better framework for integrating intonation features, as is critical in tone languages.

While the basic theoretical infrastructure and baseline software are in place, the multirate

framework is not quite ready for application to speech recognition. At a practical level, the

implementation is in MATLAB, which is extremely slow and impractical for use with large

speech corpora. Hence, some software development is needed. Also, the model is currently only

implemented for two streams and the team intends to incorporate more. More importantly, there

are a number of algorithmic extensions that will be required to maximally benefit from this

approach. Three key problems that we propose to address include characterization of

asynchronous streams, discriminative learning, and parameter tying.

The toolwear model was based on a paradigm incorporating a fixed number of high-rate features

(and state transitions) for each low-rate feature; i.e., the low-rate state would change after N

high-rate state transitions, where N was fixed for each coupled model. Such a fixed structure does

not make sense for speech, since many phone instances will be shorter in duration that the frame

rate of our longest time features. What is needed is an event-driven view of the feature modeling

process, where event timing but not extent is critical -- that is, observations center at some event

time instant rather than span some region of time with definitive start and end times. In early

stages of the project, we will use signal-dependent processing that specifies feature timing ,

such as keying syllable level features to energy maxima (likely vowel nucleus) or using

mechanisms motivated by the cosine basis tree. Later, we will investigate joint design of the

window placement function and recognition model parameters. It is important to note that this

approach is not the same as the event-based models of ASR that have been proposed in acoustic

phonetics [Bitar & Espy-Wilson, 1996; Niyogi et al., 1998; Lahiri, 1999], because we are

explicitly avoiding intermediate hard decisions in decoding and use statistical discrete-event

process models rather than simply treating the event posteriors as frame-based HMM features.

In parallel to the effort on event-based modeling, we will investigate solutions to the standard

ASR problems of discriminative learning and parameter tying. First, we propose to look at

discriminative learning of dependence across streams, particularly with features. An efficient

algorithm has been developed theoretically, but it is not yet implemented. In addition, we

propose to develop methods for determining what features should be used in the different


streams. This can be thought of as an extension of linear discriminant transformations based on a

new estimation technique. Since the ultimate goal is large vocabulary speech recognition, we need

to develop methods for automatic structure learning, including state tying but also stream

coupling. At a minimum, we can train a system with independent stream HMMs and use

standard HMM state clustering techniques, and then force the coupled HMM to have the same

tying structure. However, we hope to do more by leveraging previous work at UW by Nock on

state tying for factorial HMMs [Nock 2001].

In related work at SRI, we have proposed a multiresolution HMM that will consistently model

subsets of observable (or, partially observable) variables from a multirate process, obtained from

a signal-adaptive front end. We assume that the time scales at which phonetic cues occur are not

absolute and may change from one realization to another. To address this issue, multiresolution

observation HMMs (MRHMM) will be introduced. The observable subset of multirate features

is indicated by a hidden variable corresponding to the multirate state index; hence, the MRHMM

can be thought of as switching between resolution groups. The switching transition probabilities

of the HMM are computed directly by the local cosine basis tree generation algorithm and are

determined by the size of the drop in entropy at every branching of the tree. The resulting HMM

structure will be able to localize around transients in time by going to higher scales, and to

localize around transients in frequency by going to lower scales.

In [Sonmez et al. 2000], we have demonstrated a set of phonetically-motivated acoustic features

that discriminate a preliminary test set of highly ambiguous voiceless stops in CV contexts. The

features were automatically computed from data that had been hand-marked for consonant burst

location and voicing onset (extension to automatic marking is also proposed). Two corpora were

processed using a parallel set of features: conversational speech over the telephone

(Switchboard), and a corpus of carefully elicited speech. The latter provides an upper bound on

discrimination, and allows for comparison of feature usage across speaking style. We explored

data-driven approaches to obtaining variable-length time-localized features compatible with an

HMM statistical framework. We also suggested techniques for extension to automatic annotation

of burst location, for computation of features at such points, and for augmentation of an HMM

system with the added information.

The main idea of the switching multiresolution framework proposed in this task is to capture the

spectra at each time instant at a time-frequency resolution warranted by the instantaneous

stationarity. HMMs produce outputs uniformly in time, i.e. one feature vector per frame, the

unit time window. The way to synchronize the output of our cosine packet-segmented front-end

is to define the shortest interval of time (maximum time resolution, minimum frequency

resolution) as the effective frame. This way short duration phones such as stops can be localized

precisely without contamination from neighboring vowels and stationary sections can benefit

from a longer time window.


2.5 Evaluation Methodology

Evaluation support for the statistical modeling research task (task 2) will be conducted in

accordance with the same evaluation principles and guidelines that were outlined in section H1.5

for the signal processing research task (task 1). Especially for the case of the statistical modeling

task, evaluation will serve as a tool to better understand the space of (statistically abstracted)

phonetic/syllabic/other representations. Armed with a better understanding (i.e., a more accurate

statistical characterization), it is virtually certain that our definition of the representational space

will change. Furthermore, this change may well be in unit concept as well as unit inventory.

Thus it is very important that our evaluation tools maintain generality and flexibility, to adapt to

changing research needs, as described in section H1.5.


2.6 Resources Required

The research proposed here will be solely based on available speech data and annotations. LDC






2.7 Work Plan

While the statistical modeling effort is aimed at providing a framework for more appropriate

evaluation of the new features developed in Task 2, there are already innovative multistream

front ends (e.g., TRAP-based) that will be useful for initial development of the methods

proposed here for modeling partial information and new approaches to multistream combination.

The extension of these techniques to multirate processing is straightforward, so these plus the

signal-dependent analysis already developed by SRI will be useful for the initial work on

multirate modeling. Initial modeling work will be based on a small vocabulary task (e.g., Aurora),

in part to keep experiment turnaround time fast but also because the parameter tying

infrastructure needed to address large vocabulary recognition will require some research.

After the first year, guided by error analyses in performance evaluation studies, we will begin

assessing the maturing modeling techniques on the new features developed in Task 1. In addition,

once we observe encouraging results for a particular approach on a small task, we will migrate to

a task that will be compatible to the Rich Transcription effort at SRI/ICSI/UW, with the goal of

eventually transferring technology to that effort. We anticipate that much of the work on more

complex tasks will use the framework of lattice rescoring as opposed to the much larger software

development cost of building (or even modifying) a large vocabulary decoder compatible with the

new methods.

As described earlier, we will keep the effort focused on acoustic modeling by fixing the

vocabulary and language model in all rescoring efforts, and by conducting initial developments on

tasks, like Aurora, where the language model is not a critical factor in the system performance.

Throughout the project, we will leverage the results from each of the different modeling

approaches to help improve the others. Initially, this will be primarily in terms of sharing

acoustic segmentations, comparing performance on the best feature sets, etc. In the final two

years, we will work on merging some of the different modeling approaches more explicitly, as

well as transferring the best technology to the SRI Rich Transcription system.


2.8 Milestones and Schedule


2.9 Cost Breakdown

Category Year 1 Year 2 Year 3 Year 4 Year 5 TotalTask 2: StatisticalModeling

Labor 111,231 115,772 120,541 125,548 130,805 603,897Benefits 13,901 14,595 15,324 16,091 16,895 76,806labor total: 125,132 130,367 135,865 141,639 147,700 680,703

travel 10,500 8,250 8,250 8,250 8,250 43,500

ODC: tuition 6,450 6,450 6,450 6,450 6,450 32,250

equipment 6,510 0 0 0 0 6,510ODC total: 12,960 6,450 6,450 6,450 6,450 38,760

subcontracts:Columbia 100,000 100,000 100,001 100,000 100,001 500,002

IDIAP 60,200 62,960 65,720 67,100 62,960 318,940SRI 179,099 158,367 163,449 168,677 174,274 843,866UW 190,055 196,293 203,252 210,526 218,129 1,018,255

subcontracttotal:

339,299 321,327 329,170 335,777 337,235 1,662,808

overhead 137,702 96,339 100,160 104,173 108,385 546,759

Total Cost 815,648 759,026 783,147 806,815 826,149 3,990,785

Task 2 Roll-up:

Labor 603,897Benefits 76,806travel 43,500ODC: tuition 32,250equipment 6,510subcontracts 1,662,808overhead 546,759

Total Cost 3,990,785.0

Pushing the Envelope: Rethinking Acoustic Processing I-1

I. Language Preferences

The Novel Approaches team will take advantage of the multilingual efforts being developed at

SRI for other projects, including the Rich Transcription effort. Consequently, the offeror prefers

Egyptian Arabic and Mandarin Chinese due to both their choice by SRI and to the large amount

of data available in these dialects compared to most alternatives.

Pushing the Envelope: Rethinking Acoustic Processing J-1

J. Resources RequiredThe research proposed here will be solely based on available speech data and annotations. LDC





Pushing the Envelope: Rethinking Acoustic Processing K-1

K. Resources Offered

The ICSI/OGI/SRI/UW/Columbia/IDIAP team has always held a strong commitment to the

sharing of resources with the larger speech community, and in the past has made available such

resources as: SRI’s LM toolkit and Switchboard prosodic feature database; and ICSI’s

phonetically transcribed Switchboard data, and their front end (RASTA) and neural network

software; and a number of UW’s HTK-compatible recognition modules. Additionally, the jointly

developed Meeting data are in preparation for release at this time. We intend to continue this

commitment to resource sharing under this program.

We anticipate that the main contributions will be in terms of software modules for front-end

analysis, acoustic modeling and/or error analysis. Since the research proposed here is of a high

risk nature, it is impossible at this time to predict which aspects will be the most successful and

therefore worth making publicly available. However, our past record should provide a strong

sense of our intention.

As is always the case when releasing new resources to the community, the release schedule must

depend on the readiness of the materials for public use and on our ability to provide appropriate

support. We anticipate a graduated release process, first circulating and exercising the materials

among the sites composing our team, then to our affiliated Rich Transcription team, and

ultimately to other teams and to the community at large. Materials will be made available for

research purposes only.

Pushing the Envelope: Rethinking Acoustic Processing L-1

L. Cost Sharing Offered

This proposal offers no cost sharing.

Pushing the Envelope: Rethinking Acoustic Processing M-1

M. DeliverablesRequired Quarterly Status Reports and an Annual Project Summary Reports - ICSI will

deliver both the DARPA/ITO Quarterly Status of Reports and an Annual Project Summary

Reports as required in the solicitation and as scheduled in Section H for each Task. These

reports will contain the information as required by the solicitation and will be electronically

submitted via the DARPA/ITO Technical — Financial Information Management System (T-

FIMS), at the government furnished Uniform Resource Locator (URL) on the World Wide Web

(WWW).

Additional Deliverables — In addition to the required reports, ICSI will deliver, subject to the

agreement of all parties, computer software and/or technical data developed under this contract.

The exact nature of these deliveries depends on agreements with DARPA regarding the specifics

of yearly R&D program elements and, as such, cannot be uniquely described at this time.

Proprietary Claims — The members of the Starting Over Novel Approaches research team do

not intend to limit the release of results, reports, or presentations made as a result of this

program, which have been documented as the only deliverables of the effort. This does not

preclude ICSI or its teammates from sharing proprietary information with the Government if it is

deemed to be mutually beneficial.

When the team participates in the performance of the Contract, each team member agrees to grant

to the other team members through the terms of a subcontract between the parties a royalty-free

limited license, without right to sublicense, to use in performance of the prime contract and

subcontract to the intellectual property solely developed by the a team member during

performance of the prime contract and subcontract ( Program Intellectual Property ) to the

extent necessary for a team member to perform the prime contract or subcontract as the situation

dictates. Each team member shall ensure that its use of another team member s Program

Intellectual Property shall not compromise the confidentiality or proprietary nature of said

intellectual property.

Each team member shall retain sole title to Program Intellectual Property created solely by its

employees and consultants, and Program Intellectual Property created by employees or

consultants of more than one team member shall be jointly owned by such team members. A

joint owner will not have to account to the other owner(s) for any income or other consideration

received from its exploitation of the jointly owned intellectual property unless provided for in a

separate written agreement entered into by the joint owners.

Any subcontract awarded to a team member by ICSI, will contain the same intellectual property

terms and conditions as are contained in the prime contract. Such subcontract will also contain

the above terms set forth in this section.

Pushing the Envelope: Rethinking Acoustic Processing M-2

Identification and Assertion of Restrictions on the Government s Use, Release, orDisclosure of Technical Data or Computer Software

The team s technical approaches for the development of the contract deliverables is partially

based on programs that were developed with mixed funding and will be delivered with

Government Purpose Rights if requested by DARPA or program stakeholders. SRI s DecipherTM

software is one of these programs.

Pushing the Envelope: Rethinking Acoustic Processing N-1

N. Exceptions

(1) We note in the BAA that there is no guidance for including a bibliography of sources and

references cited in the proposal. Based on verbal guidance from the DARPA EARS PM, we have

included a single consolidated bibliography as Appendix to our proposal.

(2) The involvement of Professors Morgan and Ostendorf represents an exception to the policy

outlined in the EARS BAA, which was that:

If a site works on both Rich Transcription and Novel Approaches, it

should use different individuals in each to insure that its Novel

Approaches work is not shortchanged in order to maximize results on

the NIST-administered evaluations for Rich Transcription.

in the sense that they each will play roles in both this proposal and in the Rich

Transcription effort proposed by the SRI/ICSI/UW team. However, we argue

that this will not negatively effect the Novel Approaches effort in either case. For Professor

Ostendorf, her primary and major role will be in the Rich Transcription Proposal. For this Novel

Approaches proposal, her graduate student is very advanced and independent, so Professor

Ostendorf can take a minor role in this effort. Only 2 weeks per year of her time is budgeted

for student supervision associated with the Novel Approaches work. For Professor Morgan, his

primary time commitment is to the Novel Approaches proposal, with only 2 weeks per year of

his time budgeted for meetings and consulting for the Rich Transcription work. However, no

other ICSI or UW staff or students will be on both proposals, so that there will be no mechanism

for the involvement of Novel Approaches personnel in the Rich Transcription project. Thus,

while this is an exception to the letter of the BAA requirements, we believe that we are

complying with the spirit of the requirement. In addition, the connection to the SRI/UW/ICSI

effort will provide a good avenue for transferring the successful technology developed in this

work to a Rich Transcription system in the later years of the Novel Approaches project.

Pushing the Envelope: Rethinking Acoustic Processing O-1

O. Management PlanBased on our extensive experience with domestic and international collaborations for speech

research, and in particular on our previous work together in various combinations, the

ICSI/SRI/UW/OGI/Columbia/IDIAP team has developed a comprehensive management plan.

The key characteristics of this plan are given below.

Team Structure — The PI for the proposal is Nelson Morgan, who has supervised numerous

domestic and international collaborations at ICSI. He will be the point person for the DARPA

PM, and is also the Director of ICSI so that there are no other management layers for the

contractor organization. He will of course be responsive to key researchers in the project, and in

particular will work closely with a Management Committee consisting of Hynek Hermansky,

Dan Ellis, and George Doddington. Note that Professors Hermansky and Ellis are also co-PIs for

the two Tasks in this proposal. The committee will share responsibility for coordinating efforts

between the two tasks and between research efforts and DARPA requests. Within each task, the

PI and co-PI will coordinate communication between key personnel across sites. The ICSI

contracts staff will also report to Professor Morgan, and will handle the subcontracts with SRI,

UW, OGI, Columbia, and IDIAP.

Technical Structure — The execution of the technical program is the responsibility of two

teams, corresponding to each of the major research tasks: Signal Processing and Statistical

Modeling. Professor Morgan is the PI of each, but the co-PIs (Hermansky and Ellis) will have

significant roles in task management. Each Task team will, with the aid of this management, plan,

execute, and report on the task-related program elements and are composed of the personnel

detailed in the respective task descriptions provided in Section H. These two teams coordinate

through the common PI, through the Management Committee, and more generally through key

personnel who work on both tasks. Incorporating substantial input from key personnel, the

project PI will guide the research, development, and implementation of the EARS prototypes

ensuring linkage of vision, implementation, and execution.

Team Leadership — To address the technology development and implementation challenges

cited in previous sections of this proposal, we offer a carefully selected group of outstanding key

individuals described in Section P. The key personnel team members have the academic

background, record of past accomplishment, and substantive experience necessary to meet the

EARS Novel Approaches challenges. We will dedicate all key personnel to the project as long as

their skills are required. The PI, Nelson Morgan, will be assigned to this program for its entirety.

Alterations in key personnel assignments will not be made before securing permission from the

government.

Government Coordination - Communication with the Government EARS Team members is

essential to the success of the program and is an area to which the PI and Co-PIs will pay special

attention. Our Team expects Government participation and encourages direct communication

with the staff that comprises our technical organization. We will actively participate in an

overarching EARS program management structure to facilitate program-wide coordination of

research and evaluation. The PI and Co-PIs will be available to meet with DARPA at its

convenience to assist in briefing program progress and specifics to stakeholders and agency

leadership or to address and resolve issues related to achieving program objectives. Finally, we

will promptly deliver Quarterly Status Reports as well as Annual Project Summary Reports that

Pushing the Envelope: Rethinking Acoustic Processing O-2

describe technical plans versus performance, results, and findings.

Internal Communications - Communication is essential to the successful functioning of the

team. In the large multisite collaborations that we have been in, personnel exchange has been an

invaluable tool. Therefore, we will plan frequent visits between participating sites, particularly so

that the most active researchers will learn to work with one another. Naturally, the Management

Committee will also schedule and conduct regular meetings to review progress and define plans

and actions required to meet DARPA objectives.

Our technology approach to facilitating communication has four components. (1) Schedule

regular opportunities for face-to-face meetings, including an annual team meeting, smaller group

discussions (generally coordinated with other conferences and workshops) and training

workshops (e.g., on Decipher) early in the contract period. (2) Establish and maintain a secure

Web site that contains information on meetings, reviews, experiments, and other programmatic

data to provide continuous availability of critical data to team members. This will allow us to

synchronize all the people, documents, software, and information required to manage our effort

on a single hub. (3) Make a subset of the materials on this site available on an independent open

site as approved by the Management Committee and coordinated with DARPA. (4) Use

commercial tools such as email, videoconferencing, and teleconferencing to reduce dependency on

travel and encourage team interchange and communication.

Pushing the Envelope: Rethinking Acoustic Processing P-1

P. Personnel Qualifications

Nelson Morgan, PI: - Nelson Morgan received his B.S., M.S., and Ph.D. degrees in 1977,

1979, and 1980 respectively, all in electrical engineering from UC Berkeley. He has been

working in speech processing since 1980, and prior to that was a practicing audio engineer.

He led a speech research effort at National Semiconductor starting in 1980, worked for several

years on basic pattern recognition algorithm at the EEG Systems Lab in the mid-80 s, and has

led the speech research effort at ICSI since 1988; he currently is also Director of that

institution. He is also a Professor-in-residence in the Electrical Engineering Department of UC

Berkeley, has over 150 publications including 3 books (the most recent being a graduate text

in speech and audio processing), and has graduated 10 PhDs, 3 MSs, and numerous

postdoctoral fellows. With Herve Bourlard, he was an originator of the hybrid system

approach to speech recognition (neural networks used probabilistically with HMMs), and

with Hynek Hermansky was the co-inventor of RASTA and several of its alternate forms. He

is the holder of a number of patents in speech processing methods. He is on the Board of

Directors of the Applied Voice Input-Output Society (AVIOS). He is the former co-Editor-

in-Chief of Speech Communication. He was formerly on the Neural Network Technical

Committee of the Signal Processing Society, and currently is on the Speech Technical

Committee of the same society. He is a Fellow of the IEEE. Professor Morgan will devote

25% of his time to the proposed project, and to split his time equally between the two

Tasks.

Hynek Hermansky, co-PI: Hynek Hermansky received his PhD. from the University of

Tokyo, Japan, in 1983, and his M.S. from Brno University of Technology, Czech Republic,

in 1972, both in electrical engineering. He has worked on speech processing for 30 years. He

was Research Fellow and Assistant Professor at Brno University of Technology, Research

Scholar under 5-year Japanese Ministry of Education Fellowship at University of Tokyo,

Research Engineer at Panasonic Technologies in Santa Barbara, and Senior Member of

Technical Staff at U S WEST Advanced Technologies. He is Professor and Director of Center

for Information Technologies at the Department of Electrical and Computer Engineering at

OGI School of Science and Engineering of Oregon Health and Sciences University in Portland,

Oregon, and Senior Research Scientist at ICSI Berkeley. He is also a Member of the Board of

the International Speech Communication Association, Associate Editor and Member of the

Technical Committee on Speech Processing of IEEE, Member of the Editorial Board of

Speech Communication, and Visiting Faculty at a number of European Universities and

International Summer Schools. He has over 100 publications, has graduated 4 PhDs, and

holds 5 U.S. patents. His technical achievements include Perceptual Linear Prediction,

RASTA speech processing of speech (with Nelson Morgan), multistream techniques for

automatic recognition of speech (ASR) (with Misha Pavel, Herve Bourlard and Nelson

Morgan), Feature Neural Networks for HMM-based ASR (with Dan Ellis), and ASR from

temporal patterns (TRAPs) He is a Fellow of the IEEE, awarded for research and

development of perceptually based speech processing methods. Professor Hermansky will

devote 30% of his time to this project, and will work entirely on Task 1.

Daniel Ellis, co-PI: Dan Ellis has been an assistant professor in the Electrical Engineering

department of Columbia University since 2000; and is also a visiting senior research scientist

at the International Computer Science Institute, with whom he has been affiliated since 1996.


From 1989 to 1996, he was a research assistant in the Media Laboratory of the

Massachusetts Institute of Technology. He received S.M. (1992) and Ph.D. (1996) degrees

from MIT s department of Electrical Engineering and Computer Science, where his

dissertation was on prediction-driven computational auditory scene analysis. He holds a

B.A. (Hons.) degree in Electrical and Information Sciences from Cambridge University,

awarded in 1987. Dr. Ellis s principal research interest is signal processing and perceptual

organization in the auditory system, and how this can be emulated and exploited in automatic

systems for recognition and organization of speech and audio. He has authored many

significant publications on these topics, as well as sound analysis software in use worldwide.

He is the co-ordinator of the AUDITORY email discussion list with an international

membership of over 700. Prof. Ellis expects to devote 1 summer month/year of his time split

equally between the two Tasks.

Herv Bourlard, Professor: H. Bourlard received the Electrical and Computer Science

Engineering degree and the Ph.D. degree in Applied Sciences both from Facult’e

Polytechnique de Mons, Mons, Belgium. He was a member of the Scientific Staff at the

Philips Research Laboratory of Brussels (PRLB, Belgium), and an R&D Manager at L&H

SpeechProducts (BE); he is now Director of IDIAP (Dalle Molle Institute for Perceptual

Artificial Intelligence) a not-for-profit research institute, affiliated with the Swiss Federal

Institute of Technology at Lausanne (EPFL, CH) and the University of Geneva, involved in

speech and speaker recognition, computer vision and machine learning. He is also Professor at

EPFL and External Fellow of the International Computer Science Institute (ICSI) at Berkeley

(CA). His current interests mainly include statistical pattern classification, artificial neural

networks, and applied mathematics, with applications to time series processing, speech and

speaker recognition, and language modeling. He is the author/coauthor of over 140 reviewed

papers (including one IEEE paper award) and book chapters, as well as two books. H.

Bourlard is a member of the programme and/or scientific committee of numerous international

conferences, Editor-in-Chief of the ‘‘Speech Communication’’ journal (Elsevier), and Action

Editor of the ‘‘Neural Networks’’ journal. He is also a Member of the Admin Committee of

EURASIP (European Association for Signal Processing), a Member of the Advisory Council

of ISCA (International Speech Communication Association), Member of the IEEE Technical

Committee on Neural Network Signal Processing, and an appointed expert for the European

Commission. He is a Fellow of the IEEE ‘‘for contributions in the fields of statistical speech

recognition and neural networks’’, and a member of the Board of Trustees of the Swiss

Network for Innovation. Professor Bourlard will devote 1 month/year to this project, and will

work entirely on Task 2.

George Doddington: Dr. Doddington received a B.S. degree from the University of Florida

and M.S. and Ph.D. degrees from the University of Wisconsin, all in Electrical Engineering.

He then joined Texas Instruments (TI), where he worked for 20 years. While at TI he created

a corporate initiative in speech technology R&D, became chief of speech research, and was

elected a Senior Fellow of Texas Instruments. He was also responsible for numerous

advances in the speech research and speech technology while at TI: He provided leadership

in the creation and sharing of common speech corpora and speech resources, he provided

leadership in the development of evaluation standards and evaluation-guided speech research,

and he was responsible for a number of firsts in speech research, resources, and technology


deployment, including the first formal evaluation of commercial automatic speech recognition

systems (in IEEE Spectrum, in 1980) and the first commercial deployment of speech

recognition and speaker verification over the telephone (for Sprint, in 1989). Dr. Doddington

then joined SRI, where he was employed for 10 years. During this period he was on

assignment to the US government, providing technical direction for government research

programs in human language technology: Among his contributions, he managed DARPA’s

Human Language Technology (HLT) program during 1992-1995. Since then he has provided

technical guidance for government programs in HLT. This includes translating program

objectives into formal R&D tasks, defining research resources, formulating performance

measures, and creating evaluation processes. Dr. Doddington expects to devote 15% of his

time to the proposed effort, with his time split evenly between the two Tasks.

Mari Ostendorf, Professor — Mari Ostendorf received her B.S., M.S., and Ph.D. degrees in

1980, 1981, and 1985, respectively, all in electrical engineering from Stanford University. She

has been working in speech processing for more than 15 years. In 1985, she joined the Speech

Signal Processing Group and BBN Laboratories, where she worked on low-rate coding and

acoustic modeling for continuous speech recognition. Two years later, Dr. Ostendorf went to

Boston University in the Department of Electrical and Computer Engineering where she built

and established a speech processing program. She joined the Electrical Engineering

Department at the University of Washington in 1999 as an Endowed Professor of System

Design Methodologies, and, together with two assistant professors, has established a large

speech research lab. Recently she was named the Associate Chair for Research in the EE

Department. Recent research contributions have been in the areas of: acoustic modeling for

spontaneous speech recognition, dependence modeling for adaptation, use of out-of-domain

data in language modeling, stochastic models of prosody for both recognition and synthesis,

and information extraction for speech. She has authored over 120 papers and has graduated 13

Ph.D. students and several MS students. Dr. Ostendorf has served on the Speech Processing

and the DSP Education Committees of the IEEE Signal Processing Society, as editor of the

journal Computer Speech and Language, and on numerous conference and workshop technical

committees. She is a Senior Member of the IEEE and a member of Sigma Xi. Professor

Ostendorf expects to devote 2 weeks per year of her time to the proposed effort, as she will

primarily be focused on the Rich Transcription proposal coming from SRI. For this proposal,

she will work entirely on Task 2.

Kemal Sonmez: Kemal Sonmez is a Senior Research Engineer in the Speech Technology and

Research (STAR) Lab at SRI International. He received his Ph.D. in Electrical Engineering

University of Maryland College Park while spending three summers as a visiting research

scientist with the speech group at Texas Instruments in Dallas. He joined SRI in1996, where

he worked on and managed a number of the Government-sponsored research efforts. In

particular, he was the PI on SRI’s speaker recognition effort in conversational telephone

speech using the Switchboard corpus, and on the ROAR acoustic modeling project. He was

especially involved in the development of signal adaptive approaches to frontend processing,

multiresolution modeling, modeling of prosody and robust processing of prosodic features.

Dr. Sonmez expects to devote 30% of his time to the proposed EARS Novel Approaches

effort, overseeing the work done at SRI, assisting with the work at UW and focusing on both

tasks 1 and 2.

Pushing the Envelope: Rethinking Acoustic Processing Q-1

Q. Team CapabilitiesThe proposing team comprises a rich set of individuals and institutions, with collective

experience both in large speech recognition projects and evaluations, and in a wide range of

research innovations. The prime contractor, ICSI, is the central site for a number of domestic and

international collaborations incorporated in this project, and is itself known as a center for

research in novel approaches. Key collaborations upon which this proposal builds include work

of Bourlard and Morgan on neural networks [Morgan & Bourlard 1995] and between Hermansky

and Morgan on robust acoustic strategies ("RASTA approaches") [Hermansky & Morgan 1994].

The trio’s philosophy of innovation in ASR was elaborated in [Bourlard et al. 1996]. In addition

to Hermansky’s OGI and Bourlard’s IDIAP, the team includes Columbia and UW, whose speech

laboratories have generated extensions to and advances beyond conventional speech modeling

approaches, and SRI, which contributes both a well-established baseline recognition system and a

history of speech innovations. Finally, Dr. George Doddington brings to the team his extensive

experience in the evaluation of speech processing systems.

ICSI: ICSI has been a center for innovation in speech processing for 13 years, benefiting from

interactions between research staff, international visitors, and U.C. Berkeley students and

faculty. In addition to the work with Bourlard and Hermansky, ICSI innovations have included:

multi-stream models [Janin et al. 1999]; work on speaking rate [Mirghafori et al. 1996, Morgan

& Fosler 1998, Fosler & Morgan 1999]; direct training of utterance posteriors [Konig et al.

1996]; stochastic perceptual models of speech [Morgan et al. 1995]; combining phone and

syllable information [Wu et al. 1997, 1998]; computational auditory scene analysis [Barker et al.

2000, Ellis 1999]; noise and reverberation robustness [Benitez et al. 2001, Kingsbury 1998, Shire

1999]; and the annotation and study of conversational speech based on the expert phonetic

labeling of a portion of the Switchboard Corpus [Greenberg et al. 1996]. While the goal of ICSI

speech research has been to develop novel approaches providing insight, rather than to achieving

the best error rates in the short term, ICSI has participated in DARPA evaluations, including

Resource Management, Wall Street Journal, and the 1998 Hub 4 evaluation of 1998 (jointly with

Sheffield and Cambridge [Cook et al. 1999, Morgan et al. 1999])with a 20% WER.

SRI: SRI has been developing large vocabulary recognition research systems for the past decade

in the context of various DARPA and DoD-funded programs. It continuously maintains an

LVCSR research systems based on the Decipher recognizer and the SRILM language model

toolkit (the latter is freely available for research use). Many of the components of current state-

of-the-art LVCSR systems have been invented or co-invented by SRI researchers: bottom-up

state-clustered Gaussian mixture HMMs for acoustic modeling [Digalakis & Murveit 1994];

acoustic adaptation to speakers, channels, and environments using affine mean and variance

transforms [Digalakis et al. 1995, Sankar & Lee 1996] and combined transform-based and

Bayesian adaptation [Digalakis & Neumeyer 1996]; progressive search with lattice recognition

and N-best rescoring [Murveit et al. 1993]; minimum word error decoding and posterior

maximization in confusion networks [Stolcke et al. 1997, Mangu et al., 2000 SRI last

participated in the Hub-4 evaluations in 1998, with a WER of 21%. The system later underwent

a major overhaul prior to the 2000 Hub-5 evaluation, resulting in a 24% relative WER reduction

on that task, and a 2001 Hub-5 eval performance of 29% WER. The Decipher system continues

to evolve and on the DARPA-sponsored 2001 SPINE evaluations achieved a WER of 27%, the

best performance of any submitted system.

Pushing the Envelope: Rethinking Acoustic Processing Q-2

OGI: The Anthropic Signal Processing Laboratory at OGI was established 9 years ago by Prof.

Hynek Hermansky. Hermansky’s early work on processing of corrupted speech introduced data-

guided speech analysis techniques, which later evolved into data-guided RASTA filters

[Hermansky & Morgan 1994] and data-guided spectral projections [Hermansky & Malayath

1998]. Multi-band recognition was introduced and extensively studied in collaboration with

Bourlard in Switzerland and Morgan at ICSI in 1995 and later evolved into the TRAP technique

[Hermansky & Sharma 1998] for robustness to many types of spectral distortions and also

addresses context dependency of phonemes. Nonlinear generalization of early data-guided

techniques in the form of Feature Nets (tandem approach) was introduced with Ellis of

ICSI/Columbia [Hermansky et al. 2000]. The group successfully participates in NIST Speaker ID

evaluations, in DARPA’s SPINE evaluations, and in telecommunication industry standards

(ETSI) activities, and collaborates most intensively with ICSI, CMU’s robust speech recognition

group, IIT Madras (India), CSLP at Johns Hopkins University, and CSLU group at OGI.

Columbia: The Laboratory for Recognition and Organization of Speech and Audio was

established within Columbia’s Electrical Engineering department by Professor Ellis when he

moved there in 2000. Combining the statistical pattern recognition and learning techniques of

speech recognition with a broader range of audio processing techniques drawn from audio

engineering and auditory modeling, the lab addresses information extraction from all kinds of

sound signals. In addition to development of the Tandem approach to speech acoustic modeling,

the best performing system in the Aurora Eurospech Special Event in September 2001 [Ellis &

Reyes 2001], current projects in the lab span a range of topics from speech through to music

analysis and clustering, to alarm sound detection and classification.

IDIAP: IDIAP (Dalle Molle Institute for Perceptual Artificial Intelligence) is a semi-private

research institute affiliated with the Swiss Federal Institute of Technology (EPFL) at Lausanne

and the University of Geneva and conducts research in the areas of speech and speaker

recognition, computer vision and machine learning. Speech recognition systems developed at

IDIAP are based on Hidden Markov Models (HMM) and on hybrid systems combining HMMs

and Neural Networks (NNs). To improve the performance of those systems, the speech group

research activities include the study of robust speech analysis techniques (e.g., measuring

information on different window lengths and using multiscale systems), robust speech modeling

(e.g., multi-stream approaches), and robust decision rules (e.g., measures of confidence).

UW: In 1999, the Department. of Electrical Engineering at the University of Washington made a

major commitment to building a speech technology research program, hiring Mari Ostendorf and

two other faculty members who founded the Signal, Speech and Language Interpretation (SSLI)

Laboratory, building on the program that she had created 12 years before at Boston University.

There are currently 21 researchers in the lab, which is dedicated to solving core problems in

speech and language technologies, facilitating multidisciplinary research, and providing a broad

educational experience for students. Prof. Ostendorf’s research contributions include: segment-

based (or, trajectory) models of acoustic parameters for ASR, dependence models for speaker

adaptation, sentence-level mixture language models for topic and dialog structure, use of out-of-

domain data in language modeling, computational modeling of prosody for speech recognition and

synthesis, integration of prosody and parsing, and information extraction from speech. Her BU

group participated in several DARPA evaluations (in collaboration with BBN), she works on

Hub 5, and contributed to the UW SPINE evaluation effort this fall led by Prof. Jeff Bilmes.

Pushing the Envelope: Rethinking Acoustic Processing R-1

R. Technology Transfer

As noted in Section K (Resources Offered), each of the participating sites will make available to

the research community a number of resources resulting from our research. Particularly through

SRI, we have excellent pathways for transfer to government programs. Finally, nearly all sites

have close relations with one or more commercial entities (e.g., Qualcomm for ICSI) who will be

very interested in positive results in this project.

In some sense, though, the strongest route for technology transfer from a high-risk algorithms

program such as this is through our link with the SRI-centered Rich Transcription project, which

will be incorporating the most successful of our algorithms in their systems in years 4 and 5 of

their project.

Pushing the Envelope: Rethinking Acoustic Processing S-1

S. Government-owned ResourcesThe Government-Furnished Equipment (GFE) identified below is required by our team for the

performance of the proposed effort, and should be included in the terms of a resulting contract.

This GFE has not been included in the price of this proposal. Our performance depends on our

team receiving approval to either transfer these items to the resulting contract or to allow for our

use of these items on a rent-free, non-interference basis. SRI has a pending bid into SPAWAR to

purchase this equipment; if the bid is not accepted then these items will need to be transferred to

the resultant contract.

Organization Description Tag Number Accountable ContractNumber

SRI Sun Disk Drive G443067 N66001-94-C-6048SRI R-Squared Disk

DrivesG443126, G444294, G444295 N66001-94-C-6048

SRI Seagate Disk Drives G443127, G443129 N66001-94-C-6048SRI Sun Computers G443879, G443880, G442408 N66001-94-C-6048SRI Seagate Disk Drives G444136, G444137 N66001-94-C-6048SRI R-Squared SCSI

DrivesG444138, G444143, G444144 N66001-94-C-6048

SRI Sun Computers G444148, G444316, G444141 N66001-94-C-6048SRI Procom Hard

DrivesG444376, G444377 N66001-94-C-6048

SRI Acropolis Disks G444639, G444641 N66001-94-C-6048SRI Seagate Disk Drives G444674, G444675 N66001-94-C-6048SRI Sun Storage System G444940, G444941, G444942 N66001-94-C-6048SRI Dell Computer G445103 N66001-94-C-6048SRI Vanguard Disk

DrivesG445165, G445372, G445373,G445374

N66001-94-C-6048

SRI Vanguard HardDrives

G445371, G445372, G445373,G445374

N66001-94-C-6048

SRI Sun CD/RomReader

G441922 N66001-94-C-6046

SRI R-Squared DiskDrives

G443711, G443712, G443713 N66001-94-C-6046

SRI DEC Computer G443724, G443725, G443726,G443727

N66001-94-C-6046

SRI R-Squared DiskDrive

G444089 N66001-94-C-6046

SRI Sun Computers G444122, G444158 N66001-94-C-6046SRI Seagate Disk Drive G444160 N66001-94-C-6046SRI Sun Storage

SystemsG444943, G444944 N66001-94-C-6046

SRI Dell Computers G445123, G445124, G445125 N66001-94-C-6046U W Dell 868 Computers 1175736, 1175709, 1175680 N66019928924U W Acer 868 Computers No tag N66019928924U W Sun Ultra-5_10

Computers1175755,1175756, 1175760 N66019928924

All team sites requesting equipment (ICSI, SRI, UW, OGI, and Columbia) have included the

requisite letters of notification with this proposal indicating that we cannot provide new

information technology resources to support the development evaluation activities that are

required. These letters fully comply with the requirements of the solicitation and are included as

Appendix B in the paper versions of this proposal.

Pushing the Envelope: Rethinking Acoustic Processing T-1

T. Organizational Conflict of Interest

SRI International is currently providing technical support to several offices within DARPA. We

do not believe that any potential conflict of interest exists as these individuals are not directly

involved in the effort proposed herein. The individuals and the offices they support are listed

below:

Murray Burke DARPA/IXO Program Manager for High Performance Knowledge Base and

Rapid Knowledge Formation programs

Tim Grayson DARPA/TTO Program Manager for Digital RF Tags (DRaFT) program

William Coleman OSD C3I/SAPCO Sr. Technical Advisor to DARPA Deputy Director

William Schneider DARPA/MTO Program Manager for Optoelectronics

Richard Wishner Director DARPA/IXO

Thomas Strat IXO Program Manager (Expected effective date 1/21/02)

None of the other sites (UW, Columbia, OGI, ICSI, and IDIAP) are providing such support.

Pushing the Envelope: Rethinking Acoustic Processing 1

Appendix A: References

Allen, J.,

How do humans process and recognize speech?

IEEE Transactions on Speech and Audio, 2(4): 567-577, Oct. 1994

Barker, J., Cooke, M. and Ellis, D.,

Decoding speech in the presence of other sound sources,

Proc. ICSLP-2000, Beijing, October 2000

Barker, J., Cooke, M. and Ellis, D.,

Combining bottom-up and top-down constraints for robust ASR: The multisource decoder,

Workshop on Consistent and Reliable Acoustic Cues CRAC-2001at Eurospeech-2001, Aalborg, Denmark, September 2001.

Barker, J., Green, P. and Cooke, M.,

Linking ASA and robust ASR by missing data techniques,

Proc. WISP-2001, Stratford UK, 2001.

Bengio, S, Bourlard, H. and Weber, K.,

An EM algorithm for HMMs with emission distribution represented by HMMs,

IDIAP Research Report RR00-11, 2000.

Benitez, C., Burget, L., Chen, B., Dupont, S., Garudadri, H., Hermansky, H., Jain, P., Kajarekar,

S. and Sivadas, S.,

Robust ASR front-end using spectral-based and discriminant features: experiments on the

Aurora tasks,

Proc. Eurospeech-2001, Aalborg, September 2001.

Bilmes, J.,

Maximum Mutual Information Based Reduction Strategies for Cross-Correlation based Joint

Distributional Modeling,

Proc. ICASSP-98, Seattle, 469-472, 1998.

Bilmes, J.,

Buried Markov models for speech recognition,

Proc. ICASSP-99, Phoenix, II-713-716, 1999.

Bitar, N. and Espy-Wilson, C.,

Knowledge-based parameters for HMM speech recognition,

Proc. ICASSP, pp. 29-32, 1996.

Bourlard, H., Bengio, S, and Weber, K.,

New Approaches Towards Robust and Adaptive Speech Recognition,

invited keynote, to appear in Proc. Advanced Neural Information Processing Systems (NIPS-13)


workshop, T.K. Leen, T.G. Dietterich, and V. Tresp (Eds.), MIT Press, Denver CO, Dec. 2000.

Bourlard, H. and Dupont, S.,

A New ASR Approach Based on Independent Processing and Recombination of Partial

Frequency Bands,

Proc. ICSLP-96, 426-429, Philadelphia PA, 1996.

Bourlard, H., Dupont, S., Hermansky, H. and Morgan, N.,

Towards Subband-Based Speech Recognition,

Proc. VIII European Signal Processing Conference (EUSIPCO’96) (Trieste, Italy), 1579-1582,

1996.

Bourlard, H., Hermansky, H. and Morgan, N.,

Towards Increasing Speech Recognition Error Rates,

Speech Communication, 205-231, May 1996.

Cohen, J., Kamm, T., Andreou, A.G.,

Vocal Tract Normalization in Speech Recognition: Compensating for Systematic Speaker

Variability,

Proc. 15th Annual Speech Research Symposium, pp. 175-178, Baltimre, MD, 1995.

Cook, G., Christie, J., Ellis, D., Fosler-Lussier, E., Gotoh, Y., Kingsbury, B., Morgan, N.,

Renals, S., Robinson, A.J., and Williams, G.,

The SPRACH System for the Transcription of Broadcast News,

DARPA LVCSR meeting, 1999.

Cooke, M. and Ellis, D.,

The auditory organization of speech and other sources in listeners and computational models,

Speech Communication 35, 141-177, 2001.

Cooke, M., Green, P., Josifovski, L. and Vizinho, A.,

Robust automatic speech recognition with missing and unreliable acoustic data,

Speech Communication 34(3), 267-285, June 2001.

Deng, L. and Sun, D.,

Phonetic classification and recognition using HMM representation of overlapping articulatory

features for all classes of English sounds,

Proc. ICASSP-94, 45-48, April 1994.

Deshmukh, N., Duncan, R., Ganapathiraju, A. and Picone, J.,

Benchmarking Human Performance for Continuous Speech Recognition ,Proc. ICSLP-96, Philadelphia, 1996.

Doddington, G., Corrada, A., Ganapathiraju, A., Goel, V., Kirchhoff, K., Ordowski, M., Picone,

J. and Wheatley, B.,

Syllable-Based Speech Recognition


Final Report, Johns Hopkins Summer Workshop 1997,

http://www.clsp.jhu.edu/ws97/syllable/final_presentation/index.html

Dupont, S. and Bourlard, H.,

Using multiple time scales in a multistream speech recognition system,

Proc. Eurospeech-97, I-3-6, Rhodes, 1997.

Ellis, D.,

The Weft: A representation for periodic sounds,

Proc. ICASSP-97, Munich, II-1307-1310, 1997.

Ellis, D.,

Listening to speech recognition — the Surfsynth home pageWeb page, http://www.icsi.berkeley.edu/~dpwe/projects/surfsynth , 1997.

Ellis, D.,

Using knowledge to organize sound: The prediction-driven approach to computational auditory

scene analysis and its application to speech/nonspeech mixtures,

Speech Communication 27 3-4, pp. 281-298, April 1999.

Ellis, D. and Bilmes, J.,

Using mutual information to design feature combinations,

Proc. ICSLP-2000, Beijing, 2000.

Ellis, D. and Reyes, M.,

Investigations into Tandem acoustic modeling for the Aurora task,

Proc. Eurospeech-01, Aalborg, Denmark, September 2001.

Ellis, D., Singh, R. and Sivadas, S.,

Tandem Acoustic Modeling in Large-Vocabulary Recognition,

Proc. ICASSP-01, Salt Lake City UT, May 2001.

Fiscus, J.,

A post-processing system to yeild reduced word error rates: Recgnizer Output Voting Error

Reduction (ROVER),

Proc. ASRU-97, Santa Barbara, 1997.

Fish, R.,

Dynamic models of machining vibrations, designed for classification of tool wear,Ph.D. dissertation, Univ. of Washingtion, 2001.

Fosler-Lussier, E.,

Multi-Level Decision Trees for Static and Dynamic Pronunciation Models,

Proc. Eurospeech-99, Budapest, I-463-466, 1999.


Fosler-Lussier, E., and Morgan, N.,

Effects of Speaking Rate and Word Frequency on Conversational

Pronunciations,

Speech Communication Special Issue on pronunciation variation, 29 (2-4),

137-157, 1999.

Ghahramani, Z. and Jordan, M.,

Factorial Hidden Markov Models,

Machine Learning 29, 245-273, 1997.

Greenberg, S., Hollenback, J. and Ellis, D.,

Insights into spoken language gleaned from phonetic transcriptions of the Switchboard corpus,

Proc. ICSLP-96, Philadelphia, 1996.

Hagen, A.,

Robust speech recognition based on multi-stream processing,PhD Thesis, Swiss Federal Institute of Technology, Lausanne, December 2001,

also IDIAP Research Report, RR01-4.

Hagen, A, Morris, A. and Bourlard, H.,

From Multi-Band Full Combination to Multi-Stream Full Combination Processing in Robust

ASR,

Proc. ISCA ITRW Intl. Workshop on Automatic Speech Recognition: Challenges for the NextMillennium, Paris, Sep.~18-20, 2000.

Hall, J., Haggard, M. and Fernandes, M.

Detection in noise by spectro-temporal pattern analysis,

Journal of the Acoustical Society of America, 76, 50-56, 1984.

Hatch, A.,

Word-Level Confidence Estimation for Automatic Speech RecognitionM.S. Thesis, University of California at Berkeley, August 2001.

Hermansky, H., Ellis, D., and Sharma, S.,

Tandem connectionist feature stream extraction for conventional HMM systems,

Proc. ICASSP-2000, Istanbul, June 2000, III-1635-1638

Hermansky, H. and Malayath, N.

Spectral Basis Functions from Discriminant Analysis,

Proc. ICSLP’98, Sydney, November 1998.

Hermansky, H. and Morgan, N.,

RASTA Processing of Speech,

IEEE Transactions on Speech and Audio Processing, special issue on Robust Speech Recognition

2(4), 578-589, Oct., 1994


Hermansky, H. and Sharma, S.,

TRAPS — Classifiers of temporal patterns,

Proc. ICSLP-98, Sydney, III-1003-1006, 1998

Hermansky, H. and Sharma, S.,

Temporal Patterns (TRAPS) in ASR of Noisy Speech,

Proc. ICASSP-99, Phoenix AZ, March 1999.

Hermansky, H., Sharma, S. and Jain, P.,

Data-derived nonlinear mapping for feature extraction in HMM,

Proc. ASRU-99, Keystone CO, 1999

Hermansky, H., Tibrewala, S. and Pavel, M.,

Towards ASR on Partially Corrupted Speech,

Proc. ICSLP-96, 462-465, Philadelphia, October 1996.

Hirsch, H. and Pearce, D.,

The AURORA Experimental Framework for the Performance Evaluations of Speech

Recognition Systems under Noisy Conditions,

Proc. ISCA ITRW ASR2000, Paris, September 2000.

Janin, A., Ellis, D. and Morgan, N.,

Multi-stream speech recognition: ready for prime time?

Proc. Eurospeech 1999,. II-591-594, 1999.

Jurafsky, D., Wooters, C., Segal, J., Stolcke, A., Fosler, E., Tajchman, G. and Morgan, N.,

Using a Stochastic Context-Free Grammar as a Language Model for Speech Recognition,

Proc. ICASSP-95, 189-192, 1995.

Kajarekar, S., Yagnanarayana, B. and Hermansky, H.,

A Study of Two Dimensional Linear Discriminants for ASR,

Proc. ICASSP-01, Salt Lake City UT, 2001.

Kajarekar, S. and Hermansky, H.,

Optimization of units for continuous-digit recognition task,

Proc. ICSLP-2000, Beijing, 2000.

Kingsbury, B.,

Perceptually-inspired signal processing strategies for robust speech recognition in reverberantenvironments,Ph.D. Dissertation, University of California at Berkeley, December 1998.

Konig, Y., Bourlard, H. and Morgan, N.,

REMAP - Experiments with Speech Recognition,


Proc. ICASSP996, 3350-3353, 1996.

Lahiri, A.,

Speech recognition with phonological features,

Proc. Int. Congress on Phonetic Sciences, pp. 715-718, 1999.

Lee, S., and Glass, J.,

Real-time Probabilistic Segmentation for Segment-Based Speech Recognition,

Proc. ICSLP-1998, Sydney, 1998.

Li, J., Najmi, A. and Gray, R.M.,

Image classification by a two-dimensional hidden Markov model,

Proc. ICASSP-99, VI-3313-3316, Phoenix AZ, 1999.

Logan, B. and Moreno, P.,

Factorial HMMs for Acoustic Modeling,

Proc. ICASSP-98, 813-816, Seattle WA, 1998.

Malayath, N., Hermansky, H., Kajarekar, S., and Yegnanarayana B.,

Data-Driven Temporal Filters and Alternatives to GMM in Speaker Verification,

Digital Signal Processing 10, 55-74, 2000.

Minsky, M.

The Society of MindSimon and Schuster, 1986.

Mirghafori, N., Fosler, E. and Morgan, N.,

Towards Robustness to Fast Speech in ASR,

Proc. ICASSP-96, 335-338, 1996.

Morgan, N., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Janin, A., Pfau, T., Shriberg, E. and

Stolcke, A.,

The Meeting Project at ICSI,

Proc. Human Language Technologies Conference, San Diego, March 2001

Morgan, N. and Bourlard, H.,

Continuous Speech Recognition: An Introduction to the Hybrid HMM/Connectionist

Approach,

Signal Processing Magazine, pp 25-42, May 1995

Morgan, N., Bourlard, H., Greenberg, S. and Hermansky, H.,

Stochastic Perceptual Models of Speech,

Proc. ICASSP-95, 397-400, 1995.

Morgan, N., Ellis, D., Fosler-Lussier, E., Janin, A. and Kingsbury, B.,

Reducing errors by increasing the error rate: MLP Acoustic Modeling


for Broadcast News Transcription,

Proc. DARPA LVCSR meeting, 1999.

Morgan, N., and Fosler-Lussier, E.,

Combining multiple estimators of speaking rate,

Proc. ICASSP-98, 729-732, Seattle WA, May 1998.

Niyogi, P., Mitra, P., and Sondhi, M. M.,

‘‘A detection framework for locating phonetic events,’’ Proc. ICSLP,

pp. 1067-1070, 1998.

Nock, H.,

Techniques for Modelling Phonological Processes in Automatic Speech Recognition,Ph.D. dissertation, Cambridge Univ., 2001.

Nock, H. and Young, S.,

Loosely-coupled HMMs for ASR,

Proc. ICSLP-2000, III-143-146, Beijing, 2000.

Robinson, A., Hochberg, M., and Renals, S.,

IPA: Improved Modelling with Recurrent Neural Networks

Proc. ICASSP-94, 37-40, April 1994.

Saul, L. and Jordan, M.,

Mixed memory Markov models,

Machine Learning 37, 75-87, 1999.

Shire, M.,

Data-driven modulation filter design under adverse acoustic conditions and using phonetic and

syllabic units ,

Proc. Eurospeech-99, Budapest, III-1123-1126, 1999.

Simon, J., Depireux, D. and Shamma, S.,

Representation of complex spectra in the auditory cortex,

Proc. 11th International Symposium on Hearing, ed. Palmer, Ress, Summerfield & Meddis, 513-

520, Whurr Publishers, 1998.

Singh, R., Seltzer, M., Raj, B. and Stern, R.,

Speech in noisy environments: Robust automatic segmentation, feature extraction and

hypothesis combination,

Proc. ICASSP-01, Salt Lake City, 2001.

Sonmez, M., Plauche, M., Shriberg, E., and Franco, H.,

Consonant Discrimination in Elicited and Spontaneous Speech: A Case for Signal-adaptive

Front Ends in ASR,

Proc. ICSLP-2000, Beijing, October 2000.


Wu, S., Kingsbury, B., Morgan, N. and Greenberg, S.,

Performance Improvements Through Combining Phone- and Syllable-Scale Information in

Automatic Speech Recognition,

Proc. ICSLP-98, 459-462, Sydney, 1998.

Wu, S., Shire, M., Greenberg, S. and Morgan, N.,

Integrating Syllable Boundary Information Into Speech Recognition,

Proc. ICASSP-97, 987-990, Munich, 1997.

Yang, H., Van Vuuren, S., Sharma, S. and Hermansky, H.

Relevance of Time-Frequency Features for Phonetic and Speaker-Channel

Classification,

Speech Communication 31, 35-50, 2000.

Pushing the Envelope: Rethinking Acoustic …dpwe/proposals/DARPA-EARS-NA-2002...Pushing the Envelope: Rethinking Acoustic Processing A-1 Pushing the Envelope: Rethinking Acoustic

Documents