Pushing the Envelope: Rethinking Acoustic Processing A-1 Pushing the Envelope: Rethinking Acoustic Processing for Speech Recognition A. Executive Summary State-of-the-art speech recognition systems continue to improve, but the core acoustic operation remains the same: a single feature vector (derived from the power spectral envelope over a 20-30 ms window, stepped forward by ~10 ms per frame) is compared to a set of distributions derived from training data for an inventory of sub-word units (usually some variant of phones). This step has remained essentially unchanged for decades, and we believe that this limited perspective is a key weakness in speech recognizers. Note for instance that, under good conditions, human phone error rate for nonsense syllables has been estimated to be as low as 1.5% [Allen 1994], as compared with rates over an order of magnitude higher for the best machine phone recognizers [Lee & Glass 1998, Deng & Sun 1994, Robinson et al. 1994]. In this light, our best current recognizers appear half-deaf, only making up for this deficiency by incorporating strong domain constraints. To develop generally applicable and useful recognition techniques, we must overcome the limitations of current acoustic processing. Interestingly, even human phonetic categorization is poor for extremely short segments (e.g. <100 ms), suggesting that analysis of longer time regions is somehow essential to the task. This suggestion is supported by information theoretic analysis showing discriminative conditional dependence between features separated in time by up to several hundred milliseconds [Yang et al. 2000, Bilmes 1998]. In the development of speech recognition, there have been certain innovations in feature processing, such as delta calculation, cepstral mean normalization, or RASTA [Hermansky & Morgan 1994], which have been able to effect valuable performance improvements with minimal changes to the statistical processing. In general, however, signal processing and statistical modeling techniques have co-evolved , making it unlikely that a modification in one domain will significantly improve performance without a corresponding change in the other. This is clearly illustrated for the case of longer temporal support, which is most simply introduced by using highly overlapped analysis windows (e.g., 500 ms processing windows with a 10 ms frame step). Unfortunately, successive frames of the resulting features are highly correlated when compared to a standard 20-30 ms window, and this increased violation of the conditional independence assumptions made in the statistical processing leads to the introduction of tweak factors at every time scale in an effort to compensate. The close coupling of signal processing and statistical modeling leads us to propose a balanced effort between the two areas — radical modifications to the front end processing, along with a corresponding restructuring of the statistical models to accommodate these modifications. The proposed work consists of two tasks: 1) Signal Processing: Replacing the current notion of a spectral energy-based vector at time t by a set of variables based on posterior probabilities of broad acoustic categories for long-time and short-time functions of the spectro-temporal plane, where long-time refers to periods of up to a second. Depending on the categories, nontraditional variables such as pitch-related features may be useful, and other ‘style variables such as speaking rate could also be incorporated. These features will result in multiple streams of probabilistic information.
63
Embed
Pushing the Envelope: Rethinking Acoustic …dpwe/proposals/DARPA-EARS-NA-2002...Pushing the Envelope: Rethinking Acoustic Processing A-1 Pushing the Envelope: Rethinking Acoustic
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Pushing the Envelope: Rethinking Acoustic Processing A-1
Pushing the Envelope:Rethinking Acoustic Processing for Speech Recognition
A. Executive Summary
State-of-the-art speech recognition systems continue to improve, but the core acoustic operation
remains the same: a single feature vector (derived from the power spectral envelope over a 20-30
ms window, stepped forward by ~10 ms per frame) is compared to a set of distributions derived
from training data for an inventory of sub-word units (usually some variant of phones). This step
has remained essentially unchanged for decades, and we believe that this limited perspective is a
key weakness in speech recognizers. Note for instance that, under good conditions, human phone
error rate for nonsense syllables has been estimated to be as low as 1.5% [Allen 1994], as
compared with rates over an order of magnitude higher for the best machine phone recognizers
[Lee & Glass 1998, Deng & Sun 1994, Robinson et al. 1994]. In this light, our best current
recognizers appear half-deaf, only making up for this deficiency by incorporating strong domain
constraints. To develop generally applicable and useful recognition techniques, we must
overcome the limitations of current acoustic processing. Interestingly, even human phonetic
categorization is poor for extremely short segments (e.g. <100 ms), suggesting that analysis of
longer time regions is somehow essential to the task. This suggestion is supported by information
theoretic analysis showing discriminative conditional dependence between features separated in
time by up to several hundred milliseconds [Yang et al. 2000, Bilmes 1998].
In the development of speech recognition, there have been certain innovations in feature
processing, such as delta calculation, cepstral mean normalization, or RASTA [Hermansky &
Morgan 1994], which have been able to effect valuable performance improvements with minimal
changes to the statistical processing. In general, however, signal processing and statistical
modeling techniques have co-evolved , making it unlikely that a modification in one domain will
significantly improve performance without a corresponding change in the other. This is clearly
illustrated for the case of longer temporal support, which is most simply introduced by using
highly overlapped analysis windows (e.g., 500 ms processing windows with a 10 ms frame step).
Unfortunately, successive frames of the resulting features are highly correlated when compared to
a standard 20-30 ms window, and this increased violation of the conditional independence
assumptions made in the statistical processing leads to the introduction of tweak factors at
every time scale in an effort to compensate. The close coupling of signal processing and statistical
modeling leads us to propose a balanced effort between the two areas — radical modifications to
the front end processing, along with a corresponding restructuring of the statistical models to
accommodate these modifications.
The proposed work consists of two tasks:
1) Signal Processing: Replacing the current notion of a spectral energy-based vector at
time t by a set of variables based on posterior probabilities of broad acoustic categories
for long-time and short-time functions of the spectro-temporal plane, where long-time
refers to periods of up to a second. Depending on the categories, nontraditional variables
such as pitch-related features may be useful, and other ‘style variables such as speaking
rate could also be incorporated. These features will result in multiple streams of
probabilistic information.
Pushing the Envelope: Rethinking Acoustic Processing A-2
2) Statistical Modeling: Modifying the statistical models both to incorporate these new
multirate front ends, and to handle explicitly areas of missing information, i.e. portions of
the time-frequency plane that are obscured by acoustic degradation. This task will also
cover the discriminative learning of dependence across streams, and the exploitation of
this information for optimal combination design. In addition, the work will result in new
event-based models, in the sense of allowing acoustic cues of multiple time spans
associated with one unit.
We will pursue a number of techniques, broadly inspired by both human audition and our wish to
develop compatible statistical underpinnings that work together within a unifying multirate (and
multistream) framework, itself derived from our understanding of perception. We seek to replace
energy-based features, the common currency of acoustic front ends, by values reflecting the
posterior probability of different signal categories, themselves defined by data-driven techniques.
These probabilistic estimates will be supported on time-frequency windows drawn from a large
and flexible family, selected by experimental results from the auditory system, by data-adaptive
decompositions, and by empirical evaluation. Further input from hearing research will come via
novel features developed to reflect pitch, rate and other perceptual attributes. These approaches
are particularly important for the case of conversational speech, which exhibits the greatest
variability in speaking rate and vocal quality, and which must be analyzed in terms of parameters
that vary across the full range of time scales, from phones through to phrases and beyond. The
best approach to variability in realization is to have a wide range of alternative information
sources from which to estimate the speech content, allied with a combination strategy able to
switch opportunistically amongst the most useful sources at any given instant.
Our proposal seeks to address the impairment of current speech recognizers through a radical
reconstruction of the interface between sound and search. The representation of speech as a
sequence of spectral envelopes will be pushed aside. Pronunciation and grammar constraints,
while invaluable for reducing word error rates, can often serve to mask basic problems in the
acoustic classification, and thus we will not explore their extension in this work. Instead, we will
concentrate on the modeling of the most basic speech sounds, with application to word
recognition tasks using systems that hold constant the later stages of processing (search,
pronunciation, language modeling, etc.). A solid foundation at this level — in itself a novel
approach in speech recognition — will accrue further benefits when constraints are re-applied.
The teams working on the two tasks in this proposal will interact closely, through membership
overlap, frequent meetings, and joint work on internal evaluations. Each team includes strong
senior players, known for their innovations in these areas. Task 1 will include Hynek
Hermansky; Task 2 will include Mari Ostendorf and Herve Bourlard. Evaluation strategies for
both tasks will be developed by George Doddington. PI Nelson Morgan, along with Dan Ellis and
Kemal Sonmez, will work on both Tasks 1 and 2. A separate proposal with SRI as the prime site
will focus on Rich Transcription, and we will ensure that SRI can exploit the technologies
developed in this Novel Approaches effort when they bear fruit.
While the approaches proposed here comprise a radical departure from mainstream methods, we
feel that pushing the envelope is a difficult but necessary step to achieve the dramatic
reduction in word error rate that this program seeks.
Pushing the Envelope: Rethinking Acoustic Processing B-1
B. Innovative Claims
1) Use of broad category posterior probabilities over time-frequency patches rather
than short-window spectra or cepstra as basic features.
2) As a particular case, use of long temporal regions and limited spectral regions,
where long means analysis windows from 100 ms to 1 second, and limited
means 1-3 critical bands.
3) An alternate approach to time frequency analysis will incorporate a signal
adaptive front end, using an information-theoretic criterion to cluster temporal
regions with a local cosine basis tree.
4) Integration of features from analysis using differing temporal extent, using
multirate statistical models.
5) Integration of methods from Computational Auditory Scene Analysis (CASA)
into this new ASR framework.
6) Development of multiple streams based both on the methods referred to above
and on a criterion of minimal common errors between streams, subject to an
overall constraint on errors to avoid a trivial but useless solution.
7) Development of an event-based statistical model, in which event timing but not
extent is critical.
8) Development of multistream models incorporating all possible stream
combinations.
9) Use of partial information techniques so that low-confidence regions of time-
frequency are not given significant weight.
10) Development of task choice/evaluation methods to match the goals of this project.
Pushing the Envelope: Rethinking Acoustic Processing C-1
C. Statement of Work
Core research - The contractor will research, develop, evaluate and document innovative
acoustic processing algorithms for ASR, including time-frequency tradeoffs for front end signal
processing , and development of statistical algorithms and models to optimally incorporate the
new front end features.
Evaluation — The contractor will develop methods for the rapid evaluation of progress in
intermediate stages of analysis, in the course of which small tasks will be proposed and
developed. In addition, more traditional means of ASR evaluation will be employed throughout
the project. Together, these methods will be used to guide progress within the project. In the 4th
and 5th
years these procedures will be augmented by feedback from results in the governmental
evaluations of the Rich Transcription task, as the best Novel Approaches will be integrated into
the SRI-based evaluation system for the later years of the project.
Program Collaboration — The contractor will encourage collaboration within its team (in
particular with frequent internal meetings, conference calls, team access to a common Web site,
and the use of CVS or similar mechanisms to facilitate the development of common code
wherever possible). The contractor will also attend and support EARS meetings to facilitate
program level collaboration.
Reporting — The contractor will prepare and submit deliverable reports describing the progress,
results, and technical details of task-related activities.
Project Management — The contractor will manage the EARS Novel Approaches effort
including budgeting, scheduling, resource specification, financial tracking of the project Tasks.
The contractor will coordinate, consolidate, and submit Task-level Status Report and Project
Summary deliverables as defined in the PIP.
Pushing the Envelope: Rethinking Acoustic Processing D-1
D. Technical Rationale
Introduction
Word error rates for ASR are still too high in general, and particularly so for conversational
speech and other speech recorded under the imperfect acoustic conditions typical of many
military and commercial applications. The single largest contribution to the significant
improvements obtained by researchers over the last 5 years has been due to adaptation (in the
most general sense) over substantial amounts of testing data, and while this can be invaluable, it is
often not applicable for tasks that require excellent performance regardless of the data available
for each speaker. Even with these enhancements the performance is too poor for many
applications, and minor refinements of the basic methods are unlikely to yield the needed
improvement of conversational speech recognition down to the 5-10% range in word error rate,
particularly under general acoustic conditions (e.g., cell phone, speaker phone, and/or noisy
acoustic background).
It is further unlikely that any single magic bullet will be able to provide the desired degree of
improvement. Rather, as we have seen from the past, multiple innovations will be required to
provide significant change. However, there is an additional problem to be faced — the problem of
the so-called local minimum in error rates for complete ASR systems. As noted in [Bourlard,
Hermansky & Morgan 1996], once a system or set of approaches has been extensively
optimized, nearly any change to the system will lead to an increase in word error rates. While
most changes will simply be based on unsuccessful ideas, there is a small subset of initially
unpromising novel approaches that could lead to fundamental improvements in the complete
system once the consequences of the change are better understood. The core system design in
nearly every current state of the art ASR systems uses cepstral coefficients derived from an
auditory-scaled filter bank, computed over a 20-30 ms analysis window once every ~10 ms, with
Gaussian mixture models trained on such features to provide acoustic likelihoods (combined in
later components with the prior linguistic knowledge of pronunciation and grammar models). The
signal processing and statistical components for such a system have co-evolved so that it is
difficult to improve performance by modifying one without a corresponding change in the other.
Consider a much simpler problem than conversational speech recognition, the recognition of read
connected digits. Surprisingly, for the case of noisy digit strings such as is explored in the Aurora
project [Hirsch & Pearce 2000], error rates are still extremely high, averaging 13.1% for the
baseline system when tested over SNRs between 20 and 0 dB. Therefore, even for cases where
the range of pronunciation variability is small and where there is very little to improve on in the
language model, the performance is poor, even with the best current systems. This is true
despite the use of a Maximum a Posteriori (MAP) recognition scheme, in which the most
probable model will always be chosen. The suboptimality in practice must arise from incorrect
models, in the sense that the statistics do not well represent the data that will be seen in
recognition. This needs to be corrected in two ways: first, the data representation (features)
needs to be chosen so that the ultimate hypotheses are invariant over a range of conditions that
may not be seen during the training phase; and second, the statistical models must be developed
to properly represent the distributions and dependencies that will be observed from the stream
(or streams) of new features.
Summarizing these arguments, we need a coordinated effort in signal processing and in statistical
Pushing the Envelope: Rethinking Acoustic Processing D-2
modeling in order to successfully provide an improved alternative to today s state of the art.
However, we are left with a critical difficulty: When essentially all changes to the current
standard are likely to provide an increase in the error rate (or small decreases when used together
with the older system with some combination methods such as ROVER [Fiscus 1997]), how can
we determine the most likely directions for ultimate significant improvements? The core idea
common to all the pieces proposed in this document is to move beyond the current framewise
orientation of speech recognizers. Why should this be desirable (and not just different)? To begin
with, typical cepstral methods are inherently sensitive to spectral amplitudes, which are affected
by channel characteristics, noise, and reverberation. They also are sensitive to modification based
on context and speaking style. Temporal information is incorporated in a very specific and
limited way in the first order Markov models that we use, and it is likely that there is much more
that is fundamental to speech patterns that could be incorporated.
For these reasons, we are proposing to focus on the incorporation of acoustic information from
much longer time regions (100 ms — 1s), using a number of different approaches to feature
extraction from the time-frequency plane. Accompanying this core approach are a number of
other key ideas that our preliminary efforts have suggested: multirate statistical models, partial
information techniques, and models of higher-order auditory processing such as pitch perception
and source formation. Additionally, our experience over the last 5 years has convinced us that
there is no single ideal form of front end signal processing, so that a multistream approach will
be used. Unlike earlier efforts in which arbitrary multiple front ends were used, we will be
focusing on developing a rational approach to the design of these front ends with the criterion of
minimizing the number of errors in common (subject to an overall constraint on errors to avoid a
trivial but useless solution).
We have formulated the proposed work in terms of 2 tasks. We are proposing sufficient
personnel in each area to insure progress, but there will be enough overlap that a coordinated
effort between the task groups can be assured. The two tasks are:
1) Signal Processing: design and instantiate a new acoustic front end to calculate functions
of the time-frequency plane, particularly with a much longer time support than is
typically used for ASR. The output of the front end will be more like posterior
probabilities of broad classes than like spectral energies or cepstra.
2) Statistical Modeling: design and instantiate methods to handle incomplete information,
multiple rate data, and multiple streams of the posteriors based on the new time-
frequency functions.
Late in the project we will also take the best-scoring approaches from our internal evaluations
and provide them to our colleagues working on Rich Transcription for application to the ultimate
goals of the EARS program.
The proposed solutions within the two tasks are further summarized below.
Task 1: Signal Processing
A core idea for this task is to replace the current notion of a spectral energy-based vector at time twith a vector based on posterior probabilities of broad categories for long-time (up to a second or
more) and short-time functions of the time frequency plane. Other analyses will use time and
frequency windows intermediate between these extremes. Depending on the categories,
Pushing the Envelope: Rethinking Acoustic Processing D-3
nontraditional variables such as pitch-related features could be useful, and attributes such as
speaking rate could also be included in the classification. These features may be represented as
multiple streams of probabilistic information.
• Multiple time-frequency trade-offs: As we have noted, the dominant representation of
sound information is energy spectra calculated over brief time frames. This has been
advantageous in allowing hidden Markov models to accommodate timing variations
encountered in real speech, but at the cost of making certain kinds of temporal information
awkward or impossible to employ. Our recent work has shown effective and highly
complementary recognition by pushing the time-frequency balance to the opposite extreme:
TRAPS make independent first-stage classifications based only on information in a single
narrow frequency band, measured over an extended time window of up to 1 second
[Hermansky & Sharma 1998]. TRAPS are competitive with spectral features for clean
speech, and can halve the error rate for bandlimited noise corruption [Hermansky & Sharma
1999]. There is no particular reason to favor either dimension exclusively; a framework able
to integrate partially-dependent information will allow us to use both these views of the
signal, as well as a range of other analyses that use time and frequency windows intermediate
between these extremes. In particular, we will also use local cosine trees to zoom in and out in
time for a multitude of scales (bandwidths) weighted in probability using the entropy
criterion in growing the multiresolution tree.
• Auditory-based signal cues: In a further re-evaluation of the information extracted from
the basic sound data, we will consider looking beyond the coarse spectral energy (in all its
time-frequency guises) to extract information from the finer time structure within each band.
This information is demonstrably important to listeners, as can be witnessed by the blurry,
whispering crowd effect that is the best that can be resynthesized from current speech
recognition representations [Ellis-surfsynth 1997]. Pitch information is noteworthy in
allowing listeners to recognize voiced segments as such, and particularly in helping to glue
together different parts of the signal energy that properly belong to the same voice, and
separating them from other interfering energy: Approaches of this kind, employing pitch,
onset, modulation and spatial cues, have been developed under the banner of computational
auditory scene analysis (CASA) [Cooke & Ellis 2001], and will be integrated as a further
basis for classification within our framework.
¥ Principled multistream framework: The simplest approach to recognition is to identify a
single cue (such as broad short-time spectral profile) and use it for classification, but such an
approach is intrinsically brittle. Human perception apparently uses a far more robust
approach of employing numerous, redundant, parallel cues which are integrated to form a
single decision [Minsky 1986]. Recently, speech recognition systems based on multiple
independent recognizers have consistently and significantly outperformed other systems
[Fiscus 1997, Singh et al. 2001], but their hypothesis-level combinations are heuristic.
Instead, we will develop a probabilistically-rigorous system for combining many sources of
information, with different degrees of mutual dependence, to yield an optimal classification.
In this way, errors that occur independently in different information streams can be
discounted, and weak but consistent evidence from multiple sources can reinforce a correct
decision. We see relative error rate improvements of 25% or more over the best single stream
when complementary information streams are combined at the appropriate intermediate level
Pushing the Envelope: Rethinking Acoustic Processing D-4
(for the noisy digits Aurora task)[Ellis & Bilmes 2000].
For all of these, our vision is to displace spectral energy magnitude as the common element
underlying all acoustic modeling, replacing it with posterior probabilities of particular speech
classes — values with specific meaning that can be estimated from any kind of partial-signal cue,
and amenable to optimal combination via well-known procedures. A key part of the research will
be the determination of archetypical speech categories that will be used for generation of the low-
level posteriors. Our preference is to use data-driven methods to determine these categories,
rather than relying on linguistic theory-dependent classes such as articulatory features; however,
if data-driven approaches lead to something very close to traditional categories, we will use them.
Task 2: Statistical Modeling
A key goal in Task 2 is modifying the statistical models to both incorporate these new (multirate)
front ends, and to explicitly handle missing information (i.e., portions of the time-frequency
plane that are obscured by degradation such as noise or reverberation). This task will also include
the development of discriminative learning of dependence across streams, and incorporation of
this information for optimal combination.
• Partial information recognition: One of the greatest weaknesses of current speech
technology is the tacit assumption that the signal consists exclusively of the single voice of
interest, or, failing that, in trying to normalize the problem in such a way to approximate this
condition. The strength, however, of a recognition scheme based on multiple alternative
information streams is that individual streams can become unreliable or unavailable without
compromising the overall classification — provided their unreliability is correctly detected.
Thus, detection and labeling of non-target information is central at every level of this
approach. To address a classification problem in which a certain, dynamically-varying
portion of the information is unavailable, we will use the optimal tools of the missing data
formalism, recently developed specifically to address situations in which high levels of
nonstationary noise can temporarily obscure any part of the signal. This approach has been
shown as extremely successful in small-vocabulary, high-noise conditions such as the Aurora
task [Barker, Green & Cooke 2001]. Classification that uses acoustic information only when
it is informative, and backs off to contextual inference for other dimensions, offers the best
promise of rising to human levels of recognition in the phone-detection task, and by extension
to the classification of larger units .
• Statistical models for multistream combination: In previous work on subband
(multiband) recognizers, we have developed methods for the combination of information from
disjoint parts of the spectrum. In one of these approaches, developed at IDIAP, likelihoods
are estimated by integrating over all possible reliable stream subsets. We will conduct research
to determine if such methods are useful outside of the multiband example, and in particular
for the streams developed in Task 1. In other work at OGI and ICSI, neural networks and
simple combination rules have been used to integrate information from multiple streams
outside of the multiband context, and we will incorporate this experience in our work.
Conditional auxiliary variables can also be used to generate functions that estimate the
reliability of streams for their combination. Finally, we will study a new model called HMM2
which is a mixture of HMMs, which in principle could permit dynamic subband
segmentation as well as optimal recombination [Bourlard et al 2000, Bengio et al 2000].
Pushing the Envelope: Rethinking Acoustic Processing D-5
• Multirate and event-based models for recognition: The obvious next step after dividing
the observation space into multiple streams is to allow multiple rates, but little work has been
done with such observations because of the synchronization problems in decoding associated
with having different rates. In this work, we propose two alternative models that provide a
mechanism for using multirate features: a multirate model which is essentially a coupled set
of HMMs and a multiresolution model which has a switching mechanism between HMM
streams associated with the different rates to accommodate signal-dependent analysis. In
work on machine toolwear (wear on the edge of a machining tool), UW has shown that the
multirate model outperforms a standard Gaussian mixture HMM (Fish, 2001). A key
difference in the application of this model to speech is that it is problematic to allow
segmentation times only at the slowest rate, so we introduce the notion of event-based
models with fixed temporal extent but variable-rate analysis within each feature stream and
changing temporal resolution across streams. At SRI, related preliminary work has been done
on a kind of HMM that can handle a multiresolution feature stream, i.e. where there is a
single stream but the resolution varies as a function of time. In both cases, much work is still
needed before the methods will be useful for large vocabulary recognition, including research
on parameter tying and discriminative stream coupling. Most fundamentally, we need to gain
experience in applying this approach to automatic speech recognition.
Throughout these subtasks runs the consistent theme: constructing both a formal and a practical
foundation for the incorporation of multiple, incomplete acoustic streams, including streams
having different temporal support.
It could be said that, in general, the work proposed here is motivated much more by speech
perception than speech production. In other words, we take a very different approach than in
the recent trends in speech recognition that involve articulatory modeling. Both approaches are
well motivated from the perspective of trying to account for the variability observed in
conversational speech, in providing mechanisms for handling phones of widely different
durations and articulation quality. However, our perceptually-motivated models have the added
advantages that (1) they can better account for a signal that does not entirely due to articulators,
since there is noise (including multispeaker interference) and reverberation, and (2) are more
amenable to discriminative training techniques. Of course, this argument does not preclude using
both approaches in a single system, but we believe that this would be too much to tackle in one
project. Like the other important efforts in pronunciation modeling or language modeling that we
will not work with here, the proposed project has the potential to yield a key component for
speech recognition systems of the future.
Pushing the Envelope: Rethinking Acoustic Processing E-1
E. State-of-the-Art SystemSeveral systems are available for use in the proposed project. First and foremost, due to the ICSI-
SRI collaboration, the complete SRI system is available for our use, including on-site SRI
researchers at ICSI who are expert in its use and modification. It will be used for evaluations at
ICSI and SRI of methods developed at all the sites, and is further described below. Team
members at all sites will also be able to use HTK, the UW Graphical Models Toolkit (GMTK),
and a variant of the ICSI hybrid neural network/HMM system that was used successfully in the
1998 Hub 4 evaluation. In addition, multirate modeling software (to be developed at UW as part
of this project) will be available to all sites.
The target Rich Transcription system for transferring the successful results of this project will be
the SRI DECIPHERTM
system, which incorporates a combination of state-of-the-art techniques.
The DECIPHERTM
system has consistently exhibited state-of-the-art performance throughout
several years of government-administered tests, as is distinguished by its detailed modeling of
pronunciation variation, its robustness to noise and channel distortion, and its multilingual
capabilities. These features make the system accurate in recognizing spontaneous speech of many
styles, dialects, languages, and noise conditions.
and N-best rescoring [Murveit et al. 1993]; minimum word error decoding and posterior
maximization in confusion networks [Stolcke et al. 1997, Mangu et al., 2000 SRI last
participated in the Hub-4 evaluations in 1998, with a WER of 21%. The system later underwent
a major overhaul prior to the 2000 Hub-5 evaluation, resulting in a 24% relative WER reduction
on that task, and a 2001 Hub-5 eval performance of 29% WER. The Decipher system continues
to evolve and on the DARPA-sponsored 2001 SPINE evaluations achieved a WER of 27%, the
best performance of any submitted system.
Pushing the Envelope: Rethinking Acoustic Processing Q-2
OGI: The Anthropic Signal Processing Laboratory at OGI was established 9 years ago by Prof.
Hynek Hermansky. Hermansky’s early work on processing of corrupted speech introduced data-
guided speech analysis techniques, which later evolved into data-guided RASTA filters
[Hermansky & Morgan 1994] and data-guided spectral projections [Hermansky & Malayath
1998]. Multi-band recognition was introduced and extensively studied in collaboration with
Bourlard in Switzerland and Morgan at ICSI in 1995 and later evolved into the TRAP technique
[Hermansky & Sharma 1998] for robustness to many types of spectral distortions and also
addresses context dependency of phonemes. Nonlinear generalization of early data-guided
techniques in the form of Feature Nets (tandem approach) was introduced with Ellis of
ICSI/Columbia [Hermansky et al. 2000]. The group successfully participates in NIST Speaker ID
evaluations, in DARPA’s SPINE evaluations, and in telecommunication industry standards
(ETSI) activities, and collaborates most intensively with ICSI, CMU’s robust speech recognition
group, IIT Madras (India), CSLP at Johns Hopkins University, and CSLU group at OGI.
Columbia: The Laboratory for Recognition and Organization of Speech and Audio was
established within Columbia’s Electrical Engineering department by Professor Ellis when he
moved there in 2000. Combining the statistical pattern recognition and learning techniques of
speech recognition with a broader range of audio processing techniques drawn from audio
engineering and auditory modeling, the lab addresses information extraction from all kinds of
sound signals. In addition to development of the Tandem approach to speech acoustic modeling,
the best performing system in the Aurora Eurospech Special Event in September 2001 [Ellis &
Reyes 2001], current projects in the lab span a range of topics from speech through to music
analysis and clustering, to alarm sound detection and classification.
IDIAP: IDIAP (Dalle Molle Institute for Perceptual Artificial Intelligence) is a semi-private
research institute affiliated with the Swiss Federal Institute of Technology (EPFL) at Lausanne
and the University of Geneva and conducts research in the areas of speech and speaker
recognition, computer vision and machine learning. Speech recognition systems developed at
IDIAP are based on Hidden Markov Models (HMM) and on hybrid systems combining HMMs
and Neural Networks (NNs). To improve the performance of those systems, the speech group
research activities include the study of robust speech analysis techniques (e.g., measuring
information on different window lengths and using multiscale systems), robust speech modeling
(e.g., multi-stream approaches), and robust decision rules (e.g., measures of confidence).
UW: In 1999, the Department. of Electrical Engineering at the University of Washington made a
major commitment to building a speech technology research program, hiring Mari Ostendorf and
two other faculty members who founded the Signal, Speech and Language Interpretation (SSLI)
Laboratory, building on the program that she had created 12 years before at Boston University.
There are currently 21 researchers in the lab, which is dedicated to solving core problems in
speech and language technologies, facilitating multidisciplinary research, and providing a broad
educational experience for students. Prof. Ostendorf’s research contributions include: segment-
based (or, trajectory) models of acoustic parameters for ASR, dependence models for speaker
adaptation, sentence-level mixture language models for topic and dialog structure, use of out-of-
domain data in language modeling, computational modeling of prosody for speech recognition and
synthesis, integration of prosody and parsing, and information extraction from speech. Her BU
group participated in several DARPA evaluations (in collaboration with BBN), she works on
Hub 5, and contributed to the UW SPINE evaluation effort this fall led by Prof. Jeff Bilmes.
Pushing the Envelope: Rethinking Acoustic Processing R-1
R. Technology Transfer
As noted in Section K (Resources Offered), each of the participating sites will make available to
the research community a number of resources resulting from our research. Particularly through
SRI, we have excellent pathways for transfer to government programs. Finally, nearly all sites
have close relations with one or more commercial entities (e.g., Qualcomm for ICSI) who will be
very interested in positive results in this project.
In some sense, though, the strongest route for technology transfer from a high-risk algorithms
program such as this is through our link with the SRI-centered Rich Transcription project, which
will be incorporating the most successful of our algorithms in their systems in years 4 and 5 of
their project.
Pushing the Envelope: Rethinking Acoustic Processing S-1
S. Government-owned ResourcesThe Government-Furnished Equipment (GFE) identified below is required by our team for the
performance of the proposed effort, and should be included in the terms of a resulting contract.
This GFE has not been included in the price of this proposal. Our performance depends on our
team receiving approval to either transfer these items to the resulting contract or to allow for our
use of these items on a rent-free, non-interference basis. SRI has a pending bid into SPAWAR to
purchase this equipment; if the bid is not accepted then these items will need to be transferred to
the resultant contract.
Organization Description Tag Number Accountable ContractNumber
SRI Sun Disk Drive G443067 N66001-94-C-6048SRI R-Squared Disk
DrivesG443126, G444294, G444295 N66001-94-C-6048
SRI Seagate Disk Drives G443127, G443129 N66001-94-C-6048SRI Sun Computers G443879, G443880, G442408 N66001-94-C-6048SRI Seagate Disk Drives G444136, G444137 N66001-94-C-6048SRI R-Squared SCSI
DrivesG444138, G444143, G444144 N66001-94-C-6048
SRI Sun Computers G444148, G444316, G444141 N66001-94-C-6048SRI Procom Hard
DrivesG444376, G444377 N66001-94-C-6048
SRI Acropolis Disks G444639, G444641 N66001-94-C-6048SRI Seagate Disk Drives G444674, G444675 N66001-94-C-6048SRI Sun Storage System G444940, G444941, G444942 N66001-94-C-6048SRI Dell Computer G445103 N66001-94-C-6048SRI Vanguard Disk
DrivesG445165, G445372, G445373,G445374
N66001-94-C-6048
SRI Vanguard HardDrives
G445371, G445372, G445373,G445374
N66001-94-C-6048
SRI Sun CD/RomReader
G441922 N66001-94-C-6046
SRI R-Squared DiskDrives
G443711, G443712, G443713 N66001-94-C-6046
SRI DEC Computer G443724, G443725, G443726,G443727
N66001-94-C-6046
SRI R-Squared DiskDrive
G444089 N66001-94-C-6046
SRI Sun Computers G444122, G444158 N66001-94-C-6046SRI Seagate Disk Drive G444160 N66001-94-C-6046SRI Sun Storage
SystemsG444943, G444944 N66001-94-C-6046
SRI Dell Computers G445123, G445124, G445125 N66001-94-C-6046U W Dell 868 Computers 1175736, 1175709, 1175680 N66019928924U W Acer 868 Computers No tag N66019928924U W Sun Ultra-5_10
Computers1175755,1175756, 1175760 N66019928924
All team sites requesting equipment (ICSI, SRI, UW, OGI, and Columbia) have included the
requisite letters of notification with this proposal indicating that we cannot provide new
information technology resources to support the development evaluation activities that are
required. These letters fully comply with the requirements of the solicitation and are included as
Appendix B in the paper versions of this proposal.
Pushing the Envelope: Rethinking Acoustic Processing T-1
T. Organizational Conflict of Interest
SRI International is currently providing technical support to several offices within DARPA. We
do not believe that any potential conflict of interest exists as these individuals are not directly
involved in the effort proposed herein. The individuals and the offices they support are listed
below:
Murray Burke DARPA/IXO Program Manager for High Performance Knowledge Base and
Rapid Knowledge Formation programs
Tim Grayson DARPA/TTO Program Manager for Digital RF Tags (DRaFT) program
William Coleman OSD C3I/SAPCO Sr. Technical Advisor to DARPA Deputy Director
William Schneider DARPA/MTO Program Manager for Optoelectronics
Richard Wishner Director DARPA/IXO
Thomas Strat IXO Program Manager (Expected effective date 1/21/02)
None of the other sites (UW, Columbia, OGI, ICSI, and IDIAP) are providing such support.
Pushing the Envelope: Rethinking Acoustic Processing 1
Appendix A: References
Allen, J.,
How do humans process and recognize speech?
IEEE Transactions on Speech and Audio, 2(4): 567-577, Oct. 1994
Barker, J., Cooke, M. and Ellis, D.,
Decoding speech in the presence of other sound sources,
Proc. ICSLP-2000, Beijing, October 2000
Barker, J., Cooke, M. and Ellis, D.,
Combining bottom-up and top-down constraints for robust ASR: The multisource decoder,
Workshop on Consistent and Reliable Acoustic Cues CRAC-2001at Eurospeech-2001, Aalborg, Denmark, September 2001.
Barker, J., Green, P. and Cooke, M.,
Linking ASA and robust ASR by missing data techniques,
Proc. WISP-2001, Stratford UK, 2001.
Bengio, S, Bourlard, H. and Weber, K.,
An EM algorithm for HMMs with emission distribution represented by HMMs,
Using multiple time scales in a multistream speech recognition system,
Proc. Eurospeech-97, I-3-6, Rhodes, 1997.
Ellis, D.,
The Weft: A representation for periodic sounds,
Proc. ICASSP-97, Munich, II-1307-1310, 1997.
Ellis, D.,
Listening to speech recognition — the Surfsynth home pageWeb page, http://www.icsi.berkeley.edu/~dpwe/projects/surfsynth , 1997.
Ellis, D.,
Using knowledge to organize sound: The prediction-driven approach to computational auditory
scene analysis and its application to speech/nonspeech mixtures,
Speech Communication 27 3-4, pp. 281-298, April 1999.
Ellis, D. and Bilmes, J.,
Using mutual information to design feature combinations,
Proc. ICSLP-2000, Beijing, 2000.
Ellis, D. and Reyes, M.,
Investigations into Tandem acoustic modeling for the Aurora task,
Proc. Eurospeech-01, Aalborg, Denmark, September 2001.
Ellis, D., Singh, R. and Sivadas, S.,
Tandem Acoustic Modeling in Large-Vocabulary Recognition,
Proc. ICASSP-01, Salt Lake City UT, May 2001.
Fiscus, J.,
A post-processing system to yeild reduced word error rates: Recgnizer Output Voting Error
Reduction (ROVER),
Proc. ASRU-97, Santa Barbara, 1997.
Fish, R.,
Dynamic models of machining vibrations, designed for classification of tool wear,Ph.D. dissertation, Univ. of Washingtion, 2001.
Fosler-Lussier, E.,
Multi-Level Decision Trees for Static and Dynamic Pronunciation Models,
Proc. Eurospeech-99, Budapest, I-463-466, 1999.
Pushing the Envelope: Rethinking Acoustic Processing 4
Fosler-Lussier, E., and Morgan, N.,
Effects of Speaking Rate and Word Frequency on Conversational
Pronunciations,
Speech Communication Special Issue on pronunciation variation, 29 (2-4),
137-157, 1999.
Ghahramani, Z. and Jordan, M.,
Factorial Hidden Markov Models,
Machine Learning 29, 245-273, 1997.
Greenberg, S., Hollenback, J. and Ellis, D.,
Insights into spoken language gleaned from phonetic transcriptions of the Switchboard corpus,
Proc. ICSLP-96, Philadelphia, 1996.
Hagen, A.,
Robust speech recognition based on multi-stream processing,PhD Thesis, Swiss Federal Institute of Technology, Lausanne, December 2001,
also IDIAP Research Report, RR01-4.
Hagen, A, Morris, A. and Bourlard, H.,
From Multi-Band Full Combination to Multi-Stream Full Combination Processing in Robust
ASR,
Proc. ISCA ITRW Intl. Workshop on Automatic Speech Recognition: Challenges for the NextMillennium, Paris, Sep.~18-20, 2000.
Hall, J., Haggard, M. and Fernandes, M.
Detection in noise by spectro-temporal pattern analysis,
Journal of the Acoustical Society of America, 76, 50-56, 1984.
Hatch, A.,
Word-Level Confidence Estimation for Automatic Speech RecognitionM.S. Thesis, University of California at Berkeley, August 2001.
Hermansky, H., Ellis, D., and Sharma, S.,
Tandem connectionist feature stream extraction for conventional HMM systems,
Proc. ICASSP-2000, Istanbul, June 2000, III-1635-1638
Hermansky, H. and Malayath, N.
Spectral Basis Functions from Discriminant Analysis,
Proc. ICSLP’98, Sydney, November 1998.
Hermansky, H. and Morgan, N.,
RASTA Processing of Speech,
IEEE Transactions on Speech and Audio Processing, special issue on Robust Speech Recognition
2(4), 578-589, Oct., 1994
Pushing the Envelope: Rethinking Acoustic Processing 5
Hermansky, H. and Sharma, S.,
TRAPS — Classifiers of temporal patterns,
Proc. ICSLP-98, Sydney, III-1003-1006, 1998
Hermansky, H. and Sharma, S.,
Temporal Patterns (TRAPS) in ASR of Noisy Speech,
Proc. ICASSP-99, Phoenix AZ, March 1999.
Hermansky, H., Sharma, S. and Jain, P.,
Data-derived nonlinear mapping for feature extraction in HMM,
Proc. ASRU-99, Keystone CO, 1999
Hermansky, H., Tibrewala, S. and Pavel, M.,
Towards ASR on Partially Corrupted Speech,
Proc. ICSLP-96, 462-465, Philadelphia, October 1996.
Hirsch, H. and Pearce, D.,
The AURORA Experimental Framework for the Performance Evaluations of Speech
Recognition Systems under Noisy Conditions,
Proc. ISCA ITRW ASR2000, Paris, September 2000.
Janin, A., Ellis, D. and Morgan, N.,
Multi-stream speech recognition: ready for prime time?
Proc. Eurospeech 1999,. II-591-594, 1999.
Jurafsky, D., Wooters, C., Segal, J., Stolcke, A., Fosler, E., Tajchman, G. and Morgan, N.,
Using a Stochastic Context-Free Grammar as a Language Model for Speech Recognition,
Proc. ICASSP-95, 189-192, 1995.
Kajarekar, S., Yagnanarayana, B. and Hermansky, H.,
A Study of Two Dimensional Linear Discriminants for ASR,
Proc. ICASSP-01, Salt Lake City UT, 2001.
Kajarekar, S. and Hermansky, H.,
Optimization of units for continuous-digit recognition task,
Proc. ICSLP-2000, Beijing, 2000.
Kingsbury, B.,
Perceptually-inspired signal processing strategies for robust speech recognition in reverberantenvironments,Ph.D. Dissertation, University of California at Berkeley, December 1998.
Konig, Y., Bourlard, H. and Morgan, N.,
REMAP - Experiments with Speech Recognition,
Pushing the Envelope: Rethinking Acoustic Processing 6
Proc. ICASSP996, 3350-3353, 1996.
Lahiri, A.,
Speech recognition with phonological features,
Proc. Int. Congress on Phonetic Sciences, pp. 715-718, 1999.
Lee, S., and Glass, J.,
Real-time Probabilistic Segmentation for Segment-Based Speech Recognition,
Proc. ICSLP-1998, Sydney, 1998.
Li, J., Najmi, A. and Gray, R.M.,
Image classification by a two-dimensional hidden Markov model,
Proc. ICASSP-99, VI-3313-3316, Phoenix AZ, 1999.
Logan, B. and Moreno, P.,
Factorial HMMs for Acoustic Modeling,
Proc. ICASSP-98, 813-816, Seattle WA, 1998.
Malayath, N., Hermansky, H., Kajarekar, S., and Yegnanarayana B.,
Data-Driven Temporal Filters and Alternatives to GMM in Speaker Verification,
Digital Signal Processing 10, 55-74, 2000.
Minsky, M.
The Society of MindSimon and Schuster, 1986.
Mirghafori, N., Fosler, E. and Morgan, N.,
Towards Robustness to Fast Speech in ASR,
Proc. ICASSP-96, 335-338, 1996.
Morgan, N., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Janin, A., Pfau, T., Shriberg, E. and
Stolcke, A.,
The Meeting Project at ICSI,
Proc. Human Language Technologies Conference, San Diego, March 2001
Morgan, N. and Bourlard, H.,
Continuous Speech Recognition: An Introduction to the Hybrid HMM/Connectionist
Approach,
Signal Processing Magazine, pp 25-42, May 1995
Morgan, N., Bourlard, H., Greenberg, S. and Hermansky, H.,
Stochastic Perceptual Models of Speech,
Proc. ICASSP-95, 397-400, 1995.
Morgan, N., Ellis, D., Fosler-Lussier, E., Janin, A. and Kingsbury, B.,
Reducing errors by increasing the error rate: MLP Acoustic Modeling
Pushing the Envelope: Rethinking Acoustic Processing 7
for Broadcast News Transcription,
Proc. DARPA LVCSR meeting, 1999.
Morgan, N., and Fosler-Lussier, E.,
Combining multiple estimators of speaking rate,
Proc. ICASSP-98, 729-732, Seattle WA, May 1998.
Niyogi, P., Mitra, P., and Sondhi, M. M.,
‘‘A detection framework for locating phonetic events,’’ Proc. ICSLP,
pp. 1067-1070, 1998.
Nock, H.,
Techniques for Modelling Phonological Processes in Automatic Speech Recognition,Ph.D. dissertation, Cambridge Univ., 2001.
Nock, H. and Young, S.,
Loosely-coupled HMMs for ASR,
Proc. ICSLP-2000, III-143-146, Beijing, 2000.
Robinson, A., Hochberg, M., and Renals, S.,
IPA: Improved Modelling with Recurrent Neural Networks
Proc. ICASSP-94, 37-40, April 1994.
Saul, L. and Jordan, M.,
Mixed memory Markov models,
Machine Learning 37, 75-87, 1999.
Shire, M.,
Data-driven modulation filter design under adverse acoustic conditions and using phonetic and
syllabic units ,
Proc. Eurospeech-99, Budapest, III-1123-1126, 1999.
Simon, J., Depireux, D. and Shamma, S.,
Representation of complex spectra in the auditory cortex,
Proc. 11th International Symposium on Hearing, ed. Palmer, Ress, Summerfield & Meddis, 513-
520, Whurr Publishers, 1998.
Singh, R., Seltzer, M., Raj, B. and Stern, R.,
Speech in noisy environments: Robust automatic segmentation, feature extraction and
hypothesis combination,
Proc. ICASSP-01, Salt Lake City, 2001.
Sonmez, M., Plauche, M., Shriberg, E., and Franco, H.,
Consonant Discrimination in Elicited and Spontaneous Speech: A Case for Signal-adaptive
Front Ends in ASR,
Proc. ICSLP-2000, Beijing, October 2000.
Pushing the Envelope: Rethinking Acoustic Processing 8
Wu, S., Kingsbury, B., Morgan, N. and Greenberg, S.,
Performance Improvements Through Combining Phone- and Syllable-Scale Information in
Automatic Speech Recognition,
Proc. ICSLP-98, 459-462, Sydney, 1998.
Wu, S., Shire, M., Greenberg, S. and Morgan, N.,
Integrating Syllable Boundary Information Into Speech Recognition,
Proc. ICASSP-97, 987-990, Munich, 1997.
Yang, H., Van Vuuren, S., Sharma, S. and Hermansky, H.
Relevance of Time-Frequency Features for Phonetic and Speaker-Channel