Anthropomorphic Processing of Audio and Speechdownloads.hindawi.com/journals/specialissues/807173.pdf · Anthropomorphic Processing of Audio and Speech Guest Editors: Werner Verhelst,

EURASIP Journal on Applied Signal Processing

Anthropomorphic Processingof Audio and Speech

Guest Editors: Werner Verhelst, Jürgen Herre, Gernot Kubin,Hynek Hermansky, and Søren Holdt Jensen




Guest Editors: Werner Verhelst, Jürgen Herre,Gernot Kubin, Hynek Hermansky, and Søren Holdt Jensen


Copyright © 2005 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2005 of “EURASIP Journal on Applied Signal Processing.” All articles are open accessarticles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproductionin any medium, provided the original work is properly cited.

Editor-in-ChiefMarc Moonen, Belgium

Senior Advisory EditorK. J. Ray Liu, College Park, USA

Associate EditorsGonzalo Arce, USA Arden Huang, USA Douglas O’Shaughnessy, CanadaJaakko Astola, Finland Jiri Jan, Czech Antonio Ortega, USAKenneth Barner, USA Søren Holdt Jensen, Denmark Montse Pardas, SpainMauro Barni, Italy Mark Kahrs, USA Wilfried Philips, BelgiumJacob Benesty, Canada Thomas Kaiser, Germany Vincent Poor, USAKostas Berberidis, Greece Moon Gi Kang, Korea Phillip Regalia, FranceHelmut Bölcskei, Switzerland Aggelos Katsaggelos, USA Markus Rupp, AustriaJoe Chen, USA Walter Kellermann, Germany Hideaki Sakai, JapanChong-Yung Chi, Taiwan Alex Kot, Singapore Bill Sandham, UKSatya Dharanipragada, USA C.-C. Jay Kuo, USA Dirk Slock, FrancePetar M. Djurić, USA Geert Leus, The Netherlands Piet Sommen, The NetherlandsJean-Luc Dugelay, France Bernard C. Levy, USA Dimitrios Tzovaras, GreeceFrank Ehlers, Germany Mark Liao, Taiwan Hugo Van hamme, BelgiumMoncef Gabbouj, Finland Yuan-Pei Lin, Taiwan Jacques Verly, BelgiumSharon Gannot, Israel Shoji Makino, Japan Xiaodong Wang, USAFulvio Gini, Italy Stephen Marshall, UK Douglas Williams, USAA. Gorokhov, The Netherlands C. Mecklenbräuker, Austria Roger Woods, UKPeter Handel, Sweden Gloria Menegaz, Italy Jar-Ferr Yang, TaiwanUlrich Heute, Germany Bernie Mulgrew, UKJohn Homer, Australia King N. Ngan, Hong Kong

Contents

Editorial, Werner Verhelst, Jürgen Herre, Gernot Kubin, Hynek Hermansky, and Søren Holdt JensenVolume 2005 (2005), Issue 9, Pages 1289-1291

A Perceptual Model for Sinusoidal Audio Coding Based on Spectral Integration, Steven van de Par,Armin Kohlrausch, Richard Heusdens, Jesper Jensen, and Søren Holdt JensenVolume 2005 (2005), Issue 9, Pages 1292-1304

Parametric Coding of Stereo Audio, Jeroen Breebaart, Steven van de Par, Armin Kohlrausch,and Erik SchuijersVolume 2005 (2005), Issue 9, Pages 1305-1322

Analysis of the IHC Adaptation for the Anthropomorphic Speech Processing Systems,Alexei V. Ivanov and Alexander A. PetrovskyVolume 2005 (2005), Issue 9, Pages 1323-1333

Anthropomorphic Coding of Speech and Audio: A Model Inversion Approach, Christian Feldbauer,Gernot Kubin, and W. Bastiaan KleijnVolume 2005 (2005), Issue 9, Pages 1334-1349

Neuromimetic Sound Representation for Percept Detection and Manipulation, Dmitry N. Zotkin,Taishih Chi, Shihab A. Shamma, and Ramani DuraiswamiVolume 2005 (2005), Issue 9, Pages 1350-1364

Source Separation with One Ear: Proposition for an Anthropomorphic Approach, Jean Rouatand Ramin PichevarVolume 2005 (2005), Issue 9, Pages 1365-1373

A Physiologically Inspired Method for Audio Classification, Sourabh Ravindran,Kristopher Schlemmer, and David V. AndersonVolume 2005 (2005), Issue 9, Pages 1374-1381

A Two-Channel Training Algorithm for Hidden Markov Model and Its Application to Lip Reading,Liang Dong, Say Wei Foo, and Yong LianVolume 2005 (2005), Issue 9, Pages 1382-1399

Disordered Speech Assessment Using Automatic Methods Based on Quantitative Measures,Lingyun Gu, John G. Harris, Rahul Shrivastav, and Christine SapienzaVolume 2005 (2005), Issue 9, Pages 1400-1409

Objective Speech Quality Measurement Using Statistical Data Mining, Wei Zha and Wai-Yip ChanVolume 2005 (2005), Issue 9, Pages 1410-1424

Fourier-Lapped Multilayer Perceptron Method for Speech Quality Assessment, Moisés Vidal Ribeiro,Jayme Garcia Arnal Barbedo, João Marcos Travassos Romano, and Amauri LopesVolume 2005 (2005), Issue 9, Pages 1425-1434

Simulation of Human Speech Production Applied to the Study and Synthesis of European Portuguese, António J. S. Teixeira, Roberto Martinez, Luís Nuno Silva, Luis M. T. Jesus, Jose C. Príncipe,and Francisco A. C. VazVolume 2005 (2005), Issue 9, Pages 1435-1448

EURASIP Journal on Applied Signal Processing 2005:9, 1289–1291c© 2005 Hindawi Publishing Corporation

Editorial

Werner VerhelstDepartment of Electronics and Information Processing, Vrije Universiteit Brussel, 1050 Brussel, BelgiumEmail: [email protected]

Jurgen HerreFraunhofer Institute for Integrated Circuits (IIS), 91058 Erlangen, GermanyEmail: [email protected]

Gernot KubinSignal Processing and Speech Communication Laboratory, Graz University of Technology, 8010 Graz, AustriaEmail: [email protected]

Hynek HermanskyIDIAP Research Institute, 1920 Martigny, SwitzerlandEmail: [email protected]

Søren Holdt JensenDepartment of Communication Technology, Institute of Electronic Systems, Aalborg University, Fredrik Bajers Vej 7A,DK-9220 Aalborg, DenmarkEmail: [email protected]

Anthropomorphic systems process signals “at the image ofman.” They are designed to solve a problem in signal process-ing by imitation of the processes that accomplish the sametask in humans. In the area of audio and speech processing,remarkable successes have been obtained by anthropomor-phic systems: perceptual audio coding even caused a land-slide in the music business.

At first sight, it could seem obvious that the performanceof audio processing systems should benefit from taking intoaccount the perceptual properties of human audition. For ex-ample, front ends that extract perceptually meaningful fea-tures currently show the best results in speech recognizers.However, their features are typically used for a stochastic op-timization that is itself not anthropomorphic at all. Thus, it isnot obvious why they should perform best, and perhaps thetruly optimal features have not yet been found because, afterall, “airplanes do not flap their wings.”

In general, we believe that there are several situationswhen an anthropomorphic approach may not be the best so-lution. First, its combination with nonanthropomorphic sys-tems could result in a suboptimal overall performance (thequantization noise that was cleverly concealed by a percep-tual audio coder could become unmasked by subsequent lin-ear or nonlinear processing). Second, other approaches thatare not anthropomorphic might be better adapted to the

technology that is chosen for the implementation (airplanesdo not flap their wings because it is technically much moreefficient to use jet engines for propulsion). Nevertheless, a lotcan be learned from imitating natural systems that were opti-mized through natural selection. As such, anthropomorphicand, by extension, biomorphic systems can be considered toplay an important role in the process of developing new tech-nologies.

This special issue brings together a dozen papers fromdifferent areas of audio and speech processing that deal withaspects of anthropomorphic processing or in which an an-thropomorphic or perceptual approach was taken.

The first of two papers on perceptual audio coding pro-poses a perceptual model for the specific distortion that istypically encountered in sinusoidal modelling, while the sec-ond paper introduces a novel parametric stereo coding tech-nique based on binaural psychoacoustics. While these papersillustrate the use of human auditory perception for efficientaudio coding, the three following papers present examplesof efforts towards using different levels of neurophysiologicmodelling directly for the representation and processing ofaudio signals: from a model for the adaptation behaviourin the chemical synapses between the inner hair cells andthe auditory neurons, to a signal processing model for theearly auditory system, and then a cortical audio representa-

1290 EURASIP Journal on Applied Signal Processing

tion for sound modification. In the last pair of audio papers,signal features that are based on our knowledge of the audi-tory system are used in conjunction with machine learningtechniques, such as neural networks, to achieve more cogni-tive goals, such as audio source separation and classification.

A generally applicable technique that allows for discrim-inative training of hidden Markov models is introduced andapplied on the confusable set of visemes for lip reading pur-poses in the first of five papers on speech processing. The nextthree of these papers all deal with the important problemof finding objective distortion measures for speech, and thelast paper describes an articulatory speech synthesizer that,among other things, brought a better understanding of thePortuguese nasal vowels.

While the papers in this special issue can represent onlya small sampling of anthropomorphic techniques in audioand speech processing, they are all very valuable in their ownright and together, if nothing else, they show that anthropo-morphic sound processing systems are invaluable in the formof computational models for human perception and thatthey can fuel our quest for further understanding of humannature and self-knowledge.

Werner VerhelstJurgen Herre

Gernot KubinHynek HermanskySøren Holdt Jensen

Werner Verhelst obtained the Engineer-ing degree, Burgerlijk Werktuigkundig Elec-trotechnisch Ingenieur, in 1980, and thePh.D. degree in 1985, both from theVrije Universiteit Brussel, Belgium. He spe-cialised in digital speech and audio process-ing in general, and in speech and audio sig-nal modification in particular. Verhelst alsostudied speech synthesis at the Institute forPerception Research, and audio signal mod-elling at the Katholieke Universiteit Leuven, Belgium. Since hisgraduation, he has been with the Vrije Universiteit Brussel wherehe is heading the Research Laboratory on Digital Speech and AudioProcessing (DSSP) and teaching courses on digital signal process-ing and speech and audio processing.

Jurgen Herre joined the Fraunhofer Insti-tute for Integrated Circuits (IIS), Erlangen,Germany, in 1989. Since then he has beeninvolved in the development of percep-tual coding algorithms for high-quality au-dio, including the well-known ISO/MPEG-Audio Layer III coder (aka “MP3”). In 1995,Dr. Herre joined Bell Laboratories for apostdoc term working on the developmentof MPEG-2 advanced audio coding (AAC).

Since the end of 1996, he has returned to Fraunhofer to workon the development of advanced multimedia technologies includ-ing MPEG-4, MPEG-7, and secure delivery of audiovisual con-tent. Currently he is the Chief Scientist for the audio/multimediaactivities at Fraunhofer Institute for Integrated Circuits (IIS), Er-langen. Dr. Herre is a Fellow of the Audio Engineering Society,Cochair of the AES Technical Committee on Coding of Audio Sig-nals, and Vice Chair of the AES Technical Council. He also servedas an Associate Editor of the IEEE Transactions on Speech and Au-dio Processing and is an active member of the MPEG audio sub-group.

Gernot Kubin was born in Vienna, Austria,on June 24, 1960. He received the Dipl.-Ing. degree in 1982, and Dr. Techn. degree(sub auspiciis praesidentis) in 1990, bothin electrical engineering from TU Vienna.He has been a Professor of nonlinear signalprocessing and the Head of the Signal Pro-cessing and Speech Communication Lab-oratory (SPSC), Graz University of Tech-nology, Austria, since September 2000. Ear-lier international appointments include CERN, Geneva, Switzer-land, 1980; TU Vienna, from 1983 to 2000; Erwin SchroedingerFellow at Philips Natuurkundig Laboratorium, Eindhoven, TheNetherlands, 1985; AT&T Bell Labs, Murray Hill, USA, from1992 to 1993, and 1995; KTH, Stockholm, Sweden, 1998; Vi-enna Telecommunications Research Centre (FTW) from 1999 upto date as Key Researcher and Member of the Board; GlobalIP Sound, Sweden and USA, from 2000 to 2001 as a Scien-tific Consultant; Christian Doppler Laboratory for NonlinearSignal Processing from 2002 up to date as the Founding Di-rector. Dr. Kubin is a Member of the Board of the AustrianAcoustics Association and Vice Chair for the European COSTAction 277, Nonlinear Speech Processing. He has authored orcoauthored over ninety peer-reviewed publications and threepatents.

Hynek Hermansky works at the IDIAP Re-search Institute, Martigny, Switzerland. Hehas been working in speech processing forover 30 years, previously as a Research Fel-low at the University of Tokyo, a ResearchEngineer at Panasonic Technologies, SantaBarbara, California, a Senior Member ofthe research staff at US WEST AdvancedTechnologies, and a Professor and Direc-tor of the Center for Information Process-ing, OHSU, Portland, Oregon. He is a Fellow of the IEEE forthe “invention and development of perceptually-based speech pro-cessing methods,” a Member of the Board of the InternationalSpeech Communication Association, and a Member of the Ed-itorial Boards of Speech Communication and of Phonetica. Heholds 5 US patents and authored or coauthored over 130 papersin reviewed journals and conference proceedings. He holds a Dr.Eng. degree from the University of Tokyo, and Dipl.-Ing. degreefrom Brno University of Technology, Czech Republic. His main re-search interests are in acoustic processing for speech and speakerrecognition.

Editorial 1291

Søren Holdt Jensen received the M.S. de-gree in electrical engineering from Aal-borg University, Denmark, in 1988, andthe Ph.D. degree from the Technical Uni-versity of Denmark, in 1995. He has beenwith the Telecommunications Laboratory ofTelecom Denmark, the Electronics Instituteof the Technical University of Denmark, theScientific Computing Group of the DanishComputing Center for Research and Educa-tion (UNI-C), the Electrical Engineering Department of KatholiekeUniversiteit Leuven, Belgium, the Center for PersonKommunika-tion (CPK) of Aalborg University, and is currently an Associate Pro-fessor in the Department of Communication Technology, AalborgUniversity. His research activities are in digital signal processing,communication signal processing, and speech and audio process-ing. Dr. Jensen is a Member of the Editorial Board of the EURASIPJournal on Applied Signal Processing, and a former Chairman ofthe IEEE Denmark Section and the IEEE Denmark Section’s SignalProcessing Chapter.

EURASIP Journal on Applied Signal Processing 2005:9, 1292–1304c© 2005 Steven van de Par et al.

A Perceptual Model for Sinusoidal Audio CodingBased on Spectral Integration

Steven van de ParDigital Signal Processing Group, Philips Research Laboratories, 5656 AA Eindhoven, The NetherlandsEmail: [email protected]

Armin KohlrauschDigital Signal Processing Group, Philips Research Laboratories, 5656 AA Eindhoven, The Netherlands

Department of Technology Management, Eindhoven University of Technology, 5600 MB Eindhoven, The NetherlandsEmail: [email protected]

Richard HeusdensDepartment of Mediamatics, Delft University of Technology, 2600 GA Delft, The NetherlandsEmail: [email protected]

Jesper JensenDepartment of Mediamatics, Delft University of Technology, 2600 GA Delft, The NetherlandsEmail: [email protected]

Søren Holdt JensenDepartment of Communication Technology, Institute of Electronic Systems, Aalborg University, DK-9220 Aalborg, DenmarkEmail: [email protected]

Received 31 October 2003; Revised 22 July 2004

Psychoacoustical models have been used extensively within audio coding applications over the past decades. Recently, parametriccoding techniques have been applied to general audio and this has created the need for a psychoacoustical model that is specificallysuited for sinusoidal modelling of audio signals. In this paper, we present a new perceptual model that predicts masked thresholdsfor sinusoidal distortions. The model relies on signal detection theory and incorporates more recent insights about spectral andtemporal integration in auditory masking. As a consequence, the model is able to predict the distortion detectability. In fact, thedistortion detectability defines a (perceptually relevant) norm on the underlying signal space which is beneficial for optimisationalgorithms such as rate-distortion optimisation or linear predictive coding. We evaluate the merits of the model by combining itwith a sinusoidal extraction method and compare the results with those obtained with the ISO MPEG-1 Layer I-II recommendedmodel. Listening tests show a clear preference for the new model. More specifically, the model presented here leads to a reductionof more than 20% in terms of number of sinusoids needed to represent signals at a given quality level.

Keywords and phrases: audio coding, psychoacoustical modelling, auditory masking, spectral masking, sinusoidal modelling,psychoacoustical matching pursuit.

1. INTRODUCTION

The ever-increasing growth of application areas such as con-sumer electronics, broadcasting (digital radio and televi-sion), and multimedia/Internet has created a demand for

This is an open-access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.

high-quality digital audio at low bit rates. Over the lastdecade, this has led to the development of new coding tech-niques based on models of human auditory perception (psy-choacoustical masking models). Examples include the cod-ing techniques used in the ISO/IEC MPEG family, for exam-ple, [1], the MiniDisc from Sony [2], and the digital compactcassette (DCC) from Philips [3]. For an overview of recentlyproposed perceptual audio coding schemes and standards,we refer to the tutorial paper by Painter and Spanias [4].

Perceptual Model for Sinusoidal Audio Coding 1293

A promising approach to achieve low bit rate coding ofdigital audio signals with minimum perceived loss of qualityis to use perception-based hybrid coding schemes, where au-dio signals are decomposed and coded as a sinusoidal partand a residual. In these coding schemes, different signal com-ponents occurring simultaneously are encoded with differ-ent encoders. Usually, tonal components are encoded with aspecific encoder aimed at signals composed of sinusoids andthe remaining signal components are coded with a waveformor noise encoder [5, 6, 7, 8, 9]. To enable the selection ofthe perceptually most suitable sinusoidal description of anaudio signal, dedicated psychoacoustical models are neededand this will be the topic of this paper.

One important principle by which auditory perceptioncan be exploited in general audio coding is that the modellingerror generated by the audio coding algorithm is maskedby the original signal. When the error signal is masked, themodified audio signal generated by the audio coding algo-rithm is indistinguishable from the original signal.

To determine what level of distortion signal is allowable,an auditory masking model can be used. We, for example,consider the case where the masking model is used in a trans-form coder. Here the model will specify, for each spectro-temporal interval within the original audio signal, what dis-tortion level can be allowed within that interval such that itis perceptually just not detectable. With an appropriate signaltransformation, for example, an MDCT filter bank [10, 11],it is possible to selectively adapt the accuracy with whicheach different spectro-temporal interval is described, that is,the number of bits used for quantisation. In this way, thespectro-temporal characteristics of the error signal, can beadapted such that auditory masking is exploited effectively,leading to the lowest possible bit rate without perceptible dis-tortions.

Most existing auditory masking models are based on thepsychoacoustical literature that predominantly studied themasking of tones by noise signals (e.g., [12]). Interestingly,for subband coders and transform coders the nature of thesignals is just the reverse; the distortion is noise-like, whilethe masker, or original signal, is often tonal in character. Nev-ertheless, based on this psychoacoustical literature dedicatedpsychoacoustical models have been developed for audio cod-ing for the situation where the distortion signal is noise-likesuch as the ISO MPEG model [1].

Masking models are also used for sinusoidal coding,where the signal is modelled by a sum of sinusoidal com-ponents. Most existing sinusoidal audio coders, for exam-ple, [5, 6, 13] rely on masking curves derived from spectral-spreading-based perceptual models in order to decide whichcomponents are masked by the original signal, and which arenot. As a consequence of this decision process, a number ofmasked components are rejected by the coder, resulting in adistortion signal that is sinusoidal in nature. In this paper amodel is introduced that is specifically designed for predict-ing the masking of sinusoidal components. In addition, theproposed model takes into account some new findings in thepsychoacoustical literature about spectral and temporal inte-gration in auditory masking.

This paper is organised as follows. In Section 2 we discussthe psychoacoustical background of the proposed model.Next, in Section 3, the new psychoacoustical model will beintroduced, followed by Section 4, which describes the cal-ibration of the model. Section 5 compares predictions ofthe model with some basic psychoacoustical findings. InSection 6, we apply the proposed model in a sinusoidal audiomodelling method and in Section 7 we compare, in a listen-ing test, the resulting audio quality to that obtained with theISO MPEG model [1]. Finally, in Section 8, we will presentsome conclusions.

2. PSYCHOACOUSTICAL BACKGROUND

Auditory masking models that are used in audio coding arepredominantly based on a phenomenon known as simulta-neous masking (see, e.g., [14]). One of the earlier relevantstudies goes back to Fletcher [15] who performed listeningexperiments with tones that were masked by noise. In hisexperiments the listeners had to detect a tone that was pre-sented simultaneously with a bandpass noise masker that wasspectrally centred around the tone. The threshold level fordetecting the tones was measured as a function of the maskerbandwidth while the power spectral density (spectrum level)was kept constant. Results showed that an increase of band-width, thus increasing the total masker power, led to an in-crease of the detection thresholds. However, this increase wasonly observed when the bandwidth was below a certain crit-ical bandwidth; beyond this critical bandwidth, thresholdswere independent of bandwidth. These observations led tothe critical band concept which is the spectral interval acrosswhich masker power is integrated to contribute to the mask-ing of a tone centred within the interval.

An explanation for these observations is that the signalprocessing in the peripheral auditory system, specifically bythe basilar membrane in the cochlea, can be represented as aseries of bandpass filters which are excited by the input signal,and which produce parallel bandpass-filtered outputs (see,e.g., [16]). The detection of the tone is thought to be gov-erned by the bandpass filter (or auditory filter) that is centredaround the tone. When the power ratio between the tone andthe masker at the output of this filter exceeds a certain crite-rion value, the tone is assumed to be detectable. With theseassumptions the observations of Fletcher can be explained;as long as the masker has a bandwidth smaller than that ofthe auditory filter, an increase in bandwidth will also lead toan increase in the masker power seen at the output of the au-ditory filter, which, in turn, leads to an increase in detectionthreshold. Beyond the auditory filter bandwidth the addedmasker components will not contribute to the masker powerat the output of the auditory filter because they are rejectedby the bandpass characteristic of the auditory filter. Whereasin Fletchers experiments the tone was centred within thenoise masker, later on experiments were conducted wherethe masker did not spectrally overlap with the tone to be de-tected (see, e.g., [17]). Such experiments reveal more infor-mation on the auditory filter characteristic, specifically aboutthe tails of the filters.


The implication of such experiments should be treatedwith care. When different maskers and signals are chosen, theresulting conclusions about the auditory filter shape are quitedifferent. For example, a tonal masker proves to be a muchpoorer masker than a noise signal [17]. In addition, the fil-ter shapes seem to depend on the masker type as well as onthe masker level. These observations suggest that the basicassumptions of linear, that is, level independent, auditory fil-ters and an energy criterion that defines audibility of distor-tion components, are only a first-order approximation andthat other factors play a role in masking. For instance, it isknown that the basilar membrane behaves nonlinearly [18],which may explain, for instance, the level dependence of theauditory filter shape. For a more elaborate discussion of au-ditory masking and auditory filters, the reader is referred to[19, 20, 21].

Despite the fact that the assumption of a linear auditoryfilter and an energy detector can only be regarded as a first-order approximation of the actual processing in the audi-tory system, we will proceed with this assumption becauseit proves to give very satisfactory results in the context of au-dio coding with relatively simple means in terms of compu-tational complexity.

Along similar lines as outlined above, the ISO MPEGmodel [1] assumes that the distortion or noise level that isallowed within a specific critical band is determined by theweighted power addition of all masker components spreadon and around the critical band containing the distortion.The shape of the weighting function that is applied is basedon auditory masking data and essentially reflects the under-lying auditory filter properties. These “spectral-spreading”-based perceptual models have been used in various para-metric coding schemes for sinusoidal component selection[5, 6, 13]. It should be noted that in these models, it is as-sumed that only the auditory filter centred around the dis-tortion determines the detectability of the distortion. Whenthe distortion-to-masker ratio is below a predefined thresh-old value in each auditory filter, the distortion is assumed tobe inaudible. On the other hand, when one single filter ex-ceeds this threshold value, the distortion is assumed to beaudible. This assumption is not in line with more recent in-sights in the psychoacoustical literature on masking and willlater in the paper be shown to have a considerable impact onthe predicted masking curves. Moreover, in the ISO MPEGmodel [1], a distinction is made between masking by noisyand tonal spectral components to be able to account for thedifference in masking power of these signal types. For thispurpose a tonality detector is required which, in the Layer Imodel, is based on a spectral peak detector.

Threshold measurements in psychoacoustical literatureconsistently show that a detection threshold is not a rigidthreshold. A rigid threshold would imply that if the signalto be detected would be just above the detection threshold,the signal would always be detected while it would never bedetected when it would be just below the threshold. Contraryto this pattern, it is observed in detection threshold measure-ments that the percentages of correct detection as a functionof signal level follow a sigmoid psychometric function [22].

The detection threshold is defined as the level for which thesignal is detected correctly with a certain probability of, typ-ically, 70%–75%.

In various theoretical considerations, the shape of thepsychometric function is explained by assuming that withinthe auditory system some variable, for example, the stim-ulus power at the output of an auditory filter, is observed.In addition, it is assumed that noise is present in this ob-servation due to, for example, internal noise in the au-ditory system. When the internal noise is assumed to beGaussian and additive, the shape of the sigmoid functioncan be predicted. For the case that a tone has to be de-tected within broadband noise, the assumption of a stimu-lus power measurement with additive Gaussian noise leadsto good predictions of the psychometric function. When theincrease in the stimulus power caused by the presence ofthe tonal signal is large compared to the standard devia-tion of the internal noise, high percentages of correct de-tection are expected while the reverse is true for small in-creases in stimulus power. The ratio between the increase instimulus power and the standard deviation of the internalnoise is defined as the sensitivity index d′ and can be cal-culated from the percentage of correct responses of the sub-jects. This theoretical framework is based on signal detec-tion theory and is described more extensively in, for example,[23].

In several more recent studies it is shown that the audibil-ity of distortion components is not determined solely by thecritical band with the largest audible distortion [24, 25]. Buuset al. [24] performed listening tests where tone complexeshad to be detected when presented in a noise masker. Theyfirst measured the threshold levels of several tones separatelyeach of which were presented simultaneously with widebandnoise. Due to the specific spectral shape of the masking noise,thresholds for individual tones were found to be constantacross frequency. In addition to the threshold measurementsfor a single tone, thresholds were also measured for a com-plex of 18 equal-level tones. The frequency spacing of thetones was such that each auditory critical band containedonly a single tone. If the detectability of the tones was onlydetermined by the filter with the best detectable tone, thecomplex of tones would be just audible when one individualcomponent of the complex had the same level as the mea-sured threshold level of the individual tones. However, theexperiments showed that thresholds for the tone complexwere considerably lower than expected based on the best-filter assumption, indicating that information is integratedacross auditory filters.

In the paper by Buus et al. [24], a number of theo-retical explanations are presented. We will discuss only themultiband detector model [23]. This model assumes thatthe changes in signal power at the output of each auditoryfilter are degraded by additive internal noise that is inde-pendent in each auditory filter. It is then assumed that anoptimally weighted sum of the signal powers at the out-puts of the various auditory filters is computed which servesas a new decision variable. Based on these assumptions, itcan be shown that the sensitivity index of a tone complex,


d′total, can be derived from the individual sensitivity indices d′nas follows:

d′total =

√√√√√ K∑n=1

d′2n , (1)

where K denotes the number of tones and where each indi-vidual sensitivity index is proportional to the tone-to-maskerpower ratio [22]. According to such a framework, each dou-bling of the number of auditory filters that can contributeto the detection process will lead to a reduction of 1.5 dB inthreshold. The measured thresholds by Buus et al. are wellin line with this prediction. In their experiments, the com-plex of 18 tones leads to a reduction of 6 dB in detectionthreshold as compared to the detection threshold of a sin-gle tone. Based on (1) a change of 6.3 dB was expected. Morerecently, Langhans and Kohlrausch [25] performed similarexperiments with complex tones having a constant spacingof 10 Hz presented in a broadband noise masker, confirm-ing that information is integrated across auditory filters. Inaddition, results obtained by van de Par et al. [26] indicatethat also for bandpass noise signals that had to be detectedagainst the background of wideband noise maskers, the sameintegration across auditory filters is observed.

As indicated, integration of information across a widerange of frequencies is found in auditory masking. Similarly,integration across time has been shown to occur in the au-ditory system. Van den Brink [27] investigated the detectionof tones of variable duration that were presented simultane-ously with a noise masker with a fixed duration that was al-ways longer than that of the tone. Increasing the duration ofthe tone reduced the detection thresholds up to a duration ofabout 300 milliseconds. While this result is an indication ofintegration across time, it also shows that there is a limitationin the interval for which temporal integration occurs.

The above findings with respect to spectral and tem-poral integration of information in auditory masking haveimplications for audio coding which have not been consid-ered in previous studies. On the one hand it influences themasking properties of complex signals as will be discussedin Section 5, on the other hand it has implications for ratedistortion optimisation algorithms. To understand this, con-sider the case where for one particular frequency region athreshold level is determined for distortions that can be in-troduced by an audio coder. For another frequency regiona threshold can be determined similarly. When both distor-tions are presented at the same time, the total distortion isexpected to become audible due to the spectral integrationgiven by (1). This is in contrast to the more conventionalmodels, such as the ISO MPEG model [1], which would pre-dict this simultaneous distortion to be inaudible.

The effect of spectral integration, of course, can easily becompensated for by reducing the level of the masking thresh-olds such that the total distortion will be inaudible. But,based on (1), assuming that it holds for masking by com-plex audio signals, there are many different solutions to thisequation which lead to the same d′total. In other words, many

x

hom γi

+

Ca

Withinchannel

distortiondetectability

Di

Cs

+ x D

Figure 1: Block diagram of the masking model.

different distributions of distortion levels per spectral regionwill lead to the same total sensitivity index. However, notevery distribution of distortion levels will lead to the sameamount of bits spent by the audio coder. Thus, the concept ofa masking curve which determines the maximum level of dis-tortion allowed within each frequency region is too restric-tive and can be expected to lead to suboptimal audio coders.In fact, spectral distortion can be shaped such that the associ-ated bit rate is minimised. For more information the readeris referred to a study where these ideas were confirmed bylistening tests [28].

3. DESCRIPTION OF THE MODEL

In line with various state-of-the-art auditory models thathave been presented in the psychoacoustical literature, forexample, [29], the structure of the proposed model followsthe various stages of auditory signal processing. In view of thecomputational complexity, the model is based on frequencydomain processing and consequently neglects some parts ofperipheral processing, such as the hair cell transformationwhich performs inherent nonlinear time-domain processing.

A block diagram of the model is given in Figure 1. Themodel input x is the frequency domain representation of ashort windowed segment of audio. The window should leadto sufficient rejection of spectral side lobes in order to fa-cilitate adequate spectral resolution of the auditory filters.The first stage of the model resembles the outer- and middle-

ear transfer function hom, which is related to the filtering ofthe ear canal and the ossicles in the middle ear. The transferfunction is chosen to be the inverse of the threshold-in-quiet

function htq. This particular shape is chosen to obtain an ac-curate prediction of the threshold-in-quiet function when nomasker signal is present.

The outer- and middle-ear transfer function is followedby a gammatone filter bank (see, e.g., [30]) which resemblesthe filtering property of the basilar membrane in the innerear. The transfer function of an nth-order gammatone filterhas a magnitude spectrum that is approximated well by

γ( f ) =(

1 +(

f − f0k ERB

(f0))2

)−n/2, (2)

where f0 is the centre frequency of the filter, ERB( f0) is theequivalent rectangular bandwidth of the auditory filter cen-tred at f0 as suggested by Glasberg and Moore [31], n is


the filter order which is commonly assumed to be 4, andk = 2(n−1)(n− 1)!/π(2n− 3)!!, a factor needed to ensure thatthe filter indeed has the specified ERB. The centre frequenciesof the filters are uniformly spaced on an ERB-rate scale andfollow the bandwidths as specified by the ERB scale [31]. Thepower at the output of each auditory filter is measured anda constant Ca is added to this output as a means to limit thedetectability of very weak signals at or below the threshold inquiet.

In the next stage, within-channel distortion detectabil-ities are computed and are defined as the ratios betweenthe distortion and the masker-plus-internal noise seen at theoutput of each auditory filter. In fact, the within-channel dis-tortion detectability Di is proportional to the sensitivity in-dex d′ as described earlier. This is an important step; the dis-tortion detectability (or d′) will be used as a measure of per-ceptual distortion. This perceptual distortion measure can beinterpreted as a measure of the probability that subjects candetect a distortion signal in the presence of a masking sig-nal. The masker power within the ith filter due to an original(masking) signal x is given by

Mi = 1N

∑f

∣∣hom( f )∣∣2∣∣γi( f )

∣∣2∣∣x( f )∣∣2

, (3)

where N is the segment size in number of samples. Equiva-lently, the distortion power within the ith filter due to a dis-tortion signal ε is given by

Si = 1N

∑f

∣∣hom( f )∣∣2∣∣γi( f )

∣∣2∣∣ε( f )∣∣2. (4)

Note that (1/N)|x( f )|2 denotes the power spectral densityof the original, masking signal in sound pressure level (SPL)per frequency bin, and similarly (1/N)|ε( f )|2 is the powerspectral density of the distorting signal. The within-channeldistortion detectability Di is given by

Di = SiMi + (1/N)Ca

. (5)

From this equation two properties of the within-channel dis-tortion detectability Di can be seen. When the distortion-to-masker ratio Si/Mi is kept constant while the masker poweris much larger than (1/N)Ca, distortion detectability is alsoconstant. In other words, at medium and high masker levelsthe detectability Di is mainly determined by the distortion-to-masker ratio. Secondly, when the masker power is smallcompared to (1/N)Ca, the distortion detectability is indepen-dent of the masker power, which resembles the perception ofsignals near the threshold in quiet.

In line with the multiband energy detector model [23],we assume that within-channel distortion detectabilities Di

are combined into a total distortion detectability by an addi-tive operation. However, we do not add the squared sensitiv-ity indices as in (1), but we simply add the indices directly. Al-though this may introduce inaccuracies, these will later turn

out to be small. A benefit of this choice is that the distortionmeasure that will be derived from this assumption will haveproperties that allow a computationally simple formulationof the model (see (11)). In addition, recent results [26] showthat at least for the detection of closely spaced tones (20 Hzspacing) masked by noise, the reduction in thresholds whenincreasing the signal bandwidth is more in line with a directaddition of distortion detectabilities than with (1). There-fore, we state that

D(x, ε) = CsLeff

∑i

Di (6)

= CsLeff

∑i

∑f

∣∣hom( f )∣∣2∣∣γi( f )

∣∣2∣∣ε( f )∣∣2

NMi + Ca, (7)

where D(x, ε) is the total distortion detectability as it is pre-dicted for a human observer given an original signal x and adistortion signal ε. The calibration constant Cs is chosen suchthat D = 1 at the threshold of detectability. To account forthe dependency of distortion detectability on the duration ofthe distortion signal (in line with [27]), a scaling factor Leff isintroduced defined as

Leff = min(

L

300 ms, 1)

, (8)

where L is the segment duration in milliseconds. Equa-tion (8) resembles the temporal integration time of thehuman auditory system which has an upper bound of300 milliseconds [27].1

Equation (7) gives a complete description of the model.However, it defines only a perceptual distortion measure andnot a masking curve such as is widely used in audio codingnor a masked threshold such as is often used in psychoacous-tical experiments.

In order to derive a masked threshold, we assume that thedistortion signal ε( f ) = Aε. Here, A is the amplitude of thedistortion signal and ε the normalised spectrum of the dis-tortion signal such that ‖ε‖2 = 1 which is assumed to corre-spond to a sound pressure level of 0 dB. Without yet makingan assumption about the spectral shape of ε, we can derivethat, assuming that D = 1 at the threshold of detectability,the masked threshold A2 for the distortion signal ε is givenby

1A2= CsLeff

∑i

∑f

∣∣hom( f )∣∣2∣∣γi( f )

∣∣2∣∣ε( f )∣∣2

NMi + Ca. (9)

When deriving a masking curve it is important to con-sider exactly what type of signal is masked. When a mask-ing model is used in the context of a waveform coder, the

1An alternative definition would be to state that Leff = N , the total dura-tion of the segment in number of samples. According to this definition it isassumed that distortions are integrated over the complete excerpt at hand,which is not in line with perceptual masking data, but which in our experi-ence still leads to very satisfactory results [32].


distortion signal introduced by the coder is typically assumedto consist of bands of noise. For a sinusoidal coder, however,the distortion signal contains the sinusoids that are rejectedby the perceptual model. Thus, the components of the distor-tion signal are in fact more sinusoidal in nature. Assumingnow that a distortion component is present in only one binof the spectrum, we can derive the masked thresholds for si-nusoidal distortions. We assume that ε( f ) = v( fm)δ( f − fm)with v( fm) being the sinusoidal amplitude and fm the sinu-soidal frequency. Together with the assumption that D = 1 atthe threshold of detectability, v can be derived such that thedistortion is just not detectable. In this way, by varying fmover the entire frequency range, v2 constitutes the maskingcurve for sinusoidal distortions in the presence of a maskerx. By substituting the above assumptions in (7) we obtain

1v2(fm) = CsLeff

∑i

∣∣hom(fm)∣∣2∣∣γi( fm)∣∣2

NMi + Ca. (10)

Substituting (10) in (7), we get

D(x, ε) =∑f

∣∣ε( f )∣∣2

v2( f ). (11)

This expression shows that the computational load for calcu-lating the perceptual distortion D(x, ε) can be very low oncethe masking curve v2 has been calculated. This simple formof the perceptual distortion, such as given in (11), arises dueto the specific choice of the addition as defined in (6).

4. CALIBRATION OF THE MODEL

For the purpose of calibration of the model, the constantsCa for absolute thresholds and Cs for the general sensitivityof the model in (7) need to be determined. This will be doneusing two basic findings from the psychoacoustical literature,namely the threshold in quiet and the just noticeable differ-ence (JND) in level of about 0.5 - 1 dB for sinusoidal signals[33].

When considering the threshold in quiet, we assume thatthe masking signal is equal to zero, that is, x = 0 and thatthe just detectable sinusoidal distortion signal is given by

ε( f ) = htq( fm)δ( f − fm) for some fm, where htq is thethreshold-in-quiet curve. By substituting these assumptionsin (7) (assuming that D = 1 corresponds to a just detectabledistortion signal), we obtain

Ca = CsLeff

∑i

∣∣γi( fm)∣∣2. (12)

Note that (12) only holds if∑

i |γi( fm)|2 is constant for all fm,which is approximately true for gammatone filters.

We assume a 1 dB JND which corresponds to a maskingcondition where a sinusoidal distortion is just detectable inthe presence of a sinusoidal masker at the same frequency, sayfm. For this to be the case, the distortion level has to be 18 dBlower than the masker level, assuming that the masker anddistortion are added inphase. This specific phase assumption

is made because it leads to similar thresholds as when themasker and signal are slightly off-frequency with respect toone another, the case which is most likely to occur in au-dio coding contexts. We therefore assume that the maskersignal is x( f ) = A70δ( f − fm) and the distortion signalε( f ) = A52δ( f − fm), with A70 and A52 being the amplitudesfor a 70 and 52 dB SPL sinusoidal signal, respectively. Using(3) and (7), this leads to the expression

1Cs= Leff

∑i

∣∣hom(fm)∣∣2∣∣γi( fm)∣∣2

A252∣∣hom

(fm)∣∣2∣∣γi( fm)∣∣2

A270 + Ca

. (13)

When (12) is substituted into (13), an expression is ob-tained where Cs is the only unknown. A numerical solutionto this equation can be found using, for example, the bi-section method (cf. [34]). A suitable choice for fm wouldbe fm = 1 kHz, since it is in the middle of the auditoryrange. This calibration at 1 kHz does not significantly reducethe accuracy of the model at other frequencies. On the onehand the incorporation of a threshold-in-quiet curve pre-filter provides the proper frequency dependence of thresh-olds in quiet. On the other hand, JNDs do not differ muchacross frequency both in the model predictions and humans.

5. MODEL EVALUATION AND COMPARISONWITH PSYCHOACOUSTICAL DATA

To show the validity of the model, some basic psychoacous-tical data from listening experiments will be compared tomodel predictions. We will consider two cases, namely sinu-soids masked by noise and sinusoids masked by sinusoids.

Masking of sinusoids has been measured in several ex-periments for both (white) noise maskers [12, 35] and for si-nusoidal maskers [36]. Figure 2a shows masking curves pre-dicted by the model for a white noise masker with a spectrumlevel of 30 dB/Hz for a long duration signal (solid line) and a200 millisecond signal (dashed line) with corresponding lis-tening test data represented by circles [12] and asterisks [35],respectively. Figure 2b shows the predicted masking curve(solid line) for a 1 kHz 50 dB SPL sinusoidal masker alongwith corresponding measured masking data [36]. The modelpredictions are well in line with data for both sinusoidal andnoise maskers, despite the fact that no tonality detector wasincluded in the model such as is conventionally needed inmasking models for audio coding (e.g., [1]). Only at lowerfrequencies, there is a discrepancy between the data for thenoise masker and the predictions by the model. The reasonfor this discrepancy may be that in psychoacoustical studies,running noise generators are used to generate the masker sig-nal rather than a single noise realisation, as it is done in au-dio coding applications. The latter case has, according to sev-eral studies, a lower masking strength [37]. This difference inmasking strength is due to the inherent masker power fluc-tuations when a running noise is presented, which dependsinversely on the product of time and bandwidth seen at theoutput of an auditory filter. The narrower the auditory filter(i.e., the lower its centre frequency), the larger these fluctua-tions will be and the larger the difference is expected to be.


104103102

Frequency (Hz)

0

20

40

60

80

100

Mas

ked

thre

shol

d(d

BSP

L)

(a)

104103102

Frequency (Hz)

0

20

40

60

80

100

Mas

ked

thre

shol

d(d

BSP

L)

(b)

Figure 2: (a) Masking curves predicted by the model for a whitenoise masker with a spectrum level of 30 dB/Hz for a long dura-tion signal (solid line) and a 200- millisecond signal (dashed line)with corresponding listening test data represented by the circles[12] and asterisks [35], respectively. (b) Masking curves for a 1 kHz50 dB SPL sinusoidal masker. The dashed line is the threshold inquiet. Circles show data from [36].

As can be seen in Figure 2, the relatively weaker mask-ing power of a sinusoidal signal is predicted well by themodel without the need for explicit assumptions about thetonality of the masker such as those included in, for ex-ample, the ISO MPEG model [1]. Indeed, in the case ofa noise masker (Figure 2a), the masker power within thecritical band centred around 1 kHz (bandwidth 132 Hz) isapproximately 51.2 dB SPL, whereas the sinusoidal masker(Figure 2b) has a power of 50 dB SPL. Nevertheless, pre-dicted detection thresholds are considerably lower for thesinusoidal masker (35 dB SPL) than for the noise masker(45 dB SPL). The reason why the model is able to predictthese data well is that for the tonal masker, the distortion-to-masker ratio is constant over a wide range of auditoryfilters. Due to the addition of within-channel distortion de-tectabilities, the total distortion detectability will be relativelylarge. In contrast, for a noise masker, only the filter centredon the distortion component will contribute to the total dis-tortion detectability because the off-frequency filters have avery low distortion-to-masker ratio. Therefore, the widebandnoise masker will have stronger masking effect. Note thatfor narrowband noise signals, the predicted masking power,in line with the argumentation for a sinusoidal masker, willalso be weak. This, however, seems to be too conservative[38].

101100

Number of components

20

40

60

80

100

Mas

ked

thre

shol

d(d

BSP

L)

Figure 3: Masked thresholds predicted by the model (solid line)and psychoacoustical data (circles) [25]. Masked thresholds are ex-pressed in dB SPL per component.

A specific assumption in this model is the integration ofdistortion detectabilities over a wide range of auditory fil-ters. This should allow the model to predict correctly thethreshold difference between narrowband distortion signalsand more wideband distortion signals. For this purpose anexperiment is considered where a complex of tones had tobe detected in the presence of masking noise [25]. The tonecomplex consisted of equal-level sinusoidal components witha frequency spacing of 10 Hz centred around 400 Hz. Themasker was a 0–2 kHz noise signal with an overall level of80 dB SPL. The number of components in the complex wasvaried from one up to 41. The latter case corresponds to abandwidth of 400 Hz, which implies that the tone complexcovers more than one critical band. Equation (9) was used toderive masked thresholds. As can be seen in Figure 3, there isa good correspondence between the model predictions andthe data from [25]. Therefore, it seems that the choice of thelinear addition that was made in (6) did not lead to largediscrepancies between psychoacoustical data and model pre-dictions.

To conclude this section, a comparison is made betweenpredictions of the MPEG-1 Layer I [1] and the model pre-sented in this study which incorporates spectral integrationin masking. The MPEG model is one of a family of mod-els used in audio coding that are based on spectral-spreadingfunctions to model spectral masking. When the masking ofa narrowband distortion signal is considered, it is assumedthat the auditory filter that is spectrally centred on this dis-tortion signal determines whether the distortion is audi-ble or not. When the energy ratio between distortion sig-nal and masking signal as seen at the output of this audi-tory filter is smaller than a certain criterion value, the dis-tortion is inaudible. In this manner the maximum allowabledistortion signal level at each frequency can be determinedwhich constitutes the masking curve. An efficient implemen-tation for calculating this masking curve is a convolution be-tween the masker spectrum and a spreading function bothrepresented on a Bark scale. The Bark scale is a perceptu-ally motivated frequency scale similar to the ERB-rate scale[39].

The spectral integration model presented here does notconsider only a single auditory filter to contribute to thedetection of distortions, but potentially a whole range of


104103102

Frequency (Hz)

0

20

40

60

80

Leve

l(dB

)

(a)

104103102

Frequency (Hz)

0

20

40

60

80

Leve

l(dB

)

(b)

Figure 4: Masked thresholds predicted by the spectral integrationmodel (dashed line) and the ISO MPEG model (solid line). Themasking spectrum (dotted line) is for (a) a 1 kHz sinusoidal signaland (b) a short segment of a harpsichord signal.

filters. This can have a strong impact on the predicted mask-ing curves. Figure 4a shows the masking curves for a sinu-soidal masker at 1 kHz for the MPEG model (solid line) andthe spectral integration model (dashed line). The spectrumof the sinusoidal signal is also plotted (dotted line), but scaleddown for visual clarity. As can be seen, there is a reason-able match between both models, showing some differencesat the tails. In Figure 4b, in a similar way the masking curvesare shown but now resulting from a complex spectrum (partof a harpsichord signal). It can be seen that the maskingcurves differ systematically showing much smoother mask-ing curves for the spectral integration model as compared tothe MPEG model. For the spectral integration model mask-ing curves are considerably higher in spectral valleys. Thiseffect is a direct consequence of the spectral integration as-sumption that was adopted in our model (cf. (6)). In thespectral valleys of the masker, distortion signals can only bedetected using the auditory filter centred on the distortionwhich will lead to relatively high masked thresholds. Thisis so because off-frequency filters will be dominated by themasker spectrum. However, detection of distortion signals atthe spectral peaks of the masker is mediated by a range ofauditory filters centred around the peak, resulting in rela-tively low masked thresholds. In this case the off-frequencyfilters will reveal similar distortion-to-masker ratios as theon-frequency filter. Thus, in the model proposed here, de-tection differences between peaks and troughs are smaller,resulting in smoother masking curves as compared to thoseobserved in a spreading-based model such as the ISO MPEGmodel.

The smoothening effect is observed systematically incomplex signal spectra typically encountered in practical sit-uations and represents the main difference between the spec-tral integration model presented here and existing spreading-based models.

6. APPLICATION TO SINUSOIDAL MODELLING

Sinusoidal modelling has proven to be an efficient techniquefor the purpose of coding speech signals [40]. More recently,it has been shown that this method can also be exploited forlow-rate audio coding, for example, [41, 42, 43]. To accountfor the time-varying nature of the signal, the sinusoidal anal-ysis/synthesis is done on a segment-by-segment basis, witheach segment being modelled as a sum of sinusoids. The si-nusoidal parameters have been selected with a number ofmethods, including spectral peak-picking [44], analysis-by-synthesis [41, 43], and subspace-based methods [42].

In this section we describe an algorithm for selecting si-nusoidal components using the psychoacoustical model de-scribed in the previous section. The algorithm is based onthe matching pursuit algorithm [45], a particular analysis-by-synthesis method. Matching pursuit approximates a sig-nal by a finite expansion into elements (functions) chosenfrom a redundant dictionary. In the example of sinusoidalmodelling, one can think of such functions as (complex) ex-ponentials or as real sinusoidal functions. Matching pursuitis a greedy, iterative algorithm which searches the dictionaryfor the function that best matches the signal and subtractsthis function (properly scaled) to form a residual signal to beapproximated in the next iteration.

In order to determine which is the best matching func-tion or dictionary element at each iteration, we need to for-malise the problem. To do so, let D = (gξ)ξ∈Γ be a completedictionary, that is, a set of elements indexed by ξ ∈ Γ, whereΓ is an arbitrary index set. As an example, consider a dictio-nary consisting of complex exponentials gξ = ei2πξ(·). In thiscase, the index set Γ is given by Γ = [0, 1). Obviously, the in-dexing parameter ξ is nothing more than the frequency of thecomplex exponential. Given a dictionary D , the best match-ing function can be found by, for each and every function,computing the best approximation and selecting that func-tion whose corresponding approximation is “closest” to theoriginal signal.

In order to facilitate the following discussion, we assumewithout loss of generality that ‖gξ‖ = 1 for all ξ. Given aparticular function gξ , the best possible approximation of thesignal x is obtained by the orthogonal projection of x ontothe subspace spanned by gξ (see Figure 5). This projection isgiven by 〈x, gξ〉gξ . Hence, we can decompose x as

x = 〈x, gξ〉gξ + Rx, (14)

where Rx is the residual signal after subtracting the projec-tion 〈x, gξ〉gξ . The orthogonality of Rx and gξ implies that

‖x‖2 = ∣∣〈x, gξ〉∣∣2

+ ‖Rx‖2. (15)


xRx

gξ

〈x, gξ〉gξ span(gξ )

Figure 5: Orthogonal projection of x onto span(gξ).

We can do this decomposition for each and every dictionaryelement and the best matching one is then found by selectingthe element gξ′ for which ‖Rx‖ is minimal, or, equivalently,for which |〈x, gξ〉| is maximal. A precise mathematical for-mulation of this phrase is

ξ′ = arg supξ∈Γ

∣∣〈x, gξ〉∣∣. (16)

It must be noted that the matching pursuit algorithm isonly optimal for a particular iteration. If we subtract the ap-proximation to form a residual signal and approximate thisresidual in a similar way as we approximated the original sig-nal, then the two dictionary elements thus obtained are notjointly optimal; it is in general possible to find two differ-ent elements which together form a better approximation.This is a direct consequence of the greedy nature of the algo-rithm. The two dictionary elements which together are opti-mal could be obtained by projecting the signal x onto all pos-sible two-dimensional subspaces. This, however, is in generalvery computationally complex. An alternative solution to thisproblem is to apply, after each iteration, a Newton optimisa-tion step [46].

To account for human auditory perception, the unit-norm dictionary elements can be scaled [43], which is equiv-alent to scaling the inner products in (16). We will refer tothis method as the weighted matching pursuit (WMP) algo-rithm. While this method performs well, it can be shownthat it does not provide a consistent selection measure forelements of finite time support [47]. Rather than scaling thedictionary elements, we introduce a matching pursuit algo-rithm where psychoacoustical properties are accounted forby a norm on the signal space. We will refer to this methodas psychoacoustical matching pursuit (PAMP). As mentionedin Section 3 (see (11)), the perceptual distortion can be ex-pressed as

D =∑f

∣∣ε( f )∣∣2

v2( f )=∑f

a( f )∣∣ε( f )

∣∣2, (17)

where a = v−2. It follows from (10) that

a( f ) = CsLeff

∑i

∣∣hom( f )∣∣2∣∣γi( f )

∣∣2

NMi + Ca. (18)

1009080706050403020100

Number of sinusoids

0

20

40

60

Perc

eptu

aldi

stor

tion

Figure 6: Perceptual distortion associated with the residual signalafter sinusoidal modelling as a function of the number of sinusoidalcomponents that were extracted.

By inspection of (18), we conclude that a is real and positiveso that, in fact, the perceptual distortion measure (17) definesa norm

‖x‖2 =∑f

a( f )∣∣x( f )

∣∣2. (19)

This norm is induced by the inner product

〈x, y〉 =∑f

a( f )x( f ) y∗( f ), (20)

facilitating the use of the distortion measure in selecting theperceptually best matching dictionary element in a matchingpursuit algorithm. In Figure 6, the perceptual distortion as-sociated with the residual signal is shown as a function of thenumber of real-valued sinusoids that have been extracted fora short segment of a harpsichord excerpt (cf. (11)). As can beseen the perceptually most relevant components are selectedfirst, resulting in a fast reduction of the perceptual distor-tion for the first components. For a detailed description thereader is referred to [47, 48]. The fact that the distortion de-tectability defines a norm on the underlying signal space isimportant, since it allows for incorporating psychoacousticsin optimisation algorithms. Indeed, rather than minimisingthe commonly used l2-norm, we can minimise the percep-tually relevant norm given by (19). Examples include rate-distortion optimisation [32], linear predictive coding [49],and subspace-based modelling techniques [50].

7. COMPARISON WITH THE ISO MPEG MODELIN A LISTENING TEST

In this section we assess the performance of the proposedperceptual model in the context of sinusoidal parameterestimation. The PAMP method for estimating perceptuallyrelevant sinusoids relies on the weighting function a which,by definition, is the inverse of the masking curve. Equation(18) describes how to compute the masking curve forthe proposed perceptual model. We compare the use ofthe proposed perceptual model in PAMP to the situationwhere the masking curve is computed using the MPEG-1Layer I-II (ISO/IEC 11172-3) psychoacoustical model [1].There are several reasons for comparison with the MPEGpsychoacoustic model; the model provides a well-known


reference and because of its frequent application, it is still ade facto state-of-the-art model.

Using the MPEG-1 psychoacoustic model masking curvedirectly in the PAMP algorithm for sinusoidal extraction isnot reasonable because the MPEG-1 psychoacoustic modelwas developed to predict the masking curve in the case ofnoise maskees (distortion signals). It predicts for every fre-quency bin how much distortion can be added within thecritical band centred around the frequency bin. This pre-diction is, however, too conservative in the case that distor-tions are sinusoidal in nature since in this case the distor-tion energy is not spread over a complete critical band butis concentrated in one frequency bin only. Hence, we canadapt the MPEG-1 model by scaling the masking functionwith the critical bandwidth such that the model now predictsthe detection thresholds in the case of sinusoidal distortion.The net effect of this compensation procedure is an increaseof the masking curve at high frequencies by about 10 dB,thereby de-emphasizing high-frequency regions during si-nusoidal estimation. In fact, this masking power increaseat higher frequencies reduces the gap between the mask-ing curves between the ISO MPEG model and the proposedmodel (cf. Figure 4) By applying this modification to the ISOMPEG model, and by extending the FFT order to the sizeof the PAMP dictionary, it is suited to be used in the PAMPmethod. The dictionary elements in our implementationof the PAMP method were real-valued sinusoidal functionswindowed with a Hanning window, identical to the windowused in the analysis-synthesis procedure described below.

In the following, we present results obtained by listeningtests with audio signals. The signals are mono, sampled at44.1 kHz, where each sample is represented by 16 bits. Thetest excerpts are Carl Orff, Castanet, Celine Dion, Harpsi-chord Solo, contemporary pop music, and Suzanne Vega.

The excerpts were segmented into fixed-length frames of1024 samples (corresponding to 23.2 milliseconds) with anoverlap of 50% between consecutive frames using a Han-ning window. For each signal frame, a fixed number of per-ceptually relevant sinusoids per frame were extracted usingthe PAMP method described above, where the perceptualweighting functions a were generated from masking curvederived from the proposed perceptual model (see (18)) andthe modified MPEG model described above, respectively. Forthe MPEG model we made use of the recommendationsof MPEG Layer II, since these support input frame lengthsof 1024 samples. The masking curves were calculated fromthe Hanning-windowed original signal contained within thesame frame that is being modelled using the PAMP method.Finally, modelled frames were synthesized from the esti-mated sinusoidal parameters and concatenated to form mod-elled test excerpts, using a Hanning window-based overlap-add procedure.

To evaluate the performance of the proposed method, weused a subjective listening test procedure which is somewhatcomparable to the MUSHRA test (multistimulus test withhidden reference and anchors) [51]. For each test excerpt,listeners were asked to rank 6 different versions: 4 excerptsmodelled using the modified MPEG masking curve and fixed

Table 1: Scores used in subjective test.

Score Equivalent5 Best4 Good3 Medium2 Poor1 Poorest

model orders (i.e., the number of sinusoidal components persegment) of K = 20, 25, 30, and K = 35, and one excerptmodelled using the proposed perceptual model with K = 25.In addition, to have a low-quality reference signal, an excerptmodelled with K = 30, but using the unmodified MPEGmasking curve was included. As a reference, the listeners hadthe original excerpt available as well, which was identified tothe subjects. Unlike the MUSHRA test, no hidden referenceand no anchors were presented to the listeners.

The test excerpts were presented in a “parallel” way, us-ing the interactive benchmarking tool described in [52] asan interface to the listeners. For each excerpt, listeners wererequested to rank the different modelled signals on a scalefrom 1–5 (in steps of 0.1) as outlined in Table 1. The lis-teners were instructed to use the complete scale such thatthe poorest-quality excerpt was rated with 1 and the highest-quality excerpt with 5. The excerpts were presented throughhigh-quality headphones (Beyer-Dynamic DT990 PRO) in aquiet room, and the listeners could listen to each signal ver-sion as often as needed to determine the ranking. A total of12 listeners participated in the listening test, of which 6 lis-teners worked in the area of acoustic signal processing andhad previously participated in such tests. The authors did notparticipate in the test.

Figure 7 shows the overall scores of the listening test, av-eraged across all listeners and excerpts. The circles representthe median score, and the error bars depict 25 and 75 per-cent ranges of the total response distributions. As can beseen, the excerpts generated with the proposed perceptualmodel (SiCAS@25) show better average subjective perfor-mance than any of the excerpts based on the MPEG psychoa-coustic model, except for the MPEG case using a fixed modelorder of 35 (MPEG@35). As expected, the MPEG-based ex-cerpts have decreasing quality scores for decreasing modelorder. Furthermore, the low-quality anchor (MPEG@30nt,i.e., the MPEG model without spectral tilt modification) re-ceived the lowest-quality score on average. The statisticaldifference between the quality scores was analysed using apaired t-test using a significance level of p < 0.01, and byworking on the score differences between the proposed per-ceptual model and each of the MPEG-based methods. TheH0 hypothesis was that the mean of such difference distribu-tion was zero (µ∆ = 0), while the alternative hypothesis H1

was that µ∆ > 0. The statistical analysis supports the qual-ity ordering suggested by Figure 7. In particular, there is astatistically significant improvement in using the proposedperceptual model (SiCAS@25) over any of the MPEG-basedmethods except for MPEG@35 which performs better thanSiCAS@25 (p < 7.0 · 10−3). In fact, the model presented here


76543210

SiC

AS@

25

MP

EG

@35

MP

EG

@30

MP

EG

@25

MP

EG

@20

MP

EG

@30

nt

0

1

2

3

4

5

6

Poorest

Poor

Medium

Good

Best

Figure 7: Subjective test results averaged across all listeners and ex-cerpts.

leads to a reduction of more than 20% in terms of number ofsinusoids needed to represent signals at a given quality level.

As mentioned already in Section 5 the most relevant dif-ference between the proposed model and the ISO MPEGmodel is the incorporation of spectral integration propertiesin the proposed model. This leads to systematically smoothermasking curves such as predicted by our model for complexmasker spectra (cf. Figure 4). The effect of this is that fewersinusoidal components are used for modelling spectral val-leys of a signal with the proposed perceptual model as com-pared to the ISO MPEG model. We think that this differenceaccounts for the improvement in modelling efficiency thatwe observed in the listening tests and we expect that simi-lar improvements would have been observed when our ap-proach was compared to other perceptual models that arebased on the spectral-spreading approach such as those usedin the ISO MPEG model.

8. CONCLUSIONS

In this paper we presented a psychoacoustical model that issuited for predicting masked thresholds for sinusoidal dis-tortions. The model relies on signal detection theory and in-corporates more recent insights about spectral and temporalintegration in auditory masking. We showed that, as a con-sequence, the model is able to predict distortion detectabili-ties. In fact, the distortion detectability defines a (perceptu-ally relevant) norm on the underlying signal space which isbeneficial for optimisation algorithms such as rate-distortionoptimisation or linear predictive coding. The model provesto be very suitable for application in the context of sinu-soidal modelling, although it is also applicable in other au-dio coding contexts such as transform coding. A compara-tive listening test using a sinusoidal analysis method calledpsychoacoustical matching pursuit showed a clear preferencefor the model presented here over the ISO MPEG model [1].

More specifically, the model presented here leads to a re-duction of more than 20% in terms of number of sinusoidsneeded to represent signals at a given quality level.

ACKNOWLEDGMENTS

The authors would like to thank Nicolle H. van Schijndel,Gerard Hotho, and Jeroen Breebaart and the reviewers fortheir helpful comments on this manuscript. Furthermore,the authors thank the participants in the listening test. Theresearch was supported by Philips Research, the Technol-ogy Foundation STW, Applied Science Division of NWO, theTechnology Programme of the Dutch Ministry of EconomicAffairs, and the EU project ARDOR, IST-2001-34095.

REFERENCES

[1] IISO/MPEG Committee, Coding of moving pictures and asso-ciated audio for digital storage media at up to about 1.5 Mbit/s- part 3: Audio, 1993, ISO/IEC 11172-3.

[2] T. Yoshida, “The rewritable minidisc system,” Proc. IEEE,vol. 82, no. 10, pp. 1492–1500, 1994.

[3] A. Hoogendoorn, “Digital compact cassette,” Proc. IEEE,vol. 82, no. 10, pp. 1479–1589, 1994.

[4] T. Painter and A. Spanias, “Perceptual coding of digital audio,”Proc. IEEE, vol. 88, no. 4, pp. 451–515, 2000.

[5] K. N. Hamdy, M. Ali, and A. H. Tewfik, “Low bit rate highquality audio coding with combined harmonic and waveletrepresentation,” in Proc. IEEE Int. Conf. Acoustics, Speech, Sig-nal Processing (ICASSP ’96), vol. 2, pp. 1045–1048, Atlanta,Ga, USA, May 1996.

[6] S. N. Levine, Audio representations for data compression andcompressed domain processing, Ph.D. thesis, Stanford Univer-sity, Stanford, Calif, USA, 1998.

[7] H. Purnhagen and N. Meine, “HILN—the MPEG-4 paramet-ric audio coding tools,” in Proc. IEEE Int. Symp. Circuits andSystems (ISCAS ’00), vol. 2000, pp. 201–204, Geneva, Switzer-land, May 2000.

[8] W. Oomen, E. Schuijers, B. den Brinker, and J. Breebaart, “Ad-vances in parametric coding for high-quality audio,” in Proc.114th AES Convention, Amsterdam, The Netherlands, March2003, preprint 5852.

[9] F. P. Myburg, Design of a scalable parametric audio coder, Ph.D.thesis, Technische Universiteit Eindhoven, Eindhoven, TheNetherlands, 2004.

[10] H. S. Malvar, Signal Processing with Lapped Transforms, ArtechHouse, Boston, Mass, USA, 1992.

[11] P. P. Vaidyanathan, Multirate Systems and Filter Banks, Pren-tice Hall Signal Processing Series, Prentice Hall, EnglewoodCliffs, NJ, USA, 1993.

[12] J. E. Hawkins and S. S. Stevens, “The masking of pure tonesand of speech by white noise,” Journal of the Acoustical Societyof America, vol. 22, pp. 6–13, 1950.

[13] T. S. Verma, A perceptually based audio signal model with ap-plication to scalable audio coding, Ph.D. thesis, Stanford Uni-versity, Stanford, Claif, USA, 1999.

[14] R. L. Wegel and C. E. Lane, “The auditory masking of onepure tone by another and its probable relation to the dynamicsof the inner ear,” Phys. Rev., vol. 23, pp. 266–285, 1924.

[15] H. Fletcher, “Auditory patterns,” Reviews of Modern Physics,vol. 12, no. 1, pp. 47–65, 1940.

[16] P. M. Sellick, R. Patuzzi, and B. M. Johnstone, “Measurementsof BM motion in the guinea pig using Mossbauer technique,”Journal of the Acoustical Society of America, vol. 72, pp. 131–141, 1982.


[17] J. P. Egan and H. W. Hake, “On the masking pattern of asimple auditory stimulus,” Journal of the Acoustical Society ofAmerica, vol. 22, pp. 622–630, 1950.

[18] K. G. Yates, I. M. Winter, and D. Robertson, “Basilar mem-brane nonlinearity determines auditory nerve rate-intensityfunctions and cochlear dynamic range,” Hearing Research,vol. 45, no. 3, pp. 203–220, 1990.

[19] R. D. Patterson, “Auditory filtershapes derived with noisestimuli,” Journal of the Acoustical Society of America, vol. 59,pp. 1940–1947, 1976.

[20] M. van der Heijden and A. Kohlrausch, “The role of envelopefluctuations in spectral masking,” Journal of the Acoustical So-ciety of America, vol. 97, no. 3, pp. 1800–1807, 1995.

[21] M. van der Heijden and A. Kohlrausch, “The role of distortionproducts in masking by single bands of noise,” Journal of theAcoustical Society of America, vol. 98, no. 6, pp. 3125–3134,1995.

[22] J. P. Egan, W. A. Lindner, and D. McFadden, “Masking-leveldifferences and the form of the psychometric function,” Per-ception and Psychophysics, vol. 6, pp. 209–215, 1969.

[23] D. M. Green and J. A. Swets, Signal Detection Theory and Psy-chophysics, Krieger, New York, NY, USA, 1974.

[24] S. Buus, E. Schorer, M. Florentine, and E. Zwicker, “Decisionrules in detection of simple and complex tones,” Journal of theAcoustical Society of America, vol. 80, no. 6, pp. 1646–1657,1986.

[25] A. Langhans and A. Kohlrausch, “Spectral integration ofbroadband signals in diotic and dichotic masking experi-ments,” Journal of the Acoustical Society of America, vol. 91,no. 1, pp. 317–326, 1992.

[26] S. van de Par, A. Kohlrausch, J. Breebaart, and M. McKinney,“Discrimination of different temporal envelope structures ofdiotoic and dichotic targets signals within diotic wide-bandnoise,” in Proc. 13th International Symposium on Hearing, pp.334–340, Dourdan, France, August 2003.

[27] G. van den Brink, “Detection of tone pulse of various dura-tions in noise of various bandwidths,” Journal of the AcousticalSociety of America, vol. 36, pp. 1206–1211, 1964.

[28] S. van de Par and A. Kohlrausch, “Application of a spec-trally integrating auditory filterbank model to audio coding,”in Fortschritte der Akustik, Plenarvortrage der 28. DeutschenJahrestagung fur Akustik, DAGA-02, pp. 484–485, Bochum,Germany, 2002.

[29] T. Dau, D. Puschel, and A. Kohlrausch, “A quantitative modelof the ‘effective’ signal processing in the auditory system. I.Model structure,” Journal of the Acoustical Society of America,vol. 99, no. 6, pp. 3615–3622, 1996.

[30] R. D. Patterson, “The sound of a sinusoid; spectral models,”Journal of the Acoustical Society of America, vol. 96, no. 3,pp. 1409–1418, 1994.

[31] B. R. Glasberg and B. C. J. Moore, “Derivation of audi-tory filter shapes from notched-noise data,” Hearing Research,vol. 47, no. 1-2, pp. 103–138, 1990.

[32] R. Heusdens, J. Jensen, W. B. Kleijn, V. Kot, O. Niamut, S. vande Par, N. H. van Schijndel, and R. Vafin, “Sinusoidal codingof audio and speech,” in preparation for Journal of the AudioEngineering Society, 2005.

[33] B. C. J. Moore, An Introduction to the Psychology of Hearing,Academic Press, London, UK, 3rd edition, 1989.

[34] G. Charestan, R. Heusdens, and S. van de Par, “A gamma-tone based psychoacoustical modeling approach for speechand audio coding,” in Proc. ProRISC/IEEE: Workshop on Cir-cuits, Systems and Signal Processing, pp. 321–326, Veldhoven,The Netherlands, November 2001.

[35] A. J. M. Houtsma, “Hawkins and Stevens revisited at low

frequencies,” Journal of the Acoustical Society of America,vol. 103, no. 5, pp. 2848–2848, 1998.

[36] E. Zwicker and A. Jaroszewski, “Inverse frequency dependenceof simultaneous tone-on-tone masking patterns at low levels,”Journal of the Acoustical Society of America, vol. 71, pp. 1508–1512, 1982.

[37] A. Langhans and A. Kohlrausch, “Differences in auditory per-formance between monaural and diotic conditions. I. Maskedthresholds in frozen noise,” Journal of the Acoustical Society ofAmerica, vol. 91, pp. 3456–3470, 1992.

[38] S. van de Par and A. Kohlrausch, “Dependence of binauralmasking level differences on center frequency, masker band-width and interaural parameters,” Journal of the Acoustical So-ciety of America, vol. 106, pp. 1940–1947, 1999.

[39] E. Zwicker and H. Fastl, Psychoacoustics—Facts and Models,Springer, Berlin, Germany, 2nd edition, 1999.

[40] R. J. McAulay and T. F. Quatieri, “Sinusoidal coding,” inSpeech Coding and Syntesis, W. B. Kleijn and K. K. Paliwal,Eds., chapter 4, pp. 121–173, Elsevier Science B. V., Amster-dam, The Netherlands, 1995.

[41] M. Goodwin, “Matching pursuit with damped sinusoids,”in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing(ICASSP ’97), vol. 3, pp. 2037–2040, Munich, Germany, April1997.

[42] J. Nieuwenhuijse, R. Heusdens, and E. F. Deprettere, “Ro-bust exponential modeling of audio signals,” in Proc. IEEE Int.Conf. Acoustics, Speech, Signal Processing (ICASSP ’98), vol. 6,pp. 3581–3584, Seattle, Wash, USA, May 1998.

[43] T. S. Verma and T. H. Y. Meng, “Sinusoidal modeling us-ing frame-based perceptually weighted matching pursuits,”in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing(ICASSP ’99), vol. 2, pp. 981–984, Phoenix, Ariz, USA, May1999.

[44] R. J. McAulay and T. F. Quatieri, “Speech analysis/synthesisbased on a sinusoidal representation,” IEEE Trans. Acoust.,Speech, Signal Processing, vol. 34, no. 4, pp. 744–754, 1986.

[45] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Trans. Signal Processing, vol. 41,no. 12, pp. 3397–3415, 1993.

[46] K. Vos and R. Heusdens, “Rate-distortion optimal exponen-tial modeling of audio and speech signals,” in Proc. 21st Sym-posium on Information Theory in the Benelux, pp. 77–84,Wassenaar, The Netherlands, May 2000.

[47] R. Heusdens, R. Vafin, and W. B. Kleijn, “Sinusoidal model-ing using psychoacoustic-adaptive matching pursuits,” IEEESignal Processing Lett., vol. 9, no. 8, pp. 262–265, 2000.

[48] R. Heusdens and S. van de Par, “Rate-distortion optimal si-nusoidal modeling of audio and speech using psychoacous-tical matching pursuits,” in Proc. IEEE Int. Conf. Acoustics,Speech, Signal Processing (ICASSP ’02), vol. 2, pp. 1809–1812,Orlando, Fla, USA, May 2002.

[49] R. C. Hendriks, R. Heusdens, and J. Jensen, “Perceptual lin-ear predictive noise modelling for sinusoid-plus-noise audiocoding,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Pro-cessing (ICASSP ’04), vol. 4, pp. 189–192, Montreal, Quebec,Canada, May 2004.

[50] J. Jensen, R. Heusdens, and S. H. Jensen, “A perceptual sub-space approach for modeling of speech and audio,” IEEETrans. Speech Audio Processing, vol. 12, no. 2, pp. 121–132,2004.

[51] ITU, ITU-R BS 1534. Method for subjective assessment of inter-mediate quality level of coding systems, 2001.

[52] O. A. Niamut, Audio codec Benchmark manual, Department ofMediamatics, Delft University of Technology, January 2003.


Steven van de Par studied physics atthe Eindhoven University of Technology(TU/e), and received his Ph.D. degree in1998 from the Institute for Perception Re-search on a topic related to binaural hear-ing. As a Postdoctoral Researcher at thesame institute, he studied auditory-visualinteraction and he was a Guest Researcher atthe University of Connecticut Health Cen-tre. In the beginning of 2000 he joinedPhilips Research, Eindhoven. Main fields of expertise are audi-tory and multisensory perception and low-bit-rate audio coding.He published various papers on binaural detection, auditory-visualsynchrony perception, and audio-coding-related topics. He partic-ipated in several projects on low-bit-rate audio coding based on si-nusoidal techniques and is presently participating in the EU Adap-tive Rate-Distortion Optimized Audio codeR (ARDOR) project.

Armin Kohlrausch studied physics at theUniversity of Gottingen, Germany, and spe-cialized in acoustics. He received his M.S.degree in 1980 and his Ph.D. degree in 1984,both in perceptual aspects of sound. From1985 until 1990 he worked at the ThirdPhysical Institute, University of Gottingen,and was responsible for research and teach-ing in the fields psychoacoustics and roomacoustics. In 1991 he joined the Philips Re-search Laboratories, Eindhoven, and worked in the Speech andHearing Group, Institute for Perception Research (IPO). Since1998, he has combined his work at Philips Research Laboratorieswith a Professor position for multisensory perception at the TU/e.In 2004 he was appointed a Research Fellow of Philips Research.He is a member of a great number of scientific societies, both inEurope and the USA. Since 1998 he has been a Fellow of the Acous-tical Society of America and serves currently as an Associate Editorfor the Journal of the Acoustical Society of America, covering theareas of binaural and spatial hearing. His main scientific interestsare in the experimental study and modelling of auditory and mul-tisensory perception in humans and the transfer of this knowledgeto industrial media applications.

Richard Heusdens is an Associate Professorin the Department of Mediamatics, DelftUniversity of Technology. He received hisM.S. and Ph.D. degrees from the Delft Uni-versity of Technology, the Netherlands, in1992 and 1997, respectively. In the spring of1992 he joined the Digital Signal ProcessingGroup, Philips Research Laboratories, Eind-hoven, the Netherlands. He has worked onvarious topics in the field of signal process-ing, such as image/video compression and VLSI architectures forimage-processing algorithms. In 1997, he joined the Circuits andSystems Group, Delft University of Technology, where he was aPostdoctoral Researcher. In 2000, he moved to the Information andCommunication Theory (ICT) Group where he became an Assis-tant Professor, responsible for the audio and speech processing ac-tivities within the ICT Group. Since 2002, he has been an AssociateProfessor. Research projects he is involved in cover subjects such asaudio and speech coding, speech enhancement, and digital water-marking of audio.

Jesper Jensen received the M.S. and Ph.D.degrees from Aalborg University, Aalborg,Denmark, in 1996 and 2000, respectively,both in electrical engineering. From 1996to 2001, he was with the Center for Per-sonKommunikation (CPK), Aalborg Uni-versity, as a Researcher, Ph.D. student, andAssistant Research Professor. In 1999, hewas a Visiting Researcher at the Center forSpoken Language Research, University ofColorado at Boulder. Currently, he is a Postdoctoral Researcher atDelft University of Technology, Delft, the Netherlands. His mainresearch interests are in digital speech and audio signal processing,including coding, synthesis, and enhancement.

Søren Holdt Jensen received the M.S. de-gree in electrical engineering from Aal-borg University, Denmark, in 1988, andthe Ph.D. degree from the Technical Uni-versity of Denmark, in 1995. He has beenwith the Telecommunications Laboratory ofTelecom Denmark, the Electronics Instituteof the Technical University of Denmark, theScientific Computing Group of the DanishComputing Center for Research and Educa-tion (UNI-C), the Electrical Engineering Department of KatholiekeUniversiteit Leuven, Belgium, the Center for PersonKommunika-tion (CPK) of Aalborg University, and is currently an Associate Pro-fessor in the Department of Communication Technology, AalborgUniversity. His research activities are in digital signal processing,communication signal processing, and speech and audio process-ing. He is a Member of the Editorial Board of EURASIP Journalon Applied Signal Processing, and a former Chairman of the IEEEDenmark Section and the IEEE Denmark Section’s Signal Process-ing Chapter.

EURASIP Journal on Applied Signal Processing 2005:9, 1305–1322c© 2005 Jeroen Breebaart et al.

Parametric Coding of Stereo Audio

Jeroen BreebaartDigital Signal Processing Group, Philips Research Laboratories, 5656 AA Eindhoven, The NetherlandsEmail: [email protected]

Steven van de ParDigital Signal Processing Group, Philips Research Laboratories, 5656 AA Eindhoven, The NetherlandsEmail: [email protected]

Armin KohlrauschDigital Signal Processing Group, Philips Research Laboratories, 5656 Eindhoven, The Netherlands

Department of Technology Management, Eindhoven University of Technology, 5656 AA Eindhoven, The NetherlandsEmail: [email protected]

Erik SchuijersPhilips Digital Systems Laboratories, 5616 LW Eindhoven, The NetherlandsEmail: [email protected]

Received 27 January 2004; Revised 22 July 2004

Parametric-stereo coding is a technique to efficiently code a stereo audio signal as a monaural signal plus a small amount of para-metric overhead to describe the stereo image. The stereo properties are analyzed, encoded, and reinstated in a decoder accordingto spatial psychoacoustical principles. The monaural signal can be encoded using any (conventional) audio coder. Experimentsshow that the parameterized description of spatial properties enables a highly efficient, high-quality stereo audio representation.

Keywords and phrases: parametric stereo, audio coding, perceptual audio coding, stereo coding.

1. INTRODUCTION

Efficient coding of wideband audio has gained large inter-est during the last decades. With the increasing popularity ofmobile applications, Internet, and wireless communicationprotocols, the demand for more efficient coding systems isstill sustaining. A large variety of different coding strategiesand algorithms has been proposed and several of them havebeen incorporated in international standards [1, 2]. Thesecoding strategies reduce the required bit rate by exploitingtwo main principles for bit-rate reduction. The first principleis the fact that signals may exhibit redundant information. Asignal may be partly predictable from its past, or the signalcan be described more efficiently using a suitable set of signalfunctions. For example, a single sinusoid can be described byits successive time-domain samples, but a more efficient de-scription would be to transmit its amplitude, frequency, and

This is an open-access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.

starting phase. This source of bit-rate reduction is often re-ferred to as “signal redundancy.” The second principle (orsource) for bit-rate reduction is the exploitation of “percep-tual irrelevancy.” Signal properties that are irrelevant froma perceptual point of view can be discarded without a lossin perceptual quality. In particular, a significant amount ofbit-rate reduction in current state-of-the-art audio coders isobtained by exploiting auditory masking.

Basically, two different coding approaches can be distin-guished that aim at bit-rate reduction. The first approach,often referred to as “waveform coding,” describes the actualwaveform (in frequency subbands or transform-based) witha limited (sample) accuracy. By ensuring that the quantiza-tion noise that is inherently introduced is kept below themasking curve (both across time and frequency), the con-cept of auditory masking (e.g., perceptual intrachannel irrel-evancy) is effectively exploited.

The second coding approach relies on parametric de-scriptions of the audio signal. Such methods decomposethe audio signal in several “objects,” such as transients, si-nusoids, and noise (cf. [3, 4]). Each object is subsequently


parameterized and its parameters are transmitted. The de-coder at the receiving end resynthesizes the objects accordingto the transmitted parameters. Although it is difficult to ob-tain transparent audio quality using such coding methods,parametric coders often perform better than waveform ortransform coders (i.e., with a higher perceptual quality) atextremely low bit rates (typically up to about 32 kbps).

Recently, hybrid forms of waveform coders and para-metric coders have been developed. For example, spectralband replication (SBR) techniques are proposed as a para-metric coding extension for high-frequency content com-bined with a waveform or transform coder operating at alimited bandwidth [5, 6]. These techniques reduce the bitrate of waveform or transform coders by reducing the signalbandwidth that is sent to the encoder, combined with a smallamount of parametric overhead. This parametric overheaddescribes how the high-frequency part, which is not encodedby the waveform coder, can be resynthesized from the low-frequency part.

The techniques described up to this point aim at encod-ing a single audio channel. In the case of a multichannelsignal, these methods have to be performed for each chan-nel individually. Therefore, adding more independent audiochannels will result in a linear increase of the total requiredbit rate. It is often suggested that for multichannel material,cross-channel redundancies can be exploited to increase thecoding efficiency. A technique referred to as “mid-side cod-ing” exploits the common part of a stereophonic input signalby encoding the sum and difference signals of the two inputsignals rather than the input signals themselves [7]. If thetwo input signals are sufficiently correlated, sum/differencecoding requires less bits than dual-mono coding. However,some investigations have suggested that the amount of mu-tual information in the signals for such a transform is ratherlow [8].

One possible explanation for this finding is related tothe (limited) signal model. To be more specific, the cross-correlation coefficient (or the value of the cross-correlationfunction at lag zero) of the two input signals must be signifi-cantly different from zero in order to obtain a bit-rate reduc-tion. If the two input signals are (nearly) identical but havea relative time delay, the cross-correlation coefficient will (ingeneral) be very low, despite the fact that there exists signifi-cant signal redundancy between the input signals. Such a rel-ative time delay may result from the usage of a stereo micro-phone setup during the recording stage or may result fromeffect processors that apply (relative) delays to the input sig-nals. In this case, the cross-correlation function shows a clearmaximum at a certain nonzero delay. The maximum valueof the cross-correlation as a function of the relative delay isalso known as “coherence.” Coherent signals can in principlebe modeled using more advanced signal models, for exam-ple, using cross-channel prediction schemes. However, stud-ies indicate only limited success in exploiting coherence us-ing such techniques [9, 10]. These results indicate that ex-ploiting cross-channel redundancies, even if the signal modelis able to capture relative time delays, does not lead to a largecoding gain.

The second source for bit-rate reduction in multichan-nel audio relates to cross-channel perceptual irrelevancies.For example, it is well known that for high frequencies(typically above 2 kHz), the human auditory system is notsensitive to fine-structure phase differences between the leftand right signals in a stereo recording [11, 12]. This phe-nomenon is exploited by a technique referred to as “intensitystereo” [13, 14]. Using this technique, a single audio signalis transmitted for the high-frequency range, combined withtime- and frequency-dependent scale factors to encode leveldifferences. More recently, the so-called binaural-cue coding(BCC) schemes have been described that initially aimed atmodeling the most relevant sound-source localization cues[15, 16, 17], while discarding other spatial attributes such asthe ambiance level and room size. BCC schemes can be seenas an extension of intensity stereo in terms of bandwidth andparameters. For the full-frequency range, only a single audiochannel is transmitted, combined with time- and frequency-dependent differences in level and arrival time between theinput channels. Although the BCC schemes are able to cap-ture the majority of the sound localization cues, they sufferfrom narrowing of the stereo image and spatial instabilities[18, 19], suggesting that these techniques are mostly advan-tageous at low bit rates [20]. A solution that was suggestedto reduce the narrowing stereo image artifact is to transmitthe interchannel coherence as a third parameter [4]. Infor-mal listening results in [21, 22] claim improvements in spa-tial image width and stability.

In this paper, a parametric description of the spatialsound field will be presented which is based on the threespatial properties described above (i.e., level differences, timedifferences, and the coherence). The analysis, encoding, andsynthesis of these parameters is largely based on binauralpsychoacoustics. The amount of spatial information is ex-tracted and parameterized in a scalable fashion. At low pa-rameter rates (typically in the order of 1 to 3 kbps), the coderis able to represent the spatial sound field in an extremelycompact way. It will be shown that this configuration is verysuitable for low-bit-rate audio coding applications. It willalso be demonstrated that, in contrast to statements on BCCschemes [20, 21], if the spatial parameters bit rate is increasedto about 8 kbps, the underlying spatial model is able to en-code and recreate a spatial image which has a subjective qual-ity which is equivalent to the quality of current high-qualitystereo audio coders (such as MPEG-1 layer 3 at a bit rate of128 kbps/s). Inspection of the coding scheme proposed hereand BCC schemes reveals (at least) three important differ-ences that all contribute to quality improvements:

(1) dynamic window switching (see Section 5.1);

(2) different methods of decorrelation synthesis (seeSection 6);

(3) the necessity of encoding interchannel time or phasedifferences, even for loudspeaker playback conditions(see Section 3.1).

Finally, the bit-rate scalability options and the fact that ahigh-quality stereo image can be obtained enable integration

Parametric Coding of Stereo Audio 1307

of parametric stereo in state-of-the-art transform-based [23,24] and parametric [4] mono audio coders for a widequality/bit-rate range.

The paper outline is as follows. First the psychoacous-tic background of the parametric-stereo coder is discussed.Section 4 discusses the general structure of the coder. InSection 5, an FFT-based encoder is described. In Section 6,an FFT-based decoder is outlined. In Section 7, an alternativedecoder based on a filter bank is given. In Section 8, resultsfrom listening tests are discussed, followed by a concludingsection.

2. PSYCHOACOUSTIC BACKGROUND

In 1907, Lord Rayleigh formulated the duplex theory [25],which states that sound-source localization is facilitated byinteraural intensity differences (IIDs) at high frequenciesand by interaural time differences (ITDs) at low frequen-cies. This theory was (in part) based on the observationthat at low frequencies, IIDs between the eardrums do notoccur due to the fact that the signal wavelength is muchlarger than the size of the head, and hence the acousti-cal shadow of the head is virtually absent. According toLord Rayleigh, this had the consequence that human lis-teners can only use ITD cues for sound-source localizationat low frequencies. Since then, a large amount of researchhas been conducted to investigate the human sensitivity toboth IIDs and ITDs as a function of various stimulus pa-rameters. One of the striking findings is that although itseems that IID cues are virtually absent at low frequenciesfor free-field listening conditions, humans are neverthelessvery sensitive to IID and ITD cues at low and high frequen-cies. Stimuli with specified, frequency-independent values ofthe ITD and IID can be presented over headphones, result-ing in a lateralization of the sound source which depends onthe magnitude of the ITD as well as the IID [26, 27, 28].The usual result of such laboratory headphone-based ex-periments is that the source images are located inside thehead and are lateralized along the axis connecting the leftand the right ears. The reason for the fact that these stimuliare not perceived externalized is that the single frequency-independent IID or ITD is a poor representation of theacoustic signals at the listener’s eardrums in free-field lis-tening conditions. The waveforms of sounds are filtered bythe acoustical transmission path between the source andthe listener’s eardrums, which includes room reflections andpinna filtering, resulting in an intricate frequency depen-dence of the ITD and IID [29]. Moreover, if multiple soundsources with different spectral properties exist at differentspatial locations, the spatial cues of the signals arriving at theeardrums will show a frequency dependence which is evenmore complex because they are constituted by (weighted)combinations of the spatial cues of the individual’s soundsources.

Extensive psychophysical research (cf. [30, 31, 32]) andefforts to model the binaural auditory system (cf. [33, 34,35, 36, 37]) have suggested that the human auditory sys-tem extracts spatial cues as a function of time and frequency.

To be more specific, there is considerable evidence that thebinaural auditory system renders its binaural cues in a setof frequency bands, without having the possibility to acquirethese properties at a finer frequency resolution. This spectralresolution of the binaural auditory system can be describedby a filter bank with filter bandwidths that follow the ERB(equivalent rectangular bandwidth) scale [38, 39, 40].

The limited temporal resolution at which the auditorysystem can track binaural localization cues is often referredto as “binaural sluggishness,” and the associated time con-stants are between 30 and 100 milliseconds [32, 41]. Al-though the auditory system is not able to follow IIDs andITDs that vary quickly over time, this does not mean thatlisteners are not able to detect the presence of quickly vary-ing cues. Slowly-varying IIDs and/or ITDs result in a move-ment of the perceived sound-source location, while fastchanges in binaural cues lead to a percept of “spatial dif-fuseness,” or a reduced “compactness” [42]. Despite the factthat the perceived “quality” of the presented stimulus de-pends on the movement speed of the binaural cues, it hasbeen shown that the detectability of IIDs and ITDs is prac-tically independent of the variation speed [43]. The sensi-tivity of human listeners to time-varying changes in binau-ral cues can be described by sensitivity to changes in themaximum of the cross-correlation function (e.g., the coher-ence) of the incoming waveforms [44, 45, 46, 47]. Thereis a considerable evidence that the sensitivity to changesin the coherence is the basis of the phenomenon of thebinaural masking level difference (BMLD) [48, 49]. More-over, the sensitivity to quasistatic ITDs can also be de-scribed by (changes in) the cross-correlation function [35,36, 50].

Recently, it has been demonstrated that the conceptof “spatial diffuseness” mostly depends on the coherencevalue itself and is relatively unaffected by the temporal fine-structure details of the coherence within the temporal inte-gration time of the binaural auditory system. For example,van de Par et al. [51] measured the detectability and discrim-inability of interaurally out-of-phase test signals presented inan interaurally in-phase masker. The subjects were perfectlyable to detect the presence of the out-of-phase test signal, butthey had great difficulty in discriminating different test signaltypes (i.e., noise versus harmonic tone complexes).

Besides the limited spectral and temporal resolution thatseems to underly the extraction of spatial sound-field proper-ties, it has also been shown that the auditory system exhibitsa limited spatial resolution. The spatial parameters have tochange by a certain minimum amount before subjects areable to detect the change. For IIDs, the resolution is between0.5 and 1 dB for a reference IID of 0 dB and is relatively in-dependent of frequency and stimulus level [52, 53, 54, 55].If the reference IID increases, IID thresholds increase also.For reference IIDs of 9 dB, the IID threshold is about 1.2 dB,and for a reference IID of 15 dB, the IID threshold amountsbetween 1.5 and 2 dB [56, 57, 58].

The sensitivity to changes in ITDs strongly depends onfrequency. For frequencies below 1000 Hz, this sensitivity canbe described as a constant interaural phase difference (IPD)


sensitivity of about 0.05 rad [11, 53, 59, 60]. The referenceITD has some effect on the ITD thresholds: large ITDs in thereference condition tend to decrease sensitivity to changesin the ITDs [52, 61]. There is almost no effect of stimu-lus level on ITD sensitivity [12]. At higher frequencies, thebinaural auditory system is not able to detect time differ-ences in the fine-structure waveforms. However, time dif-ferences in the envelopes can be detected quite accurately[62, 63]. Despite this high-frequency sensitivity, ITD-basedsound-source localization is dominated by low-frequencycues [64, 65].

The sensitivity to changes in the coherence strongly de-pends on the reference coherence. For a reference coherenceof +1, changes of about 0.002 can be perceived, while for areference coherence around 0, the change in coherence mustbe about 100 times larger to be perceptible [66, 67, 68, 69].The sensitivity to interaural coherence is practically indepen-dent of stimulus level, as long as the stimulus is sufficientlyabove the absolute threshold [70]. At high frequencies, theenvelope coherence seems to be the relevant descriptor of thespatial diffuseness [47, 71].

The threshold values described above are typical for spa-tial properties that exist during a prolonged time (i.e., 300 to400 milliseconds). If the duration is smaller, thresholds gen-erally increase. For example, if the duration of the IID andITD in a stimulus is decreased from 310 to 17 milliseconds,the thresholds may increase by up to a factor of 4 [72]. In-teraural coherence sensitivity also strongly depends on theduration [73, 74, 75]. It is often assumed that the increasedsensitivity for longer durations results from temporal inte-gration properties of the auditory system. There is, how-ever, one important exception in which the auditory sys-tem does not seem to integrate spatial information acrosstime. In reverberant rooms, the perceived location of a soundsource is dominated by the first 2 milliseconds of the onset ofthe sound source, while the remaining signal is largely dis-carded in terms of spatial cues. This phenomenon is referredto as “the law of the first wavefront” or “precedence effect”[76, 77, 78, 79].

In summary, it seems that the auditory system performsa frequency separation and temporal averaging process inits determination of IIDs, ITDs, and the coherence. This es-timation process leads to the concept of a certain sound-source location as a function of frequency and time, whilethe variability of the localization cues leads to a certain de-gree of “diffuseness,” or spatial “widening,” with hardly anyinteraction between diffuseness and location [72]. Further-more, these cues are rendered with a limited (spatial) res-olution. These observations form the basis of the paramet-ric stereo coder as described in the following sections. Thegeneral idea is to encode all (monaurally) relevant soundsources using a single audio channel, combined with a pa-rameterization of the spatial sound stage. The parameterizedsound stage consists of IID, ITD, and coherence parametersas a function of frequency and time. The update rate, fre-quency resolution, and quantization of these parameters isdetermined by the human sensitivity to (changes in) theseparameters.

3. CODING ISSUES

3.1. Headphones versus loudspeaker rendering

The psychoacoustic background as discussed in Section 2 isbased on spatial cues at the level of the listener’s eardrums. Inthe case of headphone rendering, the spatial cues which arepresented to the human hearing system (i.e., the interauralcues ILD, ITD, and coherence) are virtually the same as thespatial cues in the original stereo signal (interchannel cues).For loudspeaker playback, however, the complex acousticaltransmission paths between loudspeakers and eardrums (asdescribed in Section 2) may cause significant changes in thespatial cues. It is therefore highly unlikely that the spatial cuesof the original stereo signal (e.g., the interchannel cues) andthe spatial cues at the level of the listener’s eardrums (inter-aural cues) are even comparable in the case of loudspeakerplayback. In fact, it has been suggested that the acousti-cal transmission path effectively converts certain spatial cues(for example interchannel intensity differences) to other cuesat the level of the eardrums (e.g., interaural time differences)[80, 81]. However, this effect of the transmission path isnot necessarily problematic for parametric-stereo coding. Aslong as the interaural cues are the same for original mate-rial and material which has been processed by a parametric-stereo coder, the listener should have a similar percept of thespatial sound field. Although a detailed analysis of this prob-lem is beyond the scope of this paper, we state that given cer-tain restrictions on the acoustical transmission path, it canbe shown that the interaural spatial cues are indeed compa-rable for original and decoded signal, provided that all threeinterchannel parameters are encoded and reconstructed cor-rectly. Moreover, well-known algorithms that aim at widen-ing of the perceived sound stage for loudspeaker playback(so-called crosstalk-cancellation algorithms, which are usedfrequently in commercial recordings) heavily rely on correctinterchannel phase relationships (cf. [82]). These observa-tions are in contrast to statements by others (cf. [18, 21, 22])that interchannel time or phase differences are irrelevant forloudspeaker playback.

Supported by the observations given above, we will re-fer to ILD, ITD, and coherence as interchannel parameters. Ifall three interchannel parameters are reconstructed correctly,we assume that the interaural parameters of original and de-coded signals are very similar as well (but different from theinterchannel parameters).

3.2. Mono coding effects

As discussed in Section 1, bit-rate reduction in conventionallossy audio coders is obtained predominantly by exploitingthe phenomenon of masking. Therefore, lossy audio codersrely on accurate and reliable masking models, which are oftenapplied to individual channel signals in the case of a stereo ormultichannel signal. For a parametric-stereo extended audiocoder, however, the masking model is applied only once ona certain combination of the two input signals. This schemehas two implications with respect to masking phenomena.

The first implication relates to spatial unmasking ofquantization noise. In stereo waveform or transform coders,


Input 1Input 2

Spatial analysisand downmix

Encoder

Parameterencoder

Bit streamformatter

Bit stream

Mono audioencoder

Figure 1: Structure of the parametric-stereo encoder. The two in-put signals are first processed by a parameter extraction and down-mix stage. The parameters are subsequently quantized and encoded,while the mono downmix can be encoded using an arbitrary monoaudio coder. The mono bit stream and spatial parameters are sub-sequently combined into a single output bit stream.

individual quantizers are applied on the two input signalsor on linear combinations of the input signals. As a conse-quence, the injected quantization noise may exhibit differentspatial properties than the audio signal itself. Due to bin-aural unmasking, the quantization noise may thus becomeaudible, even if it is inaudible if presented monaurally. Fortonal material, this unmasking effect (or BMLD, quantifiedas threshold difference between a binaural condition and amonaural reference condition) has shown to be relativelysmall (about 3 dB, see [83, 84]). However, we expect thatfor broadband maskers, the unmasking effect is much moreprominent. If one assumes an interaurally in-phase noise as amasker, and a quantization noise which is either inter-aurallyin-phase or interaurally uncorrelated, BMLDs are reportedof 6 dB [85]. More recent data revealed BMLDs of 13 dB forthis condition, based on a sensitivity of changes in the corre-lation of 0.045 [86]. To prevent these spatial unmasking ef-fects of quantization noise, conventional stereo coders oftenapply some sort of spatial unmasking protection algorithm.

For a parametric stereo coder, on the other hand, there isonly one waveform or transform quantizer, working on themono (downmix) signal. In the stereo reconstruction phase,both the quantization noise and the audio signal present ineach frequency band will obey the same spatial properties.Since a difference in spatial characteristics of quantizationnoise and audio signal is a prerequisite for spatial unmask-ing, this effect is less likely to occur for parametric-stereo en-hanced coders than for conventional stereo coders.

4. CODER IMPLEMENTATION

The generic structure of the parametric-stereo encoder isshown in Figure 1. The two input channels are fed to a stagethat extracts spatial parameters and generates a mono down-mix of the two input channels. The spatial parameters aresubsequently quantized and encoded, while the mono down-mix is encoded using an arbitrary mono audio coder. The re-sulting mono bit stream is combined with the encoded spa-tial parameters to form the output bit stream.

The parametric-stereo decoder basically performs the re-verse process, as shown in Figure 2. The spatial parametersare separated from the incoming bit stream and decoded.

Bit stream Bit streamdemultiplexer

Parameterdecoder

Decoder

Spatialsynthesis

Output 1

Output 2

Mono audiodecoder

Figure 2: Structure of the parametric-stereo decoder. The de-multiplexer splits mono and spatial parameter information. Themono audio signal is decoded and fed into the spatial synthesisstage, which reinstates the spatial cues based on the decoded spa-tial parameters.

The mono bit stream is decoded using a mono audio de-coder. The decoded audio signal is fed into the spatial syn-thesis stage, which reinstates the spatial image, resulting in atwo-channel output.

Since the spatial parameters are estimated (at the en-coder side) and applied (at the decoder side) as a functionof time and frequency, both the encoder and decoder re-quire a transform or filter bank that generates individualtime/frequency tiles. The frequency resolution of this stageshould be nonuniform according to the frequency resolutionof the human auditory system. Furthermore, the temporalresolution should generally be fairly low (in the order of tensof milliseconds) reflecting the concept of binaural sluggish-ness, except in the case of transients, where the precedenceeffect dictates a time resolution of only a few milliseconds.Furthermore, the transform or filter bank should be over-sampled, since time- and frequency-dependent changes willbe made to the signals which would lead to audible aliasingdistortion in a critically-sampled system. Finally, a complex-valued transform or filter bank is preferred to enable easyestimation and modification of (cross-channel) phase- ortime-difference information. A process that meets these re-quirements is a variable segmentation process with tempo-rally overlapping segments, followed by forward and inverseFFTs. Complex-modulated filter banks can be employed as alow-complexity alternative [23, 24].

5. FFT-BASED ENCODER

The spatial analysis and downmix stage of the encoder isshown in more detail in Figure 3. The two input signals arefirst segmented by an analysis windowing process. Subse-quently, each windowed segment is transformed to the fre-quency domain using a fast fourier transform (FFT). Thetransformed segments are used to extract spatial parametersand to generate a mono downmix signal. The mono signalis transformed to the time domain using an inverse FFT, fol-lowed by synthesis windowing and overlap-add (OLA).

5.1. Segmentation

The encoder receives a stereo input signal pair x1[n], x2[n]with a sampling rate fs. The input signals are segmented


Input 1

Input 2

Spatial analysis and downmix

Window

Window

FFT

FFT

Parameterextraction

&mono signalgeneration iFFT Window OLA

Parameteroutput

Monooutput

Figure 3: Spatial analysis and downmix stage of the encoder.

using overlapping frames of total length N with a (fixed)hop size of Nh samples. If no transients are detected, theanalysis window length and the window hop size (or pa-rameter update rate) should match the lower bound ofthe measured time constants of the binaural auditory sys-tem. In the following, a parameter update interval of ap-proximately 23 milliseconds is used. Each segment is win-dowed using overlapping analysis windows and subsequentlytransformed to the frequency domain using an FFT. Dy-namic window switching is used in the case of transients.The purpose of window switching is twofold: firstly, to ac-count for the precedence effect, which dictates that only thefirst 2 milliseconds of a transient in a reverberant environ-ment determine its perceived location; secondly, to preventpre-echos resulting from the frequency-dependent process-ing which is applied in otherwise relatively long segments.The window switching procedure, of which the essence isdemonstrated in Figure 4, is controlled by a transient detec-tor.

If a transient is detected at a certain temporal position, astop window of variable length is applied which just stops be-fore the transient. The transient itself is captured using a veryshort window (in the order of a few milliseconds). A startwindow of variable length is subsequently applied to ensuresegmentation at the same temporal grid as before the tran-sient.

5.2. Frequency separation

Each segment is transformed to the frequency domain us-ing an FFT of length N (N = 4096 for a sampling ratefs of 44.1 kHz). The frequency-domain signals X1[k], X2[k](k = [0, 1, . . . ,N/2]) are divided into nonoverlapping sub-bands by grouping of FFT bins. The frequency bands areformed in such a way that each band has a bandwidth, BW(in Hz), which is approximately equal to the equivalent rect-angular bandwidth (ERB) [40], following

BW = 24.7(0.00437 f + 1), (1)

with f the (center) frequency given in Hz. This process re-sults in B = 34 frequency bands with FFT start indices kb ofsubband b (b = [0, 1, . . . ,B − 1]). The center frequencies ofeach analysis band vary between 28.7 Hz (b = 0) to 18.1 kHz(b = 33).

5.3. Parameter extraction

For each frequency band b, three spatial parameters are com-puted. The first parameter is the interchannel intensity differ-ence (IID[b]), defined as the logarithm of the power ratio ofcorresponding subbands from the input signals:

IID[b] = 10 log10

∑kb+1−1k=kb X1[k]X∗1 [k]∑kb+1−1k=kb X2[k]X∗2 [k]

, (2)

where ∗ denotes complex conjugation. The second parame-ter is the relative phase rotation. The phase rotation aims atoptimal (in terms of correlation) phase alignment betweenthe two signals. This parameter is denoted by the interchan-nel phase difference (IPD[b]) and is obtained as follows:

IPD[b] = ∠( kb+1−1∑

k=kbX1[k]X∗2 [k]

). (3)

Using the IPD as specified in (3), (relative) delays betweenthe input signals which are represented as a constant phasedifference in each analysis frequency band, hence result in afractional delay. Thus, within each analysis band, the con-stant slope of phase with frequency is modeled by a con-stant phase difference per band, which is a somewhat lim-ited model for the delay. On the other hand, constant phasedifferences across the input signals are described accurately,which is in turn not possible if an ITD parameter (i.e., a pa-rameterized slope of phase with frequency) would have beenused. An advantage of using IPDs over ITDs is that the esti-mation of ITDs requires accurate unwrapping of bin-by-binphase differences within each analysis frequency band, whichcan be prone to errors. Thus, usage of IPDs circumvents thispotential problem at the cost of a possibly limited model forITDs.

The third parameter is the interchannel coherence(IC[b]), which is, in our context, defined as the normalizedcross-correlation coefficient after phase alignment accordingto the IPD. The coherence is derived from the cross-spectrumin the following way:

IC[b] =∣∣∣∑kb+1−1

k=kb X1[k]X∗2 [k]∣∣∣√(∑kb+1−1

k=kb X1[k]X∗1 [k])(∑kb+1−1

k=kb X2[k]X∗2 [k]) .(4)


Transient position

Normal window Stop window Start window Normal window Normal window

Time (s)

Figure 4: Schematic presentation of dynamic window switching in case of a transient. A stop window is placed just before the detectedtransient position. The transient itself is captured using a short window.

5.4. Downmix

A suitable mono signal S[k] is obtained by a linear combina-tion of the input signals X1[k] and X2[k]:

S[k] = w1X1[k] + w2X2[k], (5)

where w1 and w2 are weights that determine the relativeamount of X1 and X2 in the mono output signal. For exam-ple, if w1 = w2 = 0.5, the output will consist of the aver-age of the two input signals. A downmix that is created usingfixed weights however bears the risk that the power of thedownmix signal strongly depends on the cross-correlationof the two input signals. To circumvent signal loss and sig-nal coloration due to time- and frequency-dependent cross-correlations, the weights w1 and w2 are (1) complex-valued,to prevent phase cancellation, and (2) varying in magnitude,to ensure overall power preservation. Specific details of thedownmix procedure are however beyond the scope of thispaper.

After the mono signal is generated, the last parameterthat has to be extracted is computed. The IPD parameteras described above specifies the relative phase difference be-tween the stereo input signal (at the encoder) and the stereooutput signals (at the decoder). Hence the IPD does not in-dicate how the decoder should distribute these phase differ-ences across the output channels. In other words, an IPDparameter alone does not indicate whether a first signal islagging the second signal, or vice versa. Thus, it is generallyimpossible to reconstruct the absolute phase for the stereosignal pair using only the relative phase difference. Absolutephase reconstruction is required to prevent signal cancella-tion in the applied overlap-add procedure in both the en-coder as well as the decoder (see below). To signal the actualdistribution of phase modifications, an overall phase differ-ence (OPD) is computed and transmitted. To be more spe-cific, the decoder applies a phase modification equal to theOPD to compute the first output signal, and applies a phasemodification of the OPD minus the IPD to obtain the secondoutput signal. Given this specification, the OPD is computedas the average phase difference between X1[k] and S[k], fol-lowing

OPD[b] = ∠( kb+1−1∑

k=kbX1[k]S∗[k]

). (6)

Subsequently, the mono signal S[k] is transformed to thetime domain using an inverse FFT. Finally, a synthesis win-dow is applied to each segment followed by overlap-add, re-sulting in the desired mono output signal.

5.5. Parameter quantization and coding

The IID, IPD, OPD, and IC parameters are quantized ac-cording to perceptual criteria. The quantization process aimsat introducing quantization errors which are just inaudible.For the IID, this constraint requires a nonlinear quantizer, ornonlinearly spaced IID values given the fact that the sensi-tivity for changes in IID depends on the reference IID. Thevector IIDs contains the possible discrete IID values that areavailable for the quantizer. Each element in IIDs representsa single quantization level for the IID parameter and is indi-cated by IIDq[i] (i = [0, . . . , 30]):

IIDs = [IIDq[0], IIDq[1], IIDq[30]

]= [−50,−45,−40,−35,−30,−25,−22, . . . ,

− 19,−16,−13,−10,−8,−6,−4,−2, 0, . . . ,

2, 4, 6, 8, 10, 13, 16, 19, 22, 25, 30, 35, 40, 45, 50].

(7)

The IID index for subband b, IDXIID[b], is then equal to

IDXIID[b] = arg(

mini

∣∣IID[b]− IIDq[i]∣∣). (8)

For the IPD parameter, the vector IPDs represents theavailable quantized IPD values:

IPDs = [IPDq[0], IPDq[1], . . . , IPDq[7]

]

=[

0,π

4,

2π4

,3π4

,4π4

,5π4

,6π4

,7π4

].

(9)

This repertoire is in line with the finding that the human sen-sitivity to changes in timing differences at low frequenciescan be described by a constant phase difference sensitivity.The IPD index for subband b, IDXIPD[b], is given by

IDXIPD[b] = mod(⌊

4IPD[b]π

+12

⌋,ΛIPDs

), (10)


where mod(·) means the modulo operator, · the floorfunction, and ΛIPDs the cardinality of the set of possiblequantized IPD values (i.e., the number of elements in IPDs).The OPD is quantized using the same quantizer, resulting inIDXOPD[b] according to

IDXOPD[b] = mod(⌊

4OPD[b]π

+12

⌋,ΛIPDs

). (11)

Finally, the repertoire for IC, represented in the vectorICs, is given by (see also (21))

ICs = [ICq[0], ICq[1], . . . , ICq[7]

]= [1, 0.937, 0.84118, 0.60092, 0.36764, 0,− 0.589,−1].

(12)

This repertoire is based on just-noticeable differences in cor-relation reported by [69]. The coherence index IDXIC[b] forsubband b is determined by

IDXIC[b] = arg(

mini

∣∣IC[b]− ICq[i]∣∣). (13)

The IPD and OPD indices are not transmitted for subbandsb > 17 (approximately 2 kHz), given the fact that the humanauditory system is insensitive to fine-structure phase differ-ences at high frequencies. ITDs present in the high-frequencyenvelopes are supposed to be represented by the time-varyingnature of IID parameters (hence discarding ITDs presentedin envelopes that fluctuate faster than the parameter updaterate).

Thus, for each frame, 34 indices for the IID and IC haveto be transmitted, and 17 indices for the IPD and OPD. Allparameters are transmitted differentially across time. In prin-ciple, differential coding of indices Λ (λ = 0, . . . ,Λ − 1)requires 2Λ − 1 codewords λd = −Λ + 1, . . . , 0, . . . ,Λ − 1.Assuming that each differential index λd has a probability ofoccurrence p(λd), the entropy H(p) (in bits/symbol) of thisdistribution is given by

H(p) =λ=Λ−1∑

λd=−Λ+1

−p(λd) log2

(p(λd)). (14)

Given the fact that the cardinality of each parameter Λ isknown by the decoder, each differential index λd can also bemodulo-encoded by λmod, which is given by

λmod = mod(λd,Λ

). (15)

The decoder can simply retain the transmitted index λ recur-sively following

λ[q] = mod(λmod[q] + λ[q − 1],Λ

), (16)

Table 1: Entropy per parameter symbol, number of symbols persecond, and bit rate for spatial parameters.

Parameter Bits/symbol Symbols/s Bit rate (bps)IID 1.94 1464 2840IPD 1.58 732 1157OPD 1.31 732 959IC 1.88 1464 2752Total — — 7708

with q the frame number of the current frame. The entropyfor λmod, H(pmod), is given by

H(pmod

) = Λ−1∑λmod=0

−pmod(λmod

)log2

(pmod

(λmod

)). (17)

Given that

pmod(0) = p(0),

pmod(z) = p(z) + p(z −Λ) for z = 1, . . . ,Λ− 1, (18)

it follows that the difference in entropy between differentialand modulo-differential coding, H(p)−H(pmod), equals

H(p)−H(pmod

)

=λd=Λ−1∑λd=1

p(λd)

log2

p(λd)

+ p(λd −Λ

)p(λd)

+λd=Λ−1∑λd=1

p(λd −Λ

)log2

p(λd)

+ p(λd −Λ

)p(λd −Λ

) .

(19)

For nonnegative probabilities p(·), it follows that

H(p)−H(pmod

) ≥ 0. (20)

In other words, modulo-differential coding results in an en-tropy which is equal to or smaller than the entropy obtainedfor non modulo-differential coding. However, the bit-rategains for modulo time-differential coding compared to time-differential coding are relatively small: about 15% for theIPD and OPD parameters, and virtually no gain for the IIDand IC parameters. The entropy per symbol, using modulo-differential coding, and the resulting contribution to theoverall bit rate are given in Table 1. These numbers were ob-tained by analysis of 80 different audio recordings represent-ing a large variety of material.

The total estimated parameter bit rate for the configura-tion as described above, excluding bit-stream overhead, andaveraged across a large amount of representative stereo ma-terial amounts to 7.7 kbps. If further parameter bit-rate re-duction is required, the following changes can be made.

(i) Reduction of the number of frequency bands (e.g., us-ing 20 instead of 34). The parameter bit rate increases ap-proximately linearly with the number of bands. This resultsin a bit rate of approximately 4.5 kbps for the 20-band case,assuming an update rate of 23 milliseconds and including


Mono input

Decor.filter

Window

Window

FFT

FFT

Spatial synthesis

Mixingand phase

adjustment

iFFT

iFFT

Window

Window

OLA

OLA

Output 1

Output 2

Parameter input

Figure 5: Spatial synthesis stage of the decoder.

transmission of IPD and OPD parameters. Informal listeningexperiments showed that lowering the number of frequencybands below 10 results in severe degradation of the perceivedspatial quality.

(ii) No transmission of IPD and OPD parameters. As de-scribed above, the coherence is a measure of the differencebetween the input signals which cannot be accounted for by(subband) phase and level differences. A lower bit rate is ob-tained if the applied signal model does not incorporate phasedifferences. In that case, the normalized cross-correlation isthe relevant measure of differences between the input signalsthat cannot be accounted for by level differences. In otherwords, phase or time differences between the input signalsare modeled as (additional) changes in the coherence. Theestimated coherence value (which is in fact the normalizedcross-correlation) is then derived from the cross-spectrumfollowing

IC[b] =Re∑kb+1−1

k=kb X1[k]X∗2 [k]

√(∑kb+1−1k=kb X1[k]X∗1 [k]

)(∑kb+1−1k=kb X2[k]X∗2 [k]

) .(21)

The associated bit-rate reduction amounts to approximately27% compared to parameter sets which do include the IPDand OPD values.

(iii) Increasing the quantization errors of the parameters.The bit-rate reduction is only marginal, given the fact thatthe distribution of time-differential parameters is very peaky.

(iv) Decreasing the parameter update rate. The bit ratescales approximately linear with the update rate.

In summary, the parameter bit rate can be scaled betweenapproximately 8 kbps for maximum quality (using 34 analy-sis bands, an update rate of 23 milliseconds, and transmittingall relevant parameters) to about 1.5 kbps (using 20 analysisfrequency bands, an update rate of 46 milliseconds, and notransmission of IPD and OPD parameters).

6. FFT-BASED DECODER

The spatial synthesis part of the decoder receives a mono in-put signal s[n] and has to generate two output signals y1[n]and y2[n]. These two output signals should obey the trans-mitted spatial parameters. A more detailed overview of thespatial synthesis stage is shown in Figure 5.

In order to generate two output signals with a variable(i.e., parameter-dependent) coherence, a second signal has

to be generated which has a similar spectral-temporal en-velope as the mono input signal, but is incoherent froma fine-structure waveform point of view. This incoherent(or orthogonal) signal, sd[n], is obtained by convolving themono input signal s[n] with an allpass decorrelation filterhd[n]. A very cost-effective decorrelation allpass filter is ob-tained by a simple delay. The combination of a delay anda (fixed) mixing matrix to produce two signals with a cer-tain spatial diffuseness is known as a Lauridsen decorrela-tor [87]. The decorrelation is produced by complementarycomb-filter peaks and troughs in the two output signals. Thisapproach works well provided that the delay is sufficientlylong to result in multiple comb-filter peaks and troughs ineach auditory filter. Due to the fact that the auditory fil-ter bandwidth is larger at higher frequencies, the delay ispreferably frequency dependent, being shorter at higher fre-quencies. A frequency-dependent delay has the additionaladvantage that it does not result in harmonic comb-filter ef-fects in the output. A suitable decorrelation filter consists ofa single period of a positive Schroeder-phase complex [88]of length Ns = 640 (i.e., with a fundamental frequency offs/Ns). The Schroeder-phase complex exhibits low autocor-relation at nonzero lags and its impulse response hd[n] for0 ≤ n ≤ Ns − 1 is given by

hd[n] =Ns/2∑k=0

2Ns

cos(

2πknNs

+2πk(k − 1)

Ns

). (22)

Subsequently, the segmentation, windowing, and trans-form operations that are performed are equal to those per-formed in the encoder, resulting in the frequency-domainrepresentations S[k] and Sd[k], for the mono input signals[n] and its decorrelated version sd[n], respectively. The nextstep consists of computing linear combinations of the twoinput signals to arrive at the two frequency-domain outputsignals Y1[k] and Y2[k]. The dynamic mixing process, whichis performed on a subband basis, is described by the matrixmultiplication RB. For each subband b (i.e., kb ≤ k < kb+1),we have

[Y1[k]Y2[k]

]= RB

[S[k]Sd[k]

], (23)

with

RB[b] = √2P[b]A[b]V[b]. (24)


The diagonal matrix V enables real-valued (relative) scalingof the two orthogonal signals S[k] and Sd[k]. The matrix Ais a real-valued rotation in the two-dimensional signal space,that is, A−1 = AT , and the diagonal matrix P enables modi-fication of the complex-phase relationships between the out-put signals, hence |pi j| = 1 for i = j and 0 otherwise. Thenonzero entries in the matrices P, A, and V are determinedby the following constraints.

(1) The power ratio of the two output signals must obeythe transmitted IID parameter.

(2) The coherence of the two output signals must obey thetransmitted IC parameter.

(3) The average energy of the two output signals must beequal to the energy of the mono input signal.

(4) The total amount of S[k] present in the two outputsignals should be maximum (i.e., v11 should be maxi-mum).

(5) The average phase difference between the output sig-nals must be equal to the transmitted IPD value.

(6) The average phase difference between S[k] and Y1[k]should be equal to the OPD value.

The solution for the matrix P is given by

P[b] =[e jOPD[b] 0

0 e jOPD[b]− jIPD[b]

]. (25)

The matrices A and V can be interpreted as the eigenvec-tor, eigenvalue decomposition of the covariance matrix of the(desired) output signals, assuming (optimum) phase align-ment (P) prior to correlation. The solution for the eigenvec-tors and eigenvalues (maximizing the first eigenvalue v11) re-sults from a singular value decomposition (SVD) of the co-variance matrix. The matrices A and V are given by (see [89]for more details)

A[b] =[

cos(α[b]

) − sin(α[b]

)sin

(α[b]

)cos

(α[b]

)]

,

V[b] =[

cos(γ[b]

)0

0 sin(γ[b]

)]

,

(26)

with α[b] being a rotation angle in the two-dimensional sig-nal space defined by S and Sd, which is given by

α[b]=

π

4for (IC[b], c[b]) = (0, 1),

mod(

12

arctan(

2c[b]IC[b]c[b]2 − 1

),π

2

)

otherwise,

(27)

and γ[b] a parameter for relative scaling of S and Sd (i.e.,the relation between the eigenvalues of the desired covari-ance matrix):

γ[b] = arctan

√√√√√1−√µ[b]

1 +√µ[b]

, (28)

with

µ[b] = 1 +4IC2[b]− 4(c[b] + 1/c[b]

)2 , (29)

and c[b] the square root of the power ratio of the two sub-band output signals:

c[b] = 10IID[b]/20. (30)

It should be noted that a two-dimensional eigenvectorproblem has in principle four possible solutions: each eigen-vector, which is represented as columns in the matrix A, maybe multiplied with a factor −1. The modulo operator in (27)ensures that the first eigenvector is always positioned in thefirst quadrant. However, this technique only works under theconstraint of IC > 0, which is guaranteed if phase alignmentis applied. If no IPD/OPD parameters are transmitted, how-ever, the IC parameters may become negative, which requiresa different solution for the matrix R. A convenient solutionis obtained if we maximize S[k] in the sum of the output sig-nals (i.e., Y1[k] + Y2[k]). This results in the mixing matrixRA[b]:

RA[b] =[c1 cos

(ν[b] + µ[b]

)c1 sin

(ν[b] + µ[b]

)c2 cos

(ν[b]− µ[b]

)c2 sin

(ν[b]− µ[b]

)]

, (31)

with

c1[b] =√

2c2[b]1 + c2[b]

,

c2[b] =√

21 + c2[b]

,

µ[b] = 12

arccos(IC[b]

),

ν[b] = µ[b](c2[b]− c1[b]

)√

2.

(32)

Finally, the frames are transformed to the time domain,windowed (using equal synthesis windows as in the encoder),and combined using overlap-add.

7. QMF-BASED DECODER

The FFT-based decoder as described in the previous sectionrequires a relatively long FFT length to provide sufficient fre-quency resolution at low frequencies. As a result, the reso-lution at high frequencies is unnecessarily high, and conse-quently the memory requirements of an FFT-based decoderare larger than necessary. To reduce the frequency resolu-tion at high frequencies while still maintaining the requiredresolution at low frequencies, a hybrid complex filter bankis used. To be more specific, a hybrid complex-modulatedquadrature mirror filter bank (QMF) is used which is an ex-tension to the filter bank as used in spectral band replication(SBR) techniques [5, 6, 90]. The outline of the QMF-basedparametric-stereo decoder is shown in Figure 6.


Monoinput

Spatial synthesis

HybridQMF

analysis

Decorr.filter

Mixingand phase

adjustment

HybridQMF

synthesis

HybridQMF

synthesis

Output 1

Output 2

Figure 6: Structure of the QMF-based decoder. The signal is firstfed through a hybrid QMF analysis filter bank. The filter-bank out-put and a decorrelated version of each filter-bank signal are sub-sequently fed into the mixing and phase-adjustment stage. Finally,two hybrid QMF banks generate the two output signals.

Input

Hybrid QMF analysis Hybrid QMF synthesis

QMFanalysis

Subfilter

Subfilter

Subfilter

Delay

Sub-subbandsignals

Subbandsignals

QMFsynthesis

+

+

+ Output

Figure 7: Structure of the hybrid QMF analysis and synthesis filterbanks.

The input signal is first processed by the hybrid QMFanalysis filter bank. A copy of each filter-bank output is pro-cessed by a decorrelation filter. This filter has the same pur-pose as the decorrelation filter in the FFT-based decoder;it generates a decorrelated version of the input signal inthe QMF domain. Subsequently, both the QMF output andits decorrelated version are fed into the mixing and phase-adjustment stage. This stage generates two hybrid QMF-domain output signals with spatial parameters that matchthe transmitted parameters. Finally, the output signals are fedthrough a pair of hybrid QMF synthesis filter banks to resultin the final output signals.

The hybrid QMF analysis filter bank consists of a cascadeof two filter banks. The structure is shown in Figure 7.

The first filter bank is compatible with the filter bank asused in SBR algorithms. The subband signals which are gen-erated by this filter bank are obtained by convolving the in-put signal with a set of analysis filter impulse responses hk[n]given by

hk[n] = p0[n] expjπ

4K(2k + 1)(2n− 1)

, (33)

with p0[n], for n = 0, . . . ,Nq − 1, the prototype window ofthe filter, K = 64 the number of output channels, k the sub-band index (k = 0, . . . ,K−1), and Nq = 640 the filter length.The filtered outputs are subsequently down sampled by a fac-tor K , to result in a set of down-sampled QMF outputs (or

0.20.10−0.1−0.2

Frequency (rad)

−80

−40

0

Mag

nit

ude

resp

onse

(dB

)

Figure 8: Magnitude responses of the first 4 of the 64-band SBRcomplex-exponential modulated analysis filter bank. The magni-tude for k = 0 is highlighted.

subband signals) Sk[q]:1

Sk[q] = (s∗ hk

)[Kq]. (34)

The magnitude responses of the first 4 frequency bands(k = 0, . . . , 3) of the QMF analysis bank are illustrated inFigure 8.

The down-sampled subband signals Sk[q] of the low-est QMF subbands are subsequently fed through a secondcomplex-modulated filter bank (sub-filter bank) to furtherenhance the frequency resolution; the remaining subbandsignals are delayed to compensate for the delay which is in-troduced by the sub-filter bank. The output of the hybrid(i.e., combined) filter bank is denoted by Sk,m[q], with k thesubband index of the initial QMF bank, and m the filter in-dex of the sub-filter bank. To allow easy identification of thetwo filter banks and their outputs, the index k of the firstfilter bank will be denoted “subband index,” and the indexm of the subfilter bank is denoted “sub-subband index.” Thesub-filter bank has a filter order of Ns = 12, and an impulseresponse Gk,m[q] given by

Gk,m[q] = gk[q] expj

2πMk

(m +

12

)(q − Ns

2

), (35)

with gk[q] the prototype window associated with QMF bandk, q the sample index, and Mk the number of sub-subbandsin QMF subband k (m = 0, . . . ,Mk − 1). Table 2 gives thenumber of sub-subbands Mk as a function of the QMF bandk, for both the 34 and 20 analysis-band configurations. Asan example, the magnitude response of the 4-band sub-filter

1The equations given here are purely analytical; in practice the compu-tational efficiency of the filter bank can be increased using decompositionmethods.


Table 2: Specification of Mk for the first 5 QMF subbands.

QMF subband (k) Mk (B = 34) Mk (B = 20)0 12 81 8 42 4 43 4 14 4 1

bank (Mk = 4) is given in Figure 9. Obviously, due to the lim-ited prototype length (Ns = 12), the stop-band attenuationis only in the order of 20 dB.

As a result of this hybrid QMF filter-bank structure, 91(for B = 34) or 77 (B = 20) down-sampled filter out-puts Sk,m[q] and their filtered (decorrelated) counterpartsSk,m,d[q] are available for further processing. The decorrela-tion filter can be implemented in various ways. An elegantmethod comprises a reverberator [24]; a low-complexity al-ternative consists of a (frequency-dependent) delay Tk ofwhich the delay time depends on the QMF subband index k.

The next stage of the QMF-based spatial synthesis stageperforms a mixing and phase-adjustment process. For eachsub-subband signal pair Sk,m[q], Sk,m,d[q], an output signalpair Yk,m,1[q], Yk,m,2[q] is generated by

[Yk,m,1[q]

Yk,m,2[q]

]= Rk,m

[Sk,m[q]

Sk,m,d[q]

]. (36)

The mixing matrix Rk,m is determined as follows. Eachquartet of the parameters IID, IPD, OPD, and IC for a sin-gle parameter subband b represents a certain frequency rangeand a certain moment in time. The frequency range dependson the specification of the encoder analysis frequency bands(i.e., the grouping of FFT bins), while the position in timedepends on the encoder time-domain segmentation. If theencoder is designed properly, the time/frequency localizationof each parameter quartet coincides with a certain sample in-dex in a sub-subband or set of sub-subbands in the QMFdomain. For that particular QMF sample index, the mix-ing matrices are exactly the same as their FFT-based coun-terparts (as specified by (25)–(32)). For QMF sample in-dices in between, the mixing matrices are interpolated lin-early (i.e., its real and imaginary parts are interpolated indi-vidually).

The mixing process is followed by a pair of hybrid QMFsynthesis filter banks (one for each output channel), whichalso consist of two stages. The first stage comprises summa-tion of the sub-subbands m which stem from the same sub-band k:

Yk,1[q] =Mk−1∑m=0

Yk,m,1[q],

Yk,2[q] =Mk−1∑m=0

Yk,m,2[q].

(37)

3210−1−2−3

Frequency (rad)

−30

−20

−10

0

Mag

nit

ude

resp

onse

(dB

)

Figure 9: Magnitude response of the 4-band sub-filter bank. Theresponse for m = 0 is highlighted.

Finally, upsampling and convolution with synthesis fil-ters (which are similar to the QMF analysis filters as specifiedby (33)) results in the final stereo output signal.

The fact that the same filter-bank structure is used forboth PS and SBR enables an easy and low-cost integration ofSBR and parametric stereo in a single decoder structure (cf.[23, 24, 91, 92]). This combination is known as enhancedaacPlus and is under consideration for standardization inMPEG-4 as the HE-AAC/PS profile [93]. The structure ofthe decoder is shown in Figure 10. The incoming bit streamis demultiplexed into a band-limited AAC bit stream, SBRparameters, and parametric-stereo parameters. The AAC bitstream is decoded by an AAC decoder and fed into a 32-band QMF analysis bank. The output of this filter bank isprocessed by the SBR stage and by the sub-filter bank as de-scribed in Section 7. The resulting full-bandwidth mono sig-nal is converted to stereo by the PS stage, which performsdecorrelation and mixing. Finally, two hybrid QMF synthe-sis banks result in the final output signals. More details onenhanced aacPlus can be found in [23, 92].

8. PERCEPTUAL EVALUATION

To evaluate the parametric-stereo coder, two listening testswere conducted. The first test aims at establishing the max-imum perceptual quality that can be obtained given theunderlying spatial model. Other authors have argued thatparametric-stereo coding techniques are only advantageousin the low-bit-rate range, since near transparency could notbe achieved [20, 21, 22]. Therefore, this experiment is use-ful for two reasons: firstly, to verify statements by others onthe maximum quality that can be obtained using parametricstereo, secondly, if parametric stereo is included in an au-dio coder, the maximum overall bit rate at which paramet-ric stereo still leads to a coding gain compared to conven-tional stereo techniques is in part dependent on the qual-ity limitations induced by the parametric-stereo algorithm


Bit streamDemux

AACdecoder

SBR parameters

PS parameters

32 QMFanalysis

Sub-filterbank

SBRPS

HybridQMF

synthesis

HybridQMF

synthesis

Output 1

Output 2

Figure 10: Structure of enhanced aacPlus.

only. To exclude quality limitations induced by other cod-ing processes besides parametric stereo, this experiment wasperformed without a mono coder. The second listening testwas performed to derive the actual coding gain of parametricstereo in a complete coder. For this purpose, a comparisonwas made between a state-of-the-art stereo coder (i.e., aac-Plus) and the same coder extended with parametric stereo(e.g., enhanced aacPlus) as described in Section 7.

8.1. Listening test I

Nine listeners participated in this experiment. All listenershad experience in evaluating audio codecs and were specif-ically instructed to evaluate both the spatial audio qual-ity as well as other noticeable artifacts. In a double-blindMUSHRA test [94], the listeners had to rate the perceivedquality of several processed items against the original (i.e.,unprocessed) excerpts on a 100-point scale with 5 anchors.All excerpts were presented over Stax Lambda Pro head-phones. The processed items included

(1) encoding and decoding using a state-of-the-artMPEG-1 layer 3 (MP3) coder at a bit rate of 128 kbpsstereo and using its highest possible quality settings;

(2) encoding and decoding using the FFT-based par-ametric-stereo coder as described above without monocoder (i.e., assuming transparent mono coding) oper-ating at 8 kbps;

(3) encoding and decoding using the FFT-based par-ametric-stereo coder without mono coder operating ata bit rate of 5 kbps (using 20 analysis frequency bandsinstead of 34);

(4) the original as hidden reference.

The 13 test excerpts are listed in Table 3. All items arestereo, 16-bit resolution per sample, at a sampling frequencyof 44.1 kHz.

The subjects could listen to each excerpt as often as theyliked and could switch in real time between the four versionsof each item. The 13 selected items showed to be the mostcritical items from an 80-item test set for either parametricstereo or MP3 during development and in-between evalua-tions of the algorithms described in this paper. The items hada duration of about 10 seconds and contained a large varietyof audio classes. The average scores of all subjects are shownin Figure 11. The top panel shows mean MUSHRA scores for8 kbps parametric stereo (black bars) and MP3 at 128 kbps(white bars) as a function of the test item. The rightmostbars indicate the mean across all test excerpts. Most excerpts

show very similar scores, except for excerpts 4, 8, 10, and 13.Excerpts 4 (“Harpsichord”) and 8 (“Plucked string”) show asignificantly higher quality for parametric stereo. These itemscontain many tonal components, a property that is typicallyproblematic for waveform coders due to the large audibilityof quantization noise for such material. On the other hand,excerpts 10 (“Man in the long black coat”) and 13 (“Twovoices”) have higher scores for MP3. Item 13 exhibits an (un-naturally) large amount of channel separation, which is par-tially lost after parametric-stereo decoding. On average, bothcoders have equal scores.

The middle panel shows results for the parametric-stereocoder working at 5 kbps (black bars) and 8 kbps (white bars).In most cases, the 8 kbps coder has a higher quality thanthe 5 kbps coder, except for excerpts 5 (“Castanets”) and 7(“Glockenspiel”). On average, the quality of the 5 kbps coderis only marginally lower than for 8 kbps, which demonstratesthe shallow bit-rate/quality slope for the parametric-stereocoder.

The bottom panel shows 128 kbps MP3 (white bars)against the hidden reference (black bars). As expected, thehidden reference scores are close to 100. For fragments 7(“Glockenspiel”) and 10 (“Man in the long black coat”), thehidden reference scores lower than MP3 at 128 kbps, whichindicates transparent coding.

It is important to note that the results described here wereobtained for headphone listening conditions. We have foundthat headphone listening conditions are much more criti-cal for parametric stereo than playback using loudspeakers.In fact, a listening test has shown that on average, the dif-ference in MUSHRA scores between headphones and loud-speaker playback amounts to 17 points in favor of loud-speaker playback for an 8 kbps FFT-based encoder/decoder.This means that the perceptual quality for loudspeaker play-back has an average MOS of over 90, indicating excellent per-ceptual quality. The difference between these playback con-ditions is most probably the result of the combination of anunnaturally large channel separation which is obtained usingheadphones on the one hand, and crosstalk resulting fromthe downmix procedure on the other hand. It seems thatthe amount of interchannel crosstalk that is inherently intro-duced by transmission of a single audio channel only is lessthan the amount of interaural crosstalk that occurs in free-field listening conditions. A consequence of this observationis that a comparison of the present coder with BCC schemesis rather difficult, since the BCC algorithms were all testedunder subcritical conditions using loudspeaker playback (cf.[16, 17, 18, 19, 20]).


Table 3: Description of test material.

Item index Name Origin/artist1 Starship Trooper Yes2 Day tripper The Beatles3 Eye in the sky Alan Parsons4 Harpsichord MPEG si015 Castanets MPEG si026 Pitch pipe MPEG si037 Glockenspiel MPEG sm028 Plucked string MPEG sm039 Yours is no disgrace Yes

10 Man in the long black coat Bob Dylan11 Vogue Madonna12 Applause SQAM disk

13 Two voicesLeft = MPEG es03 = English femaleRight = MPEG es02 = German male

Mean13121110987654321

Test fragment index

0

50

100

MU

SHR

Asc

ore

Mean13121110987654321

Test fragment index

0

50

100

MU

SHR

Asc

ore

Mean13121110987654321

Test fragment index

0

50

100

MU

SHR

Asc

ore

Figure 11: MUSHRA scores averaged across listeners as a func-tion of test item and various coder configurations (see text). Theupper panel shows the results for 8 kbps parametric stereo (blackbars) against stereo MP3 at 128 kbps (white bars). The middle panelshows the results for 5 kbps parametric stereo (black bars) versus8 kbps parametric stereo (white bars). The lower panel shows thehidden reference (black bars) versus MP3 at 128 kbps (white bars).

8.2. Listening test II

This test also employed MUSHRA [94] methodology and in-cluded 10 items which were selected for the MPEG-4 HE-AAC stereo verification test [95]. The following versions ofeach item were included in the test:

(1) the original as hidden reference;(2) a first lowpass filtered anchor (3.5 kHz bandwidth);(3) a second lowpass filtered anchor (7 kHz bandwidth);(4) aacPlus (HE-AAC) encoded at a bitrate of 24 kbps;(5) aacPlus (HE-AAC) encoded at a bit rate of 32 kbps;(6) enhanced aacPlus (HE-AAC/PS) encoded at a to-

tal bit rate of 24 kbps. Twenty analysis bands wereused, and no IPD or OPD parameters were transmit-ted. The average parameter update rate amounted to46 milliseconds. For each frame, the required numberof bits for the stereo parameters was calculated. Theremaining number of bits was available for the monocoder (HE-AAC).

Two different test sites participated in the test, with 8and 10 experienced subjects per site, respectively. All excerptswere presented over headphones. The results per site, aver-aged across excerpts, are given in Figure 12.

At both test sites, it was found that aacPlus with para-metric stereo (enhanced aacPlus) at 24 kbps achieves a re-spectable average subjective quality of around 70 on aMUSHRA scale. Moreover, at 24 kbps, the subjective qualityof enhanced aacPlus is equal to aacPlus at 32 kbps and signif-icantly better than aacPlus at 24 kbps. These results indicate acoding gain for enhanced aacPlus of 25% over stereo aacPlus.

9. CONCLUSIONS

We have described a parametric-stereo coder which enablesstereo coding using a mono audio channel and spatial pa-rameters. Depending on the desired spatial quality, the spa-tial parameters require between 1 and 8 kbps. It has beendemonstrated that for headphone playback, a spatial param-eter bit stream of 5 to 8 kbps is sufficient to reach a qualitylevel that is comparable to popular coding techniques cur-rently on the market (i.e., MPEG-1 layer 3). Furthermore, ithas been shown that a state-of-the-art coder such as aacPlusbenefits from a significant reduction in bit rate without sub-jective quality loss if enhanced with parametric stereo.


HE

-AA

C/P

Sat

24kb

ps

HE

-AA

Cat

32kb

ps

HE

-AA

Cat

24kb

ps

LP7

LP3.

5

Hid

den

ref.

0

20

40

60

80

100

MU

SHR

Asc

ore

Figure 12: MUSHRA listening test results for two sites (black andgray symbols) showing mean grading and 95% confidence interval.

ACKNOWLEDGMENTS

We would like to thank the colleagues at Coding Technolo-gies for their fruitful cooperation in the development of HE-AAC/PS and especially Heiko Purnagen and Jonas Engdegardfor their valuable input for this paper. We would also like tothank our colleagues Rob Sluijter, Michel van Loon, and Ger-ard Hotho for their helpful comments on earlier versions ofthis manuscript. Furthermore, we would like to thank ourcolleagues at the Philips Digital Systems Lab for carrying outthe listening tests. Finally, we would like to thank the anony-mous reviewers for their thorough review and helpful sug-gestions to improve the manuscript.

REFERENCES

[1] K. Brandenburg and G. Stoll, “ISO-MPEG-1 Audio: A genericstandard for coding of high-quality digital audio,” Journal ofthe Audio Engineering Society, vol. 42, no. 10, pp. 780–792,1994.

[2] K. Brandenburg, “MP3 and AAC explained,” in Proc. 17th In-ternational AES Conference, Florence, Italy, September 1999.

[3] A. C. den Brinker, E. G. P. Schuijers, and A. W. J. Oomen,“Parametric coding for high-quality audio,” in Proc. 112thAES Convention, Munich, Germany, May 2002, preprint 5554.

[4] E. Schuijers, W. Oomen, B. den Brinker, and J. Breebaart, “Ad-vances in parametric coding for high-quality audio,” in Proc.114th AES Convention, Amsterdam, The Netherlands, March2003, preprint 5852.

[5] O. Kunz, “Enhancing MPEG-4 AAC by spectral band replica-tion,” in Technical Sessions Proceedings of Workshop and Exhi-bition on MPEG-4 (WEMP4), pp. 41–44, San Jose, Calif, USA,June 2002.

[6] M. Dietz, L. Liljeryd, K. Kjorling, and O. Kunz, “Spectral bandreplication, a novel approach in audio coding,” in Proc. 112thAES Convention, Munich, Germany, May 2002, preprint 5553.

[7] J. D. Johnston and A. J. Ferreira, “Sum-difference stereo trans-

form coding”, in Proc. IEEE Int. Conf. Acoustics, Speech, SignalProcessing (ICASSP ’92), vol. 2, pp. 569–572, San Francisco,Calif, USA, March 1992.

[8] R. G. van der Waal and R. N. J. Veldhuis, “Subband codingof stereophonic digital audio signals,” in Proc. IEEE Int. Conf.Acoustics, Speech, Signal Processing (ICASSP ’91), Toronto,Ontario, Canada, April 1991.

[9] S.-S. Kuo and J. D. Johnston, “A study of why cross channelprediction is not applicable to perceptual audio coding,” IEEESignal Processing Lett., vol. 8, no. 9, pp. 245–247, 2001.

[10] T. Liebchen, “Lossless audio coding using adaptive multichan-nel prediction,” in Proc. 113th AES Convention, Los Angeles,Calif, USA, October 2002, preprint 5680.

[11] R. G. Klumpp and H. R. Eady, “Some measurements of in-teraural time difference thresholds,” Journal of the AcousticalSociety of America, vol. 28, pp. 859–860, 1956.

[12] J. Zwislocki and R. S. Feldman, “Just noticeable differences indichotic phase,” Journal of the Acoustical Society of America,vol. 28, pp. 860–864, 1956.

[13] J. D. Johnston and K. Brandenburg, “Wideband coding—Perceptual considerations for speech and music,” in Advancesin Speech Signal Processing, S. Furui and M. M. Sondhi, Eds.,chapter 4, pp. 109–140, Marcel Dekker, New York, NY, USA,1992.

[14] J. Herre, K. Brandenburg, and D. Lederer, “Intensity stereocoding,” in Proc. 96th AES Convention, Amsterdam, TheNetherlands, February–March 1994, preprint 3799.

[15] C. Faller and F. Baumgarte, “Efficient representation of spa-tial audio using perceptual parameterization,” in Proc. IEEEWorkshop on Applications of Signal Processing to Audio andAcoustics (WASPAA ’01), pp. 199–202, New Platz, NY, USA,October 2001.

[16] C. Faller and F. Baumgarte, “Binaural cue coding: a novel andefficient representation of spatial audio,” in Proc. IEEE Int.Conf. Acoustics, Speech, Signal Processing (ICASSP ’02), vol. 2,pp. 1841–1844, Orlando, Fla, USA, May 2002.

[17] F. Baumgarte and C. Faller, “Design and evaluation of binau-ral cue coding schemes,” in Proc. 113th AES Convention, LosAngeles, Calif, USA, October 2002, preprint 5706.

[18] F. Baumgarte and C. Faller, “Why binaural cue coding is betterthan intensity stereo coding,” in Proc. 112th AES Convention,Munich, Germany, May 2002, preprint 5575.

[19] F. Baumgarte and C. Faller, “Estimation of auditory spatialcues for binaural cue coding,” in Proc. IEEE Int. Conf. Acous-tics, Speech, Signal Processing (ICASSP ’02), vol. 2, pp. 1801–1804, Orlando, Fla, USA, May 2002.

[20] C. Faller and F. Baumgarte, “Binaural cue coding applied tostereo and multi-channel audio compression,” in Proc. 112thAES Convention, Munich, Germany, May 2002, preprint 5574.

[21] F. Baumgarte and C. Faller, “Binaural cue coding—part I: Psy-choacoustic fundamentals and design principles,” IEEE Trans.Speech Audio Processing, vol. 11, no. 6, pp. 509–519, 2003.

[22] C. Faller and F. Baumgarte, “Binaural cue coding—part II:Schemes and applications,” IEEE Trans. Speech Audio Process-ing, vol. 11, no. 6, pp. 520–531, 2003.

[23] E. Schuijers, J. Breebaart, H. Purnhagen, and J. Engdegard,“Low complexity parametric stereo coding,” in Proc. 116thAES Convention, Berlin, Germany, May 2004, preprint 6073.

[24] H. Purnhagen, J. Engdegard, J. Roden, and L. Liljeryd, “Syn-thetic ambience in parametric stereo coding,” in Proc. 116thAES Convention, Berlin, Germany, May 2004, preprint 6074.

[25] J. W. Strutt (Lord Rayleigh), “On our perception of sound di-rection,” Philosophical Magazine, vol. 13, pp. 214–232, 1907.

[26] B. Sayers, “Acoustic image lateralization judgments with bin-aural tones,” Journal of the Acoustical Society of America,vol. 36, pp. 923–926, 1964.


[27] E. R. Hafter and S. C. Carrier, “Masking-level differences ob-tained with pulsed tonal maskers,” Journal of the Acoustical So-ciety of America, vol. 47, pp. 1041–1047, 1970.

[28] W. A. Yost, “Lateral position of sinusoids presented with inter-aural intensive and temporal differences,” Journal of the Acous-tical Society of America, vol. 70, no. 2, pp. 397–409, 1981.

[29] F. L. Wightman and D. J. Kistler, “Headphone simulation offree-field listening. I. Stimulus synthesis,” Journal of the Acous-tical Society of America, vol. 85, no. 2, pp. 858–867, 1989.

[30] B. Kollmeier and I Holube, “Auditory filter bandwidths inbinaural and monaural listening conditions,” Journal of theAcoustical Society of America, vol. 92, no. 4, pp. 1889–1901,1992.

[31] M. van der Heijden and C. Trahiotis, “Binaural detection asa function of interaural correlation and bandwidth of mask-ing noise: Implications for estimates of spectral resolution,”Journal of the Acoustical Society of America, vol. 103, no. 3,pp. 1609–1614, 1998.

[32] I. Holube, M. Kinkel, and B. Kollmeier, “Binaural and monau-ral auditory filter bandwidths and time constants in probetone detection experiments,” Journal of the Acoustical Societyof America, vol. 104, no. 4, pp. 2412–2425, 1998.

[33] H. S. Colburn and N. I. Durlach, “Models of binaural inter-action,” in Handbook of Perception, E. C. Carterette and M.P. Friedman, Eds., vol. IV, pp. 467–518, Academic Press, NewYork, NY, USA, 1978.

[34] W. Lindemann, “Extension of a binaural cross-correlationmodel by contralateral inhibition. I. Simulation of lateraliza-tion for stationary signals,” Journal of the Acoustical Society ofAmerica, vol. 80, no. 6, pp. 1608–1622, 1986.

[35] R. M. Stern, A. S. Zeiberg, and C. Trahiotis, “Lateralization ofcomplex binaural stimuli: A weighted-image model,” Journalof the Acoustical Society of America, vol. 84, no. 1, pp. 156–165,1988.

[36] W. Gaik, “Combined evaluation of interaural time and inten-sity differences: psychoacoustic results and computer model-ing,” Journal of the Acoustical Society of America, vol. 94, no. 1,pp. 98–110, 1993.

[37] J. Breebaart, S. van de Par, and A. Kohlrausch, “Binau-ral processing model based on contralateral inhibition. I.Model structure,” Journal of the Acoustical Society of America,vol. 110, no. 2, pp. 1074–1088, 2001.

[38] J. W. Hall and M. A. Fernandes, “The role of monaural fre-quency selectivity in binaural analysis,” Journal of the Acousti-cal Society of America, vol. 76, no. 2, pp. 435–439, 1984.

[39] A. Kohlrausch, “Auditory filter shape derived from binau-ral masking experiments,” Journal of the Acoustical Society ofAmerica, vol. 84, no. 2, pp. 573–583, 1988.

[40] B. R. Glasberg and B. C. J. Moore, “Derivation of audi-tory filter shapes from notched-noise data,” Hearing Research,vol. 47, no. 1-2, pp. 103–138, 1990.

[41] B. Kollmeier and R. H. Gilkey, “Binaural forward and back-ward masking: evidence for sluggishness in binaural detec-tion,” Journal of the Acoustical Society of America, vol. 87, no. 4,pp. 1709–1719, 1990.

[42] J. Blauert, Spatial Hearing: The Psychophysics of Human SoundLocalization, MIT Press, Cambridge, Mass, USA, 1997.

[43] J. Breebaart, S. van de Par, and A. Kohlrausch, “The contribu-tion of static and dynamically varying ITDs and IIDs to bin-aural detection,” Journal of the Acoustical Society of America,vol. 106, no. 2, pp. 979–992, 1999.

[44] L. A. Jeffress, “A place theory of sound localization,” Journal ofComparative and Physiological Psychology, vol. 41, pp. 35–39,1948.

[45] H. S. Colburn, “Theory of binaural interaction based onauditory-nerve data. II. Detection of tones in noise,” Journal

of the Acoustical Society of America, vol. 61, no. 2, pp. 525–533,1977.

[46] R. M. Stern and G. D. Shear, “Lateralization and detectionof low-frequency binaural stimuli: Effects of distribution ofinternal delay,” Journal of the Acoustical Society of America,vol. 100, no. 4, pp. 2278–2288, 1996.

[47] L. R. Bernstein and C. Trahiotis, “The normalized correlation:accounting for binaural detection across center frequency,”Journal of the Acoustical Society of America, vol. 100, no. 6,pp. 3774–3784, 1996.

[48] N. I. Durlach, “Equalization and cancellation theory of binau-ral masking-level differences,” Journal of the Acoustical Societyof America, vol. 35, pp. 1206–1218, 1963.

[49] D. M. Green, “Signal-detection analysis of equalization andcancellation model,” Journal of the Acoustical Society of Amer-ica, vol. 40, pp. 833–838, 1966.

[50] T. M. Shackleton, R. Meddis, and M. J. Hewitt, “Across fre-quency integration in a model of lateralization,” Journal of theAcoustical Society of America, vol. 91, no. 4, pp. 2276–2279,1992.

[51] S. van de Par, A. Kohlrausch, J. Breebaart, and M. McKinney,“Discrimination of different temporal envelope structures ofdiotic and dichotic target signals within diotic wide-bandnoise,” in Auditory Signal Processing: Physiology, Psychoacous-tics, and Models, D. Pressnitzer, A. de Cheveigne, S. McAdams,and L. Collet, Eds., Springer, New York, NY, USA, November2004.

[52] R. M. Hershkowitz and N. I. Durlach, “Interaural time andamplitude jnds for a 500-Hz tone,” Journal of the AcousticalSociety of America, vol. 46, pp. 1464–1467, 1969.

[53] D. McFadden, L. A. Jeffress, and H. L. Ermey, “Differencein interaural phase and level in detection and lateralization:250 Hz,” Journal of the Acoustical Society of America, vol. 50,pp. 1484–1493, 1971.

[54] W. A. Yost, “Weber’s fraction for the intensity of pure tonespresented binaurally,” Perception and Psychophysics, vol. 11,pp. 61–64, 1972.

[55] D. W. Grantham, “Interaural intensity discrimination: insen-sitivity at 1000 Hz,” Journal of the Acoustical Society of Amer-ica, vol. 75, no. 4, pp. 1191–1194, 1984.

[56] A. W. Mills, “Lateralization of high-frequency tones,” Jour-nal of the Acoustical Society of America, vol. 32, pp. 132–134,1960.

[57] R. C. Rowland Jr. and J. V. Tobias, “Interaural intensity differ-ence limen,” Journal of Speech and Hearing Research, vol. 10,pp. 733–744, 1967.

[58] W. A. Yost and E. R. Hafter, “Lateralization,” in DirectionalHearing, W. A. Yost and G. Gourevitch, Eds., pp. 49–84,Springer, New York, NY, USA, 1987.

[59] L. A. Jeffress and D. McFadden, “Differences of interauralphase and level in detection and lateralization,” Journal of theAcoustical Society of America, vol. 49, pp. 1169–1179, 1971.

[60] W. A. Yost, D. W. Nielsen, D. C. Tanis, and B. Bergert, “Tone-on-tone binaural masking with an antiphasic masker,” Percep-tion and Psychophysics, vol. 15, pp. 233–237, 1974.

[61] W. A. Yost, “Discrimination of interaural phase differences,”Journal of the Acoustical Society of America, vol. 55, pp. 1299–1303, 1974.

[62] S. van de Par and A. Kohlrausch, “A new approach to com-paring binaural masking level differences at low and high fre-quencies,” Journal of the Acoustical Society of America, vol. 101,no. 3, pp. 1671–1680, 1997.

[63] L. R. Bernstein and C. Trahiotis, “The effects of signal dura-tion on NoSo and NoSπ thresholds at 500 Hz and 4 kHz,”Journal of the Acoustical Society of America, vol. 105, no. 3,pp. 1776–1783, 1999.


[64] F. A. Bilsen and J. Raatgever, “Spectral dominance in binauralhearing,” Acustica, vol. 28, pp. 131–132, 1973.

[65] F. A. Bilsen and J. Raatgever, “Spectral dominance in binaurallateralization,” Acustica, vol. 28, pp. 131–132, 1977.

[66] D. E. Robinson and L. A. Jeffress, “Effect of varying the inter-aural noise correlation on the detectability of tonal signals,”Journal of the Acoustical Society of America, vol. 35, pp. 1947–1952, 1963.

[67] T. L. Langford and L. A. Jeffress, “Effect of noise crosscorre-lation on binaural signal detection,” Journal of the AcousticalSociety of America, vol. 36, pp. 1455–1458, 1964.

[68] K. J. Gabriel and H. S. Colburn, “Interaural correlation dis-crimination: I. Bandwidth and level dependence,” Journal ofthe Acoustical Society of America, vol. 69, no. 5, pp. 1394–1401,1981.

[69] J. F. Culling, H. S. Colburn, and M. Spurchise, “Interaural cor-relation sensitivity,” Journal of the Acoustical Society of Amer-ica, vol. 110, no. 2, pp. 1020–1029, 2001.

[70] J. W. Hall and A. D. G. Harvey, “NoSo and NoSπ thresholdsas a function of masker level for narrow-band and widebandmasking noise,” Journal of the Acoustical Society of America,vol. 76, no. 6, pp. 1699–1703, 1984.

[71] L. R. Bernstein and C. Trahiotis, “Discrimination of interauralenvelope correlation and its relation to binaural unmasking athigh frequencies,” Journal of the Acoustical Society of America,vol. 91, no. 1, pp. 306–316, 1992.

[72] L. R. Bernstein and C. Trahiotis, “The effects of randomiz-ing values of interaural disparities on binaural detection andon discrimination of interaural correlation,” Journal of theAcoustical Society of America, vol. 102, no. 2, pp. 1113–1120,1997.

[73] U. T. Zwicker and E. Zwicker, “Binaural masking-level differ-ence as a function of masker and test-signal duration,” Hear-ing Research, vol. 13, no. 3, pp. 215–219, 1984.

[74] R. H. Wilson and C. G. Fowler, “Effects of signal durationon the 500-Hz masking-level difference,” Scandinavian Audi-ology, vol. 15, no. 4, pp. 209–215, 1986.

[75] R. H. Wilson and R. A. Fugleberg, “Influence of signal dura-tion on the masking-level difference,” Journal of Speech andHearing Research, vol. 30, no. 3, pp. 330–334, 1987.

[76] H. Wallach, E. B. Newman, and M. R. Rosenzweig, “Theprecedence effect in sound localization,” American Journal ofPsychology, vol. 62, pp. 315–336, 1949.

[77] P. M. Zurek, “The precedence effect and its possible role in theavoidance of interaural ambiguities,” Journal of the AcousticalSociety of America, vol. 67, no. 3, pp. 952–964, 1980.

[78] B. G. Shinn-Cunningham, P. M. Zurek, and N. I. Durlach,“Adjustment and discrimination measurements of the prece-dence effect,” Journal of the Acoustical Society of America,vol. 93, no. 5, pp. 2923–2932, 1993.

[79] R. Y. Litovsky, H. S. Colburn, W. A. Yost, and S. J. Guzman,“The precedence effect,” Journal of the Acoustical Society ofAmerica, vol. 106, no. 4, pp. 1633–1654, 1999.

[80] S. P. Lipshitz, “Stereo microphone techniques; are the puristswrong?” Journal of the Audio Engineering Society, vol. 34, no. 9,pp. 716–744, 1986.

[81] V. Pulkki, M. Karjalainen, and J. Huopaniemi, “Analyzingvirtual sound source attributes using a binaural auditorymodel,” Journal of the Audio Engineering Society, vol. 47, no. 4,pp. 203–217, 1999.

[82] B. S. Atal and M. R. Schroeder, “Apparent sound source trans-lator,” US Patent 3,236,949, February 1966.

[83] A. J. M. Houtsma, C. Trahiotis, R. N. J. Veldhuis, and R. vander Waal, “Bit rate reduction and binaural masking release indigital coding of stereo sound,” Acustica/Acta Acustica, vol. 82,pp. 908–909, 1996.

[84] A. J. M. Houtsma, C. Trahiotis, R. N. J. Veldhuis, and R.van der Waal, “Further bit rate reduction through binau-ral processing,” Acustica/Acta Acustica, vol. 82, pp. 909–910,1996.

[85] N. I. Durlach and H. S. Colburn, “Binaural phenomena,” inHandbook of Perception, E. C. Carterette and M. P. Friedman,Eds., vol. IV, pp. 365–466, Academic Press, New York, NY,USA, 1978.

[86] S. E. Boehnke, S. E. Hall, and T. Marquardt, “Detection ofstatic and dynamic changes in interaural correlation,” Journalof the Acoustical Society of America, vol. 112, no. 4, pp. 1617–1626, 2002.

[87] H. Lauridsen, “Experiments concerning different kinds ofroom-acoustics recording,” Ingenioren, vol. 47, 1954.

[88] M. R. Schroeder, “Synthesis of low-peak-factor signals and bi-nary sequences with low autocorrelation,” IEEE Trans. Inform.Theory, vol. 16, no. 1, pp. 85–89, 1970.

[89] R. Irwan and R. M. Aarts, “Two-to-five channel sound pro-cessing,” Journal of the Audio Engineering Society, vol. 50,no. 11, pp. 914–926, 2002.

[90] M. Wolters, K. Kjorling, D. Homm, and H. Purnhagen, “Acloser look into MPEG-4 high efficiency AAC,” in Proc. 115thAES Convention, New York, NY, USA, October 2003, preprint5871.

[91] J. Breebaart, S. van de Par, A. Kohlrausch, and E. Schui-jers, “High-quality parametric spatial audio coding at low bi-trates,” in Proc. 116th AES Convention, Berlin, Germany, May2004, preprint 6072.

[92] H. Purnhagen, “Low complexity parametric stereo coding inMPEG-4,” in Proc. 7th International Conference on Digital Au-dio Effects (DAFx ’04), Naples, Italy, October 2004, available:http://dafx04.na.infn.it/.

[93] ISO/IEC, “Coding of audio-visual objects—Part 3: Audio,AMENDMENT 1: Bandwidth Extension,” ISO/IEC Int. Std.14496-3:2001/Amd.1:2003, 2003.

[94] G. Stoll and F. Kozamernik, “EBU listening tests on internetaudio codecs,” in EBU Technical Review, no. 28, 2000.

[95] ISO/IEC JTC1/SC29/WG11, “Report on the Verifica-tion Tests of MPEG-4 High Efficiency AAC,” ISO/IECJTC1/SC29/WG11 N6009, October 2003.

Jeroen Breebaart was born in the Nether-lands in 1970. He studied biomedical en-gineering at the Technical University Eind-hoven. He received his Ph.D. degree in 2001from the Institute for Perception Research(IPO) in the field of mathematical modelsof human spatial hearing. Currently, he isa researcher in the Digital Signal ProcessingGroup, Philips Research Laboratories Eind-hoven. His main fields of interest and exper-tise are spatial hearing, parametric stereo and multichannel audiocoding, automatic audio content analysis, and audio signal process-ing tools. He published several papers on binaural detection, binau-ral modeling, and spatial audio coding. He also contributed to thedevelopment of parametric stereo coding algorithms as currentlystandardized in MPEG-4 and 3GPP.


Steven van de Par studied physics at theTechnical University Eindhoven, and re-ceived his Ph.D. degree in 1998 from the In-stitute for Perception Research on a topicrelated to binaural hearing. As a Postdocat the same institute, he studied auditory-visual interaction and he was a Guest Re-searcher at the University of ConnecticutHealth Centre. In the beginning of 2000,he joined Philips Research Laboratories inEindhoven. Main fields of expertise are auditory and multisensoryperception and low-bit-rate audio coding. He published variouspapers on binaural detection, auditory-visual synchrony percep-tion, and audio-coding-related topics. He participated in severalprojects on low-bit-rate audio coding based on sinusoidal tech-niques and is presently participating in the EU project AdaptiveRate-Distortion Optimized audio codeR (ARDOR).

Armin Kohlrausch studied physics at theUniversity of Gottingen, Germany, and spe-cialized in acoustics. He received his M.S.degree in 1980 and his Ph.D. degree in 1984,both in perceptual aspects of sound. From1985 until 1990, he worked at the ThirdPhysical Institute, University of Gottingen,being responsible for research and teach-ing in the fields psychoacoustics and roomacoustics. In 1991, he joined the Philips Re-search Laboratories in Eindhoven and worked in the Speech andHearing Group of the Institute for Perception Research (IPO).Since 1998, he combines his work at Philips Research Laboratorieswith a Professor position for multisensory perception at the TU/e.In 2004, he was appointed a Research Fellow of Philips Research.He is a member of a great number of scientific societies, both inEurope and the US. Since 1998, he has been a Fellow of the Acous-tical Society of America and serves currently as an Associate Editorfor the Journal of the Acoustical Society of America, covering theareas of binaural and spatial hearing. His main scientific interest isin the experimental study and modelling of auditory and multisen-sory perception in humans and the transfer of this knowledge toindustrial media applications.

Erik Schuijers was born in the Netherlandsin 1976. He received the M.S. degree in elec-trical engineering from the Eindhoven Uni-versity of Technology, the Netherlands, in1999. Since 2000, he has joined the SoundCoding Group of Philips Digital SystemsLaboratories in Eindhoven, the Nether-lands. His main activity has been the re-search and development of the MPEG-4 parametric-audio and parametric-stereocoding tools. Currently Mr. Schuijers is contributing to the recentstandardization of MPEG-4 spatial audio coding.


Analysis of the IHC Adaptation for theAnthropomorphic Speech Processing Systems

Alexei V. IvanovComputer Engineering Department, the Belarusian State University of Informatics and Radioelectronics,220013 Minsk, BelarusEmail: alexei v [email protected]

Alexander A. PetrovskyReal-Time Systems Department, the Bialystok Technical University, 15351 Bialystok, PolandEmail: [email protected]

Received 1 November 2003; Revised 5 September 2004

We analyse the properties of the physiological model of the adaptive behaviour of the chemical synapse between inner hair cells(IHC) and auditory neurons. On the basis of the performed analysis, we propose equivalent structures of the model for implemen-tation in the digital domain. The main conclusion of the analysis is that the synapse reservoir model is equivalent in its propertiesto the signal-dependent automatic gain-control mechanism. We plot guidelines for creation of artificial anthropomorphic algo-rithms, which exploit properties of the original synapse model. This paper also presents a concise description of the experiments,which prove the presence of the positive effect from the introduction of the depicted anthropomorphic algorithm into featureextraction of the automated speech recognition engine.

Keywords and phrases: inner hair cell (IHC), Meddis IHC model, IHC adaptation, auditory models, modulation spectrum filter-ing.

1. INTRODUCTION

1.1. Anthropomorphism, psychoacoustics,and auditory physiology

Many contemporary speech processing techniques tend toreflect properties of the human auditory apparatus. As a rule,most of the information about the way human beings processacoustic data comes into artificial applications from the fieldof psychoacoustics (for classical psychoacoustics work, referto [1]).

Apart from the experiments with subjects that have re-liably diagnosed and anatomically localised auditory pathol-ogy, psychoacoustics treats the whole human auditory systemas a “black box” and tries to infer its properties without par-ticular interest to its internal structure. Most of the psychoa-coustical experiments include analysis of the responses to“simple” sounds, like pure tones, wideband noise, colourednoises, clicks, and so forth. But a lot of evidence (simulta-neous and nonsimultaneous masking, pitch perception, etc.)points to the fact that the auditory system is essentially a non-linear system.

From the system identification theory, it is known thatthe response of the linear system to an arbitrary excitationcan be derived from the study of responses of such sys-tem to simple sounds, for example, tones, noises, and clicks.

There is no need to study the internal structure of the lin-ear black box as far as responses to the simple input signalsare known. Strictly speaking, for the case of nonlinear sys-tems, this black box approach is not applicable. There aremainly two possibilities to model a nonlinear system: eitherto construct a semiparametric statistical learning machine,a “neural-network-like” structure, and let it adapt througha kind of learning algorithm, or follow the parametric ap-proach and somehow infer the internal structure of the non-linear system to be modelled, parse it into smaller and, hope-fully, simpler building blocks, then tune parameters of thoseblocks, so that model response matches that of the originalsystem.

The first alternative suffers from the problems in creatingthe representative training set, as well as from the absenceof a priori information regarding the required model com-plexity. The mentioned difficulties virtually prohibit applica-tion of this approach to the auditory modelling. The secondof the mentioned approaches corresponds to the physiologi-cally grounded studies of the auditory apparatus.

Among the solutions, which could benefit most fromthe employment of the physiological models, one can namethe development of cochlear implants, the objective andquantitative quality assessment of the coded audio recon-struction, anthropomorphic audio coding, and automated


speech recognition applications. While the first two men-tioned branches are concentrated on the closest possible lit-eral reproduction of the auditory apparatus properties in theartificial device, the latter imply a computationally efficientway to implement the “biological” audio processing algo-rithm with a certain predefined precision.

In spite of being precise and objective, the physiologicalhearing models neither provide a clear signal processing in-terpretation of those phenomena, nor give a ready answerregarding the relevance of the modelled phenomena to thehearing process in general. Thus, straightforward applica-tion of the physiological models to the fields of audio codingand speech recognition may not easily gain advantage overthe conventional algorithms [2]. Before the employment of acertain physiological model into the mentioned applications,one should answer the questions of why it is important (i.e.,what result is expected from it) and what is the most efficientway of its implementation. This reasoning leads to a conclu-sion that the further analysis of the available physiologicalmodels with the aim of finding their algorithmical interpre-tation is needed. This paper is further devoted to such kindof analysis.

Particularly, we are aiming at analysing the adaptationof the chemical “inner-hair-cell auditory nerve” (IHC-AN)synapse, and trying to infer its importance to the artificial an-thropomorphic audio signal (and particularly speech) pro-cessing systems in adverse environments. Indeed, strong on-set responses of the auditory nerve (AN) fibers to the pre-sented stimulus are followed by the “adaptation”, that is,gradual decrease of the response amplitude over time whilethe stimulus amplitude remains constant. This “adaptivestrategy” at first glance seems to be advantageous since it al-lows an emphasis of nonstationarities within the incomingsignal.

2. RESERVOIR MODEL OF IHC-ANCHEMICAL SYNAPSE

Physiological research into the way the inner ear converts anacoustical stimulation into a response of the auditory nervefibers (for a brief summary and review, refer to [3]) amongmany other findings led to the conclusions that

(i) inner hair cells are mechanical vibrations sensory cells;(ii) each IHC makes chemical synapses with approxi-

mately 10–30 peripheral axons of primary bipolar neu-rons which cell bodies contained in the spiral gan-glion and modiolar axons forming the auditory (VI-IIth) nerve;

(iii) one can distinguish three groups of afferent neu-rons based on the level of their spontaneous activity:low-spontaneous rate, medium-spontaneous rate, andhigh-spontaneous rate fibers. The level of spontaneousactivity of the fiber is closely related to the form andthe size of the synapse it formed with IHC;

(iv) chemical nature determines the following properties ofIHC-AN synapses: adaptive responses, synaptic delays,quantised response amplitudes.

Reprocessing

store w(t)

Free transmitterpool q(t)

Transmitterfactory

Synaptic cleft

c(t)

Postsynapticafferent auditory

nerve fiber

Loss

Hair celllc(t)

rc(t)

k(t)q(t)

xw(t)

y[M − q(t)]

Figure 1: Schematic representation of the Meddis reservoir model.

Properties of the chemical IHC-AN synapse are success-fully captured by the so-called “reservoir models,” in whichneurotransmitter is produced and stored in the IHC to be re-leased in accordance with IHC transmitter release probabil-ity that changes with mechanical vibrations in the inner ear.First reservoir models for IHC-AN synapses were proposedas early as [4, 5].

Meddis has put forward [6] and further developed [7, 8,9, 10, 11] a model of IHC, which includes a version of reser-voir model of chemical synapse. The latest model [10, 11]allows for a nice fit between experimental and model datafor all thee groups of IHC-AN synapses (low-, medium-,and high-spontaneous rate fibers) with only calcium conduc-tance parameters being changed.

It must be noted here that in reality neurotransmitter re-lease into synaptic cleft is a probabilistic and quantal process.However, to a certain degree, the dynamical properties of thesynapse may be reflected by the model that assumes that neu-rotransmitter flow is deterministic and continuous. From thepractical point of view, this assumption corresponds to theaveraging of the synapse response over many identical stimu-lations. Latest Meddis models [9, 10, 11] depart from this as-sumption offering better correspondence to the data record-ings of individual experiments. For the purpose of the anal-ysis of the core properties of IHC-AN synapse and construc-tion of the anthropomorphic artificial algorithms, we furthernarrow our consideration to the deterministic and continu-ous case.

Meddis version of the reservoir model is represented byschematic drawing in Figure 1, and is described by the set of(1). “Free transmitter pool” is the main storage facility forthe transmitter that is immediately ready to be released fromthe cell to the “synaptic cleft.” It is filled with neurotransmit-ter coming from the “transmitter factory” as well as that re-cycled at the “reprocessing store.” Neurotransmitter is beingreleased into “synaptic cleft” with a certain rate, dependentupon IHC stimulation, as well as instantaneous quantity ofthe stored transmitter. From the “synaptic cleft,” transmitteris either being returned to the cell for reprocessing or lost bydiffusion.

IHC Adaptation Analysis for Speech Processing 1325

We assume that the pool capacity equals M. The quantityof the transmitter stored in the pool at a certain time instantwill be denoted by q(t). The rate, at which the factory pro-duces new transmitter, is proportional to the free volume ofthe pool y[M − q(t)], here operation [· · · ] constitutes thechoice of the biggest value between zero and the value insidesquare brackets. Alternatively we may put that coefficient ybecomes zero at the moment the pool is filled to the limit. Wedenote the instantaneous amount of the transmitter in thereprocessing by w(t). The recirculation rate is proportionalto the amount of the transmitter in the reprocessing xw(t).The rate, at which transmitter is sent to the cleft, is equal tothe product of membrane permeability k(t) and the quantityof the transmitter in the pool q(t). The quantity of the neu-rotransmitter in the cleft at certain instant will be denoted byc(t). Rates of neurotransmitter loss and return for reprocess-ing are proportional to the amount of the transmitter in theclefts lc(t) and rc(t), respectively.

As it follows from the above-presented description, Med-dis version of the reservoir model is described by the follow-ing set of differential equations:

dq(t)dt

= xw(t) + y[M − q(t)

]− k(t)q(t),

dc

dt= k(t)q(t)− (l + r)c(t),

dw

dt= rc(t)− xw(t).

(1)

Initial conditions of the model are taken in accordancewith the assumption that at a certain instant t0 the system isin an equilibrium state:

xw(t0)

+ y[M − q

(t0)] = k

(t0)q(t0),

k(t0)q(t0) = (l + r)c

(t0),

rc(t0) = xw

(t0).

(2)

3. ADAPTATION PROPERTY OF THE RESERVOIRMODEL OF IHC-AN CHEMICAL SYNAPSE

Figure 2 presents a typical response of the Meddis model tothe excitation. Signal k(t) is an input to the reservoir modeland is computed by earlier stages of cochlear model (thecochlear filter bank [12] in combination with the first partof IHC model [10]) when the test tone of 6 kHz is presented.IHC medium-spontaneous rate fiber model gets its inputfrom the cochlear filter bank section with the closest to 6 kHzcentre frequency. It is running at the sampling frequency of16 kHz.

Typical values of the model coefficients were taken fromthe works of Meddis [6, 7, 8, 9] and are as follows:

M = 10, x = 66.3, y = 10,

l = 2580, r = 6580.(3)

0

0

0

0

500

500

500

500

1000

1000

1000

1000

1500

1500

1500

1500

2000

2000

2000

2000

2500

2500

2500

2500

3000

3000

3000

3000

3500

3500

3500

3500

4000

4000

4000

4000

2

3

4

0

0.05

0.1123

40

100200

300A B C D

w(t

)c(t)

q(t)

k(t)

Time

Figure 2: Reservoir model response to the excitation with the 6 kHztone, CF ∼ 6 kHz, Fs = 48 kHz, medium-spontaneous rate fiber. A:steady state, B: onset, C: adaptation, D: offset.

In order to perform this digital simulation (depicted inFigure 2) of the synapse model, the forward difference ap-proximation of the set of differential equations (1) was used,as it is advised in [8].

As it can be seen from the above figure, there are four dis-tinct regions in the model response signal c(t): steady-stateresponse to a long-term absence of stimulation (denoted asregion A); onset response (region B)—brief rise of the re-sponse level to higher values; subsequent adaptation of theresponse level to a much lower activity (region C); offsetregion (region D), when synapse recovers from the stimu-lation and response level slowly converges to a steady-statelevel.

For a detailed review of adaptation properties of IHC,please refer to [11].

4. ANALYSIS OF THE RESERVOIR MODELOF IHC-AN CHEMICAL SYNAPSE

Looking at the equation set (1), one can easily notice thatfunctions c(t) and f (t) = k(t)q(t) are linked with the lin-ear constant-coefficient differential equation of the first or-der with zero-free member:

dc(t)dt

+ (l + r)c(t) = f (t). (4)

Thus, (4) describes a linear time invariant system, whichperforms transformation of f (t) into c(t). Taking forwarddifference approximation of the differential problem and as-suming that both functions take discrete values at discrete-time instances, it is possible to approximate this system with


0 1000 2000 3000 4000 5000 6000 7000 8000−88

−86

−84

−82

−80

−78

Mag

nit

ude

(dB

)

Frequency (Hz)

0 1000 2000 3000 4000 5000 6000 7000 8000−200

−150

−100

−50

0

Ph

ase

(deg

rees

)

Frequency (Hz)

Figure 3: Frequency characteristic of filter A (Fs = 16 kHz).

a digital filter:

c(n) = 1Fs

f (n− 1)−(l + r − Fs

)Fs

c(n− 1), (5)

HA = (1/Fs)z−1

1− ((1− (l + r)/Fs))z−1

. (6)

Here Fs denotes the sampling frequency. We will furtherrefer to this filter as “filter A.” With the typical values ofparameters l and r, this filter is a lowpass filter, which hasrather smooth slope response characteristic that is presentedin Figure 3.

Further analysis of the equation set (1) leads to a con-clusion that functions s(t) = M − q(t) and f (t) = k(t)q(t)are also linked with the linear constant-coefficient differen-tial equation of the first order with zero-free member:

d3s(t)dt3

+ (x + y + l + r)d2s(t)dt2

+((x + y)(l + r) + xy

)ds(t)dt

+ xy(l + r)s(t) = d2 f (t)dt2

+ (x + l + r)df (t)dt

+ xl f (t).

(7)

We note that this equation is valid for all such valuess(t) = M − q(t) ≥ 0. If s(t) = M − q(t) ≤ 0, then it mustbe substituted with the following equation, which is obtainedfrom (7) by letting y = 0:

d3s(t)dt3

+ (x + l + r)d2s(t)dt2

+ x(l + r)ds(t)dt

= d2 f (t)dt2

+ (x + l + r)df (t)dt

+ xl f (t).

(8)

The performed digital simulations show that for realisticinput signals and reasonably high sampling frequency, it isenough to use (7) only.

0 1000 2000 3000 4000 5000 6000 7000 8000−100−90−80−70−60−50−40−30

Mag

nit

ude

(dB

)

Frequency (Hz)

0 1000 2000 3000 4000 5000 6000 7000 80000

50

100

150

200

Ph

ase

(deg

rees

)

Frequency (Hz)

Figure 4: Frequency characteristic of filter B (Fs = 16 kHz).

Again it is possible to approximate the system describedby (7) with a digital filter:

s(n) = b1

a0f (n− 1) +

b2

a0f (n− 2) +

b3

a0f (n− 3)

− a1

a0s(n− 1)− a2

a0s(n− 2)− a3

a0s(n− 3),

a0 = F3s ,

a1 = −3F3s + F2

s (x + y + l + r),

a2 = 3F3s − 2F2

s (x + y + l + r) + Fs((x + y)(l + r) + xy

),

a3 = −F3s + F2

s (x + y + l + r)

− Fs((x + y)(l + r) + xy

)+ xy(l + r),

b1 = F2s ,

b2 = −2F2s + Fs(x + l + r),

b3 = F2s − Fs(x + l + r) + xl.

(9)

We denote this filter as “filter B.” This is a lowpass fil-ter with rather sharp frequency response characteristic (seeFigure 4) for typical values of parameters x, y, l, and r.

Filter B has two real zeros and three real poles:

nB1,2 = 1− 12Fs

((x + l + r)±

√(x + l + r)2 − 4xl

), (10)

pB1 = 1− l + r

Fs, pB2 = 1− x

Fs, pB3 = 1− y

Fs.

(11)

The above conclusions imply that there must be a linkbetween functions c(t) and s(t) = M − q(t) in the form


0 1000 2000 3000 4000 5000 6000 7000 8000−50

−40−30

−20

−10

0

10

Mag

nit

ude

(dB

)

Frequency (Hz)

0 1000 2000 3000 4000 5000 6000 7000 8000180

200

220

240

260

280

Ph

ase

(deg

rees

)

Frequency (Hz)

Figure 5: Frequency characteristic of filter C (Fs = 16 kHz).

of the linear constant-coefficient differential equation of thefirst order with zero-free member. Indeed, it is the case

d2c(t)dt2

+ (x + l + r)ds(t)dt

+ xlc(t)

= −d2s(t)dt2

− (x + y)ds(t)dt

+ xys(t).

(12)

As in the case of (7), this equation is valid for s(t) =M−q(t) ≥ 0, if it is less than zero, then y in (12) should be put tozero.

The digital filter, which is equivalent to the system (12),is defined as follows:

s(n) = d0

c0f (n) +

d1

c0f (n− 1) +

d2

c0f (n− 2)

− c1

c0s(n− 1)− c2

c0s(n− 2),

c0 = F2s ,

c1 = −2F2s + Fs(x + l + r),

c2 = F2s − Fs(x + l + r) + xl,

d0 = −F2s ,

d1 = 2F2s − Fs(x + y),

d2 = −F2s + Fs(x + y)− xy.

(13)

We will further denote this filter as “filter C.” It is a high-pass filter with rather sharp frequency response characteristic(see Figure 5) for typical values of its parameters.

We also note that a cascade connection of filters B andC should be equivalent to filter A. This is true and can beimmediately proved by looking at (9), (13), and (5).

Filter A

Filter B

Filter B Filter Ck(t)

k(t)

+

+

−

−

s(t)

s(t)

k(t)q(t)

k(t)q(t)

M

M

c(t)

c(t)

Figure 6: Reservoir model equivalent structures.

5. EQUIVALENT DIGITAL STRUCTURESFOR THE RESERVOIR MODEL

Analysis of the Meddis reservoir model allows us to plot itsequivalent structures for realisation in the digital form (seeFigure 6). The realisation with the help of filter A is morepreferable since it is more computationally efficient.

Apart from the linear digital filters, the developed equiv-alent representations include an operation of multiplica-tion of the signals in the time domain. It should be notedthat, in general, multiplication of time-varying signals doesnot comply with the superposition principle, thus the reser-voir model equivalent structure performs a nonlinear signaltransformation. The signal q(t) = M − s(t), which is multi-plied by k(t) is confined in the interval [0,M] in accordancewith the reservoir model definition. It consists mainly of low-frequency components of signal k(t)q(t) in accordance withthe properties of filter B.

Operation of the multiplication in the equivalent struc-ture may be viewed as an automatic gain-control (AGC) op-eration. The gain q(t) is a parameter, which slowly variesthrough time between M in the case of weak input signal andzero in the case of strong one.

Our equivalent structure of the Meddis reservoir modelhas similarities with that plotted in the works of Perdigao [13,14].

6. LINEAR APPROXIMATION OF THE SIGNALMULTIPLICATION OPERATION IN THE EQUIVALENTSTRUCTURE OF THE RESERVOIR MODEL

It is possible to build a linear digital filter, which approxi-mates the effect of the AGC mechanism for the case of smalldeviations of the system from the equilibrium state. Partic-ular form of such filter is dependent on initial conditions,namely, the steady-state input signal value k0.

A method we are going to use is thoroughly investigatedin [15]. Similar methods of differential equation linearisation(which lead to the identical results) are widely known andused in the classical literature on theoretical mechanics.


Indeed, we assume that the system depicted in Figure 6,at a certain time instant, resides in the equilibrium. For suchcase, we may write

f0 = k0q0,

q0 =M − s0,

y(l + r)s0 = l f0.

(14)

Any deviations from the steady state are sufficientlysmall:

k(n) = k0 + δk(n),

f (n) = f0 + δ f (n),

q(n) = q0 + δq(n),

s(n) = s0 + δs(n).

(15)

Thus, for such system at an arbitrary time instant, wemay write the following set of equations (see Figure 6):

f0 + δ f (n) = (k0 + δk(n))(q0 + δq(n)

),

q0 + δq(n) =M − (s0 + δs(n)),

(a0 + a1 + a2 + a3

)s0 + a0δs(n) + a1δs(n− 1) + a2δs(n− 2)

+ a3δs(n− 3) = (b1 + b2 + b3)f0 + b1δ f (n− 1)

+ b2δ f (n− 2) + b3δ f (n− 3).(16)

Coefficients in the third equation of the set are those offilter B. Comparing sets (15) and (16), we may conclude thatthe following set of equations holds for deviations:

δ f (n) = k0δq(n) + q0δk(n),

δq(n) = −δs(n),

a0δs(n) + a1δs(n− 1) + a2δs(n− 2) + a3δs(n− 3)

= b1δ f (n− 1) + b2δ f (n− 2) + b3δ f (n− 3).

(17)

A solution of the equation set (17) with respect to vari-ables δk(n) and δ f (n) is represented as

q0 = My(l + r)y(l + r) + lk0

,

q0(a0δk(n) + a1δk(n− 1) + a2δk(n− 2) + a3δk(n− 3)

)

= a0δ f (n) +(b1k0 + a1

)δ f (n− 1)

+(b2k0 + a2

)δ f (n− 2) +

(b3k0 + a3

)δ f (n− 3).

(18)

This equation represents a desired linear digital filter,which linearly approximates AGC of the equivalent struc-ture. This filter is capable of transforming the signal δk(t) =k(t) − k0 into δ f (t) = δ(k(t)q(t)) = f (t) − f0 under the

0 200 400 600 800 1000 1200 1400 1600 1800 200015.5

16

16.5

17

17.5

18

Mag

nit

ude

(dB

)

Frequency (Hz)

0 200 400 600 800 1000 1200 1400 1600 1800 20000

1

2

3

4

5

Ph

ase

(deg

rees

)

Frequency (Hz)

Figure 7: Frequency characteristic of filter D (Fs = 16 kHz, k0 =10).

condition that these deviations are sufficiently small. Thetransfer function of this filter is expressed as

HD(z, k0

)

= My(l + r)(y(l + r) + lk0

)

· a0 + a1z−1 + a2z−2 + a3z−3

a0 +(b1k0 + a1

)z−1 +

(b2k0 + a2

)z−2 +

(b3k0 + a3

)z−3

.

(19)

Note the explicit dependency of the form of this transferfunction on the value of k0. We will further denote this filteras “filter D.”

The steady-state output f0(k0) of the system is derivedfrom the equilibrium set of (14) and is expressed as

f0(k0) = q0k0 = My(l + r)

y(l + r) + lk0k0. (20)

Filter D is a highpass filter with quite sharp frequencyresponse characteristic (see Figure 7) for a typical value ofk0 = 10.

In order to illustrate the dependence of the properties offilter D upon the value of k0, Figure 8 depicts frequency char-acteristic of that filter with k0 = 1000. As it can be seen fromthe comparison of Figures 7 and 8, apart from the change ofthe gain, the cut-off frequency of the filter is getting biggerwith the increase of k0.

From the digital filter theory it is known that the lin-ear digital filter is “bounded-input bounded-output” (BIBO)stable if all of its poles lay inside the unit circle in z-plane. Fil-ter D has three real poles. Analytical derivation of their val-ues is rather complex in general. To perform such derivation,one could take advantage of Cardano formula for the rootsof cubic equation.


0 200 400 600 800 1000 1200 1400 1600 1800 2000−40−35−30−25−20−15−10−5

Mag

nit

ude

(dB

)

Frequency (Hz)

0 200 400 600 800 1000 1200 1400 1600 1800 20000

10203040506070

Ph

ase

(deg

rees

)

Frequency (Hz)

Figure 8: Frequency characteristic of filter D (Fs = 16 kHz, k0 =1000).

An alternative way is to estimate positions of the filterpoles. Indeed, for realistic values of k0 ∼ 101–102 with quitegood precision, filter D poles lay in the vicinity of its zeros.Zeros of filter D coincide with poles of filter B (11), and ap-proximately we may put

pD1 ≈ nD1 = 1− l + r

Fs, pD2 ≈ nD2 = 1− x

Fs,

pD3 ≈ nD3 = 1− y

Fs.

(21)

It must be noted also that if k0 → 0, then pDN → nDN .Pole pD1 first leaves the unit circle while sampling fre-

quency is being decreased, indeed the realistic values of l + rare significantly larger than the values of x and y. Conse-quently, approximation of the position of the first pole givesus a condition of filter D stability, while k0 → 0:

Fs >l + r

2. (22)

Pole pD1 moves to the right on the real axis if the value ofk0 is being increased. This allows for filter D to become stablewith increased k0 even if it was unstable with the smaller val-ues of k0. This leads us to a conclusion that (22) representssufficient condition for filter D to be stable with arbitrary re-alistic values of k0.

In the work [8], it is required that the sampling frequencymust be sufficiently large for a successful digital implemen-tation of the reservoir model. Our finding of stability con-dition (22) puts a quantitative restriction on the samplingfrequency for the linearised approximation.

Under the same assumption of small deviations from theequilibrium state, it is possible to construct an equivalent lin-ear filter, which would serve as linear approximation of rela-tion of the signals δk(t) and δc(t) = c(t)− c0, that is, the in-put and the output signals of the reservoir model measuredrelatively to their corresponding equilibrium values.

Such filter (further denoted as filter E) corresponds to thecascade of the filters D and A. Its frequency response charac-teristic is presented in Figure 9. Filter E transfer function isdefined as

HE(z, k0

) = 1FS· My(l + r)(

y(l + r) + lk0) ·

(1− nD2z−1

)(1− nD3z−1

)z−1

1 +((b1k0 + a1

)/a0)z−1 +

((b2k0 + a2

)/a0)z−2 +

((b3k0 + a3

)/a0)z−3

. (23)

It should be noted that the pole of filter A and the firstzero of filter D are equal, thus they are removed from (23).

Response magnitude in the equilibrium state is derivedfrom (2) and (20) and it looks like

c0(k0) = 1

(l + r)f0(k0) = My

y(l + r) + lk0k0. (24)

The notion that poles of filter E coincide with those offilter D leads to a conclusion that condition of the stability ofthe filter is identical to that of filter D.

7. PRACTICAL OUTCOME OF THE PRESENTEDRESERVOIR MODEL ANALYSIS

As it can be seen from Figure 6, the reservoir model is equiv-alent to a kind of signal-dependent gain-control mechanism.The presented equivalent structure may be perceived as theinterpretation of the IHC adaptation mechanism from the

algorithmical signal processing point of view. In the equiva-lent structure, filters A, B, and C are all linear time-invariantstructures, the only nonlinear element here is the multiplica-tion of the signals. Implementation of the equivalent struc-ture via a combination of filters A and B seems more prefer-able among the alternatives, presented in Figure 6, since itrequires less computational effort.

A brief look at the poles of filter A (6) and B (11) givesan indication that their stability conditions are identical tothat of filter D (22). This fact is a direct result of employmentof forward difference approximation of the differential prob-lem in the filter synthesis. All known digital implementationsof the IHC reservoir model [16, 17, 18] share this methodof differential approximation. However, this limitation seemsimpractical from the technological point of view, since it pre-vents implementation of the described equivalent structure,as well as other implementations mentioned above, for sig-nals with sampling frequency below ∼ 4, 6 kHz using the


0 1000 2000 3000 4000 5000 6000 7000 8000−76

−74

−72

−70

−68

−66

Mag

nit

ude

(dB

)

Frequency (Hz)

0 1000 2000 3000 4000 5000 6000 7000 8000−200

−150

−100

−50

0

50

Ph

ase

(deg

rees

)

Frequency (Hz)

Figure 9: Frequency characteristic of filter E (Fs = 16 kHz, k0 =50).

realistic values of the model parameters. Indeed, such limi-tation of the lowest possible sampling frequency makes effi-cient combination of the model with multirate cochlear filterbanks impossible.

Fortunately, there exist other methods of approximationof the differential problem in the digital domain, for exam-ple, bilinear transformation. In accordance with its proper-ties, any stable analog linear time-invariant filter, describedby the corresponding differential equation, is converted intoa stable digital filter. With the help of bilinear transforma-tion, it is possible to construct universally stable digital fil-ters A and B regardless of the sampling frequency. This pro-cedure as well as its combination with computationally effi-cient implementation of the multirate cochlear filter bank isdescribed in detail in [19].

However, in the case of bilinear transformation, unlikethe situation with difference approximation, the coefficientb0 of the filter B is not equal to zero:

HB(z) = b0 + b1z−1 + b2z−2 + b3z−3

a0 + a1z−1 + a2z−2 + a3z−3,

a0 = 8F3s + 4F2

s (x + y + l + r)

+ 2Fs((x + y)(l + r) + xy

)+ xy(l + r),

a1 = −24F3s − 4F2

s (x + y + l + r)

+ 2Fs((x + y)(l + r) + xy

)+ 3xy(l + r),

a2 = 24F3s − 4F2

s (x + y + l + r)

− 2Fs((x + y)(l + r) + xy

)+ 3xy(l + r),

a3 = −8F3s + 4F2

s (x + y + l + r)

− 2Fs((x + y)(l + r) + xy

)+ xy(l + r),

b0 = 4F2s + 2Fs(x + l + r) + xl,

b1 = −4F2s + 2Fs(x + l + r) + 3xl,

b2 = −4F2s − 2Fs(x + l + r) + 3xl,

b3 = 4F2s − 2Fs(x + l + r) + xl.

(25)

M

f (n)

z−1

z−1

z−1

b0/a0

b1/a0 a1/a0

a2/a0b2/a0

b3/a0 a3/a0

k(n)

1 +b0

a0k(n)

Figure 10: Transposed direct form II realization.

This fact leads to the additional operations at the imple-mentation of the signal flow of Figure 6. Indeed, writing aset of equations describing the signal flow over the feedbackloop of Figure 6 results in the following relations:

f (n) = k(n)q(n) = k(n)(M − s(n)

),

3∑i=0

bi f (n− i) =3∑i=0

ais(n− i).(26)

It is evident that simple substitution of the second equa-tion into the first does not lead to the expression of the out-put signal f (n) through the current value of the input signalk(n) and previous values of the signals f and s. The currentvalue of the output is present on the both sides of the equa-tion. Separation of the variables leads to the following ex-pression for the output signal:

f (n) = k(n)1 +

(b0/a0

)k(n)

·M −

3∑i=1

bia0

f (n− i) +3∑i=1

aia0s(n− i)

.

(27)

It appears that the most computationally effective way toimplement filter B with its signal feedback is a transposed di-rect form II structure (Figure 10). This realisation minimisesthe number of delay units.

For the sake of completeness of the picture, the follow-ing formula presents a version of the digital filter A, which isobtained with the help of bilinear transformation:

HA(z) = 1l + r + 2Fs

· 1 + z−1

1 +((l + r − 2Fs

)/(l + r + 2Fs

))z−1

.

(28)

The formulae (25), (27), and (28), as well as the Figures 6and 10, contain exact instructions for the implementation ofthe reservoir IHC model, which remains stable at any sam-pling frequency. As it was noted above, this property savescomputational load and is desirable for efficient incorpora-tion of the model into multirate cochlear filter bank.


Linear approximation (23) of the reservoir model mightbe viewed as a computationally effective way to implementthe model when input signal does not significantly deviatefrom a certain fixed stationary value. It might also serve asthe linear time-variant filter, which simulates the reservoirmodel, when the slowly varying stationary value of the signalk0 is known in advance or is estimated through a long-termmoving average procedure.

This linear approximation is also important because ofits link to the RASTA filtering technique [20, 21], a well-established channel normalisation and speech augmentationmeans in ASR. Although the nature of this link needs furtherinvestigation, both techniques represent low-passband filters,running in separate frequency channels, which are convertedwith the help of nonlinearity. In the case of RASTA, each fre-quency channel is decimated to represent one frequency binof the short time Fourier transform spectrogram and con-verted into modulation-frequency domain by Jah-log trans-formation [16]. In the case of reservoir model there is noexplicit decimation and the passband signal is transformedby “BM vibration—membrane permeability” transforma-tion [6], which somewhat resembles Jah-log transform.

8. EXPERIMENTS

Several experiments were run in order to validate the orig-inal assumption that the anthropomorphic auditory mod-elling in general and IHC adaptation model in particular mayindeed augment performance of the ASR systems. A compar-ison involved three experimental setups, which are describedin more detailed fashion in [22].

(i) BASELINE: an ASR feature extraction (FE) algorithm,which is based on linear time-invariant perceptuallyaligned filters.

(ii) A-MORPHIC: anthropomorphic feature extraction al-gorithm [22], which combined linear time-variantcochlear filters to model auditory suppression andthe above-described IHC reservoir model implemen-tation. However, results mainly reflect effect of theIHC reservoir model since speech recordings in the ex-periment had approximately the same loudness level(∼ 40 dB SPL).

(iii) RASTA: the conventional RASTA algorithm-based fea-ture extraction [16].

In order to be effective, ASR FE algorithms should con-vey as much information about the speech source as possible.The measure of the amount of conveyed information, that is,the mutual information between a speech source S, which atany instant of time resides in one of the possible states Ci,i = 1, 2, . . . ,N , and a measured feature vector component Xis defined as follows:

I(S,X) = H(S)−H(S | X)

= −∑

∀Ci∈cP(Ci)

log2 P(Ci)

+∑

∀Ci∈c

∫G(X)

P(Ci,X

)log2 P

(Ci | X

)dX.

(29)

c 0 c 2 c 4 c 6 c 8 dc 1

dc 3

dc 5

dc 7

ddc 0

ddc 2

ddc 4

ddc 6

ddc 8

00.10.20.30.40.50.60.70.80.90.9

1

Mu

tual

info

rmat

ion

(bit

s)

FeaturesBASELINEA-MORPHICRASTA

Figure 11: Mutual information of feature components (∆X = 0.01).

Estimation of the mutual information has been per-formed with the help of the following procedure [22]:

I∆X(S,X) ≈ log2 N +1N

∑∀i

∑∀ j

N(Ci, ∆Xj

)log2 N

(Ci, ∆Xj

)

−∑∀i

N(Ci)

log2 N(Ci)

−∑∀ j

N(∆Xj

)log2 N

(∆Xj

).(30)

Here N denotes a total number of feature framesin the measurement; N(∆Xj)—a number of frames whenthe feature value falls into the interval [min(X) + ( j −1)∆X , min(X) + j∆X]; N(Ci)—a number of frames whichwere generated in the state Ci; N(Ci, ∆Xj)—a number offrames, belonging to the certain feature interval, which weregenerated by the source in the state Ci.

Phonetically labelled TIMIT speech corpus was usedin this experiment. Probability distributions were approx-imated with histograms that had a step size ∆X = 0.01.Results, which are presented in Figure 11, show that A-MORPHIC features are generally the most informative.

Another experiment was performed to estimate a degreeof invariance of the feature vectors to different kinds of ad-verse interference. To provide estimates of the feature invari-ance degree a simple Euclidian distance between feature vec-tors was used. Exact experiment description may be foundin [22]. Results of the experiment, which are presented inTable 1, reflect a mean distance of the feature vectors in ad-verse conditions to those perceived in a “clean” environment.As it can be seen from the table, A-MORPHIC features areless invariant to the adverse interference than RASTA. Any-way, a distance between “clean” and severely noisy (SNR0 dB) features in the case of A-MORPHIC FE matches thatbetween “clean” and mildly-noisy (SNR 30 dB) features inthe BASELINE case.

Results of the depicted experiments are also supportedby the reported in [22] comparison of the speech recogniser


Table 1: Expected mean distance between the feature vectors in ad-verse conditions and clean environment.

Feature extractionalgorithm

Interference

Noise Noise Noise Convol.

30 dB 10 dB 0 dB channel

BASELINE 0.41597 0.78894 1.05047 0.49298

RASTA FE 0.09842 0.17563 0.22338 0.05300

A-MORPHIC 0.26853 0.44951 0.42615 0.16665

performances (refer to [23] for a description of the recog-niser). It’s main result is that in adverse environments therecogniser with A-MORPHIC FE performs at least as good asthe one with RASTA FE. These facts support the conjectureof the present paper that application of the anthropomor-phic algorithms in technical devices, namely, ASR engines, isfruitful.

9. CONCLUSIONS

Analysis of the physiological model of the chemical IHC-ANsynapse creates an opportunity to implement it in the formof the anthropomorphic algorithm, which is computation-ally efficient and thus may be used in technical devices. Theequivalent digital and linearised equivalent representationscreate alternatives for a traditional direct difference approx-imation of the original set of differential equations. Theserepresentations allow for a multiple “accuracy versus compu-tational load” tradeoffs at the implementation stage. Withinthe described framework, it is possible to create implementa-tions, which remain stable regardless to the signal samplingfrequency.

It was found that effect of the IHC adaptation modelis equivalent to the action of signal-dependent automaticgain control mechanism. It is also conjectured that effectof the linearised equivalent representation resembles that ofRASTA, an algorithm engineered with the aim of alleviatingthe influence of additive and convolutive noises. This inter-pretation of the IHC-AN synapse model gives us reasons tobelieve that it is important as a mean of increasing ASR ro-bustness to the real-world environments (e.g., “too slow” and“too fast” varying additive and convolutive noises) and alsoas a mean of enhancement of the useful signal in the speechcoding applications. Presented and referenced experimentsconfirm viability of the application of the discussed anthro-pomorphic algorithm to the ASR field. However, the exactform of the relation between the IHC-AN synapse model andRASTA should be investigated further.

ACKNOWLEDGMENTS

The authors would like to thank G. Kubin and the anony-mous reviewers for the valuable insights they provided asthis article was developed. This work is supported in partby the Bialystok Technical University under the Grant no.W/WI/02/03.

REFERENCES

[1] E. Zwicker and H. Fastl, Psychoacoustics, Facts and Models,Springer, Berlin, Germany, 1990.

[2] H. Hermansky, “Should recognizers have ears?” Speech Com-munication, vol. 25, no. 1, pp. 3–27, 1998.

[3] D. C. Geisler, From Sound to Synapse: Physiology of the Mam-malian Ear, Oxford University Press, New York, NY, USA,1998.

[4] M. R. Schroeder and J. L. Hall, “Model for mechanical toneural transduction in the auditory receptor,” Journal of theAcoustical Society of America, vol. 55, no. 5, pp. 1055–1060,1974.

[5] Y. Oono and Y. Sujaku, “A model for automatic gain controlobserved in the firings of primary auditory neurons,” Abstractsof IECE Transactions, vol. 58, no. 6, pp. 61–62, 1975.

[6] R. Meddis, “Simulation of mechanical to neural transductionin the auditory receptor,” Journal of the Acoustical Society ofAmerica, vol. 79, no. 3, pp. 702–711, 1986.

[7] R. Meddis, “Simulation of auditory-neural transduction: Fur-ther studies,” Journal of the Acoustical Society of America,vol. 83, no. 3, pp. 1056–1063, 1988.

[8] R. Meddis, M. J. Hewitt, and T. M. Shackleton, “Implemen-tation details of a computation model of the inner hair-cellauditory-nerve synapse,” Journal of the Acoustical Society ofAmerica, vol. 87, no. 4, pp. 1813–1816, 1990.

[9] E. A. Lopez-Poveda, L. P. O’Mard, and R. Meddis, “A revisedcomputational inner hair cell model,” in Proc. 11th Interna-tional Symposium on Hearing, pp. 112–121, Grantham, UK,August 1997.

[10] C. J. Sumner, E. A. Lopez-Poveda, L. P. O’Mard, and R. Med-dis, “A revised model of the inner-hair cell and auditory-nervecomplex,” Journal of the Acoustical Society of America, vol. 111,no. 5, pp. 2178–2188, 2002.

[11] C. J. Sumner, E. A. Lopez-Poveda, L. P. O’Mard, and R. Med-dis, “Adaptation in a revised inner-hair cell model,” Journal ofthe Acoustical Society of America, vol. 113, no. 2, pp. 893–901,2003.

[12] A. Ivanov and A. Petrovsky, “Auditory models for robust fea-ture extraction: suppression,” in Proc. IEEE Signal ProcessingWorkshop, pp. 23–28, Poznan, Poland, October 2003.

[13] F. S. Perdigao and L. V. Sa, “Properties of auditory model rep-resentations,” in Proc. European Conference on Speech Com-munication and Technology (EUROSPEECH ’97), vol. 5, pp.2499–2502, Rhodes, Greece, September 1997.

[14] F. S. Perdigao and L. V. Sa, “Auditory models as front-ends forspeech recognition,” in Proc. NATO Advanced Study Instituteon Computational Hearing, Il Ciocco, Italy, July 1988.

[15] D. N. Morgan, “On discrete-time AGC amplifiers,” IEEETrans. Circuits Syst., vol. 22, no. 2, pp. 135–146, 1975.

[16] A. Harma and K. Palomaki, “HUTear—a free Matlab toolboxfor modeling of human auditory system,” in Proc. 1999 MAT-LAB DSP Conference, pp. 96–99, Espoo, Finland, November1999.

[17] M. Slaney, “Auditory toolbox, version 2,” Tec. Rep. 1998-10,Interval Research Corporation, Palo Alto, Calif, USA, 1998.

[18] R. D. Patterson and M. H. Allerhand, “Time-domain mod-elling of peripheral auditory processing: a modular architec-ture and a software platform,” Journal of the Acoustical Societyof America, vol. 98, no. 4, pp. 1890–1894, 1995.

[19] A. V. Ivanov and A. A. Petrovsky, “A composite physiologicalmodel of the inner ear for audio coding,” in Proc. 116th AESConvention, Berlin, Germany, May 2004, preprint 6082.

[20] H. Hermansky and N. Morgan, “RASTA processing ofspeech,” IEEE Trans. Speech Audio Processing, vol. 2, no. 4,pp. 578–589, 1994.


[21] J. Baszun and A. Petrovsky, “Enhancement of speech as apreprocessing for hearing prosthesis by time-varying tunablemodulation filters,” in Proc. 17th International Congress onAcoustics, Rome, Italy, September 2001.

[22] A. V. Ivanov and A. A. Petrovsky, “Anthropomorphic featureextraction algorithm for speech recognition in adverse envi-ronments,” in Proc. 9th International Conference “Speech andComputer” (SPECOM ’04), St. Petersburg, Russia, September2004.

[23] A. V. Ivanov and A. A. Petrovsky, “Speech recognition basedon hybrid neural network/hidden Markov model approach,”Neurocomputers Design and Applications, no. 12, pp. 27–36,2002.

Alexei V. Ivanov received the M.S. degree inapplied mathematics and physics from theMoscow Institute of Physics and Technol-ogy, Moscow, Russia, in 1995. He receivedthe Ph.D. degree from the Computer En-gineering Department, the Belarusian StateUniversity of Informatics and Radioelec-tronics, in 2004. His Ph.D. thesis is entitled“Feature space construction based on an-thropomorphic information processing forspeech recognisers in adverse environments.” In 2000, he joinedLernout & Hauspie Speech Products NV Research Laboratory,Wemmel, Belgium, as a Research Engineer. Currently he is with theComputer Engineering Department, the Belarusian State Univer-sity of Informatics and Radioelectronics, working as a Researcherin the field of automated speech recognition. His research inter-ests include application of the detailed hearing models to artificialspeech processing systems and, in particular, construction of theanthropomorphic feature extraction algorithms for speech recog-nition, with the aim to increase its robustness towards adverse in-terference. Dr. Ivanov is a Member of the Institute of Electrical andElectronics Engineers (IEEE); the IEEE Signal Processing & Infor-mation Theory Societies; the Association for Computing Machin-ery (ACM); the International Speech Communication Association(ISCA); the Acoustic Engineering Society (AES); and the AcousticalSociety of America (ASA).

Alexander A. Petrovsky received the Dipl.-Ing. degree in computer engineering in1975 and the Ph.D. degree in 1980 bothfrom the Minsk Radio-Engineering Insti-tute, Belarus. In 1989, he received the Doc-tor of Science degree from The Institute ofSimulation Problems in Power Engineering,Academy of Science, Kiev, Ukraine. In 1975,he joined Minsk Radio-Engineering Insti-tute. He became a Research Worker and As-sistant Professor and since 1980 he has been an Associate Pro-fessor at the Computer Science Department. From 1983 to 1984,he was a Research Worker at the Royal Holloway College and theImperial College of Science and Technology, University of Lon-don, UK. Since May 1990, he has been a Professor and Headof the Computer Engineering Department, the Belarusian StateUniversity of Informatics and Radioelectronics, and he is withthe Real-Time Systems Department, Faculty of Computer Sci-ence, Bialystok Technical University, Poland. Recently his main re-search interests are in acoustic signal processing, such as speechand audio coding, noise reduction and acoustic echo cancella-tion, robust speech recognition, and real-time signal processing.

A. A. Petrovsky is a Member of the Russian A. S. Popov Society forRadioengineering, Electronics and Communications, and an Edi-torial Staff Member of the Russian journal Digital Signal Process-ing, AES, IEEE, and IIAV.


Anthropomorphic Coding of Speech and Audio:A Model Inversion Approach

Christian FeldbauerSignal Processing and Speech Communication Laboratory, Graz University of Technology, 8010 Graz, AustriaEmail: [email protected]

Gernot KubinSignal Processing and Speech Communication Laboratory, Graz University of Technology, 8010 Graz, AustriaEmail: [email protected]

W. Bastiaan KleijnDepartment for Signals, Sensors and Systems, KTH (Royal Institute of Technology), 10044 Stockholm, SwedenEmail: [email protected]

Received 14 November 2003; Revised 25 August 2004

Auditory modeling is a well-established methodology that provides insight into human perception and that facilitates the extrac-tion of signal features that are most relevant to the listener. The aim of this paper is to provide a tutorial on perceptual speech andaudio coding using an invertible auditory model. In this approach, the audio signal is converted into an auditory representationusing an invertible auditory model. The auditory representation is quantized and coded. Upon decoding, it is then transformedback into the acoustic domain. This transformation converts a complex distortion criterion into a simple one, thus facilitatingquantization with low complexity. We briefly review past work on auditory models and describe in more detail the componentsof our invertible model and its inversion procedure, that is, the method to reconstruct the signal from the output of the auditorymodel. We summarize attempts to use the auditory representation for low-bit-rate coding. Our approach also allows the exploita-tion of the inherent redundancy of the human auditory system for the purpose of multiple description (joint source-channel)coding.

Keywords and phrases: speech and audio coding, auditory representation, auditory model inversion, auditory synthesis, percep-tual domain coding, multiple description coding.

1. INTRODUCTION

1.1. Motivation

The encoding of an analog signal at a finite rate requiresquantization and introduces distortion. Models of the hu-man auditory system can be exploited to minimize, for agiven rate (specified either as an average or as a fixed rate),the audible distortion (as quantified by the model) intro-duced by the encoding [1, 2, 3]. Signal features will thenbe specified with a precision that reflects audible distor-tion. However, the introduction of knowledge of the audi-tory system into coding has been handicapped by delay andcomputational constraints. For instance, temporal maskingand the adaptation of the hearing system to a stimulus arehighly nonlinear effects [4, 5]. A time-localized quantiza-tion error in the perceived signal can result in a significantchange in the auditory nerve firings over a response time in-terval that can last on the order of hundreds of milliseconds.

Therefore, the effect of time-localized quantization errorsthat are hundreds of milliseconds apart cannot be separatedinto additive terms. As a result, it is difficult to include suchdependencies of quantization errors during the quantizationprocess.

The simple distortion criteria used in practical systemsresult from a desire to perform efficient quantization atreasonable computational complexity. Such efficient, low-complexity quantization is facilitated by three conditions:(i) the (vector) variable is of low dimension, (ii) the dis-tortion criterion is a single-letter one (i.e., the distortionmeasure is a sum over many sample distortions), and (iii)the variables are independent. This is particularly well illus-trated by the discrete-cosine-transform (DCT) -based lappedtransforms commonly used in audio coding [6]. These trans-forms allow a spectrally weighted mean-square error distor-tion measure to be approximated as a single-letter criterion.For wide-sense stationary signals, the results of the DCT are

Anthropomorphic Coding of Speech and Audio 1335

asymptotically equivalent to the results of the Karhunen-Loeve transform, thus performing an approximate decorre-lation of the data. Finally, scalar quantization is used to havelow complexity.

Our objective is to use sophisticated auditory-model-based distortion criteria without the significant approxima-tions commonly used (such as simple error-weighting fil-ters in linear-prediction-based speech coders or the exclusiveconsideration of frequency-domain masking in many audiocoders).

Most quantitative models of the human auditory percep-tion provide an auditory representation of the acoustic sig-nal as output. However, the models generally do not includea quantitative measure of the perceptual distance of two real-izations of the auditory representation. In [7], a correlationmeasure of the internal representations was proposed as anobjective distortion measure. Such a measure is closely re-lated to a single-letter weighted squared-error measure. Wewill assume that a single-letter distortion criterion on the au-ditory representation can provide a high-quality distortionmeasure.

The usage of sophisticated distortion criteria withinthe existing coding architectures leads to so-called delayed-decision coding. Delayed-decision coding methods havebeen used in the context of a squared-error criterion andlinear-prediction-based waveform coding (e.g., [8]). In thedelayed-decision approach, the quantization of a signal blockis decided only after consideration of the quantization of acertain number of future blocks. Even when using pruningprocedures that eliminate the consideration of unlikely con-figurations, this method becomes computationally very ex-pensive for distortion measures that have the long time re-sponses associated with hearing models [9]. This motivatesthe consideration of less conventional coding architectures.

The coding approach we presented in [10], which is thebasis throughout this paper, avoids the high computationalcomplexity of the delayed-decision approach by exploitingthe single-letter nature of the criterion in the auditory rep-resentation. The signal is transformed to the auditory do-main and coded in that domain. The decoding is followedby a transform back towards the acoustic domain. The trans-form from the acoustic to the auditory domain can be many-to-one, making the inverse transform in general nonunique.This auditory-domain approach towards coding allows theusage of a single-letter distortion criterion and yet accountsfor the dependency of perceived distortion on errors in thesignal that are far apart in time.

It is important to note that virtually all state-of-the-artspeech and audio coding methods operate on a block-by-block basis (e.g., [1, 2, 6, 8]). For subband/transform codingfor example, decimated filterbanks or lapped transforms areused, which introduce block boundaries at regular time po-sitions (often even independent of the actual audio signal).Such a signal representation allows only a suboptimal quan-tization (in the sense of rate versus distortion) since a signal isgenerally not stationary within a block and audible artefactssuch as pre-echoes or musical noise can occur [1, 3].

In our coding approach, we use a block-free signal rep-resentation and utilize a signal-adaptive decimation (i.e.,subsampling) method, thus bypassing the suboptimality ofblock-based and constantly decimated processing. Further-more, since our approach combines the signal representationused for the quantization with the perceptual measure, weno longer need two separate signal paths with different sig-nal representations as common in many existing coders (e.g.,the MPEG audio coders in [1]).

Finally, we note that the parameters making up the au-ditory representation generally are not independent. Thatis, coding of the auditory representation removes compu-tational complexity associated with the distortion criterion,but it does not eliminate the need for signal modeling orother additional considerations to reduce the amount ofdata. In Section 4.1 and beyond, we will discuss methods thatdeal with this redundancy in an efficient manner.

In the next subsection, we review our auditory model,which can be inverted very efficiently to allow auditory resyn-thesis at high quality so that it can be used for robust codingof speech and audio signals.

1.2. An invertible auditory model

In [10] a speech coding paradigm was introduced in whichthe coding is performed in a perceptual domain where a sim-ple distortion criterion (e.g., a single-letter squared error)should form an accurate and meaningful measure for theperceived distortion. In other words, the speech or audio sig-nal is transformed into an auditory representation by passingit through an auditory model. This auditory representation isquantized and coded and the signal can be reconstructed inthe decoder by an inverse auditory model.

This approach is new and different from the one used inclassical perceptual audio or speech coders where an audi-tory model is used only in the analysis stage in parallel withthe main signal path to control the quantization and bit allo-cation [1].

The proposed paradigm requires a model of the humanauditory system that satisfies the following requirements:

(1) it provides an accurate quantitative description of per-ception;

(2) it leads to an auditory signal representation with rel-atively few parameters (to have a good basis for datacompression);

(3) it can be inverted with a relatively low computationaleffort.

An invertible auditory model that satisfies these require-ments was proposed in [10]. It is depicted in Figure 1.

In this model, the first stage is a nondecimated analysisfilterbank that simulates the motion of the basilar membranecaused by acoustic stimulation. It is well known that stimuliwith different frequencies produce responses with maxima atdifferent locations along the basilar membrane. For this pur-pose, a functional model consists of a bank of bandpass fil-ters with different center frequencies. Note that in a humancochlea, about 2 500 inner hair cells [11] are located along


Gammatone filterbank Synthesis filterbank

Inputsignal

Half-waverectification

Power-lawcompression

Peakpicking

Amplitudecorrection

Power-lawexpansion

Quantization and coding Decoding

Transmission/storage

· · · · · · · · · · · ·

Reconstructedsignal

Basilarmembrane

(BM)

Innerhair cells(IHCs)

Neuronensembles

Figure 1: Invertible auditory model.

the basilar membrane and, therefore, this is the actual num-ber of bandpass channels. One reason for this high redun-dancy is to be robust against damages such as loss of haircells. But this also means that neighboring auditory filterswould look rather similar and, hence, for modeling purposesor coding applications, it is not necessary (and hardly possi-ble) to implement such a high number of cochlea channels.For the invertible model in [10], the well-known gamma-tone filterbank with 20 channels for 8 kHz-sampled speechis used.

In each auditory channel, the analysis filterbank is fol-lowed by a model of an inner hair cell. The task of the innerhair cells is to convert the displacement of the basilar mem-brane in electrical receptor potentials. These receptor poten-tials cause a release of neurotransmitters and excite the pe-ripheral terminals of cochlear-afferent neurons [12, 13]. Inour model, this transduction process is reproduced in a verysimplified way using static nonlinearities only, namely, a half-wave rectifier and a compressive nonlinearity.

The final stage in our invertible model mimics the be-havior of an ensemble of cochlear-afferent neurons in eachauditory channel. According to the excitation by neurotrans-mitters, these neurons produce action potentials (“firingpulses”) caused by depolarization of an auditory nerve fiber.We model this generation of pulses using a peak-picking pro-cedure. The set of firing-pulse trains obtained from all au-ditory channels is referred to as the auditory representationwhich is a perceptual time-frequency representation of theoriginal speech or audio signal.

In the next section, we will describe the componentsof our auditory model in more detail. We cover the basi-lar membrane, inner hair cells, and first neural stages, thatis, we model the cochlea and the auditory nerve in thehuman inner ear but skip the outer and the middle ear.We deal with filterbanks whose characteristics are matchedto the acoustical and mechanical behavior of the cochlea

and basilar membrane. One of these characteristics is thatthe spectral resolution decreases with increasing frequency.Therefore, warped frequency scales have been introducedlong ago where selectivity bandwidths remain approximatelyconstant along the frequency axis (auditory scales), for ex-ample, the Bark (critical-band rate) scale [14] or the ERB(equivalent rectangular bandwidth) rate scale [15]. We givea survey of auditory scales and auditory filters. The emphasisis placed on invertibility so as to allow reconstruction of theinput signal. This enables the filterbank pair—analysis andsynthesis filterbank—to be used for auditory subband codersor to be used in an invertible auditory model. Furthermore,we will consider important aspects for the implementationof the auditory filterbank, which is the most complex com-ponent in our model.

In Section 3 we describe the computationally efficient,nonrecursive inversion procedure of our auditory modelwhich allows to reconstruct the input signal at a high qualityfrom the auditory representation. We investigate our analy-sis/synthesis system using frame theory, which provides uswith a bound for the reconstruction error.

Section 4.1 deals with the compression and quantizationof the auditory representation obtained by our model andsummarizes the first approaches towards low-bit-rate cod-ing.

Since the auditory representation is highly overcompleteand does not rely on a hierarchical signal decomposition, itcan be used directly for multiple description coding. We re-view the incorporation of our auditory model into this jointsource-channel coding strategy in Section 4.2.

2. AUDITORY ANALYSIS

We selected the components of the proposed auditory modelbased on existing knowledge of the human auditory system.In this section, we provide additional detail for the motiva-tion of our choices.

2.1. Basilar membrane filterbank

The filterbank to simulate the behavior of the basilar mem-brane is the most complex component in our model. Afterproviding an overview of auditory filters, we consider differ-ent aspects for the implementation of an auditory filterbank.

2.1.1. Brief overview

The frequency selectivity of the human auditory system hasbeen studied by means of psychoacoustic experiments andmeasurements in the cochlea and on the auditory nerve overmany decades. The results of these experiments have led tothe concept of auditory filters. For a historical overview, werefer to [16].

Once the bandwidths of these filters are found and ex-pressed as a function of the center frequency, an auditoryscale can be defined by integrating the reciprocal of the band-width function (the bandwidth function can be seen as thefirst derivative of the frequency with respect to the unit of the


1

0.8

0.6

0.4

0.2

00 1000 2000 3000 4000

Frequency (Hz)

Nor

mal

ized

war

ped

freq

uen

cy

Basilar membrane positionERB rate

Bark scaleFrequency warping

Figure 2: Comparison of the frequency-position mapping [17], theERB rate [15], the Bark scale [50], and the frequency warping (seeAppendix A.2) with λ = 0.5 for a sampling rate of 8 kHz.

bandwidth). For instance, the equivalent rectangular band-width ERB( fc) as a function of the filter’s center frequency fcin Hz is [15]

ERB( fc) = 0.1079 fc + 24.7, (1)

and the corresponding frequency scale, the ERB rate (or“number of ERBs”), is then

# ERBs( f ) =∫

df

ERB( f )+ const

= 21.4 log10(1 + 0.00437 f ),

(2)

where the integration constant has been chosen to make# ERBs(0) = 0.

Auditory frequency scales are related to the frequency-position mapping performed by the cochlea. In Figure 2,the ERB rate and the Bark [14] scales are compared witha position-frequency function which has been derived byGreenwood [17] from measurements of the mechanical mo-tion of the basilar membrane. For more details, see [18]. Inthis comparison, the scales are normalized. At the maximumpresented frequency of 4000 Hz, the basilar membrane po-sition is 23.4 mm, the ERB rate reaches 27.1 ERBs, and theBark scale has 18 Bark.

The shape of the auditory filters has been obtained by fit-ting different parametric expressions to experimental data. Asimple linear frequency-domain description of auditory fil-ters is the rounded exponential “roex(p, r)” function [19]

∣∣H( f )∣∣2 = (1− r)(1 + pg)e−pg + r, (3)

H0(z) G0(z)· · ·

· · ·

· · ·

...

H1(z) G1(z)

HL−1(z) GL−1(z)

+

+y[n]

x[n]

Figure 3: Analysis and synthesis filterbanks.

where g is the normalized deviation from the center fre-quency fc:

g =∣∣ f − fc

∣∣fc

. (4)

The parameter p determines the bandwidth and should bechosen as p = 4 fc/ ERB( fc). The second parameter r flattensthe shape outside the passband.

A more recent, time-domain description is the well-known gammatone function [20] for the filter impulse re-sponse

h(t) = t(l−1)e−2πbt cos(2π fct) for t > 0, (5)

where fc is the frequency of the carrier and, therefore, thecenter frequency of the filter, b largely determines the band-width, and l is the order. Patterson [20] determined thechoice l = 4 and b = 1.019 ERB( fc). For our simulations, wewill use gammatone filters since the time-domain descriptionallows straightforward FIR filter design. We will discuss thisissue in more detail in the next subsection.

Several nonlinear effects have been described such as thedependency on the sound pressure level [21] which causesmore asymmetric shapes of the frequency responses. For thisreason, both filter descriptions have been extended [15, 22]to account for this dependency. For simplicity, particularlywith respect to invertibility, we will only consider linear fil-ters for which the above descriptions are valid for moderatesound pressure levels.

2.1.2. Implementation aspects

An implementation of an auditory filterbank consists ofmany auditory filters with different center frequencies in par-allel. For coding applications, we should be able to recon-struct the input signal from the channel signals and the filterbank should be invertible. We denote the analysis filters asHk(z) for k = 0, . . . ,L−1 and the synthesis filters asGk(z) fork = 0, . . . ,L− 1. We thus obtain the analysis-synthesis struc-ture shown in Figure 3. Filterbank inversion and the designof synthesis filters are described in more detail in Section 3.2.


1

0.5

0

−0.5

−110 20 30 40 50 60 70 80

Time (ms)

Am

plit

ude

fc = 80 Hzfc = 200 Hzfc = 500 Hz

Figure 4: Impulse response and impulse response envelopes ofgammatone filters for different center frequencies.

A commonly used method to compute the proper centerfrequencies for the filters is to transform the minimum andthe maximum center frequency of interest from Hz into ERBrate. This range is divided into L − 1 uniform sections andthe obtained ERB rates are finally transformed back into Hz.

The discrete-time impulse responses of the gammatonefilters can be designed by sampling and windowing thecontinuous-time infinite-length impulse responses of (5). Aproblem with direct usage of these impulse responses for FIRimplementations is that the impulse responses are very long.In Figure 4, a gammatone impulse response for a center fre-quency of 500 Hz is plotted. Its envelope is shown as well andcompared with the envelopes obtained for center frequen-cies of 200 and 80 Hz. As it can be seen from this figure,an impulse response with about 400 samples is needed at asampling rate of 8 kHz for a center frequency fc = 200 Hzto approximate accurately the frequency response of an idealgammatone filter. For lower center frequencies, the length in-creases further (e.g., 600 samples for fc = 80 Hz). Therefore,the corresponding FIR implementations are computationallyexpensive and memory consuming.

In the appendix, we discuss alternative implementa-tion methods, which are computationally less expensive andshould, therefore, be preferred when real-time applicationsrunning on a DSP are considered. However, for the experi-ments and simulations described in the following sections,we use FIR gammatone filters because computational com-plexity was not an issue.

2.2. Inner-hair-cell model

The auditory filterbank is followed by a half-wave rectifierand a power-law compressor, simulating the behavior of in-ner hair cells. The task of the inner hair cells is the so-called transduction process, that is, to convert mechanical

movements into electrical potentials. It is assumed that thedisplacement of the cilia of the cells is proportional to thebasilar membrane velocity [21]. Measurements of electricalresponses have revealed a directional sensitivity: while dis-placement in one direction is excitatory, movement in theopposite direction is inhibitory [21]. Thus, the cells mainlyreact to positive deflection of the basilar membrane and, con-sequently, it is reasonable to model this behavior with a half-wave rectifier. Half-wave rectification is commonly used tomodel this aspect of physiology, for example, [4, 23, 24].

The aforementioned measurements also show a com-pressive response [21]. Therefore, we apply a power-lawcompressor to the half-wave rectified signals. The input x[n]and the output y[n] of the compression stage are related by

y[n] = xc[n], (6)

with c = 0.4. This stage is similar to logarithmic amplitudecompression schemes in ordinary waveform coders (e.g., µ-law).

The static nonlinearity is a strongly simplified model ofthe human peripheral processing. In related literature, moresophisticated compression or adaptation stages have beenproposed. In [4], a cascade of five feedback loops with dif-ferent time constants is used. The cascade compresses sta-tionary sounds almost logarithmically whereas rapidly vary-ing signals are transformed more linearly, thus modeling the“overshoot effect,” that is, a higher sensitivity at the onset of astimulus. Other examples can be found in [23, 24] where au-tomatic gain controllers model the synaptic region betweenthe hair cell and the nerve fiber. In our first implementationof an invertible model, we use the simple half-wave rectifierand power-law compressor to avoid stability problems wheninverting the gain control loops.

2.3. Neuron model

Contrary to many other auditory models (e.g., [4, 23, 24]),we preserve the temporal fine structure of the signal, that is,we do not apply time averaging to the subband signals be-cause this would lead to a low reconstruction quality. In ourmodel the power-law compressor is followed by an adaptivesubsampling mechanism (“peak picking”), which searchesfor local maxima and sets all other samples to zero. Let theinput and the output of the peak-picking stage be denotedby x[n] and y[n], respectively, then the output can be calcu-lated as

y[n] =x[n], x[n] > x[n− 1]∧ x[n] > x[n + 1],

0, otherwise.(7)

This model simulates the firing behavior of an ensemble ofauditory neurons. The responses are clusters of high firingactivity that are synchronized (phase-locked) with the wave-form shape of the input signal.

It is known that a single neuron generally does not firemore often than 250 times per second [12, 13] and, there-fore, it is by itself not able to preserve the time structure of


3600

2833

2220

1730

1338

1024

773

573

412

284

400 405 410 415 420 425 430

Time (ms)

Ch

ann

elce

nte

rfr

equ

ency

(Hz)

Figure 5: Auditory representation (here with 50 channels) of thesound [I] taken from “there is,” spoken by a male. Peaks are shownas rectangles with their intensity representing their amplitude. Thetime axis covers three pitch periods.

high-frequency components. Since in the early human audi-tory system, about 30 000 neurons [11] encode the signals ofsignificantly less hair cells, we can associate several neuronswith one hair cell output. Our model of the neurons is phys-iologically plausible. Each neuron has an internal state thatdecays exponentially with a relatively large time constant.When it fires, this state is reset to a value that depends onthe input signal level. The firing probability increases mono-tonically with the difference between the neuron’s input andits state. So an ensemble of neurons shows a high firing rateat the peak of the input signal. The amplitude of a pulse inour model represents the firing rate, that is, the number ofneurons of the ensemble that fire at the peak location.

The effect of phase locking is known to occur only at fre-quencies below 4 kHz [12, 13]. So the used model is physio-logically plausible for the coding of narrowband speech sig-nals. For simplicity, we use this neuron model even if we pro-cess signals at higher sampling rate, for example, widebandspeech or general audio signals.

The consideration of pulsed neural models where infor-mation is carried in the pulse timings is clearly motivated byobservations of biological neural networks. In [25] it is welldemonstrated that these models should be preferred to clas-sical neuron models such as firing-rate models or even moresimplified ones for many applications of artificial neural net-works.

In Figure 5, a pulse representation of a segment of about30 milliseconds duration taken from a voiced speech isshown. For this example, a 50-channel FIR gammatone anal-ysis filter bank was used. The neuron firings are not strictlyaligned across the frequency channels due to different delaysof the filters. Nevertheless, the phase-locking effect can beseen clearly. Also the formant structure is visible with for-mants around 400 Hz, 1700 Hz, and 2800 Hz.

Weintraub [26] used a similar deterministic model forneural firing in his sound separation system. There is alsosimilarity to Patterson’s pulse ribbon model [27] but wepreserve the amplitudes of the pulses in addition to the lo-cations. Contrary to [26, 27], we are able to resynthesize the

original audio signal directly from this neural firing pulseswhereas Weintraub uses the (unprocessed) signals from theauditory filterbank for the resynthesis [28] and Pattersondoes not resynthesize at all.

3. AUDITORY SYNTHESIS

The attempts of resynthesis of the input signal from an audi-tory representation are not new. In [29] a historical overviewis given. The aim of various model inversions was to un-derstand perception [30, 31, 32], to test the accuracy of themodel [33, 34], and to separate speech from noisy back-grounds or interfering speakers [26, 28, 32]. We propose touse an invertible auditory model for coding of speech andaudio signals [10, 35].

For the most recent models, the inversion method isbased on projections onto convex sets [32, 34] and utilizesiterative signal reconstruction algorithms. The resynthesis ofour auditory model does not need iterative procedures andis, therefore, computationally very efficient and neverthelessperceptually accurate.

3.1. Inversion of neuron and inner-hair-cell modelsThe first step in the inversion procedure is to undo thepower-law compression using the proper inverse expansionto get the positive peak amplitudes of the original subbandsignal:

y[n] = x1/c[n]. (8)

Now, each of the channel signals approximates the situationwhere a signal is downsampled and then upsampled bymeans of inserting zeros. This insertion of zeros leads toaliasing which can be removed by bandpass filtering. Thebandpass filters are located in the synthesis filterbank. Beforethey are applied, the amplitude of the pulses has to becorrected to compensate for the loss of energy due to (i) theadaptive downsampling and (ii) the peak amplitude errorsat higher frequencies introduced by the finite sampling rate.

We consider one auditory channel. The output of onechannel of the analysis filterbank resembles a sinusoid witha period of P samples that is related (but not identical) tothe inverse of the center frequency of the filter. Then thepeak-picking procedure behaves like a cascade of an or-dinary downsampler and upsampler with a fixed decima-tion/interpolation factor P for which the Fourier transformrelation is

Y(e jθ) = 1

P

P−1∑k=0

X(e j(θ−k2π/P)). (9)

The cosine signal with amplitude 1 and angular frequency2π/P with Fourier transform

X(e jθ) = π

(δ2π

(θ − 2π

P

)+ δ2π

(θ +

2πP

))(10)

is transformed into the pulse train with Fourier transform

Y(e jθ) = 2π

P

P−1∑k=0

δ2π

(θ − k2π

P

), (11)


where δ2π(θ) is the 2π-periodic delta distribution. All addi-tional frequency components have to be attenuated by thesynthesis filter and the remaining components yield the co-sine signal with amplitude 2/P. Therefore, the amplitude inthis channel has to be corrected by a factor of P/2. Thismethod is very simple and contributes substantially to goodresynthesis results. Another slightly more elaborate correc-tion method is to count the actual number of zeros betweenadjacent pulses which replaces the constant correction factorwith an adaptive one.

For the second correction step, we observe that the mea-surement of the peak amplitude is exact in continuous timeonly. In discrete time, errors due to the finite sampling in-terval are inevitable. These errors become significant in par-ticular for those auditory channels whose center frequenciesare close to half the sampling frequency. To compensate forthese errors, a method based on the assumption of a uni-formly distributed random sampling error was proposed in[10]. The method evaluates the average per-cycle maximumamplitude of a sampled sinusoid, α, which, for the case of aunity amplitude sine wave and a unity sampling period, isgiven by

α =∫ 1/2

−1/2cos(

2πtP

)dt = P

πsin(π

P

). (12)

Thus, the correction factor due to the finite sampling rate forthis channel is 1/α.

An improved correction factor was introduced in [35]which is based on least-squares optimization. For a sinu-soidal signal with amplitude A and period P, we observethe maximum sample wmax = A cos(2πt/P) with t uniformover [−1/2, 1/2]. The nonlinear least-squares estimate for theamplitude A in terms of the observation wmax is given byA = EA|wmax = β ·wmax with

β =∫ 1/2

−1/2

1cos(2πt/P)

dt = P

πln[

tan(π

4+π

2P

)]. (13)

In Figure 6, these two compensation methods are compared.For a white-noise input signal, the power spectral densityfunction of the output signal is plotted for the cases of nopeak picking and therefore no correction (“nondecimatedcase”), peak picking with correction by 1/α, with correctionby β, and peak picking without correction. We recognizethat the correction factor β according to (13) keeps the re-construction error less than 1 dB across the entire frequencyrange covered by the auditory filterbank.

3.2. Synthesis filterbank

The last stage is the synthesis filter bank, which should bean inverse of the analysis filterbank. For proper signal re-construction from a firing-pulse representation, it is essen-tial that the synthesis filters have bandpass characteristics toeliminate aliasing. This also keeps the introduced quantiza-tion noise within a local frequency range.

2

0

−2

−4

−6

−8

−100 1000 2000 3000 4000

Frequency (Hz)

Pow

ersp

ectr

alde

nsi

ty(d

B)

Nondecimatedβ

1/αNone

Figure 6: Comparison of reconstruction quality with different am-plitude correction methods for the peak amplitude sampling errors(output PSD for white input).

In general, the inverse operator for a nondecimated, in-vertible filterbank is not unique. A natural method of inver-sion of a nondecimated FIR filterbank is based on the follow-ing condition for perfect reconstruction:1

Gk(z) = Hk(z−1)

∑L−1i=0 Hi(z)Hi

(z−1) . (14)

For the case that∑L−1

i=0 Hi(z)Hi(z−1) = 1, the synthesis fil-terbank is the analysis filterbank with time-reversed impulseresponses. A delay equal to the length of the analysis filtersminus one is needed to make the synthesis filterbank causal.

In the general case, when the denominator of (14) is notequal to one (e.g., when a low number of auditory channelsis used), accurate signal reconstruction can be obtained withan additional linear-phase equalization filter (see [10]) thatoperates on the sum of all channels synthesized with Gk(z) =Hk(z−1). This equalizer has to be designed to approximatethe frequency response

E(e jθ) !=

[∑k

∣∣∣Hk(e jθ)∣∣∣2]−1

(15)

to reduce the remaining magnitude ripple.The ripple decreases with increasing order of the FIR

equalizer. However, an additional delay of half the filter orderis introduced. Thus, for the choice of the impulse responselength, a suitable compromise must be found. The minimum

1Here, perfect reconstruction refers to processing of the input signal bythe analysis and the synthesis filterbank only.


0.2

0

−0.2

260 270 280 290 300

Time (ms)

Ori

gin

al

0.2

0

−0.2

260 270 280 290 300

Time (ms)

Rec

onst

ruct

ion

0.02

0

−0.02

260 270 280 290 300

Time (ms)

Err

or

Figure 7: Comparison of the original waveform (upper plot), thewaveform reconstructed from the auditory representation (middleplot), and the reconstruction error (lower plot, note the finer am-plitude scale). Speech segment taken from “The source,” spoken bya male speaker.

delay solution without equalization has often been used [32].We found that, for 20 channels and for a sampling rate of8 kHz, this results in a 4 dB ripple. The ripple decreases witha further increase of the number of channels.

As already mentioned for the analysis filters, FIR gamma-tone filters are memory consuming. Although the synthesisfilters can use the same coefficients as used for the analysis fil-ters, separate ring buffers are needed for every auditory chan-nel in the synthesis filterbank. Consequently, the necessaryamount of memory is doubled. For an accurate FIR gamma-tone filterbank implementation with long impulse responses,the memory of the most currently used DSPs is not suffi-cient. One solution to this problem is to take shorter impulseresponses and accept deviations from the ideal frequency re-sponses. Another possibility is to consider alternative filter-bank implementations as described in the appendix.

3.3. Simulation results

In Figure 7, a segment of the original waveform with 8 kHzsampling rate is compared with the output of our inverseauditory model with 20 channels. For this simulation, theFIR gammatone filterbank from Irino’s Matlab toolbox2 wasused where the lowest center frequency was 100 Hz (orderof filter 666) and the highest 3600 Hz (order 56). The audi-tory representation has been left uncompressed. The outputof the synthesis filterbank has been passed through a linear-phase equalizer with a group delay of 25 milliseconds. Al-

2This toolbox can be found at http://www.mrc-cbu.cam.ac.uk/cnbh/aimmanual/.

though the average segmental signal-to-noise ratio (SNR)is only 17.9 dB, the reconstructed signal is without audi-ble distortion (evaluated by two experienced listeners in anoriginal/reconstructed-comparison listening test).

3.4. Frame-theoretic interpretationof auditory synthesis

It is useful to consider the auditory resynthesis from the per-spective of frame theory. This endorses our choice of synthe-sis filterbank and provides a bound for the reconstruction er-ror introduced by the analysis/synthesis filterbank pair. Fur-thermore, it justifies our simple method to reconstruct thesignal from the pulse representation and allows us to reducethe number of pulses in the auditory representation.

In practical implementations of the filterbank structure,the analysis and synthesis filterbanks are identical, except fora time reversal of the impulse responses. We first evaluate thevalidity and implications of this choice. The analysis filter-bank maps the input sequence3 to a set of channel sequences,one for each filter. It is essential that the analysis filterbankis invertible and that means it can be interpreted as a frameoperator, which we denote as F. The analysis filterbank op-eration can be written as a set of inner products, denoted as(Fx)[ j] =∑i ψ

∗j [i]x[i], with functions ψj j∈J where each is

a translate of one of the L time-reversed impulse responses.The indexes j enumerate each output sample of all L chan-nels. Invertibility of the filterbank is guaranteed if the framecondition is satisfied:

A∑i∈Z

∣∣x[i]∣∣2 ≤

∑j∈J

∣∣(Fx)[ j]∣∣2 ≤ B

∑i∈Z

∣∣x[i]∣∣2 ∀x ∈ 2(Z),

(16)

where A and B are finite, positive, scalar frame bounds. Theadjoint operator F∗ maps an L-channel signal, y, to a single-channel signal F∗y =∑ j∈J yjψj .

In general, the inverse frame operator (the synthesis fil-terbank) is not unique. We are interested in an inverse that iseasy to compute, and, importantly, that minimizes the effectof quantization errors in (Fx)[ j] on the reconstruction. Theso-called frame algorithm is an iterative procedure that pro-vides the inverse frame operator that minimizes the effect ofquantization errors. The first iteration often provides a usefulapproximation to the inverse or even the exact inverse. Theestimate xm of x at iteration m of the frame algorithm is

xm = ρF∗y +(Id− ρF∗F)xm−1, (17)

where ρ is a scalar relaxation parameter, Id is the identity op-erator, and x0 = 0. The estimation error at iterationm is then

x − xm =(Id− ρF∗F)(x − xm−1

) = (Id− ρF∗F)mx. (18)

3We assume that the input sequence is in the Hilbert space 2(Z).


With the optimal selection ρ = 2/(B + A), the error isbounded by∥∥x − xm∥∥ = min

ρ

∥∥(Id− ρF∗F)mx∥∥≤ min

ρmax

(|1− ρA|, |1− ρB|)m‖x‖=(B − AB + A

)m‖x‖.

(19)

The valuesA and B form the minimum and maximum eigen-values of the operator F∗F, which are precisely the framebounds.

The first-iteration estimate of x by the frame algorithm isthe expansion ρF∗y = ρ

∑j∈J yjψj , which implies that ρF∗

is the approximation to the inverse operator. It is easily seenthat this corresponds to a synthesis filterbank with impulseresponses that are the time-reversed impulse responses of theanalysis filterbank, scaled by ρ. Moreover, we see from (19)that the relative error is bound by the factor (B−A)/(B+A).4

For a nondecimated filterbank, the discrete-time Fouriertransform (which is unitary) simplifies the analysis of the op-erator F∗F. In the Fourier domain, the operator F∗F corre-sponds to the operator [36]

F F∗FF −1 =L−1∑i=0

Hi(e jθ)Hi(e− jθ

), (20)

where F denotes the discrete-time Fourier transform opera-tor. This immediately leads to the inversion formula given in(14). The same Fourier-domain equivalence shows that theframe bounds then correspond to the essential infimum andsupremum of

∑L−1i=0 Hi(e jθ)Hi(e− jθ).

We can now draw some conclusions for our auditory fil-terbanks based on the frame-theoretical viewpoint. First, thesynthesis filterbank based on time-reversing the impulse re-sponses is an approximation to the perfect synthesis filter-bank that has minimum sensitivity to quantization errors inthe perceptual domain. Second, the accuracy of this approx-imation is governed by the relative error (B − A)/(B + A),where A and B can be evaluated as the essential infimum andsupremum of the summed responses of the analysis filter-bank. For an auditory filterbank implementation based onFIR gammatone filters, the relative error (B − A)/(B + A) is−30.7 dB for 50 channels and −5.9 dB for 20 channels.

Frame theory can also be used to provide an interpreta-tion of the peak-picking procedure that we use in our audi-tory model. It is convenient to look at a single channel first.A frame algorithm that can be used for the reconstructionof continuous lowpass band-limited signals from irregularlyspaced samples and their derivatives was presented in [37]. Inthis case, the frame is formed by the translates of the impulseresponse of an ideal lowpass filter and its derivatives. For ourcase, the first-order derivative of the signal samples is selected

4The first-iteration estimate is exact for A = B, which corresponds to atight frame.

as zero and the reconstruction method is essentially identi-cal to the reconstruction applicable if no derivative is given.However, reconstruction is possible with a larger spacing be-tween the samples than if no information was known aboutthe derivatives (a factor two for regularly spaced samples).In practice, the first iteration of the frame algorithm consistsof ideal lowpass filtering of the upsampled (inserting zeros)weighted signal. The weighting of each sample is linear withthe distance to the previous sample. Nearly uniform spacing,as we have in our case, results in nearly uniform weighting,reducing the first iteration of the frame algorithm essentiallyto a lowpass filter. Moreover, it is easy to see that the frameis tight for the regular sampling case, which means that thefirst iteration renders the exact inverse.

We note that the frame algorithm of [37] assumes a bandlimited signal and a sample spacing that is at most 2π/θ fora band limitation of θ (in practice, the band limitation issomewhat less). Since the output of the auditory filters re-sembles sinusoids, and since a sinusoid of frequency θs hasits maxima spaced at 2π/θs, this implies that the frame al-gorithm of [37] does not apply to our case without modi-fication. The required modification consists of replacing theimpulse response of the ideal lowpass filter by the impulse re-sponse of an ideal bandpass filter.5 For regularly spaced sam-ples, the reconstruction algorithm then consists of a simplebandpass filtering. For irregular spacing, the samples mustfirst be weighted appropriately.

In practice, the bandpass filtering operation required forthe reconstruction of each of the irregularly sampled chan-nels can be usurped by the corresponding synthesis filterwithin the inverse of the basilar membrane filterbank. In ourpractical implementation, we then make the following ap-proximations with respect to inverting the peak-picking pro-cedure: (i) we use the first iteration of the frame algorithmand this is not accurate since the frame is not tight for irreg-ular sampling, (ii) we neglect the sample weightings that areneeded to account for irregular sampling, and (iii) we assumethe narrowband character of the inverse basilar membranefilterbank filters allow the bandpass filters to be omitted. Theperceptual effect of these approximations on auditory syn-thesis is small; the samples are almost uniformly spaced andthe bandpass filters used to invert the peak picking can bevery broad, broader than the auditory filters, as is confirmedby the results provided in Section 3.3.

The frame interpretation leads directly to a method to re-duce the coding rate of our basic model. Particularly for thefilters of the basilar membrane filterbank with high centerfrequency, the peak-picking procedure leads to a high rateof peaks. Since the peak locations and amplitudes must beencoded as side information, the resulting parameterizationis not a good basis for coding. However, we note that thedescribed frame-algorithm-based reconstruction from peak

5We note that, in general, sampling rates that are sufficient for lowpasssignals may not be so for bandpass signals of identical bandwidth, for exam-ple, see [38]. However, this aliasing problem is unlikely to occur for spectrathat essentially consist of a single line.


amplitudes and locations only requires that the peaks be notseparated by more than a given distance. Importantly, thereis no requirement to include all peaks of the signal. As a re-sult, we can downsample the peak sequence in the channelswith higher center frequency by a significant factor withoutloosing the ability to reconstruct the signal.

The amount of downsampling that can be applied to apeak sequence is constrained by the bandwidth of the idealbandpass filter of the frame. With increasing downsamplingof the peak sequence, the importance of the bandpass filter-ing operation increases and then it cannot be omitted fromthe synthesis structure. On the other hand, the bandpassfilter cannot be selected to be narrower than the nominalwidth of the basilar membrane filters, since that removes rel-evant information. It is interesting to note that this frame-theoretical vantage point leads to a new interpretation of theresults obtained in [39]. In [39], downsampling of the peaksequence was justified from a masking argument, which isnot physiologically plausible for the auditory representation.

4. EXEMPLARY APPLICATIONS IN AUDIOAND SPEECH CODING

The proposed invertible auditory model allows to resynthe-size the input signal with high quality and, therefore, canbuild a basis for coding of audio signals. The next sectiondescribes first approaches for quantization and coding to re-duce the amount of data needed to transmit an auditory rep-resentation, whereas in Section 4.2, we exploit the inherentredundancy of the auditory representation in a joint source-channel coding strategy to protect against possible losses ofdata during the transmission in a packet-switched network.

4.1. Auditory-domain compression

The auditory representation provided by our model is sparse,consisting mostly of zeros. However, it contains more firingpulses in total compared to the number of samples that theoriginal input signal has (about three times more for the 20-channel case and a sampling rate of 8 kHz).

Experiments have shown that the firing-pulse amplitudescan be quantized coarsely, for example, using a block scalarquantizer with a block duration of 20 milliseconds and 1 bit[10] per pulse amplitude, without introducing audible dis-tortion. The maximum amplitude of a block has to be trans-mitted as side information for each channel where 6 bits pervalue are enough. In fact, quantizing the peak amplitudeswith 1 bit enables us to refer to three amplitude values—high(= 1), middle (= 0), and zero—since zero denotes that thereis no pulse at all (no pulse time position transmitted as sideinformation).

In [39] even 0 bits were found to be sufficient for thepulse amplitudes, that is, only the side information (block-average pulse amplitudes and pulse positions) requires trans-mission. However, in that work shorter block lengths areused to determine the block energy. Especially for higher-frequency channels, the block duration is only 4 millisec-onds.

Much more important than the firing-pulse amplitudesare the pulse-time positions. Also these positions have to betransmitted as side information which produces by far themajor part of the transmitted data. In [39], these positionsare compressed using arithmetic coding for low-frequencychannels and vector quantization for high-frequency chan-nels. Furthermore, models of temporal and simultaneousmasking were added to reduce the overall number of fir-ing pulses drastically. While the consideration of simultane-ous masking does not bring a remarkable reduction, exploit-ing temporal masking does. Our own experiments with themodel for temporal postmasking adopted from [39] showthat an average reduction in number of pulses by 50% for16 kHz-sampled speech does not affect the audible qualityof the reconstructed signal [40]. For this model, a maskingthreshold signal is computed in each channel. Let x[n] be thefiring pulse train of one auditory channel and T[n] the cor-responding masking threshold. Then T[n] is defined as

T[n] =x[n], x[n] > T[n− 1]e−1/τ ,

T[n− 1]e−1/τ , otherwise.(21)

The time constant τ was set to 125 samples for the lowest-frequency channel and to 33 samples for the highest (accord-ing to the empirically determined values from [39]). Oncethis threshold is computed, the output signal of the maskingstage is

y[n] =x[n], x[n] > T[n− 1]e−1/τ ,

0, otherwise.(22)

It is natural to observe many more pulses in high-frequencychannels6 than in low-frequency channels. Thus, the re-duction of the number of pulses is most effective in high-frequency channels. This is in accordance with the frame-theoretic consideration of Section 3.4.

We have performed experiments with 16 kHz-sampledspeech and a 16-channel auditory model. The aforemen-tioned temporal masking model has been included to reducethe number of pulses. The positions of the remaining firingpulses have been coded using run-length encoding combinedwith arithmetic coding, which results in an average bit rateof about 100 kbps [40] for the transmission of the pulse po-sitions only.7 Further compression can be achieved with vec-tor quantization, (cf. [39]), where an average bit rate of about70 kbps has been achieved for the overall bit stream.

We expect that a coarse quantization of the pulse posi-tions in high-frequency channels should be sufficient sinceneurons of the auditory nerve do not any longer show phase-locked firing behavior above 4 kHz. Thus, we expect only aminor increase in necessary bit rate when audio signals at

6The average number of pulses per second in an auditory channel can bepredicted by the channel’s center frequency.

7The amplitude information must be added.


higher sampling rates (e.g., 44.1 kHz) are coded. To affirmthis is a matter of our current research.

We have to further reduce the number of pulses signifi-cantly, particularly in higher-frequency channels, to achieve abetter compression. In our most recent work [41], we incor-porated a combined model for both simultaneous and tem-poral masking. Together with another pulse-amplitude cor-rection step, which compensates for the loss of energy due tothe elimination of pulses, we are able to omit even 74% ofthe original pulses of speech signals sampled at both 8 kHzand 16 kHz without degrading the reconstruction quality.This result is a step further towards an efficient compressionmethod since it reduces the amount of side information con-siderably. To find the upper bound of the downsampling fac-tor without loosing the ability to reconstruct the signal is amatter of further investigations.

4.2. Multiple description coding

The high degree of redundancy in the human peripheral au-ditory system forms a motivation to use our invertible audi-tory model in a joint source-channel coding strategy. In otherwords, we use the overcomplete auditory representation toprotect the transmitted signal against erasure of coded in-formation (packet loss in packet networks). In [35], we pro-posed the first instance of a highly redundant speech coderoptimal for packet-switched networks, for example, for voiceover IP applications. There, we use the auditory model formultiple-description coding where the source informationis spread over multiple signal descriptions which are carriedover M independent subchannels. These transport channelsmay be physically distinct as in a packet-switched networkor correspond to multiplexed subchannels on a single phys-ical channel. When an arbitrary set of K < M subchannelsfails, the receiver uses the information from the remainingM − K intact channels to reconstruct the transmitted signal.Therefore, the encoder should be based on a nonhierarchicalsignal decomposition [42]—this is the case for our auditoryrepresentation—and should assign descriptions of equal im-portance to each transport channel. The descriptions mustbe different, that is, each must carry new information, suchthat receiving more descriptions enables the decoder to im-prove the reconstruction quality.

A grouping of the L auditory channels into M ≤ L trans-port channels provides an immediate application of our cod-ing paradigm in this context. To form descriptions of equalimportance, L should be chosen as an integer multiple ofM such that a constant number of L/M uniformly spreadauditory channels are packaged together into one trans-port channel. In this respect, each description is obtainedby frequency-domain subsampling of an overcomplete signalrepresentation. One extreme case is given if M = L, that is,the maximum possible number of transport channels is usedto achieve superior robustness. The other extreme case is asimple interleaving of odd and even indexed auditory chan-nels and assigning them to M = 2 transport channels.

If erasures occur, the coded information about some au-ditory channels is lost. However, the information at the af-fected frequencies is generally not lost because neighboring

0 10 20 30 40 50

Channel number

Figure 8: Channel erasure pattern for the 50-channel auditory rep-resentation. Black bars indicate erased channels (40%); white barsstand for intact ones (60%).

5

0

−5

−10

−15

−200 1000 2000 3000 4000

Frequency (Hz)

dB

Figure 9: Overall unequalized frequency response of the nondec-imated analysis/synthesis filterbank with channel erasures as inFigure 8.

auditory filters overlap. Assuming that the decoder knowswhich channels are erased, as is the case in packet networks,a time-varying equalizer filter can be designed for the recon-struction after the synthesis filterbank to amplify the attenu-ated regions.

From a frame-theoretical viewpoint, the perfect inversefilter bank can be constructed as long as the frame functionscorresponding to the received information form a frame, thatis, if they satisfy the frame condition displayed in (16). How-ever, since the separation between the essential infimum andsupremum will increase, the approximation made by usingtime-reversed impulse responses will become less accurate.The accuracy of the approximation prior to equalization canbe quantified by means of the factor (B − A)/(B + A), whichbounds the distortion.

Experiment and Results

We have run an experiment with an auditory model with50 auditory channels assigned to 50 transport channels. Wegenerated the channel-erasure pattern with 40% randomlymuted channels shown in Figure 8. Although the 50 audi-tory channels highly overlap, the high proportion of erasedchannels creates a clearly perceptible spectral distortion (cf.Figure 9) if no equalizer is used. The factor (B − A)/(B + A),which bounds distortion, is −3.0 dB.

The amplitudes of the firing pulses are represented with1 bit each using block-adaptive quantization, whereas the


5000

0

−50006.7 6.72 6.74 6.76 6.78 6.8

Time (s)

Ori

gin

al

5000

0

−50006.7 6.72 6.74 6.76 6.78 6.8

Time (s)

Rec

onst

ruct

ion

5000

0

−50006.7 6.72 6.74 6.76 6.78 6.8

Time (s)

Err

or

Figure 10: Original segment of a speech waveform (first plot) ver-sus reconstruction from decimated and quantized auditory repre-sentation with 40% channel erasures (middle) and the differencebetween both (third plot).

pulse positions are left unquantized. In Figure 10, the recon-struction results are shown with a waveform of the initialpart of the word “player” spoken by a female speaker sam-pled at a rate of 8 kHz. In the first plot, the original waveform(compensated for the processing delay) is drawn. The secondplot shows the output of the decoder for the case that 40% ofthe channels are erased and an appropriate equalizer with animpulse response of length 256 samples is used. In the thirdplot, the reconstruction error, that is, the difference betweenthe original and the reconstructed signal, is plotted. The av-erage segmental SNR is 15.5 dB compared to 16.4 dB in thecase without channel erasures.

These results show potential applicability of our invert-ible auditory model in joint source-channel coding methodssuch as multiple description coding for robust transmissionover packet-switched networks.

5. CONCLUSION

We have reviewed an invertible auditory model and its us-age for robust coding of speech and audio signals. The in-version procedure to reconstruct the original signal from itsauditory representation does not need computationally ex-pensive iterative algorithms and produces reconstructed au-dio signals with very high quality. The overcomplete audi-tory representation suggests the application of the invertibleauditory model in multiple description coding. Our experi-ments have shown that our auditory model provides an idealbasis for this joint source-channel coding method to allowrobust transmission over packet-switched networks even ifa high number of packets get lost. Experiments have shownpromising results.

x[n] w0

w1

wW−1

z−1

z−1

A(z)

A(z)

T T

x[n] w0

w1

wW−1

Figure 11: Modification of a nondecimated transform filterbank toobtain a frequency-warped version.

APPENDIX

A. ALTERNATIVE FILTERBANKIMPLEMENTATION METHODS

As discussed in Section 2.1.2, FIR implementations are com-putationally expensive and memory consuming.

A.1. IIR filterbank

Several computationally less expensive IIR implementationsfor gammatone filters [43] have been suggested. These arebased on usual transforms from continuous-time transferfunctions to discrete-time transfer functions (e.g., impulse-invariance transformation) which result in filters with an or-der of 8.

Inversion

An inversion based on FIR filters according to (14) is notpossible for infinite impulse response filters. While for non-decimated filterbanks the direct channel-by-channel inver-sion of minimum-phase analysis filters is possible with stableand causal synthesis filters, this is not advisable since the fre-quency response of the inverse is complementary, that is, theinverse of a bandpass filter gives a bandstop. In this paper, wedo not deal with further inversion possibilities for IIR filterbanks, but refer to [44].

A.2. Frequency-warped transform filterbankAnother computationally very efficient approximation of anauditory filterbank is to take a frequency-warped transformfilterbank. In the early 1970s, Oppenheim, etal. [45] intro-duced the technique of computing nonuniform resolutionFourier transforms. They first transform the input sequenceinto a frequency-warped version by time-reversing and pass-ing it through a chain of allpass filters. After that, an FFTof the samples along this allpass cascade is performed. Thisis a computationally very efficient method for a constantrelative-bandwidth spectral analysis for finite-length signals.

In the late 1970s, Vary [46] suggested a frequency-warpedtransform filterbank obtained by simply replacing the unit-delay elements in the signal flow graph representation of asliding window with general allpasses. This is illustrated inFigure 11. The window coefficients w0, . . . ,wW−1 correspondto the impulse response of the prototype lowpass filter which


3

2.5

2

1.5

1

0.5

00 0.5 1 1.5 2 2.5 3

Frequency θ (radians)

War

ped

freq

uen

cyθ′ (

radi

ans)

λ = 0λ = 0.4

λ = 0.8λ = −0.4

Figure 12: Phase function of the first-order allpass for four differentvalues of the warping parameter λ.

is modulated by the transform T (e.g., a DFT or a DCT) toget bandpasses. The window length W does not necessarilyhave to be equal to the number of channels L (see [47] formore details). Thus, a longer FIR prototype filter can be de-signed to better approximate gammatone or roex frequencyresponses.

We consider a nondecimated filterbank where the win-dow advances by one sample at a time and, thus, the trans-form has to be calculated for every sample and nondecimatedsubband signals are obtained.

When the unit delays are replaced with generalnonlinear-phase allpasses, the characteristics of the trans-form filterbank will be modified. Let the transfer functionof a first-order allpass be denoted by

A(z) = z−1 − λ1− λz−1

. (A.1)

with the single so-called warping parameter λ. If we substi-tute z−1 by A(z), a bilinear transform is applied resulting inwarping the frequency axis corresponding to the phase func-tion of the allpass

θ′ = arctan(

(1− λ2) sin(θ)(1 + λ2) cos(θ)− 2λ

), (A.2)

where θ and θ′ are the frequency (in radians relative to thesampling frequency) variables before and after warping, re-spectively. In Figure 12, this function is plotted for differentwarping parameters λ.

Smith and Abel [48] proposed expressions for choosinga proper λ to achieve a frequency warping nearly identical tothat of the Bark or the ERB rate frequency scales for a givensampling frequency. In Figure 2, the warped frequency scale

0

−10

−20

−30

−40

−50

−600 1000 2000 3000 4000

Frequency (Hz)

Mag

nit

ude

(dB

)

GammatoneRoexFrequency-wraped DCT

Figure 13: Normalized frequency responses of four channels ofauditory filterbanks. Comparison between FIR gammatone filters,rounded exponentials, and a frequency-warped DCT-4 filterbank(λ = 0.5, fs = 8 kHz, 64-point Kaiser window with β = 10).

obtained using allpasses with λ = 0.5 at a sampling rate of8 kHz is compared with the frequency-position function, theERB rate scale, and the Bark (critical-band rate) scale. There-fore, warping a uniform filterbank with a chain of first-orderallpasses yields a good approximation of auditory filterbanksfor critical-band spectral analysis.

In Figure 13, the frequency responses of four effectiveanalysis filters of a warped (λ = 0.5) 64-point-windowedDCT-4 filterbank are plotted. Here the window has beenchosen without a special optimization (Kaiser window withβ = 10). Therefore, the capability to approximate gamma-tone filter frequency responses or roex functions is limited(especially at higher frequencies). Anyway, we can observethat the responses fit relatively well at low center frequencies.Note that this behavior is contrary to what we have observedfor the FIR gammatone filter design, where the necessary im-pulse response length increases with decreasing center fre-quency. Further optimization of the window will improve thefrequency responses.

A window length of only 64 samples yields reasonablefrequency responses at a sampling rate of 8 kHz. Therefore,the usage of a frequency-warped transform filterbank consti-tutes a computationally highly efficient and memory-savingoption for an auditory filterbank implementation on a DSPfor real-time applications.

Inversion

A synthesis filterbank can be obtained by generalizing theoverlap-and-add procedure which is well known from the in-verse short-term Fourier transform in the same way as the


x[n] w0

w1

wW−1

A(z)

A(z)

A(z)

A(z)

T T−1

x[n]

w0

w1

wW−1

+

+

Figure 14: Nondecimated frequency-warped phase-distorted anal-ysis/synthesis filterbank.

sliding window—by replacing the unit-delay chain with ageneral allpass chain (see Figure 14). While the uniform fre-quency resolution analysis/synthesis filterbank achieves per-fect reconstruction, the frequency-warped version does not.For the simple case, when the window length W equals thenumber of channels L, we can choose the window coefficientssuch that

∑W−1i=0 w2

i = 1 and we obtain for the output signal

X(z) = X(z)AW−1(z), (A.3)

and, therefore, a phase distortion is introduced. In [47] anFIR filter is used to compensate for this phase distortion toget a near-perfect-reconstruction filterbank. However, thisintroduces an additional delay, which increases with decreas-ing compensation error. Anyway, it is not necessary to equal-ize for perfect linear phase since small phase distortions areinaudible. The case where a longer prototype filter is usedwithout a higher number of auditory channels, that is, W >L, is also considered in [47].

In a recent development [49], we have shown that an FIRsynthesis filterbank exists for a critically sampled frequency-warped transform filterbank which achieves perfect recon-struction. However, these synthesis filters amplify any quan-tization noise introduced in the subband signals and do notexhibit bandpass characteristics. Thus, they are not recom-mended for coding applications.

ACKNOWLEDGMENT

This paper is an extended version of a plenary lecture pre-sented at the second IEEE Benelux Signal Processing Sympo-sium (SPS-2000) in Hilvarenbeek, The Netherlands, March2000.

REFERENCES

[1] K. Brandenburg and G. Stoll, “ISO-MPEG-1 Audio: a genericstandard for coding of high-quality digital audio,” Journal ofthe Audio Engineering Society, vol. 42, no. 10, pp. 780–792,1994.

[2] B. Tang, A. Shen, A. Alwan, and G. Pottie, “A perceptuallybased embedded subband coder,” IEEE Trans. Speech AudioProcessing, vol. 5, no. 2, pp. 131–140, 1997.

[3] R. Veldhuis and A. Kohlrausch, “Waveform coding and audi-tory masking,” in Speech Coding and Synthesis, W. B. Kleijnand K. K. Paliwal, Eds., pp. 427–428, Elsevier Science, Ams-terdam, The Netherlands, 1995.

[4] T. Dau, D. Puschel, and A. Kohlrausch, “A quantitative modelof the ‘effective’ signal processing in the auditory system. I.Model structure,” Journal of the Acoustical Society of America,vol. 99, no. 6, pp. 3615–3622, 1996.

[5] E. Zwicker, “Dependence of post-masking on masker dura-tion and its relation to temporal effects in loudness,” Journalof the Acoustical Society of America, vol. 75, no. 1, pp. 219–223,1984.

[6] R. Geiger, A. Herre, G. Schuller, and T. Sporer, “Fine grainscalable perceptual and lossless audio coding based on Int-MDCT,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Pro-cessing (ICASSP ’03), vol. 5, pp. 445–448, Hong Kong, China,April 2003.

[7] M. Hansen and B. Kollmeier, “Using a quantitative psychoa-coustical signal representation for objective speech qualitymeasurement,” in Proc. IEEE Int. Conf. Acoustics, Speech, Sig-nal Processing (ICASSP ’97), vol. 2, pp. 1387–1390, Munich,Germany, April 1997.

[8] H. Su and P. Mermelstein, “Delayed decision coding of pitchand innovation signals in code-excited linear prediction cod-ing of speech,” in Speech and Audio Coding for Wireless andNetwork Applications, B. S. Atal, V. Cuperman, and A. Gersho,Eds., pp. 69–76, Kluwer Academic Publishers, Boston, Mass,USA, 1993.

[9] R. Fandos Marin, Delayed decision CELP speech coding usingsquared and perceptual error criteria, M.S. thesis, Departmentof Signals, Sensors and Systems, KTH (Royal Institute of Tech-nology), Stockholm, Sweden, 2003.

[10] G. Kubin and W. B. Kleijn, “On speech coding in a perceptualdomain,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Pro-cessing (ICASSP ’99), vol. 1, pp. 205–208, Phoenix, Ariz, USA,March 1999.

[11] J. B. Allen, “Cochlear modeling,” IEEE ASSP Mag., vol. 2,no. 1, pp. 3–29, 1985.

[12] S. Greenberg, “Acoustic transduction in the auditory periph-ery,” Journal of Phonetics, vol. 16, pp. 3–17, 1988.

[13] M. A. Ruggero, “Physiology and coding of sound in the audi-tory nerve,” in The Mammalian Auditory Pathway: Neurophys-iology, A. Popper and R. Fay, Eds., pp. 34–93, Springer-Verlag,New York, NY, USA, 1992.

[14] E. Zwicker and H. Fastl, Psychoacoustics. Facts and Models,Springer-Verlag, Berlin, Germany, 2nd edition, 1999.

[15] B. R. Glasberg and B. C. Moore, “Derivation of auditory filtershapes from notched-noise data,” Hearing Research, vol. 47,no. 1-2, pp. 103–138, 1990.

[16] B. C. Moore, An Introduction to the Psychology of Hearing, Aca-demic Press, London, UK, 4th edition, 1997.

[17] D. D. Greenwood, “A cochlear frequency-position functionfor several species—29 years later,” Journal of the AcousticalSociety of America, vol. 87, no. 6, pp. 2592–2605, 1990.

[18] A. Harma, M. Karjalainen, L. Savioja, V. Valimaki, U. K.Laine, and J. Huopaniemi, “Frequency-warped signal process-ing for audio applications,” Journal of the Audio EngineeringSociety, vol. 48, no. 11, pp. 1011–1029, 2000.

[19] R. D. Patterson, I. Nimmo-Smith, D. L. Weber, and R. Milroy,“The deterioration of hearing with age: Frequency selectivity,the critical ratio, the audiogram, and speech threshold,” Jour-nal of the Acoustical Society of America, vol. 72, no. 6, pp. 1788–1803, 1982.


[20] R. D. Patterson, K. Robinson, J. Holdsworth, D. McKeown,C. Zhang, and M. Allerhand, “Complex sounds and auditoryimages,” in Auditory Physiology and Perception, Y. Cazals, L.Demany, and K. Horner, Eds., pp. 429–446, Pergamon Press,Oxford, UK, 1992.

[21] P. Dallos, “Overview: Cochlear neurobiology,” in The Cochlea,P. Dallos, A. Popper, and R. Fay, Eds., vol. 8, pp. 1–43, SpringerVerlag, New York, NY, USA, 1996.

[22] T. Irino and R. D. Patterson, “A time-domain, level-dependentauditory filter: The gammachirp,” Journal of the Acoustical So-ciety of America, vol. 101, no. 1, pp. 412–419, 1997.

[23] R. F. Lyon, “A computational model of filtering, detection, andcompression in the cochlea,” in Proc. IEEE Int. Conf. Acoustics,Speech, Signal Processing (ICASSP ’82), vol. 7, pp. 1282–1285,Paris, France, May 1982.

[24] S. Seneff, “A joint synchrony/mean-rate model of auditoryspeech processing,” Journal of Phonetics, vol. 16, pp. 55–76,1988.

[25] W. Maass and C. M. Bishop, Eds., Pulsed Neural Networks,MIT Press, Cambridge, Mass, USA, 1999.

[26] M. Weintraub, “The GRASP sound separation system,” inProc. IEEE Int. Conf. Acoustics, Speech, Signal Processing(ICASSP ’84), vol. 9, pp. 69–72, San Diego, Calif, USA, March1984.

[27] R. D. Patterson, “A pulse ribbon model of monaural phaseperception,” Journal of the Acoustical Society of America,vol. 82, no. 5, pp. 1560–1586, 1987.

[28] M. Weintraub, “A computational model for separating two si-multaneous talkers,” in Proc. IEEE Int. Conf. Acoustics, Speech,Signal Processing (ICASSP ’86), vol. 11, pp. 81–86, Tokyo,Japan, April 1986.

[29] M. Slaney, “Pattern playback from 1950 to 1995,” in Proc. IEEESystems Man Cybern. Conf., vol. 4, pp. 3519–3524, Vancouver,BC, Canada, October 1995.

[30] F. S. Cooper, “Acoustics in human communication: Evolvingideas about the nature of speech,” Journal of the Acoustical So-ciety of America, vol. 68, no. 1, pp. 18–21, 1980.

[31] T. Irino and H. Kawahara, “Signal reconstruction from mod-ified auditory wavelet transform,” IEEE Trans. Signal Process-ing, vol. 41, no. 12, pp. 3549–3554, 1993.

[32] M. Slaney, D. Naar, and R. F. Lyon, “Auditory model inver-sion for sound separation,” in Proc. IEEE Int. Conf. Acoustics,Speech, Signal Processing (ICASSP ’94), vol. 2, pp. 77–80, Ade-laide, Australia, April 1994.

[33] R. W. Hukin and R. I. Damper, “Testing an auditory model byresynthesis,” in Proc. 8th European Conference on Speech Com-munication and Technology (EUROSPEECH ’89), vol. 1, pp.243–246, Paris, France, September 1989.

[34] X. Yang, K. Wang, and S. A. Shamma, “Auditory representa-tions of acoustic signals,” IEEE Trans. Inform. Theory, vol. 38,no. 2, pp. 824–839, 1992.

[35] G. Kubin and W. B. Kleijn, “Multiple-description coding(MDC) of speech with an invertible auditory model,” in Proc.IEEE Speech Coding Workshop, pp. 81–83, Porvoo, Finland,June 1999.

[36] H. Bolcskei, F. Hlawatsch, and H. Feichtinger, “Frame-theoretic analysis of oversampled filter banks,” IEEE Trans.Signal Processing, vol. 46, no. 12, pp. 3256–3268, 1998.

[37] H. N. Razafinjatovo, “Iterative reconstructions in irregularsampling with derivatives,” J. Fourier Anal. Appl., vol. 1, no. 3,pp. 281–295, 1995.

[38] B. Foster and C. Herley, “Exact reconstruction from peri-odic nonuniform samples,” in Proc. IEEE Int. Conf. Acoustics,Speech, Signal Processing (ICASSP ’95), vol. 2, pp. 1452–1455,Detroit, Mich, USA, May 1995.

[39] E. Ambikairajah, J. Epps, and L. Lin, “Wideband speech andaudio coding using gammatone filter banks,” in Proc. IEEE Int.Conf. Acoustics, Speech, Signal Processing (ICASSP ’01), vol. 2,pp. 773–776, Salt Lake City, Utah, USA, May 2001.

[40] M. Stocker, Efficient coding methods for a perceptual speechcoder, M.S. thesis, Institute of Communications and WavePropagation, Graz University of Technology, Graz, Austria,2003.

[41] C. Feldbauer and G. Kubin, “How sparse can we make theauditory representation of speech?” in Proc. 8th InternationalConference on Spoken Language Processing (ICSLP ’04), Jeju Is-land, Korea, October 2004.

[42] Y. Wang, “Multiple description coding using non-hierachicalsignal decomposition,” in Proc. European Conference SignalProcessing (EUSIPCO ’98), pp. 233–236, Rhodes, Greece,September 1998.

[43] M. Slaney, “An efficient implementation of the Patterson-Holdsworth auditory filter bank,” Tech. Rep. 35, Apple Com-puter, New York, NY, USA, 1993.

[44] L. Lin, W. Holmes, and E. Ambikairajah, “Auditory filterbank inversion,” in Proc. IEEE Int. Symp. Circuits and Sys-tems (ISCAS ’01), vol. 2, pp. 537–540, Sydney, Australia, May2001.

[45] A. Oppenheim, D. Johnson, and K. Steiglitz, “Computation ofspectra with unequal resolution using the fast Fourier trans-form,” Proc. IEEE, vol. 59, no. 2, pp. 299–301, 1971.

[46] P. Vary, “Ein Beitrag zur kurzzeitspektralanalyse mit digitalensystemen,” Ausgewahlte Arbeiten uber Nachrichtnesysteme32, Universitat Erlangen, Erlangen, Germay, 1978.

[47] E. Galijasevic, “Design of allpass-based non-uniform over-sampled DFT filter banks,” in Proc. IEEE Int. Conf. Acoustics,Speech, Signal Processing (ICASSP ’02), vol. 2, pp. 1181–1184,Orlando, Fla, USA, May 2002.

[48] J. Smith and J. Abel, “Bark and ERB bilinear transforms,”IEEE Trans. Speech Audio Processing, vol. 7, no. 6, pp. 697–708,1999.

[49] C. Feldbauer and G. Kubin, “Critically sampled frequency-warped perfect reconstruction filterbank,” in Proc. Europeanon Circuit Theory and Design Conference (ECCTD ’03), vol. 3,pp. 109–112, Krakow, Poland, September 2003.

[50] E. Zwicker and E. Terhardt, “Analytical expressions forcritical-band rate and critical bandwidth as a function of fre-quency,” Journal of the Acoustical Society of America, vol. 68,pp. 1523–1525, 1980.

Christian Feldbauer was born inGrieskirchen, Austria, on May 31, 1976. Hereceived the Dipl.-Ing. degree in electricalengineering/sound engineering from theGraz University of Technology (TUG),Austria, in 2000. The work presented hereis part of his Ph.D. research performed atthe TUG. Since 2001, he has been a Re-search and Teaching Assistant at the SignalProcessing and Speech CommunicationLaboratory at the TUG. He was a Guest Researcher at the KTH,Stockholm, in summer 2003 and at the University of Sherbrooke,Canada, in summer 2004. His research interests are in anthropo-morphic coding and perception mechanisms for speech and audio,general speech and signal processing, as well as applications andtheory of adaptive filters.


Gernot Kubin was born in Vienna, Aus-tria, on June 24, 1960. He received his Dipl.-Ing. (1982) and Dr. Techn. (1990, sub aus-piciis praesidentis) degrees in electrical en-gineering from TU Vienna. He is a Profes-sor of nonlinear signal processing and Headof the Signal Processing and Speech Com-munication Laboratory (SPSC), TU Graz,Austria, since September 2000. Earlier in-ternational appointments include CERN,Geneva, Switzerland (1980); TU Vienna (1983–2000); ErwinSchroedinger Fellow at Philips Natuurkundig Laboratorium, Eind-hoven, The Netherlands (1985); AT&T Bell Labs, Murray Hill, USA(1992–1993 and 1995); KTH, Stockholm, Sweden (1998); ViennaTelecommunications Research Centre FTW (Key Researcher andMember of the Board, 1999–now); Global IP Sound, Sweden andUSA (Scientific Consultant, 2000–2001); Christian Doppler Labo-ratory for Nonlinear Signal Processing (Founding Director, 2002–now). He is a Member of the Board of the Austrian Acoustics Asso-ciation and Vice Chair for the European COST Action 277, Nonlin-ear Speech Processing. He has authored or coauthored over ninetypeer-reviewed publications and three patents.

W. Bastiaan Kleijn holds a Ph.D. degreein electrical engineering from Delft Uni-versity of Technology, the Netherlands, aPh.D. in soil science and an M.S. degree inphysics, both from the University of Cal-ifornia, and an M.S. degree in electricalengineering from Stanford University. Heworked on speech processing at AT&T BellLaboratories from 1984 to 1996, first in de-velopment and later in research. Between1996 and 1998, he held guest professorships at Delft University ofTechnology, the Netherlands, Vienna University of Technology, andKTH (Royal Institute of Technology), Stockholm. He is now a Pro-fessor at KTH and heads the Sound and Image Processing Labo-ratory in the School of Electrical Engineering. He is also a founderand former Chairman of Global IP Sound where he remains a ChiefScientist. He is on the Editorial Boards of IEEE Signal ProcessingLetters and IEEE Signal Processing Magazine and held similar po-sitions at the IEEE Transactions on Speech and Audio Processingand the EURASIP Journal on Applied Signal Processing. He hasbeen a Member of several IEEE technical committees, and a Tech-nical Chair of ICASSP-99, the 1997 and 1999 IEEE Speech CodingWorkshops, and a General Chair of the 1999 IEEE Signal Processingfor Multimedia Workshop. He is a Fellow of the IEEE.


Neuromimetic Sound Representation forPercept Detection and Manipulation

Dmitry N. ZotkinPerceptual Interfaces and Reality Laboratory, Institute for Advanced Computer Studies (UMIACS), University of Maryland,College Park, MD 20742, USAEmail: [email protected]

Taishih ChiNeural Systems Laboratory, The Institute for Systems Research, University of Maryland, College Park, MD 20742, USAEmail: [email protected]

Shihab A. ShammaNeural Systems Laboratory, The Institute for Systems Research, University of Maryland, College Park, MD 20742, USAEmail: [email protected]

Ramani DuraiswamiPerceptual Interfaces and Reality Laboratory, Institute for Advanced Computer Studies (UMIACS), University of Maryland,College Park, MD 20742, USAEmail: [email protected]


The acoustic wave received at the ears is processed by the human auditory system to separate different sounds along the intensity,pitch, and timbre dimensions. Conventional Fourier-based signal processing, while endowed with fast algorithms, is unable toeasily represent a signal along these attributes. In this paper, we discuss the creation of maximally separable sounds in auditoryuser interfaces and use a recently proposed cortical sound representation, which performs a biomimetic decomposition of anacoustic signal, to represent and manipulate sound for this purpose. We briefly overview algorithms for obtaining, manipulating,and inverting a cortical representation of a sound and describe algorithms for manipulating signal pitch and timbre separately.The algorithms are also used to create sound of an instrument between a “guitar” and a “trumpet.” Excellent sound quality canbe achieved if processing time is not a concern, and intelligible signals can be reconstructed in reasonable processing time (aboutten seconds of computational time for a one-second signal sampled at 8 kHz). Work on bringing the algorithms into the real-timeprocessing domain is ongoing.

Keywords and phrases: anthropomorphic algorithms, pitch detection, human sound perception.

1. INTRODUCTION

When a natural sound source such as a human voice or amusical instrument produces a sound, the resulting acousticwave is generated by a time-varying excitation pattern of apossibly time-varying acoustical system, and the sound char-acteristics depend both on the excitation signal and on theproduction system. The production system (e.g., human vo-cal tract, the guitar box, or the flute tube) has its own charac-teristic response. Varying the excitation parameters producesa sound signal that has different frequency components, butstill retains perceptual characteristics that uniquely identifythe production instrument (identity of the person, type ofinstrument—piano, violin, etc.), and even the specific type

of piano on which it was produced. When one is asked tocharacterize this sound source using descriptions based onFourier analysis, one discovers that concepts such as fre-quency and amplitude are insufficient to explain such per-ceptual characteristics of the sound source. Human linguis-tic descriptions that characterize the sound are expressed interms of pitch and timbre. The goal of anthropomorphic al-gorithms is to reproduce these percepts quantitatively.

The perceived sound pitch is closely coupled with its har-monic structure and frequency of the first harmonic, or F0.On the other hand, the timbre of the sound is defined broadlyas everything other than the pitch, the loudness, and the spa-tial location of the sound. For example, two musical instru-ments might have the same pitch if they play the same note,

Neuromimetic Sound Representation for Percept Manipulation 1351

but it is their differing timbre that allows us to distinguish be-tween them. Specifically, the spectral envelope and the spec-tral envelope variations in time that include, in particular,onset and offset properties of the sound are related to thetimbre percept.

Most conventional techniques of sound manipulation re-sult in simultaneous changes in both the pitch and the timbreand cannot be used to control or assess the effects in pitchand timbre dimensions independently. A goal of this paper isthe development of controls for independent manipulationof pitch and timbre of a sound source using a cortical soundrepresentation introduced in [2], where it was used for assess-ment of speech intelligibility and for prediction of the corti-cal response to an arbitrary stimulus, and later extended in[3] providing fuller mathematical details as well as address-ing invertibility issues. We simulate the multiscale audio rep-resentation and processing believed to occur in the primatebrain [4], and while our sound decomposition is partiallysimilar to existing pitch and timbre separation and soundmorphing algorithms (in particular, MFCC decompositionalgorithm in [5], sinusoid-plus-noise model and effects gen-erated with it in [6], and parametric source models usingLPC and physics-based synthesis in [7]), the neuromorphicframework provides a view of processing from a differentperspective, supplies supporting evidence to justify the pro-cedure performed, tailors it to the way the human nervoussystem processes auditory information, and extends the ap-proach to include decomposition in the time domain in ad-dition to frequency. We anticipate our algorithms to be ap-plicable in several areas, including musical synthesis, audiouser interfaces, and sonification.

In Section 2, we discuss the potential applications forthe developed framework. In Sections 3 and 4, we describethe processing of the audio information through the corticalmodel [3] in forward and backward directions, respectively,and in Section 5, we propose an alternative, faster imple-mentation of the most time-consuming cortical processingstage. We discuss the quality of audio signal reconstruction inSection 6 and show examples of timbre-preserving pitch ma-nipulation of speech and timbre interpolation of musicalnotes in Sections 7 and 8, respectively. Finally, Section 9 con-cludes the paper.

2. APPLICATIONS

The direct application that motivated us to undertake theresearch described (and the area it is currently being usedin) is the development of advanced auditory user interfaces.Auditory user interfaces can be broadly divided into twogroups, based on whether speech or nonspeech audio signalsare used in the interface. The field of sonification [8] (“. . .use of nonspeech audio to convey information”) presentsmultiple challenges to researchers in that they must bothidentify and manipulate different percepts of sound to rep-resent different parameters in a data stream while at thesame time creating efficient and intuitive mappings of thedata from the numerical domain to the acoustical domain.An extensive resource describing sonification work is the

International Community for Auditory Display (ICAD) webpage (see http://www.icad.org/), which includes past confer-ence proceedings. While there are some isolated examplesof useful sonifications and attempts at creating multidimen-sional audio interfaces (e.g., the Geiger counter or the pulseoximeter [9]), the field of sonification, and as a consequenceaudio user interfaces, is still in the infancy due to the lack ofa comprehensive theory of sonification [10].

What is needed for advancements in this area are iden-tification of perceptually valid attributes (“dimensions”) ofsound that can be controlled; theory and algorithms forsound manipulation that allow control of these dimensions;psychophysical proof that these control dimensions con-vey information to a human observer; methods for easy-to-understand data mapping to auditory domain; technologyto create user interfaces using these manipulations; and re-finement of acoustic user interfaces to perform some spe-cific example tasks. Our research addresses some of these is-sues and creates the basic technology for manipulation of ex-isting sounds and synthesis of new sounds achieving speci-fied attributes along the perceptual dimensions. We focus onneuromorphic-inspired processing of pitch and timbre per-cepts, having the location and ambience percepts describedearlier in [11]. Our real-time pitch-timbre modification andscene rendering algorithms are capable of generating stablevirtual acoustic objects whose attributes can be manipulatedin these perceptual dimensions.

The same set of percepts may be modified in the casewhen speech signals are used in audio user interfaces. How-ever, the purpose of percept modification in this case is notto convey information directly but rather to allow for max-imally distinguishable and intelligible perception of (possi-bly several simultaneous) speech streams under stress condi-tions using the natural neural auditory dimensions. Applica-tions in this area might include, for example, an audio userinterface for a soldier where multiple sound streams are tobe attended to simultaneously. To our knowledge, much re-search has been devoted to selective attention to one signalfrom a group [12, 13, 14, 15, 16] (the well-known “cocktailparty effect” [17]), and there have only been a limited num-ber of studies (e.g., [18, 19]) on how well a person can si-multaneously perceive and understand multiple concurrentspeech streams. The general results obtained in these two pa-pers suggest that increasing separation along most of the per-ceptual characteristics leads to improvement in the recogni-tion rate for several competing messages. The characteristicthat provides the most improvement is the spatial separationof the sounds, which is beyond the scope of this paper; thesespatialization techniques are well described in [11]. Pitch wasa close second, and in Section 7 of this paper, we present acortical-representation-based pitch manipulation algorithmthat can be used to achieve the desired perceptual separationof the sounds. Timbre manipulations did not result in signif-icant improvements in recognition rate in [18, 19], though.

Another area where we anticipate our algorithms to beapplicable to is musical synthesis. Synthesizers often use sam-pled sound that has to be pitch shifted to produce differ-ent notes [7]. Simple resampling that was widely used in the


past in commercial-grade music synthesizers preserves nei-ther the spectral nor the temporal envelope (onset and decayratios) of an instrument. More recent wavetable synthesizerscan impose the correct temporal envelope on the sound butmay still distort the spectral envelope. The spectral and thetemporal envelopes are parts of the timbre percept, and theirincorrect manipulation can lead to poor perceptual qualityof the resulting sound samples.

The timbre of the instrument usually depends on the sizeand the shape of the resonator; it is interesting that for someinstruments (piano, guitar), the resonator shape (which de-termines the spectral envelope of the produced sound) doesnot change when different notes are played, and for others(flute, trumpet), the length of resonating air column changesas the player opens different holes in the tube to producedifferent notes. Timbre-preserving pitch modification algo-rithm described in Section 7 provides a physically correctpitch manipulation technique for instruments with the res-onator shape independent of the note played. It is also possi-ble to perform timbre interpolation between sound samples;in Section 8, we describe the synthesis of a new musical in-strument with the perceptual timbre lying in between twoknown instruments—the guitar and the trumpet. The syn-thesis is performed in the timbre domain, and then a timbre-preserving pitch shift described in Section 7 is applied toform different notes of the new instrument. Both operationsuse the cortical representation, which turned out to be ex-tremely useful for separate manipulations of percepts.

3. THE CORTICAL MODEL

In a complex acoustic environment, sources may simulta-neously change their loudness, location, timbre, and pitch.Yet, humans are able to integrate effortlessly the multitudeof cues arriving at their ears and derive coherent perceptsand judgments about each source [20]. The cortical modelis a computational model for how the brain is able to obtainthese features from the acoustic input it receives. Physiolog-ical experiments have revealed the elegant multiscale strat-egy developed in the mammalian auditory system for cod-ing of spectro-temporal characteristics of the sound [4, 21].The primary auditory cortex (AI), which receives its inputfrom the thalamus, employs a multiscale representation inwhich the dynamic spectrum is repeatedly represented in AIat various degrees of spectral and temporal resolutions. Thisis accomplished by cells whose responses are selective to arange of spectro-temporal parameters such as the local band-width, the symmetry, and onset and offset transition ratesof the spectral peaks. Similarly, psychoacoustical investiga-tions have shed considerable light on the way we form andlabel sound images based on relationships among their phys-ical parameters [20]. A mathematical model of the early andthe central stages of auditory processing in mammals was re-cently developed and described in [2, 3]. It is a basis for ourwork and is briefly summarized here; a full formulation ofthe model is available in [3] and analysis code in the form ofa Matlab toolbox (“NSL toolbox”) can be downloaded fromhttp://www.isr.umd.edu/CAAR/pubs.html.

0

−5

−10

−15

−20

−25

−30

−35

−40

−45

−50

Freq

uen

cyre

spon

se(d

B)

125 250 500 1000 2000

Frequency (Hz)

180 Hz510 Hz1440 Hz

Figure 1: Tuning curves for cochlear filter bank filters tuned at180 Hz, 510 Hz, and 1440 Hz (channels 24, 60, and 96), respectively.

The model consists of two basic stages. The first stage ofthe model is an early auditory stage, which models the trans-formation of an acoustic signal into an internal neural repre-sentation, called the “auditory spectrogram.” The second is acentral stage, which analyzes the spectrogram to estimate itsspectro-temporal features, specifically its spectral and tem-poral modulations, using a bank of modulation selective fil-ters mimicking those described in the mammalian primaryauditory cortex.

The first processing stage converts the audio signal s(t)into an auditory spectrogram representation y(t, x) (where xis the frequency on a logarithmic frequency axis) and consistsof a sequence of three steps described below.

(i) In the analysis step, the acoustic wave creates a com-plex pattern of mechanical vibrations on a basilar membranein mammalian cochlea. For an acoustic tone of a given fre-quency, the amplitude of the traveling wave induced in themembrane slowly increases along it up to a certain point xand then sharply decreases. The position of the point x de-pends on the frequency, with different frequencies resonat-ing at different points along the membrane. These maximumresponse points create a tonotopical frequency axis with fre-quencies approximately logarithmically decreasing from thebase of the cochlea. This process is simulated by a cochlearfilter bank—a bank of highly asymmetric constant Q band-pass filters (also called channels) spaced equally over the log-frequency axis; we denote the impulse response of each filterby h(t; x). There are 128 channels with 24 channels per octavecovering a total of 5(1/3) octaves with the lowest channel fre-quency of 90 Hz in the implementation of the model that weuse, and equivalent rectangular bandwidth (ERB) filter qual-ity QERB ≈ 4. Figure 1 shows the frequency-response curvesof a few cochlear filters.


(ii) In the transduction step, the mechanical vibrationsof the membrane are transduced into the intracellular po-tential of the inner hair cells. Membrane displacements causethe flow of liquid in the cochlea to bend the cilia (tiny hair-like formations) that are attached to the inner hair cells. Thisbending opens the cell channels and enables ionic current toflow into the cell and to change its electric potential, which islater transmitted by auditory nerve fibers to the cochlear nu-cleus. In the model, these steps are simulated by a highpassfilter (equivalent to taking a time-derivative operation), non-linear compression g(z), and the lowpass filter w(t) with cut-off frequency of 2 kHz, representing the fluid-cilia coupling,ionic channel current, and hair cell membrane leakage, re-spectively.

(iii) Finally, in the reduction step, the input to the an-teroventral cochlear nucleus undergoes lateral inhibition op-eration followed by envelope detection. Lateral inhibition ef-fectively enhances the frequency selectivity of the cochlearfilters from Q ≈ 4 to Q ≈ 12 and is modeled by a spatialderivative across the channel array. Then, the nonnegative re-sponse of the lateral inhibitory network neurons is modeledby a half-wave rectifier, and an integration over a short win-dow, µ(t; τ) = e−t/τ , with τ = 8 milliseconds, is performed tomodel the slow adaptation of the central auditory neurons.

In mathematical form, the three steps described abovecan be expressed as

y1(t, x) = s(t)⊕ h(t; x),

y2(t, x) = g(∂t y1(t, x)

)⊕w(t),

y(t, x) = max(∂x y2(t, x), 0

)⊕ µ(t; τ),

(1)

where ⊕ denotes a convolution with respect to t.The above sequence of operations essentially consists of

a bank of constant Q filters with some additional operationsand efficiently computes the time-frequency representationof the acoustic signal, which is called the auditory spec-trogram (Figure 2). The auditory spectrogram is invertiblethrough an iterative process (described in the next section);perceptually perfect inversion can be achieved, albeit at a verysignificant computational expense. A time slice of the spec-trogram is called the auditory spectrum.

The second processing stage mimics the action of thehigher central auditory stages (especially the primary audi-tory cortex). We provide a mathematical derivation (as pre-sented in [3]) of the cortical representation below, as well asqualitatively describe the processing.

The existence of a wide variety of neuron spectro-temporal response fields (SRTF) covering a range of fre-quency and temporal characteristics [21] suggests that theymay, as a population, perform a multiscale analysis of theirinput spectral profile. Specifically, the cortical stage estimatesthe spectral and temporal modulation content of the audi-tory spectrogram using a bank of modulation selective filtersh(t, x;ω,Ω,ϕ, θ). Each filter is tuned (Q = 1) to a combi-nation of a particular spectral modulation and a particulartemporal modulation of the incoming signal, and filters arecentered at different frequencies along the tonotopical axis.

2000

1000

500

250

125

Freq

uen

cy(H

z)

200 400 600 800 1000

Time (ms)

Figure 2: Example auditory spectrogram for the sentence “Thismovie is provided . . . ”.

These two types of modulations are defined as follows.(i) Temporal modulation, which defines how fast the sig-

nal energy is increasing or decreasing along the time axis at agiven time and frequency. It is characterized by the parame-ter ω, which is referred to as rate or velocity and is measuredin Hz, and by characteristic temporal modulation phase ϕ.

(ii) Spectral modulation, which defines how fast the sig-nal energy varies along the frequency axis at a given timeand frequency. It is characterized by the parameter Ω, whichis referred to as density or scale and is measured in cyclesper octave (CPO), and by characteristic spectral modulationphase θ.

The filters are designed for a range of rates from 2to 32 Hz and scales from 0.25 to 8 CPO, which corre-sponds to the ranges of neuron spectro-temporal responsefields found in the primate brain. The impulse responsefunction for the filter h(t, x;ω,Ω,ϕ, θ) can be factored intohs(x;Ω, θ)-spectral and ht(t;ω,ϕ)-temporal parts, respec-tively. The spectral impulse response function hs(x;Ω, θ) isdefined through a phase interpolation of the spectral filterseed function u(x;Ω) with its Hilbert transform u(x;Ω), andthe temporal impulse response function is similarly definedvia the temporal filter seed function v(t;ω):

hs(x;Ω, θ) = u(x;Ω) cos θ + u(x;Ω) sin θ,

ht(t;ω,ϕ) = v(t;ω) cosϕ + v(t;ω) sinϕ.(2)

The Hilbert transform is defined as

f (x) = 1π

∫∞−∞

f (z)z − xdz. (3)

We choose

u(x) = (1− x2)e−x2/2, v(t) = e−t sin(2πt) (4)


1

0.8

0.6

0.4

0.2

0

−0.2

−0.4

−0.6

Am

plit

ude

−1.5 −1 −0.5 0 0.5 1 1.5

Frequency (octaves)

(a)

1

0.8

0.6

0.4

0.2

0

−0.2

−0.4

−0.6

Am

plit

ude

0 1 2 3 4 5

Time (s)

(b)

Figure 3: Tuning curves for the basis (seed) filter for the rate-scale decomposition. The seed filter is tuned to the rate of 1 Hz and the scaleof 1 CPO. (a) Spectral response. (b) Temporal response.

as the functions that produce the basic seed filter tuned to ascale of 1 CPO and a rate of 1 Hz. Figure 3 shows its spec-tral and temporal responses generated by functions u(x) andv(t), respectively. Differently tuned filters are obtained by di-lation or compression of the filter (4) along the spectral andtemporal axes:

u(x;Ω) = Ωu(Ωx), v(t;ω) = ωv(ωt). (5)

The response rc(t, x) of a cell c with parameters ωc, Ωc,ϕc, θc to the signal producing an auditory spectrogram y(t, x)can therefore be obtained as

rc(t, x;ωc,Ωc,ϕc, θc

) = y(t, x)⊗ h(t, x;ωc,Ωc,ϕc, θc), (6)

where ⊗ denotes a convolution both on x and on t.An alternative representation of the filter can be derived

in the complex domain. Denote

hs(x;Ω) = u(x;Ω) + ju(x;Ω),

ht(t;ω) = v(t;ω) + jv(t;ω),(7)

where j = √−1. Convolution of y(t, x) with a downward-moving STRF obtained as hs(x;Ω)ht(t;ω) and an upward-moving SRTF obtained as hs(x;Ω)h∗t (t;ω) (where asteriskdenotes complex conjugation) results in two complex re-sponse functions:

zd(t, x;ωc,Ωc

) = y(t, x)⊗ [hs(x;Ωc)ht(t;ωc

)]= ∣∣zd(t, x;ωc,Ωc

)∣∣e jψd(t,x;ωc ,Ωc),

zu(t, x;ωc,Ωc

) = y(t, x)⊗ [hs(x;Ωc)h∗t(t;ωc

)]= ∣∣zu(t, x;ωc,Ωc

)∣∣e jψu(t,x;ωc ,Ωc),

(8)

and it can be shown [3] that

rc(t, x;ωc,Ωc,ϕc, θc

) = 12

[∣∣zd∣∣ cos(ψd − ϕc − θc

)+∣∣zu∣∣ cos

(ψu + ϕc − θc

)] (9)

(the arguments of zd, zu, ψd, and ψu are omitted here forclarity). Thus, the complex wavelet transform (8) uniquelydetermines the response of a cell with parameters ωc, Ωc,ϕc, θc to the stimulus, resulting in a dimensionality reduc-tion effect in the cortical representation. In other words,knowledge of the complex-valued functions zd(t, x;ωc,Ωc)and zu(t, x;ωc,Ωc) fully specifies the six-dimensional corticalrepresentation rc(t, x;ωc,Ωc,ϕc, θc). The cortical representa-tion thus can be obtained by performing (8) which results ina four-dimensional (time, frequency, rate, and scale) hyper-cube of (complex) filter coefficients that can be manipulatedas desired and inverted back into the audio signal domain.

Essentially, the filter output is computed by a convolu-tion of its spectro-temporal impulse response (STIR) withthe input auditory spectrogram, producing a modified spec-trogram. Since the spectral and temporal cross-sections ofan STIR are typical of a bandpass impulse response in hav-ing alternating excitatory and inhibitory fields, the output ata given time-frequency position of the spectrogram is largeonly if the spectrogram modulations at that position aretuned to the rate, scale, and direction of the STIR. A map ofthe responses across the filter bank provides a unique charac-terization of the spectrogram that is sensitive to the spectralshape and dynamics over the entire stimulus.

To emphasize the features of the model that are im-portant for the current work, note that every filter in therate-scale analysis responds well to the auditory spectro-gram features that have high correlation with the filter shape.


0.4

0.2

0Am

plit

ude

125 250 500 1000 2000

Frequency (Hz)

(a)

0.2

0

−0.2Am

plit

ude

125 250 500 1000 2000

Frequency (Hz)

(b)

0.2

0

−0.2Am

plit

ude

125 250 500 1000 2000

Frequency (Hz)

(c)

0.2

0

−0.2Am

plit

ude

125 250 500 1000 2000

Frequency (Hz)

(d)

0.5

0

−0.5Am

plit

ude

125 250 500 1000 2000

Frequency (Hz)

(e)

0.5

0

−0.5Am

plit

ude

125 250 500 1000 2000

Frequency (Hz)

(f)

0.5

0

−0.5Am

plit

ude

125 250 500 1000 2000

Frequency (Hz)

(g)

2

1.5

1

0.5

0

Am

plit

ude

125 250 500 1000 2000

Frequency (Hz)

(h)

Figure 4: Sample scale decomposition of (h) the auditory spectrum using different scales: (a) DC, (b) 0.25, (c) 0.5, (d) 1.0, (e) 2.0,(f) 4.0, and (g) 8.0.

The filter shown in Figure 3 is tuned to the scale of 1 CPOand essentially extracts features that are of about this par-ticular width on the log-frequency axis. A scale analysis per-formed with filters of different tuning (different width) willthus decompose the spectrogram into sets of decompositioncoefficients for different scales, separating the “wide” featuresof the spectrogram from the “narrow” features. Some manip-ulations can then be performed on parts of the decomposedspectrogram, and a modified auditory spectrogram can beobtained by inverse filtering. Similarly, rate decompositionallows for segregation of “fast” and “slow” dynamic eventsalong the temporal axis. A sample scale analysis of the audi-tory spectrogram is presented in Figure 4 (Figure 4h is theauditory spectrum, Figure 4a is the DC level of the signalwhich is necessary for the reconstruction, and the remaining6 plots show the results of processing of the given auditoryspectrum with filters of scales ranging from 0.25 to 8 CPO),and the rate analysis is similar.

Additional useful insights into the rate-scale analysis canbe obtained if we consider it as a two-dimensional wavelet

decomposition of an auditory spectrogram using a set of ba-sis functions, which are called sound ripples. The sound rip-ple is simply a spectral ripple that drifts upwards or down-wards in time at a constant velocity and is characterized bythe same two parameters—scale (density of peaks per octave)and rate (number of peaks drifting past any fixed point onthe log-frequency axis per 1-second time frame). Thus, anupward ripple with scale 1 CPO and rate 1 Hz has alternat-ing peaks and valleys in its spectrum with 1 CPO periodicity,and the spectrum shifts up along the time axis, repeating it-self with 1 Hz periodicity (Figure 5). If this ripple is used asan input audio signal for the cortical model, strong localizedresponse is seen at the filter with the corresponding selectiv-ity of ω = 1 Hz, Ω = 1 CPO. All other basis functions areobtained by dilation (compression) of the seed-sound ripple(Figure 5) in both time and frequency. (The difference be-tween the ripples and the filters used in the cortical modelis that the seed spectro-temporal response used in corticalmodel (4) and shown in Figure 3 is local; the seed-soundripple can be obtained from it by reproducing the spatial


2000

1000

500

250

125

Freq

uen

cy(H

z)

0 1 2 3 4

Time (s)

Figure 5: Sound ripple at the scale of 1 CPO and the rate of 1 Hz.

response at every octave and removing the time decay fromthe time response, and multiscale decomposition can then beviewed as overlapping the auditory spectrogram with differ-ent sound ripples and performing local cross-correlations atvarious places over the spectrogram.) In Figure 6, we showthe result of filtering of the sample spectrogram showed ear-lier using two particular differently tuned filters, one withω = 8 Hz, Ω = 0.5 CPO, and the other with ω = −2 Hz,Ω = 2 CPO. It can be seen that the filter output is the highestwhen the spectrogram features match the tuning of the filterboth in rate and scale.

As such, to obtain a multiscale representation of the au-ditory spectrogram, complex filters having the “local” soundripples (5) of different rates, scales, and central frequenciesas their real parts and Hilbert transforms of these ripples astheir imaginary parts are applied to the input audio signal asa wavelet transform (8). The result of this decomposition isa four-dimensional hypercube of complex filter coefficientsthat can be modified and inverted back to the acoustic sig-nal. The phase of the coefficient shows the best-fitting direc-tion of the filter over a particular location of the auditoryspectrogram. This four-dimensional hypercube is called thecortical representation of the sound. It can be manipulatedto produce desired effects on the sound, and in the followingsections, we show some of the possible sound modifications.

In the cortical representation, two-dimensional rate-scaleslices of the hypercube reveal the features of the signal thatare most prominent at a given time. The rate-scale plotevolves in time to reflect changing ripple content of the spec-trogram. Example of rate-scale plots are shown in Figure 7where brightness of the pixel located at the intersection of

particular rate and scale values corresponds to the magni-tude of response of the filter tuned to these rate and scalevalues. For simplification of data presentation, these plotsare obtained by integration of the response magnitude overthe tonotopical axis. The first plot is a response of the cor-tical model to a single downward-moving sound ripple withω = 3 Hz, Ω = 2 CPO; the best-matching filter (or, in otherwords, the “neuron” with the corresponding SRTF) respondsbest. The responses of 2 Hz and 4 Hz units are not equal herebecause of the cochlear filter bank asymmetry in the earlyprocessing stage. The other three plots show the evolution ofthe rate-scale response at different time instants of the sampleauditory spectrogram shown in Figure 2 (at approximately600, 720, and 1100 milliseconds, respectively); one can in-deed trace the plot time stamps back to the spectrogramand see that the spectrogram has mostly sparse downward-moving and mostly dense upward-moving features appear-ing before the 720- and 1100-millisecond marks, respectively.The peaks in the test sentence plots are sharper in rate thanin scale, which can be explained by the integration performedover the tonotopical axis in these plots (the speech signal isunlikely to elicit significantly different rate-scale maps at dif-ferent frequencies anyway because it consists mostly of equi-spaced harmonics that can rise or fall only in unison, so therate at which the highest response is seen is not likely to dif-fer at different points on the tonotopical axis; the prevalentscale does change somewhat though due to higher numberof harmonics per octave at higher frequencies).

4. RECONSTRUCTING THE AUDIO FROM THE MODEL

After altering the cortical representation, it is necessary to re-construct the modified audio signal. Just as with the forwardpath, the reconstruction consists of two steps, correspond-ing to the central processing stage and the early processingstage. The first step is the inversion of the cortical multiscalerepresentation back to a spectrogram. It is a one-step inversewavelet transform operation because of the linear nature ofthe transform (8), which in the Fourier domain can be writ-ten as

Zd(ω,Ω;ωc,Ωc

) = Y(ω,Ω)Hs(Ω;Ωc

)Ht(ω;ωc

),

Zu(ω,Ω;ωc,Ωc

) = Y(ω,Ω)Hs(Ω;Ωc

)H∗t

(− ω;ωc),

(10)

where capital letters signify the Fourier transforms of thefunctions determined by the corresponding lowercase letters.From (10), similar to the usual Fourier transform case, onecan write the formula for the Fourier transform of the recon-structed auditory spectrogram yr(t, x) from its decomposi-tion coefficients Zd, Zu as

Yr(ω,Ω) =∑

ωc ,ΩcZd(ω,Ω;ωc,Ωc

)H∗t

(ω;ωc

)H∗s

(Ω;Ωc

)+∑

ωc ,ΩcZu(ω,Ω;ωc,Ωc

)Ht(− ω;ωc

)H∗s

(Ω;Ωc

)∑

ωc ,Ωc

∣∣Ht(ω;ωc

)Hs(Ω;Ωc

)∣∣2+∑

ωc ,Ωc

∣∣H∗t

(− ω;ωc)Hs(Ω;Ωc

)∣∣2 . (11)


0.5 CPO 8 Hzupwards

4000

2000

1000

500

250

Freq

uen

cy(H

z)

200 600 1000Time (ms)

4000

2000

1000

500

250

Freq

uen

cy(H

z)

200 400 600 800 1000 1200Time (ms)

2 CPO 2 Hzdownwards

4000

2000

1000

500

250Freq

uen

cy(H

z)

200 600 1000Time (ms)

4000

2000

1000

500

250

Freq

uen

cy(H

z)

200 400 600 800 1000 1200

Time (ms)

4000

2000

1000

500

250

Freq

uen

cy(H

z)

200 400 600 800 1000 1200

Time (ms)

Figure 6: Wavelet transform of a sample auditory spectrogram (shown in Figure 2) using two sound ripples.

Then, yr(t, x) is obtained by inverse Fourier transform ofYr(ω,Ω) and is rectified to ensure that the resulting spectro-gram is positive. The subscript r here and below refers to thereconstructed version of the signal. Excellent reconstructionquality is obtained within the effective band because of thelinear nature of involved transformations.

The second step (going from the auditory spectrogramto the acoustic wave) is a complicated task due to the non-linearity of the early auditory processing stage (nonlinearcompression and half-wave rectification), which leads to theloss of the component phase information (because the audi-tory spectrogram contains only the magnitude of each fre-quency component), and a direct reconstruction cannot beperformed. Therefore, the early auditory stage is inverted it-eratively using a convex projection algorithm adapted from[22], which takes the spectrogram as an input and recon-structs the acoustic signal that produces the spectrogramclosest to the given one.

Assume that an auditory spectrogram yr(t, x) is obtainedusing (11) after performing some manipulations in the cor-tical representation, and it is now necessary to invert it backto the acoustic signal sr(t). Observe that the analysis (first)step of the early auditory processing stage is linear and thusinvertible. If an output of the analysis step y1r(t, x) is known,the acoustic signal sr(t) can be obtained as

sr(t) =∑x

y1r(t, x)⊕ h(−t; x). (12)

The challenge is to proceed back from yr(t, x) to y1r(t, x).In the convex projection method, an iterative adaptation ofthe estimate y1r(t, x) is performed based on the differencebetween yr(t, x) and the result of the processing of y1r(t, x)through the second and third steps of the early auditory pro-cessing stage. The processing steps are listed below.

(i) Initialize the reconstructed signal s(1)r (t) by a

Gaussian-distributed white noise with zero mean and unitvariance. Set iteration counter k = 1.

(ii) Compute y(k)1r (t, x), y(k)

2r (t, x), and y(k)r (t, x) from

s(k)r (t) using (1).

(iii) Compute the ratio r(k)(t, x) = yr(t, x)/ y(k)r (t, x).

(iv) Adjust y(k+1)1r (t, x) = r(k)(t, x) y(k)

1r (t, x).

(v) Compute s(k+1)r (t) using (12). Increase k by 1.

(vi) Repeat from step 2 unless the preset number of itera-tions is reached or a certain quality criterion is met (e.g., theratio r(k)(t, x) is sufficiently close to unity everywhere).

Sample auditory spectrograms of the original and the re-constructed signals are shown later, and the reconstructionquality for the speech signal after a sufficient number of iter-ations is very good.

5. ALTERNATIVE IMPLEMENTATION OF THE EARLYAUDITORY PROCESSING STAGE

An alternative, much faster implementation of the early au-ditory processing stage was developed and can best be usedfor a fixed-pitch signal (e.g., a musical instrument tone).


0.25

0.5

1

2

4

8

Scal

e

−16 −8 −4 −2 2 4 8 16

Rate

(a)

0.25

0.5

1

2

4

8

Scal

e

−16 −8 −4 −2 2 4 8 16

Rate

(b)

0.25

0.5

1

2

4

8

Scal

e

−16 −8 −4 −2 2 4 8 16

Rate

(c)

0.25

0.5

1

2

4

8

Scal

e

−16 −8 −4 −2 2 4 8 16

Rate

(d)

Figure 7: Rate-scale plots of response of cortical model to different stimuli. (a) Response to 2 CPO 3 Hz downward sound ripple. (b)–(d)Response at different temporal positions within the sample auditory spectrogram presented in Figure 2 (at 600, 720, and 1100 milliseconds,respectively).

In this implementation, which we will refer to as a log-Fourier transform early stage, a simple Fourier transform isused in place of the processing described by (1). We take ashort segment of the waveform s(t) at some time t( j) andperform a Fourier transform of it to obtain S( f ). The S( f )is obviously discrete with the total of L/2 points on the lin-ear frequency axis, where L is the length of the Fourier trans-form buffer. Some mapping must be established from the lin-ear frequency axis f to the logarithmically growing tonotopi-cal axis x. We divide a tonotopical axis into segments corre-sponding to channels. Assume that the cochlear filter bankhas N channels per octave and that the lowest frequency of

interest is f0. Then, the lower x(i)l and the upper x(i)

h frequencyboundaries of the ith segment are set to be

x(i)l = f02i/N , x(i)

h = f02(i+1)/N . (13)

S( f ) is then remapped onto the tonotopical axis. A pointf on a linear frequency axis is said to fall into the ith seg-

ment on the tonotopical frequency axis if x(i)l < f ≤ x(i)

h . Thenumber of points that falls into a segment obviously dependson the segment length, which becomes bigger for higher fre-quencies (therefore the Fourier transform of s(t) must beperformed with very high resolution and s(t) padded appro-priately to ensure that at least a few points on the f -axis fallinto the shortest segment on the x-axis). Spectral magnitudesare then averaged for all points on the f -axis that fall into thesame segment i:

yalt(t( j), x(i)) = 1

B(i)

∑x(i)l < f≤x(i)

h

∣∣S( f )∣∣, (14)

where B(i) is the total number of points on f -axis that fallsinto the ith segment on x-axis (the number of terms in thesummation), and the averaging is performed for all i, gener-ating a time slice yalt(t( j), x). The process is carried out forall time segments of s(t), producing yalt(t, x), which can besubstituted for the y(t, x) computed using (1) for all furtherprocessing.

The reconstruction proceeds in an inverse manner. At ev-ery time slice t( j), a set of y(t( j), x) is remapped to the mag-nitude spectrum S( f ) on a linear frequency axis f so that

S( f ) =y(t( j), x(i)

)if for some i, x(i)

l < f ≤ x(i)h ,

0 otherwise.(15)

At this point, the magnitude information is set cor-rectly in S( f ) to perform inverse Fourier transform but thephase information is lost. Direct one-step reconstructionfrom S( f ) is much faster than the iterative convex projectionmethod described above but produces unacceptable resultswith clicks and strong interfering noise at the frequency cor-responding to the processing window length. Processing thesignal in heavily overlapping segments with gradual fade-inand fade-out windowing functions somewhat improves theresults but the reconstruction quality is still significantly be-low the quality achieved using the iterative projection algo-rithm described in Section 4.

One way to recover the phase information and to useone-step reconstruction of s(t) from magnitude spectrumS( f ) is to save the bin phases of the forward-pass Fouriertransform and later impose them on S( f ) after it is recon-


structed from the (altered) cortical representation. Signifi-cantly better continuity of the signal is obtained in this man-ner. However, it seems that the saved phases carry the imprintof the original pitch of the signal, which produces undesir-able effects if the processing goal is to perform a pitch shift.

However, the negative effect of the phase set carrying thepitch imprint can be reversed and used for good simply bygenerating the set of bin phases that corresponds to a desiredpitch and imposing them on S( f ). Of course the knowledgeof the signal pitch is required in this case, which is not al-ways easy to obtain. We have used this technique in perform-ing timbre-preserving pitch shift of musical instrument noteswhere the exact original pitch F0 (and therefore the exactshifted pitch F′0) is known. To obtain the set of phases cor-responding to the pitch F′0, we generate, in the time domain,a pulse train of frequency F′0 and take its Fourier transformwith the same window length as used in the processing ofS( f ). The bin phases of the Fourier transform of the pulsetrain are then imposed on the magnitude spectrum S( f ) ob-tained in (15). In this manner, very good results are obtainedat reconstructing musical tones of a fixed frequency; it shouldbe noted that such reconstruction is not handled well by theiterative convex projection method described above—the re-constructed signal is not a pure tone but rather constantlyjitters up and down, preventing any musical perception, pre-sumably because the time slices of s(t) are treated indepen-dently by convex projection algorithm, which does not at-tempt to match signal features from adjacent time frames.

Nevertheless, speech reconstruction is handled better bythe significantly slower convex projection algorithm, becauseit is not clear how to select F′0 to generate the phase set.If the log-Fourier transform early stage can be applied tothe speech signals, significant processing speed-up can beachieved. A promising idea is to employ a pitch detectionmechanism at each frame of s(t) to detect F0, to compute F′0,and to impose F′0-consistent phases on S( f ) to enable one-step recovery of s(t), which is the subject of ongoing work.

6. RECONSTRUCTION QUALITY

It is important to do an objective evaluation of the recon-structed sound quality. The second (central) stage of the cor-tical model processing is perfectly invertible because of thelinear nature of the wavelet transformations involved, and itis the first (early) stage that presents difficulties for the in-version because of the phase information loss in the pro-cessing. Given the modified auditory spectrogram yr(t, x),the convex projection algorithm described above tries to syn-thesize the intermediate result y1r(t, x) that, when processedthrough the two remaining steps of the early auditory stage,yields yr(t, x) that is as close as possible to yr(t, x). The wave-form sr(t) can then be directly reconstructed from y1r(t, x).The reconstruction error measure E is defined as the averagerelative magnitude difference between the target yr(t, x) andthe candidate yr(t, x):

E = 1B

∑i, j

∣∣ yr(t( j), x(i))− yr

(t( j), x(i)

)∣∣yr(t( j), x(i)

) , (16)

where B is the total number of terms in the summation. Dur-ing the iterative update of y1r(t, x), the error E does not dropmonotonically; instead, the lower the error, the higher thechance that the next iteration actually increases the error, inwhich case the newly computed y1r(t, x) should be discardedand a new iteration should be started from the best previ-ously found y1r(t, x).

In practical tests, it was found that the error E dropsquickly to units of percents, and any further improvementrequires very significant computational expense. For the pur-poses of illustration, we took the 1200-milliseconds auditoryspectrogram (Figure 2) and ran the convex projection algo-rithm on it. It takes about 2 seconds to execute one algorithmiteration on a 1.7 GHz Pentium computer. In this sample run,the error after 20, 200, and 2000 iterations was found to be4.73%, 1.60%, and 1.08%, respectively, which is representa-tive of the general behavior observed in many experiments.

In Figure 8a, the original waveform s(t) and its corre-sponding auditory spectrogram y(t, x) from Figure 2 areplotted. The auditory spectrogram y(t, x) is used as an in-put yr(t, x) to the convex projection algorithm, which, in200 iterations, reconstructs the waveform sr(t) shown in thetop plot of Figure 8b. The spectrogram yr(t, x) correspond-ing to the reconstructed waveform is also shown in the bot-tom plot of Figure 8b. Because the reconstruction algorithmattempts to synthesize a waveform sr(t) such that its spec-trogram yr(t, x) is equal to yr(t, x), it can be expected thatthe spectrograms of the original and of the reconstructedwaveforms would match. This is indeed the case in Figure 8,but the fine waveform structure is different in the original(Figure 8a) and in the reconstruction (Figure 8b), with no-ticeably less periodicity in some segments. However, it canbe argued that because the original and the reconstructedwaveforms produce the same results when processed throughthe early auditory processing stage, the perception of theseshould be nearly identical, which is indeed the case when thesounds are played to the human ear. Slight distortions areheard in the reconstructed waveform, but the sound is clearand intelligible. Increasing the number of iterations furtherdecreases distortions; when the error drops to about 0.5%(tens of thousands of iterations), the signal is almost indis-tinguishable from the original.

We also compared the quality of the reconstructed signalwith the quality of sound produced by existing pitch modifi-cation and sound morphing techniques. In [5], spectrogrammodeling with MFCC coefficients plus residue spectrogramand iterative reconstruction process are used for sound mor-phing, and short morphing examples for voiced sounds areavailable for listening in the online version of the same pa-per. Book [7] also contains (among many other examples)some audio samples derived using algorithms that are rele-vant to our work and are targeted for the same applicationareas as we are considering, in particular, samples of cross-synthesis between the musical tone and the voice using chan-nel vocoder and resynthesis of speech and musical tones us-ing LPC with residual as an excitation signal and LPC withpulse train as an excitation signal. In our opinion, the signalquality we achieve is comparable to the quality of the relevant


2000

1000

500

250

125

Freq

uen

cy(H

z)

200 400 600 800 1000

Time (ms)

(a)

2000

1000

500

250

125

Freq

uen

cy(H

z)

200 400 600 800 1000

Time (ms)

(b)

Figure 8: (a) Original waveform and corresponding spectrogram. (b) Reconstructed waveform and corresponding spectrogram after 200iterations.

samples presented in these references, although the soundprocessing through a cortical representation is significantlyslower than the algorithms presented in [5, 6, 7].

In summary, it can be concluded that reasonable qual-ity of the reconstructed signal can be achieved in reasonabletime, such as ten seconds or so of computational time per onesecond of a signal sampled at 8 kHz (although the iterativealgorithm is not suitable for the real-time processing). If un-limited time (few hours) is allowed for processing, very goodsignal quality is achieved. The possibility of iterative signalreconstruction in real time is an open question and work inthis area is continuing.

7. TIMBRE-PRESERVING PITCH MANIPULATIONS

For speech and musical instruments, timbre is conveyed bythe spectral envelope, whereas pitch is mostly conveyed bythe harmonic structure, or harmonic peaks. This biologicallybased analysis is in the spirit of the cepstral analysis used inspeech [23], except that the Fourier-like transformation inthe auditory system is carried out in a local fashion usingkernels of different scales. The cortical decomposition is ex-pressed in the complex domain, with the coefficient magni-tude being the measure of the local bandwidth of the spec-trum and the coefficient phase being the measure of the lo-cal symmetry at each bandwidth. Finally, just as it is the casewith cepstral coefficients, the spectral envelope varies slowly.In contrast, the harmonic peaks are only visible at high res-olution. Consequently, timbre and pitch occupy different re-gions in the multiscale representation. If X is the auditoryspectrum of a given data frame, with the length N equal tothe number of filters in the cochlear filter bank, and the de-composition is performed overM scales, then the matrix S ofthe scale decomposition ofX hasM rows, one per scale value,and N columns. If the first (top) row of S contains the de-composition over the finest scale and the Mth (bottom) row

is the coarsest one, then the components of S in the upperleft triangle can be associated with pitch, whereas the rest ofthe components can be associated with timbre information[24]. In Figure 9, sample plot of the scale decomposition ofthe auditory spectrum is shown. (Please note that this is ascale versus tonotopical frequency plot rather than scale-rateplot; all rate decomposition coefficients carry timbre infor-mation.) The brightness of a pixel corresponds to the mag-nitude of the decomposition coefficient, whereas the rela-tive length and the direction of the arrow at the pixel showthe coefficient phase. The white solid diagonal line shown inFigure 9 roughly separates timbre and pitch information inthe cortical representation. The coefficients that lie above thisline carry primarily pitch information, and the rest can be as-sociated with timbre.

To control pitch and timbre separately, we apply modi-fications at appropriate locations in the cortical representa-tion matrix and invert the cortical representation back to thespectrogram. Thus, to shift the pitch while holding the tim-bre fixed, we compute the cortical multiscale representationof the entire sound, shift (along the frequency axis) the trian-gular part of every time slice of the hypercube that holds thepitch information while keeping the timbre information in-tact, and invert the result. To modify the timbre keeping thepitch intact, we do the opposite. It is also possible to splicethe pitch and the timbre information from two speakers, orfrom a speaker and a musical instrument. The result after in-version back to the sound is a “musical” voice that sings theutterance (or a “talking” musical instrument).

We express the timbre-preserving pitch shift algorithm inmathematical terms. The cortical representation consists of aset of complex coefficients zu(t, x;ωc,Ωc) and zd(t, x;ωc,Ωc).In the actual decomposition, the values of t, x, ωc, and Ωc

are discrete, and the cortical representation of a sound is justa four-dimensional hypercube of complex coefficients Zi, j,k,l.We agree that the first index i corresponds to the time axis,


8.0

4.0

2.0

1.0

0.5

0.25

Scal

es

250 500 1000 2000 4000

Frequency (Hz)

Figure 9: Plot of the sample auditory spectrum scale decomposi-tion matrix. The brightness of the pixel corresponds to the magni-tude of the decomposition coefficient, whereas the relative lengthand the direction of the arrow at the pixel show the coefficientphase. Upper triangle of the matrix of coefficients (above the solidwhite line) contains information about the pitch of the signal. Thelower triangle contains information about the timbre.

the second index j corresponds to the frequency axis, thethird index k corresponds to the scale axis, and the fourthindex l corresponds to the rate axis. Index j varies from 1 toN , whereN is the number of filters in the cochlear filter bank,index k varies from 1 to M (in order of scale increase), whereM is the number of scales, and, finally, index l varies from 1to 2L, where L is the number of rates (zd and zu are juxta-posed in Zi, j,k,l matrix as pictured on the horizontal axis inFigure 7: l = 1 corresponds to zd with the highest rate, l = 2to zd with the next lower rate, l = L to zd with the lowest rate,l = L+1 to zu with the lowest rate, l = L+2 to zu with the nexthigher rate, and l = 2L to zu with the highest rate; this partic-ular order is not critical for the pitch modifications describedbelow as they do not depend on it). Then, the coefficient isassumed to carry pitch information if it lies above the diago-nal shown in Figure 9 (i.e., if (M−k)/ j > (M−1)/N), and toshift the pitch up by q channels, we fill the matrix Z∗i, j,k,l withthe coefficients of matrix Zi, j,k,l as follows:

Z∗i, j,k,l = Zi, j,k,l, j < jb,

Z∗i, j,k,l = Zj, jb ,k,l, jb ≤ j < jb + q,

Z∗i, j,k,l = Zi, j−q,k,l, jb + q ≤ j,

(17)

where jb = (M−k)N/(M−1) rounded to the nearest positiveinteger (note that jb depends on k and therefore is differentin different hyperslices of the matrix that have different val-ues of k). The similar procedure shifts the pitch down by qchannels:

Z∗i, j,k,l = Zi, j,k,l, j < jb,

Z∗i, j,k,l = Zi, j+q,k,l, jb ≤ j < N − q,

Z∗i, j,k,l = Zi,N ,k,l, jb ≤ j, N − q ≤ j.

(18)

0−10−20−30−40−50−60M

agn

itu

de(d

B)

125 250 500 1000 2000

Frequency (Hz)

(a)

0−10−20−30−40−50−60M

agn

itu

de(d

B)

125 250 500 1000 2000

Frequency (Hz)

(b)

Figure 10: Spectrum of a speech signal (a) before and (b) after pitchshift. Note that the spectral envelope is filled with the new set ofharmonics.

Finally, to splice the pitch of the signal S1 with the tim-bre of the signal S2, we compose Z∗ from two correspondingcortical decompositions Z1 and Z2, taking the elements forwhich (M − k)/ j > (M − 1)/N from Z1 and all other onesfrom Z2. Inversion of Z∗ back to the waveform gives us thedesired result.

We show one-pitch shift example here and refer theinterested reader to http://www.isr.umd.edu/CAAR/ andhttp://www.umiacs.umd.edu/labs/pirl/NPDM/ for the actualsounds used in this example, and for more samples. Weuse the above-described algorithm to perform a timbre-preserving pitch shift of a speech signal. The cochlear modelhas 128 filters with 24 filters per octave, covering 5(1/3) oc-taves along the frequency axis. The cortical representation ismodified using (18) to achieve the desired pitch modifica-tion and then inverted using the reconstruction proceduredescribed in Section 4, resulting in a pitch-scaled version ofthe original signal. In Figure 10, we show plots of the spec-trum of the original signal and of the signal having the pitchshifted down by 8 channels (one third of an octave) at a fixedpoint in time. The pitches of the original and of the modi-fied signals are 140 Hz and 111 Hz, respectively. It can be seenfrom the plots that the signal spectral envelope is preservedand that the speech formants are kept at their original loca-tions, but a new set of harmonics is introduced.

The algorithm is sufficiently fast to be used in real time ifa log-Fourier transform early stage (described in Section 5)is substituted for a cochlear filter bank to eliminate the needfor an iterative inversion process. Additionally, it is not neces-sary to compute the full cortical representation of the soundto do timbre-preserving pitch shifts. It is enough to performonly scale decomposition for every time frame of the audi-tory spectrogram because shifts are done along the frequency


0.5

0

−0.5

Am

plit

ude

0 0.25 0.5 0.75 1.0 1.25

Time (s)

0

−30

−60

−90

Mag

nit

ude

(dB

)

0 2000 4000 6000 8000

Frequency (Hz)

0.5

0

−0.5

Am

plit

ude

0 0.25 0.5 0.75 1.0 1.25

Time (s)

0

−30

−60

−90

Mag

nit

ude

(dB

)

0 2000 4000 6000 8000

Frequency (Hz)

0.5

0

−0.5

Am

plit

ude

0 0.25 0.5 0.75 1.0 1.25

Time (s)

0

−30

−60

Mag

nit

ude

(dB

)

0 2000 4000 6000 8000

Frequency (Hz)

Figure 11: (Left column) Waveform plots and (right column) spectrum plots for guitar (top plots), trumpet (middle plots), and newinstrument (bottom plots).

axis and can be performed in each time slice of the hypercubeindependently; thus, the rate decomposition is unnecessary.We have used the pitch-shift algorithm in a small-scale studyin an attempt to generate maximally separable sounds to im-prove simultaneous eligibility of multiple competing mes-sages [19]; it was found that the pitch separation does im-prove the perceptual separability of sounds and the recogni-tion rate. Also, we have used the algorithm to generate, fromone note of a given frequency, other notes of a newly createdmusical instrument that has the timbre characteristics of twoexisting instruments. This application is described in moredetails in the following section.

8. TIMBRE MANIPULATIONS

Timbre of the audio signal is conveyed both by the spec-tral envelope and by the signal dynamics. Spectral envelopeis represented in the cortical representation by the lowerright triangle of the scale decomposition coefficients andcan be manipulated by modifying these. Sound dynamicsis captured by the rate decomposition. Selective modifica-tions to enhance or diminish the contributions of compo-nents of a certain rate can change the dynamic properties

of the sound. As an illustration, and as an example ofinformation separation across the cells of different rates, wesynthesize a few sound samples using simple modificationsto make the sound either abrupt or slurred. One such sim-ple modification is to zero out the cortical representation de-composition coefficients that correspond to the “fast” cells,creating the impression of a low-intelligibility sound in anextremely reverberant environment; the other one is to re-move “slow” cells, obtaining an abrupt sound in an ane-choic environment (see http://www.isr.umd.edu/CAAR/ andhttp://www.umiacs.umd.edu/labs/pirl/NPDM/ for the actualsound samples where the decomposition was performed overthe rates of 2, 4, 8, and 16 Hz; from these, “slow” rates are 2and 4 Hz and “fast” rates are 8 and 16 Hz). It might be pos-sible to use such modifications in sonification (e.g., by map-ping some physical parameter to the amount of simulatedreverberation and by manipulating the perceived reverbera-tion time by gradual decrease or increase of contribution of“slow” components) or in audio user interfaces in general.Similarly, in the musical synthesis, playback rate and onsetand decay ratio can be modified with shifts along the rateaxis while preserving the pitch.


0

−30

−60

Mag

nit

ude

(dB

)

0 1000 2000 3000 4000 5000 6000 7000 8000

Frequency (Hz)

(a)

0

−30

−60

Mag

nit

ude

(dB

)

0 1000 2000 3000 4000 5000 6000 7000 8000

Frequency (Hz)

(b)

0

−30

−60

Mag

nit

ude

(dB

)

0 1000 2000 3000 4000 5000 6000 7000 8000

Frequency (Hz)

(c)

Figure 12: Spectrum of the new instrument playing (a) D#3, (b)C3, and (c) G2.

To show the ease with which timbre manipulation can bedone using the cortical representation, we performed a tim-bre interpolation between two musical instruments to obtaina new in between synthetic instrument, which has both thespectral shape and the temporal spectral modulations (on-set and decay ratio) that lie between the two original instru-ments. The two instruments selected were the guitar WgC#3,and the trumpet, WtC#3, playing the same note (C#3). Therate-scale decomposition of a short (1.5 seconds) instrumentsample was performed and the geometric average of the com-plex coefficients in the cortical representations of these twoinstrument samples was computed and was converted backto the sound wave to obtain the new instrument sound sam-ple WnC#3. The behavior of the new instrument along thetime axis is intermediate between two original ones, and thespectrum shape is also an average between two original in-struments (Figure 11).

After the timbre interpolation, the synthesized instru-ment can only play the same note as the original ones. To syn-thesize other notes, we use the timbre-preserving pitch shiftalgorithm (Section 7) on the waveform WnC#3 obtained bythe timbre interpolation (third waveform in Figure 11) as aninput. Figure 12 shows the spectrum of the new instrumentfor three different newly generated notes—D#3, C3, and G2.It can be seen that the spectral envelope is the same in allthree plots (and is the same as the spectral envelope of theWnC#3), but this envelope is filled with different sets of har-monics for different notes. For this synthesis, a log-Fouriertransform early stage with pulse-train phase imprinting

(Section 5) was used, as it is ideally suited for the task. A fewsamples of music made with the new instrument are availableat http://www.umiacs.umd.edu/labs/pirl/NPDM/.

9. SUMMARY AND CONCLUSIONS

We developed and tested simple yet powerful algorithmsfor performing independent modifications of the pitch andthe timbre of an audio signal and for performing interpola-tion between sound samples. These algorithms constitute anew application of the cortical representation of the sound[3], which extracts the perceptually important audio featuressimulating the processing believed to occur in auditory path-ways in primates and thus can be used for making soundmodifications tuned for and targeted to the ways the humannervous system processes information. We obtained promis-ing results and are using these algorithms in ongoing devel-opment of auditory user interfaces.

ACKNOWLEDGMENTS

Partial support of ONR Grant N000140110571, NSF GrantIBN-0097975, and NSF Award IIS-0205271 is gratefully ac-knowledged. This paper is an extended version of paper [1].

REFERENCES

[1] D. N. Zotkin, S. A. Shamma, P. Ru, R. Duraiswami, and L. S.Davis, “Pitch and timbre manipulations using cortical repre-sentation of sound,” in Proc. IEEE Int. Conf. Acoustics, Speech,Signal Processing (ICASSP ’03), vol. 5, pp. 517–520, HongKong, China, April 2003, reprinted in Proc. ICME ’03, Bal-timore, Md, USA, July 2003, vol. 3, pp. 381–384, because ofthe cancellation of ICASSP ’03 conference meeting.

[2] M. Elhilali, T. Chi, and S. A. Shamma, “A spectro-temporalmodulation index (STMI) for assessment of speech intelli-gibility,” Speech Communication, vol. 41, no. 2, pp. 331–348,2003.

[3] T. Chi, P. Ru, and S. A. Shamma, “Multiresolution spec-trotemporal analysis of complex sounds,” to appear in Journalof the Acoustical Society of America.

[4] T. Chi, Y. Gao, M. C. Guyton, P. Ru, and S. A. Shamma,“Spectro-temporal modulation transfer functions and speechintelligibility,” Journal of the Acoustical Society of America,vol. 106, no. 5, pp. 2719–2732, 1999.

[5] M. Slaney, M. Covell, and B. Lassiter, “Automatic audio mor-phing,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Pro-cessing (ICASSP ’96), vol. 2, pp. 1001–1004, Atlanta, Ga, USA,May 1996.

[6] X. Serra, “Musical sound modeling with sinusoids plus noise,”in Musical Signal Processing, G. D. Poli, A. Picialli, S. T. Pope,and C. Roads, Eds., Swets & Zeitlinger Publishers, Lisse, TheNetherlands, 1997.

[7] P. R. Cook, Real Sound Synthesis for Interactive Applications,A. K. Peters, Natick, Mass, USA, 2002.

[8] S. Barrass, Sculpting a Sound Space with Information Prop-erties: Organized Sound, Cambridge University Press, Cam-bridge, UK, 1996.

[9] G. Kramer, B. Walker, T. Bonebright, et al. “Sonifica-tion report: Status of the field and research agenda,”prepared for NSF by members of the ICAD, 1997,http://www.icad.org/websiteV2.0/References/nsf.html.


[10] S. Bly, “Multivariate data mapping,” in Proc. Auditory Display:Sonification, Audification, and Auditory Interfaces, G. Kramer,Ed., vol. 18 of Santa Fe Institute Studies in the Sciences ofComplexity, pp. 405–416, Addison Wesley, Reading, Mass,USA, 1994.

[11] D. N. Zotkin, R. Duraiswami, and L. S. Davis, “Rendering lo-calized spatial audio in a virtual auditory space,” IEEE Trans.Multimedia, vol. 6, no. 4, pp. 553–564, 2004.

[12] D. S. Brungart, “Informational and energetic masking effectsin the perception of two simultaneous talkers,” Journal of theAcoustical Society of America, vol. 109, no. 3, pp. 1101–1109,2001.

[13] C. J. Darwin and R. W. Hukin, “Effects of reverberation onspatial, prosodic, and vocal-tract size cues to selective atten-tion,” Journal of the Acoustical Society of America, vol. 108,no. 1, pp. 335–342, 2000.

[14] C. J. Darwin and R. W. Hukin, “Effectiveness of spatial cues,prosody, and talker characteristics in selective attention,” Jour-nal of the Acoustical Society of America, vol. 107, no. 2, pp. 970–977, 2000.

[15] M. L. Hawley, R. Y. Litovsky, and H. S. Colburn, “Speech in-telligibility and localization in a multi-source environment,”Journal of the Acoustical Society of America, vol. 105, no. 6,pp. 3436–3448, 1999.

[16] W. A. Yost, R. H. Dye Jr., and S. Sheft, “A simulated cocktailparty with up to three sound sources,” Perception and Psy-chophysics, vol. 58, no. 7, pp. 1026–1036, 1996.

[17] B. Arons, “A review of the cocktail party effect,” Journal of theAmerican Voice I/O Society, vol. 12, pp. 35–50, 1992.

[18] P. F. Assmann, “Fundamental frequency and the intelligibil-ity of competing voices,” in Proc. 14th International Congressof Phonetic Sciences, pp. 179–182, San Francisco, Calif, USA,August 1999.

[19] N. Mesgarani, S. A. Shamma, K. W. Grant, and R. Du-raiswami, “Augmented intelligibility in simultaneous multi-talker environments,” in Proc. International Conference on Au-ditory Display (ICAD ’03), pp. 71–74, Boston, Mass, USA, July2003.

[20] A. S. Bregman, Auditory Scene Analysis: The Perceptual Orga-nization of Sound, MIT Press, Cambridge, Mass, USA, 1991.

[21] N. Kowalski, D. Depireux, and S. A. Shamma, “Analysis of dy-namic spectra in ferret primary auditory cortex. I. Character-istics of single-unit responses to moving ripple spectra,” Jour-nal of Neurophysiology, vol. 76, no. 5, pp. 3503–3523, 1996.

[22] X. Yang, K. Wang, and S. A. Shamma, “Auditory representa-tions of acoustic signals,” IEEE Trans. Inform. Theory, vol. 38,no. 2, pp. 824–839, 1992.

[23] F. Jelinek, Statistical Methods for Speech Recognition, MITPress, Cambridge, Mass, USA, 1998.

[24] R. Lyon and S. A. Shamma, “Auditory representations of tim-bre and pitch,” in Auditory Computations, vol. 6 of SpringerHandbook of Auditory Research, pp. 221–270, Springer-Verlag,New York, NY, USA, 1996.

Dmitry N. Zotkin was born in Moscow,Russia, in 1973. He received a combinedB.S./M.S. degree in applied mathematicsand physics from the Moscow Instituteof Physics and Technology, Dolgoprudny,Moscow Region, Russia, in 1996, and re-ceived the M.S. and Ph.D. degrees in com-puter science from the University of Mary-land, College Park, USA, in 1999 and 2002,respectively. Dr. Zotkin is currently an

Assistant Research Scientist at the Perceptual Interfaces and RealityLaboratory, Institute for Advanced Computer Studies (UMIACS),University of Maryland, College Park. His current research interestsare in multichannel signal processing for tracking and multimedia.He is also working in the general area of spatial audio, includingvirtual auditory scene synthesis, customizable virtual auditory dis-plays, perceptual processing interfaces, and associated problems.

Taishih Chi received the B.S. degree fromNational Taiwan University, Taiwan, in1992, and the M.S. and Ph.D. degrees fromthe University of Maryland, College Park,in 1997 and 2003, respectively, all in elec-trical engineering. From 1994 to 1996, hewas a Graduate School Fellow at the Uni-versity of Maryland. From 1996 to 2003, hewas a Research Assistant at the Institute forSystems Research, University of Maryland.Since June 2003, he has been a Research Associate at the Universityof Maryland. His research interests are in neuromorphic auditorymodeling, soft computing, and speech analysis.

Shihab A. Shamma obtained his Ph.D. de-gree in electrical engineering from StanfordUniversity in 1980. He joined the Depart-ment of Electrical Engineering, the Univer-sity of Maryland, in 1984, where his re-search has dealt with issues in computa-tional neuroscience and the development ofmicrosensor systems for experimental re-search and neural prostheses. His primaryfocus has been on uncovering the compu-tational principles underlying the processing and recognition ofcomplex sounds (speech and music) in the auditory system, andthe relationship between auditory and visual processing. Other re-search interests include the development of photolithographic mi-croelectrode arrays for recording and stimulation of neural signals,VLSI implementations of auditory processing algorithms, and de-velopment of algorithms for the detection, classification, and anal-ysis of neural activity from multiple simultaneous sources.

Ramani Duraiswami is a member of thefaculty in the Department of ComputerScience and in the Institute for AdvancedComputer Studies (UMIACS), the Univer-sity of Maryland, College Park. He is the Di-rector of the Perceptual Interfaces and Re-ality Laboratory there. Dr. Duraiswami ob-tained the B.Tech. degree in mechanical en-gineering from IIT Bombay in 1985, and aPh.D. degree in mechanical engineering andapplied mathematics from the Johns Hopkins University in 1991.His research interests are broad and currently include spatial au-dio, virtual environments, microphone arrays, computer vision,statistical machine learning, fast multipole methods, and integralequations.


Source Separation with One Ear: Propositionfor an Anthropomorphic Approach

Jean Rouat

Departement de Genie Electrique et de Genie Informatique, Universite Sherbrooke, 2500 boulevard de l’Universite,Sherbrooke, QC, Canada J1K 2R1

Equipe de Recherche en Micro-electronique et Traitement Informatique des Signaux (ETMETIS), Departement de Sciences Appliques,Universite du Quebec a Chicoutimi, 555 boulevard de l’Universite, Chicoutimi, Quebec, Canada G7H 2B1Email: [email protected]

Ramin Pichevar

Departement de Genie Electrique et de Genie Informatique, Universite Sherbrooke, 2500 boulevard de l’Universite,Sherbrooke, QC, Canada J1K 2R1Email: [email protected]

Equipe de Recherche en Micro-electronique et Traitement Informatique des Signaux (ETMETIS), Departement de Sciences Appliques,Universite du Quebec a Chicoutimi, 555 boulevard de l’Universite, Chicoutimi, Quebec, Canada G7H 2B1

Received 9 December 2003; Revised 23 August 2004

We present an example of an anthropomorphic approach, in which auditory-based cues are combined with temporal correlation toimplement a source separation system. The auditory features are based on spectral amplitude modulation and energy informationobtained through 256 cochlear filters. Segmentation and binding of auditory objects are performed with a two-layered spikingneural network. The first layer performs the segmentation of the auditory images into objects, while the second layer binds theauditory objects belonging to the same source. The binding is further used to generate a mask (binary gain) to suppress theundesired sources from the original signal. Results are presented for a double-voiced (2 speakers) speech segment and for sentencescorrupted with different noise sources. Comparative results are also given using PESQ (perceptual evaluation of speech quality)scores. The spiking neural network is fully adaptive and unsupervised.

Keywords and phrases: auditory modeling, source separation, amplitude modulation, auditory scene analysis, spiking neurons,temporal correlation.

1. INTRODUCTION

1.1. Source separation

Source separation of mixed signals is an important problemwith many applications in the context of audio processing. Itcan be used to assist robots in segregating multiple speakers,to ease the automatic transcription of videos via the audiotracks, to segregate musical instruments before automatictranscription, to clean up signal before performing speechrecognition, and so forth. The ideal instrumental setup isbased on the use of arrays of microphones during recordingto obtain many audio channels.

In many situations, only one channel is available to theaudio engineer that still has to solve the separation problem.Most monophonic source separation systems require a pri-ori knowledge, that is, expert systems (explicit knowledge)or statistical approaches (implicit knowledge) [1]. Most ofthese systems perform reasonably well only on specific sig-nals (generally voiced speech or harmonic music) and fail

to efficiently segregate a broad range of signals. Sameti[2] uses hidden Markov models, while Roweis [3, 4] andRoyes-Gomez [5] use factorial hidden Markov models. Jangand Lee [6] use maximum a posteriori (MAP) estimation.They all require training on huge signal databases to estimateprobability models. Wang and Brown [7] have first proposedan original bio-inspired approach that uses features obtainedfrom correlograms and F0 (pitch frequency) in combinationwith an oscillatory neural network. Hu and Wang use a pitchtracking technique [8] to segregate harmonic sources. Bothsystems are limited to harmonic signals.

We propose here to extend the bio-inspired approach tomore general situations without training or prior knowledgeof underlying signal properties.

1.2. System overview

Physiology, psychoacoustic, and signal processing are inte-grated to design a multiple-source separation system when


CSMgeneration

Envelopedetection

CAMgeneration

Spikingneural

network

Neural synchrony

Maskgeneration

256

256256 Synthesis

filterbank

256 256

Analysisfilterbank

∑

Soundmixture

Separatedsignals

Figure 1: Source separation system. Depending on the sources’ auditory images (CAM or CSM), the spiking neural network generates themask (binary gain) to switch on/off—in time and across channels—the synthesis filter bank channels before final summation.

only one audio channel is available (Figure 1). It com-bines a spiking neural network with a reconstruction anal-ysis/synthesis cochlear filter bank along with auditory im-age representations of audible signals. The segregation andbinding of the auditory objects (coming from different soundsources) is performed by the spiking neural network (imple-menting the temporal correlation [9, 10]) that also generatesa mask1 to be used in conjunction with the synthesis filterbank to generate the separated sound sources.

The neural network uses third-generation neural net-works, where neurons are usually called spiking neurons [11].In our implementation, neurons firing at the same instants(same firing phase) are characteristic of similar stimuli orcomparable input signals.2 Usually spiking neurons, in op-position to formal neurons, have a constant firing ampli-tude. This coding yields noise and interference robustnesswhile facilitating adaptive and dynamic synapses (link be-tween neurons) for unsupervised and autonomous systemdesign. Numerous spike timing coding schemes are pos-sible (and observable in physiology) [12]. Among them,we decided to use synchronization and oscillatory cod-ing schemes in combination with a competitive unsuper-vised framework (obtained with dynamic synapses), wheregroups of synchronous neurons are observed. This choicehas the advantage to allow the design of unsupervised sys-tems with no training (or learning) phase. To some ex-tent, the neural network can be viewed as a map wherelinks between neurons are dynamic. In our implementa-tion of the temporal correlation, two neurons with simi-lar inputs on their dendrites will increase their soma tosoma synaptic weights (dynamic synapses), forcing syn-chronous response. On the other hand, neurons with dissim-ilar dendritic inputs will have reduced soma to soma synapticweights yielding reduced coupling and asynchronous neuralresponses.

1Mask and masking refer here to a binary gain and should not be con-fused with the conventional definition of masking in psychoacoustics.

2The information is coded in the firing instants.

Neuron 4

Neuron 3

Neuron 2

Neuron 1

Act

ion

pote

nti

al

Time

T

T

Figure 2: Dynamic temporal correlation for two simultaneoussources: time evolution of the electrical output potential for fourneurons from the second layer (output layer). T is the oscillatoryperiod. Two sets of synchronous neurons appear (neurons 1 and 3for source 1; neurons 2 and 4 for source 2). Plot degradations aredue to JPEG coding.

Figure 2 illustrates the oscillatory response behavior ofthe output layer of the proposed neural network for twosources.

Compared to conventional approaches, our system doesnot require a priori knowledge, is not limited to harmonicsignals, does not require training, and does not need pitch ex-traction. The architecture is also designed to handle contin-uous input signals (no need to segment the signal into timeframes) and is based on the availability of simultaneous au-ditory representations of signals. Our approach is inspiredby knowledge in anthropomorphic systems but is not an at-tempt to reproduce physiology or psychoacoustics.

The next two sections motivate the anthropomorphic ap-proach, Section 4 describes in detail the system, Section 5describes the experiments, Section 6 gives the results, andSection 7 is the discussion and conclusion.

A Proposition for Source Separation with One Ear 1367

2. ANTHROPOMORPHIC APPROACH

2.1. Physiology: multiple featuresSchreiner and Langner in [13, 14] have shown that the in-ferior colliculus of the cat contains a highly systematic to-pographic representation of AM parameters. Maps showingbest modulation frequency have been determined. The pi-oneering work by Robles et al. in [15, 16, 17] reveals theimportance of AM-FM3 coding in the peripheral auditorysystem along with the role of the efferent system in rela-tion to adaptive tuning of the cochlea. In this paper, we useenergy-based features (Cochleotopic/Spectrotopic Map) andAM features (Cochleotopic/AMtopic Map) as signal repre-sentations. The proposed architecture is not limited by thenumber of representations. For now, we use two represen-tations to illustrate the relevance of multiple representationsof the signal available along the auditory pathway. In fact,it is clear from physiology that multiple and simultaneousrepresentations of the same input signal are observed in thecochlear nucleus [18, 19]. In the remaining parts of the pa-per, we call these representations auditory images.

2.2. Cocktail-party effect and CASAHumans are able to segregate a desired source in a mixture ofsounds (cocktail-party effect). Psychoacoustical experimentshave shown that although binaural audition may help toimprove segregation performance, human beings are capa-ble of doing the segregation even with one ear or when allthe sources come from the same spatial location (e.g., whensomeone listens to a radio broadcast) [20]. Using the knowl-edge acquired in visual scene analysis and by making an anal-ogy between vision and audition, Bregman developed the keynotions of the auditory scene analysis (ASA) [20]. Two of themost important aspects in ASA are the segregation and group-ing (or integration) of sound sources. The segregation steppartitions the auditory scene into fundamental auditory el-ements and the grouping is the binding of these elements inorder to reproduce the initial sound sources. These two stagesare influenced by top-down processing (schema-driven). Theaim in computational auditory scene analysis (CASA) is todevelop computerized methods for solving the sound segre-gation problem by using psychoacoustical and physiologicalcharacteristics [7, 21]. For a review see [1].

2.3. Binding of auditory sources

We assume here that sound segregation is a generalized clas-sification problem in which we want to bind features ex-tracted from the auditory image representations in differ-ent regions of our neural network map. We use the tem-poral correlation approach as suggested by Milner [9] andMalsburg in [22, 23] who observed that synchrony is a cru-cial feature to bind neurons associated to similar characteris-tics. Objects belonging to the same entity are bound togetherin time. In this framework, synchronization between differ-ent neurons and desynchronization among different regions

3Other features like transients, on-, off-responses are observed, but arenot implemented here.

perform the binding. In the present work, we implementthe temporal correlation to bind auditory image objects. Thebinding merges the segmented auditory objects belonging tothe same source.

3. PROPOSED SYSTEM STRATEGY

Two representations are simultaneously generated: ampli-tude modulation map, which we call Cochleotopic/AMtopic(CAM) Map4 and the Cochleotopic/Spectrotopic Map(CSM) that encodes the averaged spectral energies of thecochlear filter bank output. The first representation some-what reproduces the AM processing performed by multipo-lar cells (Chopper-S) from the anteroventral cochlear nucleus[19], while the second representation could be closer to thespherical bushy cell processing from the ventral cochlear nu-cleus areas [18].

We assume that different sources are disjoint in the au-ditory image representation space and that masking (binarygain) of the undesired sources is feasible. Speech has a spe-cific structure that is different from that of most noises andperturbations [26]. Also, when dealing with simultaneousspeakers, separation is possible when preserving the timestructure (the probability at a given instant t to observe over-lap in pitch and timbre is relatively low). Therefore, a binarygain can be used to suppress the interference (or separate allsources with adaptive masks).

4. DETAILED DESCRIPTION

4.1. Signal analysis

Our CAM/CSM generation algorithm is as follows.

(1) Down-sample to 8000 samples/s.(2) Filter the sound source using a 256-filter Bark-scaled

cochlear filter bank ranging from 100 Hz to 3.6 kHz.(3) (i) For CAM, extract the envelope (AM demod-

ulation) for channels 30–256; for other low-frequency channels (1–29) use raw outputs.5

(ii) For CSM, nothing is done in this step.(4) Compute the STFT of the envelopes (CAM) or of the

filter bank outputs (CSM) using a Hamming window.6

(5) To increase the spectro-temporal resolution of theSTFT, find the reassigned spectrum of the STFT [28](this consists of applying an affine transform to thepoints to realocate the spectrum).

(6) Compute the logarithm of the magnitude of the STFT.The logarithm enhances the presence of the strongersource in a given 2D frequency bin of the CAM/CSM.7

4To some extent, it is related to modulation spectrograms. See for exam-ple work in [24, 25].

5Low-frequency channels are said to resolve the harmonics while othersdo not, suggesting a different strategy for low-frequency channels [27].

6Nonoverlapping adjacent windows with 4-millisecond or 32-millisecond length have been tested.

7log(e1 + e2) max(log e1, log e2) (unless e1 and e2 are both large andalmost equal) [4].


S2 source

S1 source

0

700Fr

equ

ency

(Hz)

5 10 15 20

Channel number

Figure 3: Example of a 24-channel CAM for a mixture of /di/ and/da/ pronounced by two speakers; mixture at SNR = 0 dB and framecenter at t = 166 milliseconds.

It is observed that the efferent loop between the medial olivo-cochlear system (MOC) and the outer hair cells modifiesthe cochlear response in such a way that speech is enhancedfrom the background noise [29]. To a certain extent, one canimagine that envelope detection and selection between theCAM and the CSM, in the auditory pathway, could be as-sociated to the efferent system in combination with cochlearnucleus processing [30, 31]. For now, in the present exper-imental setup, selection between the two auditory images isdone manually. Figure 3 is an example of a CAM computedthrough a 24-cochlear-channel filter bank for a /di/ and /da/mixture pronounced by a female and male speaker. Ellipsesoutline the auditory objects.

4.2. The neural network

4.2.1. First layer: image segmentation

The dynamics of the neurons we use is governed by a mod-ified version of the Van der Pol relaxation oscillator (Wang-Terman oscillators [7]). The state-space equations for thesedynamics are as follows:

dx

dt= 3x − x3 + 2− y + ρ + p + S, (1)

dy

dt= ε

[γ(

1 + tanh(x

β

))− y

], (2)

where x is the membrane potential (output) of the neuronand y is the state for channel activation or inactivation. ρdenotes the amplitude of a Gaussian noise, p is the exter-nal input to the neuron, and S is the coupling from otherneurons (connections through synaptic weights). ε, γ, and βare constants.8 The Euler integration method is used to solve

8In our simulation, ε = 0.02, γ = 4, β = 0.1, and ρ = 0.02.

Bindingvia

synchronization

Glo

balc

ontr

olle

r

Neuroni, j

Neuronk,m

CAM/CSM

G

dz/dt = σ − ξz

σ = 1 σ = 0∑> ξ

∑< ξ

−η

Li, j

Channels

wi, j,k,mH(·)

x(k,m; t)

· · ·· · ·· · ·· · ·

· · · Freq

uenc

ies

Figure 4: Architecture of the two-layer bio-inspired neural net-work. G stands for global controller (the global controller for thefirst layer is not shown on the figure). One long-range connection isshown. Parameters of the controller and of the input layer are alsoillustrated in the zoomed areas.

the equations. The first layer is a partially connected networkof relaxation oscillators [7]. Each neuron is connected to itsfour neighbors. The CAM (or the CSM) is applied to the in-put of the neurons. Since the map is sparse, the original 256points computed for the FFT are down-sampled to 50 points.Therefore, the first layer consists of 256 × 50 neurons. Thegeometric interpretation of pitch (ray distance criterion) isless clear for the first 29 channels, where harmonics are usu-ally resolved.9 For this reason, we have also established long-range connections from clear (high-frequency) zones to con-fusion (low-frequency) zones. These connections exist onlyacross the cochlear channel number axis of the CAM.

The weight, wi, j,k,m(t) (Figure 4), between neuron(i, j)and neuron(k,m) of the first layer is

wi, j,k,m(t) = 1Card

N(i, j)

0.25eλ|p(i, j;t)−p(k,m;t)| , (3)

where p(i, j) and p(k,m) are, respectively, external inputsto neuron(i, j) and neuron(k,m) ∈ N(i, j). CardN(i, j) isa normalization factor and is equal to the cardinal number(number of elements) of the set N(i, j) containing neighborsconnected to the neuron(i, j) (can be equal to 4, 3, or 2 de-pending on the location of the neuron on the map, i.e., cen-ter, corner, etc.). The external input values are normalized.The value of λ depends on the dynamic range of the inputsand is set to λ = 1 in our case. This same weight adaptation

9Envelopes of resolved harmonics are nearly constants.


is used for long-range clear-to-confusion zone connections (6)in CAM processing case. The coupling Si, j defined in (1) is

Si, j(t) =∑

k,m∈N(i, j)

wi, j,k,m(t)H(x(k,m; t)

)

− ηG(t) + κLi, j(t),(4)

where H(·) is the Heaviside function. The dynamics of G(t)(the global controller) is as follows:

G(t) = αH(z − θ),

dz

dt= σ − ξz,

(5)

where σ is equal to 1 if the global activity of the network isgreater than a predefined ζ and is zero otherwise (Figure 4).α and ξ are constants.10

Li, j(t) is the long-range coupling as follows:

Li, j(t) =

0, j ≥ 30,∑k=225···256

wi, j,i,k(t)H(x(i, k; t)

), j < 30. (6)

κ is a binary variable defined as follows:

κ =

1 for CAM,

0 for CSM.(7)

4.2.2. Second layer: temporal correlationand multiplicative synapses

The second layer is an array of 256 neurons (one for eachchannel). Each neuron receives the weighted product of theoutputs of the first layer neurons along the frequency axis ofthe CAM/CSM. The weights between layer one and layer twoare defined as wll(i) = α/i, where i can be related to the fre-quency bins of the STFT and α is a constant for the CAMcase, since we are looking for structured patterns. For theCSM, wll(i) = α is constant along the frequency bins as weare looking for energy bursts.11 Therefore, the input stimu-lus to neuron( j) in the second layer is defined as follows:

θ( j; t) =∏i

wll(i)Ξx(i, j; t)

. (8)

The operator Ξ is defined as

Ξx(i, j; t)

=

1 for x(i, j; t) = 0,

x(i, j; t) elsewhere,(9)

where (·) is the averaging over a time window operator (theduration of the window is in the order of the discharge pe-riod). The multiplication is done only for nonzero outputs

10ζ = 0.2, α = −0.1, ξ = 0.4, η = 0.05, and θ = 0.9.11In our simulation, α = 1.

(in which spike is present) [32, 33]. This behavior has beenobserved in the integration of ITD (interaural time differ-ence) and ILD (interlevel difference) information in the barnowl’s auditory system [32] or in the monkey’s posterior pari-etal lobe neurons that show receptive fields that can be ex-plained by a multiplication of retinal and eye or head posi-tion signals [34].

The synaptic weights inside the second layer are adjustedthrough the following rule:

w′i j(t) =0.2

eµ|p( j;t)−p(k;t)| , (10)

where µ is chosen to be equal to 2. The binding of these fea-tures is done via this second layer. In fact, the second layeris an array of fully connected neurons along with a globalcontroller. The dynamics of the second layer is given by anequation similar to (4) (without long-range coupling). Theglobal controller desynchronizes the synchronized neuronsfor the first and second sources by emitting inhibitory activ-ities whenever there is an activity (spikings) in the network[7].

The selection strategy at the output of the second layeris based on temporal correlation: neurons belonging to thesame source synchronize (same spiking phase) and neuronsbelonging to other sources desynchronize (different spikingphase).

4.3. Masking and synthesis

Time-reversed outputs of the analysis filter bank are passedthrough the synthesis filter bank giving birth to zi(t). Basedon the phase synchronization described in the previous sec-tion, a mask is generated by associating zeros and ones todifferent channels:

s(t) =256∑i=1

mi(t)zi(t), (11)

where s(N − t) is the recovered signal (N is the length of thesignal in discrete mode), zi(t) is the synthesis filter bank out-put for channel i, and mi(t) is the mask value. Energy is nor-malized in order to have same SPL for all frames. Note thattwo-source mixtures are considered throughout this articlebut the technique can be potentially used for more sources.In that case, for each time frame n, labeling of individualchannels is equivalent to the use of multiple masks (one foreach source).

5. EXPERIMENTS

We first illustrate the separation of two simultaneous speak-ers (double-voiced speech segregation), separation of aspeech sentence from an interfering siren, and then comparewith other approaches.

The magnitude of the CAM’s STFT is a structured imagewhose characteristics depend heavily on pitch and formants.Therefore, in that representation, harmonic signals are sep-arable. On the other hand, the CSM representation is moresuitable for inharmonic signals with bursts of energy.


1000

2000

3000

4000Fr

equ

ency

(Hz)

(a)

1000

2000

3000

4000

Freq

uen

cy(H

z)

(b)

Figure 5: (a) Spectrogram of the /di/ and /da/ mixture. (b) Spectro-gram of the sentence “I willingly marry Marilyn” plus siren mixture.

5.1. Double-speech segregation case

Two speakers have simultaneously and respectively pro-nounced a /di/ and a /da/ (spectrogram Figure 5a). We ob-served that the CSM representation does not generate verydiscriminative representation while, from the CAM, the 2speakers are well separable (see Figure 6). After binding,two sets of synchronized neurons are obtained: one foreach speaker. Separation is performed by using (11), wheremi(t) = 0 for one speaker and mi(t) = 1 for the other speaker(target speaker).

5.2. Sentence plus siren

A modified version of the siren used in Cooke’s database [7](http://www.dcs.shef.ac.uk/∼martin/) is mixed with the sen-tence “I willingly marry Marilyn.” The spectrogram of themixed sound is shown in Figure 5b.

In that situation, we look at short but high energy bursts.The CSM representation generates a very discriminative rep-resentation of the speech and siren signals, while, on theother hand, the CAM fades the image as the envelopes ofthe interfering siren are not highly modulated. After binding,

1000

2000

3000

4000

Freq

uen

cy(H

z)

(a)

1000

2000

3000

4000

Freq

uen

cy(H

z)

(b)

Figure 6: (a) The spectrogram of the extracted /di/. (b) The spec-trogram of the extracted /da/.

two sets of synchronized neurons are obtained: one for eachsource. Separation is performed by using (11), where mi(t) =0 for the siren and mi(t) = 1 for the speech sentence and viceversa.

5.3. Comparisons

Three approaches are used for comparison: the methods pro-posed by Wang and Brown [7] (W-B), by Hu and Wang [8](H-W), and by Jang and Lee [35] (J-L). W-B uses an oscilla-tory neural network but relies on pitch information throughcorrelation, H-W uses a multipitch tracking system, and J-Lneeds statistical estimation to perform the MAP-based sepa-ration.

6. RESULTS

Results can be heard and evaluated at http://www-edu.gel.usherbrooke.ca/pichevar/, http://www.gel.usherb.ca/rouat/.

6.1. Siren plus sentence

The CSM is presented to the spiking neural network. Theweighted product of the outputs of the first layer along the


1000

2000

3000

4000Fr

equ

ency

(Hz)

(a)

1000

2000

3000

4000

Freq

uen

cy(H

z)

(b)

Figure 7: (a) The spectrogram of the extracted siren. (b) The spec-trogram of the extracted utterance.

frequency axis is different when the siren is present. Thebinding of channels on the two sides of the noise intrud-ing zone is done via the long-range synaptic connections ofthe second layer. The spectrogram of the result is shown inFigure 7. A CSM is extracted every 10 milliseconds and theselection is made by 10-millisecond intervals. In a futurework, we will use much smaller selection intervals andshorter STFT windows to prevent discontinuities, as ob-served in Figure 7.

6.2. Double-voiced speech

Perceptual tests have shown that although we reduce soundquality after the process, the vowels are separated and areclearly recognizable.

6.3. Evaluation and comparisons

Table 1 reports the perceptive evaluation of speech qualitycriterion (PESQ) on sentences corrupted with various noises.The first column is the intruding noise, the second columngives the initial SNR of the mixtures, and other columns arethe PESQ scores for the reference methods. Table 2 gives the

Table 1: PESQ for three different methods: P-R (our proposed ap-proach), W-B [7], and H-W [8]. The intrusion noises are (a) 1 kHzpure tone, (b) FM siren, (c) telephone ring, (d) white noise, (e)male-speaker intrusion (/di/) for the French /di//da/ mixture, and(f) female-speaker intrusion (/da/) for the French /di//da/ mixture.Except for the last two tests, the intrusions are mixed with a sentencetaken from Martin Cooke’s database.

Intrusion Ini. SNR P-R W-B H-W(noise) mixture (PESQ) (PESQ) (PESQ)Tone −2 dB 0.403 0.223 0.361Siren −5 dB 2.140 1.640 1.240

Telephone ring 3 dB 0.860 0.700 0.900White −5 dB 0.880 0.223 0.336

Male (da) 0 dB 2.089 N/A N/AFemale (di) 0 dB 0.723 N/A N/A

Table 2: PESQ for two different methods: P-R (our proposed ap-proach) and J-L [35]. The mixture comprises a female voice withmusical background (rock music).

MixtureSeparated P-R J-L

sources (PESQ) (PESQ)Music & female Music 1.724 0.346

(AF) Voice 0.550 0.630

comparison for a female speech sentence corrupted with rockmusic (http://home.bawi.org/∼jangbal/research/demos/rbss1/sepres.html).

Many criteria are used in the literature to compare soundsource separation performance. Some of the most importantare SNR, segmental SNR, PEL (percentage of energy loss),PNR (percentage of noise residue), and LSD (log-spectraldistortion). As they do not take into account perception, wepropose to use another criterion, that is, the PESQ, to bet-ter reflect human perception. The PESQ (perceptual eval-uation of speech quality) is an objective method for end-to-end speech quality assessment of narrowband telephonenetworks and speech codecs. The key to this process is thetransformation of both the original and degraded signals intoan internal representation that is similar to the psychophys-ical representation of audio signals in the human auditorysystem, taking into account the perceptual frequency (Barkscale) and loudness (sone). This allows a small number ofquality indicators to be used to model all subjective effects.These perceptual parameters are combined to create an ob-jective listening quality MOS. The final score is given on arange of −0.5 to 4.5.12

In all cases, the system performs better than W-P [7]and H-W [8], except for the telephone ring intrusion whereH-W is slightly better. For the double-voiced speech, themale speaker is relatively well extracted. Other evaluationswe made are based on LSD and SNR and also converge tosimilar results.

120 corresponds to the worst quality and 4.5 corresponds to the best qual-ity (no degradation).


7. CONCLUSION AND FURTHER WORK

Based on evidences regarding the dynamics of the efferentloops and on the richness of the representations observed inthe cochlear nucleus, we proposed a technique to explore themonophonic source separation problem using a multirepre-sentation (CAM/CSM) bio-inspired preprocessing stage anda bio-inspired neural network that does not require any a pri-ori knowledge of the signal.

For the time being, the CSM/CAM selection is mademanually. In a near future, we will include a top-down mod-ule based on the local SNR gain to selectively find the suitableauditory image representation, also depending on the neuralnetwork synchronization.

In the reported experiments, we segregate two sources toillustrate the work, but the approach is not restricted to thatnumber of sources.

Results obtained from signal synthesis are encouragingand we believe that spiking neural networks in combinationwith suitable signal representations have a strong potentialin speech and audio processing. The evaluation scores showthat our system yields fairly comparable (and most of thetime better) performance than other methods even if it doesnot need a priori knowledge and is not limited to harmonicsignals.

ACKNOWLEDGMENTS

This work has been funded by NSERC, MRST of QuebecGovernment, Universite de Sherbrooke, and by Universitedu Quebec a Chicoutimi. Many thanks to DeLiang Wangfor fruitful discussions on oscillatory neurons, to WolfgangMaass for pointing the work by Milner, to Christian Giguerefor discussions on auditory pathways, and to the anonymousreviewers for constructive comments.

REFERENCES

[1] M. Cooke and D. Ellis, “The auditory organization of speechand other sources in listeners and computational models,”Speech Communication, vol. 35, no. 3-4, pp. 141–177, 2001.

[2] H. Sameti, H. Sheikhzadeh, L. Deng, and R. L. Brennan,“HMM based strategies for enhancement of speech signalsembedded in nonstationary noise,” IEEE Trans. Speech AudioProcessing, vol. 6, no. 5, pp. 445–455, 1998.

[3] S. T. Roweis, “One microphone source seperation,” in Proc.Neural Information Processing Systems (NIPS ’00), pp. 793–799, Denver, Colo, USA, 2000.

[4] S. T. Roweis, “Factorial models and refiltering for speech sep-aration and denoising,” in Proc. 8th European Conference onSpeech Communication and Technology (EUROSPEECH ’03),pp. 1009–1012, Geneva, Switzerland, September 2003.

[5] M. J. Reyes-Gomez, B. Raj, and D. R. W. Ellis, “Multi-channelsource separation by factorial HMMs,” in Proc. IEEE Int. Conf.Acoustics, Speech, Signal Processing (ICASSP ’03), vol. 1, pp.664–667, Hong Kong, China, April 2003.

[6] G.-J. Jang and T.-W. Lee, “A maximum likelihood approach tosingle-channel source separation,” Journal of Machine Learn-ing Research, vol. 4, pp. 1365–1392, 2003.

[7] D. L. Wang and G. J. Brown, “Separation of speech from in-terfering sounds based on oscillatory correlation,” IEEE Trans.Neural Networks, vol. 10, no. 3, pp. 684–697, 1999.

[8] G. Hu and D. Wang, “Separation of stop consonants,” inProc. IEEE Int. Conf. Acoustics, Speech, Signal Processing(ICASSP ’03), vol. 2, pp. 749–752, Hong Kong, China, April2003.

[9] P. Milner, “A model for visual shape recognition,” Psychologi-cal Review, vol. 81, no. 6, pp. 521–535, 1974.

[10] C. von der Malsburg, “The correlation theory of brain func-tion,” Internal. Rep. 81-2, Max-Planck Institute for Biophysi-cal Chemistry, Gottingen, Germany, 1981.

[11] W. Maass, “Networks of spiking neurons: the third generationof neural network models,” Neural Networks, vol. 10, no. 9,pp. 1659–1671, 1997.

[12] D. E. Haines, Ed., Fondamental Neuroscience, Churchill Liv-ingstone, San Diego, Calif, USA, 1997.

[13] C. E. Schreiner and J. V. Urbas, “Representation of amplitudemodulation in the auditory cortex of the cat. I. The ante-rior auditory filed (AAF),” Hearing Research, vol. 21, no. 3,pp. 227–241, 1986.

[14] C. Schreiner and G. Langner, “Periodicity coding in the in-ferior colliculus of the cat. II. Topographical organization,”Journal of Neurophysiology, vol. 60, no. 6, pp. 1823–1840,1988.

[15] L. Robles, M. A. Ruggero, and N. C. Rich, “Two-tone distor-tion in the basilar membrane of the cochlea,” Nature, vol. 349,pp. 413–414, 1991.

[16] E. F. Evans, “Auditory processing of complex sounds: anoverview,” in Phil. Trans. Royal Society of London, pp. 1–12,Oxford Press, Oxford, UK, 1992.

[17] M. A. Ruggero, L. Robles, N. C. Rich, and A. Recio, “Basilarmembrane responses to two-tone and broadband stimuli,” inPhil. Trans. Royal Society of London, pp. 13–21, Oxford Press,Oxford, UK, 1992.

[18] C. K. Henkel, “The auditory system,” in Fondamental Neuro-science, D. E. Haines, Ed., Churchill Livingstone, New York,NY, USA, 1997.

[19] P. Tang and J. Rouat, “Modeling neurons in the anteroventralcochlear nucleus for amplitude modulation (AM) processing:application to speech sound,” in Proc. 4th IEEE InternationalConf. on Spoken Language Processing (ICSLP ’96), vol. 1, pp.562–565, Philadelphia, Pa , USA, October 1996.

[20] A. Bregman, Auditory Scene Analysis, MIT Press, Cambridge,Mass, USA, 1994.

[21] M. W. Beauvois and R. Meddis, “A computer model of audi-tory stream segregation,” The Quaterly Journal of Experimen-tal Psychology, vol. 43, no. 3, pp. 517–541, 1991.

[22] C. von der Malsburg and W. Schneider, “A neural cocktail-party processor,” Biological Cybernetics, vol. 54, pp. 29–40,1986.

[23] C. von der Malsburg, “The what and why of binding: themodeler’s perspective,” Neuron, vol. 24, no. 1, pp. 95–104,1999.

[24] L. Atlas and S. A. Shamma, “Joint acoustic and modulationfrequency,” EURASIP Journal on Applied Signal Processing,vol. 2003, no. 7, pp. 668–675, 2003.

[25] G. Meyer, D. Yang, and W. Ainsworth, “Applying a modelof concurrent vowel segregation to real speech,” in Compu-tational Models of Auditory Function, S. Greenberg and M.Slaney, Eds., pp. 297–310, IOS Press, Amsterdam, The Nether-lands, 2001.


[26] J. Rouat, “Spatio-temporal pattern recognition with neuralnetworks: application to speech,” in Proc. International Con-ference on Artificial Neural Networks (ICANN ’97), vol 1327 ofLecture Notes in Computer Science, pp. 43–48, Springer, Lau-sanne, Switzerland, October 1997.

[27] J. Rouat, Y. C. Liu, and D. Morissette, “A pitch determinationand voiced/unvoiced decision algorithm for noisy speech,”Speech Communication, vol. 21, no. 3, pp. 191–207, 1997.

[28] F. Plante, G. Meyer, and W. A. Ainsworth, “Improvement ofspeech spectrogram accuracy by the method of reassignment,”IEEE Trans. Speech Audio Processing, vol. 6, no. 3, pp. 282–287,1998.

[29] S. Kim, D. R. Frisina, and R. D. Frisina, “Effects of age oncontralateral suppression of distorsion product otoacousticemissions in human listeners with normal hearing,” Audiol-ogy Neuro Otology, vol. 7, pp. 348–357, 2002.

[30] C. Giguere and P. C. Woodland, “A computational model ofthe auditory periphery for speech and hearing research,” Jour-nal of the Acoustical Society of America, vol. 95, pp. 331–349,1994.

[31] M. Liberman, S. Puria, and J. J. Guinan, “The ipsilaterallyevoked olivocochlearreflex causes rapid adaptation of the 2f1-f2 distortion product otoacoustic emission,” Journal of theAcoustical Society of America, vol. 99, pp. 2572–3584, 1996.

[32] F. Gabbiani, H. Krapp, C. Koch, and G. Laurent, “Multiplica-tive computation in a visual neuron sensitive to looming,” Na-ture, vol. 420, pp. 320–324, 2002.

[33] J. Pena and M. Konishi, “Auditory spatial receptive fields cre-ated by multiplication,” Science, vol. 292, pp. 294–252, 2001.

[34] R. Andersen, L. Snyder, D. Bradley, and J. Xing, “Multimodalrepresentation of space in the posterior parietal cortex and itsuse in planning movements,” Annual Review of Neuroscience,vol. 20, pp. 303–330, 1997.

[35] G.-J. Jang, T.-W. Lee, and Y.-H. Oh, “Single-channel signalseparation using time-domain basis functions,” IEEE SignalProcessing Letters, vol. 10, no. 6, pp. 168–171, 2003.

Jean Rouat holds an M.S. degree in physicsfrom University de Bretagne, France (1981),an E. & E. M.S. degree in speech coding andspeech recognition from Universite de Sher-brooke (1984), and an E. & E. Ph.D. degreein cognitive and statistical speech recogni-tion jointly from Universite de Sherbrookeand McGill University (1988). From 1988 to2001 he was with the Universite du Quebeca Chicoutimi (UQAC). In 1995 and 1996,he was on a sabbatical leave with the Medical Research Council,Applied Psychological Unit, Cambridge, UK, and the Institute ofPhysiology, Lausanne, Switzerland. In 1990 he founded the ER-METIS, Microelectronics and Signal Processing Research Group,UQAC. He is now with Universite de Sherbrooke where he foundedthe Computational Neuroscience and Signal Processing ResearchGroup. He regularly acts as a reviewer for speech, neural networks,and signal processing journals. He is an active member of scientificassociations (Acoustical Society of America, International SpeechCommunication, IEEE, International Neural Networks Society, As-sociation for Research in Otolaryngology, ACM, etc.). He is a Mem-ber of the IEEE Technical Committee on Machine Learning for Sig-nal Processing.

Ramin Pichevar was born in March 1974,in Paris, France. He received his B.S. de-gree in electrical engineering (electronics)in 1996 and the M.S. degree in electrical en-gineering (telecommunication systems) in1999, both in Tehran, Iran. He received hisPh.D. degree in electrical and computer en-gineering from Universite de Sherbrooke,Quebec, Canada, in 2004. During his Ph.D.,he gave courses on signal processing andcomputer hardware as a Lecturer. In 2001 and 2002 he did twosummer internships at Ohio State University, USA, and at the Uni-versity of Grenoble, France, respectively. He is now a PostdoctoralFellow and Research Associate in the Computational Neuroscienceand Signal Processing Laboratory at the University of Sherbrookeunder an NSERC (National Sciences and Engineering Council ofCanada) Ideas to Innovation (I2I) grant. His domains of inter-est are signal processing, computational auditory scene analysis(CASA), neural networks with emphasis on bio-inspired neurons,speech recognition, digital communications, discrete-event simu-lation, and image processing.


A Physiologically Inspired Methodfor Audio Classification

Sourabh RavindranSchool of Electrical and Computer Engineering, College of Engineering, Georgia Institute of Technology, Atlanta,GA 30332-0250, USAEmail: [email protected]

Kristopher SchlemmerSchool of Electrical and Computer Engineering, College of Engineering, Georgia Institute of Technology, Atlanta,GA 30332-0250, USAEmail: [email protected]

David V. AndersonSchool of Electrical and Computer Engineering, College of Engineering, Georgia Institute of Technology, Atlanta,GA 30332-0250, USAEmail: [email protected]


We explore the use of physiologically inspired auditory features with both physiologically motivated and statistical audio classifi-cation methods. We use features derived from a biophysically defensible model of the early auditory system for audio classificationusing a neural network classifier. We also use a Gaussian-mixture-model (GMM)-based classifier for the purpose of comparisonand show that the neural-network-based approach works better. Further, we use features from a more advanced model of theauditory system and show that the features extracted from this model of the primary auditory cortex perform better than thefeatures from the early auditory stage. The features give good classification performance with only one-second data segments usedfor training and testing.

Keywords and phrases: auditory model, feature extraction, neural nets, audio classification, Gaussian mixture models.

1. INTRODUCTION

Human-like performance by machines in tasks of speech andaudio processing has remained an elusive goal. In an attemptto bridge the gap in performance between humans and ma-chines, there has been an increased effort to study and modelphysiological processes. However, the widespread use of bi-ologically inspired features proposed in the past has beenhampered mainly by either the lack of robustness to noiseor the formidable computational costs.

In physiological systems, sensor processing occurs in sev-eral stages. It is likely the case that signal features and bio-logical processing techniques evolved together and are com-plementary or well matched. It is precisely because of thisreason that modeling the feature extraction processes shouldgo hand in hand with the modeling of the processes that usethese features.

We present features extracted from a model of the earlyauditory system that have been shown to be robust to noise[1, 2]. The feature extraction can be implemented in low-power analog VLSI circuitry which apart from providing

substantial power gains also enables us to achieve feature ex-traction in real time. We specifically study a four-class audioclassification problem and use a neural-network-based clas-sifier for the classification. The method used herein is sim-ilar to that used by Teolis and Shamma [3] for classifyingtransient signals. The primary difference in our approach isin the additional processing of the auditory features beforefeeding them to the neural network. The rest of the paper isorganized as follows. Section 2 introduces the early auditorysystem. Section 3 discusses models of the early auditory sys-tem and the primary auditory cortex. Section 4 explains thefeature extraction process and Section 5 introduces the twomethods used to evaluate the features in a four-case audioclassification problem. Section 6 presents the experiments,followed by the results and the conclusion.

2. EARLY AUDITORY SYSTEM

As sounds enter the ear, a small amount of signal condi-tioning and spectral shaping occurs in the outer ear, but the

A Physiologically Inspired Method for Audio Classification 1375

Outerear

Middleear

Innerear

Pinna

CochleaMalleus

TympanumIncus

Stapes

Figure 1: A cut-away view of the human ear. This shows the threestages of the ear. The outer ear includes the pinna, the ear canal,and the tympanum (ear drum). The middle ear is composed ofthree small bones, or ossicles. Simply put, these three bones worktogether for gain control and for impedance matching between theouter and the inner ear. The inner ear is the snail-shaped bone calledthe cochlea. This is where the incoming sounds are decomposedinto the respective frequency components.

signals remain relatively unscathed until they contact the eardrum, which is the pathway to the middle ear. The mid-dle ear is composed of three small bones, or ossicles. Sim-ply put, these three bones work together for gain control andfor impedance matching between the outer and the inner ear(matching the low impedance of the auditory canal with thehigh impedance of the cochlear fluid). The middle ear cou-ples the sound energy in the auditory canal to the inner earor the cochlea which is a snail-shaped bone. The placementof the cochlea with respect to the rest of the ear is shown inFigure 1.

Figure 2 shows a cross-sectional view of the cochlea. Theinput to the cochlea is through the oval window, and bar-ring a scale factor resulting from the gain control, the signalthat enters the oval window of the cochlea is largely the sameas that which enters the ear. The oval window leads to oneof three fluid-filled compartments within the cochlea. Thesechambers called scala vestibuli, scala media, and scala tym-pani are separated by flexible membranes. Reissner’s mem-brane separates the scala vestibuli from the scala media, andthe basilar membrane separates the scala tympani from thescala media [4, 5].

As the oval window is pushed in and out as a result ofincident sound waves, pressure waves enter the cochlea inthe scala vestibuli and then propagate down the length ofthe cochlea. Since the scala vestibuli and the scala tympaniare connected, the increased pressure propagates back downthe length of the cochlea through the scala tympani to thefront end, also called the basal end. When the pressure wavehits the basal end, it causes a small window, called the roundwindow, that is similar in composition to the oval window,to bow outwards to absorb the increased pressure. Duringthis process, the two membrane dividers bend and bow inresponse to the changes in pressure [6] giving rise to a trav-eling wave in the basilar membrane.

︸︷︷︸Inner hair cells

Tectorial membrane

Outer haircells

Scala vestibuli

Scala media

Reissner’smembrane

Auditory nerve

Basilar membrane

Scala tympani

Figure 2: A cross-section of the human cochlea. Within the boneare three fluid-filled chambers that are separated by two mem-branes. The input to the cochlea is in the scala vestibuli, which isconnected at the apical end to the scala tympani. Pressure differ-ences between these two chambers lead to movement in the basilarmembrane. The scala media is isolated from the other two cham-bers.

At the basal end, the basilar membrane is very narrow butgets wider towards the apical end. Further, the stiffness of thebasilar membrane decreases down its length from the base tothe apex. Due to these variations along its length, differentparts of the basilar membrane resonate at different frequen-cies, and the frequencies at which they resonate are highlydependent upon the location within the cochlea. The travel-ling wave that develops inside the cochlea propagates downthe length of the cochlea until it reaches the point where thebasilar membrane resonates with the same frequency as theinput signal. The wave will essentially die out after the pointwhere resonance occurs because the basilar membrane willno longer support the propagation. It has been observed thatthe lower frequencies travel further than the higher frequen-cies. Also the basilar membrane has exponential changes inthe resonant frequency for linear distances down the lengthof the cochlea.

The basilar membrane is also attached to what is knownas the organ of Corti. One important feature of the organ ofCorti is that it has sensory cells called inner hair cells (IHC)that sense the motion of the basilar membrane. As the basi-lar membrane moves up and down in response to the pres-sure waves, it causes the local movement of the cochlear fluid.The viscous drag of the fluid bends the cilia attached to theIHC. The bending of the cilia controls the ionic flow into thehair cells through a nonlinear channel. Due to this ionic cur-rent flow, charge builds up across the hair-cell membrane.


Input h(t; s) ∂t g(·) w(t) ∂s v(s) HWR∫T

Auditoryspectrum

Cochlea Hair cell stage Cochlear nucleus

Figure 3: Mathematical model of the early auditory system consisting of filtering in the cochlea (analysis stage), conversion of mechanicaldisplacement into electrical activity in the IHC (transduction stage), and the lateral inhibitory network in the cochlear nucleus (reductionstage) [1].

This mechanism converts the mechanical displacement ofthe basilar membrane into electrical activity. Once the po-tential builds up above a certain threshold, the hair cell fires.This neural spike is carried to the cochlear nucleus by the au-ditory nerve fibre. The neurons in the cochlear nucleus (CN)exhibit inhibition characteristics and it is believed that lateralinhibition exists in the cochlear nucleus. The lateral interac-tion of the neurons is spatially limited, that is, as the distancebetween the neurons increases, the interaction decreases [7].

3. MATHEMATICAL MODEL OF THEAUDITORY SYSTEM

3.1. Model of the early auditory systemYang et al. [8] have presented a biophysically defensiblemathematical model of the early auditory system. The modelis shown in Figure 3 and described below.

When viewing the way the cochlea acts on signals of dif-ferent frequencies from an engineering perspective, it can beseen that the cochlea has bandpass frequency responses foreach location. An accurate but computationally prohibitivemodel would have a bank of bandpass filters with center fre-quencies corresponding to the resonant frequency of everypoint along the cochlea—the cochlea has about 3000 innerhair cells acting as transduction points. In practice 10–20 fil-ters per octave are considered an adequate approximation.The cochlear filters h(t; s) typically have 20 dB/decade rolloffson the low-frequency side and a very sharp rolloff on thehigh-frequency side.

The coupling of the cochlear fluid and the inner hair cellsis modeled by a time derivative (∂t). This can be justifiedsince the extent of IHC cilia deflection depends on the vis-cous drag of the cochlear fluid and the drag is directly de-pendent on the velocity of motion. The nonlinearity of theionic channel is modeled by a sigmoid-like function g(·) andthe leakiness of the cell membrane is modeled by a lowpassfilter w(t).

Lateral inhibition in the cochlear nucleus is modeled bya spatial derivative (∂s). The spatial derivative is leaky in thesense that it is accompanied by a local smoothing that reflectsthe limited spatial extent of the interactions of the CN neu-rons. Thus, the spatial derivative is often modeled along witha spatial lowpass filter v(s). The nonlinearity of the CN neu-rons is modeled by a half-wave rectifier (HWR) and the in-ability of the central auditory neurons to react to rapid tem-poral changes is modeled by temporal integration (

∫T). The

output of this model is referred to as the auditory spectrumand it has been shown that this representation is more robustto noise as compared to the normal power spectrum [1].

Low-center frequency High-center frequency

Tonotopic axis

Scale axis

Broadly tuned

Symmetry axisNarrowlytuned

Figure 4: Schematic of the cortical model. It is proposed in [9] thatthe response fields of neurons in the primary auditory cortex are ar-ranged along three mutually perpendicular axes: the tonotopic axis,the bandwidth or scale axis, and the symmetry or phase axis.

3.2. Cortical model

Wang and Shamma [9] have proposed a model of the spectralshape analysis in the primary auditory cortex. The schematicof the model is shown in Figure 4. According to this model,neurons in the primary auditory cortex (A1) are organizedalong three mutually perpendicular axes. The response fieldsof neurons lined along the tonotopic axis are tuned to differ-ent center frequencies. The bandwidth of the response fieldof neurons lined along the scale axis monotonically decreasesalong that axis. Along the symmetry axis, the response fieldof the neurons displays a systematic change in symmetry. Atthe center of A1, the response field has an excitatory center,surrounded by inhibitory sidebands. The response field tendsto be more asymmetrical with increasing distance from thecenter of A1. It has been argued that the tonotopic axis is akinto a Fourier transform, and the presence of different scalesover which this transform is performed leads to a multiscaleFourier transform. It has been shown that performing such


an operation on the auditory spectrum leads to the extrac-tion of spatial and temporal modulation information [10].This model is used to extract the cortical features explainedin Section 4.

4. FEATURE EXTRACTION

4.1. Simplified model of the early auditory system

The cochlear filters h(t; s) are implemented through a band-pass filter bank (BPF), with 40 dB/decade rolloff on the lowfrequency. This models the 20 dB/decade cochlear filter roll-off and also provides a time differentiation of the input sig-nal. The nonlinearity of the ionic channel g(·) is imple-mented by a sigmoid-like function. The temporal lowpassfilter w(t) is ignored and it has been shown that at moder-ate sound intensities, this is a valid approximation [1]. Thespatial derivative ∂s is approximated by a difference opera-tion between the adjacent frequency channels and the spa-tial low-pass filter v(s) is ignored (this corresponds to lim-iting the spatial extent of lateral inhibition of the CN neu-rons to adjacent channels). The half-wave rectification stageis retained and the temporal averaging is implemented by alowpass filter (LPF).

4.2. Features

4.2.1. Auditory model features

The auditory spectrum derived from the simplified modelof the early auditory system is a two-dimensional time-frequency representation. The filter bank consists of 128channels tuned from 180 Hz to 7246 Hz and the tempo-ral averaging (lowpass filtering) is done over 8-milliseconds“frames”, thus the auditory spectrum for one second of datais a 128× 125 two-dimensional matrix. The neural responseover time is modeled by a mean activity level (temporal av-erage) and by the variation of activity over time (temporalvariance). Thus taking the temporal average and temporalvariance of the auditory spectrum, we end up with a 256-dimensional feature vector for each one-second segment. Werefer to these as the AM short features.

4.2.2. Noise-robust auditory features

We modified the early auditory model to incorporate the logcompression due to the outer hair cells [11] and also intro-duced a decorrelation stage. The decorrelation stage is im-portant for practical reasons; while the neural networks nat-urally perform this operation, doing so explicitly reduces thetraining requirements for the networks. We refer to this rep-resentation as noise-robust auditory features (NRAF). Thenoise robustness of these features is shown elsewhere [2].The NRAF extraction can be implemented in low-power ana-log VLSI circuitry as shown in Figure 5. The auditory spec-trum is log compressed and transformed using a discretecosine transform (DCT) which effectively decorrelates thechannels. The temporal average and temporal variance ofthis representation yield a 256-dimensional feature vector forevery one-second segment. We refer to these as the NRAFshort features. The NRAF feature extraction is similar to

Input

BPF

g(·)

∑

HWR

LPF

Log

+−

BPF

g(·)

∑

HWR

LPF

Log

+ −

BPF

g(·)

∑

HWR

LPF

Log

+ −

· · ·

· · ·

· · ·

· · ·

· · ·

BPF

g(·)

∑

HWR

LPF

Log

+

· · ·

DCT

Auditory features

Figure 5: The bandpass-filtered version of the input is nonlinearlycompressed and fed back to the input. The difference operation be-tween lower and higher channels approximates a spatial derivative.The half-wave rectification followed by the smoothing filter picksout the peak. Log compression is performed followed by the DCTto decorrelate the signal.

extracting the continuous-time mel-frequency cepstral coef-ficients (MFCCs) [12], however the NRAF features are morefaithful to the processes in the auditory system.

4.2.3. Rate-scale-frequency features

A multiscale transform (performed using the cortical model)on the auditory spectrum leads to a four-dimensional rep-resentation referred to as rate-scale-frequency-time (RSFT)[9]. The processing done by the cortical model on the audi-tory spectrum is similar to a two-dimensional wavelet trans-form. Frequency represents the tonotopic axis in the basi-lar membrane and in our implementation is the center fre-quency of the bandpass filters of the early auditory model.Each unit along the time axis corresponds to 8 millisec-onds. This is the duration over which the temporal integra-tion is performed in the early auditory model. Rate corre-sponds to the center frequency of the temporal filters usedin the transform and yields temporal modulation informa-tion. Scale corresponds to the center frequency of the spa-tial (frequency) filters used in the transform and yields spa-tial modulation information. The RSFT representation is col-lapsed across the time dimension to obtain the RSF features.Principal component analysis is performed to reduce the RSFfeatures to a dimension of 256. These features are referred toas the RSF short features.

4.2.4. Mel-frequency cepstral coefficients

For the purpose of comparison, we also extracted theMFCCs. Each one-second training sample is divided into32-millisecond frames with 50% overlap and 13 MFCC


Temporalaverage

and variance

Early auditory

modelAM shortfeatures

Auditory spectrum

(a)

Log and

DCT

Temporalaverage

and variance

Early auditory

model

NRAF

NRAF shortfeatures

(b)

Corticalmodel

Temporalaverage

Early auditorymodel

PCARSF shortfeatures

RSFT RSF

(c)

Figure 6: Extraction of the different features used in the classifica-tion.

coefficients are computed from each frame. The mean andvariance of the 13 features over one-second segment were cal-culated to give a 26-dimensional feature vector.

Figure 6 gives a graphic representation of how differentfeature sets are obtained. The NRAF short features differfrom the AM short features in that they incorporate an ad-ditional log compression and decorrelation stage. The RSFshort features are obtained by a multiresolution processingon the auditory spectrum, followed by dimensionality reduc-tion.

5. CLASSIFICATION METHODS

We used two different methods for classification, a Gaussian-mixture-model (GMM) -based classifier and a neural-net(NN) -based classifier. The GMM-based classifier is used asthe nonanthropomorphic control case and was chosen be-cause of its successful use in audio classification and speechrecognition. It is easily trained and can match feature spacemorphologies.

5.1. GMM-based classifier

The feature vectors from each class were used to train theGMM models for those classes. During testing, the likelihoodof a test sample belonging to each model is computed andthe sample is assigned to the class whose model produces thehighest likelihood. Diagonal covariance was assumed with aseparate covariance matrix for each of the mixtures. The pri-ors are set based on the number of data samples in each mix-ture. To implement the GMM, we used the Netlab softwareprovided by Nabney and Bishop [13].

Input

4 vs. restClass 4

1 vs. 2 1 vs. 3 2 vs. 3

Majority voting

Class 1,2, or 3

Figure 7: Flow chart showing the decision rights of the collabora-tive committee for the NN classifier.

5.2. NN-based classifier

Through iterative development, the optimal classificationsystem was found to be a collaborative committee of fourindependently trained binary perceptrons. Three of the per-ceptrons were trained solely on two of the four classes (class1 (noise) versus class 2 (animal sounds), class 1 versus class3 (music), and class 2 versus class 3). Because of the linearseparability of class 4 (speech) within the feature space, thefourth perceptron was trained on all four classes, learningto distinguish speech from the other three classes. All fourperceptrons employed the tanh(·) decision function and theconjugate gradient descent training with momentum learn-ing. All training converged within 500 epochs of training,and consistent performance and generalization results wererealized. With the four perceptrons independently trained, acollaborative committee was established with decision rightsshown in Figure 7. The binary classifier for classifying speechversus the rest was allowed to make the decisions on thespeech class due to its ability to learn this class extremelywell. A collaborative committee using majority voting wasinstituted to arbitrate amongst the other three binary classi-fiers. The performance curve for training and testing for thethree different features is as shown in Figures 8, 9, and 10. Wesee that while NRAF short and RSF short features generalizewell, AM short features do not perform as well.

6. EXPERIMENTS

MFCCs are the most popular features in state-of-the-art au-dio classification and speech recognition. Peltonen et al. [14]showed that MFCCs used in conjunction with GMM-basedclassifiers performed very well for an auditory scene recogni-tion experiment involving identifying 17 different auditoryscenes from amongst 26 scenes. They reported near-human-like performance when using 30 seconds of data to per-form the scene recognition. We use a similar approach usingMFCCs and a GMM-based classifier as the baseline system.


100

95

90

85

80

75

70

65

Perf

orm

ance

(%co

rrec

t)

0 50 100 150 200 250 300 350 400 450 500

Epoch

Training; 96.07%Generalization; 91.99%

Figure 8: Performance curves during training and testing of the NNclassifier with the NRAF short features.

100

95

90

85

80

75

70

65

Perf

orm

ance

(%co

rrec

t)

0 50 100 150 200 250 300 350 400 450 500

Epoch


Figure 9: Performance curves during training and testing of the NNclassifier with the AM short features.

The database consisted of four classes; noise, animalsounds, music, and speech. Each of the sound samples was asecond long. The noise class was comprised of nine differenttypes of noises from the NOISEX database which includedbabble noise. The animal class was comprised of a randomselection of animal sounds from the BBC Sound Effects au-dio CD collection. The music class was formulated using theRWC music database [15] and included different genres ofmusic. The speech class was made up of spoken digits fromthe TIDIGITS and AURORA database. The training set con-sisted of a total of 4325 samples with 1144 noise, 732 animal,1460 music, and 989 speech samples and the test set consisted

100

95

90

85

80

75

70

65

60

Perf

orm

ance

(%co

rrec

t)

0 50 100 150 200 250 300 350 400 450 500

Epoch


Figure 10: Performance curves during training and testing of theNN classifier with the RSF short features.

of 1124 samples with 344 noise, 180 animal, 354, music, and246 speech samples.

Dimensionality reduction was necessitated by the inabil-ity of the GMM to handle large-dimensional feature vectors.For the AM short features, it was empirically found that re-ducing to a 64-dimensional vector by using principal compo-nent analysis (PCA) provided the best result. Since the PCAhelps decorrelate the features, a diagonal covariance matrixwas used in the GMMs. Performing linear discriminant anal-ysis [16] for dimensionality reduction and decorrelation didnot provide better results as compared to PCA. The NRAFshort features were also reduced to 64 dimensions similarly.For these two feature sets, a 4-mixture GMM was used to per-form the classification. The RSF short features were furtherreduced to a dimension of 55 using PCA, and a 6-mixtureGMM was used to perform the classification. For MFCCs,a 4-mixture GMM was used. The GMMs were optimized togive the best results on the database. However, the improve-ment in accuracy comes at a cost of reduced generalizationability.

Further experiments were performed to incorporate tem-poral information into the baseline classifier (GMM-basedclassifier using MFCCs). The MFCCs per frame were usedto generate models for each class. It was determined that us-ing 9 mixtures gave the best result. In the test phase the loglikelihood of each frame in the one second segment belong-ing to each of the classes was computed and summed overone second. The class with the highest summed log likeli-hood was declared the winner.

7. RESULTS

As can be seen from Table 1, NRAF short features outper-form MFCCs using GMM-based classifier. The NRAF shortfeatures also outperform the AM short features using both


Table 1: Performance (% correct) for different features and differ-ent classifiers.

MFCC NRAF short AM short RSF short

GMM 85.85% 90.21% 71.79% 70.99%

NN — 91.99% 82.56% 95.28%

Table 2: Confusion matrix (rows give the decision and columnsgive the true class) for MFCC with GMM. This method gave anaccuracy of 85.85%.

Noise Animal Music SpeechNoise 310 18 30 0Animal 0 140 55 0Music 34 22 269 0Speech 0 0 0 246

Table 3: Confusion matrix (rows give the decision and columnsgive the true class) for NRAF short with GMM. This gave an accu-racy of 90.21%.


Table 4: Confusion matrix (rows give the decision and columnsgive the true class) for NRAF short with NN. This gave an accuracyof 91.99%.


the GMM- and the NN-based classifiers. RSF short featuresoutperform NRAF short features while using the NN clas-sifier. The results for all the features (except MFCCs, whichwere not tested with the NN classifier) were better with anNN classifier as compared to a GMM classifier. Tables 2, 3,4, 5, 6, 7, and 8 give the confusion matrices of the variousfeatures for the two different classifiers.

It is seen from the confusion matrices that MFCCs doa very good job of learning the speech class. All the otherfeatures used are also able to separate out the speech classwith reasonable accuracy indicating the separability of thespeech class in the feature space. It is interesting to note thatmost of the mistakes by MFCCs in the noise class are mis-classifications as music (Table 2) but NRAF makes most ofits mistakes in this class as misclassifications into the animalclass (Table 6), which is more acceptable as some of the ani-mal sounds are very close to noise. The animal class seemsto be the most difficult to learn but RSF short features in

Table 5: Confusion matrix (rows give the decision and columnsgive the true class) for AM short with GMM. This gave an accuracyof 71.79%.


Table 6: Confusion matrix (rows give the decision and columnsgive the true class) for AM short with NN. This gave an accuracy of82.56%.


Table 7: Confusion matrix (rows give the decision and columnsgive the true class) for RSF short with GMM. This gave an accuracyof 70.99%.


Table 8: Confusion matrix (rows give the decision and columnsgive the true class) for RSF short with NN. This gave an accuracy of95.28%.


Table 9: Performance (% correct) for MFCCs. The one-second seg-ment was divided into 30-millisecond frames, and the final decisionwas made by combining the frame decisions.

Classifier Performance

GMM + majority rule 81.49%

conjunction with the NN classifier do a good job of learningthis class. Most of the mistakes in this case are misclassifica-tions as noise.

The result of incorporating temporal information intothe GMM-based classifier is shown in Table 9. It is seen thatthe performance decreases in comparison with using meanand variance of the MFCCs over one second. This could beattributed to the fact that there is too much variability in eachof the classes. Performing temporal smoothing over one sec-ond makes the features more robust.


8. CONCLUSIONS

We have shown that for the given four classes, audio classifi-cation problem features derived from a model of the auditorysystem combine better with an NN classifier as compared toa GMM-based classifier. The GMM-based classifier was op-timized to give the best results for the database while the NNclassifier was trained with generalization in mind. The accu-racy of the NN classifier can be increased but at the cost ofreducing the generalization ability of the classifier. It couldbe argued that the few number of classes considered com-bined with the high dimensionality of the feature space mightrender the classes linearly separable and hence aid the NNapproach. The performance of GMM- and neural-network-based classifiers was not tested for large number of classesand the scalability of NN classifier to large number of classesis an open question. Neural networks however provide an ef-ficient and natural way of handling large-dimensional featurevectors as obtained from models of the human auditory sys-tem.

ACKNOWLEDGMENTS

The authors are grateful to Shihab Shamma for the veryuseful insights he provided on the working of the auditorymodel. They would also like to thank Malcolm Slaney for allhis help and the Telluride Neuromorphic Engineering Work-shop for having motivated this work.

REFERENCES

[1] K. Wang and S. Shamma, “Self-normalization and noise-robustness in early auditory representations,” IEEE Trans.Speech Audio Processing, vol. 2, no. 3, pp. 421–435, 1994 .

[2] S. Ravindran, D. Anderson, and M. Slaney, “Low-power audioclassification for ubiquitous sensor networks,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing (ICASSP ’04),Montreal, Canada, May 2004.

[3] A. Teolis and S. Shamma, “Classification of transient signalsvia auditory representations,” Tech. Rep. TR 91-99, SystemsResearch Center, University of Maryland, College Park, Md,USA, 1991.

[4] E. Kandel, J. Schwartz, and T. Jessel, Principles of Neural Sci-ence, McGraw-Hill, New York, NY, USA, 4th edition, 2000.

[5] R. Berne and M. Levy, Physiology, Mosby, New York, NY, USA,4th edition, 1998.

[6] P. Denes and E. Pinson, The Speech Chain, Freeman, NewYork, NY, USA, 2nd edition, 1993.

[7] C. Koch and I. Segev, Eds., Methods in Neural Modelling, MITPress, Cambridge, Mass, USA, 1989.

[8] X. Yang, K. Wang, and S. Shamma, “Auditory representationsof acoustic signals,” IEEE Trans. Inform. Theory, vol. 38, no. 2,pp. 824–839, 1992.

[9] K. Wang and S. Shamma, “Spectral shape analysis in the cen-tral auditory system,” IEEE Trans. Speech Audio Processing,vol. 3, no. 5, pp. 382–395, 1995.

[10] L. Atlas and S. Shamma, “Joint acoustic and modulationfrequency,” EURASIP Journal on Applied Signal Processing,vol. 2003, no. 7, pp. 668–675, 2003.

[11] J. O. Pickles, An Introduction to the Physiology of Hearing, Aca-demic Press, New York, NY, USA, 2nd edition, 2000.

[12] P. D. Smith, M. Kucic, R. Ellis, P. Hasler, and D. V. Ander-son, “Mel-frequency cepstrum encoding in analog floating-gate circuitry,” in Proc. IEEE Int. Symp. Circuits and Systems(ISCAS ’02), vol. 4, pp. 671–674, Scottsdale, Ariz, USA, May2002.

[13] I. Nabney and C. Bishop, “Netlab neural network software,”http://www.ncrg.aston.ac.uk/netlab/.

[14] V. Peltonen, J. Tuomi, A. Klapuri, J. Huopaniemi, and T. Sorsa,“Computational auditory scene recognition,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing (ICASSP ’02),Orlando, Fla, USA, May 2002.

[15] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, “RWCmusic database: Music genre database and musical instrumentsound database,” in Proc. 4th International Conference on Mu-sic Information Retrieval (ISMIR ’03), pp. 229–230, Baltimore,Md, USA, October 2003.

[16] J. Duchene and S. Leclercq, “An optimal transformation fordiscriminant and principal component analysis,” IEEE Trans.Pattern Anal. Machine Intell., vol. 10, no. 6, pp. 978–983,1988.

Sourabh Ravindran received the B.E. de-gree in electronics and communication en-gineering from Bangalore University, Ban-galore, India, in October 2000, and the M.S.degree in electrical and computer engineer-ing from the Georgia Institute of Technol-ogy (Georgia Tech), Atlanta, Ga, in August2003. He is currently pursuing the Ph.D. de-gree at Georgia Tech. His research interestsinclude audio classification, auditory mod-eling, and speech recognition. He is a Student Member of IEEE.

Kristopher Schlemmer graduated summacum laude as a Commonwealth Scholarfrom the University of Massachusetts Dart-mouth in 2000, earning his B.S.E.E. de-gree, and from Georgia Institute of Tech-nology in 2004, earning his M.S.E.C.E. de-gree. Mr. Schlemmer is currently employedat Raytheon Integrated Defense Systems inPortsmouth, Rhode Island, and his interestsinclude analog design, digital signal pro-cessing, artificial intelligence, and neurocomputing.

David V. Anderson was born and raised inLa Grande, Ore. He received the B.S. de-gree in electrical engineering (magna cumlaude) and the M.S. degree from BrighamYoung University in August 1993 and April1994, respectively, where he worked on thedevelopment of a digital hearing aid. He re-ceived the Ph.D. degree from the GeorgiaInstitute of Technology (Georgia Tech), At-lanta, in March 1999. He is currently on thefaculty at Georgia Tech. His research interests include audition andpsychoacoustics, signal processing in the context of human au-ditory characteristics, and the real-time application of such tech-niques. He is also actively involved in the development and promo-tion of computer-enhanced education. Dr. Anderson is a Memberof the Acoustical Society of America, Tau Beta Pi, and the AmericanSociety for Engineering Education.


A Two-Channel Training Algorithm for Hidden MarkovModel and Its Application to Lip Reading

Liang DongDepartment of Electrical and Computer Engineering, National University of Singapore, Singapore 119260Email: [email protected]

Say Wei FooSchool of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798Email: [email protected]

Yong LianDepartment of Electrical and Computer Engineering, National University of Singapore, Singapore 119260Email: [email protected]

Received 1 November 2003; Revised 12 May 2004

Hidden Markov model (HMM) has been a popular mathematical approach for sequence classification such as speech recognitionsince 1980s. In this paper, a novel two-channel training strategy is proposed for discriminative training of HMM. For the proposedtraining strategy, a novel separable-distance function that measures the difference between a pair of training samples is adopted asthe criterion function. The symbol emission matrix of an HMM is split into two channels: a static channel to maintain the validityof the HMM and a dynamic channel that is modified to maximize the separable distance. The parameters of the two-channel HMMare estimated by iterative application of expectation-maximization (EM) operations. As an example of the application of the novelapproach, a hierarchical speaker-dependent visual speech recognition system is trained using the two-channel HMMs. Results ofexperiments on identifying a group of confusable visemes indicate that the proposed approach is able to increase the recognitionaccuracy by an average of 20% compared with the conventional HMMs that are trained with the Baum-Welch estimation.

Keywords and phrases: viseme recognition, two-channel hidden Markov model, discriminative training, separable-distance func-tion.

1. INTRODUCTION

The focus of most automatic speech recognition techniquesis on the spoken sounds alone. If the speaking environmentis noise free and the recognition engine is well configured,high recognition rate is attainable for most speakers. How-ever, in real-world environments such as office, bus station,shop, and factory, the speech captured may be greatly pol-luted by background noise and cross-speaker noise. Present-ing such a signal to a sound-based speech recognition sys-tem, the recognition accuracy may drop dramatically. Onesolution to enhance the speech recognition accuracy undernoisy conditions is to jointly process information from mul-tiple modalities of speech. Automatic lip reading is one modein which the visual aspect of speech is considered for speechrecognition.

It has long been observed that the presence of visualcues such as the movement of lips, facial muscles, teeth, andtongue may enhance human speech perception. Systematicstudies on lip reading have been carried out since 1950s [1, 2,

3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. Sumby and Pol-lack [1] showed that the incorporation of visual informationadded an equivalent 12 dB gain in signal-to-noise ratio.

Among the various techniques for visual speech recog-nition, hidden Markov model (HMM) holds the greatestpromise due to its capabilities in modeling and analyzingtemporal processes as reported in [9, 18, 19, 20, 21, 22, 23,24, 25, 26, 27, 28, 29]. Most of the HMM-based visual speechprocessing systems reported take an individual word as thebasic recognition unit and an HMM is trained to model it.Such an approach works well with limited vocabulary suchas digit set [15, 30], a small number of AVletters [31], andisolated words or nonsense words [32], but it is difficult toextend the methods to large-vocabulary recognition task as agreat number of word models has to be trained. One solutionto this problem is to build subword models such as phonememodels. Any word that is presented to the recognition systemis broken down into subwords. In this way, even if a word isnot included in training the system, the system can still makea good guess on its identity.

A Two-Channel HMM Training Algorithm for Lip Reading 1383

The smallest visibly distinguishable unit of visual speechis commonly referred to as viseme [33]. Like phonemesthat are the basic building blocks of sound of a language,visemes are the basic constituents for the visual representa-tion of words. The time variation of mouth shape in speech issmall compared with the corresponding variation of acousticwaveform. Some previous experiments indicate that the tra-ditional HMM classifiers, which are trained with the Baum-Welch algorithm, are sometimes incompetent to separatemouth shapes with small difference [34]. Such small dif-ference has prompted some researchers to regard the rela-tionship between phonemes and visemes as many-to-onemapping. For example, although phonemes /b/, /m/, /p/ areacoustically distinguishable, the sequence of mouth shapefor the three sounds are not readily distinguishable, hencethe three phonemes are grouped into one viseme category.An early viseme grouping was suggested by Binnie et al.[35]. The MPEG-4 multimedia standard adopted the sameviseme grouping strategy for face animation, in which four-teen viseme groups are included [36]. However, differentgroupings are adopted by different researchers to fulfill spe-cific requirements [37, 38].

Motivated by the need to find an approach to differenti-ate visemes that are only slightly different, we propose a novelapproach to improve the discriminative power of the HMMclassifiers. The approach aims at amplifying the separable-distance between a pair of training samples. A two-channelHMM is developed, one channel, called the static channel, iskept fixed to maintain the validity of the probabilistic frame-work, and the other channel, called the dynamic channel, ismodified to amplify the difference between the training pair.

A hierarchical classifier is also proposed based on thetwo-channel training strategy. At the top level, broad identi-fication is performed and fine identification is subsequentlycarried out within the broad category identified. Experi-mental results indicate that the proposed classifier excelsthe traditional ML HMM classifier in identifying the mouthshapes.

Although the proposed method is developed for therecognition of visemes, it can also be applied to any sequenceclassification problem. As such, the theoretical backgroundand the training strategy of the two-channel discriminativetraining method are introduced first in Sections 2, 3, and4. This is followed by discussion of the general propertiesand extensions of the training strategy in Sections 5 and6, respectively. Details of the application of the method toviseme recognition and the experimental results obtained aregiven in Section 7. The concluding remarks are presented inSection 8.

2. REVIEW OF HIDDEN MARKOV MODEL

Hidden Markov model is also referred to as hidden Markovprocess (HMP) as the latter emphasizes the stochastic processrather than the model itself. HMP was first introduced byBaum and Petrie [39] in 1966. The basic theories/propertiesof HMP were introduced in full generality in a series of pa-pers by Baum and his colleagues [40, 41, 42, 43], which

include the convergence of the entropy function of an HMP,the computation of the conditional probability, and the localconvergence of the maximal likelihood (ML) parameter esti-mation of HMM. Application of HMM to speech processingtook place in the mid-1970s. A phonetic speech recognitionsystem that adopts HMM-based classifier was first developedin IBM [44, 45]. Applications of HMM for speech processingwere further explored by Rabiner and Juang [46, 47].

The beauty of HMM is that it is able to reveal the under-lying process of signal generation even though the propertiesof the signal source remain greatly unknown. Assume thatOM = O1,O2, . . . ,OM is the discrete set of observed sym-bols and SN = S1, S2, . . . , SN is the set of states; an N-state-M-symbol discrete HMM θ(π,A,B) consists of the followingthree components.

(1) The probability array of the initial state: π = [πi] =[P(s1 = Si)]1×N , where s1 is the first state in the statechain.

(2) The state-transition matrix: A = [ai j] = [P(st+1 =Sj|st = Si)]N×N , where st+1 and st denote the t + 1thstate and the tth state in the state chain.

(3) The symbol emission probability matrix: B = [bi j] =[P(ot = Oj|st = Si)]N×M , where ot is the tth observedsymbol in the observation sequence.

In a K-class identification problem, assume that xT =(x1, x2, . . . , xT) is a sample of a particular class, say classdi. The probability of occurrence of the sample xT giventhe HMM θ(π,A,B), denoted by P(xT |θ), is computed us-ing either the forward or backward process and the optimalhidden-state chain is revealed using Viterbi matching [46].Training of the HMM is the process of determining the pa-rameters set θ(π,A,B) to fulfill a certain criterion functionsuch as P(xT |θ) or the mutual information [46, 48]. Fortraining of the HMM, the Baum-Welch training algorithmis popularly adopted. The Baum-Welch algorithm is an MLestimation; thus the HMM so obtained, θML, is one that max-imizes the probability P(xT |θ). Mathematically,

θML = arg maxθ

[P(xT |θ)]. (1)

The Baum-Welch training can be realized at a relativelyhigh speed as the expectation-maximization (EM) estima-tion is adopted in the training process.

However, the parameters of the HMM are solely deter-mined by the correct samples while the relationship betweenthe correct samples and incorrect ones is not taken into con-sideration. The method, in its original form, is thus not de-veloped for fine recognition. If another sample yT of classdj ( j = i) is similar to xT , the scored probability P(yT |θ)may be close to P(xT |θ), and θML may not be able to distin-guish xT and yT . One solution to this problem is to adopta training strategy that maximizes the mutual informationIM(θ, xT) defined as

IM(θ, xT

) = logP(xT |θ)− log

∑θ′ =θ

P(xT |θ′)P(θ′). (2)


Two-channel HMM =

. . .

Static-channel HMM

State i State i + 1 State i + 2 . . .

+

. . . State i State i + 1 State i + 2 . . .

Dynamic-channel HMM

Figure 1: The block diagram of a two-channel HMM.

This method is referred to as maximum mutual infor-mation (MMI) estimation [48]. It increases the a posterioriprobability of the model corresponding to the training data,and thus the overall discriminative power of the HMM ob-tained is guaranteed. However, analytical solutions to (2) aredifficult to realize and implementation of MMI estimation istedious. A computationally less intensive approach is desir-able.

3. PRINCIPLES OF TWO-CHANNEL HMM

To improve the discriminative ability of HMM and at thesame time, to facilitate the process of parameter tuning, thefollowing two-channel training method is proposed, wherethe HMM is specially tailored to amplify the difference be-tween two similar samples.

The block diagram of the two-channel HMM is given inFigure 1. It consists of a static-channel HMM and a dynamic-channel HMM. For the static channel, a normal HMM de-rived from a parameter-smoothed ML approach is used. Anew HMM for the dynamic channel is to be derived. De-tails of the derivation of the dynamic-channel HMM are de-scribed in the following paragraphs.

Assume that in a two-class identification problem, xT :d1 and yT : d2 are a pair of training samples, wherexT = (xT1 , xT2 , . . . , xTT ) and yT = (yT1 , yT2 , . . . , yTT ) are obser-vation sequences of length T and d1 and d2 are the class la-bels. The observed symbols in xT and yT are from the symbolset OM . P(xT |θ) and P(yT |θ) are the scored probabilities forxT and yT given HMM θ, respectively. The pair of trainingsamples xT and yT must be of the same length so that theirprobabilities P(xT |θ) and P(yT |θ) can be suitably compared.Such a comparison is meaningless if the samples are of differ-ent lengths; the shorter sequence may give larger probabilitythan the longer one even if it is not the true sample of θ.

Define a new function I(xT , yT , θ), called the separable-distance function, as follows:

I(xT , yT , θ

) = logP(xT |θ)− logP

(yT |θ). (3)

A large value of I(xT , yT , θ) would mean that xT andyT are more distinct and separable. The strategy then is to

determine the HMM θMSD (MSD for maximum separabledistance) that maximizes I(xT , yT , θ). Mathematically,

θMSD = arg maxθ

[I(xT , yT , θ

)]. (4)

For the proposed training strategy, the parameter set forthe static-channel HMM is determined in the normal waysuch as the ML approach. For the dynamic-channel HMM,to maintain synchronization of the duration and transitionof states, the same set of values for π and A as derived for thestatic-channel HMM is used; only the parameters of matrixB are adjusted.

As a first step towards the maximization of theseparable-distance function I(xT , yT , θ), an auxiliary func-tion F(xT , yT , θ, λ) involving I(xT , yT , θ) and the parametersof B is defined as

F(xT , yT , θ, λ

) = I(xT , yT , θ

)+

N∑i=1

λi

(1−

M∑j=1

bi j

), (5)

where λi is the Lagrange multiplier for the ith state and∑Mj=1 bi j = 1 (i = 1, 2, . . . ,N). By maximizing F(xT , yT , θ, λ),

I(xT , yT , θ) is also maximized. Differentiating F(xT , yT , θ, λ)with respect to bi j and setting the result to 0, we have

∂ logP(xT |θ)

∂bi j− ∂ logP

(yT |θ)

∂bi j= λi. (6)

Since λi is positive, the optimum value obtained forI(xT , yT , θ) is a maximum as solutions for bi j must be pos-itive. In (6), logP(xT |θ) and logP(yT |θ) may be computedby summing up all the probabilities over time T :

logP(xT |θ) = T∑

τ=1

logN∑i=1

P(sTτ = Si

)bi(xTτ). (7)

Note that the state-transition coefficients ai j do not ap-pear explicitly in (7); they are included in the term P(sTτ =Si).


The two partial derivatives in (6) may be evaluated sepa-rately as follows:

∂ logP(xT |θ)

∂bi j=

T∑τ=1

xTτ =Oj

P(sTτ = Si|θ, xT

)

= b−1i j

T∑τ=1

P(sTτ = Si, xTτ = Oj|θ, xT

),

∂ logP(yT |θ)

∂bi j=

T∑τ=1

yTτ =Oj

P(sTτ = Si|θ, yT

)

= b−1i j

T∑τ=1

P(sTτ = Si, yTτ = Oj|θ, yT

).

(8)

By defining

E(Si,Oj|θ, xT

) = T∑τ=1

P(sTτ = Si, xTτ = Oj|θ, xT

),

E(Si,Oj|θ, yT

) = T∑τ=1

P(sTτ = Si, yTτ = Oj|θ, yT

),

Dij(xT , yT , θ

) = E(Si,Oj|θ, xT

)− E(Si,Oj|θ, yT

),

(9)

equation (6) can be written as

E(Si,Oj|θ, xT

)− E(Si,Oj|θ, yT

)bi j

= Dij(xT , yT , θ

)bi j

= λi, 1 ≤ j ≤M.

(10)

By making use of the fact that∑M

j=1 bi j = 1, it can beshown that

bi j =Dij(xT , yT , θ

)∑Mj=1 Dij

(xT , yT , θ

) , i = 1, 2, . . . ,N , j = 1, 2, . . . ,M.

(11)

The set bi j (i = 1, 2, . . . ,N , j = 1, 2, . . . ,M) so ob-tained gives the maximum value of I(xT , yT , θ).

An algorithm for the computation of the values maybe developed by using standard expectation-maximization(EM) technique. By considering xT and yT as the ob-served data and the state sequence sT = (sT1 , sT2 , . . . , sTT) asthe hidden or unobserved data, the estimation of Eθ(I) =E[I(xT , yT , sT |θ)|xT , yT , θ] from incomplete data xT and yT

is then given by [49]

Eθ(I) =∑sT∈S

I(xT , yT , sT |θ)P(xT , yT , sT |θ)

=∑sT∈S

[logP

(xT , sT |θ)− logP

(yT , sT |θ)]

× P(xT , yT , sT |θ),

(12)

where θ and θ are the HMM before training and the HMMafter training, respectively, and S denotes all the state combi-nations with length T . The purpose of the E-step of the EMestimation is to calculate Eθ(I). By using the auxiliary func-

tion Qx(θ, θ) proposed in [48] and defined as follows:

Qx(θ, θ) =∑sT∈S

logP(xT , sT |θ)P(xT , sT |θ), (13)

equation (12) can be written as

Eθ(I) = Qx(θ, θ)P(yT |sT , θ

)−Qy(θ, θ)P(xT |sT , θ

). (14)

Qx(θ, θ) and Qy(θ, θ) may be further analyzed by break-

ing up the probability P(xT , sT |θ) as follows:

P(xT , sT |θ) = π

(s0) T∏τ=1

asτ−1,sτ bsτ(xτ), (15)

where π, a, and b are the parameters of θ. Here, we assumethat the initial distribution starts at τ = 0 instead of τ = 1for notational convenience. The Q function then becomes

Qx(θ, θ) =∑sT∈S

log π(s0)P(xT , sT |θ)

+∑sT∈S

( T∑τ=1

log aτ−1,τ

)P(xT , sT |θ)

+∑sT∈S

( T∑τ=1

log bτ(xτ))

P(xT , sT |θ).

(16)

The parameters to be optimized are now separated intothree independent terms.

From (14) and (16), Eθ(I) can also be divided into thefollowing three terms:

Eθ(I) = Eθ(π, I) + Eθ(a, I) + Eθ(b, I), (17)

where

Eθ(π, I) =∑sT∈S

log π(s0)

× [P(xT , yT , sT |θ)− P(xT , yT , sT |θ)] = 0,

Eθ(a, I) =∑sT∈S

T∑τ=1

log aτ−1,τ

× [P(xT , yT , sT |θ)− P(xT , yT , sT |θ)] = 0,

Eθ(b, I) =∑sT∈S

[ T∑τ=1

log bτ(xτ)− T∑

τ=1

log bτ(yτ)]

× P(xT , yT , sT |θ).

(18)


State i− 1 State i State i + 1

[bsi j]

[bdi j]

Weightage = 1− ωiStatic channel

Weightage = ωiDynamic channel

Figure 2: The two-channel structure of the ith state of a left-right HMM.

Eθ(π, I) and Eθ(a, I) are associated with the hidden-statesequence sT . It is assumed that xT and yT are drawn indepen-dently and emitted from the same state sequence sT , hence

both Eθ(π, I) and Eθ(a, I) become 0. Eθ(b, I), on the otherhand, is related to the symbols that appear in xT and yT andcontributes to Eθ(I). By enumerating all the state combina-tions, we have

Eθ(b, I) =N∑i=1

T∑τ=1

[log bi

(xTτ)− log bi

(yTτ)]

× P(xT , yT , sTτ = Si|θ

).

(19)

If∑T

τ=1[log bi(xTτ ) − log bi(yTτ )] is arranged according tothe order of appearance of the symbols (Oj) within xT andyT , we have

Eθ(b, I)

=N∑i=1

M∑j=1

log bi jT∑

τ=1xTτ =yTτ =Oj

× [P(xTτ = Oj , sTτ = Si|θ, xT)− P

(yTτ = Oj , sTτ = Si|θ, yT

)]× P

(xT , yT |θ)

(20)

or

Eθ(b, I)

=N∑i=1

M∑j=1

log bi j[E(Si,Oj|θ, xT

)− E(Si,Oj|θ, yT

)]× P

(xT , yT |θ).

(21)

In the M-step of the EM estimation, bi j is adjusted to

maximize Eθ(b, I) or Eθ(I). Since∑M

j=1 bi j = 1 and (21)

has the form K∑M

j=1 wj log vj , which attains a global maxi-

mum at the point vj = wj/∑M

j=1 wj ( j = 1, 2, . . . ,M), the re-

estimated value of bi j of θ that lead to the maximum Eθ(I) is

given by

bi j =E(Si,Oj|θ, xT

)− E(Si,Oj|θ, yT

)∑Mj=1

[E(Si,Oj|θ, xT

)− E(Si,Oj|θ, yT

)]= Dij

(xT , yT , θ

)∑Mj=1 Dij

(xT , yT , θ

) .(22)

This equation, compared with (11), enables the re-

estimation of the symbol emission coefficients bi j from ex-pectations of the existing HMM. The above derivationsstrictly observe the standard optimization strategy [49],where the expectation of the value of the separable-distancefunction, Eθ(I), is computed in the E-step and the coef-ficients bi j are adjusted to maximize Eθ(I) in the M-step.The convergence of the method is therefore guaranteed.However, bi j may not be estimated by applying (22) alone;other considerations will be taken into account such as whenDij(xT , yT , θ) is less than or equal to 0. Further discussion onthe determination of values of bi j is given in the subsequentsections.

To modify the parameters according to (22) and simul-taneously ensure the validity of the model, a two-channelstructure as depicted in Figure 2 is proposed. The elements(bi j) of matrix B of the two-channel HMM are decomposedinto two parts as

bi j = bsi j + bdi j (∀i = 1, 2, . . . ,N , j = 1, 2, . . . ,M), (23)

bsi j for the static channel and bdi j for the dynamic channel.

The dynamic-channel coefficients bdi j are the key source ofthe discriminative power. bsi j are computed using parameter-smoothed ML HMM and weighted. As long as bi j computedfrom (22) is greater than bsi j , b

di j is determined as the differ-

ence between bi j and bsi j according to (23); otherwise bdi j isset to be 0.

To avoid the occurrence of zero or negative probabil-ity, bsi j (∀i = 1, 2, . . . ,N , ∀ j = 1, 2, . . . ,M) should be keptgreater than 0 in the training procedure and at the same time,


the dynamic-channel coefficient bdi j (∀i = 1, 2, . . . ,N , ∀ j =1, 2, . . . ,M) should be nonnegative. Thus the probabilityconstraint bi j = bsi j + bdi j ≥ bsi j > 0 is met.

In addition, the relative weightage of the static channeland the dynamic channel may be controlled by the credibilityweighing factor ωi (i = 1, 2, . . . ,N) (different states may havedifferent values). If the weightage of the dynamic channel isset to be ωi by scaling of the coefficients

N∑j=1

bdi j = ωi, 0 ≤ ωi < 1∀i = 1, 2, . . . ,N , (24)

then the weightage of the static channel has to be set as fol-lows:

N∑j=1

bsi j = 1− ωi, 0 ≤ ωi < 1∀i = 1, 2, . . . ,N. (25)

4. TWO-CHANNEL TRAINING STRATEGY

4.1. Parameter initialization

The parameter-smoothed ML HMM of xT , θxML, which istrained using the Baum-Welch estimation, is referred to asthe base HMM. The static-channel HMM is derived fromthe base HMM after applying the scaling factor. Parameter

smoothing is carried out for θxML to prevent the occurrenceof zero probability. Parameter smoothing is the simple man-agement that bi j is set to some minimum value, for exam-

ple, ε = 10−3, if the estimated conditional probability bi j = 0[46]. As a result, even though symbol Oj never appears in thetraining set, there is still a nonzero probability of its occur-

rence in θxML. Parameter smoothing is a posttraining adjust-ment to decrease error rate because the training set, which isusually limited by its size, may not cover erratic samples.

Before carrying out discriminative training, ωi (credibil-ity weighing factor of the ith state), bsi j (static-channel coef-

ficients), and bdi j (dynamic-channel coefficients) are initial-ized.

The static-channel coefficients bsi j are given by

bsi1b

si2 · · · bsiM

= (1− ωi)bi1bi2 · · · biM

,

1 ≤ i ≤ N , 0 ≤ ωi < 1,(26)

where bi j is the symbol emission probability of θxML.As for the dynamic-channel coefficients bdi j , a random or

uniform initial distribution usually works well. In the experi-ments conducted in this paper, uniform values equal to ωi/Mare assigned to bdi j ’s as initial values.

The selection of ωi is flexible and largely problem-dependent. A large value of ωi means large weightage is as-signed to the dynamic channel and the discriminative poweris enhanced. However, as we adjust bdi j toward the directionof increasing I(xT , yT , θ), the probability of the correct ob-servation P(xT |θ) will normally decrease. This situation isundesirable because the two-channel HMM obtained is un-likely to generate even the correct samples.

A guideline for the determination of the value of ωi is asfollows. If the training pairs are very similar to each other

such that P(xT |θxML) ≈ P(yT |θxML), ωi should be set to a largevalue to guarantee good discrimination; on the other hand,

if P(xT |θxML) P(yT |θxML), ωi should be set to a small valueto make P(xT |θ) reasonably large. In addition, different val-ues will be used for different states because they contributedifferently to the scored probabilities. However, the values ofωi for the different states should not differ greatly.

Based on the above considerations, the following proce-

dures are taken to determine ωi. Given the base HMM θxMLand the training pair xT and yT , the optimal state chains are

searched using the Viterbi algorithm. If θxML is a left-rightmodel and the expected (optimal) duration of the ith state

(i = 1, 2, . . . ,N) of xT is from ti to ti + τi, P(xT |θxML) is thenwritten as follows:

P(xT |θxML

) = P(xTt1 , . . . , xTt1+τ1

|θxML

)P(xTt2 , . . . , xTt2+τ2

|θxML

)· · ·P(xTtN , . . . , xTtN+τN |θxML

);

(27)

P(yT |θxML) is decomposed in the same way.

Let Pdur(xT , Si|θxML) = P(xTti , . . . , xTti+τi|θxML). This proba-

bility may be computed as follows:

Pdur(xT , Si|θxML

) = ti+τi∏t=ti

[ N∑j=1

P(sTt = Sj

)bj(xTt)]

. (28)

Pdur(xT , Si|θxML) may also be computed using the for-

ward variables αxt (i) = P(xTti+1, . . . , xTti+t , sTti+t = Si|θxML) or/and

the backward variables βxt (i) = P(xTti+t+1, . . . , xTti+τi+1|sTti+t =Si, θxML) [46].

However, if θxML is not a left-right model but an ergodicmodel, the expected duration of a state will consist of a num-ber of separated time slices, for example, k slices such as ti1to ti1 + τi1, ti2 to ti2 + τi2, and tik to tik + τik. Pdur(xT , Si|θxML) isthen computed by multiplying them together as shown:

Pdur(xT , Si|θxML

)= P

(xTti1 , . . . , xTti1+τi1 |θxML

)P(xTti2 , . . . , xTti2+τi2 |θxML

)· · ·P(xTtik , . . . , xTtik+τik |θxML

).

(29)

The value of ωi is derived by comparing the corre-

sponding Pdur(xT , Si|θxML) and Pdur(yT , Si|θxML). If Pdur(xT ,

Si|θxML) Pdur(yT , Si|θxML), this indicates that the coeffi-cients of the ith state of the base model are good enoughfor discrimination, ωi should be set to a small value to pre-

serve the original ML configurations. If Pdur(xT , Si|θxML) <

Pdur(yT , Si|θxML) or Pdur(xT , Si|θxML) ≈ Pdur(yT , Si|θxML), thisindicates that state Si is not able to distinguish between xT

and yT , thus ωi must be set to a value large enough to ensure

Pdur(xT , Si|θ) > Pdur(yT , Si|θ), where θ is the two-channel


HMM. In practice, ωi can be manually selected according tothe conditions mentioned above (which is preferred), or theycan be computed using the following expression:

ωi = 11 + CvD

, (30)

where v = Pdur(xT , Si|θxML)/Pdur(yT , Si|θxML). C (C > 0) andD are constants that jointly control the smoothness of ωi withrespect to v. Since C > 0 and v > 0, ωi < 1, by using suitablevalues of C and D, a set of credibility factors ωi are computedfor the states of the target HMM. For example, if the range ofv is 10−3 ∼ 105, a typical setting is C = 1.0 and D = 0.1.

Once the values of ωi (i = 1, 2, . . . ,N) are determined,they will not be changed in the training process.

4.2. Partition of the observation symbol set

Let θ denote the HMM with the above initial configurations.The coefficients of the dynamic channel are adjusted accord-ing to the following procedures. First, E(Si,Oj|θ, xT) andE(Si,Oj|θ, yT) are computed through the counting process.Using the forward variables αxτ(i) = P(xT1 , . . . , xTτ , sTτ = Si|θ)and backward variables βxτ(i) = P(xTτ+1, . . . , xTT |sTτ = Si, θ)[46], the following two probabilities are computed:

ξxτ (i, j) = P(sTτ = Si, sTτ+1 = Sj|xT , θ

)= αxτ(i)ai jbj

(xTτ+1

)βxτ+1( j)∑N

i=1

∑Nj=1 αxτ(i)ai jbj

(xTτ+1

)βxτ+1( j)

,

γxτ (i) = P(sTτ = Si|xT , θ

) = N∑j=1

ξxτ (i, j);

(31)

ξyτ (i, j) and γ

yτ (i) are obtained in the same manner. By count-

ing the state, we have

E(Si,Oj|θ, xT

) = T∑τ=1

xTτ =Oj

γxτ (i),

E(Si,Oj|θ, yT

) = T∑τ=1

yTτ =Oj

γyτ (i).

(32)

It is shown in (22) that to maximize I(xT , yT , θ),bi j should be set proportional to Dij(xT , yT , θ). How-ever, for certain symbols, for example, Op, the expectationDip(xT , yT , θ) may be less than 0. Since the symbol emis-sion coefficients cannot take negative values, these symbolshave to be specially treated. For this reason, the symbolset OM = O1,O2, . . . ,OM is partitioned into the sub-set V = V1,V2, . . . ,VK and its complement set U =U1,U2, . . . ,UM−K (OM = U ∪ V) according to the follow-ing criterion:

V1,V2, . . . ,VK

= argOj

[E(Si,Oj|θ, xT

)E(Si,Oj|θ, yT

) > η]

(η ≥ 1),

(33)

Symbol label

20 40 60 80 100 120

E

05

10152025

xT

yT

(a)

Symbol label

20 40 60 80 100 120

E

05

10152025

xT

yT

(b)

Figure 3: (a) Distributions of E(Si,Oj|θ, xT) and E(Si,Oj|θ, yT) forvarious symbols. (b) Distribution of E(Si,Oj|θ, xT) for the symbolsin V .

where η is the threshold with a typical value of 1. η will beset to a larger value if it is required that the set V will con-tain fewer dominant symbols. With η ≥ 1, E(Si,Vj|θ, yT) −E(Si,Vj|θ, xT) > 0. As an illustration, the distributions ofthe values of E(Si,Oj|θ, xT) and E(Si,Oj|θ, yT) for differentsymbol labels are shown in Figure 3a. The filtered symbols inset V when η is set l are shown in Figure 3b.

4.3. Modification to the dynamic channel

For each state, the symbol set is partitioned according to theprocedures described in Section 4.2. As an example, considerthe ith state. For symbols in the set U , the symbol emissioncoefficient bi(Uj) (Uj ∈ U) should be set as small as possible.Let bdi (Uj) = 0, and so bi(Uj) = bsi(Uj). For symbols in thesetV , the corresponding dynamic-channel coefficient bdi (Vk)is computed according to (34), which is derived from (22):

bdi(Vk) = PD

(Si,Vk, xT , yT

)×(ωi +

K∑j=1

bsi(Vj))− bsi

(Vk), k = 1, 2, . . . ,K ,

(34)

where

PD(Si,Vk, xT , yT

)= E

(Si,Vk|θ, xT

)− E(Si,Vk|θ, yT

)∑Kj=1

[E(Si,Vj|θ, xT

)− E(Si,Vj|θ, yT

)] . (35)


bi2 = bsi2bi1

θt

θt+1

bdi1 + bdi2 = ωi

I′ = I(xT , yT , θt+1)

I = I(xT , yT , θt)

bil = bsi1bi2

(a)

bi3 = bsi3bi1

θtθt+1

d′

d

I′ = I(xT , yT , θt+1)

I = I(xT , yT , θt)bil = bsi1bi3

(b)

Figure 4: The surface of I and the direction of parameter adjustment.

However, some coefficients obtained may still be negative, forexample, bdi (Vl) < 0 because of large value of bsi(Vl). In whichcase, it indicates that bsi(Vl) alone is large enough for sepa-ration. To prevent negative values appearing in the dynamicchannel, the symbol Vl is transferred from V to U and bdi (Vl)is set to 0. The coefficients of the remaining symbols in V arereestimated using (34) until all bdi (Vk)’s are greater than 0.This situation (some bdi (Vl) < 0) usually happens at the firstfew epochs of training and it is not conducive to convergencebecause there is a steep jump in the surface of I(xT , yT , θ). Torelieve this problem, a larger value of η in (33) will be used.

4.4. Termination

Optimization is done through iteratively calling the trainingepoch described in Sections 4.2 and 4.3. After each epoch,

the separable-distance I(xT , yT , θ) of the HMM θ obtained,is calculated and compared with that obtained in the last

epoch. If I(xT , yT , θ) does not change more than a prede-fined value, training is terminated and the target two-channelHMM is established.

5. PROPERTIES OF THE TWO-CHANNELTRAINING STRATEGY

5.1. State alignmentOne of the requirements for the proposed training strategyis that the state durations of the training pair, say xT andyT , are comparable. This is a requirement for (22). If thestate durations, for example, E(Si|θ, xT) and E(Si|θ, yT), dif-fer too much, Dij(xT , yT , θ) will become meaningless. Forexample, if E(Si|θ, xT) E(Si|θ, yT), the symbol Oj takesmuch greater portion in E(Si|θ, xT) than in E(Si|θ, yT), thecomputed Dij(xT , yT , θ) may also be less than 0. The out-come is that bi j is always set to bsi j rather than adjusted toincrease I(xT , yT , θ). Fortunately, if the corresponding statedurations of the training pair are very different, the normalML HMMs are usually adequate to distinguish the states.

The following state-duration validation procedure isadded to make the training strategy complete. After eachtraining epoch, E(Si|θ, xT) and E(Si|θ, yT) are computed andcompared with each other. Using the forward variables andbackward variables, the state duration of xT is obtained asfollows:

E(Si|θ, xT

) = T∑τ=1

αxτ(i)βxτ(i)∑Ni=1 αxτ(i)βxτ(i)

, i = 1, 2, . . . ,N , (36)

and E(Si|θ, yT) is computed in the same way. If E(Si|θ, xT) ≈E(Si|θ, yT) (not necessary to be the same), for example,1.2E(Si|θ, yT) > E(Si|θ, xT) > 0.8E(Si|θ, yT), training con-tinues; otherwise, training stops even if I(xT , yT , θ) keeps onincreasing.

If the I(xT , yT , θ) of the final HMM θ does not meet cer-

tain discriminative requirement, for example, I(xT , yT , θ) isless than a desired value, a new base HMM or a smaller ωi

should be used instead.

5.2. Speed of convergence

As discussed in Section 3, the convergence of the parameter-estimation strategy proposed in (22) is guaranteed accord-ing to the EM optimization principles. In the implemen-tation of discriminative training, only some of the symbolemission coefficients in the dynamic channel are modifiedaccording to (22) while the others remain unchanged. How-ever, the convergence is still assured because firstly the surfaceof I(xT , yT , θ) with respect to bi j is continuous, and also ad-justing the dynamic-channel elements according to the two-channel training strategy leads to increased Eθ(I). A concep-tual illustration is given in Figure 4 on how bi j is modifiedwhen the symbol set is divided into subsets V and U . For easeof explanation, we assume that the symbol set contains onlythree symbols O1, O2, and O3 with O1,O2 ∈ V and O3 ∈ U


for state Si. Let θt denote the HMM trained at the tth roundand let θt+1 denote the HMM obtained at the t + 1th round.The surface of the separable distance (I surface) is denotedas I′ = I(xT , yT , θt+1) for θt+1 and I = I(xT , yT , θt) for θt.Clearly I′ > I . The I surface is mapped to the bi1-bi2 plane(Figure 4a) and the bi1-bi3 plane (Figure 4b). In the trainingphase, bi1 and bi2 are modified along the line bdi1 + bdi2 = ωi toreach a better estimation θt+1, which is shown in Figure 4a. Inthe bi1-bi3 plane, bi3 is set to the constant bsi3 while bi1 is mod-

ified along the line bi3 = bsi3 with the directiond as shown

in Figure 4b. The direction of parameter adjustment given

by (22) is denoted byd′. In the two-channel approach, since

only bi1 and bi2 are modified according to (22) while bi3 re-

mains unchanged,d may lead to lower speed of convergence

thand′ does.

5.3. Improvement to the discriminative power

The improvement to the discriminative power is estimated as

follows. Assume that θ is the two-channel HMM obtained.The lower bound of the probability P(yT |θ) is given by

P(yT |θ) ≥ (1− ωmax

)TP(yT |θxML

), (37)

where ωmax = max(ω1,ω2, . . . ,ωN ).Because the base HMM is the parameter-smoothed ML

HMM of xT , it is natural to assume that P(xT |θxML) ≥P(xT |θ). The upper bound of the separable distance is givenby the following expression:

I(xT , yT , θ

) ≤ logP(xT |θxML

)(1− ωmax

)TP(yT |θxML

)= −T log

(1− ωmax

)+ I(xT , yT , θxML

).

(38)

In practice, the gain of I(xT , yT , θ) is much smaller thanthe theoretical upper bound. It depends on the resemblancebetween xT and yT , and the setting of ωi.

6. EXTENSIONS OF THE TWO-CHANNELTRAINING ALGORITHM

6.1. Training samples with different lengths

Up to this point, the training sequences are assumed to beof equal length. This is necessary as we cannot properlycompare the probability scores of two sequences of differentlengths. To extend the training strategy to sequences of differ-ent lengths, linear adjustment is first carried out as follows.Given the training pair xTx of length Tx and yTy of length Ty ,

the objective function (10) is modified as follows:

λi =∑Tx

τ=1 P(sTxτ = Si, x

Txτ = Oj|θ, xTx

)bi j

−(Tx/Ty

)∑Ty

τ=1 P(sTyτ = Si, y

Tyτ = Oj|θ, yTy

)bi j

∀ j = 1, 2 . . . ,M.

(39)

Parameter estimation is then carried out as follows:

bi j =E(Si,Oj|θ, xTx

)− (Tx/Ty)E(Si,Oj|θ, yTy

)∑Mj=1

[E(Si,Oj|θ, xTx

)− (Tx/Ty)E(Si,Oj|θ, yTy

)] .(40)

The expectations of different states of yTy are normal-ized using the scale factor Tx/Ty . This approach is easy toimplement; however, it does not consider the nonlinear vari-ance of signal such as local stretch or squash. If the train-ing sequences demonstrate obvious nonlinear variance, somenonlinear processing such as sequence truncation or symbolprune may be carried out to adjust the training sequences tothe same length [50].

6.2. Multiple training samples

In order to obtain a reliable model, multiple observationsmust be used to train the HMM. The extension of the pro-posed method to include multiple training samples maybe carried out as follows. Consider two labeled sets X =x(1), x(2), . . . , x(k) : d1 and Y = y(1), y(2), . . . , y(l) : d2of samples, where X has k number of samples and Y hasl number of samples. The separable-distance function thattakes care of all these samples is given by

I(X ,Y , θ) = 1k

k∑m=1

logP(x(m)|θ)− 1

l

k∑n=1

logP(y(n)|θ).

(41)

For simplicity, if we assume that the observation se-quences in X and Y have the same length T , then (10) maybe rewritten as

(1/k)∑k

m=1 E(Si,Oj|θ, x(m)

)− (1/l)∑l

n=1 E(Si,Oj|θ, y(n)

)bi j

= λi, 1 ≤ j ≤M.(42)

The probability coefficients are then estimated using thefollowing:

bi j =(1/k)

∑km=1 E

(Si,Oj|θ, x(m)

)− (1/l)∑l


)∑Mj=1

[(1/k)

∑km=1 E

(Si,Oj|θ, x(m)

)− (1/l)∑l


)] . (43)


ML HMMclassifier R

...

ML HMMclassifier 1

Preliminaryrecognition

Two-channel HMMclassifier: L− 1 vs. L

...

Two-channel HMMclassifier: 1 vs. L

...

Two-channel HMMclassifier: 1 vs. 2

Finerecognition

Coarseidentity

Inputvector

sequence

Identityof theinput

Figure 5: Block diagram of the viseme recognition system.

Table 1: The 18 visemes selected for the experiments.

/a:/, /ai/, /æ/, /ei/, /i/, /j/, /ie/, /o/, /oi/, /th/, /sh/, /tZ/, /dZ/, /eu/, /au/, /p/, /m/, /b/

7. APPLICATION TO LIP READING

The proposed two-channel HMM method is applied tospeaker-dependent lip reading for modeling and recogniz-ing the basic visual speech elements of the English language.For the experiments reported in this paper, the visemes aretreated as having a one-to-one mapping with the phonemesin order to test the discriminative power of the proposedmethod. As there are 48 phonemes in the English language[47], 48 visemes are considered.

The block diagram of the viseme recognition system isgiven in Figure 5. The lip movement is captured with a videocamera and the sequence of images is processed to extractthe essential features relevant to the lip movement. For eachframe of image, a feature vector is extracted. The sequence offeature vectors thus represents the movement of lips duringviseme production. This vector sequence is then presentedas input to the proposed classifier. A hierarchical structure isadopted such that for a system with K visemes to be recog-nized, R (usually R < K) ML HMM classifiers are employedfor preliminary recognition. The output of the preliminaryrecognition is a coarse identity, which may include L (usu-ally 1 < L < K) viseme classes. Fine recognition is then per-formed using a bank of two-channel HMMs. The most prob-able viseme is then chosen as the identity of the input. Detailsof the various steps involved are given in the following sec-tions.

7.1. Data acquisition

For our experiments, a professional English speaker is en-gaged. The speaker is asked to articulate every phone meof the 18 phonemes in Table 1 one hundred times. The 18visemes are chosen as some of them bear close similarity toothers. The lip movements of the speakers are captured at 50frames per second. Each pronunciation starts from a closedmouth and ends with a closed mouth. This type of samples is

referred to as text-independent viseme samples, which is dif-ferent from the type of samples extracted from various con-texts, for example, from different words. The video clips thatindicate the productions of context-independent visemes arenormalized such that all the visemes have uniform durationof 0.5 second, or equivalently 25 frames.

7.2. Feature extraction

Each frame of the video clip reveals the lip area of thespeaker during articulation (Figure 6a). To eliminate the ef-fect caused by changes in the brightness, the RGB (red, green,blue) factors of the image are converted into HSV (hue, sat-uration, value) factors. The RGB to HSV conversion algo-rithm proposed in [51, 52] is adopted in our experiments. Asillustrated in the histograms of distribution of the hue com-ponent shown in Figure 7, the hue factors of the lip regionand the remaining lip-excluded image occupy different re-gions of the histogram. A threshold may be manually selectedto segment the lip region from the entire image as shown inFigure 6b. This threshold usually corresponds to a local min-imum point (valley) in the histogram as shown in Figure 7a.Note that for different speakers and lighting conditions, thethreshold may be different.

The boundaries of the lips are tracked using a geomet-ric template with dynamic contours to fit an elastic object[53, 54, 55]. As the contours of the lips are simple, the re-quirement on the selection of the dynamic contours thatbuild the template is thus not stringent. Results of lip track-ing experiments show that Bezier curves can well fit the shapeof the lip [34]. In our experiments, the parameterized tem-plate consists of ten Bezier curves with eight of them char-acterizing the lip contours and two of them describing thetongue when it is visible (Figure 6c). The template is con-trolled by points marked as small circles in Figure 6c. Liptracking is carried out by fitting the template to minimize acertain energy function. The energy function comprises the


50 100 150140120

100

80

60

40

20

(a)

50 100 150140

120

100

80

60

40

20

(b)

R1

R2

R3

C1

C2

C3

(c)

1

2

3

4

5

6

7

8

9

10 11

(d)

Figure 6: (a) Original image. (b) Segmented lip area. (c) Parameterized lip template. (d) Geometric measures extracted from the lip template.(1) Thickness of the upper bow. (2) Thickness of the lower bow. (3) Thickness of the lip corner. (4) Position of the lip corner. (5) Positionof the upper lip. (6) Position of the lower bow. (7) Curvature of the upper-exterior boundary. (8) Curvature of the lower-exterior boundary.(9) Curvature of the upper-interior boundary. (10) Curvature of the lower-interior boundary. (11) Width of the tongue (when it is visible).

following four terms:

Elip = − 1R1

∫R1

H(x)dx,

Eedge = − 1C1 + C2

∫C1+C2

∣∣H+(x)−H(x)∣∣

+∣∣H−(x)−H(x)

∣∣dx,

Ehole = − 1R2 − R3

∫R2−R3

H(x)dx,

Einertia =∥∥Γt+1 − Γt

∥∥2,

(44)

where R1, R2, R3, C1, and C2 are areas and contours as illus-trated in Figure 6c. H(x) is a function of the hue of a givenpixel; H+(x) is the hue function of the closest right-hand sidepixel and H−(x) is that of the closest left-hand side pixel.Γt+1 and Γt are the matched templates at time t + 1 and t.‖Γt+1 − Γt‖ indicates the Euclidean distance between the twotemplates (further details may be found in [55]). The over-all energy of the template E is the linear combination of thecomponents defined as

E = c1Elip + c2Eedge + c3Ehole + c4Einertia. (45)

Similarly, the energy terms for the tongue template in-clude

Etongue-area = − 1R3

∫R3

H(x)dx if R3 > 0,

Etongue-edge = − 1C3

∫C3

∣∣H+(x)−H(x)∣∣

+∣∣H−(x)−H(x)

∣∣dx if C3 > 0,

Etongue-inertia =∥∥Ttongue,t+1 − Ttongue,t

∥∥2,

(46)

and the overall energy is

Etongue = c5Etongue-area + c6Etongue-edge + c7Etongue-inertia. (47)

Initially, the dynamic contours are configured to providea crude match to the lips. This can be done via comparingthe enclosed region of the template and the segmented lip re-gion as depicted in Figure 6b. Following that, the template ismatched to the image sequence by adopting different valuesof the parameters ci (i = 1, 2, . . . , 7) in a number of search-ing epochs (a detailed discussion is given in [53, 54, 55]). Thematched template is pictured in Figure 6d. It can be seen thatthe matched template is symmetric and smooth, and is there-fore easy to process.

Eleven geometric parameters as shown in Figure 6d areextracted to form a feature vector from the matched tem-plate. These features indicate the thickness of various parts ofthe lips, the positions of some key points, and the curvaturesof the bows. They are chosen as they uniquely determine theshape of the lips and they best characterize the movement ofthe lips.

Principal components analysis (PCA) is carried out toreduce the dimension of the feature vectors from elevento seven. The resulting feature vectors are clustered intogroups using K-means algorithm. In the experiments con-ducted, 128 clusters are created for the vector database. Themeans of the 128 clusters form the symbol set O128 =(O1,O2, . . . ,O128) of the HMM. They are used to encode thevector sequences presented to the system.

7.3. Configuration of the viseme model

Investigation on the lip dynamics reveals that the movementof the lips can be partitioned into three phases during theproduction of a text-independent viseme. The initial phasebegins with a closed mouth and ends with the start of soundproduction. The intermediate phase is the articulation phase,which is the period when sound is produced. The third phaseis the end phase when the mouth restores to the relaxedstate. Figure 8 illustrates the change of the lips in the threephases and the corresponding acoustic waveform when thephoneme /u/ is uttered.

To associate the HMM with the physical process ofviseme production, three-state left-right HMM structure asshown in Figure 9 is adopted.


0 0.2 0.4 0.6 0.8 10

50

100

150

200

250

300

350

400

450

Threshold

(a)

0 0.2 0.4 0.6 0.8 10

50

100

150

200

250

300

350

400

(b)

0 0.2 0.4 0.6 0.8 10

50

100

150

200

250

300

350

400

450

(c)

Figure 7: Isolation of the lip region from the entire image using hue distribution. (a) Histogram of the hue component for the entire image.(b) Histogram of the hue component for the actual lip region. (c) Histogram of the hue component for the actual lip-excluded image.

Using this structure, the state-transition matrix A has theform

A =

a1,1 a1,2 0 00 a2,2 a2,3 00 0 a3,3 a3,4

0 0 0 1

, (48)

where the 4th state is a null state that indicates the end ofviseme production. The initial values of the coefficients inmatrices A and B are set according to the statistics of thethree phases. Given a viseme sample, the approximate ini-tial phase, articulation phase, and end phase are segmentedfrom the image sequence and the acoustic signal (an illus-tration is given in Figure 8), and the duration of each phase

is counted. The coefficients ai,i and ai,i+1 are initialized withthese durations. For example, if the duration of state Si is Ti,the initial value of ai,i is set to be Ti/(Ti + 1) and the initialvalue of ai,i+1 is set to be 1/(Ti + 1) as they maximize aTi

i,i ai,i+1.Matrix B is initialized in a similar manner. If symbol Oj ap-pears T(Oj) times in state Si, the initial value of bi j is set tobe T(Oj)/Ti. For such arrangement, the states of the HMMare aligned with the three phases of viseme production andhence are referred to as the initial state, articulation state, andend state.

For each of the 18 visemes in Table 1, an HMM with theabove the configuration is trained using the Baum-Welchestimation. After implementing parameter smoothing, theparameter-smoothed ML HMM is ready for the subsequenttwo-channel discriminative training.


0 500 1000 1500 2000 2500 3000 3500 4000−2−1

012

(a) (b) (c)

1 2 3

· · · · · ·

Figure 8: The three phases of viseme production. (a) Initial phase. (b) Articulation phase. (c) End phase.

Initial Articulation End

Figure 9: Three-state left-right viseme model.

7.4. Viseme classifier

The block diagram of the proposed hierarchical viseme clas-sifier is given in Figure 10. For visemes that are too similarto be separated by the normal ML HMMs, they are clusteredinto one macro class. In the figure, θMac1, θMac2, . . . , θMacR arethe R number of ML HMMs for the R macro classes. Thesimilarity between the visemes is measured as follows.

Assume that Xi = xi1, xi2, . . . , xili : di is the training sam-ples of viseme di (i = 1, 2, . . . , 18, as 18 visemes are involved),where xij is the jth training sample and li is the number of thesamples. An ML HMM is trained for each of the 18 visemesusing the Baum-Welch estimation. Let θ1, θ2, . . . , θ18 denotethe 18 ML HMMs. For xi1, xi2, . . . , xili : di, the joint proba-bility scored by θj is computed as follows:

P(Xi|θj

) = li∏n=1

P(xin|θj

). (49)

A viseme model θi is able to separate visemes di and dj ifthe following condition applies:

logP(Xi|θi

)− logP(Xi|θj

) ≥ Kli ∀ j = 1, 2, . . . , 18, j = i,(50)

where K is a positive constant that is set according to thelength of the training samples. For long training samples, alarge value of K is desired. For the 25-length samples adoptedin our experiments, K is set to be equal to 2. If the conditionstated in (50) is not met, visemes di and dj are categorizedinto the same macro class. The training samples of di and dj

are jointly used to train the ML HMM of the macro class.θMac1, θMac2, . . . , θMacR are obtained in this way.

For an input viseme zT to be identified, the probabilitiesP(zT |θMac1),P(zT |θMac2), . . . ,P(zT |θMacR) are computed andcompared with one another. The macro identity of zT is de-termined by the HMM that gives the largest probability.

A macro class may consist of several similar visemes. Finerecognition within a macro class is carried out at the sec-ond layer. Assume that Macro Class i comprises L visemes:V1,V2, . . . ,VL. A number of two-channel HMMs are trainedwith the proposed discriminative training strategy. For V1,L − 1HMMs, θ1∧2, θ1∧3, . . . , θ1∧L, are trained to separate thesamples of V1 from those of V2,V3, . . . ,VL, respectively. Takeθ1∧2 as an example, the parameter-smoothed ML HMM of

V1, θ1ML, is adopted as the base HMM. The samples of V1 are

used as the correct samples (xT in (3)) and the samples ofV2 are used as the incorrect samples (yT in (3)) while train-ing θ1∧2. There is a total of L(L − 1) two-channel HMMs inMacro Class i.

For an input viseme zT to be identified, the following hy-pothesis is made:

Hi∧ j =i if logP

(zT |θi∧ j

)− logP(zT |θj∧i

)> K ,

0 otherwise,(51)

where K is the positive constant as defined in (47). For the25-frame sequence input to the system, K is chosen to beequal to 2. Hi∧ j = i indicates a vote for Vi. The decisionabout the identity of zT is made by a majority vote of all thetwo-channel HMMs. The viseme class that has the maximumnumber of votes is chosen as the identity of zT , denoted byID(zT). Mathematically,

ID(zT) = max

i

[Number of Hi∧ j = i

]∀i, j = 1, 2, . . . ,L, i = j.

(52)

If two viseme classes, say Vi and Vj , receive the samenumber of votes, the decision about the identity of zT is made


Layer 1

θMac 1 · · · θMaci · · · θMacR

Inputviseme

Macro Class 1 · · · Macro Class i · · · Macro Class R

Layer 2

θ1∧2 θ1∧3 · · · θ1∧L · · · θ2∧1 · · · θ2∧L · · · θL∧1 θL∧2 · · · θL∧L−1

H1∧2 H1∧3 · · · H1∧L · · · H2∧1 · · · H2∧L · · · HL∧1 HL∧2 · · · HL∧L−1

Majority vote

Identity ofthe input

Figure 10: Flow chart of the hierarchical viseme classifier.

by comparing P(zT |θi∧ j) and P(zT |θj∧i). Mathematically,

ID(zT) =

i if logP(zT |θi∧ j

)> logP

(zT |θj∧i

),

j otherwise.(53)

The decision is based on pairwise comparisons of the hy-potheses. The proposed hierarchical structure greatly reducesthe computational load and increases the accuracy of recog-nition because pairwise comparisons are carried out withineach macro class, which comprises much fewer candidateclasses than the entire set. If coarse identification is not per-formed, the number of classes increases and the number ofpairwise comparisons goes up rapidly.

The two-channel HMMs act as the boundary func-tions for the viseme they represent. Each of them servesto separate the correct samples from the samples of an-other viseme. A conceptual illustration is given in Figure 11where the macro class comprises five visemes V1,V2, . . . ,V5.θ1∧2, θ1∧3, . . . , θ1∧5 build the decision boundaries for V1 todelimit it from the similar visemes.

The proposed two-channel HMM model is specially tai-lored for the target viseme and its “surroundings”. As a result,it is more accurate than the traditional modeling method thatuses single ML HMM.

Macro class

Viseme 1 (V1)

Viseme 2 (V2)

Viseme 3 (V3)

Viseme 4 (V4)

Viseme 5 (V5)

Two-channelHMMs of V1

V1

V2

V3

V4

V5

θ1∧2

θ1∧3

θ1∧4

θ1∧5

Figure 11: Viseme boundaries formed by the two-channel HMMs.

7.5. Performance of the system

Experiments are carried out to assess the performance of theproposed system. For the experiments conducted in this pa-per, 100 samples are drawn for each viseme with 50 for train-ing and the remaining 50 for testing. By computing and com-paring the probabilities scored by different viseme modelsusing (49) and (50), the 18 visemes are clustered into 6 macroclasses as illustrated in Table 2.

The results of fine recognition of some confusablevisemes are listed in Table 3. Each row in Table 3 shows thetwo similar visemes that belong to the same macro class. Thefirst viseme label (in boldface) is the target viseme and is de-noted by x. The second viseme is the incorrect viseme and

is denoted by y. θML denotes the parameter-smoothed ML


Table 2: The macro classes for coarse identification.

Macro classes Visemes Macro classes Visemes

1 /a:/, /ai/, /æ/ 4 /o/, /oi/

2 /ei/, /i/, /j/, /ie/ 5 /th/, /sh/, /tZ/, /dZ/

3 /eu/, /au/ 6 /p/, /m/, /b/

Table 3: The average values of probability and separable-distancefunction of the ML HMMs and two-channel HMMs.

Viseme pair θML θ∗1 θ∗∗2

x y P I P I P I ω1 ω2 ω3

/a:/ /ai/ −14.1 1.196 −17.1 5.571 −18.3 6.589 0.5 0.5 0.5

/ei/ /i/ −14.7 2.162 −19.3 5.977 −20.9 7.008 0.6 0.8 0.6

/au/ /eu/ −15.6 2.990 −18.1 5.872 −18.5 6.555 0.6 0.5 0.6

/o/ /oi/ −13.9 0.830 −17.5 2.508 −18.7 3.296 0.5 0.5 0.5

/th/ /sh/ −15.7 0.602 −19.0 2.809 −18.5 2.732 0.4 0.4 0.4

/p/ /m/ −16.3 1.144 −19.0 3.102 −17.1 2.233 0.4 0.5 0.4

Configuration of the two-channel HMMs:∗For θ1, ω1, ω2, and ω3 are set according to (30), with C = 1.0 and D = 0.1.∗∗For θ2, ω1, ω2, and ω3 are manually selected.

HMMs that are trained with the samples of x. With θML be-ing the base HMM, two two-channel HMMs, θ1 and θ2, aretrained with the samples of x being the target training sam-ples and the samples of y being the incorrect training sam-ples. Different sets of the credibility factors (ω1, ω2, and ω3

for the three states) are used for θ1 and θ2. P is the average logprobability scored for the testing samples and is computed asP = (1/l)

∑li=1 logP(xi|θ), where xi is the ith testing sam-

ple of viseme x and l is the number of the testing samples.I = (1/l2)

∑li=1

∑lj=1 I(xi, yj , θ) is the average separable dis-

tance. The value of I gives an indication of the discriminativepower, the larger the value of I , the higher the discriminativepower.

For all settings of (ω1,ω2,ω3), the two-channel HMMsgive a much larger separable-distance than the ML HMMs.It shows that better discrimination capabilities are attainedusing the two-channel viseme classifiers than using the MLHMM classifiers. In addition, different levels of capabilitiescan be attained by adjusting the credibility factors. However,the two-channel HMM gives smaller average probability forthe target samples than the normal ML HMM. It indicatesthat the two-channel HMMs perform well at discriminatingconfusable visemes but are not good at modeling the visemes.

The change of I(x, y, θ) with respect to the trainingepochs in the two-channel training is depicted in Figure 12.For the three-state left-right HMMs and 25-length trainingsamples adopted in the experiment, the separable-distancebecomes stable after ten to twenty epochs. Such speed of con-vergence shows that the two-channel training is not compu-tationally intensive for viseme recognition. It is also observed

that I(x, y, θ) may drop at the first few training epochs. Thisphenomenon can be attributed to the fact that some symbolsin subset V are transferred to U while training the dynamic-channel coefficients as explained in Section 4.3. Figure 12d il-lustrates the situation of early termination. The training pro-cess stops even though I(x, y, θ) still shows the tendency ofincreasing. As explained in Section 5.1, if the state durationsof the target training samples and incorrect training sam-ples differ greatly, that is, the state alignment condition isviolated, the two-channel training should terminate imme-diately.

The performance of the proposed hierarchical systemis compared with that of the traditional recognition sys-tem where ML HMMs (parameter-smoothed) are used asthe viseme classifiers. The ML HMMs and the two-channelHMMs involved are trained with the same set of trainingsamples. The credibility factors of the two-channel HMMsare set according to (30), with C = 0.1 and D = 0.1. The de-cision about the identity of an input testing sample is madeaccording to (47), (49), (50), and (51), where K = 2. Thefalse rejection error rates (FRRs) or Type-II error of the twotypes of viseme classifiers are computed for the 50 testingsamples of each of the 18 visemes. Note that as some of the 18visemes can be accurately identified by the ML HMMs withFRRs less than 10% [34], the improvement resulting from thetwo-channel training approach is not prominent for thesevisemes. In Table 4, only the FRRs of 12 confusable visemesare listed.

Compared with the conventional ML HMM classifier, theclassification error of the proposed hierarchical viseme clas-sifier is reduced by about 20%. Thus the two-channel train-ing algorithm is able to increase the discriminative ability ofHMM significantly for identifying visemes.

8. CONCLUSION

In this paper, a novel two-channel training strategy for hid-den Markov model is proposed. A separable-distance func-tion, which measures the difference between a pair of train-ing samples, is applied as the objective function. To maxi-mize the separable distance and maintain the validity of theprobabilistic framework of HMM at the same time, a two-channel HMM structure is used. Parameters in one chan-nel, named the dynamic channel, are optimized in a series ofexpectation-maximization (EM) estimations if feasible whileparameters in the other channel, the static channel, are keptfixed. The HMM trained in this way amplifies the differ-ence between the training samples. This strategy is especiallysuited to increase the discriminative ability of HMM overconfusable observations.

The proposed training strategy is applied to visemerecognition. A hierarchical system is developed with normalML HMM classifier implementing coarse recognition andtwo-channel HMM carrying out fine recognition. To extendthe classification from binary-class to multiple-class, a deci-sion rule based on majority vote is adopted. Experimental re-sults show that the classification error of the proposed viseme


I(x,y,θ

)

0

1

2

3

4

5

6

7

Training epochs

2 4 6 8 10 12 14 16

(a)

I(x,y,θ

)

0

1

2

3

4

5

6

7

Training epochs

2 4 6 8 10 12 14 16

(b)

I(x,y,θ

)

0

1

2

3

4

5

6

7

Training epochs

2 4 6 8 10 12 14 16

(c)

I(x,y,θ

)

0

1

2

3

4

5

6

7

Training epochs

2 4 6 8 10 12 14 16

Early termination

(d)

Figure 12: Change of I(x, y, θ) during the training process.

Table 4: Classification error ε1 of the conventional classifier andclassification error ε2 of the two-channel classifier.

Viseme ε1 ε2 Viseme ε1 ε2

/a:/ 64% 12% /o/ 46% 28%

/ai/ 60% 40% /oi/ 36% 8%

/ei/ 46% 22% /th/ 18% 16%

/i/ 52% 32% /sh/ 20% 12%

/au/ 30% 18% /p/ 36% 12%

/eu/ 26% 16% /m/ 32% 32%

classifier is on the average 20% less than that of the popularML HMM classifier while only 10 ∼ 20 training epochs arerequired in the training process.

The two-channel training strategy thus provides signifi-cant improvement over the traditional Baum-Welch estima-tion in fine recognition. However, the proposed method re-quires state alignment among the training samples; in otherwords, the samples should be of sufficient similarity suchthat the durations of the corresponding states are compara-ble.

Although the two-channel HMM is illustrated for visemeclassification in this paper, the method is applicable to any se-quence classification problem where the sequences to be rec-ognized are of comparable length. Such applications includespeech recognition, speaker identification, and handwritingrecognition.

REFERENCES

[1] W. H. Sumby and I. Pollack, “Visual contributions to speechintelligibility in noise,” Journal of the Acoustical Society ofAmerica, vol. 26, pp. 212–215, 1954.

[2] K. K. Neely, “Effect of visual factors on the intelligibility ofspeech,” Journal of the Acoustical Society of America, vol. 28,no. 6, pp. 1275–1277, 1956.

[3] C. A. Binnie, A. A. Montgomery, and P. L. Jackson, “Auditoryand visual contributions to the perception of consonants,”Journal of Speech and Hearing Research, vol. 17, pp. 619–630,1974.

[4] D. Reisberg, J. McLean, and A. Goldfield, “Easy to hear buthard to understand: A lipreading advantage with intact audi-tory stimuli,” in Hearing by Eye: The Psychology of Lipreading,B. Dodd and R. Campbell, Eds., pp. 97–113, Lawrence Erl-baum Associates, Hillsdale, NJ, USA, 1987.

[5] H. McGurk and J. MacDonald, “Hearing lips and seeingvoices,” Nature, vol. 264, pp. 746–748, 1976.


[6] D. W. Massaro, Speech Perception by Ear and Eye: A Paradigmfor Psychological Inquiry, Lawrence Erlbaum Associates, Hills-dale, NJ, USA, 1987.

[7] R. Campbell and B. Dodd, “Hearing by eye,” Quarterly Journalof Experimental Psychology, vol. 32, pp. 85–99, 1980.

[8] E. D. Petajan, Automatic lipreading to enhance speech recog-nition, Ph.D. dissertation, University of Illinois at Urbana-Champaign, Urbana, Ill, USA, 1984.

[9] A. J. Goldschen, Continuous automatic speech recognition bylipreading, Ph.D. dissertation, George Washington University,Washington, DC, USA, 1993.

[10] B. P. Yuhas, M. H. Goldstein, and T. J. Sejnowski, “Integrationof acoustic and visual speech signals using neural networks,”IEEE Commun. Mag., vol. 27, no. 11, pp. 65–71, 1989.

[11] D. G. Stork, G. Wolff, and E. Levine, “Neural network lipread-ing system for improved speech recognition,” in Proc. IEEE In-ternational Joint Conference on Neural Networks (IJCNN ’92),pp. 285–295, Baltimore, Md, USA, June 1992.

[12] C. Bregler and S. Omohundro, “Nonlinear manifold learn-ing for visual speech recognition,” in Proc. IEEE 5th Interna-tional Conference on Computer Vision (ICCV ’95), pp. 494–499, Cambridge, Mass, USA, June 1995.

[13] P. Silsbee and A. Bovik, “Computer lipreading for improvedaccuracy in automatic speech recognition,” IEEE Trans. SpeechAudio Processing, vol. 4, no. 5, pp. 337–351, 1996.

[14] D. G. Stork and H. L. Lu, “Speechreading by Boltzmannzippers,” in Machines that Learn Workshop, Snowbird, Utah,USA, April 1996.

[15] T. Chen, “Audiovisual speech processing,” IEEE Signal Process-ing Mag., vol. 18, no. 1, pp. 9–21, 2001.

[16] D. G. Stork and M. E. Hennecke, “Speechreading: Anoverview of image processing, feature extraction, sensoryintegration and pattern recognition techniques,” in Proc.2nd International Conference on Automatic Face and Ges-ture Recognition, pp. 16–26, Killington, Vt, USA, October1996.

[17] S. W. Foo, Y. Lian, and L. Dong, “Recognition of visualspeech elements using adaptively boosted hidden Markovmodels,” IEEE Trans. Circuits Syst. Video Technol., vol. 14,no. 5, pp. 693–705, 2004.

[18] A. J. Goldschen, O. N. Garcia, and E. D. Petajan, “Continuousoptical automatic speech recognition by lipreading,” in Proc.28th Asilomar Conference on Signals, Systems, and Comput-ers, vol. 1, pp. 572–577, Pacific Grove, Calif, USA, October–November 1994.

[19] W. J. Welsh, A. D. Simon, R. A. Hutchinson, and S. Searby,“A speech-driven ‘talking-head’ in real time,” in Proc. PictureCoding Symposium, pp. 7.6-1–7.6-2, Cambridge, Mass, USA,March 1990.

[20] P. L. Silsbee and A. C. Bovic, “Visual lipreading by computer toimprove automatic speech recognition accuracy,” Tec. Rep. TR-93-02-90, University of Texas Computer and Vision ResearchCenter, Austin, Tex, USA, 1993.

[21] P. L. Silsbee and A. C. Bovic, “Medium vocabulary audiovisualspeech recognition,” in NATO ASI New Advances and Trends inSpeech Recognition and Coding, pp. 13–16, Bubion, Granada,Spain, June–July 1993.

[22] M. J. Tomlinson, M. J. Russell, and N. M. Brooke, “Integratingaudio and visual information to provide highly robust speechrecognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, andSignal Processing (ICASSP ’96), vol. 2, pp. 821–824, Atlanta,Ga, USA, May 1996.

[23] J. Luettin, N. A. Thacker, and S. W. Beet, “Speechreading us-ing shape and intensity information,” in Proc. 4th Interna-tional Conference on Spoken Language Processing (ICSLP ’96),pp. 58–61, Philadelphia, Pa, USA, October 1996.

[24] X. Z. Zhang, R. M. Mersereau, and M. A. Clements, “Audio-visual speech recognition by speechreading,” in Proc. 14th In-ternational Conference on Digital Signal Processing (DSP ’02),vol. 2, pp. 1069–1072, Santorini, Greece, July 2002.

[25] G. Gravier, G. Potamianos, and C. Neti, “Asynchrony mod-eling for audio-visual speech recognition,” in Proc. Interna-tional Conference on Human Language Technology (HLT ’02),San Diego, Calif, USA, March 2002, available on proceedingCD.

[26] A. V. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, “DynamicBayesian networks for audio-visual speech recognition,”EURASIP Journal on Applied Signal Processing, vol. 2002,no. 11, pp. 1274–1288, 2002.

[27] S. Dupont and J. Luettin, “Audio-visual speech modelingfor continuous speech recognition,” IEEE Trans. Multimedia,vol. 2, no. 3, pp. 141–151, 2000.

[28] S. W. Foo and L. Dong, “A boosted multi-HMM classifier forrecognition of visual speech elements,” in Proc. IEEE Int. Conf.Acoustics, Speech, and Signal Processing (ICASSP ’03), vol. 2,pp. 285–288, Hong Kong, China, April 2003.

[29] J. J. Williams and A. K. Katsaggelos, “An HMM-based speech-to-video synthesizer,” IEEE Trans. Neural Networks, vol. 13,no. 4, pp. 900–915, 2002.

[30] J. Luettin, N. A. Thacker, and S. W. Beet, “Speechreading usingshape and intensity information,” in Proc. 4th InternationalConference on Spoken Language Processing (ICSLP ’96), vol. 1,pp. 58–61, Philadelphia, Pa, USA, October 1996.

[31] I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Har-vey, “Extraction of visual features for lipreading,” IEEE Trans.Pattern Anal. Machine Intell., vol. 24, no. 2, pp. 198–213, 2002.

[32] A. Adjoudani and C. Benoıt, “On the integration of audi-tory and visual parameters in an HMM-based ASR,” in SpeechReading by Humans and Machines, D. G. Stork and M. E. Hen-necke, Eds., NATO ASI Series, pp. 461–471, Springer Verlag,Berlin, Germany, 1996.

[33] E. Owens and B. Blazek, “Visemes observed by hearing im-paired and normal hearing adult viewers,” Journal of SpeechHearing and Research, vol. 28, pp. 381–393, 1985.

[34] S. W. Foo and L. Dong, “Recognition of visual speech ele-ments using hidden Markov models,” in Proc. 3rd IEEE Pa-cific Rim Conference on Multimedia (PCM ’02), pp. 607–614,December 2002.

[35] C. Binnie, A. Montgomery, and P. Jackson, “Auditory and vi-sual contributions to the perception of consonants,” Journalof Speech Hearing and Research, vol. 17, pp. 619–630, 1974.

[36] A. M. Tekalp and J. Ostermann, “Face and 2-D mesh anima-tion in MPEG-4,” Signal Processing: Image Communication,vol. 15, no. 4-5, pp. 387–421, 2000, special issue on MPEG-4.

[37] S. Morishima, S. Ogata, K. Murai, and S. Nakamura, “Audio-visual speech translation with automatic lip synchronizationand face tracking based on 3-D head model,” in Proc. IEEE Int.Conf. Acoustics, Speech, and Signal Processing (ICASSP ’02),vol. 2, pp. 2117–2120, Orlando, Fla, USA, May 2002.

[38] S. W. Foo, Y. Lian, and L. Dong, “A two-channel training al-gorithm for hidden Markov model to identify visual speechelements,” in Proc. IEEE Int. Symp. Circuits and Systems (IS-CAS ’03), vol. 2, pp. 572–575, Bangkok, Thailand, May 2003.

[39] L. E. Baum and T. Petrie, “Statistical inference for probabilis-tic functions of finite state Markov chains,” Annals of Mathe-matical Statistics, vol. 37, pp. 1554–1563, 1966.

[40] L. E. Baum and G. R. Sell, “Growth functions for transforma-tions on manifolds,” Pacific Journal of Mathematics, vol. 27,no. 2, pp. 211–227, 1968.

[41] T. Petrie, “Probabilistic functions of finite state Markovchains,” Annals of Mathematical Statistics, vol. 40, no. 1,pp. 97–115, 1969.


[42] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximiza-tion technique occurring in the statistical analysis of proba-bilistic functions of Markov chains,” Annals of MathematicalStatistics, vol. 41, pp. 164–171, 1970.

[43] L. E. Baum, “An inequality and associated maximization tech-nique in statistical estimation for probabilistic functions ofMarkov processes,” in Inequalities, vol. 3, pp. 1–8, AcademicPress, New York, NY, USA, 1972.

[44] J. K. Baker, “The DRAGON system—An overview,” IEEETrans. Acoust., Speech, Signal Processing, vol. 23, no. 1, pp. 24–29, 1975.

[45] F. Jelinek, “Continuous speech recognition by statistical meth-ods,” Proc. IEEE, vol. 64, no. 4, pp. 532–556, 1976.

[46] L. R. Rabiner, “A tutorial on hidden Markov models and se-lected applications in speech recognition,” Proc. IEEE, vol. 77,no. 2, pp. 257–286, 1989.

[47] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recog-nition, Prentice Hall, Englewood Cliffs, NJ, USA, 1993.

[48] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mer-cer, “Maximum mutual information estimation of hiddenMarkov model parameters for speech recognition,” in Proc.IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP’86), pp. 49–52, Tokyo, Japan, April 1986.

[49] G. J. McLanchlan and T. Krishnan, The EM Algorithm and Ex-tensions, Wiley Series in Probability and Statistics, John Wiley& Sons, New York, NY, USA, 1997.

[50] R. J. Schalkoff, Pattern Recognition: Statistical, Structural andNeural Approaches, John Wiley & Sons, New York, NY, USA,1992.

[51] X. Z. Zhang, C. Broun, R. M. Mersereau, and M. A.Clements, “Automatic speechreading with applications tohuman-computer interfaces,” EURASIP Journal on AppliedSignal Processing, vol. 2002, no. 11, pp. 1228–1247, 2002, spe-cial issue on Joint Audio-Visual Speech Processing.

[52] T. W. Lewis and D. M. W. Powers, “Lip feature extraction us-ing red exclusion,” Selected Papers from the Pan-Sydney Work-shop on Visualization, vol. 2, pp. 61–67, Sydney, Australia,2000.

[53] A. Yuille and P. Hallinan, “Deformable templates,” in ActiveVision, A. Blake and A. Yuille, Eds., pp. 21–38, MIT Press,Cambridge, Mass, USA, 1992.

[54] T. Coianiz, L. Torresani, and B. Caprile, “2D deformable mod-els for visual speech analysis,” in Speech Reading by Humansand Machines, D. G. Stork and M. E. Hennecke, Eds., NATOASI Series, pp. 391–398, Springer Verlag, New York, NY, USA,1996.

[55] M. E. Hennecke, K. V. Prasad, and D. G. Stork, “Using de-formable templates to infer visual speech dynamics,” in Proc.28th Asilomar Conference on Signals, Systems and Comput-ers, vol. 1, pp. 578–582, Pacific Grove, Calif, USA, October–November 1994.

Liang Dong received the B. Eng. degree inelectronic engineering from Beijing Univer-sity of Aeronautics and Astronautics, China,in 1997, and the M. Eng. degree in electri-cal engineering from the Second Academyof China Aerospace in 2000. Currently, he isa Ph.D. candidate in the National Universityof Singapore and working in the Institutefor Infocomm Research, Singapore. His re-search interests include speech processing,image processing, and video processing.

Say Wei Foo received the B. Eng. degreein electrical engineering from the Univer-sity of Newcastle, Australia, in 1972, theM.S. degree in industrial and systems engi-neering from the University of Singapore in1979, and the Ph.D. degree in electrical en-gineering from Imperial College, Universityof London, in 1983. From 1972 to 1973, hewas with the Electrical Branch, Lands andEstates Department, Ministry of Defense,Singapore. From 1973 to 1992, he worked in the Electronics Di-vision of the Defense Science Organization, Singapore, where heconducted research and carried out development work on securityequipment. From 1992 to 2001, he was the Associate Professor withthe Department of Electrical and Computer Engineering, NationalUniversity of Singapore. In 2002, he joined the School of Electri-cal and Electronic Engineering, Nanyang Technological University.He has authored and coauthored over one hundred published arti-cles. His research interests include speech signal processing, speakerrecognition, and musical note recognition.

Yong Lian received the B.S. degree fromthe School of Management, Shanghai JiaoTong University, China, in 1984, and thePh.D. degree from the Department of Elec-trical Engineering, National University ofSingapore, Singapore, in 1994. He was withthe Institute of Microcomputer Research,Shanghai Jiao Tong University, Brighten In-formation Technology Ltd., SyQuest Tech-nology International, and Xyplex Inc. from1984 to 1996. He joined the National University of Singapore in1996 where he is currently an Associate Professor in the Depart-ment of Electrical and Computer Engineering. His research in-terests include digital filter design, VLSI implementation of high-speed digital systems, biomedical instrumentation, and RF IC de-sign. Dr. Lian received the 1996 IEEE Circuits and Systems Soci-ety’s Guillemin-Cauer Award for the best paper published in IEEETransactions on Circuits and Systems Part II. He currently servesas an Associate Editor for the IEEE Transactions on Circuits andSystems Part II and has been an Associate Editor for Circuits, Sys-tems and Signal Processing since 2000. Dr. Lian serves as the Secre-tary and Member of IEEE Circuits and Systems Society’s Biomed-ical Circuits and Systems Technical Committee and Digital SignalProcessing Technical Committee, respectively.


Disordered Speech Assessment Using AutomaticMethods Based on Quantitative Measures

Lingyun GuComputational NeuroEngineering Laboratory, Department of Electrical & Computer Engineering,University of Florida, Gainesville, FL 32611-6200, USAEmail: [email protected]

John G. HarrisComputational NeuroEngineering Laboratory, Department of Electrical & Computer Engineering,University of Florida, Gainesville, FL 32611-6200, USAEmail: [email protected]

Rahul ShrivastavDepartment of Communication Sciences & Disorders, University of Florida, Gainesville, FL 32611, USAEmail: [email protected]

Christine SapienzaDepartment of Communication Sciences & Disorders, University of Florida, Gainesville, FL 32611, USAEmail: [email protected]


Speech quality assessment methods are necessary for evaluating and documenting treatment outcomes of patients suffering fromdegraded speech due to Parkinson’s disease, stroke, or other disease processes. Subjective methods of speech quality assessment aremore accurate and more robust than objective methods but are time-consuming and costly. We propose a novel objective measureof speech quality assessment that builds on traditional speech processing techniques such as dynamic time warping (DTW) andthe Itakura-Saito (IS) distortion measure. Initial results show that our objective measure correlates well with the more expensivesubjective methods.

Keywords and phrases: objective speech quality measures, subjective speech quality measures, pathology, anthropomorphic.

1. INTRODUCTION

The accurate assessment of speech quality is a major researchproblem that has attracted attention in the field of speechcommunications for many years. The two major classes ofmethods employed in the assessment of speech quality aresubjective and objective speech quality measures. Subjectivequality measures are more accurate and robust since theyare given by professional personnel who have received spe-cial assessment training, but they are necessarily time con-suming and costly. On the contrary, objective quality mea-sures, inspired by speech signal processing techniques, pro-vide an efficient, economical alternative to subjective mea-sures. Although it is not suggested to use objective qualitymeasures to completely replace subjective measures, objec-tive quality measures do show the strong ability to predictsubjective quality measures and the results do correlate very

well with those produced by subjective quality measures [1].Traditionally, objective measures have been used to evaluatespeech after decoding and in the presence of noise. Currently,some pioneers have already developed some system protocolsor algorithms to apply objective speech quality assessmentinto disordered speech analysis.

Any meaningful quality assessment should be consistentwith human responses and perception. Therefore, subjectivemeasures naturally became the first choice to evaluate speechquality. Performance methods using subjective measures arebased on a group of listeners’ opinion of the quality of anutterance. Subjective measures usually focus on speech intel-ligibility and the overall quality. Subjective measures can alsobe broadly grouped into two categories: utilitarian and ana-lytic. Utilitarian methods have three goals: (1) they should bereasonably efficient in test administration and data analysis;(2) they evaluate speech quality on a unidimensional scale;

Disordered Speech Evaluation 1401

(3) they must be reliable and robust in their test method.The key aspect of utilitarian approaches is that the results aresummarized by a single number. On the other hand, analyticmethods try to identify the underlying psychological compo-nents that determine perceived quality, and to discover theacoustic correlates of these components. Therefore the re-sults from analytic methods are summarized on a multidi-mensional scale [1].

The modified rhyme test (MRT) by House and the di-agnostic rhyme test (DRT) by Voiers are both intelligibil-ity measures. The mean opinion score (MOS) test and thediagnostic acceptability measure (DAM) are overall qualitymeasures, even though MOS is also commonly categorizedas utilitarian and DAM is classified as analytic. It is under-standable that subjective quality measures are the preferablemeans of quality assessment but subjective measures do haveseveral major drawbacks: (1) subjective measures require sig-nificant time and personnel resources, making it difficult toevaluate the range of potential speech/voice distortion; (2)subjective measures do not work very well when the testedspeech database is large [2]; (3) some rating score protocolsare not suitable for measurement of speech/voice [3]; (4)some literature suggests that listeners cannot agree on spe-cific speech/voice ratings [4].

Compared with the subjective measures mentionedabove, objective measures have several outstanding advan-tages: (1) they are less expensive to administer, saving money,time, and human resources; (2) they produce more consis-tent results and are not affected by human error; (3) mostimportantly, the form of the objective measure itself can givevaluable insight into the nature of the human speech per-ception process, helping researchers understand the speechproduction mechanism more deeply [1]. Generally speaking,objective speech quality measures are usually evaluated in thetime, spectral, or cepstral domains.

This paper is organized as follows. In Section 2, dis-ordered speech background will be introduced. Then, inSection 3, the DTW method is discussed. Specific speech fea-tures for disordered speech will be proposed in Section 4.Section 5 deals with one subjective measure. All experimen-tal results are discussed in Section 6. Finally, conclusions aredrawn in Section 7.

2. DISORDERED SPEECH BACKGROUND

Usually, patients with Parkinson’s disease or people who havesuffered a stroke have difficulty producing clear speech, re-sulting in a loss of intelligibility. Hence, it is important todevelop a means to help them produce more clear speechor develop algorithms to automatically clarify their unclearspeech. These efforts require an efficient method to evaluatedisordered speech as the first step.

Attempts to develop algorithms to evaluate disorderedspeech require us to understand how disordered speechis produced, the factors that affect disordered speech, andthe explicit phenomena related to these factors. The term“dysarthria” is used to describe changes in speech production

Automaticassessmentprocedure

Patients’speech

DTWalignment

Objectivequality

assessment

Healthyspeech

Scorescalingsystem

Speechqualityscore

Figure 1: Objective patients’ speech quality assessment block dia-gram.

characterized by an impairment in one or more of the sys-tems involved in speech [5]. The three major systems in-volved in speech production are respiration, voice produc-tion, and articulation. Voice is produced by the larynx andthe oral structures articulate to modify the sound source pro-duced by the larynx. The dysarthria associated with Parkin-son’s disease is referred to as a hypokinetic dysarthria [6, 7].Common symptoms of hypokinetic dysarthria include re-duced loudness of speech and/or monoloudness (lack ofloudness variation) and reduced speaking rate with intermit-tent rapid bursts of speech. For instance, speakers may showa slow rate of speech, but particular words or phrases withinthat utterance may be produced with a rapid rate. The oralstructures such as the tongue and lips are “rigid,” resulting ina reduced range of movement. This effectively dampens thespeech signal and distorts the accuracy of the sound (con-sonant or vowel) production. There may be some instancesof hypernasality as the condition worsens resulting from aninadequate velar closure. This may also result in the damp-ening of the sound produced. Voice quality in these patientsis often described as hoarse or harsh.

In this paper, we test several well-known speech process-ing parameters that can quantify the severity of disorderedspeech. These are the Itakura-Saito (IS) measure, the log-likelihood ratio (LLR) measure, and the log-area-ratio (LAR)measure which evaluate the spectral envelope of the givendisordered speech. Figure 1 shows the objective disorderedspeech quality assessment block diagram.

3. DYNAMIC TIME WARPING

Conventional objective speech quality measures are used toevaluate the speech quality after speech is coded and decodedor transmitted with noise and channel degradation. In thesescenarios, the original high-quality speech and the degradedspeech have exactly the same length, which leads to a simpleone-to-one comparison of windows from each speech utter-ance. However, in this project, we use the speech produced


by healthy people as the gold standard to compare with dis-ordered speech. In this case, aligning the two different speechsegments to the same reasonable comparable length is cru-cial. Dynamic time warping (DTW) is the most straightfor-ward solution and is used to solve exactly this problem inspeech recognition applications.

Given two speech patterns, X and Y, these patterns can berepresented by a sequence (x1, x2, . . . , xTx ) and (y1, y2, . . . ,yTy ), where xi and yi are the feature vectors. As we have noted,in general the sequence of xi’s will not have the same lengthas the sequence of yi’s. In order to determine the distance be-tween X and Y, given that some distance function d(x, y) ex-ists, we need a meaningful way to determine how to properlyalign the vectors for the comparison. DTW is one way thatsuch an alignment can be made [8]. We define two warpingfunctions, φx and φy , which transform the indices of the vec-tor sequences to a normalized time axis, k. Thus we have

ix = φx(k), k = 1, 2, . . . ,T ,

iy = φy(k), k = 1, 2, . . . ,T.(1)

This gives us a mapping from (x1, x2, . . . , xTx ) to (x1, x2, . . . ,xT) and from (y1, y2, . . . , yTy ) to (y1, y2, . . . , yT). With such amapping, we are able to compute dφ(x, y) using these warp-ing functions, giving us the total distance between two pat-terns as

dφ(x, y) =T∑

k=1

d(φx(k),φy(k)

)m(k)

Mφ, (2)

where m(k) is a path weight and Mφ is a normalization factor.Thus, all that remains is the specification of the path φ indi-cated in the above equation. The most common technique isto specify that φ is the minimum of all possible paths, subjectto certain constraints by using the equation as follows:

d(X, Y) minφ

dφ(x, y). (3)

For time normalization, the optimal path based on DTW hasfixed beginning and ending points. Some other constraintsmay also apply. For example, the path should be monotonic,which requires a positive slope. This constraint eliminates thepossibility of reverse warping. Therefore, we choose to en-force the Type III local constraint [8]. In addition, our nu-merous experimental results show that the eight local con-straints will not significantly change the final results. Becauseof the local continuity constraints, certain portions are ex-cluded from the region the optimal warping path can tra-verse. By using the maximum and minimum possible pathexpansion, we can define global path constraints as follows:

1 +

(φx(k)− 1

)Qmax

≤ 1 + Qmax(φx(k)− 1

),

Ty + Qmax(φx(k)− Tx

) ≤ Ty +

(φx(k)− Tx

)Qmax

.

(4)

In this aspect, slope weighting along the path adds yet an-other dimension of control in the search for the optimalwarping path. There are four types of slope weighting. Thetype chosen in this paper is

m(k) = φx(k)− φx(k − 1) + φy(k)− φy(k − 1). (5)

If we take the notation d(ix, iy) as the distance between xix

and yiy , which are the elements of (x1, x2, . . . , xTx ) and(y1, y2, . . . , yTy ), respectively, and D(ix, iy) as the accumula-tive optimal value, then we can apply the exact local con-straint as well as the slope weight to get

D(ix, iy

) = min

D(ix − 2, yx − 1

)+ 3d

(ix, iy

)D(ix − 1, yx − 1

)+ 2d

(ix, iy

)D(ix − 1, yx − 2

)+ 3d

(ix, iy

) . (6)

4. OBJECTIVE QUALITY MEASURES

From an anthropomorphic perspective, speech production isvery complex but a simple view is that vowels are producedby the lungs, the larynx excitation, and the resonance of thevocal tract. The laryngeal configuration and the tongue’s po-sition dramatically change an individual speaker’s speech in-tonation, pitch, or quality. For example, due to differences intongue positions during pronunciation, nonnative speakersof English may use tongue movements characteristic to theirnative language, thereby producing a noticeable accent. Sim-ilarly, the rigid tongue movement of the Parkinson’s patientcauses their pronunciation to become distorted. We attemptto develop objective speech quality measures using knowl-edge of human speech production. However, we first need todefine a few terms commonly used in speech processing. Aformant is defined as a peak in the speech power spectrum.The pitch of speech is usually determined by the frequencyof the excitation signal, which is produced by the vibrationof the vocal folds. The vocal tract resonance is usually repre-sented by the spectral envelope.

Some contemporary research has already made progresson objective analyses of disordered speech. For instance, theComputerized Speech Lab (CSL) produced by Kay Elemet-rics Corporation is a commercially available hardware andsoftware package for the analysis of disordered speech. TheCSL allows a clinician to calculate several measures relatedto the intelligibility and quality of disordered speech. An-other commercial product is the EVA system, made by SQ-Lab, Marseille, France. This system allows simultaneous mea-surement of acoustic and aerodynamic parameters related tospeech production. Acoustic signals are recorded using themicrophone built into the pneumotachograph which is usedto measure oral airflow. Intraoral pressure may be calculatedusing a built-in pressure sensor [9]. The majority of suchanalysis packages allow the calculation of acoustic and aero-dynamic parameters such as jitter, shimmer, signal-to-noiseratio, oral airflow, and voice onset time. However, the con-cordance between these objective measures and perceptualratings of quality and intelligibility remains at a relatively low


percentage [10, 11], and is often unsuitable for clinical pur-poses. Many of these measures can only be calculated fromrelatively steady portions of the speech signal. However, nu-merous studies have stressed that the unsteady parts of thesignal, such as onset, could provide valuable information forobjective evaluation of speech and allow finer discriminationof the severity of dysphonia. In addition, many of these mea-sures are calculated from a single vowel that patients are re-quired to produce for a relatively long period of time [12, 13].In reality, the natural continuous sentence may provide amore accurate picture of the patients speech disorder.

To overcome some of these shortcomings of the existingspeech analysis techniques, we propose a new algorithm orig-inally inspired by the speech coding-decoding and speechtelecommunications techniques. The first meaningful mea-sure which can be obtained to compare speech differencesis to compute the differences of the logarithms of the powerspectrum at each frequency range [4]. We use the followingequation to represent the difference:

d(w) = ln∣∣X(w)

∣∣2 − ln∣∣Y(w)

∣∣2, (7)

where X(w) and Y(w) are the magnitudes in the frequencydomain of two compared speech signals. It is also possible toformally express the most easy and straightforward methodto stand for the spectral distortion as follows:

d(X ,Y) =(∫ π

−π

∣∣d(w)∣∣k dw

2π

)1/k

, (8)

where, again, X and Y here represent the two speech signalsto be compared.

Although the above method is easy to implement, goodresults are not guaranteed. Many different types of modi-fied standard objective quality measures have been proposed.These include measures such as the Itakura-Saito (IS) dis-tortion measure, the log-likelihood ratio (LLR) measure, thelog-area-ratio (LAR) measure, the segmental SNR measure,and the weighted spectral slope (WSS) measure. In this pa-per, we chose to investigate the first three measures: IS, LLRand LAR [14, 15, 16].

The IS distortion measure is calculated based on the fol-lowing equation:

dIS(

ad, aφ) = (σ2

φ

σ2d

)(adRφaTdaφRφaTφ

)+ log

(σ2φ

σ2d

)− 1, (9)

where σ2φ and σ2

d represent the all-pole gains for the stan-dard healthy people’s speech and the test patients’ speech. aφand ad are the healthy-speech and patient-speech LPC coef-ficient vectors, respectively. Rφ is the autocorrelation matrixfor xφ(n), where xφ(n) is the sampled speech of healthy peo-ple. The elements of Rφ are defined as

rφ(|i− j|) = N−|i− j|∑

n=1

rφ(n)rφ(n + |i− j|),

|i− j| = 0, 1, . . . , p,

(10)

Table 1: MOS subjective measure evaluation table.

Rating Speech quality Level of distortion

5 Excellent Imperceptible

4 Good Perceptible, but not annoying

3 Fair Perceptible, and slightly annoying

2 Poor Annoying, but not objectionable

1 Unsatisfied Very annoying and objectionable

where N is the length of the speech frame and p is the orderof LPC coefficients.

LLR is similar to the IS measure. However, while the ISmeasure incorporates the gain factor by using variance terms,LLR only considers the difference between the general spec-tral shapes. The following equation provides the details forcomputing the LLR:

dLLR(

ad, aφ) = log

(adRφaTdaφRφaTφ

). (11)

LAR is another speech quality assessment measure basedon the dissimilarity of LPC coefficients between healthyspeech and the patient’s speech. Different from LLR, LARuses the reflection coefficients to calculate the difference andis expressed by the equation

dLAR =∣∣∣∣∣ 1p

p∑i=1

(log

1 + rφ(i)

1− rφ(i)− log

1 + rd(i)1− rd(i)

)2∣∣∣∣∣

1/2

, (12)

where p is the order of the LPC coefficients, rφ(i) and rd(i) arethe ith reflection coefficients of healthy and patient’s speechsignals.

In the following section describing the experiment andresults, we will compare the performances of each of thesemeasures applied to our database. The correlation betweenthese objective quality assessment measures and one subjec-tive quality assessment will also be discussed.

5. SUBJECTIVE QUALITY MEASURES

No matter how speech quality is defined, it must be basedon human response and perception. So designing a suitablesubjective measure of quality is very important in the assess-ment of speech quality. Correspondingly, the most importantcriterion to evaluate the accuracy of an objective measure ofquality is to determine its correlation with subjective qualitymeasures.

As discussed in Section 1, subjective measures can bebroadly divided into utilitarian and analytic categories.Without loss of generalization, we will use two of the utili-tarian methods for our investigation. One reliable and eas-ily implemented subjective utilitarian measure is the meanopinion score (MOS) [1, 4]. In this method, human listen-ers rate the speech under test on the five-point scale shownin Table 1. Related research shows that as few as five but nomore than nine categories are enough for the assessment of


Table 2: Moderate-severe subjective measure evaluation table.

Rating Level of distortion

3 Moderate

2 Moderate to severe

1 Severe

quality. The final speech quality assessment value can be cal-culated as the average of the responses of several listeners.The MOS test is widely used in the telecommunications areato compare the original signal quality with that of the dis-torted signal. For disordered speech analysis, however, it maynot be feasible to categorize sentences as “perceptible, but notannoying” or “annoying, but not objectionable.” Therefore, adifferent commonly used subjective utilitarian measure wasobtained. In this test, listeners rated the sentences into threecategories: mild, moderate, or severe [5, 6, 7]. A similar 4-point rating scale, called the GRBAS method, has been pre-sented for the evaluation of disorder voice quality [17]. Inthese subjective tests, each test sentence was assigned a scorebased on whether the disordered sentence quality was per-ceived to be mild, moderate, or severe. Based on our databaseof Parkinson’s patients tested in this experiment, we modi-fied the mild-moderate-severe rating scale to have three newlevels: moderate, moderate to severe, and severe. The detailsand criteria for these ratings are listed in Table 2. The fol-lowing procedures were followed when obtaining perceptualjudgment in the present experiment: Listeners were asked tolisten carefully to each test sentence. Listeners were allowed tohear the test sentence as many times as needed to ensure thatthey assigned the most appropriate score to each sentence.Listeners were asked to read the criteria table (Tables 1 and2) carefully and were required to assign a score to each sen-tence based on the level of distortion described in the tables.

6. EXPERIMENTAL RESULTS

The speech database used in this experiment was collected bythe experimenters at the Motor Movement Disorders Clinic,University of Florida. Ten patients with Parkinson’s diseasewere recorded reading a standard passage (“Grandfather Pas-sage”). Additionally, the same passage was also recorded fromfour healthy adult speakers. Although speakers vary in theirrate of speech, this passage takes approximately 1 minute toread. Three successive sentences (around 15 seconds in du-ration) were selected from this passage for acoustic and per-ceptual analyses. The sentences include “You wish to knowall about my grandfather. Well, he is nearly ninety three yearsold. He dresses himself in an ancient black frock coat, usu-ally minus several buttons.” The fourteen speakers were di-vided into two groups—males and females. In the first lis-tening test, six listeners evaluated the speech of four Parkin-son’s patients and one healthy speaker. In the second listeningtest, we tested twelve listeners who rated the speech of sevenParkinson’s patients and one healthy speaker. Of the 18 par-ticipants in the listening tests, six were from the USA, fivefrom China, five from India, one from Korea, and one from

Turkey. Seven of them were male and the rest were female.All listeners spoke fluent English.

The first listening test was used to obtain ratings usingthe MOS criteria listed in Table 1. Listeners gave an individ-ual score to each sentence. In this study, two different meth-ods were used to compare the objective and the subjectivemeasures. In the first method, all MOS scores given by thelisteners were correlated with the distance measures calcu-lated by the various algorithms. In the second approach, theorder of the MOS scores (rather than the actual value of theMOS scores) was correlated with the distance measures. Inthis approach, listeners simply ordered each sentence fromthe best to the worst quality. If two or more sentences weregiven the same rank, listeners were asked to listen carefullyand choose different ranks for each sentence. In contrast, inthe first method, listeners may end up giving identical integerscores to two speech segments even though one may soundnoticeably better than the other. Table 3 gives the details onall the sentences scored using MOS scale for male speakersonly. Sentences labelled as P1, P2, P3, and P4 were spoken bythe Parkinson’s patients and H1 is the sentences spoken bythe healthy speaker. The six listeners are labelled as List1 toList6.

One sentence from a healthy speaker was used as thestandard sentence for calculating the objective measures ofquality. DTW was first applied to align this standard sen-tence with each patient’s sentence. Figure 2 shows the opti-mal frame match path between the standard healthy speechand the patient’s speech. For the second method, every pro-cedure is the same except for replacing the exact score by therelative order. Therefore, in Table 3, each column is the ordergiven by each listener.

Finally, the three distortion measures (IS, LLR, and LAR)were calculated. The last three columns in Table 3 show theexact values of IS, LLR, and LAR, respectively. In Table 4,the last three columns show the relative order of the distor-tion scores obtained from each speaker. Figure 3 shows thehealthy speech waveform (upper panel), the patient speechwaveform (middle panel) and their distortion curve calcu-lated by the IS measure (lower panel). Figure 4 shows a sim-ilar comparison based on LLR and Figure 5 shows the samecomparison based on LAR. Figure 6 exhibits the histogram ofthe distortion values, which may give us deeper insight aboutthe differences between the healthy speaker and the patient’sspeech. This may provide greater information than the use ofa single number obtained by averaging the distortion mea-sures across a number of frames.

As discussed earlier, the quality of an objective measureis determined by how well it predicts the subjective measure.The following formula is widely used to evaluate the perfor-mance of objective measures:

ρ =∑

d

(Sd − Sd

)(Od − Od

)(∑d

(Sd − Sd

)2∑d

(Od − Od

)2)1/2 , (13)

where Sd and Od are subjective and objective results. Sd andOd are their corresponding average values. Table 3 shows all


Table 3: Subjective test results and their correlation with objective test using method 1 in the first round.

Subject List1 List2 List3 List4 List5 List6 Avg. IS LLR LAR

P1 2 3 2 2 3 2 2.33 71 035 197.5 1441.5

P2 2 1 2 1 2 1 1.50 769 990 175.6 1054.2

P3 3 1 1 1 2 2 1.67 572 200 152.3 1014.9

P4 3 2 3 2 3 3 2.67 304 150 218.8 1025.4

H1 5 5 5 5 5 5 5 24 155 96.2 752.5

Corr. — — — — — — — 0.7638 0.6419 0.5729

20

40

60

80

100

120

140

160

180

50 100 150

Frame number

Fram

en

um

ber

20

40

60

80

100

120

140

160

180

50 100 150

Frame number

Fram

en

um

ber

Figure 2: Dynamic time warping (DTW) optimal path between the recorded speech of a healthy person (horizontal axis) and a Parkinson’spatient (vertical axis).

Table 4: Subjective test results and their correlation with objective test using method 2 in the first round.

Subject List1 List2 List3 List4 List5 List6 Avg. IS LLR LAR

P1 5 2 3 3 3 3 3.2 2 4 5

P2 4 4 4 5 4 5 4.3 5 3 4

P3 2 5 5 4 5 4 4.2 4 2 2

P4 3 3 2 2 2 2 2.3 3 5 3

H1 1 1 1 1 1 1 1 1 1 1

Corr. — — — — — — — 0.8684 0.1828 0.5142

three objective measures and their correlation values basedon method 1. The IS measure, with a correlation of 0.7638,showed the best performance. Table 4 lists the correlationvalues based on method 2, and once again the IS measureshowed the highest correlation of 0.8684. In analyzing (9),(11) and (12), we can see that the good performance of theIS measure might be partially due to the fact that it not onlyconsiders the general spectral difference, but also uses thevariance term to take into account the gain factor of the all-pole filter model.

After completing the preliminary test, a second test wasconducted to validate our conclusion that IS is a good mea-sure of disordered speech quality. In this test, speech samplesfrom a larger number of patients with Parkinson’s disease(seven instead of four) were rated by more listeners (twelveinstead of six). In addition to the MOS scores, listeners werealso asked to categorize the speech samples as Normal, mod-erate, moderate to severe, or severe. To highlight the validityof the IS measures, only this measure was calculated for thespeech samples used in the second test. Table 5 shows the


0.5

0

−0.5−1

Am

plit

ude

0 0.5 1 1.5 2Time(s)

0.5

0

−0.5

Am

plit

ude

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Time(s)

50004000300020001000A

mpl

itu

de

20 40 60 80 100 120 140 160 180 200

Frame number

Figure 3: IS value (lower) versus healthy speech waveform (upper)and patient speech waveform (middle).

0.5

0

−0.5−1

Am

plit

ude

0 0.5 1 1.5 2

Time(s)

0.5

0

−0.5

Am

plit

ude

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Time(s)

15

10

5Am

plit

ude

20 40 60 80 100 120 140 160 180 200

Frame number

Figure 4: LLR value (lower) versus healthy speech waveform (up-per) and patient speech waveform (middle).

MOS from individual listeners, the average MOS, and thecorrelation between the IS measure and MOS values basedon method 1 described earlier. This correlation was foundto be 0.8032 and is comparable with 0.7638 obtained in thefirst round test. Table 6 shows the moderate-severe test scoresfrom each listener, the average moderate-severe test scores,and the correlation between the IS measure and the subjec-tive ratings. Once again, a correlation of 0.7417 was obtainedwhich is comparable to that obtained in the first round test.

0.5

0

−0.5−1

Am

plit

ude

0 0.5 1 1.5 2

Time(s)

0.5

0

−0.5

Am

plit

ude

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Time(s)

2.52

1.51

0.5Am

plit

ude

20 40 60 80 100 120 140 160 180 200

Frame number

Figure 5: LAR value (lower) versus healthy speech waveform (up-per) and patient speech waveform (middle).

80

70

60

50

40

30

20

10

00 0.5 1 1.5 2 2.5 3

×104

Distortion value

Cou

nt

Figure 6: The histogram of the distortion values based on the ISmethod.

All objective speech quality assessment criteria (IS, LLR,LAR etc.) proposed above mainly focus on the speech spec-tral envelope. From the perceptual point of view, we aremainly interested in how to efficiently evaluate the speechintelligibility and quality. However, intelligibility and qual-ity are not the only aspects of the overall speech qualityevaluation. Many other factors that affect speech quality mayalso need to be considered. For instance, Hansen and Nand-kumar proposed that pitch turbulence (PT) may be used toevaluate the monotone or pitch variation, which is directlyrelated to the laryngeal excitation signal. Similarly, energyturbulence (ET) is another important factor used to evaluate


Table 5: Subjective test results and their correlation with objective test using method 1 in the second round based on MOS test.

Subject List1 List2 List3 List4 List5 List6 List7 List8 List9 List10 List11 List12 Avg. IS

P1 3 2 4 3 2 2 3 3 3 3 1 1 2.50 41 500

P2 2 3 3 2 3 1 2 2 2 3 2 1 2.17 84 200

P3 1 2 2 1 2 1 2 1 1 2 1 1 1.42 264 000

P4 4 4 4 4 4 3 5 4 5 4 4 4 4.08 10 300

P5 4 4 3 3 3 4 5 4 4 5 4 4 3.92 29 800

P6 1 3 2 1 2 2 3 1 2 3 2 1 1.92 205 000

P7 2 3 2 2 2 3 4 3 3 3 3 2 2.67 103 000

H1 5 5 5 5 5 3 5 5 5 5 5 5 4.83 6010

Corr. — — — — — — — — — — — — — 0.8032

Table 6: Subjective test results and their correlation with objective test using method 1 in the second round based on moderate-severe test.

Subject List1 List2 List3 List4 List5 List6 List7 List8 List9 List10 List11 List12 Avg. IS

P1 1 2 2 1 2 1 2 1 1 1 2 1 1.42 205 000

P2 2 2 1 2 2 2 2 2 3 2 2 2 2 103 000

P3 3 3 3 3 3 3 3 3 3 3 3 3 3 10 300

P4 1 1 1 1 1 1 1 2 1 1 1 1 1.08 264 000

P5 2 2 2 1 2 2 2 1 2 2 1 1 1.67 84 200

P6 2 3 1 2 2 2 3 2 2 2 2 2 2.08 41 500

P7 3 3 3 3 3 3 3 3 3 3 3 3 3 29 800

Corr. — — — — — — — — — — — — — 0.7417

the monoloudness or energy variation [2]. The followingequations give the exact mathematic expressions for thesemeasures:

PT = 1N − 1

N−1∑i=1

∣∣P(i + 1)− P(i)∣∣,

ET = 1N − 1

N−1∑i=1

∣∣E(i + 1)− E(i)∣∣,

(14)

where N is the total number of frames of the given sen-tence, and P(i) and E(i) represent the pitch and energy ofthe frame i. We used the data obtained in the first round ofevaluation to test the correlation between these measures (PTand ET) and the subjective ratings. Table 7 shows the pitchturbulence (PT) and energy turbulence (ET) values calcu-lated from (14) as well as their correlation based on method1. Table 8 shows the similar results based on the method 2.Figures 7 and 8 show the pitch turbulence and energy tur-bulence from a given speech signal. Based on Tables 7 and8, it appears that PT and ET are poorly correlated with thesubjective assessments, using either method 1 or method 2.This suggests that during subjective assessment, humans putmost of their emphasis on intelligibility, which, from a signalprocessing view, is related primarily to the spectral envelope.The excitation (pitch) and energy variation are not as impor-tant as spectral envelope variation in the perception of over-all speech quality. Even in our current algorithm, pitch andenergy turbulence were not very efficient in predicting the

Table 7: Subjective test results and their correlation with PT andET test using method 1.

Subject Avg. PT ET

P1 2.33 13.2777 3.9040

P2 1.50 8.0712 8.0712

P3 1.67 4.5775 11.9782

P4 2.67 16.3607 5.8966

H1 5 4.2815 8.3446

Corr. — 0.1264 0.1137

Table 8: Subjective test results and their correlation with PT andET test using method 2.

Subject Avg. PT ET

P1 3.2 4 5

P2 4.3 3 3

P3 4.2 2 1

P4 2.3 5 4

H1 1 1 2

Corr. — 0.1828 0.0800

overall speech quality. Potentially, even though the correla-tion performance with one-dimensional evaluation (such asMOS) is poor, these two parameters may correlate well withmultidimensional evaluation (such as DAM).


0.5

0

−0.5

−1

Am

plit

ude

0 0.5 1 1.5 2

Time(s)

200

150

100

Pit

ch

20 40 60 80 100 120 140 160 180 200 220

Frame number

Figure 7: The pitch turbulence (lower) from a given speech signal(upper).

0.5

0

−0.5

−1

Am

plit

ude

0 0.5 1 1.5 2

Time(s)

100

50

0

−50

−100

−150

Am

plit

ude

20 40 60 80 100 120 140 160 180 200 220

Frame number

Figure 8: The energy turbulence (lower) from a given speech signal(upper).

7. CONCLUSION

Objective evaluation of disordered speech quality is not aneasy task. In this paper, we discuss three objective qual-ity assessment measures and one subjective measure. Byevaluating our speech database, the IS measure showed astrong correlation with the MOS tests. Therefore, the IS mea-sure is suggested to be more suitable than LLR and LAR foruse as a reliable tool to evaluate the overall quality of disor-dered speech. The IS measure could also be used to predictthe subjective quality measure MOS score given by humans.

ACKNOWLEDGMENT

The authors are grateful to three reviewers who provided uswith a large number of detailed suggestions for improvingthe submitted manuscript.

REFERENCES

[1] S. Quanckenbush, T. Barnwell, and M. Clements, ObjectiveMeasures of Speech Quality, Prentice Hall, New York, NY, USA,1988.

[2] J. Hansen and S. Nandkumar, “Objective quality assessmentand the RPE-LTP vocoder in different noise and language con-ditions,” Journal of the Acoustical Society of America, vol. 97,no. 1, pp. 609–627, 1995.

[3] J. Hansen and L. Arslan, “Robust feature-estimation and ob-jective quality assessment for noisy speech recognition usingthe credit card corpus,” IEEE Trans. Speech Audio Processing,vol. 3, no. 3, pp. 169–184, 1995.

[4] S. Dimolitsas, “Objective speech distortion measures andtheir relevance to speech quality assessments,” IEE Proceed-ings Part I: Communications, Speech and Vision, vol. 136, no. 5,pp. 317–324, 1989.

[5] L. Ramig, C. Bonitati, J. Lemke, and Y. Horii, “Voice treat-ment for patients with Parkinson disease: Development of anapproach and preliminary efficacy data,” Journal of MedicalSpeech-Language Pathology, vol. 2, no. 3, pp. 191–209, 1994.

[6] S. Countryman, L. Ramig, and A. Pawlas, “Speech andvoice deficits in Parkinsonian Plus syndromes: Can they betreated?” Journal of Medical Speech-Language Pathology, vol. 2,no. 3, pp. 211–225, 1994.

[7] S. Countryman and L. Ramig, “Effects of intensive voice ther-apy on voice deficits associated with bilateral thalamotomy inParkinson disease: A case study,” Journal of Medical Speech-Language Pathology, vol. 1, no. 4, pp. 233–250, 1993.

[8] L. Rabiner and B. Juang, Fundamental of Speech Recognition,Prentice Hall, New York, NY, USA, 1984.

[9] P. Yu, M. Ouaknine, J. Revis, and A. Giovanni, “Objectivevoice analysis for dysphonic patients: a multiparametric pro-tocol including acoustic and aerodynamic measurements,”Journal of Voice, vol. 15, no. 4, pp. 529–542, 2001.

[10] A. Giovanni, D. Robert, N. Estublier, B. Teston, M. Zanaret,and M. Cannoni, “Objective evaluation of dysphonia: prelim-inary results of a device allowing simultaneous acoustic andaerodynamic measurements,” Folia Phoniatr Logop, vol. 48,no. 4, pp. 175–185, 1996.

[11] J. Revis, A. Giovanni, F. Wuyts, and J. Triglia, “Comparison ofdifferent voice samples for perceptual analysis,” Folia PhoniatrLogop, vol. 51, no. 3, pp. 108–116, 1999.

[12] D. Berry, K. Verdolini, D. Montequin, M. Hess, R. Chan, and I.Titze, “A quantitative output-cost ratio in voice production,”Journal of Speech, Language and Hearing Research, vol. 44,no. 1, pp. 29–37, 2001.

[13] P. Dejonckere, C. Obbens, G. De Moor, and G. Wieneke, “Per-ceptual evaluation of dysphonia: Reliability and relevance,”Folia Phoniatr Logop, vol. 45, no. 2, pp. 76–83, 1993.

[14] E. Wallen and J. Hansen, “A screening test for speech pathol-ogy assessment using objective quality measures,” in Proc. 4thInternational Conference on Spoken Language Proceedings (IC-SLP ’96), vol. 2, pp. 776–779, Philadelphia, Pa, USA, October1996.

[15] L. Thorpe and W. Yang, “Performance of current perceptualobjective speech quality measures,” in Proc. IEEE Workshop onSpeech Coding Proceedings (SCW ’99), pp. 144–146, Porvoo,Finland, June 1999.

[16] J. Hansen and B. Pellom, “An effective quality evaluation pro-tocol for speech enhancement algorithms,” On-line TechnicalReport.

[17] M. Hirano, Psycho-Acoustic Evaluation of Voice: GRBAS Scalefor Evaluating the Hoarse Voice, Springer Verlag, New York,NY, USA, 1981.


Lingyun Gu received his B.S. and M.S. de-grees in electrical engineering from the Uni-versity of Electronic Science and Technol-ogy of China (UESTC) and Old Domin-ion University in 1998 and 2002, respec-tively. He is currently pursuing his Ph.D. inthe Computational NeuroEngineering Lab-oratory, Electrical and Computer Engineer-ing Department, University of Florida (UF).His main research interests are in robustspeech recognition, speech signal processing, and auditory produc-tion and perception.

John G. Harris received his B.S. and M.S.degrees in electrical engineering from MITin 1983 and 1986. He earned his Ph.D. de-gree from Caltech in the interdisciplinaryComputation and Neural Systems Programin 1991. After a two-year postdoc at the MITAI Lab, Dr. Harris joined the Electrical andComputer Engineering Department, Uni-versity of Florida (UF). He is currently anAssociate Professor and leads the HybridSignal Processing Group in researching biologically inspired cir-cuits, architectures, and algorithms for signal processing. Dr. Harrishas published over 100 research papers and patents in this area. Hecodirects the Computational NeuroEngineering Laboratory andhas a joint appointment in the Biomedical Engineering Depart-ment at UF.

Rahul Shrivastav earned his B.S. degree in1995 and M.S. degree in 1997 in speechand hearing sciences from the Universityof Mysore, India. He completed his Ph.D.degree in speech and hearing science fromIndiana University, Bloomington, in 2001.Currently he is on the faculty at the Depart-ment of Communication Sciences and Dis-orders, University of Florida. His research isstudying the factors that affect the percep-tion of voice quality and speech intelligibility in patients with a va-riety of speech disorders.

Christine Sapienza received her Ph.D.degree in speech science from The StateUniversity of New York at Buffalo in 1993.Currently, she is a Professor in the Depart-ment of Communication Sciences and Dis-orders, University of Florida. Her most re-cent work has focused on the use of strengthtraining paradigms in multiple populationsincluding voice disorders, Parkinson’s dis-ease, spinal cord injury, and multiple scle-rosis. She maintains an active research laboratory with 7 currentPh.D. students. Her clinical work takes place at Ayers OutpatientVoice Clinic and the Motor Movement Disorders Clinic at the Uni-versity of Florida. She also is a Research Health Scientist at the BrainRehabilitation Research Center, Malcom Randall VA, Gainesville,Florida.


Objective Speech Quality Measurement UsingStatistical Data Mining

Wei ZhaPower, Acquisition and Telemetry Group, Schlumberger Technology Corporation, 150 Gillingham Lane,MD 1, Sugar Land, TX 77478, USAEmail: [email protected]

Wai-Yip ChanDepartment of Electrical & Computer Engineering, Queen’s University, Kingston, ON, Canada K7L 3N6Email: [email protected]

Received 7 November 2003; Revised 3 September 2004

Measuring speech quality by machines overcomes two major drawbacks of subjective listening tests, their low speed and high cost.Real-time, accurate, and economical objective measurement of speech quality opens up a wide range of applications that cannotbe supported with subjective listening tests. In this paper, we propose a statistical data mining approach to design objective speechquality measurement algorithms. A large pool of perceptual distortion features is extracted from the speech signal. We examineusing classification and regression trees (CART) and multivariate adaptive regression splines (MARS), separately and jointly, toselect the most salient features from the pool, and to construct good estimators of subjective listening quality based on the selectedfeatures. We show designs that use perceptually significant features and outperform the state-of-the-art objective measurementalgorithm. The designed algorithms are computationally simple, making them suitable for real-time implementation. The pro-posed design method is scalable with the amount of learning data; thus, performance can be improved with more offline or onlinetraining.

Keywords and phrases: speech quality, speech perception, mean opinion scores, data mining, classification trees, regression.

1. INTRODUCTION

“Plain old telephone service,” as traditionally provided usingdedicated circuit-switched networks, is reliable and econom-ical. A contemporary challenge is to provide high-quality, re-liable, and low-cost voice telephone services over nondedi-cated and heterogeneous networks. Good voice quality is akey factor in garnering customer satisfaction. In a dynamicnetwork, voice quality can be maintained through a combi-nation of measures: design planning, online quality monitor-ing, and call control. Underlying these measures is the needto measure user opinion of voice quality. Traditionally, useropinion is measured offline using subjective listening tests.Such tests are slow and costly. In contrast, machine compu-tation (“objective measurement”), which involves no humansubjects, provides a rapid and economical means to estimateuser opinion. Objective measurement enables network ser-vice providers to rapidly provision new network connectiv-ity and voice services. Online objective measurement is theonly viable means of measuring voice quality, for the pur-pose of real-time call monitoring and control, on a network-wide scale. Other applications of voice quality measurement

include evaluation of disordered speech [1] and synthesizedspeech [2].

Algorithms for objective measurement of speech qual-ity can be divided into two types: single-ended and double-ended (see Figure 1). Double-ended algorithms need to in-put both the original (“clean”) and degraded speech signals,whereas single-ended algorithms need only to input the de-graded speech signal. Single-ended algorithms can be usedfor “passive” monitoring, that is, nonintrusively tapping intoa voice connection. Double-ended algorithms are sometimescalled “intrusive” because a voice signal known to the algo-rithm has to be injected into the transmit end. Neverthe-less, Conway in [3] proposes a method that employs double-ended algorithms without intruding on an ongoing call. Themethod is based on measuring packet degradations at the re-ceive end. The measured degradations are applied to a typicalspeech signal to produce a degraded signal. A double-endedalgorithm is used to map the speech signal and degraded sig-nal to speech quality.

The performance of objective measurement algorithms isprimarily characterized by the accuracy of the user opinionscores produced by the algorithm, using the opinion scores

Objective Speech Quality Measurement Using Statistical Data Mining 1411

Original

speech

Device or networkunder test

Degraded

speech

Single-endedmeasurement

Double-endedmeasurement

Figure 1: Single-ended and double-ended speech quality measure-ments.

obtained from subjective tests as accuracy benchmarks. Themean opinion score (MOS) [4], obtained by averaging the ab-solute categorical ratings (ACRs) produced by a group of lis-teners, is the most commonly used measure of user opinion.Subjective listening tests are generally performed with a lim-ited number of listeners, so that the MOS varies with the lis-tener sample and its size. In such a case, the degree of accu-racy of objective scores can be assessed up to the degree ofaccuracy of the subjective scores used as benchmarks.

The International Telecommunications Union (ITU)standard [5, P.862], also called Perceptual Evaluation ofSpeech Quality (PESQ), is a double-ended algorithm that ex-emplifies the “state-of-the-art.” An ITU standard for single-ended quality measurement [6, P.563] has recently reacheda “prepublished” status. Objective measurement has the ad-vantage of being consistent. While subjective tests can beused to estimate the MOS very accurately by using a largelistening panel, objective measurement can provide a moreaccurate MOS estimate than a small listener panel (see [7]for a simple model of measurement variance). Hence, objec-tive measurement, which can be automated and performedin real time, provides a very attractive alternative to subjec-tive tests.

The process of human judgment of speech quality can bemodeled in two parts. The first part, auditory perception, en-tails transduction of the received speech acoustic signal intoauditory nerve excitations. Auditory models are well stud-ied in the literature [8] and have been applied to the de-sign of PESQ and other objective measurement algorithms[9, 10, 11]. Essential elements of auditory processing includebark-scale frequency warping and spectral power to sub-jective loudness conversion. The second part of the humanjudgment process entails cognitive processing in the brain,where compact features related to normative and anomalousbehaviors in speech are extracted from auditory excitationsand integrated to form a final impression of the perceivedspeech signal quality. Cognitive models of speech distortionsare less well developed. Nevertheless, for the goal of accurateprediction of subjective opinion of speech quality, anthro-pomorphic modeling of cognitive processing is not strictlynecessary.

In place of cognitive modeling, we pursue a statisti-cal data mining approach to design novel double-ended al-gorithms. The success of statistical techniques in advanc-ing speech recognition performance lends promise to theapproach. Our algorithms are designed based on classi-fying perceptual distortions under a variety of contexts.

A large pool of context-dependent feature measurementsis created. Statistical data mining tools are used to findgood features in the pool. Features are selected to producethe best estimator of the subjective MOS value. The algo-rithms demonstrate significant performance improvementover PESQ, at a comparable computational complexity. Ineffect, the statistical classifier-estimators serve as utilitarianmodels of human-cognitive judgment of speech quality.

This paper is organized as follows. Section 2 provides thebackground by introducing existing double-ended speechquality measurement schemes and two statistical data min-ing algorithms. Section 3 describes our speech quality mea-surement algorithm architecture, its basic elements and de-sign framework, and feature design and mining. Lastly, inSection 4, various design methods and designed algorithmsare examined and their performance are assessed experimen-tally.

2. BACKGROUND

In this section, we review briefly existing objective speechquality measurement methods and the statistical data min-ing techniques we have used.

2.1. Current objective methods

Early speech quality measures were used for assessing thequality of waveform speech coders. These measures calcu-late the difference between the waveform of the nondegradedspeech and that of the degraded speech, in effect using wave-form matching as a criterion of quality. Representative mea-sures include the signal-to-noise ratio (SNR) and segmentalSNR [12]. Measures of distortions in the short-time spec-tral envelopes of speech [13] were later introduced. Thesemeasures do not require the waveforms to match in orderto produce zero distortion. They are suitable for low-bit-rate speech coders that may not preserve the original speechwaveform, for example, linear-prediction-based analysis-by-synthesis (LPAS) coders [14]. For a comprehensive review ofobjective methods known till late 1980s, the reader can con-sult [15].

Measurement algorithms that exploit the human audi-tory perception rather than just the acoustic features ofspeech provide more accurate prediction of subjective qual-ity. Representative algorithms include BSD (Bark spectraldistortion) [11], MNB (measuring normalizing block) [9,10], and PESQ (perceptual evaluation of speech quality)[5, 16]. A major difference among algorithms of this kindis in the postprocessing of the auditory error surface. Hollieret al. [17] uses an entropy measure of the error surface. MNBuses a hierarchical structure of integration over a range oftime and frequency intervals. PESQ (Figure 2) furnishes thecurrent state-of-the-art performance. PESQ performs inte-gration in three steps, first over frequency, then over short-time utterance intervals, and finally over the whole speechsignal. Different p values are used in the Lp norm integra-tions of the three steps. (PESQ also provides a delay compen-sation algorithm that is essential for quality measurement of


Originalspeech

Degradedspeech

Perceptualmodel

Timealignment

Perceptualmodel

Delay estimates

Internal representationof original

Difference in internalrepresentation determines

the audible difference

Internal representationof degraded

Cognitivemodel

Quality

Figure 2: Schematic diagram of PESQ method [5].

voice packets that are subject to delay variation in the net-work.) The different methods of integration, though theymay not resemble cognitive processes, achieve their respec-tive degrees of effectiveness through using subjectively scoredspeech data to calibrate the mapping to estimated speechquality.

Subjects in MOS tests rate speech quality on the integerACR scale of 1 to 5, with 5 representing excellent quality,and 1 representing the worst quality. The MOS is a contin-uous value based on averaging the listener’s ACR scores. Ide-ally, the MOS obtained using a large and well-formed listenerpanel reflects the “true” mean opinion of the listener pop-ulation. In practice, the measured MOS varies across tests,countries, and cultures. In subjective tests that use a differentmeasure called DMOS [4], or degradation MOS, the sub-ject listens to the original speech before scoring the degreeof degradation of the degraded speech relative to the origi-nal. In MOS tests, a subject listens to a speech sample andchooses his/her opinion of its quality in a “categorical” sense,without first listening to a “reference” speech sample. Thesubject relies on his/her experience of speech quality to de-cide on the quality of the sample. Hence, single-ended algo-rithms are akin to MOS tests, while double-ended algorithmsare akin to DMOS tests. Though most existing double-endedalgorithms are designed to predict MOS, they may actuallypredict DMOS with better accuracy than MOS [18]. Rely-ing on differences or distortions with respect to a “clean” sig-nal alleviates the need to model “clean” speech in a norma-tive sense. Nevertheless, distortions that are measurable onpsychoacoustical scales do not necessarily contribute to per-ceived quality degradation. Speech signals can be modifiedin ways such that the modified signal can be distinguishedfrom its original in a comparison test, but the modified signalwould not be judged as degraded in a MOS test. Any “cogni-tive” processing ought to give no weight to differences thatare measurable but do not affect the type of quality judg-ment that is predicted by the objective measurement. Exist-ing double-ended algorithms do not have the intelligence todisregard such type of differences. The algorithms will pre-dict a poorer quality for speech that has been transformed

but not degraded. Consider the contrived example where anutterance is replaced with a different utterance of the sameduration; the quality stays the same but the measured differ-ence may be huge.

2.2. Statistical data mining

A major aim of this work is to use statistical data miningmethods to find psychoacoustic features which most signifi-cantly correlate with quality judgment. Statistical data min-ing involves using statistical analysis tools to find underly-ing patterns or relationships in large data sets. Statistical datamining techniques have been applied to solve diverse prob-lems in manufacturing quality control, market analysis, med-ical diagnosis, financial services, and so forth with much suc-cess. We consider two techniques in this paper: classificationand regression trees (CART) [19] and multivariate adaptiveregression splines (MARS) [20].

Suppose we have a response variable y and n predictorvariables x1, . . . , xn. Suppose we observe N joint realizationsof the response and predictor variables. Our observations canbe modeled as

y = f(x1, . . . , xn

)+ δ, (1)

where δ represents a noise term. Our aim is to find a sub-set of predictor variables xi1 , . . . , xim, i j ∈ 1, . . . ,n, j =1, . . . ,m, m ≤ n, and a mapping f (xi1 , . . . , xim), such that fyields a good estimate of the response variable y.

2.2.1. CART

CART (classification and regression trees) [19] is a recur-sive partitioning algorithm. The domain D of the desiredmapping is partitioned into M disjoint regions RmM1 . Thepartitioning process is recursive; new regions are generatedby splitting regions that have been found so far. In conven-tional application of CART, the splitting is restricted to beperpendicular to the axis of the predictor variable chosen ateach step of the recursion. This enables the splitting to be ef-fected by answering a simple “yes” or “no” question on thepredictor variable. The variable is chosen amongst x1, . . . , xn


Is the speech frame active?

Yes No

Is the frame voiced? Is frame distortion larger than 1?

Yes No Yes No

Is frame distortion larger than 3?

Yes No

3 4 5

f = 3.4 f = 3.9 f = 4.5

1 2

f = 1.7 f = 2.5

Figure 3: CART regression tree.

to minimize a splitting cost criterion. CART results are easyto interpret due to its simple binary tree representation. InFigure 3, a simplistic CART tree is shown, where circles rep-resent internal nodes and rectangles represent leaf nodes.Each internal node in the tree is always split into two childnodes.

CART trees are designed in a two-stage process. First, anoversize tree is grown. The tree is then pruned based on per-formance validation, until the best-size tree is found. Duringtree growing, the next split is found by an exhaustive searchthrough all possible single-variable splits at all the currentleaf nodes. In CART regression, each region is approximatedby a constant function

f (x) = am if x ∈ Rm. (2)

The splitting cost criterion is the decrease in regression errorresultant from the split. The regions generated by CART are

disjoint, and the piecewise constant regression function f isdiscontinuous at region boundaries. This can lead to poor re-gression performance, unless the dataset is sufficiently largeto support a large tree. Nevertheless, CART has been success-fully used in classifying high-risk patients [19], quality con-trol [21], and image vector quantization [22].

2.2.2. MARS

Multivariate adaptive regression spline (MARS) [20] wasproposed as an improvement over recursive partitioning al-gorithms such as CART. Unlike CART, MARS produces a

continuous regression function f , and the regions of MARS

may overlap. In MARS, f is constructed as a sum of M basisfunctions:

f (x) =M∑

m=1

amBm(x), (3)

−10

0

10

20

30

40

50

y

0 5 10 15 20 25 30 35 40

x

MARS regressionLinear regression

Figure 4: A MARS regression function with three knots (markedby ×).

where the basis function Bm(x) takes the form of a truncatedspline function. In Figure 4, a single-variable f with three“knots” is shown, where each knot marks the end of oneregion of data and the beginning of another. Compared tothe linear regression function, the MARS regression functionbetter fits the data.

Like CART, the MARS regression model is also builtin two stages. First, an oversize model is built by progres-sively adding more basis functions. In the second stage, ba-sis functions that contribute the least to modeling accuracyare progressively pruned. At each step in the model’s grow-ing phase, the best pair of basis functions to add is found


Clean speech Auditoryprocessing

Degraded speech Auditoryprocessing

Cognitivemapping

Estimated MOS

Figure 5: Algorithm architecture.

by an exhaustive search, similar to finding the best split dur-ing CART tree growing. MARS has been applied to predictcustomer spending and forecast recession [23], and predictmobile radio channels [24].

3. PROPOSED DESIGN METHOD

In the proposed method, double-ended measurement algo-rithms are designed based on the architecture depicted inFigure 5. Auditory processing (Figure 6) is first applied toboth the clean speech and the degraded speech, to producea subband decomposition for each signal. The subband de-composed signals and the clean speech signal are input tothe cognitive mapping module (Figure 7), where a distortionsurface is produced by taking the difference of the two sub-band decompositions. A large pool of candidate feature vari-ables is extracted from the distortion surface. MARS and/orCART is applied to sift out a small set of predictor variablesfrom the pool of candidate variables, while progressively con-

structing and optimizing the regression mapping f . Thismapping replaces the statistical mining block in Figure 7upon completion of the design.

The auditory processing modules decompose the inputspeech signals into power distributions over time frequencyand then convert them to auditory excitations on a loudnessscale. The cognitive mapping module interprets the differ-ences (distortions) between the auditory excitations of theclean and the degraded speech signals. In effect, the cogni-tive module “integrates” the distortions over time and fre-quency to arrive at a predicted quality score. We make thesimple observation that “distortions are not created equal.”An isolated large distortion event is likely to be cognitivelydistinct from small distortions that are widely diffused overtime frequency, though the small distortions may integrateto a substantial amount. The latter kind of distortion may beless annoying than the former kind. We take an agnostic viewof how human cognition weighs the contributions from dif-ferent types of distortions. The approach we take is to createa plethora of “contexts” under which distortion events occur.Distortions with the same context are integrated to a valuewhich we call a “feature.” Straightforward root-mean-square(RMS) integration is used to compute the feature value. Eachcontext gives rise to one candidate feature, so that there areas many candidate features as the number of contexts. Fromthe pool of candidate features, data mining techniques areused to find a small subset of features and the best way to

SpeechFFT

Power

spectrumSummationover bands

Convertto loudness

Subbanddecomposed

signal

Figure 6: The processing steps in an auditory processing module.

combine them to estimate the speech quality. The modulesare described next in Sections 3.1 and 3.2. Detailed designconsiderations and justifications of the modules then followin Section 3.3, and finally computational complexity is con-sidered in Section 3.4.

3.1. Auditory processing

A block diagram of auditory processing is depicted inFigure 6. Human auditory processing of acoustic signals iscommonly modeled by signal decomposition through a bankof filters whose bandwidths increase with filter center fre-quency according to the bark or critical-band scale [8]. Atypical realization of this model employs roughly 17 filters orspectral bands to cover the telephone voice channel. In ourexperiments, we found that 7 bands, each with bandwidthof about 2.4 bark, strike a good balance between predictionperformance and sensitivity to irrelevant variations in the in-put data (for further elaboration, see Section 3.3.1). In ourscheme, the speech signal is partitioned into 10-millisecondframes. For each frame, a 128-point power spectrum is cal-culated by applying FFT to a 128-point Hanning-windowedsignal segment centering on each frame. The spectral powercoefficients are grouped into 7 bands. The coefficients in eachband are summed, to produce altogether 7 subband powersamples. The samples are converted to subjective loudnessscale using Zwicker’s power law [8]:

L( f ) = L0

(ETQ( f )s( f )E0

)k[(1− s( f ) +

s( f )E( f )ETQ( f )

)k− 1]

, (4)

where the exponent k = 0.23, L0 = 0.068, E0 is the referenceexcitation power level, ETQ( f ) is the excitation threshold atfrequency f , E( f ) is the input excitation at frequency f , ands( f ) is the threshold ratio.

3.2. Cognitive mapping

The “cognitive mapping” module comprises functionalblocks as depicted in Figure 7. The decomposed clean anddegraded speech signals from the auditory processing mod-ules are first subtracted to obtain their absolute difference,which is called the “distortion”. The distortion over the wholespeech signal can be organized into a two-dimensional array,representing a distortion surface over time frequency. A goalof the cognitive mapping is to aggregate cognitively similardistortions through segmentation and classification (elabo-rated below). The perceptually significant aggregated distor-tions are found using data mining. The statistical data min-ing block in Figure 7 is present during the design phase ofthe cognitive mapping block. Once the design is completed,


Clean speech

Clean speech(subband decomposed)

Degraded speech(subband decomposed)

Absolutedifference

Distortion

VAD & voicingdecision

Severityclassification

Contextualdistortion

integration

Features Statisticaldata mining

Estimated

MOS

Figure 7: Cognitive mapping.

the block is replaced by a simple mapping block. The map-

ping (the aforementioned f ) is computationally simple, ascan be seen from the example presented in Appendix B.

3.2.1. Time segmentation

The clean speech signal is processed through a voice activ-ity detector (VAD) and then a voicing detector. Each 10-millisecond speech frame is thereby labeled as either “back-ground,” “active-voiced,” or “active-unvoiced.” We use theVAD algorithm from ITU-T G.729B [25], omitting the com-fort noise generation part of the algorithm. More recent VADalgorithms such as that in the AMR codec [26] may also beused to advantage. The purpose of the segmentation is to sep-arate the different types of speech frames so that they canexert separate influence on the speech quality estimate. Theadvantage of such segmentation is suggested in [27], whereperformance was improved using clustering-based segmen-tation.

3.2.2. Severity classification

The total distortion of each frame is classified into differentseverity levels. The aim is to sift out the significant distor-tion events. Different forms of classifiers can be used. Wehave experimented with simple thresholding, CART classi-fication [19], and Gaussian mixture density modeling. Basedon our simulation results, we have found that a simple clas-sification scheme, thresholding the average frame distortion,suffices to produce most of the benefit. Results presented be-low are based on thresholding to 3 severity levels, which wecall low, medium, and high distortion severity. In [28], fixedthresholding of frame energy is shown to provide perfor-mance gain. Gains obtained from classification and segmen-tation are discussed in Section 3.3.2.

3.2.3. Context and aggregation

The speech signal now has a time-frequency representation,with a distortion sample in each time-frequency bin. Eachsample is labeled according to its frequency band index,time-segmentation type, and severity level. Contexts are cre-ated by combining label values. For instance, the above seg-mentation and classification creates 7 × 3 × 3 = 63 distinct

values. The distortion samples that have the same composite-label value belong to the same context, which is named af-ter the composite-label value. By associating a context witheach distinct composite-label value, we form 63 distinct con-texts. Each context contributes one feature variable to thecandidate feature pool to be mined. The value of a featureis obtained via root-mean-square integration of the distor-tion samples in the context, normalized by the number offrames in the speech signal. Thus, each context establishes aspecific class of distortion, and contributes to data mining afeature variable which captures the level of the distortion inthat class. The feature variables are defined in Appendix A. Asan example, the variable U B 2 0 captures the integrated dis-tortion of the context: unvoiced frame, subband 2, and lowseverity (level 0). We assume that the lengths of the speechsignals are no more than several seconds so that recency ef-fects can be ignored. Recency effects can be accounted for byintroducing forgetting factors.

3.2.4. Feature pool

Additional contexts are defined in order to create a “rich”pool of candidate features for mining. Besides labeling eachfrequency subband with its natural subband index, each sub-band is also labeled with the rank order obtained by rankingthe 7 distortions in a frame in order of decreasing magnitude.Thus, a candidate feature has either a natural or ordered sub-band index. Rank ordering the subband distortions as wellas classifying frame-level distortions based on severity cre-ate contexts that capture distortions independent of specifictime-frequency locations, but dependent on the absolute orrelative level of distortion severity. This is hypothetically jus-tifiable by the nature of the quality judgment process, andhelps the data mining algorithm to pick out cognitively sig-nificant events.

Additional contexts are also created by omitting some la-bels such as the severity level. These contexts are the 7 sub-bands, in natural or ordered index, for each of the 3 time-segmented frame classes, without severity classification; alto-gether there are 7×3 = 21 such contexts (whose feature vari-ables are listed in Appendix A as T B b and T O b). We alsoinclude weighted mean and root-mean distortions, proba-bility of each frame type, and the lowest-frequency-band


and the highest-frequency-band energy of the clean speechframes, to produce a pool totaling 209 candidate features, aslisted in Appendix A. The weighted mean of the 7 subbanddistortions is calculated using the weights [29]

wi =

1.0 for 0 ≤ i ≤ 4,

0.8 for i = 5,

0.4 for i = 6.

(5)

The pool of candidate features is redundant for the pur-pose of quality estimation. A brute force approach to find-ing the best subset of features to use would entail examining2209 − 1 possible subsets, a clearly impossible task. Yet thesuccess of our approach crucially depends on finding a smallsubset of features that are good for quality estimation. Weresort to data mining techniques to perform this task. Theeffectiveness of the techniques and performance of their de-signs are assessed experimentally in Section 4.

3.3. Feature design and selectionIn this section, we present some design justifications.

3.3.1. Number of subbandsWe first experimented with using 22 subbands, with eachband roughly three-quarter-bark wide. Using CART for re-gression, we found that roughly one out of every three bandswas selected. Therefore, we conjectured that we could groupthe distortions over 22 subbands into a smaller set of 7 sub-band distortions, to achieve a better tradeoff between retain-ing relevant spectral information and easy generalization. Ina similar rein, reduced spectral resolution was found to im-prove the accuracy of speaker-independent speech recogni-tion [30]. The 2.4-bark bandwidth in our frequency decom-position can also be compared with the 3–3.5 bark criticaldistance between vowel formant peaks [31].

3.3.2. Design of segmentation andseverity classification

In this section, we show the improvements on speech qualityestimation due to using segmentation and severity classifica-tion. Estimation performance is assessed using the correla-tion R and root-mean-square error (RMSE) ε between thesubjective MOS xi and objective MOS yi. Pearson’s formulagives

R =∑N

i

(xi − x

)(yi − y

)√∑Ni

(xi − x

)2∑Ni

(yi − y

)2, (6)

where x is the average of xi, and y is the average of yi. RMSEis calculated using

ε =√√√√∑N

i=1

(xi − yi

)2

N. (7)

The performance results exhibited in Table 1 are based ondesigning a MARS model for a speech database. As we cansee, time segmentation alone provides some improvement.

Table 1: Performance with different combinations of segmentationand severity classification.

Correlation R RMSE εNo segmentation or classification 0.906 0.306Segmentation only 0.927 0.273Severity classification only 0.896 0.339Segmentation and severity classification 0.977 0.155

An interesting phenomenon is that distortion severity classi-fication alone does not result in any improvement. However,a large improvement is obtained by combining segmentationand classification. We attribute this phenomenon to the dif-ferent significance of a given distortion level across the threetypes of speech frames: inactive (background noise), voiced,and unvoiced. The signal contents of the three frame typesare perceptually very distinct. We expect each type of con-tents to condition the perception of distortion in a certaincharacteristic fashion. Separating the distortions accordingto the frame types allows the distortions to be weighed dif-ferently for each type.

For feature definition, we also compared between using(i) the number of distortion samples in a severity class, nor-malized by the number of frames in a speech file, versus(ii) RMS integration of the distortion samples in a severityclass. The latter was found to provide better performance.

3.3.3. Feature selection

In this section, we acquire a sense of the features selected byMARS by perturbing a MARS designed model. The “Origi-nal” column in Table 2 lists the variables of the model beingperturbed, in order of decreasing importance. Variable im-portance is determined by the amount of reduction in pre-diction error provided by the variable, relative to the greatestreduction amount achieved amongst all variables. Hence, thevariable that results in the largest prediction error reductionhas importance 100%, and its amount of error reduction isused as reference. The importance of other variables is cal-culated as the percentage of their prediction error reductionrelative to the reference.

An inspection of the “Original” list naturally raises thequestion of why some of the variables are important. For ex-ample, I P VUV, the ratio of the number of inactive framesto the number of active frames, is rated most important.Moreover, low-rank subband distortion variables U O 4,U O 3, U O 5, and U O 6 1 are included in the model, andyet the high-rank subband distortions of the same unvoicedframe type are not included. To address these questions, weremoved the above feature variables from the candidate pooland redesigned the model. The resultant features are listed inthe “Modified” column of Table 2. We see from this list thatI P, the fraction of inactive frames, is rated more importantthan before. I P and I P VUV provide different encoding ofthe same information, but I P and I P VUV are not linearlyrelated. Also, in lieu of the omitted low-rank subband vari-ables for the unvoiced frames, the high-rank subband vari-able U O 0 is brought into the modified model. Thus, we see


Table 2: Variable importance list for feature selection investigation. Original: list generated using the full feature pool. Modified: list gener-ated after trimming the feature pool.

Original Modified Original ModifiedRankVariable Import. Variable Import.

RankVariable Import. Variable Import.

1 I P VUV 100.000 V B 2 100.000 11 V P VUV 34.706 I B 1 0 29.6612 V B 5 68.859 I P 68.556 12 I WM 1 33.749 I WM 1 24.9583 V B 2 68.051 V B 2 2 58.165 13 I B 1 0 33.033 V B 0 1 22.5844 V B 2 2 47.966 V O 0 49.106 14 V B 3 32.877 V WM 0 15.7475 U P VUV 47.214 REF 1 41.957 15 U B 2 24.440 V B 3 15.6656 V O 0 42.583 I B 0 39.036 16 V B 0 1 23.568 V RM 0 15.3397 I B 0 42.382 V P 37.517 17 U O 3 21.882 V B 5 1 11.9708 REF 1 41.220 I B 2 36.124 18 V O 4 18.121 U O 0 10.8779 I P 41.014 V B 5 35.681 19 U O 5 15.959 I O 0 9.81810 U O 4 36.489 V P VUV 35.255 20 U O 6 1 14.487 — —

that both the information captured in a variable as well as themanner of encoding of the information in the variable affectits importance. A rich candidate pool should convey a va-riety of information as well as information encoding. MARSconsistently picks out from the available feature variables, theones with the most relevant information and the best encod-ing. The original model, drawn from a richer pool, is pre-ferred over the modified model. The original model providesroot-mean-square prediction error (RMSE) of 0.3902 and0.3844 on the 90% training database and 10% test database,respectively. (Databases and performance assessment are dis-cussed in the next section.) For the same databases, the modi-fied model achieves RMSE of 0.3968 and 0.4318, respectively.

3.4. Complexity

The computational complexity of the algorithms designedusing the proposed approach is mainly attributable to the au-ditory processing modules and to feature extraction process-ing in the cognitive module. While the design of the mappingfrom features to the MOS estimate is somewhat involved, theactual processing needed to realize the mapping once it is de-signed is simple. As the purpose of this paper is to study theapplication of data mining techniques to design speech qual-ity measurement algorithms, we offer below a rough guide ofthe algorithm complexity. The actual complexity in specificapplications will vary with the details of the features selected.Moreover, as with other measurement algorithms (see, forexample, [32]), algorithm complexity may be reducible with-out seriously degrading the estimation accuracy. Such pur-suit of complexity reduction is left to future study.

The complexity of auditory processing in the designedalgorithms is no greater than that of the auditory process-ing component in PESQ. A somewhat lower complexity isobtained in our case by using fewer subbands. RMS integra-tion of distortion samples to compute the values of the fea-tures employed in the data-mining designed mapping has aroughly similar complexity to the Lp integrations performedin PESQ. Our use of squared integration throughout, as op-posed to using several different values of p in PESQ, low-ers the integration complexity. Computation of the mappingfunction (see Appendix B for an example), done only oncefor the whole speech file, has relatively negligible complexity.

Severity classification also has negligible complexity. The seg-mentation functionalities, VAD and voicing decision, arecommonly found in speech coders and other speech pro-cessing applications. We have used the VAD algorithm inITU-T G.729B [25], omitting its comfort noise generationfunctionality. We estimate that the segmentation function-alities require no more than 20% of the processing time ofthe ITU-T G.729 speech codec. The processing time of PESQis roughly 2.8 times that of G.729. We note that PESQ pro-vides additional functionalities such as variable delay com-pensation. Hence, a speech quality estimator using an algo-rithm designed using the proposed approach while providinga similar suite of functionalities as PESQ would incur a 7%higher complexity than PESQ. As this is a conservative upperbound, we believe complexity implementations lower thanPESQ are readily achievable.

4. EXPERIMENT RESULTS

The effectiveness of the data mining approach is demon-strated experimentally with actual designs. We compare theperformance of the algorithms designed using our methodto the current state-of-the-art algorithm in voice quality esti-mation, PESQ. Below, we first introduce the speech databasesused for the experiments. Then we compare the designs ob-tained using different data mining techniques, namely CART,hybrid CART-MARS, and MARS. We finally focus on themethod that offers the best performance: MARS design us-ing cross validation. The greatest difference between our de-signed algorithms and PESQ is in the cognitive mappingpart; thus, the comparisons below can be regarded as eval-uating different cognitive mappings.

4.1. Speech databases

The speech databases used in our experiments are listed inTable 3. They include the 7 multilingual databases in ITU-TP-series Supplement 23 [33], two wireless databases (IS-96Aand IS-127 EVRC), and a mixed wireline-wireless database[18]. We combine the 10 databases into a global database foralgorithm design. There are altogether 1760 degraded-speechfiles in the global database.


Table 3: Properties of the speech databases used for experiments.

No. of Minimum Maximum Average MOS MOSDatabase Languagefiles MOS MOS MOS spread std. error

ITU-T Supp23 Exp1A French 176 1.000 4.583 3.106 0.781 0.148ITU-T Supp23 Exp1D Japanese 176 1.000 4.208 3.666 0.701 0.158ITU-T Supp23 Exp1O English 176 1.208 4.542 3.050 0.822 0.155ITU-T Supp23 Exp3A French 200 1.292 4.833 3.226 0.732 0.152ITU-T Supp23 Exp3C Italian 200 1.083 4.833 2.950 0.896 0.152ITU-T Supp23 Exp3D Japanese 200 1.042 4.417 2.331 0.737 0.155ITU-T Supp23 Exp3O English 200 1.167 4.542 2.782 0.772 0.187Wireless IS-127 EVRC English 96 2.250 4.500 3.427 0.500 0.340Wireless IS-96A English 96 1.625 3.875 2.760 0.451 0.341Mixed English 240 1.090 4.610 3.200 0.728 n.a.

The three Exp1x databases in ITU-T Supp23 containspeech coded using the G.729 codec, singly or in tan-dem with one or two other wireline or wireless standardcodecs, under the clean channel condition. Also included aresingle-encoded speech using these standard codecs. The fourExp3x databases contain single- and multiple-encoded G.729speech under various channel error conditions (BER 0%–10%; burst and random frame erasure 0%–5%) and inputnoise conditions (clean, street, vehicle, and hoth noises at20 dB SNR).

The wireless IS-96A and IS-127 EVRC (Enhanced Vari-able Rate Codec) databases contain speech coded using theIS-96A and IS-127 codecs, respectively, under various cleanand degraded channel conditions (forward FER 3%, re-verse FER 3%), with or without the G.728 codec in tan-dem, and MNRU (modulated noise reference unit) condi-tions of 5–25 dB. The mixed database [18] contains speechcoded with a variety of wireline and wireless codecs, undera wide range of degradation conditions: tandeming, chan-nel errors (BER 1%–3%), and clipping (see [18] for moredetails). All databases include reference conditions such asspeech degraded by various levels of MNRU.

The range of the MOSs in each database is determinedby its mix of test conditions. The range is characterized inTable 3 by the maximum, minimum, average, and “spread”,which is the standard deviation of the MOSs around the aver-age. The imprecision of the subjective MOS is characterizedby its standard error (“MOS std. error” in Table 3, which isdetermined by the number of listeners who participated inthe subjective test). The RMSE of the objective scores can beassessed no better than the standard error of the subjectivescores used to benchmark the accuracy. Moreover, the mea-surement accuracy of algorithms trained using a database isalso limited by the imprecision of its subjective scores. Notethat “No. of files” in Table 3 refers to the number of speechfiles that are subjectively scored; the “clean original” speechfiles are not counted.

The designs presented in this paper are based on theabove databases which cover a range of waveform codecs,wireline and wireless LPAS [14] codecs, and a range of codectandeming and channel error conditions, and input back-ground noise conditions. Additional impairments that canbe found in telephone connections but are not currentlycovered by our databases include echo, variable delay, tones,

distortions due to harmonic or sinusoidal coders and due tomusic and artificial speech, and so forth the reader can alsoconsult [5] for its list of transmission impairments. The pro-posed design method is highly automated and should scalewell with the amount of database material available for de-sign (see Section 4.6).

4.2. CART results

We first experimented with using CART for mining, moti-vated by the fact that CART results are easier to interpret thanMARS results, and CART can be regarded as a special caseof MARS. For CART mining, we randomly assigned 90% ofthe global database to a training data set and the rest to a testdata set. The tree-growing phase uses the training set, and thetree-pruning phase uses the test set to select the best-size tree,that is, the one that gives the lowest regression error on thetest data. The CART-designed tree has 38 leaf nodes. The per-formance scores are R = 0.8861 and ε = 0.3734 on the train-ing set, and R = 0.7627 and ε = 0.5098 on the test set. Thelarge difference in RMSE values between training and test-ing indicates that the designed CART tree does not general-ize well. For PESQ, we use the PESQ-LQ mapping suggestedin [34] to obtain R = 0.8170 and ε = 0.4705 on the globalundivided set, R = 0.8198 and ε = 0.4700 on the trainingset, and R = 0.7939 and ε = 0.4744 on the test set. It appearsthat CART regression trees cannot outperform PESQ.

4.3. Hybrid CART-MARS results

By inspecting the variables mined using CART, we expectthem to be perceptually important. The poor performancemight be due more to the aforementioned limitations ofCART in regression, rather than to the feature selection.Thus, we experimented with using MARS to circumvent thelimitations of CART. Below, we present the results from twohybrid CART-MARS schemes.

The first hybrid CART-MARS method uses CART to pre-screen features from the feature candidate pool. The featurevariables selected by CART are used as a smaller feature can-didate pool for MARS model building. In this method, CARTis used only during model design; the final model is con-structed solely by MARS. The performance obtained, R =0.8501 and ε = 0.4242 on the training set, and R = 0.8233and ε = 0.4379 on the test set, is better than PESQ and CARTregression.


Table 4: MARS model selection as a function of DS using 10-fold cross-validation.

Training TestingDS N M R ε % R ε %3 78 125 0.9261 0.3025 35.6 0.8403 0.4409 7.16 47 66 0.9055 0.3402 27.6 0.8527 0.4200 11.510 21 39 0.8880 0.3685 21.6 0.8550 0.4164 12.215 20 25 0.8756 0.3872 17.6 0.8530 0.4182 11.820 19 21 0.8707 0.3941 16.1 0.8546 0.4156 12.425 16 18 0.8652 0.4019 14.5 0.8502 0.4223 11.0

The second hybrid CART-MARS method is similar to themethod used in [35]. In [35], the feature candidate pool forMARS mining is augmented by the “leaf-node index” ob-tained from a CART tree. We improve on the method byadding the CART regression output variable, instead of thenode index variable, to the candidate feature pool. The aug-mented candidate pool is used for MARS model building.In this method, if the CART output were incorporated intothe MARS model, feature extraction for the model wouldalso include computation as prescribed by the CART tree.Indeed, an inspection of the variable importance list foundthat the CART tree output is the most important featurevariable selected. The performance obtained, R = 0.9108and ε = 0.3326 on the training set, and R = 0.8231 andε = 0.4423 on the test set, is also better than PESQ andCART regression. The larger difference between training andtesting RMSE in this “augmentation” method, in compari-son with the earlier “prescreening” method, suggests that the“prescreening” method is more robust.

Although both hybrid CART-MARS methods outper-form PESQ and CART, they are inferior to the MARS modelof Section 3.3.3 on the test set. In the rest of this paper,we present detailed results based on using MARS alone, asMARS tends to offer the best performance. For the applica-tion in [35], a hybrid CART-MARS scheme provides betterperformance than CART or MARS alone. Thus, we shouldnot eliminate the possibility of some hybrid schemes outper-forming MARS-only schemes.

4.4. MARS model selection via cross-validation

Picking the size of the regression model is a crucial step inthe design. The size of the model designed using MARS is afunction of M, the number of basis functions in (3). For lin-ear spline basis functions, two real parameters are associatedwith each function, the “knot” and the linear combinationweight. (Please refer to the example in Appendix B.) Thus,the number of optimized parameters, 2M, is a useful mea-sure of model size. A large model yields low regression error,but the model is highly biased towards the training data andexhibits large variance over unseen data. On the other hand,a small model might omit some important features neces-sary for high measurement accuracy. In Friedman’s originalMARS design [20], a penalty term controlled by a “degree ofsmoothness” (DS) parameter is used in the criterion func-tion to penalize the increased variance due to large modelsize. Larger DS results in more basis functions taken outduring the pruning phase. Friedman’s design method does

not incorporate validation of the model through testing withdata not used in model building. We improve on Friedman’sdesign by using cross-validation to select the model size.

In conventional model design, available data is split intoa training set and a test set. The model is built on the for-mer, and validated on the latter. However, when the amountof available data is small, as in our case, we ought to use allthe data for model building. Using a small sample to designand validate can be achieved by n-fold cross-validation [36].The results presented below are based on n = 10-fold cross-validation. The global database is randomly divided into 10data sets with almost equal size. Training and testing is per-formed 10 times. Each time, one of the data sets serves as thetest set, and the remaining 9 data sets combined serve as thetraining set. Each data set serves as a test set only once. Foreach training-test set combination, a series of MARS mod-els corresponding to various DS values are constructed usingthe training set. The 10 R and ε values obtained for each DSvalue are averaged to obtain the cross-validation R and ε val-ues; separate averages are obtained from the training and testsets. Finally, the DS value corresponding to the best cross-validation performance is used to build the desired MARSmodel using the entire global database.

Table 4 shows the cross-validation performance resultsfor a series of MARS models obtained using different valuesof DS. Both training and test results are shown, with N de-noting the average number of distinct feature variables usedin the cross-validation models, M the average number of ba-sis functions, and % the average percentage reduction in εcompared to PESQ. From Table 4, we pick the best DS valuefor designing our final model. We see that for DS = 20, theRMSE reduction is the largest, and the discrepancy betweenthe training and test performance is the smallest. Thus, thefinal model is built using the global database, with DS = 20.

The resultant “global model” has N = 21 feature vari-ables and M = 24 basis functions. The variables and theirimportance are listed in Table 5, and the MARS regressionfunction and its basis functions are given in Appendix B. Wesee from Table 5 that the most important variables and mostof the variables are related to voiced frames. The overall trendis that features from voiced frames are treated as more im-portant than those from unvoiced and inactive frames. Thisis consistent with the fact that the great majority of activespeech frames are voiced and that human perception is moresensitive to distortion of the spectral envelopes of voicedframes than unvoiced frames. The most important variableV RM, the root-mean distortion of voiced frames, is akin to


Table 5: Variable importance ranking for the global model.

Rank Variable Importance Rank Variable Importance Rank Variable Importance

1 V RM 100.00 8 I B 0 39.782 15 U O 4 24.5302 I P 76.280 9 I B 1 0 37.531 16 U O 5 22.0923 V B 2 2 56.375 10 V O 0 37.143 17 V P 1 21.1724 REF 1 50.151 11 V B 2 36.379 18 U B 4 19.6395 V P 44.425 12 V O 5 33.179 19 V B 0 1 17.4596 V O 0 2 41.897 13 I WM 1 31.807 20 U O 6 1 17.0487 V P VUV 40.472 14 V P 2 25.562 21 I B 5 15.592

Table 6: MARS model performance on the 10 speech databases: variation over samples.

Correlation R RMSE ε PercentageDatabase Language

Proposed Method PESQ Proposed Method PESQ reduction in ε (%)ITU-T Supp23 Exp1A French 0.8753 0.8498 0.3909 0.4507 13.3ITU-T Supp23 Exp1D Japanese 0.9141 0.8725 0.3988 0.5893 32.3ITU-T Supp23 Exp1O English 0.8998 0.9164 0.3581 0.3616 1.0ITU-T Supp23 Exp3A French 0.8480 0.8199 0.4327 0.5482 21.1ITU-T Supp23 Exp3C Italian 0.9099 0.8935 0.4048 0.4499 10.0ITU-T Supp23 Exp3D Japanese 0.8728 0.8965 0.4127 0.5366 23.1ITU-T Supp23 Exp3O English 0.8757 0.8857 0.3749 0.4222 11.2Wireless EVRC English 0.6364 0.5522 0.3952 0.4359 9.3Wireless IS-96A English 0.5786 0.4562 0.3845 0.4282 10.2Mixed English 0.8771 0.8732 0.3496 0.4083 14.4Average — — — — — 14.6

the logarithmic spectral distortion that speech spectral quan-tizers are generally designed to minimize [14]. V B 2 2 is theRMS distortion in subband 2 of the voiced frames that havethe highest severity of frame distortion. Subband 2 coversthe frequency region where the long-term power spectrum ofspeech peaks. V O 0 2 is the RMS distortion in the highest-distortion subband of the voiced frames that have the highestseverity of frame distortion; in effect, V O 0 2 measures theintensity of peak distortions. The selection of V B 2 2 andV O 0 2 suggests that speech quality perception is stronglydependent on prominent spectral regions and distortionevents. The variables I P, V P, and V P VUV, which measurethe relative amount of specific frame types, and REF 1, whichmeasures the level of high-frequency loudness in the refer-ence signal, serve to adjust the regression mapping. For in-stance, in Appendix B, we see that the predicted quality valueis raised when the fraction of inactive frames is above 0.27,and is decreased when the fraction drops below 0.27.

4.5. Database results

We apply the global model to the individual databases listedin Section 4.1. We report performance results in two formats:variation over samples (VOS) in Table 6, and variation overconditions (VOC) in Table 7. In VOS, the correlation andRMSE between the objective and subjective MOS of eachsample is reported. A “sample” refers to a pair of speech filesused for quality calculation: the speech file that was playedto the listener panel, and the “clean” original version of thespeech that was played. For VOC, the subjective MOSs for thespeech files within the same test condition are first averagedtogether.

The objective MOSs are also likewise grouped and aver-aged. Then, R and ε are calculated between the per-conditionaveraged subjective and objective MOSs, over all conditionsin the database. The VOS results better reflect performancein voice quality monitoring applications [3]. The VOC re-sults are more appropriate for codec or transmission equip-ment evaluation. To the best of our knowledge, all the per-formance results that have been reported in the literature forPESQ by its inventors use the VOC format. The results forPESQ are based on using the PESQ-LQ 3rd-order regressionpolynomial specified in [34]. The results in Tables 6 and 7show that the global model provides an average reduction inRMSE ε of 14.6% and 21.4%, for VOS and VOC averaging,respectively.

We adopt the simple model proposed in [7] to help us in-terpret the relationship between the R and ε values in Tables6 and 7; the model is modified with the addition of a biasterm. Accordingly, R and ε satisfy the following relationship:

ε2 = σ2(1− R2) + σ2MOS + b2, (8)

where σ2 and σ2MOS are the “MOS spread” and “MOS std.

Error” in Table 3, respectively, and b is systematic bias. Theequation states that ε2 is the sum of unexplained variancein the estimation model, MOS estimation error due to lim-ited number of listeners, and bias error between subjectiveand objective MOSs. In comparing estimation algorithms us-ing the same databases, σ2

MOS is an irreducible noise termaffecting all the algorithms equally. Tables 6 and 7 show thatPESQ produces large ε values on databases Exp1D, Exp3A,


Table 7: MARS model performance on the 10 speech databases: variation over conditions.

Correlation R RMSE ε PercentageDatabase Language

Proposed Method PESQ Proposed Method PESQ reduction in ε (%)

ITU-T Supp23 Exp1A French 0.9381 0.9343 0.2769 0.3609 23.3ITU-T Supp23 Exp1D Japanese 0.9391 0.9539 0.2595 0.5136 49.5ITU-T Supp23 Exp1O English 0.9644 0.9566 0.2441 0.2705 9.8ITU-T Supp23 Exp3A French 0.9400 0.8776 0.3109 0.4743 34.5ITU-T Supp23 Exp3C Italian 0.9508 0.9455 0.3243 0.3441 5.8ITU-T Supp23 Exp3D Japanese 0.9455 0.9452 0.2888 0.4785 39.6ITU-T Supp23 Exp3O English 0.9459 0.9254 0.2551 0.3522 27.6Wireless EVRC English 0.8224 0.8116 0.2139 0.2176 1.3Wireless IS-96A English 0.6323 0.6203 0.2371 0.2250 −5.4Mixed English 0.9364 0.9188 0.2438 0.3366 27.6Average — — — — — 21.4

01A 1D 1O 3A 3C 3D 3O

0.2

0.4

0.6

0.8

1

Database

7 databases

10 databases

Cor

rela

tion

R

(a)

01A 1D 1O 3A 3C 3D 3O

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Database

7 databases

10 databases

RM

SEε

(b)

Figure 8: Comparison of MARS model performance between training on the 7 ITU-T databases and on all 10 speech databases. (a) Corre-lation and (b) RMSE results are shown for variation over conditions.

and Exp3D, even though R is quite high for databases Exp1Dand Exp3D. According to (8), the large ε values can be dueto bias errors, which we attribute to biases between individ-ual databases and the global database. The MARS model isable to adjust for individual databases, thus reducing the biascomponent.

4.6. Scalability

It is highly desirable to be able to design models that canscale with the amount of data available for learning. Also,new forms of speech degradations arise as a result of newtransmission environments, new speech codecs, and so forth.The data mining approach enables designing best-size mod-els for a given amount of learning data, and adapting to newlearning data. To demonstrate the scalability of the proposedmethod, we created a smaller global database comprising

only the seven ITU-T databases. New MARS models withdifferent DS values were designed using the new globaldatabase. In Figure 8, we compare the performance of thenew model, with DS = 20, N = 13, and M = 15, to thatof the larger global model designed earlier. The results arefor VOC; the results for VOS are similar. One might ex-pect the MARS model designed for the global database tobe “diluted” and hence less effective than the new model de-signed for the seven ITU-T databases. However, we see thatthe two models provide about the same level of performance.In fact, it is somewhat surprising that the global model fur-nishes 14% lower RMSE than the more tuned seven-databaseMARS model. Thus, the proposed method appears to scalewell with the amount of learning data, and suggests favor-ably the possibility of large-scale (semi-)automated, onlinemodel (re-)training.


5. CONCLUSION

We have proposed an approach to design objective speechquality measurement algorithms using statistical data miningmethods. We have examined various methods of using CARTand MARS to design novel objective speech quality measure-ment algorithms. The methods select feature variables from alarge pool to form speech quality estimation models. We haveobtained designs that outperform the state-of-the-art stan-dard PESQ algorithm in our databases. The variables form-ing the models are found to be perceptually significant, andthe methods offer some insights into the relative importanceof the variables. The designed algorithms are computation-ally simple, making them suitable for real-time implemen-tation. The best performing algorithm was designed usingMARS.

We also showed that the proposed design method canscale with the amount of learning data. The experiencelearned from building training-based systems such as speechrecognizers suggests using that the performance of the algo-rithms designed using our approach can be substantially im-proved with large-scale training, offline or online. The algo-rithms also show promise for further optimization and com-plexity reduction. The design approach can be extended toother media modalities such as video.

APPENDICES

A. FEATURE VARIABLE DEFINITIONS

The feature variables are defined below. The first letter, de-noted by T in a variable name, gives the frame type: T = Ifor Inactive, T = V for Voiced, and T = U for Unvoiced.The subband index is denoted by b, with b ∈ 0, . . . , 6 in-dexing from the lowest to the highest frequency band if theindex is natural, or from the highest to the lowest distortionif the index is rank-ordered. The frame distortion severityclass is denoted by d, with d ∈ 0, 1, 2 indexing from low-est to highest severity. With the above notations, the featurevariables are as follows.

(i) T P d: fraction of T frames in severity class d frames.(ii) T P: fraction of T frames in the speech file.

(iii) T P VUV: ratio of the number of T frames to the totalnumber of active (V and U) speech frames.

(iv) T B b: distortion for subband b of T frames, withoutdistortion severity classification, for example, I B 1represents subband 1 distortion for inactive frames.

(v) T B b d: distortion for severity class d of subband b ofT frames, for example, V B 3 2 represents distortionfor subband 3, severity class 2, of voiced frames.

(vi) T O b: distortion for ordered subband b of T frames,without severity classification, for example, U O 3represents ordered-subband 3 distortion for unvoicedframes, without distortion severity classification.

(vii) T O b d: distortion for distortion class d of orderedsubband b of T frames, for example, U O 6 1 repre-sents distortion for severity class 1 of ordered-subband6 of unvoiced frames.

(viii) T WM d: weighted mean distortion for severity classd of T frames.

(ix) T WM: weighted mean distortion for T frames.(x) T RM d: root-mean distortion for severity class d of T

frames.(xi) T RM: root-mean distortion for T frames.

(xii) REF 0: the loudness of the lower 3.5 subbands of thereference signal.

(xiii) REF 1: the loudness of the upper 3.5 subbands of thereference signal.

B. GLOBAL MARS MODEL

The basis functions BFn, where n is an integer, and the re-gression equation of the global model are listed below:

BF3 = max(0, I P− 0.270);

BF4 = max(0, 0.270− I P);

BF6 = max(0, 33.581− REF 1);

BF8 = max(0, 0.725−V B 2);

BF10 = max(0, 0.131− I B 0);

BF12 = max(0, 1.731−V B 2 2);

BF13 = max(0, V P 2− 0.710);

BF17 = max(0, I WM 1− 0.177);

BF20 = max(0, 0.758−V P VUV);

BF23 = max(0, V P− 0.422);

BF24 = max(0, 0.422−V P);

BF25 = max(0, V O 0− 2.284);

BF28 = max(0, 0.031−U O 6 1);

BF30 = max(0, 0.134− I B 5);

BF41 = max(0, V RM− 0.786);

BF42 = max(0, 0.786−V RM);

BF44 = max(0, 0.070− I B 1 0);

BF50 = max(0, 0.390−U O 4);

BF52 = max(0, 1.657−U B 4);

BF62 = max(0, 0.132−U O 5);

BF68 = max(0, 0.331−V P 1);

BF75 = max(0, V B 0 1− 0.337061E-08);

BF154 = max(0, V O 0 2− 2.036);

BF169 = max(0, V O 5− 0.548);

Objective MOS

= 2.534 + 6.738∗ BF3− 1.833∗ BF4

− 0.040∗ BF6− 1.331∗ BF8− 2.616∗ BF10

+ 0.600∗ BF12 + 1.981∗ BF13 + 4.820∗ BF17

+ 3.847∗ BF20 + 3.481∗ BF23− 6.184∗ BF24

− 0.629∗ BF25− 5.552∗ BF28 + 2.977∗ BF30

− 1.296∗ BF41 + 2.655∗ BF42− 3.328∗ BF44

+ 1.833∗ BF50− 0.320∗ BF52− 4.596∗ BF62

− 1.257∗ BF68− 0.476∗ BF75 + 0.577∗ BF154

+ 1.585∗ BF169.


ACKNOWLEDGMENTS

We thank Nortel Networks for their financial support. Wealso thank the reviewers for helping to substantially improvethe presentation in this paper.

REFERENCES[1] D. G. Jamieson, V. Parsa, M. Price, and J. Till, “Interaction of

speech coders and atypical speech, II: effects on speech qual-ity,” Journal of Speech Language & Hearing Research, vol. 45,pp. 689–699, 2002.

[2] N. Kitawaki and H. Nagabuchi, “Quality assessment of speechcoding and speech synthesis systems,” IEEE Commun. Mag.,vol. 26, no. 10, pp. 36–44, 1988.

[3] A. E. Conway, “A passive method for monitoring voice-over-IP call quality with ITU-T objective speech quality measure-ment methods,” in Proc. IEEE International Conference onCommunications (ICC ’02), vol. 4, pp. 2583–2586, New York,NY, USA, April–May 2002.

[4] ITU-T Rec. P.800, “Methods for subjective determinationof transmission quality,” International TelecommunicationUnion, Geneva, Switzerland, August 1996.

[5] ITU-T Rec. P.862, “Perceptual evaluation of speech quality(PESQ): an objective method for end-to-end speech qualityassessment of narrow-band telephone networks and speechcodecs,” International Telecommunication Union, Geneva,Switzerland, February 2001.

[6] ITU-T Rec. P.563, “Single ended method for objective speechquality assessment in narrow-band telephony applications,”International Telecommunication Union, Geneva, Switzer-land, May 2004.

[7] R. F. Kubichek, D. Atkinson, and A. Webster, “Advancesin objective voice quality assessment,” in Proc. IEEE GlobalTelecommunications Conference (GLOBECOM ’91), vol. 3, pp.1765–1770, Phoenix, Ariz, USA, December 1991.

[8] E. Zwicker and H. Fastl, Psychoacoustics: Facts and Models,Springer-Verlag, New York, NY, USA, 2nd edition, 1990.

[9] S. Voran, “Objective estimation of perceived speech quality. I.Development of the measuring normalizing block technique,”IEEE Trans. Speech Audio Processing, vol. 7, no. 4, pp. 371–382,1999.

[10] S. Voran, “Objective estimation of perceived speech quality.II. Evaluation of the measuring normalizing block technique,”IEEE Trans. Speech Audio Processing, vol. 7, no. 4, pp. 383–390,1999.

[11] S. Wang, A. Sekey, and A. Gersho, “An objective measure forpredicting subjective quality of speech coders,” IEEE J. Select.Areas Commun., vol. 10, no. 5, pp. 819–829, 1992.

[12] N. S. Jayant and P. Noll, Digital Coding of Waveforms: Princi-ples and Applications to Speech and Video, Prentice-Hall, En-glewood Cliffs, NJ, USA, 1984.

[13] J. E. Schroeder and R. F. Kubichek, “L1 and L2 normed cepstraldistance controlled distortion performance,” in Proc. IEEE Pa-cific Rim Conference on Communications, Computers and Sig-nal Processing (PACRIM ’91), vol. 1, pp. 41–44, Victoria, BC,Canada, May 1991.

[14] W. B. Kleijn and K. K. Paliwal, Eds., Speech Coding and Syn-thesis, Elsevier Science, Amsterdam, The Netherlands, 1995.

[15] S. R. Quackenbush, T. P. Barnwell III, and M. A. Clements,Objective Measures of Speech Quality, Prentice-Hall, Engle-wood Cliffs, NJ, USA, 1988.

[16] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hek-stra, “Perceptual evaluation of speech quality (PESQ)—a newmethod for speech quality assessment of telephone networksand codecs,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal

Processing (ICASSP ’01), vol. 2, pp. 749–752, Salt Lake City,Utah, USA, May 2001.

[17] M. P. Hollier, M. O. Hawksford, and D. R. Guard, “Error ac-tivity and error entropy as a measure of psychoacoustic sig-nificance in the perceptual domain,” IEE Proceedings of Vision,Image and Signal Processing, vol. 141, no. 3, pp. 203–208, 1994.

[18] L. Thorpe and W. Yang, “Performance of current perceptualobjective speech quality measures,” in Proc. IEEE Workshopon Speech Coding Proceedings, pp. 144–146, Porvoo, Finland,June 1999.

[19] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone,Classification and Regression Trees, CRC Press, Boca Raton,Fla, USA, 1984.

[20] J. H. Friedman, “Multivariate adaptive regression splines,”The Annals of Statistics, vol. 19, no. 1, pp. 1–141, 1991.

[21] N. Suzuki, S. Kirihara, A. Ootaki, M. Kitajima, and S. Naka-mura, “Statistical process analysis of medical incidents,” AsianJournal on Quality, vol. 2, no. 2, pp. 127–135, 2001.

[22] K. O. Perlmutter, S. M. Perlmutter, R. M. Gray, R. A. Olshen,and K. L. Oehler, “Bayes risk vector quantization with pos-terior estimation for image compression and classification,”IEEE Trans. Image Processing, vol. 5, no. 2, pp. 347–360, 1996.

[23] P. Sephton, “Forecasting recession: can we do better onMARS?” Federal Reserve Bank of St. Louis Review, vol. 83,no. 2, pp. 39–49, 2001.

[24] T. Ekman and G. Kubin, “Nonlinear prediction of mobileradio channels: measurements and MARS model designs,”in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing(ICASSP ’99), vol. 5, pp. 2667–2670, Phoenix, Ariz, USA,March 1999.

[25] ITU-T Rec. G.729 - Annex B, “A silence compression schemefor G.729 optimized for terminals conforming to recom-mendation V.70,” International Telecommunication Union,Geneva, Switzerland, November 1996.

[26] ETSI EN 301 708 V7.1.1, “Digital Cellular Telecommunica-tions System (Phase 2+); Voice Activity Detector (VAD) forAdaptive Multi-Rate (AMR) Speech Traffic Channels,” Euro.Telecom. Stds. Inst., December 1999.

[27] R. F. Kubichek, E. A. Quincy, and K. L. Kiser, “Speech qualityassessment using expert pattern recognition techniques,” inProc. IEEE Pacific Rim Conference on Communications, Com-puters and Signal Processing (PACRIM ’91), pp. 208–211, Vic-toria, BC, Canada, June 1989.

[28] S. Voran, “Advances in objective estimation of received speechquality,” in Proc. IEEE Workshop on Speech Coding for Telecom-munications, Porvoo, Finland, June 1999.

[29] K. K. Paliwal and B. S. Atal, “Efficient vector quantization ofLPC parameters at 24 bits/frame,” IEEE Trans. Speech, and Au-dio Processing, vol. 1, no. 1, pp. 3–14, 1993.

[30] H. Hermansky, “Perceptual linear predictive (PLP) analysis ofspeech,” Journal of the Acoustical Society of America, vol. 87,no. 4, pp. 1738–1752, 1990.

[31] L. Chistovich and V. V. Lublinskaya, “The ‘center of gravity’effect in vowel spectra and critical distance between the for-mants: psychoacoustical study of the perception of vowel-likestimuli,” Hearing Research, vol. 1, no. 3, pp. 185–195, 1979.

[32] S. Voran, “A simplified version of the ITU algorithm for objec-tive measurement of speech codec quality,” in Proc. IEEE Int.Conf. Acoustics, Speech, Signal Processing (ICASSP ’98), vol. 1,pp. 537–540, Seattle, Wash, USA, May 1998.

[33] ITU-T Rec. P. Supplement 23, “ITU-T coded-speechdatabase,” International Telecommunication Union, Geneva,Switzerland, February 1998.

[34] A. W. Rix, “A new PESQ scale to assist comparison betweenP.862 PESQ score and subjective MOS,” ITU-T SG12 COM12-D86, May 2002.


[35] A. Abraham, “Analysis of hybrid soft and hard computingtechniques for forex monitoring systems,” in Proc. IEEE Inter-national Conference on Fuzzy Systems (FUZZ-IEEE ’02), vol. 2,pp. 1616–1622, Honolulu, Hawaii, USA, May 2002.

[36] M. Stone, “Cross-validation choice and assessment of statisti-cal predictions,” Journal of the Royal Statistical Society: SeriesB, vol. 36, pp. 111–147, 1974.

Wei Zha received his B.S. and M.S. degreesfrom Shanghai Jiao Tong University, Shang-hai, China, both in electronics engineering.He worked in the Department of Electron-ics Engineering, Shanghai Jiao Tong Univer-sity, Shanghai, China. He received his Ph.D.degree in electrical and computer engineer-ing from Queen’s University, Kingston, On-tario, Canada, in 2002. From 2002 to 2003,he worked on speech quality measurementat Queen’s University. From 2003 till the end of 2004, he waswith the Coordinated Science Laboratory, University of Illinois atUrbana-Champaign, holding an NSERC Fellowship. Since January2005, he has been with Schlumberger Houston Tech Center.

Wai-Yip Chan (usually known as Geof-frey Chan) received his B.Eng. and M.Eng.degrees from Carleton University, Ottawa,Canada, and his Ph.D. degree from the Uni-versity of California at Santa Barbara, allin electrical engineering. He is currently anAssociate Professor of electrical and com-puter engineering at Queen’s University,Kingston, Canada. Previously, he was onthe faculty of Illinois Institute of Technol-ogy, Chicago, and McGill University, Montreal. He also worked atthe Communications Research Centre and Bell Northern Research(now Nortel Networks), Ottawa, where he acquired industrial ex-perience ranging from embedded DSP algorithm to VLSI circuitdesign for speech processing. His current research interests are inthe area of multimedia signal compression and communications.He served as a Technical Program Cochair of the 2000 IEEE Work-shop on Speech Coding, and received a CAREER award from theUS National Science Foundation.


Fourier-Lapped Multilayer Perceptron Methodfor Speech Quality Assessment

Moises Vidal RibeiroDepartamento de Comunicacoes (DECOM), Faculdade de Engenharia Eletrica e de Computacao (FEEC),Universidade Estadual de Campinas (UNICAMP), Caixa Postal 6101, 13083-852 Campinas SP, BrazilEmail: [email protected]

Jayme Garcia Arnal BarbedoDepartamento de Comunicacoes (DECOM), Faculdade de Engenharia Eletrica e de Computacao (FEEC),Universidade Estadual de Campinas (UNICAMP), Caixa Postal 6101, 13083-852 Campinas SP, BrazilEmail: [email protected]

Joao Marcos Travassos RomanoDepartamento de Comunicacoes (DECOM), Faculdade de Engenharia Eletrica e de Computacao (FEEC),Universidade Estadual de Campinas (UNICAMP), Caixa Postal 6101, 13083-852 Campinas SP, BrazilEmail: [email protected]

Amauri LopesDepartamento de Comunicacoes (DECOM), Faculdade de Engenharia Eletrica e de Computacao (FEEC),Universidade Estadual de Campinas (UNICAMP), Caixa Postal 6101, 13083-852 Campinas SP, BrazilEmail: [email protected]


The paper introduces a new objective method for speech quality assessment called Fourier-lapped multilayer perceptron (FLMLP).This method uses an overcomplete transform based on the discrete Fourier transform (DFT) and modulated lapped transform(MLT). This transform generates the DFT and the MLT speech spectral domains from which several relevant perceptual parametersare extracted. The proposed method also employs a multilayer perceptron neural network trained by a modified version of thescaled conjugated gradient method. This neural network maps the perceptual parameters into a subjective score. The numericalresults show that FLMLP is an effective alternative to previous methods. As a result, it is worth stating that the techniques heredescribed may be potentially useful to other researches facing the same kind of problem.

Keywords and phrases: fast Fourier transform, modulated lapped transform, neural network, objective speech quality assessment,perceptual feature, scaled conjugated gradient optimization method.

1. INTRODUCTION

The continuous search for efficient and reliable speech trans-missions through communication channels has produced agreat number of speech devices (specially codecs), which of-ten include highly sophisticated features, making their qual-ity assessment a tricky task.

For many years, the assessment of speech devices hasbeen mostly carried out using subjective tests, in whichhuman listeners perform the evaluation. This kind oftest, although very accurate, is quite expensive and time-consuming. Such situation has motivated the search for ob-jective methods able to suitably replace the subjective tests.

Several objective methods have been proposed [1, 2, 3, 4,5, 6, 7, 8, 9, 10] so far. Among them, PESQ (perceptual eval-uation of speech quality) [7], which is currently adopted asa standard by the International Telecommunication Union(ITU), aggregates some of the best features of its predeces-sors. On the other hand, the Fourier-lapped multilayer per-ceptron (FLMLP) method here proposed assembles the bestfeatures of MOQV (objective measure for speech quality) [8]and MOQV-KSOM (MOQV using Kohonen self-organizingmaps) [9, 10] together with two new techniques.

(a) An overcomplete transform [11, 12, 13] based on thediscrete Fourier transform (DFT) [14, 15] and the modu-lated lapped transform (MLT) [16] to generate a redundant


Speechsource

System undertest Objective

qualitymeasure

Mapping fromobjective tosubjective

scale

Figure 1: Basic scheme of objective methods for speech quality assessment.

spectral representation of speech signals, from which variouspertinent perceptual parameters are extracted. The discus-sion about this kind of transform will be retaken in Section 5.

(b) A multilayer perceptron neural network (MLPNN)[17] to implement a nonlinear multidimensional mappingbetween the perceptual parameters and the subjective score.MLPNN is trained by a second-order optimization methodnamed modified version of scaled conjugated gradient (SCG)[18]. The motivations to use MVSCG are the following: (i)the modified version of SCG method is one of the most pow-erful second-order optimization technique for searching ina multidimensional surface; (ii) the use of the differentialoperator, which was defined in [19], in the modified ver-sion of SCG [18, 19] formulation, provides its fast and ex-act implementation. In fact, as will be briefly highlighted inSection 6, the explicitly evaluation of the Hessian matrix isnot needed when we consider the differential operator. Asa result, the computation complexity of the training proce-dure is reduced from O(N2) to O(N), where N is the totalnumber of MLPNN weights. Therefore, the training proce-dure can be implemented for periodic online updating ofMLPNN weights. The FLMLP has been assessed using theS-23 ITU-T database [20]. The computational results showthat the FLMLP overperforms PESQ, MOQV, and MOQV-KSOM for the set of speech signals used in the tests.

The paper is organized as follows. Section 2 presents abrief discussion about earlier objective assessment meth-ods. Section 3 presents a general description of the FLMLP.Section 4 details the most important actions of the FLMLPalgorithm. Section 5 presents the basic theory underlying theovercomplete transforms. Section 6 presents the mathemati-cal formulations of the multilayer perceptron neural network(MLPNN). Section 7 reports some results attained by theFLMLP. Finally, Section 8 states some concluding remarks.

2. EARLIER OBJECTIVE SPEECH QUALITYASSESSMENT METHODS

Most of the objective quality assessment methods developedin the last decade have been based on psychoacoustic mod-eling of the human ear. Figure 1 shows the basic scheme fol-lowed by such methods.

The processing denoted by the last block in Figure 1 is notalways included in the method itself. Sometimes, the map-ping is carried out as an independent procedure, as in PSQM[4].

In the following, some of the most important objectivemethods for speech quality assessment are briefly described.

(i) MNB (measuring normalizing blocks) [1]. MNB usesa very simple hearing model; only a psycho-acoustic fre-quency scale and a model for nonlinear loudness behaviourare included. On the other hand, it uses a sophisticatedjudgement model. The technique consists in measuring andremoving spectral deviations at multiple scales using the so-called time and frequency measuring normalizing blocks.The behaviour of listeners is modeled by successive combi-nations of such blocks.

(ii) PAMS (perceptual analysis measurement system) [2].PAMS process uses an auditory model that combines a math-ematical description of the psychophysical properties of hu-man hearing with a technique that performs a perceptuallyrelevant analysis taking into account the subjectivity of theerrors in the degraded signal. It was the first method capableto align signals with variable delay. Some PAMS techniqueswere included in PESQ [7].

(iii) TOSQA (telecommunication objective speech qual-ity assessment) [3]. The speech quality calculated in TOSQAis based on a similarity measurement between reference anddegraded signals. The procedure is based on a modifiedshort-term loudness spectra, where the influence of signalparts with low loudness is reduced. The program is able totake into account quality effects such as background noise,frequency response, and nonlinearity of the system undertest.

(iv) PSQM (perceptual speech quality measure) [4]. Itis the former ITU’s standard for objective speech assess-ment [5]. PSQM converts physical domain into a perceptu-ally meaningful psychoacoustic domain through a series ofnonlinear processings (time-frequency mapping, frequencywarping, intensity warping, loudness scaling, etc.). Aftersuch transformation, the original and degraded signals arecompared, and a measure for the signal quality is extracted.A slightly modified version of PSQM, the PSQM+ [6], waslater released in order to improve the performance for sig-nals with loud distortions and/or temporal clipping.

(v) PESQ (perceptual evaluation of speech quality) [7].This method is the ITU’s current standard. It combines thebest features of PSQM and PAMS, with an improved psy-choacoustical model of human hearing. PESQ takes into ac-count a wide range of conditions, like coding distortions,errors, packet loss, delay and variable delay, and filtering inanalogue networks.

(vi) MOQV (objective measure for speech quality) [8].The psychoacoustical model of MOQV was inspired by thatone used in PSQM. Its novel features include some addi-tional processing in the cognitive model and a polynomial

FLMLP Method for Speech Quality Assessment 1427

Original

signal

Degraded

signal

PreprocessingOvercomplete

transformPerceptualmeasure

Cognitiveprocessing

Nonlinearmapping

Estimatedsubjective

value

Figure 2: Scheme of the FLMLP method.

Signals Division intoframes

Overcompletetransform

Spectralenergies

Grouping intosubbands

Energy-leveladjustment

Rest ofroutine

Difference betweenshort-term energies

Figure 3: Scheme for time-frequency decomposition and mapping into subbands.

mapping strategy between objective and subjective scores.Later, the polynomial mapping was replaced by Koho-nen self-organizing maps, originating the so-called MOQV-KSOM [9, 10].

3. GENERAL DESCRIPTION OF FLMLP

The basic scheme of FLMLP is illustrated in Figure 2. Eachblock of this scheme is described in the following.

(1) Preprocessing: defines the beginning and the end ofthe speech signals, performs a time alignment between theoriginal and degraded signals, and adjusts their energy level.

(2) Overcomplete transformation: divides both signalsinto frames and computes the proposed overcomplete trans-form, which is, basically, made up with a number of basisvectors greater than the dimensionality of the analysed sig-nal.

(3) Perceptual measure: extracts 10 perceptual parametersfrom DFT and MLT spectral domains. These parameters are

(i) the difference between short-term energies of the ref-erence and degraded signals, such parameters are ob-tained after dividing both signals into frames and map-ping the frequency components of such frames intosubbands [8];

(ii) the perceptual spectral distance (PSD) [14], given by

PSD =√√√√√ B∑b=1

[Lx(b)− Ly(b)

]2, (1)

where Lx and Ly represent the perceptual spectral den-sity function of the original and degraded signals ateach subband, respectively, and B is the number ofsubbands;

(iii) the perceptual cepstral distance (PCD) [14], given by(2), it is a modified version of the PSD

PCD = 10

√√√√√ B∑b=1

log10

[Lx(b)

]− log10

[Ly(b)

]2; (2)

(iv) the MOQV1 and MOQV2 measures [8], which aresimilar to those of PSQM [4] and PSQM+ [6], respec-tively.

(4) Nonlinear mapping: applies the MLPNN, trained by amodified version of the SCG method, to perform a mappingfrom the perceptual parameters to the target speech qualitymeasure.

(5) Estimated subjective value: stores the estimated sub-jective quality.

4. DETAILS OF FLMLP

The techniques summarized in this section are inspired bythose ones used in other methods, particularly, the PSQM[4, 5].

4.1. Preprocessing

The detection of the effective beginning and end of the orig-inal and degraded signals is performed by procedures stan-dardized by Recommendation P.861 [5]. The samples outsidethe actual active speech interval are discarded.

FLMLP processing can be applied only to time-alignedsignals. If the shift between them is not known, a temporalalignment is performed using cross-correlation implementedthrough an FFT algorithm. The index of the maximum cross-correlation value represents the shift between both signals,and the alignment is automatically performed.

The energy level of the degraded signal is adjusted multi-plying this signal by the square root of the ratio between theaverage energies of the original and degraded signals [5].

4.2. Time-frequency decompositionand mapping into subbands

Figure 3 shows the procedures used in this stage. In the firstblock, a Hanning windowing divides the preprocessed signalsinto frames of 256 or 512 samples, for sampling frequenciesof 8 kHz or 16 kHz, respectively. There is a superposition of50% between consecutive frames. After that, the overcom-plete transform (which is detailed in Section 5) is evaluatedfor each frame and the energy spectral density (ESD) of theMLT and DFT domains is determined.


The frequency lines of the resulting ESDs are equallyspaced in a linear spectral scale. However, the spectral res-olution of the human hearing is not linear. According to thedefinition of critical bands, the spectral resolution drops asthe frequency increases. In response to such fact, the fre-quency lines of each ESD are grouped into 56 subbands [5].The width of each subband increases as the central frequencyincreases. The perceptual parameter “difference between theshort-term energies” is extracted at this point.

The last task in this stage of the processing is another ad-justment performed in the DFT and MLT subband domains,aiming to equal the respective energies of the degraded andoriginal signals. The procedure is applied only to the de-graded signal, according to

Ey(n, k) =∑B

n=1 Sx(n, k)∑Bn=1 Sy(n, k)

· Sy(n, k), (3)

where n and k are the indexes of the samples in time and fre-quency domains, respectively, and Sx(n, k) and Sy(n, k) are,respectively, the ESDs of original and degraded signals afterthe grouping into subbands.

4.3. Perceptual measure

The main objective of this stage is to simulate both the trans-mission of the sound from outer to inner ear and the subjec-tive loudness generation.

The subbands spectral components are compressed usingthe nonlinear compression function

L[k] =(S0(k)0.5

)0.23

·[(

0.5 + 0.5 · E(n, k)S0(k)

)0.23

− 1

], (4)

proposed by Zwicker [21], where S0(k) is the absolute hear-ing threshold [22] given by

S0(k) = 3.64 · f −0.8 − 6.5 · e0.6·( f−3.3)2+ 103 · f 4, (5)

where f is the frequency given in kHz. This is the point atwhich the PSD and PCD parameters are extracted.

4.4. Cognitive modeling

This stage aims to model the speech signal processing in thebrain cortex level. The cognitive modeling here adopted is di-vided into two major blocks, the so-called cognitive process-ing and cognitive combination, which are described next.

Cognitive processing

This step is composed of some procedures that include thecalculation of the difference signal between the patterns re-sulting from the perceptual measure stage, the calculation ofasymmetry factors, and weighting of silent intervals [5].

The difference signal is simply the absolute value of thedifference between the degraded and original signals. In thecalculation of the energy of the difference signal for eachframe n, possible asymmetries between the signals must be

taken into account. The asymmetry is defined as the differ-ence of degradation perceived by listeners when the systemunder test has the two main characteristics: (a) it introducesstrange components, producing a major impact, and (b) sup-presses components, causing a minor impact. In order to takeinto account the asymmetry of the degradation impressions,an asymmetry factor is calculated according to

A(n, k) =(Ey(n, k) + 1

Ex(n, k) + 1

)0.2

. (6)

A(n, k) is used as a weighting factor in the calculation of theframe energies:

F(n) =56∑k=1

N(n, k) · A(n, k) · ∆c, (7)

where N(n, k) is the difference signal and ∆c is the width of asubband related to the critical band (in this case,∆c = 0.312).

After that, silent frames are identified and properlyweighted in order to reduce their influence over the finalscore. Those procedures result in the last parameters, theMOQV1 and MOQV2 measures [8], which, as the otherones, are extracted from the patterns resulting from FFT andMLT time-frequency decomposition.

Cognitive combination

This step consists in using an artificial neural network tomodel the way a listener combines different features into asingle impression for the quality evaluation of a given signal.Obviously, the processing performed by the brain is muchmore complex than that one performed by an artificial neu-ral network. However, this approach is often enough to solvesome of the problems involved in modeling the human be-haviour. Section 6 details some aspects of the neural networkhere used.

5. OVERCOMPLETE TRANSFORMBASED ON DFT AND MLT

Regarding the estimation of a subjective quality of speechsignals, it has been observed that few representative percep-tual features of speech signals are obtained from the DFT do-main. As a result, the mapping technique sometimes attainslow performance. This drawback seems to be due to the fol-lowing problems: (i) two contradictory subjective measurescan produce two perceptual feature vectors very close to eachother, (ii) two very close subjective measures are associatedwith two very distant feature vectors. The distance measurehere considered is the Euclidian norm.

To overcome or diminish the occurrence of both prob-lems, it is proposed to use an overcomplete bases for theextraction of some more representative perceptual featuresfrom the speech signals. For simplicity, it is stated that theso-called overcomplete bases or frames [11, 12, 13] are typi-cally constructed by merging a set of complete bases, such as


Fourier, wavelet, and so forth, or by adding basis functionsto a complete basis. Although being not unique, the over-complete bases can offer some advantages, such as [12] (a) agreat flexibility to capture relevant information from the an-alyzed signal, due to the use of a large set of specialized basisfunctions, and (b) an enhancement in the stability of suchrepresentation in response to small perturbations.

Based upon the knowledge about the use of DFT [14] forperceptual feature extraction, an overcomplete basis made upwith basis functions from the DFT and the MLT [16] is pre-sented.

The transpose of the analysis and synthesis transforms isexpressed by

TTa =

[QTa

PTa

]=[

0T DTN

PTa,0 PT

a,1

],

TTs =

[QTs

PTs

]= 0T

(D−1N

)TPTs,0 PT

s,1

,

(8)

respectively. Note that Ps = PTa . As a result, the coefficients in

the overcomplete domain is represented by

X[0]...

X[N − 1]X[N]

...X[2N − 1]

︸︷︷︸X

= 0T DT

N

PTa,0 PT

a,1

xw(0)...

xw(N − 1)xw(N)

...xw(2N − 1)

︸︷︷︸xw

, (9)

where xw = [xw(0) · · · xw(2N − 1)]T is the input vectorformed by cascading previous and current frames, whichwere previously submitted to a Hanning window with anoverlapping of 50%; X = [X[0] · · · X[2N − 1]]T are thecoefficients in the overcomplete domain. Note that the for-mer N samples are the DFT coefficients, while the later arethe MLT ones; 0 is a N ×N matrix of zeros; DN is an N ×NVandermonde matrix whose columns are the DFT basis vec-tors; Pa is an 2N × N orthonormal matrix whose columnsare the MLT basis vectors. For the proposed overcompletetransform Ta and its inverse Ts, the following relations canbe expressed:

⟨ϕk(n),ϕl(n)

⟩ = δ(k − l), k, l = 0, . . . ,N − 1,⟨ψk(n),ψl(n)

⟩ = δ(k − l), k, l = 0, . . . ,N − 1,⟨ϕk(n),ψl(n)

⟩ = δ(k − l), k, l = 0, . . . ,N − 1,

(10)

where ϕk(n)k=0,...,N−1 and ψl(n)l=0,...,N−1 are the basisfunctions of Qa and Pa, respectively.

It is worth stating that the use of the MLT along with theDFT was decided due to the fact that both transforms pro-vide different spectral representation of the analyzed signal.It is a remarkable consideration, because the DFT-based pro-cedure for perceptual feature extraction, applied so far, can

be straightforwardly used in the MLT domain. As a result, alltheoretical justification for the DFT-based perceptual featureextraction is well applied to the MLT-based procedure. An-other advantage of the MLT is the use of a fast algorithm forits implementation [16].

6. THE MLPNN TRAINED BY THE MODIFIEDVERSION OF THE SCG METHOD

The search for a good mapping technique lies in the choiceof an appropriate technique with generalization properties,a suitable minimization criterion, and an efficient and low-complexity training procedure. Among many mapping tech-niques available, the MLPNN trained by a second-order op-timization technique was chosen to perform the last task ofthe FLMLP method. The following two reasons support suchchoice.

First, little knowledge has been acquired about the cog-nitive mechanism of the human brain. Therefore, it is quitedifficult to develop a suitable model for the signal processinginto the brain cortex. As a consequence, the search for newersolution is an open research field.

Second, but not least, the nature of subjective analysisof speech signals is highly fuzzy. As a result, a fuzzy sys-tem should be appropriate to solve this problem. However,the equivalence between feedforward neural networks, likeMLPNN, and fuzzy logic systems [23, 24] is well known.Moreover, due to the characteristics of the posed problem,the use of a regular network [24, 25] is recommended to solvethe problem associated with the assessment of the speechquality when a reduced and representative set of perceptualfeatures is available.

In this regard, it is well established that the state spaceformulation of an MLPNN with one hidden layer is given by[17]

z(n) = AT(n)

[x(n)

1

],

u(n) = f(

z(n)) = [ f (z0(n)

) · · · f(zI−1(n)

)]T,

y(n) = bT(n)

[u(n)

1

],

f(zi(n)

) = tanh(zi(n)

), i = 1, . . . , I ,

(11)

where x(n) = [x(n) · · · x(n− K + 1) 1]T is the (K+1)×1input vector, which is constituted by elements of the per-ceptual feature vector and the bias of the MLPNN; z(n) =[z0(n) · · · zI−1(n)]T is the neuron output vector in the hid-den layer; I is the number of neurons in the hidden layer;y(n) is the MLPNN output; A(n) ∈ (K+1)×I is the ma-trix of weights between the input and the hidden layers; andb(n) ∈ (I+1)×1 is the matrix of weights between the hiddenand the output layers.

Let a(n) be a column vector formed by the columnsof the matrix A(n). Then, the vector w(n) containing allweights of the MLPNN, the total error measure ET(w(n)) for


a set of training data, and its corresponding gradient vector∇ET(w(n)) are given by

w(n) = [aT(n) bT(n)]T

, (12)

ET(

w(n)) =∑

n

e(n) =∑n

12

(y(n)− yd(n)

)2, (13)

∇ET(n) = ∇ET(

w(n)) = [∇ETa (n) ∇ETb (n)

]T, (14)

respectively. yd(n) is the desired output, e(n) is the outputerror, and ∇Ea(n) and ∇Eb(n) are the gradients of the errormeasure with respect to a(n) and b(n), respectively. Fromthe definition of error measures in (13), it can be seen thatMLPNN tries to make its output as close as possible to thesubjective measure yd(n) in a least-squares sense. Note that

∇EA(n) = ∂e(n)∂A(n)

=

∂e(n)∂a1,1(n)

· · · ∂e(n)∂a1,I(n)

.... . .

...∂e(n)

∂a(K+1),1(n)· · · ∂e(n)

∂a(K+1),I(n)

,

∇EA(n) =[

x(n)1

]∂e(n)T

∂z(n),

∂e(n)∂z(n)

=[∂e(n)∂z1(n)

· · · ∂e(n)∂zI(n)

]T,

∂f(n)∂s(n)

= f(n) =[∂ f1(n)∂s1(n)

· · · ∂ fI(n)∂sI(n)

]T,

∂e(n)∂z(n)

= (b(n) • f(n))e(n),

∇Eb(n) =

∂e(n)∂b1(n)

...∂e(n)∂bI+1(n)

=

[z(n)

1

]e(n),

(15)

where • is the Hardamard product [26]. The use of the mod-ified version of SCG method [18] in the training procedureof the MLPNN demands the computation of the total gradi-ent vector∇ET(w(n)) and the Hessian matrix H(w(n)) [18].However, it is well stablished that the evaluation of the Hes-sian matrix demands a huge computational effort. In order toavoid such problem, this contribution proposes the straight-forward computation of the expression H(w(n))d(n) [19],where d(n) is a directional vector that appears in the modi-fied version of the SCG formulation. As a result, the modi-fied version of the SCG does not require the explicitly Hes-sian matrix computation. In this regard, let the differentialoperator [19] be expressed by

dg(

w(n)) ≡ ∂

∂αg(

w(n) + αd(n))|α=0, (16)

where g(·) is a function, α is an increment, d(n) is a direc-tional vector, and w(n) is the parameters of g(·), respectively.

Then, H(w(n))d(n) is given by

H(

w(n))

d(n) = d∇ET

(w(n)

) =∑n

d∇Ea(n)

∑n

d∇Eb(n)

.(17)

7. SOME RESULTS

The tests were performed using the S-23 database [21], whichis composed of speech files in English, French, Japanese, andItalian. Each file corresponds to a determined test condition,involving some speech codecs, and has a respective meanopinion score (MOS) or comparative mean opinion score(CMOS) subjective quality measure associated. The FLMLPmethod should estimate those subjective values. The S-23database is divided into three main groups.

(i) First experiment: the speech files were submitted toa number of ITU and mobile-telephony standardcodecs.

(ii) Second experiment: the speech files were submitted toa number of environment noise types.

(iii) Third experiment: the coded signals were transmit-ted through a communication channel that introducesrandom and burst frame errors.

The training of MLPNN took into account all languagesand experiments found in the S-23 database, as shown inFigure 4. The test set has been assembled in such a way thatall different conditions found in the S-23 database are rep-resented. In other words, the method is tested for all kindsof distortions present in the S-23 database. Table 1 shows thenumber of test files used for each language and for each ex-periment.

The performance of the FLMLP method during thetraining and tests were evaluated according to the correla-tion, ρ, and variance of error, σ2

e , given by (18) and (19), re-spectively:

ρ =∑N−1

i=0

(xi(n)− x(n)

)(yi(n)− y(n)

)√∑N−1i=0

(xi(n)− x(n)

)2∑N−1i=0

(yi(n)− y(n)

)2, (18)

σ2e =

1N

N∑i=1

[(xi(n)− yi(n)

)− (x(n)− y(n))]2

, (19)

where xi(n) represents the ith objective measure, yi(n) rep-resents its corresponding subjective measure, and x(n) andy(n) represent the means of the estimated and subjectivemeasures, respectively. N is the number of measures.

In order to train MLPNN with 12 neurons in the hiddenlayer, it was considered that the training sets with the per-ceptual parameters were randomly generated for each lan-guage as well for all languages together. After that, each train-ing set was used in the learning procedure of one MLPNN.About 3000 epochs were heuristically specified for each train-ing procedure.


Group

1st Group

3rd Group

French116 files

Japanese116 files

English116 files

French132 files

Japanese132 files

English132 files

Italian132 files

Language/ number of files Language/total number of files

French248 files

Japanese248 files

English248 files

Italian248 files

General876 files

2nd Group

French84 files

Japanese88 files

English88 files

General260 files

Files with subjective measures in CMOS

Group Language/ number of files Language/total number of files

Figure 4: The training files.

After the training phase, the correlation and error vari-ance obtained for all training files were higher than 0.99and lower than 0.005, respectively. After that, each trainedMLPNN was used to measure the subjective quality of thetesting speech signals. The correlation achieved during thetest procedure is reported in Table 2 while the attained errorvariance is displayed in Table 3. From Tables 2 and 3, the fol-lowing remarks can be stressed.

(i) The fast and modified version of the SCG method ap-plied to train MLPNN yields to good results, even with only12 neurons in the hidden layer. A greater number of neuronsdo not exhibit noteworthy improvement.

(ii) The worst results have occurred with generic lan-guage. This is due to the low robustness of the FLMLP forquality assessment of several languages with only one trainedMLPNN. But, even in this case, the new method has achievedbetter results than the other ones [2, 8, 9, 10].

(iii) The behavior of the error variance reveals that theFLMLP yields estimates with low variability, which is a verydesirable property for this kind of application.

Another performance measure, not shown in the paper,is the mean difference between actual and estimated subjec-tive values. This mean is less than 0.2 for all cases.

Table 1: Number of speech files used in the tests.

Language 1st exp. 2nd exp. 3rd exp.

French 60 44 68Japanese 60 48 68English 60 48 68Italian 60 — 68Total 240 140 272

As can be seen, the FLMLP attains a notable performancein the presence of hard conditions, such as errors, variouscodecs and environmental noises. However, caution must betaken before stating its superiority over the other methods.

As commented before, neural networks have as theirmain shortcoming the lack of flexibility under untrainedconditions quite different from those ones used for training.Therefore, it is difficult to estimate how the FLMLP wouldbehave when facing untrained conditions without using ad-ditional speech databases. However, it is worth pointing outthat if the untrained conditions show some kind of similari-ties with the training ones, a good speech quality assessmentshould be accomplished.


Table 2: Performance of FLMLP in terms of ρ.

Language Measure MOQV PESQ∗ MOQV- KSOM FLMLP

FrenchMOS 0.93 0.92 0.96 0.97

CMOS 0.93 0.94 0.98 0.99

JapaneseMOS 0.91 0.94 0.95 0.96

CMOS 0.95 0.93 0.98 0.99

EnglishMOS 0.92 0.94 0.94 0.95

CMOS 0.95 0.93 0.93 0.99Italian MOS 0.90 0.93 0.93 0.94

GenericMOS 0.87 0.90 0.92 0.92

CMOS 0.94 0.93 0.92 0.94

∗ The correlation values of PESQ were obtained in tests performed by theauthors of this paper using the original ITU’s PESQ routine, because theliterature currently available does not provide that information in such adetailed way.

Table 3: Performance of FLMLP in terms of σ2e .

Language Measure FLMLP

FrenchMOS 0.030

CMOS 0.001

JapaneseMOS 0.030

CMOS 0.004

EnglishMOS 0.055

CMOS 0.001Italian MOS 0.060

GenericMOS 0.090

CMOS 0.060

On the other hand, it has been shown that PESQ, whichdoes not use neural networks, often achieves good resultswhen faced to unknown conditions. Additionally, the opti-mization of PESQ has been carried out using a larger numberof databases, making more difficult to achieve high correla-tion with a particular database.

The version of the FLMLP here described has not beenoptimized to assess signals where the distortions significantlyvary in time. Studies have been carried out focusing on thistask. Initial efforts have been addressed toward the assess-ment of small segments of the signal, and then the combina-tion of the scores into a single estimate of the signal quality.Another topic that has been investigated is the use of a “for-getting factor” to model the phenomenon where the listenerstend to forget the distortions occurred at the beginning oflong signals. Both studies are still in the early stages, but thefirst results are promising.

8. CONCLUSIONS

This contribution has introduced the FLMLP method forspeech quality assessment. As reported by numerical results,the new method not only provides good results, but it alsooutperforms previous ones for the tested conditions. There-fore, it may be a wonder that the obtained results validate

the FLMLP underlying techniques as potential tools to solvesome of the main problems that still prevent the use of ob-jective speech quality assessment to a number of conditions.

The improvement achieved by the FLMLP is due to theintroduction of two original techniques: (a) an overcompletetransform based on the DFT and the MLT that leads to a newset of perceptual parameters related to speech quality; (b)a multilayer perceptron neural network, trained by a mod-ified version of the SCG method, to map from the perceptualparameters into a subjective quality measure. Compared tothe existing solutions, the new perceptual parameters con-tain more information about the differences between the de-graded and original speech signals, whereas the neural net-work yields a more precise mapping from these parametersto an estimate of the subjective quality measure. Additionally,it can be pointed out that adjustment of MLPNN weights totake into account new conditions can be performed onlinebecause of the low complexity of the training procedure.

Further research should be carried out to address otherkinds of overcomplete transforms, aiming to improve thequality of the perceptual parameters.

Also, other nonlinear techniques can further enhance thespeech quality estimation. From the authors’ point of view,good candidates can emerge from the hybrid techniquesgrounded on type 2 fuzzy systems and hierarchical neuralnetworks.

Finally, further investigation about the performance ofFLMLP when facing untrained conditions should be con-ducted.

ACKNOWLEDGMENT

Special thanks are extended to CAPES (BEX2418/03-7),CNPq (Grant 552371/01-7), and FAPESP (Grants 01/08513-0, 01/04144-0, and 02/12216-3).

REFERENCES

[1] D. J. Atkinson, “Proposed annex to recommendation P.861,”NTIA, ITU Study Group 12 - Contribution COM 12-24-E,1997.

[2] A. W. Rix and M. P. Hollier, “The perceptual analysis mea-surement system for robust end-to-end speech quality assess-ment,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Pro-cessing (ICASSP ’00) , vol. 3, pp. 1515–1518, Istanbul, Turkey,June 2000.

[3] ETSI EG 201 377-1, Specification and measurement of speechtransmission quality; Part 1: Introduction to objective compar-ison measurement methods for one-way speech quality acrossnetworks, 1999.

[4] J. G. Beerends and J. A. Stemerdink, “A perceptual speech-quality measure based on a psychoacoustic sound represen-tation,” Journal of the Audio Engineering Society, vol. 42, no. 3,pp. 115–123, 1994.

[5] ITU-T Recommendation P.861, Objective quality measure-ment of telephone-band (300–3400 Hz) speech codecs, 1996.

[6] ITU-T Contribution COM 12-20, Improvement of the P.861perceptual speech quality measure, Geneva, Switzerland 1997,http://portal.etsi.org/docbox/zArchive/TIPHON/TIPHON/ARCHIVES/1998/05-9810-Tel Aviv/.


[7] ITU-T Recommendation P.862, Perceptual Evaluation ofSpeech Quality (PESQ), an objective method for end-to-endspeech quality assessment of narrow-band telephone networksand speech codecs, 2001.

[8] J. G. A. Barbedo and A. Lopes, “Proposal and validation of anobjective method for quality assessment of speech codecs andcommunication systems,” Revista Tecnologia, vol. 23, pp. 96–112, 2002.

[9] J. G. A. Barbedo, M. V. Ribeiro, F. J. von Zuben, A. Lopes, andJ. M. T. Romano, “Application of Kohonen self-organizingmaps to improve the performance of objective methods forspeech quality assessment,” in Proc. European Signal Process-ing Conference (EUSIPCO ’02), vol. 1, pp. 519–522, Toulouse,France, September 2002.

[10] J. G. A. Barbedo, M. V. Ribeiro, A. Lopes, and J. M. T. Ro-mano, “Estimation of the subjective quality of speech signalsusing the Kohonen self-organazing maps,” in Proc. IEEE Inter-national Telecommunications Symposium (ITS ’02), pp. 834–839, Natal, Brazil, September 2002.

[11] M. Vetterli and J. Kovacevic, Wavelet and Subband Coding,Prentice Hall, Englewood Cliffs, NJ, USA, 1995.

[12] M. S. Lewicki and T. J. Sejnowski, “Learning overcompleterepresentations,” Neural Computation, vol. 12, no. 2, pp. 337–365, 2000.

[13] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press,San Diego, Calif, USA, 2nd edition, 2001.

[14] A. V. Oppenheim and R. W. Schafer, Discrete Time Signal Pro-cessing, Prentice Hall, Englewood Cliffs, NJ, USA, 1989.

[15] P. Duhamel, “Implementation of ‘split-radix’ FFT algorithmsfor complex, real, and real-symmetric data,” IEEE Trans.Acoust., Speech, Signal Processing, vol. 34, no. 2, pp. 285–295,1986.

[16] H. S. Malvar, Signal Processing with Lapped Transforms, ArtechHouse, Norwood, Mass, USA, 1992.

[17] S. Haykin, Neural Networks: A Comprehensive Foundation,Prentice Hall, Englewood Cliffs, NJ, USA, 1999.

[18] E. P. Santos and F. J. Von Zuben, “Efficient second-orderlearning algorithm for discrete-time recurrent neural net-works,” in Recurrent Neural Networks: Design and Applica-tions, L. R. Medsker and L. C. Jain, Eds., pp. 47–75, CRC Press,Boca Raton, Fla, USA, 2000.

[19] B. A. Pearlmutter, “Fast exact multiplication by the Hessian,”Neural Computation, vol. 6, no. 1, pp. 147–160, 1994.

[20] Speech Quality Experts Group, Subjective test plan for charac-terization of an 8 kbit/s speech codec, ITU-T Study Group 12,Issue 2.0, 1995.

[21] E. Zwicker and H. Fastl, Psycho-Acoustics, Facts and Models,Springer Verlag, Berlin, Germany, 1990.

[22] E. Terhardt, “Calculating virtual pitch,” Hearing Research,vol. 1, no. 2, pp. 155–182, 1979.

[23] H.-X. Li and C. L. P. Chen, “The equivalence between fuzzylogic systems and feedforward neural networks,” IEEE Trans.Neural Networks, vol. 11, no. 2, pp. 356–365, 2000.

[24] L. M. Reyneri, “Unification of neural and wavelet networksand fuzzy systems,” IEEE Trans. Neural Networks, vol. 10,no. 4, pp. 801–814, 1999.

[25] L. M. Reyneri, “Implementation issues of neuro-fuzzy hard-ware: going toward HW/SW codesign,” IEEE Trans. NeuralNetworks, vol. 14, no. 1, pp. 176–194, 2003.

[26] A. Graham, Kronecker Products and Matrix Calculus: with Ap-plications, Ellis Horwood, Chichester, UK, 1981.

Moises Vidal Ribeiro was born in Tres Rios,Brazil, in 1974. He received the B.S. degreefrom the Federal University of Juiz de Fora,in 1999, the M.S. and Ph.D. degrees fromthe State University of Campinas (UNI-CAMP) in 2001 and 2005, respectively, bothin electrical engineering. Since 2005, he hasbeen a Postdoctoral Researcher at the Uni-versity of Campinas. He was a Visiting Re-searcher in the Image and Signal ProcessingLaboratory, the University of California, Santa Barbara, from Jan-uary 2004 to June 2004. He holds one patent. His fields of interestinclude filter banks, computational intelligence, digital and adap-tive signal processing applied to power quality evaluation, power-line communication, and DSL technology. He has been the recipi-ent of 7 scholarships from the Brazilian Government, and the au-thor of 9 journal papers and 22 conference papers. He was grantedstudent awards by IECON’01 and ISIE’03.

Jayme Garcia Arnal Barbedo received hisB.S. degree in electrical engineering fromthe Federal University of Mato Grosso doSul, Brazil, in 1998. He received the M.S.and Ph.D. degrees from the State Univer-sity of Campinas, in 2001 and 2004, re-spectively, for researches concerning objec-tive assessment of speech and audio qual-ity. From 2004 to 2005, he worked with theSource Signals Encoding Group of the Digi-tal Television Division at the CPqD Telecom & IT Solutions, Camp-inas, Brazil. He is currently conducting a postdoctoral research incontent-based audio classification at the State University of Camp-inas. His current researches also include audio and video encod-ing applied to digital television broadcasting and code vectoriza-tion.

Joao Marcos Travassos Romano was bornin Rio de Janeiro, Brazil, in 1960. He re-ceived his B.S. and M.S. degrees in electri-cal engineering from the State University ofCampinas (UNICAMP), Campinas, Brazil,in 1981 and 1984, respectively. In 1987, hereceived his Ph.D. degree from the Univer-sity of Paris-XI, Paris, France. In 1988, hejoined the Communications Department,the Faculty of Electrical and Computer En-gineering, UNICAMP, where he is now a Professor. He served asan Invited Professor at the University of Rene Descartes, Paris, dur-ing the winter of 1999 and in the Communications and ElectronicLaboratory in CNAM, Paris, during the winter of 2002. He is re-sponsible for the Signal Processing for Communications Labora-tory, and his research interests concern adaptive and intelligent sig-nal processing and its applications in telecommunications prob-lems like channel equalization and smart antennas. Since 1988, hehas been a recipient of the Research Fellowship of CNPq, Brazil. Heis a Member of the IEEE Electronics and Signal Processing Tech-nical Committee. Since April 2000, he has been the President ofthe Brazilian Communications Society (SBrT), a sister society ofComSoc-IEEE, and since April 2003, he has been the Vice Direc-tor of the Faculty of Electrical and Computer Engineering, UNI-CAMP.


Amauri Lopes received his B.S., M.S., andPh.D. degrees in electrical engineering fromthe State University of Campinas, Sao Paulo,Brazil, in 1972, 1974, and 1982, respec-tively. Since 1973 he has been with the Fac-ulty of Electrical and Computer Engineer-ing (FEEC), State University of Campinas,where he is currently a Professor. His teach-ing and research interests are in analog anddigital signal processing, circuit theory, dig-ital communications, and stochastic processes. He has publishedover 70 refereed papers in some of these areas and over 30 technicalreports. He served as the Chairman of the Department of Commu-nications and Vice Dean of the Faculty of Electrical and ComputerEngineering, University of Campinas.


Simulation of Human Speech Production Appliedto the Study and Synthesis of European Portuguese

Antonio J. S. TeixeiraInstituto de Engenharia Electronica e Telematica de Aveiro (IEETA), 3810-193 Aveiro, Portugal

Departamento de Electronica e Telecomunicacoes, Universidade de Aveiro, 3810-193 Aveiro, PortugalEmail: [email protected]

Roberto MartinezInstituto de Engenharia Electronica e Telematica de Aveiro (IEETA), 3810-193 Aveiro, PortugalEmail: [email protected]

Luıs Nuno SilvaInstituto de Engenharia Electronica e Telematica de Aveiro (IEETA), 3810-193 Aveiro, PortugalEmail: [email protected]

Luis M. T. JesusInstituto de Engenharia Electronica e Telematica de Aveiro (IEETA), 3810-193 Aveiro, Portugal

Escola Superior de Saude, Universidade de Aveiro, 3810-193 Aveiro, PortugalEmail: [email protected]

Jose C. PrıncipeComputational Neuroengineering Laboratory (CNEL), University of Florida, Gainesville, FL 32611, USAEmail: [email protected]

Francisco A. C. VazInstituto de Engenharia Electronica e Telematica de Aveiro (IEETA), 3810-193 Aveiro, Portugal

Departamento de Electronica e Telecomunicacoes, Universidade de Aveiro, 3810-193 Aveiro, PortugalEmail: [email protected]

Received 29 October 2003; Revised 31 August 2004

A new articulatory synthesizer (SAPWindows), with a modular and flexible design, is described. A comprehensive acoustic modeland a new interactive glottal source were implemented. Perceptual tests and simulations made possible by the synthesizer con-tributed to deepening our knowledge of one of the most important characteristics of European Portuguese, the nasal vowels.First attempts at incorporating models of frication into the articulatory synthesizer are presented, demonstrating the potentialof performing fricative synthesis based on broad articulatory configurations. Synthesis of nonsense words and Portuguese wordswith vowels and nasal consonants is also shown. Despite not being capable of competing with mainstream concatenative speechsynthesis, the anthropomorphic approach to speech synthesis, known as articulatory synthesis, proved to be a valuable tool forphonetics research and teaching. This was particularly true for the European Portuguese nasal vowels.

Keywords and phrases: articulatory synthesis, speech production, European Portuguese, nasal vowels, fricatives.

1. INTRODUCTION

Recent technological developments are characterized by in-creasing physical and psychological similarity to humans.One example is the well-known human-like robots. Beingone of the distinct characteristics of humans, speech is a

natural candidate to imitation by machines. Also, informa-tion can be transmitted very fast and speech frees hands andeyes for other tasks.

Various designs of machines that produce and under-stand human speech have been available for a long time[1, 2]. The use of voice in computer systems interfaces will


be an added advantage, allowing, for example, the use of in-formation systems for people with different disabilities andthe access by telephone to new information services. How-ever, our current knowledge of the production and percep-tion of voice is still incomplete. The quality (or lack of it)of synthetic voice of the currently available systems is a clearindication of the necessity to improve this knowledge [2].

There are two types of motivations for research in thevast domain of voice production and perception [3]. The firstone aims at the deep understanding of its diverse aspects andfunctions, the second is the design and development of artifi-cial systems. When artificial systems are closely related to theway humans do things, these two motivations can be merged.These systems contribute to an increased knowledge of theprocess and this knowledge can be used to improve currentsystems.

We have been developing an articulatory synthesizer,since 1995, which will hopefully produce high-quality syn-thetic European Portuguese (EP) speech. We aim at a simul-taneous improvement of our synthesis quality (technologicalmotivation) and also to expand our knowledge of Portugueseproduction and perception.

2. ARTICULATORY SYNTHESIS

Articulatory synthesis generates the speech signal throughmodeling of physical, anatomical, and physiological charac-teristics of the organs involved in human voice production.This is a different approach when compared with other tech-niques, such as formant synthesis [5]. In the articulatory ap-proach, the system is modeled instead of the signal or itsacoustics characteristics. Approaches based on the signal tryto reproduce the signal of a natural voice as faithfully as pos-sible with few or no concern about how it is produced. Incontrast, a model based on the production system uses phys-ical laws to describe the sound propagation in the vocal tractand models mechanical and aeroacoustic phenomena to de-scribe the oscillation of the vocal folds.

2.1. Basic components of an articulatory synthesizer

To implement an articulatory synthesizer in a digital com-puter, a mathematical model of the vocal system is needed.Synthesizers usually include two subsystems: an anatomic-physiological model of the structures involved in voice pro-duction and a model of the production and propagation ofsound in these structures.

The first model transforms the positions of the artic-ulators, like the jaw, tongue body, and velum, into cross-sectional areas of the vocal tract. The second model consistsof a set of equations that describe the acoustic properties ofthe vocal tract system. Generally it is divided into submod-els to simulate different phenomena such as the creation ofa source of periodic excitation (vocal fold oscillation), soundsources caused by the turbulent flow in the case of existenceof constriction zones (area sufficiently reduced along the vo-cal tract), propagation of the sound above and below the vo-cal folds, and radiation at the lips and/or nostrils.

The parameters for the models can be produced by dif-ferent methods. They can be obtained directly from the voicesignal by a process of inversion with optimization, be definedmanually by the researcher, or be the output of a linguisticprocessing part of a TTS (text-to-speech) system.

2.2. Motivations

Articulatory synthesis has not received as much attention inrecent years as it could have because there is not an alterna-tive to the actual systems of synthesis currently used in TTSsystems. This is due to different factors: the difficulty to getinformation about the vocal tract and the vocal folds duringthe production of voice in humans; the measurement tech-niques generally provide information regarding static config-urations while information concerning the dynamics of thearticulators is incomplete; a full and reliable inversion pro-cess for obtaining the articulatory parameters from naturalvoice does not exist yet; this technique involves complex cal-culations, raising problems of stability in the numerical res-olution.

Despite these limitations, articulatory synthesis presentssome important advantages: the parameters of the synthe-sizer are directly related with the human articulatory mech-anisms, being very useful in studies of production and per-ception of voice [6]; this method can produce high-qualitynasal consonants and nasal vowels [7]; source-tract interac-tion, essential for a natural sound, can be conveniently mod-eled when simulating the vocal folds and the tract as one sys-tem [8]; the parameters vary slowly in time, so they can beused in efficient processes of codification; the parameters areeasier to interpolate than LPC and formant synthesizers pa-rameters [9]; small errors in the control signals do not gen-erally produce low quality speech sounds, because the inter-polated values will always be physically possible.

According to Shadle and Damper [10], articulatory syn-thesis is clearly the best way to reproduce some attributes ofspeech we are interested in, such as to be able to sound likean extraordinary speaker (e.g., a singer, someone with dis-ordered speech, or an alien with extra sinuses); to be ableto change to another speaker type, or alter the voice qualityof a given speaker, without having to go through as mucheffort as required for the first voice. Articulatory synthesiz-ers have parameters that can be conceptualized, so that if aspeech sample sounds wrong, intuition is useful in fixing it,always teaching us something and providing opportunitiesto learn more as we work to produce a commercially usablesystem.

“Articulatory synthesis holds promise for overcomingsome of the limitations and for sharpening our understand-ing of the production/perception link” [11]. There is onlypartial knowledge about the dynamics of the speech signal,so continued research in this area is needed. The systematicstudy of the coarticulation effects is of special importance forthe development of the experimental phonetics and sciencesrelated with the processing of voice [12]. An articulatory syn-thesizer can be used as a versatile speaker and therefore con-tribute to such studies. Articulatory synthesizers can generate

Speech Production Simulation Applied to European Portuguese 1437

speech using carefully controlled conditions. This can be use-ful, for example, to test pitch-tracking algorithms [13].

The articulatory synthesizer can be combined with aspeech production evaluation tool to develop a system thatcan produce real-time audio-visual feedback to help peoplewith specific articulatory disorders. For example, computer-based speech therapy [14] of speakers with dysarthria triesto stabilize their production at syllable or word level, to im-prove the consistency of production. For severely hearing im-paired persons, the aim is to teach them new speech patternsand increase the intelligibility of their speech. For childrenwith cleft lip and palate and velopharyngeal incompetence,the aim is to eliminate misarticulated speech patterns so thatmost of these speakers can achieve highly intelligible normalspeech patterns.

Also “the use of such a [articulatory] synthesizer hasmuch to commend it in phonetic studies” [15]. The audio-visual feedback could be used as an assistant for teachingphonetics to foreign students to improve their speech quality.The synthesizer can be used to help teach characteristic fea-tures of a given language such as pitch level and vowel space[16].

Recent developments presented at the ICPhS [11] showthat articulatory synthesis is worth revisiting as a researchtool and as a part of TTS systems. Better ways of measur-ing vocal tract configurations, an increased research interestin the visual representation of speech and the use of simplercontrol structures, have renewed the interest in this researcharea [11]. Current articulatory approaches to synthesis in-clude an open-source infrastructure that can be used to com-bine different models [17], recent developments in the Hask-ins configurable articulatory synthesizer (CASY) [18], thecharacterization of lip movements [19], the ICP virtual talk-ing head that includes articulatory, aerodynamic, and acous-tic models of speech [20], and the quasiarticulatory (articula-tory parameters controlling a formant synthesizer) approachof Stevens and Hanson [21].

3. SAPWINDOWS ARTICULATORY SYNTHESIZER

Object-oriented programming was used to implement thesynthesizer. The model-view-controller concept was adoptedto separate models from their controls and viewers.

The application, developed using Microsoft Visual C++,can synthesize speech segments from parameters sequences.These sequences can be defined in a data file or edited bythe user. The synthesis process is presented step by step on agraphical interface.

Presently, implemented models allow only quality syn-thesis of vowels (oral or nasal), nasal consonants, and frica-tives.

The next sections present briefly the currently imple-mented models.

3.1. Anatomic models

For nonnasal sounds, we only have to consider the vocaltract, that is, a variable area tube between the glottis andthe lips. For nasal sounds, we have also to consider the nasal

FM

VW

V’

B

NUT L3

L6 L5L7

JawPFTO c2

cmv

cmn

DL

PP c1

WhG2

G H

G1 K(0, 0)

JawTongue tipTongue bodyLips openingLips prot.HyoidVelum position

: 20 deg. (0.349 rads): (9.800; 10.040): (6.640; 8.780): 0.139082: 0.390000: 0.060000: (4.376; 9.653)

Figure 1: Vocal tract model, based on Mermelstein’s model [22].

tract. The nasal tract area is essentially constant, with the ex-ception of the soft palate region. The vocal tract varies con-tinually and its form must be specified in intervals shorterthan a few milliseconds [23].

3.1.1. Vocal tract model

The proposed anatomic model, shown in Figure 1, assumesmidsagittal plane symmetry to estimate the vocal tract cross-sectional area. Model articulators are tongue body, tonguetip, jaw, lips, velum, and hyoid. Our model is an improvedversion of the University of Florida MMIRC model [24],which in turn was a modified version of the Mermelstein’smodel [22]. It uses a nonregular grid to estimate section’s ar-eas and lengths.

3.1.2. Nasal tract model

The model of the nasal tract allows the inclusion of differentnasal tract shapes and several paranasal sinuses.

The nasal cavity is modeled in a similar way to the oraltract and can be considered as a side branch of the vocal tract.The major difference is that the area function of the nasaltract is fixed for the most part of the nasal tract, for a particu-lar speaker. The variable region, the soft palate, changes withthe degree of nasal coupling. The velum parameter of the ar-ticulatory model controls this coupling. RLC shunt circuits,representing Helmholtz resonators, simulate the paranasalsinuses [7].

Our synthesizer allows the definition of different tractshapes and the inclusion of the needed sinus at any position


Velum

2 3.9 1.5 2.9 3.4 1 (cm)

0.8 2 2.4 1.4 0.5 cm2

Maxillary sinus

Nostrils

Figure 2: Default nasal model based on [26].

by simply editing an ASCII file. Also, blocking of the nasalpassages at any position can be simulated by defining a nullarea section at the point of occlusion. Implementation detailswere reported in [25].

In most of our studies, we use the nasal tract dimensionsfrom [26], as shown in Figure 2, which were based on studiesby Dang and Honda [27] and Stevens [28].

3.2. Interactive glottal source model

We designed a glottal excitation model that included source-tract interaction, for oral and nasal sounds [29], that alloweddirect control of source parameters, such as fundamental fre-quency, and that was not too demanding computationally.

The interactive source model we developed was based on[30]. The model was extended to include a two-mass para-metric model of the glottal area, jitter, shimmer, aspiration,and the ability to synthesize dynamic configurations.

To calculate the glottal excitation, ug(t), it became nec-essary to model the subsystems involved: the lungs, the sub-glottal cavities, the glottis and the supraglottal tract.

The role of the lungs is the production of a quasicon-stant pressure source, modeled as a pressure source pl in se-ries with the resistance Rl. To represent the subglottal region,including the trachea, we used three RLC resonant circuits[31].

Several approaches were used for vocal fold modeling:self-oscillating models, parametric glottal area models, andso forth. We wanted to have a physiological model, like thetwo-mass model, that resulted in high-quality synthesis, butat the same time a model not too demanding computation-ally. Also, a direct control of parameters such as F0 was re-quired. We therefore chose the model proposed by Prado[24], which directly parameterizes the two glottal areas. Inthe model, Rg and Lg , which depend on glottal aperture, rep-resent the vocal folds.

Systems above glottis were modeled by the tract in-put impedance zin(t) obtained from the acoustic model.This approach results in an accurate modeling of frequency-dependent losses.

The various subsystems can be represented by the equiv-alent circuit shown in Figure 3.

Pressure variation along the circuit can be represented by

pl − Rlug(t)−3∑i=1

psgi −d(Lgug(t)

)dt

− Rgug(t)− ps(t) = 0.

(1)

The glottal source model includes parameters needed tomodel F0 and glottal aperture perturbations, known as jitter

Lungs

pl+

−

Rl

Rsg1

Lsg1

Csg1

psg1

Rsg2

Lsg2

Csg2

psg2

Rsg3

Lsg3

Csg3

psg3

Trachea

pintra

Rg Lg

psub ps

Glottis

ug (t)

Zin(t)

Tract

Figure 3: Electrical analogue of the implemented glottal source.Adapted from [32].

Table 1: Glottal source time-varying parameters.

Parameter Description Typical value Unit

pl Lungs pressure 10000 dyne/cm2

F0 Fundamental frequency 100–200 Hz

OQ Open quotient 60 % of T0

SQ Speed quotient 2 —

Ag0 Minimum glottal area 0 cm2

Ag max Maximum glottal area 0.3 cm2

A2 − A1 Slope 0.03 cm2

Jitter F0 perturbation 2 %

Shimmer Ag max perturbation 5 %

Asp Aspiration — —

and shimmer. The model also takes into account the aspira-tion noise generation as proposed by Sondhi and Schroeter[23]. Our source model is controlled by two kinds of pa-rameters. The first type of parameters can vary in time, hav-ing a role similar to the tract parameters. In the synthesisprocess, these parameters can be used to control intonation,voice quality, and related phenomena. They are presentedin Table 1. The second type of source parameters (includ-ing lung resistance, glottis dimensions, etc.) does not vary intime. Their values can be altered by editing a configurationfile.

3.3. Acoustic model

Several techniques have been proposed for simulation ofsound propagation in the oral and nasal tracts [33]: di-rect numeric solution of the equations; time-domain simu-lation using wave digital filters (WDF), also known as Kelly-Lochbaum model; frequency-domain simulation. After an-alyzing the pros and cons of these three approaches, wechose for our first implementation of the acoustic modelthe frequency-domain technique. The main reason for thischoice was the possibility of easily including the frequency-dependent losses.

In our acoustic model, we made the following approxi-mations: propagation is assumed planar; the tract is straight;the tube is approximated by the concatenation of elementaryacoustic tubes of constant area. An equivalent circuit, repre-sented by a transmission matrix, models each one of theseelementary tubes. Analysis of the circuit is performed in thefrequency domain [9].


Subglottalsystem

Ug

Zsub

[Ap BpCp Dp

]

Z′sub

Pharyngealtube

[A1b B1bC1b D1b

]Uc

Backwardtract

[An BnCn Dn

]Znr

Zn Nasaltract

Figure 4: Matrices and impedances involved in the calculation ofthe transfer function Hgn, between glottis and a constriction point,which in turn is used in the calculation of flux at the noise sourcelocation.

Speech is generated by the acoustic model. We usea frequency-domain analysis and time-domain synthesismethod—usually designated as the hybrid method [9]. Theuse of the convolution method avoids the problem of con-tinuity of resonance in the faster method proposed by Lin[34]. The use of a fast implementation of the IFFT (the MITFFTW [35]) minimizes the convolution calculation time.

A similar procedure is applied to the input impedanceZin(ω), in order to obtain zin(n), needed for the source-tractinteraction modeling by the glottal source model.

3.4. Acoustic model for fricatives

The volume velocity at a constriction is obtained by the con-volution of the glottal flow with the impulse response calcu-lated, using an IFFT, from the transfer function between theglottis and the constriction point Hgn (see Figure 4).

3.4.1. Noise sources

Fluctuations in the velocity of airflow emerging from aconstriction (at an abrupt termination of a tube) createmonopole sources and fluctuations of forces exerted by anobstacle (e.g., teeth, lips) or surface (e.g., palate) orientednormal to the flow generate dipole sources. Since dipolesources have been shown to be the most influential inthe fricative spectra [36], the noise source of the fricativeshas only been approximated by equivalent pressure voltage(dipole) sources in the transmission-line model. Neverthe-less, it is also possible to insert the appropriate monopolesources, which contribute to the low-frequency amplitudeand can be modeled by an equivalent current volume velocitysource.

Frication noise is generated at the vocal tract accord-ing to the suggestions of Flanagan [37], and Sondhi andSchroeter [9]. A noise source can be introduced automat-ically at any T-section of the vocal tract network, betweenthe velum and the lips. The synthesizer’s articulatory mod-ule registers which vocal tract tube cross-sectional areas arebelow a certain threshold (A < 0.2 cm2), producing a list oftube sections that might be part of an oral constriction thatgenerates turbulence.

The acoustic module calculates the Reynolds number(Re) at the sections selected by the articulatory module andactivates noise sources at tube sections where the Reynolds

number is above a critical value (Recrit = 2000 according to[9]). Noise sources can also be inserted at any location inthe vocal tract, based on additional information about thedistribution and characteristics of sources [36, 38]. This isa different source placement strategy from that usually usedin articulatory synthesis [9] where the sources are primar-ily located in the vicinity of the constriction. The distributednature of some noise sources can be modeled by insertingseveral sources located in consecutive vocal tract sections.This will allow us to try combinations of the canonical sourcetypes (monopole, dipole, and quadrupole).

A pressure source with amplitude proportional to thesquared Reynolds number

Pnoise =

2× 10−6 × rand(

Re2−Re2crit

), Re > Recrit,

0, Re ≤ Recrit,(2)

is activated at the correct place in the tract [9, 37]. The inter-nal resistance of the noise pressure source is proportional tothe volume velocity at the constriction: Rnoise = ρ|Uc|/2A2

c ,where ρ is the density of the air, Uc is the flow at the con-striction, and Ac is the constriction cross-sectional area. Theturbulent flow can be calculated by dividing the noise pres-sure by the source resistance. This noise flow could also befiltered in the time domain to shape the noise spectrum [36]and test various experimentally derived dipole spectra.

3.4.2. Propagation and radiation

The general problem associated with having N noise sourcesis decomposed in N simple problems by using the superpo-sition principle. In order to calculate the radiated pressureat the lips due to each noise source, the vocal tract is dividedinto the following three sections: pharyngeal, region betweenvelum coupling point and noise source, and region after thesource. Data structures based on the area function of eachsection are defined and ABCD matrices calculated [9]. TheABCD matrices were then used to calculate downstream (Z1)and upstream (Z2) input impedances, as well as the transferfunction, H , given by

H = Z1

Z1 + Z2

1CZrad + D

, (3)

where C and D are parameters from the ABCD matrix (fromnoise source to lips), and Zrad is the lip radiation impedance.The radiated pressure at the lips due to a specific sourceis given by pradiated(n) = h(n) ∗ unoise(n), where h(n) =IFFT(H). The output sound pressures due to the differentnoise sources are added together. The output sound pressureresulting from the excitation of the vocal tract by a glottalsource is also added when there is voicing.

4. RESULTS

In this section, we present examples of simulation exper-iments performed with the synthesizer and two percep-tual studies regarding European Portuguese nasal vowels.


C V Vn N C V

Open Oral

Velum aperture

Figure 5: Movement of the velum and oral articulators for a nasalvowel between two stop consonants (CVC context). The threephases of a nasal vowel in this context are shown.

We start by the description of the perceptual tests; then, re-cent results in fricative synthesis; finally, examples of pro-duced words and quality tests are presented.

4.1. Nasal vowels studiesThe synthesizer was used to produce stimuli for several per-ceptual tests, most of them for studies of nasal vowels. Next,we present two representative studies: the first investigatingthe effect of velum, and other oral articulators variation overtime; the second addressing the source-tract interaction ef-fects in nasal vowels.

Experiment 1. Study of the influence of velum variation inthe perception of nasal vowels on CVC contexts [39].

Several studies point to the need of regarding speech asa dynamic phenomenon. The influence of dynamic informa-tion in oral vowel perception has been a subject of study formany years. In addition, some researchers also see nasal vow-els as dynamic. To produce high-quality synthetic nasal vow-els, would be useful to know in what measure we need toinclude dynamic information.

We investigated if it is enough, to produce a good qual-ity Portuguese nasal vowel, to couple the nasal tract or thedegree of coupling variation in time improves quality. Thenull hypothesis is that static and dynamic velum will producestimuli of similar quality.

Our first tests addressed the CVC context, nasal vowelsbetween stops, the most common for nasal vowels in Por-tuguese.

Velum and oral passage aperture variation for a nasalvowel produced between stop consonants is representedschematically in Figure 5. During the first stop consonant,the nasal and oral passages are closed. The beginning of thenasal vowel coincides with the release of the oral occlusion.To produce the nasal vowel, both the oral passage and thevelum must be open. Possibly due to the slow speed of velummovements, in European Portuguese, there is a period oftime where oral passage is open and velum is in a closed, oralmost closed, position, producing a sound with oral vowelcharacteristics, represented in Figure 5 by a V. Velum con-tinues its opening movement creating simultaneous soundpropagation in oral and nasal tracts. This zone is representedby Vn. The oral passage must close for the following stop con-sonant, so the early oral closure (before the velar closure) cre-ates a zone with only nasal radiation, represented by N. Theplace of articulation of this nasal consonant, created by coar-ticulation, is the same as the following stop.

Stimuli

For this experiment, 3 variants of each of the 5 EP nasal vow-els were produced differing in the way velum movement wasmodeled. For the first variant, called static, the velum wasopen at a fixed value during all vowel production. The othertwo variants used time-varying velum opening. In the first100 milliseconds, the velum stayed closed, making an open-ing transition in 60 milliseconds to the maximum aperture,and then remaining open. In one of these variants, a finalbilabial stop consonant, [m], was created at the end by lipclosure at 250 milliseconds. All stimuli had a fixed durationof 300 milliseconds.

Listeners

A total of 11, 9 male and 2 female, European Portuguese na-tive speakers participated in the test. They had no history ofspeech, hearing, or language impairments.

Procedure

We used a paired comparison test [40, page 361], becausewe were analysing the synthesis quality, despite the demandfor more decisions by each listener, which also increases testduration. The question answered by listeners was as follows:which of the two stimuli do you prefer as a European Por-tuguese nasal vowel? In preparing the test, we noticed that lis-teners had, in some cases, difficulty in choosing the preferredstimulus. The causes were traced to either good or poor qual-ity of both stimuli. To handle this situation, we added twonew possibilities, for a total of four possible answers: first,second, both, and none.

The test was divided into two parts. In the first part, wecompared static versus velum dynamic stimuli. In the secondpart comparison was made between dynamic stimuli withand without a final bilabial nasal consonant. Stimuli werepresented 5 times in both AB and BA order. Interstimuli in-terval was 600 milliseconds.

The results for each possible pair of stimuli in the testwere checked for listener consistency. They were retained ifthe listener preferred one stimulus in more than 70% of thepresentations. Only clear choices of one stimulus against oth-ers were analyzed.

Results

Variable velum preferred to static velum. Preference scores(percentage of the designated stimuli chosen as the preferredone) for fixed velum aperture, variable velum aperture, andthe difference between the two are presented in the boxplotsof Figure 6.

Clearly, listeners preferred stimuli with time variablevelum aperture. Average preference, including all vowels andlisteners, was as high as 71.8%. Confidence interval (CIp= 0.95) for the difference in preference score was between24.2 and 65.6%, in favour of the variable velum case.

Repeated measures ANOVA showed a significant velumvariation effect [F(1, 10) = 5.67, p < 0.05] and a nonsignifi-cant (p > 0.05) vowel and interaction between the two mainfactors (vowel and velum variation).


Fixed velum Variable velum Difference

−50

0

50

100

Pre

fere

nce

(%)

Figure 6: Boxplots of the preference scores for the first part of theperceptual test for nasal vowels in CVC context, comparing stimuliwith fixed and variable velum apertures, showing the effect of thevelum aperture variation.

Without consonant With consonant Difference

0

20

40

60

80

100

Pre

fere

nce

(%)

Figure 7: Boxplots of the preference scores for the second part ofthe perceptual test for nasal vowels in CVC context, comparingstimuli with and without a final nasal consonant, showing the ef-fect of the final nasal consonant.

Nasal consonant at nasal vowel end was preferred. In gen-eral, listener preferred stimuli ending in a nasal consonant.Looking at the preference scores represented graphically inFigure 7, stimuli with final nasal consonant were preferredmore than stimuli without the final consonant. The confi-dence interval (CIp = 0.95) for the difference in preferencescore was between 36.1 and 87.0%, in favour of the stimuliwith a final nasal consonant.

No interactionTotalPartial

0 2 4 6

Time (ms)

0

200

400

600

ug(t

)(c

m3/s

)

Figure 8: Glottal wave of 3 variants of vowel [ı]: (a) without tractload (no interaction); (b) with total tract load; (c) with tract inputimpedance calculated discarding nasal tract input impedance.

Zin totalZin oral only

0 1000 2000 3000 4000 50000

10

20

30

40

50

60

Figure 9: Input impedance for vowel [ı], with and without the nasaltract input impedance.

ANOVA results, with two main factors, confirmed a sig-nificant effect of the final nasal consonant [F(1, 8) = 9.5,p < 0.05] and nonsignificant (p > 0.05) vowel effect andinteraction between main factors.

Experiment 2. Study of source-tract interaction for nasalvowels [29].

We investigated if the extra coupling of the nasal tractin nasal vowels produced identifiable alterations in the glot-tal source due to source-tract interaction, and if modelingof such effects resulted in a more natural quality syntheticspeech.

Figure 8 depicts the effect of the 3 different inputimpedances in nasal vowel [ı]. The nasal tract load has agreat influence on the glottal source wave, because of the no-ticeable difference in the input impedance calculated with orwithout the nasal tract input impedance, shown in Figure 9.This difference is due to the fact that for high vowels suchas [ı] the impedance load for the pharyngeal region, whichis equal to the parallel of the oral cavity and nasal tract in-put impedances, is almost equal to the nasal input impedance(see Figure 10). The effect is less notorious in a low vowels,such as [5].

Stimuli

Stimuli were produced for the EP nasal vowels varying onlyone factor: the input impedance of the tract used by the


Zin nasalZin oralParallel

0 1000 2000 3000 4000 50000

20

40

60

80

Figure 10: Input impedances in the velum region for nasal vowel[ı]. The Figure presents the oral input impedance (Zin oral), thenasal tract input impedance (Zin nasal), and the equivalent paral-lel impedance (parallel). The parallel impedance is, for this vowel,approximately equal to the nasal tract input impedance.

interactive source model. This factor had 3 values: (1) inputimpedance including the effect of all supraglottal cavities; (2)input impedance calculated without taking into account thenasal tract coupling; or (3) no tract load. Only 3 vowels, [5],[ı], and [u], were considered to reduce test realization time.

The same timing was used for all vowels. In the first100 milliseconds, the velum stayed closed, making an open-ing transition in 60 milliseconds to the maximum value.The velum remained at this maximum until the end of thevowel. The stimuli ended with a nasal consonant, a bilabial[m], produced by closing the lips. Closing movement of thelips started at 200 milliseconds and ended 50 millisecondslater. Stimulus duration was fixed at 300 milliseconds forall vowels. These choices were based on the results of theExperiment 1, where dynamic velum stimuli were preferred.

The interactive source model was used with variable F0.F0 starts around 100 Hz, raises to 120 Hz in the first 100 mil-liseconds, and then gradually goes back down to 100 Hz. Theopen quotient was 60% and the speed quotient 2. Jitter andshimmer were added to improve naturalness.

Listeners

A total of 14, 11 males and 3 females European Portuguesenative speakers participated in the test. They had no historyof speech, hearing, or language impairments.

Procedure

A 4IAX (four-interval forced-choice) discrimination test wasperformed to investigate if listeners were able to perceivechanges in the glottal excitation caused by the additional cou-pling of the nasal tract.

The 4IAX test was chosen, instead of the more commonlyused ABX test, because better discrimination results havebeen reported with this type of perceptual test [4].

In the 4IAX paradigm, listeners hear two pairs of stimuli,with a small interval in between. The members of one pairare the same (AA); the members of the other pair are differ-ent (AB). Listeners have to decide which of the two pairs hasdifferent stimuli.

Table 2: Results of the 4IAX test.

Listener Sex [5] [ı] [u] Average

1 M 50.0 33.3 41.7 41.7

2 M 58.3 100.0 50.0 69.4

3 F 50.0 41.7 50.0 47.2

4 F 33.3 83.3 66.7 61.0

5 M 16.7 58.3 33.3 36.1

6 M 66.7 66.7 66.7 66.7

7 M 50.0 50.0 41.7 47.2

8 M 58.3 58.3 41.7 52.8

9 F 41.7 50.0 66.7 52.7

10 M 58.3 50.0 58.3 55.6

11 M 33.3 83.3 58.3 58.3

12 M 75.0 58.3 58.3 63.9

13 M 50.0 41.7 33.3 41.4

14 M 83.3 50.0 58.3 63.8

Average — 51.8 58.9 51.8 54.1

Std. — 17.3 18.6 11.9 10.3

Signals were presented over headphones in rooms withlow ambient noise. Each of the 4 combinations (ABAA,ABBB, AAAB, and BBAB) was presented 3 times in a randomorder. With this arrangement, each pair to be tested appears12 times. The order was different for each listener. Interstim-uli interval was 400 milliseconds and interpairs interval was700 milliseconds.

Results

Table 2 shows the percentage of correct answers for the 4IAXtest. The table presents results for each listener and vowel.Also, the statistics (mean and standard deviation) for eachvowel, and for the 3 vowels, are presented at the bottom of thetable. Results are condensed, in graphical form, in Figure 11.

From the table and the boxplots, it is clear that listeners’correct answers were close to 50%, being a little higher for thenasal vowel [ı]. These results indicate that stimuli differencesare of difficult perception by the listeners.

Statistical tests, having as null hypothesis H0 : µ = 50and alternative H1 : µ > 50, were only significant, at a 5%level of significance, for [ı]. For this vowel, the 95% confi-dence interval for the mean was between 50.1 a 67.7. For [5],we obtained p = 0.36 and for [u], p = 0.29. For the 3 vow-els considered together, the average was also not significantlysuperior to 50% (p = 0.08).

Discussion

Simulations showed some small effects of the nasal tract loadin the glottal wave time and frequency properties. Results ofperceptual tests, conducted to study to what extent these al-terations were perceived by listeners, supported the idea thatthese changes are hardly perceptible. These results agree withresults reported in [41]. In their work, Titze and Story re-ported that “An open nasal port . . . showed no measurableeffect on oscillation threshold pressure or glottal flow.”


[˜a] [ı] [u]

0

20

40

60

80

100

Cor

rect

disc

rim

inat

ion

(%)

Figure 11: Boxplot of the 4IAX discrimination test results for eval-uation of the listeners ability to perceive the effects of source-tractinteraction on nasal vowels.

There is however a tendency for the effect of interactionbeing more perceptible for the high vowel [ı], produced withreduced vocal cavity. Our simulations results suggest as anexplanation for this difference the relation between the nasaltract imput impedance and the impedance of the vocal cavityat the nasal tract coupling point.

4.2. FricativesIn a first experiment the synthesizer was used to produce,sustained unvoiced fricatives [42]. The vocal tract configura-tion derived from a natural high vowel was adjusted by rais-ing the tongue tip in order to produce a sequence of reducedvocal tract cross-sectional areas. The lung pressure was lin-early increased and decreased at the beginning and end ofthe utterance, to produce a gradual onset and offset of theglottal flow.

The second goal was to synthesize fricatives in VCV se-quences [42]. Articulatory configurations for vowels wereobtained by inversion [43]. The fricative segment was ob-tained by manual adjustment of articulatory parameters. Forexample, to define a palato-alveolar fricative configurationfor the fricative in [iSi], we used the configuration of vowel[i] and only changed the tongue tip articulator to a raisedposition ensuring a cross-sectional area small enough to ac-tivate noise sources.

For [ifi], besides raising the tongue tip, described for [iSi],we used lip opening to create the necessary small area passageat the lips. Synthesis results for the nonsense word /ifi/areshown in Figure 12.

An F0 value of 100 Hz and a maximum glottal openingof 0.3 cm2 were used to synthesize the vowels. The time tra-jectory of the glottal source parameter Ag max rises to 2 cm2

at the fricative middle point and at the end of the fricativereturns to the value used during vowel production.

0 460

Time (ms)

−0.086

0

0.06

0 460

Time (ms)

0

5000

Freq

uen

cy(H

z)

Figure 12: Synthetic [ifi], showing speech signal and spectrogram.

0 466

Time (ms)

−0.089

0

0.082

0 466

Time (ms)

0

5000

Freq

uen

cy(H

z)

Figure 13: Synthetic [ivi], showing speech signal and spectrogram.

Nonsense words with voiced fricatives were also pro-duced, keeping the glottal folds vibration throughout thefricative. Results for the [ivi] sequence are presented inFigure 13.

4.3. WordsThe synthesizer is also capable of producing words contain-ing vowels (oral or nasal), nasal consonants, and (lower-quality) stops.

To produce such words, and since the synthesizer is notconnected to the linguistic and prosodic components of atext-to-speech system, we used the following manual process:

(1) obtaining durations for each phonetic segment enter-ing the word composition (presently by direct analysis


(a) (b) (c)

Figure 14: Tract configurations used to synthesize the word mao(hand): (a) [m], (b) [a], and (c) [u].

of natural speech although an automatic process, suchas a CART tree, can be used in the future);

(2) obtaining oral articulators’ configurations for each ofthe phones. For vowels we used configurations ob-tained by an inversion process based on the natu-ral vowels’ first four formants [43, 44]. These con-figurations were already available from previous work[39, 43]. For the consonants, for which we do not have,yet, an inversion process, configurations were obtainedmanually, based on the articulatory phonetics descrip-tion and published X-ray and MRI images;

(3) velum trajectory definition, using adequate values foreach vowel and consonant;

(4) setting glottal source parameters, in particular, thefundamental frequency (F0).

We first attempted to synthesize words containing nasalsounds due to their relevance in the Portuguese language[45]. We now present three examples of synthetic words:mao, mae, and Antonio.

Example 1 (word mao (hand)). First, from natural speechanalysis, we measured durations of 100 milliseconds for the[m] and 465 milliseconds for the nasal diphthong.

In this case, the [m] configuration was obtained manu-ally and configurations for [a] and [u] were obtained by aninversion process [43, 46]. The three configurations are pre-sented in Figure 14.

A velum trajectory was defined, based on articulatory de-scriptions of the intervening sounds. As shown in Figure 15,the velum starts closed, in a preproduction position, opensfor the nasal consonant, opens more during the first vowel inthe diphthong, and finally raises towards closure in the sec-ond part of the diphthong.

Fundamental frequency, F0, and other source parameterswere also defined. F0 starts at 120 Hz, increases to 130 Hz atthe end of the nasal consonant, then to 150 Hz to stress theinitial part of the diphthong, and finally decreases to 80 Hzat the end of the word. This variation in time was based, par-tially, on the F0 contour of natural speech. Values of 60% forthe open quotient (OQ) and 2 for speed quotient (SQ) wereused. Jitter, shimmer, and source-tract interaction were alsoused.

20 100 140 185 380 565

Time (ms)

Closed

Open

Figure 15: Velum trajectory used to synthesize the word mao(hand).

0 800

Time (ms)

0

5000

Freq

uen

cy(H

z)

Figure 16: Spectrogram of the word mao produced by the articula-tory synthesizer.

Two versions were produced: with and without lip clo-sure at the end of the word. Due to the open state of thevelum, this final oral closure results in the final nasal con-sonant [m]. The spectrogram of this last version is presentedin Figure 16.

Example 2 (word mae (mother)). A possible phonetic tran-scription for the word mae (mother) is [’m5ıñ], including apalatal nasal consonant at the end [45, page 292]. Keeping theoral passage open at the end of the word produced a variant.Due to the lack of precise information regarding oral tractconfiguration during production of [5ı], we produced vari-ants differing in the configuration used for the nasal vowel[5]. One version was produced using the configuration oforal vowel [a], another, with a higher tongue position, us-ing the configuration of vowel [5]. Another parameter variedwas F0: versions with values obtained by analysis of a naturalspeech, and versions with synthetic F0. For the synthetic case,a further variation was used: the inclusion or not of source-tract interaction. Figure 17 shows the speech signal and re-spective spectrogram for nonnatural F0, source-tract interac-tion, configuration of [a] for nasal vowel [5], and final palatalocclusion.

Example 3 (word Antonio). The first name of the first au-thor, Antonio [5’tOnju], was also synthesized using the sameprocess as in the two previous examples. This word has anasal vowel at the beginning, a stop, an oral vowel, a nasal


0 800

Time (ms)

−0.679

0

0.683

0 800

Time (ms)

0

5000

Freq

uen

cy(H

z)

Figure 17: Speech signal and spectrogram of one of the versions ofthe word mae synthesized using an [a] configuration at the begin-ning of the nasal diphthong, oral occlusion at the end, source-tractinteraction, and synthetic values for F0.

consonant, and a final oral diphthong. Two versions wereproduced: one with natural F0, and another with syntheticF0. The signal and its spectrogram obtained for the first ver-sion are presented in Figure 18. The stop consonant [t] wasobtained closing and opening the oral passage without mod-eling important phenomena for the perception of a naturalquality stop such as the voice onset time (VOT) and the as-piration at the release of closure.

As part of a mean opinion score (MOS) quality test, thisand many other stimuli produced by our synthesizer, wereevaluated. To document the quality level achieved by ourmodels, Table 3 shows the ratings of the various versions ofthe 3 examples presented above. The normalized (to 5) re-sults varied between the values 3 and 4 (from fair to good).The top-rated word obtained 3.7 (3.4 without normaliza-tion).

5. CONCLUSION

From the experience with simulations and perceptual testsusing stimuli generated by our articulatory synthesizer, webelieve that articulatory synthesis is a powerful approach tospeech synthesis because of its anthropomorphic origin andand it allows us to address questions regarding human speechproduction and perception.

We developed a modular articulatory synthesizer archi-tecture for Portuguese, using object-oriented programming.Separation of control, model, and viewer allows the additionof new models without major changes to the user interface.Implemented models comprise a glottal interactive sourcemodel, a flexible nasal tract area model, and a hybrid acousticmodel capable of dealing with asymmetric nasal tract config-

0 1400

Time (ms)

−0.503

0

0.653

0 1400

Time (ms)

0

5000

Freq

uen

cy(H

z)

Figure 18: Speech signal and spectrogram for the synthetic wordAntonio [5’tOnju] produced using F0 extracted from a natural pro-nunciation.

urations and frication noise sources. Synthesized speech hasa quality ranging from fair to good.

The synthesizer has been used, mainly, in the produc-tion of stimuli for perceptual tests of Portuguese nasal vowels(e.g., [39, 47, 48]). The two studies on nasal vowels reportedin this paper were only possible with the use of the articu-latory approach to speech synthesis, allowing the creation ofstimuli by direct and precise control of the articulators andthe glottal source. They illustrate the potential of articula-tory synthesis in production and perception studies and theflexibility of our synthesizer.

Perceptual tests and simulations contributed to improveour knowledge regarding EP nasal sounds, namely the fol-lowing.

(1) It is necessary to include the time variation of velumaperture, combined with the time variation of articu-lators controlling the oral passage, in order to synthe-size high-quality nasal vowels.

(2) Nasality is not controlled solely by the velum move-ment. Oral passage reduction, or occlusion, can be alsoused to improve nasal vowel quality. When nasal vow-els were word-final, the lips or tongue movement, evenwithout occlusion, improved the quality of the synthe-sized nasal vowel by increasing the predominance ofnasal radiation. Oral occlusion, due to coarticulation,before stops also contributes to nasal quality improve-ment.

(3) Source-tract interaction effect due to extra coupling ofthe nasal tract is not easily perceived. Discriminationwas significantly above chance level only for the highvowel [ı], which can possibly be explained by the re-lation of nasal and oral input impedances at the nasaltract coupling point.


Table 3: Quality ratings for several words produced by the synthesizer. For each word, the table includes the mean opinion score (MOS), itsrespective 95% confidence interval, and the normalized value resulting from scaling natural speech scores to 5.

Word F0 Interac. Observ. MOS CI 95% Norm

maoSynthetic Yes no [m] at end 3.4 [3.0–3.7] 3.7Synthetic Yes [m] at end 3.0 [2.7–3.4] 3.3

mae

Natural yes [a], [ñ] at end 2.9 [2.6–3.3] 3.2Synthetic Yes [a], [ñ] at end 3.1 [2.7–3.4] 3.3Synthetic Yes [5], [ñ] at end 2.9 [2.6–3.3] 3.2Synthetic Yes [a], no [ñ] 3.0 [2.6–3.4] 3.3Synthetic No [5], [ñ] at end 2.9 [2.5–3.3] 3.1Synthetic No [a], [ñ] at end 2.8 [2.5–3.2] 3.1

AntonioNatural Yes — 3.0 [2.8–3.2] 3.3Synthetic Yes — 2.7 [2.4–2.9] 2.9

Natural speech — — — 4.6 — 5.0

A nasal vowel, at least in European Portuguese, is not asound obtained only by lowering the velum. The way thisaperture and other articulators vary in time is important.Namely, how the velum and the oral articulators vary in thevarious contexts improves quality.

With the addition of noise source models and modifica-tions to the acoustic model, our articulatory synthesizer is ca-pable of producing sustained fricatives and fricatives in VCVsequences. First results were presented, and judged in infor-mal listening tests as highly intelligible. Our model of frica-tives is comprehensive and flexible, making the new versionof SAPWindows a valuable tool for trying out new or im-proved source models, and running production and percep-tual studies of European Portuguese fricatives [49]. The pos-sibility of automatically inserting and removing noise sourcesalong the oral tract is a feature we regard as having great po-tential.

SAPWindows articulatory synthesizer is useful in pho-netics research and teaching. We explored the first area forseveral years with very interesting results, as shown in thispaper. Recently, we started exploring the second area, aimingat using the synthesizer in phonetics teaching at our Univer-sity’s Languages and Cultures Department. Articulatory syn-thesis is also of interest in the field of speech therapy becauseof its potential to model different speech pathologies.

Development of this synthesizer is an unfinished task.The addition of new models for other Portuguese sounds, theuse of a combined data (MRI, EMA, EPG, etc.) for a detaileddescription of the vocal tract configurations and an optimalmatch between the synthesized and the Portuguese naturalspectra [49], and the integration of the synthesizer in a text-to-speech system are planned as future work.

ACKNOWLEDGMENTS

This work was partially funded by the first author’s Ph.D.Scholarship BD/3495/94 and the project “Articulatory Syn-thesis of Portuguese” P/PLP/11222/1998, both from the Por-tuguese Research Foundation (FCT) PRAXIS XXI program.We also have to thank the University of Florida’s MMIRC,headed by Professor D. G. Childers, where this work started.

REFERENCES

[1] R. Linggard, Electronic Synthesis of Speech, Cambridge Univer-sity Press, Cambridge, UK, 1985.

[2] M. R. Schroeder, Computer Speech: Recognition, Compression,Synthesis, vol. 35 of Springer Series in Information Sciences,Springer Verlag, New York, NY, USA, 1999.

[3] J.-P. Tubach, “Presentation Generale,” Fondements et Per-spectives en Traitment Automatique de la Parole, H. Meloni,Ed., Universites Francophones, 1996.

[4] G. J. Borden, K. S. Harris, and L. J. Raphael, Speech SciencePrimer—Physiology, Acoustics, and Perception of Speech, LWW,4nd edition, 2003.

[5] D. Klatt, “Software for a cascade/parallel formant synthe-sizer,” Journal of the Acoustic Society of America, vol. 67, no. 3,pp. 971–995, 1980.

[6] P. Rubin, T. Baer, and P. Mermelstein, “An articulatory synthe-sizer for perceptual research,” Journal of the Acoustical Societyof America, vol. 70, no. 2, pp. 321–328, 1981.

[7] S. Maeda, “The role of the sinus cavities in the productionof nasal vowels,” in Proc. IEEE Int. Conf. Acoustics, Speech, Sig-nal Processing (ICASSP ’82), vol. 2, pp. 911–914, Paris, France,May 1982.

[8] T. Koizumi, S. Tanigushi, and S. Hiromitsu, “Glottal source-vocal tract interaction,” Journal of the Acoustic Society of Amer-ica, vol. 78, no. 5, pp. 1541–1547, 1985.

[9] M. M. Sondhi and J. Schroeter, “A hybrid time-frequencydomain articulatory speech synthesizer,” IEEE Trans. Acoust.,Speech, Signal Processing, vol. 35, no. 7, pp. 955–967, 1987.

[10] C. H. Shadle and R. Damper, “Prospects for articulatory syn-thesis: A position paper,” in Proc. 4th ISCA Tutorial and Re-search Workshop (ITRW ’01), Perthshire, Scotland, August–September 2001.

[11] D. H. Whalen, “Articulatory synthesis: Advances andprospects,” in Proc. 15th International Congress of PhoneticSciences (ICPhS ’03), pp. 175–177, Barcelona, Spain, August2003.

[12] B. Kuhnert and F. Nolan, “The origin of coarticulation,”Forschungsberichte des Instituts fur Phonetik und Sprach-liche Kommunikation der Universitat Munchen (FIPKM),vol. 35, pp. 61–75, 1997, and also in [50]. Online. Available:http://www.phonetik.uni-muenchen.de/FIPKM/index.html.

[13] A. Pinto and A. M. Tome, “Automatic pitch detection andmidi conversion for the singing voice,” in Proc. WSES Inter-national Conferences: AITA ’01, AMTA ’01, MCBE ’01, MCBC’01, pp. 312–317, Greece, 2001.


[14] A. M. Oster, D. House, A. Protopapas, and A. Hatzis, “Presen-tation of a new EU project for speech therapy: OLP (Ortho-Logo-Paedia),” in Proc. TMH-QPSR, Fonetik 2002, vol. 44, pp.45–48, Stockholm, Sweden, May 2002.

[15] F. S. Cooper,“ Speech synthesizers,” in Proc. 4th InternationalCongress of Phonetic Sciences (ICPhS ’61), A. Sovijarvi and P.Aalto, Eds., pp. 3–13, The Hague: Mouton, Helsinki, Finland,September 1961.

[16] M. Wrembel, “Innovative approaches to the teaching of prac-tical phonetics,” in Proc. Phonetics Teaching & Learning Con-ference (PTLC ’01), London, UK, April 2002.

[17] S. S. Fels, F. Vogt, B. Gick, C. Jaeger, and I. Wilson, “User-centred design for an open source 3-D articulatory synthe-sizer,” in Proc. 15th International Congress of Phonetic Sciences(ICPhS ’03), vol. 1, pp. 179–183, Barcelona, Spain, August2003.

[18] K. Iskarous, L. Goldstein, D. H. Whalen, M. K. Tiede, and P. E.Rubin, “CASY: The Haskins configurable articulatory synthe-sizer,” in Proc. 15th International Congress of Phonetic Sciences(ICPhS ’03), vol. 1, pp. 185–188, Barcelona, Spain, 2003.

[19] S. Maeda and M. Toda, “Mechanical properties of lip move-ments: How to characterize different speaking styles?” in Proc.15th International Congress of Phonetic Sciences (ICPhS ’03),vol. 1, pp. 189–192, Barcelona, Spain, August 2003.

[20] P. Badin, G. Bailly, F. Elisei, and M. Odisio, “Virtual talkingheads and audiovisual articulatory synthesis,” in Proc. 15th In-ternational Congress of Phonetic Sciences (ICPhS ’03), vol. 1,pp. 193–197, Barcelona, Spain, August 2003.

[21] K. N. Stevens and H. M. Hanson, “Production of consonantswith a quasi-articulatory synthesizer,” in Proc. 15th Interna-tional Congress of Phonetic Sciences (ICPhS ’03), vol. 1, pp.199–202, Barcelona, Spain, August 2003.

[22] P. Mermelstein, “Articulatory model for the study of speechproduction,” Journal of the Acoustical Society of America,vol. 53, no. 4, pp. 1070–1082, 1973.

[23] J. Schroeter and M. M. Sondhi, “Speech coding based on phys-iological models of speech production,” in Advances in SpeechSiganl Processing, Marcel Dekker, New York, NY, USA, 1992.

[24] P. P. L. Prado, A target-based articulatory synthesizer, Ph.D.dissertation, University of Florida, Gianesville, Fla, USA,1991.

[25] A. Teixeira, F. Vaz, and J. C. Prıncipe, “A comprehensive nasalmodel for a frequency domain articulatory synthesizer,” inProc. 10th Portuguese Conference on Pattern Recognition (Rec-Pad ’98), Lisbon, Portugal, March 1998.

[26] M. Chen, “Acoustic correlates of English and French nasalizedvowels,” Journal of the Acoustical Society of America, vol. 102,no. 4, pp. 2360–2370, 1997.

[27] J. Dang and K. Honda, “MRI measurements and acoustic in-vestigation of the nasal and paranasal cavities,” Journal of theAcoustical Society of America, vol. 94, no. 3, pp. 1765–1765,1993.

[28] K. N. Stevens, Acoustic Phonetics, Current Studies in Linguis-tics, MIT Press, Cambridge, Mass, USA, 1998.

[29] A. Teixeira, F. Vaz, and J. C. Prıncipe, “Effects of source-tractinteraction in perception of nasality,” in Proc. 6th EuropeanConference on Speech Communication and Technology (EU-ROSPEECH ’99), vol. 1, pp. 161–164, Budapest, Hungary,September 1999.

[30] D. Allen and W. Strong, “A model for the synthesis of naturalsounding vowels,” Journal of the Acoustical Society of America,vol. 78, no. 1, pp. 58–69, 1985.

[31] T. V. Ananthapadmanabha and G. Fant, “Calculation of trueglottal flow and its components,” Speech Communication,vol. 1, no. 3-4, pp. 167–184, 1982.

[32] L. Silva, A. Teixeira, and F. Vaz, “An object oriented articu-latory synthesizer for Windows,” Revista do Departamento deElectronica e Telecomunicacoes, Universidade de Aveiro, vol. 3,no. 5, pp. 483–492, 2002.

[33] E. L. Riegelsberger, The acoustic-to-articulatory mapping ofvoiced and fricated speech, Ph.D. dissertation, The Ohio StateUniversity, Columbus, Ohio, USA, 1997.

[34] Q. Lin, “A fast algorithm for computing the vocal-tract im-pulse response from the transfer function,” IEEE Trans. SpeechAudio Processing, vol. 3, no. 6, pp. 449–457, 1995.

[35] M. Frigo and S. Johnson, “FFTW: an adaptive software archi-tecture for the FFT,” in Proc. IEEE Int. Conf. Acoustics, Speech,Signal Processing (ICASSP ’98), vol. 3, pp. 1381–1384, Seattle,Wash, USA, 1998.

[36] S. S. Narayanan and A. A. H. Alwan, “Noise source modelsfor fricative consonants,” IEEE Trans. Speech Audio Processing,vol. 8, no. 3, pp. 328–344, 2000.

[37] J. L. Flanagan, Speech Analysis, Synthesis and Perception,Springer-Verlag, New York, NY, USA, 2nd edition, 1972.

[38] C. H. Shadle, “Articulatory-acoustic relationships in fricativeconsonants,” in Speech Production and Speech Modelling, W.J. Hardcastle and A. Marchal, Eds., pp. 187–209, Kluwer Aca-demic, Dordrecht, The Netherlands, 1990.

[39] A. Teixeira, F. Vaz, and J. C. Prıncipe, “Influence of dynamicsin the perceived naturalness of portuguese nasal vowels,” inProc. 14th International Congress of Phonetic Sciences (ICPhS’99), San Francisco, Calif, USA, August 1999.

[40] N. Kitawaki, “Quality assessment of coded speech,” in Ad-vances in Speech Signal Processing, S. Furui and M. M. Sondhi,Eds., chapter 12, pp. 357–385, Marcel Dekker, New York, NY,USA, 1992.

[41] I. R. Titze and B. H. Story, “Acoustic interactions of the voicesource with the lower vocal tract,” Journal of the Acoustical So-ciety of America, vol. 101, no. 4, pp. 2234–2243, 1997.

[42] A. Teixeira, L. M. T. Jesus, and R. Martinez, “Adding frica-tives to the Portuguese articulatory synthesizer,” in Proc. 8thEuropean Conference on Speech Communication and Technol-ogy (EUROSPEECH ’03), pp. 2949–2952, Geneva, Switzer-land, September 2003.

[43] A. Teixeira, F. Vaz, and J. C. Prıncipe, “A software tool tostudy Portuguese vowels,” in Proc. 5th European Conference onSpeech Communication and Technology (EUROSPEECH ’97) ,G. Kokkinakis, N. Fakotakis, and E. Dermatas, Eds., vol. 5, pp.2543–2546, Rhodes, Greece, September 1997.

[44] A. Teixeira, F. Vaz, and J. C. Prıncipe, “Some studies of Eu-ropean Portuguese nasal vowels using an articulatory synthe-sizer,” in Proc. 5th IEEE International Conference on Electron-ics, Circuits and Systems (ICECS ’98), vol. 3, pp. 507–510, Lis-bon, Portugal, September 1998.

[45] J. Laver, Principles of Phonetics, Cambridge Textbooks in Lin-guistics, Cambridge University Press, Cambridge, UK, 1st edi-tion, 1994.

[46] D. G. Childers, Speech Processing and Synthesis Toolboxes, JohnWiley & Sons, New York, NY, USA, 2000.

[47] A. Teixeira, L. C. Moutinho, and R. L. Coimbra, “Production,acoustic and perceptual studies on European Portuguese nasalvowels height,” in Proc. 15th International Congress of PhoneticSciences (ICPhS ’03), Barcelona, Spain, August 2003.

[48] A. Teixeira, F. Vaz, and J. C. Prıncipe, “Nasal vowels followinga nasal consonant,” in Proc. 5th Seminar on Speech Produc-tion: Models and Data, pp. 285–288, Bavaria, Germany, May2000.

[49] L. M. T. Jesus and C. H. Shadle, “A parametric study ofthe spectral characteristics of European Portuguese fricatives,”Journal of Phonetics, vol. 30, no. 3, pp. 437–464, 2002.


[50] W. J. Hardcastle and N. Hewlett, Eds., Coarticulation: Theo-retical and Empirical Perspectives, Cambridge University Press,Cambridge, Mass, USA, 1999.

Antonio J. S. Teixeira was born in Paredes,Portugal, in 1968. He received his first de-gree in electronic and telecommunicationsengineering in 1991, the M.S. degree in elec-tronic and telecommunications engineeringin 1993, and the Ph.D. degree in electricalengineering in 2000, all from the Universityof Aveiro, Aveiro, Portugal. His Ph.D. disser-tation was on articulatory synthesis of thePortuguese nasals. Since 1997, he has beenteaching in the Department of Electronics and Telecommunica-tions Engineering at the University of Aveiro, a “Professor Auxiliar”since 2000, and has been a Researcher, since its creation in 1999,in the Signal Processing Laboratory at the Institute of Electronicsand Telematics Engineering of Aveiro (IEETA), Aveiro, Portugal.His research interests include digital processing of speech signals,particularly (articulatory) speech synthesis; Portuguese phonetics;speaker verification; spoken language understanding; dialogue sys-tems; and man-machine interaction. He is also involved in a newMaster’s program in the area of speech sciences and hearing, asthe Coordinator. He is a Member of The Institute of Electrical andElectronics Engineers, International Speech Communication Asso-ciation, and the International Phonetic Association.

Roberto Martinez was born in Cuba in1961. He received his first degree in physicsin 1986 from the Moscow State UniversityM. V. Lomonosov, former USSR. He is a Mi-crosoft Certified Engineer since 1998. From1986 to 1994, he was an Assistant Professorof mathematics and physics at the HavanaUniversity, Cuba, doing research in com-puter aided molecular design. From 1996 to1998, he was with SIME Ltd., Cuba, as anIntranet Developer and System Administrator. From 1999 to 2001,he was with DISAIC Consulting Services, Cuba, doing training ofNetwork Administrators and consulting in Microsoft BackOfficeSystems Integration and Network Security. He is currently workingtoward the Doctoral degree in articulatory synthesis of Portugueseat the University of Aveiro, Portugal.

Luıs Nuno Silva received his first degreein electronics and telecommunications en-gineering in 1997 and the M.S. degree inelectronics and telecommunications engi-neering in 2001, both from the Universi-dade de Aveiro, Aveiro, Portugal. From 1997till 2002, he worked in research and de-velopment at the Instituto de EngenhariaElectronica e Telematica de Aveiro, Aveiro,Portugal (former Instituto de Engenharia deSistemas e Computadores of Aveiro, Aveiro, Portugal) as a ResearchAssociate. Since 2002, he has been working as a Software Engineerat the Research and Development Department of NEC Portugal,Aveiro, Portugal. His (research) interests include digital processingof speech signals and speech synthesis. He is a Member of The In-stitute of Electrical and Electronics Engineers.

Luis M. T. Jesus received his first degreein electronic and telecommunications engi-neering in 1996 from the Universidade deAveiro, Aveiro, Portugal, the M.S. degree inelectronics in 1997 from the University ofEast Anglia, Norwich, UK, and the Ph.D.degree in electronics in 2001 from the Uni-versity of Southampton, UK. Since 2001, hehas been a Reader in the Escola Superiorde Saude da Universidade de Aveiro, Aveiro,Portugal, and has been a member of the Signal Processing Lab-oratory at the Instituto de Engenharia Electronica e Telematicade Aveiro, Aveiro, Portugal. His research interests include acous-tic phonetics, digital processing of speech signals, and speech syn-thesis. He is a Member of The Acoustical Society of America,Associacao Portuguesa de Linguıstica, International Phonetic As-sociation, International Speech Communication Association, andThe Institute of Electrical and Electronics Engineers.

Jose C. Prıncipe is a Distinguished Profes-sor of electrical and biomedical engineer-ing at the University of Florida, Gainesville,where he teaches advanced signal process-ing and artificial neural networks (ANNs)modeling. He is a BellSouth Professor andFounder and Director of the University ofFlorida Computational NeuroEngineeringLaboratory (CNEL). He has been involvedin biomedical signal processing, brain ma-chine interfaces, nonlinear dynamics, and adaptive systems theory(information theoretic learning). He is the Editor-in-Chief of IEEETransactions on Biomedical Engineering, President of the Interna-tional Neural Network Society, and formal Secretary of the Techni-cal Committee on Neural Networks of the IEEE Signal ProcessingSociety. He is also a Member of the Scientific Board of the Foodand Drug Administration, and a Member of the Advisory Boardof the University of Florida Brain Institute. He has more than 100publications in refereed journals, 10 book chapters, and over 200conference papers. He has directed 42 Ph.D. degree dissertationsand 57 M.S. degree theses.

Francisco A. C. Vaz was born in Oporto,Portugal, in 1945. He received the Electri-cal Engineering degree from University ofOporto, Portugal, in 1968, and the Ph.D.degree in electrical engineering from theUniversity of Aveiro, Portugal, in 1987. HisPh.D. dissertation was on automatic EEGprocessing. From 1969 to 1973, he workedfor the Portuguese Nuclear Committee. Af-ter several years working in the industry, hejoined, in 1978, the staff of the Department of Electronics Engi-neering and Telecommunications, the University of Aveiro, wherehe is currently a Full Professor. His research interests have centredon the digital processing of biological signals, and since 1995 ondigital speech processing.

Anthropomorphic Processing of Audio and Speechdownloads.hindawi.com/journals/specialissues/807173.pdf · Anthropomorphic Processing of Audio and Speech Guest Editors: Werner Verhelst,

Documents