Digital Signal Processing for Hearing Instruments

Digital Signal Processing for Hearing Instruments

Guest Editors: Heinz G. Göckler, Henning Puder, Hugo Fastl, Sven Erik Nordholm, Torsten Dau, and Walter Kellermann

EURASIP Journal on Applied Signal Processing

Digital Signal Processing for HearingInstruments

EURASIP Journal on Advances in Signal Processing

Digital Signal Processing for HearingInstruments

Guest Editors: Heinz G. Gockler, Henning Puder, Hugo Fastl,Sven Erik Nordholm, Torsten Dau, and Walter Kellermann

Copyright © 2009 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2009 of “EURASIP Journal on Advances inSignal Processing.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits un-restricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Editor-in-ChiefPhillip Regalia, Institut National des Telecommunications, France

Associate Editors

Adel M. Alimi, TunisiaKenneth Barner, USAYasar Becerikli, TurkeyKostas Berberidis, GreeceJose Carlos Bermudez, BrazilEnrico Capobianco, ItalyA. Enis Cetin, TurkeyJonathon Chambers, UKMei-Juan Chen, TaiwanLiang-Gee Chen, TaiwanHuaiyu Dai, USASatya Dharanipragada, USAKutluyil Dogancay, AustraliaFlorent Dupont, FranceFrank Ehlers, ItalySharon Gannot, IsraelM. Greco, ItalyIrene Y. H. Gu, SwedenFredrik Gustafsson, SwedenUlrich Heute, GermanySangjin Hong, USAJiri Jan, Czech RepublicMagnus Jansson, SwedenSudharman K. Jayaweera, USA

Soren Holdt Jensen, DenmarkMark Kahrs, USAMoon Gi Kang, South KoreaWalter Kellermann, GermanyLisimachos P. Kondi, GreeceAlex Chichung Kot, SingaporeC.-C. Jay Kuo, USAErcan E. Kuruoglu, ItalyTan Lee, ChinaGeert Leus, The NetherlandsT.-H. Li, USAHusheng Li, USAMark Liao, TaiwanY.-P. Lin, TaiwanShoji Makino, JapanStephen Marshall, UKC. Mecklenbrauker, AustriaGloria Menegaz, ItalyRicardo Merched, BrazilMarc Moonen, BelgiumVitor Heloiz Nascimento, BrazilChristophoros Nikou, GreeceSven Nordholm, AustraliaPatrick Oonincx, The Netherlands

Douglas O’Shaughnessy, CanadaBjorn Ottersten, SwedenJacques Palicot, FranceAna Perez-Neira, SpainWilfried Philips, BelgiumAggelos Pikrakis, GreeceIoannis Psaromiligkos, CanadaAthanasios Rontogiannis, GreeceGregor Rozinaj, SlovakiaMarkus Rupp, AustriaWilliam Allan Sandham, UKBulent Sankur, TurkeyLing Shao, UKDirk Slock, FranceY.-P. Tan, SingaporeJoao Manuel R. S. Tavares, PortugalGeorge S. Tombras, GreeceDimitrios Tzovaras, GreeceBernhard Wess, AustriaJar-Ferr Yang, TaiwanAzzedine Zerguine, Saudi ArabiaAbdelhak M. Zoubir, Germany

Contents

Digital Signal Processing for Hearing Instruments, Heinz G. Gockler, Henning Puder, Hugo Fastl,Sven Erik Nordholm, Torsten Dau, and Walter KellermannVolume 2009, Article ID 898576, 3 pages

Database of Multichannel In-Ear and Behind-the-Ear Head-Related and Binaural Room ImpulseResponses, H. Kayser, S. D. Ewert, J. Anemuller, T. Rohdenburg, V. Hohmann, and B. KollmeierVolume 2009, Article ID 298605, 10 pages

A Novel Approach to the Design of Oversampling Low-Delay Complex-Modulated Filter Bank Pairs,Thomas Kurbiel, Heinz G. Gockler, and Daniel AlfsmannVolume 2009, Article ID 692861, 13 pages

Incorporating the Conditional Speech Presence Probability in Multi-Channel Wiener Filter Based NoiseReduction in Hearing Aids, Kim Ngo, Ann Spriet, Marc Moonen, Jan Wouters, and Søren Holdt JensenVolume 2009, Article ID 930625, 11 pages

Improved Reproduction of Stops in Noise Reduction Systems with Adaptive Windows andNonstationarity Detection, Dirk Mauler and Rainer MartinVolume 2009, Article ID 469480, 17 pages

Robust Distributed Noise Reduction in Hearing Aids with External Acoustic Sensor Nodes,Alexander Bertrand and Marc MoonenVolume 2009, Article ID 530435, 14 pages

Synthetic Stimuli for the Steady-State Verification of Modulation-Based Noise Reduction Systems,Jesko G. Lamm, Anna K. Berg, and Christian G. GluckVolume 2009, Article ID 876371, 8 pages

The Acoustic and Peceptual Effects of Series and Parallel Processing, Melinda C. Anderson,Kathryn H. Arehart, and James M. KatesVolume 2009, Article ID 619805, 20 pages

Low Delay Noise Reduction and Dereverberation for Hearing Aids, Heinrich W. Lollmann and Peter VaryVolume 2009, Article ID 437807, 9 pages

A Computational Auditory Scene Analysis-Enhanced Beamforming Approach for Sound SourceSeparation, L. A. Drake, J. C. Rutledge, J. Zhang, and A. KatsaggelosVolume 2009, Article ID 403681, 17 pages

Rate-Constrained Beamforming in Binaural Hearing Aids, Sriram Srinivasan and Albertus C. den BrinkerVolume 2009, Article ID 257197, 9 pages

Combination of Adaptive Feedback Cancellation and Binaural Adaptive Filtering in Hearing Aids,Anthony Lombard, Klaus Reindl, and Walter KellermannVolume 2009, Article ID 968345, 15 pages

Influence of Acoustic Feedback on the Learning Strategies of Neural Network-Based Sound Classifiers inDigital Hearing Aids, Lucas Cuadra, Enrique Alexandre, Roberto Gil-Pita, Raul Vicen-Bueno,and Lorena AlvarezVolume 2009, Article ID 465189, 10 pages

Analysis of the Effects of Finite Precision in Neural Network-Based Sound Classifiers for Digital HearingAids, Roberto Gil-Pita, Enrique Alexandre, Lucas Cuadra, Raul Vicen, and Manuel Rosa-ZureraVolume 2009, Article ID 456945, 12 pages

Signal Processing Strategies for Cochlear Implants Using Current Steering, Waldo Nogueira,Leonid Litvak, Bernd Edler, Jorn Ostermann, and Andreas BuchnerVolume 2009, Article ID 531213, 20 pages

Prediction of Speech Recognition in Cochlear Implant Users by Adapting Auditory Models toPsychophysical Data, Svante Stadler and Arne LeijonVolume 2009, Article ID 175243, 9 pages

The Personal Hearing System–A Software Hearing Aid for a Personal Communication System, GisoGrimm, Gwenael Guilmin, Frank Poppen, Marcel S. M. G. Vlaming, and Volker HohmannVolume 2009, Article ID 591921, 9 pages

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2009, Article ID 898576, 3 pagesdoi:10.1155/2009/898576

Editorial

Digital Signal Processing for Hearing Instruments

Heinz G. Gockler (EURASIP Member),1 Henning Puder,2 Hugo Fastl,3 Sven Erik Nordholm,4

Torsten Dau,5 and Walter Kellermann6

1 Digital Signal Processing Group, Ruhr-Universitat Bochum, 44780 Bochum, Germany2 Siemens Audiologische Technik GmbH, 91058 Erlangen, Germany3 Institute for Human-Machine Communication, Technische Universitat Munchen, 80333 Munchen, Germany4 Signal Processing Laboratory, Western Australian Telecommunications Research Institute (WATRI), Crawley,WA 6009, Australia

5 Centre for Applied Hearing Research, Acoustic Technology, Department of Electrical Engineering,Technical University of Denmark, 2800 Kongens Lyngby, Denmark

6 Multimedia Communications and Signal Processing, Universitat Erlangen-Nurnberg, 91058 Erlangen, Germany

Correspondence should be addressed to Heinz G. Gockler, [email protected]

Received 26 October 2009; Accepted 26 October 2009

Copyright © 2009 Heinz G. Gockler et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

Hearing, as a prerequisite of listening, presumably representsthe most important pillar of human’s ability to communicatewith each other. Hence, engineers of all denominations,physicists, and physicians have always been creative bothto improve the environmental conditions of hearing and toameliorate the individual hearing capability.

Digital signal processing for hearing instruments has beenan active field of research and industrial development formore than 25 years. As a result, these efforts have eventuallypaid off and, thus, opened big markets for digital hearing aidsand cochlear implants which, in turn, promote and acceleraterelated research and development. Certainly, the presentstate-of-the-art of hearing instruments has highly profitedfrom efficient small size technology with very low powerconsumption mainly developed for portable communicationequipment, advanced digital filtering and filter banks, aswell as speech processing and enhancement devised formodern speech transmission and recognition. Moreover,these examples of cross-fertilisation exploiting synergies arecontinuing and expanding on a large scale.

To promote the aforementioned cross-fertilisation, thegoal of this special issue on Digital Signal Processing forHearing Instruments (DSPHI) has been to collect andpresent the latest state-of-the-art research in signal process-ing methods and algorithms used in or suitable for hearinginstruments, as well as in related design approaches andimplementations. Just four years ago, in this journal a firstspecial issue on this topic was published, underlining the

importance and activity of this challenging field of research:EURASIP Journal of Applied Signal Processing 2005:18.

This special issue on DSP for Hearing Instrumentsgathers 16 articles. Again, it reflects various aspects ofmultiple disciplines needed for the treatment of hearingimpairment. In fact, the included papers address a varietyof approaches, algorithms, and designs applicable to hearinginstruments that are all amenable to DSP implementation.

The papers in this issue are organized according totheir focus of research, since some of these contributionsencompass multiple topics. To this end, we have combinedthe contributions to the following classes: (i) head-relatedtransfer functions and filter banks (2 articles) embracing allDSP inherent to hearing instruments, (ii) noise reduction andspeech enhancement (8 articles) encompassing combinationswith other algorithms, beamforming, and related measure-ment procedures, (iii) feedback cancellation (1 article), (iv)sound classification (2 articles), (v) cochlear implants (2articles), and (vi) personal hearing system (1 article).

The paper entitled “Database of multichannel in-earand behind-the-ear head-related and binaural room impulseresponses” (H. Kayser et al.) offers a database of head-relatedimpulse responses (HRIRs) and binaural room impulseresponses (BRIRs). These data are measured with three-channel behind-the-ear (BTE) hearing aids and an in-ear microphone at both ears of a human head-and-torso-simulator. The database provides a tool for the evaluation ofmultichannel hearing aid algorithms in hearing aid research.

2 EURASIP Journal on Advances in Signal Processing

The presented database allows for a realistic constructionof simulated sound fields for the investigation of hearinginstruments.

Another paper entitled “A novel approach to the designof oversampling low-delay complex-modulated filter bankpairs” (T. Kurbiel et al.) proposes a noniterative two-step procedure for the design of the prototype filters ofoversampling uniform complex-modulated FIR filter bankpairs that applies two convex objective functions: (i) designof the analysis prototype filter for minimum group delay,and (ii) design of the synthesis prototype filter such that thefilter bank pair approximates a linear-phase allpass function.Both aliasing and imaging distortion is controlled by refinedfrequency-dependent stopband constraints tailored to theactual oversampling factor. The derivation of the constraints,however, will be published elsewhere.

In the paper entitled “Incorporating the conditionalspeech presence probability in multi-channel Wiener filterbased noise reduction in hearing aids” (K. Ngo et al.) theconditional speech presence probability (SPP) is applied tothe speech distortion weighted multi-channel Wiener filter(SDW-MWF) which focuses on the improvement of thenoise reduction performance of the traditional SDW-MWF.This approach exploits sparseness of speech signals, that is,the fact that speech may not be present at all times and atall frequencies, which is not taken into account by a typicalSDW-MWF. This leads to less noise reduction in speechdominant segments leading to less speech distortion and ahigher noise reduction during speech absence. The noisereduction performance of the proposed method is evaluatedand compared with a typical SDW-MWF.

In the paper entitled “Improved reproduction of stopsin noise reduction systems with adaptive windows and non-stationarity detection” (D. Mauler et al.) a novel block-basednoise reduction system is proposed which focuses on thepreservation of transient sounds like stops or speech onsets.The key idea of the proposed system is the detection ofnonstationary input data. Depending on that decision a pairof analysis-synthesis windows is adaptively selected whicheither provides high temporal or high spectral resolution.Furthermore, the decision-directed approach to estimate thea priori SNR is modified such that speech onsets are trackedmore instantaneously without sacrificing performance instationary signal regions.

In the paper entitled “Robust distributed noise reduc-tion in hearing aids with external acoustic sensor nodes”(A. Bertrand et al.) the benefit of using external acous-tic sensor nodes for noise reduction in hearing aids isdemonstrated in a simulated acoustic scenario with multiplesound sources. A distributed adaptive node-specific signalestimation (DANSE) algorithm with reduced bandwidth anddiminished computational load is evaluated and comparedwith the noise reduction performance of centralized multi-channel Wiener filtering (MWF).

The paper entitled “Synthetic stimuli for the steady-stateverification of modulation-based noise reduction systems”(J. G. Lamm et al.) provides partial verification of theperformance of hearing instruments by measuring theproperties of their noise reduction systems. Synthetic stimuli

with their potential of adaptability to the parameter spaceof the noise reduction system are proposed as test signals.The article presents stimuli for steady-state measurements ofmodulation-based noise reduction systems.

The paper entitled “The acoustic and perceptual effectsof series and parallel processing” (M. C. Anderson etal.) explores how spectral subtraction and dynamic-rangecompression gain modifications affect temporal envelopefluctuations in parallel and serial system configurations. Inparallel processing, both algorithms derive their gains fromthe same input signal. In serial processing, the output fromthe first algorithm forms the input to the second algorithm.Acoustic measurements show that the parallel arrangementproduces higher gain fluctuations and, thus, gives rise tohigher variations of the temporal envelope than with theserial configuration. Intelligibility tests both for normal-hearing and for hearing-impaired listeners demonstrate that(i) parallel processing leads to significantly poorer speechunderstanding than that with an unprocessed (UNP) signaland with the serial arrangement and (ii) serial processing andUNP yield similar results.

The paper entitled “Low delay noise reduction and dere-verberation for hearing aids” (H. W. Lollmann et al.) presentsa new, low signal delay algorithm for single-channel speechenhancement, which concurrently achieves suppression oflate reverberant speech and background noise. It is basedon a generalized spectral subtraction rule which dependson the variance both of the late reverberant speech and ofthe background noise. Compared to the commonly usedpostfiltering, which solely performs noise reduction, boththe subjective and the objective speech quality is significantlyimproved by the proposed algorithm.

The paper entitled “A computational auditory sceneanalysis-enhanced beamforming approach for sound sourceseparation” (L. A. Drake et al.) describes a novel approach tosound source separation, which achieves better signal separa-tion performance by combining complementary techniques.The method concurrently exploits features of the soundsources and, by applying multimicrophone beamforming,the local separation of the sources, where both attributes areindependent of each other.

The paper entitled “Rate-constrained beamforming inbinaural hearing aids” (S. Srinivasan et al.) addresses hearingaid systems, where the left and right ear devices collaboratewith each other. Binaural beamforming for hearing aidsrequires an exchange of microphone signals between thetwo devices over a wireless link. In this contribution, twoissues are investigated: which multiple microphone signalsto transmit from one ear to the other, and at what bit-rate.Obviously, the second problem is relevant as the capacity ofthe wireless link is limited by stringent power consumptionconstraints.

The paper entitled “Combination of adaptive feedbackcancellation and binaural adaptive filtering in hearing aids”(A. Lombard et al.) studies a system combining adaptivefeedback cancellation and adaptive filtering connectinginputs from both ears for signal enhancement in hearingaids. Such a binaural system is analysed in terms ofsystem stability, convergence of the algorithms, and possible

EURASIP Journal on Advances in Signal Processing 3

interaction effects. As major outcomes of this study, anew stability condition adapted to the considered binauralscenario is presented, some commonly used feedback can-cellation performance measures for the unilateral case areadapted to the binaural case, and possible interaction effectsbetween the algorithms are identified.

The paper entitled “Influence of acoustic feedback on thelearning strategies of neural network-based sound classifiersin digital hearing aids” (L. Cuadra et al.) investigatesthe feasibility of using different feedback-affected sounddatabases, and a variety of learning strategies of neural net-works to reduce the probability of erroneous classification.The experimental work basically shows that the proposedmethods support the neural network-based sound classifiersin reducing its error probability by more than 18%.

The paper entitled “Analysis of the effects of finiteprecision in neural network-based sound classifiers fordigital hearing aids” (R. Gil-Pita et al.) deals with theimplementation of sound classifiers with restricted resourcessuch as finite arithmetic and a limited number of instructionsper second. This contribution focuses on exploring (i)the most appropriate quantisation scheme for digitallyimplemented neural network-based classifiers, and (ii) themost adequate approximation for its activation function.Experimental work shows that the optimized finite precisionneural network-based classifier achieves the same perfor-mance as the corresponding full precision approach and,hence, considerably reduces the computational load.

The paper entitled “Signal processing strategies forcochlear implants using current steering” (W. Nogueira etal.) presents new concepts of sound processing strategies incochlear implants. In contemporary systems, the audio signalis decomposed into different frequency bands, each assignedto one electrode. Thus, pitch perception is limited by thenumber of physical electrodes implanted into the cochlea andby the wide bandwidth assigned to each electrode. Two newsound processing strategies are presented here, where currentis simultaneously delivered to pairs of adjacent electrodes. Bysteering the locus of stimulation to sites between the elec-trodes, additional pitch percepts can be generated. Resultsfrom measures of speech intelligibility, pitch perception,and subjective appreciation of sound obtained with thetwo current steering strategies and a standard strategy arecompared in 9 adult cochlear implant users. The benefit isshown to vary considerably while the mean results showsimilar performance with all three strategies.

The paper entitled “Prediction of speech recognitionin cochlear implant users by adapting auditory modelsto psychophysical data” (S. Stadler et al.) addresses thelarge variability of speech recognition by users of cochlearimplants under noisy conditions. From the many factorsinvolved, discrimination of spectral shapes is studied indetail and incorporated in a model. Speech recognitionperformance is significantly correlated to model data forspectral discrimination. Hence, it is suggested to use theframework presented for simulating effects of changing theencoding strategy implemented in cochlear implants.

Finally, the paper entitled “The Personal HearingSystem—a software hearing aid for a personal communi-cation System” (G. Grimm et al.) describes an architectureof a personal communication system (PCS) that integratesaudio communication and hearing support for elderly andhearing-impaired persons. The concept applies a centralprocessing unit connected to audio headsets via a wirelessbody area network (WBAN). To demonstrate the feasibilityof the concept, a prototype PCS is implemented on a smallnotebook computer with a dedicated audio interface in com-bination with a mobile phone. The prototype PCS integratesa hearing aid applying binaural coherence-based speechenhancement, telephony, public announcement systems, andhome entertainment.

Heinz G. GocklerHenning Puder

Hugo FastlSven Erik Nordholm

Torsten DauWalter Kellermann


Research Article

Database of Multichannel In-Ear and Behind-the-EarHead-Related and Binaural Room Impulse Responses

H. Kayser, S. D. Ewert, J. Anemuller, T. Rohdenburg, V. Hohmann, and B. Kollmeier

Medizinische Physik, Universitat Oldenburg, 26111 Oldenburg, Germany

Correspondence should be addressed to H. Kayser, [email protected]

Received 15 December 2008; Accepted 4 June 2009

Recommended by Hugo Fastl

An eight-channel database of head-related impulse responses (HRIRs) and binaural room impulse responses (BRIRs) isintroduced. The impulse responses (IRs) were measured with three-channel behind-the-ear (BTEs) hearing aids and an in-earmicrophone at both ears of a human head and torso simulator. The database aims at providing a tool for the evaluation ofmultichannel hearing aid algorithms in hearing aid research. In addition to the HRIRs derived from measurements in an anechoicchamber, sets of BRIRs for multiple, realistic head and sound-source positions in four natural environments reflecting daily-life communication situations with different reverberation times are provided. For comparison, analytically derived IRs for arigid acoustic sphere were computed at the multichannel microphone positions of the BTEs and differences to real HRIRs wereexamined. The scenes’ natural acoustic background was also recorded in each of the real-world environments for all eight channels.Overall, the present database allows for a realistic construction of simulated sound fields for hearing instrument research and,consequently, for a realistic evaluation of hearing instrument algorithms.

Copyright © 2009 H. Kayser et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Performance evaluation is an important part of hearinginstrument algorithm research since only a careful evaluationof accomplished effects can identify truly promising andsuccessful signal enhancement methods. The gold standardfor evaluation will always be the unconstrained real-worldenvironment, which comes however at a relatively high costin terms of time and effort for performance comparisons.

Simulation approaches to the evaluation task are thefirst steps in identifying good signal processing algorithms.It is therefore important to utilize simulated input signalsthat represent real-world signals as faithfully as possible,especially if multimicrophone arrays and binaural hearinginstrument algorithms are considered that expect input fromboth sides of a listener’s head. The simplest approach tomodel the input signals to a multichannel or binaural hearinginstrument is the free-field model. More elaborate modelsare based on analytical formulations of the effect that a rigidsphere has on the acoustic field [1, 2].

Finally, the synthetic generation of multichannel inputsignals by means of convolving recorded (single-channel)

sound signals with impulse responses (IRs) correspondingto the respective spatial sound source positions, and alsodepending on the spatial microphone locations, represents agood approximation to the expected recordings from a real-world sound field. It comes at a fraction of the cost and withvirtually unlimited flexibility in arranging different acousticobjects at various locations in virtual acoustic space if theappropriate room-, head-, and microphone-related impulseresponses are available.

In addition, when recordings from multichannel hearingaids and in-ear microphones in a real acoustic backgroundsound field are available, even more realistic situations can beproduced by superimposing convolved contributions fromlocalized sound sources with the approximately omnidirec-tional real sound field recording at a predefined mixing ratio.By this means, the level of disturbing background noisecan be controlled independently from the localized soundsources.

Under the assumption of a linear and time-invariantpropagation of sound from a fixed source to a receiver,the impulse response completely describes the system. Alltransmission characteristics of the environment and objects


in the surrounding area are included. The transmission ofsound from a source to human ears is also described inthis way. Under anechoic conditions the impulse responsecontains only the influence of the human head (and torso)and therefore is referred to as head-related impulse response(HRIR). Its Fourier transform is correspondingly referredto as head-related transfer function (HRTF). Binaural head-related IRs recorded in rooms are typically referred to asbinaural room impulse responses (BRIRs).

There are several existing free available databases con-taining HRIRs or HRTFs measured on individual subjectsand different artificial head-and-torso simulators (HATS)[3–6]. However these databases are not suitable to simulatesound impinging on hearing aids located behind the ears(BTEs) as they are limited to two-channel informationrecorded near the entrance of the ear canal. Additionally thedatabases do not reflect the influence of the room acoustics.

For the evaluation of modern hearing aids, whichtypically process 2 or 3 microphone signals per ear, multi-channel input data are required corresponding to the realmicrophone locations (in the case of BTE devices behind theear and outside the pinna) and characterizing the respectiveroom acoustics.

The database presented here therefore improves overexisting publicly available data in two respects: In contrast toother HRIR and BRIR databases, it provides a dummy-headrecording as well as an appropriate number of microphonechannel locations at realistic spatial positions behind the ear.In addition, several room acoustical conditions are included.

Especially for the application in hearing aids, a broad setof test situations is important for developing and testing ofalgorithms performing audio processing. The availability ofmultichannel measurements of HRIRs and BRIRs capturedby hearing aids enables the use of signal processing tech-niques which benefit from multichannel input, for example,blind source separation, sound source localization andbeamforming. Real-world problems, such as head shadingand microphone mismatch [7] can be considered by thismeans.

A comparison between the HRTFs derived from therecorded HRIRs at the in-ear and behind-the-ear positionsand respective modeled HRTFs based on a rigid sphericalhead is presented to analyze deviations between simulationsand a real measurements. Particularly at high frequencies,deviations are expected related to the geometric differencesbetween the real head including the pinnae and the model’sspherical head.

The new database of head-, room- and microphone-related impulse responses, for convenience consistentlyreferred to as HRIRs in the following, contains six-channelhearing aid measurements (three per side) and additionallythe in-ear HRIRs measured on a Bruel & Kjær HATS [8] indifferent environments.

After a short overview of the measurement method andsetup, the acoustic situations contained in the database aresummarized, followed by a description of the analyticalhead model and the methods used to analyze the data.Finally, the results obtained under anechoic conditions arecompared to synthetically generated HRTFs based on the

7.3

7.6

13.6

2.1

2.6

32.7

4

4

4

345

5

5

6

6

Figure 1: Right ear of the artificial head with a hearing aid dummy.The distances between the microphones of the hearing aids and theentrance to the earcanal on the artificial head are given in mm.

model of a rigid sphere. The database is available underhttp://medi.uni-oldenburg.de/hrir/.

2. Methods

2.1. Acoustic Setup. Data was recorded using the head-and-torso simulator Bruel & Kjær Type 4128C onto which the BTEhearing aids were mounted (see Figure 1). The use of a HATShas the advantage of a fixed geometry and thereby provideshighly reproducible acoustic parameters. In addition to themicrophones in the BTEs mounted on the HATS, it alsoprovides internal microphones to record sound pressure nearthe location corresponding to the place of the human eardrum.

The head-and-torso simulator was used with artificialears Bruel & Kjær Type 4158C (right) and Type 4159C(left) including preamplifiers Type 2669. Recordings werecarried out with the in-ear microphones and two three-channel BTE hearing aid dummies of type Acuris providedby Siemens Audiologische Technik GmbH, one behind eachartificial ear, resulting in a total of 8 recording channels. Theterm “hearing aid dummy” refers to the microphone arrayof a hearing aid, housed in its original casing but withoutany of the integrated amplifiers, speakers or signal processorscommonly used in hearing aids.


The recorded analog signals were preamplifiedusing a G.R.A.S. Power Module Type 12AA, with theamplification set to +20 dB (in-ear microphones) and aSiemens custom-made pre-amplifier, with an amplificationof +26 dB on the hearing aid microphones. Signalswere converted using a 24-bit multichannel AD/DA-converter (RME Hammerfall DSP Multiface) connectedto a laptop (DELL Latitude 610D, Pentium M [email protected],1GB RAM) via a PCMCIA-card and the digitaldata was stored either on the internal or an externalhard disk. The software used for the recordings wasMATLAB (MathWorks, Versions 7.1/7.2, R14/R2006a) witha professional tool for multichannel I/O and real-timeprocessing of audio signals (SoundMex2 [9]).

The measurement stimuli for measuring a HRIR weregenerated digitally on the computer using MATLAB-scripts (developed in-house) and presented via the AD/DA-converter to a loudspeaker. The measurement stimuli wereemitted by an active 2-channel coaxial broadband loud-speaker (Tannoy 800A LH). All data was recorded at asampling rate of 48 kHz and stored at a resolution of 32 Bit.

2.2. HRIR Measurement. The HRIR measurements werecarried out for a variety of natural recording situations.Some of the scenarios were suffering from relatively highlevels of ambient noise during the recording. Additionally,at some recording sites, special care had to be taken ofthe public (e.g., cafeteria). The measurement procedurewas therefore required to be of low annoyance while themeasurement stimuli had to be played back at a sufficientlevel and duration to satisfy the demand of a high signal-to-noise ratio imposed by the use of the recorded HRIRsfor development and high-quality auralization purposes.To meet all requirements, the recently developed modifiedinverse-repeated sequence (MIRS) method [10] was used.The method is based on maximum length sequences (MLS)which are highly robust against transient noise since theirenergy is distributed uniformly in the form of noise over thewhole impulse response [11]. Furthermore, the broadbandnoise characteristics of MLS stimuli made them suitablefor presentation in the public rather than, for example,sine-sweep stimuli-based methods [12]. However, MLSs areknown to be relatively sensitive to (even weak) nonlinearitiesin the measurement setup. Since the recordings at public sitesrequired partially high levels reproduced by small scale andportable equipment, the risk of non-linear distortions waspresent. Inverse repeated sequences (IRS) are a modificationto MLSs which show high robustness against even-ordernonlinear distortions [13]. An IRS consists of two concate-nated MLS s(n) and its inverse:

IRS(n) =⎧⎨

⎩

s(n), n even,

−s(n), n odd,0 ≤ n ≤ 2L, (1)

where L is the period of the generating MLS. The IRStherefore has a period of 2L. In the MIRS method employedhere, IRSs of different orders are used in one measurementprocess and the resulting impulse responses of different

lengths are median-filtered to further suppress the effectof uneven-order nonlinear distortions after the followingscheme: A MIRS consists of several successive IRS of differentorders. In the evaluation step, the resulting periodic IRs ofthe same order were averaged yielding a set of IRs of differentorders. The median of these IRs was calculated and the finalIR was shortened to length corresponding to the lowest order.The highest IRS order in the measurements was 19, which isequal to a length of 10.92 seconds at the used sampling rateof 48 kHz. The overall MIRS was 32.77 seconds in durationand the calculated raws IRs were 2.73 seconds correspondingto 131072 samples.

The MIRS method combines the advantages of MLSmeasurements with high immunity against non-linear dis-tortions. A comparison of the measurement results to anefficient method proposed by Farina [12] showed that theMIRS technique achieves competitive results in anechoicconditions with regard to signal-to-noise ratio and was bettersuited in public conditions (for details see [10]).

The transfer characteristics of the measurement systemwas not compensated for in the HRIRs presented here,since it does not effect the interaural and microphonearray differences. The impulse response of the loudspeakermeasured by a probe microphone at the HATS position inthe anechoic chamber is provided as part of the database.

2.3. Content of the Database. A summary of HRIR mea-surements and recordings of ambient acoustic backgrounds(noise) is found in Table 1.

2.3.1. Anechoic Chamber. To simulate a nonreverberantsituation, the measurements were conducted in the anechoicchamber of the University of Oldenburg. The HATS wasfixed on a computer-controlled turntable (Bruel & Kjær Type5960C with Controller Type 5997) and placed opposite to thespeaker in the room as shown in Figure 2. Impulse responseswere measured for distances of 0.8 m and 3 m betweenspeaker and the HATS. The larger distance corresponds toa far-field situation (which is, e.g., commonly required bybeam-forming algorithms) whereas for the smaller distancenear-field effects may occur. For each distance, 4 angles ofelevation were measured ranging from −10◦ to 20◦ in stepsof 10◦. For each elevation the azimuth angle of the source tothe HATS was varied from 0◦ (front) to −180◦ (left turn) insteps of 5◦ (cf. Figure 3). Hence, a total of 296 (= 37×4×2)sets of impulse responses were measured.

2.3.2. Office I. In an office room at the University ofOldenburg similar measurements were conducted, coveringthe systematic variation of the sources’ spatial positions. TheHATS was placed on a desk and the speaker was moved inthe front hemisphere (from −90◦ to +90◦) at a distance of1 m with an elevation angle of 0◦. The step size of alterationof the azimuth angle was 5◦ as for the anechoic chamber.

For this environment only the BTE channels weremeasured.

A detailed sketch of the recording setup for this and theother environments is provided as a part of the database.


Table 1: Summary of all measurements of head related impulse responses and recordings of ambient noise. In the Office I environment(marked by the asterisk) only the BTE channels were measured.

Environment HRIR sets measured Sounds recorded

Anechoic chamber 296 —

Office I 37∗ —

Office II 8 12 recordings of ambient noise, total duration 19 min

Cafeteria 12 2 recordings of ambient noise, total duration 14 min

Courtyard 12 1 recording of ambient noise, total duration 24 min

Total 365 57 min of different ambient noises

Figure 2: Setup for the impulse response measurement in theanechoic room. Additional damping material was used to cover theequipment in the room in order to avoid undesired reflections.

20◦

10◦

0◦−10◦

0◦

−90◦ 90◦

(−)180◦

Figure 3: Coordinate systems for elevation angles (left-handsketch) and azimuth angles (right-hand sketch).

2.3.3. Office II. Further measurements and recordings werecarried out in a different office room of similar size.The head-and-torso simulator was positioned on a chairbehind a desk with two head orientations of 0◦ (lookingstraight ahead) and 90◦ (looking over the shoulder). Impulseresponses were measured for four different speaker positions(entrance to the room, two different desk conditions andone with a speaker standing at the window) to allowfor simulation of sound sources at typical communicationpositions. For measurements with the speaker positioned atthe entrance the door was opened and for the measurement

at the window this was also open. For the remainingmeasurements door and window were closed to reducedisturbing background noise from the corridor and fromoutdoors. In total, this results in 8 sets of impulse responses.

Separate recordings of real office ambient sound sourceswere performed: a telephone ringing (30 seconds recordedfor each head orientation) and keyboard typing at the otheroffice desks (3 minutes recorded for each head orientation).The noise emitted by the ventilation, which is installed in theceiling, was recorded for 5 minutes (both head orientations).Additionally, the sound of opening and closing the door wasrecorded 15 times.

2.3.4. Cafeteria. 12 sets of impulse responses were measuredin the fully occupied cafeteria of the natural sciences campusof the University of Oldenburg. The HATS was used tomeasure the impulse responses from different positions andto collect ambient sound signals from the cafeteria. The busylunch time hour was chosen to obtain realistic conditions.

The ambient sounds consisted mainly of unintelligiblebabble of voices from simultaneous conversations all overthe place, occasional parts of intelligible speech from nearbyspeakers and the clanking of dishes and chairs scratching onthe stone floor.

2.3.5. Courtyard. Measurements in the courtyard of thenatural sciences campus of the University of Oldenburgwere conducted analogous to the Office II and Cafeteriarecordings described above. A path for pedestrians andbicycles crosses this yard. The ambient sounds consist ofsnippets of conversation between people passing by, footsteps and mechanical sounds from bicycles including suddenevents such as ringing and squeaking of brakes. Continuousnoise from trees and birds in the surrounding was alsopresent.

2.4. Analytical Model and Data Analysis Methods. The char-acteristics of HRIRs and the corresponding HRTFs originatesfrom diffraction, shading and resonances on the head and onthe pinnae [14]. Also reflections and diffractions of the soundfrom the torso influence the HRTFs.

An analytical approximative model of the sound prop-agation around the head is the scattering of sound by arigid sphere whose diameter a equals the diameter of ahuman head. This is a simplification as the shoulders and the


pinnae are neglected and the head is regarded as sphericallysymmetric.

The solution in the frequency domain for the diffractionof sound waves on a sphere traces back to Lord Rayleigh[15] in 1904. He derived the transfer function H(∞, θ,μ)dependent on the normalized frequency μ = ka = 2π f a/c(c: sound velocity) for an infinitely distant source impingingat the angle θ between the surface normal at the observationpoint and the source:

H(∞, θ,μ

) = 1μ2

∞∑

m=0

(−i)m−1(2m + 1)Pm(cos θ)h′m(μ) , (2)

where Pm denotes the Legendre polynomials, hm the mth-order spherical Hankel function and h′m its derivative.Rabinowitz et al. [16] presented a solution for a point sourcein the distance r from the center of the sphere:

H(r, θ,μ

) = − r

aμe −iμr/aΨ, (3)

with

Ψ =∞∑

m=0

(2m + 1)Pm(cos θ)hm(μr/a

)

h′m(μ) , r > α. (4)

2.4.1. Calculation of Binaural Cues. The binaural cues,namely the interaural level difference (ILD), the interauralphase difference (IPD) and derived therefrom the interauraltime difference (ITD), can be calculated in the frequencydomain from a measured or simulated HRTF [17]. IfHl(α,ϕ, f ) denotes the HRTF from the source to the leftear and Hr(α,ϕ, f ) the transmission to the right ear, theinteraural transfer function (ITF) is given by

ITF(α,ϕ, f

) = Hl(α,ϕ, f

)

Hr(α,ϕ, f

) , (5)

with α and ϕ the azimuth and elevations angles, respectively,as shown in Figure 3 and f representing the frequency in Hz.

The ILD is determined by

ILD(α,ϕ, f

) = 20 · log10

(∣∣ITF

(α,ϕ, f

)∣∣). (6)

The IPD can also be calculated from the ITF. Derivation withrespect to the frequency f yields the ITD which equals thegroup delay between both ears:

IPD(α,ϕ, f

) = arg(ITF

(α,ϕ, f

)),

ITD(α,ϕ, f

) = − 12π

d

dfIPD

(α,ϕ, f

).

(7)

Kuhn presented the limiting cases for (2) in [2]. For lowfrequencies corresponding to the case ka � 1 the transferfunction of the spherical head model simplifies to

Hl f(∞, θ,μ

) ≈ 1− i32μ cos θ. (8)

This yields an angle of incidence independent ILD of 0 dBand an angle dependent IPD. In the coordinate system givenin Figure 3 the IPD amounts to

IPDl f (α) ≈ 3ka sinα, (9)

which results in

ITDl f (α) ≈ 6πac

sinα. (10)

For high frequencies the propagation of the waves isdescribed as “creeping waves” traveling around the spherewith approximately the speed of sound. In this case, the ITDcan be derived from geometric treatment by the differencebetween the distance from the source to the left ear and theright ear considering the path along the surface of the sphere[18]:

ITDh f ≈ 2πac

(sin(α) + α). (11)

With the approximation α ≈ sinα, (tolerating an error of5.5% for α = 135◦ and an error of 11% for α = 150◦ [2]) (11)yields:

ITDh f (α) ≈ 4πac

sinα, (12)

which equals 2/3 times the result of (10).In practice, the measured IPD is contaminated by noise.

Hence, the data was preprocessed before the ITD wasdetermined. First, the amplitude of the ITF was equalized tounity by calculating the sign of the complex valued ITF:

ITF(α,ϕ, f

) = sign(ITF

(α,ϕ, f

))

= ITF(α,ϕ, f

)

∣∣ITF

(α,ϕ, f

)∣∣ .

(13)

The result was then smoothed applying a sliding averagewith a 20-samples window. The ITD was obtained for aspecific frequency by calculating the weighted mean of theITD (derived from the smoothed IPD) for a chosen rangearound this frequency. As weighting function the coherencefunction γ was used, respectively a measure for the coherenceγn which is obtained from

γn =∣∣∣ITF(α,ϕ, f )

∣∣∣n

︸︷︷︸

smoothed

. (14)

The function was raised to the power of n to control thestrength of suppression of data with a weak coherence. In theanalysis n = 6 turned out to be a suitable choice.

3. Results

3.1. Quality of the Measurements. As evaluation of thequality, the signal-to-noise ratio (SNR) of the measuredimpulse responses was calculated for each environment. Theaverage noise power was estimated from the noise floor


irnoise(t) for the interval Tend at end of the measured IR,where the IR has declined below the noise level. The durationof the measured IRs was sufficient to assume that only noisewas present in this part of the measured IR. With the averagepower estimated for the entire duration T = 2.73 s of themeasured IR, ir(t), the SNR was calculated as

SNR = 10 log10

⟨ir2(t)

⟩

T⟨

ir2noise(t)

⟩

Tend

, (15)

where 〈·〉 denotes the temporal average.The results are given in Table 2.

3.2. Reverberation Time of the Different Environments. Thereverberation time T60 denotes the time that it takes for thesignal energy to decay by 60 dB after the playback of thesignal is stopped. It was estimated from a room impulseresponse of duration T employing the method of Schroederintegration [19]. In the Schroeder integration, the energydecay curve (EDC) is obtained by reverse-time integrationof the squared impulse response:

EDC (t) = 10 log10

∫ Tt ir

2(τ)dτ∫ T

0 ir2(τ)dτ. (16)

The noise contained in the measured IR is assumed to spreadequally over the whole measured IR and thus leads to alinearly decreasing offset in the EDC. A correction for thenoise is introduced by fitting a linear curve to the purenoise energy part at the end of the EDC, where the IR hasvanished. Subsequently the linear curve, representing theeffect of noise, is subtracted from the EDC yielding the pureIR component.

Generally, an exponential decay in time is expected andthe decay rate was found by fitting an exponential curveto the computed decay of energy [20]. An example EDC isshown in Figure 4. The first steeply sloped part of the curveresults from the decay of the energy of direct sound (earlydecay) fading at about 0.1 seconds to the part resulting fromthe diffuse reverberation tail of the IR. An exponential curveis fitted (linear in semilogarithmic presentation) to the partof the EDC corresponding to the reverberation tail. The T60

time is then determined from the fitted decay curve. Theestimated T60 times of the different environments are givenin Table 3.

3.3. Comparison to the Analytical Model of A Rigid Sphere.Duda and Martens provide pseudocode for the evaluationof (3) for the calculation of angle- and range-dependenttransfer functions of a sphere in [1]. The behavior of thetheoretical solution was also explored in detail within theirwork and compared to measurements carried out on abowling ball. The pseudocode was implemented in MATLABand 8-channel HRTFs were calculated for the microphonepositions corresponding to the entrances of the ear canalsof the HATS and the positions of the BTE hearing aidmicrophones on the artificial head.

In the following analysis, the measured HRTFs (obtainedfrom the measured HRIRs) are compared to the data

−40

−30

−20

−10

0

En

ergy

leve

l(dB

)

0 0.1 0.2 0.3 0.4 0.5 0.6

Time (s)

Figure 4: Energy decay curve calculated using the method ofSchroeder integration from a impulse response of the cafeteria(solid) and linear fit (dashed) to estimate the reverberation timeT60.

Table 2: Mean SNR values of the impulse response measurementsin the different environments.

Environment SNR (dB)

Anechoic chamber 104.8

Office II 94.7

Cafeteria 75.6

Courtyard 86.1

Table 3: Reverberation time of the different environments.

Environment T60 (ms)

Anechoic chamber <50(1)

Office II 300

Cafeteria 1250

Courtyard 900(1)

The reverberation time estimate is limited by decay of the impulseresponse of the vented loudspeaker system with a cut-off frequency of about50 Hz.

modeled for a rigid sphere and also differences betweenthe in-ear HRTFs and the BTE hearing aids HRTFs areconsidered. It is analyzed to which extend a spherical headmodel is suitable to describe the sound incidence to the BTEhearing aids regarding binaural cues and properties in thetime domain. The HRTFs from the anechoic room for thedistance of 3 m and an elevation angle of 0◦ are compared tothe predictions of the model for a rigid sphere. The measuredresults displayed in the figures were smoothed to obtain amore articulate presentation. For this purpose, the HRTFswere filtered using a sliding rectangular window with a 1/12-octave width.

Figure 5 shows exemplary transfer functions obtained foran azimuth angle of −45◦. On the left side, the measuredHRTFs are shown, on the right side the theoretical curves fora spherical head without torso. These were calculated for themicrophone positions corresponding to the measurementsetup as shown in Figure 1, whereby only the azimuth angleswere taken into account and the slight differences in elevationwere neglected. In the low-frequency range up to 1 kHz,the dotted curves on the left and the right side have asimilar course except for a stronger ripple of the measured


−20

−10

0

10

20

Leve

l(dB

)

0.1 1 10

Frequency (kHz)

In-ear and hearing aids

(a)

−20

−10

0

10

20

Leve

l(dB

)

0.1 1 10

Frequency (kHz)

Headmodel

(b)

Figure 5: Measured HRTFs (a) (log-magnitude) from the in-ear (dashed) and the hearing aid microphones (solid) and corresponding log-magnitude transfer functions calculated by the model for an ideal rigid sphere (b). The angle of incidence is −45◦. The set of the upperfour curves display the HRTFs from the left side of the artificial head, the lower set is obtained from the right side. The light colored linesrepresent the front hearing aid microphones and the dark lines the rearmost ones. A level of 0 dB corresponds to the absence of head-effects.

data. Level differences due to the transmission characteristicsof the small hearing aid microphones (solid lines) whichstrongly deviates from a flat course are observed.

In the middle frequency range, both sides are stillcorrelated, but the characteristic notches and maxima aremuch more prominent in the measurements. The intersec-tion points of the separate curves remain similar, but thevariation of the level and the level differences between themicrophones are much stronger. The results of the in-earmeasurements show a raise of 10 dB to 15 dB in comparisonto the theoretical levels, due to resonances in the ear canal.

Above 7 kHz, effects like shadowing and resonance fromthe structure of the head which are not present in the headmodel have a strong influence.

In the following, the ITDs and ILDs obtained from themeasurements are examined in more detail.

3.3.1. ILD. The ILDs from the inner ear microphones andone pair of the hearing aid microphones are shown inFigure 6 for a subset of azimuth angles (solid lines) alongwith the according curves obtained from the model (dashedlines).

As indicated in the previous figure, the measurementsand the model show a similar behavior up to a frequencyof about 3 kHz. Above this value, the influence of the headand the torso become obvious resulting in a strong rippleespecially for the inner ear measurements which include alsothe effects of the pinnae and the ear canals.

Above a frequency of 9 kHz, alignment errors andmicrophone mismatch become obvious. This is indicatedby the deviation of the ILD from the 0 dB line for soundincidence from 0◦ and −180◦.

For the ILDs of the in-ear measurements it is obvious thatthe measured ILD is much bigger than the model ILD for

sound incidence from the front left side (−30◦ to −90◦) inthe frequency range above 4 kHz. If the sound impinges frombehind, notches are observable at 3 kHz for −120◦ and atnearly 4 kHz at −150◦ in the measured ILD when comparedto the model ILD. This effect is not present in the ILDsbetween the hearing aids and therefore must originate fromthe pinnae.

3.3.2. ITD. The ITDs between the in-ear microphones anda microphone pair of the hearing aids were calculated asdescribed in Section 2.4.1 within a range of ±100 Hz to thecenter frequency. The results are shown in Figure 7, wherethe modeled data is also displayed.

For center frequencies of 125 Hz and 250 Hz, the curvesobtained from the measurements and the model are ingood accordance. Above, for 0.5 kHz and 1 kHz, deviationsoccur. Here, the ITDs calculated from the measurements areslightly higher than the theoretical values for the sphere. Thedetermination of the azimuth angle is always ambiguous fora sound coming from the back or the front hemisphere. Forthe 2-kHz curve, the ITD becomes also ambiguous for soundwaves coming from the same hemisphere.

Another difference between the ILD for low and highfrequencies is observable. For the lower frequencies, the timedifferences are larger than for higher frequencies at the sameangle of incidence, corresponding to a larger effective headradius for low frequencies. This is in accordance with thefindings of Kuhn [2] for an infinitely distant source describedby (10) and (12).

3.3.3. Analysis in the Time Domain. Figure 8 shows HRIRsfor a sound source impinging to the left side of the HATS.The angle of incidence ranges from 0◦ to 360◦ and, inthis representation, is related to the angle of incidence to


−180

−150

−120

−90

−60

−30

0

Azi

mu

than

gle

(◦)

ILD

(dB

)

0.1 1 10

Frequency (kHz)

In-ear

(a)

−180

−150

−120

−90

−60

−30

0

Azi

mu

than

gle

(◦)

ILD

(dB

)

0.1 1 10

Frequency (kHz)

Hearing aids

(b)

Figure 6: ILDs calculated from the measurements (solid lines) and the modeled HRTFs (dashed lines) for the in-ear microphones (a) andthe front microphone pair of the hearing aids (b). One tick on the right ordinate corresponds to 6 dB level difference. The dashed straightlines mark the ILD of 0 dB.

0

0.25

0

0.25

0

0.25

0

0.25

0

0.25

0.5

0.75

1

ITD

(ms)

125

250

500

1000

2000

Freq

uen

cy(H

z)

−180 −150 −120 −90 −60 −30 0

Azimuth angle (◦)

In-ear

(a)

0

0.25

0

0.25

0

0.25

0

0.25

0

0.25

0.5

0.75

1

ITD

(ms)

125

250

500

1000

2000

Freq

uen

cy(H

z)

−180 −150 −120 −90 −60 −30 0

Azimuth angle (◦)

Hearing aids

(b)

Figure 7: ITDs calculated from the measurements (solid lines) and the modeled HRTFs (dashed lines) for the in-ear microphones (a) andthe front microphone pair of the hearing aids (b). The ITDs for the mid frequencies in octaves from 125 Hz to 2 kHz are shown as indicatedon the right-hand ordinate axis. An offset of 0.5 milliseconds is added to separate the curves from each other for a better overview. One tickon the left-hand ordinate is 0.25 milliseconds.

the microphones on the left side of the head for a betteroverview. This means, for an angle of 0◦, the sound impingesperpendicularly to the hearing aid. The set of HRIRs isshown for the head model (a), the corresponding foremosthearing aid microphone on the left side (b) and the left in-ear microphone (c).

The data from the head model show a decreasing mag-nitude of the main peak with increasing angle of incidenceup to 170◦. For sound incidence from the opposite directiona peak is visible-the so-called “bright spot” which was alsodescribed by Duda and Martens [1].

The impulse responses of the hearing aid microphonealso show a bright spot for sound incidence from 180◦. Theshape of the maximum peak formation is similar to the

modeled data, but after the main peak additional delayedreflections occur. Early reflections are from the rim of thepinna as the delay lies within the range of travel timeaccording to a distance of a few centimeters. A later dominantpeak is attributed to strong reflections from the shoulders asit occurs 0.3 milliseconds to 0.5 milliseconds after the mainpeak which corresponds to a distance of about 13 cm to 20cm.

For the in-ear microphones these reflections are muchmore pronounced and have a finer structure. A bright spotis not apparent due to the asymmetry caused by the pinnae.


0◦

60◦

120◦

180◦

240◦

300◦

360◦Headmodel

0 0.3 0.6 0.9 1.2 1.5 1.8

Traveltime (ms)

(a)

0◦

60◦

120◦

180◦

240◦

300◦

360◦Hearing aid

0 0.3 0.6 0.9 1.2 1.5 1.8

Traveltime (ms)

(b)

0◦

60◦

120◦

180◦

240◦

300◦

360◦In-ear

0 0.3 0.6 0.9 1.2 1.5 1.8

Traveltime (ms)

(c)

Figure 8: Head-related impulse responses for sound incidence to

the left side of the artificial head. Data are shown for the head model

(a), a hearing aid microphone (b) and the left in-ear microphone

(c).

4. Discussion and Conclusion

A HRIR database was introduced, which is suited to simulatedifferent acoustic environments for digital sound signalprocessing in hearing aids. A high SNR of the impulseresponses was achieved even under challenging real-worldrecording conditions. In contrast to existing freely availabledatabases, six-channel measurements of BTE hearing aidsare included in addition to the in-ear HRIRs for a varietyof source positions in a free-field condition and in differ-ent realistic reverberant environments. Recordings of theambient sounds characteristic to the scenes are availableseparately. This allows for a highly authentic simulation ofthe underlying acoustic scenes.

The outcome of the analysis of the HRTFs from theanechoic room is in agreement with previous publications onHRTFs (e.g., [2]) and shows noticeable differences betweenthe in-ear measurements and the data from the hearing aids.As expected, the ILDs derived from the spherical head modelmatch the data from the hearing aids better than the datafrom the in-ear measurements. The modeled ILD fits theILD between the hearing aids reasonably up to a frequencyof 6 kHz. For the in-ear ILD, the limit is about 4 kHz.

In the frequency region above 4 to 6 kHz significantdeviations of the simulated data and the measurementsoccur. This shows, that modeling a head by a rigid spheredoes not provide a suitable estimation of sound transmissionto the microphone arrays in a BTE hearing aid and motivatesthe use of this database in hearing aid research, particularlyfor future hearing aids with extended frequency range.

It is expected that the data presented here will pre-dominantly be used in the context of evaluation of signalprocessing algorithms with multi-microphone input suchas beamformers or binaural algorithms. In such cases, verydetailed knowledge about magnitude and phase behavior ofthe HRTFs might have to be provided as a-priori knowledgeinto signal processing algorithms. Even though the currentHRTF data represent a “snapshot” of a single geometric headarrangement that would need to be adjusted to subjects onan individual basis, it can nevertheless be used as one specificrealization to be accounted for in certain algorithms.

It is impossible to determine a-priori whetherthe detailed acoustic properties captured by realisticHRIRs/HRTFs are indeed significant for either evaluationor algorithm construction. However, the availability of thecurrent database makes it possible to answer this question foreach specific algorithm, acoustic situation and performancemeasure individually. Results from work based on ourdata [21] demonstrate that even for identical algorithmsand spatial arrangements, different measures can show asignificant performance increase (e.g., SNR enhancement)when realistic HRTFs are taken into account. Conversely,other measures (such as the speech reception thresholdunder binaural conditions) have been found to be largelyinvariant to the details captured by realistic models. In anycase, the availability of the HRIR database presented heremakes it possible to identify the range of realistic conditions


under which an arbitrary hearing instrument algorithmperforms well.

This “test-bed” environment also permits detailed com-parison between different algorithms and may lead to arealistic de facto standard benchmark dataset for the hearingaid research community. The database is available underhttp://medi.uni-oldenburg.de/hrir/.

Acknowledgment

The authors would like to thank Siemens AudiologischeTechnik for providing the hearing aids and the appropriateequipment. This work was supported by the DFG (SFB/TR31) and the European Commission under the integratedproject DIRAC (Detection and Identification of Rare Audio-visual Cues, IST-027787).

References

[1] R. O. Duda and W. L. Martens, “Range dependence of theresponse of a spherical head model,” The Journal of theAcoustical Society of America, vol. 104, no. 5, pp. 3048–3058,1998.

[2] G. F. Kuhn, “Model for the interaural time differences inthe azimuthal plane,” The Journal of the Acoustical Society ofAmerica, vol. 62, no. 1, pp. 157–167, 1977.

[3] V. R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano,“The CIPIC HRTF database,” in IEEE ASSP Workshop onApplications of Signal Processing to Audio and Acoustics, pp. 99–102, October 2001.

[4] B. Gardner, K. Martin, et al., “HRTF measurements of aKEMAR dummy-head microphone,” Tech. Rep. 280, MITMedia Lab Perceptual Computing, May 1994.

[5] S. Takane, D. Arai, T. Miyajima, K. Watanabe, Y. Suzuki, andT. Sone, “A database of head-related transfer functions inwhole directions on upper hemisphere,” Acoustical Science andTechnology, vol. 23, no. 3, pp. 160–162, 2002.

[6] H. Sutou, “Shimada laboratory HRTF database,” Tech.Rep., Shimada Labratory, Nagaoka University of Technology,Nagaoka, Japan, May 2002, http://audio.nagaokaut.ac.jp/hrtf.

[7] H. Puder, “Adaptive signal processing for interference cancel-lation in hearing aids,” Signal Processing, vol. 86, no. 6, pp.1239–1253, 2006.

[8] “Head and Torso Simulator(HATS)—Type 4128,” Bruel &Kjær, Nærum, Denmark.

[9] D. Berg, SoundMex2, HorTech gGmbH, Oldenburg, Germany,2001.

[10] S. D. Ewert and H. Kayser, “Modified inverse repeatedsequence,” in preparation.

[11] D. D. Rife and J. Vanderkooy, “Transferfunction measurementwith maximum-lengthsequences,” Journal of Audio Engineer-ing Society, vol. 37, no. 6, pp. 419–444, 1989.

[12] A. Farina, “Simultaneous measurement of impulse responseand distortion with a swept-sine technique,” in AES 108thConvention, Paris, France, February 2000.

[13] C. Dunn and M. Hawksford, “Distorsion immunity of mls-derived impulse response measurements,” Journal of AudioEngineering Society, vol. 41, no. 5, pp. 314–335, 1993.

[14] J. Blauert, Raumliches Horen, Hirzel Verlag, 1974.[15] L. Rayleigh and A. Lodge, “On the acoustic shadow of a

sphere,” Proceedings of the Royal Society of London, vol. 73, pp.65–66, 1904.

[16] W. M. Rabinowitz, J. Maxwell, Y. Shao, and M. Wei, “Soundlocalization cues for a magnified head: implications fromsound diffraction about a rigid sphere,” Presence, vol. 2, no.2, pp. 125–129, 1993.

[17] J. Nix and V. Hohmann, “Sound source localization inreal sound fields based on empirical statistics of interauralparameters,” The Journal of the Acoustical Society of America,vol. 119, no. 1, pp. 463–479, 2006.

[18] R. S. Woodworth and H. Schlosberg, Woodworth and Schlos-berg’s Experimental Psychology, Holt, Rinehardt and Winston,New York, NY, USA, 1971.

[19] M. R. Schroeder, “New method of measuring reverberationtime,” The Journal of the Acoustical Society of America, vol. 36,no. 3, pp. 409–413, 1964.

[20] M. Karjalainen and P. Antsalo, “Estimation of modal decayparameters from noisy response measurements,” in AES 110thConvention, Amsterdam, The Netherlands, May 2001.

[21] T. Rohdenburg, S. Goetze, V. Hohmann, K.-D. Kammeyer,and B. Kollmeier, “Objective perceptual quality assessmentfor self-steering binaural hearing aid microphone arrays,”in Proceedings of IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP ’08), pp. 2449–2452, LasVegas, Nev, USA, March-April 2008.


Research Article

A Novel Approach to the Design of Oversampling Low-DelayComplex-Modulated Filter Bank Pairs

Thomas Kurbiel (EURASIP Member), Heinz G. Gockler (EURASIP Member),and Daniel Alfsmann (EURASIP Member)

Digital Signal Processing Group, Ruhr-Universitat Bochum, 44780 Bochum, Germany

Correspondence should be addressed to Thomas Kurbiel, [email protected]

Received 15 December 2008; Revised 5 June 2009; Accepted 9 September 2009

Recommended by Sven Nordholm

In this contribution we present a method to design prototype filters of oversampling uniform complex-modulated FIR filter bankpairs. Especially, we present a noniterative two-step procedure: (i) design of analysis prototype filter with minimum group delayand approximately linear-phase frequency response in the passband and the transition band and (ii) Design of synthesis prototypefilter such that the filter bank pairs distortion function approximates a linear-phase allpass function. Both aliasing and imagingare controlled by introducing sophisticated stopband constraints in both steps. Moreover, we investigate the delay propertiesof oversampling uniform complex-modulated FIR filter bank pairs in order to achieve the lowest possible filter bank delay. Anillustrative design example demonstrates the potential of the design approach.

Copyright © 2009 Thomas Kurbiel et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

A digital filter bank pair (FBP) is represented by a cascadeconnection of an analysis filter bank (AFB) for signaldecomposition and a synthesis filter bank (SFB) for signalreconstruction. In this contribution, we are exclusivelyinterested in FBP that are most efficient in terms of (i)low power consumption calling for minimum computationalload and a modular structure (minimum control overhead),and (ii) low overall ASFB signal (group) delay. The formerproperty is most important if the FBP is part of mobileequipment with tight energy constraints, most pronouncedin hearing aids (HAs) [1], while the latter requirementmust, for instance, be considered in two-way communicationsystems or HA, where the total group delay of the FBP shallnot exceed 5–8 milliseconds [2, 3] to allow for sufficientmargin for extensive subband signal processing.

The above computational constraints are best accountedfor by using uniform, maximally decimating (critically-sampling), complex-modulated (DFT) polyphase FB applyingFIR filters, where all individual frequency responses of theAFB or SFB are frequency-shifted versions of that of thecorresponding prototype lowpass filters, respectively, [4–6].

As it is well known, critical sampling in FBP gives rise tosevere aliasing in case of low-order (prototype) filters withoverlapping frequency responses which, in general, can becompensated for by proper matching of the AFB and SFB(prototype) filters [5, 6]. In contrast, we are interested in FBPthat call for extensive subband signal manipulation, wherealiasing compensation approaches cannot be used. As aresult, the design of FBP considered in this contribution callsfor moderately oversampled subband signals. Thus, nonlineardistortion due to aliasing and imaging can exclusively becontrolled by adequate stopband rejection of the respective(prototype) filters. As an example, stopband magnitudeconstraints for FBP in HA are derived in [7].

When considering linear-phase (LP) FIR filters for theprototype design, the above stringent group delay require-ments are best met, if the filter lengths are as small aspossible [8]. Essentially the same applies to low-delay FIRfilter designs [9].

In the past, many attempts to design FIR lowpass filterswith low group delay have been made [9–12]. In [13], aprocedure for the design of FIR Nyquist filters with low groupdelay was proposed that is based on the Remez exchangealgorithm. With the mentioned approaches a filter group


delay can always be obtained that ranges below the groupdelay of a corresponding LP FIR filter. However, the absoluteminimum value of the passband group delay was of noconcern. In contrast, for instance, Lang [14] has shownthat his algorithms for the constrained design of digitalfilters with arbitrary magnitude and phase responses have thepotential to achieve a considerable reduction of group delayas compared to LP FIR filters, even for high-order FIR filters.

Moreover, several solutions to the problem of low-delayfilter banks have been suggested in literature. In [15], aniterative method for the design of oversampling DFT filterbanks has been proposed, which allows for controlling thedistortion function for each frequency and jointly minimizesaliasing and imaging. The demand for low group delayparticularly of the AFB prototype filters has not been askedfor explicitly. Based on the algorithm [15] the approach[16] introduces additional constraints to the delay and phaseresponses. Noncritical decimation has also been suggestedin [17], where both filter bank delay aspects and magnitudedeviations of the distortion function have not especially beentaken into consideration. In [18] the problems of aliasingeffect and amplitude distortion are studied. Prototype filterswhich are optimised with respect to those properties aredesigned and their performances are compared. Moreover,the effect of the number of subbands, the oversamplingfactors, and the length of the prototype filter are also studied.Using the multicriteria formulation, all Pareto optimums aresought via the nonlinear programming technique. In [19] ahybrid optimization method is proposed to find the Paretooptimums of this highly nonlinear problem. Furthermore itis shown that Kaiser and Dolph-Chebyshev windows give thebest overall performance with or without oversampling.

From a filter bank system theoretic point of view, wepursue three objectives, representing steps towards the designof oversampling uniform complex-modulated (FFT) FIR filterbank pairs (FBPs) allowing for extensive subband signalmanipulation. We restrict ourselves to integer oversamplingfactors as defined by

O = I

M∈ N, O > 1 (1)

in order not to constrain the applicability of polyphaseprototype filters in any form [4, 6], where I represents thenumber of FB channels and M ∈ N the common decimationor interpolation factor of the FBP, respectively.

(1) In Section 2 being related to the first objectiveof this paper, we begin with a system-theoreticdescription and analysis of oversampling I-channelcomplex-modulated FIR filter bank pairs withoutsubband signal manipulation, which supplementsand extends the results reported in [16]. In particular,we investigate the properties of the distortion function[5, 6], the overall single-input single-output (SISO)transfer function of the filter bank pair that ideallyapproximates a linear-phase allpass function. Weshow that both the magnitude and the group delay

of the FBP distortion function are periodic versusfrequency. Furthermore, the group delay of the FBPis investigated in detail.

(2) For a first design step (cf. Section 3), we introducea novel procedure for the constrained design of low-delay narrow-band FIR prototype filters for over-sampling complex-modulated filter banks with anapproximately linear-phase frequency response in thepassband and the transition bands. As the objectivefunction to be minimised we adopt a particularrepresentation of the group delay [20], while thestopband magnitude specifications of the prototypefilter, as derived in [7], serve as constraints to controlsubband signal aliasing or imaging, respectively. Inthe first design step this procedure is used eitherfor the design of the SFB or the AFB prototypefilter.

(3) For the second and final design step (cf. Section 4),we use the deviation of the FBP distortion functionfrom unity as the objective function. To this end,the AFB (or SFB) FIR prototype filter is designedsubject to the stopband magnitude constraints, asgiven by [7], while the SFB (or AFB) prototype filteris fixed to the design obtained in the previous step.By this procedure the AFB and SFB prototype filters’magnitude responses are matched in the passbandsand the transition bands without further consid-eration of the overall FBP group delay, aiming atminimum, possibly differing AFB and SFB prototypefilter orders.

To illustrate the results and the potential of the designprocedure, we present a design example in Section 5. Finally,in Section 6 we draw some conclusions.

2. Oversampling Complex-ModulatedFIR Filter Bank Pairs

We begin with the introduction of our filter bank notationand a system-theoretic description and analysis of uniformoversampling I-channel complex-modulated FIR filter bankpairs (FBP) without subband signal manipulation, theprinciple of which is shown in Figure 1. In particular,we investigate the properties of the distortion function [5,6], which is required to approximate a linear-phase allpassfunction.

2.1. Distortion Function. For a uniform oversamplingcomplex-modulated I-channel FBP the AFB und SFB filters,respectively, are derived from common prototype filters[5, 6]:

Hl(zi) = H(

ziWlI

)

, l = 0, . . . , I − 1,

Gl(zi) = G(

ziWlI

)

, l = 0, . . . , I − 1.

(2)


X(zi) H0(zi) MX0(zsn)

G0(zi)M

Hl(zi)Xl(zsn)

M Gl(zi)M

HI−1(zi) MXI−1(zsn)

M GI−1(zi) Y(zi)

...

...

...

...

...

...

...

...

Figure 1: Uniform oversampling filter bank pair, oversampling factor O = I/M ∈ N.

Both prototype filters are real-valued FIR filters representedby their causal transfer functions:

H(zi) =Nh−1∑

n=0

h(n) · z−ni = cTh (zi) · h, (3)

G(zi) =Ng−1∑

n=0

g(n) · z−ni = cTg (zi) · g, (4)

where the vectors

h = [h(0),h(1), . . . ,h(Nh − 1)]T ∈ RNh ,

g =[

g(0), g(1), . . . , g(

Ng − 1)]T ∈ RNg

(5)

contain Nh or Ng coefficients of the impulse response,respectively, and the vectors

ch(zi) =[

1, z−1i , z−2

i , . . . , z−(Nh−1)i

]T,

cg(zi) =[

1, z−1i , z−2

i , . . . , z−(Ng−1)i

]T(6)

the associated delays.

The input signal x(n)zT↔ X(zi) in Figure 1 is simulta-

neously passed through all AFB channel filters Hl(zi), l =0, . . . , I − 1 and subsequently downsampled by a factor of M,yielding

Xl(zsn) = 1M

M−1∑

k=0

Hl

(

z1/Msn Wk

M

)

X(

z1/Msn Wk

M

)

,

l = 0, 1, . . . , I − 1,

(7)

using the alias component representation [5, 6], and WM =e−j2π/M . In the SFB, the M-fold upsampled subband signalsXl(zMi ) = Xl(zsn) are combined to form the z-domain outputsignal representation [6]:

Y(zi) =I−1∑

l=0

Gl(zi)Xl(

zMi)

. (8)

Inserting the upsampled form of (7) into (8), with zsn = zMiwe obtain

Y(zi) =I−1∑

l=0

Gl(zi)

⎡

⎣1M

M−1∑

k=0

Hl

(

ziWkM

)

X(

ziWkM

)⎤

⎦

= 1M

M−1∑

k=0

X(

ziWkM

)⎡

⎣I−1∑

l=0

Hl

(

ziWkM

)

Gl(zi)

⎤

⎦.

(9)

Obviously the output signal representation Y(zi) depends onall M modulation components X(ziW

kM), k = 0, . . . ,M − 1,

of the input signal. All these components are filtered bythe compound term

∑I−1l=0 Hl(ziW

kM)Gl(zi) and eventually

combined. The transfer function of the zeroth (k = 0)modulation component is generally denoted as the distortionfunction [5, 6]:

Fdist(zi) = 1M

⎡

⎣I−1∑

l=0

Hl(zi)Gl(zi)

⎤

⎦. (10)

In our approach this distortion function determines theproperties of the FBP almost exclusively, since aliasing andimaging is assumed to be negligible as a result of sufficientlyhigh AFB and SFB prototype filter stopband attenuation.Inserting (3) and (4) into (10), we obtain

Fdist(zi) = 1M

⎡

⎣I−1∑

l=0

H(

ziWlI

)

G(

ziWlI

)⎤

⎦. (11)

Next, the properties of the distortion function (11) areinvestigated in the time-domain.

2.2. Time-Domain Analysis. In this section we present anovel time-domain interpretation of the distortion function.We begin with introducing the FBP impulse response

s(n) = h(n)∗ g(n)zT←→ S(zi) = H(zi) ·G(zi). (12)

The length of s(n) is given by [8, 20]

Ns = Nh +Ng − 1. (13)

The distortion function (11) is reformulated by introducing(12) and by using (1):

Fdist(zi) = OI

I−1∑

l=0

S(

ziWlI

)

= O · S(I)0

(

zIi)

. (14)


In time-domain this relation is expressed as

fdist(n) = O · s(I)0 (n) =⎧⎨

⎩

O · s(mI) ∀n = mI ,m ∈ N,

0 ∀n /=mI ,m ∈ N.(15)

As a result the distortion function represents the z-transformof the zeroth polyphase component of s(n) [5, 6]. Consid-ering (13) and (15) the distortion function in time-domainfdist(n) has only �(Ns − 1)/I + 1 nonzero terms, which leadsto the z-domain representation of the distortion function:

Fdist(zi) = O�(Ns−1)/I+1∑

m=0

s(mI) · z−mIi , (16)

where all zero values of the FBP impulse response s(n)according to (15) have been discarded. Hence, the corre-sponding discrete-time Fourier transform of (16) is 2π/I-periodic, where I is the number of FBP channels. As aconsequence, both the magnitude response and the groupdelay response of the FBP distortion function possess this2π/I-periodicity property.

2.3. Potential Delays. Next, we show that the mean groupdelay of a uniform I-channel complex-modulated oversam-pling FBP is restricted to integer multiples of the numberI of FBP channels. This is an inherent characteristic of acomplex-modulated filter bank and has first been shown in[16]. Exploiting the above 2π/I-periodicity, we define themean group delay:

τdistg = I

2π

∫ π/I

−π/Iτdist

g

(

Ω(i))

dΩ(i). (17)

Since the distortion function (16) is conjugate symmetric:

[

Fdist

(

e−jΩ(i))]∗ =

⎡

⎣O�(Ns−1)/I∑

m=0

s(mI) · ejmIΩ(i)

⎤

⎦

∗

= Fdist

(

ejΩ(i))

,

(18)

the group delay of the filter bank is an even function allowingfor the modification of (17):

τdistg = I

π

∫ π/I

0τdist

g

(

Ω(i))

dΩ(i). (19)

Using the definition of the group delay as the negativederivation of the phase ϕ(Ω(i)):

τg(

Ω(i))

= −dϕ

(

Ω(i))

dΩ(i), (20)

it follows from (19) that

τdistg = − I

π

[

ϕ(

Ω(i))]π/I

0= I

π

[

ϕ(0)− ϕ(π

I

)]

. (21)

To determine the phase response at Ω(i) = 0 and Ω(i) = π/Iwe use (16) for zi = ej0 = 1:

Fdist

(

ej0)

= O�(Ns−1)/I+1∑

m=0

s(mI) ∈ R, (22)

which is real valued according to (12) since we only considerreal analysis and synthesis prototype filters. Moreover, themagnitude of the distortion function is supposed to beapproximately unity, Fdist(ej0) ≈ 1 · ejϕdist(0) ∈ R; thereforethe phase response at zero frequency is

ϕdist(0) = κ1 · π, κ1 ∈ Z. (23)

With the same considerations for zIi = ej(π/I)I = −1, we get

Fdist

(

ej(π/I))

= O�(Ns−1)/I+1∑

m=0

s(mI) · (−1)m ∈ R. (24)

Since at this frequency the distortion function is again real-valued and approximately unity, Fdist(ej(π/I)) ≈ 1 · ejϕdist(π/I) ∈R, we conclude that

ϕdist

(π

I

)

= κ2 · I , κ2 ∈ Z. (25)

Combining the results (23) and (25) with (21) yields

τdistg = I

π[κ1 · π − κ2 · π] = (κ1 − κ2) · I. (26)

The result states that a complex-modulated filter bank canonly approximate delays of the form τg = κ · I , κ ∈ Z.

Finally, we present a system-theoretic interpretation ofthe fact that the overall group delay is restricted to τg = κ · I .In the following all examinations of the distortion functionare performed in time-domain using fdist(n). According to(15) the distortion function represents the zeroth polyphasecomponent of s(n) = h(n) ∗ g(n) with the prototype filters(3) and (4). Therefore fdist(n) has only �(Ns − 1)/I + 1nonzero terms which are located at indices that are integralmultiples of I .

We begin with an ideal distortion function of constantmagnitude response and linear-phase. Since the distortionfunction in time-domain can be seen as the impulse responseof an FIR filter, the upper demand is equivalent to thedemand for an FIR allpass. According to the theory of FIRfilters this can only be achieved by a simple delay [8, 20, 21].Therefore all the nonzero terms of fdist(n) have to be zeroexcept for one. The resulting distortion function is

Fdist(zi) = z−dIi , d ∈ N. (27)

Hence, under ideal allpass conditions, the delay of a uniformcomplex-modulated FBP is restricted to integer multiples ofthe number I of channels.

Next we relieve the demand for exactly constant mag-nitude response and ask only for exactly linear-phase. Thenonzero terms of fdist(n) must exhibit a symmetry in orderto impose a linear-phase distortion function. For illustration,


we start with a simple example assuming an odd length:�(Ns − 1)/I + 1 = 3. To gain a better overview the nonzerosterms of fdist(n) are put into a vector:

fdist = [ε, 1, ε]T, (28)

where ε � 1 is provided. The distortion function is thediscrete-time Fourier transform of the upper expression:

Fdist

(

ejΩ(i))

= ε + e−jΩ(i)I + ε · e−j2Ω(i)I

= e−jΩ(i)I[

1 + 2 · ε · cos(

Ω(i)I)]

.(29)

As a result, the constant group delay of

τming = I · �(Ns − 1)/I − 1

2= I (30)

is obtained, while the magnitude response of (16) varies inthe vicinity of

1− 2ε ≤∣∣∣Fdist

(

ejΩ(i))∣∣∣ ≤ 1 + 2ε. (31)

Since ε � 1, the distortion function approximates a linear-phase allpass function sufficiently well. Similar results can beobtained with any even order �(Ns − 1)/I (odd length) ofthe downsampled distortion function (16). From the theoryof linear-phase FIR filters it is well known [8, 20, 21] that thezero-phase frequency responses of even-length symmetricFIR filters always possess at least a single zero at f = fi/I(zIi = −1). All antimetric linear-phase FIR filters are likewiseunusable, since they have zero transfer at zero frequency(zIi = 1). Hence, in case of exactly linear-phase distortionfunctions, the impulse response is restricted to even order, topositive symmetry, and the only possible group delay is givenby (30).

Finally we relieve the demand for exactly linear-phaseand ask only for approximately constant magnitude responseand approximately linear-phase. Thus fdist(n) is no longerrestricted to be symmetric. As a result, the position d · Iof the dominating coefficient of the distortion functionin time-domain can again take on any value according tod ∈ {1, 2, . . . , �(Ns − 1)/I}, while all other coefficients atpositions m /=d ∈ {1, 2, . . . , �(Ns − 1)/I}must be kept closeto zero by optimisation. Hence, the overall mean delay of auniform oversampling complex-modulated FBP results in

τg = d · I. (32)

Note that the above considerations of linear-phase FIR filterslikewise apply approximately.

3. Design of Low-Delay FIR Prototype Filter

In this section, we develop a procedure for the designof real-valued narrowband FIR lowpass prototype filtersfor the AFB. We are aiming at (i) minimum group delayboth in the pass and in the transition band and (ii)meeting tight magnitude frequency response constraints forthe stopband. The requirements concerning the stopband

attenuation can vary with each frequency. Especially, welook for a unique solution that yields the globally optimumdesign. To this end, we introduce for the first time a convexobjective function for group delay minimisation, whereas themagnitude requirements are used as design constraints.

3.1. Objective Function. Subsequently, a convex objectivefunction for group delay minimisation of narrowband FIRfilters is developed that delivers the desired globally optimumdesign result. To begin with, let us use the polar coordinaterepresentation of (3):

H(

ejΩ(i))

=∣∣∣H

(

ejΩ(i))∣∣∣ · ejϕ(Ω(i)), (33)

where ϕ(Ω(i)) describes the phase of the FIR filter frequencyresponse [20].

By calculating the first derivative of the frequencyresponse as given by both (33) and (13) with respect to thenormalised frequency Ω(i), we obtain a relation that containsthe group delay in one of its summands:

jdH

(

ejΩ(i))

dΩ(i)= τg

(

Ω(i))

·H(

ejΩ(i))

+ j ·d∣∣∣H

(

ejΩ(i))∣∣∣

dΩ(i)· ejϕ(Ω(i)).

(34)

Note that (34) is equivalently represented in time-domainaccording to the differentiation in frequency property of thediscrete-time Fourier transform:

hderiv(n) = n · h(n)DTFT←→ j

dH(

ejΩ(i))

dΩ(i). (35)

Next we apply the generalized Parseval’s theorem which is [8]

∞∑

n=−∞x(n)y∗(n)= 1

2π

∫ π

−πX

(

ejΩ(i))

Y∗(

ejΩ(i))

dΩ(i). (36)

On the left side of (36) we substitute x(n) = hderiv(n) = n ·h(n) and y(n) = h(n). On the right side the correspondingterms in the frequency domain are inserted. Please note thatX(ejΩ(i)

) corresponds to (33). We get

Nh−1∑

n=0

n · h(n)2

= 12π

∫ π

−π

⎛

⎝τg(

Ω(i))

·∣∣∣H

(

ejΩ(i))∣∣∣

2

+ j ·d∣∣∣H

(

ejΩ(i))∣∣∣

dΩ(i)·∣∣∣H

(

ejΩ(i))∣∣∣

⎞

⎠dΩ(i).

(37)

Using the fact that the derivation of even function yieldsuneven function the integral over the imaginary part of the


integrand in (37) is zero since the integration interval issymmetric:

Nh−1∑

n=0

n · h(n)2 = 12π

∫ π

−πτg

(

Ω(i))

·∣∣∣H

(

ejΩ(i))∣∣∣

2dΩ(i).

(38)

Obviously the rather sophisticated integral corresponds intime-domain to a simple sum. This formula was firstintroduced in [20].

Next, we proof that (38) posseses all the characteristicsthe objective function was asked for in last section. Thisis best shown by examining the following theoretical con-strained optimization problem:

minh

12π

∫ π

−πτg

(

Ω(i), h)

·∣∣∣H

(

ejΩ(i), h

)∣∣∣

2dΩ(i),

s.t. ∀ h ∈ X ,

(39)

where X is supposed to be the set of all lowpass filters oflength N with a distinctive passband (i.e., negligible ripple)and very narrow transition band. Moreover low-pass filtersin X are supposed to have a high stopband attenuation.The set X allows us to simplify the right side of (38) andmakes it possible to explain its functionality. Due to thesecond power of the magnitude frequency response and theassumed high stopband attenuation of the filters in X the

integrand τg(Ω(i))·|H(ejΩ(i))|2 is nearly zero in the stopband.

The magnitude frequency response is in consequence of thenegligible ripple nearly one throughout the passband. Andfinally due to the assumed very narrow transition band (39)can be simplified in the following way:

minh

12π

∫ Ω(i)d

0τg

(

Ω(i))

dΩ(i),

s.t. ∀ h ∈ X.

(40)

It is evident as seen in (40) that by minimizing theobjective function the area bounded by the group delay inthe passband is minimized. Minimizing the area results inminimizing the group delay itself in the passband, which isour main purpose. Moreover minimizing the area beneaththe group delay yields a smoothing effect. In the stopbandthe group delay is apparently not minimized at all. Thereforethe stopband can be regarded as a “do not care” regionthus increasing the available degrees of freedom. Next welook at more realistic filters which do not exhibit negligibletransition bands. In this case the second power of themagnitude frequency response in (39) acts in the transitionband as a real-valued weighting function for the group delay.Thus guaranteeing that in the transition band close to thepassband edge the group delay is minimized in the mostprevalent form and close to the stopband edge in the least.

One of the objective function’s strongest points is thesimple formulation in time-domain as seen in (39). The sumon the left side can readily be expressed by a quadratic form:

Nh−1∑

n=0

n · h(n)2 = hT ·DNh · h. (41)

The Nh ×Nh diagonal matrix DN has the following form:

DNh = diag (0, 1, . . . ,Nh − 1) =

⎛

⎜⎜⎜⎜⎜⎜⎜⎝

0 0 · · · 0

0 1 · · · 0

......

. . ....

0 0 · · · N − 1

⎞

⎟⎟⎟⎟⎟⎟⎟⎠

.

(42)

This matrix is positive semidefinite, which implies theconvexity of the objective function. Hence gradient andHessian matrix, both important for search methods, can beobtained very easily.

3.2. Constraints. In this section we present functions toset up constraints for the optimization problem. Thesefunctions enable us to meet the given magnitude frequencyresponse specifications during the optimization. We showthat all functions are convex and in combination withthe introduced convex objective function yield a convexoptimization problem.

3.2.1. Passband. Narrow-band low-pass filters usually do notexhibit a distinctive passband. In order to obtain a narrow-band low-pass filters it is sufficient to ask for

∣∣∣H

(

ejΩ(i), h

)∣∣∣Ω(i)=0

= 1, (43)

which is accomplished by formulating an equality constraint.Using the relation H(ej0, h) = ±|H(ej0, h)|, whereas theminus sign can be understood as a special case only [8],we reformulate the upper constraint function by using (3)evaluated at Ω(i) = 0:

H(

ej0)

= cTh

(

ej0)

· h. (44)

The vector e∗(0) is equivalent to the one-vector which isdefined as follows: 1 := (1, . . . , 1)T. The linear (referring toh) equality constraint for the passband can thus be stated asfollows:

1T · h = 1. (45)

Since term (44) is a linear function in h, the convexity of thesearch space defined by the constraints is ensured.

3.2.2. Stopband. The magnitude frequency response specifi-cations in the stopband are defined by a tolerance mask. Tothis end a nonnegative tolerance value functionΔ(Ω(i)) ≥ 0 isdefined, which determines the allowed maximum deviation:

∣∣∣H

(

ejΩ(i))∣∣∣ ≤ Δ

(

Ω(i))

∀ Ω(i) ∈ Bs. (46)

The tolerance mask is defined on the region of supportBs, the conjunction of all stopbands, which is a subset ofthe bounded interval [0,π]. This makes allowances for thesymmetry of the frequency response of real-valued filters [8].Regions of the bounded interval [0,π] where no tolerance


mask is defined are called “do not care” regions. Thedefinition of the tolerance value function Δ(Ω(i)) accordingto (46) can be used to formulate the remaining constraints.By using the following relation between the magnitude andthe real part of a complex number zi, also known as the realrotation theorem [15, 16],

|zi| = maxθ∈[0,2π)

[

Re{

zi · e−jθ}]

≥ Re{

zi · e−jθ}

,

∀θ ∈ [0, 2π),

(47)

and applying it for the magnitude frequency response weobtain

∣∣∣H

(

ejΩ(i), h

)∣∣∣ ≥ Re

{

H(

ejΩ(i), h

)

· e−jθ}

=N−1∑

n=0

h(n) · cos(

Ω(i) · n + θ)

, ∀θ∈[0, 2π).

(48)

This term again is a linear function in h and can be writtendown using the vector representation as follows:

∣∣∣H

(

e jΩ(i)

, h)∣∣∣ ≥ cT

(

Ω(i), θ)

· h, ∀θ ∈ [0, 2π), (49)

where

c(

Ω(i), θ)

=[

cos(θ), cos(

Ω(i) + θ)

, . . . , cos(

[N − 1] ·Ω(i) + θ)]T

,

(50)

depends not only on the frequency Ω but on also the addi-tional value θ as well. Using (49) the inequality constraintcan be stated as follows:

cT(

Ω(i), θ)

· h ≤ Δ(

Ω(i))

, ∀ Ω(i) ∈ Bs, θ ∈ [0, 2π).

(51)

We see that the region defined by the upper inequalityconstraint is convex due to the linearity of the left termin h. Please note that the number of constraints in thestopband in the original formulation is infinite regardingto the frequency Ω(i). In the linearized version according to(51) a second infinite parameter θ appears, which is inducedby the real rotation theorem. Thus the constraints are nowinfinite regarding both Ω(i) and θ.

3.3. Constrained Optimization Problem. In this section theconvex objective function (41) and the convex constraints(45) and (51) are used to build up a convex constrainedoptimization problem. Since all used constraint functions arelinear in h, the so-called Constraint Qualification is always

maintained. The problem can readily be formulated in thefollowing way:

minh

hT ·DNh · h

s.t. 1T · h = 1, ∀ Ω(i) = 0

cT(

Ω(i), θ)

· h ≤ Δ(

Ω(i))

, ∀ Ω(i) ∈ Bs,

∀ θ ∈ [0, 2π).

(52)

Due to the fact that the objective function is a quadraticfunction and the number of constraints is infinite, the overalloptimization problem is called convex quadratic semi-infiniteoptimization problem. The term semi-infinite implies a finitenumber of unknowns h yet a infinite number of constraints.

To obtain a computable algorithm the number ofconstraints has to be reduced to a finite number. The merediscretization of Ω(i) in the following way:

Ω(i)k = π

NFFT· k, k = 0, 1, . . . ,NFFT (53)

is not sufficient for obtaining a finite optimization problem,since the additional value θ remained still infinite. Thereforeθ has to be discretized as well:

θi = π

pi, ∀ i = 0, 1, . . . , 2p − 1, p ≥ 2. (54)

The number of discretization points of θ is restricted to evenvalues.

With these discretizations the infinite problem becomesa finite one and can be stated as follows:

minh

hT ·DNh · h

s.t. 1T · h = 1,

Ω0 :

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

cT(

Ω(i)0 , θ0

)

· h ≤ Δ(

Ω(i)0

)

...

cT(

Ω(i)0 , θi

)

· h ≤ Δ(

Ω(i)0

)

...

cT(

Ω(i)0 , θ2p−1

)

· h ≤ Δ(

Ω(i)0

)

...

ΩL :

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

cT(

Ω(i)L , θ0

)

· h ≤ Δ(

Ω(i)L

)

...

cT(

Ω(i)L , θi

)

· h ≤ Δ(

Ω(i)L

)

...

cT(

Ω(i)L , θ2p−1

)

· h ≤ Δ(

Ω(i)L

)

.

(55)


Table 1: Maximum Error over p.

p 2 4 8 16 32

−20 · log(cos(π/2 · p))/dB 3.0103 0.6877 0.1685 0.0419 0.0105

The price one has to pay for the linearization is the largenumber of inequality constraints in the stopband as pointedout in (55). The overall number of inequality constraints canbe determined to 2 · p · L.

The maximum error depends on factor p. The biggerp is, the less the maximum error becomes. Table 1 showsthe worst deviation from the constraints for some commonvalues of p.

4. Design of Low-Delay FIR Filter Bank Pair

In this section a method to design a prototype filter for theSFB is introduced. The main objective lies in obtaining adistortion function of an oversampling I-channel complex-modulated filter bank according to (11) which independentlyof the frequency nearly equals a constant delay. At thesame time the constant delay is supposed to be the smallestpossible one as figured out in Section 2. Please note that allrequirements regarding the distortion function are met onthe synthesis filter bank side only. We use the deviation ofthe distortion function from a suitable desired distortionfunction as objective function, instead of minimizing thegroup delay of the distortion function, similar to minimizingthe group delay of an FIR prototype filter in Section 3. Thereal-valued SFB prototype filter has to meet given magnitudefrequency response specifications for the stopband. Due tothe fact the constraints agree with those ones of the previousalgorithm, the convex formulation in (51) can be used.Therefore only the objective function in (55) has to bemodified.

4.1. Objective Function. In this section we present a convexobjective function which minimizes the error between thedistortion function and the desired distortion functionduring the optimization. In combination with the convexconstraints in (51) it guarantees unique solutions.

The distortion function (16) depends on both AFB andSFB prototype filters as shown in Section 2. However thecoefficients of the AFB prototype filter are regarded asconstants in this design step, due to the fact they are fixedto the design result obtained in the first algorithm. Thereforethe distortion function depends only on the SFB prototypefilter: Fdist(ejΩ(i)

, g). Below the dependence of the distortionfunction of g is pointed out only if required; otherwise we

write Fdist(ejΩ(i)).

As discussed in Section 2 the group delay of the distortionfunction of oversampling complex-modulated filter banks isrestricted to integral multiples of the number of channels Ionly. For this reason the desired distortion function can bedefined as follows:

Fdist, desire

(

ejΩ(i))

= e−jκIΩ(i), (56)

where κ ∈ N+. We are excluding the trivial case κ = 0, sinceit is not realisable due to causality reasons [6]. By using theL2-norm the objective function can be formulated as follows:

∫ π

−π

∣∣∣Fdist

(

ejΩ(i), g

)

− e−jκIΩ(i)∣∣∣

2dΩ(i). (57)

In order to obtain the lowest possible group delay, first thesmallest possible κ is selected, namely, κ = 1. In case ofdissatisfying results κ is to increase gradually until the desiredresult is achieved.

4.2. Practical Implementation. Next we want to set up anobjective function which can directly be implemented innumerical analysis programs like Matlab or Mathematica. Tothis end the integrand in (57) is reformulated in the followingway:

∣∣∣Fdist

(

ejΩ(i))

− e−jcΩ(i)∣∣∣

2

=(

Fdist

(

ejΩ(i))

− e−jκIΩ(i))

·(

F∗dist

(

ejΩ(i))

− ejκIΩ(i))

=∣∣∣Fdist

(

ejΩ(i))∣∣∣

2 − 2 Re{

ejcΩ(i) · Fdist

(

ejΩ(i))}

+ 1.

(58)

Reinserted in (57), we get an expression consisting of threeseparate integrals:

∫ π

−π

∣∣∣Fdist

(

ejΩ(i))∣∣∣

2dΩ(i)

− 2∫ π

−πRe

{

ejκIΩ(i) · Fdist

(

ejΩ(i))}

dΩ(i) +∫ π

−πdΩ(i)

︸︷︷︸2π

.

(59)

By applying Parseval’s theorem on the left integral in (59) weget a formula which allows us to determine the value of theintegral in time-domain [8]:

∫ π

−π

∣∣∣Fdist

(

ejΩ(i), h

)∣∣∣

2dΩ(i) = 2π

Ns−1∑

k=0

f 2dist(k). (60)

By inserting (15) into the right side of the upper expressionthe sum can be stated as follows:

∫ π

−π

∣∣∣Fdist

(

ejΩ(i), h

)∣∣∣

2dΩ(i) = 2πO2

Ns−1∑

k=0

[

s(I)0 (k)]2. (61)

Furthermore we omit all indices k /=mI since they are zeroaccording to (15). The remaining sum is replaced by a


weighted scalar product of two vectors:

∫ π

−π

∣∣∣Fdist

(

ejΩ(i), h

)∣∣∣

2dΩ(i) = 2πO2

�(Ns−1)/I∑

m=0

s2(mI)

= 2πO2(

sT · s)

.

(62)

The components of vector s consist of the convolution s(k) =h(k) ∗ g(k) evaluated for the indices mI , m = 0, . . . , �(Ns −1)/I as shown below:

s =(

s(0), s(I), s(2I), . . . , s(⌊

(Ns − 1)I

⌋

I))T

. (63)

Let us have a closer look on s(κI) which according to (12) is

s(mI) =Ng−1∑

k=0

h(mI − k) · g(k). (64)

Remember that the coefficients of the AFB prototype filterh(k) are considered to be constants in the current step.Besides h(k) is a causal FIR-filter (finite length), thereforeh(mI − k) in (12) is not equal to zero only if the followingtwo conditions are fullfilled:

mI − k ≥ 0 =⇒ mI ≥ k,

mI − k ≤ Nh − 1 =⇒ mI −Nh + 1 ≤ k.(65)

Therefore all redundant zero-multiplications in s(mI) are leftout by taking the above inequations into account:

s(mI) =min{Ng−1, mI}∑

max{0, mI−Nh+1}h(mI − n) · g(n), (66)

and by applying the vector/matrix representation can bestated as follows:

s(mI) = kTh (m) · g. (67)

The vector kh(m) ∈ RNg depends on the index m and hasthe dimension Ng. Its components are made up of g(mI −k) for all indices k which are included in the sum (66). Thecomponents which correspond to the remaining indices aresimply put zero as shown below:

[kh(m)]k =

⎧⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎩

g(mI − k), max{0,mI −Nh + 1}≤ k ≤ min

{

Ng − 1,mI}

,

0, otherwise.(68)

Now the components s(κI) in (63) are replaced by using (67):

s =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

s(0)

s(I)

s(2I)

...

s(⌊

Ns − 1I

⌋

I)

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

=

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

kTh (0)

kTh (1)

kTh (2)

...

kTh

(⌊Ns − 1I

⌋)

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

︸︷︷︸

K∈R�Nh+Ng−2/I+1×Ng

· g.

(69)

Vector g is pulled out as indicated above, and the remainingentries are combined to matrix K of dimension �(Nh + Ng −2)/I + 1 × Ng. Please note that K consists only of inversedand shifted AFB coefficients h(k). Its dimension depends onboth the length of AFB and SFB prototype filters (i.e.,Nh andNg) and the number of channels I .

Now vector s in (62) is replaced by (69). We obtain aquadratic form in g:

∫ π

−π

∣∣∣Fdist

(

ejΩ(i), g

)∣∣∣

2dΩ(i) = 2πO2gT ·

(

KT ·K)

· g.

(70)

The second integral in (59) is

2∫ π

−πRe

{

ejκIΩ(i) · Fdist

(

ejΩ(i))}

dΩ(i). (71)

According to the inverse discrete-time Fourier transformof a discrete signal x(k) evaluated explicitly for the zerothcoefficient [8]

x(0) = 12π

∫ π

−πX

(

ejΩ(i))

ej·0·Ω(i)dΩ(i), (72)

the integral in (71) formally corresponds to

4π · x(0) = 2∫ π

−πX

(

ejΩ(i))

dΩ(i). (73)

Therefore the evaluation of the integral in (71) is reducedto the determination of the zeroth coefficient of the inversediscrete-time Fourier transform of the following expression:

Re{

ejκIΩ(i) · Fdist

(

ejΩ(i))}

. (74)

The inverse discrete-time Fourier transform of (74) can beobtained by first applying the time-shift property of thediscrete-time Fourier transform [8]:

x(n− n0)DTFT←→ e−jn0Ω(i)

X(

ejΩ(i))

, (75)


and secondly using the fact that in case of real-valued signalsthe real part in frequency domain corresponds to the evenpart in the time-domain [8]:

12

[x(n) + x(−n)]DTFT←→ Re

{

X(

ejΩ(i))}

. (76)

When applied on (74) we get

12

[fdist(n + κI) + fdist(−n + κI)

]

DTFT←→ Re{

ejκIΩ(i) · Fdist

(

ejΩ(i), g

)}

.

(77)

Next we use (15) to express the distortion function inthe time-domain as a function of the SFB prototype filtercoefficients. Therefore the zeroth coefficient of (74) is

fdist(κI) = O · s(I)0 (κI). (78)

The upper term can be simplified according to (15). When(78) is inserted in (73), we get an expression for the secondintegral:

4π ·O · s(κI) = 2∫ π

−πRe

{

ejκIΩ(i) · Fdist

(

ejΩ(i), g

)}2dΩ(i).

(79)

Finally the second integral according to (79) is written byusing (67):

4πO · kTh (κ) · g = 2

∫ π

−πRe

{

ejκIΩ(i) · Fdist

(

ejΩ(i), g

)}2dΩ(i).

(80)

The convex objective function in (57) is readily formulatedas a quadratic function in g:

∫ π

−π

∣∣∣Fdist

(

ejΩ(i), g

)

− e−jcΩ(i)∣∣∣

2dΩ(i)

= 2πO2gT ·(

KT ·K)

· g− 4π ·O · kTh (m) · g + 2π.

(81)

Please note that since matrix K only depends on thecoefficients, h is has to be computed only once. It remainsunchanged during the iterations.

5. Design Example

Subsequently, we present an example for the design of auniform oversampling complex-modulated I-channel FBP,where I = 64. The decimation factor is M = 16, resultingin an oversampling factor of O = 4.

5.1. AFB Prototype Filter. First we start with the design ofa narrow-band FIR low-pass AFB prototype filter with lowgroup delay designed by using the algorithm described inSection 3. For the implementation of the design algorithmwe used the built-in function fmincon of the Optimization

−120

−100

−80

−60

−40

−20

0

20lo

g 10|H

(eiΩ

)|

0 0.2 0.4 0.6 0.8 1

Ω/π

(a) Logarithmic magnitude frequency response

34

35

36

37

38

39

40

41

42

43

44

Gro

up

dela

yH

(eiΩ

)

0 0.2 0.4 0.6 0.8 1

Ω/ΩS

(b) Group delay AFB

Figure 2: Narrow-band FIR low-pass filter.

Toolbox for Matlab. The magnitude frequency responsespecifications for the stopband are chosen according to theconsiderations made in [7]. The minimum possible filterlength in order to fulfill the given magnitude specificationsturned out to be Nh = 90. The number of frequency pointsL in (55) was chosen to be 1024. The maximum number pof the point of the rotation factor θ in (54) was chosen to be32 thus according to Table 1 producing a maximum error of0.0105 dB.

The logarithmic magnitude frequency response alongwith the tolerance mask for the stopband defined in [7]is depicted in Figure 2(a). We notice that the tolerancemask is not always touched by the magnitude response. Insome regions the magnitude response ranges far below theallowed attenuation, which can be traced back to the factthat the tolerance mask is not continuous and increases anddiminishes stepwise.

Figure 2(b) depicts the group delay both in passband andtransition band. The group delay in the passband, which


−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Imag

inar

ypa

rt

−1 −0.5 0 0.5 1

Real part

Figure 3: Zero plot of H(z).

ranges from the normalized frequency Ω/Ωs = 0 up toΩ/Ωs = 0.25, is almost constant with the mean value ofτg,AFB = 43.9. In the transition band, which starts at thenormalized frequency Ω/Ωs = 0.25, the group delay featuresa decay similar to that of the magnitude response. The groupdelay however does not fall below the value of τg,AFB = 43 upto the normalized frequency Ω/Ωs = 0.8 such that the groupdelay can be regarded as constant in the region, where themagnitude frequency response does not range below 40 dB.

As to be seen from Figure 3 there are no zeros forming thepassband, which is common for narrow-band filters, since nodistinct passband is existent here. All zeros are effectuatingthe stopband. They are distributed on the periphery of theunit circle. However they are located slightly inside thez-plane unit circle, thus yielding a minimum-phase filter.Moreover, zeros within the unit circle contribute to thereduction of the overall group delay of the passband [8].

5.2. SFB Prototype Filter. To complete the FBP design wepresent an example for the narrow-band FIR low-pass SFBprototype filter designed by the procedure in Section 4.This SFB prototype filter is matched to the AFB prototypefilter of Section 5.1 such that the distortion function of anoversampling I-channel complex-modulated FIR filter bankaccording to (11) approximates a constant delay (LP allpassfunction). At the same time the constant delay is supposed tobe the smallest possible one as figured out in Section 2.

The magnitude response specifications of the stopbandare chosen according to the considerations made in [7]. Theyare more strict than those of the AFB for reasons stated in [7].For the implementation of the design algorithm the built-infunction fmincon of the Optimization Toolbox of Matlab isused again. The given stopband magnitude specifications aremet for the filter length Ng = 152. The number of frequencypoints is again L = 1024.

Figure 4(a) shows the logarithmic magnitude frequencyresponse. The tolerance mask for the stopband is depicted

−180

−160

−140

−120

−100

−80

−60

−40

−20

0

20lo

g 10|G

(ejΩ

)|

0 0.2 0.4 0.6 0.8 1

Ω/π

(a) Logarithmic magnitude response SFB

−10

0

10

20

30

40

50

60

70

80

Gro

up

dela

yG

(ejΩ

)

0 0.2 0.4 0.6 0.8 1

Ω/ΩS

(b) Group delay SFB

Figure 4: Narrow-band SFB prototype filter.

in red color. We notice that this time the tolerance maskis always touched by the magnitude response. This can betraced back to the fact that, for the SFB, the steps of thetolerance mask are much smaller, not exceeding 20 dB, exceptfor the region next to the transition band.

The group delay in the passband and transition band isdepicted in Figure 4(b). The group delay is again approx-imately constant with the mean value τg,SFB = 74. Thistime the group delay does not decay below the value ofτg,SFB = 70 roughly up to the frequency where the magnituderesponse is 80 dB. The obtained SFB prototype filter is againminimum-phase, even though no demand is made in thisregard according to Section 4.

Figure 5(a) illustrates the logarithmic magnituderesponse of the distortion function. It approximates theconstant value of zero dB with a peak-to-peak deviationamounting to 1.2 dB, which lies within the tolerance limitsdefined in [7]. Moreover the magnitude response exhibitsperiodicity, as described in Section 2.


−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

20lo

g 10|F

dist

(ejΩ

)|

0 0.2 0.4 0.6 0.8 1

Ω/π

(a) Logarithmic magnitude response of the distortion function.

124

125

126

127

128

129

130

131

132

133

Gro

up

dela

yF

dist

(ejΩ

)

0 0.2 0.4 0.6 0.8 1

Ω/π

(b) Group delay of the distortion function.

Figure 5: Distortion Function.

The group delay of the distortion function is depicted inFigure 5(b). It varies around the feasible value of 2I = 128derived in Section 2.3 with a deviation of ±4 correspondingto ±3%.

The downsampled FBP impulse response fdist(mI) incompliance with (15) is depicted in Figure 6 (here all zerocoefficients have been omitted). The number �(Ns − 1)/I +1 of non-zero coefficients is four, since Ns according to(13) results in Ns = 241. Here the dominating coefficientis the second one, thus explaining the overall delay of128 as shown above. Note that, due to the even numberof non-zero coefficients, no exact linear-phase distortionfunction is possible. The only way to approximate anallpass according to the considerations in Section 2 is bybringing the coefficients as close as possible to a idealdelay.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2 2.5 3

m

Figure 6: Distortion function in time-domain fdist(mI).

6. Conclusion

In this contribution, a new iterative approach for the designof analysis and synthesis prototype filters of oversamplinguniform complex-modulated FIR filter bank pairs is pro-posed, such that the overall FBP signal delay is minimized,and the subband signals experience short delay.

We have shown that in the time-domain the distortionfunction can be expressed as a simple FIR filter impulseresponse. Based on these results we have shown that the meanvalue of the group delay is restricted to integer multiples ofthe number of channels.

For the first design step we have introduced a novelprocedure for the design of low-delay AFB FIR prototypefilters with approximately linear-phase in the passband andthe transition band. This procedure is based on convex con-strained optimization which guarantees unique solutions.To this purpose we have introduced a convex objectivefunction for group delay minimisation, which is based ona particular representation of the group delay accordingto [20]. The magnitude requirements are used as designconstraints. The magnitude specifications (e.g., for hearingaids those derived in [7]) serve as stopband constraints tocontrol aliasing.

For the second and final design step, we have presenteda procedure for the design of the SFB FIR prototype filter,such that the overall signal delay of the FBP is minimized.This procedure is again based on convex constrained opti-mization, where in analogy to the first design step themagnitude specifications serve as stopband constraints tocontrol imaging. Based on the theoretical results regardingthe minimal feasible delays, the objective function is chosenas the deviation of the FBP distortion function from aprescribed I-fold delay. In this step, the AFB prototype filteris fixed to the design result obtained in the previous step.Furthermore we have presented an efficient implementationof the objective function to cope with high computationalload.


Finally, we have discussed the properties of the designalgorithm with reference to an example. The example showsthat the group delays of the prototype filters obtained usingthe presented procedures exhibit almost constant groupdelay not only in the passband but also in the transitionband. The mean value of the group delay ranges below that oflinear-phase filters of the same length. The observed overallsignal delay lies within the tolerances defined in [7] andapproximates the feasible delay as described in Section 2.

A comparison between the proposed design methodagainst several other approaches to the design of oversam-pling complex-modulated FBS is not performed here, since[22] treats this topic thoroughly, where approaches by DAMet al. [15], by Stocker et al. [23], and by Bauml and Sorgel[24] are compared.

Future investigations will be devoted to the applicationof the prototype filter pair presented in this contributionto uniform, complex-modulated filter banks with additionalsubband signal manipulation. We will particularly investigatewhether or not amplification of subband signals has animpact on the group delay of the distortion function.

References

[1] V. Harnacher, J. Chalupper, J. Eggers, et al., “Signal processingin high-end hearing aids: state of the art, challenges, and futuretrends,” EURASIP Journal on Applied Signal Processing, vol.2005, no. 18, pp. 2915–2929, 2005.

[2] M. A. Stone and B. C. J. Moore, “Tolerable hearing aid delays.I. Estimation of limits imposed by the auditory path aloneusing simulated hearing losses,” Ear & Hearing, vol. 20, no.3, pp. 182–192, 1999.

[3] M. A. Stone and B. C. J. Moore, “Tolerable hearing aid delays.III. Effects on speech production and perception of across-frequency variation in delay,” Ear & Hearing, vol. 24, no. 2,pp. 175–183, 2003.

[4] P. Vary, Ein Beitrag zur Kurzzeitspektralanalyse mit digitalenSystemen, Ph.D. dissertation, Universitat Erlangen-Nurnberg,Erlangen, Germany, 1978.

[5] P. P. Vaidyanathan, Multirate Systems and Filter Banks,Prentice-Hall, Englewood Cliffs, NJ, USA, 1993.

[6] H. G. Gockler and A. Groth, Multiratensysteme, SchlembachFachverlag, Willburgstetten, Germany, 2004.

[7] D. Alfsmann, H. G. Gockler, and T. Kurbiel, “Frequency-Domain Magnitude Constraints for Oversampling Complex-Modulated NPR Filter Bank System Design Ensuring aPrescribed Signal-to-Distortion Ratio,” to be published.

[8] A. V. Oppenheim and R. W. Schafer, Discrete-Time SignalProcessing, Prentice-Hall, London, UK, 1989.

[9] T. Kurbiel, D. Alfsmann, and H. G. Gockler, “Design of highlyselective quasi-equiripple FIR lowpass filters with approxi-mately linear phase and very low group delay,” in Proceedingsof the European Signal Processing Conference (EUSIPCO ’08),EURASIP, Lausanne, Switzerland, 2008.

[10] T. F. Liau, M. A. Razzak, and L. G. Cuthbert, “Phaseconstraints on fir digital filters,” Electronics Letters, vol. 17, no.24, pp. 910–911, 1981.

[11] H. Baher, “Fir digital filters with simultaneous conditions onamplitude and delay,” Electronics Letters, vol. 18, no. 7, pp.296–297, 1982.

[12] E. Sharestani and L. G. Cuthbert, “Fir digital filters designedto a groupdelay characteristic,” Electronics Letters, vol. 21, no.12, pp. 542–544, 1985.

[13] X. Zhang and T. Yoshikawa, “Design of FIR Nyquist filters withlow group delay,” IEEE Transactions on Signal Processing, vol.47, no. 5, pp. 1454–1458, 1999.

[14] M. Lang, Algorithms for the constrained design of digital filterswith arbitrary magnitude and phase responses, Ph.D. disserta-tion, Vienna University of Technology, Vienna, Austria, June1999.

[15] H. H. Dam, S. Nordholm, A. Cantoni, and J. M. de Haan,“Iterative method for the design of DFT filter bank,” IEEETransactions on Circuits and Systems II, vol. 51, no. 11, pp. 581–586, 2004.

[16] H. H. Dam, S. Nordholm, and A. Cantoni, “Uniform FIRfilterbank optimization with group delay specifications,” IEEETransactions on Signal Processing, vol. 53, no. 11, pp. 4249–4260, 2005.

[17] W. Kellermann, “Analysis and design of multirate systems forcancellation of acoustic echoes,” in Proceedings of IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP ’88), pp. 2570–2573, 1988.

[18] K. F. C. Yiu, N. Grbic, S. Nordholm, and K. L. Teo, “Multicri-teria design of oversampled uniform DFT filter banks,” IEEESignal Processing Letters, vol. 11, no. 6, pp. 541–544, 2004.

[19] K. F. C. Yiu, N. Grbic, S. Nordholm, and K. L. Teo, “A hybridmethod for the design of oversampled uniform DFT filterbanks,” Signal Processing, vol. 86, no. 7, pp. 1355–1364, 2006.

[20] H. W. Schußler, Digitale Signalverarbeitung 1, Springer, Hei-delberg, Germany, 5th edition, 2008.

[21] T. W. Parks and C. S. Burrus, Digital Filter Design, John Wiley& Sons, New York, NY, USA, 1987.

[22] D. Alfsmann, H. G. Gockler, and T. Kurbiel, “Filter banks forhearing aids applying subband amplification: a comparison ofdifferent specification and design approaches,” in Proceedingsof the 17th European Signal Processing Conference (EUSIPCO’09), Glasgow, Scotland, August 2009.

[23] C. Stocker, T. Kurbiel, D. Alfsmann, and H. G. Gockler,“A novel approach to the design of oversampling complex-modulated digital filter banks,” in Proceedings of the 17th Euro-pean Signal Processing Conference (EUSIPCO ’09), Glasgow,Scotland, August 2009.

[24] R. W. Bauml and W. Sorgel, “Uniform polyphase filter banksfor use in hearing aids: design and constraints,” in Proceedingsof the European Signal Processing Conference (EUSIPCO ’08),EURASIP, Lausanne, Switzerland, 2008.


Research Article

Incorporating the Conditional Speech PresenceProbability in Multi-Channel Wiener Filter BasedNoise Reduction in Hearing Aids

Kim Ngo (EURASIP Member),1 Ann Spriet,1, 2 Marc Moonen (EURASIP Member),1

Jan Wouters,2 and Søren Holdt Jensen (EURASIP Member)3

1 Department of Electrical Engineering, Katholieke Universiteit Leuven, ESAT-SCD, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium2 Division of Experimental Otorhinolaryngology, Katholieke Universiteit Leuven, ExpORL, O.& N2, Herestraat 49/721,B-3000 Leuven, Belgium

3 Department of Electronic Systems, Aalborg University, Niels Jernes Vej 12, DK-9220 Aalborg, Denmark

Correspondence should be addressed to Kim Ngo, [email protected]

Received 15 December 2008; Revised 30 March 2009; Accepted 2 June 2009

Recommended by Walter Kellermann

A multi-channel noise reduction technique is presented based on a Speech Distortion-Weighted Multi-channel Wiener Filter(SDW-MWF) approach that incorporates the conditional Speech Presence Probability (SPP). A traditional SDW-MWF usesa fixed parameter to a trade-off between noise reduction and speech distortion without taking speech presence into account.Consequently, the improvement in noise reduction comes at the cost of a higher speech distortion since the speech dominantsegments and the noise dominant segments are weighted equally. Incorporating the conditional SPP in SDW-MWF allows toexploit the fact that speech may not be present at all frequencies and at all times, while the noise can indeed be continuouslypresent. In speech dominant segments it is then desirable to have less noise reduction to avoid speech distortion, while in noisedominant segments it is desirable to have as much noise reduction as possible. Experimental results with hearing aid scenariosdemonstrate that the proposed SDW-MWF incorporating the conditional SPP improves the signal-to-noise ratio compared to atraditional SDW-MWF.

Copyright © 2009 Kim Ngo et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Background noise (multiple speakers, traffic, etc.) is asignificant problem for hearing aid users and is especiallydamaging to speech intelligibility. Hearing-impaired peoplehave more difficulty understanding speech in noise andin general need a higher signal-to-noiseratio (SNR) thanpeople with normal hearing to communicate effectively [1].To overcome this problem both single-channel and multi-channel noise reduction algorithms have been proposed. Theobjective of these noise reduction algorithms is to maximallyreduce the noise while minimizing speech distortion.

One of the first proposed single-channel noise reductionalgorithms is spectral subtraction [2], which is based onthe assumption that the noise is additive, and the cleanspeech spectrum can be obtained by subtracting an estimate

of the noise spectrum from the noisy speech spectrum.The noise spectrum is updated during periods where thespeech is absent, as detected by a Voice Activity Detection(VAD). Another well-known single-channel noise reductiontechnique is the Ephraim and Malah noise suppressor [3,4], which estimates the amplitude of the clean speechspectrum in the spectral or in the log-spectral domainbased on a Minimum Mean Square Error (MMSE) criterion.Common for these techniques are usually noticeable artifactsknown as musical noise [5] mainly caused by the short-time spectral attenuation, the nonlinear filtering and, aninaccurate estimate of the noise characteristic. A limitationof single-channel noise reduction is that only differences intemporal and spectral signal characteristics can be exploited.In a multiple speaker scenario also known as the cocktailparty problems the speech and the noise considerably


overlap in time and frequency. This makes it difficult forsingle-channel noise reduction schemes to suppress thenoise without reducing speech intelligibility and introducingspeech distortion or musical noise.

However, in most scenarios, the desired speaker and thedisturbing noise sources are physically located at differentpositions. Multi-channel noise reduction can then exploitthe spatial diversity, that is, exploit both spectral and spatialcharacteristics of the speech and the noise sources. The Frostbeamformer and the Generalized Sidelobe Canceler [6–8]are well-known multi-channel noise reduction techniques.The basic idea is to steer a beam toward the desired speakerwhile reducing the background noise coming from otherdirections. Another known multi-channel noise reductiontechnique is the Multi-channel Wiener filter (MWF) thatprovides an MMSE estimate of the speech component inone of the microphone signals. The extension from MWFto Speech Distortion-Weighted MWF (SDW-MWF) [9, 10]allows for a trade-off between noise reduction and speechdistortion.

Traditionally, these multi-channel noise reduction algo-rithms adopt a (short-time) fixed filtering under the implicithypothesis that the clean speech is present at all time. How-ever, the speech signal typically contains many pauses whilethe noise can indeed be continuously present. Furthermore,the speech may not be present at all frequencies even duringvoiced speech segments. It has been shown in single-channelnoise reduction schemes that incorporating the conditionalSpeech Presence Probability (SPP) in the gain function orin the noise spectrum estimation better performance canbe achieved compared to traditional methods [4, 11–13]. Inthese approaches the conditional SPP is estimated for eachfrequency bin and each frame by a soft-decision approach,which exploits the strong correlation of speech presence inneighboring frequency bins of consecutive frames.

A traditional SDW-MWF uses a fixed parameter to atrade-off between noise reduction and speech distortionwithout taking speech presence or speech absence intoaccount. This means that the speech dominant segmentsand the noise dominant segments are weighted equally inthe noise reduction process. Consequently, the improvementin noise reduction comes at the cost of a higher speechdistortion. A variable SDW-MWF was introduced in [14]based on soft output voice activity detection to a trade-off between speech dominant segments and noise dominantsegments. This paper presents an SDW-MWF approach thatincorporates the conditional SPP in the trade-off betweennoise reduction and speech distortion. In speech dominantsegments it is then desirable to have less noise reductionto avoid speech distortion, while in noise dominant seg-ments it is desirable to have as much noise reduction aspossible. Furthermore, a combined solution is introducedthat in one extreme case corresponds to an SDW-MWFincorporating the conditional SPP and in the other extremecase corresponds to a traditional SDW-MWF solution.Experimental results with hearing aid scenarios demonstratethat the proposed SDW-MWF incorporating the conditionalSPP improves the SNR compared to a traditional SDW-MWF.

The paper is organized as follows. Section 2 describes thesystem model and the general set-up of a multi-channel noisereduction algorithm. The motivation is given in Section 3.Section 4 explains the estimation of the conditional SPP.Section 5 explains the derivation of the SDW-MWF incorpo-rating the conditional SPP. In Section 6 experimental resultsare presented. The work is summarized in Section 7.

2. System Model

A general set-up of a multi-channel noise reduction is shownin Figure 1 withM microphones in an environment with oneor more noise sources and a desired speaker. Let Xi(k, l), i =1, . . . ,M, denote the frequency-domain microphone signals

Xi(k, l) = Xsi (k, l) + Xn

i (k, l), (1)

where k is the frequency bin index, l the frame index, andthe superscripts s and n are used to refer to the speech andthe noise contribution in a signal, respectively. Let X(k, l) ∈CM×1 be defined as the stacked vector

X(k, l) = [X1(k, l) X2(k, l) · · ·XM(k, l)]T

= Xs(k, l) + Xn(k, l),(2)

where the superscript T denotes the transpose. In addition,we define the noise and the speech correlation matrices as

Rn(k, l) = ε{

Xn(k, l)Xn,H(k, l)}

,

Rs(k, l) = ε{

Xs(k, l)Xs,H(k, l)}

,(3)

where ε{} denotes the expectation operator, and H denotesHermitian transpose.

2.1. Multi-channel Wiener Filter (MWF and SDW-MWF).The MWF optimally estimates a desired signal, based on aMinimum Mean Squared Error (MMSE) criterion, that is,

W∗(k, l) = arg minW

ε{∣∣∣Xs

1(k, l)−WHX(k, l)∣∣∣

2}

, (4)

where the desired signal in this case is the speech componentXs

1(k, l) in the first microphone. The MWF has been extendedto the SDW-MWF that allows for a trade-off between noisereduction and speech distortion using a trade-off parameterμ [9, 10]. The design criterion of the SDW-MWF is given by

W∗(k, l) = arg minW

ε{∣∣∣Xs

1(k, l)−WHXs(k, l)∣∣∣

2}

+ με{∣∣∣WHXn(k, l)

∣∣∣

2}

.

(5)

If the speech and the noise signals are statistically indepen-dent, then the optimal SDW-MWF that provides an estimateof the speech component in the first microphone is given by

W∗(k, l) = (Rs(k, l) + μRn(k, l))−1Rs(k, l)e1, (6)


+Noise

Noise

Desired signal

Noise

W1(k, l)

W2(k, l)

WM(k, l)

Z(k, l)

X1(k, l)

X2(k, l)

XM(k, l)

... ...

Figure 1: Multi-channel noise reduction set-up in an environment with one or more noise sources and a desired speaker.

Update speech + noisecorrelation matrice

Update noise-onlycorrelation matrice

−0.2

−0.1

0

0.1

0.2

0.3

0.4

Time (s)

0 5 10 15 20 25 30

Figure 2: Illustration of a concatenated noisy speech signal withnoise-only periods which is a typical input signal for multimicro-phone noise reduction.

where the M × 1 vector e1 equals the first canonical vector

defined as e1 = [1 0 · · · 0]T

. The second-order statisticsof the noise are assumed to be stationary which means thatRs(k, l) can be estimated as Rs(k, l) = Rx(k, l)−Rn(k, l) whereRx(k, l) and Rn(k, l) are estimated during periods of speech+ noise and periods of noise-only, respectively. For μ = 1 theSDW-MWF solution reduces to the MWF solution, while forμ > 1 the residual noise level will be reduced at the cost ofa higher speech distortion. The output Z(k, l) of the SDW-MWF can then be written as

Z(k, l) = W∗,H(k, l)X(k, l). (7)

2.2. MWF in Practice. A typical input signal for a multi-channel noise reduction is shown in Figure 2, where severalspeech sentences are concatenated with sufficient noise-onlyperiods. By using a VAD the speech+noise and noise-onlyperiods can be detected, and the corresponding correlationmatrices can be estimated/updated. MWF is uniquely basedon the second-order statistics, and in the estimation ofthe speech+noise and the noise-only correlation matricesan averaging time window of 2-3 seconds is typically usedto achieve a reliable estimate. This suggests that the noisereduction performance of the MWF depends on the long-term average of the spectral and the spatial characteristics of

the speech and the noise sources. In practice, this means thatthe MWF can only work well if the long-term spectral and/orspatial characteristics of the speech and the noise are slowlytime-varying.

3. Motivation

The success of any NR algorithm is based on how muchinformation is available about the speech and the noise[1, 15, 16]. In general speech and noise can be nonsta-tionary both temporally, spectrally, and spatially. Speechis a spectrally nonstationary signal and can be consideredstationary only in a short time window of 20–30 milisec-onds. Background noise such as multitalker babble is alsoconsidered to be spectrally non-stationary. Furthermore, thespeech characteristic contains many pauses while the noisecan be continuously present. These properties are usually nottaken into consideration in multi-channel noise reductionalgorithms since the spatial characteristics are assumed to bemore or less stationary, which then indeed justifies the long-term averaging of the correlation matrices. This long-termaveraging basically eliminates any short-time effects, such asmusical noise, that typically occur in single-channel noisereduction.

The motivation behind introducing the conditional SPPin SDW-MWF is to allow for a faster tracking of the non-stationarity of the speech and the noise as well as forexploiting the fact that speech may not be present at alltime. This then allows to apply a different weight to speechdominant segments and to noise dominant segments inthe noise reduction process. Furthermore incorporating theconditional SPP in the SDW-MWF also allows the NR to beapplied in a narrow frequency band since the conditional SPPis estimated for each frequency bin; see Section 4.

4. Speech Presence Probability Estimation

The conditional SPP is estimated for each frequency binand each frame by a soft-decision approach [12, 15, 17],which exploits the strong correlation of speech presence inneighboring frequency bins of consecutive frames.

4.1. Two-State Speech Model. A two-state model for speechevents can be expressed given by two hypotheses H0(k, l)


and H1(k, l) which represent speech absence and speechpresence in each frequency bin, respectively, that is,

H0(k, l) : Xi(k, l) = Xni (k, l),

H1(k, l) : Xi(k, l) = Xsi (k, l) + Xn

i (k, l).(8)

Assuming a complex Gaussian distribution of the Short-Time Fourier Transform (STFT) coefficients for both thespeech and the noise, the conditional Probability DensityFunctions (PDFs) of the observed signals are given by

p(Xi(k, l) | H0(k, l)) = 1πλni (k, l)

exp

{

−|Xi(k, l)|2λni (k, l)

}

,

p(Xi(k, l) | H1(k, l)) = 1π(λsi(k, l) + λni (k, l)

)

× exp

{

− |Xi(k, l)|2λsi(k, l) + λni (k, l)

}

,

(9)

where λsi(k, l) � ε{|Xsi (k, l)|2H1(k, l)} and λni �

ε{|Xni (k, l)|2} denote the power spectrum of the speech

and the noise, respectively. Applying Bayes rule, theconditional SPP p(k, l) � P(H1(k, l) | Xi(k, l)) can bewritten as [4]

p(k, l) ={

1 +q(k, l)

1− q(k, l)(1 + ξ(k, l)) exp(−υ(k, l))

}−1

,

(10)

where q(k, l) � P(H0(k, l)) is the a priori Speech AbsenceProbability (SAP); ξ(k, l) and γ(k, l) denote the a priori SNRand a posteriori SNR, respectively,

ξ(k, l) � λsi(k, l)λni (k, l)

, γ(k, l) � |Xi(k, l)|2λni (k, l)

,

υ(k, l) � γ(k, l)ξ(k, l)(1 + ξ(k, l))

.

(11)

The noise power spectrum λni is estimated using recursiveaveraging during periods where the speech is absence, thatis,

H′0(k, l) : λni (k, l + 1) = ρλni (k, l) +

(1− ρ)|Xi(k, l)|2,

H′1(k, l) : λni (k, l + 1) = λni (k, l).

(12)

where ρ is an averaging parameter, and H′0(k, l) and H′

1(k, l)represents speech absence and speech presence, respectively.The noise power spectrum is updated using a perfect VADsuch that the noise power is updated at the same time as thenoise correlation matrix; see Figure 2. The noise spectrumcan also be estimated by using the Minima ControlledRecursive Averaging approach presented here [13]. The mainissue in estimating the conditional SPP p(k, l) is to havereliable estimates of the a priori SNR and the a priori SAPused in (10). Since speech has a non-stationary characteristicthe a priori SNR and the a priori SAP are estimated for eachfrequency bin of the noisy speech.

4.2. A Priori SNR Estimation. The decision-directedapproach of Ephraim and Malah [4, 12, 17] is widely usedfor estimating the a priori SNR and is given by

ξ(k, l) = κ|Xi(k, l − 1)|2

λni (k, l − 1)+ (1− κ) max

{γ(k, l)− 1, 0

},

(13)

where |Xi(k, l − 1)|2 represents an estimate of the cleanspeech spectrum, and κ is a weighting factor that controls thetrade-off between noise reduction and speech distortion [4,5]. The first term corresponds to the SNR from the previousenhanced frame, and the second term is the estimated SNRfor the current frame.

4.3. A Priori SAP Estimation. Reliable estimation of the apriori SNR is important since it is used in the estimationfor the a priori SAP. In [12, 17] an a priori SAP estimatoris proposed based on the time-frequency distribution of the

estimated a priori SNR ξ(k, l). The estimation is based onthree parameters that each exploits the strong correlation ofspeech presence in neighboring frequency bins of consecu-tive frames. The first step is to apply a recursive averaging tothe a priori SNR, that is,

ζ(k, l) = βζ(k, l − 1) +(1− β)ξ(k, l − 1), (14)

where β is the averaging parameter. In the second stepa global and local averaging is applied to ζ(k, l) in thefrequency domain. Local means that the a priori SNR isavaraged over a small number of frequency bins (smallbandwidth), and global means that the a priori SNR isaveraged over a larger number of frequency bins (largerbandwidth). The local and global averaging of the a prioriSNR is given by

ζη(k, l) =i=ωη∑

i=−ωηhη(i)ζ(k − i, l), (15)

where the subscript η represents either local or globalaveraging, and hη is a normalized Hanning window of size2ωη + 1. The local and global averaging of the a priori SNRis then normalized to values between 0 and 1 before it ismapped into the following threshold function:

Pη(k, l) =

⎧⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎩

0, if ζη(k, l) ≤ ζmin

1, if ζη(k, l) ≥ ζmax

log(

ζη(k, l)/ζmin

)

log(ζmax/ζmin), otherwise

(16)

where Plocal(k, l) is the likelihood of speech presence when thea priori SNR is avaraged over a small number of frequencybins, and Pglobal(k, l) is the likelihood of speech presencewhen the a priori SNR is averaged over a larger numberof frequency bins. ζmin and ζmax are empirical constantsthat decide the threshold for speech or noise. The last term


Pframe(l) represents the likelihood of speech presence in agiven frame based on the a priori SNR averaged over allfrequency bins, that is,

ζframe(l) = mean1≤k≤N/2+1

{ζ(k, l)

}, (17)

where N is the FFT-size. A pseudocode for the computationof Pframe(l) is given by

if ζframe(l) > ζmin then

if ζframe(l) > ζframe(l − 1) then

Pframe(l) = 1

ζpeak(l) = min{

max[

ζframe(l), ζp min

]

, ζp max

}

else

Pframe(l) = δ(l)

else

Pframe(l) = 0

end if

end if

(18)

where

δ(l)

=

⎧⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎩

0, if ζframe(l) ≤ζpeak(l) ·ζmin

1, if ζframe(l) ≥ζpeak(l) ·ζmax

log(

ζframe(l)/ζpeak(l)/ζmin

)

log(ζmax/ζmin), otherwise

(19)

represents a soft transition from speech noise, ζpeak is a con-fined peak value of ζframe, and ζp min and ζp max are empiricalconstants that determine the delay of the transition. Theproposed a priori SAP estimation is then obtained by

q(k, l) = 1− Plocal(k, l) · Pglobal(k, l) · Pframe(l). (20)

This means that if either of the previous frames or recentfrequency bins does not contain speech, that is, if the threelikelihood terms are small, then q(k, l) becomes larger andthe conditional SPP p(k, l) in (10) becomes smaller.

Two examples of the normalized a priori SNR fordifferent frames are shown in Figures 3 and 4. If the lowerthreshold ζmin is set too high, then there is a greater chancefor noise classification, and at the same time weaker fre-quency components might also be ignored. If ζmin in Figure 3is increased, then the weak high-frequency component willbe classified as noise. On the other hand if ζmax is increasedin Figure 4, the weaker low-frequency component will notbe classified as a speech dominant segment. The estimatedconditional SPPs for the two examples given above areshown in Figures 5 and 6. As mentioned above the weak

Nor

mal

ised

apr

iori

SNR

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Frequency (Hz)

0 2000 4000 6000 8000

ζlocal

ζglobal

ζmin

ζmax

Figure 3: Local and global averaging of the a priori SNR for a givenframe. Example of a high a priori SNR at low frequency.

Nor

mal

ised

apr

iori

SNR

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Frequency (Hz)

0 2000 4000 6000 8000

ζlocal

ζglobal

ζmin

ζmax

Figure 4: Local and global averaging of the a priori SNR for a givenframe. Example of a high a priori SNR at high frequency.

high-frequency component in Figure 5 will be ignored ifζmin is increased, and the speech dominant segment atlow-frequency in Figure 6 will not be as significant if ζmax

is increased. In general, classifying noise when speech ispresent is more harmful than classifying speech when noiseis present. By setting ζmin and ζmax low more speech willbe detected, and the same goes for the setting of ζp min andζp max. The goal is then to incorporate this conditional SPPinto the SDW-MWF such that speech dominant segmentswill be attenuated less compared to speech dominant seg-ments. By exploiting the conditional SPP shown in Figures 5and 6 the noise can be reduced in a narrow frequency band,that is, when the conditional SPP is low.

6 EURASIP Journal on Advances in Signal ProcessingC

ondi

tion

alSP

P

0

0.2

0.4

0.6

0.8

1

1.2

Frequency (Hz)

0 2000 4000 6000 8000

ζmin = 0.1 and ζmax = 0.3162ζmin = 0.1 and ζmax = 0.6

Figure 5: Conditional SPP with high-speech presence at lowfrequency.

Con

diti

onal

SPP

0

0.2

0.4

0.6

0.8

1

1.2

Frequency (Hz)

0 2000 4000 6000 8000

ζmin = 0.1 and ζmax = 0.3162ζmin = 0.1 and ζmax = 0.6

Figure 6: Conditional SPP with two distinct speech dominantsegments.

The frequency bin index k and frame index l are omittedin the sequel for the sake of conciseness.

5. SDW-MWF Incorporating the ConditionalSpeech Presence Probability

In this section, we derive a modified SDW-MWF, whichincorporates the conditional SPP in the filter estimation,which is referred to as SDW-MWFSPP from now on. Tradi-tionally, the trade-off parameter in SDW-MWFμ in (5) is setto a fixed value, and any improvement in noise reductioncomes at the cost of a higher-speech distortion. Furthermore,

the speech + noise segments and the noise-only segmentsare weighted equally, whereas it is desirable to have morenoise reduction in the noise-only segments compared to thespeech+noise segments. With an SDW-MWFSPP it is possibleto distinguish between the speech+noise segments and noise-only segments. The conditional SPP in (10) and the two-state model in (8) for speech events can be incorporatedinto the optimization criteria of the SDW-MWF, leading to aweighted average where the first term corresponds to H1 andis weighted by the probability that speech is present, whilethe second term corresponds to H0 that is weighted by theprobability that speech is absent, that is,

W∗ = arg minW

pε{∣∣∣Xs

1 −WHX∣∣∣

2 | H1

}

+(1− p

)ε{∣∣∣WHXn

∣∣∣

2}

,

(21)

where p = P(H1 | Xi) is the conditional probability thatspeech is present when observingXi, and (1−p) = P(H0 | Xi)is the probability that speech is absent when observing Xi.The solution is then given by

W∗ =(

pε{

XXH | H1

}

+(1− p

)ε{

XnXn,H})−1

× pε{

XsXs,H1 | H1

}

=(

pε{

XsXs,H | H1

}

+ pε{

XnXn,H}

+(1− p

)ε{

XnXn,H})−1

pε{

XsXs,H1 | H1

}

=(

pε{XsXs,H | H1} + ε{XnXn,H})−1

pε{

XsXs,H1 | H1

}

.

(22)

The SDW-MWF incorporating the conditional SPP can thenbe written as

W∗SPP =

(

Rs +

(1p

)

Rn

)−1

Rse1. (23)

Compared to (6) with the fixed μ the term 1/p, which isdefined as the weighting factor, is now adjusted for eachfrequency bin and for each frame, making the SDW-MWFSPP

changes with a faster dynamic. Figure 7 presents a blockdiagram of the proposed SDW-MWFSPP. First an FFT isperformed on each frame of the noisy speech. Then onthe left-hand side the conditional SPP is estimated, whichincludes the estimation of the a posteriori SNR, the a prioriSNR, and the a priori SAP. On the right-hand side thefrequency domain correlation matrices are estimated, whichare used to estimate the filter coefficients after weighting withthe conditional SPP. Notice that the updates of the frequencydomain correlation matrices are still based on a longer timewindow; see Section 2.2. The difference is now that theweights applied in the filter estimation are now changing foreach frequency bin and each frame based on the conditionalSPP. The last steps include the filtering operation and theIFFT. The conditional SPP weighting factor 1/p offers more


A posteriori SNR&

a priori SNR

FFT(analysis)

IFFT(synthesis)

Input signal

Output signal

Frequency domaincorrelation

matrices

A priori SAP

q = P(H0)SPP-SDW-MWF

W∗SPP

Conditional SPPp = P(H1|X)

Filtering

Z = W∗,HSPP X

Figure 7: Block diagram of the proposed SDW-MWFSPP incorpo-rating the conditional SPP.

noise reduction when p is small, that is, for noise dominantsegments, and less noise reduction when p is high, that is,for speech dominant segments, as shown in Figure 8 (solidline). This concept is compared to a fixed weighting factor μused in a traditional SDW-MWFμ that does not take speechpresence or absence into account as follows.

(i) If p = 0, that is, when the probability that speechis present is zero, the SDW-MWFSPP attenuates thenoise by applying W∗ ← 0.

(ii) If p = 1, that is, when the probability thatspeech is presence is one, the SDW-MWFSPP solutioncorresponds to the MWF solution (μ = 1).

(iii) If 0 < p < 1, there is a trade-off between noise reduc-tion and speech distortion based on the conditionalSPP.

5.1. Undesired Noise Modelling. The problem withSDW-MWFSPP derived in (23) is that the inverse of theconditional SPP is used, which can cause large fluctuationsin different frequency bands especially if the weightingfactor 1/p is used, as shown in Figure 7. For example,if the conditional SPP shown in Figure 5 is used, theSDW-MWFSPP will apply a NR corresponding to μ = 1below 2000 Hz, and between 2000 Hz, and 4500 Hz the NR

Wei

ghti

ng

fact

or

0

2

4

6

8

10

Conditional SPP

0 0.2 0.4 0.6 0.8 1

μ = 1μ = 2μ = 3

μ = 4

1/p

Figure 8: Speech presence probability-based weighting factorcompared to a fixed weighting factor.

will be much larger. This transition between low and highNR in different frequency bands can cause speech distortionor musical noise.

It is also worth noting that in the derivation of theSDW-MWFSPP the term (1 − p) = P(H0 | Xi) is not presentin (22) anymore. This can be explained by the fact that theSDW-MWF estimates the speech component in one of themicrophones under hypothesis H1 while under hypothesisH0 the noise reduction filter is set to zero. In [18] the gainfunction is similarly derived under hypothesis H1, which isdue to the fact that the method aims to provide an estimate ofthe clean speech spectrum, so that when the speech is absentthe gain is set to zero. This property negatively affects theprocessing of the noise-only bins which results in undesiredmodelling of the noise making the residual noise soundsunnatural.

5.2. Combined Solution. In [12, 17] a lower threshold isintroduced for the gain under hypothesis H0. This lowerthreshold is based on subjective criteria for the noise nat-uralness. Applying a constant attenuation when the speechis absent results in a uniform noise level, and thereforeany undesired noise modelling can be avoided so that thenaturalness of the residual noise can be retained.

Following the concept with the lower threshold a solutionis proposed that in one extreme case corresponds to theSDW-MWFSPP and in the other extreme case corresponds toa traditional SDW-MWFμ. The combined solution can thenbe written as

W∗SPP =

(

Rs +

(1

α(1/μ) + (1− α)p

)

Rn

)−1

Rse1, (24)

where μ in this case is the constant attenuation factor, and α isa trade-off factor between SDW-MWFμ and SDW-MWFSPP.


The weighting factor for the combined solution is shown inFigure 9 for α = 0.5 and for different values of μ. The conceptthen goes as follows.

(i) If α = 1, the solution corresponds to a traditionalSDW-MWFμ given in (6).

(ii) If α = 0, the solution corresponds to theSDW-MWFSPP given in (23).

(iii) If 0 < α < 1, there is a trade-off between the twosolutions based on μ and α and p given in (24).

(iv) If p = 0, that is, when the probability that speechis present is zero, the SDW-MWFSPP attenuates thenoise by applying a constant weighting, that is, μ/αcorresponding to the desired lower threshold.

The conditional SPPs for ζmin = 0.1 and ζmax = 0.3162in Figures 5 and 6 for the combined solution are shownin Figures 10 and 11. When α is increased, the solutiongets closer to the standard SDW-MWFμ (μ = 2). Theimportance of SDW-MWFSPP is that different amount of NRcan be applied to the speech dominant segments and to thenoise dominant segments. With the combined solution theoverall amount of NR might not exceed SDW-MWFμ, butthe distinction between speech and noise is the importantpart in order to enhance speech dominant segments andfurther suppress the noise dominant segments. Increasingα limits the distortion but in the same time also limits theNR in a narrow frequency band; that is, the ratio betweenthe speech dominant segments and the noise dominantsegments are reduced. Furthermore the weak high-frequencycomponent might also be less emphasized since less NRis applied to frequencies prior to the weak high-frequencycomponent; see Figures 10 and 11. This combined solutiondoes not only offer a flexibility between the SDW-MWFSPP

and a traditional SDW-MWFμ. In this case α effectivelydetermines the dynamics of the SDW-MWFSPP and thedegree of nonlinearity in the weighting factor.

6. Experimental Results

In this section, experimental results for the proposedSDW-MWFSPP (α = 0) are presented and compared to atraditional SDW-MWFμ (α = 1). In-between solutions ofthese two approaches are also presented.

6.1. Experimental Set-up. Simulations have been performedwith a 2-microphone behind-the-ear hearing aid mountedon a CORTEX MK2 manikin. The loudspeakers (FOSTEX6301B) are positioned at 1 meter from the center of the head.The reverberation time T60 = 0.21 seconds. The speech islocated at 0◦, and the two multitalker babble noise sources arelocated at 120◦ and 180◦. The speech signals consist of malesentences from the HINT-database [19], and the noise signalsconsist of a multi-talker babble from Auditec [20]. Thespeech signals are sampled at 16 kHz and are concatenatedas shown in Figure 2. For the estimation of the second-order statistics access to a perfect VAD was assumed. An FFTlength of 128 with 50% overlap was used. Table 1 shows theparameters used in the estimation of the conditional SPP.

Wei

ghti

ng

fact

or

1

2

3

4

5

6

7

8

Conditional SPP

0 0.2 0.4 0.6 0.8 1

μ = 1 and α = 0.5μ = 2 and α = 0.5

μ = 3 and α = 0.5μ = 4 and α = 0.5

Figure 9: Speech presence probability-based weighting factor witha lower threshold.

Con

diti

onal

SPP

0

0.2

0.4

0.6

0.8

1

1.2

Frequency (Hz)

0 2000 4000 6000 8000

α = 0α = 1,μ = 2α = 0.25,μ = 2

α = 0.5,μ = 2α = 0.75,μ = 2

Figure 10: Conditional SPP for the combined solution with high-speech presence at low frequency.

6.2. Performance Measures. To assess the noise reductionperformance the intelligibility-weighted signal-to-noise ratio(SNR) [21] is used which is defined as

ΔSNRintellig =∑

i

Ii(SNRi,out − SNRi,in

), (25)

where Ii is the band importance function defined in [22],and where SNRi,out and SNRi,in represent the output SNRand the input SNR (in dB) of the ith band, respectively.For measuring the signal distortion a frequency-weighted

EURASIP Journal on Advances in Signal Processing 9C

ondi

tion

alSP

P

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Frequency (Hz)

0 2000 4000 6000 8000

α = 0α = 1,μ = 2α = 0.25,μ = 2

α = 0.5,μ = 2α = 0.75,μ = 2

Figure 11: Conditional SPP for the combined solution with twodistinct speech dominant segments.

Table 1: Parameters used for the estimation of the conditional SPP.

β = 0.7 ρ = 0.95 κ = 0.98

ωlocal = 1 ζmin = −10 dB (0.1) ζp min = 4 dB

ωglobal = 10 ζmax = −5 dB (0.3162) ζp max = 10 dB

log-spectral signal distortion (SD) is used defined as

SD = 1K

K∑

k=1

√√√√∫ fu

flwERB

(f)(

10 log10

Psout,k( f )

Psin,k( f )

)2

df , (26)

where K is the number of frames, Psout,k( f ) is the outputpower spectrum of the kth frame, Psin,k( f ) is the input powerspectrum of the kth frame, and f is the frequency index. TheSD measure is calculated with a frequency-weighting factorwERB( f ) giving equal weight for each auditory critical band,as defined by the equivalent rectangular bandwidth (ERB) ofthe auditory filter [23].

6.3. SDW-MWFSPP versus SDW-MWFμ . The performance ofthe SDW-MWFSPP (α = 0) and SDW-MWFμ (μ = 1 and2) is evaluated for different input SNRs ranging from 0 dBto 25 dB. The combined solution is evaluated for differentvalues of α = 0.25, 0.50, and 0.75, since this provides a trade-off between a traditional SDW-MWFμ and the proposedSDW-MWFSPP.

The SNR improvement and SD for different input SNRsare shown in Figures 12–15. It is clear that when α → 0the SNR improvement is larger, but at the same time the SDalso increases. When α is increased, the SNR improvementdecreases, and at the same time the SD also decreases.It was found that an α value around 0.25 to 0.5 reducesthe signal distortion significantly, but this obviously comesat the cost of less improvement in SNR. As mentioned

ΔSN

Rin

telli

g(d

B)

0

2

4

6

8

10

12

14

16

Input SNR (dB)

0 5 10 15 20 25

α = 0μ = 1,α = 1μ = 1,α = 0.75

μ = 1,α = 0.5μ = 1,α = 0.25

Figure 12: SNR improvement for SDW-MWFSPP (α = 0) andSDW-MWFμ (α = 1) with μ = 1 at different input SNRs.

SD(d

B)

0

1

2

3

4

5

6

7

8

Input SNR (dB)

0 5 10 15 20 25

α = 0μ = 1,α = 1μ = 1,α = 0.75

μ = 1,α = 0.5μ = 1,α = 0.25

Figure 13: Signal distortion for SDW-MWFSPP (α = 0) andSDW-MWFμ (α = 1) with μ = 1 at different input SNRs.

the goal of SDW-MWFSPP is not to necessarily outperformSDW-MWFμ in terms of SNR or SD. The motivation behindSDW-MWFSPP is to apply less NR to speech dominant seg-ments and more NR to noise dominant segments. Thereforethe overall weighting factor in the combined solution mightnot be higher than SDW-MWFμ. Actually when μ = 2 andα = 0.5, the NR applied when the conditional SPP is largerthan 0.5 is lower than μ = 2; see Figure 9 (solid line).

6.4. Residual Noise. A reason for the increased SD can becaused by the sharp transition between speech dominant


ΔSN

Rin

telli

g(d

B)

0

2

4

6

8

10

12

14

16

Input SNR (dB)

0 5 10 15 20 25

α = 0μ = 2,α = 1μ = 2,α = 0.75

μ = 2,α = 0.5μ = 2,α = 0.25

Figure 14: SNR improvement for SDW-MWFSPP (α = 0) andSDW-MWFμ (α = 1) with μ = 2 at different input SNRs.

SD(d

B)

1

2

3

4

5

6

7

8

Input SNR (dB)

0 5 10 15 20 25

α = 0μ = 2,α = 1μ = 2,α = 0.75

μ = 2,α = 0.5μ = 2,α = 0.25

Figure 15: Signal distortion for SDW-MWFSPP (α = 0) andSDW-MWFμ (α = 1) with μ = 2 at different input SNRs.

segments and noise dominant segments; see Figures 5 and 6and Section 5.1. A softer transition in this case will probablybe desired, for example, by applying smoothing to theconditional SPP or by modifying the threshold functions in(16) and (19).

One way of interpreting the results from the SD measureis to look at the residual noise. When α → 0, the musicalnoise phenomenon occurs, while it is less significant whenα → 1 which partly can be supported by the SD measureshown in Figure 13. Using an α value around 0.25 to 0.5reduces the musical noise and makes the noise sound more

natural. It is also observed that the noise modelling of theresidual noise is more significant in the noise-only periodswhere the update of the SDW-MWFSPP occurs, see Figure 2.The goal of SDW-MWFSPP is to attenuate the noise dominantsegments more compared to speech dominant segments. Thequestion is still whether this SD measure has any effect onthe speech intelligibility. This may not be the case if onlythe noise dominant segments are attenuated more comparedto the speech dominant segments. If the conditional SPPis accurate, the speech dominant segments can be mademore significant compared to the noise dominant segments,especially if the NR is able to reduce the noise in a narrowfrequency band. The benefit of this concept is still somethingthat needs to be analyzed.

Musical noise is not an effect normally encounteredin multi-channel noise reduction. This typically appearsin single-channel noise reduction that is based on short-time spectral attenuation. Increasing α reduces the musicalnoise, which basically means that the fast tracking ofspeech presence in each frequency bin and each frame isconstrained. The function of α is to a trade-off between atraditional SDW-MWFμ , that is, a linear slow time-varyingsystem and a SDW-MWFSPP, that is, a nonlinear fast time-varying system.

7. Conclusion

In this paper an SDW-MWFSPP procedure has been pre-sented that incorporates the conditional SPP. A traditionalSDW-MWFμ uses a fixed parameter to a trade-off betweennoise reduction and speech distortion without taking speechpresence into account. Incorporating the conditional SPPin SDW-MWF allows to exploit the fact that speech maynot be present at all frequencies and at all times, whilethe noise can indeed be continuously present. This conceptallows the noise to be reduced in a narrow frequency bandbased on the conditional SPP. In speech dominant segmentsit is then desirable to have less noise reduction to avoidspeech distortion, while in noise dominant segments it isdesirable to have as much noise reduction as possible. Acombined solution is also proposed that in one extreme casecorresponds to an SDW-MWFSPP and in the other extremecase corresponds to a traditional SDW-MWFμ solution. In-between solutions correspond to a trade-off between the twoextreme cases.

The SDW-MWFSPP is found to significantly improvethe SNR compared to a traditional SDW-MWFμ. The SNRimprovement however comes at the cost of audible musicalnoise, and here the in-between solutions offer a way toreduce the musical noise while still maintaining an SNRimprovement that is larger than SDW-MWFμ. The explana-tion of this is due to the fact that a traditional SDW-MWFμimplementation is a linear filter and is based on a long-term average of the spectral and spatial signal characteristics,whereas the SDW-MWFSPP has a weighting factor changingon a faster dynamic for each frequency bin and each frame,which corresponds better to the nonstationarity of the speechand the noise characteristics.


Acknowledgments

This research work was carried out at the ESAT lab-oratory of Katholieke Universiteit Leuven, in the frameof the EST-SIGNAL Marie-Curie Fellowship program(http://est-signal.i3s.unice.fr/) under contact no. MEST-CT-2005-021175, and the Concerted Research Action GOA-AMBioRICS. Ann Spriet is a postdoctoral researcherfunded by F.W.O.-Vlaanderen. The Scientific responsibilityis assumed by the authors.

References

[1] H. Dillon, Hearing Aids, Boomerang Press, Turramurra,Australia, 2001.

[2] S. F. Boll, “Suppression of acoustic noise in speech usingspectral subtraction,” IEEE Transactions on Acoustics, Speech,and Signal Processing, vol. 27, no. 2, pp. 113–120, 1979.

[3] Y. Ephraim and D. Malah, “Speech enhancement using opti-mal non-linear spectral amplitude estimation,” in Proceedingsof IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP ’83), vol. 8, pp. 1118–1121, April1983.

[4] Y. Ephraim and D. Malah, “Speech enhancement using aminimum mean-square error short-time spectral amplitudeestimator,” IEEE Transactions on Acoustics, Speech, and SignalProcessing, vol. 32, no. 6, pp. 1109–1121, 1984.

[5] O. Cappe, “Elimination of the musical noise phenomenonwith the Ephraim and Malah noise suppressor,” IEEE Transac-tions on Speech and Audio Processing, vol. 2, no. 2, pp. 345–349,1994.

[6] O. L. I. Frost, “An algorithm for linearly constrained adaptivearray processing,” Proceedings of the IEEE, vol. 60, no. 8, pp.926–935, 1972.

[7] L. J. Griffiths and C. W. Jim, “An alternative approach to lin-early constrained adaptive beamforming,” IEEE Transactionson Antennas and Propagation, vol. 30, no. 1, pp. 27–34, 1982.

[8] B. D. Van Veen and K. M. Buckley, “Beamforming: a versatileapproach to spatial filtering,” IEEE ASSP Magazine, vol. 5, no.2, pp. 4–24, 1988.

[9] S. Doclo, A. Spriet, J. Wouters, and M. Moonen, “Frequency-domain criterion for the speech distortion weighted mul-tichannel Wiener filter for robust noise reduction,” SpeechCommunication, vol. 49, no. 7-8, pp. 636–656, 2007.

[10] A. Spriet, M. Moonen, and J. Wouters, “Stochastic gradi-ent based implementation of spatially pre-processed speechdistortion weighted multi-channel wiener filtering for noisereduction in hearing aids,” IEEE Transactions on SignalProcessing, vol. 53, no. 3, pp. 911–625, 2005.

[11] R. J. McAulay and M. L. Malpass, “Speech enhancement usinga soft-decision noise suppression filter,” IEEE Transactions onAcoustics, Speech, and Signal Processing, vol. 28, no. 2, pp. 137–145, 1980.

[12] I. Cohen, “Optimal speech enhancement under signal pres-ence uncertainty using log-spectral amplitude estimator,”IEEE Signal Processing Letters, vol. 9, no. 4, pp. 113–116, 2002.

[13] I. Cohen, “Noise spectrum estimation in adverse environ-ments: improved minima controlled recursive averaging,”IEEE Transactions on Speech and Audio Processing, vol. 11, no.5, pp. 466–475, 2003.

[14] K. Ngo, A. Spriet, M. Moonen, J. Wouters, and S. Jensen,“Variable speech distortion weighted multichannel wienerfilter based on soft output voice activity detection fornoise reduction in hearing aids,” in Proceedings of the 11thInternational Workshop on Acoustic Echo and Noise Control(IWAENC ’08), Seattle, Wash, USA, 2008.

[15] P. Loizou, Speech Enhancement: Theory and Practice, CRCPress, Boca Raton, Fla, USA, 2007.

[16] H. Levitt, “Noise reduction in hearing aids: a review,” Journalof Rehabilitation Research and Development, vol. 38, no. 1, pp.111–121, 2001.

[17] I. Cohen and B. Berdugo, “Speech enhancement for non-stationary noise environments,” Signal Processing, vol. 81, no.11, pp. 2403–2418, 2001.

[18] D. Malah, R. V. Cox, and A. J. Accardi, “Tracking speech-presence uncertainty to improve speech enhancement innon-stationary noise environments,” in Proceedings of IEEEInternational Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’99), vol. 2, pp. 789–792, Phoenix, Ariz,USA, March 1999.

[19] M. Nilsson, S. D. Soli, and J. A. Sullivan, “Development of thehearing in noise test for the measurement of speech receptionthresholds in quiet and in noise,” Journal of the AcousticalSociety of America, vol. 95, no. 2, pp. 1085–1099, 1994.

[20] Auditec, “Auditory Tests (Revised), Compact Disc, Auditec, St.Louis,” St. Louis, 1997.

[21] J. E. Greenberg, P. M. Peterson, and P. M. Zurek,“Intelligibility-weighted measures of speech-to-interferenceratio and speech system performance,” Journal of the AcousticalSociety of America, vol. 94, no. 5, pp. 3009–3010, 1993.

[22] Acoustical Society of America, “ANSI S3.5-1997 AmericanNational Standard Methods for calculation of the speechintelligibility index,” June 1997.

[23] B. Moore, An Introduction to the Psychology of Hearing,Academic Press, New York, NY, USA, 5th edition, 2003.


Research Article

Improved Reproduction of Stops in Noise Reduction Systems withAdaptive Windows and Nonstationarity Detection

Dirk Mauler and Rainer Martin (EURASIP Member)

Department of Electrical Engineering and Information Sciences, Ruhr-Universitat Bochum, 44801 Bochum, Germany

Correspondence should be addressed to Dirk Mauler, [email protected]

Received 12 December 2008; Accepted 17 March 2009


A new block-based noise reduction system is proposed which focuses on the preservation of transient sounds like stops or speechonsets. The power level of consonants has been shown to be important for speech intelligibility. In single-channel noise reductionsystems, however, these sounds are frequently severely attenuated. The main reasons for this are an insufficient temporal resolutionof transient sounds and a delayed tracking of important control parameters. The key idea of the proposed system is the detectionof non-stationary input data. Depending on that decision, a pair of spectral analysis-synthesis windows is selected which eitherprovides high temporal or high spectral resolution. Furthermore, the decision-directed approach for the estimation of the a prioriSNR is modified so that speech onsets are tracked more quickly without sacrificing performance in stationary signal regions.The proposed solution shows significant improvements in the preservation of stops with an overall system delay (input-output,excluding group delay of noise reduction filter) of only 10 milliseconds.

Copyright © 2009 D. Mauler and R. Martin. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

A large class of speech enhancement algorithms is realized inthe spectral domain. Since their performance depends on thequality of the spectral representation of the noisy data, sys-tems for a reliable and precise spectral analysis are required.Apart of filter bank implementations, a common approach isto compute the discrete Fourier Transform (DFT) on shortoverlapping time domain segments [1]. Short-time DFTsystems with frame overlap are attractive because of theiraliasing robustness and ease of implementation [2]. The datalength of a short-time segment is on the one hand connectedto the frequency resolution which is achieved after transfor-mation. The longer the time domain segment, the higher thespectral resolution. A short data length, on the other hand, isrequired for a good temporal resolution. In noise reductionsystems usually a fixed data length is used for the short-time spectral analysis, thus making a compromise betweenthe required spectral resolution and the minimal admissibletemporal resolution [3, page 469]. This concept, however,has a major drawback: in order to achieve a sufficiently highfrequency resolution, in many noise reduction systems the

data length of the short-time segments is longer than theduration of stationarity of the time domain signal, makingshort-time segments span over nonstationary signal sections.An example for this are segments that contain speech pauseand speech active samples. As a consequence, the short-timespectrum results in an average spectrum over the differentstatistics of the current time domain signal section. Sincethe spectral representation is less pronounced, a suboptimalnoise reduction performance results. Using shorter datasegments for the DFT would solve this problem only at thecost of a reduced spectral resolution. The resolution in thiscase might be yet sufficient to represent spectra that arerelatively flat like those ones of burst-like signals. However,spectra that convey many details would not be sufficientlywell resolved when short data segments are used for the DFT.

This trade off between spectral and temporal resolutionhas been addressed in recent algorithm developments.In [4] the data, length that contributes to the spectralrepresentation is adaptively grown or shrunk accordingto the stationarity range of the current signal section. Inanother approach [5] that focuses on audio restoration, thefrequency resolution is improved using an extrapolation of


the time domain data prior to the computation of the short-time DFT. The disadvantages of this method are its highcomputational demands and the fact that the extrapolationrequires perfect modeling of the signal which is in generaldifficult to achieve. Furthermore, random noise cannot beproperly extrapolated. In audio coding, analysis windowsof different lengths and shapes are switched in a signal-dependent fashion [6, 7] in order to reduce pre-echo effectsthat may appear after decoding.

In many application fields like telecommunications orhearing instruments the system delay is of great importance.The group delay of a hearing instrument can produce anoticeable or even objectionable coloration of the hearingaid wearer’s own voice. In [8] it is reported that a delayof 3 to 5 milliseconds was noticeable to most of a groupof normal hearing listeners while a delay of longer than 10milliseconds was objectionable. In [9] asymmetric windowsare presented as a way to reduce the delay in spectral analysis.However, spectral synthesis is not discussed and wouldbecome difficult with the proposed asymmetric windows ifperfect or nearly perfect reconstruction is required. The delayissue has also been addressed recently in [10] where a warpedanalysis-synthesis filter bank for speech enhancement ispresented which achieves a very low system delay. In a DFT-based analysis-synthesis system using overlap-add for signalsynthesis, the delay is given by the frame length of thesynthesis window, the frame advance and the group delay ofa possible spectral modification filter.

In this contribution we propose an analysis-synthesisoverlap-add framework that uses different analysis-synthesiswindow pairs. They differ in their length (before zero-padding) and their shape. Depending on the stationarityof the current time domain signal a proper window pair isselected for the analysis and synthesis. Data that is stationaryover a relatively long span is analyzed using a long windowin order to allow for a high spectral resolution, while short-time stationary data-like bursts of stops or speech onsetsare analyzed with a short data window so that the energyburst is well preserved in the spectral representation. Thereduced spectral resolution that results from using a shortanalysis window is not considered as a limitation, since forthe latter class of short burst-like signals we expect relativelybroadband spectra with few spectral details. The proposedsystem achieves perfect reconstruction and produces thesame low delay irrespective of the analysis-synthesis windowpair that is currently in use.

The signal dependent selection of an analysis-synthesiswindow pair to be used requires the knowledge of the spanof stationarity in the signal. In order to find the boundariesof signal stationarity in [4] an iterative window growing algo-rithm is proposed that is based on a probabilistic framework.Since a necessary condition for stationarity is an invariantpower of the random process, the temporal evolution of themean power of consecutive frames is observed. Based on alikelihood ratio test a decision is made whether a neighboringframe contains data that originates from the same statisticalprocess or not. The method requires a look-ahead overseveral frames of data in order to be able to determine theparameters of parameterized probability density functions

(pdfs). It is thus not suited for very low delay applications.In an alternative approach [11] the detection of stationaritychanges is based on an autoregressive signal model. For thereliable estimation of the model parameters a look-ahead of20 ms is required which again is not permissible for very lowdelay applications that this paper focuses on. The approachpresented here allows the detection of stationarity changeswith a very low delay of about 2 ms.

Eventually, we propose and evaluate a noise reductionsystem that integrates the switching of the analysis-synthesiswindow pair based on the detection of stationarity changesin the time domain signal. The information on stationarityboundaries can be used to additionally improve the preser-vation of stops and speech onsets: we propose a change ofthe decision-directed a priori SNR estimator [12] and theamplification of plosive-like sounds. The latter is motivatedby the fact that the improvement of the consonant-vowelintensity ratio was shown to be important for improvingspeech intelligibility [13–15].

In Section 2 we introduce the concept of switchableanalysis-synthesis window pairs and estimate the benefit andthe computational cost of the approach. Then, in Section 3,we introduce a detector for stationarity changes in the timedomain signal that is based on a likelihood ratio test (LRT).The analysis of the properties of the likelihood ratio helpssetting a proper threshold for the LRT. In Section 4 a noise-reduction system is proposed and analyzed that makes useof the nonstationarity detection. Apart from switching theanalysis and synthesis windows we propose two measuresthat aim at improving speech intelligibility by a preservationor amplification of speech onsets and burst-like sounds.Finally, in Section 5 we present experimental results.

2. Analysis-Synthesis Window Sets

In this section we define the spectral analysis-synthesissystem that provides spectral data to the frequency domainnoise reduction algorithm and synthesizes the time domainsignal after possible spectral modifications.

The main idea in this section is to provide an analysissystem with long and short analysis windows that are arbi-trarily switchable. This allows a signal dependent selectionof the appropriate analysis window. Each analysis window ismatched with a particular synthesis window that guaranteesperfect reconstruction for each window pair.

2.1. DFT-Based Analysis Synthesis System. We assume asampled noisy signal that is the sum of a speech signal, s(i),and uncorrelated noise, n(i)

y(i) = s(i) + n(i). (1)

The index i denotes the discrete time index of the data,sampled with sampling frequency fs.

We consider a block-based analysis-system with Kfrequency bins and a frame advance of R samples. If werestrict the system to uniform frequency resolution thediscrete Fourier Transform (DFT) can be used and efficientlyimplemented by means of a Fast Fourier Transformation


(FFT) algorithm. Then, the spectral coefficients, Yk(m), ofthe sampled time domain data y(i) are obtained as

Yk(m) =K−1∑

i=0

x(mR + i)h(i)e−2kπi/K , (2)

where h(i) denotes an analysis window, m is the subsampled(frame) index, k = 0 . . . K − 1 is the discrete frequency binindex, and K is the length of the DFT.

The spectral coefficients, Yk(m), may then be weightedwith a spectral gain, Gk(m), before the signal synthesis isperformed via IDFT, multiplication with a synthesis window,f (i), and a subsequent overlap-add operation [1].

2.2. Switchable Analysis-Synthesis Window Sets. In [16] asystem with switchable analysis-synthesis window pairs isproposed which achieves perfect reconstruction and canprovide spectral or temporal resolution in a flexible mannerwhile always realizing the same small delay. The main ideasthat are underlying the window design are the following.

(i) Since the spectral and temporal resolution of an anal-ysis system is governed by the length of the analysiswindow, analysis windows of different lengths have tobe provided for a system with maximum flexibility.

(ii) The delay in an overlap-add system is basicallydetermined by the length of the synthesis window.Therefore, in order to realize the same short delayfor all window pairs in a switchable analysis-synthesissystem, the synthesis windows have to be of thesame length regardless of the length of the associatedanalysis window.

(iii) In order to allow for an arbitrary frame-by-frameswitching between different analysis-synthesis win-dow pairs, in an overlap-add system the product ofanalysis and associated synthesis window has to bethe same for all window pairs.

(iv) The analysis-synthesis system should be perfectlyreconstructing whenever no processing is applied.

(v) The windows shall have reasonable frequency re-sponses to avoid aliasing and imaging distortions.

For the subsequent investigations we use the windowset example in Figure 1 [16]. It is designed for a K =512 point DFT with frame advance R = 32 samples at16 kHz sampling frequency and consists of two analysis-synthesis window pairs. The first window pair consists ofa zero-padded square-root Hann window of length 128(M = 64 in Figure 1) for both, analysis, hI(i), and synthesiswindow, f I(i). The product of analysis and synthesis windowis a length-128 Hann window. The second window pairprovides an asymmetric analysis window, hII(i), with squareroot Hann slopes. The long asymmetric analysis windowis padded with d = 64 zeros to alleviate spectral aliasing.The respective short synthesis window, f II(i), is designedin a way that the product of analysis and synthesis windowagain results in the same length-128 Hann window as forthe short window pair. Therefore, an arbitrary switching

between either of the window pairs is possible withoutviolating perfect reconstruction, of course assuming that thesignal is not modified otherwise.

2.3. Analysis of Energy Gain Using Switched Windows. Asmentioned before, short analysis windows provide a hightemporal resolution. This implies that the energy of nonsta-tionary signal sections, like bursts of plosive speech sounds,is better captured with a short analysis window than with along one. In the following, we quantify this effect.

A gain Gswitch can be defined as the ratio of the signalpower captured under the short analysis window, hI(i),related to the power that would be captured under the longanalysis window, hII(i)

Gswitch =∑K−1

i=0

(

y(i)hI(i)/√∑K−1

j=0 hI2(j))2

∑K−1i=0

(

y(i)hII(i)/√∑K−1

j=0 hII2(j))2 . (3)

The windows in the numerator and denominator of theabove formula are normalized to unity power. Since K − 2Mzeros are padded in window hI(i) the outer sum in thenumerator can start at i = K − 2M.

In the best case the nonstationarity (e.g. speech onset)occurs like a step function and coincides exactly with thelimits of the short window, therefore maximizing the powerthat can be captured under the short window. This scenariois illustrated in Figure 2, where σ2

y(i), σ2s , and σ2

n denote thepower of the noisy signal, the speech power, and the noisepower, respectively.

Speech is assumed to be statistically independent of thenoise process and appears only during the 2M most recentsamples. If additionally the windows hI(i) and hII(i) areassumed to be rectangular (Figure 2(a)), (3) simplifies so thatan estimate for the maximal achievable gain is

Gswitch = K/(2M)(K/(2M)− 1)

(σ2n/(σ2s + σ2

n

))+ 1

. (4)

In Figure 3 this expression is evaluated as a function ofthe a priori SNR ξ = σ2

s /σ2n and for several ratios of

the length of the short window to the length of the longwindow. The solid lines show the result for the assumedrectangular windows, the dashed lines show the expectedgain if the proposed tapered windows of Figure 2(b) areused instead. The gain of the tapered windows is alwayssmaller than that obtained with rectangular windows. Wefind that using the proposed short window over the proposedlong window during a burst-like speech sound at 15 dBa priori SNR improves the spectral representation by roughly4 dB. Rectangular windows would yield a gain of about5.6 dB at these conditions. Due to their unfavorable spectralproperties rectangular analysis windows do not representan alternative but should instead serve here as the upperbound for possible gains, Gswitch. Note that an increase in thespectral representation of only a few dB may already help tochange the filter behavior in a noise reduction application ina way that stops will be better preserved.


0

0.5

1Short analysis window

0 64 128 192 256 320 384 448 512

d K − 2M − d M M

hI (i)

(a)

0

0.5

1Long analysis window

0 64 128 192 256 320 384 448 512

d K − 2M − d M M

hII (i)

(b)

0

0.5

1Short synthesis window

0 64 128 192 256 320 384 448 512

d K − 2M − d M M

f I (i)

(c)

0

0.5

1Long synthesis window

0 64 128 192 256 320 384 448 512

d K − 2M − d M M

f II (i)

(d)

Figure 1: Example for a low delay switchable analysis-synthesis window set guaranteeing perfect reconstruction. The window pair in theleft column has good temporal resolution while the pair in the right column provides a good spectral resolution. The asymmetry of the longanalysis window emphasizes most recent samples.

In the preceding analysis we assumed that the non-stationarity coincides exactly with the limits of the shortwindow. If this is not the case, the gain Gswitch decreases.Therefore it is advisable to operate the analysis system witha small frame advance R. For the proposed window setin Figure 1 a choice of R = M/2 turned out to be agood compromise between computational complexity anda sufficient high temporal resolution. In terms of the filter-bank interpretation of a DFT analysis system a small frameadvance corresponds to an oversampled system which is alsofrequently used to reduce aliasing effects [1, page 339].

2.4. Computational Complexity. For the ease of use of aflexible spectral analysis-synthesis system it is desirable thatthe system behaves transparently to the spectral domainapplication when switching from one to another windowpair. The system is therefore required to always providethe same number of spectral components no matter whichwindow set is active. For this reason all analysis windowsare zero-padded to the same length so that a DFT of oneand the same length can be computed. Zero-padding theshort window does not increase the spectral resolution butcorresponds to an interpolation of the spectral data of theshort window. This increases the computational complexityas compared to the case of a standard system with a shortanalysis window without zero-padding but allows for amore flexible allocation of temporal and spectral resolution.Compared to a standard long window for analysis and

synthesis the proposed solution is less complex, more flexiblein terms of spectrotemporal resolution and has a lower delay.

With these considerations, the total number of multipli-cations can be estimated for the following three cases.

(a) Standard system with symmetric short analysis andshort synthesis windows, for example, square-rootHann windows.

(b) Proposed flexible analysis-synthesis system with a setof two window pairs:

(1) short analysis window and short synthesis win-dow,

(2) long (asymmetric) analysis window and long(asymmetric) synthesis window.

(c) Standard system with symmetric long analysis andlong synthesis windows, for example square-rootHann windows.

Complexity is determined here in terms of the numberof real-valued multiplications. These have been determinedin [17] for the calculation of an N-point FFT or IFFT:

C(N) = 0.5N log2N − 1.5N + 2. (5)

If, as in the classical analysis-synthesis system, only a shortwindow of length 2M is used without zero-padding, thecomplexity would amount to C(2M). Padding this windowto the lengthK would increase the number of multiplications


Time

Time

Time

Time

y(i)

2MK

σ2y (i)

σ2s + σ2

n

σ2n

(a)

(b)

Figure 2: Idealized scenario of a speech onset of power σ2s in noise

of power σ2n . The speech onset coincides with the short spectral

analysis window (2M most recent samples). The analysis windowis either (a), rectangular, or (b), tapered.

to C(K). However, as most of the input data to the FFT iszero, advanced techniques for pruned FFT may reduce thecomplexity by a factor of [18]

r = 1− l

q, (6)

where 2q = K and 2l = 2M. Further multiplications arerequired when weighting the input data with the analysiswindow and the processed data with the synthesis window.Here, only the nonzero samples of every window have to bemultiplied with the data.

Table 1 reports the computational complexity relativeto the complexity of case a. Furthermore, the temporalresolution, the frequency bin spacing and the system delaysare indicated for a sampling frequency of fs = 16 kHz andfor a frame advance of R = 32 samples which correspondsto 2 milliseconds. The relative computational complexity ofthe proposed solution varies between 3.9 and 4.7 dependingon how frequently the short analysis window or the longasymmetric analysis window is used. In case a, the systemdelay is only 10 milliseconds. However, when applying theshort window set A to a noise reduction system the denoisedspeech and the residual noise sound harsh and unnatural.In case b, the system delay is also low, but during longerstationary signal sections like vocals or speech pauses, thelong asymmetric window can be used, resulting in a morenatural sounding speech and residual noise. The short

0

2

4

6

8

10

12

Gsw

itch

(dB

)

−15 −10 −5 0 5 10 15 20 25 30

a priori SNR ξ = σ2s

σ2n

(dB)

Rectangular windowsProposed tapered windows

32/512

64/512

128/512

256/512

Figure 3: Expected spectral power gain versus a priori SNR ξ forusing a short analysis window versus a long window during speechonsets. The curves are labeled with the ratio of the lengths of shortand long windows, 2M/K . The solid line shows the results forrectangular analysis windows, the dashed line the results for theproposed window set. The envelope of the time domain signal isidealized as a step function where the section with increased powercoincides with the boundaries of the short window.

window pair in case b should be applied during transitionsor during bursts of a stop. Since in speech, and in partic-ular in speech pauses, stationary signal sections dominatetransient sections the long analysis-synthesis window setwill be more frequently used than the short window set sothat an effective relative computational complexity close to4.7 can be expected. While the computational complexityincreases when using the proposed solution B instead of A,it provides a considerably improved frequency bin spacing(about 36 Hz/bin) which principally allows to resolve pitchharmonics. A similar high resolution is obtained in case conly at the price of a much greater system delay and an evenslightly higher computational complexity.

3. Detecting Stationarity Boundaries

In this section we develop a detector for stationarityboundaries of data which controls the selection of windowsto be used for the spectral analysis of the current segment.Since for a real-time application this decision has to be madeframe-by-frame, the detector is optimized for decisions withvery low latency. This is an important aspect in which oursolution differs from other approaches which use statisticalmodels whose free parameters need to be estimated overseveral frames [4], or, in [11] a sufficient number of samplesis required, corresponding to at least 20 ms. The algorithmpresented in the following is operating on the time domainsampled data. It gives also information on how reliable thestationarity-decision is.

3.1. Task and Hypotheses. Given a stream of time domainsampled data (see Figure 4) we want to decide whether thelatest K2 samples (block 2) are likely to originate from the


Table 1: Comparison of proposed window set (center column) with a standard analysis-synthesis system using short (left column) or long(right column) standard analysis and synthesis windows. The values are indicated for a sampling frequency f s= 16 kHz and a frame advanceof R = 32 samples (2 ms). The effective complexity of the proposed solution varies between 3.9 and 4.7 depending on the rate of use of eitherthe short or the long analysis window. Typically, the long analysis window will be used more frequently than the short analysis window.

A) standard B) proposed solution of switchable C) standard

short symm. analysis windowshort symm. synthesis window

short symm.analysis window

short symm.synthesis window

short symm.analysis window

long asymm.synthesis window

long symm. analysis windowlong symm. synthesis window

DFT length 128 512 512

Zero-padding 0 384 64 0

Effective window length 128 128 448 512

Complexity 1.0 3.9 · · · 4.7 5.3

Temporal resolution 8 ms 8 ms 32 ms 32 ms

Frequency bin spacing 125 Hz/bin 125 Hz/bin 36 Hz/bin 31.25 Hz/bin

System delay (8+2) ms (8+2) ms (32+2) ms

Time

Block 1K1

Block 2K2

Figure 4: Definition of blocks 1 and 2, consisting of K1 and K2

samples, respectively.

same statistical process as the K1 preceding samples (block1). Thus, we have the following hypotheses:

H0: the samples in block 2 originate from the samestatistics as those ones in the preceding block 1, that is, thedata is stationary over both blocks;

H1: the samples in block 2 are supposed to followdifferent statistics than the samples in block 1 (detection of astationarity boundary between the two blocks of data).

The lengths K1 and K2 can be arbitrarily set. A necessarycondition for stationarity is that the process mean powermust be constant over time. We inherently assume ergodicityof the random processes in the respective blocks of datasince we replace the ensemble mean by the mean overthe consecutive observations (e.g., squared time domainsamples) within the respective blocks.

Furthermore, it is assumed that the samples within eachof the two blocks are independent identically distributed(i.i.d.) and are wide-sense stationary within each block. Thisassumption may be violated, for example, during voicedspeech or when the boundary between block 1 and block 2does not coincide with the stationarity boundary. In practice,however, it turns out that with a proper parameter setting

of the detector the stationarity detection works well even inthese cases.

3.2. Likelihood-based Hypothesis Test. The hypothesis istested with a likelihood ratio test (LRT). This requires theknowledge of the probability density function (pdf) whichdescribes the distribution of the squared samples in block 1and 2 under hypothesis H1 or H0.

Assuming that the observed time domain samples, y(i),are realizations of a zero-mean Gaussian random variablewith variance σ2

Y, then the squared observations, y2(i), areχ2 distributed with N = 1 degree of freedom [19, Equations(5.33), (5.65)]

pW(ω) = 1√

2πμWωexp

(

− ω

2μW

)

, ω > 0, (7)

and pW(ω) = 0 for ω ≤ 0. The mean of the squaredrandom variable, μW, is the variance of the noisy time-domain samples, μW = σ2

Y.Given hypothesis H0, the data in both blocks originate

from the same statistical process, so that the pdf describingthe distribution of the squared samples in block 2 can beformulated using the variance of the noisy time-domainsamples in block 1, σ2

Y1:

pW|H0 (ω | H0) = 1√

2πσ2Y1ω

exp

(

− ω

2σ2Y1

)

, ω > 0. (8)

If, on the other hand, the data in block 2 originate froma different statistics than the data in block 1 (hypothesis H1),the mean power has to be defined using only data of block 2:

pW|H1 (ω | H1) = 1√

2πσ2Y2ω

exp

(

− ω

2σ2Y2

)

, ω > 0. (9)

Both conditional pdfs are zero for ω ≤ 0.


The variance of the noisy observations in block 1 and 2constitute random variables, which may be approximated bytheir respective maximum likelihood estimates

σ2Y1= 1K1

K1−1∑

k=0

y2(i− k − K2), (10)

σ2Y2= 1K2

K2−1∑

k=0

y2(i− k). (11)

Given the squared observations in block 2, y2(i), alikelihood ratio (LR) test is defined by

ΠK2−1k=0 pW|H0 (ω | H0)|ω=y2(i−k)

ΠK2−1k=0 pW|H1 (ω | H1)|ω=y2(i−k)

H0

≷H1

λ′, (12)

=⇒ LR = σY2

σY1

exp

(

−12

(σ2

Y2

σ2Y1

− 1

))H0

≷H1

λ, (13)

with λ = λ′1/K2 being the LR decision threshold to be set toa reasonable value from the interval of possible LR values,[0, 1].

The LR value gives an indication whether the observedmean energy σ2

Y2could have originated from both distribu-

tions with equal or similar likelihood (LR > λ) or whetherthe statistics are significantly different. In the latter case wereject H0 and decide that a stationarity boundary has beendetected, in the former case we accept H0 as we have nosufficient evidence that stationarity has been violated. TheLR value itself gives information on the reliability of thedecision. The more it approaches zero, the more reliable isthe decision for H1. Accordingly, values close to one indicatea highly reliable decision for H0.

The value of the decision threshold λ controls the tradeoff between detection and false alarm rates. The higher λ themore stationarity boundaries are detected at the cost of anincreased false alarm rate. In the next section we analyze theLR expression and investigate the relation between thresholdλ and the probabilities of detection and false alarm.

3.3. Analysis of the Likelihood Ratio. In the sequel anexpression will be derived for the likelihood ratio as afunction of the SNR in the first block of data and the changeof the SNR at the transition from block 1 to block 2. Theanalysis of the expected LR values and their variance helps toproperly set the detection threshold λ.

3.3.1. Expected LR Value. Assuming speech and noise beingstatistically independent random variables with σ2

S and σ2N

being their respective variances, then the observed signaly(i) = s(i) + n(i) has variance σ2

Y = σ2S + σ2

N. Therefore, thevariances in block 1 and 2 may be written as

σ2Y1= σ2

S1+ σ2

N1= (ξ1 + 1)σ2

N1,

σ2Y2= σ2

S2+ σ2

N2= (ξ2 + 1)σ2

N2,

(14)

with ξi = σ2Si σ

2Ni

(i = 1, 2) being the a priori SNR . Withthese relations the LR (13) can be written as a function of

0.1

0.2

0.2

0.3

0.3

0.4

0.4

0.5

0.5

0.6

0.6

0.7

0.7

0.8

0.8

0.9

0.9

0.9

1 1 1

−40

−30

−20

−10

0

10

20

30

40

ξ 1→

2(d

B)

LR

−40 −30 −20 −10 0 10 20 30 40

ξ1 (dB)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 5: Contour plot of the simplified likelihood ratio (16). Thenoise is assumed to be stationary (σ2

N2 = σ2N1 ). ξ1 denotes the a priori

SNR in block 1, ξ1→ 2 the step of the a priori SNR at the transitionfrom block 1 to 2.

the a priori SNR in block 1, ξ1, and the change of the SNR atthe transition from block 1 to block 2, ξ1→ 2 = ξ2/ξ1:

LR =√

ξ1→ 2ξ1 + 1ξ1 + 1

σN2

σN1

exp

(

−12

(ξ1→ 2ξ1 + 1ξ1 + 1

σ2N2

σ2N1

− 1

))

.

(15)

If the additive noise is stationary over both blocks 1 and 2,that is, σ2

N1= σ2

N2, the likelihood ratio simplifies to

LR =√

ξ1→ 2ξ1 + 1ξ1 + 1

exp

(

−12

(ξ1→ 2ξ1 + 1ξ1 + 1

− 1

))

. (16)

Figure 5 illustrates the LR (16). The following conclu-sions can be drawn.

(i) Detection of an SNR increase (ξ1→ 2 > 0 dB):

(1) the more the SNR ξ1 is below 0 dB thehigher has the SNR step to be in order toproduce noticeable small LR values. However,the steepness of the LR function, that is, thedecrease of the LR value as a function of theSNR step at a constant SNR ξ1 is similar for allSNR ξ1;

(2) at SNR ξ1 > 5 dB, the LR shows similar sensitiv-ity for SNR increases.

(ii) Detection of an SNR decrease (ξ1→ 2 < 0 dB):

(1) below 0 dB SNR ξ1 a detection of an SNRdecrease is impossible. This is plausible as anSNR decrease of a signal that is already severelydisturbed (ξ1 < 0 dB) does not result in a con-siderably lower power of the disturbed signalwhich the detection is based on.


0. 1

0.1

0. 2

0.2

0.3

0.3

0.4

0.4

0. 5

0. 5

0.6

0. 6

0.7

0. 7

0. 8

0. 8

0. 9

0. 9

−40

−30

−20

−10

0

10

20

30

40

ξ 1→

2(d

B)

LR

−40 −30 −20 −10 0 10 20 30 40

ξ1 (dB)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 6: Contour plot of the likelihood ratio (15) assuming a noisepower rise from block 1 to block 2 (σ2

N2 = 4σ2N1 ). ξ1 denotes the

a priori SNR in block 1, ξ1→ 2 the step of the a priori SNR at thetransition from block 1 to 2.

(2) for all SNR ξ1 > 10 dB the LR values decrease ina similar manner over ξ1→ 2 but less steeply thanfor the case of the detection of SNR increase;

(3) we observe a saturation of the LR values at alevel that increases with decrease in the SNRξ1. For example, at an SNR of ξ1 = 10 dB anexpected LR value less than 0.48 is not possible,irrespective of the magnitude of the SNR drop.

In (16) (cf. Figure 5) noise is assumed stationary overblocks 1 and 2 which is not always the case, for examplein case of babble or cafeteria noise. Figure 6 shows the LRfunction for an assumed noise power increase by 6 dB at thetransition from block 1 to block 2. During a speech pausethe SNR is already very low (e.g. ξ1 < −10 dB) and a noiseburst further degrades the SNR (ξ1→ 2 < 0 dB). In this casethe LR function returns smaller values than in the case ofstationary noise (cf. Figure 5). Therefore, depending on thelevel of the decision threshold λ the detector might triggeron the noise burst. This example illustrates that the detectordetects any instationarities and cannot distinguish betweenspeech or noise.

3.3.2. Variance of the LR Values. The variances of the mod-elled random processes, σ2

Y1and σ2

Y2, have to be estimated

from the given data (cf. (10), (11)). As a consequence theLR is a random variable with mean and variance. Since LRis a transcendental function of the random variables σ2

Y1and

σ2Y2

, an analytic expression for the pdf of the LR is difficult toderive. In the following we therefore simulate the LR valuesfor normal distributed input data y(i) and determine thehistograms of the LR for a given SNR ξ1 in block 1 and fora given SNR step ξ1→ 2 at the transition from block 1 to block2. In Figure 7 five histograms are plotted for five SNR steps,ξ1→ 2, and constant SNR ξ1. We observe that the variance of

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Rel

ativ

efr

equ

ency

12 dB

6 dB

−12 dB −6 dB

0 dB

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

LR

ξ1→2 = 12 dBξ1→2 = 6 dBξ1→2 = 0 dB

ξ1→2 = −6 dBξ1→2 = −12 dB

Figure 7: Histograms of the LR values for constant SNR ξ1 = 5 dB.The amplitudes of the time-domain signal are Gaussian i.i.d. Thedistribution is broadest at an SNR increase between roughly 0 and10 dB.

the data is particularly large for ξ1→ 2 = 6 dB (light blue) andis small for the cases ξ1→ 2 = 0 dB (green) and ξ1→ 2 = 12 dB(dark blue).

We measured the variance of the distributions of theLR values not only for the five exemplary distributions inFigure 7, but for each pair of ξ1 ∈ [−40, 40] dB and ξ1→ 2 ∈[−40, 40] dB with a resolution of 1 dB. The result is presentedas a contour plot in Figure 8. The crosses indicate the fiveSNR combinations for which the distributions in Figure 7have been shown. We notice that the variance is highest(about 0.045) for SNR increases and LR values close to0.5 (compare with Figure 5) while for an SNR decrease thevariance of the estimated LR is about one order of magnitudesmaller. Therefore, the distributions of the LR values thatare associated with the upper right quadrant in Figure 5 arerelatively broad as compared to those ones associated withlower right quadrant (see also Figure 7). The impact of thisobservation on detection and false alarm probabilities will bediscussed in Section 3.4.

3.3.3. Optimal Block Lengths K1 and K2 . A result fromthe preceding section is that for robust decisions the blocklengths K1 and K2 should be as large as possible in order toreduce the variance of the estimates (10) and (11), thereforereducing the variance of the LR (13). At the same timeblock 2 should be short enough to span (in the majorityof cases) data from only one statistical process. If block 2contains data from more than one statistical process thepower measurement via (11) would be misleading, resultingin a wrong estimate of the SNR change.

For the low-delay detection of stops, for example, theduration of block 2 should not exceed a few ms. This is


the typical duration of the brief burst that is producedafter release of the vocal tract occlusion [20, Section 3.4.7].Therefore, we set the duration of block 2 to 2 milliseconds.

In a frame-based implementation with a frame shift ofR samples we extend the length K1 by the latest R sampleswhenever H0 (stationarity) was accepted in the precedingframe. By this, the variance of the maximum likelihoodestimate σ2

Y1(10) that is required in (13) can be reduced,

leading to more robust decisions. Whenever in the precedingframe shift a nonstationarity boundary has been detected,this extension is stopped and the data which σ2

Y1is based on

is reset to only the latest K1 samples.

3.4. Detection Probability and False Alarm Probability. Theproposed detector can also be characterized by its detectionand false alarm probabilities. Using the probability densityfunction (pdf) of the LR values, for a given SNR ξ1 = ξ1

and a given change of the SNR, ξ1→ 2 = ξ1→ 2 we define thefollowing.

(i) False alarm : a nonstationarity is detected althoughthe signals in block 1 and 2 originate from the samestatistical process, that is, the expected SNR differenceis ξ1→ 2 = 0 dB. We denote the probability associatedwith this event

Pf a =∫ λ

0pLR

(

x | ξ1, ξ1→ 2 = 0dB)

dx. (17)

(ii) Missed detection: although the data in block 1 andblock 2 originate from different statistics, that is, theexpected SNR difference is ξ1→ 2 /= 0 dB, a nonstation-arity is not detected. The associated probability isdenoted by

Pmd =∫ 1

λpLR

(

x | ξ1, ξ1→ 2

)

dx . (18)

The detection probability is defined as Pd = 1 − Pmd. Inthe sequel we determine the detection probability and thefalse alarm probability of the proposed detector. The pdf isagain approximated with histograms.

As an example let us first consider the detection of SNRincreases (e.g., bursts or speech onsets) of ξ1→ 2 = 6 dB atξ1 = 10 dB SNR. We ask for the decision threshold that isnecessary to detect 95% of these SNR rises. The top plot inFigure 9 shows for every SNR change ξ1→ 2 (1 dB resolution)the distribution of the LR values for the given SNR ξ1 =10 dB. The natural logarithm of the relative frequencies ismapped to gray levels. The dashed lines show the 5%- andthe 95%-percentile of the distributions. The distributions arebroadest for ξ1→ 2 ∈ [5, 8] dB. For ξ1→ 2 > 15 dB the variancesof the distributions are very small.

The lower plot in Figure 9 shows the detection proba-bilities as a function of the SNR step for three thresholds λ.With a threshold of λ = 0.93 (thin line) almost 100% of theSNR steps greater than 6 dB are detectable. However, the falsealarm rate which is found at ξ1→ 2 = 0 dB is 16%, which isunacceptably high.

−40

−30

−20

−10

0

10

20

30

40

ξ 1→

2(d

B)

−40 −30 −20 −10 0 10 20 30 40

ξ1 (dB)

0

0.01

0.02

0.03

0.04

0.05

0.06

Figure 8: Contour plot of the variance of the LR estimates. Thevariance has been determined empirically (K1 = 32, K2 = 32, R =32). ξ1 denotes the SNR in block 1, ξ1→ 2 the change in SNR at thetransition from block 1 to 2. The crosses illustrate the points forwhich the LR histograms are plotted in Figure 7.

With a threshold of λ = 0.84 about 95% of the SNR stepsξ1→ 2 = 6 dB can be detected while detections at ξ1→ 2 = 0 dBare expected with probability Pf a = 2.8%. Although this falsealarm probability is relatively small, we see that for everysmall SNR step in the interval ξ1→ 2 ∈ ]0, 6[ dB detectionsoccur with a considerable rate. In order to detect mainlythose SNR steps that exceed a certain SNR threshold thedecision threshold λ has to be decreased. The thick solid lineshows the detection probability for λ = 0.25. In this caseSNR increments between 0 and 5 dB SNR do not result in asignificant detection rate. Only if the SNR rise is larger than5 dB the detection rate increases and attains 95% for ξ1→ 2 =10 dB. The low threshold λ = 0.25 is thus advantageous ifonly considerable changes in the SNR of at least five to tendB should be detected.

While Figure 9 shows the detection probability for anexemplary SNR ξ1 = 10 dB, in the same way the detectionprobabilities for a given threshold λ can be determined forall ξ1 ∈ [−40, 40] dB. The result for λ = 0.25 and 1 dBresolution is shown in Figure 10 as a contour plot. The dottedred line indicates those cases where the detection probabilityequals 0.95. Additionally, SNR decreases are detected in thesame fashion and can be distinguished from SNR rises bycomparison of the estimated variances (10), (11) in block 1and 2.

3.5. Example. In Figure 11 we show the use of the detectorfor the detection of strong phoneme onsets in continuouslyspoken disturbed speech. The assumed Gaussianity of thepdfs of speech and noise is approximately fulfilled, inparticular during unvoiced speech, like stops. The cleanspeech [21] was mixed with speech-shaped noise to an SNRof 10 dB (bottom plot). The phonetic labels are printed onthe plot. In the upper part of the figure the LR values are


00.10.20.30.40.50.60.70.80.9

1LR

−30 −20 −10 0 10 20

ξ1→2 (dB)

−7−6−5−4−3−2−10

LR mean0.05-percentile0.95-percentile

(a)

0

0.2

0.4

0.6

0.8

1

Pd

−30 −20 −10 0 10 20

ξ1→2 (dB)

Pd , λ = 0.93Pd , λ = 0.84Pd , λ = 0.25

(b)

Figure 9: (a) Distributions of the LR for ξ1 = 10 dB. The gray scale represents the natural logarithm of the measured relative frequenciesof the LR values. The variance of the histograms is high during the decrease of the LR mean (dotted line). (b) Detection probabilities forthree-decision thresholds λ as a function of the observed SNR change ξ1→ 2.

0.95

0.95

−40

−30

−20

−10

0

10

20

30

40

ξ 1→

2(d

B)

Pd , λ = 0.25

−40 −30 −20 −10 0 10 20 30 40

ξ1 (dB)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 10: : Detection probability, Pd , as a function of the SNR inblock 1 and the SNR change from block 1 to block 2. The dotted redline highlights the cases where Pd = 0.95.

given. The duration of block 2 was set to 2 milliseconds.Whenever the LR falls below λ (dashed line) the detectorfires. In this example bursts of stops are detected robustly andin time. The phoneme [k] shortly before 0.6 seconds is notdetected. An analysis of the SNR reveals that ξ1 = −9 dB andthe SNR increase is ξ1→ 2 = 11.4 dB during the burst of thephoneme. Regarding the preceding analysis of the detectorit is clear that a detection under these severe conditions isnot possible with the given threshold λ (cf. Figure 5). Thedecisions are obtained within only 2 ms delay (K2 = 32,sampling rate fs = 16 kHz).

3.6. Evaluation of Detection Performance. The proposeddetector was used in a framework to verify its performance. Atotal number of 4200 clean speech sentences from the TIMITdatabase [21] have been disturbed with stationary speech-shaped noise, each at a mean segmental SNR of 10 dB.

Then, using the phonetic labels of the TIMIT database, thenumber of occurrences of each phoneme was counted. Foreach occurrence of a phoneme it was recorded whether itwas detected by the proposed detector or not. In case of adetection, the SNR increase, ξ1→ 2, and the SNR ξ1 duringthe detection have been recorded. If the detector did notfire, the maximum SNR increase within the boundaries ofthe phoneme, ξ1→ 2, and the respective SNR, ξ1, have beenrecorded in order to document, at which SNR increase thedetector failed to fire. The detection threshold is set to λ =0.1.

Given these data, histograms of the occurrences of aphoneme in the plane spanned by ξ1→ 2 and ξ1 can be created.This is illustrated for the stop “t” in Figure 12. In the samemanner the detection counts and the missed detections areillustrated in Figures 13 and 14, respectively.

It can be concluded from Figure 12 that under the givenmeasurement conditions during the closure of the stop “t”the SNR ξ1 is roughly −30 dB and the SNR rise duringthe burst is around 40 dB. If the SNR increase leads to anSNR close to or less than 0 dB the stop cannot be detected(Figure 14). In this case a multichannel spatial preprocessing(e.g. [22]) can help to improve the SNR prior to thedetection.

Stop “t” whose (ξ1, ξ1→ 2)-coordinates correspond to asmall LR value can be robustly detected (Figure 13). Thehistogram thus confirms the theoretical considerations of thepreceding subsections.

The experiment could be repeated for a higher or alower input noise level. This would make the histogramsshift towards lower, respectively, higher SNR ξ1, so that fewer,respectively, more phonemes would be detected.

Since the detector is sensitive to any transient, innonstationary environments, like cafeteria noise, we expectdetections of noise bursts also. If this shall be prevented,the detection threshold λ could be lowered in nonstationaryenvironments. In a hearing instrument this can be triggeredby manually selecting a situation-specific hearing-aid pro-gram, or could be controlled by an automatic classificationsystem as used in state-of-the-art hearing instruments [23].


kax

mpclp

leh

kclks

ixtcltiy

ah

vkcl

kem

pcl

pl

iytcl

m

aar

kclk

ix

dxiy

ngpcl

p

lae

nxiy

ng

h#

0

0.2

Dis

turb

edsp

eech

,y(i)

λ = 0.25

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

Time (s)

0

0.25

0.5

0.75

1

LR

Figure 11: Example usage of the detector for low-delay detection of instationarities in continuously spoken disturbed speech (bottom,“complexity of complete marketing planning”, [21]). The LR values are plotted on top, the decision threshold λ = 0.25 is represented by thedash-dotted line.

−40

−30

−20

−10

0

10

20

30

40

50

ξ 1→

2(d

B)

−50 −40 −30 −20 −10 0 10 20 30 40

ξ1 (dB)

0

2

4

6

8

10

12

14

16

18

20

Figure 12: Occurrences of phoneme “t” in sentences disturbed at10 dB mean segmental SNR.

4. Modifications to the Noise Reduction System

With the ability of detecting instationarities in disturbedspeech the classical noise reduction system is extended asillustrated in Figure 15. The detection of instationarities isbased on the highpass-filtered input signal, y(i). As manynoise types show a lowpass characteristic, highpass filteringimproves the SNR prior to the detection and hence helps toimprove the detection rate. Given the likelihood ratio, LR, atthe output of the detector, in the following paragraphs wediscuss three possible measures that can be applied on theirown or in combination.

In short we propose to

(i) switch the analysis (and synthesis) window of thespectral analysis system for a better temporal resolu-tion during transitional segments;

(ii) adapt the decision-directed estimator for the a prioriSNR [12] to allow for a faster and more precisetracking of the a priori SNR during transitions;

−40

−30

−20

−10

0

10

20

30

40

50

ξ 1→

2(d

B)

−50 −40 −30 −20 −10 0 10 20 30 40

ξ1 (dB)

0

2

4

6

8

10

12

14

16

18

20

Figure 13: Detected occurrences of phoneme “t” (λ = 0.1).

(iii) amplify a segment that has been classified as transi-tional to improve speech intelligibility [14, 15].

4.1. Window Switching. Figure 16 illustrates how the non-stationarity detection is applied to the spectral analysis-synthesis window sets presented in Section 2. Block 2 has alength of K2 = 32 samples (2 ms), centered on the shortanalysis window, hI(i). Block 1 is initially also of lengthK1 = 32 samples but is growing by R samples per frame shiftas long as no nonstationarity is detected and a maximumlength of 5R samples is not exceeded. As argued before, thisstrategy reduces effectively the variance of the LR estimate.If a nonstationarity is detected, block 1 is reset to the lastK1 = 32 samples.

4.2. Modified Decision-Directed Approach. In [12] the deci-sion-directed estimator for the a priori SNR is proposed. Itestimates the a priori SNR via a weighted sum of the currentmaximum a posteriori (MAP) estimate of the a priori SNR


−40

−30

−20

−10

0

10

20

30

40

50

ξ 1→

2(d

B)

−50 −40 −30 −20 −10 0 10 20 30 40

ξ1 (dB)

0

2

4

6

8

10

12

14

16

18

20

Figure 14: Missed detections of phoneme “t” (λ = 0.1).

and an estimate which is built from the speech spectrumestimated in the preceding frame, Sk(m− 1):

ξk(m) = α

∣∣∣Sk(m− 1)

∣∣∣

2

σ2N (m− 1)

+ (1− α)max

(|Yk(m)|2σ2N (m)

− 1, 0

)

,

α ∈ [0, 1[.(19)

The first estimate ξk(m) after a speech onset is ruled by thea posteriori SNR, |Yk(m)|2/σ2

N (m), in the second term in

(19) since the feedback term |Sk(m− 1)|2/σ2N (m−1) is small

due to the speech pause in the preceding frame. Since thesecond term in (19) is weighted with 1−α, which is typicallyof the order of −12 dB to −17 dB (α = 0.94 : 0.98), thea priori SNR estimate considerably underestimates the truea priori SNR during speech onsets [24]. As a consequence,stops, which are normally of low intensity, are often severelyattenuated by noise reduction filters based on the decision-directed approach.

By lowering the parameter α, the response time on fastchanges of the SNR can be improved, however, only at theprice of an increased distortion of the residual noise (musicalnoise). Therefore it was proposed to make the parameter α ofthe decision-directed approach time-dependent [25] or time-and frequency-dependent [26]. In [27] the response time ofthe a priori SNR estimator on SNR increases is improved witha recursion step in which per frame advance a preestimate ofthe clean speech spectrum is computed which is then used todetermine the decision-directed estimate of the a priori SNR.

While in [25, 26] the parameter α is modified frame-by-frame, we propose to change it only if a speech onset isdetected. Whenever a significant power increase is reliablydetected (LR less than a threshold, LRthresh1 ), αk(m) isreduced for those frequency bins k where speech activityis likely. The latter is important, as broadband reductionof αk(m) leads to audible musical noise in those frequencybands that are not masked by the speech.

To realize the desired behavior of αk(m) the maximumlikelihood estimate of the a priori SNR is smoothed alongfrequency and is then linearly mapped to the range of valuesof α ∈ [0,αmax] where αmax < 1 is typically 0.94 : 0.98.Estimates of the a priori SNR greater or equal to 15 dBvery likely indicate the presence of speech and are thereforemapped to αk(m) = 0 to preserve the speech presence inthose frequency bins. Estimates less or equal to 0 dB arevery likely dominated by noise and are therefore mapped toαk(m) = αmax.

This procedure is applied for three consecutive framesafter the onset detection. After this time the feedback ofthe estimated clean speech spectra Sk(m − 1) in (19) will

have established more robust estimates, ξk(m), so that αk(m)can be increased again to αmax until the next onset will bedetected.

4.3. Amplification of Transients. In [13] the effect of adjustingthe consonant-vowel intensity ratio on consonant recog-nition by hearing impaired subjects was investigated. Therecognition of stops was significantly improved when therelease burst of the stop was amplified. The improvementreached a maximum when the consonant-vowel intensityratio was amplified by roughly 8 to 14 dB (depending onthe stop, the vowel environment, audiogram configuration,etc.). While the results of this study relate to the undisturbedcase, in [14] speech material was used that was disturbedto 6 dB SNR with a 12-talker babble. The effects of threemodifications are compared: (1) increasing the duration ofconsonants, ( 2) increasing the consonant-vowel intensityratio by 10 dB, and ( 3) a combination of (1) and (2).Themost significant improvements are obtained from increasingthe consonant-vowel intensity ratio. Similar results areobtained in [15] where bursts of plosives are amplified by12 dB. As apposed to the studies presented before, in [15]also sentence material was used as stimulus. In this caseless improvements from the amplification of the consonantalregion were observed compared to the case where consonant-vowel-consonant stimuli were used. The clean speech wasdisturbed with speech-shaped noise at −5, 0 and 5 dB SNR.

Based on these findings, in our proposed system, inaddition to the window switching, we amplify the samplesof those frames that most probably contain a speech onset.To this end, the frame data is amplified with a gain Gtrans,whenever the LR (13) falls below a threshold LRthresh2 . Inthe cited works, the point in time and the duration ofa consonant is perfectly known as annotated speech wasused in the investigations. In our case, speech onsets haveto be detected in the disturbed signal. To account for theuncertainty of the detection we let the gain linearly increasewith increasing reliability of the nonstationarity-decision,that is the smaller the values of LR the higher is Gtrans, cf.Figure 17. As soon as the LR exceeds the threshold LRthresh2 ,we let the gain Gtrans decay exponentially to Gtrans = 1with a time constant of roughly 20 ms. This was found tobe perceptually advantageous over an abrupt decrease ofthe gain. Strong consonant amplification as proposed inthe precedingly cited works results in unnaturally sounding


FFT IFFT

HP

Index of window set under use

LR

Analysiswindow

switching

Estimationof

clean speech

Non-stationaritydetection

Synthesiswindow andoverlap-add

y(i)

K K

s(i)

Figure 15: Overview of a noise reduction system with nonstationarity detection. The stationarity decision controls the choice of the spectralanalysis-synthesis window and the estimation of the clean speech spectral coefficients.

Time

R2M

K1 K2

Figure 16: Stationarity detection applied to the analysis system withframe advance R and short windows of length 2M = 4R. In theexample, the analysis window is switched from long asymmetricto short symmetric. Block 2 consists of K2 = R samples. Block 1has initially length K1 = K2 but grows by R samples as long as nononstationarity is detected and until an upper limit for the length isreached.

LR0 1LRknee

0

1

Gmaxtrans

Gtrans

Figure 17: Mapping the likelihood ratio LR to a frame amplifica-tion Gtrans.

speech. A limitation of the maximum gain to 3.5 dB resultsin a clearly perceptible amplification of transient soundslike bursts of stops, but preserves the naturalness of thespeech. It is important to notice that the proposed increaseof the consonant-vowel intensity ratio becomes feasible onlywith short analysis windows. The amplification of the datacaptured under a classical long analysis window can produceaudible noise prior to the amplified speech onset if the onsetoccurs only in the most recent samples of the frame. Withthe concept of switched windows, however, the short analysis

and synthesis windows will be used whenever a speech onsetis detected, hence preventing audible prenoise.

5. Results

5.1. Example of Estimated Speech. To illustrate the con-sequences of the measures proposed in Section 4, speechdisturbed with speech-shaped noise has been denoised usinga frequency domain Wiener filter and decision-directed esti-mation of the a priori SNR. The spectral analysis is realizedusing either permanently the asymmetric long window,hII(i), or the short and long analysis window set, hI(i), hII(i),switched according to the nonstationarity decisions taken bythe detector presented in Section 3. In another case, not onlythe window set is switched, but also the parameter α of thedecision-directed approach is modified as proposed in 4.2.

Time-domain signals of the utterance “Poach the applesin . . .” are given in Figure 18. Figure 18(a) shows the cleanand the noisy signal at 10 dB SNR (speech-shaped stationarynoise). Figure 18(b) contains the output of a Wiener filtersingle-channel noise reduction. Since only the long spectralanalysis window is used, the stops at 0.05 seconds or at0.75 seconds are considerably distorted. In Figure 18(c) theresult obtained with a signal dependent switching betweenlong and short analysis windows is shown. At the bottom,the window decision is plotted. By using the short analysiswindow during transient sounds the distortion of thesesounds in the filtered output can be reduced. Finally, inFigure 18(d) the result with additionally modified decision-directed approach is plotted. It shows considerable improve-ments of the transients. In particular the two stops at 0.05 sand at 0.75 s are very well preserved in the filtered output.

In Figure 19 the spectrograms of the same example aregiven. The spectra are obtained using a 128-point DFT of thedata weighted with a Hann window and 75 percent overlap.As before we observe a better preservation of the phonemes[p]. Additionally, the speech onset at frame index m = 50is better preserved when the analysis window is switchedto the short window (Figure 19(d) and 19(e)) and is evenbetter preserved when the modified decision-directed is used(Figure 19(e)).


−0.04

−0.020

0.02

0.04

0.06

0 0.2 0.4 0.6 0.8 1

Time (s)

NoisyClean

(a) Clean and noisy speech (“Poach the apples in . . .”) at 10 dB SNR,speech-shaped stationary noise.

−0.04

−0.02

0

0.02

0.04

0.06

0 0.2 0.4 0.6 0.8 1

Time (s)

(b) Filter output using only the long window set (hII (i), f II (i)).

−0.04

−0.02

0

0.02

0.04

0.06

0 0.2 0.4 0.6 0.8 1

Time (s)

ShortLong

(c) Filter output using either short or long window set, switched accordingto the window decision plotted at the bottom.

−0.04

−0.02

0

0.02

0.04

0.06

0 0.2 0.4 0.6 0.8 1

Time (s)

ShortLong

(d) As before, but additionally with modified decision-directed approach.

Figure 18: Time-domain signals of input and the results obtained after noise reduction with Wiener filter and the methods proposed inSections 4.1 and 4.2.

102030405060

Freq

uen

cybi

n

50 100 150 200 250 300 350 400 450

Frame index m

−60

−40

−20

0

(a) Clean speech (“Poach the apples in . . .”).

102030405060

Freq

uen

cybi

n

50 100 150 200 250 300 350 400 450

Frame index m

−60

−40

−20

0

(b) Speech plus speech-shaped noise at 10 dB SNR.

102030405060

Freq

uen

cybi

n

50 100 150 200 250 300 350 400 450

Frame index m

−60

−40

−20

0

(c) Filter output using only the long window set (hII (i), f II (i)).

102030405060

Freq

uen

cybi

n

50 100 150 200 250 300 350 400 450

Frame index m

−60

−40

−20

0

(d) Filter output using either short or long window set, switchedaccording to the window decision plotted at the top (high level meansshort window set (I) is used, else long window set (II)).

102030405060

Freq

uen

cybi

n

50 100 150 200 250 300 350 400 450

Frame index m

−60

−40

−20

0

(e) As before, but additionally with modified decision-directed ap-proach.

Figure 19: Spectrograms (dB) of input signals and the results obtained after noise reduction with Wiener filter and the methods proposedin Sections 4.1 and 4.2. To create the spectrograms a 128-point DFT with 32 samples frame advance at 16 kHz sampling frequency and aHann data window was used.


−90

−80

−70

−60

−50

−40

−30

−20

(dB

)

10 20 30 40 50 60

Frequency bin, k

Sk(m)Yk(m)Sk(m), analysis window: hII (i) onlySk(m), analysis windows: hI (i), hII (i) switchedSk(m), analysis windows: hI (i), hII (i) switchedand modified decision-dir. approach

Figure 20: Spectra of clean, Sk(m), noisy (speech-shaped additivenoise), Yk(m), and estimated clean speech, Sk(m) of the phoneme[p] in “poach.” The estimated clean speech is obtained by Wienerfiltering using a spectral analysis based on either the asymmetriclong window only, hII(i), or based on the window set hI(i) andhII(i) from Figure 1. If the window set is used the window decisionis based on the nonstationarity decision. The thin solid black lineshows the result if additionally to the switched window set theproposed modification of the decision-directed approach is applied.Frame amplification (cf. 4.3) is not shown here. The above spectrahave been created using a 128-point DFT and Hann windowed data(sampling frequency fs = 16 kHz).

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Rel

ativ

efr

equ

ency

−8 −6 −4 −2 0 2

Difference of log-spectral distortion

hII (i) onlyhI (i), hII (i), switchedhI (i), hII (i), swit. and mod. decision-dir. app.

Figure 21: Relative frequencies of the improvements of log-spectraldistortion in 2792 occurrences of the phoneme [d]. Values less thanzero signify less distortion of the proposed system as compared toa system using a square-root Hann window for spectral analysis-synthesis that produces the same small system delay.

In Figure 20 we show sample spectra of the denoisedspeech during articulation of the phoneme [p] in the word“poach.” For comparison, the spectra of the clean speech,Sk(m) (thick solid green line), and of the noisy observation,Yk(m) (dotted black line), are also plotted.

At frequency bins k = 22 . . . 30 and k = 40 . . . 50the speech spectrum is better preserved when using theswitched window set (dashed blue line) as compared to theresults obtained with the long asymmetric window only (redsolid line). The maximum gain observed in this exampleis about 4 dB. If additionally the proposed modification ofthe decision-directed estimator of the a priori SNR is realized(thin solid black line), the estimated speech spectrum, onaverage, much better preserves the actual speech spectrum.As a consequence, the phoneme sounds sharper than withoutmodification of the decision-directed SNR estimator andwithout window switching.

5.2. Instrumental Evaluation. In our experiment 4132 cleanspeech utterances [21] disturbed with additive speech-shaped noise at 10 dB SNR have been processed with aWiener filter single-channel noise reduction using eithersquare-root Hann windows (length 8 ms) for spectral analy-sis and synthesis or the proposed system. In terms of delay thesquare-root Hann window is comparable with the proposedsystem (cf. Table 1, A versus B). Then, for every occurrenceof a phoneme the intelligibility-weighted [28] mean log-spectral distortions has been determined [29]. The mean iscomputed over frames with a segmental SNR greater than5 dB. A measurement frame is only 2 ms long in orderto be able to resolve the short bursts of stops. Finally, thedifferences between the spectral distortion produced by theproposed system and the distortion produced by the square-root Hann windows was determined. Figure 21 shows thehistogram of the differences for the example of the phoneme[d]. A negative value signifies that the distortion obtainedwith the proposed system is less than in case of the square-root Hann windows. Below the histograms the mean and the5%- and 95%-significance levels of the three distributionsare indicated. Using the long window without switching tothe short window (thick solid red line) produces on averagea similar distortion as obtained with the reference window.We observe slightly less distortion when the window isswitched to the short window (thick blue dashed line). Whenadditionally the decision-directed approach is modified (thinsolid black line) the average distortion considerably reduces(about 2.8 dB less than the reference). The distributionbecomes bimodal because not all occurring phonemes [d]are detected.

5.3. Listening Tests. Informal experiments conducted withfour expert listeners confirmed the improved reproductionof stops with the proposed modification of the noise reduc-tion system. Stationary speech-shaped noise and cafteria-babble was used at 5 and 10 dB SNR. The amplificationGtrans of transient frames (see Section 4.3) was limited toGmax

trans = 3.5 dB because this resulted in natural soundingspeech. Note that in [13–15] stronger amplifications of about10 dB are proposed to achieve a higher speech intelligibility.

6. Conclusion

In this paper a new system for block-based speech enhance-ment is proposed. The focus is on the preservation of stops,


since their clarity is crucial for the preservation of speechintelligibility. The main idea is to detect nonstationarydata in the signal segment under investigation. Given thisinformation, a signal adapted spectral analysis and synthesisis performed. A short analysis window is used during plosivesounds. It ensures a high temporal resolution and thushelps to keep the impulsive energy of burst-like soundsconcentrated in their spectrotemporal representation. Along analysis window is used when the signal is stationary.The high spectral resolution obtained with that windowallows performing noise reduction in between spectral pitchharmonics.

In addition to switching the window set for spectralanalysis and synthesis, the decision-directed SNR estimator[12], is modified to yield less distortion of speech onsetsand stops. With the nonstationarity decision at hand, alsothe amplification of stops becomes possible, which has beenshown to improve intelligibility [13–15].

To control the switching of the spectral analysis andsynthesis windows, a low-latency likelihood-based detectorfor instationarities has been derived. Its properties have beenanalyzed and the detection performance was verified exper-imentally. The examples of the time-domain and spectralrepresentation of signals denoised with the proposed systemdemonstrate that the signal dependent selection of the spec-tral analysis-synthesis window set allows to better preservestops and speech onsets. Similarly, a considerably improvedreproduction of stops has been shown for the proposedmodification of the decision-directed SNR estimator. This isconfirmed also by informal listening tests. For the future,formal listening tests are planned to check the proposedapproach for intelligibility and qualtity improvements.

Acknowledgment

This work was sponsored in part by grants from the GermanHigh-Tech Initiative, 03FPB00097.

References

[1] R. E. Crochiere, Multirate Digital Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, USA, 1st edition, 1983.

[2] K. Eneman and M. Moonen, “DFT modulated filter bankdesign for oversampled subband systems,” Signal Processing,vol. 81, no. 9, pp. 1947–1973, 2001.

[3] P. P. Vaidyanathan, Multirate Systems and Filter Banks,Prentice-Hall, Englewood Cliffs, NJ, USA, 1st edition, 1993.

[4] R. C. Hendriks, R. Heusdens, and J. Jensen, “Adaptivetime segmentation for improved speech enhancement,” IEEETransactions on Audio, Speech, and Language Processing, vol.14, no. 6, pp. 2064–2074, 2006.

[5] I. Kauppinen and K. Roth, “Improved noise reduction inaudio signals using spectral resolution enhancement withtime-domain signal extrapolation,” IEEE Transactions onSpeech and Audio Processing, vol. 13, no. 6, pp. 1210–1216,2005.

[6] B. Edler, “Coding of audio signals with overlapping blocktransform and adaptive window functions,” Frequenz, vol. 43,no. 9, pp. 252–256, 1989 (German).

[7] MPEG.ORG, “Mpeg-1 layer 3,” 1991, http://www.mpeg.org.

[8] J. Agnew and J. M. Thornton, “Just noticeable and objec-tionable group delays in digital hearing aids,” Journal of theAmerican Academy of Audiology, vol. 11, no. 6, pp. 330–336,2000.

[9] D. A. F. Florencio, “On the use of asymmetric windowsfor reducing the time delay in real-time spectral analysis,”in Proceedings of IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP ’91), vol. 5, pp. 3261–3264, Toronto, Canada, April 1991.

[10] H. W. Lollmann and P. Vary, “A warped low delay filterfor speech enhancement,” in Proceedings of InternationalWorkshop on Acoustic Echo and Noise Control (IWAENC ’06),Paris, France, September 2006.

[11] V. Tyagi, H. Bourlard, and C. Wellekens, “On variable-scalepiecewise stationary spectral analysis of speech signals forASR,” Speech Communication, vol. 48, no. 9, pp. 1182–1191,2006.


[13] E. Kennedy, H. Levitt, A. C. Neuman, and M. Weiss,“Consonant-vowel intensity ratios for maximizing consonantrecognition by hearing-impaired listeners,” The Journal of theAcoustical Society of America, vol. 103, no. 2, pp. 1098–1114,1998.

[14] S. Gordon-Salant, “Recognition of natural and time/intensityaltered CVs by young and elderly subjects with normalhearing,” Journal of the Acoustical Society of America, vol. 80,no. 6, pp. 1599–1607, 1986.

[15] V. Hazan and A. Simpson, “The effect of cue-enhancementon the intelligibility of nonsense word and sentence materialspresented in noise,” Speech Communication, vol. 24, no. 3, pp.211–226, 1998.

[16] D. Mauler and R. Martin, “A low delay, variable resolution,perfect reconstruction spectral analysis-synthesis system forspeech enhancement,” in Proceedings of European SignalProcessing Conference (EUSIPCO ’07), pp. 222–227, Poznan,Poland, September 2007.

[17] H. V. Sorensen, D. L. Jones, M. T. Heideman, and C. S.Burrus, “Real-valued fast fourier transfer algorithms,” IEEETransactions on Acoustics, Speech, and Signal Processing, vol. 35,no. 6, pp. 849–863, 1987.

[18] D. P. Skinner, “Pruning the decimation in-time fft algorithm,”IEEE Transactions on Acoustics, Speech, and Signal Processing,vol. 24, no. 2, pp. 193–194, 1976.

[19] P. Vary and R. Martin, Digital Speech Transmission, Enhance-ment, Coding and Error Concealment, John Wiley & Sons, NewYork, NY, USA, 1st edition, 2006.

[20] D. O’Shaughnessy, Speech Communications: Human andMachine, IEEE Press, Piscataway, NJ, USA, 2nd edition, 2000.

[21] J. S. Garofolo, L. F. Lamel, W. M. Fisher, et al., “Timitacoustic-phonetic continuous speech corpus,” Linguistic DataConsortium, Philadelphia, Pa, USA, 1993.

[22] R. Martin, “Small microphone arrays with postfilters for noiseand acoustic echo reduction,” in Microphone Arrays: SignalProcessing Techniques and Applications, M. Brandstein and D.Ward, Eds., pp. 255–279, Springer, New York, NY, USA, 2001.

[23] V. Hamacher, J. Chalupper, J. Eggers, et al., “Signal processingin high-end hearing aids: state of the art, challenges, and futuretrends,” EURASIP Journal on Applied Signal Processing, vol.2005, no. 18, pp. 2915–2929, 2005.



[25] I.-Y. Soon and S. N. Koh, “Low distortion speech enhance-ment,” IEE Proceedings: Vision, Image and Signal Processing,vol. 147, no. 3, pp. 247–253, 2000.

[26] M. K. Hasan, S. Salahuddin, and M. R. Khan, “A modified apriori SNR for speech enhancement using spectral subtractionrules,” IEEE Signal Processing Letters, vol. 11, no. 4, pp. 450–453, 2004.

[27] C. Plapous, C. Marro, L. Mauuary, and P. Scalart, “A two-step noise reduction technique,” in Proceedings of IEEE Inter-national Conference on Acoustics, Speech, and Signal Processing(ICASSP ’04), vol. 1, pp. 289–292, Montreal, Canada, May2004.

[28] ANSI-S3.5-1997, “Methods for the calculation of the speechintelligibility index,” American National Standards Institute,New York, NY, USA, 1997.

[29] S. R. Quackenbush, T. P. Barnwell III, and M. A. Clements,Objective Measures of Speech Quality, Prentice-Hall, Engle-wood Cliffs, NJ, USA, 1998.


Research Article

Robust Distributed Noise Reduction in Hearing Aids withExternal Acoustic Sensor Nodes

Alexander Bertrand and Marc Moonen (EURASIP Member)

Department of Electrical Engineering (ESAT-SCD), Katholieke Universiteit Leuven, Kasteelpark Arenberg 10,3001 Leuven, Belgium

Correspondence should be addressed to Alexander Bertrand, [email protected]

Received 15 December 2008; Revised 17 June 2009; Accepted 24 August 2009

Recommended by Walter Kellermann

The benefit of using external acoustic sensor nodes for noise reduction in hearing aids is demonstrated in a simulated acousticscenario with multiple sound sources. A distributed adaptive node-specific signal estimation (DANSE) algorithm, that has areduced communication bandwidth and computational load, is evaluated. Batch-mode simulations compare the noise reductionperformance of a centralized multi-channel Wiener filter (MWF) with DANSE. In the simulated scenario, DANSE is observed notto be able to achieve the same performance as its centralized MWF equivalent, although in theory both should generate the same setof filters. A modification to DANSE is proposed to increase its robustness, yielding smaller discrepancy between the performanceof DANSE and the centralized MWF. Furthermore, the influence of several parameters such as the DFT size used for frequencydomain processing and possible delays in the communication link between nodes is investigated.

Copyright © 2009 A. Bertrand and M. Moonen. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

1. Introduction

Noise reduction algorithms are crucial in hearing aids toimprove speech understanding in background noise. Forevery increase of 1 dB in signal-to-noise ratio (SNR), speechunderstanding increases by roughly 10% [1]. By usingan array of microphones, it is possible to exploit spatialcharacteristics of the acoustic scenario. However, in manyclassical beamforming applications, the acoustic field issampled only locally because the microphones are placedclose to each other. The noise reduction performance canoften be increased when extra microphones are used atsignificantly different positions in the acoustic field. Forexample, an exchange of microphone signals between a pairof hearing aids in a binaural configuration, that is, oneat each ear, can significantly improve the noise reductionperformance [2–11]. The distribution of extra acousticsensor nodes in the acoustic environment, each having asignal processing unit and a wireless link, allows furtherperformance improvement. For instance, small sensor nodes

can be incorporated into clothing, or placed strategicallyeither close to desired sources to obtain high SNR signals, orclose to noise sources to collect noise references. In a scenariowith multiple hearing aid users, the different hearing aidscan exchange signals to improve their performance throughcooperation.

The setup envisaged here requires a wireless link betweenthe hearing aid and the supporting external acoustic sensornodes. A distributed approach using compressed signalsis needed, since collecting and processing all availablemicrophone signals at the hearing aid itself would require alarge communication bandwidth and computational power.Furthermore, since the positions of the external nodes areunknown, the algorithm should be adaptive and able to copewith unknown microphone positions. Therefore, a multi-channel Wiener filter (MWF) approach is considered, sincean MWF estimates the clean speech signal without relying onprior knowledge on the microphone positions [12]. In [13,14], a distributed adaptive node-specific signal estimation(DANSE) algorithm is introduced for linear MMSE signal


estimation in a sensor network, which significantly reducesthe communication bandwidth while still obtaining theoptimal linear estimators, that is, the Wiener filters, as ifeach node has access to all signals in the network. The term“node-specific” refers to the scenario in which each node actsas a data-sink and estimates a different desired signal. Thissituation is particularly interesting in the context of noisereduction in binaural hearing aids where the two hearingaids estimate differently filtered versions of the same desiredspeech source signal, which is indeed important to preservethe auditory cues for directional hearing [15–18]. In [19],a pruned version of the DANSE algorithm, referred to asdistributed multichannel Wiener filtering (db-MWF), hasbeen used for binaural noise reduction. In the case of a singledesired source signal, it was proven that db-MWF convergesto the optimal all-microphone Wiener filter settings in bothhearing aids. The more general DANSE algorithm allows theincorporation of multiple desired sources and more than twonodes. Furthermore, it allows for uncoordinated updatingwhere each node decides independently in which iterationsteps it updates its parameters, possibly simultaneously withother nodes [20]. This in particular avoids the need for anetwork wide protocol that coordinates the updates betweennodes.

In this paper, batch-mode simulation results aredescribed to demonstrate the benefit of using additionalexternal sensor nodes for noise reduction in hearing aids.Furthermore, the DANSE algorithm is reformulated in anoise reduction context, and a batch-mode analysis of thenoise reduction performance of DANSE is provided. Theresults are compared to those obtained with the centralizedMWF algorithm that has access to all signals in the networkto compute the optimal Wiener filters. Although in theorythe DANSE algorithm converges to the same filters as thecentralized MWF algorithm, this is not the case in thesimulated scenario. The resulting decrease in performanceis explained and a modified algorithm is then proposed toincrease robustness and to allow the algorithm to convergeto the same filters as in the centralized MWF algorithm.Furthermore, the effectiveness of relaxation is shown whennodes update their filters simultaneously, as well as theinfluence of several parameters such as the DFT size usedfor frequency domain processing, and possible delays withinthe communication link. The simulations in this paper showthe potential of DANSE for noise reduction, as suggestedin [13, 14], and provide a proof-of-concept for applyingthe algorithm in cooperative acoustic sensor networks fordistributed noise reduction applications, such as hearingaids.

The outline of this paper is as follows. In Section 2,the data model is introduced and the multi-channel Wienerfiltering process is reviewed. In Section 3, a description ofthe simulated acoustic scenario is provided. Moreover, ananalysis of the benefits achieved using external acousticsensor nodes is given. In Section 4, the DANSE algorithmis reviewed in the context of noise reduction. A mod-ification to DANSE increasing robustness is introducedin Section 5. Batch-mode simulation results are given inSection 6. Since some practical aspects are disregarded in the

simulations, some remarks and open problems concerninga practical implementation of the algorithm are given inSection 7.

2. Data Model and Multichannel WienerFiltering

2.1. Data Model and Notation. A general fully connectedbroadcasting sensor network with J nodes is considered, inwhich each node k has direct access to a specific set of Mk

microphones, with M = ∑Jk=1 Mk (see Figure 1). Nodes can

be either a hearing aid or a supporting external acousticsensor node. Each microphone signal m of node k can bedescribed in the frequency domain as

ykm(ω) = xkm(ω) + vkm(ω), m = 1, . . . ,Mk, (1)

where xkm(ω) is a desired speech component and vkm(ω) anundesired noise component. Although xkm(ω) is referred toas the desired speech component, vkm(ω) is not necessarilynonspeech, that is, undesired speech sources may be includedin vkm(ω). All subsequent algorithms will be implementedin the frequency domain, where (1) is approximated basedon finite-length time-to-frequency domain transformations.For conciseness, the frequency-domain variable ω will beomitted. All signals ykm of node k are stacked in an Mk-dimensional vector yk, and all vectors yk are stacked in anM-dimensional vector y. The vectors xk, vk and x, v aresimilarly constructed. The network-wide data model cannow be written as y = x + v. Notice that the desiredspeech component x may consist of multiple desired sourcesignals, for example when a hearing aid user is listening toa conversation between multiple speakers, possibly talkingsimultaneously. If there are Q desired speech sources, then

x = As, (2)

where A is an M × Q-dimensional steering matrix and sa Q-dimensional vector containing the Q desired sources.Matrix A contains the acoustic transfer functions (evaluatedat frequency ω) from each of the speech sources to allmicrophones, incorporating room acoustics and micro-phone characteristics.

2.2. Centralized Multichannel Wiener Filtering. The goal ofeach node k is to estimate the desired speech componentxkm in its mth microphone, selected to be the referencemicrophone. Without loss of generality, it is assumed thatthe reference microphone always corresponds to m = 1. Forthe time being, it is assumed that each node has access to allmicrophone signals in the network. Node k then performsa filter-and-sum operation on the microphone signals withfilter coefficients wk that minimize the following MSE costfunction:

Jk(wk) = E{∣∣∣xk1 −wH

k y∣∣∣

2}

, (3)

where E{·} denotes the expected value operator, and wherethe superscript H denotes the conjugate transpose operator.


sQ

A

...

...

...

... ...

x11

x1M1

x21

x2M2

xJ1

xJMJ

v11

v1M1

v21

v2M2

vJ1

vJM1

y11

y1M1

y21

y2M2

yJ1

yJMJ

M1y1

Node 1

M2y2

Node 2

MJyJ

Node J

My

...

...

...

Figure 1: Data model for a sensor network with J sensor nodes, in which node k collects Mk noisy observations of the Q source signals in s.

Notice that at each node k, one such MSE problem is tobe solved for each frequency bin. The minimum of (3)corresponds to the well-known Wiener filter solution:

wk = R−1yyRyxek1, (4)

with Ryy = E{yyH}, Ryx = E{yxH}, and ek1 being an M-dimensional vector with only one entry equal to 1 and allother entries equal to 0, which selects the column of Ryx

corresponding to the reference microphone of node k. Thisprocedure is referred to as multi-channel Wiener filtering(MWF). If the desired speech sources are uncorrelated tothe noise, then Ryx = Rxx = E{xxH}. In the remaining ofthis paper, it is implicitly assumed that all Q desired sourcesmay be active at the same time, yielding a rank-Q speechcorrelation matrix Rxx. In practice, Rxx is unknown, but canbe estimated from

Rxx = Ryy − Rvv (5)

with Rvv = E{vvH}. The noise correlation matrix Rvv canbe (re-)estimated during noise-only periods and Ryy can be(re-)estimated during speech-and-noise periods, requiring avoice activity detection (VAD) mechanism. Even when thenoise sources and the speech source are not stationary, thesepractical estimators are found to yield good noise reductionperformance [15, 19].

3. Simulation Scenario and the Benefit ofExternal Acoustic Sensor Nodes

The performance of microphone array based noise reductiontypically increases with the number of microphones. How-ever, the number of microphones that can be placed on ahearing aid is limited, and the acoustic field is only sampledlocally, that is, at the hearing aid itself. Therefore, there isoften a large distance between the location of the desiredsource and the microphone array, which results in signalswith low SNR. In fact, the SNR decreases with 6 dB for every

doubling of the distance between a source and a microphone.The noise reduction performance can therefore be greatlyincreased by using supporting external acoustic sensor nodesthat are connected to the hearing aid through a wirelesslink.

To assess the potential improvement that can be obtainedby adding external sensor nodes, a multi-source scenario issimulated using the image method [21]. Figure 2 shows aschematic illustration of the scenario. The room is cubical(5 m × 5 m × 5 m) with a reflection coefficient of 0.4 at thefloor, the ceiling and at every wall. According to Sabine’sformula this corresponds to a reverberation time of T60 =0.222 s. There are two hearing aid users listening to speakerC, who produces a desired speech signal. One hearing aiduser has 2 hearing aids (node 2 and 3) and the other has onehearing aid at the right ear (node 4). All hearing aids havethree omnidirectional microphones with a spacing of 1 cm.Head shadow effects are not taken into account. Node 1 isan external microphone array containing six omnidirectionalmicrophones placed 2 cm from each other. Speakers A andB both produce speech signals interfering with speaker C.All speech signals are sentences from the HINT (Hearingin Noise Test) database [22]. The upper left loudspeakerproduces multi-talker babble noise (Auditec) with a powernormalized to obtain an input broadband SNR of 0 dBin the first microphone of node 4, which is used as thereference node. In addition to the localized noise sources, allmicrophone signals have an uncorrelated noise componentwhich consist of white noise with power that is 10% of thepower of the desired signal in the first microphone of node4. All nodes and all sound sources are in the same horizontalplane, 2 m above ground level.

Notice that this is a difficult scenario, with many sourcesand highly non-stationary (speech) noise. This kind ofscenario brings many practical issues, especially with respectto reliable VAD decisions (cf. Section 7). Throughout thispaper, many of these practical aspects are disregarded. Theaim here is to demonstrate the benefit that can be achieved

4 EURASIP Journal on Advances in Signal Processing5

m

1 mSpacing: 2 cm

1.5 m

2.5

m

2m

5 m

0.75

m

1.5 m 0.5 m0.

15m

1m

2m

1

A C

B 2

34

Figure 2: The acoustic scenario used in the simulations throughoutthis paper. Two persons with hearing aids are listening to speaker C.The other sources produce interference noise.

with external sensor nodes, in particular in multi-sourcescenarios. Furthermore, the theoretical performance of theDANSE algorithm, introduced in Section 4, will be assessedwith respect to the centralized MWF algorithm. To isolate theeffects of VAD errors and estimation errors on the correlationmatrices, all experiments are performed in batch mode withideal VADs.

Two performance measures are used to assess the qualityof the noise reduction algorithms, namely the broadbandsignal-to-noise ratio (SNR) and the signal-to-distortion ratio(SDR). The SNR and SDR at node k are defined as

SNR = 10 log10

E{

xk[t]2}

E{

nk[t]2} , (6)

SDR = 10 log10

E{

xk1[t]2}

E{

(xk1[t]− xk[t])2} (7)

with nk[t] and xk[t] the time domain noise component andthe desired speech component respectively at the outputat node k, and xk1[t] the desired time domain speechcomponent in the reference microphone of node k.

The sampling frequency is 32 kHz in all experiments. Thefrequency domain noise reduction is based on DFT’s withsize equal to L = 512 if not specified otherwise. Notice that Lis equivalent to the filter length of the time domain filtersthat are implicitly applied to the microphone signals. TheDFT size L = 512 is relatively large, which is due to the factthat microphones are far apart from each other, leading tohigher time differences of arrival (TDOA) demanding longerfilters to exploit spatial information. If the filter lengthsare too short to allow a sufficient alignment between the

signals, then the noise reduction performance degrades. Thisis evaluated in Section 6.4. To allow small DFT-sizes, yet largedistances between microphones, delay compensation shouldbe introduced in the local microphone signals or the receivedsignals at each node. However, since hearing aids typicallyhave hard constraints on the processing delay to maintain lipsynchronization, this delay compensation is restricted. This,in effect, introduces a trade-off between input-output delayand noise reduction performance.

Figure 3(a) shows the output SNR and SDR of thecentralized MWF procedure at node 4 when five differentsubsets of microphones are used for the noise reduction:

(1) the microphone signals of node 4 itself;

(2) the microphone signals of node 1 in addition to themicrophone signals of node 4 itself;

(3) the microphone signals of node 2 in addition to themicrophone signals of node 4 itself;

(4) the first microphone signal at every node in additionto all microphone signals of node 4 itself; this isequivalent to a scenario where the network support-ing node 4 consists of single-microphone nodes, thatis, Mk = 1, for k = 1, . . . , 3;

(5) all microphone signals in the network.

The benefit of adding external microphones is very clear inthis graph. It also shows that microphones with a signifi-cantly different position contribute more than microphonesthat are closely spaced. Indeed, Cases 2, 3 and 4 both addthree extra microphone signals, but the benefit is largest inCase 4, in which the additional microphones are relatively setfar apart. However, using multi-microphone nodes (Case 5)still produces a significant benefit of about 25% (2 dB) incomparison to single-microphone nodes (Case 4). Noticethat the benefit of placing external microphones, and thebenefit of using multi-microphone nodes in comparison tosingle-microphone nodes, is of course very scenario specific.For instance, if the vertical position of node 1 is reducedby 0.5 m in Figure 2, then the difference between single-microphone nodes (Case 4) and multi-microphone nodes(Case 5) is more than 3 dB, as shown in Figure 3(b), whichcorreponds to an improvement of almost 50%.

4. The DANSE Algorithm

In Section 3, simulations showed that adding externalmicrophones in addition to the microphones available ina hearing aid may yield a great benefit in terms of bothnoise suppression and speech distortion. Not surprisingly,adding external nodes with multiple microphones boosts theperformance even more. However, the latter introduces a sig-nificant increase in communication bandwidth, dependingon the number of microphones in each node. Furthermore,the dimensions of the correlation matrix to be inverted informula (4) may grow significantly. However, if each nodehas its own signal processor unit, this extra communicationbandwidth can be reduced and the computation can bedistributed by using the distributed adaptive node-specific


0

5

10

15

20

SDR

(dB

)

Node 4 + node 1 + node 2 + single micof 1, 2, 3

All mics

Output SDR of MWF at node 4

02

4

6

8

10

12SN

R(d

B)


All mics

Output SNR of MWF at node 4

(a) Scenario of Figure 2

0

5

10

15

20

SDR

(dB

)


All mics

Output SDR of MWF at node 4

0

2

4

6

8

10

SNR

(dB

)


All mics

Output SNR of MWF at node 4

(b) Scenario of Figure 2 with vertical position of node 1 reduced by0.5 m

Figure 3: Comparison of output SNR and SDR of MWF at node 4 for five different microphone subsets.

signal estimation (DANSE) algorithm, as proposed in [13,14]. The DANSE algorithm computes the optimal networkwide Wiener filter in a distributed, iterative fashion. In thissection this algorithm is briefly reviewed and reformulatedin a noise reduction context.

4.1. The DANSEK Algorithm. In the DANSEK algorithm,each node k estimates K different desired signals, corre-sponding to the desired speech components in K of itsmicrophones (assuming that K ≤ Mk, ∀ k ∈ {1, . . . , J}).Without loss of generality, it is assumed that the first Kmicrophones are selected, that is, the signal to be estimatedis the K-channel signal xk = [xk1 · · · xkK ]T . The first entryin this vector corresponds to the reference microphone,whereas the other K −1 entries should be viewed as auxiliarychannels. They are required to fully capture the signalsubspace spanned by the desired source signals. Indeed, if Kis chosen equal to Q, the K channels of xk define the samesignal subspace as defined by the channels in s, that is,

xk = Aks. (8)

where Ak denotes a K × K submatrix of the steering matrixA in formula (2). K being equal to Q is a requirement forDANSEK to be equivalent to the centralized MWF solution(see Theorem 1). The case in which K /=Q is not consideredhere. For a more detailed discussion why these auxiliarychannels are introduced, we refer to [13].

Each node k estimates its desired signal xk with respect toa corresponding MSE cost function

Jk(Wk) = E{∥∥∥xk −WH

k y∥∥∥

2}

(9)

with Wk an M × K matrix, defining a multiple-inputmultiple-output (MIMO) filter. Notice that this correspondstoK independent estimation problems in which the sameM-channel input signal y is used. Similarly to (3), the Wienersolution of (9) is given by

Wk = R−1yyRxxEk (10)

with

Ek =⎡

⎣IK

O(M−K)×K

⎤

⎦ (11)

with IK denoting the K × K identity matrix and OU×Vdenoting an all-zero U × V matrix. The matrix Ek selectsthe first K columns of Rxx, corresponding to the K-channelsignal xk. The DANSEK algorithm will compute (10) inan iterative, distributed fashion. Notice that only the firstcolumn of Wk is of actual interest, since this is the filterthat estimates the desired speech component in the referencemicrophone. The auxiliary columns of Wk are by-productsof the DANSEK algorithm.

A partitioning of the matrix Wk is defined as Wk =[WT

k1 · · ·WTkJ]

T where Wkq denotes theMk×K submatrix ofWk that is applied to yq in (9). Since node k only has accessto yk, it can only apply the partial filter Wkk. The K-channeloutput signal of this filter, defined by zk = WH

kkyk, is thenbroadcast to the other nodes. Another node q can filter thisK-channel signal zk that it receives from node k by a MIMOfilter defined by the K × K matrix Gqk. This is illustrated in


y1

y2

y3

M1

M2

M3

W11

W22

W33

K

K

K

z1

z2

z3

G12

G13

G21

G23

G31

G32

x1

x2

x3

Figure 4: The DANSEK scheme with 3 nodes (J = 3). Eachnode k estimates the desired signal xk using its own Mk-channelmicrophone signal, and 2 K-channel signals broadcast by the othertwo nodes.

Figure 4 for a three-node network (J = 3). Notice that theactual Wk that is applied by node k is now parametrized as

Wk =

⎡

⎢⎢⎢⎢⎢⎢⎢⎣

W11Gk1

W22Gk2

...

WJJGkJ

⎤

⎥⎥⎥⎥⎥⎥⎥⎦

. (12)

In what follows, the matrices Gkk, ∀ k ∈ {1, . . . , J}, areassumed to be K × K identity matrices IK to minimize thedegrees of freedom (they are omitted in Figure 4). Node kcan only manipulate the parameters Wkk and Gk1 · · ·GkJ . If(8) holds, it is shown in [13] that the solution space definedby the parametrization (12) contains the centralized solutionWk.

Notice that each node k broadcasts a K-channel (Here itis assumed without loss of generality that K ≤ Mk,∀ k ∈{1, . . . , J}; if this does not hold at a certain node k, thisnode will transmit its unfiltered microphone signals) signalzk, which is the output of the Mk × K MIMO filterWkk, acting both as a compressor and an estimator at thesame time. The subscript K thus refers to the (maximum)number of channels of the broadcast signal. DANSEKcompresses the data to be sent by node k by a factor ofmax{Mk/K , 1}. Further compression is possible, since thechannels of the broadcast signal zk are highly correlated,but this is not taken into consideration throughout thispaper.

The DANSEK algorithm will iteratively update the ele-ments at the righthand side of (12) to optimally estimatethe desired signals xk, ∀ k ∈ {1, . . . , J}. To describethis updating procedure, the following notation is used.

The matrix Gk = [GTk1 · · ·GT

kJ]T stacks all transformation

matrices of node k. The matrix Gk,−q defines the matrix Gk

in which Gkq is omitted. The K(J − 1)-channel signal z−k is

defined as z−k = [zT1 · · · zTk−1zTk+1 · · · zTJ ]T

. In what follows,a superscript i refers to the value of the variable at iterationstep i. Using this notation, the DANSEK algorithm consistsof the following iteration steps:

(1) Initialize

i← 0

k ← 1

∀ q ∈ {1, ..., J}: Wqq ← W0qq, Gq,−q ← G0

q,−q, Gqq ←IK , where W0

qq and G0q,−q are random matrices of

appropriate dimension.

(2) Node k updates its local parameters Wkk and Gk,−kby solving a local estimation problem based on itsown local microphone signals yk together with thecompressed signals ziq = Wi H

qq yq that it receives fromthe other nodes q /= k, that is, it minimizes

J ik(

Wkk, Gk,−k) = E

{∥∥∥xk −

[

WHkk | GH

k,−k]

yik∥∥∥

2}

, (13)

where

yik =[

ykzi−k

]

. (14)

Define xik similarly as (14), but now only containing thedesired speech components in the considered signals. Theupdate performed by node k is then

[Wi+1

kk

Gi+1k,−k

]

=(

Riyy,k

)−1Rixx,kEk (15)

with

Ek =⎡

⎣IK

O(Mk−K+K(J−1))×K

⎤

⎦, (16)

Riyy,k = E

{

yik yi Hk}

, (17)

Rixx,k = E

{

xikxi Hk}

. (18)

The parameters of the other nodes do not change, that is,

∀ q ∈ {1, . . . , J} \ {k} : Wi+1qq = Wi

qq, Gi+1q,−q = Gi

q,−q.

(19)

(3) Wkk ← Wi+1kk , Gk,−k ← Gi+1

k,−kk ← (k mod J) + 1

i← i + 1

(4) Return to Step 2

Notice that node k updates its parameters Wkk and Gk,−k ,according to a local multi-channel Wiener filtering problemwith respect to its Mk + (J − 1)K input channels.This MWF


problem is solved in the same way as the MWF problem givenin (3) or (9).

Theorem 1. Assume that K = Q. If xk = Aks, ∀ k ∈{1, . . . , J}, with Ak a full rankK×K matrix, then the DANSEKalgorithm converges for any k to the optimal filters (10) for anyinitialization of the parameters.

Proof. See [13].

Notice that DANSEK theoretically provides the sameoutput as the centralized MWF algorithm if K = Q. Therequirement that xk = Aks, ∀ k ∈ {1, . . . , J}, is satisfiedbecause of (2). However, notice that the data model (2) isonly approximately fullfilled in practice due to a finite-lengthDFT size. Consequently, the rank of the speech correlationmatrix Rxx is not Q, but it has Q dominant eigenvaluesinstead. Therefore, the theoretical claims of convergence andoptimality of DANSEK , with K = Q, are only approximatelytrue in practice due to frequency domain processing.

4.2. Simultaneous Updating. The DANSEK algorithm asdescribed in Section 4.1 performs sequential updating in around-robin fashion, that is, nodes update their parametersone at a time. In [20], it is observed that convergenceof DANSE is no longer guaranteed when nodes updatesimultaneously, or in an uncoordinated fashion where eachnode decides independently in which iteration steps itupdates its parameters. This is however an interesting case,since a simultaneous updating procedure allows for parallelcomputation, and uncoordinated updating removes the needfor a network wide protocol that coordinates the updatesbetween nodes.

Let W = [WT11WT

22 · · ·WTJJ]

T , and let F(W) be thefunction that defines the simultaneous DANSEK update ofall parameters in W, that is, F applies (15) ∀ k ∈ {1, . . . J}simultaneously. Experiments in [20] show that the updateWi+1 = F(Wi) may lead to limit cycle behavior. To avoidthese limit cycles, the following relaxed version of DANSE issuggested in [20]:

Wi+1 =(

1− αi)

Wi + αiF(

Wi)

(20)

with stepsizes αi satisfying

αi ∈ (0, 1], (21)

limi→∞

αi = 0, (22)

∞∑

i=0

αi = ∞. (23)

The suggested conditions on the stepsize αi are howeverquite conservative and may result in slow convergence. Inmost cases, the simultaneous update procedure convergesalready when a constant value for αi is chosen ∀ i ∈ Nthat is sufficiently small. In all simulations performed for thescenario in Section 3, a value of αi = 0.5,∀ i ∈ N was foundto eliminate limit cycles in every setup.

5. Robust DANSE

5.1. Robustness Issues in DANSE. In Section 6, simulationresults will show that the DANSE algorithm does not achievethe optimal noise reduction performance as predicted byTheorem 1. There are two important reasons for this subop-timal performance.

The first reason is the fact that the DANSEK algorithmassumes that the signal space spanned by the channels ofxk is well-conditioned, ∀ k ∈ {1, . . . , J}. This assumptionis reflected in Theorem 1 by the condition that Ak be fullrank for all k. Although this is mostly satisfied in practice,the Ak’s are often ill-conditioned. For instance, the distancebetween microphones in a single node is mostly small,yielding a steering matrix with several columns that arealmost identical, that is, an ill-conditioned matrix Ak in theformulation of Theorem 1.

The microphones of nodes that are close to a noisesource typically collect low SNR signals. Despite the lowSNR, these signals can boost the performance of the MWFalgorithm, since they can act as noise references to cancelout noise in the signals recorded by other nodes. However,the DANSE algorithm cannot fully exploit this since thelocal estimation problem at such low SNR nodes is ill-conditioned. If node k has low SNR microphone signals yk,the correlation matrix Rxx,k = E{xkxHk } has large estimationerrors, since the corresponding noise correlation matrixRvv,k and the speech+noise correlation matrix Ryy,k are verysimilar, that is, Rvv,k ≈ Ryy,k . Notice that Rxx,k is a submatrix

of Rxx,k defined in (18), which is used in the DANSEKalgorithm. From another point of view, this also relates toan ill-conditioned steering matrix A, since the submatrix Ak

is close to an all-zero matrix compared to the submatricescorresponding to nodes with higher SNR signals.

5.2. Robust DANSE (R-DANSE). In this section, a modifica-tion to the DANSE algorithm is proposed to achieve a betternoise reduction performance in the case of low SNR nodes orill-conditioned steering matrices. The main idea is to replacean ill-conditioned Ak matrix by a better conditioned matrixby changing the estimation problem at node k. The newalgorithm is referred to as “robust DANSE” or R-DANSE.In what follows, the notation v(p) is used to denote the p-th entry in a vector v, and m(p) is used to denote the p-thcolumn in the matrix M.

For each node k, the channels in xk that cause ill-conditioned steering matrices, or that correspond to lowSNR signals, are discarded and replaced by the desired speechcomponents in the signal(s) ziq received from other (highSNR) nodes q /= k, that is,

xik(p) = wi

qq(l)Hxq, q ∈ {1, . . . , J} \ {k}, l ∈ {1, . . . ,K},(24)

if xkp causes an ill-conditioned steering matrix or if xkpcorresponds to a low SNR microphone, and

xik(p) = xkp (25)


otherwise. Notice that the desired signal xik may now changeat every iteration, which is reflected by the superscript idenoting the iteration index.

To decide whether to use (24) or (25), the conditionnumber of the matrix Ak does not necessarily have tobe known. In principle, it is always better to replace theK − 1 auxiliary channels in xk as in formula (24), wherea different q should be chosen for every p. Indeed, sincemicrophones of different nodes are typically far apart fromeach other, better conditioned steering matrices are thenobtained. Also, since the correlation matrix Rxx,k is betterestimated when high SNR signals are available, the chosenq’s preferably correspond to high SNR nodes. Therefore,the decision procedure requires knowledge of the SNR atthe different nodes. For a low SNR node k, one can alsoreplace allK channels in xk as in (24), including the referencemicrophone. In this case, there is no estimation of the speechcomponent that is collected by the microphones of node kitself. However, since the network wide problem is now betterconditioned, the other nodes in the network will benefit fromthis.

The R-DANSEK algorithm performs the same steps asexplained in Section 4.1 for the DANSEK algorithm, but nowxik replaces xk in (13)–(18). This means that in R-DANSE, theEk matrix in (16) now may contain ones at row indices thatare higher than Mk. To guarantee convergence of R-DANSE,the placement of ones in (16), or equivalently the choices forq and l in (24), is not completely free, as explained in the nextsection.

5.3. Convergence of R-DANSE. To provide convergenceresults, the dependencies of each individual estimationproblem are described by means of a directed graph G withKJ vertices, where each vertex corresponds to one of thelocally computed filters, that is, a specific column of Wkk fork = 1 · · · J . (Readers that are not familiar with the jargonof graph theory might want to consult [23], although inprinciple no prior knowledge on graph theory is assumed).The graph contains an arc from filter a to b, described bythe ordered pair (a, b), if the output of filter b contains thedesired speech component that is estimated by filter a. Forexample, formula (24) defines the arc (wkk(p),wqq(l)). Avertex v that has no departing arc is referred to as a directestimation filter (DEF), that is, the signal to be estimatedis the desired speech component in one of the node’s ownmicrophone signals, as in formula (25).

To illustrate this, a possible graph is shown in Figure 5for DANSE2 applied to the scenario described in Section 3,where the hearing aid users are now listening to two speakers,that is, speakers B and C. Since the microphone signals ofnode 1 have a low SNR, the two desired signals in x1 that areused in the computation of W11 are replaced by the filtereddesired speech component in the received signals fromhigher SNR nodes 2 and 4, that is, w22(1)Hx2 and w44(1)Hx4,respectively. This corresponds to the arcs (w11(1), w22(1))and (w11(2), w44(1)). To calculate w22(1), w33(1), and w44(1),the desired speech components x21, x31 and x41 in therespective reference microphones are used. These filters

Node 1

w11(1)

w11(2)

Node 2

Node 3

Node 4

w22(1)

w22(2)

w33(1)

w33(2)

w44(1)

w44(2)

Figure 5: Possible graph describing dependencies of estimationsproblems for DANSE2 applied to the acoustic scenario described inSection 3.

are DEF’s, and are shaded in Figure 5. The microphones atnode 2 are very close to each other. Therefore, to avoid an ill-conditioned matrix A2 at node 2, the signals to be estimatedby w22(2) should be provided by another node, and not byanother microphone signal of node 2 itself. Therefore, thearc (w22(2), w44(1)) is added. For similar reasons, the arcs(w33(2), w44(1)) and (w44(2), w22(1)) are also added.

Theorem 2. Let all assumptions of Theorem 1 be satisfied.Let G be the directed graph describing the dependencies of theestimation problems in the R-DANSEK algorithm as describedabove. If G is acyclic, then the R-DANSEK algorithm convergesto the optimal filters to estimate the desired signals definedby G.

Proof. The proof of Theorem 1 in [13] on convergence ofDANSEK is based on the assumption that the desired K-channel signals xk, ∀ k ∈ {1, . . . , J}, are all in the same K-dimensional signal subspace spanned by the K sources in s,that is,

xk = Aks. (26)

This assumption remains valid in R-DANSEK . Indeed, sincexq contains Mq linear combination of the Q sources in s, thesignal xik(p) given by (24) is again a linear combination ofthe source signals. However, the coefficients of this linearcombinations may change at every iteration as the signalxik(p) is an output of the adaptive filter wi

qq(l) in anothernode q. This then leads to a modified version of Theorem 1for DANSEK in which the matrix Ak in (26) is not fixed, butmay change at every iteration, that is,

xik = Aiks. (27)


Define

Wikq = arg min

Wkq

(

minGk,−q

E{∥∥∥xk −

[

WHkq | GH

k,−q]

yiq∥∥∥

2})

.

(28)

This corresponds to the hypothetical case in which node kwould optimise Wi

kq directly, without the constraint Wikq =

WiqqGi

kq where node k depends on the parameter choice ofnode q.

In [13] it is proven that for DANSEK , under theassumptions of Theorem 1, the following holds:

∀ q, k ∈ {1, . . . , J} : Wikq = W

iqqAkq (29)

with Akq = A−Hq AHk . This means that the columns of

Wiqq span a K-dimensional subspace that also contains the

columns of Wikq, which is the optimal update with respect

to the cost function J ik of node k, as if there were noconstraints on Wi

kq. Or in other words, an update by node qautomatically optimizes the cost function of any other nodek with respect to Wkq, if node k performs a responding

optimization of Gkq, yielding Goptkq = Akq. Therefore, the

following expression holds:

∀ k ∈ {1, . . . , J},∀ i ∈ N : minGk,−k

J i+1k

(

Wi+1kk , Gk,−k

)

≤ minGk,−k

J ik(

Wikk, Gk,−k

)

.(30)

Notice that this holds at every iteration for every node. In thecase of R-DANSEK , the Akq matrix of expression (29) changesat every iteration. At first sight, expression (30) remains valid,since changes in the matrix Akq are compensated by theminimization over Gkq in (30). However, this is not truesince the desired signals xik also change at every iteration, andtherefore the cost functions at different iterations cannot becompared.

Expression (30) can be partitioned in K sub-expressions:

∀ p ∈ {1, . . . ,K},∀ k ∈ {1, . . . , J}, ∀ i ∈ N : (31)

mingk,−k(p)

J i+1kp

(

wi+1kk

(p), gk,−k

(p)) ≤ min

gk,−k(p)J ikp(

wikk

(p), gk,−k

(p))

(32)

with

J ikp(

wkk, gk,−k) = E

{∣∣∣xk(p)−

[

wHkk | gHk,−k

]

yik∣∣∣

2}

. (33)

For the R-DANSEK case, (33) remains the same, except thatxk(p) has to be replaced with xik(p). As explained above,due to this modification, expression (32) does not holdanymore. However, it does hold for the cost functions J ikpcorresponding to a DEF wkk(p), that is, a filter for whichthe desired signal is directly obtained from one of themicrophone signals of node k. Indeed, every DEF wkk(p) hasa well-defined cost function J ikp, since the signal xik(p) is fixed

over different iteration steps. Because J ikp has a lower bound,

(32) shows that the sequence {mingpk,−kJ ikp}i∈N converges. The

convergence of this sequence implies convergence of thesequence {wi

kk(p)}i∈N, as shown in [13].After convergence of all wkk(p) parameters correspond-

ing to a DEF, all vertices in the graph G that are directlyconnected to this DEF have a stable desired signal, andtheir corresponding cost functions become well-defined. Theabove argument shows that these filters then also converge.

Continuing this line of thought, convergence propertiesof the DEF will diffuse through the graph. Since the graphis acyclic, all vertices converge. Convergence of all Wkk

parameters for k = 1 · · · J automatically yields convergenceof all Gk parameters, and therefore convergence of all Wk

filters for k = 1 · · · J . Optimality of the resulting filters canbe proven using the same arguments as in the optimalityproof of Theorem 1 for DANSEK in [13].

6. Performance of DANSE and R-DANSE

In this section, the batch mode performance of DANSE andR-DANSE is compared for the acoustic scenario of Section 3.In this batch version of the algorithms, all iterations ofDANSE and R-DANSE are on the full signal length of about20 seconds. In real-life applications, however, iterationswill of course be spread over time, that is, subsequentiterations are performed on different signal segments. Toisolate the influence of VAD errors, an ideal VAD is usedin all experiments. Correlation matrices are estimated bytime averaging over the complete length of the signal. Thesampling frequency is 32 kHz and the DFT size is equal toL = 512 if not specified otherwise.

6.1. Experimental Validation of DANSE and R-DANSE. Threedifferent measures are used to assess the quality of theoutputs at the hearing aids: the signal-to-noise ratio (6),the signal-to-distortion ratio (7), and the mean squarederror (MSE) between the coefficients of the centralizedmultichannel Wiener filter wk and the filter obtained by theDANSE algorithm, that is,

MSE = 1L

∑∥∥wk −wk(1)

∥∥2 (34)

where the summation is performed over all DFT bins, withL the DFT size, wk defined by (4), and wk(1) denoting thefirst column of Wk in (12), that is, the filter that estimatesthe speech component xk1 in the reference microphone atnode k.

Two different scenarios are tested. In scenario 1 thedimension Q of the desired signal space is Q = 1, that is,both hearing aid users are listening to speaker C, whereasspeakers A and B and the babble-noise loudspeaker areconsidered to be background noise. In Figure 6, the threequality measures are plotted (for node 4) versus the iterationindex for DANSE1 and R-DANSE1, with either sequentialupdating or simultaneous updating (without relaxation).Also an upper bound is plotted, which corresponds to thecentralized MWF solution defined in (4). The R-DANSE1


56789

10

SNR

(dB

)

0 5 10 15 20 25 30

Iteration

Q = 1: SNR of node 4 versus iteration

(a)

810121416

SDR

(dB

)

0 5 10 15 20 25 30

Iteration

Q = 1: SDR of node 4 versus iteration

(b)

10−5

10−4

MSE

0 5 10 15 20 25 30

Iteration

Q = 1: MSE on filter coefficients of node 4 versus iteration

R-DANSE1 sequentialR-DANSE1 simultaneous

DANSE1 sequentialDANSE1 simultaneous

(c)

Figure 6: Scenario 1: SNR, SDR, and MSE on filter coefficientsversus iterations for DANSE1 and R-DANSE1 at node 4, for bothsequential and simultaneous updates. Speaker C is the only targetspeaker.

graph consists of only DEF nodes, except for w11, which hasan arc (w11, w44) to avoid performance loss due to low SNR.Since there is only one desired source, DANSE1 theoreticallyshould converge to the upper bound performance, but this isnot the case. The R-DANSE1 algorithm performs better thanthe DANSE1 algorithm, yielding an SNR increase of 1.5 to2 dB, which is an increase of about 20% to 25%. The sameholds for the other two hearing aids, that is, node 2 and3, which are not shown here. The parallel update typicallyconverges faster but it converges to a suboptimal limit cycle,since no relaxation is used. Although this limit cycle is notvery clear in these plots, a loss in SNR of roughly 1 dB isobserved in every hearing aid. This can be avoided by usingrelaxation, which will be illustrated in Section 6.2.

In scenario 2, the case in which Q = 2 is considered,that is, there are two desired sources: both hearing aid usersare listening to speakers B and C, who talk simultaneously,yielding a speech correlation matrix Rxx of approximatelyrank 2. The R-DANSE2 graph is illustrated in Figure 5.For this 2-speaker case, both DANSE1 and DANSE2 areevaluated, where the latter should theoretically converge tothe upper bound performance. The results for node 4 areplotted in Figure 7. While the MSE is lower for DANSE2

compared to DANSE1, it is observed that DANSE2 does notreach the optimal noise reduction performance. R-DANSE2

68

1012

SNR

(dB

)

0 5 10 15 20 25 30

Iteration


(a)

12

14

16

SDR

(dB

)

0 5 10 15 20 25 30

Iteration


(b)

10−5

10−4

MSE

0 5 10 15 20 25 30

Iteration

Q = 2: MSE on filter coefficients of node 4 versus iteration

R-DANSE2

R-DANSE1

DANSE2

DANSE1

(c)

Figure 7: Scenario 2: SNR, SDR and MSE on filter coefficientsversus iterations for DANSE1, R-DANSE1, DANSE2 and R-DANSE2

at node 4. Speakers B and C are target speakers.

is however able to reach the upper bound performance atevery hearing aid. The SNR improvement of R-DANSE2

in comparison with DANSE2 is between 2 and 3 dB atevery hearing aid, which is again an increase of about 20%to 25%. Notice that R-DANSE2 even slightly outperformsthe centralized algorithm. This may be because R-DANSE2

performs its matrix inversions on correlation matrices withsmaller dimensions than the all-microphone correlationmatrix Ryy in the centralized algorithm, which is morefavorable in a numerical sense.

6.2. Simultaneous Updating with Relaxation. Simulationson different acoustic scenarios show that in most cases,DANSEK with simultaneous updating results in a limitcycle oscillation. The occurrence of limit cycles appears todepend on the position of the nodes and sound sources, thereverberation time, as well as on the DFT size, but no clearrule was found to predict the occurrence of a limit cycle.

To illustrate the effect of relaxation, the simulation resultsof R-DANSE1 in the scenario of Section 3 are given inFigure 8(a), where now the DFT size is L = 1024, whichresults in clearly visible limit cycle oscillations when norelaxation is used. This causes an over-all loss in SNR of 2or 3 dB at every hearing aid.

Figure 8(b) shows the same experiment where relaxationis used as in formula (20) with αi = 0.5,∀ i ∈ N.


5

10

15

20

SDR

(dB

)

0 5 10 15 20 25 30

Iteration


0

5

10

15

SNR

(dB

)

0 5 10 15 20 25 30

Iteration



(a) without relaxation

5

10

15

20

SDR

(dB

)

0 5 10 15 20 25 30

Iteration


0

5

10

15

SNR

(dB

)

0 5 10 15 20 25 30

Iteration



(b) with relaxation (αi = 0.5, ∀ i ∈ N)

Figure 8: SNR and SDR for R-DANSE1 versus iterations at node 4 with sequential and simultaneous updating.

In this case, the limit cycle does not appear and the simul-taneous updating algorithm indeed converges to the samevalues as the sequential updating algorithm. Notice that thesimultaneous updating algorithm converges faster than thesequential updating algorithm.

6.3. DFT Size. In Figure 9, the SNR and SDR of the outputsignal of R-DANSE1 at nodes 3 and 4 is plotted as a functionof the DFT size L, which is equivalent to the length of thetime domain filters that are implicitly applied to the signalsat the nodes. 28 iterations were performed with sequentialupdating for L = 256, L = 512, L = 1024, and L = 2048. Theoutputs of the centralized version and the scenario in whichnodes do not share any signals, are also given as a reference.

As expected, the performance increases with increasingDFT size. However, the discrepancy between the centralizedalgorithm and R-DANSE1 grows for increasing DFT size.One reason for this observation is that, for large DFT sizes,R-DANSE often converges slowly once the filters at all nodesare close to the optimal filters.

The scenario with isolated nodes is less sensitive to theDFT size. This is because the tested DFT sizes are quite large,yielding long filters. As explained in the next section, shorterfilter lengths are sufficient in the case of isolated nodes sincethe microphones are very close to each other, yielding smalltime differences of arrival (TDOA).

6.4. Communication Delays or Time Differences of Arrival. Toexploit the spatial coherence between microphone signals,the noise reduction filters attempt to align the signal compo-nents resulting from the same source in the different micro-phone signals. However, alignment of the direct components

of the source signals is only possible when the filter lengthsare at least twice the maximum time difference of arrival(TDOA) between all the microphones. This means thatin general, the noise reduction performance degrades withincreasing TDOA’s and fixed filter lengths. Large TDOA’srequire longer filters, or appropriate delay compensation.As already mentioned in Section 3, delay compensationis restricted in hearing aids due to lip synchronizationconstraints.

The TDOA depends on the distance between themicrophones, the position of the sources and the delayintroduced by the communication link. Figure 10 showsthe performance degradation of R-DANSE at nodes 3 and4 when the TDOA increases, in this case modelled by anincreasing communication delay between the nodes. Thereis no delay compensation, that is, none of the signals aredelayed before filtering. DFT sizes L = 512 and L = 1024 areevaluated. The outputs of the centralized MWF procedureare also given as a reference, as well as the procedure whereevery node broadcasts its first microphone signal, whichcorresponds to the scenario in which all supporting nodesare single-microphone nodes. The lower bound is definedby the scenario where all nodes are isolated, that is, eachnode only uses its own microphones in the estimationprocess.

As expected, when the communication delay increases,the performance degrades due to increasing time lagsbetween signals. At node 3, the R-DANSE algorithm isslightly more sensitive to the communication delay than thecentralized MWF. The behavior at node 2 is very similar,and is omitted here. Furthermore, for large communicationdelays, R-DANSE is outperformed by the single-microphonenodes scenario. At node 4, both the centralized MWF and


2468

10121416

SDR

(dB

)

200 400 600 800 1000 1200 1400 1600 1800 2000 2200

DFT size

Q = 1: SDR of node 3 versus DFT size

−5

0

5

10

15

SNR

(dB

)

200 400 600 800 1000 1200 1400 1600 1800 2000 2200

DFT size

Q = 1: SNR of node 3 versus DFT size

R-DANSE1

OptimalIsolated

(a) node 3

68

101214161820

SDR

(dB

)

200 400 600 800 1000 1200 1400 1600 1800 2000 2200

DFT size

Q = 1: SDR of node 4 versus DFT size

02468

10121416

SNR

(dB

)

200 400 600 800 1000 1200 1400 1600 1800 2000 2200

DFT size

Q = 1: SNR of node 4 versus DFT size

R-DANSE1

OptimalIsolated

(b) node 4

Figure 9: Output SNR and SDR after 28 iterations of R-DANSE1 with sequential updating versus DFT size L at nodes 3 and 4.

2468

10121416

SDR

(dB

)

0 100 200 300 400 500 600 700 800 900

Number of samples communication delay

Q = 1: SDR of node 3 versus communication delay

−4−2

02468

101214

SNR

(dB

)

R-DANSE L = 512Centralized L = 512One mic of other nodes L = 512Isolated L = 512R-DANSE L = 1024Centralized L = 1024One mic of other nodes L = 1024Isolated L = 1024

0 100 200 300 400 500 600 700 800 900


Q = 1: SNR of node 3 versus communication delay

(a) node 3

68

101214161820

SDR

(dB

)

0 100 200 300 400 500 600 700 800 900


Q = 1: SDR of node 4 versus communication delay

2

4

6

8

1012

14

SNR

(dB

)

R-DANSE L = 512Centralized L = 512One mic of other nodes L = 512Isolated L = 512R-DANSE L = 1024Centralized L = 1024One mic of other nodes L = 1024Isolated L = 1024

0 100 200 300 400 500 600 700 800 900


Q = 1: SNR of node 4 versus communication delay

(b) node 4

Figure 10: Output SNR and SDR at nodes 3 and 4 after 12 iterations of R-DANSE1 with sequential updating vs. delay of the communicationlink.


the single-microphone nodes scenario even benefit fromcommunication delays. Apparently, the additional delayallows the estimation process to align the signals moreeffectively.

The reason why R-DANSE is more sensitive to a com-munication delay than the centralized MWF is that the latterinvolves independent estimation processes, whereas in R-DANSE, the estimation at any node k depends on the qualityof estimation at every other node q /= k. Notice howeverthat the influence of communication delay is of course verydependent on the scenario and its resulting TDOA’s. Theabove results only give an indication of this influence.

7. Practical Issues and Open Problems

In the batch-mode simulations provided in this paper, somepractical aspects have been disregarded. Therefore, the actualperformance of the MWF and the DANSEK algorithm maybe worse than what is shown in the simulations. In thissection, some of these practical aspects are briefly discussed.

The VAD is a crucial ingredient in MWF-based noisereduction applications. A simple VAD may not behave well inthe simulated scenario as described in Figure 2 due to the factthat the noise component also contains competing speechsignals. Especially the VADs at nodes that are close to aninterfering speech source (e.g., node 1 in Figure 2) are boundto make many wrong decisions, which will then severelydeteriorate the output of the DANSE algorithm. To solve this,a speaker selective VAD should be used, for example, [24].Also, low SNR nodes should be able to use VAD informationfrom high SNR nodes. By sharing VAD information, betterVAD decisions can be made [25]. How to organize this, andhow a consensus decision can be found between differentnodes, is still an open research problem.

A related problem is the actual selection of the desiredsource, versus the noise sources. A possible strategy is thatthe speech source with the highest power at a certainreference node is selected as the desired source. In hearing aidapplications, it is often assumed that the desired source is infront of the listener. Since the actual positions of the hearingaid microphones are known (to a certain accuracy), the VADcan be combined with a source localization algorithm or afixed beamformer to distinguish between a target speakerand an interfering speaker. Again, this information should beshared between nodes so that all nodes can eventually makeconsistent selections.

A practical aspect that needs special attention is theadaptive estimation of the correlation matrices in theDANSEK algorithm. In many MWF implementations, cor-relation matrices are updated with the instantaneous samplecorrelation matrix and by using a forgetting factor 0 < λ < 1,that is,

Ryy[t] = λRyy[t − 1] + (1− λ)y[t]yH[t], (35)

where y[t] denotes the sample of the multi-channel signaly at time t. The forgetting factor λ is chosen close to 1 toobtain long-term estimates that mainly capture the spatialcoherence between the microphone signals. In the DANSEK

algorithm, however, the statistics of the input signal yk innode k, defined by (14), change whenever a node q /= kupdates its filters, since some of the channels in yk are indeedoutputs from a filter in node q. Therefore, when node qupdates its filters, parts of the estimated correlation matricesRyy,k and Rxx,k, ∀ k ∈ {1, . . . , J} \ {q}, may become invalid.Therefore, strategy (35) may not work well, since every newestimate of the correlation matrix then relies on previousestimates. Instead, either downdating strategies should beconsidered, or the correlation matrices have to be completelyrecomputed.

8. Conclusions

The simulation results described in this paper demonstratethat noise reduction performance in hearing aids may be sig-nificantly improved when external acoustic sensor nodes areadded to the estimation process. Moreover, these simulationresults provide a proof-of-concept for applying DANSEK incooperative acoustic sensor networks for distributed noisereduction applications, such as in hearing aids. A morerobust version of DANSEK , referred to as R-DANSEK , hasbeen introduced and convergence has been proven. Batch-mode experiments showed that R-DANSEK significantlyoutperforms DANSEK . The occurrence of limit cycles andthe effectiveness of relaxation in the simultaneous updatingprocedure has been illustrated. Additional tests have beenperformed to quantify the influence of several parameters,such as the DFT size and TDOA’s or delays within thecommunication link.

Acknowledgments

This research work was carried out at the ESAT lab-oratory of Katholieke Universiteit Leuven, in the frameof the Belgian Programme on Interuniversity AttractionPoles, initiated by the Belgian Federal Science Policy OfficeIUAP P6/04 (DYSCO, “Dynamical systems, control andoptimization”, 2007–2011), the Concerted Research ActionGOA-AMBioRICS, and Research Project FWO no. G.0600.08(“Signal processing and network design for wireless acousticsensor networks”). The scientific responsibility is assumed byits authors. The authors would like to thank the anonymousreviewers for their helpful comments.

References


[2] B. Kollmeier, J. Peissig, and V. Hohmann, “Real-time multi-band dynamic compression and noise reduction for binauralhearing aids,” Journal of Rehabilitation Research and Develop-ment, vol. 30, no. 1, pp. 82–94, 1993.

[3] J. G. Desloge, W. M. Rabinowitz, and P. M. Zurek,“Microphone-array hearing aids with binaural output . I.Fixed-processing systems,” IEEE Transactions on Speech andAudio Processing, vol. 5, no. 6, pp. 529–542, 1997.

[4] D. P. Welker, J. E. Greenberg, J. G. Desloge, and P. M. Zurek,“Microphone-array hearing aids with binaural output. II.


A two-microphone adaptive system,” IEEE Transactions onSpeech and Audio Processing, vol. 5, no. 6, pp. 543–551, 1997.

[5] I. L. D. M. Merks, M. M. Boone, and A. J. Berkhout, “Designof a broadside array for a binaural hearing aid,” in Proceedingsof IEEE Workshop on Applications of Signal Processing to Audioand Acoustics (WASPAA ’97), October 1997.

[6] V. Hamacher, “Comparison of advanced monaural andbinaural noise reduction algorithms for hearing AIDS,” inProceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’02), vol. 4, pp. 4008–4011, May 2002.

[7] R. Nishimura, Y. Suzuki, and F. Asano, “A new adaptive bin-aural microphone array system using a weighted least squaresalgorithm,” in Proceedings of IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP ’02), vol. 2, pp.1925–1928, May 2002.

[8] T. Wittkop and V. Hohmann, “Strategy-selective noise reduc-tion for binaural digital hearing aids,” Speech Communication,vol. 39, no. 1-2, pp. 111–138, 2003.

[9] M. E. Lockwood, D. L. Jones, R. C. Bilger, et al., “Performanceof time- and frequency-domain binaural beamformers basedon recorded signals from real rooms,” The Journal of theAcoustical Society of America, vol. 115, no. 1, pp. 379–391,2004.

[10] T. Lotter and P. Vary, “Dual-channel speech enhancement bysuperdirective beamforming,” EURASIP Journal on AppliedSignal Processing, vol. 2006, Article ID 63297, 14 pages, 2006.

[11] O. Roy and M. Vetterli, “Rate-constrained beamforming forcollaborating hearing aids,” in Proceedings of IEEE Interna-tional Symposium on Information Theory (ISIT ’06), pp. 2809–2813, July 2006.

[12] S. Doclo and M. Moonen, “GSVD-based optimal filteringfor single and multimicrophone speech enhancement,” IEEETransactions on Signal Processing, vol. 50, no. 9, pp. 2230–2244,2002.

[13] A. Bertrand and M. Moonen, “Distributed adaptive node-specific signal estimation in fully connected sensornetworks—Part I: sequential node updating,” Internal Report,Katholieke Universiteit Leuven, ESAT/SCD, Leuven-Heverlee,Belgium, 2009.

[14] A. Bertrand and M. Moonen, “Distributed adaptive estima-tion of correlated node-specific signals in a fully connectedsensor network,” in Proceedings of IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP ’09),pp. 2053–2056, April 2009.

[15] T. J. Klasen, T. Van den Bogaert, M. Moonen, and J. Wouters,“Binaural noise reduction algorithms for hearing aids thatpreserve interaural time delay cues,” IEEE Transactions onSignal Processing, vol. 55, no. 4, pp. 1579–1585, 2007.

[16] S. Doclo, R. Dong, T. J. Klasen, J. Wouters, S. Haykin, and M.Moonen, “Extension of the multi-channel wiener filter withITD cues for noise reduction in binaural hearing aids,” inProceedings of the International Workshop on Acoustic Echo andNoise Control (IWAENC ’05), pp. 221–224, September 2005.

[17] S. Doclo, T. J. Klasen, T. Van den Bogaert, J. Wouters, and M.Moonen, “Theoretical analysis of binaural cue preservationusing multi-channel Wiener filtering and interaural transferfunctions,” in Proceedings of the International Workshop onAcoustic Echo and Noise Control (IWAENC ’06), September2006.

[18] T. Van den Bogaert, J. Wouters, S. Doclo, and M. Moonen,“Binaural cue preservation for hearing aids using an interauraltransfer function multichannel wiener filter,” in Proceedings of

IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP ’07), vol. 4, pp. 565–568, April 2007.

[19] S. Doclo, M. Moonen, T. Van den Bogaert, and J. Wouters,“Reduced-bandwidth and distributed MWF-based noisereduction algorithms for binaural hearing aids,” IEEE Trans-actions on Audio, Speech, and Language Processing, vol. 17, no.1, pp. 38–51, 2009.

[20] A. Bertrand and M. Moonen, “Distributed adaptivenode-specific signal estimation in fully connected sensornetworks—Part II: simultaneous & asynchronous nodeupdating,” Internal Report, Katholieke Universiteit Leuven,ESAT/SCD, Leuven-Heverlee, Belgium, 2009.

[21] J. B. Allen and D. A. Berkley, “Image method for efficientlysimulating small-room acoustics,” The Journal of the AcousticalSociety of America, vol. 65, no. 4, pp. 943–950, 1979.

[22] M. Nilsson, S. D. Soli, and J. A. Sullivan, “Development of thehearing in noise test for the measurement of speech receptionthresholds in quiet and in noise,” The Journal of the AcousticalSociety of America, vol. 95, no. 2, pp. 1085–1099, 1994.

[23] J. A. Bondy and U. S. R. Murty, Graph Theory with Applica-tions, American Elsevier, New York, NY, USA.

[24] S. Maraboina, D. Kolossa, P. K. Bora, and R. Orglmeis-ter, “Multi-speaker voice activity detection using ICA andbeampattern analysis,” in Proceedings of the European SignalProcessing Conference (EUSIPCO ’06), 2006.

[25] V. Berisha, H. Kwon, and A. Spanias, “Real-time implemen-tation of a distributed voice activity detector,” in Proceedingsof IEEE Sensor Array and Multichannel Signal ProcessingWorkshop (SAM ’06), pp. 659–662, July 2006.


Research Article

Synthetic Stimuli for the Steady-State Verification ofModulation-Based Noise Reduction Systems

Jesko G. Lamm (EURASIP Member), Anna K. Berg, and Christian G. Gluck

Bernafon AG, Morgenstrasse 131, 3018 Bern, Switzerland

Correspondence should be addressed to Jesko G. Lamm, [email protected]

Received 28 November 2008; Accepted 12 March 2009

Recommended by Heinz G. Goeckler

Hearing instrument verification involves measuring the performance of noise reduction systems. Synthetic stimuli are proposedas test signals, because they can be tailored to the parameter space of the noise reduction system under test. The article presentsstimuli targeted at steady-state measurements in modulation-based noise reduction systems. It shows possible applications of thesestimuli and measurement results obtained with an exemplary hearing instrument.

Copyright © 2009 Jesko G. Lamm et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Noise reduction systems provide users of hearing instru-ments with increased listening comfort [1]. The aim of suchsystems is to suppress uncomfortable sounds on the onehand, but to preserve speech on the other hand. Amongvarious available noise reduction methods, modulation-based processing is a common one [2].

Modulation-based noise reduction systems apply differ-ent amounts of attenuation in different frequency ranges,depending on the likelihood of speech presence in each ofthem. Based on the observation that speech has a char-acteristic modulation spectrum [3], such systems measuremodulation, which is the fluctuation of the signal’s envelopeover time.

The measurements treat modulation in different sub-bands separately, such that the signal processing can reactby applying different amounts of attenuation in differentfrequency ranges. The idea is to attenuate signals that lackthe characteristic modulation of speech.

Testing a hearing instrument regarding noise reductionperformance has to ensure that two conditions are met.

(i) The noise reduction system meets its requirements.

(ii) The noise reduction system satisfies its user.

Assessing each of these conditions requires an individualtest philosophy: while verification shows if the noise reduc-tion system meets its requirements, validation assesses the

system’s capability of meeting customer needs “in the mostrealistic environment achievable” [4].

In the following, we present a measurement-based testprocedure for modulation-based noise reduction systems inhearing instruments. Our focus is on verification and not onvalidation, because the validation of noise reduction systemsin hearing instruments has been discussed well in literature(e.g., [5]).

Numerous stimulus-based verification procedures fordifferent aspects of hearing instrument functionality havebeen presented so far. Here are two recent examples.

(i) The International Speech Test Signal [6, 7] is astimulus for measuring the hearing instrument per-formance in a speech-like environment. It is basedon a combination of numerous real-world speechsignals.

(ii) Bentler and Chiou have discussed the verification ofnoise reduction systems in hearing instruments andpresented measurements based on real-world speechin noise [8].

The above examples are both based on real-world speechsignals. This makes sense, because suitable performance inspeech is essential for hearing instruments. In this article,however, noise plays an important role, because noise is thereason why noise reduction systems are needed.

Real-world noise signals lack certain properties desirablein test signal design. These properties are mainly [9–11]


(i) the possibility of configuring certain signal character-istics (like, e.g., modulation) systematically in orderto force the system under test into a desired state;

(ii) freedom in changing the signal’s temporal charac-teristics, selecting its power-spectrum, and makingits spectral components sufficiently constant over thefrequency range of interest.

We have therefore recently proposed synthetic test signals[9], because these can be synthesized with regard to thetemporal characteristics of interest, the systematic estimationof noise reduction parameters, and accurate measurementresults.

Using synthetic test signals in the audio processingdomain is not new. In telephony applications, a so-calledComposite Source signal [12] based on synthetic speech isavailable for verification of transfer characteristics of tele-phony equipment. In that case, again the speech performanceof the system under test is of major interest, whereas the noisereduction stimuli which we describe in the following mainlyfocus on noise attenuation by the noise reduction systemunder test.

This article shows some new results and applicationswith the synthetic test signals from [9] in measuring thenoise reduction attenuation in dependency of different inputparameters. We first summarize and explain the signalsynthesis procedure from [9]. Then we show differentapplications of the synthetic test signals. These are finallyillustrated with measurement results, which we obtainedwith an exemplary hearing instrument.

2. Synthetic Test Stimuli

2.1. Requirements Towards the Stimuli. A noise reductionsystem should attenuate noise, which makes the attenuationa parameter of major interest in testing. Since a typicalnoise reduction system operates in multiple subbands [1],a systematic test procedure should measure the attenuationin each of them separately. It is therefore required thatthe test stimuli can stimulate each subband of the noisereduction individually and measure the impact on thesystem’s frequency response.

As a consequence, the stimuli have to meet the followingdemands: they should not only perform well in frequencyresponse measurements, but they should also allow signalparameters to be set individually for different frequencyranges. For example, the stimuli for verifying modulation-based noise reduction should have a constant magnitude inthe frequency range of interest (see [11]) and furthermoredifferent well-defined modulation depths in different fre-quency ranges.

The peak factor [13] of the signals should be as low aspossible, because a high peak factor implies that a signal ofgiven power has high amplitude peaks, resulting in distortionby nonlinearities not only in the measurement equipmentbut also in the hearing instrument under test itself.

The signals should also be periodic, since periodicitybrings the following advantages.

(i) Periodic stimuli avoid leakage errors [14] in process-ing based on Discrete Fourier Transforms (DFT).

(ii) Measuring the magnitude of a system’s frequencyresponse becomes independent of the system’sthroughput delay when using a periodic stimulus,because within a given time frame whose length is aninteger number of periods, the throughput delay ofthe system only produces a phase shift and thus hasno impact on the measured magnitude.

(iii) Only one period of the desired stimulus needs to becomputed, which limits synthesis time.

(iv) The stimulus can be described by means of itscomplex Fourier coefficients ck via complex Fourierseries. If the stimulus is σ , its period is T , and j isthe imaginary unit, then the Fourier Series repre-sentation of the stimulus is given by the followingequation:

σ(t) =∞∑

k=−∞ck · e j 2π (kt/T). (1)

Note that a disadvantage of a periodic signal is itsdiscrete power-spectrum: measuring frequency responseswith periodic stimuli will only cover discrete frequencies. See,for example, [14] for nonperiodic alternatives.

2.2. Signal Synthesis

2.2.1. Simple Subband Signals Based on Sinusoids. The mosttrivial subband signal is a sine wave. Rated against therequirements from Section 2.1, a sine wave performs wellregarding peak factor and periodicity, but does not have therequired constant magnitude over the frequency range ofinterest. This can be addressed in using multiple sines: atest stimulus obtained by summing sine waves of differentfrequencies will indeed cover a certain frequency range;however, summing sine waves requires special care, as willbe explained below.

We call a sum of sine signals a multisine. Summing sinesignals with carelessly chosen phase angles typically yieldsa multisine with a high peak factor [13] as opposed to thelow peak factor required in Section 2.1. By choosing theright phase angles, the peak factor can be reduced: there arevarious algorithms for determining combinations of phaseangles that yield a low peak factor in summing sine waves[13–16].

An exemplary multisine synthesis algorithm [15] hasrecently been evaluated in synthesizing stimuli for theverification of a noise reduction system in an exemplaryhearing instrument [9]. Some poor results during thisevaluation made us focus on noise-based stimuli, which willbe discussed in the following. Note that it is still an openquestion whether multisines are a suitable basis for noisereduction stimuli, but we would like to exclude that questionfrom this article’s scope.


−1.5

0

1.5

Am

plit

ude

1024 2048 3072 4096

Sample number

Original signal (peak factor: 3.5)

(a)

−0.5

0

0.5

Am

plit

ude

1024 2048 3072 4096

Sample number

Filtered signal (peak factor: 6.3)

(b)

Figure 1: A noise signal (a) and the result of filtering it througha bandpass filter whose passband corresponds to a typical noisereduction subband (b).

2.2.2. Simple Subband Signals Based on Noise. A simple wayof synthesizing a band-limited noise signal in the frequencyrange of a noise reduction subband is

(i) generating a noise signal,

(ii) filtering the noise signal with a bandpass filterwhose pass-band corresponds to the noise reductionsubband of interest.

Figure 1 shows an example: MATLAB’s “rand” functionwas used to generate a uniformly distributed digital noisesignal (Figure 1(a)). This signal was filtered with a bandpassfilter corresponding to a typical noise reduction subband(Figure 1(b)).

However, the example in Figure 1 illustrates two reasonsfor why bandpass-filtering a noise signal will in generalnot produce a stimulus that meets the requirements fromSection 2.1as follows.

(i) The stimulus has a high peak factor (e.g., a peakfactor of 6.3 in the case of the signal shown inFigure 1(b)).

(ii) It is not possible to determine the modulationdepth of the stimulus (see, e.g., the signal shown inFigure 1(b), which has some fluctuations in its enve-lope although it has not explicitly been modulated).

Obviously, subband signals that have been obtained byfiltering a broadband signal are not well-suited as noisereduction stimuli. Therefore we propose to choose a synthe-sis procedure capable of constructing band-limited signalsthat have a low peak factor.

2.2.3. Band-Limited Discrete-Interval Binary Sequences. Asignal whose amplitude has only two discrete values is calleda binary signal. Obviously, binary signals have a minimumpeak factor and therefore perfectly satisfy the peak factorrequirement from Section 2.1. However, binary signals also

tend to have a wide bandwidth, which disqualifies most ofthem from subband measurements.

The theory of discrete-interval binary signals [17, 18] pro-vides algorithms that search binary signals with amplitudechanges at multiples of a certain time interval for thosesignals whose power spectrum approximates a desired one.This theory can be used here to find binary signals whosepower is concentrated in one subband and whose powerspectrum is sufficiently constant for frequency responsemeasurements.

Although binary signals have a minimum peak factor, itis not a necessary property for a noise reduction stimulusto be binary. We expect various kinds of signals to besuitable stimuli, like, for example, the multisines mentionedin Section 2.2.1. For our further considerations, however, welimit our scope to discrete-interval binary signals, becausethese performed well in our evaluation of different stimuli,as shown exemplarily in [9]. Since signal synthesis shouldwork in discrete-time and discrete-interval binary signalswere originally defined in continuous time [17], we definethat a discrete-interval binary sequence is the discrete-timerepresentation of a discrete-interval binary signal.

The Frequency Domain Identification Toolbox (FDI-DENT) for MATLAB [19] offers readily accessible functionsfor synthesizing discrete-interval binary sequences [16]. The“dibs” function in this toolbox takes absolute values ofdesired Fourier coefficients as an input and returns oneperiod of a periodic discrete-interval binary sequence whoseFourier series approximates the given input.

When used as a stimulus for subband measurements,a periodic discrete-interval binary sequence needs to haveits power concentrated in the frequency range of inter-est, ideally like band-limited white noise. Therefore, ifthe frequency range of interest is from f1 to f2 (wheref1 > 0 and f2 ≥ f1 + T −1) and the desired RMS ofthe synthesized signal is r, then the target values for thesynthesis algorithm are given by the following absolute valuesof Fourier coefficients ck (derived from [9]):

∣∣ck∣∣

=

⎧⎪⎪⎨

⎪⎪⎩

r√

2(⌈T · f2

⌉−⌊

T · f 1

⌋

+1) ;

⌊

T · f 1

⌋

≤|k|≤⌈T · f2⌉

0 else.(2)

3. Frequency Response Measurements

3.1. General Procedure. In this section, we describe a proce-dure for stimulus-based measurements of a linear system’sfrequency response. It is based on digital signal processingand thus assumes that test stimuli and the output signalof the system under test are available as digital waveforms.All further considerations will be made with regard to thefollowing measurement procedure.

DFT-based processing can approximate the frequencyresponse function of a system whose input stimulus is aperiodic digital test signal: the system’s output is digitized


with the clock of the input signal. Then, the DFT is applied toboth the input signal and the digitized output. The frequencyresponse is calculated at each DFT frequency by dividingthe absolute value of the output-related DFT bin by thecorresponding input-related value [11, 20]. Leakage errorscan be avoided if the stimulus is periodic, and if the DFTwindow contains an integer number of its periods [14].

The described procedure requires a steady-state condi-tion of the system under test. Therefore all measurementsdescribed here are steady-state measurements.

3.2. Subband Measurements with Narrow-Band Stimuli. Avery simple test case is the measurement of the frequencyresponse in one noise reduction subband. Let the numberof subbands be M. We assign an index i ∈ {1, 2, 3, . . . ,M}to each of them. Let the lower and upper band limit of bandnumber i be fc,i and fc,i+1, respectively. Furthermore, let ui bea discrete-interval binary sequence that has been synthesizedaccording to a target by (2), where f1 = fc,i, and f2 = fc,i+1.

We can now construct a signal with modulation fre-quency fm,i and a configurable modulation depth: let mi

be another discrete-interval binary sequence that has beensynthesized in the same way as ui, but with f1 = fc,i+ fm,i andf2 = fc,i+1 − fm,i, where the modification of the band limitsby ± fm,i should compensate for the broadened spectrumthat results when modulating mi with modulation frequencyfm,i. Note that mi and ui are completely unmodulated. Nowa signal si of modulation frequency fm,i and configurablemodulation depth is given by

si(t) = α ·√

23·mi(t) ·

[1 + cos

(2π fm,i t

)]

+ (1− α) · ui(t).(3)

In (3), parameter α ∈ [0, 1] configures the modulationdepth. The factor

√2/3 ensures that the si signals resulting

from α = 0 and α = 1 have approximately the sameRMS, such that the signal power of si is almost independentof the modulation depth configured by parameter α. Intheory, the approximative factor

√2/3 could be replaced by

the precise factor that is needed to make the signal powercompletely independent from α. This precise factor could becomputed from the known signals mi and ui. However, inthe applications we present here, this is in our opinion notnecessary: in the examples we show in this article based onthe approximative factor

√2/3, we computed the signal level

of si for the cases α = 0 and α = 1 and found that it differs byless than 0.05 dB from one case to the other.

Note that the stimuli we present here are defined in con-tinuous time, but targeted at a measurement procedure usingdiscrete-time processing. The link between the continuous-time domain and the discrete-time domain is in our casegiven by the earlier-mentioned FDIDENT toolbox: It takesFourier coefficients from the continuous-time domain asan input and returns a discrete-interval binary sequence asdiscrete-time signal. Therefore, the following discrete-timeversion of (3) is needed (where n is the sample index, fs is

the sampling frequency and mi, ui, si are the discrete-timesignals resulting from sampling mi, ui, and si, resp.):

si(n) = α ·√

23· mi(n) ·

[

1 + cos

(

2πfm,i

fsn

)]

+ (1− α) · ui(n).

(4)

The signal from (4) can be used for measuring thefrequency response of a hearing instrument in the subband ofinterest with the procedure from Section 3.1. The frequencyresponse of the noise reduction system in the hearinginstrument can be obtained by a differential measurement;this means that the frequency response is first measuredwith the noise reduction turned on and then with the noisereduction switched off. Frequency-by-frequency division ofthe obtained responses yields the transfer function of thenoise reduction.

Note that the signal presented in this section is onlytargeted at a single subband. Therefore all measurementsamples at frequencies outside the given subband have to beignored. The next section describes stimuli that measure thefrequency response of the noise reduction system over thewhole bandwidth of the hearing instrument.

3.3. Full Bandwidth Measurements with Broadband Stimuli.This section describes the synthesis of a signal that allowsmeasuring the frequency response over the whole bandwidthof the hearing instrument under test. The idea is to measurethe effect of the noise reduction system in a certain subband,while the noise reduction does not act on any other frequencyrange. Let the subband of interest be b. To obtain a test signalθb that evokes attenuation of a modulation-based noisereduction system in subband number b only, we add a signalthat is designed with configurable modulation and limited tohave most of the signal power in subband number b to fullymodulated signals corresponding to the other subbands:

θb(n) = sb(n) +

√

23·

∑

i∈({1,2,...,M}\{b})mi(n)

·[

1 + cos

(

2πfm,i

fsn

)]

.

(5)

Here again, M is the number of subbands. The signal sbin (5) is the same as in (4). This means that the modulationdepth of signal sb can be configured via parameter αaccording to (4).

Note the following: if the value of α is close to 1, somesegments of signal sb are close to zero (those segments inwhich the cosine in (4) is close to −1). As the mi signalsin (5) are discrete-interval binary sequences, they will notbe perfectly band-limited and therefore produce side-lobesin subband number b. This means that the stimulus in thesubband of interest is infringed by sidelobes from otherbands for values of α close to 1.

As a consequence, the test signal θb from (5) is notwell-suited for measuring noise reduction performance as afunction of modulation depth parameter α in its full range.


However, for α = 0, signal θb can be used for measuring thefrequency response of the noise reduction system while onesubband is stimulated to apply its maximum attenuation. Weshow an example of this application in Section 5.4.1.

3.4. Subband Measurements with Broadband Stimuli. So farwe have presented the synthesis of noise reduction stimulifor different subbands and a way of mixing these stimuli inorder to obtain a test signal θb for broadband measurements.We argued that test signal θb causes problems with highmodulation depths due to side-lobe influences from othersubbands. In this section we show a way of eliminating theseside-lobe influences in one subband of interest.

If subband number b is the subband of interest, thenwe can eliminate the influences from other subbands byfiltering out their side lobes from this particular subband.This can be done using a band-stop filter whose band limitsare the crossover frequencies of subband number b. Let hbe the impulse response of such a band-stop filter, and let

“∗” be the convolution operator. Furthermore let θb be amodified version of θb in which side-lobe influences fromother subbands will be eliminated. We construct θb by amodified version of (5)

θb(n) = sb(n) +

√

23· h(n)∗

∑

i∈({1,2,...,M}\{b})mi(n)

·[

1 + cos

(

2πfm,i

fsn

)]

.

(6)

In practice, we do not implement the band-stop filter by aconvolution with h. We rather implement the filtering in theDFT domain: we put zeros into the stop-band’s DFT bins ofthe signal to filter, exploiting the periodicity of the mi signalsand of the cosine term in (6).

Note that θb can have a higher peak factor than θbdue to the filtering. In the measurements we describe inSection 5.4.2, this was however not a problem. If only onesubband of interest is within the scope of the test, then thenarrow-band stimuli from Section 3.2 can be used. The testsignal θb is useful when all subbands of the noise reductionsystem are relevant in the test case, but modulation will onlybe varied in one of them.

4. Attenuation Function Measurements

Modulation-based noise reduction systems apply attenua-tion as a function of the signal’s modulation depth [8].Therefore, the dependency between modulation depth andattenuation is of interest in noise reduction testing. For sys-tems that operate in multiple subbands, this dependency canbe assessed per subband, if varying modulation parameter αaccording to (4) and then using each resulting signal si eitheras a stimulus for measurement or as a basis for synthesizing

stimulus θb according to (6).The resulting stimuli can be used in measuring the

frequency response of a subband of interest for different

modulation depths. In order to obtain a simple modula-tion/attenuation dependency function, one needs to com-pute a single attenuation value from a transfer functiondefining gain at multiple frequencies. Inspired by the wayin which median and averaging operations work, we herepropose sorting a certain set of gain values within thesubband of interest by their magnitude and then averagingthose values that remain after eliminating the first and thelast quarter of the resulting sorted list. Typically, one wouldonly choose frequencies close to the center frequency of thecurrent subband in order to avoid taking the slopes at theband limits into the averaging process.

5. Examples

5.1. System under Test. An exemplary digital hearing instru-ment with a modulation-based noise reduction system wasthe system under test for the measurements whose results arepresented further below. The noise reduction system in thishearing instrument works in the time domain according tothe following scheme.

(i) Determine the amount of typical modulation indifferent subbands of the hearing instrument’s inputsignal by passing subband signals through runningmaximum and minimum filters and comparing thedifferent filters’ outputs [1].

(ii) For each subband, compute attenuation as a functionof modulation, where low modulation maps to highattenuation and vice versa.

(iii) Use a controllable filter to adjust the frequencyresponse of the hearing instrument as it is givenby the computed frequency-dependent attenuationvalues.

More details on the underlying concept of implementingmodulation-based noise reduction in the time domain canbe found in [1].

5.2. Measurement Setup. A test system was set up for makingmeasurements with synthetic test signals. Figure 2 illustratesthe setup: the hearing instrument under test is located in anoff-the-shelf acoustic measurement box with a loudspeaker(L1) for presenting test stimuli to be picked up by thehearing instrument’s input transducer (M2). The hearinginstrument’s output transducer (L2) is coupled with ameasurement microphone (M1) so tightly that environmentsounds can be neglected in comparison to the hearinginstrument’s output. The coupler is a cavity that is similar tothe human ear canal. Here, we used a so-called 2cc-coupler.

A digital playback and recording system can playMATLAB-created stimuli via a digital-to-analogue converter(D/A) and the loudspeaker of the measurement box (L1),while recording the hearing instrument’s output via the mea-surement microphone and an analogue-to-digital converter(A/D). The recorded digital data is stored in a MATLAB-readable file on a hard disk. The sampling rate for bothplaying and recording signals is 22050 Hz. The test systemensures synchronous playback and recording.


5.3. Measurement Procedure. The gain in the hearing instru-ment under test was set 20 dB below the maximum offeredvalue to reduce nonlinearities. All adaptive features of thehearing instrument, apart from noise reduction, were turnedoff for all test runs. The hearing instrument was furthermoreconfigured for linear amplification; this means that there wasno dynamic range compression.

Measurements were performed with different θb and θbaccording to (5) and (6), respectively. In synthesizing these

θb and θb, the required mi and ui were computed by function“dibs” of the earlier-mentioned FDIDENT toolbox, andsynthesis parameter r in (2) was adjusted to yield a 70 dB SPLlevel in each subband. Our measurement method foreseesthe use of different values of the band index b. However,for simplicity, one constant b was exemplarily chosen for allmeasurements we present here.

The DFT-based processing according to Section 3.1 wasused for frequency response measurements. As this process-ing needs an integer number of stimulus periods to fit intoa DFT window, we chose the stimulus period to be equal tothe DFT window length: a window length of 4096 samplesallowed us both the use of the Fast Fourier Transform (FFT)and the choice of about 5.4 Hz modulation frequency ( fm =fs/window length in samples). This frequency is typical forspeech, whose modulation spectrum is significant in therange from 1 to 12 Hz [3].

Two experiments were performed per stimulus: firstwith the noise reduction system of the hearing instrumentswitched off, and second while having it switched on. Thisallowed us to achieve the differential measurement that hasbeen mentioned in Section 3.2: instead of comparing theoutput and input signal of the system, we compared theoutput signal from the second experiment with the one fromthe first experiment.

This method made the measurement procedure indepen-dent of the throughput delay in the system under test, espe-cially because the throughput delay of the system was muchsmaller than the stimulus duration and therefore negligiblefor test timing; this means that we did not need to delaythe recording of output signals compared to the playbackof input stimuli. Note that even variations in throughputdelay between the first and second experiment could notinfluence the result, because magnitude computations wereindependent of the throughput delay due to the periodicityof the used stimuli (see Section 2.1).

5.4. Measurement Results

5.4.1. Frequency Response of an Exemplary Noise ReductionSystem. We measured the frequency response of the noisereduction system under test while high attenuation wasrequired in one subbband and no attenuation was requiredin the other ones. In order to trigger this noise reductionbehavior on the one hand and to allow a measurementover the whole bandwidth of the system on the other hand,we used signal θb according to (5) as a stimulus, whereparameter α in the synthesis of signal sb via (4) was set tozero.

Teststimulus

D/A

L1 M2 L2 M1

A/D

Hearinginstrument

Noise reductionsystem

Harddisk

Figure 2: Measurement setup.

−8

−6

−4

−2

0

2

Gai

n(d

B)

102 103

Frequency (Hz)

Frequency response

Figure 3: Measured frequency response of a noise reduction systemthat is stimulated for attenuating one subband.

For each measurement, the test stimulus was presentedduring 15 seconds in order to allow the system under test toreach steady state. Bin-by-bin division of FFT absolute valuesfrom the second measurement by corresponding values ofthe first measurement delivered the frequency response of thenoise reduction system. Five FFT windows were averaged forspectral smoothing [21]. These windows were taken from thelast five seconds of the test run in order to observe the steady-state condition.

The result of this measurement is shown in Figure 3: theshown frequency response indicates that the noise reductionsystem under test provides attenuation in a subband around800 Hz while not attenuating any other subband.

5.4.2. Attenuation Function of an Exemplary Noise ReductionSystem. We measured the dependency of attenuation onmodulation depth, where we defined that attenuation is theaverage over the transfer function samples at frequenciesin a band of 100 Hz around the center frequency of thesubband of interest. One can argue that the narrow-bandsignal si from (4) is the suitable stimulus for this kind ofmeasurement. However in our case, we based our stimuluson θb from (6) in order to have a broadband stimulus


0

5

10

Att

enu

atio

n(d

B)

0 5 10 15 20

20 log10 (1/[1− α])

Attenuation versus modulation depth

Figure 4: Measured dependency of noise reduction attenuation onmodulation depth parameter 20 · log10([1− α]−1).

with constant signal properties in most of the subbands.This is an advantage when testing hearing instruments withenvironment-dependent automatics, because the constant

subband signals we have in θb represent a defined environ-ment, whereas si does not impose a defined environment onany subband but the one of interest.

Measurements were performed with a modified version

of θb. The modification was to choose a sum index range of({1, 2, . . . ,M} \ {b− 1, b, b + 1}) instead of ({1, 2, . . . ,M} \{b}) from (6). This modification ensured that the nominalfrequency range of the band of interest spanned more thanone subband of the noise reduction system under test, thusmaking the procedure invariant to clock differences betweenthe device under test and test equipment and to nonidealsubband split.

For the same reason, the band-stop filtering that is partof (6) was made with a band-stop filter whose band limitswere set far enough outside the subbband of interest to reachfurther than the filter slopes of the corresponding subbandsplit filter. We had to find a compromise between setting theband limits of the band-stop filtering far outside the subbandof interest to make the measurement procedure robust andsetting them as close as possible to these band limits in orderto keep signal changes by the filtering as little as possible.One reason to be careful with the choice of the stop-bandfilter design is that the filtering can change the peak factor ofthe synthesized stimuli (see also Section 3.4). As a result, wechose band-stop corner frequencies that were half a subbandwidth away from the band limits of the subband of interest.

We used different test stimuli that resulted from varyingmodulation parameter α according to (4). Since parameterα is in the amplitude domain, whereas usual hearinginstrument specifications use decibels as unit, we used20 · log10([1 − α]−1) rather than α as the modulation depthparameter.

We varied 20· log10([1−α]−1) in steps of 2 and measuredthe noise reduction attenuation as a function of this variedparameter. The measured frequency response that was used

as a basis for computations was smoothed by averaging overfive FFT windows [21].

To obtain one single attenuation value from the fre-quency response of interest, the proposed procedure fromSection 4 was used; this means that the frequency responsewas searched for gain values corresponding to frequencies ina ±50 Hz range around the center frequency of the subbandof interest, and the corresponding gain values were thensorted and finally averaged after eliminating the first and lastquarter of the sorted list.

The obtained result is shown in Figure 4. We see thatthe system under test behaves as one would expect ofa modulation-based noise reduction system (e.g., [8]):Unmodulated signals are attenuated strongly, whereas mod-ulated signals are not attenuated or attenuated less strongly.

6. Conclusion

Synthetic test signals have been proposed for verificationin the domain of digital hearing instruments. Discrete-interval binary sequences have been used to synthesizestimuli targeted at systematic verification of a modulation-based noise reduction system.

Measurements with an exemplary hearing instrumentshowed that the synthetic signals succeeded in both stim-ulating the noise reduction in the subband of interest andmeasuring the system’s frequency response and attenuationfunction. With the given stimuli, it is possible to test againstspecifications that require noise reduction attenuation as afunction of frequency and modulation.

References

[1] A. Schaub, Digital Hearing Aids, Thieme Medical, New York,NY, USA, 2008.


[3] I. Holube, V. Hamacher, and M. Wesselkamp, “Hearinginstruments—noise reduction strategies,” in Proceedings ofthe 18th Danavox Symposium on Auditory Models and Non-linear Hearing Instruments, pp. 359–377, Kolding, Denmark,September 1999.

[4] A. Kossiakoff and W. N. Sweet, Systems Engineering Principlesand Practice, John Wiley & Sons, Hoboken, NJ, USA, 2003.

[5] M. Marzinzik and B. Kollmeier, “Predicting the subjectivequality of noise reduction algorithms for hearing aids,” ActaAcustica united with Acustica, vol. 89, no. 3, pp. 521–529, 2003.

[6] I. Holube, “Development and analysis of an internationalspeech test signal (ISTS),” in International Hearing AidResearch Conference (IHCON ’08), Lake Tahoe, CA, USA,August 2008.

[7] I. Holube and EHIMA-ISMADHA Working Group, “Shortdescription of the international speech test signal (ISTS),”EHIMA—European Hearing Instrument ManufacturersAssociation, (document contained in a downloadpackage with the ISTS signal, April 2008), June 2007,http://www.ehima.com.


[8] R. Bentler and L.-K. Chiou, “Digital noise reduction: anoverview,” Trends in Amplification, vol. 10, no. 2, pp. 67–82,2006.

[9] J. G. Lamm, A. K. Berg, and C. G. Gluck, “Synthetic signalsfor verifying noise reduction systems in digital hearing instru-ments,” in Proceedings of the 16th European Signal ProcessingConference (EUSIPCO ’08), Lausanne, Switzerland, August2008.

[10] P. E. Wellstead, “Pseudonoise test signals and the fast fouriertransform,” Electronics Letters, vol. 11, no. 10, pp. 202–203,1975.

[11] H. A. Barker and R. W. Davy, “System identification usingpseudorandom signals and the discrete fourier transform,”Proceedings of the IEE, vol. 122, no. 3, pp. 305–311, 1975.

[12] H. W. Gierlich, “New measurement methods for determiningthe transfer characteristics of telephone terminal equipment,”in Proceedings of the IEEE International Symposium on Circuitsand Systems (ISCAS ’92), pp. 2069–2072, San Diego, CA, USA,May 1992.

[13] M. R. Schroeder, “Synthesis of low-peak-factor signals andbinary sequences with low autocorrelation,” IEEE Transactionson Information Theory, vol. 16, no. 1, pp. 85–89, 1970.

[14] R. Pintelon and J. Schoukens, System Identification: A Fre-quency Domain Approach, IEEE Press, New York, NY, USA,2001.

[15] E. Van der Ouderaa, J. Schoukens, and J. Renneboog, “Peakfactor minimization using a time-frequency domain swappingalgorithm,” IEEE Transactions on Instrumentation and Mea-surement, vol. 37, no. 1, pp. 145–147, 1988.

[16] K. R. Godfrey, A. H. Tan, H. A. Barker, and B. Chong, “Asurvey of readily accessible perturbation signals for systemidentification in the frequency domain,” Control EngineeringPractice, vol. 13, no. 11, pp. 1391–1402, 2005.

[17] A. van den Bos and R. G. Krol, “Synthesis of discrete-intervalbinary signals with specified fourier amplitude spectra,”International Journal of Control, vol. 30, no. 5, pp. 871–884,1979.

[18] K.-D. Paehlike and H. Rake, “Binary mutifrequency signals—synthesis and application,” in Proceedings of the 5th IFACSymposium on Identification and System Parameter Estimation,vol. 1, pp. 589–596, Darmstadt, Germany, September 1979.

[19] I. Kollar, Frequency Domain System Identification Toolbox V3.3for Matlab, Gamax Ltd, Budapest, Hungary, 2004.

[20] S. T. Nichols and L. P. Dennis, “Estimating frequency-responsefunction using periodic signals and the f.f.t,” ElectronicsLetters, vol. 7, no. 22, pp. 662–663, 1971.

[21] J. S. Bendat and A. G. Piersol, Random Data: Analysis andMeasurement Procedures, John Wiley & Sons, New York, NY,USA, 2000.


Research Article

The Acoustic and Peceptual Effects of Series andParallel Processing

Melinda C. Anderson,1 Kathryn H. Arehart,1 and James M. Kates1, 2

1 SLHS-Speech, Department of Speech, Language, and Hearing Science, University of Colorado at Boulder,409 UCB, Boulder, CO 80309, USA

2 GN Resound 2601 Patriot Blvd., Glenview, IL 60026, USA

Correspondence should be addressed to Melinda C. Anderson, [email protected]

Received 8 December 2008; Revised 5 June 2009; Accepted 5 August 2009

Recommended by Torsten Dau

Temporal envelope (TE) cues provide a great deal of speech information. This paper explores how spectral subtraction anddynamic-range compression gain modifications affect TE fluctuations for parallel and series configurations. In parallel processing,algorithms compute gains based on the same input signal, and the gains in dB are summed. In series processing, output fromthe first algorithm forms the input to the second algorithm. Acoustic measurements show that the parallel arrangement producesmore gain fluctuations, introducing more changes to the TE than the series configurations. Intelligibility tests for normal-hearing(NH) and hearing-impaired (HI) listeners show (1) parallel processing gives significantly poorer speech understanding than anunprocessed (UNP) signal and the series arrangement and (2) series processing and UNP yield similar results. Speech qualitytests show that UNP is preferred to both parallel and series arrangements, although spectral subtraction is the most preferred. Nosignificant differences exist in sound quality between the series and parallel arrangements, or between the NH group and the HIgroup. These results indicate that gain modifications affect intelligibility and sound quality differently. Listeners appear to have ahigher tolerance for gain modifications with regard to intelligibility, while judgments for sound quality appear to be more affectedby smaller amounts of gain modification.

Copyright © 2009 Melinda C. Anderson et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Many of the 31.5 million Americans with hearing loss arecandidates for hearing aids [1]. While recent clinical trialsdocument the benefit of hearing aids (e.g., [2]), only 20–40% of individuals who are candidates actually own them[1, 3]. Approximately 65–75% of those who wear hearingaids are satisfied with their instruments [1, 3]. Results ofKochkin [1, 4] indicate that much of the potential marketfor hearing aids is not being reached, and that those who areusers of hearing aids are not completely satisfied with theirinstruments.

Numerous factors contribute to the lack of satisfaction,some related to speech intelligibility and sound quality.Of the several factors that Kochkin [4] identified as beingstrongly correlated to overall customer satisfaction, three arerelated to sound quality: “clarity of sound,” “natural sound-ing,” and “richness or fidelity of sound.” Four others are

related to speech intelligibility, including “overall benefit,”“ability to hear in small groups,” “one-on-one conversation,”and “listening to TV.” Many of these factors relate to listeningin the presence of background noise. Hearing aid signalprocessing algorithms attempt to improve intelligibility and/or quality in noisy speech by reducing the amount of noiseor increasing the audibility of low level speech soundsthrough modifications to the temporal envelope. Often,however, listeners fail to benefit from these signal processingalgorithms when compared to simple linear processing,where the only signal modification is amplification (e.g.,[2, 5–8]).

Increasingly, commercial hearing aid manufacturers areimplementing complex signal processing designs, such asdynamic range compression and noise reduction. Thesealgorithms modify the temporal envelope of an acousticsignal, particularly the low-level portions of the signal.These low-level signals are typically either speech valleys


Modified signal tosecond processing

algorithm: gainmodified again

First signalprocessingalgorithm:

gain modified

Inputsignal

Outputsignal

(a)

First signalprocessingalgorithm:gain in dB

Second signalprocessingalgorithm:gain in dB

Inputsignal

Outputgain

Σ indB

(b)

Figure 1: Block diagrams depicting (a) series and (b) parallel processing.

or undesirable background noise. Unexpected modificationsto the temporal envelope may result from interactions ofmultiple concurrent nonlinear signal processing algorithms,and may have unintended perceptual consequences. Theseinteractions may play a determining role in an individual’ssatisfaction with hearing aids. When implementing differentsignal processing algorithms in commercial hearing aids,manufacturers strive to provide the best speech intelli-gibility and sound quality. Although there is conflictingevidence regarding the benefit of many of these signalprocessing algorithms, they are often included in commer-cial hearing aids because they appear to benefit at leastsome listeners with hearing loss in some environments(e.g., [5, 8–13]).

Dynamic range compression is used in most commercialhearing aids in an attempt to restore the audibility to arange of sounds by providing more amplification for low-level portions of the signal than high-level portions, and tonormalize the loudness for hearing-impaired listeners [9].Compression, however, does not always result in benefitfor speech intelligibility or for speech quality (e.g., [2, 6,7, 13]). In a review of the compression literature, Souza[9] cites evidence showing that dynamic range compressionalgorithms improve intelligibility for low-level speech soundsin quiet, but intelligibility is similar to a linear system in thepresence of background noise. For example, Shi and Doherty[12] found compression improved speech intelligibility forsoft levels (60 dB SPL) in quiet and in reverberation althoughjudgments of clarity were similar between compressed andlinear speech. The compression advantage appears to befound when audibility is a factor for low-level speech sounds.When in background noise, both low-level speech signals,as well as low-level noise, are affected by the compressionalgorithm. The lack of benefit in background noise may bedue to a limited effective compression ratio and an increasein signal-to-noise ratio (SNR) when background noise ispresent [14].

In contrast to compression, spectral subtraction hasthe goal of decreasing the amplitude of low-level portionsof the signal that are presumably noise in an attempt toprovide a cleaner speech signal. Spectral subtraction involves

estimation of the power spectrum of the assumed noise,followed by the removal of the noise spectrum from the noisyspeech. This process is implemented in multiple channelsand will adaptively reduce the gain in each channel inresponse to the noise estimate.

Both of these signal processing routines operate byaltering the temporal envelope in an attempt to enhanceaudibility of the speech signal and improve speech intel-ligibility and quality. The temporal envelope is importantfor speech understanding because it carries cues for voicing,manner of articulation, and prosody [15]. A consequence ofdynamic range compression is a reduction of the peak-to-valley ratio of the temporal envelope, which may result in adegradation of speech cues due to smoothing of the envelope.Spectral subtraction, in contrast, may reduce the amplitudeof low-level speech sounds, which also has the potential todisrupt speech identification cues by reducing audibility ofthe low-level portions of speech that may be contained in thevalleys of the signal. In addition to envelope manipulations,the processing may also introduce unintended distortionand spectral modifications (e.g., [14, 16–18]). These signalmanipulations, which are intended to improve audibility,may actually reduce or alter the speech information availableto the listener.

The interaction of competing signal processing algo-rithms may affect the acoustic output of a hearing aid, andsubsequently affect speech intelligibility and quality. Thesignal processing interaction, and the associated acousticand perceptual consequences, may differ for different ordersof implementation (parallel or series). In both paralleland series digital processing, an incoming acoustic signalis converted from analog to digital and analyzed using afrequency filter bank. In series construction (Figure 1(a)),the filtered digital signal is processed by the first signalprocessing algorithm where gain modifications are made.The output from the first forms the input to the secondalgorithm, where additional gain modifications take place.In a parallel construction (Figure 1(b)), the filtered digitalsignal is used as the common input to the two signal-processing algorithms. The gains in dB calculated from eachdifferent signal-processing algorithm are added.


Parallel processing is expected to produce greater gainfluctuations than series processing when dynamic rangecompression and spectral subtraction are implementedtogether. These fluctuations would have an increased effecton the slowly varying portion of the temporal envelope. Inseries processing, the spectral subtraction routine reducesgain in noisy bands, providing an input to the compressionroutine that has been significantly reduced. This, in turn,causes the compression routine to increase the amountof gain given to the signal. The increased gain givenby the compression algorithm can neutralize the spectralsubtraction routine and provide minimal gain modificationto the final output. By comparison, parallel processing for thesame input will allow for greater gain modifications becausethe compression routine is acting on the more intense bandsof the original signal that have not been affected by spectralsubtraction. As a result, there are smaller gain increases fromthe compression algorithm, allowing the spectral subtractionalgorithm to have more influence. The increased effects of thenoise reduction would cause greater fluctuations in the gainprescription for parallel processing. Although the algorithmsused to calculate gain modifications are the same for botharrangements, the combinations may differ in their finalacoustic characteristics and perceptual consequences. Thispaper focuses on the differences in gain variation over theshort-time spectrum of the signal, controlling for possibledifferences in the long-term spectrum caused by the signalprocessing algorithms and their interactions.

Given the potential for differences between parallel andseries arrangements, there is surprisingly little publishedliterature regarding this topic. Franck et al. [19] reportedresults from a study that investigated signal processinginteractions using two types of signal processing strategies,dynamic range compression and spectral enhancement.When used by itself, the spectral enhancement scheme wasfound to improve vowel perception, but not consonantperception. The addition of single-channel compression didnot alter speech intelligibility. When multichannel com-pression was combined with spectral enhancement, speechintelligibility decreased significantly. The authors speculatedthat this may be because the addition of multichannelcompression alters the temporal components beyond a levelthat is beneficial for the listeners. A threshold may exist wheresignal processing modifications to the acoustic signal areminimal enough to not degrade the signal and still providebenefit, but that point is likely to differ for each individual.The authors point to the fact that different results may befound if multichannel compression was implemented beforespectral enhancement, as this order difference would alter theacoustic content at the output.

Chung [20] compared the acoustic and perceptual dif-ferences between series and parallel algorithm constructionin commercial hearing aids using measures of noise reduc-tion effectiveness, speech intelligibility, and sound quality.Manufacturer-specific forms of dynamic range compressionand noise reduction were used in the experiments, althoughthe noise reduction schemes were all based on analysis ofthe modulation of the signal. Even though Chung did notcontrol for the type of compression and noise reduction

used, the results showed perceptual and acoustic effects fordifferent algorithm arrangements. The acoustic differenceswere evaluated by recording the outputs of the hearing aidson KEMAR at SNRs of 10, 5, 0, and −5 dB using listsfrom the Hearing-In-Noise Test (HINT) [21] with speechspectrum noise and white noise. The aids were set to linearor 3:1 compression, both with and without noise reduction.The noise reduction algorithms were set using the fittingsoftware associated with each hearing aid manufacturer.Noise levels were measured with the noise reduction on andthe noise reduction off at predetermined points betweensentences after gain modifications were thought to havesettled. The metric of noise reduction effectiveness wasthen quantified in terms of the difference in the level ofthe noise between the noise reduction off condition andthe noise reduction on conditions. Analyses of the acousticmeasures for the series constructions indicate that whencompression was activated after noise reduction, the effectsof noise reduction were reduced. The parallel constructionswere found to have equivalent or increased noise reductionwhen the compression scheme was activated. Because Chungused manufacturer-specific forms of signal processing, someof the algorithm arrangements were speculated to be eitherserial or parallel, but were not definitely known. Even withthis limitation, the acoustic measures were consistent withthe idea that series and parallel constructions will result indifferent acoustical outcomes.

In addition to acoustic measures, Chung also performedperceptual measures of speech intelligibility and soundquality for two different hearing aids. The hearing aids werefrom different manufacturers and one utilized a parallelconstruction, while the second utilized a series construction.The stimuli were recorded on KEMAR using HINT lists infour conditions: linear with noise reduction on and noisereduction off, and 3:1 compression with noise reduction onand noise reduction off. The prerecorded stimuli were playedmonaurally to normal-hearing listeners. Intelligibility wasmeasured in terms of percent of words correctly identified inone sentence list for each of the eight conditions (2 hearingaids x 4 conditions). Sound quality was measured by a pairedcomparison task of nine conditions using six sentencesfrom the intelligibility experiment. The conditions includedcomparisons of compression, noise reduction, and combi-nations of compression and noise reduction. No significantdifferences in intelligibility were observed between paralleland series constructions. In contrast, listeners gave highersound quality ratings to the parallel construction whenboth noise reduction and compression were implemented.It remains unclear, however, if the perceptual differences aredue to the order of implementation (parallel or series) orbecause different brands of hearing aids were used.

Chung’s [20] work highlights the possible acoustic andperceptual effect of specific signal processing algorithmarrangements, and provides an impetus for further research.The present study represents one step in this directionby directly controlling the types and amounts of signalprocessing in series and parallel arrangements and includingboth listeners with normal hearing and listeners with hearingloss.


This paper considers the gain fluctuations, and their sub-sequent envelope effects, for dynamic range compression andspectral subtraction in both parallel and series constructions.These algorithms were chosen because of their potentialfor divergent gain modifications, especially for low-levelportions of the signal. It is known a priori that the parallelconstruction will have greater effects on the envelope of thesignal, due to greater fluctuations in the gain prescriptions.In order to ensure maximum differences in the resultinggains, the compression and spectral subtraction parameterswere the same in both parallel and series implementation.In addition, the long-term average speech spectrum wasmatched between the unprocessed signal and the processedsignals, ensuring that the only audible difference between theconfigurations is the short-time envelope fluctuations dueto gain functions. Differences between the two constructionswere quantified using both acoustic and perceptual measuresfor listeners with normal hearing and listeners with hearingloss. The purpose of the study is to focus on one specific ele-ment of the differences between parallel and series (envelopeeffects resulting from gain fluctuations), which represents aninitial step in examining the implementations currently usedin hearing aid processing.

2. General Methods

2.1. Warped Filter Bank. Frequency warping has two keycharacteristics: it provides a frequency resolution very similarto that of the human auditory system and it reduces overalltime delay compared to a fast Fourier transform (FFT)or linear-phase filter bank having comparable frequencyresolution. Increasing the frequency resolution of the outputrequires a longer time to process the signal. Frequency warp-ing has a frequency-dependent variable delay, with lowerfrequencies having longer delays. These across-frequencydifferences in time delay have been shown to be inaudiblefor most listeners [22]. Each processing condition wasimplemented using the same system design. The samplingrate was 22.05 kHz. The filter bank was an extension of thewarped compression filter bank [22]. This design utilizedan FFT-based side-branch system for the frequency analysis,after which the desired gain versus frequency response wastransferred to an equivalent warped FIR filter. The inputsignal was then convolved with the filter to produce theoutput. The resolution of the frequency analysis performedin the side branch is limited by the size of the warped FFT andits associated buffer. In this study, the warped filter cascadehad 33 first-order all-pass filter sections, and a 34-point FFTwas used to give 18 frequency analysis bands from 0 to22.05 kHz. The compression gains were determined in thefrequency domain, transformed into an equivalent warpedtime-domain filter, and the input signal was then convolvedwith the time-varying gain. The gain filter was updatedevery 24 samples (1.09 ms). Forcing the filter coefficientsto have even symmetry yields a fixed frequency-dependentgroup delay, in which the filter delay is independent of thecoefficients as long as the symmetry is preserved. The warpedfilter coefficient symmetry works just like the symmetryin a linear-phase FIR filter. A linear-phase filter has a

constant time delay at all frequencies no matter what filtercoefficients are chosen. The symmetric warped filter has agroup delay that is invariant with the selection of symmetricfilter coefficients. The phase relationship between the filteroutput and its input at every frequency remains constantover time no matter how the magnitude frequency responsevaries, and ensures that no phase modulation occurs asthe gain changes in response to the incoming signal. Thewarped frequency scale provides frequency resolution similarto that of the human auditory system, with each warpedFFT bin approximately 1.33 critical bands wide. The centerfrequencies of the eighteen frequency bands are 0, 140, 282,429, 583, 748, 929, 1129, 1358, 1625, 1946, 2345, 2860, 3555,4541, 6001, 8168, and 11025 Hz.

2.2. Dynamic Range Compression. This study employs adynamic range compression system using the frequencywarping filter bank described earlier [22]. The specificcharacteristics of the compression algorithm include acompression ration of 2:1 in all bands, a lower compressionkneepoint set at 45 dB SPL, infinite compression for soundsover 100 dB SPL, attack times of 5 ms, and release times of70 ms [23]. The choice of compression ratio and kneepointresults in a maximum of 12 dB of compression gain. Thecompressor has a fast response to increases in signal level,due to the need to limit the intensity of the signal. The slowerresponse to a decrease in signal level is an attempt to smoothgain perturbations. An envelope detector determines theintensity of the signal and forms the input to the compressor[24]. The envelope detector, in this case a peak detector,uses the fast attack time (5 ms) for tracking increases insignal level and the slower release time (70 ms) for trackingdecreases in signal level. When the signal level increasesabove the previous peak detector output, the new peakdetector output mixes in a large amount of the input absolutevalue and rises rapidly toward the new level. When theinput is smaller then the previous peak detector outputthe new peak detector output decays at a constant rate indB/sec.

2.3. Spectral Subtraction. The spectral subtraction routineused here is based on the method originally developed byTsoukalas et al. [25, Eq. 26] and bases gain modificationson estimates of the SNR, taking into account the thresholdfor audibility of noise by the listener. Spectral subtractionbegins with the assumption that the noise is stationary.Spectral subtraction is most effective for noisy signals thatshow little fluctuation, such as machine or highway noise[24]. When the background noise is composed of speech,spectral subtraction will perform more poorly, due to thefluctuating nature of the noise [24]. Our implementation ofthe Tsoukalas et al. [25] algorithm employs the same 18 bandfiler bank used in the compression algorithm and describedabove. The processing parameter ν = 1. The reader is referredto Tsoukalas et al. [25] for details on processing. Bands withpoor SNRs receive large amounts of gain reduction, whilebands with good SNRs receive little to no gain reduction.The maximum amount of attenuation provided by thespectral subtraction algorithm is 12 dB. In order to optimize


NAL-R for HI group

Spectrumequalization

Spectral subtraction

Compression

Unprocessed (UNP)

Compression

Parallel

Series


Input to warp filter bank





CompressionSpectral

subtraction

Spectral subtractiongain in dB

Compressiongain in dB



dB gain summed


NAL-R for HI group

NAL-R for HI group

NAL-R for HI group

NAL-R for HI group

Figure 2: Block diagrams depicting the five processing conditions showing each stage of processing.

the spectral subtraction algorithm, the noise estimate wascomputed from the noise signal alone before the speech andnoise were combined.

The spectral subtraction algorithm used here was chosenbecause it is less likely to introduce unwanted distortions,such as musical noise and gain-modulation effects becauseit only affects perceptually audible portions of the spectrum,rather then the entire signal. Furthermore, the gain remainsconstant at poor SNRs, rather than continuously decreasing.Signals that are judged to be noise will receive a constantattenuation, with little fluctuation. Tsoukalas et al. [25]showed improved speech intelligibility and sound qualityfor normal-hearing listeners in stationary noise. Employinga similar algorithm, Arehart et al. [10] reported significantintelligibility benefit in communication-channel noise, andsound quality benefit in both communication-channel noiseand highway noise for listeners with normal hearing andlisteners with hearing loss.

2.4. Experimental Configurations. This study included fiveprocessing conditions. (An additional condition, seriesbackward was included in the data collection. In thiscondition, the compression output formed the input tothe spectral subtraction routine. This condition wouldbe unlikely to be used in a real-world setting becausethe use of compression prior to spectral subtraction willalter the physical waveform of the input and decreasethe effectiveness of the noise estimation. Because of this

limitation, the series backward condition was not includedin discussion of the acoustic or perceptual consequences.) Asshown in Figure 2, all conditions, including the “unprocessed(UNP)” condition, were processed through the warpedfilter bank, ensuring that all stimuli were subjected to thesame frequency-dependent group delay. The UNP conditionwas included as a control. The compression and spectralsubtraction algorithms were tested in isolation, allowingfor direct examination of the effects of each algorithmby itself. Two conditions combined the signal processingalgorithms: parallel processing and series processing. In theparallel condition the compression and spectral subtractionwere processed independently on the same input signal,and the gain prescriptions in dB were summed. In theseries condition, the spectral subtraction output formedthe input to the compression algorithm. In order tocompensate for the high-frequency boost provided by thecompression scheme, all stimuli were processed individuallythrough a spectrum equalization scheme that matched theprocessed signal’s long-term average speech spectrum tothat of the original speech using an 1024-point linear-phase digital filter. The long-term spectrum equalizationscheme applied a consistent amount of gain which wasdependent on the processing condition. The purpose of thespectrum equalization scheme was to ensure that differencesin signal loudness were not available as perceptual cues[26]. The only audible differences in intensity for thestimuli are based on short-time envelope fluctuations over


the course of the signal. The compression and spectralsubtraction algorithms were not individually optimized foreach experimental condition, but were rather kept constantso direct comparisons between the processing conditionswould be possible.

All signals were amplified for the listeners with hearingloss to compensate for reduced audibility of the signal. Cus-tom linear amplification using NAL-R [27] took place afterall other signal processing was completed. The use of a linearamplification scheme allowed avoidance of signal distortionthat can result from nonlinear amplification schemes, whilestill allowing for frequency-specific amplification.

2.5. Stimuli. Two different corpora were used in this study:two concatenated sentences from the HINT [21] were usedin the acoustical analysis (Section 3) and the speech qualityexperiment (Section 4), and IEEE sentences [28] were usedin the speech intelligibility experiment (Section 5). (The twoHINT sentences were “The yellow pears taste good. Theboy got into trouble.”)All stimuli were digitized at 44.1 kHzand were downsampled to 22.05 kHz to reduce computationtime. The background noise was a speech-shaped stationarynoise, presented at SNRs ranging from 6 to −6 dB. Since theoverall intensity of the noisy speech was kept at a constantlevel, the speech intensity was incrementally reduced as thenoise level increased.

3. Acoustic Analysis

The purpose of this section is to describe and understand theacoustic differences among the four processing conditionsused for the gain modifications. We were primarily interestedin the changes to the temporal envelope caused by theamount of gain fluctuation in each condition. We usedthree measures to do this: an audibility analysis, differencespectrograms, and gain versus time analysis.

Given the potential for the spectral subtraction routineto decrease the signal level below the level of audibilityfor listeners with hearing loss, an audibility analysis wasconducted. Each of the four processing conditions withgain modifications was analyzed to look for differences inaudibility. Figure 3 displays the total output at 6 dB SNRfor each of the four processing conditions from all stages ofprocessing (see Figure 2), including the original signal level,gain modifications from the signal processing algorithms,long-term spectrum equalization, and NAL-R amplification.The hearing loss depicted is the average threshold for thatfrequency for all listeners in the HI group. Three bandswere included in this analysis, a low frequency (band 5 withcenter frequency 583 Hz), a middle frequency (band 10 withcenter frequency 1625 Hz), and a high frequency (band 15with a center frequency of 4541 Hz). The graphs in the leftpanels illustrate that compression has the most audibility,consistent with its intent. Also consistent with expectations,spectral subtraction has the least audibility, reducing the low-level portions of the signal. The right panels of Figure 3,depicting the series and parallel implementations, show verylittle difference in audibility. Some of the low-level sounds

in each condition were below the average impaired auditorythreshold, especially in band 15, but are comparable betweenprocessing conditions.

To explore the output variations among the processingconditions, difference spectrograms were derived to providea visual representation of the changes caused by the signalprocessing conditions. Difference spectrograms, shown inFigure 4, subtract one processed condition from another,giving an absolute representation in dB of the differencebetween the outputs of the two processing conditions. Inother words, the difference spectrograms depict the finalenvelope differences in the output that may be audible toa listener. The spectrograms were derived by normalizingthe bin with the maximum difference to 0 dB (blackshading) and then setting all smaller differences on a grayscale moving to −30 dB from maximum difference (whiteshading). The left panel in Figure 4 shows compressionminus spectral subtraction, while the right panel showsparallel minus series, all at 6 dB SNR. The primary energydifference in both panels is below 2000 Hz, which isconsistent with both the speech and the noise having alow-frequency concentration. The difference between thecompression minus spectral subtraction spectrogram is dueto the compression algorithm, since compression adds gainand spectral subtraction reduces gain. The difference inenergy seen in the parallel minus series spectrogram mayalso be due to differences in the influence of the compressionroutine. The series condition has a less intense input tothe compression algorithm because the signal has beenreduced by the spectral subtraction algorithm. This less-intense signal input means that more gain would be appliedfrom the compression algorithm, effectively neutralizing theimpact of the spectral subtraction algorithm in the seriescondition.

Gain versus time functions were used to quantify thedisparities in energy in the processed signals. The stimuliused in the audibility and spectrogram analysis were furtherscrutinized in order to determine the total amount of gainproduced by each signal processing routine at 6 dB SNR.Figure 5 shows the final amount of gain prescribed, includingthe long-term spectrum equalization gain and NAL-R forthe listeners with hearing loss. Again, the fifth band,the tenth band, and the fifteenth band for compression,spectral subtraction, series, and parallel arrangements wereanalyzed.

The left panels in Figure 5 show the total amount ofgain for the compression and spectral subtraction algorithmswhen processed in isolation. Compression provides highlevels of gain, and shows a significant amount of gainfluctuation over time. The spectral subtraction algorithm,however, shows considerably less gain with a more constantattenuation over time. The right panels, depicting series andparallel processing, show gain modifications that fall betweenthe two algorithms in isolation. The series and parallel con-ditions provide similar amounts of overall gain. However, theparallel implementation shows significantly greater amountsof gain fluctuation over time. The compression and parallelprocessing appear to have a greater impact on the relativestructure of the temporal envelope over time.


0 1 2 3

Time (s)

0

20

40

60

80

100O

utp

ut

(dB

SPL)

Audibility:compression and spectral subtraction

Band 5, 6 dB SNR

(a)

0 1 2 3

Time (s)

0

20

40

60

80

100

Ou

tpu

t(d

BSP

L)

Audibility:series and parallel

Band 5, 6 dB SNR

(b)

0 1 2 3

Time (s)

0

20

40

60

80

100

Ou

tpu

t(d

BSP

L)

Band 10, 6 dB SNR

(c)

0 1 2 3

Time (s)

0

20

40

60

80

100

Ou

tpu

t(d

BSP

L)

Band 10, 6 dB SNR

(d)

0 1 2 3

Time (s)

0

20

40

60

80

100

Ou

tpu

t(d

BSP

L)

Band 15, 6 dB SNR

Compression outputSpectral subtraction outputAverage HI threshold (dB SPL)

(e)

0 1 2 3

Time (s)

0

20

40

60

80

100

Ou

tpu

t(d

BSP

L)

Band 15, 6 dB SNR

Parallel outputSeries outputAverage HI threshold (dB SPL)

(f)

Figure 3: Audibility analysis: the left panels show the level of the signal output at 6 dB SNR for compression (black) and spectral subtraction(gray) for bands 5 (cf=583 Hz), 10 (cf = 1625 Hz), and 15 (cf=4541 Hz). The right panels show the level of the output signal at 6 dB SNR forparallel processing (black) and series (gray) for the same bands. The average HI threshold (in dB SPL) is indicated by the dashed line.


0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0

2000

4000

6000

8000

10000

Freq

uen

cy(H

z)

Compression minus spectral subtraction

−30

−25

−20

−15

−10

−5

0

(a)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0

2000

4000

6000

8000

10000

Freq

uen

cy(H

z)

Parallel minus series

−30

−25

−20

−15

−10

−5

0

(b)

Figure 4: Difference spectrograms: (a) shows compression-spectral subtraction, and (b) shows series-parallel. Spectrograms depictdifference in energy between the two contrasted processing conditions. For both (a) and (b), the primary region of difference is in thelow frequencies, consistent with the fact that both the speech and noise have a low-frequency concentration. The spectrograms were derivedby normalizing the bin with the maximum difference to 0 dB (black shading) and then setting all smaller differences on a gray scale movingto −30 dB from maximum difference (white shading).

4. Speech Intelligibility

The effects on speech intelligibility of different signal pro-cessing arrangements were quantified in both a group oflisteners with normal hearing and a group of listeners withhearing loss.

4.1. Subjects. The speech intelligibility experiment included14 listeners with normal hearing (NH) with an age rangeof 23–72 years old (average= 45 years old) and 14 listenerswith hearing loss (HI) with an age range of 55–79 yearsold (average=66 years old). All listeners underwent anaudiometric evaluation at their initial visit. Listeners inthe NH group had air conduction thresholds of 20 dBHL or better at octave frequencies from 250 to 8000 Hz,inclusive [29]. The listeners in the HI group had audiologictest results consistent with cochlear impairment: normaltympanometry, acoustic reflexes consistent with level ofcochlear loss, and absence of air-bone gap exceeding 10 dBat any frequency. Listeners with hearing loss were requiredto have at least a mild loss. Some of the listeners in theHI group were hearing aid users, although there was norequirement that they be so. Table 1 shows audiometric airconduction thresholds for the listeners in the HI group. Allparticipants were recruited from the Boulder/Denver metroarea and were native speakers of American English. For theperceptual listening tasks, subjects were tested monaurallyand individually in a double-walled sound-treated booth.Listeners were compensated $10/hour for their participation.

4.2. Experimental Protocol. IEEE sentences (cf. section 2)were mixed with background noise at 5 levels (6, 3, 0, −3,and −6 dB SNR). The same five conditions presented inFigure 2, as well as a clean speech token (UNP speech withno background noise), were tested. Listeners participatedin one hour of intelligibility testing. Each listener hearda random selection of 156 sentences of the possible 720

sentences that are included in the IEEE corpus. (Listenerswere tested on an additional 30 sentences in the experimentalintelligibility session. These sentences were processed usingthe series backward condition described in footnote 1. Thiscondition was not included in the final data analysis.)Instructions given to listeners are included in the appendix.

The 156 sentences were divided into four blocks. Thefirst block of sentences was a practice block consisting ofone token from each condition, plus a clean speech token.Each subject then listened to the test sentences divided intothree blocks. The processing conditions and SNRs wererandomized and each block of trials contained all conditionsof signal processing and SNRs. No feedback was given duringtesting. Percent correct scores were calculated by dividingthe number of words correctly repeated by the total numberof target words presented (25, 5 words per sentence x 5repetitions).

The average playout level was 70 dB SPL, plus theadditional linear amplification (NAL-R) for the listenerswith hearing loss. The digitally stored speech stimuli wereprocessed through a digital-to-analog converter (TDT RX8),an attenuator (TDT PA5), a headphone buffer amplifier(TDT HB7), and then presented monaurally to the listener’stest ear through a Sennheiser HD 25-1 earphone.

4.3. Results. Figure 6 shows intelligibility scores (in percentcorrect) for all SNRs, grouped by processing condition, witheach panel displaying a different processing condition. Thisfigure shows a consistent monotonic increase in intelligibilityas SNR increases for all five processing conditions. Figure 7displays the same data, grouped by SNR for each processingcondition. Intelligibility varies for each processing condition,with compression typically showing the lowest scores andUNP the highest scores. Overall, listeners in the NH groupperformed better than listeners in the HI group by an averageof 19 percentage points (range: 17.5 percentage points to 23.1percentage points). The variance for each group is similar.


0 1 2 3

Time (s)

0

10

20

30To

talg

ain

(dB

)

Total gain:compression and spectral subtraction

Band 5, 6 dB SNR

(a)

0 1 2 3

Time (s)

0

10

20

30

Tota

lgai

n(d

B)

Total gain:series and parallel

Band 5, 6 dB SNR

(b)

0 1 2 3

Time (s)

0

10

20

30

Tota

lgai

n(d

B)

Band 10, 6 dB SNR

(c)

0 1 2 3

Time (s)

0

10

20

30

Tota

lgai

n(d

B)

Band 10, 6 dB SNR

(d)

0 1 2 3

Time (s)

0

10

20

30

Tota

lgai

n(d

B)

Band 15, 6 dB SNR

Compression gainSpectral subtraction gain

(e)

0 1 2 3

Time (s)

0

10

20

30

Tota

lgai

n(d

B)

Band 15, 6 dB SNR

Parallel gainSeries gain

(f)

Figure 5: Gain versus time plots depicting the total gain modifications at 6 dB SNR for bands 5 (cf = 583 Hz), 10 (cf = 1625 Hz), and 15(cf = 4541 Hz) for the average HI listener. The gain modifications include the gain prescribed from the signal processing algorithm(s), thelong-term spectrum equalization, and NAL-R. Compression (black) and spectral subtraction (gray) are shown in the left panels. Parallelprocessing (black) and series processing (gray) are shown in the right panels.


Table 1: Listener thresholds. Test ear is marked with an asterisk. I = intelligibility participant, Q= quality participant.

Subject Ear Sex Age 250 500 1000 2000 3000 4000 6000 8000 Experiment

HI 1 R F 79 35 40 50 45 45 55 50 60 I, Q

L∗ 30 40 45 45 45 45 45 75

HI 2 R∗ F 63 25 30 25 40 40 45 65 65 I, Q

L 25 30 25 35 30 45 60 60

HI 3 R∗ M 66 40 25 20 40 75 90 85 75 I, Q

L 35 25 15 25 70 85 80 75

HI 4 R F 69 20 25 30 35 40 50 40 35 I, Q

L∗ 20 30 30 35 35 50 40 50

HI 5 R∗ F 53 15 10 10 40 45 60 50 30 I, Q

L 15 5 15 45 — 35 30 25

HI 6 R∗ F 62 30 35 40 40 — 40 — 30 I

L 20 30 20 15 — 20 — 25

HI 7 R F 74 50 60 70 65 60 65 70 70 I

L∗ 50 55 55 55 55 60 65 70

HI 8 R F 70 50 35 30 35 40 35 40 45 I

L∗ 55 45 35 35 40 40 45 50

HI 9 R∗ F 55 20 30 50 60 70 75 — 65 I

L 20 25 60 70 65 70 — 55

HI 10 R∗ F 72 25 25 30 40 — 45 60 65 I

L 20 25 25 45 — 40 — 35

HI 11 R M 64 25 30 45 45 40 45 40 35 I

L∗ 15 25 40 45 40 50 50 45

HI 12 R∗ M 62 20 15 20 30 30 45 45 50 I

L NR NR NR NR NR NR NR NR

HI 13 R∗ M 60 15 20 35 50 50 50 30 10 I

L 30 15 20 40 55 50 40 20

HI 14 R∗ F 65 45 35 40 55 65 70 65 65 I

L 25 35 35 50 60 65 70 80

HI 15 R M 78 30 25 15 25 45 45 55 75 Q

L∗ 35 30 15 35 45 60 65 70

HI 16 R∗ F 75 25 25 35 40 25 45 65 60 Q

L 30 15 20 20 65 65 60 90

HI 17 R F 68 10 10 20 30 35 30 15 15 Q

L∗ 10 15 20 35 40 45 25 25

HI 18 R∗ M 74 30 35 75 85 75 80 75 70 Q

L 25 30 80 NR NR 110 NR NR

HI 19 R∗ F 57 10 15 20 20 35 50 55 60 Q

L 15 15 15 29 25 40 45 55

HI 20 R F 36 30 45 50 60 — 70 80 85 Q

L∗ 45 45 60 65 — 75 80 85

HI 21 R M 57 15 10 5 30 60 65 55 55 Q

L∗ 10 10 10 40 55 55 50 65

HI 22 R∗ F 82 45 35 30 25 30 45 50 55 Q

35 25 25 20 35 40 50 55

HI 23 R M 79 20 20 25 50 70 90 90 80 Q

L∗ 15 15 15 40 55 55 45 65


−6 −3 0 3 6

SNR (dB)

0

20

40

60

80

100

Perc

ent

corr

ect

(%)

Unprocessed

(a)

−6 −3 0 3 6

SNR (dB)

0

20

40

60

80

100

Perc

ent

corr

ect

(%)

Compression

(b)

−6 −3 0 3 6

SNR (dB)

0

20

40

60

80

100

Perc

ent

corr

ect

(%)


(c)

−6 −3 0 3 6

SNR (dB)

0

20

40

60

80

100

Perc

ent

corr

ect

(%)

Parallel

(d)

−6 −3 0 3 6

SNR (dB)

0

20

40

60

80

100

Perc

ent

corr

ect

(%)

Series

(e)

Figure 6: Intelligibility scores (in percent correct) for listeners with normal hearing (open triangles) and listeners with hearing loss (closedcircles) grouped by processing condition. Error bars represent the standard error.

For statistical analysis, the percent correct scores weresubjected to an arcsine transform [30] and then submittedto a repeated measures analysis of variance (ANOVA)(Table 2). This analysis revealed that the factors of processingcondition, SNR, and group were all statistically significant.None of the interaction terms were significant.

In order to explore more fully the data, post-hoc t-tests were completed using Bonferroni correction (P < .01).The t-tests made individual comparisons between the UNPcondition and each of the other four conditions. In addition,the series versus parallel data were compared to each otherto see if the differences seen in the acoustic analysis resultedin differences in speech intelligibility. Examination of theraw data indicates that both groups of listeners showed thebest performance in the UNP condition. Table 2 displaysthe results of the t-tests for UNP x compression, UNP xspectral subtraction, UNP x parallel, and UNP x series. Alltests were significant for the factors of SNR and group, withthe NH group outperforming the HI group and all scoresimproving as SNR improved. The t-tests showed that thefactor of processing condition was significant in three ofthe four comparisons, with intelligibility significantly betterin the UNP condition than in the compression, spectralsubtraction, and parallel conditions. Interestingly, the factorof processing condition in the UNP x series comparisonwas not significant, indicating that there was no differ-ence in intelligibility scores between these two processingconditions.

Finally, the parallel x series comparison was also foundto be significant for the factors of processing condition,SNR, and group. In this case, listeners had better speechintelligibility with series processing. Consistent with theother comparisons, scores were better overall in the NHgroup and improved as SNR improved.

4.4. Discussion. The results from the intelligibility experi-ment show that speech understanding significantly differedbased on processing condition. The UNP condition providedthe best intelligibility, although series was not significantlyworse than UNP. The listeners in the NH group consistentlyoutperformed listeners in the HI group, although theperformance trend for processing conditions is similar forboth groups.

These findings are consistent with previous research thatshows that listeners do not perform better with increasedsignal manipulation (e.g., [5–7, 19]). The acoustic analysisshows that, out of the processing conditions with gainmodifications, series provided the least amount of gainfluctuation while maintaining audibility. The increase intemporal envelope manipulation through gain fluctuationmay be one factor in the decline in scores in the compressionand parallel conditions. The natural speech fluctuations inthe temporal envelope were more dramatically altered inthe compression and parallel conditions, because of theincreased amount of gain fluctuations over time. This has thepotential for disrupting speech identification cues by modi-fying the peak to valley ratio for individual speech sounds.


UNP comp ss par ser

Processing condition

0

20

40

60

80

100

Perc

ent

corr

ect

(%)

−6 dB SNR

(a)

UNP comp ss par ser


0

20

40

60

80

100

Perc

ent

corr

ect

(%)

−3 dB SNR

(b)

UNP comp ss par ser


0

20

40

60

80

100

Perc

ent

corr

ect

(%)

0 dB SNR

(c)

UNP comp ss par ser


0

20

40

60

80

100

Perc

ent

corr

ect

(%)

3 dB SNR

HI groupNH group

(d)

UNP comp ss par ser


0

20

40

60

80

100

Perc

ent

corr

ect

(%)

6 dB SNR

HI groupNH group

(e)

Figure 7: Intelligibility scores (in percent correct) for listeners with normal hearing (open triangles) and listeners with hearing loss (closedcircles) grouped by SNR. Error bars represent standard error. Conditions are UNP: unprocessed, comp: dynamic range compression, ss:spectral subtraction, par: parallel, and ser: series.

The decrease in speech intelligibility seen in the spectralsubtraction condition may be due to decreased audibility.The acoustic analysis revealed that the spectral subtractioncondition, while having a relatively flat amount of gainfluctuation, reduced the gain of the signal significantly. Thisgain reduction may have placed some speech informationout of reach of the listeners, to the detriment of speechunderstanding.

Results from earlier research regarding the benefit ofmultiple algorithm implementations have been inconclusive.Franck et al. [19] showed improvements in vowel identifi-cation with spectral enhancement only, but that benefit wasremoved when multi-channel compression was added to thesignal processing. The authors concluded that the increase inenvelope distortion that occurred with the addition of com-pression may have negated any speech intelligibility benefit.Our findings are similar in that listeners’ intelligibility scoreswere degraded compared to the unprocessed signal whenmultiple algorithms were implemented in parallel. However,intelligibility scores were also degraded when each algorithmwas implemented in isolation. When compared to the seriescondition, an increased amount of gain fluctuation for bothcompression and parallel processing may be responsiblefor the reduced intelligibility scores. It is possible thatthe amount of manipulation present in series is below

a threshold of performance degradation, while the otherprocessing conditions are above that threshold. Chung [20]did not find significant differences for intelligibility betweenthe parallel and series constructions, although the fact thatthe signal processing algorithms were not controlled limitsthe comparisons that can be made.

In examination of the spectral subtraction routine,the present findings differ from those of Arehart et al.[10], who showed that listeners with normal hearing andlisteners with hearing loss had improved speech intelligibilitywith spectral subtraction for nonsense syllable stimuli in acommunication-channel background noise. In contrast, wefound that sentence understanding decreased with a similarspectral subtraction algorithm. The lack of benefit seen in thecurrent study may be due to the fact that a different form ofspectral subtraction was implemented, along with differencesin the speech materials for the speech in noise test.

An additional factor, audibility, may have more nega-tively affected listeners in the spectral subtraction condition.The audibility analysis revealed that spectral subtractionhad the lowest levels of audibility among the processingconditions. Although the decrease in noise is consistent withthe intent of the algorithm, it may be that the reduction inthe low-level portions of the signal removed too much speechinformation, in addition to removal of the noise.


Table 2: Results of repeated measures ANOVA for intelligibility scores. Statistical significance (marked with an∗) is P < .01 due to Bonferronicorrection.

Intelligibility statistics

Factor F Df P

Processing condition (pc): omnibus 9.079 4, 104 <.000∗

SNR 382.139 4, 104 <.000∗

group 24.821 1, 26 <.000∗

pc × SNR 1.489 16, 416 .15

pc × group .758 4, 104 .545

SNR × group 3.903 4, 104 .018

pc × SNR × group 1.315 16, 416 .227

pc: UNP × compression 22.4 1, 26 <.000∗

SNR 229.752 4, 104 <.000∗

group 22.152 1, 26 <.000∗

pc × SNR .523 4, 104 .719

pc × group .001 1, 26 .973

SNR × group 1.758 4, 104 .172

pc x SNR × group 2.469 4, 104 .061

pc: UNP × spectral subtraction 27.440 1, 26 <.000∗

SNR 194.649 4, 104 <.000∗

group 23.276 1, 26 <.000∗

pc × SNR 3.073 4, 104 .039

pc × group 2.148 1, 26 .155

SNR × group 2 4, 104 .113

pc × SNR × group 2.595 4, 104 .066

pc: UNP × parallel 29.453 1, 26 <.000∗

SNR 173.615 4, 104 <.000∗

group 19.224 1, 26 <.000∗

pc × SNR 1.119 4, 104 .346

pc × group .320 1, 26 .576

SNR × group 1.72 4, 104 .165

pc × SNR × group 2.246 4, 104 .091

pc: UNP × series 3.595 1, 26 .069

SNR 205.037 4, 104 <.000∗

group 25.121 1, 26 <.000∗

pc × SNR .431 4, 104 .751

pc × group 1.493 1, 26 .233

SNR × group 4.714 4, 104 .003∗

pc × SNR × group 1.393 4, 104 .248

pc: parallel × series 7.974 1, 26 .009∗

SNR 207.505 4, 104 <.000∗

group 22.096 1, 26 <.000∗

pc × SNR 1.402 4, 104 .245

pc × group .476 1, 26 .496

SNR × group 3.827 4, 104 .013

pc × SNR × group .778 4, 104 .522

Dynamic range compression has a history of reducingintelligibility (e.g., [9]), and our findings are no different.The most beneficial environment for compression is softspeech with no background noise, where there is an increasein audibility of the low-level portions of speech. Our resultsare consistent with other studies that show that compression

for noisy speech does not yield intelligibility benefits (e.g.,[5–7]). One factor may be that for any increase in gain forlow-level speech, gain will also be increased for the noisein the signal. Stone and Moore [31] showed similar results,and suggested that when the speech and noise signals areprocessed concurrently with multichannel compression the


envelopes become more similar due common gain fluctua-tions, and results in Across Source Modulation Correlation(ASMC). The ASMC makes it more difficult for a listener toperceptually segregate the two auditory objects, especially forlisteners who rely on envelope cues for speech intelligibility.

A second factor, alterations to the temporal envelopethrough gain fluctuations, may have also contributed to thedecrease in speech intelligibility. In the acoustic analysis, itwas found that the compression algorithm and parametersused here caused the gain to fluctuate over time. Thisfluctuation may have altered the speech envelope beyond athreshold of benefit, and instead caused a decrease in speechintelligibility.

An important acknowledgement is that while the intel-ligibility scores between series and parallel processing aresignificantly different, the magnitude of the difference issmall, on average 5 percentage points (range: 0–11 per-centage points). This experiment was designed to maximizethe potential benefit of the spectral subtraction routine byperforming the noise estimate of the noise signal before thespeech was added, ensuring an optimized noise estimate.In addition, the use of a stationary speech-shaped noise ismost suited to a spectral subtraction scheme because thestatistics of the noise do not change over time. By controllingthese parameters, we were in the best possible situationfor determining the differences between series and parallelprocessing. In a real-world environment, where there is animperfect noise estimate and fluctuating background, it ispossible that these differences would be smaller.

5. Speech Quality

The effects on sound quality of different signal processingarrangements were quantified in both an NH group and anHI group.

5.1. Subjects. The participants in the speech quality exper-iment included 12 listeners with normal hearing (averageage= 33 years old; range= 19–64 years old) and 14 listenerswith hearing loss (average age= 66 years old; range= 36–79years old). Of these, five were participants in the intelligibilitytesting. Not all participants in the intelligibility study wereavailable for participation in the quality experiment, soadditional subjects were recruited. Audiometric thresholdsfor the HI group are displayed in Table 1. The sameaudiometric standards required for participation in theintelligibility experiment were required for enrollment in thequality experiment, and some, but not all, of the HI listenerswere hearing aid users. All subjects were tested monaurallyand individually in a double-walled sound-treated booth.Listeners were compensated $10/hour for their participation.

5.2. Experimental Procedures. Sound quality judgments weremade using a paired comparison task. Instructions for thelisteners are included in Appendix. Two HINT sentences (cf.Section 2) were mixed with the background noise at 6, 3, and0 dB SNR. Because poor intelligibility can dominate soundquality judgments, the SNR was restricted to regions where

intelligibility was expected to be above 75% for listeners withnormal hearing and above 40% for listeners with hearingloss [32]. Listeners judged 225 pairs of stimuli from thefinal processing conditions over two one-hour visits. Fifteenconditions (3 SNRs X 5 processing conditions) were allpaired together (15 x 15 matrix). (Listeners were tested onan additional 99 trials in the experimental quality sessions.These sentences were processed using the series backwardcondition described in footnote 1. This condition was notincluded in the final data analysis.) Listeners heard eachpossible combination twice. Each visit entailed four blocksof paired comparison testing. The first block was 30 practicetrials, not scored. The 255 final test trials were dividedover 6 blocks, 3 each visit. Each trial was randomized withregard to conditions tested, but no limitation was placed onorder of presentation. Play-out method was the same as forintelligibility.

5.3. Results. The paired comparison judgments were reducedto a preference score [33–35] ranging from 0 to 1. Preferencescores were calculated by summing the total number oftimes a given condition was preferred to the other seventeenconditions, and then dividing by the total number ofcomparisons for the given condition. Table 3 presents thenumber of times each condition was preferred, with the NHgroup displayed in the top panel and the HI group displayedin the bottom panel. The number of comparisons for thepreference score calculations was 360 for the NH group (12subjects x 15 conditions x 2 repetitions) and 420 for theHI group (14 subjects x 15 conditions x 2 repetitions). Thepreference scores are listed in the bottom rows of the twopanels of Table 3.

Figure 8 displays the results of the preference scores, witheach panel representing a different processing condition. Aswith intelligibility, as SNR improves the preference scoresincreased. Figure 9 shows preference scores with each panelshowing a different SNR. Spectral subtraction is the mostpreferred processing condition, with compression the leastpreferred.

The preference scores were arcsine transformed [30] andanalyzed using repeated measures ANOVA (Table 4). Thefactors of processing condition and SNR were significant.The interaction between processing condition and SNR wasalso significant, indicating the change in preference for agiven processing condition was dependent on SNR. Nosignificant difference was found between listener groups.

Post-hoc t-tests with Bonferroni correction (P < .01)were completed in order to further examine the data. Thet-tests made individual comparisons between the UNP con-dition and each of the other four conditions. In addition, theseries versus parallel data were analyzed. Table 4 displays theresults of the t-tests for UNP x compression, UNP x spectralsubtraction, UNP x parallel, and UNP x series. Examinationof the raw data indicates that both groups of listeners showedthe highest preference for the spectral subtraction condition.The statistical analysis of UNP x spectral subtraction showed,in addition to a statistical significance for the factors of pro-cessing condition and SNR, a significant interaction betweenprocessing condition and SNR, with the data revealing that


Table 3: Preferences for processing conditions. The NH group is represented in the top panel, with the HI group represented in the bottompanel. UNP = unprocessed, com = compression, ss = spectral subtraction, par = parallel, and ser = series.

(a)

NH Grp UNP 6 com 6 ss 6 par 6 ser 6 UNP 3 com 3 ss 3 par 3 ser 3 UNP 0 com 0 ss 0 par 0 ser 0

UNP 6 12 4 18 11 10 1 1 9 2 1 0 0 0 1 0

com 6 20 12 21 18 22 14 3 18 10 13 5 2 2 2 1

ss 6 6 3 12 2 1 3 2 4 0 0 1 2 0 1 1

par 6 13 6 22 12 11 6 1 7 4 4 3 0 1 2 0

ser 6 14 2 23 13 12 4 3 12 2 0 2 0 1 0 2

UNP 3 23 10 21 18 20 12 3 16 8 9 5 0 6 2 5

com 3 23 21 22 23 21 21 12 20 17 17 15 6 13 11 8

ss 3 15 6 20 17 12 8 4 12 9 4 3 2 4 1 1

par 3 22 14 24 20 22 16 7 15 12 12 7 4 4 3 3

ser 3 23 11 24 20 24 15 7 20 12 12 3 3 6 0 4

UNP 0 24 19 23 21 22 19 9 21 17 21 12 3 16 8 8

com 0 24 22 22 24 24 24 18 22 20 21 21 12 20 17 19

ss 0 24 22 24 23 23 18 11 20 20 18 8 4 12 7 10

par 0 23 22 23 22 24 22 13 23 21 24 16 7 17 12 16

ser 0 24 23 23 24 22 19 16 23 21 20 16 5 14 8 12

total 290 197 322 268 270 202 110 242 175 176 117 50 116 75 90

pref 0.806 0.547 0.894 0.744 0.750 0.561 0.306 0.672 0.486 0.489 0.325 0.139 0.322 0.208 0.250

(b)

HI Grp UNP 6 com 6 ss 6 par 6 ser 6 UNP 3 com 3 ss 3 par 3 ser 3 UNP 0 com 0 ss 0 par 0 ser 0

UNP 6 14 5 22 17 17 7 3 10 3 0 3 3 4 2 1

com 6 23 14 24 21 24 17 6 16 12 7 3 1 4 2 4

ss 6 6 4 14 9 8 1 2 3 3 2 2 1 1 0 1

par 6 11 7 19 14 12 7 2 13 6 5 3 3 6 1 2

ser 6 11 4 20 16 14 10 3 8 0 5 4 4 2 1 3

UNP 3 21 11 27 21 18 14 2 19 7 13 7 3 5 4 3

com 3 25 22 26 26 25 26 14 24 21 20 10 3 16 7 8

ss 3 18 12 25 15 20 9 4 14 7 4 2 2 3 0 3

par 3 25 16 25 22 28 21 7 21 14 14 7 2 7 1 1

ser 3 28 21 26 23 23 15 8 24 14 14 4 1 5 3 4

UNP 0 25 25 26 25 24 21 18 26 21 24 14 6 22 11 8

com 0 25 27 27 25 24 25 25 26 26 27 22 14 22 20 17

ss 0 24 24 27 22 26 23 12 25 21 23 6 6 14 7 9

par 0 26 26 28 27 27 24 21 28 27 25 17 8 21 14 17

ser 0 27 24 27 26 25 25 20 25 27 24 20 11 19 11 14

total 309 242 363 309 315 245 147 282 209 207 124 68 151 84 95

pref 0.736 0.576 0.864 0.736 0.750 0.583 0.350 0.671 0.498 0.493 0.295 0.162 0.360 0.200 0.226

preference for spectral subtraction processing increased asSNR decreased. For each of the other post-hoc comparisons,the UNP condition was the significantly preferred condition.In the analysis comparing parallel to series processing, therewas no significant difference in processing condition. Nobetween-group differences were found for any of the post-hoc comparisons, indicating that both the NH and HI groupgave similar paired comparison judgments.

5.4. Discussion. In contrast to the intelligibility results,the sound quality results show listeners have a definitepreference for spectral subtraction when listening in noise,

with UNP ranked second. For all processing conditions,preference scores decreased monotonically as SNR decreased.There was no between-group difference, indicating that boththe normal-hearing group and the hearing-impaired groupshowed similar preference judgments.

These results are consistent with past research thathas shown that while noise reduction schemes may notimprove intelligibility, they may improve sound quality (e.g.,[11]). The reduction in speech information from spectralsubtraction is not as perceptible as the decrease in noise,potentially leading listeners to choose a processing conditionthat is harmful to speech intelligibility, but is less noisy.


0 3 6

SNR (dB)

0

0.2

0.4

0.6

0.8

1

Pre

fere

nce

scor

e

Unprocessed

(a)

0 3 6

SNR (dB)

0

0.2

0.4

0.6

0.8

1

Pre

fere

nce

scor

e

Compression

(b)

0 3 6

SNR (dB)

0

0.2

0.4

0.6

0.8

1

Pre

fere

nce

scor

e


(c)

0 3 6

SNR (dB)

0

0.2

0.4

0.6

0.8

1

Pre

fere

nce

scor

e

Parallel

HI groupNH group

(d)

0 3 6

SNR (dB)

0

0.2

0.4

0.6

0.8

1

Pre

fere

nce

scor

e

Series

HI groupNH group

(e)

Figure 8: Subject preference scores (proportion of times a processing condition was preferred to all other processing conditions) for listenerswith normal hearing (open triangles) and listeners with hearing loss (closed circles) grouped by processing condition. Error bars show thestandard error of the preference scores for each group and condition. Symbols and error bars are offset slightly for clarity.

UNP comp ss par ser


0

0.2

0.4

0.6

0.8

1

Pre

fere

nce

scor

e

0 dB SNR

HI groupNH group

(a)

UNP comp ss par ser


0

0.2

0.4

0.6

0.8

1

Pre

fere

nce

scor

e

3 dB SNR

HI groupNH group

(b)

UNP comp ss par ser


0

0.2

0.4

0.6

0.8

1

Pre

fere

nce

scor

e

6 dB SNR

HI groupNH group

(c)

Figure 9: Subject preference scores for sound quality judgments grouped by SNR for listeners with normal hearing (closed triangles) andlisteners with hearing loss (open circles). Error bars show the standard error of the preference scores for each group and condition. Symbolsand error bars are offset slightly for clarity. Conditions are UNP: unprocessed, comp: dynamic range compression, ss: spectral subtraction,par: parallel, and ser: series.

Although it is likely that intelligibility is a primary factor inthe sound quality judgments [32], the fact that both groupsof listeners preferred the spectral subtraction routine mayindicate that intelligibility is not the only factor. The sound

quality results are also consistent with past research thatshows that compression does not improve sound qualityfor noisy speech (e.g., [7–9]). The interaction of spectralsubtraction and compression may have reduced the benefit


Table 4: Results of repeated measures ANOVA for quality scores. Statistical significance (marked with an∗) is P < .01 due to Bonferronicorrection.

Quality statistics

Factor F df P

processing condition (pc): omnibus 58.653 4, 96 <.000∗

snr 274.568 2, 48 <.000∗

group .571 1, 24 .457

pc × snr 12.211 8, 192 <.000∗

pc × group .380 4, 96 .669

snr × group .343 2, 48 .586

pc × snr × group .955 8, 192 .459

pc: UNP × compression 161.218 1, 24 <.000∗

snr 195.952 2, 48 <.000∗

group .012 1, 24 .914

pc × snr 5,448 2, 48 .008∗

pc × group 2.492 1, 24 .128

snr × group .562 2, 48 .574

pc × snr × group .828 2, 48 .443

pc: UNP × spectral subtraction 13.105 1, 24 .001∗

snr 190.315 2, 48 <.000∗

group .497 1, 24 .488

pc × snr 21.895 2, 48 <.000∗

pc × group .241 1, 24 .628

snr × group 1.164 2, 48 .3

pc × snr × group .304 2, 48 .727

pc: UNP × parallel 13.137 1, 24 .001∗

snr 226.11 2, 48 <.000∗

group .711 1, 24 .408

pc × snr 2.733 2, 48 .077

pc × group .77 1, 24 .389

snr × group .608 2, 48 .489

pc × snr × group .764 2, 48 .467

pc: UNP × series 9.494 1, 24 .005∗

snr 214.211 2, 48 <.000∗

group 1.553 1, 24 .225

pc × snr 1.182 2, 48 .315

pc × group .437 1, 24 .515

snr × group .216 2, 48 .702

pc × snr × group 1.462 2, 48 .242

pc: parallel × series 4.294 1, 24 .049

snr 271.273 2, 48 <.000∗

group .010 1, 24 .922

pc × snr .433 2, 48 .635

pc × group .437 1, 24 .515

snr × group .036 2, 48 .89

pc × snr × group .347 2, 48 .691

of noise reduction that was seen when spectral subtractionwas implemented by itself by introducing more noticeableeffects on the speech itself, and removing the quality benefitseen when spectral subtraction is implemented by itself. Thislack of benefit may be due to an increased amount of noise inthe signal. Additionally, the mixing of the compression and

spectral subtraction algorithm may have altered importantspeech cues through spectral modifications (e.g., [14, 16–18]).

Previously reported work from our laboratory has foundsignificant differences in sound quality judgments betweenlistener groups for different signal processing manipulations


(e.g., [35, 36]). For example, Arehart et al. [35] foundsignificant differences between a group of normal-hearinglisteners and a group of hearing-impaired listeners for severaldifferent distortion conditions including additive noise, peakclipping, and center clipping. Generally, the listeners withhearing loss showed lower preference scores for additivenoise and peak clipping, but increased preference for centerclipping, which relates to noise reduction processing byremoving low-intensity portions of the signal. The lack ofdifference between groups reported here may in part be dueto the fact that background noise was present in each ofour processing conditions, which may mask some of thedegradations to the speech signal by the signal processing.

6. General Discussion and Summary

In these experiments, several different signal processingconfigurations were tested using both acoustic and percep-tual measures in order to quantify the impact that gainfluctuations due to different signal processing arrangementsmay have on a listener’s perception of speech intelligibilityand speech quality. Of particular interest were possibledifferences between series and parallel algorithm arrange-ments. Both types of arrangements are used in commerciallyavailable hearing aids, but there is a lack of literatureregarding the acoustic and perceptual differences betweenthe arrangements. Given the increased complexity of sig-nal processing in today’s hearing aids, it is important tounderstand how these different arrangements may affectthe acoustic output of a hearing aid, and subsequently alistener’s perception. A fundamental difference between thetwo algorithm arrangements is the difference in modificationmade to the signal. In order to isolate this factor, theparameters for compression and spectral subtraction werekept the same for both arrangements. In addition, thelong-term spectrum of the processed signals was matchedto that of the original speech to remove loudness asa potential confound. By controlling these factors, thisstudy presents an initial step in identifying the acous-tic and perceptual consequences of series versus parallelprocessing.

The acoustic analyses showed that parallel processing hasgreater gain fluctuations than series processing, althoughaudibility is comparable. This difference is due to theinfluence of the spectral subtraction algorithm, which hasmore of an effect in the parallel condition. The overall impacton the acoustic output by the signal processing constructionsis to alter the amplitude of low-level portions of the signaland change the temporal envelope of the signal, with theparallel processing typically showing more gain reductionsthan series processing.

Speech understanding was best with no processing tothe signal other than NAL-R, although series processingwas not significantly worse. Speech intelligibility was signif-icantly poorer with parallel processing compared to seriesprocessing. The worst speech intelligibility scores were foundwith the compression condition. Listeners with normalhearing consistently had higher speech intelligibility scores

than listeners with hearing loss, although the pattern acrossprocessing conditions was similar. The acoustic changesresulting from the increased signal processing, and increasedgain fluctuations, may alter important speech understandingcues [37], and thus decrease speech intelligibility.

Paired-comparison sound quality judgments revealedthat listeners significantly preferred spectral subtractionprocessing to the unprocessed condition. Parallel and seriesprocessing were both significantly less preferred than theunprocessed signal. Compression was the least preferredsignal processing condition, possibly due to the fact that theintensity of the background noise was also increased whenlow-level portions of the speech were amplified. There wasno difference between listener groups, indicating that bothlisteners with normal hearing and listeners with hearing lossrespond to these signal processing arrangements in similarways.

The compression condition consistently yielded thelowest intelligibility scores and the lowest quality scores. Thecompression algorithm implemented here did not matchthe compression parameters to each subject’s audiogram.Such matching may have yielded different perceptual results.However, the trends observed here for the compressioncondition are consistent with other reports of the effects ofcompression on speech intelligibility and quality for noisyspeech at average conversational levels (e.g., [5–9]). If adifferent number of bands were used for the compressionand spectral subtraction algorithms, the potential remainsthat there would be different acoustic and perceptualconsequences. Additionally, the compression and spectralsubtraction algorithm parameters were the same for bothseries and parallel processing. In a real-world setting, theparameter settings would likely be optimized for the par-ticular algorithm arrangement chosen. The results from thisstudy indicate that the optimization should take the form ofreducing the amount of spectral modifications to the signaldue to fluctuating gain.

The magnitude of difference in both speech intelligi-bility and sound quality found between series and parallelprocessing is slight, although significant. The increasedgain manipulation found in parallel processing yieldedbetter noise reduction (as demonstrated by the gain ver-sus time function), but poorer speech intelligibility andsound quality. We consistently found a trade off betweengain manipulations and speech understanding and soundquality. Overall, the series condition yielded better speechintelligibility. These results are on par with Chung [20],who also found that parallel processing has more signalmodifications (through a more influential noise reduction)than series processing. The results presented here providea rationale for future research that explores if parameteroptimization for parallel and series processing would lessenthe difference between the arrangements, and possiblydecrease the acoustic and perceptual differences betweenthe arrangements. However, with the goal of maximizingspeech understanding and sound quality, it should be notedthat the best speech intelligibility was found for the unpro-cessed condition and the best sound quality with spectralsubtraction.


Appendices

A. Intelligibility Instructions

Instructions read to the listener for the intelligibility taskwere as follows: “In this experiment you will be listeningto sentences that have been digitally processed and are inthe presence of background noise. To begin sentence playoutplease click on the button marked PLAY. When the speakerhas finished talking, please repeat back as much of thesentence as you understood. If you did not understand anyof the words please say “I understood nothing”. To begin thenext speech sample click PLAY. You will be given a break atthe end of each block of trials. If you would like a break beforethe end of the block, do not click PLAY.”

B. Quality Instructions

Instructions read to the listener at each sound quality visitwere as follows: “In this experiment you will be comparingspeech samples that have been digitally processed. Your taskis to decide which sample you think sounds better. In eachtrial you will listen to Sample A and Sample B. After listeningto both samples, you will select which sample you thinksounds better. At times you may find it relatively easy todecide which sample sounds better. Other times you mayfind it more difficult. In all cases we encourage you to takeyour best guess at whether Sample A or Sample B soundsbetter. To begin speech playout click on the button marked“Click to Continue”. The first speech sample (Sample A) willplay followed by the second speech sample (Sample B). Afterthe second sample has finished please click ona the speechsample you feel has the better sound quality, either “SampleA sounds better” or Sample B sounds better”. When you havemade your selection and are ready for the next set of speechsamples please click on the button marked “Proceed”.”

Acknowledgments

The work was supported by a research grant from GNResound Corporation. The recording of the male talkerspeaking the IEEE corpus was kindly provided by Dr.Peggy Nelson of the University of Minnesota. We sincerelythank Ramesh Kumar Muralimanohar for his assistance indevelopment of the software used for the listener tests andfor the data analyses and Naomi Croghan for her assistancein data collection.

References

[1] S. Kochkin, “Hearing loss population tops 31 million,” HearingReview, vol. 12, pp. 16–29, 2005.

[2] V. D. Larson, D. W. Williams, W. G. Henderson, et al., “Efficacyof 3 commonly used hearing aid circuits: a crossover trial.NIDCD/VA Hearing Aid Clinical Trial Group,” The AmericanJournal of Medicine, vol. 284, pp. 1806–1813, 2000.

[3] J. Dubno, L. Matthews, L. Fu-Shing, J. Ahlstrom, and A.Horwitz, “Predictors of hearing-aid ownership and success byolder adults,” in Proceedings of the International Hearing Aid

Research Conference (IHCON ’08), Lake Tahoe, Calif, USA,August 2008.

[4] S. Kochkin, “Customer satisfaction with hearing instrumentsin the digital age,” Hearing Journal, vol. 58, pp. 30–37, 2005.

[5] I. Nabelek, “Performance of hearing-impaired listeners undervarious types of amplitude compression,” Journal of theAcoustical Society of America, vol. 74, pp. 776–791, 1983.

[6] R. Lippmann, L. Braida, and N. Durlach, “Study of mul-tichannel amplitude compression and linear amplificationfor persons with sensorineural hearing loss,” Journal of theAcoustical Society of America, vol. 69, pp. 524–534, 1981.

[7] R. van Buuren, J. Festen, and T. Houtgast, “Compressionand expansion of the temporal envelope: evaluation of speechintelligibility and sound quality,” Journal of the AcousticalSociety of America, vol. 105, pp. 2903–2913, 1999.

[8] E. Davies. Venn, P. Souza, and D. Fabry, “Speech and musicquality ratings for linear and nonlinear hearing aid circuitry,”Journal of the American Academy of Audiology, vol. 18, pp. 688–699, 2007.

[9] P. Souza, “Effects of compression on speech acoustics, intelli-gibility, and sound quality,” Trends in Amplification, vol. 6, pp.131–165, 2002.

[10] K. Arehart, J. Hansen, S. Gallant, and L. Kalstein, “Evaluationof an auditory masked threshold noise suppression algorithmin normal-hearing and hearing-impaired listeners,” SpeechCommunication, vol. 40, pp. 575–592, 2003.

[11] T. Ricketts and B. Hornsby, “Sound quality measures forspeech in noise through a commercial hearing aid implement-ing “digital noise reduction”,” Journal of the American Academyof Audiology, vol. 16, no. 5, pp. 270–277, 2005.

[12] L.-F. Shi and K. Doherty, “Subjective and objective effects offast and slow compression on the perception of reverberantspeech in listeners with hearing loss,” Journal of Speech,Language, Hearing Research, vol. 51, pp. 1328–1340, 2008.

[13] G. Walker, D. Byrne, and H. Dillon, “The effects of multichan-nel compression/ expansion on the intelligibility of nonsensesyllables in noise,” Journal of the Acoustical Society of America,vol. 76, pp. 746–757, 1984.

[14] P. Souza, L. Jenstad, and K. Boike, “Measuring the effectsof compression and amplification on speech in noise (L),”Journal of the Acoustical Society of America, vol. 119, pp. 41–44, 2006.

[15] S. Rosen, “Temporal information in speech: acoustic, auditory,and linguistic aspects, Philosophical Transactions of the RoyalSociety,” Biological Sciences, vol. 336, pp. 367–373, 1992.

[16] J. Kates, “Hearing-aid design criteria,” Journal of SpeechLanguage Pathology and Audiology Monograph, supplement 1,pp. 15–23, 1993.

[17] J. H. L. Hansen, Speech Enhancement, Encyclopedia of Electricaland Electronics Engineering, vol. 20, John Wiley & Sons, NewYork, NY, USA, 1999.

[18] P. G. Stelmachowicz, D. E. Lewis, B. Hoover, and D. H. Keefe,“Subjective effects of peak clipping and compression limitingin normal and hearing-impaired children and adults,” Journalof the Acoustical Society of America, vol. 105, pp. 412–422,1999.

[19] B. Franck, C. van Kreveld-Bos, W. Dreschler, and H. Ver-schuure, “Evaluation of spectral enhancement in hearingaids, combined with phonemic compression,” Journal of theAcoustical Society of America, vol. 106, pp. 1452–1464, 1999.

[20] K. Chung, “Effective compression and noise reduction con-figurations for hearing protectors,” Journal of the AcousticalSociety of America, vol. 121, pp. 1090–1101, 2007.


[21] M. Nilsson, S. Soli, and J. Sullivan, “Development of ahearing in noise test for the measurement of speech receptionthresholds in quiet and in noise,” Journal of the AcousticalSociety of America, vol. 95, pp. 1085–1109, 1994.

[22] J. Kates and K. Arehart, “Multichannel dynamic-range com-pression using digital frequency warping,” EURASIP Journalon Applied Signal Processing, vol. 2005, no. 18, pp. 3003–3014,2005.

[23] ANSI S3.22, Specification of Hearing Aid Characteristics, Amer-ican National Standards Institute, New York, NY, USA, 1996.

[24] J. Kates, Digital Hearing Aids, Plural, San Diego, Calif, USA,2008.

[25] D. Tsoukalas, J. Mourjopoulos, and G. Kokkinakis, “Speechenhancement based on audible noise suppression,” IEEETranactions on Speech and Audio Processing, vol. 5, pp. 497–513, 1997.

[26] ANSI S3.4, Procedure for the Computation of Loudness of SteadySounds, American Nations Standards Institute, New York, NY,USA, 2007.

[27] D. Byrne and H. Dillon, “The National Acoustics Labo-ratories’(NAL). New procedure for selecting the gain andfrequency response of a hearing aid,” Ear and Hearing, vol. 7,pp. 257–265, 1986.

[28] S. Rosenthal, “IEEE: recommended practices for speechquality measurements,” IEEE Transactions on Audio andElectroacoustics, vol. 17, pp. 227–246, 1969.

[29] ANSI S3.6, Specifications for Audiometers, American NationalStandards Institute, New York, NY, USA, 2004.

[30] G. Studebaker, “A “rationalized” arcsine transform,” Journal ofSpeech and Hearing Research, vol. 28, pp. 424–434, 1985.

[31] M. Stone and B. C. J. Moore, “Effects of spectro-temporalmodulation changes produced by multi-channel compressionon intelligibility in a competing-speech task,” Journal of theAcoustical Society of America, vol. 123, pp. 1063–1076, 2008.

[32] J. Preminger and D. Van Tassell, “Quantifying the relationbetween speech quality and speech intelligibility,” Journal ofSpeech and Hearing Research, vol. 38, pp. 714–725, 1995.

[33] H. David, The Method of Paired Comparisons, Hafner, NewYork, NY, USA, 1963.

[34] L. Rabiner, H. Levitt, and A. Rosenberg, “Investigation ofstress patterns for speech synthesis by rule,” Journal of theAcoustical Society of America, vol. 45, pp. 92–101, 1969.

[35] K. Arehart, J. Kates, M. Anderson, and L. Harvey, “Effectsof noise and distortion on speech quality judgments innormal-hearing and hearing -impaired listeners,” Journal ofthe Acoustical Society of America, vol. 122, pp. 1150–1164,2007.

[36] K. Arehart, J. Kates, and M. Anderson, “Effects of linear,nonlinear, and combined linear and nonlinear distortion onperceived speech quality,” in Proceedings of the InternationalHearing Aid Research Conference (IHCON ’08), Lake Tahoe,Calif, USA, August 2008.

[37] L. Jenstad and P. Souza, “Temporal envelope changes ofcompression and speech rate: combined effects on recognitionfor older adults,” Journal of Speech, Language, and HearingResearch, vol. 50, pp. 1123–1138, 2007.


Research Article

Low Delay Noise Reduction and Dereverberation for Hearing Aids

Heinrich W. Lollmann (EURASIP Member) and Peter Vary

Institute of Communication Systems and Data Processing, RWTH Aachen University, 52056 Aachen, Germany

Correspondence should be addressed to Heinrich W. Lollmann, [email protected]


Recommended by Heinz G. Goeckler

A new system for single-channel speech enhancement is proposed which achieves a joint suppression of late reverberant speechand background noise with a low signal delay and low computational complexity. It is based on a generalized spectral subtractionrule which depends on the variances of the late reverberant speech and background noise. The calculation of the spectral variancesof the late reverberant speech requires an estimate of the reverberation time (RT) which is accomplished by a maximum likelihood(ML) approach. The enhancement with this blind RT estimation achieves almost the same speech quality as by using the actualRT. In comparison to commonly used post-filters in hearing aids which only perform a noise reduction, a significantly betterobjective and subjective speech quality is achieved. The proposed system performs time-domain filtering with coefficients adaptedin the non-uniform (Bark-scaled) frequency-domain. This allows to achieve a high speech quality with low signal delay which isimportant for speech enhancement in hearing aids or related applications such as hands-free communication systems.

Copyright © 2009 H. W. Lollmann and P. Vary. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Algorithms for the enhancement of acoustically disturbedspeech signals have been the subject of intensive research overthe last decades, cf., [1–3]. The wide-spread use of mobilecommunication devices and, not at least, the introductionof digital hearing aids have contributed significantly to theinterest in this field. For hearing impaired people, it isespecially difficult to communicate with other persons innoisy environments. Therefore, speech enhancement systemshave become an integral component of modern hearing aids.However, despite significant progress, the development ofspeech enhancement systems for hearing aids is still a verychallenging problem due to the demanding requirementsregarding computational complexity, signal delay and speechquality.

A common approach is to use a beamformer withtwo or three closely spaced microphones followed by apost-filter, e.g., [4, 5]. An adaptive beamformer is oftenused, implemented by first- or second- order differentialmicrophone arrays or a generalized sidelobe canceller (GSC),respectively, e.g., [5]. Due to the use of small microphonearrays, only a limited noise suppression can be achieved bythis, especially for diffuse noise fields. Therefore, the output

signal of the beamformer is further processed by a (Wiener)post-filter to achieve an improved noise suppression, e.g.,[4–7]. A related approach is to use an extension of theGSC structure termed as speech distortion weighted multi-channel Wiener filter [8, 9]. This approach allows to balancethe tradeoff between speech distortions and noise reductionand is more robust towards reverberation than a commonGSC.

So far, such systems achieve only a very limited sup-pression of speech distortions due to room reverberation.Such impairments are caused by the multiple reflectionsand diffraction of the sound on walls and objects of aroom. These multiple echoes add to the direct sound at thereceiver and blur its temporal and spectral characteristics. Asa consequence, reverberation and background noise reducelistening comfort and speech intelligibility, especially forhearing impaired persons [10, 11]. Therefore, algorithms fora joint suppression of background noise and reverberationeffects are of special interest for speech enhancement inhearing instruments. However, many proposals are lesssuitable for this application.

For example, dereverberation algorithms based on lin-ear prediction such as [12] achieve mainly a reductionof early reflections and do not consider additive noise,


while algorithms based on a time-averaging [13] exhibita high signal delay. Coherence-based speech enhancementalgorithms such as [14] or [15] can suppress backgroundnoise and reverberation, but they are rather ineffective if onlytwo closely spaced microphones can be used. This problemcan be alleviated to some extend by a noise classificationand binaural processing [16] which, however, requires twohearing aid devices connected by a wireless data link. Asingle-channel algorithm for speech dereverberation andnoise reduction has been proposed recently in [17]. However,this algorithm is less suitable for hearing aids due to itshigh computational complexity and signal delay as well as itsstrong speech distortions.

A more powerful approach for noise reduction and dere-verberation is to use blind source separation (BSS), e.g., [18].Such algorithms do not require a priori knowledge aboutthe microphone positions or source locations. However, theydepend on a full data link between the hearing aid devicesand possess a high computational complexity. Therefore,further work remains to be done to integrate such algorithmsinto common hearing instruments [19].

In this contribution, a single-channel speech enhance-ment algorithm is proposed, which is more suitable forcurrent hearing aid devices. It performs a suppression ofbackground noise and late reverberant speech by means ofa generalized spectral subtraction. The devised (post-)filterexhibits a low signal delay, which is important in hearingaids, e.g., to avoid comb filter effects. The calculation of thelate reverberant speech energy requires (only) an estimateof the reverberation time (RT), which is accomplished bya maximum likelihood (ML) approach. Thus, no explicitspeech modeling is involved in the dereverberation processas, e.g., in [20] such that an estimation of speech modelparameters is not needed here.

The paper is organized as follows. In Section 2, theunderlying signal model is introduced. The overall systemfor low delay speech enhancement is outlined in Section 3.The calculation of the spectral weights for noise reductionand dereverberation is treated in Section 4. An importantissue is the determination of the spectral variances of the latereverberant speech, which in turn is based on an estimationof the RT. These issues are treated in Sections 4.2 and 4.3. Theperformance of the new system is analyzed in Section 5, andthe main results are summarized in Section 6.

2. Signal Model

The distorted speech signal x(k) is assumed to be given bya superposition of the reverberant speech signal z(k) andadditive noise v(k) where k marks the discrete time index.The received signal x(k) and the original (undisturbed)speech signal s(k) are related by

x(k) = z(k) + v(k)

=LR−1∑

n=0

s(k − n)hr(n, k) + v(k)(1)

with hr(n, k) representing the time-varying room impulseresponse (RIR) of (possibly infinite) length LR betweensource and receiver. The reverberant speech signal can bedecomposed into its early and late reverberant components

z(k) =Le−1∑

n=0

s(k − n)hr(n, k)

︸︷︷︸

=ze(k)

+LR−1∑

n=Le

s(k − n)hr(n, k)

︸︷︷︸

=zl(k)

. (2)

The late reverberation causes mainly overlap-masking effectswhich are usually more detrimental for the speech qualitythan the “coloration” effects of early reflections.

Here, the early reverberant speech ze(k) (and not s(k))constitutes the target signal of our speech enhancementalgorithm. This allows to suppress the late reverberant speechzl(k) and additive noise v(k) by modeling them both asuncorrelated noise processes and to apply known speechenhancement techniques, such as Wiener filtering or spectralsubtraction, respectively. This concept, which has beenintroduced by Lebart et al. [21] and further improved byHabets [22], forms the basis of our speech enhancementalgorithm. It is more practical for hearing aids as it avoidsthe high computational complexity and/or signal delayrequired by algorithms which strive for an (almost) completecancellation of background noise and reverberation as, e.g.,BSS.

3. Low Delay Filtering

A common approach for (single-channel) speech enhance-ment is to perform spectral weighting in the short-termfrequency-domain. The DFT coefficients of the disturbedspeech X(i, λ) are multiplied with spectral weights Wi(λ) toobtain M enhanced speech coefficients

S(i, λ) = X(i, λ) ·Wi(λ); i ∈ {0, 1, . . . ,M − 1}, (3)

where i denotes the frequency (channel) index and λ thesubsampled time index λ = �k/R�. (The operation �·�returns the greatest integer value which is lower than orequal to the argument.) For block-wise processing, thedownsampling rate R ∈ N corresponds to the frame shiftand λ to the frame index.

An efficient and common method to realize the short-term spectral weighting of (3) is to use a polyphase networkDFT analysis-synthesis filter-bank (AS FB) with subsamplingwhich comprises the common overlap-add method as specialcase, [2, 23]. A drawback of this method is that subbandfilters of high filter degrees are needed to achieve a sufficientstopband attenuation in order to avoid aliasing distortions,which results in a high signal delay. For hearing aids, how-ever, an overall processing delay of less than 10 millisecondsis desirable to avoid comb filter effects, cf., [24]. Suchdistortions are caused by the superposition of a processed,delayed signal with an unprocessed signal which bypassesthe hearing aid, e.g., through the hearing aid vent. This isespecially problematic for devices with an “open fitting.”Therefore, the algorithmic signal delay of the AS FB should


be significantly below 10 ms. One approach to achieve areduced delay is to design the prototype lowpass filter of theDFT filter-bank by numerical optimization with the designtarget to reduce the aliasing distortions with constrainedsignal delay, [25, 26].

A significantly lower signal delay can be achieved by theconcept of the filter-bank equalizer proposed in [27, 28]. Theadaptation of the coefficients is performed in the (uniformor non-uniform) short-term frequency-domain while theactual filtering is performed in the time-domain. A relatedapproach has been presented independently in [29] fordynamic range compression in hearing aids. The conceptof the filter-bank equalizer has been further improvedand generalized in [30, 31]. This filter(-bank) approachis considered here as it avoids aliasing distortions for theprocessed signal. In addition, the use of the warped filter-bank equalizer causes a significantly lower computationalcomplexity and signal delay than the use of a non-uniform(Bark-scaled) AS FB for speech enhancement as proposed,e.g., in [32–34].

A general representation of the proposed speechenhancement system is provided by Figure 1. The subbandsignals X(i, λ) are calculated either by a uniform or warpedDFT analysis filter-bank with downsampling by R, whichcan be efficiently implemented by a polyphase network. Thechoice of the downsampling rate R is here not governed byrestrictions for aliasing cancellation as for AS FBs since thefiltering is performed in the time-domain with coefficientsadapted in the frequency-domain. The influence of aliasingeffects for the calculation of the spectral weights is negligiblefor the considered application.

The frequency warped version is obtained by replacingthe delay elements of the system by allpass filters of first order

z−1 −→ A(z) = 1− αzz − α ; α ∈ R; |α| < 1. (4)

This allpass transformation allows to design a filter-bankwhose frequency bands approximate the Bark frequencybands (which model the frequency resolution of the humanauditory system) with great accuracy [35]. This can beexploited for speech enhancement to achieve a high (sub-jective) speech quality with a low number of frequencychannels, cf., [30].

The short-term spectral coefficients of the disturbedspeech X(i, λ) are used to calculate the spectral weights forspeech enhancement Wi(λ) as well as the weights Wi(λ) forspeech denoising prior to the RT estimation, see Figure 1.These spectral weights are converted to the time-domainfilter coefficients wn(λ) and wn(λ) by means of a generalizeddiscrete Fourier transform (GDFT)

wn(λ)= h(n)M

M−1∑

i=0

Wi(λ)e− j(2π/M)i(n−n0); n,n0∈{0, 1, . . . L},

(5)

and accordingly for the weights Wi(λ). The sequence h(n)denotes the real, finite impulse response (FIR) of theprototype lowpass filter of the analysis filter-bank. For the

common case of a prototype filter with linear phase responseand even filter degree L, (5) applies with n0 = L/2. TheGDFT of (5) can be efficiently calculated by the fast Fouriertransform (FFT). It is also possible to approximate the(uniform or warped) time-domain filters by FIR or IIR filtersof lower degree to further reduce the overall signal delay andcomplexity. A more comprehensive treatment can be foundin [30, 31].

4. Spectral Weights for Noise Reductionand Dereverberation

Two essential components of Figure 1 are the calculation ofthe spectral weights and the RT estimation which are treatedin this section.

4.1. Concept. The weights are calculated by the spectralsubtraction rule

W (ss)i (λ) = 1− 1

√

γ(i, λ); i ∈ {0, 1, . . . ,M − 1}. (6)

This method achieves a good speech quality with lowcomputational complexity, but other, more sophisticatedestimators such as the spectral amplitude estimators ofEphraim and Malah [36] or even psychoacoustic weightingrules [37] can be employed as well, cf., [22].

The spectral weights of (6) depend on an estimation ofthe a posteriori signal-to-interference ratio (SIR)

γ(i, λ) = |X(i, λ)|2σ2zl

(i, λ) + σ2v (i, λ)

. (7)

The spectral variances of the late reverberant speech andnoise are given by σ2

zl(i, λ) and σ2

v (i, λ), cf., (1) and (2).Equation (6) can be seen as a generalized spectral subtractionrule. If no reverberation is present, that is, z(k) = s(k),(7) reduces to the well-known a posteriori signal-to-noiseratio (SNR) and (6) to a “common” spectral magnitudesubtraction for noise reduction.

The problem of musical tones can be alleviated byexpressing the a posteriori SIR by the a priori SIR

ξ(i, λ) =E{

|Ze(i, λ)|2}

σ2zl

(i, λ) + σ2v (i, λ)

= γ(i, λ)− 1, (8)

which can be estimated by the decision-directed approach of[36]

ξ(i, λ) = η ·∣∣∣Ze(i, λ− 1)

∣∣∣

2

σ2zl

(i, λ− 1) + σ2v (i, λ− 1)

+(1− η) ·max

{γ(i, λ)− 1, 0

}

(9)

with 0.8 < η < 1. This recursive estimation of the a priori SIRcauses a significant reduction of musical tones, cf., [38]. Thespectral weights are finally confined by a lower threshold

Wi(λ) = max{

W (ss)i (λ), δw(i, λ)

}

. (10)


Wi(λ)

X(i, λ)Wi(λ)

T60(λ′)

wn(λ)

z(k)

x(k)

wn(λ)

y(k) = s(k)

Weightcalculation

DFT analysisfilter-bank

GDFT GDFT

Auxiliarytime-domain

filter

RTestimation

Maintime- domain

filter

Figure 1: Overall system for low delay noise reduction and dereverberation. The frequency warped system is obtained by replacing the delayelements of the analysis filter-bank and both time-domain filters by allpass filters of first order.

This allows to balance the tradeoff between the amount ofinterference suppression on the one hand, and musical tonesand speech distortions on the other hand. Alternatively, itis also possible to bound the spectral weights implicitly byimposing a lower threshold to the estimated a priori SIR.The adaptation of the thresholds and other parameters canbe done similar as for “common” noise reduction algorithmsbased on spectral weighting.

4.2. Interference Power Estimation. A crucial issue is theestimation of the variances of the interfering noise andlate reverberant speech to determine the a priori SIR. Thespectral noise variances σ2

v (i, λ) can be estimated by commontechniques such as minimum statistics [39].

An estimator for the variances σ2zl

(i, λ) of the latereverberant speech can be obtained by means of a simplestatistical model for the RIR of (1) [21]

hm(k) = n(k)e−ρkTsε(k) (11)

with ε(k) representing the unit step sequence. The parameterTs = 1/ fs denotes the sampling period and n(k) is asequence of i.i.d. random variables with zero mean andnormal distribution.

The reverberation time (RT) is defined as the time spanin which the energy of a steady-state sound field in a roomdecays 60 dB below its initial level after switching-off theexcitation source, [40]. It is linked to the decay rate ρ of (11)by the relation

T60 = 3ρ log10(e)

≈ 6.908ρ

. (12)

Due to this dependency, the terms decay rate and reverbera-tion time are used interchangeably in the following. The RIRmodel of (11) is rather coarse, but allows to derive a simplerelation between the spectral variances of late reverberantspeech σ2

zl(i, λ) and reverberant speech σ2

z (i, λ) according to[21]

σ2zl

(i, λ) = e−2ν(i,λ)Tl · σ2z (i, λ−Nl). (13)

The value ν(i, λ) denotes the frequency and time dependentdecay rate of the RIR in the subband-domain whose blind

estimation is treated in Section 4.3. The integer value Nl =�Tl fs/R� marks the number of frames corresponding tothe chosen time span Tl where fs denotes the samplingfrequency. The value for Tl is typically in a range of 20 to100 ms and is related to the time span after which the latereverberation (presumably) begins.

The variances of the reverberant speech can be estimatedfrom the spectral coefficients Z(i, λ) by recursive averaging

σ2z (i, λ) = κ · σ2

z (i, λ− 1) + (1− κ) ·∣∣∣Z(i, λ)

∣∣∣

2(14)

with 0 < κ < 1. The spectral coefficients of the reverberantspeech are obtained by spectral weighting

Z(i, λ) = X(i, λ) · Wi(λ) (15)

using, for instance, the spectral subtraction rule of (6) basedon an estimation of the a posteriori SNR. It should be notedthat the spectral weights Wi(λ) are also needed for thedenoising prior the the RT estimation (see Figure 1).

A more sophisticated (and complex) estimation of thelate reverberant speech energy is proposed in [22]. Ittakes model inaccuracies into account, if the source-receiverdistance is lower than the critical distance and requires anestimation of the direct-to-reverberation ratio for this.

4.3. Decay Rate Estimation. The estimation of the fre-quency dependent decay rates ν(i, λ) of (13) requires non-subsampled subband signals, which causes a high computa-tional complexity. To avoid this, we estimate the decay ratein the time-domain at decimated time instants λ′ = �k/R′�from the (partly) denoised, reverberant speech signal z(k) assketched by Figure 1. The prime indicates that the updaterate for this estimation R′ is not necessarily identical to thatfor the spectral weights Wi(λ) and Wi(λ). In general, theupdate intervals for the RT estimation can be longer than forthe calculation of the spectral weights as the room acousticschanges usually rather slowly.

The filter coefficients wn(λ) for the “auxiliary” time-domain filter which provides z(k) are obtained by a GDFTof the spectral weights Wi(λ) used in (15), see Figure 1. Thefrequency dependent decay rates ν(i, λ′), needed to evaluate


(13), are obtained by the time-domain estimate of the decayrate ρ(λ′) according to

ν(i, λ′) ≈ ρ(λ′) ∀i ∈ {0, 1, . . . ,M − 1}. (16)

This approximation is rather coarse, but it yields good resultsin practice with a low computational complexity.

A blind estimation of the decay rate (or RT) can beperformed by a maximum likelihood (ML) approach firstproposed in [41, 42]. A generalization of this approach toestimate the RT in noisy environments has been presented in[43]. The ML estimators are also based on the statistical RIRmodel of (11).

For a blind determination of the RT, an ML estimationfor the decay rate ρ is performed at decimated time instantsλ′ on a frame with N samples z(λ′R′ −N + 1), z(λ′R′ −N +2), . . . , z(λ′R′) according to

ρ(λ′) = arg

{

maxρ

{L(λ′)

}}

(17)

with the log-likelihood function given by

L(λ′) = −N2

⎛

⎝(N − 1)ln(a)

+ln

⎛

⎝2πN

N−1∑

i=0

a−2i z2(λ′R′ −N + 1 + i)

⎞

⎠ + 1

⎞

⎠,

(18)

where a = exp{−ρTs}, cf., [43]. The corresponding RT isobtained by (12).

A correct RT estimate can be expected, if the currentframe captures a free decay period following the sharpoffset of a speech sound. Otherwise, an incorrect RT isobtained, e.g., for segments with ongoing speech, speechonsets or gradually declining speech offsets. Such estimatescan be expected to overestimate the RT, since the dampingof sound cannot occur at a rate faster than the free decay.However, taking the minimum of the last Kl ML estimatesis likely to underestimate the RT, since the ML estimateconstitutes also a random variable. This bias can be reducedby “order-statistics” as known from image processing [44].In the process, the histogram of the Kl most recent MLestimates is built and its first local maximum is taken as RTestimate T

(peak)60 (λ′) excluding maxima at the boundaries. The

effects of “outliers” can be efficiently reduced by recursivesmoothing

T60(λ′) = β · T60(λ′ − 1) +(1− β) · T(peak)

60 (λ′) (19)

with 0.9 < β < 1. A strong smoothing can be applied as theRT changes usually rather slowly over time.

The devised RT estimation relies only on the fact thatspeech signals contain occasionally distinctive speech offsets,but it requires no explicit speech offset detection [21] or acalibration period [45]. Another important advantage of thisRT estimation is that it is developed for noisy signals as theprior denoising can only achieve a partial noise suppression.

−0.6

−0.4

−0.2

0

0.2

hr(k

)

0 0.1 0.2 0.3 0.4 0.5

k · Ts/s

Figure 2: Measured RIR with T60 = 0.79 seconds.

0

10

20

30

40

50

Gro

up

dela

y(s

ampl

es)

0 0.2 0.4 0.6 0.8 1

Ω/π

Figure 3: Group delay of the warped filter-bank equalizer with filterdegree L = 32 and allpass coefficient α = 0.5.

In principle, it is also conceivable to use other methodsfor the continuous RT estimation, such as the Schroedermethod [46] or a non-linear regression approach [47].However, the use of such estimators has lead to inferiorresults as the obtained histograms showed a higher spreadand less distinctive local maxima. This resulted in a muchhigher error rate in comparison to the ML approach.

5. Evaluation

The new system has been evaluated by means of instrumentalquality measures as well as informal listening tests. Thedistorted speech signals are generated according to (1) fora sampling frequency of fs = 16 kHz. A speech signal of 6minutes duration is convolved with a RIR shown in Figure 2.The RIR has been measured in a highly reverberant roomand possesses a RT of 0.79 s. (This value for T60 has beendetermined from the measured RIR by a modified Schroedermethod as described in [43].) The reverberant speech signalz(k) is distorted by additive babble noise from the NOISEX-92 database with varying global input SNRs for anechoicspeech s(k) and additive noise v(k).

For the processing according to Figure 1, a warped filter-bank equalizer is used with allpass coefficient α = 0.5,M = 32 frequency channels, a downsampling rate of R =32 and a Hann prototype lowpass filter of degree L =M. This processing with non-uniform frequency resolutionallows to achieve a good subjective speech quality with lowsignal delay, cf., [30]. The time-invariant group delay of


both warped time-domain filters is shown in Figure 3. Thegroup delay varies only between 0.5 ms and 3.125 ms forfs = 16 kHz. Such variations do not cause audible phasedistortions so that a phase equalizer is not needed here. Incontrast, the use of a corresponding warped AS FB yieldsnot only a significantly higher signal delay but requires also aphase equalization, see [31].

The spectral weights are calculated by the spectralsubtraction rule of (6) using the thresholding of (10) withδw(i, λ) ≡ 0.2 for the weights Wi(λ) and δw(i, λ) ≡ 0.1 forthe weights Wi(λ). The spectral noise variances are estimatedby minimum statistics [39] and the variances of the latereverberant speech by (13). For the blind estimation of theRT according to Section 4.3, a histogram size of Kl = 400values and an adaptation rate of R′ = 256 are used. Asmoothing factor of β = 0.995 is employed for (19).

The quality of the enhanced speech is evaluated in thetime-domain by means of the segmental signal-to-interferenceratio (SSIR) (cf., [48]). The difference between the anechoicspeech signal of the direct path sd(k) and the processedspeech y(k) = s(k) (after group delay equalization) isexpressed by

SSIRdB

= 10C(Fs)

∑

l∈Fslog10

( ∑Nf−1n=0 s2d(l − n)

∑Nf−1n=0

(sd(l − n)− y(l − n)

)2

)

.

(20)

The set Fs contains all frame indices corresponding to frameswith speech activity and C(Fs) represents its total number ofelements.

The speech quality is also evaluated in the frequency-domain by means of the mean log-spectral distance (LSD)between the anechoic speech of the direct path and theprocessed speech according to

LSDdB

= 1C(Fs)

∑

l∈Fs

√√√√√

1Nf

Nf−1∑

i=0

∣∣∣Ssd (i, l)− Sy(i, l)

∣∣∣

2(21)

with

Ssd (i, l) = max{

20log10(|Sd(i, l)|), δLSD

}

,

Sy(i, l) = max{

20log10(|Y(i, l)|), δLSD

}

,(22)

where Sd(i, l) and Sy(i, l) denote the short-term DFT coeffi-cients of anechoic and processed speech for frequency indexi and frame l. The lower threshold δLSD confines the dynamicrange of the log-spectrum and is set here to −50 dB. Half-overlapping frames with Nf = 256 samples are used for theevaluations.

A perceptually motivated spectral distance measure isgiven by the Bark spectral distortion (BSD) [49]. The Barkspectrum is calculated by three main steps: critical bandfiltering, equal loudness pre-emphasis and a phone-to-soneconversion. The BSD is obtained by the mean difference

between the Bark spectra of undistorted speech Bsd (i, l) andenhanced speech By(i, l) according to

BSD =∑

l∈Fs∑Nf−1

i=0

(

Bsd (i, l)−By(i, l))2

∑l∈Fs

∑Nf−1i=0 Bsd (i, l)2 . (23)

A modification of this measure is given by the modified Barkspectral distortion (MBSD) which takes also into account thenoise masking threshold of the human auditory system [50].The (M)BSD has been originally proposed for the evaluationof speech codecs, but it can also be used as (additional)quality measure for speech enhancement systems, cf., [22].

The curves for the different measures are plotted inFigure 4. The joint suppression of late reverberant speech andnoise yields a significantly better speech quality, in terms ofa lower LSD and MBSD as well as a higher SSIR, in compar-ison to the noise reduction without dereverberation whereσzl (i, λ) = 0 for (8) and (9), respectively. (Using the cepstraldistance (CD) measure led to almost identical results as forthe LSD measure.) For low SNRs, the dereverberation effectbecomes less significant due to the high noise energy, cf., (8).This is a desirable effect as the impact of reverberation is(partially) masked by the noise in such cases. For high SNRs,the noise reduction alone still achieves a slight improvementas the noise power estimation does not yield zero values.The estimation errors of the blind RT estimation are smallenough to avoid noteworthy impairments; the curves forspeech enhancement with blind RT estimation are almostidentical to those obtained by using the actual RT. (Usingother RIRs and noise sequences led to the same results.)Therefore, the new speech enhancement system achieves aspeech quality as the comparable approach of [22] which,however, assumes that a reliable estimate of the RT is given(and considers a common DFT AS FB).

The results of the instrumental measurements com-ply with our informal listening tests. The new speechenhancement system achieves a significant reduction ofbackground noise and reverberation, but still preserves anatural sound impression. The speech signals enhanced withblind RT estimation and known RT have revealed no audibledifferences. The noise reduction alone achieves only a slightlyaudible reduction of reverberation.

6. Conclusions

A new speech enhancement algorithm for the joint sup-pression of late reverberant speech and background noiseis proposed which addresses the special requirements ofhearing aids. The enhancement is performed by a generalizedspectral subtraction which depends on estimates for thespectral variances of background noise and late reverberantspeech. The spectral variances of the late reverberant speechare calculated by a simple rule in dependence of the RT.The time-varying RT is estimated blindly (without dedicatedexcitation signals) from a noisy and reverberant speech signalby means of an ML estimation and order-statistics filtering.

In reverberant and noisy environments, the devisedsingle-channel speech enhancement system achieves a sig-nificant reduction of interferences due to late reverberation


0.6

0.8

1

1.2

1.4

1.6

1.8

2

LSD

(dB

)

−20 −10 0 10 20 30 40

Global input SNR (dB)

(a)

0.5

1

1.5

2

2.5

3

3.5

4

4.5

MB

SD(d

B)

−20 −10 0 10 20 30 40


(b)

−30

−25

−20

−15

−10

−5

SSIR

(dB

)

−20 −10 0 10 20 30 40


Enhancement with known RTEnhancement with blind RTENoise suppression onlyReverberant speechReverberant and noisy speech

(c)

Figure 4: Log-spectral distance (LSD), modified Bark spectral dis-tortion (MBSD) and segmental signal-to-interference ratio (SSIR)for varying global input SNRs and different signals.

and additive noise. The enhancement with the blind RTestimation achieves actually the same speech quality as byusing the actual RT.

In contrast to existing algorithms for dereverberationand noise reduction, the proposed algorithm has a lowsignal delay, a reasonable computational complexity and itrequires no (large) microphone array, which is of particularimportance for speech enhancement in hearing aids. Incomparison to commonly used post-filters in hearing aidswhich only perform noise reduction, a significantly bettersubjective and objective speech quality is achieved by thedevised system.

Although the use for hearing instruments has beenconsidered primarily here, the proposed algorithm is alsosuitable for other applications such as speech enhancementin hands-free devices, mobile phones or speech recognitionsystems.

Acknowledgments

The authors are grateful for the support of GN ReSound,Eindhoven, The Netherlands. They would also like to thankthe reviewers for their helpful comments as well as theInstitute of Technical Acoustics of RWTH Aachen Universityfor providing the measured RIRs.

References

[1] J. Benesty, S. Makino, and J. Chen, Eds., Speech Enhancement,Springer, Berlin, Germany, 2005.

[2] P. Vary and R. Martin, Digital Speech Transmission: Enhance-ment, Coding and Error Concealment, John Wiley & Sons,Chichester, UK, 2006.

[3] E. Hansler and G. Schmidt, Eds., Speech and Audio Processingin Adverse Environments, Springer, Berlin, Germany, 2008.

[4] R. A. J. de Vries and B. de Vries, “Towards SNR-loss restorationin digital hearing aids,” in Proceedings of the IEEE Interna-tional Conference on Acoustics, Speech, and Signal Processing(ICASSP ’02), vol. 4, pp. 4004–4007, Orlando, Fla, USA, May2002.


[6] K. U. Simmer, J. Bitzer, and C. Marro, “Post-filtering tech-niques,” in Microphone Arrays, M. S. Brandstein and D. B.Ward, Eds., chapter 3, pp. 39–60, Springer, Berlin, Germany,2001.

[7] H. W. Lollmann and P. Vary, “Post-filter design for superdirec-tive beamformers with closely spaced microphones,” in Pro-ceedings of IEEE Workshop on Applications of Signal Processingto Audio and Acoustics (WASPAA ’07), pp. 291–294, New Paltz,NY, USA, October 2007.

[8] S. Doclo and M. Moonen, “GSVD-based optimal filteringfor multi-microphone speech enhancement,” in MicrophoneArrays, M. S. Brandstein and D. B. Ward, Eds., chapter 6, pp.111–132, Springer, Berlin, Germany, 2001.

[9] A. Spriet, M. Moonen, and J. Wouters, “Stochastic gradient-based implementation of spatially preprocessed speech distor-tion weighted multichannel Wiener filtering for noise reduc-tion in hearing aids,” IEEE Transactions on Signal Processing,vol. 53, no. 3, pp. 911–925, 2005.


[10] A. K. Nabelek and D. Mason, “Effect of noise and reverbera-tion on binaural and monaural word identification by subjectswith various audiograms,” Journal of Speech and HearingResearch, vol. 24, no. 3, pp. 375–383, 1981.

[11] A. K. Nabelek, T. R. Letowski, and F. M. Tucker, “Reverberantoverlap- and self-masking in consonant identification,” TheJournal of the Acoustical Society of America, vol. 86, no. 4, pp.1259–1265, 1989.

[12] N. D. Gaubitch, P. Naylor, and D. B. Ward, “On the use oflinear prediction for dereverberation of speech,” in Proceedingsof International Workshop on Acoustic Echo and Noise Control(IWAENC ’03), pp. 99–102, Kyoto, Japan, September 2003.

[13] T. Nakatani and M. Miyoshi, “Blind dereverberation ofsingle channel speech signal based on harmonic structure,”in Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’03), vol. 1, pp. 92–95,Hong kong, April 2003.

[14] J. B. Allen, D. A. Berkley, and J. Blauert, “Multimicrophonesignal-processing technique to remove room reverberationfrom speech signals,” The Journal of the Acoustical Society ofAmerica, vol. 62, no. 4, pp. 912–915, 1977.

[15] R. Martin, “Small microphone arrays with postfilters for noiseand acoustic echo reduction,” in Microphone Arrays, M. S.Brandstein and D. B. Ward, Eds., chapter 12, pp. 255–279,Springer, Berlin, Germany, 2001.

[16] T. Wittkop and V. Hohmann, “Strategy-selective noise reduc-tion for binaural digital hearing aids,” Speech Communication,vol. 39, no. 1-2, pp. 111–138, 2003.

[17] T. Yoshioka, T. Nakatani, T. Hikichi, and M. Miyoshi, “Max-imum likelihood approach to speech enhancement for noisyreverberant signals,” in Proceedings of IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP ’08),pp. 4585–4588, Las Vegas, Nev, USA, April 2008.

[18] H. Buchner, R. Aichner, and W. Kellermann, “A generalizationof blind source separation algorithms for convolutive mixturesbased on second-order statistics,” IEEE Transactions on Speechand Audio Processing, vol. 13, no. 1, pp. 120–134, 2005.

[19] V. Hamacher, U. Kornagel, T. Lotter, and H. Puder, “Binauralsignal processing in hearing aids: technologies and algo-rithms,” in Advances in Digital Speech Transmission, R. Martin,U. Heute, and C. Antweiler, Eds., chapter 14, pp. 401–429,John Wiley & Sons, Chichester, UK, 2008.

[20] M. S. Brandstein, “On the use of explicit speech modelingin microphone array applications,” in Proceedings of IEEEInternational Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’98), vol. 6, pp. 3613–3616, Seattle, Wash,USA, May 1998.

[21] K. Lebart, J. M. Boucher, and P. N. Denbigh, “A new methodbased on spectral subtraction for speech dereverberation,”Acta Acustica United with Acustica, vol. 87, no. 3, pp. 359–366,2001.

[22] E. A. P. Habets, Single- and multi-microphone speech dere-verberation using spectral enhancement, Ph.D. dissertation,Eindhoven University, Eindhoven, The Netherlands, June2007.

[23] R. E. Crochiere, “A weighted overlap-add method of short-time Fourier analysis/synthesis,” IEEE Transactions on Acous-tics, Speech, and Signal Processing, vol. 28, no. 1, pp. 99–102,1980.

[24] M. A. Stone and B. C. J. Moore, “Tolerable hearing aiddelays—II: estimation of limits imposed during speech pro-duction,” Ear and Hearing, vol. 23, no. 4, pp. 325–338, 2002.

[25] J. M. de Haan, N. Grbic, I. Claesson, and S. Nordholm,“Design of oversampled uniform DFT filter banks with delay

specification using quadratic optimization,” in Proceedings ofthe IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP ’01), vol. 6, pp. 3633–3636, Salt LakeCity, Utah, USA, May 2001.

[26] R. W. Bauml and W. Sorgel, “Uniform polyphase filterbanks for use in hearing aids: design and constraints,” inProceedings of the 16th European Signal Processing Conference(EUSIPCO ’08), Lausanne, Switzerland, August 2008.

[27] H. W. Lollmann and P. Vary, “Efficient non-uniform filter-bank equalizer,” in Proceedings of European Signal ProcessingConference (EUSIPCO ’05), Antalya, Turkey, September 2005.

[28] P. Vary, “An adaptive filter-bank equalizer for speech enhance-ment,” Signal Processing, vol. 86, no. 6, pp. 1206–1214, 2006.

[29] J. M. Kates and K. H. Arehart, “Multichannel dynamic-rangecompression using digital frequency warping,” EURASIPJournal on Applied Signal Processing, vol. 2005, no. 18, pp.3003–3014, 2005.

[30] H. W. Lollmann and P. Vary, “Uniform and warped low delayfilter-banks for speech enhancement,” Speech Communication,vol. 49, no. 7-8, pp. 574–587, 2007.

[31] H. W. Lollmann and P. Vary, “Low delay filter-banks forspeech and audio processing,” in Speech and Audio Processingin Adverse Environments, E. Hansler and G. Schmidt, Eds.,chapter 2, pp. 13–61, Springer, Berlin, Germany, 2008.

[32] T. Gulzow, A. Engelsberg, and U. Heute, “Comparison of adiscrete wavelet transformation and a nonuniform polyphasefilterbank applied to spectral-subtraction speech enhance-ment,” Signal Processing, vol. 64, no. 1, pp. 5–19, 1998.

[33] I. Cohen, “Enhancement of speech using Bark-scaled waveletpacket decomposition,” in Proceedings of the 7th Euro-pean Conference on Speech Communication and Technology(EUROSPEECH ’01), pp. 1933–1936, Aalborg, Denmark,September 2001.

[34] T. Fillon and J. Prado, “Evaluation of an ERB frequency scalenoise reduction for hearing aids: a comparative study,” SpeechCommunication, vol. 39, no. 1-2, pp. 23–32, 2003.

[35] J. O. Smith III and J. S. Abel, “Bark and ERB bilineartransforms,” IEEE Transactions on Speech and Audio Processing,vol. 7, no. 6, pp. 697–708, 1999.


[37] N. Virag, “Single channel speech enhancement based onmasking properties of the human auditory system,” IEEETransactions on Speech and Audio Processing, vol. 7, no. 2, pp.126–137, 1999.


[39] R. Martin, “Noise power spectral density estimation based onoptimal smoothing and minimum statistics,” IEEE Transac-tions on Speech and Audio Processing, vol. 9, no. 5, pp. 504–512,2001.

[40] H. Kuttruff, Room Acoustics, Taylor & Francis, London, UK,4th edition, 2000.

[41] R. Ratnam, D. L. Jones, B. C. Wheeler, W. D. O’Brien Jr., C.R. Lansing, and A. S. Feng, “Blind estimation of reverberationtime,” The Journal of the Acoustical Society of America, vol. 114,no. 5, pp. 2877–2892, 2003.

[42] R. Ratnam, D. L. Jones, and W. D. O’Brien Jr., “Fast algorithmsfor blind estimation of reverberation time,” IEEE SignalProcessing Letters, vol. 11, no. 6, pp. 537–540, 2004.


[43] H. W. Lollmann and P. Vary, “Estimation of the reverberationtime in noisy environments,” in Proceedings of InternationalWorkshop on Acoustic Echo and Noise Control (IWAENC ’08),pp. 1–4, Seattle, Wash, USA, September 2008.

[44] I. Pitas and A. N. Venetsanopoulos, “Order statistics in digitalimage processing,” Proceedings of the IEEE, vol. 80, no. 12, pp.1893–1921, 1992.

[45] J. Y. C. Wen, E. A. P. Habets, and P. A. Naylor, “Blind estima-tion of reverberation time based on the distribution of signaldecay rates,” in Proceedings of IEEE International Conference onAcoustic, Speech, and Signal Processing (ICASSP ’08), pp. 329–332, Las Vegas, Nev, USA, March-April 2008.

[46] M. R. Schroeder, “New method of measuring reverberationtime,” The Journal of the Acoustical Society of America, vol. 37,no. 3, pp. 409–412, 1965.

[47] N. Xiang, “Evaluation of reverberation times using a nonlinearregression approach,” The Journal of the Acoustical Society ofAmerica, vol. 98, no. 4, pp. 2112–2121, 1995.

[48] P. A. Naylor and N. D. Gaubitch, “Speech dereverberation,”in Proceedings of International Workshop on Acoustic Echoand Noise Control (IWAENC ’05), pp. 89–92, Eindhoven, TheNetherlands, September 2005.

[49] S. Wang, A. Sekey, and A. Gersho, “An objective measure forpredicting subjective quality of speech coders,” IEEE Journalon Selected Areas in Communications, vol. 10, no. 5, pp. 819–829, 1992.

[50] W. Yang, M. Benbouchta, and R. Yantorno, “Performance ofthe modified Bark spectral distortion as an objective speechquality measure,” in Proceedings of IEEE International Confer-ence on Acoustics, Speech, and Signal Processing (ICASSP ’98),vol. 1, pp. 541–544, Seattle, Wash, USA, May 1998.


Research Article

A Computational Auditory Scene Analysis-EnhancedBeamforming Approach for Sound Source Separation

L. A. Drake,1 J. C. Rutledge,2 J. Zhang,3 and A. Katsaggelos (EURASIP Member)4

1 JunTech Inc., 2314 E. Stratford Ct, Shorewood, WI 53211, USA2 Computer Science and Electrical Engineering Department, University of Maryland, Baltimore County, Baltimore, MD 21250, USA3 Electrical Engineering and Computer Science Department, University of Wisconsin-Milwaukee, Milwaukee, WI 53201, USA4 Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60208, USA

Correspondence should be addressed to L. A. Drake, [email protected]

Received 1 December 2008; Revised 18 May 2009; Accepted 12 August 2009

Recommended by Henning Puder

Hearing aid users have difficulty hearing target signals, such as speech, in the presence of competing signals or noise. Most solutionsproposed to date enhance or extract target signals from background noise and interference based on either location attributes orsource attributes. Location attributes typically involve arrival angles at a microphone array. Source attributes include characteristicsthat are specific to a signal, such as fundamental frequency, or statistical properties that differentiate signals. This paper describes anovel approach to sound source separation, called computational auditory scene analysis-enhanced beamforming (CASA-EB), thatachieves increased separation performance by combining the complementary techniques of CASA (a source attribute technique)with beamforming (a location attribute technique), complementary in the sense that they use independent attributes for signalseparation. CASA-EB performs sound source separation by temporally and spatially filtering a multichannel input signal, and thengrouping the resulting signal components into separated signals, based on source and location attributes. Experimental resultsshow increased signal-to-interference ratio with CASA-EB over beamforming or CASA alone.

Copyright © 2009 L. A. Drake et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

People often find themselves in cluttered acoustic envi-ronments, where what they want to listen to is mixedwith noise, interference, and other acoustic signals of nointerest. The problem of extracting an acoustic signal ofinterest from background clutter is called sound sourceseparation and, in psychoacoustics, is also known as the“cocktail party problem.” Such “hearing out” of a desiredsignal can be particularly challenging for hearing aid userswho often have reduced localization abilities. Sound sourceseparation could allow them to distinguish better betweenmultiple speakers, and thus, hear a chosen speaker moreclearly. Separated signals from a sound source separationsystem can be further enhanced through techniques suchas amplitude compression for listeners with sensorineuralhearing loss and are also suitable for further processingin other applications, such as teleconferencing, automaticspeech recognition, automatic transcription of ensemblemusic, and modeling the human auditory system.

There are three main approaches to the general soundsource separation problem: blind source separation methods,those that use location attributes, and those that use sourceattributes. Blind source separation techniques separatesound signals based on the assumption that the signals are“independent,” that is, that their nth-order joint momentsare equal to zero. When 2nd-order statistics are used, themethod is called principal component analysis (PCA); whenhigher-order statistics are used, it is called independentcomponent analysis (ICA). Blind source separation methodscan achieve good performance. However, they require theobservation data to satisfy some strict assumptions that maynot be compatible with a natural listening environment.Besides the “independence” requirement, they can alsorequire one or more of the following: a constant mixingprocess, a known and fixed number of sources, and anequal number of sources and observations [1]. Locationand source attribute-based methods do not require any ofthese, and thus, are effective for a wider range of listeningenvironments.


Location attributes describe the physical location ofa sound source at the time it produced the sound. Forexample, a sound passes across a microphone array fromsome direction, and this direction, called “arrival angle,” isa location attribute. One location-attribute-based techniqueis binaural CASA [2–4]. Based on a model of the humanauditory system, binaural sound source separation usesbinaural data (sound “heard” at two “ears”) to estimatethe arrival angle of “dominant” single-source sounds. Itdoes this by comparing the binaural data’s interaural timedelays and interaural intensity differences to a look-uptable and selecting the closest match. While binaural CASAperformance is impressive for a two microphone array (twoears), improved performance may be achieved by usinglarger arrays—as in beamforming. In addition to lifting thetwo microphone restriction of binaural CASA, microphonearray processing is also more amenable to quantitativeperformance analysis since it is a mathematically derivedapproach.

Beamforming uses spatially sampled data from an arrayof two or more sensors to estimate arrival angles andwaveforms of “dominant” signals in the wavefield. Generally,the idea is to combine the sensor measurements in someway so that desirable signals add constructively, while noiseand interference are reduced. Various beamforming methods(taken and adapted from traditional array processing forapplications such as radar and sonar) have been developedfor and applied to speech and other acoustic signals. A reviewof these “microphone array processing” methods can befound in [5]. Regardless of which specific location methodis chosen, however, and how well it works, it still cannotseparate signals from the same (or from a close) locationsince location is its cue for separation [6].

In this paper, we present a novel technique combiningbeamforming with a source attribute technique, monauralCASA. This category of source attribute methods modelshow human hearing separates multispeaker input to “hearout” each speaker individually. Source attributes describethe state of the sound source at the time it producesthe sound. For example, in the case of a voiced speechsound, fundamental frequency (F0) is a source attribute thatindicates the rate at which the speaker’s glottis opens andcloses. Monaural CASA [3, 7–12] is based on a model ofhow the human auditory system performs monaural soundsource separation. It groups “time-frequency” signal com-ponents with similar source attributes, such as fundamentalfrequency (F0), amplitude modulation (AM), onset/offsettimes and timbre. Such signal component groups then givethe separated sounds.

Location-attribute techniques can separate signals betterin some situations than source-attribute techniques can.For example, since location attributes are independent ofsignal spectral characteristics, they can group harmonic andinharmonic signals equally well. Source-attribute techniquessuch as monaural CASA, on the other hand, have troublewith inharmonic signals. Similarly, when a signal changes itsspectral characteristics abruptly, for example, from a fricativeto a vowel in a speech signal, the performance of location-attribute techniques will not be affected. Source-attribute

techniques, on the other hand, may mistakenly separate thefricative and the vowel—assigning them to different soundsources.

Source-attribute techniques can also perform better thanlocation-attribute methods in some situations. Specifically,they can separate sound mixtures in which the single-sourcesignals have close or equal arrival angles. Their complemen-tary strengths suggest that combining these two techniquesmay provide better sound source separation performancethan using either method individually. Indeed, previouslypublished work combining monaural and binaural CASAshows that this is a promising idea ([3, 13]).

In this paper, we exploit the idea of combining locationand source attributes further by combining beamformingwith monaural CASA into a novel approach called CASA-Enhanced Beamforming (CASA-EB). The main reason forusing beamforming rather than binaural CASA as thelocation-attribute technique, here, is that beamformingmay provide higher arrival angle resolution through theuse of larger microphone arrays and adaptive processing.In addition, beamforming is more subject to quantitativeanalysis.

2. CASA-EB Overview

We begin by introducing some notation and giving a moreprecise definition of sound source separation. Suppose amultisource sound field is observed by an array ofM acousticsensors (microphones). This produces M observed mixturesignals:

y[m,n] =K∑

k=1

xk[m,n] +w[m,n], m = 1, 2, . . . ,M, (1)

where n is the time index, m is the microphone index,xk[m,n] is the kth source signal as observed at the mthmicrophone, and w[m,n] is the noise in the observation(background and measurement noise). The goal of soundsource separation, then, is to make an estimate of each of theK single-source signals in the observed mixture signals

xk[n], k ∈ {1, 2, . . . ,K}, (2)

where ∧ is used to indicate estimation, and the estimate xk[n]may differ from the source signal by a delay and/or scalefactor.

In our CASA-EB approach, sound source separationis achieved in two steps. As shown in Figure 1, these aresignal analysis and grouping. In the signal analysis step,the array observations, y[m,n], are transformed into signalcomponents in a 3D representation space with dimensions:time frame ρ, frequency band ω, and arrival angle band φ(see illustration in Figure 2). This is accomplished in twosubsteps—temporal filtering of y[m,n] through a bandpassfilterbank, followed by spatial filtering of the resulting band-pass signals. In the grouping step, selected signal componentsfrom this 3D CASA-EB representation space are grouped toform the separated single-source signals (see the illustrationin Figure 3). Grouping consists of three substeps—selecting


Signal analysis

Spat

ial

filt

erin

g

Tem

por

alfi

lter

ing

Arr

ival

angl

e(φ

)de

tect

ion

Freq

uen

cyba

nd

(ω)

sele

ctio

n

Signal componentselection

Sourceattribute

(F0)

Locationattribute

(φ)

Shor

t-ti

me

sequ

enti

al

Sim

ult

aneo

us

Lin

kin

gsh

ort-

tim

egr

oups

Signal componentgrouping

Wav

efor

mre

syn

thes

is

Attributeestimation

Grouping

Figure 1: Block diagram of CASA-EB.

The CASA-EBrepresentation

of a siren.

The projection in thetime-frequency plane is

a spectrogram of the siren.

The projection in thetime-arrival angle plane shows

the siren’s arrival angle, φ0.

ρ

φ

φ0

ω

(a)

The CASA-EBrepresentation

of harmonic signal.

The projection in thetime-frequency plane is a spectrogram

of the harmonic signal.

The projection in thetime-arrival angle plane shows

the signal’s arrival angle, φ0.

ρ

φ

φ0

ω

(b)

Figure 2: CASA-EB representations of a siren (a), and a simpleharmonic signal (b). The projections on the time-frequency plane(signal’s spectrogram) and time-arrival angle planes (signal’s arrivalangle path) are also shown.

signal components to group, estimating their attributes,and finally grouping selected signal components that sharecommon attribute values.

This group of signal componentsgives the estimate of the siren.

This group givesthe estimate of the harmonic signal.

ρ

φ

φ0

ω

Figure 3: Separated signals from a two-signal mixture. Thisfigure shows separated signal component groups from an examplemixture signal—the sum of the two signals shown in Figure 2. Thesignal component groups are formed by collecting together signalcomponents with similar location and source attributes (details inSection 4).

A summary of the CASA-EB processing steps and themethods used to implement them are given in Table 1.The details of these are described below—signal analysisin Section 3 and grouping in Section 4. Then, Section 5discusses how waveforms of the separated single-sourcesignals can be synthesized from their signal componentgroups. Finally, after this presentation of the CASA-EBmethod, experimental results are presented in Section 6.

3. CASA-EB Representation Space

As just described, the first step in our approach is signalanalysis. The array observations y[m,n] are filtered alongboth the temporal and spatial dimensions to produce“frequency components”

Y[φ,ω,n

] = Tφ{yω[m,n]

}, with

yω[m,n] = y[m,n]∗ hω[n],(3)


Table 1: Summary of CASA-EB methods.

Processing block Method

Signal analysis

Temporal filtering Gammatone filterbank

Spatial filtering Delay-and-sum beamformer

Grouping

Signal component selection (φ) STMV beamforming

Signal component selection (ω) Signal detection using MDL criterion

Attribute estimation (F0) Autocorrelogram

Attribute estimation (φ) From P[φ,ω, ρ]

Signal component grouping (short-time sequential) Kalman filtering with Munkres’ optimal data assn algorithm

Signal component grouping (simultaneous) Clustering via a hierarchical partitioning algorithm

Signal component grouping (linking short-time groups) Munkres’ optimal data assn algorithm

Waveform resynthsis

Over frequency Adding together grouped signal components

Over time Overlap-add

where hω[n] is a bandpass temporal filter associated with thefrequency band indexed by ω, and Tφ is a spatial transformassociated with the arrival angle band indexed by φ (detailsof these signal analyses follow below).

The “frequency components” Y[φ,ω,n] are used laterin the processing (Section 4.2) for estimation of a groupingattribute, fundamental frequency, and also for waveformresynthesis. The signal components to be grouped in CASA-EB are those of its 3D representation shown in Figure 2;these are the power spectral components of the Y[φ,ω,n],obtained in the usual way as the time-average of theirmagnitudes squared

P[φ,ω, ρ

] = 1Nω

ρT+(Nω−1)/2∑

n′=ρT−(Nω−1)/2

∣∣Y[φ,ω,n′

]∣∣2, (4)

where the P[φ,ω, ρ] are downsampled from the Y[φ,ω,n]with downsampling rate T , that is, ρ = n/T , and Nω is thenumber of samples of Y[φ,ω,n] in frequency band ω thatare used to compute one sample of P[φ,ω, ρ].

3.1. Temporal Filtering. For the temporal filterbank,hω[n], ω ∈ {1, 2, . . . ,Ω}, we have used a modifiedgammatone filterbank. It consists of constant-Q filtersin high frequency bands (200 to 8000 Hz) and constant-bandwidth filters in lower frequency bands (below 200 Hz)(Constant-Q filters are a set of filters that all have the samequotient (Q), or ratio of center frequency to bandwidth.).Specifically, the constant-Q filters are the fourth-ordergammatone functions,

hω[n] = αω · e−β(αωnTs)(αωnTs)3e j2π fs/2(αωnTs)u[n], (5)

where the frequency band indices (ω = 1, 2, . . . , 75) arein reverse order, that is, the lower indices denote higherfrequencies, fs and Ts are the sampling frequency andsampling period, u[n] is the unit step function, and α and βare parameters that can be used to adjust filter characteristicssuch as bandwidths and spacing on the frequency axis. For

CASA-EB, α = 0.95, and β = 2000 work well. The constant-bandwidth filters are derived by downshifting the lowestfrequency constant-Q filter (ω = 75) by integer multiples ofits bandwidth

hω[n] = h75[n]e− j2π(ω−75)B75n, (6)

where ω = 76, 77, . . . , 90, and B75 is the bandwidth of thelowest frequency constant-Q filter.

The modified gammatone filterbank is used for temporalfiltering because it divides the frequency axis efficiently forCASA-EB. Specifically, for CASA, the frequency bands arejust narrow enough that the important spectral features ofa signal (such as harmonics in low frequencies and formantsin high frequencies) can be easily distinguished from eachother. For beamforming, the bands are narrow enough tolimit spatial filtering errors to an acceptable level.

3.2. Spatial Filtering. The spatial transform, Tφ, that we areusing is the well-known delay-and-sum beamformer

Tφ{yω[m,n]

} = 1M

M∑

m=1

yω[m,n] · e j2π(m−1) fφ , with

fφ = fωd

Csinφ,

φ ∈[

−π2

, +π

2

]

,

(7)

where fω is the center frequency of frequency band ω, d is thedistance between adjacent microphones in a uniform lineararray, and C is the speed of sound at standard temperatureand pressure.

Delay-and-sum beamforming is used here for the sig-nal analysis in our general solution to the sound sourceseparation problem because it does not cancel correlatedsignals, for example, echos (as MV beamforming can), anddoes not require a priori information or explicit modelingof target signals, interferers, or noise (as other data adaptive


beamforming can). Its drawback is that, since it has relativelylow arrival angle resolution, each signal component willcontain more interference from neighboring arrival anglebands. In CASA-EB, this is ameliorated somewhat by theadditional separation power provided by monaural CASA.For specific applications, CASA-EB performance may beimproved by defining signal and/or noise models and usinga data adaptive beamformer.

In summary, the 3D CASA-EB representation space con-sists of signal components P[φ,ω, ρ] generated by filteringa temporally and spatially sampled input signal along bothof these dimensions (to produce frequency componentsY[φ,ω,n]), and then, taking the average magnitude squaredof these.

4. CASA-EB Grouping to SeparateSingle-Source Signals

As described previously, the second step in CASA-EB isto group signal components from the time-frequency-arrival angle space into separated single-source signal esti-mates. Grouping consists of three steps: selecting the signalcomponents for grouping, estimating their location andsource attributes, and finally, grouping those with similarlyvalued attributes to form the separated single-source signalestimates. The details of these three steps are given in thefollowing three subsections.

4.1. Signal Component Selection. In this step, the set of allsignal components (P[φ,ω, ρ]) is pruned to produce a subsetof “significant” signal components, which are more likelyto have come from actual sound sources of interest andto constitute the main part of their signals. Grouping isthen performed using only this subset of signals. Experienceand experimental results indicate that this type of before-grouping pruning does not adversely affect performance andhas the following two benefits. First, it reduces the com-putational complexity of grouping and second, it increasesgrouping robustness (since there are fewer spurious signalcomponents to throw the grouping operation “off-track”).Now, we describe the signal component selection process inmore detail.

4.1.1. Arrival Angle Detection. This process begins withpruning away signal components from arrival angles inwhich it is unlikely there is any andible target sound, thatis, from angles within which the signal power is low. Thereare a variety of ways to detect such low-power arrival angles.For example, a simple way is, for a given time frame ρ, to addup the power spectral components P[φ,ω, ρ] in each arrivalangle band φ

P[φ] =

∑

ω

P[φ,ω, ρ

]. (8)

In this work, we are using a wideband adaptive beam-former by Krolik—the steered minimum variance (STMV)beamformer [14]. This wideband method is an adaptation of

Capon’s [15] narrowband minimum variance (MV) beam-former. The MV beamformer is a constrained optimizationmethod that produces a spatial spectral estimate in whichpower is minimized subject to the constraint of unity gainin the look direction, that is,

minw

[w+ · Rf ·w] subject to a+f

(φ) ·w = 1, (9)

where w is the beamformer weight vector, Rf is the covari-ance matrix of a narrowband array observation vector withfrequency f , + indicates conjugate transpose, and af (φ) =[1 e− j2π f t1(φ) · · · e− j2π f tM−1(φ)]

Tis the “steering vector.”

The solution to (9) gives the MV beamformer spatial spectralestimate:

Pf[φ] = [a+

f

(φ) · R−1

f · af(φ)]−1

. (10)

To apply this narrowband method to a wideband signal,one could just filter the wideband array observations, applythe narrowband method individually in each band, andthen sum up the results across frequency. This “incoherent”wideband method, however, does not take full advantageof the greater statistical stability of the wideband signal—a goal of wideband methods such as STMV beamforming.To achieve this goal, a wideband method must use a statisticcomputed across frequency bands.

In light of the above, STMV beamforming is an adapta-tion of MV beamforming in which a wideband compositecovariance matrix (Rst[φ] defined below) is used in place ofthe narrowband one, and the steering vector in the constraintis adjusted appropriately (more on this below):

minw

[w+ · Rst

[φ] ·w

]subject to 1T ·w = 1, (11)

where 1 is an M x 1 vector of ones. The STMV beamformersolution is

P[φ] =

[

1T · Rst[φ]−1 · 1

]−1. (12)

To compute the wideband composite covariance matrixRst[φ] from the array observation vectors, some prepro-cessing is performed first. The y[m,n] are bandpass filtered(as in (3)), and then the resulting narrowband signals are“presteered” as follows:

ystω [m,n] = Tst

[fω,φ

] · yω[m,n], (13)

where fω is the center frequency of frequency band ω, thesteering matrix Tst[ fω,φ] is a diagonal matrix with diagonal

elements [1 e j2π fωt1(φ) · · · e j2π fωt(M−1)(φ)], and tm(φ) isthe time delay between the mth sensor and a referencesensor (sensor 1) for a narrowband signal e− j2π fωt fromangle φ. Such presteering has the effect of zeroing outinter-sensor time delays tm(φ) in narrowband signals fromangle φ. For example, for the narrowband signal s(t) =[1 e− j2π fωt1(φ) · · · e− j2π fωtM−1(φ)],

Tst[fω,φ

] · s(t) = 1. (14)

Thus, the effect of preprocessing the wideband arrayobservations is to make the steering vectors equal for


all frequency bands (afω(φ) = 1), and this provides afrequency-independent steering vector to use in the STMVbeamformer’s unity-gain constraint.

Now, given the presteered array observations, the wide-band composite covariance matrix is simply

Rst[φ] =

h∑

ω=l

n0+(N−1)∑

n=n0

ystω [m,n] · yst+

ω [m,n],

=h∑

ω=lTst[fω,φ

] · Rω ·Tst+[fω,φ

],

(15)

where Rω is the covariance matrix of yω[m,n], and thesummations run from frequency band l to h and from timeindex n0 to n0 + (N − 1).

The advantage of Krolik’s technique over that of (8) andother similar data-independent beamforming techniques isthat it provides higher arrival angle resolution. Comparedto other data adaptive methods, it does not require a prioriinformation about the source signals and/or interference,does not cancel correlated signals (as MV beamforming isknown to do), and is not vulnerable to source location bias(as other wideband adaptive methods, such as the coherentsignal-subspace methods, are [16]).

4.1.2. Frequency Band Selection. Now, for each detectedarrival angle band, φ0, the next step is to select the significantsignal components from that arrival angle band. This isdone in two steps. First, high-power signal componentsare detected, and low-power ones pruned. Then, the high-power components are further divided into peaks (i.e.,local maxima) and their neighboring nonpeak components.Although all the high-power components will be includedin the separated signals, only the peak components need tobe explicitly grouped. Due to the nature of the gammatonefilterbank we are using, the non-peak components can beadded back into the separated signal estimates later at signalreconstruction time, based on their relationship with a peak.Consider the following. Since the filterbank’s neighboringfrequency bands overlap, a high-power frequency compo-nent sufficient to generate a peak in a given band is also likelyto contribute significant related signal power in neighboringbands (producing non-peak components). Thus, these non-peak components are likely to be generated by the samesignal feature as their neighboring peak, and it is reasonableto associate them.

Low-power signal components are detected and prunedusing a technique by Wax and Kailath [17]. In their work,a covariance matrix is computed from multichannel inputdata, and its eigenvalues are sorted into a low-power set(from background noise) and a high-power set (fromsignals). The sorting is accomplished by minimizing aninformation theoretic criterion, such as Akaike’s InformationCriterion (AIC) [18, 19] or the Minimum DescriptionLength (MDL) criterion [20, 21]). The MDL is discussed here

since it is the one used in CASA-EB. From [17], it is definedas

MDL = − log

⎛

⎝ΠLi=λ+1l

1/(L−λ)i

(1/(L− λ)) ·∑Li=λ+1 li

⎞

⎠

(L−λ)Nt

+12λ(2L− λ) logNt,

(16)

where λ ∈ {0, 1, . . . ,L − 1} is the number of possiblesignal eigenvalues and the parameter over which the MDLis minimized, L is the total number of eigenvalues, li is theith largest eigenvalue, and Nt is the number of time samplesof the observation vectors used to estimate the covariancematrix. The λ that minimizes the MDL (λmin) is the estimatednumber of signal eigenvalues, and the remaining (L −λmin) smallest eigenvalues are the detected noise eigenvalues.Notice, this MDL criterion is entirely a function of the(L − λ) smallest eigenvalues, and not the larger ones.Thus, in practice, it distinguishes between signal and noiseeigenvalues based on the characteristics of the backgroundnoise. Specifically, it detects a set of noise eigenvalues withrelatively low and approximately equal power. Wax andKailath use this method to estimate the number of signalsin multichannel input data. We use it to detect and removethe (L − λmin) low-power, noise components P[φ,ω, ρ]—bytreating the P[φ,ω, ρ] as the eigenvalues in their method.We chose this method for noise detection because it worksbased on characteristics of the noise, rather than relying onarbitrary threshold setting.

In summary, signal component selection/pruning isaccomplished in two steps. For each fixed time frameρ, high power arrival angle bands are detected, and sig-nal components from low power arrival angle bands areremoved. Then, in high power arrival angle bands, low-power signal components are removed and high-powersignal components are divided into peaks (for grouping)and non-peaks (to be added back into the separated signalestimates after grouping, at signal reconstruction time).

4.2. Attribute Estimation. In the previous section, wedescribed how signal components in the CASA-EB repre-sentation can be pruned and selected for grouping. In thissection, we describe how to estimate the selected signalcomponents’ attributes that will be used to group them.In this work, we estimate two types of signal attributes,location attributes and source attributes. As described in theintroduction, these are complementary. Used together, theymay allow more types of sound mixtures to be separated andproduce more completely separated source signals.

4.2.1. Locaton Attribute. For a selected signal component,P[φ,ω, ρ], the location attribute used in CASA-EB is itsarrival angle band, or simply its φ index. This is the delay-and-sum beamformer steering angle from the spatial filteringstep in Section 3.

4.2.2. Source Attribute. Source attributes are features embed-ded in a signal that describe the state of the signal’s source


at the time it produced the signal. In the previous work,several different source attributes have been used, includingF0 [2, 3, 8–11, 22, 23], amplitude modulation [8], onset time[9, 23], offset time [9], and timbre [24]. In this work, we usean F0 attribute. Since F0 is the most commonly used, its usehere will allow our results to be compared to those of othersmore easily. Next, we discuss F0 estimation in more detail.

There are two main approaches to F0 estimation: spectralpeak-based and autocorrelation-based methods. The spectralpeak-based approach is straightforward when there is onlyone harmonic group in the sound signal. In this case, itdetects peaks in the signal’s spectrum and estimates F0 byfinding the greatest common divisor of their frequencies.However, complications arise when the signal containsmore than one harmonic group. Specifically, there is theadded “data association problem,” that is, the problem ofdetermining the number of harmonic groups and whichspectral peaks belong to which harmonic groups. Theautocorrelation-based approach handles the data associationproblem more effectively and furthermore, as indicated in[25], also provides more robust F0 estimation performance.Hence, an autocorrelation-based method is used in thiswork.

The basic idea behind the autocorrelation method is thata periodic signal will produce peaks in its autocorrelationfunction at integer multiples of its fundamental period, andthese can be used to estimate F0. To use F0 as an attributefor grouping signal components, however, it is also necessaryto be able to associate the signal components P[φ,ω, ρ] withthe F0 estimates. This can be done using an extension of theautocorrelation method—the autocorrelogram method.

Detailed descriptions of the autocorrelogram methodcan be found in [9–11, 25–30]. To summarize here, thesteps of this method are the following. First, an inputsignal X[n] is filtered either by a set of equal-bandwidthbandpass filters covering the audible range of frequencies,or more often, by a filtering system based more closely onthe human auditory system, such as a gammatone filterbank.This filtering produces the bandpass signals Xω[n]. Then, toform the autocorrelogram, an autocorrelation of the filteredsignal is computed in each band and optionally normalizedby the signal power in the band:

acm[ω, τ] = RXω[τ]RXω[0]

. (17)

For an illustration, see Figure 4. Next, a summary auto-correlogram is computed by combining the narrowbandautocorrelations over frequency and optionally applying aweighting function to emphasize low-frequency peaks:

sacm[τ] = 1Ω

Ω∑

ω=1

acm[ω, τ] ·w[τ], (18)

where

w[τ] = exp[−τNτ

]

(19)

is a low frequency emphasis function, and Nτ is the numberof time lags at which the autocorrelogram is computed.

−5

0

5

X[n

]

0 100 200 300

Time (samples)

(a)

80

60

40

20

Freq

uen

cyba

nd

(ω)

50 100 150 200 250 300

Time lag (samples)

(b)

−1

0

1

sacm

[τ]

0 100 200 300

Time lag (samples)

(c)

Figure 4: Autocorrelogram representation of a sum of sinusoids.The signal, X[n] = ∑5

r=1 sin(2π300r · nTs), with Ts = 1/16 000s/sample is shown in (a). (b) shows the power-normalized auto-correlogram, acm[ω, τ] = RXω [τ]/RXω [0], where RXω [τ] is theautocorrelation of the filtered signal, Xω[n] = X[n] ∗ hω[n].Here, the maximum value is displayed in white, the minimum inblack. Finally, the summary autocorrelogram, sacm[τ] = ((1/Ω) ·∑Ω

ω=1 acm[ω, τ]) ·w[τ] is shown in (c).

For an example of the summary autocorrelogram, seeFigure 4. Finally, F0 estimates are made based on peaks inthe summary autocorrelogram, and overtones of these areidentified by associating peaks in the autocorrelogram withthe F0-estimate peaks in the summary autocorrelogram.

For CASA-EB, we are using the following implementa-tion of the autocorrelogram method. In each time frameρ, an autocorrelogram and summary autocorrelogram arecomputed for each detected arrival angle band φ0 (fromSection 4.1), and a single F0 analysis is made from eachsuch autocorrelogram/summary autocorrelogram pair. That


is, for each φ0, an autocorrelogram and summary autocor-relogram are computed from the temporally and spatiallyfiltered signal, Y[φ0,ω,n], ω ∈ {1, 2, . . . ,Ω} and n ∈ {ρT −Nτ/2+1, . . . , ρT+Nτ/2}, where we usedNτ = 320 (equivalentto 20 milliseconds). Then, for this arrival angle band andtime frame, the F0 estimation method of Wang and Brown[11] is applied, producing a single F0 estimate made fromthe highest peak in the summary autocorrelogram

F0[φ0, ρ

], (20)

and a set of flags, indicating for each P[φ0,ω, ρ], whether itcontains a harmonic of F0[φ0, ρ] or not

FN[φ0,ω, ρ

], ω ∈ {1, 2, . . . ,Ω}. (21)

Here, FN[φ0,ω, ρ] = 1 when band ω contains a harmonic,and 0 otherwise. Details of the implementation are thefollowing.

Temporal filtering is done with a gammatone filterbankbecause its constant-Q filters can resolve important low-frequency features of harmonic signals (the fundamental andits lower frequency harmonics) better than equal-bandwidthfilterbanks with the same number of bands (Low frequencyharmonics are important since, in speech for example, theyaccount for much of the signal power in vowels). Thesebetter-resolved, less-mixed low frequency harmonics cangive better F0 estimation results (F0 estimates and relatedharmonic flags, FN’s), since they produce sharper peaks inthe autocorrelogram, and these sharper peaks are easier forthe F0 estimation algorithm to interpret. Spatial filtering(new to autocorrelogram analysis) is used here becauseit provides the advantage of reducing interference in theautocorrelogram when multiple signals from different spatiallocations are present in the input.

The autocorrelogram is computed as described pre-viously, including the optional power normalization ineach frequency band. For the summary autocorrelogram,however, we have found that F0 estimation is improvedby using just the lower frequency bands that contain thestrongest harmonic features. Thus,

sacm[τ] = 174

90∑

ω=17

acm[ω, τ] ·w[τ], (22)

where the bands, 90 to 17, cover the frequency range, 0, to3500 Hz, the frequency range of a vowel’s fundamental andits lower harmonics.

Finally, an F0 analysis is performed using the autocor-relogram/summary autocorrelogram pair, according to themethod of Wang and Brown [11]. Their method is usedin CASA-EB to facillitate comparison testing of CASA-EB’Smonaural CASA to their monaural CASA (described inSection 6). The details of the method are the following. First,a single F0 is estimated based on the highest peak in thesummary autocorrelogram:

F0[φ0, ρ

] = fsτm

, (23)

where fs is the temporal sampling frequency of the inputsignal y[m,n], and τm is the time lag of the highest peakin the summary autocorrelogram. Then, the associatedovertones of this F0 are identified by finding frequencybands in the autocorrelogram with peaks at, or near, τm.Specifically, this is done as follows. A band ω is determinedto contain an overtone, that is, FN[φ0,ω, ρ] = 1, when

RXω[τm]RXω[0]

> Θd, (24)

and Θd = 0.90 is a detection threshold. Wang and Brownused Θd = 0.95. For CASA-EB, experiments show thatΘds in the range of 0.875 to 0.95 detect overtones well[31]. This F0 estimation method amounts to estimatingF0 and detecting its overtones for a single “foregroundsignal,” and treating the rest of the input mixture signalas background noise and interference. Although this limitsthe number of signals for which an F0 estimate is made(one per autocorrelogram), it also helps by eliminating theneed to estimate the number of harmonic signals. Further,it provides more robust F0 estimation since, from eachautocorrelogram, an F0 estimate is only made from the signalwith the strongest harmonic evidence (the highest peak inthe summary autocorrelogram).

Notice that in our application, the number of signals forwhich F0 estimates can be made is less limited since we havemore than one autocorrelogram per time frame (one for eachdetected arrival angle). Additionally, our F0 estimates maybe better since they are made from autocorrelograms withless interharmonic group interference. Such interference isreduced since the autocorrelograms are computed from thespatially filtered signals, Y[φ0,ω,n], ω ∈ {1, 2, . . . ,Ω}, thatare generally “less mixed” than the original input mixturesignal y[m,n] because they contain a smaller number ofharmonic groups with significant power.

4.3. Signal Component Grouping. Recall that sound sourceseparation consists of two steps: signal analysis (to breakthe signal into components such as P[φ,ω, ρ]), and signalcomponent grouping (to collect the components into singlesource signal estimates). Grouping collects together signalcomponents according to their attributes (estimated inSection 4.2), and ideally, each group only contains piecesfrom a single source signal.

Grouping is typically done in two stages: simultaneousgrouping clusters together signal components in each timeframe ρ that share common attribute values, and sequentialgrouping tracks these simultaneous groups across time. Inthe previous work, many researchers perform simultaneousgrouping first and then track the resulting clusters [2, 3, 10,22, 32]. For signals grouped by the F0 source attribute, forexample, the simultaneous grouping step consists of iden-tifying groups of harmonics, and the sequential groupingstep consists of tracking their fundamental frequencies. Aprimary advantage of simultaneous-first grouping is that itcan be real-time amenable when the target signals’ modelsare known a priori. However, when they are not known, itcan be computationally complex to determine the correct


signal models [10], or error-prone if wrong signal models areused.

Some researchers have experimented with sequential-first grouping [8, 9]. In this case, the sequential grouping stepconsists of tracking individual signal components, and thesimultaneous grouping step consists of clustering togetherthe tracks that have similar source attribute values in thetime frames in which they overlap. Although this approachis not real-time amenable since tracking is performed on thefull length of the input mixture signal before the resultingtracks are clustered, it has the advantage that it controlserror propagation. It does this by putting off the more error-prone decisions (simultaneous grouping’s signal modelingdecisions) until later in the grouping process.

In this work, we strike a balance between the twowith a short-time sequential-first grouping approach. Thisis a three-step approach (illustrated in Figure 5). First, toenjoy the benefits of sequential-first grouping (reducederror-propagation) without suffering long time delays, westart by tracking individual signal components over a fewframes. Then, these short-time frequency component tracksare clustered together into short-time single-source signalestimates. Finally, since signals are typically longer than afew frames, it is necessary to connect the short-time signalestimates together (i.e., to track them). The details of thesethree steps are given next.

4.3.1. Short-Time Sequential Grouping. In this step, signalcomponents are tracked for a few frames (six for the resultspresented in this paper). Recall from Section 4.1 that thesignal components that are tracked are the perceptuallysignificant ones (peak, high-power components from arrivalangle bands in which signals have been detected). Limitingtracking to these select signal components reduces computa-tional complexity and improves tracking performance.

Technically, tracking amounts to estimating the state of atarget (e.g., its position and velocity) over time from relatedobservation data. A target could be an object, a system, ora signal, and a sequence of states over time is called a track.In our application, a target is a signal component of a singlesound source’s signal (e.g., the nth harmonic of a harmonicsignal), its state consists of parameters (e.g., its frequency)that characterize the signal component, and the observationdata in each frame ρ consists of the (multi source) signalcomponents P[φ,ω, ρ].

Although we are tracking multiple targets (signal com-ponent sequences), for the sake of simplicity, we firstconsider the tracking of a single target. In this case, awidely used approach for tracking is the Kalman filter [33].This approach uses a linear system model to describe thedynamics of the target’s internal state and observable output,that is,

x[ρ + 1

] = A[ρ] · x

[ρ]

+ v[ρ],

z[ρ + 1

] = C[ρ + 1

] · x[ρ + 1

]+ w

[ρ + 1

].

(25)

Here, x[ρ+1] is the target’s state and z[ρ+1] is its observableoutput in time frame (ρ + 1), A[ρ] is the state transitionmatrix, C[ρ + 1] is the matrix that transforms the current

2signal

estimates

3 short-timesequential groups

(tracks)

ρ

η (η + 1)

ω

(a)

2signal

estimates

2 short-timegroups

ρ

ω

(b)

2 signalestimates

through framesequence η + 1

ρ

ω

(c)

Figure 5: Illustration of short-time sequential-first grouping. Herethe input signal is a mixture of the two single-source signals shownin Figure 2. (a) The graph shows short-time tracks in time segment(η+1) with completed signal estimate groups through time segmentη. Here, time segment η consists of time frames ρ ∈ {ηT ′, . . . , (η +1)T ′ −1}, and T ′ = 6. (b) The graph shows simultaneous groups ofthe short-time tracks shown in (a). (c) The graph shows completedsignal estimate groups through time segment (η + 1).

state of the track to the output, and v[ρ] and w[ρ] are zero-mean white Gaussian noise with covariance matrices Q[ρ]and R[ρ], respectively. Based on this model, the Kalman filteris a set of time-recursive equations that provides optimalstate estimates. At each time (ρ + 1), it does this in twosteps. First, it computes an optimal prediction of the statex[ρ + 1] from an estimate of the state x[ρ]. Then, thisprediction is updated/corrected using the current outputz[ρ + 1], generating the final estimate of x[ρ + 1].

Since the formulas for Kalman prediction and update arewell known [33], the main task for a specific applicationis reduced to that of constructing the linear model, that is,defining the dynamic equations (see (25)). For CASA-EB, atarget’s output vector, z[ρ], is composed of its frequency andarrival angle bands, and its internal state, x[ρ], consists of itsfrequency and arrival angle bands, along with their rates ofchange:

z[ρ] = [φ ω

]T ,

x[ρ] =

[

φd

dtφ ω

d

dtω]T

.(26)


The transition matrices of the state and output equations aredefined as follows:

A[ρ] =

⎡

⎢⎢⎢⎢⎢⎢⎣

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

⎤

⎥⎥⎥⎥⎥⎥⎦

,

C[ρ] =

⎡

⎣1 0 0 0

0 0 1 0

⎤

⎦,

(27)

where this choice of A[ρ] reflects our expectation that thestate changes slowly, and this C[ρ] simply picks the outputvector ([φ ω]T) from the state vector.

When there is more than one target, the tracking problembecomes more complicated. Specifically, at each time instant,multiple targets can produce multiple observations, andgenerally, it is not known which target produced whichobservation. To solve this problem, a data association processis usually used to assign each observation to a target. Then,Kalman filtering can be applied to each target as in the singletarget case.

While a number of data association algorithms have beenproposed in the literature, most of them are based on thesame intuition—that an observation should be associatedwith the target most likely to have produced it (e.g., the“closest” one). In this work, we use an extension of Munkres’optimal data association algorithm (by Burgeois and Lassalle[34]). A description of this algorithm can be found in [35].To summarize briefly here, the extended Munkres algorithmfinds the best (lowest cost) associations of observations toestablished tracks. It does this using a cost matrix with Hcolumns (one per observation) and J+H rows (one per trackplus one per observation), where the ( j,h)th element is thecost of associating observation h to track j, the (J + h,h)th

element is the cost of initiating a new track with observationh, and the remaining off-diagonal elements in the final Hrows are set to a large number such that they will not affectthe result.

The cost of associating an observation with a track is afunction of the distance between the track’s predicted nextoutput and the observation. Specifically, we are using thefollowing distance measure:

cost j,h =

⎧⎪⎨

⎪⎩

∣∣∣ω j − ωh

∣∣∣, when

∣∣∣ω j − ωh

∣∣∣ ≤ 1 and φh = φj ,

2γ, otherwise,

(28)

where ω j is the prediction of track j’s next frequency (ascomputed by the Kalman filter), ωh and φh are the frequencyand arrival angle of observation h, respectively, and track j’sarrival angle band φj is constant. Finally, γ is an arbitrarylarge number used here so that if observation h is outsidetrack j’s validation region, (|ω j − ωh| > 1 or φh /=φj),then observation h will not be associated with track j. Notethat this cost function means that frequency tracks changetheir frequency slowly (≤1 freqency band per time frame),

and sound sources do not move (since φj is held constant).In subsequent work, the assumption of unmoving sourcescould be lifted by revising the cost matrix and makingadjustments to the simultaneous grouping step (describednext in Section 4.3.2).

Finally, the cost of initiating a new track is simply set tobe larger than the size of the validation region

costJ+h,h = γ, (29)

and the remaining costs in the last H rows are set equal to 2γso that they will never be the low cost choice.

4.3.2. Simultaneous Grouping. In this step, the short-timetracks from the previous step are clustered into short-timesignal estimates based on the similarity of their source andlocation attribute values. There are a variety of clusteringmethods in the literature (refer to pattern recognition texts,such as [36–40]). In CASA-EB, we use the hierarchicalpartitioning algorithm that is summarized next.

Partitioning is an iterative approach that divides ameasurement space into k disjoint regions, where k is apredefined input to the partitioning algorithm. In general,however, it is difficult to know k a priori. Hierarchicalpartitioning addresses this issue by generating a hierarchy ofpartitions—over a range of different k values—from whichto choose the “best” partition. The specific steps are thefollowing. (1) Initialize k to be the minimum number ofclusters to be considered. (2) Partition the signal componenttracks into k clusters. (3) Compute a performance measureto quantify the quality of the partition. (4) Increment k by 1and repeat steps 2–4, until a stopping criterion is met, or kreaches a maximum value. (5) Select the best partition basedon the performance measure computed in step 3.

To implement the hierarchical partitioning algorithm,some details remain to be determined: the minimumand maximum number of clusters to be considered, thepartitioning algorithm, the performance measure, and aselection criterion to select the best partition based onthe performance measure. For CASA-EB, we have madethe following choices. For the minimum and maximumnumbers of clusters, we use the number of arrival anglebands in which signals have been detected, and the totalnumber of arrival angle bands, respectively.

For partitioning algorithms, we experimented with adeterministic one, partitioning around medoids (PAMs), anda probabilistic one, fuzzy analysis (FANNY)—both froma statistics shareware package called R [41, 42]. (R is areimplementation of S [43, 44] using Scheme semantics. Sis a very high level language and an environment for dataanalysis and graphics. S was written by Richard Becker,John M. Chambers, and Allan R. Wilks of AT&T BellLaboratories Statistics Research Department.) The differencebetween the two is in how measurements are assigned toclusters. PAM makes hard clustering assignments; that is,each measurement is assigned to a single cluster. FANNY,on the other hand, allows measurements to be spread acrossmultiple clusters during partitioning. Then, if needed, thesefuzzy assignments can be hardened at the end (after the last


iteration). For more information on PAM and FANNY, referto [37]. For CASA-EB, we use FANNY since it producesbetter clusters in our experiments.

Finally, it remains to discuss performance measures andselection criteria. Recall that the performance measure’spurpose in hierarchical partitioning is to quantify the qualityof each partition in the hierarchy. Common methods fordoing this are based on “intracluster dissimilarities” betweenthe members of each cluster in a given partition (smallis good), and/or on “intercluster dissimilarities” betweenthe members of different clusters in the partition (large isgood). As it turns out, our data produces clusters that areclose together. Thus, it is not practical to seek clusters withlarge inter-cluster dissimilarities. Rather, we have selected aperformance measure based on intra-cluster dissimilarities.Two intra-cluster performance measures were considered:the maximum intra-cluster dissimilarity in any single clusterin the partition, and the mean intra-cluster dissimilarity(averaged over all clusters in the partition). The maximumintra-cluster dissimilarity produced the best partitions forour data and is the one we used. The details of thedissimilarity measure are discussed next.

Dissimilarity is a measure of how same/different twomeasurements are from each other. It can be computed ina variety of ways depending on the measurements beingclustered. The measurements we are clustering are the sourceand location attribute vectors of signal component tracks.Specifically, for each short-time track j in time segment η,this vector is composed of the track’s arrival angle band φj ,and its F0 attribute in each time frame ρ of time segment ηin which the track is active. Recall (from Section 4.2), this F0attribute is the flag FN[φj ,ωj[ρ], ρ] that indicates whetherthe track is part of the foreground harmonic signal or not, intime frame ρ. Here, ρ ∈ {ηT′, . . . , (η + 1)T′ − 1}, T′ is thenumber of time frames in short-time segment η, and ωj[ρ]is track j’s frequency band in time frame ρ.

Given this measurement vector, dissimilarity is computedas follows. First, since we do not want to cluster tracks fromdifferent arrival angles, if two tracks ( j1 and j2) have differentarrival angles, their dissimilarity is set to a very large number.Otherwise, their dissimilarity is dependent on the differencein their F0 attributes in the time frames in which they areboth active

dj1, j2 =∑(η+1)T′−1

ρ=ηT′ D ·wj1, j2

[ρ]

∑(η+1)T′−1ρ=ηT′ wj1, j2

[ρ] , (30)

where D denotes |FNj1 [φj1 ,ωj1 [ρ], ρ]− FNj2 [φj2 ,ωj2 [ρ], ρ]|and wj1, j2 [ρ] is a flag indicating whether tracks j1 and j2 areboth active in time frame ρ, or not:

wj1, j2

[ρ] =

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

1, if tracks, j1 and j2,

are both active in time frame ρ,

0, otherwise.

(31)

If there are no time frames in which the pair of tracks areboth active, it is not possible to compute their dissimilarity.

In this case, dj1, j2 is set to a neutral value such that their(dis)similarity will not be a factor in the clustering. Sincethe maximum dissimilarity between tracks is 1 and theminimum is 0, the neutral value is 1/2. For such a pair oftracks to be clustered together, they must each be close to thesame set of other tracks. Otherwise, they will be assigned todifferent clusters.

Now that we have a performance measure (maximumintra-cluster dissimilarity), how should we use it to selecta partition? It may seem reasonable to select the onethat optimizes (minimizes) the performance measure. Thisselection criterion is no good though; it selects a partitionin which each measurement is isolated in a separate cluster.A popular strategy used in hierarchical clustering is to picka partition based on changes in the performance measure,rather than on the performance measure itself [37, 38,40]. For CASA-EB, we are using such a selection criterion.Specifically, in keeping with the nature of our data (whichcontains a few, loosely connected clusters), we have chosenthe following selection criterion. Starting with the minimumnumber of clusters, we select the first partition (the onewith the smallest number of clusters, k) for which thereis a significant change in performance from the previouspartition (with (k − 1) clusters).

4.3.3. Linking Short-Time Signal Estimate Groups. This isthe final grouping step. In the previous steps, we havegenerated short-time estimates of the separated sourcesignals (clusters of short-time signal component tracks). Inthis step, these short-time signal estimates will be linkedtogether to form full-duration signal estimates. This is adata association problem. The short-time signal estimates ineach time segment η must be associated with the previouslyestablished signal estimates through time segment (η − 1).For an illustration, see Figure 5. To make this association,we rely on the fact that signals usually contain some longsignal component tracks that continue across multiple timesegments. Thus, these long tracks can be used to associateshort-time signal estimates across segments. The idea is thata signal estimate’s signal component tracks in time segment(η − 1) will contine to be in the same signal in time segmentη, and similarly, signal component tracks in a short-timesignal estimate in time segment η will have their origins inthe same signal in preceeding time segments. The details ofour processing are described next.

For this data association problem, we use the extendedMunkres algorithm (as described in Section 4.3.1) with acost function that is based on the idea described previously.Specifically, the cost function is the following:

costgk[ρ],c�[η] =Ak,� −Bk,�

Ak,�, (32)

where gk[ρ] is the kth signal estimate through the (η− 1)st

time segment (i.e., ρ < ηT′), c�[η] is the �th short-time signalestimate in time segment η, Ak,� is the power in the union ofall their frequency component tracks,

Ak,� =∑

j∈{gk[ρ]∪c�[η]}P j , (33)


P j is the power in track j (defined below), Bk,� is the powerin all the frequency component tracks that are in both gk[ρ]and c�[η],

Bk,� =∑

j∈{gk[ρ]∩c�[η]}P j , (34)

and P j is computed by summing all the power spectraldensity components along the length of track j,

P j =min((η+1)T′−1, jstop)∑

ρ= jstart

P[

φj ,ωj[ρ], ρ]

. (35)

This cost function takes on values in the range of 0 to 1.The cost is 0 when all the tracks in cluster c�[η] that havetheir beginning in an earlier time sequence are also in clustertrack gk[ρ], and vice versa. The cost is 1 when c�[η] and gk[ρ]do not share any of the same signal component tracks.

Finally, notice that this cost function does not treat alltracks equally; it gives more weight to longer and morepowerful tracks. To see this, consider two clusters: c�1 [η] andc�2 [η] that each contains one shared track with gk[ρ]. Let theshared track in c�1 [η] be long and have high power, and letthe shared track in c�2 [η] be short and have low power. Then,Bk,1 will be larger than Bk,2, and thus costk,1[η] < costk,2[η].Although both c�1 [η] and c�2 [η] have one continuing tracksegment from gk[ρ], the one with the longer, stronger sharedtrack is grouped with it. In this way, the cost function favorssignal estimates that keep important spectral structuresintact.

5. CASA-EB Waveform Synthesis

The preceeding processing steps complete the separation ofthe mixture signal into the single-source signal estimatesgk[ρ]. However, the signal estimates are still simply groups ofsignal components. In some applications, it may be desirableto have waveforms (e.g., to listen to the signal estimates, or toprocess them further in another signal processing applicationsuch as an automatic speech recognizer).

Waveform reconstruction is done in two steps. First,in time frame ρ, a short-time waveform is generated foreach group, gk[ρ], that is active (i.e., nonempty) in thetime frame. Then, full-length waveforms are generated fromthese by connecting them together across time frames. Theimplementation details are described next.

In the first step, for each currently active group, itsshort-time waveform is generated by summing its short-timenarrowband waveforms Y[φ,ω,n] over frequency:

xρk[n] =

∑

φ,ω s.t.

P[φ,ω,ρ]∈gk[ρ]

Y[φ,ω,n

],

(36)

where n ∈ {ρ−(T−1)/2 · · · ρ+(T−1)/2}. In the second step,these short-time waveforms are connected together acrosstime into full-length waveforms by the standard overlap-addalgorithm,

xk[n] =∑

ρ

(T−1)/2∑

r=−(T−1)/2

v[r] · xρk[r], (37)

where we have chosen to use a Hanning window, v[·],because of its low sidelobes and reasonably narrow main lobewidth.


For a sound source separation method, such as CASA-EB, it is important that it both separate mixture signalscompletely and that the separated signals have good quality.The experiments described in Section 6.2 assess CASA-EB’sability to do these. Specifically, they test our hypothesis thatcombining monaural CASA and beamforming, as in CASA-EB, provides more complete signal separation than eitherCASA or beamforming alone, and that the separated signalshave low spectral distortion.

Before conducting these experiments, a preliminaryexperiment is performed. In particular, to make the com-parison of CASA-EB to monaural CASA meaningful, first weneed to verify that the performance of the monaural CASAin CASA-EB is inline with other previously published CASAmethods. Since it is not practical to compare our CASAtechnique to every previously proposed technique (thereare too many and there is no generallyaccepted standard),we selected a representative technique for comparison—thatof van der Kouwe, Wang and Brown [1]. We chose theirmethod for three reasons. First, a clear comparison canbe made since their testing method is easily reproduciblewith readily-available test data. Second, comparison to theirtechnique can provide a good check for ours since thetwo methods are similar; they both use the same groupingcue and a similar temporal analysis filter, hω[n]. The maindifferences are that our technique contains spatial filtering(which theirs does not), and it uses tracking/clusteringfor grouping (while their technique uses neural networksfor grouping). Finally, they (Roman, Wang and Brown)have also done work separating signals based on locationcues (binaural CASA) [4], and some preliminary workcombining source attributes (F0 attribute) and locationattributes (binaural CASA cues)—see [13] by Wrigley andBrown.

6.1. Preliminary Signal Separation Experiments: MonauralCASA. To compare our monaural CASA technique to thatof [1], we tested our technique using the same test dataand performance measure as they used to test theirs. Inthis way, our results can be compared directly to theirpublished results. The test data consists of 10 mixturesignals from the data set of [8]. Each mixture consists of aspeech signal (v8) and one of ten interference signals (seeTable 2).

The performance measure is the SIR gain (signal tointerference ratio) (this SIR gain is the same as the SNR gainin [1]; we prefer the name SIR gain since it is a more accuratedescription of what is computed), that is, the differencebetween the SIRs before and after signal separation:

ΔSIR = SIRafter − SIRbefore, (38)


−10

0

10

20

30

40

50

60SI

Rga

in(d

B)

0 1 2 3 4 5 6 7 8 9

Index of interferer (n0–n9)

(a)

−10

0

10

20

30

40

50

60

SIR

gain

(dB

)

0 1 2 3 4 5 6 7 8 9


(b)

−10

0

10

20

30

40

50

60

SIR

gain

(dB

)

0 1 2 3 4 5 6 7 8 9


SIR beforeSIR afterSIR gain

(c)

Figure 6: SIR gains of v8 estimates from beamforming (a), CASA (b) and CASA-EB (c). The horizontal axes in the graphs specify the testmixture by the index of the interferer. The three bars shown for each indicate the SIR of v8 in the mixture (black), the SIR of the separatedv8 (gray), and the SIR gain (white). To summarize these results, the mean SIR gains are 16.9 dB (for beamforming on mixtures with π/2radians of source separation), 17.2 dB (for monaural CASA) or 8.4 dB (for monaural CASA without the n0 and n5 results), and 24.2 dB (forCASA-EB on mixtures with π/2 radians of source separation).

where

SIRafter = 10 log

(Pv8∈v8

Pnx∈v8

)

,

SIRbefore = 10 log(Pv8∈v8+nx

Pnx∈v8+nx

)

.

(39)

Here, Pv8∈v8 is the power (or amount) of the speech signal(v8) in its estimate (i.e., the separated signal v8), Pnx∈v8 is thepower (or amount) of interference (nx) in v8, Pv8∈v8+nx is thepower of v8 in the test mixture (v8 +nx), and Pnx∈v8+nx is thepower of nx in (v8 + nx), where nx is one of {n0,n1, . . . ,n9}.

SIR is a useful measure in the sense that it tells us howwell interference has been removed by signal separation—the higher the SIR, the more interference-free the separatedsignal.

In a typical experiment, we ran our monaural CASAalgorithm on each of the ten mixture signals, and theresultant SIRs (before and after) along with the SIR gains areshown in the upper panel of Figure 6. Specifically, this figurecontains 10 groups of lines (black, gray, and white), indexedfrom 1 to 10 on a horizontal axis, one for each mixturesignal in the test data. For example, the results at index 5 arefor mixture (v8 + n5). In each group (i.e., for each mixture


Table 2: Voiced speech signal v8 and the interference signals (n0–n9) from Cooke’s 100 mixtures [8].

ID Description Characterization

v8 Why were you all weary?

n0 1 kHz tone Narrowband, continuous, structured

n1 White noise Wideband, continuous, unstructured

n2 Series of brief noise bursts Wideband, interrupted, unstructured

n3 Teaching laboratory noise Wideband, continuous, partly structured

n4 New wave music Wideband, continuous, structured

n5 FM signal (siren) Locally narrowband, continuous, structured

n6 Telephone ring Wideband, interrupted, structured

n7 Female TIMIT utterance Wideband, continuous, structured

n8 Male TIMIT utterance Wideband, continuous, structured

n9 Female utterance Wideband, continuous, structured

−10

0

10

20

30

40

50

60

70

SIR

gain

(dB

)

0 1 2 3 4 5 6 7 8 9

Interferer (n0–n9)

Figure 7: SIR gains of v8 estimates, from Wang, Brown, and vander Kouwe et al.’s monaural CASA. The horizontal axis specifies thetest mixture by its interferer. The two lines shown for each indicatethe SIR of v8 in the mixture (black), and the SIR of the separated v8(gray).

signal), the height of the black line is the SIR of the originalmixture signal, the height of the gray line is the SIR of thesignal estimate after CASA separation (v8), and the height ofthe white line is their difference, that is, the SIR gain achievedby CASA separation.

For comparison’s sake, Wang, Brown, and van derKouwe’s results on the mixture signals of Table 2 are shownin Figure 7, organized in the same way as in Figure 6. Fromthese figures, we can see that the performance of our CASAtechnique is similar to theirs. The main differences are fromthe n6 and n9 mixture signals; their method performedbetter for n6, CASA-EB for n9. Thus, our CASA techniquecan be considered comparable to this published CASAtechnique.

6.2. Main Signal Separation Experiments: CASA-EB. To testour hypothesis that the combined approach, CASA-EB, sep-arates mixture signals more completely than the individualtechniques (CASA and beamforming) used alone, we ranall three on mixture signals of the same speech (v8) andinterference (n0−n9) signals and compared the resulting SIRgains. To assess the quality of the separated signals, we alsocomputed their LPC cepstral distortions.

For monaural CASA, the test data was exactly the sameas that used in Section 6.1. For beamforming and CASA-EB, however, array data was simulated from the speech andinterference signals, and the mixture signals were made fromthese. We chose to simulate the array data rather than torecord the speech-interference mixture signals through amicrophone array because simulation provides data that isspecific to the room it is recorded in. The disadvantage of thisapproach is that the simulated array data may not be entirelyrealistic (e.g., it does not include room reverberations). Forthe array data simulation, we used a method described in[31] on a uniform linear array of 30 microphones. Each of theten mixture signals, as measured at the array, is composed ofthe speech (v8) and one interference signal (n0− n9), wherev8’s arrival angle is +π/4 and the interference signal’s is−π/4radians from broadside.

6.2.1. Signal Separation Completeness. The SIR gains ofthe separated signals from beamforming, monaural CASAand CASA-EB are shown in Figures 6(a), 6(b), and 6(c),respectively. The results show a definite advantage for CASA-EB over either beamforming or monaural CASA alone forall but two exceptions (the narrowband interferers, n0 andn5) addressed below. Specifically, the mean SIR gains forbeamforming, monaural CASA and CASA-EB are 16.9, 17.2,and 24.2 dB, respectively. Note that the mean SIR gain formonaural CASA would be 8.4 if you leave out the results fromthe mixtures made with the narrowband interferers, n0 andn5.

Now, we consider the two exceptions, that is, the mixtures(v8 + n0) and (v8 + n5) for which CASA-alone achievesnear-perfect performance, and CASA-EB does not. Whydoes CASA remove n0 and n5 so well? To find an answer,we first notice that unlike other interferers, n0 and n5


0

1

2

3

4

5

LPC

ceps

tral

dist

orti

on

0 2 4 6 8 10


(a)

0

1

2

3

4

5

LPC

ceps

tral

dist

orti

on

0 2 4 6 8 10


(b)

0

1

2

3

4

5

LPC

ceps

tral

dist

orti

on

0 2 4 6 8 10


(c)

Figure 8: LPC cepstral distortions of v8 estimates from beam-forming (a), CASA (b), and CASA-EB (c). As in Figures 6 and 7,the horizontal axes in the graphs specify the test mixture by theindex of the interferer. The value plotted is the mean LPC cepstraldistortion over the duration of the input mixture, v8 + nx, nx ∈{n0,n1, . . . ,n9}; the error bars show the standard deviations.

are narrowband and, in any short period of time, eachhas its power concentrated in a single frequency or a verynarrow frequency band. Now, recall that our CASA approachseparates a signal from interference by grouping harmonicsignal components of a common fundamental, and rejectingother signal components. It does this by first passing thesignal-interference mixture through a filter bank (the hω[n]defined in Section 3), that is, decomposing it into a set ofsubband signals. Then, the autocorrelation for each subbandis computed, forming an autocorrelogram (see Figure 4(b)),and a harmonic group (a fundamental frequency andits overtones) is identified (as described in Section 4.2).After such harmonics are identified, the remaining signalcomponents (interferers) are rejected.

When an interferer is narrowband (such as n0 and n5), itis almost certain that it will be contained entirely in a singlesubband. Furthermore, if the interferer has a lot of power (asin v8 +n0 and v8 +n5), it is going to affect the location of theautocorrelogram peak for that subband. Either the peak inthe subband will correspond to the period of the interferer, ifit is strong relative to the other signal content in the subband,or the peak will at least be pulled towards the interferer.When we use CASA, this will cause the subband to be rejectedfrom the signal estimate, and as a result the interferer willbe completely rejected. This is why CASA works so well inrejecting narrowband interferers.

When CASA-EB is used, the CASA operation is pre-ceeded by spatial filtering (beamforming). When the inter-ferer and the signal come from different directions (as isthe case in v8 + n0 and v8 + n5), this has the affect ofreducing the power of the interferer in the subband that itis in. As a result, the autocorrelogram peak in that subbandwill be much less affected by the interferer compared to theCASA alone case, and as a result, the subband may not berejected in the signal reconstruction, leading to a smaller SIRimprovement than when CASA is used alone. However, wewould like to point out that CASA-EB’s performance in thiscase (on mixtures with narrowband interferers), althoughnot as good as CASA-alone’s dramatic performance, is stillquite decent thanks to the spatial filtering that reduced theinterferers’ power.

6.2.2. Perceptual Quality of Separated Signals. The meanLPC cepstral distortions of the separated signals (v8) frombeamforming, monaural CASA, and CASA-EB are shown inFigures 8(a), 8(b), and 8(c), respectively. Here, LPC cepstraldistortion is computed as:

d[r] =

√√√√√

1F + 1

·F∑

f=0

(

ln(Pv8[f])− ln

(

Pv8

[f]))2

, (40)

where r = n/Td is the time index, Td = 160 is the lengthof signal used to compute d[r], Pv8[ f ] is the LPC powerspectral component of v8 at frequency f (computed by theYule-Walker method), and F = 60 corresponds to frequencyfs/2.

The results show that beamforming produces low dis-tortion (1.24 dB averaged over the duration of the separated


signal v8 and over all 10 test mixtures), CASA intro-duces somewhat higher distortion (2.17 dB), and CASA-EB is similar to monaural CASA (1.98 dB). The fact thatbeamforming produces lower distortion than CASA maybe because distortion in beamforming comes primarilyfrom incomplete removal of interferers and noise, whilein CASA, additional distortion comes from the removalof target signal components when the target signal hasfrequency content in bands that are dominated by inter-ferer(s). Thus, beamforming generally passes the entiretarget signal with some residual interference (generating lowdistortion), while CASA produces signal estimates that canalso be missing pieces of the target signal (producing moredistortion).

6.2.3. Summary. In summary, CASA-EB separates mixturesignals more completely than either individual method aloneand produces separated signals with rather low spectraldistortion (∼2 dB LPC cepstral distortion). Lower spectraldistortion can be had by using beamforming alone, however,beamforming generally provides less signal separation thanCASA-EB and cannot separate signals from close arrivalangles.

7. Conclusion

In this paper, we proposed a novel approach to acousticsignal separation. Compared to most previously proposedapproaches which use either location or source attributesalone, this approach, called CASA-EB, exploits both locationand source attributes by combining beamforming andauditory scene analysis. Another novel aspect of our work isin the signal component grouping step, which uses clusteringand Kalman filtering to group signal components over timeand frequency.

Experimental results have demonstrated the efficacy ofour proposed approach; overall, CASA-EB provides bettersignal separation performance than beamforming or CASAalone, and while the quality of the separated signals sufferssome degradation, their spectral distortions are rather low(∼2 dB LPC cepstral distortion). Although beyond thescope of this current work, to demonstrate the advantageof combining location and source attributes for acousticsignal separation, further performance improvements maybe achieved by tuning CASA-EB’s parts. For example, usinga higher resolution beamformer may allow CASA-EB to pro-duce separated signals with lower residual interference fromneigboring arrival angles, and using a larger set of sourceattributes could improve performance for harmonic targetsignals and accommodate target signals with nonharmonicstructures.

References

[1] A. J. W. van der Kouwe, D. Wang, and G. J. Brown, “Acomparison of auditory and blind separation techniques forspeech segregation,” IEEE Transactions on Speech and AudioProcessing, vol. 9, no. 3, pp. 189–195, 2001.

[2] P. N. Denbigh and J. Zhao, “Pitch extraction and separation ofoverlapping speech,” Speech Communication, vol. 11, no. 2-3,pp. 119–125, 1992.

[3] T. Nakatani and H. G. Okuno, “Harmonic sound streamsegregation using localization and its application to speechstream segregation,” Speech Communication, vol. 27, no. 3, pp.209–222, 1999.

[4] N. Roman, D. Wang, and G. J. Brown, “Speech segregationbased on sound localization,” The Journal of the AcousticalSociety of America, vol. 114, no. 4, pp. 2236–2252, 2003.

[5] M. Brandstein and D. Ward, Eds., Microphone Arrays,Springer, New York, NY, USA, 2001.

[6] L. Drake, A. K. Katsaggelos, J. C. Rutledge, and J. Zhang,“Sound source separation via computational auditory sceneanalysis-enhanced beamforming,” in Proceedings of the 2ndIEEE Sensor Array and Multichannel Signal Processing Work-shop, Rosslyn, Va, USA, August 2002.

[7] M. Cooke and D. P. W. Ellis, “The auditory organizationof speech and other sources in listeners and computationalmodels,” Speech Communication, vol. 35, no. 3-4, pp. 141–177,2001.

[8] M. Cooke, Modelling auditory processing and organisation,Ph.D. dissertation, The University of Sheffield, Sheffield, UK,1991.

[9] G. Brown, Computational auditory scene analysis: a repre-sentational approach, Ph.D. dissertation, The University ofSheffield, Sheffield, UK, 1992.

[10] D. P. W. Ellis, Prediction-driven computational auditory sceneanalysis, Ph.D. dissertation, MIT, Cambridge, Mass, USA,April 1996.

[11] D. L. Wang and G. J. Brown, “Separation of speech frominterfering sounds based on oscillatory correlation,” IEEETransactions on Neural Networks, vol. 10, no. 3, pp. 684–697,1999.

[12] G. Hu and D. L. Wang, “Monaural speech segregationbased on pitch tracking and amplitude modulation,” IEEETransactions on Neural Networks, vol. 15, no. 5, pp. 1135–1150,2004.

[13] S. N. Wrigley and G. J. Brown, “Recurrent timing neuralnetworks for joint F0-localisation based speech separation,” inProceedings of the IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP ’07), vol. 1, pp. 157–160,Honolulu, Hawaii, USA, April 2007.

[14] J. Krolik, “Focused wide-band array processing for spatialspectral estimation,” in Advances in Spectrum Analysis andArray Processing, S. Haykin, Ed., vol. 2 of Prentice Hall SignalProcessing Series and Prentice Hall Advanced Reference Series,chapter 6, pp. 221–261, Prentice-Hall, Englewood-Cliffs, NJ,USA, 1991.

[15] J. Capon, “High-resolution frequency-wavenumber spectrumanalysis,” Proceedings of the IEEE, vol. 57, no. 8, pp. 1408–1418,1969.

[16] D. N. Swingler and J. Krolik, “Source location bias in thecoherently focused high-resolution broad-band beamformer,”IEEE Transactions on Acoustics, Speech, and Signal Processing,vol. 37, no. 1, pp. 143–145, 1989.

[17] M. Wax and T. Kailath, “Detection of signals by informationtheoretic criteria,” IEEE Transactions on Acoustics, Speech, andSignal Processing, vol. 33, no. 2, pp. 387–392, 1985.

[18] H. Akaike, “Information theory and an extension of themaximum likelihood principle,” in Proceedings of the 2ndInternational Symposium on Information Theory, pp. 267–281,1973.


[19] H. Akaike, “A new look at the statistical model identification,”IEEE Transactions on Automatic Control, vol. 19, no. 6, pp.716–723, 1974.

[20] G. Schwartz, “Estimating the dimension of a model,” Annals ofStatistics, vol. 6, pp. 461–464, 1978.

[21] J. Rissanen, “Modeling by shortest data description,” Automat-ica, vol. 14, no. 5, pp. 465–471, 1978.

[22] T. W. Parsons, “Separation of speech from interfering speechby means of harmonic selection,” The Journal of the AcousticalSociety of America, vol. 60, no. 4, pp. 911–918, 1976.

[23] U. Baumann, “Pitch and onset as cues for segregationof musical voices,” in Proceedings of the 2nd InternationalConference on Music Perception and Cognition, February 1992.

[24] G. Brown and M. Cooke, “Perceptual grouping of musicalsounds: a computational model,” The Journal of New MusicResearch, vol. 23, no. 2, pp. 107–132, 1994.

[25] Y. Gu, “A robust pseudo perceptual pitch estimator,” inProceedings of the 2nd European Conference on Speech Com-munication and Technology (EUROSPEECH ’91), pp. 453–456,1991.

[26] M. Weintraub, A theory and computational model of auditorysound separation, Ph.D. dissertation, Stanford University,Stanford, UK, 1985.

[27] R. Gardner, “An algorithm for separating simultaneous vow-els,” British Journal of Audiology, vol. 23, pp. 170–171, 1989.

[28] M. Slaney and R. F. Lyon, “A perceptual pitch detector,” inProceedings of the IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP ’90), vol. 1, pp. 357–360,¿Albuquerque, NM, USA, April 1990.

[29] P. F. Assmann and Q. Summerfield, “Modeling the perceptionof concurrent vowels: vowels with different fundamentalfrequencies,” The Journal of the Acoustical Society of America,vol. 88, no. 2, pp. 680–697, 1990.

[30] R. Meddis and M. J. Hewitt, “Virtual pitch and phasesensitivity of a computer model of the auditory periphery—I: pitch identification,” The Journal of the Acoustical Society ofAmerica, vol. 89, no. 6, pp. 2866–2882, 1991.

[31] L. Drake, Sound source separation via computational auditoryscene analysis (casa)-enhanced beamforming, Ph.D. disserta-tion, Northwestern University, December 2001.

[32] U. Baumann, “A procedure for identification and segre-gation of multiple auditory objects,” in Proceedings of theNATO Advanced Study Institute on Computational Hearing, S.Greenberg and M. Slaney, Eds., pp. 211–215, InternationalComputer Science Institute, Berkeley, Calif, USA, 1998.

[33] Y. Bar-Shalom and T. E. Fortmann, Tracking and Data Asso-ciation, vol. 179 of Mathematics in Science and Engineering,Academic Press, New York, NY, USA, 1988.

[34] F. Burgeois and J. C. Lassalle, “Extension of the Munkres algo-rithm for the assignment problems to rectangular matrices,”Communications of the ACM, vol. 14, no. 12, pp. 802–804,1971.

[35] S. S. Blackman, Multiple-Target Tracking with Radar Applica-tions, Artech House, Boston, Mass, USA, 1986.

[36] J. T. Tou and R. C. Gonzalez, Pattern Recognition Principles,Addison-Wesley, Reading, Mass, USA, 1974.

[37] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: AnIntroduction to Cluster Analysis, John Wiley & Sons, New York,NY, USA, 1990.

[38] S. Theodoridis and K. Koutroumbas, Pattern Recognition,Academic Press, New York, NY, USA, 1999.

[39] B. D. Ripley and N. Hjort, Pattern Recognition and NeuralNetworks, Cambridge University Press, Cambridge, UK, 1995.

[40] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification,John Wiley & Sons, New York, NY, USA, 2nd edition, 2001.

[41] R. Ihaka and R. Gentleman, “R: a language for data analysisand graphics,” Journal of Computational and Graphical Statis-tics, vol. 5, no. 3, pp. 299–314, 1996.

[42] R. Gentleman, R. Ihaka, and R Core Team, “R version 0.63.1,”December 1998, a statistical computation and graphics sys-tem. Re-implementation of the S language, using Schemesemantics, http://www.stat.auckland.ac.nz/r/r.html.

[43] R. A. Becker, J. M. Chambers, and A. R. Wilks, The New SLanguage, Chapman & Hall, London, UK, 1988.

[44] J. M. Chambers and T. J. Hastie, Statistical Models in S,Chapman & Hall, London, UK, 1992.


Research Article

Rate-Constrained Beamforming in Binaural Hearing Aids

Sriram Srinivasan and Albertus C. den Brinker

Digital Signal Processing Group, Philips Research, High Tech Campus 36, 5656 AE Eindhoven, The Netherlands

Correspondence should be addressed to Sriram Srinivasan, [email protected]

Received 1 December 2008; Accepted 6 July 2009


Recently, hearing aid systems where the left and right ear devices collaborate with one another have received much attention.Apart from supporting natural binaural hearing, such systems hold great potential for improving the intelligibility of speech in thepresence of noise through beamforming algorithms. Binaural beamforming for hearing aids requires an exchange of microphonesignals between the two devices over a wireless link. This paper studies two problems: which signal to transmit from one ear to theother, and at what bit-rate. The first problem is relevant as modern hearing aids usually contain multiple microphones, and theoptimal choice for the signal to be transmitted is not obvious. The second problem is relevant as the capacity of the wireless link islimited by stringent power consumption constraints imposed by the limited battery life of hearing aids.

Copyright © 2009 S. Srinivasan and A. C. den Brinker. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

1. Introduction

Modern hearing aids are capable of performing a variety ofadvanced digital processing tasks such as amplification anddynamic compression, sound environment classification,feedback cancellation, beamforming, and single-channelnoise reduction. Improving the intelligibility and qualityof speech in noise through beamforming is arguably themost sought after feature among hearing aid users [1].Modern hearing aids typically have multiple microphonesand have been proven to provide reasonable improvementsin intelligibility and listening comfort.

As human hearing is binaural by nature, it is intuitiveto expect an improved experience by using one hearing aidfor each ear, and the number of such fittings has increasedsignificantly [2]. These fittings however have mostly beenbilateral, that is, two independently operating hearing aidson the left and right ears. To experience the benefits ofbinaural hearing, the two hearing aids need to collaboratewith one another to ensure that binaural cues are presentedin a consistent manner to the user. Furthermore, the largerspacing between microphones in binaural systems comparedto monaural ones provides more flexibility for tasks suchas beamforming. Such binaural systems introduce newchallenges, for example, preserving binaural localizationcues such as interaural time and level differences, and the

exchange of signals between the hearing aids to enablebinaural processing. The former has been addressed in [3, 4].The latter is the subject of this paper.

Binaural beamforming requires an exchange of signalsbetween the left and right hearing aids. A wired link betweenthe two devices is cumbersome and unacceptable from anaesthetic point of view, thus necessitating a wireless link.Wireless transmission of data is power intensive, and topreserve battery life, it becomes important to limit thenumber of bits exchanged over the link. A reduction in thebit-rate affects the performance of the beamformer. Thispaper investigates the relation between the transmission bit-rate and beamformer performance.

In the absence of bit-rate constraints, a simple practicalscheme is to transmit all observed microphone signals fromone ear to the other, where they are fed into a beamformertogether with the locally observed signals to obtain anestimate of the desired signal. In the presence of a limitedcapacity link, however, an intelligent decision on what signalto transmit is necessary to effectively utilize the availablebandwidth. Reduced bandwidth binaural beamforming algo-rithms have been discussed in [5], but the relation betweenbit-rate and performance was not studied.

An elegant theoretically optimal (in an information-theoretic sense) transmission scheme is presented in [6],


where the hearing aid problem is viewed as a remote Wyner-Ziv problem, and the transmitting device encodes its signalssuch that the receiving device obtains the MMSE estimate ofthe desired signal, with the locally observed signals at the leftdevice as side information. However, it requires knowledgeof the (joint) statistics of the signals observed at both thereceiving and transmitting ears, which are not availablein a hearing aid setup. A suboptimal but practical choicepresented in [7] and shown in Figure 1 is to first obtainan estimate of the desired signal at the transmitting (rightear in this example) device, and then transmit this estimate.This choice is asymptotically (at infinite bit-rate) optimalif the desired signal is observed in the presence of spatiallyuncorrelated noise, but not when a localized interferer ispresent.

From an information point of view, transmitting onlyan estimate of the desired signal does not convey all theinformation that could be used in reconstructing the desiredsignal at the receiving (left ear) device. Specifically, lack ofinformation about the interferer in the transmitted signalresults in an inability to exploit the larger left-right micro-phone spacing (provided by the binaural setup) to cancel theinterferer. This paper proposes and investigates two practicalalternatives to circumvent this problem. The first approachis to obtain an estimate of the interference-plus-noise at theright hearing aid using the right ear microphone signals,and transmit this estimate to the left device. This schemeis similar to the one in Figure 1, except that the signalbeing estimated at the right ear is the undesired signal.Intuitively, this would enable better performance in thepresence of a localized interferer as both the locally availablemicrophone signals and the received signal can be used in theinterference cancellation process, and this is indeed observedfor certain situations in the simulations described later in thepaper.

Following the information point of view one step furtherleads to the second scheme proposed in this paper, which isto just transmit one or more of the right ear microphonesignals at rate R, as shown in Figure 2. The unprocessedsignal conveys more information about both the desired andthe undesired signal, although potentially requiring a higherbit-rate. What remains to be seen is the trade-off betweenrate and performance.

This paper provides a framework to quantify the perfor-mance of the two above-mentioned beamforming schemesin terms of the rate R, the location of the desired sourceand interferer, and the signal-to-interference ratio (SIR). Theperformance is then compared to both the optimal schemediscussed in [6] and the suboptimal scheme of Figure 1 atdifferent bit-rates.

For the two-microphone system in Figure 2, given abit-rate R, another possibility is to transmit each of thetwo right ear microphone signals at a rate R/2. This ishowever not considered in this paper for the followingreason. In terms of interference cancellation, even at infiniterate, transmitting both signals results only in a marginalimprovement in SIR compared to transmitting one signal.The reason is that the left ear already has an endfire array.

Estimatedesired signal


Encodeat rate R

Left RightOutput

Figure 1: The scheme of [7]. The desired signal is first estimatedfrom the right ear microphone signals, and then transmitted ata rate R. At the left ear, the desired signal is estimated from thereceived signal and the local microphone signals.


Encodeat rate R

Left RightOutput

Figure 2: One right ear microphone signal is transmitted at a rateR. At the left ear, the desired signal is estimated from the receivedsignal and the local microphone signals.

A large gain results from the new broadside array that iscreated by transmitting the signal observed at one right earmicrophone. Additionally transmitting the signal from thesecond right ear microphone, which is located close to thefirst microphone, provides only a marginal gain. Thus, in theremainder of this paper, we only consider transmitting onemicrophone signal from the right ear.

Another aspect of the discussion on the signal tobe transmitted is related to the frequency-dependence ofthe performance of the beamformer, especially in systemswhere the individual hearing aids have multiple closelyspaced microphones. These small microphone arrays on theindividual devices are capable of interference cancellation athigh frequencies but not at low frequencies where they havea large beamwidth. As shown in [8], the benefit provided bya binaural beamformer in terms of interference cancellationis thus limited to the low-frequency part of the signal,where the monaural array performs poorly. Moreover, dueto the larger size of the binaural array, spatial aliasing affectsperformance in high frequencies. Motivated by these reasons,this paper also investigates the effect of transmitting only thelow frequencies on the required bit-rate and the resultingperformance of the beamformer.

The remainder of this paper is organized as follows.Section 2 introduces the signal model and the relevantnotation. The two rate-constrained transmission schemesintroduced in this paper are presented in Section 3. Theperformance of the proposed and reference systems iscompared for different scenarios in Section 4. Concludingremarks are presented in Section 5.


2. Signal Model

Consider a desired source S(n) and an interfering sourceI(n), in the presence of noise. The left ear signal model canbe written as

Xkl (n) = hkl (n)� S(n) + gkl (n)� I(n)

+Ukl (n), k = 1 · · ·K ,

(1)

where hkl (n) and gkl (n) are the transfer functions betweenthe kth microphone on the left hearing aid and the desiredand interfering sources, respectively, Uk

l (n) is uncorrelatedzero-mean white Gaussian noise at the kth microphone onthe left hearing aid, K is the number of microphones onthe left hearing aid, n is the time index, and the operator� denotes convolution. The different sources are assumedto be zero-mean independent jointly Gaussian stationaryrandom processes with power spectral densities (PSDs)ΦS(ω), ΦI(ω), and Φk

Ul(ω), respectively. The above signal

model allows the consideration of different scenarios, forexample, desired signal in the presence of uncorrelated noise,or desired signal in the presence of a localized interferer, andso forth. Let

Skl (n) = hkl (n)� S(n) (2)

denote the desired signal at the kth microphone on the leftdevice and let

Wkl (n) = gkl (n)� I(n) +Uk

l (n) (3)

denote the undesired signal. A similar right ear modelfollows:

Xkr (n) = hkr (n)� S(n)

︸︷︷︸

Skr (n)

+ gkr (n)� I(n) +Ukr (n)

︸︷︷︸

Wkr (n)

, k = 1 · · ·K ,(4)

where the relevant terms are defined analogously to the leftear. The following assumptions are made for simplicity:

ΦkUl

(ω) = ΦkUr

(ω) = ΦU(ω), k = 1 · · ·K. (5)

Let Xl(n) = [X1l (n), . . . ,XK

l (n)]T , and Xr(n) =[X1

r (n), . . . ,XKr (n)]T . The vectors Ul(n) and Ur(n) are

defined analogously. For any X(n) and Y(n), define ΦXY (ω)to be their cross PSD. As Uk

l (n) and Ukr (n) correspond to

spatially uncorrelated noise, the following holds:

ΦUjl U

kl(ω) = ΦU

jr Uk

r(ω) = 0, j, k = 1 · · ·K , j /= k,

ΦUjl U

kr(ω) = ΦU

jr Uk

l(ω) = 0, j, k = 1 · · ·K.

(6)

The PSD of the microphone signal Xkl (n), is given by

ΦXkl(ω) =

∣∣∣Hk

l (ω)∣∣∣

2Φs(ω) +

∣∣∣Gk

l (ω)∣∣∣

2Φi(ω) +ΦU(ω),

(7)

where Hkl (ω) is the frequency domain transfer function

corresponding to hkl (n), and Gkl (ω) corresponds to gkl (n). An

analogous expression follows for ΦXkr(ω). The ( j, k)th entry

of the matrix ΦXl , which is the PSD matrix corresponding tothe vector Xl is given by

ΦjkXl= H

jl (ω)Hk†

l (ω)Φs(ω) +Gjl (ω)Gk†

l (ω)Φi(ω)

+ δ(j − k)ΦU(ω),

(8)

where the superscript † denotes complex conjugate trans-pose.

3. What to Transmit

The problem is treated from the perspective of estimatingthe desired signal at the left hearing aid. Assume that theright hearing aid transmits some function of its observedmicrophone signals to the left hearing aid. The left deviceuses its locally observed microphone signals together withthe signal received from the right device to obtain an estimateSl of the desired signal Sl = S1

l (n) (the choice k = 1 isarbitrary) at the left device. (The processing is symmetric andthe right device similarly obtains an estimate of S1

r (n).)Denote the signal transmitted by the right device as

Xt(n), and its PSD by Φt(ω). The signal Xt(n) is transmittedat a rate R to the left ear. Under the assumptions in the signalmodel presented in Section 2, the following parametric rate-distortion relation holds [9]:

R(λ) = 14π

∫∞

−∞max

(

0, log2Φt(ω)λ

)

dω,

D(λ) = 12π

∫∞

−∞min(λ,Φt(ω))dω,

(9)

where the rate is expressed in bits per sample. The distortionhere is the mean-squared error (MSE) between Xt(n) and itsreconstruction. Equation (9) provides the relation betweenthe number of bits R used to represent the signal and theresulting distortion D in the reconstructed signal. As therelation between R and D cannot be obtained in closedform, it is expressed in terms of a parameter λ. Inserting aparticular value of λ in (9) results in certain rate R and acorresponding distortion D. An R-D curve is obtained as λtraverses the interval [0, ess sup Φt(ω)], where ess sup is theessential supremum.

Note that Xt(n) is quantized without regard to the finalprocessing steps at the left ear, and without consideringthe presence of the left microphone signals. Incorporatingsuch knowledge can lead to more efficient quantizationschemes, for example, by allocating bits to only thosefrequency components of Φt(ω) that contribute to theestimation of Sl(n). Such schemes however as mentionedearlier are not amenable to practical implementations as therequired statistics are not available under the nonstationaryconditions encountered in hearing aid applications.

Let the right device compress Xt(n) at a rate R bitsper sample, which corresponds to a certain λ and a dis-tortion D. The signal received at the left ear is depicted in


Gaussian noise with PSD

max(

0, λ0Φt(ω)− λ0

Φt(ω)

)

Xt(n) B(ω) = max(

0,Φt(ω)− λ0

Φt(ω)

)

Xt(n)

Figure 3: The forward channel representation.

Figure 3 using the forward channel representation [9]. Thesignal Xt(n) is obtained by first bandlimiting Xt(n) with afilter with frequency response

B(ω) = max(

0,Φt(ω)− λΦt(ω)

)

, (10)

and then adding Gaussian noise with PSD given by

ΦZ(ω) = max(

0, λΦt(ω)− λΦt(ω)

)

. (11)

Note that using such a representation for Xt(n) in the analysisprovides an upper bound on the achievable performance atrate R. Define

X(n) =[

XTl (n), XT

t (n)]T. (12)

The MMSE estimate of the desired signal Sl is given by

Sl = E{Sl | X}, (13)

and the corresponding MSE by

ξ(R) = 12π

∫ 2π

0

[

ΦSl (ω)−ΦSlX(ω)Φ−1X (ω)ΦXSl (ω)

]

dω,

(14)

where ΦSl (ω) = |H1l (ω)|2Φs, ΦSlX(ω) =

[ΦSlXl (ω),ΦSlXt(ω)], and ΦX is the PSD matrix

corresponding to the vector X. The MSE ξ(R) can berewritten in an intuitively appealing form in terms of theMSE resulting when estimation is performed using only Xl

and a reduction term due to the availability of the innovationprocess Xi = Xt − E{Xt | Xl}. The following theorem followsby applying results from linear estimation theory [10,Chapter 4].

Theorem 1. Let Xi = Xt − E{Xt | Xl}. Xi represents theinnovation or the “new” information at the left ear providedby the wireless link. Then, the MSE ξ can be written as

ξ(R) = ξl − 12π

∫ 2π

0

[ΦSl (ω)− ξlr(ω)

]dω, (15)

where

ξl = 12π

∫ 2π

0

[

ΦSl (ω)−ΦSlXl (ω)Φ−1Xl

(ω)Φ†SlXl

(ω)]

dω (16)

is the error in estimating Sl from Xl alone, and

ξlr(ω) = ΦSl (ω)−ΦSlXi(ω)Φ−1Xi (ω)Φ†SlXi(ω) (17)

is the error in estimating Sl from Xi.

The MSE resulting from the two proposed schemesdiscussed in Section 1 can be computed by setting Xt(n)appropriately, and then using (14). In the first scheme, anestimate of the undesired signal obtained using Xr(n) istransmitted to aid in better interference cancellation usingthe larger left-right microphone spacing. The resulting MSEis given by

ξint(R) = ξ(R)∣∣Xt=E{W1

r |Xr}. (18)

In the second scheme, one of the raw microphone signals atthe right ear, without loss of generality X1

r (n), is transmitted.The resulting MSE is given by

ξraw(R) = ξ(R)∣∣Xt=X1

r. (19)

As an example, the relevant entities required in this case(Xt = X1

r ) to compute the MSE using (14) are given asfollows:

B(ω) = max

(

0,Φ1r (ω)− λΦ1r (ω)

)

,

ΦZ(ω) = max

(

0, λΦ1r (ω)− λΦ1r (ω)

)

,

ΦSlX(ω) =[

H1l (ω)H†

l (ω)Φs(ω),B(ω)H1l (ω)H1†

r (ω)Φs(ω)]

,

ΦX =⎛

⎝ΦXl ΦXl Xt

ΦXtXlΦXt

⎞

⎠,

(20)

where the submatrix ΦXl in (20) is given by (8), the PSDΦXt

(ω) of the received signal Xt(n) is given by

ΦXt(ω) = |B(ω)|2ΦX1

r(ω) +ΦZ(ω), (21)

and the cross PSD ΦXl Xtis given by

ΦXl Xt= B(ω)

(

Hl(ω)H1†r (ω)Φs(ω) + Gl(ω)G1†

r (ω)Φi(ω))

,

(22)

with Hl(ω) = [H1l (ω), . . . ,HK

l (ω)]T and Gl(ω) =[G1

l (ω), . . . ,GKl (ω)]T . The MSEs ξint(R) and ξraw(R) are

evaluated for different bit-rates and compared to ξopt(R),which is the MSE resulting from the optimal scheme of [6],and ξsig(R), which is the MSE resulting from the referencescheme [7], where a local estimate of the desired signal istransmitted.

Before analyzing the performance as a function ofthe bit-rate, it is instructive to examine the asymptoticperformance (at infinite bit-rate) of the different schemes.


An interesting contrast results by studying the case when onlyone microphone is present on each hearing aid, and the casewith multiple microphones on each device. It is convenient toformulate the analysis in the frequency domain. In the singlemicrophone case, the transmitted signal can be expressed asXt(ω) = A(ω)X1

r (ω), where Xt(ω) and X1r (ω) are obtained

by applying the discrete Fourier transform (DFT) to therespective time domain entitiesXt(n), andX1

r (n), andA(ω) isa nonzero scalar. In the method of [7] where a local estimateof the desired signal is transmitted, A(ω) = ΦSlXr (ω)Φ−1

Xr(ω).

Note that in the single-microphone case, Xr(ω) = X1r (ω). In

the first proposed scheme where an estimate of the undesiredsignal is transmitted, A(ω) = ΦWrXr (ω)Φ−1

Xr(ω), and in the

second proposed scheme where the first microphone signalis transmitted, A(ω) = 1.

For the estimation at the left ear using both the locallyobserved signal and the transmitted signal, it can readily beseen that

E{

Sl(ω) | X1l (ω),A(ω)X1

r (ω)}

= E{

Sl(ω) | X1l (ω),X1

r (ω)}

,

(23)

where Sl(ω) and X1l (ω) are obtained from their respective

time domain entities Sl(n) and X1l (n) by applying the DFT.

Thus, at infinite-rate in the single-microphone case, allthree schemes reach the performance of the optimal schemewhere both microphone signals are available at the left ear,regardless of the correlation properties of the undesiredsignals at the left and right ear.

The case when each hearing aid has multiple micro-phones, however, offers a contrasting result. In this case,the transmitted signal is given by Xt(ω) = A(ω)Xr(ω),where A(ω) is a 1 × K vector and assumes different valuesdepending on the transmission scheme. Here, Xt(ω) is adown-mix of K different signals into a single signal, resultingin a potential loss of information since in a practicalscheme the down-mix at the right ear is performed withoutknowledge of the left ear signals. In this case, even atinfinite bit-rate, the three schemes may not achieve optimalperformance. One exception is when the undesired signal atthe different microphones is uncorrelated, and transmittinga local estimate of the desired signal provides optimalperformance, asymptotically.

4. Performance Analysis

In this section, the performance of the different schemesdiscussed above is compared for different locations of theinterferer, different SIRs, and as a function of the bit-rate. All the involved PSDs are assumed to be known toestablish theoretical upper bounds on performance. First,the performance measure used to evaluate the differentschemes is introduced. The experimental setup used for theperformance analysis is then described. Two cases are thenconsidered: one where the desired signal is observed in thepresence of uncorrelated (e.g., sensor) noise, and the secondwhere the desired signal is observed in the presence of alocalized interferer in addition to uncorrelated noise.

4.1. Performance Measure. As in [6, 7], the performance gainis defined as the ratio between the MSE at rate 0 and the MSEat rate R:

G(R) = 10 log10ξ(0)ξ(R)

, (24)

which represents the gain in dB due to the availability of thewireless link. The quantities Gopt(R), Gsig(R), Gint(R), andGraw(R), corresponding to the four different transmissionschemes, are computed according to (24) from their respec-tive average MSE values ξopt(R), ξsig(R), ξint(R), and ξraw(R).ξ(0) remains the same in all four cases as this corresponds tothe average MSE at rate zero, which is the MSE in estimatingthe desired signal using only the microphone signals on theleft ear.

4.2. Experimental Setup. In the analysis, the number ofmicrophones on each hearing aid was set to a maximumof two, that is, K = 2. Simulations were performed bothfor K = 1 and K = 2. The spherical head shadow modeldescribed in [11] was used to obtain the transfer functionsHkl (ω), Hk

r (ω), Gkl (ω), and Gk

r (ω), for k = 1, 2. The distancebetween microphones on a single hearing aid was assumed tobe 0.01 m. The radius of the sphere was set to 0.0875 m. Thedesired, interfering, and noise sources were assumed to haveflat PSDs Φs, Φi, and Φu, respectively, in the band [−Ω,Ω],where Ω = 2πF, and F = 8000 Hz. Note that Φt is not flatdue to the nonflat transfer functions.

4.3. Desired Source in Uncorrelated Noise. The desired sourceis assumed to be located at 0◦ in front of the hearing aid user.This is a common assumption in hearing aids [1]. The signal-to-noise ratio (SNR), computed as 10 log10Φs/Φu, is assumedto be 20 dB. The SIR, computed as 10 log10Φs/Φi, is assumedto be infinity, that is, Φi = 0. Thus the only undesired signalin the system is uncorrelated noise.

Figure 4(a) plots the gain due to the availability of thewireless link for K = 1; that is, each hearing aid hasonly one microphone. Gsig(R), Gint(R), and Graw(R) arealmost identical, and reach Gopt(R) at R = ∞, as expectedfrom (23). At low rates, the theoretically optimal schemeGopt(R) performs better than the three suboptimal schemesas it uses information about signal statistics at the remotedevice. The gain is 3 dB, corresponding to the familiargain in uncorrelated noise resulting from the doublingof microphones from one to two. Clearly, in this casetransmitting the raw microphone signal is a good choice asthe computational load and delay in first obtaining a localestimate can be avoided.

Figure 4(b) plots the gain for K = 2, and as discussed atthe end of Section 3, the contrast with K = 1 is evident. Athigh rate, both Gopt(R) and Gsig(R) approach 3 dB, again dueto a doubling of microphones, now from two to four.Graw(R)saturates at a lower value as there are only three microphonesignals available for the estimation. Finally, transmitting anestimate of the undesired signal leads to zero gain in thiscase as the noise is spatially uncorrelated, and thus thetransmitted signal does not contribute to the estimation of


0

0

1

2

3

20 40 60 80 100 120

Rate (kbps)

Gai

n (

dB)

Gopt (R)Gsig (R)

Gint (R)Graw (R)

(a)

0

0

1

2

3

20 40 60 80 100 120

Rate (kbps)

Gai

n (

dB)

Gopt (R)Gsig (R)

Gint (R)Graw (R)

(b)

Figure 4: Performance gain for the three schemes when a desiredsignal is observed in the presence of uncorrelated noise (i.e., SIR =∞). (a) K = 1, (b) K = 2.

the desired signal at the left ear. It is interesting to note thatfor K = 1, transmitting an estimate of the undesired signalled to an improvement, which can be explained by (23).

When comparing the results for K = 1 with K = 2,it needs to be noted that the figures plot the improvementcompared to rate R = 0 in each case, and not the absoluteSNR gain. This applies to the results shown in the subsequentsections as well. For uncorrelated noise, the absolute SNRgain at infinite bit-rate in the four microphone systemcompared to the SNR at a single microphone is 6 dB withK = 2 and 3 dB with K = 1.

Left Right

Desiredsource at 0◦Interferer

at 30◦Interferer

at −30◦

Figure 5: Location of desired and interfering sources. For aninterferer located at 30◦, the SIR at the left ear is lower than at theright ear due to head shadow.

4.4. Desired and Interfering Sources in Uncorrelated Noise.The behavior of the different schemes in the presence of alocalized interferer is of interest in the hearing aid scenario.As before, a desired source is assumed to be located at 0◦

(front of the user), and the SNR is set to 20 dB. In addition,an interferer is assumed to be located at −30◦ (i.e., front,30◦ to the right, see Figure 5), and the SIR is set to 0 dB.Figure 6 compares the four schemes for this case. For K = 1,Figure 6(a) shows that the different schemes exhibhit similarperformance.

For K = 2, Figure 6(b) provides useful insights. It isevident from the dotted curve that transmitting an estimateof the desired signal leads to poor performance. Transmittingan estimate of the interferer, interestingly, results in a highergain as seen from the dash-dot curve and can be explainedas follows. At high rates, the interferer is well preservedin the transmitted signal. Better interference suppressionis now possible using the binaural array (larger spacing)than with the closely spaced monaural array, and thus theimproved performance. Transmitting the unprocessed signalresults in an even higher gain Graw(R) that approaches thegain resulting from the optimal scheme. In this case, notonly is better interference rejection possible but also betterestimation of the desired signal as the transmitted signalcontains both the desired and undesired signals. Again, it isimportant to note that the figure only shows the gain due tothe presence of the wireless link, and not the absolute SNRgain, which is higher for K = 2 than for K = 1 due to thehigher number of microphones.

Figure 7 considers the case when the interferer is locatedat 30◦ instead of −30◦, which leads to an interesting result.Again, we focus on K = 2. The behavior of Gopt(R) andGraw(R) in Figure 7(b) is similar to Figure 6(b), but thecurves Gsig(R) and Gint(R) appear to be almost interchangedwith respect to Figure 6(b). This reversal in performance canbe intuitively explained by the head shadow effect. Note thatthe performance gain is measured at the left ear. When theinterferer is located at 30◦, the SIR at the left ear is lowerthan the SIR at the right ear as the interferer is closer tothe left ear, and shadowed by the head at the right ear; seeFigure 5. Thus at the right ear, it is possible to obtain a


0 20 40 60 80 100 120

Rate (kbps)

Gai

n (

dB)

Gopt (R)Gsig (R)

Gint (R)Graw (R)

5

10

(a)

0 20 40 60 80 100 120

Rate (kbps)

Gai

n (

dB)

Gopt (R)Gsig (R)

Gint (R)Graw (R)

0

5

10

(b)

Figure 6: Performance gain for the different schemes when adesired signal is observed in the presence of uncorrelated noise at20 dB SNR, and an interfering source at 0 dB SIR located at −30◦.(a) K = 1, (b) K = 2.

good estimate of the desired signal but not of the interferer.So, transmitting an estimate of the desired signal leads tobetter performance than transmitting an estimate of theinterferer. For an interferer located at −30◦, the interference-to-signal ratio is higher at the right ear, and thus it is possibleto obtain a better estimate of the interferer than possibleat the left ear. Transmitting this estimate to the left earprovides information that can be exploited for interferencecancellation.

From the above analysis, it can be concluded that adecision on which signal to transmit needs to be madedepending on the SIR. At high SIRs (SIR = ∞ in the limit,

0 20 40 60 80 100 120

Rate (kbps)

Gai

n (

dB)

Gopt (R)Gsig (R)

Gint (R)Graw (R)

0

5

10

15

(a)

0 20 40 60 80 100 120

Rate (kbps)

Gai

n (

dB)

Gopt (R)Gsig (R)

Gint (R)Graw (R)

0

5

10

15

(b)

Figure 7: Performance gain for the different schemes when adesired signal is observed in the presence of uncorrelated noise at20 dB SNR and an interfering source at 0 dB SIR located at 30◦. (a)K = 1, (b) K = 2.

thus only uncorrelated noise), transmitting an estimate of thedesired signal is better than transmitting the raw microphonesignal. At low SIRs, the converse holds. A simple rule ofthumb is to always transmit the unprocessed microphonesignal as the penalty at high SIRs is negligible (see Figure 4)compared to the potential gains at low SIRs (see Figures6 and 7). In addition, such a scheme results in a lowercomputational load and reduced delay.

It may be noted that this paper considers theoreticalupper bounds on the performance of the different schemes.In a practical scheme where the unprocessed signal is codedand transmitted, only the PSD of the noisy signal is required.


Glpraw(R)

0 20 40 60 80 100 120

Rate (kbps)

Gai

n (

dB)

0

5

10

15

Gopt (R)Graw (R)

(a)

0 20 40 60 80 100 120

Rate (kbps)

Gai

n (

dB)

0

5

10

15

Gopt (R)Graw (R)

Glpraw(R)

(b)

Figure 8: Performance gain for the different schemes when adesired signal is observed in the presence of uncorrelated noiseat 20 dB SNR, K = 2, and only the low-frequency portion istransmitted (below 4 kHz). (a) Interfering source at 0 dB SIRlocated at 30◦. (b) Interfering source at 0 dB SIR located at 120◦.

On the other hand, the values of Gsig(R) and Gint(R) couldbe lower in practice than the presented theoretical upperbounds as they depend on knowledge of the PSD of thedesired and interfering sources, respectively, which need tobe estimated from the noisy signal. This makes transmittingthe unprocessed signal an attractive choice.

4.5. Transmitting Only Low Frequencies. It is well known thata closely spaced microphone array offers good performance

at high frequencies, and an array with a larger microphonespacing performs well at low frequencies. This observationcan be exploited in binaural beamforming [8]. Figure 8depicts the performance when only the low-frequency con-tent of one microphone signal (up to 4 kHz) is transmittedfrom the right ear. AsGraw(R) provided the best performancein the analysis so far, only this scheme is considered inthis experiment. Each hearing aid is assumed to have twomicrophones. At the left ear, the low-frequency portion ofthe desired signal is estimated using the two locally availablemicrophone signals and the transmitted signal. The high-frequency portion is estimated using only the local signals.

The gain achieved in this setup at a rate R is denotedGlpraw(R).

When an interferer is in the front half plane at, for exam-ple, 30◦ as in Figure 8(a), transmitting the low-frequencypart alone results in poor performance. This is because thesmall microphone array at the left ear cannot distinguishbetween desired and interfering sources that are located closetogether. In this case, the binaural array is useful even at highfrequencies. When the interferer is located in the rear halfplane at, for example, 120◦ as in Figure 8(b), transmittingjust the low-frequency part results in good performance.

Glpraw(R) reaches its limit at a lower bit-rate than Graw(R)

as the high-frequency content need not be transmitted. At

infinite bit-rate, Glpraw(R) is lower than Graw(R), but such a

scheme allows a trade-off between bit-rate and performance.

5. Conclusions

In a wireless binaural hearing aid system where each hearingaid has multiple microphones, the choice of the signal thatone hearing aid needs to transmit to the other is not obvious.Optimally, the right hearing aid needs to transmit the partof the desired signal that can be predicted by the right earsignals but not the left ear signals, and vice versa for theleft device [6]. At the receiving end, an estimate of thedesired signal is obtained using the signal received over thewireless link and the locally observed signals. However, suchan optimal scheme requires that the right device is aware ofthe joint statistics of the signals at both the left and rightdevices, which is impractical in the nonstationary conditionsencountered in hearing aid applications. Suboptimal practi-cal schemes are thus required.

Transmitting an estimate of the desired signal obtainedat one hearing aid to the other is asymptotically optimalwhen the only undesired signal in the system is spatiallyuncorrelated noise [7]. In the presence of a localizedinterfering sound source the undesired signal is correlatedacross the different microphones and it has been seen thatsuch a scheme is no longer optimal. Two alternative schemeshave been proposed and investigated in this paper. Thefirst is transmitting an estimate of the undesired signal,which performs better than transmitting an estimate of thedesired signal depending on the location of the interferingsound source. The second is to simply transmit one of theunprocessed microphone signals from one device to theother. In the presence of a localized interferer or equivalentlyat low SIRs, the second scheme provides a significant gain


compared to the other suboptimal schemes. In the presenceof uncorrelated noise or equivalently at high SIRs, however,there is approximately a 1 dB loss in performance comparedto the method of [7]. While it is possible to change thetransmission scheme depending on the SIR, a simple ruleof thumb is to always transmit the unprocessed microphonesignal as the penalty at high SIRs is negligible (see Figure 4)compared to the potential gains at low SIRs (see Figures6 and 7). Furthermore, not having to obtain an estimatebefore transmission results in a lower computational load,and reduced delay, both of which are critical in hearing aidapplications. It is to be noted that the results discussed inthis paper apply when only a single interferer is present.Performance in the presence of multiple interferers is a topicfor further study.

As a microphone array with a large interelement spacingas in a binaural hearing aid systems performs well onlyin the low frequencies, the effect of transmitting only thelow-frequency content from the right hearing aid was alsoinvestigated. For interferers located in the rear half plane,a lower bit-rate is sufficient to maintain a similar levelof performance as when the whole frequency range istransmitted. As the entire frequency range is not transmitted,the asymptotic performance is lower than the full-bandtransmission. Such a scheme, however, provides a trade-offbetween the required bit-rate and achievable beamforminggain.

References

[1] V. Hamacher, J. Chalupper, J. Eggers, et al., “Signal processingin high-end hearing aids: state of the art, challenges, and futuretrends,” EURASIP Journal on Applied Signal Processing, vol.2005, no. 18, pp. 2915–2929, 2005.

[2] S. Kochkin, “MarkeTrak VII: customer satisfaction withhearing instruments in the digital age,” The Hearing Journal,vol. 58, no. 9, pp. 30–43, 2005.

[3] T. J. Klasen, T. van den Bogaert, M. Moonen, and J. Wouters,“Binaural noise reduction algorithms for hearing aids thatpreserve interaural time delay cues,” IEEE Transactions onSignal Processing, vol. 55, no. 4, pp. 1579–1585, 2007.

[4] S. Doclo, T. J. Klasen, T. van den Bogaert, J. Wouters, and M.Moonen, “Theoretical analysis of binaural cue preservationusing multi-channel Wiener filtering and interaural trans-fer functions,” in Proceedings of International Workshop onAcoustic Echo and Noise Control (IWAENC ’06), Paris, France,September 2006.

[5] S. Doclo, T. van den Bogaert, J. Wouters, and M. Moo-nen, “Comparison of reduced-bandwidth MWF-based noisereduction algorithms for binaural hearing aids,” in Proceedingsof IEEE Workshop on Applications of Signal Processing to Audioand Acoustics (WASPAA ’07), pp. 223–226, October 2007.

[6] O. Roy and M. Vetterli, “Rate-constrained beamforming forcollaborating hearing aids,” in Proceedings of IEEE Interna-tional Symposium on Information Theory (ISIT ’06), pp. 2809–2813, 2006.

[7] O. Roy and M. Vetterli, “Collaborating hearing aids,” inProceedings of MSRI Workshop on Mathematics of Relaying andCooperation in Communication Networks, April 2006.

[8] S. Srinivasan, “Low-bandwidth binaural beamforming,” Elec-tronics Letters, vol. 44, no. 22, pp. 1292–1294, 2008.

[9] T. Berger, Rate Distortion Theory: A Mathematical Basis forData Compression, Information and System Sciences Series, T.Kailath, Ed, Prentice Hall, Upper Saddle River, NJ, USA, 1971.

[10] T. Kailath, A. H. Sayed, and B. Hassibi, Linear Estimation,Prentice Hall, Upper Saddle River, NJ, USA, 2000.

[11] R. O. Duda and W. L. Martens, “Range dependence of theresponse of a spherical head model,” The Journal of theAcoustical Society of America, vol. 104, no. 5, pp. 3048–3058,1998.


Research Article

Combination of Adaptive Feedback Cancellation and BinauralAdaptive Filtering in Hearing Aids

Anthony Lombard, Klaus Reindl, and Walter Kellermann

Multimedia Communications and Signal Processing, University of Erlangen-Nuremberg, Cauerstr. 7, 91058 Erlangen, Germany

Correspondence should be addressed to Anthony Lombard, [email protected]



We study a system combining adaptive feedback cancellation and adaptive filtering connecting inputs from both ears for signalenhancement in hearing aids. For the first time, such a binaural system is analyzed in terms of system stability, convergenceof the algorithms, and possible interaction effects. As major outcomes of this study, a new stability condition adapted to theconsidered binaural scenario is presented, some already existing and commonly used feedback cancellation performance measuresfor the unilateral case are adapted to the binaural case, and possible interaction effects between the algorithms are identified. Forillustration purposes, a blind source separation algorithm has been chosen as an example for adaptive binaural spatial filtering.Experimental results for binaural hearing aids confirm the theoretical findings and the validity of the new measures.

Copyright © 2009 Anthony Lombard et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Traditionally, signal enhancement techniques for hearingaids (HAs) were mainly developed independently for eachear [1–4]. However, since the human auditory system is abinaural system combining the signals received from bothears for audio perception, providing merely bilateral systems(that operate independently for each ear) to the hearing-aid user may distort crucial binaural information neededto localize sound sources correctly and to improve speechperception in noise. Foreseeing the availability of wirelesstechnologies for connecting the two ears, several binauralprocessing strategies have therefore been presented in the lastdecade [5–10]. In [5], a binaural adaptive noise reductionalgorithm exploiting one microphone signal from each earhas been proposed. Interaural time difference cues of speechsignals were preserved by processing only the high-frequencycomponents while leaving the low frequencies unchanged.Binaural spectral subtraction is proposed in [6]. It utilizescross-correlation analysis of the two microphone signals fora more reliable estimation of the common noise powerspectrum, without requiring stationarity for the interferingnoise as the single-microphone versions do. Binaural multi-channel Wiener filtering approaches preserving binaural

cues were also proposed, for example, in [7–9], and signalenhancement techniques based on blind source separation(BSS) were presented in [10].

Research on feedback suppression and control systemtheory in general has also given rise to numerous hearing-aid specific publications in recent years. The behavior ofunilateral closed-loop systems and the ability of adaptivefeedback cancellation algorithms to compensate for thefeedback has been extensively studied in the literature (see,e.g., [11–15]). But despite the progress in binaural signalenhancement, binaural systems have not been considered inthis context. In this paper, we therefore present a theoreticalanalysis of a binaural system combining adaptive feedbackcancellation (AFC) and binaural adaptive filtering (BAF)techniques for signal enhancement in hearing aids.

The paper is organized as follows. An efficient binauralconfiguration combining AFC and BAF is described inSection 2. Generic vector/matrix notations are introducedfor each part of the processing chain. Interaction effectsconcerning the AFC are then presented in Section 3. Itincludes a derivation of the ideal binaural AFC solution, aconvergence analysis of the AFC filters based on the binauralWiener solution, and a stability analysis of the binauralsystem. Interaction effects concerning the BAF are discussed


in Section 4. Here, to illustrate our argumentation, a BSSscheme has been chosen as an example for adaptive binauralfiltering. Experimental conditions and results are finallypresented in Sections 5 and 6 before providing concludingremarks in Section 7.

2. Signal Model

AFC and BAF techniques can be combined in two differentways. The feedback cancellation can be performed directly onthe microphone inputs, or it can be applied at a later stage,to the BAF outputs. The second variant requires in generalfewer filters but it has also several drawbacks. Actually, whenthe AFC comes after the BAF in the processing chain, thefeedback cancellation task is complicated by the necessityto follow the continuously time-varying BAF filters. It mayalso significantly increase the necessary length of the AFCfilters. Moreover, the BAF cannot benefit from the feedbackcancellation effectuated by the AFC in this case. Especially athigh HA amplification levels, the presence of strong feedbackcomponents in the sensor inputs may, therefore, seriouslydisturb the functioning of the BAF. These are structurally thesame effects as those encountered when combining adaptivebeamforming with acoustic echo cancellation (AEC) [16].

In this paper, we will therefore concentrate on the“AFC-first” alternative, where AFC is followed by the BAF.Figure 1 depicts the signal model adopted in this study. Eachcomponent of the signal model will be described separatelyin the following and generic vector/matrix notations willbe introduced to carry out a general analysis of the overallsystem in Sections 3 and 4.

2.1. Notations. In this paper, lower-case boldface charactersrepresent (row) vectors capturing signals or the filters ofsingle-input-multiple-output (SIMO) systems. Accordingly,multiple-input-single-output (MISO) systems are describedby transposed vectors. Matrices denoting multiple-input-multiple-output (MIMO) systems are represented by upper-case boldface characters. The transposition of a vector or amatrix will be denoted by the superscript {·}T .

2.2. The Microphone Signals. We consider here multi-sensorhearing aid devices with P microphones at each ear (seeFigure 1), where P typically ranges between one and three.Because of the reverberation in the acoustical environment,Q point source signals sq (q = 1, . . . ,Q) are filtered by aMIMO mixing system (one Q × P MIMO system for eachear in the figure) modeled by finite impulse response (FIR)filters. This can be expressed in the z-domain as:

xsIp(z) =

Q∑

q=1

sq(z)hqIp(z) I ∈ {L, R}, (1)

where xsIp(z) is the z-domain representation of the received

source signal mixture at the pth sensor of the left (I = L)and right (I = R) hearing aid, respectively. hqLp(z) andhqRp(z) denote the transfer functions (polynomes of orderup to several thousands typically) between the qth source

and the pth sensor at the left and right ears, respectively.One of the point sources may be seen as the target sourceto be extracted, the remaining Q − 1 being considered asinterfering point sources. For the sake of simplicity, the z-transform dependency (z) will be omitted in the rest of thispaper, as long as the notation is not ambiguous.

The acoustic feedback originating from the loudspeakers(LS) uL and uR at the left and right ears, respectively,is modeled by four 1 × P SIMO systems of FIR filters.fLLp and fRLp represent the (z-domain) transfer functions(polynomes of order up to several hundreds typically) fromthe loudspeakers to the pth sensor on the left side, andfLRp and fRRp represent the transfer functions from theloudspeakers to the pth sensor on the right side. Thefeedback components captured by the pth microphone ofeach ear can therefore be expressed in the z-domain as

xuIp = uL fLIp + uR fRIp I ∈ {L, R}. (2)

Note that as long as the energy of the two LS signals arecomparable, the “cross” feedback signals (traveling from oneear to the other) are negligible compared to the “direct”feedback signals (occuring on each side independently).With the feedback paths (FBP) used in this study (see thedescription of the evaluation data in Section 5.3), an energydifference ranging from 15 to 30 dB has been observedbetween the “direct” and “cross” FBP impulse responses.When the HA gains are set at similar levels in both ears,the “cross” FBPs can then be neglected. But the impact ofthe “cross” feedback signals becomes more significant whena large difference exists between the two HA gains. Here,therefore, we explicitly account for the two types of feedbackby modelling both the “direct” paths (with transfer functionsfLLp and fRRp , p = 1, . . . ,P) and the “cross” paths (withtransfer functions fRLp and fLRp , p = 1, . . . ,P) by FIR filters.

Diffuse noise signals nLp and nRp , p = 1, . . . ,P constitutethe last microphone signal components on the left and rightears, respectively. The z-domain representation of the pthsensor signal at each ear is finally given by:

xIp = xsIp + xn

Ip + xuIp I ∈ {L, R}. (3)

This can be reformulated in a compact matrix formjointly capturing the P microphone signals of each HA:

x = xs + xn + xu = sH + xn + uF, (4)

where we have used the z-domain signal vectors

s = [s1, . . . , sQ], (5)

xsL =

[

xsL1

, . . . , xsLP

]

, (6)

xsR =

[

xsR1

, . . . , xsRP

]

, (7)

xs =[

xsL xs

R

]

, (8)

u =[

uL uR

]

, (9)


Acoustical paths

Acousticalmixing

Digital signal processing

Aco

ust

icfe

edba

ck

Adaptive feedbackcanceler

Binaural adaptive filtering

Hea

rin

g-ai

dpr

oces

sin

g

uL

uR

fLL fRL fLR fRR bL bR gL gR

vL vR

...

...

...

s1

sQ

−

−P

P

P PP P

xuL xu

R

xsL

xsR

xnR

xL

xR

xnL

HL

HR

yL yR

eL

eR

wTLL

wTRL

wTLR

wTRR

Figure 1: Signal model of the AFC-BAF combination.

as well as the z-domain matrices

HL =

⎡

⎢⎢⎢⎢⎣

h1L1 · · · h1LP

.... . .

...

hQL1 · · · hQLP

⎤

⎥⎥⎥⎥⎦

, (10)

HR =

⎡

⎢⎢⎢⎢⎣

h1R1 · · · h1RP

.... . .

...

hQR1 · · · hQRP

⎤

⎥⎥⎥⎥⎦

, (11)

H = [HL HR], (12)

fLL =[fLL1 , . . . , fLLP

], (13)

fRL =[fRL1 , . . . , fRLP

], (14)

FL =[

fTLL fTRL

]T, (15)

fLR =[fLR1 , . . . , fLRP

], (16)

fRR =[fRR1 , . . . , fRRP

], (17)

FR =[

fTLR fTRR

]T, (18)

F =[

FL FR

]

=⎡

⎣fLL fLR

fRL fRR

⎤

⎦. (19)

Furthermore, xn and xu capturing the noise and feedbackcomponents present in the microphone signals are definedin a similar way to xs. The sensor signal decomposition (4)can be further refined by distinguishing between target andinterfering sources:

xs = xstar+ xsint = starhtar + sintHint. (20)

star refers to the target source and sint is a subset of s capturingthe Q − 1 remaining interfering sources. htar is a row of Hwhich captures the transfer functions from the target sourceto the sensors and Hint is a matrix containing the remainingQ− 1 rows of H. Like the other vectors and matrices definedabove, these four entities can be further decomposed intotheir left and right subsets, labeled with the indices L and R,respectively.

2.3. The AFC Processing. As can be seen from Figure 1, weapply here AFC to remove the feedback components presentin the sensor signals, before passing them to the BAF. Feed-back cancellation is achieved by trying to produce replicas ofthese undesired components, using a set of adaptive filters.The solution adopted here consists of two 1×P SIMO systemsof adaptive FIR filters, with transfer functions bLp and bRp

between the left (resp. right) loudspeaker and the pth sensoron the left (resp. right) side. The output

yIp = uIbIp I ∈ {L, R} (21)

of the pth filter on the left (resp. right) side is then subtractedfrom the pth sensor signal on the left (resp. right) side,producing a residual signal

eIp = xIp − yIp I ∈ {L, R}, (22)

which is, ideally, free of any feedback components. (21) and(22) can be reformulated in matrix form as follows:

e = x − y = x− uB, (23)

with the block-diagonal constraint

B!= Bc =

⎡

⎣bL 0

0 bR

⎤

⎦ (24)


put on the AFC system. The vectors e and y, capturingthe z-domain representations of the residual and AFCoutput signals, respectively, are defined in analogous wayto xs in (8). As can be seen from (21) and (22), weperform here bilateral feedback cancellation (as opposed tobinaural operations) since AFC is performed for each earseparately. This is reflected in (24), where we force the off-diagonal terms to be zero instead of reproducing the acousticfeedback system F with its set of four SIMO systems. Thereason for this will become clear in Section 3.1. Guidelinesregarding an arbitrary (i.e., unconstrained) AFC system B(defined similarly to F in this case) will also be providedat some points in the paper. The superscript {·}c is usedto distinguish constrained systems Bc defined by (24) fromarbitrary (unconstrained) systems B (with possibly non-zerooff-diagonal terms).

2.4. The BAF Processing. The BAF filters perform spatialfiltering to enhance the signal coming from one of the Qexternal point sources. This is performed here binaurally,that is, by combining signals from both ears (see Figure 1).The binaural filtering operations can be described by a set offour P × 1 MISO systems of adaptive FIR filters. This can beexpressed in the z-domain as follows:

vI =P∑

p=1

eLpwLpI + eRpwRpI I ∈ {L, R}, (25)

where wLpI and wRpI, p = 1, . . . ,P, I ∈ {L, R} are thetransfer functions applied on the pth sensor of the left andright hearing aids, respectively. To reformulate (25) in matrixform, we define the vector

v =[

vL vR

]

, (26)

which jointly captures the z-domain representations of thetwo BAF outputs, and the vector and matrices

wLL =[wL1L, . . . ,wLPL

], (27)

wRL =[wR1L, . . . ,wRPL

], (28)

wL =[

wLL wRL

]

, (29)

wLR =[wL1R, . . . ,wLPR

], (30)

wRR =[wR1R, . . . ,wRPR

], (31)

wR =[

wLR wRR

]

, (32)

W =[

wTL wT

R

]

=⎡

⎣wT

LL wTLR

wTRL wT

RR

⎤

⎦, (33)

related to the transfer functions of the MIMO BAF system.We can finally express (25) as:

v = e W. (34)

2.5. The Forward Paths. Conventional HA processing(mainly a gain correction) is performed on the output ofthe AFC-BAF combination, before being played back by theloudspeakers:

uI = vI gI I ∈ {L, R}, (35)

where gL and gR model the HA processing in the z-domain, atthe left and right ears, respectively. In the literature, this partof the processing chain is often referred to as the forward path(in opposition to the acoustic feedback path). To facilitate theanalysis, we will assume that the HA processing is linear andtime-invariant (at least between two adaptation steps) in thisstudy. (35) can be conveniently written in matrix form as:

u = v Diag{

g}

, (36)

with

g =[

gL gR

]

. (37)

The Diag{·} operator applied to a vector builds a diagonalmatrix with the vector entries placed on the main diagonal.

Note that for simplicity, we assumed that the number ofsensors P used on each device for digital signal processingwas equal. The above notations as well as the followinganalysis are however readily applicable to asymmetrical con-figurations also, simply by resizing the above-defined vectorsand matrices, or by setting the corresponding microphonesignals and all the associated transfer functions to zero. Inparticular, the unilateral case can be seen as a special case ofthe binaural structure discussed in this paper, with one ormore microphones used on one side, but none on the otherside.

3. Interaction Effects on the FeedbackCancellation

The structure depicted in Figure 1 for binaural HAs mainlydeviates from the well-known unilateral case by the pres-ence of binaural spatial filtering. The binaural structureis characterized by a significantly more complex closed-loop system, possibly with multiple microphone inputs, butmost importantly with two connected LS outputs, whichconsiderably complicates the analysis of the system. However,we will see in the following how, under certain conditions,we can exploit the compact matrix notations introduced inthe previous section, to describe the behavior of the closed-loop system. We will draw some interesting conclusions onthe present binaural system, emphasizing its deviation fromthe standard unilateral case in terms of ideal cancellationsolution, convergence of the AFC filters and system stability.

3.1. The Ideal Binaural AFC Solution. In the unilateral andsingle-channel case, the adaptation of the (single) AFC filtertries to adjust the compensation signal (the filter output)to the (single-channel) acoustic feedback signal. Under idealconditions, this approach guarantees perfect removal of theundesired feedback components and simultaneously pre-vents the occurrence of howling caused by system instabilities


Acoustical paths

Acousticalmixing


Aco

ust

icfe

edba

ck



Hea

rin

g-ai

dpr

oces

sin

g

uL

uR


vL vR

...

...

...

s1

sQ

−

−P

P

P PP P

xuL xu

R

xsL

xsR

xnR

xL

xR

xnL

HL

HR

yL yR

eL

eR

cLRwTLL

wTRL

Figure 2: Equivalent signal model of the AFC-BAF combination under the assumption (40).

[11] (the stability of the binaural closed-loop system willbe discussed in Section 3.3). The adaptation of the filtercoefficients towards the desired solution is usually achievedusing a gradient-descent-like learning rule, in its simplestform using the least mean square (LMS) algorithm [17]. Thefunctioning of the AFC in the binaural configuration shownin Figure 1 is similar.

The residual signal vector (23) can be decomposed intoits source, noise and feedback components using (4):

e = xs + xn + u(F− B)︸︷︷︸

eFB

, (38)

where B denotes an arbitrary (unconstrained) AFC system

matrix (Section 2.3). eFB =[

eFBL eFB

R

]

= [eFBL1

, . . . , eFBLP , eFB

R1, . . . ,

eFBRP

] captures the z-domain representations of the residualfeedback components to be removed by the AFC. The onlyway to perfectly remove the feedback components from theresidual signals (i.e., eFB = 0), for arbitrary output signalvectors u, is to have

B = F = B. (39)

B denotes the ideal AFC solution in the unconstrained case.This is the binaural analogon to the ideal AFC solution inthe unilateral case, where perfect cancellation is achievedby reproducing an exact replica of the acoustical FBP. Inpractice, this solution is however very difficult to reachadaptively because it requires the two signals uL and uR

to be uncorrelated, which is obviously not fulfilled in ourbinaural HA scenario since the two HAs are connected(the correlation is actually highly desirable since the HAsshould form a spatial image of the acoustic scene, whichimplies that the two LS signals must be correlated to reflectinteraural time and level differences). This problem has beenextensively described in the literature on multi-channel AEC,where it is referred to as the “non-uniqueness problem”.Several attempts have been reported in the literature to partly

alleviate this issue (see, e.g., [18–20]). These techniques maybe useful in the HA case also, but this is beyond the scope ofthe present work.

In this paper, instead of trying to solve the problemmentioned above, we explicitly account for the correlationof the two LS output signals. The relation between the HAoutputs can be tracked back to the relation existing betweenthe BAF outputs vL and vR (Figure 1), which are generatedfrom the same set of sensors and aim at reproducinga binaural impression of the same acoustical scene. Therelation between vL and vR can be described by a linearoperator cLR(z) transforming vL(z) into vR(z) such that:

vR = vLcLR ∀vL, (40)

which is actually perfectly true if and only if cLR transformswL into wR:

wR = wLcLR. (41)

Therefore, the assumption (40) will only be an approxima-tion in general, except for a specific class of BAF systemssatisfying (41). The BSS algorithm discussed in Section 4belongs to this class. Figure 2 shows the equivalent signalmodel resulting from (40). As can be seen from the figure,cLR can be equivalently considered as being part of the rightforward path to further simplify the analysis. Accordingly, wethen define the new vector

g = [gL gR] = [gL cLRgR

](42)

jointly capturing cLR and the HA processing. Provided thatgL and gR are linear, (41) (and hence (40)) is equivalent toassuming the existence of a linear dependency between theLS outputs, which we can express as follows:

u = vLg = uL

gLg = uR

gRg. (43)


This assumption implies that only one filter (instead oftwo, one for each LS signal) suffices to cancel the feedbackcomponents in each sensor channel. It corresponds to theconstraint (24) mentioned in Section 2.3, which forces the

AFC system matrix B to be block-diagonal (B!= Bc). The

required number of AFC filters reduces accordingly from2× 2P to 2P.

Using the constraint (24) and the assumption (43) in(38), we can derive the constrained ideal AFC solutionminimizing eFB

I , I ∈ {L, R}, considering each side separately:

eFBI = uFI − uIbI

= uI

gIgFI − uIbI

= uI

⎡

⎣gFIg

−1I

︸︷︷︸

bI

−bI

⎤

⎦ I ∈ {L, R}. (44)

Here, bI denote the ideal AFC solution for the left or rightHA. It can be easily verified that inserting (44) into (23) leadsto the following residual signal decomposition:

e = xs + xn + u(

Bc − Bc)

︸︷︷︸

eFB

, (45)

where

Bc = Bdiag{

bL, bR

}

(46)

denotes the ideal AFC solution when B is constrained to beblock-diagonal (B

!= Bc) and under the assumption (43).The Bdiag {·} operator is the block-wise counterpart of theDiag {·} operator. Applied to a list of vectors, it builds ablock-diagonal matrix with the listed vectors placed on themain diagonal of the block-matrix, respectively.

To illustrate these results, we expand the ideal AFCsolution (46) using (15) and (18):

bL =(gLfLL + gRfRL

)g−1

L

= fLL︸︷︷︸

direct

+ gR/gLfRL︸︷︷︸

cross

,

bR =(gRfRR + gLfLR

)g−1

R

= fRR︸︷︷︸

direct

+ gR/gLfRL︸︷︷︸

cross

.

(47)

For each filter, we can clearly identify two terms due to,respectively, the “direct” and “cross” FBPs (see Section 2.2).Contrary to the “direct” terms, the “cross” terms areidentifiable only under the assumption (43) that the LSoutputs are linearly dependent. Should this assumption nothold because of, for example, some non-linearities in theforward paths, the “cross” FBPs would not be completelyidentifiable. The feedback signals propagating from one ear

to the other would then act as a disturbance to the AFCadaptation process. Note, however, that since the amplitudeof the “cross” FBPs is negligible compared to the amplitudeof the “direct” FBPs (Section 2.2), the consequences wouldbe very limited as long as the HA gains are set to similaramplification levels, as can be seen from (47). It shouldalso be noted that the forward path generally includes some(small) decorrelation delays DL and DR to help the AFCfilters to converge to their desired solution (see Section 3.2).If those delays are set differently for each ear, causality ofthe “cross” terms in (47) will not always be guaranteed, inwhich case the ideal solution will not be achievable withthe present scheme. This situation can be easily avoided byeither setting the decorrelation delays DL = DR equal foreach ear (which appears to be the most reasonable choice toavoid artificial interaural time differences), or by delaying theLS signals (but using the non-delayed signals as AFC filterinputs). However, since it would further increase the overalldelay from the microphone inputs to the LS outputs, thelatter choice appears unattractive in the HA scenario.

3.2. The Binaural Wiener AFC Solution. In the configurationdepicted in Figure 2, similar to the standard unilateralcase (see, e.g., [12]), conventional gradient-descent-basedlearning rules do not lead to the ideal solution discussedin Section 3.1 but to the so-called Wiener solution [17].Actually, instead of minimizing the feedback componentseFB in the residual signals, the AFC filters are optimized byminimizing the mean-squared error of the overall residualsignals (38).

In the following, we conduct therefore a convergenceanalysis of the binaural system depicted in Figure 2, byderiving the Wiener solution of the system in the frequencydomain:

bWienerI

(

z = e jω)

= rxIuI

(

e jω)

r−1uIuI

(

e jω)

=(

ruuI FI + rxsIuI + rxn

I uI

)

r−1uIuI

(48)

= gFIg−1I

︸︷︷︸

bI(z=e jω)

+rxsIuIr

−1uIuI

+ rxnI uIr

−1uIuI

︸︷︷︸

bI(z=e jω)

I∈{L, R},

(49)

where the frequency dependency (e jω) was omitted in (48)and (49) for the sake of simplicity, like in the rest of this

section. bI(z = e jω) is recognized as the (frequency-domain)ideal AFC solution discussed in Section 3.1, and bI(z = e jω)denotes a (frequency-domain) bias term. The assumption(43) has been exploited in (48) to obtain the above finalresult. ruIuI represents the (auto-) power spectral density ofuI, I ∈ {L, R}, and rxIuI = [rxI1uI , . . . , rxIP uI ], I ∈ {L, R}, isa vector capturing cross-power spectral densities. The cross-power spectral density vectors rxs

IuI and rxnI uI are defined in a

similar way.The Wiener solution (49) shows that the optimal solution

is biased due to the correlation of the different sourcecontributions xs and xn with the reference inputs uI, I ∈{L, R} (i.e., the LS outputs), of the AFC filters. The biasterm bI in (49) can be further decomposed like in (20),


distinguishing between desired (target source) and undesired(interfering point sources and diffuse noise) sound sources:

bWienerI

(

e jω)

= rxStarI uI

r−1uIuI

︸︷︷︸

due to target source

+ rxSint

I uIr−1uIuI

+ rxnI uIr

−1uIuI

︸︷︷︸

due to undesired sources

I ∈ {L, R}.

(50)

By nature, the spatially uncorrelated diffuse noise compo-nents xn will be only weakly correlated with the LS outputs.The third bias term will have therefore only a limited impacton the convergence of the AFC filters. The diffuse noisesources will mainly act as a disturbance. Depending onthe signal enhancement technique used, they might evenbe partly removed. But above all, the (multi-channel) BAFperforms spatial filtering, which mainly affects the interfer-ing point sources. Ideally, the interfering sources may evenvanish from the LS outputs, in which case the second biasterm would simply disappear. In practice, the interferencesources will never be completely removed. Hence the amountof bias introduced by the interfering sources will largelydepend on the interference rejection performance of the BAF.However, like in the unilateral hearing aids, the main sourceof estimation errors comes from the target source. Actually,since the BAF aims at producing outputs which are as close aspossible to the original target source signal, the first bias termdue to the (spectrally colored) target source will be muchmore problematic.

One simple way to reduce the correlation between thetarget source and the LS outputs is to insert some delays DL

and DR in the forward paths [12]. The benefit of this methodis however very limited in the HA scenario where only tinyprocessing delays (5 to 10 ms for moderate hearing losses) areallowed to avoid noticeable effects due to unprocessed signalsleaking into the ear canal and interfering with the processedsignals. Other more complicated approaches applying aprewhitening of the AFC inputs have been proposed forthe unilateral case [21, 22], which could also help in thebinaural case. We may also recall a well-known result fromthe feedback cancellation literature: the bias of the AFCsolution decreases when the HA gain increases, that is, whenthe signal-to-feedback ratio (SFR) at the AFC inputs (themicrophones) decreases. This statement also applies to thebinaural case. This can be easily seen from (50) wherethe auto-power spectral density r−1

uIuIdecreases quadratically

whereas the cross-power spectral densities increase onlylinearly with increasing LS signal levels.

Note that the above derivation of the Wiener solutionhas been performed under the assumption (43) that the LSoutputs are linearly dependent. When this assumption doesnot hold, an additional term appears in the Wiener solution.We may illustrate this exemplarily for the left side, startingfrom (48):

bWienerL

(

e jω)

= fLL + ruRuLr−1uLuL

fRL︸︷︷︸

desired solution

+ rxsLuLr

−1uLuL

+ rxnLuLr

−1uLuL

︸︷︷︸

bias

.

(51)

The bias term is identical to the one already obtained in (50),while the desired term is now split into two parts. The firstone is related to the “direct” FBPs. The second term involvesthe “cross” FBPs and shows that gradient-based optimizationalgorithms will try to exploit the correlation of the LS outputs(when existing) to remove the feedback signal componentstraveling from one ear to the other. In the extreme case thatthe two LS signals are totally decorrelated (i.e., ruRuL = 0),this term disappears and the “cross” feedback signals cannotbe compensated. Note, however, that it would only have avery limited impact as long as the HA gains are set to similaramplification levels, as we saw in Section 3.1.

3.3. The Binaural Stability Condition. In this section, weformulate the stability condition of the binaural closed-loopsystem, starting from the general case before applying theblock-diagonal constraint (24). We first need to express theresponses uL and uR of the binaural system (Figure 1) onthe left and right side, respectively, to an external excitationxs + xn. This can be done in the z-domain as follows:

uL = [xs + xn + u(F− B)]wTL gL

= (xs + xn)wTL gL

︸︷︷︸

uL

+ uL(FL: − BL:)wTL gL

︸︷︷︸

kLL

+ uR(FR: − BR:)wTL gL

︸︷︷︸

kRL

= uL + uRkRL

1− kLL, (52)

uR = [ xs + xn + u(F− B)]wTRgR

= (xs + xn)wTRgR

︸︷︷︸

uR

+ uL (FL: − BL:)wTRgR

︸︷︷︸

kLR

+ uR (FR: − BR:)wTRgR

︸︷︷︸

kRR

= uR + uLkLR

1− kRR, (53)

where FL: and BL: denote the first row of F and B, respectively,that is, the transfer functions applied to the left LS signal. FR:

and BR: denote the second row of F and B, respectively, thatis, the transfer functions applied to the right LS signal. uL anduR represent the z-domain representations of the ideal systemresponses, once the feedback signals have been completelyremoved:

u =[

uL uR

]

= (xs + xn)W Diag{

g}. (54)

kLL, kRL, kLR, and kRR can be interpreted as the open-looptransfer functions (OLTFs) of the system. They can be seenas the entries of the OLTF matrix K defined as follows:

K =⎡

⎣kLL kLR

kRL kRR

⎤

⎦ = (F− B)W Diag{

g}. (55)


Combining (52) and (53) finally yields the relations:

uL = (1− kRR)uL + kRLuR

1− k ,

uR = (1− kLL)uR + kLRuL

1− k ,

(56)

with

k = kLL + kRR + kLRkRL − kLLkRR

= tr {K} − det {K},(57)

where the operators tr {·} and det {·} denote the trace anddeterminant of a matrix, respectively.

Similar to the unilateral case [11], (56) indicate thatthe binaural closed-loop system is stable as long as themagnitude of k(z = e jω) does not exceed one for any angularfrequency ω:

∣∣∣k(

z = e jω)∣∣∣ < 1, ∀ω. (58)

Here, the phase condition has been ignored, as usual in theliterature on AFC [14]. Note that the function k in (57) andhence the stability of the binaural system, depend on thecurrent state of the BAF filters.

The above derivations are valid in the general case.No particular assumption has been made and the AFCsystem has not been constrained to be block-diagonal. In thefollowing, we will consider the class of algorithms satisfyingthe assumption (41), implying that the two BAF outputsare linearly dependent. In this case, the ideal system outputvector (54) becomes

u = (xs + xn)wTL g. (59)

Furthermore, it can easily be verified that the followingrelations are satisfied in this case:

kRLuR = kRRuL, (60)

kLRuL = kLLuR, (61)

det {K} = 0. (62)

The closed-loop response (56) of the binaural systemsimplifies, therefore, in this case to

u = 11− k u, (63)

where k, defined in (57), reduces to

k = tr {K}. (64)

Finally, when applying additionally the block-diagonal con-straint (24) on the AFC system, (64) further simplifies to

k = g(

Bc − Bc)

wTL . (65)

The stability condition (58) formulated on k for the generalcase still applies here.

The above results show that in the unconstrained (con-strained, resp.) case, when the AFC filters reach their idealsolution B = F (Bc = Bc, resp.), the function k in (57)((65), resp.) is equal to zero. Hence the stability condition(58) is always fulfilled, regardless of the HA amplificationlevels used, and the LS outputs become ideal, with u = uas expected.

4. Interaction Effects on the Binaural AdaptiveFiltering

The presence of feedback in the microphone signals is usuallynot taken into account when developing signal enhancementtechniques for hearing aids. In this section, we consider theconfiguration depicted in Figure 1 and focus exemplarilyon BSS techniques as possible candidates to implementthe BAF, thereby analyzing the impact of feedback on BSSand discussing possible interaction effects with an AFCalgorithm.

4.1. Overview on Blind Source Separation. The aim of blindsource separation is to recover the original source signalsfrom an observed set of signal mixtures. The term “blind”implies that the mixing process and the original sourcesignals are unknown. In acoustical scenarios, like in thehearing-aid application, the source signals are mixed in aconvolutive manner. The (convolutive) acoustical mixingsystem can be modeled as a MIMO system H of FIRfilters (see Section 2.2). The case where the number Q of(simultaneously active) sources is equal to the number 2 ×P of microphones (assuming P channels for each ear (seeSection 2.2)) is referred to as the determined case. The casewhere Q < 2× P is called overdetermined, while Q > 2× P isdenoted as underdetermined.

The underdetermined BSS problem can be handled basedon time-frequency masking techniques, which rely on thesparseness of the sound sources (see, e.g., [23, 24]). In thispaper, we assume that the number of sources does not exceedthe number of microphones. Separation can then be per-formed using independent component analysis (ICA) meth-ods, merely under the assumption of statistical independenceof the original source signals [25]. ICA achieves separationby applying a demixing MIMO system A of FIR filters onthe microphone signals, hence providing an estimate of eachsource at the outputs of the demixing system. This is achievedby adapting the weights of the demixing filters to force theoutput signals to become statistically independent. Becauseof the adaptation criterion exploiting the independence ofthe sources, a distinction between desired and undesiredsources is unnecessary. Adaptation of the BSS filters istherefore possible even when all sources are simultaneouslyactive, in contrast to more conventional techniques based onWiener filtering [8] or adaptive beamforming [26].

One way to solve the BSS problem is to transform themixtures to the frequency domain using the discrete Fouriertransform (DFT) and apply ICA techniques in each DFT-bin


independently (see e.g., [27, 28]). This approach is referredto as the narrowband approach, in contrast with broadbandapproaches which process all frequency bins simultaneously.Narrowband approaches are conceptually simpler but theysuffer from a permutation and scaling ambiguity in eachfrequency bin, which must be tackled by additional heuristicmechanisms. Note however that to solve the permutationproblem, information on the sensor positions is usuallyrequired and free-field sound wave propagation is assumed(see, e.g., [29, 30]). Unfortunately, in the binaural HAapplication, the distance between the microphones on eachside of the head will generally not be known exactly and headshadowing effects will cause a disturbance of the wavefront.In this paper, we consider a broadband ICA approach [31,32] based on the TRINICON framework [33]. Separationis performed exploiting second-order statistics, under theassumption that the (mutually independent) source signalsare non-white and non-stationary (like speech). Since thisbroadband approach does not rely on accurate knowledge ofthe sensor placement, it is robust against unknown micro-phone array deformations or disturbance of the wavefront. Ithas already been used for binaural HAs in [10, 34].

Since BSS allows the reconstruction of the original sourcesignals up to an unknown permutation, we cannot know a-priori which output contains the target source. Here, it isassumed that the target source is located approximately infront of the HA user, which is a standard assumption in state-of-the-art HAs. Based on the approach presented in [35], theoutput containing the most frontal source is then selectedafter estimating the time-difference-of-arrival (TDOA) ofeach separated source. This is done by exploiting theability of the broadband BSS algorithm [31, 32] to performblind system identification of the acoustical mixing system.Figure 3 illustrates the resulting AFC-BSS combination. Notethat the BSS algorithm can be embedded into the generalbinaural configuration depicted in Figure 1, with the BAFfilters wL and wR set identically to the BSS filters producingthe selected (monaural) BSS output:

wL = wR =[

aLL aRL

]

if the left output is selected, (66)

wL = wR =[

aLR aRR

]

if the right output is selected.

(67)

The BSS algorithm satisfies, therefore, the assumption (41)and the AFC-BSS combination can be equivalently describedby Figure 2, with cLR = 1. In the following, v = vL = vR refersto the selected BSS output presented (after amplificationin the forward paths) to the HA user at both ears, andw = wL = wR denotes the transfer functions of the selectedBSS filters (common to both LS outputs). Note finally thatpost-processing filters may be used to recover spatial cues[10]. They can be modelled as being part of the forward pathsgL and gR.

4.2. Discussion. In the HA scenario, since the LS output sig-nals feed back into the microphones, the closed-loop systemformed by the HAs participates in the source mixing process,together with the acoustical mixing system. Therefore, the

BSS inputs result from a mixture of the external sourcesand the feedback signals coming from the loudspeakers. Butbecause of the closed-loop system bringing the HA inputsto the two LS outputs, the feedback signals are correlatedwith the original external source signals. To understand theimpact of feedback on the separation performance of a BSSalgorithm, we describe below the overall mixing process.

The closed-loop transfer function from the externalsources (the point sources and the diffuse noise sources) tothe BSS inputs (i.e, the residual signals after AFC) can beexpressed in the z-domain by inserting (59) and (63) into(45):

e = (xs + xn) +1

1− k (xs + xn)wT g(

Bc − Bc)

= s[

H +1

1− kHwT g(Bc − Bc)]

︸︷︷︸es

+ xn[

I +1

1− kwT g(Bc − Bc)]

︸︷︷︸en

,

(68)

where Bc and Bc refer to the AFC system and its ideal solution(46), respectively, under the block-diagonal constraint (24).k characterizes the stability of the binaural closed-loopsystem and is defined by (65). From (68), we can identify twoindependent components es and en present in the BSS inputsand originating from the external point sources and from thediffuse noise, respectively. As mentioned in Section 4.1, theBSS algorithm allows to separate point sources, additionaldiffuse noise having only a limited impact on the separationperformance [32]. We therefore concentrate on the first termin (68):

es = sH + s1

1− kHwT g(Bc − Bc)︸︷︷︸

H

, (69)

which produces an additional mixing system H introducedby the acoustical feedback (and the required AFC filters).Ideally, the BSS filters should converge to a solution whichminimizes the contribution vsint

of the interfering pointsources sint at the BSS output v, that is,

vsint = sintHintwT︸︷︷︸

acoustical mixing

+ sintHintwT︸︷︷︸

feedback loop

!= 0. (70)

Hint refers to the acoustical mixing of the interfering sourcessint, as defined in Section 2.2. Hint can be defined in a similarway and describes the mixing of the interfering sourcesintroduced by the feedback loop.

In the absence of feedback (and of AFC filters), thesecond term in (70) disappears and BSS can extract the targetsource by unraveling the acoustical mixing system H, whichis the desired solution. Note that this solution also allowsto estimate the position of each source, which is necessaryto select the output of interest, as discussed in Section 4.1.However, when strong feedback signal components are


Acoustical paths

Acousticalmixing


Aco

ust

icfe

edba

ck


Hea

rin

g-ai

dpr

oces

sin

g

TDOAs


Blind sourceseparation

Outputselection

uL

uR


vL vR

...

...

...

s1

sQ

−

−P

P

P PP P

xuL xu

R

xsL

xsR

xnR

xL

xR

xnL

HL

HR

yL yR

eL

eR

aTLL

aTRL

aTLR

aTRR

v

Figure 3: Signal model of the AFC-BSS combination.

present at the BSS inputs, the BSS solution becomes biasedsince the algorithm will try to unravel the feedback loop Hinstead of targetting the acoustical mixing system H only.The importance of the bias depends on the magnituderesponse of the filters captured by H in (70), relative to themagnitude response of the filters captured by H. Contraryto the AFC bias encountered in Section 3.2, the BSS biastherefore decreases with increasing SFR.

The above discussion concerning BSS algorithms can begeneralized to any signal enhancement techniques involvingadaptive filters. The presence of feedback at the algorithm’sinputs will always cause some adaptation problems. Fortu-nately, placing an AFC in front of the BAF like in Figure 1can help increasing the SFR at the BAF inputs. In particular,when the AFC filters reach their ideal solution (i.e., Bc = Bc),then H becomes zero and the bias term due to the feedbackloop in (70) disappears, regardless of the amount of soundamplification applied in the forward paths.

5. Evaluation Setup

To validate the theoretical analysis conducted in Sections3 and 4, the binaural configuration depicted in Figure 3was experimentally evaluated for the combination of afeedback canceler and the blind source separation algorithmintroduced in Section 4.1.

5.1. Algorithms. The BSS processing was performed usinga two-channel version of the algorithm introduced inSection 4.1, picking up the front microphone at each ear (i.e.,P = 1). Four adaptive BSS filters needed to be computed ateach adaptation step. The output containing the target source(the most frontal one) was selected based on BSS-internalsource localization (see Section 4.1, and [35]). To obtain

meaningful results which are, as far as possible, independentof the AFC implementation used, the AFC filter updatewas performed based on the frequency-domain adaptivefiltering (FDAF) algorithm [36]. The FDAF algorithm allowsfor an individual step-size control for each DFT bin anda bin-wise optimum control mechanism of the step-sizeparameter, derived from [13, 37]. In practice, this optimumstep-size control mechanism is inappropriate since it requiresthe knowledge of signals which are not available underreal conditions, but it allows us to minimize the impactof a particular AFC implementation by providing usefulinformation on the achievable AFC performance. Since weused two microphones, the (block-diagonal constrained)AFC consisted of two adaptive filters (see Figure 3).

Finally, to avoid other sources of interaction effectsand concentrate on the AFC-BSS combination, we consid-ered a simple linear time-invariant frequency-independenthearing-aid processing in the forward paths (i.e., gL(z) = gL

and gR(z) = gR). Furthermore, in all the results presented in

Section 4, the same HA gains gL = gR!= g and decorrelation

delays (see Section 3.2) DL = DR = D were applied at bothears. The selected BSS output was therefore amplified by afactor g, delayed by D and played back at the two LS outputs.

5.2. Performance Measures. We saw in the previous sectionsthat our binaural configuration significantly differs fromwhat can usually be found in the literature on unilateralHAs. To be able to objectively evaluate the algorithms’performance in this context, especially concerning the AFC,we need to adapt some of the already existing and commonlyused performance measures to the new binaural configura-tion. This issue is discussed in the following, based on theoutcomes of the theoretical analysis presented in Sections 3and 4.


0

−20

−40

−60

−80

20lo

g 10|k

(ej2πf)|

(dB

)

0 2 4 6 8

Frequency f (kHz)

Stab

ility

mar

gin

Figure 4: Illustration of the stability margin.

5.2.1. Feedback Cancellation Performance Measures. In theconventional unilateral case, the feedback cancellation per-formance is usually measured in terms of misalignmentbetween the (single) FBP estimate and the true (single) FBP(which is the ideal solution in the unilateral case), as well asin terms of Added Stable Gain (ASG) reflecting the benefit ofAFC for the user [14].

In the binaural configuration considered in this study, themisalignment should measure the mismatch between eachAFC filter and its corresponding ideal solution. This can becomputed in the frequency domain as follows:

�bIp = 10 log10

∑ω

∣∣∣bIp(e

jω)− bIp(ejω)∣∣∣

2

∑ω

∣∣∣bIp(e jω)

∣∣∣

2 I ∈ {L, R}.

(71)

The ideal binaural AFC solution has been defined in (39)for the general case, and in (44) under the block-diagonalconstraint (24) and assumption (43). In the results presentedin Section 4, the misalignment has been averaged over allAFC filters (two filters in our case).

In general, it is not possible to calculate an ASG inthe binaural case since the function k(e jω) characterizingthe stability of the system depends on both gains gL andgR (Section 3.3). It is however possible to calculate anadded stability margin (ASM) measuring the additional gainmargin (the distance of 20log10|k(e jω)| to 0dB, see Figure 4)obtained by the AFC

ASM=20 log10

(

minω

1∣∣k(e jω)

∣∣

)

−20 log10

(

minω

1∣∣k(e jω)

∣∣

B=0

)

,

(72)

where k(z) has been defined in (57) and |k(e jω)|B=0 is theinitial magnitude of k(z = e jω), without AFC. Since theassumption (41) is valid in our case (with cLR = 1) and sincewe force our AFC system to be block-diagonal, we can alter-natively use the simplified expression of k given by (65). Note

that the initial stability margin 20log10(minω1/|k(e jω)|B=0)as well as the margin with AFC 20 log10(minω1/|k(e jω)|), andhence the ASM, depend not only on the acoustical (binaural)FBPs, but also on the current state of the BAF filters. Also,

when gL = gR!= g, k becomes directly proportional to g and

the ASM can be interpreted as an ASG.Additionally, the SFR measured at the BSS and AFC

inputs should be taken into account when assessing the AFC-BSS combination since it directly influences the performanceof the algorithms. The SFR is defined in the following as thesignal power ratio between the components coming fromthe external sources (without distinction between desiredand interfering sources), and the components coming fromloudspeakers (i.e., the feedback signals).

5.2.2. Signal Enhancement Performance Measures. The sep-aration performance of the BSS algorithm is evaluated interms of signal-to-interference ratio (SIR), that is, the signalpower ratio between the components coming from the targetsource and the components coming from the interferingsource(s). Although the feedback components xu and theAFC filter outputs y (i.e., the compensation signals) containsome signal coming from the external sources s (whichcauses a bias of the BSS solution, as discussed in Section 4),we will ignore them in the SIR calculation since thesecomponents are undesired. An SIR gain can then be obtainedas the difference between the SIR at the BSS inputs andthe SIR at the BSS outputs. It reflects the ability of BSS toextract the desired components from the signal mixture xs,regardless of the amount of feedback (or background noise)present. Since only one BSS output is presented to the HAuser (Section 4.1), we average the input SIR over all BSSinput channels (here two), but we consider only the selectedBSS output for the output SIR calculations.

5.3. Experimental Conditions. Since a two-channel ICA-based BSS algorithm can only separate two point sources(Section 4.1), no diffuse noise has been added to the sensorsignal mixture (i.e., xn = 0) and only two point sources wereconsidered (one target source and one interfering source).

Head-related impulse responses (HRIR) were measuredusing a pair of Siemens Life (BTE) hearing aid cases withtwo microphones and a single receiver (loudspeaker) insideeach device (no processor). The cases were mounted on areal person and connected, via a pre-amplifier box, to a(laptop) PC equipped with a multi-channel RME Multifacesound card. Measurements were made in the followingenvironments:

(i) a low-reverberation chamber (T60 ≈ 50 ms),

(ii) a living-room-like environment (T60 ≈ 300 ms).

The source signal components xs were then generated byconvolving speech signals with the recorded HRIRs, with thetarget and interfering sources placed at azimuths 0◦ (in frontof the HA user) and 90◦ (facing the right ear), respectively.The target and interfering sources were approximately ofequal (long-time) signal power.


60

40

20

0

−20

(dB

)

0 10 20 30 40 50

Hearing-aid gain (dB)

Vent size 2 mm

Critical gainwithout

FBC

Reference SIRgain

SIRgain

SFRBSS in

SFRBSS out

(a)

60

40

20

0

−20

(dB

)

0 10 20 30 40 50


Vent size 3 mm


FBC

Reference SIRgain

SIRgain

SFRBSS in

SFRBSS out

(b)

60

40

20

0

−20

(dB

)

0 10 20 30 40 50


Vent open

Critical gainwithout FBC

Reference SIRgain

SIRgain

SFRBSS in

SFRBSS out

(c)

Figure 5: BSS performance for increasing HA gain, in a low-reverberation chamber.

To generate the feedback components xu, binaural FBPs(“direct” and “cross” FBPs, see Section 2.2) measured fromSiemens BTE hearing aids were used. These recordings havebeen made for different vent sizes: 2 mm, 3 mm and open andin the following scenario:

(i) left HA mounted on a manikin without obstruction,

(ii) right HA mounted on a manikin with a telephone asobstruction.

The digital signal processing was performed at a samplingfrequency of 16 kHz, picking up the front microphone ateach ear (i.e., P = 1).


In the following, experimental results involving the combi-nation of AFC and BSS are shown and discussed. BSS filtersof 1024 coefficients each were applied, the AFC filter lengthwas set to 256 and decorrelation delays of 5 ms were includedin the forward paths.

6.1. Impact of Feedback on BSS. The discussion of Section 4indicates that a deterioration of the BSS performance isexpected at low input SFR, due to a bias introduced by thefeedback loop. To determine to which extent the amountof feedback deteriorates the achievable source separation,the performance of the (adaptive) BSS algorithm wasexperimentally evaluated for different amounts of feedbackby varying the amplification level g. Preliminary tests in theabsence of AFC showed that the feedback had almost noimpact on the BSS performance as long as the system wasstable (i.e., as long as |k(e jω)|B=0 < 1, ∀ω) because the SFRat the BSS inputs was kept high (greater than 20 dB). Thisbasically validates the traditional way signal enhancementtechniques for hearing aids have been developed, ignoringthe presence of feedback.

Signal enhancement algorithms, however, can be subjectto higher input SFR levels when an AFC is used to stabilizethe system. To be able to further increase the gains andthe amount of feedback signal in the microphone inputswhile preventing system instability, the feedback componentspresent at the BSS output v were artificially suppressed. This

is equivalent to performing AFC on the BSS output, underideal conditions. It guarantees the stability of the system(with ASM = +∞), regardless of the HA amplificationlevel, but does not reduce the SFR at the BSS inputs. Theresults after convergence of the BSS algorithm are presentedin Figures 5 and 6 for different rooms and vent sizes. Thereference lines show the gain in SIR achieved by BSS in theabsence of feedback (and hence in the absence of AFC).The critical gain depicted by vertical dashed lines in thefigures, corresponds to the maximum stable gain withoutAFC, that is, the gain for which the initial stability margin20 log10(minω1/|k(e jω)|B=0) becomes zero.

At low gains, the feedback has very little impact on theSIR gain because the input SFR is sufficiently high in all testedscenarios. We see also that the interference rejection causes adecrease in SFR (from the BSS inputs to the output) sinceparts of the external source components are attenuated. Thisshould be beneficial to an AFC algorithm since it reduces thebias of the AFC Wiener solution due to the interfering source,as discussed in Section 3.2. However, at high gains, wherethe input SFR is low (less than 10 dB), the large amount offeedback causes a significant deterioration of the interferencerejection performance. Moreover, it should be noted that atlow gains, the input SFR decreases proportionally to the gain,as expected. We see, however, from the figures that the inputSFR suddenly drops at higher gains, when the amount offeedback becomes significant (see, e.g., the transition fromg = 20 dB to g = 25 dB, in Figure 6, for an open vent). SinceBSS has no influence on the signal power of the externalsources (the “S” component in the SFR), it means that BSSamplifies the LS signals (and hence the feedback signals atthe microphones, that is, the “F” component in the SFR).This undesired effect is due to the bias introduced by thefeedback loop (Section 4.2) and can be interpreted as follows:two mechanisms enter into play. The first one unravels theacoustical mixing system. It produces LS signals which aredominated by the target source (see the positive SIR gainsin the figures), as desired. The second mechanism consistsin amplifying the sensor signals. As long as the feedbacklevel is small, this second mechanism is almost invisible sinceit would amplify signals coming from both sources. But athigher gains, where the amount of feedback in the BSS inputsbecome more significant, this second mechanism becomes


60

40

20

0

−20

(dB

)

0 10 20 30 40 50


Vent size 2 mm


FBC

Reference SIRgain

SIRgain

SFRBSS in

SFRBSS out

(a)

60

40

20

0

−20

(dB

)

0 10 20 30 40 50


Vent size 3 mm


FBC

Reference SIRgain

SIRgain

SFRBSS in

SFRBSS out

(b)

60

40

20

0

−20

(dB

)

0 10 20 30 40 50


Vent open

Critical gainwithout FBC

Reference SIRgain

SIRgain

SFRBSS in

SFRBSS out

(c)

Figure 6: BSS performance for increasing HA gain, in a living-room environment.

60

40

20

0

−20

−40

(dB

)

0 10 20 30 40 50


Low-reverberation chamber

Reference SIRgain

SIRgain

SFRBSS in

SFRFBC in

ASMMisalignment

(a)

60

40

20

0

−20

−40(d

B)

0 10 20 30 40 50


Living room environment

Reference SIRgain

SIRgain

SFRBSS in

SFRFBC in

ASMMisalignment

(b)

Figure 7: Performance of the AFC-BSS combination in two acoustical environments. The measured FPBs for a vent of size 2 mm were used.

more important since it acts mainly in favor of the targetsource. This second mechanism illustrates the impact of thefeedback loop on the BSS algorithm at high feedback levels.It shows the necessity to have the AFC placed before BSS, sothat BSS can benefit from a higher input SFR.

6.2. Overall Behavior of the AFC-BSS Combination. The fullAFC-BSS combination has been evaluated for a vent size of2 mm, in the low-reverberation chamber as well as in theliving-room-like environment (Section 5.3). Figure 7 depictsthe BSS and AFC performance obtained after convergence.Like in Figures 5 and 6, the reference lines show the gain inSIR achieved by BSS in the absence of feedback (and hence inthe absence of AFC).

The results confirm the observations made in the pre-vious section. With the AFC applied directly on the sensorsignals, the BSS algorithm could indeed benefit from theability of the AFC to keep the SFR at the BSS inputs at highlevels for every considered HA gains. Therefore, BSS alwaysprovided SIR gains which were very close to the referenceSIR gain obtained without feedback, even at high gains. Thiscontrasts with the results obtained in Figures 5 and 6, wherean ideal AFC was applied at the BSS output instead of beingapplied first.

Note that the SFR at the AFC outputs correspond hereto the SFR at the BSS inputs. The gain in SFR (SFRBSSin −SFRAFCin , i.e., the feedback attenuation) achieved by the AFC

algorithm can be therefore directly visualized from Figure 7.As expected from the discussion presented in Section 3.1, thetwo AFC filters used were sufficient to efficiently compensateboth the “direct” and “cross” feedback signals, and henceavoid instability of the binaural closed-loop system. Like inthe unilateral case and as expected from the convergenceanalysis conducted in Section 3.2, the best AFC results wereobtained at low input SFR levels, that is, at high gains. TheAFC performance was also better in the low-reverberationchamber than in the living-room-like environment, as can beseen from the higher SFR levels at the BSS inputs, the higherASM values and the lower misalignments. This result seemssurprising at the first sight, since the FBPs were identical inboth environments. It can be however easily justified by theanalytical results presented in Section 3.2. We saw actuallythat the correlation between the external source signals andthe LS signals introduce a bias of the AFC Wiener solution.The bias due to the target source is barely influenced bythe BSS results since BSS left the target signal (almost)unchanged in both environments. But the BSS performanceinfluences directly the amount of residual interfering signalpresent at the LS outputs, and hence the bias of the AFCWiener solution due to the interfering source. In general,since reverberation increases the length of the acousticalmixing filters (and hence the necessary BSS filter length,typically), the more reverberant the environment, the lowerthe achieved separation results (for a given BSS filter length).


This is confirmed here by the SIR results shown in the figures.The difference in AFC performance comes therefore fromthe higher amount of residual interfering signal present atthe LS outputs in the living-room-like environment, whichincreases the bias of the AFC Wiener solution.

The AFC does not suffer from any particular negativeinteractions with the BSS algorithm since it comes first inthe processing chain, but rather benefits from BSS, especiallyin the low-reverberation chamber, as we just saw. Note thatthe situation is very different when the AFC is applied afterBSS. In this case, the AFC filters need to quickly follow thecontinuously time-varying BSS filters, which prevents properconvergence of the AFC filters, even with time-invariantFBPs.

7. Conclusions

An analysis of a system combining adaptive feedback cancel-lation and adaptive binaural filtering for signal enhancementin hearing aids was presented. To illustrate our study, a blindsource separation algorithm was chosen as an example foradaptive binaural filtering. A number of interaction effectscould be identified. Moreover, to correctly understand thebehavior of the AFC, the system was described and analyzedin detail. A new stability condition adapted to the binauralconfiguration could be derived, and adequate performancemeasures were proposed which account for the specificitiesof the binaural system. Experimental evaluations confirmedand illustrated the theoretical findings.

The ideal AFC solution in the new binaural configurationcould be identified but a steady-state analysis showed thatthe AFC suffers from a bias in its optimum (Wiener)solution. This bias, similar to the unilateral case, is due to thecorrelation between feedback and external source signals. Itwas also demonstrated theoretically as well as experimentallythat a signal enhancement algorithm could help reducing thisbias. The correlation between feedback and external sourcesignals also causes a bias of the BAF solution. But contrary tothe bias encountered by the AFC, the BAF bias increases withincreasing HA amplification levels. Fortunately, this bias canbe reduced by applying AFC on the sensor signals directly,instead of applying it on the BAF outputs.

The analysis also showed that two SIMO AFC systems ofadaptive filters can effectively compensate for the four SIMOFBP systems when the outputs are sufficiently correlated (seeSection 3.1). Should this condition not be fulfilled becauseof, for example, some non-linearities in the forward paths,the “cross” feedback signals travelling from one ear to theother would not be completely identifiable. But we sawthat since the amplitude of the “cross” FBPs is usuallynegligible compared to the amplitude of the “direct” FBPs,the consequences would be very limited as long as the HAgains are set to similar amplification levels.

Acknowledgments

This work has been supported by a grant from the Europ-ean Union FP6 Project 004171 HEARCOM (http://hear-

com.eu/main de.html). We would also like to thank SiemensAudiologische Technik, Erlangen, for providing some of thehearing-aid recordings for evaluation.

References

[1] S. Doclo and M. Moonen, “GSVD-based optimal filteringfor single and multimicrophone speech enhancement,” IEEETransactions on Signal Processing, vol. 50, no. 9, pp. 2230–2244,2002.

[2] A. Spriet, M. Moonen, and J. Wouters, “Spatially pre-processed speech distortion weighted multi-channel Wienerfiltering for noise reduction,” Signal Processing, vol. 84, no. 12,pp. 2367–2387, 2004.

[3] S. Doclo, A. Spriet, M. Moonen, and J. Wouters, “Designof a robust multi-microphone noise reduction algorithmfor hearing instruments,” in Proceedings of the InternationalSymposium on Mathematical Theory of Networks and Systems(MTNS ’04), pp. 1–9, Leuven, Belgium, July 2004.

[4] J. Vanden Berghe and J. Wouters, “An adaptive noise cancellerfor hearing aids using two nearby microphones,” Journal of theAcoustical Society of America, vol. 103, no. 6, pp. 3621–3626,1998.

[5] D. P. Welker, J. E. Greenberg, J. G. Desloge, and P. M. Zurek,“Microphone-array hearing aids with binaural output—partII: a two-microphone adaptive system,” IEEE Transactions onSpeech and Audio Processing, vol. 5, no. 6, pp. 543–551, 1997.

[6] M. Doerbecker and S. Ernst, “Combination of two-channelspectral subtraction and adaptive Wiener post-filtering fornoise reduction and dereverberation,” in Proceedings of the8th European Signal Processing Conference (EUSIPCO ’96), pp.995–998, Trieste, Italy, September 1996.

[7] T. J. Klasen, T. van den Bogaert, M. Moonen, and J. Wouters,“Binaural noise reduction algorithms for hearing aids thatpreserve interaural time delay cues,” IEEE Transactions onSignal Processing, vol. 55, no. 4, pp. 1579–1585, 2007.

[8] S. Doclo, R. Dong, T. J. Klasen, J. Wouters, S. Haykin, andM. Moonen, “Extension of the multi-channel Wiener filterwith ITD cues for noise reduction in binaural hearing aids,”in Proceedings of IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics (WASPAA ’05), pp. 70–73,New Paltz, NY, USA, October 2005.

[9] T. J. Klasen, S. Doclo, T. van den Bogaert, M. Moonen,and J. Wouters, “Binaural multi-channel Wiener filtering forhearing aids: preserving interaural time and level differences,”in Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’06), vol. 5, pp. 145–148,Toulouse, France, May 2006.

[10] R. Aichner, H. Buchner, M. Zourub, and W. Kellermann,“Multi-channel source separation preserving spatial infor-mation,” in Proceedings of IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP ’07), vol. 1,pp. 5–8, Honolulu, Hawaii, USA, May 2007.

[11] D. P. Egolf, “Review of the acoustic feedback literature for acontrol systems point of view,” in The Vanderbilt Hearing-AidReport, pp. 94–103, York Press, London, UK, 1982.

[12] M. G. Siqueira and A. Alwan, “Steady-state analysis ofcontinuous adaptation in acoustic feedback reduction systemsfor hearing-aids,” IEEE Transactions on Speech and AudioProcessing, vol. 8, no. 4, pp. 443–453, 2000.

[13] H. Puder and B. Beimel, “Controlling the adaptation offeedback cancellation filters—problem analysis and solutionapproaches,” in Proceedings of the 12th European Conference on


Signal Processing (EUSIPCO ’04), pp. 25–28, Vienna, Austria,September 2004.

[14] A. Spriet, Adaptive filtering techniques for noise reduction andacoustic feedback cancellation in hearing aids, Ph.D. thesis,Katholieke Universiteit Leuven, Leuven, Belgium, 2004.

[15] A. Spriet, G. Rombouts, M. Moonen, and J. Wouters, “Com-bined feedback and noise suppression in hearing aids,” IEEETransactions on Audio, Speech, and Language Processing, vol.15, no. 6, pp. 1777–1790, 2007.

[16] W. Kellermann, “Strategies for combining acoustic echocancellation and adaptive beamforming microphone arrays,”in Proceedings of IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP ’97), vol. 1, pp. 219–222,Munich, Germany, April 1997.

[17] S. Haykin, Adaptive Filter Theory, Prentice Hall Informationand System Sciences Series, Prentice Hall, Upper Saddle River,NJ, USA, 4th edition, 2002.

[18] M. M. Sondhi, D. R. Morgan, and J. L. Hall, “Stereophonicacoustic echo cancellation—an overview of the fundamentalproblem,” IEEE Signal Processing Letters, vol. 2, no. 8, pp. 148–151, 1995.

[19] J. Benesty, D. R. Morgan, and M. M. Sondhi, “A betterunderstanding and an improved solution to the specificproblems of stereophonic acoustic echo cancellation,” IEEETransactions on Speech and Audio Processing, vol. 6, no. 2, pp.156–165, 1998.

[20] D. R. Morgan, J. L. Hall, and J. Benesty, “Investigation ofseveral types of nonlinearities for use in stereo acousticecho cancellation,” IEEE Transactions on Speech and AudioProcessing, vol. 9, no. 6, pp. 686–696, 2001.

[21] A. Spriet, I. Proudler, M. Moonen, and J. Wouters, “An instru-mental variable method for adaptive feedback cancellation inhearing aids,” in Proceedings of IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP ’05), vol. 3,pp. 129–132, Philadelphia, Pa, USA, March 2005.

[22] T. van Waterschoot and M. Moonen, “Adaptive feedbackcancellation for audio signals using a warped all-pole near-endsignal model,” in Proceedings of IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP ’08), pp.269–272, Las Vegas, Nev, USA, April 2008.

[23] S. Araki, H. Sawada, R. Mukai, and S. Makino, “A novel blindsource separation method with observation vector clustering,”in Proceedings of the International Workshop on Acoustic Echoand Noise Control (IWAENC ’05), pp. 117–120, Eindhoven,The Netherlands, September 2005.

[24] J. Cermak, S. Araki, H. Sawada, and S. Makino, “Blind sourceseparation based on a beamformer array and time frequencybinary masking,” in Proceedings of IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP ’07),vol. 1, pp. 145–148, Honolulu, Hawaii, USA, May 2007.

[25] A. Hyvaerinen, J. Karhunen, and E. Oja, Independent Compo-nent Analysis, John Wiley & Sons, New York, NY, USA, 2001.

[26] W. Herbordt, Sound Capture for Human / Machine Interfaces,Springer, Berlin, Germany, 2005.

[27] L. Parra and C. Spence, “Convolutive blind separation of non-stationary sources,” IEEE Transactions on Speech and AudioProcessing, vol. 8, no. 3, pp. 320–327, 2000.

[28] H. Sawada, R. Mukai, S. Araki, and S. Makino, “Polarcoordinate based nonlinear function for frequency-domainblind source separation,” in Proceedings of IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP’02), vol. 1, pp. 1001–1004, Orlando, Fla, USA, May 2002.

[29] S. Kurita, H. Saruwatari, S. Kajita, K. Takeda, and F. Itakura,“Evaluation of blind signal separation method using direc-tivity pattern under reverberant conditions,” in Proceedingsof IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP ’00), vol. 5, pp. 3140–3143, Istanbul,Turkey, June 2000.

[30] H. Sawada, R. Mukai, S. Araki, and S. Makino, “A robustand precise method for solving the permutation problem offrequency-domain blind source separation,” IEEE Transactionson Speech and Audio Processing, vol. 12, no. 5, pp. 530–538,2004.

[31] H. Buchner, R. Aichner, and W. Kellermann, “A generalizationof blind source separation algorithms for convolutive mixturesbased on second-order statistics,” IEEE Transactions on Speechand Audio Processing, vol. 13, no. 1, pp. 120–134, 2005.

[32] R. Aichner, H. Buchner, F. Yan, and W. Kellermann, “Areal-time blind source separation scheme and its applicationto reverberant and noisy acoustic environments,” SignalProcessing, vol. 86, no. 6, pp. 1260–1277, 2006.

[33] H. Buchner, R. Aichner, and W. Kellermann, “TRINICON: aversatile framework for multichannel blind signal processing,”in Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’04), vol. 3, pp. 889–892,Montreal, Canada, May 2004.

[34] K. Eneman, H. Luts, J. Wouters, et al., “Evaluation ofsignal enhancement algorithms for hearing instruments,” inProceedings of the 16th European Signal Processing Conference(EUSIPCO ’08), pp. 1–5, Lausanne, Switzerland, August 2008.

[35] H. Buchner, R. Aichner, J. Stenglein, H. Teutsch, and W.Kellennann, “Simultaneous localization of multiple soundsources using blind adaptive MIMO filtering,” in Proceedings ofIEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP ’05), vol. 3, pp. 97–100, Philadelphia, Pa,USA, March 2005.

[36] J. J. Shynk, “Frequency-domain and multirate adaptive filter-ing,” IEEE Signal Processing Magazine, vol. 9, no. 1, pp. 14–37,1992.

[37] A. Mader, H. Puder, and G. U. Schmidt, “Step-size controlfor acoustic echo cancellation filters—an overview,” SignalProcessing, vol. 80, no. 9, pp. 1697–1719, 2000.


Research Article

Influence of Acoustic Feedback on the Learning Strategies ofNeural Network-Based Sound Classifiers in Digital Hearing Aids

Lucas Cuadra (EURASIP Member), Enrique Alexandre,Roberto Gil-Pita (EURASIP Member), Raul Vicen-Bueno, and Lorena Alvarez

Departamento de Teoria de la Senal y Comunicaciones, Escuela Politecnica Superior, Universidad de Alcala,28805 Alcala de Henares, Spain

Correspondence should be addressed to Lucas Cuadra, [email protected]

Received 1 December 2008; Revised 4 May 2009; Accepted 9 September 2009


Sound classifiers embedded in digital hearing aids are usually designed by using sound databases that do not include the distortionsassociated to the feedback that often occurs when these devices have to work at high gain and low gain margin to oscillation. Theconsequence is that the classifier learns inappropriate sound patterns. In this paper we explore the feasibility of using differentsound databases (generated according to 18 configurations of real patients), and a variety of learning strategies for neural networksin the effort of reducing the probability of erroneous classification. The experimental work basically points out that the proposedmethods assist the neural network-based classifier in reducing its error probability in more than 18%. This helps enhance theelderly user’s comfort: the hearing aid automatically selects, with higher success probability, the program that is best adapted tothe changing acoustic environment the user is facing.

Copyright © 2009 Lucas Cuadra et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Acoustic feedback appears when part of the convenientlyamplified output signal produced by a digital hearingaid returns through the auditory canal, and enters againthis device, being thus anew amplified [1–6]. Sometimesfeedback may cause the hearing aid to become unstable,producing very unpleasant and irritating howls. Preventingthe hearing aid from such instability enforces designers tolimit the maximum gain that can be used to compensatethe patient’s acoustic loss. In this regard, along with noisereduction [1, 7, 8], the topic of controlling acoustic feedbackplays a key role in the design of hearing devices [1, 3–5, 9–17]. In particular, a very extensive and clear review by A.Spriet et al. on the topic of adaptive feedback cancellation inhearing aids may be found in [3–5] for further details.

However, even without reaching the limit of instability,feedback often affects negatively the performance of thosehearing aids that operate with high levels of gain, causing,for instance, distortions [1, 3–5]. In this situation, a rele-vant application—whose performance may be presumably

affected, and on which this paper focuses—is the one inwhich the hearing aid itself classifies [1, 18–23] the acousticenvironment that surrounds the user, and automaticallyselects the amplification “program” that is best adapted tosuch environment (“self-adaptation”) [20–23].

Within the more general and highly relevant topicof sound classification in hearing aids [1, 18, 19], self-adaptation is currently deemed very appreciated by hearingaid users, specially by the elderly, because the “manual”approach (in which the user has to identify the acousticsurroundings and chooses the more adequate program)is extremely uncomfortable, and very often exceeds theabilities of many hearing aid users [24, 25]. Only about25% of hearing aid recipients (a scarce 20% of thosethat could benefit from hearing aids) wear it because ofthe unpleasant whistles and/or other amplified noises thehearing instrument often produces, and in particular, whenmoving from one acoustic environment (e.g., speech-in-quiet) to another different one (say, for instance, a crowdedrestaurant) for which the active program is not suitable (theuser thus hears a sudden, uncomfortable amplified noise).


More figures confirming these facts are, for instance, thatabout one-third of Americans between the ages of 65 and 74suffer from hearing loss, and about half the people who are85 and older have hearing loss [26]. Or that about sixteenpercent of adult Europeans have hearing problems strongenough to adversely affect their daily life. The Royal NationalInstitute for Deaf People (RNID) has reported that there are8.7 million deaf and hard of hearing people in the UK, andthat just one in four hearing-impaired Britons has receiveda hearing aid [27]. Furthermore, the number of people withhearing loss is increasing at an alarming rate not only becauseof the aging of the world’s population, but also because of thegrowing exposure to noise in daily life. These facts illustratethe necessity for hearing aids to automatically classify theacoustic environment the user is in.

In addition, most elderly people regrettably suffer from apernicious presbyacusis with deep loss at some frequencies.The device has then to work with high level of gain, andusually with a very short margin gain to oscillation [3]. As aresult, the sound processed in the device not only is a “clean”version of the “external” acoustic environment but alsocontains irritating distortions. Thus, the sound classifier willnot work properly if the classification algorithm designedwere based simply on clean (feedback-free) sounds.

In this paper we explore whether it is worthwhile ornot to include the effects of this distortion (caused byfeedback) in the training process of a neural network-basedclassifier. The way the feedback should be included in thelearning process and to what extent this has an effect onthe classifier efficiency constitute the general purpose of thispaper. Regarding this, it is worth noting that the trainingprocess of the classifier is always performed off-line on adesktop computer, and never on the hearing aid itself. Onlyonce the neural network-based classifier has been properlytrained, validated, and tested in the laboratory, this classifierwill be uploaded onto the hearing aid!

Note that a question related to the general topic proposedin this paper could be as follows: what is the effect of feedbackreduction algorithms on the classification process? Thesealgorithms may increase the overall gain significantly beforefeedback effects is occurr. Would the classifier implementedon the device have to select different feature sets based on thepresence or absence of feedback? Obviously, addressing thiscomplex question in details would require a new study, whichis clearly out of the main scope of this paper.

After having a look at some key design aspects inSection 2, we will focus on two tasks, which are intimatelylinked, that will guide our research. The first one consists increating a variety of adequate sound databases that shouldincorporate the effects of feedback. The “original” database,from which the others will be derived, consists of samplesof different environmental sounds (Section 3), and will bemodified according to various configurations of real patients(the hearing loss and his/her type of hearing instrument). Wewill focus on 18 real patients (study cases) whose hearingloss is specially problematic (since they require high gainat some frequencies, and consequently, a low gain marginto oscillation) (Section 4.1). The databases including soundswith feedback, as will be explained in Section 4.2, will be

Memory

WOLAfilterbank

coprocessor

Corecoprocessor

Microphone

Loudspeaker

X(t)

|χi(k)|2

OS

IS

Figure 1: Simplified architecture diagram. For the sake of clarity,it only shows the two coprocessors that operate concurrently.The WOLA filterbank coprocessor computes |χi[k]|2, χi[k] arebeing the kth frequency bin of the spectrum of a given soundframe Xi(t), while the “core processor” deals with the remainingsignal processing tasks. X(t) labels the sound signal entering themicrophone, while IS and OS represent, respectively, the input andoutput stages.

created by filtering “normal sounds” (without feedback)by means of systems [10, 28] that simulate the dissimilarinteractions (feedback paths) between the hearing aid andthe user’s auditory system for each of the real study cases.The second step—based on making different uses of thesogenerated databases containing sounds with feedback—consists in exploring a variety of learning-and-test strategiesinspired by a leave-one-out strategy, and determining whichis the most appropriate (Section 5).

2. Framework, Design Limitations, andProblem Statement

Implementing complex algorithms on digital hearing aids islimited by some design restrictions. These devices basicallyconsist of a microphone, a loudspeaker, a digital signalprocessing (DSP), and a battery. Among the constraints,battery life still remains certainly a big problem. Althoughmemory and computational power (number of instructions)are currently less critical, the optimization of resourcesremains a key issue for more demanding algorithms suchas, for instance, neural network (NN) classifiers, noisereduction, or dereverberation systems [1, 3].

In particular, the DSP [29, 30] used by our platformto carry out the experiments is basically composed oftwo coprocessors operating concurrently, as schematicallyillustrated in Figure 1. The first one is the weighted overlapp-add (WOLA) filterbank coprocessor, which performs thetime/frequency decomposition with NB = 64 frequencybands: it provides |χi[k]|2, χi[k] being the kth frequencybin of the spectrum of a given sound frame Xi(t). Thesecond coprocessor is the “core processor” dealing withthe remaining tasks, such as, for instance, compensatingthe hearing loss, reducing noise, classifying the acousticenvironment, and so on.


File

Database

Test subset

Design subset Set offeatures

Pre-processingstage

Featureextractor Classifier

Speech-in-quiet

Speech-in-noise

Noise

S

fk

X(t)

|χi|2F

System

Figure 2: Conceptual description of the way the complete classify-ing system works. This consists of a preprocessing stage, a featureextraction stage (which computes a number of features arrangedin the vector F), and a classifying algorithm, which categorizesinput sounds from the database into three classes (speech, music,and noise). X(t) is the sound signal contained in each file in thedatabase. χi[k] is the spectrum of any frame, Xi(t), into which thesignal is segmented. S is the set of available features.

Regarding the sound classifier, as shown in Figure 2, itbasically involves the following (1) a preprocessing block(Section 2.1); (2) a feature extraction stage, which computesthe main characteristics of the sound (Section 2.2); and(3) some type of classifying algorithm, a neural network,that should be presumably able to learn from sounds withfeedback (Section 2.3).

2.1. The Preprocessing Stage. Each of the input sounds tobe classified, X(t), assumed to be a stochastic process, issegmented into frames, Xi(t), i = 1, . . . , r, r being thenumber of frames into which the signal is divided. Tocompute the features, used to characterize any frame, Xi(t), itis necessary to sample the frame:Xi(tk), k = 1, . . . , p, p beingthe number of samples per frame. Since each frame, Xi(t), isa windowed stochastic process, any one of its samples, Xi(tk),is a random variable that, for simplicity, is labeled Xik. Thus,for each audio frame, Xi(t), the following random vector isobtained: Xi = [Xi1, . . . ,Xip]. As sketched in Figure 1, theWOLA filterbank coprocessor computes its discrete Fouriertransform (DFT), leading to χi = [χi1, . . . , χiNB ]. This is justthe initial information the second stage in Figure 2 uses tocalculate all the features aiming at describing frame Xi(t).

The experiments that will be described below have beencarried out by using frames of 20 ms length.

2.2. The Feature Extraction Stage. This functional block playsthe role of processing the signal in order to extract some kindof valuable information that characterizes it and helps thesecond stage (the NN-based classifier) to work properly. Thefeatures, which must be computed on the DSP, are based on aset S of well-selected, widely-used, sound-describing featuresthat exhibit good performance along with low-mediumcomputational load. They have been selected deliberately

simple in order to emphasize the design of a more complexneural network-based classifier (Section 2.3) that is capableof learning sounds with feedback (Section 4.2). The featuresfk ∈ S, which take as initial information the matrix χi =[χi1, . . . , χiNB ] computed by the WOLA coprocessor for anysound frame, Xi(t), have been found to be as follows.

(1) The spectral centroid of the sound frame Xi(t), whichcan be associated with the measure of the sound brightness,is

SCi =∑NB

k=1

∣∣∣χik

∣∣∣ · k

∑NB

k=1

∣∣∣χik

∣∣∣

, (1)

where χik represents the kth frequency bin of the Fouriertransform at frame i, and NB is the number of frequencybands (NB = 64 in this particular platform). Please notethat χik merely represents each of the elements in the matrixχi = [χi1, . . . , χiNB ] computed by the WOLA coprocessor forthe sound frame Xi(t).

(2) The “Voice2White” ratio of frame Xi(t), proposed in[31], is a measure of the energy inside the typical speech band(300–4000 Hz) in respect to the whole energy of frame Xi(t).It is computed by using the following:

V2Wi =∑M2

k=M1

∣∣∣χik

∣∣∣

2

∑NB

k=1

∣∣∣χik

∣∣∣

2 , (2)

with M1 and M2 being the indexes that limit the speech band(300–4000 Hz).

(3) The “Spectral Flux” of frame Xi(t) is associated withthe amount of spectral local changes when comparing thisframe with the previous one:

SFi =NB∑

k=1

(∣∣∣χik

∣∣∣−

∣∣∣χi−1k

∣∣∣

)2. (3)

(4) The short time energy of frame Xi(t) is defined as

STEi = 1NB

NB∑

k=1

∣∣∣χik

∣∣∣

2. (4)

Note that since χi = [χi1, . . . , χiNB ] has been found to bea random-variable vector, then any feature fk ∈ S appliedto it, fk(χi), is thus a function of NB random variables,fk(χi1, . . . , χiNB ), and, consequently, a random variable [32].In order to simplify the notation, the random variablefk(χi1, . . . , χiNB ) will be labeled fki. Finally, to completethe characterization of the audio input signal, the above-mentioned sequence of processes has to be applied to all ther frames into which the entering sound has been segmented.

For the sake of simplicity, we have labeled [ fk1, . . . , fkr] ≡Fk as the feature vector that contains those elements obtainedwhen feature fk is applied to each of the r frames into whichthe sound signal has been segmented. We have completedthe statistical characterization of the random vector Fk byestimating its mean value, E[Fk], and its variance, Var[Fk][32].


Finally, this characterization must be done for all theavailable features fk ∈ S. The feature extraction algorithmends in generating the following feature vector: F =[E[F1], Var[F1], . . . , E[Fn f ], Var[Fn f ]], its dimension beingdim(F) = 2n f = NF . This is just the signal-describing vectorthat feeds the classifier. For the sake of clarity, it is writtenformally as F = [F1, . . . ,FNF ], which is the vector representedin Figure 2 as entering the classifying algorithm.

2.3. The Classifying Algorithm: The Neural Network Approach.In order to make the sound classifier work better withsounds that contain feedback, we have chosen, among avariety of previously explored algorithms, a particular kindof neural network (NN) [33–35], which should be properlydesigned for being constrained to the hardware requirementsof the DSP the hearing aid is based on (see [20] for furtherdetails). The key reason that has compelled us to choosethe NN approach is that neural networks are able to learnfrom appropriate training pattern sets, and properly classifyother patterns that have never been found before [20, 22,25]. This ultimately leads to very good results in terms ofsmaller error probability when compared to those from otherpopular algorithms such as the rule-based classifier [35],the Fisher linear discriminant [23, 34, 36], the minimumdistance classifier, or the k-nearest neighbour algorithm[35]. Despite of its presumably high computational cost, itsimplementation has been proven to be feasible on digitalhearing aids: it requires some tradeoffs involving a balancebetween reducing the computational demands (that is thenumber of neurons) and not degrading the quality perceivedby the user [20].

The basic architecture of the “original” NN, which isthe cornerstone of the learning schemes to be explored inSection 5, consists of three layers of neurons (input, hidden,and output layers), interconnected by links with adjustableweights [33, 35]. As will be better understood later on,we have named it “original neural network” in the effortof emphasizing that, depending on the learning strategyadopted, a number of different neural network configurationswill be finally obtained. That is, although with differentdetails that will be explained in Section 5, all the classifiershave been evolved from a basic configuration whose mainaspects are as follows.

In this original NN architecture, the number of inputneurons corresponds to that of the features used to charac-terize the sound, the number of output neurons is relatedto the three classes we are interested in, and finally, thenumber of hidden neurons depends on the adjustment ofthe complexity of the network [35]. We have explored howmany hidden neurons are required by means of batches ofexperiments that have ranged from 1 to 40 neurons. A highernumber of hidden neurons have been found to be unfeasiblebecause of the greater associated computational cost. Any ofthese experiments, which aim at estimating the number ofhidden neurons, has been repeated 10 times. In this designprocess, the “proper” NN configuration is precisely the onethat exhibits the lowest error probability computed over thevalidation set. The Levenberg-Marquardt algorithm [33, 37]with Bayesian regularization [35, 38] has been found to

be the most appropriate method for training the neuralnetwork. The main advantage of using regularization tech-niques relies on the fact that the generalization capabilitiesof the NN ends in being improved, making it capable ofreaching good results even with smaller networks, since theregularization algorithm itself prunes those neurons that arenot strictly necessary [20].

Please note that, as mentioned, in our platform, thefeatures are computed every 20 ms, and, at the end, theclassifier makes a decision every 2.5 seconds, based on themean value and the variance of the features during that time.

3. The Original “Feedback-Free”Sound Database

As mentioned in Section 1, our research aims to take thedistortions due to feedback into account when training theneural network-based classification algorithm. This requiresmaking use of a number of databases that should include dif-ferent feedback effects. These databases, as will be explainedin Section 4.2, will be created by filtering “normal sounds”(with no feedback) using systems that simulate the dissimilarinteraction between a hearing aid and its user’s auditorysystem [10]. We have grouped such normal, feedback-freesounds in a database labeled D0 that, for the sake of clarity,we have called the “original database” or the “unfiltered”database.

This “feedback-free” database D0 contains a total of7340 seconds of audio, including speech-in-quiet, speech innoise, speech-in-music, vocal music, instrumental music,and noise. The database has been manually labeled,leading to a total of 1272.5 seconds of “speech-in-quiet”,3637.5 seconds of “speech-in-noise”, and 2430 seconds of“noise”. These classes have been considered by our patientsas the most practical ones in their daily life. Note that, withinthe context of the application at hand, music files (bothvocal and instrumental) have also been categorized as “noise”sources, since emphasis is placed here on improving speechintelligibility, the first priority for the patients. All sound filesare monophonic, have been sampled at fS =16 kHz, and havebeen coded with 16 bits per sample. Both speech and musicfiles were provided by D. Ellis, and were recorded by Scheirerand Slaney [39]. This database [40] has already been used bydifferent authors in a number of works [39, 41–43].

The designers made a strong attempt to collect a data setwhich represented as much of the breadth of available inputsignals as possible as follows.

(i) Speech was recorded by digitally sampling FM radiostations, using a variety of stations, content styles,and levels. This variety of sounds allows to test therobustness of the classification system as a function ofdifferent sound input levels. Additionally, the speechsounds were recorded from a uniformly distributedset of male and female speakers in the aim of makingclassification as robust as possible.

(ii) Music sounds include samples of jazz, pop, country,salsa, reggae, classical, various non-Western styles,various sorts of rock, and new age music, both withand without vocals.


(iii) Finally, noise samples contain sounds from thefollowing diverse environments: aircraft, bus, cafe,car, kindergarten, living room, nature, school, shop,sports, traffic, train, and train station. These sourcesof noise have been artificially mixed with those ofspeech files (with varying grades of reverberation) atdifferent signal-to-noise ratios (SNRs) ranging from0 to 10 dB. In a number of experiments, these valueshave been found to be representative enough of thefollowing fact related to perceptual criteria: lowerSNRs could be treated by the hearing aid as noise, andhigher SNRs could be considered as clean speech.

The sound files exhibit different input levels, with a rangeof 30 dB between the lowest and the highest.

For proper training, validation, and testing, it is necessaryfor the database D0 to be divided into three different subsets.We formally write this as D0 = P0 ∪ V0 ∪ T0, where P0,V0, and T0 represent, respectively, the “training”, “validation”,and “test” subsets. These sets contain 2685 seconds (≈36%)for training, 1012.5 seconds (≈14%) for validation, and3642.5 seconds (≈50%) for testing. This division has beendone randomly, ensuring that the relative proportion of filesof each category is preserved for each set. Since the numberof patterns is high enough, no leave-m-out process has beenused, and only one repetition has been made. However, andas will be shown when designing the modified databasesof sounds with the feedback of the different study cases(Section 4), a number of strategies, based on the leave-one-out principle, have been adopted in the effort of enhancingthe ability of generalization of the corresponding neuralnetworks (Section 5).

4. Study Cases: Patients and Modified Databases

4.1. The Real Patients. Figure 3 illustrates the problem ofacoustic feedback that often occurs in the system formedby a hearing aid and the outer ear of its user. G representsthe effective gain corresponding to the forward path of the“normal” (no-feedback) signal processing of the hearing aid.On the other hand, F labels the equivalent feedback pathbetween the loudspeaker and the microphone [2, 3, 10]. Theclosed-loop system soformed is stable if and only if the open-loop gain fulfills |G(ω)F(ω)| < 1, ∀ω ∈ [0,π] with positivefeedback, that is, with phase-lag ϕ[(G(ω)F(ω))] = n2π,∀n ∈ Z [3]. If the gain is increased beyond this limit, thesystem begins to oscillate. Furthermore, at low gain marginto oscillation, GMO[dB] = −20 log |G(ω)F(ω)|, acousticfeedback degrades the quality of sound by producing howl-ing or ringing. In the effort of avoiding significant distortion,a gain margin of at least 6 dB is recommended [3].

Thus the distortion caused by feedback often occursin hearing instruments that operate with high levels ofamplification, and, as a result, with low GMO. To what extentthe sounds are affected by feedback depends basically onthe following: (1) the gain to compensate the loss at eachfrequency, (2) the type of hearing aid—in the ear (ITE), inthe canal (ITC), and behind the ear (BTE)—, and (3) the waythis is coupled with the user’s outer ear (which, at the end,

Microphone Loudspeaker

G

F

+

Figure 3: Representation of the problem of acoustic feedback inhearing aids. G represents the effective gain corresponding to theforward path of the “normal” (no-feedback) signal processing ofthe hearing aid. F labels the equivalent feedback path between theloudspeaker and the microphone.

determines the GMO). This is just the case of our 18 studyexamples: Table 1 lists, for any patient {P1, . . . ,P18} (and thecorresponding hearing aid, {HA1, . . . , HA18}), the GMO(dB)as a function of frequency (Hz). A more detailed descriptionof the real patients from whom the data were collected can befound in [10].

4.2. Creating Feedback-Affected Training Pattern Sets. Themore realistic sounds (with feedback) that each patientPi hears have been generated as explained in [10]. Theequivalent feedback path, F, which contains the combinationof all the feedback factors, has been modeled from a varietyof empirical studies [2]. The proper design of the gain curvesfor each patient (and his/her corresponding hearing aids)has been carried out in order to fulfill the gain requirementsspecified by the FIG6 prescription method [6]. For furtherdetails, reference [10] contains an extensive description ofthe compression and the gain curves. Without going intodetails, and aiming at explaining it as simply as possible,the simulator works as a “filter” that takes “feedback-free”sounds and generates “feedback-affected” sounds like thosethe user Pi usually hears. As illustrated in Figure 4, for anypatient Pi, the generation of his/her particular feedback-affected soundpatterns is qualified to filter any of the soundsamples that belong to the the original database, D0, bymeans of the filter Fi. It provides, at the end, the modifieddatabase with feedback, Di, subscript i meaning that itcontains the feedback distortions corresponding to patientPi. Bear in mind that any modified database Di may beformally written as Di = Pi ∪ Vi ∪ Ti, where Pi, Vi, andTi are, respectively, the modified training, validation, and testsubsets.

Once the initial, “unfiltered” database has been conve-niently modified, the next step should consist in training andvalidating the NN-based classifier by using the subsets Pi andVi. From a conceptual point of view, it is very important toremark that, when the original classifier is trained by usinga variety of different feedback-affected training sets, Pi, it ispossible to finally obtain dissimilar classifiers. This is becausethey have learnt different patterns (since the training sets Pi

are different from each other, i = 1, . . . ,NP) and, finally, theycould work in a different manner.


Table 1: Gain margin to oscillation (GMO[dB]) as a function of frequency (Hz) for any of the 18 patients ({P1, . . . ,P18}) and thecorresponding hearing aid ({HA1, . . . ,HA18}). ITE, ITC, and BTE mean, respectively, “in the ear”, “in the canal”, and “behind the ear”.Those columns labeled “125, . . . , 8 k” list the GMO[dB] in any of these bands (from 125 Hz to 8 KHz).

GMO(dB) as a function of frequency (Hz)

Pi HAi 125 250 500 1 k 2 k 3 k 4 k 6 k 8 k

P1 ITE 46 45 44 41 31 24 16 10 6

P2 ITE 43 41 40 39 31 22 16 12 7

P3 ITE 33 32 38 37 34 22 9 10 6

P4 ITC 43 42 42 39 26 18 13 6 6

P5 ITC 41 41 39 39 34 21 15 6 6

P6 ITC 43 42 41 41 36 28 23 6 7

P7 ITE 45 44 44 41 31 29 13 11 6

P8 BTE 46 45 40 27 24 24 12 12 14

P9 BTE 48 46 46 40 40 38 26 30 26

P10 BTE 50 49 47 29 29 24 21 30 27

P11 ITC 42 42 40 39 30 21 18 6 11

P12 ITC 41 39 40 40 20 15 12 6 7

P13 ITE 41 39 41 41 37 33 22 16 12

P14 ITE 44 43 43 42 36 24 9 9 6

P15 ITE 43 41 38 35 25 13 9 11 6

P16 ITE 41 39 43 39 17 12 9 11 7

P17 ITC 43 43 41 34 30 23 18 6 10

P18 ITC 43 42 41 40 25 15 13 6 8

5. Learning-and-Testing Strategies

All the strategies that will be explained in this section,although conceptually different, will have their results sta-tistically characterized by using the mean value (μ) andthe standard deviation (σ)—or equivalently, the variance(σ2)—of their error probability (Pe) computed over properlydesigned test sets. For the sake of clarity, and not repeatingunnecessarily the same concept particularized to the differentapproaches, the error probability estimated for each methodwill be assumed to be a random variable. Thus, ifX labels anyof these random variables that, for instance, take M values,xi, i = 1, . . . ,M, its mean value and variance can be computedby using the estimators [33] that follow:

E[X] ≡ μ = 1M

M∑

i=1xi,

Var[X] ≡ σ2 = 1M

M∑

i=1

(xi − μ

)2,(5)

where E[·] means the mathematical expectation of X , whileVar[·], means the variance.

Once we have the tools for statistically characterizingthe learning-and-testing strategies, we can proceed furtherin studying them more deeply. In this respect, althoughwith some details that will be explained later on, themethods we have explored can be conceptually categorizedinto two different groups of learning strategies: trainingwithout feedback (the “conventional approach”, which willbe outlined in Section 5.1) and the novel feedback-includinglearning strategies this paper centers on (“Training withfeedback”, which will be described in Section 5.2).

Feedback-freedatabase

Feedback-affecteddatabase

D0 D j

F j

Figure 4: Representation of the way any feedback-affected database(D j , j = 1, . . . ,NP) is created from the original one, D0.

5.1. “Conventional Approach”: Training without Feedback.The purpose of exploring this procedure is to clearly discernand quantify to what extent the learning strategies to beexplored, which aim at properly introducing the feedbackeffects, are effective or not. The “conventional approach”refers to the one, often used, in which the NN is designed byusing the unfiltered (feedback-free) training and validationsubsets, P0,V0 ⊂ D0. The NN-based classifier so-trained—the one that minimizes the error probability over thevalidation set V0—has been labeled NN0, where subscript“0” means “feedback-free”. Since D0 has been modifiedaccording to 18 configurations of real patients, then theclassifier can be now tested by using the generated test setsT j , j = 1, . . . , 18, corresponding to these study cases. Usingthe estimators (5), the mean value and the standard deviationof the error probabilities have been found to be, respectively,μconv[Pe] = 8.16% and σconv[Pe] = 0.31%, subscript “conv”labeling “conventional approach”.


5.2. Novel Approaches: Learning by Using Patterns withFeedback. Please note that, as pointed out in Section 4, andas occurs in other biomedical problems, the drawback hereconsists in the fact that there are a large number of data overa relatively small number of samples (patients). This “high-dimensional problem” causes difficulties to most of the long-established classifiers. This is precisely one of the motivationsthat has compelled us to explore novel strategies in the aimof improving the classifier ability to generalize.

These approaches are inspired by a leave-one-out strategythat aims to enhance the ability to generalize of thesodesigned NN. Of course, a more general m-fold cross-validation technique could potentially be used by randomlydividing the corresponding data into m disjoint sets of equalsize n/m, n being the total number of sound patterns [35].The problem is that splitting it into m equal parts is verydifficult because of the low number of patients. (Note thatthe number of patients is much smaller than the numberof sound samples in the database). On the other hand, abootstrap set [35] could be created by randomly selecting npoints from the training set with replacement. This processcould be independently repeated B times to yield B bootstrapsets, which may be treated as independent sets. If B < n,it will result in some saving in computation time. But, asmentioned, we do not need to save these computationalresources because the design phase is carried out off-line on adesktop computer.

Therefore, the learning strategies we have explored, andwhich aim to include sound with feedback based on leave-one-out strategy, can be categorized into the three strategiesthat follow.

5.2.1. Training with an Average Feedback-Affected Set. Thisstrategy works as follows.

(1) Take the training set T j , and leave it out for beingulteriorly used for testing.

(2) Create the “average” design set involving feedback-affected data from the remaining NP − 1 patients(Pi, i = 1, . . . ,NP ; i /= j). For doing so, the followingis required: (a) computing the feedback-modifieddesign sets (Pi, Vi, i = 1, . . . ,NP ; i /= j), (b) esti-mating the features that properly describe the soundsbelonging to such modified sets as explained inSection 2.2, and(c) statistically characterizing thesefeatures by estimating the mean value and thestandard deviation.

(3) Train the NN-based classifier with the “average”features, leading to a NN that we have labeled NNav, j .In this notation, subscript “av” stands for “average”,while j means that its performance will be tested withthe feedback-modified test set, T j , corresponding tothe patient Pj that has been left out during the designstage.

(4) Estimate the error probability over T j .

(5) Repeat step 1 in order for index j to sweep all possiblevalues ( j = 1, . . . ,NP).

Please note that the aforementioned approach exhibitstwo interesting aspects as follows.

(a) This strategy creates NP different NN-based classi-fiers:

{NNav,1, NNav,2, . . . , NNav,NP

}. (6)

(b) According to the leave-one-out strategy adopted, anyof the learning-machines NNav, j is tested with thefeedback-affected data that have been kept out duringthe design process, T j . The use of estimators (5) leadsto μav[Pe] = 7.63%, and σav[Pe] = 0.79%.

5.2.2. Training with “All” (Massive Database). What moti-vates this strategy is to answer the question whether or notthe creation of a large training set with information involvingthe different feedback phenomena for all the NP patientscould enrich the learning process. This creation process canbe summarized as follows.

(1) Take a representative patient Pj and leave his/her testset (T j) out for being used in the subsequent test.

(2) In the double effort of (a) designing a databasecharacterized by containing a very rich informationencompassing all the feedback phenomena, and (b)preventing the NN from overfitting, we have createdthe training set

⋃NP∀i /= j Pi that we have labeled P∀i /= j

because it contains information from all patients’databases except that from the one (Pj) that has beenleft out for testing.

(3) Create in a similar way the validation set V∀i /= j =⋃NP∀i /= j Vi.

(4) Train and validate with P∀i /= j and V∀i /= j , respec-tively.

(5) The NN so designed, NN∀i /= j, j , is the one thatminimizes the error probability over the validationset V∀i /= j .

(6) Test with the feedback-affected test set that we haveleft out for this purpose, T j , thus obtaining p∀i /= j, j

(7) Repeat steps 1–6 for all remaining the patients.

By making use of arguments in the same line of reasoningas those in the previous methods, the error probabilityachieved by the classifier trained and tested with thisapproach can be characterized by its mean value, whichhas been found to be μall[Pe] = 7.88%, and its standarddeviation, σall[Pe] = 0.40%.

As probably noted by the reader, the feedback-affectedlearning methods explored till reduces the mean error prob-ability when compared to that achieved by the conventionalapproach (8.16%), however, some of them, for instance, the“average” training reduces the mean value at the expenseof increasing the variance. The method described below hasbeen designed in the effort of reducing even more such errorprobability, both in mean value and standard deviation.


5.2.3. Improved Training Based on Selecting the Best NNAmong All the Patients. In this approach, inspired againby a leave-one strategy, we have performed the followingsequence of operations.

(1) Take one of the case study patients, let us say, Pj , forinstance, and leave his/her test set T j out for beingulteriorly used as a test subject.

(2) Train a number of NP − 1 different NN’s by makinguse of the remaining training and validation sets, Ti,and Vi, i = 1, . . . ,NP , i /= j.

(3) Among these NN’s ({NNbest,i, i /= j}), select the bestone in terms of validation error.

(4) Such NN, labeled now, NNbest,i /= j , is tested with thefeedback-affected sounds belonging to the test set T j

that has been kept out during the design stage. Let thesocalculated error probability be labeled pbest,i /= j .

(5) Repeat this process for all the remaining patients.

When extending this method for all the patients inTable 1, the process finally ends in providing NP values,{pbest,1, pbest,2, . . . , pbest,NP}, whose mean value and standarddeviation have been found to be, respectively, μbest[Pe] =6.77%, and σbest[Pe] = 0.18%.

6. Discussion

Figure 5 shows in a comprehensive way the results we haveobtained in the effort of validating the proposed learning-and-testing techniques. We have chosen a representation thatillustrates the results as a function of the mean value and thestandard deviation of the error probability estimated for eachof the different procedures. The key points to note in Figure 5are as follows.

(1) The conventional method performs worse than thenovel strategies: the neural network-based classifier designedwith the conventional approach (NN0) has a mean errorprobability (μconv[Pe] = 8.16%), which is higher than thoseachieved by the classifiers trained with the novel methods wehave proposed (μall[Pe] = 7.7%, μav[Pe] = 7.63%, μbest[Pe] =6.77%). This is because NN0 has learnt patterns that are notrepresentative enough of sounds with feedback.

(2) Although the novel approaches labeled “all” and“average” assist the different classifiers in reducing themean error probability, they achieve this at the expense ofincreasing the standard deviation when compared to that ofthe conventional approach.

(3) On the contrary, the novel strategy called “best”makes the corresponding sound classifier to achieve thebest results since it helps reduce the mean error probability(μbest[Pe] = 6.77%) while maintaining the standard devia-tion within a low value ≈0.018%.

(4) One important question that could arise whencomparing the mean error probabilities estimated by usingthe different approaches is whether these different values sta-tistically significant? To answer this question, it is importantto analyze the relative error rate associated with the results.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1×10−2

Stan

dard

devi

atio

nof

the

erro

rpr

obab

ility

0.066 0.068 0.07 0.072 0.074 0.076 0.078 0.08 0.082

Mean value of the error probability

“Average”

Massive database(“all”)

Conventional

“Best” feedback-affectedneural network

Figure 5: Two-dimensional plot illustrating the mean value andthe standard deviation of the error probability corresponding to thedifferent learning-and testing strategies explored in this work.

The relative error εPe of the error probability estimator, Pe,with a given confidence interval α, is given by [44]:

ε2Pe≤ (1− Pe) ·

[Q−1(α/2)

]2

M · Pe, (7)

where Pe is the probability to be calculated, M is the numberof elements in the test set, and Q−1(x) is the complementaryerror function defined as

Q(x) =∫∞

x

1√2π

exp

(

− t2

2

)

dt. (8)

For our application, and considering a confidence intervalof α = 0.99, the relative error for the estimation has beenfound to fulfill εPe < 0.0095. That is, those differences amongerror probabilities greater that this value must be consideredas significant.

The reason why the learning-and-testing strategy called“best” reaches the lowest error probability in both meanand standard deviation (μbest[Pe] = 6.77%, and σbest[Pe] =0.18%) is because, in this approach, the classifier finallycreated is the neural network that results in being selectedamong the neural networks that, as pointed out inSection 5.2.3, have been trained with feedback-affected setsin a leave-one-out strategy that prevents them from over-fitting. Although not clear at first sight, the key point tounderstand this is that, as emphasized in Section 5.2.3, forany patient Pj kept out for ulteriorly test, the algorithmtrains NP−1 different NN’s (by making use of the remainingtraining and validation sets, Ti, and Vi, i = 1, . . . ,NP ,i /= j), and selects the best network in terms of validationerror. By repeating this procedure for the remaining NP − 1patients (study cases), this process allows to finally includethe information from all the patients, and to pick out the“best” neural network, that is, the one that has learnt better:


the lowest validation error is equivalent to say that theneural network has acquired the knowledge of the realisticsounds the patients usually hear, and that it is not excessivelyspecialized (overfitted) since it is able to properly classifynovel sounds that it has never heard before.

This feedback-including learning strategy helps theembedded classifier improve its performance because itworks more robustly in the sense it reaches the lowesterror probability, what in turns leads to enhance the user’scomfort: the hearing aid itself selects (with higher successprobability) the program best adapted to the varying acousticenvironment in which the hearing aid user is listening.

7. Conclusions

Feedback appears when part of the amplified output signalproduced by a digital hearing aid returns through theauditory canal and enters again this device. Sometimesfeedback may cause the hearing aid to become unstable,producing irritating howls. Avoiding instability leads tolimit the maximum gain that can be used to compensatethe patient’s acoustic loss, and the use of algorithms forcontrolling acoustic feedback.

However, even without reaching the limit of instability,feedback often affects negatively the performance of hearingaids that operate with high levels of gain (and as a result, withlow gain margin to oscillation), so that the sound processedin the device is not only a “clean” version of the externalacoustic environment but also contains annoying distor-tions. As a result, the sound classifier does not work properlyif the classification algorithm is designed (as usual) by usingclean (feedback-free) sounds. Our research aims to take thesedistortions caused by feedback into account when trainingthe classification algorithm. The original sound database,which consist of samples of different environmental sounds,has been modified according to various configurations (NP =18) of real patients (hearing loss and hearing instrument).Such properly feedback-affected databases, Di = Pi ∪ Vi ∪Ti—composed of training, validation and test sets, Pi, Vi, Ti,respectively—have been created by filtering “normal sounds”(without feedback) by means of systems that model thedissimilar interaction between the hearing aid and the user’sauditory system.

Making use of these modified databases {Di, i =1, . . . , 18}, we have explored three feedback-enrichedlearning-and-testing strategies.

The first one has evaluated whether or not training-and-validating with a set that contains an “average” feedbackwould lead to a significant reduction in the classificationerror probability.

The second one has centered on answering the questionwhether or not the creation of a large training set (withinformation involving the different feedback phenomena ofall the real patients) could enrich the learning process.

In the third method, for any test set P j reserved forfuture test (in a leave-one-out strategy aiming at preventingoverfitting), the algorithm has trainedNP−1 different neuralnetworks (by making use of the remaining training andvalidation sets, Ti, and Vi, i = 1, . . . ,NP , i /= j), and selected

the best one in terms of validation error. By repeating thisfor the remaining patients, this method has been found tobe able to include the information from all the patients,and to pick out the “best” neural network. “Best” meanshere the one that has learnt better: it has acquired theknowledge of the realistic sounds the patients hear usually,and it is not excessively specialized (overfitted) in learningsuch sounds so that it would not be able to properlyclassify “novel” sounds that it has never heard before. Thisis the reason why this approach assists the classifier inreducing its mean error probability from 8.17% (in theconventional approach) down to 6.77%. This finally leadsto enhance the user’s comfort: the hearing aid itself selects(with higher success probability) the program best adaptedto the varying acoustic environment in which the patient islistening.

Acknowledgments

The authors gratefully acknowledge the valuable com-ments kindly suggested by the anonymous reviewers andthe editor, which undoubtedly have greatly improvedthe quality of this paper. This work has been partiallyfunded by the Comunidad de Madrid/Universidad de Alcala(projects CCG08-UAH/TIC-4270, CCG07-UAH/TIC-1572,and CCG06-UAH/TIC-0378), the Spanish Ministry of Edu-cation and Science (TEC2006-13883-C04-04/TCM), and theMinistry of Science and Innovation (TEC2009-14414-C03-03).

References

[1] J. Kates, Digital Hearing Aids, Plural Publishing, San Diego,Calif, USA, 2008.

[2] J. Hellgren, T. Lunner, and S. Arlinger, “System identificationof feedback in hearing aids,” The Journal of the AcousticalSociety of America, vol. 105, no. 6, pp. 3481–3496, 1999.

[3] A. Spriet, Adaptive filtering techniques for noise reduction andacoustic feedback cancellation in hearing aids, Ph.D. thesis,Katholieke Universiteit Leuven, Leuven, Belgium, 2004.

[4] A. Spriet, G. Rombouts, M. Moonen, and J. Wouters, “Adap-tive feedback cancellation in hearing aids,” Journal of theFranklin Institute, vol. 343, pp. 545–573, 2007.

[5] A. Spriet, K. Eneman, M. Moonen, and J. Wouters, “Objec-tive measures for real-time evaluation of adaptive feedbackcancellation algorithms in hearing aids,” in Proceedings of the16th European Signal Processing Conference (EUSIPCO ’08),Lausanne, Switzerland, August 2008.


[7] T. Fillon and J. Prado, “Evaluation of an erb frequency scalenoise reduction for hearing aids: a comparative study,” SpeechCommunication, vol. 39, no. 1-2, pp. 23–32, 2003.

[8] J. Alcantara, B. Moore, V. Kuhnel, and S. Launer, “Evaluationof the noise reduction system in a commercial digital hearingaid,” International Journal of Audiology, vol. 42, no. 1, pp. 34–42, 2003.

[9] J. Kates, “Feedback cancellation in hearing aids: results from acomputer simulation,” IEEE Transactions on Signal Processing,vol. 39, no. 3, pp. 553–562, 1991.


[10] R. Vicen-Bueno, R. Gil-Pita, M. Utrilla-Manso, and L. Alvarez-Perez, “A hearing aid simulator to test adaptive signal pro-cessing algorithms,” in Proceedings of the IEEE InternationalSymposium on Intelligent Signal Processing, pp. 1–6, Xiamen,China, December 2007.

[11] R. Vicen-Bueno, A. Martinez-Leira, R. Gil-Pita, and M.Rosa-Zurera, “Acoustic feedback reduction based on filtered-x lms and normalized filtered-x lms algorithms in digitalhearing aids based on wola filterbank,” in Proceedings of theIEEE International Symposium on Intelligent Signal Processing(WISP ’07), pp. 1–6, Xiamen, China, December 2007.

[12] J. Kates, “A computer simulation of hearing aid response andthe effects of ear canal size,” Acoustic Society of America, vol.83, no. 5, pp. 1952–1963, 1988.

[13] J. Maxwell and P. Zurek, “Reducing acoustic feedback in hear-ing aids,” IEEE Transactions on Speech and Audio Processing,vol. 3, no. 4, pp. 304–313, 1995.

[14] P. Z. J. E. Greenberg and M. Brantley, “Evaluation of feedbackreduction algorithms for hearing aids,” The Journal of theAcoustical Society of America, vol. 108, no. 5, pp. 2366–2376,2000.

[15] N. Abinwi and B. Rafaely, “Unbiased adaptive feedbackcancellation in hearing aids by closedloop identification,” IEEETransactions on Speech and Audio Processing, vol. 14, no. 2, pp.658–665, 2006.

[16] D. Freed and S. Soli, “An objective procedure for evaluationof adaptive antifeedback algorithms in hearing aids,” Ear andHearing, vol. 27, no. 4, pp. 382–398, 2006.

[17] D. Freed, “Adaptive feedback cancellation in hearing aids withclipping in the feedback path,” The Journal of the AcousticalSociety of America, vol. 123, no. 3, pp. 1618–1626, 2008.

[18] P. Nordqvist and A. Leijon, “An effcient robust soundclassification algorithm for hearing aids,” Acoustic Society ofAmerica, vol. 115, no. 6, pp. 3033–3041, 2004.

[19] P. Nordqvist, Sound classification in hearing instruments, Ph.D.thesis, Royal Institute of Technology (KTH), Stockholm,Sweden, 2004.

[20] E. Alexandre, L. Cuadra, M. Rosa, and F. Lopez, “Speech/non-speech classification in hearing aids driven by tailored neuralnetworks,” in Speech, Audio, Image and Biomedical SignalProcessing Using Neural Networks, B. Prasad and S. M.Prassana, Eds., pp. 145–167, Springer, Berlin, Germany, 2008.

[21] E. Alexandre, L. Alvarez, L. Cuadra, M. Rosa, and F. Lopez,“Automatic sound classification algorithm for hearing aidsusing a hierarchical taxonomy,” in New Trends in Audio andVideo, A. Dobrucki, A. Petrovsky, and W. Skarbek, Eds., vol. 1,Politechnika Bialostocka, 2006.

[22] E. Alexandre, L. Cuadra, L. Perez, M. Rosa-Zurera, and F.Lopez-Ferreras, “Automatic sound classification for improvingspeech intelligibility in hearing aids using a layered structure,”Lecture Notes in Computer Science, Springer, Berlin, Ger-many, 2006.

[23] E. Alexandre, M. Rosa, L. Cuadra, and R. Gil-Pita, “Appli-cation of fisher linear discriminant analysis to speech/musicclassification,” in Proceedings of the 120th Audio EngineeringSociety Convention, Paris, France, May 2006.

[24] M. Buchler, S. Allegro, S. Launer, and N. Dillier, “Soundclassification in hearing aids inspired by auditory sceneanalysis,” EURASIP Journal on Applied Signal Processing, vol.2005, no. 18, pp. 2991–3002, 2005.

[25] M. Buchler, Algorithms for sound classification in hearinginstruments, Ph.D. dissertation, Swiss Federal Institute ofTechnology, Zurich, Switzerland, 2002.

[26] http://www.nia.nih.gov.

[27] http://www.rnid.org.uk.[28] R. Vicen-Bueno, A. Martinez-Leira, R. Gil-Pita, and M. Rosa-

Zurera, “Modified LMS-based feedback-reduction subsystemsin digital hearing aids based on WOLA filter bank,” IEEETransactions on Instrumentation and Measurement, vol. 58, no.9, pp. 3177–3190, 2009.

[29] Dspfactory, Toccata Plus EDK 2.3.1 Getting Started Guide,Dspfactory, 2002.

[30] Dspfactory, Toccata Plus Evaluation and Development BoardManual, Dspfactory, 2002.

[31] E. Guaus and E. Batlle, “A non-linear rhythm-based styleclassification for broadcast speechmusic discrimination,” inProceedings of the 116th Audio Engineering Society Convention,Berlin, Germany, May 2004.

[32] A. Papoulis, Probability, Random Variables, and SthocasticProcesses, McGraw-Hill, New York, NY, USA, 3rd edition,1991.

[33] C. M. Bishop, Neural Networks for Pattern Recognition, OxfordUniversity Press, Oxford, UK, 1995.

[34] S. Mika, G. Ratsch, J. Weston, B. Scholkpf, and K. Muller,“Fisher discriminant analysis with kernels,” in Proceedingsof the IEEE Signal Processing Society Workshop on NeuralNetworks for Signal Processing, pp. 41–48, Madison, Wis, USA,August 1999.

[35] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification,Wiley-Interscience, New York, NY, USA, 2001.

[36] S. A. Billings and K. L. Lee, “Nonlinear fisher discriminantanalysis using a minimum squared error cost function and theorthogonal least squares algorithm,” Neural Networks, vol. 15,no. 2, pp. 263–270, 2002.

[37] M. Hagan and M. Menhaj, “Training feedforward networkswith the marquardt algorithm,” IEEE Transactions on NeuralNetworks, vol. 5, no. 6, pp. 989–993, 1994.

[38] F. D. Foresee and M. T. Hagan, “Gauss-Newton approximationto bayesian learning,” in Proceedings of the InternationalJoint Conference on Neural Networks, vol. 1, pp. 1930–1935,Houston, Tex, USA, 1997.

[39] E. Scheirer and M. Slaney, “Construction and evaluationof a robust multifeature speech/music discriminator,” inProceedings of the IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP ’97), pp. 1331–1334,Munich, Germany, September 1997.

[40] http://labrosa.ee.columbia.edu/sounds/musp/scheislan.html.[41] B. Thoshkahna, V. Sudha, and K. R. Ramakrishnan, “A speech-

music discriminator using hiln model based features,” inProceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’06), pp. 425–428,Toulouse, France, May 2006.

[42] A. L. Berenzweig and D. P. W. Ellis, “Locating singing voicesegments within music signals,” in Proceedings of the IEEEWorkshop on Applications of Signal Processing to Audio andAcoustics (WASPAA ’01), pp. 31–34, New Paltz, NY, USA,October 2001.

[43] G. Williams and D. P. W. Ellis, “Speech/music discriminationbased on posteriori probability features,” in Proceedings ofthe 6th European Conference on Speech Communication andTechnology (Eurospeech ’99), Budapest, Hungary, September1999.

[44] K. Shanmugam and P. Balaban, “A modified Monte Carlosimulation technique for the evaluation of error rate in digitalcommunication systems,” IEEE Transactions on Communica-tions, vol. 28, no. 11, pp. 1961–1924, 1980.


Research Article

Analysis of the Effects of Finite Precision in NeuralNetwork-Based Sound Classifiers for Digital Hearing Aids

Roberto Gil-Pita (EURASIP Member), Enrique Alexandre, Lucas Cuadra(EURASIP Member), Raul Vicen, and Manuel Rosa-Zurera (EURASIP Member)

Departamento de Teorıa de la Senal y Comunicaciones, Escuela Politecnica Superior, Universidad de Alcala,28805 Alcala de Henares, Spain

Correspondence should be addressed to Roberto Gil-Pita, [email protected]

Received 1 December 2008; Revised 4 May 2009; Accepted 9 September 2009


The feasible implementation of signal processing techniques on hearing aids is constrained by the finite precision required torepresent numbers and by the limited number of instructions per second to implement the algorithms on the digital signalprocessor the hearing aid is based on. This adversely limits the design of a neural network-based classifier embedded in the hearingaid. Aiming at helping the processor achieve accurate enough results, and in the effort of reducing the number of instructions persecond, this paper focuses on exploring (1) the most appropriate quantization scheme and (2) the most adequate approximationsfor the activation function. The experimental work proves that the quantized, approximated, neural network-based classifierachieves the same efficiency as that reached by “exact” networks (without these approximations), but, this is the crucial point,with the added advantage of extremely reducing the computational cost on the digital signal processor.

Copyright © 2009 Roberto Gil-Pita et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

This paper focuses on exploring to what extent the useof a quantized, approximated neural network-(NN-) basedclassifier embedded in a digital hearing aid could appreciablyaffect the performance of this device. This phrase probablymakes the reader not directly involved in hearing aid designwonder.

(1) Why do the authors propose a hearing aid capable ofclassifying sounds?

(2) Why do they propose a neural network for classifying(if there are more simple solutions)?

(3) Why do they study the effects associated with quan-tizing and approximating it? Are these effects soimportant?

The first question is related to the fact that hearingaid users usually face a variety of sound environments. Ahearing aid capable of automatically classifying the acoustic

environment that surrounds his/her user, and selectingthe amplification “program” that is best adapted to suchenvironment (“self-adaptation”) would improve the user’scomfort [1]. The “manual” approach, in which the userhas to identify the acoustic surroundings, and to choosethe adequate program, is very uncomfortable and frequentlyexceeds the abilities of many hearing aid users [2]. Thisillustrates the necessity for hearing aids to automaticallyclassify the acoustic environment the user is in [3].

Furthermore, sound classification is also used in mod-ern hearing aids as a support for the noise reductionand source separation stages, like, for example, in voiceactivity detection (VAD) [4–6]. In this case, the objectiveis to extract information from the sound in order toimprove the performance of these systems. This secondkind of classifiers differs from the first one in how oftenthe classification is carried out. In the first case, a timescale of seconds should be enough, since it typically takesapproximately 5–10 seconds for the hearing aid user to movefrom one listening environment to another [7], whereas in


the second case the information is required in shorter timeslots.

The second question, related to the use of neural net-works as the choice classifier, is based on the fact that neuralnetworks exhibit very good performance when compared toother classifiers [3, 8], but at the expense of consuming asignificantly high percentage of the available computationalresources. Although difficult, the implementation of a neuralnetwork-based classifier on a hearing aid has been proven tobe feasible and convenient to improve classification results[9].

Finally, regarding the latter question, the very core ofour paper is motivated by the fact that the way numbersare represented is of crucial importance. The number of bitsused to represent the integer and the fractional part of anumber have a strong influence on the final performanceof the algorithms implemented on the hearing aid, and animproper selection of these values can lead to saturations orlack of precision in the operations of the DSP. This is justone of the topics, along with the limited precision, this paperfocuses on.

The problem of implementing a neural-based soundclassifier in a hearing aid is that DSP-based hearing aidshave constraints in terms of computational capability andmemory. The hearing aid has to work at low clock ratesin order to minimize the power consumption and thusmaximize the battery life. Additionally, the restrictionsbecome stronger because a considerable part of the DSPcomputational capabilities is already being used for runningthe algorithms aiming to compensate the hearing losses.Therefore, the design of any automatic sound classifier isstrongly constrained to the use of the remaining resources ofthe DSP. This restriction in number of operations per secondenforces us to put special emphasis on signal processingtechniques and algorithms tailored for properly classifying whileusing a reduced number of operations.

Related to the aforementioned problem arises the onerelated to the search for the most appropriate way toimplement an NN on a DSP. Most of the NNs we will beexploring consist of two layers of neurons interconnected bylinks with adjustable weights [10]. The way we represent suchweights and the activation function of the neurons [10] maylead the classifier to fail.

Therefore, the purpose of this paper is to clearly quantifythe effects of the finite-precision limitations on the perfor-mance of an automatic sound classification system for hear-ing aids, with special emphasis on the two aforementionedphenomena: the effects of finite word length for the weightsof the NN used for the classification, and the effects of thesimplification of the activation functions of the NN.

With these ideas in mind, the paper has been structuredas follows. Section 2 will introduce the implemented classifi-cation system, describing the input features (Section 2.1) andthe neural network (Section 2.2). Section 3 will define theconsidered problems: the quantization of the weights of theneural network, and use of approximations for the activationfunctions. Finally, Section 4 will describe the database andthe protocol used for the experiments and will show theresults obtained, which will be discussed in Section 5.

2. The System

It basically consists of a feature extraction block and theaforementioned classifier based on a neural network.

2.1. Feature Extraction. There is a number of interestingfeatures that could potentially exhibit different behaviorfor speech, music, and noise and thus may help thesystem classify the sound signal. In order to carry out theexperiments of this paper we have selected a subset of themthat provide a high discriminating capability for the problemof speech/nonspeech classification along with a considerablylow associated computational cost [11]. This will assist us intesting the methods proposed in this paper. Note that thepriority of the paper is not to propose these features as thebest ones for all the problems considered in the paper, butto establish a set of strategies and techniques for efficientlyimplementing a neural network classifier in a hearing aid.We have briefly described the features below for making thepaper stand by itself. The features used to characterize anysound frame are as follows.

Spectral Centroid. The spectral centroid of the ith frame canbe associated with the measure of brightness of the sound,and is obtained by evaluating the center of gravity of thespectrum. The centroid can be calculated by making use ofthe formula [12, 13]:

Centroidi =∑K

k=1

∣∣∣χi(k)

∣∣∣ · k

∑Kk=1

∣∣∣χi(k)

∣∣∣

, (1)

where χi(k) represents the kth frequency bin of the spectrumat frame i, and K is the number of samples.

Voicewhite. This parameter, proposed in [14], is a measureof the energy inside the typical speech band (300–4000 Hz)in respect to the whole energy of the signal:

V2Wi =∑M2

k=M1

∣∣∣χi(k)

∣∣∣

2

∑Kk=1

∣∣∣χi(k)

∣∣∣

2 , (2)

where M1 and M2 are the first and the last index of the bandsthat are encompassed in the considered speech band.

Spectral Flux. It is associated with the amount of spectralchanges over time and is defined as follows [13]:

Fluxi =K∑

k=1

(∣∣∣χi(k)

∣∣∣−

∣∣∣χi−1(k)

∣∣∣

)2. (3)

Short Time Energy (STE). It is defined as the mean energy ofthe signal within each analysis frame (K samples):

STEi = 1K

K∑

k=1

∣∣∣χi(k)

∣∣∣

2. (4)


Finally, the features are calculated by estimating the meanvalue and the standard deviation of these measurements forM different time frames.

x =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

E{Centroidi}E{V2Wi}E{Fluxi}E{STEi}

(

E{

Centroid2i

}

− E{Centroidi}2)1/2

(

E{V2W2

i

}− E{V2Wi}2)1/2

(

E{

Flux2i

}

− E{Fluxi}2)1/2

(

E{

STE2i

}

− E{STEi}2)1/2

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

, (5)

where, for the sake of simplicity, we label Ei(·) ≡(1/M)

∑Mi=1(·).

It is interesting to note that some of the features dependon the square amplitude of the input signal. As will be shown,the sound database includes sounds at different levels, inorder to make the classification system more robust againstthese variations.

2.2. Classification Algorithm

2.2.1. Structure of a Neural Network. Figure 1 shows a simpleMultilayer Perceptron (MLP) with L = 8 inputs, N = 2hidden neurons and C = 3 outputs, interconnected bylinks with adjustable weights. Each neuron applies a linearcombination of its inputs to a nonlinear function calledactivation function. In our case, the model of each neuronincludes a nonlinear activation function (the hyperbolic tan-gent function), which can be calculated using the followingexpression:

f (x) = tanh(x) = ex − e−xex + e−x

. (6)

From the expression above it is straightforward to seethat implementing this function on the hearing aid DSPis not an easy task, since an exponential and a divisionneed to be computed. This motivates the need for exploringsimplifications of this activation function that could providesimilar results in terms of probability of error.

The number of neurons in the input and the outputlayers seems to be clear: the input neurons (L) representthe components of the feature vector and thus, and itsdimension will depend on the number of features used ineach experiment. On the other hand, the number of theneurons in the output layer (C) is determined by the numberof audio classes to classify, speech, music or noise.

The network also contains one layer ofN hidden neuronsthat is not part of the input or output of the network. TheseN hidden neurons enable the network to learn complex tasksby extracting progressively more meaningful features fromthe input vectors. But, what is the optimum numbers ofhidden neurons N? The answer to this question is related

to the adjustment of the complexity of the network [10]. Iftoo many free weights are used, the capability to generalizewill be poor; on the contrary if too few parameters areconsidered, the training data cannot be learned satisfactorily.

One important fact that must be considered in theimplementation of an MLP is that a scale factor in one of theinputs (x′n = xnk) can be compensated with a change in thecorresponding weights of the hidden layer (v′nm = vnm/k, form = 1, . . . ,L) , so that the outputs of the linear combinations(am) are not affected (v′nmx

′n = vnmxn). This fact is important,

since it allows scaling each feature so that it uses the entiredynamic range of the numerical representation, minimizingthe effects of the finite precision over the features withoutaffecting the final performance of the neural network.

Another important property of the MLP is related tothe output of the network. Considering that the activationfunction is a monotonically increasing function, if zi > zj ,then bi > bj . Therefore, since the final decision is taken bycomparing the outputs of the neural network and lookingfor the greatest value, once the network is trained there isno need of determining the complete output of the network(zi), being enough to determine the linear combinations ofthe output layer (bi). Furthermore, a scale factor appliedto the output weights (w′nc = kwnc, for n = 0, . . . ,N andc = 1, . . . ,C) does not affect the final performance of thenetwork, since if bi > bj , then kbi > kbj . This property allowsscaling the output weights so that the maximum value of wnc

uses the entire dynamic range, minimizing the effects of thelimited precision over the quantization of the output weights.

In this paper, all the experiments have been carried outusing the MATLAB’s Neural Network Toolbox [15], andthe MLPs have been trained using the Levenberg-Marquardtalgorithm with Bayesian regularization. The main advantageof using regularization techniques is that the generalizationcapabilities of the classifier are improved, and that it ispossible to obtain better results with smaller networks, sincethe regularization algorithm itself prunes those neurons thatare not strictly necessary.

3. Definition of the Problem

As mentioned in the introduction, there are two different(although strongly linked) topics that play a key role in theperformance of the NN-based sound classifier, and that con-stitute the core of this paper. The first one, the quantizationof the NN weights, will be described in Section 3.1, while thesecond issue, the feasibility of simplifying the NN activationfunction, will be stated in Section 3.2.

3.1. The Quantization Problem. Most of the actual DSPs forhearing aids make use of a 16-bit word-length Harvard Archi-tecture, and only modern hearing instruments have largerinternal bit range for number presentation (22–24 bits). Insome cases, the use of larger numerical representations isreserved for the filterbank analysis and synthesis stages, orto the Multiplier/ACcumulator (MAC) that multiplies 16-bitregisters, and stores the result in a 40-bit accumulator. In thispaper we have focused on this last case, in which we have


x1

x2

x3

x4

x5

x6

x7

x8

v11v12

v21

v22

v31

v32

v41

v42

v51

v52

v61

v62

v71v72

v81v82

v01

a1

f (·)

v02

a2

f (·)

y1

y2

w11

w21

w12

w22

w13

w23

w01

w02

w03

b1

b2

b3

f (·)

f (·)

f (·)

z1

z2

z3

Figure 1: Multilayer Perceptron (MLP) diagram.

thus 16-bit to represent numbers, and, as a consequence,there are several 16-bit fixed-point quantization formats.It is important to highlight that in those modern DSPsthat use larger numerical representations the quantizationproblem is minimized, since there are several configurationsthat yield very good results. The purpose of our studyis to demonstrate that a 16 bit numerical representationconfigured in a proper way can produce considerably goodresults in the implementation of a neural classifier.

The way numbers are represented on a DSP is of crucialimportance. Fixed-point numbers are usually representedby using the so-called “Q number format.” Within theapplication at hand, the notation more commonly used is“Qx.y”, where

(i) Q labels that the signed fixed-point number is in the“Q format notation,”

(ii) x symbolizes the number of bits used to represent the2’s complement of the integer portion of the number,

(iii) y designates the number of bits used to represent the2’s complement of the fractional part of such number.

For example, using a numerical representation of 16 bits,we could decide to use the quantization Q16.0, which isused for representing 16-bit 2’s complement integers. Or wecould use Q8.8 quantization, what, in turns, means that 8bits are used to represent the 2’s complement of the integerpart of the number, and 8 bits are used to represent the2’s complement of the fractional portion; or Q4.12, whichassigns 4 bits to the integer part, and 12 bits to the fractionalportion and so forth. The question arising here is: What isthe most adequate quantization configuration for the hearingaid performance?

Apart from this question to be answered later on, thereis also a crucial problem related to the small number of bitsavailable to represent the integer and the fractional parts ofnumbers: the limited precision. Although not clear at firstglance, it is worth noting that a low number of bits for theinteger part may cause the register to saturate, while a lownumber of bits in the fractional portion may cause a loss ofprecision in the number representation.

3.2. The Problem of Approximating the Activation Function.As previously mentioned, the activation function in our NNis the hyperbolic tangent function which, in order to beimplemented on a DSP, requires a proper approximation.

To what extent an approximation f is adequate enough isa balance between how well it “fits” f and the number of

instructions the DSP requires to compute f .In the effort of finding a suitable enough approximation,

in this work we have explored 2 different approximationsfor the hyperbolic tangent function, f . In general, the way

an approximation, f (x,φ), fits f will depend on a designparameter, φ, whose optimum value has to be computedby minimizing some kind of error function. In this paperwe have decided to minimize the root mean square error(RMSE) for input values uniformly distributed from −5 to+5:

RMSE(

f , f)

= +

√

E{(

f (x)− f (x))2}

. (7)

The first practical implementation for approximatingf (x) is, with some corrections that will be explained below,based on a table containing the main 2n = 256 values off (x) = tanh(x). Such approximation, which makes use of


256 tabulated values, has been labeled fT256(x), and, forreasons that will be explained below, has been defined as

fT256(x) =

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

+1, x > 2n−1−b,

tanh(⌊

x · 2b⌋

2−b)

, 2n−1−b ≥ x ≥ −2n−1−b,

−1, x < −2n−1−b,(8)

with b being a design parameter to be optimized byminimizing its root mean square error RMSE( f , fT256),making use of the proper particularization of Expression (7).

The “structure” that fT256 approximation exhibits in (8)requires some comments.

(1) Expression (8) assigns +1 output to those inputvalues greater than 2n−1−b, and −1 output to thoseinput values lower than −2n−1−b. With respect tothe remaining input values belonging to the interval2n−1−b ≥ x ≥ −2n−1−b, fT256 divides such intervalinto 2n possible values, whose corresponding outputvalues have been tabulated and stored in RAMmemory.

(2) We have included in (8), for reasons that willappear clearer later on, the scale factor 2b, aiming atdetermining which are the bits of x that lead to thebest approximation of function f .

(3) The b parameter in the aforementioned scale factordetermines the way fT256 approaches f . Its optimumvalue is the one that minimizes the root meansquare error RMSE( f , fT256). In this respect, Figure 2represents the RMSE( f , fT256) as a function of theb parameter, and shows that the minimum value ofRMSE (RMSEmin = 0.0025) is obtained when b =bopt = 5.4.

(4) Since, for practical implementation, n is an integernumber, we take b = 5 as the closest integer to bopt =5.4. This leads to RMSE= 0.0035.

(5) The scale factor 25 in Expression (8) (multiplying by25) is equivalent to binary shift x in 5 bits to the left,which can be implemented using only one assemblerinstruction!

As a consequence, implementing the fT256 approxima-tion requires storing 256 memory words, and the following 6assembler instructions:

(1) shifting 5 bits to the left,

(2) a saturation operation,

(3) a 8-bit right shift,

(4) the addition of the starting point of the table inmemory,

(5) copying this value to an addressing register,

(6) reading the value in the table.

However, in some cases (basically, when the number ofneurons is high), this number of instructions is too long.

0

0.02

0.04

0.06

0.08

0.1

0.12

Roo

tm

ean

squ

are

erro

r(R

MSE

)

0 1 2 3 4 5 6 7 8

Exponent of the scale factor (b)

Figure 2: RMSE( f , fT256), root mean square error of the tabulated-base approximation, as a function of the b parameter, the exponentof the scale factor in its defining Expression (8).

In order to simplify the calculation of this approximatedfunction, or in other words, to reduce the number ofinstructions, we have tested a second approach based on apiecewise approximation. Taking into account that a typicalDSP is able to implement a saturation using one cycle, wehave evaluated the feasibility of fitting the original activationfunction f by using a function, which is based on 3-piecelinear approximation, has been labelled ( f3PLA), and exhibitsthe expression:

f3PLA(x) =

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

1, x >1a

,

a · x,1a≥ x ≥ −1

a,

−1, x < −1a

,

(9)

where subscript “3PLA” stands for “3-piece linear approxi-mation,” and a is the corresponding design parameter, whoseoptimum value is the one that optimizes the RMSE( f , f3PLA).Regarding this optimization process, Figure 3 shows theRMSE( f , f3PLA) as a function of the a parameter. Note thatthe a value that makes the RMSE( f , f3PLA) be minimum(0.0445) is aopt = 0.769.

The practical point to note here regarding this approx-imation is that it requires multiplying the input of theactivation function by a, that, in a typical DSP requires, atleast, the following 4 instructions:

(1) copying x into one of the input register of the MACunit,

(2) copying the constant value of a into the other inputregister,

(3) copying the result of this multiplication into theaccumulator,

(4) a saturation operation.

As a consequence, the minimum number of instructionsrequired a priori for implementing this approximation is


0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

Roo

tm

ean

squ

are

erro

r(R

MSE

)

0.5 0.6 0.7 0.8 0.9 1

Slope of the approximation of the hyberbolictangent function (a)

Figure 3: Root mean square corresponding to the 3-piece linearapproximation, RMSE( f , f3PLA), as a function of the a parameter,the slope in its defining Expression (9).

4, since the saturation operation requires an additionalassembler instruction.

Furthermore, a possible way of reducing even morethe number of instructions required for implementing anapproximation consists in including the term a in thecorresponding weights of the neuron, so that f3PLA(x, a =0.769) = f3PLA(0.769x, a = 1). So, the additional bonusachieved consists in that the number of instructions isdrastically reduced to only 1 assembler instruction.

For illustrative purposes, we complete this Section byhaving a look at Figure 4. It represents the 2 approximationsconsidered in the paper: the “tabulated function-based”function ( fT256, with b = 5) and the 3-piece linearapproximation with ( f3PLA, with a = 0.769).

4. Experimental Work

Prior to the description of the different experiments wehave carried out, it is worth having a look at the sounddatabase we have used. It consists of a total of 7340 secondsof audio, including both speech in quiet, speech in noise,speech in music, vocal music, instrumental music and noise.The database was manually labelled, obtaining a total of1272.5 seconds of speech in quiet, 3637.5 seconds of speechin music or noise and 2430 seconds of music and noise.All audio files are monophonic, and were sampled with asampling frequency of 16 kHz and 16 bits per sample. Speechand music files were provided by D. Ellis, and recorded by E.Scheirer and M. Slaney [16]. This database [17] has alreadybeen used in a number of different works [16, 18–20]. Speechwas recorded by digitally sampling FM radio stations, usinga variety of stations, content styles and levels, and containssamples from both male and female speakers. The sound filespresent different input levels, with a range of 30 dB betweenthe lowest and the highest, which allows us to test the

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

f(x)

−3 −2 −1 0 1 2 3

x

Tabulated hyperbolic tangentLine-based approximation

Figure 4: Representation of the considered activation functions.tabulated hyperbolic function and line-based approximation.

robustness of the classification system against different soundinput levels. Music includes samples of jazz, pop, country,salsa, reggae, classical, various nonWestern styles, varioussorts of rock, and new age music, both with and withoutvocals. Finally, noise files include sounds from the followingenvironments: aircraft, bus, cafe, car, kindergarten, livingroom, nature, school, shop, sports, traffic, train, and trainstation. These noise sources have been artificially mixed withthose of speech files (with varying degrees of reverberation)at different Signal to Noise Ratios (SNRs) ranging from 0 to10 dB. In a number of experiments, these values have beenfound to be representative enough regarding the followingperceptual criteria: lower SNRs could be treated by thehearing aid as noise, and higher SNRs could be consideredas clean speech.

For training, validation, and testing, it is necessary for thedatabase to be divided into three different sets. 2685 seconds(≈36%) for training, 1012.5 seconds (≈14%) for validation,and 3642.5 seconds (≈50%) for testing. This division hasbeen done randomly, ensuring that the relative proportion offiles of each category is preserved for each set. The trainingset is used to determine the weights of the MLP in thetraining process, the validation set helps evaluate progressduring training and to determine when to stop training,and the test set is used to assess the classifier’s quality aftertraining. The test set has remained unaltered for all theexperiments described in this paper.

Each file was processed using the hearing aid simulatordescribed in [21] without feedback. The features werecomputed from the output of the Weighted Overlap-Add(WOLA) filterbank with 128 DFT points and analysisand synthesis window lengths of 256 samples. So, thetime/frequency decomposition is performed with 64 fre-quency bands. Concerning the architecture, the simulatorhas been configured for a 16-bit word-length Harvard


Table 1: Mean error probability (%) of different classifiers returning a decision with time slots of 2.5 seconds using 9 quantization schemes:Qx · y represents the quantization schemes with x bits for the integer part, and y for the fractional one. Regarding the classifiers,MLP Kmeans Multi-Layer Perceptron with K neurons in the hidden layer. The column labelled “Double” corresponds to the Mean error probability(%) when no-quantization, double floating point precision has been used. Columns in bold aim at helping the reader focus on the mostrelevant result: Q5.11 provides very similar results to those of double precision.

No quantization Quantization scheme

Classifier Double Q1.15 Q2.14 Q3.13 Q4.12 Q5.11 Q6.10 Q7.9 Q8.8 Q9.7

MLP 1 15.15 55.63 55.30 20.94 15.16 15.21 15.30 15.79 23.33 36.28

MLP 2 10.46 73.43 37.46 15.76 10.47 10.47 10.48 10.88 15.55 36.63

MLP 3 9.84 71.90 38.16 12.25 9.88 9.85 9.86 10.21 16.76 44.69

MLP 4 9.16 74.60 42.41 14.04 9.26 9.17 9.20 9.67 16.95 46.71

MLP 5 8.86 69.08 42.11 13.76 8.92 8.86 8.92 9.58 17.75 40.56

MLP 6 8.55 65.08 35.32 11.07 8.58 8.54 8.58 9.27 17.13 41.99

MLP 7 8.39 65.91 38.18 10.57 8.40 8.40 8.46 9.41 18.84 42.45

MLP 8 8.33 62.37 33.43 9.51 8.33 8.34 8.41 8.98 17.31 44.01

MLP 9 8.34 61.17 34.76 10.45 8.53 8.34 8.35 9.11 17.76 43.88

MLP 10 8.17 62.19 34.27 9.30 8.18 8.19 8.26 8.96 17.76 43.06

MLP 15 8.10 62.03 32.79 9.22 8.11 8.11 8.18 8.96 17.36 40.41

MLP 20 7.93 51.67 29.03 9.42 7.92 7.92 7.97 8.85 18.17 44.11

MLP 25 7.94 61.27 32.75 9.91 7.94 7.94 8.01 8.98 17.96 41.69

MLP 30 7.86 59.31 35.45 10.13 7.92 7.87 7.91 8.73 17.46 42.52

MLP 35 7.95 59.84 32.12 10.02 7.99 7.95 8.01 8.85 17.81 43.47

MLP 40 7.78 59.71 30.78 10.15 7.77 7.74 7.82 8.74 17.72 41.27

Architecture with a Multiplier/ACcumulator (MAC) thatmultiplies 16-bit registers and stores the result in a 40-bitaccumulator.

In order to study the effects of the limited precision, twodifferent scenarios were considered in the experiments. First,the classifiers were configured for returning a decision every2.5 seconds. The aim of this study is to determine the effectsof the limited precision over the classifiers for applicationslike automatic program switching, in which a large time scaleis used. Second, the classifiers were configured for takinga decision with time slots of 20 milliseconds. In this case,the objective is to study the effects of the limited precisionin a classification scenario in which a small time scale isrequired like, for example, in noise reduction or soundsource separation applications.

In the batches of experiments we have put into practice,the experiments have been repeated 100 times. The resultswe have illustrated below show the average probability ofclassification error for the test set and the computationalcomplexity in number of assembler operations neededto obtain the output of the classifier. The probability ofclassification error represents the average number of timeslots that are misclassified in the test set.

It is important to highlight that in a real classificationsystem the classification evidence can be accumulated acrossthe time for achieving lower error rates. This fact makesnecessary a study of the tradeoff between the selectedtime scale, the integration of decision for consecutive timeslots, the performance of the final system and the requiredcomputational complexity. This analysis is out of the scopeof this paper, since our aim is not to propose a particular

classification system, that must be tuned for the consideredhearing aid application, but to illustrate a set of toolsand strategies that can be used for determining the waya neural network can be efficiently implemented in realtime for sound environment classification tasks with limitedcomputational capabilities.

4.1. Comparing the Quantization Schemes. The objective ofthis first set of experiments is to study the effects of thequantization format, Qx.y, used for representing both thesignal describing features and the weights of the neural network.In this experimental work, aiming at clearly distinguishingthe different phenomena involved, the activation functionused in the neural network is the original hyperbolictangent function, f . The influence of using some of theaforementioned approximation of f has also been exploredin a novel sequence of experiments whose results will beexplained in Section 4.2.

Tables 1 and 2 show the average probability of error (%)obtained in the 100 runs of the training process for a varietyof multilayer perceptrons (MLPs) with different numbersof hidden neurons, for time slots of 2.5 seconds and 20milliseconds, respectively. In these tables, MLP K labels thatthe corresponding NN is an MLP with K neurons in thehidden layer. These batches of experiments have explored anumbers of hidden neurons ranging from 1 to 40. Aiming atclearly understanding the effect of the different quantizationschemes, we have also listed the average probability oferror computed with no-quantization, double floating pointprecision. These have been labeled in Tables 1 and 2 by usingthe header “double.”


Table 2: Mean error probability (%) of different classifiers returning a decision with time slots of 20 milliseconds using 9 quantizationschemes:Qx·y represents the quantization schemes with x bits for the integer part, and y for the fractional one. Regarding the classifiers,MLPK means Multi-Layer Perceptron with K neurons in the hidden layer. The column labelled “Double” corresponds to the Mean errorprobability (%) when no-quantization, double floating point precision has been used. Columns in bold aim at helping the reader focuson the most relevant result: Q5.11 provides very similar results to those of double precision.

No quantization Quantization scheme

Classifier Double Q1.15 Q2.14 Q3.13 Q4.12 Q5.11 Q6.10 Q7.9 Q8.8 Q9.7

MLP 1 36.36 44.05 41.25 37.24 36.36 36.36 36.42 37.11 41.79 60.16

MLP 2 27.44 42.88 33.10 28.28 27.45 27.46 27.88 32.21 46.19 60.96

MLP 3 26.11 45.86 44.42 37.05 31.23 26.60 27.43 36.97 49.26 61.56

MLP 4 24.61 50.66 51.47 41.38 30.18 24.93 26.52 36.60 54.79 62.17

MLP 5 23.07 50.91 46.39 39.42 28.25 23.45 27.07 41.88 57.32 65.41

MLP 6 22.18 55.34 51.77 45.29 30.43 23.43 27.17 39.41 54.45 62.82

MLP 7 21.50 53.69 49.61 44.22 28.74 22.35 26.53 39.00 54.37 63.40

MLP 8 21.07 54.80 52.90 47.81 26.42 21.95 25.54 36.53 53.47 61.41

MLP 9 20.55 56.32 50.24 47.41 26.81 21.75 23.44 36.77 53.16 60.83

MLP 10 20.80 58.96 52.28 49.60 28.18 22.30 23.71 36.65 52.84 61.20

MLP 15 19.74 61.13 56.33 52.93 30.14 20.83 21.48 32.83 51.28 63.11

MLP 20 19.54 62.85 57.45 53.50 29.36 20.19 20.94 30.47 49.57 61.71

MLP 25 19.49 62.54 57.30 53.40 30.97 20.36 20.90 30.20 49.88 63.60

MLP 30 19.47 63.99 57.14 51.93 31.53 20.25 20.61 28.93 48.82 61.23

MLP 35 19.44 64.87 56.70 52.14 32.19 20.94 20.41 26.69 45.07 60.02

MLP 40 19.49 62.67 55.06 49.96 29.78 20.29 20.37 27.67 46.32 61.19

Tables 1 and 2 supply some important pieces of usefulinformation:

(i) Those quantization formats with a low number ofbits for representing the integer part, such as, forexample, Q2.14, finally lead to an increase in theerror probability when compared to those computedwith double precision. This increase is caused bysaturations of the features and weights of the neuralnetworks.

(ii) On the other hand, the use of a low number of bits forthe fractional portion causes an increase in the errorprobability, basically arising from the loss of precisionin the numerical representation.

These facts illustrate the need for a tradeoff betweeninteger and fractional bits. For the sake of clarity, Figure 5shows the average relative increase in the error probabilitywith respect to the use of double precision, as a function ofthe number of bits of the fractional portion. Computing thisrelative increase has required the use of those results obtainedwhen using all the classifiers listed in Tables 1 and 2, and theaverage computed from

I = 100 · E{PQi. j − Pdouble

Pdouble

}

(%), (10)

where E{·} represents the mean value of the probabilitiesinvolving all the number of neurons considered. Notethat the lower relative increase is achieved by the Q5.11quantization scheme, for both time slot configurations. This

10−2

10−1

100

101

102

103

Rel

ativ

ein

crea

se(%

)

7 8 9 10 11 12 13 14 15

Number of bits of the fractional portion

Files of 2.5 sFiles of 2.5 ms

Figure 5: Average relative increase (%) in the probability of errorfor the classifiers studied in this paper.

is the reason by which the Q5.11 quantization format hasbeen selected for the remaining experiments of the paper.

4.2. Comparing the Approximation of the Activation Functions.The purpose of this second batch of experiments consists inquantitatively evaluating the fitness of the approximations


Table 3: Mean error probability (%) and number of simple operations required for computing the activation function approximations whenusing neural networks with different activation functions: the “tabulated function-based” function ( fT256, with b = 5) and the 3-piece linearapproximation with ( f3PLA, with a = 0.769). MLP X means that the multilayer perceptron under study contains X hidden neurons.

Mean error probability (%) Assembler instructions

Files of 2.5 s. Files of 20 ms.

Approximation Approximation Approximation

NN fT256 f3PLA fT256 f3PLA fT256 f3PLA

MLP 1 15.22 24.14 36.36 44.32 24 19

MLP 2 10.49 11.61 27.57 30.46 42 32

MLP 3 9.86 11.16 26.84 29.15 60 45

MLP 4 9.18 11.37 25.48 32.19 78 58

MLP 5 8.91 10.79 24.30 35.17 96 71

MLP 6 8.56 10.78 24.49 40.16 114 84

MLP 7 8.45 11.20 23.38 40.35 132 97

MLP 8 8.38 10.92 22.90 41.59 150 110

MLP 9 8.33 10.82 22.34 40.23 168 123

MLP 10 8.23 11.10 22.84 39.25 186 136

MLP 15 8.13 10.63 21.12 35.70 276 201

MLP 20 7.94 10.58 20.45 33.20 366 266

MLP 25 7.97 10.44 20.57 31.99 456 331

MLP 30 7.85 10.39 20.46 31.82 546 396

MLP 35 8.00 10.17 21.14 31.48 636 461

MLP 40 7.76 10.55 20.46 31.53 726 526

explored in the paper: the “tabulated function-based” func-tion ( fT256, with b = 5) and the 3-piece linear approximationwith ( f3PLA, with a = 0.769). The quantization scheme wehave used in this experimental work is Q5.11 because, asstated in Section 4.1, it is the one that makes the differentclassifiers achieve very similar results as those obtained whenno quantization (double precision) is used.

Table 3 shows the error probability corresponding toMLPs (ranging from 1 to 40 hidden neurons) that makeuse of the aforementioned approximations, for files of2.5 seconds and 20 milliseconds, respectively. A detailedobservation of Table 3 leads to the following conclusions.

(i) The “tabulated function-based” approximation,fT256, makes the NNs achieve very similar results tothose obtained when using the original hyperbolictangent function, f , for the case of files of 2.5seconds (average relative increase of 0.30%), and anaverage relative increase of 5.91%, for the case offiles of 20 milliseconds. The way to note this consistsin comparing the mean error probabilities listedon column Q5.11 in Tables 1 and 2 (in which theactivation function has not yet been approximated)with those corresponding to columns “ fT256” inTable 3.

(ii) The use of the 3-piece linear approximation, f3PLA,leads to an average relative increase in the probabilityof error of 29.88% and 61.27% for files of 2.5 secondsand 20 milliseconds, respectively.

As a conclusion, we can say that the “tabulated function-based” approximation, fT256, is a suitable way to approach

the original hyperbolic tangent function, f , mainly for thecase of files of 2.5 seconds.

Another extremely important point to note is that boththe considered approximations for the activation functionand the number of neurons are related to the number ofassembler instructions needed to implement the classificationsystem in the hearing aid. In this respect, Table 3 also showsthe number of instructions for the different MLP K classifiers(K being the numbers of hidden neurons) as a function ofthe approximation for the hyperbolic tangent function ( fT256

and f3PLA).

4.3. Improving the Results by Retraining the Output Weights.As can be shown from the results obtained in the previoussection, the use of approximated activation functions reducesthe number of assembler instructions needed to implementthe classifier. Even though this is a positive fact, the useof approximation for the activation functions may causethe classifier to slightly reduce its efficiency. Aiming atovercoming this, we have carried out a novel sequence ofexperiments, which consists in what follows.

(1) Train the NN.

(2) Introduce the aforementioned quantization schemesand the approximations for the activation function.

(3) Recompute the output weights of the network bytaking into account the studied effects related toquantization schemes and the approximations for theactivation function.

Note that training the MLP directly with the quantizationschemes and the approximations for the activation function


Table 4: Mean error probability (%) and number of simple operations required for computing the activation function approximations whenusing neural networks with different activation functions, when the output weights are retrained once the activation function is applied.

Mean error probability (%) Assembler instructions

Files of 2.5 s. Files of 20 ms.

Approximation Approximation Approximation

NN fT256 f3PLA fT256 f3PLA fT256 f3PLA

MLP 1 15.20 20.27 36.29 38.76 24 19

MLP 2 10.46 10.81 27.50 27.85 42 32

MLP 3 9.85 10.23 26.59 27.25 60 45

MLP 4 9.16 9.92 25.24 26.36 78 58

MLP 5 8.90 9.37 23.71 26.42 96 71

MLP 6 8.56 9.23 23.04 26.85 114 84

MLP 7 8.44 9.11 22.37 26.27 132 97

MLP 8 8.36 8.97 21.82 26.49 150 110

MLP 9 8.31 8.97 21.16 25.55 168 123

MLP 10 8.19 9.01 21.42 25.59 186 136

MLP 15 8.11 8.95 20.09 23.82 276 201

MLP 20 7.92 8.91 19.82 22.96 366 266

MLP 25 7.96 8.77 19.70 22.39 456 331

MLP 30 7.83 8.84 19.67 22.21 546 396

MLP 35 7.99 8.75 19.64 21.92 636 461

MLP 40 7.75 8.77 19.52 21.64 726 526

is not straightforward since the approximations used for theactivation functions are not differentiable at some points, ortheir slope is zero. The solution proposed here overcomesthese problems, and makes the process much easier.

Table 4 shows the mean error probability obtained by thedifferent neural networks once the output weights have beenrecomputed. Understanding Table 4 requires to compareit to Table 3 (in which these have not been recalculated).From this comparison, we would like to emphasize thefollowing.

(i) The retrained strategy slightly reduces the error whenthe tabulated-approximation is used. Now, fT256

leads to an average relative increase in the probabilityof error of 0.13% and 1.94% for files of 2.5 secondand 20 millisecond, respectively, compared to thoseobtained when no quantization (double precision) isused.

(ii) In the case of the 3-piece-based approximation, theretrained strategy leads to an average relative increasein the probability of error of 10.36% and 15.08% forfiles of 2.5 s. and 20 ms., respectively, compared tothose obtained when double precision is used.

To complete this paper, and in order to compare thebenefits of the proposed retraining strategy with thoseresults presented in the previous section, Figures 6 and7 show the relationship between the error rate and thenumber of operations for the tabulated-based implementationand for the line-based implementation with and withoutretrained output weights, for files of 2.5 seconds and 20milliseconds, respectively. Taking into account the limited

7

7.5

8

8.5

9

9.5

10

10.5

11

11.5

12

Pro

babi

lity

ofer

ror

(%)

101 102 103

Number of operations

MLP T256MLP lined

MLP T256 optimizedMLP lined optimized

Figure 6: Comparative analysis of the relationship between theerror rate and the number of operations for the best methodsstudied in the paper.

number of operations per second (low clock rates in orderto minimize power consumption), the results in Figures 6and 7 demonstrate the effectiveness of the proposed strategy,especially in the case of time slots of 20 milliseconds, becauseit allows to achieve lower error rates with comparablecomputational complexity. Furthermore, the use of the line-based approximation is recommended mainly when very


18

20

22

24

26

28

30

32

34

Pro

babi

lity

ofer

ror

(%)

101 102 103

Number of operations

MLP T256MLP lined

MLP T256 optimizedMLP lined optimized

Figure 7: Comparative analysis of the relationship between theerror rate and the number of operations for the best methodsstudied in the paper.

low computational complexity (less than 50 instructions) isrequired.

5. Conclusions

This paper has been motivated by the fact that the imple-mentation of signal processing techniques on hearing aidsis strongly constrained by (1) the finite precision usedfor representing numbers, and (2) the small number ofinstructions per second to implement the algorithms onthe digital signal processor the hearing aid is based on.In this respect, the objective of this paper has been toquantitatively analyzing these effects on the performanceof neural network-based sound classifiers in digital hearingaids. Such performance must be a delicate balance betweenkeeping error classification probability within low values (inorder to not disturb the user’s comfort) and achieving this byusing a small number of instructions per second. The reasonunderlying this latter restrictions is that hearing aids haveto work at low clock rates in order to minimize the powerconsumption and maximize battery life.

Within this framework, the paper has particularly cen-tered on exploring the following.

(1) the effects of using quantized weights and an approx-imated activation function for the neurons thatcompose the classifier. In particular, we have evalu-ated 2 different approximations; (1) the “tabulatedfunction-based” function, based on 256 samples ofthe analytical activation function, and (2) the 3-piecelinear approximation,

(2) how to improve the performance by making use ofthe information extracted from point 1.

The different batches of experiments lead to the followingconclusions.

(i) The Q5.11 quantization scheme has been found to beexhibiting very similar results to those obtained whenno-quantization, double-precision is used, mainly forthe case of files of 2.5 seconds.

(ii) The “tabulated function-based” approximationmakes the NNs achieve very similar results to thoseobtained when using the original hyperbolic tangentfunction for the case of files of 2.5 seconds, and anaverage relative increase of 5.91%, for the case of filesof 20 milliseconds.

(iii) The use of the 3-piece linear approximation leads toan average relative increase in the probability of error.

(iv) The retrained strategy reduces the error in average forall the experiments considered in the paper.

The final, global conclusion is that the quantized, approx-imated, neural network-based classifier achieves perceptuallythe same efficiency as that reached by “exact” networks(that is, without these approximations), but, and this isthe key point, with the added bonus of extremely reducingthe computational cost on the digital signal processor thehearing aid is based on.

Acknowledgments

This work has been partially funded by the Comunidadde Madrid/Universidad de Alcala (CCG06-UAH/TIC-0378,CCG07-UAH/TIC-1572, CCG08-UAH/TIC-4270) and theSpanish Ministry of Education and Science (TEC2006-13883-C04-04/TCM, TEC2009-14414-C03-03). The authorswould like to thank the anonymous reviewers and theeditor for their many helpful suggestions, which have greatlyimproved the presentation of this paper.

References


[2] M. Buchler, S. Allegro, S. Launer, and N. Dillier, “Soundclassification in hearing aids inspired by auditory sceneanalysis,” EURASIP Journal on Applied Signal Processing, vol.2005, no. 18, pp. 2991–3002, 2005.

[3] M. C. Buchler, Algorithms for sound classification in hearinginstruments, Ph.D. thesis, Swiss Federal Institute of Technol-ogy, Zurich, Switzerland, 2002.

[4] M. Marzinzik, Noise reduction schemes for digital hearing aidsand their use for hearing impaired, Ph.D. thesis, Carl vonOssietzky University, Oldenburg, Germany, 2000.

[5] J. Ramırez, J. C. Segura, C. Benıtez, A. de la Torre, and A.Rubio, “Efficient voice activity detection algorithms usinglong-term speech information,” Speech Communication, vol.42, no. 3-4, pp. 271–287, 2004.

[6] J. M. Gorriz, J. Ramırez, J. C. Segura, and C. G. Pun-tonet, “Improved MO-LRT VAD based on bispectra Gaussianmodel,” Electronics Letters, vol. 41, no. 15, pp. 877–879, 2005.


[7] P. Nordqvist and A. Leijon, “An efficient robust sound classi-fication algorithm for hearing aids,” Journal of the AcousticalSociety of America, vol. 115, no. 6, pp. 3033–3041, 2004.

[8] E. Alexandre, L. Cuadra, L. Alvarez, L. Alvarez, M. Rosa-Zurera, and F. Lopez-Ferreras, “Automatic sound classificationfor improving speech intelligibility in hearing aids using alayered structure,” in Proceedings of the 7th International Con-ference Intelligent Data Engineering and Automated Learning(IDEAL ’06), vol. 4224 of Lecture Notes in Computer Science,Springer, Burgos, Spain, September 2006.

[9] E. Alexandre, L. Cuadra, L. Alvarez, M. Rosa-Zurera, andF. Lopez-Ferreras, “Two-layer automatic sound classificationsystem for conversation enhancement in hearing aids,” Inte-grated Computer-Aided Engineering, vol. 15, no. 1, pp. 85–94,2008.

[10] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification,Wiley InterScience, New York, NY, USA, 2001.

[11] E. Alexandre-Cortizo, M. Rosa-Zurera, L. Cuadra, and R. Gil-Pita, “Application of fisher linear discriminant analysis tospeech/music classification,” in Proceedings of the 120th AudioEngineering Society Convention (AES ’06), vol. 2, pp. 1666–1669, Paris, France, 2006.

[12] P. Nordqvist, Sound classification in hearing instruments, Ph.D.thesis, KTH-S3, Stockholm, Sweden, 2004.

[13] J.-J. Aucouturier and F. Pachet, “Representing musical genre: astate of the art,” Journal of New Music Research, vol. 32, no. 1,pp. 83–93, 2003.

[14] E. Guaus and B. Eloi, “A non-linear rhythm-based styleclassification for broadcast speech-music discrimination,” inProceedings of the 116th Audio Engineering Society Convention(AES ’04), Berlin, Germany, May 2004.

[15] H. Demuth and H. M. Beale, Neural Network Toolbox for Usewith Matlab, Mathworks, Natick, Mass, USA, 1998.

[16] E. Scheirer and M. Slaney, “Construction and evaluationof a robust multifeature speech/music discriminator,” inProceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’97), vol. 2, pp. 1331–1334, 1997.

[17] E. Scheirer, http://labrosa.ee.columbia.edu/sounds/musp/scheislan.html.

[18] B. Thoshkahna, V. Sudha, and K. R. Ramakrishnan, “A speech-music discriminator using HILN model based features,” inProceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’06), vol. 5, pp. 425–428,Toulouse, France, May 2006.

[19] A. L. Berenzweig and D. P. W. Ellis, “Locating singingvoice segments within music signals,” in Proceedings of theIEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP ’01), pp. 119–122, Phoenix, Ariz, USA,2001.

[20] G. Williams and D. P. W. Ellis, “Speech/music discriminationbased on posterior probability features,” in Proceedings of the6th European Conference on Speech Communication and Tech-nology (EUROSPEECH ’99), Budapest, Hungary, September1999.

[21] R. Vicen-Bueno, R. Gil-Pita, M. Utrilla-Manso, and L. Alvarez-Perez, “A hearing aid simulator to test adaptive signal pro-cessing algorithms,” in Proceedings of the IEEE InternationalSymposium on Intelligent Signal Processing (WISP ’07), pp.619–624, Piscataway, NJ, USA, October 2007.


Research Article

Signal Processing Strategies for Cochlear Implants UsingCurrent Steering

Waldo Nogueira, Leonid Litvak, Bernd Edler, Jorn Ostermann, and Andreas Buchner

Laboratorium fur Informationstechnologie, Leibniz Universitat Hannover, Schneiderberg 32, 30167 Hannove, Germany

Correspondence should be addressed to Waldo Nogueira, [email protected]

Received 29 November 2008; Revised 19 April 2009; Accepted 22 September 2009

Recommended by Torsten Dau

In contemporary cochlear implant systems, the audio signal is decomposed into different frequency bands, each assigned to oneelectrode. Thus, pitch perception is limited by the number of physical electrodes implanted into the cochlea and by the widebandwidth assigned to each electrode. The Harmony HiResolution bionic ear (Advanced Bionics LLC, Valencia, CA, USA) has thecapability of creating virtual spectral channels through simultaneous delivery of current to pairs of adjacent electrodes. By steeringthe locus of stimulation to sites between the electrodes, additional pitch percepts can be generated. Two new sound processingstrategies based on current steering have been designed, SpecRes and SineEx. In a chronic trial, speech intelligibility, pitchperception, and subjective appreciation of sound were compared between the two current steering strategies and standard HiResstrategy in 9 adult Harmony users. There was considerable variability in benefit, and the mean results show similar performancewith all three strategies.

Copyright © 2009 Waldo Nogueira et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Cochlear implants are an accepted and effective treatmentfor restoring hearing sensation to people with severe-to-profound hearing loss. Contemporary cochlear implantsconsist of a microphone, a sound processor, a transmitter, areceiver, and an electrode array that is positioned inside thecochlea. The sound processor is responsible for decomposingthe input audio signal into different frequency bands anddelivering information about each frequency band to theappropriate electrode in a base-to-apex tonotopic pattern.The bandwidths of the frequency bands are approximatelyequal to the critical bands, where low-frequency bands havehigher frequency resolution than high-frequency bands. Theactual stimulation to each electrode consists of nonoverlap-ping biphasic charge-balanced pulses that are modulated bythe lowpass-filtered output of each analysis filter.

Most contemporary cochlear implants deliver interleavedpulses to the electrodes so that no electrodes are stimulatedsimultaneously. If electrodes are stimulated simultaneously,thereby overlapping in time, their electrical fields addand create undesirable interactions. Interleaved stimulation

partially eliminates these undesired interactions. Researchshows that strategies using nonsimultaneous stimulationachieve better performance than strategies using simultane-ous stimulation of all electrodes [1].

Most cochlear implant users have limited pitch reso-lution. There are two mechanisms that can underlie pitchperception in cochlear implant recipients, temporal/ratepitch and place pitch [2]. Rate pitch is related to thetemporal pattern of stimulation. The higher the frequencyof the stimulating pulses, the higher the perceived pitch.Typically, most patients do not perceive pitch changes whenthe stimulation rate exceeds 300 pulses per second [3].Nonetheless, temporal pitch cues have shown to providesome fundamental frequency discrimination [4] and limitedmelody recognition [2]. The fundamental frequency isimportant for speaker recognition and speech intelligibil-ity. For speakers of tone languages (e.g., Cantonese orMandarin), differences in fundamental frequency within aphonemic segment determine the lexical meaning of a word.It is not surprising, then, that cochlear implant users incountries with tone languages may not derive the samebenefit as individuals who speak nontonal languages [5].


Speech intelligibility in noise environments might be limitedfor cochlear implant users because of the poor perceptionof temporal cues. It has been shown that normal hearinglisteners benefit from temporal cues to improve speechintelligibility in noise environments [6].

The place pitch mechanism is related to the spatial pat-tern of stimulation. Stimulation of electrodes located towardsthe base of the cochlea produces higher pitch sensationsthan stimulation of electrodes located towards the apex.The resolution of pitch derived from a place mechanism islimited by the few number of electrodes and the currentspread produced in the cochlea when each electrode isactivated. Pitch or spectral resolution is important whenthe listening environment becomes challenging in order toseparate speech from noise or to distinguish multiple talkers[7]. The ability to differentiate place-pitch information alsocontributes to the perception of the fundamental frequency[4]. Increased spectral resolution also is required to perceivefundamental pitch and to identify melodies and instruments[8]. As many as 100 bands of spectral resolution are requiredfor music perception in normal hearing subjects [7].

Newer sound-processing strategies like HiRes aredesigned to increase the spectral and temporal resolutionprovided by a cochlear implant in order to improve thehearing abilities of cochlear implant recipients. HiResanalyzes the acoustic signal with high temporal resolutionand delivers high stimulation rates [9]. However, spectralresolution is still not optimal because of the limitednumber of electrodes. Therefore, a challenge for new signalprocessing strategies is to improve the representation offrequency information given the limited number of fixedelectrodes. Recently, researchers have demonstrated a wayto enhance place pitch perception through simultaneousstimulation of electrode pairs [3, 10–12]. This causes asummation of the electrical field producing a peak of theoverall field located in the middle of both electrodes. Ithas been reported that additional pitch sensations can becreated by adjusting the proportion of current deliveredsimultaneously to two electrodes [13]. This technique isknown as current steering [7]. As the implant can representinformation with finer spectral resolution, it becomesnecessary to improve the spectral analysis of the audio signalperformed by classical strategies like HiRes.

In addition to simultaneous stimulation of electrodes,multiple intermediate pitch percepts also can be createdusing by sequential stimulation of adjacent electrodes inquick succession [14]. Electrical models of the humancochlea and psychoacoustic experiments have shown thatsimultaneous stimulation generally is able to produce asingle, gradually shifting intermediate pitch. On the otherhand, sequential stimulation often produces two regions ofexcitation. Thus, sequential stimulation often requires anincrease in the total amount of current needed to reachcomfortable loudness, and may lead to the perception of twopitches or a broader pitch as the electrical field separates intotwo regions [15].

The main goal of this work was to improve speech andmusic perception in cochlear implant recipients throughthe development of new signal processing strategies that

take advantage of the current-steering capabilities of theAdvanced Bionics device. These new strategies were designedto improve the spectral analysis of the audio signal and todeliver the signal with greater place precision using currentsteering. The challenge was to implement the experimentalstrategies in commercial speech processors so that they couldbe evaluated by actual implanted subjects. Thus a significanteffort was put into executing the real-time applications incommercial low power processors. After implementation,the strategies were assessed using standardized tests of pitchperception and speech intelligibility and through subjectiveratings of music appreciation and speech quality.

The paper is organized as follows. Section 2 describesthe commercial HiRes and two research strategies usingcurrent steering. Section 3 details the methods for evalu-ating speech intelligibility and frequency discrimination incochlear implant recipients using the new strategies. Sections4, 5, and 6 present the results, discussion, and conclusions.

2. Methods

2.1. The High Resolution Strategy (HiRes). The HiRes strat-egy is implemented in the Auria and Harmony soundprocessors from Advanced Bionics LLC. These devices canbe used with the Harmony implant (CII and the HiRes90k).In HiRes, an audio signal sampled at 17400 Hz is pre-emphasized by the microphone and then digitized. Adaptivegain control (AGC) is performed digitally using a dual-loop AGC [16]. Afterwards the signal is broken up intofrequency bands using infinite impulse response (IIR) sixth-order Butterworth filters. The center frequencies of the filtersare logarithmically spaced between 350 Hz and 5500 Hz. Thelast filter is a high-pass filter whose bandwidth extends up tothe Nyquist frequency. The bandwidth covered by the filterswill be referred to as subbands or frequency bands. In HiRes,each frequency band is associated with one electrode.

In HiRes, the subband outputs of the filter bank areused to derive the information that is sent to the electrodes.Specifically, the filter outputs are half-wave rectified andaveraged. Half-wave rectification is accomplished by settingto 0 the negative amplitudes at the output of each filter band.The outputs of the half-wave rectifier are averaged for theduration Ts of a stimulation cycle. Finally, the “Mapping”block maps the acoustic values obtained for each frequencyband into current amplitudes that are used to modulatebiphasic pulses. A logarithmic compression function is usedto ensure that the envelope outputs fit the patient’s dynamicrange. This function is defined for each frequency band orelectrode z (z = 1, . . . ,M) and is of the form presented in thefollowing equation:

Yz(XFiltz

) = (MCL(z)− THL(z))IDR

× (XFiltz −msatdB + 12 + IDR)

+ THL(z)

z = 1, . . . ,M,

(1)

where Yz is the (compressed) electrical amplitude,XFiltz is theacoustic amplitude (output of the averager) in dB and IDR is


the input dynamic range set by the clinician. A typical valuefor the IDR is 60 dB. The mapping function used in HiResmaps the MCL at 12 dB below the saturation level msatdB . Thesaturation level in HiRes is set to 20 log10(215 − 1).

In each stimulation cycle, HiRes stimulates all M implantelectrodes sequentially to partially avoid channel interac-tions. The number of electrodes for the HiRes90k implantis M = 16, and all electrodes are stimulated at the same fixedrate. The maximum channel stimulation rate (CSR) used inthe HiRes90k is 2899 Hz.

2.2. The Spectral Resolution Strategy (SpecRes). The spectralresolution (SpecRes) strategy is a research version of thecommercial HiRes with Fidelity 120 strategy and, like HiRescan be used with the Harmony implant. This strategywas designed to increase the frequency resolution so as tooptimize use of the current steering technique. In [10],it was shown that cochlear implant subjects are able toperceive several distinct pitches between two electrodes whenthey are stimulated simultaneously. In HiRes each centerfrequency and bandwidth of a filter band is associated withone electrode.

However, when more stimulation sites are created usingcurrent steering, a more accurate spectral analysis of theincoming sound is required. For this reason, the filterbank used in HiRes is not adequate and a new signalprocessing strategy that enables higher spectral resolutionanalysis is required. Figure 1 shows the main processingblocks of the new strategy designed by Advanced BionicsLLC.

In SpecRes, the signal from the microphone is first pre-emphasized and digitized at Fs = 17400 Hz as in HiRes. Nextthe front-end implements the same adaptive-gain control(AGC) as used in HiRes. The resulting signal is sent througha filter bank based on a Fast Fourier Transform (FFT).The length of the FFT is set to L = 256 samples; thisvalue gives a good compromise between spectral resolution(related to place pitch) and temporal resolution (related totemporal pitch). The longer the FFT, the higher the frequencyresolution and thus, the lower the temporal resolution.

The linearly spaced FFT bins then are grouped intoanalysis bands. An analysis band is defined as spectralinformation contained in a range allocated to two electrodes.For each analysis band, the Hilbert envelope is computedfrom FFT bins. In order to improve the spectral resolution ofthe audio signal analysis, an interpolation based on a spectralpeak locator [17] inside each analysis band is performed.The spectral peaks are an estimation of the most importantfrequencies. The frequency estimated by the spectral peaklocator is used by the frequency weight map and the carriersynthesis. The carrier synthesis generates a pulse train withthe frequency determined by the spectral peak locator inorder to deliver temporal pitch information. The frequencyweight map converts the frequency determined by thespectral peak locator into a current weighting proportionthat is applied to the electrode pair associated with theanalysis band.

All this information is combined and nonlinearlymapped to convert the acoustical amplitudes into electrical

current amplitudes. For each stimulation cycle, pairs ofelectrodes associated with one analysis band are stimulatedsimultaneously, but the pairs of channels are stimulatedsequentially in order to reduce undesired channel interac-tion. Furthermore, the order of stimulation is selected tomaximize the distance between consecutive analysis bandsbeing stimulated. This approach reduces further channelinteraction between stimulation sites. The next sectionpresents each block of SpecRes in detail.

2.2.1. FFT and Hilbert Envelope. The FFT is performed oninput blocks of L = 256 samples of the previously windowedaudio signal:

xw(l) = x(l)w(l), l = 0, . . . ,L− 1, (2)

where x(l) is the input signal and w(l) is a 256-blackmanhanning window:

w(l) = 12

(

0.42− 0.5 cos(

2πlL

)

+ 0.08 cos(

4πlL

))

+12

(

0.5− 0.5 cos(

2πlL

))

l = 0, . . . ,L− 1.

(3)

The FFT of the windowed input signal can be decomposedinto its real and imaginary components as follows:

X(n) = FFT(xw(l))

= Re{X(n)} + j Im{X(n)}, n = 0, . . . ,L− 1,(4)

where

Re{X(n)} � Xr(n) = 1L

L−1∑

l=0

xw(l) cos(

2πn

Ll)

,

Im{X(n)} � Xi(n) = 1L

L−1∑

l=0

xw(l) sin(

2πn

Ll)

.

(5)

The linearly spaced FFT bins are then combined to providethe required number of analysis bands N . Because thenumber of electrodes in Harmony implant is M = 16electrodes, the total number of analysis bands isN =M−1 =15. Table 1 presents the number of FFT bins assigned to eachanalysis band and its associated center frequency.

The Hilbert envelope is computed for each analysis band.The Hilbert envelope for the analysis band z is denoted byHEz and is computed from the FFT bins as follows:

Hrz (τ)=nendz−1∑

n=nstartz

Xr(n) cos(

2πnτL

)

−Xi(n) sin(

2πnτL

)

,

Hiz (τ)=nendz−1∑

n=nstartz

Xr(n) sin(

2πnτL

)

−Xi(n) cos(

2πnτL

)

,

(6)

where Hrz and Hiz are the real and imaginary parts of theHilbert transform, τ is the delay within the window andnendz = nstartz +Nz.


Audioin Front

endA/D

L-fastFourier

transform(FFT)

1

2

L/2

Analysis band 1

Envelopedetection

Spectralpeak locator

Envelopedetection


Envelopedetection


Frequencyweight map

Carriersynthesis

Frequencyweight map

Carriersynthesis

Frequencyweight map

Carriersynthesis

Mapping

Mapping

Mapping

Ts

E1

E2

E2

E3

EM−1

EM

Analysis band 2

Analysis band N

...

......

...

Figure 1: Block diagram illustrating SpecRes.

Table 1: Number of FFT bins related to each analysis band and its associated center frequencies in Hz. The FFT bins have been grouped inorder to match the center frequencies of the standard HiRes filterbank used in clinical routine practice.

Analysis bandz

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number ofbins Nz

2 2 1 2 2 2 3 4 4 5 6 7 8 10 55

Start binnstartz

5 7 9 10 12 14 16 19 23 27 32 38 45 53 63

Center freqsfcenter (Hz)

408 544 646 748 884 1020 1190 1427 1700 2005 2379 2821 3330 3942 6491

Specifically, for τ = L/2, the Hilbert transform iscalculated in the middle of the analysis window:

Hrz =nendz∑

n=nstartz

Xr(n)(−1)n,

Hiz =nendz∑

n=nstartz

Xi(n)(−1)n.

(7)

the Hilbert envelope HE(τ) is obtained from the Hilberttransform as follows:

HE(τ) =√

Hrz(τ)2 +Hiz (τ)2. (8)

To implement stimulation at different positions between twoelectrodes, each analysis channel can create multiple virtualchannels by varying the proportion of current delivered toadjacent electrodes simultaneously. The weighting applied toeach electrode is controlled by the spectral peak locator andthe frequency weight map.

2.2.2. Spectral Peak Locator. Peak location is determinedwithin each analysis band z. For a pure tone within a channel,

spectral peak location should estimate the frequency of thetone. The frequency resolution obtained with the FFT ishalf a bin. A bin represents a frequency interval of Fs/L Hz.The maximum resolution that can be achieved is therefore67.96 Hz. However, it has been shown in [12] that patientsare able to perceive a maximum of around 30 distinct pitchpercepts between pairs of the most apical electrodes. Becausethe bandwidth associated with the most apical electrode pairis around 300 Hz and the maximum resolution is 30 pitchpercepts, the spectral resolution required for the analysisshould be around 10 Hz. This resolution is accomplishedby using a spectral peak locator. Spectral peak location iscomputed in two steps. The first step is to determine the FFTbin within an analysis band with the most energy. The powere(n) in each bin equals the sum of the squared real and theimaginary parts of that bin:

e(n) = X2r (n) + X2

i (n). (9)

The second step consists of fitting a parabola around the binnmaxz containing maximum energy in an analysis band z, thatis, e(nmaxz ) ≥ e(n) for all n /=nmaxz in that analysis band.To describe the parabolic interpolation strategy, a coordinate


A1

A3

A2

Spec

tral

mag

nit

ude

−1 0 c 1

Peak bin nmax Interpolated peak

Frequency (bins)

Figure 2: Parabolic fitting between three FFT bins.

system centered at nmax is defined. e(nmax−1) and e(nmax +1)represent the energy of the two adjacent bins. By taking theenergies in dB, we have

A1 = 20 log10

(e(nmaxz − 1

)),

A2 = 20 log10

(e(nmaxz

)),

A3 = 20 log10

(e(nmaxz + 1

)).

(10)

The optimal location is computed by fitting a genericparabola

y(f) = a

(f − c)2 + b, (11)

to the amplitude of the bin nmax and the amplitude of thetwo adjacent bins and taking its maximum. a, b, and c arevariables and f indicates frequency in Hz.

Figure 2 illustrates the parabolic interpolation [17, 18].The center point or vertex c gives the interpolated peaklocation (in bins). The parabola is evaluated at the three binsnearest to the center point c:

y(−1) = A1,

y(0) = A2,

y(1) = A3.

(12)

The three samples can be substituted in the parabola definedin (11). This yields the frequency difference in FFT bins:

c = 12

A1 − A3

A1 − 2A2 + A3∈[

−12

,12

]

, (13)

and the estimate of the peak location (in bins) is

n∗maxz = nmaxz + c. (14)

If the maximum bin within the channel is not the localmaximum, this can only occur near the boundary of thechannel, the spectral peak locator is placed at the boundaryof the channel.

2.2.3. Frequency-Weight-Map. The purpose of the fre-quency-weight-map is to translate the spectral peak intocochlear location. For each analysis band z two weightsare calculated wz1 and wz2 that will be applied to the twoelectrodes forming that analysis band. This can be achievedusing the cochlear frequency-position function [19]

f = A(10ax), (15)

f represents the frequency in Hz and x the position in (mm)along the cochlea. A and a were set to 350 Hz and 0.07,respectively, considering the known dimensions of the CIIand HiRes90k [20]. The locations associated to the electrodeswere calculated by substitution of its corresponding frequen-cies in the above equation. The location of each electrode isdenoted by xz (z = 1, . . . ,M).

The peak frequencies are also translated to positionsusing (15). The location corresponding to a peak frequencyin the analysis band z is denoted by xzp . To translate acochlear location to weights that will be applied to individualcurrents of each electrode, the peak location is substractedfrom the location of the first electrode xz in a pair (xz, xz+1).The weight applied to the second electrode xz+1 (higherfrequency) of the pair is calculated using the followingequation:

wz2 =xzp − xzdz

, (16)

and the weight applied to first electrode xz of the pair is

wz1 =xz+1 − xzp

dz, (17)

where dz is the distance in (mm) between the two electrodesforming an analysis band, that is,

dz = |xz+1 − xz|. (18)

2.2.4. Carrier Synthesis. The carrier synthesis attempts tocompensate for the low temporal resolution given by theFFT-based approach. The goal is to enhance temporal pitchperception by representing the temporal structure of thefrequency corresponding to the spectral peak in each analysisband. Note that the electrodes are stimulated with a currentdetermined by the HE at a constant rate determined by theCSR. The carrier synthesis modulates the Hilbert envelopeof each analysis band with a frequency coinciding with thefrequency of the spectral peak.

Furthermore, the modulation depth (relative amount ofoscillation from peak to valley) is reduced with increasingfrequency as shown in Figure 3.

The carrier synthesis defines the phase variable phh,z foreach analysis band z and frame h, where 0 ≤ phh,z ≤ CSR−1.During each frame h, phh,z is increased by the minimum ofthe estimated frequency fmaxz and CSR:

phh,z =(phh−1,z + min

(fmaxz , CSR

))mod (CSR), (19)

where fmaxz = n∗maxz (Fs/L), h indicates the actual frame, andmod indicates the modulo operator.


0

0.5

1

Mod

ula

tion

dept

hM

D(f)

0 FR/2 FR

Frequency ( f )

Figure 3: Modulation depth as a function of frequency. FR is aconstant of the algorithm equal to 2320 Hz which is the maximumchannel stimulation rate that can be delivered with the implantusing the current steering technique.

The parameter s is defined for each analysis band z asfollows:

sz =

⎧⎪⎨

⎪⎩

1, phh,z ≤ CSR2

,

0, otherwise.(20)

Then, the final carrier for each analysis band z is defined as

cz = 1− szMD(fmaxz

), (21)

where MD( fmaxz ) is the modulation depth function definedin Figure 3.

2.2.5. Mapping. The final step of the SpecRes strategy is toconvert the envelope, weight, and carrier into the currentmagnitude to apply to each electrode pair associated witheach analysis band. The mapping function is defined as inHiRes (1). For the two electrodes in the pair that comprisethe analysis band; the current delivered is given by

Iz = Yz(max(HEz))wz1cz, (22)

Iz+1 = Yz+1(max(HEz))wz2cz, (23)

where z = 1, . . . ,M − 1.In the above equation, Yz and Yz+1 are the mapping

functions for the two electrodes forming an analysis band,wz1 and wz2 are the weights, max(HEz) is the largestHilbert envelope value that was computed since the previousmapping operation for the analysis band z, and cz is thecarrier.

2.3. The Sinusoid Extraction Strategy (SineEx). The newsinusoid extraction (SineEx) strategy is based on the generalstructure of the SpecRes strategy but incorporates a robustmethod for estimating spectral components of audio signalswith high accuracy. A block diagram illustrating SineEx isshown in Figure 4.

The front-end, the filterbank, the envelope detector,and the mapping are identical to those used in SpecResstrategy. However, in contrast to the spectral-peak-pickingalgorithm performed by SpecRes, a frequency estimatorthat uses an iterative analysis/synthesis algorithm selects themost important spectral components in a given frame ofthe audio signal. The analysis/synthesis algorithm modelsthe frequency spectrum as a sum of sinusoids. Only theperceptually most important sinusoids are selected using apsychoacoustic masking model.

The analysis/synthesis loop first defines a source modelto represent the audio signal. The model’s parameters areadjusted to best match the audio signal. Because of the fewnumber of analysis bands in the Harmony system (N = 15),only a small number of parameters of the source modelcan be estimated. Therefore, the most complex task inSineEx is determining the few parameters that describe theinput signal. The selection of the most relevant componentsis controlled by a psychoacoustic masking model in theanalysis/synthesis loop. The model simulates the effect ofsimultaneous masking that occurs at the level of the basilarmembrane in normal hearing.

The model estimates which sinusoids are masked theleast to drive the stimulation to the electrodes. The ideabehind this model is to deliver only those signal componentsthat are most clearly perceived by normal-hearing listeners tothe cochlear implant. A psychoacoustic masking model usedto control the selection of sinusoids in an analysis/synthesisloop has been shown to provide improved sound quality withrespect to other methods in normal hearing [21].

For example, other applications of this technique, wherestimulation was restricted to the number of physical elec-trodes, demonstrated that the interaction between chan-nels could be reduced by selecting fewer electrodes forstimulation. Therefore, because current steering will allowstimulation of significantly more cochlear sites comparedto nonsimultaneous stimulation strategies, the maskingmodel may contribute even further to the reduction ofchannel interaction and therefore improve sound perception.In [22] a psychoacoustic masking model was also usedto select the perceptually most important componentsfor cochlear implants. One aspect assumed in [22] wasthat the negative effects of channel interaction on speechunderstanding could be reduced by selecting less bands forstimulation.

The parameters extracted for the source model are thenused by the frequency weight map and the carrier synthesisto code place pitch through current steering and to codetemporal pitch by modulating the Hilbert envelopes, justas in SpecRes. Note that a high-accuracy estimation offrequency components is required in order to take advantageof the potential frequency resolution that can be deliveredusing current steering.

For parametric representations of sound signals, as inSineEx, the definition of the source model, the methodused to select the model’s parameters, and the accuracy inthe extraction of these parameters play a very importantrole in increasing sound perception performance [21]. Thenext sections present the source model and the algorithm


Audioin

FrontendA/D

L-fastFourier

transform(FFT)

1

2

L/2

Analysis band 1

Envelopedetection

Envelopedetection

Envelopedetection

Frequencyweight map

Frequencyestimator

Carriersynthesis

Nonlinearmap

Nonlinearmap

Nonlinearmap

Ts

E1

E2

E2

E3

EM−1

EM

Analysis band 2

Analysis band M

...

......

......

...

...

Analysis/synthesis

Psychoacousticmasking model

X(n)

Figure 4: Block diagram illustrating SineEx.

used to estimate the model’s parameters based on ananalysis/synthesis procedure.

2.3.1. Source Model. Advanced models of the audio sourceare advantageous for modeling audio signals with the fewestnumber of parameters. To develop the SineEx strategy, thesource model had to be related to the current-steeringcapabilities of the implant. In SineEx, the source modeldecomposes the input signal into sinusoidal components.A source model based on sinusoids provides an accurateestimation of the spectral components that can be deliveredthrough current steering. Individual sinusoids are describedby their frequencies, amplitudes, and phases. The incomingsound x(l) is modeled as a summation of N sinusoids asfollows:

x(l) ≈ x(l) =N∑

i=1

ciej(2πmil/L+φi), (24)

where x(l) is the input signal, x(l) is the model of the signal,ci is the amplitude, mi is the frequency, and φi is the phase ofthe ith sinusoid.

2.3.2. Parameter Estimation for the Source Model. The param-eters of individual sinusoids are extracted iteratively in ananalysis/synthesis loop [23]. The algorithm uses a dictionaryof complex exponentials sm(l) = e j2πml/L(l−(L−1)/2)(l =1, . . . ,L) with P elements (m = 1, . . . ,P) [24] as source

model. The analysis/synthesis loop is started with thewindowed segment of the input signal x(l) as first residualr1(l):

r1(l) = x(l)w(l), l = 0, . . . ,L− 1, (25)

where x(l) is the input audio signal and w(l) is the sameblackman-hanning window as in SpecRes (3).

The window w(l) is also applied to the dictionaryelements:

gm(l) = w(l)sm(l) = w(l)e( j2πm/L)(l−(L−1)/2). (26)

It is assumed that gm(l) has unity norm, that is, ‖gm(l)‖ = 1for l = 0, . . . ,L− 1.

For the next stage, since x(l) and ri(l) are real values, thenext residual can be calculated as follows:

ri+1(l) = ri(l)− cigmi(l)− c∗i g∗mi(l). (27)

The estimation consists of determining the optimal elementgmi(l) and a corresponding weight ci that minimizes thenorm of the residual:

min‖ri+1(l)‖. (28)


For a given m the optimal real and imaginary component ofci (ci = ai + jbi) according to (28) can be found by setting thepartial derivatives of ‖ri+1(l)‖ with respect to ai and bi to 0:

Δri+1(l)Δai

= 0,

Δri+1(l)Δbi

= 0.

(29)

This leads the following equation system:

⎛

⎜⎜⎜⎝

∑

l

Re{gm(l)

}Re{gm(l)

} ∑

l

Re{gm(l)

}Im{gm(l)

}

∑

l

Re{gm(l)

}Im{gm(l)

} ∑

l

Im{gm(l)

}Im{gm(l)

}

⎞

⎟⎟⎟⎠

×⎛

⎝2a

−2b

⎞

⎠ =

⎛

⎜⎜⎜⎝

∑

l

Re{gm(l)

}ri(l)

∑

l

Re{gm(l)

}ri(l)

⎞

⎟⎟⎟⎠.

(30)

As the window used is symmetric w(l) = w(−l), Re{gm(l)},and Im{gm(l)} become orthogonal, that is, the scalar productbetween them is 0:

∑

l

Re{gm(l)

}Im{gm(l)

} = 0, ∀l, (31)

and the previous Equations can be simplified as follows:

a = 12

∑l Re{gm(l)

}ri(l)

∑l Re{gm(l)

}Re{gm(l)

} ,

b = −12

∑l Im

{gm(l)

}ri(l)

∑l Im

{gm(l)

}Im{gm(l)

} .

(32)

The element gmi of the dictionary selected for the ith iterationis obtained by minimizing ‖ri+1(l)‖. This is equivalent tomaximizing ci as can be observed in (27). Therefore, theelement selected gmi corresponds to the one having the largestscalar product with the signal ri(l) for l = 0, . . . ,L− 1.

Finally, the amplitude ci, frequency fmaxi , and phase φi forthe ith sinusoid are

ci =√

a2i + b2

i ,

fmaxi = nmaxi2πL

,

φi = arctan(biai

)

.

(33)

2.3.3. Analysis/Synthesis Loop Implementation. The analy-sis/synthesis algorithm can be efficiently implemented in thefrequency domain [25]. The frequency domain implementa-tion was used to incorporate the algorithm into the Harmonysystem. A block diagram illustrating the implementation ispresented in Figure 5.

The iterative procedure uses as input the FFT spectrumof an audio signal X(n). The magnitude spectrum |X(n)|

then is calculated. It is assumed that in the ith iterationi − 1 sinusoids already have been extracted and a signalSi−1(n) containing all sinusoids has been synthesized. Themagnitude spectrum |Si−1(n)| is calculated.

The synthesized spectrum is subtracted from the originalspectrum and then weighted by the magnitude maskingthreshold Iwi−1 (n) caused by the sinusoids already synthe-sized. The detection of the maximum ratio Enmax is calculatedas follows:

Enmaxi= max

(

0,|X(n)| − |Si−1(n)|

∣∣Iwi−1 (n)

∣∣

)

, n = 0, . . . ,L− 1,

nmaxi = arg max

(

0,|X(n)| − |Si−1(n)|

∣∣Iwi−1 (n)

∣∣

)

, n = 0, . . . ,L− 1,

(34)

where Iwi(n) is the psychoacoustic masking model at the ithiteration of the analysis/synthesis loop. The frequency nmaxiis used as a coarse frequency estimate of each sinusoid. Itsaccuracy corresponds to the FFT frequency resolution.

The spectral resolution of the frequency estimated isimproved using a high accuracy parameter estimation onthe neighboring frequencies of nmaxi . The high accuracy esti-mator implements (30) iteratively in the frequency domain.The algorithm takes first, the positive part of the spectrumX(n), that is, the analytical signal of x(l). As the algorithmis implemented in the frequency domain, the dictionaryelements gm(l) are transformed into the frequency domain.If G0(n) denotes the Fast Fourier Transform of g0(n) = w(l),the frequency domain representation of the other dictionaryelements can be derived by simple displacement of thefrequency axis Gm(n) = G0(n−m). For this reason, G0(n) isalso referred to as “prototype.” Note that as the window w(l)is known (3), the frequency resolution of the prototype canbe increased just by increasing the length of the FFT used totransform g0(n). Because most of the energy of the prototypeG0(l) concentrates in a small number of samples around thefrequency n = 0, a small section of the prototype is stored.By reducing the length of the prototype, the complexity ofthe algorithm drops significantly in comparison to the timedomain implementation presented in Section 2.3.2.

The solution to (30) is solved iteratively as follows.In the first iteration (r = 1), the prototype is centeredon the nmaxi,r = nmaxi coarse frequency. A displacementvariable δr is set to 1/2r, where r indicates the iterationindex. The correlation is calculated at nmaxi,r − δr , nmaxi,r , andnmaxi,r + δr . The position leading to maximum correlationat these three locations is denoted by nmaxi,r+1 . For the nextiteration (r + 1) the value δr+1 is halved (δr+1 = 1/2(r + 1))and the prototype is centered on nmaxi,r+1 . The correlationis calculated at nmaxi,r+1 − δr+1, nmaxi,r+1 , and nmaxi,r+1 + δr+1

and the maximum correlation is picked up. This procedureis repeated several times, and the final iteration gives theestimated frequency denoted by n∗maxi .

2.3.4. Psychoacoustic Masking Model. The analysis/synthesisloop of [25] is extended by a simple psychoacoustic modelfor the selection of the most relevant sinusoids. The model


X(n)

| · |

| · |

+−

+−

/ max(| · |) argmax nmaxi

fiFrequency, amplitude,and phase estimation

fi, ci, φi

Synthesis

Psychoacousticmasking model

∑

Mi−1(n)

Si−1(n)

|Si−1(n)||X(n)|

Figure 5: Frequency domain implementation of the analysis/synthesis loop including a psychoacoustic masking model for extraction andparameter estimation of individual sinusoids.

is a simplified implementation of the masking model usedin [22]. The effect of masking is modeled using a spreadingmasking function L(z). This function has been modeledusing a triangular shape with left slope sl, right slope sr , andpeak offset av as follows:

Li(z) =⎧⎨

⎩

HEdBi − av − sl · (zi − z), z < zi,

HEdBi − av − sr · (z − zi), z ≥ zi.(35)

The amplitude of the spreading function is derived fromthe Hilbert Envelope in decibels HEdBi = 20 log10(HE(z))associated to the analysis band containing the sinusoidextracted at the iteration i of the analysis/synthesis loop. Thesound intensity Ii(z) is calculated as

Ii(z) = 10Li(z)/20, z = 1, . . . ,M. (36)

The superposition of thresholds is simplified as a linearaddition of thresholds (37) in order to reduce the numberof calculations

ITi(z) =i∑

k=0

Ik(z), z = 1, . . . ,M. (37)

The spreading function has been defined in the nonlinearfrequency domain, that is, in the analysis band domain z. Asthe sinusoids are extracted in the uniformly spaced frequencydomain of the L-FFT, the masking threshold must beunwarped from the analysis band domain into the uniformlyspaced frequency domain. The unwarping is accomplishedby linearly interpolating the spreading function withoutconsidering that the two scales have different energy densitiesas follows:

Iwi(n) = ITi(z − 1) + (n− ncenter(z − 1))

× ITi(z)− ITi(z − 1)ncenter(z)− ncenter(z − 1)

,

z = 1, . . . ,M, i = 1, . . . ,N ,

(38)

where M denotes the number of analysis bands, N givesthe number of sinusoids selected, and ncenter(z) is the centerfrequency for the analysis band z in bins (see Table 1):

ncenter(z) = nstartz+1 − nstartz

2. (39)

In normal hearing, simultaneous masking occurs at the levelof the basilar membrane. The parameters that define thespread of masking can be estimated empirically with normalhearing listeners. Simultaneous masking effects can be usedin cochlear implant processing to reduce the amount ofdata that is sent through the electrode nerve interface [22].However, because simultaneous masking data is not readilyavailable from cochlear implant users, the data from normalhearing listeners were incorporated into SineEx. The choiceof the parameters that define the spread of masking requiremore investigation, and probably should be adapted in thefuture based upon the electrical spread of masking for eachindividual.

The parameters that define the spreading function wereconfigured to match the masking effect produced by tonalcomponents [26, 27] in normal hearing listeners, since themaskers are the sinusoids extracted by the analysis/synthesisloop. The left slope was set to sl = 40 dB/band, theright slope to sr = 30 dB/band, and the attenuation toav = 15 dB.

SineEx is an N-of-M strategy because only those bandscontaining a sinusoid are selected for stimulation. Theanalysis/synthesis loop chooses N sinusoids iteratively inorder of their “significance.” The number of virtual channelsactivated in a stimulation cycle is controlled by increasing ordecreasing the number of extracted sinusoids N . It shouldbe noted that the sinusoids are extracted over the entirespectrum and are not restricted to each analysis band as inSpecRes. Therefore, in some cases, more than one sinusoidmay be assigned to the same analysis band and electrodepair. In those situations, only the most significant sinusoidis selected for stimulation because only one virtual channelcan be created in each analysis band during one stimulationcycle.

2.4. Objective Analysis: HiRes, SpecRes, and SineEx. Objectiveexperiments have been performed to test the three strategies:HiRes, SpecRes, and SineEx. The strategies have been eval-uated analyzing the stimulation patterns produced by eachstrategy for synthetic and natural signals. The stimulationpatterns represent the current level applied to each locationlexc along the electrode array in each time interval or frameh. The total number of locations Lsect is set to 16000 in


this analysis. The number of locations associated with eachelectrode nloc is

nloc = Lsect

M, (40)

M indicates the number of electrodes. The location ofeach electrode is lelz = (z − 1)nloc, z = 1, . . . ,M. Thestimulation pattern is obtained as follows. First the totalcurrent produced by two electrodes at the frame h iscalculated

YTz (h) = Yz(h) + Yz+1(h), z = 1, . . . ,M − 1, (41)

where Yz(h) and Yz+1(h) denote the current applied to thefirst and second electrode pairs forming an analysis channel(22). Then, the location of excitation is obtained as follows:

lexc = lelzYz(h)YTz (h)

+ lelz+1

Yz+1(h)YTz (h)

, (42)

where lelz and lelz+1 denote the location of the first and thesecond electrode in a pair forming an analysis channel. Notethat for sequential nonsimultaneous stimulation strategies

Yz+1(h) is set to 0 and therefore, the location of excitation lexc

coincides with the location of the electrode lelz . For sequential

stimulation strategies z = 1, . . . ,M. Finally, lexc is rounded to

the first integer, that is, lexc = [lexc] and the excitation patternSexc at frame h and location lexc is expressed as

Sexc(lexc,h) = YTz (h). (43)

The first signal used to analyze the strategies was a sweeptone of constant amplitude and varying frequency from300 Hz to 8700 Hz during 1 second. The spectrogram ofthis signal is shown in Figure 6(a). The sweep tone hasbeen processed with HiRes, SpecRes, and SineEx and thestimulation patterns produced by each strategy are presentedin Figures 6(b), 6(c), and 6(d), respectively.

In HiRes, the location of excitation always coincides withthe position of the electrodes. However, in SpecRes andSineEx, the location of excitation can be steered between twoelectrodes using simultaneous stimulation.

Moreover, it should be remarked that the frequencyestimation performed by SineEx is more distinct than withSpecRes. It can be observed from Figure 6(d) that duringthe whole signal almost only two neighboring electrodes (1virtual channel) are being selected for stimulation. This factcauses that only one virtual channel is used to representthe unique frequency presented at the input. In the caseof SpecRes (Figure 6(c)), it is shown that more than onevirtual channel is generated to represent a unique sinusoidin the input signal. This is caused by the simple modelingapproach performed by SpecRes to represent sinusoids. Thisfact should cause smearing in pitch perception becausedifferent virtual channels are combined to represent a uniquefrequency. White Gaussian noise was added to the samesweep signal with at total SNR of 10 dB. The stimulationpatterns obtained in noise are presented in Figures 7(b),7(c), and 7(d). Figure 7(b) shows the stimulation pattern

generated by HiRes for the noisy sweep tone. It can beobserved that HiRes mixes both, the noise and the sweeptone, in terms of place of excitation, as the location ofexcitation coincides with the electrodes. This fact shouldcause difficulties to separate the tone from the noise.Figures 7(c) and 7(d) present the stimulation patterns whenprocessing the noisy sweep tone with SpecRes and SineEx,respectively. It can be observed that when noise is added,SpecRes stimulates more times the electrodes than SineEx.As white Gaussian noise is added, frequency componentsare distributed along the whole frequency domain. SpecResselects peaks of the spectrum without performing any modelassumption of the input signal, therefore noise componentsare treated as if they were pure tone components. This factshould lead to the perception of tonal signal when in realitythe signal is noisy. SineEx, however, is able to estimate andtrack the frequency of the sweep tone as it matches thesinusoidal model. In contrast, the added white Gaussiannoise does not match the sinusoidal model and those parts ofthe spectrum containing noise components are not selectedfor stimulation. On the one hand, this test presents thepotential robustness of SineEx in noise situations to representtonal or sine-like components. On the other hand, theexperiment shows the limitations of SineEx to model noisy-like signals like some consonants.

A natural speech signal consisting of a speech token,where “asa” is uttered by a male voice, has also beenprocessed with HiRes, SineEx, and SpecRes. Figures 8(b),8(c), and 8(d) present the stimulation patterns obtained foreach strategy.

In HiRes, the location of excitation coincides with theposition of the electrodes. This fact causes a limitationto code accurately formant frequencies because the spec-tral resolution with HiRes is limited by the number ofimplanted electrodes. It is known that formants play akey role in speech recognition. The poor representationof formants with HiRes can be observed comparing thestimulation pattern generated by HiRes (Figure 8(b)) andthe spectrogram presented in Figure 8(a). Using SpecRes,the formants can be represented with improved spectralresolution compared to HiRes as the location of excitationcan be varied between two electrodes (Figure 8(c)). However,the lower accuracy of the method used by SpecRes toextract the most meaningful frequencies, based on a peakdetector, makes the formants less distinguishable than withSineEx (Figure 8(d)). SpecRes selects frequency componentswithout making a model assumption of the incoming sound;therefore noise and frequency components are mixed causingpossible confusions between them. In SineEx, both “a”vowels can be properly represented as a sum of sinusoids.However, the consonant “s” which is a noise-like componentis not properly represented using just a sinusoidal model.

SineEx and SpecRes combine the current steering tech-nique with a method to improve temporal coding, by addingthe temporal structure of the frequency extracted in eachanalysis band. This temporal enhancement was incorporatedto SineEx and SpecRes in order to compensate for the lowertemporal resolution of the 256-FFT used by these strategiesin comparison to the IIR filterbank used by Hires. For this


1000

2000

3000

4000

5000

6000

7000

8000

Freq

uen

cy(H

z)

0 0.2 0.4 0.6 0.8

Time (s)

−60

−40

−20

0

20

(dB

)

Spectrogram

(a)

123456789

10111213141516

Ele

ctro

de

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Time (s)

0

50

100

150

CL

HiRes

(b)

123456789

10111213141516

Ele

ctro

de

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Time (s)

0

50

100

150

CL

SpecRes

(c)

123456789

10111213141516

Ele

ctro

de

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Time (s)

0

50

100

150

CL

SineEx

(d)

Figure 6: Stimulation patterns obtained with (b) HiRes, (c) SpecRes, and (d) SineEx in quiet when the input signal is a sweep tone of 1millisecond of constant amplitude (70 dB) and frequency varying from 300 Hz until 8700 kHz shown in (a). The horizontal axis representstime in seconds, and the vertical axis represents the electrode location. The level applied in current level (CL) is coded with the colors givenin the color bars. The location of excitation is obtained as presented in Section 2.4.

reason, we assume that a hypothetical improvement of pitchperception provided by SineEx or SpecRes might be causedby the current steering technique rather than by the temporalenhancement technique.

With one final comment from the objective analysis,as SineEx generally selects less frequencies than SpecRes,this strategy has the potential to reduce interaction betweenchannels and significantly reduce power consumption incomparison to SpecRes. This feature can be confirmed by anexperiment that involves counting the number of channelsbeing stimulated by HiRes, SpecRes, and SineEx during thepresentation of 50 sentences from a standardized sentencetest [28]. The CSR was set to 2320 stimulations/second for allthree strategies. Table 2 presents the total number of channelsstimulated by each strategy.

As it can be observed from Table 2, the number ofstimulations by SpecRes doubles the number of stimulationsperformed by HiRes. However, as SpecRes divides the current

Table 2: Number of stimulations for 50 sentences of the HSMsentence test [28] with HiRes, SpecRes, and SineEx.

HiRes SpecREs SineEx

Number ofstimulations

464,895 1,087,790 536,878

between two electrodes, both strategies would lead to asimilar power consumption. In SineEx however, less channelsare stimulated and this could lead to an improvement inpower consumption.

3. Study Design

HiRes, SpecRes, and SineEx were incorporated into theresearch platform Speech Processor Application Framework(SPAF) designed by Advanced Bionics. Using this Platform,


1000

2000

3000

4000

5000

6000

7000

8000

Freq

uen

cy(H

z)

0 0.2 0.4 0.6 0.8

Time (s)

−50

−40

−30

−20

−10

0

10

20

(dB

)

Spectrogram

(a)

123456789

10111213141516

Ele

ctro

de

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Time (s)

0

50

100

150

CL

HiRes

(b)

123456789

10111213141516

Ele

ctro

de

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Time (s)

0

50

100

150

CL

SpecRes

(c)

123456789

10111213141516

Ele

ctro

de

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Time (s)

0

50

100

150

CL

SineEx

(d)

Figure 7: Stimulation patterns obtained with (b) HiRes, (c) SpecRes, and (d) SineEx in noise (SNR = 10 dB) when the input signal is a sweeptone of 1 millisecond of constant amplitude (70 dB) and frequency varying from 300 Hz until 8700 kHz added with white Gaussian noise(SNR = 10 dB) shown in (a). The horizontal axis represents time in seconds, and the vertical axis represents the electrode location. The levelapplied in current level (CL) is coded by the colors given in the color bars. The location of excitation is obtained as presented in Section 2.4.

a chronic trial was conducted at the hearing center of theMedical University of Hannover with 9 Harmony implantusers. The SPAF and the three strategies were implementedin the Advanced Bionics bodyworn Platinum series processor(PSP). The aim of the study was to further investigatethe benefits of virtual channels or current steering after afamiliarization period. Subjects were tested with all threestrategies (HiRes, SpecRes, and SineEx). The study wasdivided into two symmetrical phases. In the first phase, eachstrategy was given to each study participant during fourweeks and then evaluated. The order in which the strategieswere given to each patient was randomized. In the secondstage of the study, the strategies were given in reverse orderwith respect to the first phase. Again after 4 weeks eachstrategy was evaluated. Therefore, the total length of thestudy for each subject was 24 weeks. The study participantswere selected because of their good hearing abilities in quietand noisy environments and for their motivation to listen to

music with their own clinical program. The participants werenot informed about the strategy they were using.

Frequency Discrimination. The aim of this task was todetermine if current steering strategies could deliver betterpitch perception than classical sequential stimulation strate-gies. Frequency discrimination was evaluated with a threealternative-forced-choice task (3AFC) using an adaptivemethod test [29]. Audio signals were delivered to thecochlear implant recipient via the direct audio input of thePSP. Stimuli were generated and controlled by the Psycho-Acoustic Test Suite (PACTS) software developed by AdvancedBionics. The stimuli consisted of 500 milliseconds puretones sampled at 17.4 kHz and ramped on and off over 10milliseconds with a raised cosine. The reference frequencieswere 1280 Hz and 2904 Hz. Each subject was presented withthree stimuli in each trial. Two stimuli consisted of a tone


10002000300040005000600070008000

Freq

uen

cy(H

z)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Time (s)

−60

−40

−20

0

20

(dB

)

Spectrogram

0 0.2 0.4 0.6 0.8 1−1

0

1A

mpl

itu

de

(a)

123456789

10111213141516

Ele

ctro

de

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Time (s)

0

50

100

150

CL

HiRes

(b)

123456789

10111213141516

Ele

ctro

de

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Time (s)

0

50

100

150

CL

SpecRes

(c)

123456789

10111213141516

Ele

ctro

de

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Time (s)

0

50

100

150

CL

SineEx

(d)

Figure 8: (a) Speech token “asa” uttered by a male voice and its spectrogram. (b) Stimulation pattern obtained with HiRes. (c) Stimulationpattern obtained with SpecRes. (d) Stimulation pattern obtained with SineEx. The horizontal axis represents time in seconds, and the verticalaxis represents the electrode location. The level applied in current level (CL) is coded by the colors given in the color bars. The location ofexcitation is obtained by linearly interpolating the electrical amplitude applied to the pairs of simultaneous stimulated electrodes.

burst at the reference frequency. This frequency was fixedduring the whole run. The third stimulus consisted of a toneburst at two times the reference frequency (probe frequency).The presentation order of the stimulus was randomized inthe three intervals. The subject was asked to identify theinterval containing the stimulus that was higher in pitch.After two consecutive correct answers, the frequency of theprobe stimulus was decreased by a factor of 21/12. After eachincorrect answer, the frequency of the probe stimulus wasincreased by two times this factor, leading to an asymptoticaverage of 71% correct responses [29]. The procedure wascontinued until 8 reversals were obtained and the mean ofthe probe frequency of the last four reversals was taken asthe result for that particular run. This result is termed thefrequency difference limen (FDL). Intensity was roved byrandomly varying the electrical output gain from 85% to110% of the dynamic range, to minimize loudness cues. Theexperiment was performed twice for each subject and themean value of both runs was calculated.

Speech Recognition Tests. Speech recognition was evaluatedusing the HSM sentence test [28]. The HSM test wasadministered in quiet, in noise, and with background speechinterference (competing talker).

The aim of the speech-in-noise condition was to eval-uate if current-steering strategies could improve speechintelligibility in noisy situations. For the noise condition,telephone noise was added to the HSM test according tothe Committee Communication International Telephoneand Telegram recommendation 227 [30]. The signal-to-noise-ratio was 10 dB. The aim of the speech-in-competingspeech condition was to evaluate if current steering strategiescould provide better speech intelligibility in the presence ofmultiple talkers. For the evaluation of speech recognitionwith background speech interference, a second German voicewas added to the HSM sentence test. This was accomplishedby mixing the HSM test with the Oldenburger sentence test(OLSA) [31]. Every word of the HSM sentence test wasoverlapped in time by at least one word of the OLSA test. The


signal-interference ratio was 5 dB. The patients were askedto repeat only those sentences corresponding to the HSMsentence test and the number of correct words was counted.

For each condition (quiet, noise, and competing talker) 2lists of 20 sentences were presented in each stage of the study.The subjects had to repeat each sentence, and results werebased on the number of correct words repeated correctly. Alltests were conducted by connecting a CD player directly tothe audio input of the speech processor.

Music and Speech Subjective Appreciation Tests. Subjectivesound perception with each strategy was evaluated usingquestionnaires that assessed the overall benefits of theimplant in daily life. The questionnaires supplemented thedata available from conventional tests of speech perception[32].

The questionnaires asked subjects to rate music andspeech quality [33]. At each stage of the study, the question-naire was completed by the patient such that for each strategythe same questionnaire was filled out two times.

The music questionnaire asked subjects to rate thepleasantness, distinctness, naturalness, and overall percep-tion of music on a scale from 0 (extremely unpleasant,extremely indistinct, extremely unnatural, extremely bad) to10 (extremely pleasant, extremely distinct, extremely natural,extremely good).

For the speech questionnaire asked subjects to ratedifferent characteristics of speech on a 10-point scale. Char-acteristics included background interference; naturalness offemale voices, male voices, own voice; clarity of speech;pleasantness of speech; overall quality of speech. Listenerswere provided with definitions of each dimension for eachscale.

3.1. Subjects. All subjects had clinical experience with theHiRes strategy and were users of the Harmony (HiRes 90kor CII) implant. This strategy can only be configured inmonopolar stimulation mode. Demographic information forall test subjects is presented in Table 3. P7 completed onlythe first phase of the study protocol. For all strategies, thestimulation rate was derived from the HiRes clinical programand was kept constant throughout the conditions. Thresholdand most comfortable levels also were kept constant. Onlyglobal modifications of the these levels were allowed toaccommodate loudness requests of the subjects, meaningthat the THL or MCL levels were changed by the sameamount for all the electrodes.

4. Results

All subjects reported that speech experienced using SpecResand SineEx was understandable immediately. However, thesound perceived with the new strategies was significantlydifferent from HiRes for some users. For example, SineEx wasimmediately described as brighter than the other strategies.Despite the sound quality differences, all subjects werewilling to take part in the chronic phase of the studyeven though they were not allowed to change the strategy

during the study period. All subjects immediately reportedthat speech experienced using SpecRes and SineEx wasunderstandable. For some users, the sound perceived withthe new strategies was, however, significantly different fromHiRes. For example, the sound with SineEx was immediatelydescribed as brighter than with any other strategy.

Frequency Discrimination Results. Frequency discriminationresults are presented for the two reference frequencies(1280 Hz and 2904 Hz) in Figures 9(a) and 9(b). In meanvalue, all current steering strategies obtained an improve-ment in frequency discrimination with respect to HiRes.SpecRes produced slightly better frequency discriminationthan SineEx for the 1280 Hz reference frequency.

Results were subject to the paired t-test significance test.No significant difference was found between HiRes, SpecRes,and SineEx for the two reference frequencies due to the largeinter- and intrasubject variability. Particularly, the resultswere dominated by the large variability observed in P2 forboth reference frequencies.

4.1. Speech Intelligibility Results

Speech Intelligibility in Quiet. Figure 10(a) presents the aver-aged scores for each subject for the HSM sentence test inquiet. All subjects, except P2, scored 90% or higher for allthree strategies, thus demonstrating that they were goodperformers with all strategies.

Speech Intelligibility in Noise. Figure 10(b) presents the aver-aged results for each subject for the HSM sentence test innoise (SNR = 10 dB). The mean results show that, in general,SpecRes produced the highest scores. Three patients out of9 (P2, P7, and P8) attained better speech recognition scoreswith HiRes than with SpecRes.

Speech Intelligibility with Competing Talker. The mean scoresfor each patient using the HSM sentence test with competingtalker are presented in Figure 10(c). SpecRes produced thehighest word recognition scores in this condition. P7 was notable to understand speech in this condition with any strategy.

Although SineEx produced the best frequency discrim-ination scores, this strategy was not able to improve wordrecognition with competing talker in most of the patients.Only patient 4 obtained better speech understanding withSineEx.

All the results were subject to the paired samples t-test. No significant difference were found between HiRes,SpecRes, and SineEx.

Subjective Music Appreciation Questionnaire. Figure 11presents the results for the items clarity, naturalness,pleasantness, and overall music perception. No significantdifferences were found between the three strategies (pairedt-tests) and the overall perception of music was ratedsimilarly for all three strategies. However, music was ratedas more clear, natural and pleasant with SineEx compared toHiRes and SpecRes.


Table 3: Subject demographics for the current steering study.

Patient id AgeDuration ofdeafness inyears

Cause ofdeafness

Implantexperience inyears

Electrodetype

Usualstrategy

P1 54 0Suddenhearing loss

3 HiRes90kHiRes1080 pps

P2 70 0 Unknown 8 HiRes90kHiRes1080 pps

P3 51 4After heardoperation

6 Clarion CIIHiRes900 pps

P4 26 1 Unknown 7 Clarion CIIHiRes500 pps

P5 43 0Suddenhearing loss

7 Clarion CIIHiRes900 pps

P6 62 0 Genetic 4 Clarion CIIHiRes900 pps

P7 53 0 Unknown 9 HiRes90kHiRes1200 pps

P8 49 16 Unknown 5 Clarion CIIHiRes900 pps

P9 60 0 Ototoxika 7 Clarion CIIHiRes1200 pps

P1 P2 P3 P4 P5 P6 P7 P8 P9 Mean

HiRes 5.8

6.1 7.524.4 12.9

12.7

7.320.6 2.6 18.3 5.8 2.9 27.8 17.1

4.8 21.8 2.7 5.3 3.8 3.4 20.7 3.8 8.8

12

5 4.5 19.5 8.4 4.8 10.3

SpecRes

SineEx

03.9

7.8115.62

19.53

23.43

27.34

27.34

31.25

35.16

FDL

(%)

1280 Hz

Bet

ter

(a)


HiRes

16.7

90.7 3.6 28.7 3.28 27.3 6.5 6.3 20.510.4

6.7 60.5 1.8 21 17 2.3 1.5

6.2 80 0.9 21.2 8.2 3.4 10.3

40.1 16.2

18.7 1.6

18.5SpecRes

SineEx

017.2

34.451.6

68.9

86103.3

120.5

137.7

154.9

FDL

(%)

2904 Hz

Bet

ter

(b)

Figure 9: Frequency discrimination limen by subject (average and standard deviation) for the frequency discrimination test at differentreference frequencies (a) 1280 Hz and (b) 2904 Hz with HiRes, SpecRes, and SineEx. The frequency discrimination limen is presented aspercentage of the reference frequency.

Subjective Speech Appreciation Questionnaire. Figure 12shows the results for the items speech quality with back-ground interference, natural female voice, natural male voice,own voice, clarity, pleasantness, and overall speech quality.

There were no significant differences in the ratings ofthese speech characteristics. However, SpecRes producedbetter scores in perceived pleasantness, voice in back-ground interference, and overall quality than HiRes andSineEx.

On the other hand, the naturalness of their own voiceand male voice was rated higher with HiRes. Many patientsreported that low frequency sounds were better perceivedwith HiRes, while the sound with the current steeringstrategies (especially with SineEx) was described as brighter.It is likely that the perception of female voices was rated

higher with SineEx, and male voices were rated higher withHiRes.

SineEx was rated higher for the perception of voice inbackground interference. These scores do not correlate withthe speech intelligibility scores obtained by SineEx in thecompeting talker condition, which were lower than withSpecRes and equal to HiRes.

Clarity of voice was rated highly for all strategies.However SpecRes was rated slightly higher SineEx and HiRes.

5. Discussion

This study designed and evaluated new sound processingstrategies that use current steering to improve the spectraland temporal information delivered to an Advanced Bionics



SineEx

100

10010099100

100100 100 100100

100100

9496

98

979298

9999

7377 92

95929299

999999

SpecRes

HiRes

0

20

40

60

80

100

Wor

dsco

rrec

t(%

)HSM (quiet)

(a)


491

16

SpecResHiRes

5244

777966SineEx

33 48

23319

424244

282535

597662

526150

246229

8592

0102030405060708090

100

Wor

dsco

rrec

t(%

)

HSM (SNR = 10 dB)

(b)

P1 P2 P3 P4 P5 P6 P7 P8 P9 MeanHiRes

SpecRes323726

321

637163

153339

544434

526552

000

121611

71410

263126SineEx

0

10

20

30

40

50

60

70

80

Wor

dsco

rrec

t(%

)

HSM (CT = 5 dB)

(c)

Figure 10: Percent of correct word by subject (average and standard deviation) for the HSM sentence test in (a) quiet, (b) with 10 dB SNR,and (c) 5 dB competing talker.

Harmony cochlear implant. Current steering stimulates pairsof electrodes simultaneously so that virtual channels arecreated intermediate to the physical electrodes. In SpecResan FFT is used together with a spectral peak locator toextract the most dominant frequencies. In SineEx, the audiosignal is modeled with sinusoids and the frequencies of thosesinusoids are delivered to the implant using current steeringand a psychoacoustic masking model.

SpecRes and SineEx are the first signal processingstrategies implemented in a commercial device using thecurrent steering technique. These strategies were evaluatedin a chronic study comparing them to the standard HiResstrategy that does not use current steering. All patientswere able to use the new strategies during a long-term testprotocol. In the chronic study, frequency discrimination,speech intelligibility, and subjective ratings of music andspeech were evaluated. Overall, the sound performanceachieved with the three strategies evaluated in this study wassimilar and the results exhibited large inter- and intrasubjectvariability.

For frequency discrimination, there were nonsignificantimprovements for the SineEx strategy over the HiRes strategyfrom 12% to 10.3% and from 20.5% to 16.7%, when thereference frequencies were 1280 Hz and 2904 Hz, respec-tively. These improvements are assumed to be the result

of using the current steering technique. A 1.8 percentagepoints improvement in frequency discrimination for SineExcompared to the SpecRes strategy was observed only forthe 2904 Hz reference frequency. That improvement can beattributable to the robust method for modeling the audiosignal with sinusoids and the use of a perceptual model toselect the sinusoidal components.

There are several reasons why we could not demonstratea significant improvement in frequency difference limenfor current steering strategies with respect to sequentialstimulation strategies. First, pure tones were used to estimatethe FDL. Pure tones when presented to a cochlear implantmay activate several bands (specially for HiRes and SpecRes).Therefore, the use of pure tones cannot provide an accurateassessment of specific locations along the electrode arrayas shown when stimulating the electrode array with just apair of simultaneous stimulated pulses. Second, a large intra-and intersubject variability was observed. For example, studyparticipant P2, who obtained worse speech intelligibilityin quiet than the rest of patients, showed also very poorresults in FDL with large variability. Third, it has to beremarked that all study participants used HiRes in dailylife and therefore had more experience when hearing tosequential stimulation strategies than to current steeringstrategies.


0

1

2

3

4

5

6

7

8

9R

atin

gClarity


(a)

0

1

2

3

4

5

6

7

8

9

Rat

ing

Natural


(b)

0

1

2

3

4

5

6

7

8

9

Rat

ing

Overall


HiResSpecResSineEx

(c)

0

1

2

3

4

5

6

7

8

9

Rat

ing

Pleasantness


HiResSpecResSineEx

(d)

Figure 11: Results of the music quality questionnaire using HiRes, SpecRes, and SineEx. Different items were rated on a 10-point scale.The cochlear implant subjects were asked to rate different aspects of music perception. The items are (a) clarity of music, (b) natural musicperception, (c) pleasantness hearing music, and (d) overall music perception.

The frequency difference limen results obtained in thisstudy are in the same range as those observed in the literature[34] using a similar experiment. In our study we observeda trend towards mean sound frequency difference limensincreasing with increasing frequency, this trend has also beenreported in the literature for cochlear implants [34] as well asfor normal hearing listeners [35].

Speech intelligibility was evaluated using the HSMsentence test. The results for HSM in noise showed anonsignificant improvement of speech recognition for theSpecRes strategy against the HiRes and SineEx of 4%and 8%, respectively. For the HSM with competing talker(speech background interference) the SpecRes achieved anonsignificant improvement of 5% with respect to both theHiRes and SineEx.

In SineEx, appropriate source modeling and the selectionof the components that describe this model are obviously ofgreat importance to sound and music quality. In addition,SineEx stimulates, on average, half as many electrodes asSpecRes. Thus SineEx has the potential to reduce powerconsumption in actual devices because fewer electrodes arestimulated per frame. Because a direct relationship existsbetween the number of channels selected and the amount ofchannel interaction, stimulating fewer channels can lead to areduction of interaction between channels.

Current steering strategies stimulate pairs of electrodessimultaneously. Therefore, the interaction between channelsand the spread of the electrical fields produced in thecochlea increase with respect to nonsimultaneous stimu-lation strategies. In order to limit channel interaction in


0

1

2

3

4

5

6

7R

atin

g

Speech in background interference


(a)

0123456789

Rat

ing

Natural female


(b)

0123456789

Rat

ing

Natural male


(c)

0123456789

Rat

ing

Natural own


(d)

0123456789

Rat

ing

Clarity


(e)

0123456789

Rat

ing

Pleasantness


(f)

0123456789

Rat

ing

Overall


HiResSpecResSineEx

(g)

Figure 12: Results of the speech quality questionnaire using HiRes, SpecRes, and SineEx. Different items were rated on a 10-point scale. Thecochlear implant subjects were asked to rate features of speech in different situations. The items are (a) speech in background interference,(b) natural female, (c) natural male, (d) natural own, (e) clarity, (f) pleasantness, and (g) overall speech quality.

SineEx, the frequency selection, which is related to thechannels to be stimulated, also incorporates a simple versionof a perceptual model based on masking effects.

In the version of SineEx evaluated in this study, soundperception is similar to the sound perceived with HiResand SpecRes. However, this version of SineEx modelsthe audio signal only with sinusoids, without consideringnoise components. Improvements to SineEx should includeincorporating source models that process noise componentsas well as the sinusoidal components of the audio input signalbecause noise components comprise important cues forspeech perception as well as for auditory scene analysis. Animportant aspect in designing new signal processing strate-gies for cochlear implants is the complexity of the algorithms.SineEx was implemented in the real-time Advanced Bionicsbodyworn Platinum series processor, which uses a low-power

DSP. Therefore, implementation on a commercial behind-the-ear Harmony processor should not be a major problem.The Harmony processor can save up to three strategiesat the same time. Given the large intersubject variabilityin speech and music perception with HiRes, SpecRes, andSineEx, the Harmony allows all three strategies to be placedon the same processor. This flexibility gives the user theopportunity to select between strategies depending upon thesituation.

6. Conclusions

The SineEx strategy models the input audio signal withsinusoids and uses a psychophysical masking model andcurrent steering to increase the spectral resolution in acochlear implant while at the same time reducing channel


interaction. The SpecRes strategy uses an FFT and a spectralpeak locator to extract the most dominant frequencies.SpecRes also uses current steering to improve the place-frequency accuracy of stimulation. Results show large vari-ability among cochlear implant users in pitch perception,speech intelligibility, and music perception when compar-ing these two current steering strategies against HiRes, asequential stimulation strategy. All patients tested until nowpeformed equally well using the current steering technique aswith conventional strategies. New modifications of the signalprocessing algorithms together with further inverstigationon simultaneous stimulation of the electrodes hold greatpromise for improving the hearing capabilities of cochlearimplant users.

References

[1] B. S. Wilson, C. C. Finley, D. T. Lawson, et al., “Better speechrecognition with cochlear implants,” Nature, vol. 352, no.6332, pp. 236–238, 1991.

[2] J. Laneau, When the deaf listen to music Pitch perception withcochlear implants, Ph.D. dissertation, Katholieke UniversiteitLeuven, Faculteit Toegepaste Wetenschappen, Leuven, Bel-gium, 2005.

[3] B. S. Wilson, R. Schatzer, E. A. Lopez-Poveda, X. Sun, D. T.Lawson, and R. D. Wolford, “Two new directions in speechprocessor design for cochlear implants,” Ear and Hearing, vol.26, no. 4, pp. 73s–81s, 2005.

[4] L. Geurts and J. Wouters, “Coding of the fundamentalfrequency in continuous interleaved sampling processors forcochlear implants,” Journal of the Acoustical Society of America,vol. 109, no. 2, pp. 713–726, 2001.

[5] Q.-J. Fu, F.-G. Zeng, R. V. Shannon, and S. D. Soli, “Impor-tance of tonal envelope cues in chinese speech recognition,”Journal of the Acoustical Society of America, vol. 104, no. 1, pp.505–510, 1998.

[6] K. Hopkins and B. C. J. Moore, “The contribution of temporalfine structure to the intelligibility of speech in steady andmodulated noise,” Journal of Acoustical Society of America, vol.125, no. 1, pp. 442–446, 2009.

[7] Boston Scientific, “HiRes with fidelity 120 sound processing,”A Report from Advanced Bionics, The Audtitory Business ofBoston Scientific, 2006.

[8] Z. M. Smith, B. Deleguette, and A. J. Oxenham, “Chimericsounds reveal dichotomies in auditory perception,” Nature,vol. 416, pp. 87–90, 2002.

[9] A. Buchner, C. Frohne-Buechner, L. Gaertner, A. Lesinski-Schiedat, R.-D. Battmer, and T. Lenarz, “Evaluation ofadvanced bionics high resolution mode,” International Journalof Audiology, vol. 45, no. 7, pp. 407–416, 2006.

[10] G. S. Donaldson and H. A. Kreft, “Place-pitch discriminationof single- versus dualelectrode stimuli by cochlear implantusers (l)(a),” Journal of the Acoustical Society of America, vol.118, no. 2, p. 623, 2005.

[11] B. S. Wilson, R. Schatzer, and E. A. Lopez-Poveda, “Possibili-ties for a closer mimicking of normal auditory functions withcochlear implants,” in Cochlear Implants, pp. 48–56, ThiemeMedical, New York, NY, USA, 2nd edition, 2006.

[12] D. B. Koch, M. Downing, M. J. Osberger, and L. Litvak, “Usingcurrent steering to increase spectral resolution in CII andHiRes 90K users,” Ear and Hearing, vol. 28, no. 2, supplement,pp. 38S–41S, 2007.

[13] B. Townshend, N. Cotter, D. van Compernolle, and R.L. White, “Pitch perception by cochlear implant subjects,”Journal of the Acoustical Society of America, vol. 82, no. 1, pp.106–115, 1987.

[14] H. J. McDermott and C. M. McKay, “Pitch Ranking withnon-simultaneous dual-electrode electrical stimulation of thecochlea,” Journal of Acoustical Society of America, vol. 96, pp.155–162, 1994.

[15] J. H. M. Frijns, R. K. Kalkman, F. J. Vanpoucke, et al.,“Simultaneous and non-simultaneous dual electrode simula-tion in cochlear implants: evidence for two neural responsemodalities,” Acta Oto-Laryngologica, vol. 129, pp. 433–439,2009.

[16] M. A. Stone, B. C. J. Moore, J. I. Alcantara, and B. R.Glasberg, “Comparison of different forms of compressionusing wearable digital hearing aids,” Journal of the AcousticalSociety of America, vol. 106, no. 6, pp. 3603–3619, 1999.

[17] A. J. C. Wilson, “The location of peaks,” British Journal ofApplied Physics, vol. 16, no. 5, pp. 665–674, 1965.

[18] J. O. Smith III and X. Serra, “PARSHL: an analysis/synthesisprogram for non-harmonic sounds based on a sinusoidalrepresentation,” in Proceedings of the International Com-puter Music Conference (ICMC ’87), pp. 290–297, Urbana-Champaign, Ill, USA, August 1987.

[19] D. D. Greenwood, “A cochlear frequency-position function forseveral species—29 years later,” Journal of the Acoustical Societyof America, vol. 87, no. 6, pp. 2592–2605, 1990.

[20] G. S. Stickney, P. C. Loizou, L. N. Mishra, P. F. Assmann, R.V. Shannon, and J. M. Opie, “Effects of electrode design andconfiguration on channel interactions,” Hearing Research, vol.211, no. 1-2, pp. 33–45, 2006.

[21] H. Purnhagen, N. Meine, and B. Edler, “Sinusoidal codingusing loudness-based selection,” in Proceedings of the IEEEInternational Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP ’02), vol. 2, pp. 1817–1820, Orlando, Fla, USA,May 2002.

[22] W. Nogueira, A. Buchner, Th. Lenarz, and B. Edler, “A psy-choacoustic “NofM”-type speech coding strategy for cochlearimplants,” EURASIP Journal on Applied Signal Processing, vol.2005, no. 18, pp. 3044–3059, 2005.

[23] B. Edler, H. Purnhagen, and C. Ferekidis, “ASAC—analysis/synthesis codec for very low bit rates,” in Proceedingsof the 100th Audio Engineering Society Convention (AES ’96),Copenhagen, Denmark, May 1996, Preprint 4179.

[24] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on Signal Process-ing, vol. 41, no. 12, pp. 3397–3415, 1993.

[25] H. Purnhagen and N. Meine, “HILN—the MPEG-4 paramet-ric audio coding tools,” in Proceedings of the IEEE InternationalSymposium on Circuits and Systems (ISCAS ’00), vol. 3, pp. 23–31, Geneva, Switzerland, May 2000.

[26] R. A. Lutfi, “A power-law transformation predicting maskingby sounds with complex spectra,” Journal of the AcousticalSociety of America, vol. 77, no. 6, pp. 2128–2136, 1985.

[27] F. Baumgarte, C. Ferekidis, and H. Fuchs, “A nonlinearpsychoacoustic model applied to the ISO MPEG layer 3 coder,”in Proceedings of the 99th Audio Engineering Society Convention(AES ’95), New York, NY, USA, October 1995, Preprint 4087.

[28] I. Hochmair-Desoyer, E. Schulz, L. Moser, and M. Schmidt,“The HSM sentence test as a tool for evaluating the speechunderstanding in noise of cochlear implant users,” Ameri-can Journal of Otology, vol. 18, no. 6, supplement, p. 83,1997.


[29] H. Levitt, “Transformed up-down methods in psychoacous-tics,” Journal of the Acoustical Society of America, vol. 49, no. 2,part 2, pp. 467–477, 1971.

[30] International Telecommunication Union (ITUT) G. 227,“International analogue carrier systems -General characteris-tics common to all analogue carrier-transmission systems—General characteristics common to all analogue carrier-transmission systems—Conventional telephone signal,” ITU1988, 1993.

[31] K. Wagener, V. Kuehnel, and B. Kollmeier, “Entwicklung undevaluation eines satztests fuer die deutsche sprache i: designdes oldenburgers satztests,” Zeitschrift fur Audiologie, vol. 38,pp. 4–15, 1999.

[32] G. Clark, Cochlear Implants: Fundamentals and Applications,Modern Acoustics and Signal Processing, Springer, AIP Press,Berlin, Germany, 2003.

[33] A. Gabrielsson and H. Sjogren, “Perceived sound qualityof sound reproducing,” Journal of the Acoustical Society ofAmerica, vol. 65, pp. 1019–1726, 1979.

[34] N. B. Spetner and L. W. Olsho, “Auditory frequency resolutionin human infancy,” Child Development, vol. 61, no. 3, pp. 632–652, 1990.

[35] E. J. Propst, K. A. Gordon, R. V. Harrison, S. M. Abel, and B. C.Papsin, “Sound frequency discrimination in normal-hearinglisteners and cochlear implantees,” The University of TorontoMedical Journal, vol. 79, no. 2, 2002.


Research Article

Prediction of Speech Recognition in Cochlear Implant Users byAdapting Auditory Models to Psychophysical Data

Svante Stadler and Arne Leijon (EURASIP Member)

Sound and Image Processing Lab, KTH, 10044 Stockholm, Sweden

Correspondence should be addressed to Svante Stadler, [email protected]

Received 15 December 2008; Revised 6 May 2009; Accepted 18 June 2009


Users of cochlear implants (CIs) vary widely in their ability to recognize speech in noisy conditions. There are many factors thatmay influence their performance. We have investigated to what degree it can be explained by the users’ ability to discriminatespectral shapes. A speech recognition task has been simulated using both a simple and a complex models of CI hearing. Themodels were individualized by adapting their parameters to fit the results of a spectral discrimination test. The predicted speechrecognition performance was compared to experimental results, and they were significantly correlated. The presented frameworkmay be used to simulate the effects of changing the CI encoding strategy.

Copyright © 2009 S. Stadler and A. Leijon. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Early cochlear implants, using only a single channel, wereuseful for identifying environmental sounds and improvinglip reading performance. However, speech recognition withsuch implants was very limited. Since then, the number ofchannels in implants has steadily increased as technologyhas matured, and modern implants make use of up to 22separate channels (the Cochlear Nucleus implant), or evenup to 120 “virtual” channels (Advanced Bionics HiRes120)[1]. The theoretical basis behind this development has beenthat each channel stimulates mainly in a local region of thecochlea, which along with the tonotopic organization of theauditory nerve corresponds to a frequency specific sensation[2]. According to this principle, each channel can be usedto encode the signal components in the correspondingfrequency band. However, it has been shown repeatedlythat the spectral resolution ability of CI users is limited toabout 4–8 “effective” channels, even if the actual numberof channels is much larger [3–5]. In an experiment usingnormal hearing listeners and speech signals with modifiedspectral content, Fu et al. [6] showed that increasing thenumber of effective channels increases speech recognition innoise, up to and beyond 16 channels. Therefore, it seemslogical to conclude that spectral resolution is a limiting

factor for CI users’ speech recognition. However, it is alsoclear that cognitive factors such as short-term memory andlexical knowledge have an impact [7], as well as patchy orincomplete nerve survival.

In this paper, we examine the role of spectral resolutionon speech recognition ability. To this end, a sentencerecognition task has been simulated to predict the signal-to-noise ratio at which words are barely recognized. Thesimulation includes a model of CI hearing, which has beenindividualized using data from psychophysical experiments.This approach is not entirely novel; several earlier attemptshave been made to predict speech recognition performancefor hearing impaired listeners, either using heuristics [8, 9],or auditory models [10–12]. All of these studies use theaudiogram as the main psychophysical data source to adaptthe model. However, in the case of CI hearing, the audiogramis irrelevant, since sound levels can be arbitrarily matched tostimulation levels. Therefore, other experimental data mustbe used. In [5], the results of a spectral resolution experimentis found to have a limited but significant correlation with thespeech recognition threshold (SRT) in noise. The data fromthat study were used to evaluate the presented system.

It is important to stress that the goal of this paperproject is not to simply predict the speech recognition results;that is done more effectively using regression techniques


0

2

4

6

8

10

12

14

16

18

20

Th

resh

old

PV

R(d

B)

1 2 4 8 16 32

Number of spectral bands

Figure 1: Results of the spectral discrimination test. Each box showsthe median, quartiles, and 10th and 90th percentiles among the CIusers for a specific stimulus type. The test and retest thresholds havebeen averaged.

such as support vector machines or neural networks. Thegoal is rather to explain the results in terms of equivalentsignal processing, which then can be modified in order tosimulate the effects of different signal processing techniques.The end goal is to be able to predict how an individual’sspeech recognition performance will be affected by a changein the speech processing algorithm, allowing for automaticoptimization of speech recognition performance. Althoughsuch simulations may not be reliable enough for clinicaluse, they may give indications on the types of strategies thatshould be tested clinically. The general technique of fittinghearing instrument parameters by optimizing an objectivemeasure of speech recognition ability has been proven usefulin hearing aid fitting, for example, NAL-NL1 [13]. In fact, theapproach dates back as far as the 1940s [14]. It is reasonableto believe that similar methods could be derived for cochlearimplants in the future. Another goal is to show that a largefraction of the variance in speech recognition ability in CIusers can be explained by spectral resolution properties.

2. Experimental Data

The presented framework was designed to simulate two psy-chophysical experiments: a spectral discrimination test anda sentence recognition test known as Hagerman’s sentences[15]. These experiments are described below. 32 CI usersperformed the spectral discrimination task on two separateoccasions (referred to as the test and retest) and the speechrecognition task on one occasion. All listening tests wereperformed monaurally. Any hearing aid or CI on the otherear was turned off or removed. All participants were above15 years old and had no known neurological disorders. 5participants had severe prelingual hearing loss. The implantstypes were Med-El (17 users), Nucleus (12 users), andClarion (3 users). The stimuli were presented in free field ina low-reverberant room at a fixed sound level of 70 dB SPL.

2.1. Spectral Discrimination Test. In this test, the listener’stask is to determine which out of three consecutive stimulidiffer from the other two (known as a 3 interval, 3alternative forced choice task, or 3I3AFC). The signals are200 milliseconds noise bursts which have been filtered so thatthe spectral density is matched to the long-term average ofspeech. The signals are then filtered into K bands between200 and 8000 Hz that are evenly spaced on the ERBN scale[16]. For each interval, either the odd (1, 3, 5, . . .) or theeven (2, 4, 6, . . .) bands are attenuated, and the attenuationis called the peak-to-valley ratio (PVR). The listener choosesthe stimulus that was filtered differently from the othertwo stimuli. A modified up-down procedure (2-down, 1-up) was used to estimate the threshold PVR, which isdefined as the 70.7% point on the psychometric function[17]. If the required number of correct responses is notfound at the highest allowed PVR (20 dB), the fractionof correct responses at this level is recorded. The processis repeated for K = 1, 2, 4, 8, 16, and 32 bands. Thesingle band case is simply an intensity discrimination task.Figure 1 shows the distribution of results in the test groupas percentiles. The experiment is described in more detail in[5].

2.2. Hagerman’s Sentences. The Hagerman’s Sentences is aSwedish word recognition test which is used to determinespeech recognition thresholds in noise. Similar tests existin several languages. The test consists of 50 phoneticallybalanced words, organized into 5 sentence positions withten possible words for each position, so that choosing oneof the ten words for each position generates a grammat-ically correct but semantically meaningless sentence. Aftera sentence is presented along with speech-shaped maskingnoise, the listener’s task is to repeat each sentence. Theexperimenter counts the number of correctly repeated words.The noise level is then shifted adaptively to converge tothe level where 2 out of 5 words are correct, that is, 40%correct [18]. The corresponding SNR is noted as the result.These results can be viewed in Figure 7 along with modelpredictions.

3. Modelling Methods

To predict speech recognition performance, an entire speechcommunication chain is simulated: going from a speaker,via a medium, to a listener. The speaker tries to convey asequence of words W through articulation, thereby generat-ing the acoustic signalX . When the signal reaches the listener,it may be contaminated by additive noise and reverberation.In the CI case, the acoustic signal is transduced into electricalimpulses, which give rise to a neural pattern sequence R.The listener’s brain then acts as a classifier, finding anestimate W of the word sequence. The speech recognitionperformance is defined as the fraction of words correctlyidentified. The following sections describe the process stepsin more detail. Similar approaches are used in [11, 19] toestimate speech recognition performance for listeners withnormal and impaired hearing.


3.1. Model of Speech Production and Recognition. Speechis inherently random; a single word will never be utteredexactly the same twice. Therefore, probabilistic models areappropriate to represent a speech source. Hidden MarkovModels (HMMs) are used extensively in automatic speechrecognition systems (and also to some degree in speechsynthesis) due to their ability to model signals with variabilityin both duration and spectrum [20]. Here, each word in thespeech recognition test is modelled by an HMM. The speechfeatures used to train the models are calculated as follows:the time-domain signal is divided into 20 milliseconds, 50%overlapping frames. The spectrum of each frame is computedin 32 equally spaced bands on the ERBN scale from 300to 8000 Hz, which is the frequency range encoded by theCIs used in the listening tests. The spectral magnitudeswere then converted to dB. This representation is relevant,because Euclidean distance in this space is approximatelyproportional to perceptual difference for normal hearinglisteners [21]. It is also quite similar to the signal encodingused in most CIs, which is useful in the individual adaptationprocedure (see Section 3.4).

The speech features of the entire corpus are used totrain a Gaussian Mixture Model (GMM) [22] with 40components, which should correspond to approximately onecomponent per phoneme of the Swedish language. Then, a7-state HMM is trained on each single word, approximatelyone state per phoneme. The GMM is used as the outputdistribution for all word HMMs, and only the mixtureweights are adapted for each state. This is known as a tied-mixture HMM [23], which reduces the degrees of freedom inthe model dramatically compared to a standard continuousHMM. This approach reduces the risk of overfitting themodel, which is useful here as only one utterance of eachword is available in the Hagerman test. Finally, the singleword HMMs can be concatenated to form a sentence HMM.

3.2. Word Classification. The inner workings of humanspeech perception is largely unknown, even after decades ofresearch. The best automatic speech recognizers are still notclose to human performance, which suggests that humanspeech recognition is, if not optimal, then at least not farfrom it. Therefore, the human brain is approximated byan optimal classifier. Here, “optimal” should be interpretedas classifying with minimal probability of error, which isachieved by choosing the most probable word sequence.After observing the neural pattern r, the listener finds theword sequence w that maximizes the conditional probability,

w = arg maxw

PW|R(w | r). (1)

In the case of Hagerman’s sentences, every word sequencehas equal prior probability. Therefore, the maximum a pos-teriori choice in (1) is identical to the maximum likelihoodchoice,

w = arg maxw

f R|W (r | w), (2)

where fR|W expresses the probability density function (pdf)of the neural patterns elicited by a given word sequence. It

is defined by the sentence HMM, and the search over allpossible word sequences is efficiently computed using theViterbi algorithm. The probability of correctly recognizinga word Pc can then be estimated as an average, by drawingsamples wi from the random variable W , giving

Pc = 1Nw

E[

δ(

W , W)]

≈ 1NwM

M∑

i=1

δ(wi, wi). (3)

Here, E[·] denotes the expectation over all possible sen-tences, Nw is the number of words in each sequence andδ(·, ·) expresses the number of identical words in the twosequences. This procedure is quite similar to what would bedone in a psychophysical experiment; it is simply averagingthe number of correct answers. Unfortunately, a fairly largenumber of iterations M may be needed to find a goodestimate, which can be quite time consuming if elaboratemodels are used. A more efficient approach is to considerthe mutual information (MI) between the source W and theobservation R, which expresses the amount of informationavailable to the classifier:

I(W ;R) = ER,W

[

log

(fR|W (R |W)

fR(R)

)]

. (4)

Like (3), this quantity must also be estimated using a MonteCarlo approach; however, it will converge faster, since eachiteration yields a continuous estimate, compared to thebinary result of 3. The MI is estimated using the followingexpression, where M is set to achieve the required accuracy

I(W ;R) ≈ 1M

M∑

i=1

log(fR|W (ri | wi)

)− log(fR(ri)

), (5)

where wi is a word sequence drawn randomly from thedistribution of W and ri is the corresponding observedneural pattern.

The relation between MI and minimal classification erroris given by rate-distortion theory, and is calculated using theBlahut algorithm [24]. In the case of Hagerman’s sentences,0.45 bits/word is needed to achieve the threshold 40% wordrecognition.

3.3. Models of CI Hearing. For the purposes of this paper, twomodels of CI hearing have been implemented. Model A isthe simplest possible model that can account for the resultsof the spectral resolution test, while Model B is an attemptto model the actual signal transformations, including the CIspeech processor, the electrical transduction from implant toauditory nerve, and the response of the auditory nerve cells.Model A has the advantage that its structure is very flexibleand can mimic any result of the spectral discrimination task.Model B is less flexible, but since it is modelled after aphysical cochlea, it can capture the effects of, for example,moving the electrodes. The following sections describe themodels in detail.

3.3.1. Model A: Functional Signal Processing Model. Model Ais an attempt to construct the simplest possible explanation


++

+

+DCT

c1 c2 cN

Cepstralweights

Additivenoise

X RIDCT

N

. . .

. . .

Figure 2: Block diagram of model A. The processing is applied toeach time frame in the sequence.

of the observed psychophysical results. According to Occam’srazor, one should always pick the simplest model when allother things are equal (a good discussion of this preferenceis provided in [25]). In the case of cochlear implants, we dohave some knowledge of how a particular acoustic input getsencoded in the auditory nerve, but since that knowledge islimited, any model of that entire procedure is bound to bespeculative. Therefore, model A is a good complement to themore realistic model B.

To create a simple functional model, it is first observedthat the spectral shapes to be discriminated in the spectraldiscrimination test are periodic modulations on the ERBNscale. These are quite similar to the basis functions of mel-frequency cepstral coefficients (MFCCs), which is a ubiqui-tous signal representation in automatic speech recognition(this is of course not a coincidence, the experiment wasdesigned with that in mind). As previously discussed, theincoming acoustic signal X is represented by a sequence ofshort-time auditory spectra in log units (i.e., dB). By apply-ing the Discrete Cosine Transform (DCT) to each frame, theresulting features are effectively MFCCs. Each coefficient isthen multiplied by a weight ci, which controls the sensitivityto the corresponding spectral shape. Thereafter the featurevectors are inversely transformed, and Gaussian noise withunity variance is added to each coefficient. A block diagramof this process is shown in Figure 2.

This model is able to simulate any result of the spectraldiscrimination experiment, by adapting the weights c ={c1 · · · cN}. For example, by attenuating the higher cepstralcoefficients, the spectrum is smoothed, simulating a lossin frequency selectivity. The noise amplitude is adaptedindirectly, by allowing all signal weights to scale.

3.3.2. Model B: Biologically Inspired Model. In this model,each user’s implant settings were simulated, includingfrequency bands, “map-law”, and (approximate) electrodepositions. However, any additional signal processing beingperformed by the speech processor could not be modelledas these algorithms are proprietary. Instead, it has beenassumed that the gain in each channel is mapped directlyfrom the signal energy in the corresponding frequencyband. Also, the signal preprocessing is implicitly modelledby assuming that the settings are ideally fitted, so thatthe signals utilize the full dynamic range in each channel

(explained in more detail below). This mirrors the purposeof a front-end slow-acting compressor and a pre-emphasisfilter. Thereafter, the spread of electric current in the cochleais simulated in a 3D finite element model, implementedaccording to the specifications supplied by Rattay et al.[26]. The simulation was performed using the COMSOLMultiphysics finite element software [27].

A healthy cochlea contains approximately 32 000 audi-tory nerve fibers [28]. To model each nerve fiber individuallyis neither feasible nor necessary to capture the statistics ofthe neural activity. Instead, the auditory nerve is dividedinto tonotopic groups, and the expected spike rate from eachgroup is computed. It is assumed that the perception ofspectral shapes is based solely on the number of neural spikesfrom each group of neurons, that is, spike timing is neglectedand the spike count is summed for each group and timeframe. Furthermore, it is assumed that the summed responsefrom each neural group is a monotonic function of the sumof the “activation” (defined below) in the nerve tissue elicitedby each electrical impulse. These simplifications, althoughsomewhat crude, are needed to make the speech recognitionsimulation possible.

The 3D model comprises one and a half turns of thecochlea. The electric field in the cochlea is examined in30 degree steps, making for a total of 19 modelled fibers.Each modelled fiber is then assumed to be typical for thenerve fibers surrounding it, forming a “neural group” withhomogenous properties. Each nerve fiber model consistsof ten straight line sections, as illustrated in Figure 3. Unitmonopolar impulses are simulated from electrodes at eachof the corresponding 30 degree positions, and the activationas defined in [29] (the second spacial derivative of theelectric potential) is recorded for each nerve segment. Theseresults are stored in the activation matrices Aj , where Aj(k, i)represents the activation in the kth section of nerve fiber idue to a unit impulse from electrode position j. From thesedata, responses to stimulation from electrodes at arbitrarypositions could be found using linear interpolation.

In this model, it is assumed that all variation in frequencyselectivity that is not explained by known factors (suchas CI properties and settings) is due to degeneration ofthe auditory nerve. To model different stages of neuraldegeneration, the parameter ck represents the fraction ofnerve fibers that have the kth section, as well as all sectionson the central side, intact. In other words, it is the fractionof fibers that are able to transmit an action potential whenstimulated at section k. It is assumed that the degeneration isuniform across all neural groups. Studies on cochlear nervedegeneration show that this is generally not the case [30],but it may be an acceptable approximation, given that thespectral discrimination test used in this study does not giveany frequency specific information. The weights c are usedto compute the transfer matrix T , where each element Tijrepresents the activity in neural group i due to a unit impulsefrom electrode position j, and is computed as

Tij =10∑

k=1

ckAj(k, i). (6)


(a) (b)

Figure 3: Visualization of 3D cochlear model. (a): Cross-section view of the model. Nerve fibers are shown as gray lines, with black dots atthe boundaries between sections. One electrode positioned in the Scala Tympani is shown as a circle. (b): External view of the model cochlea.

Thus, the T matrix is used to model the spread of excitationin the cochlea. Examples of A and T matrices are shown inFigure 4.

The random variable Rt representing the neural featurevector at time t is computed as

Rt = g(TX ′t +Nt

), (7)

where X ′t is the output from the CI electrode, Nt is isotropicGaussian noise, and g(·) is a sigmoidal function applied toevery element in the vector, defined as

g(y) = 1

1 + e(m−y)/s , (8)

where m is the midpoint of the sigmoid and s is itsslope parameter. The sigmoid function is used to map theactivation to neural firing rate, which has been shown to havea sigmoidal relation in single fiber measurements [31]. Thevalues of the sigmoid parameters are not known; instead, itis assumed that the CI output range has been fitted to matchthese parameters, so that the neural “input” TX ′t + Nt has amean and standard deviation that is approximately m and srespectively, for each neural group. For simplicity, all noisesources (noise due to stochastic neural firing, decision noise,etc.) are modelled by the additive term Nt. A schematic viewof the signal processing in the model is shown in Figure 5.

The model is adapted to individual data by varying theweights c and scaling the variance of the additive noise termNt. With these free parameters, it is in general not possibleto adapt the model to match a set of spectral discriminationresults exactly. Therefore it is adapted using a maximumlikelihood approach that is described in Section 3.4.1.

3.4. Adaptation to Psychophysical Data. The models of CIhearing have a number of individual (free) parameters,which vary between subjects. Some parameters can beobserved directly (such as the number of active channels),while others cannot (such as the degeneration of theauditory nerve). For each subject, the latter set of parametersare estimated by matching the results from the spectral

discrimination experiment described in Section 2.1 in a sim-ulation using a Maximum Likelihood criterion, as describedbelow. The estimated model parameters are then appliedin the speech recognition simulation to predict the speechrecognition threshold. The full procedure is illustrated inFigure 6.

3.4.1. Maximum Likelihood Adaptation. The models of hear-ing have limited degrees of freedom, and it is generallynot possible to match the experimental thresholds exactly.However, the measured thresholds should not be interpretedas deterministic; all psychophysical results have an inherentuncertainty due to the probabilistic nature of perception. Inthe spectral discrimination test, the experimenters calculateda threshold estimate mi and associated error variance σ2

i foreach condition. From these data, a probability distribution ofthe true thresholds mi can be formed (assuming estimationerrors are independent and Gaussian):

f (m1 · · ·mN ) =∏

i

f (mi) =∏

i

1√2πσi

e−(mi−mi)2/2σ2

i , (9)

where f (·) denotes a probability density (the likelihoodfunction). To find the optimal parameter set, an iterativeoptimization scheme is employed (the Nelder-Mead simplexmethod, a popular “hillclimbing” algorithm [32]). Thealgorithm converges to the set of parameters that (locally)maximizes the likelihood of the simulated thresholds.

4. Results

In order to predict the speech recognition threshold in noise(SRT) of a specific user, synthesized speech signals weremixed with noise and run through the user-adapted CImodel. The signal-to-noise ratio (SNR) was initiated at 20 dBand the information rate was estimated according to 5. Theestimate was refined until the threshold rate 0.45 bits/wordwas outside the 99% confidence interval. The SNR wasshifted in 2 dB steps until two consecutive SNR levels werefound to be on either side of the threshold rate. The SRT


(a) (b) (c)

10

5

1

Neu

rals

ecti

on

5 10 15

Neural group

15

10

5

Ch

ann

el

5 10 15

Neural group

Figure 4: Results from the electric field simulation. The activation over the cochlear nerve tissue during a unit impulse from electrodeposition 7 (a) is mapped to the activation matrix A7 (b). A matrices are combined to form the matrix T (c), here with ck = 1 for all k. Theelectrodes and neural groups are indexed in basal to apical direction and the neural sections in peripheral to central direction.

RCI

X’

X

Noise

Figure 5: Block diagram of Model B. The CI unit provides a filterbank and sound level to electric current mapping. The current fromeach channel is spread to the neural groups according to the transfermatrix T . Additive gaussian noise is used to simulate the variabilityof the neural response.

+ +

Spectraldiscr stim

PredictedSRT

Predictedresult

Speechrecog stim

Adapt

Evaluation

Adaptation

Experimentalresult

Params

CI hearingmodel

Params

CI hearingmodel

−

Figure 6: Adaptation and prediction procedure. The parameters ofthe CI model are adapted to match a set of experimental results,in this case the spectral discrimination thresholds. The adaptedparameters are then used in a simulated speech recognition test.

estimate was then found by linear interpolation betweenthe two levels. The process was repeated for test and retestdata, for both CI models, and for each individual CI user.The results are displayed in Figure 7. Correlations betweenpredictions and experimental data are tabulated in Table 1,along with median absolute prediction errors. As a reference,

Table 1: SRT prediction accuracy. The median absolute errors andSpearman correlation coefficients with corresponding P-values forSRT predictions using all combinations of models and data sets,including the average of the test and retest predictions (denoted“avg”). The table also includes prediction results from a jackknifelinear regression analysis (noted “R”).

Model/data Error (dB) Correlation P-value

A/test 5.0 0.57 6 · 10−4

A/retest 4.0 0.73 3 · 10−6

A/avg 4.4 0.72 4 · 10−6

B/test 4.3 0.52 2 · 10−3

B/retest 4.4 0.72 4 · 10−6

B/avg 4.8 0.68 2 · 10−5

R/test 3.3 0.54 2 · 10−3

R/retest 4.0 0.62 2 · 10−4

R/avg 2.6 0.66 4 · 10−5

prediction using jackknife (leave one out cross-validation)linear least squares estimation (LLSE) or commonly knownas linear regression has been performed, noted in the table asmodel “R”. This method finds the linear dependency betweenthe spectral discrimination and speech recognition resultsthat minimizes the mean square prediction error. Statisticalanalysis using Spearman’s rank correlation test has showedthat all the measured correlations are significant at the 99%confidence level. It can be noted that the proportion of thevariance in the data that is “explained” by the predictionis equal to the square of the correlation coefficient, whichmeans that models A and B explain more than half of thevariance when fitted to the retest data.

5. Discussion

The results in Figure 7 and Table 1 show that the twoCI models give predictions that are approximately equallyaccurate, but not identical. Both models also overestimateperformance (i.e., underestimate the SRT), which most likely


−10

−5

0

5

10

15

20Si

mu

lati

onSR

T

−10 0 10 20

Experimental SRT

(a)

−10

−5

0

5

10

15

20

Sim

ula

tion

SRT

−10 0 10 20

Experimental SRT

(b)

Figure 7: Speech recognition threshold for 32 CI users, as measured experimentally and as predicted by models A (a) and B (b). Each lineconnects the SRT predicted from the test (circles) and retest (squares) data. The SRT is expressed in dB and a lower value means better speechrecognition.

is a result of using an optimal decision unit as a modelof the cognitive process. For many subjects there is a largediscrepancy between the test and retest predictions, whichindicates that the spectral discrimination test results are notvery reliable. The tabulated correlation coefficients suggestthat the retest measurements are more accurate, which isreasonable; some users may be confused by the fairly abstractdiscrimination task on the first occasion. The two modelsgive different predictions, but are approximately equallyaccurate. The main advantage of model A is its abilityto simulate any spectral resolution exactly, which modelB cannot. However, model B is more useful because itis able to simulate different types of CI signal processingand coding strategies, as discussed below. The regressionpredictor gives smaller errors on average, as there is no biaspresent; looking at the correlation coefficients, however, theregression predictor is comparable to the models A and B.

One major weakness of the spectral discriminationexperiment is its inability to assess frequency specific infor-mation. This is because the signals used in the experimentare modulated across the entire frequency range. Thus, anyfrequency specific deficiency, for example, hearing only asmall range of frequencies, cannot be detected. However,this does not imply that the resulting prediction will beinaccurate; as mentioned earlier, the stimuli are similar toMFCCs, which in some sense are the “basis functions”of speech, and one may argue that it is a more relevantdomain than the frequency domain when modelling speechrecognition. Of course, the optimal case would be to haveaccess to data in both domains. Another possible weaknessin the data is that the experiments were not done usingroving sound levels, which means that discrimination can

potentially be performed by monitoring the intensity in anarrow frequency range, a problem that was illuminatedin a recent study [33]. In this particular case, it shouldbe a minor concern; since the data include a 1-band case(i.e., intensity discrimination), the models can adapt tosituations where listeners are using mainly intensity as a cuefor discrimination and this will be reflected in the speechrecognition simulation.

Although many simplifications have been made even inour advanced model B, it is fairly unlikely that using moreaccurate modelling would improve the prediction to anysignificant degree; there are simply too many unmeasur-able factors influencing the outcome. However, the causeof an individual discrepancy may be unveiled in furtherpsychophysical experiments. An interesting next step wouldbe to construct a ladder of ecological validity: a series ofexperiments ranging from simple spectral discriminationup to sentence recognition, having intermediate steps thatare more controlled than speech but more realistic thanfiltered noise. In this way, one is able to check which of theunderlying assumptions in the presented framework holdand which do not.

A quite useful application of this framework would beto modify the signal processing and coding strategy of amodelled implant and observe the impact on the predictedspeech recognition threshold. In this way, novel schemes canbe evaluated by simulation. Although such results will neverbe a replacement for human trials, it can be useful at thedevelopment stage, as human trials tend to be very lengthy.Since the CI models are individualized, it would be possibleto estimate which strategy is most suited for that particularuser.


6. Conclusion

The two presented models of CI hearing are both capableof explaining as much as 50% of the variance in speechrecognition capability among CI users. The framework forsimulating speech communication is useful for evaluatingnovel signal processing strategies for CIs, both for theCI users in general and for finding optimal settings forindividual users. The spectral discrimination test used toassess users’ hearing capabilities has shown to be useful, eventhough additional listening tests might enable more accuratemodelling and prediction.

References

[1] F.-G. Zeng, A. N. Popper, and R. R. Fay, Eds., CochlearImplants: Auditory Prostheses and Electric Hearing, Springer,New York, NY, USA, 2004.

[2] R. V. Shannon, “Multichannel electrical stimulation of theauditory nerve in man—I: basic psychophysics,” HearingResearch, vol. 11, no. 2, pp. 157–189, 1983.

[3] L. M. Friesen, R. V. Shannon, D. Baskent, and X. Wang,“Speech recognition in noise as a function of the numberof spectral channels: comparison of acoustic hearing andcochlear implants,” Journal of the Acoustical Society of America,vol. 110, no. 2, pp. 1150–1163, 2001.

[4] B. A. Henry and C. W. Turner, “The resolution of complexspectral patterns by cochlear implant and normal-hearinglisteners,” Journal of the Acoustical Society of America, vol. 113,no. 5, pp. 2861–2873, 2003.

[5] E. Molin, A. Leijon, and H. Wallsten, “Spectro-temporaldiscrimination in cochlear implant users,” in Proceedings of theIEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’05), vol. 3, pp. 25–28, Philadelphia, Pa,USA, 2005.

[6] Q.-J. Fu, R. V. Shannon, and X. Wang, “Effects of noiseand spectral resolution on vowel and consonant recognition:acoustic and electric hearing,” Journal of the Acoustical Societyof America, vol. 104, no. 6, pp. 3586–3596, 1998.

[7] B. Lyxell, B. Sahlen, M. Wass, et al., “Cognitive developmentin children with cochlear implants: relations to reading andcommunication,” International Journal of Audiology, vol. 47,supplement 2, pp. S47–S52, 2008.

[8] C. V. Pavlovic, G. A. Studebaker, and R. L. Sherbecoe, “Anarticulation index based procedure for predicting the speechrecognition performance of hearing-impaired individuals,”Journal of the Acoustical Society of America, vol. 80, no. 1, pp.50–57, 1986.

[9] T. Y. C. Ching, H. Dillon, and D. Byrne, “Speech recognitionof hearing-impaired listeners: predictions from audibility andthe limited role of high-frequency amplification,” Journal ofthe Acoustical Society of America, vol. 103, no. 2, pp. 1128–1140, 1998.

[10] I. Holube and B. Kollmeier, “Speech intelligibility predictionin hearing-impaired listeners based on a psychoacousticallymotivated perception model,” Journal of the Acoustical Societyof America, vol. 100, no. 3, pp. 1703–1716, 1996.

[11] S. Stadler, A. Leijon, and B. Hagerman, “An informationtheoretic approach to estimate speech intelligibility for normaland impaired hearing,” in Proceedings of the 8th Annual Con-ference of the International Speech Communication Association(Interspeech ’07), Antwerpen, Belgium, 2007.

[12] M. S. A. Zilany and I. C. Bruce, “Predictions of speechintelligibility with a model of the normal and impairedauditory-periphery,” in Proceedings of the 3rd InternationalIEEE EMBS Conference on Neural Engineering, pp. 481–485,Kohala Coast, Hawaii, USA, 2007.

[13] D. Byrne, H. Dillon, T. Ching, R. Katsch, and G. Keidser,“NAL-NL1 procedure for fitting nonlinear hearing aids: char-acteristics and comparisons with other procedures,” Journal ofthe American Academy of Audiology, vol. 12, no. 1, pp. 37–51,2001.

[14] W. Radley, W. Bragg, R. Dadson, et al., Hearing Aids andAudiometers, Medical Research Council Special Report Seriesno. 261, Her Majesty’s Stationery Office, London, UK, 1947.

[15] B. Hagerman, “Sentences for testing speech intelligibility innoise,” Scandinavian Audiology, vol. 11, no. 2, pp. 79–87, 1982.

[16] B. R. Glasberg and B. C. J. Moore, “Derivation of auditoryfilter shapes from notched-noise data,” Hearing Research, vol.47, no. 1-2, pp. 103–138, 1990.

[17] H. Levitt, “Transformed up- down methods in psychoacous-tics,” Journal of the Acoustical Society of America, vol. 49, no. 2,part 2, pp. 467–477, 1971.

[18] B. Hagerman and C. Kinnefors, “Efficient adaptive methodsfor measuring speech reception threshold in quiet and innoise,” Scandinavian Audiology, vol. 24, no. 1, pp. 71–77, 1995.

[19] A. Leijon, “Estimation of secondary information transmissionusing a hidden Markov model of speech stimuli,” Acta Acusticaunited with Acustica, vol. 88, no. 3, pp. 423–432, 2002.

[20] B. H. Juang and L. R. Rabiner, “Hidden Markov models forspeech recognition,” Technometrics, vol. 33, no. 3, pp. 251–272,1991.

[21] R. Plomp, Aspects of Tone Sensation, Academic Press, London,UK, 1976.

[22] G. J. McLachlan and D. Peel, Finite Mixture Models, John Wiley& Sons, New York, NY, USA, 2000.

[23] J. R. Bellegarda and D. Nahamoo, “Tied mixture continuousparameter modeling for speech recognition,” IEEE Transac-tions on Acoustics, Speech, and Signal Processing, vol. 38, no.12, pp. 2033–2045, 1990.

[24] T. Cover and J. Thomas, Elements of Information Theory, JohnWiley & Sons, New York, NY, USA, 1991.

[25] D. J. C. MacKay, Information Theory, Inference, and LearningAlgorithms, Cambridge University Press, Cambridge, UK,2006.

[26] F. Rattay, R. N. Leao, and H. Felix, “A model of the electricallyexcited human cochlear neuron—II: influence of the three-dimensional cochlear structure on neural excitability,” HearingResearch, vol. 153, no. 1-2, pp. 64–79, 2001.

[27] Comsol Multiphysics, http://www.comsol.com.

[28] H. Spoendlin and A. Schrott, “Analysis of the human auditorynerve,” Hearing Research, vol. 43, no. 1, pp. 25–38, 1989.

[29] F. Rattay, “Analysis of models for external stimulation ofaxons,” IEEE Transactions on Biomedical Engineering, vol. 33,no. 10, pp. 974–977, 1986.

[30] C. E. Zimmermann, B. J. Burgess, and J. B. Nadol Jr.,“Patterns of degeneration in the human cochlear nerve,”Hearing Research, vol. 90, no. 1-2, pp. 192–201, 1995.

[31] C. A. Miller, P. J. Abbas, B. K. Robinson, J. T. Rubinstein,and A. J. Matsuoka, “Electrically evoked single-fiber actionpotentials from cat: responses to monopolar, monophasicstimulation,” Hearing Research, vol. 130, no. 1-2, pp. 197–218,1999.


[32] J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. Wright,“Convergence properties of the Nelder-Mead simplex methodin low dimensions,” SIAM Journal on Optimization, vol. 9, no.1, pp. 112–147, 1998.

[33] M. J. Goupell, B. Laback, P. Majdak, and W.-D. Baumgartner,“Current-level discrimination and spectral profile analysis inmulti-channel electrical stimulation,” Journal of the AcousticalSociety of America, vol. 124, no. 5, pp. 3142–3157, 2008.


Research Article

The Personal Hearing System—A Software Hearing Aid fora Personal Communication System

Giso Grimm,1 Gwenael Guilmin,2 Frank Poppen,3 Marcel S. M. G. Vlaming,4

and Volker Hohmann1, 5

1 Medizinische Physik, Carl-von-Ossietzky Universitat Oldenburg, 26111 Oldenburg, Germany2 THALES Communications, 92704 Colombes Cedex, France3 OFFIS e.V., 26121 Oldenburg, Germany4 ENT/Audiology, EMGO Institute, VU University Medical Center, 1007 MB Amsterdam, The Netherlands5 HorTech gGmbH, 26129 Oldenburg, Germany

Correspondence should be addressed to Giso Grimm, [email protected]

Received 15 December 2008; Revised 27 March 2009; Accepted 6 July 2009


A concept and architecture of a personal communication system (PCS) is introduced that integrates audio communication andhearing support for the elderly and hearing-impaired through a personal hearing system (PHS). The concept envisions a centralprocessor connected to audio headsets via a wireless body area network (WBAN). To demonstrate the concept, a prototype PCS ispresented that is implemented on a netbook computer with a dedicated audio interface in combination with a mobile phone.The prototype can be used for field-testing possible applications and to reveal possibilities and limitations of the concept ofintegrating hearing support in consumer audio communication devices. It is shown that the prototype PCS can integrate hearingaid functionality, telephony, public announcement systems, and home entertainment. An exemplary binaural speech enhancementscheme that represents a large class of possible PHS processing schemes is shown to be compatible with the general concept.However, an analysis of hardware and software architectures shows that the implementation of a PCS on future advanced cellphone-like devices is challenging. Because of limitations in processing power, recoding of prototype implementations into fixedpoint arithmetic will be required and WBAN performance is still a limiting factor in terms of data rate and delay.

Copyright © 2009 Giso Grimm et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

The motivation for this study is to investigate the per-spectives of improving hearing support and its acceptanceby integrating communication services and hearing sup-port systems. Hearing aids are the standard solution toprovide hearing support for hearing-impaired persons. Inmany adverse conditions, however, current hearing aidsare insufficient to alleviate the limitations in personalcommunication and social activities of the hearing-impaired.Most challenging problems are howling due to acousticfeedback from the hearing aid receiver to the microphones[1], interference of cell phone radio frequency componentswith hearing aids [2], low signal-to-noise ratios (SNRs) inpublic locations caused by competing noise sources, andreverberation [3]. A number of partial solutions addressing

these problems are available in current hearings aids. Signalprocessing solutions comprise noise reduction algorithmslike spectral subtraction and directional microphones [3].Other assistive solutions comprise direct signal transmissionby telecoils, infrared, and radio systems [4, 5]. Recent tech-nological progress opens up possibilities of improving thesesolutions. New bridging systems, currently intended mainlyfor connection to communication and home entertainmentdevices, are based on the digital BlueTooth protocol, forexample, the ELI system [6]. New scalable algorithms can beadopted to different listening situations and communicationenvironments and are expected to be beneficial for the end-user either in terms of improved speech intelligibility orby enhancing speech quality and reducing listening effort[7]. A combination of the new signal processing schemesand communication options has not widely been explored


yet, and the final user benefit remains to be investigated.Dedicated prototype systems as investigated in this studymight facilitate this type of research.

Contrary to a hearing-impaired with moderate or stronghearing loss, a person with mild hearing loss, or, moregenerally, any person with light-to-moderate problems inhearing under adverse circumstances, will not wear a hearingaid nor other hearing support system. Hearing support sys-tems which are add-ons to existing communication devicesmight be beneficial for those users and their acceptance isexpected to be higher than that of conventional hearingaids.

Another factor that influences and might facilitate thefurther development of hearing support systems is the avail-ability of standard hardware and open software for mobiledevices, for example, the iPhone [8] or the GooglePhone [9].These devices can act as a central processor for hearing aidswith access to binaural audio information and the advantageof increased processing performance [10]. Based on suchscalable systems, the integration of hearing support systemsfor slight-to-moderate hearing losses with communicationapplications seems to be feasible in principle, but is yet to beassessed in more detail. One step toward that direction is low-delay real-time signal processing systems based on standardhard- and software, such as the Master Hearing Aid (MHA)[11], a development framework for hearing aid algorithms.Another hardware-related factor is the development ofWireless Body Area Networks (WBANs) that can be seenas an enabling technology for mobile health care [12] andthat could mediate the communication between a centralprocessor and audio headsets attached to the ear like hearingaids.

In summary, recent developments open up the possibilityof merging the functionality of traditional hearing aidsand other hearing support systems for slight-to-moderatehearing losses on scalable hardware. This combination willbe defined as a Personal Hearing System (PHS). Further-more, the integration of this PHS with general and newcommunication applications of mobile phones and PDAsto define a Personal Communication System (PCS) maylead to new applications and improved hearing support.User inquiries regarding the acceptance of such a PCS havebeen carried out within the EU project HearCom [13], andits general acceptance was demonstrated, provided that thedevice is not larger than a mobile phone and includes itsfunctionality. Some specific solutions to this already exist,but audio applications with scalable listening support fordifferent types of hearing losses and having a connection topersonal communication and multimedia devices are not yetavailable. The aim of this study is, therefore, to establish abasis for further research and development along these lines.In Section 2, the proposed architecture of a PCS is outlined.Section 3 describes the implementation of a prototype PCSwhich runs on a netbook computer and hosts four represen-tative signal enhancement algorithms. A first evaluation ofthe hardware requirements (e.g., processing power, wirelesslink requirements), of the software requirements (scalablesignal processing), and of the expected benefit for the endusers is performed using this PCS prototype.

2. PCS Architecture

The PCS is a hand-held concentrator of information tofacilitate personal communication. Figure 1 shows a blockdiagram of the projected PCS and its applications. The PCSis a development based on new advanced mobile telephonesand Personal Digital Assistants (PDAs). The reason forselecting a mobile phone as a PCS platform is the availabilityof audio and data networking channels, like GSM, UMTS,BlueTooth, and WiFi. A global positioning system—ifavailable—can be utilized by public announcement services.

Audio is played to the user via a pair of audio headsets.These audio headsets are housing loudspeakers/receivers foraudio playback. Each audio headset also has two or threemicrophones, which can be configured to form a directionalmicrophone for picking up environmental sounds, and theown voice of the user for phone application. As an option, theaudio headsets provide audio processing capabilities similarto hearing aids.

A short-range wireless link (Wireless Body Area Net-work, WBAN) provides the connection between the PCS andthe audio headsets, and optionally between the two audioheadsets at the left and right ears. Mid-range and wide-rangelinks are used to establish connections to telecommunicationnetwork providers and to local information services. All linksare part of the wireless Personal Communication Link (PCL),which supplies information to the PCS and between the PCSand the audio headsets, as a successor for the inductive link(telecoil) of current hearing aids.

A key application on the PCS is the PHS: the audiocommunication channels of the PCS, for example, telephony,public announcement, and home entertainment, are pro-cessed in the PHS with personalized signal enhancementschemes and played back through the audio headsets.In addition to the PCS audio communication channels,the PHS can process environmental sounds picked up bythe headset microphones near the user’s ears. Processingmethods may differ depending on the input, that is, acousticinput or input through the PCS communication channels.The functionality of the PHS covers that of a conventionalhearing aid, and adds some additional features. (i) Increasedconnectivity: the PCS provides services, which can connectexternal sources with the PHS. (ii) Advanced audio signalprocessing schemes: the computational power and batterysize of the central processing device allows for algorithmswhich otherwise would not run on conventional hearingaids. (iii) Potential of production cost reduction: usageof standard hardware may reduce production, marketing,distribution, and service costs if consumer headsets withslight modifications, for example, addition of microphonesfor processing of environmental sounds can be used (whichis limited to subjects with mild to moderate hearing loss).

2.1. Distributed Processing. For processing the PCS audiocommunication channels, an unidirectional link from thecentral processor to the headsets is sufficient and the linkdelay is not critical as long as it remains below 50–100 ms.Processing environmental sounds in the central processor,however, requires a bidirectional link which needs further


Hearing aidswith basicprocessing

PCS

PCL(WBAN)

PCL:GSM, UMTS,BlueTooth, WLAN

Advancedprocessing

Text display

Public announcement

Home entertainment

Telephone

PHS

PC

S-P

HS

link

Audio streamText and control

Figure 1: Architecture of a Personal Communication System (PCS)hosting the personal hearing system. The PCS (large shaded box) ishosted on an advanced mobile phone. The Personal Hearing System(PHS) is a software component for signal enhancement, processingaudio output of the PCS communication channels and processingenvironmental signals. The Personal Communication Link (PCL)transfers environmental sounds to the PHS and the processedsounds or control information back to the audio headsets.

consideration. In general, all processing blocks can be runeither on the audio headsets or on the central processor.The optimal choice for each processing block dependson several issues: (i) The computational performance andbattery capacity of the audio headsets is typically low anddoes not allow complex algorithms. (ii) The central processoror the PCL might not be available continuously because ofwireless link breakdowns. Therefore, at least basic processinglike amplification for hearing loss correction is required torun on the audio headsets. (iii) Depending on the propertiesof the PCL, the delay might exceed the tolerable delay forprocessing of environmental sounds [14], and will constraintthe algorithms on the central processor. Link delays smallerthan 10 ms would allow routing the signal through thecentral processor. In typical hearing aid applications, signalenhancement schemes precede the processing blocks forhearing loss correction (e.g., amplification and compres-sion). To avoid the transmission of several signal streams,only one set of successive processing blocks can be runon the central processor. As to whether emerging WBANtechnology might be powerful enough to achieve the delaylimit seems unclear yet. If the total link delay is longer than

about 10 ms, the signal path needs to remain completelyon the audio headsets. Then, processing on the centralprocessor is restricted to signal analysis schemes that controlprocessing parameters of the signal path, for example,classification of the acoustical environment, direction ofarrival estimation, and parameter extraction for blind sourceseparation. In general, it seems feasible that these complexsignal analysis schemes and upcoming complex processingperformance demanding algorithms for Auditory SceneAnalysis [15] might not necessarily be part of the signal path.The projected architecture might, therefore, be suited forthese algorithms, which could benefit from the high signalprocessing and battery power of the central processor. Otherrequirements for the link are bandwidth and low powerconsumption: to allow for multichannel audio processing,several (typically two or three) microphone signals fromeach ear are required, asking for sufficient link bandwidth.Additionally, if signals are transmitted in compressed form,the link signal encoder should not modify the signal toavoid artifacts and performance decreases in multichannelprocessing. To ensure long battery life, the link should uselow power. To reduce the link power consumption, thePHS could provide only advanced processing on demand.Switching on advanced processing and the link might beeither controlled manually or by an automatic audio analysisin the headsets.

The architecture of the PHS with a central processor givesthe ability to process binaural information in the centralprocessor and unilateral information either in the centralprocessor or in the audio headsets. Considering typicalprocessing schemes in hearing aids, unilateral processingcomprises dynamic compression, single channel noise reduc-tion, and feedback cancellation. Typical applications ofthe central processor are binaural and multimicrophonemethods, for example, binaural ambient noise reduction,beamformer, and blind source separation [3]. If the link delayis not sufficiently small to route the signal path through thecentral processor, binaural processing can still be achievedassuming a signal analysis algorithm running on the centralprocessor processes signals from left and right side andcontrols the signal path on both sides.

3. Implementation of a Prototype System

To assess the PCS architecture and applications experi-mentally, a prototype PCS has been implemented on asmall notebook computer “netbook” in combination witha Smartphone. To demonstrate PHS applications, severalsignal enhancement algorithms have been realized on thePCS prototype using the MHA algorithm development envi-ronment, see Section 3.2.1. A dedicated audio interface thatwas developed within the EU HearCom project to connectaudio headsets to the netbook is described in Section 3.2.2.A phone service as a prototype application of the PCS,implemented on a Smartphone, is described in Section 3.3.2.One signal enhancement algorithm (coherence-based de-reverberation [7]) was taken as an example and has beentested on its conformity with the concept of the PHS.


3.1. Architecture. See Figure 2 for a schematic signal flowof the prototype system. The PHS is implemented using aseparate notebook computer. Notebook computers deliversufficient performance for audio signal processing. Selectingsignal processing algorithms carefully and using perfor-mance optimization techniques allows for stripping downthe PC platform. However, a floating point processor isrequired for prototype algorithms and prevents using fixpoint processor-based PDAs or Smartphones. Using floatingpoint algorithms enables fast prototyping and very early fieldtesting. In a later step, it is necessary to recode positivelyevaluated algorithms to a fixed point representation andinstall these on PDAs or Smartphones. PCS services areimplemented on a Smartphone with networking capabilities.The PCS-PHS link is realized as a WiFi network connection.The audio headsets are hearing aid shells with microphonesand receiver, without signal processing capabilities. Theaudio headsets are connected to the PHS via cables anda dedicated audio interface. The audio headset signal pro-cessing capabilities are simulated on the central processor.Figure 3 shows a netbook-based PHS prototype.

3.2. Hardware Components. In the following sections, thehardware components of the prototype implementation aredescribed.

3.2.1. Netbook: Asus Eee PC. For the prototype system, aminiature notebook has been used as a hardware acceleratedfloating point processor for the PHS: the Asus Eee PC is asmall and lightweight notebook PC, its size is about 15 ∗22 cm, weighting 990 grams. It provides an Intel Celeronprocessor M, running at a clock rate of 630 MHz. To achievelow delay signal processing in a standard operating systemenvironment, a Linux operating system (UbuntuStudio 8.04)with a low-delay real-time patched kernel (2.6.24-22-rt) hasbeen installed. For comparison, the system was also installedon an Acer Aspire netbook PC and a standard desktop PC.

3.2.2. Dedicated Audio Interface. A detailed market surveyshowed that commercially available audio interfaces cannotsatisfy all requirements for the mobile PHS prototype. High-quality devices as used in recording studios offer the requiredsignal quality and low latency but are not portable because ofsize, weight, and external power supply. Portable consumerproducts do not offer quality, low latency, and requireda number of capture and playback channels. Therefore, adedicated USB audio interface has been developed whichfulfills the requirements of the PHS prototype. The audiointerface has been developed in two variants: a device withfour inputs and two outputs to drive two audio headsets withtwo microphones in each headset (USBSC4/2), and a devicewith six inputs and two outputs, for two audio headsetswith three microphones each (USBSC6/2). The basis for bothdevices is a printed circuit board (PCB). The USBSC4/2contains one PCB as shown as PCB1 in Figure 4. Assembledare two stereo ADs (four channels) and one stereo DA (twochannels). A microcontroller (μC) implements the USB2.0interface to PC hardware. A complex programmable logic

Smartphone

Hearing aidshells w/oprocessing

Netbook

Audiointerface

Advancedprocessing

Text display

Public announcement

Home entertainment

Telephone

PHS

GSM, UMTS,BlueTooth, WLAN

Basicprocessing

PC

S-P

HS

link

LAN

(W

iFi o

r et

her

net

)

Audio streamText and control

Figure 2: Prototype implementation of the PCS. The PCS servicesare hosted in a Smartphone, the PHS (mainly signal processing) ishosted in a portable PC. The PC connects to the hearing aid shellsvia a dedicated audio interface, architecture.

Figure 3: PHS prototype based on the Asus Eee PC, with adimension of 22×15 cm and a weight of 1.2 kg, including the soundcard and audio headsets.

device (CPLD) serves as glue-logic between μC and AD/DA.The advantage of CPLDs is the possibility to reconfiguretheir interconnecting structure in system. This feature isused to connect two PCBs and build one device withmore channels (USBSC6/2). A simple reconfiguration of theCPLDs and a software exchange on the μC (exchange of


StereoDA

StereoDA

StereoAD1

StereoAD1

StereoAD2

StereoAD2

USB interface to PC

FIFOs

MasterCPLD

SlaveCPLD

PCB1 PCB2

μC

Figure 4: Architecture of the dedicated audio interface, with fourinputs and two outputs (PCB1 only), or six inputs and two outputs(PCB1 and PCB2).

firmware) enables the configuration of other devices in theshortest time, as, for example, a device with four inputs andoutputs, or eight inputs and no outputs. The hardware is alsoapplicable outside the scope of hearing aid research: withminor modifications, it can be used as a mobile recordingdevice or as consumer sound card for multimedia PCs.

For future usage of the developed hardware, devicevariations in number and type of channels depending onuser requirements are quickly retrievable. The architectureis extendable by a hardware signal processing unit for user-defined audio preprocessing by exchanging the CPLD withmore complex components like field programmable logicdevices (FPGA). This extension would decrease the CPU loadof the host PC, or would allow for a higher computationalcomplexity of the algorithm. It has to be stated though thatthe implementation of algorithms in FPGAs using a fixedpoint hardware description language (HDL) like VHDL orVerilog is even more elaborate than transferring floatingpoint SW to fixed point. Thus, this proceeding is onlyadequate for well evaluated and often used algorithms like,for example, the FFT due to high nonrecurring engineeringcosts.

The developed audio interface is a generic USB2.0 audiodevice that does not require dedicated software drivers forPCs/Notebooks running under the Linux operating system.The device utilizes the USB2.0 isochronous data transferconnection for low latency, and, therefore, does not workwith USB1.

The audio interface is equipped with connectors todirectly connect two hearing aid shells housing a receiver andup to three microphones. The device provides a microphonepower supply. The USB audio interface is powered via theUSB connection. RC filters and ferrite beads are used tosuppress noise introduced by the USB power supply. One RCfilter is placed directly at the supply input. Furthermore, ateach AD- and DA-converter, one filter is placed close to theanalog and the digital supply, respectively. Additionally, noiseis suppressed by the use of ferrite beads in each supply line ofeach converter.

3.3. Software Components. In the following sections, themajor software components used in the PCS prototypeimplementation are described.

3.3.1. PHS Algorithms. In the PHS prototype four repre-sentative signal enhancement schemes have been imple-mented: single-channel noise suppression based on percep-tually optimized spectral subtraction (SC1), Wiener-filter-based single-channel noise suppression (SC2), spatially pre-processed speech-distortion-weighted multichannel Wienerfiltering (MWF), and binaural coherence dereverberationfilter (COH) [7]. Individually fitted dynamic compressionand frequency-dependent amplification was placed after thesignal enhancement algorithm to compensate for the user’shearing loss. Hearing loss compensation without any specificsignal enhancement algorithm is labelled REF. The MHA wasused as a basis of the implementation [11].

The prototype algorithms are processed at a samplingrate of 16 kHz in blocks of 32 samples, that is, 2 ms. Audiosamples are processed with 32 Bit floating point values, thatis, four bytes per sample.

As an example, we look at the coherence-based dere-verberation filter in more detail: the microphone signal istransformed into the frequency domain by a short-time fastFourier transform (FFT) with overlapping windows [16].At both ears, the algorithm splits the microphone signalsXl and Xr into nine overlapping frequency bands. In eachfrequency band k, the average phase ϕ across FFT bins νbelonging to the frequency band k is calculated, ϕ(k) =∠∑ν W(k, ν)X(ν). The weighting function W(k, ν) definesthe filter shape of the frequency band, see [11] for details.Also, ϕ is implicitly averaged across time over the length ofone analysis window. Comparing the phase with the phase ofthe contralateral side results in the interaural phase difference(IPD) within a frequency band. The phase difference ϕl −ϕr is represented as a complex number on the unit circle,z = e j(ϕl−ϕr ). The estimated coherence is the gliding vectorstrength c of z, c = |〈z〉τ|, with the averaging time constantτ. The estimated coherence is directly transformed intoa gain by applying an exponent α, G = cα. This gainis applied to the corresponding frequency band prior toits transformation back to the time domain. A detaileddescription of the algorithm and its relation to a cross-correlation based dereverberation algorithm can be found in[17].

3.3.2. PCS-PHS Link for Testing Multimedia Applications. Toprovide ways to connect PCS audio communication streamsto the PHS, a specific network audio interface has beenimplemented in the PHS. Together with a sender application,this interface forms the PCS-PHS link, which is using theInternet Protocol Suite (TCP/IP). The link can be establishedon demand, and contains a protocol to select appropriatemixing strategies for different signal sources: the level ofthe source signal can be matched with the environmentalsound level, and environmental sounds can be suppressedfor better speech intelligibility or alarm signal recognition.The mixing configuration is followed by the audio stream.


Whenever a phone connection is established, the senderapplication in the PCS is connecting to the PCS-PHS linkand is recoding the phone’s receiver output for transmissionto the PHS. To avoid drop-outs in the audio stream, thesignal from the phone has to be buffered, introducing adelay between input and output. To reduce the delay causedby the WiFi connection, the packet size was reduced to aminimum. The total delay varies between 360 and 500 ms.The long delay is specific to the prototype implementationwith a WiFi link; the final application will not include theWiFi link between PCS phone service and PHS, since bothservices are then hosted on the same machine. Via a cable-bound network, connection delays in the order of 5ms canbe reached, for example, by using the “NetJack” system[18] or with “soundjack” [19]. An alternative approach isan analog connection. However, this would not allow forsending control parameters to the PHS.

3.4. Evaluation Results

3.4.1. Computational Complexity and Power Consumption.The computational complexity of the PHS prototype systemis estimated by measuring the CPU time needed to processone block of audio data, divided by the duration of oneblock. For real-time systems, this relative CPU time needsto be below one. For most operating systems, the maximumrelative CPU time depends also on the maximum systemlatency and the absolute block length. A detailed discussionof relative CPU time and real-time performance can be foundin [11]. The relative CPU time of the PHS running thefour respective signal enhancement algorithms is shown inTable 1.

For a portable PHS, a long battery runtime is desirable.The battery runtime of the PHS prototype has been mea-sured by continuously measuring the battery voltage whilerunning the PHS with the respective signal enhancementalgorithms. The battery was fully charged before eachmeasurement. The time until automatic power-down is givenin Table 1. During the test, the netbook lid was closed andthe display illumination was turned off. The measurementwas performed once. To check that the battery age didnot significantly influence the results, the first measurementwas repeated. No significant differences have been observed.However, slight variations might have been caused byadditional CPU load caused by background processes ofthe operating system, and by differences between the accessto other hardware, for example, memory. The correlationbetween total CPU load and the battery time is very highfor the Asus Eee PC and low for the Acer Aspire one. Thelow correlation between CPU usage and battery runtime forthe Acer Aspire one might be an indication for a less efficienthardware.

3.4.2. Benefit for the End User. The algorithm performance interms of speech recognition thresholds (SRTs) and preferencerating has been assessed in the HearCom project in a largemulticenter study [7]. As an example, speech recognitionthreshold improvement data from [7] is given in Table 2.

−5 −4 −3 −2 −1 1 2 3 4 50

5

10

15

20

25

Hearing impairedNormal hearing

(%)

“REF’’ is better Preference “COH’’ is better

Figure 5: Preference histogram. COH is preferred against thereference condition “REF” by 80.6% of the hearing impaired and by61.1% of the normal hearing subjects. The categories 1–5 are “veryslightly better,” “slightly better,” “better,” “much better,” and “verymuch better.”

The standard deviation of the results across four differenttest sites is marginal, which proves the reliability of the PHSprototype as a research and field testing hearing system.While the speech intelligibility could not be improved by thealgorithm COH, it was preferred by most subjects against“REF” processing (i.e., only hearing loss correction), seeFigure 5. Hearing-impaired subjects show a clearer prefer-ence for COH than normal hearing subjects do. The listeningeffort can be reduced by COH if the SNR is near 0 dB [20].Even if the SRT cannot be improved by the algorithm, thereduction of listening effort is a major benefit for the user.Furthermore, a combination with the MWF algorithm ispossible and indicated, since both methods exploit differentsignal properties (directional versus coherence properties).An improvement of the beamformer performance is likely ifthe coherence filter is preceding the beamformer [21].

3.4.3. Requirements towards the PCL. The requirementstoward the wireless link between headsets and centralprocessor varies with the algorithms. Estimated data rates for4-byte sample formats without any further data compressionare given in this section as a worst-case scenario. Thelink bandwidth required to transmit all six microphonechannels is 768 bytes per block in the direction from oneheadset to the central processor and 256 bytes per blockin the other direction (two receiver signals). With twoheadsets and 500 blocks per second this leads to a required(uncompressed) bandwidth of 3 MBit/s from headsets to thecentral processor and 1 MBit/s back to the headsets. Therequirements for the coherence filter “COH” toward thelink bandwidth in three scenarios are presented. The trivialscenario is the condition where the algorithm is running onthe central processor, where the full audio signal of both


Table 1: Performance of the PHS prototype system. In addition to the algorithm CPU time, the CPU time used for signal routing,resampling, overlap-add, spectral analysis, and hearing loss correction was measured (labelled MHA), and also the CPU time used by thejackd sound server and the sound card interrupt handler.

AlgorithmAsus Eee PC 4G Acer Aspire one Desktop PC

Celeron M 630 MHz Atom N270, 1.6 GHz Pentium 4, 3 GHz

% CPU batt. % CPU batt. % CPU

SC1 23.3% 3h14′ 20.5% 2h28′ 8.5%

SC2 12.0% 3h19′ 11.0% 2h27′ 4.0%

MWF 16.7% 3h16′ 17.0% 2h09′ 4.5%

COH 3.5% 3h22′ 4.5% 2h27′ 1.0%

MHA 33.1% 30.5% 12.0%

jackd 6.0% 4.8% 2.2%

IRQ 4.5% 4.2% 1.0%

Table 2: Speech recognition threshold (SRT) improvement (i.e.,difference to identity processing with hearing loss correction“REF”) in dB SNR for the four algorithms, measured at four testsites. The standard deviation across test site is marginal, whichproves the reliability of the PHS prototype as a research and fieldtesting hearing system, data from [7].

Test siteSRT improvement/dB

SC1 SC2 MWF COH

1 0.0 −0.2 7.1 0.2

2 0.0 −0.1 6.7 −0.2

3 −0.1 −0.1 6.2 −0.5

4 0.2 −0.3 6.7 0.1

Average 0.1 −0.1 6.7 −0.1

Standard deviation 0.14 0.09 0.36 0.32

sides is required, that is, 1 MBit/s in each direction. Lowerbandwidth is required if only signal analysis is performedin the central processor: the phase information in ninefrequency channels of each side and for each signal blockis required, leading to a headset-to-processor bandwidthrequirement of 281.25 kBit/s. For the other direction, ninegains are transmitted, with identical gains for both sides. Thisresults in a bandwidth requirement of 140.625 kBit/s. Thethird scenario is a situation where signal analysis and filteringare processed in the audio headsets, and the link is only usedfor data exchange between the audio headsets. Then only thephase information is exchanged, that is, 140.625 kBit/s arerequired in each direction. These bandwidth requirementsdo not include data compression. With special signal cod-ing strategies, the bandwidth requirements can be furtherreduced. The bandwidth of current hearing aid wirelesssystems is in the range of 0.1–100 kBit/s, with a range ofapproximately one meter. The power consumption is below2 mW. BlueTooth technology provides data rates between10 kBit/s and 1 MBit/s, at a power consumption of 25–150 mW, and a range between 3 and 10 m. Low-delay codecscan achieve transmission delays below 1 ms at a bandwidthof 32 kBit/s [22].

For signal routing from the headsets via the centralprocessor back to the headsets, a maximum delay of

Table 3: Evaluation results of the dedicated audio interface. Asa reference device for the dynamic range measurements, an RMEADI8 Proconverter has been used. The minimal delay depends notonly on the sampling rate (and thus on the length of the anti-aliasing filters) but also on the achievable minimal block lengths.

Sampling rates 16, 32, 44.1, 48, 96 kHz

Input

Input sensitivity 0 dBFS −17.5 dBu (0.3 Vpp)

Impedance (1 kHz) 10 kΩ

Dynamic range (S/N) 92 dB unweighted

Output

Output level 0 dBFS 2.1 dBu (2.8 Vpp)

Impedance (1 kHz) 7.4Ω

Dynamic range (S/N) 94 dB unweighted

Round trip

Frequency response,−1.5 dB

8 Hz–6.7 kHz @ 16 kHz

9 Hz–13.2 kHz @ 32 kHz

10 Hz–17.9 kHz @ 44.1 kHz

11 Hz–19.1 kHz @ 48 kHz

14 Hz–33.5 kHz @ 96 kHz

Frequency response,−0.5 dB

12 Hz–5.4 kHz @ 16 kHz

14 Hz–10.5 kHz @ 32 kHz

15 Hz–14.1 kHz @ 44.1 kHz

15 Hz–15.1 kHz @ 48 kHz

17 Hz–27.8 kHz @ 96 kHz

THD (1 kHz, −32 dBu) −75.1 dB

Minimal total delay(excluding algorithmicdelay, e.g., overlap-add)

9.81 ms @ 16 kHz

7.38 ms @ 32 kHz

8.44 ms @ 44.1 kHz

7.96 ms @ 48 kHz

5.97 ms @ 96 kHz

approximately 10 ms is acceptable. For larger delays, a signalanalysis on the central processor is possible. However, whenthe sequence of filter coefficients is delayed relative to thesignal to which it is applied, signal distortion will arise.


Informal listening tests revealed that a link delay of 25 ms isacceptable for the COH algorithm. Because this algorithmrepresents the class of speech envelope filters, this marginmight apply for the more general case, too. The totaldelay of the prototype system with the “COH” algorithm is11.4 ms using the dedicated USB audio interface, and 10.1 msusing an RME HDSP9632 audio interface with RME ADI8Proconverters.

3.4.4. Technical Performance of the Dedicated Audio Interface.The technical performance of the dedicated audio interfaceis given in Table 3. During the evaluation of the dedicatedaudio interface, the following factors on the audio qualityof the device have been found. (i) A notebook should bedisconnected from power supply and used in battery mode,to avoid a 50 Hz distortion caused by the net power supply.(ii) Other USB devices should not be connected to thesame USB controller/hub since the data transmissions ofthese devices could interfere with the USB power supply(crosstalk effects between data and power wires) and therebydegenerate signal quality. (iii) Front PC-USB-ports are oftenattached to the CPU’s main-board by long ribbon cables.Running alongside a gigahertz processor, this constellationintroduces a vast amount of interference and noise.

4. Discussion

New technological developments make the developmentof a communication and hearing device with advancedand personalized signal processing of audio communicationchannels feasible. User inquiries underline that such adevelopment would be accepted by the end users. Such adevice has the potential of being accepted as an assistivelistening device and “beginner” hearing aid. However, theintroduction depends on the availability of audio headsetswith microphones and a bidirectional or only unidirectionallow-power and low-delay link. If the link to the audio head-sets is not low-delay or unidirectional, then environmentalsounds cannot be processed on the central processor, andthe benefit of the PCS would be reduced to personalizedpostprocessing of audio streams from telephony, multime-dia applications, and public announcement systems. Thisprocessing is usually not as computational demanding asalgorithms for environmental audio processing, for example,auditory scene analysis. As to whether the central processorcould be used for processing computational demandingalgorithms depends on whether the data to be exchangedbetween central processor and headsets can be restrictedto preprocessed signal parameters and time-dependent gainvalues. The perspective of transmitting the full audio signalat very low delays seems unclear to date.

The implementation of a prototype system revealedbarriers and solutions in the development of a PCS as aconcentrator of communication channels. The advantageof using high-level programming languages in algorithmdevelopment is partly reversed by the need of a floatingpoint processor. Current Smartphones and PDAs do haveonly fixed point processing. The continuous convergence of

miniature notebook PCS and mobile phones might lead toa new generation of mobile phones providing floating pointprocessing, but this is unclear at present. The solution forthe prototype was to chose a separate small notebook PC-based hardware for the PHS with a network connection tothe PCS. The processor performance of such a netbook issufficient to host most recent advanced signal enhancementalgorithms. The battery of a netbook computer providesa runtime which is sufficient for field testing. However,final realizations of the PHS for everyday use must providesignificantly longer runtime before recharging. The audioquality of the dedicated audio interface is sufficiently highfor field testing.

5. Conclusions

An architecture of a personal communication system with acentral processor and wireless audio headsets seems feasiblewith the expected WBAN developments. However, algo-rithms have to be tailored to match WBAN limitations, andthe audio headsets need microphones and own processingcapabilities. The presented binaural noise reduction scheme“COH” is one example algorithm that might match theconstraints.

Usage of scalable hardware and software is feasible, butdirect usage of software from the prototype system forproducts cannot be expected: due to the missing availabilityof floating point processing capabilities in mobile hardware,recoding floating point implementations to a fixed pointrepresentation is necessary. This is not expected to changein the near future.

The prototype system is helpful for algorithm evaluationand for testing possible PCS applications, but the gap towardsreal systems is still large.

Future work should investigate the concept further byimplementing and field-testing further algorithms for hear-ing support and communication options using the prototypesystem.

Acknowledgments

The authors thank the collaborating partners within theHearCom project on Hearing in the communication society,especially the partners of WP5 and WP7 for providingthe subjective evaluation data, and Siemens AudiologischeTechnik for providing the headsets. This work was supportedby grants from the European Union FP6, Project 004171HEARCOM.

References

[1] J. M. Kates, “Feedback cancellation in hearing aids: resultsfrom a computer simulation,” IEEE Transactions on SignalProcessing, vol. 39, no. 3, pp. 553–562, 1991.

[2] T. Victorian and D. Preves, “Progress achieved in settingstandards for hearing aid/digital cell phone compatibility,” TheHearing Journal, vol. 57, no. 9, pp. 25–29, 2004.


[3] J. M. Kates, “Signal processing for hearing aids,” in Applica-tions of Digital Signal Processing to Audio and Acoustics, KluwerAcademic Publishers, 1998.


[5] B. Marshall, “Advances in technology offer promise of anexpanding role for telecoils,” The Hearing Journal, vol. 55, no.9, pp. 40–41, 2002.

[6] J. L. Yanz, “Phones and hearing aids: issues, resolutions, and anew approach,” The Hearing Journal, vol. 58, no. 10, pp. 41–48,2005.

[7] K. Eneman, H. Luts, J. Wouters, et al., “Evaluation ofsignal enhancement algorithms for hearing instruments,” inProcedings of the 16th European Signal Processing Conference,Lausanne, Switzerland, 2008.

[8] N. Kerris, “Apple reinvents the phone with iPhone,” 2007,http://www.apple.com/pr/library/2007/01/09iphone.html .

[9] V. Nguyen, Google phone, 2008,http://www.google-phone.com .

[10] J. C. Anderson, “Hearing aid with wireless remote processor,”US patent 5721783, February 1998.

[11] G. Grimm, T. Herzke, D. Berg, and V. Hohmann, “The masterhearing aid: a PC-based platform for algorithm developmentand evaluation,” Acta Acustica united with Acustica, vol. 92, no.4, pp. 618–628, 2006.

[12] C. Otto, A. Milenkovic, C. Sanders, and E. Jovanov, “Systemarchitecture of a wireless body area sensor network forubiquitous health monitoring,” Journal of Mobile Multimedia,vol. 1, no. 4, pp. 307–326, 2006.

[13] J. Kewley, N. Thomas, M. Henley, et al., “Requirementspecification of user needs for assistive applications on acommon platform,” October 2005, http://hearcom.eu/about/DisseminationandExploitation/deliverables.html .

[14] M. A. Stone and B. C. J. Moore, “Tolerable hearing-aid delays:IV. Effects on subjective disturbance during speech productionby hearing-impaired subjects,” Ear and Hearing, vol. 26, no. 2,pp. 225–235, 2005.

[15] D. Wang and G. J. Brown, Eds., Computational Auditory SceneAnalysis: Principles, Algorithms, and Applications, Wiley/IEEEPress, New York, NY, USA, 2006.

[16] J. B. Allen, “Short term spectral analysis, synthesis, and mod-ification by discrete Fourier transform,” IEEE Transactions onAcoustics, Speech, and Signal Processing, vol. 25, no. 3, pp. 235–238, 1977.

[17] G. Grimm, B. Kollmeier, and V. Hohmann, “Increase andsubjective evaluation of feedback stability in hearing aids bya binaural coherence based noise reduction scheme,” IEEETransactions on Audio, Speech and Language Processing, vol. 17,no. 7, pp. 1408–1419, 2009.

[18] A. Carot, T. Hohn, and C. Werner, “Netjack—remote musiccollaboration with electronic sequencers on the internet,” inProceedings of the Linux Audio Conference, Parma, Italy, 2009.

[19] A. Carot, U. Kramer, and G. Schuller, “Network musicperformance (nmp) in narrow band networks,” in Proceedingsof the 120th AES Convention, Audio Engineering Society, May2006.

[20] T. Wittkop and V. Hohmann, “Strategyselective noise reduc-tion for binaural digital hearing aids,” Speech Communication,vol. 39, pp. 111–138, 2003.

[21] C. Faller and J. Merimaa, “Source localization in complexlistening situations: selection of binaural cues based oninteraural coherence,” Journal of the Acoustical Society ofAmerica, vol. 116, no. 5, pp. 3075–3089, 2004.

[22] W. Bastiaan Kleijn and M. Li, “Optimal codec for wirelesscommunication links,” in Proceedings of the HearCom OpenWorkshop, Brussels, Belgium, January 2009.

Digital Signal Processing for Hearing Instruments

Documents