Top Banner
SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYS John McDonough 1 , Kenichi Kumatani 1,2 , Takayuki Arakawa 3 , Kazumasa Yamamoto 4 , Bhiksha Raj 1 1 Carnegie Mellon University, Pittsburgh, PA, USA 2 Spansion, Inc., Sunnyvale, CA, USA 3 NEC Corporation, Kawasaki-shi, Japan 4 Toyohashi University of Technology, Toyohashi-shi, Japan ABSTRACT In prior work, we investigated the application of a spherical mi- crophone array to a distant speech recognition task. In that work, the relative positions of a fixed loud speaker and the spherical array required for beamforming were measured with an optical tracking device. In the present work, we investigate how these relative po- sitions can be determined automatically for real, human speakers based solely on acoustic evidence. We first derive an expression for the complex pressure field of a plane wave scattering from a rigid sphere. We then use this theoretical field as the predicted observa- tion in an extended Kalman filter whose state is the speaker’s current position, the direction of arrival of the plane wave. By minimizing the squared-error between the predicted pressure field and that actu- ally recorded, we are able to infer the position of the speaker. Index TermsMicrophone arrays, speech recognition, Kalman filters, spherical harmonics 1. INTRODUCTION The state-of-the-art theory of beamforming with spherical micro- phone arrays explicitly takes into account two phenomena of sound propagation, namely, diffraction and scattering; see [1, §2] and [2, §6.10]. While these phenomena are present in all acoustic array pro- cessing applications, no particular attempt is typically made to incor- porate them into conventional beamforming algorithms; rather, they are simply assumed to contribute to the room impulse response. In prior work [3, 4, 5, 6], we investigated the application of a spherical microphone array, the 32-channel Eigenmike R , to a dis- tant speech recognition task. In that work, the relative positions of a fixed loud speaker and the spherical array required for beamform- ing were measured with an optical tracking device. In the present work we investigate how these relative positions can be determined automatically for real, human speakers based solely on acoustic evi- dence. For conventional microphone arrays, speaker tracking is typ- ically performed by estimating time delays of arrival (TDOAs) be- tween pairs of microphones using the phase transform [7] or adaptive eigenvalue decomposition [8]; the TDOAs can then be used as obser- vations for a Kalman filter whose state corresponds to the speaker’s position [9]. This approach works well for conventional arrays of modest dimensions because the signals arriving at any pair of micro- phones are—to a first approximation—time-shifted versions of one another, which is equivalent to a phase shift in the frequency or sub- band domain. As we will discover in Section 2, such an approach is not suitable for rigid spherical arrays inasmuch the acoustics of such arrays introduce more complicated transformations of the signals ar- riving at pairs of sensors [10]. Meyer and Elko [10] along with Abhayapala and Ward [11] were among the first authors to propose the use of spherical micro- phone arrays for beamforming. Initial work in source localization with spherical arrays used beamforming techniques to determine the three-dimensional power spectrum in a room and then applied peak search techniques to locate the dominant sources [10, 12]. Teutsch and Kellermann [13, 14] proposed to use eigenbeam ESPIRIT to per- form source localization with cylindrical and spherical arrays; their approach was extended in [15] and more recently in [16]. An ap- proach to localization based on frequency smoothing was proposed by Khaykin and Rafaely [17]. In this work, we develop an algorithm for speaker tracking as opposed to simple localization; this implies we will incorporate both past and present acoustic observations into the estimate of the speaker’s current position as opposed to using merely the most recent observation. This is done to obtain a robust and smooth estimate of the speaker’s time trajectory. To accomplish this objective, we first derive an expression for the complex pressure field of a plane wave scattering from a rigid spherical surface [2, §6.10.3]; this expansion is an infinite series of spherical harmonics appropriately weighted by the modal coefficients for scattering from a rigid sphere. We then use this theoretical field as the predicted observation in an extended Kalman filter whose state is the speaker’s current position, which corresponds to the direction of arrival of the plane wave. By min- imizing the squared-error between the predicted pressure field and that actually recorded at the sensors of the array, we are able to infer the position of the speaker. The Kalman filter provides for robust position estimates in that past observations are efficiently combined with the most recent one during the recursive correction stage. We applied the proposed tracking algorithm to speech data spo- ken by real, human speakers standing in front of a spherical mi- crophone array. As the true speakers’ positions are unknown, we evaluated the algorithm’s effectiveness by performing beamforming using the estimated positions, then automatic speech recognition on the output of the beamformer. We found that our technique was able to reduce the final word error rate of the system from 50.9% using a single channel of the spherical microphone array to 45.7% using the beamformed array output for speech recognition. The balance of this contribution is organized as follows. Sec- tion 2 reviews the derivation of an expression for the complex pressure field of a plane wave impinging on a rigid sphere; the final expression will involve an infinite series of spherical harmonics. Section 3 presents a speaker tracking system based on an extended Kalman filter that estimates the speaker’s position by matching the actual, observed sound field impinging on a spherical array with that predicted by the theory of the preceding section. Empirical results are presented in Section 4 demonstrating the effectiveness
5

SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYSmlsp.cs.cmu.edu/people/johnmcd/docs/mcdonough-icassp-2013.pdf · SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYS John McDonough

Mar 19, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYSmlsp.cs.cmu.edu/people/johnmcd/docs/mcdonough-icassp-2013.pdf · SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYS John McDonough

SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYS

John McDonough1, Kenichi Kumatani1,2, Takayuki Arakawa3, Kazumasa Yamamoto4, Bhiksha Raj1

1Carnegie Mellon University, Pittsburgh, PA, USA2Spansion, Inc., Sunnyvale, CA, USA

3NEC Corporation, Kawasaki-shi, Japan4Toyohashi University of Technology, Toyohashi-shi, Japan

ABSTRACT

In prior work, we investigated the application of a spherical mi-crophone array to a distant speech recognition task. In that work,the relative positions of a fixed loud speaker and the spherical arrayrequired for beamforming were measured with an optical trackingdevice. In the present work, we investigate how these relative po-sitions can be determined automatically for real, human speakersbased solely on acoustic evidence. We first derive an expression forthe complex pressure field of a plane wave scattering from a rigidsphere. We then use this theoretical field as the predicted observa-tion in an extended Kalman filter whose state is the speaker’s currentposition, the direction of arrival of the plane wave. By minimizingthe squared-error between the predicted pressure field and that actu-ally recorded, we are able to infer the position of the speaker.

Index Terms— Microphone arrays, speech recognition, Kalmanfilters, spherical harmonics

1. INTRODUCTION

The state-of-the-art theory of beamforming with spherical micro-phone arrays explicitly takes into account two phenomena of soundpropagation, namely, diffraction and scattering; see [1, §2] and [2,§6.10]. While these phenomena are present in all acoustic array pro-cessing applications, no particular attempt is typically made to incor-porate them into conventional beamforming algorithms; rather, theyare simply assumed to contribute to the room impulse response.

In prior work [3, 4, 5, 6], we investigated the application of aspherical microphone array, the 32-channel Eigenmike R©, to a dis-tant speech recognition task. In that work, the relative positions ofa fixed loud speaker and the spherical array required for beamform-ing were measured with an optical tracking device. In the presentwork we investigate how these relative positions can be determinedautomatically for real, human speakers based solely on acoustic evi-dence. For conventional microphone arrays, speaker tracking is typ-ically performed by estimating time delays of arrival (TDOAs) be-tween pairs of microphones using the phase transform [7] or adaptiveeigenvalue decomposition [8]; the TDOAs can then be used as obser-vations for a Kalman filter whose state corresponds to the speaker’sposition [9]. This approach works well for conventional arrays ofmodest dimensions because the signals arriving at any pair of micro-phones are—to a first approximation—time-shifted versions of oneanother, which is equivalent to a phase shift in the frequency or sub-band domain. As we will discover in Section 2, such an approach isnot suitable for rigid spherical arrays inasmuch the acoustics of sucharrays introduce more complicated transformations of the signals ar-riving at pairs of sensors [10].

Meyer and Elko [10] along with Abhayapala and Ward [11]were among the first authors to propose the use of spherical micro-phone arrays for beamforming. Initial work in source localizationwith spherical arrays used beamforming techniques to determine thethree-dimensional power spectrum in a room and then applied peaksearch techniques to locate the dominant sources [10, 12]. Teutschand Kellermann [13, 14] proposed to use eigenbeam ESPIRIT to per-form source localization with cylindrical and spherical arrays; theirapproach was extended in [15] and more recently in [16]. An ap-proach to localization based on frequency smoothing was proposedby Khaykin and Rafaely [17].

In this work, we develop an algorithm for speaker tracking asopposed to simple localization; this implies we will incorporateboth past and present acoustic observations into the estimate of thespeaker’s current position as opposed to using merely the most recentobservation. This is done to obtain a robust and smooth estimate ofthe speaker’s time trajectory. To accomplish this objective, we firstderive an expression for the complex pressure field of a plane wavescattering from a rigid spherical surface [2, §6.10.3]; this expansionis an infinite series of spherical harmonics appropriately weightedby the modal coefficients for scattering from a rigid sphere. We thenuse this theoretical field as the predicted observation in an extendedKalman filter whose state is the speaker’s current position, whichcorresponds to the direction of arrival of the plane wave. By min-imizing the squared-error between the predicted pressure field andthat actually recorded at the sensors of the array, we are able to inferthe position of the speaker. The Kalman filter provides for robustposition estimates in that past observations are efficiently combinedwith the most recent one during the recursive correction stage.

We applied the proposed tracking algorithm to speech data spo-ken by real, human speakers standing in front of a spherical mi-crophone array. As the true speakers’ positions are unknown, weevaluated the algorithm’s effectiveness by performing beamformingusing the estimated positions, then automatic speech recognition onthe output of the beamformer. We found that our technique was ableto reduce the final word error rate of the system from 50.9% using asingle channel of the spherical microphone array to 45.7% using thebeamformed array output for speech recognition.

The balance of this contribution is organized as follows. Sec-tion 2 reviews the derivation of an expression for the complexpressure field of a plane wave impinging on a rigid sphere; the finalexpression will involve an infinite series of spherical harmonics.Section 3 presents a speaker tracking system based on an extendedKalman filter that estimates the speaker’s position by matching theactual, observed sound field impinging on a spherical array withthat predicted by the theory of the preceding section. Empiricalresults are presented in Section 4 demonstrating the effectiveness

Page 2: SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYSmlsp.cs.cmu.edu/people/johnmcd/docs/mcdonough-icassp-2013.pdf · SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYS John McDonough

of the proposed algorithm; the position estimates obtained with theproposed algorithm are used for beamforming, and thereafter theenhanced speech signal from the beamformer is used for automaticrecognition. In the final section, we briefly discuss the conclusionsdrawn from this work and our plans for future work.

2. ANALYSIS OF A PLANE WAVE IMPINGING ON ARIGID SPHERE

In this section, we develop a theoretical expression for the complexpressure field of a plane wave impinging on a rigid, spherical sur-face. We will also develop expressions for the partial derivative ofthis field with respect to the direction of arrival Ω = (θ, φ), whereθ and φ denote the polar angle and azimuth, respectively. Let us ex-press a plane wave impinging with a polar angle of θ on an array ofmicrophones as [2, §6.10.1]

Gpw(kr, θ, t) = ei(ωt+kr cos θ)

=

∞∑n=0

in (2n+ 1) jn(kr)Pn(cos θ) eiωt, (1)

where jn and Pn are respectively the spherical Bessel function of thefirst kind and the Legendre polynomial, both of order n, k , 2π/λ

is the wavenumber, and i ,√−1. Fisher and Rafaely [18] provide

a similar expansion of spherical waves, such as would be requiredfor near-field analysis. If the plane wave encounters a rigid spherewith a radius of a it is scattered [2, §6.10.3] to produce a wave withthe pressure field

Gs(kr,ka, θ, t) = (2)

−∞∑n=0

in (2n+ 1)j′n(ka)

h′n(ka)hn(kr)Pn(cos θ) eiωt,

where hn = h(1)n denotes the Hankel function [19, §10.47] of the

first kind while the prime indicates the derivative of a function withrespect to its argument. Combining (1) and (2) yields the total soundpressure field [2, §6.10.3]

G(kr, ka, θ) =

∞∑n=0

in(2n+ 1) bn(ka, kr)Pn(cos θ), (3)

where the nth modal coefficient is defined as

bn(ka, kr) , jn(kr)− j′n(ka)

h′n(ka)hn(kr). (4)

Note that the time dependence of (3) through the term eiωt has beensuppressed for convenience. Plots of |bn(ka, ka)| for n = 0, . . . , 8are shown in Figure 1.

Let us now define the spherical harmonic of order n and degreem as [20]

Y mn (θ, φ) ,

√(2n+ 1)

(n−m)!

(n+m)!Pmn (cos θ) eimφ, (5)

where Pmn is the associated Legendre function of order n and de-gree m [21, §14.3]. The spherical harmonics fulfill the same rolein the decomposition of square-integrable functions defined on thesurface of a sphere as that played by the complex exponential eiωnt

for decomposition of periodic functions defined on the real line. Let

-60

-50

-40

-30

-20

-10

0

100 1

bn(k

a) (d

B)

ka

b0

b2

b3

b6 b7

b5b4

b8

b1

10

Fig. 1. Magnitudes of the modal coefficients bn(ka, ka) for n =0, 1, . . . , 8, where a is the radius of the sphere and k is the wavenum-ber.

Fig. 2. The spherical harmonics Y0, Y1, Y2 and Y3.

γ represent the angle between the points (θ, φ) and (θs, φs) lying ona sphere, such that

cos γ = cos θs cos θ + sin θs sin θ cos(φs − φ). (6)

Then the addition theorem for spherical harmonics [22, §12.8] canbe expressed as

Pn(cos γ) =4π

2n+ 1

n∑m=−n

Y mn (θs, φs)Ymn (θ, φ), (7)

where Y denotes the complex conjugate of Y . Upon substituting (7)into (3), we find

G(krs,θs, φs, ka, θ, φ) =

∞∑n=0

in bn(ka, krs)

n∑m=−n

Y mn (θs, φs)Ymn (θ, φ), (8)

where (θ, φ) denotes the direction of arrival of the plane wave and(rs, θs, φs) denotes the position at which the sound field is mea-sured. The spherical harmonics Y0 , Y 0

0 , Y1 , Y 01 , Y2 , Y 0

2 andY3 , Y 0

3 are shown in Figure 2.In all that follows, we will assume that a = rs so that ka and

krs need not be shown as separate arguments. Based on the defini-tion (5), we can write

∂Y mn (θ, φ)

∂θ=−

√(2n+ 1)

(n−m)!

(n+m)!

dPmn (x)

dx

∣∣∣∣x=cos θ

· sin θ · e−imφ, (9)

∂Y mn (θ, φ)

∂φ=− im Y mn (θ, φ). (10)

Page 3: SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYSmlsp.cs.cmu.edu/people/johnmcd/docs/mcdonough-icassp-2013.pdf · SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYS John McDonough

It remains to evaluate dPmn (x)/dx, which can be accomplishedthrough the identity [21, §14.10]

(1−x2)dPmn (x)

dx≡ (m−n−1)Pmn+1(x)+(n+1)xPmn (x). (11)

These partial derivative expressions will be required for the lin-earization inherent in the extended Kalman filter (EKF).

3. SPEAKER TRACKING SYSTEM

Here we use development of the preceding section to formulate acomplete tracking system based on the EKF. Let yk,l denote a vec-tor of stacked sensor outputs for the kth time step and the lth sub-band. Similary, let gk,l(θ, φ) denote the model of the stacked sensoroutputs

gk,l(θ, φ) ,

G(ka, θ0, φ0, ka, θ, φ)G(ka, θ1, φ1, ka, θ, φ)

...G(ka, θS−1, φS−1, ka, θ, φ)

, (12)

where G(ka, θs, φs, ka, θ, φ) is given by (8). The linearization re-quired to apply the EKF can then be expressed as

∂G

∂θ= 4π

∞∑n=0

inbn(ka)

n∑m=−n

Y mn (θs, φs)∂Y mn (θ, φ)

∂θ

∂G

∂φ= −4π

∞∑n=0

in+1bn(ka)

n∑m=−n

mY mn (θs, φs)Ymn (θ, φ).

The predicted observation inherent in the covariance form of the(extended) Kalman filter can then be formed from several compo-nents:

1. The individual sensor outputs given in (8); these are stackedas in (12).

2. A complex, time-varying frequency-dependent scale factorBk,l, which is intended to model the unknown magnitude andphase variation of the subband components.

3. A complex exponential eiωlDk, where ωl is the center fre-quency of the lth subband and D is the decimation factor ofthe filter bank.

Given these definitions, the squared-error metric at time-step kcan be expressed as

ε(θ, φ, k) ,L−1∑l=0

∥∥∥yk,l − gk,l(θ, φ)Bk,l eiωlDk

∥∥∥2 , (13)

where yk,l denotes the subband sensor outputs from a spherical ar-ray. Now note that ifBk,l were known and (θ, φ) were treated as thestate of a state-space system, then this time-varying state could beestimated with an extended Kalman filter; obviously the necessity ofusing an extended Kalman filter follows from the non-linearities in θand φ evident in (5). It is readily shown that the maximum likelihoodestimate of Bk,l in (13) is given by

Bk,l =gHk,l(θ, φ)yk,l∥∥gk,l(θ, φ)

∥∥2 · e−iωlDk. (14)

Note that if Bk,l in (14) is substituted into (13), the term eiωlDk willcancel out of the latter. Hence, these exponential terms can just aswell be omitted from both (13) and (14).

Given the simplicity of (14), we might plausibly modify the stan-dard extended Kalman filter as such:

1. Estimate the scale factors Bk,l as in (14).

2. Use this estimate to update the state estimates (θk, φk) of theKalman filter.

3. (Possibly) perform an iterative update for each time step asin the iterated extended Kalman filter (IEKF) [23, §4.3.3] byrepeating Steps 1 and 2.

We now briefly summarize the operation of the EKF. Let us statethe state and observation equations, respectively, as

xk = xk−1 + uk−1, (15)yk = Hk(xk) + vk, (16)

where Hk is the known, nonlinear observation functional. The noiseterms uk and vk in (15–16) are by assumption zero mean, whiteGaussian random vector processes with covariance matrices

Uk = EukuHk , Vk = EvkvHk ,

respectively. Moreover, by assumption uk and vk are statisticallyindependent. Let y1:k−1 denote all past observations up to time k−1, and let yk|k−1 denote the minimum mean square error estimateof the next observation yk given all prior observations, such that,

yk|k−1 = Eyk|y1:k−1.

By definition, the innovation is the difference sk , yk − yk|k−1

between the actual and the predicted observations. This quantityis given the name innovation, because it contains all the “new in-formation” required for sequentially updating the filtering densityp(x0:k|y1:k−1) [23, §4]; i.e., the innovation contains that informa-tion about the time evolution of the system that cannot be predictedfrom the state space model.

We will now present the principal quantities and relations inthe operation of the EKF; the details can be found in Haykin [24,§10], for example. Let us begin by stating how the predicted ob-servation may be calculated based on the current state estimate, ac-cording to yk|k−1 = Hk(xk|k−1). Hence, we may write sk =

yk −Hk(xk|k−1), which implies

sk = Hk(xk|k−1)εk|k−1 + vk, (17)

where εk|k−1 = xk − xk|k−1 is the predicted state estimation errorat time k, using all data up to time k − 1, and Hk(xk|k−1) is thelinearization of Hk(x) about x = xk|k−1. It can be readily shownthat εk|k−1 is orthogonal to uk and vk [24, §10.1]. Using (17) andexploiting the statistical independence of uk and vk, the covariancematrix of the innovations sequence can be expressed as

Sk , Esks

Hk

= Hk(xk|k−1)Kk|k−1Hk(xk|k−1)+Vk, (18)

where the predicted state estimation error covariance matrix is de-fined as

Kk|k−1 , Eεk|k−1ε

Hk|k−1

. (19)

The Kalman gain Gk can be calculated as

Gk = Kk|k−1HHk (xk|k−1)S−1

k , (20)

where the covariance matrix Sk of the innovations sequence is de-fined in (18). The Riccati equation then specifies how Kk|k−1 canbe sequentially updated, namely as,

Kk|k−1 = Fk|k−1 Kk−1 FHk|k−1 + Uk−1. (21)

Page 4: SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYSmlsp.cs.cmu.edu/people/johnmcd/docs/mcdonough-icassp-2013.pdf · SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYS John McDonough

Camcoder 1

Camcoder 2

Spherical array

Kinect 1

Kinect 2

1m 1m 2m

Speaker locations

Room Height 2.6 m

9.3 m

1.93

m

1.1m

Desk

2.3 m

RT 550 msec

7.2

m

Fig. 3. Sensor configuration for data capture with the Eigenmike.

The matrix Kk in (21) is, in turn, obtained through the recursion,

Kk = Kk|k−1 −GkHkKk|k−1 = (I−GkHk)Kk|k−1. (22)

This matrix Kk can be interpreted as the covariance matrix of thefiltered state estimation error [24, §10], such that,

Kk ,εkε

Hk

,

where εk , xk − xk|k. Finally, the new, filtered state estimate iscalculated according to

xk|k = xk|k−1 + Gksk. (23)

4. EXPERIMENTAL RESULTS

Figure 3 shows the sensor configuration used for our data capture. Inthe recording sessions, eleven human subjects are asked to read 25sentences from the Wall Street Journal corpus at each of two dif-ferent positions in order to investigate the sensitivity of recognitionperformance to the distance between speaker and array; as shown inthe figure, the positions where the speaker was to stand were markedon the floor at 1m, 2m, and 4m from the array measured parallel tothe floor. The test data consisted of 6,948 words in total. The rever-beration time T60 of the room was approximately 550ms. No noisewas artificially added to the captured data, as natural noise from airconditioners, computers and other speakers was already present. Thedata was sampled at a rate of 44.1 kHz with a depth of three bytesper sample by the Eigenmike R© hardware. Subband analysis wasperformed with the filter bank described in [23, §11] with M = 512subbands.

The inter-sensor noise covariance matrix Vk for each subbandrequired in (18) was estimated by analyzing segments of each ses-sion wherein the speaker was inactive with the filter bank [23, §11],summing the outer product of the subband snapshots, then scalingby the total number of frames analyzed. For tracking speakers at alldistances from the array, the EKF was initialized to approximatelythe middle of the room. After the speakers positions were obtainedwith the speaker tracking system [9], beamforming was performed;the beamformed signal was further enhanced with Zelinski post-filtering [25, 26]. We then ran the multi-pass speech recognizer de-scribed in [27] on the enhanced speech data. Table 1 shows word

Pass (%WER)Algorithm Distance 1 2 3 4

SAC 1m 75.6 43.6 31.6 28.82m 84.7 61.5 44.5 39.24m 89.4 72.5 57.1 50.9

SH SD BF 1m 76.5 47.8 36.4 33.22m 83.9 58.7 44.3 39.94m 86.8 66.4 51.1 45.7

CTM Avg. 31.7 20.9 16.4 15.6

Table 1. WERs as a function of distances between the speakers andthe Eigenmike.

error rates (WERs) obtained with each beamforming algorithm as afunction of distance between the speaker and the Eigenmike. As areference, the word error rates obtained with the single array channel(SAC) and close-talking microphone (CTM) are also shown.

Results are given in Table 1 for the super-directive beamformer.Beamforming was performed in the spherical harmonics domain us-ing the inter-harmonic covariance matrix derived by Yan et al. [28]for diffuse noise (SH SD BF). The results reported in the table indi-cate that beamforming was ineffective at 1m and 2m, but provided asignificant reduction in WER at 4m. We attribute this to the far-fieldassumption used in derivinig (3); this assumption is largely valid forthe distance of 4m, but does not hold at the smaller distances. Infuture, we plan to investigate the near-field pressure field derived byFisher and Rafaely [18] for speaker tracking and beamforming.

In the large vocabulary continuous speech recognition task, thedistant speech recognizer with beamforming still lags behind theclose talking microphone. However, the recognition performancecan still be acceptable in applications that do not require recogniz-ing every word precisely, such as dialogue systems.

5. CONCLUSIONS

Our results demonstrated that the combination of speaker trackingand beamforming enhanced the signal sufficiently to produce a sig-nificant reduction in the error rate of a distant speech recognition sys-tem when the speaker was located 4m from the spherical array. Fordistances of 1m and 2m, however, a degradation in system perfor-mance was observed after tracking and beamforming. We attributethese contradictory results to the plane wave assumption we usedin formulating our algorithm; such an assumption is valid for thegreater distance, but not for the smaller. In future work, we willinvestigate the use of a near-field assumption for tracking and beam-forming as in [18]. We also plan to compare our proposed method toother algorithms extant in the literature [12, 16, 17].

Page 5: SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYSmlsp.cs.cmu.edu/people/johnmcd/docs/mcdonough-icassp-2013.pdf · SPEAKER TRACKING WITH SPHERICAL MICROPHONE ARRAYS John McDonough

6. REFERENCES

[1] Heinrich Kutruff, Room Acoustics, Spoon Press, New York,NY, fifth edition, 2009.

[2] Earl G. Williams, Fourier Acoustics, Academic Press, SanDiego, CA, USA, 1999.

[3] Kenichi Kumatani, John McDonough, and Bhiksha Raj, “Mi-crophone array processing for distant speech recognition:From close-talking microphones to far-field sensors,” IEEESignal Processing Magazine, pp. 127–140, November 2012.

[4] Kenichi Kumatani, Takayuki Arakawa, Kazumasa Yamamoto,John McDonough, Bhiksha Raj, Rita Singh, and Ivan Tashev,“Microphone array processing for distant speech recognition:Towards real-world deployment,” in Proc. APSIPA Confer-ence, Hollywood, CA, December 2012.

[5] John McDonough and Kenichi Kumatani, “Microphone ar-rays,” in Techniques for Noise Robustness in Automatic SpeechRecognition, Tuomas Virtanen, Rita Singh, and Bhiksha Raj,Eds. Wiley, New York, NY, 2012.

[6] John McDonough, Kenichi Kumatani, and Bhiksha Raj, “Mi-crophone arrays for distant speech recognition: Spherical ar-rays,” in Proc. APSIPA Conference, Hollywood, CA, Decem-ber 2012.

[7] G. C. Carter, “Time delay estimation for passive sonar signalprocessing,” IEEE Trans. Acoust., Speech, Signal Processing,vol. ASSP-29, pp. 463–469, 1981.

[8] J. Benesty, “Adaptive eigenvalue decomposition algorithm forpassive acoustic source localization,” Jour. of ASA, vol. 107,no. 1, pp. 384–391, January 2000.

[9] U. Klee, G. Gehrig, and J.W. McDonough, “Kalman filtersfor time delay of arrival–based source localization,” Proc. ofEurospeech, 2005.

[10] Jens Meyer and Gary W. Elko, “A highly scalable sphericalmicrophone array based on an orthonormal decomposition ofthe soundfield,” in Proc. ICASSP, Orlando, FL, May 2002.

[11] Thushara D. Abhayapala and Darren B. Ward, “Theory anddesign of high order sound field microphones using sphericalmicrophone array,” in Proc. ICASSP, Orlando, FL, May 2002.

[12] B. Rafaely, Y. Peled, M. Agmon, D. Khaykin, and E. Fisher,“Spherical microphone array beamforming,” in Speech Pro-cessing in Modern Communication: Challenges and Perspec-tives, I. Cohen, J. Benesty, and S. Gannot, Eds. Springer,Berlin, 2010.

[13] Heinz Teutsch and Walter Kellermann, “Acoustic source detec-tion and localization based on wavefield decomposition usingcircular microphone arrays,” J. Acoust. Soc. Am., vol. 120, no.5, pp. 2724–2736, 2006.

[14] Heinz Teutsch and Walter Kellermann, “Detection and lo-calization of multiple wideband acoustic sources based onwavefield decomposition using spherical apertures,” in Proc.ICASSP, Las Vegas, NV, USA, March 2008.

[15] Heinz Teutsch, Modal Array Signal Processing: Princi-ples and Applications of Acoustic Wavefield Decomposition,Springer, Heidelberg, 2007.

[16] Haohai Sun, Heinz Teutsch, Edwin Mabande, and WalterKellermann, “Robust localization of multiple sources in re-verberant environments using EB-ESPIRIT with spherical mi-crophone arrays,” in Proc. ICASSP, Prague, Czech Republic,May 2011.

[17] D. Khaykin and B. Rafaely, “Coherent signals direction-of-arrival estimation using a spherical microphone array: Fre-quency smoothing approach,” in Proc. WASPAA, New Paltz,NY, USA, October 2009.

[18] Etan Fisher and Boaz Rafaely, “Near-field spherical micro-phone array processing with radial filtering,” IEEE Transac-tions on Audio, Speech and Language Processing, pp. 256–265, November 2011.

[19] Frank W. J. Olver and L. C. Maximon, “Bessel functions,” inNIST Handbook of Mathematical Functions, Frank W. J. Olver,Daniel W. Lozier, Ronald F. Boisvert, and Charles W. Clark,Eds. Cambridge University Press, New York, NY, 2010.

[20] James R. Driscoll and Jr. Dennis M. Healy, “ComputingFourier transforms and convolutions on the 2-sphere,” Ad-vances in Applied Mathematics, vol. 15, pp. 202–250, 1994.

[21] T. M. Dunster, “Legendre and related functions,” in NISTHandbook of Mathematical Functions, Frank W. J. Olver,Daniel W. Lozier, Ronald F. Boisvert, and Charles W. Clark,Eds. Cambridge University Press, New York, NY, 2010.

[22] George B. Arfken and Hans J. Weber, Mathematical Methodsfor Physicists, Elsevier, Boston, sixth edition, 2005.

[23] Matthias Wolfel and John McDonough, Distant Speech Recog-nition, Wiley, London, 2009.

[24] S. Haykin, Adaptive Filter Theory, Prentice Hall, New York,fourth edition, 2002.

[25] R. Zelinski, “A microphone array with adaptive post-filteringfor noise reduction in reverberant rooms,” in Proc. ICASSP,New York, NY, USA, April 1988.

[26] K. Uwe Simmer, Joerg Bitzer, and Claude Marro, “Post-filtering techniques,” in Microphone Arrays, M. Branstein andD. Ward, Eds., pp. 39–60. Springer, Heidelberg, 2001.

[27] Kenichi Kumatani, John McDonough, Dietrich Klakow,Philip N. Garner, and Weifeng Li, “Adaptive beamformingwith a maximum negentropy criterion,” IEEE Trans. ASLP,August 2008.

[28] Shefeng Yan, Haohai Sun, U. Peter Svensson, Xiaochuan Ma,and J. M. Hovem, “Optimal modal beamforming for sphericalmicrophone arrays,” IEEE Trans. Audio, Speech and LanguageProcessing, vol. 19, no. 2, pp. 361–371, 2011.