Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, visual-output capabilities,addition containto
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Speech Communication 83 (2016) 42–53
Contents lists available at ScienceDirect
Speech Communication
journal homepage: www.elsevier.com/locate/specom
Near-field signal acquisition for smartglasses using two acoustic
vector-sensors
Dovid Y. Levin
a , ∗, Emanuël A.P. Habets b , 1 , Sharon Gannot a
a Bar-Ilan University, Faculty of Engineering, Building 1103, Ramat-Gan, 5290 0 02, Israel b International Audio Laboratories Erlangen, Am Wolfsmantel 33,Erlangen91058, Germany
a r t i c l e i n f o
Article history:
Received 21 February 2016
Accepted 12 July 2016
Available online 18 July 2016
PACS:
43.60.Fg
43.60.Mn
43.60.Hj
Keywords:
Beamforming
Acoustic vector-sensors
Smartglasses
Adaptive signal processing
a b s t r a c t
Smartglasses, in addition to their visual-output capabilities, often contain acoustic sensors for receiving
the user’s voice. However, operation in noisy environments may lead to significant degradation of the
received signal. To address this issue, we propose employing an acoustic sensor array which is mounted
on the eyeglasses frames. The signals from the array are processed by an algorithm with the purpose
of acquiring the desired near-field speech signal produced by the wearer while suppressing noise signals
originating from the environment. The array is comprised of two acoustic vector-sensors (AVSs) which are
located at the fore of the glasses’ temples. Each AVS consists of four collocated subsensors: one pressure
sensor (with an omnidirectional response) and three particle-velocity sensors (with dipole responses) ori-
ented in mutually orthogonal directions. The array configuration is designed to boost the input power of
the desired signal, and to ensure that the characteristics of the noise at the different channels are suffi-
ciently diverse (lending towards more effective noise suppression). Since changes in the array’s position
correspond to the desired speaker’s movement, the relative source-receiver position remains unchanged;
hence, the need to track fluctuations of the steering vector is avoided. Conversely, the spatial statistics of
the noise are subject to rapid and abrupt changes due to sudden movement and rotation of the user’s
head. Consequently, the algorithm must be capable of rapid adaptation toward such changes. We propose
an algorithm which incorporates detection of the desired speech in the time-frequency domain, and em-
ploys this information to adaptively update estimates of the noise statistics. The speech detection plays a
key role in ensuring the quality of the output signal. We conduct controlled measurements of the array
in noisy scenarios. The proposed algorithm preforms favorably with respect to conventional algorithms.
Fig. 8. STOI levels resulting from processing with five different algorithms for varying SNR levels in two scenarios. (Note that for the static scenario (a), the fixed-MVDR and
the proposed algorithm are nearly identical.).
c
t
T
s
t
p
t
u
r
a
m
p
a
5
r
s
i
f
s
t
t
c
s
e
T
r
F
i
r
n
i
c
0
d
w
5
T
10 0 0 dB is also examined (appearing at the right edge of the hor-
izontal axis). This exceptionally high SNR is useful for checking ro-
bustness in extreme cases.
The results shown in the figures indicate that although both
MPDR based algorithms perform reasonably well for low SNRs,
there is a rapid degradation in performance as SNR increases.
This can be explained by the contamination of the estimated
covariance-matrix by desired speech, which is inherent in these
methods. For very low SNRs the contamination is negligible, but
at higher SNRs the contamination becomes significant. Due to this
issue, the MPDR based algorithms cannot be regarded as viable.
With respect to distortion, the other algorithms (i.e., fixed-
MVDR and proposed) score fairly well with levels between −20
dB and −18 dB. However, they differ with regards to noise reduc-
tion. For the static scenario, the fixed-MVDR attains a noise reduc-
tion of 21.8 dB. The proposed algorithm does slightly better at low
SNRs ( −10 dB and lower). At an SNR of −5 dB, the fixed-MVDR is
slightly better and as the SNR increases the proposed algorithm’s
noise reduction drops by several decibels (reaching 17.6 for an SNR
of 10 dB). This is not decidedly troublesome since at high SNRs the
issue of noise reduction is of lesser consequence.
For the case of moving interference, the proposed algorithm
significantly outperforms fixed-MVDR. The fixed-MVDR algorithm
reduces noise by 16.1 dB, whereas the proposed algorithm yields
a reduction of 29.3 at −20 dB. As the SNR increases, the noise re-
duction gradually decreases but typically remains higher than the
fixed-MVDR. For example, at SNRs of −10 , 0, and 10 dB the re-
spective noise reductions are 26.2, 22, and 18.5 dB. Due to the
changing nature of the interference, the initial covariance estimate
of the fixed-MVDR algorithm is deficient. In contrast, the pro-
posed algorithm constantly adapts and consequently manages to
effectively reduce noise. We note that the proposed algorithm is
more successful in this dynamic case than in the case of 3 static in-
terferers. This can be explained by the increased challenge of sup-
pressing multiple sources.
The proposed algorithm significantly outperforms the fixed-
MVDR algorithm in the scenario of a moving interferer with re-
spect to STOI. Interestingly, in the static scenario the two algo-
rithms have virtually indistinguishable STOI scores. This is despite
the fact that there are differences in their noise reduction.
We now discuss the performance of the two algorithms tested
with a reduced array (RA). These are labeled ‘oracle adaptation
(RA)’ and ‘proposed algorithm (RA)’ in Figs. 6, 7 , and 8 . The noise
reduction attainable with the reduced array with the proposed al-
gorithm is roughly 6 dB which is close to the limit set by ora-
le adaptation with a reduced array. Full use of all channels form
he AVSs provided an improvement of approximately 15 to 25 dB.
he performance with respect to STOI with a reduced array is only
lightly better than the unprocessed signal. The distortion levels of
he reduced array are in the vicinity of -30 dB which is an im-
rovement over the full array with distortion of approximately -18
o -20 dB. This improvement is apparently due to the fact that the
nprocessed signal was defined as the average of the two omnidi-
ectional channels used by the reduced array. In any case, the full
rray preforms satisfactorily in terms of distortion and improve-
ent is of negligible significance; utilization of all channels does
rovide significant improvements with respect to noise reduction
nd STOI.
.4. Threshold parameter sensitivity
In this subsection, we examine the impact of the threshold pa-
ameter η on the performance of the proposed algorithm. If η is
et too low, then too many bins are mistakenly labeled as contain-
ng desired speech. This may lead to poor noise estimation since
ewer bins are used in the estimation process. Conversely, if η is
et too high, bins which do contain desired speech will not be de-
ected as such which may lead to contamination of the noise es-
imation (as seen in the MPDR based algorithms). Presumably, a
ertain region of values in between these extremes will yield de-
irable results with respect to the conflicting goals.
We repeatedly executed the algorithm with η taking on differ-
nt values (the other parameters in Table 1 remain unchanged).
his was done for SNRs ranging from −20 dB to 10 dB. The noise
eduction results are plotted in Fig. 9 , the distortion results in
ig. 10 , and the STOI measure in Fig. 11 . The STOI measures peak
n the vicinity of η = 0 . 9 and the curve is fairly flat indicating
obustness. Similarly, η = 0 . 9 is a fairly good choice with respect to
oise reduction, although for low SNRs Fig. 9 indicates that a slight
ncrease in η is beneficial for low SNRs and conversely a slight de-
rease in η is beneficial for high SNRs.
The distortion levels are somewhat better in the vicinity of η =.6. However, since the distortion is minor, this slight improvement
oes not justify the accompanying degradation in noise and STOI
hich are of notable quantity.
.5. Post-processing results
In this subsection, we examine the effects of post-processing.
hree parameters influence the post-processing: β , SNR , and
min
D.Y. Levin et al. / Speech Communication 83 (2016) 42–53 51
10−2
10−1
100
5
10
15
20
25
30
theshold value η
nois
e re
duct
ion
[dB
]
−20 dB−15 dB−10 dB −5 dB 0 dB 5 dB10 dB
(a) 3 static interferers
10−2
10−1
100
5
10
15
20
25
30
theshold value η
nois
e re
duct
ion
[dB
]
−20 dB−15 dB−10 dB −5 dB 0 dB 5 dB10 dB
(b) 1 moving interferer
Fig. 9. Noise reduction attained with the proposed algorithm using different values for the speech detection threshold ( η) for a number of SNR levels.
10−2
10−1
100
0
5
10
15
20
25
theshold value η
dist
ortio
n [d
B]
−20 dB−15 dB−10 dB −5 dB 0 dB 5 dB10 dB
(a) 3 static interferers
10−2
10−1
100
0
5
10
15
20
25
theshold value η
dist
ortio
n [d
B]
−20 dB−15 dB−10 dB −5 dB 0 dB 5 dB10 dB
(b) 1 moving interferer
Fig. 10. Distortion levels resulting from applying the proposed algorithm using different values for the speech detection threshold ( η) for a number of SNR levels.
10−2
10−1
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
theshold value η
ST
OI
−20 dB−15 dB−10 dB −5 dB 0 dB 5 dB10 dB
(a) 3 static interferers
10−2
10−1
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
theshold value η
ST
OI
−20 dB−15 dB−10 dB −5 dB 0 dB 5 dB10 dB
(b) 1 moving interferer
Fig. 11. STOI levels attained with the proposed algorithm using different values for the speech detection threshold ( η) for a number of SNR levels.
52 D.Y. Levin et al. / Speech Communication 83 (2016) 42–53
SNR [dB]-20 -15 -10 -5 0 5 10 1000
nois
e re
duct
ion
[dB
]
5
10
15
20
25
30
35
no post-processingpost1post2
(a) Noise reduction
SNR [dB]-20 -15 -10 -5 0 5 10 1000
dist
ortio
n [d
B]
-20
-18
-16
-14
-12
-10
-8
-6
no post-processingpost1post2
(b) Distortion
SNR [dB]-20 -15 -10 -5 0 5 10 1000
ST
OI
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
no post-processingpost1post2
(c) STOI
Fig. 12. Effects of post-processing on noise reduction, distortion, and STOI.
Table 2
Parameter values in post-processing.
Parameter: post 1 : post 2 :
β 0 .9 0 .9
SNR min −10 dB −24 dB
W min −8 dB −20 dB
a
p
t
t
t
p
f
m
t
a
d
a
a
l
I
b
6
a
S
v
a
a
t
o
t
t
i
h
p
s
p
a
A
T
d
i
A
w
f
t
W min . Setting the latter two parameters at lower values corre-
sponds to a more aggressive suppression of noise, whereas higher
values correspond to a more conservative approach regarding sig-
nal distortion. To illustrate this trade-off, we test the two sets of
parameters whose values 6 are given in Table 2 . These two sets
6 It should be noted that SNR min describes the ratio of powers whereas W min de-
scribes a filter’s amplitude . Consequently, the former is converted to decibel units
via 10log 10 ( ·), and the latter by 20log 10 ( ·).
s
s
v
r
re referred to as ‘post 1 ’ and ‘post 2 ’, respectively. In general, post-
rocessing parameters are determined empirically; the designer
ests which values yield results which are satisfactory for a par-
icular application.
Fig. 12 portrays the effects of post-processing (using the
hree speaker scenario as a test case) on the performance. Post-
rocessing reduces noise but increases distortion and adversely af-
ects intelligibility as measured by STOI (this degradation is very
inor for ‘post 1 ’ and more prominent in ‘post 2 ’). The parame-
ers of ‘post 1 ’ are more conservative and the parameters of ‘post 2 ’
re more aggressive with respect to noise reduction. The former
o not reduce as much noise, but have less distortion and only
minor degradation of STOI score. The latter reduce more noise
t the expense of greater distortion and lower STOI. With the
atter, audio artifacts have a stronger presence than the former.
n general, the parameters may be adjusted to attain a desirable
alance.
. Conclusion
We proposed an array which consists of two AVSs mounted on
n eyeglasses frame. This array configuration provides high input
NR and removes the need for tracking changes in the steering
ector. An algorithm for suppressing undesired components was
lso proposed. This algorithm adapts to changes of the noise char-
cteristics by continuously estimating the noise covariance ma-
rix. A speech detection scheme is used to identify the presence
f time-frequency bins containing desired speech and preventing
hem from corrupting the estimation of the noise covariance ma-
rix. The speech detection plays a pivotal role in ensuring the qual-
ty of the output signal; in the absence of a speech detector, the
igher levels of noise and distortion which are typical of MPDR
rocessing are present. Experiments confirm that the proposed
ystem performs well in both static and changing scenarios. The
roposed system may be used to improve the quality of speech
cquisition in smartglasses.
cknowledgments
We wish to acknowledge technical assistance from Microflown
echnologies (Arnhem, Netherlands) related to calibration and re-
uction of sensor noise. This contribution was essential for attain-
ng quality audio results.
ppendix A. Background on AVSs
A sound field can be described as a combination of two fields
hich are coupled: a pressure field and a particle-velocity field. The
ormer is a scalar field and the latter is a vector field consisting of
hree Cartesian components.
Conventional sensors which are typically used in acoustic
ignal-processing measure the pressure field. Acoustic vectors sen-
ors (AVSs) also measure the particle-velocity field, and thus pro-
ide more information: each sensor provides four components
ather than one component.
D.Y. Levin et al. / Speech Communication 83 (2016) 42–53 53
Fig. A.13. The magnitude of the directivity patterns of an AVS are plotted. They
consist of a monopole and three mutually orthogonal dipoles.
a
s
m
D
a
D
w
(
F
r
t
a
r
s
v
t
d
r
c
l
s
m
s
d
(
A
d
c
t
c
s
t
c
t
w
c
t
R
A
A
B
B
d
C
C
C
D
D
E
E
E
F
G
G
G
H
H
L
L
M
O
P
R
S
S
S
S
T
V
An AVS consists of four collocated subsensors: one monopole
nd three orthogonally oriented dipoles. For a plane wave, each
ubsensor has a distinct directivity response. The response of a
onopole element is
mon = 1 , (A.1)
nd the response of a dipole element is
dip = q
T u , (A.2)
here u is a unit-vector denoting the wave’s direction of arrival
DOA), and q is a unit-vector denoting the subsensor’s orientation.
rom the definition of scalar multiplication, it follows that D dip cor-
esponds to the cosine of the angle between the signal’s DOA and
he subsensor’s orientation. The orientation of the three subsensors
re q x = [1 0 0] T , q y = [0 1 0] T , and q z = [0 0 1] T . The monopole
esponse, which is independent of DOA, corresponds to the pres-
ure field and the three dipole responses correspond to a scaled
ersion of the Cartesian particle-velocity components. Fig. A.13 por-
rays the magnitude of the four spatial responses.
For a spherical wave, the acoustical impedance is frequency-
ependent. It can be shown that the dipole elements undergo a
elative gain of 1 +
c r
1 jω over an omnidirectional sensor (as dis-
ussed in Section 2 ). This phenomenon is manifested particu-
arly at lower frequencies for which the wavelength is significantly
horter than the source-receiver distance.
A standard omnidirectional microphone functions as a
onopole element. Several approaches are available for con-
tructing the dipole components of an AVS. One approach applies
ifferential processing of closely-spaced omnidirectional sensors
Derkx and Janse, 2009; Elko and Pong, 1995; 1997; Olson, 1946 ).
n alternative approach employs acoustical sensors with inherent
ckerman, E., 2013. Google gets in your face [2013 Tech To Watch]. In: IEEESpectrum, vol. 50 (1), pp. 26–29. http://spectrum.ieee.org/consumer-electronics/
gadgets/google-gets- in- your- face . arfield, W. (Ed.), 2016. Fundamentals of wearable computers and augmented 778
reality. CRC Press. itzer, J. , Simmer, K. , 2001. Superdirective microphone arrays. In: Brandstein, M.,
Ward, D. (Eds.), Microphone Arrays: Signal Processing Techniques and Applica-
tions. Springer-Verlag, pp. 18–38 . chapter 2. e Bree, H.-E. , 2003. An overview of microflown technologies. Acta Acust. United
Acust. 89 (1), 163–172 . apon, J. , 1969. High-resolution frequency-wavenumber spectrum analysis. Proc.
IEEE 57 (8), 1408–1418 . ass, S., Choi, C.Q., 2015. Google Glass, HoloLens, and the real future of augmented
ox, H. , Zeskind, R. , Owen, M. , 1987. Robust adaptive beamforming. IEEE Trans.
Acoust. Speech Signal Process. 35 (10), 1365–1376 . erkx, R.M.M. , 2010. First-order adaptive azimuthal null-steering for the suppres-
sion of two directional interferers. EURASIP J. Adv. Signal Process. 2010, 1 . erkx, R.M.M. , Janse, K. , 2009. Theoretical analysis of a first-order azimuth-steerable
lko, G. , Pong, A.-T. N. , 1995. A simple adaptive first-order differential microphone.
In: IEEE ASSP Workshop on Applications of Signal Processing to Audio andAcoustics, pp. 169–172 .
lko, G. , Pong, A.-T. N. , 1997. A steerable and variable first-order differential mi-crophone array. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) 1, 223–
226 . phraim, Y. , Malah, D. , 1984. Speech enhancement using a minimum-mean square
error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal
Process. 32 (6), 1109–1121 . rost, O. , 1972. An algorithm for linearly constrained adaptive array processing. Proc.
IEEE 60 (8), 926–935 . annot, S. , Burshtein, D. , Weinstein, E. , 2001. Signal enhancement using beamform-
ing and nonstationarity with applications to speech. IEEE Trans. Signal Process.49 (8), 1614–1626 .
annot, S. , Cohen, I. , 2004. Speech enhancement based on the general transfer func-
tion gsc and postfiltering. IEEE Trans. Sp. Au. Proc. 12 (6), 561–571 . ilbert, E. , Morgan, S. , 1955. Optimum design of directive antenna arrays subject to
random variations. Bell Syst. Tech. J 34, 637–663 . armanci, K. , Tabrikian, J. , Krolik, J. , 20 0 0. Relationships between adaptive mini-
eute, U. , 2008. Speech-transmission quality: aspects and assessment for wideband
vs. narrowband signals. In: Martin, R., Heute, U., Antweiler, C. (Eds.), Advancesin Digital Speech Transmission. John Wiley & Sons .
efkimmiatis, S. , Dimitriadis, D. , Maragos, P. , 2006. An optimum microphone ar-ray post-filter for speech applications. In: Interspeech – Int. Conf. on Spoken
Lang. Proc . evin, D. , Habets, E. , Gannot, S. , et al. , 2013. Robust beamforming using sensors with
nonidentical directivity patterns. In: IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 91–95 . cCowan, I.A. , Bourlard, H. , 2003. Microphone array post-filter based on noise field
ierce, A.D. , 1991. Acoustics: an introduction to its physical principles and applica-tions. Acoustical Society of America .
andell, C. , 2005. Wearable computing: a review. Technical Report. Technical ReportCSTR-06-004. University of Bristol .
hujau, M. , Ritz, C.H. , Burnett, I.S. , 2009. Designing acoustic vector sensors for local-
isation of sound sources in air. In: 17th European Signal Processing Conference,pp. 849–853 .
immer, K.U. , Bitzer, J. , Marro, C. , 2001. Post-filtering techniques. In: MicrophoneArrays. Springer, pp. 39–60 .
priet, A. , Moonen, M. , Wouters, J. , 2004. Spatially pre-processed speech distortionweighted multi-channel Wiener filtering for noise reduction. Signal Processing
84 (12), 2367–2387 .
ung, D., 2014. What’s wrong with Google Glass?: the improvements the BigG needs to make before Glass hits the masses. URL www.wareable.com/
google-glass/google-glass-improvements-needed . aal, C. , Hendriks, R. , Heusdens, R. , Jensen, J. , 2011. An algorithm for intelligibility