Lecture 8: Spatial sound - Electrical Engineeringdpwe/e6820/lectures/L08-spatial.pdf · EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound Michael Mandel...

EE E6820: Speech & Audio Processing & Recognition

Lecture 8:Spatial sound

Michael Mandel <[email protected]>

Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820

March 27, 2008

1 Spatial acoustics

2 Binaural perception

3 Synthesizing spatial audio

4 Extracting spatial sounds

Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 1 / 33

Outline

1 Spatial acoustics





Spatial acoustics

Received sound = source + channelI so far, only considered ideal source waveform

Sound carries information on its spatial originI ”ripples in the lake”

I evolutionary significance

The basis of scene analysis?I yes and no—try blocking an ear


Ripples in the lake

SourceSource

Listener

Wavefront (@ c m/s)

Energy ∝ 1/r2

Effect of relative position on soundI delay = ∆r

cI energy decay ∼ 1

r 2

I absorption ∼ G (f )r

I direct energy plus reflections

Give cues for recovering source position

Describe wavefront by its normal


Recovering spatial information

Source direction as wavefront normal

moving plane found from timing at 3 points

wavefront

A

B

Ctime

pres

sure

∆t/c = ∆s = AB·cosθ

θ

need to solve correspondence

range r

azimuth θ

elevation φ

Space: need 3 parameters

e.g. 2 angles and range


The effect of the environmentReflection causes additional wavefronts

reflection

diffraction & shadowing

I + scattering, absorptionI many paths → many echoes

Reverberant effectI causal ‘smearing’ of signal energy

time / sec

freq

/ H

z

time / sec

freq

/ H

z

0 0.5 1 1.50

2000

4000

6000

8000

0 0.5 1 1.50

2000

4000

6000

8000Dry speech 'airvib16' + reverb from hlwy16


Reverberation impulse response

Exponential decay of reflections:

t

hroom(t) ~e-t/T

time / s

freq

/ H

z

hlwy16 - 128pt window

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

2000

4000

6000

8000

-70

-60

-50

-40

-30

-20

-10

Frequency-dependentI greater absorption at high frequencies → faster decay

Size-dependentI larger rooms → longer delays → slower decay

Sabine’s equation:

RT60 =0.049V

Sα

Time constant as size, absorption


Outline

1 Spatial acoustics





Binaural perception

path length difference

path length difference

head shadow (high freq)

source

LR

What is the information in the 2 ear signals?I the sound of the source(s) (L+R)I the position of the source(s) (L-R)

Example waveforms (ShATR database)

2.2 2.205 2.21 2.215 2.22 2.225 2.23 2.235

-0.1

-0.05

0

0.05

0.1

time / s

shatr78m3 waveform

Left

Right


Main cues to spatial hearing

Interaural time difference (ITD)I from different path lengths around headI dominates in low frequency (< 1.5 kHz)I max ∼750 µs → ambiguous for freqs > 600 Hz

Interaural intensity difference (IID)I from head shadowing of far earI negligible for LF; increases with frequency

Spectral detail (from pinna reflections) useful for elevation &range

Direct-to-reverberant useful for range

Claps 33 and 34 from 627M:nf90

time / s

freq

/ kH

z

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.80

5

10

15

20


Head-Related Transfer Functions (HRTFs)

Capture source coupling as impulse responses

{`θ,φ,R(t), rθ,φ,R(t)}

Collection: (http://interface.cipic.ucdavis.edu/)

0 0.5 1 1.5

-45

0

45

0 0.5 1 1.5

0

1

0 0.5 1 1.5-1

0

1

time / ms time / ms

HRIR_021 Left @ 0 el

HRIR_021 Left @ 0 el 0 az

HRIR_021 Right @ 0 el 0 az

HRIR_021 Right @ 0 el

LEFT

RIGHT

Azi

mut

h / d

eg

Highly individual!


http://interface.cipic.ucdavis.edu/

Cone of confusion

azimuth θ

Cone of confusion (approx equal ITD)

Interaural timing cue dominates (below 1kHz)I from differing path lengths to two ears

But: only resolves to a coneI Up/down? Front/back?


Further cues

Pinna causes elevation-dependent coloration

Monaural perceptionI separate coloration from source spectrum?

Head motionI synchronized spectral changesI also for ITD (front/back) etc.


Combining multiple cuesBoth ITD and ILD influence azimuth;What happens when they disagree?

t t

r(t)

1 ms

l(t)

t t

r(t)l(t)

Identical signals to both ears → image is centered

Delaying right channel moves image to left

t t

r(t)l(t)

Attenuating left channel returns image to center

“Time-intensity trading”


Binaural position estimation

Imperfect results: (Wenzel et al., 1993)

-180 -120 -60 0 60 120 1800

Target Azimuth (Deg)

-180

-120

-60

60

120

180

0Ju

dged

Azi

mut

h (D

eg)

listening to ‘wrong’ HRTFs → errors

front/back reversals stay on cone of confusion


The Precedence Effect

Reflections give misleading spatial cues

t

l(t)

tR/c

R r(t)

directreflected

But: Spatial impression based on 1st wavefrontthen ‘switches off’ for ∼50 ms

. . . even if ‘reflections’ are louder

. . . leads to impression of room


Binaural Masking Release

Adding noise to reveal target

t

t

Tone + noise to one ear: tone is masked

+

t

t

Identical noise to other ear: tone is audible

t

+

Binaural Masking Level Difference up to 12dBI greatest for noise in phase, tone anti-phase


Outline

1 Spatial acoustics





Synthesizing spatial audio

Goal: recreate realistic soundfieldI hi-fi experienceI synthetic environments (VR)

ConstraintsI resourcesI information (individual HRTFs)I delivery mechanism (headphones)

Source material typesI live recordings (actual soundfields)I synthetic (studio mixing, virtual environments)


Classic stereo

L R

‘Intensity panning’:no timing modifications, just vary level ±20 dB

I works as long as listener is equidistant (ILD)

Surround sound:extra channels in center, sides, . . .

I same basic effect: pan between pairs


Simulating reverberation

Can characterize reverb by impulse responseI spatial cues are important: record in stereoI IRs of ∼1 sec → very long convolution

Image model: reflections as duplicate sources

source listener

virtual (image) sourcesreflected

path

‘Early echos’ in room impulse response:

t

hroom(t)

direct pathearly echos

Actual reflection may be hreflect(t), not δ(t)


Artificial reverberation

Reproduce perceptually salient aspectsI early echo pattern (→ room size impression)I overall decay tail (→ wall materials. . . )I interaural coherence (→ spaciousness)

Nested allpass filters (Gardner, 1992)

z-k+ +

-g

g

g,k

x[n] y[n]

nk 2k 3k

-g

1-g2g(1-g2) g2(1-g2)h[n]

z-k - g1 - g·z-kH(z) =

20,0.3

Allpass

Nested+Cascade Allpass Synthetic Reverb

30,0.750,0.5

AP0+ AP1 AP2

LPFg

a0 a1 a2

+ +


Synthetic binaural audio

Source convolved with {L,R} HRTFs gives precise positioning. . . for headphone presentation

I can combine multiple sources (by adding)

Where to get HRTFs?I measured set, but: specific to individual, discreteI interpolate by linear crossfade, PCA basis setI or: parametric model - delay, shadow, pinna (Brown and Duda,

1998)

Source

Delay Shadow Pinna

z-tDL(θ)1 - azt

1 - bL(θ)z-1

z-tDR(θ)1 - azt

1 - bR(θ)z-1

Σ pkL(θ,φ)·z-tPkL(θ,φ)

Σ pkR(θ,φ)·z-tPkR(θ,φ)

Room echoKE·z-tE

+

+

(after Brown & Duda '97)

Head motion cues?I head tracking + fast updates


Transaural sound

Binaural signals without headphones?

Can cross-cancel wrap-around signalsI speakers SL,R , ears EL,R , binaural signals BL,R

I Goal: present BL,R to EL,R

SL = H−1LL (BL − HRLSR)

SR = H−1RR (BR − HLRSL)

EL ER

HRR

HRLHLR

HLL

SL

BL

SR

BR

M

Narrow ‘sweet spot’I head motion?


Soundfield reconstruction

Stop thinking about earsI just reconstruct pressure + spatial derivatives

p(x,y,z,t)

∂p(t)/∂z∂p(t)/∂x∂p(t)/∂y

I ears in reconstructed field receive same sounds

Complex reconstruction setup (ambisonics)

I able to preserve head motion cues?


Outline

1 Spatial acoustics





Extracting spatial sounds

Given access to soundfield, can we recover separatecomponents?

I degrees of freedom: > N signals from N sensors is hardI but: people can do it (somewhat)

Information-theoretic approachI use only very general constraintsI rely on precision measurements

Anthropic approachI examine human perceptionI attempt to use same information


Microphone arrays

Signals from multiple microphones can be combined toenhance/cancel certain sources

‘Coincident’ mics with different directional gains

m1 s1

m2

s2

a21a22

a12a11

[m1

m2

]=

[a11 a12

a21 a22

] [s1

s2

]⇒

[s1

s2

]= A−1m

Microphone arrays (endfire)

DD +D ++

-40

-20

0

λ = 4D

λ = 2D

λ = D


Adaptive Beamforming &Independent Component Analysis (ICA)

Formulate mathematical criteria to optimize

Beamforming: Drive interference to zeroI cancel energy during nontarget intervals

ICA: maximize mutual independence of outputsI from higher-order moments during overlap

m1 m2

s1 s2

a11 a21

a12 a22

x

−δ MutInfo δa

Limited by separation model parameter spaceI only N × N?


Binaural models

Human listeners do better?I certainly given only 2 channels

Extract ITD and IID cues?

I cross-correlation finds timing differencesI ‘consume’ counter-moving pulsesI how to achieve IID, tradingI vertical cues...


Time-frequency masking

How to separate sounds based on direction?I assume one source dominates each time-frequency pointI assign regions of spectrogram to sources based on probabilistic

modelsI re-estimate model parameters based on regions selected

Model-based EM Source Separation and Localization

I Mandel and Ellis (2007)

I models include IID as∣∣∣ Lω

Rω

∣∣∣ and IPD as arg Lω

Rω

I independent of source, but can model it separately


Summary

Spatial soundI sampling at more than one point gives information on origin

direction

Binaural perceptionI time & intensity cues used between/within ears

Sound renderingI conventional stereoI HRTF-based

Spatial analysisI optimal linear techniquesI elusive auditory models


References

Elizabeth M. Wenzel, Marianne Arruda, Doris J. Kistler, and Frederic L. Wightman.Localization using nonindividualized head-related transfer functions. The Journal ofthe Acoustical Society of America, 94(1):111–123, 1993.

William G. Gardner. A real-time multichannel room simulator. The Journal of theAcoustical Society of America, 92(4):2395–2395, 1992.

C. P. Brown and R. O. Duda. A structural model for binaural sound synthesis. IEEETransactions on Speech and Audio Processing, 6(5):476–488, 1998.

Michael I. Mandel and Daniel P. Ellis. EM localization and separation using interaurallevel and phase cues. In IEEE Workshop on Applications of Signal Processing toAudio and Acoustics, pages 275–278, 2007.

J. C. Middlebrooks and D. M. Green. Sound localization by human listeners. AnnuRev Psychol, 42:135–159, 1991.

Brian C. J. Moore. An Introduction to the Psychology of Hearing. Academic Press,fifth edition, April 2003. ISBN 0125056281.

Jens Blauert. Spatial Hearing - Revised Edition: The Psychophysics of Human SoundLocalization. The MIT Press, October 1996.

V. R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano. The cipic hrtfdatabase. In Applications of Signal Processing to Audio and Acoustics, 2001 IEEEWorkshop on the, pages 99–102, 2001.


Lecture 8: Spatial sound - Electrical Engineeringdpwe/e6820/lectures/L08-spatial.pdf · EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound Michael Mandel...

Documents