Page 1
Reverberant Speech Processingfor Human Communication
and Automatic Speech Recognition
Tomohiro Nakatani, Armin Sehr, Walter [email protected] , sehr,[email protected]
NTT Communication Science Laboratories
LMS, University of Erlangen-Nuremberg
March 26, 2012
Page 2
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
Page 3
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Tasks:
• Rendering - Reproducedesired signals at distantears
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
Page 4
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Tasks:
• Acquisition - Localizesources and capture cleansignals from distance
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
Page 5
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Tasks:
• Rendering - Reproducedesired signals at distantears
• Acquisition - Localizesources and capture cleansignals from distance
Challenges:
• Feedback of loudspeakersignals
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
Page 6
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Tasks:
• Rendering - Reproducedesired signals at distantears
• Acquisition - Localizesources and capture cleansignals from distance
Challenges:
• Noise and interferers
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
Page 7
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Tasks:
• Rendering - Reproducedesired signals at distantears
• Acquisition - Localizesources and capture cleansignals from distance
Challenges:
• Reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
Page 8
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Tasks:
• Rendering - Reproducedesired signals at distantears
• Acquisition - Localizesources and capture cleansignals from distance
Challenges:
• Feedback of loudspeakersignals
• Noise and interferers
• Reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
Page 9
Applications
Hands-free equipment
for telecommunication and natural human/machine interaction
for mobile phones / smart phones, mobile computing devices,PDAs
in car interiors (’command&control’, telecommunication, in-carcommunication, . . .)
for desktop computers, info-/edutainment terminals, interactive TV,game stations, simulators
for telepresence systems (offices,. . ., classrooms, . . ., auditoria) for ambient communication (smart meeting rooms, smart homes,
information kiosks, museums and exhibitions, . . .) for voice-driven navigation systems in cars, operating rooms, . . .
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 3
Page 10
Applications (cont’d)
Professional Audio equipment for stages and recording studios virtual acoustic environments (virtual concert halls, telepresence
studios,. . .)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 4
Page 11
Applications (cont’d)
Professional Audio equipment for stages and recording studios virtual acoustic environments (virtual concert halls, telepresence
studios,. . .)
Safety and Surveillance acoustic displays in control centers, cockpits monitoring in health care environments (advanced ’babyphones’) acoustic scene analysis (train stations, . . .)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 4
Page 12
Another Scenario: ’Listening devices’
DSP
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
Page 13
Another Scenario: ’Listening devices’
DSP
Tasks:
• Rendering -Reproduceundistorted signalswith binaural cues
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
Page 14
Another Scenario: ’Listening devices’
DSP
Tasks:
• Rendering -Reproduceundistorted signalswith binaural cues
• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
Page 15
Another Scenario: ’Listening devices’
DSP
Tasks:
• Rendering -Reproduceundistorted signalswith binaural cues
• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)
Challenges:
• Loudspeakerfeedback (howling)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
Page 16
Another Scenario: ’Listening devices’
DSP
Tasks:
• Rendering -Reproduceundistorted signalswith binaural cues
• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)
Challenges:
• Noise and interferers
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
Page 17
Another Scenario: ’Listening devices’
DSP
Tasks:
• Rendering -Reproduceundistorted signalswith binaural cues
• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)
Challenges:
• Reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
Page 18
Another Scenario: ’Listening devices’
DSP
Tasks:
• Rendering -Reproduceundistorted signalswith binaural cues
• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)
Challenges:
• Loudspeakerfeedback (howling)
• Noise and interferers
• Reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
Page 19
Applications
Hearing aids, of course
Headsets, e.g., for mobile phones, mobile computing devices, personal digital
assistants
hearing protection in noisy environments (construction work,mining,. . .)
active noise cancellation systems
. . .
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 6
Page 20
Example 1: DICIT - an Interactive TV system
Voice-controlled home entertainment system (EU project DICIT2005-2009; see e.g., Marquardt et al., 2009; Youtube )
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 7
Page 21
Example 1: DICIT - an Interactive TV system
Voice-controlled home entertainment system (EU project DICIT2005-2009; see e.g., Marquardt et al., 2009; Youtube )
featuring
Multichannel AEC (GFDAF, Buchner/Benesty et al., 2003ff) Multibeamforming (Mabande et al., 2009; Kellermann, 1997) Source localization (GCF; Brutti et al., 2007) Speech/non-speech classification (Omologo, 2009) Noise-robust automatic speech recognition (ViaVoice, IBM 2009)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 7
Page 22
Example 1: DICIT - an Interactive TV system
Voice-controlled home entertainment system (EU project DICIT2005-2009; see e.g., Marquardt et al., 2009; Youtube )
featuring
Multichannel AEC (GFDAF, Buchner/Benesty et al., 2003ff) Multibeamforming (Mabande et al., 2009; Kellermann, 1997) Source localization (GCF; Brutti et al., 2007) Speech/non-speech classification (Omologo, 2009) Noise-robust automatic speech recognition (ViaVoice, IBM 2009)
Challenge: Reverberation for large source distances in morereverberant rooms
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 7
Page 23
Who Spoke When, What, Who Spoke When, What, ToTo--whom and How?whom and How?
RealReal--time Meeting Browsertime Meeting Browser
Recognize speech andRecognize speech andother audio eventsother audio events
Example 2: Meeting recognition system
Page 24
Example 3: Audio postproduction system
Microphone(s)Actor/actress
Step1:Sound&video recording (on location)
Step2:Audio post-production(de-noising, de-reverb, sound effects)
[Movies/TV creation]
Page 25
Overview
Part I: Introduction
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 26
Overview
Part I: Introduction Fundamentals
Approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 27
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 28
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications
Professional audio post production
Meeting speech recognition with microphone arrays
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 29
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications
Fundamentals: Dereverberation with inverse filtering
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 30
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications Fundamentals: Dereverberation with inverse filtering
What is ’inverse’ filtering? Robust ’approximate’ inverse filtering
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 31
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications
Fundamentals: Dereverberation with inverse filtering
Blind inverse filtering
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 32
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications
Fundamentals: Dereverberation with inverse filtering Blind inverse filtering
Overview of basic approaches Closer look: multichannel linear prediction with
time-varying source model
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 33
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications
Fundamentals: Dereverberation with inverse filtering
Blind inverse filtering
Integration with blind source separation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 34
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 35
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches
Cepstral mean normalization
Model-based feature enhancement
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 36
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches
Model-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 37
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches Model-based approaches
Matched training Multi-style training Adaptive training MAP and MLLR adaptation Parametric adaptation tailored to reverberation Frame-wise adaptation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 38
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches
Model-based approaches
Decoder-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 39
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches
Model-based approaches Decoder-based approaches
Missing feature techniques Uncertainty decoding
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 40
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches
Model-based approaches
Decoder-based approaches
A generic approach: REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 41
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments
Part IV: Summary, Conclusions, and Outlook
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
Page 42
Fundamental Signal Processing Problems - Formulation
DigitalSignal
Processing
W
v
x
u
y
KL
N P
Linear MIMO system W (’multi-ple input/ multiple output’):(
vy
)
= W ∗
(
ux
)
=
(
Wvu Wvx
Wyu Wyx
)
∗
(
ux
)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 11
Page 43
Fundamental Signal Processing Problems - Formulation
Linear MIMO system W (’multi-ple input/ multiple output’):(
vy
)
= W ∗
(
ux
)
=
(
Wvu Wvx
Wyu Wyx
)
∗
(
ux
)
DigitalSignal
Processing
W
n
S1
SM
z1
z2
z2M−1
z2M
Hzvv
x
u
y
KL
N P
Listeners’ signals:
z = Hzv ∗ v + nz
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 11
Page 44
Fundamental Signal Processing Problems - Formulation
Linear MIMO system W (’multi-ple input/ multiple output’):(
vy
)
= W ∗
(
ux
)
=
(
Wvu Wvx
Wyu Wyx
)
∗
(
ux
)
Listeners’ signals:
z = Hzv ∗ v + nz
DigitalSignal
Processing
W
n
s1
sM
S1
SM
z1
z2
z2M−1
z2M
Hxv
Hxs
Hzvv
x
u
y
KL
N P Microphone signals:
x = Hxs ∗ s + Hxv ∗ v + nx
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 11
Page 45
Fundamental Problems for Signal Acquisition
n
s1
sM
S1
SM
Hxv
Hxs
Wvu
Wyx
Wyu
v
x
u
y
KL
NP
Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !
= s ∗δ(k −k0)
where x = Hxs ∗ s + Hxv ∗ v + nx
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12
Page 46
Fundamental Problems for Signal Acquisition
n
s1
sM
S1
SM
Hxv
Hxs
Wvu
Wyx
Wyu
v
x
u
y
KL
NP
Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !
= s ∗δ(k −k0)
where x = Hxs ∗ s + Hxv ∗ v + nx
3 Subproblems:
• Echo cancellation:(Wyu + Wyx ∗ Hxv ∗ Wvu )∗u = 0
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12
Page 47
Fundamental Problems for Signal Acquisition
n
s1
sM
S1
SM
Hxv
Hxs
Wvu
Wyx
Wyu
v
x
u
y
KL
NP
Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !
= s ∗δ(k −k0)
where x = Hxs ∗ s + Hxv ∗ v + nx
3 Subproblems:
• Source separation anddereverberation :Wyx ∗ Hxs ∗ s = s ∗ δ(k − k0)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12
Page 48
Fundamental Problems for Signal Acquisition
n
s1
sM
S1
SM
Hxv
Hxs
Wvu
Wyx
Wyu
v
x
u
y
KL
NP
Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !
= s ∗δ(k −k0)
where x = Hxs ∗ s + Hxv ∗ v + nx
3 Subproblems:
• Noise and interferencesuppression:
Wyx ∗ nx = 0
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12
Page 49
Fundamental Problems for Signal Acquisition
n
s1
sM
S1
SM
Hxv
Hxs
Wvu
Wyx
Wyu
v
x
u
y
KL
NP
Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !
= s ∗δ(k −k0)
where x = Hxs ∗ s + Hxv ∗ v + nx
3 Subproblems:
• Echo cancellation:(Wyu + Wyx ∗ Hxv ∗ Wvu )∗u = 0
• Source separation anddereverberation :Wyx ∗ Hxs ∗ s = s ∗ δ(k − k0)
• Noise and interferencesuppression:
Wyx ∗ nx = 0
Components of x , i.e., Hxs ∗ s, Hxv ∗ v, nx , must be separated by W!
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12
Page 50
Fundamentals - Room Impulse Response (RIR) properties
Elements of Hwv , Hxv , Hxs are room impulse responses (RIRs).
Typical structure of RIRs:
Direct sound
Early reflections
Late reverberation
h
t
Main characteristic parameters:
T60: Time for exponential decay of envelope by 60dB
DRR: Direct-to-Reverberant (Energy) Ratio
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 13
Page 51
Fundamentals - Room Impulse Response (RIR) properties
• Reverberation time T60
⊲ car ≈ 50ms⊲ concert halls ≈ 1 . . . 2s
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 14
Page 52
Fundamentals - Room Impulse Response (RIR) properties
• Reverberation time T60
⊲ car ≈ 50ms⊲ concert halls ≈ 1 . . . 2s
• FIR models⊲ typically LH ≈ T60 · fs/3 coefficients⊲ nonminimum-phase⊲ many zeros close to unit circle
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 14
Page 53
Fundamentals - Room Impulse Response (RIR) properties
• Reverberation time T60
⊲ car ≈ 50ms⊲ concert halls ≈ 1 . . . 2s
• FIR models⊲ typically LH ≈ T60 · fs/3 coefficients⊲ nonminimum-phase⊲ many zeros close to unit circle
Example: Office 5.5m × 3m × 2.8m, T60 ≈ 300msec, fs = 12kHz.
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
taps−1.5 −1 −0.5 0 0.5 1 1.5
−1.5
−1
−0.5
0
0.5
1
1.5
Real part
Imag
inar
y pa
rt
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 14
Page 54
Fundamentals - RIR properties (cont’d)
RIRs for varying source-mic distance (d1 = 1m vs. d2 = 4m, T60 ≈ 900ms)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1x 10
−5
t in s
h(n)
distance 4m
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−6
−4
−2
0
2
4x 10
−5
t in s
h(n)
distance 1m
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 15
Page 55
Fundamentals - RIR properties (cont’d)
RIRs for varying source-mic distance (d1 = 1m vs. d2 = 4m, T60 ≈ 900ms)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1x 10
−5
t in s
h(n)
distance 4m
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−6
−4
−2
0
2
4x 10
−5
t in s
h(n)
distance 1m
Energy decay curves(EDC [Schröder 1965])
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−40
−35
−30
−25
−20
−15
−10
−5
0
t in s
ener
gy d
ecay
cur
ve in
dB
distance 4mdistance 1m
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 15
Page 56
Fundamentals - RIR properties (cont’d)
RIRs for varying source-mic distance (d1 = 1m vs. d2 = 4m, T60 ≈ 900ms)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1x 10
−5
t in s
h(n)
distance 4m
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−6
−4
−2
0
2
4x 10
−5
t in s
h(n)
distance 1m
Energy decay curves(EDC [Schröder 1965])
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−40
−35
−30
−25
−20
−15
−10
−5
0
t in s
ener
gy d
ecay
cur
ve in
dB
distance 4mdistance 1m
DRR(Direct-to-Reverberant Energy Ratio)
4.9dB/-4.0dB
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 15
Page 57
Fundamentals - RIR properties (cont’d)
RIRs for varying source-mic distance (d1 = 1m vs. d2 = 4m, T60 ≈ 900ms)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1x 10
−5
t in s
h(n)
distance 4m
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−6
−4
−2
0
2
4x 10
−5
t in s
h(n)
distance 1m
Energy decay curves(EDC [Schröder 1965])
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−40
−35
−30
−25
−20
−15
−10
−5
0
t in s
ener
gy d
ecay
cur
ve in
dB
distance 4mdistance 1m
DRR(Direct-to-Reverberant Energy Ratio)
4.9dB/-4.0dB
RIR, DRR ⇔ Reverberation time T60
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 15
Page 58
Fundamentals - RIR properties (cont’d)
Variability with displacements:
Mic displacement 4.2cm(source distance d=4m):
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1
−0.5
0
0.5
1x 10
−5
t in s
h(n)
Difference between RIR1 and RIR2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1
0
1x 10
−5
t in s
h(n)
RIR 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1
0
1x 10
−5
t in s
h(n)
RIR 2
System error norm: 0.23dB
Shift of RIR by 1 sample:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1
t in s
h(n)
Difference between RIR1 and a RIR1 shifted by 1 sample
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1
t in s
h(n)
RIR 1
System error norm: 2.56dB
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 16
Page 59
Fundamentals - Reverberation in signal representations
Clean vs. reverberated with T60 ≈ 900ms, d1 = 4m and T60 ≈ 3.1s, d2 = 5m
Time-domain:
−0.5
0.5
s t
−0.3
0.3
x t
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
−0.5
0.5
x t
t in s
Pauses filled!
STFT domain:
f in
Hz
0
2000
4000
6000
−80
−60
−40
−20
0
f in
Hz
0
2000
4000
6000
−80
−60
−40
−20
0
t in s
f in
Hz
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60
2000
4000
6000
−80
−60
−40
−20
0
Pauses filled!
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 17
Page 60
Fundamentals - Reverberation in ASR features
Clean vs. reverberated with T60 ≈ 900ms, d1 = 4m and T60 ≈ 3.1s, d2 = 5m
Logmelspec domain:m
el c
hann
el
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
mel
cha
nnel
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
t in s
mel
cha
nnel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
Pauses filled!
MFCC domain:
ceps
tral
coe
ffici
ent
0
2
4
6
8
10
−60
−50
−40
−30
−20
−10
0
10
ceps
tral
coe
ffici
ent
0
2
4
6
8
10
−60
−50
−40
−30
−20
−10
0
t in sce
pstr
al c
oeffi
cien
t
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
0
2
4
6
8
10
−70
−60
−50
−40
−30
−20
−10
0
10
Pauses of c0 filled!
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 18
Page 61
Dereverberation for Speech Enhancement
Basic Idea: Separate speech production from RIR, equalize the latter
room(to be equalized)
vocal tract(to be preserved)
glottal excitation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 19
Page 62
Dereverberation for Speech Enhancement
Basic Idea: Separate speech production from RIR, equalize the latter
room(to be equalized)
vocal tract(to be preserved)
glottal excitation
’Blind’ problem!(no reference signal for RIR input)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 19
Page 63
Dereverberation for Speech Enhancement
Basic Idea: Separate speech production from RIR, equalize the latter
room(to be equalized)
vocal tract(to be preserved)
glottal excitation
’Blind’ problem!(no reference signal for RIR input)
Distinction: Partial Deconvolution
(removes reverberation by RIRinversion, ideally without speechdistortion)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 19
Page 64
Dereverberation for Speech Enhancement
Basic Idea: Separate speech production from RIR, equalize the latter
room(to be equalized)
vocal tract(to be preserved)
glottal excitation
’Blind’ problem!(no reference signal for RIR input)
Distinction: Partial Deconvolution
(removes reverberation by RIRinversion, ideally without speechdistortion)
m
Reverberation Suppression(compromise betweendereverberation and signaldistortion necessary)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 19
Page 65
Dereverberation for Signal Enhancement (cont’d)
Dealing with ’Blindness’ by exploiting
Prior Knowledge on
A, Speech production models (e.g., source-filter model, HMM) and signalproperties (nonwhiteness, nonstationarity, nongaussianity)
B, Room acoustics parameter (e.g., T60)
C, Location and radiation characteristics of speech source
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 20
Page 66
Dereverberation for Signal Enhancement (cont’d)
Dealing with ’Blindness’ by exploiting
Prior Knowledge on
A, Speech production models (e.g., source-filter model, HMM) and signalproperties (nonwhiteness, nonstationarity, nongaussianity)
B, Room acoustics parameter (e.g., T60)
C, Location and radiation characteristics of speech source
and some Useful Assumptions
D, Joint moments (e.g., correlation) of signal samples:Small lags characterize speech ⇔ Large lags characterize reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 20
Page 67
Dereverberation for Signal Enhancement (cont’d)
Dealing with ’Blindness’ by exploiting
Prior Knowledge on
A, Speech production models (e.g., source-filter model, HMM) and signalproperties (nonwhiteness, nonstationarity, nongaussianity)
B, Room acoustics parameter (e.g., T60)
C, Location and radiation characteristics of speech source
and some Useful Assumptions
D, Joint moments (e.g., correlation) of signal samples:Small lags characterize speech ⇔ Large lags characterize reverberation
E, Speech signal statistics change faster than RIRs
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 20
Page 68
Dereverberation for Signal Enhancement (cont’d)
Dealing with ’Blindness’ by exploiting
Prior Knowledge on
A, Speech production models (e.g., source-filter model, HMM) and signalproperties (nonwhiteness, nonstationarity, nongaussianity)
B, Room acoustics parameter (e.g., T60)
C, Location and radiation characteristics of speech source
and some Useful Assumptions
D, Joint moments (e.g., correlation) of signal samples:Small lags characterize speech ⇔ Large lags characterize reverberation
E, Speech signal statistics change faster than RIRs
F, Multichannel recordings: Speech component is the same ⇔ RIRs aredifferent
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 20
Page 69
Dereverberation for Signal Enhancement - Approaches
SignalDereverberation
PartialDeconvolution
Single-channel Multichannel
ReverberationSuppression
Single-channel Multichannel
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 21
Page 70
Dereverberation - Single-Channel Partial Deconvolution
Single-channel partial deconvolution
Can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E) for identifying RIR estimate
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 22
Page 71
Dereverberation - Single-Channel Partial Deconvolution
Single-channel partial deconvolution
Can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E) for identifying RIR estimate
Inversion of a single RIR involves [Neely 1979] removing the allpass component of the nonminimum-phase RIR →
approximated by delay
inverting zeros close to, or on unit circle → approximation by ’channelshortening’
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 22
Page 72
Dereverberation - Single-Channel Partial Deconvolution
Single-channel partial deconvolution
Can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E) for identifying RIR estimate
Inversion of a single RIR involves [Neely 1979] removing the allpass component of the nonminimum-phase RIR →
approximated by delay
inverting zeros close to, or on unit circle → approximation by ’channelshortening’
for realization problems see, e.g., [Morjopoulos 1994], [Naylor 2010]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 22
Page 73
Dereverberation - Multichannel Partial Deconvolution
Multichannel partial deconvolution
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23
Page 74
Dereverberation - Multichannel Partial Deconvolution
Multichannel partial deconvolution
Can additionally exploit spatial diversity (incl. assumption F) and priorknowledge of source location and radiation characteristic (C) foridentifying RIR
Spatial diversity facilitates RIR identification
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23
Page 75
Dereverberation - Multichannel Partial Deconvolution
Multichannel partial deconvolution
Can additionally exploit spatial diversity (incl. assumption F) and priorknowledge of source location and radiation characteristic (C) foridentifying RIR
Spatial diversity facilitates RIR identification
Perfect inversion with FIR filters is possible (MINT [Miyoshi 1988]) exact knowledge of RIR lengths required
no common zeros of RIRs allowed
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23
Page 76
Dereverberation - Multichannel Partial Deconvolution
Multichannel partial deconvolution
Can additionally exploit spatial diversity (incl. assumption F) and priorknowledge of source location and radiation characteristic (C) foridentifying RIR
Spatial diversity facilitates RIR identification
Perfect inversion with FIR filters is possible (MINT [Miyoshi 1988]) exact knowledge of RIR lengths required
no common zeros of RIRs allowed
Indirect approaches often invert in subbands for robustness (e.g. [Naylor2005])
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23
Page 77
Dereverberation - Multichannel Partial Deconvolution
Multichannel partial deconvolution
Can additionally exploit spatial diversity (incl. assumption F) and priorknowledge of source location and radiation characteristic (C) foridentifying RIR
Spatial diversity facilitates RIR identification
Perfect inversion with FIR filters is possible (MINT [Miyoshi 1988]) exact knowledge of RIR lengths required
no common zeros of RIRs allowed
Indirect approaches often invert in subbands for robustness (e.g. [Naylor2005])
Direct approaches to identify a robust inverse exist (e.g. [Buchner 2004],[Buchner 2010], and below!)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23
Page 78
Dereverberation - Single-Channel Reverberation Suppressio n
Single-channel Reverberation Suppression
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 24
Page 79
Dereverberation - Single-Channel Reverberation Suppressio n
Single-channel Reverberation Suppression
can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E), e.g., for
equalizing the vocal tract IR and suppressing reverberation in the LPCresidual (e.g., [Yegnanarayana 2000], [Gaubitch 2006])
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 24
Page 80
Dereverberation - Single-Channel Reverberation Suppressio n
Single-channel Reverberation Suppression
can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E), e.g., for
equalizing the vocal tract IR and suppressing reverberation in the LPCresidual (e.g., [Yegnanarayana 2000], [Gaubitch 2006])
can exploit prior knowledge on room acoustics (e.g.,T60) to estimatePSD of reverberation and use spectral subtraction methods as commonfor additive noise (e.g., [Lebart 2001])
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 24
Page 81
Dereverberation - Multichannel Reverberation Suppression
Multichannel Reverberation Suppression
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25
Page 82
Dereverberation - Multichannel Reverberation Suppression
Multichannel Reverberation Suppression
can additionally exploit spatial diversity (incl. assumption F) and priorknowledge on source location and radiation characteristic (C), e.g.,
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25
Page 83
Dereverberation - Multichannel Reverberation Suppression
Multichannel Reverberation Suppression
can additionally exploit spatial diversity (incl. assumption F) and priorknowledge on source location and radiation characteristic (C), e.g.,
beamforming using only prior knowledge of source location and radiationcharacteristic (C) (e.g., [Griebel 2001])
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25
Page 84
Dereverberation - Multichannel Reverberation Suppression
Multichannel Reverberation Suppression
can additionally exploit spatial diversity (incl. assumption F) and priorknowledge on source location and radiation characteristic (C), e.g.,
beamforming using only prior knowledge of source location and radiationcharacteristic (C) (e.g., [Griebel 2001])
spatial diversity for multichannel spectral subtraction (e.g., [Allen 1977]), orsubspace methods (e.g., [Gannot 2003])
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25
Page 85
Dereverberation - Multichannel Reverberation Suppression
Multichannel Reverberation Suppression
can additionally exploit spatial diversity (incl. assumption F) and priorknowledge on source location and radiation characteristic (C), e.g.,
beamforming using only prior knowledge of source location and radiationcharacteristic (C) (e.g., [Griebel 2001])
spatial diversity for multichannel spectral subtraction (e.g., [Allen 1977]), orsubspace methods (e.g., [Gannot 2003])
spatial diversity complemented by prior knowledge on room acousticsparameter (e.g., [Habets 2005])
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25
Page 86
Handling Reverberation for Automatic Speech Recognition
Block diagram of ASR system
pre−
training
speechsignal
transcription
transcriptionrecog−nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26
Page 87
Handling Reverberation for Automatic Speech Recognition
Block diagram of ASR system
pre−
training
speechsignal
transcription
transcriptionrecog−nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
REMOS
Strategies
A) signal-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26
Page 88
Handling Reverberation for Automatic Speech Recognition
Block diagram of ASR system
pre−
training
speechsignal
transcription
transcriptionrecog−nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
REMOS
Strategies
A) signal-based approaches
B) feature-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26
Page 89
Handling Reverberation for Automatic Speech Recognition
Block diagram of ASR system
pre−
training
speechsignal
transcription
transcriptionrecog−nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
REMOS
Strategies
A) signal-based approaches
B) feature-based approaches
C) model-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26
Page 90
Handling Reverberation for Automatic Speech Recognition
Block diagram of ASR system
pre−
training
speechsignal
transcription
transcriptionrecog−nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
REMOS
Strategies
A) signal-based approaches
B) feature-based approaches
C) model-based approaches
D) decoder-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26
Page 91
Part II.Multichannel blind inverse filtering
Page 92
Two approaches for signal dereverberation
Signaldereverberation
Partialdeconvolution
Reverberationsuppression
[Lebart 2001], [Habets 2005], [Löllman 2009], [Erkelens 2010], [Kameoka 2009], [Jeub 2010]and others
“Robust” blind inverse filteringis the main topic of part II
Page 93
Multichannel inverse filtering
M
m
Kmt
mt xwy
1 0
)()(
Linear filtering:
ts)1(
th)2(
th)(M
th
)1(tx
)2(tx
)(Mtx
)1(tw
)2(tw
)(Mtw
+
ty
Dereverberatedsignal
tt sy Goal: estimate )(mtw s.t.
RIRsCleanspeech
Reverberantspeech
m : mic. indext : time index : a set of
variablesfor all t and m
Inversefilter
Page 94
!
Part II. Multichannel blind inverse filtering
- Example applications- Professional audio post production- Meeting recognition with microphone arrays
- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter
- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with
time-varying source model
- Integration with blind source separation
Page 95
"
Application to audio post-production
Microphone(s)Actor/actress
Step1:Sound&video recording (on location)
Step2:Audio post-production(de-noising, de-reverb, sound effects)
[Movies/TV creation]
Page 96
#Dereverberation plug-in for Pro Tools: NML RevCon-RR(sold by TAC System, Inc.)
Dereverberation system for audio post production [Kinoshita 2008]
Page 97
Who Spoke When, What, Who Spoke When, What, ToTo--whom and How?whom and How?
Show&Tell:ST-3.2: Thursday, March 29, 10:30-12:30 RealReal--time Meeting Browsertime Meeting Browser
Recognize speech andRecognize speech andother audio eventsother audio events
Online meeting recognition [Hori 2012]
ReverberationReverberation
SimultaneousSimultaneousSpeechSpeech
BackgroundBackgroundnoisenoise
Page 98
$
Online/offline processing flow of meeting recognition
Dereverberation
Voiceactivity
detection
Speechseparation
Mic signals
Dereverberatedmicrophone signals
Separatedsignals
Cleanedsignals
Noisesuppression ASR
Wordsequence
Preprocessingfor all following
signal processing units
Page 99
%
ASR performance w/ and w/o dereverberation
Worderrorrate(%)
Test data: Meeting by 4 speakers (15 min x 8 sessions)Recording: 8 mics. (T60: about 350 ms, Speaker-mic distance: 100 cm)
Acoustic model:Trained on CSJ (corpus of spontaneous Japanese): headset recording
Language model:Vocabulary size: 156K (LVCSR)
Baseline:Distant microphone(w/o enhancement)w/o derev:BSS+denoisew/ derev:derev.+BSS+denoiseHeadset:Close microphone(w/o enhancement)Online processing
(Latency=1s for preprocessing,w/o speaker adaptation)
Offline processing(w/ unsupervised speaker adaptation)
0102030405060
908070
Hea
dsetw
/ der
evw
/o d
erev
72.
1 %
Bas
elin
e 86
.5 %
56.6
%30
.6 %
Hea
dset
w/ d
erev
38.0
%B
asel
ine
78.9
%w
/o d
erev
35.9
%27
.4 %
Page 100
&
Questions to be answered
• What is inverse filtering ?• Is the inverse filter robust against interferences ?• Can we estimate the inverse filter with blind
processing ?
ts )1(th)2(
th)(M
th
)1(tx
)2(tx
)(Mtx
)1(tw
)2(tw
)(Mtw
+
tyDereverberatedsignal
Page 101
What is inverse filtering ?
Is the inverse filter robust against interferences ?
Can we estimate the inverse filter with blind processing ?
Answers at a glance
Unfortunately no,
Yes, we can,
Inversion of room impulse responses (RIRs)
by using cues for distinguishing speech from RIRs
but there is a robust ‘approximate’ inverse filter
Page 102
Part II. Multichannel blind inverse filtering
- Example applications- Professional audio post production- Meeting recognition with microphone arrays
- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter
- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with
time-varying source model
- Integration with blind source separation
Assume non-blindprocessing for
analysis purpose
Page 103
Inversion of RIRs = Inversion of matrix transformation
Reverberant speech
Cleanspeech
Dereverberatedspeech
RIRsInversefiltering
)1(tx
tyts
Viewed asmatrix inversion
InversionViewed as matrixtransformation
)(Mtx
Page 104
$!
)(
)(1
)(
mKt
mt
mt
x
xx
Matrix/vector representations of RIR convolution/filtering
Single channel filtering Multichannel filtering
Kmt
m xw0
)()(
0
1
)()(1
)(
)(0
)(
)(1
)(0
)(1
)(0
0
00
00
0
Kt
t
t
mL
m
mL
m
mL
mm
mm
s
ss
hh
h
h
hhh
hh
h
h
h
ts)(mH
)(
)(
)()(0 ,,
mKt
mt
mK
m
x
xww
Tm)(w)(m
tx)()( m
tTm xw
)(
)1(
)()1(
1
)()(
Mt
tTMTM
m
mt
Tm
x
xwwxw
Tw
txt
Tty xw
Single channel RIR convolution Multichannel RIR convolution
tmm
t sHx )()(
tMM
t
t
sH
H
x
x
)(
)1(
)(
)1(
tt Hsx
H)(mtx tx
hLKK 0
Page 105
$"
Existence of inverse filter
• A column vector is an inverse filter when it satisfies:
• An inverse filter exists, when is invertible, i.e., it is full column rank, and is obtained as
tt ys w
ts tytxtt Hsx t
Tty xw
tTHsw
Hw
T]0,,0,1[ ewhereTT eHw
Hew TT TT HHHH 1)( where
w
TKtttt sss ],,,[01 swhere
Page 106
$
M (#mics) > 1 is required for single source case
• H is invertible, or full column rank, if and only if
and all columns are linearly independent
• In the case of single source, (#rows of H) >= (#columns of H)is satisfied if and only if
M (#mics) > 1
H
)1(H
)(MH
)2(H
#rows
#columns
(#rows of H) >= (#columns of H)
Page 107
$
Generalization to N sources ' M microphones case
– Inverse filter exists when is full column rank
• M (#mics) >N (#sources)• H(z) does not contain common zeros
)1(ts )1(
tx)1()1(
tt sy )2(
tx)2(
ts)(N
ts )(Mtx
)2()2(tt sy
)()( Nt
Nt sy
H W
HEquivalent
Multiple-input/output inverse theorem (MINT) [Miyoshi 1988]
Page 108
$$
Part II. Multichannel blind inverse filtering
- Example applications- Professional audio post production- Meeting recognition with microphone arrays
- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter
- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with
time-varying source model
- Integration with blind source separation
Page 109
$%
Assumptions for inverse filtering
• Invertible RIRs• No additive noise • Time-invariant RIRs
Not realistic !
Inverse filter is too sensitive tomodeling errors (noise or RIR change)
Problem of inverse filtering
Page 110
$&
Inverse filter greatly amplifies noise
Noise-free reverberant case• Clean speech• Reverberant speech
– Synthesized using a fixed RIR (RT60=0.5 s)• Dereverberated speech using an inverse filter
for known RIRs (2-channel)
Noisy reverberant case• Noisy reverberant speech (SNR=30dB)• Speech processed using the same inverse filter
(2-channel)
Page 111
$
Why inverse filter is so sensitive to additive noise
ts H
min
Minimum singular value of
tx tstn tn~
often extremely small
(compared to maximum singular value)
Extremelyamplifies
noise
tT
tn nHe ~where
Hew TT
invmax
Maximum singular value of
often extremelylarge
HH
Page 112
$
Standard numerical approach for robustness [Engl 1996]
•Regularization– A general technique for robust matrix inversion
• Add a very small positive constant to diagonal offor calculating the pseudo-inversion of
– It can reduce the maximum singular value of
H
TT HIHHH 1)(~
(Identity matrixI
TT HHHH 1)(
HHT
H~invmax
Noisy rev. Processed
Noise amplification is greatly mitigated
Page 113
$
• Channel shortening– Set “direct signal + early reflections” as target signal, and
reduce only late reverberation
Room acoustics motivated approach for robustness
Target to reverberation ratio (TRR) w/ channel shortening is much higher than TRR w/ inverse filtering
Directpath Early
reflections
Latereverberation
TargetTRR
e.g., -3 dB
e.g., 8 dB
Noisy rev. Processed
t
Illustration of an RIR
about 30 ms
Inversefiltering
Hew ~TT
Channelshortening
Hhew ~)( Te
TT
e eh
lh
Page 114
%!
Intermediate summary II-1
• Dereverberation: inversion of RIRs– Assuming RIRs to be a time-invariant linear system
• Inverse filter exists– When we have more microphones than sources– But it may be very sensitive to additive noise
• ‘Approximate’ inverse filter is robust against noise– Based on regularization and channel shortening
Page 115
%"
Part II. Multichannel blind inverse filtering
- Example applications- Professional audio post production- Meeting recognition with microphone arrays
- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter
- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with
time-varying source model
- Integration with blind source separation
Page 116
%
Blind inverse filtering based dereverberation
Reverberantspeech
Unknown
RIRsts )(m
txDereverberatedspeech
Inversefiltering
ty
Estimateinverse
filter
wMics.
UnknownSpeech
productionsystem
Cleanspeech
Two approaches• RIR estimation + RIR inversion• Direct estimation of inverse filter
Page 117
%
RIRsts )(mtx
Inversefiltering
ty
Estimateinverse
filter
w
Approach:Estimatethatdecorrelates
)(mtx
w
Unknown
• SOS approach assumes to be stationary white Gaussian
• HOS approach assumes to be an i.i.d. sequenceHigher order decorrelation [Sato 1975], [Bellini 1994]
ts
ts
Conventional decorrelation approaches for stationary white signal
Multichannel linear prediction (MCLP) [Slock 1994], [Abed-Meraim 1997]
0' tt ssEfor 'tt
0
)('
)(
mt
mt xxE
Increasecorrelation
0' tt yyEStationarywhite signal
Page 118
%$
Multichannel linear prediction (MCLP)
Mic.1
Mic.M
)))
Predictreverberationin observation
)(mtx
)1()1(ttt rsx
Past observation
)))
Currentobservation
M
m
Lmt
mt
x
xcr1 1
)()()1(
)1(tr
Time
: reverberation
Page 119
%%
MCLP based decorrelation [Slock 1994], [Abed-Meraim 1997]
• is modeled by
where is prediction coeffs. – is equivalent to inverse filter
• can be estimated by minimizing prediction error when sources are stationary and uncorrelated in time
– Quadratic form: optimized using a closed form solution
t
M
m
Kmt
mt sxcx
1 1
)()()1(
)1(tx
wTM
KM
K cccc ],,,,,,[ )()(1
)1()1(1 c
t
Tttx
2cxcc
1)1(minargˆ
c
tTt s cx 1
Predicted signal (= reverberation)Prediction error (= direct signal)
cxTttt xs 1)1(
c
Page 120
%&
Why dereverberation can be achieved by MCLP
1
' '0t
tt ttss for
1
2
10
)1(
1
21
)1(
t
Ttt
tt
Tt shx cxxc
1
2
11
)1(
t
Tttt shs cx
1 1
2
11
)1(2||t t
Tttt shs cx
Minimization is achieved only when
1
2||t
ts
1
)1(
tsh cxTt 1
Truereverberation
Predictedreverberation=
01
1
t
Ttts x(and thus )
is usually assumed for MCLP without loss of generality
1)1(0 h
Page 121
%
Robustness of MCLP against noise
)()()( mt
mt
mt nxz Let be noisy reverberant observation.
Additive noise (or can be viewed as modeling error)
Cost function fordereverberation
Cost function fornoise amplification
Assume and to be uncorrelated, then, the cost function becomes
1
21
)1(
1
21
)1(
1
21
)1(
t
Ttt
t
Ttt
t
Ttt nxz cncxcz
)(mtx
)(mtn
Regularization is inherently included
Page 122
%
RIRsts )(m
txInversefiltering
ty
Estimateinverse
filter
wMics.
Approach:Estimatethatdecorrelates
)(mtx
w
Unknown
Problem of decorrelation approach for speech dereverberation
UnknownSpeech
productionsystem
Problem:Not only dereverberatebut also decorrelate ts
)(mtx
Key to the solution:Use cues to separatespeech and RIRs
Both are decorrelated
Page 123
%
Cues for separating speech and RIRs
Cues Speech RIRs
Inter-channeldifference
Auto-correlationduration
Nonstationarity Stationary only within short time period of the order of 30 ms
Stationary over long time period of the order of 1000 ms or larger
Correlated only within short time interval of the order of 30 ms
Correlated within long time interval over100 ms
Common to all the microphone signals
Different for each microphone
Page 124
&!
Approaches to blind inverse filtering
• Subspace method (RIR estimation + inversion)–[Furuya 1997], [Gannot 2003], [Gaubitch 2006]
• Pre-whitening + decorrelation–Second-order statistics (SOS): [Gaubitch 2003],
[Furuya 2007], [Triki 2007]–Higher-order statistics (HOS): [Gillespie 2001]
• Channel shortening–[Gillespie 2003], [Kinoshita 2009]
• Joint speech and reverberation modeling–[Hopgood 2003], [Buchner (TRINICON) 2010],
[Yoshioka 2007], [Nakatani 2008]
Auto-correlationduration
Auto-correlationdurationandnonstationarity
Inter-channeldifference
Cues
Page 125
&"
Pre-whitening + decorrelation
• A typical method for pre-whitening–Low-dimensional (e.g., 12-dim) single channel linear prediction often used
Assumption: pre-whitening can decorrelate only in , and we can obtain where is an unknown decorrelated speech
Pre-whiteningReducecorrelationwithin shorttime interval
tt Hsx
Reverberantspeech
tmt sHx ~~ )(
tt sHx ~~ ts~tt Hsx ts
Estimateinverse filter
Inversefiltering
Estimatethatdecorrelates
)(~ mtx)(~ m
ty
w
w
Inversefiltering
w)(m
ty
Page 126
&
Channel shortening
• Introduce constraints so that dereverberation reduces only late reverberation
• Techniques:– Correlation shaping [Gillespie 2003]– Multistep MCLP [Kinoshita 2009]
Make derev. robust and do not decorrelate speech
Channelshortening
Directpath Early
reflections
Latereverberation
Page 127
&
Multistep MCLP [Gesbert 1997], [Kinoshita 2009]
Mic.1
Mic.M
)))
Predict late reverberationin observation
)(mtx
)1()1(ttt rsx
Past observation
)))
Currentobservation
Time
M
m
K
D
mt
mt xcr
1
)()()1(
Delay D (=30-50 ms)
)1(tr
ts : direct signal + earlyreflections
: latereverberation
Page 128
&$
Approaches to blind inverse filtering
• Subspace method (RIR estimation + inversion)–[Furuya 1997], [Gannot 2001], [Gaubitch 2006]
• Pre-whitening + decorrelation–Second-order statistics (SOS): [Gaubitch 2003],
[Furuya 2006], [Triki 2006]–Higher-order statistics (HOS): [Gillespie 2001]
• Channel shortening–[Gillespie 2003], [Kinoshita 2006]
• Joint speech and reverberation modeling–[Hopgood 2003], [Buchner (TRINICON) 2010],
[Yoshioka 2007], [Nakatani 2008]
Duration of auto-correlation
Duration of auto-correlationandnonstationarity
Spatial diversity
Cues
Page 129
&%
Joint speech and reverberation modeling for derev.
Reverberantobservation
Unknown true generative system
Sourceprocess
Reverberationprocess tx
hstt xpx ,);(~
Model of generative system
Sourcemodel
Reverberationmodel
Parametricmodel
sParametricmodel
h
Parameter estimation by
#Likelihood maximization[Hopgood 2003], [Yoshioka 2007],[Nakatani 2008]
#Kullback-Leiblerdivergence minimization[Buchner (TRINICON) 2010]
Distinguishable ?
Page 130
&&
Time-varying&
Correlatedonly within
short interval
Stationary&
Correlatedover
long interval
Source model(SOS or HOS)
Reverberationmodel
Models for source process and reverberation process
Distinguishable
Page 131
&
Multichannel blind partial deconvolution (MCBPD) by TRINICON
Cost function for SOS-TRINICON [Buchner 2010]
t
tyts ,,SOSˆdetlogˆdetlog RR J
Goal
Autocorrelation matrix of ty
ts,R
Goal
Decorrelation(deconvolution)
MCBPD byTRINICON
Autocorrelationmatrix of
observed signal
Page 132
&
Part II. Multichannel blind inverse filtering
- Example applications- Professional audio post production- Meeting recognition with microphone arrays
- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter
- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with
time-varying source model
- Integration with blind source separation
Page 133
&
MCLP with time-varying source model for dereverberation
[Yoshioka 2007][Nakatani 2008, 2011]
Time-varyingshort-timeGaussian
MCLP
Source process(SOS)
Distinguishable bylikelihood
maximization
Reverberationprocess
Page 134
!
Reformulation of MCLP based on likelihood maximization
);(log)( cxc txpL
)1,0;()( tts sNsp Assume (stationary white Gaussian), then
1
21
)1( .||)2/1()(t
tT
t constxL xcc
)(max ccL
1
21
)1( ||mint
tT
tx xcc
Minimize prediction errorMaximize likelihood
Conditionalprobability rule
.);|(log1
1:1'')1( constxxp
tttttx
.)(log1
constspt
ts Source model
cxTttt xs 1)1(
where tTtt sx cx 1
)1(
Page 135
"
Time-varying Gaussian source model (TVGSM)
1. Each short time segment is stationary multivariate Gaussian, which can be characterized by
2. varies over different time segments
TNtttt sss ][ 11 s
ttttsp RsRs ,0;; N
Tttt E ssR where
ts R
tR
: parameters to be estimated
is an autocorrelation matrix
ms order30
Page 136
MCLP with multivariate source model
• Prediction error is assumed to follow TVGSM
tTtt sx cx 1
)1(
ttt scXx 1)1(
1
12
1
)1(1
)1(1
)1(
Nt
t
t
TNt
Tt
Tt
Nt
t
t
s
ss
x
xx
c
x
xx
or
ts
)1(tx 1tX ts
30ms
order
Page 137
Likelihood function of MCLP with TVGSM
t
ttst pL ),;(log),( RcsRc
),0;();( ttttsp RsRs Nwhere
||log||||),( 1)1(
tt
ttt tL RcXxRc R
sRss R1|||| Twhere (quadratic form)
Prediction errorweighted by
Normalization term
1tR
cXxs 1)1(
ttt and
Page 138
$
Iterative optimization procedure
Initialize
ˆt
Ttt E xxR
A few iterations are sufficient for convergence
cxs ˆˆ1
)1( ttt X
Update prediction coeffs.
ˆ tR1. Dereverberate
2. Calculate autocorrelation matrix of
t
tt tRc
cXxc ˆ1)1( ||||minargˆ
Update source model
ˆˆˆt
Ttt E ssR
ˆ ts
tx
c
c ˆ tR
ts
Closedform
Page 139
%
Importance of time-varying source model
Source signal Observation
Processed A Processed B
Freq
uenc
ykH
z
(A) MCLP withstationary whiteGaussian source model
(B) MCLP with TVGSM
T60 : 0.5 s
A few seconds of observation are sufficient for dereverberation
Recording: 2.5 s
Source-micdistance: 1.5 m# mics : 2
Page 140
&
Blind inverse filtering works in noisy environments
Reducereverberation
10 dB0.1 dB
(T60 =0.65s)15 dB10 dB15 dBSNR
5.8 dB (T60 =0.39s)
TRR*
*TRR: Target-to-reverberation ratio (target = direct signal + early reflections)
Noisy reverberant speech Dereverberation(w/ multistep MCLP
w/ TVGSM)
Noise: additive white noise (reproduced and recorded by 8 mics)
Processed signal# mics: 8source-mics distance: 2 m
10.3 dB11.4 dB13.8 dB15.2 dB
TRR
Noise may slightly increase,but not significantly
Page 141
Real-time factor (RTF)using MATLAB
(RT60: 0.5 s, # mics: 2)
Time-domain Subband
170 0.8
Computationally efficient implementation
• Subband decomposition approach [Nakatani 2010], [Yoshioka 2009b]
• Computational efficiency largely improves
Subbandanalysis
Subbandsynthesis
MCLP with TVGSM
MCLP with TVGSM
MCLP with TVGSM
ty )(mtx
1,ny
2,ny
,Fny
)(1,mnx
)(2,mnx
)(,mFnx
Page 142
Processing flow with subband decomposition [Nakatani 2010]
1.Set analysis parameters: prediction delay (D should be # of subband samples corresponding to 30 ms, or larger)
: length of prediction filter, : # of mics,: index of target channel to be dereverberated: a coeff. for flooring constant (e.g., )
2.Decompose a multichannel observed signal into a set of subband signals
: subband signal (e.g., [Weiss 2000], or STFT can also be used)
: channel index, : sample index: subband index
E.g., # of subbands is 512 (including negative frequencies) for 16 kHz sampling
3.In each subband , set initial estimates of source variance as
where is a flooring constant for
4.Obtain vector representation of in all channels as
where T is non-conjugate transposition, and
5.In each subband f, iterate the following until convergence is achieved
i. Obtain prediction filter as
where and are Moore-Penrose pseudo-inverse and complex conjugate operations. (see [Yoshioka, 2009b] for efficient calculation)
ii.Obtain dereverberated subband signalas
iii. Update source variance estimates as
6.Compose a dereverberated signal from a set of dereverberated subband signals
)(,mfnx
m nf
2)(, ||max 0mfnnk x
fmfnfn x ! ,||max 2)(,,0
f
f fn,!
)(,mfnxTTM
fnTfn
Tfnfn ],,[ )(
,)2(,
)1(,, xxxx
TmfLn
mfn
mfn
mfn xxx ],,[ )(
,1)(,1
)(,
)(, x
n fn
mfnfDn
n fn
TfDnfDn
f
x
,
*)(,,
,
*,,
0
!!xxx
c
*
fDnTf
mfnfn x ,
*)(,,0
xcyfn,y
fn ,! ffnfn y ! ,||max 2
,,
fn,y
fc
D
L0m
410
fn,!
M
Page 143
Part II. Multichannel blind inverse filtering
- Example applications- Professional audio post production- Meeting recognition with microphone arrays
- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter
- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with
time-varying source model
- Integration with blind source separation
Page 144
!
Reverberant speechMixed reverberant speech
BSS+dereverberation
Cleanspeechestimate
IntegratedBSS+
dereverberationReverberation
Reverberation
Directsignal
Cleanspeech
Approaches:• MCLP based approach [Yoshioka 2009b, 2011]• TRINICON [Buchner 2010]
Page 145
"
Generative model for reverberant sound mixture
Time-varyingGaussian
Instantaneousmixing
process
Source process 1 Mixture process
Time-varyingGaussian
Source process 2
###
Multi-inputmulti-output
MCLP
Reverb. process###
)1(ts
)2(ts
)2(tx
)1(tx
)1(
tx
)2(tx
)(mtx : reverberant mixture
)(mtx
: non-reverberant mixture
Jointly optimized by maximum likelihood estimation approach [Yoshioka 2009b, 2011]
Page 146
Optimization procedure (subband-based implementation)
Initialization
Compute source estimates
Update source model parameters
Update de-mixing matrices
Converged ?
Update prediction coefficients .)(mtx
*
+,-
.)(mts*.
/ 0 10
Closed-form optimization
Page 147
Improvement in signal-to-interference ratio (SIR)
T60=0.3 s T60=0.5 s02468
10121416
Der
ever
b +
BS
S
BS
S
Unp
roce
ssed
# sources: 2# mics: 4Source-micdistance: 1.5 m
Recording: 1 to 8 s(average: 3.5 s)
SIR[dB]
BSS: [Sawada 2007]
Results averaged over 672 pairs of utterances (TIMIT test set)
Page 148
$
Live demoLive demo
Page 149
%
TRINICON: general framework for blind MIMO signal processing
• for source (assumed or estimated)• for output
"
0 0
,,
0
))),((ˆlog())),((ˆlog(),()(i
N
jPDyPDsb jipjipbi yyW J #
Unknownmixingsystem
Unmixingsystem Wb
)1(s
)(Ps
)1(x
)(Mx
)1(y
)(Py
Source models
Cost function [Buchner 2010]
with PD-variate pdfs (P: source number, D: filter length)
)),((ˆ , jip PDs y)),((ˆ , jip PDy y
b: index of signal blocks
Page 150
&
Comparison of SOS and HOS by TRINICON [Buchner 2010]
SIR improvement (dB)
Number of iterations
SOS
SOS+HOS
BSS (w/o derev.)
Signal-to-reverberation ratio (SRR) improvement (dB)
# mics.: 4, # sources: 2, T60 : 700 ms,Source-mic distance: 1.65 m, Recording: 30 sec
Number of iterations
SOS
SOS+HOS
Page 151
Summary II-2
• Robust blind inverse filtering is possible – Using joint speech and reverberation modeling
• Based only on a few seconds of observation (e.g., 2.5 s)• With a relatively small computational cost (e.g., RTF<1)• In an online processing manner (e.g., latency=1s)
– Under low SNR conditions (e.g., 10 dB SNR)
• Future challenges– Realtime adaptation of inverse filter [Yoshioka 2009a],
[Evers 2011]– Single channel inverse filtering [Gillespie 2001] – Processing under more adverse noise conditions such as
nonstationary diffuse noise– Optimal integration of inverse filtering and spectral
enhancement based dereverberation
Page 152
Part III:
Robust Automatic Speech Recognition (ASR)
in Reverberant Environments
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 88
Page 153
Part III: Robust ASR in Reverberant Environments
Introduction
Feature-based Approaches
Model-based Approaches
Decoder-based Approaches
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 89
Page 154
Part III: Robust ASR in Reverberant Environments
Introduction
Feature-based Approaches
Model-based Approaches
Decoder-based Approaches
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 89
Page 155
ASR System
Block Diagram
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
REMOS
D
C
B
A
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 90
Page 156
Feature Extraction: Calculation of MFCCs
DFT
DCT logmel
Hamming
window
filtering
st|()|2
Goal:Dimensionality
reduction
MFCCs:
Mel
Frequency
Cepstral
Coefficients
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91
Page 157
Feature Extraction: Calculation of MFCCs
DFT
DCT logmel
Hamming
window
filtering
st|()|2
Goal:Dimensionality
reduction
MFCCs:
Mel
Frequency
Cepstral
Coefficients
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91
Page 158
Feature Extraction: Calculation of MFCCs
DFT
DCT logmel
Hamming
window
filtering
st
sMELn
|()|2
coefficientsmelspec
Goal:Dimensionality
reduction
MFCCs:
Mel
Frequency
Cepstral
Coefficients
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91
Page 159
Feature Extraction: Calculation of MFCCs
DFT
DCT logmel
Hamming
window
filtering
st
sMELnsn
|()|2
logmelspeccoefficients coefficients
melspec
Goal:Dimensionality
reduction
MFCCs:
Mel
Frequency
Cepstral
Coefficients
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91
Page 160
Feature Extraction: Calculation of MFCCs
DFT
DCT logmel
Hamming
window
filtering
st
sMELnsnsMFCC
n
|()|2
MFCCs logmelspeccoefficients coefficients
melspec
Goal:Dimensionality
reduction
MFCCs:
Mel
Frequency
Cepstral
Coefficients
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91
Page 161
Acoustic Modeling
Hidden Markov Model (HMM) λ
1 2 4 5 63
a22 a33 a44 a55
a12 a23 a34 a45 a56
p(sn|qn = 2) p(sn|qn = 5)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 92
Page 162
Acoustic Modeling
Hidden Markov Model (HMM) λ
1 2 4 5 63
a22 a33 a44 a55
a12 a23 a34 a45 a56
p(sn|qn = 2) p(sn|qn = 5)
Powerful model for:
temporal variation
spectral variation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 92
Page 163
Dispersive Effect of Reverberation
clean utterance "four, two, seven"
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
reverberant utterance "four, two, seven"
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
frame n
frame n
melchannell
melchannell
Logmelspec features, dB scale
Dispersive effect of reverberation:
features smeared along time axis
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 93
Page 164
Dispersive Effect of Reverberation
clean utterance "four, two, seven"
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
reverberant utterance "four, two, seven"
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
frame n
frame n
melchannell
melchannell
Logmelspec features, dB scale
Dispersive effect of reverberation:
features smeared along time axis
Time-frequency pattern is changed
Inter-frame correlation is increased
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 93
Page 165
Dispersive Effect of Reverberation
clean utterance "four, two, seven"
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
reverberant utterance "four, two, seven"
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
frame n
frame n
melchannell
melchannell
Logmelspec features, dB scale
Dispersive effect of reverberation:
features smeared along time axis
Time-frequency pattern is changed
Inter-frame correlation is increased
Different statistical properties
to be captured by acoustic model
Contradiction to conditional inde-
pendence assumption of HMMs
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 93
Page 166
Explanation of Dispersive Effect
10 20 30 40 50 60 70 80 90 100
−0.2
0
0.2
TD representation of initial RIR segment
FD representation of initial RIR segment
5
10
15
20
frame τ
ht
me
lch
an
ne
ll
time in msframe 1
frame 2frame 3
frame shift
Time-domain (TD) description ofreverberant speech xt :
xt = ht ∗ st
RIR typically much longer than
analysis window
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 94
Page 167
Explanation of Dispersive Effect
10 20 30 40 50 60 70 80 90 100
−0.2
0
0.2
TD representation of initial RIR segment
FD representation of initial RIR segment
5
10
15
20
frame τ
ht
me
lch
an
ne
ll
time in msframe 1
frame 2frame 3
frame shift
Time-domain (TD) description ofreverberant speech xt :
xt = ht ∗ st
RIR typically much longer than
analysis window
Feature-domain (FD) description ofxMEL
n : melspec convolution
xMEL
n =
TH−1∑
τ=0
hMEL
τ ⊙ sMEL
n−τ
sMELn : clean-speech feature vector
xMELn : reverberant feature vector
hMEL
n : melspec RIR representation
⊙: element-wise multiplication
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 94
Page 168
Illustration of Melspec Convolution
= *
= * sMELnxMEL
n hMEL
n
=
=
xMELn sMEL
nhMEL
0 ⊙
⊙
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 95
Page 169
Illustration of Melspec Convolution
= *
= * sMELnxMEL
n hMEL
n
+
+=
=
xMELn sMEL
nhMEL
0sMEL
n−1hMEL
1 ⊙
⊙
⊙
⊙
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 95
Page 170
Illustration of Melspec Convolution
= *
= * sMELnxMEL
n hMEL
n
+ +
+ +
+
+=
=
xMELn sMEL
nhMEL
0sMEL
n−1hMEL
1sMEL
n−2hMEL
2
. . .
. . .
⊙⊙
⊙⊙⊙
⊙
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 95
Page 171
Accuracy of Melspec Convolution
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
a)
b)
c)
d)
Frame n
ch
an
ne
ll
ch
an
ne
ll
ch
an
ne
ll
ch
an
ne
ll
a) Clean utterance
b) Reverberant utterance
c) Melspec convolution
d) Simple multiplication
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 96
Page 172
Statistical Properties of Reverberant Speech Features
Example: digit “seven”
logmelspec clean utterance
10 20 30 40
5
10
15
20
5
10
15
20
logmelspec reverberant utterance
10 20 30 40
5
10
15
20
5
10
15
20
means of clean logmelspec HMM
5 10 15
5
10
15
20
5
10
15
20
logmelspec RIR representation
20 40 60
5
10
15
20
−14
−12
−10
−8
−6
−4
−2
0
frame delay τ
frame nframe n
state j
melchannell
melchannell
melchannell
melchannell
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 97
Page 173
Statistical Properties of Reverberant Speech Features
Histograms
0 5 10 15 200
0.1
0.2
0.3
0.4
10 15 200
0.1
0.2
0.3
0.4
histogram clean
histogram rev.
histogram clean
histogram rev.
histograms for state j = 1, channel l = 3
histograms for state j = 5, channel l = 21
x
x
estim
ate
dpdf
estim
ate
dpdf
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 98
Page 174
Statistical Properties of Reverberant Speech Features
Histograms
0 5 10 15 200
0.1
0.2
0.3
0.4
10 15 200
0.1
0.2
0.3
0.4
histogram clean
histogram rev.
histogram clean
histogram rev.
histograms for state j = 1, channel l = 3
histograms for state j = 5, channel l = 21
x
x
estim
ate
dpdf
estim
ate
dpdfAuto-CoVariances (ACVs)
0 10 20 30
5
10
15
20
0.2
0.4
0.6
0.8
1
0 10 20 30
5
10
15
20
0.2
0.4
0.6
0.8
1
ACVs of clean speech, j = 9
ACVs of reverberant speech, j = 9
melchannell
melchannell
frame τ
frame τ
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 98
Page 175
Recognition Results
Word Accuracy as Function of Reverberation Time
0 200 400 600 8000
10
20
30
40
50
60
70
80
90
100
reverberation time T60 in ms
word
accura
cy
in%
Task: Read sentences
from Wall Street
Journal (WSJ 5K task)
Features: MFCCs
+ ∆ + ∆∆ coefficients
Recognizer:Cross-word triphones,
3 states per triphone,
16 Gaussians per state
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 99
Page 176
Which Part of Reverberation is Harmful for ASR?
Word Accuracy as Function of Dereverberation Start Time
0 100 200 300 400 500 600 70065
70
75
80
85
90
95
100
0 dB
5 dB
10 dB
15 dB
20 dB
30 dB
∞ dB
TDEREV in ms
word
accura
cy
in%
[Sehr 2010a]
Task: Connected
digits (TI digits)
Features: MFCCs
+ ∆ coefficients
Recognizer:Word-level HMMs,
16 states per digit,
3 Gaussians per state
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 100
Page 177
Strategies for Reverberation-Robust ASR
Block diagram of ASR system
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
Strategies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101
Page 178
Strategies for Reverberation-Robust ASR
Block diagram of ASR system
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
A
Strategies
A) signal-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101
Page 179
Strategies for Reverberation-Robust ASR
Block diagram of ASR system
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
A
B
Strategies
A) signal-based approaches
B) feature-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101
Page 180
Strategies for Reverberation-Robust ASR
Block diagram of ASR system
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
Strategies
A) signal-based approaches
B) feature-based approaches
C) model-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101
Page 181
Strategies for Reverberation-Robust ASR
Block diagram of ASR system
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
Strategies
A) signal-based approaches
B) feature-based approaches
C) model-based approaches
D) decoder-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101
Page 182
Strategies for Reverberation-Robust ASR
Block diagram of ASR system
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
REMOS
Strategies
A) signal-based approaches
B) feature-based approaches
C) model-based approaches
D) decoder-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101
Page 183
Part III: Robust ASR in Reverberant Environments
Introduction
Feature-based Approaches
Model-based Approaches
Decoder-based Approaches
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 102
Page 184
Key Ideas of Feature-based Approaches
Three Different Approaches
Feature compensation
⇒ Example: Cepstral mean normalization (CMN)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 103
Page 185
Key Ideas of Feature-based Approaches
Three Different Approaches
Feature compensation
⇒ Example: Cepstral mean normalization (CMN)
Features insensitive to reverberation
⇒ Example: RASTA features
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 103
Page 186
Key Ideas of Feature-based Approaches
Three Different Approaches
Feature compensation
⇒ Example: Cepstral mean normalization (CMN)
Features insensitive to reverberation
⇒ Example: RASTA features
Features facilitating the capture of statistical properties
⇒ Example: Dynamic features
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 103
Page 187
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
Page 188
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
|XSTFT
n,k |2 ≈ |HSTFT
k |2 |SSTFT
n,k |2
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
Page 189
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
|XSTFT
n,k |2 ≈ |HSTFT
k |2 |SSTFT
n,k |2
xMFCC
n,c ≈ hMFCC
c + sMFCC
n,c
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
Page 190
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
|XSTFT
n,k |2 ≈ |HSTFT
k |2 |SSTFT
n,k |2
xMFCC
n,c ≈ hMFCC
c + sMFCC
n,c
Cepstral Mean Normalization
xCMN
n,c = xMFCC
n,c − xMFCC
c
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
Page 191
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
|XSTFT
n,k |2 ≈ |HSTFT
k |2 |SSTFT
n,k |2
xMFCC
n,c ≈ hMFCC
c + sMFCC
n,c
Cepstral Mean Normalization
xCMN
n,c = xMFCC
n,c − xMFCC
c
xMFCC
c =1
N
N∑
n=1
xMFCC
n,c ≈ hMFCC
c + sMFCC
c
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
Page 192
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
|XSTFT
n,k |2 ≈ |HSTFT
k |2 |SSTFT
n,k |2
xMFCC
n,c ≈ hMFCC
c + sMFCC
n,c
Cepstral Mean Normalization
xCMN
n,c = xMFCC
n,c − xMFCC
c
xMFCC
c =1
N
N∑
n=1
xMFCC
n,c ≈ hMFCC
c + sMFCC
c
xCMN
n,c ≈ hMFCC
c + sMFCC
n,c − (hMFCC
c + sMFCC
c )= sMFCC
n,c − sMFCC
c = sCMN
n,c
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
Page 193
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
|XSTFT
n,k |2 ≈ |HSTFT
k |2 |SSTFT
n,k |2
xMFCC
n,c ≈ hMFCC
c + sMFCC
n,c
Cepstral Mean Normalization
xCMN
n,c = xMFCC
n,c − xMFCC
c
xMFCC
c =1
N
N∑
n=1
xMFCC
n,c ≈ hMFCC
c + sMFCC
c
xCMN
n,c ≈ hMFCC
c + sMFCC
n,c − (hMFCC
c + sMFCC
c )= sMFCC
n,c − sMFCC
c = sCMN
n,c
⇒ convolution compensated
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
Page 194
CMN - Illustration 1st-order Highpass Filter
Clean vs. Highpass Filtered Logmel Features
No CMN
mel channel
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
t in s
mel channel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 105
Page 195
CMN - Illustration 1st-order Highpass Filter
Clean vs. Highpass Filtered Logmel Features
No CMN
mel channel
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
t in s
mel channel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
With CMN
mel channel
5
10
15
20
−12
−10
−8
−6
−4
−2
0
2
4
6
8
t in s
mel channel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−12
−10
−8
−6
−4
−2
0
2
4
6
8
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 105
Page 196
CMN - Illustration Reverberation
Clean vs. Reverberant (T60 = 900ms) Logmel Features
No CMN
mel channel
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
t in s
mel channel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 106
Page 197
CMN - Illustration Reverberation
Clean vs. Reverberant (T60 = 900ms) Logmel Features
No CMN
mel channel
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
t in s
mel channel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
With CMN
mel channel
5
10
15
20
−12
−10
−8
−6
−4
−2
0
2
4
6
8
t in s
mel channel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−12
−10
−8
−6
−4
−2
0
2
4
6
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 106
Page 198
CMN - Discussion
Approach
Apply CMN to both training and test data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 107
Page 199
CMN - Discussion
Approach
Apply CMN to both training and test data
⇒ Short impulse responses can be compensated
+ Good for compensating different microphone characteristics or
different telephone channels
+ Good for compensating coloration due to early reflections
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 107
Page 200
CMN - Discussion
Approach
Apply CMN to both training and test data
⇒ Short impulse responses can be compensated
+ Good for compensating different microphone characteristics or
different telephone channels
+ Good for compensating coloration due to early reflections
− But: not suitable for compensating late reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 107
Page 201
CMN - Discussion
Approach
Apply CMN to both training and test data
⇒ Short impulse responses can be compensated
+ Good for compensating different microphone characteristics or
different telephone channels
+ Good for compensating coloration due to early reflections
− But: not suitable for compensating late reverberation
Further considerations
Reliable only if utterance is long enough (>4 s [Droppo 2008])
Extensions necessary for different speech activity rates of
training and test data [Droppo 2008]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 107
Page 202
RASTA (RelAtive SpecTrA) Features [Hermansky 1994]
Background
Speed of spectral changes of speech:
⇒ limited by movements of articulators in vocal tract
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 108
Page 203
RASTA (RelAtive SpecTrA) Features [Hermansky 1994]
Background
Speed of spectral changes of speech:
⇒ limited by movements of articulators in vocal tract
Many non-speech effects:
⇒ characterized by short time-invariant impulse responses
Examples: microphone characteristics, telephone channels
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 108
Page 204
RASTA (RelAtive SpecTrA) Features [Hermansky 1994]
Background
Speed of spectral changes of speech:
⇒ limited by movements of articulators in vocal tract
Many non-speech effects:
⇒ characterized by short time-invariant impulse responses
Examples: microphone characteristics, telephone channels
Analysis artifacts:
⇒ very fast spectral changes
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 108
Page 205
RASTA (RelAtive SpecTrA) Features [Hermansky 1994]
Background
Speed of spectral changes of speech:
⇒ limited by movements of articulators in vocal tract
Many non-speech effects:
⇒ characterized by short time-invariant impulse responses
Examples: microphone characteristics, telephone channels
Analysis artifacts:
⇒ very fast spectral changes
Idea
Remove very slow and fast spectral changes from features:
⇒ bandpass filtering in each channel
+ Insensitivity to slow and fast spectral changes
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 108
Page 206
RASTA Features: Block Diagram
Calculation of RASTA Features
Bandpassfilter
filterBandpass
Bandpassfilter
form
vecto
rs
H0(ejΩ)
H1(ejΩ)
HL(ejΩ)
log()
log()
log() exp()
exp()
exp()
xtxRASTA
n
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 109
Page 207
RASTA Features: Discussion
RASTA Features
Effective for short time-invariant impulse responses (like CMN)
+ Good for compensating different microphone characteristics or
different telephone channels
+ Good for compensating coloration due to early reflections
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 110
Page 208
RASTA Features: Discussion
RASTA Features
Effective for short time-invariant impulse responses (like CMN)
+ Good for compensating different microphone characteristics or
different telephone channels
+ Good for compensating coloration due to early reflections
Reverberation described by long RIRs
− Therefore: not suitable for compensating late reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 110
Page 209
Dynamic Features [Furui 1986]
Idea
Temporal changes of short-time spectra:
⇒ important for discriminating phonemes
First and second derivate of static features (∆ and ∆∆ features):
⇒ capture these changes
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 111
Page 210
Dynamic Features [Furui 1986]
Idea
Temporal changes of short-time spectra:
⇒ important for discriminating phonemes
First and second derivate of static features (∆ and ∆∆ features):
⇒ capture these changes
∆ Feature Calculation
∆sn = sn+κ − sn−κ
typical: κ ∈ 1,2
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 111
Page 211
Dynamic Features [Furui 1986]
Idea
Temporal changes of short-time spectra:
⇒ important for discriminating phonemes
First and second derivate of static features (∆ and ∆∆ features):
⇒ capture these changes
∆ Feature Calculation
∆sn = sn+κ − sn−κ
∆sn =
∑N∆
κ=1 κ ·(
sn+κ − sn−κ
)
2 ·∑N∆
κ=1 κ2
typical: κ ∈ 1,2 or N∆ ∈ 2,3,4
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 111
Page 212
Dynamic Features [Furui 1986]
Idea
Temporal changes of short-time spectra:
⇒ important for discriminating phonemes
First and second derivate of static features (∆ and ∆∆ features):
⇒ capture these changes
∆ Feature Calculation
∆sn = sn+κ − sn−κ
∆sn =
∑N∆
κ=1 κ ·(
sn+κ − sn−κ
)
2 ·∑N∆
κ=1 κ2
typical: κ ∈ 1,2 or N∆ ∈ 2,3,4
∆∆ features: calculated in a similar way from ∆ features
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 111
Page 213
Why are Dynamic Features interesting for Reverberant ASR?
Reverberant Speech
Long-term relations between feature vectors
Cannot be captured by HMMs
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 112
Page 214
Why are Dynamic Features interesting for Reverberant ASR?
Reverberant Speech
Long-term relations between feature vectors
Cannot be captured by HMMs
⇒ Mitigation by feature vectors with long temporal reach
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 112
Page 215
Why are Dynamic Features interesting for Reverberant ASR?
Reverberant Speech
Long-term relations between feature vectors
Cannot be captured by HMMs
⇒ Mitigation by feature vectors with long temporal reach
Temporal reach of features
Static features: typically 10 ms – 40 ms
∆ features: typically 20 ms – 120 ms
∆∆ features: typically 30 ms – 200 ms
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 112
Page 216
Why are Dynamic Features interesting for Reverberant ASR?
Reverberant Speech
Long-term relations between feature vectors
Cannot be captured by HMMs
⇒ Mitigation by feature vectors with long temporal reach
Temporal reach of features
Static features: typically 10 ms – 40 ms
∆ features: typically 20 ms – 120 ms
∆∆ features: typically 30 ms – 200 ms
Dynamic features can partly capture long-term relations
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 112
Page 217
Model-based Feature Enhancement [Krueger 2010]
RIR parameter
estimationFeature extraction
DCT
ASR
Observation modelA priori model Inference
Reverberant speech xt
Reverberant logmelspec coefficients xn T60
p(sn|sn−1) p(sn|x1:n) p(xn|sn−TH :n)
Enhanced logmelspec coefficients sn
Enhanced MFCCs sMFCC
n
Estimated transcription
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 113
Page 218
Model-based Feature Enhancement [Krueger 2010]
A Priori Model: Clean Speech Model
Linear dynamic model−2
sn−3 sn−2 sn−1 sn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114
Page 219
Model-based Feature Enhancement [Krueger 2010]
A Priori Model: Clean Speech Model
Linear dynamic model−2
sn−3 sn−2 sn−1 sn
sn = Asn−1 + b + un
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114
Page 220
Model-based Feature Enhancement [Krueger 2010]
A Priori Model: Clean Speech Model
Switching linear dynamic model
hidden statesqn−3 qn−2qn−2 qn
sn−3 sn−2 sn−1 sn
sn = A(qn)sn−1 + b(qn) + un
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114
Page 221
Model-based Feature Enhancement [Krueger 2010]
A Priori Model: Clean Speech Model
Switching linear dynamic model
hidden statesqn−3 qn−2qn−2 qn
sn−3 sn−2 sn−1 sn
sn = A(qn)sn−1 + b(qn) + un
p(sn|sn−1,qn) = N (sn;A(qn)sn−1 + b(qn),Σu(qn))
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114
Page 222
Model-based Feature Enhancement [Krueger 2010]
A Priori Model: Clean Speech Model
Switching linear dynamic model
hidden statesqn−3 qn−2qn−2 qn
sn−3 sn−2 sn−1 sn
sn = A(qn)sn−1 + b(qn) + un
p(sn|sn−1,qn) = N (sn;A(qn)sn−1 + b(qn),Σu(qn))
Model for non-stationary feature vector sequences of clean speech
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114
Page 223
Model-based Feature Enhancement [Krueger 2010]
Observation Model: Reverberation Model
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115
Page 224
Model-based Feature Enhancement [Krueger 2010]
Observation Model: Reverberation Model
based on melspec convolution
xn = log
(
TH∑
τ=0
exp(hτ + sn−τ )
)
+ vn
vn: captures approximation error
h0:TH: based on strictly exponentially decaying RIR model
⇒ Only T60 needs to be estimated
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115
Page 225
Model-based Feature Enhancement [Krueger 2010]
Observation Model: Reverberation Model
based on melspec convolution
xn = log
(
TH∑
τ=0
exp(hτ + sn−τ )
)
+ vn
= f (sn−TH :n,h0:TH) + vn
vn: captures approximation error
h0:TH: based on strictly exponentially decaying RIR model
⇒ Only T60 needs to be estimated
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115
Page 226
Model-based Feature Enhancement [Krueger 2010]
Observation Model: Reverberation Model
based on melspec convolution
xn = log
(
TH∑
τ=0
exp(hτ + sn−τ )
)
+ vn
= f (sn−TH :n,h0:TH) + vn
p(vn) = N (vn;µv ,Σv )
vn: captures approximation error
h0:TH: based on strictly exponentially decaying RIR model
⇒ Only T60 needs to be estimated
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115
Page 227
Model-based Feature Enhancement [Krueger 2010]
Observation Model: Reverberation Model
based on melspec convolution
xn = log
(
TH∑
τ=0
exp(hτ + sn−τ )
)
+ vn
= f (sn−TH :n,h0:TH) + vn
p(vn) = N (vn;µv ,Σv )
p(xn|sn−TH :n) = N (xn; f (sn−TH :n,h0:TH) + µv ,Σv )
vn: captures approximation error
h0:TH: based on strictly exponentially decaying RIR model
⇒ Only T60 needs to be estimated
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115
Page 228
Model-based Feature Enhancement [Krueger 2010]
Bayesian Inference
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116
Page 229
Model-based Feature Enhancement [Krueger 2010]
Bayesian Inference
MMSE estimate
sn = E sn|x1:n
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116
Page 230
Model-based Feature Enhancement [Krueger 2010]
Bayesian Inference
MMSE estimate
sn = E sn|x1:n
p(sn|x1:n) =p(xn|sn,x1:n−1) p(sn|x1:n−1)
∫
p(xn|sn,x1:n−1) p(sn|x1:n−1)dsn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116
Page 231
Model-based Feature Enhancement [Krueger 2010]
Bayesian Inference
MMSE estimate
sn = E sn|x1:n
p(sn|x1:n) =p(xn|sn,x1:n−1) p(sn|x1:n−1)
∫
p(xn|sn,x1:n−1) p(sn|x1:n−1)dsn
≈p(xn|sn−TH :n)
∑Mi=1 p(sn|sn−1,qn = i) p(qn = i)
∫
p(xn|sn,x1:n−1)∑M
i=1 p(sn|sn−1,qn = i) p(qn = i) dsn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116
Page 232
Model-based Feature Enhancement [Krueger 2010]
Bayesian Inference
MMSE estimate
sn = E sn|x1:n
p(sn|x1:n) =p(xn|sn,x1:n−1) p(sn|x1:n−1)
∫
p(xn|sn,x1:n−1) p(sn|x1:n−1)dsn
≈p(xn|sn−TH :n)
∑Mi=1 p(sn|sn−1,qn = i) p(qn = i)
∫
p(xn|sn,x1:n−1)∑M
i=1 p(sn|sn−1,qn = i) p(qn = i) dsn
⇒ Inference performed by bank of iterated extended Kalman filters
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116
Page 233
Model-based Feature Enhancement [Krueger 2010]
Discussion
Approach tailored to reverberant feature vector sequences
long-term relations explicitely captured by observation model
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 117
Page 234
Model-based Feature Enhancement [Krueger 2010]
Discussion
Approach tailored to reverberant feature vector sequences
long-term relations explicitely captured by observation model
+ Promising results reported on AURORA 5 task (connected digits)
+ Moderate computational complexity
+ Latency of only a few frames
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 117
Page 235
Model-based Feature Enhancement [Krueger 2010]
Discussion
Approach tailored to reverberant feature vector sequences
long-term relations explicitely captured by observation model
+ Promising results reported on AURORA 5 task (connected digits)
+ Moderate computational complexity
+ Latency of only a few frames
Suitable for online recognition in reverberant environments
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 117
Page 236
Further Feature-based Approaches
[Petrick 2008] Harmonicity-based feature analysis
[Thomas 2008] Frequency-domain linear prediction
[Wolfel 2009] Particle filter-based feature enhancement
[Kumar 2010] Cepstral inverse filtering
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 118
Page 237
Part III: Robust ASR in Reverberant Environments
Introduction
Feature-based Approaches
Model-based Approaches
Decoder-based Approaches
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 119
Page 238
Key Idea of Model-based Approaches
Mismatch between clean HMM and reverberant data
test datareverberant
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120
Page 239
Key Idea of Model-based Approaches
Feature-based: “dereverberate” data
clean HMMtest datadereverberated
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120
Page 240
Key Idea of Model-based Approaches
Model-based: “reverberate” acoustic model
test datareverberant
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120
Page 241
Key Idea of Model-based Approaches
Model-based: “reverberate” acoustic model
Adjust acoustic model to statistical properties of reverberant data
test datareverberant
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120
Page 242
Key Idea of Model-based Approaches
Model-based: “reverberate” acoustic model
Adjust acoustic model to statistical properties of reverberant data
reverberant
test data
HMM
reverberant
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120
Page 243
Training with Reverberant Data
Matched Training
Record training data in target environment
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121
Page 244
Training with Reverberant Data
Matched Training
Record training data in target environment
+ Training data perfectly capture statistical properties
− Extremely high effort
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121
Page 245
Training with Reverberant Data
Matched Training
Record training data in target environment
+ Training data perfectly capture statistical properties
− Extremely high effort
Generate training data by convolution with RIR[Giuliani 1999, Stahl 2001, Matassoni 2002]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121
Page 246
Training with Reverberant Data
Matched Training
Record training data in target environment
+ Training data perfectly capture statistical properties
− Extremely high effort
Generate training data by convolution with RIR[Giuliani 1999, Stahl 2001, Matassoni 2002]
+ Significantly reduced effort
+ Only slight degradation in recognition performance [Stahl 2001]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121
Page 247
Training with Reverberant Data
Matched Training
Record training data in target environment
+ Training data perfectly capture statistical properties
− Extremely high effort
Generate training data by convolution with RIR[Giuliani 1999, Stahl 2001, Matassoni 2002]
+ Significantly reduced effort
+ Only slight degradation in recognition performance [Stahl 2001]
Multi-Style Training
Use training data from many different rooms
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121
Page 248
Training with Reverberant Data
Matched Training
Record training data in target environment
+ Training data perfectly capture statistical properties
− Extremely high effort
Generate training data by convolution with RIR[Giuliani 1999, Stahl 2001, Matassoni 2002]
+ Significantly reduced effort
+ Only slight degradation in recognition performance [Stahl 2001]
Multi-Style Training
Use training data from many different rooms
+ Robust HMMs
+ Very flexible
− Discrimination capability reduced compared to matched training
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121
Page 249
Matched Training
matched
test datareverberant
HMM
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 122
Page 250
Matched Training: Modeling Accuracy
Histograms
0 5 10 15 200
0.05
0.1
0.15
0.2
0.25
8 10 12 14 16 18 20 220
0.05
0.1
0.15
0.2
0.25
0.3
histogram rev.
output pdf clean HMM
output pdf rev. HMM
histogram rev.
output pdf clean HMM
output pdf rev. HMM
histograms for state j = 1, channel l = 3
histograms for state j = 5, channel l = 21
x
x
estim
ate
dpdf
estim
ate
dpdf
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 123
Page 251
Matched Training: Modeling Accuracy
Histograms
0 5 10 15 200
0.05
0.1
0.15
0.2
0.25
8 10 12 14 16 18 20 220
0.05
0.1
0.15
0.2
0.25
0.3
histogram rev.
output pdf clean HMM
output pdf rev. HMM
histogram rev.
output pdf clean HMM
output pdf rev. HMM
histograms for state j = 1, channel l = 3
histograms for state j = 5, channel l = 21
x
x
estim
ate
dpdf
estim
ate
dpdfAuto-CoVariances (ACVs)
0 5 10 15 20 25 30 35
5
10
15
20
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35
5
10
15
20
0.2
0.4
0.6
0.8
1
ACVs of reverberant speech, j = 9
ACVs captured by HMM, j = 9
melchannell
melchannell
frame τ
frame τ
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 123
Page 252
Multi-Style Training
reverberanttest data
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
Page 253
Multi-Style Training
training dataclean HMM
reverberant
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
Page 254
Multi-Style Training
training dataclean HMM
reverberant
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
Page 255
Multi-Style Training
training dataclean HMM
reverberant
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
Page 256
Multi-Style Training
training dataclean HMM
reverberant
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
Page 257
Multi-Style Training
training dataclean HMM
reverberant
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
Page 258
Multi-Style Training
reverberanttest data
HMMmulti−style
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
Page 259
Reverberation-Adaptive Training
Idea
Capture only linguistic variabilities by acoustic model
Remove acoustic variabilities by appropriate transforms
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125
Page 260
Reverberation-Adaptive Training
Idea
Capture only linguistic variabilities by acoustic model
Remove acoustic variabilities by appropriate transforms
Approach
Multi-style training with dereverberated data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125
Page 261
Reverberation-Adaptive Training
Idea
Capture only linguistic variabilities by acoustic model
Remove acoustic variabilities by appropriate transforms
Approach
Multi-style training with dereverberated data
Similar to noise-adaptive training [Deng 2000] ormodel-independent adaptive training [Gales 2001]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125
Page 262
Reverberation-Adaptive Training
Idea
Capture only linguistic variabilities by acoustic model
Remove acoustic variabilities by appropriate transforms
Approach
Multi-style training with dereverberated data
Similar to noise-adaptive training [Deng 2000] ormodel-independent adaptive training [Gales 2001]
+ long-term relations partly removed by dereverberation
+ room dependency reduced
⇒ discrimination capability increased compared to multi-style training
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125
Page 263
Reverberation-Adaptive Training
Idea
Capture only linguistic variabilities by acoustic model
Remove acoustic variabilities by appropriate transforms
Approach
Multi-style training with dereverberated data
Similar to noise-adaptive training [Deng 2000] ormodel-independent adaptive training [Gales 2001]
+ long-term relations partly removed by dereverberation
+ room dependency reduced
⇒ discrimination capability increased compared to multi-style training
Successfully applied, e.g., in [Kinoshita 2006]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125
Page 264
Reverberation-Adaptive Training
clean HMM
reverberanttest data
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
Page 265
Reverberation-Adaptive Training
clean HMM
reverberanttest data
dereverberated test data
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
Page 266
Reverberation-Adaptive Training
clean HMM
reverberanttest data
dereverberated training data
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
Page 267
Reverberation-Adaptive Training
clean HMM
reverberanttest data
dereverberated training data
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
Page 268
Reverberation-Adaptive Training
clean HMM
reverberanttest data
dereverberated training data
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
Page 269
Reverberation-Adaptive Training
clean HMM
reverberanttest data
dereverberated training data
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
Page 270
Reverberation-Adaptive Training
clean HMM
reverberanttest data
dereverberated test data
adaptiveHMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
Page 271
Data-driven Adaptation
Approaches
Maximum A Posteriori adaptation (MAP) [Gauvain 1994]
Maximum Likelihood Linear Regression (MLLR)
[Legetter 1995, Gales 1998]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 127
Page 272
Data-driven Adaptation
Approaches
Maximum A Posteriori adaptation (MAP) [Gauvain 1994]
Maximum Likelihood Linear Regression (MLLR)
[Legetter 1995, Gales 1998]
Successfully used for speaker and noise adaptation
Can also be used for reducing mismatch due to reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 127
Page 273
MLLR
MLLR
Adaptation of the HMM mean vectors
µX = DµS + d
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128
Page 274
MLLR
MLLR
Adaptation of the HMM mean vectors and covariance matrices
µX = DµS + d
ΣXX = E ΣSS ET
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128
Page 275
MLLR
MLLR
Adaptation of the HMM mean vectors and covariance matrices
µX = DµS + d
ΣXX = E ΣSS ET
Transformation parameters D, d , E estimated by EM algorithm
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128
Page 276
MLLR
MLLR
Adaptation of the HMM mean vectors and covariance matrices
µX = DµS + d
ΣXX = E ΣSS ET
Transformation parameters D, d , E estimated by EM algorithm
Supervised MLLR: known transcription
Unsupervised MLLR: during recognition
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128
Page 277
MLLR
MLLR
Adaptation of the HMM mean vectors and covariance matrices
µX = DµS + d
ΣXX = E ΣSS ET
Transformation parameters D, d , E estimated by EM algorithm
Supervised MLLR: known transcription
Unsupervised MLLR: during recognition
CMLLR (Constrained MLLR)
Same transformation matrix for mean vector and covariance matrix
µX = DµS + d
ΣXX = D ΣSS DT
+ fewer adaptation parameters
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128
Page 278
Data-driven Approaches
Illustration: Example Matched Training on Reverberated Data
reverberantly−trained HMM
(e.g., set of RIRs)
description of the
acoustic environmentclean−speech
training data
reverberated
training data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 129
Page 279
Data-driven Approaches
Discussion
Very accurate description of statistical properties by reverberant
training/adaptation data
Loss of accuracy: only when turning data into model
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 130
Page 280
Data-driven Approaches
Discussion
Very accurate description of statistical properties by reverberant
training/adaptation data
Loss of accuracy: only when turning data into model
Reverberant training: requires large amount of reverberant data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 130
Page 281
Data-driven Approaches
Discussion
Very accurate description of statistical properties by reverberant
training/adaptation data
Loss of accuracy: only when turning data into model
Reverberant training: requires large amount of reverberant data
Data-driven adaptation: moderate amount of reverberant data
(but more than model-based adaptation)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 130
Page 282
Data-driven Approaches
Discussion
Very accurate description of statistical properties by reverberant
training/adaptation data
Loss of accuracy: only when turning data into model
Reverberant training: requires large amount of reverberant data
Data-driven adaptation: moderate amount of reverberant data
(but more than model-based adaptation)
Main Limitation
Conventional HMMs cannot accurately capture long-term relations
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 130
Page 283
Parametric Model-Based Approaches
Illustration
adapted HMMclean−speech HMM
reverberationrepresentation
training data
clean−speech
(e.g., set of RIRs)
acoustic environment
description of the
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 131
Page 284
Parametric Model-based Adaptation
clean-speech
HMMs
reverberation
model
adaptation
algorithm
adapted
HMMs
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132
Page 285
Parametric Model-based Adaptation
clean-speech
HMMs
reverberation
model
adaptation
algorithm
adapted
HMMs
Discussion
proposed in [Raut 2006, Hirsch 2008, Sehr 2009]
based on melspec convolution
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132
Page 286
Parametric Model-based Adaptation
clean-speech
HMMs
reverberation
model
adaptation
algorithm
adapted
HMMs
Discussion
proposed in [Raut 2006, Hirsch 2008, Sehr 2009]
based on melspec convolution
+ long-term relations considered for HMM parameter estimation
+ no adaptation utterances necessary
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132
Page 287
Parametric Model-based Adaptation
clean-speech
HMMs
reverberation
model
adaptation
algorithm
adapted
HMMs
Discussion
proposed in [Raut 2006, Hirsch 2008, Sehr 2009]
based on melspec convolution
+ long-term relations considered for HMM parameter estimation
+ no adaptation utterances necessary
− reduced accuracy due to approximation errors
− additional loss of accuracy when mapping combination to HMM
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132
Page 288
Parametric Model-based Adaptation
clean-speech
HMMs
reverberation
model
adaptation
algorithm
adapted
HMMs
Discussion
proposed in [Raut 2006, Hirsch 2008, Sehr 2009]
based on melspec convolution
+ long-term relations considered for HMM parameter estimation
+ no adaptation utterances necessary
− reduced accuracy due to approximation errors
− additional loss of accuracy when mapping combination to HMM
Main Limitation
Conventional HMMs cannot accurately capture long-term relations
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132
Page 289
Parametric Model-based Adaptation
Mean Adaptation Approach [Raut 2006, Hirsch 2008, Sehr 2009]
adaptation
transform to
cepstral domainmelspec domain
transform to perform calculate
cepstral averages
µSMFCC µ
SMFCC µSMEL µ
XMEL µXMFCC
β
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 133
Page 290
Parametric Model-based Adaptation
Mean Adaptation Approach [Raut 2006, Hirsch 2008, Sehr 2009]
adaptation
transform to
cepstral domainmelspec domain
transform to perform calculate
cepstral averages
µSMFCC µ
SMFCC µSMEL µ
XMEL µXMFCC
β
Adaptation Equation
µXMEL(l , j) =∑
p
β(l , j , j − p) µSMEL(l , j − p)
β(l , j , i) state-level reverberation representation:
describes energy dispersion from state i to j in channel l
i , j state indices
l mel channel index
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 133
Page 291
Estimation of Reverberation Representation [Hirsch 2008]
state 1 state 2 state 3
h2t
tstart(2,1) tend(2,1)t
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 134
Page 292
Estimation of Reverberation Representation [Hirsch 2008]
state 1 state 2 state 3
h2t
tstart(2,1) tend(2,1)t
h2t =
6 log(10)
T60M
· exp
(
−6 log(10)
T60M
· t
)
, for t ≥ 0
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 134
Page 293
Estimation of Reverberation Representation [Hirsch 2008]
state 1 state 2 state 3
h2t
tstart(2,1) tend(2,1)t
h2t =
6 log(10)
T60M
· exp
(
−6 log(10)
T60M
· t
)
, for t ≥ 0
β(j , i) =
∫ tend(j,i)
tstart(j,i)
h2t dt
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 134
Page 294
Limitation of Conventional HMMs
Emission pdf of Conventional HMMs
p(xn|j)
⇒ conditional independence assumption
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 135
Page 295
Limitation of Conventional HMMs
Emission pdf of Conventional HMMs
p(xn|j)
⇒ conditional independence assumption
Conditional Emission pdf
capturing long-term relationships by
p(xn|j ,x1:n−1)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 135
Page 296
Limitation of Conventional HMMs
Emission pdf of Conventional HMMs
p(xn|j)
⇒ conditional independence assumption
Conditional Emission pdf
capturing long-term relationships by
p(xn|j ,x1:n−1)
Approximation by: Context-aware Methods:
Frame-wise HMM adaptation
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 135
Page 297
Conventional Adaptation versus Context-Aware Methods
HMM adaptation
Viterbi initialization
Finished?
Viterbi score calculation
(a) Conventional
HMM adaptation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 136
Page 298
Conventional Adaptation versus Context-Aware Methods
HMM adaptation
Viterbi initialization
Finished?
Viterbi score calculation
(a) Conventional
HMM adaptation
Viterbi initialization
HMM adaptation
Finished?
Viterbi score calculation
(b) Frame-wise
adaptation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 136
Page 299
Conventional Adaptation versus Context-Aware Methods
HMM adaptation
Viterbi initialization
Finished?
Viterbi score calculation
(a) Conventional
HMM adaptation
Viterbi initialization
HMM adaptation
Finished?
Viterbi score calculation
(b) Frame-wise
adaptation
Viterbi initialization
Inner optimization
Finished?
Viterbi score calculation
(c) REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 136
Page 300
Frame-wise Adaptation
xn ≈ log(exp(h0 + sn) + exp(rn))
µxn(j) = log(exp(h0 + µs(j)) + exp(rn))
rn late reverberation
j state index
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 137
Page 301
Frame-wise Adaptation
xn ≈ log(exp(h0 + sn) + exp(rn))
µxn(j) = log(exp(h0 + µs(j)) + exp(rn))
rn late reverberation
j state index
Autoregressive Modeling [Takiguchi 2006]
rn = a + xn−1 a prediction coefficient
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 137
Page 302
Frame-wise Adaptation
xn ≈ log(exp(h0 + sn) + exp(rn))
µxn(j) = log(exp(h0 + µs(j)) + exp(rn))
rn late reverberation
j state index
Autoregressive Modeling [Takiguchi 2006]
rn = a + xn−1 a prediction coefficient
Moving-Average Modeling [Sehr 2011]
rn = log
(
TH∑
τ=1
exp(µhτ+ sn−τ )
)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 137
Page 303
Frame-wise Adaptation
Discussion
+ Overcomes conditional independence assumption
+ Accurate modeling of long-term relations
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 138
Page 304
Frame-wise Adaptation
Discussion
+ Overcomes conditional independence assumption
+ Accurate modeling of long-term relations
− Increased computational complexity
− Increased effort for integration into ASR systems
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 138
Page 305
Frame-wise Adaptation
Discussion
+ Overcomes conditional independence assumption
+ Accurate modeling of long-term relations
− Increased computational complexity
− Increased effort for integration into ASR systems
Full potential not yet demonstrated
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 138
Page 306
Frame-wise Adaptation
Discussion
+ Overcomes conditional independence assumption
+ Accurate modeling of long-term relations
− Increased computational complexity
− Increased effort for integration into ASR systems
Full potential not yet demonstrated
Promising direction for future research
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 138
Page 307
Further Model-based Approaches
[Couvreur 2001] Reverberant training of several HMMs
+ model selection
[Sehr 2010b] Training of reverberant HMMs on stereo data
[Gales 2011] Extension of MLLR and VTS to reverberant
environments
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 139
Page 308
Part III: Robust ASR in Reverberant Environments
Introduction
Feature-based Approaches
Model-based Approaches
Decoder-based Approaches
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 140
Page 309
Overview of Decoder-based Approaches
Key Idea
Modify the decoding algorithm to increase reverberation robustness
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 141
Page 310
Overview of Decoder-based Approaches
Key Idea
Modify the decoding algorithm to increase reverberation robustness
Two Approaches
Missing feature techniques
⇒ Distinguish between reliable and unreliable observations
⇒ Estimate or discard the unreliable parts
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 141
Page 311
Overview of Decoder-based Approaches
Key Idea
Modify the decoding algorithm to increase reverberation robustness
Two Approaches
Missing feature techniques
⇒ Distinguish between reliable and unreliable observations
⇒ Estimate or discard the unreliable parts
Uncertainty decoding
⇒ Combined with signal or feature enhancement techniques
⇒ Exploit reliability information about enhanced data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 141
Page 312
Overview of Decoder-based Approaches
Key Idea
Modify the decoding algorithm to increase reverberation robustness
Two Approaches
Missing feature techniques
⇒ Distinguish between reliable and unreliable observations
⇒ Estimate or discard the unreliable parts
Uncertainty decoding
⇒ Combined with signal or feature enhancement techniques
⇒ Exploit reliability information about enhanced data
Decoder-based approaches bridge the gap between
feature-based and model-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 141
Page 313
Missing Feature Techniques
For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]
Key Ideas
Partition the observations into reliable and missing components Use only the reliable components for recognition
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142
Page 314
Missing Feature Techniques
For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]
Key Ideas
Partition the observations into reliable and missing components Use only the reliable components for recognition
Main Steps
Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142
Page 315
Missing Feature Techniques
For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]
Key Ideas
Partition the observations into reliable and missing components Use only the reliable components for recognition
Main Steps
Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately
How to handle missing data?
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142
Page 316
Missing Feature Techniques
For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]
Key Ideas
Partition the observations into reliable and missing components Use only the reliable components for recognition
Main Steps
Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately
How to handle missing data?
Marginalization: Eliminate unreliable data by integration over
corresponding dimensions
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142
Page 317
Missing Feature Techniques
For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]
Key Ideas
Partition the observations into reliable and missing components Use only the reliable components for recognition
Main Steps
Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately
How to handle missing data?
Marginalization: Eliminate unreliable data by integration over
corresponding dimensions Bounded marginalization: Exploit known bounds of the missing
data for integration
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142
Page 318
Missing Feature Techniques
For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]
Key Ideas
Partition the observations into reliable and missing components Use only the reliable components for recognition
Main Steps
Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately
How to handle missing data?
Marginalization: Eliminate unreliable data by integration over
corresponding dimensions Bounded marginalization: Exploit known bounds of the missing
data for integration Data imputation: Determine state-dependent estimates for the
unreliable data, given the reliable data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142
Page 319
Missing Feature Techniques for Reverberation Robustness
[Palomaki 2004]
Modulation filtering for the mask estimation
Bounded marginalization for handling missing data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 143
Page 320
Missing Feature Techniques for Reverberation Robustness
[Palomaki 2004]
Modulation filtering for the mask estimation
Bounded marginalization for handling missing data
[Gemmeke 2011]
Oracle masks based on clean and reverberant features
Semi-Oracle masks based on clean features and estimated RIRs
Gaussian-dependent bounded imputation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 143
Page 321
Uncertainty Decoding
Conventional Feature Enhancement Methods
EnhancementFeature
Algorithm
Decoding
AcousticModel
Transcriptionxnsn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 144
Page 322
Uncertainty Decoding
Conventional Feature Enhancement Methods
EnhancementFeature
Algorithm
Decoding
AcousticModel
Transcriptionxnsn
Use only point estimate sn of clean features
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 144
Page 323
Uncertainty Decoding
Conventional Feature Enhancement Methods
EnhancementFeature
Algorithm
Decoding
AcousticModel
Transcriptionxnsn
Use only point estimate sn of clean features
Contribution of each Gaussian component m:
p(sn|m) = N (sn;µ(m)s ,Σ
(m)s )
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 144
Page 324
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Feature Enhancement Combined with Uncertainty Decoding
EnhancementFeature
Algorithm
Decoding
AcousticModel
Transcriptionxn
p(sn|sn)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 145
Page 325
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Feature Enhancement Combined with Uncertainty Decoding
EnhancementFeature
Algorithm
Decoding
AcousticModel
Transcriptionxn
p(sn|sn)
Signal/feature enhancement inevitably introduces distortions
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 145
Page 326
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Feature Enhancement Combined with Uncertainty Decoding
EnhancementFeature
Algorithm
Decoding
AcousticModel
Transcriptionxn
p(sn|sn)
Signal/feature enhancement inevitably introduces distortions
Use reliability information in addition to point estimate
⇒ Use p(sn|sn) instead of sn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 145
Page 327
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Mismatch Modelsn = sn + bn
p(bn) = N (bn;0,Σbn)
p(sn|sn,m) ≈ p(sn|sn) = p(bn)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146
Page 328
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Mismatch Modelsn = sn + bn
p(bn) = N (bn;0,Σbn)
p(sn|sn,m) ≈ p(sn|sn) = p(bn)
Contribution of Gaussian Component m
p(sn|m) =
∫
p(sn,sn|m) dsn =
∫
p(sn|sn,m) p(sn|m) dsn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146
Page 329
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Mismatch Modelsn = sn + bn
p(bn) = N (bn;0,Σbn)
p(sn|sn,m) ≈ p(sn|sn) = p(bn)
Contribution of Gaussian Component m
p(sn|m) =
∫
p(sn,sn|m) dsn =
∫
p(sn|sn,m) p(sn|m) dsn
≈ N (sn;µ(m)s ,Σ
(m)s +Σbn
)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146
Page 330
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Mismatch Modelsn = sn + bn
p(bn) = N (bn;0,Σbn)
p(sn|sn,m) ≈ p(sn|sn) = p(bn)
Contribution of Gaussian Component m
p(sn|m) =
∫
p(sn,sn|m) dsn =
∫
p(sn|sn,m) p(sn|m) dsn
≈ N (sn;µ(m)s ,Σ
(m)s +Σbn
)
Unreliable features ⇒ large Σbn⇒ little effect on Viterbi score
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146
Page 331
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Mismatch Modelsn = sn + bn
p(bn) = N (bn;0,Σbn)
p(sn|sn,m) ≈ p(sn|sn) = p(bn)
Contribution of Gaussian Component m
p(sn|m) =
∫
p(sn,sn|m) dsn =
∫
p(sn|sn,m) p(sn|m) dsn
≈ N (sn;µ(m)s ,Σ
(m)s +Σbn
)
Unreliable features ⇒ large Σbn⇒ little effect on Viterbi score
Main challenge: Estimation of time-variant feature cov. Σbn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146
Page 332
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Key Idea
Strong reverberation
⇒ Large effect of speech enhancement
⇒ Large mismatch between clean and enhanced features
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 147
Page 333
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Key Idea
Strong reverberation
⇒ Large effect of speech enhancement
⇒ Large mismatch between clean and enhanced features
Effect of speech enhancement captured by bn = xn − sn
⇒ Mismatch covariance assumed proportional to difference
between observed and enhanced features
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 147
Page 334
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Key Idea
Strong reverberation
⇒ Large effect of speech enhancement
⇒ Large mismatch between clean and enhanced features
Effect of speech enhancement captured by bn = xn − sn
⇒ Mismatch covariance assumed proportional to difference
between observed and enhanced features
Model elements of time-variant diagonal mismatch cov. matrix Σbnas
(Σbn)ii = αi b2
n,i
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 147
Page 335
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Key Idea
Strong reverberation
⇒ Large effect of speech enhancement
⇒ Large mismatch between clean and enhanced features
Effect of speech enhancement captured by bn = xn − sn
⇒ Mismatch covariance assumed proportional to difference
between observed and enhanced features
Model elements of time-variant diagonal mismatch cov. matrix Σbnas
(Σbn)ii = αi b2
n,i
α is estimated by EM algorithm using adaptation data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 147
Page 336
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Featureextraction
Variance
Reverberant speech
Dereverberation
Gaussiancovariance
Compensatedcovariance sequence
Acousticmodel
Word
Dereverberated speech
Variancecompensation
Recognition
Gaussian mean
Feature covariance
xtst
xn sn
Σbn
Σ(m)s
µ(m)s ,Σ
(m)s
µ(m)s
Σ(m)s +Σbn w
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 148
Page 337
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Discussion
− Accounting for time-variant covariance matrix Σbnincreases
computational complexity
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 149
Page 338
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Discussion
− Accounting for time-variant covariance matrix Σbnincreases
computational complexity
+ Can be combined with static variance compensation and
mean adaptation by MLLR
+ Independent of enhancement algorithm ⇒ Highly flexible
+ Has been used successfully also for non-stationary interferences
[Delcroix 2011b]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 149
Page 339
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Discussion
− Accounting for time-variant covariance matrix Σbnincreases
computational complexity
+ Can be combined with static variance compensation and
mean adaptation by MLLR
+ Independent of enhancement algorithm ⇒ Highly flexible
+ Has been used successfully also for non-stationary interferences
[Delcroix 2011b]
Promising approach for interconnection of
signal/feature-based methods and ASR systems
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 149
Page 340
Part III: Robust ASR in Reverberant Environments
Introduction
Feature-based Approaches
Model-based Approaches
Decoder-based Approaches
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 150
Page 341
REMOS: REverberation MOdeling for Speech Recognition
Online Model Combination
RVM
CSM
previous
observations
current
observation
: combinationoperator
CSM: clean-speech model
⇒ HMM network
RVM: reverberation model
combination of CSM and RVM:
⇒ context-aware acoustic model
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 151
Page 342
REMOS: REverberation MOdeling for Speech Recognition
Online Model Combination
RVM
CSM
previous
observations
current
observation
: combinationoperator
CSM: clean-speech model
⇒ HMM network
RVM: reverberation model
combination of CSM and RVM:
⇒ context-aware acoustic model
Advantages
CSM and RVM are trained
independently
changing environment:
adjust only RVM
changing task:
adjust only CSM
high degree of flexibility
[Sehr 2010c]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 151
Page 343
REMOS Decoding [Sehr 2010c]
feature
extraction
transcriptionViterbi
algorithm
extended
RVMCSM
xt xn
Extended Viterbi Algorithm:
finds most likely path through CSM
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 152
Page 344
REMOS Decoding [Sehr 2010c]
feature
extraction
transcriptionViterbi
algorithm
extended
RVMCSM
xt xn
Extended Viterbi Algorithm:
finds most likely path through CSM
Inner Optimization: accounts for RVM and previous observations
determines most likely contributions of CSM and RVM to current
observation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 152
Page 345
REMOS [Sehr 2010c]
Online combination of model outputs from
clean-speech HMM and reverberation model
capturing long-term relations:
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 153
Page 346
REMOS [Sehr 2010c]
Online combination of model outputs from
clean-speech HMM and reverberation model
capturing long-term relations:
Combination Operator
xn = f (sn,sn−TH :n−1,hn,an)
= log(exp(hn + sn) + exp(rn + an))
rn: logmelspec late re-verberation estimate
an: captures approxima-tion error of rn
hn: logmelspec repre-sentation of directsound component ofRIR
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 153
Page 347
REMOS [Sehr 2010c]
Online combination of model outputs from
clean-speech HMM and reverberation model
capturing long-term relations:
Combination Operator
xn = f (sn,sn−TH :n−1,hn,an)
= log(exp(hn + sn) + exp(rn + an))
Late Reverberation Estimate
rn = log
(
TH∑
τ=1
exp(µHτ + sn−τ )
)
rn: logmelspec late re-verberation estimate
an: captures approxima-tion error of rn
hn: logmelspec repre-sentation of directsound component ofRIR
µH1:TH
:mean vectors of log-melspec representa-tion for late reverber-ation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 153
Page 348
REMOS [Sehr 2010c]
Illustration of Generative Model
Cleanspeech
model
Reverberationmodel
p(sn|j)
p(hn)
p(an)
hn
an
sn sn−1 sn−TH
f
xn
. . .
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 154
Page 349
REMOS [Sehr 2010c]
Conditional emission pdf is decomposed into
reverberation model and clean HMM:
p(xn|j ,x1:n−1) =
∫
p(xn|sn,x1:n−1) p(sn|j) dsn,
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 155
Page 350
REMOS [Sehr 2010c]
Conditional emission pdf is decomposed into
reverberation model and clean HMM:
p(xn|j ,x1:n−1) =
∫
p(xn|sn,x1:n−1) p(sn|j) dsn,
Reverberation Model:
p(xn|sn,x1:n−1)
=
∫∫
p(hn)p(an) δ(xn − f (sn,sn−TH :n−1,hn,an)) dhn dan
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 155
Page 351
REMOS [Sehr 2010c]
Approximation of Conditional Emission pdf:by maximum values of integrand
p(xn|j ,x1:n−1) ≈ p(hn) p(an) p(sn|j)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 156
Page 352
REMOS [Sehr 2010c]
Approximation of Conditional Emission pdf:by maximum values of integrand
p(xn|j ,x1:n−1) ≈ p(hn) p(an) p(sn|j)
maximum values hn, an, sn determined by inner optimization
(hn, an, sn) = argmax(hn,an,sn)
p(hn)p(an)p(sn|j)
subject to xn = f (sn,sn−TH :n−1,hn,an)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 156
Page 353
Detailed Illustration of REMOS Decoding
model
clean−speech HMMs
network of
reverberation
find previous
vectors
backtracking matrix
Viterbi score matrix
vectors (3D tensor)clean−speech
matrix of clean−speech
optimization
inner Viterbi
calculate
score
n
n
j
j
sn
αij
p(xn|j, x1:n−1)
γn−1(i)
γn(j)
ψn(j)
sn(j)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 157
Page 354
Detailed Illustration of REMOS Decoding
model
clean−speech HMMs
network of
reverberation
find previous
vectors
backtracking matrix
Viterbi score matrix
vectors (3D tensor)clean−speech
matrix of clean−speech
optimization
inner Viterbi
calculate
score
n
n
j
j
xn
RVM
CSM
sn−TH
sn−1
sn
hn
αij
p(xn|j, x1:n−1)
γn−1(i)
γn(j)
ψn(j)
sn(j)
p(hn) p(an)
p(sn|j)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 157
Page 355
Detailed Illustration of REMOS Decoding
model
clean−speech HMMs
network of
reverberation
find previous
vectors
backtracking matrix
Viterbi score matrix
vectors (3D tensor)clean−speech
matrix of clean−speech
optimization
inner Viterbi
calculate
score
n
n
nj
jj
l
xn
RVM
CSM
sn−TH
sn−1
sn
hn
αij
p(xn|j, x1:n−1)
γn−1(i)
γn(j)
ψn(j)
sn(j)
p(hn) p(an)
p(sn|j)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 157
Page 356
Detailed Illustration of REMOS Decoding
model
clean−speech HMMs
network of
reverberation
find previous
vectors
backtracking matrix
Viterbi score matrix
vectors (3D tensor)clean−speech
matrix of clean−speech
optimization
inner Viterbi
calculate
score
n
n
nj
jj
l
xn
RVM
CSM
sn−TH
sn−1
sn
hn
αij
p(xn|j, x1:n−1)
γn−1(i)
γn(j)
ψn(j)
sn(j)
p(hn) p(an)
p(sn|j)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 157
Page 357
Modeling Accuracy of REMOS
Example: digit “seven”
logmelspec clean utterance
10 20 30 40
5
10
15
20
5
10
15
20
logmelspec reverberant utterance
10 20 30 40
5
10
15
20
5
10
15
20
means of clean logmelspec HMM
5 10 15
5
10
15
20
5
10
15
20
logmelspec RIR representation
20 40 60
5
10
15
20
−14
−12
−10
−8
−6
−4
−2
0
frame delay τ
frame nframe n
state j
melchannell
melchannell
melchannell
melchannell
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 158
Page 358
Modeling Accuracy of REMOS
Histograms
0 5 10 15 200
0.05
0.1
0.15
0.2
0.25
0.3
8 10 12 14 16 18 20 220
0.1
0.2
0.3
0.4
histogram rev.
output pdf clean HMM
prior hist. REMOS
posterior hist. REMOS
histogram rev.
output pdf clean HMM
prior hist. REMOS
posterior hist. REMOS
histograms for state j = 1, channel l = 3
histograms for state j = 5, channel l = 21
x
x
estim
ate
dpdf
estim
ate
dpdf
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 159
Page 359
Modeling Accuracy of REMOS
Histograms
0 5 10 15 200
0.05
0.1
0.15
0.2
0.25
0.3
8 10 12 14 16 18 20 220
0.1
0.2
0.3
0.4
histogram rev.
output pdf clean HMM
prior hist. REMOS
posterior hist. REMOS
histogram rev.
output pdf clean HMM
prior hist. REMOS
posterior hist. REMOS
histograms for state j = 1, channel l = 3
histograms for state j = 5, channel l = 21
x
x
estim
ate
dpdf
estim
ate
dpdf
Auto-CoVariances (ACVs)
0 5 10 15 20 25 30 35
5
10
15
20
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35
5
10
15
20
0.2
0.4
0.6
0.8
1
ACVs of reverberant speech, j = 9
ACVs of posterior REMOS output, j = 9
melchannell
melchannell
frame τ
frame τ
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 159
Page 360
Recognition Results [Sehr 2010c]
30
40
50
60
70
80
90
100
clean HMMclean HMM + MLLRadaptation [Sehr 2009]multi-style HMM
multi-style HMM + MLLRmatched HMM
REMOS
word
accura
cy
in%
room A room B room C
Setup
Task: Connected
digits (TI digits)
Features:Logmelspec
coefficients
Recognizer:Word-level HMMs,
16 states/digit,
1 Gaussian/state
Rooms:T60 DRR
A: 300 ms 4.0dBB: 700 ms −4.0dBC: 900 ms −4.0dB
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 160
Page 361
REMOS [Sehr 2010c]
Discussion
+ Approach tailored to reverberant feature vector sequences
+ Long-term relations explicitely captured by reverberation model
+ Reverberation exploited for discrimination
+ Very promising results in logmelspec domain
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 161
Page 362
REMOS [Sehr 2010c]
Discussion
+ Approach tailored to reverberant feature vector sequences
+ Long-term relations explicitely captured by reverberation model
+ Reverberation exploited for discrimination
+ Very promising results in logmelspec domain
− Inner optimization increases decoding complexity
− Implementation requires changes in decoding routines
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 161
Page 363
REMOS [Sehr 2010c]
Discussion
+ Approach tailored to reverberant feature vector sequences
+ Long-term relations explicitely captured by reverberation model
+ Reverberation exploited for discrimination
+ Very promising results in logmelspec domain
− Inner optimization increases decoding complexity
− Implementation requires changes in decoding routines
Promising direction for future research
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 161
Page 364
IV. Summary, Conclusions, and Outlook
Dereverberation for Signal EnhancementState-of-the-art
Close to 12 dB DRR gain with T60 ≈ 0.7s (offline) with 4 mics, d=1.65m, no noise (TRINICON, 2 sources) 8 mics, d=2m, SNR=10 dB (MCLP)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 162
Page 365
IV. Summary, Conclusions, and Outlook
Dereverberation for Signal EnhancementState-of-the-art
Close to 12 dB DRR gain with T60 ≈ 0.7s (offline) with 4 mics, d=1.65m, no noise (TRINICON, 2 sources) 8 mics, d=2m, SNR=10 dB (MCLP)
Challenges Larger distances, more reverberant rooms Robustness to speech-like interferers, nonstationary/diffuse noise,
transient echo cancellation residuals Robust tracking of time-varying acoustics Low-latency (≪ 1s) and efficient real-time implementations Joint optimization with spectral subtraction techniques
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 162
Page 366
IV. Summary, Conclusions, and Outlook (cont’d)
Dereverberation as preprocessing for ASR
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 163
Page 367
IV. Summary, Conclusions, and Outlook (cont’d)
Dereverberation as preprocessing for ASRExample: 20 k WSJ convolved with RIRs (T60 = 0.78s, d = 2m), NTT ASR
WER[%] Preproc. Acoustic model85.5 none clean speech43.4 none multi-condition training26.1 1-ch derev multi-condition training14.2 2-ch derev clean w/ unsuperv.
speaker adaptation by MLLR
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 163
Page 368
IV. Summary, Conclusions, and Outlook (cont’d)
Dereverberation as preprocessing for ASRExample: 20 k WSJ convolved with RIRs (T60 = 0.78s, d = 2m), NTT ASR
WER[%] Preproc. Acoustic model85.5 none clean speech43.4 none multi-condition training26.1 1-ch derev multi-condition training14.2 2-ch derev clean w/ unsuperv.
speaker adaptation by MLLR
Challenges for approaching close-talk performance Transition from reverberated signals to real recordings Self-adaptation to changing acoustics and front-ends, including
variable number and changing, unconstrained positions of talkers different nodes in distributed microphone arrays
Joint optimization with ASR methods to handle reverberation and noise
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 163
Page 369
Summary, Conclusions, and Outlook (cont’d)
Reverberation-specific ASR TechniquesState-of-the-art
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164
Page 370
Summary, Conclusions, and Outlook (cont’d)
Reverberation-specific ASR TechniquesState-of-the-art
Feature-based techniques account for the inter-frame relations caused by dispersion efficiently exploit predictability of reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164
Page 371
Summary, Conclusions, and Outlook (cont’d)
Reverberation-specific ASR TechniquesState-of-the-art
Feature-based techniques account for the inter-frame relations caused by dispersion efficiently exploit predictability of reverberation
Model-based techniques could not yet show their full potential, as framewise adaptation and optimization is computationally complex
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164
Page 372
Summary, Conclusions, and Outlook (cont’d)
Reverberation-specific ASR TechniquesState-of-the-art
Feature-based techniques account for the inter-frame relations caused by dispersion efficiently exploit predictability of reverberation
Model-based techniques could not yet show their full potential, as framewise adaptation and optimization is computationally complex
Decoder-based techniques compromise between the above regarding complexity
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164
Page 373
Summary, Conclusions, and Outlook (cont’d)
Reverberation-specific ASR TechniquesState-of-the-art
Feature-based techniques account for the inter-frame relations caused by dispersion efficiently exploit predictability of reverberation
Model-based techniques could not yet show their full potential, as framewise adaptation and optimization is computationally complex
Decoder-based techniques compromise between the above regarding complexity
Outlook Integration into state-of-the art ASR systems
expected soon for signal enhancement- and feature-based methods model-based methods must become more efficient for widespread use
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164
Page 374
Concluding remarks
Dereverberation, the ’Holy Grail’ of Acoustic Signal Proce ssing?
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 165
Page 375
Concluding remarks
Dereverberation, the ’Holy Grail’ of Acoustic Signal Proce ssing?
Blind deconvolution of the acoustic paths seems to come closer
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 165
Page 376
Concluding remarks
Dereverberation, the ’Holy Grail’ of Acoustic Signal Proce ssing?
Blind deconvolution of the acoustic paths seems to come closer
Less ambitious algorithms are also effective and their progress followsthe typical DSP objectives
increase algorithmic performance and robustness
reduce computational load
integrate with other functionalities
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 165
Page 377
Concluding remarks
Dereverberation, the ’Holy Grail’ of Acoustic Signal Proce ssing?
Blind deconvolution of the acoustic paths seems to come closer
Less ambitious algorithms are also effective and their progress followsthe typical DSP objectives
increase algorithmic performance and robustness
reduce computational load
integrate with other functionalities
As a follow-up to the CHIME Challenge 2011
⇒ Next Challenge for Reverberation-robust Speech Processin gis underway!
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 165
Page 378
Acknowledgements
We are especially grateful to
Dr. Keisuke Kinoshita, Dr. Marc Delcroix, Dr. Shoko Araki, Dr. MehrezSouden, and Dr. Takaaki Hori (NTT)
Dr. Herbert Buchner, Edwin Mabande and Lutz Marquardt (formerlyLMS)
Roland Maas and Christian Hofmann (LMS)
for their contributions to the course material
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 166
Page 379
Acknowledgements
We are especially grateful to
Dr. Keisuke Kinoshita, Dr. Marc Delcroix, Dr. Shoko Araki, Dr. MehrezSouden, and Dr. Takaaki Hori (NTT)
Dr. Herbert Buchner, Edwin Mabande and Lutz Marquardt (formerlyLMS)
Roland Maas and Christian Hofmann (LMS)
for their contributions to the course material and wish to acknowledgethe support of parts of the LMS by
Deutsche Forschungsgemeinschaft (DFG) under contract number KE890/4-1
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 166
Page 380
ご清聴ありがとうございました
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 167