Page 1
Analysis-by-synthesisfor source separation and speech recognition
Michael I [email protected]
Brooklyn College (CUNY)
Joint work with Young Suk Cho and Arun Narayanan (Ohio State)
Columbia Neural Network Seminar SeriesSeptember 8, 2015
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 1 / 73
Page 2
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 2 / 73
Page 3
Motivation: need for noise robustness
Outline
1 Motivation: need for noise robustnessNeed for better mobile voice qualityNeed for noise robust automatic speech recognition (ASR)Main challenge
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 3 / 73
Page 4
Motivation: need for noise robustness Need for better mobile voice quality
Outline
1 Motivation: need for noise robustnessNeed for better mobile voice qualityNeed for noise robust automatic speech recognition (ASR)Main challenge
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 4 / 73
Page 5
Motivation: need for noise robustness Need for better mobile voice quality
Need for better mobile voice quality
There are now more mobile devices than humans on earth1
But recording conditions for these devices leave much to be desired
Can we recover high quality speech from noisy & degraded recordings?
1http://www.independent.co.uk/life-style/gadgets-and-tech/news/
there-are-officially-more-mobile-devices-than-people-in-the-world-9780518.html
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 5 / 73
Page 6
Motivation: need for noise robustness Need for better mobile voice quality
Why mobile voice quality stinks2
2Je↵ Hecht. Why mobile voice quality still stinks—and how to fix it. IEEE Spectrum, September 2014
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 6 / 73
Page 7
Motivation: need for noise robustness Need for better mobile voice quality
Why mobile voice quality stinks2
2Je↵ Hecht. Why mobile voice quality still stinks—and how to fix it. IEEE Spectrum, September 2014
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 6 / 73
Page 8
Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)
Outline
1 Motivation: need for noise robustnessNeed for better mobile voice qualityNeed for noise robust automatic speech recognition (ASR)Main challenge
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 7 / 73
Page 9
Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)
Conversational mobile software agents
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 8 / 73
Source: Tom Vanleenhove
Page 10
Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)
Conversational mobile software agents need to work in
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 9 / 73
Source: Flickr user rickihuang
Page 11
Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)
Conversational mobile software agents need to work in
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 9 / 73
Source: Flickr user retorta net
Page 12
Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)
Conversational mobile software agents need to work in
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 9 / 73
Source: Flickr user Brian Indrelunas
Page 13
Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)
But automatic speech recognition doesn’t work there3
3Amit Juneja. A comparison of automatic and human speech recognition in null grammar. The Journal of the Acoustical
Society of America, 131(3):EL256–EL261, February 2012
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 10 / 73
Page 14
Motivation: need for noise robustness Main challenge
Outline
1 Motivation: need for noise robustnessNeed for better mobile voice qualityNeed for noise robust automatic speech recognition (ASR)Main challenge
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 11 / 73
Page 15
Motivation: need for noise robustness Main challenge
Main challenge
Speech is a rich signal, it requires rich models
Synthesis models are rich enough to represent almost all speech
Non-parametric synthesis models for high qualityDNN as non-linear distance function
Parametric synthesis models for e�cient representatione�cient gradient-based optimization of input (not model)
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 12 / 73
Page 16
Motivation: need for noise robustness Main challenge
Main challenge
Speech is a rich signal, it requires rich models
Synthesis models are rich enough to represent almost all speech
Non-parametric synthesis models for high qualityDNN as non-linear distance function
Parametric synthesis models for e�cient representatione�cient gradient-based optimization of input (not model)
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 12 / 73
Page 17
Non-parametric synthesis for speech enhancement
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 13 / 73
Page 18
Non-parametric synthesis for speech enhancement Overview
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 14 / 73
Page 19
Non-parametric synthesis for speech enhancement Overview
Concatenative resynthesis for speech enhancement4,5
Standard approaches try to modify noisy recordings
We instead resynthesize a clean version of the same speech
Should produce infinite suppression and high speech quality
4Michael I Mandel, Young-Suk Cho, and Yuxuan Wang. Learning a concatenative resynthesis system for noise suppression.
In Proc. IEEE GlobalSIP, 20145Michael I Mandel and Young Suk Cho. Audio super-resolution using concatenative resynthesis. In Proc. IEEE WASPAA,
2015. To appear
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 15 / 73
Page 20
Non-parametric synthesis for speech enhancement Overview
Motivating example
Your phone records your voice in quiet, close-talk conditions
Uses those recordings to replace your voice in noisy, far-talk conditions
Resynthesizes your speech from previous high-quality recordings
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 16 / 73
Page 21
Non-parametric synthesis for speech enhancement Overview
Concatenative resynthesis
Use a large dictionary of ⇠200 ms “chunks” of audio
Learn DNN-based a�nity between dictionary & mixture chunks
Perform concatenative synthesis of signal from dictionary
General robust supervised nonlinear signal mapping framework
Task Map from To
Noise suppression Noisy CleanAudio super-resolution Reverberated, compressed Clean
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 17 / 73
Page 22
Non-parametric synthesis for speech enhancement Deep neural network as nonlinear distance function
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 18 / 73
Page 23
Non-parametric synthesis for speech enhancement Deep neural network as nonlinear distance function
Deep neural network as nonlinear distance function6
Generative Discriminative Dictionary-based
Data-intensive training Moderate training data Data-e�cient trainingHard to adapt Hard to adapt Very adaptable
6Michael I Mandel, Young-Suk Cho, and Yuxuan Wang. Learning a concatenative resynthesis system for noise suppression.
In Proc. IEEE GlobalSIP, 2014
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 19 / 73
Page 24
Non-parametric synthesis for speech enhancement Deep neural network as nonlinear distance function
Train DNN on correctly and incorrectly paired chunks
Noise suppression
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 20 / 73
Page 25
Non-parametric synthesis for speech enhancement Deep neural network as nonlinear distance function
Train DNN on correctly and incorrectly paired chunks
Audio super-resolution
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 21 / 73
Page 26
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 22 / 73
Page 27
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Find optimal sequence of clean chunks
x = {xt}Tt=0 input sequence of noisy chunks
z = {zt}Tt=0 best sequence of corresponding dictionary chunks
A�nity between clean and noisy chunks
Transition a�nity between clean chunks
z = argmaxz
Y
t
p(zt = j | xt) p(zt = j | zt�1 = i)
= argmaxz
Y
i
g(zj , xi ) Tij
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 23 / 73
Page 28
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Find optimal sequence of clean chunks
x = {xt}Tt=0 input sequence of noisy chunks
z = {zt}Tt=0 best sequence of corresponding dictionary chunks
A�nity between clean and noisy chunks
Transition a�nity between clean chunks
z = argmaxz
Y
t
p(zt = j | xt) p(zt = j | zt�1 = i)
= argmaxz
Y
i
g(zj , xi ) Tij
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 23 / 73
Page 29
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Find optimal sequence of clean chunks
x = {xt}Tt=0 input sequence of noisy chunks
z = {zt}Tt=0 best sequence of corresponding dictionary chunks
A�nity between clean and noisy chunks
Transition a�nity between clean chunks
z = argmaxz
Y
t
p(zt = j | xt) p(zt = j | zt�1 = i)
= argmaxz
Y
i
g(zj , xi ) Tij
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 23 / 73
Page 30
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Compare all pairs of noisy and clean chunks
D1D2D3...
M1M1M1...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73
Page 31
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Compare all pairs of noisy and clean chunks
D1D2D3...
M2M2M2...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73
Page 32
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Compare all pairs of noisy and clean chunks
D1D2D3...
M3M3M3...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73
Page 33
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Compare all pairs of noisy and clean chunks
D1D2D3...
M4M4M4...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73
Page 34
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Compare all pairs of noisy and clean chunks
D1D2D3...
M5M5M5...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73
Page 35
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Compare all pairs of noisy and clean chunks
D1D2D3...
MNMNMN...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73
Page 36
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Standard Viterbi algorithm for to find optimal sequence
D1D2D3...
MNMNMN...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 25 / 73
Page 37
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Standard Viterbi algorithm for to find optimal sequence
D1D2D3...
MNMNMN...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 25 / 73
Page 38
Non-parametric synthesis for speech enhancement Noise suppression experiments
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 26 / 73
Page 39
Non-parametric synthesis for speech enhancement Noise suppression experiments
Original “clean” speech
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 27 / 73
Page 40
Non-parametric synthesis for speech enhancement Noise suppression experiments
Noisy speech
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 28 / 73
Page 41
Non-parametric synthesis for speech enhancement Noise suppression experiments
Traditional mask-based separation
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 29 / 73
Page 42
Non-parametric synthesis for speech enhancement Noise suppression experiments
Concatenative resynthesis output
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 30 / 73
Page 43
Non-parametric synthesis for speech enhancement Noise suppression experiments
Original “clean” speech
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 31 / 73
Page 44
Non-parametric synthesis for speech enhancement Noise suppression experiments
Subjective quality is high
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 32 / 73
Page 45
Non-parametric synthesis for speech enhancement Noise suppression experiments
Subjective quality is high
20 40 60 80 100
Noisy
IRM NN
Concat
Clean
Quality (higher better)
SpeechNoise SupOverall
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 32 / 73
Page 46
Non-parametric synthesis for speech enhancement Noise suppression experiments
Subjective intelligibility is ok
60 70 80 90 100
Noisy
IRM NN
Concat
Clean
Words correctly identified (%)
Keywords
All words
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 33 / 73
Page 47
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 34 / 73
Page 48
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Original clean speech
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 35 / 73
Page 49
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Reverberated, compressed, 20% packet loss
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 36 / 73
Page 50
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
NMF-based bandwidth expansion output
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 37 / 73
Page 51
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Concatenative resynthesis output
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 38 / 73
Page 52
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Original clean speech
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 39 / 73
Page 53
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Subjective quality is high
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 40 / 73
Page 54
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Subjective quality is high
0
10
20
30
40
50
60
70
80
90
100
MU
SH
RA
Sco
re
CleanClean (hid)RevRev 8kHzRevOpusL20RevOpusL20 (hid)
CleanAmr RevAmr RevOpusL20
InputNMFConcat
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 40 / 73
Page 55
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Subjective intelligibility is good
70
75
80
85
90
95
100
Corr
ect
word
s (%
)
CleanRevRev 8kHzRevOpusL20
CleanAmr RevAmr RevOpusL20
InputNMFConcat
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 41 / 73
Page 56
Non-parametric synthesis for speech enhancement Summary
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 42 / 73
Page 57
Non-parametric synthesis for speech enhancement Summary
Summary
Concatenative synthesizer, DNN as noise-robust selection function
Instead of modifying noisy speech, replace itcompletely eliminates noise, except for synthesis errorsproduces high quality, natural-sounding speech
General robust supervised nonlinear signal mapping framework
Data-e�cient to train and adaptable to new talkers
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 43 / 73
Page 58
Non-parametric synthesis for speech enhancement Summary
Future applications
Generalize to audio-visual speech recognition
Label dictionary elements ahead of time to enablenoise-robust non-parametric speech recognitionnoise-robust pitch trackingnoise-robust speaker identification
Incorporate language model into transition cost
Develop e�cient search mechanisms for large-vocabulary dictionaries
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 44 / 73
Page 59
Parametric synthesis for speech recognition
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 45 / 73
Page 60
Parametric synthesis for speech recognition Overview
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 46 / 73
Page 61
Parametric synthesis for speech recognition Overview
Mask-based source separation: Noisy
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 47 / 73
Page 62
Parametric synthesis for speech recognition Overview
Mask-based source separation: Masked
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 48 / 73
Page 63
Parametric synthesis for speech recognition Overview
Disrupts speech features: Noisy MFCCs
“He said such products would be marketed by othercompanies with experience him at this month.”
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 49 / 73
Page 64
Parametric synthesis for speech recognition Overview
Disrupts speech features: Masked MFCCs
“He said such products would be marketed by othercompanies with experience him at this month.”
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 50 / 73
Page 65
Parametric synthesis for speech recognition Overview
Disrupts speech features: Clean MFCCs
“He said such products would be marketed by othercompanies with experience in that business.”
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 51 / 73
Page 66
Parametric synthesis for speech recognition Overview
Estimate better features using a strong prior model
“He said such products would be marketed by othercompanies with experience in that business.”
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 52 / 73
Page 67
Parametric synthesis for speech recognition Overview
Our approach: Analysis-by-synthesis
Synthesize speech signal so that itlooks like the observationlooks like speech
Itakura-Saito divergence compares prediction with noisy observation
Recognizer gives likelihood of speech-ness
Both easy to optimize using gradient descent
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 53 / 73
Page 68
Parametric synthesis for speech recognition Overview
Speech recognizer includes lots of information
Large vocabulary continuous speech recognizer captures:
Acoustics of speech sounds
The e↵ect of neighboring speech sounds
Pronunciation of words
Order of words
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 54 / 73
Page 69
Parametric synthesis for speech recognition Algorithm
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 55 / 73
Page 70
Parametric synthesis for speech recognition Algorithm
Optimization over speech features
x: optimization state: MFCCs, ⇠10,000 dimensions
y(x): ASR features derived from x
M: mask provided a priori by another source separator
minx
L(x;M) = minx
n
(1� ↵) LI (x;M) + ↵ LH(y(x))o
Total cost
Distance to noisy observation
Negative log likelihood under recognizer
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 56 / 73
Page 71
Parametric synthesis for speech recognition Algorithm
Optimization over speech features
x: optimization state: MFCCs, ⇠10,000 dimensions
y(x): ASR features derived from x
M: mask provided a priori by another source separator
minx
L(x;M) = minx
n
(1� ↵) LI (x;M) + ↵ LH(y(x))o
Total cost
Distance to noisy observation
Negative log likelihood under recognizer
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 56 / 73
Page 72
Parametric synthesis for speech recognition Algorithm
Optimization over speech features
x: optimization state: MFCCs, ⇠10,000 dimensions
y(x): ASR features derived from x
M: mask provided a priori by another source separator
minx
L(x;M) = minx
n
(1� ↵) LI (x;M) + ↵ LH(y(x))o
Total cost
Distance to noisy observation
Negative log likelihood under recognizer
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 56 / 73
Page 73
Parametric synthesis for speech recognition Algorithm
Analysis of audio meets resynthesis of MFCCs at mask
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 57 / 73
Page 74
Parametric synthesis for speech recognition Algorithm
LI (x;M): Distance to noisy observation
Resynthesize MFCCs to power spectrum, where mask was computed
Do mask-aware comparison in that domain: weighted Itakura-Saitobetween resynthesis, S!t(x), and noisy observation, Sweighted by mask, M
LI (x;M) = DM(S k S) =X
!,t
M!t
✓
S!t
S!t(x)� log
S!t
S!t(x)� 1
◆
Does not require modeling speech excitation
Numerically di↵erentiable with respect to x
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 58 / 73
Page 75
Parametric synthesis for speech recognition Algorithm
LH(y(x)): Likelihood under recognizer
Large vocabulary continuous speech recognizerbig hidden Markov model (HMM)approximated by the lattice of likely paths
Closed form gradient with respect to x
Serves as a model of clean MFCC sequences
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 59 / 73
Page 76
Parametric synthesis for speech recognition Algorithm
LH(y(x)): Likelihood under recognizer
Large vocabulary continuous speech recognizerbig hidden Markov model (HMM)approximated by the lattice of likely paths
Closed form gradient with respect to x
Serves as a model of clean MFCC sequences
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 59 / 73
Page 77
Parametric synthesis for speech recognition Algorithm
Optimization
State space of approximately 13⇥ 800 ⇡10,000 dimensions
Quasi-Newton optimization, BFGSgradient plus approximate second-order information
Closed form gradient of HMM likelihoodusing a forward-backward algorithm
Numerical gradient of IS divergenceindependent costs and gradients for each frame
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 60 / 73
Page 78
Parametric synthesis for speech recognition Results
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 61 / 73
Page 79
Parametric synthesis for speech recognition Results
Experiment
AURORA4 corpusread Wall Street Journal sentences (5000 word vocabulary)six environmental noise typesSNRs between 5 and 15 dB
Masks from ideal binary mask and estimated ratio mask7
7Arun Narayanan and DeLiang Wang. Ideal ratio mask estimation using deep neural networks for robust speech
recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 7092–7096.IEEE, May 2013
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 62 / 73
Page 80
Parametric synthesis for speech recognition Results
Recognition results
Word error rate (%) averaged across noise type
Mask Direct A-by-S
Noisy 30.94Estimated 16.18 15.31Oracle 14.38 13.62Clean 9.54
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 63 / 73
Page 81
Parametric synthesis for speech recognition Results
Reconstruction results
Itakura-Saito divergence between resynthesized speech and original
Mask Direct A-by-S �
Noisy 272301Estimated 276497 275224 �1273Oracle 273006 272506 �500
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 64 / 73
Page 82
Parametric synthesis for speech recognition Results
Resynthesis gets closer to reliable regions
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 65 / 73
Page 83
Parametric synthesis for speech recognition Results
Resynthesis gets closer to reliable regions
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 66 / 73
Page 84
Parametric synthesis for speech recognition Results
Resynthesis gets closer to reliable regions
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 67 / 73
Page 85
Parametric synthesis for speech recognition Results
Resynthesis gets closer to reliable regions
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 68 / 73
Page 86
Parametric synthesis for speech recognition Summary
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 69 / 73
Page 87
Parametric synthesis for speech recognition Summary
Summary
Use a full recognizer as a prior model for clean speech
Synthesize from MFCCs to the domain of the mask
Adjust synthesis of speech signal so that itlooks like the observationlooks like speech
Reduces recognition errors, distance to clean utterance
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 70 / 73
Page 88
Parametric synthesis for speech recognition Summary
Future directions
Apply to DNN-based acoustic models
Model speech excitation for full resynthesis of clean speech
Model multiple simultaneous speakers and estimate masks jointly
Combine with similar binaural model to include spatial clustering
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 71 / 73
Page 89
Summary
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 72 / 73
Page 90
Summary
Summary
Synthesizers provide strong prior information
Non-parametric synthesis models for high qualitylearned nonlinear matching function for perceptually motivated features
Parametric synthesis models for e�cient representationstrong, di↵erentiable prior model of speech
Thanks!
Any questions?
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 73 / 73
Page 91
Summary
Summary
Synthesizers provide strong prior information
Non-parametric synthesis models for high qualitylearned nonlinear matching function for perceptually motivated features
Parametric synthesis models for e�cient representationstrong, di↵erentiable prior model of speech
Thanks!
Any questions?
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 73 / 73
Page 92
Summary
Summary
Synthesizers provide strong prior information
Non-parametric synthesis models for high qualitylearned nonlinear matching function for perceptually motivated features
Parametric synthesis models for e�cient representationstrong, di↵erentiable prior model of speech
Thanks!
Any questions?
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 73 / 73
Page 93
Parametric synthesis for separation
Outline
5 Parametric synthesis for separation
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 1 / 3
Page 94
Parametric synthesis for separation
Re-estimate mask using resynthesis: Original
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 2 / 3
Page 95
Parametric synthesis for separation
Re-estimate mask using resynthesis: Re-estimate
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 3 / 3