Lab RO S A Laboratory for the Recognition and Organization of Speech and Audio Estimating Single-Channel Source Separation Masks Relevance Vector Machine Classifiers vs. Pitch-Based Masking Ron J. Weiss, Daniel P. W. Ellis {ronw,dpwe}@ee.columbia.edu LabROSA, Columbia University Estimating Single-Channel Source Separation Masks – p. 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Estimating Single-Channel Source Separation Masks – p. 1
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Single Channel Source Separation
Speech
Time (seconds)
Fre
quen
cy (
Hz)
0 1 2 30
1000
2000
3000
4000Babble noise
Time (seconds)0 1 2 3
0
1000
2000
3000
4000Mixture (10 dB SNR)
Time (seconds)0 1 2 3
0
1000
2000
3000
4000
+ =
• Given a monoaural signal composed of multiple sources
• e.g. multiple speakers, speech + music, speech +background noise
• Want to separate the constituent sources
• For noise robust speech recognition, hearing aids
Estimating Single-Channel Source Separation Masks – p. 2
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Missing Data Masks
Mixture
Time (seconds)
Fre
quen
cy (
Hz)
0 1 2 30
1000
2000
3000
4000Mask − regions where speech energy dominates
Time (seconds)
Fre
quen
cy (
Hz)
0 1 2 30
1000
2000
3000
4000
• Leverage the sparsity of audio sources - only one source islikely to have a significant amount of energy in any giventime-frequency cell
• If we can decide which cells are dominated by the source ofinterest (i.e. has local SNR greater than some threshold),we can filter out noise dominated cells (“refiltering”[3])
• Create a binary mask that labels each cell of thespectrogram as missing or reliable
Estimating Single-Channel Source Separation Masks – p. 3
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Mask Estimation As Classification [4]
• Goal is to classify each spectrogram cell as being reliable(dominated by speech signal) or not
• Separate classifier for each frequency band
• Train on speech mixed with a variety of different noisesignals (babble noise, white noise, speech shaped noise,etc...) at a variety of different levels (-5 to 10 dB SNR)