Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 1 Scene Analysis for Speech and Audio Recognition Sound, Mixtures & Learning Computational Auditory Scene Analysis Recognizing Speech in Noise Using Models in Parallel The Listening Machine Dan Ellis <[email protected]> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Columbia University, New York http://labrosa.ee.columbia.edu/ 1 2 3 4 5
34
Embed
Scene Analysis for Speech and Audio Recognitiondpwe/talks/MIT-2003-04.pdf · 2003-04-18 · Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 3 The problem with
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 1
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 2
Sound, Mixtures & Learning
• Sound
- carries useful information about the world- complements vision
• Mixtures
- .. are the rule, not the exception- medium is ‘transparent’ with many sources- must be handled!
• Learning
- the speech recognition lesson:let the data do the work
- ... like listeners do
1
0 2 4 6 8 10 12 time/s
frq/Hz
0
2000
1000
3000
4000
level / dB-60
-40
-20
0
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 3
The problem with recognizing mixtures
“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”
(after Bregman’90)
• Auditory Scene Analysis: describing a complex sound in terms of high-level sources/events
- ... like listeners do
• Hearing is ecologically grounded
- reflects natural scene properties = constraints- subjective, not absolute
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 4
Auditory Scene Analysis
(Bregman 1990)
• How do people analyze sound mixtures?
- break mixture into small
elements
(in time-freq)- elements are
grouped
in to sources using
cues
- sources have aggregate
attributes
• Grouping ‘rules’ (Darwin, Carlyon, ...):
- cues: common onset/offset/modulation, harmonicity, spatial location, ...
Frequencyanalysis
Groupingmechanism
Onsetmap
Harmonicitymap
Positionmap
Sourceproperties
(after Darwin, 1996)
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 5
Cues to simultaneous grouping
• Elements + attributes
• Common onset
- simultaneous energy has common source
• Periodicity
- energy in different bands with same cycle
• Other cues
- spatial (ITD/IID), familiarity, ...
time / s
freq
/ H
z
0 1 2 3 4 5 6 7 8 90
2000
4000
6000
8000
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 6
The effect of context
• Context can create an ‘expectation’: i.e. a bias towards a particular interpretation
• Bregman’s old-plus-new principle:
- a change is preferably interpreted as addition
• E.g. the continuity illusion
+
time / s
freq
uenc
y / k
Hz
0.0 0.4 0.8
1
2
1.20
1000
2000
4000
f/Hz ptshort
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4time/s
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 7
Approaches to sound mixture recognition
• Separate signals, then recognize
- e.g. CASA, ICA- nice, if you can do it
• Recognize combined signal
- ‘multicondition training’- combinatorics..
• Recognize with parallel models
- full joint-state space?- divide signal into fragments,
then use missing-data recognition
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 8
Independent Component Analysis (ICA)
(Bell & Sejnowski 1995 etc.)
• Drive a parameterized separation algorithm to maximize independence of outputs
• Advantages:
- mathematically rigorous, minimal assumptions- does not rely on prior information from models
• Disadvantages:
- may converge to local optima...- separation, not recognition- does not exploit prior information from models
m1m2
s1s2
a11a21
a12a22
x
−δ MutInfoδa
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 9
Outline
Sound, Mixtures & Learning
Computational Auditory Scene Analysis
- Data-driven- Top-down constraints
Recognizing Speech in Noise
Using Models in Parallel
The Listening Machine
1
2
3
4
5
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 10
Computational Auditory Scene Analysis:The Representational Approach
(Cooke & Brown 1993)
• Direct implementation of psych. theory
- ‘bottom-up’ processing- uses common onset & periodicity cues
• Able to extract voiced speech:
inputmixture
signalfeatures
(maps)
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
onset
period
frq.mod
time
freq
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300400
600
1000
15002000
3000
frq/Hzbrn1h.aif
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300400
600
1000
15002000
3000
frq/Hzbrn1h.fi.aif
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 11
Adding top-down constraints
Perception is not directbut a search for plausible hypotheses
• Data-driven (bottom-up)...
- objects irresistibly appear
vs. Prediction-driven (top-down)
- match observations with parameters of a world-model
- need world-model constraints...
inputmixture
signalfeatures
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
inputmixture
signalfeatures
predictionerrors
hypotheses
predictedfeaturesFront end Compare
& reconcile
Hypothesismanagement
Predict& combinePeriodic
components
Noisecomponents
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 12
Prediction-Driven CASA
(Ellis 1996)
• Explain a complex sound with basic elements
−70
−60
−50
−40
dB
200400
100020004000
f/Hz Noise1
200400
100020004000
f/Hz Noise2,Click1
200400
100020004000
f/Hz City
0 1 2 3 4 5 6 7 8 9
50100200400
1000
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
Horn3 (5/10)
Squeal (6/10)
Horn4 (8/10)Horn5 (10/10)
0 1 2 3 4 5 6 7 8 9time/s
200400
100020004000
f/Hz Wefts1−4
50100200400
1000
Weft5 Wefts6,7 Weft8 Wefts9−12
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 13
Aside: Evaluation
• Evaluation is a big problem for CASA
- what is the goal, really?- what is a good test domain?- how do you measure performance?
• SNR improvement
- tricky to derive from before/after signals:correspondence problem
- can do with fixed filtering mask; but rewards removing signal as well as noise
• Speech Recognition (ASR) improvement
- recognizers typically very sensitive to artefacts
• ‘Real’ task?
- mixture corpus with specific sound events...
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 14
Outline
Sound, Mixtures & Learning
Computational Auditory Scene Analysis
Recognizing Speech in Noise
- Conventional ASR- Tandem modeling
Using Models in Parallel
The Listening Machine
1
2
3
4
5
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 15
Recognizing Speech in Noise
• Standard speech recognition structure:
• How to handle additive noise?
- just train on noisy data: ‘multicondition training’
3
Featurecalculation
sound
Acousticclassifier
feature vectorsAcoustic model
parameters
HMMdecoder
Understanding/application...
phone probabilities
phone / word sequence
Word models
Language modelp("sat"|"the","cat")p("saw"|"the","cat")
s ah t
D A
T A
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 16
A. Berenzweig, D. Ellis, S. Lawrence (2002). “Using Voice Segments to Improve Artist Classification of Music “, Proc. AES-22 Intl. Conf. on Virt., Synth., and Ent. Audio. Espoo, Finland, June 2002.http://www.ee.columbia.edu/~dpwe/pubs/aes02-aclass.pdf
A. Berenzweig, D. Ellis, S. Lawrence (2002). “Anchor Space for Classification and Similarity Measurement of Music“, Proc. ICME-03, Baltimore, July 2003.http://www.ee.columbia.edu/~dpwe/pubs/icme03-anchor.pdf
M. Cooke and G. Brown. “Computational auditory scene analysis: Exploiting principles of perceived continuity”, Speech Communication 13, 391-399, 1993
D. Ellis. Prediction-driven computational auditory scene analysis, Ph.D. dissertation, MIT, 1996.http://www.ee.columbia.edu/~dpwe/pubs/pdcasa.pdf
D. Ellis. “Detecting Alarm Sounds”, Proc. Workshop on Consistent & Reliable Acoustic Cues CRAC-01, Denmark, Sept. 2001.http://www.ee.columbia.edu/~dpwe/pubs/crac01-alarms.pdf
M. Gales and S. Young. “Robust continuous speech recognition using parallel model combination”, IEEE Tr. Speech and Audio Proc., 4(5):352--359, Sept. 1996.http://citeseer.nj.nec.com/gales96robust.html
H. Hermansky,D. Ellis and S. Sharma, “Tandem connectionist feature extraction for conventional HMM systems,” Proc. ICASSP, Istanbul, June 2000. http://citeseer.nj.nec.com/hermansky00tandem.html