Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 1 Sound, Mixtures, and Learning: LabROSA overview Sound Content Analysis Recognizing sounds Organizing mixtures Accessing large datasets Dan Ellis <[email protected]> Laboratory for Recognition and Organization of Speech and Audio (Lab ROSA ) Columbia University, New York http://labrosa.ee.columbia.edu/ 1 2 3 4
39
Embed
grasp-2003-09 - Columbia Universitydpwe/talks/grasp-2003-09.pdf · Title: grasp-2003-09.fm Author: dpwe Created Date: 9/26/2003 12:00:56 PM
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 1
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 2
Sound Content Analysis
• Sound understanding: the key challenge
- what listeners do- understanding = abstraction
• Applications
- indexing/retrieval- robots- prostheses
1
0 2 4 6 8 10 12 time/s
frq/Hz
0
2000
1000
3000
4000
Voice (evil)
Stab
Rumble Strings
Choir
Voice (pleasant)
Analysis
level / dB-60
-40
-20
0
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 3
The problem with recognizing mixtures
“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”
(after Bregman’90)
• Auditory Scene Analysis: describing a complex sound in terms of high-level sources/events
- ... like listeners do
• Hearing is ecologically grounded
- reflects natural scene properties = constraints- subjective, not absolute
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 4
Approaches to handling sound mixtures
• Separate signals, then recognize
- e.g. CASA, ICA- nice, if you can do it
• Recognize combined signal
- ‘multicondition training’- combinatorics..
• Recognize with parallel models
- full joint-state space?- or: divide signal into fragments,
then use missing-data recognition
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 5
Outline
Sound Content Analysis
Recognizing sounds
- Speech recognition- Nonspeech
Organizing mixtures
Accessing large datasets
1
2
3
4
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 6
Recognizing Sounds: Speech
• Standard speech recognition structure:
• How to handle additive noise?
- just train on noisy data: ‘multicondition training’
2
Featurecalculation
sound
Acousticclassifier
feature vectorsAcoustic model
parameters
HMMdecoder
Understanding/application...
phone probabilities
phone / word sequence
Word models
Language modelp("sat"|"the","cat")p("saw"|"the","cat")
s ah t
D A
T A
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 7
Novel speech signal representations
(with Marios Athineos)
• Common sound models use 10ms frames
- but: sub-10ms envelope is perceptible
• Use a parametric (LPC) model on spectrum
• Convert to features for ASR
- improvements esp. for stops
Time
Fre
quen
cy
Fireplce - orig
0 2 40
2k
4k
0
2k
4k
Time
Fre
quen
cy
TDLP resynth
0 2 40
2k
4k
0
2k
4k
Time
Fre
quen
cy
CTFLP resynth
0 2 40
2k
4k
0
2k
4k
Time
Fre
quen
cy
CTFLP resynth 4x TSM
0 2 40
2k
4k
0
2k
4k
200 400 600 800 1000 1200 1400 1600 1800 2000
1
1.5
2
2.5
3
3.5
4
4.5
5
2
0
2
4
waveform0-0.5 kHz0.5-1 kHz1-2 kHz2-4 kHz
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 8
Finding the Information in Speech
(with Patricia Scanlon)
• Mutual Information in time-frequency:
• Use to select classifier input features
5
10
15
-200 -100 0 100 200
5
10
15
-200 -100 0 100 200time / ms
MI / bits
freq
/ B
ark
0.005
0.01
0.015
0.02
0.025
0.02
0.04
0.06
0.08
0
0.01
0.02
0.03
0.04
0.020.040.060.080.10.120.14
All phones Stops
Vowels Speaker (vowels)
5
10
15
5
10
15
5
10
15
-50 0 50
5
10
15
-50 0 50 time / ms
frfe
q / B
ark
133ftrs
IRREG IRREG+CHKBD
95ftrs
57ftrs
19ftrs
85ftrs
47ftrs
28ftrs
9ftrs
time / ms
0 20 40 60 80 100 120 14045
50
55
60
65
70
Number of features
Fra
me
accu
racy
/ %
IRREGRECTRECT, all frames
IRREG+CHKBDRECT+CHKBD
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 9
Alarm sound detection
(Ellis 2001)
• Alarm sounds have particular structure
- people ‘know them when they hear them’- clear even at low SNRs
• Why investigate alarm sounds?
- they’re supposed to be easy- potential applications...
• Contrast two systems:
- standard, global features,
P
(
X
|
M
)
- sinusoidal model, fragments,
P
(
M
,
S
|
Y
)
time / s
hrn01 bfr02 buz01
level / dB
freq
/ kH
z
0 5 10 15 20 250
1
2
3
4
-40
-20
0
20s0n6a8+20
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 10
Alarms: Results
• Both systems commit many insertions at 0dB SNR, but in different circumstances:
20 25 30 35 40 45 50
0
6 7 8 9
time/sec0
freq
/ kH
z
1
2
3
4
0
freq
/ kH
z
1
2
3
4Restaurant+ alarms (snr 0 ns 6 al 8)
MLP classifier output
Sound object classifier output
NoiseNeural net system Sinusoid model system
Del Ins Tot Del Ins Tot
1 (amb) 7 / 25 2 36% 14 / 25 1 60%
2 (bab) 5 / 25 63 272% 15 / 25 2 68%
3 (spe) 2 / 25 68 280% 12 / 25 9 84%
4 (mus) 8 / 25 37 180% 9 / 25 135 576%
Overall 22 / 100 170 192% 50 / 100 147 197%
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 11
Outline
Sound Content Analysis
Recognizing sounds
Organizing mixtures
- Auditory Scene Analysis- Missing data recognition- Parallel model inference
Accessing large datasets
1
2
3
4
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 12
Auditory Scene Analysis
(Bregman 1990)
• How do people analyze sound mixtures?
- break mixture into small
elements
(in time-freq)- elements are
grouped
in to sources using
cues
- sources have aggregate
attributes
• Grouping ‘rules’ (Darwin, Carlyon, ...):
- cues: common onset/offset/modulation, harmonicity, spatial location, ...
3
Frequencyanalysis
Groupingmechanism
Onsetmap
Harmonicitymap
Positionmap
Sourceproperties
(after Darwin, 1996)
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 13
Cues to simultaneous grouping
• Elements + attributes
• Common onset
- simultaneous energy has common source
• Periodicity
- energy in different bands with same cycle
• Other cues
- spatial (ITD/IID), familiarity, ...
• But: Context ...
time / s
freq
/ H
z
0 1 2 3 4 5 6 7 8 90
2000
4000
6000
8000
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 14
Computational Auditory Scene Analysis:The Representational Approach
(Cooke & Brown 1993)
• Direct implementation of psych. theory
- ‘bottom-up’ processing- uses common onset & periodicity cues
• Able to extract voiced speech:
inputmixture
signalfeatures
(maps)
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
onset
period
frq.mod
time
freq
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300400
600
1000
15002000
3000
frq/Hzbrn1h.aif
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300400
600
1000
15002000
3000
frq/Hzbrn1h.fi.aif
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 15
Adding top-down constraints
Perception is not directbut a search for plausible hypotheses
• Data-driven (bottom-up)...
- objects irresistibly appear
vs. Prediction-driven (top-down)
- match observations with parameters of a world-model
- need world-model constraints...
inputmixture
signalfeatures
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
inputmixture
signalfeatures
predictionerrors
hypotheses
predictedfeaturesFront end Compare
& reconcile
Hypothesismanagement
Predict& combinePeriodic
components
Noisecomponents
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 16
Prediction-Driven CASA
−70
−60
−50
−40
dB
200400
100020004000
f/Hz Noise1
200400
100020004000
f/Hz Noise2,Click1
200400
100020004000
f/Hz City
0 1 2 3 4 5 6 7 8 9
50100200400
1000
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
Horn3 (5/10)
Squeal (6/10)
Horn4 (8/10)Horn5 (10/10)
0 1 2 3 4 5 6 7 8 9time/s
200400
100020004000
f/Hz Wefts1−4
50100200400
1000
Weft5 Wefts6,7 Weft8 Wefts9−12
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 17
Segregation vs. Inference
• Source separation requires attribute separation- sources are characterized by attributes
(pitch, loudness, timbre + finer details)- need to identify & gather different attributes for
different sources ...
• Need representation that segregates attributes- spectral decomposition- periodicity decomposition
• Sometimes values can’t be separated- e.g. unvoiced speech- maybe infer factors from probabilistic model?
- or: just skip those values, infer from higher-level context
- do both: missing-data recognition
p O x y, ,( ) p x y, O( )Æ
Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 18
Missing Data Recognition
• Speech models p(x|m) are multidimensional...- i.e. means, variances for every freq. channel- need values for all dimensions to get p(•)
• But: can evaluate over a subset of dimensions xk