Lab ROSA ROSA overview - Dan Ellis 2001-09-28 - 1 Recognition & Organization of Speech & Audio Dan Ellis http://labrosa.ee.columbia.edu/ Outline Introducing LabROSA Projects in speech, music & audio Summary 1 2 3
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 1
Recognition & Organizationof Speech & Audio
Dan Ellis
http://labrosa.ee.columbia.edu/
Outline
Introducing LabROSA
Projects in speech, music & audio
Summary
1
2
3
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 2
Sound organization
• Central operation:
- continuous sound mixture
→
distinct objects & events
• Perceptual impression is very strong
- but hard to ‘see’ in signal
1
0 2 4 6 8 10 12 time/s
frq/Hz
0
2000
4000
Voice (evil)
Stab
Rumble Strings
Choir
Voice (pleasant)
Analysis
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 3
Bregman’s lake
“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”
(after Bregman’90)
• Received waveform is a mixture
- two sensors, N signals ...
• Disentangling mixtures as primary goal
- perfect solution is not possible- need knowledge-based
constraints
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 4
The information in sound
• A sense of hearing is evolutionarily useful
- gives organisms ‘relevant’ information
• Auditory perception is
ecologically
grounded
- scene analysis is preconscious (
→
illusions)- special-purpose processing reflects
‘natural scene’ properties- subjective
not
canonical (ambiguity)
freq
/ H
z
0 1 2 3 40
1000
2000
3000
4000
time / s0 1 2 3 4
Steps 1 Steps 2
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 5
Key themes for LabROSA
http://labrosa.ee.columbia.edu/
• Sound organization: construct hierarchy
- at an instant (sources)- along time (segmentation)
• Scene analysis
- find attributes according to objects- use attributes to form objects- ... plus constraints of knowledge
• Exploiting large data sets (the ASR lesson)
- supervised/labeled: pattern recognition- unsupervised: structure discovery, clustering
• Special cases:
- speech recognition- other source-specific recognizers
• ... within a ‘complete explanation’
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 6
Outline
Introducing LabROSA
Projects in speech, music & audio
- Tandem speech recognition - ‘Meeting recorder’ speech analysis- Musical information extraction- Alarm sound detection
Summary
1
2
3
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 7
Automatic Speech Recognition (ASR)
• Standard speech recognition structure:
• ‘State of the art’ word-error rates (WERs):
- 2% (dictation) - 30% (telephone conversations)
• Can use multiple streams...
Featurecalculation
sound
Acousticclassifier
feature vectorsAcoustic model
parameters
HMMdecoder
Understanding/application...
phone probabilities
phone / word sequence
Word models
Language modelp("sat"|"the","cat")p("saw"|"the","cat")
s ah t
D A
T A
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 8
Tandem speech recognition
(with Manuel Reyes, ICSI, OGI, CMU)
• Neural net estimates phone posteriors;but Gaussian mixtures model finer detail
• Combine them!
• Train net, then train GMM on net output
- GMM is ignorant of net output ‘meaning’
Speechfeatures
Featurecalculation
Inputsound
Neural netclassifier
Nowaydecoder
Phoneprobabilities
Words
s ah t
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
Hybrid Connectionist-HMM ASR
Speechfeatures
Featurecalculation
Inputsound
Gauss mixmodels
HTKdecoder
Subwordlikelihoods
Words
s ah t
Conventional ASR (HTK)
Speechfeatures
Featurecalculation
Inputsound
Neural netclassifier
Phoneprobabilities
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
Tandem modeling
Gauss mixmodels
HTKdecoder
Subwordlikelihoods
Words
s ah t
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 9
Tandem system results: Aurora digits
clean20151050-5
1
2
5
10
20
50
100
SNR / dB (averaged over 4 noises)
HTK GMM: 100%Hybrid: 84.6%Tandem: 64.5%Tandem + PC: 47.2%
WE
R /
% (
log
scal
e)
WER as a function of SNR for various Aurora99 systems
HTK GMM baselineHybrid connectionist
Average WER ratio to baseline:
TandemTandem + PC
Aurora 2 Eurospeech 2001 Evaluation
- 1 0
0
1 0
2 0
3 0
4 0
5 0
6 0
Avg. rel. improvementRe
l im
pro
ve
me
nt
%
- M
ult
ico
nd
itio
n
Columbia
Philips
UPC Barcelona
Bell Labs
IBM
Motorola 1
Motorola 2
Nijmegen
ICSI/OGI/Qualcomm
ATR/Gri f f i th
AT&T
Alcatel
Siemens
UCLA
Microsoft
Slovenia
Granada
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 10
The Meeting Recorder project
(with ICSI, UW, SRI, IBM)
• Microphones in conventional meetings
- for summarization/retrieval/behavior analysis- informal, overlapped speech
• Data collection (ICSI, UW, ...):
- 100 hours collected, ongoing transcription- headsets + tabletop + ‘PDA’
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 11
Crosstalk cancellation
• Baseline speaker activity detection is hard:
• Noisy crosstalk model:
• Estimate subband C
Aa
from A’s peak energy
- ... including pure delay (10 ms frames)- ... then linear inversion
120 125 130 135 140 145 150 155 time / secs
speakeractive
level/dB
Spkr A
Spkr B
Spkr C
Spkr D
Spkr E
Tabletop
0
20
40
breathnoise
crosstalk
interruptions
speaker Bcedes floor
m C s⋅ n+=
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 12
PDA-based speaker change detection
• Goal: small conference-tabletop device
• Speaker turns from PDA mock-up signals?
• SCD algo on spectral + interaural features
- average spectral + per-channel ITD,
∆φ
pda.aif: excerpt with 512-pt xcorr, 80% max thresh
31 32 33 34 35 36 37 38 39 400
2000
4000
6000
8000
lag/
ms
31 32 33 34 35 36 37 38 39 40-1.5
-1
-0.5
0
0.5
1
1.5
time/s31 32 33 34 35 36 37 38 39 40
-0.5
0
0.5
Morgan
Adam
Dan
what's this one here this this one has a couple cheesy electrets oh that's connected too or that's connected two- both
yep yeah both channels
yeah that's that's the dummy dummy P.D.A. yeah it is
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 13
Music analysis: Lyrics extraction
(with Adam Berenzweig)
• Vocal content is highly salient, useful for retrieval
• Can we find the singing? Use an ASR classifier:
• Frame error rate ~20% for segmentation based on posterior-feature statistics
• Lyric segmentation + transcribed lyrics
→
training data for lyrics ASR...
time / sec time / sec time / sec0 1 2 0 1 2 0 1 2
freq
/ kH
z
0
2
4
phon
e cl
ass
speech (trnset #58)
Sp
ectr
og
ram
Po
ster
iog
ram
music (no vocals #1) singing (vocals #17 + 10.5s)
singing
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 14
Music analysis: Structure recovery
(with Rob Turetsky)
• Structure recovery by similarity matrices(after Foote)
- similarity distance measure?- segmentation & repetition structure- interpretation at different scales:
notes, phrases, movements- incorporating musical knowledge:
‘theme similarity’
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 15
Alarm sound detection
• Alarm sounds have particular structure
- people ‘know them when they hear them’
• Isolate alarms in sound mixtures
- representation of energy in time-frequency- formation of atomic elements- grouping by common properties (onset &c.)- classify by attributes...
• Key: recognize
despite
background
freq
/ H
z
1 1.5 2 2.50
1000
2000
3000
4000
5000
time / sec1 1.5 2 2.5
0
1000
2000
3000
4000
5000 1
1 1.5 2 2.50
1000
2000
3000
4000
5000
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 16
The ‘Machine listener’
• Goal: An auditory system for machines
- use same environmental information as people
• Aspects:
- recognize spoken commands (but not others)- track ‘acoustic channel’ quality (for responses)- categorize environment (conversation, crowd...)
• Scenarios
- personal listener
→
summary of your day- autonomous robots: need awareness
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 17
Outline
Introducing LabROSA
Projects in speech, music & audio
Summary
1
2
3
LabROSA
ROSA overview - Dan Ellis 2001-09-28 - 18
LabROSA Summary
DO
MA
INS
AP
PLI
CA
TIO
NS
ROSA
• Broadcast• Movies• Lectures
• Meetings• Personal recordings• Location monitoring
• Speech recognition• Speech characterization• Nonspeech recognition
• Object-based structure discovery & learning
• Scene analysis• Audio-visual integration• Music analysis
• Structuring• Search• Summarization• Awareness• Understanding