Scene Analysis for Speech and Audio Recognitiondpwe/talks/MIT-2003-04.pdf · 2003-04-18 · Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 3 The problem with

Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 1

Scene Analysisfor Speech and Audio Recognition

Sound, Mixtures & Learning

Computational Auditory Scene Analysis

Recognizing Speech in Noise

Using Models in Parallel

The Listening Machine

Dan Ellis <[email protected]>

Laboratory for Recognition and Organization of Speech and Audio(Lab

ROSA

)Columbia University, New Yorkhttp://labrosa.ee.columbia.edu/

1

2

3

4

5

http://www.ee.columbia.edu/~dpwe/muscontent/



• Sound

- carries useful information about the world- complements vision

• Mixtures

- .. are the rule, not the exception- medium is ‘transparent’ with many sources- must be handled!

• Learning

- the speech recognition lesson:let the data do the work

- ... like listeners do

1

0 2 4 6 8 10 12 time/s

frq/Hz

0

2000

1000

3000

4000

level / dB-60

-40

-20

0


The problem with recognizing mixtures

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”

(after Bregman’90)

• Auditory Scene Analysis: describing a complex sound in terms of high-level sources/events

- ... like listeners do

• Hearing is ecologically grounded

- reflects natural scene properties = constraints- subjective, not absolute


Auditory Scene Analysis

(Bregman 1990)

• How do people analyze sound mixtures?

- break mixture into small

elements

(in time-freq)- elements are

grouped

in to sources using

cues

- sources have aggregate

attributes

• Grouping ‘rules’ (Darwin, Carlyon, ...):

- cues: common onset/offset/modulation, harmonicity, spatial location, ...

Frequencyanalysis

Groupingmechanism

Onsetmap

Harmonicitymap

Positionmap

Sourceproperties

(after Darwin, 1996)


Cues to simultaneous grouping

• Elements + attributes

• Common onset

- simultaneous energy has common source

• Periodicity

- energy in different bands with same cycle

• Other cues

- spatial (ITD/IID), familiarity, ...

time / s

freq

/ H

z

0 1 2 3 4 5 6 7 8 90

2000

4000

6000

8000


The effect of context

• Context can create an ‘expectation’: i.e. a bias towards a particular interpretation

• Bregman’s old-plus-new principle:

- a change is preferably interpreted as addition

• E.g. the continuity illusion

+

time / s

freq

uenc

y / k

Hz

0.0 0.4 0.8

1

2

1.20

1000

2000

4000

f/Hz ptshort

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4time/s


Approaches to sound mixture recognition

• Separate signals, then recognize

- e.g. CASA, ICA- nice, if you can do it

• Recognize combined signal

- ‘multicondition training’- combinatorics..

• Recognize with parallel models

- full joint-state space?- divide signal into fragments,

then use missing-data recognition


Independent Component Analysis (ICA)

(Bell & Sejnowski 1995 etc.)

• Drive a parameterized separation algorithm to maximize independence of outputs

• Advantages:

- mathematically rigorous, minimal assumptions- does not rely on prior information from models

• Disadvantages:

- may converge to local optima...- separation, not recognition- does not exploit prior information from models

m1m2

s1s2

a11a21

a12a22

x

−δ MutInfoδa


Outline



- Data-driven- Top-down constraints




1

2

3

4

5


Computational Auditory Scene Analysis:The Representational Approach

(Cooke & Brown 1993)

• Direct implementation of psych. theory

- ‘bottom-up’ processing- uses common onset & periodicity cues

• Able to extract voiced speech:

inputmixture

signalfeatures

(maps)

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

onset

period

frq.mod

time

freq

0.2 0.4 0.6 0.8 1.0 time/s

100

150200

300400

600

1000

15002000

3000

frq/Hzbrn1h.aif

0.2 0.4 0.6 0.8 1.0 time/s

100

150200

300400

600

1000

15002000

3000

frq/Hzbrn1h.fi.aif


Adding top-down constraints

Perception is not directbut a search for plausible hypotheses

• Data-driven (bottom-up)...

- objects irresistibly appear

vs. Prediction-driven (top-down)

- match observations with parameters of a world-model

- need world-model constraints...

inputmixture

signalfeatures

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

inputmixture

signalfeatures

predictionerrors

hypotheses

predictedfeaturesFront end Compare

& reconcile

Hypothesismanagement

Predict& combinePeriodic

components

Noisecomponents


Prediction-Driven CASA

(Ellis 1996)

• Explain a complex sound with basic elements

−70

−60

−50

−40

dB

200400

100020004000

f/Hz Noise1

200400

100020004000

f/Hz Noise2,Click1

200400

100020004000

f/Hz City

0 1 2 3 4 5 6 7 8 9

50100200400

1000

Horn1 (10/10)

Crash (10/10)

Horn2 (5/10)

Truck (7/10)

Horn3 (5/10)

Squeal (6/10)

Horn4 (8/10)Horn5 (10/10)

0 1 2 3 4 5 6 7 8 9time/s

200400

100020004000

f/Hz Wefts1−4

50100200400

1000

Weft5 Wefts6,7 Weft8 Wefts9−12


Aside: Evaluation

• Evaluation is a big problem for CASA

- what is the goal, really?- what is a good test domain?- how do you measure performance?

• SNR improvement

- tricky to derive from before/after signals:correspondence problem

- can do with fixed filtering mask; but rewards removing signal as well as noise

• Speech Recognition (ASR) improvement

- recognizers typically very sensitive to artefacts

• ‘Real’ task?

- mixture corpus with specific sound events...


Outline




- Conventional ASR- Tandem modeling



1

2

3

4

5



• Standard speech recognition structure:

• How to handle additive noise?

- just train on noisy data: ‘multicondition training’

3

Featurecalculation

sound

Acousticclassifier

feature vectorsAcoustic model

parameters

HMMdecoder

Understanding/application...

phone probabilities

phone / word sequence

Word models

Language modelp("sat"|"the","cat")p("saw"|"the","cat")

s ah t

D A

T A


Tandem speech recognition

(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI)

• Neural net estimates phone posteriors;but Gaussian mixtures model finer detail

• Combine them!

• Train net, then train GMM on net output

- GMM is ignorant of net output ‘meaning’

Speechfeatures

Featurecalculation

Inputsound

Neural netclassifier

Nowaydecoder

Phoneprobabilities

Words

s ah t

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Hybrid Connectionist-HMM ASR

Speechfeatures

Featurecalculation

Inputsound

Gauss mixmodels

HTKdecoder

Subwordlikelihoods

Words

s ah t

Conventional ASR (HTK)

Speechfeatures

Featurecalculation

Inputsound

Neural netclassifier

Phoneprobabilities

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Tandem modeling

Gauss mixmodels

HTKdecoder

Subwordlikelihoods

Words

s ah t


Tandem system results

• It works very well (‘Aurora’ noisy digits):

System-features Avg. WER 20-0 dB Baseline WER ratio

HTK-mfcc 13.7% 100%

Neural net-mfcc 9.3% 84.5%

Tandem-mfcc 7.4% 64.5%

Tandem-msg+plp 6.4% 47.2%

clean20151050-5

1

2

5

10

20

50

100

SNR / dB (averaged over 4 noises)

HTK GMM: 100%Hybrid: 84.6%Tandem: 64.5%Tandem + PC: 47.2%

WE

R /

% (

log

scal

e)

WER as a function of SNR for various Aurora99 systems

HTK GMM baselineHybrid connectionist

Average WER ratio to baseline:

TandemTandem + PC


Inside Tandem systems:What’s going on?

• Visualizations of the net outputs

• Neural net normalizes away noise?

- ... just a successful way to build a classifier?

freq

/ kH

z

0

1

2

3

4

0

1

2

3

4

-40 dB

-30

-20

-10

0

10

freq

/ m

el c

han

0

5

10

3

4

5

6

7

0

5

10

-20

-10

0

10

20

phon

e

0

0.5

1

time / s0 0.2 0.4 0.6 0.8 1

phon

e

time / s0 0.2 0.4 0.6 0.8 1

qh#axuwowaoahayeyehihiywrnvthfzs

kcltclkt


kcltclkt


kcltclkt


kcltclkt

Spectrogram

Clean 5dB SNR to ‘Hall’ noise“one eight three” (MFP_183A)

Cepstral-smoothedmel spectrum

Hidden layerlinear outputs

Phone posteriorestimates


Tandem vs. other approaches

• 50% of word errors corrected over baseline

• Beat a ‘bells and whistles’ systemthat used many large-vocabulary techniques

Aurora 2 Eurospeech 2001 Evaluation

- 1 0

0

1 0

2 0

3 0

4 0

5 0

6 0

Avg. rel. improvementRe

l im

pro

ve

me

nt

%

- M

ult

ico

nd

itio

nColumbia

Philips

UPC BarcelonaBell Labs

IBM

Motorola 1Motorola 2

NijmegenICSI/OGI/Qualcomm

ATR/Gri f f i th

AT&TAlcatel

Siemens

UCLAMicrosoft

SloveniaGranada


Outline




Using Models in Parallel- HMM decomposition/factoring- Speech fragment decoding


1

2

3

4

5


Using Models in Parallel:HMM decomposition

(e.g. Varga & Moore 1991, Gales & Young 1996)

• Independent state sequences for 2+ component source models

• New combined state space q' = {q1 q2}

- need pdfs for each combination

4

model 1

model 2

observations / time

p X q1 q2,( )


“One microphone source separation”(Roweis 2000, Manuel Reyes)

• State sequences → t-f estimates → mask

- 1000 states/model (→ 106 transition probs.)- simplify by modeling subbands (coupled HMM)?

freq

/ H

z

0

1000

2000

3000

4000

time / sectime / sec

0

1000

2000

3000

4000

0

1000

2000

3000

4000

0 1 2 3 4 5 60

1000

2000

3000

4000

0 1 2 3 4 5 6

Speaker 1

Ori

gin

alvo

ices

Sta

tem

ean

sR

esyn

thes

ism

asks

Mix

ture

Speaker 2


Speech Fragment Recognition(Jon Barker & Martin Cooke, Sheffield)

• Signal separation is too hard!Instead:- segregate features into partially-observed

sources- then classify

• Made possible by missing data recognition- integrate over uncertainty in observations

for optimal posterior distribution

• Goal:Relate clean speech models P(X|M)to speech-plus-noise mixture observations- .. and make it tractable


Comparing different segregations

• Standard classification chooses between models M to match source features X

• Mixtures → observed features Y, segregation S, all related by

- spectral features allow clean relationship

• Joint classification of model and segregation:

- integral collapses in several cases...

M∗ P M X( )M

argmax P X M( )P M( )P X( )--------------⋅

Margmax = =

P X Y S,( )

freq

ObservationY(f )

Segregation S

SourceX(f )

P M S Y,( ) P M( ) P X M( )P X Y S,( )

P X( )-------------------------⋅ Xd∫ P S Y( )⋅=


Calculating fragment matches

• P(X|M) - the clean-signal feature model

• P(X|Y,S)/P(X) - is X ‘visible’ given segregation?

• Integration collapses some bands...

• P(S|Y) - segregation inferred from observation- just assume uniform, find S for most likely M - or: use extra information in Y to distinguish S’s

e.g. harmonicity, onset grouping

• Result: - probabilistically-correct relation between

clean-source models P(X|M)and inferred, recognized source + segregation P(M,S|Y)

P M S Y,( ) P M( ) P X M( )P X Y S,( )

P X( )-------------------------⋅ Xd∫ P S Y( )⋅=


Speech fragment decoder results

• Simple P(S|Y) model forces contiguous regions to stay together- big efficiency gain when searching S space

• Clean-models-based recognition rivals trained-in-noise recognition

"1754" + noise

SNR mask

Fragments

FragmentDecoder "1754"

-5 0 5 10 15 20 clean0

10

20

30

40

50

60

70

80

90 AURORA 2000 - Test Set A

WE

R /

%

SNR / dB

MD Soft SNRHTK clean training

HTK multicondition


Multi-source decoding

• Search for more than one source

• Mutually-dependent data masks

• Use e.g. CASA features to propose masks- locally coherent regions- more powerful than Roweis masks

• Huge practical advantage over full search

Y(t)

S1(t)q1(t)

S2(t)q2(t)


Outline





The Listening Machine- Everyday sound- Alarms- Music

1

2

3

4

5



• Smart PDA records everything

• Only useful if we have index, summaries- monitor for particular sounds- real-time description

• Scenarios

- personal listener → summary of your day- future prosthetic hearing device- autonomous robots

• Meeting data, ambulatory audio

5


Alarm sound detection(Ellis 2001)

• Alarm sounds have particular structure- people ‘know them when they hear them’- clear even at low SNRs

• Why investigate alarm sounds?- they’re supposed to be easy- potential applications...

• Contrast two systems:- standard, global features, P(X|M)- sinusoidal model, fragments, P(M,S|Y)

time / s

hrn01 bfr02 buz01

level / dB

freq

/ kH

z

0 5 10 15 20 250

1

2

3

4

-40

-20

0

20s0n6a8+20


Alarms: Results

• Both systems commit many insertions at 0dB SNR, but in different circumstances:

20 25 30 35 40 45 50

0

6 7 8 9

time/sec0

freq

/ kH

z

1

2

3

4

0

freq

/ kH

z

1

2

3

4Restaurant+ alarms (snr 0 ns 6 al 8)

MLP classifier output

Sound object classifier output

NoiseNeural net system Sinusoid model system

Del Ins Tot Del Ins Tot

1 (amb) 7 / 25 2 36% 14 / 25 1 60%

2 (bab) 5 / 25 63 272% 15 / 25 2 68%

3 (spe) 2 / 25 68 280% 12 / 25 9 84%

4 (mus) 8 / 25 37 180% 9 / 25 135 576%

Overall 22 / 100 170 192% 50 / 100 147 197%


Music Applications

• Music as a complex, information-rich sound

• Applications of separation & recognition:- note/chord detection & classification

- singing detection (→ genre identification ...)

23 240

1

2

72 730

1

231 32 40 41 48 49

80 81 85 86 88 89

freq

/ kH

zDYWMB: Alignments to MIDI note 57 mapped to Orig Audio

0 50 100 150 200

Boards of CanadaSugarplastic

Belle & SebastianMercury Rev

CorneliusRichard Davies

Dj ShadowMouse on Mars

The Flaming LipsAimee Mann

WilcoXTCBeck

Built to SpillJason Falkner

OvalArto Lindsay

Eric MatthewsThe MolesThe Roots

Michael Penntrue voice

time / sec

Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee)


Summary

• Sound- .. contains much, valuable information at many

levels- intelligent systems need to use this information

• Mixtures- .. are an unavoidable complication when using

sound- looking in the right time-frequency place to find

points of dominance

• Learning- need to acquire constraints from the

environment- recognition/classification as the real task


References

A. Bregman. Auditory Scene Analysis, MIT Press, 1990.A. Bell and T. Sejnowski. “An information-maximization approach to blind separation and blind

deconvolution,” Neural Computation, 7: 1129-1159, 1995.http://citeseer.nj.nec.com/bell95informationmaximization.html

A. Berenzweig, D. Ellis, S. Lawrence (2002). “Using Voice Segments to Improve Artist Classification of Music “, Proc. AES-22 Intl. Conf. on Virt., Synth., and Ent. Audio. Espoo, Finland, June 2002.http://www.ee.columbia.edu/~dpwe/pubs/aes02-aclass.pdf

A. Berenzweig, D. Ellis, S. Lawrence (2002). “Anchor Space for Classification and Similarity Measurement of Music“, Proc. ICME-03, Baltimore, July 2003.http://www.ee.columbia.edu/~dpwe/pubs/icme03-anchor.pdf

M. Cooke and G. Brown. “Computational auditory scene analysis: Exploiting principles of perceived continuity”, Speech Communication 13, 391-399, 1993

D. Ellis. Prediction-driven computational auditory scene analysis, Ph.D. dissertation, MIT, 1996.http://www.ee.columbia.edu/~dpwe/pubs/pdcasa.pdf

D. Ellis. “Detecting Alarm Sounds”, Proc. Workshop on Consistent & Reliable Acoustic Cues CRAC-01, Denmark, Sept. 2001.http://www.ee.columbia.edu/~dpwe/pubs/crac01-alarms.pdf

M. Gales and S. Young. “Robust continuous speech recognition using parallel model combination”, IEEE Tr. Speech and Audio Proc., 4(5):352--359, Sept. 1996.http://citeseer.nj.nec.com/gales96robust.html

H. Hermansky,D. Ellis and S. Sharma, “Tandem connectionist feature extraction for conventional HMM systems,” Proc. ICASSP, Istanbul, June 2000. http://citeseer.nj.nec.com/hermansky00tandem.html

http://citeseer.nj.nec.com/bell95informationmaximization.html

http://www.ee.columbia.edu/~dpwe/pubs/pdcasa.pdf

http://citeseer.nj.nec.com/hermansky00tandem.html

http://citeseer.nj.nec.com/gales96robust.html

http://www.ee.columbia.edu/~dpwe/pubs/aes02-aclass.pdf

http://www.ee.columbia.edu/~dpwe/pubs/icme03-anchor.pdf

http://www.ee.columbia.edu/~dpwe/pubs/crac01-alarms.pdf

Scene Analysis for Speech and Audio Recognitiondpwe/talks/MIT-2003-04.pdf · 2003-04-18 · Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 3 The problem with

Documents