One Microphone Source Separation - Columbia …dpwe/papers/Roweis00-monsep.pdfOne Microphone Source Separation Sam T. Roweis Gatsby Unit, University College London [email protected]

OneMicr ophoneSourceSeparation

SamT. RoweisGatsbyUnit, UniversityCollegeLondon

[email protected]

AbstractSourceseparation,or computationalauditorysceneanalysis,attemptsto extractindividual acousticobjectsfrom input which containsa mixtureof soundsfromdifferent sources,alteredby the acousticenvironment. Unmixing algorithmssuchasICA andits extensionsrecover sourcesby reweightingmultiple obser-vationsequences,andthuscannotoperatewhenonly a singleobservationsignalis available. I presenta techniquecalled refiltering which recoverssourcesbya nonstationaryreweighting(“masking”) of frequency sub-bandsfrom a singlerecording,andarguefor theapplicationof statisticalalgorithmsto learningthismaskingfunction. I presentresultsof a simple factorial HMM systemwhichlearnsonrecordingsof singlespeakersandcanthenseparatemixturesusingonlyoneobservationsignalby computingthemaskingfunctionandthenrefiltering.

1 Learning fr om data in computational auditory sceneanalysis

Imaginelisteningto many pianosbeingplayedsimultaneously. If eachpianistwerestrikingkeys randomlyit would be very difficult to tell which notecamefrom which piano. Butif eachwere playing a coherentsong,separationwould be much easierbecauseof thestructureof music. Now imagineteachinga computerto do theseparationby showing itmany musicalscoresas“training data”. Typical auditoryperceptualinput containsa mix-tureof soundsfrom differentsources,alteredby theacousticenvironment.Any biologicalor artificial hearingsystemmust extract individual acousticobjectsor streamsin orderto do successfullocalization,denoisingandrecognition.Bregman[1] calledthis processauditorysceneanalysisin analogyto vision. Sourceseparation,or computationalauditorysceneanalysis(CASA) is the practicalrealizationof this problemvia computeranalysisof microphonerecordingsandis very similar to the musicaltaskdescribedabove. It hasbeeninvestigatedby researchgroupswith differentemphases.TheCASA communityhavefocusedon bothmultiple andsinglemicrophonesourceseparationproblemsunderhighlyrealistic acousticconditions,but have usedalmost exclusively hand designedsystemswhich includesubstantialknowledgeof thehumanauditorysystemandits psychophysicalcharacteristics(e.g.[2,3]). Unfortunately, it is difficult to incorporatelarge amountsofdetailedstatisticalknowledgeabout the problem into suchan approach. On the otherhand,machinelearningresearchers,especiallythoseworking on independentcomponentsanalysis(ICA) andrelatedalgorithms,have focusedon thecaseof multiple microphonesin simplified mixing environmentsandhave usedpowerful “blind” statisticaltechniques.These“unmixing” algorithms(even thosewhich attemptto recover more sourcesthansignals)cannotoperateonsinglerecordings.Furthermore,sincethey oftendependonly onthejoint amplitudehistogramof theobservationsthey canbeverysensitiveto thedetailsoffiltering andreverberationin theenvironment.Thegoalof thispaperis to bringtogethertherobust representationsof CASA andmethodswhich learnfrom datato solve a restrictedversionof the sourceseparationproblem– isolatingacousticobjectsfrom only a singlemicrophonerecording.

2 Refiltering vs. unmixing

Unmixingalgorithmsreweightmultiplesimultaneousrecordings�� (genericallycalledmicrophones)to form anew sourceobject �� :

�� estimatedsource

� �� mic 1

� �� mic 2

�� mic K

(1)

The unmixing coefficients � � are constantover time and are chosento optimize somepropertyof thesetof recoveredsources,which oftentranslatesinto a kurtosismeasureonthejoint amplitudehistogramof themicrophones.Theintuition is thatunmixingalgorithmsarefinding spikes(or dentsfor low kurtosissources)in themarginalamplitudehistogram.Thetimeorderingof thedatapointsis oftenirrelevant.

Unmixing dependson a fine timescale,sample-by-samplecomparisonof severalobserva-tionsignals.Humans,ontheotherhand,cannothearhistogramspikes1 andperformwell onmany monauralseparationtasks.We aredoingstructuralanalysis,or a kind of perceptualgroupingontheincomingsound.But whatis beinggrouped?Thereis substantialevidencethat the energy acrosstime in differentfrequency bandscancarry relatively independentinformation.This suggeststhattheappropriatesubpartsof anaudiosignalmaybenarrowfrequency bandsover short times. To generatetheseparts,one can perform multibandanalysis– breakthe original signal ! �� into many subbandsignals " � �� eachfiltered tocontainonly energy from a smallportionof thespectrum.Theresultsof suchananalysisareoftendisplayedasa spectrogramwhich shows energy (usingcolouror grayscale)asafunctionof time(ordinate)andfrequency (abscissa).(For exampleoneis shown onthetopleft of figure5.) In themusicalanalogy, a spectrogramis likea musicalscorein which thecolouror grey level of theeachnotetells you how hardto hit thepianokey.

The basic idea of refiltering is to constructnew sourcesby selectively reweighting themultibandsignals" � �� . Crucially, however, themixing coefficientsareno longerconstantover time; they arenow calledmaskingsignals. Givena setof maskingsignals,denoted� � �� , a source �� canbe recoveredby modulatingthe correspondingsubbandsignalsfrom theoriginal input andsumming:

��#�� estimatedsource

�mask1� �� $� �� " � ��

sub-band1

� mask2� �� " � �� sub-band2

�%��#� maskK� �� " � �� sub-bandK

(2)

The �� are gain knobson eachsubbandthat we can twist over time to bring bandsin andout of the sourceasneeded.This performsmaskingon the original spectrogram.(An equivalentoperationcan be performedin the frequency domain.2) This approach,illustratedin figure1, formsthebasisof many CASA approaches(e.g.[2,3,4]).

For any specificchoiceof maskingsignals � � �� , refiltering attemptsto isolatea singlesourcefrom theinput signalandsuppressall othersourcesandbackgroundnoises.Differ-entsourcescanbeisolatedby choosingdifferentmaskingsignals.Henceforth,I will makea strongsimplifying assumptionthat � � �� arebinary andconstantover a timescale& ofroughly 30ms. This is physicallyunrealistic,becausethe energy in eachsmall region oftime-frequency never comesentirely from a singlesource.However in practice,for smallnumbersof sources,thisapproximationworksquitewell (figure3). (Think of ignoringcol-lisionsby assumingseparatepianoplayersdonotoftenhit thesamenoteat thesametime.)

1Try randomlypermutingthetimeorderof samplesin astereomixturecontainingseveralsourcesandseeif you still heardistinctstreamswhenyou play it back.

2Make a conventionalspectrogramof the original signal ')(+*-, and modulatethe magnitudeofeachshort time DFT while preservingits phase: .�/�(102,436587:9);�<=/?>�5@;�'A/�(10�,CBD>FE 5@;�'A/$(10�,CB�Bwhere . / (10�, and ' / (102, arethe G�H1I windows (blocks)of therecoveredandoriginal signals,< /J isthemaskingsignalfor subbandK in window G , and 5MLON P is theDFT.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

500

1000

1500

2000

2500

3000

3500

4000

! ��" � ��

��X Q

Figure 1: The refiltering approachto onemicrophonesourceseparation.Multiband analysisoftheoriginal signal 'R(+*C, givessub-bandsignalsS J (+*C, which aremodulatedby maskingsignals< J (+*-,(binaryor realvaluedbetween0 and1) andrecombinedto give theestimatedsourceor object . (+*-, .Refilteringcanalsobethoughtof asa highly nonstationaryWienerfilter in whichboththesignalandnoisespectraarere-estimatedat a rate T U & ; thebinaryassumptionis equivalentto assumingthatovera timescale& thesignalandnoisespectraarenonoverlapping.

It is afortunateempiricalfactthatrefiltering,evenwith binarymaskingsignals,cancleanlyseparatesourcesfrom asinglemixedrecording.Thiscanbedemonstratedby takingseveralisolatedsourcesor noisesandmixing themin a controlledway. Sincetheoriginal compo-nentsareknown, an“optimal” setof maskingsignalscanbecomputed.For example,wemightset � � �� equalto theratioof energy from onesourcein bandV aroundtimes �RW & tothesumof energiesfrom all sourcesin thesamebandat thattime(asrecommendedby theWienerfilter) or to abinaryversionwhich thresholdsthis ratio. Constructingmasksin thisway is alsousefulfor generatinglabeledtrainingdata,asdiscussedbelow.

3 Multiband grouping asa statistical pattern recognitionproblem

Sinceone-microphonesourceseparationusingrefilteringis possibleif themaskingsignalsarewell chosen,theessentialproblembecomes:how canthe � � �� becomputedautomat-ically from a singlemixed recording? The goal is to groupor “tag” togetherregionsofthe spectrogramthat belongto the sameauditoryobject. Fortunately, in audition(as invision), naturalsignals—especiallyspeech—exhibit a lot of regularity in the way energyis distributedacrossthe time-frequency plane. Groupingcuesbasedon theseregularitieshavebeenstudiedfor many yearsby psychophysicistsandarehandbuilt into many CASAsystems.Cuesarebasedontheideaof suspiciouscoincidences:roughly, “thingsthatmovetogetherlikely belongtogether”.Thus,frequencieswhich exhibit commononsets,offsets,or upward/downwardsweepsaremorelikely to begroupedinto thesamestream(figure2).Also, many realworld soundshave harmonicspectra;sofrequencieswhich lie exactly ona harmonic“stack” areoftenperceptuallygroupedtogether. (Musically, pianoplayersdonot hit keys randomly, but insteadusechordsandrepeatedmelodies.)

Harmonicstacking.

Commononset.

Frequencyco-modulation.

Figure2: Examplesof threecommongroup-ing cuesfor energy which often comesfroma single source. (left) Frequencieswhich lieexactlyonharmonicmultiplesof asinglebasefrequency. (middle) Frequencieswhich sud-denly increaseor decreasetheir energy to-gether. (right) Energy which which movesupor down in frequency at thesametime.

Thereareseveralwaysthatstatisticalpatternrecognitionmight beappliedto take advan-tageof thesecues.Methodsmayberoughlygroupedinto unsupervisedones,which learnmodelsof isolatedsourcesand then try to explain mixed input as being causedby theinteractionof individual sourcemodels;andsupervisedmethods,which explicitly modelgroupingin mixedacousticinputbut requirelabeleddataconsistingof mixedinputaswell

asmaskingsignals.Luckily it is veryeasyto generatesuchdataby mixing isolatedsourcesin a controlledway, althoughthesubsequentsupervisedlearningcandifficult.3

E(s1)

E(s

2)

Figure 3: Eachpoint representsthe energy from onesourceversusanotherin a narrow frequency bandover a 32mswindow. The plotshows all frequenciesover a 2 secondperiodfrom a speechmixture.Typically whenonesourcehaslarge energy the otherdoesnot. Thebinary assumptionon the maskingsignals < J (+*C, is equivalent to pro-jectingthepointsshown ontoeitherthehorizontalor verticalaxis.

4 Resultsusing factorial-max HMMs

Here,I will describeone(purely unsupervised)methodI have pursuedfor automaticallygeneratingmaskingsignalsfrom a singlemicrophone.The approachfirst trainsspeakerdependenthiddenMarkov models(HMMs) on isolateddatafrom single talkers. Thesepre-trainedmodelsarethencombinedin a particularway to build aseparationsystem.

First, for eachspeaker, a simpleHMM is fit usingpatchesof narrowbandspectrogramsasthepatternvectors.4 Theemissiondensitiesmodelthetypicalspectralpatternsproducedbyeachtalker, while thetransitionprobabilitiesencouragespectralcontinuity. HMM trainingwasinitializedby first trainingamixtureof Gaussiansoneachspeaker’sdata(with asinglesharedcovariancematrix) independentof time order. Eachmixturehad8192componentsof dimension T#XZYA[ �]\ T�^`_aY ; thus eachHMM had 8192 states. To avoid overfitting,the transitionmatriceswereregularlizedafter training so that eachtransition(even thoseunobservedin thetrainingset)hada smallfinite probability.

Next, to separatea new singlerecordingwhich is a mixtureof known speakers,thesepre-trainedmodelsarecombinedinto a factorial hiddenMarkov model(FHMM) architecture[5]. A FHMM consistsof two or moreunderlyingMarkov chains(thehiddenstates)whichevolveindependently. Theobservation b=c atany timedependsonthestatesof all thechains.A simpleway to modelthis dependenceis to have eachchain d independentlyproposeanoutput b ec andthencombinethemto generatetheobservationaccordingto somerule b c �f � b �chg b �c2g �� g b ec � . Below, I usea modelwith only two chains,whosestatesaredenotedi c and j#c . At eachtime, onechainproposesan outputvector k)l�m andtheotherproposesn�o m . Thekey partof themodelis thefunction

f: observationsaregeneratedby takingthe

elementwisemaximumof theproposalsandaddingnoise.Thismaximumoperationreflectstheobservationthat the log magnitudespectrogramof a mixtureof sourcesis very nearlytheelementwisemaximumof the individual spectrograms.The full generative modelforthis “f actorial-maxHMM” canbewrittensimply as:

p ��i c �%qsr i cut � � V � �wv �yx (3)p � j#c �zqsr j�cut � � V � �|{ �yx (4)p�� b=c r i c g j#c � �z} �~�� kRl�m g n o mu� g�� (5)

3Recallthatrefilteringcanonly isolateoneauditorystreamatatimefrom thescene(wearealwaysseparating“a source”from “the background”).This makeslearningthemaskingsignalsanunusualproblembecausefor any input (spectrogram)thereare asmany correctanswersasobjectsin thescene.Sucha highly multimodaldistribution on outputsgiven inputsmeansthat themappingfromauditoryinput to maskingsignalscannotbe learnedusingbackpropor othersingle-valuedfunctionapproximatorswhich take theaverage of thepossiblemaskingspresentin thetrainingdata.

4Theobservationsarecreatedby concatenatingthevaluesof 2 adjacentcolumnsof thelog magni-tudeperiodograminto asinglevector. Theoriginalwaveformsweresampledat16kHz.Periodogramwindows of 32msata framerateof 16mswereanalyzedusingaHammingtaperedDFT zeropaddedto length1024. This gave 513 frequency samplesfrom DC to Nyquist. Averagesignalenergy wasnormalizedacrossthemostrecent8 framesbeforecomputingeachDFT.

where} �� g Q � denotesa Gaussiandistributionwith mean� andcovarianceQ and ~�� is the elementwisemaximumoperationon two vectors. (Therearealsodensitieson theinitial statesi � and j � .) This modelis illustratedin figure4. It ignorestwo aspectsof thespectrogramdata:first, Gaussiannoiseis usedalthoughtheobservationsarenonnegative;second,theprobability factorrequiringthenon-maximumoutputproposalto be lessthanthemaximumproposalis missing.However, in practicetheseapproximationsarenot toosevereandmakingthemallowsanefficient inferenceprocedure(seebelow).

maxmax max max y3y2y1 yT

1 2 3 Tz z z z

2x x3 xTx1

b bb

ax1

1 2 3b T

a ax2x 3axT

z z z z

Figure 4: Factorial HMM with�� outputsemantics.Two Markovchains � H and � H evolve indepen-dently. Observations � H are theelementwisemax of the individ-ual emissionvectors�� L �h� m��-�s�-m PplusGaussiannoise.

In theexperimentpresentedbelow, eachchainrepresentsa speaker dependentHMM (onemaleandonefemale).Theemissionandtransitionprobabilitiesfrom eachspeaker’s pre-trainedHMM wereusedas the parametersfor the combinedFHMM. (The outputnoisecovariance� is sharedbetweenthetwo HMMs.)

Given an input waveform, the observation sequence� � b � g �� g b�� is createdfromthe spectrogramasbefore.� Separationis doneby first inferring a joint underlyingstatesequence�=�i c g �j#c�� of thetwo Markov chainsin themodelandthenusingthedifferenceoftheir individualoutputpredictionsto computea binarymaskingsignal:

� c � V � � T if k$�l#m � V � � n �o m � V � and X if k$�l#m � V � ¡ n �o m � V � (6)

Ideally, the inferredstatesequences�s�i c g �j#c�� shouldbe the modeof the posteriordistri-bution p ��i c g j#c r � � . Sincethe hiddenchainssharea singlevisible outputvariable,naiveinferencein theFHMM graphicalmodelyieldsanintractableamountof work exponentialin thesizeof thestatespaceof eachsubmodel.However, becauseall of theobservationsarenonnegative andthe max operationis usedto combineoutputproposals,thereis anefficient trick for computingthe bestjoint statetrajectory. At eachtime, we canupperboundthelog-probabilityof generatingtheobservationvectorif onechainis in stateV , nomatterwhatstatethe otherchain is in. Computingtheseboundsfor eachstatesettingofeachchainrequiresonly a linearamountof work in thesizeof thestatespaces.With theseboundsin hand,eachtime we evaluatethe probability of a specificpair of stateswe caneliminatefrom considerationall statesettingsof eitherchainwhoseboundsareworsethanthe achievedprobability. If pairsof statesareevaluatedin a sensibleheuristicorder(forexampleby rankingthebounds)thisresultsin practicein almostall possibleconfigurationsbeingquickly eliminated.(This trick turnsout to beequivalentto � ¢ searchin gametrees.)

Thetrainingdatafor themodelconsistsonly of spectrogramsof isolatedexamplesof eachspeakerbut inferencecanbedoneontestdatawhichis aspectrogramof asinglemixtureofknown speakers.Theresultsof separatinga simpletwo speakermixtureareshown below.Thetestutterancewasformedby linearly mixing two out-of-sampleutterances(onemaleandonefemale)from the samespeakersasthe modelsweretrainedon. Figure5 showsthe original mixed spectrogram(top left) aswell asthe sequenceof outputs k �l m (bottomleft) and

n �o m (bottomright) from eachchain. Thechainwith themaximumoutputin anysub-bandat any time has � � �� T , otherwise� � �� X (top right). TheFHMM systemachievesgoodseparationfrom only asinglemicrophone(seefigure6).

� � b � �� b��

k �l m n �o m

�£ ¤

Figure 5: (top left) Original spectrogramof mixed utterance.(bottom) Male and femalespec-trogramspredictedby factorialHMM andusedto computerefiltering masks.(top right) Maskingsignals< J (+*C, , computedby comparingthemagnitudesof eachmodel’s predictions.

5 Conclusions

In this paperI have arguedfor the marriageof learningalgorithmswith the refilteringapproachto CASA. I have presentedresultsfrom a simple factorial HMM systemona speaker dependentseparationproblemwhich indicatethat automaticallylearnedone-microphoneseparationsystemsmay be possible. In the machinelearningcommunity,the one-microphoneseparationproblemhasreceived much lessattentionthanunmixingproblems,while CASA researchershave not employed automaticlearningtechniquestofull effect. Sceneanalysisis aninterestingandchallenginglearningproblemwith excitingand practicalapplications,and the refiltering setuphasmany nice properties. First, itcan work if the maskingsignalsarechosenproperly. Second,it is easyto generatelotsof training data,both supervisedandunsupervised.Third, a good learningalgorithm—when presentedwith enoughdata—shouldautomaticallydiscover the sortsof groupingcueswhich havebeenbuilt into existing systemsby hand.

Furthermore,in therefilteringparadigmthereis noneedto makeaharddecisionaboutthenumberof sourcespresentin aninput. Eachproposedmaskinghasanassociatedscoreorprobability;groupingswith high scorescanbeconsidered“sources”,while oneswith lowscoresmight bepartsof thebackgroundor mixturesotherfaint sources.CASA returnsacollectionof candidatemaskingsandtheirassociatedscores,andthenit is upto theusertodecide—basedon therangeof scores—thenumberof sourcesin thescene.

Many existing approachesto speechandaudioprocessinghave thepotentialto beappliedto the monauralsourceseparationproblem. The unsupervisedfactorial HMM systempresentedin this paperis very similar to the work in the speechrecognitioncommunityon parallel modelcombination[6,7]; however ratherthanusingthe combinedmodelstoevaluatethe likelihood of speechin noise, the efficiently inferred statesare being usedto generatea maskingsignal for refiltering. Wan andNelsonhave developeddual EKFmethods[8] andappliedthemspeechdenoisingbut havealsoinformally demonstratedtheirpotentialapplicationto monauralsourceseparation.Attias andcolleagues[9] developeda fully probabilisticmodel of speechin noiseandusedvariationalBayesiantechniquesto performinferenceandlearningallowing denoisinganddereverberation;their approachclearly hasthe potentialto be appliedto the separationproblemaswell. Cauwenberghs[10] hasa very promisingapproachto theproblemfor purelyharmonicsignalsthat takesadvantageof powerful phaseconstraintswhichareignoredby otheralgorithms.

Unsupervisedandsupervisedapproachescanbe combinedto variousdegrees.Learningmodelsof isolatedsoundsmay be useful for developing featuredetectors;conjunctionsof suchfeaturedetectorscanthenbe trainedin a supervisedfashionusing labeleddata.

A

B

C

Figure6: Testseparationresults,usinga2-chainspeaker dependentfactorial-maxHMM, followedby refiltering. (Seefigure4 andtext for details.)(A) Originalwaveformof mixedutterance.(B) Original isolatedmale& femalewaveforms.(C) Estimatedmaleandfemalewaveforms.

The oscillatorycorrelationalgorithmof Brown andWang[4] hasa low level moduletodetectfeaturesin thecorrelogramanda high level moduleto do grouping. Relatedideasin machinevision, suchasMarkov networks [11] andminimum normalizedcut [12] uselow level operationsto defineweightsbetweenpixelsandthenhigherlevel computationsto grouppixelstogether.

AcknowledgementsThanksto HagaiAttias,GuyBrown,Geoff HintonandLawrenceSaulfor many insightfuldiscussionsabouttheCASA problem,andto threeanonymousrefereesandmany visitorsto my posterfor helpfulcomments,criticismsandreferencesto work I hadoverlooked.

References[1] A.S.Bregman.(1994)AuditorySceneAnalysis. MIT Press.[2] G. Brown & M. Cooke. (1994)Computationalauditorysceneanalysis.

ComputerSpeechandLanguage8.[3] D. Ellis. (1994)A computerimplementationof psychoacousticgroupingrules.

Proc.12thIntl. Conf. onPatternRecognition,Jerusalem.[4] G. Brown & D.L. Wang.(2000)Anoscillatorycorrelationframework for

computationalauditorysceneanalysis. NIPS12.[5] Z. Ghahramani& M.I. Jordan(1997)Factorial hiddenMarkov models, MachineLearning29.[6] A.P. Varga& R.K. Moore(1990)HiddenMarkov modeldecompositionof speech andnoise,

IEEEConf. Acoustics,Speech& SignalProcessing(ICASSP’90).[7] M.J.F. Gales& S.J.Young(1996)Robustcontinuousspeech recognition using

parallel modelcombination, IEEETrans.Speech& Audio Processing4.[8] E.A. Wan& A.T. Nelson(1998)Removal of noisefromspeech usingthedualEKF algorithm,

IEEEConf. Acoustics,Speech& SignalProcessing(ICASSP’98).[9] H. Attias,J.C.Platt& A. Acero(2001)Speech denoisinganddereverberation

usingprobabilisticmodels, this volume.[10] G. Cauwenberghs(1999)Monaural separation of independentacousticalcomponents,

IEEESymp.Circuit & Systems(ISCAS’99).[11] W. Freeman& E. Pasztor. (1999)Markov networksfor low-levelvision.

MitsubishiElectricResearchLaboratoryTechnicalReportTR99-08.[12] J.Shi& J.Malik. (1997)Normalizedcutsandimage segmentation.

IEEEConf. ComputerVisionandPatternRecognition,PuertoRico (ICCV’97).

One Microphone Source Separation - Columbia …dpwe/papers/Roweis00-monsep.pdfOne Microphone Source Separation Sam T. Roweis Gatsby Unit, University College London [email protected]

Documents