DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Deep Neural Networks forAcoustic Modeling in Speech

Recognition

Hinton,Geoffrey,etal.“Deepneuralnetworksforacousticmodelinginspeechrecognition:Thesharedviewsoffourresearchgroups.” Signal

ProcessingMagazine,IEEE 29.6(2012):82-97.

Presented by PeidongWang04/04/2016

1

Content

• SpeechRecognitionSystem• GMM-HMMModel• TrainingDeepNeuralNetworks• GenerativePretraining• Experiments• Discussion

2

Content


3

SpeechRecognitionSystem

• Goal• Convertingspeechtotext

• AMathematicalPerspective

orw = argmax

w{P(w |Y )}

w = argmaxw

{P(Y |w)P(w)}

4

Content


5

GMM-HMMModel

• GMM and HMM• GMM is short for Gaussian Mixture Model, and HMM isshort for Hidden Markov Model.

• PredecessorofDNNs• Before Deep Neural Networks (DNNs), the most commonlyused speech recognition systemswere consistedof GMMsand HMMs.

6

GMM-HMMModel

• HMM• HMMisusedtodealwiththetemporalvariabilityofspeech.

• GMM• GMMisusedtorepresenttherelationshipbetweenHMMstatesandtheacousticinput.

7

GMM-HMMModel

• Features• ThefeaturesistypicallyrepresentedbyconcatenatingMel-frequencycepstralcoefficients(MFCCs)orperceptuallinearpredictivecoefficients(PLPs)computedfromtherawwaveformandtheirfirst- andsecond-ordertemporaldifferences.

8

GMM-HMMModel

• Shortcoming• GMM-HMMmodelsarestatisticallyinefficientformodelingdatathatlieonornearanonlinearmanifoldinthedataspace.• Forexample,modelingthesetofpointsthatlieveryclosetothesurfaceofasphere.

9

Content


10

TrainingDeepNeuralNetworks

• DeepNeuralNetwork(DNN)• ADNNisafeed-forward,artificialneuralnetworkthathasmorethanonelayerofhiddenunitsbetweenitsinputsanditsoutputs.•Withnonlinearactivationfunctions,DNNisabletomodelanarbitrarynonlinearfunction(projectionfrominputstooutputs).[*]

[*]Addedbythepresenter.

11


• ActivationFunctionoftheOutputUnits• Theactivationfunctionoftheoutputunitsis“softmax”function.• Themathematicalexpressionisasfollows.

pj =exp(x j )exp(xk )

k∑

12


• ObjectiveFunction•Whenusingthesoftmaxoutputfunction,thenaturalobjectivefunction(costfunction)Cisthecross-entropybetweenthetargetprobabilitiesdandtheoutputsofthesoftmax,p.• Themathematicalexpressionisasfollows.

C = dj log pjj∑

13


•WeightPenaltiesandEarlyStopping• Toreduceoverfitting,largeweightscanbepenalizedinproportiontotheirsquaredmagnitude,orthelearningcansimplybeterminatedatthepointwhichperformanceonaheld-outvalidationsetstartsgettingworse.

14


• OverfittingReduction• Generallyspeaking,therearethreemethods.•Weightpenaltiesandearlystoppingcanreducetheoverfittingbutonlybyremovingmuchofthemodelingpower.• Verylargetrainingsetscanreduceoverfittingbutonlybymakingtrainingverycomputationallyexpensive.• GenerativePretraining

15

Content


16

GenerativePretraining

• Purpose• Themultiplelayersoffeaturedetectors(theresultofthisstep)canbeusedasagoodstartingpointforadiscriminative“fine-tuning”phaseduringwhichbackpropagationthroughtheDNNslightlyadjuststheweightsandimprovestheperformance.• Inaddition,thisstepcansignificantlyreduceoverfitting.

17


• RestrictedBoltzmannMachine(RBM)• RBMconsistsofalayerofstochasticbinary“visible”unitsthatrepresentbinaryinputdataconnectedtoalayerofstochasticbinaryhidden (latent)unitsthatlearntomodelsignificantnonindependenciesbetweenthevisibleunits.• Thereareundirectedconnectionsbetweenvisibleandhiddenunitsbutnovisible-visibleorhidden-hiddenconnections.

18


• RestrictedBoltzmannMachine(RBM)(Cont’d)• TheframeworkofanRBMisshownbelow.

From:SlidesinCSE5526NeuralNetworks19


• RestrictedBoltzmannMachine(RBM)(Cont’d)• RBMusesasinglesetofparameters,W,todefinethejointprobabilityofavectorofvaluesoftheobservablevariables,v,andavectorofvaluesofthelatentvariables,h,viaanenergyfunction,E.

20

p(v,h;W ) = 1Ze−E (v,h;W ),Z = e−E (v ',h ';W )

v ',h '∑

E(v,h) = − aivii∈visible∑ − bjhj

j∈visible∑ − vihjwij

i, j∑


• RestrictedBoltzmannMachine(RBM)(Cont’d)• Theprobabilitythatthenetworkassignstoavisiblevector,v,isgivenbysummingoverallpossiblehiddenvectors.

• Thederivativeofthelogprobabilityofatrainingsetwithrespecttoaweightissurprisinglysimple.Theanglebracketsdenoteexpectationsunderthecorrespondingdistribution.

p(v) = 1Z

e−E (v,h)h∑

1N

∂log p(vn )∂wijn=1

N

∑ =< vihj >data − < vihj >model

21


• RestrictedBoltzmannMachine(RBM)(Cont’d)• Thelearningruleisthusasfollows.

• Abetterlearningprocedureiscontrastivedivergence(CD),whichisshownbelow.Thesubscript“recon”denotesastepinCDwhenthestatesofvisibleunitsareassigned0or1accordingtothecurrentstatesofthehiddenunits.

Δwij = ε(< vihj >data − < vihj >model )

Δwij = ε(< vihj >data − < vihj >recon )

22


•ModelingReal-ValuedData• Real-valueddata,suchasMFCCs,aremorenaturallymodeledbylinearvariableswithGaussiannoiseandtheRBMenergyfunctioncanbemodifiedtoaccommodatesuchvariables,givingaGaussian-BernoulliRBM(GRBM).

E(v,h) = (vi − ai )2

2σ i2

i∈vis∑ − bjhj

j∈hid∑ − vi

σ i

hjwiji, j∑

23


• StackingRBMstoMakeaDeepBeliefNetwork• AftertraininganRBMonthedata,theinferredstatesofthehiddenunitscanbeusedasdatafortraininganotherRBMthatlearnstomodelthesignificantdependenciesbetweenthehiddenunitsofthefirstRBM.• Thiscanberepeatedasmanytimesasdesiredtoproducemanylayersofnonlinearfeaturedetectorsthatrepresentprogressivelymorecomplexstatisticalstructureinthedata.

24


• StackingRBMstoMakeaDeepBeliefNetwork(Cont’d)

From:Thepaper25


• InterfacingaDNNwithanHMM• InanHMMframework,thehiddenvariablesdenotethestatesofthephonesequence,andthe“visible”variablesdenotethefeaturevectors.[*]

[*]Addedbythepresenter

From:Gales,Mark,andSteveYoung."TheapplicationofhiddenMarkovmodels inspeechrecognition.”Foundationsandtrendsinsignalprocessing 1.3(2008):195-304. 26


• InterfacingaDNNwithanHMM(Cont’d)• TocomputeaViterbialignmentortoruntheforward-backwardalgorithmwithintheHMMframework,werequirethelikelihoodp(AcousticInput|HMMstate).• ADNN,however,outputsprobabilitiesoftheformp(HMMstate|AcousticInput).

27


• InterfacingaDNNwithanHMM(Cont’d)• TheposteriorprobabilitiesthattheDNNoutputscanbeconvertedintothescaledlikelihoodbydividingthembythefrequenciesoftheHMMstatesintheforcedalignmentthatisusedforfine-tuningtheDNN.• Forcedalignment isaprocedureusedtogeneratelabelsforthetrainingprocess.[*]

[*]Addedbythepresenter

28


• InterfacingaDNNwithanHMM(Cont’d)• All of the likelihoods produced in this way are scaled by thesame unknown factor of p(AcousticInput).• Although this appears to have little effect on somerecognition tasks, it can be important for tasks wheretraining labels are highly unbalanced.

29

Content


30

Experiments

• PhoneticClassificationandRecognitiononTIMIT• TheTIMITdatasetisarelativelysmalldatasetwhichprovidesasimpleandconvenientwayoftestingnewapproachestospeechrecognition.

31

Experiments

• PhoneticClassificationandRecognitiononTIMIT(Cont’d)

From:Thepaper32

Experiments

• Bing-Voice-SearchSpeechRecognitionTask• Thistaskused24hoftrainingdatawithahighdegreeofacousticvariabilitycausedbynoise,music,side-speech,accents,sloppypronunciation,etal.• ThebestDNN-HMMacousticmodelachievedasentenceaccuracyof69.6%onthetestset,comparedwith63.8%forastrong,minimumphoneerror(MPE)-trainedGMM-HMMbaseline.

33

Experiments

• Bing-Voice-SearchSpeechRecognitionTask(Cont’d)

From:Thepaper 34

Experiments

• OtherLargeVocabularyTasks• SwitchboardSpeechRecognitionTask(acorpuscontainingover300hoftrainingdata)• GoogleVoiceInputSpeechRecognitionTask• YouTubeSpeechRecognitionTask• EnglishBroadcastNewsSpeechRecognitionTask

35

Experiments

• OtherLargeVocabularyTasks(Cont’d)

From:Thepaper 36

Content


37

Discussion

• ConvolutionalDNNsforPhoneClassificationandRecognition• AlthoughconvolutionalmodelsalongthetemporaldimensionachievedgoodclassificationresultsonTIMITcorpus,applyingthemtophonerecognitionisnotstraightforward.• ThisisbecausetemporalvariationsinspeechcanbepartiallyhandledbythedynamicprogramingprocedureintheHMMcomponentandhiddentrajectorymodels.

38

Discussion

• SpeedingUpDNNsatRecognitionTime• ThetimethataDNN-HMMsystemrequirestorecognize1sofspeechcanbereducedfrom1.6sto210ms,withoutdecreasingrecognitionaccuracy,byquantizingtheweightsdownto8busingCPU.• Alternatively,itcanbereducedto66msbyusingagraphicsprocessingunit(GPU).

39

Discussion

• AlternativePretrainingMethodsforDNNs• ItispossibletolearnaDNNbystartingwithashallowneuralnetwithasinglehiddenlayer.Oncethisnethasbeentraineddiscriminatively,asecondhiddenlayerisinterposedbetweenthefirsthiddenlayerandthesoftmaxoutputunitsandthewholenetworkisagaindiscriminativelytrained.Thiscanbecontinueduntilthedesirednumberofhiddenlayersisreached,afterwhichfullbackpropagationfine-tuningisapplied.

40

Discussion

• AlternativePretrainingMethodsforDNNs(Cont’d)• PurelydiscriminativetrainingofthewholeDNNfromrandominitialweightsworkswell,too.• Varioustypesofautoencoderwithonehiddenlayercanalsobeusedinthe layer-by-layergenerativepretrainingprocess.

41

Discussion

• AlternativeFine-TuningMethodsforDNNs•MostDBN-DNNacousticmodelsarefine-tunedbyapplyingstochasticgradientdescentwithmomentumtosmallminibatchesoftrainingcases.•Moresophisticatedoptimizationmethodscanbeused,butitisnotclearthatthemoresophisticatedmethodsareworthwhilesincethefine-tuningprocessistypicallystoppedearlytopreventoverfitting.

42

Discussion

• UsingDBN-DNNstoProvideInputFeaturesforGMM-HMMSystems• ThisclassofmethodsuseneuralnetworkstoprovidethefeaturevectorsforthetrainingprocessoftheGMMinaGMM-HMMsystem.• Themostcommonapproachistotrainarandomlyinitializedneuralnetwithanarrowbottleneckmiddlelayerandtousetheactivationsofthebottleneckhiddenunitsasfeatures.

43

Discussion

• UsingDNNstoEstimateArticulatoryFeaturesforDetection-BasedSpeechRecognition• DBN-DNNsareeffectivefordetectingsubphoneticspeechattributes(alsoknownasphonologicalorarticulatoryfeatures).

44

Discussion

• Summary•MostofthegaincomesfromusingDNNstoexploitinformationinneighboringframesandfrommodelingtiedcontext-dependentstates.• Thereisnoreasontobelievethattheoptimaltypesofhiddenunitsortheoptimalnetworkarchitecturesareused,anditishighlylikelythatboththepretrainingandfine-tuningalgorithmscanbemodifiedtoreducetheamountofoverfittingandtheamountofcomputation.

45

Thank You！

46

InvestigationofSpeechSeparationasaFront-Endfor

NoiseRobustSpeechRecognition

Narayanan,Arun,andDeLiangWang."Investigationofspeechseparationasafront-endfornoiserobustspeechrecognition."Audio,Speech,andLanguageProcessing,IEEE/ACMTransactionson 22.4

(2014):826-835.

Presented by PeidongWang04/04/2016

47

Content

• Introduction• SystemDescription• EvaluationResults• Discussion

48

Content


49

Introduction

• Background• Althoughautomaticspeechrecognition(ASR)systemshavebecomefairlypowerful,theinherentvariabilitycanstillposechallenges.• Typically,ASRsystemsthatworkwellincleanconditionssufferfromadrasticlossofperformanceinthepresenceofnoise.

50

Introduction

• Feature-BasedMethods• Thisclassofmethodsfocusonfeatureextractionorfeaturenormalization.• Feature-basedtechniqueshavethepotentialtogeneralizewell,butdonotalwaysproducethebestresults.

51

Introduction

• TwoGroupsofFeature-BasedMethods•Whenstereo[*] data isunavailable,priorknowledgeaboutspeechand/ornoiseisused,suchasspectralreconstructionbasedmissingfeaturemethods,directmaskingmethodsandfeatureenhancementmethods.•Whenstereodataisavailable,featuremappingmethodsandrecurrentneuralnetworkshavebeenused.

[*]Bystereowemeannoisyandthecorresponding cleansignals.

52

Introduction

•Model-BasedMethods• TheASRmodelparametersareadaptedtomatchthedistributionofnoisyorenhancedfeatures.•Model-basedmethodsworkwellwhentheunderlyingassumptionsaremet,buttypicallyinvolvesignificantcomputationaloverhead.• Thebestperformancesareusuallyobtainedbycombiningfeature-basedandmodel-basedmethods.

53

Introduction

• SupervisedClassificationBasedSpeechSeparation• Stereotrainingdataisalsousedbysupervisedclassificationbasedspeechseparationalgorithms.• Suchalgorithmstypicallyestimatetheidealbinarymask(IBM)-abinarymaskdefinedinthetime-frequency(T-F)domainthatidentifiesspeechdominantandnoisedominantT-Funits.• Theabovemethodcanbeextendedtoidealratiomask(IRM),which representstheratioofspeechtomixture energy.

54

Content


55

SystemDescription

• BlockDiagramoftheProposedSystem

From:Thepaper56

SystemDescription

• AddressingAdditiveNoiseandConvolutionalDistortion• Theadditivenoiseandtheconvolutionaldistortionaredealtwithintwoseparatestages:Noiseremovalfollowedbychannelcompensation.• NoiseisremovedviaT-FmaskingusingtheIRM.Tocompensateforchannelmismatchandtheerrorsintroducedbymasking,welearnanon-linearmappingfunctionthatundoesthesedistortions.

57

SystemDescription

• Time-FrequencyMasking

58

SystemDescription

• Time-FrequencyMasking(Cont’d)• HeretheauthorsperformT-Fmaskinginthemel-frequencydomain,unlikesomeoftheothersystemsthatoperateinthegammatonefeaturedomain.• Toobtainthemel-spectrogramofasignal,itisfirstpre-emphasizedandtransformedtothelinearfrequencydomainusinga320channelfastFouriertransform(FFT).A20msecHammingwindowisused. The161-dimensionalspectrogramisthenconvertedtoa26-channelmel-spectrogram.

59

SystemDescription

• Time-FrequencyMasking(Cont’d)• TheauthorsuseDNNstoestimatetheIRMasDNNsshowgoodperformanceandtrainingusingstochasticgradientdescentscaleswellcomparedtoothernonlineardiscriminativeclassifiers.

60

SystemDescription

• Time-FrequencyMasking(Cont’d)• TargetSignal• Theidealratiomaskisdefinedastheratioofthecleansignalenergytothemixtureenergyateachtime-frequencyunit.• Themathematicalexpressionisshownbelow.

IRM (t, f ) = 10(SNR(t , f )/10)

10(SNR(t , f )/10) +1SNR(t, f ) = 10 log10 (X(t, f ) / N(t, f ))

61

SystemDescription

• Time-FrequencyMasking(Cont’d)• TargetSignal• RatherthanestimatingIRMdirectly,theauthorsestimateatransformedversionoftheSNR.• Themathematicalexpressionofthesigmoidaltransformationisshownbelow.

d(t, f ) = 11+ exp(−α (SNR(t, f )− β ))

62

SystemDescription

• Time-FrequencyMasking(Cont’d)• TargetSignal• Duringtesting,thevaluesoutputfromtheDNNaremappedbacktotheircorrespondingIRMvalues.

63

SystemDescription

• Time-FrequencyMasking(Cont’d)• Features• Featureextractionisperformedbothatthefullbandandthesubbandlevel.• Thecombinationoffeatures,31dimensionalMFCCs,13dimensionalFASTAfilteredPLPsand15dimensionalamplitudemodulationspectrogram(AMS)features,areused.

64

SystemDescription

• Time-FrequencyMasking(Cont’d)• Features• ThefullbandfeaturesarederivedbysplicingtogetherfullbandMFCCsandRASTA-PLPs,alongwiththeirdeltaandaccelerationcomponents,andsubbandAMSfeatures.• ThesubbandfeaturesarederivedbysplicingtogethersubbandMFCCs,RASTA-PLPs,andAMSfeatures.Someauxiliarycomponentsarealsoadded.

65

SystemDescription

• Time-FrequencyMasking(Cont’d)• SupervisedLearning• IRMestimationisperformedintwostages.Inthefirststage,multipleDNNsaretrainedusingfullbandandsubbandfeatures.ThefinalestimateisobtainedusinganMLPthatcombinestheoutputofthefullbandandthesubbandDNNs.

66

SystemDescription

• Time-FrequencyMasking(Cont’d)• SupervisedLearning• ThefullbandDNNswouldbecognizantoftheoverallspectralshapeoftheIRMandtheinformationconveyedbythefullbandfeatures,whereasthesubbandDNNsareexpectedtobemorerobusttonoiseoccurringatfrequenciesoutsidetheirpassband.

67

SystemDescription

• Time-FrequencyMasking(Cont’d)

From:Thepaper 68

SystemDescription

• FeatureMapping

69

SystemDescription

• FeatureMapping(Cont’d)• EvenafterT-Fmasking,channelmismatchcanstillsignificantlyimpactperformance.• Thishappensfortworeasons.Firstly,thealgorithmlearnstoestimatetheratiomaskusingmixturesofspeechandnoiserecordedusingasinglemicrophone.Secondly,becausechannelmismatchisconvolutional,speechandnoise,whichnowincludesbothbackgroundnoiseandconvolutivenoise,areclearlynotuncorrelated.

70

SystemDescription

• FeatureMapping(Cont’d)• Thegoaloffeaturemappinginthisworkistolearnspectro-temporalcorrelationsthatexistinspeechtoundothedistortionsintroducedbyunseenmicrophonesandthefirststageofthealgorithm.

71

SystemDescription

• FeatureMapping(Cont’d)• TargetSignal• Thetargetisthecleanlog-melspectrogram(LMS).The“clean”LMSherecorrespondstothoseobtainedfromthecleansignalsrecordedusingasinglemicrophoneinasinglefiltersetting.

72

SystemDescription

• FeatureMapping(Cont’d)• TargetSignal• InsteadofusingtheLMSdirectlyasthetarget,theauthorsapplyalineartransformtolimitthetargetvaluestotherange[0,1]tousethesigmoidaltransferfunctionfortheoutputlayeroftheDNN.• Themathematicalexpressionisasfollows.

Xd (t, f ) =ln(X(t, f ))−min(ln(X(⋅, f )))

max(ln(X(⋅, f )))−min(ln(X(⋅, f )))

73

SystemDescription

• FeatureMapping(Cont’d)• TargetSignal• Duringtesting,theoutputoftheDNNismappedbacktothedynamicrangeoftheutterancesintrainingset.

74

SystemDescription

• FeatureMapping(Cont’d)• Features• TheauthorsuseboththenoisyandthemaskedLMS.

• SupervisedLearning• UnliketheDNNsusedforIRMestimation,thehiddenlayersoftheDNNforthistaskuserectifiedlinearunits(ReLUs).Inaddition,theoutputlayerusessigmoidactivations.

75

SystemDescription

• FeatureMapping(Cont’d)

From:Thepaper76

SystemDescription

• AcousticModeling

77

SystemDescription

• AcousticModeling(Cont’d)• TheacousticmodelsaretrainedusingtheAurora-4dataset.• Aurora-4isa5000-wordclosedvocabularyrecognitiontaskbasedontheWallStreetJournaldatabase.Thecorpushastwotrainingsets,cleanandmulti-condition,bothwith7138utterances.

78

SystemDescription

• AcousticModeling(Cont’d)• GaussianMixtureModels• TheHMMsandtheGMMsareinitiallytrainedusingthecleantrainingset.Thecleanmodelsarethenusedtoinitializethemulti-conditionmodels;bothcleanandmulti-conditionmodelshavethesamestructureanddifferonlyintransitionandobservationprobabilitydensities.

79

SystemDescription

• AcousticModeling(Cont’d)• DeepNeuralNetworks• Theauthorsfirstalignthecleantrainingsettoobtainsenonelabelsateachtime-frameforallutterancesinthetrainingset.DNNsarethentrainedtopredicttheposteriorprobabilityofsenonesusingeithercleanfeaturesorfeaturesextractedfromthemulti-conditionset.

80

SystemDescription

• DiagonalFeatureDiscriminantLinearRegression

81

SystemDescription

• DiagonalFeatureDiscriminantLinearRegression(Cont’d)• dFDLRisasemi-supervisedfeatureadaptationtechnique.• ThemotivationfordevelopingdFDLRistoaddresstheproblemofgeneralizationtounseenmicrophoneconditionsinourdataset,whichiswheretheDNN-HMMsystemsperformtheworst.

82

SystemDescription

• DiagonalFeatureDiscriminantLinearRegression(Cont’d)• ToapplydFDLR,wefirstobtainaninitialsenone-levellabelingforourtestutterancesusingtheunadaptedmodels.Featuresarethentransformedtominimizethecross-entropyerrorinpredictingtheselabels.• Themathematicalexpressionsareasfollow.

Ot ( f ) = wf iOt ( f )+ bf

min E(st ,Dout (Ot−5...Ot+5 ))t∑

83

SystemDescription

• DiagonalFeatureDiscriminantLinearRegression(Cont’d)• TheparameterscaneasilybelearnedwithintheDNNframeworkbyaddingalayerbetweentheinputlayerandthefirsthiddenlayeroftheoriginalDNN. Afterinitialization,thestandardbackpropagationalgorithmisrunfor10epochstolearntheparametersofthedFDLRmodel. Duringbackpropagation,weightsoftheoriginalhiddenlayersarekeptunchangedandonlytheparametersinthedFDLRareupdated.

84

Content


85

EvaluationResults

From:Thepaper86

EvaluationResults

From:Thepaper87

Content


88

Discussion

• Severalinterestingobservationscanbemadefromtheresultspresentedintheprevioussection.• Firstly,theresultsclearlyshowthatthespeechseparationfront-endisdoingagoodjobatremovingnoiseandhandlingchannelmismatch.• Secondly,withnochannelmismatch,T-Fmaskingaloneworkedwellinremovingnoise.

89

Discussion

• Finally,directlyperformingfeaturemappingfromnoisyfeaturestocleanfeaturesperformsreasonably,butitdoesnotperformaswellastheproposedfront-end.

90

Thank You！

91

DeepNeural Networks for Acoustic …web.cse.ohio-state.edu/~wang.7642/homepage/files...DeepNeural Networks for Acoustic ModelinginSpeech Recognition Hinton, Geoffrey, et al. “Deep

Documents