Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

VeryDeepConvolutionalNeuralNetworksforNoiseRobust

SpeechRecognition

Yanmin Qian,etal.“VeryDeepConvolutionalNeuralNetworksforNoiseRobustSpeechRecognition.” IEEETransactionsonAudio,Speech,andLanguageProcessing.Acceptedforpublicationforafutureissue.

Presented by PeidongWang09/09/2016

1

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

2

Content


3

Abstract

• ASR: PreviousattemptsincreasingthenumberofCNNlayersfrom2to3gaveadegradation.• CV:Recentworkinimageshowsthattheaccuracyofimageclassificationcanbeimprovedbyincreasingthenumberofconvolutionallayerswithcarefullytunedarchitecture.• ASR:VeryDeepConvolutionalNeuralNetworksusesupto10convolutionallayersandgetsaWERof8.81%onAurora4,whichisthebestpublishedresult.

4

Content


5

ReviewofConvolutionalNeuralNetworks

• AConventionalConvolutionalNeuralNetwork(CNN)

6

From:SlidesinCSE5526NeuralNetworks

ReviewofConvolutionalNeuralNetworks

• ConvolutionandPooling(Subsampling)

7

Content


8

ModelDescription

• ContextWindowExtension• Atypicalsizeofinputfeaturesinspeechrecognitionis11x40,where11denotesthenumberofframesinawindow,40denotesthedimensionofFBankfeatures.[*]

• Usingthiscontextwindowsize,convolutionscanbeperformedintime5timeswithafiltersizeof3,asinthefollowingfigure(vd6).

9

[*]addedbythepresenter

ModelDescription

• ContextWindowExtension(cont’d)

10

ModelDescription

• ContextWindowExtension(cont’d)• InVeryDeepConvolutionalNeuralNetworks(VDCNNs),thecontextwindowsizeisextendedto17(andfurtherto21),whichallows8(and10)convolutionstobeperformedintime,respectively.

11

ModelDescription


12

ModelDescription


13

ModelDescription

• FeatureDimensionExtension• Basedon40-dimFBankfeatures,atmost6convolutionsand2poolingscanbeperformedinfrequency,leadingtothevd6model.• InVDCNN,theFBankfeaturesareextendedto64-dim,sothat4moreconvolutionscanbeperformedinfrequency.

14

ModelDescription

• FeatureDimensionExtension(cont’d)

15

ModelDescription

• FeatureDimensionExtension(cont’d)• Finallytheinputextensionisperformedinbothtimeandfrequency,leadingtoa17x64input.Theresultingmodelisnamedvd10.

16

ModelDescription


17

ModelDescription

• FeatureDimensionExtension(cont’d)• Thefull-ext modelfurtherextendsthenumberoftimeframesto21sothat2moreconvolutionoperationscanbeperformedintime,giving10convolutionoperationsinbothtimeandfrequency.

18

ModelDescription


19

ModelDescription

• FeatureDimensionExtension(cont’d)• Toconfirmthattheperformancegainisnotfromtheextendedinputfeatures,amodelwiththesamewiderinputfeatures(17x64)butshallowconvolutionallayersisdeveloped.

20

ModelDescription


21

ModelDescription

• PoolinginTime• YoumayhavenoticedthattheVDCNNmodelsallusepoolinginfrequencyanddonopoolingintime.• Toinvestigatewhetherpoolingintimeishelpful,vd10-tpoolisdesigned.

22

ModelDescription

• PoolinginTime(cont’d)

23

ModelDescription

• PoolinginTime(cont’d)

24

ModelDescription

• PaddinginFeatureMaps• InmostworkonCNNsforspeechrecognition,theconvolutionsareperformedwithoutpadding.• Paddingcansavethesizeoffeaturemapsandbetterutilizetheborderinformation.

25

ModelDescription

• PaddinginFeatureMaps(cont’d)

26

ModelDescription

• PaddinginFeatureMaps(cont’d)•Modelvd10-fpadpadsonlyinfrequency,allowingmorepoolingoperationsinfrequency.

27

ModelDescription


28

ModelDescription

• PaddinginFeatureMaps(cont’d)• Paddinginbothdimensionsisalsoapplied,whichisindicatedasvd10-fpad-tpad.• Inthismodel,consideringthatpoolingisanecessaryapproachtoreducethefeaturemapsize,poolingintimeisalsoapplied.

29

ModelDescription


30

ModelDescription


31

ModelDescription

• CompleteFigure

32

ModelDescription

• CompleteFigure(cont’d)

33

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps• VDCNNsuseonechannelfeaturemapasinput,i.e.thestaticFBankfeature.•Mostworkinspeechrecognition,however,usesthree-channelfeatures(static,∆,and∆∆).• ThenumberofinputchannelsarecomparedforVDCNN.

34

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)

35

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)• Itisinterestingtofindthat1channelbaseVDCNNsarebetterthanthemodelsusing3channels.• OnepossibleexplanationwouldbethattheinformationinthedynamicfeaturesmaybebetterextractedfromtherawstaticfeaturesdirectlybyVDCNN.

36

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)• Anotherexplanationmaybeasfollows.

37

ModelDescription

•ModelParameterSize• ItisobservedthatalthoughthenumberofconvolutionallayersisincreasedsignificantlyintheproposedVDCNN,thetotalparametersizeissmallerthanthebaselineCNNandDNN.

38

ModelDescription

•ModelParameterSize(cont’d)

39

ModelDescription

• ConvergenceofVeryDeepCNNs• TheVDCNNconvergesfasterthanothermodeltypes,intermsofthenumberofepochs[*].• Accordingly,althoughVDCNNsneedmorecomputationsineachiteration(9.5timesmorecomputationscomparedtothebaselineCNN),theVDCNNstakecomparabletimeformodeltraining.

40

[*]addedbythepresenter

ModelDescription

• ConvergenceofVeryDeepCNNs(cont’d)

41

ModelDescription

• NoiseRobustnessofVeryDeepCNNs

42

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• TobetterunderstandhowVDCNNprocessesnoisyspeech,eachcondition(A,B,CorD)ofthisframeispropagatedthroughthebestperformingmodelvd10-fpad-tpad.• Theoutputsofthe1st convolutionallayerandthe6thconvolutionallayerforA,B,CandDareplottedinthenextfigures.

43

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)

44

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• Tofurtherverifytheobservation,thedifferencesbetweennoisyfeaturemapsandcleanfeaturemapsaremeasuredforallconvolutionallayers.• Usingdatainthetest,wecomputetheaveragedmeansquareerror(MSE)toevaluatethedifferencesbetweenthethreenoisyconditionsandthecleancondition.

45

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• TheMSEvaluesafteralloperationsareshowbelow.

46

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• TheMSEvaluesfordifferentCNNmodels.

47

Content


48

Experiments

• ExperimentalSetup• TheGMM-HMMsystemisbuiltwithKaldi.• Allneuralnetworkmodels,includingDNN/CNN/LSTM,aretrainedusingCNTK.• ThestandardtestingpipelineinKaldirecipesareusedfordecodingandscoring.• Asimilarstructure(IBM-VGG)designedbyresearchersinIBMandNYUisalsoconstructedforcomparison.

49

Experiments

• EvaluationonAurora4• Aurora4isamediumvocabularytaskbasedontheWallStreetJournal(WSJ0).• Trainingsetscontain14276utterances.• Fourconditions,A,B,CandD,asmentionedbefore.

50

Experiments

• EvaluationonAurora4(cont’d)

51

Experiments

• EvaluationonAMI• AMIcorpuscontainsaround100hoursofmeetingrecords.• Thesignalwascapturedandsynchronizedwithmultiplemicrophonessuchasindividualheadmicrophones(IHM,close-talk)andmicrophonearrays(singledistantmicrophone(SDM)andmultipledistantmicrophones(MDM)).•MDMwasprocessedbyastandardbeamformingalgorithmtogenerateasinglechanneldataset.

52

Experiments

• EvaluationonAMI(cont’d)• Thesizeofinputfeaturesisinvestigated.

53

Experiments

• EvaluationonAMI(cont’d)• Theeffectofotherdesignsarealsoinvestigated.

54

Experiments

• EvaluationonAMI(cont’d)• TobetterexplainthesuperiorityofVDCNNs,weusesomerelatedfeaturemaps.

55

Experiments

• EvaluationonAMI(cont’d)• Onesamesinglesynchronizedframeispropagated.

56

Experiments

• EvaluationonAMI(cont’d)

57

Content


58

Conclusion

• FeaturesofVDCNN• Thesizesoffiltersandpoolingtemplatesaresmall.• Theinputfeaturemapsarelarge.• Otherdesignsuchaspoolingintime,padding,andinputfeaturemapsselectionareadjusted.• OnAurora4,itachievesaWERof8.81%(state-of-art).• OnAMI,itsaccuracyiscompetitivetoanLSTM.

59

Thank You！

60

Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Documents