Page 1
VeryDeepConvolutionalNeuralNetworksforNoiseRobust
SpeechRecognition
Yanmin Qian,etal.“VeryDeepConvolutionalNeuralNetworksforNoiseRobustSpeechRecognition.” IEEETransactionsonAudio,Speech,andLanguageProcessing.Acceptedforpublicationforafutureissue.
Presented by PeidongWang09/09/2016
1
Page 2
Content
• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion
2
Page 3
Content
• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion
3
Page 4
Abstract
• ASR: PreviousattemptsincreasingthenumberofCNNlayersfrom2to3gaveadegradation.• CV:Recentworkinimageshowsthattheaccuracyofimageclassificationcanbeimprovedbyincreasingthenumberofconvolutionallayerswithcarefullytunedarchitecture.• ASR:VeryDeepConvolutionalNeuralNetworksusesupto10convolutionallayersandgetsaWERof8.81%onAurora4,whichisthebestpublishedresult.
4
Page 5
Content
• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion
5
Page 6
ReviewofConvolutionalNeuralNetworks
• AConventionalConvolutionalNeuralNetwork(CNN)
6
From:SlidesinCSE5526NeuralNetworks
Page 7
ReviewofConvolutionalNeuralNetworks
• ConvolutionandPooling(Subsampling)
7
Page 8
Content
• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion
8
Page 9
ModelDescription
• ContextWindowExtension• Atypicalsizeofinputfeaturesinspeechrecognitionis11x40,where11denotesthenumberofframesinawindow,40denotesthedimensionofFBankfeatures.[*]
• Usingthiscontextwindowsize,convolutionscanbeperformedintime5timeswithafiltersizeof3,asinthefollowingfigure(vd6).
9
[*]addedbythepresenter
Page 10
ModelDescription
• ContextWindowExtension(cont’d)
10
Page 11
ModelDescription
• ContextWindowExtension(cont’d)• InVeryDeepConvolutionalNeuralNetworks(VDCNNs),thecontextwindowsizeisextendedto17(andfurtherto21),whichallows8(and10)convolutionstobeperformedintime,respectively.
11
Page 12
ModelDescription
• ContextWindowExtension(cont’d)
12
Page 13
ModelDescription
• ContextWindowExtension(cont’d)
13
Page 14
ModelDescription
• FeatureDimensionExtension• Basedon40-dimFBankfeatures,atmost6convolutionsand2poolingscanbeperformedinfrequency,leadingtothevd6model.• InVDCNN,theFBankfeaturesareextendedto64-dim,sothat4moreconvolutionscanbeperformedinfrequency.
14
Page 15
ModelDescription
• FeatureDimensionExtension(cont’d)
15
Page 16
ModelDescription
• FeatureDimensionExtension(cont’d)• Finallytheinputextensionisperformedinbothtimeandfrequency,leadingtoa17x64input.Theresultingmodelisnamedvd10.
16
Page 17
ModelDescription
• FeatureDimensionExtension(cont’d)
17
Page 18
ModelDescription
• FeatureDimensionExtension(cont’d)• Thefull-ext modelfurtherextendsthenumberoftimeframesto21sothat2moreconvolutionoperationscanbeperformedintime,giving10convolutionoperationsinbothtimeandfrequency.
18
Page 19
ModelDescription
• FeatureDimensionExtension(cont’d)
19
Page 20
ModelDescription
• FeatureDimensionExtension(cont’d)• Toconfirmthattheperformancegainisnotfromtheextendedinputfeatures,amodelwiththesamewiderinputfeatures(17x64)butshallowconvolutionallayersisdeveloped.
20
Page 21
ModelDescription
• FeatureDimensionExtension(cont’d)
21
Page 22
ModelDescription
• PoolinginTime• YoumayhavenoticedthattheVDCNNmodelsallusepoolinginfrequencyanddonopoolingintime.• Toinvestigatewhetherpoolingintimeishelpful,vd10-tpoolisdesigned.
22
Page 23
ModelDescription
• PoolinginTime(cont’d)
23
Page 24
ModelDescription
• PoolinginTime(cont’d)
24
Page 25
ModelDescription
• PaddinginFeatureMaps• InmostworkonCNNsforspeechrecognition,theconvolutionsareperformedwithoutpadding.• Paddingcansavethesizeoffeaturemapsandbetterutilizetheborderinformation.
25
Page 26
ModelDescription
• PaddinginFeatureMaps(cont’d)
26
Page 27
ModelDescription
• PaddinginFeatureMaps(cont’d)•Modelvd10-fpadpadsonlyinfrequency,allowingmorepoolingoperationsinfrequency.
27
Page 28
ModelDescription
• PaddinginFeatureMaps(cont’d)
28
Page 29
ModelDescription
• PaddinginFeatureMaps(cont’d)• Paddinginbothdimensionsisalsoapplied,whichisindicatedasvd10-fpad-tpad.• Inthismodel,consideringthatpoolingisanecessaryapproachtoreducethefeaturemapsize,poolingintimeisalsoapplied.
29
Page 30
ModelDescription
• PaddinginFeatureMaps(cont’d)
30
Page 31
ModelDescription
• PaddinginFeatureMaps(cont’d)
31
Page 32
ModelDescription
• CompleteFigure
32
Page 33
ModelDescription
• CompleteFigure(cont’d)
33
Page 34
ModelDescription
• 1Channelvs.3ChannelsBasedInputFeatureMaps• VDCNNsuseonechannelfeaturemapasinput,i.e.thestaticFBankfeature.•Mostworkinspeechrecognition,however,usesthree-channelfeatures(static,∆,and∆∆).• ThenumberofinputchannelsarecomparedforVDCNN.
34
Page 35
ModelDescription
• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)
35
Page 36
ModelDescription
• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)• Itisinterestingtofindthat1channelbaseVDCNNsarebetterthanthemodelsusing3channels.• OnepossibleexplanationwouldbethattheinformationinthedynamicfeaturesmaybebetterextractedfromtherawstaticfeaturesdirectlybyVDCNN.
36
Page 37
ModelDescription
• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)• Anotherexplanationmaybeasfollows.
37
Page 38
ModelDescription
•ModelParameterSize• ItisobservedthatalthoughthenumberofconvolutionallayersisincreasedsignificantlyintheproposedVDCNN,thetotalparametersizeissmallerthanthebaselineCNNandDNN.
38
Page 39
ModelDescription
•ModelParameterSize(cont’d)
39
Page 40
ModelDescription
• ConvergenceofVeryDeepCNNs• TheVDCNNconvergesfasterthanothermodeltypes,intermsofthenumberofepochs[*].• Accordingly,althoughVDCNNsneedmorecomputationsineachiteration(9.5timesmorecomputationscomparedtothebaselineCNN),theVDCNNstakecomparabletimeformodeltraining.
40
[*]addedbythepresenter
Page 41
ModelDescription
• ConvergenceofVeryDeepCNNs(cont’d)
41
Page 42
ModelDescription
• NoiseRobustnessofVeryDeepCNNs
42
Page 43
ModelDescription
• NoiseRobustnessofVeryDeepCNNs(cont’d)• TobetterunderstandhowVDCNNprocessesnoisyspeech,eachcondition(A,B,CorD)ofthisframeispropagatedthroughthebestperformingmodelvd10-fpad-tpad.• Theoutputsofthe1st convolutionallayerandthe6thconvolutionallayerforA,B,CandDareplottedinthenextfigures.
43
Page 44
ModelDescription
• NoiseRobustnessofVeryDeepCNNs(cont’d)
44
Page 45
ModelDescription
• NoiseRobustnessofVeryDeepCNNs(cont’d)• Tofurtherverifytheobservation,thedifferencesbetweennoisyfeaturemapsandcleanfeaturemapsaremeasuredforallconvolutionallayers.• Usingdatainthetest,wecomputetheaveragedmeansquareerror(MSE)toevaluatethedifferencesbetweenthethreenoisyconditionsandthecleancondition.
45
Page 46
ModelDescription
• NoiseRobustnessofVeryDeepCNNs(cont’d)• TheMSEvaluesafteralloperationsareshowbelow.
46
Page 47
ModelDescription
• NoiseRobustnessofVeryDeepCNNs(cont’d)• TheMSEvaluesfordifferentCNNmodels.
47
Page 48
Content
• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion
48
Page 49
Experiments
• ExperimentalSetup• TheGMM-HMMsystemisbuiltwithKaldi.• Allneuralnetworkmodels,includingDNN/CNN/LSTM,aretrainedusingCNTK.• ThestandardtestingpipelineinKaldirecipesareusedfordecodingandscoring.• Asimilarstructure(IBM-VGG)designedbyresearchersinIBMandNYUisalsoconstructedforcomparison.
49
Page 50
Experiments
• EvaluationonAurora4• Aurora4isamediumvocabularytaskbasedontheWallStreetJournal(WSJ0).• Trainingsetscontain14276utterances.• Fourconditions,A,B,CandD,asmentionedbefore.
50
Page 51
Experiments
• EvaluationonAurora4(cont’d)
51
Page 52
Experiments
• EvaluationonAMI• AMIcorpuscontainsaround100hoursofmeetingrecords.• Thesignalwascapturedandsynchronizedwithmultiplemicrophonessuchasindividualheadmicrophones(IHM,close-talk)andmicrophonearrays(singledistantmicrophone(SDM)andmultipledistantmicrophones(MDM)).•MDMwasprocessedbyastandardbeamformingalgorithmtogenerateasinglechanneldataset.
52
Page 53
Experiments
• EvaluationonAMI(cont’d)• Thesizeofinputfeaturesisinvestigated.
53
Page 54
Experiments
• EvaluationonAMI(cont’d)• Theeffectofotherdesignsarealsoinvestigated.
54
Page 55
Experiments
• EvaluationonAMI(cont’d)• TobetterexplainthesuperiorityofVDCNNs,weusesomerelatedfeaturemaps.
55
Page 56
Experiments
• EvaluationonAMI(cont’d)• Onesamesinglesynchronizedframeispropagated.
56
Page 57
Experiments
• EvaluationonAMI(cont’d)
57
Page 58
Content
• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion
58
Page 59
Conclusion
• FeaturesofVDCNN• Thesizesoffiltersandpoolingtemplatesaresmall.• Theinputfeaturemapsarelarge.• Otherdesignsuchaspoolingintime,padding,andinputfeaturemapsselectionareadjusted.• OnAurora4,itachievesaWERof8.81%(state-of-art).• OnAMI,itsaccuracyiscompetitivetoanLSTM.
59