Adapting Wavenet for Speech Enhancement DARIO RETHAGE | JULY 12, 2017
AdaptingWavenetforSpeechEnhancementDARIORETHAGE| JULY12, 2017
Iam
vMasterStudent
v 6months@MusicTechnologyGroup,Universitat Pompeu Fabra
v Deeplearningforacousticsourceseparation
vWithJordiPons,AudioSignalProcessingLab
Learningfromrawaudio
v Highdimensionality
vManylevelsofstructure
v Nohandcraftedfeatureextraction
v Nodiscardingofinformation(phase)
v Untilrecentlycomputationallyintractable
timbrephoneme
phonetictransition
Wavenet:AGenerativeModelforRawAudio
v Speechsynthesisonwaveformlevelusingauto-regressive,generativemodel
v Generates8-bit(256values)probabilitydistribution
v Sampleoutputdistribution(probabilistictask)
v Considerableparametersavings§ Smallfilters§ Largedilations
v 16kHzsamplingrate(wide-band)
v Veryslow
v Notstrictlyend-to-end
Wavenet:KeyFeaturesv Causality
v GatedUnits
v Softmax Output
v μ-lawQuantization
v Dilation
v Stacks
Causalityv Onlypreviousandcurrentsampleinformpredictionofsamplet+1
v Asymmetricpadding
v 2x1filters
GatedUnitsv Controlcontributionofeachlayer
μ-lawquantization
v Non-linearcompanding
v Betteruseof8-bitquantizationspace
Softmax
v Noassumptionsaboutoutputdistribution
vWellsuitedformulti-modaldistributions
v Requiresdiscretizationofoutput
Stacksv Repeatdilationpattern
vMoredepth,lesswidth
Dilation
v Largerreceptivefield,sameparameters
v Bypowersof2
Wavenet:Reimplementation
vManyopenquestions§ FilterDepths§ NumberofLayers
v TrainedonVCTK,109nativespeakersofEnglish,goodphoneticcoverage
v Proofofconcept
v ~600kparameters
SpeechEnhancementvWithinacousticsourceseparation
v Deterministic
v Goal:Improveintelligibilityand/oroverallperceptualqualityofspeechsignal
v Untilrecently,greatestsuccessesinthefrequencydomainv e.g.estimatingspectralmask
Eitherestimate𝒔" given𝒎directlyor𝒃& given𝒎,since𝒔 = 𝒎 − 𝒃
𝑚𝑡 = 𝑠𝑡 + 𝑏𝑡𝑚:mixture𝑠:speech𝑏:background
AWavenetForSourceSeparationv Genericarchitecture,suitableforanyacousticsourceseparation
v Blindtwo-sourceseparation
v Discriminative
v End-to-end§ Time-domaininput/output§ Nopre/post-filtering§ Noquantization
v 16kHzsamplingrate(wide-band)
v Flexible
KeyContributionsv Non-causality
v Real-valuedpredictions
v Non-autoregressive
v Targetfields
v Enforcestimecontinuity
v Energy-conservingloss
Non-causalityv Equalcontextinthepastandfuture
v Symmetricpadding
v 3x1filters
Real-valuedPredictionsv AssumesGaussianoutputdistribution
v Noquantizationerror
v Oneoutputunitperoutputsample
Wavenet ProposedModel
v μ-lawcompandingdisadvantageous
TargetFieldstargetsample
TargetFields
TargetFields
TargetFields
TargetFields
TargetFields
TargetFieldstargetfield
TargetFieldsv Autoregressionrequiressequential,samplebysample,inference→slow
v ParallelpredictionoftargetfieldbenefitsinferenceANDtraining
EnforcingTimeContinuityvWithoutauroregression,originalWavenetproducespointdiscontinuities
v Veryunpleasantsound
v 3x1filtersinfinal(non-dilated)layersallowtimecontinuitytobereflectedintheloss
Pointdiscontinuity3x1filters
Energy-ConservingLoss
v Goal:𝐸/0 ≡ 𝐸/20v Inspiredbydissimilaritylosses
v Empirically,reducesalgorithmicartifacts
FlexibilityinTemporalDimensionv Samemodelcanbedeployedonreducedcomputationalresources
v Audioinputofarbitrarylength→one-shotdenoising
v Reducesredundantcomputations
v 25sofaudioinsingleforwardpass(TitanXPascal)
v ~0.56sper1secondofnoisyaudio
v Fullyconvolutional
Experiments
Setup
v 33Layers§ Dilations:1,2,...,256,512§ Stacks:3
v 384msReceptiveField
v 6.3mparameters
Data
v VCTKforvoice
v DEMANDforenvironmentalsounds
Unseenspeakersinunseennoiseconditions
TrainingSNR:0dB– 18dB
TestSNR:2.5dB– 17.5dB
EvaluationMetricsv Shouldbeperceptuallymeaningful
vMOS=meanopinionscore(predicted)inrange[1,5]
vWeightedcombinationofobjectivespeechqualitymeasures
v SIG:MOSratingofthesignaldistortionattendingonlytothespeechsignal
v BAK:MOSratingoftheintrusivenessofbackgroundnoise
v OVL:MOSratingoftheoveralleffect
Results
BestConfiguration
v Energy-conservingloss
v 10%noise-onlyaugmentation
v 100mstargetfield
v Conditioning
Mixed Speech Background Wiener
Mixed Speech Background Wiener
Mixed Speech Background Wiener
12.5dB
7.5dB
2.5dB
PerceptualEvaluation
v 33participants
v 20samples,5ateachSNR
v 1-5qualityrating
“giveanoverallqualityscore,takingintoconsiderationboth:speechqualityandbackground-noisesuppression”
WienerFiltering ProposedModel
2.92 3.60
Takeawayv AdiscriminativeadaptationofWavenetforspeechenhancement
v Reductionintimecomplexity,withoutsacrificingexpressivecapability
v Noise-onlyaugmentationnecessaryforgeneratingsilence
v Nospeech-specificconstraints
v Energy-conservation
v Perceptualtrials:PreferredoverWienerFiltering
v Possibletolearnmulti-scalehierarchicalrepresentationsfromrawaudio
v Audiosamplesonline,sourceonGitHub
FutureWorkv Continueexploringtheideaofenergy-conservinglossesinneuralaudioprocessingmodels
v Betterhandlingofshort-timehighenergyevents,e.g.honkincitytraffic
v Applytootheraudiodomains§ Music,multi-trackseparation
Thankyou