Top Banner
Adapting Wavenet for Speech Enhancement DARIO RETHAGE | JULY 12, 2017
33

Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

May 28, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

AdaptingWavenetforSpeechEnhancementDARIORETHAGE| JULY12, 2017

Page 2: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Iam

vMasterStudent

v 6months@MusicTechnologyGroup,Universitat Pompeu Fabra

v Deeplearningforacousticsourceseparation

vWithJordiPons,AudioSignalProcessingLab

Page 3: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Learningfromrawaudio

v Highdimensionality

vManylevelsofstructure

v Nohandcraftedfeatureextraction

v Nodiscardingofinformation(phase)

v Untilrecentlycomputationallyintractable

timbrephoneme

phonetictransition

Page 4: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Wavenet:AGenerativeModelforRawAudio

v Speechsynthesisonwaveformlevelusingauto-regressive,generativemodel

v Generates8-bit(256values)probabilitydistribution

v Sampleoutputdistribution(probabilistictask)

v Considerableparametersavings§ Smallfilters§ Largedilations

v 16kHzsamplingrate(wide-band)

v Veryslow

v Notstrictlyend-to-end

Page 5: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Wavenet:KeyFeaturesv Causality

v GatedUnits

v Softmax Output

v μ-lawQuantization

v Dilation

v Stacks

Page 6: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Causalityv Onlypreviousandcurrentsampleinformpredictionofsamplet+1

v Asymmetricpadding

v 2x1filters

GatedUnitsv Controlcontributionofeachlayer

Page 7: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

μ-lawquantization

v Non-linearcompanding

v Betteruseof8-bitquantizationspace

Softmax

v Noassumptionsaboutoutputdistribution

vWellsuitedformulti-modaldistributions

v Requiresdiscretizationofoutput

Page 8: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Stacksv Repeatdilationpattern

vMoredepth,lesswidth

Dilation

v Largerreceptivefield,sameparameters

v Bypowersof2

Page 9: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Wavenet:Reimplementation

vManyopenquestions§ FilterDepths§ NumberofLayers

v TrainedonVCTK,109nativespeakersofEnglish,goodphoneticcoverage

v Proofofconcept

v ~600kparameters

Page 10: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

SpeechEnhancementvWithinacousticsourceseparation

v Deterministic

v Goal:Improveintelligibilityand/oroverallperceptualqualityofspeechsignal

v Untilrecently,greatestsuccessesinthefrequencydomainv e.g.estimatingspectralmask

Eitherestimate𝒔" given𝒎directlyor𝒃& given𝒎,since𝒔 = 𝒎 − 𝒃

𝑚𝑡 = 𝑠𝑡 + 𝑏𝑡𝑚:mixture𝑠:speech𝑏:background

Page 11: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

AWavenetForSourceSeparationv Genericarchitecture,suitableforanyacousticsourceseparation

v Blindtwo-sourceseparation

v Discriminative

v End-to-end§ Time-domaininput/output§ Nopre/post-filtering§ Noquantization

v 16kHzsamplingrate(wide-band)

v Flexible

Page 12: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

KeyContributionsv Non-causality

v Real-valuedpredictions

v Non-autoregressive

v Targetfields

v Enforcestimecontinuity

v Energy-conservingloss

Page 13: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Non-causalityv Equalcontextinthepastandfuture

v Symmetricpadding

v 3x1filters

Page 14: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Real-valuedPredictionsv AssumesGaussianoutputdistribution

v Noquantizationerror

v Oneoutputunitperoutputsample

Wavenet ProposedModel

v μ-lawcompandingdisadvantageous

Page 15: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

TargetFieldstargetsample

Page 16: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

TargetFields

Page 17: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

TargetFields

Page 18: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

TargetFields

Page 19: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

TargetFields

Page 20: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

TargetFields

Page 21: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

TargetFieldstargetfield

Page 22: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

TargetFieldsv Autoregressionrequiressequential,samplebysample,inference→slow

v ParallelpredictionoftargetfieldbenefitsinferenceANDtraining

Page 23: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

EnforcingTimeContinuityvWithoutauroregression,originalWavenetproducespointdiscontinuities

v Veryunpleasantsound

v 3x1filtersinfinal(non-dilated)layersallowtimecontinuitytobereflectedintheloss

Pointdiscontinuity3x1filters

Page 24: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Energy-ConservingLoss

v Goal:𝐸/0 ≡ 𝐸/20v Inspiredbydissimilaritylosses

v Empirically,reducesalgorithmicartifacts

Page 25: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

FlexibilityinTemporalDimensionv Samemodelcanbedeployedonreducedcomputationalresources

v Audioinputofarbitrarylength→one-shotdenoising

v Reducesredundantcomputations

v 25sofaudioinsingleforwardpass(TitanXPascal)

v ~0.56sper1secondofnoisyaudio

v Fullyconvolutional

Page 26: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Experiments

Setup

v 33Layers§ Dilations:1,2,...,256,512§ Stacks:3

v 384msReceptiveField

v 6.3mparameters

Data

v VCTKforvoice

v DEMANDforenvironmentalsounds

Unseenspeakersinunseennoiseconditions

TrainingSNR:0dB– 18dB

TestSNR:2.5dB– 17.5dB

Page 27: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

EvaluationMetricsv Shouldbeperceptuallymeaningful

vMOS=meanopinionscore(predicted)inrange[1,5]

vWeightedcombinationofobjectivespeechqualitymeasures

v SIG:MOSratingofthesignaldistortionattendingonlytothespeechsignal

v BAK:MOSratingoftheintrusivenessofbackgroundnoise

v OVL:MOSratingoftheoveralleffect

Page 28: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Results

Page 29: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

BestConfiguration

v Energy-conservingloss

v 10%noise-onlyaugmentation

v 100mstargetfield

v Conditioning

Mixed Speech Background Wiener

Mixed Speech Background Wiener

Mixed Speech Background Wiener

12.5dB

7.5dB

2.5dB

Page 30: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

PerceptualEvaluation

v 33participants

v 20samples,5ateachSNR

v 1-5qualityrating

“giveanoverallqualityscore,takingintoconsiderationboth:speechqualityandbackground-noisesuppression”

WienerFiltering ProposedModel

2.92 3.60

Page 31: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Takeawayv AdiscriminativeadaptationofWavenetforspeechenhancement

v Reductionintimecomplexity,withoutsacrificingexpressivecapability

v Noise-onlyaugmentationnecessaryforgeneratingsilence

v Nospeech-specificconstraints

v Energy-conservation

v Perceptualtrials:PreferredoverWienerFiltering

v Possibletolearnmulti-scalehierarchicalrepresentationsfromrawaudio

v Audiosamplesonline,sourceonGitHub

Page 32: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

FutureWorkv Continueexploringtheideaofenergy-conservinglossesinneuralaudioprocessingmodels

v Betterhandlingofshort-timehighenergyevents,e.g.honkincitytraffic

v Applytootheraudiodomains§ Music,multi-trackseparation

Page 33: Adapting Wavenet for Speech Enhancement - Jordi Ponsjordipons.me › media › wavenet_denoising_dario.pdf · 2017-08-23 · Wavenet: A Generative Model for Raw Audio vSpeech synthesis

Thankyou