Top Banner
Active Learning for Speech Emotion Recognition using Deep Neural Network Mohammed Abdelwahab And Carlos Busso
21

Active Learning for Speech Emotion Recognition using Deep ...

Mar 27, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Active Learning for Speech Emotion Recognition using Deep ...

Active Learning for Speech Emotion Recognition using Deep Neural Network

MohammedAbdelwahabAndCarlosBusso

Page 2: Active Learning for Speech Emotion Recognition using Deep ...

2

§ Mismatchbetweentrainandtestconditionsisoneofthemainbarrierinspeechemotionrecognition

§ Underidealclassificationconditions§  Thetrainingandtestingsetscomefrom

thesamedomain

§ Underrealapplicationconditions§  Thetrainingandtestingsetscomefrom

thedifferentdomains

§  Thisleadstoperformancedrop[Shamiand

Verhelst2007,ParthasarathyandBusso2017]

Generalization of Models training testing

Training Testing Accuracy

Danish Danish 64.90%

Berlin Berlin 80.70%

Berlin Danish 22.90%

Danish Berlin 52.60%

Training Arousal[CCC]

Valence[CCC]

Dominance[CCC]

In-corpus 0.764 0.289 0.713

Cross-corpus 0.464 0.184 0.451

[ShamiandVerhelst2007]

[ParthasarathyandBusso2017]

Page 3: Active Learning for Speech Emotion Recognition using Deep ...

3

§ Theperformanceofaclassifierdegradesifthereisamismatchbetweentrainingandtestingconditions§  Speakervariations,channels(environments,noise),language,and

microphonesettings

§ Howtobuildaclassifierthatgeneralizeswell?§ Minimizethediscrepancybetweenthesourceandtargetdomains

The Problem

source target

training testingWeexplorethisproblem

relyingonactivelearning

Page 4: Active Learning for Speech Emotion Recognition using Deep ...

4

§ Activelearninghasbeenwidelyusedtoiterativelyselecttrainingsamplesthatmaximizesthemodel’sperformance§ Notallthesamplesareequal

§ DNNpushesstateoftheartperformance§  Itrequiresvastamountsoflabeleddata

§ ThereisaneedforscalableactivelearningapproachforDNN§  ExploretheapproachestoidentifymostusefulNsamples

Motivation

Target domain Unlabeled sentences

Label

selected

sentences

Sourcedomain

Before After

Page 5: Active Learning for Speech Emotion Recognition using Deep ...

5

§ SpeechEmotionRecognition§  Uselabelers’agreementtobuilduncertaintymodels[Zhanget.al.2013]

§ Multi-viewuncertaintysamplingtominimizeamountoflabeleddata[Zhanget.al.2015]

§ Minimizeannotationspersampleusingagreementthreshold[Zhanget.al.2015]

§ Minimizenoiseaccumulationinself-training[Zhanget.al.2016]

§  Adaptmodelwithlowconfidencecorrectlyclassifiedsamples[Abdelwahab&Busso2017]

§  CombineEnsemblesandActivelearningtomitigateperformancelossinnewdomain[Abdelwahab&Busso2017]

§  GreedysamplingforMulti-taskspeechemotionlinearregression[Wu&Huang2018]

§ NoneofthoseapproachesusedDeepNeuralnetworks

Related Work

Page 6: Active Learning for Speech Emotion Recognition using Deep ...

6

§ Thereisnodataacquisitionsfunctionsthatworkwellinallscenarios§ Heuristicapproacheswhereshowntoworkinpractice

§  Greedysampling

§  Labelspace§  Featurespace§  Combination

§  Uncertaintysampling

§  Leastconfidentsamples

§  Margin

§  Entropy§  VoteEntropy(Ensembles)

§  Dropout

§  Randomsampling(baseline)

Data Acquisition Functions

Explaindiversity

sampling

Explaindiversity

sampling

Page 7: Active Learning for Speech Emotion Recognition using Deep ...

7

§ Greedysamplingforregression[Wuetal.,2019]

§ maximizethediversityinthetrainset

1.  Selectinitialsamples

§  Previouslyselectedsamples

2.  Computedistances

§  Featuresspace

§  Labelspace

§  Combination

3.  Selectksamplestoannotate

4.  Updatemodelandrepeat

Greedy Sampling Approach

Explaindiversity

sampling

di,jx = kxi � xjk2di,jy = |yi � yj |di,jxy = di,jx di,jy

Featurespace

Source

domain

Target

domain

Labelspace

Predictedlabels

fromthetarget

domain

Labelof

source

domain

Page 8: Active Learning for Speech Emotion Recognition using Deep ...

8

Targetdomain

Unlabeledsentences

§ DropoutcanapproximateBayesianinference[Galetal.,2016]§ Wecanrepresentthemodels’uncertainty

§ Usedifferentconfigurationsofdropout,analyzingpredictionspersample

§ Goal:selectsamplesthattheexistingmodelisthemostuncertainacrossseveraldropoutiterations

Uncertainty Sampling: Dropout

Page 9: Active Learning for Speech Emotion Recognition using Deep ...

§ Useexistingpodcastrecordings§ Divideintospeakerturns§ Emotionretrievaltobalancetheemotionalcontent§ Annotateusingcrowdsourcingframework

The MSP-Podcast Database

9

� � � � � �

Podcastrecording

RezaLotfianandCarlosBusso,"Buildingnaturalisticemotionallybalancedspeechcorpusbyretrievingemotionalspeechfromexistingpodcastrecordings,"IEEETransactionsonAffectiveComputing,vol.Toappear,2018.

Page 10: Active Learning for Speech Emotion Recognition using Deep ...

10

§ MSP-Podcast§  Collectionofpubliclyavailablepodcasts(naturalnessandthediversityofemotions)

§  Interviews,talkshows,news,discussions,education,storytelling,comedy,science,technology,politics,etc.

§  CreativeCommonscopyrightlicenses

§  Singlespeakersegments,HighSNR,nomusic,nophonequality

§  Developingandoptimizingdifferentmachinelearningframeworkusingexistingdatabases

§  Balancetheemotionalcontent

§  Emotionalannotationusingcrowdsourcingplatform

The MSP-Podcast Database

Podcast

Audio

16kHz,16b

PCM,MonoDiarization

2.75s<…<11s

Durationfilter

SNRfilterEmotion

retrieval

Manual

screening

Remove

telephone

quality

Emotional

Annotation

Remove

segments

withmusic

Page 11: Active Learning for Speech Emotion Recognition using Deep ...

MSP-Podcast corpus version 1.1

11

Segmentedturns

152,975sentencesover1,000podcasts

Withemotionlabels:

22,630sentences

(38h,57m)

Arousal Valence Dominance

§  Testset

•  7,181segmentsfrom50speakers

(25males,25females)

§  Developmentset

•  2,614segmentsfrom15speakers

(10males,5females)

§  Trainset

•  remaining12,830segments

Page 12: Active Learning for Speech Emotion Recognition using Deep ...

12

§ Interspeech2013Featureset§  65lowleveldescriptors(LLD)§  FunctionalarecalculatedonLLDsresultingintotalof6,373features

§  Functionalsinclude:§  Quartileranges§  Arithmeticmean

§  Rootquadraticmean

§  Moments

§  Mean/std.ofrising/fallingslopes

Acoustic Features

Page 13: Active Learning for Speech Emotion Recognition using Deep ...

13

§ Multitasklearningnetwork:§ Primarytask:emotionregression

§  concordancecorrelationcoefficient(CCC)§  Secondarytask:featurereconstruction

§  Meansquareerror(MSE)

§  Secondarytaskhelpstogeneralizethemodel,

especiallywithlimiteddata

Proposed Architecture

embedding

Autoencoder

IS-2013(6,373features–Inputnodes)

Reconstructedinput

Emotionregressor

L = �11

N

NX

i=1

kx� xk2 + �2

"1� 2⇢�y�y

�2y + �2

y + (µy � µy)2

#

MSE CCC

Page 14: Active Learning for Speech Emotion Recognition using Deep ...

14

§ Weconsider50,100,200,400,800,and1200samples§ Samplesareselectedbasedonthelatestmodel§ Weconsidertwostartingpoints

§  Fromscratch

§  Autoencodertrainedonreconstructionlossonlyfor20epochs§ Resultsaretheaverageof20trials§ Greedysampling(featurespace):

§  Useembeddingoftheautoencodertoreducethesearchspace

Experimental Settings

Target domain Unlabeled sentences

50

100

200

400

800

1200

Targetnumber

ofsamples

Page 15: Active Learning for Speech Emotion Recognition using Deep ...

15

§ Observation§  Greedyfeatureleadstobetterperformance

§  Dropoutisnotaseffective§  Randomapproachbestmethodsasweaddmoredata

§  Pretrainedencoderhelpswithlimitedsamples

Results for Arousal

Pretrainedencoder Fromscratch

Withincorpus

performance

Page 16: Active Learning for Speech Emotion Recognition using Deep ...

16

§ Observations§  Pretrainedautoencoderhelpstoachievebetterperformance

§ Weapproachwithincorpusperformancewithonly10%ofthetrainingdata

§  Randomsamplingislesseffective

Results for Valence

Pretrainedencoder Fromscratch

Page 17: Active Learning for Speech Emotion Recognition using Deep ...

17

Statistical Significance Arousal Valence

#samples 100 200 400 800 100 200 400 800

RandomSampling 0.57 0.61 0.66 0.69 0.07 0.10 0.13 0.17

GreedyFeature 0.58 0.64 0.68 0.71 0.10 0.12 0.16 0.21

GreedyLabel 0.50 0.53 0.66 0.69 0.09 0.12 0.17 0.21

GreedyCombination 0.52 0.55 0.68 0.70 0.09 0.12 0.17 0.20

Dropout 0.56 0.59 0.63 0.67 0.11 0.12 0.15 0.18

§ Observations§ Greedysamplinginfeaturespacealmostalwaysbetterthanrandomsampling

§ Dropoutwasnotaseffective

Bold:statisticallysignificantimprovementsoverrandomsampling

Page 18: Active Learning for Speech Emotion Recognition using Deep ...

18

Sensitivity to k (how often we update the model)

Valence

Arousal

100samples 200samples 400samples

§ Multitaskautoencoderframeworkwithgreedymethods§  Nostatisticaldifferencewithk=1andk=10§ Methodisnotsensitivetothisparameter(reducecomplexity)

Page 19: Active Learning for Speech Emotion Recognition using Deep ...

19

§ 20resultsstartingwithdifferentinitializations§ Greedysamplingonthefeaturespaceversusrandomsampling

§  StandarddeviationoftheCCCvaluesachievedbythegreedysampling

methoddecreasesfasterasthesamplingsizeincreases

§  Moreconsistentthanrandomsampling

Consistency of the Results Arousal Valence

Page 20: Active Learning for Speech Emotion Recognition using Deep ...

20

§ Greedysamplingachieveshigherperformancewithlowervariancecomparedtorandomsampling

§ Greedysamplinginlabelspacedependsonmodel’sperformance§ Asweintroducedmoredata,thedifferencesinperformanceacrossdataacquisitionfunctionsreduce

§ Reducecomputationcost:§  Calculatethedistanceinembeddingwithlowerdimensions

§  Setadequatevalueofkreducesthefrequencyofmodelupdates

§ FutureWork§  Combineactivelearningwithcurriculumlearning

§  Considernewacquisitionfunctionsthatscalewell

Conclusions

Page 21: Active Learning for Speech Emotion Recognition using Deep ...

21

§ ThisworkwasfundedbyNSFCAREERawardIIS-1453781

Thank you

Interestedonourresearch?

msp.utdallas.edu