Active Learning for Speech Emotion Recognition using Deep Neural Network Mohammed Abdelwahab And Carlos Busso
Active Learning for Speech Emotion Recognition using Deep Neural Network
MohammedAbdelwahabAndCarlosBusso
2
§ Mismatchbetweentrainandtestconditionsisoneofthemainbarrierinspeechemotionrecognition
§ Underidealclassificationconditions§ Thetrainingandtestingsetscomefrom
thesamedomain
§ Underrealapplicationconditions§ Thetrainingandtestingsetscomefrom
thedifferentdomains
§ Thisleadstoperformancedrop[Shamiand
Verhelst2007,ParthasarathyandBusso2017]
Generalization of Models training testing
Training Testing Accuracy
Danish Danish 64.90%
Berlin Berlin 80.70%
Berlin Danish 22.90%
Danish Berlin 52.60%
Training Arousal[CCC]
Valence[CCC]
Dominance[CCC]
In-corpus 0.764 0.289 0.713
Cross-corpus 0.464 0.184 0.451
[ShamiandVerhelst2007]
[ParthasarathyandBusso2017]
3
§ Theperformanceofaclassifierdegradesifthereisamismatchbetweentrainingandtestingconditions§ Speakervariations,channels(environments,noise),language,and
microphonesettings
§ Howtobuildaclassifierthatgeneralizeswell?§ Minimizethediscrepancybetweenthesourceandtargetdomains
The Problem
source target
training testingWeexplorethisproblem
relyingonactivelearning
4
§ Activelearninghasbeenwidelyusedtoiterativelyselecttrainingsamplesthatmaximizesthemodel’sperformance§ Notallthesamplesareequal
§ DNNpushesstateoftheartperformance§ Itrequiresvastamountsoflabeleddata
§ ThereisaneedforscalableactivelearningapproachforDNN§ ExploretheapproachestoidentifymostusefulNsamples
Motivation
Target domain Unlabeled sentences
Label
selected
sentences
Sourcedomain
Before After
5
§ SpeechEmotionRecognition§ Uselabelers’agreementtobuilduncertaintymodels[Zhanget.al.2013]
§ Multi-viewuncertaintysamplingtominimizeamountoflabeleddata[Zhanget.al.2015]
§ Minimizeannotationspersampleusingagreementthreshold[Zhanget.al.2015]
§ Minimizenoiseaccumulationinself-training[Zhanget.al.2016]
§ Adaptmodelwithlowconfidencecorrectlyclassifiedsamples[Abdelwahab&Busso2017]
§ CombineEnsemblesandActivelearningtomitigateperformancelossinnewdomain[Abdelwahab&Busso2017]
§ GreedysamplingforMulti-taskspeechemotionlinearregression[Wu&Huang2018]
§ NoneofthoseapproachesusedDeepNeuralnetworks
Related Work
6
§ Thereisnodataacquisitionsfunctionsthatworkwellinallscenarios§ Heuristicapproacheswhereshowntoworkinpractice
§ Greedysampling
§ Labelspace§ Featurespace§ Combination
§ Uncertaintysampling
§ Leastconfidentsamples
§ Margin
§ Entropy§ VoteEntropy(Ensembles)
§ Dropout
§ Randomsampling(baseline)
Data Acquisition Functions
Explaindiversity
sampling
Explaindiversity
sampling
7
§ Greedysamplingforregression[Wuetal.,2019]
§ maximizethediversityinthetrainset
1. Selectinitialsamples
§ Previouslyselectedsamples
2. Computedistances
§ Featuresspace
§ Labelspace
§ Combination
3. Selectksamplestoannotate
4. Updatemodelandrepeat
Greedy Sampling Approach
Explaindiversity
sampling
di,jx = kxi � xjk2di,jy = |yi � yj |di,jxy = di,jx di,jy
Featurespace
Source
domain
Target
domain
Labelspace
Predictedlabels
fromthetarget
domain
Labelof
source
domain
8
Targetdomain
Unlabeledsentences
§ DropoutcanapproximateBayesianinference[Galetal.,2016]§ Wecanrepresentthemodels’uncertainty
§ Usedifferentconfigurationsofdropout,analyzingpredictionspersample
§ Goal:selectsamplesthattheexistingmodelisthemostuncertainacrossseveraldropoutiterations
Uncertainty Sampling: Dropout
…
§ Useexistingpodcastrecordings§ Divideintospeakerturns§ Emotionretrievaltobalancetheemotionalcontent§ Annotateusingcrowdsourcingframework
The MSP-Podcast Database
9
� � � � � �
Podcastrecording
RezaLotfianandCarlosBusso,"Buildingnaturalisticemotionallybalancedspeechcorpusbyretrievingemotionalspeechfromexistingpodcastrecordings,"IEEETransactionsonAffectiveComputing,vol.Toappear,2018.
10
§ MSP-Podcast§ Collectionofpubliclyavailablepodcasts(naturalnessandthediversityofemotions)
§ Interviews,talkshows,news,discussions,education,storytelling,comedy,science,technology,politics,etc.
§ CreativeCommonscopyrightlicenses
§ Singlespeakersegments,HighSNR,nomusic,nophonequality
§ Developingandoptimizingdifferentmachinelearningframeworkusingexistingdatabases
§ Balancetheemotionalcontent
§ Emotionalannotationusingcrowdsourcingplatform
The MSP-Podcast Database
Podcast
Audio
16kHz,16b
PCM,MonoDiarization
2.75s<…<11s
Durationfilter
SNRfilterEmotion
retrieval
Manual
screening
Remove
telephone
quality
Emotional
Annotation
Remove
segments
withmusic
MSP-Podcast corpus version 1.1
11
Segmentedturns
152,975sentencesover1,000podcasts
Withemotionlabels:
22,630sentences
(38h,57m)
Arousal Valence Dominance
§ Testset
• 7,181segmentsfrom50speakers
(25males,25females)
§ Developmentset
• 2,614segmentsfrom15speakers
(10males,5females)
§ Trainset
• remaining12,830segments
12
§ Interspeech2013Featureset§ 65lowleveldescriptors(LLD)§ FunctionalarecalculatedonLLDsresultingintotalof6,373features
§ Functionalsinclude:§ Quartileranges§ Arithmeticmean
§ Rootquadraticmean
§ Moments
§ Mean/std.ofrising/fallingslopes
Acoustic Features
13
§ Multitasklearningnetwork:§ Primarytask:emotionregression
§ concordancecorrelationcoefficient(CCC)§ Secondarytask:featurereconstruction
§ Meansquareerror(MSE)
§ Secondarytaskhelpstogeneralizethemodel,
especiallywithlimiteddata
Proposed Architecture
embedding
Autoencoder
IS-2013(6,373features–Inputnodes)
Reconstructedinput
Emotionregressor
L = �11
N
NX
i=1
kx� xk2 + �2
"1� 2⇢�y�y
�2y + �2
y + (µy � µy)2
#
MSE CCC
14
§ Weconsider50,100,200,400,800,and1200samples§ Samplesareselectedbasedonthelatestmodel§ Weconsidertwostartingpoints
§ Fromscratch
§ Autoencodertrainedonreconstructionlossonlyfor20epochs§ Resultsaretheaverageof20trials§ Greedysampling(featurespace):
§ Useembeddingoftheautoencodertoreducethesearchspace
Experimental Settings
Target domain Unlabeled sentences
50
100
200
400
800
1200
Targetnumber
ofsamples
15
§ Observation§ Greedyfeatureleadstobetterperformance
§ Dropoutisnotaseffective§ Randomapproachbestmethodsasweaddmoredata
§ Pretrainedencoderhelpswithlimitedsamples
Results for Arousal
Pretrainedencoder Fromscratch
Withincorpus
performance
16
§ Observations§ Pretrainedautoencoderhelpstoachievebetterperformance
§ Weapproachwithincorpusperformancewithonly10%ofthetrainingdata
§ Randomsamplingislesseffective
Results for Valence
Pretrainedencoder Fromscratch
17
Statistical Significance Arousal Valence
#samples 100 200 400 800 100 200 400 800
RandomSampling 0.57 0.61 0.66 0.69 0.07 0.10 0.13 0.17
GreedyFeature 0.58 0.64 0.68 0.71 0.10 0.12 0.16 0.21
GreedyLabel 0.50 0.53 0.66 0.69 0.09 0.12 0.17 0.21
GreedyCombination 0.52 0.55 0.68 0.70 0.09 0.12 0.17 0.20
Dropout 0.56 0.59 0.63 0.67 0.11 0.12 0.15 0.18
§ Observations§ Greedysamplinginfeaturespacealmostalwaysbetterthanrandomsampling
§ Dropoutwasnotaseffective
Bold:statisticallysignificantimprovementsoverrandomsampling
18
Sensitivity to k (how often we update the model)
Valence
Arousal
100samples 200samples 400samples
§ Multitaskautoencoderframeworkwithgreedymethods§ Nostatisticaldifferencewithk=1andk=10§ Methodisnotsensitivetothisparameter(reducecomplexity)
19
§ 20resultsstartingwithdifferentinitializations§ Greedysamplingonthefeaturespaceversusrandomsampling
§ StandarddeviationoftheCCCvaluesachievedbythegreedysampling
methoddecreasesfasterasthesamplingsizeincreases
§ Moreconsistentthanrandomsampling
Consistency of the Results Arousal Valence
20
§ Greedysamplingachieveshigherperformancewithlowervariancecomparedtorandomsampling
§ Greedysamplinginlabelspacedependsonmodel’sperformance§ Asweintroducedmoredata,thedifferencesinperformanceacrossdataacquisitionfunctionsreduce
§ Reducecomputationcost:§ Calculatethedistanceinembeddingwithlowerdimensions
§ Setadequatevalueofkreducesthefrequencyofmodelupdates
§ FutureWork§ Combineactivelearningwithcurriculumlearning
§ Considernewacquisitionfunctionsthatscalewell
Conclusions