This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2013 Article ID 265819 9 pageshttpdxdoiorg1011552013265819
Research ArticlePractical Speech Emotion Recognition Based on OnlineLearning From Acted Data to Elicited Data
Chengwei Huang Ruiyu Liang Qingyun Wang Ji Xi Cheng Zha and Li Zhao
School of Information Science and Engineering Southeast University Nanjing 210096 China
Correspondence should be addressed to Chengwei Huang huangcwx126com
Received 7 March 2013 Revised 26 May 2013 Accepted 4 June 2013
Academic Editor Saeed Balochian
Copyright copy 2013 Chengwei Huang et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited
We study the cross-database speech emotion recognition based on online learning How to apply a classifier trained on acted datato naturalistic data such as elicited data remains a major challenge in todayrsquos speech emotion recognition system We introducethree types of different data sources first a basic speech emotion dataset which is collected from acted speech by professionalactors and actresses second a speaker-independent data set which contains a large number of speakers third an elicited speechdata set collected from a cognitive task Acoustic features are extracted from emotional utterances and evaluated by using maximalinformation coefficient (MIC) A baseline valence and arousal classifier is designed based on Gaussian mixture models Onlinetraining module is implemented by using AdaBoost While the offline recognizer is trained on the acted data the online testingdata includes the speaker-independent data and the elicited data Experimental results show that by introducing the online learningmodule our speech emotion recognition system can be better adapted to new data which is an important character in real worldapplications
1 Introduction
The state-of-the-art speech emotion recognition (SER) sys-tem is largely dependent on its training data Emotional vocalbehavior is personality dependent situation dependent andlanguage dependent Therefore emotional models trainedfrom a specific database may not fit to other databases Tosolve this problem we introduce an online learning frame-work to the SER system Online speech data is used to retrainand to improve the classifier Adopting the online learningframework we may better adapt our SER system to differentspeakers and different data sources
Many achievements have been reported on the actedspeech emotion databases [1ndash3] Tawari and Trivedi [4] con-sidered the role of context and detected seven emotions onthe Berlin Emotional Database [5] Ververidis and Kotropou-los [6] studied gender-based speech emotion recognition sys-tem for five different emotional states A number of machinelearning algorithms have been studied in SER using actedemotional data Only recently the need of using naturalisticdata has been pointed out Several naturalistic speech emo-tion databases have been developed such as AIBO emotional
speech database [7] and VAMdatabase [8] Many researchersnotice that real world data plays a key role in the SER system[9] and the model trained on the acted data does not fit verywell on the naturalistic data
Incremental learning may provide us a good solutionto solve this problem under an online learning frameworkThe pretrained models on the acted data may be updatedusing very few online data Since the naturalistic emotiondata is very difficult to collect acted speech data still playsan important role especially in studying rare emotion typessuch as fear-type emotion [1] confidence and anxiety [10] Byusing incremental learning we can make use of the availableacted databases as a baseline recognizer and then retrain theclassifier online for specific purposes
Many successful algorithms have been proposed for in-cremental learning such as Learning++ [11] and Bagging++[12] Incremental learning algorithms may be classified intotwo categories In the first category a single classifier is updat-ed by reestimating its parameters This type of learning algo-rithms is dependent on the specific classifier such as theincremental learning algorithm for support vector machine
2 Mathematical Problems in Engineering
proposed by Xiao et al [13] The techniques used in suchparameter estimation may not be generalized In the secondcategory the incremental learning algorithm is not depen-dent on a specific type of classifiers Multiple classifiers arecreated and combined by a certain fusion rule such as major-ity vote Boosting is a typical type of algorithms that fall intothe second category By creating weak classifiers using se-lected data we may add new training data to the learningprocedure and gradually adapt the SER system in an onlineenvironment
In this paper we explore the possibility of transferringpretrained SER system from acted data to more naturalisticdata in an online learning framework Section 2 describesour acted data and elicited data Section 3 provides acousticanalysis of emotional features In Section 4 we introduce ourspeech emotion recognizer and the online learning method-ology Finally in Section 5 we provide the experimentalresults which show that combining the acted data and theelicited data using online learning brings us the best result
2 Three Types of Data Sources
In this paper we use three types of data sources to validate ourSER system (i) acted basic emotion database (ii) speaker-independent emotion database and (iii) elicited emotiondatabase
The first database contains the basic emotions includinghappiness anger surprise sadness fear and neutrality Theemotional speech is recorded by professional actors andactress six males and six femalesThis acted database may beused as a standard training dataset for our baseline recog-nizer However in real world applications the naturalisticemotional speech is different from the acted speech
The second database is designed for speaker-independenttest which includes fifty-one different speakers Other thana large number of speakers a special type of emotion isconsidered namely fidgetiness Fidgetiness is an importanttype of emotion in cognitive related tasks It may be inducedby repeated work environmental noise and stress The sec-ond database contains five emotions as shown in Table 1This database may be used for testing the ability of speakeradaptationWhen using training data from the first databaseit is challenging to test our SER system on the second data-base due to many unknown speakers
The third database contains elicited speech in a cognitivetask as shown in Table 2 The first row shows the emotiontypes collected in our experiments such as fidgetiness con-fidence and tirednessThe second row is the speaker numberrelated to each type of emotion The third row is the maleand female proportion in the emotion data The last row isthe number of utterances in each emotion class The datais collected locally in our lab We carried out a cognitiveexperiment and collected the emotional speech related tocognitive performance Subject was required to work on aset of math calculations and to report the results orallyDuring the cognitive task the speech signals were recordedand annotated with emotional labels
In the third database ldquocorrect answerrdquo or ldquofalse answerrdquolabels are marked on each utterance in the oral report by
the listeners who have not participated in the eliciting exper-iment Therefore we may calculate the percentage of falseanswers in the negative emotion samples and the percentageof negative emotion in the ldquofalse answerrdquo samples Resultsshow that the proportion of the mistake made in the mathcalculation is higher with the presence of negative emotionsas shown in Figures 1 and 2The purpose of this database is tostudy the cognitive related emotions in speech The analysisshows the dependency between the mistakes made in themath calculation and the negative emotions in the speech
3 Feature Analysis
31 Acoustic Feature Extraction Emotional information ishidden in the speech signals Unlike the linguistic informa-tion it is difficult to find the related acoustic features There-fore feature analysis and selection are very important steps inbuilding an SER system
We selected typical utterances to study the feature vari-ance caused by emotional change as shown in Figures 3 45 6 7 8 9 10 and 11 To better reflect the change caused byemotional information we fix the context of these utterances
The utterances shown in the figures are recorded from thesame speaker By comparing the utterances under differentemotional state from the same speaker we can exclude theinfluence brought by different speaking habits and personali-ties It reveals the changes in the acoustic features caused onlyby the emotional information
We induced three types of practical emotions from acognitive task namely fidgetiness confidence and tirednessWe also studied the basic emotions like happiness angersurprise sadness and fearThe intensity feature and the pitchcontour are extracted and demonstrated in Figure 3 throughFigure 11
The first syllable is not normal speech under the fear emo-tional state The pitch feature is missing and it is whisperedspeech under the emotional state of fear Under the tirednessemotion state the pitch contour is low and flat which is quitedistinguishable from other emotion states
Mathematical Problems in Engineering 3
Mistakes
Negative emotions
Positive emotions
Figure 1The percentage of negative emotions whenmistake occursin the cognitive task
Negative emotions
Correct answersFalse answers
Figure 2The percentage of correct answers and false answers whennegative emotion occurs in the cognitive task
In the neutral speech the pitch contour is also flat but atthe end of the sentence the pitch frequency increases Com-paring speaking the pitch frequency is not consistent at theend of the sentence Under the sadness emotion state thepitch contour is smooth and decreases at the end of the sen-tence Furthermore in the happiness sample the varianceof the pitch frequency is higher The pith frequency also in-creases in the confidence and surprise samples
We also notice that under the angry emotion state thevariance of the intensity is lower and the intensity contouris smooth However in the sadness sample the varianceof the intensity is higher Sadness and tiredness may havecaused longer time duration and a lower speech rate whilefidgetiness and anger may have caused a higher speech rate
Quantitative statistical analysis is shown in Figure 12Pitch and formants features are compared under variousemotional states
For modeling and recognition purposes 481 dimensionsof acoustic features are constructed Statistic functions over
Time (s)0 2337
0
600
Freq
uenc
y (H
z)
Time (s)0 2337
minus06864
06402
0
Time (s)0 2337
5374
8552
Inte
nsity
(dB)
Figure 3 Intensity and pitch contour of happiness
Time (s)0 3587
minus0708
07079
0
Time (s)0 3587
0
600
Freq
uenc
y (H
z)
Time (s)0 3587
1393
8779
Inte
nsity
(dB)
Figure 4 Intensity and pitch contour of sadness
the entire utterance such as maximum minimum meanrange are applied to the basic speech features as listed belowldquodrdquo stands for difference and ldquod2rdquo stands for the second orderof difference
Feature 1ndash6 mean maximum minimum medianrange and variance of Short-time Energy (SE)Feature 7ndash18 mean maximum minimum medianrange and variance of dSE and d2SE
4 Mathematical Problems in Engineering
Time (s)0 2716
minus07079
07079
0
Time (s)0 2716
0
600
Freq
uenc
y (H
z)
Time (s)0 2716
minus2878
8675
Inte
nsity
(dB)
Figure 5 Intensity and pitch contour of fidgetiness
Time (s)0 286
minus0708
06984
0
Time (s)0 286
0
600
Freq
uenc
y (H
z)
Time (s)0 286
1434
8812
Inte
nsity
(dB)
Figure 6 Intensity and pitch contour of surprise
Feature 19ndash24 mean maximum minimum medianrange and variance of pitch frequency (F
0)
Feature 25ndash36 mean maximum minimum medianrange and variance of dF
0and d2F
0
Feature 37ndash42 mean maximum minimum medianrange and variance of Zero-Crossing Rate (ZCR)
Time (s)0 2575
-06931
07073
0
Time (s)0 2575
0
600
Freq
uenc
y (H
z)Time (s)
0 25751474
8628
Inte
nsity
(dB)
Figure 7 Intensity and pitch contour of fear
Time (s)0 4061
minus0708
07079
0
Time (s)0 4061
0
600
Freq
uenc
y (H
z)
Time (s)0 4061
minus300
8682
Inte
nsity
(dB)
Figure 8 Intensity and pitch contour of tiredness
Feature 43ndash54 mean maximum minimum medianrange and variance of dZCR and d2ZCR
Feature 70-71 Maximum Voiced Duration (MVD)Maximum Unvoiced Duration (MUD)
Time (s)0 2354
minus07079
07079
0
Time (s)0 2354
0
600
Freq
uenc
y (H
z)
Time (s)0 2354
minus09446
8677
Inte
nsity
(dB)
Figure 11 Intensity and pitch contour of confidence
Feature 72ndash77 mean maximum minimum medianrange and variance of Harmonic-to-Noise Ratio(HNR)
Feature 78ndash95 mean maximum minimum medianrange and variance of HNR (0ndash400Hz 400ndash2000Hz and 2000ndash5000Hz)
Feature 96ndash119 meanmaximumminimummedianrange and variance of 1st formant frequency (F1) 2ndformant frequency (F2) 3rd formant frequency (F3)and 4th formant frequency (F4)
Feature 120ndash143 mean maximum minimum me-dian range and variance of dF1 dF2 dF3 and dF4
Feature 144ndash167 mean maximum minimum me-dian range and variance of d2F1 d2F2 d2F3 andd2F4
Feature 168ndash171 Jitter1 of F1 F2 F3 and F4
Feature 172ndash175 Jitter2 of F1 F2 F3 and F4
Feature 176ndash199 mean maximum minimum me-dian range and variance of F1 F2 F3 and F4 Band-width
Feature 200ndash223 mean maximum minimum me-dian range and variance of dF1 Bandwidth dF2Bandwidth dF3 Bandwidth and dF4 Bandwidth
Feature 224ndash247 mean maximum minimum me-dian range and variance of d2F1 Bandwidth d2F2Bandwidth d2F3 Bandwidth and d2F4 Bandwidth
Feature 248ndash325 mean maximum minimum me-dian range and variance of MFCC (0ndash12th-order)
Figure 12 Feature distribution over various emotional states
Feature 326ndash403 mean maximum minimum me-dian range and variance of dMFCC (0ndash12th-order)Feature 404ndash481 mean maximum minimum me-dian range and variance of d2MFCC (0ndash12th-order)
32 Feature Selection Based onMIC In this section we intro-duce the feature selection algorithm in our speech emotionclassifier Feature selection algorithms may be roughly clas-sified into two groups namely ldquowrapperrdquo and ldquofilterrdquo Algo-rithms in the former group are dependent on the specific clas-sifiers such as sequential forward selection (SFS) The finalselection result is dependent on a specific classifier If we re-place the specific classifier the results will change In thesecond group feature selection is done by a certain evaluationcriteria such as FisherDiscriminant Ratio (FDR)The feature
Figure 13 The arousal and the valence dimensions of emotions
selection result achieved in this type of method is not de-pendent on specific classifiers and bears a better generalityacross different databases
Maximal information coefficient (MIC) based feature se-lection algorithm falls into the second group MIC is a newstatistic tool that measures linear and nonlinear relationshipsbetween paired variables invented by Reshef et al [14]
MIC is based on the idea that if a relationship existsbetween two variables then a grid can be drawn on the scat-terplot of the two variables that partitions the data to encap-sulate that relationship [14] We may calculate the MIC of acertain acoustic feature and the emotional state by exploringall possible grids on the two variables First we computefor every pair of integers (119909 119910) that largest possible mutualinformation achieved by any 119909-by-119910 grid [14] Second for afair comparison we normalize these MIC values between allacoustic features and the emotional state Detailed study ofMIC may be found in [14]
Since MIC can treat linear and nonlinear associations atthe same time we do not need tomake any assumption on thedistribution of our original features Therefore it is especiallysuitable for evaluating a large number of emotional featuresBased on a large number of basic features as described inSection 31 we apply MIC to measure the contribution ofthese features in correlation with emotion states Finally asubset of features is selected for our emotion classifier
4 Recognition Methodology
41 Baseline GMM Classifier The Gaussian mixture model(GMM) based classifier is the state-of-the-art recognitionmethod in speaker and language identification In this paperwe built the baseline classifier using Gaussianmixturemodeland we may compare the baseline classifier with the onlinelearning method
Mathematical Problems in Engineering 7
GMM may be defined by the sum of several Gaussiandistributions
119901 (X119905| 120582) =
119872
sum
119894=1
119886119894119887119894(X119905) (1)
where X119905is a 119863-dimension random vector 119887
119894(X119905) is the 119894th
member of Gaussian distribution 119905 is the index of utterancesample 119886
119894is the mixture weight and 119872 is the number of
Gaussian mixture members Each member is a119863-dimensionvariable which follows the Gaussian distribution with themean U
119894and the covariance Σ
119894
119887119894(X119905) =
1
(2120587)119863210038161003816
10038161003816Σ119894
1003816100381610038161003816
12exp minus1
2
(X119905minus U119894)119879
Σminus1
119894(X119905minus U119894)
(2)
Note that119872
sum
119894=1
119886119894= 1 (3)
Emotion classification can be done by maximizing theposterior probability
EmotionLable = argmax119896
(119901 (X119905| 120582119896)) (4)
ExpectationMaximization (EM) is adopted forGMMparam-eter estimation [15]
Due to the different types of emotions among the datasetswe unify the emotional datasets by categorizing them intopositive and negative regions in the valence and arousal di-mensions as shown in Figure 13 We may verify the ability ofthe emotion classifier by classifying the emotional utterancesinto different regions in the valence and arousal space
42 Online LearningUsingAdaBoost While the offlineGMMclassifier is trained using EM algorithm the online trainingalgorithmusingAdabBoost will be introduced in this sectionAdaBoost is a powerful algorithm in assemble learning [16]The belief in this AdaBoost is that weak classifiers may becombined into a powerful classifier Multiple classifierstrained on randomly selected datasets perform quiet differ-ently from each other on the same testing dataset therefore
we may reduce the misclassification rate by a proper decisionfusion rule
AdaBoost algorithm consists of several iterations In eachiteration a new training set is selected for a new weak clas-sifier A weight is assigned to the new weak classifier Basedon the testing results of the newweak classifier the weights ofall the data samples are modified for the next iteration At thefinal step the assembled classifier is achieved by combinationof themultipleweak classified through aweighted voting rule
Let us suppose the current training set is [17]
119879 = 1199041 1199042 119904
119873 (6)
where the weights of the samples are
119882 = 1199081 1199082 119908
119873
119873
sum
119894=0
119908119894= 1
(7)
The error rate of the new weak classifier is
119890 = sum
119894119888(119904119894) = 119910119894
119908119894 (8)
where 119888(119904119894) is the classification result and 119910
119894is the class label
The fusion weight assigned to each classifier is defined by theerror rate
120572 = ln((1 minus 119890)119890
) (9)
At the beginning of the algorithm each sample is assignedby equal weight During the iteration the sample weights areupdated
119908119894+1
=
119908119894times 120573 119888 (119904
119894) = 119910119894
119908119894 119888 (119904
119894) = 119910119894
(10)
At the arrival of the new data assuming that we knowthe label information for each sample pretrained classifiersfrom the offline data are used as initial weak classifiers Ada-Boost algorithm is applied to the new online data and fusionweights are reassigned to the offline trained classifiers
At the first119898 initial iterations119898 pretrained classifiers areused as the weak classifiers and added to the final ensembleclassifier instead of training new weak classifiers from therandomly selected dataset After the119898 initial iterations newweak classifiers are trained from the new online data andadded to the final ensemble classifier in the AdaBoostalgorithm
The major difference between the online training and theoffline training is the data used for learning Offline train-ing uses large acted data while online training uses small andnatural data Offline training is independent of the onlinetraining and ready to use while the online training is depen-dent on the offline training and only retrains the existingmodel to fit specific purposes such as to tune on a largenumber of speakers The purpose of online training is toquickly adapt the existing offline model to a small amountof new data
8 Mathematical Problems in Engineering
5 Experimental Results
In our experiment the offline training is carried out on theacted basic emotion dataset The speaker-independent data-set and the elicited practical emotion dataset are used for theonline training and the online testing Although the datasetsused in online testing are preprocessed utterances rather thanreal time online data our experiments still provide a simu-lated online situation We divide dataset 2 and dataset 3 intosmaller sets dataset 2a and dataset 2b which are used as thesimulated online initialization
Speech utterances from different sources are organizedinto several datasets as shown in Table 2
The online learning algorithm is verified both on thespeaker-independent data and the elicited data The resultsare shown in Table 4 A large number of speakers bring dif-ficulties in modeling emotional behavior since emotionexpression is highly dependent on individual habit and per-sonality By extending the offline trained classifier to theonline data that contains a large number of speakers weimproved the generality of our SER system The elicited datais collected in a cognitive experiment that is more close tothe real world situation During the cognitive task emotionalspeech is induced We observed that the different naturebetween the acted data and the induced speech during acognitive task caused a significant decrease of the recognitionrate By using the online training technique we may transferthe offline trained SER system to the elicited data Extendingour SER system to different data sources may bring emotionrecognition closer to real world applications
The major challenge in our online learning algorithm ishow to combine the existing offline classifier and efficientlyadapt the model parameters to a small number of new onlinedata We adopted the incremental learning idea and solvedthis problem by modifying the initial stage in the AdaBoostframework One of the contributions of our online learningalgorithm is that we may reuse the existing offline trainingdata and make the online learning stage more efficiently Wemake use of a large amount of available offline training dataand only require a small amount of data for online trainingas shown in Table 3 The weight of each weak classifier is animportant parameter The proposed method may be furtherimproved by using fuzzy membership function to evaluatethe confidence in GMM classifiers and reestimate the weightof each weak classifier
6 Discussions
Acted data is often considered not suitable for real worldapplications However traditional researches have been fo-cused on the acted emotion speech andmany acted databasesare available How to transfer an SER system that trained onthe acted data to the new naturalistic data in real world is anunsolved challenge
Many feature selection algorithms may be applied to SERsystem MIC is a newly proposed and powerful algorithm forexploring nonlinear relationship between variables
AdaBoost is a popular algorithm to ensemble multipleweak classifiers to establish a strong classifier By applying
Table 3 Selected datasets for online and offline experiments
Datasets index Data source Number ofutterances Purpose of use
Dataset 1 Acted speech 12000 Offline training
Dataset 2a Speakerindependent 1000 Online training
result Experiment 1 Dataset 1 NA Dataset 2b 633Experiment 2 Dataset 1 Dataset 2a Dataset 2b 756Experiment 5 Dataset 2a NA Dataset 2b 700Experiment 3 Dataset 1 NA Dataset 3b 612Experiment 4 Dataset 1 Dataset 3a Dataset 3b 731Experiment 6 Dataset 3a NA Dataset 3b 685
AdaBoost in the online occasion we train multiple weakclassifiers based on the newly arrived online data The offlinepretrained classifiers are used for initialization We may ex-plore other incremental learning algorithms in the futurework
Acknowledgments
This work was partially supported by China Postdoctoral Sci-ence Foundation (no 2012M520973) National Nature Sci-ence Foundation (no 61231002 no 61273266 no 51075068)and Doctoral Fund of Ministry of Education of China (no20110092130004)The authors would like to thank the anony-mous reviewers for their valuable comments and helpfulsuggestions
References
[1] C Clavel I Vasilescu L Devillers G Richard and T EhretteldquoFear-type emotion recognition for future audio-based surveil-lance systemsrdquo Speech Communication vol 50 no 6 pp 487ndash503 2008
[2] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoSpeech emotionrecognition based on re-composition of two-class classifiersrdquoin Proceedings of the 3rd International Conference on AffectiveComputing and Intelligent Interaction andWorkshops (ACII rsquo09)Amsterdam The Netherlands September 2009
[3] K R Scherer ldquoVocal communication of emotion a review ofresearch paradigmsrdquo SpeechCommunication vol 40 no 1-2 pp227ndash256 2003
[4] A Tawari andMM Trivedi ldquoSpeech emotion analysis explor-ing the role of contextrdquo IEEE Transactions on Multimedia vol12 no 6 pp 502ndash509 2010
[5] F Burkhardt A Paeschke M Rolfes W Sendlmeier and BWeiss ldquoA database of German emotional speechrdquo inProceedings
Mathematical Problems in Engineering 9
of the 9th European Conference on Speech Communication andTechnology pp 1517ndash1520 Lissabon Portugal September 2005
[6] D Ververidis and C Kotropoulos ldquoAutomatic speech classifi-cation to five emotional states based on gender informationrdquo inProceedings of the 12th European Signal Processing Conferencepp 341ndash344 Vienna Austria 2004
[7] S SteidlAutomatic Classification of Emotion-RelatedUser Statesin Spontaneous Childrenrsquos Speech Department of Computer Sci-ence Friedrich-Alexander-Universitaet Erlangen-NuermbergBerlin Germany 2008
[8] M Grimm K Kroschel and S Narayanan ldquoThe Vera am Mit-tag German audio-visual emotional speech databaserdquo in Pro-ceedings of the IEEE International Conference onMultimedia andExpo (ICME rsquo08) pp 865ndash868 Hannover Germany June 2008
[9] K P Truong How Does Real Affect Affect Affect Recognitionin Speech Center for Telematics and Information TechnologyUniversity of Twente Enschede The Netherlands 2009
[10] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoRecognition ofpractical emotion from elicited speechrdquo in Proceedings of the 1stInternational Conference on Information Science and Engineer-ing (ICISE rsquo09) pp 639ndash642 Nanjing China December 2009
[11] R Polikar L Udpa S S Udpa and V Honavar ldquoLearn++an incremental learning algorithm for supervised neural net-worksrdquo IEEE Transactions on Systems Man and Cybernetics Cvol 31 no 4 pp 497ndash508 2001
[12] Q L Zhao Y H Jiang and M Xu ldquoIncremental learning byheterogeneous Bagging ensemblerdquo Lecture Notes in ComputerScience vol 6441 no 2 pp 1ndash12 2010
[13] R Xiao J Wang and F Zhang ldquoAn approach to incrementalSVM learning algorithmrdquo in Proceedings of the IEEE Interna-tional Conference on Tools with Artificial Intelligence pp 268ndash273 2000
[14] D N Reshef Y A Reshef H K Finucane et al ldquoDetectingnovel associations in large data setsrdquo Science vol 334 no 6062pp 1518ndash1524 2011
[15] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using Gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995
[16] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997
[17] Q ZhaoThe research on ensemble pruning and its application inon-line machine learning [PhD thesis] National University ofDefense Technology Changsha China 2010
proposed by Xiao et al [13] The techniques used in suchparameter estimation may not be generalized In the secondcategory the incremental learning algorithm is not depen-dent on a specific type of classifiers Multiple classifiers arecreated and combined by a certain fusion rule such as major-ity vote Boosting is a typical type of algorithms that fall intothe second category By creating weak classifiers using se-lected data we may add new training data to the learningprocedure and gradually adapt the SER system in an onlineenvironment
In this paper we explore the possibility of transferringpretrained SER system from acted data to more naturalisticdata in an online learning framework Section 2 describesour acted data and elicited data Section 3 provides acousticanalysis of emotional features In Section 4 we introduce ourspeech emotion recognizer and the online learning method-ology Finally in Section 5 we provide the experimentalresults which show that combining the acted data and theelicited data using online learning brings us the best result
2 Three Types of Data Sources
In this paper we use three types of data sources to validate ourSER system (i) acted basic emotion database (ii) speaker-independent emotion database and (iii) elicited emotiondatabase
The first database contains the basic emotions includinghappiness anger surprise sadness fear and neutrality Theemotional speech is recorded by professional actors andactress six males and six femalesThis acted database may beused as a standard training dataset for our baseline recog-nizer However in real world applications the naturalisticemotional speech is different from the acted speech
The second database is designed for speaker-independenttest which includes fifty-one different speakers Other thana large number of speakers a special type of emotion isconsidered namely fidgetiness Fidgetiness is an importanttype of emotion in cognitive related tasks It may be inducedby repeated work environmental noise and stress The sec-ond database contains five emotions as shown in Table 1This database may be used for testing the ability of speakeradaptationWhen using training data from the first databaseit is challenging to test our SER system on the second data-base due to many unknown speakers
The third database contains elicited speech in a cognitivetask as shown in Table 2 The first row shows the emotiontypes collected in our experiments such as fidgetiness con-fidence and tirednessThe second row is the speaker numberrelated to each type of emotion The third row is the maleand female proportion in the emotion data The last row isthe number of utterances in each emotion class The datais collected locally in our lab We carried out a cognitiveexperiment and collected the emotional speech related tocognitive performance Subject was required to work on aset of math calculations and to report the results orallyDuring the cognitive task the speech signals were recordedand annotated with emotional labels
In the third database ldquocorrect answerrdquo or ldquofalse answerrdquolabels are marked on each utterance in the oral report by
the listeners who have not participated in the eliciting exper-iment Therefore we may calculate the percentage of falseanswers in the negative emotion samples and the percentageof negative emotion in the ldquofalse answerrdquo samples Resultsshow that the proportion of the mistake made in the mathcalculation is higher with the presence of negative emotionsas shown in Figures 1 and 2The purpose of this database is tostudy the cognitive related emotions in speech The analysisshows the dependency between the mistakes made in themath calculation and the negative emotions in the speech
3 Feature Analysis
31 Acoustic Feature Extraction Emotional information ishidden in the speech signals Unlike the linguistic informa-tion it is difficult to find the related acoustic features There-fore feature analysis and selection are very important steps inbuilding an SER system
We selected typical utterances to study the feature vari-ance caused by emotional change as shown in Figures 3 45 6 7 8 9 10 and 11 To better reflect the change caused byemotional information we fix the context of these utterances
The utterances shown in the figures are recorded from thesame speaker By comparing the utterances under differentemotional state from the same speaker we can exclude theinfluence brought by different speaking habits and personali-ties It reveals the changes in the acoustic features caused onlyby the emotional information
We induced three types of practical emotions from acognitive task namely fidgetiness confidence and tirednessWe also studied the basic emotions like happiness angersurprise sadness and fearThe intensity feature and the pitchcontour are extracted and demonstrated in Figure 3 throughFigure 11
The first syllable is not normal speech under the fear emo-tional state The pitch feature is missing and it is whisperedspeech under the emotional state of fear Under the tirednessemotion state the pitch contour is low and flat which is quitedistinguishable from other emotion states
Mathematical Problems in Engineering 3
Mistakes
Negative emotions
Positive emotions
Figure 1The percentage of negative emotions whenmistake occursin the cognitive task
Negative emotions
Correct answersFalse answers
Figure 2The percentage of correct answers and false answers whennegative emotion occurs in the cognitive task
In the neutral speech the pitch contour is also flat but atthe end of the sentence the pitch frequency increases Com-paring speaking the pitch frequency is not consistent at theend of the sentence Under the sadness emotion state thepitch contour is smooth and decreases at the end of the sen-tence Furthermore in the happiness sample the varianceof the pitch frequency is higher The pith frequency also in-creases in the confidence and surprise samples
We also notice that under the angry emotion state thevariance of the intensity is lower and the intensity contouris smooth However in the sadness sample the varianceof the intensity is higher Sadness and tiredness may havecaused longer time duration and a lower speech rate whilefidgetiness and anger may have caused a higher speech rate
Quantitative statistical analysis is shown in Figure 12Pitch and formants features are compared under variousemotional states
For modeling and recognition purposes 481 dimensionsof acoustic features are constructed Statistic functions over
Time (s)0 2337
0
600
Freq
uenc
y (H
z)
Time (s)0 2337
minus06864
06402
0
Time (s)0 2337
5374
8552
Inte
nsity
(dB)
Figure 3 Intensity and pitch contour of happiness
Time (s)0 3587
minus0708
07079
0
Time (s)0 3587
0
600
Freq
uenc
y (H
z)
Time (s)0 3587
1393
8779
Inte
nsity
(dB)
Figure 4 Intensity and pitch contour of sadness
the entire utterance such as maximum minimum meanrange are applied to the basic speech features as listed belowldquodrdquo stands for difference and ldquod2rdquo stands for the second orderof difference
Feature 1ndash6 mean maximum minimum medianrange and variance of Short-time Energy (SE)Feature 7ndash18 mean maximum minimum medianrange and variance of dSE and d2SE
4 Mathematical Problems in Engineering
Time (s)0 2716
minus07079
07079
0
Time (s)0 2716
0
600
Freq
uenc
y (H
z)
Time (s)0 2716
minus2878
8675
Inte
nsity
(dB)
Figure 5 Intensity and pitch contour of fidgetiness
Time (s)0 286
minus0708
06984
0
Time (s)0 286
0
600
Freq
uenc
y (H
z)
Time (s)0 286
1434
8812
Inte
nsity
(dB)
Figure 6 Intensity and pitch contour of surprise
Feature 19ndash24 mean maximum minimum medianrange and variance of pitch frequency (F
0)
Feature 25ndash36 mean maximum minimum medianrange and variance of dF
0and d2F
0
Feature 37ndash42 mean maximum minimum medianrange and variance of Zero-Crossing Rate (ZCR)
Time (s)0 2575
-06931
07073
0
Time (s)0 2575
0
600
Freq
uenc
y (H
z)Time (s)
0 25751474
8628
Inte
nsity
(dB)
Figure 7 Intensity and pitch contour of fear
Time (s)0 4061
minus0708
07079
0
Time (s)0 4061
0
600
Freq
uenc
y (H
z)
Time (s)0 4061
minus300
8682
Inte
nsity
(dB)
Figure 8 Intensity and pitch contour of tiredness
Feature 43ndash54 mean maximum minimum medianrange and variance of dZCR and d2ZCR
Feature 70-71 Maximum Voiced Duration (MVD)Maximum Unvoiced Duration (MUD)
Time (s)0 2354
minus07079
07079
0
Time (s)0 2354
0
600
Freq
uenc
y (H
z)
Time (s)0 2354
minus09446
8677
Inte
nsity
(dB)
Figure 11 Intensity and pitch contour of confidence
Feature 72ndash77 mean maximum minimum medianrange and variance of Harmonic-to-Noise Ratio(HNR)
Feature 78ndash95 mean maximum minimum medianrange and variance of HNR (0ndash400Hz 400ndash2000Hz and 2000ndash5000Hz)
Feature 96ndash119 meanmaximumminimummedianrange and variance of 1st formant frequency (F1) 2ndformant frequency (F2) 3rd formant frequency (F3)and 4th formant frequency (F4)
Feature 120ndash143 mean maximum minimum me-dian range and variance of dF1 dF2 dF3 and dF4
Feature 144ndash167 mean maximum minimum me-dian range and variance of d2F1 d2F2 d2F3 andd2F4
Feature 168ndash171 Jitter1 of F1 F2 F3 and F4
Feature 172ndash175 Jitter2 of F1 F2 F3 and F4
Feature 176ndash199 mean maximum minimum me-dian range and variance of F1 F2 F3 and F4 Band-width
Feature 200ndash223 mean maximum minimum me-dian range and variance of dF1 Bandwidth dF2Bandwidth dF3 Bandwidth and dF4 Bandwidth
Feature 224ndash247 mean maximum minimum me-dian range and variance of d2F1 Bandwidth d2F2Bandwidth d2F3 Bandwidth and d2F4 Bandwidth
Feature 248ndash325 mean maximum minimum me-dian range and variance of MFCC (0ndash12th-order)
Figure 12 Feature distribution over various emotional states
Feature 326ndash403 mean maximum minimum me-dian range and variance of dMFCC (0ndash12th-order)Feature 404ndash481 mean maximum minimum me-dian range and variance of d2MFCC (0ndash12th-order)
32 Feature Selection Based onMIC In this section we intro-duce the feature selection algorithm in our speech emotionclassifier Feature selection algorithms may be roughly clas-sified into two groups namely ldquowrapperrdquo and ldquofilterrdquo Algo-rithms in the former group are dependent on the specific clas-sifiers such as sequential forward selection (SFS) The finalselection result is dependent on a specific classifier If we re-place the specific classifier the results will change In thesecond group feature selection is done by a certain evaluationcriteria such as FisherDiscriminant Ratio (FDR)The feature
Figure 13 The arousal and the valence dimensions of emotions
selection result achieved in this type of method is not de-pendent on specific classifiers and bears a better generalityacross different databases
Maximal information coefficient (MIC) based feature se-lection algorithm falls into the second group MIC is a newstatistic tool that measures linear and nonlinear relationshipsbetween paired variables invented by Reshef et al [14]
MIC is based on the idea that if a relationship existsbetween two variables then a grid can be drawn on the scat-terplot of the two variables that partitions the data to encap-sulate that relationship [14] We may calculate the MIC of acertain acoustic feature and the emotional state by exploringall possible grids on the two variables First we computefor every pair of integers (119909 119910) that largest possible mutualinformation achieved by any 119909-by-119910 grid [14] Second for afair comparison we normalize these MIC values between allacoustic features and the emotional state Detailed study ofMIC may be found in [14]
Since MIC can treat linear and nonlinear associations atthe same time we do not need tomake any assumption on thedistribution of our original features Therefore it is especiallysuitable for evaluating a large number of emotional featuresBased on a large number of basic features as described inSection 31 we apply MIC to measure the contribution ofthese features in correlation with emotion states Finally asubset of features is selected for our emotion classifier
4 Recognition Methodology
41 Baseline GMM Classifier The Gaussian mixture model(GMM) based classifier is the state-of-the-art recognitionmethod in speaker and language identification In this paperwe built the baseline classifier using Gaussianmixturemodeland we may compare the baseline classifier with the onlinelearning method
Mathematical Problems in Engineering 7
GMM may be defined by the sum of several Gaussiandistributions
119901 (X119905| 120582) =
119872
sum
119894=1
119886119894119887119894(X119905) (1)
where X119905is a 119863-dimension random vector 119887
119894(X119905) is the 119894th
member of Gaussian distribution 119905 is the index of utterancesample 119886
119894is the mixture weight and 119872 is the number of
Gaussian mixture members Each member is a119863-dimensionvariable which follows the Gaussian distribution with themean U
119894and the covariance Σ
119894
119887119894(X119905) =
1
(2120587)119863210038161003816
10038161003816Σ119894
1003816100381610038161003816
12exp minus1
2
(X119905minus U119894)119879
Σminus1
119894(X119905minus U119894)
(2)
Note that119872
sum
119894=1
119886119894= 1 (3)
Emotion classification can be done by maximizing theposterior probability
EmotionLable = argmax119896
(119901 (X119905| 120582119896)) (4)
ExpectationMaximization (EM) is adopted forGMMparam-eter estimation [15]
Due to the different types of emotions among the datasetswe unify the emotional datasets by categorizing them intopositive and negative regions in the valence and arousal di-mensions as shown in Figure 13 We may verify the ability ofthe emotion classifier by classifying the emotional utterancesinto different regions in the valence and arousal space
42 Online LearningUsingAdaBoost While the offlineGMMclassifier is trained using EM algorithm the online trainingalgorithmusingAdabBoost will be introduced in this sectionAdaBoost is a powerful algorithm in assemble learning [16]The belief in this AdaBoost is that weak classifiers may becombined into a powerful classifier Multiple classifierstrained on randomly selected datasets perform quiet differ-ently from each other on the same testing dataset therefore
we may reduce the misclassification rate by a proper decisionfusion rule
AdaBoost algorithm consists of several iterations In eachiteration a new training set is selected for a new weak clas-sifier A weight is assigned to the new weak classifier Basedon the testing results of the newweak classifier the weights ofall the data samples are modified for the next iteration At thefinal step the assembled classifier is achieved by combinationof themultipleweak classified through aweighted voting rule
Let us suppose the current training set is [17]
119879 = 1199041 1199042 119904
119873 (6)
where the weights of the samples are
119882 = 1199081 1199082 119908
119873
119873
sum
119894=0
119908119894= 1
(7)
The error rate of the new weak classifier is
119890 = sum
119894119888(119904119894) = 119910119894
119908119894 (8)
where 119888(119904119894) is the classification result and 119910
119894is the class label
The fusion weight assigned to each classifier is defined by theerror rate
120572 = ln((1 minus 119890)119890
) (9)
At the beginning of the algorithm each sample is assignedby equal weight During the iteration the sample weights areupdated
119908119894+1
=
119908119894times 120573 119888 (119904
119894) = 119910119894
119908119894 119888 (119904
119894) = 119910119894
(10)
At the arrival of the new data assuming that we knowthe label information for each sample pretrained classifiersfrom the offline data are used as initial weak classifiers Ada-Boost algorithm is applied to the new online data and fusionweights are reassigned to the offline trained classifiers
At the first119898 initial iterations119898 pretrained classifiers areused as the weak classifiers and added to the final ensembleclassifier instead of training new weak classifiers from therandomly selected dataset After the119898 initial iterations newweak classifiers are trained from the new online data andadded to the final ensemble classifier in the AdaBoostalgorithm
The major difference between the online training and theoffline training is the data used for learning Offline train-ing uses large acted data while online training uses small andnatural data Offline training is independent of the onlinetraining and ready to use while the online training is depen-dent on the offline training and only retrains the existingmodel to fit specific purposes such as to tune on a largenumber of speakers The purpose of online training is toquickly adapt the existing offline model to a small amountof new data
8 Mathematical Problems in Engineering
5 Experimental Results
In our experiment the offline training is carried out on theacted basic emotion dataset The speaker-independent data-set and the elicited practical emotion dataset are used for theonline training and the online testing Although the datasetsused in online testing are preprocessed utterances rather thanreal time online data our experiments still provide a simu-lated online situation We divide dataset 2 and dataset 3 intosmaller sets dataset 2a and dataset 2b which are used as thesimulated online initialization
Speech utterances from different sources are organizedinto several datasets as shown in Table 2
The online learning algorithm is verified both on thespeaker-independent data and the elicited data The resultsare shown in Table 4 A large number of speakers bring dif-ficulties in modeling emotional behavior since emotionexpression is highly dependent on individual habit and per-sonality By extending the offline trained classifier to theonline data that contains a large number of speakers weimproved the generality of our SER system The elicited datais collected in a cognitive experiment that is more close tothe real world situation During the cognitive task emotionalspeech is induced We observed that the different naturebetween the acted data and the induced speech during acognitive task caused a significant decrease of the recognitionrate By using the online training technique we may transferthe offline trained SER system to the elicited data Extendingour SER system to different data sources may bring emotionrecognition closer to real world applications
The major challenge in our online learning algorithm ishow to combine the existing offline classifier and efficientlyadapt the model parameters to a small number of new onlinedata We adopted the incremental learning idea and solvedthis problem by modifying the initial stage in the AdaBoostframework One of the contributions of our online learningalgorithm is that we may reuse the existing offline trainingdata and make the online learning stage more efficiently Wemake use of a large amount of available offline training dataand only require a small amount of data for online trainingas shown in Table 3 The weight of each weak classifier is animportant parameter The proposed method may be furtherimproved by using fuzzy membership function to evaluatethe confidence in GMM classifiers and reestimate the weightof each weak classifier
6 Discussions
Acted data is often considered not suitable for real worldapplications However traditional researches have been fo-cused on the acted emotion speech andmany acted databasesare available How to transfer an SER system that trained onthe acted data to the new naturalistic data in real world is anunsolved challenge
Many feature selection algorithms may be applied to SERsystem MIC is a newly proposed and powerful algorithm forexploring nonlinear relationship between variables
AdaBoost is a popular algorithm to ensemble multipleweak classifiers to establish a strong classifier By applying
Table 3 Selected datasets for online and offline experiments
Datasets index Data source Number ofutterances Purpose of use
Dataset 1 Acted speech 12000 Offline training
Dataset 2a Speakerindependent 1000 Online training
result Experiment 1 Dataset 1 NA Dataset 2b 633Experiment 2 Dataset 1 Dataset 2a Dataset 2b 756Experiment 5 Dataset 2a NA Dataset 2b 700Experiment 3 Dataset 1 NA Dataset 3b 612Experiment 4 Dataset 1 Dataset 3a Dataset 3b 731Experiment 6 Dataset 3a NA Dataset 3b 685
AdaBoost in the online occasion we train multiple weakclassifiers based on the newly arrived online data The offlinepretrained classifiers are used for initialization We may ex-plore other incremental learning algorithms in the futurework
Acknowledgments
This work was partially supported by China Postdoctoral Sci-ence Foundation (no 2012M520973) National Nature Sci-ence Foundation (no 61231002 no 61273266 no 51075068)and Doctoral Fund of Ministry of Education of China (no20110092130004)The authors would like to thank the anony-mous reviewers for their valuable comments and helpfulsuggestions
References
[1] C Clavel I Vasilescu L Devillers G Richard and T EhretteldquoFear-type emotion recognition for future audio-based surveil-lance systemsrdquo Speech Communication vol 50 no 6 pp 487ndash503 2008
[2] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoSpeech emotionrecognition based on re-composition of two-class classifiersrdquoin Proceedings of the 3rd International Conference on AffectiveComputing and Intelligent Interaction andWorkshops (ACII rsquo09)Amsterdam The Netherlands September 2009
[3] K R Scherer ldquoVocal communication of emotion a review ofresearch paradigmsrdquo SpeechCommunication vol 40 no 1-2 pp227ndash256 2003
[4] A Tawari andMM Trivedi ldquoSpeech emotion analysis explor-ing the role of contextrdquo IEEE Transactions on Multimedia vol12 no 6 pp 502ndash509 2010
[5] F Burkhardt A Paeschke M Rolfes W Sendlmeier and BWeiss ldquoA database of German emotional speechrdquo inProceedings
Mathematical Problems in Engineering 9
of the 9th European Conference on Speech Communication andTechnology pp 1517ndash1520 Lissabon Portugal September 2005
[6] D Ververidis and C Kotropoulos ldquoAutomatic speech classifi-cation to five emotional states based on gender informationrdquo inProceedings of the 12th European Signal Processing Conferencepp 341ndash344 Vienna Austria 2004
[7] S SteidlAutomatic Classification of Emotion-RelatedUser Statesin Spontaneous Childrenrsquos Speech Department of Computer Sci-ence Friedrich-Alexander-Universitaet Erlangen-NuermbergBerlin Germany 2008
[8] M Grimm K Kroschel and S Narayanan ldquoThe Vera am Mit-tag German audio-visual emotional speech databaserdquo in Pro-ceedings of the IEEE International Conference onMultimedia andExpo (ICME rsquo08) pp 865ndash868 Hannover Germany June 2008
[9] K P Truong How Does Real Affect Affect Affect Recognitionin Speech Center for Telematics and Information TechnologyUniversity of Twente Enschede The Netherlands 2009
[10] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoRecognition ofpractical emotion from elicited speechrdquo in Proceedings of the 1stInternational Conference on Information Science and Engineer-ing (ICISE rsquo09) pp 639ndash642 Nanjing China December 2009
[11] R Polikar L Udpa S S Udpa and V Honavar ldquoLearn++an incremental learning algorithm for supervised neural net-worksrdquo IEEE Transactions on Systems Man and Cybernetics Cvol 31 no 4 pp 497ndash508 2001
[12] Q L Zhao Y H Jiang and M Xu ldquoIncremental learning byheterogeneous Bagging ensemblerdquo Lecture Notes in ComputerScience vol 6441 no 2 pp 1ndash12 2010
[13] R Xiao J Wang and F Zhang ldquoAn approach to incrementalSVM learning algorithmrdquo in Proceedings of the IEEE Interna-tional Conference on Tools with Artificial Intelligence pp 268ndash273 2000
[14] D N Reshef Y A Reshef H K Finucane et al ldquoDetectingnovel associations in large data setsrdquo Science vol 334 no 6062pp 1518ndash1524 2011
[15] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using Gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995
[16] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997
[17] Q ZhaoThe research on ensemble pruning and its application inon-line machine learning [PhD thesis] National University ofDefense Technology Changsha China 2010
Figure 1The percentage of negative emotions whenmistake occursin the cognitive task
Negative emotions
Correct answersFalse answers
Figure 2The percentage of correct answers and false answers whennegative emotion occurs in the cognitive task
In the neutral speech the pitch contour is also flat but atthe end of the sentence the pitch frequency increases Com-paring speaking the pitch frequency is not consistent at theend of the sentence Under the sadness emotion state thepitch contour is smooth and decreases at the end of the sen-tence Furthermore in the happiness sample the varianceof the pitch frequency is higher The pith frequency also in-creases in the confidence and surprise samples
We also notice that under the angry emotion state thevariance of the intensity is lower and the intensity contouris smooth However in the sadness sample the varianceof the intensity is higher Sadness and tiredness may havecaused longer time duration and a lower speech rate whilefidgetiness and anger may have caused a higher speech rate
Quantitative statistical analysis is shown in Figure 12Pitch and formants features are compared under variousemotional states
For modeling and recognition purposes 481 dimensionsof acoustic features are constructed Statistic functions over
Time (s)0 2337
0
600
Freq
uenc
y (H
z)
Time (s)0 2337
minus06864
06402
0
Time (s)0 2337
5374
8552
Inte
nsity
(dB)
Figure 3 Intensity and pitch contour of happiness
Time (s)0 3587
minus0708
07079
0
Time (s)0 3587
0
600
Freq
uenc
y (H
z)
Time (s)0 3587
1393
8779
Inte
nsity
(dB)
Figure 4 Intensity and pitch contour of sadness
the entire utterance such as maximum minimum meanrange are applied to the basic speech features as listed belowldquodrdquo stands for difference and ldquod2rdquo stands for the second orderof difference
Feature 1ndash6 mean maximum minimum medianrange and variance of Short-time Energy (SE)Feature 7ndash18 mean maximum minimum medianrange and variance of dSE and d2SE
4 Mathematical Problems in Engineering
Time (s)0 2716
minus07079
07079
0
Time (s)0 2716
0
600
Freq
uenc
y (H
z)
Time (s)0 2716
minus2878
8675
Inte
nsity
(dB)
Figure 5 Intensity and pitch contour of fidgetiness
Time (s)0 286
minus0708
06984
0
Time (s)0 286
0
600
Freq
uenc
y (H
z)
Time (s)0 286
1434
8812
Inte
nsity
(dB)
Figure 6 Intensity and pitch contour of surprise
Feature 19ndash24 mean maximum minimum medianrange and variance of pitch frequency (F
0)
Feature 25ndash36 mean maximum minimum medianrange and variance of dF
0and d2F
0
Feature 37ndash42 mean maximum minimum medianrange and variance of Zero-Crossing Rate (ZCR)
Time (s)0 2575
-06931
07073
0
Time (s)0 2575
0
600
Freq
uenc
y (H
z)Time (s)
0 25751474
8628
Inte
nsity
(dB)
Figure 7 Intensity and pitch contour of fear
Time (s)0 4061
minus0708
07079
0
Time (s)0 4061
0
600
Freq
uenc
y (H
z)
Time (s)0 4061
minus300
8682
Inte
nsity
(dB)
Figure 8 Intensity and pitch contour of tiredness
Feature 43ndash54 mean maximum minimum medianrange and variance of dZCR and d2ZCR
Feature 70-71 Maximum Voiced Duration (MVD)Maximum Unvoiced Duration (MUD)
Time (s)0 2354
minus07079
07079
0
Time (s)0 2354
0
600
Freq
uenc
y (H
z)
Time (s)0 2354
minus09446
8677
Inte
nsity
(dB)
Figure 11 Intensity and pitch contour of confidence
Feature 72ndash77 mean maximum minimum medianrange and variance of Harmonic-to-Noise Ratio(HNR)
Feature 78ndash95 mean maximum minimum medianrange and variance of HNR (0ndash400Hz 400ndash2000Hz and 2000ndash5000Hz)
Feature 96ndash119 meanmaximumminimummedianrange and variance of 1st formant frequency (F1) 2ndformant frequency (F2) 3rd formant frequency (F3)and 4th formant frequency (F4)
Feature 120ndash143 mean maximum minimum me-dian range and variance of dF1 dF2 dF3 and dF4
Feature 144ndash167 mean maximum minimum me-dian range and variance of d2F1 d2F2 d2F3 andd2F4
Feature 168ndash171 Jitter1 of F1 F2 F3 and F4
Feature 172ndash175 Jitter2 of F1 F2 F3 and F4
Feature 176ndash199 mean maximum minimum me-dian range and variance of F1 F2 F3 and F4 Band-width
Feature 200ndash223 mean maximum minimum me-dian range and variance of dF1 Bandwidth dF2Bandwidth dF3 Bandwidth and dF4 Bandwidth
Feature 224ndash247 mean maximum minimum me-dian range and variance of d2F1 Bandwidth d2F2Bandwidth d2F3 Bandwidth and d2F4 Bandwidth
Feature 248ndash325 mean maximum minimum me-dian range and variance of MFCC (0ndash12th-order)
Figure 12 Feature distribution over various emotional states
Feature 326ndash403 mean maximum minimum me-dian range and variance of dMFCC (0ndash12th-order)Feature 404ndash481 mean maximum minimum me-dian range and variance of d2MFCC (0ndash12th-order)
32 Feature Selection Based onMIC In this section we intro-duce the feature selection algorithm in our speech emotionclassifier Feature selection algorithms may be roughly clas-sified into two groups namely ldquowrapperrdquo and ldquofilterrdquo Algo-rithms in the former group are dependent on the specific clas-sifiers such as sequential forward selection (SFS) The finalselection result is dependent on a specific classifier If we re-place the specific classifier the results will change In thesecond group feature selection is done by a certain evaluationcriteria such as FisherDiscriminant Ratio (FDR)The feature
Figure 13 The arousal and the valence dimensions of emotions
selection result achieved in this type of method is not de-pendent on specific classifiers and bears a better generalityacross different databases
Maximal information coefficient (MIC) based feature se-lection algorithm falls into the second group MIC is a newstatistic tool that measures linear and nonlinear relationshipsbetween paired variables invented by Reshef et al [14]
MIC is based on the idea that if a relationship existsbetween two variables then a grid can be drawn on the scat-terplot of the two variables that partitions the data to encap-sulate that relationship [14] We may calculate the MIC of acertain acoustic feature and the emotional state by exploringall possible grids on the two variables First we computefor every pair of integers (119909 119910) that largest possible mutualinformation achieved by any 119909-by-119910 grid [14] Second for afair comparison we normalize these MIC values between allacoustic features and the emotional state Detailed study ofMIC may be found in [14]
Since MIC can treat linear and nonlinear associations atthe same time we do not need tomake any assumption on thedistribution of our original features Therefore it is especiallysuitable for evaluating a large number of emotional featuresBased on a large number of basic features as described inSection 31 we apply MIC to measure the contribution ofthese features in correlation with emotion states Finally asubset of features is selected for our emotion classifier
4 Recognition Methodology
41 Baseline GMM Classifier The Gaussian mixture model(GMM) based classifier is the state-of-the-art recognitionmethod in speaker and language identification In this paperwe built the baseline classifier using Gaussianmixturemodeland we may compare the baseline classifier with the onlinelearning method
Mathematical Problems in Engineering 7
GMM may be defined by the sum of several Gaussiandistributions
119901 (X119905| 120582) =
119872
sum
119894=1
119886119894119887119894(X119905) (1)
where X119905is a 119863-dimension random vector 119887
119894(X119905) is the 119894th
member of Gaussian distribution 119905 is the index of utterancesample 119886
119894is the mixture weight and 119872 is the number of
Gaussian mixture members Each member is a119863-dimensionvariable which follows the Gaussian distribution with themean U
119894and the covariance Σ
119894
119887119894(X119905) =
1
(2120587)119863210038161003816
10038161003816Σ119894
1003816100381610038161003816
12exp minus1
2
(X119905minus U119894)119879
Σminus1
119894(X119905minus U119894)
(2)
Note that119872
sum
119894=1
119886119894= 1 (3)
Emotion classification can be done by maximizing theposterior probability
EmotionLable = argmax119896
(119901 (X119905| 120582119896)) (4)
ExpectationMaximization (EM) is adopted forGMMparam-eter estimation [15]
Due to the different types of emotions among the datasetswe unify the emotional datasets by categorizing them intopositive and negative regions in the valence and arousal di-mensions as shown in Figure 13 We may verify the ability ofthe emotion classifier by classifying the emotional utterancesinto different regions in the valence and arousal space
42 Online LearningUsingAdaBoost While the offlineGMMclassifier is trained using EM algorithm the online trainingalgorithmusingAdabBoost will be introduced in this sectionAdaBoost is a powerful algorithm in assemble learning [16]The belief in this AdaBoost is that weak classifiers may becombined into a powerful classifier Multiple classifierstrained on randomly selected datasets perform quiet differ-ently from each other on the same testing dataset therefore
we may reduce the misclassification rate by a proper decisionfusion rule
AdaBoost algorithm consists of several iterations In eachiteration a new training set is selected for a new weak clas-sifier A weight is assigned to the new weak classifier Basedon the testing results of the newweak classifier the weights ofall the data samples are modified for the next iteration At thefinal step the assembled classifier is achieved by combinationof themultipleweak classified through aweighted voting rule
Let us suppose the current training set is [17]
119879 = 1199041 1199042 119904
119873 (6)
where the weights of the samples are
119882 = 1199081 1199082 119908
119873
119873
sum
119894=0
119908119894= 1
(7)
The error rate of the new weak classifier is
119890 = sum
119894119888(119904119894) = 119910119894
119908119894 (8)
where 119888(119904119894) is the classification result and 119910
119894is the class label
The fusion weight assigned to each classifier is defined by theerror rate
120572 = ln((1 minus 119890)119890
) (9)
At the beginning of the algorithm each sample is assignedby equal weight During the iteration the sample weights areupdated
119908119894+1
=
119908119894times 120573 119888 (119904
119894) = 119910119894
119908119894 119888 (119904
119894) = 119910119894
(10)
At the arrival of the new data assuming that we knowthe label information for each sample pretrained classifiersfrom the offline data are used as initial weak classifiers Ada-Boost algorithm is applied to the new online data and fusionweights are reassigned to the offline trained classifiers
At the first119898 initial iterations119898 pretrained classifiers areused as the weak classifiers and added to the final ensembleclassifier instead of training new weak classifiers from therandomly selected dataset After the119898 initial iterations newweak classifiers are trained from the new online data andadded to the final ensemble classifier in the AdaBoostalgorithm
The major difference between the online training and theoffline training is the data used for learning Offline train-ing uses large acted data while online training uses small andnatural data Offline training is independent of the onlinetraining and ready to use while the online training is depen-dent on the offline training and only retrains the existingmodel to fit specific purposes such as to tune on a largenumber of speakers The purpose of online training is toquickly adapt the existing offline model to a small amountof new data
8 Mathematical Problems in Engineering
5 Experimental Results
In our experiment the offline training is carried out on theacted basic emotion dataset The speaker-independent data-set and the elicited practical emotion dataset are used for theonline training and the online testing Although the datasetsused in online testing are preprocessed utterances rather thanreal time online data our experiments still provide a simu-lated online situation We divide dataset 2 and dataset 3 intosmaller sets dataset 2a and dataset 2b which are used as thesimulated online initialization
Speech utterances from different sources are organizedinto several datasets as shown in Table 2
The online learning algorithm is verified both on thespeaker-independent data and the elicited data The resultsare shown in Table 4 A large number of speakers bring dif-ficulties in modeling emotional behavior since emotionexpression is highly dependent on individual habit and per-sonality By extending the offline trained classifier to theonline data that contains a large number of speakers weimproved the generality of our SER system The elicited datais collected in a cognitive experiment that is more close tothe real world situation During the cognitive task emotionalspeech is induced We observed that the different naturebetween the acted data and the induced speech during acognitive task caused a significant decrease of the recognitionrate By using the online training technique we may transferthe offline trained SER system to the elicited data Extendingour SER system to different data sources may bring emotionrecognition closer to real world applications
The major challenge in our online learning algorithm ishow to combine the existing offline classifier and efficientlyadapt the model parameters to a small number of new onlinedata We adopted the incremental learning idea and solvedthis problem by modifying the initial stage in the AdaBoostframework One of the contributions of our online learningalgorithm is that we may reuse the existing offline trainingdata and make the online learning stage more efficiently Wemake use of a large amount of available offline training dataand only require a small amount of data for online trainingas shown in Table 3 The weight of each weak classifier is animportant parameter The proposed method may be furtherimproved by using fuzzy membership function to evaluatethe confidence in GMM classifiers and reestimate the weightof each weak classifier
6 Discussions
Acted data is often considered not suitable for real worldapplications However traditional researches have been fo-cused on the acted emotion speech andmany acted databasesare available How to transfer an SER system that trained onthe acted data to the new naturalistic data in real world is anunsolved challenge
Many feature selection algorithms may be applied to SERsystem MIC is a newly proposed and powerful algorithm forexploring nonlinear relationship between variables
AdaBoost is a popular algorithm to ensemble multipleweak classifiers to establish a strong classifier By applying
Table 3 Selected datasets for online and offline experiments
Datasets index Data source Number ofutterances Purpose of use
Dataset 1 Acted speech 12000 Offline training
Dataset 2a Speakerindependent 1000 Online training
result Experiment 1 Dataset 1 NA Dataset 2b 633Experiment 2 Dataset 1 Dataset 2a Dataset 2b 756Experiment 5 Dataset 2a NA Dataset 2b 700Experiment 3 Dataset 1 NA Dataset 3b 612Experiment 4 Dataset 1 Dataset 3a Dataset 3b 731Experiment 6 Dataset 3a NA Dataset 3b 685
AdaBoost in the online occasion we train multiple weakclassifiers based on the newly arrived online data The offlinepretrained classifiers are used for initialization We may ex-plore other incremental learning algorithms in the futurework
Acknowledgments
This work was partially supported by China Postdoctoral Sci-ence Foundation (no 2012M520973) National Nature Sci-ence Foundation (no 61231002 no 61273266 no 51075068)and Doctoral Fund of Ministry of Education of China (no20110092130004)The authors would like to thank the anony-mous reviewers for their valuable comments and helpfulsuggestions
References
[1] C Clavel I Vasilescu L Devillers G Richard and T EhretteldquoFear-type emotion recognition for future audio-based surveil-lance systemsrdquo Speech Communication vol 50 no 6 pp 487ndash503 2008
[2] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoSpeech emotionrecognition based on re-composition of two-class classifiersrdquoin Proceedings of the 3rd International Conference on AffectiveComputing and Intelligent Interaction andWorkshops (ACII rsquo09)Amsterdam The Netherlands September 2009
[3] K R Scherer ldquoVocal communication of emotion a review ofresearch paradigmsrdquo SpeechCommunication vol 40 no 1-2 pp227ndash256 2003
[4] A Tawari andMM Trivedi ldquoSpeech emotion analysis explor-ing the role of contextrdquo IEEE Transactions on Multimedia vol12 no 6 pp 502ndash509 2010
[5] F Burkhardt A Paeschke M Rolfes W Sendlmeier and BWeiss ldquoA database of German emotional speechrdquo inProceedings
Mathematical Problems in Engineering 9
of the 9th European Conference on Speech Communication andTechnology pp 1517ndash1520 Lissabon Portugal September 2005
[6] D Ververidis and C Kotropoulos ldquoAutomatic speech classifi-cation to five emotional states based on gender informationrdquo inProceedings of the 12th European Signal Processing Conferencepp 341ndash344 Vienna Austria 2004
[7] S SteidlAutomatic Classification of Emotion-RelatedUser Statesin Spontaneous Childrenrsquos Speech Department of Computer Sci-ence Friedrich-Alexander-Universitaet Erlangen-NuermbergBerlin Germany 2008
[8] M Grimm K Kroschel and S Narayanan ldquoThe Vera am Mit-tag German audio-visual emotional speech databaserdquo in Pro-ceedings of the IEEE International Conference onMultimedia andExpo (ICME rsquo08) pp 865ndash868 Hannover Germany June 2008
[9] K P Truong How Does Real Affect Affect Affect Recognitionin Speech Center for Telematics and Information TechnologyUniversity of Twente Enschede The Netherlands 2009
[10] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoRecognition ofpractical emotion from elicited speechrdquo in Proceedings of the 1stInternational Conference on Information Science and Engineer-ing (ICISE rsquo09) pp 639ndash642 Nanjing China December 2009
[11] R Polikar L Udpa S S Udpa and V Honavar ldquoLearn++an incremental learning algorithm for supervised neural net-worksrdquo IEEE Transactions on Systems Man and Cybernetics Cvol 31 no 4 pp 497ndash508 2001
[12] Q L Zhao Y H Jiang and M Xu ldquoIncremental learning byheterogeneous Bagging ensemblerdquo Lecture Notes in ComputerScience vol 6441 no 2 pp 1ndash12 2010
[13] R Xiao J Wang and F Zhang ldquoAn approach to incrementalSVM learning algorithmrdquo in Proceedings of the IEEE Interna-tional Conference on Tools with Artificial Intelligence pp 268ndash273 2000
[14] D N Reshef Y A Reshef H K Finucane et al ldquoDetectingnovel associations in large data setsrdquo Science vol 334 no 6062pp 1518ndash1524 2011
[15] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using Gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995
[16] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997
[17] Q ZhaoThe research on ensemble pruning and its application inon-line machine learning [PhD thesis] National University ofDefense Technology Changsha China 2010
Feature 70-71 Maximum Voiced Duration (MVD)Maximum Unvoiced Duration (MUD)
Time (s)0 2354
minus07079
07079
0
Time (s)0 2354
0
600
Freq
uenc
y (H
z)
Time (s)0 2354
minus09446
8677
Inte
nsity
(dB)
Figure 11 Intensity and pitch contour of confidence
Feature 72ndash77 mean maximum minimum medianrange and variance of Harmonic-to-Noise Ratio(HNR)
Feature 78ndash95 mean maximum minimum medianrange and variance of HNR (0ndash400Hz 400ndash2000Hz and 2000ndash5000Hz)
Feature 96ndash119 meanmaximumminimummedianrange and variance of 1st formant frequency (F1) 2ndformant frequency (F2) 3rd formant frequency (F3)and 4th formant frequency (F4)
Feature 120ndash143 mean maximum minimum me-dian range and variance of dF1 dF2 dF3 and dF4
Feature 144ndash167 mean maximum minimum me-dian range and variance of d2F1 d2F2 d2F3 andd2F4
Feature 168ndash171 Jitter1 of F1 F2 F3 and F4
Feature 172ndash175 Jitter2 of F1 F2 F3 and F4
Feature 176ndash199 mean maximum minimum me-dian range and variance of F1 F2 F3 and F4 Band-width
Feature 200ndash223 mean maximum minimum me-dian range and variance of dF1 Bandwidth dF2Bandwidth dF3 Bandwidth and dF4 Bandwidth
Feature 224ndash247 mean maximum minimum me-dian range and variance of d2F1 Bandwidth d2F2Bandwidth d2F3 Bandwidth and d2F4 Bandwidth
Feature 248ndash325 mean maximum minimum me-dian range and variance of MFCC (0ndash12th-order)
Figure 12 Feature distribution over various emotional states
Feature 326ndash403 mean maximum minimum me-dian range and variance of dMFCC (0ndash12th-order)Feature 404ndash481 mean maximum minimum me-dian range and variance of d2MFCC (0ndash12th-order)
32 Feature Selection Based onMIC In this section we intro-duce the feature selection algorithm in our speech emotionclassifier Feature selection algorithms may be roughly clas-sified into two groups namely ldquowrapperrdquo and ldquofilterrdquo Algo-rithms in the former group are dependent on the specific clas-sifiers such as sequential forward selection (SFS) The finalselection result is dependent on a specific classifier If we re-place the specific classifier the results will change In thesecond group feature selection is done by a certain evaluationcriteria such as FisherDiscriminant Ratio (FDR)The feature
Figure 13 The arousal and the valence dimensions of emotions
selection result achieved in this type of method is not de-pendent on specific classifiers and bears a better generalityacross different databases
Maximal information coefficient (MIC) based feature se-lection algorithm falls into the second group MIC is a newstatistic tool that measures linear and nonlinear relationshipsbetween paired variables invented by Reshef et al [14]
MIC is based on the idea that if a relationship existsbetween two variables then a grid can be drawn on the scat-terplot of the two variables that partitions the data to encap-sulate that relationship [14] We may calculate the MIC of acertain acoustic feature and the emotional state by exploringall possible grids on the two variables First we computefor every pair of integers (119909 119910) that largest possible mutualinformation achieved by any 119909-by-119910 grid [14] Second for afair comparison we normalize these MIC values between allacoustic features and the emotional state Detailed study ofMIC may be found in [14]
Since MIC can treat linear and nonlinear associations atthe same time we do not need tomake any assumption on thedistribution of our original features Therefore it is especiallysuitable for evaluating a large number of emotional featuresBased on a large number of basic features as described inSection 31 we apply MIC to measure the contribution ofthese features in correlation with emotion states Finally asubset of features is selected for our emotion classifier
4 Recognition Methodology
41 Baseline GMM Classifier The Gaussian mixture model(GMM) based classifier is the state-of-the-art recognitionmethod in speaker and language identification In this paperwe built the baseline classifier using Gaussianmixturemodeland we may compare the baseline classifier with the onlinelearning method
Mathematical Problems in Engineering 7
GMM may be defined by the sum of several Gaussiandistributions
119901 (X119905| 120582) =
119872
sum
119894=1
119886119894119887119894(X119905) (1)
where X119905is a 119863-dimension random vector 119887
119894(X119905) is the 119894th
member of Gaussian distribution 119905 is the index of utterancesample 119886
119894is the mixture weight and 119872 is the number of
Gaussian mixture members Each member is a119863-dimensionvariable which follows the Gaussian distribution with themean U
119894and the covariance Σ
119894
119887119894(X119905) =
1
(2120587)119863210038161003816
10038161003816Σ119894
1003816100381610038161003816
12exp minus1
2
(X119905minus U119894)119879
Σminus1
119894(X119905minus U119894)
(2)
Note that119872
sum
119894=1
119886119894= 1 (3)
Emotion classification can be done by maximizing theposterior probability
EmotionLable = argmax119896
(119901 (X119905| 120582119896)) (4)
ExpectationMaximization (EM) is adopted forGMMparam-eter estimation [15]
Due to the different types of emotions among the datasetswe unify the emotional datasets by categorizing them intopositive and negative regions in the valence and arousal di-mensions as shown in Figure 13 We may verify the ability ofthe emotion classifier by classifying the emotional utterancesinto different regions in the valence and arousal space
42 Online LearningUsingAdaBoost While the offlineGMMclassifier is trained using EM algorithm the online trainingalgorithmusingAdabBoost will be introduced in this sectionAdaBoost is a powerful algorithm in assemble learning [16]The belief in this AdaBoost is that weak classifiers may becombined into a powerful classifier Multiple classifierstrained on randomly selected datasets perform quiet differ-ently from each other on the same testing dataset therefore
we may reduce the misclassification rate by a proper decisionfusion rule
AdaBoost algorithm consists of several iterations In eachiteration a new training set is selected for a new weak clas-sifier A weight is assigned to the new weak classifier Basedon the testing results of the newweak classifier the weights ofall the data samples are modified for the next iteration At thefinal step the assembled classifier is achieved by combinationof themultipleweak classified through aweighted voting rule
Let us suppose the current training set is [17]
119879 = 1199041 1199042 119904
119873 (6)
where the weights of the samples are
119882 = 1199081 1199082 119908
119873
119873
sum
119894=0
119908119894= 1
(7)
The error rate of the new weak classifier is
119890 = sum
119894119888(119904119894) = 119910119894
119908119894 (8)
where 119888(119904119894) is the classification result and 119910
119894is the class label
The fusion weight assigned to each classifier is defined by theerror rate
120572 = ln((1 minus 119890)119890
) (9)
At the beginning of the algorithm each sample is assignedby equal weight During the iteration the sample weights areupdated
119908119894+1
=
119908119894times 120573 119888 (119904
119894) = 119910119894
119908119894 119888 (119904
119894) = 119910119894
(10)
At the arrival of the new data assuming that we knowthe label information for each sample pretrained classifiersfrom the offline data are used as initial weak classifiers Ada-Boost algorithm is applied to the new online data and fusionweights are reassigned to the offline trained classifiers
At the first119898 initial iterations119898 pretrained classifiers areused as the weak classifiers and added to the final ensembleclassifier instead of training new weak classifiers from therandomly selected dataset After the119898 initial iterations newweak classifiers are trained from the new online data andadded to the final ensemble classifier in the AdaBoostalgorithm
The major difference between the online training and theoffline training is the data used for learning Offline train-ing uses large acted data while online training uses small andnatural data Offline training is independent of the onlinetraining and ready to use while the online training is depen-dent on the offline training and only retrains the existingmodel to fit specific purposes such as to tune on a largenumber of speakers The purpose of online training is toquickly adapt the existing offline model to a small amountof new data
8 Mathematical Problems in Engineering
5 Experimental Results
In our experiment the offline training is carried out on theacted basic emotion dataset The speaker-independent data-set and the elicited practical emotion dataset are used for theonline training and the online testing Although the datasetsused in online testing are preprocessed utterances rather thanreal time online data our experiments still provide a simu-lated online situation We divide dataset 2 and dataset 3 intosmaller sets dataset 2a and dataset 2b which are used as thesimulated online initialization
Speech utterances from different sources are organizedinto several datasets as shown in Table 2
The online learning algorithm is verified both on thespeaker-independent data and the elicited data The resultsare shown in Table 4 A large number of speakers bring dif-ficulties in modeling emotional behavior since emotionexpression is highly dependent on individual habit and per-sonality By extending the offline trained classifier to theonline data that contains a large number of speakers weimproved the generality of our SER system The elicited datais collected in a cognitive experiment that is more close tothe real world situation During the cognitive task emotionalspeech is induced We observed that the different naturebetween the acted data and the induced speech during acognitive task caused a significant decrease of the recognitionrate By using the online training technique we may transferthe offline trained SER system to the elicited data Extendingour SER system to different data sources may bring emotionrecognition closer to real world applications
The major challenge in our online learning algorithm ishow to combine the existing offline classifier and efficientlyadapt the model parameters to a small number of new onlinedata We adopted the incremental learning idea and solvedthis problem by modifying the initial stage in the AdaBoostframework One of the contributions of our online learningalgorithm is that we may reuse the existing offline trainingdata and make the online learning stage more efficiently Wemake use of a large amount of available offline training dataand only require a small amount of data for online trainingas shown in Table 3 The weight of each weak classifier is animportant parameter The proposed method may be furtherimproved by using fuzzy membership function to evaluatethe confidence in GMM classifiers and reestimate the weightof each weak classifier
6 Discussions
Acted data is often considered not suitable for real worldapplications However traditional researches have been fo-cused on the acted emotion speech andmany acted databasesare available How to transfer an SER system that trained onthe acted data to the new naturalistic data in real world is anunsolved challenge
Many feature selection algorithms may be applied to SERsystem MIC is a newly proposed and powerful algorithm forexploring nonlinear relationship between variables
AdaBoost is a popular algorithm to ensemble multipleweak classifiers to establish a strong classifier By applying
Table 3 Selected datasets for online and offline experiments
Datasets index Data source Number ofutterances Purpose of use
Dataset 1 Acted speech 12000 Offline training
Dataset 2a Speakerindependent 1000 Online training
result Experiment 1 Dataset 1 NA Dataset 2b 633Experiment 2 Dataset 1 Dataset 2a Dataset 2b 756Experiment 5 Dataset 2a NA Dataset 2b 700Experiment 3 Dataset 1 NA Dataset 3b 612Experiment 4 Dataset 1 Dataset 3a Dataset 3b 731Experiment 6 Dataset 3a NA Dataset 3b 685
AdaBoost in the online occasion we train multiple weakclassifiers based on the newly arrived online data The offlinepretrained classifiers are used for initialization We may ex-plore other incremental learning algorithms in the futurework
Acknowledgments
This work was partially supported by China Postdoctoral Sci-ence Foundation (no 2012M520973) National Nature Sci-ence Foundation (no 61231002 no 61273266 no 51075068)and Doctoral Fund of Ministry of Education of China (no20110092130004)The authors would like to thank the anony-mous reviewers for their valuable comments and helpfulsuggestions
References
[1] C Clavel I Vasilescu L Devillers G Richard and T EhretteldquoFear-type emotion recognition for future audio-based surveil-lance systemsrdquo Speech Communication vol 50 no 6 pp 487ndash503 2008
[2] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoSpeech emotionrecognition based on re-composition of two-class classifiersrdquoin Proceedings of the 3rd International Conference on AffectiveComputing and Intelligent Interaction andWorkshops (ACII rsquo09)Amsterdam The Netherlands September 2009
[3] K R Scherer ldquoVocal communication of emotion a review ofresearch paradigmsrdquo SpeechCommunication vol 40 no 1-2 pp227ndash256 2003
[4] A Tawari andMM Trivedi ldquoSpeech emotion analysis explor-ing the role of contextrdquo IEEE Transactions on Multimedia vol12 no 6 pp 502ndash509 2010
[5] F Burkhardt A Paeschke M Rolfes W Sendlmeier and BWeiss ldquoA database of German emotional speechrdquo inProceedings
Mathematical Problems in Engineering 9
of the 9th European Conference on Speech Communication andTechnology pp 1517ndash1520 Lissabon Portugal September 2005
[6] D Ververidis and C Kotropoulos ldquoAutomatic speech classifi-cation to five emotional states based on gender informationrdquo inProceedings of the 12th European Signal Processing Conferencepp 341ndash344 Vienna Austria 2004
[7] S SteidlAutomatic Classification of Emotion-RelatedUser Statesin Spontaneous Childrenrsquos Speech Department of Computer Sci-ence Friedrich-Alexander-Universitaet Erlangen-NuermbergBerlin Germany 2008
[8] M Grimm K Kroschel and S Narayanan ldquoThe Vera am Mit-tag German audio-visual emotional speech databaserdquo in Pro-ceedings of the IEEE International Conference onMultimedia andExpo (ICME rsquo08) pp 865ndash868 Hannover Germany June 2008
[9] K P Truong How Does Real Affect Affect Affect Recognitionin Speech Center for Telematics and Information TechnologyUniversity of Twente Enschede The Netherlands 2009
[10] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoRecognition ofpractical emotion from elicited speechrdquo in Proceedings of the 1stInternational Conference on Information Science and Engineer-ing (ICISE rsquo09) pp 639ndash642 Nanjing China December 2009
[11] R Polikar L Udpa S S Udpa and V Honavar ldquoLearn++an incremental learning algorithm for supervised neural net-worksrdquo IEEE Transactions on Systems Man and Cybernetics Cvol 31 no 4 pp 497ndash508 2001
[12] Q L Zhao Y H Jiang and M Xu ldquoIncremental learning byheterogeneous Bagging ensemblerdquo Lecture Notes in ComputerScience vol 6441 no 2 pp 1ndash12 2010
[13] R Xiao J Wang and F Zhang ldquoAn approach to incrementalSVM learning algorithmrdquo in Proceedings of the IEEE Interna-tional Conference on Tools with Artificial Intelligence pp 268ndash273 2000
[14] D N Reshef Y A Reshef H K Finucane et al ldquoDetectingnovel associations in large data setsrdquo Science vol 334 no 6062pp 1518ndash1524 2011
[15] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using Gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995
[16] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997
[17] Q ZhaoThe research on ensemble pruning and its application inon-line machine learning [PhD thesis] National University ofDefense Technology Changsha China 2010
Feature 70-71 Maximum Voiced Duration (MVD)Maximum Unvoiced Duration (MUD)
Time (s)0 2354
minus07079
07079
0
Time (s)0 2354
0
600
Freq
uenc
y (H
z)
Time (s)0 2354
minus09446
8677
Inte
nsity
(dB)
Figure 11 Intensity and pitch contour of confidence
Feature 72ndash77 mean maximum minimum medianrange and variance of Harmonic-to-Noise Ratio(HNR)
Feature 78ndash95 mean maximum minimum medianrange and variance of HNR (0ndash400Hz 400ndash2000Hz and 2000ndash5000Hz)
Feature 96ndash119 meanmaximumminimummedianrange and variance of 1st formant frequency (F1) 2ndformant frequency (F2) 3rd formant frequency (F3)and 4th formant frequency (F4)
Feature 120ndash143 mean maximum minimum me-dian range and variance of dF1 dF2 dF3 and dF4
Feature 144ndash167 mean maximum minimum me-dian range and variance of d2F1 d2F2 d2F3 andd2F4
Feature 168ndash171 Jitter1 of F1 F2 F3 and F4
Feature 172ndash175 Jitter2 of F1 F2 F3 and F4
Feature 176ndash199 mean maximum minimum me-dian range and variance of F1 F2 F3 and F4 Band-width
Feature 200ndash223 mean maximum minimum me-dian range and variance of dF1 Bandwidth dF2Bandwidth dF3 Bandwidth and dF4 Bandwidth
Feature 224ndash247 mean maximum minimum me-dian range and variance of d2F1 Bandwidth d2F2Bandwidth d2F3 Bandwidth and d2F4 Bandwidth
Feature 248ndash325 mean maximum minimum me-dian range and variance of MFCC (0ndash12th-order)
Figure 12 Feature distribution over various emotional states
Feature 326ndash403 mean maximum minimum me-dian range and variance of dMFCC (0ndash12th-order)Feature 404ndash481 mean maximum minimum me-dian range and variance of d2MFCC (0ndash12th-order)
32 Feature Selection Based onMIC In this section we intro-duce the feature selection algorithm in our speech emotionclassifier Feature selection algorithms may be roughly clas-sified into two groups namely ldquowrapperrdquo and ldquofilterrdquo Algo-rithms in the former group are dependent on the specific clas-sifiers such as sequential forward selection (SFS) The finalselection result is dependent on a specific classifier If we re-place the specific classifier the results will change In thesecond group feature selection is done by a certain evaluationcriteria such as FisherDiscriminant Ratio (FDR)The feature
Figure 13 The arousal and the valence dimensions of emotions
selection result achieved in this type of method is not de-pendent on specific classifiers and bears a better generalityacross different databases
Maximal information coefficient (MIC) based feature se-lection algorithm falls into the second group MIC is a newstatistic tool that measures linear and nonlinear relationshipsbetween paired variables invented by Reshef et al [14]
MIC is based on the idea that if a relationship existsbetween two variables then a grid can be drawn on the scat-terplot of the two variables that partitions the data to encap-sulate that relationship [14] We may calculate the MIC of acertain acoustic feature and the emotional state by exploringall possible grids on the two variables First we computefor every pair of integers (119909 119910) that largest possible mutualinformation achieved by any 119909-by-119910 grid [14] Second for afair comparison we normalize these MIC values between allacoustic features and the emotional state Detailed study ofMIC may be found in [14]
Since MIC can treat linear and nonlinear associations atthe same time we do not need tomake any assumption on thedistribution of our original features Therefore it is especiallysuitable for evaluating a large number of emotional featuresBased on a large number of basic features as described inSection 31 we apply MIC to measure the contribution ofthese features in correlation with emotion states Finally asubset of features is selected for our emotion classifier
4 Recognition Methodology
41 Baseline GMM Classifier The Gaussian mixture model(GMM) based classifier is the state-of-the-art recognitionmethod in speaker and language identification In this paperwe built the baseline classifier using Gaussianmixturemodeland we may compare the baseline classifier with the onlinelearning method
Mathematical Problems in Engineering 7
GMM may be defined by the sum of several Gaussiandistributions
119901 (X119905| 120582) =
119872
sum
119894=1
119886119894119887119894(X119905) (1)
where X119905is a 119863-dimension random vector 119887
119894(X119905) is the 119894th
member of Gaussian distribution 119905 is the index of utterancesample 119886
119894is the mixture weight and 119872 is the number of
Gaussian mixture members Each member is a119863-dimensionvariable which follows the Gaussian distribution with themean U
119894and the covariance Σ
119894
119887119894(X119905) =
1
(2120587)119863210038161003816
10038161003816Σ119894
1003816100381610038161003816
12exp minus1
2
(X119905minus U119894)119879
Σminus1
119894(X119905minus U119894)
(2)
Note that119872
sum
119894=1
119886119894= 1 (3)
Emotion classification can be done by maximizing theposterior probability
EmotionLable = argmax119896
(119901 (X119905| 120582119896)) (4)
ExpectationMaximization (EM) is adopted forGMMparam-eter estimation [15]
Due to the different types of emotions among the datasetswe unify the emotional datasets by categorizing them intopositive and negative regions in the valence and arousal di-mensions as shown in Figure 13 We may verify the ability ofthe emotion classifier by classifying the emotional utterancesinto different regions in the valence and arousal space
42 Online LearningUsingAdaBoost While the offlineGMMclassifier is trained using EM algorithm the online trainingalgorithmusingAdabBoost will be introduced in this sectionAdaBoost is a powerful algorithm in assemble learning [16]The belief in this AdaBoost is that weak classifiers may becombined into a powerful classifier Multiple classifierstrained on randomly selected datasets perform quiet differ-ently from each other on the same testing dataset therefore
we may reduce the misclassification rate by a proper decisionfusion rule
AdaBoost algorithm consists of several iterations In eachiteration a new training set is selected for a new weak clas-sifier A weight is assigned to the new weak classifier Basedon the testing results of the newweak classifier the weights ofall the data samples are modified for the next iteration At thefinal step the assembled classifier is achieved by combinationof themultipleweak classified through aweighted voting rule
Let us suppose the current training set is [17]
119879 = 1199041 1199042 119904
119873 (6)
where the weights of the samples are
119882 = 1199081 1199082 119908
119873
119873
sum
119894=0
119908119894= 1
(7)
The error rate of the new weak classifier is
119890 = sum
119894119888(119904119894) = 119910119894
119908119894 (8)
where 119888(119904119894) is the classification result and 119910
119894is the class label
The fusion weight assigned to each classifier is defined by theerror rate
120572 = ln((1 minus 119890)119890
) (9)
At the beginning of the algorithm each sample is assignedby equal weight During the iteration the sample weights areupdated
119908119894+1
=
119908119894times 120573 119888 (119904
119894) = 119910119894
119908119894 119888 (119904
119894) = 119910119894
(10)
At the arrival of the new data assuming that we knowthe label information for each sample pretrained classifiersfrom the offline data are used as initial weak classifiers Ada-Boost algorithm is applied to the new online data and fusionweights are reassigned to the offline trained classifiers
At the first119898 initial iterations119898 pretrained classifiers areused as the weak classifiers and added to the final ensembleclassifier instead of training new weak classifiers from therandomly selected dataset After the119898 initial iterations newweak classifiers are trained from the new online data andadded to the final ensemble classifier in the AdaBoostalgorithm
The major difference between the online training and theoffline training is the data used for learning Offline train-ing uses large acted data while online training uses small andnatural data Offline training is independent of the onlinetraining and ready to use while the online training is depen-dent on the offline training and only retrains the existingmodel to fit specific purposes such as to tune on a largenumber of speakers The purpose of online training is toquickly adapt the existing offline model to a small amountof new data
8 Mathematical Problems in Engineering
5 Experimental Results
In our experiment the offline training is carried out on theacted basic emotion dataset The speaker-independent data-set and the elicited practical emotion dataset are used for theonline training and the online testing Although the datasetsused in online testing are preprocessed utterances rather thanreal time online data our experiments still provide a simu-lated online situation We divide dataset 2 and dataset 3 intosmaller sets dataset 2a and dataset 2b which are used as thesimulated online initialization
Speech utterances from different sources are organizedinto several datasets as shown in Table 2
The online learning algorithm is verified both on thespeaker-independent data and the elicited data The resultsare shown in Table 4 A large number of speakers bring dif-ficulties in modeling emotional behavior since emotionexpression is highly dependent on individual habit and per-sonality By extending the offline trained classifier to theonline data that contains a large number of speakers weimproved the generality of our SER system The elicited datais collected in a cognitive experiment that is more close tothe real world situation During the cognitive task emotionalspeech is induced We observed that the different naturebetween the acted data and the induced speech during acognitive task caused a significant decrease of the recognitionrate By using the online training technique we may transferthe offline trained SER system to the elicited data Extendingour SER system to different data sources may bring emotionrecognition closer to real world applications
The major challenge in our online learning algorithm ishow to combine the existing offline classifier and efficientlyadapt the model parameters to a small number of new onlinedata We adopted the incremental learning idea and solvedthis problem by modifying the initial stage in the AdaBoostframework One of the contributions of our online learningalgorithm is that we may reuse the existing offline trainingdata and make the online learning stage more efficiently Wemake use of a large amount of available offline training dataand only require a small amount of data for online trainingas shown in Table 3 The weight of each weak classifier is animportant parameter The proposed method may be furtherimproved by using fuzzy membership function to evaluatethe confidence in GMM classifiers and reestimate the weightof each weak classifier
6 Discussions
Acted data is often considered not suitable for real worldapplications However traditional researches have been fo-cused on the acted emotion speech andmany acted databasesare available How to transfer an SER system that trained onthe acted data to the new naturalistic data in real world is anunsolved challenge
Many feature selection algorithms may be applied to SERsystem MIC is a newly proposed and powerful algorithm forexploring nonlinear relationship between variables
AdaBoost is a popular algorithm to ensemble multipleweak classifiers to establish a strong classifier By applying
Table 3 Selected datasets for online and offline experiments
Datasets index Data source Number ofutterances Purpose of use
Dataset 1 Acted speech 12000 Offline training
Dataset 2a Speakerindependent 1000 Online training
result Experiment 1 Dataset 1 NA Dataset 2b 633Experiment 2 Dataset 1 Dataset 2a Dataset 2b 756Experiment 5 Dataset 2a NA Dataset 2b 700Experiment 3 Dataset 1 NA Dataset 3b 612Experiment 4 Dataset 1 Dataset 3a Dataset 3b 731Experiment 6 Dataset 3a NA Dataset 3b 685
AdaBoost in the online occasion we train multiple weakclassifiers based on the newly arrived online data The offlinepretrained classifiers are used for initialization We may ex-plore other incremental learning algorithms in the futurework
Acknowledgments
This work was partially supported by China Postdoctoral Sci-ence Foundation (no 2012M520973) National Nature Sci-ence Foundation (no 61231002 no 61273266 no 51075068)and Doctoral Fund of Ministry of Education of China (no20110092130004)The authors would like to thank the anony-mous reviewers for their valuable comments and helpfulsuggestions
References
[1] C Clavel I Vasilescu L Devillers G Richard and T EhretteldquoFear-type emotion recognition for future audio-based surveil-lance systemsrdquo Speech Communication vol 50 no 6 pp 487ndash503 2008
[2] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoSpeech emotionrecognition based on re-composition of two-class classifiersrdquoin Proceedings of the 3rd International Conference on AffectiveComputing and Intelligent Interaction andWorkshops (ACII rsquo09)Amsterdam The Netherlands September 2009
[3] K R Scherer ldquoVocal communication of emotion a review ofresearch paradigmsrdquo SpeechCommunication vol 40 no 1-2 pp227ndash256 2003
[4] A Tawari andMM Trivedi ldquoSpeech emotion analysis explor-ing the role of contextrdquo IEEE Transactions on Multimedia vol12 no 6 pp 502ndash509 2010
[5] F Burkhardt A Paeschke M Rolfes W Sendlmeier and BWeiss ldquoA database of German emotional speechrdquo inProceedings
Mathematical Problems in Engineering 9
of the 9th European Conference on Speech Communication andTechnology pp 1517ndash1520 Lissabon Portugal September 2005
[6] D Ververidis and C Kotropoulos ldquoAutomatic speech classifi-cation to five emotional states based on gender informationrdquo inProceedings of the 12th European Signal Processing Conferencepp 341ndash344 Vienna Austria 2004
[7] S SteidlAutomatic Classification of Emotion-RelatedUser Statesin Spontaneous Childrenrsquos Speech Department of Computer Sci-ence Friedrich-Alexander-Universitaet Erlangen-NuermbergBerlin Germany 2008
[8] M Grimm K Kroschel and S Narayanan ldquoThe Vera am Mit-tag German audio-visual emotional speech databaserdquo in Pro-ceedings of the IEEE International Conference onMultimedia andExpo (ICME rsquo08) pp 865ndash868 Hannover Germany June 2008
[9] K P Truong How Does Real Affect Affect Affect Recognitionin Speech Center for Telematics and Information TechnologyUniversity of Twente Enschede The Netherlands 2009
[10] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoRecognition ofpractical emotion from elicited speechrdquo in Proceedings of the 1stInternational Conference on Information Science and Engineer-ing (ICISE rsquo09) pp 639ndash642 Nanjing China December 2009
[11] R Polikar L Udpa S S Udpa and V Honavar ldquoLearn++an incremental learning algorithm for supervised neural net-worksrdquo IEEE Transactions on Systems Man and Cybernetics Cvol 31 no 4 pp 497ndash508 2001
[12] Q L Zhao Y H Jiang and M Xu ldquoIncremental learning byheterogeneous Bagging ensemblerdquo Lecture Notes in ComputerScience vol 6441 no 2 pp 1ndash12 2010
[13] R Xiao J Wang and F Zhang ldquoAn approach to incrementalSVM learning algorithmrdquo in Proceedings of the IEEE Interna-tional Conference on Tools with Artificial Intelligence pp 268ndash273 2000
[14] D N Reshef Y A Reshef H K Finucane et al ldquoDetectingnovel associations in large data setsrdquo Science vol 334 no 6062pp 1518ndash1524 2011
[15] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using Gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995
[16] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997
[17] Q ZhaoThe research on ensemble pruning and its application inon-line machine learning [PhD thesis] National University ofDefense Technology Changsha China 2010
Figure 12 Feature distribution over various emotional states
Feature 326ndash403 mean maximum minimum me-dian range and variance of dMFCC (0ndash12th-order)Feature 404ndash481 mean maximum minimum me-dian range and variance of d2MFCC (0ndash12th-order)
32 Feature Selection Based onMIC In this section we intro-duce the feature selection algorithm in our speech emotionclassifier Feature selection algorithms may be roughly clas-sified into two groups namely ldquowrapperrdquo and ldquofilterrdquo Algo-rithms in the former group are dependent on the specific clas-sifiers such as sequential forward selection (SFS) The finalselection result is dependent on a specific classifier If we re-place the specific classifier the results will change In thesecond group feature selection is done by a certain evaluationcriteria such as FisherDiscriminant Ratio (FDR)The feature
Figure 13 The arousal and the valence dimensions of emotions
selection result achieved in this type of method is not de-pendent on specific classifiers and bears a better generalityacross different databases
Maximal information coefficient (MIC) based feature se-lection algorithm falls into the second group MIC is a newstatistic tool that measures linear and nonlinear relationshipsbetween paired variables invented by Reshef et al [14]
MIC is based on the idea that if a relationship existsbetween two variables then a grid can be drawn on the scat-terplot of the two variables that partitions the data to encap-sulate that relationship [14] We may calculate the MIC of acertain acoustic feature and the emotional state by exploringall possible grids on the two variables First we computefor every pair of integers (119909 119910) that largest possible mutualinformation achieved by any 119909-by-119910 grid [14] Second for afair comparison we normalize these MIC values between allacoustic features and the emotional state Detailed study ofMIC may be found in [14]
Since MIC can treat linear and nonlinear associations atthe same time we do not need tomake any assumption on thedistribution of our original features Therefore it is especiallysuitable for evaluating a large number of emotional featuresBased on a large number of basic features as described inSection 31 we apply MIC to measure the contribution ofthese features in correlation with emotion states Finally asubset of features is selected for our emotion classifier
4 Recognition Methodology
41 Baseline GMM Classifier The Gaussian mixture model(GMM) based classifier is the state-of-the-art recognitionmethod in speaker and language identification In this paperwe built the baseline classifier using Gaussianmixturemodeland we may compare the baseline classifier with the onlinelearning method
Mathematical Problems in Engineering 7
GMM may be defined by the sum of several Gaussiandistributions
119901 (X119905| 120582) =
119872
sum
119894=1
119886119894119887119894(X119905) (1)
where X119905is a 119863-dimension random vector 119887
119894(X119905) is the 119894th
member of Gaussian distribution 119905 is the index of utterancesample 119886
119894is the mixture weight and 119872 is the number of
Gaussian mixture members Each member is a119863-dimensionvariable which follows the Gaussian distribution with themean U
119894and the covariance Σ
119894
119887119894(X119905) =
1
(2120587)119863210038161003816
10038161003816Σ119894
1003816100381610038161003816
12exp minus1
2
(X119905minus U119894)119879
Σminus1
119894(X119905minus U119894)
(2)
Note that119872
sum
119894=1
119886119894= 1 (3)
Emotion classification can be done by maximizing theposterior probability
EmotionLable = argmax119896
(119901 (X119905| 120582119896)) (4)
ExpectationMaximization (EM) is adopted forGMMparam-eter estimation [15]
Due to the different types of emotions among the datasetswe unify the emotional datasets by categorizing them intopositive and negative regions in the valence and arousal di-mensions as shown in Figure 13 We may verify the ability ofthe emotion classifier by classifying the emotional utterancesinto different regions in the valence and arousal space
42 Online LearningUsingAdaBoost While the offlineGMMclassifier is trained using EM algorithm the online trainingalgorithmusingAdabBoost will be introduced in this sectionAdaBoost is a powerful algorithm in assemble learning [16]The belief in this AdaBoost is that weak classifiers may becombined into a powerful classifier Multiple classifierstrained on randomly selected datasets perform quiet differ-ently from each other on the same testing dataset therefore
we may reduce the misclassification rate by a proper decisionfusion rule
AdaBoost algorithm consists of several iterations In eachiteration a new training set is selected for a new weak clas-sifier A weight is assigned to the new weak classifier Basedon the testing results of the newweak classifier the weights ofall the data samples are modified for the next iteration At thefinal step the assembled classifier is achieved by combinationof themultipleweak classified through aweighted voting rule
Let us suppose the current training set is [17]
119879 = 1199041 1199042 119904
119873 (6)
where the weights of the samples are
119882 = 1199081 1199082 119908
119873
119873
sum
119894=0
119908119894= 1
(7)
The error rate of the new weak classifier is
119890 = sum
119894119888(119904119894) = 119910119894
119908119894 (8)
where 119888(119904119894) is the classification result and 119910
119894is the class label
The fusion weight assigned to each classifier is defined by theerror rate
120572 = ln((1 minus 119890)119890
) (9)
At the beginning of the algorithm each sample is assignedby equal weight During the iteration the sample weights areupdated
119908119894+1
=
119908119894times 120573 119888 (119904
119894) = 119910119894
119908119894 119888 (119904
119894) = 119910119894
(10)
At the arrival of the new data assuming that we knowthe label information for each sample pretrained classifiersfrom the offline data are used as initial weak classifiers Ada-Boost algorithm is applied to the new online data and fusionweights are reassigned to the offline trained classifiers
At the first119898 initial iterations119898 pretrained classifiers areused as the weak classifiers and added to the final ensembleclassifier instead of training new weak classifiers from therandomly selected dataset After the119898 initial iterations newweak classifiers are trained from the new online data andadded to the final ensemble classifier in the AdaBoostalgorithm
The major difference between the online training and theoffline training is the data used for learning Offline train-ing uses large acted data while online training uses small andnatural data Offline training is independent of the onlinetraining and ready to use while the online training is depen-dent on the offline training and only retrains the existingmodel to fit specific purposes such as to tune on a largenumber of speakers The purpose of online training is toquickly adapt the existing offline model to a small amountof new data
8 Mathematical Problems in Engineering
5 Experimental Results
In our experiment the offline training is carried out on theacted basic emotion dataset The speaker-independent data-set and the elicited practical emotion dataset are used for theonline training and the online testing Although the datasetsused in online testing are preprocessed utterances rather thanreal time online data our experiments still provide a simu-lated online situation We divide dataset 2 and dataset 3 intosmaller sets dataset 2a and dataset 2b which are used as thesimulated online initialization
Speech utterances from different sources are organizedinto several datasets as shown in Table 2
The online learning algorithm is verified both on thespeaker-independent data and the elicited data The resultsare shown in Table 4 A large number of speakers bring dif-ficulties in modeling emotional behavior since emotionexpression is highly dependent on individual habit and per-sonality By extending the offline trained classifier to theonline data that contains a large number of speakers weimproved the generality of our SER system The elicited datais collected in a cognitive experiment that is more close tothe real world situation During the cognitive task emotionalspeech is induced We observed that the different naturebetween the acted data and the induced speech during acognitive task caused a significant decrease of the recognitionrate By using the online training technique we may transferthe offline trained SER system to the elicited data Extendingour SER system to different data sources may bring emotionrecognition closer to real world applications
The major challenge in our online learning algorithm ishow to combine the existing offline classifier and efficientlyadapt the model parameters to a small number of new onlinedata We adopted the incremental learning idea and solvedthis problem by modifying the initial stage in the AdaBoostframework One of the contributions of our online learningalgorithm is that we may reuse the existing offline trainingdata and make the online learning stage more efficiently Wemake use of a large amount of available offline training dataand only require a small amount of data for online trainingas shown in Table 3 The weight of each weak classifier is animportant parameter The proposed method may be furtherimproved by using fuzzy membership function to evaluatethe confidence in GMM classifiers and reestimate the weightof each weak classifier
6 Discussions
Acted data is often considered not suitable for real worldapplications However traditional researches have been fo-cused on the acted emotion speech andmany acted databasesare available How to transfer an SER system that trained onthe acted data to the new naturalistic data in real world is anunsolved challenge
Many feature selection algorithms may be applied to SERsystem MIC is a newly proposed and powerful algorithm forexploring nonlinear relationship between variables
AdaBoost is a popular algorithm to ensemble multipleweak classifiers to establish a strong classifier By applying
Table 3 Selected datasets for online and offline experiments
Datasets index Data source Number ofutterances Purpose of use
Dataset 1 Acted speech 12000 Offline training
Dataset 2a Speakerindependent 1000 Online training
result Experiment 1 Dataset 1 NA Dataset 2b 633Experiment 2 Dataset 1 Dataset 2a Dataset 2b 756Experiment 5 Dataset 2a NA Dataset 2b 700Experiment 3 Dataset 1 NA Dataset 3b 612Experiment 4 Dataset 1 Dataset 3a Dataset 3b 731Experiment 6 Dataset 3a NA Dataset 3b 685
AdaBoost in the online occasion we train multiple weakclassifiers based on the newly arrived online data The offlinepretrained classifiers are used for initialization We may ex-plore other incremental learning algorithms in the futurework
Acknowledgments
This work was partially supported by China Postdoctoral Sci-ence Foundation (no 2012M520973) National Nature Sci-ence Foundation (no 61231002 no 61273266 no 51075068)and Doctoral Fund of Ministry of Education of China (no20110092130004)The authors would like to thank the anony-mous reviewers for their valuable comments and helpfulsuggestions
References
[1] C Clavel I Vasilescu L Devillers G Richard and T EhretteldquoFear-type emotion recognition for future audio-based surveil-lance systemsrdquo Speech Communication vol 50 no 6 pp 487ndash503 2008
[2] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoSpeech emotionrecognition based on re-composition of two-class classifiersrdquoin Proceedings of the 3rd International Conference on AffectiveComputing and Intelligent Interaction andWorkshops (ACII rsquo09)Amsterdam The Netherlands September 2009
[3] K R Scherer ldquoVocal communication of emotion a review ofresearch paradigmsrdquo SpeechCommunication vol 40 no 1-2 pp227ndash256 2003
[4] A Tawari andMM Trivedi ldquoSpeech emotion analysis explor-ing the role of contextrdquo IEEE Transactions on Multimedia vol12 no 6 pp 502ndash509 2010
[5] F Burkhardt A Paeschke M Rolfes W Sendlmeier and BWeiss ldquoA database of German emotional speechrdquo inProceedings
Mathematical Problems in Engineering 9
of the 9th European Conference on Speech Communication andTechnology pp 1517ndash1520 Lissabon Portugal September 2005
[6] D Ververidis and C Kotropoulos ldquoAutomatic speech classifi-cation to five emotional states based on gender informationrdquo inProceedings of the 12th European Signal Processing Conferencepp 341ndash344 Vienna Austria 2004
[7] S SteidlAutomatic Classification of Emotion-RelatedUser Statesin Spontaneous Childrenrsquos Speech Department of Computer Sci-ence Friedrich-Alexander-Universitaet Erlangen-NuermbergBerlin Germany 2008
[8] M Grimm K Kroschel and S Narayanan ldquoThe Vera am Mit-tag German audio-visual emotional speech databaserdquo in Pro-ceedings of the IEEE International Conference onMultimedia andExpo (ICME rsquo08) pp 865ndash868 Hannover Germany June 2008
[9] K P Truong How Does Real Affect Affect Affect Recognitionin Speech Center for Telematics and Information TechnologyUniversity of Twente Enschede The Netherlands 2009
[10] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoRecognition ofpractical emotion from elicited speechrdquo in Proceedings of the 1stInternational Conference on Information Science and Engineer-ing (ICISE rsquo09) pp 639ndash642 Nanjing China December 2009
[11] R Polikar L Udpa S S Udpa and V Honavar ldquoLearn++an incremental learning algorithm for supervised neural net-worksrdquo IEEE Transactions on Systems Man and Cybernetics Cvol 31 no 4 pp 497ndash508 2001
[12] Q L Zhao Y H Jiang and M Xu ldquoIncremental learning byheterogeneous Bagging ensemblerdquo Lecture Notes in ComputerScience vol 6441 no 2 pp 1ndash12 2010
[13] R Xiao J Wang and F Zhang ldquoAn approach to incrementalSVM learning algorithmrdquo in Proceedings of the IEEE Interna-tional Conference on Tools with Artificial Intelligence pp 268ndash273 2000
[14] D N Reshef Y A Reshef H K Finucane et al ldquoDetectingnovel associations in large data setsrdquo Science vol 334 no 6062pp 1518ndash1524 2011
[15] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using Gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995
[16] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997
[17] Q ZhaoThe research on ensemble pruning and its application inon-line machine learning [PhD thesis] National University ofDefense Technology Changsha China 2010
Due to the different types of emotions among the datasetswe unify the emotional datasets by categorizing them intopositive and negative regions in the valence and arousal di-mensions as shown in Figure 13 We may verify the ability ofthe emotion classifier by classifying the emotional utterancesinto different regions in the valence and arousal space
42 Online LearningUsingAdaBoost While the offlineGMMclassifier is trained using EM algorithm the online trainingalgorithmusingAdabBoost will be introduced in this sectionAdaBoost is a powerful algorithm in assemble learning [16]The belief in this AdaBoost is that weak classifiers may becombined into a powerful classifier Multiple classifierstrained on randomly selected datasets perform quiet differ-ently from each other on the same testing dataset therefore
we may reduce the misclassification rate by a proper decisionfusion rule
AdaBoost algorithm consists of several iterations In eachiteration a new training set is selected for a new weak clas-sifier A weight is assigned to the new weak classifier Basedon the testing results of the newweak classifier the weights ofall the data samples are modified for the next iteration At thefinal step the assembled classifier is achieved by combinationof themultipleweak classified through aweighted voting rule
Let us suppose the current training set is [17]
119879 = 1199041 1199042 119904
119873 (6)
where the weights of the samples are
119882 = 1199081 1199082 119908
119873
119873
sum
119894=0
119908119894= 1
(7)
The error rate of the new weak classifier is
119890 = sum
119894119888(119904119894) = 119910119894
119908119894 (8)
where 119888(119904119894) is the classification result and 119910
119894is the class label
The fusion weight assigned to each classifier is defined by theerror rate
120572 = ln((1 minus 119890)119890
) (9)
At the beginning of the algorithm each sample is assignedby equal weight During the iteration the sample weights areupdated
119908119894+1
=
119908119894times 120573 119888 (119904
119894) = 119910119894
119908119894 119888 (119904
119894) = 119910119894
(10)
At the arrival of the new data assuming that we knowthe label information for each sample pretrained classifiersfrom the offline data are used as initial weak classifiers Ada-Boost algorithm is applied to the new online data and fusionweights are reassigned to the offline trained classifiers
At the first119898 initial iterations119898 pretrained classifiers areused as the weak classifiers and added to the final ensembleclassifier instead of training new weak classifiers from therandomly selected dataset After the119898 initial iterations newweak classifiers are trained from the new online data andadded to the final ensemble classifier in the AdaBoostalgorithm
The major difference between the online training and theoffline training is the data used for learning Offline train-ing uses large acted data while online training uses small andnatural data Offline training is independent of the onlinetraining and ready to use while the online training is depen-dent on the offline training and only retrains the existingmodel to fit specific purposes such as to tune on a largenumber of speakers The purpose of online training is toquickly adapt the existing offline model to a small amountof new data
8 Mathematical Problems in Engineering
5 Experimental Results
In our experiment the offline training is carried out on theacted basic emotion dataset The speaker-independent data-set and the elicited practical emotion dataset are used for theonline training and the online testing Although the datasetsused in online testing are preprocessed utterances rather thanreal time online data our experiments still provide a simu-lated online situation We divide dataset 2 and dataset 3 intosmaller sets dataset 2a and dataset 2b which are used as thesimulated online initialization
Speech utterances from different sources are organizedinto several datasets as shown in Table 2
The online learning algorithm is verified both on thespeaker-independent data and the elicited data The resultsare shown in Table 4 A large number of speakers bring dif-ficulties in modeling emotional behavior since emotionexpression is highly dependent on individual habit and per-sonality By extending the offline trained classifier to theonline data that contains a large number of speakers weimproved the generality of our SER system The elicited datais collected in a cognitive experiment that is more close tothe real world situation During the cognitive task emotionalspeech is induced We observed that the different naturebetween the acted data and the induced speech during acognitive task caused a significant decrease of the recognitionrate By using the online training technique we may transferthe offline trained SER system to the elicited data Extendingour SER system to different data sources may bring emotionrecognition closer to real world applications
The major challenge in our online learning algorithm ishow to combine the existing offline classifier and efficientlyadapt the model parameters to a small number of new onlinedata We adopted the incremental learning idea and solvedthis problem by modifying the initial stage in the AdaBoostframework One of the contributions of our online learningalgorithm is that we may reuse the existing offline trainingdata and make the online learning stage more efficiently Wemake use of a large amount of available offline training dataand only require a small amount of data for online trainingas shown in Table 3 The weight of each weak classifier is animportant parameter The proposed method may be furtherimproved by using fuzzy membership function to evaluatethe confidence in GMM classifiers and reestimate the weightof each weak classifier
6 Discussions
Acted data is often considered not suitable for real worldapplications However traditional researches have been fo-cused on the acted emotion speech andmany acted databasesare available How to transfer an SER system that trained onthe acted data to the new naturalistic data in real world is anunsolved challenge
Many feature selection algorithms may be applied to SERsystem MIC is a newly proposed and powerful algorithm forexploring nonlinear relationship between variables
AdaBoost is a popular algorithm to ensemble multipleweak classifiers to establish a strong classifier By applying
Table 3 Selected datasets for online and offline experiments
Datasets index Data source Number ofutterances Purpose of use
Dataset 1 Acted speech 12000 Offline training
Dataset 2a Speakerindependent 1000 Online training
result Experiment 1 Dataset 1 NA Dataset 2b 633Experiment 2 Dataset 1 Dataset 2a Dataset 2b 756Experiment 5 Dataset 2a NA Dataset 2b 700Experiment 3 Dataset 1 NA Dataset 3b 612Experiment 4 Dataset 1 Dataset 3a Dataset 3b 731Experiment 6 Dataset 3a NA Dataset 3b 685
AdaBoost in the online occasion we train multiple weakclassifiers based on the newly arrived online data The offlinepretrained classifiers are used for initialization We may ex-plore other incremental learning algorithms in the futurework
Acknowledgments
This work was partially supported by China Postdoctoral Sci-ence Foundation (no 2012M520973) National Nature Sci-ence Foundation (no 61231002 no 61273266 no 51075068)and Doctoral Fund of Ministry of Education of China (no20110092130004)The authors would like to thank the anony-mous reviewers for their valuable comments and helpfulsuggestions
References
[1] C Clavel I Vasilescu L Devillers G Richard and T EhretteldquoFear-type emotion recognition for future audio-based surveil-lance systemsrdquo Speech Communication vol 50 no 6 pp 487ndash503 2008
[2] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoSpeech emotionrecognition based on re-composition of two-class classifiersrdquoin Proceedings of the 3rd International Conference on AffectiveComputing and Intelligent Interaction andWorkshops (ACII rsquo09)Amsterdam The Netherlands September 2009
[3] K R Scherer ldquoVocal communication of emotion a review ofresearch paradigmsrdquo SpeechCommunication vol 40 no 1-2 pp227ndash256 2003
[4] A Tawari andMM Trivedi ldquoSpeech emotion analysis explor-ing the role of contextrdquo IEEE Transactions on Multimedia vol12 no 6 pp 502ndash509 2010
[5] F Burkhardt A Paeschke M Rolfes W Sendlmeier and BWeiss ldquoA database of German emotional speechrdquo inProceedings
Mathematical Problems in Engineering 9
of the 9th European Conference on Speech Communication andTechnology pp 1517ndash1520 Lissabon Portugal September 2005
[6] D Ververidis and C Kotropoulos ldquoAutomatic speech classifi-cation to five emotional states based on gender informationrdquo inProceedings of the 12th European Signal Processing Conferencepp 341ndash344 Vienna Austria 2004
[7] S SteidlAutomatic Classification of Emotion-RelatedUser Statesin Spontaneous Childrenrsquos Speech Department of Computer Sci-ence Friedrich-Alexander-Universitaet Erlangen-NuermbergBerlin Germany 2008
[8] M Grimm K Kroschel and S Narayanan ldquoThe Vera am Mit-tag German audio-visual emotional speech databaserdquo in Pro-ceedings of the IEEE International Conference onMultimedia andExpo (ICME rsquo08) pp 865ndash868 Hannover Germany June 2008
[9] K P Truong How Does Real Affect Affect Affect Recognitionin Speech Center for Telematics and Information TechnologyUniversity of Twente Enschede The Netherlands 2009
[10] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoRecognition ofpractical emotion from elicited speechrdquo in Proceedings of the 1stInternational Conference on Information Science and Engineer-ing (ICISE rsquo09) pp 639ndash642 Nanjing China December 2009
[11] R Polikar L Udpa S S Udpa and V Honavar ldquoLearn++an incremental learning algorithm for supervised neural net-worksrdquo IEEE Transactions on Systems Man and Cybernetics Cvol 31 no 4 pp 497ndash508 2001
[12] Q L Zhao Y H Jiang and M Xu ldquoIncremental learning byheterogeneous Bagging ensemblerdquo Lecture Notes in ComputerScience vol 6441 no 2 pp 1ndash12 2010
[13] R Xiao J Wang and F Zhang ldquoAn approach to incrementalSVM learning algorithmrdquo in Proceedings of the IEEE Interna-tional Conference on Tools with Artificial Intelligence pp 268ndash273 2000
[14] D N Reshef Y A Reshef H K Finucane et al ldquoDetectingnovel associations in large data setsrdquo Science vol 334 no 6062pp 1518ndash1524 2011
[15] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using Gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995
[16] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997
[17] Q ZhaoThe research on ensemble pruning and its application inon-line machine learning [PhD thesis] National University ofDefense Technology Changsha China 2010
In our experiment the offline training is carried out on theacted basic emotion dataset The speaker-independent data-set and the elicited practical emotion dataset are used for theonline training and the online testing Although the datasetsused in online testing are preprocessed utterances rather thanreal time online data our experiments still provide a simu-lated online situation We divide dataset 2 and dataset 3 intosmaller sets dataset 2a and dataset 2b which are used as thesimulated online initialization
Speech utterances from different sources are organizedinto several datasets as shown in Table 2
The online learning algorithm is verified both on thespeaker-independent data and the elicited data The resultsare shown in Table 4 A large number of speakers bring dif-ficulties in modeling emotional behavior since emotionexpression is highly dependent on individual habit and per-sonality By extending the offline trained classifier to theonline data that contains a large number of speakers weimproved the generality of our SER system The elicited datais collected in a cognitive experiment that is more close tothe real world situation During the cognitive task emotionalspeech is induced We observed that the different naturebetween the acted data and the induced speech during acognitive task caused a significant decrease of the recognitionrate By using the online training technique we may transferthe offline trained SER system to the elicited data Extendingour SER system to different data sources may bring emotionrecognition closer to real world applications
The major challenge in our online learning algorithm ishow to combine the existing offline classifier and efficientlyadapt the model parameters to a small number of new onlinedata We adopted the incremental learning idea and solvedthis problem by modifying the initial stage in the AdaBoostframework One of the contributions of our online learningalgorithm is that we may reuse the existing offline trainingdata and make the online learning stage more efficiently Wemake use of a large amount of available offline training dataand only require a small amount of data for online trainingas shown in Table 3 The weight of each weak classifier is animportant parameter The proposed method may be furtherimproved by using fuzzy membership function to evaluatethe confidence in GMM classifiers and reestimate the weightof each weak classifier
6 Discussions
Acted data is often considered not suitable for real worldapplications However traditional researches have been fo-cused on the acted emotion speech andmany acted databasesare available How to transfer an SER system that trained onthe acted data to the new naturalistic data in real world is anunsolved challenge
Many feature selection algorithms may be applied to SERsystem MIC is a newly proposed and powerful algorithm forexploring nonlinear relationship between variables
AdaBoost is a popular algorithm to ensemble multipleweak classifiers to establish a strong classifier By applying
Table 3 Selected datasets for online and offline experiments
Datasets index Data source Number ofutterances Purpose of use
Dataset 1 Acted speech 12000 Offline training
Dataset 2a Speakerindependent 1000 Online training
result Experiment 1 Dataset 1 NA Dataset 2b 633Experiment 2 Dataset 1 Dataset 2a Dataset 2b 756Experiment 5 Dataset 2a NA Dataset 2b 700Experiment 3 Dataset 1 NA Dataset 3b 612Experiment 4 Dataset 1 Dataset 3a Dataset 3b 731Experiment 6 Dataset 3a NA Dataset 3b 685
AdaBoost in the online occasion we train multiple weakclassifiers based on the newly arrived online data The offlinepretrained classifiers are used for initialization We may ex-plore other incremental learning algorithms in the futurework
Acknowledgments
This work was partially supported by China Postdoctoral Sci-ence Foundation (no 2012M520973) National Nature Sci-ence Foundation (no 61231002 no 61273266 no 51075068)and Doctoral Fund of Ministry of Education of China (no20110092130004)The authors would like to thank the anony-mous reviewers for their valuable comments and helpfulsuggestions
References
[1] C Clavel I Vasilescu L Devillers G Richard and T EhretteldquoFear-type emotion recognition for future audio-based surveil-lance systemsrdquo Speech Communication vol 50 no 6 pp 487ndash503 2008
[2] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoSpeech emotionrecognition based on re-composition of two-class classifiersrdquoin Proceedings of the 3rd International Conference on AffectiveComputing and Intelligent Interaction andWorkshops (ACII rsquo09)Amsterdam The Netherlands September 2009
[3] K R Scherer ldquoVocal communication of emotion a review ofresearch paradigmsrdquo SpeechCommunication vol 40 no 1-2 pp227ndash256 2003
[4] A Tawari andMM Trivedi ldquoSpeech emotion analysis explor-ing the role of contextrdquo IEEE Transactions on Multimedia vol12 no 6 pp 502ndash509 2010
[5] F Burkhardt A Paeschke M Rolfes W Sendlmeier and BWeiss ldquoA database of German emotional speechrdquo inProceedings
Mathematical Problems in Engineering 9
of the 9th European Conference on Speech Communication andTechnology pp 1517ndash1520 Lissabon Portugal September 2005
[6] D Ververidis and C Kotropoulos ldquoAutomatic speech classifi-cation to five emotional states based on gender informationrdquo inProceedings of the 12th European Signal Processing Conferencepp 341ndash344 Vienna Austria 2004
[7] S SteidlAutomatic Classification of Emotion-RelatedUser Statesin Spontaneous Childrenrsquos Speech Department of Computer Sci-ence Friedrich-Alexander-Universitaet Erlangen-NuermbergBerlin Germany 2008
[8] M Grimm K Kroschel and S Narayanan ldquoThe Vera am Mit-tag German audio-visual emotional speech databaserdquo in Pro-ceedings of the IEEE International Conference onMultimedia andExpo (ICME rsquo08) pp 865ndash868 Hannover Germany June 2008
[9] K P Truong How Does Real Affect Affect Affect Recognitionin Speech Center for Telematics and Information TechnologyUniversity of Twente Enschede The Netherlands 2009
[10] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoRecognition ofpractical emotion from elicited speechrdquo in Proceedings of the 1stInternational Conference on Information Science and Engineer-ing (ICISE rsquo09) pp 639ndash642 Nanjing China December 2009
[11] R Polikar L Udpa S S Udpa and V Honavar ldquoLearn++an incremental learning algorithm for supervised neural net-worksrdquo IEEE Transactions on Systems Man and Cybernetics Cvol 31 no 4 pp 497ndash508 2001
[12] Q L Zhao Y H Jiang and M Xu ldquoIncremental learning byheterogeneous Bagging ensemblerdquo Lecture Notes in ComputerScience vol 6441 no 2 pp 1ndash12 2010
[13] R Xiao J Wang and F Zhang ldquoAn approach to incrementalSVM learning algorithmrdquo in Proceedings of the IEEE Interna-tional Conference on Tools with Artificial Intelligence pp 268ndash273 2000
[14] D N Reshef Y A Reshef H K Finucane et al ldquoDetectingnovel associations in large data setsrdquo Science vol 334 no 6062pp 1518ndash1524 2011
[15] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using Gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995
[16] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997
[17] Q ZhaoThe research on ensemble pruning and its application inon-line machine learning [PhD thesis] National University ofDefense Technology Changsha China 2010
of the 9th European Conference on Speech Communication andTechnology pp 1517ndash1520 Lissabon Portugal September 2005
[6] D Ververidis and C Kotropoulos ldquoAutomatic speech classifi-cation to five emotional states based on gender informationrdquo inProceedings of the 12th European Signal Processing Conferencepp 341ndash344 Vienna Austria 2004
[7] S SteidlAutomatic Classification of Emotion-RelatedUser Statesin Spontaneous Childrenrsquos Speech Department of Computer Sci-ence Friedrich-Alexander-Universitaet Erlangen-NuermbergBerlin Germany 2008
[8] M Grimm K Kroschel and S Narayanan ldquoThe Vera am Mit-tag German audio-visual emotional speech databaserdquo in Pro-ceedings of the IEEE International Conference onMultimedia andExpo (ICME rsquo08) pp 865ndash868 Hannover Germany June 2008
[9] K P Truong How Does Real Affect Affect Affect Recognitionin Speech Center for Telematics and Information TechnologyUniversity of Twente Enschede The Netherlands 2009
[10] C Huang Y Jin Y Zhao Y Yu and L Zhao ldquoRecognition ofpractical emotion from elicited speechrdquo in Proceedings of the 1stInternational Conference on Information Science and Engineer-ing (ICISE rsquo09) pp 639ndash642 Nanjing China December 2009
[11] R Polikar L Udpa S S Udpa and V Honavar ldquoLearn++an incremental learning algorithm for supervised neural net-worksrdquo IEEE Transactions on Systems Man and Cybernetics Cvol 31 no 4 pp 497ndash508 2001
[12] Q L Zhao Y H Jiang and M Xu ldquoIncremental learning byheterogeneous Bagging ensemblerdquo Lecture Notes in ComputerScience vol 6441 no 2 pp 1ndash12 2010
[13] R Xiao J Wang and F Zhang ldquoAn approach to incrementalSVM learning algorithmrdquo in Proceedings of the IEEE Interna-tional Conference on Tools with Artificial Intelligence pp 268ndash273 2000
[14] D N Reshef Y A Reshef H K Finucane et al ldquoDetectingnovel associations in large data setsrdquo Science vol 334 no 6062pp 1518ndash1524 2011
[15] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using Gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995
[16] Y Freund and R E Schapire ldquoA decision-theoretic generaliza-tion of on-line learning and an application to boostingrdquo Journalof Computer and System Sciences vol 55 no 1 part 2 pp 119ndash139 1997
[17] Q ZhaoThe research on ensemble pruning and its application inon-line machine learning [PhD thesis] National University ofDefense Technology Changsha China 2010