7/31/2019 Feature Analysis Fot Emotion Identification in Speech
1/12
490 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010
Feature Analysis and Evaluation for AutomaticEmotion Identification in Speech
Iker Luengo, Eva Navas, and Inmaculada Hernez
AbstractThe definition of parameters is a crucial step inthe development of a system for identifying emotions in speech.Although there is no agreement on which are the best featuresfor this task, it is generally accepted that prosody carries mostof the emotional information. Most works in the field use somekind of prosodic features, often in combination with spectral andvoice quality parametrizations. Nevertheless, no systematic studyhas been done comparing these features. This paper presents theanalysis of the characteristics of features derived from prosody,spectral envelope, and voice quality as well as their capabilityto discriminate emotions. In addition, early fusion and late fu-sion techniques for combining different information sources areevaluated. The results of this analysis are validated with experi-
mental automatic emotion identification tests. Results suggest thatspectral envelope features outperform the prosodic ones. Evenwhen different parametrizations are combined, the late fusionof long-term spectral statistics with short-term spectral envelopeparameters provides an accuracy comparable to that obtainedwhen all parametrizations are combined.
Index TermsEmotion identification, information fusion,parametrization.
I. INTRODUCTION
FEATURES extracted from the speech signal have a great
effect on the reliability of an emotion identificationsystem. Depending on these features, the system will have a
certain capability to distinguish emotions and will be able to
deal with speakers not seen during the training. Many works
in the field of emotion recognition are aimed to find the most
appropriate parametrization, yet there is no clear agreement on
which feature set is best.
One of the major problems to determine the best features for
emotion identification is that there is no solid theoretical basis
relating the characteristics of the voice with the emotional state
of the speaker [1]. That is why most of the works in this field
are based on features obtained from direct comparison of speech
signals portraying different emotions. The comparison enablesto estimate the acoustic differences among them, identifying
features that could be useful for emotion identification.
Manuscript received December 11, 2009; revised March 22, 2010; acceptedMay 06, 2010. Date of current version September 15, 2010. This work wassupported in part by the Spanish Government under the BUCEADOR project(TEC2009-14094-C04-02) and in part by the AVIVAVOZ project (TEC2006-13694-C03-02). The associate editor coordinating the review of this manuscriptand approving it for publication was Dr. Hamid K. Aghajan.
The authors are with the Department of Electronics and Telecommunica-tions, University of the Basque Country, Bilbao 48013, Spain ( e-mail: [email protected]; [email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2010.2051872
The only widely known theory that describes the physiolog-
ical changes caused by an emotional state [2], [3] takes the Dar-
wininan theory [4] as a reference, considering emotions as a re-
sult of evolutionary needs. According to this theory, each emo-
tion induces some physiological and psychological changes in
order to prepare us to do something, e.g., fear prepares us to
run from a danger. These changes have certain influence on the
speech characteristics, mainly on those related to intonation, in-
tensity, and speaking rate, i.e., on prosody.
For many years, automatic identification systems have used
prosodic features almost exclusively, mainly because of the
aforementioned theory. The relation between emotion and
prosody is reflected in many works in the literature [5][9].
A detailed summary of these works and their conclusions is
presented in [10]. Prosodic features are mostly used in the
form of long-term statistics, usually estimated over the whole
utterance. The most common ones are simple statistics such as
mean, variance, minimum, maximum, or range. But it is also
possible to use more complex representations in an attempt to
retain more emotional information.
The use of prosodic features gives a certain reiterative confu-
sion pattern among emotions. They seem to be able to discrimi-
nate high arousal emotions (anger, happiness) from low arousal
ones (sadness,boredom) easily. But the confusion level for emo-tions of the same arousal level is very large [1]. However, hu-
mans are able to distinguish anger from happiness and sadness
from boredom accurately. This reinforces the idea that there are
some other voice characteristics useful for emotion identifica-
tion. Some works in the literature use spectral measurements or
voice quality features, showing the importance of these param-
eters for emotion identification.
The vocal tract is also influenced by the emotional state, and
so are the spectral characteristics of the voice. The relationship
between emotion and spectral characteristics is empirically sup-
ported by several works [7], [11][14]. Furthermore, taking into
account that prosody is determined mainly by the vocal fold ac-tivity and that the spectrum envelope is mostly influenced by
the vocal tract, it is reasonable to assume that they are not very
correlated. Therefore, both types of features provide different
information about the emotional state of the speaker, and the
combination of prosodic and spectral parametrizations may be
interesting. In fact, most of the works using spectral measures
use them together with prosodic features, significantly reducing
the error rate with respect to using only one of the parametriza-
tions [12], [13], [15][18].
The effect that emotions have on the vocal folds cause
changes in the voice quality, too [19], [20], so in the last couple
of years, features related to voice quality have also been used
1520-9210/$26.00 2010 IEEE
7/31/2019 Feature Analysis Fot Emotion Identification in Speech
2/12
LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH 491
to help emotion identification [19], [21]. Nevertheless, few
authors use them, due to the difficulty of extracting the glottal
signal from the speech [20]. The estimation of the glottal signal
can only be done with acceptable accuracy on very stable
speech segments, which makes it difficult to extract voice
quality features by automatic means.
Finally, it is also possible to use linguistic features such asthe occurrence of sighs or certain emotive words [17], [18], [22].
However, only a few authors use them. On the one hand, because
in most cases an automatic speech recognition system (ASR)
is needed, which increases the complexity of the system. On
the other hand, because this kind of features only makes sense
with spontaneous speech, and most of the works in the field
have been developed using read speech databases, often with
the same texts for all emotions.
Best results are obtained when parametrizations of different
nature are combined in order to increase the available informa-
tion. When all features have the same temporal structure, this
combination can be done by simply concatenating the feature
vectors. The problem arises when the temporal structure isdifferent for the considered parametrizations. For example,
prosodic information is usually given in the form of long-term
statistics, while traditional spectral envelope parametrizations
like Mel frequency cepstral coefficients (MFCC) or linear
prediction cepstral coefficients (LPCC) are extracted for each
frame. A simple solution applied in many works is to calculate
long-term statistics of spectral features, making it possible
to concatenate them to the prosodic feature vector [15][17].
Another alternative is to do the opposite. Instead of calculating
statistics of spectral features, frame-wise and intensity
values are used and appended directly to the MFCC or LPCC
vectors [14], [23]. In [15], both strategies are compared, ob-taining very similar results in both cases. A more elaborate
solution uses a late fusion of classifiers [24], [25], i.e., training
a different classifier with each parametrization and combining
the results of these classifiers, as in [12] and [18].
Going over the literature shows a disagreement on which fea-
tures are best for the identification of emotions in speech. It is
widely accepted that prosodic features carry most of the emo-
tional information, and that the combination of parametriza-
tions of different nature improves the results. But no systematic
study has been done comparing the effectiveness of each kind
of parametrization or their combinations. This work attempts to
fill this gap analyzing the characteristics and appropriateness of
different sets of features used for the recognition of emotions. It
is focused on acoustic parameters that can be extracted directly
from the speech signal without using ASR systems: prosody,
spectral envelope, and voice quality. The effectiveness of dif-
ferent combinations of features is also studied, and the different
combination approaches are compared to see which one is more
appropriate: the simple feature concatenation (early fusion) or
the classifier result combination (late fusion).
Section II in the paper describes the emotional database
used in this work, as well as the acoustic processing carried
out to extract the data needed for the calculation of the fea-
tures. Section III describes the considered features, whereas
Section IV describes the analysis of these features. Some em-pirical experiments are carried out and described in Section V,
in order to confirm the results obtained in the analysis. Finally,
the results are commented and some conclusions are extracted.
II. WORKING DATABASE
A. Description of the Database
Both the analysis of the features and the emotion identifi-
cation experiments described in this paper were carried out
using the Berlin emotional speech database [26]. This database
contains 535 recordings uttered by five male and five female
speakers, simulating seven emotional states: anger, boredom,
disgust, fear, happiness, neutral, and sadness. Although the
original recordings are available at 16 kHz and 16 bits per
sample, they were subsampled to 8 kHz for this work.
The database corpus contains ten sentences of neutral se-
mantic content that all speakers recorded portraying the seven
emotional styles. As described in [26], in order to get emotions
as natural as possible, the recordings were evaluated in a percep-
tual test where 20 listeners had to identify the intended emotion
and score its naturalness. All recordings with an identificationrate lower than 80% and an overall naturalness score under 60%
were discarded, leaving the 535 sentences available in the final
database. Due to the perceptual selection process, some emo-
tions are more represented than others.
This database was chosen because it presents certain char-
acteristics that were of interest. It is a multispeaker database,
making it possible to perform speaker-independent tests. Fur-
thermore, the perceptual selection guarantees that the portrayed
emotions are highly natural. In addition, many researches about
identification of emotions have been done using this database,
which makes it possible to compare the results with other pub-
lished results.
B. Processing of the Recordings
The recordings were processed in order to get the character-
istic curves and labelings needed for the feature extraction. The
processing included detection of the vocal activity, estimation of
the glottal source signal and of the intonation curve, voiced-un-
voiced labeling, pitch-period marking, and vowel detection. All
this processing was performed automatically without manual
corrections.
1) Vocal Activity Detection (VAD): Silences and pauses were
detected using the LTSE VAD algorithm described in [27]. The
labels obtained were further processed to discard silence labels
shorter than 100 ms that usually appear as a result of detection
errors, and which have no linguistic meaning.
2) Glottal Source Estimation: The glottal signal is needed
in order to compute various features related to voice quality.
However, estimating the glottal flow from the speech signal is
inherently difficult and only when it is applied to very stationary
segments of the speech the results may be acceptable. Iterative
adaptive inverse filtering (IAIF) [28] was used to perform the
inverse filtering and recovering the glottal flow, since it is a fully
automatic method that gives acceptable results with stationary
signals. Nevertheless, it is assumed that the estimated glottal
signal is inaccurate for nonstationary segments.
3) Intonation Curve and Voiced-Unvoiced Labeling: The in-tonation curve was computed with the cepstrum dynamic pro-
7/31/2019 Feature Analysis Fot Emotion Identification in Speech
3/12
492 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010
gramming (CDP) algorithm described in [29], which uses the
cepstrum transform and dynamic programming. This algorithm
provides the voiced-unvoiced (VUV) labeling, too.
4) Pitch Period Marking: In addition to glottal flow estima-
tion, pitch synchronous marks are also needed for the extraction
of voice quality features. Once the intonation curves, the VUV
labeling and the glottal flow were extracted, pitch marks wereplaced in the negative peaks of the inverse filtering residual,
which were located by simple peak picking using the estimated
value as a clue to detect the next peak.
5) Vowel Detection: Vowels are some of the most stable
segments in a speech signal, making them very appropriate
for the computation of certain features, such as those derived
from the glottal source estimation. Furthermore, vowels are
always voiced and are strongly affected by intonation patterns,
providing a consistent point to calculate intonation-related fea-
tures such as risings or fallings. The vowels in the database
were automatically detected using the algorithm described in
[30], which is based on a phoneme recognizer working with
HMM models of clustered phonemes. Phoneme clustering wasautomatically performed according to their acoustical simi-
larity, keeping the vowels on their own cluster. This provides a
consistent and very robust set of models, capable of detecting
80% of vowel boundaries with less than 20-ms error.
III. FEATURE DEFINITION
A. Segmental Features
Segmental features are calculated once for every frame, al-
lowing the analysis of their temporal evolution. A 25-ms Ham-
ming windowing with 60% overlap is used, i.e., a new feature
vector is extracted every 10 ms.1) Spectral Envelope: Reference [11] shows that log-filter
power coefficients (LFPC) outperform traditional MFCC or
LPCC parameters in emotion identification, so they were
chosen for the frame-wise spectral characterization. LFPC
features represent the spectral envelope in terms of the energy
in Mel-scaled frequency bands.
Eighteen LFPC coefficients were estimated for each frame,
together with their first and second derivatives, giving a total of
54 spectral features at the segment level. In order to minimize
microphone distance effects, LFPC features were normalized to
the mean value of the whole utterance.
2) Prosody Primitives:We refer to the intonation and in-tensity curves as prosody primitives, as the (suprasegmental)
prosodic features are estimated from these. A new sample of in-
tensity and was estimated for every frame, together with their
first and second derivatives, providing six features per frame.
Intensity curves were normalized to the mean intensity value of
the whole utterance in order to minimize the effect of the mi-
crophone distance.
As values are not defined for unvoiced frames, two feature
vector streams were extracted from each recording, the first one
corresponding to voiced frames (characterized by both intona-
tion and intensity features) and the second one corresponding to
unvoiced frames (characterized only by intensity values). Both
streams were treated as different parametrizations during thiswork.
TABLE ISUPRASEGMENTAL PARAMETERS AND THE CORRESPONDING
SYMBOL USED ALONG THIS DOCUMENT
B. Suprasegmental Features
Suprasegmental features represent long-term information, es-timated over time intervals longer than a frame. In this work, this
interval has been defined as the time between two consecutive
pauses. It is expected that speech pauses correspond roughly to
linguistic stops in the message, so this approach is very sim-
ilar to using a whole utterance as integration time. However, as
pauses can be detected automatically with the VAD algorithm,
this method makes it possible to adapt the parametrization algo-
rithm to work with direct audio input if necessary.
1) Spectral Statistics: For the suprasegmental characteriza-
tion of the spectrum, long-term statistics of LFPC were calcu-
lated. For each of the 18 LFPC coefficients and their first and
second derivatives, six statistics were computed, as shown in
Table I. At the end, suprasegmental spectralfeatures were extracted.
7/31/2019 Feature Analysis Fot Emotion Identification in Speech
4/12
LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH 493
2) Prosody: Prosodic features are divided into five cate-
gories, according to the nature of the information they represent.
Altogether 54 prosodic features were defined.
Intonation Statistics: Intonation features were calculated
as the same six statistics presented in Table I, applied to the
values and their first and second derivatives, giving 18 param-
eters as a result. Only frames detected as voiced were used forthe computation of the statistics.
Intensity Statistics: Following the same approach as with
the intonation, intensity features were composed of the same
statistics applied to the intensity values and their first and second
derivatives, providing 18 new parameters.
Speech Rate Features: They were defined as the mean and
variance of the vowel duration, as shown in Table I.
Regression Features: In each detected vowel, a linear re-
gression was estimated for the values of and intensity. Then
the absolute value of the regression line slopes was calculated,
and the six features listed in Table I were extracted. With these
features, we try to combine long integration intervals (with a
length comparable to a sentence) and short ones (with an ap-proximate duration of a phoneme), as done by other authors
[31].
Sentence-End Features: Prosodic values at the end of the
sentences may provide additional information about the emo-
tion. For example, an increase in with respect to the rest of
the sentence can represent surprise, whereas a lower intensity
is usually related to sadness. Therefore, ten more features were
extracted from the last vowel detected in the integration time,
as shown in Table I. The normalized values are defined as the
corresponding non-normalized ones divided by the mean value
over all the vowels detected in the integration segment.
3) Voice Quality Features: Features related to voice qualityare extracted from the glottal source signal and from the pitch
period marks. They were computed only for vocalic segments,
in order consider only segments with reliable glottal source es-
timation.
The features specified in Table I were calculated for each
vowel, and the values corresponding to vowels in the same inte-
gration segment were averagedin order to obtain a single feature
vector for the whole integration time.
Jitter and shimmer were estimated using the five-point period
perturbation quotient (ppq5) and five-point amplitude perturba-
tion quotient (apq5) values as defined in Praat.1 The normalized
amplitude quotient (NAQ) [32] is estimated for every glottal
pulse, so the NAQ value for a vowel was calculated by aver-
aging the NAQ obtained all along that vowel. Similarly, spec-
tral tilt and spectral balance [33] are calculated for every frame,
and the value for a vowel was calculated averaging the values
all along the vowel.
IV. FEATURE ANALYSIS AND SELECTION
A. Inter-Emotion and Intra-Emotion Dispersion
The capability of a parameter set to retain the emotional
characteristics and to avoid the remaining attributes of the
speech can be measured in terms of inter-emotion dispersion
and intra-emotion dispersion. A large inter-emotion dispersion1http://www.praat.org
TABLE IIDISCRIMINABILITY VALUES OF THE FEATUREFAMILIES MEASURED WITH THE CRITERION
would mean that the features take very different values for
each emotion, separating the distributions and making the
classification task easier. A reduced intra-emotion dispersionreflects the consistency of the features within a given emotion.
The relation between intra-emotion and inter-emotion disper-
sions provides a measure of the overlapping of the class distri-
butions. This relation can be estimated using the following cri-
terion [34]:
(1)
where denotes the trace of a matrix and and are
the intra-class and inter-class dispersion matrices, respectively:
(2)
(3)
being the number of training samples, the
number of emotions, the samples of class , the centroid
of the class, and the global mean:
(4)
(5)
The criterion is often used in discriminant analysis, and it
is a generalization of the well-known Fisher criterion (6) for the
multiclass and multidimensional case [35]:
(6)
values were computed for each feature family as a first es-
timation of their capability to discriminate emotions. The results
are presented in Table II.
If we focus on suprasegmental features, it is observed that
prosody is less discriminative than spectral envelope statistics.This result is especially significant, as prosodic features are by
7/31/2019 Feature Analysis Fot Emotion Identification in Speech
5/12
494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010
Fig. 1. (a) Scatter plot of suprasegmental prosodic, (b) suprasegmental spectral, (c) segmental prosodic, and (d) segmental spectral features, projected over thetwo most discriminant directions by LDA. An: anger, Bo: boredom, Di: disgust, Fe: fear, Ha: happiness, Ne: neutral, Sa: sadness.
far the most used ones in the literature for emotion identification.
The difference shown in Table II could be partly justified by the
fact that 324 statistics are used in the spectral parametrization
and only 54 in the prosodic one. But when the calculation is re-
peated using only the best 54 spectral features (see the feature
selection procedure in Section IV-C), they still seem to discrim-
inate emotions better, obtaining a .
To confirm this result and to show the discriminability of each
set of features visually, an LDA transformation was applied to
both spectral and prosodic features. The two most discriminant
directions are represented graphically in Fig. 1. As can be seen,
emotions are less overlapped when using spectral features thanwhen using prosodic ones.
It is remarkable that in both cases, the most discriminant
direction (horizontal axis) seems to be related to the activa-
tion level, shifting high activation emotions (anger, happiness,
and fear) to one side and low activation emotions (sadness and
boredom) to the other.
In Fig. 1(a), we can already identify the confusions usually
described in the literature when using prosodic features, for ex-
ample anger and happiness. Disgust is also frequently reported
to be hard to detect by prosodic features, and this is reflected
in the scatter plot, too. The distribution of disgust samples has a
large dispersion and is strongly overlapped with other emotions,
especially with the neutral style. On the contrary, it is clearlyseparated from the other emotions when using spectral features.
Going back to the discriminability values in Table II, it can
be seen that features related to voice quality seem to provide
very little information, at least the ones considered in this work.
Nevertheless, when these features are concatenated with the
prosodic ones, they contribute to increase the value from 6.47
to 7.14. This is not so surprising if we take into account that fea-
tures without discrimination power by themselves may be useful
when combined with others [36], as in this case. The concatena-
tion of all suprasegmental features provides the highest discrim-
ination, suggesting that the information captured by the spec-
trum, the prosody, and the voice quality are complementary.
Regarding segmental features, it can be seen that their capa-bility to separate emotions is almost nonexistent. This is clearly
observed in Fig. 1(c) and (d), where the two most discriminant
features given by LDA are represented for prosodic primitives
and LFPC features. All emotions are completely overlapped.
Due to the short-term nature of the segmental parametrization,
each LFPC feature vector reflects the spectral envelope of a
single frame, i.e., it represents the characteristic vocal tract filter
for the phonemes articulated in that frame. As the spectral en-
velope is much more different for different phonemes than for
different emotions, the intra-class dispersion is very large, in-
creasing the overlap among emotions. In the case of intonation
and intensity, a similar effect occurs, as the frame-wise sam-
ples have more variation due to the linguistic content than to theemotional content of the utterances.
7/31/2019 Feature Analysis Fot Emotion Identification in Speech
6/12
LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH 495
TABLE IIIUNSUPERVISED CLUSTERING RESULTS FOR SPECTRAL STATISTICS
This does not mean that the features are useless. Both the
criterion and the LDA are optimal only if the classes have
normal homoscedastic distribution, i.e., they have the same co-
variance matrices [35]. They cannot make use of the subtle dif-
ferences in the shape of the distributions, which can be captured
if the right classification engine is applied. Usually Gaussian
mixture models (GMMs) are used for this task, as they can
capture small differences among distributions, assuming thatenough training samples are provided.As segmental features are
extracted once for every 10 ms, there are indeed enough training
samples as to train robust and accurate models. In fact, GMMs
are very popular in emotion identification when frame-wise fea-
tures are used. Unfortunately, this means that no conclusions can
be extracted from the measures for segmental parametriza-
tions. The only way to get an estimation of their capability to
distinguish emotions is to perform empirical tests of automatic
identification of emotions. The results of these empirical tests
are presented in Section V.
B. Unsupervised Clustering
The values given above provide a clue about the discrim-
ination power of each feature family, and the scatterplot of the
most discriminant directions estimated by LDA gives a visual
representation of it. But using only two directions is somehow
unrealistic, since the addition of more dimensions could provide
more separation among the classes. Unfortunately, it is not pos-
sible to give an understandable graphical representation of more
than two dimensions. However, it is possible to obtain descrip-
tive results that can provide some insight of what happens when
all features are used.
For this purpose, a blind unsupervised clustering was per-
formed using the k-means algorithm. If the emotional classesare correctly separated, the resulting clusters should correspond
to each emotion. The outcome of this clustering is shown in
Tables III and IV for suprasegmental features. No clustering was
performed to segmental features because the distributions are so
overlapped that the algorithm would not be able to locate the
classes correctly.
The clustering is able to identify the emotions quite accu-
rately, with the spectral parametrization having a better perfor-
mance than the prosodic one, as predicted by the values.
Using spectral statistics, almost all samples belonging to a given
emotion are assigned to the same cluster, whereas prosodic fea-
tures exhibit the typical confusion among emotions: anger with
happiness and neutral with boredom and sadness. If we con-sider these tables as classification confusion matrices, prosodic
TABLE IVUNSUPERVISED CLUSTERING RESULTS FOR PROSODIC STATISTICS
features would achieve an overall accuracy of 75.87% whereas
spectral statistics would get a 98.68%. Note that the expected
accuracy of an emotion identification system is much lower, be-
cause the test utterances will not be seen during the training.
C. Feature Selection
The use of noninformative or redundant features may de-
crease the accuracy of a classifier, due to the confusion they
add to the system. A feature selection algorithm can help iden-
tifying the truly useful features, reducing the dimensionality
of the parametrization and making the classifier run faster and
more accurately. Furthermore, detecting the discriminative fea-
tures may provide a deeper understanding about the influence
of the emotions in the acoustic characteristics of the voice.
The minimal-redundancy-maximal-relevance (mRMR) algo-
rithm [37] has been used to get a ranking of the features, from
the most to the least significant one. This algorithm selects
the features that maximize the mutual information between
the training samples and their classes (maximal relevance) and
simultaneously minimize the dependency among the selectedfeatures (minimal redundancy). mRMR has been applied to
all five parametrization families defined in Section III, as well
as to their combinations. The ranking of these combinations,
presented in Table V, is very interesting as it shows which
parametrization is preferred. When all suprasegmental features
are concatenated, among the first ten features ranked, we find
four prosodic and six spectral features. Looking further up
to position 30, we find ten prosodic, one voice quality, and
19 spectral features. These results are in line with the ones
obtained in [16] and [17], supporting the previous results in
Section IV-A that suggest that spectral suprasegmental features
provide more information about the emotion than prosodicones. Voice quality features are in general ranked low in the
list, even though some studies have claimed their relation with
the emotional state of the speaker [19], [20]. Probably the low
ranking is due to the automatic nature of the parametrization. In
works dealing with voice quality, features are usually extracted
with human intervention, providing very accurate values. In the
case presented in this paper, all processing was fully automatic,
and even though voice quality features were extracted only
for vowels (which are supposedly very stable), the resulting
estimation errors may increase the confusability in the system,
making them not very suitable for identification. Errors during
the automatic vowel detection may further increase this confus-
ability. These results are also confirmed by the low valuesobtained by voice quality features.
7/31/2019 Feature Analysis Fot Emotion Identification in Speech
7/12
496 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010
TABLE VBEST TEN FEATURES RANKED FOR FEATURE COMBINATIONS
The ranking for segmental parametrizations was performedseparately for the voiced and unvoiced streams. Following the
approach taken with prosodic primitives, LFPC features were
also divided into voiced and unvoiced streams, so that the frame-
wise spectral parametrization and the prosodic primitives can
be easily combined by concatenation. In the combination of
the voiced streams, the three intonation features ( , , and
) are rankedamong the ten best ones, while intensity deltas
( and ) are between positions 10 and 20. The
frame intensity is far below in the ranking for the voiced
stream, but appears in second position for the unvoiced one.
Having the prosodic primitives ranked so high in the list cor-
roborates the importance of these features for emotion identifi-cation.
V. EXPERIMENTAL EVALUATION
A. Experimental Framework
In order to validate the results of the feature analysis, emo-
tion identification tests were carried out on the Berlin database.
Suprasegmental features were modeled with SVMs using RBF
kernel, whereas GMMs were used for segmental features. This
way, the characteristics of each parametrization, as shown
in Section IV-A, can be exploited. On the one hand, GMMs
should be able to model the subtle differences in the distribu-tion of the highly overlapped segmental features, thanks to the
large number of training samples provided by the frame-wise
parametrization. On the other hand, SVMs can take advantage
of the larger separability of suprasegmental features. Further-
more, the high generalization capability exhibited by SVMs
[38] will guarantee the creation of robust models, even though
the suprasegmental parametrization provides very few training
samples. In the case of SVMs, the one-vs-all approach [39] was
used for the multiclass classification.
The experimental framework was designed as a nested
double cross-validation (see Fig. 2). The outer level ensures
speaker independent results, where the speakers of the test
recordings have not been seen during the training. The innerlevel intends to get development tests for the optimization of
Fig. 2. Nested double cross-validation. In the (a) outer level, five blocks aredefined according to speakers. Four are used for training and the last one forspeaker independent testing. In the (b) inner level, the recordings in the trainingset are randomly rearranged to form five new sub-blocks for development pur-poses.
the classifiers. The speakers in the database are divided into five
blocks for the outer level. Each block con-
tains one male and one female, so that gender balance is kept
within the blocks. For the th loop, blocks
define the training set, leaving block for testing. For the
inner loop, the sentences available in are randomly
distributed into five sub-blocks , which arethen used for the inner level cross-validation. Once the five
inner-level loops have ended, their results are gathered and
used to estimate optimal values for the number of mixtures in
the GMM, the RBF kernel spread, the SVM misclassification
cost, and the optimum number of features. Finally, the whole
set is used to train the system for the th loop in the
outer level and perform the testing on block .
B. Selection of the Number of Features
In order to estimate the optimum number of features, develop-
ment tests were repeated adding one feature at a time, according
to the ranking obtained in Section IV-C. Fig. 3 shows the re-sulting accuracy in the development tests using suprasegmental
features, as a function of the number of features.
According to these results, we can observe that, even though
voice quality parametrization showed a very low class separa-
tion (see Table II), it does not perform so badly considering the
low number of features it uses. With all five voice quality fea-
tures, the system gets 49.4% of correct classifications, whereas
with the best five prosodic features, it gets 50.7%. Furthermore,
the combination of prosody and voice quality seems to be bene-
ficial, as predicted by the estimated class separation values: the
best prosodic system reaches a maximum of 65.5% with 39 fea-
tures, whereas the combination obtains a 67.4% with 17 fea-tures. Not only does the accuracy increase, but it also reaches
its maximum with fewer features.
Suprasegmental spectrum statistics clearly outperform
prosody, at least when more than 15 features are used, which
again confirms the conclusions obtained with the class separa-
bility analysis. Spectrum statistics reach a maximum of 75.4%
accuracy with 96 features and keep steady from then on. The
combination of all suprasegmental parametrizations obtains
the best results if more than 25 features are used, reaching
an almost steady state at 152 features with 77.9% accuracy,
with a marginally better absolute maximum of 78.6% with 247
features.
The development results for segmental features are repre-sented in Fig. 4. None of the curves reach a real saturation
7/31/2019 Feature Analysis Fot Emotion Identification in Speech
8/12
LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH 497
Fig. 3. Development results for suprasegmental parametrizations as a function of the number of features. (a) is a zoom view to see the results with few features.
TABLE VIACCURACY ON DEVELOPMENT AND TEST RESULTS WITH SELECTED NUMBER OF FEATURES AND ALL FEATURES
Fig. 4. Development results forsegmental parametrizations as a function of thenumber of features.
point, as the accuracy continues growing as new features are
added. But it can be seen that the improvement is very small
once 20 features have been used. For example, LFPC parame-
ters get 70.9% accuracy with 20 features, and when all 54 are
used, this number increases only to 72.9%. When voiced andunvoiced frames are treated separately, the accuracy of LFPC
decreases, which seems reasonable as there are approximately
half the training samples for each stream.
Results with prosodic primitives seem rather modest,
reaching 65.3% accuracy in the voiced stream and 50.8% in the
unvoiced one. But it should be kept in mind that only six and
three features are used, respectively. If for example, only the
best six features in the voiced LFPC stream are kept, the system
gets 58.8% accuracy. However, adding the prosodic primitives
to LFPC features does not improve the results significantly.
According to the ranking shown in Table V, prosodic primitives
are more informative than most LFPC parameters, as they are
among the first 10 or 20 features selected. However, this is trueonly if few features are selected, e.g., less than 15. When the
number of features increases, the LFPC features that are added
compensate for the information of the prosodic primitives, so
that at the end, the combination has no effect on the overall
accuracy.
Table VI summarizes the development results for each
parametrization family and their combinations, both with the
estimated optimal number of features and with the completeset. The final speaker independent test results are also pre-
sented for each case. Looking at these final test results, it can
be concluded that the spectral statistics are the best isolated
parametrization, reaching a 70.5% accuracy with 96 features
(and almost the same accuracy with all 324 features). Among
the feature combinations, the best result is obtained combining
all suprasegmental features, providing 72.2% correct answers
with 152 features and 72.5% with all 383.
C. Late Fusion Combination
So far, early fusion schemes have been used, i.e., concate-
nating parametrizations. But it may be interesting to see if re-sults improve with a late fusion system, i.e., combining not
the features themselves but the results of the classifiers trained
with them. Furthermore, the late fusion allows combining the
information captured by parametrizations of different temporal
structure, such as segmental and suprasegmental features or the
voiced and unvoiced streams.
An SVM-based fusion system [40], [41] has been used for this
task. Given an utterance to be classified and a set of classifiers,
a vector is formed with the scores provided by the classifiers for
each emotion. This score vector is then classified by the fusion
SVM to get the final decision. With an appropriate training, the
SVM is expected to learn the score patterns of the errors and
hits of the classifiers and improve the results. The scores of thedevelopment tests were used for this training.
7/31/2019 Feature Analysis Fot Emotion Identification in Speech
9/12
498 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010
TABLE VIIACCURACY ON LATE FUSION TEST RESULTS WITH SELECTED NUMBER OF FEATURES AND ALL FEATURES
Results for the late fusion system are shown in Table VII. The
fusion of suprasegmental features (column 1) achieves very sim-
ilar results with both early and late fusion systems. Segmental
features (column 2) on the other hand show a great improve-
ment. However, this improvement is partly due to the combi-
nation of both the voiced and unvoiced streams of LFPC and
prosodic primitives, which is not possible with the early fusion.
Modeling the voiced and unvoiced streams separately in seg-
mental features and combining them afterwards through late
fusion provides good results. While LFPC parametrization ob-
tains 69.9% accuracy with 20 features and 72.2% with all 54,the fusion of its streams (column 3) gets 72.0% and 76.5%, re-
spectively (an error reduction of 7% and 15%, respectively). A
noticeable improvement can also be seen when combining the
streams of prosodic primitives (column 4). Therefore, we de-
cided to keep voiced and unvoiced streams separated.
The late fusion system can also be used to combine seg-
mental and suprasegmental systems. Combining the results
from prosody and LFPC (column 5) yields to a similar accuracy
as combining the results from long-term spectral statistics
and frame-wise prosodic primitives (column 6). In both cases,
the accuracy is higher than when using the spectral statistics
alone (the best isolated system) and slightly better than usingthe early fusion of all suprasegmental features (the best early
fusion system).
When combining LFPC with spectral statistics (column 7)
or prosody with prosodic primitives (column 8), there is also
a significant improvement. This means that fusing systems that
use features of the same acoustic origin but different time span
can also be helpful to reduce classification errors.
As a last experiment, all features were combined with the
late fusion system (column 9): suprasegmental prosody, voice
quality, and spectral statistics together with segmental LFPC
and prosody primitives, with separated voiced and unvoiced
streams. Altogether, seven classifiers were combined in this last
test. The obtained results are the best among all the systemstested: 78.3% for all features and 76.8% for selected ones,
which implies a 20% error reduction with respect to the best
early fusion system.
D. Analysis of the Results by Emotion
Altogether, it seems that spectral characteristics provide
higher identification accuracies than prosodic ones. Neverthe-
less, prosodic features may be more appropriate to identify
certain emotions, even though they are not the best parametriza-
tion overall. The identification rates of each emotion have been
examined separately, in order to verify whether or not this is the
case. Fig. 5 presents the identification rates obtained for eachemotion with some representative parametrizations: prosodic
Fig. 5. Comparison of the identification accuracy for each emotion withprosodic and spectral parameters.
and spectral statistics in the case of suprasegmental features,
and the late fusion of voiced and unvoiced streams of LFPC
and prosody primitives in the case of segmental features.
It can be observed that fear and happiness are the worst iden-
tified emotions, with less than 50% of recordings correctly clas-
sified in most cases. Looking into the results, we observed that
happiness is mostly confused with anger, while fear is confused
with neutral with suprasegmental features and with neutral andhappiness with segmental parametrizations. This result suggests
that the considered parametrizations are not suitable to capture
the characteristics of these emotions, and that other features
should be accounted.
Both anger and neutral get similar accuracies with prosodic
and spectral long-term statistics. On the other hand, LFPC
seems a more suitable short-term parametrization than prosody
primitives for these emotions. In the case of boredom and
sadness, it is the segmental features that obtain similar identi-
fication rates, whereas in the suprasegmental parametrizations,
the spectral statistics get better results. Finally, disgust seems to
be very difficult to detect with prosody-related features, but the
accuracy increases significantly with spectral characteristics,
both in segmental and suprasegmental parametrizations.
Spectral features provide similar or higher accuracies than
prosodic ones in all emotions. This suggests that, even when
each emotion is considered on its own, spectral characteristics
are more suitable for emotion identification.
VI. CONCLUSION
The main goal of the work presented here was to analyze fea-
tures related to prosody, spectral envelope, and voice quality
regarding their capability to separate emotions. This analysis
sheds light on an aspect of the automatic recognition of emo-
tions in speech that is not fully covered in the literature. Al-though there are many studies that analyze how individual fea-
7/31/2019 Feature Analysis Fot Emotion Identification in Speech
10/12
LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH 499
tures change according to the emotional state of the speaker [7],
[8], [42], [43], they do not evaluate the behavior of the whole
feature set. This can lead to inaccurate conclusions about the
usefulness of a given parametrization, as the features that are
most discriminant individually may not be the most discriminant
when taken together. Indeed, most of these studies show that,
when considered individually, changes in the prosodic featuresare more significant than changes in the spectral parameters.
However, the results presented in this work suggest that if the
whole feature set is considered, spectral envelope parametriza-
tions are more informative than prosodic ones.
The analysis presented in this paper has been performed by
means of discriminality measures, unsupervised clustering,
and feature ranking. The results have been confirmed with
empirical experiments of automatic identification of emotions.
The discriminality criterion that has been used provides a way
to estimate the performance of the whole feature set instead
of individual parameters. These measures have been comple-
mented with unsupervised clustering to see whether or not
the parametrizations provide enough class separation. Bothmethods reveal that long-term statistics of spectral envelope
provide larger separation among emotions than traditional
prosodic features. This explains the higher accuracy of the
former in the experimental tests. The measures also show
that combining suprasegmental parametrizations of spectral
envelope, prosody, and voice quality increases the separation
among emotions, confirming that the use of features extracted
from different information sources helps in reducing the identi-
fication error.
Unfortunately, these methods are not suitable for the anal-
ysis of segmental parametrizations. Instead, a feature ranking
algorithm has been applied to both segmental and supraseg-mental parametrizations to detect the most discriminant fea-
tures. Although many papers in the field mention the use of
some kind of feature selection algorithm, only few of them dis-
cuss the outcome from this selection. When this outcome is pro-
vided, long-term spectral statistics are usually selected first, over
prosodic ones [16], [17], suggesting that suprasegmental spec-
tral features provide more information about the emotional state
of the speaker. The feature selection results shown in the present
paper are in agreement with this conclusion.
Some works in the literature rely entirely on empirical exper-
iments to determine the capacity of a given parametrization to
identify emotions. Although this approach is suitable to find out
the most discriminant parametrization, it gives no further infor-
mation about the relation among these features. Furthermore,
most works that use a combination of different sets of features
(e.g., prosodic and spectral) provide only the accuracy results
for that combination, but not separately for each feature set, so
it is not possible to deduce which set is more informative. In
this work, we have analyzed each feature set independently and
in combination with the others. The experimental results have
also been given for each parametrization and for their combi-
nations. This way, it is possible to measure the accuracy gain
when different feature sets are combined, and see whether or
not this combination is advantageous. For example, even though
segmental prosodic primitives were ranked high by the featureselection algorithm, it has been shown that the combination of
frame-wise LFPC and prosodic primitives gets similar accu-
racy as the LFPC alone. Only with a reduced number of fea-
tures is this combination helpful (see Fig. 4). Also for supraseg-
mental parametrizations, prosodic features perform better than
the spectral ones only when few features are used [Fig. 3(a)].
According to these results, we can say that traditional prosodic
features seem to be the most appropriate for automatic identifi-cation of emotions only if they are considered individually or in
a very reduced feature set. But if large feature sets are consid-
ered, spectral features outperform them.
The analysis presented in this paper has been performed using
an acted speech emotional database. In order to see whether the
results are applicable to real-life systems, they should be vali-
dated using different databases, especially databases of natural
emotions. Nevertheless, the database used in this work is sup-
posed to contain highly natural emotion portrayals, so it is very
likely that these conclusions are valid to a great extent. A pre-
vious work using the Aibo database of natural emotions [44]
showed similar conclusions, with frame-wise MFCC features
outperforming long-term prosodic statistics [45]. Also in [12],[13], and [46], MFCC features get better results than supraseg-
mental ones.
The results from the late fusion tests suggest that the com-
bined use of parametrizations extracted from the same infor-
mation source but with different time-scale, i.e., segmental and
suprasegmental features, increases the accuracy of the system.
The difference in the time scales and in the classifiers makes
each subsystem to retain different characteristics of the emo-
tions. In fact, one of the best results has been achieved with the
late fusion of long-term LFPC statistics with the voiced and un-
voiced streams of frame-wise LFPC features. This system has
only been outperformed by the late fusion of all segmental andsuprasegmental features, including spectral envelope, prosody,
and voice quality. However, this last system is much more com-
plex and the obtained improvement is very small. The use of all
features requires estimating LFPC, intonation values, voiced-
unvoiced decision, pitch period marking, inverse filtering, and
vowel detection. This makes the parametrization step very com-
plex and time-consuming. Furthermore, it uses seven different
classifiers prior to the fusion. The spectral system on the other
hand needs only LFPCs, some simple statistics obtained from
them, and the voiced-unvoiced decisions in order to separate the
streams, reducing the number of classifiers to three. The differ-
ence in the accuracy may not justify increasing the complexity
of the system in such a degree.
We are not claiming that features extracted from prosody are
useless. Several papers show that humans are able to identify
emotions in prosodic copy-synthesis experiments [5], [47],
confirming that prosody does carry a great amount of emotional
information, at least for some emotions. But the traditional
prosodic representations may not be well suited to capture this
information. On the one hand, long-term statistics estimated
over the whole sentence lose the information of specific charac-
teristic prosodic events. On the other hand, short-term prosodic
primitives do not capture the prosodic structure correctly, which
is suprasegmental by definition. The results suggest that a new
more elaborate representation is needed to effectively extractthe emotional information contained in the prosody.
7/31/2019 Feature Analysis Fot Emotion Identification in Speech
11/12
500 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010
REFERENCES
[1] K. R. Scherer, Vocal communicationof emotion: A review of researchparadigms, Speech Commun., vol. 40, pp. 227256, Apr. 2003.
[2] K. R. Scherer, Psychological models of emotion, in The Neuropsy-chology of Emotion, J. Borod, Ed. Oxford, U.K.: Oxford Univ. Press,2000, ch. 6, pp. 137166.
[3] P. Ekman, An argument for basic emotions, Cognit. Emotion, vol. 6,
pp. 169200, 1992.[4] C. Darwin, The Expression of the Emotions in Man and Animals, 3rded. Oxford, U.K.: Oxford Univ. Press, 1998.
[5] F. Burkhardt and W. F. Sendlmeier, Verification of acoustical corre-lates of emotional speech using formant-synthesis, in Proc. ISCA Tu-torial and Research Workshop Speech and Emotion, Belfast, Ireland,Sep. 2000, pp. 151156.
[6] C. F. Huang and M. Akagi, A three-layered model for expressivespeech perception, Speech Commun., vol. 50, pp. 810828, Oct. 2008.
[7] K. R. Scherer, R. Banse, H. G. Wallbott, and T. Goldbeck, Vocal cuesin emotion encoding and decoding, Motiv. Emotion, vol. 15, no. 2, pp.123148, 1991.
[8] M. Schrder, Speech and emotion research, Ph.D. disservation, Uni-versitt des Saarlandes, Saarbrcken, Germany, 2003.
[9] E. Navas, I. Hernez, A. Castelruiz, J. Snchez, and I. Luengo,Acoustic analysis of emotional speech in standard Basque for emo-
tion recognition, in Progress in Pattern Recognition, Image Analysisand Applications, ser. Lecture Notes in Computer Science. Berlin,Germany: Springer, Oct. 2004, vol. 3287, pp. 386393.
[10] D. Erickson, Expressive speech: Production, perception and applica-tionto speech synthesis,Acoust. Sci. Tech., vol. 26,pp. 317325,2005.
[11] T. L. Nwe, S. W. Foo, and L. C. de Silva, Speech emotion recognitionusing hidden Markov models, Speech Commun., vol. 41, pp. 603623,Jun. 2003.
[12] S. Kim, P. G. Georgiou, S. Lee, and S. Narayanan, Real-time emo-tion detection system using speech: Multi-modal fusion of differenttimescale features, in Proc. Int. Workshop Multimedia Signal Pro-cessing, Crete, Greece, Oct. 2007, pp. 4851.
[13] B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll, Frame vs.turn-level: Emotion recognition from speech considering static and dy-namic processing, in Affective Computing and Intelligent Interaction,ser. Lecture Notes in Computer Science. Berlin, Germany: Springer,
2007, vol. 4738, pp. 139147.[14] J. Nicholson, K. Takahashi, and R. Nakatsu, Emotion recognition
in speech using neural networks, Neural Comput. Appl., vol. 9, pp.290296, Dec. 2000.
[15] O. W. Kwon, K. Chan, J. Hao, and T. W. Lee, Emotion recognitionby speech signals, in Proc. Eurospeech, Geneva, Switzerland, 2003,pp. 125128.
[16] T. Vogt and E. Andr, Improving automatic emotion recognition fromspeech via gender differentiation, in Proc. LREC, Genoa, Italy, May2006.
[17] B. Schuller, R. Mller, M. Lang, and G. Rigoll, Speaker independentemotion recognition by early fusion of acoustic and linguistic featureswithin ensembles, in Proc. Interspeech, Lisbon, Portugal, Sep. 2005,pp. 805808.
[18] R. Lpez-Cozar, Z. Callejas, M. Kroul, J. Nouza, and J. Silovsk,Two-level fusion to improve emotion classification in spoken dia-
logue systems, in Graphics Recognition. Recent Advances and NewOpportunities, ser. Lecture Notes in Computer Science. Berlin,Germany: Springer, 2008, vol. 5246, pp. 617624.
[19] M. Lugger and B. Yang, The relevance of voice quality featuresin speaker independent emotion recognition, in Proc. ICCASP,Honolulu, HI, Apr. 2007, vol. 4, pp. 1720.
[20] C. Gobl and A. N. Chasaide, The role of voice quality in commu-nicating emotion, mood and attitude, Speech Commun., vol. 40, pp.189212, Apr. 2003.
[21] R. Tato, R. Santos, R. Kompe, andJ. Pardo, Emotional spaceimprovesemotion recognition, in Proc. ICSLP, Sep. 2002, pp. 20292032.
[22] R. Mller, B. Schuller, and G. Rigoll, Enhanced robustness in speechemotion recognition combining acoustic and semantic analyses, inProc. From Signals to Signs of Emotion and Vice Versa, Santorino,Greece, Sep. 2004.
[23] A. Nogueiras, A. Moreno, A. Bonafonte, and J. B. Mario, Speechemotion recognition using hidden Markov models, in Proc. Eu-rospeech, Aalborg, Denmark, Sep. 2001, pp. 26792682.
[24] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, On combining clas-sifiers, IEEE Trans. Pattern Anal. Mach. Intell. , vol. 20, no. 3, pp.22639, Mar. 1998.
[25] D. Ruta and B. Gabrys, An overview of classifier fusion methods,Comput. Inf. Syst., vol. 7, no. 1, pp. 110, Feb. 2000.
[26] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B.Weiss, A database of German emotional speech, in Proc. Inter-speech, Lisbon, Portugal, Sep. 2005, pp. 15171520.
[27] J. Ramirez, J. C. Segura, C. Benitez, A. de la Torre, and A. Rubio,Efficient voice activity detection algorithms using long-term speechinformation, Speech Commun., vol. 42, pp. 271287, Apr. 2004.
[28] P. Alku, H. Tiitinen, and R. Ntnen, A method for generating nat-ural-sounding speech stimuli for cognitive brain research, Clin. Neu-rophysiol., vol. 110, pp. 13291333, Aug. 1999.
[29] I. Luengo, I. Saratxaga, E. Navas, I. Hernez, J. Snchez, and I. n.Sainz, Evaluation of pitchdetection algorithms under real conditions,in Proc. ICASSP, Honolulu, HI, Apr. 2007, pp. 10571060.
[30] I. Luengo, E. Navas, J. Snchez, and I. Hernez, Deteccin de vocalesmediante modelado de clusters de fonemas, Procesado Del Lenguaje
Natural, vol. 43, pp. 121128, Sep. 2009.[31] F. Ringeval and M. Chetouani, Exploiting a vowel based approach
for acted emotion recognition, in Verbal and Nonverbal Features ofHuman-Human and Human-Machine Interaction, ser. Lecture Notes inComputer Science. Berlin, Germany: Springer, Oct. 2008, vol. 5042,
pp. 243254.[32] T. Bckstrm, P. Alku, and E. Vilkman, Time-domain parameteriza-
tion of the closing phase of glottal airflow waveform from voices overa large intensity range, IEEE Trans. Speech Audio Process., vol. 10,no. 3, pp. 186192, Mar. 2002.
[33] R. van Son and L. Pols, An acoustic description of consonant reduc-tion, Speech Commun., vol. 28, pp. 125140, Jun. 1999.
[34] K. Fukunaga, Introduction to Statistical Pattern Recognition. NewYork: Academic, 1990.
[35] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. NewYork: Wiley, 2001.
[36] I. Guyon and A. Elisseeff, An introduction to variable and featureselection, J. Mach. Learn. Res., vol. 3, pp. 11571182, Mar. 2003.
[37] H. Peng, F. Long, and C. Ding, Feature selection based on mutualinformation: Criteria of max-dependency, max-relevance, and min-re-dundancy, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp.12261238, Aug. 2005.
[38] C. J. Burges, A tutorial on support vector machines for pattern recog-nition, Data Min. Knowl. Discov., vol. 2, pp. 121167, 1998.
[39] C. W. Hsu and C. J. Lin, A comparison of methods for multi-classsupport vector machines, IEEE Trans. Neural Netw., vol. 13, no. 2,pp. 415425, Mar. 2002.
[40] J. Fierrez-Aguilar, D. Garcia-Romero, J. Ortega-Garcia, and J. Gon-zalez-Rodriguez, Adapted user-dependent multimodal biometric au-thentication exploiting general information, Pattern Recognit. Lett.,vol. 26, no. 16, pp. 26282639, Dec. 2005.
[41] B. Gutschoven and P. Verlinde, Multi-modal identity verificationusing support vector machines (SVM), in Proc. Int. Conf. InformationFusion, Paris, France, Jul. 2000, vol. 2, pp. 38.
[42] R. Banse andK. R. Scherer, Acousticprofilesin vocalemotionexpres-sion, J. Personal. Social Pathol., vol. 70, no. 3, pp. 614636, 1996.
[43] L. Devillers, I. Vasilescu,and L. Vidrascu, and pause features anal-ysis for anger and fear detection in real-life spoken dialogs, in Proc.Speech Prosody, Nara, Japan, Mar. 2004, pp. 205208.
[44] A. Batliner, S. Steidl, B. Schuller, D. Seppi, K. Laskowski, T. Vogt, L.Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson, Com-bining efforts for improving automatic classification of emotional userstates, in Proc. Information SocietyLanguage Technologies Conf.(IS-LTC), Ljubljana, Slovenia, Oct. 2006, pp. 240245.
[45] I. Luengo, E. Navas, and I. Hernez, Combining spectraland prosodicinformation for emotion recognition in the interspeech 2009 emotionchallenge, in Proc. Interspeech, Brighton, U.K., Sep. 2009, pp.332335.
[46] I. Luengo, E. Navas, I. Hernez, and J. Sanchez, Automatic emotionrecognition using prosodic parameters, in Proc. Interspeech, Lisbon,Portugal, Sep. 2005, pp. 493496.
[47] E. Navas, I. Hernez, andI. Luengo, Anobjectiveand subjectivestudy
of the role of semantics in building corpora for TTS, IEEE Trans.Speech Audio Process., vol. 14, no. 4, pp. 111727, Jul. 2006.
7/31/2019 Feature Analysis Fot Emotion Identification in Speech
12/12
LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH 501
Iker Luengo received the telecommunication en-gineering degree from the University of the BasqueCountry, Bilbao, Spain, in 2003 and is currentlypursuing the Ph.D. degree in telecommunications.
Hehas been a researcher in theAhoLabSignal Pro-cessing Group in the Electronics and Telecommuni-cations Department since 2003. He has participatedas a research engineer in government-funded R&D
projects, focused in emotionalspeech, speaker recog-nition, diarization of meetings, and speech prosody.Mr. Luengo is member of the International Speech
Communication Association (ISCA) and the Spanish thematic network onSpeech Technologies (RTTH).
Eva Navas received the telecommunication en-gineering degree and the Ph.D. degree from theDepartment of Electronics and Telecommunicationsof the University of the Basque Country, Bilbao,Spain.
Since 1999, she has been a researcher at theAhoLab Signal Processing Group. She is currentlyteaching at the Faculty of Industrial and Telecommu-nication Engineering in Bilbao. She has participatedas a research engineer in government-funded R&Dprojects as well as in privately-funded research
contracts. Her research is focused on expressive speech characterization,recognition, and generation.
Dr. Navas is a member of the International Speech Communication Associ-ation (ISCA), the Spanish thematic network on Speech Technologies (RTTH),and the European Center of Excellence on Speech Synthesis (ECESS).
Inmaculada Hernez received the telecommunica-tionsengineeringdegreefrom the Universitat Politec-nica de Catalunya, Barcelona, Spain, and the Ph.D.degree in telecommunications engineering from theUniversity of the Basque Country, Bilbao, Spain, in1987 and 1995, respectively.
She is a Full Professor in the Electronics andTelecommunication Department, Faculty of Engi-
neering, University of the Basque Country, in thearea of signal theory and communications. She isfounding member of the Aholab Signal Processing
Research Group. Her research interests are signal processing and all aspectsrelated to speech processing. She is also interested in the development ofspeech resources and technologies for the Basque language.
Dr. Hernez is a member of the International Speech Communication Asso-ciation (ISCA), the Spanish thematic network on Speech Technologies (RTTH),and the European Center of Excellence on Speech Synthesis (ECESS).