Feature Analysis Fot Emotion Identification in Speech

7/31/2019 Feature Analysis Fot Emotion Identification in Speech

1/12

490 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010

Feature Analysis and Evaluation for AutomaticEmotion Identification in Speech

Iker Luengo, Eva Navas, and Inmaculada Hernez

AbstractThe definition of parameters is a crucial step inthe development of a system for identifying emotions in speech.Although there is no agreement on which are the best featuresfor this task, it is generally accepted that prosody carries mostof the emotional information. Most works in the field use somekind of prosodic features, often in combination with spectral andvoice quality parametrizations. Nevertheless, no systematic studyhas been done comparing these features. This paper presents theanalysis of the characteristics of features derived from prosody,spectral envelope, and voice quality as well as their capabilityto discriminate emotions. In addition, early fusion and late fu-sion techniques for combining different information sources areevaluated. The results of this analysis are validated with experi-

mental automatic emotion identification tests. Results suggest thatspectral envelope features outperform the prosodic ones. Evenwhen different parametrizations are combined, the late fusionof long-term spectral statistics with short-term spectral envelopeparameters provides an accuracy comparable to that obtainedwhen all parametrizations are combined.

Index TermsEmotion identification, information fusion,parametrization.

I. INTRODUCTION

FEATURES extracted from the speech signal have a great

effect on the reliability of an emotion identificationsystem. Depending on these features, the system will have a

certain capability to distinguish emotions and will be able to

deal with speakers not seen during the training. Many works

in the field of emotion recognition are aimed to find the most

appropriate parametrization, yet there is no clear agreement on

which feature set is best.

One of the major problems to determine the best features for

emotion identification is that there is no solid theoretical basis

relating the characteristics of the voice with the emotional state

of the speaker [1]. That is why most of the works in this field

are based on features obtained from direct comparison of speech

signals portraying different emotions. The comparison enablesto estimate the acoustic differences among them, identifying

features that could be useful for emotion identification.

Manuscript received December 11, 2009; revised March 22, 2010; acceptedMay 06, 2010. Date of current version September 15, 2010. This work wassupported in part by the Spanish Government under the BUCEADOR project(TEC2009-14094-C04-02) and in part by the AVIVAVOZ project (TEC2006-13694-C03-02). The associate editor coordinating the review of this manuscriptand approving it for publication was Dr. Hamid K. Aghajan.

The authors are with the Department of Electronics and Telecommunica-tions, University of the Basque Country, Bilbao 48013, Spain ( e-mail: [email protected]; [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TMM.2010.2051872

The only widely known theory that describes the physiolog-

ical changes caused by an emotional state [2], [3] takes the Dar-

wininan theory [4] as a reference, considering emotions as a re-

sult of evolutionary needs. According to this theory, each emo-

tion induces some physiological and psychological changes in

order to prepare us to do something, e.g., fear prepares us to

run from a danger. These changes have certain influence on the

speech characteristics, mainly on those related to intonation, in-

tensity, and speaking rate, i.e., on prosody.

For many years, automatic identification systems have used

prosodic features almost exclusively, mainly because of the

aforementioned theory. The relation between emotion and

prosody is reflected in many works in the literature [5][9].

A detailed summary of these works and their conclusions is

presented in [10]. Prosodic features are mostly used in the

form of long-term statistics, usually estimated over the whole

utterance. The most common ones are simple statistics such as

mean, variance, minimum, maximum, or range. But it is also

possible to use more complex representations in an attempt to

retain more emotional information.

The use of prosodic features gives a certain reiterative confu-

sion pattern among emotions. They seem to be able to discrimi-

nate high arousal emotions (anger, happiness) from low arousal

ones (sadness,boredom) easily. But the confusion level for emo-tions of the same arousal level is very large [1]. However, hu-

mans are able to distinguish anger from happiness and sadness

from boredom accurately. This reinforces the idea that there are

some other voice characteristics useful for emotion identifica-

tion. Some works in the literature use spectral measurements or

voice quality features, showing the importance of these param-

eters for emotion identification.

The vocal tract is also influenced by the emotional state, and

so are the spectral characteristics of the voice. The relationship

between emotion and spectral characteristics is empirically sup-

ported by several works [7], [11][14]. Furthermore, taking into

account that prosody is determined mainly by the vocal fold ac-tivity and that the spectrum envelope is mostly influenced by

the vocal tract, it is reasonable to assume that they are not very

correlated. Therefore, both types of features provide different

information about the emotional state of the speaker, and the

combination of prosodic and spectral parametrizations may be

interesting. In fact, most of the works using spectral measures

use them together with prosodic features, significantly reducing

the error rate with respect to using only one of the parametriza-

tions [12], [13], [15][18].

The effect that emotions have on the vocal folds cause

changes in the voice quality, too [19], [20], so in the last couple

of years, features related to voice quality have also been used

1520-9210/$26.00 2010 IEEE


2/12

LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH 491

to help emotion identification [19], [21]. Nevertheless, few

authors use them, due to the difficulty of extracting the glottal

signal from the speech [20]. The estimation of the glottal signal

can only be done with acceptable accuracy on very stable

speech segments, which makes it difficult to extract voice

quality features by automatic means.

Finally, it is also possible to use linguistic features such asthe occurrence of sighs or certain emotive words [17], [18], [22].

However, only a few authors use them. On the one hand, because

in most cases an automatic speech recognition system (ASR)

is needed, which increases the complexity of the system. On

the other hand, because this kind of features only makes sense

with spontaneous speech, and most of the works in the field

have been developed using read speech databases, often with

the same texts for all emotions.

Best results are obtained when parametrizations of different

nature are combined in order to increase the available informa-

tion. When all features have the same temporal structure, this

combination can be done by simply concatenating the feature

vectors. The problem arises when the temporal structure isdifferent for the considered parametrizations. For example,

prosodic information is usually given in the form of long-term

statistics, while traditional spectral envelope parametrizations

like Mel frequency cepstral coefficients (MFCC) or linear

prediction cepstral coefficients (LPCC) are extracted for each

frame. A simple solution applied in many works is to calculate

long-term statistics of spectral features, making it possible

to concatenate them to the prosodic feature vector [15][17].

Another alternative is to do the opposite. Instead of calculating

statistics of spectral features, frame-wise and intensity

values are used and appended directly to the MFCC or LPCC

vectors [14], [23]. In [15], both strategies are compared, ob-taining very similar results in both cases. A more elaborate

solution uses a late fusion of classifiers [24], [25], i.e., training

a different classifier with each parametrization and combining

the results of these classifiers, as in [12] and [18].

Going over the literature shows a disagreement on which fea-

tures are best for the identification of emotions in speech. It is

widely accepted that prosodic features carry most of the emo-

tional information, and that the combination of parametriza-

tions of different nature improves the results. But no systematic

study has been done comparing the effectiveness of each kind

of parametrization or their combinations. This work attempts to

fill this gap analyzing the characteristics and appropriateness of

different sets of features used for the recognition of emotions. It

is focused on acoustic parameters that can be extracted directly

from the speech signal without using ASR systems: prosody,

spectral envelope, and voice quality. The effectiveness of dif-

ferent combinations of features is also studied, and the different

combination approaches are compared to see which one is more

appropriate: the simple feature concatenation (early fusion) or

the classifier result combination (late fusion).

Section II in the paper describes the emotional database

used in this work, as well as the acoustic processing carried

out to extract the data needed for the calculation of the fea-

tures. Section III describes the considered features, whereas

Section IV describes the analysis of these features. Some em-pirical experiments are carried out and described in Section V,

in order to confirm the results obtained in the analysis. Finally,

the results are commented and some conclusions are extracted.

II. WORKING DATABASE

A. Description of the Database

Both the analysis of the features and the emotion identifi-

cation experiments described in this paper were carried out

using the Berlin emotional speech database [26]. This database

contains 535 recordings uttered by five male and five female

speakers, simulating seven emotional states: anger, boredom,

disgust, fear, happiness, neutral, and sadness. Although the

original recordings are available at 16 kHz and 16 bits per

sample, they were subsampled to 8 kHz for this work.

The database corpus contains ten sentences of neutral se-

mantic content that all speakers recorded portraying the seven

emotional styles. As described in [26], in order to get emotions

as natural as possible, the recordings were evaluated in a percep-

tual test where 20 listeners had to identify the intended emotion

and score its naturalness. All recordings with an identificationrate lower than 80% and an overall naturalness score under 60%

were discarded, leaving the 535 sentences available in the final

database. Due to the perceptual selection process, some emo-

tions are more represented than others.

This database was chosen because it presents certain char-

acteristics that were of interest. It is a multispeaker database,

making it possible to perform speaker-independent tests. Fur-

thermore, the perceptual selection guarantees that the portrayed

emotions are highly natural. In addition, many researches about

identification of emotions have been done using this database,

which makes it possible to compare the results with other pub-

lished results.

B. Processing of the Recordings

The recordings were processed in order to get the character-

istic curves and labelings needed for the feature extraction. The

processing included detection of the vocal activity, estimation of

the glottal source signal and of the intonation curve, voiced-un-

voiced labeling, pitch-period marking, and vowel detection. All

this processing was performed automatically without manual

corrections.

1) Vocal Activity Detection (VAD): Silences and pauses were

detected using the LTSE VAD algorithm described in [27]. The

labels obtained were further processed to discard silence labels

shorter than 100 ms that usually appear as a result of detection

errors, and which have no linguistic meaning.

2) Glottal Source Estimation: The glottal signal is needed

in order to compute various features related to voice quality.

However, estimating the glottal flow from the speech signal is

inherently difficult and only when it is applied to very stationary

segments of the speech the results may be acceptable. Iterative

adaptive inverse filtering (IAIF) [28] was used to perform the

inverse filtering and recovering the glottal flow, since it is a fully

automatic method that gives acceptable results with stationary

signals. Nevertheless, it is assumed that the estimated glottal

signal is inaccurate for nonstationary segments.

3) Intonation Curve and Voiced-Unvoiced Labeling: The in-tonation curve was computed with the cepstrum dynamic pro-


3/12


gramming (CDP) algorithm described in [29], which uses the

cepstrum transform and dynamic programming. This algorithm

provides the voiced-unvoiced (VUV) labeling, too.

4) Pitch Period Marking: In addition to glottal flow estima-

tion, pitch synchronous marks are also needed for the extraction

of voice quality features. Once the intonation curves, the VUV

labeling and the glottal flow were extracted, pitch marks wereplaced in the negative peaks of the inverse filtering residual,

which were located by simple peak picking using the estimated

value as a clue to detect the next peak.

5) Vowel Detection: Vowels are some of the most stable

segments in a speech signal, making them very appropriate

for the computation of certain features, such as those derived

from the glottal source estimation. Furthermore, vowels are

always voiced and are strongly affected by intonation patterns,

providing a consistent point to calculate intonation-related fea-

tures such as risings or fallings. The vowels in the database

were automatically detected using the algorithm described in

[30], which is based on a phoneme recognizer working with

HMM models of clustered phonemes. Phoneme clustering wasautomatically performed according to their acoustical simi-

larity, keeping the vowels on their own cluster. This provides a

consistent and very robust set of models, capable of detecting

80% of vowel boundaries with less than 20-ms error.

III. FEATURE DEFINITION

A. Segmental Features

Segmental features are calculated once for every frame, al-

lowing the analysis of their temporal evolution. A 25-ms Ham-

ming windowing with 60% overlap is used, i.e., a new feature

vector is extracted every 10 ms.1) Spectral Envelope: Reference [11] shows that log-filter

power coefficients (LFPC) outperform traditional MFCC or

LPCC parameters in emotion identification, so they were

chosen for the frame-wise spectral characterization. LFPC

features represent the spectral envelope in terms of the energy

in Mel-scaled frequency bands.

Eighteen LFPC coefficients were estimated for each frame,

together with their first and second derivatives, giving a total of

54 spectral features at the segment level. In order to minimize

microphone distance effects, LFPC features were normalized to

the mean value of the whole utterance.

2) Prosody Primitives:We refer to the intonation and in-tensity curves as prosody primitives, as the (suprasegmental)

prosodic features are estimated from these. A new sample of in-

tensity and was estimated for every frame, together with their

first and second derivatives, providing six features per frame.

Intensity curves were normalized to the mean intensity value of

the whole utterance in order to minimize the effect of the mi-

crophone distance.

As values are not defined for unvoiced frames, two feature

vector streams were extracted from each recording, the first one

corresponding to voiced frames (characterized by both intona-

tion and intensity features) and the second one corresponding to

unvoiced frames (characterized only by intensity values). Both

streams were treated as different parametrizations during thiswork.

TABLE ISUPRASEGMENTAL PARAMETERS AND THE CORRESPONDING

SYMBOL USED ALONG THIS DOCUMENT

B. Suprasegmental Features

Suprasegmental features represent long-term information, es-timated over time intervals longer than a frame. In this work, this

interval has been defined as the time between two consecutive

pauses. It is expected that speech pauses correspond roughly to

linguistic stops in the message, so this approach is very sim-

ilar to using a whole utterance as integration time. However, as

pauses can be detected automatically with the VAD algorithm,

this method makes it possible to adapt the parametrization algo-

rithm to work with direct audio input if necessary.

1) Spectral Statistics: For the suprasegmental characteriza-

tion of the spectrum, long-term statistics of LFPC were calcu-

lated. For each of the 18 LFPC coefficients and their first and

second derivatives, six statistics were computed, as shown in

Table I. At the end, suprasegmental spectralfeatures were extracted.


4/12


2) Prosody: Prosodic features are divided into five cate-

gories, according to the nature of the information they represent.

Altogether 54 prosodic features were defined.

Intonation Statistics: Intonation features were calculated

as the same six statistics presented in Table I, applied to the

values and their first and second derivatives, giving 18 param-

eters as a result. Only frames detected as voiced were used forthe computation of the statistics.

Intensity Statistics: Following the same approach as with

the intonation, intensity features were composed of the same

statistics applied to the intensity values and their first and second

derivatives, providing 18 new parameters.

Speech Rate Features: They were defined as the mean and

variance of the vowel duration, as shown in Table I.

Regression Features: In each detected vowel, a linear re-

gression was estimated for the values of and intensity. Then

the absolute value of the regression line slopes was calculated,

and the six features listed in Table I were extracted. With these

features, we try to combine long integration intervals (with a

length comparable to a sentence) and short ones (with an ap-proximate duration of a phoneme), as done by other authors

[31].

Sentence-End Features: Prosodic values at the end of the

sentences may provide additional information about the emo-

tion. For example, an increase in with respect to the rest of

the sentence can represent surprise, whereas a lower intensity

is usually related to sadness. Therefore, ten more features were

extracted from the last vowel detected in the integration time,

as shown in Table I. The normalized values are defined as the

corresponding non-normalized ones divided by the mean value

over all the vowels detected in the integration segment.

3) Voice Quality Features: Features related to voice qualityare extracted from the glottal source signal and from the pitch

period marks. They were computed only for vocalic segments,

in order consider only segments with reliable glottal source es-

timation.

The features specified in Table I were calculated for each

vowel, and the values corresponding to vowels in the same inte-

gration segment were averagedin order to obtain a single feature

vector for the whole integration time.

Jitter and shimmer were estimated using the five-point period

perturbation quotient (ppq5) and five-point amplitude perturba-

tion quotient (apq5) values as defined in Praat.1 The normalized

amplitude quotient (NAQ) [32] is estimated for every glottal

pulse, so the NAQ value for a vowel was calculated by aver-

aging the NAQ obtained all along that vowel. Similarly, spec-

tral tilt and spectral balance [33] are calculated for every frame,

and the value for a vowel was calculated averaging the values

all along the vowel.

IV. FEATURE ANALYSIS AND SELECTION

A. Inter-Emotion and Intra-Emotion Dispersion

The capability of a parameter set to retain the emotional

characteristics and to avoid the remaining attributes of the

speech can be measured in terms of inter-emotion dispersion

and intra-emotion dispersion. A large inter-emotion dispersion1http://www.praat.org

TABLE IIDISCRIMINABILITY VALUES OF THE FEATUREFAMILIES MEASURED WITH THE CRITERION

would mean that the features take very different values for

each emotion, separating the distributions and making the

classification task easier. A reduced intra-emotion dispersionreflects the consistency of the features within a given emotion.

The relation between intra-emotion and inter-emotion disper-

sions provides a measure of the overlapping of the class distri-

butions. This relation can be estimated using the following cri-

terion [34]:

(1)

where denotes the trace of a matrix and and are

the intra-class and inter-class dispersion matrices, respectively:

(2)

(3)

being the number of training samples, the

number of emotions, the samples of class , the centroid

of the class, and the global mean:

(4)

(5)

The criterion is often used in discriminant analysis, and it

is a generalization of the well-known Fisher criterion (6) for the

multiclass and multidimensional case [35]:

(6)

values were computed for each feature family as a first es-

timation of their capability to discriminate emotions. The results

are presented in Table II.

If we focus on suprasegmental features, it is observed that

prosody is less discriminative than spectral envelope statistics.This result is especially significant, as prosodic features are by


5/12


Fig. 1. (a) Scatter plot of suprasegmental prosodic, (b) suprasegmental spectral, (c) segmental prosodic, and (d) segmental spectral features, projected over thetwo most discriminant directions by LDA. An: anger, Bo: boredom, Di: disgust, Fe: fear, Ha: happiness, Ne: neutral, Sa: sadness.

far the most used ones in the literature for emotion identification.

The difference shown in Table II could be partly justified by the

fact that 324 statistics are used in the spectral parametrization

and only 54 in the prosodic one. But when the calculation is re-

peated using only the best 54 spectral features (see the feature

selection procedure in Section IV-C), they still seem to discrim-

inate emotions better, obtaining a .

To confirm this result and to show the discriminability of each

set of features visually, an LDA transformation was applied to

both spectral and prosodic features. The two most discriminant

directions are represented graphically in Fig. 1. As can be seen,

emotions are less overlapped when using spectral features thanwhen using prosodic ones.

It is remarkable that in both cases, the most discriminant

direction (horizontal axis) seems to be related to the activa-

tion level, shifting high activation emotions (anger, happiness,

and fear) to one side and low activation emotions (sadness and

boredom) to the other.

In Fig. 1(a), we can already identify the confusions usually

described in the literature when using prosodic features, for ex-

ample anger and happiness. Disgust is also frequently reported

to be hard to detect by prosodic features, and this is reflected

in the scatter plot, too. The distribution of disgust samples has a

large dispersion and is strongly overlapped with other emotions,

especially with the neutral style. On the contrary, it is clearlyseparated from the other emotions when using spectral features.

Going back to the discriminability values in Table II, it can

be seen that features related to voice quality seem to provide

very little information, at least the ones considered in this work.

Nevertheless, when these features are concatenated with the

prosodic ones, they contribute to increase the value from 6.47

to 7.14. This is not so surprising if we take into account that fea-

tures without discrimination power by themselves may be useful

when combined with others [36], as in this case. The concatena-

tion of all suprasegmental features provides the highest discrim-

ination, suggesting that the information captured by the spec-

trum, the prosody, and the voice quality are complementary.

Regarding segmental features, it can be seen that their capa-bility to separate emotions is almost nonexistent. This is clearly

observed in Fig. 1(c) and (d), where the two most discriminant

features given by LDA are represented for prosodic primitives

and LFPC features. All emotions are completely overlapped.

Due to the short-term nature of the segmental parametrization,

each LFPC feature vector reflects the spectral envelope of a

single frame, i.e., it represents the characteristic vocal tract filter

for the phonemes articulated in that frame. As the spectral en-

velope is much more different for different phonemes than for

different emotions, the intra-class dispersion is very large, in-

creasing the overlap among emotions. In the case of intonation

and intensity, a similar effect occurs, as the frame-wise sam-

ples have more variation due to the linguistic content than to theemotional content of the utterances.


6/12


TABLE IIIUNSUPERVISED CLUSTERING RESULTS FOR SPECTRAL STATISTICS

This does not mean that the features are useless. Both the

criterion and the LDA are optimal only if the classes have

normal homoscedastic distribution, i.e., they have the same co-

variance matrices [35]. They cannot make use of the subtle dif-

ferences in the shape of the distributions, which can be captured

if the right classification engine is applied. Usually Gaussian

mixture models (GMMs) are used for this task, as they can

capture small differences among distributions, assuming thatenough training samples are provided.As segmental features are

extracted once for every 10 ms, there are indeed enough training

samples as to train robust and accurate models. In fact, GMMs

are very popular in emotion identification when frame-wise fea-

tures are used. Unfortunately, this means that no conclusions can

be extracted from the measures for segmental parametriza-

tions. The only way to get an estimation of their capability to

distinguish emotions is to perform empirical tests of automatic

identification of emotions. The results of these empirical tests

are presented in Section V.

B. Unsupervised Clustering

The values given above provide a clue about the discrim-

ination power of each feature family, and the scatterplot of the

most discriminant directions estimated by LDA gives a visual

representation of it. But using only two directions is somehow

unrealistic, since the addition of more dimensions could provide

more separation among the classes. Unfortunately, it is not pos-

sible to give an understandable graphical representation of more

than two dimensions. However, it is possible to obtain descrip-

tive results that can provide some insight of what happens when

all features are used.

For this purpose, a blind unsupervised clustering was per-

formed using the k-means algorithm. If the emotional classesare correctly separated, the resulting clusters should correspond

to each emotion. The outcome of this clustering is shown in

Tables III and IV for suprasegmental features. No clustering was

performed to segmental features because the distributions are so

overlapped that the algorithm would not be able to locate the

classes correctly.

The clustering is able to identify the emotions quite accu-

rately, with the spectral parametrization having a better perfor-

mance than the prosodic one, as predicted by the values.

Using spectral statistics, almost all samples belonging to a given

emotion are assigned to the same cluster, whereas prosodic fea-

tures exhibit the typical confusion among emotions: anger with

happiness and neutral with boredom and sadness. If we con-sider these tables as classification confusion matrices, prosodic

TABLE IVUNSUPERVISED CLUSTERING RESULTS FOR PROSODIC STATISTICS

features would achieve an overall accuracy of 75.87% whereas

spectral statistics would get a 98.68%. Note that the expected

accuracy of an emotion identification system is much lower, be-

cause the test utterances will not be seen during the training.

C. Feature Selection

The use of noninformative or redundant features may de-

crease the accuracy of a classifier, due to the confusion they

add to the system. A feature selection algorithm can help iden-

tifying the truly useful features, reducing the dimensionality

of the parametrization and making the classifier run faster and

more accurately. Furthermore, detecting the discriminative fea-

tures may provide a deeper understanding about the influence

of the emotions in the acoustic characteristics of the voice.

The minimal-redundancy-maximal-relevance (mRMR) algo-

rithm [37] has been used to get a ranking of the features, from

the most to the least significant one. This algorithm selects

the features that maximize the mutual information between

the training samples and their classes (maximal relevance) and

simultaneously minimize the dependency among the selectedfeatures (minimal redundancy). mRMR has been applied to

all five parametrization families defined in Section III, as well

as to their combinations. The ranking of these combinations,

presented in Table V, is very interesting as it shows which

parametrization is preferred. When all suprasegmental features

are concatenated, among the first ten features ranked, we find

four prosodic and six spectral features. Looking further up

to position 30, we find ten prosodic, one voice quality, and

19 spectral features. These results are in line with the ones

obtained in [16] and [17], supporting the previous results in

Section IV-A that suggest that spectral suprasegmental features

provide more information about the emotion than prosodicones. Voice quality features are in general ranked low in the

list, even though some studies have claimed their relation with

the emotional state of the speaker [19], [20]. Probably the low

ranking is due to the automatic nature of the parametrization. In

works dealing with voice quality, features are usually extracted

with human intervention, providing very accurate values. In the

case presented in this paper, all processing was fully automatic,

and even though voice quality features were extracted only

for vowels (which are supposedly very stable), the resulting

estimation errors may increase the confusability in the system,

making them not very suitable for identification. Errors during

the automatic vowel detection may further increase this confus-

ability. These results are also confirmed by the low valuesobtained by voice quality features.


7/12


TABLE VBEST TEN FEATURES RANKED FOR FEATURE COMBINATIONS

The ranking for segmental parametrizations was performedseparately for the voiced and unvoiced streams. Following the

approach taken with prosodic primitives, LFPC features were

also divided into voiced and unvoiced streams, so that the frame-

wise spectral parametrization and the prosodic primitives can

be easily combined by concatenation. In the combination of

the voiced streams, the three intonation features ( , , and

) are rankedamong the ten best ones, while intensity deltas

( and ) are between positions 10 and 20. The

frame intensity is far below in the ranking for the voiced

stream, but appears in second position for the unvoiced one.

Having the prosodic primitives ranked so high in the list cor-

roborates the importance of these features for emotion identifi-cation.

V. EXPERIMENTAL EVALUATION

A. Experimental Framework

In order to validate the results of the feature analysis, emo-

tion identification tests were carried out on the Berlin database.

Suprasegmental features were modeled with SVMs using RBF

kernel, whereas GMMs were used for segmental features. This

way, the characteristics of each parametrization, as shown

in Section IV-A, can be exploited. On the one hand, GMMs

should be able to model the subtle differences in the distribu-tion of the highly overlapped segmental features, thanks to the

large number of training samples provided by the frame-wise

parametrization. On the other hand, SVMs can take advantage

of the larger separability of suprasegmental features. Further-

more, the high generalization capability exhibited by SVMs

[38] will guarantee the creation of robust models, even though

the suprasegmental parametrization provides very few training

samples. In the case of SVMs, the one-vs-all approach [39] was

used for the multiclass classification.

The experimental framework was designed as a nested

double cross-validation (see Fig. 2). The outer level ensures

speaker independent results, where the speakers of the test

recordings have not been seen during the training. The innerlevel intends to get development tests for the optimization of

Fig. 2. Nested double cross-validation. In the (a) outer level, five blocks aredefined according to speakers. Four are used for training and the last one forspeaker independent testing. In the (b) inner level, the recordings in the trainingset are randomly rearranged to form five new sub-blocks for development pur-poses.

the classifiers. The speakers in the database are divided into five

blocks for the outer level. Each block con-

tains one male and one female, so that gender balance is kept

within the blocks. For the th loop, blocks

define the training set, leaving block for testing. For the

inner loop, the sentences available in are randomly

distributed into five sub-blocks , which arethen used for the inner level cross-validation. Once the five

inner-level loops have ended, their results are gathered and

used to estimate optimal values for the number of mixtures in

the GMM, the RBF kernel spread, the SVM misclassification

cost, and the optimum number of features. Finally, the whole

set is used to train the system for the th loop in the

outer level and perform the testing on block .

B. Selection of the Number of Features

In order to estimate the optimum number of features, develop-

ment tests were repeated adding one feature at a time, according

to the ranking obtained in Section IV-C. Fig. 3 shows the re-sulting accuracy in the development tests using suprasegmental

features, as a function of the number of features.

According to these results, we can observe that, even though

voice quality parametrization showed a very low class separa-

tion (see Table II), it does not perform so badly considering the

low number of features it uses. With all five voice quality fea-

tures, the system gets 49.4% of correct classifications, whereas

with the best five prosodic features, it gets 50.7%. Furthermore,

the combination of prosody and voice quality seems to be bene-

ficial, as predicted by the estimated class separation values: the

best prosodic system reaches a maximum of 65.5% with 39 fea-

tures, whereas the combination obtains a 67.4% with 17 fea-tures. Not only does the accuracy increase, but it also reaches

its maximum with fewer features.

Suprasegmental spectrum statistics clearly outperform

prosody, at least when more than 15 features are used, which

again confirms the conclusions obtained with the class separa-

bility analysis. Spectrum statistics reach a maximum of 75.4%

accuracy with 96 features and keep steady from then on. The

combination of all suprasegmental parametrizations obtains

the best results if more than 25 features are used, reaching

an almost steady state at 152 features with 77.9% accuracy,

with a marginally better absolute maximum of 78.6% with 247

features.

The development results for segmental features are repre-sented in Fig. 4. None of the curves reach a real saturation


8/12


Fig. 3. Development results for suprasegmental parametrizations as a function of the number of features. (a) is a zoom view to see the results with few features.

TABLE VIACCURACY ON DEVELOPMENT AND TEST RESULTS WITH SELECTED NUMBER OF FEATURES AND ALL FEATURES

Fig. 4. Development results forsegmental parametrizations as a function of thenumber of features.

point, as the accuracy continues growing as new features are

added. But it can be seen that the improvement is very small

once 20 features have been used. For example, LFPC parame-

ters get 70.9% accuracy with 20 features, and when all 54 are

used, this number increases only to 72.9%. When voiced andunvoiced frames are treated separately, the accuracy of LFPC

decreases, which seems reasonable as there are approximately

half the training samples for each stream.

Results with prosodic primitives seem rather modest,

reaching 65.3% accuracy in the voiced stream and 50.8% in the

unvoiced one. But it should be kept in mind that only six and

three features are used, respectively. If for example, only the

best six features in the voiced LFPC stream are kept, the system

gets 58.8% accuracy. However, adding the prosodic primitives

to LFPC features does not improve the results significantly.

According to the ranking shown in Table V, prosodic primitives

are more informative than most LFPC parameters, as they are

among the first 10 or 20 features selected. However, this is trueonly if few features are selected, e.g., less than 15. When the

number of features increases, the LFPC features that are added

compensate for the information of the prosodic primitives, so

that at the end, the combination has no effect on the overall

accuracy.

Table VI summarizes the development results for each

parametrization family and their combinations, both with the

estimated optimal number of features and with the completeset. The final speaker independent test results are also pre-

sented for each case. Looking at these final test results, it can

be concluded that the spectral statistics are the best isolated

parametrization, reaching a 70.5% accuracy with 96 features

(and almost the same accuracy with all 324 features). Among

the feature combinations, the best result is obtained combining

all suprasegmental features, providing 72.2% correct answers

with 152 features and 72.5% with all 383.

C. Late Fusion Combination

So far, early fusion schemes have been used, i.e., concate-

nating parametrizations. But it may be interesting to see if re-sults improve with a late fusion system, i.e., combining not

the features themselves but the results of the classifiers trained

with them. Furthermore, the late fusion allows combining the

information captured by parametrizations of different temporal

structure, such as segmental and suprasegmental features or the

voiced and unvoiced streams.

An SVM-based fusion system [40], [41] has been used for this

task. Given an utterance to be classified and a set of classifiers,

a vector is formed with the scores provided by the classifiers for

each emotion. This score vector is then classified by the fusion

SVM to get the final decision. With an appropriate training, the

SVM is expected to learn the score patterns of the errors and

hits of the classifiers and improve the results. The scores of thedevelopment tests were used for this training.


9/12


TABLE VIIACCURACY ON LATE FUSION TEST RESULTS WITH SELECTED NUMBER OF FEATURES AND ALL FEATURES

Results for the late fusion system are shown in Table VII. The

fusion of suprasegmental features (column 1) achieves very sim-

ilar results with both early and late fusion systems. Segmental

features (column 2) on the other hand show a great improve-

ment. However, this improvement is partly due to the combi-

nation of both the voiced and unvoiced streams of LFPC and

prosodic primitives, which is not possible with the early fusion.

Modeling the voiced and unvoiced streams separately in seg-

mental features and combining them afterwards through late

fusion provides good results. While LFPC parametrization ob-

tains 69.9% accuracy with 20 features and 72.2% with all 54,the fusion of its streams (column 3) gets 72.0% and 76.5%, re-

spectively (an error reduction of 7% and 15%, respectively). A

noticeable improvement can also be seen when combining the

streams of prosodic primitives (column 4). Therefore, we de-

cided to keep voiced and unvoiced streams separated.

The late fusion system can also be used to combine seg-

mental and suprasegmental systems. Combining the results

from prosody and LFPC (column 5) yields to a similar accuracy

as combining the results from long-term spectral statistics

and frame-wise prosodic primitives (column 6). In both cases,

the accuracy is higher than when using the spectral statistics

alone (the best isolated system) and slightly better than usingthe early fusion of all suprasegmental features (the best early

fusion system).

When combining LFPC with spectral statistics (column 7)

or prosody with prosodic primitives (column 8), there is also

a significant improvement. This means that fusing systems that

use features of the same acoustic origin but different time span

can also be helpful to reduce classification errors.

As a last experiment, all features were combined with the

late fusion system (column 9): suprasegmental prosody, voice

quality, and spectral statistics together with segmental LFPC

and prosody primitives, with separated voiced and unvoiced

streams. Altogether, seven classifiers were combined in this last

test. The obtained results are the best among all the systemstested: 78.3% for all features and 76.8% for selected ones,

which implies a 20% error reduction with respect to the best

early fusion system.

D. Analysis of the Results by Emotion

Altogether, it seems that spectral characteristics provide

higher identification accuracies than prosodic ones. Neverthe-

less, prosodic features may be more appropriate to identify

certain emotions, even though they are not the best parametriza-

tion overall. The identification rates of each emotion have been

examined separately, in order to verify whether or not this is the

case. Fig. 5 presents the identification rates obtained for eachemotion with some representative parametrizations: prosodic

Fig. 5. Comparison of the identification accuracy for each emotion withprosodic and spectral parameters.

and spectral statistics in the case of suprasegmental features,

and the late fusion of voiced and unvoiced streams of LFPC

and prosody primitives in the case of segmental features.

It can be observed that fear and happiness are the worst iden-

tified emotions, with less than 50% of recordings correctly clas-

sified in most cases. Looking into the results, we observed that

happiness is mostly confused with anger, while fear is confused

with neutral with suprasegmental features and with neutral andhappiness with segmental parametrizations. This result suggests

that the considered parametrizations are not suitable to capture

the characteristics of these emotions, and that other features

should be accounted.

Both anger and neutral get similar accuracies with prosodic

and spectral long-term statistics. On the other hand, LFPC

seems a more suitable short-term parametrization than prosody

primitives for these emotions. In the case of boredom and

sadness, it is the segmental features that obtain similar identi-

fication rates, whereas in the suprasegmental parametrizations,

the spectral statistics get better results. Finally, disgust seems to

be very difficult to detect with prosody-related features, but the

accuracy increases significantly with spectral characteristics,

both in segmental and suprasegmental parametrizations.

Spectral features provide similar or higher accuracies than

prosodic ones in all emotions. This suggests that, even when

each emotion is considered on its own, spectral characteristics

are more suitable for emotion identification.

VI. CONCLUSION

The main goal of the work presented here was to analyze fea-

tures related to prosody, spectral envelope, and voice quality

regarding their capability to separate emotions. This analysis

sheds light on an aspect of the automatic recognition of emo-

tions in speech that is not fully covered in the literature. Al-though there are many studies that analyze how individual fea-


10/12


tures change according to the emotional state of the speaker [7],

[8], [42], [43], they do not evaluate the behavior of the whole

feature set. This can lead to inaccurate conclusions about the

usefulness of a given parametrization, as the features that are

most discriminant individually may not be the most discriminant

when taken together. Indeed, most of these studies show that,

when considered individually, changes in the prosodic featuresare more significant than changes in the spectral parameters.

However, the results presented in this work suggest that if the

whole feature set is considered, spectral envelope parametriza-

tions are more informative than prosodic ones.

The analysis presented in this paper has been performed by

means of discriminality measures, unsupervised clustering,

and feature ranking. The results have been confirmed with

empirical experiments of automatic identification of emotions.

The discriminality criterion that has been used provides a way

to estimate the performance of the whole feature set instead

of individual parameters. These measures have been comple-

mented with unsupervised clustering to see whether or not

the parametrizations provide enough class separation. Bothmethods reveal that long-term statistics of spectral envelope

provide larger separation among emotions than traditional

prosodic features. This explains the higher accuracy of the

former in the experimental tests. The measures also show

that combining suprasegmental parametrizations of spectral

envelope, prosody, and voice quality increases the separation

among emotions, confirming that the use of features extracted

from different information sources helps in reducing the identi-

fication error.

Unfortunately, these methods are not suitable for the anal-

ysis of segmental parametrizations. Instead, a feature ranking

algorithm has been applied to both segmental and supraseg-mental parametrizations to detect the most discriminant fea-

tures. Although many papers in the field mention the use of

some kind of feature selection algorithm, only few of them dis-

cuss the outcome from this selection. When this outcome is pro-

vided, long-term spectral statistics are usually selected first, over

prosodic ones [16], [17], suggesting that suprasegmental spec-

tral features provide more information about the emotional state

of the speaker. The feature selection results shown in the present

paper are in agreement with this conclusion.

Some works in the literature rely entirely on empirical exper-

iments to determine the capacity of a given parametrization to

identify emotions. Although this approach is suitable to find out

the most discriminant parametrization, it gives no further infor-

mation about the relation among these features. Furthermore,

most works that use a combination of different sets of features

(e.g., prosodic and spectral) provide only the accuracy results

for that combination, but not separately for each feature set, so

it is not possible to deduce which set is more informative. In

this work, we have analyzed each feature set independently and

in combination with the others. The experimental results have

also been given for each parametrization and for their combi-

nations. This way, it is possible to measure the accuracy gain

when different feature sets are combined, and see whether or

not this combination is advantageous. For example, even though

segmental prosodic primitives were ranked high by the featureselection algorithm, it has been shown that the combination of

frame-wise LFPC and prosodic primitives gets similar accu-

racy as the LFPC alone. Only with a reduced number of fea-

tures is this combination helpful (see Fig. 4). Also for supraseg-

mental parametrizations, prosodic features perform better than

the spectral ones only when few features are used [Fig. 3(a)].

According to these results, we can say that traditional prosodic

features seem to be the most appropriate for automatic identifi-cation of emotions only if they are considered individually or in

a very reduced feature set. But if large feature sets are consid-

ered, spectral features outperform them.

The analysis presented in this paper has been performed using

an acted speech emotional database. In order to see whether the

results are applicable to real-life systems, they should be vali-

dated using different databases, especially databases of natural

emotions. Nevertheless, the database used in this work is sup-

posed to contain highly natural emotion portrayals, so it is very

likely that these conclusions are valid to a great extent. A pre-

vious work using the Aibo database of natural emotions [44]

showed similar conclusions, with frame-wise MFCC features

outperforming long-term prosodic statistics [45]. Also in [12],[13], and [46], MFCC features get better results than supraseg-

mental ones.

The results from the late fusion tests suggest that the com-

bined use of parametrizations extracted from the same infor-

mation source but with different time-scale, i.e., segmental and

suprasegmental features, increases the accuracy of the system.

The difference in the time scales and in the classifiers makes

each subsystem to retain different characteristics of the emo-

tions. In fact, one of the best results has been achieved with the

late fusion of long-term LFPC statistics with the voiced and un-

voiced streams of frame-wise LFPC features. This system has

only been outperformed by the late fusion of all segmental andsuprasegmental features, including spectral envelope, prosody,

and voice quality. However, this last system is much more com-

plex and the obtained improvement is very small. The use of all

features requires estimating LFPC, intonation values, voiced-

unvoiced decision, pitch period marking, inverse filtering, and

vowel detection. This makes the parametrization step very com-

plex and time-consuming. Furthermore, it uses seven different

classifiers prior to the fusion. The spectral system on the other

hand needs only LFPCs, some simple statistics obtained from

them, and the voiced-unvoiced decisions in order to separate the

streams, reducing the number of classifiers to three. The differ-

ence in the accuracy may not justify increasing the complexity

of the system in such a degree.

We are not claiming that features extracted from prosody are

useless. Several papers show that humans are able to identify

emotions in prosodic copy-synthesis experiments [5], [47],

confirming that prosody does carry a great amount of emotional

information, at least for some emotions. But the traditional

prosodic representations may not be well suited to capture this

information. On the one hand, long-term statistics estimated

over the whole sentence lose the information of specific charac-

teristic prosodic events. On the other hand, short-term prosodic

primitives do not capture the prosodic structure correctly, which

is suprasegmental by definition. The results suggest that a new

more elaborate representation is needed to effectively extractthe emotional information contained in the prosody.


11/12


REFERENCES

[1] K. R. Scherer, Vocal communicationof emotion: A review of researchparadigms, Speech Commun., vol. 40, pp. 227256, Apr. 2003.

[2] K. R. Scherer, Psychological models of emotion, in The Neuropsy-chology of Emotion, J. Borod, Ed. Oxford, U.K.: Oxford Univ. Press,2000, ch. 6, pp. 137166.

[3] P. Ekman, An argument for basic emotions, Cognit. Emotion, vol. 6,

pp. 169200, 1992.[4] C. Darwin, The Expression of the Emotions in Man and Animals, 3rded. Oxford, U.K.: Oxford Univ. Press, 1998.

[5] F. Burkhardt and W. F. Sendlmeier, Verification of acoustical corre-lates of emotional speech using formant-synthesis, in Proc. ISCA Tu-torial and Research Workshop Speech and Emotion, Belfast, Ireland,Sep. 2000, pp. 151156.

[6] C. F. Huang and M. Akagi, A three-layered model for expressivespeech perception, Speech Commun., vol. 50, pp. 810828, Oct. 2008.

[7] K. R. Scherer, R. Banse, H. G. Wallbott, and T. Goldbeck, Vocal cuesin emotion encoding and decoding, Motiv. Emotion, vol. 15, no. 2, pp.123148, 1991.

[8] M. Schrder, Speech and emotion research, Ph.D. disservation, Uni-versitt des Saarlandes, Saarbrcken, Germany, 2003.

[9] E. Navas, I. Hernez, A. Castelruiz, J. Snchez, and I. Luengo,Acoustic analysis of emotional speech in standard Basque for emo-

tion recognition, in Progress in Pattern Recognition, Image Analysisand Applications, ser. Lecture Notes in Computer Science. Berlin,Germany: Springer, Oct. 2004, vol. 3287, pp. 386393.

[10] D. Erickson, Expressive speech: Production, perception and applica-tionto speech synthesis,Acoust. Sci. Tech., vol. 26,pp. 317325,2005.

[11] T. L. Nwe, S. W. Foo, and L. C. de Silva, Speech emotion recognitionusing hidden Markov models, Speech Commun., vol. 41, pp. 603623,Jun. 2003.

[12] S. Kim, P. G. Georgiou, S. Lee, and S. Narayanan, Real-time emo-tion detection system using speech: Multi-modal fusion of differenttimescale features, in Proc. Int. Workshop Multimedia Signal Pro-cessing, Crete, Greece, Oct. 2007, pp. 4851.

[13] B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll, Frame vs.turn-level: Emotion recognition from speech considering static and dy-namic processing, in Affective Computing and Intelligent Interaction,ser. Lecture Notes in Computer Science. Berlin, Germany: Springer,

2007, vol. 4738, pp. 139147.[14] J. Nicholson, K. Takahashi, and R. Nakatsu, Emotion recognition

in speech using neural networks, Neural Comput. Appl., vol. 9, pp.290296, Dec. 2000.

[15] O. W. Kwon, K. Chan, J. Hao, and T. W. Lee, Emotion recognitionby speech signals, in Proc. Eurospeech, Geneva, Switzerland, 2003,pp. 125128.

[16] T. Vogt and E. Andr, Improving automatic emotion recognition fromspeech via gender differentiation, in Proc. LREC, Genoa, Italy, May2006.

[17] B. Schuller, R. Mller, M. Lang, and G. Rigoll, Speaker independentemotion recognition by early fusion of acoustic and linguistic featureswithin ensembles, in Proc. Interspeech, Lisbon, Portugal, Sep. 2005,pp. 805808.

[18] R. Lpez-Cozar, Z. Callejas, M. Kroul, J. Nouza, and J. Silovsk,Two-level fusion to improve emotion classification in spoken dia-

logue systems, in Graphics Recognition. Recent Advances and NewOpportunities, ser. Lecture Notes in Computer Science. Berlin,Germany: Springer, 2008, vol. 5246, pp. 617624.

[19] M. Lugger and B. Yang, The relevance of voice quality featuresin speaker independent emotion recognition, in Proc. ICCASP,Honolulu, HI, Apr. 2007, vol. 4, pp. 1720.

[20] C. Gobl and A. N. Chasaide, The role of voice quality in commu-nicating emotion, mood and attitude, Speech Commun., vol. 40, pp.189212, Apr. 2003.

[21] R. Tato, R. Santos, R. Kompe, andJ. Pardo, Emotional spaceimprovesemotion recognition, in Proc. ICSLP, Sep. 2002, pp. 20292032.

[22] R. Mller, B. Schuller, and G. Rigoll, Enhanced robustness in speechemotion recognition combining acoustic and semantic analyses, inProc. From Signals to Signs of Emotion and Vice Versa, Santorino,Greece, Sep. 2004.

[23] A. Nogueiras, A. Moreno, A. Bonafonte, and J. B. Mario, Speechemotion recognition using hidden Markov models, in Proc. Eu-rospeech, Aalborg, Denmark, Sep. 2001, pp. 26792682.

[24] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, On combining clas-sifiers, IEEE Trans. Pattern Anal. Mach. Intell. , vol. 20, no. 3, pp.22639, Mar. 1998.

[25] D. Ruta and B. Gabrys, An overview of classifier fusion methods,Comput. Inf. Syst., vol. 7, no. 1, pp. 110, Feb. 2000.

[26] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B.Weiss, A database of German emotional speech, in Proc. Inter-speech, Lisbon, Portugal, Sep. 2005, pp. 15171520.

[27] J. Ramirez, J. C. Segura, C. Benitez, A. de la Torre, and A. Rubio,Efficient voice activity detection algorithms using long-term speechinformation, Speech Commun., vol. 42, pp. 271287, Apr. 2004.

[28] P. Alku, H. Tiitinen, and R. Ntnen, A method for generating nat-ural-sounding speech stimuli for cognitive brain research, Clin. Neu-rophysiol., vol. 110, pp. 13291333, Aug. 1999.

[29] I. Luengo, I. Saratxaga, E. Navas, I. Hernez, J. Snchez, and I. n.Sainz, Evaluation of pitchdetection algorithms under real conditions,in Proc. ICASSP, Honolulu, HI, Apr. 2007, pp. 10571060.

[30] I. Luengo, E. Navas, J. Snchez, and I. Hernez, Deteccin de vocalesmediante modelado de clusters de fonemas, Procesado Del Lenguaje

Natural, vol. 43, pp. 121128, Sep. 2009.[31] F. Ringeval and M. Chetouani, Exploiting a vowel based approach

for acted emotion recognition, in Verbal and Nonverbal Features ofHuman-Human and Human-Machine Interaction, ser. Lecture Notes inComputer Science. Berlin, Germany: Springer, Oct. 2008, vol. 5042,

pp. 243254.[32] T. Bckstrm, P. Alku, and E. Vilkman, Time-domain parameteriza-

tion of the closing phase of glottal airflow waveform from voices overa large intensity range, IEEE Trans. Speech Audio Process., vol. 10,no. 3, pp. 186192, Mar. 2002.

[33] R. van Son and L. Pols, An acoustic description of consonant reduc-tion, Speech Commun., vol. 28, pp. 125140, Jun. 1999.

[34] K. Fukunaga, Introduction to Statistical Pattern Recognition. NewYork: Academic, 1990.

[35] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. NewYork: Wiley, 2001.

[36] I. Guyon and A. Elisseeff, An introduction to variable and featureselection, J. Mach. Learn. Res., vol. 3, pp. 11571182, Mar. 2003.

[37] H. Peng, F. Long, and C. Ding, Feature selection based on mutualinformation: Criteria of max-dependency, max-relevance, and min-re-dundancy, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp.12261238, Aug. 2005.

[38] C. J. Burges, A tutorial on support vector machines for pattern recog-nition, Data Min. Knowl. Discov., vol. 2, pp. 121167, 1998.

[39] C. W. Hsu and C. J. Lin, A comparison of methods for multi-classsupport vector machines, IEEE Trans. Neural Netw., vol. 13, no. 2,pp. 415425, Mar. 2002.

[40] J. Fierrez-Aguilar, D. Garcia-Romero, J. Ortega-Garcia, and J. Gon-zalez-Rodriguez, Adapted user-dependent multimodal biometric au-thentication exploiting general information, Pattern Recognit. Lett.,vol. 26, no. 16, pp. 26282639, Dec. 2005.

[41] B. Gutschoven and P. Verlinde, Multi-modal identity verificationusing support vector machines (SVM), in Proc. Int. Conf. InformationFusion, Paris, France, Jul. 2000, vol. 2, pp. 38.

[42] R. Banse andK. R. Scherer, Acousticprofilesin vocalemotionexpres-sion, J. Personal. Social Pathol., vol. 70, no. 3, pp. 614636, 1996.

[43] L. Devillers, I. Vasilescu,and L. Vidrascu, and pause features anal-ysis for anger and fear detection in real-life spoken dialogs, in Proc.Speech Prosody, Nara, Japan, Mar. 2004, pp. 205208.

[44] A. Batliner, S. Steidl, B. Schuller, D. Seppi, K. Laskowski, T. Vogt, L.Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson, Com-bining efforts for improving automatic classification of emotional userstates, in Proc. Information SocietyLanguage Technologies Conf.(IS-LTC), Ljubljana, Slovenia, Oct. 2006, pp. 240245.

[45] I. Luengo, E. Navas, and I. Hernez, Combining spectraland prosodicinformation for emotion recognition in the interspeech 2009 emotionchallenge, in Proc. Interspeech, Brighton, U.K., Sep. 2009, pp.332335.

[46] I. Luengo, E. Navas, I. Hernez, and J. Sanchez, Automatic emotionrecognition using prosodic parameters, in Proc. Interspeech, Lisbon,Portugal, Sep. 2005, pp. 493496.

[47] E. Navas, I. Hernez, andI. Luengo, Anobjectiveand subjectivestudy

of the role of semantics in building corpora for TTS, IEEE Trans.Speech Audio Process., vol. 14, no. 4, pp. 111727, Jul. 2006.


12/12


Iker Luengo received the telecommunication en-gineering degree from the University of the BasqueCountry, Bilbao, Spain, in 2003 and is currentlypursuing the Ph.D. degree in telecommunications.

Hehas been a researcher in theAhoLabSignal Pro-cessing Group in the Electronics and Telecommuni-cations Department since 2003. He has participatedas a research engineer in government-funded R&D

projects, focused in emotionalspeech, speaker recog-nition, diarization of meetings, and speech prosody.Mr. Luengo is member of the International Speech

Communication Association (ISCA) and the Spanish thematic network onSpeech Technologies (RTTH).

Eva Navas received the telecommunication en-gineering degree and the Ph.D. degree from theDepartment of Electronics and Telecommunicationsof the University of the Basque Country, Bilbao,Spain.

Since 1999, she has been a researcher at theAhoLab Signal Processing Group. She is currentlyteaching at the Faculty of Industrial and Telecommu-nication Engineering in Bilbao. She has participatedas a research engineer in government-funded R&Dprojects as well as in privately-funded research

contracts. Her research is focused on expressive speech characterization,recognition, and generation.

Dr. Navas is a member of the International Speech Communication Associ-ation (ISCA), the Spanish thematic network on Speech Technologies (RTTH),and the European Center of Excellence on Speech Synthesis (ECESS).

Inmaculada Hernez received the telecommunica-tionsengineeringdegreefrom the Universitat Politec-nica de Catalunya, Barcelona, Spain, and the Ph.D.degree in telecommunications engineering from theUniversity of the Basque Country, Bilbao, Spain, in1987 and 1995, respectively.

She is a Full Professor in the Electronics andTelecommunication Department, Faculty of Engi-

neering, University of the Basque Country, in thearea of signal theory and communications. She isfounding member of the Aholab Signal Processing

Research Group. Her research interests are signal processing and all aspectsrelated to speech processing. She is also interested in the development ofspeech resources and technologies for the Basque language.

Dr. Hernez is a member of the International Speech Communication Asso-ciation (ISCA), the Spanish thematic network on Speech Technologies (RTTH),and the European Center of Excellence on Speech Synthesis (ECESS).

Feature Analysis Fot Emotion Identification in Speech

Documents