Top Banner

of 12

Feature Analysis Fot Emotion Identification in Speech

Apr 05, 2018

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/31/2019 Feature Analysis Fot Emotion Identification in Speech

    1/12

    490 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010

    Feature Analysis and Evaluation for AutomaticEmotion Identification in Speech

    Iker Luengo, Eva Navas, and Inmaculada Hernez

    AbstractThe definition of parameters is a crucial step inthe development of a system for identifying emotions in speech.Although there is no agreement on which are the best featuresfor this task, it is generally accepted that prosody carries mostof the emotional information. Most works in the field use somekind of prosodic features, often in combination with spectral andvoice quality parametrizations. Nevertheless, no systematic studyhas been done comparing these features. This paper presents theanalysis of the characteristics of features derived from prosody,spectral envelope, and voice quality as well as their capabilityto discriminate emotions. In addition, early fusion and late fu-sion techniques for combining different information sources areevaluated. The results of this analysis are validated with experi-

    mental automatic emotion identification tests. Results suggest thatspectral envelope features outperform the prosodic ones. Evenwhen different parametrizations are combined, the late fusionof long-term spectral statistics with short-term spectral envelopeparameters provides an accuracy comparable to that obtainedwhen all parametrizations are combined.

    Index TermsEmotion identification, information fusion,parametrization.

    I. INTRODUCTION

    FEATURES extracted from the speech signal have a great

    effect on the reliability of an emotion identificationsystem. Depending on these features, the system will have a

    certain capability to distinguish emotions and will be able to

    deal with speakers not seen during the training. Many works

    in the field of emotion recognition are aimed to find the most

    appropriate parametrization, yet there is no clear agreement on

    which feature set is best.

    One of the major problems to determine the best features for

    emotion identification is that there is no solid theoretical basis

    relating the characteristics of the voice with the emotional state

    of the speaker [1]. That is why most of the works in this field

    are based on features obtained from direct comparison of speech

    signals portraying different emotions. The comparison enablesto estimate the acoustic differences among them, identifying

    features that could be useful for emotion identification.

    Manuscript received December 11, 2009; revised March 22, 2010; acceptedMay 06, 2010. Date of current version September 15, 2010. This work wassupported in part by the Spanish Government under the BUCEADOR project(TEC2009-14094-C04-02) and in part by the AVIVAVOZ project (TEC2006-13694-C03-02). The associate editor coordinating the review of this manuscriptand approving it for publication was Dr. Hamid K. Aghajan.

    The authors are with the Department of Electronics and Telecommunica-tions, University of the Basque Country, Bilbao 48013, Spain ( e-mail: [email protected]; [email protected]; [email protected]).

    Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TMM.2010.2051872

    The only widely known theory that describes the physiolog-

    ical changes caused by an emotional state [2], [3] takes the Dar-

    wininan theory [4] as a reference, considering emotions as a re-

    sult of evolutionary needs. According to this theory, each emo-

    tion induces some physiological and psychological changes in

    order to prepare us to do something, e.g., fear prepares us to

    run from a danger. These changes have certain influence on the

    speech characteristics, mainly on those related to intonation, in-

    tensity, and speaking rate, i.e., on prosody.

    For many years, automatic identification systems have used

    prosodic features almost exclusively, mainly because of the

    aforementioned theory. The relation between emotion and

    prosody is reflected in many works in the literature [5][9].

    A detailed summary of these works and their conclusions is

    presented in [10]. Prosodic features are mostly used in the

    form of long-term statistics, usually estimated over the whole

    utterance. The most common ones are simple statistics such as

    mean, variance, minimum, maximum, or range. But it is also

    possible to use more complex representations in an attempt to

    retain more emotional information.

    The use of prosodic features gives a certain reiterative confu-

    sion pattern among emotions. They seem to be able to discrimi-

    nate high arousal emotions (anger, happiness) from low arousal

    ones (sadness,boredom) easily. But the confusion level for emo-tions of the same arousal level is very large [1]. However, hu-

    mans are able to distinguish anger from happiness and sadness

    from boredom accurately. This reinforces the idea that there are

    some other voice characteristics useful for emotion identifica-

    tion. Some works in the literature use spectral measurements or

    voice quality features, showing the importance of these param-

    eters for emotion identification.

    The vocal tract is also influenced by the emotional state, and

    so are the spectral characteristics of the voice. The relationship

    between emotion and spectral characteristics is empirically sup-

    ported by several works [7], [11][14]. Furthermore, taking into

    account that prosody is determined mainly by the vocal fold ac-tivity and that the spectrum envelope is mostly influenced by

    the vocal tract, it is reasonable to assume that they are not very

    correlated. Therefore, both types of features provide different

    information about the emotional state of the speaker, and the

    combination of prosodic and spectral parametrizations may be

    interesting. In fact, most of the works using spectral measures

    use them together with prosodic features, significantly reducing

    the error rate with respect to using only one of the parametriza-

    tions [12], [13], [15][18].

    The effect that emotions have on the vocal folds cause

    changes in the voice quality, too [19], [20], so in the last couple

    of years, features related to voice quality have also been used

    1520-9210/$26.00 2010 IEEE

  • 7/31/2019 Feature Analysis Fot Emotion Identification in Speech

    2/12

    LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH 491

    to help emotion identification [19], [21]. Nevertheless, few

    authors use them, due to the difficulty of extracting the glottal

    signal from the speech [20]. The estimation of the glottal signal

    can only be done with acceptable accuracy on very stable

    speech segments, which makes it difficult to extract voice

    quality features by automatic means.

    Finally, it is also possible to use linguistic features such asthe occurrence of sighs or certain emotive words [17], [18], [22].

    However, only a few authors use them. On the one hand, because

    in most cases an automatic speech recognition system (ASR)

    is needed, which increases the complexity of the system. On

    the other hand, because this kind of features only makes sense

    with spontaneous speech, and most of the works in the field

    have been developed using read speech databases, often with

    the same texts for all emotions.

    Best results are obtained when parametrizations of different

    nature are combined in order to increase the available informa-

    tion. When all features have the same temporal structure, this

    combination can be done by simply concatenating the feature

    vectors. The problem arises when the temporal structure isdifferent for the considered parametrizations. For example,

    prosodic information is usually given in the form of long-term

    statistics, while traditional spectral envelope parametrizations

    like Mel frequency cepstral coefficients (MFCC) or linear

    prediction cepstral coefficients (LPCC) are extracted for each

    frame. A simple solution applied in many works is to calculate

    long-term statistics of spectral features, making it possible

    to concatenate them to the prosodic feature vector [15][17].

    Another alternative is to do the opposite. Instead of calculating

    statistics of spectral features, frame-wise and intensity

    values are used and appended directly to the MFCC or LPCC

    vectors [14], [23]. In [15], both strategies are compared, ob-taining very similar results in both cases. A more elaborate

    solution uses a late fusion of classifiers [24], [25], i.e., training

    a different classifier with each parametrization and combining

    the results of these classifiers, as in [12] and [18].

    Going over the literature shows a disagreement on which fea-

    tures are best for the identification of emotions in speech. It is

    widely accepted that prosodic features carry most of the emo-

    tional information, and that the combination of parametriza-

    tions of different nature improves the results. But no systematic

    study has been done comparing the effectiveness of each kind

    of parametrization or their combinations. This work attempts to

    fill this gap analyzing the characteristics and appropriateness of

    different sets of features used for the recognition of emotions. It

    is focused on acoustic parameters that can be extracted directly

    from the speech signal without using ASR systems: prosody,

    spectral envelope, and voice quality. The effectiveness of dif-

    ferent combinations of features is also studied, and the different

    combination approaches are compared to see which one is more

    appropriate: the simple feature concatenation (early fusion) or

    the classifier result combination (late fusion).

    Section II in the paper describes the emotional database

    used in this work, as well as the acoustic processing carried

    out to extract the data needed for the calculation of the fea-

    tures. Section III describes the considered features, whereas

    Section IV describes the analysis of these features. Some em-pirical experiments are carried out and described in Section V,

    in order to confirm the results obtained in the analysis. Finally,

    the results are commented and some conclusions are extracted.

    II. WORKING DATABASE

    A. Description of the Database

    Both the analysis of the features and the emotion identifi-

    cation experiments described in this paper were carried out

    using the Berlin emotional speech database [26]. This database

    contains 535 recordings uttered by five male and five female

    speakers, simulating seven emotional states: anger, boredom,

    disgust, fear, happiness, neutral, and sadness. Although the

    original recordings are available at 16 kHz and 16 bits per

    sample, they were subsampled to 8 kHz for this work.

    The database corpus contains ten sentences of neutral se-

    mantic content that all speakers recorded portraying the seven

    emotional styles. As described in [26], in order to get emotions

    as natural as possible, the recordings were evaluated in a percep-

    tual test where 20 listeners had to identify the intended emotion

    and score its naturalness. All recordings with an identificationrate lower than 80% and an overall naturalness score under 60%

    were discarded, leaving the 535 sentences available in the final

    database. Due to the perceptual selection process, some emo-

    tions are more represented than others.

    This database was chosen because it presents certain char-

    acteristics that were of interest. It is a multispeaker database,

    making it possible to perform speaker-independent tests. Fur-

    thermore, the perceptual selection guarantees that the portrayed

    emotions are highly natural. In addition, many researches about

    identification of emotions have been done using this database,

    which makes it possible to compare the results with other pub-

    lished results.

    B. Processing of the Recordings

    The recordings were processed in order to get the character-

    istic curves and labelings needed for the feature extraction. The

    processing included detection of the vocal activity, estimation of

    the glottal source signal and of the intonation curve, voiced-un-

    voiced labeling, pitch-period marking, and vowel detection. All

    this processing was performed automatically without manual

    corrections.

    1) Vocal Activity Detection (VAD): Silences and pauses were

    detected using the LTSE VAD algorithm described in [27]. The

    labels obtained were further processed to discard silence labels

    shorter than 100 ms that usually appear as a result of detection

    errors, and which have no linguistic meaning.

    2) Glottal Source Estimation: The glottal signal is needed

    in order to compute various features related to voice quality.

    However, estimating the glottal flow from the speech signal is

    inherently difficult and only when it is applied to very stationary

    segments of the speech the results may be acceptable. Iterative

    adaptive inverse filtering (IAIF) [28] was used to perform the

    inverse filtering and recovering the glottal flow, since it is a fully

    automatic method that gives acceptable results with stationary

    signals. Nevertheless, it is assumed that the estimated glottal

    signal is inaccurate for nonstationary segments.

    3) Intonation Curve and Voiced-Unvoiced Labeling: The in-tonation curve was computed with the cepstrum dynamic pro-

  • 7/31/2019 Feature Analysis Fot Emotion Identification in Speech

    3/12

    492 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010

    gramming (CDP) algorithm described in [29], which uses the

    cepstrum transform and dynamic programming. This algorithm

    provides the voiced-unvoiced (VUV) labeling, too.

    4) Pitch Period Marking: In addition to glottal flow estima-

    tion, pitch synchronous marks are also needed for the extraction

    of voice quality features. Once the intonation curves, the VUV

    labeling and the glottal flow were extracted, pitch marks wereplaced in the negative peaks of the inverse filtering residual,

    which were located by simple peak picking using the estimated

    value as a clue to detect the next peak.

    5) Vowel Detection: Vowels are some of the most stable

    segments in a speech signal, making them very appropriate

    for the computation of certain features, such as those derived

    from the glottal source estimation. Furthermore, vowels are

    always voiced and are strongly affected by intonation patterns,

    providing a consistent point to calculate intonation-related fea-

    tures such as risings or fallings. The vowels in the database

    were automatically detected using the algorithm described in

    [30], which is based on a phoneme recognizer working with

    HMM models of clustered phonemes. Phoneme clustering wasautomatically performed according to their acoustical simi-

    larity, keeping the vowels on their own cluster. This provides a

    consistent and very robust set of models, capable of detecting

    80% of vowel boundaries with less than 20-ms error.

    III. FEATURE DEFINITION

    A. Segmental Features

    Segmental features are calculated once for every frame, al-

    lowing the analysis of their temporal evolution. A 25-ms Ham-

    ming windowing with 60% overlap is used, i.e., a new feature

    vector is extracted every 10 ms.1) Spectral Envelope: Reference [11] shows that log-filter

    power coefficients (LFPC) outperform traditional MFCC or

    LPCC parameters in emotion identification, so they were

    chosen for the frame-wise spectral characterization. LFPC

    features represent the spectral envelope in terms of the energy

    in Mel-scaled frequency bands.

    Eighteen LFPC coefficients were estimated for each frame,

    together with their first and second derivatives, giving a total of

    54 spectral features at the segment level. In order to minimize

    microphone distance effects, LFPC features were normalized to

    the mean value of the whole utterance.

    2) Prosody Primitives:We refer to the intonation and in-tensity curves as prosody primitives, as the (suprasegmental)

    prosodic features are estimated from these. A new sample of in-

    tensity and was estimated for every frame, together with their

    first and second derivatives, providing six features per frame.

    Intensity curves were normalized to the mean intensity value of

    the whole utterance in order to minimize the effect of the mi-

    crophone distance.

    As values are not defined for unvoiced frames, two feature

    vector streams were extracted from each recording, the first one

    corresponding to voiced frames (characterized by both intona-

    tion and intensity features) and the second one corresponding to

    unvoiced frames (characterized only by intensity values). Both

    streams were treated as different parametrizations during thiswork.

    TABLE ISUPRASEGMENTAL PARAMETERS AND THE CORRESPONDING

    SYMBOL USED ALONG THIS DOCUMENT

    B. Suprasegmental Features

    Suprasegmental features represent long-term information, es-timated over time intervals longer than a frame. In this work, this

    interval has been defined as the time between two consecutive

    pauses. It is expected that speech pauses correspond roughly to

    linguistic stops in the message, so this approach is very sim-

    ilar to using a whole utterance as integration time. However, as

    pauses can be detected automatically with the VAD algorithm,

    this method makes it possible to adapt the parametrization algo-

    rithm to work with direct audio input if necessary.

    1) Spectral Statistics: For the suprasegmental characteriza-

    tion of the spectrum, long-term statistics of LFPC were calcu-

    lated. For each of the 18 LFPC coefficients and their first and

    second derivatives, six statistics were computed, as shown in

    Table I. At the end, suprasegmental spectralfeatures were extracted.

  • 7/31/2019 Feature Analysis Fot Emotion Identification in Speech

    4/12

    LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH 493

    2) Prosody: Prosodic features are divided into five cate-

    gories, according to the nature of the information they represent.

    Altogether 54 prosodic features were defined.

    Intonation Statistics: Intonation features were calculated

    as the same six statistics presented in Table I, applied to the

    values and their first and second derivatives, giving 18 param-

    eters as a result. Only frames detected as voiced were used forthe computation of the statistics.

    Intensity Statistics: Following the same approach as with

    the intonation, intensity features were composed of the same

    statistics applied to the intensity values and their first and second

    derivatives, providing 18 new parameters.

    Speech Rate Features: They were defined as the mean and

    variance of the vowel duration, as shown in Table I.

    Regression Features: In each detected vowel, a linear re-

    gression was estimated for the values of and intensity. Then

    the absolute value of the regression line slopes was calculated,

    and the six features listed in Table I were extracted. With these

    features, we try to combine long integration intervals (with a

    length comparable to a sentence) and short ones (with an ap-proximate duration of a phoneme), as done by other authors

    [31].

    Sentence-End Features: Prosodic values at the end of the

    sentences may provide additional information about the emo-

    tion. For example, an increase in with respect to the rest of

    the sentence can represent surprise, whereas a lower intensity

    is usually related to sadness. Therefore, ten more features were

    extracted from the last vowel detected in the integration time,

    as shown in Table I. The normalized values are defined as the

    corresponding non-normalized ones divided by the mean value

    over all the vowels detected in the integration segment.

    3) Voice Quality Features: Features related to voice qualityare extracted from the glottal source signal and from the pitch

    period marks. They were computed only for vocalic segments,

    in order consider only segments with reliable glottal source es-

    timation.

    The features specified in Table I were calculated for each

    vowel, and the values corresponding to vowels in the same inte-

    gration segment were averagedin order to obtain a single feature

    vector for the whole integration time.

    Jitter and shimmer were estimated using the five-point period

    perturbation quotient (ppq5) and five-point amplitude perturba-

    tion quotient (apq5) values as defined in Praat.1 The normalized

    amplitude quotient (NAQ) [32] is estimated for every glottal

    pulse, so the NAQ value for a vowel was calculated by aver-

    aging the NAQ obtained all along that vowel. Similarly, spec-

    tral tilt and spectral balance [33] are calculated for every frame,

    and the value for a vowel was calculated averaging the values

    all along the vowel.

    IV. FEATURE ANALYSIS AND SELECTION

    A. Inter-Emotion and Intra-Emotion Dispersion

    The capability of a parameter set to retain the emotional

    characteristics and to avoid the remaining attributes of the

    speech can be measured in terms of inter-emotion dispersion

    and intra-emotion dispersion. A large inter-emotion dispersion1http://www.praat.org

    TABLE IIDISCRIMINABILITY VALUES OF THE FEATUREFAMILIES MEASURED WITH THE CRITERION

    would mean that the features take very different values for

    each emotion, separating the distributions and making the

    classification task easier. A reduced intra-emotion dispersionreflects the consistency of the features within a given emotion.

    The relation between intra-emotion and inter-emotion disper-

    sions provides a measure of the overlapping of the class distri-

    butions. This relation can be estimated using the following cri-

    terion [34]:

    (1)

    where denotes the trace of a matrix and and are

    the intra-class and inter-class dispersion matrices, respectively:

    (2)

    (3)

    being the number of training samples, the

    number of emotions, the samples of class , the centroid

    of the class, and the global mean:

    (4)

    (5)

    The criterion is often used in discriminant analysis, and it

    is a generalization of the well-known Fisher criterion (6) for the

    multiclass and multidimensional case [35]:

    (6)

    values were computed for each feature family as a first es-

    timation of their capability to discriminate emotions. The results

    are presented in Table II.

    If we focus on suprasegmental features, it is observed that

    prosody is less discriminative than spectral envelope statistics.This result is especially significant, as prosodic features are by

  • 7/31/2019 Feature Analysis Fot Emotion Identification in Speech

    5/12

    494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010

    Fig. 1. (a) Scatter plot of suprasegmental prosodic, (b) suprasegmental spectral, (c) segmental prosodic, and (d) segmental spectral features, projected over thetwo most discriminant directions by LDA. An: anger, Bo: boredom, Di: disgust, Fe: fear, Ha: happiness, Ne: neutral, Sa: sadness.

    far the most used ones in the literature for emotion identification.

    The difference shown in Table II could be partly justified by the

    fact that 324 statistics are used in the spectral parametrization

    and only 54 in the prosodic one. But when the calculation is re-

    peated using only the best 54 spectral features (see the feature

    selection procedure in Section IV-C), they still seem to discrim-

    inate emotions better, obtaining a .

    To confirm this result and to show the discriminability of each

    set of features visually, an LDA transformation was applied to

    both spectral and prosodic features. The two most discriminant

    directions are represented graphically in Fig. 1. As can be seen,

    emotions are less overlapped when using spectral features thanwhen using prosodic ones.

    It is remarkable that in both cases, the most discriminant

    direction (horizontal axis) seems to be related to the activa-

    tion level, shifting high activation emotions (anger, happiness,

    and fear) to one side and low activation emotions (sadness and

    boredom) to the other.

    In Fig. 1(a), we can already identify the confusions usually

    described in the literature when using prosodic features, for ex-

    ample anger and happiness. Disgust is also frequently reported

    to be hard to detect by prosodic features, and this is reflected

    in the scatter plot, too. The distribution of disgust samples has a

    large dispersion and is strongly overlapped with other emotions,

    especially with the neutral style. On the contrary, it is clearlyseparated from the other emotions when using spectral features.

    Going back to the discriminability values in Table II, it can

    be seen that features related to voice quality seem to provide

    very little information, at least the ones considered in this work.

    Nevertheless, when these features are concatenated with the

    prosodic ones, they contribute to increase the value from 6.47

    to 7.14. This is not so surprising if we take into account that fea-

    tures without discrimination power by themselves may be useful

    when combined with others [36], as in this case. The concatena-

    tion of all suprasegmental features provides the highest discrim-

    ination, suggesting that the information captured by the spec-

    trum, the prosody, and the voice quality are complementary.

    Regarding segmental features, it can be seen that their capa-bility to separate emotions is almost nonexistent. This is clearly

    observed in Fig. 1(c) and (d), where the two most discriminant

    features given by LDA are represented for prosodic primitives

    and LFPC features. All emotions are completely overlapped.

    Due to the short-term nature of the segmental parametrization,

    each LFPC feature vector reflects the spectral envelope of a

    single frame, i.e., it represents the characteristic vocal tract filter

    for the phonemes articulated in that frame. As the spectral en-

    velope is much more different for different phonemes than for

    different emotions, the intra-class dispersion is very large, in-

    creasing the overlap among emotions. In the case of intonation

    and intensity, a similar effect occurs, as the frame-wise sam-

    ples have more variation due to the linguistic content than to theemotional content of the utterances.

  • 7/31/2019 Feature Analysis Fot Emotion Identification in Speech

    6/12

    LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH 495

    TABLE IIIUNSUPERVISED CLUSTERING RESULTS FOR SPECTRAL STATISTICS

    This does not mean that the features are useless. Both the

    criterion and the LDA are optimal only if the classes have

    normal homoscedastic distribution, i.e., they have the same co-

    variance matrices [35]. They cannot make use of the subtle dif-

    ferences in the shape of the distributions, which can be captured

    if the right classification engine is applied. Usually Gaussian

    mixture models (GMMs) are used for this task, as they can

    capture small differences among distributions, assuming thatenough training samples are provided.As segmental features are

    extracted once for every 10 ms, there are indeed enough training

    samples as to train robust and accurate models. In fact, GMMs

    are very popular in emotion identification when frame-wise fea-

    tures are used. Unfortunately, this means that no conclusions can

    be extracted from the measures for segmental parametriza-

    tions. The only way to get an estimation of their capability to

    distinguish emotions is to perform empirical tests of automatic

    identification of emotions. The results of these empirical tests

    are presented in Section V.

    B. Unsupervised Clustering

    The values given above provide a clue about the discrim-

    ination power of each feature family, and the scatterplot of the

    most discriminant directions estimated by LDA gives a visual

    representation of it. But using only two directions is somehow

    unrealistic, since the addition of more dimensions could provide

    more separation among the classes. Unfortunately, it is not pos-

    sible to give an understandable graphical representation of more

    than two dimensions. However, it is possible to obtain descrip-

    tive results that can provide some insight of what happens when

    all features are used.

    For this purpose, a blind unsupervised clustering was per-

    formed using the k-means algorithm. If the emotional classesare correctly separated, the resulting clusters should correspond

    to each emotion. The outcome of this clustering is shown in

    Tables III and IV for suprasegmental features. No clustering was

    performed to segmental features because the distributions are so

    overlapped that the algorithm would not be able to locate the

    classes correctly.

    The clustering is able to identify the emotions quite accu-

    rately, with the spectral parametrization having a better perfor-

    mance than the prosodic one, as predicted by the values.

    Using spectral statistics, almost all samples belonging to a given

    emotion are assigned to the same cluster, whereas prosodic fea-

    tures exhibit the typical confusion among emotions: anger with

    happiness and neutral with boredom and sadness. If we con-sider these tables as classification confusion matrices, prosodic

    TABLE IVUNSUPERVISED CLUSTERING RESULTS FOR PROSODIC STATISTICS

    features would achieve an overall accuracy of 75.87% whereas

    spectral statistics would get a 98.68%. Note that the expected

    accuracy of an emotion identification system is much lower, be-

    cause the test utterances will not be seen during the training.

    C. Feature Selection

    The use of noninformative or redundant features may de-

    crease the accuracy of a classifier, due to the confusion they

    add to the system. A feature selection algorithm can help iden-

    tifying the truly useful features, reducing the dimensionality

    of the parametrization and making the classifier run faster and

    more accurately. Furthermore, detecting the discriminative fea-

    tures may provide a deeper understanding about the influence

    of the emotions in the acoustic characteristics of the voice.

    The minimal-redundancy-maximal-relevance (mRMR) algo-

    rithm [37] has been used to get a ranking of the features, from

    the most to the least significant one. This algorithm selects

    the features that maximize the mutual information between

    the training samples and their classes (maximal relevance) and

    simultaneously minimize the dependency among the selectedfeatures (minimal redundancy). mRMR has been applied to

    all five parametrization families defined in Section III, as well

    as to their combinations. The ranking of these combinations,

    presented in Table V, is very interesting as it shows which

    parametrization is preferred. When all suprasegmental features

    are concatenated, among the first ten features ranked, we find

    four prosodic and six spectral features. Looking further up

    to position 30, we find ten prosodic, one voice quality, and

    19 spectral features. These results are in line with the ones

    obtained in [16] and [17], supporting the previous results in

    Section IV-A that suggest that spectral suprasegmental features

    provide more information about the emotion than prosodicones. Voice quality features are in general ranked low in the

    list, even though some studies have claimed their relation with

    the emotional state of the speaker [19], [20]. Probably the low

    ranking is due to the automatic nature of the parametrization. In

    works dealing with voice quality, features are usually extracted

    with human intervention, providing very accurate values. In the

    case presented in this paper, all processing was fully automatic,

    and even though voice quality features were extracted only

    for vowels (which are supposedly very stable), the resulting

    estimation errors may increase the confusability in the system,

    making them not very suitable for identification. Errors during

    the automatic vowel detection may further increase this confus-

    ability. These results are also confirmed by the low valuesobtained by voice quality features.

  • 7/31/2019 Feature Analysis Fot Emotion Identification in Speech

    7/12

    496 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010

    TABLE VBEST TEN FEATURES RANKED FOR FEATURE COMBINATIONS

    The ranking for segmental parametrizations was performedseparately for the voiced and unvoiced streams. Following the

    approach taken with prosodic primitives, LFPC features were

    also divided into voiced and unvoiced streams, so that the frame-

    wise spectral parametrization and the prosodic primitives can

    be easily combined by concatenation. In the combination of

    the voiced streams, the three intonation features ( , , and

    ) are rankedamong the ten best ones, while intensity deltas

    ( and ) are between positions 10 and 20. The

    frame intensity is far below in the ranking for the voiced

    stream, but appears in second position for the unvoiced one.

    Having the prosodic primitives ranked so high in the list cor-

    roborates the importance of these features for emotion identifi-cation.

    V. EXPERIMENTAL EVALUATION

    A. Experimental Framework

    In order to validate the results of the feature analysis, emo-

    tion identification tests were carried out on the Berlin database.

    Suprasegmental features were modeled with SVMs using RBF

    kernel, whereas GMMs were used for segmental features. This

    way, the characteristics of each parametrization, as shown

    in Section IV-A, can be exploited. On the one hand, GMMs

    should be able to model the subtle differences in the distribu-tion of the highly overlapped segmental features, thanks to the

    large number of training samples provided by the frame-wise

    parametrization. On the other hand, SVMs can take advantage

    of the larger separability of suprasegmental features. Further-

    more, the high generalization capability exhibited by SVMs

    [38] will guarantee the creation of robust models, even though

    the suprasegmental parametrization provides very few training

    samples. In the case of SVMs, the one-vs-all approach [39] was

    used for the multiclass classification.

    The experimental framework was designed as a nested

    double cross-validation (see Fig. 2). The outer level ensures

    speaker independent results, where the speakers of the test

    recordings have not been seen during the training. The innerlevel intends to get development tests for the optimization of

    Fig. 2. Nested double cross-validation. In the (a) outer level, five blocks aredefined according to speakers. Four are used for training and the last one forspeaker independent testing. In the (b) inner level, the recordings in the trainingset are randomly rearranged to form five new sub-blocks for development pur-poses.

    the classifiers. The speakers in the database are divided into five

    blocks for the outer level. Each block con-

    tains one male and one female, so that gender balance is kept

    within the blocks. For the th loop, blocks

    define the training set, leaving block for testing. For the

    inner loop, the sentences available in are randomly

    distributed into five sub-blocks , which arethen used for the inner level cross-validation. Once the five

    inner-level loops have ended, their results are gathered and

    used to estimate optimal values for the number of mixtures in

    the GMM, the RBF kernel spread, the SVM misclassification

    cost, and the optimum number of features. Finally, the whole

    set is used to train the system for the th loop in the

    outer level and perform the testing on block .

    B. Selection of the Number of Features

    In order to estimate the optimum number of features, develop-

    ment tests were repeated adding one feature at a time, according

    to the ranking obtained in Section IV-C. Fig. 3 shows the re-sulting accuracy in the development tests using suprasegmental

    features, as a function of the number of features.

    According to these results, we can observe that, even though

    voice quality parametrization showed a very low class separa-

    tion (see Table II), it does not perform so badly considering the

    low number of features it uses. With all five voice quality fea-

    tures, the system gets 49.4% of correct classifications, whereas

    with the best five prosodic features, it gets 50.7%. Furthermore,

    the combination of prosody and voice quality seems to be bene-

    ficial, as predicted by the estimated class separation values: the

    best prosodic system reaches a maximum of 65.5% with 39 fea-

    tures, whereas the combination obtains a 67.4% with 17 fea-tures. Not only does the accuracy increase, but it also reaches

    its maximum with fewer features.

    Suprasegmental spectrum statistics clearly outperform

    prosody, at least when more than 15 features are used, which

    again confirms the conclusions obtained with the class separa-

    bility analysis. Spectrum statistics reach a maximum of 75.4%

    accuracy with 96 features and keep steady from then on. The

    combination of all suprasegmental parametrizations obtains

    the best results if more than 25 features are used, reaching

    an almost steady state at 152 features with 77.9% accuracy,

    with a marginally better absolute maximum of 78.6% with 247

    features.

    The development results for segmental features are repre-sented in Fig. 4. None of the curves reach a real saturation

  • 7/31/2019 Feature Analysis Fot Emotion Identification in Speech

    8/12

    LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH 497

    Fig. 3. Development results for suprasegmental parametrizations as a function of the number of features. (a) is a zoom view to see the results with few features.

    TABLE VIACCURACY ON DEVELOPMENT AND TEST RESULTS WITH SELECTED NUMBER OF FEATURES AND ALL FEATURES

    Fig. 4. Development results forsegmental parametrizations as a function of thenumber of features.

    point, as the accuracy continues growing as new features are

    added. But it can be seen that the improvement is very small

    once 20 features have been used. For example, LFPC parame-

    ters get 70.9% accuracy with 20 features, and when all 54 are

    used, this number increases only to 72.9%. When voiced andunvoiced frames are treated separately, the accuracy of LFPC

    decreases, which seems reasonable as there are approximately

    half the training samples for each stream.

    Results with prosodic primitives seem rather modest,

    reaching 65.3% accuracy in the voiced stream and 50.8% in the

    unvoiced one. But it should be kept in mind that only six and

    three features are used, respectively. If for example, only the

    best six features in the voiced LFPC stream are kept, the system

    gets 58.8% accuracy. However, adding the prosodic primitives

    to LFPC features does not improve the results significantly.

    According to the ranking shown in Table V, prosodic primitives

    are more informative than most LFPC parameters, as they are

    among the first 10 or 20 features selected. However, this is trueonly if few features are selected, e.g., less than 15. When the

    number of features increases, the LFPC features that are added

    compensate for the information of the prosodic primitives, so

    that at the end, the combination has no effect on the overall

    accuracy.

    Table VI summarizes the development results for each

    parametrization family and their combinations, both with the

    estimated optimal number of features and with the completeset. The final speaker independent test results are also pre-

    sented for each case. Looking at these final test results, it can

    be concluded that the spectral statistics are the best isolated

    parametrization, reaching a 70.5% accuracy with 96 features

    (and almost the same accuracy with all 324 features). Among

    the feature combinations, the best result is obtained combining

    all suprasegmental features, providing 72.2% correct answers

    with 152 features and 72.5% with all 383.

    C. Late Fusion Combination

    So far, early fusion schemes have been used, i.e., concate-

    nating parametrizations. But it may be interesting to see if re-sults improve with a late fusion system, i.e., combining not

    the features themselves but the results of the classifiers trained

    with them. Furthermore, the late fusion allows combining the

    information captured by parametrizations of different temporal

    structure, such as segmental and suprasegmental features or the

    voiced and unvoiced streams.

    An SVM-based fusion system [40], [41] has been used for this

    task. Given an utterance to be classified and a set of classifiers,

    a vector is formed with the scores provided by the classifiers for

    each emotion. This score vector is then classified by the fusion

    SVM to get the final decision. With an appropriate training, the

    SVM is expected to learn the score patterns of the errors and

    hits of the classifiers and improve the results. The scores of thedevelopment tests were used for this training.

  • 7/31/2019 Feature Analysis Fot Emotion Identification in Speech

    9/12

    498 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010

    TABLE VIIACCURACY ON LATE FUSION TEST RESULTS WITH SELECTED NUMBER OF FEATURES AND ALL FEATURES

    Results for the late fusion system are shown in Table VII. The

    fusion of suprasegmental features (column 1) achieves very sim-

    ilar results with both early and late fusion systems. Segmental

    features (column 2) on the other hand show a great improve-

    ment. However, this improvement is partly due to the combi-

    nation of both the voiced and unvoiced streams of LFPC and

    prosodic primitives, which is not possible with the early fusion.

    Modeling the voiced and unvoiced streams separately in seg-

    mental features and combining them afterwards through late

    fusion provides good results. While LFPC parametrization ob-

    tains 69.9% accuracy with 20 features and 72.2% with all 54,the fusion of its streams (column 3) gets 72.0% and 76.5%, re-

    spectively (an error reduction of 7% and 15%, respectively). A

    noticeable improvement can also be seen when combining the

    streams of prosodic primitives (column 4). Therefore, we de-

    cided to keep voiced and unvoiced streams separated.

    The late fusion system can also be used to combine seg-

    mental and suprasegmental systems. Combining the results

    from prosody and LFPC (column 5) yields to a similar accuracy

    as combining the results from long-term spectral statistics

    and frame-wise prosodic primitives (column 6). In both cases,

    the accuracy is higher than when using the spectral statistics

    alone (the best isolated system) and slightly better than usingthe early fusion of all suprasegmental features (the best early

    fusion system).

    When combining LFPC with spectral statistics (column 7)

    or prosody with prosodic primitives (column 8), there is also

    a significant improvement. This means that fusing systems that

    use features of the same acoustic origin but different time span

    can also be helpful to reduce classification errors.

    As a last experiment, all features were combined with the

    late fusion system (column 9): suprasegmental prosody, voice

    quality, and spectral statistics together with segmental LFPC

    and prosody primitives, with separated voiced and unvoiced

    streams. Altogether, seven classifiers were combined in this last

    test. The obtained results are the best among all the systemstested: 78.3% for all features and 76.8% for selected ones,

    which implies a 20% error reduction with respect to the best

    early fusion system.

    D. Analysis of the Results by Emotion

    Altogether, it seems that spectral characteristics provide

    higher identification accuracies than prosodic ones. Neverthe-

    less, prosodic features may be more appropriate to identify

    certain emotions, even though they are not the best parametriza-

    tion overall. The identification rates of each emotion have been

    examined separately, in order to verify whether or not this is the

    case. Fig. 5 presents the identification rates obtained for eachemotion with some representative parametrizations: prosodic

    Fig. 5. Comparison of the identification accuracy for each emotion withprosodic and spectral parameters.

    and spectral statistics in the case of suprasegmental features,

    and the late fusion of voiced and unvoiced streams of LFPC

    and prosody primitives in the case of segmental features.

    It can be observed that fear and happiness are the worst iden-

    tified emotions, with less than 50% of recordings correctly clas-

    sified in most cases. Looking into the results, we observed that

    happiness is mostly confused with anger, while fear is confused

    with neutral with suprasegmental features and with neutral andhappiness with segmental parametrizations. This result suggests

    that the considered parametrizations are not suitable to capture

    the characteristics of these emotions, and that other features

    should be accounted.

    Both anger and neutral get similar accuracies with prosodic

    and spectral long-term statistics. On the other hand, LFPC

    seems a more suitable short-term parametrization than prosody

    primitives for these emotions. In the case of boredom and

    sadness, it is the segmental features that obtain similar identi-

    fication rates, whereas in the suprasegmental parametrizations,

    the spectral statistics get better results. Finally, disgust seems to

    be very difficult to detect with prosody-related features, but the

    accuracy increases significantly with spectral characteristics,

    both in segmental and suprasegmental parametrizations.

    Spectral features provide similar or higher accuracies than

    prosodic ones in all emotions. This suggests that, even when

    each emotion is considered on its own, spectral characteristics

    are more suitable for emotion identification.

    VI. CONCLUSION

    The main goal of the work presented here was to analyze fea-

    tures related to prosody, spectral envelope, and voice quality

    regarding their capability to separate emotions. This analysis

    sheds light on an aspect of the automatic recognition of emo-

    tions in speech that is not fully covered in the literature. Al-though there are many studies that analyze how individual fea-

  • 7/31/2019 Feature Analysis Fot Emotion Identification in Speech

    10/12

    LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH 499

    tures change according to the emotional state of the speaker [7],

    [8], [42], [43], they do not evaluate the behavior of the whole

    feature set. This can lead to inaccurate conclusions about the

    usefulness of a given parametrization, as the features that are

    most discriminant individually may not be the most discriminant

    when taken together. Indeed, most of these studies show that,

    when considered individually, changes in the prosodic featuresare more significant than changes in the spectral parameters.

    However, the results presented in this work suggest that if the

    whole feature set is considered, spectral envelope parametriza-

    tions are more informative than prosodic ones.

    The analysis presented in this paper has been performed by

    means of discriminality measures, unsupervised clustering,

    and feature ranking. The results have been confirmed with

    empirical experiments of automatic identification of emotions.

    The discriminality criterion that has been used provides a way

    to estimate the performance of the whole feature set instead

    of individual parameters. These measures have been comple-

    mented with unsupervised clustering to see whether or not

    the parametrizations provide enough class separation. Bothmethods reveal that long-term statistics of spectral envelope

    provide larger separation among emotions than traditional

    prosodic features. This explains the higher accuracy of the

    former in the experimental tests. The measures also show

    that combining suprasegmental parametrizations of spectral

    envelope, prosody, and voice quality increases the separation

    among emotions, confirming that the use of features extracted

    from different information sources helps in reducing the identi-

    fication error.

    Unfortunately, these methods are not suitable for the anal-

    ysis of segmental parametrizations. Instead, a feature ranking

    algorithm has been applied to both segmental and supraseg-mental parametrizations to detect the most discriminant fea-

    tures. Although many papers in the field mention the use of

    some kind of feature selection algorithm, only few of them dis-

    cuss the outcome from this selection. When this outcome is pro-

    vided, long-term spectral statistics are usually selected first, over

    prosodic ones [16], [17], suggesting that suprasegmental spec-

    tral features provide more information about the emotional state

    of the speaker. The feature selection results shown in the present

    paper are in agreement with this conclusion.

    Some works in the literature rely entirely on empirical exper-

    iments to determine the capacity of a given parametrization to

    identify emotions. Although this approach is suitable to find out

    the most discriminant parametrization, it gives no further infor-

    mation about the relation among these features. Furthermore,

    most works that use a combination of different sets of features

    (e.g., prosodic and spectral) provide only the accuracy results

    for that combination, but not separately for each feature set, so

    it is not possible to deduce which set is more informative. In

    this work, we have analyzed each feature set independently and

    in combination with the others. The experimental results have

    also been given for each parametrization and for their combi-

    nations. This way, it is possible to measure the accuracy gain

    when different feature sets are combined, and see whether or

    not this combination is advantageous. For example, even though

    segmental prosodic primitives were ranked high by the featureselection algorithm, it has been shown that the combination of

    frame-wise LFPC and prosodic primitives gets similar accu-

    racy as the LFPC alone. Only with a reduced number of fea-

    tures is this combination helpful (see Fig. 4). Also for supraseg-

    mental parametrizations, prosodic features perform better than

    the spectral ones only when few features are used [Fig. 3(a)].

    According to these results, we can say that traditional prosodic

    features seem to be the most appropriate for automatic identifi-cation of emotions only if they are considered individually or in

    a very reduced feature set. But if large feature sets are consid-

    ered, spectral features outperform them.

    The analysis presented in this paper has been performed using

    an acted speech emotional database. In order to see whether the

    results are applicable to real-life systems, they should be vali-

    dated using different databases, especially databases of natural

    emotions. Nevertheless, the database used in this work is sup-

    posed to contain highly natural emotion portrayals, so it is very

    likely that these conclusions are valid to a great extent. A pre-

    vious work using the Aibo database of natural emotions [44]

    showed similar conclusions, with frame-wise MFCC features

    outperforming long-term prosodic statistics [45]. Also in [12],[13], and [46], MFCC features get better results than supraseg-

    mental ones.

    The results from the late fusion tests suggest that the com-

    bined use of parametrizations extracted from the same infor-

    mation source but with different time-scale, i.e., segmental and

    suprasegmental features, increases the accuracy of the system.

    The difference in the time scales and in the classifiers makes

    each subsystem to retain different characteristics of the emo-

    tions. In fact, one of the best results has been achieved with the

    late fusion of long-term LFPC statistics with the voiced and un-

    voiced streams of frame-wise LFPC features. This system has

    only been outperformed by the late fusion of all segmental andsuprasegmental features, including spectral envelope, prosody,

    and voice quality. However, this last system is much more com-

    plex and the obtained improvement is very small. The use of all

    features requires estimating LFPC, intonation values, voiced-

    unvoiced decision, pitch period marking, inverse filtering, and

    vowel detection. This makes the parametrization step very com-

    plex and time-consuming. Furthermore, it uses seven different

    classifiers prior to the fusion. The spectral system on the other

    hand needs only LFPCs, some simple statistics obtained from

    them, and the voiced-unvoiced decisions in order to separate the

    streams, reducing the number of classifiers to three. The differ-

    ence in the accuracy may not justify increasing the complexity

    of the system in such a degree.

    We are not claiming that features extracted from prosody are

    useless. Several papers show that humans are able to identify

    emotions in prosodic copy-synthesis experiments [5], [47],

    confirming that prosody does carry a great amount of emotional

    information, at least for some emotions. But the traditional

    prosodic representations may not be well suited to capture this

    information. On the one hand, long-term statistics estimated

    over the whole sentence lose the information of specific charac-

    teristic prosodic events. On the other hand, short-term prosodic

    primitives do not capture the prosodic structure correctly, which

    is suprasegmental by definition. The results suggest that a new

    more elaborate representation is needed to effectively extractthe emotional information contained in the prosody.

  • 7/31/2019 Feature Analysis Fot Emotion Identification in Speech

    11/12

    500 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010

    REFERENCES

    [1] K. R. Scherer, Vocal communicationof emotion: A review of researchparadigms, Speech Commun., vol. 40, pp. 227256, Apr. 2003.

    [2] K. R. Scherer, Psychological models of emotion, in The Neuropsy-chology of Emotion, J. Borod, Ed. Oxford, U.K.: Oxford Univ. Press,2000, ch. 6, pp. 137166.

    [3] P. Ekman, An argument for basic emotions, Cognit. Emotion, vol. 6,

    pp. 169200, 1992.[4] C. Darwin, The Expression of the Emotions in Man and Animals, 3rded. Oxford, U.K.: Oxford Univ. Press, 1998.

    [5] F. Burkhardt and W. F. Sendlmeier, Verification of acoustical corre-lates of emotional speech using formant-synthesis, in Proc. ISCA Tu-torial and Research Workshop Speech and Emotion, Belfast, Ireland,Sep. 2000, pp. 151156.

    [6] C. F. Huang and M. Akagi, A three-layered model for expressivespeech perception, Speech Commun., vol. 50, pp. 810828, Oct. 2008.

    [7] K. R. Scherer, R. Banse, H. G. Wallbott, and T. Goldbeck, Vocal cuesin emotion encoding and decoding, Motiv. Emotion, vol. 15, no. 2, pp.123148, 1991.

    [8] M. Schrder, Speech and emotion research, Ph.D. disservation, Uni-versitt des Saarlandes, Saarbrcken, Germany, 2003.

    [9] E. Navas, I. Hernez, A. Castelruiz, J. Snchez, and I. Luengo,Acoustic analysis of emotional speech in standard Basque for emo-

    tion recognition, in Progress in Pattern Recognition, Image Analysisand Applications, ser. Lecture Notes in Computer Science. Berlin,Germany: Springer, Oct. 2004, vol. 3287, pp. 386393.

    [10] D. Erickson, Expressive speech: Production, perception and applica-tionto speech synthesis,Acoust. Sci. Tech., vol. 26,pp. 317325,2005.

    [11] T. L. Nwe, S. W. Foo, and L. C. de Silva, Speech emotion recognitionusing hidden Markov models, Speech Commun., vol. 41, pp. 603623,Jun. 2003.

    [12] S. Kim, P. G. Georgiou, S. Lee, and S. Narayanan, Real-time emo-tion detection system using speech: Multi-modal fusion of differenttimescale features, in Proc. Int. Workshop Multimedia Signal Pro-cessing, Crete, Greece, Oct. 2007, pp. 4851.

    [13] B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll, Frame vs.turn-level: Emotion recognition from speech considering static and dy-namic processing, in Affective Computing and Intelligent Interaction,ser. Lecture Notes in Computer Science. Berlin, Germany: Springer,

    2007, vol. 4738, pp. 139147.[14] J. Nicholson, K. Takahashi, and R. Nakatsu, Emotion recognition

    in speech using neural networks, Neural Comput. Appl., vol. 9, pp.290296, Dec. 2000.

    [15] O. W. Kwon, K. Chan, J. Hao, and T. W. Lee, Emotion recognitionby speech signals, in Proc. Eurospeech, Geneva, Switzerland, 2003,pp. 125128.

    [16] T. Vogt and E. Andr, Improving automatic emotion recognition fromspeech via gender differentiation, in Proc. LREC, Genoa, Italy, May2006.

    [17] B. Schuller, R. Mller, M. Lang, and G. Rigoll, Speaker independentemotion recognition by early fusion of acoustic and linguistic featureswithin ensembles, in Proc. Interspeech, Lisbon, Portugal, Sep. 2005,pp. 805808.

    [18] R. Lpez-Cozar, Z. Callejas, M. Kroul, J. Nouza, and J. Silovsk,Two-level fusion to improve emotion classification in spoken dia-

    logue systems, in Graphics Recognition. Recent Advances and NewOpportunities, ser. Lecture Notes in Computer Science. Berlin,Germany: Springer, 2008, vol. 5246, pp. 617624.

    [19] M. Lugger and B. Yang, The relevance of voice quality featuresin speaker independent emotion recognition, in Proc. ICCASP,Honolulu, HI, Apr. 2007, vol. 4, pp. 1720.

    [20] C. Gobl and A. N. Chasaide, The role of voice quality in commu-nicating emotion, mood and attitude, Speech Commun., vol. 40, pp.189212, Apr. 2003.

    [21] R. Tato, R. Santos, R. Kompe, andJ. Pardo, Emotional spaceimprovesemotion recognition, in Proc. ICSLP, Sep. 2002, pp. 20292032.

    [22] R. Mller, B. Schuller, and G. Rigoll, Enhanced robustness in speechemotion recognition combining acoustic and semantic analyses, inProc. From Signals to Signs of Emotion and Vice Versa, Santorino,Greece, Sep. 2004.

    [23] A. Nogueiras, A. Moreno, A. Bonafonte, and J. B. Mario, Speechemotion recognition using hidden Markov models, in Proc. Eu-rospeech, Aalborg, Denmark, Sep. 2001, pp. 26792682.

    [24] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, On combining clas-sifiers, IEEE Trans. Pattern Anal. Mach. Intell. , vol. 20, no. 3, pp.22639, Mar. 1998.

    [25] D. Ruta and B. Gabrys, An overview of classifier fusion methods,Comput. Inf. Syst., vol. 7, no. 1, pp. 110, Feb. 2000.

    [26] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B.Weiss, A database of German emotional speech, in Proc. Inter-speech, Lisbon, Portugal, Sep. 2005, pp. 15171520.

    [27] J. Ramirez, J. C. Segura, C. Benitez, A. de la Torre, and A. Rubio,Efficient voice activity detection algorithms using long-term speechinformation, Speech Commun., vol. 42, pp. 271287, Apr. 2004.

    [28] P. Alku, H. Tiitinen, and R. Ntnen, A method for generating nat-ural-sounding speech stimuli for cognitive brain research, Clin. Neu-rophysiol., vol. 110, pp. 13291333, Aug. 1999.

    [29] I. Luengo, I. Saratxaga, E. Navas, I. Hernez, J. Snchez, and I. n.Sainz, Evaluation of pitchdetection algorithms under real conditions,in Proc. ICASSP, Honolulu, HI, Apr. 2007, pp. 10571060.

    [30] I. Luengo, E. Navas, J. Snchez, and I. Hernez, Deteccin de vocalesmediante modelado de clusters de fonemas, Procesado Del Lenguaje

    Natural, vol. 43, pp. 121128, Sep. 2009.[31] F. Ringeval and M. Chetouani, Exploiting a vowel based approach

    for acted emotion recognition, in Verbal and Nonverbal Features ofHuman-Human and Human-Machine Interaction, ser. Lecture Notes inComputer Science. Berlin, Germany: Springer, Oct. 2008, vol. 5042,

    pp. 243254.[32] T. Bckstrm, P. Alku, and E. Vilkman, Time-domain parameteriza-

    tion of the closing phase of glottal airflow waveform from voices overa large intensity range, IEEE Trans. Speech Audio Process., vol. 10,no. 3, pp. 186192, Mar. 2002.

    [33] R. van Son and L. Pols, An acoustic description of consonant reduc-tion, Speech Commun., vol. 28, pp. 125140, Jun. 1999.

    [34] K. Fukunaga, Introduction to Statistical Pattern Recognition. NewYork: Academic, 1990.

    [35] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. NewYork: Wiley, 2001.

    [36] I. Guyon and A. Elisseeff, An introduction to variable and featureselection, J. Mach. Learn. Res., vol. 3, pp. 11571182, Mar. 2003.

    [37] H. Peng, F. Long, and C. Ding, Feature selection based on mutualinformation: Criteria of max-dependency, max-relevance, and min-re-dundancy, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp.12261238, Aug. 2005.

    [38] C. J. Burges, A tutorial on support vector machines for pattern recog-nition, Data Min. Knowl. Discov., vol. 2, pp. 121167, 1998.

    [39] C. W. Hsu and C. J. Lin, A comparison of methods for multi-classsupport vector machines, IEEE Trans. Neural Netw., vol. 13, no. 2,pp. 415425, Mar. 2002.

    [40] J. Fierrez-Aguilar, D. Garcia-Romero, J. Ortega-Garcia, and J. Gon-zalez-Rodriguez, Adapted user-dependent multimodal biometric au-thentication exploiting general information, Pattern Recognit. Lett.,vol. 26, no. 16, pp. 26282639, Dec. 2005.

    [41] B. Gutschoven and P. Verlinde, Multi-modal identity verificationusing support vector machines (SVM), in Proc. Int. Conf. InformationFusion, Paris, France, Jul. 2000, vol. 2, pp. 38.

    [42] R. Banse andK. R. Scherer, Acousticprofilesin vocalemotionexpres-sion, J. Personal. Social Pathol., vol. 70, no. 3, pp. 614636, 1996.

    [43] L. Devillers, I. Vasilescu,and L. Vidrascu, and pause features anal-ysis for anger and fear detection in real-life spoken dialogs, in Proc.Speech Prosody, Nara, Japan, Mar. 2004, pp. 205208.

    [44] A. Batliner, S. Steidl, B. Schuller, D. Seppi, K. Laskowski, T. Vogt, L.Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson, Com-bining efforts for improving automatic classification of emotional userstates, in Proc. Information SocietyLanguage Technologies Conf.(IS-LTC), Ljubljana, Slovenia, Oct. 2006, pp. 240245.

    [45] I. Luengo, E. Navas, and I. Hernez, Combining spectraland prosodicinformation for emotion recognition in the interspeech 2009 emotionchallenge, in Proc. Interspeech, Brighton, U.K., Sep. 2009, pp.332335.

    [46] I. Luengo, E. Navas, I. Hernez, and J. Sanchez, Automatic emotionrecognition using prosodic parameters, in Proc. Interspeech, Lisbon,Portugal, Sep. 2005, pp. 493496.

    [47] E. Navas, I. Hernez, andI. Luengo, Anobjectiveand subjectivestudy

    of the role of semantics in building corpora for TTS, IEEE Trans.Speech Audio Process., vol. 14, no. 4, pp. 111727, Jul. 2006.

  • 7/31/2019 Feature Analysis Fot Emotion Identification in Speech

    12/12

    LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH 501

    Iker Luengo received the telecommunication en-gineering degree from the University of the BasqueCountry, Bilbao, Spain, in 2003 and is currentlypursuing the Ph.D. degree in telecommunications.

    Hehas been a researcher in theAhoLabSignal Pro-cessing Group in the Electronics and Telecommuni-cations Department since 2003. He has participatedas a research engineer in government-funded R&D

    projects, focused in emotionalspeech, speaker recog-nition, diarization of meetings, and speech prosody.Mr. Luengo is member of the International Speech

    Communication Association (ISCA) and the Spanish thematic network onSpeech Technologies (RTTH).

    Eva Navas received the telecommunication en-gineering degree and the Ph.D. degree from theDepartment of Electronics and Telecommunicationsof the University of the Basque Country, Bilbao,Spain.

    Since 1999, she has been a researcher at theAhoLab Signal Processing Group. She is currentlyteaching at the Faculty of Industrial and Telecommu-nication Engineering in Bilbao. She has participatedas a research engineer in government-funded R&Dprojects as well as in privately-funded research

    contracts. Her research is focused on expressive speech characterization,recognition, and generation.

    Dr. Navas is a member of the International Speech Communication Associ-ation (ISCA), the Spanish thematic network on Speech Technologies (RTTH),and the European Center of Excellence on Speech Synthesis (ECESS).

    Inmaculada Hernez received the telecommunica-tionsengineeringdegreefrom the Universitat Politec-nica de Catalunya, Barcelona, Spain, and the Ph.D.degree in telecommunications engineering from theUniversity of the Basque Country, Bilbao, Spain, in1987 and 1995, respectively.

    She is a Full Professor in the Electronics andTelecommunication Department, Faculty of Engi-

    neering, University of the Basque Country, in thearea of signal theory and communications. She isfounding member of the Aholab Signal Processing

    Research Group. Her research interests are signal processing and all aspectsrelated to speech processing. She is also interested in the development ofspeech resources and technologies for the Basque language.

    Dr. Hernez is a member of the International Speech Communication Asso-ciation (ISCA), the Spanish thematic network on Speech Technologies (RTTH),and the European Center of Excellence on Speech Synthesis (ECESS).