Deep Learning based Speech Emotion Recognition System

Deep Learning based Speech Emotion Recognition System

*P Jothi Thilaga1, S Kavipriya

2, K Vijayalakshmi

3

1Assistant Professor, 2Associate Professor, 3Professor & Head,

1,3Department of Computer Science and Engineering

Ramco Institute of Technology, Rajapalayam, India

2Department of Computer Science and Engineering

MepcoSchlenk Engineering College, Sivakasi, India

[email protected], [email protected], [email protected]

Abstract

Emotions are elementary for humans, impacting perception and everyday activities like

communication, learning and decision-making. Speech emotion Recognition (SER) systems

aim to facilitate the natural interaction with machines by direct voice interaction rather than

exploitation ancient devices as input to know verbal content and build it straightforward for

human listeners to react. During this SER system primarily composed of 2 sections called

feature extraction and feature classification phase. SER implements on bots to speak with

humans during a non-lexical manner. The speech emotion recognition algorithm here is

predicated on the Convolutional Neural Network (CNN) model, which uses varied modules

for emotion recognition and classifiers to differentiate feelings like happiness, calm, anger,

neutral state, sadness, and fear. The accomplishment of classification is predicated on

extracted features. Finally, the emotion of a speech signal will be determined.

Keywords:Deep Learning, Speech Emotion Recognition, Spectrogram, Convolutional Neural

Network, Mel Frequency Cepstral Coefficient

1. Introduction

One of the most usual ways of articulating human emotions is done with speech

signals. Emotions give language color and serve by means of an essential component of

normal two-way human contact and communication. As hearers, one should respond to the

speaker's expressive stage and adjust our behavior in response to the feelings the speaker

conveys.

*Corresponding Author

Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

Volume 23, Issue 12, December - 2021 Page-212

Latest technological enhancements have supported the human to act with a computer

through non-traditional modalities like voice, gesture, facial features, etc. This contact still

absences the part of emotions. It is disputed that to actually bring home the bacon emotional

human-computer intelligent action present is a necessity for the computer to be capable to act

certainly with the user, quite just like the methodology human-human interaction takes place.

Numerous studies are directed comprises of classical method of human interaction and

human-computer interaction. They terminated that for intelligent interaction emotions play a

vital ingredient. It is tend to give some basic analysis among the range of emotion recognition

from speech. The task of distinguishing the emotional components of speech, in spite of the

linguistics content, is thought of as Speech emotion Recognition (SER). Whereas humans are

capable of acting this task effectively as a natural part of voice communication, the

tractability to try and do this inevitably with programmable devices remains a task in

progress.

Characteristics of speeches are basically explained by some chief features, which

include speech flow, loudness, intonation and intensity of overtones. Speech flow describes

the speed at that utterances are created additionally attributable to vary and length of

temporary breaks in speaking. Loud represents the amount of energy relevant to the

utterances articulation and once thought of a time-varying quantity, the dynamic quality of

speaker. Intonation is that the design of manufacturing utterances concerning rising and fall in

pitch and ends up in tonal shifts in either direction of the speaker's mean vocal pitch.

Overtones are the upper tones that faintly accompany associate elementary tone, thus being

responsible for the tonal diversity of sounds. Continuous options, qualitative options, and

spectral options are among the various distinctive acoustic options that are normally wont to

acknowledge speech feeling. Low-level descriptors (LLDs), like prosody features like pitch

and intensity, voice quality features like formants, and spectral features like Mel-frequency

cepstral coefficients (MFCCs) [1], are employed in SER

Deep learning has revolutionized speech processing on signals. In the part of

prediction of emotion through speech signals, several deep learning methodologies have been

recommended and used in current years for automated feature extraction. Though, to the best

of once understanding, no comprehensive research analysis exists that objectively evaluates

and summarizes current deep learning strategies for SER, including their pros and cons.

The system of SER is built using a convolutional neural network model [20]. CNN has two

phases. In the initial phase, unlabeled samples are accustomed learn aspirant features by

contractive convolutional neural networks with reconstruction penalization. The aspirant

features in the second phase are used as the set of input to the CNN to study affected-salient,

discriminative features using a novel objective function that encourages the feature saliency,

orthogonality and discrimination.

The spectrogram was used to transform the raw speech signal to MFCC format [1].

The CNN blocks make use of the MFCC format and the trained model uses the MFCC format

to output gender and emotion for a given speech signal.

2. Related Work

The three major components of Emotion recognition systems on digitized speech is

given by: signal pre-processing, feature extraction and classification [13]. Furthermost studies

in early speech emotion recognition research have focused on selecting the proper acoustic

features [12]. Spectral features are the original acoustic features considered in many speech

applications; they can express usual emotion information from the view of human auditory

perception. An end-to-end discrete speech emotion recognition algorithm uses the input as

speech spectrogram. That signifies, although present algorithms are an end-to-end structure,



the algorithms characteristics the speech emotion recognition task as an image classification

problem. By separating from this approach, selecting as much as possible, to use an end-to-

end deep learning algorithm to capture the information indication from emotional speech

[15]. Therefore, in the proposed algorithm raw speech data is inputted, moderately than a

spectrogram. Convolutional Neural Network is one of the types of Deep learning techniques

based uniquely on feed-forward architecture for classification [13]. CNN's are naturally used

for pattern recognition and afford better data classification [10]. These networks have small

size neurons present on every layer of the designed model architecture that process the input

data in the practice of receptive fields [14]. By containing the information about gender, the

training and testing data will be progressively consistent and also the neural network will

have one more part to make the emotional vocal structures of the two genders as an

elementary grouping [15].

3. Methodology

3.1 Problem Statement

All are unique individuals; everybody express our thoughts in different ways,

which leads to confusion. One of the difficulties in determining the other's feelings is to look

at their facial expressions and listen to their tone of voice. This system helps the bots to know

the emotions of the human through speech. This is simple for people, who have no disabilities

so identifying the emotion of people who have hearing loss, even though they have a speech

disorder, can be difficult. As a result, people with hearing impairments will receive their

message with feeling, making communication more human-like. In this project it is aimed to

develop “Speech Emotion Recognition using Deep Learning models”.

3.2 Dataset

In this system RAVDESS dataset is used for processing of speech. Ryerson Audio-

Visual Database of Emotional Speech and Song - (RAVDESS) [7] could be valid multimodal

information of emotional speech and song. The information is gender stable consisting of

twenty four (12 female, 12 male) skilled actors, vocal music lexically-matched statements

during a neutral North American accent. Various categories of speeches include calm, happy,

sad, angry, surprise, disgust and fearful expressions. Different kinds of songs contain sad,

calm, happy, angry and fearful emotions. Every expression comprises of two levels of

emotional intensity, with a further neutral expression. All conditions are accessible in face-

and-voice, face-only, and voice-only formats. In this system only Audio Speech data in this

RAVDESS is used. There are 1440 files in this section of the RAVDESS: 60 trials per actor x

24 actors = 1440 trials.

File naming convention: Each of the 1440 files contains a distinctive name. The name

consists of a 7-part numerical symbol (e.g. 03-01-07-01-02-01-02.wav). These identifiers

outline the stimulation characteristics:

Filename identifiers:

Modality [01 for full-AV, 02 for video-only, 03 for audio-only]

Vocal channel [ 01 for speech , 02 for song]

Emotions [01 to 08 representation for neutral, calm, happy, sad, angry, fearful, disgust,

surprised]

Emotional intensity [01 for normal, 02 for strong]

Statement [01 for „Kids are talking by the door‟, 02 for „Dogs are sitting by the door‟]

Repetition [01 for 1st repetition, 02 for 2nd repetition]



Actor [01 to 24 representation, Males are represented by odd and Females are represented by

even numbers]

3.3 Deep Learning Model

The system uses Convolutional Neural Network (CNN) with RAVDESS as a

training dataset. The below flowchart depicts the flow of the speech emotion recognition

using CNN.

3.3.1 Importing Libraries & Loading Input file: Importing Libraries: First, we have to

imported the required python libraries for this system in this module. tensorflow, keras,

librosa, seaborn, scipy, and plotly are some of the basic python libraries.

Tensorflow – Provide a group of workflows to develop and train certain models using Python

Keras – To Develop and evaluate deep learning models

Librosa – Package for music and audio analysis

Seaborn – Used to visualize random distributions

Scipy – Used for both scientific and technical computing.

Plotly – Allowed users to import, copy and paste or stream data to be analyzed and visualized.

Load the Input file using librosa: The system takes raw audio speech data from the

RAVDESS dataset as input. The audio dataset is then transformed into a data frame. The



below Figure 1 shows the instance of data frame. The labelling will then be done based on the

file identification of the file name. Audio data is loaded and supplied as input to the next

module using the Librosa library.

Figure 1: Data Frame from Audio Dataset

3.3.2 Pre - Processing: Based on the audio data, this system converts the data in audio form

to a waveform, which is then plotted as a spectrogram [2]. The below Figure 2 shows the

sample audio signal waveform and the spectrogram representation of that audio. After that, I

trim the excess silence voice in the input audio data. Then the spectrogram format given input

to the next module.

Figure 2: Sample Audio Signal Waveform and Spectrogram

3.3.3 Feature Extraction Using MFCC: MFCC [18]:Mel Frequency Cepstral Coefficients

(MFCCs) are a commonly used feature in automatic speech and speaker recognition systems.

The coefficients that make up an MFC are known as MFCCs. They're made up of a cepstral

representationof the audio clip.

Calculation of MFCCs:

1. Mount the signal into short frames.

2. For each and every frame periodogram estimate of the power spectrum was designed.

3. Melfilter bank is applied to the power spectra, sum the energy in each filter.

4. Logarithm of all filter bank energies was calculated.

5. Discrete Cosine Transform (DCT) [8] of the log filter bank energies was calculated.

6. Retain DCT coefficients 2-13, discard the rest.

Mel Scale: The Mel scale compares a pure tone's perceived frequency, or pitch, to its real



measured frequency. At low frequencies, humans are much better at detecting minor changes

in pitch than they are at high frequencies. By incorporating this scale, our characteristics

become more closely aligned with what humans hear.

Conversion from frequency to Mel scale is processed by:

M (F) = 1125 ln (1 + F / 700)

To go from Mel‟s back to frequency:

M-1 (m) = 700 (exp (m / 1125) - 1)

Feature Extraction: From the Spectrogram, the Mel Power Spectrogram is then plotted. The

below Figure 3 shows the sample audio Mel Power Spectrum. The Mel Power Spectrogram

is used to plot the MFCC. The MFCC features are collected and passed to the next module as

an input.The Figure 4 depicts the sample audio MFCC format. The emotion distribution is

plotted using the seaborn library from the MFCC format.

Figure 3: Mel Power Spectrum for Sample Audio

Figure 4: MFCC format for Sample Audio

Plotting Emotion Distribution: The file name identifiers are used to define the data frame. The

emotion distribution is plotted from that data frame. The below Figure 5 represents the

emotion distribution of the RAVDESS dataset.



Figure 5: Emotion Distribution

3.3.4 Data Augmentation[3]: Data augmentation[16] aids in the generation of synthetic data

from existing data sets, allowing the model's generalization capability to be increased.

This system uses 2 augmentation techniques. There are,

Adding White Noise

Pitch Tuning.

Figure 6 shows the original sample audio waveform without adding any noise and pitch.

Figure 7 shows the sample audio waveform with noise adding. Figure 8 shows the sample

audio waveform with pitch tuning.

Figure 6: Original Audio Wave

Figure 7: Noise Audio Wave



Figure 8: Pitch Audio Wave

The augmented audio data, which is simulated audio data, is combined with the original audio

data.

3.3.5 CNN Model: Stratified Shuffle Split [19]: It provides train or test indices to split data in

train or test sets. The Stratified Shuffle Split occurs after the augmentation process.

Convolutional Neural Network: A ConvNet (CNN) is a Deep Learning algorithm which make

use of an input image, assign significance (learnable weights and biases) to various

aspects/objects in the image, and distinguish between them.

Model Architecture: The convolutional neural network (CNN) model with keras is

used in this system. This system was built with 8 Conv1D layers, 2 MaxPooling1D layers, 2

Batch Normalization layers, 2 dropout layers, and a Dense layer. Table 1 describes the

representation of CNN model architecture.

Layer

Kernal

Size/ Pool

Size (P)

Activation

Function Dropout

CONV1D (256) 8 ReLu -


BATCH

NORMALIZATION(8) - ReLu 0.25

MAX POOLING1D

(128) 8 (P) - -




BATCH

NORMALIZATION(8) - ReLu 0.25

MAX POOLING1D

(128) 8 (P) - -



FLATTEN - - -

DENSE (10) - Softmax -

Table 1: Representation of model architecture



3.3.6 Testing and Training: The trained precision of the CNN model, the Train loss, and the

Test loss are the outputs of this system. This model was trained with a 700-period epoch. This

system obtains a training accuracy of 91.04 % as a result with train loss of 0.0358 and test

loss of 0.3443.Figure 9 depicts the model output screenshot.

Figure 9: Screenshot of Output

4. Generation of results

This audio-based speech emotion recognition system detects the emotion. This system was

trained using a Convolutional Neural Network model. On 700 epoch, the CNN model was

trained. The Figure 10 depicts the model loss graph between epoch and loss. The method

includes the model's evaluation metric.

Figure 10: Model Loss Graph



The model accuracy, train loss, and test loss are all outputs. The trained CNN model's

output screen is visualized in Figure 11.The model in this method trained with an accuracy of

91.04 percent and train and test losses of 0.0358 and 0.3443, respectively. The Table 2 shows

the system and model's measurement metrics.

Accuracy 91.04%

Train loss 0.0358

Test loss 0.3443

Table 2: Evaluation Metric of CNN model

Figure 11: Final Output Screenshot

5. Conclusions

The proposed model that recognize the emotion of a given input audio and the target

answer in emotion of the given audio with their gender. Nowadays, Speech Emotion

Recognition used in Robotics and Online Retail Services. In Robotics, Robots are trained

with the Speech Emotion Recognition to easily interact with humans and In Retail Services,

many online retail services are possess their own bots to interact with customers. So the

Speech Emotion Recognition becoming one of the most application. By considering this

Speech Emotion Recognition was built using Deep Learning with CNN model. CNN model is

considered as one of the best model for emotion difference task. In this system the CNN

model achieved a 91.04% accuracy rate. With more data, this system would have performed

better. This system also did exceptionally well in distinguishing between masculine and

feminine voices. Extending this project to integrate with the robot to assist it to possess a far

better understanding of the mood of human, which can facilitate it to possess a far better

speech communication further because it is integrated with numerous music applications to

suggest songs to its users per his/her emotions, it may also be employed in numerous on-line

searching applications like Amazon to enhance the merchandise recommendation for its

users.

References

[1] A. Zabidi, I. M. Yassin, H. A. Hassan, N. Ismail, M. M. A. M. Hamzah, Z. I. Rizman, and H.

Z. Abidin, “Detection of asphyxia in infants using deep learning Convolutional Neural

Network (CNN) trained on Mel Frequency Cepstrum Coefficient (MFCC) features extracted

from cry sounds,” Journal of Fundamental and Applied Sciences, vol. 9, no. 3S, p. 768, 2018



[2] Hajarolasvadi, N.; Demirel, H. 3D CNN-Based Speech Emotion Recognition Using K-Means

Clustering and Spectrograms. Entropy 2019, 21, 479. https://doi.org/10.3390/e21050479

[3] J. Salamon and J. P. Bello, "Deep Convolutional Neural Networks and Data Augmentation

for Environmental Sound Classification," in IEEE Signal Processing Letters, vol. 24, no. 3,

pp. 279-283, March 2017, doi: 10.1109/LSP.2017.2657381.

[4] Jianfeng Zhao, Xia Mao, Lijiang Chen, "Speech emotion recognition using deep 1D & 2D

CNN LSTM networks", Biomedical Signal Processing and Control, Volume 47, 2019,

https://doi.org/10.1016/j.bspc.2018.08.035.

[5] K. Zvarevashe and O. O. Olugbara, “Recognition of speech emotion using custom 2d-

convolution neural network deep learning algorithm,” Intelligent Data Analysis, vol. 24, no.

5, pp. 1065–1086, 2020.

[6] Lech Margaret, Stolar Melissa, Best Christopher, Bolia Robert, "Real-Time Speech Emotion

Recognition Using a Pre-trained Image Classification Network: Effects of Bandwidth

Reduction and Companding," in Frontiers in Computer Science, vol. 2, 2020, doi:

10.3389/fcomp.2020.00014.

[7] Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech

and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North

American English. PLoS ONE 13(5): e0196391.

https://doi.org/10.1371/journal.pone.0196391

[8] L. Stanković and M. Brajović, "Analysis of the Reconstruction of Sparse Signals in the DCT

Domain Applied to Audio Signals," in IEEE/ACM Transactions on Audio, Speech, and

Language Processing, vol. 26, no. 7, pp. 1220-1235, July 2018, doi:

10.1109/TASLP.2018.2819819.

[9] NithyaRoopa S., Prabhakaran M, Betty P, "Speech Emotion Recognition using Deep

Learning" International Journal of Recent Technology and Engineering (IJRTE), vol. 7,

issue. 4S, November 2018.

[10] Murugan, Harini. (2020). Speech Emotion Recognition Using CNN. International Journal of

Psychosocial Rehabilitation. 24. 10.37200/IJPR/V24I8/PR280260.

[11] Mustaqeem, M. Sajjad and S. Kwon, "Clustering-Based Speech Emotion Recognition by

Incorporating Learned Features and Deep BiLSTM," in IEEE Access, vol. 8, pp. 79861-

79875, 2020, doi: 10.1109/ACCESS.2020.2990405.

[12] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning Salient Features for Speech Emotion

Recognition Using Convolutional Neural Networks,” IEEE Transactions on Multimedia, vol.

16, no. 8, pp. 2203–2213, Dec. 2014. doi:10.1109/tmm.2014.2360798.

[13] R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar and T. Alhussain, "Speech Emotion

Recognition Using Deep Learning Techniques: A Review," in IEEE Access, vol. 7, pp.

117327-117345, 2019, doi: 10.1109/ACCESS.2019.2936124.

[14] S. Zhang, A. Chen, W. Guo, Y. Cui, X. Zhao and L. Liu, "Learning Deep Binaural

Representations With Deep Convolutional Neural Networks for Spontaneous Speech Emotion

Recognition," in IEEE Access, vol. 8, pp. 23496-23505, 2020, doi:

10.1109/ACCESS.2020.2969032.

[15] T. -W. Sun, "End-to-End Speech Emotion Recognition With Gender Information," in IEEE

Access, vol. 8, pp. 152423-152438, 2020, doi: 10.1109/ACCESS.2020.3017462.

[16] X. Cui, V. Goel and B. Kingsbury, "Data Augmentation for Deep Neural Network Acoustic

Modeling," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23,

no. 9, pp. 1469-1477, Sept. 2015, doi: 10.1109/TASLP.2015.2438544.

[17] Y. Qin, S. Member, T. Lee, A. Pak, and H. Kong, “Automatic Assessment of Speech

Impairment in Cantonese-speaking People with Aphasia,” IEEE J. Sel. Top. Signal Process.,

vol. PP, no. c, p. 1, 2019.

[18] http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-

cepstral-coefficients-mfccs/



[19] https://scikitlearn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSpli

t.html



Deep Learning based Speech Emotion Recognition System

Documents