Page 1
Deep Learning based Speech Emotion Recognition System
*P Jothi Thilaga1, S Kavipriya
2, K Vijayalakshmi
3
1Assistant Professor, 2Associate Professor, 3Professor & Head,
1,3Department of Computer Science and Engineering
Ramco Institute of Technology, Rajapalayam, India
2Department of Computer Science and Engineering
MepcoSchlenk Engineering College, Sivakasi, India
[email protected] , [email protected] , [email protected]
Abstract
Emotions are elementary for humans, impacting perception and everyday activities like
communication, learning and decision-making. Speech emotion Recognition (SER) systems
aim to facilitate the natural interaction with machines by direct voice interaction rather than
exploitation ancient devices as input to know verbal content and build it straightforward for
human listeners to react. During this SER system primarily composed of 2 sections called
feature extraction and feature classification phase. SER implements on bots to speak with
humans during a non-lexical manner. The speech emotion recognition algorithm here is
predicated on the Convolutional Neural Network (CNN) model, which uses varied modules
for emotion recognition and classifiers to differentiate feelings like happiness, calm, anger,
neutral state, sadness, and fear. The accomplishment of classification is predicated on
extracted features. Finally, the emotion of a speech signal will be determined.
Keywords:Deep Learning, Speech Emotion Recognition, Spectrogram, Convolutional Neural
Network, Mel Frequency Cepstral Coefficient
1. Introduction
One of the most usual ways of articulating human emotions is done with speech
signals. Emotions give language color and serve by means of an essential component of
normal two-way human contact and communication. As hearers, one should respond to the
speaker's expressive stage and adjust our behavior in response to the feelings the speaker
conveys.
*Corresponding Author
Journal of University of Shanghai for Science and Technology ISSN: 1007-6735
Volume 23, Issue 12, December - 2021 Page-212
Page 2
Latest technological enhancements have supported the human to act with a computer
through non-traditional modalities like voice, gesture, facial features, etc. This contact still
absences the part of emotions. It is disputed that to actually bring home the bacon emotional
human-computer intelligent action present is a necessity for the computer to be capable to act
certainly with the user, quite just like the methodology human-human interaction takes place.
Numerous studies are directed comprises of classical method of human interaction and
human-computer interaction. They terminated that for intelligent interaction emotions play a
vital ingredient. It is tend to give some basic analysis among the range of emotion recognition
from speech. The task of distinguishing the emotional components of speech, in spite of the
linguistics content, is thought of as Speech emotion Recognition (SER). Whereas humans are
capable of acting this task effectively as a natural part of voice communication, the
tractability to try and do this inevitably with programmable devices remains a task in
progress.
Characteristics of speeches are basically explained by some chief features, which
include speech flow, loudness, intonation and intensity of overtones. Speech flow describes
the speed at that utterances are created additionally attributable to vary and length of
temporary breaks in speaking. Loud represents the amount of energy relevant to the
utterances articulation and once thought of a time-varying quantity, the dynamic quality of
speaker. Intonation is that the design of manufacturing utterances concerning rising and fall in
pitch and ends up in tonal shifts in either direction of the speaker's mean vocal pitch.
Overtones are the upper tones that faintly accompany associate elementary tone, thus being
responsible for the tonal diversity of sounds. Continuous options, qualitative options, and
spectral options are among the various distinctive acoustic options that are normally wont to
acknowledge speech feeling. Low-level descriptors (LLDs), like prosody features like pitch
and intensity, voice quality features like formants, and spectral features like Mel-frequency
cepstral coefficients (MFCCs) [1], are employed in SER
Deep learning has revolutionized speech processing on signals. In the part of
prediction of emotion through speech signals, several deep learning methodologies have been
recommended and used in current years for automated feature extraction. Though, to the best
of once understanding, no comprehensive research analysis exists that objectively evaluates
and summarizes current deep learning strategies for SER, including their pros and cons.
The system of SER is built using a convolutional neural network model [20]. CNN has two
phases. In the initial phase, unlabeled samples are accustomed learn aspirant features by
contractive convolutional neural networks with reconstruction penalization. The aspirant
features in the second phase are used as the set of input to the CNN to study affected-salient,
discriminative features using a novel objective function that encourages the feature saliency,
orthogonality and discrimination.
The spectrogram was used to transform the raw speech signal to MFCC format [1].
The CNN blocks make use of the MFCC format and the trained model uses the MFCC format
to output gender and emotion for a given speech signal.
2. Related Work
The three major components of Emotion recognition systems on digitized speech is
given by: signal pre-processing, feature extraction and classification [13]. Furthermost studies
in early speech emotion recognition research have focused on selecting the proper acoustic
features [12]. Spectral features are the original acoustic features considered in many speech
applications; they can express usual emotion information from the view of human auditory
perception. An end-to-end discrete speech emotion recognition algorithm uses the input as
speech spectrogram. That signifies, although present algorithms are an end-to-end structure,
Journal of University of Shanghai for Science and Technology ISSN: 1007-6735
Volume 23, Issue 12, December - 2021 Page-213
Page 3
the algorithms characteristics the speech emotion recognition task as an image classification
problem. By separating from this approach, selecting as much as possible, to use an end-to-
end deep learning algorithm to capture the information indication from emotional speech
[15]. Therefore, in the proposed algorithm raw speech data is inputted, moderately than a
spectrogram. Convolutional Neural Network is one of the types of Deep learning techniques
based uniquely on feed-forward architecture for classification [13]. CNN's are naturally used
for pattern recognition and afford better data classification [10]. These networks have small
size neurons present on every layer of the designed model architecture that process the input
data in the practice of receptive fields [14]. By containing the information about gender, the
training and testing data will be progressively consistent and also the neural network will
have one more part to make the emotional vocal structures of the two genders as an
elementary grouping [15].
3. Methodology
3.1 Problem Statement
All are unique individuals; everybody express our thoughts in different ways,
which leads to confusion. One of the difficulties in determining the other's feelings is to look
at their facial expressions and listen to their tone of voice. This system helps the bots to know
the emotions of the human through speech. This is simple for people, who have no disabilities
so identifying the emotion of people who have hearing loss, even though they have a speech
disorder, can be difficult. As a result, people with hearing impairments will receive their
message with feeling, making communication more human-like. In this project it is aimed to
develop “Speech Emotion Recognition using Deep Learning models”.
3.2 Dataset
In this system RAVDESS dataset is used for processing of speech. Ryerson Audio-
Visual Database of Emotional Speech and Song - (RAVDESS) [7] could be valid multimodal
information of emotional speech and song. The information is gender stable consisting of
twenty four (12 female, 12 male) skilled actors, vocal music lexically-matched statements
during a neutral North American accent. Various categories of speeches include calm, happy,
sad, angry, surprise, disgust and fearful expressions. Different kinds of songs contain sad,
calm, happy, angry and fearful emotions. Every expression comprises of two levels of
emotional intensity, with a further neutral expression. All conditions are accessible in face-
and-voice, face-only, and voice-only formats. In this system only Audio Speech data in this
RAVDESS is used. There are 1440 files in this section of the RAVDESS: 60 trials per actor x
24 actors = 1440 trials.
File naming convention: Each of the 1440 files contains a distinctive name. The name
consists of a 7-part numerical symbol (e.g. 03-01-07-01-02-01-02.wav). These identifiers
outline the stimulation characteristics:
Filename identifiers:
Modality [01 for full-AV, 02 for video-only, 03 for audio-only]
Vocal channel [ 01 for speech , 02 for song]
Emotions [01 to 08 representation for neutral, calm, happy, sad, angry, fearful, disgust,
surprised]
Emotional intensity [01 for normal, 02 for strong]
Statement [01 for „Kids are talking by the door‟, 02 for „Dogs are sitting by the door‟]
Repetition [01 for 1st repetition, 02 for 2nd repetition]
Journal of University of Shanghai for Science and Technology ISSN: 1007-6735
Volume 23, Issue 12, December - 2021 Page-214
Page 4
Actor [01 to 24 representation, Males are represented by odd and Females are represented by
even numbers]
3.3 Deep Learning Model
The system uses Convolutional Neural Network (CNN) with RAVDESS as a
training dataset. The below flowchart depicts the flow of the speech emotion recognition
using CNN.
3.3.1 Importing Libraries & Loading Input file: Importing Libraries: First, we have to
imported the required python libraries for this system in this module. tensorflow, keras,
librosa, seaborn, scipy, and plotly are some of the basic python libraries.
Tensorflow – Provide a group of workflows to develop and train certain models using Python
Keras – To Develop and evaluate deep learning models
Librosa – Package for music and audio analysis
Seaborn – Used to visualize random distributions
Scipy – Used for both scientific and technical computing.
Plotly – Allowed users to import, copy and paste or stream data to be analyzed and visualized.
Load the Input file using librosa: The system takes raw audio speech data from the
RAVDESS dataset as input. The audio dataset is then transformed into a data frame. The
Journal of University of Shanghai for Science and Technology ISSN: 1007-6735
Volume 23, Issue 12, December - 2021 Page-215
Page 5
below Figure 1 shows the instance of data frame. The labelling will then be done based on the
file identification of the file name. Audio data is loaded and supplied as input to the next
module using the Librosa library.
Figure 1: Data Frame from Audio Dataset
3.3.2 Pre - Processing: Based on the audio data, this system converts the data in audio form
to a waveform, which is then plotted as a spectrogram [2]. The below Figure 2 shows the
sample audio signal waveform and the spectrogram representation of that audio. After that, I
trim the excess silence voice in the input audio data. Then the spectrogram format given input
to the next module.
Figure 2: Sample Audio Signal Waveform and Spectrogram
3.3.3 Feature Extraction Using MFCC: MFCC [18]:Mel Frequency Cepstral Coefficients
(MFCCs) are a commonly used feature in automatic speech and speaker recognition systems.
The coefficients that make up an MFC are known as MFCCs. They're made up of a cepstral
representationof the audio clip.
Calculation of MFCCs:
1. Mount the signal into short frames.
2. For each and every frame periodogram estimate of the power spectrum was designed.
3. Melfilter bank is applied to the power spectra, sum the energy in each filter.
4. Logarithm of all filter bank energies was calculated.
5. Discrete Cosine Transform (DCT) [8] of the log filter bank energies was calculated.
6. Retain DCT coefficients 2-13, discard the rest.
Mel Scale: The Mel scale compares a pure tone's perceived frequency, or pitch, to its real
Journal of University of Shanghai for Science and Technology ISSN: 1007-6735
Volume 23, Issue 12, December - 2021 Page-216
Page 6
measured frequency. At low frequencies, humans are much better at detecting minor changes
in pitch than they are at high frequencies. By incorporating this scale, our characteristics
become more closely aligned with what humans hear.
Conversion from frequency to Mel scale is processed by:
M (F) = 1125 ln (1 + F / 700)
To go from Mel‟s back to frequency:
M-1 (m) = 700 (exp (m / 1125) - 1)
Feature Extraction: From the Spectrogram, the Mel Power Spectrogram is then plotted. The
below Figure 3 shows the sample audio Mel Power Spectrum. The Mel Power Spectrogram
is used to plot the MFCC. The MFCC features are collected and passed to the next module as
an input.The Figure 4 depicts the sample audio MFCC format. The emotion distribution is
plotted using the seaborn library from the MFCC format.
Figure 3: Mel Power Spectrum for Sample Audio
Figure 4: MFCC format for Sample Audio
Plotting Emotion Distribution: The file name identifiers are used to define the data frame. The
emotion distribution is plotted from that data frame. The below Figure 5 represents the
emotion distribution of the RAVDESS dataset.
Journal of University of Shanghai for Science and Technology ISSN: 1007-6735
Volume 23, Issue 12, December - 2021 Page-217
Page 7
Figure 5: Emotion Distribution
3.3.4 Data Augmentation[3]: Data augmentation[16] aids in the generation of synthetic data
from existing data sets, allowing the model's generalization capability to be increased.
This system uses 2 augmentation techniques. There are,
Adding White Noise
Pitch Tuning.
Figure 6 shows the original sample audio waveform without adding any noise and pitch.
Figure 7 shows the sample audio waveform with noise adding. Figure 8 shows the sample
audio waveform with pitch tuning.
Figure 6: Original Audio Wave
Figure 7: Noise Audio Wave
Journal of University of Shanghai for Science and Technology ISSN: 1007-6735
Volume 23, Issue 12, December - 2021 Page-218
Page 8
Figure 8: Pitch Audio Wave
The augmented audio data, which is simulated audio data, is combined with the original audio
data.
3.3.5 CNN Model: Stratified Shuffle Split [19]: It provides train or test indices to split data in
train or test sets. The Stratified Shuffle Split occurs after the augmentation process.
Convolutional Neural Network: A ConvNet (CNN) is a Deep Learning algorithm which make
use of an input image, assign significance (learnable weights and biases) to various
aspects/objects in the image, and distinguish between them.
Model Architecture: The convolutional neural network (CNN) model with keras is
used in this system. This system was built with 8 Conv1D layers, 2 MaxPooling1D layers, 2
Batch Normalization layers, 2 dropout layers, and a Dense layer. Table 1 describes the
representation of CNN model architecture.
Layer
Kernal
Size/ Pool
Size (P)
Activation
Function Dropout
CONV1D (256) 8 ReLu -
CONV1D (256) 8 ReLu -
BATCH
NORMALIZATION(8) - ReLu 0.25
MAX POOLING1D
(128) 8 (P) - -
CONV1D (256) 8 ReLu -
CONV1D (256) 8 ReLu -
CONV1D (256) 8 ReLu -
BATCH
NORMALIZATION(8) - ReLu 0.25
MAX POOLING1D
(128) 8 (P) - -
CONV1D (256) 8 ReLu -
CONV1D (256) 8 ReLu -
FLATTEN - - -
DENSE (10) - Softmax -
Table 1: Representation of model architecture
Journal of University of Shanghai for Science and Technology ISSN: 1007-6735
Volume 23, Issue 12, December - 2021 Page-219
Page 9
3.3.6 Testing and Training: The trained precision of the CNN model, the Train loss, and the
Test loss are the outputs of this system. This model was trained with a 700-period epoch. This
system obtains a training accuracy of 91.04 % as a result with train loss of 0.0358 and test
loss of 0.3443.Figure 9 depicts the model output screenshot.
Figure 9: Screenshot of Output
4. Generation of results
This audio-based speech emotion recognition system detects the emotion. This system was
trained using a Convolutional Neural Network model. On 700 epoch, the CNN model was
trained. The Figure 10 depicts the model loss graph between epoch and loss. The method
includes the model's evaluation metric.
Figure 10: Model Loss Graph
Journal of University of Shanghai for Science and Technology ISSN: 1007-6735
Volume 23, Issue 12, December - 2021 Page-220
Page 10
The model accuracy, train loss, and test loss are all outputs. The trained CNN model's
output screen is visualized in Figure 11.The model in this method trained with an accuracy of
91.04 percent and train and test losses of 0.0358 and 0.3443, respectively. The Table 2 shows
the system and model's measurement metrics.
Accuracy 91.04%
Train loss 0.0358
Test loss 0.3443
Table 2: Evaluation Metric of CNN model
Figure 11: Final Output Screenshot
5. Conclusions
The proposed model that recognize the emotion of a given input audio and the target
answer in emotion of the given audio with their gender. Nowadays, Speech Emotion
Recognition used in Robotics and Online Retail Services. In Robotics, Robots are trained
with the Speech Emotion Recognition to easily interact with humans and In Retail Services,
many online retail services are possess their own bots to interact with customers. So the
Speech Emotion Recognition becoming one of the most application. By considering this
Speech Emotion Recognition was built using Deep Learning with CNN model. CNN model is
considered as one of the best model for emotion difference task. In this system the CNN
model achieved a 91.04% accuracy rate. With more data, this system would have performed
better. This system also did exceptionally well in distinguishing between masculine and
feminine voices. Extending this project to integrate with the robot to assist it to possess a far
better understanding of the mood of human, which can facilitate it to possess a far better
speech communication further because it is integrated with numerous music applications to
suggest songs to its users per his/her emotions, it may also be employed in numerous on-line
searching applications like Amazon to enhance the merchandise recommendation for its
users.
References
[1] A. Zabidi, I. M. Yassin, H. A. Hassan, N. Ismail, M. M. A. M. Hamzah, Z. I. Rizman, and H.
Z. Abidin, “Detection of asphyxia in infants using deep learning Convolutional Neural
Network (CNN) trained on Mel Frequency Cepstrum Coefficient (MFCC) features extracted
from cry sounds,” Journal of Fundamental and Applied Sciences, vol. 9, no. 3S, p. 768, 2018
Journal of University of Shanghai for Science and Technology ISSN: 1007-6735
Volume 23, Issue 12, December - 2021 Page-221
Page 11
[2] Hajarolasvadi, N.; Demirel, H. 3D CNN-Based Speech Emotion Recognition Using K-Means
Clustering and Spectrograms. Entropy 2019, 21, 479. https://doi.org/10.3390/e21050479
[3] J. Salamon and J. P. Bello, "Deep Convolutional Neural Networks and Data Augmentation
for Environmental Sound Classification," in IEEE Signal Processing Letters, vol. 24, no. 3,
pp. 279-283, March 2017, doi: 10.1109/LSP.2017.2657381.
[4] Jianfeng Zhao, Xia Mao, Lijiang Chen, "Speech emotion recognition using deep 1D & 2D
CNN LSTM networks", Biomedical Signal Processing and Control, Volume 47, 2019,
https://doi.org/10.1016/j.bspc.2018.08.035.
[5] K. Zvarevashe and O. O. Olugbara, “Recognition of speech emotion using custom 2d-
convolution neural network deep learning algorithm,” Intelligent Data Analysis, vol. 24, no.
5, pp. 1065–1086, 2020.
[6] Lech Margaret, Stolar Melissa, Best Christopher, Bolia Robert, "Real-Time Speech Emotion
Recognition Using a Pre-trained Image Classification Network: Effects of Bandwidth
Reduction and Companding," in Frontiers in Computer Science, vol. 2, 2020, doi:
10.3389/fcomp.2020.00014.
[7] Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech
and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North
American English. PLoS ONE 13(5): e0196391.
https://doi.org/10.1371/journal.pone.0196391
[8] L. Stanković and M. Brajović, "Analysis of the Reconstruction of Sparse Signals in the DCT
Domain Applied to Audio Signals," in IEEE/ACM Transactions on Audio, Speech, and
Language Processing, vol. 26, no. 7, pp. 1220-1235, July 2018, doi:
10.1109/TASLP.2018.2819819.
[9] NithyaRoopa S., Prabhakaran M, Betty P, "Speech Emotion Recognition using Deep
Learning" International Journal of Recent Technology and Engineering (IJRTE), vol. 7,
issue. 4S, November 2018.
[10] Murugan, Harini. (2020). Speech Emotion Recognition Using CNN. International Journal of
Psychosocial Rehabilitation. 24. 10.37200/IJPR/V24I8/PR280260.
[11] Mustaqeem, M. Sajjad and S. Kwon, "Clustering-Based Speech Emotion Recognition by
Incorporating Learned Features and Deep BiLSTM," in IEEE Access, vol. 8, pp. 79861-
79875, 2020, doi: 10.1109/ACCESS.2020.2990405.
[12] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning Salient Features for Speech Emotion
Recognition Using Convolutional Neural Networks,” IEEE Transactions on Multimedia, vol.
16, no. 8, pp. 2203–2213, Dec. 2014. doi:10.1109/tmm.2014.2360798.
[13] R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar and T. Alhussain, "Speech Emotion
Recognition Using Deep Learning Techniques: A Review," in IEEE Access, vol. 7, pp.
117327-117345, 2019, doi: 10.1109/ACCESS.2019.2936124.
[14] S. Zhang, A. Chen, W. Guo, Y. Cui, X. Zhao and L. Liu, "Learning Deep Binaural
Representations With Deep Convolutional Neural Networks for Spontaneous Speech Emotion
Recognition," in IEEE Access, vol. 8, pp. 23496-23505, 2020, doi:
10.1109/ACCESS.2020.2969032.
[15] T. -W. Sun, "End-to-End Speech Emotion Recognition With Gender Information," in IEEE
Access, vol. 8, pp. 152423-152438, 2020, doi: 10.1109/ACCESS.2020.3017462.
[16] X. Cui, V. Goel and B. Kingsbury, "Data Augmentation for Deep Neural Network Acoustic
Modeling," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23,
no. 9, pp. 1469-1477, Sept. 2015, doi: 10.1109/TASLP.2015.2438544.
[17] Y. Qin, S. Member, T. Lee, A. Pak, and H. Kong, “Automatic Assessment of Speech
Impairment in Cantonese-speaking People with Aphasia,” IEEE J. Sel. Top. Signal Process.,
vol. PP, no. c, p. 1, 2019.
[18] http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-
cepstral-coefficients-mfccs/
Journal of University of Shanghai for Science and Technology ISSN: 1007-6735
Volume 23, Issue 12, December - 2021 Page-222
Page 12
[19] https://scikitlearn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSpli
t.html
Journal of University of Shanghai for Science and Technology ISSN: 1007-6735
Volume 23, Issue 12, December - 2021 Page-223