SPEECH ENHANCEMENT AND EMOTION RECOGNITION ON NN CLASSIFIER 1 S.LOKESH, 2 BANUPRIYA.V, 3 JAYA PRETHEENA.P 1, Faculty, 2,3 UG Scholar Department of Electronics and Communication Engineering, Vel Tech, Avadi – Alamathi Road, Avadi, Tamil Nadu, India. Email:[email protected],[email protected], [email protected]ABSTRACT This work introduces a split based enhancement to signal which we are loading. Automatic speech emotion recognition (SER) plays an important role in HCI(human computer interface) systems for measuring people’s emotions has dominated psychology by linking expressions to group of basic emotions (i.e., anger, disgust, fear, happiness, sadness, and surprise). Our work describes speech enhancement and speech emotion recognition from speech signal based on features analysis and NN-classifier. Speech enhancement is doing based on the phase and frequency value of the particular signal. The recognition system involves speech emotion detection, features extraction and selection and finally classification. These features are useful to distinguish the maximum number of samples accurately and the NN classifier based on discriminant analysis is used to classify the six different expression. Keywords:speech emotion recognition (SER),HCI(human computer interface),Spectral subtraction (SS) INTRODUCTION Historically the sounds of spoken language have been studied at two different levels: (1) phonetic components of spoken words, e.g., vowel and consonant sounds, and (2) acoustic wave patterns. A language can be broken down into a very small number of basic sounds, called phonemes (English has approximately forty). An acoustic wave is a sequence of changing vibration patterns (generally in air), however we are more accustom to “seeing” acoustic waves as their electrical analog on an oscilloscope (time presentation) or spectrum analyzer (frequency presentation)[1,2]. Also seen in sound analysis are two-dimensional patterns called spectrograms, which display frequency (vertical axis) vs. time (horizontal axis) and represent the International Journal of Pure and Applied Mathematics Volume 119 No. 15 2018, 1- ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu 1
16
Embed
International Journal of Pure and Applied Mathematics Volume … · 2018-06-16 · Historically the sounds of spoken language have been studied at two different levels: (1) phonetic
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
SPEECH ENHANCEMENT AND EMOTION RECOGNITION ON NN CLASSIFIER
1S.LOKESH,
2BANUPRIYA.V,
3JAYA PRETHEENA.P
1,
Faculty,2,3
UG Scholar Department of Electronics and Communication
Historically the sounds of spoken language have been studied at two different levels: (1)
phonetic components of spoken words, e.g., vowel and consonant sounds, and (2) acoustic wave
patterns. A language can be broken down into a very small number of basic sounds, called
phonemes (English has approximately forty). An acoustic wave is a sequence of changing
vibration patterns (generally in air), however we are more accustom to “seeing” acoustic waves
as their electrical analog on an oscilloscope (time presentation) or spectrum analyzer (frequency
presentation)[1,2]. Also seen in sound analysis are two-dimensional patterns called
spectrograms, which display frequency (vertical axis) vs. time (horizontal axis) and represent the
International Journal of Pure and Applied MathematicsVolume 119 No. 15 2018, 1-ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu
1
signal energy as the figure intensity or colour.Generally, restricting the flow of air (in the vocal
tract) generates that we call consonants. On the other hand modifying the shape of the passages
through which the sound waves, produced by the vocal chords, travel generates vowels. The
power source for consonants is airflow producing white noise, while the power for vowels is
vibrations (rich in overtones) from the vocal chords[3,4,8,9,10]. The difference in the sound of
spoken vowels such as 'A' and 'E' are due to differences in the formant peaks caused by the
difference in the shape of your mouth when you produce the sounds .Henry Sweet is generally
credited with starting modern phonetics in 1877 with his publishing of A Handbook of
Phonetics. It is said that Sweet was the model for Professor Henry Higgins in the 1916 play,
Pygmalion, by George Bernard Shaw. You may remember the story of Professor Higgins and
Eliza Doolittle from the musical (and movie) My Fair Lady. The telephone companies studied
speech production and recognition in an effort to improve the accuracy of word recognition by
humans. Remember nine (N AY N vs. N AY AH N) shown here in one of the “standard”
phoneme sets[5,6,7].
PROPOSED SYSTEM
Speech Emotion Recognition for transform features system through textural analysis and
NNclassifier. The system involves enhancement and classifier. The purpose of this module is to
convert the speech waveform.
FIG 1 BLOCK DIAGRAM OF PROPOSEDMETHOD
Input Voice Signal
International Journal of Pure and Applied Mathematics Special Issue
2
In fig 1, the input signal is one of the frequencies within part of the audio range,that used
in the transmission range. In telephony, the using range of voice frequencyband range from
approximately 300Hz to3400 Hz.
Pre processing
Sampling
Sampling is done for function varying in space , time or any other dimension and similar outputs
are obtained in two or more dimension. The sampling frequency or sampling rate fs is defined as
the number of samples obtained in one second (samples per second , thus fs=1/Ts).
Quantization
Quantization, in mathematics and DSP, step of mapping a big set of input signals to alight set –
such as roundingvalues.Aalgorithm that undergoes quantization is called Quantizer. The error
made by quantization is known as quantization error or round-offerror. Quantization is depends
to some degree in around all digital signal processing, as the step of define a signal in digital
form normally says as rounding. Quantization forms for the core of frequently all lossy
compression algorithms. The first type, common called rounding quantization, is the mainly used
for many applications, to show the use of approximate definition for few quantity that is to be
noted and used in further steps . This module says the easy method of approximation that uses in
everyday arithmetic. This may also includeanalog-to-digital conversion of a signal. Hence the
purpose is mandatory to retain as signal fidelity as possible. In these the design always centers
nearly managing the approximation error
Wavelet Decomposition
A wave is an modulating function of time or space that is periodic. It an infinite length
continuous function in time or space . In attractive wavelets are localized waves. A wavelet is a
waveform of an efficiently small duration that has an average value of zero.
Feature Extraction
Speech signal contains huge number of parameter which shows emotion content .If any change
in these parameter indicate change in the emotions. The proper choice of feature vector is most
important one. So many approaches toward automatic recognition of emotion. Feature vector
may be divided into long-time and short-time feature vectors. The long-time is estimated once.
Almost all length of the utterance, hence the short-time ones are determined over window of
International Journal of Pure and Applied Mathematics Special Issue
3
usually less than 100ms. The long-time approach identifies emotions more efficiently. Short time
features uses interrogative phrases which has large pitch contour and a larger pitch standard
deviation.
NN Classifier
Neural network are predictive models lossy depends on the action of biological neurons. The
selection of the “neural network” was of the great PR successes of the Twentieth description
such as “A network of weighted, addictive values with nonlinear transfer function”. Hence,
despite the name, neural networks are far from “thinking machines” or “artificial brains”. A
typical artificial neural network might have a hundred neurons. In comparison, the human
nervous system is believed to have about 3x1010
neurons.Speaker recognition can be classified
into identification and verification. Speaker identification is the process of determining which
registered speaker provides a given utterance. Speaker verification, on the other hand, is the
process of accepting or rejecting the identity claim of a speaker. The system that we will describe
is classified as text-independent speaker identification system since its task is to identify the
person who speaks regardless of what is saying. At the highest level, all speaker recognition
systems contain two main modules: feature extraction and feature matching. Feature extraction is
the process that extracts a small amount of data from the voice signal that can later be used to
represent each speaker. Feature matching involves the actual procedure to identify the unknown
speaker by comparing extracted features from his/her voice input with the ones from a set of
known speakers.
MODULE 1
FIG 2BLOCK DIAGRAM OF SPEAKER IDENTIFICATION
Input
speech
Feature
extraction
Reference
model
(Speaker #1)
Similarity
Reference
model
(Speaker #N)
Similarity
Maximum
selection
Identification
result
(Speaker ID)
International Journal of Pure and Applied Mathematics Special Issue
4
In fig 2 ,All speaker recognition systems have to serve two distinguished phases. The first one is
referred to the enrolment or training phase, while the second one is referred to as the operational
or testing phase. In the training phase, each registered speaker has to provide samples of their
speech so that the system can build or train a reference model for that speaker. In case of speaker
verification systems, in addition, a speaker-specific threshold is also computed from the training
samples. In the testing phase, the input speech is matched with stored reference model(s) and a
recognition decision is made.
MODULE 2
FIG 3 BLOCK DIAGRAM OF SPEAKER VERIFICATON
In fig 3 ,Speaker recognition is a difficult task. Automatic speaker recognition works based on
the premise that a person‟s speech exhibits characteristics that are unique to the speaker.
However this task has been challenged by the highly variant of input speech signals. The
principle source of variance is the speaker himself/herself. Speech signals in training and testing
sessions can be greatly different due to many facts such as peoplevoice change with time, health
conditions (e.g. the speaker has a cold), speaking rates, and so on. There are also other factors,
beyond speaker variability, that present a challenge to speaker recognition technology. Examples
of these are acoustical noise and variations in recording environments (e.g. speaker uses different
telephone handsets).
MODULE 3
International Journal of Pure and Applied Mathematics Special Issue
5
FIG 4 BLOCK DIAGRAM OF SPEECH ENCHANCEMENT
The fundamental purpose of speech is communication, i.e., the transmission of messages. A
message represented as a sequence of discrete symbols can be quantified by its information
content in bits, and the rate of transmission of information is measured in bits/second (bps). In
fig 4 speech production, as well as in many human-engineered electronic communication
systems, the information to be transmitted is encoded in the form of a continuously varying
(analog) waveform that can be transmitted, recorded, manipulated, and ultimately decoded by a
human listener. In the case of speech, the fundamental analog form of the message is an acoustic
waveform, which we call the speech signal. Speech signals can be converted to an electrical
waveform by a microphone, further manipulated by both analog and digital signal processing,
and thenconverted back to acoustic form by a loudspeaker, a telephone handset or headphone, as
desired. Signals are usually corrupted by noise in the real world. To reduce the influence of
noise, two research topics are the speech enhancement and speech recognition in noisy
environments have arose. For the speech enhancement, the extraction of a signal buried in noise,
adaptive noise cancellation (ANC) provides a good solution. In contrast to other enhancement
techniques, its great strength lies in the fact that no aknowledge of signal or noise is required in
advance. The advantage is gained with the auxiliary of a secondary input to measure the noise
source. The cancellation operation is based on the following principle. Since the desired signal is
corrupted by the noise, if the noise can be estimated from the noise source, this estimated noise
can then be subtracted from the primary channel resulting in the desired signal. Traditionally,
this task is done by linear filtering. In real situations, the corrupting noise is a nonlinear
distortion version of the source noise, so a nonlinear filter should be a better choice. In the
typical speech enhancement methods based on STFT, only the magnitude spectrum is modified
and phase spectrum is kept unchanged. It was believed that the magnitude spectrum includes
most of the information of the speech, and phase spectrum contains little of that. Furthermore,
the human auditory system is phase deaf. For above reason, in typical speech enhancement
algorithms, such as Spectral subtraction (SS), MMSE-STSA or MAP algorithm, the speech
enhancement process is on the basis of spectral magnitude component only and keep the phase
component unchanged.
RESULT
International Journal of Pure and Applied Mathematics Special Issue
6
FIG 5 ANGRY SIGNAL
In the fig 5,The classification response of angry signal explain ,browser can take the recorded
voice signal and preprocessing and filtering the voice signal .The wavelet transform is used to
divide a continuous time function into wavelets. The feature can entropy and zerocrossing to
compare emotion with database and angry classification. The feature extraction of peak point is
0.5543 and mean signal is -0.9941.
International Journal of Pure and Applied Mathematics Special Issue
7
International Journal of Pure and Applied Mathematics Special Issue
8
FIG 6 HAPPY SIGNAL
In Fig 6,The classification response of happy signal explain ,browser can take the recorded voice
signal and preprocessing and filtering the voice signal .The wavelet transform is used to divide a
continuous time function into wavelets. The feature can entropy and zerocrossing to compare emotion
with database and angry classification. The feature extraction of peak point is 0.2268 and mean signal is
-0.9305
International Journal of Pure and Applied Mathematics Special Issue
9
FIG 7.SAD SIGNAL
In fig7,The classification response of sad signal explain ,browser can take the recorded voice signal
and preprocessing and filtering the voice signal .The wavelet transform is used to divide a continuous
time function into wavelets.The feature can entropy and zerocrossing to compare emotion with database
and angry classification. The feature extraction of peak point is 1.0781 and mean signal is -1.1281.
International Journal of Pure and Applied Mathematics Special Issue
10
FIG 8 NORMAL SIGNAL
In fig 8,The classification response of normal signal explain ,browser can take the recorded voice
signal and preprocessing and filtering the voice signal .The wavelet transform is used to divide a
continuous time function into wavelets.The feature can entropy and zerocrossing to compare emotion
with database and angry classification. The feature extraction of peak point is 0.7076 and mean signal is
-1.5265.
COMPARISON BETWEEN DIFFERENT EMOTIONAL SIGNALS
In NN classifier training a database has been created for the above mentioned four emotions for 3
persons. Its peak, minimum average values are calculated and a overall mean value for each emotion of
the individual is stored in a table. Based on the above table 1, the neural network classifier is traind for
each of emotion. Thus when a input signal is given the neural network classifier compares its mean
value with those values in the table and estimate the correct emotion of the speaker.
TABLE 1:CLASSIFICATION OF EMOTIONAL SIGNALS.
International Journal of Pure and Applied Mathematics Special Issue
11
CONCLUSION
In this work, most recent work done in the field of Speech Emotion Recognition is discussed. Most
used methods of feature extraction and several classifier performances are reviewed. Success of
emotion recognition is dependent on appropriate feature extraction as well as proper classifier selection
from the sample emotional speech. It can be seen that Integration of various features can give the better
recognition rate.Classifier performance is need to be increased for recognition of speaker independent
systems. The application area of emotion recognition from speech is expanding as it opens the new
means of communication between human and machine. It is needed to model effective method of
speech feature extraction so that it can even provide emotion recognition of real time speech. This
speech emotion recognition has various applicationin robotics and medical field. The sole purpose of
this process is to help in man machine interaction. In robotics, Robot can recognize emotional
information as contained in the human speech signals for friendly interaction with human beings, and
eventually satisfactory performance and effect are realized. In call-centerapplication.the experiments
results indicate that the proposed method provides very stable and successful emotional classification
performance and it promises the feasibility of the agent for mobile communication services.
REFERENCES
[1] Y. Li and Y. Zhao, “Recognizing Emotions in Speech Using Short- Term and Long-Term Features,”
Proc. Int’l Conf. Spoken Language Processing, pp. 2255-2258, 1998.
[2] N. Virag, “Single channel speech enhancement based on masking properties of the human auditory
system,” IEEE Trans. Speech AudioProcess., vol. 7, no. 2, pp. 126–137, Mar. 1999.
International Journal of Pure and Applied Mathematics Special Issue
12
[3] P. Loizou, “Speech enhancement based on perceptually motivated bayesian estimators of the