International Journal of Pure and Applied Mathematics Volume … · 2018-06-16 · Historically the sounds of spoken language have been studied at two different levels: (1) phonetic

SPEECH ENHANCEMENT AND EMOTION RECOGNITION ON NN CLASSIFIER

1S.LOKESH,

2BANUPRIYA.V,

3JAYA PRETHEENA.P

1,

Faculty,2,3

UG Scholar Department of Electronics and Communication

Engineering,

Vel Tech, Avadi – Alamathi Road,

Avadi, Tamil Nadu, India.

Email:[email protected],[email protected] ,

[email protected]

ABSTRACT

This work introduces a split based enhancement to signal which we are loading. Automatic

speech emotion recognition (SER) plays an important role in HCI(human computer interface)

systems for measuring people’s emotions has dominated psychology by linking expressions to

group of basic emotions (i.e., anger, disgust, fear, happiness, sadness, and surprise). Our work

describes speech enhancement and speech emotion recognition from speech signal based on

features analysis and NN-classifier. Speech enhancement is doing based on the phase and

frequency value of the particular signal. The recognition system involves speech emotion

detection, features extraction and selection and finally classification. These features are useful to

distinguish the maximum number of samples accurately and the NN classifier based on

discriminant analysis is used to classify the six different expression.

Keywords:speech emotion recognition (SER),HCI(human computer interface),Spectral

subtraction (SS)

INTRODUCTION

Historically the sounds of spoken language have been studied at two different levels: (1)

phonetic components of spoken words, e.g., vowel and consonant sounds, and (2) acoustic wave

patterns. A language can be broken down into a very small number of basic sounds, called

phonemes (English has approximately forty). An acoustic wave is a sequence of changing

vibration patterns (generally in air), however we are more accustom to “seeing” acoustic waves

as their electrical analog on an oscilloscope (time presentation) or spectrum analyzer (frequency

presentation)[1,2]. Also seen in sound analysis are two-dimensional patterns called

spectrograms, which display frequency (vertical axis) vs. time (horizontal axis) and represent the

International Journal of Pure and Applied MathematicsVolume 119 No. 15 2018, 1-ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

1

signal energy as the figure intensity or colour.Generally, restricting the flow of air (in the vocal

tract) generates that we call consonants. On the other hand modifying the shape of the passages

through which the sound waves, produced by the vocal chords, travel generates vowels. The

power source for consonants is airflow producing white noise, while the power for vowels is

vibrations (rich in overtones) from the vocal chords[3,4,8,9,10]. The difference in the sound of

spoken vowels such as 'A' and 'E' are due to differences in the formant peaks caused by the

difference in the shape of your mouth when you produce the sounds .Henry Sweet is generally

credited with starting modern phonetics in 1877 with his publishing of A Handbook of

Phonetics. It is said that Sweet was the model for Professor Henry Higgins in the 1916 play,

Pygmalion, by George Bernard Shaw. You may remember the story of Professor Higgins and

Eliza Doolittle from the musical (and movie) My Fair Lady. The telephone companies studied

speech production and recognition in an effort to improve the accuracy of word recognition by

humans. Remember nine (N AY N vs. N AY AH N) shown here in one of the “standard”

phoneme sets[5,6,7].

PROPOSED SYSTEM

Speech Emotion Recognition for transform features system through textural analysis and

NNclassifier. The system involves enhancement and classifier. The purpose of this module is to

convert the speech waveform.

FIG 1 BLOCK DIAGRAM OF PROPOSEDMETHOD

Input Voice Signal

International Journal of Pure and Applied Mathematics Special Issue

2

In fig 1, the input signal is one of the frequencies within part of the audio range,that used

in the transmission range. In telephony, the using range of voice frequencyband range from

approximately 300Hz to3400 Hz.

Pre processing

Sampling

Sampling is done for function varying in space , time or any other dimension and similar outputs

are obtained in two or more dimension. The sampling frequency or sampling rate fs is defined as

the number of samples obtained in one second (samples per second , thus fs=1/Ts).

Quantization

Quantization, in mathematics and DSP, step of mapping a big set of input signals to alight set –

such as roundingvalues.Aalgorithm that undergoes quantization is called Quantizer. The error

made by quantization is known as quantization error or round-offerror. Quantization is depends

to some degree in around all digital signal processing, as the step of define a signal in digital

form normally says as rounding. Quantization forms for the core of frequently all lossy

compression algorithms. The first type, common called rounding quantization, is the mainly used

for many applications, to show the use of approximate definition for few quantity that is to be

noted and used in further steps . This module says the easy method of approximation that uses in

everyday arithmetic. This may also includeanalog-to-digital conversion of a signal. Hence the

purpose is mandatory to retain as signal fidelity as possible. In these the design always centers

nearly managing the approximation error

Wavelet Decomposition

A wave is an modulating function of time or space that is periodic. It an infinite length

continuous function in time or space . In attractive wavelets are localized waves. A wavelet is a

waveform of an efficiently small duration that has an average value of zero.

Feature Extraction

Speech signal contains huge number of parameter which shows emotion content .If any change

in these parameter indicate change in the emotions. The proper choice of feature vector is most

important one. So many approaches toward automatic recognition of emotion. Feature vector

may be divided into long-time and short-time feature vectors. The long-time is estimated once.

Almost all length of the utterance, hence the short-time ones are determined over window of


3

usually less than 100ms. The long-time approach identifies emotions more efficiently. Short time

features uses interrogative phrases which has large pitch contour and a larger pitch standard

deviation.

NN Classifier

Neural network are predictive models lossy depends on the action of biological neurons. The

selection of the “neural network” was of the great PR successes of the Twentieth description

such as “A network of weighted, addictive values with nonlinear transfer function”. Hence,

despite the name, neural networks are far from “thinking machines” or “artificial brains”. A

typical artificial neural network might have a hundred neurons. In comparison, the human

nervous system is believed to have about 3x1010

neurons.Speaker recognition can be classified

into identification and verification. Speaker identification is the process of determining which

registered speaker provides a given utterance. Speaker verification, on the other hand, is the

process of accepting or rejecting the identity claim of a speaker. The system that we will describe

is classified as text-independent speaker identification system since its task is to identify the

person who speaks regardless of what is saying. At the highest level, all speaker recognition

systems contain two main modules: feature extraction and feature matching. Feature extraction is

the process that extracts a small amount of data from the voice signal that can later be used to

represent each speaker. Feature matching involves the actual procedure to identify the unknown

speaker by comparing extracted features from his/her voice input with the ones from a set of

known speakers.

MODULE 1

FIG 2BLOCK DIAGRAM OF SPEAKER IDENTIFICATION

Input

speech

Feature

extraction

Reference

model

(Speaker #1)

Similarity

Reference

model

(Speaker #N)

Similarity

Maximum

selection

Identification

result

(Speaker ID)


4

In fig 2 ,All speaker recognition systems have to serve two distinguished phases. The first one is

referred to the enrolment or training phase, while the second one is referred to as the operational

or testing phase. In the training phase, each registered speaker has to provide samples of their

speech so that the system can build or train a reference model for that speaker. In case of speaker

verification systems, in addition, a speaker-specific threshold is also computed from the training

samples. In the testing phase, the input speech is matched with stored reference model(s) and a

recognition decision is made.

MODULE 2

FIG 3 BLOCK DIAGRAM OF SPEAKER VERIFICATON

In fig 3 ,Speaker recognition is a difficult task. Automatic speaker recognition works based on

the premise that a person‟s speech exhibits characteristics that are unique to the speaker.

However this task has been challenged by the highly variant of input speech signals. The

principle source of variance is the speaker himself/herself. Speech signals in training and testing

sessions can be greatly different due to many facts such as peoplevoice change with time, health

conditions (e.g. the speaker has a cold), speaking rates, and so on. There are also other factors,

beyond speaker variability, that present a challenge to speaker recognition technology. Examples

of these are acoustical noise and variations in recording environments (e.g. speaker uses different

telephone handsets).

MODULE 3


5

FIG 4 BLOCK DIAGRAM OF SPEECH ENCHANCEMENT

The fundamental purpose of speech is communication, i.e., the transmission of messages. A

message represented as a sequence of discrete symbols can be quantified by its information

content in bits, and the rate of transmission of information is measured in bits/second (bps). In

fig 4 speech production, as well as in many human-engineered electronic communication

systems, the information to be transmitted is encoded in the form of a continuously varying

(analog) waveform that can be transmitted, recorded, manipulated, and ultimately decoded by a

human listener. In the case of speech, the fundamental analog form of the message is an acoustic

waveform, which we call the speech signal. Speech signals can be converted to an electrical

waveform by a microphone, further manipulated by both analog and digital signal processing,

and thenconverted back to acoustic form by a loudspeaker, a telephone handset or headphone, as

desired. Signals are usually corrupted by noise in the real world. To reduce the influence of

noise, two research topics are the speech enhancement and speech recognition in noisy

environments have arose. For the speech enhancement, the extraction of a signal buried in noise,

adaptive noise cancellation (ANC) provides a good solution. In contrast to other enhancement

techniques, its great strength lies in the fact that no aknowledge of signal or noise is required in

advance. The advantage is gained with the auxiliary of a secondary input to measure the noise

source. The cancellation operation is based on the following principle. Since the desired signal is

corrupted by the noise, if the noise can be estimated from the noise source, this estimated noise

can then be subtracted from the primary channel resulting in the desired signal. Traditionally,

this task is done by linear filtering. In real situations, the corrupting noise is a nonlinear

distortion version of the source noise, so a nonlinear filter should be a better choice. In the

typical speech enhancement methods based on STFT, only the magnitude spectrum is modified

and phase spectrum is kept unchanged. It was believed that the magnitude spectrum includes

most of the information of the speech, and phase spectrum contains little of that. Furthermore,

the human auditory system is phase deaf. For above reason, in typical speech enhancement

algorithms, such as Spectral subtraction (SS), MMSE-STSA or MAP algorithm, the speech

enhancement process is on the basis of spectral magnitude component only and keep the phase

component unchanged.

RESULT


6

FIG 5 ANGRY SIGNAL

In the fig 5,The classification response of angry signal explain ,browser can take the recorded

voice signal and preprocessing and filtering the voice signal .The wavelet transform is used to

divide a continuous time function into wavelets. The feature can entropy and zerocrossing to

compare emotion with database and angry classification. The feature extraction of peak point is

0.5543 and mean signal is -0.9941.


7


8

FIG 6 HAPPY SIGNAL

In Fig 6,The classification response of happy signal explain ,browser can take the recorded voice

signal and preprocessing and filtering the voice signal .The wavelet transform is used to divide a

continuous time function into wavelets. The feature can entropy and zerocrossing to compare emotion

with database and angry classification. The feature extraction of peak point is 0.2268 and mean signal is

-0.9305


9

FIG 7.SAD SIGNAL

In fig7,The classification response of sad signal explain ,browser can take the recorded voice signal

and preprocessing and filtering the voice signal .The wavelet transform is used to divide a continuous

time function into wavelets.The feature can entropy and zerocrossing to compare emotion with database

and angry classification. The feature extraction of peak point is 1.0781 and mean signal is -1.1281.


10

FIG 8 NORMAL SIGNAL

In fig 8,The classification response of normal signal explain ,browser can take the recorded voice

signal and preprocessing and filtering the voice signal .The wavelet transform is used to divide a

continuous time function into wavelets.The feature can entropy and zerocrossing to compare emotion

with database and angry classification. The feature extraction of peak point is 0.7076 and mean signal is

-1.5265.

COMPARISON BETWEEN DIFFERENT EMOTIONAL SIGNALS

In NN classifier training a database has been created for the above mentioned four emotions for 3

persons. Its peak, minimum average values are calculated and a overall mean value for each emotion of

the individual is stored in a table. Based on the above table 1, the neural network classifier is traind for

each of emotion. Thus when a input signal is given the neural network classifier compares its mean

value with those values in the table and estimate the correct emotion of the speaker.

TABLE 1:CLASSIFICATION OF EMOTIONAL SIGNALS.


11

CONCLUSION

In this work, most recent work done in the field of Speech Emotion Recognition is discussed. Most

used methods of feature extraction and several classifier performances are reviewed. Success of

emotion recognition is dependent on appropriate feature extraction as well as proper classifier selection

from the sample emotional speech. It can be seen that Integration of various features can give the better

recognition rate.Classifier performance is need to be increased for recognition of speaker independent

systems. The application area of emotion recognition from speech is expanding as it opens the new

means of communication between human and machine. It is needed to model effective method of

speech feature extraction so that it can even provide emotion recognition of real time speech. This

speech emotion recognition has various applicationin robotics and medical field. The sole purpose of

this process is to help in man machine interaction. In robotics, Robot can recognize emotional

information as contained in the human speech signals for friendly interaction with human beings, and

eventually satisfactory performance and effect are realized. In call-centerapplication.the experiments

results indicate that the proposed method provides very stable and successful emotional classification

performance and it promises the feasibility of the agent for mobile communication services.

REFERENCES

[1] Y. Li and Y. Zhao, “Recognizing Emotions in Speech Using Short- Term and Long-Term Features,”

Proc. Int’l Conf. Spoken Language Processing, pp. 2255-2258, 1998.

[2] N. Virag, “Single channel speech enhancement based on masking properties of the human auditory

system,” IEEE Trans. Speech AudioProcess., vol. 7, no. 2, pp. 126–137, Mar. 1999.


12

[3] P. Loizou, “Speech enhancement based on perceptually motivated bayesian estimators of the

magnitude spectrum,” IEEE Trans. SpeechAudio Process., vol. 13, no. 5, pp. 857–869, Sep. 2005.

[4] T. Vogt, E. Andre, “Improving Automatic EmotionRecognition from Speech via Gender

Differentiation”, in Proc. of Int’l Conf. of Language Resources and Evaluation, 2006.

[5] B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll,“Combining Frame and Turn-Level

Information for Robust Recognition of Emotions within Speech,” Proc. Int’l Conf. Spoken Language

Processing, pp. 2225-2228, 2007.

[6] Dr.AntoBennet, M , Sankaranarayanan S, Ashokram S ,Dinesh Kumar T R,“Testing of Error

Containment Capability in can Network”, International Journal of Applied Engineering Research,

Volume 9, Number 19 (2014) pp. 6045-6054.

[7]Dr.AntoBennet, M, SankarBabu G, Natarajan S, “Reverse Room Techniques for Irreversible Data

Hiding”, Journal of Chemical and Pharmaceutical Sciences 08(03): 469-475, September 2015.

[8] Dr.AntoBennet, M ,Sankaranarayanan S, SankarBabu G, “ Performance & Analysis of Effective Iris

Recognition System Using Independent Component Analysis”, Journal of Chemical and

Pharmaceutical Sciences 08(03): 571-576, August 2015.

[9] Dr.AntoBennet, M ,Sankaranarayanan S, SankarBabu G, “ Performance & Analysis of Effective Iris

Recognition System Using Independent Component Analysis”, Journal of Chemical and

Pharmaceutical Sciences 08(03): 571-576, August 2015.

[10] Dr.AntoBennet, M, Suresh R, Mohamed Sulaiman S, “Performance &analysis of automated

removal of head movement artifacts in EEG using brain computer interface”, Journal of Chemical and

Pharmaceutical Research 07(08): 291-299, August 2015


13


14


15

16

International Journal of Pure and Applied Mathematics Volume … · 2018-06-16 · Historically the sounds of spoken language have been studied at two different levels: (1) phonetic

Documents