Top Banner
UNIVERSITY OF GAZIANTEP SPEECH RECOGNITION EEE 499 GRADUATION PROJECT IN ELECTRICAL & ELECTRONICS ENGINEERING SUBMITTED TO Doç. Dr. ERGUN ERÇELEBĐ BY HÜSEYĐN ÇELĐK Fall 2008
80
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speech Recognition

UNIVERSITY OF GAZIANTEP

SPEECH RECOGNITION

EEE 499 GRADUATION PROJECT

IN

ELECTRICAL & ELECTRONICS ENGINEERING

SUBMITTED TO

Doç. Dr. ERGUN ERÇELEBĐ

BY

HÜSEYĐN ÇELĐK

Fall 2008

Page 2: Speech Recognition

ii

ABSTRACT In this study, the properties of human voice and the issue of speech recognition has

been studied and a software speech recognition tool was designed with MATLAB

program.

The tool was designed to be Command base, Speaker dependent, and it has a small

vocabulary that should include less than or equal to 50 voice commands.

The project is mainly about the processes for defining a mathematical equivalent

model of voice command which must be unique for that voice command. And followed

by creating a library(database) for the models of the voice commands that are going to

be used. The recognition process is comparing the unique model of the voice command

with the models in the libray (database). And after comparing process, the best

acceptable match which is above a recognition threshold percentage is assigned as the

recognized command.

In this project, 2 different methods was used for defining an equivalent model of a

voice command.

• Lineer Predictive Coding Method (LPC)

• Mel Frequency Cepstrum Coefficients Method (MFCC)

Page 3: Speech Recognition

iii

ÖZET Bu çalışmada, insan sesinin özellikleri ve ses tanımayla ilgili temel ilkeler incelendi.

MATLAB programı kullanılarak, bilgisayar ortamında bir ses tanıma programı

geliştirildi.

Geliştirilen program, komut tabanlı, konuşmacı bağımlı , 50 ses komutu veya daha

az komuttan oluşan küçük bir kelime haznesi kapsayacak şekilde tasarlandı.

Proje genel olarak, bir ses komutunun sadece o ses komutuna özel olacak şekilde bir

matematiksel denklik modeli oluşturulmasıyla ilgilidir. Sonrasında programda

kullanılacak olan komutların denklik modelleri için bir kütüphane (veritabanı)

oluşturulur. Tanıma aşamasıda ise tanınması istenen ses komutunun denklik modeli

kütüphanedeki modellerle karşılaştırılır ve en uygun eşleşmenin elde edildiği komut

tanınan komut olarak ekrana yansıtılır.

Bu projede, bir ses komutunun matematiksel denklik modelinin oluşturulma aşaması

için 2 farklı metod denemiştir.

• Doğrusal Öngörü Analizi Metodu (LPC)

• Melodi Frekansı Cepstral Katsayıları Analiz Metodu (MFCC)

Page 4: Speech Recognition

TABLE OF CONTENTS

ABSTRACT …………………………………………………………………………....ii ÖZET …………………………………………………………………………………..iii

1 CHAPTER I : INTRODUCTION AND OBJECTIVES ....... ........................... 1 2 CHAPTER II : BASIC ACOUSTICS AND SPEECH SIGNAL ..................... 3

2.1 The Speech Signal ............................................................................................ 3

2.2 Speech Production ............................................................................................ 5

2.3 Properties of Human Voice............................................................................... 6 3 CHAPTER III : SPEECH RECOGNITION...................................................... 7

3.1 Speech Recognition Tool.................................................................................. 8

3.2 Main Block Diagram of Speech Recognition Tool .......................................... 9

3.3 Speech Processing........................................................................................... 10 3.3.1 Speech Representation in Computer Environment................................. 10 3.3.2 Symbolic Representation of a Speech Signal ......................................... 12

3.3.2.1 Pre -Works on the Recorded Sound................................................ 13 3.3.2.2 Feature Extracting........................................................................... 16 3.3.2.3 Fingerprint Calculation ................................................................... 23

3.4 Fingerprint Processing ................................................................................... 24 3.4.1 Fingerprints Library................................................................................ 25 3.4.2 Fingerprints Comparison ........................................................................ 25 3.4.3 Decision .................................................................................................. 30

4 CHAPTER IV : Project Demonstration with Code Explanations ................. 31

4.1 Training Part and Building Commands Database........................................... 31 4.1.1 Matlab Functions ( .m files)....................................................................34

4.2 Recognition Part ............................................................................................. 36 4.2.1 Matlab Functions ( .m files)....................................................................39

5 CHAPTER V : CONCLUSION......................................................................... 41 6 APPENDICES ...................................................................................................... 43

6.1 Appendix 1 : References............................................................................... 43 6.2 Appendix 2 : Test Results............................................................................. 45 6.3 Appendix 3 : Matlab Codes .......................................................................... 56

Page 5: Speech Recognition

1

1 CHAPTER I : INTRODUCTION AND OBJECTIVES

Speech is the most natural way to communicate for humans. While this has been

true since the dawn of civilization, the invention and widespread use of the telephone,

audio-phonic storage media, radio, and television has given even further importance to

speech communication and speech processing.

The advances in digital signal processing technology has led the use of speech

processing in many different application areas like speech compression, enhancement,

synthesis, and recognition. In this project, the issue of speech recognition is studied and

a speech recognition system is developed with MATLAB program.

Speech recognition can simply be defined as the representation of a speech signal

via a limited number of symbols. The aim here is to find the written equivalent of the

signal. And each voice command must have a unique equivalent model. This is the main

and the most important part of this study.

Speech recognition presents great advantages to human-computer interaction. It is

easy to obtain speech data, and it does not require special skills like using keyboard,

entering data via clicking the buttons on the GUI programs, and so on. Transferring text

data into electronical media using speech is about 8-10 times faster than hand-writing,

and about 4-5 times faster than using keyboard by the most skilled typist. Moreover, the

user can continue entering text while moving or doing any work that requires her to use

her hands. Since a microphone or a telephone can be used, it is more economic to enter

data, and it is possible to enter data from a remote point via telephone.

Page 6: Speech Recognition

2

In this project,

• A speaker dependent, small vocabulary, isolated word speech recognition system for

noise-free environments was developed.

• And to make the program smart and be industrial, an executable windows file (.exe)

with a graphical user interface (GUI) was build.

• One of the main advantages of the designed tool is each time you don’t need to press

a button when you are going to say a voice command. The tool provides the user a

continual recording and recognition process.

• And it provides the user an easy way of using the tool. New voice commands can

easily be added to the system and the tool shows the steps of the processes.

• Also featuring method and its parameters (order, frame length, recognition threshold

percentage ) can be easily changed.

• The project reported in chapters,

• Chapter I gives an introduction of the project and its objectives.

• Chapter II introduces the general view of a speech signal, speech production,

and properties of human voice.

• Chapter III is devoted to speech processing and feature extraction techniques

used in speech recognition. And Speech Recognition Tool is introduced.

• Chapter IV presents project demonstration, a sample training and recognition

session using the GUI application developed with several screenshots.

• Chapter V gives the conclusion and discusses of the project and the future works

are mentioned.

• Appendices : References, Test Results, Matlab Codes.

Page 7: Speech Recognition

3

2 CHAPTER II : BASIC ACOUSTICS AND SPEECH SIGNAL

As relevant background to the field of speech recognition, this chapter intends to

discuss how the speech signal is produced and perceived by human beings. This is an

essential subject that has to be considered before one can pursue and decide which

approach to use for speech recognition.

2.1 The Speech Signal

Human communication is to be seen as a comprehensive diagram of the process

from speech production to speech perception between the talker and listener,

See Figure II.1

Figure II.1 - Schematic Diagram of the Speech Production/Perception Process.

Five different elements, A.Speech formulation, B.Human vocal mechanism,

C.Acoustic air, D.Perception of the ear, E.Speech comprehension.

Page 8: Speech Recognition

4

The first element (A.Speech formulation) is associated with the formulation of the

speech signal in the talker’s mind. This formulation is used by the human vocal

mechanism (B.Human vocal mechanism) to produce the actual speech waveform. The

waveform is transferred via the air (C.Acoustic air) to the listener. During this transfer

the acoustic wave can be affected by external sources, for example noise, resulting in a

more complex waveform. When the wave reaches the listener’s hearing system (the

ears) the listener percepts the waveform (D.Perception of the ear) and the listener’s

mind (E.Speech comprehension) starts processing this waveform to comprehend its

content so the listener understands what the talker is trying to tell him.

One issue with speech recognition is to “simulate” how the listener process the

speech produced by the talker. There are several actions taking place in the listeners

head and hearing system during the process of speech signals. The perception process

can be seen as the inverse of the speech production process.

The basic theoretical unit for describing how to bring linguistic meaning to the

formed speech, in the mind, is called phonemes. Phonemes can be grouped based on the

properties of either the time waveform or frequency characteristics and classified in

different sounds produced by the human vocal tract.

Speech is:

• Time-varying signal,

• Well-structured communication process,

• Depends on known physical movements,

• Composed of known, distinct units (phonemes),

• Is different for every speaker,

• May be fast, slow, or varying in speed,

• May have high pitch, low pitch, or be whispered,

• Has widely-varying types of environmental noise,

• May not have distinct boundaries between units (phonemes),

• Has an unlimited number of words.

Page 9: Speech Recognition

5

2.2 Speech Production

To be able to understand how the production of speech is performed one need to

know how the human’s vocal mechanism is constructed, see Figure II.2 . The most

important parts of the human vocal mechanism are the vocal tract together with nasal

cavity, which begins at the velum. The velum is a trapdoor-like mechanism that is used

to formulate nasal sounds when needed. When the velum is lowered, the nasal cavity is

coupled together with the vocal tract to formulate the desired speech signal. The cross-

sectional area of the vocal tract is limited by the tongue, lips, jaw and velum and varies

from 0-20 cm2.

When humans produce speech, air is expelled from the lungs through the trachea.

The air flowing from the lungs causes the vocal cords to vibrate and by forming the

vocal tract, lips, tongue, jaw and maybe using the nasal cavity, different sounds can be

produced.

Figure II.2 - Human Vocal Mechanism

Page 10: Speech Recognition

6

2.3 Properties of Human Voice

One of the most important parameter of sound is its frequency. The sounds are

discriminated from each other by the help of their frequencies. When the frequency of a

sound increases, the sound gets high-pitched and irritating. When the frequency of a

sound decreases, the sound gets deepen.

Sound waves are the waves that occur from vibration of the materials. The highest

value of the frequency that a human can produce is about 10 kHz. And the lowest value

is about 70 Hz. These are the maximum and minimum values. This frequency interval

changes for every person. And the magnitude of a sound is expressed in decibel (dB).

A normal human speech has a frequency interval of 100Hz - 3200Hz and its

magnitude is in the range of 30 dB - 90 dB.

A human ear can perceive sounds in the frequency range between 16 Hz and 20 kHz.

And a frequency change of 0.5 % is the sensitivity of a human ear.

Speaker Characteristics, • Due to the differences in vocal tract length, male, female, and children’s speech are

different.

• Regional accents are the differences in resonant frequencies, durations, and pitch.

• Individuals have resonant frequency patterns and duration patterns that are unique

(allowing us to identify speaker).

• Training on data from one type of speaker automatically “learns” that group or

person’s characteristics, makes recognition of other speaker types much worse.

Page 11: Speech Recognition

7

3 CHAPTER III : SPEECH RECOGNITION The main goal of a speech recognition system is to substitute for a human listener,

although it is very difficult for an artificial system to achieve the flexibility offered by

human ear and human brain. Thus, speech recognition systems need to have some

constraints. For instance, number of words is a constraint for a word-based recognitions

system. In order to increase the performance of the recognition, the process is dealt with

in parts, and researches are concentrated on those parts.

Speech recognition is the process of extraction of linguistic information from speech

signals. The linguistic information which is the most important element of speech is

called phonetic information.

Although this project is a command base speech recognition system, it so difficult to

identify a voice command by investigating it as a whole unit. Here, I preferred a smaller

approach as it is done in phoneme-base systems. A single word (voice command)

consists of phonemes, that’s why we are going to investigate the voice commands in 30

ms intervals.

The work principle of speech recognition systems is roughly based on the

comparison of input data to prerecorded patterns. These patterns are the equivalent

models of the voice commands saved in training process. By this comparison, the

pattern to which the input data is most similar is accepted as the symbolic representation

of the data. This preprocessing is called Feature Extraction. First, short time feature

vectors are obtained from the input speech data, and then these vectors are compared to

the patterns classified prior to comparison. The feature vectors extracted from speech

signal are required to best represent the speech data, to be in size that can be processed

efficiently, and to have distinct characteristics. Thus, obtaining a very clear distinction

of speech is the main goal of the feature vector extraction.

Page 12: Speech Recognition

8

3.1 Speech Recognition Tool My Speech Recognition Tool consists of 4 main parts.

Figure III.1 – 4 main parts of a speech recognition system

In the Training process the voice commands that are going to be used in recognition

are defined to the system.

The second part is creating a library for the training voice commands. They are

stored in the library.

The third part is Recognizing Process, you say a command and the fingerprint of

this command is compared with the fingerprints of the commands in the library. And the

best acceptable match that is above a threshold percentage is assigned as the recognized

command.

The last part is Processing, while you are training the system you can also assign

some functionalities to these commands. And processing is the part where the function

of the recognized command is operated.

Page 13: Speech Recognition

9

3.2 Main Block Diagram of Speech Recognition Tool

Microphone

Pre-Works

RecordedSound

Pure SpeechSignal

CurrentFingerPrint

FeatureExtracting

FingerPrintsComparison

FingerPrintsLibrary

DatabaseFingerPrints

ComparisonDistances

Decision

Recognized

Process

Best Match

Functionof the

Command

NotRecognized

Return

UnderMatched

���������

SpeechProcessing

FingerPrintProcessing

Figure III.10 - Main Block Diagram of Speech Recognition Tool

Page 14: Speech Recognition

10

3.3 Speech Processing

3.3.1 Speech Representation in Computer Environment The speech sound of a human is got in to computer by a microphone. Here the

microphone entrance unit of the computer is used as an analog input entrance unit. The

sound waves are caught by the microphone as an analog input. Then the analog speech

signal is converted to digitalized signal. In this project 8000 Hz sampling frequency and

256 (8 bits/sample) quantization level is used.

The speech signal and all its characteristics can be represented in two different

domains, the time and the frequency domain.

See Figure III.2,

III.2 a) Representation of a speech signal in time domain

III.2 b) Representation of a speech signal in frequency domain

III.2 c) Spectrogram of a speech signal

Page 15: Speech Recognition

11

Figure III.2 - Representation of speech signal in computer environment

Page 16: Speech Recognition

12

3.3.2 Symbolic Representation of a Speech Signal

This is the part where we obtain the unique equivalent model of a speech signal.

First, the sound is recorded with microphone then the digitalized recorded signal is

processed and modified. Finally a unique equivalent model is obtained which we call it

as “ finger print ”.

Figure III.3 - Processes applied for Symbolic representation of Speech Signal. Here the first part is applying some pre processes on the recorded signal, the aim is

to obtain a pure speech signal which is purified from the noises and removing the

silence part from the whole speech. This is one of the most important and difficult part

in the project. The start and end point of the speech must be found correctly. There

shouldn’t be any silence at the beginning and at the end of the speech. The final speech

signal (voice command) must only consist of unvoiced and voiced parts of speech.

See figure III.4

The second part is obtaining the fingerprint of the voice command. The signal is

splitted into small parts (frames) that are in 30 ms length. And all these frames are

separately investigated by looking into their time domain, frequency domain and power

domain properties. Some modifications are applied to these frames and finally all the

frames combined together and they represent the fingerprint of the voice command

which is expected to be unique for that voice command.

Page 17: Speech Recognition

13

3.3.2.1 Pre-Works on the Recorded Sound

Figure III.4 - Block Diagram of Pre-Works on the recorded Speech Signal.

Page 18: Speech Recognition

14

A) Band-Pass Filter

Because of the electrical layout of the computer and the environment, a powerful

noise occurs called 50 Hz Noise. We must reject this noise. And as you know the

frequency of human voice is in the range 100 Hz - 3200 Hz. So a band pass filter is used

with cutoff frequencies 70 Hz and 3200 Hz. A FIR type digital filter is used.

For the removing of the background noise also another method is used in this part.

The signal is converted to a .wav file and reconstructed. This method is really effective

for rejecting the background noise.

B) Pre-emphasis Filter

Finally, the digitized speech signal is processed by a first order digital network in

order to spectrally flatten the signal. The unvoiced frames have high frequency but low

energy. In order to investigate unvoiced frames we use a pre-emphasis filter. This filter

is easily implemented in the time domain by taking difference.

0.97 1( ) ( ) . ( )Y n S n S n= − −

Now we a have a pure voice command “( )Y n ” and it is ready for the process

feature extracting.

C) Short-Time Energy

After filtering the signal, now we should find its short-time energy. This process is

required for defining the starting and ending points of the speech.

The signal is separated into frames that have 40 samples each. And the energy of

each frame is found by adding the absolutes values of 40 continual samples each other.

This process goes on for the whole signal.

Page 19: Speech Recognition

15

The Energy of the signal is found with the below equation,

D) Start – End Point Detection

After finding the short-time energy of the signal, we find a threshold value from the

mean value of the energy and we reject the frames that are below this value. Because

the threshold is value the that the signal passes from silence part to speech part. We are

interested in the voiced part of the recorded signal. The rest of the signal does not

contain any required information. And also any silence part at the beginning or at the

end of the signal will cause recognition failures. So finding the correct points for the

beginning and end of the signal is very important.

I use this equation to find a good threshold level,

402

1

40 11 ( )

1,2,3,...,2 40 40

( ( ( ) ) )( ) ,i length

i nn n

XX

P =

+ −= =

1

1

( )

4 40

. ( ),n

l

lengthlP n

XThreshold l== =

40. 1,2,3,...( ) [ ( ( ) ) ] ,S i X P n T i= > =

Page 20: Speech Recognition

16

3.3.2.2 Feature Extracting This stage is often referred as speech processing front end. The main goal of Feature

Extraction is to simplify recognition by summarizing the vast amount of speech data

without losing the acoustic properties that defines the speech. See Figure III.5

Figure III.5 - The Block Diagram of Feature Extracting Process.

Page 21: Speech Recognition

17

A) Frame Blocking

Investigations show that speech signal characteristics stays stationary in a

sufficiently short period of time interval (It is called quasi-stationary). For this reason,

speech signals are processed in short time intervals. This time interval must be chosen

very carefully and correctly. In this interval the properties of sound should not change

so much and the interval should also be long enough that will give enough information

about that frame. So the signal is divided into frames that are ∼30 ms length.With a

8000 Hz sampling frequency, 30ms = 240 samples.And each frame overlaps its

previous frame by a predefined size. The overlapping size is defined to be half of the

frame length. The goal of the overlapping scheme is to smooth the transition from frame

to frame. See figure III.6

B) Frame Windowing

Each frame is multiplied by an N sample window W(n). Here I used a hamming

window. This hamming window is used to minimize the adverse effects of chopping an

N sample section out of the running speech signal. While creating the frames, the

chopping of N sample from the running signal may have a bad effect on the signal

parameters. To minimize this effect windowing is done. Also windowing smoothes the

side-band lobes of the formant frequencies and it is done in order to eliminate

discontinuities at the edges of the frames. See figure III.7

Each frame is convoluted by the window function.

0 <= n <= N 11

2. .0.54 0.46 cos -( ) ( ) ,

N

nW n

π−

= −

( ) ( ) ( )S n X n W n= ∗

Page 22: Speech Recognition

18

Figure III.6 - Frame Blocking, The speech signal is separated into 7 frames each has T ms length.

Figure III.7 - Windowing, the frames are convoluted with the window function.

Page 23: Speech Recognition

19

C) Featuring Method

This is the part where the framed and windowed speech signal is converted to its

symbolic representation vector (Fingerprint vector). Here we have two different

methods. MFCC Method and LPC Method.

1) MFCC – Mel Frequency Cepstrum

The MFC coefficients, are the coefficients of the Fourier transform representation of

the log magnitude spectrum and taking the Discreate Cosine Transform.

Figure III.8 – Block Diagram of Obtaining Mel Frequency Cepstral Coefficients.

Page 24: Speech Recognition

20

a) Fast Fourier Transform ( FFT )

The next important step in the processing of the signal is to obtain a frequency

spectrum of each block. The information in the frequency spectrum is often enough to

identify the frame. The purpose of the frequency spectrum is to identify the formants,

which are the peaks in the frequency spectrum. One method to obtain a frequency

spectrum is to apply an FFT to each block. The resulting information can be examined

manually to find the peaks, but it is quite noisy, which makes the take difficult for a

computer to identify the peaks. FFT of each frame is obtained from the below formula.

1

1( ) ( ) jwn

n

X W X n eT

=

= ∑

b) Mel-frequency Wrapping

The human ear perceives the frequencies non-linearly. Researches show that the

scaling is linear up to 1 kHz and logarithmic above that. The Mel-Scale (Melody Scale)

filter bank which characterizes the human ear perceiveness of frequency is as shown in

Figure III.9. It is used as a band pass filtering for this stage of identification.

The signals for each frame is passed through Mel-Scaled band pass filter to mimic

the human ear.

10 . ( )

.

2595 log 1700

. ( ) , Normal freq Hz

Mel Scaled freq

f

m

fm

= + →

Page 25: Speech Recognition

21

Figure III.9 - Mel-Scaled Filter Bank.

c) Mel Frequency Cepstral Coefficients

As of the final step, each frame is inverse Fourier transformed to take them back to

the time domain. Instead of using inverse FFT, Discrete Cosine Transform is used as it

is more appropriate.

The discrete form for a signal x(n) is defined as,

1

0

X cos1

2

N

k nn

X n kN

π−

=

= +

∑ , 0,1, 2,3,...., 1Nk −=

As a result of this process, Mel-Frequency Cepstral Coefficients are obtained. These

coefficients are called feature vectors. And in this project 12 mel frequency cepstral

coefficients and 12 delta cepstral coefficients are generated per frame and these are used

as the feature matrix. So I called this as 12th order MFCC. This is the default value for

the MFCC order. The order can be easily changed from the control panel (GUI). After

the cepstral coefficients were generated a Cepstral mean normalization [CMN] was

done to get ride of the bias signal present across the coefficients. The fingerprint matrix

is an n by m matrix. Which consists of n frames and m coefficients each in frame. Here

m is equal to 2 times of order. Default m is 2 12 24× = .

Page 26: Speech Recognition

22

2) LPC – Linear Predictive Coding

LPC models this process as a linear sum of earlier samples using a digital filter

inputting excitement signal. An alternate explanation is that linear prediction filters

attempt to predict future values of input signal based on past samples.

And Linear Predictive Coding is the method to model the vocal tract filter. This

vocal tract filter is the model of H(z) in the figure below. This filter is an all-pole filter

which consists of only poles.

( )X n → ( )X n→ ɶ

With these equations LPC estimates the current value [n] from previous values of a

sequence x[n].And in this project an order of 11 which is the order of the digital filter

that is used to describe the featuring vector. And the order can be easily changed from

the control panel (GUI).After finding the LPC filter coefficients for each frame, these

coefficients are converted to a Digital FIR type filter and the fingerprint matrix is

created ( n by m matrix ). Which consists of n frames and m coefficients each in

frame. Here m is equal to frame length(samples each frame) over 12.

Default m is , 240 12 20= .

H(z)

1

1

11

[ 1 (2) (3) ( 1) ]

2 1 3 2 1

( ) , ( ) ,

, , , ... ,

( ) ( ). ( ) ( ). ( ) ... ( ) . ( )

pj

jpj j

jj

p

n n p n p

H z A z a z p Ordera z

a a a a

X n a X a X a X

− =

=

− − + −

= = − →

= +

= − − − −

∑∑

ɶ

Page 27: Speech Recognition

23

3.3.2.3 Fingerprint Calculation Fingerprint Matrices consists of n frames and m coefficients in each frame.

1) MFCC – Mel Frequency Cepstral Coefficients

11 12 13 1

21 22

31

1

m

n nm

MFCC

n m

f f f f

f f

F f

f f×

=

⋱ ⋮

⋮ ⋱ ⋮

⋯ ⋯ ⋯

2) LPC – Linear Predictive Coding Coefficients

11 12 13 1

21 22

31

1

m

n nm

LPC

n m

f f f f

f f

F f

f f×

=

⋱ ⋮

⋮ ⋱ ⋮

⋯ ⋯ ⋯

Page 28: Speech Recognition

24

3.4 Fingerprint Processing After we obtained the featuring vector of a voice command now the system is ready

for Recognition process.

CurrentFingerPrint

FingerPrintsComparison

FingerPrintsLibrary

DatabaseFingerPrints

ComparisonDistances

Decision

Recognized

Process

Best Match

Functionof the

Command

NotRecognized

Return

UnderMatched

Figure III.10 - Fingerprint Processing and Recognition Process.

Page 29: Speech Recognition

25

3.4.1 Fingerprints Library As the system is a command base speech recognition system, the voice commands

that are going to be used must be determined before starting the recognition process.

Each voice command is recorded 3 times and 3 patterns are obtained for that command.

Then these three patterns are saved to the library in the name of that command and

having numbers 1 to 3.

This process is done for each voice commands that are going to be used. And finally

we have 3 times n (number of voice commands) fingerprints saved to the library.

3.4.2 Fingerprints Comparison

This is the part where the fingerprint of the current voice command (the command

that is wanted to be recognized) is compared with the fingerprints in the library. Here

both current and library featuring vector are matrices and comparison is done by

calculating the Euclidian distance squares between the current and the library

fingerprint matrices. And comparison is done by frame to frame distance calculation.

Each row (frame) in the current fingerprint is compared with every row (frames) of

library fingerprint. And finally, after overall comparisons of current fingerprint and

library fingerprint, one comparison matrix is obtained for each comparison.

Here the fingerprints will have different number of frames because all the voice

commands have not equal lengths in time domain. And as I mentioned before frame

number represents the length of the voice command. Even the patterns of same

command will have different lengths and so different number of. But the thing that does

not change is the number of coefficients in each frame. Because all the fingerprints are

obtained with same method and same order.

In MFCC Method 24 coefficients are generated per frame and in LPC Method 20

coefficients are generated per frame as the default values of column coefficients.

Page 30: Speech Recognition

26

Here is an example of fingerprints comparison of current fingerprint compared with

one of the fingerprints in the library.

11 12 13 14

21 22 23 24

31 32 33 34 3 4

3

4

C

Current Fingerprint Matrix

frames

coeffs each frame

c c c c

F c c c c

c c c c×

=

11 12 13 14

21 22 23 24

31 32 33 34

41 42 43 44

51 52 53 54 5 4

5

4

L

Library Fingerprint Matrix

frames

coeffs each frame

l l l l

l l l l

F l l l l

l l l l

l l l l×

=

The comparison starts with the fingerprint that has smaller size. Here current

fingerprint have a size of ( 3 x 4 ) which is smaller than the size of the library

fingerprint ( 5 x 4 ) . So we start with current fingerprint. And then, we represent the

fingerprints in the below form,

Page 31: Speech Recognition

27

1 1

2 21 1

3 32 2

4 43 3

5 5

,

Frame

FrameFrame

FrameC Frame L

FrameFrame

Frame

L L

L LC C

L LF C C F

L LC C

L L

= = = =

[ ] [ ]

[ ] [ ]

1 11 12 13 14 3 31 32 33 34

1 11 12 13 14 5 51 52 53 54

, ... ,

, ... ,

C c c c c C c c c c

L l l l l L l l l l

= =

= =

And the Comparison Matrix is ; ( || Comparison Operator→ )

1 1 1 2 1 3 1 4 1 5

2 1 2 2 2 3 2 4 2 5

3 1 3 2 3 3 3 4 3 5

( || ) ( || ) ( || ) ( || ) ( || )

( || ) ( || ) ( || ) ( || ) ( || )

( || ) ( || ) ( || ) ( || ) ( || )

C L C L C L C L C L

D C L C L C L C L C L

C L C L C L C L C L

=

42

1 1 1 11

42

1

( || ) ( )

( || ) ( )

n nn

x y xn ynn

C L c l

C L c l

=

=

= −

= −

Page 32: Speech Recognition

28

11 12 13 14 15

21 22 23 24 25

31 32 33 34 35

Square Matrix Extension Matrix

Diagonal is frame to frame comparison Frame numbers are not equal

r c

d d d d d

D d d d d d

d d d d d×

↓ ↓

=

����������� �������

( || )x y x yd C L=

thn n Comparison distance of n frames of Current and

Library finferprints

d →

After finding the comparison matrix as in the above form, we do an optimization.

Because the fingerprints of same voice commands will have different numbers of

frames but they contain same information overall. And this optimization is done in

order to minimize this time warping effect. And also this process produces much higher

distances when the frame numbers of the fingerprints are not equal. Generally patterns

of same voice commands have closer frame numbers.

Optimization is done with the below technique,

[ ]1 2 3 4 5optimumD D D D D D=

Page 33: Speech Recognition

29

1 1

1 11 12

2 21 22 23

3 32 33 34

( ) ( ) 1, 2 ,...,

( , )

( , , )

( , , )

( , , ) ,n nnn n n n

Square Matrix

D min d d

D min d d d

D min d d d

D min d d d where n r− +

=

=

=

= =

Distance is equal to the minimum value of the diagonal parameter and the

parameters that are in the left side and right side of it.

}

1

54 34 35

( ) 1 , ... ,

3 ( , )

3 ( , ... , ) ,n r cr r

Extension Matrix

r

D D mean d d

D mean d d where n c+

×

× +

= =

= =

Distance is equal to 3 times mean value of the distances of last frame of the current

fingerprint compared to extra frames of the library fingerprint or vice versa.

The final distance is square root of sum of the optimum distance values.

1n

c

nfinal DD=

= ∑

Page 34: Speech Recognition

30

3.4.3 Decision

After comparing the current fingerprint with all the library fingerprints and finding the

distances, we obtain 3 different distance value for each voice command in the library. Then

all the distances are combined together and we find the minimum of them. Because where

the minimum distance occurs gives us the best match. Which means that the current

fingerprint is most likely, similar to that command. And when we find best fingerprint

match, we also find what was said in the current voice command if it exists in the library.

Here there seems to be a problem. Our system is command base system which has

limited number of voice commands. If a command which exists in the library said to the

tool, there is no problem, the best match will be probably ( ∼ 95 % ) be what we expected.

But what if a command that the system does not know is said to the tool ? Again there is

going to be a minimum distance and the tool will determine it as the best match. And the

name of matched library command will be shown in the listbox. But the thing is that the

result is going to be wrong. Because that new command does not exists in the library. So in

order not to fall into this mistake we should determine a matching threshold level as in the

form of percentage. And firstly before we assign the command recognized or unrecognized

we look this percentage level (matching percentage ) .If the level is above the determined

value then the command is assigned as recognized and its name is shown in the listbox with

its matching percentage. And if the level is below the determined level then the command is

assigned as not recognized and “???” is shown in the listbox.

The advantage of determining a matching threshold percentage level is not to

recognize commands those are not in the library but the technique I used for this process

is only valid for small libraries which can be include 10 to 20 commands. Because

while the number of commands in the library increase, also the probabality of matching

the current fingerprint increases. After this part, again the tool returns the beginning and

listens for new voice commands automatically and continually.

1 1

[ ]

final

j i

c r

r cLibrary FingerPrint

DMatching Percentage

= =×

=∑ ∑

Page 35: Speech Recognition

31

4 CHAPTER IV : Project Demonstration with Code Explanations

4.1 Training Part and Building Commands Database Matlab m files used in this part

• Training.m , Training.fig

• Record_Training.m

• Featuring.m

• Melcepst_met.m

• Lpc_met.m

• Plotts.m

• Save_Mat.m

• Lib_Commands.mat

We start demonstration Chapter with Training process. We determine the voice

commands to the tool and save them to “ Library ” folder.

To start the training tool, we run “Training.m” function. This function has a

graphical user interface. See Figure IV.1

Page 36: Speech Recognition

32

Figure IV.1 - A screen shot from Training GUI .

• Listbox : Previously recorded voice commands. There are 3 patterns from each of them

saved in the “Library” folder.

• Remove Button : Removes the selected command from the listbox and deletes 3 patterns

of it from the library.

• Record Button : Records new voice command with the name entered to the Command

textbox. When it is pressed , it provides the user 3 continually recording process.

• Featuring Method Panel : Applies entered parameters to the recorded speech signal while

finding its fingerprint.

• Featuring Method Selection Box : 1. MFCC Method , 2. LPC Method

• Order Textbox : Order of the applied method.

• Frame Length Textbox : Length of the frames in ms. for the framing process.

• Okay Button : Saves the current data to the library.

• Reject Button : Rejects the current data.

• Play Button : Sounds the current recorded speech.

• Featuring Matrix Size Textbox : Displays the size of the current fingerprint. (#frame x #Coeff.)

Page 37: Speech Recognition

33

When the Record button pressed recording loop starts,

Record_Training.m

Featuring.mMelcepst_met.m

Method 2Method 1

Save_Mat.m

Lpc_met.m

Record_Training.m

Plotts.m

Figure IV.2 - Running sequence of the Matlab functions used in Training Process.

Page 38: Speech Recognition

34

4.1.1 Matlab Functions ( .m files) Training.m This is the main function of the Training process. All the sub functions are called

inside this function and they arranged in a sequence. Also configuring and initializing

the tool are done in this part.

Record_Training.m

Recording process is operated in this function. When the record button is pressed

the system enters a recording loop and recording sound continue until 3 separate

patterns of a voice command is saved to the library.

Recording sound from microphone is realized by creating an anolog input object.

And it gives output every in every 1000 sample of the recorded signal. Than if the

energy of that frame is greater than the defined threshold value, tool assigns frames to a

variable by end to end until a frame exists that has an energy of smaller than this

threshold. After obtaining the speech signal, it is filtered, normalized and sent to

“Featuring.m” function.

Featuring.m

In this part first the recorded speech signal is converted to pure speech signal by

finding the start point and end point of the speech. Then silence parts in the speech are

removed from the whole recorded signal. This part is realized by finding the energy of

the speech and then removing the frames below the defined threshold level.

After obtaining the pure voice signal, its fingerprint is found with the selected

method and method specifications. Then the datas and variables are sent to the function

“Plotts.m”.

Featuring process is exacuted just for seeing the obtained fingerprint, and certifying

if it is suitable or not.

Page 39: Speech Recognition

35

Plotts.m

This is the function where all the results are shown on the GUI panel. Recorded

signal, its energy, start and end points, extracted pure speech (command) signal,

featuring matrix and its size are plotted. Then if the user certifies that the datas are

suitable, he presses Okay Button and the function “Save_Mat.m” is called where the

voice command is saved to library. Else if the user does not certify that the datas are

suitable, he presses Reject Button and all the current datas and variables cleared. In both

cases pressing Okay Button or Reject Button the recording process starts from the

beginning automatically and stops if the Okay Button is pressed 3 times. Which means

3 separate patterns of the new voice command are saved to library. Or it stops

automaticaly if nothing said for 6.25 seconds.

Save_Mat.m

When it is called, it saves the current recorded command with the name entered to

command textbox. The signal is recorded to a mat file with the name entered to the

command textbox and appending a pattern number.(1-2-3).

Melcepst_met.m - Lpc_met.m

If the selected Featuring Method from the GUI panel is “1- MFCC” the function

“Melcepst_met.m” is called and if it is “2- LPC” the function “Lpc_met.m” is called for

feature extracting process in the “Featuring.m” function.

Lib_Commands.mat

Name of saved voice commands are stored in this file and they are dislayed in the

listbox on the GUI.

Page 40: Speech Recognition

36

4.2 Recognition Part

Matlab m files used in this part

• SpeechRecognition.m , SpeechRecognition.fig

• Record_Recognition.m

• Featuring.m

• Melcepst_met.m

• Lpc_met.m

• Compare.m

• Disteusq.m

• Library_Call.m

• Lib_Commands.mat

After building fingerprints database, now the tool is ready for recognition process.

To start the Speech Recognition Tool, we run “SpeechRecognition.m” function. This

function has a graphical user interface. See Figure IV.3 .

Page 41: Speech Recognition

37

Figure IV.3 - A screen shot from Speech Recognition GUI .

• Listbox : Name of the voice commands which exist in the library. Also when a command

recognized, its name is highlighted on the listbox. If it is not recognized “???” line is highlighted.

• Mode Selection Box : Selection of the process, 1-Recognition , 2-Training.

• Featuring Method Panel : Applies entered parameters to the recorded speech signal while

finding its fingerprint. Parameters are same as they are in the Training Tool.

• Recognition Level Textbox : Defines the Recognition level threshold percentage.

• Start Button : Starts Recognition Process, and listening for voice commands to be recognized.

• Stop Button : Stops all the processes.

• Energy Textbox : Shows recording process is running and also it displays the energy of the

recorded frames.

• Recognition Result and Matching level Textbox : Displays the current recognized command

and its matching percentage.

Page 42: Speech Recognition

38

When the Start button pressed recording loop and recognition process start,

Figure IV.2 - Running sequence of the Matlab functions used in Recognition Process.

Page 43: Speech Recognition

39

4.2.1 Matlab Functions ( .m files) SpeechRecognition.m This is the main function of the Recognition process. Configuring and initializing

the tool are done in this part.

Record_Recognition.m

Recording process is operated in this function and all the sub functions are called

inside this function and they arranged in a sequence. If Training Mode is selected and

the start button is pressed, “Training.m” function is called. If Recognition Mode is

selected and the start button is pressed, the system enters in an infinite recording loop

and recording sound continues until stop button pressed or nothing is said for 10

seconds. Also before starting the analog input object, “Library_Call.m” function is

called for the database voice commands.

Recording sound and the other processes are same as it is done in Training process

(“Record_Training.m”) as it should be. Because the processes done to a speech signal

must be totally same in Training process and Recognition process.

After obtaining the speech signal, it is filtered, normalized and sent to

“Featuring.m” function.

Library_Call.m

All the voice commands in .mat file format in the “Library” folder are read and their

fingerprints are found by calling the function “Featuring.m”, then these fingerprints are

assigned to variables to be used in the comparison (“Compare.m”) process.

Page 44: Speech Recognition

40

Featuring.m

After obtaining the pure voice signal, its fingerprint is found with the selected

method and method specifications. Then the datas and variables are sent to the function

“Compare.m”.

Melcepst_met.m - Lpc_met.m

If the selected Featuring Method from the GUI panel is “1- MFCC” the function

“Melcepst_met.m” is called and if it is “2- LPC” the function “Lpc_met.m” is called for

feature extracting process in the “Featuring.m” function.

Compare.m

This is the function where the current fingerprint is compared with the database

fingerprints. For comparing process, “Disteusq.m” function is executed for each

comparison. After obtaining comparison results, the distances, it is decided that which

one of the comparison results give the minumum distance, which means that it is the

best match. Then if this match satisfies the required conditons, the name of that library

command is displayed with its matching percentage on the GUI and also its name is

highlighted in the listbox.

Disteusq.m

This function is used for finding the euclidian distances of two fingerprints. The

output is just a single number which represents the difference between the two input

matrices.

Page 45: Speech Recognition

41

5 CHAPTER V : CONCLUSION

In this project, basic principles and properties of human speech were investigated

and digital signal processing techniques on speech signal were studied. Finally, a

speaker dependent, small vocabulary, isolated word speech recognition system for

noise-free environments was developed. And an executable windows file (.exe) with a

graphical user interface (GUI) was build to make the program smart and be industrial.

One of the main advantages of the designed tool is each time you don’t need to

press a button when you are going to say a voice command. The tool provides the user a

continual recording and recognition process. And it provides the user an easy way of

using the tool. New voice commands can easily be added to the system and the tool

shows the steps of the processes. And featuring method and its parameters (order, frame

length, recognition threshold percentage ) can be easily changed on the panel.

Also, the tool can be used in dictation process with some small changes in the

algorithm. This is going to be one of the future works of this project.

Two different methods were used for feature extracting. First one is Mel Frequency

Cepstral Coefficients (MFCC) and the second one is Linear Predictive method(LPC).

These two methods were tried with different parameters and some tests were made with

different characteristics. And it was observed that the MFC coefficients, have been

shown to be a more robust, reliable feature set for speech recognition than the LPC

coefficients. Because of the sensitivity of the low order MFC coefficients to overall

spectral slope and the sensitivity of the high-order MFC coefficients to noise. And it is

seen that, LPC approximates speech linearly at all frequencies but MFCC is more robust

and also take into account the psychoacoustic properties of the human auditory system.

And both in two methods it is seen that when the order is increased, the recognition

efficiency increases. Also more training patterns for each voice command give better

results and the talking style of the speaker does not effect recognition so much.

But when we consider the recognition time period and optimization parameters, the

orders were found to be as 12 (12 MFCC – 12 Delta Coeff.) for MFCC method and 11

Page 46: Speech Recognition

42

for LPC method. And not to make the system working slowly, 3 patterns were taken

from each voice command.

The optimum frame length was found to be 30 ms. Because the speech in less than

30ms does not include enough information and more than 30 ms includes more

information. That’s why 30 ms is found to be optimum, which includes exactly what is

needed.

After the overall tests and observations (See Appendix 2), it was seen that the

environment noise had a bad effect on recognition. And the best results were obtained

from the test 6 (See Appendix 2 – Table 6) which was tested with one speaker, 50

library commands, 50 testing commands, 5 try (5x50 =250 commands), 3 patterns for

each command, 20% Recognition level threshold percentange and with 24 MFC (12

Mel – 12 Delta) coefficients. And 96.8 % overall efficiency was obtained. Which

means, totaly 250 voice commands tried and only 8 of them recognized wrongly.

Also another good result was obtained with test 3 (See Appendix 2 – Table 3). 2

patterns were taken from 3 speakers for each voice command and a library was created

with 30 voice commands, totaly 6x30 = 180 commands. And the system was tested with

3 speakers, with these 30 commands, with 20% Recognition level percentage and

appliying 24 MFC coefficients. The overall result is 95%. Which means, totaly 180

voice commands were tried and only 9 of them recognized wrongly. In this test it was

observed that taking patterns from different speakers has an effect that makes the

system less dependent on speaker characteristics. And the speaker dependent system

becomes an independent system from the speaker or speakers.

These results seem to be good but infact they are not. When i first started this

project, the aim was to obtain an efficiency of 99 percent. This is the at least value for a

speech recognition system to use the system in commercial or industrial applications.

And i know that this difference occured beacuse of my own algorithms such as start and

end points detection algorithm, comparison algorithm and others i used in this project.

But these algorithms are in testing stage. I will improve these algorithms or i will

change some of them in the future works of this project.

Page 47: Speech Recognition

43

6 APPENDICES

6.1 Appendix 1 : References

[1] Lawrance R.Rabiner , Ronald W.Schafer , “Digital Processing of Speech Signals” ,

Prentice Hall, New Jersey, 1978

[2] Lawrence Rabiner and Biing-Hwang Juang , “Fundamentals of Speech Recognition”,

Prentice Hall, New Jersey, 1993

[3] D.Raj Reddy, “Invited papers presented at the 1974 IEEE symposium”,

ACADEMIC PRESS, 1975

[4] James L. Flonagon, Lawrance R.Rabiner, “Speech Synthesis”, Bell Laboratories, Murray Hill, 1973

[5] Gérard Blanchet, Maurice Charbit, “Digital Signal and Image Processing using

MATLAB”, ISTE Ltd, 2006

[6] Jaan Kiusalaas , “Numerical Methods in Engineering with MATLAB” , Cambridge

University Press 2005

[7] Prof. Dr. H. G. Tillmann , "An Introduction to Speech Recognition" , Institut für

Phonetik und Sprachliche Kommunikation, University of Munich, Germany.

http://www.speech-recognition.de/slides.html

[8] Mike Brookes , “VOICEBOX : Speech Processing Toolbox for MATLAB” ,

Department of Electrical & Electronic Engineering, Imperial College ,

http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html

[9] “MATLAB 7 - Creating Graphical User Interfaces” , Mathworks

[10] Matlab 7 , Help Folder

Page 48: Speech Recognition

44

[11] Halil Đbrahim BÜLBÜL , Abdulkadir KARACI , Speech Command Recognition

In Computer : Pattern Recognition Method” , Kastamonu Education Journal , March

2007, Vol:15 No:1

[12] Nursel YALÇIN , “Speech Recognition Theory And Techniques” , Kastamonu

Education Journal , March 2008 , Vol:16 No:1

[13] Cemal HANĐLÇĐ , Figen ERTAŞ , “On the Parameters of Text-Independent

Speaker Identification Using Continuous HMMs” , Uludağ University Engineering

Faculty Journal, Vol:12, No:1 , 2007

[14] Dr.-Ing. Bernd Plannerer , “Automatic Speech Recognition”

[15] T. Thrasyvoulou , S. Benton , “ Speech parameterization using the Mel scale Part II ”

[16] Wilson Clark , “CSE552/652 - Hidden Markov Models for Speech Recognition” ,

Department of Computer Science and Engineering OGI School of Science &

Engineering , http://www.cse.ogi.edu/class/cse552/

[17] Tony Robinson , “Speech Analysis” ,

http://mi.eng.cam.ac.uk/~ ajr/SA95/SpeechAnalysis.html

[18] Ozan MUT , MS Thesis , Gebze High Technology Institute, Computer

Engineering Faculty

[19] Volkan Tunalı , “A Speaker Dependent, Large Vocabulary, Isolated Word Speech

Recognition System For Turkish” , Ms Thesis, Marmara University Institute For

Graduate Studies in Pure and Applied Sciences

[20] Seydi vakkas üstün , “Yapay Sinir Ağları Kullanarak Tükçedeki Sesli Harflerin

Tanınması” , MS thesis , Yıldız Technical University Institute of Engineering and

Science, Istanbul, Turkey

Page 49: Speech Recognition

45

6.2 Appendix 2 : Test Results

Framing length is selected to be 30 ms in all tests.

Tests with one speaker, 20 Library commands , 30 Testing commands ,

50 % Recognition level threshold percentage

• Test with 1 pattern , 24 MFCC Coefficients (12 Mel – 12 Delta) - See Table 1 • Test with 3 patterns , 24 MFCC Coefficients (12 Mel – 12 Delta) - See Table 2 Tests with three speakers , 30 Library commands , 30 Testing commands ,

6 (3x2) patterns for each command , 20 % Recognition level threshold percentage

• Test with 24 MFCC Coefficients (12 Mel – 12 Delta) - See Table 3 • Test with 11 LPC Coefficients - See Table 4 Tests with one speaker , 50 Library commands , 50 Testing commands,

3 patterns for each command , 20 % Recognition level threshold percentage

• Test with 16 MFCC Coefficients (8 Mel – 8 Delta) - See Table 5 • Test with 24 MFCC Coefficients (12 Mel – 12 Delta) - See Table 6 • Test with 32 MFCC Coefficients (16 Mel – 16 Delta) - See Table 7 • Test with 7 LPC Coefficients - See Table 8 • Test with 11 LPC Coefficients - See Table 9 • Test with 15 LPC Coefficients - See Table 10

Page 50: Speech Recognition

46

Table 1

Commands / Test No 1 2 3 4 5 Total

1 sol sol stop sol sol sol 80%

2 sağ sağ geri sağ stop sağ 60%

3 ileri geri ileri ileri geri ileri 60%

4 geri geri ileri geri geri geri 80%

5 dur ??? dur close dur dur 60%

6 left left left ??? left left 80%

7 right right right right right right 100%

8 go go go go go go 100%

9 back back ??? back back back 80%

10 stop stop stop stop stop stop 100%

11 aç aç aç aç aç aç 100%

12 kapat kapat kapat kapat kapat kapat 100%

13 open open ??? open open open 80%

14 close close close close close close 100%

15 forward close forward forward forward forward 80%

16 backward ??? samet backward kapat backward 40%

17 ali ali ali ali ali ali 100%

18 samet samet samet samet samet samet 100%

19 fatma ??? ??? fatma ??? fatma 40%

20 ayşe ayşe ayşe ayşe ayşe ayşe 100%

Total 80% 65% 90% 80% 100% 83% 21 aşağı ??? ??? ??? ??? ??? 100%

22 yukarı ??? ??? ??? ??? ??? 100%

23 hızlı ??? close ??? ??? ??? 80%

24 yavaş yavaş ??? ??? ??? left 60%

25 fast back back ??? ??? back 40%

26 slow close ??? close ??? ??? 60%

27 gel ??? left ??? left ??? 60%

28 git ??? ??? geri ??? geri 60%

29 merve ??? ??? ??? ??? ??? 100%

30 hüseyin ??? ileri ??? ??? ??? 80%

Total 76.7% 63.3% 86.7% 83.3% 90% 80%

Page 51: Speech Recognition

47

Table 2

Test No / Commands

1 2 3 4 5 Total

1 sol sol sol sol sol sol 100%

2 sağ sağ sağ sağ sağ sağ 100%

3 ileri ileri ileri geri ileri ileri 80%

4 geri geri geri geri geri geri 100%

5 dur dur dur dur dur dur 100%

6 left left left left left left 100%

7 right right right right right right 100%

8 go go go go go go 100%

9 back back back back back back 100%

10 stop stop stop stop stop stop 100%

11 aç aç aç back aç aç 80%

12 kapat kapat kapat kapat kapat kapat 100%

13 open open open open open open 100%

14 close close close close close close 100%

15 forward forward forward forward forward forward 100%

16 backward backward backward backward backward backward 100%

17 ali ali ali ali ali ali 100%

18 samet samet samet samet samet samet 100%

19 fatma ??? fatma fatma fatma fatma 80%

20 ayşe ayşe ayşe ayşe ayşe ayşe 100%

Total 95% 100% 90% 100% 100% 97% 21 aşağı ??? ??? ayşe ??? ??? 80%

22 yukarı ??? ali ??? ??? ??? 80%

23 hızlı close ??? ??? ??? ??? 80%

24 yavaş ??? ??? ??? ??? ??? 100%

25 fast back ??? back back ??? 40%

26 slow close ??? close close ??? 40%

27 gel ??? left ??? left ??? 60%

28 git ??? ??? ??? ??? ??? 100%

29 merve ileri ??? ??? ileri ??? 60%

30 hüseyin ??? ??? ??? ??? ??? 100%

Total 83.3% 93.3% 83.3% 86.7% 100% 89.3%

Page 52: Speech Recognition

48

Table 3

Speaker 1 Speaker 2 Speaker 3

Test No / Commands

1 2 1 2 1 2

Total 1 sol sol sol sol sol sol sol 100%

2 sağ sağ sağ sağ sağ sağ sağ 100%

3 ileri ileri ileri geri ileri ileri ileri 83.3%

4 geri geri geri geri geri geri geri 100%

5 dur dur dur dur dört dur dur 83.3%

6 left left left left left left left 100%

7 right right right right right right right 100%

8 go go go go go go go 100%

9 back dur back dört back back back 66.6%

10 stop stop stop stop stop stop stop 100%

11 aç aç aç aç aç stop aç 83.3%

12 kapat kapat kapat kapat kapat kapat kapat 100%

13 open open open open open open open 100%

14 close close close close close close close 100%

15 forward forward forward forward forward forward forward 100%

16 backward backward backward backward backward backward backward 100%

17 ali ali ali ali ali ali ali 100%

18 samet samet samet samet samet samet samet 100%

19 fatma fatma fatma fatma fatma fatma fatma 100%

20 ayşe ayşe ayşe ayşe ayşe ayşe ayşe 100%

21 bir bir geri bir bir bir geri 66.6%

22 iki iki iki iki iki iki iki 100%

23 üç üç üç üç üç üç üç 100%

24 dört dört dört dört dört dört dört 100%

25 beş üç beş beş beş beş beş 83.3%

26 altı altı altı altı altı altı altı 100%

27 yedi yedi yedi geri yedi yedi yedi 83.3%

28 sekiz sekiz sekiz sekiz sekiz sekiz sekiz 100%

29 dokuz dokuz dokuz dokuz dokuz dokuz dokuz 100%

30 on on on on on on on 100%

93.3% 96.6% 90% 96.6% 96.6% 96.6% Total 95% 93.3% 96.6%

95%

Page 53: Speech Recognition

49

Table 4

Speaker 1 Speaker 2 Speaker 3

Test No / Commands

1 2 1 2 1 2 Total

1 sol forward sol sol forward ??? sol 50%

2 sağ sağ sağ sağ sağ sağ sağ 100%

3 ileri ileri geri ileri geri ileri ileri 66.6%

4 geri geri geri geri geri geri geri 100%

5 dur dur dur dur dur dur dur 100%

6 left left left left left left left 100%

7 right right right right right right right 100%

8 go go dört go dört right go 50%

9 back back back back back back back 100%

10 stop stop stop stop stop stop stop 100%

11 aç aç aç aç aç aç aç 100%

12 kapat kapat kapat kapat kapat kapat kapat 100%

13 open open open open open open open 100%

14 close close dur right close right close 50%

15 forward forward forward forward forward forward forward 100%

16 backward backward backward backward backward backward backward 100%

17 ali ali ali samet ali ali aç 66.6%

18 samet samet samet samet samet samet samet 100%

19 fatma open fatma fatma fatma fatma fatma 83.3%

20 ayşe ayşe ayşe ayşe ayşe ayşe ayşe 100%

21 bir bir bir bir bir bir bir 100%

22 iki iki iki iki iki iki iki 100%

23 üç üç üç üç üç üç üç 100%

24 dört dört dört dört dört dört dört 100%

25 beş beş beş beş beş beş beş 100%

26 altı altı samet altı altı altı altı 83.3%

27 yedi geri yedi geri geri yedi yedi 50%

28 sekiz sekiz sekiz sekiz sekiz sekiz sekiz 100%

29 dokuz ??? close dokuz dokuz dokuz close 50%

30 on sol sol on ??? on on 50%

83.3% 80% 90% 83.3% 90% 93.3% Total 81.6% 86.6% 91.6%

86.6%

Page 54: Speech Recognition

50

Table 5

Test No / Commands

1 2 3 4 5 Total

1 sol sol sol sol sol sol 100%

2 sağ sağ sağ sağ sağ one 80%

3 ileri ileri geri geri ileri ileri 60%

4 geri geri geri geri geri geri 100%

5 dur dur dur dur two dur 80%

6 left left left left left left 100%

7 right right right ten right right 80%

8 go go go go go go 100%

9 back back back dört dört back 60%

10 stop stop stop stop stop stop 100%

11 aç aç five aç aç aç 80%

12 kapat kapat kapat kapat kapat kapat 100%

13 open open open open open open 100%

14 close close close close close close 100%

15 forward forward forward forward forward forward 100%

16 backward backward backward backward backward backward 100%

17 yukarı yukarı yukarı dört yukarı yukarı 80%

18 aşağı aşağı aşağı aşağı aşağı aşağı 100%

19 hızlı hızlı hızlı hızlı hızlı hızlı 100%

20 yavaş yavaş yavaş yavaş yavaş yavaş 100%

21 one on one close one one 60%

22 two two two two two two 100%

23 three three three three three three 100%

24 four sol four four close four 60%

25 five five five five five five 100%

26 six six six six six bir 100%

27 seven seven seven seven seven seven 100%

28 eight eight eight eight eight eight 100%

29 nine nine nine nine nine nine 100%

30 ten ten ten ten ten ten 100%

31 bir bir six bir bir bir 80%

32 iki iki hüseyin iki iki iki 80%

33 üç üç üç üç üç üç 100%

34 dört dört dört dört dört dört 100%

35 beş beş beş beş beş beş 100%

36 altı altı altı altı altı altı 100%

37 yedi yedi yedi yedi yedi yedi 100%

38 sekiz sekiz sekiz sekiz sekiz sekiz 100%

39 dokuz dokuz dokuz dokuz dokuz dokuz 100%

40 on on on on on on 100%

41 hüseyin six hüseyin hüseyin hüseyin hüseyin 80%

42 mustafa mustafa mustafa ??? sağ mustafa 60%

43 samet samet samet samet samet samet 100%

44 ali ali ali ali ali ali 100%

45 hasan hasan hasan hasan hasan hasan 100%

46 fatma fatma fatma altı fatma fatma 80%

47 merve merve merve merve merve merve 100%

48 eda eda eda eda eda eda 100%

49 ayşe ayşe ayşe ayşe ayşe ayşe 100%

50 hüsniye hüsniye bir hüsniye hüsniye hüsniye 80%

Total 94% 90% 86% 90% 96% 91.2%

Page 55: Speech Recognition

51

Table 6

Test No / Commands

1 2 3 4 5 Total

1 sol sol sol sol sol sol 100%

2 sağ close sağ sağ sağ sağ 80%

3 ileri ileri ileri ileri ileri ileri 100%

4 geri geri geri geri geri geri 100%

5 dur dur dur dur dur dur 100%

6 left left left left left left 100%

7 right right right right right right 100%

8 go go go go go go 100%

9 back back back back dört back 80%

10 stop stop stop stop stop stop 100%

11 aç aç aç aç aç aç 100%

12 kapat kapat kapat kapat kapat kapat 100%

13 open open open open open open 100%

14 close close close close close close 100%

15 forward forward forward forward forward forward 100%

16 backward backward backward backward backward backward 100%

17 yukarı yukarı yukarı yukarı yukarı yukarı 100%

18 aşağı aşağı aşağı aşağı aşağı aşağı 100%

19 hızlı hızlı hızlı hızlı hızlı hızlı 100%

20 yavaş yavaş yavaş yavaş seven yavaş 80%

21 one one one one one one 100%

22 two two two two two two 100%

23 three three three three three three 100%

24 four four four four close four 80%

25 five five five five five five 100%

26 six six bir six six six 80%

27 seven seven seven seven seven seven 100%

28 eight eight eight eight eight eight 100%

29 nine nine nine nine nine nine 100%

30 ten ten ten ten ten ten 100%

31 bir bir bir bir bir bir 100%

32 iki iki iki iki iki iki 100%

33 üç üç üç üç üç üç 100%

34 dört dört dört dört dört dört 100%

35 beş beş beş beş beş beş 100%

36 altı altı altı altı altı altı 100%

37 yedi yedi yedi yedi yedi yedi 100%

38 sekiz sekiz sekiz sekiz sekiz sekiz 100%

39 dokuz dokuz dokuz dokuz dokuz dokuz 100%

40 on on on on on on 100%

41 hüseyin hüseyin six hüseyin hüseyin hüseyin 80%

42 mustafa mustafa sağ mustafa mustafa mustafa 80%

43 samet samet samet samet samet samet 100%

44 ali ali ali ali ali ali 100%

45 hasan hasan hasan seven hasan hasan 80%

46 fatma fatma fatma fatma fatma fatma 100%

47 merve merve merve merve merve merve 100%

48 eda eda eda eda eda eda 100%

49 ayşe ayşe ayşe ayşe ayşe ayşe 100%

50 hüsniye hüsniye hüsniye hüsniye hüsniye hüsniye 100%

Total 98% 94% 98% 94% 100% 96.8%

Page 56: Speech Recognition

52

Table 7

Test No / Commands

1 2 3 4 5 Total

1 sol sol on sol sol sol 80%

2 sağ close sağ sağ sağ sağ 80%

3 ileri ileri ileri ileri ileri ileri 100%

4 geri geri geri geri geri geri 100%

5 dur dur dur dur dur dur 100%

6 left left left left left left 100%

7 right right right right right right 100%

8 go go go go go go 100%

9 back back ten dört dört back 60%

10 stop stop stop stop stop stop 100%

11 aç aç aç aç aç aç 100%

12 kapat kapat kapat kapat kapat kapat 100%

13 open open open open open open 100%

14 close close close close close close 100%

15 forward forward forward forward forward forward 100%

16 backward backward backward backward backward backward 100%

17 yukarı yukarı yukarı yukarı dört yukarı 80%

18 aşağı aşağı aşağı aşağı aşağı aşağı 100%

19 hızlı hızlı hızlı hızlı hızlı hızlı 100%

20 yavaş seven yavaş yavaş yavaş yavaş 80%

21 one one one one one one 100%

22 two two two two left two 80%

23 three three three three three three 100%

24 four on four four four one 60%

25 five five five five five five 100%

26 six six six six six six 100%

27 seven seven seven seven seven seven 100%

28 eight eight eight eight eight eight 100%

29 nine nine nine nine nine nine 100%

30 ten ten ten ten ten ten 100%

31 bir six bir bir bir bir 80%

32 iki iki iki iki iki iki 100%

33 üç üç üç üç üç üç 100%

34 dört dört dört dört dört dört 100%

35 beş beş beş beş beş beş 100%

36 altı altı altı altı altı altı 100%

37 yedi yedi yedi yedi yedi yedi 100%

38 sekiz sekiz sekiz sekiz sekiz sekiz 100%

39 dokuz dokuz dokuz dokuz dokuz dokuz 100%

40 on on on on on on 100%

41 hüseyin hüseyin hüseyin hüseyin eigth hüseyin 80%

42 mustafa hasan mustafa kapat mustafa mustafa 60%

43 samet samet samet samet samet samet 100%

44 ali ali ali ali ali ali 100%

45 hasan hasan hasan hasan hasan hasan 100%

46 fatma fatma fatma fatma fatma fatma 100%

47 merve merve merve merve merve merve 100%

48 eda eda eda eda eda eda 100%

49 ayşe ayşe ayşe ayşe ayşe ayşe 100%

50 hüsniye hüsniye hüsniye hüsniye bir hüsniye 80%

Total 90% 96% 96% 90% 98% 94%

Page 57: Speech Recognition

53

Table 8

Test No / Commands

1 2 3 4 5 Total

1 sol sol close close sol on 40%

2 sağ sağ stop sağ sağ sağ 80%

3 ileri ileri ileri geri ileri geri 60%

4 geri bir geri bir geri geri 60%

5 dur six yedi dur dur dur 60%

6 left left left nine left left 80%

7 right right ten ten right ten 40%

8 go close seven go ten go 40%

9 back back dört dört back back 60%

10 stop stop stop stop stop stop 100%

11 aç aç aç aç aç aç 100%

12 kapat kapat kapat kapat kapat kapat 100%

13 open open open open open open 100%

14 close close close close close close 100%

15 forward forward four forward forward four 60%

16 backward kapat backward backward kapat kapat 40%

17 yukarı ali yukarı two yukarı yukarı 60%

18 aşağı aşağı aşağı aşağı aşağı aşağı 100%

19 hızlı hızlı dokuz dokuz hızlı hızlı 60%

20 yavaş yavaş yavaş yavaş yavaş yavaş 100%

21 one close one close one one 60%

22 two two two two two two 100%

23 three three six three six three 60%

24 four four on four sol four 60%

25 five five five five five five 100%

26 six six six six six six 100%

27 seven dört seven seven seven seven 80%

28 eight eight six eight eight eight 80%

29 nine nine nine nine nine nine 100%

30 ten dört ten ten dört dört 40%

31 bir bir bir six geri bir 60%

32 iki bir sekiz iki bir iki 40%

33 üç üç üç üç eigth üç 80%

34 dört dört dört dört dört dört 100%

35 beş beş back left beş back 40%

36 altı open altı open altı altı 60%

37 yedi yedi yedi yedi geri geri 60%

38 sekiz three sekiz sekiz sekiz three 60%

39 dokuz dokuz dur dokuz dokuz dokuz 80%

40 on on four four on on 60%

41 hüseyin hüseyin hüseyin hüseyin hüseyin hüseyin 100%

42 mustafa mustafa sağ mustafa ??? open 40%

43 samet samet altı samet samet samet 80%

44 ali ali ali ali samet ali 80%

45 hasan dört hasan hasan hasan stop 60%

46 fatma fatma fatma fatma fatma fatma 100%

47 merve merve merve merve merve merve 100%

48 eda eda eda eda eda eda 100%

49 ayşe ayşe altı ayşe ayşe ayşe 80%

50 hüsniye bir hüsniye six hüsniye six 40%

Total 74% 64% 72% 76% 78% 72.8%

Page 58: Speech Recognition

54

Table 9

Test No / Commands

1 2 3 4 5 Total

1 sol close sol sol sol on 60%

2 sağ sağ sağ sağ stop sağ 80%

3 ileri geri ileri ileri ileri geri 60%

4 geri geri geri geri bir geri 80%

5 dur dur yedi dur six dur 60%

6 left left left left left nine 80%

7 right right ten right ten rigth 60%

8 go go ten seven go go 60%

9 back back dört back dört back 60%

10 stop stop stop stop stop stop 100%

11 aç aç aç aç aç aç 100%

12 kapat kapat kapat kapat kapat kapat 100%

13 open open open seven open open 80%

14 close close close close close close 100%

15 forward four four forward forward forward 80%

16 backward backward kapat backward kapat backward 60%

17 yukarı ali yukarı yukarı two yukarı 60%

18 aşağı aşağı aşağı aşağı aşağı aşağı 100%

19 hızlı hızlı hızlı dokuz hızlı dokuz 60%

20 yavaş yavaş yavaş yavaş yavaş yavaş 100%

21 one close one one close one 60%

22 two two two two two two 100%

23 three three three six three three 80%

24 four four sol sol four four 60%

25 five five five five five five 100%

26 six six six six six six 100%

27 seven seven seven seven dört seven 80%

28 eight eight eigth eight eight six 80%

29 nine nine nine nine nine nine 100%

30 ten ten dört ten dört ten 60%

31 bir bir geri bir bir bir 80%

32 iki iki bir iki bir iki 80%

33 üç üç üç sekiz üç üç 80%

34 dört dört dört dört dört dört 100%

35 beş beş left back beş beş 60%

36 altı altı altı open altı open 60%

37 yedi yedi yedi geri yedi geri 60%

38 sekiz three sekiz sekiz three sekiz 60%

39 dokuz dokuz dur dokuz dokuz dokuz 80%

40 on four on on four on 60%

41 hüseyin hüseyin ??? eigth hüseyin hüseyin 60%

42 mustafa sağ mustafa sağ mustafa ??? 40%

43 samet samet samet altı samet altı 60%

44 ali ali ali samet ali ali 80%

45 hasan hasan dört stop hasan hasan 60%

46 fatma fatma fatma altı fatma fatma 80%

47 merve merve merve merve merve merve 100%

48 eda eda eda eda eda eda 100%

49 ayşe ayşe ayşe ayşe ayşe altı 80%

50 hüsniye hüsniye six hüsniye hüsniye hüsniye 80%

Total 80% 74% 70% 74% 80% 75.6%

Page 59: Speech Recognition

55

Table 10

Test No / Commands

1 2 3 4 5 Total

1 sol sol sol sol ??? sol 80%

2 sağ stop sağ samet close sağ 40%

3 ileri ileri geri ileri geri ileri 60%

4 geri six geri geri geri geri 80%

5 dur left dur dur left dur 60%

6 left ??? left left left left 80%

7 right ten ten right right right 60%

8 go go go go go ten 80%

9 back back back dört back back 80%

10 stop stop stop stop ??? stop 80%

11 aç aç aç back aç aç 80%

12 kapat kapat kapat kapat kapat close 80%

13 open open forward open open open 80%

14 close close close close close close 100%

15 forward four forward forward four forward 60%

16 backward backward backward kapat backward backward 80%

17 yukarı yukarı two two yukarı yukarı 60%

18 aşağı aşağı aşağı aşağı aşağı aşağı 100%

19 hızlı hızlı hızlı hızlı dokuz hızlı 80%

20 yavaş yavaş yavaş close yavaş yavaş 80%

21 one close one one one one 80%

22 two two two two two two 100%

23 three three six three three six 60%

24 four four on four four on 60%

25 five four five five five five 80%

26 six six six six six six 100%

27 seven back seven seven seven seven 80%

28 eight eight eight six six eight 60%

29 nine nine nine nine nine nine 100%

30 ten ten ten dört dört ten 60%

31 bir bir geri bir geri bir 60%

32 iki iki close sekiz iki iki 60%

33 üç üç üç üç üç üç 100%

34 dört dört dört dört dört dört 100%

35 beş beş left beş beş left 60%

36 altı open altı open altı altı 80%

37 yedi yedi yedi yedi yedi geri 80%

38 sekiz sekiz sekiz sekiz sekiz sekiz 100%

39 dokuz dokuz dokuz dur dur dokuz 60%

40 on four on on on four 60%

41 hüseyin hüseyin sekiz hüseyin close hüseyin 60%

42 mustafa mustafa mustafa sağ ??? ??? 40%

43 samet samet samet altı samet samet 80%

44 ali ali samet ali ali ali 80%

45 hasan hasan hasan stop hasan hasan 80%

46 fatma fatma fatma fatma fatma fatma 100%

47 merve merve merve merve merve merve 100%

48 eda eda eda eda eda eda 100%

49 ayşe ayşe ayşe ayşe ayşe altı 80%

50 hüsniye six hüsniye hüsniye six hüsniye 60%

Total 76% 78% 72% 72% 82% 76%

Page 60: Speech Recognition

56

6.3 Appendix 3 : Matlab Codes Training.m function varargout = Training(varargin) gui_Singleton = 1; gui_State = struct('gui_Name', mfilename, ... 'gui_Singleton', gui_Singleton, ... 'gui_OpeningFcn', @Training_OpeningFcn, ... 'gui_OutputFcn', @Training_OutputFcn, ... 'gui_LayoutFcn', [] , ... 'gui_Callback', []); if nargin && ischar(varargin{1}) gui_State.gui_Callback = str2func(varargin{1}); end if nargout [varargout{1:nargout}] = gui_mainfcn(gui_State, varargin{:}); else gui_mainfcn(gui_State, varargin{:}); end % -------------------------------------------------------------------- function Training_OpeningFcn(hObject, eventdata, handles, varargin) handles.output = hObject; guidata(hObject, handles); % Configure and initialize the GUI objects.

set(handles.record_button,'Enable','on'); set(handles.text6,'Enable','on'); set(handles.okay_button,'Enable','off'); set(handles.reject_button,'Enable','off'); set(handles.sound_button,'Enable','off'); set(handles.text8,'Enable','off'); set(handles.text9,'Enable','off'); set(handles.text10,'Enable','off'); set(handles.text11,'Enable','off'); set(handles.energy_text,'Enable','off'); set(handles.text4,'Enable','on'); set(handles.text22,'Visible','off'); set(handles.size_text,'Visible','off'); handles.Fs = 8000 ; % sampling frequency handles.nBits = 8 ; % bits/sample % creating an analog input object from sound card ai = analoginput('winsound', 0); addchannel(ai, 1); % Configure the analog input object. set(ai, 'SampleRate', handles.Fs); set(ai, 'BitsPerSample', 2*handles.nBits); set(ai, 'BufferingConfig',[1000,30]); set(ai, 'SamplesPerTrigger', 1000); set(ai, 'TriggerRepeat', 1); % Microphone Setup start(ai); test = getdata(ai);

Page 61: Speech Recognition

57

test = test-mean(test); st = sum(abs(test)); stop(ai); handles.th = 3*st; % threshold level for recording clear ai; fid = load('Lib_Commands.mat'); % assign names of the library commands to a variable. set(handles.listbox1,'String',fid.fid) % initializing the commands listbox handles.cc = 0 ; handles.aa = 1 ; handles.method = 1; % initializing the selected featuring method guidata(hObject, handles); % -------------------------------------------------------------------- function varargout = Training_OutputFcn(hObject, eventdata, handles) varargout{1} = handles.output; % when record button is pressed function record_button_Callback(hObject, eventdata, handles) handles.Method = get(handles.popupmenu2, 'Value') ; % getting the number of selected method (MFCC-1,LPC-2) name = get(handles.edit3,'String'); % name of the new command handles.Framelen = str2double(get(handles.frame_text,'String')); % getting the time of framing the signal (in milisecond)

handles.Order = str2double(get(handles.order_text,'String')); % getting the number of order for featuring. handles.cc = 0 ; handles.aa = 1 ; set(handles.text10,'Enable','on'); set(handles.text11,'Enable','on'); set(handles.text10,'String',handles.aa); set(handles.text11,'String',name); set(handles.text22,'Visible','on'); set(handles.size_text,'Visible','on'); guidata(hObject, handles); % starts recording process for the training commands,returns the % fingerprint of the current command and its properties. [handles ] = Record_Training(hObject, eventdata, handles) ; % assinging the variables to use in the other functions handles.name = name ; Plotts(hObject, eventdata, handles) % plotting the datas. guidata(hObject, handles); % -------------------------------------------------------------------- % when record button is pressed function okay_button_Callback(hObject, eventdata, handles) handles.Method = get(handles.popupmenu2, 'Value') ;

Page 62: Speech Recognition

58

name = get(handles.edit3,'String'); handles.Framelen = str2double(get(handles.frame_text,'String')); handles.Order = str2double(get(handles.order_text,'String')); guidata(hObject, handles); cc = handles.cc ; namee = handles.name ; if cc <= 2 cc = cc + 1 ; Save_Mat( handles.recorded_signal ,namee, cc) ; % saves the voice command to the library as .wav file if cc == 3 % if 3 patterns are obtained for a voice command , the system stops set(handles.record_button,'Enable','on'); set(handles.text6,'Enable','on'); set(handles.text8,'Enable','off'); set(handles.text9,'Enable','off'); set(handles.energy_text,'Enable','off'); set(handles.text10,'Enable','off'); set(handles.text11,'Enable','off'); set(handles.okay_button,'Enable','off'); set(handles.reject_button,'Enable','off'); set(handles.sound_button,'Enable','off'); set(handles.text4,'Enable','on'); cc = 0; name = get(handles.edit3,'String');

% adding the new command to the listbox fid = load('Lib_Commands.mat'); lfid = length(fid.fid); fid.fid{lfid+1}=name; set(handles.listbox1,'String',fid.fid ) fid = cellstr(fid.fid) ; save('Lib_Commands.mat','fid') handles.cc = cc ; guidata(hObject, handles); return else handles.cc = cc ; handles.aa = handles.aa + 1 ; set(handles.text10,'Enable','on'); set(handles.text11,'Enable','on'); set(handles.text10,'String',handles.aa); set(handles.text11,'String',handles.name); % recording the next pattern (until 3 patterns obtained) [handles ] = Record_Training(hObject, eventdata, handles) ; Plotts(hObject, eventdata, handles) % plotting the datas. guidata(hObject, handles); end else % finishes recording set(handles.record_button,'Enable','on');

Page 63: Speech Recognition

59

set(handles.text6,'Enable','on'); set(handles.text8,'Enable','off'); set(handles.text9,'Enable','off'); set(handles.energy_text,'Enable','off'); set(handles.text10,'Enable','off'); set(handles.text11,'Enable','off'); set(handles.okay_button,'Enable','off'); set(handles.reject_button,'Enable','off'); set(handles.sound_button,'Enable','off'); set(handles.text4,'Enable','on'); cc = 0; handles.cc = cc ; guidata(hObject, handles); end % -------------------------------------------------------------------- % when reject button pressed function reject_button_Callback(hObject, eventdata, handles) % clears the current command and starts rerecording handles.Method = get(handles.popupmenu2, 'Value') ; name = get(handles.edit3,'String'); handles.Framelen = str2double(get(handles.frame_text,'String')); handles.Order = str2double(get(handles.order_text,'String')); guidata(hObject, handles);

[handles ] = Record_Training(hObject, eventdata, handles) ; Plotts(hObject, eventdata, handles) % plotting the datas. guidata(hObject, handles); % -------------------------------------------------------------------- % -------------------------------------------------------------------- function sound_button_Callback(hObject, eventdata, handles) % sounds the current command wavplay(handles.Command_Signal,8000) pause(0.3) % -------------------------------------------------------------------- % when remove button pressed function rmv_button_Callback(hObject, eventdata, handles) % removes the selected command from the listbox and library value = get(handles.listbox1,'Value'); value = int8(value); % re edits the library and the commands listbox fid = load('Lib_Commands.mat'); name = fid.fid{value}; [stat,mess]=fileattrib('Library'); lib_path = mess.Name;

Page 64: Speech Recognition

60

delete([lib_path '\' name '1.mat']) delete([lib_path '\' name '2.mat']) delete([lib_path '\' name '3.mat']) fid.fid(value)=''; fid=fid.fid; save('Lib_Commands.mat','fid') set(handles.listbox1,'Value',(value-1)) set(handles.listbox1,'String',fid) % -------------------------------------------------------------------- function popupmenu2_Callback(hObject, eventdata, handles) selection = get(handles.popupmenu2, 'Value') ; switch selection case 1 set(handles.order_text,'String',12); % default order values for featuring methods case 2 set(handles.order_text,'String',9); end % -------------------------------------------------------------------- function popupmenu2_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white'); else set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor')); end % -------------------------------------------------------------------- function energy_text_Callback(hObject, eventdata, handles) function energy_text_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white'); else set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor'));

end % -------------------------------------------------------------------- function listbox1_Callback(hObject, eventdata, handles) function listbox1_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white'); else set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor')); end % -------------------------------------------------------------------- function edit3_Callback(hObject, eventdata, handles) function edit3_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white'); else set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor')); end % -------------------------------------------------------------------- function frame_text_Callback(hObject, eventdata, handles) function frame_text_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white'); else set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor')); end % -------------------------------------------------------------------- function order_text_Callback(hObject, eventdata, handles) function order_text_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white'); else set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor')); end % --------------------------------------------------------------------

Page 65: Speech Recognition

61

SpeechRecognition.m function varargout = SpeechRecognition(varargin) gui_Singleton = 1; gui_State = struct('gui_Name', mfilename, ... 'gui_Singleton', gui_Singleton, ... 'gui_OpeningFcn', @SpeechRecognition_OpeningFcn, ... 'gui_OutputFcn', @SpeechRecognition_OutputFcn, ... 'gui_LayoutFcn', [] , ... 'gui_Callback', []); if nargin && ischar(varargin{1}) gui_State.gui_Callback = str2func(varargin{1}); end if nargout [varargout{1:nargout}] = gui_mainfcn(gui_State, varargin{:}); else gui_mainfcn(gui_State, varargin{:}); end % -------------------------------------------------------------------- % -------------------------------------------------------------------- function SpeechRecognition_OpeningFcn(hObject, eventdata, handles, varargin) handles.output = hObject; guidata(hObject, handles); % Configure and initialize the GUI objects. set(handles.popupmenu2,'Visible','on'); set(handles.start_button,'Visible','on'); set(handles.stop_button,'Enable','off'); set(handles.text4,'Visible','off'); set(handles.text5,'Visible','off'); set(handles.text23,'Visible','off'); set(handles.size_text,'Visible','off');

set(handles.text28,'Visible','off'); set(handles.energy_text,'Visible','off'); % initializing the commands listbox fid = load('Lib_Commands.mat'); ldf = length(fid.fid)+1; fid.fid(ldf)={'???'}; fid=fid.fid; set(handles.listbox1,'String',fid) handles.output = hObject; guidata(hObject, handles); % -------------------------------------------------------------------- % -------------------------------------------------------------------- function varargout = SpeechRecognition_OutputFcn(hObject, eventdata, handles) varargout{1} = handles.output; % -------------------------------------------------------------------- % -------------------------------------------------------------------- function popupmenu2_Callback(hObject, eventdata, handles) selection = get(handles.popupmenu2, 'Value') ; switch selection case 2 % Training Mode

Page 66: Speech Recognition

62

set(handles.stop_button,'Enable','off'); set(handles.text4,'Visible','off'); set(handles.text5,'Visible','off'); set(handles.energy_text,'Visible','off'); set(handles.text23,'Visible','off'); set(handles.size_text,'Visible','off'); set(handles.text28,'Visible','off'); case 1 % Recognizing mode set(handles.stop_button,'Enable','off'); set(handles.text4,'Visible','off'); set(handles.text5,'Visible','off'); set(handles.energy_text,'Visible','off'); % initialize the commands listbox fid = load('Lib_Commands.mat'); lf = length(fid.fid); fid = load('Lib_Commands.mat'); ldf = length(fid.fid)+1; fid.fid(ldf)={'???'}; fid=fid.fid; set(handles.listbox1,'String',fid) end % -------------------------------------------------------------------- % -------------------------------------------------------------------- function popupmenu2_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white'); else

set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor')); end % -------------------------------------------------------------------- % -------------------------------------------------------------------- % when start button pressed function start_button_Callback(hObject, eventdata, handles) selection = get(handles.popupmenu2, 'Value') ; switch selection case 2 set(handles.stop_button,'Enable','off'); set(handles.text4,'Visible','off'); set(handles.text5,'Visible','off'); set(handles.energy_text,'Visible','off'); Training ; % run Training function case 1 set(handles.stop_button,'Enable','on'); set(handles.start_button,'Enable','off'); set(handles.text4,'Visible','on'); set(handles.text5,'Visible','on'); set(handles.text23,'Visible','on'); set(handles.size_text,'Visible','on'); set(handles.text28,'Visible','on'); set(handles.energy_text,'Visible','on'); handles.Fs = 8000 ; handles.nBits = 8 ;

Page 67: Speech Recognition

63

% get the properties of featuring process handles.Framelen = str2double(get(handles.len_text,'String')); handles.Order = str2double(get(handles.order_text,'String')); handles.Method = get(handles.popupmenu3, 'Value') ; handles.level = str2double(get(handles.level_text,'String')); handles.output = hObject; guidata(hObject, handles); % starts listening for the voice commands handles = Record_Recognition(hObject, eventdata, handles) ; end % -------------------------------------------------------------------- function stop_button_Callback(hObject, eventdata, handles) %stops recording handles.output = hObject; stop(handles.ai); guidata(hObject, handles); set(handles.stop_button,'Enable','off'); set(handles.start_button,'Enable','on'); % -------------------------------------------------------------------- function close_button_Callback(hObject, eventdata, handles)

handles.output = hObject; guidata(hObject, handles); selection = questdlg(['Close ' get(handles.figure1,'Name') '?'],... ['Close ' get(handles.figure1,'Name') '...'],... 'Yes','No','Yes'); if strcmp(selection,'No') return; end stop(handles.ai); delete(handles.ai); clear(handles.ai); handles.output = hObject; guidata(hObject, handles); close(handles.figure1) % -------------------------------------------------------------------- function popupmenu3_Callback(hObject, eventdata, handles) selectionn = get(handles.popupmenu3, 'Value') ; switch selectionn case 1 set(handles.order_text,'String',12); % default order values for featuring methods case 2 set(handles.order_text,'String',9); end % --------------------------------------------------------------------

Page 68: Speech Recognition

64

function popupmenu3_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white'); else set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor')); end % -------------------------------------------------------------------- function order_text_Callback(hObject, eventdata, handles) function order_text_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white'); else set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor')); end % -------------------------------------------------------------------- function energy_text_Callback(hObject, eventdata, handles) function energy_text_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white'); else set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor')); end % -------------------------------------------------------------------- function recog_text_Callback(hObject, eventdata, handles) function recog_text_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white');

else set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor')); end % -------------------------------------------------------------------- function listbox1_Callback(hObject, eventdata, handles) function listbox1_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white'); else set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor')); end % -------------------------------------------------------------------- function len_text_Callback(hObject, eventdata, handles) function len_text_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white'); else set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor')); end % -------------------------------------------------------------------- function level_text_Callback(hObject, eventdata, handles) function level_text_CreateFcn(hObject, eventdata, handles) if ispc set(hObject,'BackgroundColor','white'); else set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor')); end % --------------------------------------------------------------------

Page 69: Speech Recognition

65

Record_Training.m function [handles ] = Record_Training(hObject, eventdata, handles) set(handles.record_button,'Enable','off'); set(handles.text6,'Enable','off'); set(handles.okay_button,'Enable','off'); set(handles.reject_button,'Enable','off'); set(handles.sound_button,'Enable','off'); set(handles.text8,'Enable','on'); set(handles.text9,'Enable','on'); set(handles.energy_text,'Enable','on'); set(handles.text4,'Enable','off'); guidata(hObject, handles); % setting the initial values n = 0 ; z = 0 ; data = 0 ; audio_prev = 0 ; signal = 0 ; FngrPrnts_pcov = 0 ; stop_rec = 0 ; Fs = handles.Fs; nbits = handles.nBits; th = handles.th; b1 =fir1(250,[2*70/Fs 2*3200/Fs]); % Speech filter coefficients 70Hz-3200Hz b2 = [1 -0.97]; % Preemphasis filter coeff. % creating an analog input object from sound card ai = analoginput('winsound', 0); addchannel(ai, 1); % Configure the analog input object. set(ai, 'SampleRate', Fs); set(ai, 'BitsPerSample', 2*nbits); set(ai, 'BufferingConfig',[1000,30]); set(ai, 'SamplesPerTrigger', 1000); set(ai, 'TriggerRepeat', inf); start(ai); % starts recording process

while 1 pause(0.01) % Waiting time in each loop to give enough time for setting GUI objects. % Does not cause any sample loss in recording process. % Analog input object works asyncroniously. handles.ai=ai; % Assinging the analog object to use its properties in the other functions guidata(hObject, handles); audio = getdata(ai); % getting the recorded data (1000 samples) EE = (sum( abs( audio - mean(audio) ) )); % Energy of the frame and subtracting DC value set(handles.energy_text, 'String', EE); if EE >= th ; % if the energy of the frame is greater than th the system senses % that the speaker is saying a command. n = n+1 ; % for each frame that has an energy greater than th, these frames are arrenged % in a row and attached each other. if n==1 data = [audio_prev ; audio] ; else data = [data ; audio] ; end else if n > 0 % this the point where the command ends. % ( while n is greater than zero and if the energy of the current frame ) % ( is less than th, this means the speaker finished saying the command ) stop_rec = stop_rec+1;

Page 70: Speech Recognition

66

data = [data ; audio ] ; if (stop_rec>1) stop(ai); delete(ai) clear ai % Filtering process signal_filtered1 = filter(b1,1,data); % Band-Pass 70Hz-3200Hz signal_filtered2 = filter(b2,1,signal_filtered1); % Pre-Emphasis signal = (20/th)*signal_filtered2(350:end); % Amplifying wavwrite(signal,8000,8,'command.wav') ; % Converting the signal 16 bit to 8 bit/sample recorded_signal = wavread('command.wav'); delete command.wav % Sending the Signal to Featuring process [ FngrPrnts,Command_Signal,Paudio,start_point,end_point ] = Featuring(handles,recorded_signal) ; guidata(hObject, handles); break end else % The part which no command is said audio_prev = audio ; z = z+1 ; end end

if (z >= 50) % if there exists 50 frames continually that all have energy less than th , the systems stops. % 50*1000/8000 = 6.25 second % if nothing is said in 6.25 second the system stops set(handles.popupmenu2,'Visible','on'); set(handles.start_button,'Visible','on'); set(handles.stop_button,'Visible','off'); set(handles.text1,'Visible','off'); set(handles.text2,'Visible','off'); set(handles.text4,'Visible','off'); set(handles.text5,'Visible','off'); set(handles.energy_text,'Visible','off'); % Resetting the variables to their initialized values. n = 0 ; z = 0 ; data = 0 ; audio_prev = 0 ; signal = 0 ; FngrPrnts = 0 ; stop(ai) delete(ai) clear ai guidata(hObject, handles); break end end handles.recorded_signal = recorded_signal; handles.Paudio = Paudio; handles.Command_Signal = Command_Signal; handles.FngrPrnts = FngrPrnts; handles.start_point = start_point; handles.end_point = end_point;

Page 71: Speech Recognition

67

Record_Recognition.m function handles = Record_Recognition(hObject, eventdata, handles) % Assinging the fingerprints of library commands to variables. [ lib_com1,lib_com2,lib_com3,lf ] = Library_Call(hObject, eventdata, handles) ; % setting the initial values n = 0 ; z = 0 ; data = 0 ; audio_prev = 0 ; signal = 0 ; FngrPrnts = 0 ; stop_rec = 0 ; Fs = handles.Fs; nBits = handles.nBits; b1 =fir1(250,[2*70/Fs 2*3200/Fs]); % Speech filter coefficients 70Hz-3200Hz b2 = [1 -0.97]; % Preemphasis filter coeff. % creating an analog input object from sound card ai = analoginput('winsound', 0); addchannel(ai, 1); % Configure the analog input object. set(ai, 'SampleRate', Fs); set(ai, 'BitsPerSample', 2*nBits); set(ai, 'BufferingConfig',[1000,30]); set(ai, 'SamplesPerTrigger', 1000); set(ai, 'TriggerRepeat', 1); % Microphone Setup start(ai); test = getdata(ai); test = test-mean(test); st = sum(abs(test)); stop(ai); th = 3*st; % recording threshold % Recording Process starts

set(ai, 'TriggerRepeat', inf); start(ai); while (1) pause(0.01) % Waiting time in each loop to give enough time for setting GUI objects. % Does not cause any sample loss in recording process. % Analog input object works asyncroniously. handles.ai=ai; % Assinging the analog object to use its properties in the other functions guidata(hObject, handles); audio = getdata(ai); % getting the recorded data (1000 samples) EE = (sum( abs( audio - mean(audio) ) )); % Energy of the frame and subtracting DC value set(handles.energy_text, 'String', EE); if EE >= th ; % if the energy of the frame is greater than th the system senses % that the speaker is saying a command. n = n+1 ; % for each frame that has an energy greater than th, these frames % are arrenged in a row and attached each other. if n==1 data = [audio_prev ; audio] ; else data = [data ; audio] ; end

Page 72: Speech Recognition

68

else if n > 0 % this the point where the command ends. % ( while n is greater than zero and if the energy of the current frame ) % ( is less than th, this means the speaker finished saying the command ) stop_rec = stop_rec+1; data = [data ; audio ] ; if (stop_rec>1) % one frame in the middle of the signal is acceptable. % Filtering process signal_filtered1 = filter(b1,1,data); % Band-Pass 70Hz-3200Hz signal_filtered2 = filter(b2,1,signal_filtered1); % Pre-Emphasis signal = (20/th)*signal_filtered2(350:end); % Amplifying wavwrite(signal,8000,8,'command.wav') % Converting the signal 16 bit to 8 bit/sample recorded_signal = wavread('command.wav'); delete command.wav % Sending the Signal to Featuring process. [ FngrPrnts,Command_Signal,Paudio,start_point,end_point ] = Featuring(handles, recorded_signal) ; % Printing size of the fingerprint. [r,c]=size(FngrPrnts); set(handles.size_text,'String',[ num2str(r) ' x ' num2str(c) ]); % Comparing the current fingerprint with the library fingerprints % The decision and resulting processes are realized in this function. Compare( FngrPrnts,lib_com1,lib_com2,lib_com3,lf,hObject, eventdata, handles ) ;

% Resetting the variables to their initialized values. n = 0 ; z = 0 ; data = 0 ; audio_prev = 0 ; signal = 0 ; FngrPrnts = 0 ;audio=0;stop_rec = 0; end else audio_prev = audio ; % The part which no command is said z = z+1 ; end end if (z >= 80) % if there exists 80 frames all have energy less than six continually, the systems stops. % 80*1000/8000 = 10 second % if nothing is said in 10 second the system stops automatically. stop(ai) delete(ai) clear ai; set(handles.popupmenu2,'Visible','on'); set(handles.start_button,'Enable','on'); set(handles.stop_button,'Enable','off'); set(handles.text4,'Visible','off'); set(handles.text5,'Visible','off'); set(handles.energy_text,'Visible','off'); set(handles.text23,'Visible','off'); set(handles.size_text,'Visible','off'); set(handles.text28,'Visible','off'); % Resetting the variables to their initialized values. n = 0 ; z = 0 ; data = 0 ; audio_prev = 0 ; signal = 0 ; FngrPrnts = 0 ; break end end

Page 73: Speech Recognition

69

Save_Mat.m function Save_Mat(recorded_signal,name,cc) % saves the voice command with a pattern number [stat,mess]=fileattrib('Library'); lib_path = mess.Name; number = int2str(cc) ; name = char(name); save([lib_path '\' name number],'recorded_signal')

Plotts.m function Plotts(hObject, eventdata, handles) set(handles.record_button,'Enable','off'); set(handles.text6,'Enable','off'); set(handles.text8,'Enable','off'); set(handles.text9,'Enable','off'); set(handles.energy_text,'Enable','off'); set(handles.okay_button,'Enable','on'); set(handles.reject_button,'Enable','on'); set(handles.sound_button,'Enable','on'); set(handles.text4,'Enable','off'); lre = length(handles.recorded_signal); lpe = length(handles.Paudio); min_r = min(handles.recorded_signal); max_r = max(handles.recorded_signal); l_ff = length(handles.FngrPrnts');

% Printing size of the fingerprint. [r,c]=size(handles.FngrPrnts); set(handles.size_text,'String',[ num2str(r) ' x ' num2str(c) ]); % plots the recorded signal,its power and start and end points on the same graph. axes(handles.axes3) ; , cla ; plot(40*(0:lpe-1),handles.Paudio,'r','LineWidth',2) , hold on plot((0:lre-1),handles.recorded_signal) , hold on plot(handles.start_point*ones(1,length(min_r:1/1000:max_r)),((min_r:1/1000:max_r)),'k','LineWidth',2) , hold on plot(handles.end_point*ones(1,length(min_r:1/1000:max_r)),((min_r:1/1000:max_r)),'k','LineWidth',2) , hold off % plots the pure command signal axes(handles.axes4) ; , cla ; plot(handles.Command_Signal) % plots the fingerprint of the command signal axes(handles.axes2) ; , cla ; plot(handles.FngrPrnts') if handles.Method == 1 axis([1 l_ff+0.1 -7.1 7.1]) else axis([1 l_ff+0.1 0 20.1]) end

Page 74: Speech Recognition

70

Featuring.m function [ FngrPrnts,Command_Signal,Paudio,start_point,end_point ] = Featuring(handles, recorded_signal) signal = 0.5*recorded_signal/max(abs(recorded_signal)) ; % Normalizing the signal for power calculation lx=length(signal); nn=0; for c=1:40:lx-39 nn=nn+1; Paudio(nn) = sum(( abs( signal(c:c+39) ) )).^2 ; end Paudio = sqrt(Paudio/40)/2; mean_p = mean(Paudio)/3.5; zz = 0; gg = 0; g = nn+1; stopp1 = 0 ; stopp2 = 0 ; startpoint = 1; endpoint = 15; for f=1:nn g = g-1; if ((stopp1==0)&&(f<=nn-3)) % start point searching - 4 frames that have energy greater than mean value if ( (Paudio(f)>=mean_p) && (Paudio(f+1)>=mean_p) && (Paudio(f+2)>=mean_p) && (Paudio(f+3)>=mean_p) ) startpoint = f ; stopp1 = 1; end

end if ((stopp2==0)&&(g>=4)) % end point searching - 4 frames that have energy greater than mean value if ( (Paudio(g-3)>=mean_p) && (Paudio(g-2)>=mean_p) && (Paudio(g-1)>=mean_p) && (Paudio(g)>=mean_p) ) endpoint = g ; stopp2 = 1; end end if ((stopp1==1)&&(stopp2==1)) break end end Fs = handles.Fs ; Order = handles.Order ; Framelen = handles.Framelen; if ( ( startpoint >= endpoint ) || ( stopp1 ~= 1 ) || ( stopp2 ~= 1 ) ) % error values start_point = 1; end_point = 1000-352; Command_Signal = recorded_signal( start_point : end_point ) ; FngrPrnts = zeros(10,2*Order);

Page 75: Speech Recognition

71

else start_point = 40*startpoint - 20 ; end_point = 40*endpoint ; Command_Signal = recorded_signal( start_point : end_point ) ; if (handles.Method == 1) % MFCC method len = Fs*(Framelen)/1000 ; FngrPrnts = Melcepst_met(Command_Signal,Fs,Order,len); ss = size(FngrPrnts); elseif (handles.Method == 2) % LPC method len = Fs*(Framelen)/1000 ; FngrPrnts = Lpc_met(Command_Signal,Fs,Order,len); ss = size(FngrPrnts); end end

Library_Call.m function [ lib_com1,lib_com2,lib_com3,lf ] = Library_Call(hObject, eventdata, handles) [stat,mess]=fileattrib('Library'); lib_path = mess.Name;

fid = load('Lib_Commands.mat'); lf = length(fid.fid); for f = 1:lf lib_wav11{f} = load([lib_path '\' char(fid.fid(f)) '1' ]); lib_wav22{f} = load([lib_path '\' char(fid.fid(f)) '2' ]); lib_wav33{f} = load([lib_path '\' char(fid.fid(f)) '3' ]); end for f = 1:lf lib_wav1{f} = lib_wav11{f}.recorded_signal ; lib_wav2{f} = lib_wav22{f}.recorded_signal ; lib_wav3{f} = lib_wav33{f}.recorded_signal ; end for f = 1:lf lib_com1{f} = Featuring(handles, lib_wav1{f} ) ; lib_com2{f} = Featuring(handles, lib_wav2{f} ) ; lib_com3{f} = Featuring(handles, lib_wav3{f} ) ; end

Page 76: Speech Recognition

72

LPC_met.m function FngrPrnts = Lpc_met(Command_Signal,Fs,nLPC,len) data_frame = enframe(Command_Signal,hamming(len),floor(len/2)); % % framing the input with hamming window [r,c] = size(data_frame); % r = number of frames , % c = number of samples in each frame for n = 1:r a_lpc = lpc(data_frame(n,:),nLPC) ; % LPC filter coeffs. FngrPrnts1(n,:)= abs(freqz(1,a_lpc,ceil(len/12))); % LPC filter transfer Func. , 20 samples(coeff.) each frame end FngrPrnts = FngrPrnts1; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function f=enframe(x,win,inc) nx=length(x(:)); nwin=length(win); if (nwin == 1) len = win; else len = nwin; end nf = fix((nx-len+inc)/inc); f=zeros(nf,len); indf= inc*(0:(nf-1)).';

inds = (1:len); f(:) = x(indf(:,ones(1,len))+inds(ones(nf,1),:)); if (nwin > 1) w = win(:)'; f = f .* w(ones(nf,1),:); end

Page 77: Speech Recognition

73

Melcepst_met.m function FngrPrnts = Melcepst_met(s,fs,nc,n) fh=0.5; fl=0; inc=floor(n/2); % overlap 50% p = floor(3*log(fs)); % number of filters z=enframe(s,hamming(n),inc); % framing the input with hamming window f=rfft(z.'); % discrete fourier transform [m,a,b]=melbankm(p,n,fs,fl,fh); % Mel Filters pw=f(a:b,:).*conj(f(a:b,:)); pth=max(pw(:))*1E-20; ath=sqrt(pth); y=log(max(m*abs(f(a:b,:)),ath)+eps); c=rdct(y).'; % discrete Cosine transform nf=size(c,1); nc=nc+1; if p>nc c(:,nc+1:end)=[]; elseif p<nc c=[c zeros(nf,nc-p)]; end c(:,1)=[]; nc=nc-1; vf=(4:-1:-4)/60; af=(1:-1:-1)/2; ww=ones(5,1); cx=[c(ww,:); c; c(nf*ww,:)];

vx=reshape(filter(vf,1,cx(:)),nf+10,nc); vx(1:8,:)=[]; ax=reshape(filter(af,1,vx(:)),nf+2,nc); ax(1:2,:)=[]; c=[c ax]; FngrPrnts = c ; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function f=enframe(x,win,inc) nx=length(x(:)); nwin=length(win); if (nwin == 1) len = win; else len = nwin; end nf = fix((nx-len+inc)/inc); f=zeros(nf,len); indf= inc*(0:(nf-1)).'; inds = (1:len); f(:) = x(indf(:,ones(1,len))+inds(ones(nf,1),:)); if (nwin > 1) w = win(:)'; f = f .* w(ones(nf,1),:); end

Page 78: Speech Recognition

74

function y=rfft(x,n,d) s=size(x); if prod(s)==1 y=x; else if nargin <3 d=find(s>1); d=d(1); if nargin<2 n=s(d); end end if isempty(n) n=s(d); end y=fft(x,n,d); y=reshape(y,prod(s(1:d-1)),n,prod(s(d+1:end))); s(d)=1+fix(n/2); y(:,s(d)+1:end,:)=[]; y=reshape(y,s); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function y=rdct(x,n,a,b) fl=size(x,1)==1; if fl x=x(:); end [m,k]=size(x); if nargin<2 n=m; end if nargin<4 b=1; if nargin<3 a=sqrt(2*n); end

end if n>m x=[x; zeros(n-m,k)]; elseif n<m x(n+1:m,:)=[]; end x=[x(1:2:n,:); x(2*fix(n/2):-2:2,:)]; z=[sqrt(2) 2*exp((-0.5i*pi/n)*(1:n-1))].'; y=real(fft(x).*z(:,ones(1,k)))/a; y(1,:)=y(1,:)*b; if fl y=y.'; end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [x,mn,mx]=melbankm(p,n,fs,fl,fh) f0=700/fs; fn2=floor(n/2); lr=log((f0+fh)/(f0+fl))/(p+1); bl=n*((f0+fl)*exp([0 1 p p+1]*lr)-f0); b2=ceil(bl(2)); b3=floor(bl(3)); b1=floor(bl(1))+1; b4=min(fn2,ceil(bl(4)))-1; pf=log((f0+(b1:b4)/n)/(f0+fl))/lr; fp=floor(pf); pm=pf-fp; k2=b2-b1+1; k3=b3-b1+1; k4=b4-b1+1; r=[fp(k2:k4) 1+fp(1:k3)]; c=[k2:k4 1:k3]; v=2*[1-pm(k2:k4) pm(1:k3)]; mn=b1+1; mx=b4+1; if nargout > 1 x=sparse(r,c,v); else x=sparse(r,c+mn-1,v,p,1+fn2); end

Page 79: Speech Recognition

75

Disteusq.m function dif=Disteusq(x,y) [rx,cx] = size(x); [ry,cy] = size(y); if ((rx*cx)>(ry*cy)) aa=x; bb=y; x = bb; y = aa; % to find the distance always in the same form rx <= ry end [nx,p]=size(x); ny=size(y,1); % calculate the distances of each frames if p>1 z=permute(x(:,:,ones(1,ny)),[1 3 2])-permute(y(:,:,ones(1,nx)),[3 1 2]); d=sum(z.*conj(z),3); else z=x(:,ones(1,ny))-y(:,ones(1,nx)).'; d=z.*conj(z); end [r,c] = size(d); m = min([r c]); ds = abs(r-c); diff(1) = min( d(1,1:2) ) ; for n = 2 : m-1 diff(n) = min( d( n , ( (n-1):(n+1) ) ) ) ; % frame to frame comparison with warping end diff(m) = min( d( m,((m-1):m) ) ) ; if ( r ~= c ) diff(m) = min( d( m,((m-1):(m+1)) ) ) ;

for n = (m+1):(m+ds) diff(n)= 3*mean( d( m , ((m+1):(m+ds)) ) ); % extra frame distances. end end dif = sum(diff);

Page 80: Speech Recognition

76

Compare.m function Compare( FngrPrnts,lib_com1,lib_com2,lib_com3,lf,hObject, eventdata, handles ) level = str2double(get(handles.level_text,'String')); % get the recognition level threshold percentage for n = 1:lf Ediff1(n) = Disteusq( FngrPrnts , lib_com1{n} ) ; % Calculate the distance between current fingerprint and Ediff2(n) = Disteusq( FngrPrnts , lib_com2{n} ) ; % library fingreprints Ediff3(n) = Disteusq( FngrPrnts , lib_com3{n} ) ; End min1 = min( Ediff1 ) ; min2 = min( Ediff2 ) ; min3 = min( Ediff3 ) ; min_all = min( [ min1 min2 min3 ] ); % minumum distance of all if min_all == min1 p = find( Ediff1 == min1); comm = lib_com1{p}; % find the listbox index value where min. dist. occurs elseif min_all == min2 p = find( Ediff2 == min2); comm = lib_com2{p}; elseif min_all == min3 p = find( Ediff3 == min3); comm = lib_com3{p}; end error1 = min_all / sum(sum(abs(comm))) ; % squared error error2 = sqrt(min_all) / sum(sum(abs(comm))) ; % true error error = ( error1 + error2 ) / 2; % final error matched = ceil(100 - ( 100 * error) ); % matching percentage if (matched < level) % under matched

set(handles.listbox1,'Value',(lf+1)); % highlight "???" com_name = get(handles.listbox1,'String'); set(handles.text28,'String',[ ''' ' com_name{lf+1} ' ''' ' , ' '% ' num2str(matched) ' Matched' ]); else % good matched set(handles.listbox1,'Value',p); % highlight the name of the recognized command com_name = get(handles.listbox1,'String'); set(handles.text28,'String',[ ''' ' com_name{p} ' ''' ' , ' '% ' num2str(matched) ' Matched' ]); guidata(hObject, handles); end