Nov 21, 2014

This document describes the basics of speaker recognition algorithm.

Welcome message from author

This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript

1

BABU BANARSI DAS INSTITUTE OF TECHNOLOGY, GHAZIABAD

(AFFILIATED TO U.P. TECHNICAL UNIVERSITY, LUCKNOW)

(Established in 2000)

“REAL TIME SPEAKER RECOGNITION”

A project report submitted in partial fulfillment of the requirement for the

Award of Degree of

Bachelor of Technology

In

Electronics and Communication Engineering

Academic session 2008-2012

Submitted By: Project guides:

Rohit Singh (0803531072) Mr. Prashant Sharma

Harish Kumar (0803531409) Mr. Anil Kumar Pandey

Samreen Zehra (0803531074)

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

MAY 2012

2

CERTIFICATE

This is to certify that Rohit Singh, Harish Kumar and Samreen Zehra of Department of

Electronics and Communication Engineering of this institute have carried out together the

project work presented in this report entitled „Real Time Speaker Recognition‟ in partial

fulfillment of award of degree of Bachelor of Technology in Electronics and Communication

Engineering from UPTU, Lucknow under our supervision. The report embodies results of

their works and studies carried out by the students themselves and the content of report does

not form the part of any other degree to these candidates or to anybody else.

Project guides:

Mr. Prashant Sharma (AP, ECE)

Mr. Anil Kumar Pandey (AP, ECE)

3

ACKNOWLEDGEMENT

We have taken efforts in this project. However, it would not have been possible without the

kind support and help of many individuals. We would like to extend our sincere thanks to all

of them.

We are highly indebted to Mr. K. L. Pursnani, Head of Department Electronic and

Communication Engineering, BBDIT, Ghaziabad, for his guidance and constant supervision

as well as for providing necessary information regarding the project & for his support in

completing the project.

We would like to express our special gratitude and thanks to our project guides, Mr. Prashant

Sharma and Mr. Anil Kumar Pandey for giving us such attention and time. Both of our guides

had taken great pain and efforts to help us the best way without whom this project would ever

be realized.

We would like to express our gratitude towards our parents and friends for their kind co-

operation and encouragement which help us in completion of this project.

Our thanks and appreciations also go to our college in developing the project and people who

have willingly helped us out with their abilities.

Rohit Singh (0803531072)

Harish Kumar (0803531409)

Samreen Zehra (0803531074)

4

CONTENTS

1. Introduction…………………………………………………………………………...7

1.1 Project overview……………………………..…………………....………..……..7

1.2 Application…………………...………………………………...………………...10

2. Methodology……………...………………………………………….………………11

2.1 Algorithm………………………………………………...………………………11

2.2 Flow Chart……………………………………………...………………………...12

3. Identification background……………………………………………………….….13

3.1 DSP Fundamentals………………………………………………...……………..13

3.1.1 Basic Definitions………...………………………………….………...…..13

3.1.2 Convolution………………………………………………..……………...14

3.1.3 Discrete Fourier Transform………………………………...……………..14

3.2 Human Speech Production Model…………………………………...…………...15

3.2.1 Anatomy..………………………………………………….……………...15

3.2.2 Vocal Model……………………………………….…………………..….17

3.3 Speaker Identification……………………………………...……………………..18

4. Speech Feature Extraction……………………………………………………….…21

4.1 Introduction…………………………………………………………………...…..21

4.2 Short term analysis…..……………………………………………...…………....22

4.3 Cepstrum…………………………………………………………………...…….23

4.3.1 Delta Cepstrum……………..……………………………………….……24

4.4 Mel-Frequency Cepstrum Coefficients……………………………………….......25

4.4.1 Computing of mel cepstrum coefficient……………………………..……27

4.5 Framing and Windowing………..............................................................................29

5. Vector Quantization……………………………………………..………………......30

6. K-means………………………………………………………………………….…..33

6.1 Mean shift clustering…….…………………………………………...……...…...35

6.2 Bilateral filtering…….………………………………………………………........36

6.3 Speaker matching…………………..……………………………………..……….36

7. Euclidean distance………………………………………………………………...…...38

7.1 One dimension…………………………………………………………...……….40

7.2 Two dimension…………………….…………………………………..…………40

5

7.3Three dimension………….………………………………………………………..40

7.4 N dimensions………………………………………………………………….......40

7.5 Squared euclidean distance …………………....……………………...………….40

7.6 Decision making……………………………………………………...…………..41

8 Result and Conclusion…………………………………………………………………..42

8.1 Result…………………………………………………………………………......42

8.2 Pruning Basis…………………...…………………………………………...…...42

8.2.1 Static Pruning…………………………………………………………43

8.2.2 Adaptive Pruning……………………………………………………..44

8.3 Conclusion…………………………………………………………….....……….44

6

LIST OF FIGURES

Name of Figure Page No

1.1 Identification Taxonomy………………………………………………………………..….8

3.1 Vocal Tract Model…………………………………………………………………….….16

3.2 Multi tube Lossless Model…………………...…………………………………..……….17

3.3 Source Filter Model……………………………………….………………….…………..18

3.4 Enrollment Phase…………………………………………………………….…………...19

3.5 Identification Phase………………………………………………………..……………...20

4.1 Short Term Analysis……………………………………………………….……………..23

4.2 Speech Magnitude Spectrum…………………………………………….……………….25

4.3 Cepstrum………………………………………………………………………….………25

4.4 Computing of mel-cepstrum coefficient……………………………………….…………27

4.5 Triangular Filter used to compute mel-cepstrum ………………………………..……….28

4.6 Quasi Stationary………………………………………………………………..…………29

5.1 Vector Quantization of Two Speakers………………………………………..…………..31

7.1 Decision Process…………………………………………………………….……………41

8.1 Distance stabilization……………………………………………………….…………….42

8.2 Evaluation of static variant using different pruning intervals…………………..………...43

8.3 Examples of the matching score distribution………………………………..……………44

7

CHAPTER 1

INTRODUCTION

1.1 Project overview

The human speech conveys different types of information. The primary type is the meaning or

words, which speaker tries to pass to the listener. But the other types that are also included in

the speech are information about language being spoken, speaker emotions, gender and

identity of the speaker. The goal of automatic speaker recognition is to extract, characterize

and recognize the information about speaker identity. Speaker recognition is usually divided

into two different branches:

Speaker verification

Speaker identification.

Speaker verification task is to verify the claimed identity of person from his voice. This

process involves only binary decision about claimed identity. In speaker identification there is

no identity claim and the system decides who the speaking person is.

Speaker identification can be further divided into two branches. Open-set speaker

identification decides to whom of the registered speakers‟ unknown speech sample belongs or

makes a conclusion that the speech sample is unknown. In this work,

we deal with the closed-set speaker identification, which is a decision making process of who

of the registered speakers is most likely the author of the unknown speech sample. Depending

on the algorithm used for the identification, the task can also be divided into text-dependent

and text-independent identification. The difference is that in the first case the system knows

the text spoken by the person while in the second case the system must be able to recognize

the speaker from any text. This taxonomy is

Represented in Figure 1.1

8

Speaker Recognition

Speaker verification Speaker identification

Closed set Identification Open set Identification

Text independent identification Text dependent identification

Fig. 1.1 Identification Taxonomy

Speaker recognition is basically divided into two-classification: speaker recognition and

speaker identification and it is the method of automatically identify who is speaking on the

basis of individual information integrated in speech waves. Speaker recognition is widely

applicable in use of speaker‟s voice to verify their identity and control access to services such

as banking by telephone, database access services, voice dialing telephone shopping,

information services, voice mail, security control for secret information areas, and remote

access to computer AT and T and TI with Sprint have started field tests and actual

application of speaker recognition technology; many customers are already being used by

Sprint‟s Voice Phone Card.

9

Speaker recognition technology is the most potential technology to create new services that

will make our everyday lives more secured. Another important application of speaker

recognition technology is for forensic purposes. Speaker recognition has been seen an

appealing research field for the last decades which still yields a number of unsolved problems.

The main aim of this project is speaker identification, which consists of comparing a speech

signal from an unknown speaker to a database of known speaker. The system can recognize

the speaker, which has been trained with a number of speakers. Below figure shows the

fundamental formation of speaker identification and verification systems. Where the speaker

identification is the process of determining which registered speaker provides a given

speech. On the other hand, speaker verification is the process of rejecting or accepting the

identity claim of a speaker. In most of the applications, voice is use as the key to confirm the

identities of a speaker is classified as speaker verification.

Adding the open set identification case in which a reference model for an unknown speaker

may not exist can also modify above formation of speaker identification and verification

system. This is usually the case in forensic application. In this circumstances, an added

decision alternative, the unknown does not match any of the models, is required. Other

threshold examination can be used in both verification and identification process to decide

if the match is close enough to acknowledge the decision or if more speech data are

needed.

Speaker recognition can also divide into two methods, text-dependent and text independent

methods. In text dependent method the speaker has to say key words or sentences having the

same text for both training and recognition trials. Whereas in the text independent does not

rely on a specific text being speaks. Formerly text dependent methods were widely in

application, but later text independent is in use. Both text dependent and text independent

methods share a problem however.

By playing back the recorded voice of registered speakers this system can be easily deceived.

There are different technique is used to cope up with such problems. Such as a small set of

words or digits are used as input and each user is provoked to thorough a specified

sequence of key words that is randomly selected every time the system is used. Still this

method is not completely reliable. This method can be deceived with the highly developed

electronics recording system that can repeat secrete key words in a request order. Practical

10

applications for automatic speaker identification are obviously various kinds of security

systems. Human voice can serve as a key for any security objects, and it is not so easy in

general to lose or forget it. Another important property of speech is that it can be transmitted

by telephone channel, for example. This provides an ability to automatically identify speakers

and provide access to security objects by telephone. Nowadays, this approach begins to be

used for telephone credit card purchases and bank transactions. Human voice can also be used

to prove identity during access to any physical facilities by storing speaker model in a small

chip, which can be used as an access tag, and used instead of a pin code. Another important

application for speaker identification is to monitor people by their voices. For instance, it is

useful in information retrieval by speaker indexing of some recorded debates or news, and

then retrieving speech only for interesting speakers. It can also be used to monitor criminals in

common places by identifying them by voices. In fact, all these examples are actually

examples of real time systems. For any identification system to be useful in practice, the time

response, or time spent on the identification should be minimized. Growing size of speaker

database is its major limitation.

1.2 Application

Practical applications for automatic speaker identification are obviously various kinds of

security systems. Human voice can serve as a key for any security objects, and it is not so

easy in general to lose or forget it. Another important property of speech is that it can be

transmitted by telephone channel, for example. This provides an ability to automatically

identify speakers and provide access to security objects by telephone. Nowadays, this

approach begins to be used for telephone credit card purchases and bank transactions. Human

voice can also be used to prove identity during access to any physical facilities by storing

speaker model in a small chip, which can be used as an access tag, and used instead of a pin

code. Another important application for speaker identification is to monitor people by their

voices. For instance, it is useful in information retrieval by speaker indexing of some recorded

debates or news, and then retrieving speech only for interesting speakers. It can also be used

to monitor criminals in common places by identifying them by voices.

In fact, all these examples are actually examples of real time systems. For any identification

system to be useful in practice, the time response, or time spent on the identification should be

minimized. Growing size of speaker database is its major limitation.

11

CHAPTER 2

METHODOLOGY

2.1 Algorithm

1. Record the test sound through microphone.

2. Convert the recorded sound into .wav format.

3. Load recorded sound files from database.

4. Extract features from test file for recognition.

5. Extract features from all the files stored in the database.

6. Find out the centroids of the test file by any clustering algorithm.

7. Find out the centroids of the sample files stored in database so that codebooks can be

generated.

8. Calculate the Euclidean distance between the test file and individual samples of the

database.

9. Find out the sample having minimum distance with the test file.

10. The sample corresponding to the minimum distance is most likely the author of the test

sound.

12

2.2 Flowchart

13

CHAPTER 3

IDENTIFICATION BACKGROUND

In this chapter we discuss theoretical background for speaker identification. We start from the

digital signal processing theory. Then we move to the anatomy of human voice production

organs and discuss the basic properties of the human speech production mechanism and

techniques for its modeling. This model will be used in the next chapter when we will discuss

techniques for the extraction of the speaker characteristics from the speech signal.

3.1 DSP Fundamentals

According to its abbreviation, Digital Signal Processing (DSP) is a part of computer science,

which operates with special kind of data – signals. In most cases, these signals are obtained

from various sensors, such as microphone or camera. DSP is the mathematics, mixed with the

algorithms and special techniques used to manipulate with these signals, converted to the

digital form.

3.1.1 Basic Definitions

By signal we mean here a relation of how one parameter is related to another parameter. One

of these parameters is called independent parameter (usually it is time), and the other one is

called dependent, and represents what we are measuring. Since both of these parameters

belong to the continuous range of values, we call such signal continuous signal. When

continuous signal is passed through an Analog-To-Digital converter (ADC)it is said to be

discrete or digitized signal. Conversion works in the following way: every time period, which

occurs with frequency, called sampling frequency, signal value is taken and quantized, by

selecting an appropriate value from the range of7possible values. This range is called

quantization precision and usually represented as an amount of bits available to store signal

value. Based on the sampling theorem, proved by Nyquist in 1940, digital signal can contain

frequency components only up to one half of the sampling rate. Generally, continuous signals

are what we have in nature while discrete signals exist mostly inside computers. Signals that

use time as the independent parameter are said to be in the time domain, while signals that use

14

frequency as the independent parameter are said to be in the frequency domain. One of the

important definitions used in DSP is the definition of linear system. By system we mean here

any process that produces output signal in response on a given input signal. A system is called

linear if it satisfies the following three properties: homogeneity, additivity and shift

invariance.

3.1.2 Convolution

An impulse is a signal composed of all zeros except one non-zero point. Every signal can be

decomposed into a group of impulses, each of them then passed through a linear system and

the resulting output components are synthesized or added together. The resulting signal is

exactly the same as obtained by passing the original signal through the system. Every impulse

can be represented as a shifted and scaled delta function, which is a normalized impulse, that

is, sample number zero has a value of one and all other samples have a value of zero. When

the delta function is passed through a linear system, its output is called impulse response. If

two systems are different they will have different impulse responses. Scaled and shifted

impulse response and scaling and shifting of the input are identical to the scaling and shifting

of the output. It means that knowing systems impulse response we know everything about the

system. Convolution is a formal mathematical operation, which is used to describe

Relationship between three signals of interest: input and output signals, and the impulse

response of the system. It is usually said that the output signal is the input signal convolved

with the system‟s impulse response. Mathematics behind the convolution does not restrict

how long the impulse response is. It only says that the size of the output signal is the size of

the input signal plus the size of the impulse response minus one. Convolution is very

important concept in DSP. Based on the properties of linear systems it provides the way of

combining two signals to form a third signal. A lot of mathematics behind the DSP is based

on the convolution.

3.1.3 Discrete Fourier Transform

Fourier transform belongs to the family of linear transforms widely used in DSP based on

decomposing signal into sinusoids (sine and cosine waves).Usually in DSP we use the

Discrete Fourier Transform (DFT), a special kind of Fourier transform used to deal with a

periodic discrete signals. Actually there are an infinite number of ways how signal can be

15

decomposed but sinusoids are selected because of their sinusoidal fidelity that means that

sinusoidal input to the linear system will produce sinusoidal output, only the amplitude and

phase may change, frequency and shape remain the same.

3.2 Human Speech Production Model

Undoubtedly, ability to speak is the most important way for humans to communicate between

each other. Speech conveys various kind of information, which is essentially the meaning of

information speaking person wants to impart, individual information representing speaker and

also some emotional filling. Speech production begins with the initial formalization of the

idea which speaker wants to impart to the listener. Then speaker converts this idea into the

appropriate order of words and phrases according to the language. Finally, his brain produces

motor nerve commands, which move the vocal organs in an appropriate way. Understanding

of how human produce sounds forms the basis of speaker identification.

3.2.1 Anatomy

The sound is an acoustic pressure formed of compressions and rarefactions of air molecules

that originate from movements of human anatomical structures. Most important components

of the human speech production system are the lungs (source of air during speech), trachea

(windpipe), larynx or its most important part vocal cords (organ of voice production), nasal

cavity (nose), soft palate or velum (allows passage of air through the nasal cavity), hard

palate (enables consonant articulation), tongue, teeth and lips. All these components, called

articulators by speech scientists, move to different positions to produce various sounds. Based

on their production, speech sounds can also be divided into consonants and voiced and

unvoiced vowels.

From the technical point of view, it is more useful to think about speech production system in

terms of acoustic filtering operations that affect the air going from the lungs. There are three

main cavities that comprise the main acoustic filter. According to they are nasal, oral and

pharyngeal cavities.

The articulators are responsible for changing the properties of the system and form its output.

Combination of these cavities and articulators is called vocal tract. Its simplified acoustic

16

model is represented in Figure 3.1.

Fig. 3.1 Vocal Tract Model

Speech production can be divided into three stages: first stage is the sound source production,

second stage is the articulation by vocal tract, and the third stage is sound radiation or

propagation from the lips and/or nostrils. A voiced sound is generated by vibratory motion of

the vocal cords powered by the airflow generated by expiration. The frequency of oscillation

of vocal cords is called the fundamental frequency. Another type of sounds -unvoiced sound

is produced by turbulent airflow passing through a narrow constriction in the vocal tract.

In a speaker recognition task, we are interested in the physical properties of human vocal

tract. In general it is assumed that vocal tract carries most of the speaker related information.

However, all parts of human vocal tract described above can serve as speaker dependent

characteristics. Starting from the size and power of lungs, length and flexibility of trachea and

ending by the size, shape and other physical characteristics of tongue, teeth and lips. Such

characteristics are called physical distinguishing factors. Another aspects of speech

production that could be useful indiscriminating between speakers are called learned factors,

which includes peaking rate, dialect, and prosodic effects.

17

3.2.2 Vocal Model

In order to develop an automatic speaker identification system, we should construct

reasonable model of human speech production system. Having such a model, we can extract

its properties from the signal and, using them, we can decide whether or not two signals

belong to the same model and as a result to the same speaker.

Modeling process is usually divided into two parts: the excitation (or source) modeling and

the vocal tract modeling. This approach is based on the assumption of independence of the

source and the vocal tract models. Let us look first at the continuous-time vocal tract model

called multi tube lossless model, which is based on the fact that production of speech is

characterized by changing the vocal tract shape. Because the formalization of such a time-

varying vocal-tract shape model is quite complex, in practice it is simplified to the series of

concatenated lossless acoustic tubes with varying cross-sectional areas, as shown in Figure.

This model consists of a sequence of tubes with cross-sectional areas Ak and lengths Lk. In

practice the lengths of tubes assumed to be equal. If a large amount of short tubes is used,

then we can approach to the continuously varying cross-sectional area, but at the cost of more

complex model. Tract model serves as a transition to the more general discrete-time model,

also known as source-filter model, which is shown in Figure.

Fig. 3.2 Multi tube Lossless Model

18

In this model, the voice source is either a periodic pulse stream or uncorrelated white noise, or

a combination of these. This assumption is based on the evidence from human anatomy that

all types of sounds, which can be produced by humans, are divided into three general

categories: voiced, unvoiced and combination of these two. Voiced signals can be modeled as

a basic or fundamental frequency signal filtered by the vocal tract and unvoiced as a white

noise also filtered by the vocal tract. Here E(z) represents the excitation function, H(z)

represents the transfer function, ands(n) is the output of the whole speech production system

[8]. Finally, we can think about vocal tract as a digital filter, which affects source signal and

about produced sound output as a filter output. Then based on the digital filter theory we can

extract the parameters of the system from its output

Fig.3.3 Source Filter Model

3.3 Speaker Identification

The human speech conveys different types of information. The primary type is the meaning or

words, which speaker tries to pass to the listener. But the other types that are also included in

the speech are information about language being spoken, speaker emotions, gender and

identity of the speaker. The goal of automatic speaker recognition is to extract, characterize

and recognize the information about speaker identity. Speaker recognition is usually divided

into two different branches, speaker verification and speaker identification. Speaker

verification task is to verify the claimed identity of person from his voice. This process

involves only binary decision about claimed identity. In speaker identification there is no

identity claim and the system decides who the speaking person is Speaker identification can

19

be further divided into two branches. Open-set speaker identification decides to whom of the

registered speakers‟ unknown speech sample belongs or makes a conclusion that the speech

sample is unknown. In this work, we deal with the closed-set speaker identification, which is

a decision making process of whom of the registered speakers is most likely the author of the

unknown speech sample. Depending on the algorithm used for the identification, the task can

also be divided into text-dependent and text-independent identification. The difference is that

in the first case the system knows the text spoken by the person while in the second case the

system must be able to recognize the speaker from any text.

The process of speaker identification is divided into two main phases.

1) Speaker enrollment

2) Speaker Identification

During the first phase, speaker enrollment, speech samples are collected from

The speakers and they are used to train their models. The collection of enrolled models is also

called a speaker database. Then in the enrollment phase, these features are modeled and stored

in the speaker database. This process is represented in following figure.

Fig. 3.4 Enrollment Phase

The next phase of speech recognition is identification phase which is shown below in the

figure. In the second phase, identification phase, a test sample from an unknown speaker is

compared against the speaker database. Both phases include the same first step, Feature

extraction, which is used to extract speaker dependent characteristics from speech. The main

purpose of this step is to reduce the amount of test data while retaining speaker discriminative

information.

20

Fig. 3.5 Identification Phase

However, these two phases are closely related. For instance, identification algorithm usually

depends on the modeling algorithm used in the enrollment phase. This thesis mostly

concentrates on the algorithms in the identification phase and their optimization.

21

CHAPTER 4

SPEECH FEATURE EXTRACTION

4.1 Introduction

The acoustic speech signal contains different kind of information about speaker. This includes

“high-level” properties such as dialect, context, speaking style, emotional state of speaker and

many others. A great amount of work has been already done in trying to develop

identification algorithms based on the methods used by humans to identify speaker. But these

efforts are mostly impractical because of their complexity and difficulty in measuring the

speaker discriminative properties used by humans. More useful approach is based on the

“low-level” properties of the speech signal such as pitch (fundamental frequency of the vocal

cord vibrations), intensity, formant frequencies and their bandwidths, spectral correlations,

short-time spectrum and others. From the automatic speaker recognition task point of view, it

is useful to think about speech signal as a sequence of features that characterize both the

speaker as well as the speech. It is an important step in recognition process to extract

sufficient information for good discrimination in a form and size which is amenable for

effective modeling. The amount of data, generated during the speech production, is quite large

while the essential characteristics of the speech process change relatively slowly and

therefore, they require less data.

According to these matters feature extraction is a process of reducing data while retaining

speaker discriminative information.

When dealing with speech signals there are some criteria that the extracted features should

meet. Some of them are listed below:

discriminate between speakers while being tolerant of intra-speaker variability,

easy to measure,

stable over time,

occur naturally and frequently in speech,

change little from one speaking environment to another,

Not be susceptible to mimicry.

22

For MFCC feature extraction, we use the melcepst function from Voicebox. Because of its

nature, the speech signal is a slowly varying signal or quasi-stationary. It means that when

speech is examined over a sufficiently short period of time (20-30 milliseconds) it has quite

stable acoustic characteristics. It leads to the useful concept of describing human speech

signal, called “short-term analysis”, where only a portion of the signal is used to extract signal

features at one time. It works in the following way: predefined length window (usually 20-30

milliseconds) is moved along the signal with an overlapping (usually 30-50% of the window

length) between the adjacent frames. Overlapping is needed to avoid losing of information.

Parts of the signal formed in such a way are called frames. In order to prevent an abrupt

change at the end points of the frame, it is usually multiplied by a window function. The

operation of dividing signal into short intervals is called windowing and such segments are

called windowed frames (or sometime just frames). There are several window functions used

in speaker recognition area but the most popular is Hamming window function, which is

described by the following equation:

𝑊 𝑛 = 0.54− 0.46cos(2𝑛𝜋

𝑁 − 1)

Where N is the size of the window or frame. A set of features extracted from one frame is

called feature vector.

4.2 Short Term Analysis

Frame1, Frame2, Frame3, … Frame N the speech signal is slowly varying over time (quasi-

stationary), that is when the signal is examined over a short period of time (5-100msec), the

signal is fairly stationary. Therefore speech signals are often analyzed in short time segments,

which is referred to as short-time spectral analysis.

23

Fig. 4.1 Short Term Analysis

4.3 Cepstrum

The speech signal s(n) can be represented as a “quickly varying” source signal e(n)

convolved with the “slowly varying” impulse response h(n) of the vocal tract represented as a

linear filter. We have access only to the output (speech signal) and it is often desirable to

eliminate one of the components. Separation of the source and the filter parameters from the

mixed output is in general difficult problem when these components are combined using not

linear operation, but there are various techniques appropriate for components combined

linearly. The cepstrum is representation of the signal where these two components are

resolved into two additive parts. It is computed by taking the inverse DFT of the logarithm of

the magnitude spectrum of the frame. This is represented in the following equation:

24

𝐶𝑒𝑝𝑠𝑡𝑟𝑢𝑚 𝑓𝑟𝑎𝑚𝑒 = 𝐼𝐷𝐹𝑇 log 𝐷𝐹𝑇 𝑓𝑟𝑎𝑚𝑒

Some explanation of the algorithm is therefore needed. By moving to the frequency domain

we are changing from the convolution to the multiplication. Then by taking logarithm we are

moving from the multiplication to the addition. That is desired division into additive

components. Then we can apply linear operator inverse DFT, knowing that the transform will

operate individually on these two parts and knowing what Fourier transform will do with

quickly varying and slowly varying parts. Namely it will put them into different, hopefully

separate parts in new, also called quefrency axis.

4.3.1 Delta Cepstrum

The cepstral coefficients provide a good representation of the local spectral properties

of the framed speech. But, it is well known that a large amount of information resides

in the transitions from one segment of speech to another. An improved representation can

be obtained by extending the analysis to include information about the temporal cepstral

derivative. Delta Cepstrum is used to catch the changes between the different frames. Delta

Cepstrum defined as:

∆𝐶𝑠 𝑛;𝑚 =1

2{𝐶𝑠 𝑛;𝑚 + 1 − 𝐶𝑠(𝑛;𝑚 − 1)}

𝑖 = 1,… ,𝑄

The results of the feature extraction are a series of vectors characteristic of the time-

varying spectral properties of the speech signal.

25

Fig.4.2 Speech Magnitude Spectrum

We can see that the speech magnitude spectrum is combined from slow and quickly varying

parts. But there is still one problem: multiplication is not a linear operation. We can solve it

by taking logarithm from the multiplication as described earlier. Finally, let us look at the

result of the inverse DFT in Figure.

Fig.4.3Cepstrum

We can see that two components are clearly distinctive now.

4.4 Mel-Frequency Cepstrum Coefficients

In this project we are using Mel Frequency Cepstral Coefficient. Mel frequency Cepstral

Coefficients are coefficients that represent audio based on perception. This coefficient has a

great success in speaker recognition application. It is derived from the Fourier Transform of

the audio clip. In this technique the frequency bands are positioned logarithmically, whereas

in the Fourier Transform the frequency bands are not positioned logarithmically. As the

frequency bands are positioned logarithmically in MFCC, it approximates the human

26

system response more closely than any other system. These coefficients allow better

processing of data. In the Mel Frequency Cepstral Coefficients the calculation of the Mel

Cepstrum is same as the real Cepstrum except the Mel Cepstrum‟s frequency scale is warped

to keep up a correspondence to the Mel scale. The Mel scale was projected by Stevens,

Volkmann and Newman in 1937. The Mel scale is mainly based on the study of observing the

pitch or frequency Perceived by the human. The scale is divided into the units mel. In this test

the listener or test person started out hearing a frequency of 1000 Hz, and labeled it 1000 Mel

for reference. Then the listeners were asked to change the frequency till it reaches to the

frequency twice the reference frequency. Then this frequency labeled 2000 Mel. The same

procedure repeated for the half the frequency, then this frequency labeled as 500 Mel, and so

on. On this basis the normal frequency is mapped into the Mel frequency. The Mel scale is

normally a linear mapping below 1000 Hz and logarithmically spaced above 1000 Hz. Figure

below shows the example of normal frequency is mapped into the Mel frequency.

Mel-frequency cepstrum coefficients (MFCC) are well known features used to describe

speech signal. They are based on the known evidence that the information carried by low-

frequency components of the speech signal is phonetically more important for humans than

carried by high-frequency components. Technique of computing MFCC is based on the short-

term analysis, and thus from each frame a MFCC vector is computed. MFCC extraction is

similar to the cepstrum calculation except that one special step is inserted, namely the

frequency axis is warped according to the mel-scale. Summing up, the process of extracting

MFCC from continuous speech is illustrated in Figure.

A “mel” is a unit of special measure or scale of perceived pitch of a tone. It does not

correspond linearly to the normal frequency, indeed it is approximately linear below 1 kHz

and logarithmic above.

27

4.4.1Computing of melcepstrum coefficients

Fig. 4.4 Computing of mel-cepstrum coefficient

Figure above shows the calculation of the Mel Cepstrum Coefficients. Here we are using the

bank filter to warping the Mel frequency. Utilizing the bank filter is much more convenient to

do Mel frequency warping, with filters centered according to Mel frequency. According to the

Mel frequency the width of the triangular filters vary and so the log total energy in a critical

band around the center frequency is included. After warping are a number of coefficients.

Finally we are using the Inverse Discrete Fourier Transformer for the cepstral coefficients

calculation. In this step we are transforming the log of the quefrench domain coefficients to

the frequency domain. Where N is the length of the DFT we used in the cepstrum section.

To place more emphasize on the low frequencies one special step before inverse DFT in

calculation of cepstrum is inserted, namely mel-scaling. A “mel” is a unit of special measure

or scale of perceived pitch of a tone. It does not correspond linearly to the normal frequency,

indeed it is approximately linear below 1 kHz and logarithmic above. This approach is based

on the psychophysical studies of human perception of the frequency content of sounds. One

28

useful way to create mel-spectrum is to use a filter bank, one filter for each desired mel-

frequency component. Every filter in this bank has triangular band pass frequency response.

Such filters compute the average spectrum around each center frequency with increasing

bandwidths, as displayed in Figure.

Fig. 4.5 Triangular Filter used to compute mel-cepstrum

This filter bank is applied in frequency domain and therefore, it simply amounts to taking

these triangular filters on the spectrum. In practice the last step of taking inverse DFT is

replaced by taking discrete cosine transform(DCT) for computational efficiency.

The number of resulting mel-frequency cepstrum coefficients is practically chosen relatively

low, in the order of 12 to 20 coefficients. The zeroth coefficient is usually dropped out

because it represents the average log-energy of the frame and carries only a little speaker

specific information.

However, MFCC are not equally important in speaker identification and thus some

coefficients weighting might by applied to acquire more precise result. Different approach to

the computation of MFCC than described in this work is represented in that is simplified by

omitting filter bank analysis.

29

4.5 Framing and Windowing

The speech signal is slowly varying over time (quasi-stationary) that is when the signal is

examined over a short period of time (5-100msec), the signal is fairly stationary. Therefore

speech signals are often analyzed in short time segments, which are referred to as

short-time spectral analysis. This practically means that the signal is blocked in frames

of typically 20-30 msec. Adjacent frames typically overlap each other with 30-50%,

this is done in order not to lose any information due to the windowing.

Fig 4.6 Quasi Stationary

After the signal has been framed, each frame is multiplied with a window function w(n) with

length N, where N is the length of the frame. Typically the Hamming window is used:

𝑊 𝑛 = 0.54− 0.46cos(2𝜋𝑛

𝑁 − 1)

Where 0 ≤ 𝑛 ≤ 𝑁 − 1

The windowing is done to avoid problems due to truncation of the signal as windowing helps

in the smoothing of the signal.

30

CHAPTER 5

VECTOR QUANTIONZATION

A speaker recognition system must able to estimate probability distributions of the

computed feature vectors. Storing every single vector that generate from the training mode

is impossible, since these distributions are defined over a high-dimensional space. It is

often easier to start by quantizing each feature vector to one of a relatively small number of

template vectors, with a process called vector quantization. VQ is a process of taking a large

set of feature vectors and producing a smaller set of measure vectors that represents the

centroids of the distribution.

The technique of VQ consists of extracting a small number of representative feature vectors

as an efficient means of characterizing the speaker specific features. By means of VQ, storing

every single vector that we generate from the training is impossible.

Vector quantization (VQ) is a process of mapping vectors from a vector space to a finite

number of regions in that space. These regions are called clusters and represented by their

central vectors or centroids. A set of centroids, which represents the whole vector space, is

called a codebook. In speaker identification, VQ is applied on the set of feature vectors

extracted from the speech sample and as a result, the speaker codebook is generated. Such

codebook has a significantly smaller size than extracted vector set and referred as a speaker

model.

Actually, there is some disagreement in the literature about approach used in VQ. Some

authors consider it as a template matching approach because VQ ignores all temporal

variations and simply uses global averages (centroids). Other authors consider it as a

stochastic or probabilistic method, because VQ uses centroids to estimate the modes of a

probability distribution. Theoretically it is possible that every cluster, defined by its centroid,

models particular component of the speech. But practically, however, VQ creates

unrealistically clusters with rigid boundaries in a sense that every vector belongs to one and

only one cluster. Mathematically a VQ task is defined as follows: given a set of feature

vectors, find a partitioning of the feature vector space into the predefined number of regions,

which do not overlap with each other and added together form the whole feature vector space.

Every vector inside such region is represented by the corresponding centroid. The process of

VQ for two speakers is represented in Figure.

31

Fig. 5.1 Vector Quantization of Two Speakers

There are two important design issues in VQ: the method for generating the codebook and

codebook size. Known clustering algorithms for codebook generation are:

Generalized Lloyd algorithm (GLA),

Self-organizing maps (SOM),

Pair wise nearest neighbor (PNN),

Iterative splitting technique (SPLIT),

Randomized local search (RLS).

K means

According to, iterative splitting technique should be used when the running time is important

but RLS is simpler to implement and generates better codebooks in the case of speaker

32

identification task. Codebook size is a trade-off between running time and identification

accuracy. With large size, identification accuracy is high but at the cost of running time and

vice versa. Experimental result obtained in is that saturation point choice is 64 vectors in

codebook. The quantization distortion (quality of quantization) is usually computed as the

sum of squared distances between vector and its representative (centroid). The well-known

distance measures are Euclidean, city block distance.

33

CHAPTER 6

K MEANS

The K-means algorithm partitions the T feature vectors into M centroids. The algorithm first

chooses M cluster-centroids among the T feature vectors. Then each feature vector is assigned

to the nearest centroid, and the new centroids are calculated. This procedure is continued until

a stopping criterion is met, that is the mean square error between the feature vectors and the

cluster-centroids is below a certain threshold or there is no more change in the cluster-center

assignment.

In data mining, k-means clustering is a method of cluster analysis which aims

to partition n observations into k clusters in which each observation belongs to the cluster

with the nearest mean. This results into a partitioning of the data space into cells. The problem

is computationally difficult (NP-hard), however there are efficient heuristic algorithms that

are commonly employed and converge fast to a local optimum. These are usually similar to

the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative

refinement approach employed by both algorithms. Additionally, they both use cluster centers

to model the data, however k-means clustering tends to find clusters of comparable spatial

extent, while the expectation-maximization mechanism allows clusters to have different

shapes.

The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is

often called the k-means algorithm; it is also referred to as Lloyd's algorithm, particularly in

the computer science community.

Given an initial set of k means m1(1)

,…,mk(1)

(see below), the algorithm proceeds by

alternating between two steps:

34

Where each 𝑥𝑝 goes into exactly one 𝑆𝑖(𝑡)

, even if it could go in two of them. Update step:

Calculate the new means to be the centroid of the observations in the cluster.

The algorithm is deemed to have converged when the assignments no longer change.

Commonly used initialization methods are Forgy and Random Partition. The Forgy method

randomly chooses k observations from the data set and uses these as the initial means. The

Random Partition method first randomly assigns a cluster to each observation and then

proceeds to the Update step, thus computing the initial means to be the centroid of the

cluster's randomly assigned points. The Forgy method tends to spread the initial means out,

while Random Partition places all of them close to the center of the data set. According to

Hamerly et al., the Random Partition method is generally preferable for algorithms such as the

k-harmonic means and fuzzy k-means. For expectation maximization and standard k-means

algorithms, the Forgy method of initialization is preferable.

Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real

vector, k-means clustering aims to partition the n observations into k sets

(k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS):

Where μi is the mean of points in Si.

The two key features of k-means which make it efficient are often regarded as its biggest

drawbacks:

Euclidean distance is used as a metric and variance is used as a measure of cluster scatter.

The number of clusters k is an input parameter: an inappropriate choice of k may yield

poor results. That is why, when performing k-means, it is important to run diagnostic

checks for determining the number of clusters in the data set.

A key limitation of k-means is its cluster model. The concept is based on spherical clusters

that are separable in a way so that the mean value converges towards the cluster center. The

clusters are expected to be of similar size, so that the assignment to the nearest cluster center

35

is the correct assignment. When for example applying k-means with a value of onto

the well-known Iris flower data set, the result often fails to separate the three Iris species

contained in the data set. With , the two visible clusters (one containing two species)

will be discovered, whereas with one of the two clusters will be split into two even

parts. In fact, is more appropriate for this data set, despite the data set containing

3 classes. As with any other clustering algorithm, the k-means result relies on the data set to

satisfy the assumptions made by the clustering algorithms. It works very well on some data

sets, while failing miserably on others.

The result of k-means can also be seen as the Voronoi cells of the cluster means. Since data is

split halfway between cluster means, this can lead to suboptimal splits as can be seen in the

"mouse" example. The Gaussian models used by the Expectation-maximization

algorithm (which can be seen as a generalization of k-means) are more flexible here by having

both variances and covariance. The EM result is thus able to accommodate clusters of variable

size much better than k-means as well as correlated cluster.

k-means clustering in particular when using heuristics such as Lloyd's algorithm is rather easy

to implement and apply even on large data sets. As such, it has been successfully used in

various topics, ranging from market segmentation, computer vision, geostatistics

and astronomy to agriculture. It often is used as a preprocessing step for other algorithms, for

example to find a starting configuration.

k-means clustering, and its associated expectation-maximization algorithm, is a special case of

a Gaussian mixture model, specifically, the limit of taking all covariance as diagonal, equal,

and small. It is often easy to generalize a k-means problem into a Gaussian mixture model.

6.1Mean shift clustering

Basic mean shift clustering algorithms maintain a set of data points the same size as the input

data set. Initially, this set is copied from the input set. Then this set is iteratively replaced by

the mean of those points in the set that are within a given distance of that point. By

contrast, k-means restricts this updated set to k points usually much less than the number of

points in the input data set, and replaces each point in this set by the mean of all points in

the input set that are closer to that point than any other (e.g. within the Voronoi partition of

each updating point). A mean shift algorithm that is similar then to k-means, called likelihood

mean shift, replaces the set of points undergoing replacement by the mean of all points in the

36

input set that are within a given distance of the changing set. One of the advantages of mean

shift over k-means is that there is no need to choose the number of clusters, because mean

shift is likely to find only a few clusters if indeed only a small number exist. However, mean

shift can be much slower than k-means. Mean shift has soft variants much as k-means does.

6.2 Bilateral filtering

k-means implicitly assumes that the ordering of the input data set does not matter.

The bilateral filter is similar to K-means and mean shift in that it maintains a set of data points

that are iteratively replaced by means. However, the bilateral filter restricts the calculation of

the (kernel weighted) mean to include only points that are close in the ordering of the input

data. This makes it applicable to problems such as image denoising, where the spatial

arrangement of pixels in an image is of critical importance.

6.3 Speaker Matching

During the matching a matching score is computed between extracted feature vectors and

every speaker codebook enrolled in the system. Commonly it is done as a partitioning

extracted feature vectors, using centroids from speaker codebook, and calculating matching

score as a quantization distortion.

Another choice for matching score is mean squared error (MSE), which is computed as the

sum of the squared distances between the vector and nearest centroid divided by number of

vectors extracted from the speech sample.

The quantization distortion (quality of quantization) is usually computed as the sum of

squared distances between vector and its representative (centroid). The well-known distance

measures are Euclidean, city block distance, weighted Euclidean and Mahalanobis. They are

represented in the following equations:

37

Where x and y are multi-dimensional feature vectors and D is a weighting matrix. When D is

a covariance matrix weighted Euclidean distance also called Mahalanobis distance. Weighted

Euclidean distance where D is a diagonal matrix and consists of diagonal elements of

covariance matrix is more appropriate, in a sense that it provides more accurate identification

result. The reason for such result is that because of their nature not all components in feature

vectors are equally important and weighted distance might give more precise result.

38

CHAPTER 7

EUCLIDEAN DISTANCE

In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between

two points that one would measure with a ruler, and is given by the Pythagorean formula. By

using this formula as distance, Euclidean space (or even any inner product space) becomes

a metric space. The associated norm is called the Euclidean norm. Older literature refers to the

metric as Pythagorean metric.

The Euclidean distance between points p and q is the length of the line segment connecting

them ( ).

In Cartesian coordinates, if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points

in Euclidean n-space, then the distance from p to q, or from q to p is given by:

The position of a point in a Euclidean n-space is a Euclidean vector. So, p and q are Euclidean

vectors, starting from the origin of the space, and their tips indicate two points. The Euclidean

norm, or Euclidean length, or magnitude of a vector measures the length of the vector:

Where the last equation involves the dot product.

A vector can be described as a directed line segment from the origin of the Euclidean space

(vector tail), to a point in that space (vector tip). If we consider that its length is actually the

distance from its tail to its tip, it becomes clear that the Euclidean norm of a vector is just a

special case of Euclidean distance: the Euclidean distance between its tail and its tip.

39

The distance between points p and q may have a direction (e.g. from p to q), so it may be

represented by another vector, given by:

In a three-dimensional space (n=3), this is an arrow from p to q, which can be also regarded as

the position of q relative to p. It may be also called a displacement vector if p and q represent

two positions of the same point at two successive instants of time.

The Euclidean distance between p and q is just the Euclidean length of this distance (or

displacement) vector:

Which is equivalent to equation 1, and also to:

7.1One dimension

In one dimension, the distance between two points on the real line is the absolute value of

their numerical difference. Thus if x and y are two points on the real line, then the distance

between them is given by:

In one dimension, there is a single homogeneous, translation-invariant metric (in other words,

a distance that is induced by a norm), up to a scale factor of length, which is the Euclidean

distance. In higher dimensions there are other possible norms.

40

7.2 Two dimensions

In the Euclidean plane, if p = (p1, p2) and q = (q1, q2) then the distance is given by

This is equivalent to the Pythagorean theorem.

Alternatively, it follows from (2) that if the polar coordinates of the point p are (r1, θ1) and

those of q are (r2, θ2), then the distance between the points is

7.3 Three dimensions

In three-dimensional Euclidean space, the distance is

7.4 N dimensions

In general, for an n-dimensional space, the distance is

7.5 Squared Euclidean Distance

The standard Euclidean distance can be squared in order to place progressively greater weight

on objects that are further apart. In this case, the equation becomes

41

Squared Euclidean Distance is not a metric as it does not satisfy the triangle inequality,

however it is frequently used in optimization problems in which distances only have to be

compared.

It is also referred to as quadrance within the field of rational trigonometry.

7.6 Decision making

The next step after computing of matching scores for every speaker model enrolled in the

system is the process of assigning the exact classification mark for the input speech. This

process depends on the selected matching and modeling algorithms. In template matching,

decision is based on the computed distances, whereas in stochastic matching it is based on the

computed probabilities.

Fig. 7.1 Decision Process

In the recognition phase an unknown speaker, represented by a sequence of feature vectors is

compared with the codebooks in the database. For each codebook a distortion measure is

computed, and the speaker with the lowest distortion is chosen as the most likely speaker.

42

CHAPTER 8

RESULT AND CONCLUSION

8.1 Result

We show the results of our experiments. Every chart is preceded by the short explanation

what we measured and why and followed by the short discussion about results.

8.2 Pruning Basis

First, we start from the basis for speaker pruning. We ran a few tests to see how the matching

function behaves during the identification and when it stabilizes. Chart in Figure 7.1

represents the variation of matching score depending on the available test vectors for 20

different speakers. One vector refers to the one analysis frame.

Fig: 8.1 Distance stabilization

In this figure, the bold line represents the owner of the test sample. It can be seen that at the

beginning the matching score for correct speaker is somewhere among the other scores. But

after large enough amount of new vectors extracted from the test speech, it becomes close

with only few score sand at the end it becomes the smallest score. Actually, this is the

43

underlying reason for speaker pruning: when we have more data we can drop some of the

models from identification.

8.2.1 Static pruning

In the next experiment, we consider the trade-off between identification error rate and average

time spent on the identification for static pruning. By varying the pruning interval or number

of pruned speakers we expect different error rates and different identification times. From

several runs with different parameter combination we can plot the error rate as a function of

average identification time. To obtain such dependency, we fixed three values for pruning

interval and were varying the number of pruned speakers. The results are shown in Figure 7.3.

Fig: 8.2Evaluation of static variant using different pruning intervals

From this figure we can see that all curves follow almost the same shape. It is because in

order to have fast identification we have to choose either small pruning interval or large

number of pruned speakers. On the other hand, in order to have low error rate we have to

choose large interval or small number of pruned speakers. The main conclusion from this

figure is that these two parameters compensate each other.

44

8.2.2 Adaptive pruning

The idea of adaptive pruning is based on the assumption that distribution of matching score

follows more or less the Gaussian curve. In Figure 7.4 we can see the distributions of

matching scores for two typical identifications.

Fig: 8.3 Examples of the matching score distribution

From this figure we can see that distribution is not exactly follows the Gaussian curve but its

shape is almost the same. In the next experiment, we consider the trade-off between

identification error rate and average time spenton the identification for adaptive pruning. By

fixing the parameter ŋ and varying the pruning interval we obtained desired dependency.

8.3 Conclusion

The goal of this project was to create a speaker recognition system, and apply it to a speech of

an unknown speaker. By investigating the extracted features of the unknown speech and

then compare them to the stored extracted features for each different speaker in order to

identify the unknown speaker.

45

The feature extraction is done by using MFCC (Mel Frequency Cepstral Coefficients). The

function „melcepst‟ is used to calculate the melcepstrum of a signal. The speaker was modeled

using Vector Quantization (VQ). A VQ codebook is generated by clustering the training

feature vectors of each speaker and then stored in the speaker database. In this method, the

K means algorithm is used to do the clustering. In the recognition stage, a distortion measure

which based on the minimizing the Euclidean distance was used when matching an unknown

speaker with the speaker database.

During this project, we have found out that the VQ based clustering approach provides

us with the faster speaker identification process.

Related Documents