3172

8/8/2019 3172

http://slidepdf.com/reader/full/3172 1/5

8/8/2019 3172


Session T4J

San Juan, PR July 23 – 28, 2006

9th

International Conference on Engineering Education

T4J-2

that human hearing does not follow the linear scale but rather

the Mel-spectrum scale which is a linear spacing below 100

Hz and logarithmic scaling above 100 Hz. In the final step, the

Mel-spectrum plot is converted back to the time domain by

using the following equation:

Mel (f) = 2595*log10 (1 + f /700)

The resultant matrices are referred to as the Mel-Frequency Cepstrum Coefficients [2]. This spectrum provides

a fairly simple but unique representation of the spectral

properties of the voice signal which is the key for representing

and recognizing the voice characteristics of the speaker.

After the speech data has been transformed into MFCC

matrices, one must apply one of the widely used pattern

recognition techniques to build speaker recognition models

using the data attained in the feature extraction phase and then

subsequently identify any sequence uttered by an unknown

speaker. The techniques used to discern any similarities and

differences in the MFCC matrices include, but are not limited

to, Dynamic Time Warping (DTW) and Hidden Markov

Modeling (HMM).The reference voiceprint with the lowest distance

measured from the input pattern is deemed the identity of the

unknown speaker. The best match, smallest distance measured

between feature matrix and unknown matrix, is based upon

DTW. There are two underlying concepts which comprise the

DTW procedure:

1. Features: the information in each signal has to be

represented in some manner (MFCC).

2. Distances: some form of distance metric has to be

used in order to obtain match.

Under distances metric subclass two additional concepts exist:

1. Local: a computational difference between a feature of

one signal and a feature of another signal.

2. Global: an overall computational difference between

the entire signal and another signal.

The distance metric most commonly used within DTW is

the Chebychev distance measure. The Chebychev distance

between two points is the maximum distance between the

points in any single dimension. The distance between points

X = (X1, X2, … etc.) and Y = (Y1, Y2, … etc.) is computed

using the formula:

Max | Xi - Yi |

where Xi and Yi are the values of the ith variable at points X

and Y, respectively.Speech is a time-dependent process. Hence the utterances

of the same word will have different durations, and utterances

of the same word with the same duration will differ in the

middle, due to different parts of the words being spoken at

different rates. To obtain a global distance between two

speech patterns a time alignment must be performed. The best

matching template is the one for which there is the lowest

distance path aligning the input pattern to the template. A

simple global distance score for a path is simply the sum of

local distances that make up the path [4].

A speaker voice patterns may exhibit a substantial degree

of variance: identical sentences, uttered by the same speaker

but at different times, result in a similar, yet different sequence

of MFCC matrices. The purpose of speaker modeling is to

build a model that can cope with speaker variation in feature

space and to create a fairly unique representation of the

speaker's characteristics. Stochastic modeling enables us tomodel the speaker's characteristics by describing the speech

production as a stochastic process. A stochastic approach for

modeling the speech and the speaker is the Hidden Markov

Model technique. A first order Markov model is a finite set of

states, where transition between states are modeled by a

transition probability matrix A, assuming that the probability

of being in state Si at time t only depends on the state occupied

at time t -1. If the state probability vector π is known for t = 0,

the probability vector for the next observation moments can be

computed recursively by:

πt = A* π t-1π

t = At * π0

When all possible transitions from Si to S j are allowed

(full transition matrix) the model is called ergodic. When only

state sequences in chronological order are allowed (upper-

triangular transition matrix) the model is called left-to-right.

Markov model is the sense that the state sequence cannot be

observed directly (it is hidden) and only the observation

sequence is known. A Probability Density Function (PDF)

describes the probability Pi for the observation vector given

that the process is in state.

Application of HMMs on acoustical observations is based

on the following assumptions:

1. Each observation belongs to a finite set of N states:

{S1, S2, S3, … , SN}

2. The probability of being in state S i at time t only

depends on the observed state at time (t -1),

3. All observations are mutually independent,

4. The probability density functions are assumed to be

of a multivariate Gaussian distribution. In principle,

each possible distribution can be modeled by a

mixture of Gaussians, but the number of mixtures is

limited by the size of the available training set.

Although especially the third assumption is not strictly

true when the observations are acoustic vectors, hidden

Markov modeling is one of the most popular and effective

modeling techniques for acoustic time series [6].

The process of hidden Markov speaker model training isdefined as the determination of the optimal model parameters,

identification of unknown speaker, given a set of training

vectors from a particular speaker stored in reference library.

Two approaches exist to train an HMM, the Baum-Welch

algorithm and the Viterbi algorithm. The latter approach is

used in this project. Both algorithms use the Maximum

Likelihood (ML) criterion. The Baum-Welch algorithm

evaluates all possible state sequences, which could have

8/8/2019 3172


Session T4J


9th


T4J-3

produced the training observations, while the Viterbi

algorithm only considers the most probable state sequence.

The most general implementation of these algorithms update

all parameters of the HMM, including co-variances, means,

weighting coefficients and transition probabilities [6].

The complexity of a speaker model is an important factor

which determines the performance of a model, where the

optimal complexity is dependent on the amount of training

data. The complexity of an HMM is described by the numberof model parameters, which include one covariance matrix,

one mean vector, and one weighting coefficient per mixture

per state plus a transition matrix.

The HMM algorithm is a very involved process. In order

to implement a speaker recognition system using HMM the

following steps must be taken:

1. For each reference word, a Markov model must be

built using parameters that optimize the observations

of the word.

2. A calculation of model likelihoods for all possible

reference models against the unknown model must be

completed using the Viterbi algorithm followed by the

selection of the reference with the highest modellikelihood value.

METHODOLOGY USED

During the feature extraction phase a database of feature

matrices or “voice fingerprints” was created in order to be

tested against the feature matching phase. Initially, a simple

speech database provided by the Swiss Federal Institute of

Technology [2] was used to design and test the system. A

voice fingerprint represents the most basic, yet unique,

features of a particular speakers’ voice. As mentioned earlier,

The Mel-Frequency Cepstrum Coefficient algorithm was

implemented in order to complete the feature extraction task.By examining each block of the MFCC processor, a Matlab®

M-file was created. Some of the MFCC processes, such as

frame blocking, windowing, and the Fast Fourier Transform,

are available for download from the Matlab® program

repository [8]. The MFCC (mfcc.m) front-end used for this

project is provided by the Interval Research Corporation and is

available for download [9].

One of the primary concerns, when designing this system,

was to determine which feature matching algorithm, Dynamic

Time Warping or Hidden Markov Modeling, produces the

most representative speaker recognition system. The Dynamic

Time Warping (dtw.m) source code, like the various MFCC

components, is available for download from the Mathworks

repository. The Hidden Markov Modeling source code

(model.m) is available in [10]. Although the two major

components of a representative speaker system feature

extraction (mfcc.m) and feature matching (model.m, dtw.m)

were available, local programming and other files were needed

to satisfy the feature matching portion and the remaining

requirements of the project. A Matlab®-based program

complete with graphical user interface was developed that

could record sound files in real-time and then store them in a

previously specified location under a unique identifier. Also,

another file in Matlab® environment that could use the

previously recorded voice files, stored as *.wav files, to create

voice fingerprints using the mfcc code, and then output each

recorded file to a common voice database according to its

unique identifier was developed. The aforementioned task was

accomplished using the built-in Matlab® function xlswrite,

which creates each voice fingerprint as a separate Excel

worksheet in a common Excel workbook. As requirement forthe feature matching portion of the system, another Matlab®-

based file was designed that could record an unknown voice

file and after completing the feature extraction process it

would compare the newly attained voice fingerprint with the

references stored in the voice database.

In the case of the DTW algorithm, the distance metrics

are computed between the feature vector of the unknown

speaker and those of the speakers stored in the database and

the reference with the smallest distance metric is returned. In

the case of the HMM algorithm, Markov models are created

for both the feature matrices stored in the reference database

and that of the unknown speaker and the reference with the

highest likelihood value stored in the reference database isreturned.

After fairly representative speaker recognition system was

constructed using the speech data provided by the Swiss

Federal Institute, a DSP headset provided by Plantronics®

[11] was used so more complex voice commands could be

recorded locally.

PROGRAM STRUCTURE

Feature Extraction:After Opening Matlab® and setting the speaker recognition

folder as the active directory, typing “main” in the command

window initializes the prompt as represented in Figure I.

Selecting the training option from the Main Menu leads theuser to the prompt as is depicted in Figure II. After choosing

either option of Creating a New Speaker Database or Add

Speaker to Existing Database, the user is then ask to enter a

name or “tag” to identify the reference voiceprint to be created

as is shown in Figure III.

FIGURE I:

MAIN MENU OF SPEAKER RECOGNITION SYSTEM

8/8/2019 3172


Session T4J


9th


T4J-4

The recording status bar informs the user that the system

is currently recording the voice data. After the data has been

successfully created, the user receives a successful completion

prompt.

FIGURE II:USER OPTIONS FOR SPEAKER DATABASE

Feature Matching:

The testing option from the Main Menu gives the user the

option of using Dynamic Time Warping or Hidden Markov

Modeling as is shown in Figure IV. After choosing either

option, the user is then asked to press “OK” to initialize the

voice recording process.

FIGURE III:USER PROMPT FOR VOICE PRINT IDENTIFICATION

After the unknown voice data has been successfully

recorded the “tag” for the best match will be shown on the

computer display. Also, spectrograms of the unknown voice

data and the best matching voice data associated with the tag

will be displayed on screen. Figure V shows a sample result.

FIGURE IV:

USER OPTIONS WINDOW

Operation:

The Clear Database option on the Main Menu deletes the

*.wav, *.xls, and *.mat files created as a result of a voice

recognition process. The file entitled “acoustic_data.xls” is the

common voice database and “test_data.xls” is the voice print

of the voice file to be recognized. The Hidden Markov

Modeling and Dynamic Time Warping files each contain a

threshold parameter which controls the sensitivity of the

system.

FIGURE V

RESULTS OF VOICE RECOGNITION PROCESS

8/8/2019 3172


Session T4J


9th


T4J-5

CONCLUSION

A fairly representative speaker recognition system has been

constructed using the speech data provided by the Swiss

Federal Institute and simple utterances recorded locally using

Matlab® software. Through this project and with experience

in trial and error, the student was able to build better and more

powerful system. The knowledge he gained through this

project will help him in future research and developmentprojects.

ACKNOWLEDGEMENT

This work was supported by NASA Langley Research Center

Grant # NOC1-03033 under the Chesapeake Information

Based Aeronautic Consortium (CIBAC).

REFERENCES

[1] Ambardar, Ashok. Analog and Digital Signal Processing. Brooks/Cole

Publishing company, 1999.

[2] Do, Minh N.”An Automatic Speaker Recognition System.” Digital

Signal Processing Mini-Project. 14 June 2005.<http://lcavwww.epfl.ch/~minhdo/asr_project/asr_project.html>

[3] “ECE341 So You Want to Try and do Speech Recognition.” A Simple

Speech Recognition Algorithm. 15 April 2003. 1 July 2005.

<http://www.eecg.toronto.edu/~aamodt/ece341/speech-recognition>

[4] “Isolated Word, Speech Recognition using Dynamic Time Warping.”

Dynamic Time Warping. 14 June 2005.

<http://www.cnel.ufl.edu/~kkale/dtw.html>

[5] J. Bilmes, ”A Gentle Tutorial on the EM Algorithm and its Application

to Parameter Estimation for Gaussian Mixture and Hidden Markov

Models. Technical Report, University of Berkeley, ICSI-TR- 97-021.April 1998.

<http://crow.ee.washington.edu/people/bulyko/papers/em.pdf>

[6] Koolwaaji, Johan, “iSpeak- Consultancy in Speech Technology,”

Fundamentals of HMM Based Speaker Verification, May 2001.<www.ispeak.nl/prfhtm/node13.html>

[7] “Speech Recognition by Dynamic Time Warping.” Speech Recognition

by Dynamic Time Warping. 20 April 1998. 06 July 2005.<http://www.dcs.shef.ac.uk/~stu/com326/>

[8] The Mathworks – MATLAB and SIMULINK for Technical

Computing. 10 June 2005.

<http://www.mathworks.com>

[9] Slaney, Malcolm. “Auditory Toolbox Technical Report # 1998-010.”

Auditory Toolbox. 14 June 2005.

<http://rvl4.ecn.purdue.edu/~malcolm/interval/1998-010/>

[10] <http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html>

[11] Plantronics. Plantronics Inc, 2005 – 2006.

<http://www.plantronics.com>

3172

Documents