Top Banner
Session T4J San Juan, PR July 23 28, 2006 9 th International Conference on Engineering Education T4J-1 Design of Matlab®-Based Automatic Speaker Recognition Systems Jamel Price and Ali Eydgahi Department of Engineering and Aviation Sciences University of Maryland Eastern Shore Princess Anne, MD 21853 [email protected]   Abstract - This paper presents design of an automatic speaker recognition system using Matlab® environment, which was part of a research project for NASA for undergraduate research experience. The project represents one of the many design and development activities that University of Maryland Eastern Shore offers as part of undergraduate research experience to undergraduate students in the area of Science, Technology, Engineering and Mathematics. The goal of this project was to determine if Matlab® can be used to construct a simple, yet complete and representative automatic speaker recognition system. The issues that were considered are 1) Can Matlab®, be effectively used to complete the aforementioned task, 2) Which speech recognition algorithm—Dynamic Time Warping or Hidden Markov Modeling, produces the most representative speaker recognition system, and if the former and latter issues provided promising results, 3) Can a Matlab® based speaker recognition system be ported to a real world environment such as airplane cockpit for recording and performing complex voice commands.  Index Terms – Chebychev Distance, Digital Signal Processing, Dynamic Time Warping Algorithm, Hidden Markov Modeling Algorithm, Speaker Recognition System, Viterbi Algorithm. INTRODUCTION Digital signal processing (DSP) is the processing of signals by digital means. In many cases, the signal is initially in the form of an analog i.e.., electrical voltage or current and through one of the processing techniques a discrete or digital output analogous to the analog signal is produced. Signals commonly need to be processed in a variety of ways. Today, the filtering of signals to improve signal quality or to extract important information is done by digital signal processors using DSP techniques rather than by analog electronics [1]. Although the mathematical theory underlying DSP techniques, such as the Fast Fourier Transform (FFT), digital filter design and signal compression can be fairly complex, the numerical operations required to implement these techniques are in fact very simple [1]. Although FFT is not very optimal in many filtering applications it is a very commonly used technique for analyzing and filtering digital signals. The digital signal processor is a programmable microprocessor device with its own native instruction codes that is capable of carrying out millions of floating point operations per second. By coding the various digital signal processes theoretically, a digital signal processor can be replicated using a data-manipulation and development environment such as Matlab®. Speaker recognition is the process of automatically recognizing who is speaking based on unique characteristics contained in speech waves. This technique makes it possible to use the speaker's voice to verify their identity and control access to services such as voice dialing, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers [2]. All speaker recognition systems at the highest level contain two modules, feature extraction and feature matching. Feature extraction is the process of extracting unique information from voice data that can later be used to identify the speaker. Feature matching is the actual procedures of identifying the speaker by comparing the extracted voice data with a database of known speakers and based on this a suitable decision is made. There are many techniques used to parametrically represent a voice signal for speaker recognition tasks. These techniques include Linear Prediction Coding (LPC), Auditory Spectrum-Based Speech Feature (ASSF), and the Mel- Frequency Cepstrum Coefficients (MFCC). The MFCC technique was used in this project. The MFCC technique is based on the known variation of the human ear’s critical bandwidth frequencies with filters that are spaced linearly at low frequencies and logarithmically at high frequencies to capture the important characteristics of speech [2]. The MFCC process is subdivided into five phases or blocks. In the frame blocking section, the speech waveform is more or less divided into frames of approximately 30 milliseconds. The windowing block minimizes the discontinuities of the signal by tapering the beginning and end of each frame to zero. The FFT block converts each frame from the time domain to the frequency domain. In the Mel- frequency wrapping block, the signal is plotted against the Mel-spectrum to mimic human hearing. Studies have shown
5

3172

Apr 09, 2018

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 3172

8/8/2019 3172

http://slidepdf.com/reader/full/3172 1/5

Page 2: 3172

8/8/2019 3172

http://slidepdf.com/reader/full/3172 2/5

Session T4J

San Juan, PR July 23 – 28, 2006

9th

International Conference on Engineering Education

T4J-2 

that human hearing does not follow the linear scale but rather

the Mel-spectrum scale which is a linear spacing below 100

Hz and logarithmic scaling above 100 Hz. In the final step, the

Mel-spectrum plot is converted back to the time domain by

using the following equation:

Mel (f) = 2595*log10 (1 + f /700)

The resultant matrices are referred to as the Mel-Frequency Cepstrum Coefficients [2]. This spectrum provides

a fairly simple but unique representation of the spectral

properties of the voice signal which is the key for representing

and recognizing the voice characteristics of the speaker.

After the speech data has been transformed into MFCC

matrices, one must apply one of the widely used pattern

recognition techniques to build speaker recognition models

using the data attained in the feature extraction phase and then

subsequently identify any sequence uttered by an unknown

speaker. The techniques used to discern any similarities and

differences in the MFCC matrices include, but are not limited

to, Dynamic Time Warping (DTW) and Hidden Markov

Modeling (HMM).The reference voiceprint with the lowest distance

measured from the input pattern is deemed the identity of the

unknown speaker. The best match, smallest distance measured

between feature matrix and unknown matrix, is based upon

DTW. There are two underlying concepts which comprise the

DTW procedure:

1.  Features: the information in each signal has to be

represented in some manner (MFCC).

2.  Distances: some form of distance metric has to be

used in order to obtain match.

Under distances metric subclass two additional concepts exist:

1.  Local: a computational difference between a feature of 

one signal and a feature of another signal.

2.  Global: an overall computational difference between

the entire signal and another signal.

The distance metric most commonly used within DTW is

the Chebychev distance measure. The Chebychev distance

between two points is the maximum distance between the

points in any single dimension. The distance between points

X = (X1, X2, … etc.) and Y = (Y1, Y2, … etc.) is computed

using the formula:

Max | Xi - Yi |

where Xi and Yi are the values of the ith variable at points X

and Y, respectively.Speech is a time-dependent process. Hence the utterances

of the same word will have different durations, and utterances

of the same word with the same duration will differ in the

middle, due to different parts of the words being spoken at

different rates. To obtain a global distance between two

speech patterns a time alignment must be performed. The best

matching template is the one for which there is the lowest

distance path aligning the input pattern to the template. A

simple global distance score for a path is simply the sum of 

local distances that make up the path [4].

A speaker voice patterns may exhibit a substantial degree

of variance: identical sentences, uttered by the same speaker

but at different times, result in a similar, yet different sequence

of MFCC matrices. The purpose of speaker modeling is to

build a model that can cope with speaker variation in feature

space and to create a fairly unique representation of the

speaker's characteristics. Stochastic modeling enables us tomodel the speaker's characteristics by describing the speech

production as a stochastic process. A stochastic approach for

modeling the speech and the speaker is the Hidden Markov

Model technique. A first order Markov model is a finite set of 

states, where transition between states are modeled by a

transition probability matrix A, assuming that the probability

of being in state Si at time t only depends on the state occupied

at time t -1. If the state probability vector π is known for t = 0,

the probability vector for the next observation moments can be

computed recursively by:

πt = A* π t-1π

t = At * π0

When all possible transitions from Si to S j are allowed

(full transition matrix) the model is called ergodic. When only

state sequences in chronological order are allowed (upper-

triangular transition matrix) the model is called left-to-right.

Markov model is the sense that the state sequence cannot be

observed directly (it is hidden) and only the observation

sequence is known. A Probability Density Function (PDF)

describes the probability Pi for the observation vector given

that the process is in state.

Application of HMMs on acoustical observations is based

on the following assumptions:

1.  Each observation belongs to a finite set of N states:

{S1, S2, S3, … , SN}

2.  The probability of being in state S i at time t only

depends on the observed state at time (t -1),

3.  All observations are mutually independent,

4.  The probability density functions are assumed to be

of a multivariate Gaussian distribution. In principle,

each possible distribution can be modeled by a

mixture of Gaussians, but the number of mixtures is

limited by the size of the available training set.

Although especially the third assumption is not strictly

true when the observations are acoustic vectors, hidden

Markov modeling is one of the most popular and effective

modeling techniques for acoustic time series [6].

The process of hidden Markov speaker model training isdefined as the determination of the optimal model parameters,

identification of unknown speaker, given a set of training

vectors from a particular speaker stored in reference library.

Two approaches exist to train an HMM, the Baum-Welch

algorithm and the Viterbi algorithm. The latter approach is

used in this project. Both algorithms use the Maximum

Likelihood (ML) criterion. The Baum-Welch algorithm

evaluates all possible state sequences, which could have

Page 3: 3172

8/8/2019 3172

http://slidepdf.com/reader/full/3172 3/5

Session T4J

San Juan, PR July 23 – 28, 2006

9th

International Conference on Engineering Education

T4J-3 

produced the training observations, while the Viterbi

algorithm only considers the most probable state sequence.

The most general implementation of these algorithms update

all parameters of the HMM, including co-variances, means,

weighting coefficients and transition probabilities [6].

The complexity of a speaker model is an important factor

which determines the performance of a model, where the

optimal complexity is dependent on the amount of training

data. The complexity of an HMM is described by the numberof model parameters, which include one covariance matrix,

one mean vector, and one weighting coefficient per mixture

per state plus a transition matrix.

The HMM algorithm is a very involved process. In order

to implement a speaker recognition system using HMM the

following steps must be taken:

1.  For each reference word, a Markov model must be

built using parameters that optimize the observations

of the word.

2.  A calculation of model likelihoods for all possible

reference models against the unknown model must be

completed using the Viterbi algorithm followed by the

selection of the reference with the highest modellikelihood value.

METHODOLOGY USED

During the feature extraction phase a database of feature

matrices or “voice fingerprints” was created in order to be

tested against the feature matching phase. Initially, a simple

speech database provided by the Swiss Federal Institute of 

Technology [2] was used to design and test the system. A

voice fingerprint represents the most basic, yet unique,

features of a particular speakers’ voice. As mentioned earlier,

The Mel-Frequency Cepstrum Coefficient algorithm was

implemented in order to complete the feature extraction task.By examining each block of the MFCC processor, a Matlab®

M-file was created. Some of the MFCC processes, such as

frame blocking, windowing, and the Fast Fourier Transform,

are available for download from the Matlab® program

repository [8]. The MFCC (mfcc.m) front-end used for this

project is provided by the Interval Research Corporation and is

available for download [9].

One of the primary concerns, when designing this system,

was to determine which feature matching algorithm, Dynamic

Time Warping or Hidden Markov Modeling, produces the

most representative speaker recognition system. The Dynamic

Time Warping (dtw.m) source code, like the various MFCC

components, is available for download from the Mathworks

repository. The Hidden Markov Modeling source code

(model.m) is available in [10]. Although the two major

components of a representative speaker system feature

extraction (mfcc.m) and feature matching (model.m, dtw.m)

were available, local programming and other files were needed

to satisfy the feature matching portion and the remaining

requirements of the project. A Matlab®-based program

complete with graphical user interface was developed that

could record sound files in real-time and then store them in a

previously specified location under a unique identifier. Also,

another file in Matlab® environment that could use the

previously recorded voice files, stored as *.wav files, to create

voice fingerprints using the mfcc code, and then output each

recorded file to a common voice database according to its

unique identifier was developed. The aforementioned task was

accomplished using the built-in Matlab® function xlswrite,

which creates each voice fingerprint as a separate Excel

worksheet in a common Excel workbook. As requirement forthe feature matching portion of the system, another Matlab®-

based file was designed that could record an unknown voice

file and after completing the feature extraction process it

would compare the newly attained voice fingerprint with the

references stored in the voice database.

In the case of the DTW algorithm, the distance metrics

are computed between the feature vector of the unknown

speaker and those of the speakers stored in the database and

the reference with the smallest distance metric is returned. In

the case of the HMM algorithm, Markov models are created

for both the feature matrices stored in the reference database

and that of the unknown speaker and the reference with the

highest likelihood value stored in the reference database isreturned.

After fairly representative speaker recognition system was

constructed using the speech data provided by the Swiss

Federal Institute, a DSP headset provided by Plantronics®

[11] was used so more complex voice commands could be

recorded locally.

PROGRAM STRUCTURE

 Feature Extraction:After Opening Matlab® and setting the speaker recognition

folder as the active directory, typing “main” in the command

window initializes the prompt as represented in Figure I.

Selecting the training option from the Main Menu leads theuser to the prompt as is depicted in Figure II. After choosing

either option of Creating a New Speaker Database or Add

Speaker to Existing Database, the user is then ask to enter a

name or “tag” to identify the reference voiceprint to be created

as is shown in Figure III.

FIGURE I:

MAIN MENU OF SPEAKER RECOGNITION SYSTEM

Page 4: 3172

8/8/2019 3172

http://slidepdf.com/reader/full/3172 4/5

Session T4J

San Juan, PR July 23 – 28, 2006

9th

International Conference on Engineering Education

T4J-4 

The recording status bar informs the user that the system

is currently recording the voice data. After the data has been

successfully created, the user receives a successful completion

prompt.

FIGURE II:USER OPTIONS FOR SPEAKER DATABASE

 Feature Matching:

The testing option from the Main Menu gives the user the

option of using Dynamic Time Warping or Hidden Markov

Modeling as is shown in Figure IV. After choosing either

option, the user is then asked to press “OK” to initialize the

voice recording process.

FIGURE III:USER PROMPT FOR VOICE PRINT IDENTIFICATION

After the unknown voice data has been successfully

recorded the “tag” for the best match will be shown on the

computer display. Also, spectrograms of the unknown voice

data and the best matching voice data associated with the tag

will be displayed on screen. Figure V shows a sample result.

FIGURE IV:

USER OPTIONS WINDOW

Operation:

The Clear Database option on the Main Menu deletes the

*.wav, *.xls, and *.mat files created as a result of a voice

recognition process. The file entitled “acoustic_data.xls” is the

common voice database and “test_data.xls” is the voice print

of the voice file to be recognized. The Hidden Markov

Modeling and Dynamic Time Warping files each contain a

threshold parameter which controls the sensitivity of the

system.

FIGURE V

RESULTS OF VOICE RECOGNITION PROCESS

Page 5: 3172

8/8/2019 3172

http://slidepdf.com/reader/full/3172 5/5

Session T4J

San Juan, PR July 23 – 28, 2006

9th

International Conference on Engineering Education

T4J-5 

CONCLUSION

A fairly representative speaker recognition system has been

constructed using the speech data provided by the Swiss

Federal Institute and simple utterances recorded locally using

Matlab® software. Through this project and with experience

in trial and error, the student was able to build better and more

powerful system. The knowledge he gained through this

project will help him in future research and developmentprojects.

ACKNOWLEDGEMENT

This work was supported by NASA Langley Research Center

Grant # NOC1-03033 under the Chesapeake Information

Based Aeronautic Consortium (CIBAC).

REFERENCES

[1]  Ambardar, Ashok. Analog and Digital Signal Processing. Brooks/Cole

Publishing company, 1999.

[2]  Do, Minh N.”An Automatic Speaker Recognition System.” Digital

Signal Processing Mini-Project. 14 June 2005.<http://lcavwww.epfl.ch/~minhdo/asr_project/asr_project.html>

[3]  “ECE341 So You Want to Try and do Speech Recognition.” A Simple

Speech Recognition Algorithm. 15 April 2003. 1 July 2005.

<http://www.eecg.toronto.edu/~aamodt/ece341/speech-recognition>

[4]  “Isolated Word, Speech Recognition using Dynamic Time Warping.”

Dynamic Time Warping. 14 June 2005.

<http://www.cnel.ufl.edu/~kkale/dtw.html>

[5]  J. Bilmes, ”A Gentle Tutorial on the EM Algorithm and its Application

to Parameter Estimation for Gaussian Mixture and Hidden Markov

Models. Technical Report, University of Berkeley, ICSI-TR- 97-021.April 1998.

<http://crow.ee.washington.edu/people/bulyko/papers/em.pdf>

[6]  Koolwaaji, Johan, “iSpeak- Consultancy in Speech Technology,”

Fundamentals of HMM Based Speaker Verification, May 2001.<www.ispeak.nl/prfhtm/node13.html>

[7]  “Speech Recognition by Dynamic Time Warping.” Speech Recognition

by Dynamic Time Warping. 20 April 1998. 06 July 2005.<http://www.dcs.shef.ac.uk/~stu/com326/>

[8]  The Mathworks – MATLAB and SIMULINK for Technical

Computing. 10 June 2005.

<http://www.mathworks.com>

[9]  Slaney, Malcolm. “Auditory Toolbox Technical Report # 1998-010.”

Auditory Toolbox. 14 June 2005.

<http://rvl4.ecn.purdue.edu/~malcolm/interval/1998-010/>

[10]  <http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html>

[11]  Plantronics. Plantronics Inc, 2005 – 2006.

<http://www.plantronics.com>