Session T4J San Juan, PR July 23 – 28, 2006 9 th International Conference on Engineering Education T4J-1Design of Matlab®-Based Automatic Speaker Recognition Systems Jamel Price and Ali Eydgahi Department of Engineering and Aviation Sciences University of Maryland Eastern Shore Princess Anne, MD 21853 [email protected]Abstract - This paper presents design of an automatic speaker recognition system using Matlab® environment, which was part of a research project for NASA for undergraduate research experience. The project represents one of the many design and development activities that University of Maryland Eastern Shore offers as part of undergraduate research experience to undergraduate students in the area of Science, Technology, Engineering and Mathematics. The goal of this project was to determine if Matlab® can be used to construct a simple, yet complete and representative automatic speaker recognition system. The issues that were considered are 1) Can Matlab®, be effectively used to complete the aforementioned task, 2) Which speech recognition algorithm—Dynamic Time Warping or Hidden Markov Modeling, produces the most representative speaker recognition system, and if the former and latter issues provided promising results, 3) Can a Matlab® based speaker recognition system be ported to a real world environment such as airplane cockpit for recording and performing complex voice commands. Index Terms – Chebychev Distance, Digital Signal Processing, Dynamic Time Warping Algorithm, Hidden Markov Modeling Algorithm, Speaker Recognition System, Viterbi Algorithm. INTRODUCTION Digital signal processing (DSP) is the processing of signals by digital means. In many cases, the signal is initially in the form of an analog i.e.., electrical voltage or current and through one of the processing techniques a discrete or digital output analogous to the analog signal is produced. Signals commonly need to be processed in a variety of ways. Today, the filtering of signals to improve signal quality or to extract important information is done by digital signal processors using DSP techniques rather than by analog electronics [1]. Although the mathematical theory underlying DSP techniques, such as the Fast Fourier Transform (FFT), digital filter design and signal compression can be fairly complex, the numerical operations required to implement these techniques are in fact very simple [1]. Although FFT is not very optimal in many filtering applications it is a very commonly used technique for analyzing and filtering digital signals. The digital signal processor is a programmable microprocessor device with its own native instruction codes that is capable of carrying out millions of floating point operations per second. By coding the various digital signal processes theoretically, a digital signal processor can be replicated using a data-manipulation and development environment such as Matlab®. Speaker recognition is the process of automatically recognizing who is speaking based on unique characteristics contained in speech waves. This technique makes it possible to use the speaker's voice to verify their identity and control access to services such as voice dialing, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers [2]. All speaker recognition systems at the highest level contain two modules, feature extraction and feature matching. Feature extraction is the process of extracting unique information from voice data that can later be used to identify the speaker. Feature matching is the actual procedures of identifying the speaker by comparing the extracted voice data with a database of known speakers and based on this a suitable decision is made. There are many techniques used to parametrically represent a voice signal for speaker recognition tasks. These techniques include Linear Prediction Coding (LPC), Auditory Spectrum-Based Speech Feature (ASSF), and the Mel- Frequency Cepstrum Coefficients (MFCC). The MFCC technique was used in this project. The MFCC technique is based on the known variation of the human ear’s critical bandwidth frequencies with filters that are spaced linearly at low frequencies and logarithmically at high frequencies to capture the important characteristics of speech [2]. The MFCC process is subdivided into five phases or blocks. In the frame blocking section, the speech waveform is more or less divided into frames of approximately 30 milliseconds. The windowing block minimizes the discontinuities of the signal by tapering the beginning and end of each frame to zero. The FFT block converts each frame from the time domain to the frequency domain. In the Mel- frequency wrapping block, the signal is plotted against the Mel-spectrum to mimic human hearing. Studies have shown
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
8/8/2019 3172
http://slidepdf.com/reader/full/3172 1/5
8/8/2019 3172
http://slidepdf.com/reader/full/3172 2/5
Session T4J
San Juan, PR July 23 – 28, 2006
9th
International Conference on Engineering Education
T4J-2
that human hearing does not follow the linear scale but rather
the Mel-spectrum scale which is a linear spacing below 100
Hz and logarithmic scaling above 100 Hz. In the final step, the
Mel-spectrum plot is converted back to the time domain by
using the following equation:
Mel (f) = 2595*log10 (1 + f /700)
The resultant matrices are referred to as the Mel-Frequency Cepstrum Coefficients [2]. This spectrum provides
a fairly simple but unique representation of the spectral
properties of the voice signal which is the key for representing
and recognizing the voice characteristics of the speaker.
After the speech data has been transformed into MFCC
matrices, one must apply one of the widely used pattern
recognition techniques to build speaker recognition models
using the data attained in the feature extraction phase and then
subsequently identify any sequence uttered by an unknown
speaker. The techniques used to discern any similarities and
differences in the MFCC matrices include, but are not limited
to, Dynamic Time Warping (DTW) and Hidden Markov
Modeling (HMM).The reference voiceprint with the lowest distance
measured from the input pattern is deemed the identity of the
unknown speaker. The best match, smallest distance measured
between feature matrix and unknown matrix, is based upon
DTW. There are two underlying concepts which comprise the
DTW procedure:
1. Features: the information in each signal has to be
represented in some manner (MFCC).
2. Distances: some form of distance metric has to be
used in order to obtain match.
Under distances metric subclass two additional concepts exist:
1. Local: a computational difference between a feature of
one signal and a feature of another signal.
2. Global: an overall computational difference between
the entire signal and another signal.
The distance metric most commonly used within DTW is
the Chebychev distance measure. The Chebychev distance
between two points is the maximum distance between the
points in any single dimension. The distance between points
X = (X1, X2, … etc.) and Y = (Y1, Y2, … etc.) is computed
using the formula:
Max | Xi - Yi |
where Xi and Yi are the values of the ith variable at points X
and Y, respectively.Speech is a time-dependent process. Hence the utterances
of the same word will have different durations, and utterances
of the same word with the same duration will differ in the
middle, due to different parts of the words being spoken at
different rates. To obtain a global distance between two
speech patterns a time alignment must be performed. The best
matching template is the one for which there is the lowest
distance path aligning the input pattern to the template. A
simple global distance score for a path is simply the sum of
local distances that make up the path [4].
A speaker voice patterns may exhibit a substantial degree
of variance: identical sentences, uttered by the same speaker
but at different times, result in a similar, yet different sequence
of MFCC matrices. The purpose of speaker modeling is to
build a model that can cope with speaker variation in feature
space and to create a fairly unique representation of the
speaker's characteristics. Stochastic modeling enables us tomodel the speaker's characteristics by describing the speech
production as a stochastic process. A stochastic approach for
modeling the speech and the speaker is the Hidden Markov
Model technique. A first order Markov model is a finite set of
states, where transition between states are modeled by a
transition probability matrix A, assuming that the probability
of being in state Si at time t only depends on the state occupied
at time t -1. If the state probability vector π is known for t = 0,
the probability vector for the next observation moments can be
computed recursively by:
πt = A* π t-1π
t = At * π0
When all possible transitions from Si to S j are allowed
(full transition matrix) the model is called ergodic. When only
state sequences in chronological order are allowed (upper-
triangular transition matrix) the model is called left-to-right.
Markov model is the sense that the state sequence cannot be
observed directly (it is hidden) and only the observation
sequence is known. A Probability Density Function (PDF)
describes the probability Pi for the observation vector given
that the process is in state.
Application of HMMs on acoustical observations is based
on the following assumptions:
1. Each observation belongs to a finite set of N states:
{S1, S2, S3, … , SN}
2. The probability of being in state S i at time t only
depends on the observed state at time (t -1),
3. All observations are mutually independent,
4. The probability density functions are assumed to be
of a multivariate Gaussian distribution. In principle,
each possible distribution can be modeled by a
mixture of Gaussians, but the number of mixtures is
limited by the size of the available training set.
Although especially the third assumption is not strictly
true when the observations are acoustic vectors, hidden
Markov modeling is one of the most popular and effective
modeling techniques for acoustic time series [6].
The process of hidden Markov speaker model training isdefined as the determination of the optimal model parameters,
identification of unknown speaker, given a set of training
vectors from a particular speaker stored in reference library.
Two approaches exist to train an HMM, the Baum-Welch
algorithm and the Viterbi algorithm. The latter approach is
used in this project. Both algorithms use the Maximum
Likelihood (ML) criterion. The Baum-Welch algorithm
evaluates all possible state sequences, which could have
8/8/2019 3172
http://slidepdf.com/reader/full/3172 3/5
Session T4J
San Juan, PR July 23 – 28, 2006
9th
International Conference on Engineering Education
T4J-3
produced the training observations, while the Viterbi
algorithm only considers the most probable state sequence.
The most general implementation of these algorithms update
all parameters of the HMM, including co-variances, means,
weighting coefficients and transition probabilities [6].
The complexity of a speaker model is an important factor
which determines the performance of a model, where the
optimal complexity is dependent on the amount of training
data. The complexity of an HMM is described by the numberof model parameters, which include one covariance matrix,
one mean vector, and one weighting coefficient per mixture
per state plus a transition matrix.
The HMM algorithm is a very involved process. In order
to implement a speaker recognition system using HMM the
following steps must be taken:
1. For each reference word, a Markov model must be
built using parameters that optimize the observations
of the word.
2. A calculation of model likelihoods for all possible
reference models against the unknown model must be
completed using the Viterbi algorithm followed by the
selection of the reference with the highest modellikelihood value.
METHODOLOGY USED
During the feature extraction phase a database of feature
matrices or “voice fingerprints” was created in order to be
tested against the feature matching phase. Initially, a simple
speech database provided by the Swiss Federal Institute of
Technology [2] was used to design and test the system. A
voice fingerprint represents the most basic, yet unique,
features of a particular speakers’ voice. As mentioned earlier,
The Mel-Frequency Cepstrum Coefficient algorithm was
implemented in order to complete the feature extraction task.By examining each block of the MFCC processor, a Matlab®
M-file was created. Some of the MFCC processes, such as
frame blocking, windowing, and the Fast Fourier Transform,
are available for download from the Matlab® program
repository [8]. The MFCC (mfcc.m) front-end used for this
project is provided by the Interval Research Corporation and is
available for download [9].
One of the primary concerns, when designing this system,
was to determine which feature matching algorithm, Dynamic
Time Warping or Hidden Markov Modeling, produces the
most representative speaker recognition system. The Dynamic
Time Warping (dtw.m) source code, like the various MFCC
components, is available for download from the Mathworks
repository. The Hidden Markov Modeling source code
(model.m) is available in [10]. Although the two major
components of a representative speaker system feature
extraction (mfcc.m) and feature matching (model.m, dtw.m)
were available, local programming and other files were needed
to satisfy the feature matching portion and the remaining
requirements of the project. A Matlab®-based program
complete with graphical user interface was developed that
could record sound files in real-time and then store them in a
previously specified location under a unique identifier. Also,
another file in Matlab® environment that could use the
previously recorded voice files, stored as *.wav files, to create
voice fingerprints using the mfcc code, and then output each
recorded file to a common voice database according to its
unique identifier was developed. The aforementioned task was
accomplished using the built-in Matlab® function xlswrite,
which creates each voice fingerprint as a separate Excel
worksheet in a common Excel workbook. As requirement forthe feature matching portion of the system, another Matlab®-
based file was designed that could record an unknown voice
file and after completing the feature extraction process it
would compare the newly attained voice fingerprint with the
references stored in the voice database.
In the case of the DTW algorithm, the distance metrics
are computed between the feature vector of the unknown
speaker and those of the speakers stored in the database and
the reference with the smallest distance metric is returned. In
the case of the HMM algorithm, Markov models are created
for both the feature matrices stored in the reference database
and that of the unknown speaker and the reference with the
highest likelihood value stored in the reference database isreturned.
After fairly representative speaker recognition system was
constructed using the speech data provided by the Swiss
Federal Institute, a DSP headset provided by Plantronics®
[11] was used so more complex voice commands could be
recorded locally.
PROGRAM STRUCTURE
Feature Extraction:After Opening Matlab® and setting the speaker recognition
folder as the active directory, typing “main” in the command
window initializes the prompt as represented in Figure I.
Selecting the training option from the Main Menu leads theuser to the prompt as is depicted in Figure II. After choosing
either option of Creating a New Speaker Database or Add
Speaker to Existing Database, the user is then ask to enter a
name or “tag” to identify the reference voiceprint to be created
as is shown in Figure III.
FIGURE I:
MAIN MENU OF SPEAKER RECOGNITION SYSTEM
8/8/2019 3172
http://slidepdf.com/reader/full/3172 4/5
Session T4J
San Juan, PR July 23 – 28, 2006
9th
International Conference on Engineering Education
T4J-4
The recording status bar informs the user that the system
is currently recording the voice data. After the data has been
successfully created, the user receives a successful completion
prompt.
FIGURE II:USER OPTIONS FOR SPEAKER DATABASE
Feature Matching:
The testing option from the Main Menu gives the user the
option of using Dynamic Time Warping or Hidden Markov
Modeling as is shown in Figure IV. After choosing either
option, the user is then asked to press “OK” to initialize the
voice recording process.
FIGURE III:USER PROMPT FOR VOICE PRINT IDENTIFICATION
After the unknown voice data has been successfully
recorded the “tag” for the best match will be shown on the
computer display. Also, spectrograms of the unknown voice
data and the best matching voice data associated with the tag
will be displayed on screen. Figure V shows a sample result.
FIGURE IV:
USER OPTIONS WINDOW
Operation:
The Clear Database option on the Main Menu deletes the
*.wav, *.xls, and *.mat files created as a result of a voice
recognition process. The file entitled “acoustic_data.xls” is the
common voice database and “test_data.xls” is the voice print
of the voice file to be recognized. The Hidden Markov
Modeling and Dynamic Time Warping files each contain a
threshold parameter which controls the sensitivity of the
system.
FIGURE V
RESULTS OF VOICE RECOGNITION PROCESS
8/8/2019 3172
http://slidepdf.com/reader/full/3172 5/5
Session T4J
San Juan, PR July 23 – 28, 2006
9th
International Conference on Engineering Education
T4J-5
CONCLUSION
A fairly representative speaker recognition system has been
constructed using the speech data provided by the Swiss
Federal Institute and simple utterances recorded locally using
Matlab® software. Through this project and with experience
in trial and error, the student was able to build better and more
powerful system. The knowledge he gained through this
project will help him in future research and developmentprojects.
ACKNOWLEDGEMENT
This work was supported by NASA Langley Research Center
Grant # NOC1-03033 under the Chesapeake Information
Based Aeronautic Consortium (CIBAC).
REFERENCES
[1] Ambardar, Ashok. Analog and Digital Signal Processing. Brooks/Cole
Publishing company, 1999.
[2] Do, Minh N.”An Automatic Speaker Recognition System.” Digital
Signal Processing Mini-Project. 14 June 2005.<http://lcavwww.epfl.ch/~minhdo/asr_project/asr_project.html>
[3] “ECE341 So You Want to Try and do Speech Recognition.” A Simple
Speech Recognition Algorithm. 15 April 2003. 1 July 2005.