Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012 14 2. Review of Literature Speech is a natural means of communication for humans. It is not surprising that humans can recognize the identity of a person by hearing his voice. About 2-3 seconds of speech is sufficient for a human to identify a voice. One review on human speech recognition [33] states that many studies of 8-10 speakers yield accuracy of more than 97% if a sentence or more of the speech is heard. Performance falls if the length of the speech is short and if the number of speakers is more. Speaker Recognition is one area of artificial intelligence where machine performance can exceed human performance using short test utterances and a large number of speakers in which case machine accuracy often exceed that of humans. Research on Speaker Identification systems can be dated back to more than fifty years [3]. The survey of this work is given in brief in the subsequent sections. 2.1 Early Systems (1960-1980) The first reported work on Speaker Recognition can be dedicated to Pruzansky at Bell Labs [34], as early as 1963, who initiated research by using filter banks and correlating two digital spectrograms for a similarity measure. The system used several utterances of commonly spoken words by ten talkers and converted it to time-frequency-energy patterns. Some of each talker's utterances were used to form reference patterns and the remaining utterances served as test patterns. The recognition procedure consisted of cross-correlating the test patterns with the reference patterns and selecting the talker corresponding to the reference pattern with the highest correlation as the talker of the test utterance. The recognition score for three-dimensional patterns was
14
Embed
2. Review of Literature - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/7503/5/05_chapter 2.pdf · 2. Review of Literature ... automatic speaker recognition system. S. Furui
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
14
2. Review of Literature
Speech is a natural means of communication for humans. It is not
surprising that humans can recognize the identity of a person by
hearing his voice. About 2-3 seconds of speech is sufficient for a
human to identify a voice. One review on human speech recognition
[33] states that many studies of 8-10 speakers yield accuracy of
more than 97% if a sentence or more of the speech is heard.
Performance falls if the length of the speech is short and if the
number of speakers is more. Speaker Recognition is one area of
artificial intelligence where machine performance can exceed human
performance using short test utterances and a large number of
speakers in which case machine accuracy often exceed that of
humans. Research on Speaker Identification systems can be dated
back to more than fifty years [3]. The survey of this work is given in
brief in the subsequent sections.
2.1 Early Systems (1960-1980)
The first reported work on Speaker Recognition can be dedicated
to Pruzansky at Bell Labs [34], as early as 1963, who initiated
research by using filter banks and correlating two digital
spectrograms for a similarity measure. The system used several
utterances of commonly spoken words by ten talkers and converted
it to time-frequency-energy patterns. Some of each talker's
utterances were used to form reference patterns and the remaining
utterances served as test patterns. The recognition procedure
consisted of cross-correlating the test patterns with the reference
patterns and selecting the talker corresponding to the reference
pattern with the highest correlation as the talker of the test
utterance. The recognition score for three-dimensional patterns was
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
15
89%. Reducing the original patterns to time-energy patterns
resulted in a lower recognition score; however, when only spectral
information was retained, recognition results were the same as
those for three-dimensional patterns. The work was further
improved in [35], by using a small subset of features. Features
were formed as the average of the speech energy over certain
rectangular areas on the spectrograms. Results were computed as a
function of the number of features used and as a function of the
size of the areas used to form the features. The filter bank approach
used in the earlier two cases was replaced by formant analysis by
Doddington [36]. Doddington proposed a speaker-verification using
eight known speakers and 32 impostors. Formant frequencies,
voicing pitch period, and speech energy—all as functions of time—
were used in verification. Proper time normalization was shown to
be an important factor in improving verification error performance.
Intra- Speaker variation in speech was investigated by Endres et al.
[37] and Furui [38]. In [37], Spectrograms of utterances produced
by seven speakers and recorded over periods of up to 29 years
showed that the frequency position of formants and pitch of voiced
sounds shift to lower frequencies with increasing age of test
persons. Speech spectrograms of texts spoken in a normal and a
disguised voice revealed strong variations in formant structure.
Speech spectrograms of utterances of well-known people were
compared with those of imitators. The imitators succeeded in
varying the formant structure and fundamental frequency of their
voices, but they were not able to adapt these parameters to match
or even be similar to those of imitated persons.
In [39], B S Atal, evaluated several different parametric
representations of speech derived from the linear prediction model,
for its effectiveness for automatic recognition of speakers from their
voices. Twelve predictor coefficients were determined approximately
once every 50 msec from speech sampled at 10 kHz. The predictor
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
16
coefficients and other speech parameters derived from them, such
as the impulse response function, the autocorrelation function, the
area function, and the cepstrum function were used as input to an
automatic speaker recognition system. S. Furui [40] and A. E.
Rosenberg and M. R. Sambur [41] used cepstrum coefficients
extracted by means of LPC analysis successively throughout an
utterance to form time functions. In time-domain methods, with
adequate time alignment, one can make precise and reliable
comparisons between two utterances of the same text, in similar
phonetic environments. Hence, text-dependent methods have a
much higher level of performance than text-independent methods.
Texas Instruments system based on filter banks and Bell Lab
Systems based on cepstal analysis were the first commercially
experimented Speaker Recognition systems.
2.2 Medieval systems (1980-2000)
In this period there was lot of development in Speaker
Identification technology. These advances were both in the field of
feature extraction and feature matching.
2.2.1 Feature Extraction
Voice pitch (F0) and formant frequencies (F1, F2, F3) extracted
from time aligned, un-coded and coded speech samples were
compared to establish the statistical distribution of error attributed
to the coding system [42]. The mel-warped cepstrum is a very
popular feature domain. The mel warping transforms the frequency
scale to place less emphasis on high frequencies. It is based on the
nonlinear human perception of the frequency of sounds [43]. The
cepstrum can be considered as the spectrum of the log spectrum.
Removing its mean reduces the effects of linear time-invariant
Speaker Identification using Orthogonal Transforms and Vector Quantization 2008-2012
17
filtering (e.g., channel distortion). Often, the time derivatives of the
mel cepstra (also known as delta cepstra) are used as additional
features to model trajectory information.
Studies on automatically extracting the speech periods of each
person separately from a dialogue/conversation/meeting involving
more than two people have appeared as an extension of speaker