AN UNSUPERVISED APPROACH FOR AUTOMATIC LANGUAGE IDENTIFICATION by Tuba İslam B.S. in Electrical and Electronics Eng., Boğaziçi University, 2000 Submitted to the Institute for Graduate Studies in Science and Engineering in partial fulfillment of the requirements for the degree of Master of Science Graduate Program in Electrical and Electronics Engineering Boğaziçi University 2003
57
Embed
AN UNSUPERVISED APPROACH FOR AUTOMATIC LANGUAGE IDENTIFICATION
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
AN UNSUPERVISED APPROACH FOR
AUTOMATIC LANGUAGE IDENTIFICATION
by
Tuba İslam
B.S. in Electrical and Electronics Eng., Boğaziçi University, 2000
Submitted to the Institute for Graduate Studies in
Science and Engineering in partial fulfillment of
the requirements for the degree of
Master of Science
Graduate Program in Electrical and Electronics Engineering
Boğaziçi University
2003
ii
ABSTRACT
AN UNSUPERVISED APPROACH FOR AUTOMATIC LANGUAGE IDENTIFICATION
Today, the need for multi-language communication applications, which can serve
people from different nations in their native languages, has gained an increasing
importance. Automatic Language Identification has a significant role in the pre-process
phase of multi-language systems. The conventional systems require difficult and time-
consuming labeling process of phoneme boundaries of the utterances in the speech corpus.
In our work, we propose an unsupervised method in order to built an automatic language
identification system that does not require labeled speech database or linguistic
information of the target languages. The method comprises two branches processing in
parallel; nearest neighbor selection method using language-dependent Gaussian mixtures,
and mono-lingual phoneme recognition method using language-dependent network files.
The performance of the system is compared with the previous studies and 24.3 per cent of
decrease and 13.9 per cent of increase is observed respectively in the worst and best cases.
With the proposed method, a robust system with a tolerable performance is built that can
easily integrate any language into the LID application.
iii
ÖZET
OTOMATİK DİL TANIMADA GÖZETİMSİZ YAKLAŞIM
Günümüzde insanlara kendi dillerinde hizmet sunabilmek için farklı dillerde iletişim
kurabilen sistemlere duyulan gereksinim giderek artmaktadır. Otomatik Dil Tanıma
uygulaması çok dilli sistemlerde ön işlem olarak yer almaktadır. Geleneksel sistemlerde
modellerin eğitimi öncesinde ses verisi zahmetli ve zaman alıcı bir etiketleme aşamasından
geçerek konuşmalardaki fonem sınırları belirlenmektedir. Bu tezde izlenen yöntem
etiketlenmiş veritabanına veya dillere ait linguistik bilgiye ihtiyaç duymaksızın geliştirilen
otomatik dil tanıma sistemine ait gözetimsiz bir yaklaşım içermektedir. Bu yöntem paralel
olarak işleyen iki ayrı daldan oluşmaktadır; dile özgü Gauss karışımlar kullanılarak
gerçekleştirilen en yakın komşu karışım seçimi yöntemi ve dile özgü ağ yapıları
kullanılarak gerçekleştirilen tek dilde eğitilmiş fonem tanıma yöntemi. Elde edilen
sonuçlarla önceki çalışmalar karşılaştırıldığında en kötü durumda başarımın yüzde 24.3
azaldığı, en iyi durumda ise yüzde 13.9 arttığı görülmüştür. Önerilen yöntem kullanılarak
kabul edilebilir bir başarıma sahip gürbüz bir dil tanıma sistemi geliştirilmiştir.
Figure 4.5. Mean and variance values of language ranks with phoneme recognizer ..........33
Figure 4.6. The superposition of two methods ....................................................................33
vii
Figure 4.7. The algorithm for merging the two results........................................................34
viii
LIST OF TABLES
Table 1.1. Phoneme categories of Turkish with examples of words .....................................4
Table 1.2. Phoneme categories of English with examples of words .....................................4
Table 3.1. Number of speakers in OGI database .................................................................23
Table 3.2. Effect of feature normalization on LID for baseline system ..............................25
Table 3.3. Mean values of language ranks obtained by plain NNMS .................................27
Table 4.1. Mean values of language ranks obtained by phoneme recognizer .....................32
Table 4.2. Performance of LID system with weights: 0.6, 0.8 and 1 ..................................34
Table 4.3. Performance of the LID system with weights: 0.8, 0.4 and 1 ............................35
Table 4.4. Confusion matrix of languages in LID system of weights: 0.8, 0.4 and 1 .........36
Table 4.5. Percentage of correct identification for six-language test ..................................36
Table 4.6. Percentage of correct identification for three-language test ...............................37
Table 4.7. Percentage of correctness for pair-wise test of English......................................37
ix
LIST OF ABBREVIATIONS
FFT Fast Fourier Transform
GMM Gaussian Mixture Model
HMM Hidden Markov Model
Hz Hertz
LID Language Identification
LD Language Dependent
LI Language Independent
LPCC Linear Predictive Cepstrum Coefficients
MFCC Mel Frequency Cepstrum Coefficient
ML Mono Lingual
NNMS Nearest Neighbor Mixture Selection
OGI-TS Oregon Graduate Institute-Telephone Speech
PRLM-P Phoneme Recognition followed by Language Modeling - Parallel
TURTEL Turkish Telephone Speech Corpora
VQ Vector Quantization
1
1. INTRODUCTION
1.1. Motivation
The necessity for multilingual capacities grows with the development of world
communication. Speaking different languages will remain as an obstacle until either multi-
lingual large vocabulary continuous speech recognition or automatic language
identification systems reach excellent performance and reliability. Automatically
identifying a language from just the acoustics without understanding the language is a
challenging problem. A multilingual person has no problem identifying the languages he
understands. Word spotting is the basic method that is followed during this process in
brain. In a human-made system, however, it is not easy to model the words in different
languages and to generate a successful language identification system, which needs a large
amount of labeled speech data and linguistic information of the target languages.
In all speech processing applications, one major restriction for a better performance is
the data limitation. The accuracy of the system relies on the quality and the variety of the
database. Another restriction is the necessity of the labeled speech corpora for the
languages under test. The labeling process of raw speech material and other linguistic data
takes great effort and time. Systems using multiple large vocabulary continuous speech
recognizers give the best results. These systems include a complete word recognizer for
each language and use word and sentence level language modeling. To build such a
system, a large amount of labeled speech is necessary to train the recognizers and large
amounts of written text are needed to train language models of word n-grams. A simpler
but successful approach is parallel language-dependent phone-recognition followed by
language modeling but since it is based on multiple language-specific phone recognizers, it
also requires labeled speech to train those recognizers.
Our motivation in this thesis is to search for the methods of building a language
identification system that does not require linguistic information and labeled speech
corpora of the target languages. The system will not have much dependency on pre-
2
processed data and therefore there will be no difficulty in adapting a new language to the
application.
1.2. Applications
The purpose of a language identification application includes the ability of
automatically adapting a speech-based tool, such as online banking or information
retrieval, to the native language of the user. With the growth of the Internet, we now live in
a worldwide society communicating and doing business with people who use a wide
variety of languages which makes language identification more important each day.
Multilingual environments may have political, military, scientific, commercial or tourist
context (Adda-Decker, 2000).
Just a few of the many different uses where a language identifier may be useful are
the natural language processing systems, information retrieval systems, speech mining
applications, speech file filtering systems, translation services through software, anywhere
where you might need to work with more than one language or knowledge management
systems.
1.3. Language Discrimination Basics
Humans and machines can use many different attributes to distinguish one language
from another. There are some essential cues for understanding a spoken language. The
most accurate way is to catch some of the words spoken. Detecting the phones that are not
common in most languages or focusing on the intonation and the stress also help but are
not enough. The basic perspective of a spoken language is presented in Figure 1.1
(Greenberg, 2001).
3
Understanding
Syntax Morphology
1000 ms Stress- Accent
Intonation Words
“Intelligibility”Prosody Lexicon
200 ms 40-400 ms
Interface between Sound and Meaning
Range Syllables
80 ms 40-400 ms
Phonetic Segments “Articulation”
Segments
80 ms 40-400 ms
Place of Articulation (200 ms) Manner of Articulation (80 ms)
Voicing, Rounding (40 ms)
Features
200 ms 40-400 ms
Modulation Spectrum Acoustics
Figure 1.1. An experimental perspective of spoken language
One way of representing speech sounds is by using phonemes. A “phoneme” is a
physical representation of a phonological unit in a language. Across the world’s languages,
the sound inventory varies considerably. The size of the phoneme inventory used for
speech recognition can be 29 phonemes as it is in Turkish or 46 phonemes as it is in
Portuguese. Formally, we can define the phoneme as a linguistic unit such that, if one
phoneme is substituted for another in a word, the meaning of that word could change. This
is only true for a set of phonemes in one language. Therefore in a single language, a finite
set of phonemes exists. However, when different languages are compared, there are
differences; for example, in Turkish, /l/ and /r/ (as in "laf" and "raf") are two different
phonemes, whereas in Japanese, they are not (Ladefoged, 1962). Similarly, the presence of
individual sounds, such as the "clicks" found in some sub-Saharan African languages, or
the velar fricatives found in Arabic, take attention of the listeners fluent in languages that
do not contain these phonemes. Still, as the vocal part used in the production of languages
is universal, phoneme sets mostly overlap and the total number of phonemes is finite
4
(Ladefoged, 1962). The Turkish phonemes subdivided into groups based on the way they
are produced are given in Table 1.1.
Table 1.1. Phoneme categories of Turkish with examples of words
Vowels: Semivowels: Fricatives: Nasals: Plosives: Affricates: kim rey sar mal bul can gül lale şal nal del çam kel yer far gir göl lala hep pul çal zor ter yıl dağ kaç kul ver gem bol jüri
Different from the Turkish phoneme structure, there are many diphthongs in some
other languages like English and German. The classification of English phonemes is given
with examples of words in Table 1.2.
Table 1.2. Phoneme categories of English with examples of words
Vowels: Diphthongs: Semivowels: Fricatives: Nasals: Plosives: Affricates: heed bay was sail am bat jaw hid by ran ship an disc chore head bow lot funnel sang goat had bough yacht thick pool hard beer hull tap hod doer zoo kite hoard boar azure hood boy that who'd bear valve hut heard the
The vowel systems also differ from one language to the other. In the study of
Pellegrino et al. in 1999, the phonemic differences based on vowels were taken into
account in language identification. Five languages (Spanish, Japanese, Korean, French and
Vietnamese) were chosen for the evaluations of LID system because of their
phonologically different vowel systems. Spanish and Japanese vowel systems are, for
example, rather simple, as they include only five vowels. But on the other hand, Korean
5
and French systems are quite complex and they make use of secondary articulations (long
vs. short vowel opposition in Korean and nasalization in French).
The phoneme error rate of a language correlates with the number of phonemes used
to model this language. In the study of Schultz (2001), the acoustic confusability of
languages, obtained by the phoneme-based recognizers, are given in Figure 1.2. The
phoneme error rates range from 33.8 per cent to 46.4per cent. Turkish is an exception in
this result because of the high substitution rate between the vowels “e”, “i” and “y”.
Figure 1.2. Phoneme error rates of some languages as an example for acoustic
confusability
It is also possible to distinguish between speech sounds depending on the way they
are produced. The speech units in this case are known as the phones. A “phone” is a
realization of an acoustic-phonetic unit or segment. It is the actual sound produced when a
speaker thinks of speaking a phoneme. Phone and phoneme sets differ from one language
to another, even though many languages share a common subset of phones and phonemes
(Shultz, 2001). Phone and phoneme frequencies of occurrence may also differ from one
language to another. Phonotactics, the rules of allowed sequence of phones and phonemes
are also different in most cases.
6
There are more phones than phonemes, as some of them are produced in different
ways depending on the context. For example, the pronunciation of the phoneme /l/ differs
slightly when it occurs before consonants and at the end of utterances such as in “salı”
(tuesday) and in “kalk” (wake up). As they are both different forms of the same phoneme,
they form a set of allophones. Any machine-based speech recognizer would need to be
aware of the existence of allophone sets.
Also the morphology, i.e. the word roots and lexicons, are usually different one
language to another. Each language has its own vocabulary and its own way in word
formation.
The stress, rhythm and intonation of speech are the prosodic features. Duration of
phonemes, pitch characteristics, and stress patterns differ from one language to another.
Stress is used in two different levels. It indicates the most important words in the sentences
and the prominent syllables in the words, which may change the meaning totally. As an
example, the English word "object" could be understood as either a noun or a verb,
depending on whether the stress is placed on the first or second syllable. Intonation, or
pitch movement, is very important in indicating the meaning of an English sentence. In
tonal languages, such as Mandarin and Vietnamese, the intonation determines the meaning
of individual words as well.
The syntax, the sentence patterns are different among languages. Although there may
be same words shared in two language such as “hat” in German and Turkish or “ten” in
English and Turkish, the neighboring words in the sentence and also the suffixes or
prefixes used in the word will be different.
The perceptual confusion of different languages is examined by Muthusamy in 1994
and the responses of subjects are shown in Figure 1.3.
7
100.0
22.4
84.1
79.5
38.2
13.5
24.5
70.5
51.7
19.0
100.0
30.2
83.3
77.8
40.0
14.7
27.3
81.4
60.9
34.5
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
EN FA FR GE JA KO MA SP TA VI
Language Codes
Ave
rage
Sub
ject
Per
form
ance
First Quarter
Last Quarter
Figure 1.3. Perceptual language identification results
The difference between the last quarter and the first quarter, which denotes the
beginning and the end of the test respectively, shows the effect of learning of the subjects
during the test because of the feedback given after each response. Muthusamy implies that,
having the lowest score in human perception, Korean is confused more often with Farsi,
Japanese, Mandarin, Tamil and Vietnamese.
1.4. Previous Research
Since 1970s researches have been focused on automatic language identification from
speech. Systems implemented up to date mainly vary according to their methods for
modeling languages. There are two phases of language identification; the “training phase”
and “recognition phase”.
During the “training phase” the base system is presented with examples of speech
from a variety of languages. Each training speech utterance is converted into a stream of
feature vectors that are computed from short windows of the speech waveform (e.g. 20ms).
The windowed speech waveform is assumed stationary in some manner. The feature
vectors are recomputed with a pre-defined step size (e.g. 10ms, 50 per cent overlapping)
8
and contain cepstral information about the speech signal. The training algorithm analyzes a
sequence of such vectors and produces a model for each language.
During the “recognition phase” of LID, feature vectors computed from the new
utterance are compared to each of the language-dependent models. The likelihood that the
new utterance was spoken in the same language as the speech used to train each model is
computed by a distance measure and the maximum-likely model is found. The language of
the speech that was used to train the model having the maximum-likelihood is assigned as
the language of the utterance.
Complicated language identification systems use phonemes to model speech. During
the training phase of the phoneme models, these systems use a phonetic transcription that
implies the sequence of symbols representing the spoken sounds or an orthographic
transcription, which implies the text of the words spoken, and also a phonemic
transcription dictionary is necessary.
1.4.1. LID Using Spectral Content
In the earliest researches on language identification, developers focused on the
differences in spectral content among languages. The basic idea was that different
languages contain different phonemes and phones. A set of short-term spectra is obtained
from the training utterances and these prototypes are compared to the ones obtained from
the test speech.
There are many different options to choose during the implementation of this
approach. The training and test spectra can be used directly or can be used to obtain the
feature vectors for the speech such as the cepstrum coefficients or formant-based vectors.
The training data can be chosen directly from the training utterances or they can be
synthesized by using K-means clustering. The similarity between the sets of training and
test spectra can be calculated by the Euclidean, Mahalanobis, or any other distance metric.
Examples to these language identification systems are proposed and developed by
Cimarutsi and Ives in 1982, Goodman et al. in 1989 and Sugiyama in 1991.
9
In order to compute the similarity between languages, most of the early systems
calculated the distance between the test vector and its closest train vector and accumulated
the result as an overall distance. In these systems, the language with the lowest distance is
assigned as the identified language. Later, Gaussian mixture modeling is applied to this
approach by Nakagawa et al. in 1992 and Zissman in 1993. In this case, each vector is
assumed to be generated randomly according to a probability density that is a weighted
sum of multi-variate Gaussian densities (Zissman, 2001). During the training, Gaussian
mixture models for the feature vectors are computed for each language. During the
recognition, the likelihood of the test utterance feature vectors is computed given each of
the language models. The language having the maximum likelihood is proposed as the
identified language. In this approach instead of only one feature vector from the training
set, the whole set of training feature vectors affects the scoring of each test vector, and
therefore this may be called a soft version of vector quantization.
Since using vector quantization gives a static classification, in order to model the
sequential characteristics of speech, language identification systems with Hidden Markov
Modeling are implemented in late 80s. HMM based language identification systems first
proposed by House and Neuburg in 1977 (Zissman, 2001). In these systems, HMM
training was performed on unlabeled training speech and the system performance was even
worse than static classifiers in some cases.
Later, a new approach was proposed by Li in 1994 by labeling the vowels of each
speech utterance automatically and computing spectral vectors in the neighborhood of the
vowels. Instead of modeling the feature vectors over all training data, the selected portions
are used. During the recognition, the selected portions of test data are processed and
language with the maximum likelihood is assigned as the identified language.
1.4.2. LID Using Prosody
Pitch frequency (fundamental frequency) of speech is defined as the frequency at
which the vocal cords vibrate during a voiced sound (Hess, 1983). It is difficult to make a
reliable estimate of the pitch frequency from the speech data since the harmonics of the
side frequencies cause a distortion.
10
One of the basic and simplest algorithms used depends on the multiple measures of
periodicity in the signal. Fundamental frequency (f0) is usually processed on a logarithmic
scale rather than a linear one in order to match the resolution of human auditory system.
Normally 50 Hz ≤ f0 ≤ 500 Hz for voiced speech. For unvoiced speech f0 is undefined and
by convention, it is zero in log scale.
Since the fundamental frequency implies the characteristics of the speaker, it does not
give global information about the language or the utterance. The slope of the pitch
frequency, however, gives some clues about the prosody and the stress on the utterance,
which might differ from language to language. Humans can also use prosodic information
in order to guess the spoken language (Muthusamy et al., 1994).
Language identification system depending on prosody alone has also been proposed
by Itahashi et al. (1994, 1995) and especially in noisy environments, pitch estimation is
argued to be more robust compared to spectral parameters. However, compared to phonetic
information, prosody comprises little information about the language (Hazen, 1993). Some
researches imply that systems with prosodic and phonetic parameters perform about the
same as the systems using only phonetic parameters. Therefore, the prosodic information
of speech is not concerned in this thesis.
1.4.3. LID Using Phone-Recognition
Different languages have different phone distributions and that leads many
researchers to build LID systems that extract the phone sequence of the utterances and
determine the language based on the statistics of that sequence. An example of this
approach is implemented by Lamel, who built two HMM-based phone recognizers for
English and French (Lamel and Gauvain, 1993). He found that the likelihood scores
obtained from language-dependent phone recognizers can be used to distinguish between
the two languages.
In a different approach by Shultz (2001), language specific phonemes of N
languages are unified into one global set. The target language that we do not have enough
11
information is modeled by the adaptation of other well-known languages using the
phoneme description of languages.
As building a system that depends on the phone recognition necessitates multi-
language phonetically labeled corpora, it becomes more difficult to include new languages
into the language identification process. This difficulty can be handled by using a phone
recognizer for a single language and obtaining the phonetic distributions for other
languages. Hazen and Zue (1993) and Zissman and Singer (1994) developed LID systems
that use one single-language front-end phone recognizers, with a successful performance.
This work is extended by Zissman and Singer (1994) and Yan and Barnard (1995) for
multiple single-language front ends.
In this thesis, a phoneme recognizer based on Turkish phonemes is developed and the
phonetic distributions of the languages in OGI multi-language database are evaluated.
1.4.4. LID Using Word-Recognition
The systems based on word-recognition are more complicated than the phone-level
systems and less complicated than the large-vocabulary systems. They use the lexical
information of languages and score the occurrence of words for each language.
In the approach of Kadambe and Hieronymus (1995) which uses lexical modeling for
language identification, the incoming utterance is processed by parallel language-
dependent phone recognizers and possible word sequences are identified from the resulting
phone sequences. To obtain the lexical information of all target languages is not an easy
task to deal with since each language dependent lexicon includes several thousand entries.
1.4.5. LID Using Continuous Speech Recognition
In order to obtain better LID performance, researchers try to add more and more
knowledge to their systems. Large-vocabulary continuous-speech recognition systems are
the most complicated ones issued for this purpose. During the training process, one speech
recognizer per language is created and during the evaluations all recognizers are run in
12
parallel to select the most likely one as the recognized language. Mendoza et al. (1996),
Schultz and Waibel (1998) and Hieronymus and Kadambe (1997) have worked on these
systems.
As these systems use higher-level knowledge (words and word sequences) rather than
lower-level knowledge (phones and phone sequences), the identification performance is
better than other simpler systems. On the other hand, they require many hours of labeled
training data for each language to be recognized and the algorithms are the most complex
ones in computation (Zissman, 2001).
13
2. THEORETICAL BACKGROUND
2.1. Speech Representation
Since their introduction in early 1970’s, homomorphic signal processing techniques
have been of great interest in speech recognition. Homomorphic systems are a class of
nonlinear systems that obey a generalized principle of superposition. Linear systems are a
special, a plain case of a homomorphic system (Picone, 1993).
In speech processing, the homomorphic system should have the following property:
mea (description of most recent meal), stb (free speech before the tone), sta (free speech
after the tone). The records classified as “stories before the tone” -stb files-, each lasts 45
seconds, are used in our evaluations. The numbers of speakers for the training and test sets
in OGI database are given in Table 3.1.
Table 3.1. Number of speakers in OGI database
Number of Speakers Language Training Set Evaluation Set English 50 141
Farsi 49 51 French 50 57 German 50 59 Hindi 173* 52
Korean 50 40 Japanese 49 37 Mandarin 49 52 Spanish 50 60 Tamil 50 55
Vietnamese 50 50 * “stb” files only.
The speech files have the NIST SPHERE header format. All files, compressed by
"shorten" speech compression method, are decompressed and byte-swapped before the
feature extraction phase.
24
3.2. Feature Extraction
The speech files in OGI multi-language corpora, sampled at 8kHz with 16-bit
resolution, are parameterized every 20ms with 10ms overlap between contiguous frames.
For each frame a 24-dimensional feature vector is computed; 12 cepstrum coefficients, 12
delta cepstrum coefficients. Speech utterances are windowed by using a 160-point
Hamming window to get the short-term energy spectrum. After that they are filtered with a
filter bank of 16 filters.
The energy coefficient is not included in the feature vector because of the different
recording levels over telephone line. In the study of Wong in 2001, it is shown that the
static log energy coefficient reduces the performance of the language identification system.
As an explanation of this result, it is implied that the static short-term features do not
encapsulate the language specific information in contrast to the transient features.
Cepstral normalization is performed in order to minimize the channel effect. During
this process, the mean cepstrum of each file is calculated and then the obtained value is
subtracted from each feature vector. The effect of cepstrum normalization is examined
through a plain system test for normalized and unnormalized feature vectors.
Figure 3.1. Distributions of c1 vs. c0 coefficients of mixtures for some languages
25
Table 3.2. Effect of feature normalization on LID for baseline system
UnnormalizedParameters
Normalized Parameters
Improvement in mean rank
EN 1.40 1.00 3.6% FA 8.25 8.20 0.5% FR 6.45 6.85 3.6% GE 1.80 2.00 1.8% HI 3.70 3.55 1.4% JA 3.75 4.20 -4.1% KO 8.70 7.35 4.1% MA 7.25 6.90 3.2% SP 4.30 3.80 4.5% TA 4.20 3.60 5.4% VI 10.60 10.80 -1.8%
Overall 5.49 5.29 3.6%
The results in Table 3.2 are in terms of ranks of the target languages and it is implied
that the normalization process improves LID system performance for all languages except
Japanese and Vietnamese. The unexpected result for the two languages may be related with
the voiced phoneme distribution of these languages. The ratio of the high-energy frames
might be corrupted during the normalization process.
3.3. Gaussian Mixture Modeling using Vector Quantization
In order to build an LID system, which does not dependent on the amount of labeled
speech data, we implement a method based on Gaussian mixtures generated by vector
quantization. The speech files in the training set of OGI corpora for 11 languages are used
in order to obtain the codebook of the system for each language.
A composite of mixtures is evaluated with an optimal codebook size of 32. Mixture
splitting is performed by using an entropy-based distance measure, defined over the
codewords of each language.
The algorithm can be summarized as follows:
Decide on the codebook size (N=32).
Evaluate the initial mean value (codeword) using the input feature vectors.
26
Split the mean vector with the maximum weight in the codebook recursively until the
total number of codewords is reached.
Using the sum of squared error as the distortion measure, clusterize the input vectors
around each codeword. This is done by finding the distance between the input vector
and each codeword. The input vector belongs to the cluster of the codeword that
yields the minimum distance.
Re-estimate the new set of codewords. This is done by obtaining the average of each
cluster. Add the component of each vector and divide by the number of vectors.
(3.1)
where i is the component of each vector (x, y, z, ... directions), m is the number of
vectors in the cluster.
Repeat the previous two steps until either the codewords do not change or the change
is quite small.
The evaluations of this system is based on the Nearest Neighbor Selection algorithm
(Higgins, 1993), which is a non-parametric approach using the averaged nearest neighbor
distance to classify features into mixtures. The Gaussian that best fits the input vector of
mel cepstrum coefficients, c, is found by evaluating the distance using the log likelihood
values. The distance search algorithm can be expressed briefly as follows:
dmin = ∞ ;
for m = 1..M d = 0; for n=1..N d = d + ( c n - µn
m )/σnm ;
end if d<dmin
dmin = d; argmin = m; end
end return dmin;
n - µnm )*( c
Figure 3.2. Distance search algorithm
27
The likelihood values obtained from 11 languages are sorted in order to get the rank
of each language, such that the one with the maximum likelihood takes the rank of 1, and
the minimum takes the rank of 11. The mean values of the language ranks for each model
tested for each test set (45-sec utterances of 20 files) are given in Table 3.3. The mean and
variance values of the correct matches are plotted in Figure 3.3.
Table 3.3. Mean values of language ranks obtained by plain NNMS
Languages of Model Files Mean EN FA FR GE HI JA KO MA SP TA VI
Figure 4.5. Mean and variance values of language ranks with phoneme recognizer
4.3. Superposition of Two Methods
In our proposed system of LID, the two separate methods, explained above, are
processed in parallel and the outputs are merged at final stage as shown in Figure 4.6.
Nearest Neighbor
Search of LD-GMMs
Detected Language
Feature
Extraction
Ranks of Languages
Merge Outputs Utterance
Monolingual Phoneme
Recognizer with LD networks
Figure 4.6. The superposition of two methods
34
Search for language with the highest rank in output_1
and find index I
Search for language with the highest rank in output_2
and find index J
Find rank in output_2 corresponding to index I
Find rank in output_1 corresponding to index J
no norank_I > thresh rank_J > thresh
False Detection
yes
rank_I > rank_J
yes
Language “ J ”
DetectedLanguage
“ I ” Detected
Language “ I ”
Detected
Language “ J ”
Detected
yes no
Figure 4.7. The algorithm for merging the two results
In both methods, the ranks of the languages for each file are obtained as output and
the results are merged through an algorithm described in Figure 4.7. The output of one
method is checked by the output of the other and this leads the system to give more
accurate decisions and to handle some of the false detections. Each language in the test
domain also behaves as a garbage model for the target language.
The results for 11 languages with different coefficients of LID system are listed in
Table 4.2 and Table 4.3. The evaluations, obtained for the uni-gram, bi-gram and
normalization coefficients of “0.8”, “0.4” and “1” respectively, give better results in the
overall performance compared to the system with weights of “0.6”, “0.8” and “1”. In an
LID application, the weights of the system should be tuned depending on the results
obtained for the languages under test.
Table 4.2. Performance of LID system with weights: 0.6, 0.8 and 1
35
% LD-NNMS LI-Phoneme Recognizer Superposed System
EN 15.0 15.0 30.0 FA 15.0 30.0 15.0 FR 20.0 85.0 55.0 GE 55.0 45.0 75.0 HI 30.0 35.0 45.0 JA 00.0 15.0 15.0 KO 15.0 10.0 15.0 MA 05.0 15.0 15.0 SP 35.0 00.0 35.0 TA 45.0 15.0 35.0 VI 00.0 15.0 05.0
Overall 21.4 25.5 30.9
Table 4.3. Performance of the LID system with weights: 0.8, 0.4 and 1
% LD-NNMS LI-Phoneme Recognizer Superposed System
EN 55.0 15.0 50.0 FA 20.0 30.0 25.0 FR 40.0 85.0 65.0 GE 25.0 45.0 50.0 HI 25.0 35.0 40.0 JA 00.0 15.0 10.0 KO 40.0 10.0 40.0 MA 10.0 15.0 20.0 SP 25.0 00.0 25.0 TA 45.0 15.0 40.0 VI 15.0 15.0 15.0
Overall 25.2 25.5 34.5
The confusion probabilities of the target languages are given in Table 4.4. It is
observed that English test files are most probably confused with French and Spanish with a
probability of 0.1. French, German and Hindi are mostly confused with English with a
probability of 0.15, 0.1 and 0.15, respectively.
36
Table 4.4. Confusion matrix of languages in LID system of weights: 0.8, 0.4 and 1
Languages of Model Files EN FA FR GE HI JA KO MA SP TA VI EN 0.50 0.05 0.10 0.05 0.00 0.05 0.05 0.05 0.10 0.05 0.00 FA 0.15 0.25 0.20 0.10 0.00 0.00 0.00 0.00 0.10 0.00 0.00 FR 0.15 0.00 0.65 0.00 0.00 0.05 0.00 0.00 0.05 0.00 0.00 GE 0.10 0.05 0.05 0.50 0.00 0.00 0.00 0.00 0.00 0.05 0.00 HI 0.15 0.00 0.00 0.00 0.40 0.10 0.05 0.00 0.00 0.10 0.00 JA 0.00 0.05 0.10 0.10 0.10 0.10 0.05 0.10 0.15 0.05 0.00 KO 0.05 0.00 0.10 0.05 0.05 0.00 0.40 0.00 0.10 0.05 0.05 MA 0.10 0.05 0.10 0.10 0.15 0.00 0.05 0.20 0.05 0.05 0.00 SP 0.15 0.00 0.05 0.10 0.05 0.05 0.05 0.00 0.25 0.05 0.05 TA 0.00 0.00 0.00 0.05 0.00 0.05 0.10 0.00 0.20 0.40 0.00
APPENDIX B: MATLAB FILE FOR COMPUTING MFCC function m_pFeature = mfcc8000(x); % mfcc8K(filename), this function convert raw speech data to MFCC file. fbank = 17; % m_nFilterBanks mel = 14; % m_nFeatureVectorLength m_DimNorm = sqrt(2.0/fbank); fs = 8000; melmax = 2595*log10(1.0+((fs/2.0)/700.0)); melmax = melmax / (fbank+1); for i = 1:fbank+2 m_CenterFreqs(i) = floor(1.5 + (512/4000)*700.0*(10^((1.0/2595.0)*(i-1)*tmp)-1.0)); end x = filter([1 -0.97], 1, x); x = x.*hanning(length(x)); fx = abs(fft(x,1024)); first = 0; last = fbank; for i = first+1:last m(i-first) = 0.0; len = m_CenterFreqs(i+2)-m_CenterFreqs(i)+1; wgt = triang(len)/sum(triang(len)); m(i-first) = log(max(1,sum(fx(m_CenterFreqs(i):m_CenterFreqs(i+2)).* wgt))); end m = m(1:fbank-1); fbank = length(m); for i = 1:mel m_pFeature(i) = 0.0; for j=1:fbank m_pFeature(i) = m_pFeature(i) + m(j)*cos((((i-1)*pi)/fbank)*(j-0.5)); end m_pFeature(i) = m_DimNorm*m_pFeature(i); end m_pFeature(1) = 0.1*m_pFeature(1); m_pFeature = m_pFeature(:); m_pFeature = m_pFeature(1:mel); return;
42
APPENDIX C: C FUNCTIONS FOR GMM TRAINING AND NNMS
// TRAINMODEL ----------------------------------------------------- // INPUT : script file of parametrized data used for training. // OUTPUT: model file with mean, variance and weight values. void CTrain::TrainModel(int numMixes) { FILE *fid, *finput, *fout;
char inputfilename[FNLEN]; int argmax; int frmNo=0; fid = FRead(scriptfile); printf("training started..\n"); SetNumMixes(numMixes); weights = new float[numMixes]; means = new float[numMixes][NUMMEL]; vars = new float[numMixes][NUMMEL]; memset(weights, 0, numMixes*sizeof(float)); memset(means, 0, numMixes*NUMMEL*sizeof(float)); memset(vars, 0, numMixes*NUMMEL*sizeof(float)); /* initialization */ while(!feof(fid)) {