Cairo University Computer Engineering Department Giza, 12613 EGYPTSpeech Lab Graduation Project Report Submitted by Amr M. Medhat Sameh M. Serag Mostafa F. Mahmoud In partial fulfillment of the B.Sc. Degree in Computer Engineering Supervised by Dr. Nevin M. Darwish July 2004
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Speech has long been viewed as the future of computer interfaces, promising
significant improvements in ease of use and enabling the rise of variety of speech-
recognition-based applications. With the recent advances in speech recognitiontechnology, computer-assisted pronunciation teaching (CAPT) has emerged as a
tempting alternative to traditional methods of supplementing or replacing direct
student-teacher interaction.
Speech Lab is an Arabic pronunciation teaching system for teaching some of the Holy
Qur'an recitation rules. The objective is to detect learner's pronunciation errors and
provide diagnostic feedback. The heart of the system is a phone-level HMM-based
speech recognizer. The idea of comparing the learner's pronunciation with the correct
one of the teacher is based on identifying phone insertions, deletions or substitutions
resulting from the recognition of the learner's speech. In this work we focus on some
of the recitation rules targeting pronunciation problems of Egyptian learners.
First and foremost, we would like to thank Dr. Salah Hamid from The Engineering
Company for the Development of Computer Systems (RDI) for his generous and
enthusiastic guidance. Without his insightful and constructive advices and supports,this project would have not been achieved. We are deeply grateful to him and also to
Waleed Nazeeh and Badr Mahmoud for their helpful support.
In this project we made use of a series of lessons for teaching the Holy Qur'an
recitation rules by Sheikh Ahmed Amer, we are so grateful to him for these wonderful
lessons. Beside using their content, they were of great help in drawing the
methodology we worked within in the project.
We are also grateful to all our friends who helped us by recording the data to build the
speaker-independent database. They were really cooperative and helpful. And special
thanks to the artists Mohammed Abdul-Mon'em, Mahmoud Emam and Mohammed
Nour for their wonderful work that added beauty and elegance to our project.
Special thanks must go also to Dr. Goh Kawai for providing us with his valuable
paper on pronunciation teaching.
Finally, we would like to thank our supervisor Dr. Nevin Darwish, our parents and all
ACKNOWLEDGEMENT........................................................................................... iiiLIST OF ABBREVIATIONS.......................................................................................iv
2.1.1 The Need for Automatic Pronunciation Teaching
During the past two decades, the exercise of spoken language skills has received
increasing attention among educators. Foreign language curricula focus on productive
skills with special emphasis on communicative competence. Students' ability to
engage in meaningful conversational interaction in the target language is considered
an important, if not the most important, goal of second language education.
According to Eskinazi, the use of an automatic recognition system to help a user
improve his accent and pronunciation is appealing for at least two reasons: first, it
affords the user more practice time than a human teacher can provide, and second, the
user is not faced with the sometimes overwhelming problem of human judgment of
his production of “foreign” sounds [1]. To figure out more about the importance of
the system, it is important to recognize the specific difficulties encountered in
pronunciation teaching:
• Explicit pronunciation teaching requires the sole attention of the teacher to a
single student; this poses a problem in a classroom environment.
• Learning pronunciation can involve a large amount of monotonous repetition,
thus requiring a lot of patience and time from the teacher.
• With pronunciation being a psycho-motoric action, it is not only a mental task but also demands coordination and control over many muscles. Given the
social implications of the act of speaking it can also mean that students are
afraid to perform in the presence of others.
• In language tests the oral component is costly time consuming and subjective,
therefore an automatic method of pronunciation assessment is highly
desirable.
Additionally, all arguments for the usefulness of CALL systems apply here as well,
such as being available at all times and being cheaper.
All these reasons indicate that computer-based pronunciation teaching is not onlydesirable for self-study products but also for products which would complement the
teaching aids available to a language teacher [2].
2.1.2 Components of Pronunciation to Address
The accuracy of pronunciation is determined by both segmental and supra-segmental
features.
The segmental features are concerned with the distinguishable sound units of speech,
i.e. phonemes. A phoneme is also defined as "the smallest unit which can make a
difference in meaning". The set of phonemes of one language can be classified into
broad phonetic subclasses; for example, the most general classification as we will seein section 2.4 would be to separate vowels and consonants. Each language is
characterized by its distinctive set of phonemes. When learning a new language,
foreign students can divide the phonemes of the target language into two groups. The
first group contains those phonemes which are similar to the ones in his or her source
language. The second group contains those phonemes which do not exist in the source
language [6].
Teaching the pronunciation of segmental or phonetic features includes teaching the
correct pronunciation of phonemes and the co-articulation of phonemes into higher
phonological units, i.e., teaching phonemes pronunciation in isolation and in context
with other phonemes within words or sentences after that.
The supra-segmental features of speech are the prosodic aspects which comprise of
intonation, pitch, rhythm and stress. Teaching the pronunciation of prosodic features
includes teaching the following [3]:
• the correct position of stress at word level;
• the alternation of stress and unstressed syllables compensation and vowel
reduction;• the correct position of sentence accent;
• the generation of the adequate rhythm from of stress, accent, and phonological
rules;
• the generation of adequate intonational pattern utterance related to
communicative functions.
For beginners phonetic characteristics are of greater importance because these cause
mispronunciations. With increasing fluency more emphasis should be on teaching
prosody. But the focus here will be on teaching phonetics since teaching prosody
usually requires a different teaching approach.
2.1.3 Previous Work
Over the last decade several research groups have started to develop interactive
language teaching systems incorporating pronunciation teaching based on speech
recognition techniques. There was the SPELL project from [Hiller, 1993] which
concentrated on teaching pronunciation of individual words or short phrases plus
additional exercises for intonation, stress and rhythm. However, this system
concentrated on one sound at a time, for instance the pair "thin-tin" is used to train the
'th' sound, but it did not check whether the remaining phonemes in the word were
pronounced correctly.
Another early approach based on dynamic programming and vector quantization by
[Hamada, 1993] is likewise limited to word level comparisons between recordings of native and non-native utterances of a word. Therefore, their system required new
recordings of native speech for each new word used in the teaching system. This
system is called a text-dependent system in contrast to a text-independent one, where
the teaching material can be adjusted without additional recordings.
The systems described by [Bernstein, 1990] and [Neumeyer, 1996] were capable of
scoring complete sentences but not smaller units of speech.
The system used by [Rogers, 1994] was originally designed to improve the
intelligibility of hearing for speech impaired people. It was text-dependent and
The system described by [Eskenazi 1996] was also text dependent and compared the
log-likelihood scores produced by a speaker independent recognizer of native and
non-native speech for a given sentence [2].
The European funded project ISLE [1998] is another example that aims to develop a
system that improves the English pronunciation of Italian and German native
speakers.
There is also the LISTEN project which is an inter-disciplinary research project at
Carnegie Mellon University to develop a novel tool to improve literacy - an
automated Reading Tutor that displays stories on a computer screen, and listens to
children read aloud.
Beside all these systems, there was also a work done to build some tools to support
this research in pronunciation assessment. Eduspeak [2000] by SRI International is an
example. It is a speech recognition toolkit that consists of a speech recognition
module and acoustical native and non-native models for adults and children. It also
has some score algorithms that make use of spectral matching and duration of sounds.
2.1.4 Computer as a Teacher
Success of an automatic pronunciation training system depends on how perfect it acts
as a human teacher in a classroom. The following are some issues to be considered in
a CAPT system to be able to assist or even replace teachers:
1. Evaluation
In pronunciation exercises there exists no clearly right or wrong answer. A large
number of different factors contribute to the overall pronunciation quality and these
are also difficult to measure. Hence, the transition from poor to good pronunciation isa gradual one and any assessment must also be presented on a graduated scale using a
scoring technique [2].
2. Integration into a complete educational system
For practical applications, any scoring method will have to be embedded within an
interactive language teaching system containing modules for error analysis,
pronunciation lessons, feedback and assessment. These modules can take results from
the core algorithm to give the student detailed feedback about the type of errors which
occurred, using both visual and audio information. For instance, in those cases where
a phoneme gets rejected because of a too poor score, the results of the phoneme loopindicate what has actually been recognized. This information can then be used for
error correction [2]. [Hiller 1996] presented a useful paradigm for a CALL
pronunciation teaching system called DELTA consisting of the four stages of
learning:
• Demonstrate the lesson audibly.
• Evaluate listening of the student ability with small tests.
A perfect CAPT system function is not just to tell the user blindly: well done or
wrong, repeat again! It should be more intelligent like an actual teacher.
In natural conversations, a listener may interrupt the talker to provide a correction or
simply point out the error. But the talker might not understand his message and askshis listener for a clarification. So, a correctly formed message usually results from an
ensuing dialogue in which meaning is negotiated.
Ideally teachers point out incorrect pronunciation at the right time, and refrain from
intervening too often in order to avoid discouraging the student from speaking. They
also intervene soon enough to prevent errors from being repeated several times and
from becoming hard-to-break habits [5].
So, a perfect system that acts as a real teacher should consider the following [4, 5]:
• Addressing the error precisely, so the part of the word that was mispronounced
should be precisely located within the word.
• The addressed error should be used to modify the native utterance so that the
mispronounced component is emphasized by being louder, longer and possibly
with higher pitch. The student then says the word again and the system
repeats.
• Correcting only when necessary, reinforcing good pronunciation, and avoiding
negative feedback to increase student's confidence.
• The pace of correction, that is, the maximum amount of interruptions per unit
of time that is tolerable, should be adapted to fit each student's personality;
since adaptive feedback is important to obtain better results from correction
and to avoid discouraging the student.
2.1.5 Components of an ASR-based Pronunciation Teaching System
The ideal ASR-based CAPT system can be described as a sequence of five phases, the
first four of which strictly concern ASR components that are not visible to the user,
while the fifth has to do with broader design and graphical user interface issues [7].
1. Speech recognition
The ASR engine translates the incoming speech signal into a sequence of words on
the basis of internal phonetic and syntactic models. This is the first and most
important phase, as the subsequent phases depend on the accuracy of this one. It isworth mentioning that a speaker-dependent system is more appropriate in teaching
foreign language pronunciation [2]. Details of this phase will be presented later.
2. Scoring
This phase makes it possible to provide a first, global evaluation of pronunciation
quality in the form of a score. The ASR system analyzes the spoken utterance that has
been previously recognized. The analysis can be done on the basis of a comparison
between temporal properties (e.g. rate of speech) and/or acoustic properties of the
student’s utterance on one side, and natives’ reference properties on the other side; the
closer the student’s utterance comes to the native models used as reference, the higher
An isolated-word speech recognition system requires that the speaker pause briefly
between words, whereas a continuous speech recognition system does not.
Spontaneous, or extemporaneously generated, speech contains disfluencies, and it is
much more difficult to recognize than speech read from script. Some systems require
speaker enrollment where a user must provide samples of his or her speech before
using them, whereas other systems are said to be speaker-independent, in that noenrollment is necessary. Some of the other parameters depend on the specific task.
Perplexity indicates the language’s branching power, with low-perplexity tasks
generally having a lower word error rate. Recognition is generally more difficult
when vocabularies are large or have many similar-sounding words. Finally, there are
some external parameters that can affect speech recognition system performance,
including the characteristics of the background noise (Signal to Noise Ratio) and the
type and the placement of the microphone [8].
2.2.2 Speech Recognition System Architecture
The process of speech recognition starts with a sampled speech signal. This signal has
a good deal of redundancy because the physical constraints on the articulators that
produce speech - the glottis, tongue, lips, and so on - prevent them from moving
quickly. Consequently, the ASR system can compress information by extracting a
sequence of acoustic feature vectors from the signal. Typically, the system extracts a
single multidimensional feature vector every 10 ms that consists of 39 parameters.
Researchers refer to these feature vectors, which contain information about the local
frequency content in the speech signal, as acoustic observations because they
represent the quantities the ASR system actually observes. The system seeks to infer
the spoken word sequence that could have produced the observed acoustic sequence.
[9]
It is assumed that ASR system knows the speaker’s vocabulary previously. Thisrestricts the search for possible word sequences to words listed in the lexicon, which
lists the vocabulary and provides phoneme s for the pronunciation of each word.
Language constraints are also used to dictate whether the word sequences are equally
likely to occur [9]. Training data are used to determine the values of the language and
phone model parameters
The dominant recognition paradigm is known as hidden Markov models (HMM). An
HMM is a doubly stochastic model, in which the generation of the underlying
phoneme string and the frame-by-frame, surface acoustic realizations are both
represented probabilistically as Markov processes. Neural networks have also been
used to estimate the frame based scores; these scores are then integrated into HMM- based system architectures, in what has come to be known as hybrid systems or
Hybrid HMM [8].
An interesting feature of a frame-based HMM systems is that speech segments are
identified during the search process, rather than explicitly. An alternate approach is to
first identify speech segments, then classify the segments and use the segment scores
to recognize words. This approach has produced competitive recognition performance
in several tasks [8]. Our system will be an HMM-based one.
The speech recognition process as a whole can be seen as a system of five basic
components as in figure [2] below: (1) an acoustic signal analyzer which computes a
spectral representation of the incoming speech; (2) a set of phone models (HMMs)
trained on large amounts of actual speech data; (3) a lexicon for converting sub-word
phone sequences into words; (4) a statistical language model or grammar network that
defines the recognition task in terms of legitimate word combinations at the sentence
level; (5) a decoder, which is a search algorithm for computing the best match
between a spoken utterance and its corresponding word string [10].
Figure [2]: Components of a typical speech recognition system.
1. Signal Analysis
The first step which will be presented in detail later, consists of analyzing the
incoming speech signal. When a person speaks into an ASR device __usually through
a high quality noise-canceling microphone__ the computer samples the analog input
into a series of 16 or 8-bit values at a particular sampling frequency (usually 16 KHz).
These values are grouped together in predetermined overlapping temporal intervals
called "frames". These numbers provide a precise description of the speech signal's
amplitude. In a second step, a number of acoustically relevant parameters such as
energy, spectral features, and pitch information, are extracted from the speech signal.
During training, this information is used to model that particular portion of the speech
signal. During recognition, this information is matched against the pre-existing model
of the signal [10].
2. Phone Models
The second module is responsible for training a machine to recognize spoken
language amounts by modeling the basic sounds of speech (phones). An HMM canmodel either phones or other sub-word units or it can model words or even whole
sentences. Phones are either modeled as individual sounds, so-called monophones, or
as phone combinations that model several phones and the transitions between them
(biphones or triphones). After comparing the incoming acoustic signal with the
HMMs representing the sounds of language, the system computes a hypothesis based
on the sequence of models that most closely resembles the incoming signal. The
HMM model for each linguistic unit (phone or word) contains a probabilistic
representation of all the possible pronunciations for that unit.
Building HMMs in the training process, requires a large amount of speech data of the
The lexicon, or dictionary, contains the phonetic spelling for all the words that are
expected to be observed by the recognizer. It serves as a reference for converting the
phone sequence determined by the search algorithm into a word. It must be carefully
designed to cover the entire lexical domain in which the system is expected to perform. If the recognizer encounters a word it does not "know" (i.e., a word not
defined in the lexicon), it will either choose the closest match or return an out-of-
vocabulary recognition error. Whether a recognition error is registered as
misrecognition or an out-of-vocabulary error depends in part on the vocabulary size.
If, for example, the vocabulary is too small for an unrestricted dictation task (let's say
less than 3K) the out-of-vocabulary errors are likely to be very high. If the vocabulary
is too large, the chance of misrecognition errors increases because with more similar-
sounding words, the confusability increases. The vocabulary size in most commercial
dictation systems tends to vary between 5K and 60K [10].
4. The Language Model
The language model predicts the most likely continuation of an utterance on the basis
of statistical information about the frequency in which word sequences occur on
average in the language to be recognized. For example, the word sequence "A bare
attacked him" will have a very low probability in any language model based on
standard English usage, whereas the sequence "A bear attacked him" will have a
higher probability of occurring. Thus the language model helps constrain the
recognition hypothesis produced on the basis of the acoustic decoding just as the
context helps decipher an unintelligible word in a handwritten note. Like the HMMs,
an efficient language model must be trained on large amounts of data, in this case
texts collected from the target domain.
In ASR applications with constrained lexical domain and/or simple task definition, the
language model consists of a grammatical network that defines the possible word
sequences to be accepted by the system without providing any statistical information.
This type of design is suitable for pronunciation teaching applications in which the
possible word combinations and phrases are known in advance and can be easily
anticipated (e.g., based on user data collected with a system pre-prototype). Because
of the a priori constraining function of a grammar network, applications with clearly
defined task grammars tend to perform at much higher accuracy rates than the quality
of the acoustic recognition would suggest [10].
5. Decoder
The decoder is an algorithm that tries to find the utterance that maximizes the
probability that a given sequence of speech sounds corresponds to that utterance. This
is a search problem, and especially in large vocabulary systems careful consideration
must be given to questions of efficiency and optimization, for example to whether the
decoder should pursue only the most likely hypothesis or a number of them in parallel
(Young, 1996). An exhaustive search of all possible completions of an utterance
might ultimately be more accurate but of questionable value if one has to wait two
days to get a result. Therefore there are some trade-offs to maximize the search results
while at the same time minimizing the amount of CPU and recognition time.
Phonetics studies all the sound of speech; trying to describe how they are made, to
classify them and to give some idea of their nature. Phonetic investigation shows that
human beings are capable of producing an enormous number of speech sounds,
because the range of articulatory possibilities is vast, although each language uses
only some of the sounds that are available [11].Even more importantly, each language
organizes and makes use of the sounds in its own particular way.
The study of the selection that each language makes from the vast range of possible
speech sounds and of how each language organizes and uses the selection it makes is
called Phonology. In other words, Phonetics describes and classifies the speech
sounds and their nature while Phonology studies how they work together and how
they are used in a certain language where differences among sounds serve to indicate
distinctions of meaning [11].
Obviously, not all the differences between speech sounds are significant, and not only
this, the difference between two speech sounds can be significant in one language but
not in another one. A list of sounds whose differences from one another are
significant can be built up by making a comparison between words of the same
language. These significant or distinctive sounds are the elements of the sound system
and are known as phonemes. Whereas the different sounds that not make any
difference are know as allophones [11].
In Arabic, there are 37 distinct phonemes [12], but when it comes to the Holy Qur'an,
the nature of its rules requires defining a new set of phonemes where distinguishing
between the correct and wrong way of reading in some rules cannot be detected using
only the standard set of Arabic phonemes. Below is the set of phonemes we definedfor some – not all – of the Qur'an phonemes that we needed for the rules we teach in
These classifications will be useful later in the process grouping the phonemes with
similar properties in building a recognizer.
2.4 Speech Signal Processing
2.4.1 Feature Extraction
Once a signal has been sampled, we have huge amounts of data, often 16,000 16 bit
numbers a second! We need to find ways to concisely capture the properties of the
signal that are important for speech recognition before we can do much else. Probably
the most important parametric representation of speech is the spectral representation
of the signal, as seen in a spectrogram1 which contains much of the information we
need. We can obtain the spectral information from a segment of the speech signal
using an algorithm called the Fast Fourier Transform. But even a spectrogram is far too complex a representation to base a speech recognizer on. This section describes
some methods for characterizing the spectra in more concise terms [15].
Filter Banks: One way to more concisely characterize the signal is by a filter bank.
We divide the frequency range of interest (say 100-8000Hz) into N bands and
measure the overall intensity in each band. This could be computed from spectral
analysis software such as the Fast Fourier Transform). In a uniform filter bank , each
frequency band is of equal size. For instance, if we used 8 ranges, the bands might
cover the frequency ranges: 100Hz-1000Hz, 1000Hz-2000Hz, 2000Hz-3000Hz, ...,
7000Hz-8000Hz.
But, is it a good representation? We’d need to compare the representations of differentvowels for example and see whether the vector reflects differences in these vowels or
not. If we do this, we’ll see there are some problems with a uniform filter bank. So, a
better alternative is to organize the ranges using a logarithmic scale. Another
alternative is to design a non-uniform set of frequency bands that has no simple
mathematical characterization but better reflects the responses of the ear as
determined from experimentation. One very common design is based on perceptual
studies to define critical bands in the spectra. A commonly used critical band scale is
called the Mel scale which is essentially linear up to 1000 Hz and logarithmic after
that. For instance, we might start the ranges at 200 Hz, 400 Hz, 630 Hz, 920 Hz, 1270
Hz, 1720 Hz, 2320 Hz, and 3200 Hz.
LPC: A different method of encoding a speech signal is called Linear Predictive
Coding (LPC). The basic idea of LPC is to represent the value of the signal over some
window at time t, s(t) in terms of an equation of the past n samples, i.e.,
1
A spectrogram is an image that represents the time-varying spectrum of a signal. The x-axisrepresents time, the y-axis frequency and the pixel intensity represents the amount of energy infrequency band y, at time x.
5717, 6644, and 7595). The MFCCs are then computed using the following formula:
where N is the desired number of coefficients. What this is doing is computing aweighted sum over the filter banks based on a cosine curve. The first coefficient, c0, is
simply the sum of all the filter banks, since i = 0 makes the argument to the cosine
function 0 throughout, and cos(0)=1. In essence it is an estimate of the overall
intensity of the spectrum weighting all frequencies equally. The coefficient c1 uses a
weighting that is one half of a cosine cycle, so computes a value that compares the
low frequencies to the high frequencies. The function for c2 is one cycle of the cosine
function, while for c3 it is one and a half cycles, and so on.
2.4.2 Building Effective Vector Representations of Speech
Whether we use the filter bank approach, the LPC approach or any other approach, we
end up with a small set of numbers that characterize the signal. For instance, if we
used the Mel-scale with dividing the spectra into 7 frequency ranges, we have reduced
the representation of the signal over the 20 ms segment to a vector consisting of eight
numbers. With a 10 ms shift in each segment, we are representing the signal by one of
these vectors every 10 ms. This is certainly a dramatic reduction in the space needed
to represent the signal. Rather than 16,000 numbers per second, we now represent the
signal by 700 numbers a second!
Just using the six spectral measures, however, is not sufficient for large-vocabulary
speech recognition tasks. Additional measurements are often taken that capture
aspects of the signal not adequately represented in the spectrum. Here are a fewadditional measurements that are often used:
Power: It is a measure of the overall intensity. If the segment Sk contains N samples
of this signal, s(0),..., s(N-1), then the power power(Sk ) is computed as
following:
Power(Sk ) = Σi=1,N-1 s(i)^2.
An alternative that doesn’t create such a wide difference between low and soft sounds
One problem with direct power measurements is that the representation is very
sensitive to how loud the speaker is speaking. To adjust for this, the power can be
normalized by an estimate of the maximum power. For instance, if P is the maximum
power within the last 2 seconds, the normalized power of the new segment would be
power(Sk )/P. The power is an excellent indicator of the voiced/unvoiced distinction,
and if the signal is especially noise-free, can be used to separate silence from lowintensity speech such as unvoiced fricatives. But we don't need it in the MFCC since
the power is estimated well by the c0 coefficient.
Power Difference: The spectral representation captures the static aspects of a signal
over the segment, but we have seen that there is much information in the transitions in
speech. One way to capture some of this is to add a measure to each segment that
reflects the change in power surrounding it. For instance, we could set:
PowerDiff(Sk )= power(Sk +1)-power(Sk -1).
Such a measure would be very useful for detecting stops.
Spectral Shifts: Besides shifts in overall intensity, we saw that frequency shifts in the
formants can be quite distinctive, especially in looking at the effects of consonants
next to vowels. We can capture some of this information by looking at the difference
in the spectral measures in each frequency band. For instance, if we have eight
frequency intensity measures for segment Sk , f k (1),...,f k (8), then we can define the
spectral change for each segment as with the power difference, i.e., df k (i) = f k -1(i)-
f k +1(i)
With all these measurements, we would end up with 18-number vector, the 8 spectral
band measures, eight spectral band differences, the overall power and the power difference. This is a reasonable approximation of the types of representations used in
current state-of-the-state speech recognition systems. Some systems add another set of
values that represent the “acceleration”, and would be computed by calculating the
differences between the df k values.
2.5 HMM
2.5.1 Introduction
A hidden Markov model (HMM) is a stochastic generative process that is particularly
well suited to modeling time-varying patterns such as speech. HMMs represent
speech as a sequence of observation vectors derived from a probabilistic function of a
first-order Markov chain. Model ‘states’ are identified with an output probability
distribution that describes pronunciation variations, and states are connected by
probabilistic ‘transitions’ that capture durational structure. An HMM can thus be
used as a ‘maximum likelihood classifier’ to compute the probability of a sequence of
words given a sequence of acoustic observations using Viterbi search. The basics of
HMM will be discussed in the following sub-subsections. More information can be
In order to understand the HMM, we must first look at a Markov model and a
stochastic process in general. A stochastic process specifies certain probabilities of
some events and the relations between the probabilities of the events in the same
process at different times. A process is called Markovian if the probability at one time
is only conditioned on a finite history. Therefore, a Markov model is defined as a
finite state machine which changes state once every time unit. State is a concept used
to help understand the time evolution of a Markov process. Being in a certain state at
a certain time is then the basic event in a Markov process. A whole Markov process
thus produces a sequence of states S= s1, s2 … sT.
2.5.3 Hidden Markov Model
The HMM is an extension of a Markov process. A hidden Markov model can be
viewed as a Markov chain, where each state generates a set of observations. You only
see the observations, and the goal is to infer the hidden state sequence. For example,the hidden states may represent words or phonemes, and the observations represent
the acoustic signal. Figure [3] shows an example of such a process where the six state
model moves through the state sequence S = 1; 2; 2; 3; 4; 4; 5; 6 in order to generate
the sequence o1 to o6.
Figure [3] The Markov Generation Model
Each time t that a state j is entered, a speech vector ot is generated from the
probability density b j(ot ). Furthermore, the transition from state i to state j is also
probabilistic and is governed by the discrete probability aij .
Thus, we can see that the stochastic process of an HMM is characterized by two set of
probabilities.
The first set is the transition probabilities and are defined as:
This can also be written in matrix form A ={ aij }. For the Markov process itself,
when the previous state is known, there is a certain probability to transit to each of the
other states.
The second is the observation probability where the speech signal is converted into a
time sequence of observation vectors ot defined in an acoustic space. The sequence of
vectors is called an observation sequence O= o1 , o2 .. oT with each ot a staticrepresentation of speech at t . The observation probability is defined as:
with its matrix form B ={bj }.
The composition of the parameters Μ = ( A, B) defines an HMM. (In the HMM
literature there is another set of parameters, the probability that the HMM starts at
initial time Π= { π j }). The model becomes λ = ( A, B, Π) depending on three
parameters. However, for cases like ours where all the HMM always start at the first
state, s0 1 = , this Π can be included in A.
2.5.4 Speech recognition with HMM
The basic way of using HMM in speech recognition is to model different well defined
phonetic units wl (e.g., words or sub-word units or phonemes) in an inventory { wl }
for the recognition task, with a set of HMMs (each with parameter λl ). To recognize a
word wk from an unknown O is to find basically:
The probability P is usually calculated indirectly using Bayes' rule:
Here P (O) is constant for a given O over all possible wl . The a priori probability P (wl
) only concerns the language model of the given task, which we assume here to be
constant too. Then the problem of recognition is converted to calculation of P (O | wl ).
But we use λl to model wl , therefore we actually need to calculate P (O | λl ).
We can see that the joint probability of O and S being generated by the model λ can
be calculated as following:
Where, the transitions occurring at different times and in different states areindependent and therefore:
And also for a given state S , the observation probability is:
However, in reality, the state sequence S is unknown. Then one has to sum the
probability P (O , S | λ) over all S in order to get P (O | λ) .
In order to use HMM in ASR, a number of practical problems have to be solved.
1. The evaluation problem: One has to evaluate the value P (O | λ) given only O
and λ, but not S . Without an efficient algorithm, one has to sum over nT possible S
with a total of 2T. n
T
⋅calculations, which is impractical.2. The estimation problem: The values of all λl in a system have to be determined
from a set of sample data. This is called training . The problem is how to get an
optimal set of λl that leads to the best recognition result, given a training set.
3. The decoding problem: Given a set of well trained λl and an O with an unknown
identity, one has to find P l (O | λ) for all λl . In the recognition process, for each
single λl , one hopes, instead of summing over all S , to find a single S M that is most
likely associated with O. S M also provides the information of boundaries between
the concatenated phonetic or linguistic units that are most likely associated with
O. The term decoding refers to finding the way that O is coded onto S . In both the
training and recognition processes of a recognition system, problem 1 is involved.
2.5.6 Two important algorithms
The two important algorithms that solve the essential problems are both named after
their inventors: the Baum-Welch algorithm (Baum et al., 1970) for parameter
estimation in training, and the Viterbi algorithm for decoding in recognition (in some
recognizers the Viterbi algorithm is also used for training).
The essential part of the Baum-Welch algorithm is a so-called expectation-
maximization (EM) procedure, used to overcome the difficulty of incomplete
information about the training data (the unknown state sequence). In the most
commonly used implementation of the EM procedure for speech recognition, a
maximum-likelihood (ML) criterion is used. The solutions for the ML equations givethe closed-form formulae for updating HMM parameters given their old values In
order to obtain good parameters, a good initial set of parameters is essential, since the
Baum-Welch algorithm only gives a solution for a local optimum. However, for
speech recognition, such a solution often leads to sufficiently well performance.
The basic shortcoming of the ML training is that maximizing the likelihood that the
model parameters generate the training observations is not directly related to the
actual goal of reducing the recognition error, which is to maximize the discrimination
between the classes of patterns in speech.
The Viterbi algorithm essentially avoids searching through an unmanageably large
space of HMM states to find the most likely state sequence S M by using step-wiseoptimal transitions. In most cases, the state sequence S M yields satisfactory results for
recognition. But in other cases, S M does not give rise to state sequence corresponding
to the most correct words.
2.6 HTK
One of the optimal tools for speech recognition research is the HMM Toolkit,
abbreviated as HTK. It is a well-known and free toolkit for use in research into
automatic speech recognition and other pattern recognition systems such as hand-
writing recognition and facial recognition. It has been developed by the Speech
Vision Robotics Group at the Cambridge University Engineering Department andEntropic Ltd [18].
The toolkit consists of a set of modules for building Hidden Markov Models (HMMs)
which can be called from both command line and script file(s). The following are their
main functions:
1. Receiving audio input from the user.
2. Coding the audio files.
3. Building the grammar and dictionary for the application.
4. Attaching the recorded utterances to their corresponding transcriptions.
5. Building the HMMs.
6. Adjusting the parameters of the HMMs using the training sets.
7. Recognizing the user's speech using the Viterbi algorithm.
8. Comparing the testing speech patterns with the reference speech patterns.
In the actual processing, the HTK firstly parameterizes features of speech data to
various forms such as Linier Predictive Coding (LPC) and Mel-Cepstrum. Then, it
will estimate the HMM parameters using the Baum-Welch Algorithm for training.
Recognition tests are executed by estimating the best hypothesis from given featurevectors and from a language model using the Viterbi algorithm which finds the
maximum likelihood state sequence. Results are given with recognition percentage as
well as numbers of deletion, substitution and insertion errors.
The approach we adopted in our system considers the systemic and structural
differences between the learner's utterance and the correct utterance as phone
insertions, deletions and substitutions [19].
This requires a phone recognizer trained on the correct phones and the wrong ones
that may be inserted or substituted by the learner. Knowledge of phonetics, phonology
and pedagogy is needed to know the different possible mispronunciations of each
phone.An example of phone substation problem of the word " " is shown in figure [4]
where usually learners encounter the problem of the emphatic pronunciation of the
first letter ( ) in this word which appears in the vowel after it. So, the
correct phone /a_l/ may be replaced with /a_h/ (see the phonology table in section
2.3).
n r s:
a_h
a_l
start end
Figure [4] Phone substitution in the word " "
Our handling to this rule ( ) considers that both cases of pronouncing
the letter ( ) are represented by the same phone as the difference
appears usually in the vowel not the consonant., although there is a little bit difference
in their acoustic properties except in some few cases like the letter " " for example,
when it is pronounced with emphasis ( ) it becomes " ".
Building a suitable database covering all possible right and wrong phones is easy asmost of the phones in the Holy Quran are not new to ordinary Arabic speakers.
With this approach we can detect pronunciation errors for various rules other than this
rule ( ) like problems of pronouncing particular letter as "" and " "
for example and the rule of )( . But other rules like ( ) require a
different way in handling which we don't deal with in our system.
There is another approach depends on assessing the pronunciation quality and it may
tolerate more recognition noises [Witt and Young, 2000; Neumeyer et al., 2000;
Franco et al., 2000]. The judgment in this approach is usually required to correlate
well with human judges which makes it less objective and harder to implement than
our approach that asks for accurate and precise phoneme recognition.
6- The output of the comparison (the pronunciation) is then passed to the User
Profile Analyzer.
7- The User Profile Analyzer checks the user profile and determines which
mistakes the user should receive feedback about depending on the lessons he
already passed.
8- The mistakes are then passed to the feedback generator.
9- The feedback generator generates the feedback and passes it to the GUI.
10- The GUI displays the feedback to user.
As the figure shows, the system consists of six main modules other than the GUI; the
following is a brief description of each:
1- Recognizer
After the user’s utterance is perceived through the microphone and saved in a .WAVfile, it is passed to an HMM-based phone-level recognizer along with a phone level
grammar file containing the phones of both the reference word and the expected
mistaken word and checks the utterance to determine which of them is closer to and
outputs a text file containing the phones of the recognized word.
2- Recognizer Interface
The recognizer runs on DOS shell which is not a user friendly interface especially in
feedback-oriented applications. So, an interface was built between our GUI and the
recognizer to overcome this problem.
3- String Comparator
The recognizer passes the phones of the recognized word to the string comparator
which has the reference word of the current lesson and compares them together and
passes the difference at every phone to the User Profile Analyzer.
4- User Profile Analyzer
After the user succeeds in a certain lesson, his profile is updated and this lesson is
added to his profile as in the following lessons he is expected not to commit a mistake
related to this learned lesson. For example: the user learned only the lesson teaching
( ), when he tries to recite ( ), he gets feedback only
related to his mistakes in ( ), if he learns and succeeds in another lesson
teaching ( ), and tries to recite ( ) again, he gets feedback related to his mistakes in both letters as they are both saved in his profile now.
5- Auxiliary Database
On starting a training or testing session, some values must be initialized to control the
session lifetime. For example, which word(s) will appear to the user to utter, what is
the reference transcription to that word(s)…etc. All of this information are stored in
this database.
6- Feedback Generator
After the mistakes are filtered according to the user’s knowledge, the feedback
generator analyzes the mistakes and determines the suitable method of guiding theuser to correcting them.
As for the rule of ( ), we have chosen 8 letters which are present in
)( . A speech database was built covering the right and wrong pronunciations
of these letters to train the recognizer.
According to the lessons of Sheikh Ahmed Amer, we followed his methodology asordinary Arabic speakers by default pronounce the correct pronunciation of a letter in
some words while in others they couldn't despite their ignorance of the rules, as the
nature of the word itself forces the speaker to pronounce it correctly in some cases.
So, to cover both cases for each letter, the training database was chosen to contain
four words for each letter, two of which contained this letter read with the mistaken
pronunciation and the other two are other words containing this letter read with the
correct one. For example: for )( , we have the words: (
) the first two are usually mistaken and the user emphasizes ( ) whereas
in the last two, the letter is always pronounced correctly. Each speaker reads these
four words per letter three times. A list of all the words used for training can be found
in Appendix B.
3.2.3 Constraints
The design of our system was base on a few assumptions:
Calm environment
System functions with higher performance when used in a rather relatively calm
environment. Noise at a certain level can degrade the recognition accuracy.
Cooperative user
The word to be pronounced is displayed on the screen for the user, the user is
expected to either pronounce it correctly or mispronounce the letter being taught.The
system doesn’t deal with unexpected words.
Male user
This version of the system has been trained by only young male users voices, so in
order to serve female users, new models has to be constructed and trained by female
users voices.
3.3. Speech Recognition with HTK
In this project, several experiments were done using HTK v3.2.1 to build HMMs for
different recognizers starting from a small English digit recognizer to learn and testthe tool, then a small Arabic word recognizer as an attempt to handle Arabic speech
recognition, until building up the prototype and the speaker-dependent and speaker-
independent versions of the project core.
In this section, we explain the steps followed to build such recognizers. Details of
using each tool can be found in the HTK manual [18].
3.3.1 Data preparation
Recording the data
The first stage in developing a recognizer is building a speech database for training
and testing. Although HTK supports a tool (HSLab) for recording and labeling data,
we used another easier and user-friendly program for that purpose, which is Cool Edit
Pro v2. Speech is recorded via a desktop microphone and sampled in 16 bits at 16
kHz. It is saved in the Windows PCM format as a WAV file. Then Cool edit is used to
segment the data word by word giving each a distinct label.
Creating the Transcription Files
To train a set of HMMs, every file of training data must have an associated phonelevel transcription. This process was done manually by writing the phone level
transcription of each word, all in a single Master Label File (MLF) with the standard
HTK format.
Coding the data
Speech is then coded using the tool HCopy where the speech signal is processed first
by separating the signal into frames of 10ms length and then converting those frames
into feature vectors – MFCC coefficients in our case.
3.3.2 Creating Monophone HMMs
Creating Flat Start Monophones
The first step in HMM training is to create a prototype model defining the model
topology. The model usually in phone-level recognizers consists of three states plus
one entry and one exit states.
After that, An HMM seed is generated by the tool HCompV to initializes the
prototype model with a global mean and variance computed for all the frames in every
feature file. Also, this variance scaled with a factor (typically 0.01) is used as variance
flooring to set a floor on the variances estimated in the subsequent steps .
A copy of the previous seed is set in a Master Macro File (MMF) called hmmdefs as
the initialization for every model defined in the HMM list. This list contains all themodels that will be used in the recognition task, namely a model for each phone used
in the training data plus a silence model /sil/ for the start and end of every utterance.
Another file called macros is created that contains the variance floor macro and
defines the HMM parameter kind and the vector size
HMM parameters Re-estimation
The flat start monophones are re-estimated using the embedded training version of the
Baum–Welch algorithm which is performed by HERest tool. Whereby every model is
re-estimated with frames labeled with its corresponding transcription.
This tool is used for re-estimation after every modification in the models and usuallytwo or three iterations are performed every time.
Fixing silence models
Forward and backward skipping transitions are set in the silence model /sil/ to provide
it with a longer duration mean probability. This is performed by the HMM editor tool
HHEd.
3.3.3 Creating Tied-State Triphones
To provide some measure of context-dependency as a refinement for the models, we
create triphone HMMs where each phone model is represented by both a left and right
First the label editor HLEd is used to convert the monophone transcriptions to an
equivalent set of triphone transcriptions. Then, the HMM editor HHEd is used to
create the triphone HMMs from the triphone list generated by HLEd. This HHEd
command has the effect of tying all of the transition matrices in each triphone set
where one or more HMMs share the same set of parameters.
The last step is using decision trees that are based on asking questions about the left
and right contexts of each triphone. Based on the acoustic differences between phones
according to the different classifications mentioned in section 2.3, phones are
clustered using these decision tress for more refinement. The decision tree attempts to
find those contexts which make the largest difference to the acoustics and which
should therefore distinguish clusters.
Decision tree state tying is performed by running HHEd using the QS command for
questions where the questions should progress from wide, general classifications
(such as consonant, vowel, nasal, diphthong, etc.) to specific instances of each phone.
3.3.4 Increasing the number of mixture components
The early stages of triphone construction, particularly state tying, are best done with
single Gaussian models, but it is preferable for the final system to consist of multiple
mixture component context-dependent HMMs instead of single Gaussian HMMs
especially for speaker-independent systems. The optimal number of mixture
components can be obtained only by experiments by gradually increasing the number
of mixture components per state and monitoring the performance after each
experiment.
The tool HHEd is used for increasing mixture components with the command MU.
3.3.5 Recognition and evaluation
After training HMMs the tool HVite is used for Viterbi search in a recognition lattice
of multiple paths of the words to be recognized.
Our grammar file is written in the phone-level to provide multiple paths for phone
insertions, deletion or substitutions of the word to be recognized.It is written using the
BNF notation of the HTK, then the tool HParse is used to generate a lattice from this
grammar as an input to HVite.
As we used phone-level grammar, our dictionary is just a list of the phones used like
the following:
sil sila:_h a:_h
a:_l a:_l
… etc.
Note, if there are some paths in the grammar file make context-dependent triphones
that have no corresponding models in the HMMs, we copy the HMMs of the
monophones before tying and them to the final models instead of re-training the
models on the new triphones.
To test the recognizer performance, we run HVite on testing data where the output
transcriptions are written in a Master Label File for all testing files. We then use thetool HResults to compare this file with a reference file of the correct transcriptions
(dynamic array) to facilitate comparison with the reference phones stored in another
array-list.
As mentioned before, there three kinds of pronunciation mistakes, insertion,
substitution, and deletion. So, the module was implemented with three methods,
CheckInsersion(), CheckSubstitution() , and CheckDeletion() .The core of the three
methods was implemented but only the CheckSubstitution() was completed and tested
where the only mistakes we handle now fall in this kind of mistakes.
The implementation of the CheckSubstitution() method is as follows, for all the
phones named 'fat-ha ' in the reference word, if it is different to the corresponding
phone in the recognized word(s) with respect to emphasis ( ), and the pervious
phone is in a passed lesson or in the current lesson (in case of training not testing),
then the feedback generator will be fired to report this mistake, otherwise the mistake
will be ignored.
By doing so, we are able to detect all mistakes, but we filter feedback according to the
user's status.
As it is observed, we search for the specific vowel phone 'fat-ha ' as we assumed
that the emphasized phone is the same as the un-emphasized one, and the difference
only appears on the vowels follow the consonant. This assumption is valid for most
consonants; other consonants out of this assumption are not handled in our project,
but can simply be done by adding some additional phones.
3.5.3 Auxiliary Database
For every lesson, the user’s utterance will be tested with two words. So, to decide
which word will appear to the user, and initializing the reference array-list (dynamic
array) for that word, the method TrainWhat() takes the lesson number and the word
number so that the session can be started and returns the corresponding word.
On generating the feedback, for each mistaken phone, we check its lesson to know
whether the user had passed this lesson or not and display the appropriate feedback.
The method GetLessonNo() implements this function using a simple switch case.
Also on generating feedback, we need to map between the phones resulting from
running the recognizer and the corresponding Arabic letters is needed, i.e. a de-
transcriptor. So, the method Corr_Arabic() implements this function by taking a
string represents the phone, passing through a switch case to return the corresponding
Arabic letter.
As observed, switch case is frequently used for simplicity of implementation and thesmall search space, but if the search space increased somewhat, there will be another
decision.
3.5.4 User Profile Analyzer
Two methods perform this analyzer, the first is ReadProfile() which is called at the
beginning of the training session or the testing session to give feedback only
according to lesson numbers stored in this profile (i.e. succeeded lessons by the user).
In case of training (not testing), the current lesson will be taken into consideration
besides the succeeded lessons. The second method is UpdateProfile() which is called
after the user has passed a certain lesson so that it can be considered afterwards.
frmLessons: contains command buttons for choosing a lesson to listen to, and a
command button for performing a test.
frmListening: for playing the chose lesson in the frmLessons, so it contains a cassette-
like buttons for performing this function.
frmTraining: for train the user on the lesson he has heard and testing him.The scenario mentioned above takes place on this form, so it is the most important
form in the project.
One last thing to mention here is that the layout of these forms was drawn using