ARABIC TEXT-TO-SPEECH SYNTHESIZER AHMAD QASIM MOHAMMAD AL JAYOUSI FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY UNIVERSITY OF MALAYA KUALA LUMPUR DECEMBER 2007
ARABIC TEXT-TO-SPEECH SYNTHESIZER
AHMAD QASIM MOHAMMAD AL JAYOUSI
FACULTY OF COMPUTER SCIENCE AND
INFORMATION TECHNOLOGY
UNIVERSITY OF MALAYA
KUALA LUMPUR
DECEMBER 2007
ARABIC TEXT-TO-SPEECH SYNTHESIZER
AHMAD QASIM MOHAMMAD AL JAYOUSI
DISSERTATION SUBMITTED IN FULFILMENT OF THE REQUIREMENTS
FOR THE DEGREE OF MASTER OF SOFTWARE ENGINEERING
FACULTY OF COMPUTER SCIENCE AND
INFORMATION TECHNOLOGY
UNIVERSITY OF MALAYA
KUALA LUMPUR
DECEMBER 2007
ABSTRACT
Text-To-Speech technology has steadily grown over the years to support many languages
by utilizing a number of useful methods and techniques. Despite its overall steady
growth, Arabic TTS technology has gained little attention. Apart from a few commercial
products, an Arabic TTS Synthesizer System has often failed to exceed laboratory
boundaries. This dissertation examines the issues, requirements and methodologies
involved in developing a useful Arabic TTS Synthesizer System. Additionally, this
dissertation describes in details the construction of an Arabic TTS Synthesizer System
using the Concatenation Synthesis method, which relies on prerecorded speech units that
are utilized by the system to generate concatenated speech. Two types of speech units
were used independently: The first consists of 375 diphones that cover Arabic sounds and
the other has 178 allophones which cover Arabic sounds. The system consists of several
programs and algorithms that are integrated in the form of modules to be easily modified
and developed individually in the future. The evaluation of TTS Synthesizer System is
also an important issue, but difficult because the speech quality is a very
multidimensional term. This has led to the large number of different tests and methods to
evaluate different features in speech. Our survey results demonstrate that the majority of
the words and sentences are recognizable by the developed Arabic TTS Synthesizer
System. In fact, 80 to 85 % of the words and 70 % of the sentences were correctly and
completely recognized by the listeners in the survey. Lastly, Arabic TTS speech is
considered as intelligible; however, the system still requires further research and
developments in many directions including handling various Arabic texts such as dates,
abbreviations, foreign names and numbers.
öa‡�����ègöa‡�����ègöa‡�����ègöa‡�����èg @@@@
�cŽè@ñ‡è@a‰ĆszŽjÜađi@Žj�ÔbđóŽì@ކŠžà@ŽÉ��Şò‹@�g@¶@âÝÉÜa@ky@ðÝ‚a†@bÈŠŒ@æŽà
êïÜg@ðÉ-Üaì@™‹¨aìL@ @@¶g@æŽàŽì�ÔŽÒ�ÕÜa@�ÝžâŽy@đ÷bŽ‹4ađÈ@Ć厇žèŽáb@ @
žàŽz�ìb4ýŽm@Üa@kïm‹9žzÓì‹@ @Üđ9žï�ÙİíŽçđà@ĆåŽé@b�ØđÝŽáŞpb@ @
Žmđ—žÒŽ’@Ž‹ŽŠa4òđà@ĆæÜ@Ž9�éĆïđkžy@Ü@DŽ9žéŽábL@ @Üaìñ‰Žà@ĆéŽáŽ–@bŽåĆÉžo@ @
ŽàGÐì@bžoï@ýì@�Ø�ÑĆïžoŽy@À@đÕ�éŽáb@ @Žì�Üđi@í�ÕŽ‡�Š@�cĆ‚9Žá��Ô@Ž‡đàï�éŽábNNN@ @
g�Ü�Ùï�c@ŽñaŽ‡đÜaŽì@bŽî@bŽáŽè@ñ‡è@a‰szŽjÜaNNN@ @ ˜Ü�����ƒ¾a@áØäia˜Ü�����ƒ¾a@áØäia˜Ü�����ƒ¾a@áØäia˜Ü�����ƒ¾a@áØäia@ @@ @@ @@ @
ACKNOWLEDGMENT
The effort involved in writing a dissertation is tremendous and anyone who has ever done
this can attest to it. I am fortunate to acknowledge a number of people who contributed
along the way in various ways by offering suggestions, corrections, guidance, ideas,
comments, and advice. The work on this dissertation has been an inspiring, often
exciting, sometimes challenging, but always interesting experience.
To begin with, first of all I would thank the Al Mighty Allah (SWT) for giving me the
guidance, knowledge and patience that enabled me to successfully complete this
dissertation. I am very grateful to you my Lord.
I would like to take great pride to forward my sincere appreciation and gratitude to my
supervisor, Madam. Zarinah Mohd Kasirun, for her invaluable guidance, support,
encouragement and help, without her tireless efforts, patience and guidance, this
dissertation would not have been successfully completed.
I would like to thank my beloved family especially my beloved grandparents, my mother,
my brothers, my fiancé, all my dearly beloved uncles, their children and wives, for their
kindness, encouragement, moral and financial support. Without them I would not be what
I am now, a millions of thanks to all of them.
Not forgetting, I would like to thank all my dear friends, for their ideas and support given
to me in completing this work.
TABLE OF CONTENTS
DECLARATION ii ABSTRACT iii DEDICATION iv ACKNOWLEDGMENT v TABLE OF CONTENTS vi LIST OF FIGURES x LIST OF TABLES xi LIST OF SYMBOLS AND ABBREVIATIONS xii
CHAPTER 1: INTRODUCTION
1.1 OVERVIEW 1
1.2 PROBLEM JUSTIFICATION 4
1.3 SCOPE OF THE PROBLEM 9
1.4 SYSTEM AIMS AND OBJECTIVES 10
1.4.1 General Objective 10 1.4.2 Specific Objective 10
1.5 POTENTIAL BENEFITS AND MOTIVATION OF ARABIC TTS 11
1.6 PROJECT METHODOLOGY 12
1.7 ORGANIZATION OF THE DISSERTATION 14
CHAPTER 2: LITERATURE REVIEW
2.1 OVERVIEW 16
2.2 WHAT IS SPEECH? 18
2.3 SPEECH PRODUCTION MECHANISM 19
2.4 WHAT IS TEXT-TO-SPEECH? 21
2.5 HISTORY OF TEXT-TO-SPEECH 22
2.6 TTS METHODS, TECHNIQUES AND ALGORITHMS 26
2.7 THE IMPORTANCE OF TEXT-TO-SPEECH 30
2.8 APPLICATIONS OF TEXT-TO-SPEECH SYNTHESIS 31
2.8.1 Example Applications 31 2.8.2 Other Applications and Future Directions 33
2.9 THE CHALLENGES BEHIND THE TEXT-TO-SPEECH 34
2.9.1 The Diacritization Problem 34 2.9.2 Dialects 35 2.9.3 Differences in gender 36
2.10 EXISTING PRODUCTS 37
2.10.1 MBROLA – PROJECT 37 2.10.2 ACAPELA – GROUP 38 2.10.3 ARABTALK 38 2.10.4 Sakhr TTS 40 2.10.5 Élan TTS 41
2.11 SUMMARY 43
CHAPTER 3: CONCATENATIVE SYNTHESIS
3.1 OVERVIEW 46
3.2 CONCATENATIVE SYNTHESIS 48
3.2.1 Phonemes 52 3.2.2 Diphones 52 3.2.3 Demi-syllables 53 3.2.4 Speech Signal Representation for Concatenative Synthesis 54
3.3 TYPES OF CONCATENATIVE SYNTHESIS 57
3.3.1 Concatenation of Stored Allophone 57 3.3.2 Diphone Concatenation Synthesis 58 3.3.3 Concatenation of Stored Syllables and Demi-syllable 59 3.3.4 Concatenation of Stored Waveform 60 3.3.5 Concatenation of Stored Words 62
3.4 SUMMARY 64
CHAPTER 4: ARABIC LANGUAGE
4.1 OVERVIEW 65
4.1.1 English and Arabic 67 4.2 DEFINITION OF AN ARABIC WORD 69
4.3 ARABIC IS A DIACRITIZED LANGUAGE 71
4.4 INTRODUCTION TO VOWELS 75
4.4.1 Short Vowels 76 4.4.1.1 The First Short Vowel 76 4.4.1.2 The Second Short Vowel 77
4.4.1.3 The Third Short Vowel 77 4.4.1.4 The First Short Vowel Doubled 78 4.4.1.5 The Second Short Vowel Doubled 78 4.4.1.6 The Third Short Vowel Doubled 79
4.4.2 Long Vowels 79 4.4.2.1 The First Long Vowel 79 4.4.2.2 The Second Long Vowel 81 4.4.2.3 The Third Long Vowel 81
4.5 SYLLABLES OF ARABIC LANGUAGE 82
4.5.1 Syllable Structure 82 4.5.2 Syllable Patterns 82
4.5.2.1 Short and Long Syllables 83 4.5.2.2 Closed and Open Syllables 83 4.5.2.3 Ending a Syllable with a Consonant 83
4.6 SUMMARY 85
CHAPTER 5: SYSTEM ANALYSIS AND DESIGN
5.1 OVERVIEW 86
5.2 SOFTWARE AND HARDWARE REQUIREMENTS 91
5.2.1 Microsoft Visual Basic 6.0 91 5.2.1.1 Visual Basic and Arabic Supports 92 5.2.1.2 Visual Basic Bi-directional Features 92 5.2.2 Microsoft Word 93 5.2.3 Sound Forge 8.0 93
5.3 ARABIC TEXT-TO-SPEECH ARCHITECTURE 95
5.3.1 Synthesizing Text Steps 96 5.3.2 Recording of the sounds 98 5.3.3 Storing Sound Files Using Wave File Format 99
5.4 THE DESIGNING OF DATABASE FOR SOUND FILES 100
5.4.1 Representing Relational Database Using Microsoft Access 100
5.5 THE DESIGN OF USER INTERFACE 103
5.6 SUMMARY 106
CHAPTER 6: ARABIC TTS SYSTEM IMPLEMENTATION
6.1 OVERVIEW 107
6.2 CREATING THE ARABIC TTS SYSTEM 108
6.2.1 Understanding the form and its procedures 108 6.2.2 Creating the Command Button 109 6.2.3 Creating the Text Box 110 6.2.4 Creating the Button associated with the TextBox 111 6.2.5 Creating the Drive to See All the Files Inside the Drive 111 6.2.6 Creating the Multimedia Controls and their Properties 112 6.2.7 Combining the Multimedia Controls with the Command Button 113 6.2.8 The User Interface of the TTS Synthesizer System 114
6.3 SUMMARY 124
CHAPTER 7: TESTING AND EVALUATING OF ARABIC TTS SYSTEM
7.1 OVERVIEW 125
7.2 TESTING AND EVALUATING TTS SYSTEMS 125
7.3 TESTING THE ARABIC VOICE 128
7.3.1 Test group 128 7.3.2 Method 128
7.4 TEST AND EVALUATION RESULTS 129
7.4.1 Naturalness 129 7.4.2 Speed 130 7.4.3 Sound quality 131 7.4.4 Pronunciation 132 7.4.5 Clearness 134 7.4.6 Stress/Intonation 135 7.4.7 Error 139
7.5 SUMMARY 140
CHAPTER 8: CONCLUSION AND FUTURE WORK
8.1 CONCLUSION 141
8.2 SUGGESTIONS AND FUTURE WORK 144 REFERENCES
REFERENCES AND RESOURCES 146
APPENDICES
APPENDIX A: Questionnair 153 APPENDIX B: Glossary 156
LIST OF FIGURES Figure 2.1 The human vocal organs. 19 Figure 2.2 Kratzenstein’s resonators. 22 Figure 2.3 Wheatstone’s reconstruction of von Kempelen’s speaking machine 23 Figure 2.4 The Voder electronic synthesizer. 24 Figure 2.5 Some milestones in speech synthesis. 25 Figure 2.6 Linear predictive synthesizer. 27 Figure 2.7 Block diagram of articulatory speech synthesizer. 28 Figure 2.8 Viterbi Alignment. 40 Figure 3.1 Block diagram of a concatenative text-to-speech system. 50 Figure 3.2 Wave sound for Allophone Concatenation Synthesis. 57 Figure 3.3 Wave sound for Diphone Concatenation Synthesis. 58 Figure 4.1 The classification of the Arabic words. 70 Figure 5.1 Use-Case Diagram for TTS Synthesizer System 86 Figure 5.2 IPO – Schematic Architecture Design of Arabic TTS 95 Figure 5.3 Sequence Diagram for Text Normalization 96 Figure 5.4 Sequence Diagram for Text Segmentation 97 Figure 5.5 Sequence Diagram for Text Concatenation 98 Figure 5.6 The procedure taken when user requests for certain data 101 Figure 5.7 User interface of Arabic and EnglishTTS Synthesizer System. 106 Figure 6.1 The Basic Form and its properties 109 Figure 6.2 The Form with the command button “Speak” 110 Figure 6.3 The Form with the TextBox written inside “Text1” 110 Figure 6.4 The Form with the TextBox and the Command Button. 111 Figure 6.5 The Drive C, its directory and all the files inside the drive. 112 Figure 6.6 The form with the Multimedia Control Commands. 113 Figure 6.7 The form with the Multimedia Controls and the Command Buttons. 113 Figure 6.8 User interface of Arabic and English TTS Synthesizer System 114 Figure 6.9 The word ���� to be pronounced by the system 123 Figure 7.1 Naturalness of the voice 129 Figure 7.2 The speed of the speech 130 Figure 7.3 The sound quality of the voice. 131 Figure 7.4 The pronunciation’s effect on understanding. 132 Figure 7.5 The concentration needed to hear the pronunciation 133 Figure 7.6 The annoying level of the pronunciation 134 Figure 7.7 Understanding the voice. 135 Figure 7.8 The level of difficulty in understanding the voice. 136 Figure 7.9 The intonation of the system. 137 Figure 7.10 The stress of the system 138 Figure 7.11 Pronunciation mistakes. 139
LIST OF TABLES Table 2.1 Comparison between types of speech synthesis 29 Table 2.2 Comparison between types of speech synthesis 44 Table 3.1 Concatenative Synthesis: Pros and Cons 56 Table 3.2 Advantages and Disadvantages of Units of Concatenative Synthesis 63 Table 4.1 Arabic dotted and undotted alphabet characters 68 Table 4.2 The Arabic diacritics and the significance of each one 72 Table 4.3 Arabic Syllable Patterns 83 Table 5.1 Description of Open File Use Case. 87 Table 5.2 Description of Input Text Use Case. 87 Table 5.3 Description of Clear Text Use Case. 87 Table 5.4 Description of Synthesize Text Use Case. 88 Table 5.5 Description of Update and Delete Use Case. 88 Table 5.6 Description of Record Sound Use Case. 88 Table 5.7 Description of Normalize Text use case. 89 Table 5.8 Description of Text Parser Use Case. 89 Table 5.9 Description of Concatenate Text Use Case. 90 Table 10 Description of Compare Sound Use Case. 90 Table 5.11 Example of Sound segments in wave files 99 Table 5.12 Arabic Word Table 102 Table 5.13 Syllable Table 102 Table 5.14 Abbreviation Table 102 Table 7.1 Possible evaluating attributes 127 Table 7.2 Naturalness/ Clearness 129 Table 7.3 Sound Speed 130 Table 7.4 Sound quality 131 Table 7.5 Pronunciation Question 1 132 Table 7.6 Pronunciation Question 2 133 Table 7.7 Pronunciation Question 3 134 Table 7.8 Clearness Question 1 135 Table 7.9 Clearness Question 2 136 Table 7.10 Stress/Intonation Question 1 137 Table 7.11 Stress/Intonation Question 2 138 Table 7.12 System Error 139
LIST OF SYMBOLS AND ABBREVIATIONS
ANN Artificial Neural Network
ASCII American Standard Character International Institute
BIDI Bi-Directional
C Constant letter
DOS Disk Operating System
DSP Digital Signal Processing
GTP Grapheme-To-Phoneme
HMM Hidden Markov Model
ICT Information Communication Technology
IDE Integrated Development Environment
IPO Input-Process-Output Schematic
IVR Interactive Voice Response
LPC Linear Prediction Coding
MCI Media Control Interface
MFCC Mel Frequency Cepstral Coefficient
MSA Modern Standard Arabic
OCR Optical Character Recognition
PTI Panasonic Technologies Inc.
RDI Research and Development International
SAPI Speech Application Programming Interface
SDK Software Development Kit
SDLC System Development Life Cycle
TTP Text-To-Phonetic
TTS Text-To-Speech
V Vowel Letter
VCR Video Cassette Recorder
CHAPTER ONE
INTRODUCTION
1.1 OVERVIEW
Language is among the mainly most important features that differentiate humans from
other living creatures and speech is the key medium of language (Edwards, 1991). With
the advent of digital electronic technology, the goal of developing machines that imitates
human sounds has come closest to be achieved. It has to be said that no one has really
succeeded in synthesizing a voice that is identical to a human voice. Meanwhile, speech
synthesizers are now available which produce speech of a quality adequate for many
applications.
When we hear speech in our own language, we hear individual words and sounds. We
can write down speech using discrete letters with spaces between words. When we hear
speech in a language that we do not know, we cannot do this. All of the words and sounds
seem to run together in a continuous, individual stream (Robert, 1999). Is speech discrete
or it is continuous? The answer to this question is both. On a purely physical level, speech
is continuous, except where we pause to take breath. On a psychological level, speech is
perceived as composed of discrete sounds and groups of discrete sounds, the former
corresponding more or less to the letters of the alphabet and the letter to words of the
language.
Humans are capable of dividing physically continuous speech signal into discrete units
because of their linguistic expertise. Mastering language is perhaps the most exceptional
single intellectual achievement of a person’s life (Robert, 1999). An unsounded amount
of knowledge is required to speak and understand a language with native fluency. This
dual nature of speech “discrete or continuous” is what makes computer processing of
speech so challenging (Pinker, 1993). Digital computers are capable of only incompletely
representing a continuous signal, devoid of linguistic knowledge except for the small
amount provided by us humans. This perspective is correct in the sense that speech and
language are unique to the human species. Besides, speech is the primary modality for
communication between humans, and reliable speech synthesis by machines would be
very useful. Still more useful would be speech understanding – Speech understanding is
the identification of the meaning of the utterance (Alan Dix, el. at. 2003).
Speech provides our first contact with the raw, unwashed world of real sensor data. These
data are noisy, quite literally: there can be background noise as well as artifacts
introduced by the digitization process (Alan Dix, el. at., 2003); there is a variation in the
way that words are pronounced, even by the same speaker; different words can sound the
same; and so on.
The term speech synthesis evokes memories of mechanical to many of us, tedious or
repetitive voices. Text-To-Speech Synthesizer System (TTS) is simply defined as written
text transformed into speech; reading or dictating machines; the part of speech
technology, which is concerned with automatically generating speech from a computer.
A TTS synthesizer System is a computer-based system that has the ability to read any
text aloud, whether it is directly introduced in the computer by an operator or scanned
and submitted to an Optical Character Recognition (OCR) system. OCR is the process
that allows the transformation of a string of phonetic/syllabic and prosodic symbols into a
synthetic signal, (i.e. the automatic production of speech, through a Grapheme-To-
Phoneme transcription of the sentences to utter). Grapheme is the letters in a words’
dictionary, while, Phoneme is the smallest unit of speech that differentiates one word
from another (Edward, 2003).
The TTS technology is becoming inevitable in some businesses that need to provide for
their customers with the latest and fundamental information in real time. These
businesses usually use Interactive Voice Response (IVR) systems and call centers to
communicate this information to their customers and prospects. Converting fundamental
data stored in Web sites, databases and files into human voice using the traditional
expensive and time-consuming human recordings in studios is becoming a hard and long
process since the information is usually dynamic. In some cases, it would be impossible
to track these changes using the human recordings way.
Arabic is a complex language and it is not like other languages, (i.e., English, French, or
Spanish), those languages, written in Latin script, have vowels while the Arabic language
has special characters called “diacritics”. These diacritics give the Arabic words the
correct meaning within a sentence. For example, two Arabic words have different
meaning can be written the same and only the diacritics can help the reader to distinguish
them. Such as, the word “���” is for example, pronounced differently in sentences “ ��������
���� �� ���” meaning “the student submitted the homework” and in the other sentence
“ ������������ ��� ” meaning “I greeted my friend”.
1.2 PROBLEM JUSTIFICATION
Information and communication technology is rapidly evolving as an effective tool for
making information widespread and available online to several communities. The
industrial society is turning towards information society. The increased use of
information technology is enabling people across the world to participate in the
knowledge network; however, people in some developing countries are being deprived of
the benefits of the use of ICT and the computer system. One of the main reasons for this
is the lack of suitable human computer interface for disabled people or users and the
software designed and developed to meet their needs. To design and develop a computer
interface for a person who can not see what computer displays, is the most challenging
task for many software developers.
Even though, many TTS systems with different language exist, each individual system is
not the same with the other. The options and functions of TTS process vary from one
system to another. Arabic TTS Synthesizer System is like many other systems that came
with the objective to develop a system that would assist people in the task of knowledge
gaining from texts.
TTS Synthesizer Systems converts the written input to spoken output by automatically
generating synthetic or computer generated speech. Typed text is converted into speech
using various algorithms such as formant synthesis, concatenative synthesis or other
methods, which will be explained briefly in the next chapter. Speech synthesis is often
referred to “Text-To-Speech” System conversion (TTS); the term Text-To-Speech will be
used from this point and onwards instead of speech synthesis.
As the system being developed, it cannot avoid from having a problem. Primarily, there
are problems that are actually faced by the people who develop the program as to make
the program works efficiency and fulfills the users’ requirements. This part of
dissertation, discusses the major problems that will be rose during the development of the
system starting from the stage of designing the system until the stage where it is being
implemented and tested.
The problem area in speech synthesis is very extensive. There are quite a few problems in
text pre-processing, such as numerals, abbreviations, acronyms. Moreover, the
pronunciation of written text is a major problem nowadays as well. For example, when
concerning the Arabic words that cannot be translated the same into other languages such
as Malay language. However, in the case of court session, there are some terminologies
that may be used by lawyers or the judges in the court, so they have to use one of these
terminologies in Arabic with the correct pronunciation.
The problem of converting text into speech for some language can naturally be broken
down into two sub-problems (Ronald, et al., 1997). The first sub-problem involves the
conversion of linguistic parameters (for example, phoneme sequences, and accentual
parameters) into parameters (for example, formant parameters, concatenative unit indices,
and pitch time / value pairs) that can drive the actual synthesis of speech. The second
sub-problem involves the computation of these linguistic parameter specifications from
input text.
The first task faced by any text-to-speech synthesizer system is the conversion of input
text into linguistic representation, usually called Text-To-Phonetic (TTP) or Grapheme-
To-Phoneme (GTP) conversion. The difficulty of conversion is highly language
depended and includes many problems. For Arabic, English and most of other languages
the conversion is much complicated. A very large set of different rules and their
exceptions is needed to produce correct pronunciation for synthesized speech.
Conversion can be divided in three main phases, text preprocessing, correct
pronunciation, and the analysis of prosodic features for correct intonation, stress, and
duration. Here the dissertation touches on the first two phases; however, the last phase is
out of the dissertation scope.
Text preprocessing is usually a very complex task and includes several language
dependent problems. Digits and numerals must be expanded into full words. For example
in Arabic, numeral 243 would be expanded as “ ������������ ! �"#$#! ” meaning “two-hundreds
and forty three” and another example, numeral 1750 as “ �! %& �"'(! ���)�� ” meaning
“one-thousand seven hundreds and fifty”. Fractions and dates are also problematic.
Figure 2/3 can be expanded as “ �#��*�� ” meaning “one-third” in case if the figure is a
fraction or as “�+, -� .�*” meaning “second of March” in case if it is a date.
Abbreviations may be expanded into full words, pronounced as written or pronounced
letter-by-letter (Macon, 1996). There are also some contextual problems. For example,
“/0”can be pronounced either as “123 "�40”, “kg” meaning “kilogram” or as “ �23 "�405� ”
meaning “kilograms” depending on preceding number; yet another example, “ .د ” as
“�"�06”, “Dr.” meaning “Doctor” and “78” as “ 9 :2; ”, “etc” meaning “etcetera”. In some
cases, the adjacent information may be enough to find out the correct conversion, but to
avoid wrong conversions the best solution in some cases may be the use of letter-to-letter
conversion. Innumerable abbreviations for company names and other related things exist
and they may be pronounced in many ways. For example, N.A.T.O. as “"�4<” and RAM as
“1�” are usually pronounced as written; therefore, SAS can be pronounced as “ 8 =�8 >8> ”
and ADP “ =�6 =�8?=4 ” letter-by-letter. Some abbreviations such as MPEG as “ �8@�A4 ” are
pronounced irregularly.
Special characters and symbols, such as (‘@’, ‘#’, ‘%’, ‘&’, ‘*’, ‘(’, ‘)’, ‘-’, ‘/’, ‘<’, ‘>’,
‘[’, ‘]’) are generally spoken as “at symbol”, “pound sign”, “percent”, “ampersand”,
“asterisk”, “left parenthesis”, “right parenthesis”, “dash”, “slash”, “left angle bracket”,
“right angle bracket”, “left square bracket”, and “right square bracket”, respectively,
cause also special kind of problems. In some situations, the word order must be changed.
For example, $71.50 must be expanded as “B��C� �"'(! B�D!6 �"���! %�E!” meaning
“seventy-one dollars and fifty cents” and $100 million as “B�D!6 ��"4�� �����” meaning “one
hundred million dollars”, not as one hundred dollars million. Also special characters and
character strings in for example web sites or e-mail messages must be expanded with
special rules. For example, character ‘@’ is usually converted as at and e-mail messages
may contain character strings, such as some header information, which may be omitted.
Some languages also include special non-ASCII characters, such as accent markers or
special symbols.
The second task faced by any text-to-speech synthesizer system is to find correct
pronunciation for different contexts in the text, usually called Pronunciation or
Heteronyms. Some words, called homographs, cause maybe the most difficult problems
in TTS systems. Homographs are spelled the same way but they differ in meaning and
usually in pronunciation (e.g. FG+, ���). The word FG+ is for example pronounced
differently in sentences “ ����������H 98 IF�� ” meaning “the boy went to the school” and
“ ����� J2�K ���� ” meaning “my friend bought gold”. With these kinds of words, some
semantically information is necessary to achieve correct pronunciation.
The pronunciation of a certain word may also be different due to contextual effects. This
is easy to see when comparing phrases the end and the beginning. The pronunciation of
this depends on the initial phoneme in the following word. Compound words are also
problematic. For example, the characters ‘� L 1’ in “M �2I�” meaning “square” and “N��O�� � P-��”
meaning “from their lord” is pronounced differently. Some sounds may also be either
voiced or unvoiced in different context. For example, phoneme /س/ in word “�Q 2��”
meaning “path” the character “ص” is voiced as “س”, but unvoiced in word “ N��� �'I�”
meaning “straight”.
1.3 SCOPE OF THE PROBLEM The Arabic TTS Synthesizer System is a stand-alone and event-based system. The scope
of this dissertation looks into a few dimensions. These dimensions can be divided as
follows. The TTS Synthesizer System is able to convert text to audio format using both
languages Arabic and English. That is the system only supports two languages, which are
Arabic and English. Additionally, the Arabic TTS Synthesizer System is able to
pronounce the text input by the user, in a form of word-by-word, or in a form of a
sentence.
Arabic TTS Synthesizer System designed to be used by beginners – those who have no
background or do not speak Arabic – that is the TTS system is to be used mainly by non-
Arabic speaking, audiences, teachers – teachers may use this system to teach their
students on the correct pronunciation; disabled or impaired users and finally students.
Besides, many different types of people whose jobs require them to search for knowledge
in documents would be useful in the field of linguistics and engineering can use this
project. Students who study the words occurrences in text, lexicographers who compile
dictionaries and translators all must examine large quantities of text, often in the millions
of words, in order to find evidence of word usage.
1.4 SYSTEM AIMS AND OBJECTIVES
The Arabic TTS Synthesizer System gives a clear way of pronouncing Arabic words as
the user can listen to the human-speech as they enter the text. This system also benefits
students especially in learning and improving their skills and vocabularies of Arabic
language. With some interesting user interfaces added such as animations, graphics and
colorful text may attract users including students to learn Arabic language.
The aim of this dissertation is to develop an Arabic TTS Synthesizer System that can be
used for reading the input text written in Arabic words; an understanding on what is
speech will guide us to develop the system. The system can be used for the following
purposes:
1.4.1 General Objective:
To design, implement and evaluate an Arabic TTS Synthesizer System for novice
non-Arabic speaking audiences.
1.4.2 Specific Objective:
1. To analyze the current state-of-the-art on TTS Synthesizer System particularly for
the Arabic language.
2. To design and implement the Arabic TTS Synthesizer System.
3. To develop a suitable mechanism for converting textual information into audio
form for TTS Synthesizer System application.
4. To assess the implemented TTS Synthesizer System in terms of its functionalities
and quality of speech.
1.5 POTENTIAL BENEFITS AND MOTIVATION OF ARABIC TTS
The motivation behind building and developing such a system is that the TTS interface
can improve the user’s experience on a desktop. It is more relaxing to listen instead of
reading large portions of text. It is good for the blind, slow readers, and less straining for
the eyes. Arabic TTS Synthesizer System brings benefits especially in the educational
field. It assists the research, data collecting and text analyzing. It is very useful for the
students, educators, and language researchers. It provides them with an effective way of
knowing how to pronounce the words. The following are benefits of the Arabic TTS
Synthesizer System:
1. Easy to use – intuitive: Arabic TTS Synthesizer System interface will be designed
to be intuitive and easy to use.
2. Efficient: Arabic TTS Synthesizer System reduces costs and increases efficiency.
3. Standard technology: Arabic TTS Synthesizer System relies on the use of
standard technologies and open standards like Microsoft Visual Basic 6.0, Forge
8.0, Microsoft Word.
4. Arabic TTS Synthesizer System will help users to learn Arabic language.
5. Provides accessibility: Over the Web and over the phone.
6. Offers adaptability and flexibility: Any time, anywhere.
7. Increases competitiveness: Increases depth of value-added services and develops
competitive advantage by creating differentiation.
1.6 PROJECT METHODOLOGY
The dissertation methodology intends to build on already existing resource that is Arabic
language together with additional computing tools. This dissertation is derived from the
observation on the similar works and initiatives from different organizations and
individuals in the field words databases. A more specific focus would be on the modern
and recent technology of words processing and database programming. In order for the
Arabic TTS Synthesizer System to be developed, the following activities are to be carried
out throughout the development the Arabic TTS Synthesizer System:
1. Literatures Search and Collection of Relevant Information: This phase is a
process of gathering and compiling information about the project work and gets
the relevant information in the form of journals, books, articles, Internet and other
resources. Prior to the development of the software, a study has been made to
understand what text-to-speech system is as well as on available TTS systems.
Here there researcher after studying the existing TTS systems is going to make a
comparison among the TTS systems and finding the strengths and weaknesses of
each system, the features of each system, and finally the technology used in the
system. In addition, in this phase the researcher is going to look into different
methods or algorithms used to produce the TTS System such as formant, linear
prediction, and others. The method that is to be used in this system is
Concatenative Synthesis Method, which uses different length prerecorded
samples derived from natural speech.
2. System Analysis and Design: In this phase, various activities are being carried
out such recording of the sounds using special type of software known as Cool
Edit, building the database of these sounds using Microsoft Access and designing
the user interface using Visual Basic 6.0. This is an important and critical stage
because without the sounds we cannot develop the software. Sounds are the basic
requirement of this software. This phase will discuss the system design, such as
the software process model; here the model will be the System Development Life
Cycle (SDLC). In addition, the researcher is going to touch on some other aspects,
like, requirements analysis, development of the system model, Text-To-Speech
Analysis, resources of databases, methods, techniques, and algorithms. Finally in
this stage the designing of the interfaces of Arabic TTS Synthesizer System will
start.
3. Development of Text-To-Speech Synthesizer System: Implementation phase
starts here. The software is developed using Visual Basic 6.0. This is the most
time consuming stage, where coding activities are involved. This is a critical and
difficult phase in developing the software.
4. System Testing and Evaluation: The last phase involves the testing of the
software. This is the most important stage, where the system is tested, information
are collected, analyzed and results are discussed. This is an integration of all of
the previous phases. In this phase, several users were participating to use the
system and test it.
1.7 ORGANIZATION OF THE DISSERTATION
This dissertation is divided into 8 chapters. The organization of the chapters is as follows:
Chapter 1: Introduction
The aim of this chapter is to introduce the field of speech synthesis. The chapter began
with the basic definitions of important terms, in section 1.1. To understand the Problem
Description and Background of the problem are presented in section 1.2. In section 1.3
Scope of the problem is discussed. Project Objectives are presented in section 1.4. The
motivation for this dissertation happens to be the application of TTS synthesizers for
increasing human-computer interaction for people using Arabic as their native language.
Chapter 2: Literature Review
This chapter takes a look into the literature review of the existing TTS Synthesizer
Systems that are similar to the proposed system; This chapter is divided into 8 sections,
which are: section 1, Overview; section 2, What is Text-To-Speech; section 3, explains
the History of Text-To-Speech; section 4, the Importance of Text-To-Speech; section 5,
Speech Production Mechanism; section 6, discusses the Existing Products; section 7,
Applications for Text-To-Speech; and section 8, The Challenges Behind The Text-To-
Speech.
Chapter 3: Concatenative Synthesis
This chapter explains in detail the methodology of the system. This chapter is divided
into 3 sections, which are: section 1, Overview; section 2, explains the Concatenative
Synthesis; and section 3, explains the Types of Concatenative Synthesis.
Chapter 4: Arabic Language and TTS
This chapter explains in detail the Arabic language. This chapter is divided into 6
sections, which are: section 1, Overview; section 2, Definition of an Arabic Word;
section 3, Arabic is a Diacritized Language; section 4, Introduction to Vowels; section 5,
Joining Letters; and section 6, Syllables and Doubled Letters.
Chapter 5: System Analysis Design and Architecture
This chapter discusses the System Analysis and Design and the Architecture of the
proposed system Arabic Text-To-Speech Synthesizer. This chapter has been divided into
4 sections which are: section 1, Overview; section 2, explains the Arabic Text-To-Speech
Architecture; section 3, The Designing of Database for Sound File; and section 4, The
Design of User Interface and Its Procedures.
Chapter 6: The System Implementation
This chapter discusses the system implementation by providing screen shots of the
system, and every screen shot is provided with an explanation. Section 1, explains
overview of the chapter and section 2, the process of Creating the TTS Application.
Chapter 7: Testing and Evaluation
This chapter shows the result of the testing and evaluation of the system.
Chapter 8: Conclusion
This chapter outlines the findings and the recommendation and future enhancement of the
current system.
CHAPTER TWO
LITERATURE REVIEW
2.1 OVERVIEW Synthetic speech has been the vision of humanity for decades. To know how the current
systems function and how they have been developed to their current form, a
chronological review may perhaps be valuable. In this chapter, the history of synthesized
speech from the first mechanical efforts to systems that form the basis for today’s high-
quality synthesizers is discussed. Some separate milestones in synthesis-related methods
and techniques will also be discussed briefly.
Electronic and computer technology developments are causing an explosive growth in the
use of machines for processing information (Holmes, 1988). In most cases this
information originates from a human being, and is finally to be used by a human being.
Therefore, there is a need for effective ways to transfer information, in both directions
between people and machines. One suitable way for this communication in many cases is
in the form of speech, because speech is the communication method that is most widely
used between humans; it thus seems extremely natural and requires no special training.
There are many situations where speech is not the best method for communicating with
machines. For example, large amounts of text are much more easily received by reading
from a screen, and positional control of features in a computer-aided design system is
easier by direct manual manipulation. However, for interactive dialogue and for input of
large amounts of text or numeric data speech offers great advantages (Holmes, 1988). For
all applications where the machine is only accessible from a standard telephone
instrument there is no practicable alternative.
Several researches are being carried out in the area of speech synthesis. In the early days
of synthesis, research efforts were devoted mainly in simulating human speech
production mechanism, using basic articulator models, which was based on electronic
acoustic theories. Even though this way of modeling is one of the ultimate goals of
synthesis research, advances in computer sciences have widened the research field to
include text to speech processing. It is not only human speech generation is modeled but
also text processing is modeled. This way of modeling is generally done by a set of rules
derived, for example, from phonetic theories and acoustic analysis.
2.2 WHAT IS SPEECH? Speech is a complex signal from which we extract many types of information, from the
message content and meaning, to the nature of the transmission medium, to the identity
and condition of the speaker. How well we perform these tasks depends on the quality of
the speech signal, the efficacy of our hearing, the nature of the listening environment, and
our accumulated auditory experience. Because of its importance in human
communications, and since we do not have to consciously direct our attention (or even be
awake) to detect it, sound and in particular, speech, provides a channel for
communication which combines a degree of immediacy with the release of sight and
touch for other tasks. It could be argued that speech science is 2,500 years old, (Robert,
1999).
Human speech is produced by a learned, coordinated process involving: drawing air
through the larynx, varying the tension on the vocal cords, positioning the articulators of
the vocal tract, and performing these actions under the conscious control of learned
language skills. The character and frequency content of the basic sound source are altered
by voluntary muscular control of the tension on the vocal cords. Thus are generated
voiced sounds (e.g., “oo”) having a periodic structure or unvoiced sounds (e.g., sh”)
having an aperiodic structure. This is the basis of sounds that are then amplified by the
resonant features of the vocal tract and shaped by articulation. Articulation is the process
of interrupting and shaping the sound signal from the larynx to form the basic sounds of
speech (phonemes) and combining them to form words. The articulators used are the jaw,
lips, soft palate, teeth, and tongue.
2.3 SPEECH PRODUCTION MECHANISM
The human speech production system is an interesting and complex mechanism. Initially,
every person has vocal characteristics that are unique so that one can often recognize an
individual by their voice alone. These characteristics directly relate to the physiology of
the talker (Robert, 1999). Features such as age, gender, height, weight, and the structure
of the vocal chords, nasal and oral cavities, teeth and lips all play a major role in the
speech production process.
Figure 2.1 The human vocal organs (Ntsourak's Home Page). Vocal organs presented in figure 2.1 above, form human speeches. The core energy
source is the lungs with the diaphragm. When speaking, the airflow is forced through the
glottis between the vocal cords and the larynx to the three main cavities of the vocal tract,
the pharynx and the oral and nasal cavities (Owens, 1993). From the oral and nasal
cavities, the airflow exits through the nose and mouth, respectively. The V-shaped
opening between the vocal cords, called the glottis, is the most important sound source in
the vocal system. The vocal cords may act in several different ways during speech.
The most significant function is to adjust the airflow by rapidly opening and closing,
causing energetic sound from which vowels and voiced consonants are produced (Robert,
1999). The basic frequency of vibration depends on the mass and tension and is about
110 Hz with men, 200 Hz with women, and 300 Hz with children. By stopping
consonants, the vocal cords may operate suddenly from a completely closed position, in
which they cut the airflow completely, to very open position producing a light cough or a
glottal stop. On the other hand, with unvoiced consonants, such as /s/ or /f/, they may be
completely open. An intermediate position may also occur with for example phonemes
like /h/.
The pharynx connects the larynx to the oral cavity. It has almost fixed dimensions, but its
length may be changed slightly by raising or lowering the larynx at one end and the soft
palate at the other end (Flanagan, 1972). The soft palate also isolates or connects the
route from the nasal cavity to the pharynx. At the bottom of the pharynx are the epiglottis
and false vocal cords to prevent food reaching the larynx and to isolate the esophagus
acoustically from the vocal tract. The epiglottis, the false vocal cords and the vocal cords
are closed during swallowing and open during normal breathing.
The oral cavity is one of the most important parts of the vocal tract. The movements of
the palate, the tongue, the lips, the cheeks and the teeth can vary its size, shape and
acoustics. Especially the tongue is very flexible, the tip and the edges can be moved
independently and the entire tongue can move forward, backward, up and down. The lips
control the size and shape of the mouth opening through which speech sound is radiated.
2.4 WHAT IS TEXT-TO-SPEECH? Speech is the key means of communication between people. TTS Synthesis, automatic
generation of speech waveforms, has been under development for several decades
(Santen et al., 1997). Recent progress in TTS synthesis has produced synthesizers with
very high intelligibility but the sound quality and naturalness remain a major problem.
According to (Stuart et al., 2002) they stated that Speech Synthesis or TTS Synthesis is a
complementary to speech recognition. The idea of being able to converse naturally with a
computer is an attractive one for many users, especially those who do not consider
themselves as computer literate, since it reflects their natural, daily medium of expression
and communication. However, there are as many problems in Speech Synthesis as there
are in recognition. The most difficult problem is that we are highly sensitive to variations
and accent in speech, and are therefore intolerant of imperfections in synthesized speech.
We are used to hear natural speech that we find it difficult to adjust to the monotonic,
non-prosodic tones that synthesized speech can produce. In fact, most speech
synthesizers can deliver a degree of prosody, but in order to decide what tone to give to
word, the system must have an understanding of the domain. Therefore, an effective
automatic reader would also need to be able to understand natural language, which is
difficult. However, for ‘canned’ messages and responses, the prosody can be hand coded
yielding speech that is much more acceptable. The term Speech Synthesis also refers to
the technologies that enable computers or other electronic systems to output simulated
human speech (Weinschenk & Barker, 2000). They provide acoustic information that is
phonologically acceptable yet have meaning to human listeners.
2.5 HISTORY OF TEXT-TO-SPEECH
Human attraction with talking machines is not new. For decades, people have tried to
make machines powerful with the capacity to speak; before the machine age humans
even hoped to construct speech for non-living objects. The early men attempted to show
that their idols could speak, usually by hiding a person behind the figure or channeling
voices through air tubes. Gerbert of Auricllac, Albertus Magnus, and Roger Bacon made
early examples of “speaking heads”.
The first efforts to create synthetic speech were made over two centuries ago (Flanagan,
1972). One of the earliest attempts at speech synthesis was made in 1779, when a Russian
scientist called Christian Kratzenstein, constructed a set of five acoustic resonators, see
figure 2.2, which when activated by a vibrating reed, produced imitations of the vowels.
He built models of the human vocal tract that could produce the five long vowel sounds (/
a /, / e /, / i /, / o / and / u /).
Figure 2.2 Kratzenstein’s resonators (Owens, 1993).
In 1791, Wolfgang Von Kempelen, a Hungarian, demonstrated a more successful
machine that could speak whole words and phrases. It consisted of a large bellows to
supply air to a reed that in turn excited a hand-held rubber tube (resonator). Extra tubes
and whistles were added to imitate the nasal and fricative sounds.
In 1837, Charles Wheatstone constructed his well-known version of Von Kempelen’s
speaking machine. It was a bit more complicated and was capable to create vowels and
most of the consonant sounds. Some sound combinations and even full words were also
possible to create. Vowels were produced with vibrating reed and all passages were
closed. Resonances were affected by deforming the leather resonator like in Von
Kempelen’s machine. Consonants, including nasals, were produced with disordered flow
channel a suitable passage with reed-off. In addition, in 1857 M. Faber built the
“Euphonia”. Wheatstone’s design was resurrected in 1923 by Paget see figure 2.3.
Figure 2.3 Wheatstone’s reconstruction of von Kempelen’s speaking machine (Flanagan, 1972).
One of the first electrical synthesizers was a device called the Voder (from voice
demonstrator), see figure 2.4, which was built in 1938. It had a voicing/noise source with
a foot pedal for fundamental frequency control. Signals, in this synthesizer, routed
through ten band-pass filters. The device was played like a musical instrument.
Figure 2.4 The Voder electronic synthesizer (Owens, 1993)
The first formant synthesizer was in 1953, constructed by Lawrence which consisted of
three electronic formant resonators connected in parallel. The input signal was either a
buzz or noise. A moving glass slide was used to convert painted patterns into six time
functions to control the three formant frequencies, voicing amplitude, fundamental
frequency, and noise amplitude. At about the same time when Lawrence’s machine was
introduced, Gunnar Fant introduced the first cascade formant synthesizer, OVE (Orator
Verbis Electris) which consisted of formant resonators connected in cascade.
The first complete full TTS synthesizer for English was developed in the Electro
technical Laboratory, Japan in 1968 by Noriko Umeda and his companions (Klatt, 1987).
It was based on an articulatory model and included a syntactic analysis module with
sophisticated heuristics. The speech was quite intelligible but monotonous and far away
from the quality of present synthesizers.
More details on the history of Speech Synthesis, in a chronological order are given in
(Klatt, 1987). Some milestones of speech synthesis development are shown in figure 2.5.
Figure 2.5 Some milestones in speech synthesis (Sami, 1999).
One of the complicated and sophisticated methods and algorithms applied recently in
TTS Synthesizer Systems is hidden Markov models (HMM). HMMs have been applied
to speech recognition from late 1970’s. For TTS Synthesizer Systems, it has been used
for about two decades (Schroeder, 1993).
2.6 TTS METHODS, TECHNIQUES AND ALGORITHMS
Concerning the complexities and difficulties that the researchers face nowadays, to
produce high quality TTS Synthesis System they followed many methods to accomplish
this goal. These methods are classified into two main categories:
The first category, synthesizes the speech without using human sound, this category is
known as Synthesized Speech. This category can be divided into three parts:
Formant Synthesis specifies directly the formant frequencies and bandwidths as well as
the source parameters (Weinschenk & Barker, 2000). Formant Synthesizers are also
referred to as Rule-Based Synthesizers, where generalized rules are extracted from the
filtered information. The input is then tested on the rules. The vocal tract transfers
function can be modeled by simulating formant frequencies and formant amplitudes. The
model transfers function for the vocal tract makes the formants much more evident. The
speech output is determined by phonetic rules to determine the parameters that are
necessary to synthesize a desired utterance using a formant synthesizer
Linear Prediction is a method initially designed for speech coding systems; it is used in
speech synthesis as well. The first speech synthesizer was produced from speech coders.
Similar to formant synthesis, the basic LP is based on the source-filter-model of speech.
The digital filter coefficients are estimated automatically from a frame of natural speech.
The main deficiency of the ordinary LP method is that it represents an all-pole model,
which means that phonemes that contain anti-formants such as nasals and nasalized
vowels are poorly modeled. The quality is poor too with short plosives because the time-
scale events may be shorter than the frame size used for analysis. With these deficiencies,
the speech synthesis quality with standard LP method is generally considered poor, but
with some modifications and extensions for the basic model the quality may be increased.
Several other variations of linear prediction have been developed to increase the quality
of the basic method (Donovan, 1996). Figure 2.6 shows the basic structure of the linear
predictive synthesizer.
Figure 2.6 Linear predictive synthesizer (Donovan, 1996).
Articulatory Synthesis is a method of synthesizing speech by controlling the speech
articulators. This method determines the characteristics of the vocal tract filter by means
of a description of the vocal tract geometry and places the potential sound sources within
this geometry. Articulatory synthesis typically involves models of the human articulators
and vocal cords. The articulators are usually modeled with a set of area functions
between glottis and mouth. The first articulatory model was based on a table of vocal
tract area functions from larynx to lips for each phonetic segment (Klatt, 1987).
Figure 2.7, shows a block diagram of the main components of a typical articulatory
speech synthesizer. It consists of three main components, an articulatory model, an
acoustic-tube model and a vocal-cord model. The articulatory model transforms a set of
perhaps 6 – 10 articulatory parameters, representing the position of the speech
articulators (lips, tongue, jaw etc.) into a cross-sectional area function, A(x), of the vocal
tract. An excitation model that simulates the modulated airflow from the vocal cords and
may include detailed modeling of cord vibration drives the acoustic-tube model.
Figure 2.7 Block diagram of articulatory speech synthesizer (Owens, 1993).
The second category synthesizes the speech using human sound so-called Concatenative
Synthesis.
Concatenative Synthesis is the most used technique of today, where segments of speech
are tied together to form a complete speech chain. The speech output is produced by
coupling segments from the database to create the sequence of segments. This technique
requires a bit of manual preparation of the speech segments. There are three categories
within this method, unit-selection, domain-specific, and diphone.
In addition, trainable concatenative speech synthesis uses computer assembly of recorded
voice sounds to create meaningful speech output. The basic process for developing
concatenated synthesizer is to have a human reader read units of speech and store the
recorded units of speech. These units are then assembled on demand according to given
business rules. It also works best for systems requiring a small vocabulary. (Weinschenk
& Barker, 2000). It uses also a large speech databases, has become popular due to its
ability to produce high quality natural speech output. The large footprints of these
systems do not present a practical problem for applications where the synthesis engine
runs on a server with enough computational power and sufficient storage.
It can be concluded that, most of the existing commercial text-to-speech systems can be
classified as either formant synthesizers (Klatt, 1980) or concatenation synthesizers
(Donovan, 1996 & Hamza, 2000). The formant synthesis was dominant for long time, but
today the concatenative method is becoming more and more popular. The articulatory
method is still too complicated for high quality implementations, but may arise as a
potential method in the future. Table 2.1 summaries the main differences between the
Concatenative and Parameter types of speech synthesis.
Table 2.1 Comparison between Concatenative and Parameter types of speech synthesis
Concatenative Synthesis
Parameter Synthesis (Formant and Articuraly)
Basic Human-voiced Fragments
Algorithm
Quality More Natural More Synthetic Prosody Lower Higher
Memory Requirement Higher Lower Algorithm Splice Phoneme Model Vocal Tract
Effort to develop new voice language
Higher Lower
2.7 THE IMPORTANCE OF TEXT-TO-SPEECH
TTS is emerging as a major feature in telecommunication systems. Several factors are
involved including the increased computer power, the deregulation of the telephony
networks, and a general acceptance of TTS as a practical tool for business and consumes.
The increase in computer power has greatly enhanced TTS in all applications. They have
made TTS Systems for telecommunications much less expensive, on a per port basis, by
allowing host based solutions to become a reality without having to use a high-cost,
dedicated Digital Signal Processing (DSP) resource board.
Deregulation of the telephony industry has also been a principle driving force for TTS
technology. The telecommunication industry has gone from a series of monopolies to
competitive industries. This competition has created a great need for differentiation
among companies offering similar services. TTS provides an opportunity to offer real
value applications and enhanced services to their customers.
Consumers, including business users, now expect greater automation and access to
variety of telecommunication services. For many tasks, automation is practical and
enjoyable. Calling for train schedule, airline departures and arrivals, and entertainment is
more easily and conveniently carried out using automated systems that incorporate Text-
To-Speech. All these show the important of speech synthesizer in any area of application.
2.8 APPLICATIONS OF TEXT-TO-SPEECH SYNTHESIS
A Text-To-Speech Synthesizer System is a computer based system that can convert text
into speech. Over the last few decades, extensive work has been done on text-to-speech
synthesis for the English language. Other languages such as Arabic have had limited
testing mentioned earlier. Concatenative speech synthesis can be achieved in two
different ways, either by concatenating a fixed number and size of units, such as phones
or diphones, or by concatenating a variable size unit, which is called unit selection.
Comparing to unit selection, diphone synthesis are more challenging in terms of signal
processing, since only one example of each unit exists in the database. On the other hand
diphone synthesis is to prefer when building applications like mobile devices, where the
memory size is the main concern.
Today we have Text-to-Speech Synthesizer Systems with a very high intelligible level
and an adequate level for numerous applications. These high quality TTS Synthesizer
Systems have numerous applications like the examples below:
2.8.1 Example Applications
• Applications for the Blind
By the help of an especially designed keyboard and a fast sentence assembling
program, synthetic speech can be produced in a few seconds to remedy the
voice handicaps. Also blind people can benefit from TTS Synthesizer Systems
which gave them access to written information.
• Applications for the Deafened and Vocally Handicapped
Voice handicaps originate in mental or motor/sensation disorders. Machines
can be an invaluable support in the latter case. With the help of an especially
designed keyboard and a fast sentence-assembling program, synthetic speech
can be produced in a few seconds to remedy these impediments.
• Educational Applications
Synthesized speech can be used also in many educational situations. They
provide a helpful tool to learn a new language known as computer aided
learning system. It can also be used with interactive educational applications.
• Applications for Telecommunications
In these systems textual information can be accessed over the telephone.
Mostly they are used when the requirement of interactivity is little and texts
range from simple messages. Queries can be given through the user’s voice
(needs speech recognition) or through the telephone keyboard.
• Applications for Multimedia
Man-machine communication that can help people with their work and other
things, for example voice activation systems in the car.
• Fundamental and Applied Research
TTS synthesizers possess a very peculiar feature, which makes them
wonderful laboratory tools for linguists. These are completely under control,
so that repeated experiences provide identical results (as is hardly the case
with human beings). Consequently, they allow investigating the efficiency of
intonative and rhythmic models. A particular type of TTS synthesizer, that is
based on a description of the vocal tract through its resonant frequencies (its
formants) and denoted as formant synthesizer, has also been extensively used
by phoneticians to study speech in terms of acoustical rules.
• Government Services
Government offices receive many calls requesting information. These range
from tax information from the Internal Revenue Service to the time and place
of town meetings. Much of this information can be dispensed by use of speech
synthesis over phone lines. To state a few applications, we have tax
information, road closing information, lottery results, opening and closing
times of public buildings, and unemployment claims processing.
2.8.2 Other Applications and Future Directions
Text-To-Speech Synthesizer Systems can be used in all kind of human-machine
interactions. For instance, in warning and alarm systems synthesized speech may be used
instead of warning lights or buzzers to give more accurate information of the current
situation. It may also be used to receive some desktop messages from a computer, such as
printer activity or received e-mail.
In the future, synthesized speech may also be used in language interpreters or several
other communication systems, such as videophones, videoconferencing, or talking mobile
phones. With talking mobile phones it is possible to increase the usability considerably
for example with visually impaired users or in situations where it is difficult or even
dangerous to try to reach the visual information. It is obvious that it is less dangerous to
listen than to read the output from mobile phone for example when driving a car.
2.9 THE CHALLENGES BEHIND THE TEXT-TO-SPEECH
The task of developing TTS Synthesizer System is very challenging and not easy, as it
requires lots of work and understating. Compared to limited vocabulary systems, which
must produce only a small-predefined set of possible utterances, TTS Synthesizer
Systems must be able to intelligently handle any input text. Consider, for example, the
number 1904, in order to correctly produce this number, the system must determine that
1904 is to be pronounced “���� ! ���)�'R! & ”, that is “one thousand nine hundred and
four” (as opposed to “nineteen o four”).
Anyway, the task of analyzing the input text is only half of the challenge. The next
challenge where, it must generate the actual sounds that accurately produce these
pronunciations, once the system has determined the desired pronunciations of the input.
This task is complicated by the fact that perceptually identical sounds, or phonemes, are
acoustically quite different in different phonetic contexts.
The precise duration and frequencies of a sound depend on many factors such as which
segments precede and follow it, its position in the word, syllable, or phrase, whether the
syllable containing it is emphasized, whether the speech is fast or slow, whether the voice
is that of a male or a female, and so on. The challenges of Arabic text-to-speech can be
divided into the following problems:
2.9.1 The Diacritization Problem The Arabic written text does not contain vowels and other markings that make the
orthography easy to understand. In their article, Mayfield Tomokiyo et al. (2003)
compare the Arabic language with English. They mean that the correct pronunciation of
an English word is not often obvious from its spelling and that there are many words with
multiple pronunciations. This problem can be solved by relying on electronically lexicons
that provide the correct pronunciation for an orthographic string. Mayfield Tomokiyo et
al. mean that Arabic has no such electronic solution.
To be able to use the language, we must know what the correct vowel is. The authors
mention two approaches to solve the vowelling problem for spoken language; either
inferring the vowels or enumerating the lexicon. Other authors that have approached this
specific problem are Al-Muhtaseb et al. (2003). They describe in their article an Arabic
Text-to-Speech System (ATTS) for classical Arabic where they solved the problem with
vowelization by implementing a processor for automatic vowelization of the text before
applying it to speech rules. To be able to generate vowels automatically, the processor
requires integration of morphological, syntactical, and semantic information (Al-
Muhtaseb et al., 2003).
2.9.2 Dialects Arabic is spoken in more than 23 countries and as mentioned before by 300 million
people. There are many varieties of the Arabic language, many dialects that reflect social
diversity of its speakers. Mayfield Tomokiyo et al. (2003) concern these varieties in
dialects as a problem for speech synthesis for many reasons. First, what dialect is to be
generated? The Modern Standard Arabic (MSA) or one of the dialects should be
generated. The second problem would be that a limitation of listeners would rise because
MSA is understood only by people with high education and/or with a good level in
reading and writing. The third problem considers the transcription of spoken Arabic.
Spoken Arabic has very little occasions to be written down. News and Newspapers are
delivered in Modern Standard Arabic to some extension. For example, munitions are not
fully said in news and this can be considered as being influenced by the dialect one
speaks. Mayfield Tomokiyo et al. mean that speakers of the same dialect can differ
significantly in their choice of which vowel is being used in the spoken language. The
reason for this is that the vowelling of the Modern standard Arabic is only learned in
school.
2.9.3 Differences in gender Mayfield Tomokiyo et al. (2003) bring up the issue of gender differences in speech.
There are in Arabic inflectional components that reflect the gender of the speaker and of
the listener. For example, consider the word “ �S��T R” “takallama” which means, “he spoke”
or “he has spoken”. If the listener is male then the imperative form “�S��T R” “takallam” is
said. In cases where the listener is female we say “ S��T R�) ” “ta-kal-lami” is said where the
long vowel “i” indicates the female gender. Therefore, when speech is the final product in
systems such as a translation system or a synthesizer, an appropriate gender marking
becomes more obvious and should be done correctly.
2.10 EXISTING PRODUCTS
This section introduces some of the commercial TTS Synthesis Systems available today.
More than 28 TTS Synthesizer Systems currently existing in the market. Some of the text
in this section is based on information collected from Internet, fortunately, mostly from
the manufacturers and developers official homepages.
First commercial TTS Synthesis Systems were mostly hardware based and the
developing process was very time-consuming and expensive. Since computers have
turned out to be more and more powerful, currently most synthesizers are software-based
systems. Software based systems are easy to configure and update, and usually they are
much less expensive than the hardware systems. However, a stand-alone hardware device
may still be the best solution when a portable system is needed.
2.10.1 MBROLA – PROJECT MBROLA-project is one of the main systems that have an Arabic voice. The MBROLA
project was initiated by the TCTS Laboratory in the Faculté Polytechnique de Mons,
Belgium. The main goal of the project is to have a speech synthesis for as many
languages as possible. MBROLA is used for non-commercial purposes. Another purpose
with it is to increase the academic research, especially in prosody generation.
The MBROLA speech synthesizer is based on diphone concatenation. MBROLA
produces speech samples on 16 bits (linear) if it is provided with a list of phonemes as
input together with prosodic information. MBROLA uses the PSOLA (Pitch
Synchronous Overlap Add) method that was originally developed at France Telecom
(CNET). It is actually not a synthesis method itself but allows prerecorded speech
samples to be concatenated and provides good controlling for pitch and duration.
The diphone databases are currently available for US English/UK English/Breton
English, Brazilian Portuguese, Dutch, French, German, Romanian, Spanish, Greek,
Welsh, Indian languages, Venezuelan Spanish, Hungarian, Turkish and Arabic. Some of
these languages exist with male and female voice (MBROLA).
2.10.2 ACAPELA – GROUP Acapela group constitutes all speech technologies that have been developed over the last
20 years. Speech synthesis and speech recognition have been created and improved by
Acapela. Acapela Group evolves from the strategic combination of three major European
companies in vocal technologies: "Babel Technologies" created in Mons, Belgium,
"Infovox" created in Stockholm, Sweden and "Elan Speech" created in Toulouse, France.
Acapela owns currently three technologies, TTS by diphone, TTS by Unit Selection and
Automatic Speech Recognition. Acapela is currently available for US English, UK
English, Arabic, Belgian Dutch, Dutch, French, German, Italian, Polish, Spanish and
Swedish.
2.10.3 ARABTALK
The ARABTALK TTS Synthesis System was developed at Research ad Development
International (RDI), for Arabic language. ARABTALK is a state-of-the-art corpus based
concatenative TTS System. The system employs Artificial Neural Networks (ANN)
statistical prosody based models for duration, energy, and global pitch contour prediction.
In addition, it has a real time synthesis by selection algorithm to explore large speech
corpus. ARABTALK has a Hidden Markov models (HMMs) based procedure to
automatically time-align new voices transcriptions to their acoustic phoneme boundaries.
The system is multi-user and safe-threaded enabled for server-based applications.
ARABTALK has a mature Arabic phonology framework. The current system has 41
phonetic letters by adding extra phonemes to consider the effect of the pharyngealized
phonemes.
The system has two databases one for male speaker at 22 kHz sampling rate (one hour)
and the other database is for a female speaker at 16 kHz sampling rate (four hours). The
speech is coded into 12 dimensional MFCCs plus log energy and their derivatives. The
EGG signal is recorded with each utterance to support pitch synchronous analysis and
prosodic modification if necessary during the synthesis process. The system uses HMMs
based Viterbi alignment procedure that is developed at RDI for this purpose (Wael et al.,
2000).The Viterbi alignment procedure can be summarized as a problem of searching
time boundaries for known sequence of HMM models for phonemes. Since the best state
sequence, which is known to be the Viterbi path, is obtained during decoding process,
time boundaries can be obtained directly. This process is illustrated in figure 2.8.
The overall architecture and general features of ARABTALK Text-To-Speech system for
Arabic language has been presented. An online demo is available at http://www.rdi-
eg.com/rdi/research/Arabtalk.asp. The system is corpus-based system and has many
statistical models. The system has real time unit selection with different caching methods.
Figure 2.8 Viterbi Alignment (Yasser, 2000).
2.10.4 Sakhr TTS
Sakhr TTS engine converts any Arabic/English text into a human voice. Sakhr has been
focusing in the last 5 years on creating an Arabic TTS engine that can match in its quality
the human voice. This technology gives businesses a competitive edge by allowing them
to provide their customers with the latest static and dynamic information anytime,
anywhere using normal telephones and mobiles.
Sakhr developed the Diacritizer engine (examples of diacritics are “ ”). This
engine can put the diacritics needed in Arabic texts automatically. The Diacritizer is the
main component in Arabic TTS. Without the Diacritizer, the output quality of the TTS
engine would be inaccurate and not clear. Since Arabic native speakers write Arabic text
without diacritics, the TTS engine should handle the non-diacritized text. The Diacritizer
will convert the non-diacritized text into a diacritized text and then the TTS engine will
convert it to a clear and human Arabic voice. Moreover, Text-To-Speech technology
Software Development Kit (SDK) converts any computer readable text into a human
sounding synthetic speech. Arabic is at least one order of magnitude difficult than other
common languages due to the lack of diacritics, i.e. vowelization needed to properly utter
any input text.
A major limitation encountering the development of Arabic TTS was the constraints
imposed by handling undiacritized Arabic text. Sakhr automatic Diacritizer is integrated
with a speech synthesizer engine to produce a real system with undiacritized Arabic as
input and high quality speech as output. The generation of such an output would be
impossible without an automatic Diacritizer, due to the abundance of different
pronunciation of the majority of the words without diacritics. Sakhr used its unique
diacritizer to provide the TTS synthesizer with the adequate vowelization to produce a
natural and intelligible sound.
Sakhr TTS Engine is composed of three basic parts. The Linguistic Module converts the
input text into a phonetic transcription. The Phonetic Module calculates speech
parameters, and the Acoustic Module uses those parameters to generate synthetic speech
signals.
2.10.5 Élan TTS Élan TTS translates any text to speech. The software has been developed by Learnout &
Hausple Company, which is a major world wide supplier to Text-To-Speech software and
a leading developer of this advanced technology. This software simply reads IT-
generated texts out loud, with the flexibility and richness of natural sounding speech. The
text can be spoken in several languages with both male and female voices depending on
the requirement from the user. This company has developed such kind of software for the
Telecom application and the name given for that speech synthesizer software is speech
cube. There is software as well that has been developed for the Telecom application, and
there are proverb speech platform and proverb speech unit.
Speech cube is a high density multilingual and multi-channel software Text-To-Speech
component for Telecom. It has been designed to run on telephony application and it is
compatible with all market standards. This software can automatically translate all
written texts into speech and can read it loudly. The user can choose either one of eight
languages provided by the software. This software is available under Windows NT,
Linux, SCO, Qnx, and Salaries. The main process, named syntheses interacts with the
user or with the application via standard input stream. The sound formats that this
software supports are 8 kHz, 11 kHz, and 16 kHz, with or without sun header (16 bits per
sample).
There are several key features highlighted regarding the software. And among these
features are as follows: Simultaneously run on several languages, Progressive which
means easy upgrade of server capacity, Large volume text treatment capacity, Unlimited
vocabulary, High quality voice, smooth, and natural intonation, Voice speed control and
pitch control Male and female voices
The software can be used as follows: As an engine, this software can translate text into
speech in real time for a voice board supporting a multimedia audio interface. As a
process, it can convert text database into speech for later use. As a server, this software is
available to a group of application, on a network as (Client/ Server) Protocol. Élan Text-
To-Speech has benefits such as; it simplifies the access to a value added system, with
updated information quickly and easily for improved efficiency.
Table 2.2 shows the summary of the existing Arabic Text-To-Speech Synthesizer
Systems. The table compares the five products in terms of manufacturer, platform,
languages supports, voices, controls, requirements, and lastly the synthesis method used
in building the system. From the table it is clear that the five systems support Arabic
language, except for ARABTALK TTS which supports only Arabic language, while the
rest are multilingual systems. It is also obvious that four systems out of five using the
same method of synthesis which is concatenative synthesis method; however, the
ARABTALK TTS uses Artificial Neural Networks (ANN) synthesis method. Lastly, in
terms of the voice, four systems out of five have both male and female voices except for
ACAPELA which has male voice only.
2.11 SUMMARY
As a conclusion, speech is a complex signal from which we extract many types of
information, from the message content and meaning, to the nature of the transmission
medium, to the identity and condition of the speaker. Yet, several researches were being
carried out in the area of TTS synthesis. The chapter also has described the mechanism of
speech production, and brief history of TTS. The product range of TTS synthesizers is
very wide and it is quite unreasonable to present all possible products or systems
available out there.
Table 2.2 Summary of Text-To-Speech Existing Products
Product Manufacturer or Developer Platforms Languages Voices Controls / Support Requirements Method
MBROLA
TCTS Laboratory in the Faculté Polytechnique de Mons, Belgium http://www. mbrola.com
UNIX Windows 95/98/xp
English French Spanish Italian German Hungarian Romanian Turkish Arabic
Male Female
Speed, Intonation contours, Lexical stress, Sentence accent, Segmental durations, Pitch and pitch range, Gender, Age, Vocal track scaling, glottal source param.
32 Mb memory 15 Mb disk Pentium 75 2 Mb memory 10 Mb disk
Concatenative Synthesis.
ACAPELA
Acapela group http://www. acapela.com
Windows 95/NT/98/xp UNIX
English Polish Spanish Italian Arabic Swedish
Male
Pentium 75 MHz 160 Mb Disk 8 Mb mem (UNIX: 32 Mb)
Concatenative Synthesis
Arabtalk TTS Research and Development International RDI http://www.rdi-eg.com/rdi/research/Arabtalk.asp
Windows 98/NT/2000/XP
Arabic Male Female
- - Artificial Neural Networks (ANN), statistical prosody, Hidden Markov Models
Sakhr TTS Sakhr Software http://www.sakhr.com/TTS/TTS.asp
Windows 98/NT/2000/XP
Arabic / English
Male Female
- - Unit Selection Diphone Concatenative Synthesis
Elan TTS Élan Speech http://www.tmaa.com/tts/Elan_profile.htm Or http://www.elanspeech.com Developed by Learnout & Hausple Company
Windows NT Linux, SCO, Qnx, and Salaries
Am. English Br. English French Dutch German Portuguese Romanian Spanish Arabic
Male Female Pitch and speed modification, timber alteration
Chip, embedded, client/ server
- Unit Selection Diphone Concatenative Synthesis
CHAPTER THREE
CONCATENATIVE SYNTHESIS
3.1 OVERVIEW
One of the most popular methods of synthesizing speech from text is by stringing
together, or concatenating, prerecorded words, syllables, or other speech segments (Olive,
1996). This avoids many of problems encountered in phoneme-to-phoneme synthesis,
such as the co-articulatory effects between neighboring speech sounds (Schroeter, 1996).
Still, even words do not usually occur in isolation: the words immediately preceding or
following a given word influence its articulation, its pitch, its duration and stress – often
depending on the meaning of the utterance. This chapter describes in details
Concatenative Synthesis Method and its types.
Synthetic voices are made by concatenating units of sound that have been previously
stored in a reference database. The contents of these units and methods of concatenation
vary, but the principle of concatenation is universal for TTS involving all but the briefest
messages. Nowadays, the use of actual speech waveforms has become increasingly
popular, where stored waveforms of various sizes are fetched as needed, with adjustments
made mostly at unit boundaries, but sometimes more generally throughout the utterance
(Browman, 1980).
Concatenative Synthesis Method uses a large database of source sounds, segmented into
units, and a unit selection algorithm that finds the sequence of units that match best the
sound or phrase to be synthesized, called the target. The selection is performed according
to the descriptors of the units, which are characteristics extracted from the source sounds,
or higher-level descriptors attributed to them. The selected units can then be transformed
to fully match the target specification, and are concatenated. However, if the database is
sufficiently large, the probability is high that a matching unit will be found, so the need to
apply transformations is reduced. The units can be non-uniform, i.e. they can comprise a
sound snippet, an instrument note, up to a whole phrase. Concatenative Synthesis can be
more or less data driven, where, instead of supplying rules constructed by careful
thinking as in a rule based approach; the rules are induced from the data itself. The
advantage of this approach is that the information contained in the many sound examples
in the database can be exploited.
3.2 CONCATENATIVE SYNTHESIS
The term Speech Synthesis is used solely for the development of speech sounds
completely in a machine, which to some extent modeled the human speaking system. The
applications were mainly for research in speech production and perception. Recently,
particularly in an engineering environment, speech synthesis has come to mean provision
of information in the form of speech from a machine, in which the messages are
structured dynamically to suit the particular circumstances required. The applications
include simple information services, reading machines for the blind and communication
aids for people with speech disorders. Speech synthesis can also be an important part of
complicated man-machine systems, where various types of structured conversation can be
made using voice output, with either automatic speech recognition or key pressing for the
man-to-machine direction of communication.
It is an indication of the times that parametric synthesizers are becoming old-fashioned.
Neither the allophone nor the prosody problem has yet been solved satisfactorily.
Progress is being made but a new kid has blown into town by the name of concatenative
synthesis. Today it is feasible for the units of synthesis to be digitized, human speech.
The job of the “Synthesizer” is arranging these units into the desired output, adjusting
prosody, and smoothing the boundaries between units to prevent infelicities in
articulation, all of which are still worthy challenges. TTS Synthesis is the process of
converting a written text into artificial speech. The system processes the text and repeats
it aloud in a computer-based program.
Concatenative Synthesis has been around since the 1950s; however, the writer Jonathan
Swift explained the important nature of concatenative synthesizers in a Gulliver’s
Travels, published in 1726:
These bits of wood were covered on every square with paper pasted on them, and on these papers were written all the words of their language, in their several moods, tenses, and declensions, but without any order. The professor then desired me to observe, for he was going to set his engine at work. The pupils at his command took each of them hold an iron handle, whereof there were forty fixed round edges of the frame, and giving them a sudden turn, the whole disposition of the words was entirely changed.
Concatenative Synthesis uses actual short segments of recorded speech that were cut
from recordings, and stored in an inventory voice database as waveforms, or encoded by
a suitable speech coding method.
Firstly, Concatenative Synthesis consisted only of digitally recorded full utterances, any
one of these recorded can be played back on request according to the situational needs. If
the spoken output consisted of a relatively small vocabulary of words, each one can be
recorded digitally and played back in the appropriate order on request.
The unit, syllable, words, etc. has the intelligibility and naturalness of human speech
within the limits of digitization. Mainly, the pronunciation of consonants; it is similar for
the vowel durations in stressed syllables that parametric synthesizers cannot get
consistently right. Concatenative Synthesis at the word level solved the allophone
problem, and that part of the prosody problem concerned with the relative durations of
segments within the word.
Figure 3.1 below shows a block diagram of a typical concatenative TTS system. The
front-end on the left converts a particular input text string into a string of phonetic
symbols and prosody (fundamental frequency, duration, and amplitude) targets. The
front-end uses a set of rules and/or a pronunciation dictionary. With a string of phonetic
symbols, it constructs target values for fundamental frequency (i.e., pitch), phoneme
durations, and amplitudes. The center block in figure 3.1 gathers the units according to
the list of targets set by the front-end. These units are selected from a store that holds the
inventory of available sound units.
Different types of speech units may be stored in the inventory of a concatenative TTS
system. Storing whole word units is unreasonable for general TTS because of the
tremendous demands on a voice talent that would have to read a few hundreds of
thousands of words in a consistent voice and manner. Even if recorded successfully in
multiple sessions spread over several weeks, a lack of co-articulation and phonetic
recoding at word boundaries may result in unnatural sounding speech. On the other hand,
using phones is as well unacceptable for the reason that of the large co-articulatory
effects that exist between adjacent phones.
Figure 3.9 Block diagram of a concatenative text-to-speech system (Olive, 1996).
As a result, conversions from one unit to the next may be audible as glitches that
introduce perceptually disruptive discontinuities. Naturally, longer units are more likely
to result in higher quality synthesis, given that the rate of concatenations (how many unit-
to-unit conversions occur per second of speech) is lower than in the case of shorter units.
In contrast, there is a need for a larger set of longer units to cover any application domain.
Until the mid of 1990s most practical TTS implementations compromised by using one of
two types of inventory units, the diphone and the demi-syllable.
Inter-unit effects and prosody over and above the level of the recorded units are two
problems of concatenative speech synthesis. For example, in naturally spoken speech,
words are run together. There are no white spaces except where we pause for breath or
effect. Natural between word articulations still needs to be simulated. As well,
appropriate prosody has to be computed because it depends heavily on the context of the
whole utterance, which is not known when the speech units are prerecorded. With little
computing power, early concatenative synthesizers could only speak in a monotone with
discontinuities between each word.
Parametric synthesis ruled the rest because it could handle a vocabulary of unlimited size,
lately. Developed computers now permit hundreds of thousands of words to be
prerecorded digitally. The synthesizer reorganizes them as needed, smoothes between
words, and attempts to approximate the correct prosody, all in real time.
However, these kind of parametric synthesizers are capable of producing because their
units are sound-based, individuals cannot use words as units. Apart from the
impracticality, word-based synthesis is unable to handle the many applications that
require a large unlimited vocabulary, such as ones that “read” texts not known in
advance.
3.2.1 Phonemes Phonemes are possibly the most commonly used units in speech synthesis since they are
the standard linguistic presentation of speech. The list of basic units is usually between 40
and 50, which is obviously the smallest compared to other units (Allen et al., 1987).
Using phonemes gives maximum flexibility with the rule-based systems. But, some
phones that do not have a steady-state target position, such as plosives, which are difficult
to synthesize. The articulation must be formulated as rules as well. Phonemes are
occasionally used as an input for speech synthesizer to make for example diphone-based
synthesizer.
3.2.2 Diphones A diphone is the snippet of speech from the middle of one phone to the middle of the next
phone. The middle of a phone tends to be its acoustically most stable region. Therefore,
diphones represent acoustic transitions from the stable midsection of one phone to the
next. Diphones are also distinct to widen the middle point of the steady-state part of the
phone to the middle point of the subsequent one, thus they include the transitions between
contiguous phones. Meaning that the concatenation point will be in the most steady-state
region of the signal, this reduces the distortion from concatenation points. Another
benefit with diphones is that the co-articulation consequence needs no more to be
formulated as rules. Theoretically, the number of diphones is [the square of the number of
phonemes (+ allophones)], but not all combinations of phonemes are needed. The
number of units is normally from 1500 to 2000, which increases the memory
requirements and makes the data collection more difficult compared to phonemes. On the
other hand, the number of data is still acceptable and with other advantages, diphone is an
appropriate unit for sample-based text-to-speech synthesis. The number of diphones may
be reduced by inverting symmetric transitions, like for example the letter / �ـ / at the
beginning of a word would become / UVW / at the end.
3.2.3 Demi-syllables A demi-syllable encompasses half a syllable; that is, either the syllable-initial portion up
to the first half of the syllable nucleus, or the syllable-final portion starting from the
second half of the syllable nucleus. The number of demi-syllables in English is roughly
the same as the number of diphones. Because demi-syllable units are usually longer than
diphones, and allow for better capture of co-articulation effects compared to diphones,
they should pose fewer concatenation problems.
One benefit of demi-syllables is that only about 1,000 of them is needed to construct the
10,000 syllables of English (Donovan, 1996). Using demisyllables, rather than for
example phonemes and diphones, requires significantly less concatenation points. Demi-
syllables as well take account of most transitions, then also a large number of co-
articulation effects and cover a large number of allophonic variations due to separation of
initial and final consonant clusters too. But, the memory requirements are still somewhat
high, but tolerable. Compared to phonemes and diphones, the exact number of demi-
syllables in a language cannot be defined. With purely demi-syllable based system, not all
possible words can be synthesized correctly. This problem is faced at least with some
proper names. However, demi-syllables and syllables may be successfully used in a
system that uses variable length units and affixes, such as the HADIFIX system
(Dettweiler et al., 1985).
Triphones or tetraphones have longer segmental units, they are somewhat seldom used.
Triphones are like diphones, but contains one phoneme between steady-state points (half
phoneme + phoneme + half phoneme). In other words, a triphone is a phoneme with a
specific left and right context. For English, more than 10,000 units are required (Huang et
al., 1997).
Structuring the unit list consists of three main phases. First, the natural speech must be
recorded so that all used units (phonemes) within all possible contexts (allophones) are
included. After this, the units must be labeled or segmented from spoken speech data, and
finally, the most appropriate units must be chosen. Gathering the samples from natural
speech is usually very time-consuming. Nevertheless, some of this work may be done
automatically by choosing the input text for analysis phase accurately. The
accomplishment of rules to decide on accurate samples for concatenation must be done
very cautiously too.
For many languages, demi-syllables minimize the co-articulation effects at syllable
boundaries because the demi-syllable is obtained from natural utterances by “cutting” in
the middle of a steady-state vowel. Thus, only relatively simple concatenation rules might
be required – in the best of the worlds. However, the reality of human speech is more
complex and a successful concatenation system may have to rely on a concatenation of
demi-syllables, diphones, and suffixes (postvocalic consonant clusters).
3.2.4 Speech Signal Representation for Concatenative Synthesis A good speech signal representation for concatenative synthesis approximates the
following set of requirements (Schroeter, 1991):
1. The speech signal can be stored in a highly compressed (i.e., coded) form so that a
large voice database can be used even under tight memory limitations. Coder and
decoder are of low computational complexity.
2. Coding/ decoding are perceptually transparent. Since there is a need to imitate all
the voice characteristics of a real person, subjecting the speech signal to vocoder
like degradations will not lead to speech synthesis of high naturalness.
3. Coding algorithms have to allow for random access. Since most speech coders
contain some sort of autoregressive memory, all state variables of the coder have
to be made available at concatenation points since the decoder will have to switch
between units of speech that are very unlikely to have been recorded
consecutively in time.
4. An ideal speech representation must allow for natural-sounding modifications of
pitch, duration, and amplitude. This is particularly important for small inventories
with one, or just a few, typical examples for each unit.
5. For some advanced applications, it even might be desirable to allow for fine-
tuning of the voice, for example, to add more aspiration, maturity, or let the voice
scream when needed. Instead of recording different voice inventories for different
speaking styles, advanced voice conversion might be used to approximate an
angry voice using a happy or neutral voice as a starting point.
Table 3.2 Concatenative Synthesis: Pros and Cons
Pros Cons Units include difficult sounds and
transition Long-distance co-articulation not
captured Units capture local co-articulation DSP: at least smoothing of concatenative
points
Neither pros nor cons: ± DSP: Prosodic modifications possible/ required. ± Compromise between coverage and inventory size.
3.3 TYPES OF CONCATENATIVE SYNTHESIS Natural sounding speech can be produced by concatenating (or stringing together)
segments from a database of recorded speech. In general, this approach yields the most
natural sounding synthesized output. However, variations in speech and the automated
techniques used for segmenting analysis speech waveforms sometimes result in audible
glitches in the output. There are four basic types to concatenative synthesis:
3.3.1 Concatenation of Stored Allophone Hypothetically, allophones appear to be the perfect unit for concatenation. A few hundred
of them will serve as the basic building blocks for all utterances. The problems only
appear upon implementation. It is impossible to either articulate the consonant letter [k]
in English without also articulating a vowel, however short, as the onset the beginning
part or offset the ending part. Figure 3.2 shows the wave sound for Allophone using
concatenative synthesis.
Figure 3.2 Wave sound for Allophone Concatenation Synthesis (Olive, 1996).
3.3.2 Diphone Concatenation Synthesis Diphones are speech units that begin in the middle of the stable state of a phone and end
in the middle of the following one. The number of diphones depends on the possible
combinations of phonemes in a language. In diphone synthesis, only one example of each
diphone is contained in the speech database. The quality of the resulting speech is
generally not as good as that from unit selection but more natural sounding than the
output of formant synthesizers. Diphone synthesis suffers from the robotic-sounding
quality of formant synthesis. In order to build a diphone database, the following questions
have to be answered and determined: What diphone pairs exist in a language and what
carrier words should be used? The answer for these questions is very language
independent. Figure 3.3 shows the wave sound for Diphone using concatenative
synthesis.
Figure 3.3 Wave sound for Diphone Concatenation Synthesis (Olive, 1996).
The raison for diphones, whether used as units of speech recognition or speech synthesis,
is that they capture the articulation effects between allophones. Moreover, the articulation
effects between the diphones themselves are not serious a problem because the diphone
boundaries are the midpoints of the same sound, which tend to match up fairly well.
Considering allophonic variation gives us the new concept of allodiphones – diphones
made up of the combinations of the various allophones of the language.
To achieve the most natural sounding allodiphones, they must be isolated from naturally
spoken speech. Unfortunately, one must process enormous amounts of speech to build a
database of allodiphones sufficient for high quality concatenative speech. Today’s
diphone synthesizers’ compromise and use a modest number of allodiphones derived
artificially invented free flowing speech.
3.3.3 Concatenation of Stored Syllables and Demi-syllable The syllable appears to be a convenient unit for storage, as it is large enough to reduce
the number of joins that are necessary and yet small enough to make each stored unit
reusable in a number of word contexts. There is often a choice as to where to separate one
syllable from the next; whenever possible a stop consonant should be placed at the start
of a syllable. This allows the join to be made at the silence preceding the stop. A seamless
join between the first and second syllables of “dis-crim-in-ate” would be easier to achieve
that it would be for “disc-rim-in-ate”.
The use of demi-syllables reduces some of the problems that occur with syllables, and
several examples have been reported of the concatenation of demi-syllables (Browman,
1980)
Demi-syllables occur as two types: the syllable onset plus the first half of the nucleus
vowel sound and the second half of the nucleus vowel sound plus the syllable coda. There
are several thousands demi-syllables, ignoring all but the most common allophones. If we
get into allodemisyllables that is demi-syllables that are composed of differing allophones
that number becomes much larger.
Demi-syllables have two strengths as units of concatenative speech synthesis. One is in
consonant cluster such as the skr in scrap. While diphones are better than simple
allophones in such clusters, demi-syllables are best of all because the entire cluster is
taken from human speech.
The other advantage of the demi-syllable is in achieving natural sounding segment
durations. A synthesizer that fails to capture the nuances will be perceived as unnatural.
Since the length of a syllable ending in one or more consonants is distributed over the
vowel and the final consonant(s), the sizing natural sounding syllable lengths is more
easily achieved with demi-syllables than other units are.
A drawback of demi-syllables is that not all syllable boundaries fit smoothly together.
Recently, some speech engineers have proposed a mixed inventory of both diphones and
demi-syllables, taking advantage of their respective strengths and compensating for their
respective weakness.
3.3.4 Concatenation of Stored Waveform An obvious way of producing speech messages by machine is to have recordings of a
human being speaking all the various words, and to replay the recordings at the required
times to compose the messages. The first significant application for this technique was a
speaking clock, introduced into the UK telephone system in 1936, and now provided by
telephone administrations all over the world.
The original UK Speaking Clock used optical recording on glass disc for the various
phrases, words and part-words that were required to make up the full range of time
announcements. Some words can be split into parts for this application, because, for
example, the recording can be used for the second syllables of “twenty”, “thirty”, etc. The
next generation of equipment used analogue storage on magnetic drums.
The development of large, cheap computer memories has made it practicable to store
speech signals in digitally coded form for use with computer-controlled replay, and,
provided sufficiently fast memory access is available, this arrangement overcomes the
timing problems of analogue waveform storage. Digitally coded waveforms of speech
signals of adequate quality for announcing machines generally use digit rates of 16 – 32
Kbits per second of message stored; so quite a large memory is needed if many different
elements are required to make up the messages.
The decoding of LPC speech is the least complex. However, that is not its chief
advantage. The pitch of LPC speech can be adjusted via a single parameter, and the all-
important duration of a speech can be adjusted by the simple expedient of adding or
subtracting frames during decoding. Thus some, but not all, prosodic adjustments can be
effected at the decoding stage.
Waveform coders require two stages for synthesis. One is the decoding process that
produces the reconstructed waveform. The second is the imposition of prosodic on the
reconstructed waveform, in particular, a smoothing of the transitions between
concatenated units. Until recently computing power was insufficient to do this
effectively.
There is a tradeoff between LPC and the various methods of waveform coding. With
waveform coding the speech units themselves sound natural, but the transition between
units is stilted. With LPC coding the transitions are smoother, but the speech has an
inescapable mechanical quality that listeners find both unnatural and unpleasant.
3.3.5 Concatenation of Stored Words It is possible to concatenate words if care is taken to ensure that the recorded words have
intonation and rhythm that are consistent with the eventual use. A potential difficulty
with this technique is that it is extremely difficult to add to the vocabulary at a later date.
Some reasons for this are that the original speaker may not be available or the voice
quality of the speaker may have changed; even the acoustics of the recording studio or the
choice of microphone can make the newly recorded words identifiably different when
heard in the context of previously recorded utterances.
Table 3.2 shows the summary of the advantages and the pitfalls of units of Concatenative
Synthesis in terms of a sentence, a word, a syllable, an allophone, a diphone, and a demi-
syllable.
Table 3.3 Advantages and Disadvantages of Units of Concatenative Synthesis.
Concatenation Unit
Advantages Disadvantages
1
Sentence
Naturalness throughout
Usually impractical to prerecord every sentence that might be needed.
2
Word
Naturalness within the word
Inter-word articulation may sound mechanical; usually impractical to prerecord all the words that might be needed.
3
Syllable
Naturalness within the syllable
Good inter-syllable articulation is difficult to achieve, making individual words unintelligible; usually impractical to prerecord all the syllables that might be needed, though not as seriously as with words or sentences.
4
Allophone
Naturalness within the allophone, especially vowels; much fewer units need be prerecorded than with other choices
Good articulation between most allophones is difficult to achieve, making individual words unintelligible; stop consonantal allophones are difficult to isolate for prerecording.
5
Diphone
Naturalness on the allophonic level; much better articulation between diphones than between allophones
Consonant clusters do not always sound natural; syllable lengths do not always sound natural; impractical to extract and prerecord all possible allodiphones from naturally spoken speech.
6
Demi-syllable
Naturalness on the syllable level; natural sounding consonant clusters; natural sounding syllable lengths
Articulation at syllable bounders may sound unnatural, making individual words unintelligible, however not as seriously as with syllables; impractical to extract and prerecord all possible allodemisyllables from naturally spoken speech.
Any Concatenative Sound Synthesis system must perform the following tasks, which
may sometimes perform implicitly. These tasks or steps are:
� Analysis The source sound files are segmented into units and analyzed to express
their characteristics with sound descriptors.
� Database Source file references, units and unit descriptors are stored in a
database. The subset of the database that is pre-selected for one particular
synthesis is called the corpus.
� Target The target specification is generated from a symbolic score (expressed in
notes or descriptors), or analyzed from an audio score (using the same
segmentation and analysis methods as for the source sounds).
� Selection Units are selected from the database that match best the given target
descriptors according to a distance function and a concatenation quality function.
The selection can be local (the best match for each target unit is found
individually), or global (the sequence with the least total distance if found).
� Synthesis is done by concatenation of selected units, possibly applying
transformations.
3.4 SUMMARY
This chapter has explained in details what Concatenative synthesis method is. It also has
shown the different types of speech units, phonemes, diphones, and demi-syllables. The
next chapter will explain in details the Arabic language.
CHAPTER FOUR
ARABIC LANGUAGE
4.1 OVERVIEW In this chapter, the Arabic language is introduced, where the alphabet is represented as
well as the Arabic morphology and prosody that is relevant for this dissertation.
The Arabic language, or simply Arabic, is the largest member of the Semitic branch of
the Afro-Asiatic language family and is closely related to Hebrew and Aramaic (Michel,
1970). It is spoken throughout the Arab world and is widely studied and known
throughout the Islamic world. Arabic has been a literary language since at least the 6th
century and is the liturgical language of Islam. Because of its liturgical role, Arabic has
lent many words to other Islamic languages, akin to the role Latin has in Western
European languages. During the Middle Ages Arabic was also a major vehicle of culture,
especially in science, mathematics and philosophy, with the result that many European
languages have also borrowed numerous words from it. The Arabic script is written from
right to left (Nicholas and Putros, 1986). Arabic is ranked as number four among the
world’s major languages, with 300 million native speakers of all dialects.
As humans, our capability to receive any form of communication is based on our sensory
organs: the ears, eyes, skin, nose, and taste buds. The reception of the communication
through a single sensory organ is referred to as a modality. Each of us has five modalities,
directly related to our senses. They are auditory/vocal, visual, tactile, olfactory, and
gustatory. The study of how humans perceive communication through these sensory
organs is called the science of semiotics (Clement, 1993). Specifically, semiotics
provides a theoretical approach and related techniques for analyzing the structure of all
forms and meaning of signals and signs in human communication received via these
modalities.
Most Semitic languages in both ancient and contemporary times are usually written
without short vowels and other diacritic marks, often leading to potential ambiguity.
While such ambiguity only rarely impedes proficient speakers, it can certainly be a
source of confusion for beginning readers and people with learning disabilities (Abu-
Rabia, 1999). Diacritization is even more problematic for computational systems, adding
another level of ambiguity to both analysis and generation of text. Dialectal Arabic refers
to the dialects derived from Classical Arabic. These dialects vary occasionally that means
that it is hard and a challenge for a Lebanese to understand an Algerian and it is worth
mentioning there is even a difference within the same country.
As mentioned before there are many varieties of the Arabic language (Shafi, 1978), many
dialects that reflect social diversity of its speakers. Arabic can be sub-classified as
Classical Arabic, Eastern Arabic, Western Arabic and Maltese. Western Arabic
encompasses the Arabic spoken colloquially in the region of northern Africa, often
referred to as the Maghreb. Eastern Arabic includes the Arabic dialects spoken in North
Africa, the Middle East. Arabic speakers use Modern Standard Arabic (MSA) to
communicate across dialect groups. It is used in a situation where the native dialect will
not be understood.
In the Arabic alphabet, there are 29 letters, three of which are long vowels and the rest are
consonants. Each letter is given a name which contains the letter itself (Nicholas and
Putros, 1986). The characters of the Arabic alphabet are neither capital nor small; they
have one form only. Moreover, some of these letters are very similar to English letter
sounds e.g. “baa” is very close to the letter “B” in the English language; this is a useful
way to remember the sounds. However, many Arabic letters have no equivalent sound in
English e.g. “ein”, and some letters have subtle but important differences in
pronunciation, e.g. “haa” which is pronounced with a lot more emphasis in the throat than
the letter “H” in English. Also, please note that the Arabic script is read from right to left.
4.1.1 English and Arabic These two languages are drastically different in both family and writing system. English
belongs to the Indo-European family, while Arabic is a typical Semitic language (Michel,
1970). The sound systems of the two languages are extensively different in consonants,
vowels, stress placement and the dynamics that govern the function of those linguistic
systems in real speech. In the area of consonants, Arabic has a series of back consonants
– uvular, pharyngeal and emphatic – that have no existence in English, whatsoever. These
sounds heavily color the phonetic setting of Arabic compared to that of English.
Additionally, English vowel system is dominated by vowel quality/quantity reduction as
opposed to only some vowel quantity [length] reduction in Arabic. Concerning stress
placement, the rules in Arabic are more systematic and well defined than in English. In
Arabic, the rule of long syllable, primarily due to a long vowel, is very powerful and is,
therefore, a source of stress misplacement for Arab learners of English (Shafi, 1978).
Concerning the writing system, English has a Latin-based orthography, while Arabic has
its own Aramaic-based orthography that is so different in its design of sound
representation. English hardely has any diacritical marks as opposed to in Arabic in
which slightly more than half of its consonants are distinguished by one or more dots in
the form of superscripts or subscripts to set them apart from the undotted consonants as in
table 4.1, below:
Table 4.4 Arabic dotted and undotted alphabet characters
Undotted Arabic Sound English Sounds Dotted Arabic Sound English Sounds
& ă alif
X Y�� b baa
5 Y�R t taa
Z Y�# Th thaa
[ �4\ j jeem
] Y�E h`` haa
^ Y�; kh khaa
6 _6 d daal
+ _+ th zaal
� Y� r raa
` Y` z` zaa
> a� s seen
b aK sh sheen
c 6�� s` saad
d 6�e Dh`` dhaad
Q Y�f th`` thaa
g Y�h z`` zaa
i a� --- ein
j a3 Gh ghein
k Y�l f faa
m k�n Q qaaf
o k�0 k kaaf
_ 1D L laam
1 �4� m meem
� �"< n noon
: Y�G h` haa
! !! w or u waow
Y pqr A` hamza
s Y�� Y yaa
4.2 DEFINITION OF AN ARABIC WORD
As morphology is concerned with the analysis of words, it is primary to define the
term word. As words are in the form of written text, word is defined by any text
editing program. A word is the alphanumeric string between any two non-
alphanumeric characters (Clement, 1987). An Arabic word is a word, as defined
above, which meets the following two conditions:
� All its characteristics are bare or diacritized Arabic alphabets. Diacritized words
cannot be fully determined by their spelling characters only.
� It belongs to either of the following two categories:
o The original Arabic words.
o The Arabic words.
Original Arabic words are divided in turn into two sub-categories:
� Derivative Arabic words – There are the verbs and nouns that are built according
to the Arabic derivation rules. The sweeping majority of the Arabic words belong
to this category.
� Fixed Arabic words – These are a set of words molded by Arabs, and do not
obey the Arabic derivation rules. Most of these fixed words are neither verbs nor
nouns; most of them are functional words like pronouns, prepositions,
conjunctions, question words, and the like. They tie the words of the Arabic
sentence together. The category of the fixed Arabic words contains a limited
number of members (According to a research, there is approximately 260
significant fixed words only).
The Arabized words are nouns borrowed from foreign languages (perhaps with some
phonetic adjustments to suit the Arabic pronunciation) and have become common
among the native Arabic speakers. To preserve the purity of the Arabic language, it is
not preferable to consider a word in this category unless its meaning has no
counterpart in the category of the original Arabic words.
Figure 4.1 below summarizes the definition of Arabic words:
Figure 4.10 The classification of the Arabic words (Raja, 1979).
Although the number of the derivative Arabic words is much larger than both the fixed
Arabic and the Arabized words, the frequency of the latter ones, specially the fixed
words, is considerably high that any treatment of Arabic must treat all the above
categories with the same degree of care. Example of fixed words in the Arabic text:
����� ������ ������ ������ ��!� �"#$%� &$��!�� ' (")*+ , -.# �/�*�� �012��34 . 617���,89: �"#$; 3��: ' ������� � <= �$.)>.
Total number of words in the paragraph = 23, Number of fixed words in the paragraph
= 8 Frequency of fixed Arabic words in the paragraph = 8/23 = 35%
Arabic word
Original Arabized
Fixed Derivative
4.3 ARABIC IS A DIACRITIZED LANGUAGE
The pronunciation of a word in some languages, like English, is almost always fully
determined by its constituting characters and vowels determine the correct corresponding
voice while pronouncing a word. Such languages are called non-diacritized languages.
On the other hand, there are languages, like Latin, where the pronunciation of their words
cannot be fully determined by their spelling characters only. In such languages, two
different words may have identical spelling whereas their pronunciations and meanings
are totally different. To remove this ambiguity, special marks are put above or below the
spelling characters to determine the correct pronunciation (Mitchell, 1993). These marks
are called diacritics and the language that uses them is called a diacritized language.
Arabic is also a diacritized language. In fact, Arabic has the most elaborate diacritization
system. Table 4.2 shows the Arabic diacritics and the significance of each one. Each
character in an Arabic word must be assigned two things about diacritics:
• The Shadda state of character. (With Shadda / without Shadda)
• The diacritic of the character.
These are called the diacritic information of the character.
Unfortunately, in today’s Arabic writing, people do not explicitly mention diacritics.
They dependent on their knowledge of the language and the context to supple the missing
diacritics while reading a non-diacritized text. They only mention diacritics in writing
when a severe ambiguity is feared or for educational purposes.
Table 4.5 The Arabic diacritics and the significance of each one
Diacritic Name Sounds like Example Comments
V Fateha a MC t � -
u Damma U FI��0 -
� Kasra I X� ��0 -
v Sokoon A non vowelized consonant
_�)N��� -
w Tanween fateha Fateha + � B����0 Only the last character may be assigned diacritic.
x Tanween damma
Damma + � %6���� Only the last character may be assigned this diacritic.
y Tanween kasra Kasra + � y���� Only the last character may be assigned this diacritic.
L ! L s Vowel Long (a), (i) or (u) vowel
_��n -
J Alif leyna Long (a) vowel 9��� Only a terminal J
may be assigned to this diacritic.
_ Bypassed character
Not pronounced )'Y� -
Hidden Alif vowel
Long (a) z+ -
{ Shadda
|_+�_
N5+ 5
Nm+%m
�~���
��E
In fact, shadda is not a diacritic but is a mark of doubling the character while pronouncing it. The character with a shadda needs another diacritic to determine the vowel.
An automatic morphological analyzer must consider diacritics in its model of Arabic
words and must have some mechanism of figuring out the missing diacritics of a given
Arabic word. The three diacritization states of Arabic word are:
i) Full diacritization: It is the assignment of all the diacritic information for
each character in the word including the last one. In Arabic, the
diacritization of the last character sometimes depends on the syntactic
analysis of the word within its sentence.
ii) Half diacritization: It is the same as full diacritization except for that it
does not provide the diacritic mark of the last character if it depends on the
syntactic analysis of the word. As the morphological analysis deals with
words one by one and does not analyze the sentence as a whole, it can
only be hoped to provide half diacritization.
iii) Partial diacritization: Any other diacritization state of the word that
provides less diacritic information than half diacritization is called partial
diacritization.
An Arabic word in general is complex. It may be a word or a sentence. In a study of
sufficiently large sample of Arabic text, the following simple structure of the Arabic
words are inferred:
• The main part, a noun or a verb, of the word occurs in the middle. It is called the
word's body.
• The body may be prefixed by something like the definitive article, a preposition, a
gender determiner, a tense determiner and so on, or some combination of them.
When a prefix precedes a body, it may slight modify its string and also be slightly
modified. The prefix cannot be a standalone word.
• The body may also be suffixed by something like a pronoun, a gender determiner,
a tense determiner and so on, or some combination of them. When a suffix
succeeds a body, it may slightly modify its string and also be slightly modified.
Suffix cannot be a standalone word.
In the absence of a prefix is assumed as a null prefix and the absence of a suffix is
assumed as a null suffix, it can be generalized that the structure of Arabic word as
(Salman and Jacob, 1980):
Any Arabic word = Prefix Body Suffix
The prefix or suffix can add an entity to the noun or the verb. For example, the prefix
may be a preposition and the suffix may be a pronoun.
4.4 INTRODUCTION TO VOWELS There are two sets of vowels in Arabic (Salman and Jacob, 1980): Short vowels and long
vowels. It requires about twice as much time to produce a long vowel as to produce a
short one. There is a tendency in English to obscure those vowels that are in non-stressed,
syllables. This tendency has to be overcome when speaking Arabic. In general, the
Arabic vowels are pronounced more crisply, clearly, and more tensely than the English
vowels. Syllables are the building blocks of speech and they come in three types;
consonant vowel, vowel consonant, consonant vowel consonant.
That is to say, a syllable may be formed by having a consonant followed by a vowel, such
as in the word “TO”, a vowel followed by consonant, such as in the word “OF”, or two
vowels with a consonant in between, such as in the word “FOR”. In Arabic, we use only
types one and three. Each vowel has a name, the vowels collectively have a name, the
letters that contain a vowel have a name, letters are named specifically depending on
which vowel they hold, and doubled vowels are given names. The first row of the charts
contain the examples, the second rows show the individual names of the vowels, the third
rows show the adjective used to describe the letter which is attributed with the given
vowel, the fourth rows show the English equivalents, and the columns to the right show
the collective name for the group.
As a corollary to the restricted vowel quality in Arabic, its diphthongs are limited in
number. In fact, some linguists treat the so-called diphthongs as combination of simple
vowels [i.e., abutting vowels] rather than as blended clusters of vocalic elements (Raja,
1979). The five-parameter description of the system is as follows:
1. Place: Has three places [front, center, back]; each place yields one vowel;
system has no neutral vowel schema, [ə].
2. Stricture: Has two tongue heights [high, low] or [close, open]; two are high
one is low.
3. Tense/ lax: Is contrastive in combination with length.
4. Lip-position: Has three distinct lip positions [round, spread, neutral] with one
degree of rounding and spreading; rounding is restricted to back vowels and
spreading to front ones.
5. Oral/ nasal: No contrasts; has only oral vowels.
4.4.1 Short Vowels Unlike English, the short vowels in the Arabic writing system are not actually written
within the word. These vowels are indicated by “signs” and not by letters, (e.g., / / and /
/) which are placed over the consonant and / / which is placed below the consonant,
according to the pattern of the word. These signs represent the sounds (/ a /, / u / and / i /),
respectively. When a sign is used with a consonant, the vowel pronunciation always
precedes the consonant. The vowel marks or signs are always placed over or below the
first half of the consonant.
4.4.1.1 The First Short Vowel
/ u / A high back rounded short vowel which is similar to the English “o” in the words
“to” and “do”. This vowel mark is also placed above the consonant and its shape
resembles an English comma / /. It is called a “dhamma”.
The first short vowel is known as the dhamma. The letter that holds the dhamma is
known as madhmoom. The dhamma is one of the three diacritics and the letter that holds
one of these diacritics is known as mutaharrik. The dhamma is the English equivalent of
the letter “o” and “u”. For example, the letter “baa” with its dhamma ب will be
pronounced “bu” with a long “u”.
4.4.1.2 The Second Short Vowel
/ a / A short un-rounded low central vowel which is similar to the English “a” in the
word “fat”. However, it is even shorter in duration. It is symbolized by a short
diagonal stroke / / written above the consonant. It is called a “fathha”.
Each letter of the alphabet is given a fathha. This vowel is placed atop the letter and is
called a fathha. The letter which holds the fathha is known as maftooh. This vowel is the
English equivalent of the letter “a”. For example, the letter “baa” with its fathha ب sounds
like “ba”.
4.4.1.3 The Third Short Vowel
/ i / A high front un-rounded vowel which is similar to the English “i” in the words
“sin” and “sit”. This vowel, as with the / a / is also symbolized by short diagonal
/ / only it is placed below the consonant and not above. It is called a “kasra”.
The kasra is one of the diacritics. The letter that holds the kasra under it is known as
maksoor. This vowel is the English equivalent of the letter “e” and “i”. For example, the
letter “baa” with a kasra beneath it ب sounds like “be”. Please become familiar with the
sounds associated with these letters when they are makssor.
4.4.1.4 The First Short Vowel Doubled
Each diacritic may be doubled in order to add the sound of the letter “n” to the end of the
word. For example, the letter “baa” with two dhamma ب sounds like “bun” or “bon”.
When a letter is doubled, the vowel is called a tanween. The letter that holds this tanween
is known as munawwan. Since all three harakat have this capability of doubling, they are
given separate names: a doubled dhamma is called a dhammatein, a doubled fathha is
called a fathhatein, and a doubled kasra is called a kasratein.
This is a feature in the Arabic language that few, if any, languages adopt. This is
primarily used to differentiate between nouns and other parts of speech. That is to say, the
noun in Arabic may or may not have a tanween, but verbs and particles will never have it.
Also, the tanween occurs on the last letter of the word; it may not come upon a letter
which is in the middle or beginning. The dhamma, when doubled is called a dhammatein.
4.4.1.5 The Second Short Vowel Doubled
A doubled fathha is known as a fathhatein. When a letter such as “raa” is given a
fathhatein, it sounds like “run”, as in the word “run” in English. There is an important
observation to make with the fathhatein; it is that when a letter is written with this vowel,
it is always followed by the letter “alif”. Each letter in the alphabet below will have the
letter “alif” after itself, and the fathhatein will be written atop this “alif” (i.e., B��). This is
only for script purposes. When writing in Arabic, this is how a fathhatein is used.
4.4.1.6 The Third Short Vowel Doubled
The vowel is the kasra, known as the kasratein when doubled. This vowel is the English
equivalent of the letter “e” and “I” with the addition of the letter “n”. Taking the letter
“baa” for example and placing a kasratein beneath it ب will cause the letter to sound like
“bin”.
4.4.2 Long Vowels In contrast to the short vowels, the long vowels in the Arabic writing system are
represented by letters of the alphabet and not by signs. Although their pronunciation is
similar to that of the short vowels, the sound is prolonged, that is, it is held longer. With
/ a / and / aa / there is not only a quantity difference but a substantial quality difference as
well. One must be careful to pronounce them as pure long vowels without diphthong
quality. In ordinary speech, when they occur at the end of a word, they are shortened. The
long vowels are explained below. In as much as they are represented by letters and not by
signs, the variations of their written forms will be described in detail later.
4.4.2.1 The First Long Vowel
/ aa / A long un-rounded low central back vowel which is similar in pronunciation to
the Arabic short vowel / a / but rather longer in duration. It is similar to the
English “a” of “father”. The / aa / tends to be lower and further back than the / a /.
Apart from the three normal vowels, the Arabic alphabet contains three letters which are
often considered to be long vowels. They are considered this because they elongate and
emphasize the short vowels on the letter before them. The three long vowels are the
“alif”, “waow”, and “yaa”.
As for the “alif”, it is always empty of vowels. Whenever we see the “alif” with a vowel,
this letter is not an “alif”; rather it is a “hamza”. This letter always necessitates a fathha
before it. That is to say that we will never see an “alif” before which there is a dhamma or
a kasra. The job of this letter is to lengthen the stretch of the fathha on the letter before it,
and it is for this reason that it is known as a long vowel. For example, the letter “baa”
with a fathha atop it sounds like “ba”. But adding the “alif” causes it to sound like “baa”.
The other two long vowels are the “waow” and the “yaa”. These letters are not always
considered to be long vowels. This is because they may have vowels themselves; in this
case, they are treated as consonants. However, if these letters are empty of all vowels,
they may be considered to be long vowels. When they are empty of vowels, there are two
situations; either they are preceded by their appropriate haraka (dhamma for “waow”, and
kasra for “yaa”), or they have a fathha before them. Never will the “waow” be preceded
by a kasra.
In the case of being preceded by a fathha, the letters are not considered long vowels; they
are called “waow” leen and “yaa” leen. But if they are preceded by their appropriate
vowels, they are considered long vowels and they are called “waow” maddah and “yaa”
maddah. They emphasize the sound of the preceding haraka. For example, the letter
“baa” with a dhamma on it sounds like “bu” and with a kasra on it sounds like “be”.
Adding a “waow” to the former and a “yaa” to the latter causes the letter to sound like
“buu” and “bee”. The “alif”, “waow” maddah, and “yaa” maddah may occur in the
middle of words. The same applies for the “waow” leen and “yaa” leen. The leen will not
be discussed as they are like any other letter.
4.4.2.2 The Second Long Vowel
The “waow” may have a vowel of its own; in this case, it is not a long vowel. Also, it
may be empty of the three vowels but preceding it may be a fathha; in this case too, it is
not a long vowel. The final situation is where the “waow” is empty of the three vowels
and preceding it is its appropriate vowel; the dhamma; this “waow” is a long vowel.
/ uu / A high back rounded long vowel which is similar to the English “oo” in the word
“tool”. However, it is much longer in duration.
4.4.2.3 The Third Long Vowel
The final long vowel is the letter “yaa”. This letter corresponds to the kasra and it
enhances and emphasizes the sound of this vowel just as the “alif” emphasized the fathha
on the preceding letter and the “waow” emphasized the dhamma on the preceding letter.
/ ii / A high front un-rounded vowel which is similar to the English “ee” in the words
“seen” and “bee”, but which is longer in duration and has no diphthongization.
As mentioned, the “yaa” may be mutaharrik itself and thus will not be considered to be a
long vowel. Also, it may be without a vowel but proceeded by a maftooh letter. In this
situation, too, the “yaa” is not a long vowel. The “yaa” maddah is that “yaa” which has
no vowel of its own and is preceded by a maksoor letter.
4.5 SYLLABLES OF ARABIC LANGUAGE
4.5.1 Syllable Structure
The sounds of Arabic are divided into syllabic and non-syllabic entities (Wright, 1974).
The three short vowels and their long counterparts always form the syllable nucleus. The
syllable nucleus is the segment that stands out and has more prominence in an utterance.
The vowels always form the syllable nucleus and all of the consonants represent the
marginal elements including / y / and / w / of the syllable structure.
In as much as there is a clear cut between vowels and consonants, there is no need to
mark syllabicity. In accordance with this clear-cut division of vowels and consonants, the
number of syllables in a word will be identical to the number of vowels. Every Arabic
word begins with a single consonant. Also, every syllable forming a word structure
begins with a single consonant. Therefore, whenever there are consonant clusters or
double consonants in the middle of the structure of a word, the point of the syllable
division is between either the consonants of the consonant cluster between the double
consonants. Here the syllable is predictable and automatic.
4.5.2 Syllable Patterns There are five syllable patterns (Michel, 1970). In their representation, C stands for all of
the consonants, V for short vowels and VV for long vowels. The long vowels (VV) are
always considered as mono-phthongs since there are no vowel clusters in the Arabic
sound stream. The five patterns are:
Table 4.3 Arabic Syllable Patterns
Pattern Example in Arabic Pronunciation Meaning in English CV ?@ Bi “in”
CVC �� Tub “Repent”
CVV $�+ Yaa “O” (vocative particle)
CVVC �@$�4 Baab “door”
CVCC �AB�, Waθb “jumping”
4.5.2.1 Short and Long Syllables
A short syllable can be defined as any short vowel immediately preceded by a single
consonant. The CV pattern, listed in the above classification, is to be considered as a
short syllable. The remaining four syllable patterns are considered as long syllables.
4.5.2.2 Closed and Open Syllables
An open syllable ends with a vowel and includes the patterns CV and CVV. A closed
syllable is any syllable that ends with a consonant(s). Closed syllables include the CVC,
CVVC and CVCC patterns.
4.5.2.3 Ending a Syllable with a Consonant
There are three ways to construct a syllable. They are (1) consonant vowel, (2) vowel
consonant, and (3) consonant vowel consonant. In the first instance, the syllable ends in a
vowel and this is represented in the Arabic language by the consonant letter of the
syllable having one of the three vowels. The second instance is not an option in the
Arabic language because vowels do not precede consonants; rather they follow them. As
for the third instance, we do not yet know how to construct that syllable whose last letter
is a consonant.
When a syllable ends in a consonant, the consonant is free of all vowels. In Arabic, this is
represented by means of a special symbol atop the letter. This symbol is called a sukoon
and the letter which holds this symbol is called saakin. Therefore, the letter ‘M’ in the
word “FROM” would be transliterated with a sukoon atop the letter “meem”. Only the
sound of the letter would be pronounced just as we pronounce the sound of the letter ‘M’
when we recite this word. Therefore, saakin letters can occur in the middle or at the end
of a word.
Notice that the word “from” is strange in that the first letter of this word is saakin. Only
the sound of the ‘F’ is pronounced and there are no vowels surrounding it. It is as if the
‘F’ is a syllable of its own. In Arabic, this is an intolerable situation; we cannot initiate
pronunciation with a saakin letter.
In English, the ‘f’ is incorporated into the other syllable. There is no problem in saying
that this letter combined with the ‘R’ together act as the first consonant of this syllable. In
Arabic, however, the ‘F’ would be preceded by a consonant-vowel pair. This consonant-
vowel pair would connect with the ‘f’ forming a new syllable. The word ‘FROM’ would
be changed to something like “IFROM” with a hamza maksoor preceding the “faa” in
order to complete the incomplete syllable (the ‘f’); the vowel of the “hamza” varies. The
broken syllable was ‘f’; it has now been changed to a sound syllable type three because
that is the only type of syllable in Arabic which ends in a consonant. It is type three
because there is a consonant (the ‘hamza’), a vowel (the kasra beneath the ‘hamza’), and
the “faa” of the original word.
This is a common problem when Arabs speak or write. Often, a syllable ends in a
consonant, and the next syllable starts with a consonant. This is not a problem for us,
however, as our goal is to learn how to read and scribe Arabic, not to speak it.
Below are a few examples of saakin letters in the middle of syllables, at the end, and
syllables which are created in order to complete incomplete syllables. Notice that the
third example in the first table has “type 3” listed twice. This is because there is a syllable
between the first two letters and the vowel in between and there is a syllable between the
last letter, its vowel, and the ‘N’ sound which is made by the tanween. Also notice that in
the last two examples, there was an incomplete syllable which was completed by adding a
‘hamza’ in the beginning of the word.
%F|��n |��n N-�� type 3, type 3 type 3 type 3
N2ItN<� �1N��8 2�*N� � incomplete, type 3 incomplete, type 1 type 3, type 1, type 1
4.6 SUMMARY As a summary, The Arabic language, or simply Arabic, is the largest member of the
Semitic branch of the Afro-Asiatic language family. There are three vowels in Arabic
language they are /a/, /i/ and /u/ each one has two forms; the first one is Short Vowel and
the second Short Vowel Doubled. The next chapter will show the architectural design of
the Arabic TTS Synthesizer System.
CHAPTER FIVE
SYSTEM ANALYSIS AND DESIGN
5.1 OVERVIEW
A system is always developed for a purpose; this purpose is to provide functionality, or
behavior, that will satisfy the needs and wishes of clients and users. In this chapter, the
researcher discusses the analysis of the TTS system and the design. Analysis is the first
part of the system development where we begin to understand in depth the needs of the
system. Analysis involves substantial amount of efforts. The second part of the system
development is software design, which is discussed in details in later sections of this
chapter. Figure 5.1 shows the use case diagram of the Arabic TTS Synthesizer System.
Admin User
Figure 5.11 Use-Case Diagram for TTS Synthesizer System
Open File
Input Text
Synthesize Text
Clear Text
Update and Delete
Compare Sound
Normalize Text
Concatenate Text
Text Parser
Record Sounds
<<extend>> <<extend>> <<extend>>
<<extend>> <<extend>>
<<extend>>
Tables 5.1 to 5.10 below represent the specification document of all the use cases:
Table 5.6 Description of Open File Use Case. Use case: 1 Open file Description
Actors Admin and User Pre-conditions None Post-conditions Open file dialogue box will appear Basic Flow The user and/ or the admin first open the
TTS Synthesizer System and choose either Arabic or English
Alternative flows None Special Requirements The Operating System that runs the TTS
Synthesizer System must support Arabic Language, in order for the user to be able to type the text using Arabic letters
Use case relationships None
Table 5.7 Description of Input Text Use Case. Use case: 2 Input Text Description
Actors Admin and User Pre-conditions None Post-conditions The input text by the user and/ or will
appear in the text box provided Basic Flow The user and/ or the admin type the
targeted text to be converted into speech in the provided text box
Alternative flows None Special Requirements The Operating System that runs the TTS
Synthesizer System must support Arabic Language, in order for the user to be able to type the text using Arabic letters
Use case relationships None
Table 5.8 Description of Clear Text Use Case. Use case:3 Clear Text Description
Actors Admin and User Pre-conditions None Post-conditions The input text by the user and/ or the admin
will be cleared from the text box Basic Flow The user clicks the clear button Alternative flows None Special Requirements None
Use case relationships None
Table 5.9 Description of Synthesize Text Use Case. Use case: 4 Synthesis Text Description
Actors Admin and User Pre-conditions None Post-conditions Hear the sound of the input text by the user
or admin Basic Flow After the user and/ or the admin input a text
in the text box, next the user or admin click the speak button to hear the pronunciation of the input text
Alternative flows None Special Requirements None
Use case relationships None
Table 5.10 Description of Update and Delete Use Case. Use case: 5 Update and Delete
Description
Actors Admin Pre-conditions Admin must access the code of the system Post-conditions The system must display the updated
version of the system Basic Flow The admin updates the data, design, sound
files, and other files Alternative flows None Special Requirements None
Use case relationships None
Table 5.11 Description of Record Sound Use Case. Use case: 6 Record Sounds
Description
Actors Admin Pre-conditions Choosing the sound format Post-conditions Sounds recorded must be added to the
system’s database Basic Flow The admin records the necessary sounds
and adds them to the database, by accessing into the database file
Alternative flows None Special Requirements None
Use case relationships None
Table 5.12 Description of Normalize Text use case. Use case: 7 Text Normalization
Description
Actors None Pre-conditions Normalized sound Database should be
existed Post-conditions Retrieve sound to be spoken Basic Flow After the user input the targeted text to be
spoken, The system will compare the input text with the sound file and match with it and then produce the output
Alternative flows If the text did not match with the sound file in the Normalized database, it will compare with the other 2 databases (Parse and Concatenation database)
Special Requirements None
Use case relationships Extend with Record Sound use case
Table 5.13 Description of Text Parser Use Case. Use case: 8 Text Parser Description
Actors None Pre-conditions Parser sound Database should be existed Post-conditions Retrieve sound to be spoken Basic Flow After the user input the targeted text to be
spoken, The system will compare the input text with the sound file and match with it and then produce the output
Alternative flows If the text did not match with the sound file in the Parser database, it will compare with the other 2 databases (Normalized and Concatenation database)
Special Requirements None
Use case relationships Extend with Record Sound use case
Table 5.14 Description of Concatenate Text Use Case. Use case: 9 Text Concatenation
Description
Actors None Pre-conditions Concatenation sound Database should be
existed Post-conditions Retrieve sound to be spoken Basic Flow After the user input the targeted text to be
spoken, The system will compare the input text with the sound file and match with it and then produce the output
Alternative flows If the text did not match with the sound file in the Concatenation database, it will compare with the other 2 databases (Normalized and Parse database)
Special Requirements None
Use case relationships Extend with Record Sound use case
Table 5.15 Description of Compare Sound Use Case. Use case: 10 Compare Sound
Description
Actors None Pre-conditions All the database table must be existed Post-conditions Compare and retrieve targeted sound from
the database Basic Flow After the user input the targeted text to be
spoken, The system will compare the input text with the sound file and match with it and then produce the output
Alternative flows None Special Requirements None
Use case relationships Extend relationship with Text Normalization, Text Parser and Text Concatenation Use cases.
5.2 SOFTWARE AND HARDWARE REQUIREMENTS There are many tools available to do database programming purposely to develop a
program. For this project to develop TTS, the selection of an appropriate tool to support
the program application is made. Different tools are compared as to reach the one that is
compatible with the project. To choose the tool, the first thing to be considered is to
revise the purpose of the project and what are the functions of the program application.
This step also establishes the objectives and the scope of Arabic TTS Synthesizer System
and the tasks that need to be undertaken.
All the criteria of the program are included to evaluate the available tools and the most
appropriate would be chosen. The decision is made based on several factors for example
the budget, level of vendor support, compatibility with other software, and whether the
product runs on particular hardware. This project used Microsoft Visual Basic 6 and
Microsoft WordPad as the development tools.
5.2.1 Microsoft Visual Basic 6.0
Visual Basic is a Microsoft Windows programming language. Visual Basic is created in
an Integrated Development Environment (IDE). The IDE allows the programmer to
create, run, and debug Visual Basic programs conveniently. IDEs allow a programmer to
create working programs in a fraction of the time that it would normally take to code
programs without using IDEs.
5.2.1.1 Visual Basic and Arabic Supports
Visual Basic is Bi-directional (also known as “BIDI”)-enabled. Bi-directional is a generic
term used to describe software products that support Arabic and other languages, which
are written right-to-left. More specifically, bi-directional refers to the product ability to
manipulate and display text for both left-to-right and right-to-left languages. For example,
displaying a sentence containing words written in both English and Arabic requires bi-
directional capability. Microsoft Visual Basic includes standard features to create and run
Windows applications with full bi-directional language functionality.
5.2.1.2 Visual Basic Bi-directional Features
Although the Microsoft Visual Basic 6.0 user interface (menus, dialog boxes, and Help)
is in English, user will find the convenience and ease of use of Microsoft Visual Basic 6.0
bi-directional features are indispensable for the bi-directional programming needs:
1. Create bi-directional applications quickly and easily:
Many bi-directional features appear as new properties in the Properties window
for easy program development.
2. Combine languages:
Mix right-to-left and left-to-right language text (for example, Arabic and English)
in program code as user builds applications for a bi-directional 32-bit Microsoft
Windows environment.
3. Create right-to-left visual features:
Add right-to-left visual features to forms, menus, and more than 15 custom
controls such as grids, list boxes, and combo boxes.
4. Database support:
Develop database solutions with support for Arabic and other right-to-left
language sort orders.
5.2.2 Microsoft Word
Microsoft Word is a word processing application from Microsoft. Richard Brodie for
IBM PCs running DOS in 1983 originally wrote it. Later versions were created for the
Apple Macintosh (1984), SCO UNIX, OS/2 and Microsoft Windows (1989). It later
became part of the Microsoft Office suite, and Microsoft refers to Word as Microsoft
Office Word in this context to indicate its inclusion in the suite, although it is still also
sold as a standalone product or bundled with Microsoft Works. Microsoft Word is used to
store collection of Arabic text in rich text format. There are three test files created in
WordPad. These files are provided in Text-To-Speech System and can be used by user
for the purpose of trying and testing the program functions.
5.2.3 Sound Forge 8.0
Sound Forge 8.0 is used to record the sounds in the development stage of the Arabic
speech synthesizer software. Sound Forge software provides the ultimate set of tools for
recording professional audio. Sound Forge 8.0 is a digital audio editor for Windows
98SE, Me, 2000, and XP.
Sound Forge can be used to audio editing, multitask background rendering, 32-bit/64-bit
float/192kHz file support, enhanced DirectX plug-in management, QuickTime and
Windows Media format import and user interface enhancements.
Sound Forge 6.0 includes an extensive set of customizable processes, effects and tools for
manipulating audio, and supports a wide range of audio and video file formats, including
Windows Media, RealMedia, QuickTime and MPEG 1&2. Users can save time by
continuing production through open, play, preview, cut, copy, paste, and delete files
while other project files render in the background.
Sound Forge 8.0 supports full resolution 32-bit files for high audio quality. It enables the
user to import and save high-resolution 32-bit files, even to record 32-bit files if the
hardware supports 32-bit recording.
Tools provided in the Sound Forge are track at once CD burning, CD Ripping, Spectrum
Analysis, Auto Region (using beats and measures, or peak detection), Crossfade Loop,
Extract Regions, Find Tool, Enhanced Preset Manager, Sampler Tool, Statistics Tool
(Max, RMS, DC offset, Zero Crossings), Simple Synthesis, FM Synthesis and DTMF/MF
Tone Synthesis. Recording tools provided are Auto calibration for DC Offset, Generate
SMPTE/MIDI Time Code, Glitch/Gap Detection, Punch In option, Real-time record
meters, and Remote record function. The figure below shows the tools for recording
voice in Sound Forge 8.0.
5.3 ARABIC TEXT-TO-SPEECH ARCHITECTURE Figure 5.2 shows the architecture of the Arabic TTS Synthesizer System. It is composed
of three major components, which are depicted by the rectangle as in figure 5.15. Raw
Text is the input of the system. Any typed text will be normalized in Text Normalization
Module. Syllable Parser will segment the normalized text to syllable unit according to
Arabic rules. Lastly, Syllable concatenation will combine syllable unit sound file to
produce a synthesized speech. All the sound files are stored in one folder named as sound
to be accessed by the system. The architecture is based on Input, Processing and Output
Schematic (IPO).
Figure 5.2 IPO – Schematic Architecture Design of Arabic TTS
Input Layer
Processing
Layer
Output Layer
5.3.1 Synthesizing Text Steps The Arabic TTS Synthesizer is using concatenative approach with syllables unit. There
are three different stages to produce a synthesized speech:
� Text Normalization
The inputted text may not only contain words. There could be other different type
of character such as numbers, abbreviation and acronyms, which we considered as
symbol in this system. This module will convert the symbols input into readable
text. Figure 5.3 shows the sequence diagram of the Text Normalization
component.
Figure 5.3 Sequence Diagram for Text Normalization
� Text Segmentation
Figure 5.4 shows the sequence diagram of the Text Segmentation component. Input
text may be in the form of paragraphs, sentences, or words. Thus, it is necessary to
segment text in hierarchal order: higher level structures to paragraphs, paragraphs to
sentences, sentences to words and words to manageable units. For this TTS
Synthesizer System, the manageable units are in the form of phonemes.
In this research, we limited the input text to paragraph form. A paragraph was
segmented into sentences by finding the sentence punctuation marks such as ‘.’, ‘!’
and ‘?’. However, there are exceptions. For example, if abbreviations were used in the
input text such as [7] the system must be aware that the periods are not sentence
punctuation marks but rather used as an abbreviation mark. To solve this dilemma,
the system checked the preceding letters whether it is included in our abbreviation
database. If not, then the period will be considered as a mark that ends a sentence. To
segment sentences into words, blank spaces were located in the text that has been
classified as a sentence. From the text that has been identified as words, the phonemic
representations equivalent to the set of letters of the retrieved word were generated. In
our implementation, segmentation of words into phonemes only allows a maximum
of 16 letters.
Figure 5.4 Sequence Diagram for Text Segmentation
� Text Concatenation
This design process will receive a list of syllable segment that has been properly
arranged according to the raw text. Base on the list of syllable, Syllable
Concatenation module will concatenate the sound according to the sequence and
finally play the sound that we know as synthesized speech. The system is capable
of doing Arabic language text conversion into Arabic synthesized speech. Figure
5.5 shows the sequence diagram of the Syllable Concatenation component.
Figure 5.5 Sequence Diagram for Text Concatenation
5.3.2 Recording of the sounds The next step is recording of the sounds. The recording of the sounds is one of the tasks
that consume a lot of time in development of the software. Below activities that are
carried out prior to recording of sounds:
1. Determining the fields
Fields are small units of application data recognized by system software. In this
system, the fields are set as sounds.
2. Determining the data type
Data type is a detailed coding scheme, which is recognized by system software for
representing data. The data are represented in the files as wave.
3. File organization
A file organization is a technique for physically arranging the records of the file
on the secondary storage device. The sounds in the files are stored alphabetically
from أ to ي, which is known as sequential file organization.
5.3.3 Storing Sound Files Using Wave File Format
WAVE File Format is a file format for storing digital audio (waveform) data. It supports
a variety of bit resolutions, sample rates, and channels of audio. This format is very
popular upon IBM PC (clone) platforms, and is widely used in professional programs that
process digital audio waveforms. It takes into account some peculiarities of the Intel CPU
such as Little Endian byte order (Craig, 2000).
WAVE (.wav) is the standard form for uncompressed audio on a PC. Since a wave file is
uncompressed data - as close a copy to the original analog data as possible - it is therefore
much larger than the same file would be in a compressed format such as mp3 or
RealAudio. Audio CDs store their audio in, essentially, the wave format. Any audio will
need to be in this format in order to be edited using a wave editor, or burned to an audio
CD that will play in the stereo. Table 5.11 below shows some examples of sound
segments in a wave format.
Table 5.16 Example of Sound segments in wave files
1 letter 2 letter 3 letter
wav. �ب .wav� wav.ت
wav. �ت wav ! .wav.ت
wav " .wav #$ .wav.ت
.wav %$ .wav� wav.ت
Segment for letter ت (ta) Segment for letter ب (ba)
1 letter 2 letter 3 letter
wav.&�ب &.wav� wav.ب
wav.&�ت wav !&.wav.ب
wav "&.wav #$&.wav.ب
wav.&"ن &.wav� wav.ب
5.4 THE DESIGNING OF DATABASE FOR SOUND FILES In this dissertation, the type of database used is a relational database is a collection of
data items organized as a set of formally-described tables from which data can be
accessed or reassembled in many different ways without having to reorganize the
database tables.
5.4.1 Representing Relational Database Using Microsoft Access
In this dissertation, Microsoft Access is used as a database to store the sound files.
Microsoft Access is one of the most popular database software that is used by users to
store their data and information for future retrieval. Using Microsoft Access, all of the
data and information can be managed from a signal database file. Without database, all
the wave files have to be treated separately and we have hundreds of wave files need to
be managed.
By having a database, all the sound files are treated as a signal database file and this will
save time in searching for particular files. The sounds are first stored in a hard disk drive
and then copied to the CD. Then the database is created for this sound files. The reason
for creating the database is mainly to minimize the retrieval time. To retrieve any file or
information directly from the hard disk takes longer time compared to accessing it from
the memory. Once we activate the database, all the related information will be loaded into
the memory from the hard disk.
By doing so, the time needed to retrieve the data and information is relatively short.
Moreover, if we do not create the database, when we want to read data it will take long
time, as it first has to search from the hard disk, then load to the memory then from the
memory it will fetch the desired data to us. This shows that it takes double of time as
compared to the first case. Figure 5.6 shows the procedure taken by the system when the
user request for a particular data or information.
Figure 5.6 The procedure taken when user requests for certain data Here is a simple description for the above given process:
• Process A – when user request sound ب.wav it will search for it in the
memory.
• Process B – if the sound is not in the memory it will go to the hard disk and
then search for the sound ب.wav
• Process C – once the sound is found, it will bring the targeted sound file
.wav to the memory.ب
• Process D – the sound will be fetched from the memory and return to the user
as he/she requested.
Tables 5.12 to 5.14 below show the list of the database tables included in the Arabic TTS
Synthesizer System.
User request Memory Hard disk A
D C
B
Table 5.17 Arabic Word Table
Field Type Attributes Null Default Extra Id Int(11) Primarykey No Auto_increment
Sound Varchar(60) No
Table 5.18 Syllable Table
Field Type Attributes Null Default Extra Id Int(11) Primarykey No Auto_increment
Sound Varchar(60) No
Table 5.19 Abbreviation Table
Field Type Attributes Null Default Extra Id Int(11) Primarykey No Auto_increment
Sound Varchar(60) No
All the sound files with the extension of .wav are stored in one folder called Sound
folder. While the database tables contain the name of the file, for example, the file ت.wav
in the Sound folder exist in the Syllable Concatenation Table as ت only without the
extension.
5.5 THE DESIGN OF USER INTERFACE
There are numerous methods and interfaces for creating the implementation of
synthesized speech in desired applications easier have been developed during this decade.
It is quite clear that it is not possible to create a standard for methods speech synthesis
because most systems act as stand-alone device that means they are incompatible with
each other and do not share common parts. However, it is possible to standardize the
interface of data flow between the application and the synthesizer.
Generally, the interface contains a set of control characters or variables for controlling the
synthesizer output and features. The output is usually controlled by normal play, stop,
pause, and resume type commands and the controllable features are usually pitch baseline
and range, speech rate, volume, and in some cases even different voices, ages, and
genders are available. In most frameworks, it is also possible to control other external
applications, such as a talking head or video.
Microsoft Visual Basic 6.0 is used as a programming tool in designing the user interface
for the Arabic Text-To-Speech Synthesizer System. Visual Basic provides us with a
visual interface; it provides us with design windows area in which to work. This design
window comes complete with toolbox, toolbars, and menus. The design of the user
interface is carried out in many stages before the real features are designed. The reason
behind this is to have a full understanding in designing the user interface using Visual
Basic. Below are the stages of designing:
When we create a Visual Basic program, we will often need to make choices regarding
how to organize our program code. Visual Basic recognizes three different types of
modules, or file, of program code:
� Form modules
� Class module
� Modules, or code modules
Each type contains variable declarations, subs and functions. Both form modules and
class modules contain the code required to describe the contents and behavior of a
particular class of object. A form module also contains a physical description of a form or
screen window, indicating what controls appear on it, what they look like, and some
aspects of how the form and controls will behave at run time.
Code modules, or simply modules, are rather different, in that the statements that go into
one of these do not describe anything about a class of objects, but instead define an
individual set of subs, functions and data descriptions. These can be thought of as the
methods and properties of individual objects in the program – each code module defines
one object instead of a class of objects. Objects defined as classes or forms have to be
created or loaded when a program runs before we can access their methods or properties,
but the code in a code module is available immediately.
Any number of code modules can be added to as Visual Basic program. Each code
module that we add to a project will define another set of program-wide data items, subs
and functions. Each module must have a unique name. to summarize:
� Form modules contain the data items, methods and event-handlers for each
instance of a type of form. Every form of a particular class that is created and used
in a program will have its own set of data items, called instance variables, which
will describe the state of that form. Every form will also have the use of all of the
subs, functions and event-handlers (or methods) defined for the class.
� Class modules contain the data items, methods and event-handlers for each
instance of a type of object. Every object of a particular class that is created and
used in a program will have its own set of data items, called instance variables,
which will describe the state of that form. Every object will also have the use of
all of the subs, functions and event-handlers (or methods) defined for the class.
� Code modules contain the data items, subs and functions for use in a program.
Only a single instance of each data item will exist throughout the program, but
these may be made accessible to every object in the program.
The following is a list of procedures used in implementing and designing the Arabic TTS
Synthesizer System Interfaces and the Conversion Engine. These procedures are as
follows: Understanding the form and its procedures, Creating the Command Button,
Creating the Text Box, Creating the Button associated with the TextBox, Creating the
Drive to see all the Files inside the Drive, Creating the Multimedia Controls and their
Properties, Combining Multimedia Controls with Command Button, and lastly The User
Interface of the TTS Synthesizer System. The detailed description of each of the above-
mentioned procedures will be explained in the next chapter, the Implementation Chapter.
The User Interface of the TTS Synthesizer System
The user interface for the TTS Synthesizer System is created by combining the functions
and control described in the previous sections. The interface contains TextBox,
CommandButton, DataControl, label, Multimedia Control, DirListBox, and so on. The
figure 5.7 below shows the user interface for the TTS Synthesizer System. The system
contains two sub-systems; the first sub-system is Arabic TTS as shown in the figure
below and the second sub-system is English TTS as shown in the figure below.
Figure 5.7 User interface of Arabic and English TTS Synthesizer System. 5.6 SUMMARY
The first part of this chapter has shown the use case diagram of the Arabic TTS, and the
next part has described in details the architectural design of the Arabic TTS. It also listed
the software and hardware requirements. Last part of the chapter has explained the
database construction and the user interface design. The next chapter will explain the
implementation.
CHAPTER SIX
ARABIC TTS SYSTEM IMPLEMENTATION
6.1 OVERVIEW
This chapter is the embodiment of the theoretical ideas presented in the previous chapters
into a practical and working system. This chapter discusses the activities needed to
successfully build the Arabic TTS System. This chapter will explain the implementation
of the system step-by-step.
Programming is time-consuming and costly, but except in unusual circumstances, it is the
simplest for the systems analyst because it is well understood. After maintenance, the
implementation phase of the systems development life cycle is the most expensive and
time-consuming phase of the entire life cycle. Implementation is expensive because so
many people are involved in the process; it is time-consuming because of all the work
that has to be completed during implementation. Physical design specifications must be
turned into working computer code, the code must be tested until most of the errors have
been detected and corrected, the system must be installed, use sites must be prepared for
the new system, and users must come to rely on the new system rather than the existing
one to get their work done. System implementation is made up of many activities, such as
coding, testing, installation, documentation, training and support. The purpose of these
steps is to convert the physical system specification into working and reliable software
and hardware, document the work that has been done, and provide help for current and
future users of the system.
6.2 CREATING THE ARABIC TTS SYSTEM
Text-To-Speech applications have recently gotten easier to develop with the introduction
of the ActiveX Text-To-Speech and Direct Speech Synthesis controls found in the (SAPI
SDK), which stands for Speech Application Programming Interface Software
Developer’s Kit. The SAPI is a set of functions that enable us to incorporate TTS and
Speech Recognition into our applications. These controls provide us with a wide range of
flexibility in creating applications that can convert written text to speech.
Below is a list of procedures used in implementing the Arabic TTS Synthesizer System
Interfaces and the Conversion Engine. These procedures are as follows: Understanding
the form and its procedures, Creating the Command Button, Creating the Text Box,
Creating the Button associated with the TextBox, Creating the Drive to see all the Files
inside the Drive, Creating the Multimedia Controls and their Properties, Combining
Multimedia Controls with Command Button, and lastly The User Interface of the TTS
Synthesizer System. Below is the detailed description of each.
6.2.1 Understanding the form and its procedures
When we begin a standard.EXE Project, a form will appear in the design window, see
figure 6.1. During this process, we are able to use the tools provided on the menus,
toolbar and toolbox. The objects that appear as icons in the toolbox are called “controls”.
These controls can be added to the form to interact with users. The form design window
presents us with gray background. We can change the color and size of the form.
Controls are placed on the form by selecting then from the toolbox. The property window
either displays the properties list as alphabetically or categorized. We can select one of
these options by clicking the respective tab. The categories are such as Appearance,
Behavior, Dynamic Data Exchange, Font, Position, Scale, and so on. The Appearance
category, for example, is the property that deals with color, caption, pictures, border and
others.
Figure 6.12 The Basic Form and its properties
6.2.2 Creating the Command Button
The important feature of the software is to have a command button whereas the user
presses the button, it will respond accordingly. When the user clicks on the button, an
event will occur. In a simple term, an even is something happening. The command button
is created on the form and this button has its own properties where we can manipulate it
as our requirements. The caption property allows us to rename the button, as we want, for
example “Speak”. The figure 6.2 below shows the form with the command button.
Figure 6.13 The Form with the command button “Speak”
6.2.3 Creating the Text Box
The TextBox control, see figure 6.3, is typically used to collect user input. It usually
allows the data to be edited. The default amount of the text can be entered in the TextBox
is 2084 characters, but with the MultiLine Property changed to True, the user can enter
up to 32 Kb. The Textbox is the most important feature in the proposed system. Inside the
text box, users can type whatever they want. The properties of the text box can be
manipulated as well. We can change its height and width based on our own preference.
The font type and the font size of the text can be changed too.
Figure 6.14 The Form with the TextBox written inside “Text1”
6.2.4 Creating the Button associated with the TextBox and their procedures
All of these, the Form, the Command Button, and the TextBox, need to be integrated
together. For example, how the TextBox will response when the user clicks on the
Command Button. Again, the TextBox and the Command Button are created. The
Command Button is given the name “Clear”. The user will type a text inside the TextBox
and once the “Clear” Button clicked, the TextBox will be cleared from any text written
inside by the user. Figure 6.4 below shows the Button associated with the TextBox.
Here is the procedure written in the “Clear” Button, inside the code page:
Private Sub Clear_Click ()
Text1.Text = “ ”
End Sub
Figure 6.15 The Form with the TextBox and the Command Button.
6.2.5 Creating the Drive to See All the Files Inside the Drive
The DriveListBox enables the user to select a valid drive at the runtime. Figure 6.5 shows
the DriveListBox, which identifies the drive available to the users on a particular system.
the box will show all mapped drives. Everything in My Computer will be accessible to
the user for selection. The DriveListBox displays all the available directories on a
selected drive. The FileListBox further enables users to select files from the selected
directory. If the pattern property has been changed to a particular file extension, then only
those files will appear in the FileListBox. Why do we need this tool the proposed system?
Because, we are going to play the sound files, the pattern property is changed to “.wav”.
Figure 6.16 The Drive C, its directory and all the files inside the drive.
6.2.6 Creating the Multimedia Controls and their Properties
The MCI control manages the recording and playback of Multimedia files on MCI
devices as seen in figure 6.6. The control looks like a set of VCR controls, containing a
set of buttons that supplies MCI commands to a device such as sound card, MIDI
sequencers, CD-ROM drives, audio CD players and so on. The Multimedia control uses a
set of sophisticated commands, known as MCI commands, which control a range of
Multimedia devices. Many of these commands correspond directly to a button on the
Multimedia control. The Play Command carries out the same instruction as the Play
Button in VCR panel.
Figure 6.17 The form with the Multimedia Control Commands.
6.2.7 Combining the Multimedia Controls with the Command Button
The Multimedia Controls are combined with the Command Buttons such as “Stop”,
“Play”, and “Beep”, sounds and we want to see how the Multimedia Controls response
when any of the buttons is pressed. Figure 6.7 shows the above procedures:
Figure 6.18 The form with the Multimedia Controls and the Command Buttons.
6.2.8 The User Interface of the TTS Synthesizer System
The user interface for the TTS Synthesizer System is created by combining the functions
and control described in the previous sections. The interface contains TextBox,
CommandButton, DataControl, label, Multimedia Control, DirListBox, and so on. The
figure 6.8 below shows the user interface for the TTS Synthesizer System. The system
contains two sub-systems; the first sub-system is Arabic TTS and the second sub-system
is English TTS as shown in the figure below.
Figure 6.8 User interface of English and Arabic TTS Synthesizer System.
The user interface combines the functions as described below:
• Text Box
A text box is created to allow user to input their text in the text box. The
maximum length of characters in the text box is also specified. The users are
allowed to type in a multi-line text box. The multi-line property is set to true.
• Command Button
There are four command buttons created for this software. They are “Clear”, “Say
It” and “Exit” buttons. Each of those buttons has its own function as in the coding
procedure as below:
� Clear Button
Private Sub Command1_Click ()
TxtFileName.text = " "
TxtFileName.text.setFocus
End Sub
� Exit Button
Private Sub Command2_Click ()
End
End Sub
� Open Button
Private Sub cmdopen_Click()
On Error Resume Next
With CommonDialog1
.CancelError = True ' turn Cancel error on
.DialogTitle = "Open File" ' update dialogs title
.Filter = "Text Files(*.txt)|*.txt|Batch Files(*.bat)|*.bat |Module Files(*.bas)|*.bas" ' Update filetypes
.ShowOpen ' show open dialog
End With
Open CommonDialog1.FileName For Input As 1 ' Load file into text box.
Text1.Text = Input$(LOF(1), 1)
Close 1
End Sub
� Speak Button
This is the most important part of the user interface in which the program
will execute when a user click on this button. When clicking on the button,
the text input in the text box by the user will be read.
� Home Page Button
Private Sub Command3_Click()
Unload Me
Load Home
Home.Show
End Sub
• The Data Control
The data control is a control created on the form to see all the sound files in the
database and check if the sounds exist or not. The data control is connected to the
database. The RecordSource property is set to “Sounds” which is the name of the
table in the database. In the user interface, the data control is set to invisible.
• DirListBox, FileListBox, and DriveListBox
All these three controls are also created in the form to check the existence of the
files that we are looking for in a selected drive. These controls are also made
invisible.
• Multimedia Control
Multimedia Control is meant to play the sounds when the user clicks on the 'Say
It' button. The Multimedia control is important in which the system will response
to the click on the button with Multimedia control. The Multimedia control is set
to invisible in its properties. The procedures of converting text to speech can be
shown as below.
• Specify the number of character in an array
Dim A(20)
Dim B(20)
Cls
X = txtFilename.Text
X = X + " "
Cls: the Cls method, which stands for Clear Screen like the DOS command,
clears all the text written with text methods from a form or TextBox control.
• The length of the text
Len (X): for many operations, we may need to know how many characters are in
a string. We might need this information to know whether the string with which
we are working will fit in a fixed-length database field. Alternatively, if we are
working with big strings, we may want to make sure that the combined size of the
two strings does not exceed the capacity of the string variable. To determine the
length of any string, we use the Len ( ) function, as in the code:
L = Len(X)
p = 1
q = 1
For i = 1 To L
• Cut the text, word by word
Mid ( ) is a function that is used to retrieve a substring from a string, we can use
the Mid ( ) function to retrieve a letter, word, or phrase from the middle of a
string. The Mid ( ) function contains two required arguments and one optional
argument, as shown in the following syntax:
If Mid(X, i, 1) = " " Then
oneword = Mid(X, q, i - q)
q = i + 1
i represents the character position at which the retrieved string begins. If i is
greater than the length of the string, an empty string is returned. The optional
argument 1 represents the number of characters to be returned from X. if 1 is
omitted, the function returns all characters in the source string, from the starting
position, on to the end.
• Specify the number of sounds in the database
Data1.Recordset.MoveFirst
For k = 1 To Data1.Recordset.RecordCount
• Identify string in one word
InStr ( ): the function that enables us to search a string for a character or group of
characters is the InStr ( ) function. This function has two required and two
optional parameters. The required parameters are the string to be searched and the
text to search for. If the search text appears in the string being searched, InStr ( )
returns the index of the character where the search string starts. If the search text
is not present, InStr ( ) returns 0. as shown in the following syntax:
Pos = InStr(oneword, Data1.Recordset.Fields(0))
SegL = Len(Data1.Recordset.Fields(0))
• Go to next record if string not there
If Pos = 0 Then
Data1.Recordset.MoveNext
GoTo 10
End If
If A(Pos) = 0 Then
For n = Pos To Pos + SegL - 1
If B(n) = 1 Then
Data1.Recordset.MoveNext
GoTo 10
End If
Next n
• Specify the array of the segment
A(Pos) = SegL
For m = Pos To Pos + SegL - 1
B(m) = 1
Next m
'Print "A("; Pos; ")="; A(Pos)
End If
Data1.Recordset.MoveNext
p = p + 1
10 Next k
Data1.Recordset.MoveFirst is a function that moves the record pointer from the
current record to the first record in the opened recordset.
Data1.Recordset.MoveNext is a function that moves the record pointer from the
current record to the next record (the record following the current record) in the
opened recordset. If no record exists (that is, if we are already at the last record),
the end-of-file (EOF) flag is set, and there will be no current record.
• Play the sounds of the word in an array
For t = 1 To 20
If A(t) <> 0 Then
MMControl1.FileName = "F:\Arabic TTS\Sound\" + Mid(oneword, t, A(t)) + ".wav"
Constructing the Database
A database may contain phonemes, diphones, syllables, or words, or for better prosodic
synthesis, a mixture of these. But, as a consequence, a decision algorithm needs to be
implemented to decide which acoustical unit suits a given text better. Also, the use of
diphones and/or words requires larger memory space. Phonemes were being used as a
basic acoustical unit due to its simple implementation and low memory requirement.
• The first step in constructing a diphone database for Arabic is to determine all
possible diphone pairs of Arabic. In general, the typical diphone size is the square
of the phone number for any language.
• Arabic has 28 consonant phonemes; four of these consonants are the emphatic
ones. Two semi-vowels (as /ay/ in ( ���) “bayt” (house) or as /aw/ in ( م ��) “yawm”
meaning (day)) and six vowels and three additional consonants (/p/ ‘پ’, /g ‘چ’/
and /v ‘ڤ’/). This results in 39 possible phonemes. Since we are interested in
possible phoneme combinations, i.e. diphones, we get 39 times 39 = 1521 diphone
pairs.
Text Analysis
Text analysis is composed of three processes: text normalization, text segmentation
and text concatenation. Text normalization spells out numerical values and
abbreviation in the input text. Text segmentation segments text into basic acoustical
units identical to the ones stored in the database. As well as text concatenation
concatenates these acoustical units, phonemes and syllables to produce the targeted
sound.
Text Normalization
The process of reformatting the text or unwrapping the strict token sequence from the
visual presentation style, and encoding the useful parts of the style in a defined and
explicit way is called text normalization. Input text to a TTS system usually does not
have constraints. It can contain abbreviations, symbols, foreign words and numbers.
Thus, a pre-processing unit is required to convert this unlimited text into a format,
which can be processed. For example, [ ()*ا] should be read as [ ,-). /)*ا]. In TTS
Synthesizer System, a dictionary for abbreviation was constructed where the system
can look at during text normalization.
Text Segmentation
Input text may be in the form of paragraphs, sentences, or words. Thus, it is necessary
to segment text in hierarchal order: higher level structures to paragraphs, paragraphs
to sentences, sentences to words and words to manageable units. For this TTS
Synthesizer System, the manageable units are in the form of phonemes.
In this dissertation, the input text is limited to paragraph form. A paragraph was
segmented into sentences by finding the sentence punctuation marks such as ‘.’, ‘!’
and ‘?’. However, there are exceptions. For example, if abbreviations were used in the
input text such as [7] the system must be aware that the periods are not sentence
punctuation marks but rather used as an abbreviation mark. To solve this dilemma,
the system checked the preceding letters whether it is included in our abbreviation
database. If not, then the period will be considered as a mark that ends a sentence. To
segment sentences into words, blank spaces were located in the text that has been
classified as a sentence. From the text that has been identified as words, the phonemic
representations equivalent to the set of letters of the retrieved word were generated. In
our implementation, segmentation of words into phonemes only allows a maximum
of 16 letters.
Examples
The word: Maghrib (@�C#) ‘when the sun falls in the horizon’
• Text converter:
o CVCCVC (<gh> is bound as one; /�/) • Text segmenter:
o CVC. CVC. • Synthesizer:
o /ma�/ + /rib/ The word: dhahab (��) ‘gold’
• Text converter: o CVCVC (<dh> is bound as one; / θ /)
• Text segmenter: o CV.CVC
• Synthesizer: o /θa/ + /hab/
How the system is functioning?
This section provides the implementation details of Arabic TTS Synthesizer. The Arabic
TTS Synthesizer works as follow:
Once the System's interface is displayed to the user, the user has to write or input any text
to be converted to audio format, then the user clicks the speak button to hear the sound.
The system reads the targeted text backward (from left to right), then compare it against
the database and against the sound folder. If the targeted text is found, it converts the text
to audio format; otherwise, it cuts the text letter by letter and compare against the
database and the sound folder. Figure 6.9 shows the word " "E��F#" , for instance, this word
can be pronounced as described below:
Figure 6.9 The word "�G to be pronounced by the system.
o The system searches for the whole word against the database, then against the
sound folder.
o If the word exists in the database and in the sound folder.
o The system generates or converts the word from written form to audio form.
o Otherwise, the word is not there.
"��F#
o The system cuts the word letter by letter backward, such as "��F#, it becomes H
/ E��F#, then searches for the phone H and the syllable E��F# against the database
and the sound folder.
o If found it converts the word from written form to audio form.
o The system cuts the word letter by letter backward, such as "��F#, it becomes "�#
/ EEF#, then searches for the syllable E�#" and the syllable I EEF# against the
database and the sound folder.
o If found it converts the word from written form to audio form.
o The system cuts the word letter by letter backward, such as "E��F#, it becomes
"��EJ / EE#, then searches for the syllable "��EEJ and the phone EE# Iagainst the
database and the sound folder.
o If found it converts the word from written form to audio form.
o And so on for the rest of the written text.
6.3 SUMMARY
This chapter has covered the implementation of the Arabic TTS Synthesizer System. It
also explained the process of creating the TTS, the next section of this chapter has
described in details the algorithm and the coding of the system and how the Arabic TTS
Synthesizer System is working. The next chapter will explain the testing and the
evaluation results of the Arabic TTS Synthesizer System which were done by a group of
participants.
CHAPTER SEVEN
TESTING AND EVALUATING OF ARABIC TTS SYSTEM
7.1 OVERVIEW To test the intelligibility, naturalness and overall quality of the Arabic Text-to-Speech
system developed in this dissertation, a test for the Arabic voice was designed. In this
chapter, test parameters plus design of the test and results are discussed.
7.2 TESTING AND EVALUATING TTS SYSTEMS
Once the system components have been coded, it is time to test them, several testing
approaches exist that lead to delivering a quality system to the end users or customer
(clients). Testing is not the first place where fault finding occurs, but testing is focused on
finding faults. There are several steps in testing a system: Function test, Performance test,
Acceptance test, and Installation test. Each step has a different focus, and a step’s success
depends on its goal or objective.
A function test checks that the integrated system performs its functions as specified in
the requirements. For example, a function test of a TTS system verifies that the system
can correctly convert text to speech, and so on.
The performance test compares the integrated components with the nonfunctional
system requirements. These requirements, including accuracy, speed, and reliability,
constrain the way in which the system functions are performed. For instance, a
performance test of the TTS system evaluates the speed with which the conversion
process is made and the response time to the users.
An acceptance test assures the customers that the system is the system that was built for
them. The acceptance test sometimes run in its actual environment but often is run at a
test facility different from the target location. For this reason, the final installation test
may run to allow users to exercise system functions and document additional problems
that result from being at the actual site.
Users want systems that are easy to learn and use as well as effective, efficient, safe, and
satisfying. Evaluating is the process of determining the usability and acceptability of the
product or design that is measured in terms of a variety of criteria including the number
of errors users make using it, how appealing it is, how well if matches the requirements,
and so on. Carrying out an effective evaluation of any TTS System is not always a simple
task. Some of the main contributing factors that Klatt (1987) believes affect the overall
quality of any TTS System are:
•••• Clearness, that is how much of the spoken output the user grasps, as well as how
quickly a listener gets fatigue by only listening.
•••• Pleasantness / Naturalness, the most subjective evaluation criteria are degree of
pleasantness and degree of naturalness. The two are slightly different. It is
possible for a voice to sound natural and unpleasant; and it is possible for a voice
to be judged pleasant and still have a machine-like quality, which is the case for
modern, top-of-the-line speech synthesizers.
•••• Suitability for used application, different applications have differing needs for a
TTS system. For example, a system for the blind requires higher rates considering
intelligibility, more than the naturalness.
Depending on what kind of information is needed, the evaluation can be made on several
levels; phoneme, word, or sentence level. Many tests help to address these three issues
and others. Several individual test methods for synthetic speech have been developed
during the last decades. Even some researchers complain that there are too many existing
methods that make the comparisons and standardization procedures difficult.
Alternatively, it is clear that there is still not a single test method that gives a final correct
result. This chapter gives a short introduction of the most commonly used methods. This
will be the foundation of the test and the evaluation questionnaire introduced in later
section of this chapter.
Users want systems that are easy to learn and use as well as effective, efficient, safe, and
satisfying. Evaluating is the process of determining the usability and acceptability of the
product or design that is measured in terms of a variety of criteria including the number
of errors users make using it, how appealing it is, how well if matches the requirements,
and so on. Table 7.1 below shows the possible evaluating attributes that can be done in
testing and evaluating the Arabic TTS Synthesizer System.
Table 7.20 Possible evaluating attributes
Attributes Ratings Levels (+ … -)
Naturalness Very natural ………… Very unnatural
Speed Too much fast ………… Too much slow
Sound Quality Very good ………… Very bad
Pronunciation Not annoying ………… Very annoying
Clearness Very easy ………… Very hard
Stress/Intonation Not annoying ………… Very annoying
7.3 TESTING THE ARABIC VOICE To test the clearness, naturalness and overall quality of the Arabic TTS Synthesizer
System developed in this dissertation, a test for the Arabic voice was designed. In this
chapter, test parameters plus design of the test and results are discussed.
7.3.1 Test group The only concern when choosing the test group is that they should be non-speaking of the
Arabic language. In order to decide what a good command is, it was decided that the
participants should have the Arabic language as their second language. The group
consists of 27 people. The majority of the participants are students at International
Islamic University Malaysia, at the Department of Arabic Linguistics. The level of
fluency is varying among the participant, some of them are somehow fluent and the some
of them are not very fluent.
7.3.2 Method The main goal of this evaluation test is to determine how much of the spoken output one
can understand is. The test is divided into three parts. The first part is to evaluate the
system with respect to naturalness, speed, sound quality, pronunciation, clearness and
stress/intonation. The second part is to assess the usage of the system. The last part is to
find the level of errors in the system.
The participant is asked a few questions about these aspects and is asked to mark how
well the voice performs. These simple exercises will asses the overall assessment of this
TTS Synthesizer System.
7.4 TEST AND EVALUATION RESULTS Now that the test is done, a summary of the results is presented in this section. The results
are presented in diagrams and tables with percentage values.
7.4.1 Naturalness Regarding the question whether the voice is nice to listen to or not, 33.3 % (9
respondents out of 27) considered the voice natural, 40.7 % (11 respondents out of 27)
thought that the naturalness of the voice was acceptable and 25.9 % (7 respondents out of
27) considered the voice unnatural. Table 7.2 below shows the outcomes of the
questionnaire in detail.
Table 7.21 Naturalness/ Clearness
Very Natural
Natural OK Unnatural Very unnatural
Total
No. of Respondents
0 9 11 7 0 27
% of Respondents
00.0 33.3 40.7 25.9 00.0 100 %
The results are shown in figure 7.1 below.
Clearness 1
How much the voice is clear?
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
Very
much
Much Neither
much nor
little
Little Very little
Scales
% o
f liste
ners
Series1
Figure 7.19 Naturalness of the voice
7.4.2 Speed The speed of a system is a major concern, if the system speaks too fast or too slow this
may have a negative effect on the concentration of the subjects. They might give up
listening if it is too fast and the speech would not sound as natural as possible if the
speech is too slow. 14.8 % (4 respondents out of 27) of the listeners considered the
system speaks too fast. 18.5 % (5 respondents out of 27) of the listeners thought that the
system speaks adequately fast after listening to the sound. In other words, they considered
the voice to have normal speech speed. 7.4 % (2 respondents out of 27) thought it is slow
and another 22.2 % (6 respondents out of 27) thought it is too slow. Table 7.3 below
shows the outcomes of the questionnaire in detail.
Table 7.22 Sound Speed
Too fast Fast Normal Slow Too slow Total No. of Respondents
4 5 10 2 6 27
% of Respondents
14.8 18.5 37.0 7.4 22.2 100 %
Figure 7.2 below shows the results for listening to the sound.
Speed
Does the system speak adequate fast (Normal)?
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
Too fast Fast Normal Slow Too slow
Scales
% o
f liste
ners
Series1
Figure 7.20 The speed of the speech
7.4.3 Sound quality The question for this part is “Do you consider the system to be of good sound quality?”
After listening to the sound, 11.1 % (3 respondents out of 27) considered the voice has a
very good sound quality. 44.4 % (12 respondents out of 27) considered the voice has a
good sound quality. 33.3 % (9 respondents out of 27) thought the sound quality of the
voice is neither bad nor good and the remaining 11.1 % (3 respondents out of 27)
considered that the sound quality of the system bad. Table 7.4 below shows the outcomes
of the questionnaire in detail.
Table 7.23 Sound quality
Very Good
Good Neither Good nor Bad
Bad Very Bad Total
No. of Respondents
3 12 9 3 0 27
% of Respondents
11.1 44.4 33.3 11.1 00.0 100 %
The results are shown in figure 7.3 below.
Sound Quality Do you consider the system has a good sound quality?
0.05.0
10.015.020.025.030.035.040.045.050.0
Very
good
Good Neither
Good nor
Bad
Bad Very bad
Scales
% o
f li
ste
ne
rs
Series1
Figure 7.21 The sound quality of the voice.
7.4.4 Pronunciation The pronunciation part consists of three questions addressed to the participants to be able
to get an idea of how difficult the speech uttered by the system is to grab/get and to be
able to decide what sounds are the most difficult ones to catch and gradually process
these sounds in some way and improve them. The first question in this category is if the
listeners found it was very hard to grab/get some of the words. I hoped to get some
information about what words were considered hard to grab/get and what sounds these
words contained. 7.4 % (2 respondents out of 27) of the listeners thought it is very hard
to grab/get some of the words. 40.7 % (11 respondents out of 27) of the listeners thought
it was easy to grab/get, while 25.9 % (7 respondents out of 27) thought it is neither hard
nor easy, and 18.5 % (5 respondents out of 27) thought it is hard to grab/get some of the
words. Table 7.5 below shows the outcomes of the questionnaire in detail.
Table 7.24 Pronunciation Question 1
Very Hard Hard Neither Hard nor Easy
Easy Very Easy Total
No. of Respondents
2 5 7 11 2 27
% of Respondents
7.4 18.5 25.9 40.7 7.4 100 %
Figure 7.4 shows the results of the first and the second time of listening.
Pronunciation 1
Was it very hard to grasb/get some of the words?
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
Ver y har d Har d Nei ther har d nor
easy
Easy Ver y easy
Scales
% o
f liste
ners
Series1
Figure 7.22 The pronunciation’s effect on understanding.
The second question, in the pronunciation part, is intended to investigate if the
participants had to concentrate hard to be able to grab/get the speech uttered by the
system. This question can give information about how difficult the voice is to grab/get
and how much the participants had to concentrate to grab/get the voice. The results are
summarized according to the subjects’ own estimations. The results after listening to the
sound show that 11.1 % (3 respondents out of 27) of the participants do not have to
concentrate on the sound. While, 29.6 % (8 respondents out of 27) of the participants
consider the system requires normal concentration. 37.0 % (10 respondents out of 27) of
the participants have to concentrate a little. For 18.5 % (5 respondents out of 27) some
concentration is needed for specific sounds. The remaining 3.7 % (1 respondent out of
27) had to concentrate a lot. Table 7.6 below shows the outcomes of the questionnaire in
detail.
Table 7.25 Pronunciation Question 2 A lot of
concentration Some
concentration Normal
concentration Little
concentration No
concentration Total
No. of Respondents
1 5 8 10 3 27
% of Respondents
3.7 18.5 29.6 37.0 11.1 100 %
The results are shown in figure 7.5.
Pronunciation 2
Did you have to concentrate a lot to grab/get the speech told by the
voice?
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
A lot of
concentr ation
Some concentr ation
at some wor ds
Nor mal
concentr ation
Li ttle concentr ation No concentr ation
Scales
% o
f liste
ners
Series1
Figure 7.23 The concentration needed to hear the pronunciation
The third question, considering pronunciation, is how annoying the participants found the
voice. 44.4 % (8 respondents out of 27) of the participants found the voice slightly
annoying. 33.3 % (6 respondents out of 27) of the participants thought the voice was not
annoying and 22.2 % (4 respondents out of 27) found it annoying. Table 7.7 below
shows the outcomes of the questionnaire in detail.
Table 7.26 Pronunciation Question 3
Not annoying
Little annoying
Annoying Very annoying
Too much annoying
Total
No. of Respondents
9 9 7 1 1 27
% of Respondents
33.3 33.3 25.9 3.7 3.7 100 %
The results of the first and the second time of listening are shown in figure 7.6.
Pronunciation 3
How did you find the pronunciation?
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
Not annoyi ng Li t t l e annoyi ng Annoyi ng V er y annoyi ng T oo much annoyi ng
Scales
% o
f liste
ners
Series1
Figure 7.24 The annoying level of the pronunciation
7.4.5 Clearness Two questions were asked concerning the intelligibility of the system. The question how
much the participants understood the voice or how much of what the voice said the
participants understood, 55.6 % (10 respondents out of 27) of the participants understood
much (well). 22.2 % (4 respondents out of 27) did understand the voice very much (very
well), 11.1 % (2 respondents out of 27) neither much nor little and another 11.1 % (2
respondents out of 27) understood a little, i.e. not very well. As mentioned before these
are the subjects’ own estimations. Table 7.8 below shows the outcomes of the
questionnaire in detail.
Table 7.27 Clearness Question 1
Very much
Much Neither much
nor little
Little Very little Total
No. of Respondents
7 11 4 4 1 27
% of Respondents
25.9 40.7 14.8 14.8 3.7 100 %
The results are shown in figure 7.7.
Clearness 1
How much the voice is clear?
0.05.0
10.015.020.025.030.035.040.045.0
Very
much
Much Neither
much nor
little
Little Very little
Scales
% o
f li
ste
ne
rs
Series1
Figure 7.25 Understanding the voice.
The second question of this part is “Was the voice easy to grab/get?” The reason for this
question was to establish if the difficulty in grabbing/getting is in the voice or in the
listeners’ lack of knowledge and lexicology. After listening to the sound 44.4 % (12
respondents out of 27) of the listeners found the voice easy to understand, while 25.9 %
(7 respondents out of 27) found it neither hard nor easy. 14.8 % (4 respondents out of 27)
considered it very easy to grab the sound. However, 11.1 % (3 respondents out of 27)
considered the sound as hard to grab/get; and only 3.7 % (1 respondent out of 27)
considered that the sound is very hard to grab/get. Table 7.9 below shows the outcomes
of the questionnaire in detail.
Table 7.28 Clearness Question 2
Very Hard Hard Neither Hard nor
Easy
Easy Very Easy Total
No. of Respondents
1 3 7 12 4 27
% of Respondents
3.7 11.1 25.9 44.4 14.8 100 %
Figure 7.8 shows the results of the first and the second time of listening.
Clearness 2
Was the voice easy to grab/get?
0.0
10.0
20.0
30.0
40.0
50.0
Very
hard
Hard Neither
hard nor
easy
Easy Very
easy
Scales
% o
f liste
ners
Series1
Figure 7.26 The level of difficulty in understanding the voice.
7.4.6 Stress/Intonation However, no process concerning the stress and the intonation has been undertaken on the
system, it was decided to survey the participants concerning the voice aspects. The first
question in the stress and intonation part is what the participants think of the intonation of
the voice. The results after listening to the sound are as follows: 40.7 % (11 respondents
out of 27) considered the intonation as good. 25.9 % (7 respondents out of 27) thought
the sound as neither good nor bad. 18.5 % (5 respondents out of 27) considered the sound
as very good; however, 7.4 % (2 respondents out of 27) thought the sound as bad, in
addition to 7.4 % (2 respondents out of 27) considered the sound as very bad. Table 7.10
below shows the outcomes of the questionnaire in detail.
Table 7.29 Stress/Intonation Question 1
Very Good
Good Neither Good
nor Bad
Bad Very Bad Total
No. of Respondents
5 11 7 2 2 27
% of Respondents
18.5 40.7 25.9 7.4 7.4 100 %
The results of listening to the sound are shown in percentages in figure 7.9.
Stress/Intonation 1
What do you think of the intonation of the voice?
0.0
5.010.0
15.0
20.025.0
30.0
35.040.0
45.0
Very
good
Good Neither
Good nor
Bad
Bad Very bad
Scales
% o
f li
ste
ne
rs
Series1
Figure 7.27 The intonation of the system.
“How do you find the stress?” is the second question in this part of the evaluating
questionnaire. 40.7 % (11 respondents out of 27) found the voice as little annoying. 29.6
% (8 respondents out of 27) found the stress as not annoying at all. Though, 18.5 % (5
respondents out of 27) found the stress as annoying; but, 7.4 % (2 respondents out of 27)
found the stress very annoying and only 3.7 % (1 respondent out of 27) found the stress
as too much annoying. Table 7.11 below shows the outcomes of the questionnaire detail.
Table 7.30 Stress/Intonation Question 2
Not annoying
Little annoying
Annoying Very annoying
Too much annoying
Total
No. of Respondents
8 11 5 2 1 27
% of Respondents
29.6 40.7 18.5 7.4 3.7 100 %
The results are listed in figure 7.10 in percentage of the number of subjects.
Stress/Intonation 2
How did you find the stress?
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
Not
annoying
Little
annoying
Annoying Very
annoying
Too much
annoying
Scales
% o
f liste
ners
Series1
Figure 7.28 The stress of the system
This is good considering that the stress and the intonation part of the system were not
processed at all. One can maybe say that the voice has really good stress and intonation.
7.4.7 Error On the question whether the participants think that the voice makes many pronunciation
mistakes, 29.6 % (8 respondents out of 27) considered that the system makes neither
many nor few mistakes. 25.9 % (7 respondents out of 27) of the listeners believed that
the mistakes were few and 25.9 % (7 respondents out of 27) believed that the mistakes
were very few. However, 11.1 % (3 respondents out of 27) considered that there were
many mistakes, finally, only 7.4 % (2 respondents out of 27) thought that the mistakes
were very few. Table 7.12 below shows the outcomes of the questionnaire detail.
Table 7.31 System Error
Too many Many Neither many
nor few
Few Very few Total
No. of Respondents
2 3 8 7 7 27
% of Respondents
7.4 11.1 29.6 25.9 25.9 100 %
Figure 7.11 shows the results.
Errors
Does the system make many pronunciation
mistakes?
0.05.0
10.015.020.025.030.035.0
Too
many
Many Neither
many
nor few
Few Very few
Scales
% o
f li
ste
ne
rs
Series1
Figure 7.29 Pronunciation mistakes.
7.5 SUMMARY In this chapter, the evaluation test for intelligibility, naturalness, and quality testing of the
Arabic TTS synthesizer are presented. Testing the TTS helped in understanding how the
variation in input text can affect the intelligibility, naturalness, and quality of the
synthesized speech. The tests reveal that the synthesizer performs quite good and
satisfactory.
From the above results and analysis, it implies that, when it comes to the intelligibility of
the system, the Arabic TTS Synthesizer System is successful. The participants can hear
what is being said and recognize changes with the synthesized speech. However, there is
a plan for improvements which will be described in the next chapter. The majority of
both words and sentences were correctly recognized and perceived of the majority of the
listeners and the evaluation of the overall quality of the system is satisfying at this stage.
CHAPTER EIGHT
CONCLUSION AND FUTURE WORK
8.1 CONCLUSION
Text-To-Speech Synthesizer has been developed gradually over the last few decades and
it has been integrated into several new applications. For most applications, the
intelligibility and comprehensibility of TTS Synthesizer have reached the acceptable
level. Nevertheless, in prosodic, text preprocessing, and pronunciation fields there is still
much work and improvements to be done to achieve more natural sounding speech.
Natural speech has so many dynamic changes that perfect naturalness may be impossible
to achieve.
However, since the markets of TTS Synthesizer related applications are increasing
gradually, the attention for giving more efforts and funds into this research area is
increasing as well. Current TTS Synthesizer Systems are so complicated that one
researcher cannot handle the whole system. With good modularity it is likely to divide the
system into a number of individual modules whose developing process can be done alone
if the communication between the modules is made carefully.
There are three methods used in TTS Synthesis technology, which have been introduced
in Chapter 2. The most commonly used techniques in current systems are based on
Formant and Concatenative Synthesis. The Concatenative Synthesis method is becoming
more accepted; since the method is to reduce the problems with the discontinuity effects
in concatenation points are becoming more effective. The Concatenative method provides
more natural and individual sounding speech, but the quality with some consonants may
vary considerably and the controlling of pitch and duration may be in some cases
difficult, especially with longer units.
Naturally, some combinations and modifications of these basic methods have been used
with variable success. An interesting approach is to use a hybrid system where the
Formant and Concatenative methods have been applied in parallel to phonemes where
they are the most suitable. In general, combining the best parts of the basic methods is a
good idea, but in practice, controlling of synthesizer may become difficult.
The TTS Synthesizer System is based on concatenation method. The challenges with the
Arabic language when building TTS Systems are addressed. Examples of these problems
and challenges are the diacritization problem, the existing of many dialects in the Arabic
language, and the differences in gender. This mapping of the problems would be helpful
for others who wish to build a TTS Synthesizer System in Arabic and other languages
who have not been extensively studied and processed.
Apart from the aforementioned advantages, the developed TTS Synthesizer System is not
without limitations. To run this system, the computer should support Arabic version of
Microsoft Office. Besides, this system also contains limited numbers of segments in the
wave sound file. There are thousands of Arabic vocabularies with different suffixes and
prefixes. To have all the words in the database, the file would be too large with thousands
of segments. Therefore, this system has the frequently used segments in the database.
Each language has its own rules in which should be considered to have a better
performance of the TTS Synthesizer System. It is a major factor that reflects to the
quality of the output and performance.
This dissertation shows that the creation of TTS Synthesizer System covers a whole range
of processes and that extensive work has to be done in order to build a voice in Festival.
The availability of free and semi-free synthesis systems, such as the Festival Speech
Synthesis System, makes the building of a speech synthesis easier and the costs lower.
Lastly, this dissertation has fulfilled its purpose, through creating a fully working Arabic
TTS Synthesizer System. It can be said that the system provides satisfactory results after
the testing but extensive and continued work is required to develop the system further and
to get a high quality TTS Synthesizer System. The results of this system are very
promising, with high level of intelligibility. Although the questionnaire with the test and
the evaluation of the system is quite simple, guidelines and information of the
intelligibility, naturalness, speed and overall quality of the system can be identified. In
observing a large and diverse group of participants would have enabled a better
evaluation of the system. The small number of participants has affected the test results
and the evaluation of this system negatively. The small number of subjects does not allow
comparing the subjects and the results taking the level of fluency into consideration,
which would have been interesting to see if it would be any differences. Therefore a
better division of the group and a larger amount of people is recommended.
8.2 SUGGESTIONS AND FUTURE WORK
The generation of synthetic speech from text involves many stages of processing which
gradually build an acoustic description of a speech signal. The results can be judged by
whether it is:
� Effective in conveying information
� Believable as a human voice and
� Pleasant to listen to for long periods.
At the time of writing, it must be concluded that there is still work to be done in
improving each of these factors. Some of the possible improvements that can be made
are:
� Record more sounds in the sound database. More sounds can be recorded to have
better performance and more vocabularies. Users can learn more words without
much limitation.
� Build more user friendly interfaces, such as a command to select different voices,
for example, voice of a man and voice of a woman. As well as an interface, this
will allow users to click on the Arabic words rather than typing them – applicable
for users who do not have Arabic keyboard.
� Adding an animation character (Agent). An agent or mount utterance character
can be included to attract user to continue using this software. Humans are more
attracted to animated and attractive interfaces which can create interest and fun in
learning. The characters are able to speak the input text, along with the output
sound with mouth utterances and gestures.
� This system also provides opportunity to develop a new TTS Synthesizer with
different languages. The TTS Synthesizer itself is developed based on existing
online TTS Synthesizer Systems on the Internet. The possible languages that can
be developed and added to the current TTS Synthesizer System are for example,
France, Malay, etc.
� Future considerations to improve the quality of the system should be addressed.
An initial task is to check the speech or diphone segmentation of the problematic
sounds since they are considered hard to understand. A manual checking and
correcting the labels is required. One has to trace back and check the entry in the
diphone index and compare it to the label for the fabricated word.
� Another important issue is signal processing to obtain the required prosody.
Speaker specific intonation and speaker specific duration have to be considered
when building a new voice. The major components of the prosody that can be
recognized are pitch, amplitude and the duration of the concatenated speech.
REFERENCES AND RESOURCES AcuVoice Inc. (1998). Homepage. http://www.acuvoice.com.
Alan D. & Barbara H., (2003). System Analysis and Design. 2nd Edition, John Wiley and
Sons.
Alan D., Janet E., Gregory D., & Russell B. (2003). Human-Computer Interaction, 3rd
Edition. Prentice Hall.
Alan E. & Ryan M. (1999). Visual Basic 6.0: Environment, Programming, and
Applications. Que™ Education and Training. An Imprint of Macmillan Computer
Publishing.
Allen J., Hunnicutt S., & Klatt D. (1987). From Text to Speech: The MITalk System.
Cambridge University Press, Inc.
Amr Y. & Ossama E. (2004). An arabic TTS system based on the IBM trainable speech
synthesizer. Le traitement automatique de l’arabe, JEP-TALN.
Bell Laboratories TTS (1998). Homepage. http://www.bell-labs.com/project/tts/.
Binu k. Mathew (2004). The Perception Processor. Ph.D. Dissertation.
Breen A., Bowers E., Welsh W. (1996). An Investigation into the Generation of Mouth
Shapes for a Talking Head. Proceedings of ICSLP 96, vol. 4.
Browman, C. (1980). Rules for demi-syllable synthesis using LINGUA language
interpreter, Proceedings of International Conference on Acoustics, and signal
Processing, IEEE.
Chris, R. (1992) (Editor). Speech Processing,McGRAW-HILL, London.
Clement H., (1987). A History of Arabic Literature. Draf Publishers Limited. London.
Cornelius T. (2003). Intelligent Systems: Technology and Applications, Signal, Image,
and Speech Processing, vol. 3, pp. 1 – 48. CRC Press, New York.
Craig A. (2000). Digital Audio with JAVA, Prentice Hall PTR, USA.
David A., Guy F., (2003). Information System Development - Methodologies, techniques
and tools. 3rd Edition, Mc-Graw Hill.
Deitel H., Deitel P. & Nieto T. (1999). Visual Basic 6.0 How to Program. Prentice Hall,
Upper Saddle River. New Jersey.
Dettweiler H., Hess W. (1985). Concatenation Rules for Demisyllable Speech Synthesis.
Proceedings of ICASSP 85 (2): pp. 752-755.
Don J. Connexionx - sharing knowledge and building communities.
http://cnx.rice.edu/content/m0088/latest/.
Donovan R. (1996). Trainable Speech Synthesis. PhD. Thesis. Cambridge University
Engineering Department, England. ftp://svr-
ftp.eng.cam.ac.uk/pub/reports/donovan_thesis.ps.Z.
Dutoit T. & Leich H. (1993). MBR-PSOLA: Text-to-speech synthesis based on an
MBEre-synthesis of the segments database. Speech Communication, vol. 13, pp. 432 –
440.
Edward Y. (2003). Techniques of Teaching Pronunciation in ESL, Bilingual & Foreign
Language Classes. Lincom Europa.
Edwards A. (1991). Speech Synthesis: Technology for Disabled People. London: Paul
Chapman Ltd.
F. J. Owens. (1993). Signal Processing of Speech. The Macmillan Press Ltd.
Flanagan J. & Rabiner L. (1973) (Editors) Speech Synthesis. Dowden, Hutchinson &
Ross, Inc., Pennsylvania.
Flanagan J. (1972). Speech Analysis, Synthesis, and Perception. Springer-Verlag, Berlin-
Heidelberg-New York.
HADIFIX (1997). Speech Synthesis Homepage. University of Bonn. http://www.ikp.uni-
bonn.de/~tpo/Hadifix.en.html.
Hamza W. (2000). Arabic Speech Synthesis Using Large Speech Database. PhD. Thesis.
Cairo University, Electronics and Communications Engineering Department.
Haywood J. & Ahmad H. (2003). A new Arabic grammar. Lund Humphries, London.
Holmes J. (1988). Speech Synthesis and Recognition. Van Nostrand Reinhold (UK) Co.
Ltd. England.
http://svr-www.eng.cam.ac.uk/~ajr/SpeechAnalysis/node5.html
http://www.acoustics.hut.fi/.../lemmetty_mst/chap2.html
http://www.csun.edu/cod/conf/1991/proceedings/voice.htm.
http://www.enhance.phon.A.ac.uk/public/examples/copysyn/index.html.
Huang X., Acero A., Hon H., Ju Y., Liu J., Mederith S., & Plumpe M.(1997). Recent
Improvements on Microsoft’s Trainable Text-to-Speech System - Whistler. Proceedings
of ICASSP 97 (2): pp. 959-934.
Husni A., Moustafa E., & Mansour A. (2003). Techniques for high quality Arabic speech
synthesis. College of Computer Science and Engineering, King Fahd University of
Petroleum and Minerals.
Husni Al-Muhtaseb, Moustafa Elshafei, and Mansour Al-Gamdi. (2003). Techniques for
High Quality Arabic Speech Synthesis. College of Computer Science and Engineering,
King Fahd University of Petroleum and Minerals.
INC Speech Works International. http://www.tmaa.com/tts/.
IPA. The international phonetic association. http://www.arts.gla.ac.uk/IPA
Ishizaka K. & Flanagan J. (1972). Synthesis of voiced sounds from a two-mass model of
the vocal cords. Bell System Technology Journal, vol. 51, No. 6, pp. 133-1268.
Jeffery L., Lonnie D., Kevin C., (2004). System Analysis and Design Methods. 6th
Edition, Mc-Graw Hill.
Klatt D. (1987). Review of Text-to-Speech Conversion for English. Journal of the
Acoustical Society of America, JASA. vol. 82, No. 3, pp.737-793.
Klatt D. (1987). Review of text-to-speech conversion for English. Journal of the Acoustic
Society of America. Vol. 82, No. 3, pp. 737 - 793.
Kleijn W. and Paliwal K. (1995) (Editors). Speech Coding and Synthesis. Amsterdam:
Elsevier.
Koay C. (2000). Learning Visual Basic 6.0: Step-by-Step. Venton Publishing. Kuala
Lumpur. Malaysia
Laura Mayfield Tomokiyo, Alan W Black, and Kevin A Lenzo. (2003). Arabic In My
Hand: Small-Footprint Synthesis of Egyptian Arabic. Cepstral LLC, Pittsburgh, USA.
Lernout & Hauspies (L&H) (1998). Speech Technologies Homepage.
http://www.lhs.com/speechtech/.
Li D. & Douglas N. (2003). Speech Processing a dynamic and Optimization Oriented
Approach. Marcel Dekker Inc. New York.
Lowell M. (1999). SAMS Teach Yourself Visual Basic 6 in 10 minutes. SAMS. A division
of Macmillan Computer Publishing, USA.
Maria M. (2004). A Prototype of an Arabic Diphone Speech Synthesizer in Festival.
Master Thesis in Computational Linguistics. Uppsala University
MBROLA. The MBROLA project towards a freely available multilingual speech
synthesizer. http://tcts.fpms.ac.be/synthesis/mbrola.html.
Michael E. & William N. (1999). Programming with Microsoft Visual Basic 6.0 An
Object-Oriented Approach. Course Technology, Inc. ITP.
Michel K., (1970). Arabic Phonology: Implications for Phonological Theory and
Historical Semitic. PhD thesis. Massachusetts Institute of Technology.
Mitchell T., (1993). Pronouncing Arabic. Clarendon Oxford.
Moulines E. & Charpentier F. (1990). Pitch-synchronous waveform processing
techniques for text-to-speech synthesis using diphones. Speech Communication, vol. 9,
No. 5-6, pp. 453-467.
Newell A., Barnett J., Forgie J., Green C., Klatt D., Licklieder J., Munson J., Reddy R., &
Woods W. (1973). Speech Understanding System. Final Report of a study Group. North
Holland, Amsterdam.
Nicholas A. and Putros S. (1986). The Arabic Alphabet: How to Read and Write. London.
Ntsourak's Home Page. http://www.telecom.tuc.gr/.../tutorial_acoustic.htm.
Olive, J.P.. (1996). Concatenative Syllables. Progress in Speech Synthesis. Springer, New
York. pp. 261 – 262.
Panasonic CyberTalk (1998). Homepage.
http://www.research.panasonic.com/pti/stl_web_demo/demo.html.
Peter R. (1992). Computing Linguistic and Phonetics: Introductory Readings, Academic
Press, London.
Pinker S. (1993). The Language Instinct: How the Mind Creates Language. New York:
W. Morrow and Company.
Quartieri T. & McAulay R. (1992). Shape Invariant time-scale and pitch modification of
speech. IEEE Trans. Signal Process, vol. 40, No. 3, pp. 497-510.
Raja T., (1979). The Structure of Arabic: From Sound to Sentence. Beirut.
Robert D. (1999). Computer Speech Technology. Boston, London: Artech House.
Ronald C., & Antonio Z. (1997). Survey of the state of the art in human language
technology. Giardini Editori e Stampatori in Pisa.
Sagisaka Y., Campbell N., & Higuchi N. (1997) (Editors). Computing Prosody -
Computational Models for Processing Spontaneous Speech, Berlin: Springer.
Salman H. & Jacob Y., (1980). Arabic Phonology and Script. International Book Center.
Michigan.
Sami L., (1999). Review of Speech Synthesis Technology. Master thesis,
Santen J., Sproat R., Olive J., Hirschberg J. (editors), Progress in Speech Synthesis,
Springer-Verlag New York Inc., 1997.
Schroeter J. (1996). Articulatory Synthesis and Visual Speech. Progress in Speech
Synthesis. Springer, New York. pp. 179 – 184.
Schroeter M. (1993). A Brief History of Synthetic Speech. Speech Communication. vol.
13, pp. 231-237.
Shafi S., (1978). A Course in Spoken Arabic. Oxford University Press. Bombay.
Stuart R. & Peter N. (2002). Artificial Intelligence: A Modern Approach, 2nd Edition.
Pearson Education Inc. Upper Saddle River, New Jersey.
Stylianou Y. (2001). Applying the harmonic plus noise model in concatenative speech
synthesis. IEEE Transaction on Speech Audio Process, vol. 9, No. 1, pp. 21-29.
The DISC Best Practice Guide. A survey of existing methods and tools for developing and
evaluation of speech synthesis and of commercial speech synthesis systems.
http://www.disc2.dk/tools/SGsurvey.html.
Thierry D. (1996). An Introduction to Text-to-Speech Synthesis. Kluwer Academic
Publishers, Dordrecht.
Todd K. & Stephen C. (2000). Microsoft Visual Basic. South-western Educational
Publishing.
Wael H. & Mohsen R. (2000). Concatenative Arabic speech synthesis using large
database, In Proceedings of ICSLP2000, vol. 2, pp. 182-185, Beijing, China.
Weinschenk, S., Barker, D. (2000). Designing Effective Speech Interfaces, 1st Edition,
Wiley.
Witten I. (1982). Principles of Computer Speech. Academic Press Inc.
Wright W., (1974). A Grammar of Arabic Language. 3rd Edition. Beirut.
Yasser H., Shady Q., Salah H., & Mohsen R. (2000). ARABTALK® An Implementation
for Arabic Text To Speech System. www.nemlar.org/ARAB-TALK-RDI.doc.
APPENDIX A: QUESTIONNAIRE The aim of this questionnaire is to help me testing and evaluating the Arabic TTS System. I would appreciate if you answer freely and as honest as possible. I would like to thank you for your help and participation.
PART ONE: ASSESSING THE QUALITY OF THE SYSTEM � Clearness/Naturalness
� Is the voice nice listening to?
[ ] Very natural [ ] Natural [ ] OK [ ] Unnatural [ ] Very unnatural
� Speed
� Does the system speak adequate fast?
[ ] Too much fast [ ] Too fast [ ] Fast/normal [ ] Too slow [ ] Too much slow
� Sound Quality
� Do you consider the system has a good sound quality?
[ ] Very good [ ] Good [ ] Neither good nor bad [ ] Bad [ ] Very bad
� Pronunciation
� Was it very hard to grab/get some of the words?
[ ] Very hard [ ] Hard [ ] Neither hard nor easy [ ] Easy [ ] Very easy
� Did you have to concentrate a lot to grab/get the speech told by the voice?
[ ] A lot of concentration [ ] Some concentration at some words [ ] Normal concentration [ ] Little concentration [ ] No concentration was needed
� How did you find the pronunciation?
[ ] Not annoying [ ] Little annoying [ ] Annoying [ ] Very annoying [ ] Too much annoying
� Clearness
� How much the voice is clear?
[ ] Very much [ ] Much [ ] Neither much nor little [ ] Little [ ] Very little
� Was the voice easy to grab/get?
[ ] Very hard [ ] Hard [ ] Neither hard nor easy [ ] Easy [ ] Very easy
� Stress/Intonation � What do you think of the intonation of the voice?
[ ] Very good [ ] Good [ ] Neither good or bad [ ] Bad [ ] Very bad
� How did you find the stress?
[ ] Not annoying [ ] Little annoying [ ] Annoying [ ] Very annoying [ ] Too much annoying
PART ONE: USAGE OF THE SYSTEM
� Occupation ________________________. � Does the system help you in your job/work?
[ ] Yes [ ] No
� In what way does the system help you? Please specify: _______________________________________________________________
PART THREE: FINDING ERRORS � Does the system make many pronunciation mistakes?
[ ] Too many [ ] Many [ ] Neither many nor few [ ] Few [ ] Too few
APPENDIX B: GLOSSARY Allophone: An allophone is a phonetic variant of a phoneme in a particular language.
Alveolar: A phone produced when the tongue touches the tooth ridge behind the teeth
(alveolus). See the diagram of a head for the location of the tooth ridge. The “t sound” in
English is an alveolar stop, produced by stopping and then releasing the air flow out of
the mouth by closing the tongue onto the tooth ridge.
Band-pass filter: Filter with a single transmission band or pass-band with relatively low
attenuation extending from a lower band-edge frequency greater than zero to a finite
upper band- edge frequency.
Bilabial: A phone produced by the closure or partial closure of both lips. See the
diagram of a head. The English sounds represented by the letters p in pit and b in bad are
bilabial stops, produced by stopping and then releasing the air flow out of the mouth by
closing the lips. Bilabial and labio-dental phones are together classed as labial.
Consonant: A consonant is a sound made by a partial or complete closure of the vocal
tract.
Dialect: Generally dialects of a language are more similar than different languages.
However, what is a dialect and what is a language is often a political rather than a
linguistic question. The division of Serbo-Croat, the common language of former
Yugoslavia, into two languages, Serbian and Croatian, shows this rather sharply. A
further example of very similar languages which might be called dialects of the same
language are Dutch (spoken in the Netherlands) and Flemish (spoken in north-western
Belgium). On the other hand, in China there are languages which are mutually un-
intelligible when spoken but are often called dialects of one Chinese language. It is
important to note that although some dialects have more social prestige in a country than
others; this says nothing about their linguistic qualities.
Diphthong: A diphthong is a phonetic sequence, consisting of a vowel and a glide that is
interpreted as a single vowel.
Fricative: If during the production of a phone, air is made to pass through a narrow
passage, a “friction” sound or fricative is produced (i.e. a more-or-less “hissing” sound).
English examples are the “f sound” in fee or the “sh sound” in she.
Glottis: The glottis is the space between the vocal folds.
Grapheme: A grapheme is a “spelling unit”. For example, in Spanish the combination ll
represents a different sound from a single l. Thus these are two graphemes. In English,
graphemes may be quite complex. For example -tion behaves more-or-less as a single
grapheme in words like function.
IPA: The International Phonetic Alphabet or IPA is a set of symbols which can be used
to represent the phones and phonemes of natural languages. A subset which can be used
to represent “Standard English” (roughly the dialect of middle-class people from the
south east of England) is given in a separate table.
Intonation: Intonation is the system of levels (rising and falling) and variations in pitch
sequences within speech.
Labio-dentals: A phone produced by the partial closure of the lower lip on the upper
teeth. See the diagram of a head. The English sounds represented by the letters f in fit and
v in van are labio-dental fricatives, produced by restricting the air flow out of the mouth
by touching the lower lip on the upper teeth. Bilabial and labio-dental phones are together
classed as labial.
Nasal: A nasal is a phone made by allowing air to flow out of the nose while possibly
stopping it in the mouth. Allowing air to flow out of the mouth is achieved by opening
the uvula (see the diagram of a head). English has three such phones: the nasal stops
which end the words rum, run and rung.
Onset: An onset is the part of the syllable that precedes the vowel of the syllable.
Palatal: A phone produced when the top of the tongue touches the hard palate. See the
diagram of a head for the location of the hard palate. The English sounds represented by
the letters sh in ship and s in measure are palatal fricatives, produced by partially
stopping the air flow out of the mouth by touching the top of the tongue on the hard
palate.
Phone: A phone is an unanalyzed sound of a language. It is the smallest identifiable unit
found in a stream of speech that is able to be transcribed with an IPA symbol.
Phoneme: A phoneme is the smallest contrastive unit in the sound system of a language.
Phonetics: Phonetics is the study of human speech sounds.
Pitch: Pitch is the rate of vibration of the vocal folds.
Stress: Stress is an increase in the activity of the vocal apparatus of a speaker.
Syllable: A syllable is a unit of sound composed of (1) a central peak of sonority (usually
a vowel), and (2) the consonants that cluster around this central peak.
Velar: A phone produced when the top of the tongue touches the soft palate or velum.
See the diagram of a head for the location of the soft palate. The English sounds
represented by the letters k in kit and g in got are velar stops, produced by stopping and
then releasing the airflow out of the mouth by touching the top of the tongue on the soft
palate.
Vowel: A vowel is a sound made when the impedance of the air through the vocal tract is
minimal and the vocal tract is completely open.