This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Syntactic Reordering for Arabic- English Phrase-Based Machine Translation
Arwa Hatem and Nazlia Omar
School of Computer Science, Faculty of Information Science and Technology
Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor, Malaysia
Abstract.performing translation task which converts text or speech in one Natural
Language (Source Language (SL)) into another Natural Language (Target
Language (TL)). The translation from Arabic to English is difficult task due to
the Arabic languages are highly inflectional, rich morphology and relatively
free word order. Word ordering plays an important part in the translation
process. The paper proposes a transfer-based approach in Arabic to English MT
to handle the word ordering problem. Preliminary tested indicate that our
system, AE-TBMT is competitive when compared against other approaches
from the literature.
Machine Translation (MT) refers to the use of a machine for
Keywords Rule based, Arabic Syntactic, word reordering : .
1 Introduction
Translation is the transformation of natural language into other. Machine Translation
(MT) refers to the use of a machine for performing translation task which converts
text or speech in one Natural Language (Source Language (SL)) into another Natural
Language (Target Language (TL)). The first successful attempt to create machine
translation started in the 1949s after the Second World War. In the early 1960s some
scientists were starting to lose hope in development MT due to the slow progress. In
end of 1970 the Commission of the European Communities (CEC) supported work on
the Eurotra system [2]. This was a project which aimed at the development of a
multilingual interlingua system for most commonly used languages. Since 1970s to
1980s this projects started and proved to be practically successful. Recently much
software became available for translation. Amongst many challenges that Natural
Language Processing (NLP) presents the biggest is the inherent ambiguity of natural
language. In addition, the linguistic diversity between the source and target language
makes MT a bigger challenge. This is particularly true of languages widely divergent
in their sentence structure such as Arabic and English language. The major structural
difference between Arabic and English languages is that Arabic language are highly
inflectional, with a rich morphology, relatively free word order, and default sentence
structure as Subject-Object-Verb or Subject-Verb-Object or Verb-Subject-Object or
S: [NP VP] == S: [NP: [Adj N] VP:[V Det N1 Adj N2]] English
Syntactic Reordering for Arabic- English Phrase-Based Machine Translation 201
Fig. 1 Implementation process
Fig. 1 illustrates the whole process which includes the following stages:
Tokenization: This an important step for a syntactic parser to construct a phrase
structure tree from syntactic units. After inserting the source sentence in the
system the tokenizer divides the text into tokens. The token can be a word, a part
of a word, or a punctuation mark. A tokenizer requests to know the white spaces
and punctuation marks.
Morphological analysis: After the tokenization process the morphological analyser
will provide the morphological information about words. It provides the
grammatical class of the words (parts of speech) and create the Arabic word in its
right form depending on the morphological features.
Lexicon: In this system the lexicon is accountable for inferring morphological and
classifying verbs, nouns, adverb and adjectives when needed. It is the main
lexicon translation; the source language searches in a dictionary and then chooses
the translation.
A lexicon provides the specific details about every individual lexical entry
(i.e. word or phrase) in the vocabulary of the language concerned. Lexicon contains grammatical information which are usually have abbreviated
form: „n‟ for noun, „v‟ for verb, „pron‟ for pronoun, „det‟ for determiner, „prep‟ for
preposition, „adj‟ for adjective, „adv‟ for adverb, and „conj‟ for conjunction. The
Source Language
Sentence
SL parse tree
Morphological
Analysis
Target Language
Sentence
TL parse tree
Generation
Transfer
Arabic to English Transfer Rules
Structure Transfer
Lexicon transfer
Dictionary
.
202 A. Hatem and N. Omar
lexicon must contain information about all the different words that can be used. If
the word is ambiguous, it will be described by multiple entries in the lexicon, one
for each different use.
Parsing: The parser divides the sentence into smaller sets depending on their
syntactic functions in the sentence. There are four types of phrases i.e. Verb
Phrase (VP), Noun Phrase (NP), Adjective/Adverbial Phrase (AP), and
Prepositional Phrase (PP). After the parsing process the sentence is represented in
a phrase structure tree. Figure 1 show the phrase structure tree for the sentence
.(the cleaver student reads the book) الطالة الذكً قزأ الكراب
The above tree captures various grammatical relationships like dominance and
precedence. Dominance means that some nodes dominate other nodes. In the
above example S dominates NP and VP. Precedence means that some nodes
precede other nodes. In the above example NP precedes VP. The root node is the
node that dominates other nodes. The first NP is the root for Det, Adj and N,
which in their turn are said to be children of the first NP and siblings of one
another.
Syntactic rules: A set of Arabic and English rules are fed into the system. In this
step the reordering process will be found which will be based on the order of
words in a sentence, and how the words are grouped.
Agreement rules: After syntactic rules the agreement rules applied which are
responsible about the additions of prefix and suffix in the sentences.
In order to test the performance of our approach, experiments are carried out to
translate the news titles of Aljazeera website. We evaluate our results on the 100
sentences (in different fields i.e. political, sports and economic news) taken from
Aljazeera news website and which are available at http://www.aljazeera.net/portal.
Syntactic Reordering for Arabic- English Phrase-Based Machine Translation 203
5 Results and Discussion
The aim of this experiment is to investigate whether the following machine
translation systems, namely, Bing Translator, Google, Systran and AE-TBMT, are
suitable for handling the word reordering in the translation from Arabic to
English. We evaluate our results on the 100 sentences taken from Aljazeera news
website and which are available at http://www.aljazeera.net/portal. The evaluation
involves counting the numbers of problems which appear in each of the system.
The weight for every problem is 1 in all sentences, and it can be interpreted as the
basic unit of penalty as we count every problem that appears in the target sentence as
1 weakness.
The percentage of the total score for each system is calculated by dividing
the total score by 700. We have 100 test examples and each is evaluated out of 7
depending on the numbers of problems. The score is given by human expert in
translation and it tests the differences between the human translation and the machine
translation systems. Table 2 illustrates part of the result achieved by this experiment.
For the first example i.e. “ تكرير محطات إلنشاء عراقية خطة ", the weakness of Google
system is in problems (4,6) and for that , the system score 5 out of 7. The weakness of
Bing system was in problem (6) and the system scored 5 out of 7. The weakness of
Systran system was in problem (4,6) and the system scored 5 out of 7. AE-TBMT got
score 6 out of 7 because there is only one weakness translation in problem (4). Table 1, 2 and 3 show the comparison results on some different sentences when
applied to above translation systems. In table 1 a comparison between the manual
translation results and other machine translation systems results are shown.
Table 1. Exper iment results and compar ison with other systems
Sentence Manual
Translation
Google Bing Systran AE-TBMT
خطح عزاقٍح
إلًشاء هذطاخ
ذكزٌز
Iraqi plan to
build refinaries
Plan for the
establishment of
an Iraqi refineries
Iraqi plan for
refineries
Iraqi plan for
establishment of
stations repeating
Iraqi plan to
establishment
refineries
جزدى فً
اًفجاراخ غزج
Injuries in Gaza
blasts
Wounded in the
blasts Gaza
Wounded in Gaza
bombing
Wound in explosions
of Gaza
Wounded in
Gaza
explosions
هصز ذزفض
ذدخال ذزكٍا
تالوصالذح
Egypt refuses
interference of
Turkeg in
reconciliation
Egypt rejects
interference
Turkey
reconciliation
Egypt rejects
Turkish
intervention in
reconciliation
Egypt refuses
Turkish interventions
in the reconciliation
Egypt refuses
interference of
Turkey in
reconciliation
اًفجارا قوٌا هز
وسط العاصوح
األفغاًٍح
Strong explosion
shook the center
of Afghanstan
capital
A powerful
explosion rocked
the centre of
Afghan capital
Powerful
explosion Central
Strong explosions
shaking in the
middle of the capital
Afghan
a strong
explosion
shooke centre
of Afghan
capital
ذعرقل اهزٌكا
شثكح ذجسس
روسٍح
America
arrested Russian
spy net
America spy
network arrested
Russia
Arrest of Russian
spy network
America
America arrests
Russian net of
spying
America arrests
Russia spy net
ذقرزح ألواًٍا
هوعد لوغادرج
أفغاًسراى
German
proposes time to
leave
Afganistan
Germany is
proposing a date
to leave
Afghanistan
Germany proposes
to leave
Afghanistan
Afghanistan
proposes Germany
appointment for
departure
Germany
proposes a date
to leave
Afghanistan
204 A. Hatem and N. Omar
Table 2 shows the number of problems which appeared after the translation and
shows how many penalties each system has been scored. For example in Google
when we translate the sentence “ نشاء محطات تكريرخطة عراقية إل ” problem 4 and 6 occurred
in the output and as we give 1 penalty for each problem, Google has two penalties.
Bing has only one penalty because only problem 6 occurred and Systran has two
penalties similar to Google. AE-TBMT has one penalty when problem 4 occurred in
the output.
Table 2. Number of Problems and Penalties
Sentence
Google Bing Systran AE-TBMT
Problem Penalty
no.
problem Penalty
no.
Problem Penalty
no.
problem Penalty
no.
خطح عزاقٍح إلًشاء هذطاخ ذكزٌز
4,6 2 4,6 2 4,6 2 4 1
جزدى فً
اًفجاراخ غزج
3,4,6 3 3,4,6 3 3,4 2 4 1
هصز ذزفض ذدخال تالوصالذح ذزكٍا
2,4,6 3 2,4 2 2,4,6,7 4 0 0
اًفجارا قوٌا هز
وسط العاصوح
األفغاًٍح
4 1 1,3 2 4,6 2 6 1
اهزٌكا شثكح ذعرقل
ذجسس روسٍح
1,2,5,6 4 1,4,6 3 1,6 2 1 1
هوعد ألواًٍا ذقرزح
لوغادرج أفغاًسراى
1,4,6 3 1,6 2 1,3,4,5,6 5 1,4 2
Table 3. Total Number of Penalties
Translators Penalties
Google 122
Bing 161
Systran 149
AE-TBMT 113
It can be seen that in general, the result for AE-TBMT is better than other approaches
as reported in Table 3 the system has only 113 problems after translating 100
sentences. This shows that AE-TBMT is able to generate best translation from Arabic
to English.
We believe that a good translation can be achieved with a systematic lexicon which
provides more correct translations. It can be seen that with more complicated sentence
structures, the features of the different words are essential in accurate translation. In
addition, good reordering rules play an important role in the quality of translation.
Syntactic Reordering for Arabic- English Phrase-Based Machine Translation 205
6 Conclusion and Future Work
In this paper, we described a set of syntactic reordering rules that exploit systematic
differences between Arabic and English word order to transform Arabic sentences to
be equivalent to English in terms of their word order. The approach was tested on 100
titles from Aljazeera news website. Preliminary comparisons indicate that our
approach is competitive with other approaches in the literature. Our manual
evaluation of the reordering accuracy indicated that our approach is helpful at
improving the translation quality despite relatively frequent reordering errors. Future
research will be aimed at testing our approach on different domain.
References
1. Abu Shquier, M., Sembok, T.: Handling agreement in machine translation from English to Arabic. In: 1st International Conference on Digital Communications and Computer Applications (DCCA 2007), pp. 385–379 (2007)
2. Attia, M.: Mplications of the Agreement Features in Machine Translation, Master thesis. Al-Azhar University (2002)
3. Shirko, O., Omar, N., Arshad, H., Albared, M.: Machine Translation of Noun Phrases from Arabic to English Using Transfer-Based Approach. Journal of Computer Science 6(3), 350–356 (2010)
4. Yngve, V.H.: Early Research at M.I.T. In: John Hutchins, W. (ed.) Research of Adequate Theory Early years in machine translation: memoirs and biographies of pioneers, pp. 39–72 (2000)
5. Mankai, C., Mili, A.: Machine Translation from Arabic to English and French information sciences, vol. 3, pp. 91–109. Elsevier Science Inc., New York (1995)
6. Salem, Y., Hensman, A., Nolan, B.: Implementing Arabic to- English Machine Translation using the Role and Reference Grammar Linguistic Model. In: Proceedings of the Eighth Annual International Conference on Information Technology and Telecommunication (ITT 2008), Galway, Ireland (Runner-up for Best Paper Award) (2008)
7. Lavie, A., Probst, K., Peterson, E., Vogel, S., Levin, L., Font-Llitjos, A., Carbonell, J.: A Trainable Transfer-based Machine Translation Approach for Languages with Limited Resources. In: Proceedings of Workshop of the European Association for Machine Translation, Valletta, Malta, EAMT 2004 (2004)
8. Habash, N., Sadat, F.: Arabic preprocessing schemes for statistical machine translation. In: Proceedings of HLT-NAACL 2006, New York, NY, USA (2006)
9. Lee, Y.: Morphological Analysis for Statistical Machine Translation. In: Proceedings of HLTNAACL 2004, Boston, MA, USA (2004)
10. Badr, I., Zbib, R., Glass, J.: Segmentation for English-to-Arabic statistical machine translation. In: Proceedings of ACL 2008, HLT: Short Papers, Columbus,OH, USA (2008)
11. Jean-Pierre, C., Pasi, T.: A Non-Deterministic Tokenizer for Finite-State Parsing. In: The European Conference on Artificial Intelligence, Workshop on Extended Finite State Models of Language (ECAI 1996), Budapest, Hungary, pp. 10–12 (1996)
12. Elming, J., Habash, N.: Syntactic Reordering for English-Arabic Phrase-Based Machine Translation. In: Proceedings of the EACL, Workshop on Computational Approaches to Semitic Languages, Athens, Greece, pp. 69–77 (2009)