Syntactic Reordering for Arabic- English Phrase-Based Machine Translation

Syntactic Reordering for Arabic- English Phrase-Based Machine Translation

Arwa Hatem and Nazlia Omar

School of Computer Science, Faculty of Information Science and Technology

Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor, Malaysia

Abstract.performing translation task which converts text or speech in one Natural

Language (Source Language (SL)) into another Natural Language (Target

Language (TL)). The translation from Arabic to English is difficult task due to

the Arabic languages are highly inflectional, rich morphology and relatively

free word order. Word ordering plays an important part in the translation

process. The paper proposes a transfer-based approach in Arabic to English MT

to handle the word ordering problem. Preliminary tested indicate that our

system, AE-TBMT is competitive when compared against other approaches

from the literature.

Machine Translation (MT) refers to the use of a machine for

Keywords Rule based, Arabic Syntactic, word reordering : .

1 Introduction

Translation is the transformation of natural language into other. Machine Translation

(MT) refers to the use of a machine for performing translation task which converts

text or speech in one Natural Language (Source Language (SL)) into another Natural

Language (Target Language (TL)). The first successful attempt to create machine

translation started in the 1949s after the Second World War. In the early 1960s some

scientists were starting to lose hope in development MT due to the slow progress. In

end of 1970 the Commission of the European Communities (CEC) supported work on

the Eurotra system [2]. This was a project which aimed at the development of a

multilingual interlingua system for most commonly used languages. Since 1970s to

1980s this projects started and proved to be practically successful. Recently much

software became available for translation. Amongst many challenges that Natural

Language Processing (NLP) presents the biggest is the inherent ambiguity of natural

language. In addition, the linguistic diversity between the source and target language

makes MT a bigger challenge. This is particularly true of languages widely divergent

in their sentence structure such as Arabic and English language. The major structural

difference between Arabic and English languages is that Arabic language are highly

inflectional, with a rich morphology, relatively free word order, and default sentence

structure as Subject-Object-Verb or Subject-Verb-Object or Verb-Subject-Object or

Y. Zhang et al. (Eds.): DTA/BSBT 2010, CCIS 118, pp. 198–206, 2010. © Springer-Verlag Berlin Heidelberg 2010

{arwa,no}@ftsm.ukm.my

is recognized the world over, with the current state of art in MT, it is not possible to

have fully automatic, high quality, and general-purpose machine translation. The

major need is to handle ambiguity and other complexities of NLP in practical

systems.

2 Related Work

Machine translation (MT) including the translation from and into Arabic has been

attracting attention from the researchers and many approaches are applied to enhance

the quality of machine translations. Abu Shquier and Sembok [1] asserts that Arabic

language differs extremely in terms of its characters, and morphology from other

languages. the authors develop MT system from English to Arabic using rule-based

approach emphasis given in handling of word agreement and ordering . In

addition Attia[2] describes the agreement as one of the features that greatly affect the

output of MT. The author used the transfer approach to analysis of English as a source

language, problems related to the transfer of English into Arabic, and the generation

of Arabic as a target language focusing on implications of the agreement features in

machine translation.

On the other hand Omar et al.[3] developed a machine translation system called

Npae-Rbmt that translates Arabic noun phrases into English by using transfer-based

approach. Yngve [4] reported that Arabic was one of the language besides English,

German and French which were subjects of the COMIT (operational grammar

encoding project) in the late 1950s. Chafia Mankai and Ali Mili [5] presented an

attempt to carry out MT from Arabic language to English language and from Arabic

language to French language. They have proved that analyzing Arabic and reordering

must be doing to get good results according to Arabic rules. Salem et al[6] introduced

an Arabic to English MT system called UniArab, which is based on rule based Role

and Reference Grammar model to support rule-based lexical framework. We differ

from the other efforts; we deal on utilizing syntactic analysis to overcome the

reordering challenges.

This work presents a new system to translate the news titles of Aljazeera1 website

by using Transfer-based method. The motivation for this study is to develop an

automated translator sufficient in translating from Arabic phrases into English.

We evaluate our system (AE-TBMT) by applied 100 titles from Aljazeera news

website on it and compare it is translation results with translation results from Google

translator and Microsoft translator. We chosen the corpus from Aljazeera website due

to that Aljazeera has become the most popular in the Arab world.

1 http://aljazeera.net/portal

Verb-Object-Subject whereas, English follows structure as Subject-Verb-Object. As

Syntactic Reordering for Arabic- English Phrase-Based Machine Translation 199

3 Arabic Syntactic Matters

Arabic is a morphologically and syntactically complex language with many variations

from English .Arabic morphology has been well studied in the context of MT.

Preceding results all refer that the tokenization is helpful when translating from

Arabic language [8, 9]. When translating from a morphologically rich language,

tokenization means that the translation process is passed into multiple steps [10].

Arabic is segmented by simple punctuation tokenization. This rank of tokenization

not enough to Syntactic analysis [12]. Chanod and Tapanainen [11] defined the

tokenization as a significant issue in natural language processing as it is closely

related to the morphological analysis. We used tokenization to get more segmented

language. We expect we would achieve better performance by setting up more Arabic

segmentation. In this work we focus on syntactic reordering for Arabic- English

phrases. Below we describe eight major problems that regarding the translation from

Arabic to English which motivates a number of our decisions in this paper.

1. Word order is one of the main differences between Arabic and English is sentence

word order .In Arabic there are four structures for the sentence VSO, OVS, SVO

or VOS, while English has only SVO order. A reordering rule should move the

verb of an Arabic sentence to the right of the subject, for example اعلي األهٍي العام

ى لألهن الورذدج تاى كً هو [Announced The Secretary General UN Ban Ki-moon]

which translate to “ The UN Secretary General Ban Ki-moon announced”.

2. Arabic adjective usually follows their nouns except in superlative adjectives.

Whereas, English adjectives precede their nouns. For example ًٌرجل غ [man

rich] translate to” a rich man”. Reordering rule should move the noun of an Arabic

sentence to the right of the adjective, but the superlative case, translates without

reordering for example رجل اغٌى “richest man”.

3. Idafa is one of the syntactic construction in Arabic, which indicates possession

and compounding. English has three syntactic constructions. Idafa typically

consists of one or more indefinite nouns followed by a definite noun [12]. For

example هفاذٍخ الثٍد „keys the house‟ translates as “the house keys”, “the house‟s

keys” and “the keys of the house”. Two of the three English constructions require

content word reordering.

4. Multiple meanings : this problem occurs because the different structures have

different meanings of the same meaning. For example "الواًٍا ذطلة" can translate to

“Germany requests” or “Germany calls”.

5. Grammatical relations. In some cases, when translating phrases from Arabic to

English the sentences become weak distinction to determine which is the subject

or object. For example “ ٌزور الزئٍس السوداًً الزئٍس الجزائزي” , “Sudanese President

visit Algerian President”.

200 A. Hatem and N. Omar

6. Addition and deletion: This problem appears because of the limitation of lexicon

as the original translation contains extra words that have no equivalent words in

the input of the source language.

7. Determiner Agreement: This problem appears in the target language because

the noun phrase that are preceded by “a(n)” is translated as if it were preceded

by "the".

4 The Approach

In a rule-based machine translation system the source sentence is first analysed

morphologically in order to get a syntactic representation. This representation will

be useful to find a suitable rule for the source sentence and the equivalent pattern for

the target sentence in order to generate an acceptable sentence. The following steps

summarize the translation process:

STEP 1: Input the source text (Arabic sentence) and start tokenization process. The

task of the tokenizer is to divide the text into tokens.

STEP 2: Start morphological analysis. In this step the morphological analyzer

provides morpho-syntactic information.

STEP 3: The syntactic parser builds a syntactic relevant tree, which represents

relationships between the words of the phrase.

STEP 4: Lexical transfer will map Arabic lexical elements to their English equivalent.

It will also map Arabic morphological features to the corresponding set of

English features.

STEP 5: Structure transfer will map the Arabic dependency tree to the equivalent

English syntactic structure.

STEP 6: Arabic synthesiser will synthesis the inflected English word-form based on

the morphological features and traverse the syntactic tree to produce the

surface English phrase.

األفغاًٍح العاصوح وسط هز قوٌا اًفجارا

Strong explosion shook the center of Afghanstan capital

S: [NP VP] == S: [ NP:[N Adj] VP:[V N1 Det N2 Adj]] Arabic

S: [NP VP] == S: [NP: [Adj N] VP:[V Det N1 Adj N2]] English


Fig. 1 Implementation process

Fig. 1 illustrates the whole process which includes the following stages:

Tokenization: This an important step for a syntactic parser to construct a phrase

structure tree from syntactic units. After inserting the source sentence in the

system the tokenizer divides the text into tokens. The token can be a word, a part

of a word, or a punctuation mark. A tokenizer requests to know the white spaces

and punctuation marks.

Morphological analysis: After the tokenization process the morphological analyser

will provide the morphological information about words. It provides the

grammatical class of the words (parts of speech) and create the Arabic word in its

right form depending on the morphological features.

Lexicon: In this system the lexicon is accountable for inferring morphological and

classifying verbs, nouns, adverb and adjectives when needed. It is the main

lexicon translation; the source language searches in a dictionary and then chooses

the translation.

A lexicon provides the specific details about every individual lexical entry

(i.e. word or phrase) in the vocabulary of the language concerned. Lexicon contains grammatical information which are usually have abbreviated

form: „n‟ for noun, „v‟ for verb, „pron‟ for pronoun, „det‟ for determiner, „prep‟ for

preposition, „adj‟ for adjective, „adv‟ for adverb, and „conj‟ for conjunction. The

Source Language

Sentence

SL parse tree

Morphological

Analysis

Target Language

Sentence

TL parse tree

Generation

Transfer

Arabic to English Transfer Rules

Structure Transfer

Lexicon transfer

Dictionary

.


lexicon must contain information about all the different words that can be used. If

the word is ambiguous, it will be described by multiple entries in the lexicon, one

for each different use.

Parsing: The parser divides the sentence into smaller sets depending on their

syntactic functions in the sentence. There are four types of phrases i.e. Verb

Phrase (VP), Noun Phrase (NP), Adjective/Adverbial Phrase (AP), and

Prepositional Phrase (PP). After the parsing process the sentence is represented in

a phrase structure tree. Figure 1 show the phrase structure tree for the sentence

.(the cleaver student reads the book) الطالة الذكً قزأ الكراب

The above tree captures various grammatical relationships like dominance and

precedence. Dominance means that some nodes dominate other nodes. In the

above example S dominates NP and VP. Precedence means that some nodes

precede other nodes. In the above example NP precedes VP. The root node is the

node that dominates other nodes. The first NP is the root for Det, Adj and N,

which in their turn are said to be children of the first NP and siblings of one

another.

Syntactic rules: A set of Arabic and English rules are fed into the system. In this

step the reordering process will be found which will be based on the order of

words in a sentence, and how the words are grouped.

Agreement rules: After syntactic rules the agreement rules applied which are

responsible about the additions of prefix and suffix in the sentences.

In order to test the performance of our approach, experiments are carried out to

translate the news titles of Aljazeera website. We evaluate our results on the 100

sentences (in different fields i.e. political, sports and economic news) taken from

Aljazeera news website and which are available at http://www.aljazeera.net/portal.


5 Results and Discussion

The aim of this experiment is to investigate whether the following machine

translation systems, namely, Bing Translator, Google, Systran and AE-TBMT, are

suitable for handling the word reordering in the translation from Arabic to

English. We evaluate our results on the 100 sentences taken from Aljazeera news

website and which are available at http://www.aljazeera.net/portal. The evaluation

involves counting the numbers of problems which appear in each of the system.

The weight for every problem is 1 in all sentences, and it can be interpreted as the

basic unit of penalty as we count every problem that appears in the target sentence as

1 weakness.

The percentage of the total score for each system is calculated by dividing

the total score by 700. We have 100 test examples and each is evaluated out of 7

depending on the numbers of problems. The score is given by human expert in

translation and it tests the differences between the human translation and the machine

translation systems. Table 2 illustrates part of the result achieved by this experiment.

For the first example i.e. “ تكرير محطات إلنشاء عراقية خطة ", the weakness of Google

system is in problems (4,6) and for that , the system score 5 out of 7. The weakness of

Bing system was in problem (6) and the system scored 5 out of 7. The weakness of

Systran system was in problem (4,6) and the system scored 5 out of 7. AE-TBMT got

score 6 out of 7 because there is only one weakness translation in problem (4). Table 1, 2 and 3 show the comparison results on some different sentences when

applied to above translation systems. In table 1 a comparison between the manual

translation results and other machine translation systems results are shown.

Table 1. Exper iment results and compar ison with other systems

Sentence Manual

Translation

Google Bing Systran AE-TBMT

خطح عزاقٍح

إلًشاء هذطاخ

ذكزٌز

Iraqi plan to

build refinaries

Plan for the

establishment of

an Iraqi refineries

Iraqi plan for

refineries

Iraqi plan for

establishment of

stations repeating

Iraqi plan to

establishment

refineries

جزدى فً

اًفجاراخ غزج

Injuries in Gaza

blasts

Wounded in the

blasts Gaza

Wounded in Gaza

bombing

Wound in explosions

of Gaza

Wounded in

Gaza

explosions

هصز ذزفض

ذدخال ذزكٍا

تالوصالذح

Egypt refuses

interference of

Turkeg in

reconciliation

Egypt rejects

interference

Turkey

reconciliation

Egypt rejects

Turkish

intervention in

reconciliation

Egypt refuses

Turkish interventions

in the reconciliation

Egypt refuses

interference of

Turkey in

reconciliation

اًفجارا قوٌا هز

وسط العاصوح

األفغاًٍح

Strong explosion

shook the center

of Afghanstan

capital

A powerful

explosion rocked

the centre of

Afghan capital

Powerful

explosion Central

Strong explosions

shaking in the

middle of the capital

Afghan

a strong

explosion

shooke centre

of Afghan

capital

ذعرقل اهزٌكا

شثكح ذجسس

روسٍح

America

arrested Russian

spy net

America spy

network arrested

Russia

Arrest of Russian

spy network

America

America arrests

Russian net of

spying

America arrests

Russia spy net

ذقرزح ألواًٍا

هوعد لوغادرج

أفغاًسراى

German

proposes time to

leave

Afganistan

Germany is

proposing a date

to leave

Afghanistan

Germany proposes

to leave

Afghanistan

Afghanistan

proposes Germany

appointment for

departure

Germany

proposes a date

to leave

Afghanistan


Table 2 shows the number of problems which appeared after the translation and

shows how many penalties each system has been scored. For example in Google

when we translate the sentence “ نشاء محطات تكريرخطة عراقية إل ” problem 4 and 6 occurred

in the output and as we give 1 penalty for each problem, Google has two penalties.

Bing has only one penalty because only problem 6 occurred and Systran has two

penalties similar to Google. AE-TBMT has one penalty when problem 4 occurred in

the output.

Table 2. Number of Problems and Penalties

Sentence

Google Bing Systran AE-TBMT

Problem Penalty

no.

problem Penalty

no.

Problem Penalty

no.

problem Penalty

no.

خطح عزاقٍح إلًشاء هذطاخ ذكزٌز

4,6 2 4,6 2 4,6 2 4 1

جزدى فً

اًفجاراخ غزج

3,4,6 3 3,4,6 3 3,4 2 4 1

هصز ذزفض ذدخال تالوصالذح ذزكٍا

2,4,6 3 2,4 2 2,4,6,7 4 0 0

اًفجارا قوٌا هز

وسط العاصوح

األفغاًٍح

4 1 1,3 2 4,6 2 6 1

اهزٌكا شثكح ذعرقل

ذجسس روسٍح

1,2,5,6 4 1,4,6 3 1,6 2 1 1

هوعد ألواًٍا ذقرزح

لوغادرج أفغاًسراى

1,4,6 3 1,6 2 1,3,4,5,6 5 1,4 2

Table 3. Total Number of Penalties

Translators Penalties

Google 122

Bing 161

Systran 149

AE-TBMT 113

It can be seen that in general, the result for AE-TBMT is better than other approaches

as reported in Table 3 the system has only 113 problems after translating 100

sentences. This shows that AE-TBMT is able to generate best translation from Arabic

to English.

We believe that a good translation can be achieved with a systematic lexicon which

provides more correct translations. It can be seen that with more complicated sentence

structures, the features of the different words are essential in accurate translation. In

addition, good reordering rules play an important role in the quality of translation.


6 Conclusion and Future Work

In this paper, we described a set of syntactic reordering rules that exploit systematic

differences between Arabic and English word order to transform Arabic sentences to

be equivalent to English in terms of their word order. The approach was tested on 100

titles from Aljazeera news website. Preliminary comparisons indicate that our

approach is competitive with other approaches in the literature. Our manual

evaluation of the reordering accuracy indicated that our approach is helpful at

improving the translation quality despite relatively frequent reordering errors. Future

research will be aimed at testing our approach on different domain.

References

1. Abu Shquier, M., Sembok, T.: Handling agreement in machine translation from English to Arabic. In: 1st International Conference on Digital Communications and Computer Applications (DCCA 2007), pp. 385–379 (2007)

2. Attia, M.: Mplications of the Agreement Features in Machine Translation, Master thesis. Al-Azhar University (2002)

3. Shirko, O., Omar, N., Arshad, H., Albared, M.: Machine Translation of Noun Phrases from Arabic to English Using Transfer-Based Approach. Journal of Computer Science 6(3), 350–356 (2010)

4. Yngve, V.H.: Early Research at M.I.T. In: John Hutchins, W. (ed.) Research of Adequate Theory Early years in machine translation: memoirs and biographies of pioneers, pp. 39–72 (2000)

5. Mankai, C., Mili, A.: Machine Translation from Arabic to English and French information sciences, vol. 3, pp. 91–109. Elsevier Science Inc., New York (1995)

6. Salem, Y., Hensman, A., Nolan, B.: Implementing Arabic to- English Machine Translation using the Role and Reference Grammar Linguistic Model. In: Proceedings of the Eighth Annual International Conference on Information Technology and Telecommunication (ITT 2008), Galway, Ireland (Runner-up for Best Paper Award) (2008)

7. Lavie, A., Probst, K., Peterson, E., Vogel, S., Levin, L., Font-Llitjos, A., Carbonell, J.: A Trainable Transfer-based Machine Translation Approach for Languages with Limited Resources. In: Proceedings of Workshop of the European Association for Machine Translation, Valletta, Malta, EAMT 2004 (2004)

8. Habash, N., Sadat, F.: Arabic preprocessing schemes for statistical machine translation. In: Proceedings of HLT-NAACL 2006, New York, NY, USA (2006)

9. Lee, Y.: Morphological Analysis for Statistical Machine Translation. In: Proceedings of HLTNAACL 2004, Boston, MA, USA (2004)

10. Badr, I., Zbib, R., Glass, J.: Segmentation for English-to-Arabic statistical machine translation. In: Proceedings of ACL 2008, HLT: Short Papers, Columbus,OH, USA (2008)

11. Jean-Pierre, C., Pasi, T.: A Non-Deterministic Tokenizer for Finite-State Parsing. In: The European Conference on Artificial Intelligence, Workshop on Extended Finite State Models of Language (ECAI 1996), Budapest, Hungary, pp. 10–12 (1996)

12. Elming, J., Habash, N.: Syntactic Reordering for English-Arabic Phrase-Based Machine Translation. In: Proceedings of the EACL, Workshop on Computational Approaches to Semitic Languages, Athens, Greece, pp. 69–77 (2009)


Syntactic Reordering for Arabic- English Phrase-Based Machine Translation

Documents