Top Banner
Automatic Extraction of Arabic Multiword Expressions *Mohammed Attia, Antonio Toral, Lamia Tounsi, Pavel Pecina and Josef van Genabith School of Computing, Dublin City University, Ireland
24
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Arabic mwe presentation 07

Automatic Extraction of Arabic Multiword Expressions

*Mohammed Attia, Antonio Toral, Lamia Tounsi, Pavel Pecina and Josef van Genabith

School of Computing, Dublin City University, Ireland

Page 2: Arabic mwe presentation 07

Outline

● Introduction

● Data Resources

● Methodology● Crosslingual Correspondence Asymmetries● Translation-Based Approach● Corpus-Based Approach

● Discussion of experiments and results

● Conclusion

Page 3: Arabic mwe presentation 07

Introduction

● Criteria of MWEs● Ubiquity● Diversity● Low polysemy● Statistically significant co-occurrence

● Focus● Arabic● Nominal MWEs

● Purpose is building an MWE lexicon for Arabic

Page 4: Arabic mwe presentation 07

Data Resources

✔ Multilingual, bilingual and monolingual settings

✔ Availability of rich resources that have not been exploited in similar tasks before.

● Arabic Wikipedia (March 2010)● 117,491 titles, of them 89,623 multiword titles● Arabic is ranked 27th according to size (article count) and

17th according to usage● Information helpful for linguistic processing

Page 5: Arabic mwe presentation 07

Data Resources

● Princeton WordNet 3.0● An electronic lexical database for English● Arabic WordNet contains only 11,269 synsets (including

2,348 MWEs)

Page 6: Arabic mwe presentation 07

Data Resources

● Arabic Gigaword● Unannotated corpus distributed by the Linguistic Data

Consortium (LDC).● Articles from news agencies and newspapers from different

Arab regions, such as Al-Ahram in Egypt, An Nahar in Lebanon and Assabah in Tunisia.

● Largest publicly available corpus of Arabic to date.● Contains 848 million words.

Page 7: Arabic mwe presentation 07

Methodology

3 different techniques for 3 different data sources

Motivation for using different techniques● The extraction of MWEs is a problem more complex than

can be dealt with by one simple solution.● The choice of technique depends on the nature of the task

and the type of the resources used.

Page 8: Arabic mwe presentation 07

Pipeline

Page 9: Arabic mwe presentation 07

Technique 1: Crosslingual Asymmetries

● Data: Titles of Wikipedia Articles in Arabic and corresponding titles in 21 languages.

● Definition: We rely on many-to-one correspondence relations

● The non-compositionality of MWEs makes it unlikely to have a mirrored representation in the other languages.

● Compositionalily varies:● highly compositional, "قاعدة عسكرية", "military base",

● with a degree of idiomaticity, such as, "مدينة الملهي", "amusement park", lit. "city of amusements".

● extremely opaque , "فرس النبي", "grasshopper", lit. "the horse of the Prophet".

Page 10: Arabic mwe presentation 07

Technique 1: Crosslingual Asymmetries

● Steps

(1) Candidate Selection. All Arabic Wikipedia multiword titles are taken as candidates.

(2) Filtering. We exclude titles of disambiguation and administrative pages.

(3) Validation. We check if there is a single-word translation in any of 21 selected languages.

Page 11: Arabic mwe presentation 07

Technique 1: Crosslingual Asymmetries

● Evaluation:● 1100 multiword titles are randomly selected from Arabic

Wikipedia and manually tagged as: MWEs, non-MWEs, or NEs.

● Baseline: all multi-word titles are considered as MWEs

● Results

Page 12: Arabic mwe presentation 07

Example

Page 13: Arabic mwe presentation 07

Language Ranking

How likely will each language give many-to-one correspondence?

Page 14: Arabic mwe presentation 07

Technique 2: Translation-Based

● Data: Princeton WordNet● Assumption: MWEs in one language are likely to be

translated as MWE in another language.● Ontological advantage

● Steps● Extracting the list of nominal MWEs from PWN 3.0.● Translating the list into Arabic using Google Translate.● Validating the results using pure frequency counts from three

search engines: Al-Jazeera, BBC Arabic and AWK.

Page 15: Arabic mwe presentation 07

Technique 2: Translation-Based

● Evaluation (automatic)● Gold Standard: PWN-MWEs found in English Wikipedia and have

correspondence in Arabic: 6322 expressions.

● We test the Google translation without any filtering, and consider this as the baseline.

● Then we filter the output based on the number of combined hits from the search engines.

● Results

Page 16: Arabic mwe presentation 07

Technique 2: Translation-Based

● Evaluation (Manual)● On 200 MWE candidates● Precision

– Baseline (before validation): 45.5%– After validation: 83%

Page 17: Arabic mwe presentation 07

Technique 2: Translation-Based

● Notes on Google Translate● Word Order

– shark repellent => القرش طارد– accordion door => الكورديون الباب

● Transferring source word to target

– acroclinium roseum => acroclinium roseum– actitis hypoleucos => actitis hypoleucos

Page 18: Arabic mwe presentation 07

Technique 3: Corpus-Based

● Data: Arabic Gigaword corpus

● Association Measures used:

● Pointwise Mutual Information (PMI)

● Pearson’s chi-square

● Steps

(1) Compute the frequency of all the unigrams, bigrams, and trigrams

(2) Computing the association measures for all bigrams and trigrams (threshold to 50)

(3) Ranking bigrams and trigrams

(4) Conducting lemmatization of Arabic words using MADA.

(5) Filtering the list using the MADA POS-tagger. The patterns included for bigrams are: NN NA, and for trigrams: NNN NNA NAA

Page 19: Arabic mwe presentation 07

Technique 3: Corpus-Based

● Why is lemmatization important?

● Al>mm AlmtHdp (the-nations united) “the United Nations”Al>mm@>um~ap_1@N@1#AlmtHdp@mut~aHid_1@AJ@2#

● ll>mm AlmtHdp (to-the-nations united) “to the United Nations”ll>mm@>um~ap_1@N@3#AlmtHdp@mut~aHid_1@AJ@3#

● wAl>mm AlmtHdp (and-the-nations united) “and the United Nations”wAl>mm@>um~ap_1@c-N@3#AlmtHdp@mut~aHid_1@AJ@3#

● bAl>mm AlmtHdp (by-the-nations united) “by the United Nations”bAl>mm@>um~ap_1@N@3#AlmtHdp@mut~aHid_1@AJ@3#

Page 20: Arabic mwe presentation 07

Technique 3: Corpus-Based

● Evaluation: 3600 expressions are randomly selected and classified into MWE or non-MWE by a human annotator.

● Results

Page 21: Arabic mwe presentation 07

Discussion results

● Combination of yields

Page 22: Arabic mwe presentation 07

Discussion of results

● Similarities and dissimilarities of output

The set of collocations detected by the association measures may differ from the those which capture the interest of lexicographers and Wikipedians

● مناحم مازوز “Menachem Mazuz”● خضروات طازجة “fresh fruits”● سيداتي وسادتي “Ladies and gentlemen”

Page 23: Arabic mwe presentation 07

Conclusion

● Applicability to other languages

● the heterogeneity of the data sources helps to enrich the MWE lexicon.

● A lexical resource of:● 33,000 MWEs● 39,000 NEs

Page 24: Arabic mwe presentation 07

Thank you!