Top Banner
Lexical Profiling for Arabic Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, Josef van Genabith National Centre for Language Technology (NCLT), School of Computing, Dublin City University Funded by: Enterprise Ireland, the Irish Research Council for Science Engineering and Technology (IRCSET), and the EU projects PANACEA and META-NET
25
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: E lex presentation_03

Lexical Profiling for Arabic

Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, Josef van Genabith

National Centre for Language Technology (NCLT),

School of Computing, Dublin City University

Funded by:

Enterprise Ireland, the Irish Research Council for Science

Engineering and Technology (IRCSET), and

the EU projects PANACEA and META-NET

Page 2: E lex presentation_03

Overview

• Introduction

• Building the lexical database for Arabic– Corpus-based Selection of Entries

– Morphological Details: Inflectional Paradigms

– Syntactic Details: Subcategorization Frames

• Web Application

• Conclusion

Page 3: E lex presentation_03

Introduction

• Modern Standard Arabic vs. Classical Arabic

• Current State of Arabic Lexicography– Lexicons are not corpus-based

– Buckwalter Electronic Dictionary and Arabic Morphological Analyser

– No lexica for subcategorization frames

• Importance of Lexical Resources

Page 4: E lex presentation_03

Introduction

• Arabic Morphotactics

Page 5: E lex presentation_03

Aim

• Constructing a lexical database of Modern Standard Arabic

• Constructing a database for Arabic subcategorization frames

Page 6: E lex presentation_03

Methodology

Lexical Details

• Using a medium-scale manually created lexicon of 10,799 lemmas

• Using statistics from a 1 billion word corpus (annotated by MADA)

– 90% from the LDC's Arabic Gigaword

– 10% collected from the Al-Jazeera website

Subcategorization Details

• Using a medium-scale manually created lexicon of 2,901 lemma-frame types

• Using the Penn Arabic Treebank of 22,524 sentences, and 587,665 words

Page 7: E lex presentation_03

Extending the Lexical Database

• Start-off with a seed lexicon– Three Lexical Databases, manually constructed

• 5,925 nominal lemmas, with details on:– Gender and number

– Inflection paradigm (13 continuation classes)

– Humanness

• 1,529 verb lemmas, with details on:– Transitivity

– Whether passive is allowed or not

– Whether the imperative is allowed or not

• 490 patterns (456 for nominals and 34 for verbs)

• lemma-root look up database

Page 8: E lex presentation_03

Methodology

Page 9: E lex presentation_03

Extending the Lexical Database

• Automatically Extending the Lexical Database: Lexical Enrichment– Data-driven filtering technique

• 40,648 lemmas (in Buckwalter or SAMA 3.1)

• Statistics from three web search engines• Statistics from the corpus annotated by MADA• 29,627 lemmas (left after filtering)

Page 10: E lex presentation_03

Extending the Lexical Database

Automatically Extending the Lexical Database: Feature Enrichment

– Machine Learning– Multilayer Peceptron classification algorithm

– Training Data: 4,816 nominals and 1,448 verbs

– Classes for nominals: continuation classes (or inflection paths), the semantico-grammatical feature of humanness, and POS (noun or adjective)

– Classes for verbs: transitivity, allowing the passive voice, and allowing the imperative mood

– We feed these datasets with frequency statistics from the corpus and build a vector grid.

Page 11: E lex presentation_03

Extending the Lexical Database

• Extending the Lexical Database– Feature enrichment using Machine Learning

Page 12: E lex presentation_03

Extending the Lexical Database

• Extending the Lexical Database– With Machine Learning we add:

18,000 new lemmas: 12,974 nominals 5,034 verbs

Page 13: E lex presentation_03

Extending the Lexical Database

• Handling Broken PluralsjAnib (side)jawAnib (sides)

Poor handling of broken plural in Buckwalter

(4) <lemmaID>jAnib_1</lemmaID> <voc>jAnib</voc> <pos>jAnib/NOUN</pos> <gloss>side/aspect</gloss>

(5) <lemmaID>jAnib_1</lemmaID> <voc>jawAnib</voc> <pos>jawAnib/NOUN</pos> <gloss>sides/aspects</gloss>

Two differences: voc and gloss

Page 14: E lex presentation_03

Extending the Lexical Database

• Extracting Broken Plurals<gloss>side/aspect</gloss>

<gloss>sides/aspects</gloss>

We use Levenshtein Distance which measures the difference between two strings (here glosses having the same lemmaID).

distance of 2 / length of the first string = 0.15 (within the threshold 0.4)

We collect 2,266 candidates

Page 15: E lex presentation_03

Extending the Lexical Database

• Validating Broken Plurals<voc>jAnib</voc> singular

pattern is: fAEilregex is: .A.i.

<voc>jawAnib</voc> pluralpattern is: fawAEilregex is: .awA.i.

Pattern database: 135 singular patterns that choose from a set of 82 broken plural patterns

2,266 candidates -> 1,965 are validated (87%)

Page 16: E lex presentation_03

Extending the Lexical Database

• Interesting statistics on Arabic pluralsInsights from the corpus:

5,570 lemmas have a feminine plural suffix

1,942 lemmas have a masculine plural suffix

2,730 lemmas with a broken plural forms

Page 17: E lex presentation_03

Extraction of Subcat Frames

• Importance of subcategorization frames

• Advantage of Automatic Extraction

• Available Resource on Arabic Subcat Frames:

– none except Arabic LFG Parser (Attia, 2008) - available as open source

Page 18: E lex presentation_03

Extraction of Subcat Frames

What are LFG subcat frames? Governable GFs (SUBJ, OBJ, OBJϴ, OBLϴ, COMP

and XCOMP) Non-governable GFs (ADJ and XADJ)

π<gf1,gf2,…gfn>

{iEotamada Al-Tifolu EalaY wAlidati-hi “The child relied on his mother”

{iEotamada<(↑SUBJ)( ↑OBL>alaY)>

Page 19: E lex presentation_03

Extraction of Subcat Frames

Automatic extraction of subcat frames The ATB contains 22,524 sentences LFG Annotation algorithm (DCU) Traversing trees and looking for dependencies. Lemmatization We extract 7,746 lemma-frame types (for verbs, nouns and

adjectives)

Page 20: E lex presentation_03

Extraction of Subcat Frames

Estimating the Subcategorization Probability

Page 21: E lex presentation_03

Extraction of Subcat Frames

Evaluation the Subcategorization Extraction

Page 22: E lex presentation_03

Extraction of Subcat Frames

Evaluation the Subcategorization Extraction

Page 23: E lex presentation_03

Web Application• AraComLex Lexicon Writing Application

www.cngl.ie/aracomlex

Page 24: E lex presentation_03

Byproducts of the Work

A number of open-source Resources:

• finite-state morphological transducer Arabic morphological patterns Subcategorization frames Arabic lemma frequency counts

Page 25: E lex presentation_03

Conclusion

• We successfully use machine learning to predict morpho-syntactic features for newly acquired words.

• We successfully extract subcategorization frames from the Penn Arabic Treebank

• We build specifications and implementation for an Arabic lexicographic web application.