Lexical Profiling for Arabic Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, Josef van Genabith National Centre for Language Technology (NCLT), School of Computing, Dublin City University Funded by: Enterprise Ireland, the Irish Research Council for Science Engineering and Technology (IRCSET), and the EU projects PANACEA and META-NET
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Lexical Profiling for Arabic
Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, Josef van Genabith
National Centre for Language Technology (NCLT),
School of Computing, Dublin City University
Funded by:
Enterprise Ireland, the Irish Research Council for Science
Engineering and Technology (IRCSET), and
the EU projects PANACEA and META-NET
Overview
• Introduction
• Building the lexical database for Arabic– Corpus-based Selection of Entries
– Morphological Details: Inflectional Paradigms
– Syntactic Details: Subcategorization Frames
• Web Application
• Conclusion
Introduction
• Modern Standard Arabic vs. Classical Arabic
• Current State of Arabic Lexicography– Lexicons are not corpus-based
– Buckwalter Electronic Dictionary and Arabic Morphological Analyser
– No lexica for subcategorization frames
• Importance of Lexical Resources
Introduction
• Arabic Morphotactics
Aim
• Constructing a lexical database of Modern Standard Arabic
• Constructing a database for Arabic subcategorization frames
Methodology
Lexical Details
• Using a medium-scale manually created lexicon of 10,799 lemmas
• Using statistics from a 1 billion word corpus (annotated by MADA)
– 90% from the LDC's Arabic Gigaword
– 10% collected from the Al-Jazeera website
Subcategorization Details
• Using a medium-scale manually created lexicon of 2,901 lemma-frame types
• Using the Penn Arabic Treebank of 22,524 sentences, and 587,665 words
Extending the Lexical Database
• Start-off with a seed lexicon– Three Lexical Databases, manually constructed
• 5,925 nominal lemmas, with details on:– Gender and number
– Inflection paradigm (13 continuation classes)
– Humanness
• 1,529 verb lemmas, with details on:– Transitivity
– Whether passive is allowed or not
– Whether the imperative is allowed or not
• 490 patterns (456 for nominals and 34 for verbs)
<voc>jawAnib</voc> pluralpattern is: fawAEilregex is: .awA.i.
Pattern database: 135 singular patterns that choose from a set of 82 broken plural patterns
2,266 candidates -> 1,965 are validated (87%)
Extending the Lexical Database
• Interesting statistics on Arabic pluralsInsights from the corpus:
5,570 lemmas have a feminine plural suffix
1,942 lemmas have a masculine plural suffix
2,730 lemmas with a broken plural forms
Extraction of Subcat Frames
• Importance of subcategorization frames
• Advantage of Automatic Extraction
• Available Resource on Arabic Subcat Frames:
– none except Arabic LFG Parser (Attia, 2008) - available as open source
Extraction of Subcat Frames
What are LFG subcat frames? Governable GFs (SUBJ, OBJ, OBJϴ, OBLϴ, COMP
and XCOMP) Non-governable GFs (ADJ and XADJ)
π<gf1,gf2,…gfn>
{iEotamada Al-Tifolu EalaY wAlidati-hi “The child relied on his mother”
{iEotamada<(↑SUBJ)( ↑OBL>alaY)>
Extraction of Subcat Frames
Automatic extraction of subcat frames The ATB contains 22,524 sentences LFG Annotation algorithm (DCU) Traversing trees and looking for dependencies. Lemmatization We extract 7,746 lemma-frame types (for verbs, nouns and
adjectives)
Extraction of Subcat Frames
Estimating the Subcategorization Probability
Extraction of Subcat Frames
Evaluation the Subcategorization Extraction
Extraction of Subcat Frames
Evaluation the Subcategorization Extraction
Web Application• AraComLex Lexicon Writing Application