Top Banner
Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes) Learning and Teaching" Parkinson B08, University of Leeds, Monday 23rd July 2012. Eric Atwell I-AIBS Institute for Artificial Intelligence and Biological Systems School of Computing University of Leeds
41

Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

Mar 28, 2015

Download

Documents

Ashley Maynard
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

Corpus resources for learning Arabic to understand the Quran

Higher Education Academy workshop on "The Role of Corporain LSP (Language for Specific Purposes) Learning and Teaching"

Parkinson B08, University of Leeds, Monday 23rd July 2012.

Eric Atwell I-AIBS Institute for Artificial Intelligence

and Biological Systems

School of Computing

University of Leeds

Page 2: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

An Artificial Intelligence interdisciplinary approach to understanding the Quran

Page 3: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

(1) What is the Quran?

Holy Book Prophet Text Dated

Suhuf Ibrahim (Scrolls) Abraham ?

Tawrat (Torah) Moses 1500 BCE?

Zabur (Psalms) David 1000 BCE?

Injil (Gospel) Jesus 1 CE

The Quran Muhammad (PBUH) 610-632 CE

Islam: the last in a series of 5 religious texts

Page 4: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

(1) What is the Quran?

-Classical Arabic, 1300+ years ago

- All believers should learn the text; translations are “interpretations”

- Islamic Law (legal logic)

- Divine guidance & direction

- Science and philosophy

- Has inspired Algebra, Linguistics

The central religious text of Islam

Page 5: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

(2) Traditional Arabic Linguistics

- Orthography (diacritics and vowelization)- Etymology (Semitic roots)- Morphology (derivation and inflection)- Syntax (origins of dependency grammar)- Discourse Analysis & Rhetoric- Semantics & Pragmatics

Originated in Arabs studying the language of the Quran (scientific analysis for at least 1000 years – a lot older than English language!):

Page 6: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

(3) ComputingQuran is online, for keyword searchBUT verse-by-verse translations are interpretationsMuslims should access the “true” Classical Arabic source

Page 7: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

(3) Computing

Example question-answering dialog system:

QuestionHow long should I breastfeed my child for?

Answer Mothers should suckle their offspring for two years, if the father wishes to complete the term (The Holy Quran, Verse 2:233).

- How far can we go?- An Artificial Intelligence system which “understands” the Quran?

Page 8: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

An Artificial Intelligence approach to understanding the Quran

Central HypothesisAugmenting the text of the Quran with rich linguistic annotation will lead to a more intelligent/accurate AI systems.

- Prepare the data by annotating the Quran.- Use the data to build an AI system for concept search and question-answering.

Page 9: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

Corpus resources for learning Arabic to understand the Quran

Augmenting the Arabic text of the Quran with rich linguistic annotation will help learners to understand Quranic Arabic.

- Annotate the Quranic Arabic Corpus.- Teacher and Learners use the annotations for deeper understanding of Quranic Arabic.

Page 10: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

Straw Poll: LSP for religious texts?• How many Muslims in the audience?

– How many read/recite Classical Arabic Quran?– How many would like to?

• How many Jews in the audience?– How many read/recite Classical Hebrew Tanakh?– How many would like to?

• How many Christians in the audience?– How many read/recite Classical Hebrew/Greek Bible?– How many would like to?

• Have I left anyone out?

Page 11: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

Annotating the QuranChallenges

Orthography - Complex non-standard script

Morphology (word structure) - Arabic is highly inflected, challenging to analyze

Grammar - Phrase structure, dependency

Semantics – Ontology of Entities and Concepts referred to by pronouns and nouns

Page 12: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

Annotating the QuranSolutions

- Computing advances have made annotation possible, to high accuracy

- Leverage existing resources from Traditional Arabic Grammar

-Machine-Learning annotation followed by manual verification

-- Community effort using online volunteers

Page 13: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

Recent Advances: Orthography

Google Search for verse (68:38) on Jan 21, 2008 shows many typos

An accurate digital copy of the Quran?

Encoding Issues- Missing diacritics

- Simplified script (not Uthmani)

- Windows code page 1256, not Unicode

Page 14: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

Recent Advances: OrthographyTanzil Project (http://tanzil.info)

- Stable version released May 2008

- Uses Unicode XML encoding, including the special characters designed for the complex Arabic script of the Quran

- Manually verified to 100% accuracy by a group of experts who have memorized the entire text of the Quran

Page 15: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

Recent Advances: OrthographyJava Quran API (http://jqurantree.org)

(Dukes 2009)

- Java classes for querying the Tanzil XML of the Quran

- gives authentic script on web-pages

Page 16: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

Recent Advances: Morphology

- Buckwalter Arabic Morphological Analyzer (Tim Buckwalter, 2002)

- Morphological Analysis of the Quran at the University of Haifa (Shuly Wintner, 2004)

- Lexeme & feature based morphological representation of Arabic (Nizar Habash, 2006)

Page 17: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Haifa Corpus (2004)

Multiple analysis for each word (up to 5)rbb+fa&l+Noun+Triptotic+Masc+Sg+Pron+Dependent+1P+Sgrbb+fa&l+Noun+Triptotic+Masc+Sg+Gen

Not manually verifiedAuthors reports an F-measure of 86%

Non-standard annotation scheme not familiar to Arabic linguists e.g. extracting a list of all verbs is non-trivial

Arabic text is only encoded phonetically not familiar to Arabic linguists e.g. searching for a specific root is not easy

Page 18: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Quranic Arabic Corpus http://corpus.quran.com/

Kais Dukes – PhD (part-time)

word structure - colour-coded morphological analysis translation – verse, word-for-word English translations grammar- dependency parse following Arabic tradition semantics – ontology of entities and concepts Machine Learning - annotations used for A.I. training Impact - dozens of researchers have collaborated/cited, and over a million visitors use the website per year

Page 19: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Quranic Arabic CorpusVerified Uthmani Script

- Unicode Uthmani Script- Sourced from the verified Tanzil project

Page 20: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Quranic Arabic CorpusPhonetics (faja'alnāhumu)

- Phonetic transcription generated algorithmically- Guided by Arabic vowelized diacritics

Page 21: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Quranic Arabic CorpusInterlinear translation

- Word-for-word translation from accepted sources- Interlinear translation scheme

Page 22: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Quranic Arabic CorpusLocation Reference (21:70:4)

- Common standard for verses (Chapter:Verse)- Extended in the QAC corpus to include word numbers and segment numbers, e.g. (21:70:4:2)

Page 23: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Quranic Arabic CorpusMorphological Segmentation

- Division of a single word into multiple segments- Part-of-speech tag assigned to each segment- Traditional Arabic Grammar rules used for division

Page 24: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Quranic Arabic CorpusMorphological segment features

Page 25: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Quranic Arabic CorpusArabic Grammar Summary

Page 26: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Quranic Arabic CorpusSyntactic Annotation

- Dependency Grammar based onإعراب (i'rāb)- Syntactico-semantic roles for each word

Page 27: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Quranic Arabic CorpusOntology of entities and concepts

- linked to/from nouns and pronouns in the text

Page 28: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Quranic Arabic CorpusFramework for collaboration

User Interaction via Message Board:“If you come across a word and you feel that a better analysis could be provided, you can suggest a correction online by clicking on an Arabic word”(5000+ resolved messages)

Resources:Publications; Citations, Reviews, FAQs, Feedback,Data Download, Software download, Mailing list

Page 29: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Quranic Arabic CorpusUsers: researchers, public

- Artificial Intelligence and Computational Linguistics- Arabic linguistics-Quranic and Islamic Studies-Classical literature analysis-Anyone who wants to appreciate the Quran

Page 30: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

The Quranic Arabic Corpusnew Computational Linguistics

- First Treebank of Classical Arabic

- Free Treebank of the Quran

- First formal representation of Traditional Arabic Grammar using constituency/dependency graphs

- Machine-Learning parser

Page 31: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

User Feedback (300+ comments)“I would like to applaud you for your effort” Prof

Behnam Sadeghi, Stanford University“We are big admirers of the work” Prof Gregory

Crane, Classics Dept, Tufts University “I regularly use your work on the Qur'an and read

it whenever I can.” Prof Yousuf Islam, Director, Daffodil International University

“Congratulations to all concerned on this project” - Prof Michael Arthur, VC, Leeds Uni

Page 32: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

Most users are teachers and learners of Quranic Arabic

Over a million users already, and growing; many unforseen social benefits, eg:

“I work as a chaplain in correctional centers in the State of Missouri, U.S.A. Thanks for your permission to use the Quranic Arabic Corpus in these correctional centers” Tadar Wazir.

Page 33: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

AI for understanding the Quran

Qurany the first Quran "search for a concept" website http://xyzqurany.appspot.com/

If you choose from the tree of concepts on the left hand side concept "Pillars of Islam" then "The Prayers" then "Performing the Prayers" then "Friday Prayers“...you get Quran verses on this topic in the upper right frame and Hadith on this topic in the lower right frame.

Nora Abbas, Qurany: A Tool to Search for Concepts in the Quran (PDF). 2009

Page 34: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

“Google Qurany" html version athttp://www.comp.leeds.ac.uk/nora/html Store each Quran or Haddith verse as a separate web-page, andannotate each web-page with English translations and concept-tags.

Then search is enabled via Google, but "keywords” can be concept-tags and/or English words and/or Arabic words.

Google "Jesus site:http://www.comp.leeds.ac.uk/nora/html“ Google “Friday Prayers site:http://www.comp.leeds.ac.uk/nora/html”

Page 35: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

AI for understanding the Quranhttp://www.textminingthequran.com/wiki - Tools and resources for text mining the Quran

including pronoun references, related verses, lemma concordance and collocation, and text mining the Hadeeth

Abdul-Baquee Sharaf and Eric Atwell (2012). QurAna: Corpus of the Quran annotated with Pronominal Anaphora. Proc LREC’2012, Istanbul

Abdul-Baquee Sharaf and Eric Atwell (2012). QurSim: A corpus for evaluation of relatedness in short texts. Proc LREC’2012, Istanbul

Page 36: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

www.textminingthequran.com/wiki

QurSim - 7,679 pairs of related verses, according to Ibn Kathir, respected Islamic Scholar

QurAna - 24,668 pronouns, each linked to its anaphoric referent entity or concept, and the location of the antecedent if available.

Concept list - a list of 1054 entities or concepts arising from Pronoun referents in the Quran – nominal entities in a Quran ontology

Page 37: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

AI for understanding the QuranSALMA – Sawalha Atwell Leeds Morphological Analyser SALMA Morphological analysis of Quran text

Majdi Sawalha, Eric Atwell (2010). Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text. Proc LREC’2010, Valetta, Malta

Majdi Sawalha, Eric Atwell (2010). Constructing and Using Broad-Coverage Lexical Resource for Enhancing Morphological Analysis of Arabic. Proc LREC’2010, Valetta, Malta

Page 38: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

AI for understanding the QuranBoundary-Annotated Quran - Tagged with prosodic annotation scheme from Tajwīd

(recitation) mark-up in the Qur'an

Claire Brierley, Majdi Sawalha and Eric Atwell (2012). Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing. Proc LREC’2012, Istanbul

Majdi Sawalha, Claire Brierley and Eric Atwell (2012). Predicting Phrase Breaks in Classical and Modern Standard Arabic Text. Proc LREC’2012, Istanbul

Page 39: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

AI for understanding the QuranThe Quranic Arabic Corpus - the first online annotated

linguistic resource which shows the Arabic "irab" morphology and grammar for each word and verse in the Holy Quran, including word-by-word morphology and English gloss, and Ontology of Quranic concepts

Kais Dukes, Eric Atwell and Nizar Habash (2011). Supervised Collaboration for Syntactic Annotation of Quranic Arabic. Language Resources and Evaluation Journal (LREJ).

Kais Dukes and Eric Atwell (2012). LAMP: A Multimodal Web Platform for Collaborative Linguistic Analysis. Proc LREC’2012, Istanbul

Page 40: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

ConclusionAugmenting the Arabic text of the Quran with rich linguistic annotation will help learners to understand Quranic Arabic.http://corpus.quran.com/

Eric Atwell, Nora Abbas, Claire Brierley, Kais

Dukes, Majdi Sawalha, Abdul-Baquee Sharaf I-AIBS Institute for Artificial Intelligence and

Biological Systems

School of Computing, University of Leeds

Page 41: Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes)

Questions?