Transcript

بن محمد اإلمام جامعةاإلسالمية سعود

الحاسب علوم كليةوالمعلومات

الحاسب علوم قسم

Imam Mohammad Ibn Saud Islamic University

College of Computing and Information Science

Computer sciences Department

Prepared by: Al-Moammar.A., Al-Abdullah.H., and Al-Ajlan.N

Arabic Tokenization and Stemming

Supervised by: Dr. Amal Al-Saif.

Arabic Tokenization and Stemming

Outline Introduction Tokenization:• Arabic Characteristics.

• Methodology.• Result.

Stemming:• Arabic Characteristics.• Methodology.• Results.

Conclusion.

Introduction

Arabic language.

Tokenization.

Stemming.

Outline Introduction Tokenization:• Arabic Characteristics.• Methodology.• Result.

Stemming:• Arabic Characteristics.• Methodology.• Results.

Conclusion.

Arabic Language Characteristics

• Writing the letter in ambiguous case cause orthography problems. • Encliticization of a word ending with “ ة” or “ى” :

• Ambiguity results from decliticization of “ ل” “l” “ ا” “A” and “ ال ” “Al” “the”.

word Encliticization of word

جمعتهم

هم + their“جمعة

Friday”هم + collect“جمعت

them”

مستواك ك + ”Your level“مستوى

Outline Introduction Tokenization:• Arabic Characteristics.• Methodology.• Result.

Stemming:• Arabic Characteristics.• Methodology.• Results.

Conclusion.

My Approach

Sample of Arabic tokenized text:

The Bigrams equation that used is:

P(wi | sj) is probability of ith word given jth segmentation.P(sj | si-1)is probability of jth segmentation given previous segmentation.

Outline Introduction Arabic Characteristics. Tokenization:• Arabic Characteristics.• Methodology.• Result.

Stemming:• Arabic Characteristics.• Methodology.• Results.

Conclusion.

Results

The result of My Approach algorithm:

• They used Bigrams on 45 files with size of 29092 tokens.

• The final accuracy was 98.83%.

  Recall Accuracy Precision F-measure

Result without statistical support 0.9877462 0.9802977 0.8617793 0.920473

Outline Introduction Arabic Characteristics. Tokenization:• Methodology.• Result.

Stemming:• Arabic Characteristics.• Methodology.• Results.

Conclusion.

Arabic Language Characteristics

Outline Introduction Arabic Characteristics. Tokenization:• Methodology.• Result.

Stemming:• Arabic Characteristics.• Methodology.• Results.

Conclusion.

Methodology

Root-based. Light Stemmer. N-Gram. Hybrid Method.

Light Stemmer Removed morphemes by Light stemmers

Light Stemmer Classification of Light8 stemmer

N-gram Statistical stemmer based on calculating a measure of

similarity between a pair of words.

N-gram techniques:• Digram. • Trigram.

N-gramN-gram techniques:• (االستفسارات)

• Digram (N=2)“" "," "," "," "," "," "," "," "," ات"," را ار سا فس تف ست اس ال ال

• Trigram (N=3)" "," "," "," "," "," "," "," رات"," ارا سار فسا تفس ستف است الس "اال

N-gram The string similarity measures calculated using Dice’s

Coefficient:

S = 2Cwq /(Aw + Bq)

Example : « استفسر: «االستفسارات“" "," "," "," "," "," "," "," "," ات"," را ار سا فس تف ست اس ال ال

" "," "," "," "," سر" فس تف ست اسwould be:

(2 * 4/(10 +5) = 0.533).

Outline Introduction Arabic Characteristics. Tokenization:• Methodology.• Result.

Stemming:• Arabic Characteristics.• Methodology.• Results.

Conclusion.

Hybrid Method Incorporates three different techniques for Arabic Stemming.

The Hybrid algorithm starts with constructing the root file containing more than 9,000 valid Arabic roots.

Results

Results Hybrid algorithm was found to supersede the other

stemming ones.

The obtained results illustrate that using the hybrid stemmer enhances the performance of some Arabic process.

In Arabic Text Categorization: the averages accuracies are: 74.41% for khoja, 59.71% for light stemming, 48.17% for n-grams, and 82.33% for Hybrid stemmer.

Outline Introduction Arabic Characteristics. Tokenization:• Methodology.• Result.

Stemming:• Arabic Characteristics.• Methodology.• Results.

Conclusion.

Conclusion

Thanks

top related