Automatic keyphrase extraction for Arabicrehab/J11.pdf · 2016-02-07 · Uncorrected Author Proof 2 R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Unc
orre
cted
Aut
hor P
roof
Journal of Intelligent & Fuzzy Systems xx (20xx) x–xxDOI:10.3233/IFS-151923IOS Press
1
Automatic keyphrase extraction for Arabicnews documents based on KEA system
1
2
Rehab Duwairia,∗ and Mona Hedayab3
aDepartment of Computer Information Systems, Jordan University of Science and Technology, Irbid, Jordan4
bDepartment of Computer Science and Engineering, College of Engineering, Qatar University, Doha, Qatar5
Abstract. A keyphrase is a sequence of words that play an important role in the identification of the topics that are embeddedin a given document. Keyphrase extraction is a process which extracts such phrases. This has many important applicationssuch as document indexing, document retrieval, search engines, and document summarization. This paper presents a frameworkfor extracting keyphrases from Arabic news documents which is based on the KEA system. It relies on supervised learning,Naı̈ve Bayes in particular, to extract keyphrases. Two probabilities are computed: the probability of being a keyphrase and theprobability of not being a keyphrase. The final set of keyphrases is chosen from the set of phrases that have high probabilitiesof being keyphrases. The novel contributions of the current work are that it provides insights on keyphrase extraction for newsdocuments written in Arabic. It also presents an annotated dataset that was used in the experimentation. Finally, it uses Naı̈veBayes as a medium for extracting keyphrases.
6
7
8
9
10
11
12
13
14
Keywords: Keyphrase extraction, term indexing, document summarization, document classification, Arabic web content15
1. Introduction16
Keyphrase extraction is the process of assigning17
phrases that describe the main topic or important18
phrases of a document [8, 18, 31]. Keyphrase extraction19
is very important and has many applications in infor-20
mation retrieval, automatic indexing, text classification,21
text summarization and tagging to name a few [7–10,22
20]. Traditionally, this was done by human annotator23
who would assign a set of keyphrases to a document.24
Manual annotation is tedious and time consuming and25
may not be practical these days with the huge vol-26
umes of online documents. Automatic or semiautomatic27
annotation of documents, on the other hand, employs a28
computer program to extract keyphrases that describe29
a document. In the latter case, a human may provide
∗Corresponding author. Rehab Duwairi, Department of Com-puter Information Systems, Jordan University of Science andTechnology, Irbid 22110, Jordan. Tel.: +962 2 7201000; Fax: +962 27201077; E-mail: [email protected].
certain guidelines or hints to the system. Keyphrases 30
could be drawn from a fixed vocabulary (controlled 31
indexing or term assignment); in this case keyphrases 32
of a document may contain phrases that do not appear 33
in the document. Free indexing, on the other hand, 34
means that the annotators or systems are free to choose 35
keyphrases that describe a document. 36
Keyphrase extraction has been seen as a classifi- 37
cation problem for the English language [16, 27, 28, 38
31–33], and the Arabic language [9]. Other efforts view 39
this problem as a ranking problem for English [14, 40
18, 34, 35] and for Arabic [7, 8]. Consequently such 41
efforts utilize ranking algorithms to extract features. 42
The classification viewpoint for keyphrase extraction 43
is a supervised machine learning method where classi- 44
fiers such as naı̈ve Bayes classifiers [16, 33] or neural 45
networks [32] are used. The classifiers should be trained 46
first using annotated documents (i.e. documents which 47
keyphrases are known beforehand). The classifiers per- 48
form well when new documents have a similar domain 49
boundaries. In Arabic this is not an easy task as Ara-278
bic does not support letter capitalization and does not279
follow strict punctuation rules especially when deal-280
ing with informal text – where punctuation marks281
are usually absent. In English, however, the sentence282
begins with a capital letter and ends with a period283
[11].284
Usually the first phase of keyphrase extraction algo-285
rithms deals with generating candidate keyphrases. This286
may mean generating n-grams, or linking to ontology287
or a thesaurus. For a morphologically rich language like288
Arabic, the number of possible candidates may be huge289
and therefore pruning strategies must be employed.290
A common pruning strategy is to stem the candidate291
phrases. In Arabic, there is stemming and light stem-292
ming. In stemming the words are reduced to their roots 293
[2] while in light stemming [1] only common prefixes 294
and suffixes are removed. 295
Arabic also has three varieties: classical Arabic, 296
modern standard Arabic (MSA), and dialectical Ara- 297
bic. Arabic dialects vary from one Arab country to 298
another. As we are dealing with news documents and 299
not scientific papers, dialectical Arabic is present. This 300
adversely affects the performance of the stemmers 301
and consequently reduces the accuracy of keyphrase 302
extraction. 303
Moreover, unlike English, -which has many 304
resources on the internet that contain formal articles 305
on specific fields associated with their keywords and 306
keyphrases- Arabic content on the Internet is poor 307
and keyphrase-associated Arabic text is almost non- 308
existent. In fact, one major contribution of this work is 309
to provide an annotated dataset suitable for keyphrase 310
extraction. 311
4. A supervised learning framework for 312
keyphrase extraction 313
4.1. KEA architecture 314
KEA is a supervised learning algorithm which 315
consists of two stages, namely, training phase and 316
extraction phase. In the training stage, KEA creates 317
a model using the training data; these consist of doc- 318
uments with author assigned keyphrases. During the 319
extraction stage, by comparison, KEA uses the model 320
created in the training phase and applies it to the test- 321
ing data. The accuracy of KEA is calculated based on 322
comparing the author-assigned keyphrases with KEA 323
assigned keyphrases for the testing data. 324
During the selection of candidate keyphrases phase, 325
KEA first cleans the input documents. Secondly, KEA 326
identifies candidate phrases and lastly KEA case-folds 327
and stems the candidate phrases. The following rules, 328
which are used by KEA, were adapted to become suit- 329
able for Arabic: 330
– Punctuation marks, brackets and numbers are 331
replaced with phrase boundaries. 332
– Apostrophes are removed from the documents. 333
– Hyphenated words are split into two words; i.e. 334
hyphens are removed. 335
– Non-letter tokens are removed from the docu- 336
ments. 337
– Acronyms are handled as a single token.
Unc
orre
cted
Aut
hor P
roof
R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system 5
After applying the above rules to documents, every338
document now consists of a sequence of words; each339
word consists at least of one letter.340
The following rules are used by KEA during candi-341
date phrase identification and were modified to become342
suitable for Arabic:343
– Candidate phrases cannot begin or end with a stop-344
word.345
– Candidate phrases can be proper names.346
– Candidate phrases are limited to maximum 3347
phrases.348
In the Case-folding and Stemming task, candidate349
phrases are folded to small case letters so that all phrases350
will be case insensitive. Case-folding is applicable for351
English but not for Arabic. After that, all phrases are352
stemmed.353
After candidate phrases are generated and prepro-cessed, KEA assigns weights to these candidates bycalculating two values: TFxIDF and First Occurrence ofthe phrase. TFxIDF calculates the frequency of a givenphrase at the current document (TF) and frequency ofthe phrase in the general use or the global corpus ofdocuments (IDF). TFxIDF for phrase P in document Dis calculated using the formula shown in Equation (1):
TF × IDF = freq(P, D)
size(D)× log2
df (P)
N(1) [33]
Where:354
– freq(P,D) is the number of times P occurs in D.355
– size(D) is the number of words in D.356
– df(P) is the number of documents containing P in357
the global corpus.358
– N is the size of the global corpus.359
The First Occurrence weight is calculated as the num-360
ber of words that precedes the phrase’s first appearance361
divided by the total number of words in that document.362
KEA uses the Naı̈ve Bayes Classifier to build its clas-363
sification model. This classifier is a probabilistic one364
that depends on the Bayes Theorem with the assump-365
tion that features or attributes are independent. After366
calculating the weights of candidate phrases, KEA367
determines whether a given candidate phrase is qual-368
ified to be a keyphrase (P[YES]) or not qualified to be369
a keyphrase (P[NO]). These two probabilities are cal-370
culated as shown in Equations (2) and (3) respectively:371
P[yes] = Y
Y + N× PTF×IDF [t|yes]372
×Pdistance[d|yes] (2) [33]373
P[No] = N
Y + N× PTF×IDF[t|no] 374
×Pdistance[d|no] (3) [33] 375
Where: 376
– t is TF × IDF. 377
– d is distance or First Occurrence value. 378
– Y is the number of positive phrases in the training 379
documents. 380
– N is the number of negative phrases in the training 381
documents. 382
The rank or importance of a candidate phrase is cal-culated using Equation (4):
Rank = P[yes]
P[yes] + P[no](4) [33]
Candidate phrases are ranked according to the val- 383
ues calculated using Equation (4). If the ranks of two 384
candidate phrases are equal, then their TF×IDF val- 385
ues are compared to break this tie; the candidate phrase 386
with higher TF×IDF is put first in the list. Finally, KEA 387
prunes the candidate phrases that are subsets of other 388
candidate phrases which rank is higher. 389
4.2. Extending KEA to suit the extraction of Arabic 390
keyphrases 391
The following subsections explain the extensions that 392
we have applied to KEA to become suitable for Arabic. 393
4.2.1. Replacing the KEA stemming algorithm 394
The original stemming algorithm of KEA was 395
removed and replaced with a stemming algorithm suit- 396
able for extracting roots of Arabic words. The stemming 397
algorithm reported in [2] was coded using Java and 398
added to the code of KEA. This algorithm is a statistical 399
based stemmer which extracts roots of words by assign- 400
ing weights and orders to the letters of words. Table 1 401
shows the original weights assigned to the letters of the 402
Arabic alphabet and Table 2, by comparison, shows the 403
Table 1Arabic letters and their weights (Adopted from [2])
Weight Arabic letters
53.53
210 Remaining letter in the Arabic Alphabet
Unc
orre
cted
Aut
hor P
roof
6 R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system
Table 2Arabic letters with their order (Adopted from [2])
Position of Letters Order of letters Order of Letters(from right) (Word Length is Odd) (Word Length is Even)
orders assigned to the letters of the Arabic alphabet.404
N is the number of letters in a word, 1 . . . N are the405
positions of letters in a word. 1 is the first letter and N406
is the last letter. The idea behind the weights, shown in407
Table 1, is that, letters that appear as prefixes or suf-408
fixes are assigned weights higher than letters which do409
not appear as prefixes or suffixes. According to the work410
reported in [2], the following letters may appear as parts411
of prefixes or suffixes: .412
These letters may also appear in the stem of the word.413
The original algorithm did not take this into considera-414
tion and therefore it has errors in the generated roots. We415
have modified the weights and differentiated between416
the weights of a given letter when it appears as a prefix or417
suffix; and when it appears as part of the stem. For exam-418
ple, the weight of letter is set to 3.5 if it serves as419
part of the definitive article and to 1 if it appears as part420
of the stem. After determining the orders and weights of421
letters, the algorithm then multiplies the orders by the422
Table 5Root extraction using stemmer 2 (adopted from [29])
LetterGroup A U O U O O U P URoot
weights to produce products that subsequently are used 423
in extracting the root. The letters that correspond to the 424
smallest three products constitute the root (read from 425
right to left). Table 3, shows the algorithm in action by 426
demonstrating how the root of “ ” is extracted. 427
The stemmer reported in [29] extracts the roots of 428
Arabic words by dividing the Arabic alphabet into six 429
groups. This division is based on whether a letter can be 430
part of the prefixes, suffixes, original stem or any combi- 431
nation of these alternatives. These groups are described 432
in Table 4. 433
As few of the Arabic letters can appear as pre- 434
fixes, suffixes and as part of the stems, the algorithm 435
added position information to the groups to differenti- 436
ate between the case when a given letter is part of the 437
original stem or part of the suffixes or prefixes. The algo- 438
rithm extracts the root of a given word by first encoding 439
that word using the groups and position information. In 440
several cases, the root is extracted directly after encod- 441
ing. On few cases conflict resolution via transformation 442
rules is required to extract the correct root. Table 5, 443
demonstrates how stemmer 2 is applied to extract the 444
root of “ ”. 445
In Arabic KEA, we have alternated between the use 446
of Stemmer 1 and Stemmer 2. Information about the 447
performance of these two stemmers is provided in the 448
experimentation and result analysis section. 449
4.2.2. Replacing the stopwords file 450
Stopwords are words that cannot be part of 451
keyphrases. The stopwords list that is originally used 452
with KEA is removed and a new list of Arabic stop- 453
words is added. This list was compiled from free 454
resources available on the internet. 455
Table 4The division of the Arabic alphabet into groups for stemmer 2 [29]
Group Description
O: Original letters. These letters are surely part of the root. They are: .P: Prefix letters. These letters can be added only in the prefix part. They are:S: Suffix letters. These letters can be added only in the suffix part. They are: only HaaPS: Prefix-Suffix letters. These letters can be only added in both sides of the word i.e. in the suffix part or in the prefix part. They are:
U: Uncertain letters. These letters can be added anywhere in the word. They are:A: Added Letters. These letters are always considered additional letters. They are: only Taa Marbuta.
Unc
orre
cted
Aut
hor P
roof
R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system 7
4.2.3. Adjusting KEA features and their456
corresponding probabilities457
As there are fundamental differences between the458
Arabic and the English languages, we have also altered459
a few aspects of KEA that are related to keyphrase460
features and their corresponding weights.461
The following list describes these alterations:462
– For a phrase to be a candidate phrase it must occur463
in the document two or more times. This feature464
is called the number of occurrences. Also, the465
weight of this feature is proportional to that fea-466
tures’ value. The higher the values of number of467
occurrences the higher the weight of this feature.468
– The maximum number of words that may appear469
in a keyphrase for the Arabic language and after470
extensive analysis and consultations with language471
experts, this feature was set to 3 words. This is472
related to the structure of the sentence in Arabic.473
– Arabic allows proper nouns to be part of474
keyphrases especially when dealing with news arti-475
cles.476
– Case folding is not applicable to Arabic.477
5. Experimentation and result analysis478
5.1. Datasets479
As KEA is a supervised learning algorithm, we had480
to prepare a dataset which consists of documents and481
their corresponding author/reader assigned keywords.482
This is not an easy task, as Arabic content on the internet483
is modest. Furthermore, many documents which exist484
on the internet do not have author assigned keyphrases.485
We have contacted the authors of KP-Miner [7, 8] and486
requested their dataset. They agreed to provide us with487
a copy of their dataset. This dataset is called KP-Miner488
Dataset.489
In addition to the KP-Miner dataset, we have decided490
to generate our own dataset by collecting articles491
published on the internet and manually assigning492
keyphrases to them. We had focused on two topics,493
namely: leadership and management; and agriculture,494
environment and food. In total, we had gathered 62 doc-495
uments. 27 documents fall in in the first category and 35496
documents fall in the second category. Two raters were497
used to assign keyphrases to these documents. At the498
end of this phase, every document has two sets of key499
phrases. The final set of keyphrases for a given docu-500
ment consists of the intersection of the keyphrases’ lists501
generated by the two raters.
Table 6Training and testing datasets distributions
Dataset Total Training Testing Avg. # ofDocs Docs Docs keyphrases
Dataset 1: Leadership andmanagement
27 18 9 7.8
Dataset 2: Agriculture,environment and food
35 23 12 11.1
Dataset 3: KP-Miner 100 70 30 7.9
5.2. Experimentation setup 502
In the current work, we have divided the documents 503
as training or testing as shown in Table 6. The last col- 504
umn of Table 6 shows the average number of keyphrases 505
assigned to documents. 506
5.2.1. Arabic-KEA overall performance when 507
varying the number of extracted keyphrases 508
Arabic-KEA performance is measured using the 509
average number of matched keyphrases between KEA 510
generated phrases and author-assigned phrases. For 511
example, assume we have 10 documents and each 512
is assigned 5 author-assigned keyphrases and Arabic- 513
KEA extracts five keyphrases for each document. The 514
average number of matches is calculated by averaging 515
the number of matched keyphrases for every docu- 516
ment and then dividing by the number of documents. 517
Table 7 shows the average number of matches for the 518
three datasets when varying the number of extracted 519
keyphrases to be: 5, 7, 10, 15 and 20. As Table 7 520
demonstrates the accuracy of Arabic-KEA increases 521
as the number of extracted keyphrases is increased. 522
This is understandable as the likelihood of obtain- 523
ing matches between author-assigned keyphrases and 524
Arabic-KEA generated keyphrases increases as the 525
number of keyphrases increases. This behavior is com- 526
mon for the three datasets. The second half of the 527
number represents the standard deviation. The sta- 528
tistical stemmer, i.e. Stemmer 1, was used in this 529
experiment. 530
5.2.2. Arabic-KEA overall performance when 531
varying the size of the training data 532
In this experiment, we aim to assess Arabic-KEA 533
performance when focusing on the size of the training 534
data. The theory here is that we can build more accu- 535
rate classifiers when the number of training documents 536
is large. Table 8 shows the accuracies of Arabic-KEA 537
for Dataset 1 when varying the number of training doc- 538
uments between 1, 5, 10, and 18. As it is clear from 539
Unc
orre
cted
Aut
hor P
roof
8 R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system
Table 7Performance of Arabic-KEA as the number of extracted keyphrases increases