Automatic keyphrase extraction for Arabicrehab/J11.pdf · 2016-02-07 · Uncorrected Author Proof 2 R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents

Unc

orre

cted

Aut

hor P

roof

Journal of Intelligent & Fuzzy Systems xx (20xx) x–xxDOI:10.3233/IFS-151923IOS Press

1

Automatic keyphrase extraction for Arabicnews documents based on KEA system

1

2

Rehab Duwairia,∗ and Mona Hedayab3

aDepartment of Computer Information Systems, Jordan University of Science and Technology, Irbid, Jordan4

bDepartment of Computer Science and Engineering, College of Engineering, Qatar University, Doha, Qatar5

Abstract. A keyphrase is a sequence of words that play an important role in the identification of the topics that are embeddedin a given document. Keyphrase extraction is a process which extracts such phrases. This has many important applicationssuch as document indexing, document retrieval, search engines, and document summarization. This paper presents a frameworkfor extracting keyphrases from Arabic news documents which is based on the KEA system. It relies on supervised learning,Naı̈ve Bayes in particular, to extract keyphrases. Two probabilities are computed: the probability of being a keyphrase and theprobability of not being a keyphrase. The final set of keyphrases is chosen from the set of phrases that have high probabilitiesof being keyphrases. The novel contributions of the current work are that it provides insights on keyphrase extraction for newsdocuments written in Arabic. It also presents an annotated dataset that was used in the experimentation. Finally, it uses Naı̈veBayes as a medium for extracting keyphrases.

6

7

8

9

10

11

12

13

14

Keywords: Keyphrase extraction, term indexing, document summarization, document classification, Arabic web content15

1. Introduction16

Keyphrase extraction is the process of assigning17

phrases that describe the main topic or important18

phrases of a document [8, 18, 31]. Keyphrase extraction19

is very important and has many applications in infor-20

mation retrieval, automatic indexing, text classification,21

text summarization and tagging to name a few [7–10,22

20]. Traditionally, this was done by human annotator23

who would assign a set of keyphrases to a document.24

Manual annotation is tedious and time consuming and25

may not be practical these days with the huge vol-26

umes of online documents. Automatic or semiautomatic27

annotation of documents, on the other hand, employs a28

computer program to extract keyphrases that describe29

a document. In the latter case, a human may provide

∗Corresponding author. Rehab Duwairi, Department of Com-puter Information Systems, Jordan University of Science andTechnology, Irbid 22110, Jordan. Tel.: +962 2 7201000; Fax: +962 27201077; E-mail: [email protected].

certain guidelines or hints to the system. Keyphrases 30

could be drawn from a fixed vocabulary (controlled 31

indexing or term assignment); in this case keyphrases 32

of a document may contain phrases that do not appear 33

in the document. Free indexing, on the other hand, 34

means that the annotators or systems are free to choose 35

keyphrases that describe a document. 36

Keyphrase extraction has been seen as a classifi- 37

cation problem for the English language [16, 27, 28, 38

31–33], and the Arabic language [9]. Other efforts view 39

this problem as a ranking problem for English [14, 40

18, 34, 35] and for Arabic [7, 8]. Consequently such 41

efforts utilize ranking algorithms to extract features. 42

The classification viewpoint for keyphrase extraction 43

is a supervised machine learning method where classi- 44

fiers such as naı̈ve Bayes classifiers [16, 33] or neural 45

networks [32] are used. The classifiers should be trained 46

first using annotated documents (i.e. documents which 47

keyphrases are known beforehand). The classifiers per- 48

form well when new documents have a similar domain 49

as the training documents. 50

1064-1246/16/$35.00 © 2016 – IOS Press and the authors. All rights reserved

mailto:[email protected]

Unc

orre

cted

Aut

hor P

roof

2 R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system

This paper adopts a classification based approach for51

extracting keyphrases from news documents. In par-52

ticular, it extends the KEA system [16, 33], which is53

based on the Naı̈ve Bayes classifier, to handle keyphrase54

extraction from Arabic news documents. The adapted55

system is called Arabic-KEA. KEA is open software56

and this has encouraged many researchers to adapt it to57

other languages such as the work reported in [24] which58

has adapted KEA to the Turkish language.59

Arabic-KEA was trained and tested using three60

datasets. The KP-Miner dataset [8] which consists of61

100 short articles generated from Arabic Wikipedia62

and two in-house collected datasets. Several tests were63

carried out using Arabic-KEA to measure its accuracy64

when several parameters are changed.65

The major contributions of this paper include exten-66

sions of KEA’s stemming algorithms and stopword67

files, the collection and annotation of an Arabic dataset,68

and thorough experimentations which helped us under-69

stand keyphrase extraction in Arabic documents. The70

results reveal that the higher the size of the training71

dataset the better the performance of the classifier.72

The reason is that more data means the classifier73

is capable of building more sophisticated models. A74

second finding is that the choice of stemming algo-75

rithm is somewhat related to the quality of results:76

good stemmers help in obtaining better results. Also,77

the accuracy increases as the number of generated78

keyphrases increases.79

This paper is organized as follows: Section 1 has80

described and introduced keyphrase extraction in gen-81

eral and the suggested algorithm in particular. Section82

2 provides background information to the reader and it83

also places this work in its proper place in the litera-84

ture. Section 3 explains aspects of the Arabic language85

that one should consider when dealing with keyphrase86

extraction. Section 4, on the other hand, describes the87

suggested framework and the modifications to KEA.88

Section 5, by comparison, illustrates the experimenta-89

tion setup and provides insights on the obtained results.90

Section 6 concludes this paper and highlights future91

work.92

2. Background and related work93

Keyphrase extraction from documents is very impor-94

tant and has many applications and therefore there95

are numerous publications that suggest algorithms96

for approaching that problem (For example [3, 7–10,97

13–19, 22, 23, 25, 27, 28, 31–35]). The majority98

of these published algorithms deal with the English 99

text. 100

2.1. Keyphrase extraction for non-Arabic texts 101

Turney [31] was the first to suggest using machine 102

learning for keyphrase extraction where he used the 103

C4.5 classifier to extract phrases. Wang et al. [32] com- 104

bine several neural networks to extract keyphrases from 105

Chinese and English text. They argue that combining 106

several base classifiers yields better results. Sarkar [28] 107

used naı̈ve Bayes classifier to extract keyphrases from 108

medical documents; so his work is domain-specific and 109

utilizes a glossary database. 110

G. Ercan and I. Cicekli [10] utilized lexical chains 111

to extract keyphrases. They focused on single word 112

phrases. In addition to standard phrase features such 113

as phrase frequency and phrase position, they introduce 114

four additional features that are derived from the lexical 115

chain of the phrase. These lexical chains are built using 116

WordNet [12]. They approached keyphrase extraction 117

as a classification problem and used the C4.5 classifier 118

in particular. 119

Matsuo and Ishizuka [18] employed word co- 120

occurrence to judge whether a candidate phrase is 121

a keyphrase or not. In particular, the bias of the 122

co-occurrence probability distribution of a candidate 123

phrase with frequent phrases in a document is mea- 124

sured using the χ2. Afterwards, phrases are ranked in 125

decreasing order of their χ2 scores. Xie, et al. [35] also 126

have used word co-occurrence to extract keyphrases but 127

they introduced a relatedness semantic measure to rank 128

the phrases instead of using χ2. 129

Jiang et al. [14] viewed keyphrase extraction as a 130

ranking problem and used Linear Ranking SVM algo- 131

rithm to extract keyphrases. Wu et al. [34] extracted 132

keyphrases from documents by focusing on noun 133

phrases only and by utilizing a glossary database. Their 134

work is domain-specific, uses POS tagger and noun 135

phrase extractor. Weights are assigned to the candi- 136

date phrases. These are calculated by utilizing the 137

glossary database and afterwards phrases with highest 138

weights are returned as keyphrases. It addresses English 139

text. 140

CFinder [15] is an unsupervised framework for 141

detecting keyphrases that subsequently are used to gen- 142

erate ontologies. The framework is based on extracting 143

noun phrases as candidate keyphrases. Abbreviations 144

are expanded using a manually built-in synonym table. 145

Afterwards, the weights of these candidates are calcu- 146

lated. The weight function is a combination of statistical 147

Unc

orre

cted

Aut

hor P

roof

R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system 3

knowledge (frequency), domain-specific knowledge (A148

manually built glossary related to a specific domain) and149

structural patterns (occurrence). The authors have used150

the D04MG [6] ontology and 27 documents related to151

emergency management systems. Even though CFinder152

is an unsupervised framework yet it relies on a manu-153

ally built synonym table and a glossary. CFinder scored154

a 0.53 F-measure compared to 0.28 by Test2Onto [4],155

0.14 by KP-Miner [7, 9] and 0.43 by Moki [30] when156

run on the emergency dataset.157

The researchers in [3] extracted topical keyphrases158

from tweets. Their work utilizes a graph-based algo-159

rithm for ranking keyphrases that is based on TextRank160

[21]. Their addition to TextRank is to include node161

properties when calculating weight or merit. They also162

leverage hashtags when extracting topical keyphrases.163

Their results show that accuracy is increased when com-164

pared to the standard TextRank.165

The work reported in [5] extracts keyphrases from166

short texts such as titles of scientific papers. Their167

framework is based on clustering words in the collec-168

tion of documents using topic modeling. Afterwards,169

the candidate keyphrases are generated using frequent170

pattern mining algorithm. The third stage of their work171

consists of ranking the generated candidate keyphrases172

based on their respective coverage, purity, phraseness173

and completeness. Their main addition is that the qual-174

ity of phrases can be judged on phrases of different175

lengths simultaneously.176

The focal point of the work reported in [23] is that177

adopting a uniform view to all candidate keyphrases is178

unfair towards rare terms that a human would consider179

as keyphrases. They assume that every keyphrase has a180

word which is more important compared to other words181

in the same phrase. This is called the core word. The182

heart of their algorithm is to find such core words based183

on their frequency and POS tags. In the second stage,184

the core words are expanded with correlated words so185

that keyphrases can be generated.186

The work reported in [19] suggests an approach187

for keyphrase indexing. Keyphrase indexing starts188

by extracting keyphrases from documents and then189

mapping them to a fixed vocabulary. This approach190

overcomes the problem of ill-formed keyphrases191

obtained by keyphrase extraction and the large training192

corpus required by term assignment. Their suggested193

approach starts by extracting n-grams and then assign-194

ing weights to these n-grams. The n-grams are195

transformed to pseudo-phrases by removing stopwords,196

stemming, and alphabetically sorting the remaining197

words. These pseudo-phrases are then mapped to a198

controlled vocabulary to generate the final list of 199

keyphrases. 200

2.2. Keyphrase extraction for Arabic texts 201

Few works have addressed keyphrase extraction for 202

Arabic text. Sakhr [26] a leading company in the field 203

of Arabic text processing, provides a keyphrase extrac- 204

tor. Users can upload their documents to Sakhr keyword 205

extractor one by one and the system would return the 206

set of keywords; the algorithm behind the extractor is 207

not published. El-Beltagy and Rafea [7, 8] provide KP- 208

Miner for extracting keyphrases from Arabic as well as 209

English documents. KP-Miner utilizes a set of heuris- 210

tics to extract keyphrases. KP-Miner approaches the 211

problem as a ranking problem and therefore does not 212

require a training corpus. Candidate keyphrases are 213

ranked based on their term frequency (tf) and inverse 214

term frequency (idf). 215

In the work of El-Shishtawy and Al-Sammak [9], 216

which addressed Arabic text, each candidate keyphrase 217

is represented as a vector of 8 features. Some of 218

these features are statistical and others are linguistic. 219

Examples of statistical features include: normalized 220

phrase words, phrase relative frequency, word relative 221

frequency, normalized sentence location, normalized 222

phrase location, and normalized phrase length. Linguis- 223

tic features, on the other hand, contain two Boolean 224

features. The first one specifies whether or not a sen- 225

tence contains a verb; and the second feature determines 226

whether the sentence is a question. Also, the abstract 227

or verbal noun (Masdar) form is used to represent 228

keyphrases. After building the feature vectors of can- 229

didate keyphrases, analysis of variance test is used to 230

determine the importance of the previous 8 features. 231

After that a linear discriminant classifier is used to 232

classify candidate keyphrases as true keyphrases (posi- 233

tive examples) or false keyphrases (negative examples). 234

Still, this work view keyphrase extraction as a classifi- 235

cation problem. The dataset size is rather small and it 236

is manually annotated. Their results show better preci- 237

sion and recall values when compared to KP-Miner [7, 238

8] and Sakhr [26]. 239

2.3. Summary of keyphrase extraction 240

All keyphrase extraction algorithms whether they are 241

ranking or classification algorithms, domain-specific or 242

domain independent, need training corpus or do not 243

need training corpus, applied on full-length scientific 244

Unc

orre

cted

Aut

hor P

roof


papers, news articles, or microblogs, utilize glos-245

sary database (or thesaurus or ontology) or do not246

utilize glossary database (or thesaurus or ontology)247

have to deal with three issues, namely, candidate248

keyphrase generation, assigning features to the can-249

didate keyphrases (i.e. building feature vectors) and250

ranking the set of candidates using some ranking func-251

tions. When generating candidate phrases there is a252

potential that the number of phrases is huge and there-253

fore pruning strategies that are based on heuristics are254

used. For example, both English and Arabic keyphrases255

are located in a single sentence. As another example256

of pruning, the frequency of a keyphrase is calcu-257

lated and only phrases that appear in a document a258

number of times greater than a user-specified thresh-259

old will be considered in subsequent stages. As a260

last example of pruning, most researchers assume that261

keyphrases do not contain stopwords and therefore dur-262

ing document preprocessing, stopwords are removed.263

Experiments have shown that some author or reader264

assigned keyphrases, for English text, may contain stop-265

words such as prepositions.266

A feature vector for a candidate keyphrase con-267

sists of a number of variables or features that describe268

that keyphrase and are subsequently used to judge the269

importance of that keyphrase. Examples of keyphrase270

features that are commonly used by researchers include271

phrase frequency, the position (in the document) where272

the phrase first or last appears, or whether the phrase is273

a noun or verb.274

3. Arabic language peculiarities for keyphrase275

extraction276

Keyphrase extraction requires identifying sentence277

boundaries. In Arabic this is not an easy task as Ara-278

bic does not support letter capitalization and does not279

follow strict punctuation rules especially when deal-280

ing with informal text – where punctuation marks281

are usually absent. In English, however, the sentence282

begins with a capital letter and ends with a period283

[11].284

Usually the first phase of keyphrase extraction algo-285

rithms deals with generating candidate keyphrases. This286

may mean generating n-grams, or linking to ontology287

or a thesaurus. For a morphologically rich language like288

Arabic, the number of possible candidates may be huge289

and therefore pruning strategies must be employed.290

A common pruning strategy is to stem the candidate291

phrases. In Arabic, there is stemming and light stem-292

ming. In stemming the words are reduced to their roots 293

[2] while in light stemming [1] only common prefixes 294

and suffixes are removed. 295

Arabic also has three varieties: classical Arabic, 296

modern standard Arabic (MSA), and dialectical Ara- 297

bic. Arabic dialects vary from one Arab country to 298

another. As we are dealing with news documents and 299

not scientific papers, dialectical Arabic is present. This 300

adversely affects the performance of the stemmers 301

and consequently reduces the accuracy of keyphrase 302

extraction. 303

Moreover, unlike English, -which has many 304

resources on the internet that contain formal articles 305

on specific fields associated with their keywords and 306

keyphrases- Arabic content on the Internet is poor 307

and keyphrase-associated Arabic text is almost non- 308

existent. In fact, one major contribution of this work is 309

to provide an annotated dataset suitable for keyphrase 310

extraction. 311

4. A supervised learning framework for 312

keyphrase extraction 313

4.1. KEA architecture 314

KEA is a supervised learning algorithm which 315

consists of two stages, namely, training phase and 316

extraction phase. In the training stage, KEA creates 317

a model using the training data; these consist of doc- 318

uments with author assigned keyphrases. During the 319

extraction stage, by comparison, KEA uses the model 320

created in the training phase and applies it to the test- 321

ing data. The accuracy of KEA is calculated based on 322

comparing the author-assigned keyphrases with KEA 323

assigned keyphrases for the testing data. 324

During the selection of candidate keyphrases phase, 325

KEA first cleans the input documents. Secondly, KEA 326

identifies candidate phrases and lastly KEA case-folds 327

and stems the candidate phrases. The following rules, 328

which are used by KEA, were adapted to become suit- 329

able for Arabic: 330

– Punctuation marks, brackets and numbers are 331

replaced with phrase boundaries. 332

– Apostrophes are removed from the documents. 333

– Hyphenated words are split into two words; i.e. 334

hyphens are removed. 335

– Non-letter tokens are removed from the docu- 336

ments. 337

– Acronyms are handled as a single token.

Unc

orre

cted

Aut

hor P

roof


After applying the above rules to documents, every338

document now consists of a sequence of words; each339

word consists at least of one letter.340

The following rules are used by KEA during candi-341

date phrase identification and were modified to become342

suitable for Arabic:343

– Candidate phrases cannot begin or end with a stop-344

word.345

– Candidate phrases can be proper names.346

– Candidate phrases are limited to maximum 3347

phrases.348

In the Case-folding and Stemming task, candidate349

phrases are folded to small case letters so that all phrases350

will be case insensitive. Case-folding is applicable for351

English but not for Arabic. After that, all phrases are352

stemmed.353

After candidate phrases are generated and prepro-cessed, KEA assigns weights to these candidates bycalculating two values: TFxIDF and First Occurrence ofthe phrase. TFxIDF calculates the frequency of a givenphrase at the current document (TF) and frequency ofthe phrase in the general use or the global corpus ofdocuments (IDF). TFxIDF for phrase P in document Dis calculated using the formula shown in Equation (1):

TF × IDF = freq(P, D)

size(D)× log2

df (P)

N(1) [33]

Where:354

– freq(P,D) is the number of times P occurs in D.355

– size(D) is the number of words in D.356

– df(P) is the number of documents containing P in357

the global corpus.358

– N is the size of the global corpus.359

The First Occurrence weight is calculated as the num-360

ber of words that precedes the phrase’s first appearance361

divided by the total number of words in that document.362

KEA uses the Naı̈ve Bayes Classifier to build its clas-363

sification model. This classifier is a probabilistic one364

that depends on the Bayes Theorem with the assump-365

tion that features or attributes are independent. After366

calculating the weights of candidate phrases, KEA367

determines whether a given candidate phrase is qual-368

ified to be a keyphrase (P[YES]) or not qualified to be369

a keyphrase (P[NO]). These two probabilities are cal-370

culated as shown in Equations (2) and (3) respectively:371

P[yes] = Y

Y + N× PTF×IDF [t|yes]372

×Pdistance[d|yes] (2) [33]373

P[No] = N

Y + N× PTF×IDF[t|no] 374

×Pdistance[d|no] (3) [33] 375

Where: 376

– t is TF × IDF. 377

– d is distance or First Occurrence value. 378

– Y is the number of positive phrases in the training 379

documents. 380

– N is the number of negative phrases in the training 381

documents. 382

The rank or importance of a candidate phrase is cal-culated using Equation (4):

Rank = P[yes]

P[yes] + P[no](4) [33]

Candidate phrases are ranked according to the val- 383

ues calculated using Equation (4). If the ranks of two 384

candidate phrases are equal, then their TF×IDF val- 385

ues are compared to break this tie; the candidate phrase 386

with higher TF×IDF is put first in the list. Finally, KEA 387

prunes the candidate phrases that are subsets of other 388

candidate phrases which rank is higher. 389

4.2. Extending KEA to suit the extraction of Arabic 390

keyphrases 391

The following subsections explain the extensions that 392

we have applied to KEA to become suitable for Arabic. 393

4.2.1. Replacing the KEA stemming algorithm 394

The original stemming algorithm of KEA was 395

removed and replaced with a stemming algorithm suit- 396

able for extracting roots of Arabic words. The stemming 397

algorithm reported in [2] was coded using Java and 398

added to the code of KEA. This algorithm is a statistical 399

based stemmer which extracts roots of words by assign- 400

ing weights and orders to the letters of words. Table 1 401

shows the original weights assigned to the letters of the 402

Arabic alphabet and Table 2, by comparison, shows the 403

Table 1Arabic letters and their weights (Adopted from [2])

Weight Arabic letters

53.53

210 Remaining letter in the Arabic Alphabet

Unc

orre

cted

Aut

hor P

roof


Table 2Arabic letters with their order (Adopted from [2])

Position of Letters Order of letters Order of Letters(from right) (Word Length is Odd) (Word Length is Even)

1 N N2 N-1 N-1. . . . . . . . .N/2 N/2 N/2+1.0N/2+1 N/2+1-1.5 N/2+1-0.5N/2+2 N/2+2-1.5 N/2+2-0.5. . . . . . . . .N N-1.5 N-0.5

Table 3Example of root extracting of a word using stemmer 1

Weight 1 5 2 0 3.5 5Order 5.5 4.5 3.5 4 5 6Product 5.5 22.5 7 0 25 30Root

orders assigned to the letters of the Arabic alphabet.404

N is the number of letters in a word, 1 . . . N are the405

positions of letters in a word. 1 is the first letter and N406

is the last letter. The idea behind the weights, shown in407

Table 1, is that, letters that appear as prefixes or suf-408

fixes are assigned weights higher than letters which do409

not appear as prefixes or suffixes. According to the work410

reported in [2], the following letters may appear as parts411

of prefixes or suffixes: .412

These letters may also appear in the stem of the word.413

The original algorithm did not take this into considera-414

tion and therefore it has errors in the generated roots. We415

have modified the weights and differentiated between416

the weights of a given letter when it appears as a prefix or417

suffix; and when it appears as part of the stem. For exam-418

ple, the weight of letter is set to 3.5 if it serves as419

part of the definitive article and to 1 if it appears as part420

of the stem. After determining the orders and weights of421

letters, the algorithm then multiplies the orders by the422

Table 5Root extraction using stemmer 2 (adopted from [29])

LetterGroup A U O U O O U P URoot

weights to produce products that subsequently are used 423

in extracting the root. The letters that correspond to the 424

smallest three products constitute the root (read from 425

right to left). Table 3, shows the algorithm in action by 426

demonstrating how the root of “ ” is extracted. 427

The stemmer reported in [29] extracts the roots of 428

Arabic words by dividing the Arabic alphabet into six 429

groups. This division is based on whether a letter can be 430

part of the prefixes, suffixes, original stem or any combi- 431

nation of these alternatives. These groups are described 432

in Table 4. 433

As few of the Arabic letters can appear as pre- 434

fixes, suffixes and as part of the stems, the algorithm 435

added position information to the groups to differenti- 436

ate between the case when a given letter is part of the 437

original stem or part of the suffixes or prefixes. The algo- 438

rithm extracts the root of a given word by first encoding 439

that word using the groups and position information. In 440

several cases, the root is extracted directly after encod- 441

ing. On few cases conflict resolution via transformation 442

rules is required to extract the correct root. Table 5, 443

demonstrates how stemmer 2 is applied to extract the 444

root of “ ”. 445

In Arabic KEA, we have alternated between the use 446

of Stemmer 1 and Stemmer 2. Information about the 447

performance of these two stemmers is provided in the 448

experimentation and result analysis section. 449

4.2.2. Replacing the stopwords file 450

Stopwords are words that cannot be part of 451

keyphrases. The stopwords list that is originally used 452

with KEA is removed and a new list of Arabic stop- 453

words is added. This list was compiled from free 454

resources available on the internet. 455

Table 4The division of the Arabic alphabet into groups for stemmer 2 [29]

Group Description

O: Original letters. These letters are surely part of the root. They are: .P: Prefix letters. These letters can be added only in the prefix part. They are:S: Suffix letters. These letters can be added only in the suffix part. They are: only HaaPS: Prefix-Suffix letters. These letters can be only added in both sides of the word i.e. in the suffix part or in the prefix part. They are:

U: Uncertain letters. These letters can be added anywhere in the word. They are:A: Added Letters. These letters are always considered additional letters. They are: only Taa Marbuta.

Unc

orre

cted

Aut

hor P

roof


4.2.3. Adjusting KEA features and their456

corresponding probabilities457

As there are fundamental differences between the458

Arabic and the English languages, we have also altered459

a few aspects of KEA that are related to keyphrase460

features and their corresponding weights.461

The following list describes these alterations:462

– For a phrase to be a candidate phrase it must occur463

in the document two or more times. This feature464

is called the number of occurrences. Also, the465

weight of this feature is proportional to that fea-466

tures’ value. The higher the values of number of467

occurrences the higher the weight of this feature.468

– The maximum number of words that may appear469

in a keyphrase for the Arabic language and after470

extensive analysis and consultations with language471

experts, this feature was set to 3 words. This is472

related to the structure of the sentence in Arabic.473

– Arabic allows proper nouns to be part of474

keyphrases especially when dealing with news arti-475

cles.476

– Case folding is not applicable to Arabic.477

5. Experimentation and result analysis478

5.1. Datasets479

As KEA is a supervised learning algorithm, we had480

to prepare a dataset which consists of documents and481

their corresponding author/reader assigned keywords.482

This is not an easy task, as Arabic content on the internet483

is modest. Furthermore, many documents which exist484

on the internet do not have author assigned keyphrases.485

We have contacted the authors of KP-Miner [7, 8] and486

requested their dataset. They agreed to provide us with487

a copy of their dataset. This dataset is called KP-Miner488

Dataset.489

In addition to the KP-Miner dataset, we have decided490

to generate our own dataset by collecting articles491

published on the internet and manually assigning492

keyphrases to them. We had focused on two topics,493

namely: leadership and management; and agriculture,494

environment and food. In total, we had gathered 62 doc-495

uments. 27 documents fall in in the first category and 35496

documents fall in the second category. Two raters were497

used to assign keyphrases to these documents. At the498

end of this phase, every document has two sets of key499

phrases. The final set of keyphrases for a given docu-500

ment consists of the intersection of the keyphrases’ lists501

generated by the two raters.

Table 6Training and testing datasets distributions

Dataset Total Training Testing Avg. # ofDocs Docs Docs keyphrases

Dataset 1: Leadership andmanagement

27 18 9 7.8

Dataset 2: Agriculture,environment and food

35 23 12 11.1

Dataset 3: KP-Miner 100 70 30 7.9

5.2. Experimentation setup 502

In the current work, we have divided the documents 503

as training or testing as shown in Table 6. The last col- 504

umn of Table 6 shows the average number of keyphrases 505

assigned to documents. 506

5.2.1. Arabic-KEA overall performance when 507

varying the number of extracted keyphrases 508

Arabic-KEA performance is measured using the 509

average number of matched keyphrases between KEA 510

generated phrases and author-assigned phrases. For 511

example, assume we have 10 documents and each 512

is assigned 5 author-assigned keyphrases and Arabic- 513

KEA extracts five keyphrases for each document. The 514

average number of matches is calculated by averaging 515

the number of matched keyphrases for every docu- 516

ment and then dividing by the number of documents. 517

Table 7 shows the average number of matches for the 518

three datasets when varying the number of extracted 519

keyphrases to be: 5, 7, 10, 15 and 20. As Table 7 520

demonstrates the accuracy of Arabic-KEA increases 521

as the number of extracted keyphrases is increased. 522

This is understandable as the likelihood of obtain- 523

ing matches between author-assigned keyphrases and 524

Arabic-KEA generated keyphrases increases as the 525

number of keyphrases increases. This behavior is com- 526

mon for the three datasets. The second half of the 527

number represents the standard deviation. The sta- 528

tistical stemmer, i.e. Stemmer 1, was used in this 529

experiment. 530

5.2.2. Arabic-KEA overall performance when 531

varying the size of the training data 532

In this experiment, we aim to assess Arabic-KEA 533

performance when focusing on the size of the training 534

data. The theory here is that we can build more accu- 535

rate classifiers when the number of training documents 536

is large. Table 8 shows the accuracies of Arabic-KEA 537

for Dataset 1 when varying the number of training doc- 538

uments between 1, 5, 10, and 18. As it is clear from 539

Unc

orre

cted

Aut

hor P

roof


Table 7Performance of Arabic-KEA as the number of extracted keyphrases increases

Dataset 5 7 10 15 20

Dataset 1: (18 training, 9 testing) 1.33 ± 1.41 1.56 ± 1.59 1.67 ± 1.73 1.78 ± 1.92 1.89 ± 1.96Dataset 2: (23 training, 12 testing) 2 ± 1.13 2.58 ± 1.24 2.92 ± 1.24 3.33 ± 1.61 3.5 ± 1.68Dataset 3: (70 training, 30 testing) 1.1 ± 0.76 1.4 ± 0.86 1.73 ± 1.08 2 ± 1.29 2.3 ± 1.62

Table 8Accuracy of Arabic-KEA as the number of training documents increases for Dataset 1

Size of training dataset 1 5 10 18

Average number of matching keyphrases 0.89 ± 0.93 1.22 ± 1.2 1.31 ± 1.17 1.56 ± 1.59


Size of training dataset 1 5 10 15 23

Average number of matching keyphrases 1.83 ± 1.19 2.08 ± 1 2.17 ± 1.19 2.33 ± 1.15 2.58 ± 1.24


Size of training dataset 1 5 10 15 30 45 70

Average number of matching keyphrases 1.22 ± 0.43 1.23 ± 0.82 1.32 ± 0.96 1.33 ± 0.92 1.38 ± 0.78 1.4 ± 0.8 1.4 ± 0.86

Table 8, the average number of matches increases as540

the number of training documents increases.541

Table 9 shows the behavior of Arabic-KEA when542

varying the number of training documents for Dataset543

2. Again, as expected, the average number of matches544

is directly proportional to the number of training docu-545

ments.546

Table 10 describes the changes of the average num-547

ber of matches when varying the number of training548

documents for Dataset 3. Table 10 emphasizes that the549

average number of matched keyphrases increases as the550

number of training documents increases. We note that551

accuracy of Arabic-KEA converges when the number552

of training documents reaches 45 documents.553

5.2.3. Arabic-KEA overall performance when554

alternating between stemming algorithms555

Table 11, shows the average number of matched556

keyphrases when alternating between two Arabic stem-557

mers. Stemmer 1 is the statistical based stemmer558

described in [2]. Stemmer 2, on the other hand, is a559

rule based stemmer described in [29]. As Table 11560

clearly indicates, Stemmer 1 outperforms Stemmer 2.561

The results shown in Table 11 that the choice of stem-562

ming algorithm does affect the quality of extracted563

keyphrases.564

In order to assess where Arabic-KEA stands when565

compared with other keyphrase extractor systems,566

we have compared Arabic-KEA with KP-Miner. KP-567

Table 11Arabic-KEA performance with several stemmers

Dataset Stemmer 1: Statistical Stemmer 2: Rulebased stemmer based stemmer

Dataset 1 1.56 ± 1.59 0.67 ± 1Dataset 2 2.58 ± 1.24 1.17 ± 0.94Dataset 3 1.4 ± 0.86 0.96 ± 0.87

Miner [7, 8] is an unsupervised keyphrase extractor 568

and thus does not require any training data. KP- 569

Miner is freely available online at http://www.claes.sci. 570

eg/coe wm/kpminer/. We have used the online interface 571

provided by its authors. This means that we have used 572

the original implementation provided by KP-Miner 573

authors. To make the comparison fair and meaningful, 574

only documents which were used in testing Arabic- 575

KEA from the three datasets were also used to test 576

KP-Miner. For Dataset 1, documents numbered from 577

1 to 9 were used for testing both Arabic-KEA and 578

KP-Miner. For Dataset 2, documents numbered from 579

1 to 12 were used for testing both Arabic-KEA and 580

KP-Miner. Finally, for Dataset 3, documents num- 581

bered from 1 to 30 were used for testing purposes. 582

The results of the comparisons are summarized in 583

Table 12. The numbers in Table 12 represent the average 584

number of matched keyphrases obtained by Arabic- 585

KEA and by KP-Miner when the number of extracted 586

keyphrases varies between 5, 7, 10, 15, and 20. As 587

it can be seen from Table 12, Arabic-KEA scored 588

http://www.claes.sci.eg/coe_wm/kpminer/

Unc

orre

cted

Aut

hor P

roof


Table 12Arabic-KEA versus KP-miner

No of KPs Dataset 1 Dataset 2 Dataset 3Arabic-KEA KP-Miner Arabic-KEA KP-Miner Arabic-KEA KP-Miner

5 1.33 ± 1.41 1.00 ± 0.82 2.00 ± 1.13 1.67 ± 1.03 1.1 ± 0.76 1.23 ± 0.677 1.56 ± 1.59 1.11 ± 0.87 2.58 ± 1.24 2.08 ± 1.19 1.4 ± 0.86 1.50 ± 0.6710 1.67 ± 1.73 1.44 ± 1.07 2.92 ± 1.24 2.75 ± 1.23 1.73 ± 1.08 1.73 ± 0.6815 1.78 ± 1.92 1.44 ± 1.07 3.33 ± 1.61 3.00 ± 1.41 2.00 ± 1.29 1.90 ± 0.9120 1.89 ± 1.96 1.89 ± 1.37 3.5 ± 1.96 3.00 ± 1.41 2.3 ± 1.62 2.10 ± 1.27

better averages for Dataset 1 and Dataset 2 regardless589

of the extracted number of keyphrases. For Dataset 3,590

we notice that KP-Miner provides better accuracy than591

Arabic-KEA when the number of extracted keyphrases592

is 5 or 7. Both Arabic-KEA and KP-Miner give the593

same number of matched keyphrases when the number594

of extracted keyphrases is 10 for Dataset 3. Arabic-KEA595

outperforms KP-Miner when the number of keyphrases596

is equal to 15 or 20 for Dataset 3. In summary, Arabic-597

KEA outperforms KP-Miner for Dataset 1 and Dataset598

2. For Dataset 3, KP-Miner outperforms Arabic-KEA599

when the number of extracted keyphrases is 5 or 7600

but Arabic-KEA managed to improve its accuracy and601

supersedes KP-Miner when the number of extracted602

keyphrases reached either 15 or 20.603

6. Conclusions and future work604

This paper has reported a framework for keyphrase605

extraction from Arabic documents which is based on606

adapting an existing system that was initially used for607

the English language. In particular, the KEA system608

[33] was extended by the current work. The extensions609

were nontrivial and major parts of the code have to be610

modified as there are fundamental differences between611

Arabic and English. The contribution of this paper are:612

the modification of KEA to become suitable for Arabic613

language, the collection and annotation of two datasets614

that are suitable for research that deals with keyphrase615

extraction, and finally an extensive experimentation and616

results analysis which gained us a better understanding617

of keyphrase extraction in Arabic documents.618

The results reveal that the accuracy of Arabic-619

KEA increases as the number of extracted keyphrases620

increases. It also shows that accuracy increases as the621

size of the training data increases. Finally, the results622

reveal that stemming is an effective method to reduce623

the number of candidate keyphrases and therefore good624

stemming algorithms play part in improving the accu-625

racy of keyphrase extraction. In this work we have626

experimented with a statistical-based stemmer and a 627

rule-based stemmer. The statistical stemmer gave better 628

results when compared with the rule-based stemmer. 629

References 630

[1] M. Aljlayl and O. Frieder, On Arabic search: Improving the 631

retrieval effectiveness via a light stemming approach. Pro- 632

ceedings of the ACM 11th Conference on Information and 633

Knowledge Management, New York: ACM Press, 2002, pp. 634

340–347. 635

[2] R. Al-Shalabi, G. Kanaan and H. Al-Sarhan, New approach for 636

extracting Arabic roots. In proceeding of International Arab 637

Conference on Information Technology (ACIT’2003), Alexan- 638

dra, Egypt, 2003, pp. 42–59. 639

[3] A. Bellaachia and M. Al-Dhelaan, NE-Rank: A novel graph- 640

based keyphrase extraction on Twitter. Proceedings of the 641

IEEE/WIC/ACM International Conference on Web Intelli- 642

gence and Intelligent Agent Technology, Macau, 2012, pp. 643

372–379. 644

[4] P. Cimiano and J. Volker, Text2Onto – A framework for ontol- 645

ogy learning and data-driven change discovery. Proceedings 646

of the 10th International Conference on Applications of Natu- 647

ral Language to Information Systems (NLDB), Springer, Vol. 648

3513, 2005, pp. 227–238. 649

[5] M. Danilevsky, C. Wang, X. Desai, J. Guo and J. Han, 650

Automatic construction and ranking of topical keyphrases on 651

collections of short documents, Proceedings of the 2014 SIAM 652

International Conference on Data Mining, 2014. 653

[6] P. Delir-Haghighi, F. Burstein, A. Zaslavsky and P. Arbon, 654

Development and evaluation of ontology for intelligent deci- 655

sion support in medical emergency management for mass 656

gatherings, Decision Support Systems 54 (2013), 1192–1204. 657

[7] S. El-Beltagy and R. Rafea, KP-Miner: Participation in 658

SemEval-2. Proceedings of the 5th International workshop on 659

Semantic Evaluation (ACL2010), Uppsala, Sweden, 2010, pp. 660

190–193. 661

[8] S. El-Beltagy and R. Rafea, KP-Miner: A keyphrase extraction 662

for English and Arabic Documents, Information Systems 34 663

(2009), 132–144. 664

[9] T.A. El-Shishtawy and A.K. Al-Sammak, Arabic keyphrase 665

extraction using linguistic knowledge and machine learning 666

techniques. Proceedings of the Second International Confer- 667

ence on Arabic Language Resources and Tools, Cairo, Egypt, 668

2009. 669

[10] G. Ercan and I. Cicekli, Using lexical chains for key- 670

word extraction, Information Processing and Management 43 671

(2007), 1705–1714. 672

Unc

orre

cted

Aut

hor P

roof


[11] A. Farghaly and K. Shaalan, Arabic natural language process-673

ing: Challenges and solutions. ACM Transactions on Asian674

Languages Information Processing 8(4) (2009). Article 14.675

[12] C. Fellbaum, Ed., WordNet: An electronic lexical database,676

MIT Press, 1998.677

[13] K. Frantzi, S. Ananiadou and H. Mima, Automatic recognition678

of multi-word terms: The C-value/NC-value Method, Interna-679

tional Journal on Digital Library 3(2) (2000), 115–130.680

[14] X. Jiang, Y. Hu and H. Li, A ranking approach to keyphrase681

extraction. SIGIR’09, Boston, MA, USA, 2009, pp. 756–757.682

[15] Y. Kang, P. Delir-Haghighi and F. Burstein, CFinder: An intel-683

ligent key concept finder from text to ontology, Expert Systems684

with Applications 41 (2014), 4494–4505.685

[16] KEA: Keyphrase Extraction Algorithm, http://www.nzdl.686

org/Kea/. Last accessed 7, June 2015.687

[17] S.N. Kim, O. Medelyan, M.Y. Kan and T. Baldwin, Semeval-688

2010 task 5: Automatic keyphrase extraction from scientific689

articles. In Proceedings of the 5th International Workshop on690

Semantic Evaluation. Association for Computational Linguis-691

tics, 2010, pp. 21–26.692

[18] Y. Matsuo and M. Ishizuka, Keyword extraction from a single693

document suing word co-occurrence statistical information,694

International Journal on Artificial Intelligence Tools 13(1)695

(2004), 157–169.696

[19] O. Medelyan and I. Witten, Domain-independent automatic697

keyphrase indexing with small training sets, Journal of the698

American Society for Information Science and Technology699

59(7) (2008), 1026–1040.700

[20] O. Medelyan, E. Frank and I.H. Witten, Human-competitive701

tagging using automatic keyphrase extraction. In Proceedings702

of the 2009 Conference on Empirical Methods in Natural703

Language Processing: Volume 3-Volume 3, Association for704

Computational Linguistics, 2009, pp. 1318–1327.705

[21] R. Mihalcea and P. Tarau, Textrank: Bringing order into text.706

Proceeding of the 2004 Conference on Empirical Methods in707

Natural Language Processing, Barcelona, Spain, 2004, pp.708

404–411.709

[22] T. Nguyen and M. Kan, Keyphrase extraction in scientific pub-710

lications, in D.H. Goh et al., (Eds.): ICADL, LNCS Vol. 4822,711

2007, pp. 317–326.712

[23] Y. Ouyang, W. Li and R. Zhang, Keyphrase extraction based713

on core word identification and word expansion, Proceedings714

of the 5th International Conference on Semantic Evaluation,715

ACL 2010, Upssla Sweden, 2010, pp. 142–145.716

[24] N. Pala and I. Cicekli, Turkish keyphrase extraction using 717

KEA. Proceedings of the 22nd International Conference on 718

Computer and Information Sciences, Ankara, Turkey, 2007, 719

pp. 1–5. 720

[25] N. Pudota, A. Dattolo, A. Baruzzo and C. Tasso, A new domain 721

independent keyphrase extraction system. M. Agosti, F. Espos- 722

ito and C. Thanos (Eds.), IRCDL, CCIS 91, 2010, pp. 67–78. 723

[26] Sakhr Keyword Extractor, http://www.sakhr.com/Keyword. 724

aspx, Last accessed 7, June 2015. 725

[27] K. Sarkar, M. Nasipuri and S. Ghose, Machine learning based 726

keyphrase extraction: Comparing decision trees, naı̈ve Bayes 727

and artificial neural networks, The Journal of Information Pro- 728

cessing Systems 8(4) (2012), 693–712. 729

[28] K. Sarkar, Automatic keyphrase extraction from medical doc- 730

uments. In Chaudhury et al (Eds): PReMI, LNCS Vol. 5909, 731

2009, pp. 273–278. 732

[29] R. Sonbol, N. Ghneim and M. Desouki, Arabic morpholog- 733

ical analysis: A new approach. In Proceedings of the Third 734

International Conference on Information and Communication 735

Technologies: From Theory to Applications (ICTTA 2008), 736

Damascus, Syria, 2008, pp. 1–6. 737

[30] S. Tonelli, M. Rospocher, E. Pianta and L. Serafini, Boosting 738

collaborative ontology building with key-concept extraction. 739

In 2011 Fifth IEEE international conference on semantic com- 740

puting (ICSC), 2011, pp. 316–319. 741

[31] P.D. Turney, Learning algorithms for keyphrase extraction, 742

Information Retrieval 2(4) (2000), 303–336. 743

[32] J. Wang, H. Peng, J. Hu and J. Zhang, Ensemble learning for 744

keyphrase extraction from scientific documents. In J. Wang 745

(Eds.): LNCS Vol. 3971, 2006, pp. 1267–1272. 746

[33] I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin and C.G. 747

Nevill-Manning, Kea: Practical automatic keyphrase extrac- 748

tion. In Y.L. Theng and S. Foo, Eds., Design and Usability of 749

Digital Libraries: Case Studies in the Asia Pacific, Information 750

Science Publishing, London, 2005, pp. 129-152. 751

[34] Y.B. Wu, Q. Li, R.S. Bot and X. Chine, Finding nuggets 752

in documents: A machine learning approach, Journal of the 753

American Society for Information Science and Technology 754

(JASIST) 57(6) (2006), 740–752. 755

[35] F. Xie, X. Wu and X. Hu, Keyphrase extraction based on 756

semantic relatedness, Proceedings on the 9th IEEE Inter- 757

national Conference on Cognitive Informatics (ICCI’10), 758

Beijing, China, 2010. 759

http://www.nzdl.org/Kea/

http://www.sakhr.com/Keyword.aspx