A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

A Repetition Based Measure for A Repetition Based Measure for Verification of Text Collections Verification of Text Collections

and for Text Categorizationand for Text Categorization

Dmitry V.KhmelevDmitry V.Khmelev Department of Mathematics, University of Department of Mathematics, University of TorontoToronto

William J. TeahanWilliam J. Teahan School of informatics, University of Wales, School of informatics, University of Wales, BangorBangor

2

AbstractAbstract

We suggest a way for locating duplicates and We suggest a way for locating duplicates and plagiarisms in a text collection using an plagiarisms in a text collection using an R-R-measuremeasure..

RR-measure is the normalized sum of the -measure is the normalized sum of the lengths of all suffixes of the text repeated in lengths of all suffixes of the text repeated in other documents of the collection.other documents of the collection.

We applied the technique to several standard We applied the technique to several standard text collections and found that they contained text collections and found that they contained a significant number of duplicate and a significant number of duplicate and plagiarized documents.plagiarized documents.

3

AbstractAbstract

A reformulation of the method leads to an A reformulation of the method leads to an algorithm that can be applied to supervised algorithm that can be applied to supervised multi-class categorization.multi-class categorization.

Using Reuters Corpus Volume 1 (RCV1), the Using Reuters Corpus Volume 1 (RCV1), the results show that the method outperforms results show that the method outperforms SVM at multi-class categorization.SVM at multi-class categorization.

4

1. Motivation1. Motivation

Text collections are used intensively in Text collections are used intensively in scientific research for many purposes such as scientific research for many purposes such as text categorization, text mining, natural text categorization, text mining, natural language processing, information retrieval language processing, information retrieval and so on.and so on.

Every creator of a text collection is faced at Every creator of a text collection is faced at some stage with the task of some stage with the task of verifyingverifying its its contents, e.g. are there duplicate contents, e.g. are there duplicate documents?documents?

R-measureR-measure is defined as a number between 0 is defined as a number between 0 and 1 to characterize the “repeatedness”.and 1 to characterize the “repeatedness”.

5

1. Motivation1. Motivation

The The RR-measure can be computed effectively -measure can be computed effectively using the suffix array data structure.using the suffix array data structure.

The computation procedure can be improved to The computation procedure can be improved to locate the sets of the duplicate or plagiarized locate the sets of the duplicate or plagiarized documents, and to identify “non-typical” documents, and to identify “non-typical” documents, such as those in a foreign documents, such as those in a foreign language.language.

Another reformulation leads to an algorithm Another reformulation leads to an algorithm that can be applied to supervised classification.that can be applied to supervised classification.

The suggested techniques are character-based The suggested techniques are character-based and do not require a-priori knowledge about the and do not require a-priori knowledge about the representation of the documents.representation of the documents.

6

2. 2. RR-measure-measure

Suppose the collection consists of Suppose the collection consists of mm documents, documents, each document being a string each document being a string TTii = = TTii[1…| [1…| TTii |]. A |]. A squared squared RR22-measure of document -measure of document TT is defined is defined as:as:

RR22((TT||TT11,…,…TTmm) = ) =

QQ((SS||TT11,…,,…,TTmm) is the length of the longest prefix of ) is the length of the longest prefix of SS, repeated in one of documents , repeated in one of documents TT11,…,,…,TTmm

l

i

mTTliTQll 1

1 ),...|]...[()1(

2

7

2.2. R R-measure-measure

ex: ex: TT=“catΔsatΔon”, =“catΔsatΔon”, TT11=“catΔsat” =“catΔsat” TT22=“theΔcatΔonΔaΔmat”, then=“theΔcatΔonΔaΔmat”, then

RR22((TT||TT11,,TT22) = ) =

(7+6+5+4+3) from “catΔsat(7+6+5+4+3) from “catΔsat”” , and (5+4+3+2+1) from , and (5+4+3+2+1) from “atΔon”.“atΔon”.

Alternative Alternative LL-measure-measure

LL((TT||TT11,…,…TTmm) = ) =

LL((TT||TT11,,TT22) = ) =

727.0))12345()34567(()110(10

2

),...|]...[(max1

1,...,1

TmTliTQl li

7.0)1,2,3,4,5,3,4,5,6,7max(10

1

8

2.2. R R-measure-measure

RR-measure seems a more “intuitive” measure, -measure seems a more “intuitive” measure, since substrings other than “catΔsat” are also since substrings other than “catΔsat” are also repeated.repeated.

RR-measure can be computed effectively using -measure can be computed effectively using a suffix array, a full-text indexing structure.a suffix array, a full-text indexing structure.

(Let (Let SSCC = = TT00$$TT11$...$...TTmm$ and construct a suffix $ and construct a suffix array for array for SSCC))

The complexity of time is The complexity of time is OO(|(|SSCC|) and that of |) and that of space is space is OO(|(|SSCC|)+|)+OO((mm)+)+OO((MM), where ), where M M = = maxmaxj=0,…,mj=0,…,m||TTjj|.|.

9

3. Applications of 3. Applications of RR-measure-measure

Locating the duplicate setsLocating the duplicate sets

Supervised classificationSupervised classification

Identifying foreign and/or non-typical Identifying foreign and/or non-typical documentsdocuments

10


Supervised ClassificationSupervised Classification There exist two distinct types of classification:There exist two distinct types of classification:

topic categorizationtopic categorization, , multi-class categorization multi-class categorization (binary classifier, multi-class classifier)(binary classifier, multi-class classifier)

RR-measure can be used:-measure can be used:

To select the correct class for the document To select the correct class for the document TT among among mm classes represented by texts classes represented by texts SS11, …, , …, SSmm, the source is guessed using the following , the source is guessed using the following estimate:estimate:

θ(θ(TT) = argmax) = argmaxii RR((TT||SSii))

11


Identifying foreign and/or non-typical Identifying foreign and/or non-typical documentsdocuments Non-typical documents can be located simply Non-typical documents can be located simply

by examining those documents which have by examining those documents which have the lowest the lowest RR-measures.-measures.

There is a predominant language associated There is a predominant language associated with the collection as a whole, and we want to with the collection as a whole, and we want to identify documents that have a different identify documents that have a different language.language.

Construct sample text for each language then Construct sample text for each language then proceed as multi-class categorization.proceed as multi-class categorization.

12

4. Experiments and Results4. Experiments and Results

Analysis of various text collectionsAnalysis of various text collections

Multi-class categorizationMulti-class categorization

13

4.1 Analysis of Various Text 4.1 Analysis of Various Text CollectionsCollections

Reuters-21578Reuters-21578 Contains 579 (2.7%) duplicate documents with Contains 579 (2.7%) duplicate documents with

RR=1.0=1.0

Partitioned into a training/testing split called Partitioned into a training/testing split called ModApteModApte

Two pairs of duplicates are shared between the Two pairs of duplicates are shared between the training and testing splits (12495 and 18011, training and testing splits (12495 and 18011, 14779 and 14913)14779 and 14913)

14


20Newsgroups20Newsgroups 20news-19997 contains many duplicated 20news-19997 contains many duplicated

messagesmessages

20news-18828 was derived from 20news-19997 20news-18828 was derived from 20news-19997 with the purpose of removing duplicates.with the purpose of removing duplicates.

There are still 6 repeated documents. There are still 6 repeated documents. (indistinguishable to classifiers that rely on (indistinguishable to classifiers that rely on word-based feature extraction)word-based feature extraction)

Two documents differ by an extra new-line Two documents differ by an extra new-line character and are assigned two different classes.character and are assigned two different classes.

15


Russian-416Russian-416 Comprises 416 texts from 102 Russian writers Comprises 416 texts from 102 Russian writers

of the 19of the 19thth and 20 and 20thth centuries. centuries.

Only two documents have Only two documents have R R ≧ 0.1 and all ≧ 0.1 and all other books have other books have RR ＜＜ 0.01.0.01.

The level of The level of RR-measure values is much lower -measure values is much lower than other collections since the average than other collections since the average document length is much larger (284800 document length is much larger (284800 characters).characters).

16


Reuters Corpus Version 1Reuters Corpus Version 1 A significant proportion of the articles in RCV1 are A significant proportion of the articles in RCV1 are

duplicated (3.4% or 27,754 articles) or extensively duplicated (3.4% or 27,754 articles) or extensively plagiarized (7.9% with plagiarized (7.9% with R R ≧ 0.5)≧ 0.5)

Checking percentage of matching fields (ex: topics, Checking percentage of matching fields (ex: topics, headlines and dates) in duplicated documents:headlines and dates) in duplicated documents:

Headlines-56.9% matched, dates-78.1%, countries-86.8%, Headlines-56.9% matched, dates-78.1%, countries-86.8%, industries-80.1%, topics-52.3%.industries-80.1%, topics-52.3%.

40% of the 50 lowest scoring articles consist almost 40% of the 50 lowest scoring articles consist almost entirely of names and numbers (non-typical docs).entirely of names and numbers (non-typical docs).

17


Reuters Corpus Version 1Reuters Corpus Version 1 Identify foreign language documentsIdentify foreign language documents

Several class models were constructed from a Several class models were constructed from a small sampling (100-120 KB) of English, small sampling (100-120 KB) of English, French, German, Dutch and Belgian text French, German, Dutch and Belgian text obtained from a popular search engine.obtained from a popular search engine.

Find 410 French articles, 6 Dutch, 5 Belgian Find 410 French articles, 6 Dutch, 5 Belgian and 1 German article.and 1 German article.

100% precision and an estimated 98% recall.100% precision and an estimated 98% recall.

18

4.2 4.2 Multi-class CategorizationMulti-class Categorization

Authorship attribution using RCV1Authorship attribution using RCV1

An important application for IR with benefits such An important application for IR with benefits such as user modeling, determining context, efficient as user modeling, determining context, efficient partitioning the collection for distributed retrieval partitioning the collection for distributed retrieval and so on.and so on.

An experimental collection was formed from 1813 An experimental collection was formed from 1813 articles of the top 50 authors with respect to total articles of the top 50 authors with respect to total size of articles.size of articles.

10-fold split on subsets with conditions: R<0.25 10-fold split on subsets with conditions: R<0.25 (873), R<0.5 (1161), R<0.75 (1255), R<1.0 (873), R<0.5 (1161), R<0.75 (1255), R<1.0 (1316), R≦1.0 (1813)(1316), R≦1.0 (1813)

19

4.2 4.2 Multi-class CategorizationMulti-class Categorization

Table 2: How well the R-measure performs at determining the top Table 2: How well the R-measure performs at determining the top 50 authors in RCV1 compared to SVM and compression-based 50 authors in RCV1 compared to SVM and compression-based methods. Results are percentage of correct guesses at first rank.methods. Results are percentage of correct guesses at first rank.

20

5. 5. Conclusion and DiscussionConclusion and Discussion

In this paper, we have highlighted the need In this paper, we have highlighted the need for for verifyingverifying a text collection—that is, a text collection—that is, ensuring the collection is both ensuring the collection is both validvalid and and consistentconsistent..

RR-measure is suggested for collection -measure is suggested for collection verification and it can be computed verification and it can be computed effectively using the suffix array.effectively using the suffix array.

The implication for text categorization The implication for text categorization research is that a more careful approach is research is that a more careful approach is required to split the collection into training required to split the collection into training and testing sets than a random selection.and testing sets than a random selection.

21

5. 5. Conclusion and DiscussionConclusion and Discussion

There exists a class of problems which is more There exists a class of problems which is more suitable for the suitable for the RR-measure and PPM approach -measure and PPM approach than for SVM, such as the classification of texts than for SVM, such as the classification of texts coming from a single source, like the papers coming from a single source, like the papers written by a single author or in a single written by a single author or in a single language.language.

In cases where the classification depends on the In cases where the classification depends on the presence of one or two words (as for Reuters-presence of one or two words (as for Reuters-21578), SVM would be the preferred method.21578), SVM would be the preferred method.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Documents

text mining

duplicate documents

standard text collections

motivation text collections

plagiarized documents

text categorization

multiclass categorization

multi class categorization