Top Banner

of 15

A Communication Perspective on Automatic

Apr 14, 2018

Download

Documents

Piyush Raghav
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/30/2019 A Communication Perspective on Automatic

    1/15

    A Communication Perspective on AutomaticText Categorization

    Marta Capdevila and Oscar W. Marquez Florez, Member, IEEE

    AbstractThe basic concern of a Communication System is to transfer information from its source to a destination some distance

    away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view,

    the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is

    developed where Automatic Text Categorization (ATC) is studied under a Communication System perspective. Under this approach,

    the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by

    a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and

    adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections

    validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction

    (reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of

    final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by

    state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).

    Index TermsData communications, text processing, data compaction and compression, clustering, classifier design and evaluation,

    feature evaluation and selection.

    1 INTRODUCTION

    A deep parallelism may be established between aCommunication System and an Automatic Text Categor-ization (ATC) scheme, since both disciplines deal with thetransmission of information and its reliable recovery. Theestablishment of this novel simile allows to tackle the over-dimensioned document representation space that is heavily

    redundant with the classification task and typically turnsproblematic to many categorizers [1] in ATC,1 from afounded Communication theoretical point of view. The mainobjective of our research has been to investigate how and upto which extreme the document representation space can becompressed and what are the effects on final classification ofthis compression. The idea behind is to set a first step towardan optimal encoding of the category, carried by thedocument vectorial representation, in view of both limitingthe greedy use of resources issued from the high-dimension-ality feature space and reducing the effects of overfitting.2

    Additionally, our research also aims at showing how thedocument decoding (or classification task) can take advan-tage of common Gaussian assumptions made in the

    Communication System discipline but fairly ignored in ATC.This paper is structured as follows: In Section 2, ATC is

    briefly reviewed, and in Section 3, the Communication

    systems perspective is established. Sections 4 and 5 explainthe theoretical basis of the proposed Document Sampling

    and Document decoding. Sections 6, 7, 8, and 9 are

    dedicated to experimental results, and finally, Section 10

    presents the conclusions.

    2 AUTOMATIC TEXT CATEGORIZATION

    ATC is the task of assigning a text document to one or more

    predefined categories or classes,3 based on its textual context.

    It corresponds to a supervised (nonfully automated) process,

    where categories are predefined by some external mechan-

    ism (normally human) by establishing, at the same time, aset of already labeled examples that form the training set.

    Classifiers are generated from those training examples, by

    induction, in the so-called learning phase. This forms the

    machine learning paradigm (as opposed to the knowledge-

    engineering approach) over ATC that is predominant since

    the 1990s exponential universalization of electronic textual

    information [1].It is further generally assumed that categories are

    exclusive (also known as nonoverlapping), meaning that a

    document can only belong to a single category (single-label

    categorization), as this scenario has been shown [1] to be

    more general than the multilabel case.

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009 1027

    . The authors are with the Signal and Communications ProcessingDepartment, Telecommunication Engineering School, University of Vigo,Rua Maxwell s/n, Campus Universitario Lagoas-Marcosende, E-36310Vigo, Spain. E-mail {martacap, omarquez}@gts.tsc.uvigo.es.

    Manuscript received 30 July 2008; revised 21 Nov. 2008; accepted 18 Dec.2008; published online 8 Jan. 2009.

    Recommended for acceptance by S. Zhang.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTKDE-2008-07-0394.Digital Object Identifier no. 10.1109/TKDE.2009.22.

    1. A sound exception can be established for state-of-the-art SVMcategorizer, which arguably [2] is well adapted to the typical high-dimensionality representation space of ATC and which benefits from theimproved performance of recent training algorithms [3].

    2. In general terms, the problem of overfitting results from thecharacterization of reality with too many parameters, which makes themodeling too specific and poorly generalizable.

    3. Categorization and classification designations are used indiscriminatelyin this text, both indicating the described supervised process.

    1041-4347/09/$25.00 2009 IEEE Published by the IEEE Computer Society

  • 7/30/2019 A Communication Perspective on Automatic

    2/15

    2.1 Document Vectorial Representation

    The first step toward any compact document representationis the definition of the indexing features. The indexing features,also called terms, are the minimal meaningful constitutiveunits (a common choice is to use words). The set of different

    terms that appear in the collection of training documentsforms the vocabulary or alphabet of terms. Once thealphabet chosen, the text document can be represented inthe terms space. In this indexing process, the sequentialityor order of terms in the text is commonly lost. This is knownas the bag-of-words approach [1].

    The problem is that the indexing vocabulary typicallyreaches tens or hundred of thousands of terms. To work insuch a high-dimensionality space commonly turns proble-matic. This is why, before initiating any classification task, afiltering designed to reduce the term space dimensionalityis usually applied. There are basically two approaches to

    dimensionality reduction: 1) Term selection, where a subset offeatures is selected out of the original set and 2) Termextraction, where chosen features are obtained by combina-tion of the original features. In the latter approach,Distributional Clustering [4], [5], [6], [7] is a supervisedclustering technique that has been shown to be veryeffective at reducing the document indexing space withresidual loss in categorization accuracy.

    2.2 Common Categorizers

    In the following, we will shortly review two of the state-of-the-art classifiers used in ATC, which will be extensivelyreferenced in our experiments.

    2.2.1 Multinomial Naive Bayes (MNB)

    MNB is a probabilistic categorizer that assumes a documentis a sequence of terms, each of them randomly chosen amongthe term vocabulary, independently from the rest of termevents in the document. Besides its oversimplified NaiveBayes basis, MNB results in good real performance [8].

    2.2.2 Support Vector Machines (SVMs)

    SVM is a binary classifier that attempts to find, among allthe surfaces that separate positive from negative trainingexamples, the decision surface that has the widest possible

    margin (the margin being defined as the smallest distancebetween positive and negative examples to the decisionsurface). SVM is particularly well adapted to ATC [2] andstands for one of the best-performing categorizers [9].

    3 A COMMUNICATION INTERPRETATION ON ATC

    A communication system [10] has the basic function of

    transferring information (i.e., a message) from a source toa destination. There are mainly three essential parts of anycommunication system: the encoder/transmitter, the trans-mission channel, and the receiver/decoder. The encoder/transmitter processes the source message into the encodedand transmitted messages. The transmission channel is themedium that bridges the distance from source to destina-tion. Every channel introduces some degree of undesirableeffects such as attenuation, noise, interference, and distor-tion. The receiver/decoder processes the received mes-sages in order to deliver it to destination. A classicaldigital communication system simplified model is repre-

    sented in Fig. 1.In its raw form, a text document is a string of characters.

    Typically in ATC, a bag-of-words approach is adopted,which assumes that the document is an order-ignoredsequence of words4 that can be represented vectorially. It isfurther assumed that the vocabulary used by a givendocument depends on the category or topic it belongs to.The ATC scheme can be modeled by a communicationsystem, as shown in Fig. 2.

    3.1 The Encoder/Transmitter Model

    The generation of a document is determined by a Category

    encoder, which is a random selector of words, modulated bythe category C (i.e., the selection of words is a random eventdifferent and characteristic of each category). For eachcategory input ci, the Category encoder is characterized by adistinct alphabet Ti (which is the subset ofT that containsthe words used by the documents ofci) and the conditionalprobabilities of each element of this alphabet fpt1jci;pt2jci . . .ptjjci . . .g. Fig. 3 illustrates different examplealphabets Ti .

    Actually, the Category encoder generates a sequence of

    outcomes5 that are the actual words that (partially) form the

    document. In the communication nomenclature, each word

    1028 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

    Fig. 1. A classical digital communication system simplified model.

    Fig. 2. ATC modeled by a communication system.

    4. Words and terms are used indiscriminately in this text to designate themeaningful units of language.

    5. The outcomes may be considered independent, as in the Naive Bayesapproaches.

  • 7/30/2019 A Communication Perspective on Automatic

    3/15

    could be a symbol. Note that the length of the sequence israndom (i.e., not fixed, since some documents are short,others long, randomly) but presumably category indepen-dent. And, finally, let us indicate that the input ci is itself the

    value of the outcome of another random event, the categoryC characterized by an alphabet C fc1; c2 . . . cjCjg

    6 withprobabilities pc fpc1; pc2 . . .pcjCjg.

    The Document builder measures the degree of contribu-tion7 of each word in the sequence generated by theCategory encoder and establishes a T-dimensional vector orcodeword with all the obtained weights. This process iscommonly known in ATC as the indexing of the document.The final vectorial representation of the document, noted Ein our scheme, constitutes the vector signal that is trans-mitted over the channel.

    The miscouplings between the ideally generated docu-

    ments and the actual documents (i.e., introduction of lexicalborrowed from the vocabulary of other categories, wordsiterations, etc.) are modeled by undesirable effects intro-duced by the channel, namely Noise, Intersymbol interference,and Channel distortion. Noise refers to random and unpre-dictable variations superimposed to the transmitted signal.Channel distortion is a perturbation of the signal due todistorting response of the channel. And, finally, theIntersymbol interference is a form of distortion caused bythe previously transmitted symbols.

    The received signal d is the actual document that wemanipulate. The role of the receiver/decoder is to decode thecategory out of the received document d(i.e., to perform thedocument classification).

    3.2 The Receiver/Decoder Model

    Now, the problem is that, typically in ATC, the alphabet ofsymbols T has an extremely high dimensionality. Manywords are semantically equivalent or category-related (i.e.,the alphabets Ti are extensive) and a large amount ofothers are not discriminative of any category in particular(i.e., the intersection between sets Ti is large), see Fig. 3 toget a visual perception upon this. The alphabet of symbolsT may be said to carry a high degree of redundancyand noise.

    From a communication perspective, working with such aredundant and noisy alphabet of symbols T implies asuboptimal encoding of the category C that generates over-dimensioned codewords. Apart from representing a waste ofresources in terms of use of channel capacity8 andprocessing economy, the over-dimensionality problem is notinsignificant since it happens to affect the Category decisortask. From this point of view, we may wish to eliminatenoise and redundancy by filtering and compressing (ofcourse, ideally under a lossless compression) the alphabet ofsymbols T.

    This is what the blocks Prefilter, Noisy terms filter, andRedundant terms compressor in Fig. 4 basically aim to. Moreprecisely, the Prefilter is a low-level filter which typicallyincludes removal of stopwords (i.e., articles, conjunctions,etc.), infrequent words, nonalphabetical words, etc. TheNoisy terms filter eliminates words that are noninformative(i.e., nondiscriminative) of the category variable. And,finally, the Redundant term compressor clusters terms thatconvey similar information over the category.

    The resulting alphabet of symbols K is a lowerdimensional set of less noisy and less redundant features(i.e., combinations of terms) that provides a more optimalsampling space for documents. In fact, the space of symbolsis transformed with a view that the final documents, seen as

    codewords, are characterized by 1) being as short aspossible, 2) having as little noise as possible, and 3) con-taining the maximum information as possible.

    Back to Fig. 2, the document filter applies the prefilteringand noisy terms filtering just described upon the receiveddocument d. The document is, thus, finally represented in alower dimensional space 0T, resulting in d

    0.The document sampler projects the filtered document d0

    into the new space of features K, producing documentrepresentation d00. This projection process implies a newdocument quantization. In the case of a TF indexing,quantization simply means adding up the weights of the

    original words of a same cluster.The category decisor has the task of decoding the

    document. It is the actual supervised classifier that haspreviously undergone a learning phase that, for simplicityreasons, is not reflected in Fig. 2.

    3.3 Further Remarks

    Several interesting issues can be drawn from the commu-nication analysis performed upon ATC. The first one is thatthe document may be seen as a vector signal that encodesthe category source information. The signal space isconformed by the alphabet of terms T defined by thedocument collection. The main issue of this encodingscheme is its high suboptimization both because of the high

    CAPDEVILA AND MARQUEZ FL OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1029

    Fig. 3. Venn diagram of the alphabets of terms of different categories.

    Fig. 4. Supervised term alphabet T reduction process.

    6. jCj denotes the size or cardinality of C.7. Classically, the contribution of each word can be measured by either a

    binary, a term frequency (TF), or a term frequency inverse documentfrequency (TFIDF) weight scheme, among others [1].

    8. The concept of channel capacity can be assimilated to the concept ofstorage capacity.

  • 7/30/2019 A Communication Perspective on Automatic

    4/15

    dimensionality of the signal space (T ) C) and becauseof the fact (directly related to the latter) that a samecategory may be encoded by many different codewords.

    Our work directly tackles the document-space optimi-zation. Inspired by the communication simile established,we ideally pursue the goal of extracting (up to the extent itmay be practically possible) an orthonormal basis for thesignal space. Under this inspiration, we have designed a

    term alphabet transform (i.e., filtering and compression)that improves the optimality of the category encodingconveyed by the resampled documents. This should resultin a better utilization of storage and processing resourcesas well as, hopefully, facilitate the document decoding orclassification task.

    Note that the framework of the problem, as it has beeninitially established, is not to perform a coding design andthen try to adapt the document representations to it. Instead,we have opted to optimize, to the extent it may be possible,the signal space in the hope of improving classification.

    The document resampling model is further developed

    in Section 4, while Section 5 deepens the documentdecoding aspects.

    4 DOCUMENT SAMPLING

    4.1 Document-Space Analysis

    In their vectorial representation, documents are representedin the space of terms, and thus, following the communica-tion simile established in Section 3, terms have to bethought as being the basis functions of the document space.

    The nutshell of the document-space analysis is how torepresent terms by means of a function. Which information

    do we have about terms? How could we characterize them?The answer to these questions resides in the fact that in a

    supervised scheme such as ATC, we are given a set ofprelabeled documents. Based on the latter, terms can beexpressed in terms of the information they convey oncategories. Once terms have been properly characterized bya function, the notion of the orthogonality and redundancybetween them (basis functions of the document space)can be pursued.

    4.2 Distributional Representation of Terms

    As previously expressed in Section 3, the generation of atext document may be modeled as a random selection ofterms dependent on category C. In other words, theprobability of appearance of a term in a document dependson the category the document belongs to. Terms can beunderstood as the outcomes of a r.e. T that is mutuallydependent with the category r.e. C. The term is theobservable data while the category is the unknown parameter.

    Thereby, any term tj can be characterized as a distribu-tional function ftj over the space of categories C

    ftj : C ! 0; 1;

    c 7! ftj c:1

    An intuitive alternative (but not the only one) for thisdistributional function is the conditional probability massfunction (PMF) pCjtj [4]. Note that the conditional probabil-ities pCjtj cijtj are not known, but in a supervised scheme

    such as ATC, these can be approximated from the training

    set of documents Dtrain

    pcijtj %#tj; Dtraini

    #tj; Dtrain ; 2

    being #tj; Dtraini the number of times the term tj appears

    in all training documents belonging to category ci and

    #tj; Dtrain the number of times the term tj appears in thewhole train collection.

    4.3 Alphabet of Symbols T Filteringand Compression

    As announced in Section 3, we can envisage to reduce the

    symbol alphabet T in two distinct directions:

    1. Noisy terms filteringTerms that have a flat functionftj , that is,

    ftj cr % ftj cs; 8r; s 3

    do not convey information on the target category.These terms, from a communication perspective, are

    noisy and should be eliminated. A noise filter should

    discriminate between informative terms and noninfor-

    mative or noisy terms. A dispersion measure needs to be

    defined, as well as a selection threshold, upon it. Note

    that the threshold will have to be set experimentally.2. Redundant terms compressionTerms that convey

    similar information on the target category randomevent, that is,

    fti cr % ftj cr; 8r 4

    are redundant. A (ideally) lossless data compression

    scheme reduces redundancy by clustering terms

    with similar distributional representation. A similar-

    ity measure needs to be defined.

    4.4 Dispersion and Similarity Measures

    As just seen, to perform the document compression, we

    need to establish measures on the information conveyed by

    the term and the redundancy between terms. But which

    metrics to use? The answer is not straightforward. We may

    use distinct dispersion and similarity measures depending

    on different interpretations on what the distributional termrepresentation ftj is.

    4.4.1 The PMF Interpretation

    Distributional functions ftj are commonly PMFs. Information

    Theory (IT) [11] provides useful measures to quantify

    information conveyed by random events (e.g., entropy)

    and similarity between PMFs (e.g., Kullback-Leibler and

    Jensen-Shannon divergences).The IT approach has been commonly adopted by other

    works on Distributional Clustering [4], [5], [6]. We have

    chosen to follow a new and unexplored direction, the

    discrete signal interpretation, which is rooted in Communica-

    tion and Signal Processing related concepts, coherently with

    the proposed general framework setup.

    1030 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

  • 7/30/2019 A Communication Perspective on Automatic

    5/15

    4.4.2 The Discrete Signal Interpretation

    A discrete signal is a set of jCj measurements of an

    unknown but latent random variable (r.v.). The rationale

    behind a discrete signal interpretation of ftj is that we are

    interested in analyzing the general shape of the distribu-

    tions. By modeling those distributions with a latent random

    variable, small differences between distributions are assimi-

    lated by the random nature of the signal.

    . A dispersion measure: Sample variance. The variance is ameasure of the statistical dispersion of an r.v. For agiven discrete r.v. Xwith PMF pX defined in X, it isexpressed as:

    2X 4

    EX X2 EX2 X

    2; 5

    where EgX P

    x2XgxpXx denotes the ex-

    pectation operation and X EX the expected

    mean of X.

    Now, a discrete signal is a bunch of measure-ments on an r.v. The underlying PMF is unknown,

    and thus, the expectation operator E cannot be

    computed. The variance of a discrete signal ftj , also

    called sample variance, is thus obtained by substitut-

    ing the expectation operator in (5) by the arithmetic

    mean as follows9:

    s2ftj4 1

    jCj

    Xci2C

    ftj ci mftj

    2; 6

    where mftj 1

    jCjPci2Cftj ci is the arithmetic mean

    of ftj .Sample variance is bounded in the interval

    0; jCj1jCj

    2 . Noninformative terms are those with low

    dispersion among categories (i.e., ftj with flat

    distribution). They are thus characterized by their

    low variance.. A similarity measure: Sample correlation coefficient.

    Correlation refers to the departure of two variablesfrom independence. Pearsons correlation coefficient in(7) is the most widely used measure of relationshipbetween two random variables Xand Y. It evaluatesthe degree to which both functions are linearly

    associated (equals 0 if they are statistically indepen-dent and, in the other extreme, 1 if they are linearlydependent)

    X;Y 4 1

    XYEXYX XY Y: 7

    As with the variance, in the case of two discrete

    signals ftj and ftk , the correlation coefficient is

    expressed by its sample version:

    rftj ftk 4

    PjCji1 ftj ci mftj

    ftk ci mftk

    j

    Cjsftj sftk: 8

    4.5 Clustering Algorithms

    Now let us turn to the clustering algorithms used for theredundancy compression task. Our main approach in thisresearch has been to adopt an agglomerative term cluster-ing approach, disregarding efficiency aspects apparentlyimproved by divisive clustering methods as pointed out byDhillon [6]. The reason is that our interest has beenfocussed in studying the influence of the number of clusters

    built and its optimal number more than working onalgorithm efficiency aspects.

    4.5.1 Initial Agglomerative Approach

    The first clustering implementation conveyed has beeninspired by the agglomerative hard clustering10 algorithmproposed by Baker [4]. The algorithm is simple and scaleswell to large vocabulary sizes, since instead of comparingthe similarity of all pairs of words, it restricts the comparisonto a smaller subset of size M (M being the final number ofclusters desired). After the data set preprocessing and noisyterms filtering, the words vocabulary is ordered in decreas-

    ing variance order. Then, the algorithm initializes theM clusters to the M first words of the sorted list. It followson by iteratively comparing the M clusters and merging thecloser ones. Empty clusters are filled with next words in thesorted list.

    When merging occurs, the distribution of the newcluster becomes the weighted average of the distributionsof its constituent words. For instance, when mergingterms tj and tk into a same cluster, the resultingdistribution function is:

    ftj_tk c ptj

    ptj ptkftj c

    ptk

    ptj ptkftk c: 9

    This algorithm has been named as Static window Hardclustering (SH clustering). Static window refers to the fixedM-dimensional window it is based on, while Hard clusteringdenotes the nonoverlapping nature of clustering.

    4.5.2 Dynamic Window Approach

    A further agglomerative clustering algorithm has beenimplemented where the fixed M-dimensional Static windowhas been replaced by a Dynamic window scheme. Therationale of this procedure is to avoid the forcing of distantclusters merging due to the M-dimensional fixed size of the

    working window, specially when M is low.The algorithm proceeds as the former Hard clusteringprocedure except that the initial window size is set to aninput value W 6 M. The window is iteratively expandedwhenever no pair of clusters with intercluster distancelower than a certain threshold exist. In a subsequent step,when all vocabulary terms have been assigned to a cluster,the window is progressively contracted until its dimensionreaches the number M of desired clusters. Toward thisobjective, the intercluster distance threshold is progres-sively incremented following an arithmetic progression(whose common difference has to be set in an inputparameter). At each step, the merging of close clustersis performed.

    CAPDEVILA AND MARQUEZ FL OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1031

    9. In this document, biased estimates are adopted. Alternatively, unbiasedestimates could be used by substituting jCj by jCj 1.

    10. Hard clustering assumes that each term can only belong to one cluster.Clusters do not overlap. They produce a partition (disjoint subsets) of terms.

  • 7/30/2019 A Communication Perspective on Automatic

    6/15

    4.5.3 Soft Clustering Approach

    A soft clustering11 approach has been designed in order toaccept different semantic contexts for a same term. Theimplementation of a soft clustering model is notably morecomputationally expensive than a hard clustering scheme,since it demands an iterative procedure where the degree ofproximity of each pair of clusters is analyzed.

    4.5.4 Clustering Algorithms Implemented

    From the combination of the different approaches exposed,four agglomerative clustering algorithms have been im-plemented, namely: Static Window Hard clustering (SH),Static Window Soft clustering (SS), Dynamic Window Hardclustering (DH), and Dynamic Window Soft clustering (DS).

    4.6 Document Quantization

    Once the clustering algorithm has ended, we assume eachof the resulting term clusters to be a symbol of the newalphabet K (i.e., an indexing dimension for the documentsampling). The indexing or quantization of the documentcan be simply done by a term frequency (TF) weightingscheme such as:

    dji T Fj; i #kj; di; 10

    where #kj; di is the number of times the terms of cluster kjappear in di. In this simple indexing, the classical inversedocument frequency (IDF) factor is ignored because it hasalready been, to a certain extent, taken into account in thenoise filtering phase (i.e., terms that appear uniformly indocuments of all categories have been identified as noisyterms, and thus, eliminated).

    We may ignore document length as we assume it isindependent from the category. In order to normalize allresulting document vectors, whenever necessary, we haveadopted a cosine normalization.

    5 DOCUMENT DECODING

    We cannot expect that the final alphabet of symbolsobtained exactly map a pure orthogonal set of basis aseventually desired. And, consequently, documents re-sampled in the new term-clusters space can be assumedto be (to a certain extent) corrupted codewords of theideal category encoding. Adopting communication termi-nology, we may say they constitute the actually receivedmessages, contaminated by the undesirable effects intro-duced by the transmission channel. To sum up, thedecoding of a document sampled in the term-clustersspace that ideally would be a straightforward extractionof the category, cannot as such directly implementedin practice due to the channel noise, interference,and distortion influence. This brings us to an OptimumDetection problem.

    5.1 MAP Decoder

    The optimization criterion can be formulated in terms ofpcij~dk, that is, the conditional probability that ci was

    selected by the source given that the document ~dk12 is

    received. If

    pcij ~dk; H > pcjj~dk; H 8j 6 i; 11

    (where H denotes the overall hypothesis space), then thedecoder should decide that the transmitted symbol was thecategory ci. This constitutes the basis of a maximum a

    posteriori (MAP) orprobabilistic decoder that is expressed as:e ~dk maxci

    pcij ~dk; H: 12

    Now, the posterior probability pcij ~dk; H can be straight-

    forwardly estimated by Bayesian inference pcij~dk; H p ~dkjci;HpcijH

    p ~dkjH. Given that the evidence p ~dkjH does not

    depend on category ci, the classification criterion simplifies

    to the following expression that constitutes the discrimina-

    tive function of MAP categorizers:

    eMAP~dk max

    cip ~dkjci; HpcijH: 13

    5.2 Gaussian Assumption (Discriminant Analysis)

    The Gaussian assumption is a classical modeling assumptionheavily used in areas such as Signal Processing and Commu-nication System but poorly applied in the field of ATC (seeSection 5.2.3 for a discussion on this assertion).

    The Gaussian model assumes that each category encod-ing characterizes a multivariate Gaussian or Normal Prob-ability Density Function (PDF). A document ~d is thenassumed to be a realization of an n-dimensional randomvector ~D that is dependent on the category output ci withthe following Gaussian PDF N ~i; i:

    fN ~i;i~d

    1

    2n2 jij

    12

    e12~d ~i

    T1i

    ~d ~i; 14

    where the mean vector ~i E ~Djci is an n-dimensionalvector and the covariance matrix i E ~Djci ~i ~Djci ~i

    T is an n n-dimensional positive-definite13 matrix withpositive determinant jij.

    Now, the likelihood p ~dkjci can be expressed in thefollowing terms, where FN ~i;i denotes the probabilitydistribution function14:

    p ~dkjci lim!~0

    p ~dk ~d ~dk jci

    lim!~0

    FN ~i;i~dk FN ~i;i

    ~dk

    % fN ~i;i~dk !~0:

    15

    The factor appears in the numerator of (15) for eachcategory. Consequently, it does not affect classification, andthus, the MAP criterion translates to the following dis-criminative function:

    1032 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

    11. Soft or fuzzy clustering allows a term to belong to more than onecluster. Clusters may then overlap.

    12. Note that ~dk corresponds to the document signal vector, after thefiltering and compression processes described in Section 3 (and thus,properly corresponds to d00k).

    13. Recall that a positive-definite matrix is a symmetric matrix with all its

    eigenvalues positive. A positive-definite matrix is always invertible ornonsingular. Its determinant is always positive.

    14. By definition, probability distribution F~D and density f~D functions

    are related, f~D~4

    @n F~D~@1 ...@n

    .

  • 7/30/2019 A Communication Perspective on Automatic

    7/15

    e ~dk ffi maxci

    1

    jij12

    e12 ~dki

    T1i

    ~dk ipcijH

    ( ): 16

    Given that the logarithmic function is a monotonicallyincreasing function, (16) is normally expressed as:

    e ~dk ffi max

    cilnpcijH

    1

    2ln jij

    & 1

    2 ~dk i

    T1i ~dk i': 17

    5.2.1 Quadratic Discriminant Analysis (QDA)

    ATC being a supervised classification scheme, both i andi can be estimated from the training set of documentsDtraini that belong to category ci. The discriminativefunctions in (16) and (17) describe a quadratic shape foreach category and the decision frontiers are also quadratic.

    5.2.2 Linear Discriminant Analysis (LDA)

    Let us suppose that the covariance matrices for all

    categories are identical. This constitutes the homoscedasticsimplifying assumption. The discriminant function (17)simplifies to:

    e ~dk ffi maxci

    lnpcijH 1

    2~i

    T1 ~i ~iT1 ~dk

    & ': 18

    This corresponds to a linear separative surface (i.e., ahyperplane).

    5.2.3 Applying Discriminant Analysis to ATC

    The first problem that appears when trying to applydiscriminant analysis (DA) to ATC is the so-called n ) Ni

    problem.15

    The number of variables (indexing terms orsymbols) is typically extremely high (tens or hundred ofthousands) while the number of sample documents ismoderately small. Moreover, note that even if this problemwould not exist, the computation of extremely high-dimensional covariance matrices is not feasible.

    As a result of these limitations, DA can only beenvisaged after a preliminary indexing terms space reduc-tion phase.

    One of the pioneers in using DA in ATC were Schutzeet al. [12] who applied LDA classifier to the routing task, in1995.

    More recently, in 2006, Li et al. [13] used discriminantanalysis for multiclass classification. Their experimentalinvestigation showed that LDA reaches accurate perfor-mance comparable to the one offered by SVM with a neatimprovement in terms of simplicity and time efficiency bothin the learning and the classification phases.16

    5.3 Independence Assumption

    The independence or Naive Bayes assumption that statesthat terms are stochastically independent can be formu-lated over the Gaussian MAP categorizer. Under this

    statement, the covariance matrix is the diagonal matrixof variances

    21 0 . . . 0

    0 22 . . . 0

    ..

    . . .. ..

    .

    0 0 . . . 2n

    2666437775: 19

    The covariance matrix determinant simply is jj 1 . . . n and the inverse covariance matrix is straightfor-

    wardly deduced. The estimation of is, computationally

    speaking, drastically simplified, since it reduces to the

    computation of the n variances.It can be easily shown that, under the independence

    assumption, the discriminative function (16) is simplified to

    the product of the univariate Gaussian PDFs in each of the

    document indexing directions:

    e ~dk ffi max

    ci

    fN i1;2i1

    dk1 . . . fN in;2indkn

    pcijH: 205.3.1 Quadratic GNB

    The quadratic-GNB obeys (20). The category-dependent

    parameters can be estimated by the arithmetic mean ij $

    mij 1NiP

    ~dk 2Dtrainidkj and the sample variance

    2ij $ s

    2ij

    1Ni

    P~dk2Dtraini

    dkj mij2.

    5.3.2 Linear GNB

    Under the homoscedastic hypothesis, the variance is con-sidered to be category independent. It can be estimated

    as before by substituting Ni by N and Dtraini by Dtrain .Under this simplification, (20) defines a linear GNB asexpressed in (18).

    5.3.3 White Noise GNB

    A further simplifying assumption may consider thevariance to be one and the same for all categories andvariables. This generates the, by us baptized, White Noise-GNB (or simply WN-GNB)17 categorizer. It can be easilyshown that, after applying the logarithmic function,discriminative function (20) simplifies to:

    eMAP ~dk ffi maxci

    12Pnj1dkj ij2

    2 lnpcijH( ):

    21

    In the case of equiprobable categories, the WN-GNBcategorizer defined in (21) reduces to a (euclidean)minimum distance categorizer.

    5.3.4 A New Hybrid GNB Categorizers Family

    Document data sets are characterized by extremely sparse

    matrices (even after the resampling envisaged in Section 4).

    The majority of the variables (i.e., indexing terms) are not

    CAPDEVILA AND MARQUEZ FL OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1033

    15. When the number of dimensions of the random vector n is greaterthan the quantity of sample data N, then the estimated covariance matrix

    does not have full rank, and thus, cannot be inverted.16. The advantage of adopting an LDA strategy is that it is a natural

    multiclass classifier, while SVM is a binary classifier that has to be adaptedto the multiclass problem, by reducing it to a binary classification scenario,which is a nontrivial task.

    17. In Communications, a white noise is a Gaussian-distributed noise thathas a flat spectral density. It is called white noise by analogy to white light.By similitude, we use the terminology white noise to designate a Gaussiannoise that affects uniformly all document vectors.

  • 7/30/2019 A Communication Perspective on Automatic

    8/15

    representative of the target category. This means that, for a

    big part of indexing attributes, mean and variance are null

    theoretically.18 This fact affects negatively the computation

    of discriminative function in (20), and as for those

    attributes, a small deviation from zero of an indexing

    weight dkp results in a close to zero value for the term

    fN ip;2ipdkp, which (when summed up repeatedly) can

    end by setting to nil the overall probability.With the idea to mitigate the effects of the sparsity existingin ATC, we have envisaged to set a variance lower bound.

    6 EXPERIMENTAL SCENARIO

    Before entering into this section, note that some aspects of

    the experimental scenario adopted, such as the tackling of

    the overlapping issue in Reuters-21578, and the effective-

    ness measure used, are reviewed in rather detail since they

    form a different treatment from commonly followed

    solutions [1].

    6.1 Standard Collections

    Two of the most widely used standard collections, the

    20 Newsgroups (NG) and the Reuters-21578 (RE) data sets,

    form our experimental scenario. In our experiments, both

    collections have been preprocessed by removing stopwords

    (from the Weka [14] stop list) and nonalphabetical words.

    Infrequent terms occurring in less than four documents or

    appearing less than four times in the whole data set, have

    also been filtered.

    6.1.1 20 Newsgroups

    The 20 Newsgroups data set is a collection of approximately20,000 newsgroup documents, partitioned (nearly) evenly

    across 20 different newsgroups, each corresponding to a

    different topic. We used the bydate version of the data set

    maintained by Jason Rennie,19 which is sorted by date into

    training (60 percent) and test (40 percent) sets.

    6.1.2 Reuters-21578

    The Reuters-21578, Distribution 1.0 test collection20 is

    formed using 21,578 news that appeared in the Reuters

    newswire in 1987, classified (by human indexers) according

    to 135 thematic categories, mostly concerning business and

    economy. In our experiments, we have used the mostcommon subset of Reuters-21578, ModApte training/test

    partition which only considers the set of 90 categories with

    at least one positive training example and one positive test

    example. It results in a partition of 7,769 training documents

    and 3,019 test documents.Several factors characterize Reuters-21578 data set [15],

    notably: categories are overlapping (i.e., a document may

    belong to more than one category) and distribution across

    categories is highly skewed (i.e., some categories have

    very few labeled documentseven only onewhile others

    have thousands).

    6.1.3 Tackling the Category Overlapping Issue

    Reuters-21578 data set classification is inserted in a multi-

    label categorization frame, which is, by nature, out of the

    single-label categorization scheme assumed in most of ATC

    research works, including this. The category overlapping

    issue can be tackled in three different ways:

    . By deploying the L nonexclusive categories into allthe possible 2L category combinations.

    . By assuming that categories are independent, andthus, reverting L-categories classification probleminto L independent binary classification tasks.

    . By ignoring multilabeled documents which consti-tute approximately 29 percent of all documents inReuters-21578.

    Classically, the category independence alternative has

    been implicitly undertaken in most researches [15]. Minority

    authors as [16] have opted to ignore multilabeled documents.

    In our work, we have opted to deploy the L 90 Reuters-

    21578 categories into all possible 2L 290 combinations,

    which results in an impressive number of the order of 1027.

    The reasons for our decision are basically:

    . Our conceptual framework, both for documentsampling and decoding, is based in a multiclass(not binary) scheme.

    . By deploying categories, we avoid assuming anyindependence hypothesis.

    . 379 (out of more than 1027!) is the actual number of

    category combinations that have at least one docu-ment representative in the training set (out of these379, only 126 category combinations are representedin the test set).

    6.2 Effectiveness Measures

    6.2.1 Confusion Matrix

    In a single-label multiclass classification scheme, the

    classification process can be visualized by means of a

    confusion matrix . Each column of the matrix represents the

    number of documents in a predicted category, while each

    row represents the documents in an actual category.In other words, referring to Table 1, 11 would be the

    number of documents of category c1 that are correctly

    classified under c1, while 12 corresponds to documents of

    1034 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

    TABLE 1An Example of Confusion Matrix

    18. Note that in practice, variance has to be set to a minimal value;

    otherwise, univariate normal PDFs expressions could not be computed.19. http://people.csail.mit.edu/jrennie/20Newsgroups/.20. The Reuters-21578 corpus is freely avaible for experimentation

    purposes from http://www.daviddlewis.com/resources/testcollections/reuters21578.

  • 7/30/2019 A Communication Perspective on Automatic

    9/15

    c1 incorrectly classified into c2 and 21 corresponds to

    documents of c2 incorrectly classified into c1.

    6.2.2 Precision and Recall Microaverages

    Precision and recall21

    are common measures of effectivenessin ATC [1]. Overall measures for all categories are usuallyobtained by microaveraging:22

    T P

    T P F P

    PjCji1 T PiPjCj

    i1 T Pi F Pi; 22

    T P

    T P F N

    PjCji1 T PiPjCj

    i1 T Pi F Ni: 23

    It can be easily seen that, in the single-label multiclassclassification model, the former expressions are equivalent.They both result from the quotient of the sum of the diagonalcomponents of matrix and the sum of all elements ofmatrix , which happens to be the overall classificationaccuracy which is the quotient of correct classifications(numerator) and total correct and incorrect classifications(denominator).

    6.3 Technological Solutions

    The implementations undertaken in this research are basedon the Weka 3 Data Mining Software [14]. In particular, wehave used: NaiveBayesMultinomial, which is the MNBcategorizer implemented in Weka and Weka LibSVM

    (WLSVM), which is the integration of LibSVM into Wekaenvironment.23

    7 EXPERIMENTAL RESULTS ON NOISYTERMS FILTERING

    This section provides empirical evidence that the novelnoisy terms filtering ensures a beneficial feature spacereduction.

    7.1 20 Newsgroups

    7.1.1 Experimental Scenario

    In the noisy terms filtering scheme that was designed, the

    setting of sample variance threshold s2 (from this point on,

    simply noted as ) has to be heuristically tuned. The only

    thing we a priori know is that it is bounded in the interval

    0; s2max jCj1

    jCj2 that results in 0; 0:0475 for jCj 20.

    We have graphically represented in Fig. 5 the effect of the

    variation of sample variance threshold on the classificationaccuracy for a term clustering of 20,100 and 5,000 clusters,

    respectively. The classifiers used have been the classic MNBand SVM. The clustering algorithm tested is static windowhard clustering (SH clustering).24 The similarity measure

    used is the sample correlation coefficient in both cases. Therest of clustering parameters are the default ones (see

    Section 8).

    7.1.2 Analysis of Results

    The curves corresponding to MNB and SVM classificationin Fig. 5 have parallel evolutions. They present a mountainshape with a maximum in the range of [0.005; 0.02]. Notethat this range assures a classification accuracy decreaselower than 5 percent for the 20 clusters curve (whichhappens to be our target clustering as we will see inSection 8). Outside this range, classification accuracydecrease is considered to be too high.

    7.2 Reuters-21578

    7.2.1 Experimental Scenario

    As in the case of 20 Newsgroups, the dispersion measure

    used has been sample variance. A priori, we only know that

    the variance threshold is bounded in the interval 0; s2max jCj1

    jCj2 that results in 0; 0:002631 for jCj 379.

    We have graphically represented in Fig. 6 the effect of the

    variation of sample variance threshold on the classification

    accuracy for a term clustering of 200, 500 and 1,000 clusters,

    respectively. The classifiers used have been the classic MNB

    and SVM and the clustering algorithm tested SH clustering.

    The similarity measure used is the sample correlation

    CAPDEVILA AND MARQUEZ FL OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1035

    Fig. 5. 20 NewsgroupsEffect of the variation of the sample variance threshold on the classification accuracy for a term clustering of 20,100, and

    5,000 clusters, respectively. The clustering algorithm used is static window hard clusteringversion with sample correlation coefficient as similarity

    measure. (a) MNB classifier. (b) SVM classifier.

    21. Precision (i) is the probability that if a random document dj isclassified under ci, this decision is correct. Recall, (i) is the probability thatif a random document dj ought to be classified under ci, this decision is

    taken.22. T P stands for True Positive, F P for False Positive, T N for True

    Negative, and F N for False Negative.23. Weka LibSVM is publicly available from http://www.cs.iastate.edu/

    ~yasser/wlsvm/.24. Similar results are obtained with dynamic window soft clustering

    (DS clustering), but, due to space limitations, they are not shown here.

  • 7/30/2019 A Communication Perspective on Automatic

    10/15

    coefficient in both cases. The rest of clustering parameters

    are the default ones (see Section 8).

    7.2.2 Analysis of Results

    The curves corresponding to MNB and SVM classificationhave slightly different evolutions. MNB classifier seems tobe more robust to the presence of noisy terms (low values ofvariance threshold). In any case, both classifiers (or moreprecisely, the combination of clustering/classifier) may beconsidered reasonably robust to the presence of noisyterms. The optimal range for is 0:00005; 0:001. Note thatthis range assures a maximum decrease of accuracy of5 percent for the 200 clusters curve (our target clusteringas we will see in Section 8). In both cases, the absolutemaximum is obtained for a value of of 0.0003.

    Similar results are obtained when clustering is per-

    formed by the DS algorithm, with the difference that theoptimal range for is restricted to 0:0002; 0:001.

    7.3 Interpreting and Extrapolating Results

    Both in 20 Newsgroups and Reuters-21578, we have seen thata range of values exist for the sample variance thresholdwhere a maximum of classification accuracy is reached.Outside these bounds, the classification accuracy decreases:

    . When < lower-bound, the selection is too unrest-rictive and noisy terms affect negatively termclustering, which results in classification accuracy

    deterioration.. When > upper-bound, the selection is too restrictive

    and informative terms are eliminated, which alsoresults in accuracy deterioration.

    While in some cases (e.g., SH clustering on Reuterscollection), classification accuracy seems not to be affectedby any variance threshold lower-bound (i.e., robustness tonoisy terms), a threshold upper-bound always exists. This isreasonable as one cannot limitlessly eliminate terms withoutreducing classification accuracy. The quid of the questionhere is to find an effective term reduction (that fastens theterm clustering process) while preserving classification

    accuracy. In other words, we have the following compro-mise: the higher is, the faster the clustering process results,but the higher is, the lower the classification accuracybecomes.

    The problem is how to determine a priori (not experi-mentally) the optimal values for the variance thresholdlower and upper-bounds.

    There is a clear correspondence of relative values in

    both cases. In the case of Newsgroups, the optimum rangefor is 0:1053s2max; 0:4211s

    2max, which corresponds to a

    range of 88-41 percent terms selected. For Reuters, it is0:08s2max; 0:38s

    2max, which corresponds to a range of 94-

    33 percent terms selected.We may extrapolate these particular cases into a general

    scenario (that should, of course, be verified in futureexperiments with other collections). The optimal range for may be set to 0:11s2max; 0:38s

    2max (this would embrace the

    limits set by both collections).

    8 EXPERIMENTAL RESULTS ON CLUSTERINGOF TERMS

    This section provides empirical evidence that our redun-dancy compression model drastically outperforms two ofthe most effective feature selection functions [17], namely,Information Gain (IG) and Chi-square (CHI), and allows anaggressive dimensionality reduction with minimal loss (insome cases, benefits) on final classification accuracy.

    8.1 20 Newsgroups

    8.1.1 Experimental Scenario

    In the preliminary term selection phase of this experimental

    scenario, noisy terms have been eliminated using thesample variance measure and a threshold set to 0.02. As aresult of this, out of the initial 32,631 preselected terms ofthe training set, 13,465 informative words have beenselected and served as the basis of the clustering phase.We applied the four clustering variants with samplecorrelation coefficient similarity function. In the dynamicwindow approaches, the initial window size has been set to100 and the contraction step was empirically set to 0.0001.25

    8.1.2 Discussion of Basic Performance

    We have obtained the categorization accuracy of the MNBcategorizer upon NG data set indexed in the resultingclusters space. This has been referenced to the results

    1036 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

    25. Experimentally small values for this parameter, lower than 0.0001,have showed better results than higher ones.

    Fig. 6. Reuters-21578Effect of the variation of the sample variance threshold on the classification accuracy for a term clustering of 200, 500, and

    1,000 clusters, respectively. The clustering algorithm used is static window hard clusteringversion with the sample correlation coefficient as the

    similarity measure. (a) MNB classifier. (b) SVM classifier.

  • 7/30/2019 A Communication Perspective on Automatic

    11/15

    issued from classic IG and CHI selection functions appliedto the raw data set with same preprocessing (removal ofstopwords and nonalphabetical words) and same spacereduction factor.

    Fig. 7 shows parallel results to the ones obtained by theprevious research works on Distributional clustering [4], [5],[6]. The clusters categorization accuracy curves are notablybetter than those of classic IG and CHI term selectionfunctions. They present an abrupt initial increase up to20 clusters (accuracy in the range of 74 percent to 76 percent,maximum value 76.0489 percent), and from there, theyasymptotically get to the maximum accuracy of 78.2528 per-cent obtained by a full-feature MNB classifier.26

    We can say that clustering is good, with only a residualloss of classification accuracy of 2.82 percent, for 20 clustersor more, which is the number of categories defined in20 Newsgroups collection. It results in an indexing term-

    space reduction of original 113,357 words into only 20 wordclusters. The reduction factor is 113;35720

    113;357 0:9982.

    8.1.3 NG Indexed by Clusters

    Fig. 8a illustrates how NG data set is indexed in theobtained clusters space.27 The graphic represents thedocuments vectors (in the y-axis) indexed in the term-cluster space (x-axis). Documents have been normalizedand they have also been arranged as to the category theybelong to. The graphic has thus to be read followingthe 20 horizontal bands that can be identified in thefigures. Each band corresponds to all documents belong-ing to the same category. Each single document is a line

    inside this band. Basically, each category is mainlyidentified by a single and distinctive cluster.28

    Finally, NG, indexed by 100 term clusters in Fig. 8b,gives a very graphical understanding of the power of

    clustering. Here again, basically, 20 clusters are active,while a large part of the figure is void. This asserts the ideathat increasing the number of clusters up to more than thenumber of categories is roughly unnecessary.

    8.2 Reuters-21578

    8.2.1 Experimental Scenario

    We applied our four clustering variants on the Reuters-

    21578 preprocessed training set upon which noisy termshave been filtered by a sample variance threshold of 0.0003(the collection is finally indexed by 8,550 informativeterms). We have used sample correlation coefficient. Thestep in the contraction phase of the dynamic windowapproaches was empirically set to 0.00001 and the windowsize to 100. Finally, the whole collection of documents hasbeen indexed in the resulting space of clusters.

    8.2.2 Discussion of Basic Performance

    Fig. 9 shows parallel results to the ones obtained with20 Newsgroups data set. Curves corresponding to Distribu-

    tional clustering algorithms present an abrupt initial increaseup to 200 clusters (accuracy 76-82 percent, depending on theclustering and the categorizer used), and from there, theyasymptotically get to a maximum accuracy of the order ofthe maximum of 81.5953 percent obtained by classicalselection functions (IG or CHI) with SVM. Note that insimilar conditions, MNB obtains a maximum accuracy of78.453 percent.

    In other words, the indexing space can be reduced fromthe original 32,539 words to 200 clusters (i.e., two orders ofmagnitude) with no loss (even gain) of categorizationaccuracy. The reduction factor is 32;539200

    32;539 0:9939.

    With this reduction, there is a gain of classification accuracy

    of 4.53 percent (from 82.1478 percent to 78.453 percentaccuracy obtained when indexing with 10,000 termsselected with indiscriminately IG or CHI functions) whenusing the MNB classifier. When using SVM, the loss ofclassification accuracy is only 2.24 percent (from 79.7676 per-cent to 81.5953 percent). Most notably, an overall maximumaccuracy of 82.1478 percent is reached when RE is indexedby the 200 clusters obtained by DS clustering and classifiedwith MNB.

    Qualitatively, our clustering approach essentially im-proves the deterministic annealing algorithm (a Distribu-tional Clustering algorithm) proposed by Bekkerman et al.

    [7] who found that simple word indexing was a moreefficient representation than word clusters for Reuters-21578 (10 largest categories).29 The authors argue that thebad performance of their algorithm is due to thestructural low-complexity of Reuters corpora (comparedto 20 Newsgroups, for instance) that was found not tosignificantly improve classification accuracy when docu-ment indexation was done using a great number ofwords. Our results show that classification accuracy overReuters-21578 is clearly improved by using word clustersinstead of words, when indexed by a small number offeatures. One of the basic differences between Bekker-mans approach and ours that may explain our results,

    CAPDEVILA AND MARQUEZ FL OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1037

    26. Note that Baker and McCallum [4] and Dhillon et al. [6]categorization accuracy maximum values are (in absolute terms) slightlysuperior to ours due to the fact they have used a bigger train split (2/3 train-1/3 test random split against our 60 percent train-40 percent test Renniespublicly available bydate split). They refer [6] to achieve 78.05 percentaccuracy with only 50 clusters, just 4.1 percent short of the accuracyachieved by a full-feature classifier. Our results with only 20 clusters,minimize this loss to 2.82 percent, thus representing relatively improvedresults.

    27. Produced by DS clustering using sample correlation coefficient as

    similarity measure and a sample variance threshold of 0.015 (i.e., 18,046terms selected).

    28. This is the general idea. Nevertheless, when analyzed more in detail,Fig. 8 shows secondary clusters for some of the categories which indicatesubject relationships.

    29. To our knowledge, Bekkerman et al. research is the only work thathas applied Distributional Clustering to Reuters-21578 data set.

    30. Issued from the decomposition of the original 90 overlappedcategories.

    Fig. 7. 20 NewsgroupsClassification accuracy of DistributionalClustering algorithms vs. Information Gain and Chi-square termselection functions with Multinomial Naive Bayes categorizer, using

    sample correlation coefficient as similarity measure.

  • 7/30/2019 A Communication Perspective on Automatic

    12/15

    apart from the noisy term filtering process, is that theformer solve the overlapping issue in Reuters collectionby assuming the obviously violated assumption ofcategory independence, while we avoid such hypothesis(see Section 6.1.3).

    The 379 (nonoverlapped) categories30 Reuters-21578collection can thus be optimally indexed with only200 clusters. It should be noted that out of the total379 categories of the training set, only 126 are effectivelyrepresented in the test set. This fact, together with theextremely insignificant representation of some categories,

    may explain why the optimal number of clusters needed forindexing is lower than the number of categories.

    8.2.3 RE Indexed by Clusters

    Fig. 10 illustrates how Reuters data set is indexed in theobtained clusters space produced by DS clustering (thefigure shows a detailed view on most representativeclusters). In order to facilitate the lecture of the figures,the ten most populated categories have been labeled(horizontal grids delimitate the corresponding bands). Thefigure clearly translates the uneven category distribution ofthe data set. Clusters corresponding to dominant categoriesearn and acq, clearly mark two vertical lines in the figure.In general, these clusters are quite active in all documents.

    A possible explanation for this is that these clusters are sobig31 that they tend to be generalist. Another way to see it isthat all categories are related to the general subject ofbusiness and economy that these clusters (identifiers ofcategories earn and acq) globally represent. Anyway,when examined in detail, a singular and discriminativecluster can be identified for each category, except forcategory interest&moneyfx that shows similar index-ing to category moneyfx. Note that both former categoriesare mainly indexed by a fairly big cluster.

    9 EXPERIMENTAL RESULTS ON DOCUMENTDECODING

    9.1 20 Newsgroups

    In Fig. 11a, it can be observed that32 hybrid Q-GNBpresents good classification results, surpassing MNB andLDA in the range of 20-200 word clusters. These results

    1038 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

    Fig. 8. 20 Newsgroups training set indexed by clusters. (a) Training set indexed by 20 clusters. (b) Training set indexed by 100 clusters.

    Fig. 9. Reuters-21578Categorization accuracy of Distributional Clustering versus Information Gain and Chi-square. Dispersion and similarity

    measures used are sample variance and sample correlation coefficient. (a) MNB categorizer. (b) SVM categorizer.

    30. Issued from the decomposition of the original 90 overlappedcategories.

    31. Uneven cluster size has a direct correspondence with unevencategory distribution. Most populated categories (i.e., earn and acq),which by their side condensate more than 57 percent of all documents, arerelated to two single clusters that together represent more than 52 percent ofthe total number of terms contained in clusters.

    32. The NG data set used in these experiments is indexed in the space ofthe obtained word clusters produced by DS clustering using samplecorrelation coefficient as similarity measure and a sample variancethreshold of 0.015 (i.e., 18,046 preselected terms), with the rest of parametersset to default values.

  • 7/30/2019 A Communication Perspective on Automatic

    13/15

    should be interpreted under the perspective of the data setrepresentation sparsity. As extensively commented inSection 8.1.1, when indexing NG collection with 20 wordclusters, each word cluster (nearly) univocally identifies adistinct category. This means that the rest of word-clustersindexes are set (on average) to zero with a very narrow(practically null) variance. The effect of null mean andnull variance indexing attributes results to be extremelynegative. For instance, a residual not null weight in one ofthose null indexing attributes may easily generate a0 probability term in (20) that will eventually overimposeto other nonnull components of the expression. The ideapursued by our hybrid GNB is to set a lower-bound valuefor variance.

    To get a more precise idea of what a variance lower boundof 0.015 means, we can make use of the 68-95-99.7 rule orempirical rule that characterizes a normal or Gaussiandistribution. As stated by this rule, 95 percent of the valueslie in the interval 2. Setting a lower-bound variance of0.015 means that the narrowest Gaussian distributionallowed concentrates 95 percent of its values in the interval 0:25. This seems to be a reasonable superimposedrestriction in a normalized scheme as the one we work on,where documents are cosine-normalized, and thus, attributeweights vary in the interval [0; 1].

    The effect of variance lower-bound tuning in hybridQ-GNB classification has been further studied. Optimalresults were obtained for values greater than 0.01 (for which2 0:2). For values higher than this, Q-GNB shows aslight maximum at 0.015 (for which 2 0:25). Surpris-ingly, classification results for a variance lower bound of 0.1(for which 2 0:63) are stably good. This may be justifiedby the robustness of word-clusters indexing of NG.

    9.2 Reuters-21578

    In Fig. 11b, it can be observed33 that hybrid Q-GNBpresents good classification accuracy, similar to the one

    obtained by SVM and only surpassed by MNB. Theseresults should be interpreted in the same sense as

    done with NG. Introducing a variance lower bound holdsthe negative effect of null mean and null variance indexesdue to data set indexing matrix sparsity. The effect ofvariance lower-bound tuning in hybrid Q-GNB classifica-tion has also been analyzed. As in NG, it has been seen thatsmall values of variance lower bound degrade classificationaccuracy, which explains the bad performance of nativeGNB categorizers. Optimal results are obtained for values

    in the range of 0.005-0.01 (for which 0:14 2 0:2). Forvalues higher than this, hybrid Q-GNB suffers a drasticaccuracy descent. Forcing the widening of the Gaussiandistributions too much introduces errors and confusion inthe classification. This fact, which is reasonable, wassurprisingly not occurring with NG eventually due to theneater separation between categories.

    10 CONCLUSIONS

    The theoretical model we have proposed has led to aperforming two-level term-space reduction scheme, imple-

    mented by a noisy term filtering and a subsequentredundant term compression.In particular, the elimination of noisy terms based on the

    sample variance of their category-conditional PMF hasexperimentally proved to be an innovative and correctprocedure. Both in 20 Newsgroups and Reuters-21578collections, we have seen that a range of values for thesample variance threshold exists where a maximum ofclassification accuracy is reached. Outside these bounds, theclassification accuracy decreases due to noisy terms inter-ference (at the lower bound) and informative terms over-elimination (at the upper bound).

    On the other hand, the results obtained by our signal

    processing inspired redundancy compressors to allow anindexing term-space reduction factor of 0:9982 (fromoriginal 113,357 words to only 20 word clusters) with aresidual loss of classification accuracy of 2.82 percent34 withMNB classifier.

    We have enlarged our research with tough Reuters-21578data set, which is an highly nonuniformly distributed textcollection where categories are related and overlapped.Results obtained are extremely satisfactory and clearlyoutperform those formerly published by Bekkerman et al.[7] with other distributional clustering procedures. Ourclustering method, with sample correlation coefficient assimilarity measure, allows an indexing term-space reductionfactor of 0:9938 (from original 32,539 words to only200 word clusters) with a gain of classification accuracy of4.53 percent35 when using the MNB classifier. Whenadopting SVM, the loss of classification accuracy is only2.24 percent. In any case, the overall maximum of classifica-tion accuracy is reached when the collection is indexed by200 clusters with MNB classifier. These results tend toindicate that MNB significantly benefits from the compres-sion of the term space (and its intrinsic overfitting reduc-tion). SVM is, arguably [2], more robust to overfitting, andthus, less prone to be positively affected by the compression.

    By reducing redundancy, feature extraction by term

    clustering tends to produce a set of orthogonal basis for the

    CAPDEVILA AND MARQUEZ FL OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1039

    Fig. 10. Reuters-21578 training set indexed by 200 clusters. View

    restricted to the 50 most representative clusters.

    33. Same experimental scenario as in Section 8.2.1.

    34. With respect of an indexing of 5,000 word clusters.35. With respect of an indexing of 10,000 terms selected with classical IG

    or CHI functions.

  • 7/30/2019 A Communication Perspective on Automatic

    14/15

    document representation. Basically, each (main) category isidentified by a singular and discriminative cluster, thusdrawing up the compressed documents to an orthogonalcoding of the category. In all, with both collections, thereseems to be a relationship between the entropy of the

    category distribution and the actual optimal number ofclusters. 20 Newsgroups is a practically uniform distributedcollection. Its normalized entropy36 is 0.9981 (almost 1).Reuters-21578 is an extremely unevenly distributed collec-tion. Its normalized entropy is 0.4881 (almost 1/2). Experi-mentally, 20 Newsgroups needs as many clusters ascategories, while Reuters-21578 needs half of it.

    In ATC, MNB is one of the most popular statisticalcategorizer because of its simplicity and good accuracyresults. Gaussian assumption [12], [13] has been poorlyapplied to ATC due to intrinsic problems related to thehigh dimensionality of the typical document vectorialrepresentation. Our purpose has been, once the represen-

    tation space has been optimally reduced, to experimen-tally test how Gaussian MAP categorizers, specially underthe Naive Bayes assumption, may be adapted to theconcomitance of sparsity in ATC. By establishing avariance lower bound in Gaussian PDFs, we have rescuedthe use of Gaussian MAP classifiers in the ATC arena.Our hybrid approach reaches classification results com-parable to those obtained by MNB and SVM and opensthe door for further research.

    We are currently pursuing our work with the design of adivisive clustering algorithm, which, in view of the resultsobtained with the tested agglomerative clustering schemes,we think can throw interesting improvements both in

    classification effectiveness and computational efficiencyterms. We have also envisaged to establish a thoroughsimilarity measures comparison and analysis. Future workis also foreseen in the communication theoretical modelingaspect, with special stress on the synthesis of prototypedocuments via the generative model proposed, as well asthe deepening on the document coding (and subsequentdecoding) optimal design.

    ACKNOWLEDGMENTS

    The authors wish to thank the anonymous reviewers fortheir fruitful comments and the Weka Machine LearningProject for making their software open source under a GPL

    license. For this research, Marta Capdevila was supported

    in part by a predoctoral grant from the R&D General

    Department of the Xunta de Galicia Regional Government

    (Spain), awarded on 19 July 2005.

    REFERENCES[1] F. Sebastiani, Machine Learning in Automated Text Categoriza-

    tion, ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.[2] T. Joachims, Text Categorization with Support Vector Machines:

    Learning with Many Relevant Features, Proc. 10th European Conf.Machine Learning (ECML), pp. 137-142, 1998.

    [3] T. Joachims, Learning to Classify Text Using Support VectorMachinesMethods, Theory, and Algorithms. Kluwer/Springer,2002.

    [4] L.D. Baker and A.K. McCallum, Distributional Clustering ofWords for Text Classification, Proc. Special Interest Group onInformation Retrieval (SIGIR 98) 21st ACM Intl Conf. Research andDevelopment in Information Retrieval, pp. 96-103, 1998.

    [5] N. Slonim and N. Tishby, The Power of Word Clusters for Text

    Classification, Proc. 23rd European Colloquium on InformationRetrieval Research, 2001.

    [6] I. Dhillon, S. Mallela, and R. Kumar, A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification,

    J. Machine Learning Research (JMLR), special issue on variable andfeature selection, vol. 3, pp. 1265-1287, 2003.

    [7] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, Distribu-tional Word Clusters vs. Words for Text Categorization,

    J. Machine Learning Research, vol. 3, pp. 1183-1208, 2003.[8] A. McCallum and K. Nigam, A Comparison of Event Models for

    Naive Bayes Text Classification, Proc. Assoc. for the Advancement ofArtificial Intell igence (AAAI 98) Workshop Learning for TextCategorization, 1998.

    [9] Y. Yang and X. Liu, A Re-Examination of Text CategorizationMethods, Proc. 22nd Ann. Intl ACM Special Interest Group on

    Information Retrieval Conf. (SIGIR 99), pp. 42-49, Aug. 1999.[10] S. Haykin, Communication Systems. John Wiley & Sons, 2001.[11] T.M. Cover and J.A. Thomas, Elements of Information Theory,

    second ed. John Wiley & Sons, Inc., 2006.[12] H. Schutze, D. Hull, and J. Pedersen, A Comparison of Classifiers

    and Document Representations for the Routing Problem, Proc.18th Ann. Intl ACM Special Interest Group on Information Retrieval(SIGIR 95) Conf. Research and Development in Information Retrieval,pp. 229-237, 1995.

    [13] T. Li, S. Zhu, and M. Ogihara, Using Discriminant Analysisfor Multi-Class Classification: An Experimental Investigation,Knowledge and Information Systems, vol. 10, no. 4, pp. 453-472,2006.

    [14] I.H. Witten and E. Frank, Data Mining: Practical Machine LearningTools and Techniques, second ed. Morgan Kaufmann, 2005.

    [15] F. Debole and F. Sebastiani, An Analysis of the Relative Hardness

    of Reuters-21578 Subsets, Proc. Fourth Intl Conf. LanguageResources and Evaluation (LREC 04), pp. 971-974, 2004.[16] K. Torkkola, Linear Discriminant Analysis in Document Classi-

    fication, Proc. IEEE Intl Conf. Data Mining (ICDM-2001) WorkshopText Mining (TextDM 01), 2001.

    1040 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

    36. Normalized entropy is the fraction between entropy and maximumentropy.

    Fig. 11. Comparison of Gaussian Naive Bayes categorizers classification accuracy against state-of-the-art SVM and MNB performance.

    (a) 20 NewsgroupsHybrid GNB with variance lower bound of 0.015. (b) Reuters-21578Hybrid GNB with variance lower bound of 0.008.

  • 7/30/2019 A Communication Perspective on Automatic

    15/15

    [17] Y. Yang and J.O. Pedersen, A Comparison Study on FeatureSelection in Text Categorization, Proc. Intl Conf. MachineLearning, pp. 412-420, 1997.

    Marta Capdevila received the engineeringdegree in telecommunications from the Poly-technic University of Catalonia (UPC), Barcelo-na, Spain, in 1992. She is currently workingtoward the PhD degree. During 1991, shestudied image processing at the Ecole Nationale

    Superieure des Telecommunications, Paris,France. From 1992 to mid-1993, she wasselected for a young graduate trainee contractat the European Space Agency (ESA), Frascati,

    Italy. After this stage, and until mid-1994, she held an applicationengineering post at the pan-European research networking companyDANTE, Cambridge, United Kingdom. From 1995 to 2001, she wasappointed to several positions in the Spanish industry. Since 2001, shehas been involved in research activities at the TelecommunicationEngineering School, University of Vigo, Spain. Her research interestsinclude automatic text categorization and term-space compaction andcompression.

    Oscar W. Marquez Florez received the tele-communication engineering degree in 1985 fromthe Odessa Electrotechnical Institute of Com-munication, Ukraine, and the doctorate degreein telecommunications in 1991 from the Ruhr-University Bochum, Germany. In 1992, he joinedthe Telecommunication Engineering faculty ofthe University of Vigo, where he is currently anassociate professor. In addition to teaching, heis involved in research in the areas of signal

    processing in digital communications, computer-based learning, statis-tical pattern recognition, and image-based biometrics. He is a memberof the IEEE.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    CAPDEVILA AND MARQUEZ FL OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1041