Cross-domain Text Classification through Iterative Refining of Target Categories Representations

Cross-domain Text Classification through Iterative Refining of TargetCategories Representations

Giacomo Domeniconi1, Gianluca Moro1, Roberto Pasolini1 and Claudio Sartori21DISI, Universita degli Studi di Bologna, Via Venezia 52, Cesena, Italy

2DISI, Universita degli Studi di Bologna, Viale del Risorgimento 2, Bologna, Italy{giacomo.domeniconi, gianluca.moro, roberto.pasolini, claudio.sartori}@unibo.it

Keywords: Text Mining, Text Classification, Transfer Learning, Cross-domain Classification

Abstract: Cross-domain text classification deals with predicting topic labels for documents in a target domain by leverag-ing knowledge from pre-labeled documents in a source domain, with different terms or different distributionsthereof. Methods exist to address this problem by re-weighting documents from the source domain to transferthem to the target one or by finding a common feature space for documents of both domains; they often re-quire the combination of complex techniques, leading to a number of parameters which must be tuned for eachdataset to yield optimal performances. We present a simpler method based on creating explicit representationsof topic categories, which can be compared for similarity to the ones of documents. Categories representationsare initially built from relevant source documents, then are iteratively refined by considering the most similartarget documents, with relatedness being measured by a simple regression model based on cosine similarity,built once at the begin. This expectedly leads to obtain accurate representations for categories in the targetdomain, used to classify documents therein. Experiments on common benchmark text collections show thatthis approach obtains results better or comparable to other methods, obtained with fixed empirical values forits few parameters.

1 INTRODUCTION

Text classification (or categorization) generally en-tails the automatic organization of text documentsinto a user-defined taxonomy of classes or categories,which typically correspond to topics discussed in thedocuments, such as science, arts, history and so on.This general task is useful to organize many typesof documents like news stories, books and mail mes-sages and may be applied within several contexts, in-cluding spam filtering, sentiment analysis, ad-hoc ad-vertising, etc.

Classic text classification methods require a train-ing set of documents, which must be already la-beled with correct classes, to infer a knowledge modelwhich is then able to classify further unseen doc-uments under the same classes: this general ap-proach is used by many different works, shown to behighly effective in organizing documents among sev-eral classes (Sebastiani, 2002).

However, a usable training set, other than beingreasonably sized, should reflect quite precisely the

This work was partially supported by the project Gen-Data, funded by the Italian MIUR.

characteristics of the documents to be classified: thisgenerally assumes having documents classified underthe very same categories of interest and basically con-taining equal or correlated terms. In other words, de-noting a set of classes to be recognized together withthe distribution of terms across them as a domain, wewould need a training set of labeled documents fallingwithin the very same domain. Such a training set, insome real contexts, may be unavailable or may requireunfeasible human efforts or costs.

In some cases, although, we have at our disposala set of labeled documents in a domain which is onlyslightly different from the one where we want to clas-sify documents. For example, we may have a set oflabeled documents with topics which are similar tothose of interest, such that each topic on one side maybe mapped to one on the other. On the other hand,we may have the same topics, but treated with somedifferent terms, as may happen if we want to leveragea training set of outdated documents to classify newerones. On a theoretical point of view, we usually havethe same class labels equally conditioned by the inputdata, but the data itself has different distributions.

At this extent, methods for cross-domain text clas-

https://www.researchgate.net/publication/224952248_Machine_Learning_in_Automated_Text_Categorization?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

sification exist, which can be used to classify unla-beled documents of a target domain by exploitingthe knowledge obtained from labeled documents ofa source domain. These methods fall into the widerclass of transfer learning approaches, as they gener-ally involve the transfer of knowledge obtained fromthe source domain to the target (Pan and Yang, 2010).

Different approaches exist for this task: somemethods are generally based on adapting source datato the target domain, while others rely on bringingdata of both domains in a common feature space tospot similarities. These methods are usually basedon advanced statistical concepts and techniques, gen-erally making their exact implementation difficult.Moreover, the outcome of these methods is oftenheavily influenced by respective parameters: whilefor each possible dataset there are ranges of param-eters values yielding optimal accuracy, these valuesare generally different for each dataset, thus requir-ing to discover a combination of parameters’ valuesthat produces acceptable results, following generallya poor and unpredictable trial-and-error approach ina search space whose largeness grows exponentiallywith respect to the number of parameters.

In other words, solutions that need a high numberof parameters often achieve good results in controlledenvironments with known test sets, thanks to repeatedtry–and–error cycles for parameter tuning, but in thereal world sometimes they are not robust enough.

To alleviate the problem of parameter settings,we present in this work a simple novel method forcross-domain text classification based on building anditeratively improving structured representations forthe categories in the target domain. In practice, themethod starts from typical bag-of-words representa-tions for single documents from source and target do-mains and combines those from the source domainto build preliminary representations for the top-levelcategories shared between the two domains; theseare then refined by iteratively making them “closer”to documents of the target domain, to finally obtainfairly accurate representations of the correspondingcategories. This works by comparing these represen-tations of documents and categories by means of anunivariate logistic regression model, built once beforethe iterative phase and based on the standard cosinesimilarity measure: this is used to pick documentswhich are most similar to each category, from whichnew representations are built each time, until they be-come as consistent as possible with the target domain.

We performed experiments on text collectionscommonly used as benchmarks, showing that this ap-proach can achieve the same performances of the bestknown methods with good efficiency, despite a sim-

ple and compact implementation. We also show thatthese results are obtained by always using the samevalues for the two parameters: this eliminates theneed of tuning, thus making the method more prac-tically usable in real scenarios.

The rest of the paper is organized as follows. Sec-tion 2 reports an overview of other works about cross-domain text classification. Section 3 exposes in detailthe method used to classify documents. Section 4 de-scribes the experiments performed and reports theirresults, compared with those of other works. Finally,Section 5 sums up the the contribute and discussespossible future developments.

2 RELATED WORK

Supervised machine learning-based methods for textclassification are largely diffused and have proven tobe fairly effective in classifying documents acrosslarge taxonomies of categories, either flat or hier-archical, provided that suitable training sets of pre-labeled documents are given (Dumais et al., 1998;Joachims, 1998; Yang and Liu, 1999; Sebastiani,2002). Unsupervised approaches also exist, whichare able to some extent to isolate previously unknowngroups (clusters) of correlated documents, but gen-erally cannot reach the accuracy of supervised ap-proaches (Merkl, 1998; Kohonen et al., 2000).

A common approach is to represent documentsas vectors of numeric features, computed accordingto their content. Words are often used as features,with each document represented by the number ofoccurrencies of each or by some derived measure:this is known as the bag of words approach (Se-bastiani, 2002). Some later methods make use ofstatistical techniques like Latent Semantic Indexing(Weigend et al., 1999) or Latent Dirichlet Allocation(Blei et al., 2003) to discover hidden correlations be-tween words and consequently improve representa-tions of documents. More recent methods extract se-mantic information carried by terms by leveraging ex-ternal knowledge bases such as the WordNet database(Scott and Matwin, 1998) or Wikipedia (Gabrilovichand Markovitch, 2007).

While in most text classification methods stan-dard machine learning algorithms are used on bagsof words, a somewhat distinct approach is the Roc-chio method, where bags obtained from training doc-uments are averaged to build similar representationsfor categories and each new document is assigned tothe category having the representation which is mostsimilar to it (Joachims, 1997): our method is simi-larly based on the idea of explicitely representing cat-

https://www.researchgate.net/publication/2395713_A_Probabilistic_Analysis_of_the_Rocchio_Algorithm_with_TFIDF_for_Text_Categorization?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/224604612_A_Survey_on_Transfer_Learning?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/223447201_Text_classification_with_self-organizing_maps_Some_lessons_learned?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/5602133_Self_organization_of_a_massive_document_collection_IEEE_Trans_Neural_Net?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/263688852_Exploiting_Hierarchy_in_Text_Categorization?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/221613897_Inductive_learning_algorithms_and_representations_for_text_categorization_Proceedings_of_ACM-CIKM_1998?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/285278428_Text_Categorization_with_Support_Vector_Machines_Learning_with_Many_Relevant_Features?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/242636615_Self_organization_of_a_massive_document_collection?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/200042392_Computing_Semantic_Relatedness_using_Wikipedia-based_Explicit_Semantic_Analysis?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==


https://www.researchgate.net/publication/2552792_A_Re-Examination_of_Text_Categorization_Methods?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/2430121_Text_Classification_Using_WordNet_Hypernyms?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==


https://www.researchgate.net/publication/221620547_Latent_Dirichlet_Allocation?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==





egories as averages of relevant documents.Text categorization is one of the most relevant

applications for cross-domain classification, also re-ferred to as domain adaptation. According to thescheme proposed in (Pan and Yang, 2010), cross-domain classification is a case of transductive transferlearning, where knowledge must be transferred acrosstwo domains which are generally different while hav-ing the same labels Y for data. In many cases, includ-ing text classification, the two domains share (or aretrivially represented in) a common feature space X .

It is also often assumed that labels in source andtarget domains are equally conditioned by the inputdata, which though is distributed differently betweenthe two; denoting with XS and YS data and labelsfor the source domain and with XT and YT those forthe target domain, we have P(YS|XS) = P(YT |XS), butP(XS) 6= P(XT ): this condition is known as covariateshift (Shimodaira, 2000).

Often, two major approaches to transductive trans-fer learning are distinguished: (Pan and Yang, 2010)and other works refer to them as instance-transfer andfeature-representation-transfer.

Instance-transfer-based approaches generallywork by re-weighting instances (data samples) fromthe source domain to adapt them to the target domain,in order to compensate the discrepancy betweenP(XS) and P(XT ): this generally involves estimatingan importance P(xS)

P(xT )for each source instance xS to

reuse it as a training instance xT under the targetdomain.

Some works mainly address the related problemof sample selection bias, where a classifier must belearned from a training set with a biased data distri-bution. (Zadrozny, 2004) analyzes the bias impact onvarious learning methods and proposes a correctionmethod using knowledge of selection probabilities.

The kernel mean matching method (Huang et al.,2007) learns re-weighting factors by matching themeans between the domains data in a reproducing ker-nel Hilbert space (RKHS); this is done without es-timating P(XS) and P(XT ) from a possibly limitedquantity of samples. Among other works operatingunder this restriction there is the Kullback-Lieblerimportance estimation procedure (Sugiyama et al.,2007), a model to estimate importance based on mini-mization of the Kullback-Liebler divergence betweenreal and expected P(XT ).

Among works specifically considering text clas-sification, (Dai et al., 2007b) trains a Naıve Bayesclassifier on the source domain and transfers it tothe target domain through an iterative Expectation-Maximization algorithm. In (Gao et al., 2008) multi-ple classifiers are trained on possibly multiple source

domains and combined in a locally weighted ensem-ble based on similarity to a clustering of the targetdocuments to classify them.

On the other side, feature-representation-transfer-based approaches generally work by finding a newfeature space to represent instances of both source andtarget domains, where their differences are reducedand standard learning methods can be applied.

The structural correspondence learning method(Blitzer et al., 2006) works by introducing pivot fea-tures and learning linear predictors for them, whoseresulting weights are transformed through SingularValue Decomposition and then used to augment train-ing data instances. The paper (Daume III, 2007)presents a simple method based on augmenting in-stances with features differentiating source and tar-get domains, possibly improvable through nonlinearkernel mapping. In (Ling et al., 2008a) a spec-tral classification-based framework is introduced, us-ing an objective function which balances the sourcedomain supervision and the target domain structure.With the Maximum Mean Discrepancy (MMD) Em-bedding method (Pan et al., 2008), source and targetinstances are brought to a common low-dimensionalspace where differences between data distributionsare reduced; transfer component analysis (Pan et al.,2011) improves this approach in terms of efficiencyand generalization to unseen target data.

The following works are focused on text classifi-cation. In (Dai et al., 2007a) an approach based onco-clustering of words and documents is used, wherelabels are transferred across domain using word clus-ters as a bridge. The topic-bridged PLSA method(Xue et al., 2008) is instead based on ProbabilisticLatent Semantic Analysis, which is extended to ac-cept unlabeled data. In (Zhuang et al., 2011) is pro-posed a framework for joint non-negative matrix tri-factorization of both domains. Topic correlation anal-ysis (Li et al., 2012) extracts both shared and domain-specific latent features and groups them, to supporthigher distribution gaps between domains.

Within the distinction between instance-transferand feature-representation-transfer approaches, ourmethod could be regarded as following the former, asno latent common space is learned. Instead, sourcedocuments are brought to the target domain, althoughin aggregated form and with no adaptation: they justserve to train a knowledge model and to bootstrap theiterative phase, as detailed in the next section.

Likely to traditional text classification, somemethods leverage external knowledge bases: thesecan be helpful to link knowledge across domains. Themethod presented in (Wang et al., 2008) improvesthe cited co-clustering-based approach (Dai et al.,

https://www.researchgate.net/publication/286391936_Topic_correlation_analysis_for_cross-domain_text_classification?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==



https://www.researchgate.net/publication/221606757_Transfer_Learning_via_Dimensionality_Reduction?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/221653922_Co-clustering_based_classification_for_out-of-domain_documents?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/227762893_Exploiting_Associations_between_Word_Clusters_and_Document_Classes_for_Cross-domain_Text_Categorization?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/221654137_Knowledge_transfer_via_multiple_model_local_structure_mapping?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/221299279_Topic-bridged_PLSA_for_cross-domain_text_classification?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/221619182_Direct_Importance_Estimation_with_Model_Selection_and_Its_Application_to_Covariate_Shift_Adaptation?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==


https://www.researchgate.net/publication/221653907_Spectral_domain-transfer_learning?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/230710850_Shimodaira_H_Improving_predictive_inference_under_covariate_shift_by_weighting_the_log-likelihood_function_J_Stat_Plan_Infer_90_227-244?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/45861561_Frustratingly_Easy_Domain_Adaptation?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/220766249_Using_Wikipedia_for_Co-clustering_Based_Cross-Domain_Text_Classification?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/221604898_Transferring_Naive_Bayes_Classifiers_for_Text_Classification?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==



https://www.researchgate.net/publication/2943806_Learning_and_Evaluating_Classifiers_under_Sample_Selection_Bias?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/49627061_Domain_Adaptation_via_Transfer_Component_Analysis?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==


https://www.researchgate.net/publication/221012895_Domain_Adaptation_with_Structural_Correspondence_Learning?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

2007a) by representing documents with concepts ex-tracted from Wikipedia. The bridging informationgap method (Xiang et al., 2010) exploits instead anauxiliary domain acting as a bridge between sourceand target, using Wikipedia articles as a practical ex-ample. These methods usually offer very high perfor-mances, but need a suitable knowledge base for thecontext of the analyzed documents, which might notbe easily available for overly specialized domains.

Beyond the presented works where domains differonly in the distribution of terms, methods for cross-language text classification exist, where source andtarget documents are written in different languages,so that there are few or no common words betweenthe two domains. This scenario generally requires ei-ther some labeled documents for the target domain oran external knowledge base to be available: a dictio-nary for translation of single terms is often used. Asexamples, in (Ling et al., 2008b) is presented an ap-proach based on information bottleneck where Chi-nese texts are translated into English to be classified,while the method in (Prettenhofer and Stein, 2010) isbased on the structural correspondence learning citedabove (Blitzer et al., 2006).

Other than text classification by topic, another re-lated task on which domain adaptation is frequentlyused is sentiment analysis, where positive and nega-tive opinions about specific objects (products, brands,etc.) must be distinguished: a usual motivating exam-ple is the need to extract knowledge from labeled re-views for some products to classify reviews for prod-ucts of a different type, with possibly different termi-nologies. Spectral feature alignment (Pan and Yang,2010) works by clustering together words specific fordifferent domains leveraging the more general terms.In (Bollegala et al., 2013) a sentiment-sensitive the-saurus is built from possibly multiple source domains.In (Cheeti et al., 2013) a Naıve Bayes classifier onsyntax trees-based features is used.

3 CROSS-DOMAIN LEARNINGMETHOD

This section describes in detail our method to classifydocuments in a target domain exploiting the knowl-edge of a source domain.

Inputs to the method are a set DS of source or in-domain documents, which constitute the source do-main and a disjoint set DT of target or out-of-domaindocuments, making up the target domain; we denotewith D = DS ∪DT their union. Each document in Dis labeled with a single class from a set C , accord-ing to two functions CS : DS → C and CT : DT → C .

As in any cross-domain classification method, we as-sume to have prior knowledge of CS, while CT is notknown: the goal is to infer a function CT : DT → Cwith maximal similarity to CT .

The following subsections give details about thesteps of the method: pre-processing of documents andfeature extraction according to standard procedures,creation of initial representations for categories, train-ing of a function to predict similarity between rep-resentations and iterative refining of categories rep-resentations. A discussion about time computationalcomplexity is given thereafter.

3.1 Text Pre-processing

The method initially performs typical pre-processingoperations on documents to transform each unstruc-tured text into a structured representation.

A common tokenization process extracts singlewords from each document d, discarding punctuation,words shorter than 3 letters and all those found in apredefined list of stopwords; then the common Porterstemming algorithm is applied to group words withcommon stems (Porter, 1980). In the end, a set W (d)of the processed words extracted from d is obtained,along with the number of occurrencies f (d, t), alsoknown as (raw) frequency, for each word (or term,equivalently) t.

The usual bag of words representation is used:each document d is reduced to a vector wd of weightsfor each term t in a global feature set W . As in otherpapers, features are filtered by Document Frequency(DF) thresholding, discarding all terms appearing inless than 3 documents, to trivially reduce complexitywith negligible effects on accuracy. The remainingterms constitute the set W of features considered inall bags of words.

The weight of each term in each document isbased on the numbers of occurrencies and determinedby a defined weighting scheme. We use a slight vari-ant of the common tf-idf (Term Frequency, InverseDocument Frequency) scheme (Salton and Buckley,1988), computing the product between the relativefrequency of a term in a document (instead of raw fre-quency, to avoid overweighting terms in longer docu-ments) and the logarithm of the inverse frequency ofthe term across all the documents (to give less bias tooverly common terms).

wd,t =f (d, t)

∑τ∈W f (d,τ)︸︷︷︸t f

· log|D|

|{δ ∈D : t ∈W (δ)}|︸︷︷︸id f

(1)

Each document d ∈D will be then represented byits weighted bag of words wd .

https://www.researchgate.net/publication/221022407_Can_Chinese_web_pages_be_classified_with_English_data_source?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/251571267_Cross-Domain_Sentiment_Classification_Using_a_Sentiment_Sensitive_Thesaurus?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/220873317_Cross-Language_Text_Classification_Using_Structural_Correspondence_Learning?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/220073181_Bridging_Domains_Using_World_Wide_Knowledge_for_Transfer_Learning?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/223117716_Buckley_C_Term-Weighting_Approaches_in_Automatic_Text_Retrieval_Information_Processing_Management_245_513-523?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==




https://www.researchgate.net/publication/277010888_Cross-Domain_Sentiment_Classification_Using_an_Adapted_Nave_Bayes_Approach_and_Features_Derived_From_Syntax_Trees?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==


3.2 Initial Representation of Categories

Likely to single documents, whole categories are rep-resented as bags of words.

For each category c ∈ C , a bag of words can in-tuitively be built by averaging those of documentswhich are representative for it, i.e. those labeled withc. As no prior knowledge of how documents in DTare labeled is available, documents in DS are used in-stead, as the labeling function CS is known. Each cat-egory c is then represented by the set R0

c = {d ∈DS :CS(d) = c} of in-domain documents labeled with it. Itis then sufficient to compute the mean weight of eachterm in each category, thus obtaining a representationw0

c for each category c ∈ C .

w0c =

1|R0

c |∑

d∈R0c

wd (2)

The “0” indices denote that these are initial repre-sentations, which constitute the starting point for thesubsequent iterative phase.

3.3 Text Similarity Measure

We need a function Φ : Rn×Rn→ [0,1] which, giventwo bags of words with n = |W | features each, com-putes a relatedness score between the two of them.Specifically, given a document d and a category c, werefer to Φ(wd ,w0

c) as the absolute likelihood of d be-ing labeled with c, which is independent from otherdocuments and categories.

A basic function commonly used to determine therelatedness of two bag of words is the cosine similar-ity, defined for two generic vectors a,b ∈ Rn as:

cos(a,b) = ∑ni=1 ai ·bi√

∑ni=1 a2

i ·√

∑ni=1 b2

i

(3)

This measure is widely used to compare docu-ments in form of bags of words against each other,as it effectively spots similar distributions of termsin the documents. So, when computing the similar-ity cos(wd ,w0

c) between bags representing a docu-ment d and a category c, we expect it to be signifi-cantly higher if they are effectively related, i.e. if cis the correct label for d. Assuming that values of thecosine similarity for couples of related bags are dis-tributed according to a random variable Y+ and thatvalues for couples of unrelated bags are distributedin another random variable Y−, then we predict thatE(Y+)> E(Y−) holds.

However, no prior knowledge is available of “howhigh” and “how low” should be the cosine similarityfor pairs of related and unrelated bags, respectively.

More formally, distributions of Y+ and Y− are un-known and we are not allowed to suppose that theyare the same across different contexts.

To address this issue, suitable knowledge can beextracted from the source domain, whose labelingof documents is known: the values of cosine simi-larity between in-domain documents and categoriescan be measured by using the previously extractedbags of words. In practice, all the possible couples(d,c) ∈DS×C made of an in-domain document anda category are considered, computing for each the co-sine similarity between respective bags: these valuesare used as samples from the Y+ and Y− distributions.

To extract knowledge from these samples, we fita univariate logistic regression model (Hosmer Jr andLemeshow, 2004): this procedure returns a function π

returning the absolute likelihood for two bags of beingrelated, given their cosine similarity.

π(x) =1

1+ e−(β0+β1x)(4)

Considering, for each (d,c) ∈ DS × C , xd,c =

cos(wd ,w0c) and yd,c equal to 1 if CS(d) = c and to

0 otherwise, logistic regression is used to find valuesof β0 and β1 which maximize

∏(d,c)∈DS×C

π(xd,c)yd,c(1−π(xd,c))

1−yd,c (5)

The general function Φ(wd ,w0c)= π(cos(wd ,w0

c))obtained indicatesg the absolute likelihood of c beingthe correct label for d.

3.4 Iterative Refining of TargetCategories

The Φ function can be used to classify out-of-domaindocuments by comparing their representation againstthose extracted from in-domain documents for thecategories in C : simply, for each document d ∈ DT ,the predicted label is the category with the highest re-latedness likelihood.

C0T (d) = argmax

c∈CΦ(wd ,w0

c) (6)

In the common case where the source and targetdomains are similar, yet somehow different, this doesnot yield optimal results. Infact, the representationsused for categories are extracted from the source do-main and thus reflect the distributions of words mea-sured in it, while out-of-domain documents may referto the same categories with different distributions ofterms and even with some different terms.

https://www.researchgate.net/publication/209803025_Applied_Logistic_Regression_Hoboken?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==


However, we expect that this approximate clas-sification does still yield the correct labeling forsome amount of out-of-domain documents. More-over, as the used function returns a likelihood for eachdocument-category couple, we can weight the con-fidence of the classification for each document andwe expect that documents classified with a very highdegree of confidence almost surely are correctly la-beled. For each document d ∈DT and each categoryc∈ C , we define the relative confidence or probabilityp0(d,c) of d being labeled with c as the normaliza-tion of the absolute likelihood across categories. Inpractice, for any d, values of p0(d,c) for each c ∈ Cconstitute a probability distribution (their sum is one).

p0(d,c) =Φ(wd ,w0

c)

∑γ∈C Φ(wd ,w0γ)

(7)

The formula implies that, in order for d to have anhigh probability p0(d,c) of being labeled with c, itsrepresentation must be very similar to that for c, butis also important that it is largely unrelated to differ-ent categories: if a document seems highly related tomore than one category, none of them can be assignedwith high relative confidence.

Having computed the probability distributionsamong categories for all documents in DT , those hav-ing high confidence of belonging to a specific cate-gory can be distinguished. Fixed a threshold ρ, wedefine for each category c ∈ C a set R1

c of documentshaving a probability of belonging to c superior eitherto this threshold and to probabilities for other cate-gories: these documents are considered to be “surelyenough” labeled with c. We impose ρ ≥ 1

|C | , as anylower threshold would cause all documents to alwaysbe considered for their most probable category.

R1c = {d ∈DT : p0(d,c)> ρ∧C0

T (d) = c} (8)

As sets of out-of-domain documents represent-ing each category have been created, they can be ex-ploited to build new representations for the categories.For each category c, similarly to how w0

c was built byaveraging bags of in-domain documents labeled withc, a new representation w1

c is built by averaging docu-ments in R1

c . Documents of the source domain are nomore used, as we experienced no significant accuracyimprovement retaining them and because these bagsshould represent only the target domain.

w1c =

1|R0

c |∑

d∈R1c

wd (9)

Having these new representations, the process cannow return to the classification phase described at thebeginning of this subsection and execute it again by

substituting for each category c its initial representa-tion w0

c with its newly built one w1c . We expect that, as

new bags for categories better represent the target do-main, the obtained classification gets closer to the realone. Moreover, we expect that documents which wereclassified with high confidence in the first run retainthis distinction in the new run, as they contributed tobuild the new representation for their respective cat-egory, and even that new documents pass the confi-dence threshold for respective categories.

So, from the new categories representations, newprobabilities p1(d,c) for each (d,c) ∈ DT × C canbe computed and, still considering the ρ confidencethreshold, new sets R2

c of “sure” documents can be ex-tracted for each category c, which in turn can be usedto build further representations w2

c for each c ∈ C .This cycle where bags of words for categories are

progressively refined to better represent the target do-main could be run indefinitely: we expect that theclassification of these documents gets more accurateafter each iteration.

Operationally, the method continues this iterativerefining process until either a limit NI of iterationsis reached or the representations for all categoriesin C are identical to those from previous iteration.Infact, if in one iteration i the condition Ri

c = Ri−1c

holds for each category c ∈ C , then equal represen-tations will be obtained through subsequent iterations(wi

c = wi−1c ). As a generalization of the second condi-

tion, where representations must be identical to thoseof the previous iteration, we may arrest the algorithmwhen, for each category, the cosine similarity betweenthe latest representation and the previous one reachesa fixed threshold β, which should be slightly less thanone (the default condition is equivalent to set β = 1).In this way, we may save some iterations where therepresentations have negligible variations.

Once a termination condition is met after a num-ber nI ≤ NI of iterations, the final predicted labelCT (d) for each document d ∈DT is the one whose fi-nal representation wnI

c is most similar to its bag wd . Inthis step, as all target documents must be labeled, themost probable category for each is considered, evenif its relative probability is not above ρ. In a likelycase where new documents within the target domainmust be classified after this training process withoutrepeating it, we can compare each of them with allcategories and assign it to the most similar one.

The pseudo-code for the whole described pro-cess (excluding the text pre-processing phase) is givenin Figure 1: the equations given above for the firstiteration are rewritten with an iteration counter i.Apart from the univariate logistic regression routine,for which there exist a number of implementations

Input: a bag of words wd for each documentd ∈ D = DS ∪DT , set C of top categories, label-ing CS : DS→ C for source documents, confidencethreshold ρ, maximum number NI of iterationsOutput: predicted labeling CT for documents ofthe target domain

for all c ∈ C doR0

c ←{d ∈DS : CS(d) = c}w0

c ← 1|R0

c |·∑d∈R0

cwd

end forfor all (d,c) ∈DS×C do

xd,c← cos(wd ,w0c)

yd,c← 1 if CS(d) = c, 0 otherwiseend forπ← LOGISTICREGRESSION(x,y)Φ(a,b), π(cos(a,b))i← 0while i < NI ∧ (i = 0∨∃c ∈ C : Ri

c 6= Ri−1c ) do

for all (d,c) ∈DT ×C dopi(d,c)← Φ(wd ,wi

c)

∑γ∈C Φ(wd ,wiγ)

end forfor all c ∈ C do

Aic←{d ∈DT : argmax

γ∈Cpi(d,γ) = c}

Ri+1c ←{d ∈ Ai

c : pi(d,c)> ρ}wi+1

c ← 1|Ri+1

c |·∑d∈Ri+1

cwd

end fori← i+1

end whilefor all d ∈DT do

CT (d)← argmaxc∈C

Φ(wd ,wic)

end forreturn CT

Figure 1: Pseudo-code for the iterative refining algorithm.

(Minka, 2003), the given code is self-contained andcan be easily implemented in many languages.

3.5 Computational Complexity

The process performs many operations on vectors oflength |W |: while these operations would generallyrequire a time linear in this length, given the preva-lent sparsity of these vectors, we can use suitable datastructures to bound both storage space and computa-tion time linearly w.r.t. the mean number of non-zeroelements. At this extent, we denote with lD and lC themean number of non-zero elements in bags of wordsfor documents and categories, respectively. By def-inition, we have lD ≤ |W | and lC ≤ |W |; from ourexperiments (described in the next section) we also

generally observed lD� lC < |W |.The construction of the initial representation for

categories is done in O(|DS| · lD) time, as all valuesof all documents representations must be summed up.Cosine similarities for vectors with lD and lC non-zeroelements respectively can be computed in O(lD + lC)time, which can be written as O(lC) given that lD < lC.To fit the logistic regression model, the cosine similar-ity for NS = |DS| · |C | pairs must be computed to ac-quire input data, which requires O(lc ·NS) time; thenthe model can be fit with one of various optimizationmethods which are generally linear in the number NSof data samples (Minka, 2003).

In each iteration of the refining phase, themethod computes cosine similarity for NT = |DT | · |C |document-category pairs and normalizes them to ob-tain distribution probabilities in O(NT · lC) time; then,to build new bags of words for categories, up to |DT |document bags must be summed up, which is done inO(|DT | · lD) time. The sum of these two steps, alwaysconsidering lD < lC, is O(|DT | · |C | · lC), which mustbe multiplied by the final number nI of iterations.

Summing up, the overall complexity of themethod is O(|DS| · |C | · lC + nI · |DT | · |C | · lC), whichcan be simplified as O(nI · |D| · |C | · lC), with lC ≤|W |. The complexity is therefore linear in the number|D| of documents, the number |C | of top categories(usually very small), the mean number lC of meanterms per category (having |W | as an upper bound)and the number nI of iterations in the final phase,which in our experiments is always less than 20. Thiscomplexity is comparable to the other methods whichare considered in the upcoming experiments section.

4 EXPERIMENTS

To assess the performances of the method describedabove, we performed some experiments on sets ofdocuments already used as a test bed for other cross-domain text classification methods, to be able to com-pare our results with them.

The method has been implemented in a softwareframework written in Java. To fit logistic regressionmodels, we relied upon the Weka machine learningsoftware (Hall et al., 2009).

4.1 Benchmark Datasets

For our experiments, we considered three text collec-tions commonly used in cross-domain classificationdue to their classes taxonomy, exhibiting a shallowhierarchical structure. This allows to isolate a small

https://www.researchgate.net/publication/215990408_The_weka_data_mining_software_an_update_SIGKDD_Explor_Newsl?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

https://www.researchgate.net/publication/237254359_A_comparison_of_numerical_optimizers_for_logistic_regression?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==


https://www.researchgate.net/publication/221900777_The_WEKA_data_mining_software_An_update?el=1_x_8&enrichId=rgreq-d7a7a34e-b873-41f1-942d-6536231a7b39&enrichSource=Y292ZXJQYWdlOzI2OTcwMzkyMjtBUzoyMTA4MzI0NzU1OTQ3NTNAMTQyNzI3Nzc4NzE2NQ==

set of top categories, each including a number of sub-categories in which documents are organized.

Each possible input dataset is set up by choosinga small set of top categories of a collection constitut-ing the set C and splitting documents of these cate-gories into two groups: one contains documents ofsome branches of the top categories and is used as thesource domain, the other one containing documents ofdifferent sub-categories is used as the target domain.By labeling each document in the two domains withits top-category, we obtain suitable datasets.

The 20 Newsgroups collection1 (or 20NG) is a setof posts from 20 different Usenet discussion groups,which are arranged in a hierarchy, each represented byalmost 1,000 posts. We consider the 4 most frequenttop categories comp, rec, sci and talk, each repre-sented by 4 sub-categories (5 for comp). Each test in-volves two top categories: the source domain is com-posed by documents of 2 or 3 sub-categories for eachof them, the target domain is composed by the remain-ing 2 or 3 sub-categories each. We considered 6 dif-ferent problems with different pairs of top-categories,following the same sub-categories split used in otherworks (see e.g. (Dai et al., 2007a) for a table). Wealso considered four problems with documents drawnfrom three top categories, which are less commonlytested among other works.

The SRAA text collection2 is also drawn fromUsenet: it consists of 73,218 posts from discussiongroups about simulated autos, simulated aviation, realautos and real aviation. With this setting, we can per-form tests using two different sets of top categories:{real, simulated} and {auto, aviation}. In the firstcase, we used documents about aviation of both typesfor the source domain and about autos of both typesfor the target domain; the second case is similar, withsimulated vehicles as the source domain and real ve-hicles as the target one. As the four groups are highlyunbalanced in the collection as is, tests are performedon a selection of 16,000 documents, 4,000 for eachgroup, likely to other works.

The Reuters-21578 collection3 contains 21,578newswire stories about economy and finance col-lected from Reuters in 1992. In this collection, docu-ments are labeled with 5 types of labels, among whichorgs, people and places are commonly used as topcategories: we considered the three possible pairs ofthem, using the same split between source and targetemployed by other works where sub-categories are

1http://qwone.com/˜jason/20Newsgroups/2http://people.cs.umass.edu/˜mccallum/data/

sraa.tar.gz3http://www.cse.ust.hk/TL/dataset/Reuters.

zip

evenly divided.

4.2 Setup and Evaluation

We performed tests on the datasets described above.The only parameters to be configured are the max-imum number NI of iterations, which we fixed at 20and was rarely reached in our runs, and the confidencethreshold ρ, for which we tested multiple values.

In each test run, to evaluate the goodness of thepredicted labeling CT with respect to the correct oneCT , likely to other works, we measure the accuracy asthe ratio of documents in the target domain for whichthe correct label was predicted: as almost all tar-get domains have evenly distributed documents acrosscategories, this is a fairly valid measure.

Acc(CT ,CT ) =|{d ∈DT : CT (d) =CT (d)}|

|DT |(10)

For each test, we also report two baseline results:the minimal accuracy obtained by simply classifyingout-of-domain documents using categories represen-tations extracted from the source domain (we wouldobtain this by setting NI = 0, i.e. suppressing the iter-ative phase) and the maximal accuracy which wouldbe reached by classifying the same documents us-ing both the regression model and the categories rep-resentations extracted from the target domain itself,assuming prior knowledge of its labeling (in otherwords, we set DS = DT and NI = 0). We considerthese baseline results as lower and upper bounds forthe real accuracy.

4.3 Results

Table 1 summarizes some relevant results for eachconsidered dataset: the accuracy baselines, the re-sults reported in other works and our results with thethreshold ρ set to 0.54, including the number of iter-ations needed to terminate the refining phase. Specif-ically, we reported the available results from the fol-lowing works, also cited in Section 2:

CoCC co-clustering (Dai et al., 2007a),

TPLSA topic-bridged PLSA (Xue et al., 2008),

CDSC spectral classification (Ling et al., 2008a),

MTrick matrix trifactorization (Zhuang et al., 2011),

TCA topic correlation analysis (Li et al., 2012).

We can see from the table that our approach per-forms better than reported methods in most cases.

In the table, we picked ρ = 0.54 as we determinedempirically by our experiments that it generally yields







Baselines Other methods ρ = 0.54Dataset min max CoCC TPLSA CDSC MTricka TCA Acc. Iters.

20 Newsgroupscomp vs sci 0.760 0.989 0.870 0.989 0.902 - 0.891 0.976 16rec vs talk 0.641 0.998 0.965 0.977 0.908 0.950 0.962 0.992 9rec vs sci 0.824 0.991 0.945 0.951 0.876 0.955 0.879 0.984 11sci vs talk 0.796 0.990 0.946 0.962 0.956 0.937 0.940 0.974 11

comp vs rec 0.903 0.992 0.958 0.951 0.958 - 0.940 0.980 10comp vs talk 0.966 0.995 0.980 0.977 0.976 - 0.967 0.990 8

comp vs rec vs sci 0.682 0.975 - - - 0.932 - 0.940 16rec vs sci vs talk 0.486 0.991 - - - 0.936 - 0.977 15

comp vs sci vs talk 0.722 0.986 - - - 0.921 - 0.971 14comp vs rec vs talk 0.917 0.991 - - - 0.955 - 0.980 9

SRAAreal vs simulated 0.570 0.976 0.880 0.889 0.812 - - 0.936 13auto vs aviation 0.809 0.983 0.932 0.947 0.880 - - 0.962 18

Reuters-21578orgs vs places 0.736 0.909 0.680 0.653 0.682 0.768 0.730 0.724 16orgs vs people 0.779 0.918 0.764 0.763 0.768 0.808 0.792 0.820 13

people vs places 0.612 0.926 0.826 0.805 0.798 0.690 0.626 0.693 13a Values for 20 Newsgroups collection reported by “MTrick” (in italic) actually are not computed on single runs, but are averages of multiple runs, each

with an equal set of top categories, where a baseline document classifier trained on source domain and tested on target got an accuracy higher than 65%

Table 1: Results of our method (on rightmost columns) on selected test datasets, compared with those reported by other works:the results in bold are the best for each dataset (excluding baselines).

0.5 0.6 0.7 0.8 0.9

0.70

0.80

0.90

1.00

Confidence threshold

Acc

urac

y

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

● ●● ● ● ●

● ●

Figure 2: Accuracy on the comp vs sci dataset for differentvalues of the ρ threshold

0.5 0.6 0.7 0.8 0.9

0.4

0.6

0.8

1.0


Acc

urac

y

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

●

●● ● ● ● ●

● ● ● ● ● ●

●

auto vs avireal vs sim

Figure 3: Accuracy on the two SRAA datasets for differentvalues of the ρ threshold

optimal results. Being close to 0.5, in the commoncase with two top categories, few documents are gen-erally ignored in the iterative refining phase.

0.4 0.5 0.6 0.7 0.8

0.5

0.7

0.9


Acc

urac

y

Figure 4: Accuracy on the comp vs rec vs sci dataset fordifferent values of the ρ threshold

However, we observed that in many cases thethreshold parameter ρ has little influence on the finalaccuracy, as long as it stays within a reasonable rangeof values: we show some examples. Figure 2 reportsthe accuracy on the comp vs sci dataset with differ-ent threshold values: it can be noted that the resultscarcely varies for threshold values between 0.5 and0.8; the same trend holds even for the other datasetswith two top categories of 20 Newsgroups. Figure3 shows the same plot for the two SRAA problems,showing just a different range for real vs sim. Instead,tests on 20NG with three top categories, where theminimum value for ρ is 1/3, generally yield high accu-racies for thresholds between 0.45 and 0.6, as shownfor example in Figure 4 for comp vs rec vs sci. On theReuters collection, accuracy has a more unpredictable

0 5 10 15 20

0.6

0.7

0.8

0.9

1.0

Iteration

Acc

urac

y

●

●

●

●●

●● ● ● ● ● ● ● ● ● ● ● ●

●

comp vs scicomp vs rec vs sciauto vs aviorgs vs people

Figure 5: Intermediate accuracy at each iteration on differ-ent datasets

behavior as the threshold varies: this is probably dueto the higher difficulty of distinguishing its top cate-gories, as also appears from results of other works.

In the results, we reported the number of iter-ations needed for the algorithm to reach the con-vergence condition, where categories representationsstop changing between successive iterations. It is in-teresting to check what accuracy would be obtainedwith an anticipated termination of the algorithm, ob-tained by setting a lower value for the maximum num-ber NI of iterations. We report in Figure 5 the in-termediate accuracy obtained on various datasets bylimiting the NI parameter. As stated above, for NI = 0the minimal accuracy is obtained. Accuracy generallygrows faster in the first iterations and only has minorfluctuations in the successive iterations: generally, theresult is that 5 iterations grant an accuracy at most 3%below the convergence value, while with 10 iterationsthe optimal result is within 1%. The parameter canthen set as a tradeoff between accuracy and runningtime, while setting an high value (empirically, 20 ormore) would still yield optimal accuracy with a rea-sonable running time.

Specifically, letting the algorithm reach the con-vergence condition, our running times for single tests,each run on two processing cores on virtualized hard-ware, have been within 2 minutes for the smallerReuters-based datasets, between 5 and 7 minutes forproblems with two top categories of 20NG and be-tween 15 and 20 minutes for the remaining datasetswith more documents. These times are roughly pro-portional to the number of iterations: the initial train-ing phase on the source domain takes about the timeof one iteration to compute the needed similarity val-ues and few seconds to fit the regression model.

As said above, an alternative termination condi-tion to reduce the number of iterations and conse-quently the running time, without compromising theaccuracy, is to stop them when the cosine similaritiesof all categories between their own current and previ-

β→ 1 (def.) 0.9999 0.999Dataset A I A I A I

20 Newsgroupscomp vs sci 976 16 974 9 973 7rec vs talk 992 9 992 8 990 4rec vs sci 984 11 984 6 984 4sci vs talk 974 11 974 6 970 4

comp vs rec 980 10 980 6 979 3comp vs talk 990 8 990 4 990 2comp rec sci 940 16 940 12 938 8rec sci talk 977 15 976 10 976 9

comp sci talk 971 14 971 11 967 6comp rec talk 980 9 979 4 979 3

SRAAreal vs sim 936 13 939 9 936 4auto vs avi 962 18 961 8 957 3

Reuters-21578orgs places 724 16 727 10 731 6orgs people 820 13 820 12 812 7

people places 693 13 693 12 666 5Table 2: Accuracy (A, in thousandths) and number of it-erations (I) for all datasets with different settings for the β

similarity threshold for termination (β = 1 corresponds tothe default termination condition).

ous representations reach a given threshold β. Resultsof this variant with two different values of the thresh-old, compared to default results with β = 1, are givenin Table 2. With the two picked values, we generallyhave a strong reduction of the number of iterationswhile maintaining the accuracy very close to the con-vergence value: in the tests with β = 0.999, the num-ber of iterations is at least halved down in most casesand always drops below 10 with an accuracy which,excluding tests on Reuters, is at most 0.5% lower thanthe one obtained normally. A lower number of itera-tions directly impacts on the running time, which withβ = 0.999 stayed within about 10 minutes even forlarger datasets. The downside of this variant is the in-troduction of a new parameter, altough is shown thatthe two values given in the table generally work fineon all tested datasets.

Summing up, the fixed values for the two settableparameters ρ = 0.54 and NI = 10 seem to yield goodresults in almost any dataset, with the possibility ofreducing NI to trade off some accuracy for a lowerrunning time. The alternative termination condition isa further possibility to limit the number of iterationswith a parameter for which, likely to the other ones,globally valid values seem to exist.

While up to now we assumed to know in advanceall documents of the target domain, in many real casesthere is the need to classify documents which are notknown while training the knowledge model. As stated

0.2 0.4 0.6 0.8

0.90

0.94

0.98

Ratio of known target domain

Acc

urac

y

●● ● ● ● ● ● ● ●

●

comp vs scicomp vs talk

Figure 6: Average accuracy on comp vs sci and comp vs talkdatasets when only a given ratio of the target documents isknown during the iterative phase; each point is an averageon 5 tests, with error bars indicating standard deviation

before, in this case, we can simply compare new doc-uments with the final categories representations andcheck which of them is the most similar for each doc-ument. To verify this, we performed additional testson 20NG where in the iterative phase only a subset ofdocuments in the target domain is known, while thefinal accuracy is evaluated as before on all of them.

Figure 6 reports the results for tests performed on20 Newsgroups with ρ= 0.54 and NI = 20 where onlya fixed ratio of out-of-domain documents is consid-ered to be known in the iterative phase: each pointgives the average accuracy on five runs with differ-ent subsets randomly drawn from the target domain.For ease of readability, we just reported results for thetwo datasets with respectively the lowest and high-est overall average accuracy: curves for other two-categories problems on 20NG follow the same trendand mostly lie between the two. Average accuracy isabove 90% even when just the 10% of the target do-main (less than 500 documents) is known, while us-ing the 30% or more of the out-of-domain documentsgenerally guarantees an accuracy of 96% at least.

5 CONCLUSION AND FUTUREWORK

We presented a conceptually simple and fairly ef-ficient method for cross-domain text classificationbased on explicitly representing categories and com-puting their similarity to documents. The methodworks by initially creating representations from thesource domain to start collecting sure classificationsin the target domain, which are then used to progres-sively build better representations for it.

We tested the method on text collections com-monly used as benchmarks for transfer learning tasks,obtaining fairly good performances with respect toother approaches. While the algorithm has few pa-

rameters to be set, they are shown to often have littleinfluence on the result and that some fixed empiricalvalues often yield (nearly) optimal accuracy.

The method can be extended to consider seman-tics of terms, either leveraging external knowledge (asin (Wang et al., 2008)) or statistical techniques like(P)LSA. Regarding applications, with suitable adap-tations, we are considering testing it on related prob-lems such as cross-language classification and senti-ment analysis.

REFERENCES

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). LatentDirichlet allocation. The Journal of Machine Learningresearch, 3:993–1022.

Blitzer, J., McDonald, R., and Pereira, F. (2006). Do-main adaptation with structural correspondence learn-ing. In Proceedings of the 2006 conference on empir-ical methods in natural language processing, pages120–128. Association for Computational Linguistics.

Bollegala, D., Weir, D., and Carroll, J. (2013). Cross-domain sentiment classification using a sentiment sen-sitive thesaurus. IEEE Transactions on Knowledgeand Data Engineering, 25(8):1719–1731.

Cheeti, S., Stanescu, A., and Caragea, D. (2013). Cross-domain sentiment classification using an adapted nivebayes approach and features derived from syntaxtrees. In Proceedings of KDIR 2013, 5th InternationalConference on Knowledge Discovery and InformationRetrieval, pages 169–176.

Dai, W., Xue, G.-R., Yang, Q., and Yu, Y. (2007a). Co-clustering based classification for out-of-domain doc-uments. In Proceedings of the 13th ACM SIGKDDinternational conference on Knowledge discovery anddata mining, pages 210–219. ACM.

Dai, W., Xue, G.-R., Yang, Q., and Yu, Y. (2007b). Trans-ferring naive bayes classifiers for text classification.In Proceedings of the AAAI ’07, 22nd national con-ference on Artificial intelligence, pages 540–545.

Daume III, H. (2007). Frustratingly easy domain adapta-tion. In Proceedings of the 45th Annual Meeting ofthe Association of Computational Linguistics, pages256–263.

Dumais, S., Platt, J., Heckerman, D., and Sahami, M.(1998). Inductive learning algorithms and representa-tions for text categorization. In Proceedings of CIKM’98, 7th International Conference on Information andKnowledge Management, pages 148–155. ACM.

Gabrilovich, E. and Markovitch, S. (2007). Computing se-mantic relatedness using Wikipedia-based explicit se-mantic analysis. In Proceedings of the 20th Interna-tional Joint Conference on Artificial Intelligence, vol-ume 7, pages 1606–1611.

Gao, J., Fan, W., Jiang, J., and Han, J. (2008). Knowledgetransfer via multiple model local structure mapping.













































In Proceedings of the 14th ACM SIGKDD interna-tional conference on Knowledge discovery and datamining, pages 283–291. ACM.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,P., and Witten, I. H. (2009). The WEKA data min-ing software: an update. ACM SIGKDD explorationsnewsletter, 11(1):10–18.

Hosmer Jr, D. W. and Lemeshow, S. (2004). Applied logisticregression. John Wiley & Sons.

Huang, J., Smola, A. J., Gretton, A., Borgwardt, K. M.,and Scholkopf, B. (2007). Correcting sample selec-tion bias by unlabeled data. Advances in neural infor-mation processing systems, 19:601–608.

Joachims, T. (1997). A probabilistic analysis of the Rocchioalgorithm with TFIDF for text categorization. In Pro-ceedings of ICML ’97, 14th International Conferenceon Machine Learning, pages 143–151.

Joachims, T. (1998). Text categorization with support vec-tor machines: Learning with many relevant features.Proceedings of ECML-98, 10th European Conferenceon Machine Learning, 1398:137–142.

Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J.,Paatero, V., and Saarela, A. (2000). Self organizationof a massive document collection. IEEE Transactionson Neural Networks, 11(3):574–585.

Li, L., Jin, X., and Long, M. (2012). Topic correlation anal-ysis for cross-domain text classification. In Proceed-ings of the Twenty-Sixth AAAI Conference on ArtificialIntelligence.

Ling, X., Dai, W., Xue, G.-R., Yang, Q., and Yu, Y. (2008a).Spectral domain-transfer learning. In Proceedings ofthe 14th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 488–496. ACM.

Ling, X., Xue, G.-R., Dai, W., Jiang, Y., Yang, Q., and Yu,Y. (2008b). Can chinese web pages be classified withenglish data source? In Proceedings of the 17th inter-national conference on World Wide Web, pages 969–978. ACM.

Merkl, D. (1998). Text classification with self-organizingmaps: Some lessons learned. Neurocomputing,21(1):61–77.

Minka, T. P. (2003). A comparison of nu-merical optimizers for logistic regression.http://research.microsoft.com/en-us/um/people/minka/papers/logreg/.

Pan, S. J., Kwok, J. T., and Yang, Q. (2008). Transfer learn-ing via dimensionality reduction. In Proceedings ofthe AAAI ’08, 23rd national conference on Artificialintelligence, pages 677–682.

Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. (2011).Domain adaptation via transfer component analysis.IEEE Transactions on Neural Networks, 22(2):199–210.

Pan, S. J. and Yang, Q. (2010). A survey on transfer learn-ing. IEEE Transactions on Knowledge and Data En-gineering, 22(10):1345–1359.

Porter, M. F. (1980). An algorithm for suffix stripping.Program: electronic library and information systems,14(3):130–137.

Prettenhofer, P. and Stein, B. (2010). Cross-language textclassification using structural correspondence learn-ing. In Proceedings of the 48th Annual Meeting ofthe Association for Computational Linguistics, pages1118–1127.

Salton, G. and Buckley, C. (1988). Term-weighting ap-proaches in automatic text retrieval. Information pro-cessing & management, 24(5):513–523.

Scott, S. and Matwin, S. (1998). Text classification usingwordnet hypernyms. In Use of WordNet in naturallanguage processing systems: Proceedings of the con-ference, pages 38–44.

Sebastiani, F. (2002). Machine learning in automatedtext categorization. ACM computing surveys (CSUR),34(1):1–47.

Shimodaira, H. (2000). Improving predictive inferenceunder covariate shift by weighting the log-likelihoodfunction. Journal of statistical planning and infer-ence, 90(2):227–244.

Sugiyama, M., Nakajima, S., Kashima, H., Von Buenau, P.,and Kawanabe, M. (2007). Direct importance estima-tion with model selection and its application to covari-ate shift adaptation. In Advances in Neural Informa-tion Processing Systems 20, volume 7, pages 1433–1440.

Wang, P., Domeniconi, C., and Hu, J. (2008). UsingWikipedia for co-clustering based cross-domain textclassification. In ICDM ’08, 8th IEEE InternationalConference on Data Mining, pages 1085–1090. IEEE.

Weigend, A. S., Wiener, E. D., and Pedersen, J. O. (1999).Exploiting hierarchy in text categorization. Informa-tion Retrieval, 1(3):193–216.

Xiang, E. W., Cao, B., Hu, D. H., and Yang, Q. (2010).Bridging domains using world wide knowledge fortransfer learning. IEEE Transactions on Knowledgeand Data Engineering, 22(6):770–783.

Xue, G.-R., Dai, W., Yang, Q., and Yu, Y. (2008). Topic-bridged plsa for cross-domain text classification. InProceedings of the 31st annual international ACM SI-GIR conference on Research and development in in-formation retrieval, pages 627–634. ACM.

Yang, Y. and Liu, X. (1999). A re-examination oftext categorization methods. In Proceedings of the22nd annual international ACM SIGIR conference onResearch and development in information retrieval,pages 42–49. ACM.

Zadrozny, B. (2004). Learning and evaluating classifiersunder sample selection bias. In Proceedings of the21st International Conference on Machine Learning,page 114. ACM.

Zhuang, F., Luo, P., Xiong, H., He, Q., Xiong, Y., andShi, Z. (2011). Exploiting associations between wordclusters and document classes for cross-domain textcategorization. Statistical Analysis and Data Mining,4(1):100–114.