ACL Member Portal | The Association for Computational … · Petr Knoth, Lukas Zilka and Zdenek Zdrahal 11:10–11:50 Cross Language POS Taggers (and other Tools) for Indian Languages:

IJCNLP 2011Proceedings of

the 5th Workshop onCross Lingual Information Access

November 13, 2011Shangri-La Hotel

Chiang Mai, Thailand

IJCNLP 2011

the 5th International Joint Conference on Natural LanguageProcessing

Proceedings of the5th Workshop on Cross Lingual Information Access

November 13, 2011Chiang Mai, Thailand

We wish to thank our sponsors

Gold Sponsors

www.google.com www.baidu.com The Office of Naval Research (ONR)

The Asian Office of Aerospace Research and Devel-opment (AOARD)

Department of Systems Engineering and Engineering Managment, The Chinese Uni-versity of Hong Kong

Silver Sponsors

Microsoft Corporation

Bronze Sponsors

Chinese and Oriental Languages Information Processing Society (COLIPS)

Supporter

Thailand Convention and Exhibition Bureau (TCEB)

We wish to thank our sponsors

Organizers

Asian Federation of Natural Language Processing (AFNLP)

National Electronics and Computer Technolo-gy Center (NECTEC), Thailand

Sirindhorn International Institute of Technology (SIIT), Thailand

Rajamangala University of Technology Lanna (RMUTL), Thailand

Maejo University, Thailand

Chiang Mai University (CMU), Thailand

c©2011 Asian Federation of Natural Language Proceesing

vii

Introduction

Welcome to the IJCNLP-2011 Workshop on Cross Lingual Information Access:ComputationalLinguistics and the Information Need of Multilingual Societies.

The development of digital and online information repositories is creating many opportunities andalso new challenges in information retrieval. The availability of online documents in many differentlanguages makes it possible for users around the world to directly access previously unimagined sourcesof information. However in conventional information retrieval systems the user must enter a search queryin the language of the documents in order to retrieve it. This requires that users can express their queriesin those languages in which the information is available and can understand the documents returned bythe retrieval process. This restriction clearly limits the amount and type of information that an individualuser really has access to.

Cross lingual information access (CLIA) is concerned with any technologies and applications that enablepeople to freely access information that is expressed in any languages. With the rapid development ofglobalization and digital online information in Internet, huge demand for cross lingual information accesshas emerged from ordinary netizens (polyglots or monoglots) who are surfing the Internet for specialinformation (e.g. travelling, product description), and communicating in soaring social networks (e.g.Facebook, Youtube, Twitter, Myspace), to global companies which provide multilingual services to theirmultinational customers, and governments who aim to lower the barriers to international commerceand collaboration, and homeland security. This huge demand has triggered vigorous research anddevelopment in CLIA.

In recent times, research in Cross Lingual Information Access has been vigorously pursued throughseveral international fora, such as, the Cross-Language Evaluation Forum (CLEF), NTCIR AsianLanguage Retrieval, Question-answering Workshop, cross language information retrieval in Indianlanguages (FIRE) and such other fora. In addition to CLIR, significant results have been obtained inmultilingual summarization workshops and cross-language named entity extraction challenges by theACL (Association for Computational Linguistics) and the Geographic Information retrieval (GeoCLEF)track of CLEF.

This workshop is a continuous effort to address the need of cross-lingual information access on top ofits previous four issues which were held during IJCAI 2007 in Hyderabad, IJCNLP 2008 in Hyderabad,NAACL 2009 in Colorado, and COLING 2010 in Beijing. It aims to bring together researchers from avariety of fields such as information retrieval, computational linguistics, machine translation, and digitallibrary, and practitioners from government and industry to address the issues of information need ofmultilingual society.

This fifth international workshop on Cross Lingual Information Access aims to bring together varioustrends in multi-source, cross and multilingual information retrieval and access, and provide a venue forresearchers and practitioners from academia, government, and industry to interact and share a broadspectrum of ideas, views and applications. This workshop also aims to highlight and emphasize thecontributions of Natural Language Processing (NLP) and Computational Linguistics to CLIA. Thepresent workshop includes an invited keynote talk followed by presentations of technical papers selectedafter peer review.

The workshop starts with an invited keynote talkWeb-based Machine Translation given by Haifeng Wang.

The technical paper presentations will start from the second session of the workshop. The paper byKnoth et al addresses the issue of explicit semantic analysis for cross-lingual link discovery. Thispaper explores how to automatically generate cross-language links between resources in large document

viii

collections.The paper presents new methods that are applicable to any multilingual document collection.They reported a comparative study on the Wikipedia corpus and provide new insights into the evaluationof link discovery systems. In the work of Siva Reddy and Serge Sharoff, they propose cross languagePoS taggers for Indian Languages. They show how to build a cross-language PoS tagger for Kannadaexploiting the resources of Telugu. In addition they also build large corpora and a morphological analyserfor Kannada. They showed that a cross-language taggers are as efficient as mono-lingual taggers. Thework by Duo Ding introduces an ongoing work of leveraging a cross-lingual topic model (CLTM) tointegrate the multilingual search results. The CLTM detects the underlying topics of different languageresults and uses the topic distribution of each result to cluster them into topic-based classes. In CLTM,they unify distributions in topic level by direct translation, thus distinguishing from other multi-lingualtopic models, which mainly concern the parallelism at document or sentence level. They suggested thatCLTM clustering method is effective and outperforms few other existing document clustering techniques.Manaal et al propose a soundex-based translation correction in Urdu-English cross-language informationretrieval. They discuss the challenges associated with the resource-poor language like Urdu and showthe effectiveness of the proposed approach on the benchmark dataset. Li et al adopted the contextualizedhidden Markov model (CHMM) framework for unsupervised Russian PoS tagging. They propose abackoff smoothing method that incorporates left, right, and unambiguous context into the transitionprobability estimation during the expectation-maximization process. They show that the resulting modelachieves overall and disambiguation accuracies comparable to a CHMM using the classic backoffsmoothing method for HMM-based PoS tagging. Johannes Knopp addresses extending a multilinguallexical resource by bootstrapping named entity classification using Wikipedia category system. Theirapproach is able to classify more than two million named entities and improves the quality of an existingNER resource.

With these diverse of topics, we look forward to a lively exchange of ideas in the workshop.

We thank Haifeng Wang for the invited keynote talk, all the members of the Program Committee fortheir excellent and insightful reviews, the authors who submitted contributions for the workshop and theparticipants for making the workshop a success.

Organizing CommitteeThe 5th International Workshop on Cross Lingual Information AccessIJCNLP 2011November 13, 2011.

ix

Organizers:

Asif Ekbal, IIT Patna, India (Co-chair)Deyi Xiong, Institute for InfoComm Research, Singapore (Co-chair)Prasenjit Majumder, DAIICT, IndiaMitesh Khapra, IIT Bombay

Program Committee:

Eneko Agirre, University of the Basque CountryRafael Banchs, Institute for Infocomm ResearchSivaji Bandyopadhyay, Jadavpur UniversityPushpak Bhattacharya, IIT BombayNicola Cancedda, Xerox Research CenterSomnath Chandra, MIT, Govt. of IndiaWenliang Chen, Institute for Infocomm ResearchPatrick Saint Dizier, IRIT, Universite Paul SabatierXiangyu Duan, Institute for Infocomm ResearchNicola Ferro, University of PaduaCyril Goutte, National Research Council of CanadaGareth Jones, Dublin City UniversityJoemon Jose, University of GlasgowA Kumaran, Microsoft Research of IndiaJun Lang, Institute for Infocomm ResearchSwaran Lata, MIT, Govt. of IndiaGina-Anne Levow, National Centre for Text Mining (UK)Qun Liu, Institute of Computing Technology, CASYang Liu, Institute of Computing Technology, CASMandar Mitra, ISI KolkataDoug Ouard, University of Maryland, College ParkCarol Peters, Istituto di Scienza e Tecnologie dell’Informazione and CLEF campaignPaolo Rosso, Technical University of ValenciaSudeshna Sarkar, IIT KharagpurHendra Setiawan, University of MarylandL Sobha, AU-KBC, ChennaiRohini Srihari, University at Buffalo, SUNYRalf Steinberger, European Commission - Joint Research Centre, ItalyLe Sun, Institute of Software, CASVasudeva Varma, IIIT HyderabadThuy Vu, Institute for Infocomm ResearchHaifeng Wang, BaiduYunqing Xia, Tsinghua University, ChinaMin Zhang, Institute for Infocomm ResearchGuodong Zhou, SooChow UniversitChengqing Zong, Institute of Automation, CASRaghavendra Udupa, Microsoft Research

xi

Invited Speaker:

Haifeng Wang, Baidu

xii

Table of Contents

Web-based Machine TranslationHaifeng Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Using Explicit Semantic Analysis for Cross-Lingual Link DiscoveryPetr Knoth, Lukas Zilka and Zdenek Zdrahal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannadausing Telugu Resources

Siva Reddy and Serge Sharoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Integrate Multilingual Web Search Results using Cross-Lingual Topic ModelsDuo Ding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Soundex-based Translation Correction in Urdu–English Cross-Language Information RetrievalManaal Faruqui, Prasenjit Majumder and Sebastian Pado . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Unsupervised Russian POS Tagging with Appropriate ContextLi Yang, Erik Peterson, John Chen, Yana Petrova and Rohini Srihari . . . . . . . . . . . . . . . . . . . . . . . . . 30

Extending a multilingual Lexical Resource by bootstrapping Named Entity Classification using Wikipedia’sCategory System

Johannes Knopp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

xiii

Conference Program

Saturday, November 13, 2011

8:35–8:45 Opening Remarks

8:45–10:00 Keynote Speech

Web-based Machine TranslationHaifeng Wang

10:00–10:30 Break

10:30–11:10 Using Explicit Semantic Analysis for Cross-Lingual Link DiscoveryPetr Knoth, Lukas Zilka and Zdenek Zdrahal

11:10–11:50 Cross Language POS Taggers (and other Tools) for Indian Languages: An Experi-ment with Kannada using Telugu ResourcesSiva Reddy and Serge Sharoff

11:50–14:00 Lunch

14:00–14:30 Integrate Multilingual Web Search Results using Cross-Lingual Topic ModelsDuo Ding

14:30–15:00 Soundex-based Translation Correction in Urdu–English Cross-Language Informa-tion RetrievalManaal Faruqui, Prasenjit Majumder and Sebastian Pado

15:00–15:30 Unsupervised Russian POS Tagging with Appropriate ContextLi Yang, Erik Peterson, John Chen, Yana Petrova and Rohini Srihari

15:30–16:00 Break

16:00–16:40 Extending a multilingual Lexical Resource by bootstrapping Named Entity Classifi-cation using Wikipedia’s Category SystemJohannes Knopp

16:40–17:00 Closing

xv

Proceedings of the 5th International Joint Conference on Natural Language Processing, page 1,Chiang Mai, Thailand, November 8-12, 2011.

Web-based Machine Translation

Haifeng WangBaidu

Beijing, 100085, [email protected]

1 Abstract

Machine translation (MT) has been studied formore than 60 years. World-Wide-Web offers moreopportunities to MT. We could try to crawl moreweb data to train the MT system. But we haveto filter the very noisy web data. There aremany potential web-based applications for MT,such as translation of web-page, translation of in-stant message, translation of SNS, translation of e-commerce, mobile translation, etc. To make betteruse of the web data, and to produce better web-based MT applications, we should also adapt theMT methods to the web scenario. In this talk,I will introduce our work on web-based machinetranslation.

2 Biography

Dr. WANG Haifeng a senior scientist at Baidu,and a visiting professor at Harbin Institute of Tech-nology. At Baidu, he is the head of Baidu’s NLPdepartment, and the advisor of its speech team,the technical leader of its recommendation & per-sonalization team, and one of the core membersof Baidu’s technology committee. He receivedhis PhD in computer science from Harbin Insti-tute of Technology in 1999. He worked as anassociate researcher at Microsoft Research China1999 2000, a research scientist at iSilk.com (HongKong) 20002002, and chief research scientist anddeputy director at Toshiba (China) R&D Centertill Jan. 2010. He has authored more than 70 NLPpapers, including 13 full papers in ACL main con-ferences. His research interests span a wide rangeof topics including: MT (SMT, RBMT, EBMT,TM and hybrid methods), parsing, generation,grammar induction, paraphrase, collocation ex-traction, SRL, WSD, LM, recommendation, per-sonalization, speech and search. He has servedas program chair, area chair, tutorial chair, work-shop chair, industry track chair and PC members

for numerous NLP conferences including ACL,SIGIR, NAACL, EMNLP, COLING and IJCNLP,etc. He also serves as associate editor of ACMTALIP, guest editor of ACM TIST. He is the Vice-President-Elect of the ACL.

1

Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 2–10,Chiang Mai, Thailand, November 8-12, 2011.

Using Explicit Semantic Analysis for Cross-Lingual Link Discovery

Petr KnothKMi, The Open [email protected]

Lukas ZilkaKMi, The Open [email protected]

Zdenek ZdrahalKMi, The Open University

[email protected]

Abstract

This paper explores how to automati-cally generate cross-language links be-tween resources in large document col-lections. The paper presents new meth-ods for Cross-Lingual Link Discovery(CLLD) based on Explicit Semantic Anal-ysis (ESA). The methods are applicable toany multilingual document collection. Inthis report, we present their comparativestudy on the Wikipedia corpus and providenew insights into the evaluation of link dis-covery systems. In particular, we mea-sure the agreement of human annotators inlinking articles in different language ver-sions of Wikipedia, and compare it to theresults achieved by the presented methods.

1 Introduction

Cross-referencing documents is an essential partof organising textual information. However, keep-ing links in large, quickly growing, document col-lections up-to-date, is problematic due to the num-ber of possible connections. In multilingual doc-ument collections, interlinking semantically re-lated information in a timely manner becomeseven more challenging. Suitable software toolsthat could facilitate the link discovery process byautomatically analysing the multilingual contentare currently lacking. In this paper, we presentnew methods for Cross-Lingual Link Discovery(CLLD) applicable across different types of mul-tilingual textual collections.

Our methods are based on Explicit SemanticAnalysis (ESA) introduced by Gabrilovich andMarkovitch (2007). ESA is a method that calcu-lates semantic relatedness of two texts by map-ping their term vectors to a high dimensionalspace (typically, but not necessarily, the space ofWikipedia concepts) and by calculating the sim-

ilarity between these vectors (instead of compar-ing them directly). The method has receivedmuch attention in the recent years and it has alsobeen extended to a multilingual version calledCross-Lingual Explicit Semantic Analysis (CL-ESA) (Sorg and Cimiano, 2008). To the best of ourknowledge, this method has not yet been appliedin the context of automatic link discovery systems.

Since the CLLD field is relatively young, it isalso important to establish a constructive meansfor evaluating these systems. Our paper pro-vides insight into this problem by investigating theagreement/reliability of man-made links and bypresenting a possible approach for the definitionof ground truth, i.e. gold standard.

The paper brings the following contributions:

(a) It applies Explicit Semantic Analysis to thelink discovery and CLLD tasks.

(b) It provides new insights into the evaluation ofCLLD systems and into the way people linkinformation in different languages, as mea-sured by their agreement.

2 Related Work

CLLD MethodsCurrent approaches to link detection can be di-

vided into three groups:

(1) link-based approaches discover new links byexploiting an existing link graph (Itakura andClarke, 2008; Jenkinson et al., 2008; Lu etal., 2008).

(2) semi-structured approaches try to discovernew links using semi-structured information,such as the anchor texts or document titles(Geva, 2007; Dopichaj et al., 2008; Granitzeret al., 2008; Milne and Witten, 2008; Mihal-cea and Csomai, 2007).

2

(3) purely content-based approaches use as aninput plain text only. They typically dis-cover related resources by calculating se-mantic similarity based on document vectors(Allan, 1997; Green, 1998; Zeng and Blo-niarz, 2004; Zhang and Kamps, 2008; He,2008). Some of the mentioned approaches,such as (Lu et al., 2008), combine multipleapproaches. To the best of our knowledge, noapproach has so far been reported to use Ex-plicit Semantic Analysis to address this task.

The main disadvantage of the link-based andsemi-structured approaches is probably the dif-ficulty associated with porting them across dif-ferent types of document collections. The twowell-known solutions to monolingual link detec-tion, the Geva’s and Itakura’s algorithms (Trotmanet al., 2009), fit in these two categories. Whilethese algorithms have been demonstrated to be ef-fective on a specific Wikipedia set, their perfor-mance has significantly decreased when they wereapplied to a slightly different task of interlink-ing two encyclopedia collections. Purely content-based methods have been mostly found to produceslightly worse results than the two previous classesof methods, however their advantage is that theirperformance should remain stable across differentdocument collections. As a result, they can al-ways be used as part of any link discovery sys-tem and can even be combined with domain spe-cific methods that make use of the link graph orsemi-structured information. In practice, domain-specific link discovery systems can achieve highprecision and recall. For example, Wikify! (Mihal-cea and Csomai, 2007) and the link detector pre-sented by Milne and Witten (2008) can be usedto identify suitable anchors in text and enrich itwith links to Wikipedia by combining multiple ap-proaches with domain knowledge.

In this paper, we present four methods (threepurely content-based and one combining the link-based and content-based approach) for CLLDbased on CL-ESA. Measuring semantic similar-ity using ESA has been previously shown to pro-duce better results than calculating it directly ondocument vectors using cosine and other similar-ity measures and it has also been found to outper-form the results that can be obtained by measuringsimilarity on vectors produced by Latent Seman-tic Analysis (LSA) (Gabrilovich and Markovitch,2007). Therefore, the cross-lingual extension of

ESA seems a plausible choice.Evaluation of link discovery systemsThe evaluation of link discovery systems is cur-

rently problematic as there is no widely acceptedgold standard. Manual development of such astandard would be costly, because: (a) the num-ber of possible links is very high even for smallcollections, (b) the link generation task is subjec-tive (Ellis et al., 1994) and (c) it is not entirelyclear how the link generation task should be de-fined in terms of link granularity (for example,document-to-document links, anchor-to-documentlinks, anchor-to-passage links etc.). Developingsuch a CLLD corpora manually would be evenmore complicated.

As a result, Wikipedia links were extracted andtaken as the gold standard (ground truth) in a com-parative evaluation in (Huang et al., 2008). Theauthors admit that Wikipedia links are not perfect(validity of existing links is sometimes question-able and useful links may be missing) the compar-ative evaluation of methods and systems shouldbe considered informative only. For example, itwould be naıve to expect that measuring preci-sion/recall characteristics would be accurate.

In this paper we discuss the issues in automati-cally defining the ground truth for CLLD systems.We take into account the differences in the waypeople link content in different languages to as-sess the agreement between the different languageversions with the goal to find out how well our sys-tem performs. Our experiments are conducted onthe Wikipedia dataset, however we use the articlesonly as a set of documents abstracting from theWikipedia encyclopedic nature.

3 The CLLD methods

This section describes the methods used in our ex-periments. The whole process of cross-languagelink detection is shown in Figure 1. The methodtakes as an input a new “orphan” document (i.e.a document that is not linked to other documents)written in the source language and automaticallygenerates a ranked list of documents written in thetarget language (the suitable link targets from thesource document). The task involves two steps:the cross-language step and the link generationstep. We have experimented with four differentCLLD methods: CL-ESA2Links, CL-ESADirect,CL-ESA2ESA and CL-ESA2Similar that will bedescribed later on. The names of the methods

3

Figure 1: Cross-language link discovery process

are derived from the approach applied in the firstand the second step. These methods have differ-ent characteristics and would be useful in differentscenarios.

In the first step, an ESA vector is calculatedfor each document in the document collection.This results in obtaining a weighted vector ofWikipedia concepts for each document in the tar-get language. The cardinality of the vector is givenby the number of concepts (pages) in the targetlanguage version of Wikipedia (i.e. it is about 3.8million for English, 764,000 for Spanish, etc.). Asimilar procedure is applied on the orphan doc-ument, however, the source language version ofESA is used. The resulting ESA vector is thencompared to the ESA vectors that represent docu-ments in the target language collection (CL-ESAapproach). A set of candidate vectors representingdocuments in the target language is acquired as anoutput of the cross-language step, see Section 3.1.

In the second step, the candidate vectors aretaken as a seed and are used to discover documentsthat are suitable link targets. The four different

Figure 2: CLLD candidates

approaches used in this step distinguish the above-mentioned methods, see Section 3.2.

3.1 The cross-language step

The main rationale for the cross-language step isto find t suitable candidates in the target languagethat can later be exploited to identify link targets.Semantically similar target language documents tothe source language document are considered byour methods as suitable candidates. To identifysuch documents, the ESA vector of the source doc-ument is compared to the ESA vectors of docu-ments in the target document collection.

Each dimension in an ESA vector expresses thesimilarity of a document to the given language ver-sion of a Wikipedia concept/article. Therefore, thecardinality of the source document vector is differ-ent from the cardinality of the vectors represent-ing the documents in the target language collec-tion (Figure 2). In order to calculate the similarityof two vectors, we map the dimensions that corre-spond to the same Wikipedia concepts in differentlanguage versions. In most cases, if a Wikipediaconcept is mapped to another language version,there is a one-to-one correspondence between thearticles in those two languages. However, thereare cases when one page in the source language ismapped to more than one page in the target lan-guage and vice versa.1 For the purpose of simi-larity calculation, we use 100 dimensions with thehighest weight that are mappable from the sourceto the target language. The number of candidatesto be extracted is controlled by parameter t. Wehave experimentally found that its selection has asignificant impact on the performance of our meth-ods.

1These multiple mappings appear quite rarely, e.g. in5,889 cases out of 550,134 for Spanish to English and for2,528 cases out of 163,715 for Czech to English.

4

Figure 3: Schematic illustration of the four ap-proaches used by the CLLD methods.

3.2 The link generation step

In the link generation step, the candidate docu-ments are taken and used to produce a ranked listof targets for the original source document. Thefollowing approaches, schematically illustrated inFigure 3, are taken by our four methods:

• CL-ESA2Links - This method requires ac-cess to the link structure in the target collec-tion. More precisely, the method takes theoriginal orphan document in the source lan-guage and tries to link it to an already inter-linked target language collection. After ap-plying CL-ESA in the first step, existing linksare extracted from the candidate documents.The link targets are then ranked according totheir similarity to the source document, i.e.documents that are more similar are rankedhigher. This list is then used as a collectionof link targets.

• CL-ESADirect - This method applies CL-ESA on the source document and takes thelist of candidates directly as link targets.

• CL-ESA2ESA - In this method, the applica-tion of CL-ESA is followed by another appli-cation of monolingual ESA, which measuresthe semantic similarity of the candidates with

all documents in the document collection, toidentify link targets.

• CL-ESA2Similar - Instead of generating theranked list of link targets using monolin-gual ESA as in the previous method, whichis computationally expensive, we calculate avector sum from the candidate list of ESAdocument vectors. We then select strongWiki concepts representing these dimensionsas the set of targets. This is equivalent to cal-culating cosine similarity using tfidf vectors.Though much quicker, the main disadvantageis that if we wanted to use this method on an-other set than Wikipedia, ESA would have tobe used with a different background collec-tion.

All of the methods have different properties.CL-ESA2Links requires the knowledge of the linkgraph in the target document collection. CL-ESA2ESA and ESADirect are two methods thatare universal, i.e. can be easily applied in any doc-ument collection. The difference between them isthat the former one requires significantly less doc-ument vector comparisons than the later method.CL-ESA2Similar works almost as fast as CL-ESADirect, but it has the disadvantage that ESAhas to be used with the specific document collec-tion as a background.

4 The underlying data

Wikipedia has been used as a corpus for the meth-ods evaluation. This decision has the followingadvantages that make it possible for us to test andanalyse the methods on a real use case:

• A very large multilingual text collection.

• The articles are well-interlinked and the in-terlinking has been approved by a large com-munity of users.

• A large proportion of articles contain ex-plicit mapping between different languageversions.

In our study, we have experimented with theEnglish, Spanish and Czech language versions ofWikipedia. We consider the cases of linking fromSpanish to English and from Czech to English,i.e. from a less resourced language to the moreresourced one. We believe that this is the more in-teresting direction for CLLD methods as the target

5

language version is more likely to contain relevantinformation not available in the source language.The language selection has been motivated by theaim to test the methods in two very different envi-ronments. The Spanish version is relatively wellresourced containing 764,095 pages (about fourtimes fewer than English), the Czech languageis much less resourced containing 196,494 pages(about four times fewer than Spanish).

5 Evaluation methodology

One of the main obstacles in systematically im-proving link discovery systems is the difficulty toevaluate the results. The issue that makes reliableevaluation problematic is due to both technical andcognitive aspects. The difficulty in obtaining the“ground truth” for a sufficiently large dataset iscaused both by the lack of human resources tomanually annotate a very large number of docu-ment combinations, and the inherent subjectivityof the task. As a result, we find it essential to es-timate the agreement between annotators and seeto what extent the precision and recall character-istics can be measured with respect to interlinkeddocument collections.

We claim that the reasons for linking two piecesof information is made at the level of seman-tics, i.e. the annotator has to understand the con-cepts/ideas described in two papers to decide ifthey should be connected by a link. We claimthat this process should be language independent.Thus, an article about London will be related to anarticle about the United Kingdom regardless of thelanguage the articles are written in.

Therefore, let us define the link generation taskin the following way: Given a document2 in thesource language, find documents in the target lan-guage that are suitable link targets for the sourcedocument, i.e. there is a semantic relationship be-tween the source document and the linked targetdocuments.

Based on the definition, the ground truth fora topic document d is the set of documents thatcan be considered (semantically) suitable link tar-gets. Though this set is typically unknown to us,we can in our experiment approximate it by takingthe existing Wikipedia links as ground truth. Be-cause the Wikipedia link structure has been agreedby a large number of contributing authors, it is

2The term topic is also sometimes used to refer to the doc-ument.

likely to have a relatively consistent link struc-ture in comparison to content that would be linkedjust by a single person. To establish the groundtruth for the original source document, we can ex-tract all links originating in the source documentand pointing to other documents. Since the pro-cess of linking information is performed at the se-mantic level, and is thus language independent, wecan enrich our ground truth with link graphs fromdifferent language versions of Wikipedia. Thiscauses the ground truth to get larger which has twoconsequences: (1) It increases the reliability of theevaluation as many relevant links are often omit-ted (Knoth et al., 2010) (2) It is more difficult toachieve higher recall.

6 Results

6.1 Experimental setupThe experiment was carried out for two languagepairs: Spanish to English and Czech to English.We will denote the source language Lsource andthe target language Ltarget. The input for the dif-ferent CLLD methods are two document sets:

• Let SOURCELsource be the set of topicdocuments selected as pages that contain aWikipedia link between different languageversions. In our case, 100 pages were se-lected.

• Let TARGETLtarget be the collection ofdocuments in the target language from whichthe link targets are selected. In our case,this collection contains all (3.8 million)Wikipedia pages in English.

The output of the method is a set (ranked list)LISTresult = 〈TARGETLtarget , score〉. To es-tablish the ground truth we define:

• Let ρ be the mapping from documents in thesource language to their target language ver-sions ρ : DLsource → DLtarget .

• Let SOURCELtarget be the set of topicdocuments mapped to the target languageSOURCELtarget = ρSOURCELsource .

• Let α, β be the mappings from documentsto the other documents they link to in thesource and target language respectively α :DLsource → DLsource , β : DLtarget →DLtarget .

6

then we define the ground truth (GT) as theunion of ground truths for different language ver-sions, in this experiment we define it as the unionof ground truth for the source and target language.

GT = α(SOURCELsource)∪β(SOURCELtarget)

A given generated item 〈d, score〉 ∈LISTresult is evaluated as a hit if and onlyif d ∈ GT .

6.2 Methods evaluation

To investigate the performance of the first part ofCLLD - the cross-language step carried out byCL-ESA, we have analysed how well the systemfinds for a given topic document in the source lan-guage the duplicate document in the target lan-guage. In this step, the system takes a docu-ment in the source language, and selects from the3.8 million large document set in the target lan-guage the documents with the highest similarity.We then check, if a duplicate document (d =ρdsource) appears among the top k retrieved doc-uments. The experiment is repeated for all exam-ples in SOURCELsource and the results are thenaveraged (Figure 4). The graph suggests that themethod performs well, as the document often ap-pears among the first few results. In about 65%of cases, the document is found among the first 50retrieved items. We believe that if the set of candi-dates (controlled by the t parameter) contains thisdocument, the CLLD method is likely to producebetter results, this is especially true for the CL-ESA2Links method.

The overall results for all the methods are pre-sented in Figure 5. We have experimentally sett = 10 for Spanish to English and t = 3 for Czechto English CLLD. CL-ESA2Links performed inthe experiments the best achieving 0.2 precisionat 0.3 recall. CL-ESA2Similar performed the bestout of the purely content-based methods.

Though the precision/recall might seem quitelow, a number of things should be taken into ac-count:

• A significant number of potentially usefullinks is still missing in our ground truth, be-cause people typically do not intend to linkall relevant information. As a result, manypotentially useful connections are not explic-itly present in Wikipedia (Knoth et al., 2010).

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

50 100 150 200 250 300 350 400 450 500

P

TOP-k

TOP-k probability (es)TOP-k probability (cs)

Figure 4: The probability (y-axis) of finding thetarget language version of a given source languagedocument using CL-ESA in the top k retrieveddocuments (x-axis). Drawn as a cumulative dis-tribution function.

The problem can be partly mitigated by com-bining the ground truth from more languageversions. Another approach is to measure theagreement instead of precision/recall charac-teristics (see Section 6.3).

• A significant number of links in Wikipediaare conceptual links. These links do not ex-press a particularly strong relationship at thearticle level. This makes it very difficult forthe pure-content based methods to find them,which results in low recall. It seems that CL-ESA2Links is the only method that does notsuffer from this issue.

• The experiment settings make it hard for themethods to achieve high precision/recall per-formance. The TARGETLtarget set contains3.8 million articles, out of which, the meth-ods are supposed to identify on average justa small subset of target documents. Moreprecisely, in Spanish to English CLLD, ourground truth contains on average 341 tar-get documents with standard deviation 293,in Czech to English, it contains on average382 target documents with standard deviation292.

6.3 Measuring the agreementTo assess the subjectivity of the link genera-tion task and to investigate the reliability of theacquired ground truth, we have compared thelink structures from different language version of

7

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Pre

cis

ion

Recall

CL-ESA2Links

CL-ESA2Similar

CL-ESA2ESA

CL-ESADirect

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Pre

cis

ion

Recall

CL-ESA2Links

CL-ESA2Similar

CL-ESA2ESA

CL-ESADirect

Figure 5: The precision (y − axis)/recall (x-axis) graphs for Spanish to English (left) and Czech toEnglish (right) CLLD methods.

Spanish vs EnglishYen Nen N/Aen

Yes 5,563 10,201 3,934Nes 15,715 539,299,641 99,191,766N/Aes 5781 321,326,145 0Czech vs English

Yen Nen N/Aen

Ycz 4,308 8,738 2,194Ncz 12,961 392,411,445 7,501,806N/Acz 9,790 356,532,740 0

Table 1: The agreement of Spanish and EnglishWikipedia and Czech and English Wikipedia ontheir link structures calculated and summed for allpages in SOURCEes. Y - indicates yes, N - no,N/A - not available/no decision

Wikipedia. We have iterated over the set of top-ics from SOURCELsource and recorded for eachdocument in TARGETLtarget in each step if it isa valid link target (yes - Y ) or if it is not a validlink target (no - N ) for the given source documentin each language, thus measuring the agreementbetween the link structures in different languages.The results are presented in Table 1.

As demonstrated in Figure 6, a subset ofWikipedia pages cannot be mapped to other lan-guage versions. Either the semantically equiva-lent page does not exist or the cross-language linkis missing. These links were classified as no de-cision/not available (N/A). The mappable docu-ments were classified in a standard way accordingto their appearance in the link graphs of the lan-guage versions. Only these links are taken intoaccount while measuring the agreement.

A common way to assess inter-annotator agree-

Figure 6: Individual cases of agree-ment/disagreement/no decision (not available) fortwo language versions of Wikipedia link graphs.

ment between two raters in Information Retrievalis using the Cohen’s Kappa calculated as:

κ =Pr(a)− Pr(e)

1− Pr(e) ,

where Pr(a) is the relative observed frequency ofagreement and Pr(e) is the hypothetical probabil-ity of chance agreement. Pr(a) is typically cal-culated as |Y,Y |+|N,N |

|Y,Y |+|Y,N |+|N,Y |+|N,N | . Since there isa strong agreement on the negative decisions, theprobability will be close to 1. If we ignore the|N,N | cases, which do not carry any useful infor-mation, the formula looks as follows:

Pr(a) =|Y, Y |

|Y, Y |+ |Y,N |+ |N,Y | .

The probability of a random agreement is ex-tremely low, because the probability of a link

8

0

0.05

0.1

0.15

0.2

0 10 20 30 40 50 60 70 80 90 100

agre

em

ent

% truth

CL-ESA2LinksCL-ESA2Similar

CL-ESA2ESACL-ESADirect

0

0.05

0.1

0.15

0.2

0 10 20 30 40 50 60 70 80 90 100

agre

em

ent

% truth

CL-ESA2LinksCL-ESA2Similar

CL-ESA2ESACL-ESADirect

Figure 7: The agreements of the Spanish to English (left) and Czech to English (right) CLLD methodswith GTes,en and GTcz,en respectively. The y-axis shows the agreement strength and the x-axis thenumber of generated examples as a fraction of the number of examples in ground truth.

connecting any two pages is approximately:3

plink =|links||pages|2 =

78.3M

3.2M2= 0.000007648.

Thus, the hypothetical number of items appear-ing in the Y, Y class by chance is p2link.(|Y, Y | +|Y,N |+ |N,Y |+ |N,N |). This formula estimatesthe number of agreements achieved by chance. Inour case the value is much smaller than 1, henceP (e) is close to 0. Therefore, we can calculate theagreement for English and Spanish as:

κen,es =5, 563

31, 479= 0.177.

The agreement for Czech and English is:

κen,cz =4, 308

26, 007= 0.166.

The value indicates a relatively low inter-annotatoragreement. We believe that the fact that such alow agreement has been measured is very inter-esting, particularly because the link structure inWikipedia is a result of a collaborative effort ofmany contributors. Therefore, we would expectthat even lower agreement might be experiencedin other types of text collections.

Motivated by the previous findings, we havecalculated the agreement between the output ofour method and the link graphs present in dif-ferent language versions of Wikipedia. We wereespecially interested to find out if the agree-ment is significantly different from the agreement

3Following the official Wikipedia statistics. Though dif-ferent language versions have different plink, the differencesdo not effect the results.

measured between different language versions ofWikipedia. We have generated by our CLLDmethods 100% of |GT | links for every orphan doc-ument in SOURCELsource , i.e. if a particulardocument is linked in Wikipedia to 57 documents,we generate 57 links. We have then measured theagreement for each topic document and averagedthe agreement values. The results of the exper-iment for Spanish to English and Czech to En-glish CLLD are shown in Figure 7. They suggestthat CL-ESA2Links achieved a level of agreementcomparable to that of human annotators. A veryreasonable level of agreement has also been mea-sured for CL-ESA2Similar, especially for the first10% of the generated links. CL-ESADirect andCL-ESA2ESA exhibit a lower level of agreement.

7 Conclusion

In this paper, we have presented and evaluatedfour different methods for Cross-Language LinkDiscovery (CLLD). We have used Cross-languageExplicit Semantic Analysis as a key component inthe development of the four presented methods.The results suggest that methods that are awareof the link graph in the target language achieveslightly better results than those that identify linksin the target language only by calculating seman-tic similarity. However, the former methods can-not be applied in all document collections and thusthe later methods are valuable. Though it mightseem at first sight that CLLD methods do not pro-vide very high precision and recall, we have shownthat the performance can, in fact, reach the resultsachieved by human annotators.

9

ReferencesJames Allan. 1997. Building hypertext using informa-

tion retrieval. Inf. Process. Manage., 33:145–159,March.

Philipp Dopichaj, Andre Skusa, and Andreas Heß.2008. Stealing anchors to link the wiki. In Gevaet al. (Geva et al., 2009), pages 343–353.

David Ellis, Jonathan Furner-Hines, and Peter Wil-lett. 1994. On the measurement of inter-linkerconsistency and retrieval effectiveness in hypertextdatabases. In SIGIR ’94: Proceedings of the 17thannual international ACM SIGIR conference on Re-search and development in information retrieval,pages 51–60, New York, NY, USA. Springer-VerlagNew York, Inc.

Evgeniy Gabrilovich and Shaul Markovitch. 2007.Computing semantic relatedness using wikipedia-based explicit semantic analysis. In In Proceedingsof the 20th International Joint Conference on Artifi-cial Intelligence, pages 1606–1611.

Shlomo Geva, Jaap Kamps, and Andrew Trotman, ed-itors. 2009. Advances in Focused Retrieval, 7th In-ternational Workshop of the Initiative for the Evalu-ation of XML Retrieval, INEX 2008, Dagstuhl Cas-tle, Germany, December 15-18, 2008. Revised andSelected Papers, Lecture Notes in Computer Sci-ence. Springer.

Shlomo Geva. 2007. Gpx: Ad-hoc queries and auto-mated link discovery in the wikipedia. In NorbertFuhr, Jaap Kamps, Mounia Lalmas, and AndrewTrotman, editors, INEX, Lecture Notes in ComputerScience. Springer.

Michael Granitzer, Christin Seifert, and Mario Zech-ner. 2008. Context based wikipedia linking. InGeva et al. (Geva et al., 2009), pages 354–365.

Stephen J. Green. 1998. Automated link generation:can we do better than term repetition? Comput.Netw. ISDN Syst., 30(1-7):75–84.

Jiyin He. 2008. Link detection with wikipedia. InGeva et al. (Geva et al., 2009), pages 366–373.

Wei Che Huang, Andrew Trotman, and Shlomo Geva.2008. Experiments and evaluation of link discoveryin the wikipedia.

Kelly Y. Itakura and Charles L. A. Clarke. 2008. Uni-versity of waterloo at inex 2008: Adhoc, book, andlink-the-wiki tracks. In Geva et al. (Geva et al.,2009), pages 132–139.

Dylan Jenkinson, Kai-Cheung Leung, and AndrewTrotman. 2008. Wikisearching and wikilinking. InGeva et al. (Geva et al., 2009), pages 374–388.

Petr Knoth, Jakub Novotny, and Zdenek Zdrahal.2010. Automatic generation of inter-passage linksbased on semantic similarity. In Proceedings of the

23rd International Conference on ComputationalLinguistics (Coling 2010), pages 590–598, Beijing,China, August.

Wei Lu, Dan Liu, and Zhenzhen Fu. 2008. Csir at inex2008 link-the-wiki track. In Geva et al. (Geva et al.,2009), pages 389–394.

Rada Mihalcea and Andras Csomai. 2007. Wikify!:linking documents to encyclopedic knowledge. InProceedings of the sixteenth ACM conference onConference on information and knowledge manage-ment, CIKM ’07, pages 233–242, New York, NY,USA. ACM.

David Milne and Ian H. Witten. 2008. Learning tolink with wikipedia. In James G. Shanahan, SihemAmer-Yahia, Ioana Manolescu, Yi Zhang, David A.Evans, Aleksander Kolcz, Key-Sun Choi, and AbdurChowdhury, editors, CIKM, pages 509–518. ACM.

Philipp Sorg and Philipp Cimiano. 2008. Cross-lingual information retrieval with explicit seman-tic analysis. In Working Notes for the CLEF 2008Workshop.

Andrew Trotman, David Alexander, and Shlomo Geva.2009. Overview of the inex 2010 link the wiki track.

Jihong Zeng and Peter A. Bloniarz. 2004. From key-words to links: an automatic approach. InformationTechnology: Coding and Computing, InternationalConference on, 1:283.

Junte Zhang and Jaap Kamps. 2008. A content-based link detection approach using the vector spacemodel. In Geva et al. (Geva et al., 2009), pages 395–400.

10


Cross Language POS Taggers (and other Tools) for Indian Languages: AnExperiment with Kannada using Telugu Resources

Siva ReddyLexical Computing Ltd, UK

[email protected]

Serge SharoffUniversity of Leeds, UK

[email protected]

AbstractIndian languages are known to have a largespeaker base, yet some of these languageshave minimal or non-efficient linguistic re-sources. For example, Kannada is rela-tively resource-poor compared to Malay-alam, Tamil and Telugu, which in-turn arerelatively poor compared to Hindi. ManyIndian language pairs exhibit high simi-larities in morphology and syntactic be-haviour e.g. Kannada is highly similar toTelugu. In this paper, we show how tobuild a cross-language part-of-speech tag-ger for Kannada exploiting the resourcesof Telugu. We also build large corporaand a morphological analyser (includinglemmatisation) for Kannada. Our experi-ments reveal that a cross-language taggersare as efficient as mono-lingual taggers.We aim to extend our work to other In-dian languages. Our tools are efficient andsignificantly faster than the existing mono-lingual tools.

1 Introduction

Part-of-speech (POS) taggers are some of the ba-sic tools for natural language processing in anylanguage. For example, they are needed for ter-minology extraction using linguistic patterns orfor selecting word lists in language teaching andlexicography. At the same time, many languageslack POS taggers. One reasons for this is the lackof other basic resources like corpora, lexicons ormorphological analysers. With the advent of Web,collecting corpora is no longer a major problem(Kilgarriff et al., 2010). With technical advancesin lexicography (Atkins and Rundell, 2008), build-ing lexicons and morphological analysers is alsopossible to considerable extent.

The other reason for the lack of POS taggersis partly due the lack of researchers working on a

particular language. Due to this, some languagesdo not have any annotated data to build efficienttaggers.

Cross-language research mainly aims to buildtools for a resource-poor language (target lan-guage) using the resources of a resource-rich lan-guage (source language). If the target language istypologically related to the source one, it is possi-ble to rely on the resource rich language.

In this work, we aim to find if cross languagetools for Indian languages are any efficient ascompared to existing mono-lingual tools. As ause case, we experiment with the resource-poorlanguage Kannada, by building various cross-language POS taggers, using the resources of itstypologically-related and relatively resource-richlanguage Telugu. Our POS taggers can also beused as a morphological analyser since our POStags include morphological information. We alsobuild a lemmatiser for Kannada which uses POStag information to choose a relevant lemma fromthe set of plausible lemmas.

2 Related Work

There are several methods for building POS tag-gers for a target language using source languageresources. Some researchers (Yarowsky et al.,2001; Yarowsky and Ngai, 2001; Das and Petrov,2011) built POS taggers for a target language us-ing parallel corpus. The source (cross) language isexpected to have a POS tagger. First, the sourcelanguage tools annotate the source side of the par-allel corpora. Later these annotations are projectedto the target language side using the alignmentsin the parallel corpora, creating virtual annotatedcorpora for the target language. A POS tagger forthe target is then built from the virtual annotatedcorpora. Other methods which make use of paral-lel corpora are (Snyder et al., 2008; Naseem et al.,2009). These approaches are based on hierarchicalBayesian models and Markov Chain Monte Carlo

11

sampling techniques. They aim to gain from in-formation shared across languages. The main dis-advantage of all such methods is that they rely onparallel corpora which itself is a costly resourcefor resource-poor languages.

Hana et al. (2004) and Feldman et al. (2006)propose a method for developing a POS tagger fora target language using the resources of another ty-pologically related language. Our method is moti-vated from them, but with the focus on resourcesavailable for Indian languages.

2.1 Hana et al. (2004)

Hana et al. aim to develop a tagger for Rus-sian from Czech using TnT (Brants, 2000), asecond-order Markov model. Though the lan-guages Czech and Russian are free-word order,they argue that TnT is as efficient as other mod-els.

TnT tagger is based on two probabilities - thetransition and emission probabilities. The tag se-quence of a given word sequence is selected bycalculating

argmaxt1...tn

[n∏

i=1

P (ti|ti−1, ti−2)P (wi|ti)]

(1)

where wi . . . wn is the word sequence andt1 . . . tn are their corresponding POS tags.

Transition probabilities, P (ti|ti−1, ti−2), de-scribe the conditional probability of a tag giventhe tags of previous words. Based on the intu-ition that transition probabilities across typologi-cally related languages remain the same, Hana etal. treat the transition probabilities of Russian tobe the same as Czech.

Emission probabilities, P (wi|ti), describe theconditional probability of a word given a tag. Itis not straightforward to estimate emission prob-abilities from a cross-language. Instead, Hana etal. develop a light paradigm-based (a set of rules)lexicon for Russian which emits all the possibletags for a given word form. The distribution of allthe tags of a word is treated to be uniform. Usingthis assumption, surrogate emission probabilitiesof Russian are estimated without using Czech.

The accuracy of the cross-pos tagger, i.e. thetagger of Russian built using Czech, is found to beencouraging.

2.2 Existing Tools for Kannada

There exists literature on Kannada morphologicalanalysers (Vikram and Urs, 2007; Antony et al.,2010; Shambhavi et al., 2011) and POS taggers(Antony and Soman, 2010) but none of them haveany publicly downloadable resources. Murthy(2000) gives an overview of existing resources forKannada and points out that most of these existwithout public access. We are interested only inthe work whose tools are publicly available fordownload.

We found only one downloadable POS taggerfor Kannada developed by the Indian LanguageMachine Translation (ILMT) consortium1. Theconsortium publicly released tools for 9 Indianlanguages including Kannada and Telugu. Theavailable tools are transliterators, morphologicalanalysers, POS taggers and shallow parsers.

The POS taggers from the ILMT consortium aremono-lingual POS taggers i.e. trained using thetarget language resources itself. These were devel-oped by Avinesh and Karthik (2007) by traininga conditional random fields (CRF) model on thetraining data provided by the participating institu-tions in the consortium. In the public evaluationof POS taggers for Indian languages (Bharati andMannem, 2007), the tagger (Avinesh and Karthik,2007) was ranked best among all the existing tag-gers.

Indian languages are morphologically rich withDravidian languages posing extra challenge be-cause of their agglutinative nature. Avinesh andKarthik (2007) noted that morphological informa-tion play an important role in Indian language POStagging. Their CRF model is trained on all the im-portant morphological features to predict the out-put tag for a word in a given context. The pipelineof (Avinesh and Karthik, 2007) can be describedas below

1. Tokenise the Unicode input2. Transliterate the tokenised input to ASCII

format.3. Run the morph analyser to get all the mor-

phological sets possible4. Extract relevant morphological features used

by the CRF model5. Given a word, based on the morphological

features of its context and itself, the CRF

1Tools for 9 Indian languages http://ltrc.iiit.ac.in/showfile.php?filename=downloads/shallow_parser.php

12

Field Description Number of Tags TagsFull Tag 311 NN.n.f.pl.3.d, VM.v.n.sg.3., . . .

1 Main POS Tag 25 CC, JJ, NN, VM, . . .2 Coarse POS Category 9 adj, n, num, unk . . .3 Gender 6 any, f, m, n, punc, null4 Number 4 any, pl, sg, null5 Person 5 1, 2, 3, any, null6 Case 3 d, o, null

Table 1: Fields in each tag and its corresponding statistics. null denotes empty value, e.g. in the tagVM.v.n..3., number and case fields are null

model annotate the word with a relevant POStag

6. Transliterate the ASCII output to Unicode

The major drawback with this tagging model isthat it relies on a pipeline and if something breaksin the pipeline, the POS tagger doesn’t work. Wefound that the tagger annotates only 78% of theinput sentences. The tagger is found to be too slowto scale for large annotation tasks.

We aim to remove this pipeline, yet build an ef-ficient tagger which also performs morphologicalanalysis at the same time.

2.3 Kannada and Telugu BackgroundKannada and Telugu are spoken by 35 and 75 mil-lion people respectively2. Majority of the existingresearch in Indian languages focused on few lan-guages like Hindi, Marathi, Bengali, Telugu andTamil, as a result of which other languages likeKannada, Malayalam are relatively resource-poor.

Telugu is known to be highly influenced byKannada, making the languages slightly mutuallyintelligible (Datta, 1998, pg. 1690). Until 13th

century both the languages have same script. Inthe later years, the script has changed but still closesimilarities can be observed. Both the scripts be-long to the same script family.

The similarities between Kannada and Telugu,and the relative resource abundance in Telugu,motivates us to develop a cross language POS tag-ger for Kannada using Telugu.

3 Our Tagset

All the Indian languages have similarities in mor-phological properties and syntactic behaviour. Theonly main difference is the agglutinative behaviourof Dravidian languages. Observing these similari-ties and differences in Indian languages, Bharati et

2Source: Wikipedia

al. (2006) proposed a common POS tagset for allIndian languages. Avinesh and Karthik (2007) usethis tagset.

We encode morphological information to theabove tagset creating a fine-grained POS tagsetsimilar to the work of (Schmid and Laws, 2008)for German, which is morphologically rich likeKannada. Each tag consists of 6 fields. Table 1describe each field and its statistics. For example,our tag NN.n.m.sg.3.o represents the main POStag ’NN’ for common noun as defined by (Bharatiet al., 2006), ’n’ for coarse grained category noun,’m’ for masculine gender, ’sg’ for singular num-ber, ’3’ for 3rd person, ’o’ for oblique case. Formore guidelines on morphological labels, pleaserefer to (Bharati et al., 2007).

Since our POS tag encodes morphological in-formation in itself, our tagger could also be usedas a morphological analyser. A sample sentencePOS tagged by our tagger is displayed in Figure 1.

4 Our Method

We aim to build a Hidden-Markov model (HMM)based Kannada POS tagger described by the Equa-tion 1. We use TnT (Brants, 2000), a popular im-plementation of the second-order Markov modelfor POS tagging. We construct the TnT model byestimating transition and emission probabilities ofKannada using the cross-language Telugu. Sinceour tagset has both POS and morphological in-formation encoded in it, the HMM model has anadvantage of using morphological information topredict the main POS tag, and the inverse, wheremain POS tag helps to predict the morphologi-cal information. Briefly, the steps involved in ourmethod are

1. Download large corpora of Kannada and Tel-ugu

13

Word POS Tag Lemma.Suffix

ಕ ಯ NN.n.n.sg..o ಕ .ಅಪ ಾರ NN.n.n.sg..d ಪ ಾರ.0 ಯ ೂಂ ನ NN.unk.... ಯ ೂಂ ನ.

ಆಟದ NN.n.n.sg..o ಆಟ.ಅಾಜ ಾ ದ VM.unk.... ಾಜ ಾ ದ.

ಚಂದಗುಪನು NNP.unk.... ಚಂದಗುಪನು.ಅಪ ಾ ಯ NN.n.m.sg.3.o ಅಪ ಾ .ಅಾತ NN.n.n.sg..d ಾತ .0

ವ ದ VM.v.any.any.any. ವ ಸು.ಇದಇ ೂಬ QC.unk.... ಇ ೂಬ .ಹುಡುಗನ NN.n.m.sg.3.o ಹುಡುಗ.ಅ

ಾರ ಯನು NN.n.n.sg..o ಾರ .ಅನುಾ VM.v..pl.2. ಾಡು.0 NN.n.n.sg..d .0ಸು ದನು VM.v.m.sg.3. ಸು.ಉ

. SYM.punc.... ..

Figure 1: A Sample POS Tagging and Lemmatisation for a Kannada Sentence

2. Determine the transition probabilities of Tel-ugu by training TnT on the machine anno-tated corpora of Telugu. Since Telugu andKannada are typologically related, we as-sume the transition probabilities of Kannadato be the same as of Telugu

3. Estimate the emission probabilities of Kan-nada from machine annotated Telugu corpusor machine annotate Kannada corpus

4. Use the probabilities from the step 2 and 3 tobuild a POS tagger for Kannada

4.1 Step1: Kannada and Telugu CorpusCreation

Corpus collection once used to be long, slow andexpensive. But with the advent of the Web andthe success of Web-as-Corpus notion (Kilgarriffand Grefenstette, 2003), corpus collection can behighly automated, and thereby fast and inexpen-sive.

We have used Corpus Factory method (Kilgar-riff et al., 2010) to collect Kannada and Telugucorpora from the Web. The method is describedin the following steps.

Frequency List: Corpus Factory method re-quires a frequency list of the language of interestto start corpus collection. The frequency list of

the language is built from its Wikipedia dump3.The dump is processed to remove all the Wiki andHTML markup to extract raw corpus, the Wikicorpus. The frequency list is then built from thetokenised Wiki corpus.

Seed Word Collection: We treat the top 1000words of the frequency list as the high-frequencywords of the language and the next 5000 as themid-frequency ones which we shall use as our seedwords.

Query Generation: 30,000 random queries of2-word size are generated such that no query isidentical nor its permutations.

URL Collection: Each query is sent to Bing4

search engine and the pages corresponding to thehits are downloaded. These pages are converted toUTF-8 encoding.

Filtering Above pages are cleaned to removeboiler-plate text (i.e. html and irrelevant blockslike ads) extracting the plain text. Some of thesepages are found to be in foreign languages andsome of them are found to be spam. We applied asimple language modelling based filter to removethese pages. The filter validates only the pages in

3Wikipedia Dumps: http://dumps.wikimedia.org

4Bing: http://bing.com

14

which the ratio of non-frequent words to the high-frequent words is maintained. If a page doesn’tmeet this criteria, we discard it.

Near-Duplicate Removal: The above filter isn’tsufficient to discard the pages which are dupli-cates. In-order to detect them, we used Broder etal. (1997) near-duplicate detection algorithm, andstore only one page among the duplicates.

Finally we collected cleaned corpora of 16 mil-lion words for Kannada and 4.6 million words forTelugu5.

4.2 Step 2: Estimating Kannada TransitionProbabilities

Transition probabilities represent the probabil-ity of transition to a state from the previousstates. Here each state represents a tag and henceP (ti|ti−1, ti−2). We estimate transition probabili-ties in two different ways.

4.2.1 From the source languageAcross typologically related languages, it is likelythat transition probabilities among tags are thesame. We assume the transition probabilities ofTelugu to be approximately equal to the transitionprobabilities of Kannada.

One can estimate the transition probabilities ofa language from its manually annotated corpora.Since we do not have the manually annotated Tel-ugu corpora publicly available, we have used (Avi-nesh and Karthik, 2007) to tag the Telugu corpusdownloaded in Step 1. This tagged corpus cap-tures an approximation of the true transition prob-abilities in the manually annotated corpora.

The tagged corpus is converted to the format inFigure 1 and then using TnT we estimate transitionprobabilities.

4.2.2 From the target languageApart from using Telugu transition probabilities,we also experimented with the existing KannadaPOS tagger. We annotated the Kannada corpuscollected in Step 1 using the existing tagger. Wethen estimated the transition probabilities from themachine annotated Kannada corpus. Note thatif Kannada POS tagger is used for estimatingtransition probabilities, our tagger can no longerbe called a cross-language tagger, and is mono-lingual. This tagger is used to compare the perfor-mance of cross-lingual and mono-lingual taggers.

5Telugu is collected two years back and Kannada very re-cently and so are the differences in sizes.

Since we learn the transition probabilities of thefine-grained POS tags from a large corpora, thishelps in building a robust and efficient tagger com-pared to the existing mono-lingual tagger. Robustbecause we would be able to predict POS and mor-phological information for unseen words, and effi-cient because the morphological information helpsin better POS prediction and vice versa.

4.3 Step 3: Estimating Kannada EmissionProbabilities

Emission probabilities represent the probabilitiesof an emission (output) of a given state. Herestate corresponds to tag and emission to a wordand hence P (wi|ti). We tried various ways of es-timating emission probabilities of Kannada.

4.3.1 Approximate string matchingIt is not easy to estimate emission probabilities ofa language from a cross language without the helpof either parallel corpora or a bilingual dictionaryor a translation system. Since the languages, Kan-nada and Telugu, are slightly mutually intelligible(Datta, 1998, pg. 1690), we aimed to exploit lex-ical similarities between Kannada and Telugu tothe extent possible.

Firstly, a Telugu lexicon is built by training TnTon the machine annotated Telugu corpora (Step1). The lexicon has the information of each Tel-ugu word and its corresponding POS tags alongwith their frequencies. Then, a word list for Kan-nada is built from the Kannada corpus. For a ev-ery Kannada word, the most probable similar Tel-ugu word is determined using approximate stringmatching 6. To measure similarity, we transliter-ated both Kannada and Telugu words to a com-mon ASCII encoding. For example, the mostsimilar Telugu words of the Kannada word xAs-wAnu are (’xAswAn’, 0.545), (’xAswAru’, 0.5),(’rAswAnu’, 0.5), (’xAswAdu’, 0.5) and the mostsimilar Telugu words of the Kannada word viBA-gavu are (’viBAgamu’, 0.539), (’viBAga’, 0.5),(’viBAgalanu’, 0.467), (’viBAgamulu’, 0.467).

We assume that for a Kannada word, its tagsand their frequencies are equal to the most similarTelugu word. Based on this assumption, we builda lexicon for Kannada with each word having itsplausible tags and frequencies derived from Tel-ugu. This lexicon is used for estimating transitionprobabilities.

6We used Python n-gram package for approximate stringmatching: http://packages.python.org/ngram/

15

4.3.2 Source tags and target morphology

For each morphological set from the machineannotated Telugu corpora, we determine all itsplausible fine-grained POS tags. For example,morphological set n.n.sg..o is associated withall the tags which satisfy the regular expression*.n.n.sg..o. Then for every word in Kannada,based on its morphology determined by the mor-phological analyser, we assign all the tags applica-ble, as learned from Telugu uniformly. The draw-back of this approach is that the search space islarge.

4.3.3 Target tags with uniform distribution

Instead of estimating emission probabilities fromthe cross language, we learn the plausible fine-grained tags of a given Kannada word from themachine annotated Kannada corpora (Step 1) andassume uniform distribution over all its tags.Though we learn the tags using the existing POStagger, we do not use the information about tagfrequencies, and hence we are not using the emis-sion probabilities of the existing tagger. The exist-ing tagger is just used to build a lexicon for Kan-nada.

Since we run the tagger on a large Kannadacorpus, our lexicon contains most of the Kan-nada word forms and their corresponding POS andmorphological information. This lexicon helps inremoving the pipeline of (Avinesh and Karthik,2007), thus building a high-speed tagger. Even, ifsome words are absent in the lexicon, TnT is wellknown to predict tags for unseen words based onthe transition probabilities.

The advantage of this method over the previousis that the search space is drastically reduced.

4.3.4 Target emission probabilities

In this method, we learn the Kannada emissionprobabilities directly from the machine annotatedKannada corpora, i.e. we use the emission proba-bilities of the existing tagger. This helps us in esti-mating the upper-bound performance of the cross-lingual tagger when the transition probabilities aretaken from Telugu.

Also, it helps in estimating the upper-bound per-formance of mono-lingual tagger when the transi-tion probabilities are directly taken from Kannada.Our mono-lingual tagger will be robust, fast and asaccurate as the existing mono-lingual tagger.

4.4 Step4: Final Tagger

We experimented with various TnT tagging mod-els by selecting transition and emission probabil-ities from the Steps 2 and 3. Though one mayquestion the performance of TnT for free-word or-der languages like Kannada, Hana et al. (2004)found that TnT models are as good as other mod-els for free-word order languages. Additionally,Schmid and Laws (2008) observed that TnT mod-els are also good at learning fined-grained transi-tion probabilities. In our evaluation, we also foundthat our TnT models are competitive to the exist-ing CRF model of (Avinesh and Karthik, 2007).

Apart from building POS tagging models, wealso learned the associations of each word withits lemma and suffix given a POS tag, from themachine annotated Kannada corpus. For exam-ple, Kannada word aramaneVgalYannu is associ-ated with lemma aramaneV and suffix annu whenoccurred with the tag NN.n.n.pl..o and similarlyword aramaneVgeV is associated with lemma ara-maneV and suffix igeV when occurred with the tagNN.n.n.sg..o.

An example sentence tagged by our modelsalong with the lemmatisation is displayed in Fig-ure 1.

5 Evaluation Results

We evaluated all our models on the manually an-notated Kannada corpora developed by the ILMTconsortium7. The corpus consists of 201,373words and it is tagged with Bharati et al. (2006)tagset which forms the first field of our fine-grained POS tagset. Since we did not have manu-ally annotated data for morphology, we evaluatedonly on the first field of our tags. For example, inthe tag NST.n.n.pl..o, we evaluate only for NST.

Table 2 displays the results for various taggingmodels. Note that all our models are TnT mod-els whereas (Avinesh and Karthik, 2007) is a CRFmodel.

Model 1 uses the transition probabilities ofTelugu (section 4.2.1) and emission probabilitiesestimated from Telugu using approximate stringmatching (section 4.3.1). This model achieves50% accuracy without using almost any resourcesof the target language. This is encouraging es-pecially for languages which do not have any re-

7This corpus is not publicly available and is licensed. Wedid not use it for any of our training purposes except for theevaluation

16

Model Transition Prob Emission Prob Precision Recall F-measure

Cross-Language POS Tagger1 From the source language Approximate string matching 56.88 56.88 56.882 From the source language Source tags and target morphology 28.65 28.65 28.653 From the source language Target tags with uniform distribution 75.10 75.10 75.104 From the source language Target emission probabilities 77.63 77.63 77.63

Mono-Lingual POS Tagger5 From the target language Target emission probabilities 77.66 77.66 77.666 (Avinesh and Karthik, 2007) 78.64 61.48 69.01

Table 2: Evaluation results of various tagging models

sources.Model 2 uses the transition probabilities of Tel-

ugu (section 4.2.1) and the emission probabilitiesestimated by mapping Telugu tags to the Kannadamorphology (section 4.3.2). The performance ispoor due to explosion in search space of the plau-sible tags. We optimise the search space using aKannada lexicon in Model 3.

Model 3 uses the transition probabilities of Tel-ugu (section 4.2.1) and emission probabilities es-timated from machine-built Kannada lexicon (sec-tion 4.3.3). The performance is competitive withthe mono-lingual taggers Models 5 and 6. Thetagger has better F-measure than (Avinesh andKarthik, 2007). This model reveals that transitionprobabilities apply across typologically related In-dian languages. To build an efficient cross-lingualtagger, it is good-enough to use cross-languagetransitions along with the target lexicon i.e. the listof all the tags plausible for a given target word.

Model 4 uses the transition probabilities of Tel-ugu (section 4.2.1) and emission probabilities ofKannada estimated from the existing Kannada tag-ger (section 4.3.4). This gives us an idea of theupper-bound performance of cross-language POStaggers when source transition probabilities areused. The performance is almost equal to themono-lingual tagger Model 5, showing that transi-tion probabilities across Kannada and Telugu arealmost same. We could build cross-language POStaggers as efficient as mono-lingual taggers condi-tioned that we have a good target lexicon.

Model 5 is a mono-lingual tagger which usestarget transition and emission probabilities esti-mated from the existing tagger (section 4.2.2 and4.3.4). The performance is highly competitivewith better F-measure than (Avinesh and Karthik,2007). This shows that a HMM-based tagger isas efficient as a CRF model (or any other model).While to tag 16 million words of Kannada corpora

using (Avinesh and Karthik, 2007) took 5 days ona Quadcore processor @ 2.3 GHz each core, ithardly took few minutes by TnT model with bet-ter recall. We also aim to develop robust, fast andefficient mono-lingual taggers to Indian languageswhich already have POS taggers.

Table 3 displays the tagwise results of our cross-language tagger Model 3, our mono-lingual tag-ger Model 5 and the existing mono-lingual taggerModel 6.

6 Conclusions

This is an attempt to build POS taggers and othertools for resource-poor Indian languages using rel-atively resource-rich languages. Our experimentalresults for Kannada using Telugu are highly en-couraging towards building cross-language tools.Cross-language POS taggers are found to be as ac-curate as the mono-lingual POS taggers.

Future directions include building cross lan-guage tools for other resource-poor Indian lan-guage, such as Malayalam using Tamil, Marathiusing Hindi, Nepali using Hindi, etc. For Indianlanguages which already have tools, we aim tobuild robust, fast and efficient tools using the ex-isting tools.

Finally, all the tools developed in this work areavailable for download8. The corpora (tagged) de-veloped for this work is accessible through SketchEngine9 or Intellitext10 interface.

Acknowledgements

This work has been supported by AHRC DEDEFIgrant AH/H037306/1. Thanks to anonymous revi-wers for thier useful feedback.

8The tools developed in this work can be downloadedfrom http://sivareddy.in/downloads or http://corpus.leeds.ac.uk/serge/

9Sketch Engine http://sketchengine.co.uk10Intellitext http://corpus.leeds.ac.uk/it/

17

Tag Freq Prec Recall F1 Prec Recall F1 Prec Recall F1Model 3 Model 5 Model 6

NN 81289 74.32 84.89 79.25 81.58 80.79 81.19 84.91 62.59 72.06VM 33421 84.56 88.21 86.35 83.94 89.39 86.58 86.79 71.78 78.57SYM 30835 92.26 95.51 93.86 95.57 96.11 95.84 95.64 73.99 83.43JJ 13429 54.92 27.59 36.73 55.54 39.70 46.30 56.38 32.76 41.44PRP 9102 60.02 33.14 42.70 59.07 56.01 57.50 60.69 46.07 52.38QC 7699 90.70 73.45 81.17 90.55 93.52 92.01 88.52 70.40 78.43NNP 7221 43.66 45.41 44.52 60.87 61.82 61.34 62.20 61.72 61.96CC 4003 87.11 92.03 89.50 88.62 94.33 91.38 88.69 75.39 81.50RB 3957 27.03 26.26 26.64 33.48 37.30 35.29 34.31 29.52 31.73NST 2139 49.26 62.51 55.10 38.72 79.34 52.04 40.27 67.27 50.39QF 1385 67.17 80.36 73.18 54.95 80.51 65.32 58.18 70.61 63.80NEG 889 68.00 3.82 7.24 89.93 42.18 57.43 86.50 35.32 50.16QO 622 54.66 20.74 30.07 45.43 28.78 35.24 54.00 21.70 30.96WQ 599 70.25 46.91 56.26 80.17 80.30 80.23 81.73 55.26 65.94PSP 374 7.92 2.14 3.37 - - - 26.28 71.39 38.42INTF 23 5.32 43.48 9.48 5.08 60.00 9.38 1.06 17.39 2.00INJ 3 5.13 66.67 9.52 1.67 33.33 3.17 2.70 33.33 5.00Overall 201,373 75.10 75.10 75.10 77.66 77.66 77.66 78.64 61.48 69.01

Table 3: Tag wise results of Models 3, 5 and 6 described in Table 2

ReferencesP.J. Antony and K.P. Soman. 2010. Kernel based part

of speech tagger for kannada. In Machine Learningand Cybernetics (ICMLC), 2010 International Con-ference on, volume 4, pages 2139 –2144, july.

P.J. Antony, M Anand Kumar, and K.P. Soman. 2010.Paradigm based morphological analyzer for kannadalanguage using machine learning approach. Ad-vances in Computational Sciences and Technology(ACST), 3(4).

Sue B. T. Atkins and Michael Rundell. 2008. The Ox-ford Guide to Practical Lexicography. Oxford Uni-versity Press, Oxford.

P. V. S. Avinesh and G. Karthik. 2007. Part-Of-SpeechTagging and Chunking using Conditional RandomFields and Transformation-Based Learning. In Pro-ceedings of the IJCAI and the Workshop On ShallowParsing for South Asian Languages (SPSAL), pages21–24.

Akshar Bharati and Prashanth R. Mannem. 2007. In-troduction to shallow parsing contest on south asianlanguages. In Proceedings of the IJCAI and theWorkshop On Shallow Parsing for South Asian Lan-guages (SPSAL), pages pages 1–8.

A. Bharati, R. Sangal, D. M. Sharma, and L. Bai. 2006.Anncorra: Annotating corpora guidelines for posand chunk annotation for indian languages. In Tech-nical Report (TR-LTRC-31), LTRC, IIIT-Hyderabad.

A. Bharati, R. Sangal, and D.M. Sharma. 2007. Ssf:Shakti standard format guide. Technical Report TR-LTRC-33, Language Technologies Research Centre,IIIT-Hyderabad, India.

Thorsten Brants. 2000. Tnt: a statistical part-of-speech tagger. In Proceedings of the sixth con-ference on Applied natural language processing,

ANLC ’00, pages 224–231, Stroudsburg, PA, USA.Association for Computational Linguistics.

Andrei Z. Broder, Steven C. Glassman, Mark S. Man-asse, and Geoffrey Zweig. 1997. Syntactic cluster-ing of the web. In Selected papers from the sixthinternational conference on World Wide Web, pages1157–1166.

Dipanjan Das and Slav Petrov. 2011. Unsupervisedpart-of-speech tagging with bilingual graph-basedprojections. In Proceedings of ACL 2011.

Amaresh Datta. 1998. The Encyclopaedia Of IndianLiterature, volume 2.

Anna Feldman, Jirka Hana, and Chris Brew. 2006.A cross-language approach to rapid creation of newmorpho-syntactically annotated resources. In Pro-ceedings of LREC, pages 549–554.

Jiri Hana, Anna Feldman, and Chris Brew. 2004. AResource-light Approach to Russian Morphology:Tagging Russian using Czech resources. In Pro-ceedings of EMNLP 2004, Barcelona, Spain.

Adam Kilgarriff and Gregory Grefenstette. 2003. In-troduction to the special issue on the web as corpus.CL, 29(3):333–348.

Adam Kilgarriff, Siva Reddy, Jan Pomikalek, and Avi-nesh PVS. 2010. A corpus factory for many lan-guages. In Proceedings of the Seventh conferenceon International Language Resources and Evalu-ation (LREC’10), Valletta, Malta, may. EuropeanLanguage Resources Association (ELRA).

Kavi Narayana Murthy. 2000. Computer process-ing of kannada language. Technical report, Com-puter and Kannada Development, Kannada Univer-sity, Hampi.

18

Tahira Naseem, Benjamin Snyder, Jacob Eisenstein,and Regina Barzilay. 2009. Multilingual part-of-speech tagging: Two unsupervised approaches. J.Artif. Intell. Res. (JAIR), 36:341–385.

Helmut Schmid and Florian Laws. 2008. Estima-tion of conditional probabilities with decision treesand an application to fine-grained pos tagging. InProceedings of the 22nd International Conferenceon Computational Linguistics - Volume 1, COLING’08, pages 777–784, Stroudsburg, PA, USA. Associ-ation for Computational Linguistics.

B. R Shambhavi, P Ramakanth Kumar, K Srividya,B J Jyothi, Spoorti Kundargi, and G Varsha Shas-tri. 2011. Kannada morphological analyser andgenerator using trie. IJCSNS International Journalof Computer Science and Network Security, 11(1),January.

Benjamin Snyder, Tahira Naseem, Jacob Eisenstein,and Regina Barzilay. 2008. Unsupervised multilin-gual learning for pos tagging. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing, EMNLP ’08, pages 1041–1050,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

T. N. Vikram and Shalini R. Urs. 2007. Develop-ment of prototype morphological analyzer for thesouth indian language of kannada. In Proceedingsof the 10th international conference on Asian digi-tal libraries: looking back 10 years and forging newfrontiers, ICADL’07, pages 109–116, Berlin, Hei-delberg. Springer-Verlag.

David Yarowsky and Grace Ngai. 2001. Inducing mul-tilingual pos taggers and np bracketers via robustprojection across aligned corpora. In Proceedingsof the second meeting of the North American Chap-ter of the Association for Computational Linguisticson Language technologies, NAACL ’01, pages 1–8, Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

David Yarowsky, Grace Ngai, and Richard Wicen-towski. 2001. Inducing multilingual text analysistools via robust projection across aligned corpora.In Proceedings of the first international conferenceon Human language technology research, HLT ’01,pages 1–8, Stroudsburg, PA, USA. Association forComputational Linguistics.

19


Integrate Multilingual Web Search Results

using Cross-Lingual Topic Models

Duo Ding

Shanghai Jiao Tong University, Shanghai, 200240, P.R. China

[email protected]

Abstract

With the thriving of the Internet, web users

today have access to resources around the

world in more than 200 different languages.

How to effectively manage multilingual

web search results has emerged as an es-

sential problem. In this paper, we introduce

the ongoing work of leveraging a Cross-

Lingual Topic Model (CLTM) to integrate

the multilingual search results. The CLTM

detects the underlying topics of different

language results and uses the topic distribu-

tion of each result to cluster them into top-

ic-based classes. In CLTM, we unify

distributions in topic level by direct transla-

tion, thus distinguishing from other multi-

lingual topic models, which mainly

concern the parallelism at document or sen-

tence level (Mimno 2009; Ni, 2009). Ex-

perimental results suggest that our CLTM

clustering method is effective and outper-

forms the 6 compared clustering approach-

es.

1 Introduction

The growing of the Internet has made the web mul-

tilingual. With the Internet, user can browse the

web page written in any language, and search for

results in any language in the world.

However, since users would have a large set of

search results edited in many languages after mul-

tilingual search (shown as Figure 1), the redundan-

cy issue became a problem. Here the “redundancy

issue” stands for two problems. The first is that we

would get duplicated results from different lan-

guage search. This can be fixed by simply main-

taining a set and throw away the duplicated results.

The second problem is that the users will get so

many search results after multilingual search that

they cannot quickly find the results they want. To

facilitate users’ quick browsing, one effective solu-

tion might be post-retrieval document clustering,

which had been shown by Hearst and Pedersen

(1996) to produce superior results. So we can em-

ploy the Cross-Lingual Topic Models to cluster the

numerous results into topic classes, each contain-

ing the results related to one specific topic, to solve

the redundancy problem.

Figure 1: Multilingual Search

Our approach works in two steps. First we trans-

late the topic documents into a unified language.

Then, by conducting a clustering method derived

from the Cross-Lingual Topic Model (CLTM), we

cluster all the results into topic classes. We assume

different “topics” exist among all the returned

search results. (Blei 2003). Thus by detecting the

underlying topics of search results, we give a topic

distribution for each result and then cluster it into a

particular class according to the distribution.

Through experiments, the CLTM gives an impres-

sive performance in clustering multilingual web

search results.

2 Cross-Lingual Topic Models

Topic models have emerged as a very useful tool

to detect underlying topics of text collections. They

are probabilistic models for uncovering the under-

lying semantic structure of a document collection

20

based on a hierarchical Bayesian analysis of the

original texts (Blei et al. 2003). Having the method

of assigning topic distributions to the terms and

documents, this analysis of the context can be uti-

lized on many applications. Meanwhile, the devel-

opment of multilingual search is calling for useful

cross-lingual tools to integrate the results in differ-

ent languages. So we leverage Cross-Lingual Top-

ic Models (CLTM) to accomplish the task of

integrating multilingual web results.

Some similar methods have been proposed re-

cently to define polylingual or multilingual topic

models to find the topics aligned across multiple

languages (Mimno 2009; Ni, 2009). The key dif-

ference between us is that the polylingual topic

models assume that the documents in a tuple share

the individual tuple-specific distribution over top-

ics, while in the Cross-Lingual Topic Model, the

distributions of tuples and different languages are

identical. At the same time, our emphasis is to uti-

lize the power of CLTM to solve the problem of

clustering multilingual search results, which is dif-

ferent from other topic model tools.

2.1 Definition

Firstly we give the statistical assumptions and ter-

minology in Cross-Lingual Topic Models (CLTM).

The thought behind CLTM is that, for results with-

in a specific language search result set, we model

each result as arising from multiple topics, where a

topic is defined to be a distribution over a fixed

vocabulary of terms in this language. In every lan-

guage Li, Let K be a specified number of topics, V

the size of the vocabulary, a positive K-vector,

and a scalar. We let DirV ( ) denote a V-

dimensional Dirichlet with vector parameter and

DirK ( ) denote a K dimensional symmetric Di-

richlet with scalar parameter .

There might be several topics underlying in the

collection. We draw a distribution for each topic

over words . And for each search result

document, we draw a vector of topic proportions

. Finally for each word, we firstly give a

topic assignment , where the range of

is 1 to K; then draw a word ,

where the range of is from 1 to V.

From definition above we can see that the hid-

den topical structure of a collection is represented

in the hidden random variables: the topics , the

per-document topic proportions , and the per-

word topic assignments . This is similar to

another kind of topic models, latent Dirichlet allo-

cation (LDA).

We make central use of the Dirichlet distribution

in CLTM, the exponential family distribution over

the simplex of positive vectors that sum to one.

Since we use distribution similar to latent Dirichlet

allocation on each language result set, we give the

Dirichlet density:

The parameter is a positive K-vector, and

denotes the Gamma function, which can be thought

of as a real-valued extension of the factorial func-

tion. Under the assumption that document collec-

tions (result sets) in different languages share a

same topic distribution, we can describe the Cross-

Lingual Topic Models in Figure 2.

Figure 2: The graphical model presentation of the

Cross-Lingual Topic Model (CLTM)

2.2 Clustering with CLTM

From the definition, we see that CLTM contains

two Dirichlet random variables: the topic propor-

tions are distributions over topic indices {1, . . . ,

K}; the topics are distributions over the vocabu-

lary. We use these variables to formulate our topic-

detecting method.

Detecting Topics

In CLTM, exploring a corpus through a topic mod-

el typically begins with visualizing the posterior

topics through their per-topic term probabilities .

In our method, we need to find several topics in the

“Result Pool” of each query, thus making it possi-

ble to assign topic distributions to each result in the

21

set. To do so, we detect the topics in a result set by

visualizing several posterior topics and use the fol-

lowing formula to calculate the word score:

We can see that the above formula is based on

the TFIDF term score of vocabulary terms used in

information retrieval (Baeza-Yates and Rbiero-

Neto, 1999). We use this score to determine salient

topics in a query’s result set. The first part of it is

similar to the term frequency (TF); the second part

is similar to the document frequency (IDF).

Document Topic Distribution

When several topics are found in a result set, we

would like to know the underlying topics contained

in each result document so that we can cluster

them into a particular class according to their top-

ics. Since a result document may contain multiple

topics and what we need is the most salient one,

we can plot the posterior topic proportions and ex-

amine the most likely topic assigned to each word

in this query to find the most salient topic. In our

method, we sum up the distribution of every term

in the document to form the final distribution of

this doc.

This formula calculates the similarity of a doc-

ument on the Kth topic. Nv denotes quantity of

words that the vth result contains.

After the two-step processing, for each result

document in a query’s result list, we have K simi-

larities which respectively denote the possibility

for the document to be clustered to the Kth topic

class. We then conduct clustering on the result set

based on this possibility to put them in different

topic-based classes.

3 Experiments

In this section, we give experimental results on

Cross-Lingual Topic Model clustering method,

compared with 6 other clustering algorithms, to

show that CLTM is a powerful tool in cross-lingual

context analysis and multilingual topic-based clus-

tering.

For this series of experiments we simply use the

cluster results of two languages, English and Chi-

nese to show the performance of different cluster-

ing methods (Because it is convenient to evaluate).

However, due to the fact that the Cross-Lingual

Topic Models are language independent, we be-

lieve that the method is also feasible in other lan-

guages.

3.1 Baseline Clustering Algorithms

In the first place, we apply 6 baseline clustering

algorithms to the unified search results. We extract

20 frequently referred Chinese search queries and

translate them into English. (Using Google Trans-

late.) Then for each pair of queries we search them

both in Chinese and English in the Google Search

Engine, each recording top 40 returned results (in-

cluding title, snippet and url). And then we regard

English as the unified language and translate the 40

Chinese results into English, again using Google

Translate, thus having totally 80 returned search

results for each query.

In the next step, for each of the 80 results, we

convert these 80 snippets into the vector-space

format files. After that, we begin to cluster these

result documents (snippets) into classes. In our

definition, the cluster number is 5. The fixed-

predefined clustering number is more effective for

both baseline methods and CLTM method to con-

duct clustering and also drives it clearer to make

comparisons.

The 6 baseline clustering algorithms we use are:

repeated bisection (rb), refined repeated bisection

(rbr), direct clustering (direct), agglomerative clus-

tering (agglo), graph partitioning (graph), biased

agglomerative (bagglo). We use a clustering tool,

CLUTO, to implement baseline clustering.

The similarity function is chosen to be cosine

function, and the clustering criterion function for

the rb, rbr, and direct methods is

In this formula, K is the total number of clusters,

S is the total objects to be clustered, Si is the set of

objects assigned to the ith cluster, ni is the number

of objects in the ith cluster, v and u represent two

objects, and sim(v, u) is the similarity between two

objects.

22

Table 1: Parameter and description of the 6 baseline clustering algorithms used in the experiment

For agglomerative and biased agglomerative

clustering algorithm, we use the traditional

UPGMA criterion function and for graph partition-

ing algorithm, we use cluster-weighted single-link

criterion function. The parameters and explana-

tions for each clustering algorithm are represented

in Table 1.

3.2 Cross-Lingual Topic Model Clustering

In Cross-Lingual Topic Model based clustering,

we firstly calculate the word score for each vocab-

ulary by using formula (2) in Section 2. Thus for

each query, there is a probability for each of its

vocabulary word on 5 different topics. Then, we

use formula (3) to calculate the probability of each

document (each snippet) on 5 topics. Finally, we

find the topic with highest probability in each doc-

ument and assign the document into this topic class,

which finishes the process of clustering.

In our evaluation process, we ask 7 evaluators to

view the results of different clustering methods.

Each of the evaluators is given the clustering re-

sults on 2 or 3 queries in 7 different methods (6

baseline methods plus CLTM). And they are asked

to compare the results by giving two scores to each

method. In the evaluation process, they are blind to

the clustering method names of the assigned results.

The first score is the “Internal Similarity”, which

accounts for the similarity of the results clustered

into the same class. This score reveals the com-

pactness of each topic class and the range of the

score is from 1 to 10: 1 score means not good

compactness and 10 scores means perfect com-

pactness. The second score is called “External Dis-

tinctness”, which shows whether the classes are

distinct with each other. The range is also 1 to 10:

1 score represents poor quality and 10 represents

the best performance. The results of evaluations

are shown in Figure 3 and Figure 4.

Figure 3: The Internal Similarity of 7 methods

Figure 4: The External Distinctness of 7 methods

4 Conclusion

In this paper, we introduce the ongoing work of

exploiting a kind of topic models, Cross-Lingual

Clustering Algorithm Parameter Algorithm Description

Repeated Bisection -rb The desired k-way clustering solution is computed by performing a sequence of k-1

repeated bisections.

Refined Repeated Bi-

section

-rbr Similar to the above method, but at the end, the overall solution is globally optimized.

Direct Clustering -direct In this method, the desired k-way clustering solution is computed by simultaneously

finding all k clusters.

Agglomerative Clus-

tering

-agglo The k-way clustering solution is computed using the agglomerative paradigm whose

goal is to locally optimize (min or max) a particular clustering criterion function.

Graph Partitioning -grapg The clustering solution is computed by first modeling the objects using a nearest-

neighbor graph, and then splitting the graph into k-clusters using a min-cut graph par-

titioning algorithm

Biased Agglomerative -bagglo Similar to the agglo method, but the agglomeration process is biased by a partitional

clustering solution that is initially computed on the dataset.

23

Topic Models (CLTM), to solve the problem of

integrating and clustering multilingual search re-

sults. The CLTM detects the underlying topics of

the results and assign a distribution to each result.

According to this distribution, we cluster each re-

sult to the topic class of which it is mainly about.

We give each word a “word-score” which repre-

sents the distribution of topics on this word and

sum all the term probabilities up in a result to ob-

tain the topic distribution for each result document.

To evaluate the effectiveness of Cross-Lingual

Topic Models, we compare it with 6 baseline clus-

tering algorithms on the same dataset. The experi-

mental results of “Internal Similarity” and

“External Distinctness” scores suggest that the

Cross-Lingual Topic Model gives a better perfor-

mance and provides more reasonable results for

clustering multilingual web search documents.

Acknowledgments

The author would like to thank Matthew Scott of

Microsoft Research Asia for helpful suggestions

and comments. The author also thanks the anony-

mous reviewers for their insightful feedback.

References

Andreas Faatz: Enrichment Evaluation, technical report

TR-AF-01-02 at Darmstadt University of Technology

A. V. Leouski and W. B. Croft. 1996. An evaluation of

techniques for clustering search results. Technical

Report IR-76, Department of Computer Science,

University of Massachusetts, Amherst.

Bernard J. Jansen, Amanda Spink,*, Tefko Saracevic.

2000. Real life, real users, and real needs: a study and

analysis of user queries on the web. Information Pro-

cessing and Management 36 (2000).

Chi Lang Ngo and Hung Son Nguyen. 2004. A Toler-

ance Rough Set Approach to Clustering Web Search

Results, PKDD 2004, LNAI 3202, pp. 515–517.

David Blei, Andrew Ng, and Michael Jordan. 2003.

Latent Dirichlet allocation, 3:993-1022.

D. R. Cutting, D. R. Karger, J. O. Pedersen and J. W.

Tukey, Scatter/Gather. 1992. A cluster-based ap-

proach to browsing large document collections, In

Proceedings of the 15th International ACM SIGIR

Conference (SIGIR’92), pp 318-329.

David Mimno, Hanna M. Wallach, Jason Naradowsky,

David A. Smith and Andrew McCallum. 2009. Pol-

ylingual Topic Models, In Proceedings of the 2009

Conference on Empirical Methods in Natural Lan-

guage Processing, pages 880–889, Singapore.

He Xiaoning, Wang Peidong, Qi Haoliang, Yang

Muyun, Lei Guohua, Xue Yong. 2008. Using Google

Translation in Cross-Lingual Information Retrieval,

Proceedings of NTCIR-7 Workshop Meeting, De-

cember 16–19, Tokyo, Japan

Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma

and Jinwen Ma. 2004. Learning to Cluster Web

Search Results. SIGIR04, Sheffield, South Yorkshire,

UK.

Liddle, S., Embley, D., Scott, D., Yau, S. 2002. Extract-

ing Data Behind Web. In Proceedings of the Joint

Workshop on Conceptual Modeling Approaches for

E-business: A Web Service Perspective (eCOMO

2002), pp. 38–49 (October 2002)

Murata, M, Ma, Q, and Isahara, H. 2002. "Applying

multiple characteristics and techniques to obtain high

levels of performance in information retrieval". Pro-

ceedings of the Third NTCIR Workshop on Research

in Information Retrieval, Question Answering and

Summarization, Tokyo Japan. NII, Tokyo.

McRoy, S. 1992. Using Multiple Knowledge Sources

for Word Sense Discrimination, in Computational

Linguistics, vol. 18, no. 1.

Peter F. Brown, Stephen A. Della Pietra, Vincent J. Del-

la Pietra, Robert L. Mercer. 1993. The Mathematics

of Statistical Machine Translation: Parameter Estima-

tion. Computational Linguistics, volume 19, Issue 2.

P.S. Bradley, Usama Fayyad, and Cory Reina. 1998.

Scaling Clustering Algorithms to Large Databases,

From: KDD-98 Proceedings, AAAI (www.aaai.org).

Raghavan, S., Garcia-Molina, H. 2001. Crawling the

Hidden Web. In: Proceedings of the 27th

Internation-

al Conference on Very Large Data Bases, pp.29–138.

W. B. Croft. 1978. Organizing and searching large files

of documents, Ph.D. Thesis, University of Cam-

bridge.

Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen.

2009. Mining Multilingual Topics from Wikipedia,

WWW 2009, Madrid, Spain.

Zamir O., Etzioni O. 1998. Web Document Clustering:

A Feasibility Demonstration, Proceedings of the 19th

International ACM SIGIR Conference on Research

and Development of Information Retrieval

(SIGIR'98), 46-54.

Zamir O., Etzioni O. Grouper. 1999. A Dynamic Clus-

tering Interface to Web Search Results. In Proceed-

ings of the Eighth International World Wide Web

Conference (WWW8), Toronto, Canada.

24


Soundex-based Translation Correction in Urdu–EnglishCross-Language Information Retrieval

Manaal FaruquiComputer Science & Engg.

Indian Institute of TechnologyKharagpur, India

[email protected]

Prasenjit MajumderComputer Science & Engg.

DAIICT GandhinagarGandhinagar, India

[email protected]

Sebastian PadóComputational Linguistics

Heidelberg UniversityHeidelberg, Germany

[email protected]

Abstract

Cross-language information retrieval is dif-ficult for languages with few processingtools or resources such as Urdu. An easyway of translating content words is pro-vided by Google Translate, but due to lex-icon limitations named entities (NEs) aretransliterated letter by letter. The resultingNEs errors (zynydyny zdn for Zinedine Zi-dane) hurts retrieval. We propose to replaceEnglish non-words in the translation out-put. First, we determine phonetically sim-ilar English words with the Soundex algo-rithm. Then, we choose among them by amodified Levenshtein distance that modelscorrect transliteration patterns. This strat-egy yields an improvement of 4% MAP(from 41.2 to 45.1, monolingual 51.4) onthe FIRE-2010 dataset.

1 Introduction

Cross-language information retrieval (CLIR) re-search is the study of systems that accept queriesin one language and return text documents in a dif-ferent language. CLIR is of considerable practicalimportance in countries with many languages likeIndia. One of the most widely used languages isUrdu, the official language of five Indian states aswell as the national language of Pakistan. There arearound 60 million speakers of Urdu – 48 million inIndia and 11 million in Pakistan (Lewis, 2009).

Despite this large number of speakers, NLPfor Urdu is still at a fairly early stage (Hussain,2008). Studies have been conducted on POS tag-ging (Sajjad and Schmid, 2009), corpus construc-tion (Becker and Riaz, 2002), word segmenta-tion (Durrani and Hussain, 2010), lexicographic

sorting (Hussain et al., 2007), and information ex-traction (Mukund et al., 2010). Many other pro-cessing tasks are still missing, and the size of theUrdu internet is minuscule compared to Englishand other major languages, making Urdu a primecandidate for a CLIR source language.

A particular challenge which Urdu poses forCLIR is its writing system. Even though it is aCentral Indo-Aryan language and closely related toHindi, its development was shaped predominantlyby Persian and Arabic, and it is written in Perso-Arabic script rather than Devanagari. CLIR witha target language that uses another script needs totransliterate (Knight and Graehl, 1998) any ma-terial that cannot be translated (typically out-of-vocabulary items like Named Entities). The diffi-culties of Perso-Arabic in this respect are (a), somevowels are represented by letters which are alsoconsonants and (b), short vowels are customarilyomitted. For example, in A Kñ Kð (Winona) the first ðis used for the W but the second is used for O. Alsothe i sound is missing after ð (W).

In this paper, we consider Urdu–English CLIR.Starting from a readily available baseline (usingGoogle Translate to obtain English queries), weshow that transliteration of Named Entities, morespecifically missing vowels, is indeed a major fac-tor in wrongly answered queries. We reconstructmissing vowels in an unsupervised manner throughan approximate string matching procedure basedon phonetic similarity and orthographic similarityby using Soundex code (Knuth, 1975) and Leven-shtein distance (Gusfield, 1997) respectively, andfind a clear improvement over the baseline.

2 Translation Strategies for Urdu–English

We present a series of strategies for translatingUrdu queries into English so that they can be pre-

25

sented to a monolingual English IR system thatworks on some English document collection. In-spection of the strategies’ errors led us to developa hierarchy of increasingly sophisticated strategies.

2.1 Baseline model (GTR)

As our baseline, we aimed for a model that is state-of-the-art, freely available, and can be used by userswithout the need for heavy computational machin-ery. We decided to render the Urdu query intoEnglish with the Google Translate web service.1

2.2 Approximate Matching (GTR+SoEx)

Google Translate appears to have a limited Urdulexicon. Words that are out of vocabulary (OOV)are transliterated letter by letter into the Latin alpha-bet. Without an attempt to restore short (unwritten)vowels, these match the actual English terms onlyvery rarely. For example, Singur, the name of avillage in India gets translated to Sngur.

To address this problem, we attempt to mapthese incomplete transliterations onto well-formedEnglish words using approximate string match-ing. We use Soundex (Knuth, 1975), an algorithmwhich is normally used for “phonetic normaliza-tion”. Soundex maps English words onto their firstletter plus three digits which represent equivalenceclasses over consonants, throwing away all vowelsin the process. For example, Ashcraft is mappedonto A261, where 2 stands for the “gutturals” and“sibilants” S and K, 6 for R, and 1 for the “labio-dental” F. All codes beyond the first three are ig-nored. The same soundex code would be assigned,for example, to Ashcroft, Ashcrop, or even Azaroff.The two components which make Soundex a well-suited choice for our purposes are exactly (a), theforming of equivalence classes over consonants,which counteracts variance introduced by one-to-many correspondences between Latin and Arabicletters; and (b), the omission of vowels.

Specifically, we use Soundex as a hash function,mapping all English words from our English docu-ment collection onto their Soundex codes. TheGTR+SoEx model then attempts to correct allwords in the Google Translate output by replac-ing them with the English word sharing the sameSoundex code that has the highest frequency in theEnglish document collection.

1http://translate.google.com. All querieswere translated in the first week of January 2011.

2.3 NER-centered Approximate Matching(GTR+SoExNER)

An analysis of the output of the GTR+SoEx modelshowed that the model indeed ensured that allwords in the translation were English words, butthat it “overcorrected”, replacing correctly trans-lated, but infrequent, English words by more fre-quent words with the same Soundex code. Unfor-tunately, Google Translate does not indicate whichwords in its output are out-of-vocabulary.

Recall that our original motivation was to im-prove coverage specifically for out-of-vocabularywords, virtually all of which are Named Entities.Thus, we decided to apply Soundex matching onlyto NEs. As a practical and simple way of identi-fying malformed NEs, we considered those wordsin the Google Translate output which did not oc-cur in the English document base at all (i.e., whichwere “non-words”). We manually verified that thisheuristic indeed identified malformed Named En-tities in our experimental materials (see Section 3below for details). We found a recall of 100% (alltrue NEs were identified) and a precision of 96% (asmall number of non-NEs was classified as NEs).

The GTR+SoExNER strategy applies Soundexmatching to all NEs, but not to other words in theGoogle Translate output.

2.4 Disambiguation(GTR+SoExNER+LD(mod))

Generally, a word that has been wrongly translit-erated from Urdu maps onto the same Soundexcode as several English words. The median num-ber of English words per transliteration is 7. Thiscan be seen as a sort of ambiguity, and the strat-egy adopted by the previous models is to justchoose the most frequent candidate, similar to the“predominant” sense baseline in word sense dis-ambiguation (McCarthy et al., 2004). We foundhowever that the most frequent candidate is of-ten wrong, since Soundex conflates fairly differentwords (cf. Section 2.2). For example, Subhas, thefirst name of an Indian freedom fighter, receivesthe soundex code S120 but it is mapped onto theEnglish term Space (freq=7243) instead of Subhas(freq=2853).

We therefore experimented with a more in-formed strategy that chooses the English candi-date based on two variants of Levenshtein distance.The first model, GTR+SoExNER+LD, uses stan-dard Levenshtein distance with a cost of 1 for

26

each insertion, deletion and substitution. Our fi-nal model, GTR+SoExNER+LDmod uses a modi-fied version of Levenshtein distance which is opti-mized to model the correspondences that we expect.Specifically, the addition of vowels and the replace-ment of consonants by vowels come with no cost,to favour the recovery of English vowels that areunexpressed in Urdu or expressed as consonants(cf. Section 1). Thus, the LDmod between zdn andzidane would be Zero.

3 Experimental Setup

Document Collection and Queries Our experi-ments are based on the FIRE-20102 English data,consisting of documents and queries, as our exper-imental materials. The document collection con-sists of about 124,000 documents from the English-language newspaper “The Telegraph India”3 from2004-07. The average length of a document was40 words. The FIRE query collection consists of50 English queries which were of the same domainas that of the document collection. The averagenumber of relevant documents for a query was 76(with a minimum of 13 and a maximum of 228).

The first author, who has an advanced knowledgeof Urdu, translated the English FIRE queries man-ually into Urdu. One of the resulting Urdu query isshown in Table 1, together with the Google trans-lations back into English (GTR) which form thebasis of the CLIR queries in the simplest model.Every query has a title, and a description, both ofwhich we used for retrieval. The bottom row (en-tity) shows the Translate output and from the bestmodel (Soundex matching with modified Leven-shtein distance). The bold-faced terms correspondto names that are corrected successfully, increasingthe query’s precision from 49% to 86%.

Cross-lingual IR setup We implemented themodels described in Section 2, using the TerrierIR engine (Ounis et al., 2006) for retrieval fromthe FIRE-2010 English document collection. Weused the PL2 weighting model with the term fre-quecy normalisation parameter of 10.99. The doc-ument collection and the queries were stemmedusing the Porter Stemmer (Porter, 1980). We ap-plied all translation strategies defined in Section 2as query expansion modules that enrich the GoogleTranslate output with new relevant query terms. In

2http://www.isical.ac.in/~fire/2010/data_download.html

3http://www.telegraphindia.com/

a pre-experiment, we experimented with addingeither only the single most similar term for eachOOV item (1-best) or the best n terms (n-best).We consistently found better results for 1-best andreport results for this condition only.

Monolingual model We also computed a mono-lingual English model which did not use the trans-lated Urdu queries but the original English onesinstead. The result for this model can be seen as anupper bound for Urdu-English CLIR models.

Evaluation We report two evaluation measures.The first one is Mean Average Precision (MAP), anevaluation measure that is highest when all correctitems are ranked at the top (Manning et al., 2008).MAP measures the global quality of the ranked doc-ument list; however improvements in MAP couldresult from an improved treatment of marginallyrelevant documents, while it is the quality of thetop-ranked documents that is most important inpractice and correlates best with extrinsic measures(Scholer and Turpin, 2009). Therefore we alsoconsider P@5, the precision of the five top-rankeddocuments.

4 Results and Discussion

Table 2 shows the results of our experiments.Monolingual English retrieval achieves a MAP of51.4, while the CLIR baseline (Google Translateonly – GTR) is 41.3. We expect the results of ourexperiments to fall between these two extremes.

We first extend the baseline model with Soundexmatching for all terms in the title and description(GTR+SoEx), we actually obtain a result way be-low the baseline (MAP=36.7). The reason is that,as discussed in Section 2.2, Soundex is too coarse-grained for non-NEs, grouping words such as redand road into the same equivalence class, thuspulling in irrelevant terms. This analysis is sup-ported by the observation, mentioned above, that1-best always performs better than n-best.

We are however able to obtain a clear improve-ment of about 1.5% absolute by limiting Soundexmatching to automatically identified Named En-tities, up to MAP=43.0 (GTR+SoExNER). How-ever, this model still relies completely on fre-quency for choosing among competitors with thesame Soundex code, leading to errors like theSubhas/Space mixup discusssed in Section 2.4.The use of Levenshtein distance, representing amore informed manner of disambiguation, makes

27

title UR íª� @ð A¿ ÿ �PAÓ ÿ�� Qå� á�Ó I� »YËPð A¿ à@ YK P áKYJ

�K Ptitle EN (GTR) Zynydyny zydan World Cup head butt incidentdesc UR ñ» ø �PQ�

�KAÓ ÿ � à@ YK P á�Ó �k. áKQ» ��C�K ñ» �H@ QKðA�J�X ÿ��@à@ YK P ÿ � øñËA�K @ I. k. @PAÓ á�Ó ÉJKZA

ÿ» 200T I� »YËPð ÿ�� Qå�

á�ËñK. á��KAK. P@ñÃ A

K ¬C g ÿ»desc EN (GTR) Find these documents from public opinion zdn to mtrzzy, from

Italian to zydan about offensive comments, World Cup finals in2006 head to kill incidents are mentioned

entity EN (GTR) Zynydyny Zydan zdn Mtrzzyentity (GTR+SoExNER+LDmod) zinedine zaydan zidane materazzi

Table 1: A sample query

Model MAP P@5GTR 41.3 62.4GTR+SoEx 36.7 59.2GTR+SoExNER 43.0 62.4GTR+SoExNER+LD 45.0 65.2GTR+SoExNER+LDmod 45.3 65.6Monolingual English 51.4 71.6

Table 2: Results for Urdu-English CLIR models onthe FIRE 2010 collection (Mean Average Precisionand Precision of top five documents)

a considerable difference, and leads to a finalMAP of 45.33 or about 4% absolute increasefor the (GTR+SoExNER+LDmod) model. Abootstrap resampling analysis (Efron and Tibshi-rani, 1994) confirmed that the difference betweenGTR+SoExNER+LDmod and GTR model is signif-icant (p<0.05). All models are still significantlyworse than the monolingual English model.

The P@5 results are in tandem with the MAP re-sults for all models, showing that the improvementwhich we obtain for the best model leads to top-5lists whose precision is on average more than 3%better than the baseline top-5 lists. This differenceis not significant, but we attribute the absence ofsignificance to the small sample size (50 queries).

In a qualitative analysis, we found that many re-maining low-MAP queries still suffer from missingor incorrect Named Entities. For example, Noida(an industrial area near New Delhi), was translit-erated to Nuydh and then incorrectly modified toNidhi (an Indian name). This case demonstrates thelimits of our method which cannot distinguish wellamong NEs which differ mainly in their vowels.

5 Related Work

There are several areas of related work. The firstis IR in Urdu, where monolingual work has beendone (Riaz, 2008). However, to our knowledge,our study is the first one to address Urdu CLIR.The second is machine transliteration, which is awidely researched area (Knight and Graehl, 1998)but which usually requires some sort of bilingualresource. Knight and Graehl (1998) use 8000English-Japanese place name pairs, and Mandal etal. (2007) hand-code rules for Hindi and Bengali toEnglish. In contrast, our method does not requireany bilingual resources. Finally, Soundex codeshave been applied to Thai-English CLIR (Suwan-visat and Prasitjutrakul, 1998) and Arabic namesearch (Aqeel et al., 2006). They have also beenfound useful for indexing Named Entities (Ragha-van and Allan, 2004; Kondrak, 2004) as well as IRmore generally (Holmes and McCabe, 2002).

6 Conclusion

In this paper, we have considered CLIR from Urduinto English. With Google Translate as translationsystem, the biggest hurdle is that most named enti-ties are out-of-vocabulary items and transliteratedincorrectly. A simple, completely unsupervisedpostprocessing strategy that replaces English non-words by phonetically similar words with minimaledit distance is able to recover almost half of theloss in MAP that the cross-lingal setup incurs overa monolingual English one. Directions for futurework include monolingual query expansion in Urduto improve the non-NE part of the query and train-ing a full Urdu-English transliteration system.

Acknowledgments We thank A. Tripathi and H.Sajjad for invaluable discussions and suggestions.

28

ReferencesSyed Uzair Aqeel, Steve Beitzel, Eric Jensen, David

Grossman, and Ophir Frieder. 2006. On the devel-opment of name search techniques for Arabic. Jour-nal of the American Society for Information Scienceand Technology, 57(6):728–739.

Dara Becker and Kashif Riaz. 2002. A study in urducorpus construction. In Proceedings of the 3rd COL-ING workshop on Asian language resources and in-ternational standardization, pages 1–5, Taipei, Tai-wan.

Nadir Durrani and Sarmad Hussain. 2010. Urdu wordsegmentation. In Proceedings of the Annual Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics, pages 528–536,Los Angeles, California.

Bradley Efron and Robert Tibshirani. 1994. An Intro-duction to the Bootstrap. Monographs on Statisticsand Applied Probability 57. Chapman & Hall.

Dan Gusfield. 1997. Algorithms on strings, trees, andsequences: computer science and computational bi-ology. Cambridge University Press, Cambridge,UK.

David Holmes and M. Catherine McCabe. 2002. Im-proving precision and recall for soundex retrieval.In Proceedings of the International Conference onInformation Technology: Coding and Computing,pages 22–27, Washington, DC, USA.

Sarmad Hussain, Sana Gul, and Afifah Waseem. 2007.Developing lexicographic sorting: An example forurdu. ACM Transactions on Asian Language Infor-mation Processing, 6(3):10:1–10:17.

Sarmad Hussain. 2008. Resources for urdu languageprocessing. In Proceedings of the Workshop onAsian Language Resources at IJCNLP 2008, Hyder-abad, India.

Kevin Knight and Jonathan Graehl. 1998. Ma-chine transliteration. Computational Linguistics,24(4):599–612.

Donald E. Knuth. 1975. Fundamental Algorithms,volume III of The Art of Computer Programming.Addison-Wesley, Reading, MA.

Grzegorz Kondrak. 2004. Identification of confus-able drug names: A new approach and evalua-tion methodology. In In Proceedings of the Inter-national Conference on Computational Linguistics,pages 952–958, Geneva, Switzerland.

M. Paul Lewis, editor. 2009. Ethnologue – Languagesof the World. SIL International, 16th edition.

Debasis Mandal, Mayank Gupta, Sandipan Dandapat,Pratyush Banerjee, and Sudeshna Sarkar. 2007.Bengali and Hindi to English CLIR evaluation. InProceedings of CLEF, pages 95–102.

Christopher D. Manning, Prabhakar Raghavan, andHinrich Schütze. 2008. Introduction to InformationRetrieval. Cambridge University Press, Cambridge,UK, 1st edition.

Diana McCarthy, Rob Koeling, Julie Weeds, and JohnCarroll. 2004. Finding predominant senses in un-tagged text. In Proceedings of the Annual Meet-ing of the Association for Computational Linguistics,pages 280–287, Barcelona, Spain.

Smruthi Mukund, Rohini Srihari, and Erik Peterson.2010. An information-extraction system for urdu—a resource-poor language. ACM Transactions onAsian Language Information Processing, 9:15:1–15:43.

Iadh Ounis, Gianni Amati, Vassilis Plachouras, BenHe, Craig Macdonald, and Christina Lioma. 2006.Terrier: A High Performance and Scalable Informa-tion Retrieval Platform. In Proceedings of ACM SI-GIR’06 Workshop on Open Source Information Re-trieval, Seattle, WA, USA.

Martin F. Porter. 1980. An algorithm for suffix strip-ping. Program, 14(3):130–137.

Hema Raghavan and James Allan. 2004. Us-ing soundex codes for indexing names in ASRdocuments. In Proceedings of the HLT-NAACL2004 Workshop on Interdisciplinary Approachesto Speech Indexing and Retrieval, pages 22–27,Boston, MA.

Kashif Riaz. 2008. Baseline for Urdu IR evaluation.In Proceeding of the 2nd ACM workshop on Im-proving non-English web searching, pages 97–100,Napa, CA, USA.

Hassan Sajjad and Helmut Schmid. 2009. TaggingUrdu text with parts of speech: A tagger comparison.In Proceedings of the 12th Conference of the Euro-pean Chapter of the ACL, pages 692–700, Athens,Greece.

Falk Scholer and Andrew Turpin. 2009. Metric and rel-evance mismatch in retrieval evaluation. In Informa-tion Retrieval Technology, volume 5839 of LectureNotes in Computer Science, pages 50–62. Springer.

Prayut Suwanvisat and Somboon Prasitjutrakul. 1998.Thai-English cross-language transliterated word re-trieval using soundex technique. In Proceesingsof the National Computer Science and EngineeringConference, Bangkok, Thailand.

29


Unsupervised Russian POS Tagging with Appropriate Context

Li Yang, Erik Peterson, John Chen, Yana Petrova, and Rohini SrihariJanya Inc.

1408 Sweet Home Road, Suite 1Amherst, NY 14228, USA

lyang,epeterson,jchen,ypetrova,[email protected]

AbstractWhile adopting the contextualized hid-den Markov model (CHMM) frameworkfor unsupervised Russian POS tagging,we investigate the possibility of utiliz-ing the left, right, and unambiguous con-text in the CHMM framework. We pro-pose a backoff smoothing method that in-corporates all three types of context intothe transition probability estimation dur-ing the expectation-maximization process.The resulting model with this new methodachieves overall and disambiguation ac-curacies comparable to a CHMM usingthe classic backoff smoothing method forHMM-based POS tagging from (Thedeand Harper, 1999).

1 Introduction

A careful review of the work on unsupervised POStagging in the past two decades reveals that thehidden Markov model (HMM) has been the stan-dard approach since the seminal work of (Kupiec,1992) and (Merialdo, 1994) and that researcherssought to improve HMM-based unsupervised POStagging from a variety of perspectives, includ-ing exploring dictionary usage, context utilization,sparsity control and modeling, and parameter andmodel updates tuned to linguistic features. For ex-ample, (Banko and Moore, 2004) and (Goldberg etal., 2008) utilized contextualized HMM (CHMM)to capture rich context. To account for spar-sity, (Goldwater and Griffiths, 2007) and (John-son, 2007) utilized the Dirichlet hyperparametersof the Bayesian HMM. (Berg-Kirkpatrick et al.,2010) integrated the discriminative logistic regres-sion model into the M-step of the standard gener-ative model to allow rich linguistically-motivatedfeatures.

Unsupervised systems went beyond the main-stream HMM framework by employing methods

such as prototype-driven clustering (Haghighi andKlein, 2006; Abend et al., 2010), Bayesian LDA(Toutanova and Johnson, 2007), integer program-ming (Ravi and Knight, 2009), and K-means clus-tering (Lamar et al., 2010).

Despite this large body of work, little effort hasbeen devoted to unsupervised Russian POS tag-ging. Supervised Russian POS systems emergedin recent years. For example, eleven supervisedsystems entered the POS track of the 2010 RussianMorphological Parsers Evaluation 1. Althoughthe top two systems from the 2010 Evaluationachieved near perfect accuracy over the RussianNational Corpus, little has been done on unsu-pervised Russian POS tagging. In this paper, wepresent our solution to unsupervised Russian POStagging by adopting the CHMM. Our choice isbased on the accuracy and efficiency of CHMM,an identical rationale to that behind (Goldberg etal., 2008).

We aim to achieve two goals. First, we intend toresolve the potential issue of missing useful con-textual features by the backoff smoothing schemein (Thede and Harper, 1999) and (Goldberg et al.,2008) for transition probabilities. Second, we ex-plore the possibility of incorporating unambigu-ous context into transition probability estimationin an HMM framework. We propose a novel planto achieve both goals in a unified approach.

In the following, we adopt the CHMM for un-supervised Russian POS tagging in section 2. Sec-tion 3 highlights the potential issue of missing use-ful left context in the backoff scheme by (Thedeand Harper, 1999). Section 4 illustrates an up-dated backoff scheme to resolve this potential is-sue. This scheme also unifies the left, right, andunambiguous context. The experiments and dis-cussion are presented in section 5. We present con-clusions in section 6.

1See http://ru-eval.ru/tables index.html

30

2 CHMM for Russian POS Tagging

Our system is built upon the architecture of a con-textualized HMM. Like other existing unsuper-vised HMM-based POS systems, the task of un-supervised POS tagging for us is to construct anHMM to predict the most likely POS tag sequencein the new data, given only a dictionary listing allpossible parts-of-speech of a set of words and alarge amount of unlabeled text for training.

Traditionally, the transition probability in asecond-orderHMMis given by p(ti|ti−2ti−1), andthe emission probability byp(wi|ti) ((Kriouile,1990; Banko and Moore, 2004)). The CHMM,such as such as (Banko and Moore, 2004), (Adler,2007), and (Goldberg et al., 2008),, incorporatesmore context into the transition and emissionprobabilities. Here, we adopt the transition proba-bility p(ti—ti1ti+1) of (Adler, 2007) and (Gold-berg et al., 2008) and the emission probabilityp(wi—titi+1) of (Adler, 2007).

Our training corpus consists of all 406,342words of the plain text for training from the AppenRussian Named Entity Corpus 2, containing tex-tual documents from a variety of sources. We cre-ated a POS dictionary for all 61,020 unique tokensin this corpus, using the output from the Russianlemmatizer 3. The lemmatizer returns the stemsof words and a list of POS tags for each word, re-lying on the morphology dictionary of the AOTTeam 4. Our tag set consists of 17 tags, compa-rable to those 5 used in Russian National Corpus(RNC), with the only addition of the Punct tag forpunctuation marks. We relied on the Appen databecause we did not have access to the RNC whenour project was being developed. But we hope tobe able to train and test out system with the RNCin the future.

3 Parameter Estimation and a PotentialIssue

Given the model and resources for training de-scribed in section 2, we estimate the model pa-rameters for our CHMM by following the stan-dard EM procedures. During pre-processing, thedictionary is consulted, and a list of potential POStags is provided for each word/token in the train-ing sequence. In case of unknown words, the mor-

2Licensed from http://www.appen.com.au/3Available at http://lemmatizer.org/en/4See http://aot.ru/5Listed at http://www.ruscorpora.ru

phology analyzer built in the Russian lemmatizersuggests a list of tags. If the morphology analyzerdoes not make any suggestion, a list of open POStags are assigned to the unknown words.

The potential POS tags in the training data pro-vide counts to roughly esitimate the initial transi-tion and emission probabilities. (Adler, 2007) ini-tialized transition probabilities using a small por-tion of the training data. In our work, we initializethe emission probabilities using 20% of the train-ing data with p(wi|titi+1) =

#(wi,ti,ti+1)#(ti,ti+1)

. Duringthe EM process, we use additive smoothing whenestimating p(wi|titi+1) (Chen, 1996).

We initialize the transition probabilitiesp(ti|ti−1ti+1) with a uniform distribution. Whenre-estimating p(ti|ti−1ti+1), we use the methodfrom (Thede and Harper, 1999) for backoffsmoothing in equation (1).

p(ti|ti−1ti+1) = λ3N3

C2

+(1−λ3)λ2 ·N2

C1

+(1−λ3)(1−λ2) ·N1

C0(1)

The λ coefficients are calculated the same wayas in (Thede and Harper, 1999), that is λ2 =log(N2+1)+1log(N2+2) and λ3 = log(N3+1)+1

log(N3+2) . The counts,Ni and Cj are modified for our unsupervisedCHMM, as shown in Table 1. Note that N2 cap-tures the counts of the bi-gram titi+1, consistingof the current state ti and its right context ti+1.

(Thede and Harper, 1999) and (Goldberg et al.,2008) show that equation (1) is quite effectivein both supervised and unsupervised scenarios.However, in our case where Russian is concerned,there are situations where equation (1) may notgive good estimates.

Through RNC’s online search tool, we discov-ered that the word from a specific set of pronounsfollowing the comma is always analyzed as a con-junction, which itself can be followed by a numberof possible POS tags. This set includes ambiguouswords such as chto and chem. Although the Appencorpus does not come with POS tags, our Russianlinguist observed similar linguistic regularties inthe corpus. Some examples regarding chto from

N1 = N e1 estimated counts of ti+1

N2 = N e2 estimated counts of titi+1

N3 = N e3 estimated counts of ti−1titi+1

C0 = Ce0 estimated total # of tags

C1 = Ce1 estimated counts of ti

C2 = Ce2 estimated counts of ti−1ti+1

Table 1: Estimated counts as superscript e.

31

Appen are listed below.

Example 1 ,(Punct) chto(CONJ) na(PREP)

Gloss comma and/or/that on

Example 2 ,(Punct) chto(CONJ) gotovy(ADJ)

Gloss comma and/or/that ready

In the preceding examples, the comma to theleft of chto provides for a useful clue. However,a potential issue arises when we estimate p(ti-1titi+1) using equation (1). That is, when the tri-gram ti−1titi+1 is rare and the first term of theequation is very small, the second term will affectp(ti−1titi+1) more. The count, N2, in the secondterm is for the bi-gram (chto-CONJ, right word-POS), right word-POS) but not for (left word-comma, chto-CONJ). Therefore, the useful clue inthe latter bi-gram is missed. To resolve this, onecannot simply switch to the left context in N2 be-cause there are cases where the right context pro-vides more of a clue. For example, observed fromthe Russian National Corpus, adjectival pronounsare only followed by a noun or an adjective anda noun, where the right context of adjectival pro-nouns are more important for disambiguating theadjectival pronouns. Several more examples fromthe Appen data where the left or right context con-tributing to disambiguation are listed in the Ap-pendix.

4 Incorporating All Three Types ofContext

Several systems made use of the information pro-vided in unambiguous POS tag sequence. (Brill,1995) learned rules from the context of unambigu-ous words. (Mihalcea, 2003) created equivalenceclasses from unambiguous words for training. Weexpected the assumption that unambiguous con-text helps with disambiguation to hold for Russianas well.

N1 =Nu1 , # of unambiguous counts of ti+1

NL2 = NuL

2 , # of unamb. bi-gram ti−1ti w left context ti−1

NR2 = NuR

2 , # of unamb. bi-gram titi+1 w right context ti+1

N3 =Nu3 , # of unamb. tri-gram ti−1titi+1

C0 = Cu0 , total # of unamb. tags

C1 = Cu1 , # of unamb. ti

C2 = Cu2 , # of unamb. bi-gram of ti−1ti+1

Table 2: Counts of unambiguous tri-grams, bi-grams, and unigrams. The superscript u stands forunambiguous counts.

Nu1 ← N e

1 estimated counts of ti+1

NuL2 ← N eL

2 estimated counts of ti−1tiNuR

2 ← N eR2 estimated counts of titi+1

Nu3 ← N e

3 estimated counts of ti−1titi+1

Cu0 ← Ce

0 estimated total # of tagsCu1 ← Ce

1 estimated counts of tiCu2 ← Ce

2 estimated counts of ti−1ti+1

Table 3: Replacement plan for unambiguouscounts

In the Appen training corpus, 84% of thewords/tokens have a unique POS tag, based onour dictionary and the Russian lemmatizer. Wecan easily spot examples in the corpus whereunambiguous context helps with disambigua-tion. Again, in our earlier example, ,(Punct)chto(CONJ) na(PREP), the unambiguous left con-text ‘,’ reveals that chto is a CONJ instead ofa PRON. To take advantage of the unambiguouscontext, we collect the counts for all unambigu-ous tri-gram and bi-gram sequences in the Ap-pen training corpus and integrate these counts intoequation (2) through the equivalence in Table 2.

p(ti|ti−1ti+1) = λ3N3

C2

+(1− λ3)λ2 ·NL

2

CL1

× NR2

CR1

+(1− λ3)(1− λ2) ·N1

C0(2)

where λ2 =log(NL

2 +1)+1

log(NL2 +2)

× log(NR2 +1)+1

log(NR2 +2)

, and

λ3 = log(N3+1)+1log(N3+2) . λ2 incorporates both the left

and right context. The unambiguous counts aredefined in Table 2.

Now that the new backoff smoothing plan com-bines both the left and right unambiguous bi-gramcounts, we extend this plan to cover the caseswhere the unambiguous tri/bi/uni-grams are notavailable, by replacing them with the estimatedcounts from Table 1. Table 3 displays the schemefor replacing an unambiguous count with an esti-mated count from the EM process.

5 Experiments and Results

We designed three experiments to test three com-binations of the context, in addition to experiment-ing with a traditional second-order HMM. TheAppen corpus contains a development set and an

32

Model & setting(s) Overall Accuracy Disamb. Accuracy2nd-order HMM 94.88% 63.42%CHMM left context 95.72% 69.42%CHMM right context 96.05% 71.78%CHMM unique 96.06% 71.85%← left/right context

Table 4: Experiments, overall and disambiguationaccuracies over test data

evaluation set. We passed both sets through theRussian lemmatizer to obtain POS tags for the dataand had the tags manually corrected by a Rus-sian linguist. Thus, we have created both develop-ment and evaluation data. 14% of words/tokens inboth development and evaluation data have multi-ple POS tags. Table 4 summarizes our experimen-tal settings and results over the evaluation data.

The second-order HMM was trained with thetraditional transition probability p(ti|ti−2ti−1)and emission probability p(wi|ti). It gained anoverall accuracy of 94.88%, and was able to cor-rectly disambiguate 63.42% of the ambiguouswords/tokens.

All three CHMM models were trained with theemission probability p(wi|titi+1) initialized with20% of the unlabeled training corpus. ModelCHMM left context considered the left context bi-gram ti−1ti when calculating the second term inequation (1). Model CHMM right context consid-ered the right context bi-gram titi+1 when calcu-lating the same term. Model CHMM unique ←left/right unified both unambiguous context

counts and estimated counts for left and right con-text from the EM process, using equation (2).

All CHMM models achieved accuracies 1%higher than the HMM, while the disambiguationaccuracies from the former three are 7−9% higherthan the latter. This shows that the CHMM mod-els capture more useful context information forRussian POS tagging than the traditional HMM.At the same time, the overall and disambigua-tion accuracies between CHMM right context andCHMM unique ← left/right are comparable. Er-ror analyses indicate that a backoff scheme foremission probabilities is also needed to incorpo-rate the left context.

6 Conclusion and Future Work

We adopted the CHMM to unsupervised RussianPOS tagging. The CHMM models using eitherthe left or right context were able to outperformthe traditional second-order HMM. To resolve the

potential issue of missing out the left contextwith the classic smoothing scheme in (Thede andHarper, 1999), we experimented with an approachto unifying the information provided in the left,right, and unambiguous contexts. The results fromthe latter were comparable to a CHMM with theclassic backoff smoothing method in (Thede andHarper, 1999), although we expected a more sig-nificant improvement. We plan to investigate abackoff scheme for emission probabilities wherewe will incorporate the left context as well, whilecurrently we only rely on additive smoothing foremission probabilities.

Acknowledgments

We would like to thank the anonymous review-ers for their valuable comments and sugges-tions. Our work was partially funded by the AirForce Research Laboratory/RIEH in Rome, NewYork through contracts FA8750-09-C-0038 andFA8750-10-C-0124.

ReferencesOmri Abend, Roi Reichart, and Ari Rappoport. 2010.

Improved unsupervised pos induction through pro-totype discovery. In Proceedings of the 48th ACL.

Meni Adler. 2007. Hebrew Morphological Disam-biguation. Ph.D. thesis, University of the Negev.

Michele Banko and Robert C. Moore. 2004. Partof speech tagging in context. In Proceedings ofthe 20th international conference on ComputationalLinguistics.

Taylor Berg-Kirkpatrick, Alexandre Bouchard-Ct,John DeNero, and Dan Klein. 2010. Painless un-supervised learning with features. In Proceedings ofNAACL 2010.

Eric Brill, 1995. Very Large, chapter Unsuper-vised Learning of Disambiguation Rules for Part ofSpeech Tagging, pages 1–13. Kluwer AcademicPress.

Stanley F. Chen. 1996. Building Probabilistic Modelsfor Natural Language. Ph.D. thesis, Harvard Uni-versity.

Yoav Goldberg, Meni Adler, and Michael Elhadad.2008. Em can find pretty good pos taggers (whengiven a good start). In Proceedings of ACL-08:HLT.

Sharon Goldwater and Tom Griffiths. 2007. A fullybayesian approach to unsupervised part-of-speechtagging. In Proceedings of the 45th ACL.

33

Aria Haghighi and Dan Klein. 2006. Prototype-drivenlearning for sequence models. In Proceedings of themain conference on HLT-NAACL.

Mark Johnson. 2007. Why doesnt em find good hmmpos-taggers. In n EMNLP.

Abdelaziz Kriouile. 1990. Some improvements inspeech recognition algorithms based on hmm. InAcoustics, Speech, and Signal Processing.

Julian Kupiec. 1992. Robust part-of-speech taggingusing a hidden markov model. Computer Speech &Language, 6:225–242.

Michael Lamar, Yariv Maron, and Elie Bienenstock.2010. Latent descriptor clustering for unsupervisedpos induction. In EMNLP 2010.

Bernard Merialdo. 1994. Tagging english text witha probabilistic model. Computational Linguistics,20:155–171.

Rada Mihalcea. 2003. The role of non-ambiguouswords in natural language disambiguation. In Pro-ceedings of the Conference on RANLP.

Sujith Ravi and Kevin Knight. 2009. Minimized mod-els for unsupervised part-of-speech tagging. In Pro-ceedings of ACL-IJCNLP 2009,, pages 504–512.

Scott M. Thede and Mary P. Harper. 1999. A second-order hidden markov model for part-of-speech tag-ging. In Proceedings of the 37th Annual Meeting ofthe ACL.

Kristina Toutanova and Mark Johnson. 2007. Abayesian lda-based model for semi-supervised part-of-speech tagging. In Proceedings of NIPS.

Appendix: Linguistic Patterns Observedin Appen

In Section 3, we illustrated how the left contexthelped to disambiguate chto. In the following wepresent several more examples from the Appencorpus illustrating the helpful left or right con-text. While the patterns our Russian linguist ob-served are common in both the RNC and Appen,the counts and statistics regarding each pattern areunavailable for reporting because the RNC wasthen inaccessible to us and Appen was not taggedwith POS tags.

Examples 3 through 7 show that the left contextof chem, poka, and kak helps to disambiguate themas conjuctions.

Example 3 ,(Punct) chem(CONJ) v(PREP)stolitse(NOUN)

Gloss comma and/than in capital

Example 4 ,(Punct) poka(CONJ)eta(PRONOUN)

Gloss comma yet this

Example 5 ,(Punct) poka(CONJ) Sovet(NOUN)

Gloss comma yet council

Example 6 ,(Punct) kak(CONJ) dva(NUMERAL)neudachnika(NOUN)

Gloss comma as two losers

Example 7 ,(Punct) kak(CONJ) on(PRONOUN)

Gloss comma as he

The next examples show that the right contextdetermines the adjectival tag, PRONOUN P, ofthe pronouns.

Example 8 obekty(NOUN) svoey(PRONOUN P)sistemy(NOUN)

Gloss units their/they system

Example 9 esli(CONJ) mnogie(PRONOUN P)mnogie(NOUN)

Gloss if many/various emigrants

34


Extending a multilingual Lexical Resource by bootstrapping NamedEntity Classification using Wikipedia’s Category System

Johannes KnoppKR & KM Research Group, Department of Computer Science

Universitat Mannheim, B6 26, 68159 Mannheim, [email protected]

Abstract

Named Entity Recognition and Classi-fication (NERC) is a well-studied NLPtask which is typically approached usingmachine learning algorithms that rely ontraining data whose creation usually isexpensive. The high costs result in thelack of NERC training data for many lan-guages. An approach to create a multi-lingual NE corpus was presented in Went-land et al. (2008). The resulting resourcecalled HeiNER describes a valuable num-ber of NEs but does not include their types.We present a bootstrap approach based onWikipedia’s category system to classifythe NEs contained in HeiNER that is ableto classify more than two million namedentities to improve the resource’s quality.

1 Introduction

For tasks in information extraction NERC is veryimportant and often supervised machine learningapproaches are used to solve it, e.g. Bender etal. (2003) or Szarvas et al. (2006). In A survey ofnamed entity recognition and classification DavidNadeau and Satoshi Sekine conclude:

“When supervised learning is used, aprerequisite is the availability of a largecollection of annotated data. Such col-lections are available from the evalua-tion forums but remain rather rare andlimited in domain and language cover-age” (Nadeau and Sekine, 2007)

To overcome the problem of limited languagecoverage, Wentland et al. (2008) started to cre-ate the multilingual Heidelberg Named Entity Re-source (HeiNER). In more than 250 languages,HeiNER lists Wikipedia (WP) articles that de-scribe a named entity (NE), in 16 of those lan-guages it contains a collection of textual contexts a

NE was unambigiously mentioned in. Those con-texts provide useful training material for NE clas-sification, thus the goal of this work is to add NEtypes to HeiNER’s entries.

Unlike the widely used machine learning ap-proaches to NERC our classification method re-lies only on WP’s category system and thus doesnot need any language specific information. Theidea is to first determine sets of WP categoriesto identify each NE type. After that, these setsare used to initialize a bootstrapping algorithmthat identifies the types for unclassified NEs. NEtypes follow the CoNLL definition presented bySang (2002): person (PER), location (LOC), or-ganization (ORG) and miscellaneous (MISC).1

The CoNLL types were chosen because HeiNER’sevaluation was based on the CoNLL types.

The following sections reveal details aboutHeiNER (section 2), describe the bootstrap ap-proach of NE classification with WP categories(section 3) and show the results in the evaluationsection (section 4).

2 HeiNER

As this work builds upon the Heidelberg NamedEntity Resource (HeiNER), we will describe thedata that HeiNER provides and how they were cre-ated to give the reader an idea about their qualityand structure.

HeiNER is a multilingual collection of namedentities along with disambiguated context excerptsand a disambiguation dictionary that maps propernames to a set of NEs the proper names may re-fer to. The resource was created automaticallyfrom Wikipedia relying on (i) the heuristic pre-sented in Bunescu and Pasca (2006) to recognizeEnglish Wikipedia articles that denote a NE and(ii) Wikipedia’s link structure.

1cf. http://www.cnts.ua.ac.be/conll2003/ner/annotation.txt

35

<transDict><namedEntity id=’2134’><an>Organizazion d’as Nazions Unitas</an><bs>Ujedinjeni narodi</bs><ga>Nisiin Aontaithe</ga><gl>ONU</gl><hu>Egyesult Nemzetek Szervezete</hu><lb>Vereent Natiounen</lb><nds>Vereente Natschonen</nds><tr>Birlesmis Milletler</tr><en>United Nations</en>...</namedEntity></transDict>

Figure 1: Example of the entry for “United Na-tions” in the translation dictionary

First, the NER heuristic based on uppercase let-ters generated a list of English WP articles thatdenote a NE. This method created more than 1.5million NEs with a precision of 95% 2. With helpof WP’s interlanguage links the available transla-tions for every NE were added to the list resultingin the translation dictionary shown in figure 1. Allof the more than 250 languages available in WPwere considered to create the NE translations.

As the NE articles in WP are known from thefirst step, the disambiguation dictionary is built af-terwards using disambiguation and redirect linksto map proper names to NEs. Finally the con-text dataset is created for every NE by storing theparagraphs they are unambiguously mentioned in.This was done for 16 languages. An excerpt of thecontext dataset is shown in Figure 2 below.

<dataset neID=’2134’ lang=’en’neStr=’United Nations’>

<context id=’0’><surfaceForm>United Nations</surfaceForm><leftContext>The World Health Organization (WHO) is aspecialized agency of the</leftContext><rightContext>(UN) that acts as a coordinatingauthority on international public health.</rightContext>

</context></dataset>

Figure 2: Excerpt from the English context datasetfor the NE “United Nations”

The NEs together with disambiguated contextsin different languages can be considered use-ful data for NE disambiguation, classification or

2Read Wentland et al. (2008) for more details.

machine translation (e.g. Federmann and Hun-sicker (2011)).3 For this paper the heuristics tocreate the list of English NEs were run on themore recent WP dump of November 3rd 2009 andresulted in a total of 2,225,193 found NEs com-pared to 1,547,586 NEs reported in the original pa-per. The difference is solely caused by the naturalgrowth of Wikipedia.

3 A Bootstrap Approach to NEClassification with WP Categories

As described in Section 2 HeiNER presents a lotof context information of NEs. To release the fullpotential of the multilingual data the NEs need tobe annotated with their respective type.

Instead of using a classical NER system thiswork concentrates on a language agnostic ap-proach that is based on WP’s category structurewhich is not only suited for NER but can be usedfor other classifications based on WP categoriesas well. In short, the idea is to identify WP cate-gories that correspond to a NE type and then usethose categories to classify NEs that are placed inthose typed categories. The categories can be in-terpreted as a signature or footprint of a NE type.The method outline is as follows: First, for everyNE type a list of seed categories is created man-ually. It is enhanced by taking two levels of sub-categories into account. The resulting lists of typespecific categories are used to classify the articlesin HeiNER by looking up if they are placed in oneof the seed categories and assigning the respectivetype. The steps are illustrated in figure 3.

Seed Categoriesfor Persons

PER

InitialClassifiedArticles

enhanceseeds

lookup inHeiNER

Figure 3: The manually chosen and enhanced seedcategories generate the initial list of classified ar-ticles. The illustration shows the method for PER,it works in the same way for the other categories.

This leaves most of the NEs in HeiNER unclas-sified, but the initially classified NEs can be usedfor the bootstrapping solution that is visualized infigure 4: For every NE type, a NE type vector

3HeiNER is available for the scientific community athttp://heiner.cl.uni-heidelberg.de/

36

MISC

PER

ORG

LOC

NE 1: 0.96 (PER) 0.05 (LOC) 0.02 (ORG) 0.30 (MISC) NE 2: …

Ranked unclassified Articles

ClassifiedArticles

Type Vectors

compute type vectors

fromcategories

computesimilaritiesto uncl.articles

add best 10% to classified

articles

PER

LOC

ORG

MISC

Figure 4: Bootstrapping loop to classify articles.

based on categories is built by looking up all cat-egories of the now classified articles and count-ing them for each type. The articles are then clas-sified by computing the similarity between theircategory vector and the four NE type vectors andchoosing the most similar one. This is done in teniterations where each step updates the type vectorswith the new classified articles. The only man-ual work needed is collecting the seed categories.This can be applied in any language that is avail-able in WP. We use the English version becauseit is by far the largest edition. Also note that theseeds define the result of the classification. Morefine grained types like politician or entertainer (cf.(Fleischman and Hovy, 2002)) could be easily im-plemented by choosing other seeds.

After this broad overview the subsectionspresent a more detailed description of the ap-proach. For that we introduce the notation schemeused in this paper:The set of NE types t ∈ T consists of personsPER, locations LOC, organizations ORG andmiscellaneous MISC.C denotes the set of all categories in the EnglishWikipedia. Single Categories that are mentionedin the text are written in SMALL CAPS.

3.1 Generating Seed Categories

For every NE type the seed categories hold a setof WP categories such that any NE article that isplaced in one of them is considered to be of thetype the category is associated with. Because theclassification method relies on the seeds’ qualitythey have to be annotated manually. The goal isto find categories that are broad enough to classifyas many NEs as possible but also are accurate inorder to avoid incorrect classifications.

To find the best seed categories for the NE typesperson, location, organization and miscellaneous,we started to randomly pick NE articles belong-ing to one type, then inspect the categories it isplaced in and move up in the category tree byfollowing supercategories until the topic range ofa category gets too broad for unambiguous clas-sification. The broad-but-accurate categories areadded to the seed set of the respective type. Be-cause the subcategories can be considered to beuseful for the classification process, we add twolevels of subcategories to the initial seed list. Therestriction to two levels of subcategories is neededto avoid adding noise, because WP’s category sys-tem is a graph, not a tree.

An example for the manual creation of seed cat-egories might help at this point: if we are inter-ested in the NE type person, we start with a ran-dom WP article about a person, e.g. Jimmy Hen-drix. We always follow the most promising super-categories which leads to the following chain:1960S SINGERS ⇒ SINGERS BY TIME PERIOD

⇒ PEOPLE BY OCCUPATION AND PERIOD ⇒PEOPLE BY OCCUPATION⇒ PEOPLE

The accuracy of each category is checked by in-specting subcategories and articles belonging to it.The category PEOPLE has a subcategory BIBLI-OGRAPHY which deals with biographical books.Thus, PEOPLE itself is not accurate enough to findpersons. Still most of the subcategories of PEO-PLE like PEOPLE BY OCCUPATION or PEOPLE BY

RELIGION are added to the seed categories of NEtype person.

As a result there are 15 seed categories foundfor the type person. The same was carried out forthe other NE types. All seed categories togetherwith two levels of subcategories form the set oftyped categories Ct. The results can be seen intable 1.

The number of seed categories does not neces-sarily correlate with the number of found subcat-egories: The types PER and LOC have the samecount of seed categories, but CPER is almost 3.5times bigger than CLOC and has about 1,500 cat-egories more than CORG which started with 75seed categories. An explanation would be thatpersons are supported well and have a very finegrained categorization while locations can be de-scribed with a smaller set of categories. CMISC

remains in between the others with 4,747 subcate-gories.

37

type t seedcategories

sub-categories

typedcategories Ct

PER 15 9,625 9,640LOC 15 2,783 2,798ORG 75 8,033 8,108MISC 27 4,747 4,774

Table 1: Numbers of categories found for each NEtype derived from seed categories.

3.2 Initial Named Entity Classification

Starting from the enhanced seed categories the ini-tial list of classified NEs can be created easily. Justiterate over every article in HeiNER and check ifit is placed in Ct. If this is the case the article canbe considered to be of type t and hence is added tothe set of classified NE articles NEt. If more thanone type was found for an article it is left unclas-sified. The results of this initial classification areshown in table 2.

To point out the generative power of the cate-gories the last row shows the “productivity ratio”NEtCt

of each category. The earlier assumption thatthere are more articles of type PER than others issupported by the fact that more than half millionNEs could be initially classified and also by thenumber of articles found per category. This can-not be solely based on the superior count of PERcategories because the number of ORG related cat-egories is not that far behind, though NEORG isabout 4 times smaller than NEPER. Also the PERrelated categories are about five times more pro-ductive than the ones related to MISC. In otherwords, most of WP’s contributors write articlesabout NEs of the type PER and categorize themstudiously. The quality of the results will be dis-cussed in the evaluation in section 4.

Type t Ct NEtNEtCt

PER 9,640 502,173 52LOC 2,798 41,539 15ORG 8,108 128,433 16MISC 4,774 47,887 10

Table 2: Number of classified articles derivedfrom seed categories. The last row shows therounded average classification produced by eachcategory.

3.3 Type Vectors & Bootstrapping

After the initial classification step we can removethe 720,032 classified articles from the NE listwith 2,224,472 entries leaving 1,504,440 yet toclassify articles. As the presented method relieson categories 7,033 articles without any catego-rization are removed too which results in a finallist of 1,497,407 NEs that need to be classified inthe bootstrapping process.

As explained earlier the categories of the classi-fied articles are used to build a NE type vector con-sisting of categories associated with NEs of a cer-tain type. The categories of classified articles formthe dimensions of the type vectors, their counts de-fine the length in that dimension. The algorithm infigure 5 shows how the vector is created. Note thatfor the NE type vector all categories are taken intoaccount and not just the ones pointing to NEs thatwere used in the initial classification step. The in-tuition behind this is that the aggregated categoriesform the footprint of a type even if not each ofthem points to a NE.

def c o m p u t e v e c t o r (NEt ) :# s t o r e v e c t o r as a d i c t i o n a r yc a t e g o r y v e c t o r = {}f o r article in NEt :

f o r c in article.catgories :i f c a t e g o r y v e c t o r . h a s k e y ( c ) :

c a t e g o r y v e c t o r [ c ] += 1e l s e :

c a t e g o r y v e c t o r [ c ] = 1re turn c a t e g o r y v e c t o r

Figure 5: Python-Pseudocode algorithm of a func-tion to build the category vector. The vector isstored in a dictionary where the category name isthe key and the count its value.

The algorithm is applied to each NE type inNEt, the results are shown in table 3. The di-mensions of the vectors in the third row show thenumber of unique categories. The fourth row rep-resents the overall count of categories in the arti-cles and the last row shows the average number ofcategories per article. Again we can see that PERis categorized in more detail while LOC and ORGhave a similar ratio. MISC has the lowest catego-rization rate. We expect our method to work bestwith articles that are placed in many categories.

The type of an unclassified NE article is deter-mined by converting its categories into a vector,computing similarities to the type vectors, and as-

38

typet

NEt dimen-sions

categorycount

categoriesper NEt

PER 502,173 132,098 4,037,634 7.86LOC 41,539 35,880 228,468 5.08ORG 128,433 72,184 694,523 4.94MISC 47,887 33,110 229,438 4.33

Table 3: Statistics for the NE type vectors that arecreated for NEt.

signing the type with the highest similarity score.As categories can either be present or not the cat-egory vector of an article is binary. In order toverify the general approach we classify the NEsin two setups using different similarity measures,cosine similarity and Dice’s coefficient:

cosine(~x, ~y) =

∑n

k=1xkyk√∑n

k=1x2k ·

√∑n

k=1y2k

dice(~x, ~y) =2 ·

∑n

k=1(weightxk · weightyk)∑n

k=1weightxk +

∑n

k=1weightyk

Cosine similarity computes the angle betweenthe two vectors taking only the directions of typevectors into account and not their length. Becausethere are no negative categorizations the result-ing similarities range between zero and one. TheDice’s coefficient includes the count of shared el-ements in relation to all elements that are not zero.It considers the weights of the vectors by multi-plying the shared elements4. The factor 2 keepsthe result range between zero and one.

In the bootstrapping phase HeiNER’s unclassi-fied NEs are classified as just described. In 10 it-erations the 10% with the highest similarity val-ues are added to their respective set NEt and thetype vectors are updated before the next 10% areclassified. Figure 6 shows the process for cosinesimilarity and figure 7 for Dice’s coefficient.

For each NE type the tables list the exact countsof how many NEs were added in each of the 10 it-erations. The bar plots beneath the tables visualizethese data by stacking the counts of each type inevery iteration. As the sum is always 10% of theinitially unclassified data the bars have the samelength. The exception at iteration 10 stems fromthe fact that articles that do not share a categorywith any of the type vectors cannot be classified.The difference between the last Dice and cosine

4As we multiply with a binary vector we just decidewhether to add the value of the non-binary vector at that po-sition or not.

run PER LOC ORG MISCinitial 502,173 41,539 128,433 47,887

Cosine1 3,999 120,641 23,469 1,6312 1,216 11,456 42,997 94,0713 1,414 56,725 38,220 53,3814 33,664 11,763 39,064 65,2495 50,990 10,690 17,511 70,5496 44,166 24,131 22,569 58,8747 14,924 39,565 33,347 61,9048 4,482 45,417 37,201 62,6409 3,392 38,138 38,711 69,49910 4,057 26,395 38,719 60,913

Bootstrap 162,304 384,921 331,808 598,711Total 664,477 426,460 460,241 646,598Plus 32% 927% 258% 1250%

1 2 3 4 5 6 7 8 9 10

Bootstrap Cosine

Iteration

0

20000

40000

60000

80000

100000

120000

140000

PERLOCORGMISC

Figure 6: Bootstrapping using cosine similarity.The bar plot shows the visualization of the NEtype classifications in the table above.

bar is a result of the different classification deci-sions made in the bootstrapping process.

Inspecting the results we can see that the lion’sshare in the first iteration in both setups is clas-sified as LOC. This indicates that many locationswere missed by the enhanced seed categories, butthe type vector allowed to find the missed NEs.Following iterations do not show a bias towardsLOC which supports this analysis. Neverthe-less cosine similarity seems to be biased towardsMISC because on average about 60,000 articlesare added to this type per iteration resulting in thebiggest gain in 8 of the 10 iterations. This could becaused by cosine similarity’s ignorance of weightsin the type vector thus preferring articles that share

39

many categories with a type vector over articleswith less but higher weighted categories. MISCmight have thematically wide spread categoriessupporting that effect. However, the bias towardsthat type cannot solely be based on this property,because the initialized vector is the one with theleast dimensions in comparison to the others.

Bootstrapping using the Dice’s coefficient tendsto be biased towards LOC and ORG, the formershowing an overall gain of 1,308 percent5. Infour of the iterations ORG wins the majority ofnew classified articles, LOC is in advantage infive of the iterations leaving PER one major gainin the fifth run. Because Dice’s coefficient takesthe counts of categories into account, it is likelythat the unclassified articles are placed in some ofthe categories that have high values for LOC andORG.

The count of articles added to PER develops re-markably similar for both measures. They startwith few new articles in the first three iterations,rise to many more additions in steps four, five andsix to slow down again in the left iterations. Inboth cases eventually PER is the NE type with theleast added articles (cf. lines “Bootstrap“), but stillthe biggest count when summing it up with the ini-tial count (cf. lines ”Total“). No other named en-tity type shows such a strong correlation betweenthe two different similarity measures. This indi-cates that most of the articles were already clas-sified in the initialization proving the seed cate-gories for that type to be of high quality.

In summary, both bootstrapping setups are ableto classify almost all of the unclassified NEs, butdiffer a lot in their results with the exception of thetype PER.

4 Evaluation

Before the bootstrapping phase an evaluation setof NEs was created and excluded from the pro-cess. It consists of NEs of each type: 295 PER,192 LOC, 110 ORG and 122 MISC entries thatwere annotated manually by one annotator. Bothsetups are evaluated by classifying the NEs in thesame way as in the bootstrapping and investigatingthe precision of the results.

5This growth is narrowed a little bit by the fact that itstarted with the smallest count of articles.

run PER LOC ORG MISCinitial 502,173 41,539 128,433 47,887

Dice’s coefficient1 5,271 137,051 6,406 1,0122 17 25 138,578 11,1203 1,266 58,780 65,593 24,1014 36,595 16,952 56,017 40,1765 67,975 31,508 25,819 24,4386 38,196 56,745 45,219 9,5807 16,166 67,458 54,813 11,3038 8,969 67,890 52,944 19,9379 5,581 65,655 46,860 31,64410 5,751 41,301 56,864 26,323

Bootstrap 185,787 543,365 549,113 199,634Total 687,960 584,904 677,546 247,521Plus 37% 1,308% 427% 417%

1 2 3 4 5 6 7 8 9 10

Bootstrap Dice

Iteration

0

20000

40000

60000

80000

100000

120000

140000

PERLOCORGMISC

Figure 7: Bootstrapping using Dice’s coefficient.The bar plot shows the visualization of the NEtype classifications in the table above.

4.1 Initial type vectors

The confusion matrix in table 4 shows the re-sults using the type vector from the initial NEclassifications. The rate of correct classificationsvaries from 35.25% (MISC, Dice’s coefficient) to81.02% (PER, Dice’s coefficient). It is not sur-prising that PER is the best performing named en-tity type when we remember the earlier statementthat articles of that type are categorized with highdetail and that this NE type has by far the high-est count of instances after the initialization. Thisis underlined by the fact that almost no instanceswere classified incorrectly as a person in the otherevaluation sets. Consequently, there is no muchconfusion between persons and other NE types.

40

Eval. set PER LOC ORG MISC UNCLCosine

PER (295) 78.64% (232) 5.76% (17) 8.47% (25) 6.44% (19) 0.68% (2)LOC (192) 0.0% (0) 60.42% (116) 10.94% (21) 7.29% (14) 21.35% (41)ORG (110) 0.91% (1) 15,45% (17) 67.27% (74) 8.18% (9) 8.18% (9)MISC (122) 0.82% (1) 8.2% (10) 38.52% (47) 37.7% (46) 14.75% (18)

Dice’s coefficientPER (295) 81.02% (239) 6.1% (18) 7.8% (23) 4.41% (13) 0.68% (2)LOC (192) 0.0% (0) 64.06% (123) 9.9% (19) 4.69% (9) 21.35% (41)ORG (110) 1.82% (2) 19.09% (21) 64.55% (71) 6.36% (7) 8.18% (9)MISC (122) 3.28% (4) 9.84% (12) 36.89% (45) 35.25% (43) 14.75% (18)

Table 4: Confusion matrix for the CoNLL named entity types. Members of evaluation sets for every typewere classified by computing similarities to the initialised named entity type vectors. The overall highestvalues (cosine and Dice similarity) are marked as boldface. The percentages show the fraction of theabsolute numbers that are given in the first row, the numbers in braces show the absolute numbers.

Considering that 21.35% of the articles wereleft unclassified, only 18.23% (cosine) and14.59% (Dice) of LOC were explicitly classifiedwrong. Unclassified articles occur if none of theinstances in the evaluation set LOC has categoriesthat can be found in any of the NE type vectors.This could either mean that the seed categories forthis type were not chosen broad enough or that ar-ticles of type LOC are placed in categories that arewide spread over WP’s category graph and cannotbe grouped easily. The bootstrapping results indi-cated that the former case is more likely. ORG areclassified correctly with a chance of 67.27% (co-sine) and 64.55% (Dice) leaving an error rate of24.55% (cosine) and 27.27% (Dice). Cosine out-performs the Dice’s coefficient in this class.

The CoNLL definitions of MISC do not seemto correspond well with WP categories. For theevaluation set of type MISC more instances wereclassified as an organization in both setups. Thatindicates a high probability to confuse membersof MISC with LOC which is not that surprising,recalling that the definition of this type is “wordsof which one part is a location, organization, mis-cellaneous or person”(Sang, 2002). Further in-vestigation would be necessary to judge whethertype overlaps are just caused by incorrect classi-fications or if the articles really do belong to thatclass and maybe should be allowed to be classifiedas both MISC and LOC. For example a book thathas a location in its title like The Restaurant at theEnd of the Universe could benefit from a doubleclassification because depending on the context itmay serve as one or the other.

The results of the initialization step show that ingeneral the MUC-6 named entity types(Grishmanand Sundheim, 1996) PER, LOC and ORG can beclassified with this approach reasonably well with60.42% (LOC, cosine) as lower and 81.02% (PER,Dice) as an upper bound. This does not workout as well for MISC, but still the lower boundof 35.25% (Dice) beats a baseline with randomlyassigned types that would result in 25% correctclassifications. Thus, the initially constructed typevectors are useful for NEC of WP articles. At thistime it is not possible to say which of the similaritymeasures returns better results.

4.2 Bootstrapping Iterations

To evaluate the iterative classification phase weused the resulting type vectors of every step toclassify the evaluation set and again analyze thepercentage of NEs that were classified correctly. 6

Figure 8 shows the results per iteration for eachtype and setup. The continuous line representscosine similarity while the dashed line representsDice’s coefficient. To see which setup works bestcompared to the other the different lines markedwith the same symbols must be compared. Thelines point out the development of the quality ofthe type vectors.

After every iteration the type vector is refinedwhich should improve classifications. However,because every classification step only incorporatesthe best or most certain 10% of unclassified NEsleaving the less clear NEs unclassified, the preci-

6Because the annotated data represent only a fraction ofthe whole data we cannot provide reliable recall results.

41

● ● ●

●

●● ● ●

● ● ●

Iteration

Pre

cisi

on

initial 1 2 3 4 5 6 7 8 9 10

048

12162024283236404448525660646872768084889296

100

● ●● ● ●

● ● ● ● ● ●

● PERLOCORGMISC

CosineDice

Figure 8: Precision of the classification for the it-erations in the bootstrapping phase.

sion is expected to decrease in later iterations dueto introduced noise. Thus a stable line indicates asuccessful approach.

If we ignore MISC for a moment, the cosinesetup has an overall decrease in precision relativeto their starting point while the Dice setup is fairlystable or even better. The difficulty of represent-ing the MISC type with WP categories seems tobe the reason for its different behaviour, the broadchoice of categories creates the bias of the co-sine method. Dice’s coefficient is more robust andseems to avoid that noise making it more suitablefor the task. This can be seen after the first itera-tion: As discussed in section 3.3 the biggest frac-tion was classified as LOC. While the precisionof Dice’s coefficient increases by more than 10%in this iteration the precision of the cosine setupdrops more than 5% which implies that many NEswere classified wrong. Finally, the best results af-ter bootstrapping are:

• PER – Dice 78.31% (cosine 73.22%)

• LOC – Dice 66.67% (cosine: 50%)

• ORG – Dice 74.55% (cosine: 60.91%)

• MISC – cosine 61.48% (Dice: 40.16%)

Dice coefficient performs better than cosinesimilarity for three out of four NE types, which

implies that taking statistical evidence into ac-count improves the performance of the classifica-tion. The numbers indicate that cosine similaritybeats Dice coefficient at the classification of Mis-cellaneous because it is biased.

5 Conclusion

In this paper we have shown a language-agnosticmethod to classify more than two million NEs inthe multilingual lexical resource HeiNER (Went-land et al., 2008) in two steps, adhering to theCoNLL definition of NEs (Sang, 2002; Sang andMeulder, 2003) relying on structural informationonly. First, we initialized 700,032 classified NEsutilizing the category system of Wikipedia startingwith a set of 132 manually annotated seed cate-gories. As the method relies only on WP’s struc-ture any classification task that can be representedby WP categories can be approached this way forany language available in WP. Second, the cate-gories of these classified articles were used to cre-ate NE type vectors to classify yet unlabelled ar-ticles by computing the similarities between thevectors and unclassified articles’ categories. Thiswas done via bootstrapping in two setups thatwork with two similarity measures: cosine sim-ilarity and Dice’s coefficient. The results wereevaluated on manually annotated data and showedthat the type vectors created from the initializationstep easily outperform a random baseline and thatthe method is suited well for the NE types usedin MUC-6 (Grishman and Sundheim, 1996) butthat the additional CoNLL class MISC shows agap in quality because it is harder to map the latterto Wikipedia categories. The evaluation of boot-strapping iterations reveals that Dice’s coefficientis the better similarity measure for this particulartask. This can be attributed to its property of tak-ing the weights of the vectors’ values into accountin contrast to cosine’s property of only observ-ing the angle between two vectors ignoring theirlengths. After all, two lists of NEs were createdfor each of the types PER, LOC, ORG and MISC,one by cosine and one by Dice similarity. AddingNE types to HeiNER makes it a valuable resourcefor multilingual NERC providing a fair amount oftraining material in various languages.

6 Acknowledgements

Thanks to Anette Frank for her suggestions andsupport for the thesis that is the basis of this paper.

42

ReferencesOliver Bender, Franz Josef Och, and Hermann Ney.

2003. Maximum entropy models for named entityrecognition. In Proceedings of the seventh confer-ence on Natural language learning at HLT-NAACL2003, pages 148–151, Morristown, NJ, USA. Asso-ciation for Computational Linguistics.

Razvan Bunescu and Marius Pasca. 2006. Using en-cyclopedic knowledge for named entity disambigua-tion. In Proceedings of the 11th Conference of theEuropean Chapter of the Association for Computa-tional Linguistics (EACL-06), Trento, Italy, pages 9–16, April.

Christian Federmann and Sabine Hunsicker. 2011.Stochastic parse tree selection for an existing rbmtsystem. In Proceedings of the Sixth Workshop onStatistical Machine Translation, pages 351–357, Ed-inburgh, Scotland, July. Association for Computa-tional Linguistics.

Michael Fleischman and Eduard Hovy. 2002. Finegrained classification of named entities. In Proceed-ings of the 19th international conference on Com-putational linguistics, pages 1–7, Morristown, NJ,USA. Association for Computational Linguistics.

Ralph Grishman and Beth Sundheim. 1996. Messageunderstanding conference: A brief history. In Pro-ceedings of the 16th International Conference onComputational Linguistics (COLING), pages 466–471. http://acl.ldc.upenn.edu/C/C96/C96-1079.pdf.

David Nadeau and Satoshi Sekine. 2007. A surveyof named entity recognition and classification. Lin-guisticae Investigationes, 30(1):3–26, January.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 SharedTask: Language-independent Named Entity Recog-nition. In Proceedings of the 7th Conference on Nat-ural language Learning at HLT-NAACL 2003, pages142–147, Morristown, NJ, USA.

Erik F. Tjong Kim Sang. 2002. Introduction to theCoNLL-2002 shared task: Language-independentnamed entity recognition. In Proceedings of Con-ference on Natural Language Learning.

G. Szarvas, R. Farkas, A. Kocsor, et al. 2006. Amultilingual named entity recognition system usingboosting and c4. 5 decision tree learning algorithms.Lecture Notes in Computer Science, 4265:267.

Wolodja Wentland, Johannes Knopp, Carina Silberer,and Matthias Hartung. 2008. Building a mul-tilingual lexical resource for named entity disam-biguation, translation and transliteration. In Euro-pean Language Resources Association (ELRA), edi-tor, Proceedings of the Sixth International LanguageResources and Evaluation (LREC’08), Marrakech,Morocco, may.

43

Author Index

Chen, John, 30

Ding, Duo, 20

Faruqui, Manaal, 25

Knopp, Johannes, 35Knoth, Petr, 2

Majumder, Prasenjit, 25

Pado, Sebastian, 25Peterson, Erik, 30Petrova, Yana, 30

Reddy, Siva, 11

Sharoff, Serge, 11Srihari, Rohini, 30

Wang, Haifeng, 1

Yang, Li, 30

Zdrahal, Zdenek, 2Zilka, Lukas, 2

45

ACL Member Portal | The Association for Computational … · Petr Knoth, Lukas Zilka and Zdenek Zdrahal 11:10–11:50 Cross Language POS Taggers (and other Tools) for Indian Languages:

Documents