PORT2 - UPC Universitat Politècnica de Catalunyanlp/meaning/documentation/... · Port2 Page : 8 1.4 Uploading Process To upload correctly all this di erent knowledge into a single

PORT2

Document Number Deliverable D4.3Project ref. IST-2001-34460Project Acronym MEANINGProject full title Developing Multilingual Web-scale Language TechnologiesProject URL http://www.lsi.upc.es/˜nlp/meaning/meaning.htmlAvailability PublicAuthors: Jordi Atserias (UPC), Montse Cuadros (UPC), Eva Naqui (UPC),German Rigau (UPV/EHU)

INFORMATION SOCIETY TECHNOLOGIES

WP4-Deliverable D4.3 Version: FINALPort2 Page : 1

Project ref. IST-2001-34460Project Acronym MEANINGProject full title Developing Multilingual Web-scale

Language TechnologiesSecurity (Distribution level) PublicContractual date of delivery February 2004Actual date of delivery March 16, 2005Document Number Deliverable D4.3Type ReportStatus & version v FINALNumber of pages 62WP contributing to the deliberable WP4WPTask responsible German RigauAuthors

Jordi Atserias (UPC), MontseCuadros (UPC), Eva Naqui(UPC), German Rigau(UPV/EHU)

Other contributorsReviewerEC Project Officer Evangelia MarkidouAuthors: Jordi Atserias (UPC), Montse Cuadros (UPC), Eva Naqui (UPC),German Rigau (UPV/EHU)Keywords: Multilingual Central Repository, EuroWordNet, WordNetAbstract: This document describes the third version of the Multilingual Cen-tral Repository (Mcr2) and the third Porting process (PORT2). We describe theknowledge uploaded and integrated into Mcr2, including a brief description of ageneral Upload/Porting architecture. Finally, we provide a full description of thethird Porting process. The current version of the MCR integrates 1.642.389 uniquesemantic relations between concepts (ILI-records). This represents one order ofmagnitude larger than the Princeton wordnet (138.091 unique semantic relationsin WN1.6). Furthermore, the current MCR have been also enriched with 466.972semantic properties. In fact, the resulting Mcr2 is the largest and richest multi-lingual lexical–knowledge ever built. In that way, the Mcr produced by Meaning

is going to constitute the natural multilingual large-scale linguistic resource for anumber of semantic processes that need large amounts of linguistic knowledge to beeffective tools.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies


Contents

1 Introduction 41.1 EuroWordNet architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Meaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Knowledge Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Uploading Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Integration Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5.1 Realisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5.2 Generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6 Porting Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Software of Mcr2 102.1 WEI: a Web Interface to access the Mcr . . . . . . . . . . . . . . . . . . . 102.2 APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Import/Export Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Advanced Analysis module . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Content of Mcr2 123.1 Content of Mcr0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Content of Mcr1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Content of Mcr2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Meaning Inter-Lingual-Index . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.1 EuroWordNet Base Concepts . . . . . . . . . . . . . . . . . . . . . 163.4.2 EuroWordNet Top Ontology . . . . . . . . . . . . . . . . . . . . . . 193.4.3 WordNet Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.4 Suggested Upper Merged Ontology (Sumo) . . . . . . . . . . . . . 21

3.5 Local wordnets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5.1 Uploading Princeton WordNets . . . . . . . . . . . . . . . . . . . . 223.5.2 WordNet 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5.3 eXtended WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.6 Large collections of semantic preferences . . . . . . . . . . . . . . . . . . . 243.7 Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.8 VerbNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Porting Process 264.1 Uploading process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2 Integration Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.1 Cross–checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2.2 Conceptual coverage . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2.3 Coverage of semantic relations . . . . . . . . . . . . . . . . . . . . . 324.2.4 Realisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2.5 Generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34



4.3 Porting Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 PORT2 Results 375.1 Final porting results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 The future of the Mcr 416.1 Further Uploading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.1.1 Improved Selectional Preferences acquired from BNC . . . . . . . . 416.1.2 Topic Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.1.3 Non subject/object Selectional Preferences . . . . . . . . . . . . . . 436.1.4 Large collections of Sense Examples . . . . . . . . . . . . . . . . . . 43

6.2 Further Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.3 Porting Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 Mcr2 examples 467.1 The ”Vaso” Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.2 The ”Pasta” Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.3 The “Hospital” Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

8 Conclusions 59



1 Introduction

This document describes Mcr2, the third version of the Meaning Multilingual CentralRepository. The Multilingual Central Repository (Mcr) acts as a multilingual interfacefor integrating and distributing all the knowledge acquired by Meaning.

1.1 EuroWordNet architecture

The Mcr follows the model proposed by the EuroWordNet project. EuroWordNet [Vossen,1998] is a multilingual lexical database with wordnets for several European languages,which are structured as the Princeton WordNet [Fellbaum, 1998].

The Princeton WordNet contains information about nouns, verbs, adjectives and ad-verbs in English and is organized around the notion of a synset. A synset is a set of wordswith the same part-of-speech that can be interchanged in a certain context. For example,<car, auto, automobile, machine, motorcar> form a synset because they can be used torefer to the same concept. A synset is often further described by a gloss: ”4-wheeled;usually propelled by an internal combustion engine”. Finally, synsets can be related toeach other by semantic relations, such as hyponymy (between specific and more generalconcepts), meronymy (between parts and wholes), cause, etc.

Figure 1 gives a schematic presentation of the EuroWordNet architecture. In the middle,the language-independent structures are given: the Ili, a Domain Ontology and a TopConcept Ontology. The Ili consists of a list of so-called Ili-records which are relatedto word-meanings in the local wordnets, (possibly) to one or more Top Concepts and(possibly) to domains.

Some language-independent structuring of the Ili is nevertheless provided by two sep-arate ontologies, which may be linked to Ili records:

• the Top Concept ontology, which is a hierarchy of language-independent concepts,reflecting important semantic distinctions, e.g. Object and Substance, Location,Dynamic and Static;

• a hierarchy of domain labels, which are knowledge structures grouping meanings interms of topics or scripts, e.g. Traffic, Road-Traffic, Air-Traffic, Sports, Hospital,Restaurant;

Both the Ontological properties and the Domain Labels can be transferred via theequivalence relations of the Ili-records to the local wordnet meanings, as is illustrated inFigure 1. The Top Concepts Location and Dynamic are for example directly linked tothe Ili-record drive and therefore indirectly also apply to all language-specific conceptsrelated to this Ili-record. Via the local wordnet relations, the Top Concept can be furtherinherited by all other related language-specific concepts.

The main purpose of the Top Ontology is to provide a common framework for themost important concepts in all the wordnets. It consists of 63 basic semantic distinctionsthat classify a set of 1.601 Ili-records representing the most important concepts in the



Figure 1: EuroWordNet architecture

different wordnets1. The classification was verified by the different EuroWordNet partners,so that it holds for all the language-specific wordnets. In section 3.4.2, we will furtherdescribe the Top Ontology used in Meaning.

The Domain Hierarchy group concepts in a different way, based on scripts rather thanclassification. For instance, grouping together concept nouns non-hierarchically relatedsuch as hospital, doctor, operation, together with concepts from other part–of–speechsuch as to operate. This is a powerfull tool to control the ambiguity problem in NaturalLanguage Processing. In section 3.4.3, we will further describe the Domain hierarchy usedin Meaning.

1.2 Meaning

Meaning works with five wordnets corresponding to five European languages (Basque,

1These represent the current set of Base Concepts based on WordNet 1.6. The original set fromEuroWordnet based on wordnet 1.5 had 1.030 Base Concepts



Catalan, English, Italian and Spanish). The Mcr acts as the sense inventory for nouns,verbs, adjectives and adverbs for all the languages involved in the project. All these lan-guages realise the meaning in different ways and Meaning will benefit from that becausethese wordnets have been constructed following the model proposed by the EuroWordNetprojects. That is, the wordnets are linked to an Inter-Lingual-Index (Ili). Via this index,the languages are interconnected so that it is possible to go from the words in one languageto similar words in any other language connected. The Ili is a set of meanings, mainlytaken from Princeton WordNet. The only purpose of the Ili is to mediate between thesynsets of the local wordnets. Each synset in the local wordnets has at least one equiva-lence relation with a record in this Ili, either directly or indirectly via other related synsets.Language-specific synsets linked to the same Ili-record should thus be equivalent acrossthe languages.

The development of Meaning is organized in three consecutive cycles. Figure 2 sum-marises the Meaning data flow. Each Meaning development cycle consists of:

• WP6 (WSD): Word Sense Disambiguation systems (WSD0, WSD1, WSD2) usingthe local wordnets and the enriched knowledge ported from the Multilingual CentralRepository.

• WP5 (Acquisition): Local acquisition of knowledge using specially designed toolsand resources, corpus and wordnets (ACQ0, ACQ1, ACQ2).

• WP4 (Knowledge Integration): Uploading the acquired knowledge from each lan-guage into the Multilingual Central Repository and porting to the local wordnets(PORT0, PORT1, PORT2).

Meaning will have three consecutive processes for uploading and porting the knowl-edge acquired from each language to the respective local wordnets: PORT0, PORT1,PORT2. The knowledge acquired locally will be uploaded and ported across the rest oflanguages via the EuroWordNet Ili, maintaining the compatibility among them. Theknowledge acquired from each language during the three cycles will be consistently up-load into the Mcr, granting the integrity of all the data produced by the project. Aftereach Meaning cycle, all knowledge acquired and integrated into the Mcr will be thendistributed across the local wordnets.

In that way, the Ili structure (including the Top Ontology and the Domain Hierarchy)will act as a natural backbone to transfer the different knowledge acquired from each localwordnet to the rest of wordnets.

Meaning has been developed the Mcr to maintain compatibility between wordnets ofdifferent languages and versions, past and new. The Ili should itself be connected to newerversions of WordNet or extensions of the Ili [Sofia et al., 2002a; Sofia et al., 2002b] usingthe technology for the automatic alignment of different large-scale and complex semanticnetworks [Daude et al., 1999; Daude et al., 2000; Daude et al., 2001] (see also WorkingPaper WP4.3 Making wordnets compatible).



Figure 2: MEANING data flow

1.3 Knowledge Integration

The third version of Mcr, Mcr2 integrates five local wordnets (including five versions ofthe English Princeton WordNet and the eXtended WordNet), the Suggested Upper MergedOntology, the EuroWordNet Top Ontology, WordNet Domains, and large set collectionsof Selectional Preferences and Instances (see Working Papers WP3.3, WP4.1 and WP4.4and deliverables D2.1 and D2.2 for a complete description of these knowledge resources).In order to carry out this integration process, several tasks have been performed:

1. the uploading,

2. the integration and finally,

3. the porting of all this knowledge to the local wordnets.

The first two tasks have been extensively described in Working Paper WP4.4 Uploading1. Here we provide a summary.



1.4 Uploading Process

To upload correctly all this different knowledge into a single multilingual repository a verycomplex and delicate process must be performed. Once finished the first part of uploadingthe data released by the different partners (just checking errors and inconsistencies), amore complex second part must be performed. This second part consist of the correctintegration of every piece of information into the Mcr. That is, linking correctly all thisknowledge to the Ili. This second part involves a complex cross checking validation processand usually a complex expansion/inference of large amounts of semantic properties andrelations through the semantic structure.

Working Paper WP4.2 Upload 0 and WP4.4 Upload 1 explains in detail the two previousuploading processes performed in the previous two Meaning cycles. Now, Working PaperWP4.6 Upload 2 describes the last uploading process performed in the third Meaning

cycle.Next sections, will also describe other new resources which has been uploaded in this

third round, e.g. Improved WordNet Domains (2nd release) [Magnini and Cavaglia, 2000],Base Concepts (2nd release) [Vossen, 1998], EuroWordNet Top Concept Ontology (2ndrelease) [Vossen, 1998], VerbNet [Kipper et al., 2000].

1.5 Integration Process

Once all this data is correctly uploaded into the Mcr, two different process have beendevised: realization and generalization. Both processes seems to be very promising. How-ever, in the third round of Meaning we only performed the realization of the new versionof the Top Concept Ontology (see Working Paper WP4.7 for further details). Obviously,both processes require further investigation and exploitation.

1.5.1 Realisation

Once all this data is uploaded into the Mcr, it is possible to perform a full expansionprocess of the Top Ontology properties through the nominal and verbal hierarchies.

Some of the selectional preferences acquired from SemCor and BNC can also be in-herited through the nominal part of the hierarchy. This process involves also a heavycomputational effort. In fact, some other WordNet relations can be derived in the sameway (for instance, meronym relations, etc.)

By integrating this knowledge, we are making explicit all knowledge contained into theMcr. It would be interesting to implement some capabilities to mechanise this process. Allinferred relations and knowledge can be rebuild several times during integration withoutlosing information or consistency.

1.5.2 Generalisation

A similar process can be devised in order to expand the knowledge into the Mcr. In thiscase, rather than expanding top–down the knowledge and properties represented into the



Mcr, a bottom–up generalisation mechanism can be performed. In this case, differentknowledge and properties can collapse on particular Base Concepts and ontological nodes.

1.6 Porting Process

Having all this types of different knowledge and properties completely expanded and cov-ering the whole Mcr, a new set of inference mechanism can be devised in order to furtherinfer new relations and knowledge. For instance, new relations can be generated when de-tecting particular semantic patterns occurring for some synsets having certain ontologicalproperties, for a particular Domains, etc. That is, new relations can be generated whencombining different methods and knowledge. For instance, when several relations derivedin the integration process have particular confidence scores greater than certain thresholds.

However, without this new inference tool (i.e. without having inferred extra knowl-edge) in this porting process all the knowledge integrated into the Mcr can be ported(distributed) to the local wordnets. That is, this process finish producing exporting XMLfiles for all local wordnets.

Thus, the current Mcr software include system modules for:

• Uploading the data acquired from one language to the Mcr.

• Porting the knowledge stored into the Mcr to the local wordnets.

• Checking the integrity of the data stored in the Mcr.

The fact that word senses will be linked to concepts in Mcr will allow for the appropri-ate representation and storage of the acquired knowledge. In that way, the Mcr producedby Meaning is going to constitute the natural multilingual large-scale linguistic resourcefor a number of semantic processes that need large amounts of linguistic knowledge to beeffective tools (e.g. Web ontologies).

After this introduction to the Mcr and the knowledge integration process, Section 2gives a general overview of the software components of the Mcr2. Section 3 summarises allknowledge uploaded into the third release of the Mcr. Section 4 summarizes the portingprocess and Section 5 provides some final figures of this process. Section 6 describes someplants to continue enriching the current Mcr with further knowledge. Section 7 illustratethe current content of the Mcr providing some examples. Finally, Section 8 provides someconcluding remarks.



2 Software of Mcr2

This section provides a brief summary of the different software components which arepart of the Mcr. Deliverable D4.1 includes the Database Design. See also Working PaperWP4.1 Basic Design of the Multilingual Central Repository for further details. The currentstatus of the software components are summarized in next sections. A complete descriptionis provided in Working Paper WP4.8 MCR software. Basically, the Web Interface, thedifferent APIs and the Import/Export and Statistical facilities.

2.1 WEI: a Web Interface to access the Mcr

The Mcr database has been implemented using MySQL. The Mcr provides a web interfaceto the database based on the Web EuroWordNet Interface (WEI)[Benıtez et al., 1998].The interface provides consulting and editing facilities for the data included into the Mcr.Meaning has provided a new release to access the Mcr database 2.

The basic aim of this tool is to provide a flexible access for editing and consulting Mcr.The Web EuroWordNet Interface (WEI) is a tool that provides to the user all the lexico-semantic information contained in all uploaded WordNets: English (versions 1.5, 1.6, 1.7,1.7.1, 2.0), Italian, Basque, Catalan and Spanish, etc.

Figure 3: Consulting WEI

WEI allows the user to consult the Mcr using a powerful but very intuitive userinterface. WEI provides facilities for a flexible querying of the Mcr. First, the usercan select how to enter to the Mcr by providing a word or a variant or a synset of any

2http://nipadio.lsi.upc.edu/cgi-bin/wei4/public/wei.consult.perl



wordnet uploaded into the Mcr. Then, the user must choose one of the wordnets tonavigate through some of its semantic relations. Finally, the user select which informationand from which wordnet whats to obtain the result of the consultation. Figure 3 shows thetypical consulting interface of WEI, accessing through the English WordNet 1.6 varianthouse 1 to the content of all its hypernyms in the Basque, Spanish, Italian and English2.0 WordNets, as well as all the information associated to the ILI (SUMO, Domains, TopConcept Ontology, Semantic File, Base Concept).

2.2 APIs

Three different APIs are been developed, first, a SOAP API to allow any remote user tointeract with the Mcr. The aim of this API is to provide the major accessibility to Mcr.Next, MCRQuery is an extension of wnQuery perl API for the Mcr to allow Princetonwordnet users to migrate easily to Mcr and thus make their application multilingual bymeans of a general API. And last but not least, a fast API on C++ for high performancesoftware.

The MCRQuery module allows us to easily adapt packages developed for the officialPrinceton wordnets, e.g. the perl WordNet similarity package to work with the Mcr

(instead of WordNet files). The WordNet similarity is a set of Perl modules that imple-ment the semantic relatedness measures described by Leacock Chodorow [Leacock andChodorow, 1998], Jiang and Conrath [Jiang and Conrath, 1997], Resnik [Resnik, 1995],Lin [Lin, 1998b], Hirst and St Onge [Hirst and St-Onge, 1998], Wu and Palmer [Wu andPalmer, 1994], the adapted gloss overlap measure by Banerjee and Pedersen [Banerjee andPedersen, 2003], and a measure based on context vectors by Patwardhan [Patwardhan,2003].

MCRquery is also used as a common abstraction to access Mcr or WordNet for othersoftware development inside the project. e.g. ExRetriever tools [Fernandez et al., 2004]

(See WP5.14 Experiment 5.F: Sense Examples (3rd round)).

2.3 Import/Export Facilities

It is not necessary to maintain the defined format of EuroWordNet, as long as a standard-ized format in XML is agreed with other projects involved in the development of wordnetsto maintain all data compatible. Meaning will set an XML standard format for wordnetdata and will provide methods for integrating data from other current standards: Prince-ton format [Fellbaum, 1998], EuroWordNet format [Vossen, 1998], VisDic format [Sofia etal., 2002a; Sofia et al., 2002b].

2.4 Advanced Analysis module

Advanced facilities will be also provided to explore the data and to analyse/mine themultilingual relations. This module will specially focus on the multilingual comparison ofthe data (i.e including facilities to made easy cross-lingual queries).



3 Content of Mcr2

3.1 Content of Mcr0

After PORT0, the first porting process preformed in the first cycle, Mcr0 included thefollowing large–scale resources:

• ILI

– Aligned to WordNet 1.6 [Fellbaum, 1998]

– EuroWordNet Base Concepts [Vossen, 1998]

– EuroWordNet Top Concept Ontology [Vossen, 1998]

– WordNet Domains version 070501 [Magnini and Cavaglia, 2000]

• Local wordnets

– English WordNet 1.5, 1.6, 1.7.1 [Fellbaum, 1998]

– Basque wordnet [Agirre et al., 2002]

– Italian wordnet [Pianta et al., 2002]

– Catalan wordnet [Benıtez et al., 1998]

– Spanish wordnet [Atserias et al., 1997]

• Large collections of semantic preferences

– Acquired from SemCor [Agirre and Martinez, 2001; Agirre and Martinez, 2002]

– Acquired from BNC [McCarthy, 2001]

• Instances

– Named Entities [Alfonseca and Manandhar, 2002]

See deliverable D2.1 3 (Basic Design of architecture and methodologies) and WorkingPaper WP4.2 4 (Upload 0) for an extended summary of each of these components andtheir uploading process. See deliverable D4.1 5 (PORT0) for a detailed report of the finalfigures after the first porting process.

3http://www.lsi.upc.edu/~nlp/meaning/documentation/D2.1.pdf.gz4http://www.lsi.upc.edu/~nlp/meaning/documentation/WP4.2.pdf.gz5http://www.lsi.upc.edu/~nlp/meaning/documentation/D4.1.pdf.gz



3.2 Content of Mcr1

For the second release of the Mcr, we planned to upload several new large-scale semanticresources into the Mcr (see deliverable D2.2 6 Basic Design of architecture and method-ologies (2nd round) for further details):

• Suggested Upper Merged Ontology (Sumo) [Niles and Pease, 2001]

• eXtended WordNet [Mihalcea and Moldovan, 2001]

• WordNet 2.0 [Fellbaum, 1998]

• Improved Selectional Preferences acquired from BNC [McCarthy, 2001]

• Direct dependencies form Parsed SemCor [Agirre and Martinez, 2001]

• Named Entities from Sumo [Niles and Pease, 2001]

• Named Entities from MultiWordNet [Pianta et al., 2002]

The resulting Meaning Mcr1 included:

• Ili


– EuroWordNet Base Concepts [Vossen, 1998]

– EuroWordNet Top Concept Ontology [Vossen, 1998]

– WordNet Domains version 070501 [Magnini and Cavaglia, 2000]

– Suggested Upper Merged Ontology (Sumo) [Niles and Pease, 2001]

• Local wordnets

– English WordNet 1.5, 1.6, 1.7, 1.7.1, 2.0 [Fellbaum, 1998]

– eXtended WordNet 1.7 [Mihalcea and Moldovan, 2001]




– Spanish wordnet [Atserias et al., 1997]


– Direct dependencies from Parsed SemCor [Agirre and Martinez, 2001]


6http://www.lsi.upc.edu/~nlp/meaning/documentation/D2.2.pdf.gz



– Acquired from BNC (2nd release) [McCarthy, 2001]

• Large collections of Sense Examples

– SemCor

• Instances


– Named Entities [Niles and Pease, 2001]

– Named Entities [Pianta et al., 2002]

See deliverable D2.2 7 (Basic Design of architecture and methodologies) and WorkingPaper WP4.4 8 (Upload 1) for an extended summary of each of these components andtheir uploading process. See deliverable D4.2 9 (PORT1) for a detailed report of the finalfigures after the second porting process.

3.3 Content of Mcr2

Next sections, will also describe other new resources which has been uploaded in thisthird round. In this Meaning cycle we mainly uploaded and integrated new releases ofpreviously upload resources (e.g. Improved WordNet Domains (2nd release) [Magnini andCavaglia, 2000], Base Concepts (2nd release) [Vossen, 1998], EuroWordNet Top ConceptOntology (2nd release) [Vossen, 1998]). However, we also have integrated a new large-scaleresource: VerbNet [Kipper et al., 2000].

Since the first version of the Mcr, we decided to integrate into the Mcr only con-ceptual knowledge (semantic information relating or attached to synsets). This decisionhad several implications. For instance, not all the large–scale knowledge acquired fromWP5 ACQ have been uploaded and ported (e.g. subcategorization frequencies, topic sig-natures, terminology and domain information). This knowledge is maintained into thelocal wordnets.

After PORT2, the final content of Mcr2 should include:

• ILI


– Base Concepts (2nd release) [Vossen, 1998]

– Top Concept Ontology (2nd release) [Vossen, 1998]

– MultiwordNet WordNet Domains (2nd release) [Magnini and Cavaglia, 2000]

– Suggested Upper Merged Ontology (Sumo) [Niles and Pease, 2001]

7http://www.lsi.upc.edu/~nlp/meaning/documentation/D2.2.pdf.gz8http://www.lsi.upc.edu/~nlp/meaning/documentation/WP4.4.pdf.gz9http://www.lsi.upc.edu/~nlp/meaning/documentation/D4.2.pdf.gz



• Local wordnets

– English WordNet 1.5, 1.6, 1.7, 1.7.1, 2.0 [Fellbaum, 1998]

– eXtended WordNet [Mihalcea and Moldovan, 2001]




– Spanish wordnet [Atserias et al., 2004b]


– Direct dependencies from Parsed SemCor [Agirre and Martinez, 2001]


– Acquired from BNC (2nd release) [McCarthy, 2001]

• Predicate structure

– VerbNet [Kipper et al., 2000]

• Instances


– Named Entities [Niles and Pease, 2001]

– Named Entities [Pianta et al., 2002]

See deliverable D2.3 10 (Basic Design of architecture and methodologies) and WorkingPaper WP4.6 11 (Upload 2) for an extended summary of each of these components andtheir uploading process.

Next sections will provide a short summary of each of the above Mcr components.

3.4 Meaning Inter-Lingual-Index

As in previous cycles, Meaning use Princeton WordNet 1.6 as Ili for Mcr2. This decisionminimize the effect of porting errors when the knowledge acquired from one language hasbeen ported to other wordnets. Initially most of the knowledge acquired has been derivedfrom WordNet 1.6 (selectional preferences from SemCor and BNC) and the Italian WordNetand the WordNet Domains, both developed at IRST are using WordNet 1.6 as Ili [Piantaet al., 2002; Magnini and Cavaglia, 2000].

However, the Ili for Spanish, Catalan and Basque wordnets was WordNet 1.5 [Atseriaset al., 1997; Benıtez et al., 1998]. Meaning applied the technology for mapping accurately

10http://www.lsi.upc.edu/~nlp/meaning/documentation/D2.3.pdf.gz11http://www.lsi.upc.edu/~nlp/meaning/documentation/WP4.6.pdf.gz



wordnet versions 12, but some hundreds of links received multiple choices (see WorkingPaper WP4.3 Making wordnets compatible for further details).

To solve this version gap and in order to minimize side effects with respect otherEuropean initiatives (Balkanet, EuroTerm, etc.) and wordnet developments around GlobalWordNet Association, Meaning provided a revised version of the automatic mapping fromWordNet 1.5 and WordNet 1.6.

As the new versions connected to WordNet 1.6 of Spanish, Catalan and Basque was noterror free, we suggested to perform a complete revision of variant to synset connections.This revision was performed for Spanish and reported in [Atserias et al., 2004b].

However, further research is also needed to locate automatically dubious mapping areas(see Working Paper WP4.3 Making wordnets compatible for further details).

3.4.1 EuroWordNet Base Concepts

The overall design of the EuroWordNet database made it possible to develop the localwordnets relatively independently while guaranteeing a high level of compatibility. Nev-ertheless, some specific measures were taken to enlarge the compatibility of the differentresources:

1. The definition of a common set of so-called Base Concepts that was used as astarting point by all the sites to develop the cores of the wordnets. Base Conceptsare meanings that play a major role in the wordnets.

2. The classification of the Base Concepts in terms of the Top Ontology.

The main characteristic of the Base Concepts is their importance in wordnets. Ac-cording to our pragmatic point of view, a concept is important if it is widely used, eitherdirectly or as a reference for other widely used concepts. Importance is thus reflected inthe ability of a concept to function as an anchor to attach other concepts. This anchor-ing capability was defined in terms of three operational criteria that can be automaticallyapplied to the available resources:

• the number of relations (general or limited to hyponymy).

• high position of the concept in a hierarchy

• being widely used by several languages

The procedure of selecting the EuroWordNet Base Concepts and the Top Ontology isdiscussed in [Vossen, 1998]. The final set of common Base Concepts from WordNet 1.5 hasbeen also mapped to WordNet 1.6. We have provided to each wordnet developer a list ofnot covered Base Concepts.

12http://www.lsi.upc.edu/~nlp/tools/mapping.html



We also plan to compare our results not only with our set of Base Concepts (comingfrom the initial set of EuroWordNet), but also with the set of Base Concepts produced bythe BalkaNet Project and the pre-released Base Concept from Princeton WordNet.

We are also planning to provide an automatically constructed set of Base Concepts.This will require to find a formal criteria to detect the most appropriate synsets which bestrepresent the most important concepts of WordNet. This criteria could be based on:

• Frequency count of the synset in Semcor

• Number of descendants

• Conceptual Density

• Changes of Top Concept Ontology/Domain/SUMO properties of adjacent concepts

Table 1 shows as an example, the possible Base Concepts that could represent ap-propriately the different senses of the noun “Church”. Ascending through the hypernymchain for each sense, we can locate the local maxima using different criteria, for instance,for each synset the number of relations or the number of occurrences in SemCor. Forchurch 1 the occurrence-criteria would select Christianity 2, organisation 2 and group 1while the relation-criteria would select faith 3 and organisation 2. For the second sense ofchurch, church 2, the occurrence-criteria will select church 2, construction 3 and object 1while the relation-criteria would select church 2 and building 1. Finally, for church 3 theoccurrence-criteria would select service 3 and activity 1 while the relation-criteria wouldselect religious ceremony 1 and activity 1. Obviously, different criteria will select a differ-ent set of Base Concepts. However, it is important to notice, that both criteria produce avery similar set of Base Concepts (in all cases, only one level of difference).

The new set of Base Concepts provided by this method should be also compared withthose currently uploaded into the Mcr (or coming from Balkanet or Princeton). More-over, we must remark that the Base Concepts are the synsets that are used to assignthe Top Concept Ontology properties. For instance, following SUMO, church 1 is a Reli-giousOrganization+ (it has also the Top Concept Ontology properties Human+, Group+and Function+13 and the WordNet Domain Religion); church 3 is a ReligiousProcess+(having the Top Concept Ontology properties agentive+, Cause+, Dynamic+ and Pur-pose= and also the WordNet Domain Religion); however, church 2 is a Building+, notReligiousBuilding+ (although this synset has Top Concept Ontology properties Artifact+,Building+, Object+ and also a WordNet Domain Religion).

Obviously, we also suggest further investigation on the possibility to obtain automati-cally a new set of Base Concepts, and on the possibility to attach new ontological propertiesto them, and finally, on how to characterize the current ontological properties uploadedinto the Mcr as semantic roles for predicates.

In EuroWordNet, the Base Concepts were classified by the Top Ontology using 63semantic distinctions. This ontology, which functions as a common framework for all thewordnets, is briefly described in the next section.

13As SUMO does, = stands for an assigned label and + for an inherited label



church 1

#occur. #rel. offset synset

2,338 18 00017954-n group 1,grouping 10 19 05962976-n social group 1

729 37 05997592-n organisation 2,organization 130 10 06002286-n establishment 2,institution 115 12 06023733-n faith 3,religion 262 5 06024357-n Christianity 2,church 1,Christian church 1

church 2


11 14 00001740-n entity 1,something 151 29 00009457-n object 1,physical object 11 39 00011937-n artifact 1,artefact 1

68 63 03431817-n construction 3,structure 150 79 02347413-n building 1,edifice 10 11 03135441-n place of worship 1, house of prayer 1,

house of God 1, house of worship 159 19 02438778-n church 2,church building 1

church 3


25 20 00017487-n act 2,human action 1,human activity 1611 69 00261466-n activity 1

2 5 00662816-n ceremony 30 11 00663517-n religious ceremony 1,religious ritual 1

243 7 00666638-n service 3,religious service 1,divine service 111 1 00666912-n church 3,church service 1

Table 1: Example of selecting Base concepts for the noun Church



3.4.2 EuroWordNet Top Ontology

The EuroWordNet Top Ontology consists of 63 higher-level concepts, excluding the top.Following [Lyons, 1977] EuroWordNet distinguished at the first level 3 types of entities:

• 1stOrderEntity Any concrete entity (publicly) perceivable by the senses and locatedat any point in time, in a three-dimensional space, e.g.: vehicle, animal, substance,object.

• 2ndOrderEntity Any Static Situation (property, relation) or Dynamic Situation,which cannot be grasped, heard, seen, felt as an independent physical thing. Theycan be located in time and occur or take place rather than exist, e.g.: happen, be,have, begin, end, cause, result, continue, occur..

• 3rdOrderEntity Any unobservable proposition which exists independently of timeand space. They can be true or false rather than real. They can be asserted ordenied, remembered or forgotten, e.g.: idea, thought, information, theory, plan.

The purpose of the EuroWordNet Top Concept ontology was to enforce more unifor-mity and compatibility of the different wordnet developments. However, the Eurowordnetproject only performed a complete validation of the consistency of the Top Conceptontology of the Base Concepts.

Although, the classification of WordNet is not always consistent with the Top Conceptontology, we performed an automatic expansion of the Top Concept properties assignedto the Base Concepts. That is, we enriched the complete Ili structure with features comingfrom the Base Concepts by inheriting the Top Concept features following the hyponymyrelationship (see Working Paper WP4.5 Towards de MEANING Top Ontology).

Assuming (as the builders of Sumo and WordNet Domain have done) that the onto-logical properties have been correctly assigned to particular synsets and WordNet definescoherent ontological subsumption chains across taxonomies, an automatic process can con-sistently inherit all the properties through the whole hierarchy of WordNet - no matter theontology they come from.

In Meaning we have performed an automatic expansion of the Top Concept Ontologyproperties assigned to the Base Concepts. That is, we enriched the complete Ili structurewith features coming from the Bc by inheriting the Top Concept features following thehyponymy relationship.

This way, once properties are exported to the Ili and inherited through the wholeWordNet hierarchy, all concepts in a WordNet will result to be assigned with a set ofsemantic features as in the following example.



lentil 1WD gastronomySF food

SUMO FruitOrVegetableTCO Comestible ; Plant

As the classification of WordNet is not always consistent with the Top Concept Ontol-ogy, the incompatibilities of the properties impeded the full automatic top–down propaga-tion of the Top Concept Ontology properties. That semi-automatic process resulted in anumber of synsets showing non–compatible information. Specifically:

• Sticking to Top Concept Ontology and according to the set of incompatibilities, someTop Concept Ontology properties assigned by hand appeared to be incompatiblewith either (a) inherited information, (b) information assigned via equivalence tothe Semantic Files (Lexicographical Files from WordNet) or/and even (c) other TopConcept Ontology properties assigned by hand.

• Top Concept Ontology properties, either original or inherited, are suspicious to beincompatible with other ontologies currently uploaded into the Mcr.

By examining a subset of synsets, we realised that there are at least the following mainsources of errors:

• Erroneous hand-made Top Concept Ontology mappings

• Erroneous statements of equivalence between Top Concept Ontology properties andSemantic Files

• Erroneous ISA links in WordNet -which causes erroneous inheritance

• Multiple inheritance within WordNet can cause incompatibilities in inheritance ofproperties [Guarino and Welty, 2000]

We can see an example of incompatible information in the following example, where a3rdOrderEntity can not coexist with properties only attributable to Events:

00660718 process 1WD factotumSF act

SUMO IntentionalProcessTO 3rdOrderEntity;Cause;Mental;Purpose



3.4.3 WordNet Domains

The initial EuroWordNet design included a Domain ontology. However, only the ComputerDomain was included into the EuroWordNet database.

Information brought by Domain Labels is complementary to what is already in Word-Net. First of all a Domain Labels may include synsets of different syntactic categories: forinstance MEDICINE groups together senses from nouns, such as doctor and hospital, andfrom Verbs such as to operate.

Second, a Domain Label may also contain senses from different WordNet subhierar-chies (i.e. deriving from different unique beginners or from different lexicographer files.For example, the SPORT contains senses such as athlete, deriving from life form, gameequipment, from physical object, sport from act, and playing field, from location.

Meaning use WordNet Domains [Magnini and Cavaglia, 2000] which were partiallyderived from the Dewey Decimal Classification 14. WordNet Domains is a hierarchy of 165Domain Labels associated to WordNet 1.6 synsets.

3.4.4 Suggested Upper Merged Ontology (Sumo)

Sumo15 [Niles and Pease, 2001] is being created as part of the IEEE Standard UpperOntology Working Group. The goal of this Working Group is to develop a standardupper ontology that will promote data interoperability, information search and retrieval,automated inference, and natural language processing. SUMO provides definitions forgeneral purpose terms and is the result of merging different free upper ontologies (e.g.Sowa’s upper ontology, Allen’s temporal axioms, Guarino’s formal mereotopology, etc.).There is a complete set of mappings from WordNet 1.6 synsets to Sumo: nouns, verbs,adjectives, and adverbs.

Sumo consists of a set of concepts, relations, and axioms that formalize an upperontology. An upper ontology is limited to concepts that are meta, generic, abstract orphilosophical, and hence are general enough to address (at a high level) a broad rangeof domain areas. Concepts specific to particular domains are not included in the upperontology, but such an ontology does provide a structure upon which ontologies for specificdomains (e.g. medicine, finance, engineering, etc.) can be constructed.

The current version of Sumo consists of 1,019 terms (all of them connected to WordNet1.6 synsets), 4,181 axioms and 822 rules.

We think that further investigation is needed with respect comparing both Sumo andthe EuroWordNet Top Concept Ontology. For instance, the typology of processes in theSumo was inspired by Beth Levin’s well-received work entitled ”Verb Classes and Alter-nations”. Among other things, this work attempts to classify over 3,000 English verbs into48 “semantically coherent verb classes”. Some of the verb classes relate to static predicatesin the ontology rather than to processes, and some classes are syntactically motivated, e.g.the class of verbs that take predicative complements.

14http://www.oclc.org/dewey15http://ontology.teknowledge.com/



Currently only the SUMO labels and the SUMO ontology hyperonym relations areloaded into the Mcr. We also performed a preliminary cross–checking process with theTop Concept ontology expansion, the Domain ontology and the the SUMO ontol-ogy (see Working Papers WP4.5, Towards de MEANING Top Ontology and WP4.7, TheMEANING Top Ontology for further details).

3.5 Local wordnets

In Working Paper WP3.3 we describe extensively the initial coverage of the Meaning

wordnets before the first uploading. In Working Papers WP4.2 Upload 0, WP4.4 Upload1 and WP4.6 Upload 2, we describe extensively the current coverage of the Meaning

wordnets after uploading the different local wordnets in the different cycles.New versions of the local wordnets for Spanish, Catalan, Basque and Italian has been

integrated in Mcr2. The figures about the number of synsets, variant and relations aresimilar to the previous versions integrated initially to Mcr0. During the last cycle, alllocal wordnets have have been improved and enriched.

3.5.1 Uploading Princeton WordNets

The current version of the Mcr contains most of the information represented in the Prince-ton WordNets.

The main changes uploading the Princeton WordNets to the Mcr consists of:

• Satellite Adjectives are coded as adjective (i.e Part of Speech s is converted to a).

• WordNet Verbal Frames are not loaded.

Uploading local wordnets not based on Ili from WordNet 1.6 is complex because be-tween different wordnet versions, synsets can be splitted (1:N), joined (N:1), added (0:1)or deleted (1:0). Thus, even if we perform manual checking of these connections, for thoseremaining cases of spliting or joining synsets the information inside the synsets should bemodified accordingly. At the moment, regarding Princeton WordNets 1.7, 1.7.1 and 2.0 itis not planned to make any manual checking of the mappings.

Mcr2 contains five version of the Princeton WordNet, and the first version of theeXtended Wordnet (aligned to WN1.7). Table 2 shows the overall figures for all theseWordNets.

3.5.2 WordNet 2.0

Special attention deserves WordNet 2.0 which includes more than 42,000 new links betweennouns and verbs that are morphologically related, a topical organization for many areas thatclassifies synsets by category, region, or usage, gloss and synset fixes, and new terminology,mostly in the terrorism domain.



nouns verbs adj adv #synsets #relationsWordNet1.5 51,253 8,847 13,460 3,145 76,705 103,445WordNet1.6 66,025 12,127 17,915 3,575 99,642 138,741WordNet1.7 74,488 12,754 18,523 3,612 109,377 151,546eXtended WordNet 74,488 12,754 18,523 3,612 109,377 551,551WordNet1.7.1 75,804 13,214 18,576 3,629 111,223 153,781WordNet2.0 79,689 13,508 18,563 3,664 115,424 204,074

Table 2: Main figures for the English WordNets

In this version, the Princeton team has added links for derivational morphology betweennouns and verbs. Furthermore, some synsets have been also organized into topical domains.Domains are always noun synset, however synsets from every syntactic category can beconnected. Each domain is further classified as a category, region, or usage.

In order to upload WordNet 2.0, is has been necessary to represent its new relationstypes (category, region, usage, related to) and their inverses (category term, region term,usage term) into the Mcr.

3.5.3 eXtended WordNet

In the eXtended WordNet16 [Mihalcea and Moldovan, 2001] the WordNet glosses are syn-tactically parsed, transformed into logic forms and the content words are semanticallydisambiguated. The key idea of the Extended WordNet project is to exploit the rich in-formation contained in the definitional glosses that is now used primarily by humans toidentify correctly the meaning of words. In the first version of the eXtended WordNet re-leased, XWN 0.1, the glosses of WordNet 1.7 are parsed, transformed into the logic formsand the senses of the words are disambiguated. Being derived from an automatic process,disambiguated words included into the glosses have assigned a confidence label indicatingthe quality of the annotation (gold, silver or normal). The quality of the relations derivedfrom XWN has been taken into account during the uploading of the XWN inside the Mcr2.First, associating different confidence scores to the relations according to its quality (gold1, silver 0.6, and normal 0.3). Secondly, associating different acquisition methods to therelations (xg, xs, xn respectively).

In order to upload coherently the eXtended WordNet into the Mcr, we also neededto upload WordNet 1.7 (integrated in the second cycle Mcr1) and build a new mappingbetween WordNet 1.6 and WordNet 1.7 17.

We think that further investigation is also needed with respect these resources. Forinstance, trying to derive automatically disambiguated semantic relations between synsetglosses [Gangemi et al., 2003].

We discovered several problems uploading eXtended WordNet 1.0. For instance, thenon standard normalization of the variants (capital letters, the use of the character – for

16http://xwn.hlt.utdallas.edu/17http://www.lsi.upc.edu/~nlp/tools/mapping.html



compound words instead of the white space, etc). After applying a case-insensitive, spacesubstitution set of heuristics, there remain 1,755 unexisting senses. Sometimes, the lemmaexists in WordNet with a different PoS. e.g: enrolling does only exists as verb in WordNet1.7 but not as noun.

As a further work, we can consider to upload and integrate the newest version (2.0.1) ofthe eXtended WordNet which is aligned to Princeton WordNet 2.0. We are also studyingthe possibility to upload in a near future our own versions of the eXtended WordNet (seeWorking Paper WP6.14 Experiment 6.L: Disambiguating WN Glosses).

3.6 Large collections of semantic preferences

Three large set of selectional preferences have been already uploaded in Mcr0 (see Deliv-erable D4.1 PORT0).

A total of 958,377 weighted Selectional Preferences (SPs) obtained from three differentcorpora and using different approaches have been uploaded into the Mcr.

The first set [McCarthy, 2001] of weighted SPs has been obtained by computing prob-ability distributions over the Wn1.6 noun hierarchy derived from the result of parsing theBNC. This set totalized 707.618 semantic relations. Part of these relations correspond torole–agent–bnc (115.542) and role–patient–bnc (95.065) in tables 9 and 10 respectively.The rest (497.011 relations) has been integrated into the Mcr as simple ROLE relation.

The second set [Agirre and Martinez, 2002] has been obtained from generalizations ofgrammatical relations extracted from Semcor. This set totalized 203.546 semantic rela-tions. These relations correspond to role–agent–semcor (69.840) and role–patient–semcor(110.102) in tables 9 and 10 respectively.

The third set of Selectional Preferences comes also from SemCor, which has been alsoparsed using a new version of Minipar [Lin, 1998a]. All the subject and object syntacticdependencies between head synsets can be captured and uploaded into Mcr. This re-source allows direct comparisons between word instances and the generalized SelectionalPreferences (captured from SemCor and BNC). Conversely to the other two sets, the syn-tactic dependencies captured by this process will not be generalised. This set totalized23,609 semantic relations corresponding to direct subject and object relations commingfrom the Minipar output. These relations correspond to role–agent–semcor2 (10.196) androle–patient–semcor2 (13.408) in tables 9 and 10 respectively.

The SPs were included in the Mcr as ROLE noun–verb relations 18. Although we candistinguish subjects and objects, all of them have been included as a more general ROLErelation.

3.7 Instances

The Mcr2 contains three different sources of name entities and instances:

• 6,961 Named Entities from the work of [Alfonseca and Manandhar, 2002]

18In EuroWordNet, INVOLVED and ROLE relationships are defined symmetrically.



• 5,561 Named Entities from Sumo [Niles and Pease, 2001]

• 4,097 Named Entities from MultiWordNet [Pianta et al., 2002]

For future versions of the Mcr, we suggest to provide a new ontology of Named Entitiesto support and cover the formal criteria followed by the three approaches. This initiativewould be very useful when comparing Named Entities derived using different languageprocessors.

3.8 VerbNet

VerbNet19 [Kipper et al., 2000] is a verb lexicon with syntactic and semantic informationfor English verbs, using Levin verb classes to systematically construct lexical entries. Foreach syntactic frame in a verb class, there is a set of semantic predicates associated with it.Many of these semantic components are cross-linguistic. The lexical items in each languageform natural groupings based on the presence or absence of semantic components and theability to occur or not occur within particular syntactic frames. The English entries aremapped directly onto English WordNet senses. We hope that this new resource will providefurther structure and consistency to the selectional preferences acquired automatically.

19http://www.cis.upenn.edu/old/verbnet/home.html



4 Porting Process

This section summarises PORT2 which includes: uploading, integrating and porting pro-cesses.

4.1 Uploading process

Working Paper WP4.6 Upload 2 provides an extended report of the third upload process.Now, we uploaded again the new versions available for Spanish, Catalan and Basque word-nets. Apart of these, new resources has been uploaded in this third round, e.g. ImprovedWordNet Domains (2nd release) [Magnini and Cavaglia, 2000], Base Concepts (2nd release)[Vossen, 1998], EuroWordNet Top Concept Ontology (2nd release) [Atserias et al., 2004a],VerbNet [Kipper et al., 2000].

4.2 Integration Process

In Mcr2 we have integrated five different versions of the Princeton WordNet. Althoughthese WordNets are quite similar and Meaning has the technology for mapping accuratelywordnet versions making them compatible and then allowing to reuse many other valu-able semantic resources 20, the Mcr2 have some hundreds of links with multiple choicesrequiring manual verification (see also Working Paper WP4.3 Making wordnets compatible).

In the second cycle we studied the impact of the transformation of the other WordNetversions to WordNet 1.6 [Atserias et al., 2004b]. We need to explore new techniques todetect automatically the most problematics cases.

This section will also provide a partial description of all knowledge uploaded and inte-grated into the Mcr2 (see Working Paper WP4.6 Upload 2 for an extended report). Thiswill provide some preliminary information useful for the third PORTing. As the overlap-ping between wordnets is a crucial issue when Porting the knowledge acquired from onelanguage to the other, we provide some figures regarding the number of common synsets(local wordnets are far from the coverage of Princeton English Wordnet), common relationsand different views of all the knowledge uploaded (WordNet Domains, Wordnet SemanticFiles, Base Concepts, EuroWordNet Top Ontology) to measure the qualitative overlappingof the local wordnets.

4.2.1 Cross–checking

We think that further investigation is also needed with respect the resources currentlyintegrated into the Mcr. For instance, trying to derive automatically disambiguatedrelations between synset glosses. In fact, after uploading all these new resources to theMcr, a new set of complex integration and porting processes must be studied. This willallow designing sophisticated strategies and metarules for subsequent portings. Obviously,

20http://www.lsi.upc.edu/~nlp/tools/mapping.html



integrating all these large–scale semantic resources into a single platform a complete cross-checking research can be performed. For instance, we can improve both the Sumo labelswith the WordNet Domains by simply merging and comparing them.

Synset Word SUMO Domain00536235n blow Breathing anatomy00005052v blow Breathing medicine

00003142v exhale Breathing medicine00899001a exhaled Breathing factotum00263355a exhaling Breathing factotum

00536039n expiration Breathing anatomy02849508a expiratory Breathing anatomy00003142v expire Breathing medicine

02579534a inhalant Breathing anatomy00536863n inhalation Breathing anatomy00003763v inhale Breathing medicine00898664a inhaled Breathing factotum00263512a inhaling Breathing factotum

00537041n pant Breathing anatomy00004002v pant Breathing medicine00535106n panting Breathing anatomy00264603a panting Breathing factotum00411482r pantingly Breathing factotum

Table 3: Sumo vs. Domain labels

To illustrate how we can detect errors and inconsistencies between different types ofknowledge, we can see in the example in table 3 that systematically, the nouns correspond-ing to the Sumo process Breathing has been labelled with ANATOMY domain, some verbswith MEDICINE and some adjectives with FACTOTUM, when in fact, all these sensescorrespond to different Part-of-Speech of the same Breathing concept.

In order to illustrate the kind of problems we need to face when merging all thesesemantic resources into a single and common platform, consider the example shown infigure 4. The act playing#1 which is a kind of musical performance#1 is connected byderivational morphological relations to three senses of the verb play. The verb play#3 isconnected by a domain relation to the noun music#1 and the verb play#7 is connectedto music#1 and music#3. However, play#6, also related to the musical domain is notconnected by a domain relation to none of music#1 nor music#3. All the three senses ofthe verb play have the WN Domain MUSIC label and the Sumo music label. However,each verb sense of play have different behaviour assigning category relations. Should thenoun playing be also connected by a category relation to both music#1 and music#3?Should be made explicit this connection? Regarding WN Domain labels, why the musical



senses of the verb play and the noun music do not have also the FREE TIME label asthe noun act playing? With respect Sumo, why they have different types? Furthermore,being the eXtended WordNet the result of an automatic process, it contains also wrongdisambiguations (play#4 belonging to the THEATRE domain). We think that having allthis different sources of knowledge uploaded and integrated into the same framework willallow to improve systematically all this misleading inconsistencies.

RELATED-TO

the act of playing a

musical instrument

DOMAIN free_time music

SUMO &%RecreationOrExercise+

00093905n

playing

the act of performing music

DOMAIN free_time music

SUMO &%RecreationOrExercise+

00092967n

musical performance

RELATED-TO

RELATED-TO

play on an instrument;

"The band played all night long"

DOMAIN music

SUMO &%music+

01675975v

play#3

perform music on a musical instrument;

"He plays the flute";

"Can you play on this old recorder?"

DOMAIN music

SUMO &%music+

01677078v

play#7

re-play (as a melody);

"Play it again, Sam";

"She played the third movement very beautifully"

DOMAIN music

SUMO &%music+

01675975v

play#6

spiel#1

CATEGORY

an artistic form of auditory

communication incorporating

instrumental or vocal tones

in a structured and continuous

manner

DOMAIN music

SUMO &%music+

06591368n

+music#1

CATEGORYmusical activity (singing

or whistling etc.);

"his music was his central

interest"

DOMAIN music

SUMO &%music+

00515842n

music#3

CATEGORY

play a role or part;

"Gielgud played Hamlet"; ...

DOMAIN theatre

SUMO &%Pretender+

01670298v

act#3

play#4

represent#10

GLOSS

Figure 4: Example of noun playing

On the other hand, regarding the top–down expansion of the properties of the Eu-roWordNet Top Ontology through WordNet (see Working Papers WP4.5 and WP4.7),problematic cases can be detected by cross-checking the different resources in the Mcr.

There are attributes of the Top Concept ontology that can not be inferred top–downfrom the hand-made assignments. Another way of automatically enrich wordnet withmore attributes is using the semantic file of the synset. For example, the synset 10960967-n first half only has the attribute Part. But its semantic File is noun.time. Thus theassociated Top Concept ontology property Time could be added (also note that sumolabel is TimeInterval+).

The problems/inconsistencies found can be classified into:

• WordNet hierarchy



The classification of Wn is not always consistent with the Top Concept ontology

– Animal vs. Plant

00911639n phytoplankton 1 (SUMO.Plant+) and its direct descendant 00911809nplanktonic algae 1 (SUMO.Alga).

– Substance (Liquid, Solid, Gas) vs. Object

For instance, body part 1 is an Object. However, some of their descendants haveincompatible properties:

∗ Liquid 04195761n 105 liquid body substance 1 bodily fluid 1 body fluid 1

∗ Substance 4086329n 117 body substance 1 the substance of the body

∗ Solid 06672286n covering 1 natural covering 1 cover 5 any covering for thebody or a body part

• Cross-checking resources with different granularitiesFor instance the division between Human–Creature–Animal–Hominid:

– Human vs Animal: All the Hominids are considered animal by the semanticFile, but Human by the top Concept ontology (SUMO Hominid+)

– Human vs Creature: All the creatures (mainly the descendants of imag-inary being 1 imaginary creature 1) are classified as Human by the semanticFile.

• Multiple inheritance: piece of leather 1

WordNet is not a tree and a synset can have more than one direct ancestor. Thus,it can inherit attributes from its multiples ancestors. Figure 5 shows the compli-cated multiple inheritance of piece of leather (on the top) inheriting Living (whichis obviously incorrect) and Natural Attribute (which could be questionable) fromskin but also Part from its other ancestors. The Multiple inheritance could bringup also a new type of problems. Figure 6 shows another example where multipleinheritance will lead to inherited incompatible attributes: Artifact from and Naturalfrom organic compound 1.



03120175-n

piece_of_leather#1

03119215-n

piece#1

has_hyperonym

10580693-n

leather#1

has_hyperonym

03090721-n

part#4,portion#2

Part=

has_hyperonym

00009457-n

object#1,physical_object#1

Natural=

Object=

has_hyperonym

00001740-n

entity#1,something#1

has_hyperonym

10579741-n

animal_skin#1

has_hyperonym

04068217-n

skin#1,tegument#1,cutis#1

Covering=

Living=

Part=

Solid=

has_hyperonym

10537753-n

animal_product#1

has_hyperonym

04067708-n

body_covering#1

Covering=

Living=

Part=

Solid=

has_hyperonym

04103288-n

connective_tissue#1

has_hyperonym

06672286-n

cover#5,covering#1,natural_covering#1

Covering=

Natural=

Object=

has_hyperonym

00010123-n

natural_object#1

Natural=

Object=

has_hyperonym

has_hyperonym

04087907-n

animal_tissue#1

has_hyperonym

04087702-n

tissue#1

Living=

Part=

Solid=

has_hyperonym

04058532-n

body_part#1

Living=

Part=

has_hyperonym

06684175-n

part#7,piece#3

Part=

has_hyperonym

has_hyperonym

10577352-n

animal_material#1

has_hyperonym

10446867-n

material#1,stuff#1

Substance=

has_hyperonym

00010572-n

matter#3,substance#1

Substance=

has_hyperonym

has_hyperonym

Figure 5: Multiple inheritance Example



02221884-n

atropine#1

02198410-n

antispasmodic#1,spasmolytic#1,antispasmodic_agent#1

has_hyperonym

10548994-n

alkaloid#1

has_hyperonym

02981307-n

medicine#2,medication#1,medicament#1,medicinal_drug#1

Function=

has_hyperonym

02609065-n

drug#1

Substance=

has_hyperonym

00011937-n

artifact#1,artefact#1

Artifact=

Object=

has_hyperonym

00009457-n

object#1,physical_object#1

Natural=

Object=

has_hyperonym

00001740-n

entity#1,something#1

has_hyperonym

10560207-n

organic_compound#1

Natural=

Substance=

has_hyperonym

10630741-n

compound#2,chemical_compound#1

Substance=

has_hyperonym

00010572-n

matter#3,substance#1

Substance=

has_hyperonym

has_hyperonym

Figure 6: Multiple inheritance Example



4.2.2 Conceptual coverage

Tables 4, 5, 6 and 7 shows the overlapping for nouns, verbs, adjectives and adverbs betweeneach wordnet pair.

At a synset level, noun overlapping is quite high and homogeneous between wordnetpairs. The maximum overlapping occurs between English and Spanish (38,023) and thelowest between Italian and Catalan (16,360).

en16 spwn itwn cawn bawn Total

en16 - 38,023 23,641 32,376 27,390 66,025spwn - - 18,681 32,017 23,837 43,367itwn - - - 16,360 17,287 26,475cawn - - - - 20,807 33,042bawn - - - - - 27,439

Total 66,025 43,367 26,475 33,042 27,439 -

Table 4: Noun overlapping between wordnet pairs

For verbs, at a synset level, the overlapping is also quite high but less uniform betweenwordnet pairs. The maximum overlapping occurs also between English and Spanish (8,830)and the lowest between Italian and Basque (1,977).


en16 - 8,830 4,434 5,897 3,290 12,127spwn - - 3878 5,144 3,217 9,043itwn - - - 2,948 1,977 4,493cawn - - - - 3,130 5,907bawn - - - - - 3,290

Total 12,127 9,043 4,493 5,907 3,290 -

Table 5: Verb overlapping between wordnet pairs

At a synset level, adjective overlapping is not high because some wordnets providepoor coverage on adjectives. While Spanish provides good overlapping with English (themaximum overlapping with 14,667 synsets), Basque wordnet only provide some hundredsof adjectives.

At a synset level, adverbs overlapping is not high because some wordnets provide poorcoverage on adverbs. While Italian provides good overlapping with English (the maximumoverlapping with 1,093 synsets), Catalan and Basque wordnet do not provide adverbs atall.

4.2.3 Coverage of semantic relations

We describe in this section the results of performing some basic comparisons betweenall wordnets currently integrated into the Mcr. This will provide also some preliminary




en16 - 14,667 2,998 4,773 100 17,915spwn - - 2,653 4,735 98 14,941itwn - - - 904 24 3,034cawn - - - - 60 4,773bawn - - - - - 100

Total 17,915 14,941 3,034 4,773 100 -

Table 6: Adjective Overlapping between wordnet pairs


en16 - 42 1,093 0 0 3,575spwn - - 16 0 0 42itwn - - - 0 0 1,094cawn - - - - 0 0bawn - - - - - 0

Total 3,575 42 1,094 0 0 -

Table 7: Adverb Overlapping between wordnet pairs

information useful for the third PORTing. In particular, we compare the current coverageof the different semantic relations across all wordnets.

Table 8 summarises the overlapping relations between all wordnets. The local wordnetsdeveloped following the EuroWordnet framework (Basque, Spanish and Catalan) share thesame amount of relations. Thus, we use Spanish as the model to compare with ItalianWordnet and English Princeton Wordnet. We can also see that those wordnets derivedfrom EuroWordNet represent much richer information (they present a large variety ofsemantic relations), than those derived from WordNet.

Working Paper WP4.6 Upload 2 provides further analysis regarding other knowledgeuploaded and integrated into Mcr2 (mainly, Domains, Semantic Files, Top Ontology,SUMO, etc.)

4.2.4 Realisation

During the first round we performed a Realization Process expanding all the Top Con-cept Ontology properties following the WordNet hierarchies. We are now producing anew and consistent version of the Top Concept Ontology (see Working Papers WP4.5 To-wards de MEANING Top Ontology and WP4.7 The MEANING Top Ontology). No furtherRealization was planned for this round.

Once having a new version of the Base Concepts and associated Top Concept Ontologyproperties, the Mcr performed a full expansion process through the nominal and verbalhierarchies.

Some of the selectional preferences acquired from SemCor and BNC can also be in-herited through the nominal part of the hierarchy. This process involves also a heavy



computational effort. In fact, some other WordNet relations can be derived in the sameway (for instance, meronym relations, etc.)

By expanding this knowledge, we are making explicit all the knowledge contained intothe Mcr. Mcr should consider to implement some capabilities to mechanise this process.All inferred relations and knowledge should be rebuild several times during the integrationprocess without losing information or consistency.

We started to work using the version integrated in version Mcr1 having 2.696 TCOfeatures expanded by inheritance to 253.003 features. At this moment, we have reachedthe figure of 2.756 hand-coded features which expand to 276.384. Moreover, 52 blockingpoints have been set.

Comparing both versions:

1. Both versions share

2.676 hand-coded features (corresponding to 1.013 different synsets)51.043 expanded features (corresponding to 36.289 different synsets)

2. Differences

The initial version had 201.960 expanded features belonging to 75.052 synsets whichnow are not present. The new version has 225.341 new expanded features, belongingto 75.295 synsets.

See Working Papers WP4.1, WP4.5 and WP4.7 for further details of the Top ConceptOntology currently integrated into the Mcr.

4.2.5 Generalisation

A similar process can be devised in order to expand the knowledge into the Mcr. In thiscase, rather than expanding top–down the knowledge and properties represented into theMcr, a full bottom–up generalisation mechanism can be performed. In this case, differentknowledge and properties can collapse on particular Base Concepts and ontological proper-ties. We suggest further analysis of this possibility. However, no further Generalizationwas planned for this round.

4.3 Porting Process

Having all this types of different knowledge and properties completely expanded throughthe whole Mcr, a new set of inference mechanism can be devised in order to furtherinfer new relations and knowledge. For instance, new relations can be generated whendetecting particular semantic patterns occurring for some synsets having certain ontologicalproperties, for a particular Domains, etc. That is, new relations can be generated whencombining different methods and knowledge. For instance, when several relations derivedin the integration process have particular confidence scores greater than certain thresholds.We also suggest further analysis of this possibility.



As we have greatly improve the quality of the local wordnets and the their associatedinformation, we decide to redo the porting process from scratch keeping trace of the sourceof the information.

In PORT1 and PORT2 we decided to incorporate the new knowledge encoded into thenew versions of the Princeton WordNet (1.7, 1.7.1, 2.0 and eXtended WordNet). Thus, weported to WordNet1.6 the new relations types (e.g. usage). However, we discart to portthe old relation types already present in version 1.6 (e.g. has holo made of, has hyponym,etc. ). They will be probably inconsistent with the current content aligned to WordNet1.6.



Relation Total Basque-Spanish Catalan-Spanish Catalan-Basque

be in state 1,302 38 568 38causes 240 116 156 110has derived 8,504 1 288 1has holo madeof 708 0 297 0has holo member 11,847 0 3,510 0has holo part 6,878 0 2,839 0has hyponym 78,293 26,293 35,809 26,293has subevent 427 139 213 139has xpos hyponym 81 0 0 0near antonym 7,444 1,295 2,405 1,295near synonym 10,955 24 4,174 24role 106 101 0 0role agent 516 488 0 1role instrument 291 269 0 0role location 83 79 0 0role patient 6 6 0 0see also wn15 3,280 142 696 142verb group 523 73 101 73xpos fuzzynym 37 36 0 0xpos near synonym 319 291 3 1

Relation Italian-English Italian-Spanish English-Spanish

be in state 372 372 1,174causes 85 85 170has derived 1,337 1,337 2,154has holo madeof 212 212 338has holo member 339 339 5,180has holo part 1,830 1,830 3,902has hyponym 19,091 19,091 44,626has subevent 170 170 356near antonym 1,946 1,946 5,487near synonym 384 384 17,471see also wn15 874 874 3,033verb group 115 115 195

Table 8: Overlapping of relations



5 PORT2 Results

5.1 Final porting results

Without having inferred extra knowledge in this porting process all the knowledge inte-grated into the Mcr has been ported (distributed) to the local wordnets. That is, thisprocess finish producing exporting XML files for all local wordnets.

Tables 9 and 10 summarises the main results before the whole porting process (UP-LOAD2) and after the porting process of (PORT2). All wordnets gained some kind ofnew knowledge coming from other wordnets by means of the third porting process. Adirect result of the upload/integration/porting effort is that all information associated tothe Ilis is automatically ported to the other wordnets. Thus, WordNet Domains are nowavailable to the rest of local wordnets, EuroWordnet Top Ontology is also available forItalian WordNet and for English Wordnet 1.6, and the SUMO labels have been ported toCatalan, Italian and Spanish. Moreover, local relations can be ported to the rest of word-nets. Thus, Italian and English Wordnet can be enriched with all the new set of relationscoming from EuroWordnet. In turn, Basque, Catalan, Italian and Spanish wordnets canbe extensively enriched with the large amounts of relations coming from newer versions ofWordNet, eXtended WordNet and the selectional preferences acquired from English.

In these tables, we do not consider hypo/hypernym relations. links stands for totalnumber of Domains or Top Ontology labels ported (one synset could have more than onelabel). Selectional Preferences have been included in the database as from noun to verbrelations (ROLE) instead of relations from verb to noun (INVOLVED) 21. Although wecan distinguish subjects and objects in the database, all of them have been included asa more general ROLE relation. Role–agent–semcor stands for those subject selectionalpreferences acquired from SemCor. Role–patient-semcor stands for object selectional pref-erences acquired from SemCor. Role–agent–semcor2 stands for those subject selectionalpreferences acquired from directly parsing SemCor. Role–patient-semcor2 stands for ob-ject selectional preferences acquired from directly parsing SemCor. Role–agent–bnc standsfor those subject selectional preferences acquired from the British National Corpus, andRole–patient–bnc for those typical objects acquired from BNC. In fact, some of them mayoverlap. There are other 497.011 more general ROLE relations not included into the tables.We need to investigate new inference facilities to enhance the porting process as suggestedbefore.

Thus, for English, the current Mcr totalize more than one million and a half non hier-archical (hypo/hypernym) relations. In contrast, the version of WordNet 2.0 has 108,484non hypo/hypernym relations22.

In that way, the Mcr produced by Meaning is going to constitute the natural mul-tilingual large-scale linguistic resource for a number of semantic processes that need largeamounts of linguistic knowledge to be effective tools (e.g. Web ontologies). The fact thatword senses will be linked to concepts in Mcr will allow for the appropriate representation

21INVOLVED and ROLE relations are symmetric22Inverse relations are counted just once



and storage of the acquired knowledge.Mcr2 integrates now into the same EuroWordNet framework (using a new version of

Base Concepts, the Top Ontology and the WordNet Domains) five local wordnets (with fourEnglish WordNet versions) with hundreds of thousand of new semantic relations, instancesand properties fully expanded. All wordnets gained some kind of new knowledge comingfrom other wordnets by means of this porting process. In fact, the resulting Mcr2 is thelargest and richest multilingual lexical–knowledge ever built.



Spanish English Italian

Relations UPLOAD PORT2 UPLOAD PORT2 UPLOAD PORT2

be in state 589 +1 650 +3 194 +2causes 189 = 224 +19 85 +15has derived 2,120 = 6,052 +2 1,353 =has holo madeof 338 = 709 = 218 =has holo member 5,180 = 11,848 = 343 =has holo part 3,907 = 6,881 = 1,901 =has subevent 356 = 427 = 171 =has xpos hyponym 480 +1 0 +225 0 +22near antonym 5,493 +5 7,449 +4 1,966 =near synonym 17,510 +1 21,858 +21 470 +54pertains to 42 +2 +50 2role 102 = 0 +106 0 +46role agent 504 = 0 +516 0 +227role instrument 282 = 0 +291 0 +151role location 82 = 0 +83 0 +39role patient 6 = 0 +6 0 +3see also wn15 3,033 = 3,286 = 874 =verb group 98 = 262 = 58 =xpos fuzzynym 36 = 0 +37 0 +23xpos near synonym 307 = 0 +319 0 +181

gloss 0 +264,038 0 +550,922 0 +115,943category term 0 +2,699 0 +4.894 0 +1,314region term 0 +420 0 +890 0 +165related to 0 +22,320 0 +31,852 0 +12,078usage term 0 +321 0 +869 0 +139Total 41,353 +289,798 60,557 +586,220 7,679 +129,639

role agent-semcor2 0 +8,981 10,196 = 0 +5,625role agent-semcor 0 +63,931 69,840 = 0 +35,292role agent-bnc 0 +99,737 115,542 = 0 +52,212role patient-semcor2 0 +11,515 13,408 = 0 +7,062role patient-semcor 0 +100,265 110,102 = 0 +42,667role patient-bnc 0 +79,443 95,065 = 0 +54,177Role 0 +363,872 414,153 = 0 +197,035

Instances 0 +1,599 0 +2,198 791 =Proper Nouns 1,806 = 17,842 = 2,161 =

Base Concepts 1,169 = 1,535 = 0 +935

Domains Links 0 +55,239 109,621 = 35,174 =Domains Synsets 0 +48,053 96,067 = 30,607 =

Top Ontology Links 3,438 = 0 +4,148 0 +2,544Top Ontology Synsets 1,290 = 0 +1,554 0 +946

Table 9: A) PORT2 Results



Catalan Basque

Relations UPLOAD PORT2 UPLOAD PORT2

be in state 286 = 38 =causes 156 = 103 =has derived 288 = 1 =has holo madeof 297 = 231 =has holo member 3,510 = 379 =has holo part 2,839 = 2,316 =has subevent 213 = 139 =has xpos hyponym 169 = 0 =near antonym 2,408 = 1,300 =near synonym 4,176 = 24 =role 0 = 101 =role agent 1 = 479 =role instrument 0 = 269 =role location 0 = 79 =role patient 0 = 6 =see also wn15 696 = 150 =verb group 51 = 73 =xpos fuzzynym 0 = 0 =xpos near synonym 3 = 0 =

gloss 0 +164,511 0 +107,217category term 0 +1,771 0 +1,324region term 0 +138 0 +208related to 0 +15,320 0 +12,642usage term 0 +200 0 +105Total 15,526 +181,940 5,993 +121,496

role agent-semcor2 0 +6,976 0 +6,137role agent-semcor 0 +53,027 0 +44,178role agent-bnc 0 +76,287 0 +58,565role patient-semcor2 0 +8,927 0 +7,480role patient-semcor 0 +82,818 0 +67,578role patient-bnc 0 +62,548 0 + 47,288Role 0 +290,583 0 +231,226

Instances 0 +1,599 0 +365Proper Nouns 1,119 = 552 =

Base Concepts 1,169 = 1,017 =

Domain Links 0 +40,762 0 +29,817Domain Synsets 0 +35,177 0 +25,860

Top Ontology Links 3,164 = 3,021 =Top Ontology Synsets 1,180 = 1,126 =

Table 10: B) PORT2 Results



6 The future of the Mcr

Although this is the final version of the Mcr in Meaning our plan is to continue exploringand improving this multilingual resource in several ways.

6.1 Further Uploading

6.1.1 Improved Selectional Preferences acquired from BNC

The application and study of these sets of SPs seem to indicate that both methodologiessuffer from an overly high level of generalization. In Mcr2 we uploaded a new sets of verbspecific Selectional Preferences obtained from semcor using a new methodology based onprotomodels (see Working Paper WP5.8 Experiment 5.G: Selectional Preferences (2ndround)).

The Tree Cut Models (tcms) that we used in the first round of Meaning acquisitionfor learning selectional preferences from unannotated text often suffered from an overlyhigh level of generalisation, that is classes which are very high in the WordNet hierarchyare used to represent the preferences. Table 11 shows volumes of the data uploaded. Table12 shows an example of the kind of information obtained for the noun church. Not onlycan be retrieved the list of verbs associated directly to a sense (direct), but also the listassociated to the n

th hypernym of the synset (hyper–n).We are investigating 3 possibilities to acquire more specific, accurate and intuitive

models:

Weighting TCMs a proposal by [Wagner, 2002] to introduce a weighting factor to counterthe effect of data size on the model.

WSD on input data the use of automatic WSD on the training data to counter theeffect of polysemy

Protomodels new selectional preference models which aim to cover only a portion ofthe data where that portion can be disambiguated and where the disambiguation isperformed using a ratio of types in a class, rather than tokens.

Object Subject#sel. pref. 162,178 149,059involved verbs 2,047 2,047involved synsets 5,149 4,582

Table 11: Figures for the new Selectional Preferences

However, these new set of high quality Selectional Preferences should be verb disam-biguated before full integration into the Mcr.



church#n#1 direct embrace 0,0055 divide 0,0028 believe 0,0015 force 0,0013bring 0,0008 like 0,0008 see 0,0008 support 0,0006 give0,0003

church#n#2 direct erect 0,0223 burn 0,0090 surround 0,0085 dedicate0,0074 enter 0,0063 view 0,0044 round 0,0030 abandon0,0025 design 0,0025 damage 0,0024 replace 0,0007 like0,0005 call 0,0004 include 0,0003 see 0,0003 keep 0,0001know 0,0001 give 0,0000

church#n#3 hyper-1 00666912 dedicate 0,0743 rebuild 0,0556 attend 0,0353 situate0,0223 enlarge 0,0207 supply 0,0182 found 0,0169 set up0,0118 equip 0,0098 maintain 0,0095 retain 0,0069 es-tablish 0,0067 start 0,0066 list 0,0065 begin 0,0063 buy0,0063 regard 0,0062 clean 0,0056 distribute 0,0055 at-tack 0,0053 free 0,0053 dominate 0,0050 incorporate0,0049 serve 0,0049 destroy 0,0046 mention 0,0040 cre-ate 0,0039 approach 0,0038 join 0,0038 close 0,0036pass 0,0036 refer 0,0035 lead 0,0033 leave 0,0032 con-trol 0,0025 form 0,0023 return 0,0023 open 0,0021 ex-amine 0,0020 give 0,0020 help 0,0020 miss 0,0020 com-plete 0,0018 add 0,0017 reach 0,0016 come 0,0014 hear0,0011 set 0,0011 follow 0,0010 include 0,0009 replace0,0009 suggest 0,0009 go 0,0007 keep 0,0006 use 0,0006know 0,0004 show 0,0004 need 0,0003 provide 0,0003 like0,0002 get 0,0001 take 0,0001

hyper-2 00663517 administer 0,0365 fashion 0,0254 celebrate 0,0247 insti-tute 0,0211 recite 0,0141 alarm 0,0136 progress 0,0132tax 0,0111 sing 0,0082 hinder 0,0073 adapt 0,0058 prac-tise 0,0058 encompass 0,0052 travel 0,0045 refuse 0,0044deem 0,0039 point out 0,0039 sit 0,0035 rescue 0,0030light 0,0019 gather 0,0016 receive 0,0016 install 0,0015continue 0,0013 hate 0,0013 permit 0,0011 constitute0,0010 approach 0,0008 lay 0,0007 remain 0,0007 remem-ber 0,0005 hold 0,0003

Table 12: New Selectional Preferences acquired for the noun “church”



6.1.2 Topic Signatures

From ACQ1 we obtained a large set of sense examples acquired automatically from theweb (see Working Paper WP5.5 Experiment 5.H a): Publicly available topic signaturesfor all WordNet nominal senses). These examples have been obtained querying Google.For each word sense in WordNet, a program builds a complex query including sets ofmonosemous synonymous relatives. Using this approach, large collections of text can beobtained. This will represent hundreds of examples per word sense. Using this large-scaleresource we generated Topic Signatures 23 [Agirre and de Lacalle, 2004] for every wordsense in WordNet.

In fact, we released this publicly available resource which comprises both automaticallyextracted examples for all WordNet 1.6 noun senses and topic signatures built based onthose examples. We gathered around 700 sentences per each noun in WordNet. When themonosemous relatives are used to build a sense corpus for polysemous words, they comprisean average of around 3,500 sentences per word sense. The size of the topic signatures thusconstructed is of around 4,500 words per word sense.

Table 13 presents the Topic Signatures (list of words and associated weights) for sense6 of horse heroin, diacetyl morphine, H, horse, junk, scag, shit, smack ”a morphine deriva-tive”.

However, as we mentioned for the improved Selectional Preferences these Topic Signa-tures should be disambiguated before full integration into the Mcr.

6.1.3 Non subject/object Selectional Preferences

In the three past Meaning cycles we decided to incorporate only subject/object SelectionalPreferences acquired from different corpora and techniques. However, the English parsersused for this task already produce other very valuable dependencies. Our plan is to uploadand integrate the rest of Selectional Preferences captured.

6.1.4 Large collections of Sense Examples

We can also integrate into Mcr all the sense examples appearing in SemCor. Currently,WordNet glosses contain some usage examples of the described concept. We can incor-porate also as usage examples the sentences corresponding to all the sense occurrences ofSemCor.

6.2 Further Integration

First, we suggest for future rounds a manual validation of the Top Concept Ontology anda new expansion (Realization) of the properties.

We also suggest a full expansion (Realization) through the nominal part of the hier-archy of the selectional preferences acquired from SemCor and BNC (and possibly other

23http://ixa.si.ehu.es/Ixa/resources/sensecorpus



drug(467.90) cocaine(377.79) cocain(372.15) scag(159.76) heroin(86.46) mari-juana(84.58) addict(52.62) cannabis(46.98) addiction(33.42) addictive(31.95)crack(24.24) alcohol(21.89) coca(20.67) illegal(18.79) stimulant(18.79) ar-rest(16.91) gateway(15.03) percent(15.03) association(14.54) abuse(13.61)user(13.57) opiate(13.15) powder(13.15) dealer(11.48) lsd(11.27) nar-cotic(11.27) opium(11.27) tobacco(11.27) government(11.05) law(10.17)amphetamine(09.39) ecstasy(09.39) inject(09.39) substantially(09.39)weed(09.39) epidemic(08.06) netherlands(08.06) effect(07.65) addicted(07.51)cia(07.51) cigarette(07.51) heroine(07.51) methadone(07.51) snort(07.51)consumption(06.91) enforcement(06.91) gram(06.91) decline(06.54) hol-land(06.54) population(06.54) market(05.95) smoke(05.81) abuser(05.63)admit(05.63) decriminalize(05.63) dependence(05.63) forecast(05.63) fre-quent(05.63) morphine(05.63) pcp(05.63) prohibitionist(05.63) pusher(05.63)rate(05.52) test(05.52) treatment(05.52) brain(05.08) derive(05.08)dutch(05.08) pot(05.08) usage(05.08) substance(04.67) adolescent(04.60)amsterdam(04.60) slang(04.60) housing(04.36) plant(04.36) smoking(04.36)california(04.20) big(03.82) estimate(03.82) acid(03.75) autopsy(03.75)black-market(03.75) bolivia(03.75) breathe(03.75) bust(03.75) busted(03.75)cancer(03.75) cheat(03.75) coffee(03.75) coincidentally(03.75) coke(03.75)correlation(03.75) credible(03.75) dopamine(03.75) drug-(03.75) fatal(03.75)hallucinogen(03.75) handgun(03.75) harmless(03.75)

Table 13: Initial set of Topic Signatures for sense 6 of horse



implicit semantic knowledge currently available in WordNet such as meronymy informa-tion).

We also suggest further investigation to perform also full bottom–up expansion (Gen-eralization), rather than merely expanding top–down the knowledge and properties rep-resented into the Mcr. In this case, different knowledge and properties can collapse onparticular Base Concepts, Semantic Files, Domains and/or ontological nodes.

6.3 Porting Process

The consortium needs to investigate also a new set of inference mechanism in order tofurther infer new relations and knowledge inside the Mcr. For instance, new relationscan be generated when detecting particular semantic patterns occurring for some synsetshaving certain ontological properties, for a particular Domains, etc. That is, new relationscan be generated when combining different methods and knowledge. For instance, whenseveral relations derived in the integration process have particular confidence scores greaterthan certain thresholds.

However, without this new inference tool (i.e. without having inferred extra knowledge)in this porting process all the knowledge integrated into the Mcr will be ported to thelocal wordnets.

Mcr2 integrates now into the same EuroWordNet framework (using a new version ofBase Concepts, the Top Concept Ontology and the WordNet Domains) five local wordnets(with five English WordNet versions) with hundreds of thousand of new semantic relations,instances and properties fully expanded. In fact, the resulting Mcr2, with more than 1,6million relations, will be one of the largest and richest multilingual lexical–knowledge everbuilt.



7 Mcr2 examples

When uploading coherently all this knowledge into the Mcr a full range of new possibilitiesappear for improving both Acquisition and WSD problems (and other Semantic Processes).We will illustrate these new capabilities by two simple examples.

7.1 The ”Vaso” Example

The Spanish noun vaso has three possible senses. The first one is connected to the sameIli as the English synset <drinking glass glass>. This Ili record, belonging to the Seman-tic File ARTIFACT has no specific WordNet Domain (FACTOTUM). However, the TopConcept Ontology provides further clues about its meaning: it has the following propertiesForm-Object, Origin-Artifact, Function-Container and Function-Instrument. The Sumo

type for this synset is also Artifact. A valuable information also comes from the disam-biguated glosses included into the eXtended WordNet. This gloss has two ’silver’ words24

(glass, container) and three ’normal’ words (the rest). For instance, hold#VBG#8 corre-sponds to “contain or hold; have within: ”The jar carries wine”; ”The canteen holds freshwater”; ”This can contains water”). The reverse relation rgloss can be used to explorein which definitions vaso 1 is used. In this case, 36 relations (most of them nouns, butalso three verb senses and two adjective senses). Further, coming from the SelectionalPreferences acquired from SemCor, we know that the typical things that somebody doeswith this kind of vaso are for instance the corresponding equivalent translations to Spanishfor <polish, shine, smooth, smoothen> or <beautify, embellish, prettify>. After parsingSemCor with Minipar [Lin, 1998a], we also included into the Mcr those synsets appearingas subjects and direct objects. That is, without performing any kind of generalization.Obviously, this information is similar to the generalized classes provided by [Agirre andMartinez, 2002]. In this case, the subjects and direct objects captured when parsing Sem-Cor are the corresponding equivalent translations of <pass, hand, reach, pass on, turn over,give> or <offer, proffer>. We also included the verbal the improved proto–models acquiredfrom BNC. In this case, the object proto–models are for instance: “lift” and “roll”. Word-Net 2.0 also provides a new morphological derivational relation: to glass#v#4 “put ina glass container”. Finally, we must add that this also holds for the rest of languagesconnected.

vaso_1 02755829-n

SF: 06-NOUN.ARTIFACT

DOMAIN: FACTOTUM

SUMO: &%Artifact+

TO: 1stOrderEntity-Form-Object

TO: 1stOrderEntity-Origin-Artifact

24High confidence



TO: 1stOrderEntity-Function-Container

TO: 1stOrderEntity-Function-Instrument

EN: drinking_glass glass

IT: bicchiere

BA: edontzi baso edalontzi

CA: got vas

02755829-n drinking_glass glass:

GLOSS: a glass container for holding liquids while drinking

eXtended WordNet:

GLOSS: a glass#NN#2 container#NN#1 for hold#VBG#8 liquid#NNS#1 while drink#VBG#1

RGLOSS:

03262062-n 1.0 {rummer#1}

10840325-n 0.6 {zinc_oxide#1,flowers_of_zinc#1,philosopher’s_wool#1}

10767198-n 0.6 {red_lead#1,minium#1}

10721901-n 0.6 {lithium_carbonate#1,Lithane#1,Lithonate#1}

10633743-n 0.6 {insulator#1,dielectric#1,nonconductor#1}

07445885-n 0.6 {optician#1,lens_maker#1}

03442944-n 0.6 {sun_parlor#1,sun_parlour#1,sun_porch#1,sunroom#1,sun_lounge#1,solarium#1}

03299003-n 0.6 {seidel#1}

02835732-n 0.6 {hotbed#2}

02770699-n 0.6 {nursery#2,greenhouse#1,glasshouse#1}

02759431-n 0.6 {goblet#1}

02755829-n 0.6 {glass#2,drinking_glass#1}

02398768-n 0.6 {cash_bar#1}

02397742-n 0.6 {case#11,showcase#2,display_case#1}

02277068-n 0.6 {beer_glass#1}

01093769-v 0.6 {glass#3,glass_in#1}

01073800-a 0.6 {unglazed#1,glassless#1}

01073688-a 0.6 {glazed#2,glassed#1}

00129338-v 0.6 {glass#4}

10795011-n 0.3 {sodium_carbonate#1,washing_soda#1,sal_soda#1,soda_ash#1,soda#1}

10659058-n 0.3 {potassium_carbonate#1}

09914390-n 0.3 {glass#3,glassful#1}

07292487-n 0.3 {glassmaker#1}

03623779-n 0.3 {wineglass#1}

03567934-n 0.3 {vase#1}

03549887-n 0.3 {tumbler#2}

03361087-n 0.3 {snifter#1,brandy_snifter#1,brandy_glass#1}



03328048-n 0.3 {pony#4,shot_glass#1,jigger#1}

03088905-n 0.3 {parfait_glass#1}

02988452-n 0.3 {mercury_thermometer#1}

02970011-n 0.3 {Mason_jar#1}

02931434-n 0.3 {liqueur_glass#1}

02923710-n 0.3 {bulb#2,light_bulb#1,lightbulb#1,incandescent_lamp#1,electric_light#1,electric-light_bulb#1}

02841849-n 0.3 {hurricane_lamp#1,hurricane_lantern#1,tornado_lantern#1,storm_lantern#1,storm_lamp#1}

02817711-n 0.3 {highball_glass#1}

01594762-v 0.3 {glaze#3,glass#1}

DOBJ SemCor:

02755829 00849393-v 0.0074 polish shine smooth smoothen

02755829 00201878-v 0.0013 beautify embellish prettify

02755829 00826635-v 0.0010 get_hold_of take

02755829 00140937-v 0.0001 ameliorate amend better improve meliorate

02755829 00083947-v 0.0000 alter change

DOBJ Semcor-No Generalization:

02755829 00826635-v get_hold_of take

02755829 00849393-v polish shine smooth smoothen

02755829 01526289-v pass hand reach pass_on turn_over give

02755829 01571054-v offer proffer

Proto-Classes:

lift dobj 02755829 0.0143220878

roll dobj 02755829 0.0056179775

bear dobj 02755829 0.0011655012

turn dobj 02755829 0.0011137408

send dobj 02755829 0.0005092687

hear dobj 02755829 0.0003165225

like subj 02755829 0.0003032475

find subj 02755829 6.85143e-05

WN2.0:

RELATED TO: glass#v#4 (put in a glass container)

The second sense of vaso is the equivalent translation of <vessel, vas>. This Ili

record, belonging to the Semantic File BODY has assigned a different WordNet Domain



(ANATOMY). The EuroWordNet Top Ontology in this case, has the following propertiesForm-Substance-Solid, Origin-Natural-Living, Composition-Part and Function-Container.The Sumo label provides the properties and axioms assigned to BodyVessel. This glosshas two ’gold’ words 25 (tube and circulate) and one ’silver’ (body fluid) and the last wordis monosemous. From the Selectional Preferences acquired from SemCor, we know thatthe typical events applied to this king of vaso are for instance the corresponding equiv-alent translations to Spanish for <inject, shoot> or <administer, dispense>. Observingthe rgloss relation we can see that this sense is related to the verb extrangulate 1 or tothe nouns bascular system 1 blood vessel 1. In total 34 relations (most of them to nounsbut also 7 to verbs and 3 to adjective concepts). In this case, the subjects and directobjects captured when parsing SemCor are the corresponding equivalent translations of<follow, travel along> and <be, occur>; and the proto–models are for instance: “open”and “show”. In this case, there are no new relations coming from WordNet 2.0. As before,we must add that this knowledge can be also ported to the rest of languages connected.

vaso_2 04195626-n

SF: 08-NOUN.BODY

DOMAIN: ANATOMY

SUMO: &%BodyVessel+

TO: 1stOrderEntity-Form-Substance-Solid

TO: 1stOrderEntity-Origin-Natural-Living

TO: 1stOrderEntity-Composition-Part

TO: 1stOrderEntity-Function-Container

EN: vessel vas

IT: vaso dotto canale

BA: hodi baso

CA: vas

04195626-n vessel vas:

GLOSS: a tube in which a body fluid circulates

eXtended WordNet:

GLOSS: a tube#NN#4 in which a body_fluid#NN#1 circulate#VBZ#4

RGLOSS:

04194681-n 1.0 {lymphatic_system#1}

07610890-n 0.6 {stevedore#1,loader#1,longshoreman#1,docker#1,dockhand#1,dock_worker#1,dock-walloper#1,lumper#1}

25Hand corrected



04280896-n 0.6 {spermatic_cord#1}

04268977-n 0.6 {vascular_system#1}

04267314-n 0.6 {lumbar_plexus#1,plexus_lumbalis#1}

04216701-n 0.6 {lesser_omentum#1}

04207149-n 0.6 {blood_vessel#1}

04063459-n 0.6 {bulb#5}

04061458-n 0.6 {hilum#1,hilus#1}

03542269-n 0.6 {trivet#1}

03521286-n 0.6 {torpedo_tube#1}

03341371-n 0.6 {siphon#1,syphon#1}

02713719-n 0.6 {foremast#1}

02533651-n 0.6 {cup#8,loving_cup#2}

02288257-n 0.6 {bilges#1}

00952199-v 0.6 {strangulate#2}

00942624-v 0.6 {extravasate#1}

00891660-v 0.6 {anchor#2,cast_anchor#1,drop_anchor#1}

00028352-n 0.6 {mooring#3,docking#1,tying_up#1,dropping_anchor#1}

09590888-n 0.3 {dockage#1,docking_fee#1}

06363422-n 0.3 {dockyard#1}

03413654-n 0.3 {still#3}

03412425-n 0.3 {sternpost#1}

03081935-n 0.3 {pan#1}

02311368-n 0.3 {bomb#2,bomb_calorimeter#1}

02188041-n 0.3 {anchor_chain#1}

02160659-n 0.3 {accommodation_ladder#1}

01381363-v 0.3 {ground#5,run_aground#1}

01315463-v 0.3 {bear_down_on#1,bear_down_upon#1}

00982771-a 0.3 {stern#4}

00982045-a 0.3 {bow#1}

00981849-a 0.3 {fore#1}

00481361-v 0.3 {loft#4}

00176545-v 0.3 {bilge#1}

DOBJ SemCor:

04195626 01781222 0.0334 be occur

04195626 00058757 0.0072 inject shoot

04195626 01357963 0.0068 follow travel_along

04195626 00055849 0.0045 administer dispense

04195626 01012352 0.0022 block close_up impede jam obstruct occlude

04195626 00054862 0.0021 care_for treat

04195626 01670590 0.0017 hinder impede

04195626 00401762 0.0011 cognize know



04195626 01253107 0.0005 go locomote move travel

04195626 01669882 0.0003 keep prevent

DOBJ SemCor No-Generalization:

04195626 01357963 follow travel_along

04195626 01781222 be occur

SUBJ SemCor:

04195626 01831830 0.0133 stop terminate

04195626 01357963 0.0127 floow travel_along

04195626 01830886 0.0043 discontinue

04195626 01779664 0.0008 cease end finish terminate

04195626 01832078 0.0003 continue go_along go_on keep keep_on proceed

04195626 01253107 0.0002 go locomote move travel

04195626 01520167 0.0002 transfer

04195626 01505951 0.0002 give

04195626 01590833 0.0002 furnish provide render supply

04195626 01612822 0.0001 act move

04195626 01775973 0.0000 be

Proto-Classes:

open dobj 04195626 0.0006462453

show subj 04195626 0.0001756852

The last sense of vaso is the equivalent translation of <glassful, glass>. This Ili

record, belongs to the Semantic File QUANTITY and has assigned a different WordNetDomain (FACTOTUM-NUMBER). The Top Concept Ontology in this case, has the follow-ing properties Composition-Part SituationType-Static and SituationComponent-Quantity.The Sumo label provides the properties and axioms assigned to ConstantQuantity. Thisgloss has only one ’silver’ word from the eXtended WordNet (quantity). The other twohave label ’normal’. From the Selectional Preferences acquired from SemCor, we know thatthe typical events applied to this king of vaso are for instance the corresponding equivalenttranslations to Spanish for <drink, imbibe> or <consume, have, ingest take, take in>.Similar information appear for the parsed SemCor, and no direct proto–models have beenacquired. In this case, there are no new relations coming from WordNet 2.0. As before,we must add that this knowledge can be also ported to the rest of languages connected.

vaso_3 09914390-n



SF: 23-NOUN.QUANTITY

DOMAIN: NUMBER

SUMO: &%ConstantQuantity+

TO: 1stOrderEntity-Composition-Part

TO: 2ndOrderEntity-SituationType-Static

TO: 2ndOrderEntity-SituationComponent-Quantity

EN: glassful glass

IT: bicchierata bicchiere

BA: basocada

CA: got vas

09914390-n glassful glass:

GLOSS: the quantity a glass will hold

eXtended WordNet:

GLOSS: the quantity#NN#1 a glass#NN#2 will hold#VB#1

DOBJ SemCor:

09914390 00795711 0.0026 drink imbibe

09914390 01530096 0.0009 accept have take

09914390 00786286 0.0009 consume have ingest take take_in

09914390 01513874 0.0001 acquire get

DOBJ Semcor No generalization:

09914390 00795711 drink imbibe

09914390 01530096 accept have take

As we can see, we can add consistently a large set of explicit knowledge about each senseof vaso that can be used to differentiate and characterize better their particular meanings.We expect to devise appropriate ways to exploit this unique resource in the next rounds.

7.2 The ”Pasta” Example

We will continue illustrating the current content of the Mcr, after porting, with anothersimple example: the Spanish noun pasta.

The word pasta (see tables 15 and 14) illustrates how all the different classificationschemes uploaded into the Mcr: Semantic File, WordNet Domain, Top Concept Ontol-ogy, etc. are consistent and makes clear semantic distinctions between the money sense



(pasta 6), the general/chemistry sense (pasta 7) and the food senses (all the rest). Thefood senses of Pasta can now be further differentiate by means of explicit Top ConceptOntology properties. All the food senses are descendants of substance 1 and food 1 andinherits the Top Concept attributes Substance and Comestible respectively.

Domain: chemistry-pure scienceSemantic File: 27-SubstanceSUMO:Substance-SelfConnectedObject-Object-Physical-Entity

Top Concept ontologyNatural-Origin-1stOrderEntitySubstance-Form-1stOrderEntity

pasta#n#7 10541786-npaste#1gloss: any mixture of a softand malleable consistency

Domain: money-economy-soc.scienceSemantic File: 21-MONEYSUMO:CurrencyMeasure-ConstantQuantity-PhysicalQuantity-Quantity-Abstract-EntityTop Concept ontologyArtifact-Origin-1stOrderEntityFunction-1stOrderEntityMoneyRepresentation-Representation-Function-1stOrderEntity

pasta#n#6 09640280-ndough#2,bread#2,loot#2, ...gloss: informal terms formoney

Table 14: Substance and money senses for the Spanish word pasta

Selectional Preferences can also help to distinguish between senses, e.g only the moneysense has the following preferences as object: 1.44 01576902-v {raise#4}, 0.45 01518840-v {take in#5, collect#2} or 0.23 01565625-v {earn#2, garner#1} or 0.12 01564908-v{clear#15, take in#10, make#10, gain#8, realize#4, pull in#2, bring in#2, earn#1}.

Table 16 presents the new selectional preferences acquired for the Spanish word Pasta.That is, the prototypical verbs associated to each English equivalent translation or theirhypernyms.

We can also investigate new inference facilities to enhance the integration process. Afterfull expansion (Realization) of the Ewn Top Concept ontology properties, we will performa full expansion through the noun part of the hierarchy of the selectional preferencesacquired from SemCor and BNC (and possibly other implicit semantic knowledge currentlyavailable in Wn such as meronymy information).

We plan further investigation to perform full bottom–up expansion (Generalization),rather than merely expanding knowledge and properties top-down. In this case, differentknowledge and properties can collapse on particular Base Concepts, Semantic Files, Do-mains and/or Top Concepts in order to become automatically possible semantic roles forpredicates.



Domain: gastronomy-alimentation-applied scienceSemantic File: 13-FOODTop concept ontologyComestible-Function-1stOrderEntitySubstance-Form-1stOrderEntity

Top Concept ontologyNatural-Origin-1stOrderEntity

Top Concept ontologyPart-composition-1stOrderEntitypasta#n#4 05886080-nspread#5,paste#3gloss: a tasty mixture to bespread on bread or crackers

pasta#n#1 05671312-npastry#1,pastry dough#1gloss: a dough of flour andwater and shorteningpasta#n#3 05739733-npasta#1,alimentary paste#1gloss: shaped and drieddough made from flour andwater & sometimes egg

pasta#n#5 05889686-ndough#1gloss: a dough of flour andwater and shortenings

Top Concept ontologyArtifact-Origin-1stOrderEntityGroup-Composition-1stOrderEntity

pasta#n#2 05671439-npie crust#1,pie shell#1gloss: pastry used to hold piefillings

Table 15: Food senses for the Spanish word pasta



pasta#n#1 hyper2 05909338 divide 0,0127 wrap 0,0063 pack 0,0045 mix 0,0044press 0,0025 check 0,0013 pass 0,0007 add 0,0006eat 0,0006 make 0,0005 prevent 0,0004 remove0,0004 produce 0,0002 leave 0,0001 like 0,0001

pasta#n#2 hyper-1 05670938 eat 0,0017 serve 0,0012 choose 0,0007 include0,0002 leave 0,0002 take 0,0001

hyper-2 05670374 dispense 0,0161 crush 0,0137 pop 0,0120 eat 0,0103bless 0,0102 chew 0,0095 put out 0,0064 tuck0,0058 freeze 0,0050 clutch 0,0048 transfer 0,0015fill 0,0014 try 0,0013 avoid 0,0006 buy 0,0006 in-clude 0,0001 make 0,0001

pasta#n#3 direct divide 0,0127 wrap 0,0063 pack 0,0045 mix 0,0044press 0,0025 check 0,0013 pass 0,0007 add 0,0006eat 0,0006 make 0,0005 prevent 0,0004 remove0,0004 produce 0,0002 leave 0,0001 like 0,0001

pasta#n#4 direct mix 0,0065 add 0,0004hyper-1 05844302 mix 0,0142 picture 0,0097 spread 0,0046 accom-

pany 0,0017 serve 0,0016 hate 0,0013 prepare0,0013 pass 0,0007 do 0,0005 keep 0,0005 include0,0004 love 0,0004 like 0,0003 hold 0,0002 make0,0001 produce 0,0001

pasta#n#5 hyper-1 105909338 divide 0,0127 wrap 0,0063 pack 0,0045 mix 0,0044press 0,0025 check 0,0013 pass 0,0007 add 0,0006eat 0,0006 make 0,0005 prevent 0,0004 remove0,0004 produce 0,0002 leave 0,0001 like 0,0001

Table 16: New Selectional Preferences for Food senses of “pasta”



7.3 The “Hospital” Example

Finally, we will conclude this set of examples with the current content information alreadyintegrated into the MCR for the sense 1 of hospital.

Figure 7 shows the set of properties (Domains, Semantic File, SUMO and Top ConceptOntology) associated to each disambiguated concept of the synset gloss (using eXtendedWordnet). This definition describes an hospital (sense 1) as a healh facility (sense 1)where patients (sense 1) receives (sense 2) treatment (sense 1). The fact of having all thewords from the gloss correctly disambiguated inside the Mcr provides an open range ofpossibilities to be explored.

The domains of hospital 1 are building industry, medicine and town planning; the Se-mantic File is artifact; the SUMO label is StationaryArtifact; finally, the Top ConceptOntology properties, inherited from the hypernym chain, are Artifact, Building and Ob-ject. The domains of health facility 1 are building industry, medicine and town planning;the Semantic File is artifact; the SUMO label is Building; finally, the Top Concept On-tology properties, inherited from the hypernym chain, are Artifact, Building and Object.The domain of patient 1 is medicine; the Semantic File is person; the SUMO label is Pa-tient; finally, the Top Concept Ontology properties, inherited from the hypernym chain,are Function, Human, Living and Object. receive 2 has no explicit domain assigned. How-ever, the Semantic File is change, the SUMO label is Getting, and finally, the Top ConceptOntology properties are Dinamic and Experience. Finally, the domain for treatment 1 ismedicine; the Semantic File is act; the SUMO label is TherapeuticProcess; and the TopConcept Ontology properties are Agentive, Cause, Condition, Dinamic, Purpose, Socialand UnboundedEvent. Furthermore, receive 2 and treatment 1 are both Base Concepts.

In fact, the hospital 1 is the place where it occurs the event that a patient 1 receive 2treatment 1. Or in other words, in all health facilities patients gets therapeutic processes?Which kind of further inferencing capabilities can be derived from the knowledge currentlyintegrated into the Mcr?

We want to stress that all this knowledge is also available to the rest of the languagesincluded into the MCR. That is, we can derive automatically partial translations of theEnglish glosses to the rest of languages integrated into the MCR. For instance, in Figure17 we present the translations in Spanish for all the content words of the hospital 1 glossappearing into the MCR. We want to stress again that they also share the same propertiesand relations.

This knowledge could be used to derive complete and more accurate translations of theEnglish glosses (see Working Paper WP3.9 Statistical Machine Translation of WordNetglosses for further details of this process).

For instance, the inherent knowledge structure presented in this definition matchesappropriately with the scripts-like frame structures of FrameNet [Baker et al., 1997]. Figure18 presents the corresponding slots for the frame Treatment.n. In particular, the hospital 1example seems to fit correctly with several slots of this frame: core slot Patient (patient),core slot Treatment (treatment) and pheripherical slot Place (hospital). We should deviseways to integrate Framenet-like structures in future versions of the MCR, maybe taking



Figure 7: Hospital example

profit of the Top Concept Ontology to derive the appropriate roles.



English Spanish Italianhospital hospital ospedalehealth facility centro medico, centro de salud NOLEXpatient paciente pazientereveive obtener otteneretreatment tratamiento, terapia cura, terapia

Table 17: hospital 1 gloss translation equivalences for Spanish and Italian

Frame elements TypeAffliction CoreBody part CoreDegree PeripheralDuration Extra-ThematicHealer CoreManner PeripheralMedication CoreMotivation Extra-ThematicPatient CorePlace PeripheralPurpose Extra-ThematicTime PeripheralTreatment Core

Table 18: Treatment.n frame



8 Conclusions

This document described the third version of the Multilingual Central Repository (Mcr2)and the third Porting process (PORT2). We described the knowledge uploaded and inte-grated into Mcr2, including a brief description of a general Upload/Porting architecture.Finally, we provide a full description of the third Porting process.

The current version of the Mcr integrates wordnets from five different languages. Thefinal version of the Mcr contains 1,642,389 unique semantic relations between concepts(ILI-records). This represents one order of magnitude larger than the Princeton wordnet(138,091 unique semantic relations in WN1.6). Table 19 summarizes the main sources forsemantic relations integrated into Mcr2.

source #relationsAcquired from Princeton WN1.6 138.091Selectional Preferences acquired from SemCor 203.546Selectional Preferences acquired from BNC 707.618New relations acquired from Princeton WN2.0 42.212Gold relations from eXtended WN 17.185Silver relations from eXtended WN 239.249Normal relations from eXtended WN 294.488Total 1.642.389

Table 19: Main sources of semantic relations

Furthermore, the current Mcr have been also enriched with 466,972 semantic prop-erties coming from different sources. Table 20 summarizes the main sources for semanticproperties integrated into Mcr2.

source #propertiesWordNet Domain 110.556Top Concept Ontology 256.776SUMO 99.640Total 466.972

Table 20: Main sources of semantic properties

In fact, the resulting Mcr2 is the largest and richest multilingual lexical–knowledgeever built. In that way, the Mcr produced by Meaning is going to constitute the naturalmultilingual large-scale linguistic resource for a number of semantic processes that needlarge amounts of linguistic knowledge to be effective tools.



References

[Agirre and de Lacalle, 2004] Eneko Agirre and Oier Lopez de Lacalle. Publicly availabletopic signatures for all wordnet nominal senses. In Proceedings of the 4rd InternationalConference on Language Resources and Evaluations (LREC). Lisbon, Portugal., 2004.

[Agirre and Martinez, 2001] E. Agirre and D. Martinez. Learning class-to-class selectionalpreferences. In Proceedings of CoNLL01, Toulouse, France, 2001.

[Agirre and Martinez, 2002] E. Agirre and D. Martinez. Integrating selectional preferencesin wordnet. In Proceedings of the first International WordNet Conference in Mysore,India, 21-25 January 2002.

[Agirre et al., 2002] E. Agirre, O. Ansa, X. Arregi, J.M. Arriola, A. Diaz de Ilarraza,E. Pociello, and L. Uria. Methodological issues in the building of the basque wordnet:quantitative and qualitative analysis. In Proceedings of the first International WordNetConference in Mysore, India, 21-25 January 2002.

[Alfonseca and Manandhar, 2002] E. Alfonseca and S. Manandhar. An unsupervisedmethod for general named entity recognition and automated concept discovery. In Pro-ceedings of the 1st International Conference on General WordNet, Mysore, India, 2002.

[Atserias et al., 1997] J. Atserias, S. Climent, X. Farreres, G. Rigau, and H. Rodrıguez.Combining multiple methods for the automatic construction of multilingual wordnets.In Proceedings of RANLP’97, pages 143–149, Bulgaria, 1997.

[Atserias et al., 2004a] J. Atserias, S. Climent, and G. Rigau. Towards the meaning topontology: Sources of ontological meaning. In 4rd International Conference on LanguageResources and Evaluations (LREC), 2004.

[Atserias et al., 2004b] J. Atserias, G. Rigau, and L. Villarejo. Spanish wordnet 1.6: Port-ing the spanish wordnet across princeton versions. In Proceedings of the 4th InternationalConference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal,2004.

[Baker et al., 1997] C. Baker, C. Fillmore, and J. Lowe. The berkeley framenet project.In COLING/ACL’98, Montreal, Canada, 1997.

[Banerjee and Pedersen, 2003] S. Banerjee and T. Pedersen. Extended gloss overlaps as ameasure of semantic relatedness. In Proceedings of 18th International Joint Conferenceon Artificial Intelligence (IJCAI’03), Acapulco, Mexico, 2003.

[Benıtez et al., 1998] L. Benıtez, S. Cervell, G. Escudero, M. Lopez, G. Rigau, andM. Taule. Methods and tools for building the catalan wordnet. In Proceedings ofthe ELRA Workshop on Language Resources for European Minority Languages, FirstInternational Conference on Language Resources & Evaluation, Granada, Spain, 1998.



[Daude et al., 1999] J. Daude, L. Padro, and G. Rigau. Mapping Multilingual HierarchiesUsing Relaxation Labeling. In Joint SIGDAT Conference on Empirical Methods inNatural Language Processing and Very Large Corpora (EMNLP/VLC’99), Maryland,US, 1999.

[Daude et al., 2000] J. Daude, L. Padro, and G. Rigau. Mapping WordNets Using Struc-tural Information. In Proceedings of 38th annual meeting of the Association for Compu-tational Linguistics (ACL’2000), Hong Kong, 2000.

[Daude et al., 2001] J. Daude, L. Padro, and G. Rigau. A complete wn1.5 to wn1.6 map-ping. In Proceedings of NAACL Workshop ”WordNet and Other Lexical Resources:Applications, Extensions and Customizations”, Pittsburg, PA, United States, 2001.

[Fellbaum, 1998] C. Fellbaum, editor. WordNet. An Electronic Lexical Database. The MITPress, 1998.

[Fernandez et al., 2004] J. Fernandez, M. Castillo, G. Rigau, J. Atserias, and J. Turmo.Automatic acquisition of sense examples using exretriever. In Proceedings of the 4thInternational Conference on Language Resources and Evaluation (LREC 2004), Lisbon,Portugal, 2004.

[Gangemi et al., 2003] A. Gangemi, R. Navigli, and P. Velardi. Axiomatizing wordnetglosses in the ontowordnet project. In Proceedings of 2nd International Semantic WebConference Workshop on Human Language Technology for the Semantic Web and WebServices, Sanibel Island, Florida, 2003.

[Guarino and Welty, 2000] Nicola Guarino and Christopher A. Welty. A formal ontology ofproperties. In Proceedings of ECAI’2000 Workshop on Knowledge Acquisition, Modelingand Management, pages 97–112, 2000.

[Hirst and St-Onge, 1998] G. Hirst and D. St-Onge. Lexical chains as representations ofcontext for the detection and correction of malapropisms. In WordNet: An ElectronicLexical Database and Some of its Applications, Editor C. Fellbaum. MIT Press, 1998.

[Jiang and Conrath, 1997] J. Jiang and D. Conrath. Semantic similarity based on cor-pus statistics and lexical taxonomy. In Proceedings of the International Conference onResearch in Computational Linguistics, Taiwan, 1997.

[Kipper et al., 2000] K. Kipper, H. Trang Dang, and M. Palmer. Class-based construc-tion of a verb lexicon. In AAAI-2000 Seventeenth National Conference on ArtificialIntelligence, 2000.

[Leacock and Chodorow, 1998] C. Leacock and M. Chodorow. Combining local contextand wordnet similarity for word sense indentification. In WordNet: An Electronic LexicalDatabase and Some of its Applications, Editor C. Fellbaum. MIT Press, 1998.



[Lin, 1998a] D. Lin. Automatic retrieval and clustering of similar words. In Proceedings ofCOLING-ACL’1998, Montreal, Canada, 1998.

[Lin, 1998b] D. Lin. An information-theoretic definition of similarity. In Proceedigns of theinternational Conference on Machine Learning (ICML’98), Madison, Wisconsin USA,1998.

[Lyons, 1977] J. Lyons, editor. Semantics 1. Cambridge University Press, Cambridge, UK,1977.

[Magnini and Cavaglia, 2000] B. Magnini and G. Cavaglia. Integrating subject field codesinto wordnet. In In Proceedings of the Second Internatgional Conference on LanguageResources and Evaluation LREC’2000, Athens. Greece, 2000.

[McCarthy, 2001] D. McCarthy. Lexical Acquisition at the Syntax-Semantics Interface:Diathesis Aternations, Subcategorization Frames and Selectional Preferences. PhD the-sis, University of Sussex, 2001.

[Mihalcea and Moldovan, 2001] R. Mihalcea and D. Moldovan. Extended wordnet:Progress report. In Proceedings of NAACL Workshop on WordNet and Other LexicalResources, Pittsburgh, PA, 2001.

[Niles and Pease, 2001] I. Niles and A. Pease. Towards a standard upper ontology. InIn Proceedings of the 2nd International Conference on Formal Ontology in InformationSystems (FOIS-2001), pages 17–19. Chris Welty and Barry Smith, eds, 2001.

[Patwardhan, 2003] S. Patwardhan. Incorporating dictionary and corpus information intoa context vector measure of semantic relatedness. Master’s thesis, University of Min-nesota, Duluth, 2003.

[Pianta et al., 2002] E. Pianta, L. Bentivogli, and C. Girardi. Multiwordnet: developingan aligned multilingual database. In First International Conference on Global WordNet,Mysore, India, 2002.

[Resnik, 1995] P. Resnik. Using information content to evaluate semantic similarity in ataxonomy. In Proceedings of 14th International Joint Conference on Artificial Intelli-gence (IJCAI’95), Montreal, Canada, 1995.

[Sofia et al., 2002a] S. Sofia, N. Alexandros, H. Jeroen, S. Maximiliano, and C. Dimitris.Extending the eurowordnet with domain- specific terminology using an expand modelapproach. In Proceedings of the 1st Global WordNet Association conference, Mysore,India, 2002.

[Sofia et al., 2002b] S. Sofia, O. Kemal, P. Karel, C. Dimitris, C. Dan, T. Dan, K. Svetla,T. George, D. Dominique, and G. Maria. Balkanet: A multilingual semantic network forthe balkan languages. In Proceedings of the 1st Global WordNet Association conference,2002.



[Vossen, 1998] P. Vossen, editor. EuroWordNet: A Multilingual Database with LexicalSemantic Networks . Kluwer Academic Publishers , 1998.

[Wagner, 2002] A. Wagner. Learning thematic role relations for wordnets. In Proceedings ofESSLLI-2002 Workshop on Machine Learning Approaches in Computational Linguistics,Trento, Italy, 2002.

[Wu and Palmer, 1994] Z. Wu and M. Palmer. Verb semantics and lexical selection. InProceedings of the 32nd Annual Meeting of the Association for Computational Linguistics(ACL’94), Las cruces, New Mexico, 1994.


PORT2 - UPC Universitat Politècnica de Catalunyanlp/meaning/documentation/... · Port2 Page : 8 1.4 Uploading Process To upload correctly all this di erent knowledge into a single

Documents